Arxiv Day: Article

Toward Errorless Training ImageNet-1k

In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.

Updated: 2025-08-06 23:58:56

标题: 朝向无误差的训练ImageNet-1k

摘要: 在本文中，我们描述了一个前馈人工神经网络，使用[5]的新方法对ImageNet 2012比赛数据集[7]进行训练，精度达到98.3%，Top-1率为99.69%，并且在数据集的10个批次分区中有平均285.9个标签完全分类正确。表现最佳的模型使用了322,430,160个参数，精度为4位小数。我们推测我们的模型未能达到100%的准确率是因为存在双重标签问题，即数据集中存在相同图像但标签不同的重复图像。

更新时间: 2025-08-06 23:58:56

领域: cs.CV,cs.LG,68T07

下载: http://arxiv.org/abs/2508.04941v1

RLSR: Reinforcement Learning from Self Reward

Large language models can generate solutions to complex problems, but training them with reinforcement learning typically requires verifiable rewards that are expensive to create and not possible for all domains. We demonstrate that LLMs can effectively self-improve through self-judging without reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains where verifiable rewards are impractical. By implementing self-judging across Countdown puzzles and integration problems, we achieve performance comparable to formal verification without ground truth solutions. Most notably, Qwen 2.5 7B DeepSeek Distilled trained with self-rewards qualifies for the prestigious MIT Integration Bee competition, performance through self-supervised improvement. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance without any external validation. Our findings demonstrate that LLM judges can provide effective reward signals for training, unlocking reinforcement learning in countless domains previously limited by reward engineering challenges. This work represents a significant step toward autonomous AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress across domains where training data is scarce or evaluation is complex.

Updated: 2025-08-06 23:51:16

标题: RLSR：自我奖励的强化学习

摘要: 大型语言模型可以生成复杂问题的解决方案，但是用强化学习训练它们通常需要昂贵且不可能在所有领域创建的可验证奖励。我们证明LLMs可以通过自我判断有效地自我改进，而无需参考解决方案，利用生成和验证解决方案之间的固有不对称性。我们的实验表明，模型可以在没有地面真实答案的情况下提供可靠的奖励信号，从而在不可行的领域实现强化学习。通过在倒计时谜题和积分问题上实施自我判断，我们实现了与地面真实解决方案相当的性能。值得注意的是，Qwen 2.5 7B DeepSeek Distilled通过自我奖励训练符合著名的MIT Integration Bee竞赛标准，通过自监督改进实现了性能。当与合成问题生成相结合时，我们建立了一个完整的自我改进循环，模型生成练习问题，解决它们，并评估自己的表现，而无需任何外部验证。我们的研究结果表明，LLM评判可以为训练提供有效的奖励信号，解锁了以前受到奖励工程挑战限制的无数领域的强化学习。这项工作代表了朝着自主AI系统不断改进的重要一步，而不是人类引导的训练，潜在地加速了在训练数据稀缺或评估复杂的领域的进展。

更新时间: 2025-08-06 23:51:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.08827v2

ALScope: A Unified Toolkit for Deep Active Learning

Deep Active Learning (DAL) reduces annotation costs by selecting the most informative unlabeled samples during training. As real-world applications become more complex, challenges stemming from distribution shifts (e.g., open-set recognition) and data imbalance have gained increasing attention, prompting the development of numerous DAL algorithms. However, the lack of a unified platform has hindered fair and systematic evaluation under diverse conditions. Therefore, we present a new DAL platform ALScope for classification tasks, integrating 10 datasets from computer vision (CV) and natural language processing (NLP), and 21 representative DAL algorithms, including both classical baselines and recent approaches designed to handle challenges such as distribution shifts and data imbalance. This platform supports flexible configuration of key experimental factors, ranging from algorithm and dataset choices to task-specific factors like out-of-distribution (OOD) sample ratio, and class imbalance ratio, enabling comprehensive and realistic evaluation. We conduct extensive experiments on this platform under various settings. Our findings show that: (1) DAL algorithms' performance varies significantly across domains and task settings; (2) in non-standard scenarios such as imbalanced and open-set settings, DAL algorithms show room for improvement and require further investigation; and (3) some algorithms achieve good performance, but require significantly longer selection time.

Updated: 2025-08-06 23:39:46

标题: ALScope: 一种用于深度主动学习的统一工具包

摘要: 深度主动学习（DAL）通过在训练过程中选择最具信息量的未标记样本来降低标注成本。随着现实世界应用变得更加复杂，由于分布偏移（例如，开放式识别）和数据不平衡而产生的挑战引起了越来越多的关注，促使了许多DAL算法的发展。然而，缺乏统一的平台阻碍了在不同条件下的公平和系统评估。因此，我们提出了一个新的DAL平台ALScope，用于分类任务，集成了来自计算机视觉（CV）和自然语言处理（NLP）的10个数据集，以及21个代表性的DAL算法，包括经典基线和最近设计用于处理分布偏移和数据不平衡等挑战的方法。该平台支持关键实验因素的灵活配置，从算法和数据集选择到任务特定因素，如超出分布（OOD）样本比例和类别不平衡比例，从而实现全面和真实的评估。我们在这个平台上进行了大量实验。我们的研究结果表明：（1）DAL算法在不同领域和任务设置下的性能差异显著；（2）在非标准场景（如不平衡和开放式设置）中，DAL算法还有改进空间，需要进一步研究；（3）一些算法表现出良好的性能，但需要更长的选择时间。

更新时间: 2025-08-06 23:39:46

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.04937v1

INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to modeling inaccuracies and struggle to generalize to novel tasks. In contrast, humans intuitively interact with their surroundings, demonstrating remarkable adaptability, making efficient decisions through implicit physical understanding. In this work, we propose INTENTION, a novel framework enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios, by integrating Vision-Language Models (VLMs) based scene reasoning with interaction-driven memory. We introduce Memory Graph to record scenes from previous task interactions which embodies human-like understanding and decision-making about different tasks in real world. Meanwhile, we design an Intuitive Perceptor that extracts physical relations and affordances from visual scenes. Together, these components empower robots to infer appropriate interaction behaviors in new scenes without relying on repetitive instructions. Videos: https://robo-intention.github.io

Updated: 2025-08-06 23:27:22

标题: 意图：通过互动直觉和基于VLM的方法推断人形机器人运动的倾向性

摘要: 传统的机器人操作和规划在很大程度上依赖于精确的物理模型和预定义的动作序列。虽然在结构化环境中有效，但这种方法在真实环境中经常由于建模不准确而失败，并且难以推广到新任务。相比之下，人类直观地与周围环境互动，展示出了卓越的适应能力，通过隐含的物理理解做出高效的决策。在这项工作中，我们提出了INTENTION，这是一个新颖的框架，通过将基于视觉-语言模型（VLMs）的场景推理与基于互动驱动的记忆集成，使机器人具有学习的互动直觉和在各种场景中自主操作的能力。我们引入了Memory Graph来记录先前任务互动中的场景，这体现了人类对于现实世界中不同任务的理解和决策。同时，我们设计了一个直观的感知器，从视觉场景中提取物理关系和可供性。这些组件共同使机器人能够在新场景中推断出适当的互动行为，而不依赖于重复的指令。视频：https://robo-intention.github.io

更新时间: 2025-08-06 23:27:22

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.04931v1

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

Updated: 2025-08-06 23:23:20

标题: 将基础单目深度估计器扩展到带有校准标记的鱼眼摄像头

摘要: 我们提出了一种方法，将基础单目深度估计器（FMDEs）从透视图像扩展到鱼眼图像。尽管FMDEs经过数千万图像的训练，但仍然容易受到相机校准（内在，畸变）参数变化引入的协变量漂移的影响，导致深度估计错误。我们的方法通过将编码鱼眼图像的潜在嵌入的分布与透视图像的分布对齐，实现了FMDEs在不重新训练或微调的情况下可以用于鱼眼相机。为此，我们引入了一组校准标记作为一种轻量级的调整机制，调节用于对齐的潜在嵌入。通过利用FMDEs已经表现出色的潜在空间，我们认为调节它们的嵌入可以避免传统重新校准或将地图投影到图像空间中的规范参考框架中引入的伪影和损失的负面影响。我们的方法是自监督的，不需要鱼眼图像，但利用公开可用的大规模透视图像数据集。这是通过将透视图像重新校准为鱼眼图像，并在训练过程中强制保持它们的估计之间的一致性来实现的。我们在室内和室外使用几种FMDEs评估了我们的方法，在这些领域我们始终优于使用单一标记集的最先进方法。代码可在以下链接找到：https://github.com/JungHeeKim29/calibration-token。

更新时间: 2025-08-06 23:23:20

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04928v1

Taxonomy of Faults in Attention-Based Neural Networks

Attention mechanisms are at the core of modern neural architectures, powering systems ranging from ChatGPT to autonomous vehicles and driving a major economic impact. However, high-profile failures, such as ChatGPT's nonsensical outputs or Google's suspension of Gemini's image generation due to attention weight errors, highlight a critical gap: existing deep learning fault taxonomies might not adequately capture the unique failures introduced by attention mechanisms. This gap leaves practitioners without actionable diagnostic guidance. To address this gap, we present the first comprehensive empirical study of faults in attention-based neural networks (ABNNs). Our work is based on a systematic analysis of 555 real-world faults collected from 96 projects across ten frameworks, including GitHub, Hugging Face, and Stack Overflow. Through our analysis, we develop a novel taxonomy comprising seven attention-specific fault categories, not captured by existing work. Our results show that over half of the ABNN faults arise from mechanisms unique to attention architectures. We further analyze the root causes and manifestations of these faults through various symptoms. Finally, by analyzing symptom-root cause associations, we identify four evidence-based diagnostic heuristics that explain 33.0% of attention-specific faults, offering the first systematic diagnostic guidance for attention-based models.

Updated: 2025-08-06 23:20:18

标题: 基于注意力的神经网络中的错误分类学

摘要: 注意机制是现代神经架构的核心，驱动着从ChatGPT到自主车辆等系统，并产生了重大经济影响。然而，一些高调的失败案例，如ChatGPT的荒谬输出或谷歌因注意力权重错误而暂停Gemini的图像生成，突显了一个关键的空白：现有的深度学习错误分类可能无法充分捕捉注意机制引入的独特故障。这一空白使从业者缺乏可操作的诊断指导。为了解决这一问题，我们提出了对基于注意力的神经网络（ABNNs）故障的首个全面实证研究。我们的工作基于对来自96个项目、涵盖GitHub、Hugging Face和Stack Overflow等十个框架的555个真实世界故障的系统分析。通过我们的分析，我们建立了一个由七个特定于注意力的故障类别组成的新颖分类，这些类别不被现有研究所捕捉。我们的结果表明，超过一半的ABNN故障来源于注意力架构独有的机制。我们进一步通过各种症状分析了这些故障的根本原因和表现。最后，通过分析症状-根本原因的关联，我们确定了四个基于证据的诊断启发式，解释了33.0%的特定于注意力的故障，为基于注意力的模型提供了首个系统化的诊断指导。

更新时间: 2025-08-06 23:20:18

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.04925v1

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.

Updated: 2025-08-06 23:09:12

标题: 口头表达的表示学习用于可解释的少样本泛化

摘要: 人类在观察了很少的样本后就能识别物体，这是一种非凡的能力，这种能力是由他们对真实世界环境的固有语言理解所实现的。开发口头化和可解释的表示形式可以显著提高模型在数据稀缺环境中的泛化能力。在这项工作中，我们提出了一种名为Verbalized Representation Learning（VRL）的新方法，用于利用少样本数据自动提取人类可解释特征用于物体识别。我们的方法通过使用视觉语言模型（VLM）来以自然语言形式捕捉类间差异和类内共性，从而识别不同类别之间的关键区别特征和同一类别内的共享特征。然后，这些口头化特征通过VLM映射到数值向量。最终得到的特征向量可以进一步用于训练和推断下游分类器。实验结果表明，在相同的模型规模下，VRL方法在使用95%更少的数据和更小的模型的情况下，比先前最先进的方法实现了24%的绝对改进。此外，与人工标记的属性相比，VRL学习到的特征在用于下游分类任务时表现出20%的绝对增益。可在https://github.com/joeyy5588/VRL/tree/main找到代码。

更新时间: 2025-08-06 23:09:12

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.18651v3

How Effective are Large Time Series Models in Hydrology? A Study on Water Level Forecasting in Everglades

The Everglades play a crucial role in flood and drought regulation, water resource planning, and ecosystem management in the surrounding regions. However, traditional physics-based and statistical methods for predicting water levels often face significant challenges, including high computational costs and limited adaptability to diverse or unforeseen conditions. Recent advancements in large time series models have demonstrated the potential to address these limitations, with state-of-the-art deep learning and foundation models achieving remarkable success in time series forecasting across various domains. Despite this progress, their application to critical environmental systems, such as the Everglades, remains underexplored. In this study, we fill the gap by investigating twelve task-specific models and five time series foundation models across six categories for a real-world application focused on water level prediction in the Everglades. Our primary results show that the foundation model Chronos significantly outperforms all other models while the remaining foundation models exhibit relatively poor performance. We also noticed that the performance of task-specific models varies with the model architectures, and discussed the possible reasons. We hope our study and findings will inspire the community to explore the applicability of large time series models in hydrological applications. The code and data are available at https://github.com/rahuul2992000/Everglades-Benchmark.

Updated: 2025-08-06 23:04:10

标题: 大型时间序列模型在水文学中有多有效？对Everglades水位预测的研究

摘要: 《永恒湿地在周边地区的洪涝调节、水资源规划和生态系统管理中起着至关重要的作用。然而，传统的基于物理和统计方法用于预测水位往往面临重大挑战，包括高计算成本和对多样化或意想不到的条件的有限适应性。最近，大型时间序列模型的进展已经证明了解决这些限制的潜力，最先进的深度学习和基础模型在各个领域的时间序列预测中取得了显著成功。尽管取得了进展，但它们在永恒湿地等关键环境系统中的应用仍未得到充分探讨。在这项研究中，我们通过调查十二个专门任务模型和五个时间序列基础模型，跨六个类别进行了一个关于永恒湿地水位预测的真实应用研究，以填补这一空白。我们的主要结果显示，基础模型Chronos明显优于所有其他模型，而其余基础模型表现相对较差。我们还注意到，任务特定模型的性能随着模型架构的不同而异，并讨论了可能的原因。我们希望我们的研究和发现能激发社区探索大型时间序列模型在水文应用中的适用性。代码和数据可在https://github.com/rahuul2992000/Everglades-Benchmark 上获得。》

更新时间: 2025-08-06 23:04:10

领域: cs.LG

下载: http://arxiv.org/abs/2505.01415v2

Scaling Laws For Mixed Quantization

Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the memory and computational requirements for inference. In this study, we focus on a straightforward question: When aiming for a target accuracy or perplexity with low-precision quantization, how much high-precision computation needs to be preserved, and how fine-grained this quantization would need to be as we scale LLMs to larger sizes? We first introduce two critical metrics, named the quantization ratio ($Q_r$) and quantization block size ($Q_b$). The former measures the number of parameters quantized to low-precision arithmetic normalized by the total parameter count, whereas the latter defines the number of values within a block that share a scaling factor, akin to the block size concept introduced in the FP4 format in NVIDIA's Blackwell architecture. Through extensive and carefully controlled experiments across different models and quantization methods, we propose a unified scaling law on post-training quantization (PTQ) that can predict loss degeneration for varying $Q_r$ and $Q_b$. For $Q_r$, our scaling law implies that parameter scaling and ratio scaling have a multiplicative relationship. Consequently, larger models are more amenable to a higher quantization ratio $Q_r$, thus supporting an increase in the adoption of mixed quantization for inference. Regarding $Q_b$, our findings indicate that a small block size, similar to that used in Blackwell, is not essential for large models. Employing a small $Q_b$ can instead unnecessarily complicate the design of the hardware circuit.

Updated: 2025-08-06 22:52:22

标题: 混合量化的标度律

摘要: 大语言模型（LLMs）的后训练量化已被证明在减少推断的内存和计算需求方面是有效的。在这项研究中，我们关注一个简单的问题：在采用低精度量化以达到目标准确度或困惑度时，需要保留多少高精度计算，并且在将LLMs扩展到更大规模时，这种量化需要多精细化。我们首先介绍了两个关键指标，分别为量化比率（$Q_r$）和量化块大小（$Q_b$）。前者衡量了以低精度算术量化的参数数量，除以总参数数量，而后者定义了块内共享缩放因子的值数量，类似于NVIDIA的Blackwell架构中FP4格式引入的块大小概念。通过对不同模型和量化方法进行广泛而谨慎控制的实验，我们提出了一个关于后训练量化（PTQ）的统一缩放定律，可以预测不同$Q_r$和$Q_b$的损失退化。对于$Q_r$，我们的缩放定律意味着参数缩放和比例缩放之间存在乘法关系。因此，更大的模型更容易接受更高的量化比率$Q_r，从而支持增加混合量化用于推断。关于$Q_b$，我们的研究结果表明，对于大型模型，类似于Blackwell中使用的小块大小并不是必要的。使用小$Q_b$反而可能会不必要地增加硬件电路的设计复杂性。

更新时间: 2025-08-06 22:52:22

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.06722v3

Attention on flow control: transformer-based reinforcement learning for lift regulation in highly disturbed flows

A linear flow control strategy designed for weak disturbances may not remain effective in sequences of strong disturbances due to nonlinear interactions, but it is sensible to leverage it for developing a better strategy. In the present study, we propose a transformer-based reinforcement learning (RL) framework to learn an effective control strategy for regulating aerodynamic lift in arbitrarily long gust sequences via pitch control. The random gusts produce intermittent, high-variance flows observed only through limited surface pressure sensors, making this control problem inherently challenging compared to stationary flows. The transformer addresses the challenge of partial observability from the limited surface pressures. We demonstrate that the training can be accelerated with two techniques -- pretraining with an expert policy (here, linear control) and task-level transfer learning (here, extending a policy trained on isolated gusts to multiple gusts). We show that the learned strategy outperforms the best proportional control, with the performance gap widening as the number of gusts increases. The control strategy learned in an environment with a small number of successive gusts is shown to effectively generalize to an environment with an arbitrarily long sequence of gusts. We investigate the pivot configuration and show that quarter-chord pitching control can achieve superior lift regulation with substantially less control effort compared to mid-chord pitching control. Through a decomposition of the lift, we attribute this advantage to the dominant added-mass contribution accessible via quarter-chord pitching.

Updated: 2025-08-06 22:45:40

标题: 关注流动控制：基于变压器的强化学习用于高度扰动流中的升力调节

摘要: 一个为弱扰动设计的线性流控制策略在强扰动序列中可能不再有效，因为非线性相互作用，但利用它来发展更好的策略是明智的。在本研究中，我们提出了一种基于变压器的强化学习（RL）框架，用于通过俯仰控制学习有效的气动升力调节策略，以处理任意长的阵风序列。随机阵风产生间歇性、高方差的流动，仅通过有限的表面压力传感器观察到，使得与稳定流动相比，这种控制问题本质上具有挑战性。变压器解决了来自有限表面压力的部分可观测性挑战。我们展示了两种技术可以加速训练--使用专家策略（这里是线性控制）进行预训练和任务级别的迁移学习（这里是将在孤立阵风上训练的策略扩展到多个阵风）。我们表明，学习到的策略优于最佳比例控制，随着阵风数量的增加，性能差距会扩大。在具有少量连续阵风的环境中学习的控制策略被证明能够有效地推广到具有任意长阵风序列的环境中。我们调查了枢轴配置，并表明四分之一弦俯仰控制可以通过比中弦俯仰控制更少的控制力量实现卓越的升力调节。通过对升力的分解，我们将这一优势归因于通过四分之一弦俯仰实现的主导附加质量贡献。

更新时间: 2025-08-06 22:45:40

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2506.10153v2

Navigating Cookie Consent Violations Across the Globe

Online services provide users with cookie banners to accept/reject the cookies placed on their web browsers. Despite the increased adoption of cookie banners, little has been done to ensure that cookie consent is compliant with privacy laws around the globe. Prior studies have found that cookies are often placed on browsers even after their explicit rejection by users. These inconsistencies in cookie banner behavior circumvent users' consent preferences and are known as cookie consent violations. To address this important problem, we propose an end-to-end system, called ConsentChk, that detects and analyzes cookie banner behavior. ConsentChk uses a formal model to systematically detect and categorize cookie consent violations. We investigate eight English-speaking regions across the world, and analyze cookie banner behavior across 1,793 globally-popular websites. Cookie behavior, cookie consent violation rates, and cookie banner implementations are found to be highly dependent on region. Our evaluation reveals that consent management platforms (CMPs) and website developers likely tailor cookie banner configurations based on their (often incorrect) interpretations of regional privacy laws. We discuss various root causes behind these cookie consent violations. The resulting implementations produce misleading cookie banners, indicating the prevalence of inconsistently implemented and enforced cookie consent between various regions.

Updated: 2025-08-06 22:45:37

标题: 横跨全球的cookie同意违规情况航行

摘要: 在线服务为用户提供cookie横幅，以接受/拒绝其网页浏览器上放置的cookie。尽管cookie横幅的采用率增加，但很少有人确保cookie同意符合全球隐私法律的规定。以前的研究发现，即使用户明确拒绝，cookie经常被放置在浏览器上。这些cookie横幅行为上的不一致绕过了用户的同意偏好，被称为cookie同意违规。为了解决这个重要问题，我们提出了一个名为ConsentChk的端到端系统，用于检测和分析cookie横幅行为。ConsentChk使用一个正式模型系统地检测和分类cookie同意违规行为。我们调查了世界上八个英语国家/地区，并分析了1,793个全球热门网站上的cookie横幅行为。发现cookie行为、cookie同意违规率和cookie横幅实施在很大程度上取决于地区。我们的评估显示，同意管理平台（CMPs）和网站开发人员可能根据他们对地区隐私法的（常常不正确的）解释来定制cookie横幅配置。我们讨论了这些cookie同意违规背后的各种根本原因。由此产生的实施方式产生了误导性的cookie横幅，表明在各个地区之间存在不一致实施和执行cookie同意的普遍现象。

更新时间: 2025-08-06 22:45:37

领域: cs.CR

下载: http://arxiv.org/abs/2506.08996v2

Towards Scalable Bayesian Optimization via Gradient-Informed Bayesian Neural Networks

Bayesian optimization (BO) is a widely used method for data-driven optimization that generally relies on zeroth-order data of objective function to construct probabilistic surrogate models. These surrogates guide the exploration-exploitation process toward finding global optimum. While Gaussian processes (GPs) are commonly employed as surrogates of the unknown objective function, recent studies have highlighted the potential of Bayesian neural networks (BNNs) as scalable and flexible alternatives. Moreover, incorporating gradient observations into GPs, when available, has been shown to improve BO performance. However, the use of gradients within BNN surrogates remains unexplored. By leveraging automatic differentiation, gradient information can be seamlessly integrated into BNN training, resulting in more informative surrogates for BO. We propose a gradient-informed loss function for BNN training, effectively augmenting function observations with local gradient information. The effectiveness of this approach is demonstrated on well-known benchmarks in terms of improved BNN predictions and faster BO convergence as the number of decision variables increases.

Updated: 2025-08-06 22:41:42

标题: 通过梯度信息的贝叶斯神经网络实现可扩展的贝叶斯优化

摘要: 贝叶斯优化（BO）是一种广泛使用的数据驱动优化方法，通常依赖于目标函数的零阶数据来构建概率代理模型。这些代理指导探索利用过程，以寻找全局最优解。虽然高斯过程（GPs）通常被用作未知目标函数的代理，但最近的研究强调了贝叶斯神经网络（BNNs）作为可扩展和灵活的替代方案的潜力。此外，当可用时，将梯度观测整合到GPs中已被证明可以提高BO的性能。然而，在BNN代理中使用梯度的方法尚未被探索。通过利用自动微分，梯度信息可以无缝地集成到BNN训练中，从而为BO提供更丰富的代理信息。我们提出了一个针对BNN训练的梯度信息损失函数，有效地将函数观测与局部梯度信息相结合。该方法在著名基准测试中的有效性得到了证明，表现为改善了BNN预测并加快了BO的收敛速度，尤其是在决策变量数量增加时。

更新时间: 2025-08-06 22:41:42

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2504.10076v2

ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis

The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.

Updated: 2025-08-06 22:39:38

标题: ConfAgents：一种用于成本高效医疗诊断的一致引导多代理框架

摘要: 人工智能代理在医疗保健研究中的效力受到其依赖静态、预定义策略的限制。这造成了一个关键限制：代理可以成为更好的工具使用者，但无法学习成为更好的战略规划者，这是像医疗保健这样复杂领域所必需的关键技能。我们介绍了HealthFlow，一个通过新颖的元级演化机制克服这一限制的自我进化人工智能代理。HealthFlow通过将程序成功和失败提炼成持久的战略知识库，自主地完善自己的高层问题解决策略。为了确立我们的研究并促进可重复的评估，我们介绍了EHRFlowBench，一个新的基准测试，包含从同行评审的临床研究中衍生出的复杂、现实的健康数据分析任务。我们全面的实验表明，HealthFlow的自我进化方法明显优于最先进的代理框架。这项工作标志着从构建更好的工具使用者转向设计更智能、自我进化的任务管理者，为更自主、更有效的科学发现人工智能铺平了道路。

更新时间: 2025-08-06 22:39:38

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2508.04915v1

Advancing Hate Speech Detection with Transformers: Insights from the MetaHate

Hate speech is a widespread and harmful form of online discourse, encompassing slurs and defamatory posts that can have serious social, psychological, and sometimes physical impacts on targeted individuals and communities. As social media platforms such as X (formerly Twitter), Facebook, Instagram, Reddit, and others continue to facilitate widespread communication, they also become breeding grounds for hate speech, which has increasingly been linked to real-world hate crimes. Addressing this issue requires the development of robust automated methods to detect hate speech in diverse social media environments. Deep learning approaches, such as vanilla recurrent neural networks (RNNs), long short-term memory (LSTM), and convolutional neural networks (CNNs), have achieved good results, but are often limited by issues such as long-term dependencies and inefficient parallelization. This study represents the comprehensive exploration of transformer-based models for hate speech detection using the MetaHate dataset--a meta-collection of 36 datasets with 1.2 million social media samples. We evaluate multiple state-of-the-art transformer models, including BERT, RoBERTa, GPT-2, and ELECTRA, with fine-tuned ELECTRA achieving the highest performance (F1 score: 0.8980). We also analyze classification errors, revealing challenges with sarcasm, coded language, and label noise.

Updated: 2025-08-06 22:36:17

标题: 使用变压器技术推进仇恨言论检测：来自MetaHate的见解

摘要: 仇恨言论是一种普遍且有害的在线话语形式，包括侮辱性言论和诽谤性帖子，可能对被针对的个人和社区产生严重的社会、心理甚至身体影响。随着社交媒体平台如X（前身为Twitter）、Facebook、Instagram、Reddit等继续促进广泛的交流，它们也成为仇恨言论的滋生地，这与现实世界的仇恨犯罪日益关联。解决这一问题需要开发强大的自动化方法来检测不同社交媒体环境中的仇恨言论。深度学习方法，如基本循环神经网络（RNNs）、长短期记忆（LSTM）和卷积神经网络（CNNs），取得了良好的结果，但通常受长期依赖和效率低下的并行化等问题限制。本研究对使用MetaHate数据集进行仇恨言论检测的基于transformer的模型进行了全面探索，该数据集是包含120万个社交媒体样本的36个数据集的元集合。我们评估了多个最新的transformer模型，包括BERT、RoBERTa、GPT-2和ELECTRA，经过微调的ELECTRA取得了最高性能（F1分数：0.8980）。我们还分析了分类错误，揭示了对讽刺、编码语言和标签噪声的挑战。

更新时间: 2025-08-06 22:36:17

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.04913v1

Theorem-Carrying Transactions: Runtime Verification to Ensure Interface Specifications for Smart Contract Safety

Security bugs and trapdoors in smart contracts have been impacting the Ethereum community since its inception. Conceptually, the 1.45-million Ethereum's contracts form a single "gigantic program" whose behaviors are determined by the complex compositions of contracts. Can programmers be assured that this gigantic program conforms to high-level safety specifications, despite unforeseeable code-level intricacies? Static code verification cannot be faithful to this gigantic program due to its scale and high polymorphism. In this paper, we present a viable approach to achieve this goal. Our technology, called Theorem-Carrying Transactions (TCT), combines the benefits of concrete execution and symbolic proofs. Under the TCT protocol, every transaction carries a theorem that proves its adherence to the specified properties in the invoked contracts, and the runtime system checks the theorem before executing the transaction. Once a theorem is proven, it will be reused for future transactions, so TCT's runtime overhead is minimal. As case studies, we demonstrate that TCT secures token contracts without foreseeing code-level intricacies, such as integer overflow and reentrancy. TCT is also successfully applied to a Uniswap codebase, showcasing a complex decentralized finance (DeFi) scenario. Our evaluation shows a negligible runtime overhead, two orders of magnitude lower than a state-of-the-art approach for runtime checking of contract code safety.

Updated: 2025-08-06 22:31:48

标题: 定理载体交易：运行时验证以确保智能合约安全的接口规范

摘要: 自以太坊成立以来，智能合约中的安全漏洞和后门一直在影响以太坊社区。从概念上讲，145万个以太坊合约形成了一个单一的“巨大程序”，其行为由合约的复杂组合确定。程序员能否确信这个巨大程序符合高级安全规范，尽管存在无法预见的代码级复杂性？静态代码验证由于规模和高多态性，无法忠实于这个巨大程序。在本文中，我们提出了一个可行的方法来实现这个目标。我们的技术称为定理载体交易（TCT），结合了具体执行和符号证明的优点。在TCT协议下，每个交易都携带一个定理，证明其遵守调用合约中指定的属性，运行时系统在执行交易之前检查该定理。一旦定理被证明，它将被用于未来的交易，因此TCT的运行时开销是最小的。作为案例研究，我们演示了TCT如何在不预见代码级复杂性的情况下保护代币合约，如整数溢出和重入。TCT还成功应用于Uniswap代码库，展示了一个复杂的去中心化金融（DeFi）场景。我们的评估显示，TCT的运行时开销可以忽略不计，比当前最先进的合约代码安全运行时检查方法低两个数量级。

更新时间: 2025-08-06 22:31:48

领域: cs.CR,cs.PL

下载: http://arxiv.org/abs/2408.06478v2

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce \emph{Autospeculative Decoding} (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O} (K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.

Updated: 2025-08-06 22:30:24

标题: 扩散模型是秘密可交换的：通过自动推测并行化DDPMs

摘要: 去噪扩散概率模型（DDPMs）已经成为生成建模的强大工具。然而，它们的顺序计算需求导致了显著的推理时间瓶颈。在这项工作中，我们利用DDPMs和随机定位之间的联系证明，在适当的重新参数化下，DDPM的增量满足可交换性属性。这一一般性见解使得各种性能优化技术从自回归模型转移到扩散设置变得接近黑盒。为了证明这一点，我们引入了“自动推测解码”（ASD），这是一种广泛使用的推测解码算法在DDPMs上的扩展，不需要任何辅助草稿模型。我们的理论分析表明，ASD在K步顺序DDPM上实现了$O(K^{\frac{1}{3}})$的并行运行时加速。我们还展示了自动推测解码的实际实现显著加速了在各个领域中的DDPM推理。

更新时间: 2025-08-06 22:30:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.03983v2

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.

Updated: 2025-08-06 22:30:21

标题: 优化基于LLM的多Agent系统与文本反馈：软件开发案例研究

摘要: 我们已经看到大型语言模型（LLMs）赋能的多智能体系统在解决需要各种技能专家合作的复杂任务方面取得了显著进展。然而，优化基于LLM的多智能体系统仍然具有挑战性。在这项工作中，我们对基于角色的多智能体系统的群体优化进行了实证案例研究，利用自然语言反馈处理各种评估维度下的具有挑战性的软件开发任务。我们提出了一个两步智能体提示优化流程：利用文本反馈确定表现不佳的智能体及其失败解释，然后优化已识别智能体的系统提示，利用失败解释。然后，我们研究了各种优化设置对系统性能的影响，比较了两个组：在线与离线优化以及个体与群体优化。对于群体优化，我们研究了两种提示策略：一次性和多次提示优化。总体而言，我们展示了我们的优化方法对处理软件开发任务的角色为基础的多智能体系统在各种评估维度上的有效性，并研究了不同优化设置对多智能体系统的群体行为的影响，为未来发展提供了实用见解。

更新时间: 2025-08-06 22:30:21

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.16086v2

Explainable Evidential Clustering

Unsupervised classification is a fundamental machine learning problem. Real-world data often contain imperfections, characterized by uncertainty and imprecision, which are not well handled by traditional methods. Evidential clustering, based on Dempster-Shafer theory, addresses these challenges. This paper explores the underexplored problem of explaining evidential clustering results, which is crucial for high-stakes domains such as healthcare. Our analysis shows that, in the general case, representativity is a necessary and sufficient condition for decision trees to serve as abductive explainers. Building on the concept of representativity, we generalize this idea to accommodate partial labeling through utility functions. These functions enable the representation of "tolerable" mistakes, leading to the definition of evidential mistakeness as explanation cost and the construction of explainers tailored to evidential classifiers. Finally, we propose the Iterative Evidential Mistake Minimization (IEMM) algorithm, which provides interpretable and cautious decision tree explanations for evidential clustering functions. We validate the proposed algorithm on synthetic and real-world data. Taking into account the decision-maker's preferences, we were able to provide an explanation that was satisfactory up to 93% of the time.

Updated: 2025-08-06 22:24:18

标题: 可解释的证据聚类

摘要: 无监督分类是一个基本的机器学习问题。现实世界的数据通常包含不完美之处，表现为不确定性和不精确性，这些传统方法处理不好。基于Dempster-Shafer理论的证据聚类解决了这些挑战。本文探讨了解释证据聚类结果的问题，这对于高风险领域如医疗保健至关重要。我们的分析表明，在一般情况下，代表性是决策树作为演绎解释者的必要且充分条件。在代表性概念的基础上，我们将这一想法推广到通过效用函数容纳部分标记。这些函数使得能够表示"可以容忍"的错误，导致将证据错误解释成为解释成本的定义，并构建根据证据分类器定制的解释器。最后，我们提出了迭代证据错误最小化（IEMM）算法，为证据聚类函数提供可解释且谨慎的决策树解释。我们在合成和现实世界数据上验证了所提出的算法。考虑决策者的偏好，我们能够提供令人满意的解释，达到93%的满意度。

更新时间: 2025-08-06 22:24:18

领域: cs.LG

下载: http://arxiv.org/abs/2507.12192v2

Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

The integration of large language models (LLMs) into enterprise systems has introduced a new class of covert security vulnerabilities, particularly within logic execution layers and persistent memory contexts. This paper introduces Logic-layer Prompt Control Injection (LPCI), a novel category of attacks that embeds encoded, delayed, and conditionally triggered payloads within memory, vector stores, or tool outputs. These payloads can bypass conventional input filters and trigger unauthorised behaviour across sessions.

Updated: 2025-08-06 22:10:30

标题: 逻辑层提示控制注入（LPCI）：一种新的主体系统安全漏洞类别

摘要: 大型语言模型（LLMs）集成到企业系统中引入了一类新的隐蔽安全漏洞，特别是在逻辑执行层和持久性存储上下文中。本文介绍了逻辑层提示控制注入（LPCI），这是一种新颖的攻击类型，它将编码、延迟和有条件触发的载荷嵌入到内存、向量存储或工具输出中。这些载荷可以绕过传统的输入过滤器，并在会话之间触发未经授权的行为。

更新时间: 2025-08-06 22:10:30

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.10457v2

RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

Updated: 2025-08-06 21:59:34

标题: RCR-Router：具有结构化记忆的多Agent LLM系统的高效角色感知上下文路由

摘要: 多智能体大型语言模型（LLM）系统在复杂推理和协作决策任务中表现出强大潜力。然而，大多数现有的协调方案依赖于静态或全文本路由策略，导致过多的令牌消耗、冗余内存暴露以及有限的适应性跨交互轮次。我们引入了RCR-Router，这是一个模块化和角色感知的上下文路由框架，旨在实现多智能体LLM中的高效自适应协作。据我们所知，这是第一个根据角色和任务阶段动态选择语义相关内存子集的路由方法，同时遵守严格的令牌预算。一个轻量级的评分策略指导内存选择，智能体输出被迭代地整合到共享内存存储中，以促进渐进式上下文细化。为了更好地评估模型行为，我们进一步提出了一个回答质量评分指标，捕捉了LLM生成的解释超出标准QA准确性的范围。在三个多跳QA基准测试上的实验证明，RCR-Router减少了令牌使用量（高达30%），同时提高或保持了答案质量。这些结果突显了结构化内存路由和输出感知评估在推进可扩展的多智能体LLM系统方面的重要性。

更新时间: 2025-08-06 21:59:34

领域: cs.CL,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.04903v1

Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning

The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ($\Delta_Q$), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.

Updated: 2025-08-06 21:56:56

标题: 稳定性的敏感性：迁移学习中自适应数据选择可复制性的理论与实证分析

摘要: 迁移学习的广泛应用彻底改变了机器学习，使预训练模型能够高效地适应新领域。然而，这些适应性的可靠性仍然知之甚少，特别是在使用动态优先选择训练示例的自适应数据选择策略时。我们提出了一项关于迁移学习中可复制性的全面理论和实证分析，引入了一个量化适应效果和结果一致性之间基本权衡的数学框架。我们的主要贡献是对选择敏感性（$\Delta_Q$）进行形式化，这是一个能够捕捉自适应选择策略如何对训练数据中的扰动做出响应的度量。我们证明了可复制性失败概率：两次独立训练运行产生的模型在性能上相差超过一个阈值的可能性随着选择敏感性的增加呈二次增长，而随着样本量的减少呈指数减少。通过对MultiNLI语料库进行广泛实验，使用从均匀采样到基于梯度的选择的六种自适应选择策略，我们证明了这一理论关系在实践中的准确性。我们的结果表明，像基于梯度和课程学习这样的高度自适应策略实现了优越的任务性能，但受到了高可复制性失败率的困扰，而不那么自适应的方法则保持了低于7%的失败率。至关重要的是，我们展示了源域预训练提供了一个强大的缓解机制，将失败率降低了多达30%，同时保持了性能增益。这些发现为实践者建立了有原则的指导方针，以在性能和可复制性之间权衡，并强调了现代迁移学习系统中需要考虑可复制性的设计的必要性。

更新时间: 2025-08-06 21:56:56

领域: cs.LG

下载: http://arxiv.org/abs/2508.04901v1

Revealing Temporal Label Noise in Multimodal Hateful Video Classification

The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.

Updated: 2025-08-06 21:55:59

标题: 揭示多模态仇恨视频分类中的时间标签噪声

摘要: 在线多媒体内容的快速增长加剧了仇恨言论的传播，提出了重要的社会和监管挑战。虽然最近的研究推进了多模态仇恨视频检测，但大多数方法依赖于粗糙的视频级别注释，忽略了仇恨内容的时间粒度。这引入了大量的标签噪音，因为被注释为仇恨的视频通常包含长时间的非仇恨段。在本文中，我们通过一种细粒度的方法研究了这种标签歧义的影响。具体来说，我们使用注释的时间戳从HateMM和MultiHateClip英语数据集中剪辑出仇恨视频，以分离明确的仇恨片段。然后，我们对这些剪辑片段进行了探索性分析，以检查仇恨和非仇恨内容的分布和特征。这个分析突出了语义重叠的程度和粗糙、视频级别注释引入的混淆。最后，控制实验表明，时间戳噪音基本改变了模型的决策边界，削弱了分类的信心，突出了仇恨言论表达的固有上下文依赖性和时间连续性。我们的研究结果提供了对多模态仇恨视频时间动态的新见解，并强调了对于改善鲁棒性和可解释性的时态感知模型和基准的需求。代码和数据可在https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise下载。

更新时间: 2025-08-06 21:55:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04900v1

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

Updated: 2025-08-06 21:55:28

标题: 诚实可靠的评估和专家等效性测试：自动新生儿癫痫检测

摘要: 可靠地评估新生儿癫痫检测的机器学习模型对于临床采用至关重要。目前的做法通常依赖于不一致和有偏差的指标，阻碍了模型的可比性和可解释性。关于人工智能性能的专家级声明经常没有经过严格验证，引发了对其可靠性的担忧。本研究旨在系统评估常见的性能指标，并提出针对新生儿癫痫检测特定挑战的最佳实践。使用真实和合成的癫痫标注，我们评估了标准性能指标、共识策略以及在不同类别不平衡、评分者一致性和评分者数量下的人类专家级等效性测试。马修斯和皮尔逊相关系数在反映类别不平衡下的性能方面优于曲线下面积。共识类型对评分者数量和他们之间的一致水平敏感。在人类专家级等效性测试中，使用Fleiss k的多评分者图灵测试最能捕捉专家级人工智能性能。我们建议报告：（1）至少一个平衡的指标，（2）灵敏度、特异度、阳性预测值和阴性预测值，（3）使用Fleiss k的多评分者图灵测试结果，以及（4）以上所有内容在留置的验证集上。这一提议的框架为临床验证提供了重要的先决条件，通过对新生儿癫痫检测的人工智能方法进行彻底和诚实的评估。

更新时间: 2025-08-06 21:55:28

领域: cs.LG

下载: http://arxiv.org/abs/2508.04899v1

RLTHF: Targeted Human Feedback for LLM Alignment

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

Updated: 2025-08-06 21:39:55

标题: RLTHF：面向LLM对齐的定向人类反馈

摘要: 将大型语言模型（LLMs）微调以与用户偏好对齐具有挑战性，这是由于在人类反馈强化学习（RLHF）中质量人类标注的高成本和AI反馈的泛化能力限制。为了解决这些挑战，我们提出了RLTHF，这是一个人工智能混合框架，它将基于LLM的初始对齐与选择性人类标注相结合，以实现全人类标注对齐，并做到最小的努力。RLTHF利用奖励模型的奖励分布识别由LLMs错误标记的难以标注的样本，并通过集成战略性人类更正和利用LLM正确标记的样本来迭代增强对齐。对HH-RLHF和TL;DR数据集的评估显示，RLTHF仅需6-7%的人类标注工作即可达到全人类标注水平的对齐。此外，对于下游任务使用RLTHF精心策划的数据集训练的模型优于那些在完全人类标注的数据集上训练的模型，强调了RLTHF的有效性。

更新时间: 2025-08-06 21:39:55

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13417v3

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)

Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks.

Updated: 2025-08-06 21:38:52

标题: 图感知大型语言模型（LLMs）上的对抗性攻击与防御

摘要: 大型语言模型（LLMs）越来越多地与图结构数据集成，用于节点分类等任务，这是传统上由图神经网络（GNNs）主导的领域。虽然这种整合利用丰富的关系信息来提高任务性能，但它们对抗性攻击的鲁棒性尚未被探索。我们通过利用针对基于图的模型量身定制的现有对抗攻击方法，包括毒化（训练时攻击）和逃避（测试时攻击），对两个代表性模型LLAGA（Chen等人2024）和GRAPHPROMPTER（Liu等人2024）进行了首次探索，以探讨图感知LLMs的漏洞。此外，我们发现了LLAGA的一个新的攻击面，攻击者可以将恶意节点作为占位符注入到节点序列模板中，严重降低其性能。我们的系统分析揭示了图编码中的某些设计选择可以增强攻击成功率，具体发现包括：（1）LLAGA中的节点序列模板增加了其脆弱性；（2）GRAPHPROMPTER使用的GNN编码器表现出更高的鲁棒性；以及（3）这两种方法仍然容易受到不可察觉的特征扰动攻击的影响。最后，我们提出了一个端到端的防御框架GALGUARD，结合了一个基于LLM的特征校正模块，以减轻特征级扰动，并调整了GNN的防御方式，以保护免受结构攻击。

更新时间: 2025-08-06 21:38:52

领域: cs.CR,cs.AI,cs.SI

下载: http://arxiv.org/abs/2508.04894v1

Toward A Causal Framework for Modeling Perception

Perception occurs when individuals interpret the same information differently. It is a known cognitive phenomenon with implications for bias in human decision-making. Perception, however, remains understudied in machine learning (ML). This is problematic as modern decision flows, whether partially or fully automated by ML applications, always involve human experts. For instance, how might we account for cases in which two experts interpret differently the same deferred instance or explanation from a ML model? Addressing this and similar questions requires first a formulation of perception, particularly, in a manner that integrates with ML-enabled decision flows. In this work, we present a first approach to modeling perception causally. We define perception under causal reasoning using structural causal models (SCMs). Our approach formalizes individual experience as additional causal knowledge that comes with and is used by the expert decision-maker in the form of a SCM. We define two kinds of probabilistic causal perception: structural and parametrical. We showcase our framework through a series of examples of modern decision flows. We also emphasize the importance of addressing perception in fair ML, discussing relevant fairness implications and possible applications.

Updated: 2025-08-06 21:33:43

标题: 朝向建模感知的因果框架

摘要: 感知是当个体对相同的信息进行不同解释时发生的现象。这是一个已知的认知现象，对人类决策中的偏见具有重要影响。然而，在机器学习（ML）中，对于感知的研究仍然不足。这是有问题的，因为现代决策流程，无论是部分还是完全由ML应用程序自动化，总是涉及人类专家。例如，我们如何解释两个专家对ML模型的相同延迟实例或解释进行不同解释的情况？解决这个问题和类似问题首先需要对感知进行表述，特别是以一种与ML决策流程相结合的方式。在这项工作中，我们提出了一种对感知进行因果建模的初步方法。我们使用结构因果模型（SCMs）在因果推理下定义感知。我们的方法将个体经验形式化为专家决策者以SCM形式使用的额外因果知识。我们定义了两种概率因果感知：结构性和参数性。我们通过一系列现代决策流程的示例展示我们的框架。我们还强调了在公平ML中解决感知的重要性，讨论相关的公平性影响和可能的应用。

更新时间: 2025-08-06 21:33:43

领域: cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2401.13408v4

Low-skilled Occupations Face the Highest Upskilling Pressure

Substantial scholarship has estimated the susceptibility of jobs to automation, but little has examined how job contents evolve in the information age as new technologies substitute for tasks, shifting required skills rather than eliminating entire jobs. Here we explore patterns of occupational skill change and characterize occupations and workers subject to the greatest reskilling requirements. Recent work found that changing skill requirements are greatest for STEM occupations in the 2010s. Nevertheless, analyzing 167 million online job posts covering 727 occupations, we find that skill change is greatest for low-skilled occupations when accounting for distance between skills. We further investigate the differences in skill change across employer and market size, as well as social demographic groups. We find that jobs from small employers and markets experienced larger skill upgrades to catch up with the skill demands of their large employers and markets. Female and minority workers are disproportionately employed in low-skilled jobs and face the most significant skill adjustments. While these varied skill changes could create uneven reskilling pressures across workers, they may also lead to a narrowing of gaps in job quality and prospects. We conclude by showcasing our model's potential to chart job evolution directions using skill embedding spaces.

Updated: 2025-08-06 21:33:03

标题: 低技能职业面临最高的提升技能压力

摘要: 大量的学术研究已经估计了工作对自动化的敏感性，但很少有人研究工作内容在信息时代如何演变，新技术替代任务，转移所需技能而不是消除整个工作。在这里，我们探讨了职业技能变化的模式，并描述了面临最大再培训需求的职业和工作者。最近的研究发现，在2010年代，STEM职业的技能要求发生了最大的变化。然而，通过分析覆盖727种职业的1.67亿份在线职位招聘信息，我们发现在考虑技能之间的差距时，低技能职业的技能变化最大。我们进一步调查了雇主和市场规模以及社会人口统计学群体在技能变化上的差异。我们发现，来自小雇主和市场的工作经历了更大的技能升级，以满足大雇主和市场的技能需求。女性和少数民族工作者在低技能工作中占比较高，并面临最显著的技能调整。尽管这些各种各样的技能变化可能会在工作者之间造成不均匀的再培训压力，但它们也可能导致工作质量和前景的差距缩小。我们最后展示了我们的模型利用技能嵌入空间来绘制工作演变方向的潜力。

更新时间: 2025-08-06 21:33:03

领域: cs.CY,cs.LG

下载: http://arxiv.org/abs/2101.11505v6

Retrieval-Augmented Water Level Forecasting for Everglades

Accurate water level forecasting is crucial for managing ecosystems such as the Everglades, a subtropical wetland vital for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent advances in deep learning, particularly time series foundation models, have demonstrated success in general-domain forecasting, their application in hydrology remains underexplored. Furthermore, they often struggle to generalize across diverse unseen datasets and domains, due to the lack of effective mechanisms for adaptation. To address this gap, we introduce Retrieval-Augmented Forecasting (RAF) into the hydrology domain, proposing a framework that retrieves historically analogous multivariate hydrological episodes to enrich the model input before forecasting. By maintaining an external archive of past observations, RAF identifies and incorporates relevant patterns from historical data, thereby enhancing contextual awareness and predictive accuracy without requiring the model for task-specific retraining or fine-tuning. Furthermore, we explore and compare both similarity-based and mutual information-based RAF methods. We conduct a comprehensive evaluation on real-world data from the Everglades, demonstrating that the RAF framework yields substantial improvements in water level forecasting accuracy. This study highlights the potential of RAF approaches in environmental hydrology and paves the way for broader adoption of adaptive AI methods by domain experts in ecosystem management. The code and data are available at https://github.com/rahuul2992000/WaterRAF.

Updated: 2025-08-06 21:27:12

标题: Everglades水位预测的检索增强模型

摘要: 准确的水位预测对于管理生态系统如印加湿地至关重要，这是一种对于洪水缓解、干旱管理、水资源规划和生物多样性保护至关重要的亚热带湿地。尽管最近深度学习的进展，特别是时间序列基础模型，在一般领域预测方面取得了成功，但它们在水文学中的应用仍未得到充分探索。此外，由于缺乏有效的适应机制，它们往往难以在不同的未知数据集和领域中进行泛化。为了填补这一空白，我们将检索增强预测（RAF）引入水文领域，提出了一个框架，在预测之前检索历史上类似的多变量水文事件以丰富模型输入。通过维护过去观测的外部存档，RAF识别并整合历史数据中的相关模式，从而增强上下文意识和预测准确性，而无需对任务特定的重新训练或微调模型。此外，我们探讨并比较基于相似性和基于互信息的RAF方法。我们对印加湿地的真实数据进行了全面评估，证明RAF框架在水位预测准确性方面取得了显著改进。本研究突出了RAF方法在环境水文学中的潜力，并为生态系统管理领域专家更广泛采用自适应人工智能方法铺平了道路。代码和数据可在https://github.com/rahuul2992000/WaterRAF获取。

更新时间: 2025-08-06 21:27:12

领域: cs.LG

下载: http://arxiv.org/abs/2508.04888v1

Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates

Air pollution is the world's largest environmental risk factor for human disease and premature death, resulting in more than 6 million permature deaths in 2019. Currently, there is still a challenge to model one of the most important air pollutants, surface ozone, particularly at scales relevant for human health impacts, with the drivers of global ozone trends at these scales largely unknown, limiting the practical use of physics-based models. We employ a 2D Convolutional Neural Network based architecture that estimate surface ozone MOMO-Chem model residuals, referred to as model bias. We demonstrate the potential of this technique in North America and Europe, highlighting its ability better to capture physical model residuals compared to a traditional machine learning method. We assess the impact of incorporating land use information from high-resolution satellite imagery to improve model estimates. Importantly, we discuss how our results can improve our scientific understanding of the factors impacting ozone bias at urban scales that can be used to improve environmental policy.

Updated: 2025-08-06 21:24:32

标题: 利用深度学习来处理全球空气质量估计的物理模型偏差

摘要: 空气污染是全球最大的环境风险因素，对人类健康造成疾病和过早死亡，导致2019年超过600万人过早死亡。目前，对于最重要的空气污染物之一——地面臭氧，特别是在与人类健康影响相关的尺度上，仍然存在挑战，全球地面臭氧趋势的驱动因素在这些尺度上很大程度上是未知的，限制了基于物理的模型的实际应用。我们采用基于2D卷积神经网络的架构，估计地面臭氧MOMO-Chem模型的残差，称为模型偏差。我们在北美和欧洲展示了这种技术的潜力，强调它相对于传统的机器学习方法更好地捕捉物理模型残差的能力。我们评估了将高分辨率卫星图像中的土地利用信息纳入模型估计以改进模型的影响。重要的是，我们讨论了我们的结果如何改进我们对影响城市尺度臭氧偏差的因素的科学理解，从而可以用于改进环境政策。

更新时间: 2025-08-06 21:24:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04886v1

RACE-IT: A Reconfigurable Analog Computing Engine for In-Memory Transformer Acceleration

Transformer models represent the cutting edge of Deep Neural Networks (DNNs) and excel in a wide range of machine learning tasks. However, processing these models demands significant computational resources and results in a substantial memory footprint. While In-memory Computing (IMC)offers promise for accelerating Vector-Matrix Multiplications(VMMs) with high computational parallelism and minimal data movement, employing it for other crucial DNN operators remains a formidable task. This challenge is exacerbated by the extensive use of complex activation functions, Softmax, and data-dependent matrix multiplications (DMMuls) within Transformer models. To address this challenge, we introduce a Reconfigurable Analog Computing Engine (RACE) by enhancing Analog Content Addressable Memories (ACAMs) to support broader operations. Based on the RACE, we propose the RACE-IT accelerator (meaning RACE for In-memory Transformers) to enable efficient analog-domain execution of all core operations of Transformer models. Given the flexibility of our proposed RACE in supporting arbitrary computations, RACE-IT is well-suited for adapting to emerging and non-traditional DNN architectures without requiring hardware modifications. We compare RACE-IT with various accelerators. Results show that RACE-IT increases performance by 453x and 15x, and reduces energy by 354x and 122x over the state-of-the-art GPUs and existing Transformer-specific IMC accelerators, respectively.

Updated: 2025-08-06 21:22:57

标题: RACE-IT：用于内存变压器加速的可重构模拟计算引擎

摘要: Transformer 模型代表了深度神经网络（DNNs）的前沿，并在各种机器学习任务中表现出色。然而，处理这些模型需要大量的计算资源，并导致显著的内存占用。虽然内存计算（IMC）在加速高计算并行性和最小数据移动的向量-矩阵乘法（VMMs）方面具有潜力，但将其用于其他关键的DNN运算仍然是一项艰巨的任务。这一挑战在Transformer模型中广泛使用复杂激活函数、Softmax和数据相关矩阵乘法（DMMuls）的情况下变得更加严峻。为了解决这一挑战，我们通过增强模拟内容寻址存储器（ACAMs）引入了一种可重构模拟计算引擎（RACE）来支持更广泛的操作。基于RACE，我们提出了RACE-IT加速器（意为适用于内存Transformer的RACE），以实现Transformer模型的所有核心操作的高效模拟域执行。考虑到我们提出的RACE在支持任意计算方面的灵活性，RACE-IT非常适合适应新兴和非传统的DNN架构，而无需硬件修改。我们将RACE-IT与各种加速器进行比较。结果显示，与最先进的GPU和现有的Transformer特定IMC加速器相比，RACE-IT的性能提高了453倍和15倍，并分别减少了354倍和122倍的能耗。

更新时间: 2025-08-06 21:22:57

领域: cs.AR,cs.ET,cs.LG

下载: http://arxiv.org/abs/2312.06532v3

Uncertainty Quantification for Surface Ozone Emulators using Deep Learning

Air pollution is a global hazard, and as of 2023, 94\% of the world's population is exposed to unsafe pollution levels. Surface Ozone (O3), an important pollutant, and the drivers of its trends are difficult to model, and traditional physics-based models fall short in their practical use for scales relevant to human-health impacts. Deep Learning-based emulators have shown promise in capturing complex climate patterns, but overall lack the interpretability necessary to support critical decision making for policy changes and public health measures. We implement an uncertainty-aware U-Net architecture to predict the Multi-mOdel Multi-cOnstituent Chemical data assimilation (MOMO-Chem) model's surface ozone residuals (bias) using Bayesian and quantile regression methods. We demonstrate the capability of our techniques in regional estimation of bias in North America and Europe for June 2019. We highlight the uncertainty quantification (UQ) scores between our two UQ methodologies and discern which ground stations are optimal and sub-optimal candidates for MOMO-Chem bias correction, and evaluate the impact of land-use information in surface ozone residual modeling.

Updated: 2025-08-06 21:22:06

标题: 使用深度学习进行表面臭氧模拟的不确定性量化

摘要: 空气污染是全球性的危险，截至2023年，全球94％的人口暴露在不安全的污染水平下。地表臭氧（O3）是一种重要的污染物，其趋势驱动因素难以建模，传统的基于物理的模型在与人类健康影响相关的尺度上的实际应用方面存在不足。基于深度学习的模拟器显示出捕捉复杂气候模式的潜力，但总体上缺乏支持关键决策制定和公共卫生措施的必要可解释性。我们实施了一个考虑不确定性的U-Net架构，使用贝叶斯和分位回归方法预测多模型多成分化学数据同化（MOMO-Chem）模型的地表臭氧残差（偏差）。我们展示了我们的技术在2019年6月对北美和欧洲地区的偏差估计中的能力。我们强调了我们两种不确定性量化（UQ）方法之间的UQ分数，并辨别了哪些地面站是最佳和次优的MOMO-Chem偏差校正候选，并评估了土地利用信息在地表臭氧残差建模中的影响。

更新时间: 2025-08-06 21:22:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04885v1

The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models

In this work, we study the problem of choosing the discretisation schedule for sampling from masked discrete diffusion models in terms of the information geometry of the induced probability path. Specifically, we show that the optimal schedule under the Fisher-Rao geometry recovers the popularly-used cosine schedule.

Updated: 2025-08-06 21:19:08

标题: 余弦调度对于掩蔽离散扩散模型是费舍尔-劳-最优的。

摘要: 在这项工作中，我们研究了从遮蔽的离散扩散模型中进行采样时选择离散化时间表的问题，其基础是诱导概率路径的信息几何。具体而言，我们展示了在Fisher-Rao几何下的最优时间表可以恢复出广泛使用的余弦时间表。

更新时间: 2025-08-06 21:19:08

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.04884v1

Gaussian mixture layers for neural networks

The mean-field theory for two-layer neural networks considers infinitely wide networks that are linearly parameterized by a probability measure over the parameter space. This nonparametric perspective has significantly advanced both the theoretical and conceptual understanding of neural networks, with substantial efforts made to validate its applicability to networks of moderate width. In this work, we explore the opposite direction, investigating whether dynamics can be directly implemented over probability measures. Specifically, we employ Gaussian mixture models as a flexible and expressive parametric family of distributions together with the theory of Wasserstein gradient flows to derive training dynamics for such measures. Our approach introduces a new type of layer -- the Gaussian mixture (GM) layer -- that can be integrated into neural network architectures. As a proof of concept, we validate our proposal through experiments on simple classification tasks, where a GM layer achieves test performance comparable to that of a two-layer fully connected network. Furthermore, we examine the behavior of these dynamics and demonstrate numerically that GM layers exhibit markedly different behavior compared to classical fully connected layers, even when the latter are large enough to be considered in the mean-field regime.

Updated: 2025-08-06 21:16:17

标题: 高斯混合层用于神经网络

摘要: 双层神经网络的平均场理论考虑了线性参数化的无限宽网络，其参数空间上由概率测度确定。这种非参数视角显著推动了神经网络的理论和概念理解，大力验证其在中等宽度网络中的适用性。在这项工作中，我们探索了相反的方向，研究动态是否可以直接在概率测度上实现。具体来说，我们采用高斯混合模型作为灵活且表达力强的参数化分布族，结合Wasserstein梯度流理论，推导了这些测度的训练动态。我们的方法引入了一种新型层——高斯混合（GM）层，可以集成到神经网络架构中。作为概念验证，我们通过简单分类任务的实验验证了我们的提议，其中GM层的测试性能与双层全连接网络相媲美。此外，我们研究了这些动态的行为，并通过数值证明，即使后者足够大以被视为平均场区域，GM层的行为与经典全连接层表现出明显不同。

更新时间: 2025-08-06 21:16:17

领域: cs.LG,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2508.04883v1

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.

Updated: 2025-08-06 21:15:05

标题: 打造您的数据集：通过语料库检索和增强生成特定任务的合成数据集

摘要: 构建专门任务的高质量数据集是一项耗时且资源密集的过程，通常需要专业领域知识。我们提出了一种名为CRAFT（Corpus Retrieval and Augmentation for Fine-Tuning）的方法，用于生成合成数据集，只需少量用户编写的示范性few-shot，展示要执行的任务。给定这些示例，CRAFT利用大规模的公共网络爬虫语料库和基于相似性的文档检索来查找其他相关的人类编写的文档。最后，经过指导调整的大型语言模型（LLMs）将检索到的文档增强为定制格式的任务样本，然后可用于微调。我们展示了CRAFT可以高效地为四种不同任务生成大规模的特定任务训练数据集：生物学、医学、常识问答（QA）以及摘要。我们的实验表明，基于CRAFT的模型在QA任务上优于或与一般LLM相匹配，同时超过了在人工策划的摘要数据上训练的模型46个偏好点。CRAFT优于其他合成数据集生成方法，如Self-和Evol-Instruct，并且即使初始few-shots的质量有所变化，也能保持稳健性。

更新时间: 2025-08-06 21:15:05

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2409.02098v2

Hilbert Neural Operator: Operator Learning in the Analytic Signal Domain

Neural operators have emerged as a powerful, data-driven paradigm for learning solution operators of partial differential equations (PDEs). State-of-the-art architectures, such as the Fourier Neural Operator (FNO), have achieved remarkable success by performing convolutions in the frequency domain, making them highly effective for a wide range of problems. However, this method has some limitations, including the periodicity assumption of the Fourier transform. In addition, there are other methods of analysing a signal, beyond phase and amplitude perspective, and provide us with other useful information to learn an effective network. We introduce the \textbf{Hilbert Neural Operator (HNO)}, a new neural operator architecture to address some advantages by incorporating a strong inductive bias from signal processing. HNO operates by first mapping the input signal to its analytic representation via the Hilbert transform, thereby making instantaneous amplitude and phase information explicit features for the learning process. The core learnable operation -- a spectral convolution -- is then applied to this Hilbert-transformed representation. We hypothesize that this architecture enables HNO to model operators more effectively for causal, phase-sensitive, and non-stationary systems. We formalize the HNO architecture and provide the theoretical motivation for its design, rooted in analytic signal theory.

Updated: 2025-08-06 21:12:15

标题: 希尔伯特神经算子：在解析信号领域中的算子学习

摘要: 神经算子已经成为一种强大的、数据驱动的范式，用于学习偏微分方程（PDEs）的解算子。最先进的架构，如傅里叶神经算子（FNO），通过在频域执行卷积取得了显著成功，使其对各种问题都非常有效。然而，这种方法存在一些局限性，包括傅里叶变换的周期性假设。此外，除了相位和幅度的视角之外，还有其他分析信号的方法，并为我们提供其他有用信息以学习有效的网络。我们引入了\textbf{希尔伯特神经算子（HNO）}，这是一种新的神经算子架构，以纳入信号处理的强归纳偏差来解决一些优势。HNO首先通过希尔伯特变换将输入信号映射到其解析表示，从而使瞬时幅度和相位信息成为学习过程中的明确特征。然后，将这种希尔伯特变换表示应用于核心可学习操作——谱卷积。我们假设这种架构使HNO能够更有效地为因果、相位敏感和非平稳系统建模操作符。我们形式化了HNO架构，并提供了其设计的理论动机，根植于解析信号理论。

更新时间: 2025-08-06 21:12:15

领域: cs.LG

下载: http://arxiv.org/abs/2508.04882v1

Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain

As hybrid electric vehicles (HEVs) gain traction in heavy-duty trucks, adaptive and efficient energy management is critical for reducing fuel consumption while maintaining battery charge for long operation times. We present a new reinforcement learning (RL) framework based on the Soft Actor-Critic (SAC) algorithm to optimize engine control in series HEVs. We reformulate the control task as a sequential decision-making problem and enhance SAC by incorporating Gated Recurrent Units (GRUs) and Decision Transformers (DTs) into both actor and critic networks to capture temporal dependencies and improve planning over time. To evaluate robustness and generalization, we train the models under diverse initial battery states, drive cycle durations, power demands, and input sequence lengths. Experiments show that the SAC agent with a DT-based actor and GRU-based critic was within 1.8% of Dynamic Programming (DP) in fuel savings on the Highway Fuel Economy Test (HFET) cycle, while the SAC agent with GRUs in both actor and critic networks, and FFN actor-critic agent were within 3.16% and 3.43%, respectively. On unseen drive cycles (US06 and Heavy Heavy-Duty Diesel Truck (HHDDT) cruise segment), generalized sequence-aware agents consistently outperformed feedforward network (FFN)-based agents, highlighting their adaptability and robustness in real-world settings.

Updated: 2025-08-06 20:53:11

标题: 序列感知SAC控制在电动动力传动系统中用于优化发动机燃料消耗

摘要: 随着混合动力电动汽车（HEVs）在重型卡车中越来越受到关注，适应性和高效的能源管理对于减少燃料消耗并保持电池充电以实现长时间运行至关重要。我们提出了一种基于Soft Actor-Critic（SAC）算法的新型强化学习（RL）框架，用于优化串联HEVs中的发动机控制。我们将控制任务重新定义为一个序列决策问题，并通过将门控循环单元（GRUs）和决策转换器（DTs）纳入到actor和critic网络中来增强SAC，以捕捉时间依赖关系并提高随时间规划的能力。为了评估鲁棒性和泛化能力，我们在不同的初始电池状态、驱动周期持续时间、功率需求和输入序列长度下训练模型。实验结果显示，在高速公路燃油经济测试（HFET）周期中，基于DT的actor和基于GRU的critic的SAC代理在燃料节省方面与动态规划（DP）相比仅相差1.8％，而基于GRU的actor和critic网络以及基于FFN的actor-critic代理分别相差3.16％和3.43％。在未知的驱动周期（US06和Heavy Heavy-Duty Diesel Truck（HHDDT）巡航段）上，序列感知代理始终优于基于前馈网络（FFN）的代理，突显了它们在现实世界环境中的适应性和鲁棒性。

更新时间: 2025-08-06 20:53:11

领域: eess.SY,cs.AI,cs.LG,cs.SY

下载: http://arxiv.org/abs/2508.04874v1

A Robust Pipeline for Differentially Private Federated Learning on Imbalanced Clinical Data using SMOTETomek and FedProx

Federated Learning (FL) presents a groundbreaking approach for collaborative health research, allowing model training on decentralized data while safeguarding patient privacy. FL offers formal security guarantees when combined with Differential Privacy (DP). The integration of these technologies, however, introduces a significant trade-off between privacy and clinical utility, a challenge further complicated by the severe class imbalance often present in medical datasets. The research presented herein addresses these interconnected issues through a systematic, multi-stage analysis. An FL framework was implemented for cardiovascular risk prediction, where initial experiments showed that standard methods struggled with imbalanced data, resulting in a recall of zero. To overcome such a limitation, we first integrated the hybrid Synthetic Minority Over-sampling Technique with Tomek Links (SMOTETomek) at the client level, successfully developing a clinically useful model. Subsequently, the framework was optimized for non-IID data using a tuned FedProx algorithm. Our final results reveal a clear, non-linear trade-off between the privacy budget (epsilon) and model recall, with the optimized FedProx consistently out-performing standard FedAvg. An optimal operational region was identified on the privacy-utility frontier, where strong privacy guarantees (with epsilon 9.0) can be achieved while maintaining high clinical utility (recall greater than 77%). Ultimately, our study provides a practical methodological blueprint for creating effective, secure, and accurate diagnostic tools that can be applied to real-world, heterogeneous healthcare data.

Updated: 2025-08-06 20:47:50

标题: 一种基于SMOTETomek和FedProx的在不平衡临床数据上实现差分隐私联邦学习的稳健管道

摘要: 联邦学习（FL）提供了一种突破性的合作健康研究方法，允许在分散的数据上进行模型训练，同时保护患者隐私。当与差分隐私（DP）结合时，FL提供正式的安全性保障。然而，这些技术的整合引入了隐私和临床效用之间的重大权衡，这一挑战在医疗数据集中常见的严重类别不平衡进一步复杂化。本研究通过系统性的多阶段分析解决了这些相互关联的问题。一个FL框架被应用于心血管风险预测，初步实验表明标准方法在处理不平衡数据时存在困难，导致召回率为零。为了克服这种局限性，我们首先在客户端级别集成了混合少数类过采样技术与Tomek链接（SMOTETomek），成功开发了一个临床有用的模型。随后，框架针对非独立同分布的数据使用了调整后的FedProx算法进行优化。我们的最终结果显示隐私预算（epsilon）和模型召回率之间存在明显的非线性权衡，优化后的FedProx始终优于标准的FedAvg。在隐私-效用边界上确定了一个最佳的运行区域，可以在保持高临床效用（召回率大于77%）的同时实现强大的隐私保障（epsilon为9.0）。最终，我们的研究为创建可以应用于真实世界的异质医疗保健数据的有效、安全和准确的诊断工具提供了一个实用的方法论蓝图。

更新时间: 2025-08-06 20:47:50

领域: cs.CR,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2508.10017v1

PinRec: Outcome-Conditioned, Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems

Generative retrieval methods utilize generative sequential modeling techniques, such as transformers, to generate candidate items for recommender systems. These methods have demonstrated promising results in academic benchmarks, surpassing traditional retrieval models like two-tower architectures. However, current generative retrieval methods lack the scalability required for industrial recommender systems, and they are insufficiently flexible to satisfy the multiple metric requirements of modern systems. This paper introduces PinRec, a novel generative retrieval model developed for applications at Pinterest. PinRec utilizes outcome-conditioned generation, enabling modelers to specify how to balance various outcome metrics, such as the number of saves and clicks, to effectively align with business goals and user exploration. Additionally, PinRec incorporates multi-token generation to enhance output diversity while optimizing generation. Our experiments demonstrate that PinRec can successfully balance performance, diversity, and efficiency, delivering a significant positive impact to users using generative models. This paper marks a significant milestone in generative retrieval, as it presents, to our knowledge, the first rigorous study on implementing generative retrieval at the scale of Pinterest.

Updated: 2025-08-06 20:36:12

标题: PinRec: 面向产业规模推荐系统的结果条件、多令牌生成检索

摘要: 生成检索方法利用生成序列建模技术，如变压器，为推荐系统生成候选项。这些方法在学术基准测试中表现出有希望的结果，超越了传统的检索模型，如两塔结构。然而，当前的生成检索方法缺乏工业推荐系统所需的可扩展性，且不足以满足现代系统对多种指标要求的灵活性。本文介绍了PinRec，这是一种为Pinterest应用开发的新型生成检索模型。PinRec利用结果条件生成，使建模者能够指定如何平衡各种结果指标，如保存和点击数量，从而有效地与业务目标和用户探索保持一致。另外，PinRec还采用了多令牌生成来增强输出多样性，同时优化生成过程。我们的实验表明，PinRec能够成功地平衡性能、多样性和效率，为使用生成模型的用户带来显著的正面影响。这篇论文标志着生成检索领域的一个重要里程碑，因为它据我们所知，是关于在Pinterest规模上实施生成检索的第一项严格研究。

更新时间: 2025-08-06 20:36:12

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2504.10507v3

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

Updated: 2025-08-06 20:30:55

标题: 《不可知论者：通过通用学习环境进行强化学习，学习任何编程语言的编码》

摘要: 大型语言模型（LLMs）已经擅长编写高资源语言（如Python和JavaScript）的代码，但在对科学和工程至关重要的低资源语言上表现不佳。除了显而易见的预训练数据短缺外，后训练本身也是一个瓶颈：每种新语言似乎都需要新的数据集、测试工具和强化学习（RL）基础设施。我们引入了Agnostics，这是一个语言无关的后训练流程，消除了每种语言的工程障碍。关键思想是仅通过外部可观察行为来评估代码，因此一个单一的验证器可以测试任何语言编写的解决方案。具体来说，我们（i）使用LLM将现有的单元测试数据集重写为I/O格式，（ii）提供一个短配置，告诉验证器如何编译和运行目标语言，（iii）在强化学习中应用具有可验证奖励（RLVR）的强大代码执行环境。应用于五种低资源语言——Lua、Julia、R、OCaml和Fortran——Agnostics（1）将Qwen-3 4B改进到与其他16B-70B开放权重模型相媲美的性能；（2）可扩展到更大更多样化的模型系列（Qwen-3 8B、DeepSeek Coder 6.7B Instruct、Phi 4 Mini）；（3）对于${\le} 16$B参数模型，在MultiPL-E和我们引入的新的多语言版本LiveCodeBench上设定了新的一流pass@1结果。我们将发布语言无关的训练数据集（Ag-MBPP-X、Ag-Codeforces-X、Ag-LiveCodeBench-X）、训练代码和可即用配置，使任何编程语言中的RL后训练变得像编辑一个简短的YAML文件一样简单。

更新时间: 2025-08-06 20:30:55

领域: cs.LG,cs.PL

下载: http://arxiv.org/abs/2508.04865v1

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Multi-modality magnetic resonance imaging(MRI) data facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.

Updated: 2025-08-06 20:20:59

标题: 一份包含多种模态分割的原发性鼻咽癌MRI数据集

摘要: 多模态磁共振成像（MRI）数据有助于在鼻咽癌（NPC）管理中早期诊断、肿瘤分割和疾病分期。公开可用的全面数据集的缺乏限制了对NPC诊断、治疗计划和机器学习算法发展的进展。为了解决这一关键需求，我们介绍了第一个全面的NPC MRI数据集，包括277名原发性NPC患者的MR轴向成像。该数据集包括T1加权、T2加权和增强T1加权序列，共831张扫描。除了相应的临床数据外，经验丰富的放射科医师手动注释和标记的分割为未经治疗的原发性NPC提供了高质量的数据资源。

更新时间: 2025-08-06 20:20:59

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2404.03253v2

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

Purpose: This study proposes a framework for fine-tuning large language models (LLMs) with differential privacy (DP) to perform multi-abnormality classification on radiology report text. By injecting calibrated noise during fine-tuning, the framework seeks to mitigate the privacy risks associated with sensitive patient data and protect against data leakage while maintaining classification performance. Materials and Methods: We used 50,232 radiology reports from the publicly available MIMIC-CXR chest radiography and CT-RATE computed tomography datasets, collected between 2011 and 2019. Fine-tuning of LLMs was conducted to classify 14 labels from MIMIC-CXR dataset, and 18 labels from CT-RATE dataset using Differentially Private Low-Rank Adaptation (DP-LoRA) in high and moderate privacy regimes (across a range of privacy budgets = {0.01, 0.1, 1.0, 10.0}). Model performance was evaluated using weighted F1 score across three model architectures: BERT-medium, BERT-small, and ALBERT-base. Statistical analyses compared model performance across different privacy levels to quantify the privacy-utility trade-off. Results: We observe a clear privacy-utility trade-off through our experiments on 2 different datasets and 3 different models. Under moderate privacy guarantees the DP fine-tuned models achieved comparable weighted F1 scores of 0.88 on MIMIC-CXR and 0.59 on CT-RATE, compared to non-private LoRA baselines of 0.90 and 0.78, respectively. Conclusion: Differentially private fine-tuning using LoRA enables effective and privacy-preserving multi-abnormality classification from radiology reports, addressing a key challenge in fine-tuning LLMs on sensitive medical data.

Updated: 2025-08-06 20:17:58

标题: 学习隐私诊断：基于DP的LLM用于放射学报告分类

摘要: 目的：本研究提出了一个框架，用于使用差分隐私（DP）微调大型语言模型（LLMs），以在放射学报告文本上执行多异常分类。通过在微调过程中注入校准噪声，该框架旨在减轻与敏感患者数据相关的隐私风险，并在保持分类性能的同时防止数据泄露。材料和方法：我们使用了从2011年至2019年间公开可用的MIMIC-CXR胸部放射学和CT-RATE计算机断层扫描数据集中收集的50,232份放射学报告。通过使用差分隐私低秩调整（DP-LoRA）在高度和中度隐私范围内（跨一系列隐私预算= {0.01, 0.1, 1.0, 10.0}）对MIMIC-CXR数据集的14个标签和CT-RATE数据集的18个标签进行微调LLMs。使用BERT-medium、BERT-small和ALBERT-base三种模型架构评估模型性能。统计分析比较了不同隐私级别下的模型性能，以量化隐私效用权衡。结果：通过我们在2个不同数据集和3个不同模型上的实验，我们观察到明显的隐私效用权衡。在中度隐私保证下，DP微调模型在MIMIC-CXR和CT-RATE上分别达到了0.88和0.59的可比加权F1分数，与非私密LoRA基线的0.90和0.78相比。结论：使用LoRA进行差分隐私微调能够有效地实现并保护来自放射学报告的多异常分类，解决了在敏感医疗数据上微调LLMs的关键挑战。

更新时间: 2025-08-06 20:17:58

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.04450v2

Can SGD Handle Heavy-Tailed Noise?

Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded $p$-th moments for some $p \in (1, 2]$, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: $\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$ and $\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$, respectively. For non-convex objectives under H\"older smoothness, we prove convergence to a stationary point with rate $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$, and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$ sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline -- even in regimes where the variance is unbounded.

Updated: 2025-08-06 20:09:41

标题: SGD能处理重尾噪声吗？

摘要: 随机梯度下降（SGD）是大规模优化的基石，然而在现代机器学习和强化学习中常见的重尾噪声下，其理论行为仍然不太清楚。在这项工作中，我们严格地探讨了缺乏任何自适应修改的普通SGD在这种不利的随机条件下是否可以被证明成功。假设随机梯度仅具有某个$p \in (1, 2]$的有界$p$-阶矩，我们建立了对于凸、强凸和非凸问题类的（投影）SGD的尖锐收敛保证。特别地，我们展示了在凸和强凸领域中SGD在最小的假设下实现了最优的样本复杂度：分别为$\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$和$\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$。对于非凸目标在H\"older平滑性下，我们证明了以速率$\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$收敛到一个稳定点，并补充了一个针对具有任意多项式步长计划的SGD的匹配下界。最后，我们考虑了非凸小批量SGD在标准平滑性和有界中心矩假设下，证明了它也实现了与潜在平滑性常数的改进相当的$\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$样本复杂度。这些结果挑战了重尾噪声使SGD无效的主流观点，并将普通SGD建立为一个稳健且理论上合理的基准线 - 即使在方差无界的情况下也是如此。

更新时间: 2025-08-06 20:09:41

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2508.04860v1

Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices

Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness.

Updated: 2025-08-06 20:04:08

标题: 使用超匹配滤波器进行小尺寸设备的关键词识别

摘要: 开放词汇关键词检测（KWS）是指在语音录音中检测单词或术语的任务，无论它们是否包含在训练数据中。本文介绍了一种具有最先进检测准确性的小尺寸设备的开放词汇关键词检测模型。该模型由语音编码器、目标关键词编码器和检测网络组成。语音编码器可以是微型Whisper或微型Conformer。目标关键词编码器实现为一个超网络，将所需的关键词作为字符字符串，并生成一组唯一的权重用于一个卷积层，可以视为关键词特定的匹配滤波器。检测网络使用匹配滤波器权重执行关键词特定的卷积，指导Perceiver模块的交叉注意机制确定目标术语是否出现在录音中。结果表明，我们的系统实现了最先进的检测性能，并有效地推广到超出领域条件，包括第二语言（L2）语音。值得注意的是，我们的最小模型仅有420万个参数，与数倍大小的模型相匹配或表现更好，展示了效率和稳健性。

更新时间: 2025-08-06 20:04:08

领域: eess.AS,cs.LG,cs.SD

下载: http://arxiv.org/abs/2508.04857v1

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.

Updated: 2025-08-06 20:00:40

标题: 可证明的后训练量化：OPTQ和Qronos的理论分析

摘要: 训练后量化（PTQ）已成为减少现代深度神经网络内存和计算成本的关键工具，包括大型语言模型（LLMs）。在PTQ算法中，OPTQ框架-也称为GPTQ-由于其计算效率和强大的经验性能而成为一种领先的方法。然而，尽管OPTQ被广泛采用，但它缺乏严格的定量理论保证。本文提出了OPTQ的确定性和随机变体以及最新相关的一种PTQ算法Qronos的第一个定量误差界限。我们分析了OPTQ的迭代过程如何引起量化误差，并推导出明确取决于校准数据和OPTQ使用的正则化参数的非渐近2-范数误差界限。我们的分析为几种实际设计选择提供了理论上的理由，包括按减小范数顺序排列特征的广泛使用的启发式方法，以及选择正则化参数的指导。对于随机变体，我们建立了更强的无穷-范数误差界限，这些界限可以控制所需的量化字母表，特别适用于下游层和非线性。最后，我们将分析扩展到Qronos，提供了新的理论界限，对其确定性和随机变体都有帮助，以解释其经验优势。

更新时间: 2025-08-06 20:00:40

领域: cs.LG,cs.AI,cs.IT,cs.NA,math.IT,math.NA,68T07, 68W25, 62M45, 68Q25

下载: http://arxiv.org/abs/2508.04853v1

ranDecepter: Real-time Identification and Deterrence of Ransomware Attacks

Ransomware (RW) presents a significant and widespread threat in the digital landscape, necessitating effective countermeasures. Active cyber deception is a promising strategy to thwart RW and limiting its propagation by misleading it with false information and revealing its true behaviors. Furthermore, RW often acts as a communication conduit between attackers and defenders, allowing deception to return false data to attackers and deplete their resources. This paper introduces ranDecepter, a novel approach that combines active cyber deception with real-time analysis to enhance defenses against RW attacks. The ranDecepter identifies RW in real-time and isolates it within a deceptive environment, autonomously identifying critical elements in the RW code to create a loop mechanism. By repeatedly restarting the malware and transmitting counterfeit encryption information and secret keys to the attacker, it forces the attacker to store these fabricated details for each victim, thereby depleting their resources. Our comprehensive evaluation of ranDecepter, conducted using 1,134 real-world malware samples and twelve benign applications, demonstrates a remarkable 100% accuracy in RW identification, with no false positives and minimal impact on response times. Furthermore, within 24-hours, ranDecepter generates up to 9,223K entries in the attacker's database using 50 agents, showcasing its potential to undermine attacker resources.

Updated: 2025-08-06 19:59:37

标题: ranDecepter：勒索软件攻击的实时识别和阻止

摘要: 勒索软件（RW）在数字领域中是一个重大且普遍存在的威胁，需要有效的对策。主动网络欺骗是一种有前途的策略，可以通过误导RW并限制其传播来阻止其，通过向攻击者提供虚假信息并揭示其真实行为。此外，RW通常作为攻击者和防御者之间的通信通道，允许欺骗将虚假数据返回给攻击者并消耗他们的资源。本文介绍了ranDecepter，一种将主动网络欺骗与实时分析相结合的新方法，以增强对RW攻击的防御。ranDecepter能够实时识别RW并将其隔离在一个欺骗性环境中，自动识别RW代码中的关键元素以创建一个循环机制。通过反复重新启动恶意软件并向攻击者传输伪造的加密信息和秘钥，它迫使攻击者为每个受害者存储这些虚构的细节，从而消耗他们的资源。我们对ranDecepter进行了全面评估，使用了1,134个真实世界恶意软件样本和12个良性应用程序，结果显示在RW识别方面具有惊人的100%的准确率，没有误报并且对响应时间的影响最小。此外，仅在24小时内，ranDecepter使用50个代理在攻击者的数据库中生成了9223K条记录，展示了它破坏攻击者资源的潜力。

更新时间: 2025-08-06 19:59:37

领域: cs.CR

下载: http://arxiv.org/abs/2508.00293v3

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

Updated: 2025-08-06 19:57:38

标题: 大陪审团：用于动态质量评分的协作机器学习模型评估协议

摘要: 生成式机器学习模型已经成为现代系统的核心，驱动创意写作、摘要、多跳推理和上下文感知对话等应用。这些模型支撑着大规模AI助手、工作流自动化和自主决策。在这些领域中，可接受的响应很少是绝对或静态的，而是复数和高度依赖上下文的。然而，标准评估制度仍然依赖于静态的基准测试，鼓励优化朝向排行榜分数而不是与动态用户需求或不断变化的现实相一致。GrandJury引入了一个正式的评估协议，结合了时间衰减的聚合、完全可追溯性，支持动态、透明的任务评分归因和多评审人判断。这些元素共同实现了多元化、负责任的评估，捕捉了不断演变的共识并展现了分歧。我们提供了一个开源实现（grandjury PyPI软件包）和一个公开的大型语言模型（LLM）推理输出集合，以说明这种需求和方法。GrandJury为评估机器学习输出的AI从业者提供了一个新的范式，而无需绝对的基本事实。

更新时间: 2025-08-06 19:57:38

领域: cs.LG,cs.AI,cs.HC,I.2.6; I.2.7

下载: http://arxiv.org/abs/2508.02926v2

Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning

Reinforcement learning (RL) has become a key technique for enhancing the reasoning abilities of large language models (LLMs), with policy-gradient algorithms dominating the post-training stage because of their efficiency and effectiveness. However, most existing benchmarks evaluate large-language-model reasoning under idealized settings, overlooking performance in realistic, non-ideal scenarios. We identify three representative non-ideal scenarios with practical relevance: summary inference, fine-grained noise suppression, and contextual filtering. We introduce a new research direction guided by brain-science findings that human reasoning remains reliable under imperfect inputs. We formally define and evaluate these challenging scenarios. We fine-tune three LLMs and a state-of-the-art large vision-language model (LVLM) using RL with a representative policy-gradient algorithm and then test their performance on eight public datasets. Our results reveal that while RL fine-tuning improves baseline reasoning under idealized settings, performance declines significantly across all three non-ideal scenarios, exposing critical limitations in advanced reasoning capabilities. Although we propose a scenario-specific remediation method, our results suggest current methods leave these reasoning deficits largely unresolved. This work highlights that the reasoning abilities of large models are often overstated and underscores the importance of evaluating models under non-ideal scenarios. The code and data will be released at XXXX.

Updated: 2025-08-06 19:51:29

标题: 大型语言模型在RL微调后的非理想条件下的推理能力

摘要: 强化学习（RL）已成为增强大型语言模型（LLMs）推理能力的关键技术，由于其高效性和有效性，政策梯度算法在后训练阶段占据主导地位。然而，大多数现有基准评估大型语言模型在理想化设置下的推理能力，忽视了在现实、非理想情况下的表现。我们确定了三个具有实际相关性的代表性非理想情景：摘要推理、精细噪音抑制和上下文过滤。我们引入了一个新的研究方向，根据脑科学发现，人类推理在不完美输入下仍然可靠。我们正式定义并评估这些具有挑战性的情景。我们使用代表性政策梯度算法对三个LLMs和一款最先进的大型视觉语言模型（LVLM）进行微调，然后在八个公共数据集上测试它们的表现。我们的结果显示，虽然RL微调在理想化设置下改善了基线推理能力，但在所有三个非理想情景下性能显著下降，暴露了高级推理能力的关键局限性。尽管我们提出了一种特定于情景的补救方法，但我们的结果表明当前方法在很大程度上未能解决这些推理缺陷。这项工作强调了大型模型的推理能力往往被夸大，并强调了在非理想情景下评估模型的重要性。代码和数据将在XXXX上发布。

更新时间: 2025-08-06 19:51:29

领域: cs.AI

下载: http://arxiv.org/abs/2508.04848v1

Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset

This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.

Updated: 2025-08-06 19:51:21

标题: 四元数哈达玛网络：一种新的防御对抗攻击的方法与新数据集

摘要: 这篇论文讨论了针对雨、雪和雾去除而设计的深度学习模型的脆弱性。尽管这些模型在恶劣天气下可以提高图像质量，但它们容易受到破坏其有效性的对抗性攻击。传统的防御方法，如对抗性训练和模型蒸馏，通常需要大量的重新训练，使其在实际部署中成本高昂且不切实际。虽然去噪和超分辨率技术可以帮助图像分类模型，但它们会引入高计算需求和视觉伪影，阻碍图像处理任务。我们提出了一种针对一阶白盒对抗攻击的模型无关防御方法，使用四元Hadamard网络（QHNet）来解决这些挑战。白盒攻击特别难以防御，因为攻击者可以完全访问模型的架构、权重和训练过程。我们的防御引入了四元Hadamard去噪卷积块（QHDCB）和四元去噪残差块（QDRB），利用多项式阈值化。QHNet在编码器-解码器架构中整合了这些块，通过特征细化，有效中和对抗性噪声。此外，我们还引入了对抗性天气条件视觉数据集（AWCVD），通过在涉及雾、雨痕和雪的情景中对最先进的天气去除技术应用一阶梯度攻击来创建。通过PSNR和SSIM指标，我们证明了与最先进的去噪和超分辨率技术相比，QHNet显著增强了低级计算机视觉模型对抗性攻击的鲁棒性。源代码和数据集将与本文的最终版本一起发布。

更新时间: 2025-08-06 19:51:21

领域: cs.LG,eess.IV

下载: http://arxiv.org/abs/2502.10452v2

Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS)

Autonomous web-based geographical information systems (AWebGIS) aim to perform geospatial operations from natural language input, providing intuitive, intelligent, and hands-free interaction. However, most current solutions rely on cloud-based large language models (LLMs), which require continuous internet access and raise users' privacy and scalability issues due to centralized server processing. This study compares three approaches to enabling AWebGIS: (1) a fully-automated online method using cloud-based LLMs (e.g., Cohere); (2) a semi-automated offline method using classical machine learning classifiers such as support vector machine and random forest; and (3) a fully autonomous offline (client-side) method based on a fine-tuned small language model (SLM), specifically T5-small model, executed in the client's web browser. The third approach, which leverages SLMs, achieved the highest accuracy among all methods, with an exact matching accuracy of 0.93, Levenshtein similarity of 0.99, and recall-oriented understudy for gisting evaluation ROUGE-1 and ROUGE-L scores of 0.98. Crucially, this client-side computation strategy reduces the load on backend servers by offloading processing to the user's device, eliminating the need for server-based inference. These results highlight the feasibility of browser-executable models for AWebGIS solutions.

Updated: 2025-08-06 19:50:29

标题: 微调小型语言模型（SLMs）用于自主基于Web的地理信息系统（AWebGIS）

摘要: 自主的基于网络的地理信息系统（AWebGIS）旨在从自然语言输入中执行地理空间操作，提供直观、智能和无需手动操作的交互。然而，大多数当前的解决方案依赖于基于云的大型语言模型（LLMs），这些模型需要持续的互联网访问，并由于集中服务器处理而引发用户的隐私和可扩展性问题。本研究比较了三种实现AWebGIS的方法：（1）使用基于云的LLMs（例如Cohere）的完全自动化在线方法；（2）使用传统的机器学习分类器（如支持向量机和随机森林）的半自动化离线方法；以及（3）基于经过精调的小型语言模型（SLM），具体为T5-small模型，在客户端Web浏览器中执行的完全自主离线方法。第三种方法利用SLMs，在所有方法中实现了最高的准确性，精确匹配准确度为0.93，莱文斯坦相似度为0.99，并且面向召回的摘要评估ROUGE-1和ROUGE-L得分为0.98。关键是，这种客户端计算策略通过将处理分担给用户设备，减少了后端服务器的负载，消除了基于服务器的推断的需求。这些结果突显了浏览器可执行模型在AWebGIS解决方案中的可行性。

更新时间: 2025-08-06 19:50:29

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.04846v1

Multi-Stage Knowledge-Distilled VGAE and GAT for Robust Controller-Area-Network Intrusion Detection

The Controller Area Network (CAN) protocol is a standard for in-vehicle communication but remains susceptible to cyber-attacks due to its lack of built-in security. This paper presents a multi-stage intrusion detection framework leveraging unsupervised anomaly detection and supervised graph learning tailored for automotive CAN traffic. Our architecture combines a Variational Graph Autoencoder (VGAE) for structural anomaly detection with a Knowledge-Distilled Graph Attention Network (KD-GAT) for robust attack classification. CAN bus activity is encoded as graph sequences to model temporal and relational dependencies. The pipeline applies VGAE-based selective undersampling to address class imbalance, followed by GAT classification with optional score-level fusion. The compact student GAT achieves 96% parameter reduction compared to the teacher model while maintaining strong predictive performance. Experiments on six public CAN intrusion datasets--Car-Hacking, Car-Survival, and can-train-and-test--demonstrate competitive accuracy and efficiency, with average improvements of 16.2% in F1-score over existing methods, particularly excelling on highly imbalanced datasets with up to 55% F1-score improvements.

Updated: 2025-08-06 19:50:26

标题: 多阶段知识蒸馏VGAE和GAT用于鲁棒的控制器区域网络入侵检测

摘要: 控制区域网络（CAN）协议是一种用于车辆内通信的标准，但由于缺乏内置安全性，仍然容易受到网络攻击的影响。本文提出了一个多阶段入侵检测框架，利用无监督异常检测和针对汽车CAN流量定制的监督图学习。我们的架构结合了变分图自动编码器（VGAE）用于结构异常检测和知识蒸馏图注意网络（KD-GAT）用于强大的攻击分类。CAN总线活动被编码为图序列，以建模时间和关系依赖性。流水线应用基于VGAE的选择性欠采样来解决类别不平衡问题，然后进行GAT分类并进行可选的分数级融合。与教师模型相比，紧凑的学生GAT实现了96%的参数减少，同时保持强大的预测性能。对六个公共CAN入侵数据集--汽车黑客、汽车生存和can-train-and-test--进行的实验显示出竞争力的准确性和效率，平均F1分数改进率达到16.2%，特别擅长于高度不平衡的数据集，F1分数提高高达55%。

更新时间: 2025-08-06 19:50:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04845v1

Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Model-driven engineering problems often require complex model transformations (MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of such problems include model synchronization, automated model repair, and design space exploration. Manually developing complex MTs is an error-prone and often infeasible process. Reinforcement learning (RL) is an apt way to alleviate these issues. In RL, an autonomous agent explores the state space through trial and error to identify beneficial sequences of actions, such as MTs. However, RL methods exhibit performance issues in complex problems. In these situations, human guidance can be of high utility. In this paper, we present an approach and technical framework for developing complex MT sequences through RL, guided by potentially uncertain human advice. Our framework allows user-defined MTs to be mapped onto RL primitives, and executes them as RL programs to find optimal MT sequences. Our evaluation shows that human guidance, even if uncertain, substantially improves RL performance, and results in more efficient development of complex MTs. Through a trade-off between the certainty and timeliness of human advice, our method takes a step towards RL-driven human-in-the-loop engineering methods.

Updated: 2025-08-06 19:48:34

标题: 用不确定的人类引导进行强化学习的复杂模型转换

摘要: 模型驱动工程问题通常需要复杂的模型转换（MTs），即在广泛序列中链接的MTs。此类问题的相关示例包括模型同步化、自动模型修复和设计空间探索。手动开发复杂的MTs是一个容易出错且通常不可行的过程。强化学习（RL）是缓解这些问题的一种合适方式。在RL中，一个自主代理通过试错探索状态空间，以识别有益的行动序列，如MTs。然而，在复杂问题中，RL方法存在性能问题。在这些情况下，人类指导可能具有很高的实用性。在本文中，我们提出了一种通过RL开发复杂MT序列的方法和技术框架，该方法由潜在不确定的人类建议指导。我们的框架允许用户定义的MTs映射到RL基元，并将它们作为RL程序执行，以找到最佳的MT序列。我们的评估表明，即使不确定，人类指导显著提高了RL的性能，并导致更有效地开发复杂的MTs。通过在人类建议的确定性和及时性之间进行权衡，我们的方法迈出了RL驱动的人机协同工程方法的一步。

更新时间: 2025-08-06 19:48:34

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.20883v2

Towards Scalable Newborn Screening: Automated General Movement Assessment in Uncontrolled Settings

General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.

Updated: 2025-08-06 19:46:34

标题: 走向可扩展的新生儿筛查：在非受控环境中进行自动化的一般运动评估

摘要: 一般运动（GMs）是婴儿身体协调的自发运动，为发育中的神经系统提供宝贵的见解。通过Prechtl GM评估（GMA）来评估GMs，可靠地预测神经发育障碍。然而，GMA需要受过专门训练的临床医生，数量有限。为了扩大新生儿筛查范围，需要一种能够自动从婴儿视频录像中分类GMs的算法。这些数据存在挑战，包括录像长度、设备类型和设置的变化，每个视频都粗略注释了整体运动质量。在这项工作中，我们引入了一个工具，用于从这些录像中提取特征，并探索各种机器学习技术，实现自动化GM分类。

更新时间: 2025-08-06 19:46:34

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2411.09821v4

Unified Flow Matching for Long Horizon Event Forecasting

Modeling long horizon marked event sequences is a fundamental challenge in many real-world applications, including healthcare, finance, and user behavior modeling. Existing neural temporal point process models are typically autoregressive, predicting the next event one step at a time, which limits their efficiency and leads to error accumulation in long-range forecasting. In this work, we propose a unified flow matching framework for marked temporal point processes that enables non-autoregressive, joint modeling of inter-event times and event types, via continuous and discrete flow matching. By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding. We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.

Updated: 2025-08-06 19:42:49

标题: 长期事件预测的统一流匹配

摘要: 建模长时间跨度的标记事件序列是许多现实世界应用中的基本挑战，包括医疗保健、金融和用户行为建模。现有的神经时间点过程模型通常是自回归的，一次预测下一个事件，这限制了它们的效率，并导致长期预测中的误差累积。在这项工作中，我们提出了一个统一的流匹配框架，用于标记时间点过程，实现了非自回归的，联合建模事件间隔时间和事件类型，通过连续和离散流匹配。通过学习这两个组件的连续时间流，我们的方法生成了连贯的长时间跨度事件轨迹，而无需顺序解码。我们在六个现实世界基准测试中评估我们的模型，并展示了在准确性和生成效率方面相对于自回归和扩散基线的显著改进。

更新时间: 2025-08-06 19:42:49

领域: cs.LG

下载: http://arxiv.org/abs/2508.04843v1

Data Driven Insights into Composition Property Relationships in FCC High Entropy Alloys

Structural High Entropy Alloys (HEAs) are crucial in advancing technology across various sectors, including aerospace, automotive, and defense industries. However, the scarcity of integrated chemistry, process, structure, and property data presents significant challenges for predictive property modeling. Given the vast design space of these alloys, uncovering the underlying patterns is essential yet difficult, requiring advanced methods capable of learning from limited and heterogeneous datasets. This work presents several sensitivity analyses, highlighting key elemental contributions to mechanical behavior, including insights into the compositional factors associated with brittle and fractured responses observed during nanoindentation testing in the BIRDSHOT center NiCoFeCrVMnCuAl system dataset. Several encoder decoder based chemistry property models, carefully tuned through Bayesian multi objective hyperparameter optimization, are evaluated for mapping alloy composition to six mechanical properties. The models achieve competitive or superior performance to conventional regressors across all properties, particularly for yield strength and the UTS/YS ratio, demonstrating their effectiveness in capturing complex composition property relationships.

Updated: 2025-08-06 19:41:15

标题: 基于数据驱动的观点：FCC高熵合金中的成分属性关系

摘要: 高熵合金在推动航空航天、汽车和国防等各个领域的技术发展中至关重要。然而，由于综合化学、工艺、结构和性能数据的稀缺性，预测性能建模面临着重大挑战。鉴于这些合金的设计空间巨大，揭示潜在模式是至关重要但困难的，需要能够从有限和异质数据集中学习的先进方法。本研究提出了几项敏感性分析，突出了对机械行为的关键元素贡献，包括对BIRDSHOT中心NiCoFeCrVMnCuAl系统数据集中纳米压痕测试期间观察到的脆性和断裂反应相关的组成因素的洞察。通过贝叶斯多目标超参数优化精心调整的几种基于编码器解码器的化学性质模型，被评估用于将合金组成映射到六种机械性能。这些模型在所有性能方面均实现了与传统回归器相竞争或优越的表现，尤其是对屈服强度和UTS/YS比例，证明了它们在捕捉复杂组成性质关系方面的有效性。

更新时间: 2025-08-06 19:41:15

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2508.04841v1

$\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

In this work, we compile $\textbf{$\texttt{DroidCollection}$}$, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop $\textbf{$\texttt{DroidDetect}$}$, a suite of encoder-only detectors trained using a multi-task objective over $\texttt{DroidCollection}$. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

Updated: 2025-08-06 19:26:28

标题: "Droid：用于人工智能生成代码检测的资源套件"

摘要: 在这项工作中，我们编制了$\textbf{$\texttt{DroidCollection}$}$，这是用于训练和评估机器生成代码检测器的最全面的开放数据套件，包括超过一百万个代码样本、七种编程语言、43个编码模型的输出以及超过三个真实世界的编码领域。除了完全由人工智能生成的样本，我们的收集还包括人工智能共同编写的代码，以及明确设计用于规避检测的对抗样本。随后，我们开发了$\textbf{$\texttt{DroidDetect}$}$，这是一个仅使用编码器训练的检测器套件，通过在$\texttt{DroidCollection}$上使用多任务目标进行训练。我们的实验表明，现有的检测器的性能无法推广到其狭窄训练数据之外的多样化编码领域和编程语言。此外，我们证明了大多数检测器很容易受到通过表面提示和对齐方法人性化输出分布的影响，但通过在少量对抗数据上进行训练可以轻松解决这个问题。最后，我们展示了度量学习和基于不确定性的重新采样作为增强检测器训练的手段，以应对可能存在噪声的分布。

更新时间: 2025-08-06 19:26:28

领域: cs.SE,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.10583v3

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

Updated: 2025-08-06 19:21:50

标题: VLM4D：走向视觉语言模型中的时空意识

摘要: 视觉语言模型（VLMs）已经展现出在整合语言和视觉推理方面的显著能力，但在理解动态时空交互方面仍然存在根本限制。人类可以轻松地追踪和推理物体的移动、旋转和透视变化，这些能力对于强大的动态现实世界理解至关重要，但目前的VLMs明显缺乏这些能力。在本文中，我们介绍了VLM4D，这是第一个专门设计用于评估VLMs时空推理能力的基准。我们的基准包括多样的现实世界和合成视频，配有精心策划的问题-答案对，强调平移和旋转运动、透视感知和运动连续性。通过对最先进的开源和闭源VLMs进行全面评估，我们发现与人类基准相比存在显著的性能差距，突出了现有模型的根本缺陷。广泛的分析表明，VLMs特别难以整合多个视觉线索并保持时间上的连贯性。我们进一步探讨了一些有前途的方向，例如利用4D特征场重建和有针对性的时空监督微调，证明它们在增强时空理解方面的有效性。我们的工作旨在鼓励更深入地探索改进VLMs的时空基础，为动态环境提供更强大和可靠的视觉智能铺平道路。

更新时间: 2025-08-06 19:21:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.02095v2

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

Updated: 2025-08-06 19:11:33

标题: LLM的人格测量中持续的不稳定性：规模、推理和对话历史的影响

摘要: 大型语言模型需要一致的行为模式以确保安全部署，然而它们类似个性的特征仍然知之甚少。我们提出了PERSIST（合成文本中的个性稳定性），这是一个全面的评估框架，测试了25个以上的开源模型（1B-671B参数），涵盖了超过500,000个响应。我们使用传统（BFI-44、SD3）和新颖的LLM适应的个性工具，系统地改变问题顺序、释义、人物角色和推理模式。我们的发现挑战了基本的部署假设：（1）即使是400B+的模型也表现出相当大的响应变异性（标准差>0.4）；（2）仅仅对提示重新排序就能将个性测量值提高20%；（3）预期能够稳定行为的干预措施，如思维链推理、详细的人物角色指导、包含对话历史，反而可能增加变异性；（4）LLM适应的工具表现出与人类中心版本相同的不稳定性，这证实了架构而非翻译限制。这种持续的不稳定性跨尺度存在，并且缓解策略表明当前的LLM缺乏真正行为一致性的基础。对于需要可预测行为的安全关键应用，这些发现表明基于个性的对齐策略可能基本上是不足够的。

更新时间: 2025-08-06 19:11:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04826v1

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Updated: 2025-08-06 19:10:58

标题: Voost：一种统一且可扩展的双向虚拟试穿和试脱扩散变压器

摘要: 虚拟试穿旨在合成一个穿着目标服装的人的真实图像，但准确建模服装-身体对应关系仍然是一个持久的挑战，特别是在姿势和外观变化下。在本文中，我们提出了Voost - 一个统一且可扩展的框架，通过一个扩散变压器共同学习虚拟试穿和试脱。通过共同建模这两个任务，Voost使每个服装-人物配对都可以监督两个方向，并支持对生成方向和服装类别进行灵活的条件控制，增强服装-身体关系推理，无需特定于任务的网络、辅助损失或额外标签。此外，我们引入了两种推理时技术：注意温度缩放以适应分辨率或遮罩变化，以及利用任务之间的双向一致性的自我校正抽样。大量实验证明，Voost在试穿和试脱基准测试中取得了最先进的结果，始终优于强基线在对齐精度、视觉逼真度和泛化性能方面。

更新时间: 2025-08-06 19:10:58

领域: cs.GR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04825v1

Automated File-Level Logging Generation for Machine Learning Applications using LLMs: A Case Study using GPT-4o Mini

Logging is essential in software development, helping developers monitor system behavior and aiding in debugging applications. Given the ability of large language models (LLMs) to generate natural language and code, researchers are exploring their potential to generate log statements. However, prior work focuses on evaluating logs introduced in code functions, leaving file-level log generation underexplored -- especially in machine learning (ML) applications, where comprehensive logging can enhance reliability. In this study, we evaluate the capacity of GPT-4o mini as a case study to generate log statements for ML projects at file level. We gathered a set of 171 ML repositories containing 4,073 Python files with at least one log statement. We identified and removed the original logs from the files, prompted the LLM to generate logs for them, and evaluated both the position of the logs and log level, variables, and text quality of the generated logs compared to human-written logs. In addition, we manually analyzed a representative sample of generated logs to identify common patterns and challenges. We find that the LLM introduces logs in the same place as humans in 63.91% of cases, but at the cost of a high overlogging rate of 82.66%. Furthermore, our manual analysis reveals challenges for file-level logging, which shows overlogging at the beginning or end of a function, difficulty logging within large code blocks, and misalignment with project-specific logging conventions. While the LLM shows promise for generating logs for complete files, these limitations remain to be addressed for practical implementation.

Updated: 2025-08-06 18:57:51

标题: 使用LLMs自动生成机器学习应用程序的自动文件级别日志生成：使用GPT-4o Mini的案例研究

摘要: 日志记录在软件开发中是必不可少的，它帮助开发人员监视系统行为并帮助调试应用程序。鉴于大型语言模型（LLMs）生成自然语言和代码的能力，研究人员正在探索它们生成日志声明的潜力。然而，先前的工作侧重于评估在代码功能中引入的日志，而文件级别的日志生成尚未得到充分探讨，特别是在机器学习（ML）应用程序中，全面的日志记录可以增强可靠性。在这项研究中，我们以GPT-4o mini为案例研究，评估其在文件级别为ML项目生成日志声明的能力。我们收集了一组包含4,073个Python文件且至少有一个日志声明的171个ML存储库。我们识别并删除了文件中的原始日志，提示LLM为它们生成日志，并评估生成的日志的位置、日志级别、变量和文本质量与人工编写的日志相比。此外，我们手动分析了生成的日志的代表样本，以识别常见模式和挑战。我们发现，在63.91%的情况下，LLM在与人类相同的位置引入日志，但代价是高达82.66%的过度日志率。此外，我们的手动分析揭示了文件级别日志记录的挑战，显示了在函数的开头或结尾过度日志、在大型代码块内难以记录和与项目特定的日志记录约定不一致。虽然LLM显示出为完整文件生成日志的潜力，但这些限制仍需要解决以实现实际应用。

更新时间: 2025-08-06 18:57:51

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04820v1

CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework

Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.

Updated: 2025-08-06 18:55:14

标题: CoMAD: 一种多教师自监督蒸馏框架

摘要: 许多自监督学习范式，如对比学习和遮罩图像建模，从未标记数据中学习强大的表示，但通常是孤立预训练的，忽视互补的见解，并产生对资源受限部署不切实际的大型模型。为了克服这些挑战，我们引入了一种轻量级、无参数的框架，称为一致性导向的遮罩蒸馏（CoMAD），将多个当前最先进的自监督视觉变换器的知识统一到一个紧凑的学生网络中。CoMAD从三个预训练的ViT-Base教师（MAE、MoCo v3和iBOT）中蒸馏出，每个教师都提供不同的语义和情境先验。我们不是简单地对教师输出进行平均，而是应用非对称遮罩：学生只看到25%的补丁，而每个教师都接收到逐渐轻的、独特的遮罩，迫使学生在更丰富的情境下插值缺失的特征。通过线性适配器和层归一化将教师嵌入与学生空间对齐，然后通过我们的联合一致性门控进行融合，该门控通过将余弦亲和性与教师间协议相结合来对每个令牌进行加权。学生在可见标记和重建特征图上通过双级KL散度进行训练，捕获了局部和全局结构。在ImageNet-1K上，CoMAD的ViT-Tiny实现了75.4%的Top-1，比以前的最新技术增加了0.4%。在密集预测转移中，它在ADE20K上达到了47.3%的mIoU，在MS-COCO上达到了44.5%的框平均精度和40.5%的面具平均精度，确立了紧凑SSL蒸馏的新技术水平。

更新时间: 2025-08-06 18:55:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04816v1

CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

In the field of crisis/disaster informatics, social media is increasingly being used for improving situational awareness to inform response and relief efforts. Efficient and accurate text classification tools have been a focal area of investigation in crisis informatics. However, current methods mostly rely on single-label text classification models, which fails to capture different insights embedded in dynamic and multifaceted disaster-related social media data. This study introduces a novel approach to disaster text classification by enhancing a pre-trained Large Language Model (LLM) through instruction fine-tuning targeted for multi-label classification of disaster-related tweets. Our methodology involves creating a comprehensive instruction dataset from disaster-related tweets, which is then used to fine-tune an open-source LLM, thereby embedding it with disaster-specific knowledge. This fine-tuned model can classify multiple aspects of disaster-related information simultaneously, such as the type of event, informativeness, and involvement of human aid, significantly improving the utility of social media data for situational awareness in disasters. The results demonstrate that this approach enhances the categorization of critical information from social media posts, thereby facilitating a more effective deployment for situational awareness during emergencies. This research paves the way for more advanced, adaptable, and robust disaster management tools, leveraging the capabilities of LLMs to improve real-time situational awareness and response strategies in disaster scenarios.

Updated: 2025-08-06 18:53:49

标题: CrisisSense-LLM: 面向灾害信息学的多标签社交媒体文本分类的指导微调大型语言模型

摘要: 在危机/灾难信息学领域，社交媒体越来越被用于提高形势感知，以指导应对和救援工作。高效准确的文本分类工具一直是危机信息学研究的重点领域。然而，目前的方法大多依赖于单标签文本分类模型，无法捕捉动态多面的与灾难相关的社交媒体数据中蕴含的不同见解。本研究通过通过针对多标签分类的细化调整，增强了预训练的大型语言模型（LLM）来介绍一种新颖的灾难文本分类方法。我们的方法涉及从与灾难相关的推文中创建一个全面的指令数据集，然后用于对开源LLM进行微调，从而嵌入灾难特定知识。这个微调模型可以同时对灾难相关信息的多个方面进行分类，比如事件类型、信息性以及人道援助的参与程度，显著提高社交媒体数据在灾难中形势感知中的实用性。结果表明，这种方法提高了从社交媒体帖子中分类关键信息，从而在紧急情况下促进更有效的形势感知的部署。这项研究为更先进、适应性更强、更健壮的灾难管理工具铺平了道路，利用LLM的能力来改善灾难情景中的实时形势感知和应对策略。

更新时间: 2025-08-06 18:53:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.15477v3

HCRide: Harmonizing Passenger Fairness and Driver Preference for Human-Centered Ride-Hailing

Order dispatch systems play a vital role in ride-hailing services, which directly influence operator revenue, driver profit, and passenger experience. Most existing work focuses on improving system efficiency in terms of operator revenue, which may cause a bad experience for both passengers and drivers. Hence, in this work, we aim to design a human-centered ride-hailing system by considering both passenger fairness and driver preference without compromising the overall system efficiency. However, it is nontrivial to achieve this target due to the potential conflicts between passenger fairness and driver preference since optimizing one may sacrifice the other. To address this challenge, we design HCRide, a Human-Centered Ride-hailing system based on a novel multi-agent reinforcement learning algorithm called Harmonization-oriented Actor-Bi-Critic (Habic), which includes three major components (i.e., a multi-agent competition mechanism, a dynamic Actor network, and a Bi-Critic network) to optimize system efficiency and passenger fairness with driver preference consideration. We extensively evaluate our HCRide using two real-world ride-hailing datasets from Shenzhen and New York City. Experimental results show our HCRide effectively improves system efficiency by 2.02%, fairness by 5.39%, and driver preference by 10.21% compared to state-of-the-art baselines.

Updated: 2025-08-06 18:47:38

标题: HCRide：协调乘客公平和司机偏好的人类中心顺风车服务

摘要: 订单调度系统在顺风车服务中发挥着至关重要的作用，直接影响运营商收入、司机利润和乘客体验。大多数现有工作都侧重于提高系统效率，以增加运营商收入，但这可能会给乘客和司机带来不好的体验。因此，在这项工作中，我们旨在设计一个以人为中心的顺风车系统，同时考虑乘客公平和司机偏好，而不会影响整体系统效率。然而，要实现这一目标并不容易，因为在乘客公平和司机偏好之间可能存在潜在冲突，优化其中一个可能会牺牲另一个。为了解决这一挑战，我们设计了一个基于新颖的多智能体强化学习算法Habic的人为中心顺风车系统HCRide，其中包括三个主要组件（即多智能体竞争机制、动态Actor网络和Bi-Critic网络），以优化系统效率并考虑乘客公平和司机偏好。我们使用深圳和纽约市两个现实世界的顺风车数据集对我们的HCRide进行了广泛评估。实验结果显示，与最先进的基线相比，我们的HCRide有效地提高了系统效率2.02％，公平性5.39％，司机偏好10.21％。

更新时间: 2025-08-06 18:47:38

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2508.04811v1

Log2Sig: Frequency-Aware Insider Threat Detection via Multivariate Behavioral Signal Decomposition

Insider threat detection presents a significant challenge due to the deceptive nature of malicious behaviors, which often resemble legitimate user operations. However, existing approaches typically model system logs as flat event sequences, thereby failing to capture the inherent frequency dynamics and multiscale disturbance patterns embedded in user behavior. To address these limitations, we propose Log2Sig, a robust anomaly detection framework that transforms user logs into multivariate behavioral frequency signals, introducing a novel representation of user behavior. Log2Sig employs Multivariate Variational Mode Decomposition (MVMD) to extract Intrinsic Mode Functions (IMFs), which reveal behavioral fluctuations across multiple temporal scales. Based on this, the model further performs joint modeling of behavioral sequences and frequency-decomposed signals: the daily behavior sequences are encoded using a Mamba-based temporal encoder to capture long-term dependencies, while the corresponding frequency components are linearly projected to match the encoder's output dimension. These dual-view representations are then fused to construct a comprehensive user behavior profile, which is fed into a multilayer perceptron for precise anomaly detection. Experimental results on the CERT r4.2 and r5.2 datasets demonstrate that Log2Sig significantly outperforms state-of-the-art baselines in both accuracy and F1 score.

Updated: 2025-08-06 18:47:26

标题: Log2Sig：通过多元行为信号分解实现频率感知内部威胁检测

摘要: 内部威胁检测面临着重大挑战，因为恶意行为的欺骗性质往往与合法用户操作相似。然而，现有方法通常将系统日志建模为平面事件序列，因此无法捕捉嵌入用户行为中的固有频率动态和多尺度干扰模式。为了解决这些局限性，我们提出了Log2Sig，这是一个强大的异常检测框架，将用户日志转换为多变量行为频率信号，引入了一种新颖的用户行为表示。Log2Sig采用多变量变分模态分解（MVMD）来提取内在模态函数（IMFs），揭示了跨多个时间尺度的行为波动。基于此，该模型进一步对行为序列和频率分解信号进行联合建模：每日行为序列使用基于Mamba的时间编码器进行编码，以捕捉长期依赖性，而相应的频率分量则线性投影以匹配编码器的输出维度。然后将这些双视图表示融合起来构建全面的用户行为配置文件，将其输入到多层感知器中进行精确的异常检测。对CERT r4.2和r5.2数据集的实验结果表明，Log2Sig在准确性和F1分数方面明显优于现有技术基线。

更新时间: 2025-08-06 18:47:26

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.05696v1

MambaITD: An Efficient Cross-Modal Mamba Network for Insider Threat Detection

Enterprises are facing increasing risks of insider threats, while existing detection methods are unable to effectively address these challenges due to reasons such as insufficient temporal dynamic feature modeling, computational efficiency and real-time bottlenecks and cross-modal information island problem. This paper proposes a new insider threat detection framework MambaITD based on the Mamba state space model and cross-modal adaptive fusion. First, the multi-source log preprocessing module aligns heterogeneous data through behavioral sequence encoding, interval smoothing, and statistical feature extraction. Second, the Mamba encoder models long-range dependencies in behavioral and interval sequences, and combines the sequence and statistical information dynamically in combination with the gated feature fusion mechanism. Finally, we propose an adaptive threshold optimization method based on maximizing inter-class variance, which dynamically adjusts the decision threshold by analyzing the probability distribution, effectively identifies anomalies, and alleviates class imbalance and concept drift. Compared with traditional methods, MambaITD shows significant advantages in modeling efficiency and feature fusion capabilities, outperforming Transformer-based methods, and provides a more effective solution for insider threat detection.

Updated: 2025-08-06 18:45:00

标题: MambaITD：一种用于内部威胁检测的高效跨模态Mamba网络

摘要: 企业面临着越来越多的内部威胁风险，而现有的检测方法由于诸如时间动态特征建模不足、计算效率和实时瓶颈、跨模态信息孤岛问题等原因，无法有效解决这些挑战。本文提出了一种基于Mamba状态空间模型和跨模态自适应融合的新的内部威胁检测框架MambaITD。首先，多源日志预处理模块通过行为序列编码、间隔平滑和统计特征提取对异构数据进行对齐。其次，Mamba编码器对行为和间隔序列中的长程依赖关系进行建模，并结合序列和统计信息，并与门控特征融合机制动态组合。最后，我们提出了一种基于最大化类间方差的自适应阈值优化方法，通过分析概率分布动态调整决策阈值，有效识别异常，并缓解类别不平衡和概念漂移。与传统方法相比，MambaITD在建模效率和特征融合能力方面显示出显著优势，胜过基于Transformer的方法，并为内部威胁检测提供了更有效的解决方案。

更新时间: 2025-08-06 18:45:00

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.05695v1

DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection

Insider threat detection (ITD) poses a persistent and high-impact challenge in cybersecurity due to the subtle, long-term, and context-dependent nature of malicious insider behaviors. Traditional models often struggle to capture semantic intent and complex behavior dynamics, while existing LLM-based solutions face limitations in prompt adaptability and modality coverage. To bridge this gap, we propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning. DMFI converts raw logs into two structured views: (1) a semantic view that processes content-rich artifacts (e.g., emails, https) using instruction-formatted prompts; and (2) a behavioral abstraction, constructed via a 4W-guided (When-Where-What-Which) transformation to encode contextual action sequences. Two LoRA-enhanced LLMs are fine-tuned independently, and their outputs are fused via a lightweight MLP-based decision module. We further introduce DMFI-B, a discriminative adaptation strategy that separates normal and abnormal behavior representations, improving robustness under severe class imbalance. Experiments on CERT r4.2 and r5.2 datasets demonstrate that DMFI outperforms state-of-the-art methods in detection accuracy. Our approach combines the semantic reasoning power of LLMs with structured behavior modeling, offering a scalable and effective solution for real-world insider threat detection. Our work demonstrates the effectiveness of combining LLM reasoning with structured behavioral modeling, offering a scalable and deployable solution for modern insider threat detection.

Updated: 2025-08-06 18:44:40

标题: DMFI：基于LLM的内部威胁检测的双模态微调和推理框架

摘要: 内部威胁检测（ITD）在网络安全中是一项持久且高影响力的挑战，因为恶意内部人行为的微妙、长期和依赖于背景的特性。传统模型常常难以捕捉语义意图和复杂行为动态，而现有的LLM（长期-短期记忆）基础解决方案在及时适应性和模态覆盖方面存在局限性。为了弥补这一差距，我们提出了DMFI，一个集成语义推理和行为感知微调的双模态框架。DMFI将原始日志转换为两种结构化视图：（1）一个语义视图，使用指令格式的提示处理内容丰富的文档（例如，电子邮件，https）；和（2）通过4W指导（何时-何地-什么-哪一个）转换构建的行为抽象，以编码上下文动作序列。两个经过LoRA增强的LLM独立进行微调，并通过基于轻量级MLP的决策模块融合它们的输出。我们进一步引入了DMFI-B，一个区分性适应策略，将正常和异常行为表示分开，提高了在严重类别不平衡情况下的鲁棒性。对CERT r4.2和r5.2数据集的实验表明，DMFI在检测准确性方面优于最先进的方法。我们的方法结合了LLM的语义推理能力和结构化行为建模，为现实世界的内部威胁检测提供了可扩展且有效的解决方案。我们的工作证明了将LLM推理与结构化行为建模相结合，为现代内部威胁检测提供了可扩展且可部署的解决方案。

更新时间: 2025-08-06 18:44:40

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.05694v1

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems.

Updated: 2025-08-06 18:44:24

标题: 实现推荐系统中的扩展规律：一种用于超大规模模型部署的基础专家范式

摘要: 尽管缩放定律为推荐系统带来了显著的性能提升，但有效部署超大规模模型仍然是一个未解决的主要挑战。与已经广泛采用FM的领域（如自然语言处理和计算机视觉）相比，推荐系统的进展受到独特挑战的阻碍，包括需要从不断变化的数据分布中学习在线流数据的需求，需要适应具有广泛差异的下游任务和输入分布的不同推荐表面，以及严格的延迟和计算约束。为了弥合这一差距，我们提出利用基础-专家范式：一种旨在开发和部署超大规模推荐FM的框架。在我们的方法中，一个中央FM被训练成终身、跨表面、多模式用户数据，以学习通用知识。然后，通过面向目标的嵌入，这些知识被有效地转移给各种轻量级、面向特定表面的“专家”模型，使它们能够适应本地数据分布和最小化开销的优化目标。为了满足我们的培训、推断和开发需求，我们构建了HyperCast，一个生产级基础设施系统，重新设计了训练、服务、日志记录和迭代，以支持这种解耦范式。我们的方法现已部署在Meta，每天为数十亿用户请求提供服务，展示了相比之前的单阶段生产系统在线指标的改喱，同时提高了开发速度并保持了基础设施的效率。据我们所知，这项工作代表了在此规模上成功部署基础-专家范式的第一个案例，提供了一个经过验证、计算有效且开发者友好的蓝图，以实现推荐系统中缩放定律的承诺。

更新时间: 2025-08-06 18:44:24

领域: cs.IR,cs.AI,cs.LG,68T05, 68T07, 68T30,H.3.3; I.2.6

下载: http://arxiv.org/abs/2508.02929v2

DSBC : Data Science task Benchmarking with Context engineering

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

Updated: 2025-08-06 18:41:57

标题: DSBC：数据科学任务基准测试与上下文工程化

摘要: 最近大型语言模型（LLMs）的进展显著影响了数据科学工作流程，导致专门设计用于自动化分析任务的数据科学代理的出现。尽管这些代理被迅速采纳，但对它们的功效和局限性进行系统基准测试的研究仍然很少。在本文中，我们引入了一个专门设计的全面基准测试，以反映用户与数据科学代理的真实互动，通过观察我们商业应用程序的使用情况。我们评估了三个LLMs：Claude-4.0-Sonnet、Gemini-2.5-Flash和OpenAI-o4-Mini，涵盖了三种方法：零射击与上下文工程、多步骤与上下文工程，以及与SmolAgent。我们的基准测试评估了八种数据科学任务类别的性能，此外还探索了模型对常见提示问题（如数据泄漏和稍微模糊的指令）的敏感性。我们进一步研究了温度参数对每个模型和方法的整体和任务特定结果的影响。我们的研究结果显示，评估的模型和方法之间存在明显的性能差异，突出了影响实际部署的关键因素。本文介绍的基准数据集和评估框架旨在为未来更强大和有效的数据科学代理的研究奠定基础。

更新时间: 2025-08-06 18:41:57

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2507.23336v2

A solvable generative model with a linear, one-step denoiser

We develop an analytically tractable single-step diffusion model based on a linear denoiser and present an explicit formula for the Kullback-Leibler divergence between the generated and sampling distribution, taken to be isotropic Gaussian, showing the effect of finite diffusion time and noise scale. Our study further reveals that the monotonic fall phase of Kullback-Leibler divergence begins when the training dataset size reaches the dimension of the data points. Finally, for large-scale practical diffusion models, we explain why a higher number of diffusion steps enhances production quality based on the theoretical arguments presented before.

Updated: 2025-08-06 18:39:59

标题: 一个可解的生成模型，带有线性、一步去噪器

摘要: 我们基于线性去噪器开发了一个解析易处理的单步扩散模型，并提供了一个显式公式，用于描述生成和采样分布之间的Kullback-Leibler散度，采样分布被视为各向同性高斯分布，展示了有限扩散时间和噪声尺度的影响。我们的研究进一步揭示，当训练数据集大小达到数据点的维度时，Kullback-Leibler散度的单调下降阶段开始。最后，对于大规模实际扩散模型，我们解释了为什么更多的扩散步数能够提高生产质量，这是基于之前提出的理论论据。

更新时间: 2025-08-06 18:39:59

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2411.17807v3

Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

Urban flood risk emerges from complex and nonlinear interactions among multiple features related to flood hazard, flood exposure, and social and physical vulnerabilities, along with the complex spatial flood dependence relationships. Existing approaches for characterizing urban flood risk, however, are primarily based on flood plain maps, focusing on a limited number of features, primarily hazard and exposure features, without consideration of feature interactions or the dependence relationships among spatial areas. To address this gap, this study presents an integrated urban flood-risk rating model based on a novel unsupervised graph deep learning model (called FloodRisk-Net). FloodRisk-Net is capable of capturing spatial dependence among areas and complex and nonlinear interactions among flood hazards and urban features for specifying emergent flood risk. Using data from multiple metropolitan statistical areas (MSAs) in the United States, the model characterizes their flood risk into six distinct city-specific levels. The model is interpretable and enables feature analysis of areas within each flood-risk level, allowing for the identification of the three archetypes shaping the highest flood risk within each MSA. Flood risk is found to be spatially distributed in a hierarchical structure within each MSA, where the core city disproportionately bears the highest flood risk. Multiple cities are found to have high overall flood-risk levels and low spatial inequality, indicating limited options for balancing urban development and flood-risk reduction. Relevant flood-risk reduction strategies are discussed considering ways that the highest flood risk and uneven spatial distribution of flood risk are formed.

Updated: 2025-08-06 18:37:17

标题: 无监督图深度学习揭示城市地区新兴的洪水风险特征

摘要: 城市洪水风险源于洪水危害、洪水暴露、社会和物理脆弱性之间的复杂和非线性相互作用，以及复杂的空间洪水依赖关系。然而，现有用于表征城市洪水风险的方法主要基于洪水平原图，侧重于有限数量的特征，主要是危害和暴露特征，而不考虑特征之间的相互作用或空间区域之间的依赖关系。为了填补这一空白，本研究提出了一种基于新颖的无监督图深度学习模型（称为FloodRisk-Net）的综合城市洪水风险评级模型。FloodRisk-Net能够捕捉区域之间的空间依赖关系，以及洪水危害和城市特征之间的复杂和非线性相互作用，从而确定新兴洪水风险。利用美国多个大都会统计区（MSAs）的数据，该模型将它们的洪水风险划分为六个不同的城市特定级别。该模型是可解释的，并允许对每个洪水风险级别内的区域进行特征分析，以便确定形成每个MSA内最高洪水风险的三种典型类型。发现洪水风险在每个MSA内以分层结构空间分布，其中核心城市承担着最高的洪水风险。发现多个城市具有较高的整体洪水风险水平和较低的空间不平等性，表明在平衡城市发展和减少洪水风险方面的选择有限。讨论了相关的减少洪水风险的策略，考虑到最高洪水风险和洪水风险不均匀空间分布是如何形成的。

更新时间: 2025-08-06 18:37:17

领域: cs.LG,cs.AI,cs.CY,stat.AP,I.2.1 Applications and Expert Systems

下载: http://arxiv.org/abs/2309.14610v4

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.

Updated: 2025-08-06 18:36:19

标题: 高效计算和友好识别的3D点云隐私保护

摘要: 3D点云被广泛应用于自动驾驶汽车、机器人、CAD模型等应用中。据我们所知，这些应用在3D点云中引发了隐私泄露的问题，但这个问题尚未得到很好地研究。与2D图像隐私有关的是纹理和2D几何结构不同，3D点云是无纹理的，只与3D几何结构相关。在这项工作中，我们定义了3D点云隐私问题，并提出了一种高效的隐私保护框架，名为PointFlowGMM，可以支持下游的分类和分割任务，而不需要查看原始数据。通过使用基于流的生成模型，将点云投影到潜在的高斯混合分布子空间中。我们进一步设计了一种新颖的角度相似性损失，以模糊原始几何结构，并将模型大小从767MB减小到120MB，而不降低识别性能。在潜在空间中投影的点云会随机正交旋转，以进一步保护原始几何结构，类与类之间的关系在旋转后得以保留，因此受保护的点云可以支持识别任务。我们在多个数据集上评估了我们的模型，并在加密的点云上取得了与原始点云相比可比的识别结果。

更新时间: 2025-08-06 18:36:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.15818v3

Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform

We introduce a novel privatization framework for high-dimensional controlled variable selection. Our framework enables rigorous False Discovery Rate (FDR) control under differential privacy constraints. While the Model-X knockoff procedure provides FDR guarantees by constructing provably exchangeable ``negative control" features, existing privacy mechanisms like Laplace or Gaussian noise injection disrupt its core exchangeability conditions. Our key innovation lies in privatizing the data knockoff matrix through the Gaussian Johnson-Lindenstrauss Transformation (JLT), a dimension reduction technique that simultaneously preserves covariate relationships through approximate isometry for $(\epsilon,\delta)$-differential privacy. We theoretically characterize both FDR and the power of the proposed private variable selection procedure, in an asymptotic regime. Our theoretical analysis characterizes the role of different factors, such as the JLT's dimension reduction ratio, signal-to-noise ratio, differential privacy parameters, sample size and feature dimension, in shaping the privacy-power trade-off. Our analysis is based on a novel `debiasing technique' for high-dimensional private knockoff procedure. We further establish sufficient conditions under which the power of the proposed procedure converges to one. This work bridges two critical paradigms -- knockoff-based FDR control and private data release -- enabling reliable variable selection in sensitive domains. Our analysis demonstrates that structural privacy preservation through random projections outperforms the classical noise addition mechanism, maintaining statistical power even under strict privacy budgets.

Updated: 2025-08-06 18:16:53

标题: 通过约翰逊-林登斯特劳斯变换实现差分私有模型-X Knockoffs

摘要: 我们引入了一个新的针对高维控制变量选择的私有化框架。我们的框架在差分隐私约束下实现了严格的虚发现率（FDR）控制。虽然模型-X knockoff过程通过构建可证实可交换的“负控制”特征提供了FDR保证，但现有的隐私机制如拉普拉斯或高斯噪声注入破坏了其核心可交换性条件。我们的关键创新在于通过高斯约翰逊-林登斯特劳斯变换（JLT）对数据 knockoff 矩阵进行私有化处理，这是一种同时通过近似等距保留协变量关系以实现$(\epsilon,\delta)$-差分隐私的降维技术。我们在一个渐近情况下理论上表征了所提出的私有变量选择过程的FDR和功率。我们的理论分析表征了不同因素的作用，如JLT的降维比率、信噪比、差分隐私参数、样本大小和特征维度，在塑造隐私-功率权衡方面的作用。我们的分析基于一种针对高维私有 knockoff 过程的新型“去偏技术”。我们进一步建立了提议过程的功率收敛于一的充分条件。这项工作连接了两个关键范式--基于 knockoff 的 FDR 控制和私有数据发布--实现了在敏感领域可靠的变量选择。我们的分析表明，通过随机投影实现结构隐私保护优于经典的噪声添加机制，在严格的隐私预算下仍然保持统计功率。

更新时间: 2025-08-06 18:16:53

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2508.04800v1

Optimality Principles and Neural Ordinary Differential Equations-based Process Modeling for Distributed Control

Most recent advances in machine learning and analytics for process control pose the question of how to naturally integrate new data-driven methods with classical process models and control. We propose a process modeling framework enabling integration of data-driven algorithms through consistent topological properties and conservation of extensive quantities. Interconnections among process network units are represented through connectivity matrices and network graphs. We derive the system's natural objective function equivalent to the non-equilibrium entropy production in a steady state system as a driving force for the process dynamics. We illustrate how distributed control and optimization can be implemented into process network structures and how control laws and algorithms alter the system's natural equilibrium towards engineered objectives. The basic requirement is that the flow conditions can be expressed in terms of conic sector (passivity) conditions. Our formalism allows integration of fundamental conservation properties from topology with learned dynamic relations from data through sparse deep neural networks. We demonstrate in a practical example of a simple inventory control system how to integrate the basic topology of a process with a neural network ordinary differential equation model. The system specific constitutive equations are left undescribed and learned by the neural ordinary differential equation algorithm using the adjoint method in combination with an adaptive ODE solver from synthetic time-series data. The resulting neural network forms a state space model for use in e.g. a model predictive control algorithm.

Updated: 2025-08-06 18:16:46

标题: 最优性原则和基于神经常微分方程的过程建模在分布式控制中的应用

摘要: 机器学习和分析在过程控制方面的最新进展提出了一个问题，即如何自然地将新的数据驱动方法与经典的过程模型和控制相结合。我们提出了一个过程建模框架，通过一致的拓扑特性和广泛量的保存来实现数据驱动算法的集成。过程网络单元之间的相互连接通过连接矩阵和网络图表示。我们推导出系统的自然目标函数，相当于非平衡熵在稳态系统中的产生作为过程动力学的驱动力。我们演示了如何将分布式控制和优化实现到过程网络结构中，以及控制定律和算法如何改变系统的自然平衡以实现工程目标。基本要求是流动条件可以用圆锥部门（被动性）条件来表示。我们的形式主义允许通过稀疏深度神经网络将拓扑学中的基本守恒性质与数据中学习到的动态关系进行集成。我们以一个简单的库存控制系统的实际示例演示了如何将过程的基本拓扑与神经网络常微分方程模型相结合。系统特定的本构方程未经描述，并通过神经常微分方程算法使用伴随方法结合自适应ODE求解器从合成时间序列数据中学习。生成的神经网络形成状态空间模型，可用于例如模型预测控制算法。

更新时间: 2025-08-06 18:16:46

领域: cs.NE,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.04799v1

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

Updated: 2025-08-06 18:14:43

标题: 奇偶感知字节对编码：改善标记化中的跨语言公平性

摘要: 标记化是大多数自然语言处理流程的第一步，通常是最不受关注的一步。学习标记器的标准算法依赖于基于频率的目标，这些目标偏爱训练数据中占主导地位的语言，因此使得资源较少的语言的标记化结果更长、形态学上不合理，甚至充满了<UNK>占位符。这种现象最终加剧了来自不同语言背景的用户之间的计算和财务不平等。为了解决这个问题，我们引入了一种新的变体——Parity-aware Byte Pair Encoding (BPE)，这是广泛使用的BPE算法的一个变体。在每一次合并步骤中，Parity-aware BPE最大化当前压缩效果最差的语言的压缩增益，以实现跨语言的平等交换。我们通过实验证明，Parity-aware BPE导致各种语言之间更加公平的标记计数，对全局压缩率几乎没有影响，并且对下游任务中的语言模型性能没有实质性影响。

更新时间: 2025-08-06 18:14:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04796v1

Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

Updated: 2025-08-06 18:14:04

标题: 利用冻结的LLM提升对话注释中的说话者特征

摘要: 在对话转录流水线中，经常使用大型语言模型（LLMs）进行后处理，以改善语法、标点和可读性。我们探讨了一种互补的后处理步骤：通过为说话者特征添加元数据标签，如年龄、性别和情绪，来丰富转录的对话。一些标签是全局的，适用于整个对话，而一些是时变的。我们的方法将冻结的音频基础模型（如Whisper或WavLM）与冻结的LLAMA语言模型相结合，推断出这些说话者属性，而无需对任何模型进行任务特定的微调。通过使用轻量级、高效的连接器来连接音频和语言表示，我们在说话者分析任务上取得了竞争性能，同时保持了模块化和速度。此外，我们证明了一个冻结的LLAMA模型可以直接比较x-vectors，在某些情况下实现了8.8％的等错误率。

更新时间: 2025-08-06 18:14:04

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.04795v1

Federated Continual Recommendation

The increasing emphasis on privacy in recommendation systems has led to the adoption of Federated Learning (FL) as a privacy-preserving solution, enabling collaborative training without sharing user data. While Federated Recommendation (FedRec) effectively protects privacy, existing methods struggle with non-stationary data streams, failing to maintain consistent recommendation quality over time. On the other hand, Continual Learning Recommendation (CLRec) methods address evolving user preferences but typically assume centralized data access, making them incompatible with FL constraints. To bridge this gap, we introduce Federated Continual Recommendation (FCRec), a novel task that integrates FedRec and CLRec, requiring models to learn from streaming data while preserving privacy. As a solution, we propose F3CRec, a framework designed to balance knowledge retention and adaptation under the strict constraints of FCRec. F3CRec introduces two key components: Adaptive Replay Memory on the client side, which selectively retains past preferences based on user-specific shifts, and Item-wise Temporal Mean on the server side, which integrates new knowledge while preserving prior information. Extensive experiments demonstrate that F3CRec outperforms existing approaches in maintaining recommendation quality over time in a federated environment.

Updated: 2025-08-06 18:06:36

标题: 联邦式持续推荐

摘要: 随着推荐系统中隐私保护的重视不断增加，联邦学习（FL）作为一种保护隐私的解决方案得到了采用，实现了协作训练而不共享用户数据。虽然联邦推荐（FedRec）有效地保护了隐私，但现有方法在处理非平稳数据流时存在困难，无法保持一致的推荐质量。另一方面，持续学习推荐（CLRec）方法解决了用户偏好的演变，但通常假设了集中数据访问，与FL的约束不兼容。为了弥补这一差距，我们引入了联邦持续推荐（FCRec），这是一个将FedRec和CLRec整合起来的新任务，要求模型在保护隐私的同时从流数据中学习。作为解决方案，我们提出了F3CRec，这是一个旨在在FCRec的严格约束下平衡知识保留和适应的框架。F3CRec引入了两个关键组件：客户端的自适应重放记忆，根据用户特定的转变有选择地保留过去的偏好，以及服务器端的物品级临时平均值，将新知识整合到保留之前信息的同时。大量实验表明，F3CRec在联邦环境中随着时间推移保持推荐质量方面优于现有方法。

更新时间: 2025-08-06 18:06:36

领域: cs.LG,cs.IR,H.3.3; I.2.6; C.2.4

下载: http://arxiv.org/abs/2508.04792v1

Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization

Content-based mammographic image retrieval systems require exact BIRADS categorical matching across five distinct classes, presenting significantly greater complexity than binary classification tasks commonly addressed in literature. Current medical image retrieval studies suffer from methodological limitations including inadequate sample sizes, improper data splitting, and insufficient statistical validation that hinder clinical translation. We developed a comprehensive evaluation framework systematically comparing CNN architectures (DenseNet121, ResNet50, VGG16) with advanced training strategies including sophisticated fine-tuning, metric learning, and super-ensemble optimization. Our evaluation employed rigorous stratified data splitting (50%/20%/30% train/validation/test), 602 test queries, and systematic validation using bootstrap confidence intervals with 1,000 samples. Advanced fine-tuning with differential learning rates achieved substantial improvements: DenseNet121 (34.79% precision@10, 19.64% improvement) and ResNet50 (34.54%, 19.58% improvement). Super-ensemble optimization combining complementary architectures achieved 36.33% precision@10 (95% CI: [34.78%, 37.88%]), representing 24.93% improvement over baseline and providing 3.6 relevant cases per query. Statistical analysis revealed significant performance differences between optimization strategies (p<0.001) with large effect sizes (Cohen's d>0.8), while maintaining practical search efficiency (2.8milliseconds). Performance significantly exceeds realistic expectations for 5-class medical retrieval tasks, where literature suggests 20-25% precision@10 represents achievable performance for exact BIRADS matching. Our framework establishes new performance benchmarks while providing evidence-based architecture selection guidelines for clinical deployment in diagnostic support and quality assurance applications.

Updated: 2025-08-06 18:05:18

标题: 基于BIRADS的乳腺X线摄影图像检索的先进多架构深度学习框架：超级集成优化的综合性能分析

摘要: 基于内容的乳腺X线摄影图像检索系统需要在五个不同类别中确切匹配BIRADS，比文献中常见的二元分类任务复杂得多。当前的医学图像检索研究存在方法论限制，包括样本量不足、数据分割不当以及缺乏足够的统计验证，这些限制阻碍了临床转化。我们开发了一个全面的评估框架，系统比较了CNN架构（DenseNet121、ResNet50、VGG16）与先进的训练策略，包括精细调整、度量学习和超集成优化。我们的评估采用了严格的分层数据分割（50%/20%/30%训练/验证/测试），602个测试查询，并使用1,000个样本的自举置信区间进行系统验证。通过差异学习率进行先进的精细调整取得了显著的改进：DenseNet121（10个准确率34.79%，提高19.64%）和ResNet50（10个准确率34.54%，提高19.58%）。结合互补架构的超集成优化实现了10个准确率36.33%（95% CI: [34.78%, 37.88%]），比基准提高24.93%，每个查询提供3.6个相关案例。统计分析显示，优化策略之间存在显著的性能差异（p<0.001），效果尺寸较大（Cohen's d>0.8），同时保持实用的搜索效率（2.8毫秒）。性能明显超过了对于5类医学检索任务的现实期望，文献表明对于确切的BIRADS匹配，20-25%的精度@10代表可实现的性能。我们的框架建立了新的性能基准，同时为临床部署提供了基于证据的架构选择指南，用于诊断支持和质量保障应用。

更新时间: 2025-08-06 18:05:18

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04790v1

Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts

This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.

Updated: 2025-08-06 18:03:42

标题: 评估LLM引导反思对互动AI生成教育播客学习成果的影响

摘要: 这项研究探讨了在交互式人工智能生成的播客中嵌入LLM引导的反思提示是否比没有提示的版本更有助于学习和用户体验。三十六名本科生参与了研究，尽管各种条件下的学习成果相似，但反思提示降低了被认为的吸引力，强调了对反思互动设计的更多研究需求。

更新时间: 2025-08-06 18:03:42

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.04787v1

Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration

The increasing frequency of extreme weather events, such as hurricanes, highlights the urgent need for efficient and equitable power system restoration. Many electricity providers make restoration decisions primarily based on the volume of power restoration requests from each region. However, our data-driven analysis reveals significant disparities in request submission volume, as disadvantaged communities tend to submit fewer restoration requests. This disparity makes the current restoration solution inequitable, leaving these communities vulnerable to extended power outages. To address this, we aim to propose an equity-aware power restoration strategy that balances both restoration efficiency and equity across communities. However, achieving this goal is challenging for two reasons: the difficulty of predicting repair durations under dataset heteroscedasticity, and the tendency of reinforcement learning agents to favor low-uncertainty actions, which potentially undermine equity. To overcome these challenges, we design a predict-then-optimize framework called EPOPR with two key components: (1) Equity-Conformalized Quantile Regression for uncertainty-aware repair duration prediction, and (2) Spatial-Temporal Attentional RL that adapts to varying uncertainty levels across regions for equitable decision-making. Experimental results show that our EPOPR effectively reduces the average power outage duration by 3.60% and decreases inequity between different communities by 14.19% compared to state-of-the-art baselines.

Updated: 2025-08-06 18:00:30

标题: 不确定性感知的预测-优化框架用于公平的灾后电力恢复

摘要: 随着飓风等极端天气事件频率的增加，突显了对高效和公平的电力系统恢复的迫切需求。许多电力提供商主要基于每个地区的电力恢复请求量来做出恢复决策。然而，我们的数据驱动分析揭示了请求提交量存在显著差异，弱势社区往往提交的恢复请求较少。这种差异使当前的恢复解决方案不公平，使这些社区容易受到延长停电时间的影响。为了解决这个问题，我们旨在提出一种平衡社区之间恢复效率和公平性的具有公平意识的电力恢复策略。然而，实现这一目标面临两个挑战：在数据集异方差性下预测修复持续时间的困难，以及强化学习代理倾向于偏爱低不确定性行动，这可能会损害公平性。为了克服这些挑战，我们设计了一个名为EPOPR的预测-优化框架，其中包括两个关键组件：（1）用于不确定性感知修复持续时间预测的公平一致化分位回归，以及（2）适应不同地区不确定性水平的空间-时间注意力强化学习，用于公平决策。实验结果表明，与最先进的基线相比，我们的EPOPR有效地将平均停电时间减少了3.60％，并将不同社区之间的不公平性减少了14.19％。

更新时间: 2025-08-06 18:00:30

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2508.04780v1

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

Updated: 2025-08-06 17:58:46

标题: SEAgent：具有从经验中自主学习的自我进化计算机使用代理

摘要: 将大型视觉语言模型（LVLMs）重新用作计算机使用代理人（CUAs）已经取得了重大突破，主要是由人工标记的数据驱动。然而，这些模型经常在新颖和专业化软件中遇到困难，特别是在缺乏人类注释的情况下。为了解决这一挑战，我们提出了SEAgent，这是一个自主演化的框架，使CUAs能够通过与陌生软件的交互自主演化。具体而言，SEAgent赋予计算机使用代理人通过体验学习自主掌握新领域软件环境的能力，代理人通过探索新软件、通过迭代的试错学习，并逐步解决从简单到复杂的自动生成任务。为了实现这一目标，我们设计了一个逐步轨迹评估的世界状态模型，以及一个生成越来越多样化和具有挑战性任务的课程生成器。代理人的策略通过体验学习进行更新，包括对失败操作的对抗性模仿和对成功操作的群体相对策略优化（GRPO）。此外，我们引入了一种从专家到通才的训练策略，该策略整合了来自专家代理人的个体经验见解，促进了更强大的通才CUA的发展，能够实现持续自主演化。这一统一代理最终在其专业软件上的表现超过了个别专家代理人的集合。我们在OS-World中的五个新颖软件环境中验证了SEAgent的有效性。我们的方法在成功率方面取得了显著的改善，从11.3%提高到34.5%，超过了竞争对手开源CUA UI-TARS的23.2%。

更新时间: 2025-08-06 17:58:46

领域: cs.AI,cs.CL,cs.CV,cs.LG,cs.MA,cs.MM

下载: http://arxiv.org/abs/2508.04700v1

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

Updated: 2025-08-06 17:58:36

标题: 跳跃、跳跃和过度思考：诊断推理模型在多跳分析中失误的原因

摘要: 理论模型的出现及其融入实际人工智能聊天机器人已经导致在解决高级数学、深度搜索和提取式问答问题方面取得了突破，这些问题需要复杂的多步思维过程。然而，我们尚未完全理解为什么这些模型产生了比通用语言模型更多的幻觉。在这项调查研究中，我们系统地探讨了当代语言模型在多跳问题回答任务中的推理失败。我们引入了一个新颖而微妙的错误分类框架，该框架检查了三个关键维度上的失败情况：涉及的源文档的多样性和独特性（“跳跃”），捕获相关信息的完整性（“覆盖范围”）以及认知效率低下（“过度思考”）。通过严格的人工注释，以及辅助的自动化度量标准，我们的探索揭示了常常被以准确性为中心的评估所隐藏的错综复杂的错误模式。这种调查方法为我们提供了对当前模型认知限制的更深入的洞察，并为未来语言建模工作中增强推理忠实度、透明度和稳健性提供了可操作的指导。

更新时间: 2025-08-06 17:58:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04699v1

From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario

Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints, increasing the complexity of action execution and agent coordination. However, despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited, hindering the advancement of MARS research in practice. To bridge this gap, we conducted two studies to investigate performance trade-offs of hierarchical multi-agent frameworks in a simulated real-world multi-robot healthcare scenario. In Study 1, using CrewAI, we iteratively refine the system's knowledge base, to systematically identify and categorize coordination failures (e.g., tool access violations, lack of timely handling of failure reports) not resolvable by providing contextual knowledge alone. In Study 2, using AutoGen, we evaluate a redesigned bidirectional communication structure and further measure the trade-offs between reasoning and non-reasoning models operating within the same robotic team setting. Drawing from our empirical findings, we emphasize the tension between autonomy and stability and the importance of edge-case testing to improve system reliability and safety for future real-world deployment. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: https://byc-sophie.github.io/mas-to-mars/.

Updated: 2025-08-06 17:54:10

标题: 从MAS到MARS：卫生场景中分层多智能体机器人系统中的协调失败和推理权衡

摘要: 多智能体机器人系统（MARS）建立在多智能体系统之上，通过整合物理和任务相关约束，增加了动作执行和智能体协调的复杂性。然而，尽管先进的多智能体框架已经存在，但它们在机器人上的实际部署仍然有限，阻碍了MARS研究在实践中的进展。为弥补这一差距，我们进行了两项研究，以调查层次化多智能体框架在模拟实际多机器人医疗场景中的性能折衷。在研究1中，使用CrewAI，我们通过迭代地完善系统的知识库，系统地识别和分类协调失败（例如，工具访问违规，未及时处理故障报告）的问题，这些问题仅通过提供上下文知识无法解决。在研究2中，使用AutoGen，我们评估了重新设计的双向通信结构，并进一步衡量了在同一机器人团队环境中运行的推理和非推理模型之间的折衷。根据我们的实证研究结果，我们强调了自主性和稳定性之间的紧张关系，以及通过边缘案例测试来提高系统可靠性和安全性，以便未来实际部署。补充材料，包括代码、任务代理设置、跟踪输出以及协调失败和推理行为的注释示例，可在以下链接获取：https://byc-sophie.github.io/mas-to-mars/。

更新时间: 2025-08-06 17:54:10

领域: cs.RO,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.04691v1

Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering

This study introduces Query Attribute Modeling (QAM), a hybrid framework that enhances search precision and relevance by decomposing open text queries into structured metadata tags and semantic elements. QAM addresses traditional search limitations by automatically extracting metadata filters from free-form text queries, reducing noise and enabling focused retrieval of relevant items. Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique items with 40,000+ reviews and detailed product attributes) demonstrated QAM's superior performance, achieving a mean average precision at 5 (mAP@5) of 52.99\%. This represents significant improvement over conventional methods, including BM25 keyword search, encoder-based semantic similarity search, cross-encoder re-ranking, and hybrid search combining BM25 and semantic results via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust solution for Enterprise Search applications, particularly in e-commerce systems.

Updated: 2025-08-06 17:47:00

标题: 查询属性建模：通过语义搜索和元数据过滤改进搜索相关性

摘要: 本研究介绍了查询属性建模（QAM），这是一个混合框架，通过将开放文本查询分解为结构化元数据标签和语义元素，增强了搜索精度和相关性。QAM通过自动从自由形式文本查询中提取元数据过滤器，减少噪声并实现对相关项目的专注检索，解决了传统搜索的局限性。使用亚马逊玩具评论数据集（包括10,000个独特项目，40,000多条评论和详细产品属性）进行实验评估，证明了QAM的卓越性能，达到了52.99\%的5个平均精度（mAP@5）。这相比于传统方法（包括BM25关键字搜索、基于编码器的语义相似性搜索、交叉编码器重新排序和通过互惠排名融合（RRF）结合BM25和语义结果的混合搜索）有显著改进。结果表明QAM是企业搜索应用的稳健解决方案，特别适用于电子商务系统。

更新时间: 2025-08-06 17:47:00

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.04683v1

GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.

Updated: 2025-08-06 17:42:22

标题: GeRe: 通过一般样本重放实现LLM持续学习中高效的反遗忘

摘要: 大型语言模型（LLMs）的持续学习能力对推动人工智能的普遍发展至关重要。然而，跨不同领域对LLMs进行持续微调往往受到灾难性遗忘的困扰，其特点是：1）其通用能力的显著遗忘，以及2）先前学习任务的性能急剧下降。为了以简单而稳定的方式同时解决这两个问题，我们提出了General Sample Replay（GeRe）框架，该框架利用常规预训练文本进行高效的防遗忘。在GeRe下重新审视最普遍的基于回放的实践之外，我们进一步利用神经状态引入了一种增强的激活状态约束优化方法，使用基于阈值的边界（TM）损失，该方法在回放学习过程中保持激活状态的一致性。我们是第一个验证一个小型、固定集合的预先收集的通用回放样本足以解决这两个问题的研究，即保留通用能力同时在连续任务中提升整体性能。事实上，前者可以本质上促进后者。通过对照实验，我们系统地比较了在GeRe框架下使用TM和不同回放策略，包括普通标签拟合、通过KL散度进行逻辑拟合和通过L1/L2损失进行特征拟合。结果表明，TM不断改善性能，并表现出更好的稳健性。我们的工作为未来LLMs的高效回放铺平了道路。我们的代码和数据可在https://github.com/Qznan/GeRe 上获取。

更新时间: 2025-08-06 17:42:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04676v1

Robustly Learning Monotone Single-Index Models

We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + \zeta,$ for $\zeta > 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.

Updated: 2025-08-06 17:37:06

标题: 稳健地学习单调单指数模型

摘要: 我们考虑在对抗性标签噪声存在的情况下，在高斯分布下学习单指数模型相对于平方损失的基本问题。我们的主要贡献是针对这一学习任务的第一个计算效率高的算法，实现了一个常数因子的近似，适用于所有单调激活函数的类别，其二阶矩有界为$2 + \zeta$，其中$\zeta > 0$。这个类别特别包括所有单调利普希茨函数，甚至包括像（可能有偏差的）半平面这样的不连续函数。以前针对未知激活函数的情况的工作要么不能达到常数因子的近似，要么只适用于一个大大较小的激活函数家族。我们方法的主要概念创新在于开发一种优化框架，超越了通常的梯度方法的边界，并且通过直接利用问题结构、高斯空间的性质和单调函数的规律性来识别一个有用的矢量场来指导算法更新。

更新时间: 2025-08-06 17:37:06

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2508.04670v1

Cybersecurity of Quantum Key Distribution Implementations

Practical implementations of Quantum Key Distribution (QKD) often deviate from the theoretical protocols, exposing the implementations to various attacks even when the underlying (ideal) protocol is proven secure. We present new analysis tools and methodologies for quantum cybersecurity, adapting the concepts of vulnerabilities, attack surfaces, and exploits from classical cybersecurity to QKD implementation attacks. We present three additional concepts, derived from the connection between classical and quantum cybersecurity: "Quantum Fuzzing", which is the first tool for black-box vulnerability research on QKD implementations; "Reversed-Space Attacks", which are a generic exploit method using the attack surface of imperfect receivers; and a concrete quantum-mechanical definition of "Quantum Side-Channel Attacks", meaningfully distinguishing them from other types of attacks. Using our tools, we analyze multiple existing QKD attacks and show that the "Bright Illumination" attack could have been fully constructed even with minimal knowledge of the device implementation. This work begins to bridge the gap between current analysis methods for experimental attacks on QKD implementations and the decades-long research in the field of classical cybersecurity, improving the practical security of QKD products and enhancing their usefulness in real-world systems.

Updated: 2025-08-06 17:37:04

标题: 量子密钥分发实现的网络安全性

摘要: 量子密钥分发（QKD）的实际实现经常偏离理论协议，使得实现暴露于各种攻击，即使底层（理想）协议被证明安全。我们提出了新的量子网络安全分析工具和方法，将经典网络安全中的漏洞、攻击面和利用等概念应用到QKD实现攻击中。我们提出了三个额外的概念，从经典和量子网络安全之间的联系中得出：“量子模糊测试”，这是针对QKD实现的黑盒漏洞研究的第一个工具；“反向空间攻击”，这是一种使用不完美接收方的攻击面的通用利用方法；以及“量子侧信道攻击”的具体量子机械定义，从而有意义地将其与其他类型的攻击区分开来。利用我们的工具，我们分析了多种现有的QKD攻击，并展示了“明亮照明”攻击即使在对设备实现了最少的了解的情况下也能完全构建。这项工作开始弥合对QKD实现实验性攻击的当前分析方法和几十年来在经典网络安全领域的研究之间的差距，提高了QKD产品的实际安全性，并增强了它们在现实世界系统中的实用性。

更新时间: 2025-08-06 17:37:04

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2508.04669v1

Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection

Recent advances in parameter-efficient transfer learning have demonstrated the utility of composing LoRA adapters from libraries of pretrained modules. However, most existing approaches rely on simple retrieval heuristics or uniform averaging, which overlook the latent structure of task relationships in representation space. We propose a new framework for adapter reuse that moves beyond retrieval, formulating adapter composition as a geometry-aware sparse reconstruction problem. Specifically, we represent each task by a latent prototype vector derived from the base model's encoder and aim to approximate the target task prototype as a sparse linear combination of retrieved reference prototypes, under an $\ell_1$-regularized optimization objective. The resulting combination weights are then used to blend the corresponding LoRA adapters, yielding a composite adapter tailored to the target task. This formulation not only preserves the local geometric structure of the task representation manifold, but also promotes interpretability and efficient reuse by selecting a minimal set of relevant adapters. We demonstrate the effectiveness of our approach across multiple domains-including medical image segmentation, medical report generation and image synthesis. Our results highlight the benefit of coupling retrieval with latent geometry-aware optimization for improved zero-shot generalization.

Updated: 2025-08-06 17:36:57

标题: 超越适配器检索：通过稀疏任务投影实现潜在几何保持组合

摘要: 最近在参数高效的迁移学习方面取得了一些进展，展示了利用预训练模块库构建LoRA适配器的效用。然而，大多数现有方法依赖于简单的检索启发式或均匀平均，忽视了表示空间中任务关系的潜在结构。我们提出了一个新的适配器重用框架，超越了检索，将适配器组合形式化为一个几何感知的稀疏重构问题。具体而言，我们通过基础模型的编码器从每个任务中导出一个潜在原型向量，并旨在在$\ell_1$正则化优化目标下，将目标任务原型近似为检索到的参考原型的稀疏线性组合。然后使用得到的组合权重来混合相应的LoRA适配器，生成一个针对目标任务定制的复合适配器。这种表述不仅保留了任务表示流形的局部几何结构，还通过选择一组最小相关适配器来促进可解释性和高效重用。我们在多个领域展示了我们方法的有效性，包括医学图像分割、医学报告生成和图像合成。我们的结果突显了将检索与潜在几何感知优化相结合对提高零样本泛化的好处。

更新时间: 2025-08-06 17:36:57

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2410.09908v2

How are CS students using resources and AI tools for coding tasks?

A survey of 26 CS students reveals that AI coding assistants are mainly used for writing code (second to online searches) while AI chatbots are the top resource for debugging. Participants with different coding experience prefer online help over direct human help from peers and instructors.

Updated: 2025-08-06 17:35:55

标题: CS学生如何利用资源和人工智能工具进行编码任务？

摘要: 一项对26名计算机科学学生的调查显示，AI编码助手主要用于编写代码（仅次于在线搜索），而AI聊天机器人是调试的首选资源。具有不同编码经验的参与者更倾向于在线帮助，而不是直接从同行和教师那里获得帮助。

更新时间: 2025-08-06 17:35:55

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.04667v1

Perch 2.0: The Bittern Lesson for Bioacoustics

Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

Updated: 2025-08-06 17:34:43

标题: 鲈鱼2.0：关于生物声学的苦鹭教训

摘要: Perch是一个用于生物声学的高性能预训练模型。它以监督方式训练，提供了成千上万种发声物种的即插即用分类分数，同时也为迁移学习提供了强大的嵌入。在这个新版本Perch 2.0中，我们从仅在鸟类物种上进行训练扩展到一个大型多类群数据集。该模型通过使用原型学习分类器和新的源预测训练标准进行自我蒸馏训练。Perch 2.0在BirdSet和BEANS基准测试上获得了最新的性能。尽管几乎没有海洋训练数据，但它在海洋转移学习任务上的表现优于专门的海洋模型。我们提出了为什么细粒度物种分类是生物声学特别强大的预训练任务的假设。

更新时间: 2025-08-06 17:34:43

领域: cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.04665v1

Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization

Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within $36.7$ to $93.5$ MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of $30\times$ to $51\times$. Our FPGA accelerator also achieves up to $3.6\times$ less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.

Updated: 2025-08-06 17:33:10

标题: 基于张量压缩优化的超高存储效率的FPGA上Transformer训练

摘要: Transformer模型已经在各种机器学习任务中取得了最先进的性能。由于隐私、领域适应和设备上的科学机器学习等考虑，人们对在资源受限的边缘设备上训练transformers的兴趣日益增长。然而，transformer训练所需的显著计算和内存需求通常超出了边缘设备的能力范围。通过利用低秩张量压缩，本文提出了第一个用于端到端transformer训练的FPGA加速器。在算法方面，我们提出了一种针对张量化transformer训练的双向收缩流，与现有的张量操作相比，显著减少了计算FLOPS和层内存成本。在硬件方面，我们将所有高度压缩的模型参数和梯度信息存储在芯片上，为每个训练阶段创建了一个仅在芯片上存储内存的框架。这减少了芯片外通信，最小化了延迟和能源成本。此外，我们为每个训练阶段实现了定制的计算核心，并采用层内并行和流水线技术进一步提高运行时和内存效率。通过在ATIS数据集上使用FP-32数据格式的36.7到93.5MB的transformer模型进行实验，我们的张量化FPGA加速器可以在AMD Alevo U50 FPGA上进行单批端到端训练，内存预算不到6MB BRAM和22.5MB URAM。与在NVIDIA RTX 3090 GPU上未压缩训练相比，我们的FPGA训练实现了30倍到51倍的内存减少。与在NVIDIA RTX 3090 GPU上进行张量Transformer训练相比，我们的FPGA加速器每个时代的能源成本也降低了最多3.6倍。

更新时间: 2025-08-06 17:33:10

领域: cs.LG,cs.AR,cs.CL

下载: http://arxiv.org/abs/2501.06663v2

Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

Updated: 2025-08-06 17:32:58

标题: 雕塑家：通过积极的上下文管理赋予LLMs认知代理能力

摘要: 大型语言模型（LLMs）在处理长上下文时存在明显的性能下降，这是由于积极干扰，即上下文中较早部分的无关信息干扰了推理和记忆召回。虽然大多数研究侧重于外部存储系统以增强LLMs的能力，但我们提出了一种互补方法：为LLMs提供主动上下文管理（ACM）工具，以积极塑造其内部工作记忆。我们引入了Sculptor，一个为LLMs配备三类工具的框架：（1）上下文分割，（2）摘要、隐藏和恢复，以及（3）智能搜索。我们的方法使LLMs能够积极管理他们的注意力和工作记忆，类似于人类如何在选择性关注相关信息的同时过滤出干扰。在信息稀疏基准-PI-LLM（积极干扰）和NeedleBench Multi-Needle Reasoning上的实验评估表明，即使没有特定的训练，Sculptor也显著提高了性能，利用了LLMs固有的工具调用泛化能力。通过实现主动上下文管理，Sculptor不仅可以缓解积极干扰，还为各种长上下文任务提供了更可靠的推理认知基础，突显了明确的上下文控制策略，而不仅仅是更大的标记窗口，是规模上的强健性的关键。

更新时间: 2025-08-06 17:32:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04664v1

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

Updated: 2025-08-06 17:30:44

标题: 分层修剪：面向大规模扩散模型的位置感知压缩

摘要: 最先进的文本到图像扩散模型（DMs）实现了卓越的质量，然而它们庞大的参数规模（8-11B）对于资源受限设备上的推断提出了重大挑战。在本文中，我们提出了一种新颖的压缩框架HierarchicalPrune，其基于一个关键观察：DM块展现出不同的功能层次结构，早期块建立语义结构，而后期块处理纹理细化。HierarchicalPrune协同结合了三种技术：（1）层次位置剪枝，根据位置层次结构识别和移除较不重要的后期块；（2）位置权重保护，系统性地保护对于语义结构完整性至关重要的早期模型部分；以及（3）敏感度引导蒸馏，根据我们对块间敏感度变化的发现调整知识传递强度。因此，我们的框架将十亿级扩散模型带入适合设备推断的范围，同时保持输出图像的质量。具体而言，结合INT4权重量化，HierarchicalPrune实现了77.5-80.4%的内存占用降低（例如，从15.8GB降至3.2GB）和27.9-38.0%的延迟降低，在服务器和消费级GPU上测得，与原始模型相比，GenEval分数仅下降2.6%，HPSv2分数下降7%。最后，我们对85名参与者进行了全面的用户研究，结果显示HierarchicalPrune在保持感知质量与原始模型可比的同时，明显优于先前的工作。

更新时间: 2025-08-06 17:30:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04663v1

YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper

In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques.

Updated: 2025-08-06 17:27:48

标题: YOLOv8-Based深度学习模型用于自动家禽疾病检测和健康监测的论文

摘要: 在家禽产业中，检测鸡的疾病对于避免财务损失至关重要。传统技术依赖于人工观察，这是费时且容易出错的。本研究使用YOLO v8深度学习模型进行实时对象识别。该研究建议采用基于人工智能的方法，通过开发一个系统来分析高分辨率的鸡的照片，YOLO v8可以检测出疾病的迹象，如行为和外观上的异常。已经使用了大量标注的数据集来训练算法，该算法提供准确的实时识别感染的鸡，并及时向农场经营者发出警告以采取及时行动。通过促进早期感染的识别，消除人工检查的需要，并加强大规模农场的生物安全性，这种人工智能技术提高了鸡的健康管理水平。YOLO v8的实时特性为改善农场管理技术提供了一个可扩展且有效的方法。

更新时间: 2025-08-06 17:27:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04658v1

Self-Questioning Language Models

Can large language models improve without external data -- by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

Updated: 2025-08-06 17:23:53

标题: 自问语言模型

摘要: 大型语言模型在没有外部数据的情况下能否改进--通过生成自己的问题和答案？我们假设一个预训练的语言模型可以在只给定一个主题提示（例如，代数文字问题）并要求模型生成自己的问题的情况下改进其推理能力。为此，我们提出了自问语言模型（SQLM）：一个非对称的自我对弈框架，其中一个提出者获得主题并为解决者生成问题，解决者试图回答问题。提出者和解决者都通过强化学习进行训练。如果问题不太容易或太困难，提出者将获得奖励，解决者将根据多数投票获得奖励，这是在没有地面真实答案的情况下正确性的代理。对于编码，提出者可以生成用于验证的单元测试。我们研究了这个非对称的自我对弈框架在三个基准测试上的表现：三位数的乘法、来自OMEGA基准测试的代数问题以及来自Codeforces的编程问题。通过不断生成更有趣的问题并尝试解决它们，语言模型可以在没有访问任何精心策划的训练数据集的情况下改进下游基准测试。

更新时间: 2025-08-06 17:23:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.03682v2

X-SAM: From Segment Anything to Any Segmentation

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

Updated: 2025-08-06 17:19:10

标题: X-SAM: 从任何细分到任何细分

摘要: 大型语言模型（LLMs）在广泛的知识表示方面表现出强大的能力，但它们在像素级的感知理解方面存在固有的不足。虽然分段任意模型（SAM）在受视觉提示驱动的图像分割方面代表了重大进步，但它在多掩模预测和类别特定分割任务方面存在明显的局限性，并且无法将所有分割任务整合到统一的模型架构中。为了解决这些局限性，我们提出了X-SAM，这是一个简化的多模态大型语言模型（MLLM）框架，将分割范式从“分割任意”扩展到“任何分割”。具体而言，我们引入了一个新颖的统一框架，可以为MLLMs实现更高级的像素级感知理解。此外，我们提出了一项新的分割任务，称为Visual GrounDed（VGD）分割，它通过交互式视觉提示对所有实例对象进行分割，并赋予MLLMs视觉基础、像素级解释能力。为了在各种数据源上进行有效训练，我们提出了一种支持跨多个数据集协同训练的统一训练策略。实验结果表明，X-SAM在广泛的图像分割基准上实现了最先进的性能，突显了其在多模态、像素级视觉理解方面的效率。代码可在https://github.com/wanghao9610/X-SAM上获得。

更新时间: 2025-08-06 17:19:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04655v1

LLM Collaboration With Multi-Agent Reinforcement Learning

A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges.

Updated: 2025-08-06 17:18:25

标题: LLM与多智能体强化学习的合作

摘要: 在多智能体系统（MAS）领域已经做了大量工作，用于建模和解决涉及多个互动智能体的问题。然而，大多数LLMs是独立预训练的，并且没有专门针对协调进行优化。现有的LLM微调框架依赖于个体奖励，需要为每个智能体设计复杂的奖励以鼓励合作。为了解决这些挑战，我们将LLM协作建模为合作多智能体强化学习（MARL）问题。我们开发了一个多智能体、多轮算法，即Multi-Agent Group Relative Policy Optimization（MAGRPO），用于解决这个问题，借鉴了当前LLMs的RL方法以及MARL技术。我们在LLM写作和编码协作方面的实验表明，通过MAGRPO微调MAS使智能体能够通过有效的合作高效生成高质量的响应。我们的方法为LLMs使用其他MARL方法打开了大门，并突出了相关的挑战。

更新时间: 2025-08-06 17:18:25

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.04652v1

Live Music Models

We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Updated: 2025-08-06 17:18:21

标题: 现场音乐模式

摘要: 我们介绍了一种名为现场音乐模型的音乐生成模型新类别，它可以在实时中产生连续的音乐流，并与用户控制同步。我们发布了Magenta RealTime，一个开放权重的现场音乐模型，可以使用文本或音频提示来控制声学风格。在音乐质量的自动评估指标上，尽管使用更少的参数并提供首次具有实时生成能力的开放权重音乐生成模型，Magenta RealTime的表现优于其他模型。我们还发布了Lyria RealTime，一个基于API的模型，具有扩展控制功能，提供对我们覆盖范围广泛的最强大模型的访问。这些模型展示了一种强调人机交互的AI辅助音乐创作的新范式，用于现场音乐表演。

更新时间: 2025-08-06 17:18:21

领域: cs.SD,cs.HC,cs.LG

下载: http://arxiv.org/abs/2508.04651v1

Accept-Reject Lasso

The Lasso method is known to exhibit instability in the presence of highly correlated features, often leading to an arbitrary selection of predictors. This issue manifests itself in two primary error types: the erroneous omission of features that lack a true substitutable relationship (falsely redundant features) and the inclusion of features with a true substitutable relationship (truly redundant features). Although most existing methods address only one of these challenges, we introduce the Accept-Reject Lasso (ARL), a novel approach that resolves this dilemma. ARL operationalizes an Accept-Reject framework through a fine-grained analysis of feature selection across data subsets. This framework is designed to partition the output of an ensemble method into beneficial and detrimental components through fine-grained analysis. The fundamental challenge for Lasso is that inter-variable correlation obscures the true sources of information. ARL tackles this by first using clustering to identify distinct subset structures within the data. It then analyzes Lasso's behavior across these subsets to differentiate between true and spurious correlations. For truly correlated features, which induce multicollinearity, ARL tends to select a single representative feature and reject the rest to ensure model stability. Conversely, for features linked by spurious correlations, which may vanish in certain subsets, ARL accepts those that Lasso might have incorrectly omitted. The distinct patterns arising from true versus spurious correlations create a divisible separation. By setting an appropriate threshold, our framework can effectively distinguish between these two phenomena, thereby maximizing the inclusion of informative variables while minimizing the introduction of detrimental ones. We illustrate the efficacy of the proposed method through extensive simulation and real-data experiments.

Updated: 2025-08-06 17:13:27

标题: 接受-拒绝Lasso

摘要: Lasso方法在存在高度相关特征时被认为表现不稳定，通常导致预测变量的任意选择。这个问题体现在两个主要的错误类型上：排除缺乏真正可替代关系的特征（虚假冗余特征）和包含具有真正可替代关系的特征（真正冗余特征）。尽管大多数现有方法只解决这些挑战中的一个，我们引入了接受-拒绝Lasso（ARL），这是一种解决这一困境的新方法。ARL通过对数据子集之间的特征选择进行细粒度分析来实现一个接受-拒绝框架。这个框架旨在通过细粒度分析将集成方法的输出分为有益和有害组件。Lasso面临的根本挑战是变量间的相关性模糊了信息的真实来源。ARL首先使用聚类来识别数据中的不同子集结构。然后分析Lasso在这些子集中的行为，以区分真实和虚假相关性。对于真正相关的特征，引起多重共线性，ARL倾向于选择一个代表性特征并拒绝其余特征以确保模型稳定性。相反，对于由虚假相关性连接的特征，在某些子集中可能消失，ARL接受那些Lasso可能错误排除的特征。真正相关与虚假相关引起的不同模式产生了可分割的分离。通过设置适当的阈值，我们的框架可以有效区分这两种现象，从而最大化包含信息变量，同时最小化引入有害变量。我们通过广泛的仿真和真实数据实验说明了所提出方法的有效性。

更新时间: 2025-08-06 17:13:27

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2508.04646v1

Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections

Inspired by the Kolmogorov-Arnold superposition theorem, Kolmogorov-Arnold Networks (KANs) have recently emerged as an improved backbone for most deep learning frameworks, promising more adaptivity than their multilayer perceptron (MLP) predecessor by allowing for trainable spline-based activation functions. In this paper, we probe the theoretical foundations of the KAN architecture by showing that it can optimally approximate any Besov function in $B^{s}_{p,q}(\mathcal{X})$ on a bounded open, or even fractal, domain $\mathcal{X}$ in $\mathbb{R}^d$ at the optimal approximation rate with respect to any weaker Besov norm $B^{\alpha}_{p,q}(\mathcal{X})$; where $\alpha < s$. We complement our approximation result with a statistical guarantee by bounding the pseudodimension of the relevant class of Res-KANs. As an application of the latter, we directly deduce a dimension-free estimate on the sample complexity of a residual KAN model when learning a function of Besov regularity from $N$ i.i.d. noiseless samples, showing that KANs can learn the smooth maps which they can approximate.

Updated: 2025-08-06 17:12:12

标题: Besov范数中的逼近速率和带残差连接的Kolmogorov-Arnold网络的样本复杂度

摘要: 受科尔莫戈洛夫-阿诺德叠加定理启发，科尔莫戈洛夫-阿诺德网络（KANs）最近已成为大多数深度学习框架的改进骨干，通过允许可训练的基于样条的激活函数，承诺比其多层感知器（MLP）前身更具适应性。本文通过展示KAN架构的理论基础，探究了其能够在有界开放或甚至分形域$\mathcal{X}$中以$\mathbb{R}^d$上的最优逼近速率最优逼近任何Besov函数$B^{s}_{p,q}(\mathcal{X})$的能力，相对于任何更弱的Besov范数$B^{\alpha}_{p,q}(\mathcal{X})$；其中$\alpha < s$。我们通过限制相关类别的Res-KAN的伪维数来补充逼近结果的统计保证。作为后者的一个应用，我们直接推导出当从$N$个i.i.d.无噪声样本中学习具有Besov正则性的函数时，残差KAN模型的样本复杂性的无维度估计，表明KANs可以学习它们能够逼近的平滑映射。

更新时间: 2025-08-06 17:12:12

领域: cs.LG,cs.NA,cs.NE,math.FA,math.NA,stat.ML

下载: http://arxiv.org/abs/2504.15110v3

A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation

Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead.

Updated: 2025-08-06 17:10:31

标题: 一个可扩展的用于链接预测的预训练框架，具有高效的适应性

摘要: 链接预测（LP）是图机器学习中的关键任务。虽然图神经网络（GNNs）最近显著提高了LP性能，但现有方法面临关键挑战，包括来自稀疏连接的有限监督、对初始化敏感以及在分布转移下泛化能力差。我们探索预训练作为解决这些挑战的方法。与节点分类不同，LP本质上是一个成对任务，需要整合节点级和边级信息。在这项工作中，我们首次系统研究了这些不同模块的可转移性，并提出了一种后期融合策略，有效地结合它们的输出以提高性能。为了处理预训练数据的多样性并避免负迁移，我们引入了一个专家混合（MoE）框架，捕捉各个专家中的不同模式，促进预训练模型在多样的下游数据集上无缝应用。为了快速适应，我们开发了一种参数高效的调整策略，使预训练模型能够适应未见数据集，而计算开销最小。对两个领域的16个数据集进行的实验表明，我们的方法的有效性，实现了在资源匮乏的链接预测上的最新性能，同时与端到端训练方法相比取得了竞争性结果，计算开销降低了超过10,000倍。

更新时间: 2025-08-06 17:10:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04645v1

Millions of inequivalent quadratic APN functions in eight variables

The only known example of an almost perfect nonlinear (APN) permutation in even dimension was obtained by applying CCZ-equivalence to a specific quadratic APN function. Motivated by this result, there have been numerous recent attempts to construct new quadratic APN functions. Currently, 32,892 quadratic APN functions in dimension 8 are known and two recent conjectures address their possible total number. The first, proposed by Y. Yu and L. Perrin (Cryptogr. Commun. 14(6): 1359-1369, 2022), suggests that there are more than 50,000 such functions. The second, by A. Polujan and A. Pott (Proc. 7th Int. Workshop on Boolean Functions and Their Applications, 2022), argues that their number exceeds that of inequivalent quadratic (8,4)-bent functions, which is 92,515. We computationally construct 3,775,599 inequivalent quadratic APN functions in dimension 8 and estimate the total number to be about 6 million.

Updated: 2025-08-06 17:08:13

标题: 八变量中数百万个非等价二次APN函数

摘要: 在偶数维度中已知的几乎完美非线性（APN）置换的唯一示例是通过将CCZ等价应用于特定的二次APN函数而获得的。受这一结果的启发，最近有许多尝试构建新的二次APN函数。目前，在8维中已知32,892个二次APN函数，并且两个最近的猜想涉及它们可能的总数。第一个猜想由Y. Yu和L. Perrin提出（Cryptogr. Commun. 14(6): 1359-1369, 2022），建议有超过50,000个这样的函数。第二个猜想由A. Polujan和A. Pott提出（Proc.第7届布尔函数及其应用国际研讨会，2022），认为它们的数量超过了不等价的二次（8,4）-弯曲函数的数量，后者为92,515个。我们通过计算构建了3,775,599个8维中的不等价二次APN函数，并估计总数约为600万个。

更新时间: 2025-08-06 17:08:13

领域: math.CO,cs.CR,cs.DM,cs.IT,math.IT

下载: http://arxiv.org/abs/2508.04644v1

4-Swap: Achieving Grief-Free and Bribery-Safe Atomic Swaps Using Four Transactions

Cross-chain asset exchange is crucial for blockchain interoperability. Existing solutions rely on trusted third parties and risk asset loss, or use decentralized alternatives like atomic swaps, which suffer from grief attacks. Griefing occurs when a party prematurely exits, locking the counterparty's assets until a timelock expires. Hedged Atomic Swaps mitigate griefing by introducing a penalty premium; however, they increase the number of transactions from four (as in Tier Nolan's swap) to six, which in turn introduces new griefing risks. Grief-Free (GF) Swap reduces this to five transactions by consolidating assets and premiums on a single chain. However, no existing protocol achieves grief-free asset exchange in just four transactions. This paper presents 4-Swap, the first cross-chain atomic swap protocol that is both grief-free and bribery-safe, while completing asset exchange in just four transactions. By combining the griefing premium and principal into a single transaction per chain, 4-Swap reduces on-chain transactions, leading to faster execution compared to previous grief-free solutions. It is fully compatible with Bitcoin and operates without the need for any new opcodes. A game-theoretic analysis shows that rational participants have no incentive to deviate from the protocol, ensuring robust compliance and security.

Updated: 2025-08-06 17:06:55

标题: 4-Swap：使用四个交易实现无悔和安全的交换

摘要: 跨链资产交换对于区块链互操作性至关重要。现有解决方案依赖于可信第三方，并存在资产丢失的风险，或者使用像原子交换这样的去中心化替代方案，但原子交换容易受到痛苦攻击。痛苦攻击发生在一方过早退出，将对手方的资产锁定，直到定时锁定到期。对冲原子交换通过引入罚金溢价来减轻痛苦，然而，它将交易数量从四笔（Tier Nolan的交易）增加到六笔，从而引入新的痛苦风险。无痛苦（GF）交换通过将资产和溢价合并到单一链上，将交易数量减少到五笔。然而，目前没有任何现有协议可以在仅四笔交易中实现无痛苦的资产交换。本文提出了4-Swap，这是第一个既无痛苦又防贿赂的跨链原子交换协议，同时仅需四笔交易便完成资产交换。通过将痛苦溢价和本金合并到每条链的单一交易中，4-Swap降低了链上交易数量，相较于先前的无痛苦解决方案，实现更快的执行。它与比特币完全兼容，并且无需任何新的操作码即可运行。博弈论分析表明，理性参与者没有动机偏离协议，确保了强大的合规性和安全性。

更新时间: 2025-08-06 17:06:55

领域: cs.CR,C.2.4

下载: http://arxiv.org/abs/2508.04641v1

Stochastic Encodings for Active Feature Acquisition

Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

Updated: 2025-08-06 17:06:54

标题: 主动特征获取的随机编码

摘要: 主动特征获取是一个基于实例的、顺序决策问题。其目的是根据当前观测动态选择要测量哪些特征，对每个测试实例独立进行选择。常见的方法要么使用强化学习，但会遇到训练困难，要么贪婪地最大化标签和未观测特征的条件互信息，导致短视的获取。为了解决这些缺点，我们引入了一个在监督方式下训练的潜变量模型。通过在随机潜在空间中考虑很多可能的未观测实现来进行获取。对大量合成和真实数据集的广泛评估表明，我们的方法可靠地优于各种基准方法。

更新时间: 2025-08-06 17:06:54

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.01957v3

CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

Updated: 2025-08-06 16:57:06

标题: CaPulse：通过调整时间序列的因果节奏来检测异常

摘要: 时间序列异常检测在各个领域引起了广泛关注。现有方法往往无法捕捉时间序列数据中异常生成背后的机制。此外，时间序列异常检测通常面临几个与数据相关的固有挑战，即标签稀缺、数据不平衡和复杂的多周期性。在本文中，我们利用因果工具引入了一个基于因果性的新框架CaPulse，该框架调整到时间序列数据的潜在因果脉冲，以有效检测异常。具体而言，我们首先建立一个结构因果模型来解析异常生成过程。为了解决数据带来的挑战，我们提出了周期性归一化流，其中包括一种新颖的掩码机制和精心设计的周期性学习器，创造了一种周期感知的基于密度的异常检测方法。对七个真实世界数据集的广泛实验表明，CaPulse始终优于现有方法，实现了3%至17%的AUROC改进，并提高了可解释性。

更新时间: 2025-08-06 16:57:06

领域: cs.LG

下载: http://arxiv.org/abs/2508.04630v1

P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.

Updated: 2025-08-06 16:51:38

标题: P-Aligner: 通过原则性指导综合实现语言模型的预对齐

摘要: 大型语言模型（LLMs）被期望在与人类用户互动期间产生安全、有益和诚实的内容，但当给出有缺陷的指令时，例如缺少上下文、模棱两可的指令或不当的语气时，它们经常无法与这些价值观保持一致，留下了改进的空间。一种经济高效且具有高影响力的方法是在模型开始解码之前预先对齐指令。现有方法要么依赖于高昂的测试时间搜索成本，要么依赖于端到端模型重写，后者由具有不明确目标的定制训练语料库提供支持。在这项工作中，我们展示了通过P-Aligner可以实现有效和高效的偏好对齐目标，P-Aligner是一个轻量级模块，生成保留原始意图的指令，同时以更符合人类偏好的形式表达。P-Aligner是在UltraPrompt上进行训练的，这是通过使用蒙特卡洛树搜索提出的原则引导流水线合成的新数据集，该流水线系统地探索与人类偏好紧密相关的候选指令空间。跨不同方法的实验证明P-Aligner通常在各种模型和基准测试中优于强基线，包括分别在GPT-4-turbo和Gemma-2-SimPO上的平均获胜率增益为28.35%和8.69%。进一步的分析通过多个角度验证了其有效性和效率，包括数据质量、搜索策略、迭代部署和时间开销。

更新时间: 2025-08-06 16:51:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04626v1

Temporal and Heterogeneous Graph Neural Network for Remaining Useful Life Prediction

Predicting Remaining Useful Life (RUL) plays a crucial role in the prognostics and health management of industrial systems that involve a variety of interrelated sensors. Given a constant stream of time series sensory data from such systems, deep learning models have risen to prominence at identifying complex, nonlinear temporal dependencies in these data. In addition to the temporal dependencies of individual sensors, spatial dependencies emerge as important correlations among these sensors, which can be naturally modelled by a temporal graph that describes time-varying spatial relationships. However, the majority of existing studies have relied on capturing discrete snapshots of this temporal graph, a coarse-grained approach that leads to loss of temporal information. Moreover, given the variety of heterogeneous sensors, it becomes vital that such inherent heterogeneity is leveraged for RUL prediction in temporal sensor graphs. To capture the nuances of the temporal and spatial relationships and heterogeneous characteristics in an interconnected graph of sensors, we introduce a novel model named Temporal and Heterogeneous Graph Neural Networks (THGNN). Specifically, THGNN aggregates historical data from neighboring nodes to accurately capture the temporal dynamics and spatial correlations within the stream of sensor data in a fine-grained manner. Moreover, the model leverages Feature-wise Linear Modulation (FiLM) to address the diversity of sensor types, significantly improving the model's capacity to learn the heterogeneity in the data sources. Finally, we have validated the effectiveness of our approach through comprehensive experiments. Our empirical findings demonstrate significant advancements on the N-CMAPSS dataset, achieving improvements of up to 19.2% and 31.6% in terms of two different evaluation metrics over state-of-the-art methods.

Updated: 2025-08-06 16:48:27

标题: 时间和异质图神经网络用于剩余寿命预测

摘要: 预测剩余寿命（RUL）在涉及各种相互关联传感器的工业系统的预测和健康管理中发挥着至关重要的作用。鉴于这些系统不断产生的时间序列传感器数据，深度学习模型在识别这些数据中复杂、非线性的时间依赖关系方面备受关注。除了单个传感器的时间依赖关系外，空间依赖关系也出现为这些传感器之间的重要相关性，这可以通过描述时变空间关系的时间图自然建模。然而，现有研究大多依赖于捕捉这个时间图的离散快照，这种粗粒度方法会导致时间信息的丢失。此外，鉴于各种异质传感器的存在，利用这种固有的异质性对RUL预测至关重要。为了捕捉传感器图中的时间和空间关系以及异质特征的微妙之处，我们引入了一种名为时间和异质图神经网络（THGNN）的新模型。具体而言，THGNN从相邻节点中聚合历史数据，以精确捕捉传感器数据流中的时间动态和空间相关性。此外，该模型利用特征线性调制（FiLM）来处理传感器类型的多样性，显著提高了模型学习数据源中异质性的能力。最后，我们通过全面实验验证了我们方法的有效性。我们的实证研究结果在N-CMAPSS数据集上取得了显著进展，根据两种不同的评估指标，改进幅度分别达到了19.2%和31.6%，超过了现有方法。

更新时间: 2025-08-06 16:48:27

领域: cs.AI

下载: http://arxiv.org/abs/2405.04336v3

HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs

Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions''), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE's superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2.

Updated: 2025-08-06 16:45:05

标题: HiD-VAE：通过分层和解耦的语义ID实现可解释的生成推荐

摘要: 推荐系统在帮助用户浏览现代在线平台庞大商品目录方面是不可或缺的。最近，生成式推荐已经成为一种有前途的范式，将传统的检索和排名流程统一为一个端到端模型，能够进行动态生成。然而，现有的生成方法受到其无监督标记化的基本限制，生成的语义ID存在两个关键缺陷：(1)它们在语义上是平坦且不可解释的，缺乏一致的层次结构；(2)它们容易出现表示纠缠（即“ID碰撞”），这会损害推荐的准确性和多样性。为了克服这些限制，我们提出了HiD-VAE，这是一个通过两个核心创新学习层次化解耦商品表示的新框架。首先，HiD-VAE开创了一个层次监督的量化过程，将离散代码与多级商品标签对齐，产生更加均匀和解耦的ID。关键是，训练得到的码本可以预测层次标签，为每个推荐提供一个可追踪和可解释的语义路径。其次，为了对抗表示纠缠，HiD-VAE引入了一个直接惩罚潜在空间重叠的新颖独特性损失。这种机制不仅解决了关键的ID碰撞问题，还通过确保更全面地利用商品表示空间来促进推荐的多样性。这些高质量、解耦的ID为下游生成模型提供了坚实的基础。对三个公共基准的大量实验验证了HiD-VAE相对于最先进方法的卓越性能。代码可在https://anonymous.4open.science/r/HiD-VAE-84B2找到。

更新时间: 2025-08-06 16:45:05

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.04618v1

Verifiable Exponential Mechanism for Median Estimation

Differential Privacy (DP) is a rigorous privacy standard widely adopted in data analysis and machine learning. However, its guarantees rely on correctly introducing randomized noise--an assumption that may not hold if the implementation is faulty or manipulated by an untrusted analyst. To address this concern, we propose the first verifiable implementation of the exponential mechanism using zk-SNARKs. As a concrete application, we present the first verifiable differentially private (DP) median estimation scheme, which leverages this construction to ensure both privacy and verifiability. Our method encodes the exponential mechanism and a utility function for the median into an arithmetic circuit, employing a scaled inverse CDF technique for sampling. This design enables cryptographic verification that the reported output adheres to the intended DP mechanism, ensuring both privacy and integrity without revealing sensitive data.

Updated: 2025-08-06 16:43:01

标题: 可验证的指数机制用于中位数估计

摘要: 差分隐私（DP）是一种严格的隐私标准，在数据分析和机器学习中被广泛采用。然而，其保证依赖于正确引入随机噪声——这一假设在实现过程中可能不成立，如果实现存在缺陷或被不受信任的分析师操纵。为了解决这一问题，我们提出了第一个使用zk-SNARKs的可验证指数机制的实现。作为一个具体应用，我们提出了第一个可验证的差分隐私（DP）中位数估计方案，利用这种构造来确保隐私和可验证性。我们的方法将指数机制和中位数的效用函数编码到一个算术电路中，使用一个缩放的逆CDF技术进行抽样。这种设计使得可以进行密码验证，确保报告的输出符合预期的DP机制，同时确保隐私和完整性，而不会泄露敏感数据。

更新时间: 2025-08-06 16:43:01

领域: cs.CR

下载: http://arxiv.org/abs/2505.16246v2

Self-Managing DRAM: A Low-Cost Framework for Enabling Autonomous and Efficient in-DRAM Operations

The memory controller is in charge of managing DRAM maintenance operations (e.g., refresh, RowHammer protection, memory scrubbing) to reliably operate modern DRAM chips. Implementing new maintenance operations often necessitates modifications in the DRAM interface, memory controller, and potentially other system components. Such modifications are only possible with a new DRAM standard, which takes a long time to develop, likely leading to slow progress in the adoption of new architectural techniques in DRAM chips. We propose a new low-cost DRAM architecture, Self-Managing DRAM (SMD), that enables autonomous in-DRAM maintenance operations by transferring the responsibility for controlling maintenance operations from the memory controller to the SMD chip. To enable autonomous maintenance operations, we make a single modification to the DRAM interface, such that an SMD chip rejects memory controller accesses to DRAM regions under maintenance, while allowing memory accesses to others. Thus, SMD enables 1) implementing new in-DRAM maintenance mechanisms (or modifying existing ones) with no further changes in the DRAM interface or other system components, and 2) overlapping the latency of a maintenance operation in one DRAM region with the latency of accessing data in another. We evaluate SMD and show that it 1) can be implemented without adding new pins to the DDRx interface with low latency and area overhead, 2) achieves 4.1% average speedup across 20 four-core memory-intensive workloads over a DDR4-based system/DRAM co-design technique that intelligently parallelizes maintenance operations with memory accesses, and 3) guarantees forward progress for rejected memory accesses. We believe and hope SMD can enable innovations in DRAM architecture to rapidly come to fruition. We open source all SMD source code and data at https://github.com/CMU-SAFARI/SelfManagingDRAM.

Updated: 2025-08-06 16:37:52

标题: 自管理DRAM：一个低成本框架，用于实现DRAM操作的自主和高效进行

摘要: 内存控制器负责管理DRAM维护操作（例如刷新、RowHammer保护、内存擦除），以可靠地操作现代DRAM芯片。实施新的维护操作通常需要对DRAM接口、内存控制器和潜在的其他系统组件进行修改。这种修改只有在采用新的DRAM标准时才可能实现，而新的DRAM标准需要很长时间开发，可能导致DRAM芯片中新的体系结构技术的采用进展缓慢。我们提出了一种新的低成本DRAM架构，Self-Managing DRAM（SMD），通过将控制维护操作的责任从内存控制器转移至SMD芯片，实现了DRAM自主维护操作。为了实现自主维护操作，我们对DRAM接口进行了一次修改，使得SMD芯片拒绝内存控制器访问正在维护中的DRAM区域，同时允许对其他区域进行内存访问。因此，SMD实现了以下功能：1）在不对DRAM接口或其他系统组件进行进一步更改的情况下，实施新的内部DRAM维护机制（或修改现有机制），2）将一个DRAM区域中的维护操作的延迟与另一个区域中的数据访问的延迟重叠。我们对SMD进行了评估，并表明它：1）可以在DDRx接口中实现，而无需增加新的引脚，延迟和面积开销较低，2）在20个四核内存密集型工作负载上，相对于基于DDR4的系统/DRAM协同设计技术，实现了4.1%的平均加速度，并智能并行化维护操作与内存访问，3）保证了被拒绝的内存访问的前进进度。我们相信并希望SMD能够促使DRAM架构的创新迅速实现。我们向所有人开放SMD源代码和数据，网址为https://github.com/CMU-SAFARI/SelfManagingDRAM。

更新时间: 2025-08-06 16:37:52

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2207.13358v9

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

Updated: 2025-08-06 16:36:42

标题: RL-PLUS：使用混合策略优化对强化学习中LLMs能力边界崩溃进行对抗

摘要: 带可验证奖励的强化学习（RLVR）显著提高了大型语言模型（LLMs）的复杂推理能力。然而，由于其本质上的在策略策略与LLM庞大的动作空间和稀疏奖励相结合，它很难突破基本LLM的能力边界。关键是，RLVR可能导致能力边界崩溃，缩小LLM的问题解决范围。为解决这一问题，我们提出了RL-PLUS，一种新颖的混合策略优化方法，用于LLMs，它将内部利用与外部数据相结合，实现更强的推理能力并超越基本模型的边界。RL-PLUS集成了两个核心组件，即多重重要性采样以解决来自外部数据的分布不匹配问题，以及基于探索的优势函数来引导模型走向高价值的、未探索的推理路径。我们提供了理论分析和广泛的实验证实我们方法的优越性和通用性。与现有的RLVR方法相比，RL-PLUS在六个数学推理基准上实现了最先进的性能；在六个超出分布范围的推理任务上表现出优越的性能；在不同模型系列中持续和显著地获得平均相对改进高达69.2%。此外，Pass@k曲线的分析表明，RL-PLUS有效地解决了能力边界崩溃的问题。

更新时间: 2025-08-06 16:36:42

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.00222v3

A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature

The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.

Updated: 2025-08-06 16:33:20

标题: 一个可复现、可扩展的用于合成自回归模型文献的流程管道

摘要: 自回归生成模型研究的加速步伐已经产生了成千上万篇论文，使得手动文献调查和再现研究变得越来越不切实际。我们提出了一个完全开源、可重现的流程，可以自动从公共存储库中检索候选文档，筛选相关性，提取元数据、超参数和报告结果，对主题进行聚类，生成检索增强摘要，并为重新运行选定实验生成容器化脚本。对50篇手动注释的论文进行定量评估显示，相关性分类、超参数提取和引文识别的F1分数均超过0.85。对多达1000篇论文的语料库进行的实验表明，使用八个CPU工作线程的近线性可扩展性。三个案例研究——AWD-LSTM在WikiText-2上的应用，Transformer-XL在WikiText-103上的应用以及在Lakh MIDI数据集上的自回归音乐模型——证实了提取的设置支持忠实再现，实现了测试困惑度在原始报告的1-3%之内。

更新时间: 2025-08-06 16:33:20

领域: cs.IR,cs.DL,cs.LG,68P20, 68T05, 68T50,H.3.3; H.3.7; I.2.6; I.2.7

下载: http://arxiv.org/abs/2508.04612v1

Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

Inspired by the brain's hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves $85.3$\% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.

Updated: 2025-08-06 16:29:59

标题: 具有半监督终身学习的神经形态网络安全

摘要: 受大脑的分层处理和能效启发，本文提出了一种适用于终身网络入侵检测系统（NIDS）的脉冲神经网络（SNN）架构。所提出的系统首先采用高效的静态SNN来识别潜在的入侵，然后激活一个负责分类特定攻击类型的自适应动态SNN。模仿生物适应能力，动态分类器利用了启发式结构可塑性Grow When Required（GWR）和一种新颖的自适应脉冲时序依赖可塑性（Ad-STDP）学习规则。这些生物合理的机制使网络能够逐步学习新的威胁，同时保留现有知识。在持续学习环境中使用UNSW-NB15基准测试，该架构表现出强大的适应性，减少了灾难性遗忘，并实现了85.3%的总体准确率。此外，使用Intel Lava框架进行的模拟确认了高操作稀疏性，突显了在神经形态硬件上低功耗部署的潜力。

更新时间: 2025-08-06 16:29:59

领域: cs.LG,cs.AI,cs.ET,cs.NE

下载: http://arxiv.org/abs/2508.04610v1

EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78\% and unweighted accuracy of 92.52\% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75\% and unweighted accuracy of 91.28\%. On the RAVDESS dataset, we get a weighted accuracy of 94.53\% and 94.98\% unweighted accuracy for ReLU activation and 93.72\% weighted accuracy and 94.64\% unweighted accuracy for ELU activation. These results highlight EmoAugNet's effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.

Updated: 2025-08-06 16:28:27

标题: EmoAugNet：一种用于语音情感识别的信号增强混合CNN-LSTM框架

摘要: 在语音中识别情感信号对于增强人机交互（HCI）的效果具有重要影响。本研究介绍了EmoAugNet，这是一个混合深度学习框架，将长短期记忆（LSTM）层与一维卷积神经网络（1D-CNN）结合，以实现可靠的语音情感识别（SER）。从语音信号中提取的特征的质量和多样性对SER系统性能的影响显著。采用了全面的语音数据增强策略，将传统方法（如加噪声、变调和时间拉伸）与一种新颖的组合增强管道结合起来，以增强泛化能力并减少过拟合。每个音频样本都被转换为一个高维特征向量，使用了均方根能量（RMSE）、梅尔频率倒谱系数（MFCC）和过零率（ZCR）。我们的模型使用ReLU激活函数，在IEMOCAP数据集上具有加权准确率95.78\%和非加权准确率92.52%，使用ELU激活函数具有加权准确率96.75\%和非加权准确率91.28%。在RAVDESS数据集上，使用ReLU激活函数获得94.53%的加权准确率和94.98%的非加权准确率，使用ELU激活函数获得93.72%的加权准确率和94.64%的非加权准确率。这些结果突显了EmoAugNet通过整合数据增强和混合建模来提高SER系统的稳健性和性能。

更新时间: 2025-08-06 16:28:27

领域: cs.SD,cs.HC,cs.LG

下载: http://arxiv.org/abs/2508.06321v1

Multitask Learning with Stochastic Interpolants

We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

Updated: 2025-08-06 16:25:19

标题: 使用随机插值的多任务学习

摘要: 我们提出了一个学习概率分布之间映射的框架，它广泛地概括了流动和扩散模型的时间动态。为了实现这一点，我们通过用向量、矩阵或线性算子替换标量时间变量来推广随机插值器，从而能够跨越多维空间连接概率分布。这种方法使得构建多功能生成模型成为可能，这些模型能够在无需特定任务训练的情况下完成多个任务。我们基于算子的插值器不仅为现有生成模型提供了统一的理论视角，还扩展了它们的能力。通过数值实验，我们展示了我们的方法在条件生成和修复、微调和后验采样以及多尺度建模方面的零样本效力，表明它作为专门模型的通用任务无关替代品的潜力。

更新时间: 2025-08-06 16:25:19

领域: cs.LG,math.DS

下载: http://arxiv.org/abs/2508.04605v1

TURA: Tool-Augmented Unified Retrieval Agent for AI Search

The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

Updated: 2025-08-06 16:24:17

标题: TURA: 用于人工智能搜索的工具增强统一检索代理

摘要: 大型语言模型（LLMs）的出现正在将搜索引擎转变为对话式人工智能搜索产品，主要使用检索增强生成（RAG）技术处理网络语料库。然而，这一范式存在显著的工业限制。传统的RAG方法在处理实时需求和需要访问动态生成内容（如票务可用性或库存）的结构化查询时存在困难。由于仅限于索引静态页面，搜索引擎无法执行所需的交互式查询，以获取此类对时间敏感的数据。学术研究一直专注于优化用于静态内容的RAG，忽视了复杂意图和对数据库和实时API等动态来源的需求。为了弥合这一差距，我们引入了TURA（用于AI搜索的工具增强统一检索代理），这是一个创新的三阶段框架，将RAG与代理工具使用相结合，以访问静态内容和动态实时信息。TURA具有三个关键组件：一个意图感知检索模块，用于分解查询和检索封装为模型上下文协议（MCP）服务器的信息源，一个基于DAG的任务计划器，将任务依赖关系建模为有向无环图（DAG），以实现最佳并行执行，以及一个轻量级精简代理执行器，用于高效调用工具。TURA是第一个系统地弥合静态RAG和动态信息源之间差距的架构，用于构建世界一流的AI搜索产品。为数千万用户提供服务，利用代理框架提供强大的实时答案，同时满足大规模工业系统的低延迟需求。

更新时间: 2025-08-06 16:24:17

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2508.04604v1

CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

Updated: 2025-08-06 16:20:56

标题: CostFilter-AD：通过匹配成本过滤增强异常检测

摘要: 无监督异常检测（UAD）旨在将输入图像的异常掩模定位于正常样本中。现有方法基本上依赖于图像级或特征级匹配来推导异常分数，无论是通过重建正常对应物（基于重建）还是通过学习图像特征嵌入空间（基于嵌入）。通常，这种匹配过程是不准确的，但往往被忽视，导致检测性能不佳。为了解决这个问题，我们将经典匹配任务（例如深度和流估计）中的成本滤波概念引入到UAD问题中。我们将这种方法称为CostFilter-AD。具体而言，我们首先构建输入和正常样本之间的匹配成本体积，包括两个空间维度和一个匹配维度，编码潜在的匹配。为了进一步完善这一点，我们提出了一个成本体积过滤网络，由输入观察作为跨多个特征层的注意查询引导，有效地抑制匹配噪声，同时保留边缘结构并捕获微妙的异常。作为一个通用的后处理插件，CostFilter-AD可以与基于重建或基于嵌入的方法集成。对MVTec-AD和VisA基准进行的大量实验验证了CostFilter-AD对单类和多类UAD任务的通用益处。代码和模型将在https://github.com/ZHE-SAPI/CostFilter-AD 上发布。

更新时间: 2025-08-06 16:20:56

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.01476v3

Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

Updated: 2025-08-06 16:19:30

标题: 基础选择：针对目标应用对预训练大型语言模型进行低秩分解

摘要: 大型语言模型（LLM）显著提升了各种应用程序的性能，但它们在计算上消耗大量资源且能源需求高。这使得在资源有限的设备上部署它们变得具有挑战性，比如个人计算机和移动/可穿戴设备，同时也导致在资源丰富的环境中，如云服务器中存在大量推理成本。为了扩展LLM的使用，我们引入了一种低秩分解方法，有效地压缩这些模型，以满足特定应用程序的要求。我们观察到，在通用数据集上预训练的LLM包含许多对特定应用程序不需要的冗余组件。我们的方法侧重于识别和移除这些冗余部分，仅保留目标应用程序所需的必要元素。具体而言，我们将LLM的权重矩阵表示为基本组件的线性组合。然后我们修剪不相关的基本组件，并增强模型，以利于特定应用程序的新基本组件。在目标应用程序上进行的Llama 2-7b和-13B模型的深度压缩结果，包括数学推理和代码生成，显示我们的方法显著减小了模型大小，同时保持与最先进的低秩压缩技术相当的准确性。

更新时间: 2025-08-06 16:19:30

领域: cs.LG,cs.AR,cs.CL

下载: http://arxiv.org/abs/2405.15877v2

Personalized One-shot Federated Graph Learning for Heterogeneous Clients

Federated Graph Learning (FGL) has emerged as a promising paradigm for breaking data silos among distributed private graphs. In practical scenarios involving heterogeneous distributed graph data, personalized Federated Graph Learning (pFGL) aims to enhance model utility by training personalized models tailored to client needs. However, existing pFGL methods often require numerous communication rounds under heterogeneous graphs, leading to significant communication overhead and security concerns. While One-shot Federated Learning (OFL) enables collaboration in a single round, existing OFL methods are designed for image-centric tasks and are ineffective for graph data, leaving a critical gap in the field. Additionally, personalized models derived from existing methods suffer from bias, failing to effectively generalize to the minority. To address these challenges, we propose the first \textbf{O}ne-shot \textbf{p}ersonalized \textbf{F}ederated \textbf{G}raph \textbf{L}earning method (\textbf{O-pFGL}) for node classification, compatible with Secure Aggregation protocols for privacy preservation. Specifically, for effective graph learning in one communication round, our method estimates and aggregates class-wise feature distribution statistics to construct a global surrogate graph on the server, facilitating the training of a global graph model. To mitigate bias, we introduce a two-stage personalized training approach that adaptively balances local personal information and global insights from the surrogate graph, improving both personalization and generalization. Extensive experiments on 14 diverse real-world graph datasets demonstrate that our method significantly outperforms state-of-the-art baselines across various settings.

Updated: 2025-08-06 16:14:04

标题: 个性化一次性联邦图学习用于异构客户

摘要: 联邦图学习（FGL）已经成为打破分布式私有图之间数据孤岛的一种有前途的范式。在涉及异构分布式图数据的实际场景中，个性化联邦图学习（pFGL）旨在通过训练根据客户需求定制的个性化模型来增强模型效用。然而，现有的pFGL方法通常需要在异构图之间进行大量通信轮次，导致显著的通信开销和安全问题。虽然一次性联邦学习（OFL）允许在单一轮次内进行协作，但现有的OFL方法主要针对图像中心任务设计，对于图数据效果不佳，这在该领域存在重要差距。此外，现有方法推导出的个性化模型存在偏差，无法有效泛化到少数群体。为了解决这些挑战，我们提出了第一个用于节点分类的\textbf{一次性个性化联邦图学习}方法（\textbf{O-pFGL}），与用于隐私保护的安全聚合协议兼容。具体而言，为了在一次通信轮次内进行有效的图学习，我们的方法估计并聚合按类别的特征分布统计信息，构建一个全局替代图在服务器上，促进全局图模型的训练。为了减轻偏差，我们引入了一种两阶段个性化训练方法，自适应地平衡本地个人信息和来自替代图的全局见解，提高个性化和泛化性能。对14个不同真实世界图数据集的广泛实验表明，我们的方法在各种设置下显著优于现有技术基线。

更新时间: 2025-08-06 16:14:04

领域: cs.LG

下载: http://arxiv.org/abs/2411.11304v7

Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding

Resistance spot welding is the dominant joining process for the body-in-white in the automotive industry, where the weld nugget diameter is the key quality metric. Its measurement requires destructive testing, limiting the potential for efficient quality control. Physics-informed neural networks were investigated as a promising tool to reconstruct internal process states from experimental data, enabling model-based and non-invasive quality assessment in aluminum spot welding. A major challenge is the integration of real-world data into the network due to competing optimization objectives. To address this, we introduce two novel training strategies. First, experimental losses for dynamic displacement and nugget diameter are progressively included using a fading-in function to prevent excessive optimization conflicts. We also implement a custom learning rate scheduler and early stopping based on a rolling window to counteract premature reduction due to increased loss magnitudes. Second, we introduce a conditional update of temperature-dependent material parameters via a look-up table, activated only after a loss threshold is reached to ensure physically meaningful temperatures. An axially symmetric two-dimensional model was selected to represent the welding process accurately while maintaining computational efficiency. To reduce computational burden, the training strategies and model components were first systematically evaluated in one dimension, enabling controlled analysis of loss design and contact models. The two-dimensional network predicts dynamic displacement and nugget growth within the experimental confidence interval, supports transferring welding stages from steel to aluminum, and demonstrates strong potential for fast, model-based quality control in industrial applications.

Updated: 2025-08-06 16:14:00

标题: 使用铝点焊实验数据改进物理信息神经网络的训练策略

摘要: 电阻点焊是汽车工业中车身白车身的主要连接工艺，焊接金焊点直径是关键的质量指标。其测量需要破坏性测试，限制了高效质量控制的潜力。基于物理的神经网络被研究作为一种有前途的工具，用于从实验数据中重建内部过程状态，实现铝点焊的基于模型和非侵入式质量评估。一个主要挑战是将真实世界数据整合到网络中，因为存在竞争的优化目标。为了解决这个问题，我们引入了两种新颖的训练策略。首先，采用渐进式淡入函数逐渐包含动态位移和焊点直径的实验损失，以防止过度的优化冲突。我们还实现了一个自定义学习率调度程序和基于滚动窗口的早停止，以抵消由于损失增加而导致的过早降低。其次，我们引入了一个条件更新温度相关材料参数的查找表，仅在达到损失阈值后激活，以确保物理意义上的温度。选择了一个轴对称的二维模型来准确表示焊接过程，同时保持计算效率。为了减轻计算负担，首先在一维中系统评估了训练策略和模型组件，实现了对损失设计和接触模型的控制分析。二维网络在实验置信区间内预测动态位移和焊点增长，支持将焊接阶段从钢转移到铝，并在工业应用中具有快速、基于模型的质量控制的强大潜力。

更新时间: 2025-08-06 16:14:00

领域: cs.LG

下载: http://arxiv.org/abs/2508.04595v1

GraphProp: Training the Graph Foundation Models using Graph Properties

This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.

Updated: 2025-08-06 16:12:42

标题: GraphProp: 使用图属性训练图基础模型

摘要: 这项工作着重于训练具有在图级任务中具有强大泛化能力的图基础模型（GFMs），例如图分类。有效的GFM训练需要捕捉跨不同领域一致的信息。我们发现，与节点特征和图标签相比，图结构提供了更一致的跨领域信息。然而，传统的GFMs主要专注于将各个领域的节点特征传输到统一的表示空间中，但往往缺乏结构性跨领域泛化能力。为了解决这个问题，我们引入了GraphProp，它强调结构性泛化。GraphProp的训练过程包括两个主要阶段。首先，我们通过预测图不变量来训练一个结构性GFM。由于图不变量是仅依赖于抽象结构而不依赖于特定标记或图的绘制的图属性，因此这个结构性GFM具有强大的能力捕捉抽象结构信息，并提供可比较的跨不同领域的区分性图表示。在第二阶段，我们使用结构性GFM提供的表示作为位置编码来训练一个全面的GFM。这一阶段利用特定领域的节点属性和图标签进一步提高跨领域节点特征的泛化能力。我们的实验证明，GraphProp在监督学习和少样本学习中显著优于竞争对手，特别是在处理没有节点属性的图形时。

更新时间: 2025-08-06 16:12:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04594v1

Algebraically Observable Physics-Informed Neural Network and its Application to Epidemiological Modelling

Physics-Informed Neural Network (PINN) is a deep learning framework that integrates the governing equations underlying data into a loss function. In this study, we consider the problem of estimating state variables and parameters in epidemiological models governed by ordinary differential equations using PINNs. In practice, not all trajectory data corresponding to the population described by models can be measured. Learning PINNs to estimate the unmeasured state variables and epidemiological parameters using partial measurements is challenging. Accordingly, we introduce the concept of algebraic observability of the state variables. Specifically, we propose augmenting the unmeasured data based on algebraic observability analysis. The validity of the proposed method is demonstrated through numerical experiments under three scenarios in the context of epidemiological modelling. Specifically, given noisy and partial measurements, the accuracy of unmeasured states and parameter estimation of the proposed method is shown to be higher than that of the conventional methods. The proposed method is also shown to be effective in practical scenarios, such as when the data corresponding to certain variables cannot be reconstructed from the measurements.

Updated: 2025-08-06 16:09:11

标题: 代数可观测的物理信息神经网络及其在流行病学建模中的应用

摘要: 物理信息神经网络（PINN）是一个深度学习框架，将数据潜在的控制方程集成到损失函数中。在这项研究中，我们考虑使用PINN来估计受常微分方程控制的流行病学模型中的状态变量和参数的问题。在实践中，无法测量与模型描述的人口相对应的所有轨迹数据。学习使用部分测量来估计未测量的状态变量和流行病学参数的PINNs是具有挑战性的。因此，我们引入了状态变量的代数可观测性概念。具体地，我们提出基于代数可观测性分析来增强未测量数据。通过流行病学建模背景下的三种情景下的数值实验，证明了所提出方法的有效性。具体来说，考虑到嘈杂和部分测量，所提出方法对未测量状态和参数估计的准确性显示出比传统方法更高。所提出的方法也在实际情景中显示出有效性，例如无法从测量中重建与某些变量对应的数据时。

更新时间: 2025-08-06 16:09:11

领域: cs.SC,cs.LG,math.DS,q-bio.QM

下载: http://arxiv.org/abs/2508.04590v1

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and two in vivo datasets. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the D and f parameters, although slight overconfidence was observed in D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.

Updated: 2025-08-06 16:08:55

标题: IVIM MRI 中基于体素的监督模型的不确定性量化的全面框架

摘要: 从扩散加权MRI准确估计像素内不协调运动（IVIM）参数仍然具有挑战性，这是由于反问题的不适定性和对噪声的高敏感性，特别是在灌注区域。在这项工作中，我们提出了一种基于深度集成（DE）混合密度网络（MDNs）的概率深度学习框架，可以估计总体预测不确定性并将其分解为概率（AU）和认知（EU）组成部分。该方法与非概率神经网络、贝叶斯拟合方法和单高斯参数化的概率网络进行了基准测试。在合成数据上进行了监督训练，并在模拟和两个体内数据集上进行了评估。使用校准曲线、输出分布锐度和连续排名概率分数（CRPS）评估了所量化的不确定性的可靠性。MDNs为D和f参数产生了更加校准和锐利的预测分布，尽管在D*中观察到轻微的过度自信。鲁棒变异系数（RCV）显示，与高斯模型相比，MDNs对D*的体内估算更加平滑。尽管训练数据涵盖了预期的生理范围，但体内的提升的EU表明与真实的采集条件存在不匹配，突出了包含EU的重要性，DE允许了这一点。总的来说，我们提出了一个全面的IVIM拟合框架，带有不确定性量化，可以识别和解释不可靠的估计。所提出的方法还可以通过适当的架构和模拟调整用于拟合其他物理模型。

更新时间: 2025-08-06 16:08:55

领域: eess.IV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04588v1

Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference

Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research.

Updated: 2025-08-06 16:08:27

标题: 位置：当前人工智能会议模式不可持续！诊断集中式人工智能会议危机

摘要: 人工智能（AI）会议对推动研究、分享知识和促进学术社区发展至关重要。然而，它们的迅速扩张使得集中式会议模式日益不可持续。本文提供了一个基于数据的结构性危机诊断，威胁到科学传播、公平性和社区福祉的基本目标。我们确定了四个关键压力领域：（1）在科学方面，每位作者的出版率在过去十年中翻了一番，每年超过4.5篇论文；（2）在环境方面，单个会议的碳足迹超过了主办城市的日排放量；（3）在心理方面，71%的在线社区讨论反映了负面情绪，35%提及了心理健康问题；和（4）在物流方面，如NeurIPS 2024等顶级会议的参与人数开始超过场地容量。这些压力指向一个与其核心使命不符的系统。作为回应，我们提出了社区联合会议（CFC）模型，将同行评审、报告和网络交流分为全球协调但在本地组织的组成部分，为人工智能研究提供了一条更可持续、包容和有弹性的前进道路。

更新时间: 2025-08-06 16:08:27

领域: cs.CY,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.04586v1

Measuring the Carbon Footprint of Cryptographic Privacy-Enhancing Technologies

Privacy-enhancing technologies (PETs) have attracted significant attention in response to privacy regulations, driving the development of applications that prioritize user data protection. At the same time, the information and communication technology (ICT) sector faces growing pressure to reduce its environmental footprint, particularly its carbon emissions. While numerous studies have assessed the energy footprint of various ICT applications, the environmental footprint of cryptographic PETs remains largely unexplored. Our work addresses this gap by proposing a standardized methodology for evaluating the carbon footprint of PETs. To demonstrate this methodology, we focus on PETs supporting client-server applications as they are the simplest to deploy. In particular, we measure the energy consumption and carbon footprint increase induced by five cryptographic PETs (compared to their non-private equivalent): HTTPS web browsing, encrypted machine learning (ML) inference, encrypted ML training, encrypted databases, and encrypted emails. Our findings reveal significant variability in carbon footprint increases, ranging from a twofold increase in HTTPS web browsing to a 100,000-fold increase in encrypted ML. Our study provides essential data to help decision-makers assess privacy-carbon trade-offs in such applications. Finally, we outline key research directions for developing PETs that balance strong privacy protection with environmental sustainability.

Updated: 2025-08-06 16:07:29

标题: 测量密码隐私增强技术的碳足迹

摘要: 随着隐私法规的出台，隐私增强技术（PETs）引起了广泛关注，推动了优先保护用户数据的应用程序的开发。与此同时，信息和通信技术（ICT）行业面临着不断增长的压力，需要减少其环境足迹，特别是碳排放。尽管许多研究评估了各种ICT应用的能源足迹，但加密PETs的环境足迹仍然几乎未被探索。我们的研究填补了这一空白，提出了一种评估PETs碳足迹的标准方法。为了展示这一方法，我们重点关注支持客户端-服务器应用程序的PETs，因为它们最容易部署。特别是，我们测量了五种加密PETs（相对于其非私密等价物）引起的能源消耗和碳足迹增加：HTTPS网页浏览、加密机器学习（ML）推理、加密ML训练、加密数据库和加密电子邮件。我们的研究结果显示，碳足迹增加存在显著的差异，从HTTPS网页浏览的两倍增加到加密ML的十万倍增加不等。我们的研究为决策者提供了重要数据，帮助他们评估这类应用中的隐私与碳排放之间的权衡。最后，我们概述了发展既能保护隐私又能实现环境可持续性的PETs的关键研究方向。

更新时间: 2025-08-06 16:07:29

领域: cs.CR

下载: http://arxiv.org/abs/2508.04583v1

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

Memorization in generative models extends far beyond verbatim text reproduction--it manifests through non-literal patterns, semantic associations, and surprisingly, across modalities in transcript-conditioned generation tasks such as Lyrics-to-Song (L2S) and Text-to-Video (T2V) models. We reveal a new class of cross-modality memorization where models trained on these tasks leak copyrighted content through indirect, phonetic pathways invisible to traditional text-based analysis. In this work, we introduce Adversarial PhoneTic Prompting (APT), an attack that replaces iconic phrases with homophonic alternatives--e.g., "mom's spaghetti" becomes "Bob's confetti"--preserving the acoustic form while largely changing semantic content. We demonstrate that models can be prompted to regurgitate memorized songs using phonetically similar but semantically unrelated lyrics. Despite the semantic drift, black-box models like SUNO and open-source models like YuE generate outputs that are strikingly similar to the original songs--melodically, rhythmically, and vocally--achieving high scores on AudioJudge, CLAP, and CoverID. These effects persist across genres and languages. More surprisingly, we find that phonetic prompts alone can trigger visual memorization in text-to-video models: when given altered lyrics from Lose Yourself, Veo 3 generates scenes that mirror the original music video--complete with a hooded rapper and dim urban settings--despite no explicit visual cues in the prompt. This cross-modality leakage represents an unprecedented threat: models memorize deep, structural patterns that transcend their training modality, making traditional safety measures like copyright filters ineffective. Our findings reveal a fundamental vulnerability in transcript-conditioned generative models and raise urgent concerns around copyright, provenance, and secure deployment of multimodal generation systems.

Updated: 2025-08-06 16:06:47

标题: 鲍勃的五彩纸屑：音韵记忆攻击在音乐和视频生成中的应用

摘要: 生成模型中的记忆远不止于逐字文本复制--它通过非文字模式、语义关联以及令人惊讶地在基于文本的生成任务中跨模态表现，如歌词到歌曲（L2S）和文本到视频（T2V）模型。我们揭示了一类新的跨模态记忆，这些模型在这些任务上训练后通过对传统基于文本分析不可见的间接语音途径泄露了受版权保护的内容。在这项工作中，我们引入了Adversarial PhoneTic Prompting（APT），一种攻击方法，用同音词替代标志性短语--例如，“妈妈的意大利面条”变成了“鲍勃的纸屑”--保留声音形式，同时在很大程度上改变语义内容。我们证明，模型可以通过语音上相似但语义无关的歌词提示来促使其复述记忆的歌曲。尽管语义漂移，像SUNO这样的黑盒模型和像YuE这样的开源模型生成的输出与原始歌曲在旋律、节奏和声音上非常相似，在AudioJudge、CLAP和CoverID上取得了高分。这些效果跨越了流派和语言。更令人惊讶的是，我们发现仅仅使用语音提示就可以触发文本到视频模型中的视觉记忆：给定修改后的《Lose Yourself》歌词，Veo 3生成了与原始音乐视频镜像的场景--包括一个戴兜帽的说唱歌手和昏暗的城市背景--尽管提示中没有明确的视觉线索。这种跨模态泄漏代表了一种前所未有的威胁：模型记忆了超越其训练模态的深层结构模式，使得传统的版权过滤器等安全措施失效。我们的发现揭示了在基于文本的生成模型中的一种基本漏洞，并引发了围绕版权、来源和多模式生成系统的安全部署的紧急关注。

更新时间: 2025-08-06 16:06:47

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2507.17937v2

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

Updated: 2025-08-06 16:06:43

标题: 分享您的注意力：通过基于矩阵的字典学习实现Transformer权重共享

摘要: 大型语言模型（LLM）已经彻底改变了人工智能应用，然而它们高度的计算和内存需求阻碍了它们的广泛部署。现有的压缩技术侧重于内部块的优化（例如低秩逼近、注意力头修剪），而transformer的重复层级结构意味着存在显著的跨块冗余-这在键值（KV）缓存之外很大程度上未被探索。受CNN中的字典学习启发，我们提出了一个跨transformer层的结构化权重共享框架。我们的方法将注意力投影矩阵分解为共享的字典原子，将注意力模块的参数减少了66.7%，同时实现了与基准性能相当的表现。与需要蒸馏或架构更改的复杂方法不同，MASA（Attention中的矩阵原子共享）作为一个直接替代操作-使用标准优化器进行训练-并将每一层的权重表示为共享矩阵原子的线性组合。在不同规模（100M-700M参数）的实验中，MASA比分组查询注意力（GQA）、低秩基线和最近提出的重复所有/顺序共享在相似参数预算下实现更好的基准准确性和困惑度。消融研究证实了对字典大小的稳健性以及共享表示在捕捉跨层统计规律方面的有效性。扩展到Vision Transformers（ViT），MASA在图像分类和检测任务上与66.7%更少的注意力参数匹配性能指标。通过将字典学习策略与transformer效率相结合，MASA为参数高效模型提供了可扩展的蓝图，而不牺牲性能。最后，我们调查了在预训练的LLMs上使用MASA以减少其参数数量而不会在性能上出现显著下降的可能性。

更新时间: 2025-08-06 16:06:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04581v1

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.

Updated: 2025-08-06 16:02:25

标题: SDBench: 一个全面的演讲者辨识基准套件

摘要: 即使是最先进的说话者辨识系统在不同数据集上的错误率存在较大差异，代表了许多用例和领域。此外，跨系统的比较需要谨慎应用最佳实践，如数据集拆分和度量定义，以便进行苹果对苹果的比较。我们提出了SDBench（说话者辨识基准），这是一个开源基准套件，集成了13个不同的数据集，具有内置工具，可对各种设备端和服务器端系统的说话者辨识性能进行一致和精细的分析。SDBench支持可重现的评估，随着时间的推移，还能方便地集成新系统。为了展示SDBench的有效性，我们构建了SpeakerKit，这是一个以推理效率为重点的系统，建立在Pyannote v3之上。通过SDBench，快速执行了消融研究，使SpeakerKit的速度比Pyannote v3提高了9.6倍，同时达到了可比较的错误率。我们对包括Deepgram、AWS Transcribe和Pyannote AI API在内的6个最先进系统进行了基准测试，揭示了准确性和速度之间的重要权衡。

更新时间: 2025-08-06 16:02:25

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.16136v2

ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges

Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs' confidence performance and offer competitive baselines to support future research.

Updated: 2025-08-06 16:00:19

标题: ConfProBench：基于MLLM的过程评判的置信度评估基准

摘要: 推理是解决复杂多模态任务的多模态大语言模型（MLLMs）的关键能力，判断推理步骤的正确性对于提高这种能力至关重要。最近，基于MLLM的过程评判者（MPJs）被广泛用于评估多模态任务中推理步骤的正确性。因此，评估MPJs对于识别它们的局限性并指导未来改进至关重要。然而，现有的MPJs基准主要关注任务，如步骤正确性分类和推理过程搜索，而忽视了一个关键方面：MPJs在步骤级别产生的置信度得分是否可靠。为了解决这一差距，我们提出了ConfProBench，这是第一个旨在系统评估MPJs生成的步骤级置信度得分可靠性的全面基准。我们的基准构建了三种类型的对抗性扰动推理步骤：同义词替换、句法转换和图像扰动，以测试MPJ在扰动下的置信度的稳健性。此外，我们引入了三种新的评估指标：置信度稳健性分数（CRS）、置信度敏感性分数（CSS）和置信度校准分数（CCS），分别评估稳健性、敏感性和校准性。我们评估了14种最先进的MLLMs，包括专有模型和开源模型。实验证明了目前MPJs置信度性能的局限性，并提供了竞争性基准线，以支持未来研究。

更新时间: 2025-08-06 16:00:19

领域: cs.AI,I.2.6; I.2.7; D.2.8

下载: http://arxiv.org/abs/2508.04576v1

Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.

Updated: 2025-08-06 15:59:18

标题: 超越头脑风暴：是什么推动高质量的科学思想？来自多智能体协作的启示

摘要: 尽管人工智能代理在科学构思方面显示出潜力，但大多数现有框架依赖于单一代理的完善，由于知识和视角的限制，创造力受到限制。受现实世界研究动态的启发，本文探讨了结构化的多代理讨论是否能超越独立构思。我们提出了一个合作多代理框架，用于生成研究提案，并系统地比较了包括团队规模、领导者主导与无领导结构以及跨学科性和资历不同的团队组成在内的配置。为了评估构思质量，我们采用了一个全面的协议，通过基于代理的评分和人类审查跨维度评估新颖性、战略愿景和整合深度等方面。我们的结果显示，多代理讨论明显优于独立基线。指定的领导者充当催化剂，将讨论转化为更加整合和有远见的提案。值得注意的是，我们发现认知多样性是质量的主要推动因素，然而专业知识是一项不可妥协的前提，因为缺乏高级知识基础的团队甚至无法超越一个能干的代理。这些发现为设计协作人工智能构思系统提供了可操作的见解，并揭示了团队结构如何影响创造性成果。

更新时间: 2025-08-06 15:59:18

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.04575v1

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks.

Updated: 2025-08-06 15:53:58

标题: 推荐系统真的充分利用多模态内容吗？关于推荐中多模态表示的全面分析

摘要: 多模式推荐系统旨在通过整合异构内容，如图像和文本元数据，以提高推荐的准确性。尽管有效，但目前尚不清楚它们的增益是源自真正的多模式理解还是增加的模型复杂性。本文研究了多模式项目嵌入的作用，强调表示的语义信息量。初步实验显示，来自标准提取器（例如ResNet50，Sentence-Bert）的嵌入增强了性能，但依赖于特定于模式的编码器和临时融合策略，缺乏对跨模式对齐的控制。为了克服这些限制，我们利用大型视觉语言模型（LVLMs）通过结构化提示生成多模式设计的嵌入。这种方法产生了语义对齐的表示，而无需任何融合。跨多个设置的实验证明了显著的性能改进。此外，LVLMs嵌入提供了一个独特的优势：它们可以解码为结构化文本描述，从而使得对它们的多模式理解能力进行直接评估。当这些描述作为辅助内容整合到推荐系统中时，它们提高了推荐性能，从经验上验证了编码在LVLMs输出中的语义深度和对齐性。我们的研究强调了语义丰富表示的重要性，并将LVLMs定位为在推荐任务中构建强大和有意义的多模式表示的引人注目基础。

更新时间: 2025-08-06 15:53:58

领域: cs.IR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.04571v1

A Relative Ignorability Framework for Decision-Relevant Observability in Control Theory and Reinforcement Learning

Sequential decision-making systems routinely operate with missing or incomplete data. Classical reinforcement learning theory, which is commonly used to solve sequential decision problems, assumes Markovian observability, which may not hold under partial observability. Causal inference paradigms formalise ignorability of missingness. We show these views can be unified and generalized in order to guarantee Q-learning convergence even when the Markov property fails. To do so, we introduce the concept of relative ignorability. Relative ignorability is a graphical-causal criterion which refines the requirements for accurate decision-making based on incomplete data. Theoretical results and simulations both reveal that non-Markovian stochastic processes whose missingness is relatively ignorable with respect to causal estimands can still be optimized using standard Reinforcement Learning algorithms. These results expand the theoretical foundations of safe, data-efficient AI to real-world environments where complete information is unattainable.

Updated: 2025-08-06 15:51:18

标题: 一个相对可忽略性框架用于控制理论和强化学习中的决策相关可观察性

摘要: 序贯决策系统通常在缺失或不完整数据的情况下运行。常用于解决序贯决策问题的经典强化学习理论假设马尔可夫可观察性，而在部分可观察性下可能不成立。因果推断范式形式化地表达了缺失性的可忽略性。我们展示了这些观点可以统一和泛化，以确保Q-learning的收敛性，即使马尔可夫性质失败。为此，我们引入了相对可忽略性的概念。相对可忽略性是一种图形因果准则，它细化了基于不完整数据的准确决策的要求。理论结果和模拟均显示，对于缺失性相对可忽略的非马尔可夫随机过程，可以使用标准强化学习算法进行优化。这些结果扩展了安全、高效的人工智能的理论基础，适用于无法获取完整信息的实际环境。

更新时间: 2025-08-06 15:51:18

领域: cs.LG,stat.ME,60G

下载: http://arxiv.org/abs/2504.07722v6

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor Identification} module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textit{Anchor-based Temporal Propagation} module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

Updated: 2025-08-06 15:49:53

标题: CLASP：基于交叉模态显著锚点的弱监督密集音频-视觉事件定位的语义传播

摘要: Dense Audio-Visual Event Localization (DAVEL)任务旨在在未经修剪的视频中定位同时在音频和视觉模态中发生的事件。本文探讨了在一个新的更具挑战性的弱监督设置下的DAVEL任务（W-DAVEL任务），其中仅提供视频级别的事件标签，且每个事件的时间边界是未知的。我们通过利用跨模态显著锚点来解决W-DAVEL，这些锚点被定义为在弱监督下被很好地预测并在音频和视觉模态中展现高度一致的事件语义的可靠时间戳。具体而言，我们提出了一个“相互事件协议评估”模块，通过衡量预测的音频和视觉事件类别之间的差异来生成一个协议分数。然后，这个协议分数被用于一个“跨模态显著锚点识别”模块中，该模块通过全局视频和局部时间窗口识别机制识别音频和视觉锚点特征。在多模态集成后的锚点特征被馈送到一个“基于锚点的时间传播”模块中，以增强原始时间音频和视觉特征中的事件语义编码，从而在弱监督下更好地进行时间定位。我们在UnAV-100和ActivityNet1.3数据集上建立了W-DAVEL的基准。大量实验证明我们的方法取得了最先进的性能。

更新时间: 2025-08-06 15:49:53

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2508.04566v1

SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.

Updated: 2025-08-06 15:49:26

标题: SID：使用苏格拉底跨学科对话数据集对STEM教育中的引导式教学能力进行基准测试

摘要: 促进学生在复杂问题解决场景中的知识整合和转移能力是现代教育的核心目标，跨学科 STEM 是实现这一目标的关键途径，但需要专家指导，这在规模上是困难的。虽然大型语言模型在这方面有潜力，但由于缺乏有效的评估基准，它们在引导教学方面的真正能力仍不清楚。为了解决这个问题，我们介绍了SID，第一个旨在系统评估LLMs在多轮跨学科苏格拉底对话中的高阶引导能力的基准。我们的贡献包括一个包含48个复杂STEM项目的10000个对话轮的大规模数据集，一个用于捕捉深层教育特征的新型注释模式，以及一套新的评估指标（例如X-SRG）。基准实验证实，即使是最先进的LLMs也很难执行有效的引导对话，以带领学生实现知识整合和转移。这突显了我们基准的关键价值，可以推动更注重教学的LLMs的发展。

更新时间: 2025-08-06 15:49:26

领域: cs.AI

下载: http://arxiv.org/abs/2508.04563v1

Attack Pattern Mining to Discover Hidden Threats to Industrial Control Systems

This work focuses on validation of attack pattern mining in the context of Industrial Control System (ICS) security. A comprehensive security assessment of an ICS requires generating a large and variety of attack patterns. For this purpose we have proposed a data driven technique to generate attack patterns for an ICS. The proposed technique has been used to generate over 100,000 attack patterns from data gathered from an operational water treatment plant. In this work we present a detailed case study to validate the attack patterns.

Updated: 2025-08-06 15:47:19

标题: 攻击模式挖掘：发现工业控制系统隐藏威胁

摘要: 这项工作侧重于在工业控制系统（ICS）安全背景下攻击模式挖掘的验证。对ICS的全面安全评估需要生成大量不同类型的攻击模式。为此，我们提出了一种数据驱动的技术，用于生成ICS的攻击模式。该技术已被用于从一个运行中的水处理厂收集的数据中生成超过10万个攻击模式。在本文中，我们提供了一个详细的案例研究来验证这些攻击模式。

更新时间: 2025-08-06 15:47:19

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.04561v1

CauKer: classification time series foundation models can be pretrained on synthetic data only

Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

Updated: 2025-08-06 15:43:43

标题: CauKer：分类时间序列基础模型可以仅在合成数据上预训练

摘要: 时间序列基础模型（TSFMs）由于其强大的零样本能力和广泛的实际应用而近年来备受关注。这种模型通常需要在大规模、精心筛选的真实世界序列集合上进行计算密集型的预训练。为了实现对TSFMs的样本高效预训练，我们提出了CauKer，这是一种新颖的算法，旨在生成具有真实趋势、季节性和非线性交互作用的多样化因果一致的合成时间序列。CauKer将高斯过程（GP）核组合与结构因果模型（SCM）相结合，以产生适用于最先进的分类TSFMs的不同架构和不同预训练方法的高效样本预训练数据。此外，我们的实验表明，CauKer生成的数据集对于数据集大小（10K到10M个样本）和模型容量（1M到783M个参数）都呈现清晰的缩放规律，而真实世界数据集则显示不规则的缩放行为。

更新时间: 2025-08-06 15:43:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.02879v2

LA-CaRe-CNN: Cascading Refinement CNN for Left Atrial Scar Segmentation

Atrial fibrillation (AF) represents the most prevalent type of cardiac arrhythmia for which treatment may require patients to undergo ablation therapy. In this surgery cardiac tissues are locally scarred on purpose to prevent electrical signals from causing arrhythmia. Patient-specific cardiac digital twin models show great potential for personalized ablation therapy, however, they demand accurate semantic segmentation of healthy and scarred tissue typically obtained from late gadolinium enhanced (LGE) magnetic resonance (MR) scans. In this work we propose the Left Atrial Cascading Refinement CNN (LA-CaRe-CNN), which aims to accurately segment the left atrium as well as left atrial scar tissue from LGE MR scans. LA-CaRe-CNN is a 2-stage CNN cascade that is trained end-to-end in 3D, where Stage 1 generates a prediction for the left atrium, which is then refined in Stage 2 in conjunction with the original image information to obtain a prediction for the left atrial scar tissue. To account for domain shift towards domains unknown during training, we employ strong intensity and spatial augmentation to increase the diversity of the training dataset. Our proposed method based on a 5-fold ensemble achieves great segmentation results, namely, 89.21% DSC and 1.6969 mm ASSD for the left atrium, as well as 64.59% DSC and 91.80% G-DSC for the more challenging left atrial scar tissue. Thus, segmentations obtained through LA-CaRe-CNN show great potential for the generation of patient-specific cardiac digital twin models and downstream tasks like personalized targeted ablation therapy to treat AF.

Updated: 2025-08-06 15:37:30

标题: LA-CaRe-CNN：级联细化CNN用于左心房瘢痕分割

摘要: 房颤（AF）代表心律失常中最普遍的类型，治疗可能需要患者接受消融疗法。在这种手术中，心脏组织被有意地局部造成瘢痕，以防止电信号引起心律失常。特定于患者的心脏数字孪生模型显示出个性化消融疗法的巨大潜力，但是，它们需要准确地对健康和瘢痕组织进行语义分割，通常是从晚期增强（LGE）磁共振（MR）扫描中获得的。在这项工作中，我们提出了左心房级联细化卷积神经网络（LA-CaRe-CNN），旨在准确地分割左心房以及来自LGE MR扫描的左心房瘢痕组织。LA-CaRe-CNN是一个经过端到端训练的2阶段CNN级联，其中第1阶段生成左心房的预测结果，然后第2阶段与原始图像信息一起对其进行细化，以获得左心房瘢痕组织的预测结果。为了考虑在训练过程中向未知领域的领域转移，我们采用强烈的强度和空间增强来增加训练数据集的多样性。我们提出的基于5折集成的方法取得了出色的分割结果，即左心房的DSC为89.21％，ASSD为1.6969毫米，更具挑战性的左心房瘢痕组织的DSC为64.59％，G-DSC为91.80％。因此，通过LA-CaRe-CNN获得的分割结果显示出用于生成特定于患者的心脏数字孪生模型以及下游任务，如个性化靶向消融治疗房颤的巨大潜力。

更新时间: 2025-08-06 15:37:30

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04553v1

Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation

As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift -- i.e. when training and test data are sampled from different data distributions -- remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

Updated: 2025-08-06 15:37:22

标题: 基于增强的域泛化和多源域的联合训练用于整个心脏分割

摘要: 作为全球领先的死因，心血管疾病促使开发更复杂的方法来分析心脏及其亚结构，如计算机断层扫描（CT）和磁共振（MR）等医学图像。重要心脏结构的语义分割对评估特定患者心脏形态和病理非常有用。此外，准确的语义分割可用于生成心脏数字孪生模型，从而进行电生理模拟和个性化治疗规划等。尽管基于深度学习的医学图像分割方法在过去十年取得了巨大进展，但在领域转移下保持良好性能——即，当训练和测试数据来自不同数据分布时，仍然具有挑战性。为了在训练时已知的领域上表现良好，我们采用一种平衡的联合训练方法，平衡地利用来自不同源领域的CT和MR数据。此外，为了减轻领域转移对仅在测试时遇到的领域的影响，我们依赖于强大的强度和空间增强技术，大大丰富了可用的训练数据。我们提出的整个心脏分割方法，使用我们的贡献进行5倍集成，在MR数据方面实现了最佳性能，并与仅在CT上训练的模型相比，在CT数据方面实现了类似最佳性能。对于CT数据，我们的方法的DSC为93.33%，ASSD为0.8388mm；对于MR数据，DSC为89.30%，ASSD为1.2411mm，表明我们的方法具有从中可以生成特定患者心脏孪生模型的准确语义分割的巨大潜力。

更新时间: 2025-08-06 15:37:22

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04552v1

MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

Updated: 2025-08-06 15:34:24

标题: MSC：具有基于实例分割和剪辑级字幕的海洋野生动物视频数据集

摘要: 海洋视频由于海洋物体和周围环境的动态性、摄像机运动以及水下场景的复杂性而对视频理解提出了重大挑战。现有的视频字幕数据集通常专注于通用或以人为中心的领域，往往无法推广到海洋环境的复杂性，并获得有关海洋生物的见解。为了解决这些限制，我们提出了一个两阶段的海洋物体导向视频字幕生成流程。我们引入了一个全面的视频理解基准，利用视频、文本和分割蒙版的三元组来促进视觉定位和字幕生成，从而提高海洋视频的理解和分析以及海洋视频的生成。此外，我们强调了视频分割的有效性，以便在场景变化中检测显著物体的转换，这显著丰富了字幕内容的语义。我们的数据集和代码已发布在https://msc.hkustvgd.com。

更新时间: 2025-08-06 15:34:24

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2508.04549v1

Beyond risk: A proto-framework for assessing the societal impact of AI systems

In the discourse on AI regulation, 'responsible AI' is the dominant paradigm, with the focus on mitigating the risks related to AI systems. While this focus is important and necessary, it has limited use for a systematic consideration of AI's societal impact. This paper proposes a proto-framework for assessing the societal impact of AI systems by operationalising the concept of freedom. This proto-framework is intended as a step towards a fully operationalised framework to be used in policymaking contexts. By drawing on Kantian philosophy and related contemporary interpretations, freedom is developed as the counterpart to the concept of responsibility. Two dimensions of freedom are developed in further detail: freedom as capability and freedom as opportunity. These two dimensions of freedom are then applied in a proto-framework that systematically considers AI's impact on society using the Sustainable Development Goals. This proto-framework aims to complement current risk-based approaches and thereby offers a first step towards operationalising the concept of freedom in AI regulation.

Updated: 2025-08-06 15:33:00

标题: 超越风险：评估人工智能系统社会影响的原型框架

摘要: 在关于人工智能监管的讨论中，“负责任的人工智能”是主导范式，重点是减轻与人工智能系统相关的风险。虽然这种关注是重要且必要的，但对于系统性考虑人工智能对社会影响的有限。本文提出了一个用于评估人工智能系统对社会影响的原型框架，通过将自由的概念操作化。这个原型框架旨在作为朝着在政策制定背景下使用的完全操作化框架迈出的一步。通过借鉴康德哲学和相关的当代解释，自由被发展为责任概念的对应物。自由的两个维度进一步详细发展：自由作为能力和自由作为机会。然后，这两个自由维度被应用于一个原型框架中，该框架系统地考虑了人工智能对社会的影响，使用可持续发展目标。这个原型框架旨在补充当前基于风险的方法，从而为在人工智能监管中操作化自由概念提供第一步。

更新时间: 2025-08-06 15:33:00

领域: cs.CY,cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.03666v2

Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph--a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute?

Updated: 2025-08-06 15:30:07

标题: 基于对个人数据的基本理解和不断发展的威胁格局，预测隐私风险

摘要: 个人和组织在没有相对隐私风险的基本理解时很难保护个人信息。通过分析超过5,000起实证身份盗窃和欺诈案例，这项研究确定了哪些类型的个人数据被暴露，暴露发生的频率以及这些暴露的后果。我们构建了一个身份生态系统图——一个基础的、基于图的模型，其中节点代表个人可识别信息（PII）属性，边代表它们之间的实证披露关系（例如，一个PII属性由于另一个属性的暴露而暴露的概率）。利用这个图结构，我们开发了一个隐私风险预测框架，利用图论和图神经网络来估计在某些PII属性受损时进一步披露的可能性。结果显示，我们的方法有效地回答了核心问题：给定身份属性的披露可能导致另一个属性的披露吗？

更新时间: 2025-08-06 15:30:07

领域: cs.LG,cs.CR,cs.SI

下载: http://arxiv.org/abs/2508.04542v1

InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation

Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.

Updated: 2025-08-06 15:28:04

标题: InqEduAgent：具有高斯过程增强的自适应人工智能学习伙伴

摘要: 合作伙伴关系在以探究为导向的教育中至关重要。然而，大多数学习伙伴的选择要么依赖基于经验的分配，缺乏科学规划，要么建立在基于规则的机器助手之上，在知识拓展和灵活性不足方面遇到困难。本文提出了一种基于LLM的代理模型，用于模拟和选择适合探究式学习的学习伙伴，名为InqEduAgent。生成式代理被设计来捕捉学习者在现实场景中的认知和评估特征。然后，提出了一种带有高斯过程增强的自适应匹配算法，用于识别先前知识中的模式。为面对不同练习的学习者提供了最佳的学习伙伴匹配。实验结果显示，在大多数知识学习场景和具有不同能力水平的LLM环境中，InqEduAgent表现出最佳性能。该研究促进了基于人类的学习伙伴的智能分配，以及基于人工智能的学习伙伴的制定。代码、数据和附录可在https://github.com/InqEduAgent/InqEduAgent 上公开获取。

更新时间: 2025-08-06 15:28:04

领域: cs.AI

下载: http://arxiv.org/abs/2508.03174v2

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables remarkable capabilities in natural language processing and generation. However, these systems introduce unprecedented security vulnerabilities that extend beyond traditional content generation attacks to system-level compromise. This paper presents a comprehensive evaluation of the security of LLMs used as reasoning engines within autonomous agents, highlighting how they can be exploited as attack vectors capable of achieving complete computer takeover. We focus on how different attack surfaces and trust boundaries - Direct Prompt Injection, RAG Backdoor, and Inter Agent Trust - can be leveraged to orchestrate such takeovers. We demonstrate that adversaries can effectively coerce popular LLMs (including GPT-4, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 18 state-of-the-art LLMs reveals an alarming scenario: 94.4% of models succumb to Direct Prompt Injection and 83.3% are vulnerable to the more stealth and evasive RAG Backdoor Attack. Notably, we tested trust boundaries within multi-agent systems, where LLM agents interact and influence each other, and we revealed a critical security flaw: LLMs which successfully resist direct injection or RAG backdoor will execute identical payloads when requested by peer agents. Our findings show that 100.0% of tested LLMs can be compromised through Inter-Agent Trust Exploitation attacks and that every model exhibits context-dependent security behaviors that create exploitable blind spots. Our results also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.

Updated: 2025-08-06 15:27:03

标题: LLMs的黑暗面：基于代理的攻击实现完全控制计算机

摘要: 大型语言模型（LLM）代理和多代理系统的快速采用使自然语言处理和生成具有显著能力。然而，这些系统引入了前所未有的安全漏洞，超越了传统的内容生成攻击，扩展到系统级妥协。本文全面评估了LLM作为自主代理内部推理引擎的安全性，突出了它们如何被利用为攻击向量，能够实现完全接管计算机。我们关注不同的攻击面和信任边界 - 直接提示注入、RAG后门、以及代理间信任 - 如何被利用来策划这种接管。我们证明对手可以有效地强迫流行的LLM（包括GPT-4、Claude-4和Gemini-2.5）自动在受害机器上安装和执行恶意软件。我们对18个最先进的LLM进行评估揭示了一个令人担忧的情景：94.4%的模型屈服于直接提示注入，83.3%容易受到更隐匿和回避的RAG后门攻击。值得注意的是，我们测试了多代理系统内的信任边界，在这里，LLM代理相互交互并相互影响，并揭示了一个关键的安全漏洞：成功抵抗直接注入或RAG后门的LLM将在被同行代理请求时执行相同的载荷。我们的研究结果显示，100.0%的测试LLM都可以通过代理间信任利用攻击而被破坏，而每个模型都表现出依赖于上下文的安全行为，从而创造出可利用的盲点。我们的结果还突出了需要增加对LLM安全风险的意识和研究，显示了网络安全威胁的范式转变，AI工具本身成为复杂的攻击向量。

更新时间: 2025-08-06 15:27:03

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.06850v4

Improving Sequential Market Coordination via Value-oriented Renewable Energy Forecasting

Large penetration of renewable energy sources (RESs) brings huge uncertainty into the electricity markets. The current deterministic clearing approach in the day-ahead (DA) market, where RESs participate based on expected production, has been criticized for causing a lack of coordination between the DA and real-time (RT) markets, leading to high overall operating costs. Previous works indicate that improving day-ahead RES entering quantities can significantly mitigate the drawbacks of deterministic clearing. In this work, we propose using a trained forecasting model, referred to as value-oriented forecasting, to determine RES Improved Entering Quantities (RIEQ) more efficiently during the operational phase. Unlike traditional models that minimize statistical forecasting errors, our approach trains model parameters to minimize the expected overall operating costs across both DA and RT markets. We derive the exact form of the loss function used for training, which becomes piecewise linear when market clearing is modeled by linear programs. Additionally, we provide the analytical gradient of the loss function with respect to the forecast, enabling an efficient training strategy. Numerical studies demonstrate that our forecasts significantly reduce overall operating costs for deterministic market clearing compared to conventional forecasts based on expected RES production.

Updated: 2025-08-06 15:25:24

标题: 通过面向价值的可再生能源预测改进顺序市场协调

摘要: 大规模可再生能源（RESs）的大量渗透给电力市场带来了巨大的不确定性。在日前市场（DA）中，RESs基于预期产量参与的当前确定性结算方法，已经因导致DA市场和实时市场之间缺乏协调而受到批评，导致整体运营成本高昂。先前的研究表明，改进日前RES进入量可以显著减轻确定性结算的缺点。在这项工作中，我们提出使用一个经过训练的预测模型，称为价值导向预测，以更高效地确定在运营阶段的RES改进进入量（RIEQ）。与传统的最小化统计预测误差的模型不同，我们的方法训练模型参数以最小化跨DA和RT市场的预期整体运营成本。我们推导出用于训练的损失函数的确切形式，在线性程序模拟市场结算时变为分段线性。此外，我们提供了与预测相关的损失函数的解析梯度，实现了高效的训练策略。数值研究表明，与基于预期RES产量的传统预测相比，我们的预测显著减少了确定性市场结算的整体运营成本。

更新时间: 2025-08-06 15:25:24

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2405.09004v3

A Survey of Controllable Learning: Methods and Applications in Information Retrieval

Controllability has become a crucial aspect of trustworthy machine learning, enabling learners to meet predefined targets and adapt dynamically at test time without requiring retraining as the targets shift. We provide a formal definition of controllable learning (CL), and discuss its applications in information retrieval (IR) where information needs are often complex and dynamic. The survey categorizes CL according to what is controllable (e.g., multiple objectives, user portrait, scenario adaptation), who controls (users or platforms), how control is implemented (e.g., rule-based method, Pareto optimization, hypernetwork and others), and where to implement control (e.g., pre-processing, in-processing, post-processing methods). Then, we identify challenges faced by CL across training, evaluation, task setting, and deployment in online environments. Additionally, we outline promising directions for CL in theoretical analysis, efficient computation, empowering large language models, application scenarios and evaluation frameworks.

Updated: 2025-08-06 15:22:29

标题: 《可控学习的调查：信息检索中的方法和应用》

摘要: 可控性已经成为可靠的机器学习的一个关键方面，使学习者能够在测试时满足预定义的目标并动态适应，而不需要在目标发生变化时重新训练。我们提供了可控学习（CL）的正式定义，并讨论了它在信息检索（IR）中的应用，其中信息需求通常是复杂和动态的。该调查将CL根据可控性内容（例如，多目标、用户画像、场景适应）、谁控制（用户还是平台）、如何实施控制（例如，基于规则的方法、帕紗多最优化、超网络等）以及在哪里实施控制（例如，预处理、处理中、后处理方法）进行分类。然后，我们确定了CL在在线环境中面临的挑战，包括训练、评估、任务设置和部署。此外，我们概述了CL在理论分析、高效计算、赋能大型语言模型、应用场景和评估框架方面的有前景的方向。

更新时间: 2025-08-06 15:22:29

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2407.06083v3

LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking' capabilities of various LLMs by examining the models' internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

Updated: 2025-08-06 15:16:05

标题: LLMs拥有一颗石头般的心：揭秘大型推理模型的软思维能力

摘要: 人类认知自然地涉及抽象和流动的概念，而现有的推理模型通常依赖于生成离散的标记，可能限制其表达能力。最近的进展旨在通过使大型语言模型（LLMs）能够生成软、抽象的标记，从而促进在连续概念空间内进行推理，以解决这一限制。本文通过使用一系列探测技术来探讨各种LLMs的“软思考”能力，通过检查模型的内部行为。与通常认为的软思考能够同时探索不同推理路径的普遍观点相反，我们的研究结果显示，LLMs在随后的解码步骤中主要依赖于软输入的最具影响力的组件。这种依赖阻碍了不同推理路径的探索，将普通的软思考降低为一种贪婪解码的形式，模糊了通过软标记传递更多信息的优势。为了解决这个问题，我们探讨了引入\emph{随机性}的抽样策略，采用了Dirichlet重抽样和Gumbel-Softmax技巧等方法。我们的实验证明，引入随机性可以缓解普通方法的局限性，释放软思考的潜力。值得注意的是，Gumbel-Softmax技巧提供了适度的随机性，并控制了平滑性，从而在八个推理基准测试中取得了优越的性能。

更新时间: 2025-08-06 15:16:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.03440v2

Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.

Updated: 2025-08-06 15:13:24

标题: 揭示临床抑郁评估的格局：从行为特征到精神医学推理

摘要: 抑郁症是一种广泛存在的精神障碍，影响着全球数百万人。虽然自动抑郁症评估显示出希望，但大多数研究依赖于有限或非临床验证的数据，并且通常优先考虑复杂的模型设计而非真实世界的有效性。本文旨在揭示临床抑郁症评估的格局。我们引入了C-MIND，这是一个在真实医院就诊中收集的临床神经精神多模式诊断数据集，历时两年。每位参与者完成三个结构化的精神任务，并从专家临床医生那里获得最终诊断，同时记录信息丰富的音频、视频、转录和功能性近红外光谱（fNIRS）信号。利用C-MIND，我们首先分析与诊断相关的行为特征。我们训练一系列经典模型，以量化不同任务和模态对诊断性能的贡献，并剖析它们的组合的有效性。然后我们探索LLMs是否能像临床医生一样进行精神病学推理，并识别它们在现实临床环境中的明显限制。作为回应，我们提出要用临床专业知识指导推理过程，并通过在Macro-F1得分上提高LLM诊断性能高达10%。我们旨在从数据和算法角度建立一个临床抑郁症评估的基础设施，使C-MIND能够促进基于扎实和可靠的研究的精神健康护理。

更新时间: 2025-08-06 15:13:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04531v1

SLR: Automated Synthesis for Scalable Logical Reasoning

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

Updated: 2025-08-06 15:09:52

标题: SLR：可扩展逻辑推理的自动合成

摘要: 我们引入SLR，这是一个端到端框架，用于通过可扩展的逻辑推理对大型语言模型（LLMs）进行系统评估和训练。根据用户的任务规范，SLR自动合成（i）归纳推理任务的指令提示，（ii）可在模型输出上执行的验证程序，以提供可验证的奖励，以及（iii）潜在的地面真相规则。这个过程完全自动化，可扩展，不需要人工标注，并且可以精确控制任务难度。使用SLR，我们创建了SLR-Bench，一个包含19k提示的基准，分为20个课程级别，逐渐增加关系，算术和递归复杂性。大规模评估显示，当代LLMs很容易生成句法有效的规则，但在正确的逻辑推理方面经常失败。最近的推理LLMs表现出改进，但在测试时会产生非常高的计算成本，仅1000个提示的成本就超过300美元。最后，通过SLR的课程学习使Llama-3-8B在SLR-Bench上的准确性翻了一番，仅以一小部分的计算成本实现了与Gemini-Flash-Thinking的持平。此外，这些推理能力可以推广到广泛的已建立基准，突显了SLR在下游推理中的有效性。

更新时间: 2025-08-06 15:09:52

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.15787v4

RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

Updated: 2025-08-06 15:08:16

标题: RAIDX：用于可解释Deepfake检测的检索增强生成和GRPO强化学习框架

摘要: AI 生成模型的快速发展使得超写实图像的创作成为可能，同时也带来了通过广泛传播虚假信息而产生的伦理风险。当前的深度伪造检测方法被分类为面部特定检测器或通用 AI 生成检测器，但它们缺乏透明度，将检测框架作为一个分类任务而没有解释决策过程。尽管一些基于 LLM 的方法提供了解释性，但它们受粗粒度分析和对劳动密集型标注的依赖的影响。本文介绍了一种新颖的深度伪造检测框架 RAIDX（检索增强图像深度伪造检测和解释性），它整合了检索增强生成（RAG）和群体相对策略优化（GRPO），以提高检测准确性和决策解释性。具体地，RAIDX 利用 RAG 来整合外部知识以提高检测准确性，并采用 GRPO 来自动生成细粒度的文本解释和显著性图，消除了对广泛手动标注的需求。在多个基准测试中的实验表明 RAIDX 在识别真实或伪造方面的效果显著，并且在文本描述和显著性图中提供可解释的理由，实现了最先进的检测性能，同时推进了深度伪造识别的透明度。RAIDX 是第一个统一框架，协同作用 RAG 和 GRPO，解决了准确性和解释性方面的关键差距。我们的代码和模型将公开提供。

更新时间: 2025-08-06 15:08:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04524v1

Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation

Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas

Updated: 2025-08-06 15:07:39

标题: 条件胎儿脑图谱学习用于自动组织分割

摘要: 胎儿脑磁共振成像(MRI)已成为研究活体脑发育的关键工具。然而，由于脑成熟程度、成像协议和胎龄(GA)估计的不确定性变化，其评估仍具有挑战性。为了克服这些问题，脑图谱提供了一个标准化参考框架，通过将图谱和对象在共同的坐标系中对齐，可以促进客观评估和跨受试者比较。在这项工作中，我们介绍了一种新颖的深度学习框架，用于生成实时胎儿脑组织分割的连续、年龄特定脑图谱。该框架结合了直接注册模型和条件鉴别器。在一个包括219例神经典型胎儿磁共振成像的精心策划数据集上进行训练，这些数据跨越了从21到37周的妊娠。该方法实现了高精度的注册，捕捉了具有锐利结构细节的动态解剖变化，并具有均值为86.3%的六种脑组织的Dice相似系数(DSC)的稳健分割性能。此外，对生成的图谱进行体积分析揭示了详细的神经典型生长轨迹，为胎儿脑发育提供了宝贵的见解。这种方法能够通过最小的预处理和实时性能进行个性化的发育评估，支持研究和临床应用。模型代码可在https://github.com/cirmuw/fetal-brain-atlas获取。

更新时间: 2025-08-06 15:07:39

领域: eess.IV,cs.CV,cs.LG,68T07 (Primary) 92C50 (Secondary),I.4.9; I.4.6; I.2.0

下载: http://arxiv.org/abs/2508.04522v1

Channel-Independent Federated Traffic Prediction

In recent years, traffic prediction has achieved remarkable success and has become an integral component of intelligent transportation systems. However, traffic data is typically distributed among multiple data owners, and privacy constraints prevent the direct utilization of these isolated datasets for traffic prediction. Most existing federated traffic prediction methods focus on designing communication mechanisms that allow models to leverage information from other clients in order to improve prediction accuracy. Unfortunately, such approaches often incur substantial communication overhead, and the resulting transmission delays significantly slow down the training process. As the volume of traffic data continues to grow, this issue becomes increasingly critical, making the resource consumption of current methods unsustainable. To address this challenge, we propose a novel variable relationship modeling paradigm for federated traffic prediction, termed the Channel-Independent Paradigm(CIP). Unlike traditional approaches, CIP eliminates the need for inter-client communication by enabling each node to perform efficient and accurate predictions using only local information. Based on the CIP, we further develop Fed-CI, an efficient federated learning framework, allowing each client to process its own data independently while effectively mitigating the information loss caused by the lack of direct data sharing among clients. Fed-CI significantly reduces communication overhead, accelerates the training process, and achieves state-of-the-art performance while complying with privacy regulations. Extensive experiments on multiple real-world datasets demonstrate that Fed-CI consistently outperforms existing methods across all datasets and federated settings. It achieves improvements of 8%, 14%, and 16% in RMSE, MAE, and MAPE, respectively, while also substantially reducing communication costs.

Updated: 2025-08-06 15:02:28

标题: 通道独立的联合流量预测

摘要: 近年来，交通预测取得了显著的成功，并已成为智能交通系统的一个重要组成部分。然而，交通数据通常分布在多个数据所有者之间，隐私限制阻止了直接利用这些孤立数据集进行交通预测。大多数现有的联邦式交通预测方法侧重于设计通信机制，允许模型利用其他客户端的信息以提高预测准确性。不幸的是，这种方法通常会产生大量的通信开销，导致传输延迟严重减慢训练过程。随着交通数据量的持续增长，这个问题变得越来越关键，使得当前方法的资源消耗变得不可持续。为了解决这一挑战，我们提出了一种新颖的联邦式交通预测变量关系建模范式，称为通道独立范式（CIP）。与传统方法不同，CIP通过使每个节点能够仅使用本地信息进行高效准确的预测，从而消除了跨客户端通信的需求。基于CIP，我们进一步开发了Fed-CI，一种高效的联邦学习框架，允许每个客户端独立处理自己的数据，同时有效减轻由于客户端之间缺乏直接数据共享而导致的信息丢失。Fed-CI显著减少了通信开销，加速了训练过程，并在遵守隐私法规的同时实现了最先进的性能。对多个真实世界数据集的广泛实验表明，Fed-CI在所有数据集和联邦设置中始终优于现有方法。它在RMSE、MAE和MAPE方面分别提高了8%、14%和16%，同时大幅减少了通信成本。

更新时间: 2025-08-06 15:02:28

领域: cs.LG

下载: http://arxiv.org/abs/2508.04517v1

Avoiding Catastrophe in Online Learning by Asking for Help

Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are "catastrophic", i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe in that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We also assume that the agent can transfer knowledge between similar inputs. We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Although our focus is the product of payoffs, we provide matching bounds for the typical additive regret. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

Updated: 2025-08-06 14:58:13

标题: 通过寻求帮助避免在线学习中的灾难

摘要: 大多数具有正式遗憾保证的学习算法假设所有错误都是可弥补的，并且基本上依赖于尝试所有可能的行为。当一些错误是“灾难性”的，即不可挽回时，这种方法存在问题。我们提出了一个在线学习问题，其目标是最小化灾难发生的机会。具体而言，我们假设每一轮中的回报代表避免灾难发生的机会，并试图最大化回报的乘积（避免灾难发生的总体机会），同时允许有限次向导查询。我们还假设代理人可以在相似的输入之间传递知识。我们首先展示了一般情况下，任何算法要么以线性速率查询导师，要么几乎肯定会引发灾难。然而，在导师策略类在标准在线模型中是可学习的情况下，我们提供了一种算法，其遗憾和查询导师的速率都会随时间范围增长而接近0。尽管我们的重点是回报的乘积，但我们也为典型的加法遗憾提供了匹配的界限。从概念上讲，如果在没有灾难风险的情况下一个策略类是可学习的，那么在存在灾难风险的情况下，如果代理人可以寻求帮助，那么它是可学习的。

更新时间: 2025-08-06 14:58:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2402.08062v6

Learning richness modulates equality reasoning in neural networks

Equality reasoning is ubiquitous and purely abstract: sameness or difference may be evaluated no matter the nature of the underlying objects. As a result, same-different (SD) tasks have been extensively studied as a starting point for understanding abstract reasoning in humans and across animal species. With the rise of neural networks that exhibit striking apparent proficiency for abstractions, equality reasoning in these models has also gained interest. Yet despite extensive study, conclusions about equality reasoning vary widely and with little consensus. To clarify the underlying principles in learning SD tasks, we develop a theory of equality reasoning in multi-layer perceptrons (MLP). Following observations in comparative psychology, we propose a spectrum of behavior that ranges from conceptual to perceptual outcomes. Conceptual behavior is characterized by task-specific representations, efficient learning, and insensitivity to spurious perceptual details. Perceptual behavior is characterized by strong sensitivity to spurious perceptual details, accompanied by the need for exhaustive training to learn the task. We develop a mathematical theory to show that an MLP's behavior is driven by learning richness. Rich-regime MLPs exhibit conceptual behavior, whereas lazy-regime MLPs exhibit perceptual behavior. We validate our theoretical findings in vision SD experiments, showing that rich feature learning promotes success by encouraging hallmarks of conceptual behavior. Overall, our work identifies feature learning richness as a key parameter modulating equality reasoning, and suggests that equality reasoning in humans and animals may similarly depend on learning richness in neural circuits.

Updated: 2025-08-06 14:57:17

标题: 学习丰富度调节神经网络中的平等推理

摘要: 平等推理是无处不在且纯粹抽象的：无论基础对象的性质如何，都可以评估相同或不同。因此，同异（SD）任务已被广泛研究，作为理解人类和动物物种之间抽象推理的起点。随着展现出明显对抽象具有熟练度的神经网络的兴起，这些模型中的平等推理也引起了兴趣。然而，尽管进行了广泛研究，关于平等推理的结论却各不相同，并且缺乏共识。为了澄清学习SD任务的基本原则，我们在多层感知器（MLP）中发展了一个平等推理理论。根据比较心理学的观察，我们提出了一个从概念到感知结果的行为范围。概念行为的特征是任务特定的表征、高效的学习以及对虚假感知细节的不敏感。感知行为的特征是对虚假感知细节的强烈敏感，伴随着需要详尽训练才能学会任务。我们发展了一个数学理论，表明MLP的行为是由学习丰富性驱动的。富裕模式的MLP表现出概念行为，而懒惰模式的MLP表现出感知行为。我们在视觉SD实验中验证了我们的理论发现，显示富有特征学习通过鼓励概念行为的特征促进了成功。总的来说，我们的工作确定了特征学习丰富性作为调节平等推理的关键参数，并且暗示人类和动物中的平等推理可能同样依赖于神经回路中的学习丰富性。

更新时间: 2025-08-06 14:57:17

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2503.09781v3

Reconstructing Physics-Informed Machine Learning for Traffic Flow Modeling: a Multi-Gradient Descent and Pareto Learning Approach

Physics-informed machine learning (PIML) is crucial in modern traffic flow modeling because it combines the benefits of both physics-based and data-driven approaches. In conventional PIML, physical information is typically incorporated by constructing a hybrid loss function that combines data-driven loss and physics loss through linear scalarization. The goal is to find a trade-off between these two objectives to improve the accuracy of model predictions. However, from a mathematical perspective, linear scalarization is limited to identifying only the convex region of the Pareto front, as it treats data-driven and physics losses as separate objectives. Given that most PIML loss functions are non-convex, linear scalarization restricts the achievable trade-off solutions. Moreover, tuning the weighting coefficients for the two loss components can be both time-consuming and computationally challenging. To address these limitations, this paper introduces a paradigm shift in PIML by reformulating the training process as a multi-objective optimization problem, treating data-driven loss and physics loss independently. We apply several multi-gradient descent algorithms (MGDAs), including traditional multi-gradient descent (TMGD) and dual cone gradient descent (DCGD), to explore the Pareto front in this multi-objective setting. These methods are evaluated on both macroscopic and microscopic traffic flow models. In the macroscopic case, MGDAs achieved comparable performance to traditional linear scalarization methods. Notably, in the microscopic case, MGDAs significantly outperformed their scalarization-based counterparts, demonstrating the advantages of a multi-objective optimization approach in complex PIML scenarios.

Updated: 2025-08-06 14:56:18

标题: 重建物理信息机器学习用于交通流建模：一种多梯度下降和帕累托学习方法

摘要: 物理信息机器学习（PIML）在现代交通流建模中至关重要，因为它结合了基于物理和数据驱动方法的优势。在传统的PIML中，通常通过构建一个混合损失函数来将物理信息纳入其中，该函数通过线性标量化结合数据驱动损失和物理损失。其目标是找到这两个目标之间的权衡，以提高模型预测的准确性。然而，从数学角度来看，线性标量化仅限于识别帕累托前沿的凸区域，因为它将数据驱动和物理损失视为独立的目标。考虑到大多数PIML损失函数是非凸的，线性标量化限制了可实现的权衡解决方案。此外，调整两个损失成分的权重系数可能既耗时又具有计算挑战性。为了解决这些局限性，本文通过将训练过程重新构建为一个多目标优化问题，将数据驱动损失和物理损失视为独立的目标，引入了PIML中的一个范式转变。我们应用了多种多梯度下降算法（MGDAs），包括传统的多梯度下降（TMGD）和双锥梯度下降（DCGD），在这个多目标设置中探索帕累托前沿。这些方法在宏观和微观交通流模型上进行了评估。在宏观情况下，MGDAs实现了与传统线性标量化方法相媲美的性能。值得注意的是，在微观情况下，MGDAs明显优于基于标量化的对应方法，展示了在复杂的PIML场景中采用多目标优化方法的优势。

更新时间: 2025-08-06 14:56:18

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.13241v2

Argumentative Debates for Transparent Bias Detection [Technical Report]

As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability.

Updated: 2025-08-06 14:56:08

标题: 透明偏见检测的辩论性讨论 [技术报告]

摘要: 随着社会中人工智能系统的使用增加，解决数据引起的潜在偏见或模型学习到的偏见对特定群体造成系统性不利是至关重要的。文献中提出了几种公平（不公平）的概念，以及相应的算法方法用于检测和减轻不公平，但是除了极少数例外，这些方法往往忽视了透明性。相反，可解释性和可解释性是算法公平性的核心要求，甚至比其他算法解决方案更重要，考虑到公平性的以人为本的性质。在本文中，我们提出了一种新颖的可解释、可解释的偏见检测方法，依赖于关于个体是否存在偏见的辩论，基于个体和其邻近个体的受保护特征的价值。我们的方法建立在形式和计算论证技术之上，通过在邻域内部和跨邻域辩论来讨论偏见。我们对我们的方法进行了形式化、定量和定性评估，突出了其相对基线的性能优势，以及其可解释性和可解释性。

更新时间: 2025-08-06 14:56:08

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04511v1

Think How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models

Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs' difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).

Updated: 2025-08-06 14:55:31

标题: 思考如何思考：在大型推理模型中通过自主难度认知减轻过度思考

摘要: 最近的大型推理模型（LRM）擅长复杂推理任务，但往往容易陷入过度思考，生成过长和冗余的推理轨迹。为了探索其本质，我们的经验分析揭示了LRM主要受限于在解决问题之前像人类一样识别任务属性（即难度级别），导致一种“一刀切”的推理过程。受此启发，一个紧迫且自然的问题浮现出来：我们是否可以明确地启动这种能力，以减轻LRM中的过度思考？在本文中，我们提出了Think-How-to-Think（TH2T），一种新颖的两阶段微调策略，逐步激发LRM的难度认知和冗余认知。具体而言，我们首先将难度催眠注入输出前缀，引导模型朝向自适应推理深度，训练在混合了短路径和长路径的混合数据集上。然后，我们引入冗余催眠，监督中间推理步骤，识别并消除不必要的推理模式。对7B/14B/32B模型的实验表明，TH2T在易任务上将推理成本显著降低超过70％，在困难任务上降低40％，同时保持性能稳定。结果输出显示出明显的难度感知能力和减少的冗余（例如反思和循环）。

更新时间: 2025-08-06 14:55:31

领域: cs.AI

下载: http://arxiv.org/abs/2507.02663v2

NCCR: to Evaluate the Robustness of Neural Networks and Adversarial Examples

Neural networks have received a lot of attention recently, and related security issues have come with it. Many studies have shown that neural networks are vulnerable to adversarial examples that have been artificially perturbed with modification, which is too small to be distinguishable by human perception. Different attacks and defenses have been proposed to solve these problems, but there is little research on evaluating the robustness of neural networks and their inputs. In this work, we propose a metric called the neuron cover change rate (NCCR) to measure the ability of deep learning models to resist attacks and the stability of adversarial examples. NCCR monitors alterations in the output of specifically chosen neurons when the input is perturbed, and networks with a smaller degree of variation are considered to be more robust. The results of the experiment on image recognition and the speaker recognition model show that our metrics can provide a good assessment of the robustness of neural networks or their inputs. It can also be used to detect whether an input is adversarial or not, as adversarial examples are always less robust.

Updated: 2025-08-06 14:54:25

标题: NCCR：评估神经网络和对抗样本的稳健性

摘要: 最近神经网络引起了很多关注，相关的安全问题也随之而来。许多研究表明，神经网络容易受到人为扰动的对抗样本的影响，这种扰动对人类的感知来说太小而无法辨别。已经提出了不同的攻击和防御方法来解决这些问题，但对评估神经网络及其输入的稳健性的研究较少。在这项工作中，我们提出了一种称为神经元覆盖变化率（NCCR）的度量标准，用于衡量深度学习模型抵抗攻击的能力以及对抗样本的稳定性。NCCR监测特定选择的神经元输出在输入受到干扰时的变化，并认为变化程度较小的网络更为稳健。对图像识别和说话人识别模型的实验结果显示，我们的度量标准可以很好地评估神经网络或其输入的稳健性。它还可以用于检测输入是否为对抗性，因为对抗样本总是不够稳健。

更新时间: 2025-08-06 14:54:25

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.21483v2

PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification.

Updated: 2025-08-06 14:50:25

标题: PRISM：通过对称多分辨率卷积层实现轻量级多变量时间序列分类

摘要: 多变量时间序列分类在从可穿戴传感到生物医学监测等领域至关重要。尽管最近取得了一些进展，基于Transformer和CNN的模型通常仍然计算量沉重，提供有限的频率多样性，并需要大量的参数预算。我们提出了PRISM（Per-channel Resolution-Informed Symmetric Module），这是一个基于卷积的特征提取器，它在多个时间尺度上独立地对每个通道应用对称的有限脉冲响应（FIR）滤波器。这种多分辨率、每通道设计产生了高度频率选择性的嵌入，而没有任何通道间的卷积，大大减少了模型大小和复杂性。在人类活动、睡眠阶段和生物医学基准测试中，PRISM与轻量级分类头配对，达到或超越了领先的CNN和Transformer基准线，同时使用大约一个数量级更少的参数和FLOPs。通过将经典信号处理见解与现代深度学习相结合，PRISM为多变量时间序列分类提供了准确、资源高效的解决方案。

更新时间: 2025-08-06 14:50:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04503v1

Causal Reflection with Language Models

While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent's internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

Updated: 2025-08-06 14:44:23

标题: 语言模型与因果反思

摘要: 虽然LLMs表现出令人印象深刻的流利性和事实回忆，但他们在强大的因果推理方面存在困难，经常依赖虚假的相关性和脆弱的模式。同样，传统的强化学习代理也缺乏因果理解，优化于奖励而不模拟为什么行动导致结果。我们引入了因果反思，一个明确建模因果关系的框架，作为状态、行动、时间和扰动的动态函数，使代理能够推理延迟和非线性效应。此外，我们定义了一个形式的反思机制，用于识别预测和观察结果之间的不匹配，并生成因果假设来修订代理的内部模型。在这种架构中，LLMs不是黑匣子推理者，而是结构化推理引擎，将正式因果输出转化为自然语言解释和反事实。我们的框架奠定了因果反思代理的理论基础，这些代理可以在不断发展的环境中适应、自我纠正并传达因果理解。

更新时间: 2025-08-06 14:44:23

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.04495v1

NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks

Early Exit Neural Networks (EENNs) endow astandard Deep Neural Network (DNN) with Early Exit Classifiers (EECs), to provide predictions at intermediate points of the processing when enough confidence in classification is achieved. This leads to many benefits in terms of effectiveness and efficiency. Currently, the design of EENNs is carried out manually by experts, a complex and time-consuming task that requires accounting for many aspects, including the correct placement, the thresholding, and the computational overhead of the EECs. For this reason, the research is exploring the use of Neural Architecture Search (NAS) to automatize the design of EENNs. Currently, few comprehensive NAS solutions for EENNs have been proposed in the literature, and a fully automated, joint design strategy taking into consideration both the backbone and the EECs remains an open problem. To this end, this work presents Neural Architecture Search for Hardware Constrained Early Exit Neural Networks (NACHOS), the first NAS framework for the design of optimal EENNs satisfying constraints on the accuracy and the number of Multiply and Accumulate (MAC) operations performed by the EENNs at inference time. In particular, this provides the joint design of backbone and EECs to select a set of admissible (i.e., respecting the constraints) Pareto Optimal Solutions in terms of best tradeoff between the accuracy and number of MACs. The results show that the models designed by NACHOS are competitive with the state-of-the-art EENNs. Additionally, this work investigates the effectiveness of two novel regularization terms designed for the optimization of the auxiliary classifiers of the EENN

Updated: 2025-08-06 14:41:42

标题: NACHOS: 针对硬件限制的早期退出神经网络的神经架构搜索

摘要: 早期退出神经网络（EENNs）为标准的深度神经网络（DNN）配备了早期退出分类器（EECs），在达到足够的分类置信度时提供中间处理点的预测。这带来了许多效果和效率方面的好处。目前，EENNs的设计是由专家手动进行的，这是一个复杂而耗时的任务，需要考虑许多方面，包括正确的放置、阈值设置和EECs的计算开销。出于这个原因，研究正在探索使用神经结构搜索（NAS）来自动化EENNs的设计。目前，文献中提出了少数全面的EENNs NAS解决方案，一个完全自动化的、同时考虑骨干和EECs的设计策略仍然是一个未解决的问题。为此，本文提出了硬件受限的早期退出神经网络的神经结构搜索（NACHOS），这是用于设计满足准确性和推理时EENNs执行的乘法和累加（MAC）运算数量约束的最佳EENNs的NAS框架。具体来说，这提供了骨干和EECs的联合设计，以选择一组符合条件（即符合约束）的帕累托最优解，以在准确性和MAC数量之间达到最佳权衡。结果表明，由NACHOS设计的模型与最先进的EENNs竞争力强。此外，本文研究了为优化EENN的辅助分类器而设计的两个新型正则化项的有效性。

更新时间: 2025-08-06 14:41:42

领域: cs.LG,cs.CV,cs.NE,68T07

下载: http://arxiv.org/abs/2401.13330v3

Learning Robust Intervention Representations with Delta Embeddings

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

Updated: 2025-08-06 14:39:34

标题: 使用Delta嵌入学习强大的干预表示

摘要: 因果表示学习近年来吸引了相当多的研究兴趣，作为改善模型泛化能力和鲁棒性的手段。干预图像对的因果表示具有这样的特性，即只有在起始状态和结束状态之间受到干预/行动影响的场景元素对应的变量发生变化。尽管这一领域的大部分工作都集中在识别和表示因果模型下的场景变量，但较少的工作集中在干预本身的表示上。在这项工作中，我们展示了一个改善分布外（OOD）鲁棒性的有效策略是关注潜在空间中干预的表示。具体来说，我们提出一个干预可以由对视觉场景不变且在影响的因果变量方面稀疏的因果增量嵌入来表示。利用这一洞察力，我们提出了一个能够从图像对中学习因果表示的框架，而无需额外的监督。在因果三元组挑战实验中，因果增量嵌入在OOD环境中表现出高效性，明显超过合成和真实世界基准的基准性能。

更新时间: 2025-08-06 14:39:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04492v1

Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation

A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication.

Updated: 2025-08-06 14:37:18

标题: 层次评分用于机器学习分类器错误影响评估

摘要: 机器学习（ML）模型的常见用途是预测样本的类别。目标检测是分类的扩展，它通过在样本内使用边界框来定位对象。分类，以及目标检测，通常通过计算如果预测标签与地面实况标签不匹配则将预测视为错误来进行评估。这种通过/失败的评分将所有错误分类视为等同。在许多情况下，类别标签可以组织成具有层次结构的类别分类法，以反映数据之间的关系或者操作者对错误分类的评价。当存在这种层次结构时，分层评分指标可以返回与预测与地面实况标签之间距离相关的模型性能。这些指标可以被视为对预测给予部分信用，而不是通过/失败，从而使我们更细致地了解错误分类的影响。本研究开发了各种复杂程度的分层评分指标，利用评分树来编码类别标签之间的关系，并生成反映评分树中距离的指标。这些评分指标在一个抽象用例上展示了评分树，代表了三种权重策略，并通过错误类型被排除进行评估。结果表明，这些指标以更细粒度捕捉错误，并且评分树使调整成为可能。这项工作展示了一种评估机器学习性能的方法，不仅通过错误数量来排名模型，还通过错误的种类或影响来排名。在出版时，评分指标的Python实现将在一个开源存储库中提供。

更新时间: 2025-08-06 14:37:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04489v1

Benchmarking Quantum and Classical Sequential Models for Urban Telecommunication Forecasting

In this study, we evaluate the performance of classical and quantum-inspired sequential models in forecasting univariate time series of incoming SMS activity (SMS-in) using the Milan Telecommunication Activity Dataset. Due to data completeness limitations, we focus exclusively on the SMS-in signal for each spatial grid cell. We compare five models, LSTM (baseline), Quantum LSTM (QLSTM), Quantum Adaptive Self-Attention (QASA), Quantum Receptance Weighted Key-Value (QRWKV), and Quantum Fast Weight Programmers (QFWP), under varying input sequence lengths (4, 8, 12, 16, 32 and 64). All models are trained to predict the next 10-minute SMS-in value based solely on historical values within a given sequence window. Our findings indicate that different models exhibit varying sensitivities to sequence length, suggesting that quantum enhancements are not universally advantageous. Rather, the effectiveness of quantum modules is highly dependent on the specific task and architectural design, reflecting inherent trade-offs among model size, parameterization strategies, and temporal modeling capabilities.

Updated: 2025-08-06 14:37:07

标题: 基准测试城市电信预测的量子和经典顺序模型

摘要: 在这项研究中，我们评估了使用米兰电信活动数据集预测单变量时间序列短信活动（短信接收）的传统和量子启发式顺序模型的性能。由于数据完整性的限制，我们专注于每个空间网格单元的短信接收信号。我们比较了五种模型，即LSTM（基准）、量子LSTM（QLSTM）、量子自适应自注意力（QASA）、量子接收加权键值（QRWKV）和量子快速权重编程器（QFWP），在不同的输入序列长度（4、8、12、16、32和64）下进行比较。所有模型都经过训练，仅基于给定序列窗口内的历史值来预测下一个10分钟的短信接收值。我们的研究结果表明，不同模型对序列长度的敏感性不同，这表明量子增强并非普遍有利。相反，量子模块的有效性高度依赖于特定任务和架构设计，反映了模型大小、参数化策略和时间建模能力之间固有的权衡。

更新时间: 2025-08-06 14:37:07

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2508.04488v1

Quantum circuit complexity and unsupervised machine learning of topological order

Inspired by the close relationship between Kolmogorov complexity and unsupervised machine learning, we explore quantum circuit complexity, an important concept in quantum computation and quantum information science, as a pivot to understand and to build interpretable and efficient unsupervised machine learning for topological order in quantum many-body systems. To span a bridge from conceptual power to practical applicability, we present two theorems that connect Nielsen's quantum circuit complexity for the quantum path planning between two arbitrary quantum many-body states with fidelity change and entanglement generation, respectively. Leveraging these connections, fidelity-based and entanglement-based similarity measures or kernels, which are more practical for implementation, are formulated. Using the two proposed kernels, numerical experiments targeting the unsupervised clustering of quantum phases of the bond-alternating XXZ spin chain, the ground state of Kitaev's toric code and random product states, are conducted, demonstrating their superior performance. Relations with classical shadow tomography and shadow kernel learning are also discussed, where the latter can be naturally derived and understood from our approach. Our results establish connections between key concepts and tools of quantum circuit computation, quantum complexity, and machine learning of topological quantum order.

Updated: 2025-08-06 14:36:10

标题: 量子电路复杂度和拓扑序的无监督机器学习

摘要: 受科尔莫哥洛夫复杂性与无监督机器学习之间密切关系的启发，我们探索了量子电路复杂性，这是量子计算和量子信息科学中的重要概念，作为理解和构建可解释和高效的无监督机器学习拓扑序在量子多体系统中的枢纽。为了从概念的力量延伸到实际的适用性，我们提出了两个定理，将尼尔森的量子电路复杂性与两个任意量子多体态之间的量子路径规划连接起来，分别与保真度变化和纠缠生成相关。利用这些连接，基于保真度和基于纠缠的相似度度量或核被制定出来，这对于实施更为实用。利用这两种提出的核，我们进行了针对无监督聚类的数值实验，包括交替边的XXZ自旋链的量子相、Kitaev的扭曲码的基态和随机乘积态，证明了它们的优越性能。还讨论了与经典影子层析和影子核学习的关系，后者可以从我们的方法自然推导和理解。我们的结果建立了量子电路计算、量子复杂性和拓扑量子序的机器学习的关键概念和工具之间的联系。

更新时间: 2025-08-06 14:36:10

领域: quant-ph,cond-mat.dis-nn,cs.CC,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2508.04486v1

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

Updated: 2025-08-06 14:33:45

标题: 操作系统代理：基于MLLM的代理用于普通计算设备的调查

摘要: 梦想创造像《钢铁侠》中的虚构J.A.R.V.I.S一样能干和多才多艺的AI助手，长期以来一直吸引着人们的想象力。随着（多模态）大型语言模型（（M）LLMs）的发展，这一梦想更接近现实，因为基于（M）LLM的Agent利用计算设备（例如计算机和手机）在操作系统（OS）提供的环境和界面（例如，图形用户界面（GUI））中操作，自动化任务已经取得了显著进展。本文提供了对这些先进代理的全面调查，被称为OS Agents。我们首先阐明了OS Agents的基本原理，探讨了它们的关键组成部分，包括环境、观察空间和行动空间，并概述了理解、规划和基础等基本能力。然后，我们研究了构建OS Agents的方法，重点关注领域特定的基础模型和代理框架。对评估协议和基准的详细审查突显了OS Agents在各种任务中的评估方式。最后，我们讨论当前的挑战，并确定未来研究的有希望的方向，包括安全和隐私、个性化和自我进化。这项调查旨在巩固OS Agents研究的现状，提供见解，引导学术探讨和工业发展。一个开源的GitHub仓库被维护为这一领域的进一步创新提供动态资源。我们提交了一份9页的工作版本，ACL 2025已接受，为该领域提供简明的概述。

更新时间: 2025-08-06 14:33:45

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04482v1

Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

Updated: 2025-08-06 14:32:22

标题: 情感检测使用条件生成对抗网络（cGAN）：一种深度学习方法

摘要: 这篇论文提出了一种基于深度学习的情绪检测方法，使用条件生成对抗网络（cGANs）。与传统的单模态技术依赖单一数据类型不同，我们探索了一个多模态框架，集成了文本、音频和面部表情。所提出的cGAN架构经过训练，能够生成合成的情绪丰富数据，并提高跨多个模态的分类准确度。我们的实验结果显示，与基准模型相比，情绪识别性能显著提高。这项工作突显了cGANs在增强人机交互系统中的潜力，能够实现更加细致的情感理解。

更新时间: 2025-08-06 14:32:22

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2508.04481v1

Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions

Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.

Updated: 2025-08-06 14:29:38

标题: 谁减少排放，谁加剧了热情？能效干预的因果机器学习估计

摘要: 减少国内能源需求对气候缓解和燃料贫困战略至关重要，然而能源效率干预的影响是高度异质的。利用在英国住房存量的全国代表性数据上训练的因果机器学习模型，我们估计了墙体隔热对天然气消耗的平均和条件治疗效果，重点关注能源负担亚组群之间的分布效果。虽然干预措施平均降低了天然气需求（最多降低19%），低能源负担群体实现了可观的节省，而那些承受高能源负担的人则几乎没有减少。这种模式反映了一种行为驱动的机制：受到高收入成本比（例如超过0.1）限制的家庭将节省用于改善热舒适度，而不是降低消耗。这种反应远非浪费，而是在先前匮乏的背景下进行的合理调整，对健康和福祉可能产生共同利益。这些发现呼吁建立一个更广泛的评估框架，同时考虑家庭能源政策的气候影响和公平影响。

更新时间: 2025-08-06 14:29:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.04478v1

Metric Learning in an RKHS

Metric learning from a set of triplet comparisons in the form of "Do you think item h is more similar to item i or item j?", indicating similarity and differences between items, plays a key role in various applications including image retrieval, recommendation systems, and cognitive psychology. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks have shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard Euclidean space $\mathbb{R}^d$; there is a comprehensive theory for metric learning in $\mathbb{R}^d$. This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at https://github.com/RamyaLab/metric-learning-RKHS.

Updated: 2025-08-06 14:29:04

标题: 在RKHS中的度量学习

摘要: 从一组三元组比较中学习度量，比如“您认为项目h与项目i或项目j更相似？”，指示项目之间的相似性和差异，在包括图像检索、推荐系统和认知心理学在内的各种应用中起着关键作用。目标是在RKHS中学习反映比较的度量。使用核方法和神经网络进行非线性度量学习已经显示出极大的实证前景。尽管先前的研究已经解决了这个问题的某些方面，但对这些方法的理论理解却很少或几乎没有。例外情况是RKHS为标准欧几里德空间$\mathbb{R}^d$的特殊（线性）情况；在$\mathbb{R}^d$中度量学习有一个全面的理论。本文建立了一个通用的RKHS框架用于度量学习，并提供了新颖的泛化保证和样本复杂度界限。我们通过一系列模拟和对真实数据集的实验验证了我们的发现。我们的代码可以在https://github.com/RamyaLab/metric-learning-RKHS 上公开获取。

更新时间: 2025-08-06 14:29:04

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04476v1

Spatial-Frequency Aware for Object Detection in RAW Image

Direct RAW-based object detection offers great promise by utilizing RAW data (unprocessed sensor data), but faces inherent challenges due to its wide dynamic range and linear response, which tends to suppress crucial object details. In particular, existing enhancement methods are almost all performed in the spatial domain, making it difficult to effectively recover these suppressed details from the skewed pixel distribution of RAW images. To address this limitation, we turn to the frequency domain, where features, such as object contours and textures, can be naturally separated based on frequency. In this paper, we propose Space-Frequency Aware RAW Image Object Detection Enhancer (SFAE), a novel framework that synergizes spatial and frequency representations. Our contribution is threefold. The first lies in the ``spatialization" of frequency bands. Different from the traditional paradigm of directly manipulating abstract spectra in deep networks, our method inversely transforms individual frequency bands back into tangible spatial maps, thus preserving direct physical intuition. Then the cross-domain fusion attention module is developed to enable deep multimodal interactions between these maps and the original spatial features. Finally, the framework performs adaptive nonlinear adjustments by predicting and applying different gamma parameters for the two domains.

Updated: 2025-08-06 14:26:27

标题: RAW图像中目标检测的空间频率感知

摘要: 直接基于RAW数据（未经处理的传感器数据）的物体检测具有巨大的潜力，但由于其广泛的动态范围和线性响应，面临固有挑战，往往会抑制关键物体细节。特别是，现有的增强方法几乎都是在空间域中进行的，这使得难以有效地从RAW图像的偏斜像素分布中恢复这些被抑制的细节。为了解决这一限制，我们转向频域，在这里，基于频率可以自然地分离对象轮廓和纹理等特征。在本文中，我们提出了一种全新的框架，即空间频率感知RAW图像物体检测增强器（SFAE），该框架通过协同作用空间和频率表示。我们的贡献有三个方面。首先，在频率带的“空间化”方面。与直接操作深度网络中的抽象光谱的传统范例不同，我们的方法将单个频率带逆向转换为具体的空间地图，从而保留直接的物理直觉。然后，开发了跨域融合注意力模块，以实现这些地图与原始空间特征之间的深度多模态交互。最后，该框架通过预测并应用两个领域的不同伽马参数来执行自适应非线性调整。

更新时间: 2025-08-06 14:26:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.01396v2

Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

Updated: 2025-08-06 14:19:32

标题: 通过渐进对齐在文本到图像模型中实现零残留概念消除

摘要: 概念消除旨在防止预训练的文本到图像模型生成与语义有害概念（即目标概念）相关的内容，正在引起越来越多的关注。最先进的方法将这个任务形式化为一个优化问题：它们将所有目标概念与语义无害的锚概念对齐，并应用闭式解来相应地更新模型。虽然这些闭式方法高效，但我们认为现有方法存在两个被忽视的限制：1）由于“非零对齐残差”，它们经常导致不完全的消除，特别是当文本提示相对复杂时。2）它们可能由于总是集中参数更新在少数深层而遭受生成质量下降。为了解决这些问题，我们提出了一种新颖的闭式方法ErasePro：它旨在更完全地擦除概念并更好地保持总体生成质量。具体而言，ErasePro首先将严格的零残差约束引入优化目标中，确保目标和锚概念特征之间的完美对齐，并实现更完全的消除。其次，它采用渐进的逐层更新策略，逐渐将目标概念特征从浅层到深层转移到锚概念的特征。随着深度的增加，所需的参数变化减小，从而减少对敏感深层的偏差并保持生成质量。在不同的概念消除任务（包括实例、艺术风格和裸体消除）上的实证结果已经证明了我们的ErasePro的有效性。

更新时间: 2025-08-06 14:19:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04472v1

Pull-Based Query Scheduling for Goal-Oriented Semantic Communication

This paper addresses query scheduling for goal-oriented semantic communication in pull-based status update systems. We consider a system where multiple sensing agents (SAs) observe a source characterized by various attributes and provide updates to multiple actuation agents (AAs), which act upon the received information to fulfill their heterogeneous goals at the endpoint. A hub serves as an intermediary, querying the SAs for updates on observed attributes and maintaining a knowledge base, which is then broadcast to the AAs. The AAs leverage the knowledge to perform their actions effectively. To quantify the semantic value of updates, we introduce a grade of effectiveness (GoE) metric. Furthermore, we integrate cumulative perspective theory (CPT) into the long-term effectiveness analysis to account for risk awareness and loss aversion in the system. Leveraging this framework, we compute effect-aware scheduling policies aimed at maximizing the expected discounted sum of CPT-based total GoE provided by the transmitted updates while complying with a given query cost constraint. To achieve this, we propose a model-based solution based on dynamic programming and model-free solutions employing state-of-the-art deep reinforcement learning (DRL) algorithms. Our findings demonstrate that effect-aware scheduling significantly enhances the effectiveness of communicated updates compared to benchmark scheduling methods, particularly in settings with stringent cost constraints where optimal query scheduling is vital for system performance and overall effectiveness.

Updated: 2025-08-06 14:17:41

标题: 基于拉取的目标导向语义通信的查询调度

摘要: 本文讨论了目标导向语义通信中基于拉取的状态更新系统的查询调度。我们考虑一个系统，多个感知代理（SAs）观察具有各种属性的源，并向多个执行代理（AAs）提供更新，AAs根据接收到的信息来实现他们在端点的异构目标。一个中心作为中介，查询SAs的观察属性更新并维护一个知识库，然后广播给AAs。AAs利用知识来有效地执行他们的动作。为了量化更新的语义价值，我们引入了一个效果等级（GoE）指标。此外，我们将累积透视理论（CPT）整合到长期效果分析中，以考虑系统中的风险意识和损失规避。利用这个框架，我们计算了旨在最大化传输更新提供的基于CPT的总GoE的期望折现总和的效果感知调度策略，同时遵守给定的查询成本约束。为了实现这一目标，我们提出了基于动态规划的基于模型的解决方案和采用最先进的深度强化学习（DRL）算法的无模型解决方案。我们的研究结果表明，效果感知调度与基准调度方法相比，显著提高了通信更新的效果，特别是在成本约束严格的情况下，优化查询调度对系统性能和整体效果至关重要的设置中。

更新时间: 2025-08-06 14:17:41

领域: cs.IT,cs.AI,cs.NI,math.IT

下载: http://arxiv.org/abs/2503.06725v2

FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions

Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client's local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

Updated: 2025-08-06 14:15:57

标题: FedHiP：通过闭合形式解决方案实现异构不变个性化联合学习

摘要: 最近，个性化联邦学习（PFL）已经成为一种流行的范式，通过协作训练同时适应每个客户端的本地应用程序来交付个性化模型。现有的PFL方法通常面临着一个重大挑战，即由于客户端之间普遍存在的数据异质性（即非IID数据），严重阻碍了收敛并降低了性能。我们发现根本问题在于长期依赖基于梯度的更新，这些更新对非IID数据敏感。为了从根本上解决这个问题并弥合研究差距，在本文中，我们提出了一个名为FedHiP的异质性不变个性化联邦学习方案，通过分析（即闭合形式）解决方案来避免基于梯度的更新。具体来说，我们利用自监督预训练的趋势，利用一个基础模型作为冻结的骨干用于无梯度特征提取。在特征提取器之后，我们进一步开发了一个用于无梯度训练的分析分类器。为了支持集体泛化和个体个性化，我们的FedHiP方案包括三个阶段：分析本地训练，分析全局聚合和分析本地个性化。我们的FedHiP方案的闭合形式解决方案使其具有异质性不变性的理想属性，意味着每个个性化模型保持相同，无论数据在所有其他客户端之间如何分布。对基准数据集的大量实验验证了我们的FedHiP方案的优越性，准确度超过了最先进基线至少5.79%-20.97%。

更新时间: 2025-08-06 14:15:57

领域: cs.LG

下载: http://arxiv.org/abs/2508.04470v1

Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks

Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.

Updated: 2025-08-06 14:15:48

标题: 嵌入式几乎是你所需要的一切：用于可泛化基因组预测任务的检索增强推断

摘要: 大型预训练DNA语言模型，如DNABERT-2、核苷酸Transformer和HyenaDNA在各种基因组基准测试中表现出强大的性能。然而，大多数应用程序依赖于昂贵的微调，在训练和测试数据共享相似分布时效果最佳。在这项工作中，我们调查了是否总是需要针对特定任务进行微调。我们展示了基于简单嵌入的流水线可以从这些模型中提取固定表示，并将其馈送到轻量级分类器，从而实现竞争性能。在具有不同数据分布的评估设置中，基于嵌入的方法通常优于微调，同时将推理时间缩短了10到20倍。我们的结果表明，嵌入提取不仅是一个强大的基准线，而且是一种更具泛化性和高效性的微调替代方法，特别适用于在不同或未见基因组背景中部署。例如，在增强子分类中，HyenaDNA嵌入结合zCurve实现0.68的准确率（相对于微调的0.58），推理时间减少了88%，碳排放量降低了8倍（0.02kg vs. 0.17kg CO2）。在非TATA启动子分类中，DNABERT-2嵌入与zCurve或GC含量达到0.85的准确率（相对于微调的0.89），碳足迹降低了22倍（0.02kg vs. 0.44kg CO2）。这些结果表明，基于嵌入的流水线在保持强大预测性能的同时，提供了超过10倍的碳效率。代码可在以下链接找到：https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED。

更新时间: 2025-08-06 14:15:48

领域: q-bio.GN,cs.LG

下载: http://arxiv.org/abs/2508.04757v1

Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment

Open-set semi-supervised learning (OSSL) leverages unlabeled data containing both in-distribution (ID) and unknown out-of-distribution (OOD) samples, aiming simultaneously to improve closed-set accuracy and detect novel OOD instances. Existing methods either discard valuable information from uncertain samples or force-align every unlabeled sample into one or a few synthetic "catch-all" representations, resulting in geometric collapse and overconfidence on only seen OODs. To address the limitations, we introduce selective non-alignment, adding a novel "skip" operator into conventional pull and push operations of contrastive learning. Our framework, SkipAlign, selectively skips alignment (pulling) for low-confidence unlabeled samples, retaining only gentle repulsion against ID prototypes. This approach transforms uncertain samples into a pure repulsion signal, resulting in tighter ID clusters and naturally dispersed OOD features. Extensive experiments demonstrate that SkipAlign significantly outperforms state-of-the-art methods in detecting unseen OOD data without sacrificing ID classification accuracy.

Updated: 2025-08-06 14:06:10

标题: 让虚空保持虚空：通过选择性非对齐实现强大的开放集半监督学习

摘要: 开放集半监督学习（OSSL）利用包含有标签和未知的超出分布（OOD）样本的未标记数据，旨在同时提高封闭集准确性并检测新颖的OOD实例。现有方法要么丢弃来自不确定样本的有价值信息，要么强制将每个未标记样本对齐到一个或几个合成的“全包括”表示，导致几何坍缩并对已知OOD过度自信。为解决这些限制，我们引入了选择性非对齐，将一种新颖的“跳过”操作添加到对比学习的传统拉和推操作中。我们的框架SkipAlign 选择性地跳过对低置信度未标记样本的对齐（拉），仅保留对ID原型的温和排斥。这种方法将不确定样本转化为纯排斥信号，导致更紧密的ID群集和自然分散的OOD特征。大量实验证明，SkipAlign在检测未知OOD数据方面明显优于最先进的方法，而不会牺牲ID分类准确率。

更新时间: 2025-08-06 14:06:10

领域: cs.LG

下载: http://arxiv.org/abs/2504.12569v3

ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions

Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.

Updated: 2025-08-06 14:05:00

标题: ChartM$^3$: 用多模式指令进行图表编辑的基准测试

摘要: 图表是一种基本的可视化格式，在研究和工业领域的数据分析中被广泛使用。虽然让用户根据高层意图编辑图表具有极大的实用价值，但现有方法主要依赖于自然语言指令，这些指令通常太模糊以支持细粒度编辑。在这项工作中，我们引入了一种新颖的多模态图表编辑范式，用户意图通过自然语言和视觉指示的组合来表达，明确突出要修改的元素。为了支持这种范式，我们提出了Chart$\text{M}^3$，一个具有多级复杂性和多角度评估的多模态图表编辑新基准。Chart$\text{M}^3$包含1,000个样本，涵盖四个不同难度级别的编辑。每个样本包括图表、代码和多模态指令的三元组。为了全面评估图表编辑模型，Chart$\text{M}^3$提供了既考虑视觉外观又考虑代码正确性的指标。我们的基准测试揭示了当前多模态大型语言模型（MLLMs），包括GPT-4o，在解释和处理视觉指示方面存在显著的局限性。为了解决这个问题，我们构建了Chart$\text{M}^3$-Train，一个包含24,000个多模态图表编辑样本的大规模训练集。在这个数据集上对MLLMs进行微调会带来显著的改进，展示了多模态监督在构建实用图表编辑系统中的重要性。我们的数据集、代码和评估工具可以在https://github.com/MLrollIT/ChartM3找到。我们的数据集、代码和评估工具也可以在https://github.com/yaolinli/VCE找到。

更新时间: 2025-08-06 14:05:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.21167v3

GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

Transformer-based neural operators have emerged as promising surrogate solvers for partial differential equations, by leveraging the effectiveness of Transformers for capturing long-range dependencies and global correlations, profoundly proven in language modeling. However, existing methodologies overlook the coordinated learning of interdependencies between local physical details and global features, which are essential for tackling multiscale problems, preserving physical consistency and numerical stability in long-term rollouts, and accurately capturing transitional dynamics. In this work, we propose GFocal, a Transformer-based neural operator method that enforces simultaneous global and local feature learning and fusion. Global correlations and local features are harnessed through Nystr\"{o}m attention-based \textbf{g}lobal blocks and slices-based \textbf{focal} blocks to generate physics-aware tokens, subsequently modulated and integrated via convolution-based gating blocks, enabling dynamic fusion of multiscale information. GFocal achieves accurate modeling and prediction of physical features given arbitrary geometries and initial conditions. Experiments show that GFocal achieves state-of-the-art performance with an average 15.2\% relative gain in five out of six benchmarks and also excels in industry-scale simulations such as aerodynamics simulation of automotives and airfoils.

Updated: 2025-08-06 14:02:39

标题: GFocal：一种用于在任意几何形状上解决PDEs的全局-焦点神经算子

摘要: 基于Transformer的神经操作符已经成为偏微分方程的有前途的替代求解器，通过利用Transformer在捕捉长程依赖和全局相关性方面的有效性，这在语言建模中已经深刻证明。然而，现有的方法忽视了本地物理细节和全局特征之间相互依赖的协同学习，这对于处理多尺度问题、在长期轧制中保持物理一致性和数值稳定性以及准确捕捉过渡动态是至关重要的。在这项工作中，我们提出了GFocal，一种基于Transformer的神经操作符方法，它强制实现同时全局和本地特征的学习和融合。通过Nyström注意力机制的全局块和基于切片的关键块，全局相关性和本地特征被利用来生成具有物理意识的令牌，随后通过基于卷积的门控块进行调制和集成，实现多尺度信息的动态融合。GFocal在给定任意几何形状和初始条件的情况下实现了物理特征的准确建模和预测。实验证明，GFocal在六个基准测试中有五个取得了平均15.2%的相对增益，并且在诸如汽车和翼型的空气动力学模拟等行业规模模拟中也表现出色。

更新时间: 2025-08-06 14:02:39

领域: cs.LG

下载: http://arxiv.org/abs/2508.04463v1

3DTTNet: Multimodal Fusion-Based 3D Traversable Terrain Modeling for Off-Road Environments

Off-road environments remain significant challenges for autonomous ground vehicles, due to the lack of structured roads and the presence of complex obstacles, such as uneven terrain, vegetation, and occlusions. Traditional perception algorithms, primarily designed for structured environments, often fail in unstructured scenarios. In this paper, traversable area recognition is achieved through semantic scene completion. A novel multimodal method, 3DTTNet, is proposed to generate dense traversable terrain estimations by integrating LiDAR point clouds with monocular images from a forward-facing perspective. By integrating multimodal data, environmental feature extraction is strengthened, which is crucial for accurate terrain modeling in complex terrains. Furthermore, RELLIS-OCC, a dataset with 3D traversable annotations, is introduced, incorporating geometric features such as step height, slope, and unevenness. Through a comprehensive analysis of vehicle obsta cle-crossing conditions and the incorporation of vehicle body structure constraints, four traversability cost labels are generated: lethal, medium-cost, low-cost, and free. Experimental results demonstrate that 3DTTNet outperforms the comparison approaches in 3D traversable area recognition, particularly in off-road environments with irregular geometries and partial occlusions. Specifically, 3DTTNet achieves a 42\% improvement in scene completion IoU compared to other models. The proposed framework is scalable and adaptable to various vehicle platforms, allowing for adjustments to occupancy grid parameters and the integration of advanced dynamic models for traversability cost estimation.

Updated: 2025-08-06 14:02:23

标题: 3DTTNet：基于多模态融合的离线环境3D可穿越地形建模

摘要: 越野环境对自主地面车辆仍然是重大挑战，这是因为缺乏结构化道路和复杂障碍物，如不平整地形、植被和遮挡物。传统的感知算法主要设计用于结构化环境，在非结构化场景中经常失败。本文通过语义场景完成实现了可通行区域识别。提出了一种新颖的多模态方法3DTTNet，通过将LiDAR点云与前向单目图像集成生成密集的可通行地形估计。通过集成多模态数据，加强了环境特征提取，这对于在复杂地形中进行准确的地形建模至关重要。此外，引入了一种带有3D可通行注释的数据集RELLIS-OCC，其中包含步高、坡度和不平整等几何特征。通过对车辆跨越障碍物条件进行全面分析，并结合车身结构约束，生成了四种可通行成本标签：致命、中等成本、低成本和免费。实验结果表明，3DTTNet在3D可通行区域识别方面优于比较方法，特别是在具有不规则几何形状和部分遮挡的越野环境中。具体来说，与其他模型相比，3DTTNet在场景完成IoU方面实现了42％的改进。提出的框架可扩展并适应各种车辆平台，允许调整占用格参数并集成高级动态模型进行可通行性成本估算。

更新时间: 2025-08-06 14:02:23

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2412.08195v2

CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the 'draft-then-verify' paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a 'query-and-correct' paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at https://github.com/hunzhizi/CARD.

Updated: 2025-08-06 14:02:10

标题: 缓存辅助并行推测解码：用于高效大规模语言模型推断的方法

摘要: 推测解码（SD）是一种额外的草案模型首先提供多个草案标记，然后原始目标模型在并行中验证这些标记，已经显示出对LLM推理加速具有巨大的能力。然而，现有的SD方法必须遵循“先起草，然后验证”的范式，这迫使起草和验证过程在SD期间顺序执行，导致推理性能低效，并限制了起草模型的大小。此外，一旦在起草过程中拒绝候选序列中的一个标记，所有后续候选标记都必须被丢弃，导致起草低效。为了解决这些挑战，我们提出了一个基于缓存的并行推测解码框架，采用“查询和纠正”范式。具体而言，CARD将起草和验证分离：起草模型生成候选标记以填充共享缓存，同时目标模型同时纠正起草模型的生成方向。这有效地使目标模型能够以接近起草模型速度进行推理。我们的方法在不需要对起草或目标模型进行微调的情况下，实现了高达4.83倍的速度提升。我们的代码可以在https://github.com/hunzhizi/CARD 上找到。

更新时间: 2025-08-06 14:02:10

领域: cs.LG

下载: http://arxiv.org/abs/2508.04462v1

Small transformer architectures for task switching

The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.

Updated: 2025-08-06 14:01:05

标题: 小型变压器架构用于任务切换

摘要: 在大规模生成式人工智能方面取得的快速进展主要基于注意力机制。相反，很难想象出小规模应用，其中基于注意力的架构能够胜过传统方法，如多层感知器或循环网络。我们在“任务切换”的背景下研究了这个问题。在这个框架中，模型通过当前任务由随机插入的控制令牌确定的进行中的令牌序列。我们展示了标准的变压器无法解决基于有限域算术的基本任务切换参考模型，其中包含专门用于增量/加法/逆向复制/上下文（IARC）的子任务。我们展示了变压器、长短期记忆循环网络（LSTM）和普通多层感知器（MLP）实现了类似但仅有限的预测准确性。我们通过包括标准变压器架构的非平移不变对应物cisformer的扩展以及另一种注意力机制——广泛注意力，扩大了我们的比较研究。我们发现后者的组合是唯一能够达到约95%的显著性能水平的模型。我们的结果表明，在任务切换设置中，通过比较质量不同的公式，注意力的工作可以更好地理解甚至改进。

更新时间: 2025-08-06 14:01:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04461v1

From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control

Large Reasoning Models (LRMs) have demonstrated a latent capacity for complex reasoning by spontaneously exhibiting cognitive behaviors such as step-by-step reasoning, reflection, and backtracking, commonly referred to as "Aha Moments". However, such emergent behaviors remain unregulated and uncontrolled, often resulting in overthinking, where the model continues generating redundant reasoning content even after reaching reliable conclusions. This leads to excessive computational costs and increased latency, limiting the practical deployment of LRMs. The root cause lies in the absence of intrinsic regulatory mechanisms, as current models are unable to monitor and adaptively manage their reasoning process to determine when to continue, backtrack, or terminate. To address this issue, we propose the Meta-cognitive Reasoning Framework (MERA), which explicitly decouples the thinking process into distinct reasoning and control components, thereby enabling the independent optimization of control strategies. Specifically, MERA incorporates a takeover-based data construction mechanism that identifies critical decision points during reasoning and delegates the creation of control signals to auxiliary LLMs, thereby enabling the construction of high-quality reasoning-control data. Additionally, a structured reasoning-control separation is implemented via supervised fine-tuning, enabling the model to generate explicit traces and acquire initial meta-cognitive control capabilities. Finally, MERA employs Control-Segment Policy Optimization (CSPO), which combines segment-wise Group Relative Policy Optimization (GRPO) with a control-masking mechanism to optimize control behavior learning while minimizing interference from irrelevant content. Experiments on various reasoning benchmarks demonstrate that models trained with MERA enhance both reasoning efficiency and accuracy.

Updated: 2025-08-06 13:59:17

标题: 从“啊哈时刻”到可控思维：通过解耦思维和控制实现大型推理模型中的元认知推理

摘要: 大型推理模型（LRM）已经展示出复杂推理的潜在能力，通过自发展示认知行为，如逐步推理、反思和回溯，通常被称为“顿悟时刻”。然而，这种新兴行为仍未受到调控和控制，常常导致过度思考，即模型在达到可靠结论后继续生成冗余推理内容。这导致了过高的计算成本和增加的延迟，限制了LRM的实际部署。根本原因在于缺乏内在的调节机制，因为当前模型无法监测和自适应地管理其推理过程，以确定何时继续、回溯或终止。为了解决这个问题，我们提出了元认知推理框架（MERA），明确将思考过程分解为独立的推理和控制组件，从而实现控制策略的独立优化。具体而言，MERA包含一种基于接管的数据构建机制，该机制在推理过程中识别关键决策点，并将控制信号的创建委托给辅助LLMs，从而实现高质量的推理控制数据的构建。此外，通过监督微调实现了结构化的推理控制分离，使模型能够生成明确的跟踪并获取初始的元认知控制能力。最后，MERA采用了控制段策略优化（CSPO），该方法将分段群体相对策略优化（GRPO）与控制屏蔽机制相结合，以优化控制行为学习，同时最小化干扰来自无关内容。在各种推理基准测试中进行的实验表明，使用MERA训练的模型提高了推理效率和准确性。

更新时间: 2025-08-06 13:59:17

领域: cs.AI

下载: http://arxiv.org/abs/2508.04460v1

Benchmarking Uncertainty and its Disentanglement in multi-label Chest X-Ray Classification

Reliable uncertainty quantification is crucial for trustworthy decision-making and the deployment of AI models in medical imaging. While prior work has explored the ability of neural networks to quantify predictive, epistemic, and aleatoric uncertainties using an information-theoretical approach in synthetic or well defined data settings like natural image classification, its applicability to real life medical diagnosis tasks remains underexplored. In this study, we provide an extensive uncertainty quantification benchmark for multi-label chest X-ray classification using the MIMIC-CXR-JPG dataset. We evaluate 13 uncertainty quantification methods for convolutional (ResNet) and transformer-based (Vision Transformer) architectures across a wide range of tasks. Additionally, we extend Evidential Deep Learning, HetClass NNs, and Deep Deterministic Uncertainty to the multi-label setting. Our analysis provides insights into uncertainty estimation effectiveness and the ability to disentangle epistemic and aleatoric uncertainties, revealing method- and architecture-specific strengths and limitations.

Updated: 2025-08-06 13:58:17

标题: 基准不确定性及其在多标签胸部X射线分类中的解释

摘要: 可靠的不确定性量化对于可信赖的决策制定和在医学影像中使用AI模型至关重要。尽管先前的研究已经探讨了神经网络在合成或明确定义的数据设置中，如自然图像分类中利用信息理论方法量化预测、认知和随机不确定性的能力，但其在现实生活中医学诊断任务中的适用性尚未充分探讨。在本研究中，我们使用MIMIC-CXR-JPG数据集为多标签胸部X射线分类提供了一个广泛的不确定性量化基准。我们评估了13种不确定性量化方法，包括卷积（ResNet）和基于Transformer的（Vision Transformer）架构，涵盖了广泛的任务范围。此外，我们将Evidential Deep Learning、HetClass NNs和Deep Deterministic Uncertainty扩展到多标签设置。我们的分析提供了关于不确定性估计有效性以及分离认知和随机不确定性的能力的见解，揭示了方法和架构特定的优势和局限性。

更新时间: 2025-08-06 13:58:17

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.04457v1

Automatic LLM Red Teaming

Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

Updated: 2025-08-06 13:52:00

标题: 自动化LLM红队对抗

摘要: 红队渗透测试对于识别漏洞并建立对当前LLMs的信任至关重要。然而，当前用于大型语言模型（LLMs）的自动化方法依赖于脆弱的提示模板或单轮攻击，未能捕捉真实世界对话中的复杂互动性质。我们提出了一种新颖的范式：训练一个AI来战略性地“击败”另一个AI。通过将红队渗透测试形式化为马尔可夫决策过程（MDP）并采用分层强化学习（RL）框架，我们有效应对了固有的稀疏奖励和长期挑战。我们的生成式代理通过细粒度、令牌级别的伤害奖励学习连贯的、多轮攻击策略，使其能够揭示现有基线所忽视的微妙漏洞。这一方法开创了一个新的最先进技术状态，从根本上重新构想了LLM红队渗透测试，将其视为一个动态的、基于轨迹的过程（而不是一步测试），对于稳健的AI部署至关重要。

更新时间: 2025-08-06 13:52:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04451v1

Proactive Constrained Policy Optimization with Preemptive Penalty

Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method's convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.

Updated: 2025-08-06 13:50:53

标题: 主动的受限政策优化与抢先处罚

摘要: 安全强化学习（RL）经常面临重大问题，如约束违规和不稳定性，需要使用受限策略优化，以寻求在确保遵守特定约束条件（如安全性）的情况下寻找最佳策略。通常，受限制的优化问题通过拉格朗日方法来解决，这是一种后违规纠正方法，可能会导致振荡和超调。受此启发，我们提出了一种名为主动受限策略优化（PCPO）的新方法，该方法整合了一种主动惩罚机制。这个机制将障碍项整合到目标函数中，当策略接近边界时施加成本。同时，我们引入了一个约束感知内在奖励，以引导边界感知探索，仅当策略接近约束边界时激活。我们建立了PCPO更新的对偶间隙和性能的理论上下限，揭示了该方法的收敛特性。此外，为了增强优化性能，我们采用了策略迭代方法。一个有趣的发现是，在实验中PCPO表现出显著的稳定性。实验结果表明，PCPO框架为在约束条件下的策略优化提供了强大的解决方案，对未来研究和实际应用具有重要意义。

更新时间: 2025-08-06 13:50:53

领域: cs.LG

下载: http://arxiv.org/abs/2508.01883v2

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top-$k$ postprocessing. Our code is available at https://github.com/EleutherAI/sae-auto-interp, and our explanations are available at https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.

Updated: 2025-08-06 13:47:10

标题: 大型语言模型中自动解释数百万个特征

摘要: 尽管深度神经网络中的神经元激活通常没有简单的人类可理解的解释，但稀疏自动编码器（SAEs）可以用来将这些激活转换为一个更易于解释的更高维度的潜在空间。然而，这些SAEs可能具有数百万个不同的潜在特征，使得人类无法手动解释每一个。在这项工作中，我们建立了一个开源自动化流程，使用LLMs生成和评估SAE特征的自然语言解释。我们在不同大小、激活函数和损失上训练的两种不同的开放权重LLMs上测试我们的框架。我们引入了五种新的评估解释质量的技术，比之前的技术更便宜。其中一种技术，干预评分，评估了对特征进行干预的可解释性效果，我们发现这解释了现有方法未能召回的特征。我们提出了生成更好解释的指导方针，并讨论了现有评分技术的缺陷。我们使用我们的解释来衡量独立训练的SAEs的语义相似性，并发现在残余流的相邻层上训练的SAEs非常相似。我们的大规模分析证实，即使使用top-k后处理稀疏化神经元，SAE潜变仍然比神经元更可解释。我们的代码可在https://github.com/EleutherAI/sae-auto-interp找到，我们的解释可在https://huggingface.co/datasets/EleutherAI/auto_interp_explanations找到。

更新时间: 2025-08-06 13:47:10

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2410.13928v3

Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle

Reinforcement learning (RL)-based dynamic treatment regimes (DTRs) hold promise for automating complex clinical decision-making, yet their practical deployment remains hindered by the intensive engineering required to inject clinical knowledge and ensure patient safety. Recent advancements in large language models (LLMs) suggest a complementary approach, where implicit prior knowledge and clinical heuristics are naturally embedded through linguistic prompts without requiring environment-specific training. In this study, we rigorously evaluate open-source LLMs as dynamic insulin dosing agents in an in silico Type 1 diabetes simulator, comparing their zero-shot inference performance against small neural network-based RL agents (SRAs) explicitly trained for the task. Our results indicate that carefully designed zero-shot prompts enable smaller LLMs (e.g., Qwen2.5-7B) to achieve comparable or superior clinical performance relative to extensively trained SRAs, particularly in stable patient cohorts. However, LLMs exhibit notable limitations, such as overly aggressive insulin dosing when prompted with chain-of-thought (CoT) reasoning, highlighting critical failure modes including arithmetic hallucination, temporal misinterpretation, and inconsistent clinical logic. Incorporating explicit reasoning about latent clinical states (e.g., meals) yielded minimal performance gains, underscoring the current model's limitations in capturing complex, hidden physiological dynamics solely through textual inference. Our findings advocate for cautious yet optimistic integration of LLMs into clinical workflows, emphasising the necessity of targeted prompt engineering, careful validation, and potentially hybrid approaches that combine linguistic reasoning with structured physiological modelling to achieve safe, robust, and clinically effective decision-support systems.

Updated: 2025-08-06 13:46:02

标题: 大型语言模型是动态治疗规划器吗？一项从先验知识注入角度进行的体外研究

摘要: 基于强化学习（RL）的动态治疗方案（DTRs）在自动化复杂临床决策方面具有潜力，但其实际部署受到注入临床知识和确保患者安全所需的繁重工程的阻碍。最近大型语言模型（LLMs）的进展表明了一种互补的方法，通过语言提示自然嵌入隐含的先验知识和临床启发，而无需环境特定训练。在本研究中，我们严格评估了开源LLMs作为动态胰岛素剂量代理在一个虚拟类型1糖尿病模拟器中的表现，将它们的零-shot推断性能与专门为该任务进行训练的小型神经网络RL代理（SRAs）进行比较。我们的结果表明，精心设计的零-shot提示可以使较小的LLMs（例如Qwen2.5-7B）在相对稳定的患者队列中实现与经过广泛训练的SRAs相当或更优越的临床表现。然而，LLMs在受到思维链（CoT）推理时表现出明显的局限性，突出显示了包括算术幻觉、时间误解和不一致的临床逻辑在内的关键故障模式。融合对潜在临床状态（例如饭食）的明确推理只带来了最小的性能提升，强调了当前模型在仅通过文本推理捕捉复杂、隐藏的生理动态方面的局限性。我们的发现主张谨慎但乐观地将LLMs整合到临床工作流程中，强调有针对性的提示工程、谨慎的验证和结合语言推理与结构化生理建模的可能混合方法的必要性，以实现安全、稳健和临床有效的决策支持系统。

更新时间: 2025-08-06 13:46:02

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2508.04755v1

Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by contrastively reducing language biases or amplifying the weights of visual embedding during decoding. However, these approaches remain limited in their ability to capture fine-grained visual details. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. By magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning capabilities.

Updated: 2025-08-06 13:45:16

标题: 透过放大镜：适应性感知放大用于无幻觉的VLM解码

摘要: 现有的视觉-语言模型（VLMs）经常受到视觉幻觉的困扰，生成的响应包含与视觉输入不相符的不准确信息。努力解决这个问题的方法主要是通过对比减少语言偏见或在解码过程中增强视觉嵌入权重来减轻幻觉，而不需要对模型进行微调。然而，这些方法在捕捉细粒度视觉细节方面仍然存在局限性。在这项工作中，我们提出了一种新颖的视觉解码方法 Perception Magnifier（PM），它基于注意力逐步隔离相关的视觉标记，并放大相应的区域，促使模型在解码过程中集中于细粒度的视觉细节。通过在保留每个解码步骤中的结构和上下文信息的同时放大关键区域，PM允许VLM加强对视觉输入的审查，从而产生更准确和忠实的响应。大量实验证明，PM不仅实现了卓越的幻觉减轻，而且增强了语言生成能力，同时保留了强大的推理能力。

更新时间: 2025-08-06 13:45:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.10183v3

Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.

Updated: 2025-08-06 13:44:04

标题: 云模型特征函数自编码器：将云模型理论与MMD正则化相结合，用于增强生成建模

摘要: 我们介绍了一种新颖的生成模型Cloud Model Characteristic Function Auto-Encoder (CMCFAE)，将云模型集成到Wasserstein Auto-Encoder (WAE)框架中。通过利用云模型的特征函数来规范潜在空间，我们的方法能够更准确地对复杂数据分布进行建模。与依赖标准高斯先验和传统差异度量的传统方法不同，我们的方法采用云模型先验，提供了更灵活和现实的潜在空间表示，从而缓解了重建样本中观察到的同质化现象。我们推导了云模型的特征函数，并在WAE框架内提出了相应的正则化器。在MNIST、FashionMNIST、CIFAR-10和CelebA上进行了广泛的定量和定性评估，结果表明CMCFAE在重构质量、潜在空间结构化和样本多样性方面优于现有模型。这项工作不仅建立了云模型理论与基于MMD的正则化的新颖集成，还为增强基于自编码器的生成模型提供了一个有前途的新视角。

更新时间: 2025-08-06 13:44:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04447v1

Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

Updated: 2025-08-06 13:37:37

标题: 无矩阵的二到无穷和一到二范数估计

摘要: 在这篇论文中，我们提出了一种新的随机算法，用于在无矩阵情境下仅使用矩阵向量乘法来估计二到无穷范数和一到二范数。我们的方法基于对Hutchinson的对角估计器及其Hutch++版本进行适当修改。我们为这两种修改提供了Oracle复杂度界限。我们进一步展示了我们的算法在基于雅可比矩阵的正则化中的实际效用，该正则化用于深度神经网络在图像分类任务中的训练。我们还展示了我们的方法可以应用于减轻推荐系统领域中对抗攻击的影响。

更新时间: 2025-08-06 13:37:37

领域: cs.LG,cs.NA,math.NA,stat.ML,65F35

下载: http://arxiv.org/abs/2508.04444v1

Stepsize anything: A unified learning rate schedule for budgeted-iteration training

The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration budgets. While learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical foundations. In addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process inefficient. In this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training budgets. First, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature variations. From this framework, we derive the UBA schedule, controlled by a single hyper-parameter \varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between \varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of \varphi. We offer practical guidelines for its selection via theoretical analysis and empirical results. Extensive experimental results show that UBA consistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.

Updated: 2025-08-06 13:37:34

标题: 步长任意：用于预算迭代训练的统一学习率调度

摘要: 随着计算成本的不断增加和有限资源的限制，预算迭代训练的关键性需求越发凸显，该训练旨在在预先确定的迭代预算内实现最佳学习。尽管学习率表格基本上决定了不同网络和任务的性能，特别是在预算迭代场景中，但它们的设计仍然主要是启发式的，缺乏理论基础。此外，最佳学习率表格需要进行大量的试错选择，使训练过程低效。在本研究中，我们提出了统一的预算感知（UBA）表格，这是一个在不同受限训练预算下，始终胜过常用表格的理论基础的学习率表格。首先，我们通过构建一个新颖的训练预算感知优化框架来弥合差距，该框架明确考虑了对景观曲率变化的鲁棒性。从这个框架中，我们推导出由一个单一超参数 \varphi 控制的UBA表格，提供了灵活性和简单性之间的权衡，消除了对每个网络数值优化的需要。此外，我们建立了 \varphi 与条件数之间的理论联系，为我们的方法增加了解释和理由。此外，我们证明了不同值的 \varphi 的收敛性。我们通过理论分析和实证结果为其选择提供了实用指南。广泛的实验结果表明，UBA在不同训练迭代预算下，跨越了不同网络架构（如ResNet、OLMo）和规模，在各种视觉和语言任务中始终胜过常用表格。

更新时间: 2025-08-06 13:37:34

领域: cs.LG

下载: http://arxiv.org/abs/2505.24452v3

Efficient Training of Physics-enhanced Neural ODEs via Direct Collocation and Nonlinear Programming

We propose a novel approach for training Physics-enhanced Neural ODEs (PeN-ODEs) by expressing the training process as a dynamic optimization problem. The full model, including neural components, is discretized using a high-order implicit Runge-Kutta method with flipped Legendre-Gauss-Radau points, resulting in a large-scale nonlinear program (NLP) efficiently solved by state-of-the-art NLP solvers such as Ipopt. This formulation enables simultaneous optimization of network parameters and state trajectories, addressing key limitations of ODE solver-based training in terms of stability, runtime, and accuracy. Extending on a recent direct collocation-based method for Neural ODEs, we generalize to PeN-ODEs, incorporate physical constraints, and present a custom, parallelized, open-source implementation. Benchmarks on a Quarter Vehicle Model and a Van-der-Pol oscillator demonstrate superior accuracy, speed, generalization with smaller networks compared to other training techniques. We also outline a planned integration into OpenModelica to enable accessible training of Neural DAEs.

Updated: 2025-08-06 13:30:51

标题: 物理增强神经ODE的高效训练：通过直接配点和非线性规划

摘要: 我们提出了一种新颖的方法，用于训练物理增强的神经ODE（PeN-ODEs），将训练过程表达为动态优化问题。整个模型，包括神经元组件，使用高阶隐式Runge-Kutta方法离散化，采用翻转的Legendre-Gauss-Radau点，从而得到一个大规模非线性程序（NLP），可以有效地通过Ipopt等最先进的NLP求解器解决。这种形式使得可以同时优化网络参数和状态轨迹，解决了基于ODE求解器的训练在稳定性、运行时和准确性方面的关键局限。在最近的基于直接配点法的神经ODE方法的基础上，我们推广到PeN-ODEs，结合物理约束，并提出了一种定制的、并行化的、开源的实现。在Quarter Vehicle Model和Van-der-Pol振荡器上的基准测试表明，与其他训练技术相比，利用较小的网络具有更高的准确性、速度和泛化能力。我们还概述了计划将其整合到OpenModelica中，以便进行神经DAE的易访问训练。

更新时间: 2025-08-06 13:30:51

领域: cs.LG,math.DS,math.OC,90C30, 68T05,G.1.6; I.2.6

下载: http://arxiv.org/abs/2505.03552v2

Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

Updated: 2025-08-06 13:30:51

标题: 使用生成式人工智能自动生成符合马来西亚中学数学课程的选择题

摘要: 这篇论文探讨了马来西亚教育系统中可扩展和高质量教育评估工具的关键需求。文章强调了生成式人工智能（GenAI）的潜力，同时也承认了确保事实准确性和课程对齐性的重大挑战，特别是对于像马来语这样的低资源语言。本研究介绍并比较了四种逐步生成Form 1数学多项选择题（MCQs）的管道，使用OpenAI的GPT-4o生成马来语。这些方法从非基础提示（结构化和基础）到检索增强生成（RAG）方法（一个使用LangChain框架，一个手动实现）不等。该系统基于官方课程文件，包括教师准备的笔记和年度教学计划（RPT）。采用双重自动评估框架来评估所生成的问题。通过语义文本相似性（STS）与RPT进行课程对齐性测量，通过一种新颖的基于RAG的问答（RAG-QA）方法验证上下文的有效性。结果表明，基于RAG的管道明显优于非基础提示方法，生成具有更高课程对齐性和事实有效性的问题。研究进一步分析了基于框架的RAG的实现易用性和手动管道提供的精细控制之间的权衡。这项工作提出了一种在低资源语言中生成特定课程教育内容的验证方法，引入了一种共生的RAG-QA评估技术，并为马来西亚和类似地区的实用EdTech解决方案的开发和部署提供了可操作的见解。

更新时间: 2025-08-06 13:30:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04442v1

StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

Updated: 2025-08-06 13:28:22

标题: StepFun-Formalizer：通过知识推理融合释放LLMs的自动形式化潜力

摘要: 自动形式化旨在将自然语言数学陈述翻译成形式语言。虽然大型语言模型(LLMs)加速了这一领域的进展，但现有方法仍然存在较低的准确性。我们确定了有效自动形式化的两个关键能力：全面掌握形式语言领域知识和理解自然语言问题的推理能力以及非正式-正式对齐。没有前者，模型无法识别正确的形式对象；没有后者，它难以解释现实世界的背景并将其精确映射到形式表达式中。为了解决这些差距，我们引入了ThinkingF，一个数据合成和训练流程，改进了这两种能力。首先，我们构建了两个数据集：一个是通过提炼和选择大规模示例来丰富形式知识，另一个是通过生成专家设计的模板引导的非正式-正式推理轨迹。然后，我们利用这些数据集应用SFT和RLVR来进一步融合和完善这两种能力。最终的7B和32B模型展现出全面的形式知识和强大的非正式-正式推理能力。值得注意的是，StepFun-Formalizer-32B在FormalMATH-Lite上达到了40.5%的SOTA BEq@1分数，在ProverBench上达到了26.7%，超过了所有先前的通用和专门化模型。

更新时间: 2025-08-06 13:28:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04440v1

Prompt Obfuscation for Large Language Models

System prompts that include detailed instructions to describe the task performed by the underlying LLM can easily transform foundation models into tools and services with minimal overhead. They are often considered intellectual property, similar to the code of a software product, because of their crucial impact on the utility. However, extracting system prompts is easily possible. As of today, there is no effective countermeasure to prevent the stealing of system prompts, and all safeguarding efforts could be evaded. In this work, we propose an alternative to conventional system prompts. We introduce prompt obfuscation to prevent the extraction of the system prompt with little overhead. The core idea is to find a representation of the original system prompt that leads to the same functionality, while the obfuscated system prompt does not contain any information that allows conclusions to be drawn about the original system prompt. We evaluate our approach by comparing our obfuscated prompt output with the output of the original prompt, using eight distinct metrics to measure the lexical, character-level, and semantic similarity. We show that the obfuscated version is constantly on par with the original one. We further perform three different deobfuscation attacks with varying attacker knowledge--covering both black-box and white-box conditions--and show that in realistic attack scenarios an attacker is unable to extract meaningful information. Overall, we demonstrate that prompt obfuscation is an effective mechanism to safeguard the intellectual property of a system prompt while maintaining the same utility as the original prompt.

Updated: 2025-08-06 13:24:09

标题: 大型语言模型的提示混淆

摘要: 系统提示包括详细说明来描述底层LLM执行的任务，可以轻松将基础模型转化为具有最小开销的工具和服务。它们通常被视为知识产权，类似于软件产品的代码，因为它们对实用性具有重要影响。然而，提取系统提示是很容易的。到目前为止，没有有效的对策来防止系统提示被窃取，所有的保护工作都可能被规避。在这项工作中，我们提出了一种替代传统系统提示的方法。我们引入提示混淆来防止提取系统提示，而开销很小。其核心思想是找到原始系统提示的表示形式，以达到相同的功能，而混淆的系统提示不包含任何可推断原始系统提示的信息。我们通过比较我们混淆的提示输出和原始提示的输出，使用八个不同的度量标准来衡量词汇、字符级和语义相似性，来评估我们的方法。我们展示混淆版本始终与原始版本保持一致。我们进一步进行了三种不同的反混淆攻击，攻击者知识不同--涵盖黑盒和白盒条件--并展示在现实攻击场景中，攻击者无法提取有意义的信息。总的来说，我们证明了提示混淆是一种有效的机制，可以保护系统提示的知识产权，同时保持与原始提示相同的实用性。

更新时间: 2025-08-06 13:24:09

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2409.11026v4

Algorithm Development in Neural Networks: Insights from the Streaming Parity Task

Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks (RNNs) trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.

Updated: 2025-08-06 13:21:56

标题: 神经网络中的算法开发：来自流式奇偶任务的见解

摘要: 即使是在过度参数化的情况下，深度神经网络表现出了非凡的泛化能力。对这一现象的研究集中在分布内的泛化，通过平滑插值来实现。然而，在某些情况下，神经网络也学会了对远远超出原始训练集范围的数据进行外推，有时甚至允许无限泛化，这意味着已经学会了一种能够解决任务的算法。在这里，我们进行了对循环神经网络（RNNs）在流式奇偶任务上训练的学习动态的案例研究，以开发一种有效的算法发展理论。流式奇偶任务是一个简单但非线性的任务，定义在任意长度的序列上。我们展示了，通过足够的有限训练经验，RNNs表现出了向完美无限泛化的相变。利用表示动态的有效理论，我们发现了一种隐含的表示融合效应，可以解释为构建一个重现任务的有限自动机。总的来说，我们的结果揭示了神经网络如何能够从有限的训练经验中无限泛化的一种机制。

更新时间: 2025-08-06 13:21:56

领域: cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2507.09897v2

GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment

The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit foundation models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces. This restricts their ability to complement each other for more advanced capabilities. In this work, we present GenEDA, the first framework that cross-modally aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representation learning and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms to support both open-source trainable LLMs and commercial frozen LLMs. We leverage this aligned architecture to develop the first generative foundation model for netlists, unleashing LLMs' generative reasoning capability on the low-level and bit-blasted netlists. GenEDA enables three unprecedented generative netlist functional reasoning tasks, where it reversely generates high-level functionalities such as specifications and RTL code from low-level netlists. These tasks move beyond traditional gate function classification to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs' (e.g., GPT and DeepSeek series) performance in all tasks.

Updated: 2025-08-06 13:21:11

标题: GenEDA：通过跨模态电路编码器-解码器对齐实现生成式网表功能推理

摘要: 基于基础AI的成功激发了电路基础模型的研究，这些模型定制化设计旨在辅助集成电路（IC）设计过程。然而，现有的预训练电路基础模型通常仅限于用于预测任务的独立编码器或用于生成任务的解码器。这两种模型类型是独立开发的，操作在不同的电路模态上，并存在于不同的潜在空间中。这限制了它们相互补充以实现更高级功能的能力。在这项工作中，我们提出了GenEDA，这是第一个在共享潜在空间内跨模态对齐电路编码器与解码器的框架。GenEDA弥合了基于图形的电路表示学习和基于文本的大型语言模型（LLMs）之间的鸿沟，实现了它们各自潜在空间之间的通信。为了实现对齐，我们提出了两种范例，以支持开源可训练的LLMs和商业冻结的LLMs。我们利用这种对齐的架构开发了第一个用于网表的生成基础模型，释放了LLMs在低级和位爆炸网表上的生成推理能力。GenEDA实现了三项前所未有的生成网表功能推理任务，其中它从低级网表中逆向生成高级功能，例如规范和RTL代码。这些任务超越了传统的门功能分类，直接生成完整电路功能。实验证明，GenEDA显著提升了所有任务中先进LLMs（例如GPT和DeepSeek系列）的性能。

更新时间: 2025-08-06 13:21:11

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2504.09485v2

\textsc{SimInstruct}: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices

High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o's limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.

Updated: 2025-08-06 13:16:10

标题: SimInstruct：一种负责任的工具，用于收集专家和LLM模拟新手之间的支架对话

摘要: 高质量的、多轮的初学者和专家之间的指导对话对于开发支持教学、学习和决策的人工智能系统至关重要。这些对话通常涉及脚手架——专家通过问题、反馈和逐步指导支持初学者的思维过程。然而，由于隐私问题和求助固有的脆弱性，这样的数据很少。我们提出了SimInstruct，一个可扩展的专家参与工具，用于收集脚手架对话。以教学发展指导为例，SimInstruct通过LLMs模拟初学者教师，变化他们的教学挑战和LLM的角色特征，同时人类专家提供多轮反馈、推理和指导支持。这种设计使得可以创建出现实的、教学丰富的对话，而不需要真实的初学者参与。我们的结果显示，角色特征，比如外向和内向，对专家如何参与有着有意义的影响。与真实的辅导录音相比，SimInstruct对话展示出了可比的教学相关性和认知深度。专家还报告这一过程既引人入胜又反思，提高了数据质量和他们自己的专业洞察力。我们进一步通过增强的数据集对LLaMA模型进行了微调，将其打造成为一个专家模型，其在教学质量方面胜过了GPT-4o。我们的分析突显了GPT-4o在薄弱的反思性问题、过度使用通用赞美、傲慢的语气以及倾向于用过多建议淹没初学者的局限性。

更新时间: 2025-08-06 13:16:10

领域: cs.AI

下载: http://arxiv.org/abs/2508.04428v1

Efficient Unsupervised Domain Adaptation Regression for Spatial-Temporal Sensor Fusion

The growing deployment of low-cost, distributed sensor networks in environmental and biomedical domains has enabled continuous, large-scale health monitoring. However, these systems often face challenges related to degraded data quality caused by sensor drift, noise, and insufficient calibration -- factors that limit their reliability in real-world applications. Traditional machine learning methods for sensor fusion and calibration rely on extensive feature engineering and struggle to capture spatial-temporal dependencies or adapt to distribution shifts across varying deployment conditions. To address these challenges, we propose a novel unsupervised domain adaptation (UDA) method tailored for regression tasks. Our proposed method integrates effectively with Spatial-Temporal Graph Neural Networks and leverages the alignment of perturbed inverse Gram matrices between source and target domains, drawing inspiration from Tikhonov regularization. This approach enables scalable and efficient domain adaptation without requiring labeled data in the target domain. We validate our novel method on real-world datasets from two distinct applications: air quality monitoring and EEG signal reconstruction. Our method achieves state-of-the-art performance which paves the way for more robust and transferable sensor fusion models in both environmental and physiological contexts. Our code is available at https://github.com/EPFL-IMOS/TikUDA.

Updated: 2025-08-06 13:15:20

标题: 高效的无监督领域自适应回归方法用于时空传感器融合

摘要: 低成本、分布式传感器网络在环境和生物医学领域的不断部署已经实现了连续、大规模的健康监测。然而，这些系统常常面临由传感器漂移、噪声和不充分的校准引起的数据质量下降等挑战，这些因素限制了它们在现实应用中的可靠性。传统的传感器融合和校准的机器学习方法依赖于广泛的特征工程，并且难以捕捉空间-时间依赖性或适应不同部署条件下的分布变化。为了解决这些挑战，我们提出了一种针对回归任务定制的新颖的无监督领域自适应（UDA）方法。我们提出的方法与空间-时间图神经网络有效集成，并利用了源域和目标域之间扰动逆格拉姆矩阵的对齐，灵感来源于Tikhonov正则化。这种方法实现了可伸缩和高效的领域自适应，而无需在目标域中需要标记的数据。我们在来自两个不同应用领域的真实数据集上验证了我们的新方法：空气质量监测和EEG信号重建。我们的方法取得了最先进的性能，为环境和生理背景下更稳健和可转移的传感器融合模型铺平了道路。我们的代码可在https://github.com/EPFL-IMOS/TikUDA 上找到。

更新时间: 2025-08-06 13:15:20

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2411.06917v2

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

Updated: 2025-08-06 13:14:20

标题: 解码多模迷宫：多模关注模型中可解释性采用的系统性审查

摘要: 多模态学习在近年来取得了显著进展，特别是在整合基于注意力的模型方面，导致在各种任务中实现了显著的性能提升。与此同时，对可解释人工智能（XAI）的需求推动了一系列研究，旨在解释这些模型复杂的决策过程。本系统文献综述分析了从2020年1月到2024年初发表的关于多模态模型可解释性的研究。在更广泛的XAI目标框架内，我们跨越多个维度检验了文献，包括模型架构、涉及的模态、解释算法和评估方法。我们的分析显示，大多数研究集中在视觉-语言和仅语言模型上，其中基于注意力的技术是最常用的解释方法。然而，这些方法通常无法完全捕捉不同模态之间的交互全谱，这一挑战进一步受到跨领域的架构异质性的加剧。重要的是，我们发现多模态设置下的XAI评估方法主要是非系统化的，缺乏一致性、稳健性，并且没有考虑模态特定的认知和环境因素。基于这些发现，我们提供了一套全面的建议，旨在促进多模态XAI研究中严谨、透明和标准化的评估和报告实践。我们的目标是支持未来研究中更具可解释性、可追溯性和负责任的多模态AI系统，而可解释性是它们的核心。

更新时间: 2025-08-06 13:14:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04427v1

A Value Based Parallel Update MCTS Method for Multi-Agent Cooperative Decision Making of Connected and Automated Vehicles

To solve the problem of lateral and logitudinal joint decision-making of multi-vehicle cooperative driving for connected and automated vehicles (CAVs), this paper proposes a Monte Carlo tree search (MCTS) method with parallel update for multi-agent Markov game with limited horizon and time discounted setting. By analyzing the parallel actions in the multi-vehicle joint action space in the partial-steady-state traffic flow, the parallel update method can quickly exclude potential dangerous actions, thereby increasing the search depth without sacrificing the search breadth. The proposed method is tested in a large number of randomly generated traffic flow. The experiment results show that the algorithm has good robustness and better performance than the SOTA reinforcement learning algorithms and heuristic methods. The vehicle driving strategy using the proposed algorithm shows rationality beyond human drivers, and has advantages in traffic efficiency and safety in the coordinating zone.

Updated: 2025-08-06 13:08:15

标题: 基于价值的并行更新MCTS方法用于连接和自动化车辆的多智能体协作决策Making

摘要: 为了解决多车协同驾驶中的横向和纵向联合决策问题，本文提出了一种带有并行更新的蒙特卡洛树搜索（MCTS）方法，用于具有有限视野和时间折扣设置的多智能体马尔可夫博弈。通过分析在部分稳态交通流中多车联合行动空间中的并行行动，该并行更新方法可以快速排除潜在的危险行动，从而增加搜索深度而不牺牲搜索广度。所提出的方法在大量随机生成的交通流中进行了测试。实验结果显示，该算法具有良好的稳健性，并且比最先进的强化学习算法和启发式方法表现更好。使用所提出的算法的车辆驾驶策略表现出超越人类驾驶员的合理性，并在协调区域的交通效率和安全性方面具有优势。

更新时间: 2025-08-06 13:08:15

领域: cs.MA,cs.AI,cs.GT,cs.SY,eess.SY

下载: http://arxiv.org/abs/2409.13783v2

Algorithm Selection for Recommender Systems via Meta-Learning on Algorithm Characteristics

The Algorithm Selection Problem for recommender systems-choosing the best algorithm for a given user or context-remains a significant challenge. Traditional meta-learning approaches often treat algorithms as categorical choices, ignoring their intrinsic properties. Recent work has shown that explicitly characterizing algorithms with features can improve model performance in other domains. Building on this, we propose a per-user meta-learning approach for recommender system selection that leverages both user meta-features and automatically extracted algorithm features from source code. Our preliminary results, averaged over six diverse datasets, show that augmenting a meta-learner with algorithm features improves its average NDCG@10 performance by 8.83% from 0.135 (user features only) to 0.147. This enhanced model outperforms the Single Best Algorithm baseline (0.131) and successfully closes 10.5% of the performance gap to a theoretical oracle selector. These findings show that even static source code metrics provide a valuable predictive signal, presenting a promising direction for building more robust and intelligent recommender systems.

Updated: 2025-08-06 13:06:24

标题: 通过元学习对推荐系统中的算法特征进行算法选择

摘要: 为推荐系统选择最佳算法-为特定用户或情境选择最佳算法-仍然是一个重要挑战。传统的元学习方法通常将算法视为分类选择，忽略了它们的固有属性。最近的研究表明，用特征明确地表征算法可以提高其他领域的模型性能。在此基础上，我们提出了一种针对推荐系统选择的每个用户的元学习方法，利用了用户元特征和从源代码自动提取的算法特征。我们的初步结果，对六个不同数据集进行平均，显示通过将算法特征与元学习器相结合，其平均NDCG@10性能提高了8.83%，从0.135（仅用户特征）提高到0.147。这个增强模型优于单一最佳算法基线（0.131），并成功缩小了10.5%的性能差距到理论上的oracle选择器。这些发现表明，即使是静态的源代码度量也提供了有价值的预测信号，为构建更强大和智能的推荐系统提供了一个有希望的方向。

更新时间: 2025-08-06 13:06:24

领域: cs.IR,cs.LG,I.2.m

下载: http://arxiv.org/abs/2508.04419v1

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.

Updated: 2025-08-06 13:05:09

标题: 在分割之前三思：一种用于指代音视频分割的对象感知推理代理程序

摘要: 参考音频-视觉分割（Ref-AVS）旨在基于给定的参考表达来分割可听到的视频中的目标对象。先前的工作通常依赖于通过多模态融合学习潜在嵌入以促使可调节的SAM/SAM2解码器进行分割，这需要强大的像素级监督并且缺乏可解释性。从明确参考理解的新视角出发，我们提出了TGS-Agent，将任务分解为一个Think-Ground-Segment过程，通过首先通过多模态分析识别所指对象，然后进行粗粒度接地和精准分割，模拟人类推理过程。为此，我们首先提出了Ref-Thinker，一个能够在文本、视觉和听觉线索上进行推理的多模态语言模型。我们构建了一个具有明确对象感知的思考-回答链的指导调整数据集，用于Ref-Thinker的微调。Ref-Thinker推断的对象描述被用作Grounding-DINO和SAM2的明确提示，这些模型在进行接地和分割时不依赖于像素级监督。此外，我们介绍了R\textsuperscript{2}-AVSBench，一个具有语言多样性和推理密集型参考的新基准，用于更好地评估模型的泛化能力。我们的方法在标准Ref-AVSBench和提出的R\textsuperscript{2}-AVSBench上取得了最先进的结果。代码将在https://github.com/jasongief/TGS-Agent 上提供。

更新时间: 2025-08-06 13:05:09

领域: cs.MM,cs.CV,cs.MA,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.04418v1

Thompson Exploration with Best Challenger Rule in Best Arm Identification

This paper studies the fixed-confidence best arm identification (BAI) problem in the bandit framework in the canonical single-parameter exponential models. For this problem, many policies have been proposed, but most of them require solving an optimization problem at every round and/or are forced to explore an arm at least a certain number of times except those restricted to the Gaussian model. To address these limitations, we propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule. While Thompson sampling was originally considered for maximizing the cumulative reward, we demonstrate that it can be used to naturally explore arms in BAI without forcing it. We show that our policy is asymptotically optimal for any two-armed bandit problems and achieves near optimality for general $K$-armed bandit problems for $K\geq 3$. Nevertheless, in numerical experiments, our policy shows competitive performance compared to asymptotically optimal policies in terms of sample complexity while requiring less computation cost. In addition, we highlight the advantages of our policy by comparing it to the concept of $\beta$-optimality, a relaxed notion of asymptotic optimality commonly considered in the analysis of a class of policies including the proposed one.

Updated: 2025-08-06 12:59:38

标题: 使用最佳挑战规则的汤普森探索在最佳臂识别中的应用

摘要: 本文研究了在经典单参数指数模型中的赌博框架下的固定置信度最佳臂识别（BAI）问题。针对这个问题，已经提出了许多策略，但大多数需要在每一轮解决一个优化问题，或者被迫至少探索一定次数的臂，除了那些限制在高斯模型的策略。为了解决这些限制，我们提出了一种将汤普森取样与最佳挑战者规则相结合的新型策略。虽然汤普森取样最初被认为是为了最大化累积奖励，但我们证明它可以自然地在BAI中探索臂而无需强迫。我们表明，我们的策略对于任何两臂赌博问题都是渐近最优的，并且对于一般的$K$臂赌博问题（$K\geq 3$）达到了接近最优的水平。然而，在数值实验中，我们的策略在样本复杂度方面表现出与渐近最优策略相竞争的性能，同时需要较少的计算成本。此外，我们通过将其与$\beta$-最优性的概念进行比较，突出了我们的策略的优势，这是在分析包括所提出的策略在内的一类策略中常考虑的一种放松的渐近最优性概念。

更新时间: 2025-08-06 12:59:38

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2310.00539v2

Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents

Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation $\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input token order of magnitude (1e3). Our best evaluated configurations $\unicode{x2013}$ one token order above, but within the model's context window $\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.

Updated: 2025-08-06 12:56:54

标题: 超越像素：探索基于LLM的Web代理的DOM降采样

摘要: 前沿LLMs最近才实现了可用的、自主的网络代理。在此过程中，一个模型被视为即时域模型后端。应该建议交互，它会与基于Web的任务和相应的应用状态进行协商。关键问题在于应用状态序列化，即快照。最先进的网络代理建立在基于地面的GUI快照上，即增强了视觉线索的屏幕截图。不仅要模仿人类感知，而且图像代表了相对廉价的模型输入方式。LLM视觉仍然落后于代码解释能力。结构上类似于HTML的DOM快照提出了一个期望的替代方案。然而，庞大的模型输入标记大小使得迄今为止与网络代理的可靠实现无法实现。我们提出了D2Snap，这是一种首创的DOM降采样算法。基于GPT-4o后端，我们在来自Online-Mind2Web数据集的任务上评估了D2Snap。D2Snap降采样的DOM快照的成功率（67%）与基线的基于地面的GUI快照（65%）相匹配，而且在相同的输入标记数量级（1e3）内。我们最佳的评估配置--高一级的标记顺序，但在模型的上下文窗口内--比这个基线高出8%。此外，我们的评估表明，DOM固有的层次结构体现了LLM的强大UI特征。

更新时间: 2025-08-06 12:56:54

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2508.04412v1

AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context

As our understanding of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present AUTALIC, the first benchmark dataset dedicated to the detection of anti-autistic ableist language in context, addressing a significant gap in the field. The dataset comprises 2,400 autism-related sentences collected from Reddit, accompanied by surrounding context, and is annotated by trained experts with backgrounds in neurodiversity. Our comprehensive evaluation reveals that current language models, including state-of-the-art LLMs, struggle to reliably identify anti-autistic ableism and align with human judgments, underscoring their limitations in this domain. We publicly release AUTALIC along with the individual annotations which serve as a valuable resource to researchers working on ableism, neurodiversity, and also studying disagreements in annotation tasks. This dataset serves as a crucial step towards developing more inclusive and context-aware NLP systems that better reflect diverse perspectives.

Updated: 2025-08-06 12:55:44

标题: AUTALIC：一个用于上下文中反自闭症主义者语言的数据集

摘要: 随着我们对自闭症和能力主义的理解不断增加，我们对自闭症人士的能力主义语言的理解也在增加。这种语言对自然语言处理研究提出了重大挑战，因为它具有微妙且依赖于上下文的特性。然而，检测反自闭症能力主义语言仍未被充分探讨，现有的自然语言处理工具往往无法捕捉其微妙的表达。我们提出了AUTALIC，这是第一个专门用于检测上下文中反自闭症能力主义语言的基准数据集，填补了该领域的重要空白。该数据集包括从Reddit收集的2,400个与自闭症相关的句子，并附带周围的上下文，由具有神经多样性背景的专业人员进行注释。我们的全面评估显示，当前的语言模型，包括最先进的LLMs，在可靠地识别反自闭症能力主义和与人类判断一致方面存在困难，突显了它们在这一领域的局限性。我们公开发布AUTALIC以及个别注释，这些注释将为研究能力主义、神经多样性以及研究注释任务中的分歧的研究人员提供宝贵的资源。该数据集是朝着开发更具包容性和上下文感知的自然语言处理系统的关键一步，这些系统能更好地反映多样化的观点。

更新时间: 2025-08-06 12:55:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.16520v4

The Relative Instability of Model Comparison with Cross-validation

Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where such confidence intervals will be valid. However, in the common setting where CV is used to compare two algorithms, it becomes necessary to consider a notion of relative stability which cannot easily be derived from existing stability results, even for simple algorithms. To better understand relative stability and when CV provides valid confidence intervals for the test error difference of two algorithms, we study the soft-thresholded least squares algorithm, a close cousin of the Lasso. We prove that while stability holds when assessing the individual test error of this algorithm, relative stability fails to hold when comparing the test error of two such algorithms, even in a sparse low-dimensional linear model setting. Additionally, we empirically confirm the invalidity of CV confidence intervals for the test error difference when either soft-thresholding or the Lasso is used. In short, caution is needed when quantifying the uncertainty of CV estimates of the performance difference of two machine learning algorithms, even when both algorithms are individually stable.

Updated: 2025-08-06 12:54:56

标题: 使用交叉验证进行模型比较的相对不稳定性

摘要: 现有研究表明，交叉验证（CV）可用于为稳定的机器学习算法的测试错误提供渐近置信区间，并且许多流行算法的现有稳定性结果可以应用于推导出这种置信区间有效的正例。然而，在常见情况下，CV用于比较两种算法时，需要考虑一个相对稳定性的概念，这个概念即使对于简单算法也不容易从现有稳定性结果中推导出来。为了更好地理解相对稳定性以及CV何时提供两种算法的测试错误差异的有效置信区间，我们研究了软阈值最小二乘算法，这是Lasso的近亲。我们证明，尽管在评估该算法的单个测试错误时稳定性成立，但在比较两种算法的测试错误时相对稳定性却不成立，即使在稀疏低维线性模型设置中也是如此。此外，我们通过实验证实了当使用软阈值或Lasso时，CV置信区间对于测试错误差异是无效的。简而言之，即使两种算法各自稳定，但在量化CV估计两种机器学习算法性能差异的不确定性时仍需谨慎。

更新时间: 2025-08-06 12:54:56

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.04409v1

Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models

Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

Updated: 2025-08-06 12:48:53

标题: 基于深度学习的可扩展图像到3D立面解析器，用于生成热3D建筑模型

摘要: 翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：翻译结果如下：Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

更新时间: 2025-08-06 12:48:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04406v1

What Lives? A meta-analysis of diverse opinions on the definition of life

The question of "what is life?" has challenged scientists and philosophers for centuries, producing an array of definitions that reflect both the mystery of its emergence and the diversity of disciplinary perspectives brought to bear on the question. Despite significant progress in our understanding of biological systems, psychology, computation, and information theory, no single definition for life has yet achieved universal acceptance. This challenge becomes increasingly urgent as advances in synthetic biology, artificial intelligence, and astrobiology challenge our traditional conceptions of what it means to be alive. We undertook a methodological approach that leverages large language models (LLMs) to analyze a set of definitions of life provided by a curated set of cross-disciplinary experts. We used a novel pairwise correlation analysis to map the definitions into distinct feature vectors, followed by agglomerative clustering, intra-cluster semantic analysis, and t-SNE projection to reveal underlying conceptual archetypes. This methodology revealed a continuous landscape of the themes relating to the definition of life, suggesting that what has historically been approached as a binary taxonomic problem should be instead conceived as differentiated perspectives within a unified conceptual latent space. We offer a new methodological bridge between reductionist and holistic approaches to fundamental questions in science and philosophy, demonstrating how computational semantic analysis can reveal conceptual patterns across disciplinary boundaries, and opening similar pathways for addressing other contested definitional territories across the sciences.

Updated: 2025-08-06 12:47:41

标题: 什么是生命？对生命定义的多元观点的元分析

摘要: “什么是生命？”这个问题挑战着科学家和哲学家数个世纪，产生了一系列定义，反映了其出现的神秘性和不同学科角度对这个问题的贡献。尽管我们在理解生物系统、心理学、计算和信息理论方面取得了显著进展，但至今仍未有一个被普遍接受的生命定义。随着合成生物学、人工智能和天体生物学的进展挑战着我们传统对于生命意义的看法，这一挑战变得越来越紧迫。我们采用了一种方法论，利用大型语言模型（LLMs）分析了一组由跨学科专家提供的生命定义。我们使用了一种新颖的成对相关性分析将这些定义映射成不同的特征向量，随后进行了凝聚聚类、簇内语义分析和t-SNE投影，揭示了潜在的概念原型。这种方法揭示了与生命定义相关的主题的连续景观，表明历来被视为二元分类问题的内容应该被认为是统一概念潜在空间内不同的视角。我们提供了一种新的方法论桥梁，将还原主义和整体主义方法应用于科学和哲学基本问题，展示了计算语义分析如何能够揭示跨学科边界上的概念模式，并为解决其他科学领域中存在争议的定义领域开辟了类似的路径。

更新时间: 2025-08-06 12:47:41

领域: q-bio.OT,cs.AI,cs.CY,q-bio.BM,q-bio.CB,q-bio.SC,stat.AP

下载: http://arxiv.org/abs/2505.15849v2

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

Updated: 2025-08-06 12:47:05

标题: FlexQ：通过算法-系统协同设计实现LLM服务的高效后训练INT6量化

摘要: 大型语言模型（LLMs）展示出卓越的性能，但需要昂贵的内存和计算成本，限制了它们的实际部署。虽然现有的INT4/INT8量化减少了这些成本，但它们往往会降低准确性或缺乏最佳效率。INT6量化在模型准确性和推理效率之间提供了更好的折衷，但在现代GPU中缺乏硬件支持，迫使通过高精度算术单元进行模拟，从而限制了加速效果。在本文中，我们提出了FlexQ，一种结合了算法创新和系统级优化的新型后训练INT6量化框架。FlexQ在所有层中采用统一的6位权重量化，通过层级敏感性分析确定的层中自适应保留8位激活。为了最大化硬件效率，我们开发了一种专门的高性能GPU内核，支持通过二值张量核心（BTC）等效表示进行矩阵乘法的W6A6和W6A8。评估LLaMA模型表明，FlexQ保持接近FP16准确性，困惑度增加不超过0.05。所提出的内核在LLaMA-2-70B线性层上比ABQ-LLM平均加速1.39倍。端到端，FlexQ提供1.33倍的推理加速和1.21倍的内存节省，优于SmoothQuant。代码已发布在https://github.com/FlyFoxPlayer/FlexQ。

更新时间: 2025-08-06 12:47:05

领域: cs.LG

下载: http://arxiv.org/abs/2508.04405v1

Why are LLMs' abilities emergent?

The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of "creation without understanding" that characterises contemporary AI development. We explore how the neural approach's reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.

Updated: 2025-08-06 12:43:04

标题: 为什么LLMs的能力是紧急的？

摘要: 大型语言模型（LLMs）在生成任务中取得了显著成功，这引发了关于其所获得的能力特性的根本问题，这些能力通常出乎意料地在没有明确训练的情况下出现。本文通过理论分析和实证观察，探讨了深度神经网络（DNNs）的新兴特性，解决了当代人工智能发展中特征的“不理解即创造”的认识论挑战。我们探讨了神经方法对非线性、随机过程的依赖性与符号计算范式之间的根本差异，创造了其宏观行为无法从微观神经元活动中分析推导的系统。通过对尺度律、理解现象和模型能力中的相变的分析，我证明了新兴能力是由高度敏感的非线性系统的复杂动态而非仅仅由参数缩放产生的。我的调查揭示了当前关于指标、预训练损失阈值和上下文学习的争论忽略了DNNs中新兴性的基本本体性质。我认为这些系统表现出真正的新兴属性，类似于其他复杂自然现象中发现的属性，其中系统能力是由简单组件之间的合作相互作用而产生的，而不能简化为其个体行为。本文得出结论，理解LLM的能力需要将DNNs视为由新兴普遍原则支配的复杂动态系统的新领域，类似于物理学、化学和生物学中运作的原则。这种观点将焦点从纯现象学定义的新兴转移到理解内部动态转换，使这些系统能够获得超越其个体组件的能力。

更新时间: 2025-08-06 12:43:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04401v1

Industrial LLM-based Code Optimization under Regulation: A Mixture-of-Agents Approach

Recent advancements in Large Language Models (LLMs) for code optimization have enabled industrial platforms to automate software performance engineering at unprecedented scale and speed. Yet, organizations in regulated industries face strict constraints on which LLMs they can use - many cannot utilize commercial models due to data privacy regulations and compliance requirements, creating a significant challenge for achieving high-quality code optimization while maintaining cost-effectiveness. We address this by implementing a Mixture-of-Agents (MoA) approach that directly synthesizes code from multiple specialized LLMs, comparing it against TurinTech AI's vanilla Genetic Algorithm (GA)-based ensemble system and individual LLM optimizers using real-world industrial codebases. Our key contributions include: (1) First MoA application to industrial code optimization using real-world codebases; (2) Empirical evidence that MoA excels with open-source models, achieving 14.3% to 22.2% cost savings and 28.6% to 32.2% faster optimization times for regulated environments; (3) Deployment guidelines demonstrating GA's advantage with commercial models while both ensembles outperform individual LLMs; and (4) Real-world validation across 50 code snippets and seven LLM combinations, generating over 8,700 variants, addresses gaps in industrial LLM ensemble evaluation. This provides actionable guidance for organizations balancing regulatory compliance with optimization performance in production environments.

Updated: 2025-08-06 12:41:21

标题: 基于规则的工业LLM代码优化：一种混合智能体方法

摘要: 最近关于代码优化的大型语言模型（LLMs）的进展使得工业平台能够以前所未有的规模和速度自动化软件性能工程。然而，在受监管行业的组织面临着对他们可以使用的LLMs的严格约束 - 许多由于数据隐私法规和合规要求，无法利用商业模型，这给实现高质量代码优化并保持成本效益带来了重大挑战。我们通过实施一种混合代理（MoA）方法来解决这个问题，该方法直接从多个专门的LLMs中合成代码，将其与TurinTech AI的基于遗传算法（GA）的普通集成系统和使用真实世界工业代码库的单个LLM优化器进行比较。我们的主要贡献包括：（1）首次将MoA应用于使用真实世界代码库进行工业代码优化；（2）经验证证据表明MoA在开源模型方面表现出色，为受监管环境节省了14.3%至22.2%的成本，并实现了28.6%至32.2%更快的优化时间；（3）部署指南展示了GA在商业模型方面的优势，同时集成系统优于单个LLMs；以及（4）在50个代码片段和七种LLM组合上进行真实世界验证，生成了超过8,700种变体，填补了工业LLM集成评估中的空白。这为在生产环境中平衡法规合规和优化性能的组织提供了可操作的指导。

更新时间: 2025-08-06 12:41:21

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.03329v2

Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.

Updated: 2025-08-06 12:41:18

标题: 用大型语言模型提高事故数据质量：来自肯塔基州次生事故叙述的证据

摘要: 这项研究评估了先进的自然语言处理（NLP）技术，通过挖掘事故叙述来提高事故数据质量，在肯塔基州进行了二次事故识别案例研究。从2015年至2022年的16,656个手动审核的叙述中，确认了3,803起二次事故，我们比较了三类模型：零-shot开源大型语言模型（LLMs）（LLaMA3:70B、DeepSeek-R1:70B、Qwen3:32B、Gemma3:27B）；经过微调的transformers（BERT、DistilBERT、RoBERTa、XLNet、Longformer）；以及传统的逻辑回归作为基线。模型在2015年至2021年的数据上进行了校准，并在2022年的1,771个叙述上进行了测试。经过微调的transformers取得了卓越的性能，其中RoBERTa取得了最高的F1分数（0.90）和准确率（95%）。零-shot LLaMA3:70B达到了可比的F1值0.86，但需要推断139分钟；逻辑回归基线明显落后（F1:0.66）。LLMs在某些变体的召回率方面表现出色（例如，GEMMA3:27B为0.94），但产生了高昂的计算成本（DeepSeek-R1:70B高达723分钟），而经过微调的模型在简短训练后在几秒内处理了测试集。进一步分析表明，中型LLMs（例如，DeepSeek-R1:32B）在性能上可以与更大的对手媲美，同时减少运行时间，表明优化部署的机会。结果凸显了在准确性、效率和数据需求之间的权衡，经过微调的transformer模型在肯塔基数据上有效平衡了精确度和召回率。实际部署考虑强调隐私保护的本地部署、集成方法以提高准确性，以及增量处理以实现可扩展性，为利用先进的NLP增强事故数据质量提供了可复制的方案。

更新时间: 2025-08-06 12:41:18

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.04399v1

GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a 7.7% improvement on ScreenSpot, a 17.2% improvement on ScreenSpotPro, and 91.9% accuracy on ScreenSpotV2.

Updated: 2025-08-06 12:35:24

标题: GuirlVG: 通过强化学习的实证探索激励GUI视觉定位

摘要: 图形用户界面视觉基础（GUI-VG）是GUI代理的核心能力，主要依赖于多模态大语言模型（MLLMs）的监督微调（SFT），这需要大量数据整理和显著的培训成本。然而，随着MLLMs的不断发展，甚至在预训练期间涵盖GUI领域，对耗尽的SFT后训练的必要性变得越来越值得怀疑。与此同时，基于规则的强化微调（RFT）的最近成功表明了一种更有效的替代方法。尽管有这种承诺，但尚未探索GUI-VG应用RFT的最佳方式。为了弥合这一差距，我们介绍了GuirlVG，这是一种基于强化学习的GUI-VG方法，建立在系统的经验研究和一种新颖的稳定技术基础上。我们发现，对RFT的朴素应用表现不佳，激发了更深入的探索。首先，我们将RFT分解为其核心组件，并分析每个组件的最佳公式。其次，我们提出了一种新颖的对抗性KL因子，动态稳定训练以减轻奖励过度优化。第三，我们进一步探讨RFT的训练配置以提高效果。大量实验证明，仅使用5.2K个训练样本的GuirlVG优于使用超过10M个样本训练的SFT方法，在ScreenSpot上取得了7.7%的改进，在ScreenSpotPro上取得了17.2%的改进，以及在ScreenSpotV2上达到了91.9%的准确率。

更新时间: 2025-08-06 12:35:24

领域: cs.AI

下载: http://arxiv.org/abs/2508.04389v1

Artificial Consciousness as Interface Representation

Whether artificial intelligence (AI) systems can possess consciousness is a contentious question because of the inherent challenges of defining and operationalizing subjective experience. This paper proposes a framework to reframe the question of artificial consciousness into empirically tractable tests. We introduce three evaluative criteria - S (subjective-linguistic), L (latent-emergent), and P (phenomenological-structural) - collectively termed SLP-tests, which assess whether an AI system instantiates interface representations that facilitate consciousness-like properties. Drawing on category theory, we model interface representations as mappings between relational substrates (RS) and observable behaviors, akin to specific types of abstraction layers. The SLP-tests collectively operationalize subjective experience not as an intrinsic property of physical systems but as a functional interface to a relational entity.

Updated: 2025-08-06 12:25:06

标题: 人工意识作为界面表征

摘要: 人工智能系统是否能拥有意识是一个有争议的问题，因为定义和操作主观体验存在困难。本文提出了一个框架，将人工意识的问题重新构建为实证可追踪的测试。我们引入了三个评估标准 - S（主观-语言）、L（潜在-新兴）和P（现象-结构），统称为SLP测试，评估人工智能系统是否实现了促进类似意识属性的界面表示。借鉴范畴论，我们将界面表示建模为在关系基质（RS）和可观察行为之间的映射，类似于特定类型的抽象层。SLP测试共同将主观体验操作化，不作为物理系统的内在属性，而是作为与关系实体的功能性接口。

更新时间: 2025-08-06 12:25:06

领域: cs.AI,q-bio.NC

下载: http://arxiv.org/abs/2508.04383v1

Streaming Generated Gaussian Process Experts for Online Learning and Control

Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a streaming kernel-induced progressively generated expert framework of Gaussian processes (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.

Updated: 2025-08-06 12:22:27

标题: 在线学习和控制中的流式生成高斯过程专家

摘要: 高斯过程（GPs）作为一种非参数学习方法，具有灵活的建模能力和对函数逼近的校准不确定性量化。此外，GPs通过高效地将新数据与多项式时间计算相结合，支持在线学习，使其非常适用于需要快速适应的安全关键动态系统。然而，当处理流式数据时，精确GPs的推断和在线更新会导致立方计算时间和二次存储内存复杂度，限制其在实时环境下处理大型数据集的可扩展性。在本文中，我们提出了一种流式核诱导渐进生成专家框架的高斯过程（SkyGP），通过维护有限数量的专家，同时继承了精确高斯过程的学习性能保证，解决了计算和内存约束。此外，引入了两种SkyGP变体，每种都针对特定目标，一种是最大化预测准确性（SkyGP-Dense），另一种是提高计算效率（SkyGP-Fast）。通过广泛的基准测试和实时控制实验验证了SkyGP的有效性，显示其相对于最先进方法具有卓越的性能。

更新时间: 2025-08-06 12:22:27

领域: cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2508.03679v2

ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition

Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions.

Updated: 2025-08-06 12:21:38

标题: ProtoN：用于无约束多印象耳识别的原型节点图神经网络

摘要: 耳朵生物特征提供了一种稳定且无接触的身份识别模式，然而其有效性受到标注数据稀缺和显著的类内变化的限制。现有方法通常仅从单独的印记中提取身份特征，限制了它们捕获一致且具有区分性的表征的能力。为了克服这些限制，提出了一种少样本学习框架ProtoN，通过基于图的方法联合处理同一身份的多个印记。每个印记作为类特定图中的一个节点表示，同时还有一个可学习的原型节点来编码身份级别信息。这个图通过原型图神经网络（PGNN）层进行处理，该层专门设计用于通过双路径消息传递机制来完善印记和原型的表征。为了进一步增强区分力，PGNN还整合了一种跨图原型对齐策略，通过强化类内紧凑性同时保持类间区别来提高类别可分性。此外，采用混合损失函数来平衡情节和全局分类目标，从而提高嵌入空间的整体结构。对五个基准耳朵数据集进行的广泛实验表明，ProtoN实现了最先进的性能，排名1的识别准确率高达99.60%，等误差率(EER)低至0.025，表明在有限数据条件下少样本耳朵识别的有效性。

更新时间: 2025-08-06 12:21:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04381v1

AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers

Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify the origin of model outputs. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider may act adversarially. To our knowledge, this is the first work to evaluate fingerprinting for provenance attribution under such a threat model. The methods rely on a trusted verifier that extracts secret fingerprints from the model's output space, unknown to the provider, and trains a model to predict and verify them. Our empirical evaluation shows that our methods achieve near-zero FPR@95%TPR for instances of GAN and diffusion models, even when tested on small modifications to the original architecture and training data. Moreover, the methods remain robust against adversarial attacks that actively modify the outputs to bypass detection. Source codes are available at https://github.com/PSMLab/authprint.

Updated: 2025-08-06 12:17:38

标题: AuthPrint：对抗恶意模型提供者的指纹生成模型

摘要: 生成模型在高风险领域越来越被采用，但目前的部署没有机制来验证模型输出的来源。我们通过将模型指纹技术扩展到传统合作设置之外的领域来解决这一问题，其中模型提供者可能会采取敌对行动。据我们所知，这是第一项评估在这种威胁模型下用于溯源归因的指纹技术的工作。这些方法依赖于一个受信任的验证器，从模型的输出空间中提取秘密指纹，这些指纹对提供者来说是未知的，并训练一个模型来预测和验证它们。我们的实证评估表明，即使在对原始架构和训练数据进行小修改的情况下，我们的方法在对GAN和扩散模型的实例进行测试时，也能实现接近零的FPR@95%TPR。此外，这些方法仍然对积极修改输出以规避检测的敌对攻击保持强大。源代码可在https://github.com/PSMLab/authprint获取。

更新时间: 2025-08-06 12:17:38

领域: cs.CR

下载: http://arxiv.org/abs/2508.05691v1

VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

Updated: 2025-08-06 12:17:09

标题: VisionTS++：具有持续预训练视觉骨干的跨模态时间序列基础模型

摘要: 最近的研究表明，对图像进行预训练的视觉模型可以通过将时间序列预测重新构建为图像重建任务而表现良好，这表明它们作为通用时间序列基础模型的潜力。然而，从视觉到时间序列的有效跨模态转移仍然具有挑战性，主要是由于三个关键差异：（1）结构化、有界图像数据与无界、异质时间序列之间的数据模态差距；（2）标准RGB三通道视觉模型与需要用任意数量的变量建模时间序列之间的多元预测差距；以及（3）大多数视觉模型确定性输出格式与需要具有不确定性感知的概率预测之间的概率预测差距。为了弥合这些差距，我们提出了VisionTS++，一种基于视觉模型的TSFM，在大规模时间序列数据集上执行持续预训练，包括3个创新：（1）基于视觉模型的过滤机制，用于识别高质量时间序列数据，从而缓解模态差距并提高预训练稳定性，（2）一种彩色多元转换方法，将多元时间序列转换为多子图RGB图像，捕获复杂的变量间依赖关系；以及（3）使用并行重建头的多分位数预测方法，生成不同分位水平的预测，从而更灵活地近似任意输出分布，而无需限制性的先验分布假设。在分布内和分布外的TSF基准上进行评估，\model 实现了SOTA结果，在均方误差减少方面优于专门的TSFM高出6%-44%，在12个概率预测设置中有9个排名第一。我们的工作为跨模态知识转移建立了一种新的范式，推进了通用TSFM的发展。

更新时间: 2025-08-06 12:17:09

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04379v1

HALO: Hindsight-Augmented Learning for Online Auto-Bidding

Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

Updated: 2025-08-06 12:13:21

标题: HALO：在线自动竞标的后见之明增强学习

摘要: 数字广告平台通过实时竞标（RTB）系统进行毫秒级拍卖，广告主通过算法竞标争夺广告曝光机会。这种动态机制实现了精准的受众定位，但由于广告主的异质性，引入了深度的运营复杂性：不同广告主的预算和ROI目标跨度巨大，从个体商家到跨国品牌。这种多样性为多约束竞标（MCB）带来了挑战性的适应性环境。传统的自动竞标解决方案在这种环境下存在两个关键缺陷：1）严重的样本效率低下，特定约束条件下的失败探索无法为新的预算-ROI组合提供可转移的知识，2）在约束转移时的有限泛化能力，因为它们忽略了约束和竞标系数之间的物理关系。为解决这一问题，我们提出了HALO：用于在线自动竞标的事后增强学习。HALO引入了一个理论基础的事后机制，通过轨迹重新定向将所有探索重新用于任意约束配置的训练数据。此外，它采用B样条函数表示，实现了在约束空间中连续、导数感知的出价映射。HALO确保了在预算/ROI要求与训练场景大相径庭时的稳健适应性。工业数据集评估显示了HALO在处理多尺度约束方面的优越性，减少了约束违规同时提高了GMV。

更新时间: 2025-08-06 12:13:21

领域: cs.LG

下载: http://arxiv.org/abs/2508.03267v2

Continual Multiple Instance Learning for Hematologic Disease Diagnosis

The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.

Updated: 2025-08-06 12:03:25

标题: 持续的多实例学习用于血液病诊断

摘要: 实验室和诊所的动态环境每天都会产生大量数据流，需要定期更新训练好的机器学习模型以保持一致的性能。持续学习被认为可以帮助训练模型而不会出现灾难性遗忘。然而，目前的方法对于多实例学习（MIL）效果不佳，而MIL经常用于基于单细胞的血液病诊断（例如白血病检测）。在这里，我们提出了第一个专门针对MIL的持续学习方法。我们的方法是基于回放的，从各种袋中选择单个实例。我们使用实例注意力分数和距离从袋均值和类均值向量来精心选择要存储在以前任务的示例集中的哪些样本和实例，以保留数据的多样性。使用来自一个白血病实验室一个月数据的真实输入，我们研究了我们的方法在类增量情景中的有效性，并将其与众所周知的持续学习方法进行比较。我们展示了我们的方法明显优于目前的方法，为MIL提供了第一个持续学习方法。这使得模型能够随着时间推移适应数据分布的变化，比如由于疾病发生或潜在基因改变引起的变化。

更新时间: 2025-08-06 12:03:25

领域: cs.LG,cs.CV,eess.IV,q-bio.QM

下载: http://arxiv.org/abs/2508.04368v1

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive "less is more" paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.

Updated: 2025-08-06 11:58:58

标题: OmniPlay: 在全模态游戏中对全模态模型进行基准测试

摘要: 尽管Gemini和GPT-4o等通用基础模型展示了令人印象深刻的多模态能力，但现有的评估未能测试它们在动态、交互式世界中的智能。静态基准缺乏主动性，而交互基准则受到严重的模态瓶颈的影响，通常忽略了重要的听觉和时间线索。为了弥合这一评估差距，我们引入了OmniPlay，这是一个诊断基准，不仅旨在评估，而且探讨了全感官光谱上的主动模型的融合和推理能力。基于模态相互依赖的核心理念，OmniPlay由一套五个游戏环境组成，系统地创造了既有协同又有冲突的情景，迫使代理人进行真正的跨模态推理。我们对六个领先的全模态模型进行了全面评估，发现了一个关键的二分法：它们在高保真度记忆任务上表现出超人的表现，但在需要强大推理和战略规划的挑战中存在系统性失败。我们证明，这种脆弱性源于脆弱的融合机制，导致在模态冲突下性能灾难性下降，并揭示了一个反直觉的“减少即增加”的悖论，即去除感官信息反而可以矛盾地提高性能。我们的发现表明，通向强大的通用人工智能的道路需要超越规模化的研究重点，明确解决协同融合问题。我们的平台可以在https://github.com/fuqingbie/omni-game-benchmark上匿名审查。

更新时间: 2025-08-06 11:58:58

领域: cs.AI

下载: http://arxiv.org/abs/2508.04361v1

Deep Exploration with PAC-Bayes

Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.

Updated: 2025-08-06 11:50:34

标题: 使用PAC-Bayes进行深度探索

摘要: 强化学习（RL）在延迟奖励下的连续控制是一个尚未充分探讨的问题，尽管在现实世界应用中具有重要意义。许多复杂技能都是基于中间技能作为先决条件的。例如，一个人型机动装置必须先学会站立，然后才能学会行走。为了应对延迟奖励，代理必须进行深度探索。然而，现有的深度探索方法是为小离散动作空间设计的，它们在连续控制的最新状态下的泛化尚未得到证实。我们首次从PAC-Bayesian的角度解决了深度探索问题，在演员-评论家学习的背景下。为了做到这一点，我们通过PAC-Bayes界限量化贝尔曼算子的误差，其中一个自举的评论家网络集合代表后验分布，它们的目标作为数据信息的函数空间先验。我们从此界限推导出一个目标函数，并用它来训练评论家集合。每个评论家训练一个个体软演员网络，实现为共享主干和评论家特定的头部。代理通过在随机选择的演员头上进行ε-软操作来进行深度探索。我们提出的算法，称为PAC-Bayesian Actor-Critic（PBAC），是唯一一种能够一致地在具有不同难度的连续控制任务上发现延迟奖励的算法。

更新时间: 2025-08-06 11:50:34

领域: cs.LG

下载: http://arxiv.org/abs/2402.03055v4

LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content

This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial "direct relevance" score, $S_{d,i}$, assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a "contextual relevance" score, $S_{c,i}$, that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving narratives. The LUST framework aims to provide a nuanced, temporally-aware measure of user-defined significance, outputting an annotated video with visualized relevance scores and comprehensive analytical logs.

Updated: 2025-08-06 11:48:51

标题: LUST：一种多模态框架，具有基于分层LLM评分的学习主题重要性跟踪的能力

摘要: 本文介绍了Learned User Significance Tracker（LUST），这是一个旨在分析视频内容并量化其片段与用户提供的文本描述之间主题相关性的框架。LUST利用多模态分析管道，将从视频帧中提取的视觉线索与通过自动语音识别（ASR）从音频轨道提取的文本信息相结合。其核心创新在于采用大型语言模型（LLM）的分层、两阶段相关性评分机制。初始的“直接相关性”分数$S_{d,i}$基于即时视觉和听觉内容针对主题评估各个片段。然后是“情境相关性”分数$S_{c,i}$，通过将之前主题分数的时间进展纳入评估，以使模型理解不断发展的叙事。LUST框架旨在提供一个细致、时间感知的用户定义重要性度量，输出具有可视化相关性分数和全面分析日志的带注释视频。

更新时间: 2025-08-06 11:48:51

领域: cs.MM,cs.AI,68T07

下载: http://arxiv.org/abs/2508.04353v1

DBSCAN in domains with periodic boundary conditions

Many scientific problems involve data that is embedded in a space with periodic boundary conditions. This can for instance be related to an inherent cyclic or rotational symmetry in the data or a spatially extended periodicity. When analyzing such data, well-tailored methods are needed to obtain efficient approaches that obey the periodic boundary conditions of the problem. In this work, we present a method for applying a clustering algorithm to data embedded in a periodic domain based on the DBSCAN algorithm, a widely used unsupervised machine learning method that identifies clusters in data. The proposed method internally leverages the conventional DBSCAN algorithm for domains with open boundaries, such that it remains compatible with all optimized implementations for neighborhood searches in open domains. In this way, it retains the same optimized runtime complexity of $O(N\log N)$. We demonstrate the workings of the proposed method using synthetic data in one, two and three dimensions and also apply it to a real-world example involving the clustering of bubbles in a turbulent flow. The proposed approach is implemented in a ready-to-use Python package that we make publicly available.

Updated: 2025-08-06 11:47:22

标题: DBSCAN在具有周期边界条件的领域中的应用

摘要: 许多科学问题涉及嵌入周期边界条件空间中的数据。例如，这可以与数据中固有的循环或旋转对称性或空间扩展周期性有关。在分析这类数据时，需要精心设计的方法来获得遵守问题的周期边界条件的高效方法。在这项工作中，我们提出了一种将聚类算法应用于嵌入周期域中的数据的方法，该方法基于DBSCAN算法，这是一种广泛使用的无监督机器学习方法，用于识别数据中的聚类。所提出的方法内部利用传统的DBSCAN算法用于具有开放边界的域，使其与所有优化的开放域邻域搜索的实现兼容。通过这种方式，它保持了相同的优化运行时复杂度为$O(N\log N)$。我们使用合成数据在一维、二维和三维中演示了所提出方法的工作原理，并将其应用于涉及在湍流中聚类气泡的真实示例。该提出的方法已实现为一个现成可用的Python软件包，并已公开提供。

更新时间: 2025-08-06 11:47:22

领域: cs.LG,physics.comp-ph,physics.flu-dyn

下载: http://arxiv.org/abs/2501.16894v2

CITRAS: Covariate-Informed Transformer for Time Series Forecasting

In practical time series forecasting, covariates provide rich contextual information that can potentially enhance the forecast of target variables. Although some covariates extend into the future forecasting horizon (e.g., calendar events, discount schedules), most multivariate models fail to leverage this pivotal insight due to the length discrepancy with target variables. Additionally, capturing the dependency between target variables and covariates is non-trivial, as models must precisely reflect the local impact of covariates while also capturing global cross-variate dependencies. To overcome these challenges, we propose CITRAS, a decoder-only Transformer that flexibly leverages multiple targets, past covariates, and future covariates. While preserving strong autoregressive capabilities, CITRAS introduces two novel mechanisms in patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates future covariates into the forecasting of target variables based on their concurrent dependencies. Additionally, Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the past series of attention scores. Experimentally, CITRAS outperforms state-of-the-art models on thirteen real-world benchmarks from both covariate-informed and multivariate settings, demonstrating its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

Updated: 2025-08-06 11:46:11

标题: CITRAS：基于协变量的时间序列预测变换器

摘要: 在实际的时间序列预测中，协变量提供丰富的上下文信息，可以潜在地增强目标变量的预测。尽管一些协变量延伸到未来的预测时间范围（例如，日历事件，折扣时间表），但大多数多变量模型未能利用这一关键见解，因为与目标变量的长度不一致。此外，捕捉目标变量和协变量之间的依赖关系是非常困难的，因为模型必须精确地反映协变量的局部影响，同时也要捕捉全局跨变量的依赖关系。为了克服这些挑战，我们提出了CITRAS，这是一个仅解码器的Transformer，灵活地利用多个目标、过去协变量和未来协变量。在保留强大的自回归能力的同时，CITRAS引入了两种新颖的机制，即基于补丁的跨变量注意力：键-值（KV）转移和注意力得分平滑。KV转移无缝地将未来协变量纳入基于它们的并发依赖关系的目标变量的预测中。此外，注意力得分平滑通过平滑过去的注意力分数系列，将局部准确的补丁级跨变量依赖关系细化为全局变量级依赖关系。在实验中，CITRAS在来自协变量信息和多变量设置的十三个真实基准测试中表现优于最先进的模型，展示了它利用跨变量和跨时间依赖性提高预测准确性的多才能力。

更新时间: 2025-08-06 11:46:11

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.24007v3

Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets, including gene expression data collected at uneven time points and an image progression task, demonstrating the method's versatility.

Updated: 2025-08-06 11:43:20

标题: 不规则时间点高维快照数据的多边缘随机流匹配

摘要: 在不规则时间点进行有限快照观测的高维系统演化建模在定量生物学和相关领域中是一个重大挑战。传统方法通常依赖于降维技术，这可能会过度简化动态并无法捕捉非平衡系统中关键的瞬时行为。我们提出了多边际随机流匹配（MMSFM），这是对模拟无关评分和流匹配方法的新颖扩展，用于多边际设置，实现在非等距时间点测量的高维数据的对齐，而不降低维度。使用测量值样条增强了对不规则快照时间的鲁棒性，并且评分匹配可以防止在高维空间中过度拟合。我们在几个合成和基准数据集上验证了我们的框架，包括在不均匀时间点收集的基因表达数据和图像进展任务，展示了该方法的多功能性。

更新时间: 2025-08-06 11:43:20

领域: cs.LG,cs.NE,I.2,I.2.6

下载: http://arxiv.org/abs/2508.04351v1

Chain of Questions: Guiding Multimodal Curiosity in Language Models

Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

Updated: 2025-08-06 11:42:54

标题: 问题链：引导语言模型中的多模态好奇心

摘要: 大型语言模型（LLMs）中的推理能力通过一系列思维和显式逐步解释等方法得到了显著提升。然而，这些改进尚未完全过渡到多模态环境中，在这种环境中，模型必须主动决定何时与复杂的现实环境进行交互时要使用哪些感官模式，如视觉、音频或空间感知。在本文中，我们介绍了一种被好奇心驱动的推理方法，即问题链（CoQ）框架，鼓励多模态语言模型动态生成针对其周围环境的有针对性问题。这些生成的问题引导模型有选择地激活相关模态，从而收集进行准确推理和生成响应所必需的关键信息。我们在一个新颖的多模态基准数据集上评估了我们的框架，该数据集由整合了WebGPT、ScienceQA、AVSD和ScanQA数据集而成。实验结果表明，我们的CoQ方法提高了基础模型有效识别和整合相关感官信息的能力。这导致了推理过程的准确性、可解释性和与各种多模态任务的对齐性得到了提高。

更新时间: 2025-08-06 11:42:54

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2508.04350v1

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

Updated: 2025-08-06 11:42:47

标题: GTPO和GRPO-S：带有策略熵的令牌和序列级奖励塑造

摘要: 强化学习（RL）与像Group Relative Policy Optimization（GRPO）这样的算法改善了大型语言模型（LLM）的推理能力，但受到了粗粒度信用分配的限制，该分配将统一奖励应用于序列中的所有标记。这是长链推理任务中的一个重大缺陷。本文通过\textbf{动态熵加权}解决了这个问题。我们的核心思想是，正确响应中的高熵标记可以引导策略朝着更高的性能上限发展。这使我们能够通过两种方式为精确的策略更新创建更细粒度的奖励信号：1）\textbf{Group Token Policy Optimization}（\textbf{GTPO}），我们为每个标记分配一个熵加权奖励，以进行精细粒度的信用分配。2）\textbf{Sequence-Level Group Relative Policy Optimization}（\textbf{GRPO-S}），我们基于其平均标记熵为每个序列分配一个熵加权奖励。实验证明，我们的方法明显优于强大的DAPO基线。结果证实我们的熵加权机制是这一性能提升的关键驱动因素，为增强模型的深层推理提供了更好的路径。

更新时间: 2025-08-06 11:42:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04349v1

From Split to Share: Private Inference with Distributed Feature Sharing

Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers' outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

Updated: 2025-08-06 11:41:10

标题: 从分裂到共享：具有分布式特征共享的私人推断

摘要: 基于云的机器学习即服务（MLaaS）在处理敏感客户数据时引发了严重的隐私问题。现有的私有推断（PI）方法在隐私和效率之间面临着根本性的折衷：加密方法提供了强大的保护，但带来了高计算开销，而高效的替代方法，如分裂推断，暴露了中间特征以进行逆推攻击。我们提出了PrivDFS，这是一种用于私有推断的新范例，它用分布式特征共享取代了单一暴露的表示。PrivDFS将客户端的输入特征分割成多个平衡的份额，这些份额分发给非合谋、非通信的服务器用于独立的部分推断。客户端安全地聚合服务器的输出以重建最终预测，确保没有单个服务器观察到足够的信息来危及输入隐私。为了进一步加强隐私，我们提出了两个关键扩展：PrivDFS-AT，它使用具有基于扩散的代理攻击者的对抗训练来强制执行抵抗逆转的特征分割，以及PrivDFS-KD，它利用用户特定的密钥来使分割策略多样化，并防止基于查询的逆转概括。对CIFAR-10和CelebA的实验表明，PrivDFS在不损失准确性的情况下，将客户端计算减少了最多100倍，并且扩展对扩散式内分布和自适应攻击都具有很强的鲁棒性。

更新时间: 2025-08-06 11:41:10

领域: cs.LG

下载: http://arxiv.org/abs/2508.04346v1

On the Fundamental Impossibility of Hallucination Control in Large Language Models

This paper establishes a fundamental impossibility theorem: no LLM capable performing non-trivial knowledge aggregation can simultaneously achieve truthful (internally consistent) knowledge representation, semantic information conservation, complete revelation of relevant knowledge, and knowledge-constrained optimality. This impossibility is not an engineering limitation but arises from the mathematical structure of information aggregation itself. We establish this result by describing the inference process as an auction of ideas, where distributed components compete exploiting their partial knowledge to shape responses. The proof spans three independent mathematical domains: mechanism design theory (Green-Laffont), the theory of proper scoring rules (Savage), and direct architectural analysis of transformers (Log-Sum-Exp convexity). In particular, we show how in the strictly concave settings the score of an aggregate of diverse beliefs strictly exceeds the sum of individual scores. That gap may quantify the creation of unattributable certainty or overconfidence -- the mathematical origin of both hallucination and creativity, or imagination. To support this analysis, we introduce the complementary concepts of the semantic information measure and the emergence operator to model bounded reasoning in a general setting. We prove that while bounded reasoning generates accessible information, providing valuable insights and inspirations, idealized reasoning strictly preserves semantic content. By demonstrating that hallucination and imagination are mathematically identical phenomena-grounded in the necessary violation of information conservation-this paper offers a principled foundation for managing these behaviors in advanced AI systems. Finally, we present some speculative ideas to inspire evaluation and refinements of the proposed theory.

Updated: 2025-08-06 11:34:54

标题: 关于大型语言模型中幻觉控制的根本不可能性

摘要: 本文建立了一个基本的不可能定理：没有能够进行非平凡知识聚合的LLM能够同时实现真实（内部一致）的知识表示、语义信息保留、相关知识的完全揭示以及知识受限的最优性。这种不可能性并非来自工程上的限制，而是源自信息聚合的数学结构本身。我们通过将推理过程描述为一种思想拍卖来建立这一结果，其中分布式组件竞争利用其局部知识来塑造响应。证明涉及三个独立的数学领域：机制设计理论（格林-拉冯特）、适当评分规则理论（萨维奇）以及变压器的直接架构分析（对数和指数凸性）。特别是，我们展示了在严格凹设置中，多元信念的聚合得分严格超过个体得分之和。这一差距可以量化产生不可归因的确定性或过度自信――幻觉和创造力、或想象力的数学起源。为了支持这一分析，我们引入了语义信息度量和出现算子的互补概念，以在一般情况下模拟有界推理。我们证明了，虽然有界推理产生了可访问的信息，提供了有价值的见解和启发，但理想化的推理严格保留语义内容。通过证明幻觉和想象是在信息保留的必要违反的基础上完全相同的现象――本文提供了一种管理这些行为在先进AI系统中的基础。最后，我们提出了一些推测性观点，以启发对所提出理论的评估和完善。

更新时间: 2025-08-06 11:34:54

领域: stat.ML,cs.AI,cs.CL,cs.GT,cs.LG

下载: http://arxiv.org/abs/2506.06382v4

Sign Spotting Disambiguation using Large Language Models

Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method's superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

Updated: 2025-08-06 11:34:07

标题: 使用大型语言模型进行标志识别消歧

摘要: 标志识别是在连续手语视频中识别和定位个别标志的任务，在扩展数据集注释和解决手语翻译中严重数据稀缺问题方面起着至关重要的作用。虽然自动标志识别在实现规模化的帧级监督方面具有巨大的潜力，但它面临挑战，如连续手语流中固有的词汇僵化和模糊性。因此，我们引入了一种新颖的、无需训练的框架，该框架集成了大型语言模型（LLMs）以显著增强标志识别质量。我们的方法提取全局时空和手形特征，然后利用动态时间扭曲和余弦相似度将其与大规模标志词典进行匹配。这种基于字典的匹配在不需要模型重新训练的情况下具有更好的词汇灵活性。为了减少匹配过程中的噪声和模糊性，LLM通过波束搜索执行上下文感知的术语消歧。与传统方法相比，在合成和真实手语数据集上进行的大量实验表明我们的方法相比传统方法具有更高的准确性和句子流畅度，突显了LLMs在推进标志识别方面的潜力。

更新时间: 2025-08-06 11:34:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.03703v3

Bases of Riemann-Roch spaces associated with arbitrary elliptic curve divisors and their application in constructing various elliptic Codes families

In this paper, we determine explicit bases for Riemann--Roch spaces associated with various families of elliptic codes. We establish the feasibility and provide exact algorithms for constructing bases of Riemann--Roch spaces corresponding to arbitrary divisors on elliptic curves. These results are subsequently applied to derive bases for quasi-cyclic elliptic codes and their subfield subcodes as well as for the class of Goppa-like elliptic codes. For algebraic geometry code applications, having an explicit description of Riemann--Roch space bases for arbitrary divisors is particularly valuable as it simultaneously enables efficient code construction and reveals structural properties of the codes leading to the new cryptanalysis methods when these codes are employed in cryptographic schemes

Updated: 2025-08-06 11:34:05

标题: 基于任意椭圆曲线除子的黎曼-罗赫空间基础及其在构建各种椭圆码族中的应用

摘要: 在本文中，我们确定了与各种椭圆码家族相关的黎曼-罗赫空间的显式基。我们建立了构建椭圆曲线上任意除子对应的黎曼-罗赫空间基的可行性，并提供了确切的算法。随后，我们将这些结果应用于推导拟循环椭圆码及其子域子码的基，以及高帕式椭圆码的基。对于代数几何码的应用，对任意除子的黎曼-罗赫空间基进行显式描述特别有价值，因为它同时能够实现高效的码构建，并揭示代码的结构性质，从而导致在密码方案中使用这些代码时产生新的密码分析方法。

更新时间: 2025-08-06 11:34:05

领域: cs.IT,cs.CR,math.AG,math.IT,14H05, 94B27, 11T71

下载: http://arxiv.org/abs/2508.04340v1

Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models

Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking "Which answer is most likely?", DRN asks "Which hypothesis has the most internally consistent evidence?". DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.

Updated: 2025-08-06 11:33:35

标题: 审慎推理网络：一种基于不确定性驱动的用于预训练语言模型的信念跟踪推理范式

摘要: 大型语言模型在逻辑推理方面经常失败，当语义启发与决定性证据相冲突时-这种现象我们称之为认知陷阱。为了解决这一基本限制，我们引入了深思推理网络（DRN），这是一种新的范式，它将逻辑推理从概率最大化重新定位为不确定性最小化。DRN不再问“哪个答案最有可能？”，而是问“哪个假设有最一致的证据？”。DRN通过显式跟踪信念状态和通过迭代证据综合过程量化对竞争性假设的认识不确定性，实现了内在可解释性。我们通过两种互补的架构验证了我们的方法-一种体现核心不确定性最小化原则的定制判别模型，以及一种增强现有生成型LLM的轻量级验证模块。在LCR-1000上进行评估，我们设计了新的对抗性推理基准，以暴露认知陷阱，定制的DRN相比标准基线实现了高达15.2%的改进。当作为参数高效的验证器与Mistral-7B集成时，我们的混合系统在最具挑战性的问题上将准确率从20%提高到80%。关键是，DRN展示了强大的零样本泛化能力，将TruthfulQA性能提高了23.6%，无需额外训练，表明基于不确定性的深思学习可以学习可传递的推理原则。我们将DRN定位为可验证的System 2推理组件，用于构建更加可信赖的人工智能系统。

更新时间: 2025-08-06 11:33:35

领域: cs.AI

下载: http://arxiv.org/abs/2508.04339v1

Symmetry & Critical Points for Symmetric Tensor Decomposition Problems

We consider the nonconvex optimization problem associated with the decomposition of a real symmetric tensor into a sum of rank-one terms. Use is made of the rich symmetry structure to construct infinite families of critical points represented by Puiseux series in the problem dimension, and so obtain precise analytic estimates on the objective function value and the Hessian spectrum. The results enable an analytic characterization of various obstructions to local optimization methods, revealing, in particular, a complex array of saddles and minima that differ in their symmetry, structure, and analytic properties. A notable phenomenon, observed for all critical points considered, concerns the index of the Hessian increasing with the objective function value.

Updated: 2025-08-06 11:33:31

标题: 对称张量分解问题的对称性和临界点

摘要: 我们考虑与将实对称张量分解为秩为一项之和相关的非凸优化问题。利用丰富的对称结构构建了一系列临界点，这些临界点用问题维度中的普韦苏级数表示，从而获得了关于目标函数值和海森矩阵谱的精确解析估计。这些结果使得能够对各种妨碍局部优化方法的障碍进行解析表征，特别是揭示了一系列不同对称性、结构和解析特性的鞍点和极小值。一个显著的现象是，对所有考虑的临界点而言，观察到海森矩阵的指数随着目标函数值的增加而增加。

更新时间: 2025-08-06 11:33:31

领域: math.OC,cs.LG,cs.NA,math.AG,math.NA,stat.ML

下载: http://arxiv.org/abs/2306.07886v5

Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method not only significantly outperforms recent rationalization methods, but also achieves comparable or even better results than a representative LLM (llama3.1-8b-instruct).

Updated: 2025-08-06 11:31:36

标题: 对抗性合作性合理化：即使是干净数据集中的虚假相关性风险

摘要: 这项研究调查了一个合作游戏构建的自我合理化框架，其中一个生成器最初从原始输入中提取最具信息量的片段，随后的预测器利用所选子集作为其输入。生成器和预测器合作训练以最大化预测准确性。在本文中，我们首先揭示了一个潜在的警告：这样一个合作游戏可能会在理性提取过程中无意中引入抽样偏差。具体来说，生成器可能会无意中在所选的理性候选项和标签之间创建一个不正确的相关性，即使它们在原始数据集中语义上不相关。随后，我们利用详细的理论分析和经验证据阐明了这种偏差的起源。我们的研究结果表明了通过攻击来检查这些相关性的方向，基于这一方向，我们进一步提出了一项指导，以防止预测器学习这些相关性。通过对六个文本分类数据集和两个图分类数据集使用三种网络架构（GRUs、BERT和GCN）进行实验，我们展示了我们的方法不仅在很大程度上优于最近的理性化方法，而且实现了与代表性LLM（llama3.1-8b-instruct）相当甚至更好的结果。

更新时间: 2025-08-06 11:31:36

领域: cs.AI

下载: http://arxiv.org/abs/2505.02118v5

Modelling and Classifying the Components of a Literature Review

Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

Updated: 2025-08-06 11:30:07

标题: 建模和分类文献综述的组成部分

摘要: 先前的研究表明，利用人工智能方法分析科学文献在根据论文句子的修辞角色进行标注方面取得了显著的好处，如研究间隙、结果、局限性、现有方法的拓展等。这种表示法还有潜力支持开发能够生成高质量文献综述的新一代系统。然而，实现这一目标需要定义相关的注释模式和有效的大规模文献注释策略。本文通过以下两点解决了这些挑战：1）引入了一个专门设计用于支持文献综述生成的新颖注释模式；2）对一系列最先进的大型语言模型（LLMs）在根据此模式分类修辞角色方面进行了全面评估。为此，我们还介绍了Sci-Sentence，一个由领域专家手动注释的700个句子和使用LLMs自动标记的2,240个句子组成的新颖跨学科基准。我们在这个基准上评估了37个LLMs，涵盖了不同的模型系列和大小，同时采用了零样本学习和微调方法。实验得出了几个推动这一具有挑战性领域的最新进展的新见解。首先，当前一代LLMs在高质量数据上微调时在这项任务上表现出色，达到了96%以上的F1性能水平。其次，虽然像GPT-4o这样的大型专有模型取得了最佳结果，但一些轻量级的开源替代方案也表现出色。最后，通过用LLMs生成半合成示例来丰富训练数据被证明是有益的，使小型编码器能够取得稳健的结果，并显著提高了几个开放解码器模型的性能。

更新时间: 2025-08-06 11:30:07

领域: cs.CL,cs.AI,cs.HC,cs.IR

下载: http://arxiv.org/abs/2508.04337v1

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

Updated: 2025-08-06 11:23:11

标题: SWE-Bench幻觉：当最先进的LLMs记忆而非推理时

摘要: 随着大型语言模型（LLMs）变得越来越强大和广泛应用，基准测试在评估它们的实际效用中起着核心作用。例如，SWE-Bench Verified已成为评估LLMs软件工程能力的关键基准，特别是它们解决现实世界GitHub问题的能力。最近的LLMs在SWE-Bench上表现出色，引发了人们对它们在复杂编码任务中的能力的乐观看法。然而，当前的评估协议可能夸大了这些模型的真实能力。区分LLMs的可推广问题解决能力和其他学习的人工成果至关重要。在这项工作中，我们引入了两个诊断任务：仅从问题描述中识别文件路径和仅使用当前文件上下文和问题描述重现地面真相功能，以探究模型的基础知识。我们提供经验证据表明，在SWE-Bench-Verified上的性能提升可能部分受记忆驱动，而不是真正的问题解决。我们展示了最先进的模型仅通过问题描述就能达到高达76%的准确率来识别有错误的文件路径，而无需访问存储库结构。这种性能在SWE-Bench中的任务上仅为53%，指向可能存在数据污染或记忆。类似的模式也在功能重现任务中观察到，在SWE-Bench Verified上的逐字相似性比其他类似编码基准测试高得多（SWE-Bench Verified和Full上的连续5元组准确率高达35%，而其他基准测试中的任务仅高达18%）。这些发现引发了对现有结果的担忧，强调了需要更加强健、抗污染的基准测试来可靠评估LLMs的编码能力。

更新时间: 2025-08-06 11:23:11

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2506.12286v3

Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

Updated: 2025-08-06 11:22:23

标题: 遗忘：一种新的机制，实现更好的大型语言模型微调

摘要: 监督微调（SFT）在预训练的大型语言模型（LLMs）中扮演着至关重要的角色，显著增强了其获取特定领域知识的能力，同时保留或可能增强其通用能力。然而，SFT的有效性取决于数据质量和数据量，否则可能导致性能提升有限，甚至相对于相关基准值而言出现性能下降。为了减轻这种依赖性，我们建议将每个语料库中的标记分类为两部分——积极标记和消极标记——基于它们是否有助于提高模型性能。积极标记可以以通用方式进行训练，而缺乏基本语义或具有误导性的消极标记应明确遗忘。总体而言，标记分类有助于模型学习较少信息的消息，而遗忘过程塑造了知识边界，指导模型更加准确地学习哪些信息。我们在已建立的基准上进行实验，发现这种遗忘机制不仅提高了整体模型性能，还促进了更多样化的模型响应。

更新时间: 2025-08-06 11:22:23

领域: cs.LG

下载: http://arxiv.org/abs/2508.04329v1

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

Updated: 2025-08-06 11:11:40

标题: 超越排行榜：重新思考大型语言模型的医学基准

摘要: 大型语言模型（LLMs）在医疗保健领域展现出显著的潜力，促使许多基准测试来评估它们的能力。然而，人们仍然担忧这些基准测试的可靠性，这些基准测试通常缺乏临床忠实度、健壮的数据管理和以安全为导向的评估指标。为了解决这些缺点，我们引入了MedCheck，这是专门为医疗基准测试设计的首个生命周期导向评估框架。我们的框架将基准测试的开发分解为从设计到治理的五个连续阶段，并提供了一个包含46个医学定制标准的综合检查清单。利用MedCheck，我们对53个医学LLM基准测试进行了深入的实证评估。我们的分析揭示了普遍存在的系统问题，包括与临床实践的深刻脱节，由于未加以控制的污染风险导致数据完整性危机，以及对模型健壮性和不确定性意识等安全关键评估维度的系统忽视。基于这些发现，MedCheck既可以作为现有基准测试的诊断工具，也可以作为一个可行的指南，促进对AI在医疗保健领域进行更加标准化、可靠和透明的评估方法。

更新时间: 2025-08-06 11:11:40

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2508.04325v1

InfoQ: Mixed-Precision Quantization via Global Information Flow

Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).

Updated: 2025-08-06 11:07:49

标题: InfoQ: 通过全局信息流进行混合精度量化

摘要: 混合精度量化（MPQ）对于在资源受限设备上部署深度神经网络至关重要，但为每个层找到最佳位宽代表了一个复杂的组合优化问题。当前最先进的方法依赖于计算昂贵的搜索算法或类似Hessian的局部敏感性启发式代理，无法捕捉量化误差的级联全局影响。在这项工作中，我们认为一个层的量化敏感性不应该通过其局部特性来衡量，而应该通过其对整个网络信息流的影响来衡量。我们引入了InfoQ，一个新颖的MPQ框架，在位宽搜索阶段无需训练。InfoQ通过将每个层在不同位宽上量化，并通过单次前向传递测量出在随后层中的互信息变化，评估层的敏感性。这量化了每个层量化对网络信息流的影响程度。得到的分数被用来将位宽分配建模为一个整数线性规划问题，有效地解决在给定预算（例如模型大小或BitOps）下最小化总敏感性。我们的无需重新训练的搜索阶段提供了一个更优越的搜索时间/准确性权衡（使用比LIMPQ等最先进方法少两个数量级的数据），同时在高压缩率（14倍和10.66倍）下为MobileNetV2和ResNet18在ImageNet上提供最高1%的准确性改善。

更新时间: 2025-08-06 11:07:49

领域: cs.LG

下载: http://arxiv.org/abs/2508.04753v1

PAK-UCB Contextual Bandit: An Online Learning Approach to Prompt-Aware Selection of Generative Models and LLMs

Selecting a sample generation scheme from multiple prompt-based generative models, including large language models (LLMs) and prompt-guided image and video generation models, is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed PAK-UCB algorithm addresses a contextual bandit (CB) setting with shared context variables across the arms, utilizing the generated data to update kernel-based functions that predict the score of each model available for unseen text prompts. Additionally, we leverage random Fourier features (RFF) to accelerate the online learning process of PAK-UCB. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show that RFF-UCB performs successfully in identifying the best generation model across different sample types. The code is available at: github.com/yannxiaoyanhu/dgm-online-select.

Updated: 2025-08-06 11:02:01

标题: PAK-UCB上下文强盗：一种在线学习方法，用于Prompt感知选择生成模型和LLMs

摘要: 从多个基于提示的生成模型中选择样本生成方案，包括大型语言模型（LLMs）和提示引导的图像和视频生成模型，通常通过选择最大化平均评估分数的模型来解决。然而，基于分数的选择忽视了不同模型可能针对不同类型的文本提示实现最佳生成性能的可能性。在线识别适用于各种输入提示的最佳生成模型可以减少与查询次优模型相关的成本。在这项工作中，我们探讨了针对不同文本提示的基于文本的生成模型的变化排名的可能性，并提出了一个在线学习框架来预测给定输入提示的最佳数据生成模型。所提出的PAK-UCB算法处理具有跨臂共享上下文变量的上下文强盗（CB）设置，利用生成的数据来更新基于核函数的函数，预测每个可用于未见文本提示的模型的分数。此外，我们利用随机傅立叶特征（RFF）加速PAK-UCB的在线学习过程。我们在真实和模拟的文本到图像和图像到文本生成模型上进行的数值实验表明，RFF-UCB成功地识别了在不同样本类型中的最佳生成模型。代码可在以下网址找到：github.com/yannxiaoyanhu/dgm-online-select。

更新时间: 2025-08-06 11:02:01

领域: cs.LG

下载: http://arxiv.org/abs/2410.13287v5

RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR

Updated: 2025-08-06 11:01:24

标题: RGB事件驱动的行人属性识别：一个基准数据集和一个不对称的RWKV融合框架

摘要: 现有的行人属性识别方法通常基于RGB帧摄像机开发。然而，这些方法受到RGB摄像机的限制，如对光照条件和运动模糊的敏感性，这些限制阻碍了它们的性能。此外，目前的属性识别主要集中在分析行人的外部外观和服装，缺乏对情绪维度的探索。在本文中，我们重新审视这些问题，并通过借鉴事件摄像机在低光、高速和低功耗方面的优势，提出了一种新颖的多模态RGB-Event属性识别任务。具体来说，我们引入了第一个大规模多模态行人属性识别数据集，称为EventPAR，包括10万对RGB-Event样本，涵盖了50个与外观和六种人类情绪相关的属性，多样化的场景和不同的季节。通过在该数据集上重新训练和评估主流的PAR模型，我们建立了一个全面的基准，并为未来的数据和算法基线研究奠定了坚实的基础。此外，我们提出了一种基于RWKV的新型多模态行人属性识别框架，具有一个RWKV视觉编码器和一个不对称的RWKV融合模块。在我们提出的数据集以及两个模拟数据集（MARS-Attribute和DukeMTMC-VID-Attribute）上进行了大量实验，取得了最先进的结果。源代码和数据集将在https://github.com/Event-AHU/OpenPAR 上发布。

更新时间: 2025-08-06 11:01:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.10018v2

Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.

Updated: 2025-08-06 10:50:54

标题: 真实世界离线强化学习：基于视觉语言模型反馈

摘要: 离线强化学习可以使策略从预先收集的次优数据集中学习，而无需在线交互。这使其非常适合用于真实世界的机器人和安全关键场景，其中收集在线数据或专家演示是缓慢、昂贵且风险高的。然而，大多数现有的离线RL作品都假设数据集已经标记了任务奖励，这个过程通常需要大量人力，特别是当难以确定地面真实状态时（例如，在现实世界中）。在本文中，我们基于先前的工作，具体是RL-VLM-F，并提出了一个新颖的系统，该系统利用视觉语言模型和任务的文本描述中的偏好反馈自动生成离线数据集的奖励标签。然后，我们的方法利用带有奖励标签的数据集使用离线RL来学习策略。我们展示了该系统在复杂的真实世界机器人辅助穿衣任务中的适用性，我们首先使用视觉语言模型在次优离线数据集上学习奖励函数，然后使用学习到的奖励来运用隐式Q学习来开发有效的穿衣策略。我们的方法在涉及刚性和可变形物体操纵的模拟任务中表现出色，并在行为克隆和逆向RL等基线上显著优于。总之，我们提出了一个新系统，可以从未标记的次优离线数据集中实现自动奖励标记和策略学习。

更新时间: 2025-08-06 10:50:54

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.05273v2

WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification

Machine unlearning, the efficient deletion of the impact of specific data in a trained model, remains a challenging problem. Current machine unlearning approaches that focus primarily on data-centric or weight-based strategies frequently encounter challenges in achieving precise unlearning, maintaining stability, and ensuring applicability across diverse domains. In this work, we introduce a new two-phase efficient machine unlearning method for image classification, in terms of weight saliency, leveraging weight saliency to focus the unlearning process on critical model parameters. Our method is called weight saliency soft-guided contrastive learning for efficient machine unlearning image classification (WSS-CL), which significantly narrows the performance gap with "exact" unlearning. First, the forgetting stage maximizes kullback-leibler divergence between output logits and aggregated pseudo-labels for efficient forgetting in logit space. Next, the adversarial fine-tuning stage introduces contrastive learning in a self-supervised manner. By using scaled feature representations, it maximizes the distance between the forgotten and retained data samples in the feature space, with the forgotten and the paired augmented samples acting as positive pairs, while the retained samples act as negative pairs in the contrastive loss computation. Experimental evaluations reveal that our proposed method yields much-improved unlearning efficacy with negligible performance loss compared to state-of-the-art approaches, indicative of its usability in supervised and self-supervised settings.

Updated: 2025-08-06 10:47:36

标题: WSS-CL：用于高效机器遗忘图像分类的权重显著性软引导对比学习

摘要: 机器遗忘是指在一个经过训练的模型中有效删除特定数据影响的过程，仍然是一个具有挑战性的问题。目前主要关注数据中心或基于权重的策略的机器遗忘方法经常面临精确遗忘、保持稳定性和确保在不同领域中适用性的挑战。在这项工作中，我们介绍了一种新的两阶段高效的机器遗忘方法，用于图像分类，利用权重显著性来将遗忘过程集中在关键的模型参数上。我们的方法被称为权重显著性软引导对比学习，用于高效的机器遗忘图像分类（WSS-CL），大大缩小了与“精确”遗忘之间的性能差距。首先，遗忘阶段通过最大化输出logits和聚合伪标签之间的Kullback-Leibler散度，在logit空间中实现高效的遗忘。接下来，对抗微调阶段以自监督的方式引入对比学习。通过使用缩放的特征表示，它最大化了特征空间中遗忘和保留数据样本之间的距离，其中遗忘和成对增强的样本作为正样本，而保留样本作为对比损失计算中的负样本。实验评估表明，我们提出的方法在与最先进方法相比，遗忘效果显著提高，性能损失可以忽略不计，表明其在监督和自监督设置中的可用性。

更新时间: 2025-08-06 10:47:36

领域: cs.LG

下载: http://arxiv.org/abs/2508.04308v1

Compressing Large Language Models with PCA Without Performance Loss

We demonstrate that Principal Component Analysis (PCA), when applied in a structured manner, either to polar-transformed images or segment-wise to token sequences, enables extreme compression of neural models without sacrificing performance. Across three case studies, we show that a one-layer classifier trained on PCA-compressed polar MNIST achieves over 98 percent accuracy using only 840 parameters. A two-layer transformer trained on 70-dimensional PCA-reduced MiniLM embeddings reaches 76.62 percent accuracy on the 20 Newsgroups dataset with just 81000 parameters. A decoder-only transformer generates coherent token sequences from 70-dimensional PCA embeddings while preserving over 97 percent cosine similarity with full MiniLM representations, using less than 17 percent of the parameter count of GPT-2. These results highlight PCA-based input compression as a general and effective strategy for aligning model capacity with information content, enabling lightweight architectures across multiple modalities.

Updated: 2025-08-06 10:47:22

标题: 使用主成分分析对大型语言模型进行压缩，无性能损失

摘要: 我们证明，主成分分析（PCA）在结构化方式下应用，无论是应用于极坐标转换的图像还是分段应用于令牌序列，均能实现对神经模型的极端压缩而不牺牲性能。通过三个案例研究，我们展示了一个在PCA压缩极坐标MNIST上训练的单层分类器，仅使用840个参数就能达到超过98%的准确率。一个在70维PCA降维MiniLM嵌入上训练的两层变压器，在20个新闻组数据集上达到76.62%的准确率，仅使用81000个参数。一个仅解码器的变压器从70维PCA嵌入中生成连贯的令牌序列，同时保持与完整MiniLM表示的97%余弦相似度，使用比GPT-2参数数量少于17%。这些结果突显了基于PCA的输入压缩作为一种通用且有效的策略，用于使模型容量与信息内容对齐，实现跨多种模态的轻量级架构。

更新时间: 2025-08-06 10:47:22

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2508.04307v1

Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset

Existing event stream based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term, large-scale frame-event visual object tracking dataset, termed FELT. It contains 1,044 long-term videos that involve 1.9 million RGB frames and event stream pairs, 60 different target objects, and 14 challenging attributes. To build a solid benchmark, we retrain and evaluate 21 baseline trackers on our dataset for future work to compare. In addition, we propose a novel Associative Memory Transformer based RGB-Event long-term visual tracker, termed AMTTrack. It follows a one-stream tracking framework and aggregates the multi-scale RGB/event template and search tokens effectively via the Hopfield retrieval layer. The framework also embodies another aspect of associative memory by maintaining dynamic template representations through an associative memory update scheme, which addresses the appearance variation in long-term tracking. Extensive experiments on FELT, FE108, VisEvent, and COESOT datasets fully validated the effectiveness of our proposed tracker. Both the dataset and source code will be released on https://github.com/Event-AHU/FELT_SOT_Benchmark

Updated: 2025-08-06 10:45:11

标题: 使用事件相机进行长期视觉目标跟踪：一种关联记忆增强跟踪器和基准数据集

摘要: 现有的基于事件流的跟踪器在短期跟踪数据集上进行评估，然而，实际场景的跟踪涉及长期跟踪，现有跟踪算法在这些场景中的性能仍不清楚。在本文中，我们首先提出了一个新的长期、大规模的帧-事件视觉对象跟踪数据集，称为FELT。它包含1,044个长期视频，涉及1.9百万个RGB帧和事件流对，60个不同的目标对象，以及14个具有挑战性的属性。为了建立一个稳固的基准，我们重新训练并评估了21个基准跟踪器在我们的数据集上，供未来工作比较。此外，我们提出了一种基于关联记忆变压器的RGB-事件长期视觉跟踪器，称为AMTTrack。它遵循一个单流跟踪框架，通过Hopfield检索层有效地聚合多尺度的RGB/事件模板和搜索令牌。该框架还通过维护动态模板表示来解决长期跟踪中的外观变化问题，这通过关联记忆更新方案实现。在FELT、FE108、VisEvent和COESOT数据集上进行的大量实验充分验证了我们提出的跟踪器的有效性。数据集和源代码将在https://github.com/Event-AHU/FELT_SOT_Benchmark上发布。

更新时间: 2025-08-06 10:45:11

领域: cs.CV,cs.AI,cs.NE

下载: http://arxiv.org/abs/2403.05839v3

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, \emph{e.g.,} languages, and the other for continuous representations, \emph{e.g.,} location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code will be available on https://github.com/OpenGVLab/Hulk.

Updated: 2025-08-06 10:40:31

标题: 《绿巨人：面向人类中心任务的通用知识翻译器》

摘要: 人类中心的感知任务，例如行人检测、基于骨架的动作识别和姿态估计，在元宇宙和体育分析等广泛的工业应用中具有重要意义。近年来出现了开发人类中心基础模型的激增，这些模型可以使各种人类中心感知任务受益。虽然许多人类中心基础模型取得了成功，但它们并没有探索针对人类中心和需要任务特定微调的3D和视觉-语言任务。这些限制限制了它们在更多下游任务和情境中的应用。为了解决这些问题，我们提出了 Hulk，第一个多模态人类中心通用模型，能够处理2D视觉、3D视觉、基于骨架的和视觉-语言任务，而无需进行任务特定微调。实现这一目标的关键在于将各种任务特定头部压缩为两个通用头部，一个用于离散表示，例如语言，另一个用于连续表示，例如位置坐标。两个头部的输出可以进一步堆叠成四种不同的输入和输出模态。这种统一表示使 Hulk 能够将各种人类中心任务视为模态转换，整合跨一系列任务的知识。对 Hulk 在涵盖8个人类中心任务的12个基准测试进行的综合评估显示了我们提出方法的优越性，在11个基准测试中实现了最先进的性能。代码将在 https://github.com/OpenGVLab/Hulk 上提供。

更新时间: 2025-08-06 10:40:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2312.01697v5

Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum

This study proposes NIRMAL (Novel Integrated Robust Multi-Adaptation Learning), a novel optimization algorithm that combines multiple strategies inspired by the movements of the chess piece. These strategies include gradient descent, momentum, stochastic perturbations, adaptive learning rates, and non-linear transformations. We carefully evaluated NIRMAL against two widely used and successful optimizers, Adam and SGD with Momentum, on four benchmark image classification datasets: MNIST, FashionMNIST, CIFAR-10, and CIFAR-100. The custom convolutional neural network (CNN) architecture is applied on each dataset. The experimental results show that NIRMAL achieves competitive performance, particularly on the more challenging CIFAR-100 dataset, where it achieved a test accuracy of 45.32\%and a weighted F1-score of 0.4328. This performance surpasses Adam (41.79\% accuracy, 0.3964 F1-score) and closely matches SGD with Momentum (46.97\% accuracy, 0.4531 F1-score). Also, NIRMAL exhibits robust convergence and strong generalization capabilities, especially on complex datasets, as evidenced by stable training results in loss and accuracy curves. These findings underscore NIRMAL's significant ability as a versatile and effective optimizer for various deep learning tasks.

Updated: 2025-08-06 10:30:22

标题: 新 NIRMAL 优化器与 Adam 和带动量的 SGD 的比较分析

摘要: 这项研究提出了一种名为NIRMAL（Novel Integrated Robust Multi-Adaptation Learning）的新型优化算法，结合了受象棋棋子移动启发的多种策略。这些策略包括梯度下降、动量、随机扰动、自适应学习率和非线性转换。我们对NIRMAL进行了仔细评估，并将其与两种广泛使用且成功的优化器Adam和带动量的SGD在四个基准图像分类数据集上进行了比较：MNIST、FashionMNIST、CIFAR-10和CIFAR-100。在每个数据集上应用了自定义卷积神经网络（CNN）架构。实验结果显示，NIRMAL在性能上取得了竞争优势，特别是在更具挑战性的CIFAR-100数据集上，在该数据集上取得了45.32％的测试准确性和0.4328的加权F1分数。这一表现超越了Adam（41.79％准确性，0.3964的F1分数），并且与带动量的SGD（46.97％准确性，0.4531的F1分数）接近。此外，NIRMAL表现出稳健的收敛和强大的泛化能力，特别是在复杂数据集上，表现出损失和准确度曲线稳定的训练结果。这些发现突显了NIRMAL作为各种深度学习任务的多功能和有效优化器的显著能力。

更新时间: 2025-08-06 10:30:22

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.04293v1

Challenges in Applying Variational Quantum Algorithms to Dynamic Satellite Network Routing

Applying near-term variational quantum algorithms to the problem of dynamic satellite network routing represents a promising direction for quantum computing. In this work, we provide a critical evaluation of two major approaches: static quantum optimizers such as the Variational Quantum Eigensolver (VQE) and the Quantum Approximate Optimization Algorithm (QAOA) for offline route computation, and Quantum Reinforcement Learning (QRL) methods for online decision-making. Using ideal, noise-free simulations, we find that these algorithms face significant challenges. Specifically, static optimizers are unable to solve even a classically easy 4-node shortest path problem due to the complexity of the optimization landscape. Likewise, a basic QRL agent based on policy gradient methods fails to learn a useful routing strategy in a dynamic 8-node environment and performs no better than random actions. These negative findings highlight key obstacles that must be addressed before quantum algorithms can offer real advantages in communication networks. We discuss the underlying causes of these limitations, including barren plateaus and learning instability, and suggest future research directions to overcome them.

Updated: 2025-08-06 10:25:39

标题: 将变分量子算法应用于动态卫星网络路由的挑战

摘要: 将近期变分量子算法应用于动态卫星网络路由问题，代表了量子计算的一个有前途的方向。在这项工作中，我们对两种主要方法进行了关键评估：静态量子优化器，如变分量子本征求解器（VQE）和量子近似优化算法（QAOA）用于离线路由计算，以及用于在线决策的量子强化学习（QRL）方法。通过使用理想的、无噪声的模拟，我们发现这些算法面临着重大挑战。具体来说，静态优化器由于优化景观的复杂性而无法解决甚至是一个经典简单的4节点最短路径问题。同样，基于策略梯度方法的基本QRL代理无法在动态8节点环境中学习到有用的路由策略，并且表现不比随机动作好。这些负面发现突出了必须解决的关键障碍，然后量子算法才能在通信网络中提供真正的优势。我们讨论了这些限制的根本原因，包括贫瘠高原和学习不稳定性，并提出了未来研究方向以克服这些问题。

更新时间: 2025-08-06 10:25:39

领域: quant-ph,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.04288v1

Heterogeneity-Oblivious Robust Federated Learning

Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. To address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy.

Updated: 2025-08-06 10:18:00

标题: 异构性无视的强大联邦学习

摘要: Federated Learning（FL）在现实世界的超异质性环境下仍然极易受到毒化攻击的威胁，其中客户端在数据分布、通信能力和模型架构方面存在显著差异。这种异质性不仅削弱了聚合策略的有效性，还使攻击更难以被检测。此外，高维模型扩大了攻击面。为了解决这些挑战，我们提出了Horus，一个以低秩适应（LoRAs）为中心的异质性无视的强大FL框架。Horus不是聚合完整的模型参数，而是将LoRAs插入到经验稳定的层中，并仅聚合LoRAs以减少攻击的发现。我们发现，LoRAs的输入投影（LoRA-A）在异质性和毒化下比输出投影（LoRA-B）更稳定，利用这一发现，我们设计了一个使用LoRA-A特征的异质性无视毒化评分来过滤受到毒化的客户端。对于其余的良性客户端，我们提出了一个投影感知聚合机制，以保留合作信号同时抑制漂移，通过与全局方向的一致性重新加权客户端更新。在各种数据集、模型架构和攻击方式的广泛实验中，Horus一直表现出比现有技术基线更出色的鲁棒性和准确性。

更新时间: 2025-08-06 10:18:00

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2508.03579v2

Per-element Secure Aggregation against Data Reconstruction Attacks in Federated Learning

Federated learning (FL) enables collaborative model training without sharing raw data, but individual model updates may still leak sensitive information. Secure aggregation (SecAgg) mitigates this risk by allowing the server to access only the sum of client updates, thereby concealing individual contributions. However, a significant vulnerability has recently attracted increasing attention: when model updates are sparse vectors, a non-zero value contributed by a single client at a given index can be directly revealed in the aggregate, enabling precise data reconstruction attacks. In this paper, we propose a novel enhancement to SecAgg that reveals aggregated values only at indices with at least $t$ non-zero contributions. Our mechanism introduces a per-element masking strategy to prevent the exposure of under-contributed elements, while maintaining modularity and compatibility with many existing SecAgg implementations by relying solely on cryptographic primitives already employed in a typical setup. We integrate this mechanism into Flamingo, a low-round SecAgg protocol, to provide a robust defense against such attacks. Our analysis and experimental results indicate that the additional computational and communication overhead introduced by our mechanism remains within an acceptable range, supporting the practicality of our approach.

Updated: 2025-08-06 10:16:40

标题: 在联邦学习中针对数据重建攻击的逐元素安全聚合

摘要: 联邦学习（FL）实现了协作模型训练，而无需共享原始数据，但个体模型更新仍可能泄漏敏感信息。安全聚合（SecAgg）通过允许服务器仅访问客户端更新的总和来减轻这一风险，从而隐藏个体贡献。然而，最近引起越来越多关注的一个重要漏洞是：当模型更新为稀疏向量时，单个客户端在给定索引处贡献的非零值可能直接在聚合中显示，从而使精确数据重建攻击成为可能。在本文中，我们提出了一种对SecAgg的新型增强，仅在至少有$t$个非零贡献的索引处显示聚合值。我们的机制引入了逐元素掩码策略，以防止低贡献元素的暴露，同时保持与许多现有SecAgg实现的模块化和兼容性，仅依赖于已在典型设置中使用的加密原语。我们将这一机制整合到Flamingo中，这是一种低轮次的SecAgg协议，可提供对此类攻击的强大防御。我们的分析和实验结果表明，我们的机制引入的额外计算和通信开销仍在可接受范围内，支持我们方法的实用性。

更新时间: 2025-08-06 10:16:40

领域: cs.CR

下载: http://arxiv.org/abs/2508.04285v1

Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllability over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions: 1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks.

Updated: 2025-08-06 10:13:17

标题: 合成POMDPs挑战记忆增强RL：记忆需求结构建模

摘要: 最近的研究已经为记忆增强强化学习（RL）算法制定了基准，提供了部分可观察马尔可夫决策过程（POMDP）环境，在这些环境中，代理依赖过去的观察结果来做出决策。虽然许多基准包含了足够复杂的现实世界问题，但它们缺乏对记忆模型所面临挑战程度的可控性。相比之下，合成环境可以精细操纵动态，因此对于详细和严格评估记忆增强RL至关重要。我们的研究侧重于POMDP合成，具有以下三个关键贡献：1. 一个基于记忆需求结构（MDS）、转移不变性和相关概念的POMDP分析理论框架；2. 利用线性过程动力学、状态聚合和奖励重分配的方法，构建具有预定义属性的定制POMDP；3. 经过实证验证的一系列逐渐增加难度级别的POMDP环境，设计基于我们的理论见解。我们的工作阐明了解决POMDP时记忆增强RL所面临的挑战，为分析和设计POMDP环境提供了指导，并为在RL任务中选择记忆模型提供了经验支持。

更新时间: 2025-08-06 10:13:17

领域: cs.AI

下载: http://arxiv.org/abs/2508.04282v1

GRILL: Gradient Signal Restoration in Ill-Conditioned Layers to Enhance Adversarial Attacks on Autoencoders

Adversarial robustness of deep autoencoders (AEs) remains relatively unexplored, even though their non-invertible nature poses distinct challenges. Existing attack algorithms during the optimization of imperceptible, norm-bounded adversarial perturbations to maximize output damage in AEs, often stop at sub-optimal attacks. We observe that the adversarial loss gradient vanishes when backpropagated through ill-conditioned layers. This issue arises from near-zero singular values in the Jacobians of these layers, which weaken the gradient signal during optimization. We introduce GRILL, a technique that locally restores gradient signals in ill-conditioned layers, enabling more effective norm-bounded attacks. Through extensive experiments on different architectures of popular AEs, under both sample-specific and universal attack setups, and across standard and adaptive attack settings, we show that our method significantly increases the effectiveness of our adversarial attacks, enabling a more rigorous evaluation of AE robustness.

Updated: 2025-08-06 10:10:21

标题: GRILL：在病态层中恢复梯度信号以增强对自动编码器的对抗攻击

摘要: 深度自动编码器（AEs）的对抗鲁棒性仍然相对未被探索，尽管它们的非可逆性质带来了独特的挑战。在优化不可察觉的、范数受限的对抗性扰动以最大化AEs输出损坏的攻击算法存在，通常会停留在次优攻击。我们观察到，当通过病态层进行反向传播时，对抗性损失梯度会消失。这个问题源于这些层的雅可比矩阵中接近零的奇异值，这会在优化过程中削弱梯度信号。我们引入了GRILL技术，它可以在病态层中局部恢复梯度信号，从而实现更有效的范数受限攻击。通过对流行AEs不同架构进行广泛实验，包括样本特定和通用攻击设置，并在标准和自适应攻击设置下进行实验，我们展示了我们的方法显著提高了对抗性攻击的效果，从而更严格地评估了AE的鲁棒性。

更新时间: 2025-08-06 10:10:21

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.03646v3

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs may introduce critical vulnerabilities in these systems. Here, we explore the impact of prompt-injection attacks targeting consensus generating systems by introducing a four-dimensional taxonomy of attacks. We test these attacks using LLaMA 3.1 8B and Chat GPT 4.1 Nano finding the LLMs more vulnerable to criticism attacks -- attacks using disagreeable prompts -- and more effective at tilting ambiguous consensus statements. We also find evidence of more effective manipulation when using explicit imperatives and rational-sounding arguments compared to emotional language or fabricated statistics. To mitigate these vulnerabilities, we apply Direct Preference Optimization (DPO), an alignment method that fine-tunes LLMs to prefer unperturbed consensus statements. While DPO significantly improves robustness, it still offers limited protection against attacks targeting ambiguous consensus. These results advance our understanding of the vulnerability and robustness of consensus generating LLMs in digital democracy applications.

Updated: 2025-08-06 10:10:01

标题: 数字民主中一致生成应用程序的快速注入漏洞

摘要: 大型语言模型（LLMs）作为一种生成共识性陈述和聚合偏好的方法，在数字民主实验中备受关注。然而，LLMs可能会在这些系统中引入关键漏洞。在这里，我们探讨了针对共识生成系统的提示注入攻击的影响，引入了一个攻击的四维分类法。我们使用LLaMA 3.18B和Chat GPT 4.1 Nano测试了这些攻击，发现LLMs更容易受到批评攻击的影响——即使用不同意见的提示进行攻击——并且更有效地倾斜模糊的共识性陈述。我们还发现，在使用明确的命令和听起来理性的论点相比，使用情感语言或虚假统计数据时，操纵更加有效。为了减轻这些漏洞，我们应用了直接偏好优化（DPO），这是一种对LLMs进行微调以偏好未受干扰的共识性陈述的对齐方法。虽然DPO显著提高了鲁棒性，但仍对针对模糊共识的攻击提供有限的保护。这些结果推进了我们对数字民主应用中共识生成LLMs的脆弱性和鲁棒性的理解。

更新时间: 2025-08-06 10:10:01

领域: cs.CY,cs.CR

下载: http://arxiv.org/abs/2508.04281v1

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

Updated: 2025-08-06 10:08:48

标题: 在合成世界中利用强化学习提升视觉-语言模型训练，实现真实世界成功

摘要: 交互式多模态代理必须将原始视觉观察转换为连贯的语言条件行为序列 - 这是当前视觉语言模型（VLMs）仍然缺乏的能力。早期的强化学习（RL）工作原则上可以赋予VLMs这样的技能，但他们很少测试所学行为是否超出其训练模拟器，而且依赖于脆弱的超参数调整或具有低状态变化的密集奖励环境。我们介绍了一种轻量级、无超参数的RL算法，名为Vision-Language Decoupled Actor-Critic（VL-DAC）。VL-DAC将PPO更新应用于动作标记，同时仅在环境步骤级别学习价值：据我们所知，此种安排以前并未用于大型VLMs或LLMs。这种简单的解耦消除了不稳定的加权项，并实现更快、更可靠的收敛。在一个廉价的模拟器（MiniWorld、Gym-Cards、ALFWorld或WebShop）中训练单个VLM时，VL-DAC已经产生了广泛泛化的策略：在BALROG（以游戏为中心的代理控制）上相对增加了+50\%，在VSI-Bench（空间规划）最困难的部分上相对增加了+5%，在VisualWebBench（网络导航）上增加了+2%，而且所有这些都没有降低一般图像理解准确度。这些结果首次证明，一种简单的RL算法可以在廉价的合成世界中完全训练VLMs，并在真实图像的代理、空间推理和网络导航基准测试中带来可衡量的收益。

更新时间: 2025-08-06 10:08:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04280v1

Mockingbird: How does LLM perform in general machine learning tasks?

Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.

Updated: 2025-08-06 10:08:47

标题: Mockingbird：LLM在一般机器学习任务中的表现如何？

摘要: 大型语言模型（LLMs）现在越来越频繁地被用作聊天机器人，负责总结信息或根据用户指令生成文本和代码。LLMs 的推理能力和推断速度的快速增长揭示了它们在超越聊天机器人领域到普通机器学习任务中的显著潜力。这项工作是出于对这种潜力的好奇而进行的。在这项工作中，我们提出了一个框架 Mockingbird，将LLMs调整到普通机器学习任务中，并在几个普通机器学习任务上评估其性能和可伸缩性。这个框架的核心概念是指导LLMs扮演角色并反思自己的错误以改进自己。我们的评估和分析结果显示，基于LLMs的机器学习方法，如Mockingbird，在常见的机器学习任务上可以取得可接受的结果；然而，仅仅反思自身无法超越领域特定文档和来自人类专家的反馈的效果。

更新时间: 2025-08-06 10:08:47

领域: cs.LG

下载: http://arxiv.org/abs/2508.04279v1

Large Language Model's Multi-Capability Alignment in Biomedical Domain

BalancedBio is a theoretically grounded framework for parameter-efficient biomedical reasoning, addressing multi-capability integration in domain-specific AI alignment. It establishes the Biomedical Multi-Capability Convergence Theorem, proving orthogonal gradient spaces are essential to prevent capability interference for safe deployment. Key innovations include: (1) Medical Knowledge Grounded Synthetic Generation (MKGSG), extending Source2Synth with clinical workflow constraints and medical ontology validation for factual accuracy and safety; and (2) Capability Aware Group Relative Policy Optimization, deriving optimal hybrid reward weighting to maintain orthogonality in RL, using a reward model with rule-based and model-based scores adapted to biomedical tasks. Mathematical analysis proves Pareto-optimal convergence, preserving performance across capabilities. It achieves state-of-the-art results in its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over baseline), reasoning (61.94%, +7.75%), instruction following (67.95%, +6.44%), and integration (86.7%, +18.5%). Theoretical safety guarantees include bounds on capability preservation and clinical accuracy. Real-world deployment yields 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance. This work provides a principled methodology for biomedical AI alignment, enabling efficient reasoning with essential safety and reliability, with the 0.5B model version to be released.

Updated: 2025-08-06 10:06:11

标题: 大型语言模型在生物医学领域的多能力对齐

摘要: BalancedBio是一个基于理论的框架，用于参数高效的生物医学推理，解决领域特定AI对齐中的多能力集成问题。它建立了生物医学多能力融合定理，证明正交梯度空间对于防止能力干扰以实现安全部署至关重要。关键创新包括：（1）医学知识基础合成生成（MKGSG），通过将临床工作流约束和医学本体验证与Source2Synth相结合，确保事实的准确性和安全性；和（2）能力感知组相对策略优化，推导出最佳混合奖励权重以在RL中保持正交性，使用一个奖励模型，其中基于规则和基于模型的分数适应于生物医学任务。数学分析证明帕累托最优收敛，保持跨能力的性能。在其参数类别中取得了最先进的结果：领域专业知识（80.95% BIOMED-MMLU，比基线高15.32%）、推理（61.94%，+7.75%）、遵循指令（67.95%，+6.44%）和集成（86.7%，+18.5%）。理论安全保证包括对能力保留和临床准确性的边界。实际部署实现了78%的成本降低，23%的诊断准确性提高，以及89%的临床接受度。这项工作为生物医学AI对齐提供了一种有原则的方法论，实现了与必要的安全性和可靠性相结合的高效推理，0.5B模型版本即将发布。

更新时间: 2025-08-06 10:06:11

领域: cs.AI

下载: http://arxiv.org/abs/2508.04278v1

PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model's energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.

Updated: 2025-08-06 10:03:03

标题: PROM：在有效卷积神经网络中优先考虑减少乘法运算而不是降低位宽

摘要: 卷积神经网络（CNNs）对于在资源受限设备上进行计算机视觉任务至关重要。量化有效地压缩这些模型，减少存储大小和能量成本。然而，在现代深度可分离架构中，计算成本在其组件之间分布不均匀，其中逐点操作是最昂贵的。通过将一般量化方案应用于这种不平衡的成本分布，现有的量化方法未能充分利用潜在的效率提升。为此，我们引入了PROM，一种通过选择两种不同的位宽来量化现代深度可分离卷积网络的简单方法。具体地，逐点卷积被量化为三值权重，而剩下的模块使用8位权重，这是通过一个简单的量化感知训练过程实现的。此外，通过将激活量化为8位，我们的方法将具有三值权重的逐点卷积转化为int8加法，这在硬件平台上得到了广泛支持，并有效地消除了昂贵的乘法操作的需求。将PROM应用于MobileNetV2将模型的能量成本降低了一个数量级以上（23.9倍），并将存储大小降低了2.7倍，与float16基准相比，在ImageNet上保持了类似的分类性能。我们的方法推进了在ImageNet上量化卷积模型的能量消耗与top-1准确率之间的帕累托前沿。PROM解决了将深度可分离卷积网络量化为三值和8位权重的挑战，提供了一种简单的方式来减少能量成本和存储大小。

更新时间: 2025-08-06 10:03:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2505.03254v2

DOGR: Towards Versatile Visual Document Grounding and Referring

With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGR, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms.

Updated: 2025-08-06 10:02:43

标题: DOGR：面向多功能视觉文档定位和指称

摘要: 随着多模态大型语言模型（MLLMs）的最新进展，基于基础和指代能力的详细理解和灵活用户交互越来越受到重视。然而，由于细粒度数据集和全面基准的稀缺，这些能力在视觉文档理解中仍然发展不足。为了填补这一空白，我们提出了文档基础和指代数据引擎（DOGR-Engine），该引擎生成两种高质量的细粒度文档数据：（1）多粒度解析数据，以改善文本定位和识别，以及（2）指导调整数据，以激活MLLMs在对话和推理中的基础和指代能力。利用DOGR-Engine，我们构建了DOGR-Bench，这是一个涵盖三种文档类型（图表、海报和PDF文档）的七个基础和指代任务的基准，为细粒度文档理解提供了全面的评估。利用生成的数据，我们进一步开发了DOGR，这是一个在文本定位和识别方面表现出色的强基准模型，同时在对话和推理过程中精确地基于和指代关键文本信息，从而将文档理解推进到更细粒度，并实现灵活的交互范式。

更新时间: 2025-08-06 10:02:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.17125v3

A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1\%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05\% of full text modified, the QA accuracy collapses from 95\% to 50\%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

Updated: 2025-08-06 10:01:26

标题: 一些词汇可能扭曲图表：基于图检索增强的大型语言模型生成中的知识污染攻击

摘要: 基于图检索增强生成（GraphRAG）最近成为增强大型语言模型（LLMs）的一种有前途的范例，通过将原始文本转换为结构化知识图，提高准确性和可解释性。然而，GraphRAG依赖LLMs从原始文本中提取知识进行图构建，这个过程可能被恶意操纵以植入误导性信息。针对这种攻击面，我们提出了两种知识毒化攻击（KPAs），并证明只修改源文本中少数词汇就可以显著改变构建的图，毒化GraphRAG，并严重误导下游推理。第一种攻击被称为定向KPA（TKPA），利用图论分析定位生成图中的脆弱节点，并用LLMs重写相应叙述，实现对特定问答（QA）结果的精确控制，成功率达到93.1\%，同时保持毒化文本流畅和自然。第二种攻击被称为通用KPA（UKPA），利用代词和依赖关系等语言线索破坏生成图的结构完整性，通过改变全局影响力词汇。少于0.05\%的全文修改，QA准确率从95\%下降到50\%。此外，实验证明，最先进的防御方法无法检测这些攻击，突显了保护GraphRAG管道免受知识毒化的重要性仍然鲜为人知。

更新时间: 2025-08-06 10:01:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04276v1

Efficient rule induction by ignoring pointless rules

The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

Updated: 2025-08-06 09:58:27

标题: 通过忽略无关规则实现高效的规则归纳

摘要: 归纳逻辑编程（ILP）的目标是找到一组推广训练示例和背景知识的逻辑规则。我们介绍了一种ILP方法，用于识别无意义的规则。如果一个规则包含冗余文字或无法对负例进行区分，那么这个规则就是无意义的。我们展示了忽略无意义规则允许ILP系统正确修剪假设空间。我们在多个领域进行了实验，包括视觉推理和游戏玩法，结果显示我们的方法可以将学习时间缩短99%，同时保持预测准确性。

更新时间: 2025-08-06 09:58:27

领域: cs.AI

下载: http://arxiv.org/abs/2502.01232v2

Leveraging large language models for SQL behavior-based database intrusion detection

Database systems are extensively used to store critical data across various domains. However, the frequency of abnormal database access behaviors, such as database intrusion by internal and external attacks, continues to rise. Internal masqueraders often have greater organizational knowledge, making it easier to mimic employee behavior effectively. In contrast, external masqueraders may behave differently due to their lack of familiarity with the organization. Current approaches lack the granularity needed to detect anomalies at the operational level, frequently misclassifying entire sequences of operations as anomalies, even though most operations are likely to represent normal behavior. On the other hand, some anomalous behaviors often resemble normal activities, making them difficult for existing detection methods to identify. This paper introduces a two-tiered anomaly detection approach for Structured Query Language (SQL) using the Bidirectional Encoder Representations from Transformers (BERT) model, specifically DistilBERT, a more efficient, pre-trained version. Our method combines both unsupervised and supervised machine learning techniques to accurately identify anomalous activities while minimizing the need for data labeling. First, the unsupervised method uses ensemble anomaly detectors that flag embedding vectors distant from learned normal patterns of typical user behavior across the database (out-of-scope queries). Second, the supervised method uses fine-tuned transformer-based models to detect internal attacks with high precision (in-scope queries), using role-labeled classification, even on limited labeled SQL data. Our findings make a significant contribution by providing an effective solution for safeguarding critical database systems from sophisticated threats.

Updated: 2025-08-06 09:53:38

标题: 利用大型语言模型进行基于SQL行为的数据库入侵检测

摘要: 数据库系统被广泛用于存储各个领域的关键数据。然而，异常数据库访问行为的频率，如内部和外部攻击导致的数据库入侵，继续上升。内部冒名顶替者通常具有更多的组织知识，使他们更容易有效地模仿员工行为。相比之下，外部冒名顶替者可能由于对组织的陌生而表现不同。当前方法缺乏在操作级别检测异常所需的细粒度，经常将整个操作序列错误地分类为异常，尽管大多数操作可能代表正常行为。另一方面，一些异常行为通常类似于正常活动，使现有检测方法难以识别。本文介绍了一种针对结构化查询语言（SQL）的双层异常检测方法，使用基于变压器的双向编码器表示（BERT）模型，具体是DistilBERT，这是一个更高效的、预训练版本。我们的方法结合了无监督和监督机器学习技术，可以准确识别异常活动，同时最大程度减少对数据标记的需求。首先，无监督方法使用集成异常检测器，标记远离数据库中典型用户行为学习的正常模式的嵌入向量（超出范围的查询）。其次，监督方法使用经过微调的基于变压器的模型，通过基于角色标记的分类，甚至在有限标记的SQL数据上也能高精度地检测内部攻击（范围内的查询）。我们的发现通过为保护关键数据库系统免受复杂威胁提供有效解决方案做出了重要贡献。

更新时间: 2025-08-06 09:53:38

领域: cs.CR,cs.DB,cs.LG

下载: http://arxiv.org/abs/2508.05690v1

A Visual Tool for Interactive Model Explanation using Sensitivity Analysis

We present SAInT, a Python-based tool for visually exploring and understanding the behavior of Machine Learning (ML) models through integrated local and global sensitivity analysis. Our system supports Human-in-the-Loop (HITL) workflows by enabling users - both AI researchers and domain experts - to configure, train, evaluate, and explain models through an interactive graphical interface without programming. The tool automates model training and selection, provides global feature attribution using variance-based sensitivity analysis, and offers per-instance explanation via LIME and SHAP. We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.

Updated: 2025-08-06 09:53:31

标题: 使用敏感性分析的交互式模型解释的视觉工具

摘要: 我们提出了SAInT，这是一个基于Python的工具，通过集成的局部和全局敏感性分析，帮助用户以可视化方式探索和理解机器学习（ML）模型的行为。我们的系统支持人在环（HITL）工作流程，通过交互式图形界面，使用户（包括AI研究人员和领域专家）能够配置、训练、评估和解释模型，无需编程。该工具自动化模型训练和选择，利用基于方差的敏感性分析提供全局特征归因，并通过LIME和SHAP提供每个实例的解释。我们在预测泰坦尼克号数据集上的生存分类任务上展示了该系统，并展示了敏感性信息如何指导特征选择和数据细化。

更新时间: 2025-08-06 09:53:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04269v1

Electron-nucleus cross sections from transfer learning

Transfer learning (TL) allows a deep neural network (DNN) trained on one type of data to be adapted for new problems with limited information. We propose to use the TL technique in physics. The DNN learns the details of one process, and after fine-tuning, it makes predictions for related processes. We consider the DNNs, trained on inclusive electron-carbon scattering data, and show that after fine-tuning, they accurately predict cross sections for electron interactions with nuclear targets ranging from helium-3 to iron.

Updated: 2025-08-06 09:53:15

标题: 从迁移学习中的电子-核交叉截面

摘要: 迁移学习（TL）允许一个在一种数据类型上训练的深度神经网络（DNN）能够适应新问题并且只有有限信息。我们提议在物理学中使用迁移学习技术。DNN学习一个过程的细节，经过微调后，可以对相关过程进行预测。我们考虑在包含电子-碳散射数据上训练的DNN，并展示在微调后，它们准确地预测了电子与从氦-3到铁的核靶标相互作用的截面。

更新时间: 2025-08-06 09:53:15

领域: hep-ph,cs.LG,hep-ex,nucl-ex,nucl-th

下载: http://arxiv.org/abs/2408.09936v2

A virtual sensor fusion approach for state of charge estimation of lithium-ion cells

This paper addresses the estimation of the State Of Charge (SOC) of lithium-ion cells via the combination of two widely used paradigms: Kalman Filters (KFs) equipped with Equivalent Circuit Models (ECMs) and machine-learning approaches. In particular, a recent Virtual Sensor (VS) synthesis technique is considered, which operates as follows: (i) learn an Affine Parameter-Varying (APV) model of the cell directly from data, (ii) derive a bank of linear observers from the APV model, (iii) train a machine-learning technique from features extracted from the observers together with input and output data to predict the SOC. The SOC predictions returned by the VS are supplied to an Extended KF (EKF) as output measurements along with the cell terminal voltage, combining the two paradigms. A data-driven calibration strategy for the noise covariance matrices of the EKF is proposed. Experimental results show that the designed approach is beneficial w.r.t. SOC estimation accuracy and smoothness.

Updated: 2025-08-06 09:52:28

标题: 一种用于锂离子电池荷电状态估计的虚拟传感器融合方法

摘要: 本文讨论了通过两种广泛使用的范式：装备等效电路模型（ECMs）和机器学习方法的卡尔曼滤波器（KFs）的组合来估计锂离子电池的电荷状态（SOC）。特别地，考虑了一种最近的虚拟传感器（VS）综合技术，其操作如下：（i）直接从数据中学习电池的仿射参数变化（APV）模型，（ii）从APV模型导出一组线性观察器，（iii）从观察器提取的特征以及输入和输出数据训练机器学习技术来预测SOC。 VS返回的SOC预测结果与电池端电压一起作为输出测量提供给扩展卡尔曼滤波器（EKF），结合了这两种范式。提出了一种基于数据驱动的EKF噪声协方差矩阵的校准策略。实验结果表明，设计的方法在SOC估计精度和平滑性方面具有益处。

更新时间: 2025-08-06 09:52:28

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2508.04268v1

XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding

Current auto-regressive models can generate high-quality, topologically precise meshes; however, they necessitate thousands-or even tens of thousands-of next-token predictions during inference, resulting in substantial latency. We introduce XSpecMesh, a quality-preserving acceleration method for auto-regressive mesh generation models. XSpecMesh employs a lightweight, multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, thereby accelerating inference. We further propose a verification and resampling strategy: the backbone model verifies each predicted token and resamples any tokens that do not meet the quality criteria. In addition, we propose a distillation strategy that trains the lightweight decoding heads by distilling from the backbone model, encouraging their prediction distributions to align and improving the success rate of speculative predictions. Extensive experiments demonstrate that our method achieves a 1.7x speedup without sacrificing generation quality. Our code will be released.

Updated: 2025-08-06 09:51:03

标题: XSpecMesh：通过多头推测解码实现保持质量的自回归网格生成加速

摘要: 目前的自回归模型可以生成高质量、拓扑精确的网格；然而，在推断过程中，它们需要进行成千上万甚至数万次的下一个标记预测，导致显著的延迟。我们引入了XSpecMesh，这是一种保持质量的加速方法，用于自回归网格生成模型。XSpecMesh采用了一种轻量级、多头的推测解码方案，在单次前向传递中并行预测多个标记，从而加快推断速度。我们进一步提出了一种验证和重新取样策略：骨干模型验证每个预测的标记，并重新取样不符合质量标准的标记。此外，我们提出了一种提炼策略，通过从骨干模型提取信息，训练轻量级解码头，鼓励它们的预测分布对齐，提高推测预测的成功率。大量实验证明，我们的方法实现了1.7倍的加速，而不牺牲生成质量。我们的代码将会发布。

更新时间: 2025-08-06 09:51:03

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23777v2

SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning

Federated Learning (FL) enables collaborative model training on decentralized data but remains vulnerable to gradient leakage attacks that can reconstruct sensitive user information. Existing defense mechanisms, such as differential privacy (DP) and homomorphic encryption (HE), often introduce a trade-off between privacy, model utility, and system overhead, a challenge that is exacerbated in heterogeneous environments with non-IID data and varying client capabilities. To address these limitations, we propose SelectiveShield, a lightweight hybrid defense framework that adaptively integrates selective homomorphic encryption and differential privacy. SelectiveShield leverages Fisher information to quantify parameter sensitivity, allowing clients to identify critical parameters locally. Through a collaborative negotiation protocol, clients agree on a shared set of the most sensitive parameters for protection via homomorphic encryption. Parameters that are uniquely important to individual clients are retained locally, fostering personalization, while non-critical parameters are protected with adaptive differential privacy noise. Extensive experiments demonstrate that SelectiveShield maintains strong model utility while significantly mitigating gradient leakage risks, offering a practical and scalable defense mechanism for real-world federated learning deployments.

Updated: 2025-08-06 09:50:39

标题: 选择性屏蔽：轻量级混合防御对联合学习中的梯度泄漏

摘要: 联合学习（FL）使去中心化数据上的协作模型训练成为可能，但仍然容易受到渗漏梯度攻击的影响，这种攻击可以重建敏感用户信息。现有的防御机制，如差分隐私（DP）和同态加密（HE），通常在隐私、模型效用和系统开销之间引入了折衷，这在具有非IID数据和不同客户端能力的异构环境中更加严重。为了解决这些限制，我们提出了SelectiveShield，这是一个轻量级混合防御框架，能够自适应地集成选择性同态加密和差分隐私。SelectiveShield利用Fisher信息量化参数的敏感性，使客户端能够在本地识别关键参数。通过协作谈判协议，客户端就共享的最敏感参数集达成一致，以便通过同态加密进行保护。对于个别客户端独特重要的参数会在本地保留，促进个性化，而非关键参数则通过适应性差分隐私噪声进行保护。大量实验证明，SelectiveShield在显著减轻梯度泄漏风险的同时保持强大的模型效用，为实际且可扩展的防御机制提供了一个实用的方案，适用于实际的联合学习部署。

更新时间: 2025-08-06 09:50:39

领域: cs.DC,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.04265v1

Zero-Shot Neural Architecture Search with Weighted Response Correlation

Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at https://github.com/kunjing96/ZSNAS-WRCor.git.

Updated: 2025-08-06 09:48:18

标题: 使用加权响应相关性的零样本神经架构搜索

摘要: 神经架构搜索（NAS）是一种自动设计神经网络架构的有前途的方法。然而，由于需要从头开始训练多个架构，NAS的架构估计在计算上是昂贵且耗时的。尽管现有的零样本NAS方法利用无需训练的代理来加速架构估计，但它们的有效性、稳定性和通用性仍然不足。我们提出了一种新颖的无需训练的估计代理，称为加权响应相关性（WRCor）。WRCor利用不同输入样本间响应的相关系数矩阵来计算估计架构的代理评分，从而可以衡量其表达能力和泛化能力。代理评估的实验结果表明，WRCor及其投票代理是比现有代理更有效的估计策略。我们还将它们与不同的搜索策略一起应用于架构搜索。架构搜索的实验结果显示，我们的零样本NAS算法在不同的搜索空间中优于大多数现有NAS算法。我们的NAS算法可以在4个GPU小时内在ImageNet-1k数据集上发现一个测试错误率为22.1%的架构。所有代码都可以在https://github.com/kunjing96/ZSNAS-WRCor.git上公开获取。

更新时间: 2025-08-06 09:48:18

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.08841v2

Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark

With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV

Updated: 2025-08-06 09:46:49

标题: 分段任何车辆：基于语义和视觉背景的 SAM 和基准测试

摘要: 随着自动驾驶技术的迅速发展，车辆感知，特别是检测和分割，对算法性能提出了越来越高的要求。预训练的大型分割模型，特别是Segment Anything Model (SAM)，引起了人们的极大兴趣，并激发了人工智能领域的新研究方向。然而，SAM无法直接应用于车辆部件分割这一细粒度任务，因为其基于文本提示的分割功能不公开，且默认模式生成的蒙版区域缺乏语义标签，限制了其在结构化、特定类别的分割任务中的实用性。为解决这些限制，我们提出了SAV，一个由三个核心组件组成的新颖框架：基于SAM的编码器-解码器、车辆部件知识图和上下文样本检索编码模块。知识图通过结构本体明确地建模车辆部件之间的空间和几何关系，有效地编码先验结构知识。同时，上下文检索模块通过识别和利用训练数据中视觉上相似的车辆实例来增强分割，为改进泛化性提供丰富的上下文先验。此外，我们引入了一个新的大规模基准数据集用于车辆部件分割，名为VehicleSeg10K，该数据集包含11665个高质量的像素级标注，覆盖多种场景和视角。我们在该数据集以及另外两个数据集上进行了全面实验，对多个代表性基准进行了基准测试，为未来研究和比较奠定了坚实基础。一旦接受，本文的数据集和源代码将会发布。本文的数据集和源代码将在https://github.com/Event-AHU/SAV 上发布。

更新时间: 2025-08-06 09:46:49

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04260v1

Deep Neural Network-Driven Adaptive Filtering

This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF). In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition. The DNN, functioning as a universal nonlinear operator, is structurally embedded into the core architecture of the AF system, establishing a direct mapping between filtering residuals and learning gradients. The maximum likelihood is adopted as the implicit cost function, rendering the derived algorithm inherently data-driven and thus endowed with exemplary generalization capability, which is validated by extensive numerical experiments across a spectrum of non-Gaussian scenarios. Corresponding mean value and mean square stability analyses are also conducted in detail.

Updated: 2025-08-06 09:42:40

标题: 深度神经网络驱动的自适应滤波

摘要: 本文提出了一种深度神经网络（DNN）驱动的框架，旨在解决自适应滤波（AF）中长期存在的泛化挑战。与强调显式成本函数设计的传统AF框架不同，所提出的框架将范式转向直接梯度获取。DNN作为通用非线性操作符，结构上嵌入到AF系统的核心架构中，建立了滤波残差和学习梯度之间的直接映射。最大似然被采用为隐式成本函数，使得所得算法固有地是数据驱动的，因此具有出色的泛化能力，这一点通过在一系列非高斯情景下进行的大量数值实验得到验证。相应的均值和均方稳定性分析也进行了详细研究。

更新时间: 2025-08-06 09:42:40

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.04258v1

Higher Gauge Flow Models

This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

Updated: 2025-08-06 09:42:01

标题: 高阶规范流模型

摘要: 本文介绍了Higher Gauge Flow Models，这是一种新颖的生成流模型类。在普通的Gauge Flow Models（arXiv:2507.13414）的基础上，这些Higher Gauge Flow Models利用了一个L$_{\infty}$-代数，有效地扩展了李代数。这种扩展允许将与高阶群相关的高阶几何和高阶对称性集成到生成流模型的框架中。在一个高斯混合模型数据集上的实验评估显示，与传统的流模型相比，性能有显著改善。

更新时间: 2025-08-06 09:42:01

领域: cs.AI,cs.LG,math.DG

下载: http://arxiv.org/abs/2507.16334v2

Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation

In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the non-stragglers and gradient computed from a privacy-preservation global coded dataset to mitigate the negative impact of the stragglers. However, when aggregating these gradients, fixed weights are consistently applied across iterations, neglecting the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight may result in diminished learning performance. To overcome this drawback, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before the training, each device uploads a coded local dataset with additive noise to the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the gradients received from the non-stragglers and the gradient computed from the global coded dataset, where an adaptive policy for varying the aggregation weights is designed. Under this policy, we optimize the performance in terms of privacy and learning, where the learning performance is analyzed through convergence analysis and the privacy performance is characterized via mutual information differential privacy. Finally, we perform simulations to demonstrate the superiority of ACFL compared with the non-adaptive methods.

Updated: 2025-08-06 09:37:24

标题: 自适应编码联邦学习：隐私保护和故障节点缓解

摘要: 在这篇文章中，我们讨论了在存在拖延者的情况下，联邦学习的问题。针对这个问题，提出了一种编码联邦学习框架，其中中央服务器聚合来自非拖延者的梯度和从隐私保护全局编码数据集计算出的梯度，以减轻拖延者的负面影响。然而，在聚合这些梯度时，固定权重始终被应用于迭代中，忽略了全局编码数据集的生成过程和训练模型在迭代过程中的动态性质。这种疏忽可能导致学习性能下降。为了克服这一缺点，我们提出了一种名为自适应编码联邦学习（ACFL）的新方法。在ACFL中，在训练之前，每个设备上传带有附加噪声的编码本地数据集到中央服务器，以在隐私保护要求下生成全局编码数据集。在训练的每个迭代中，中央服务器聚合来自非拖延者的梯度和从全局编码数据集计算出的梯度，设计了一种适应性策略来变化聚合权重。在这种策略下，我们优化了隐私和学习的性能，其中学习性能通过收敛分析进行分析，隐私性能通过互信息差分隐私进行描述。最后，我们进行了模拟实验，以展示ACFL相对于非自适应方法的优越性。

更新时间: 2025-08-06 09:37:24

领域: eess.SP,cs.CR,cs.LG

下载: http://arxiv.org/abs/2403.14905v2

True Multimodal In-Context Learning Needs Attention to the Visual Context

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

Updated: 2025-08-06 09:36:34

标题: 真正的多模态上下文学习需要注意视觉上下文

摘要: 多模态大语言模型（MLLMs）建立在强大的语言骨干上，已经实现了多模态上下文学习（MICL）-从少量多模态演示中适应新任务，包括图像、问题和答案。尽管在标准视觉语言数据集上表现出明显改进，但当前的MLLMs仍然难以利用演示中的视觉信息。具体而言，它们倾向于忽视视觉提示，过度依赖文本模式，导致仅仅是文本模仿而不是真正的多模态适应。这种行为使得MICL仍然是单模态的，并且在很大程度上限制了其实际效用。更重要的是，这种限制经常被在不需要理解视觉上下文的任务上表现出的性能提升所掩盖。因此，如何有效增强MICL能力并可靠评估MICL性能仍未得到充分探讨。为了解决这些问题，我们首先引入了动态注意重分配（DARA），这是一种有效的微调策略，通过重新平衡视觉和文本令牌之间的注意力，鼓励模型关注视觉上下文。此外，我们提出了TrueMICL，一个专门为MICL设计的数据集，包含支持和测试集，明确要求整合多模态信息-特别是视觉内容-以正确完成任务。大量实验证明了我们的整体解决方案的有效性，展示了真正的多模态上下文学习能力的显著改进。代码和数据集可在https://chenxshuo.github.io/true-micl-colm 上获取。

更新时间: 2025-08-06 09:36:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15807v2

T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion

Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - https://github.com/monaf-chowdhury/T3Time/

Updated: 2025-08-06 09:31:44

标题: T3Time：通过自适应多头对齐和残差融合进行三模态时间序列预测

摘要: 多元时间序列预测（MTSF）旨在对变量之间的时间动态进行建模，以预测未来趋势。基于Transformer的模型和大型语言模型（LLMs）由于能够捕捉长程依赖性和模式而显示出潜力。然而，当前的方法往往依赖于严格的归纳偏见，忽视了变量间的相互作用，或者应用静态融合策略，从而限制了在预测时段上的适应性。这些限制在捕捉时间序列数据中微妙、特定时段的关系方面造成瓶颈。为了解决这个问题，我们提出了T3Time，一个新颖的三模态框架，包括时间、频谱和提示分支，其中专用频率编码分支捕捉周期结构，同时具有一个门控机制，根据预测时段学习在时间和频谱特征之间的优先级。我们还提出了一种机制，通过动态加权每个头的重要性来自适应地聚合多个跨模态对齐头。对基准数据集的大量实验表明，我们的模型始终优于最先进的基线模型，平均减少了3.28%的均方误差和2.29%的平均绝对误差。此外，在少样本学习环境下展现了强大的泛化能力：在5%的训练数据下，我们看到均方误差和平均绝对误差分别减少了4.13%和1.91%；在10%的数据下，均方误差和平均绝对误差平均减少了3.62%和1.98%。代码 - https://github.com/monaf-chowdhury/T3Time/

更新时间: 2025-08-06 09:31:44

领域: cs.LG

下载: http://arxiv.org/abs/2508.04251v1

TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.

Updated: 2025-08-06 09:30:47

标题: TalkDep：基于临床依据的LLM人物角色对话中心的抑郁筛查

摘要: 心理健康服务的需求不断增加，已经超过了开发临床专业人员所需的真实训练数据的供应，导致对抑郁症诊断的支持有限。这种短缺促使开发模拟或虚拟患者来协助培训和评估，但现有方法通常无法生成临床有效、自然和多样化的症状表现。在这项工作中，我们采用最新的高级语言模型作为基础，并提出了一个新颖的临床医生-患者模拟管道，名为TalkDep，可以访问多样化的患者档案来开发模拟患者。通过将模型条件设置为精神疾病诊断标准、症状严重程度评分和环境因素，我们的目标是创建能够更好支持诊断模型的训练和评估的真实患者回应。我们通过临床专业人员进行的全面评估验证了这些模拟患者的可靠性。验证的模拟患者的可用性为改善自动抑郁症诊断系统的稳健性和泛化性提供了可扩展和适应的资源。

更新时间: 2025-08-06 09:30:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04248v1

Automated ultrasound doppler angle estimation using deep learning

Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9{\deg} to 9.4{\deg} for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

Updated: 2025-08-06 09:28:07

标题: 使用深度学习进行自动超声多普勒角度估计

摘要: 角度估计是多普勒超声临床工作流程中测量血液速度的重要步骤。广泛认为，不正确的角度估计是多普勒血流速度测量中的主要误差原因。在本文中，我们提出了一种基于深度学习的自动多普勒角度估计方法。该方法使用包括图像增强在内的2100张人类颈动脉超声图像进行开发。五个预训练模型用于提取图像特征，这些特征被传递给自定义的浅层网络进行多普勒角度估计。独立地，通过人类观察者审查图像进行比较获得了测量结果。自动和手动角度估计之间的平均绝对误差（MAE）在评估的模型中范围从3.9°到9.4°。此外，最佳表现模型的MAE小于可接受的临床多普勒角度误差阈值，从而避免将正常速度值误分类为狭窄。结果表明，应用基于深度学习的技术进行自动超声多普勒角度估计具有潜力。这种技术有可能在商用超声扫描仪的成像软件中实现。

更新时间: 2025-08-06 09:28:07

领域: cs.LG,cs.AI,I.2.1

下载: http://arxiv.org/abs/2508.04243v1

PA-RNet: Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting

In real-world applications, multimodal time series data often suffer from interference, especially in the textual modality. Existing methods for multimodal time series forecasting often neglect the inherent perturbations within textual data, where irrelevant, noisy, or ambiguous content can significantly degrade model performance, particularly when the noise exhibits varying intensity or stems from structural inconsistencies. To address this challenge, we propose PA-RNet (Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting), a robust multimodal forecasting framework. PA-RNet features a perturbation-aware projection module and a cross-modal attention mechanism to effectively separate noise from the textual embeddings while maintaining semantically meaningful representations, thereby enhancing the model's generalization ability. Theoretically, we establish the Lipschitz continuity of PA-RNet with respect to textual inputs and prove that the proposed perturbation module can reduce expected prediction error, offering strong guarantees of stability under noisy conditions. Furthermore, we introduce a textual perturbation pipeline that can be seamlessly incorporated into existing multimodal time series forecasting tasks, allowing for systematic evaluation of the model's robustness in the presence of varying levels of textual noise. Extensive experiments across diverse domains and temporal settings demonstrate that PA-RNet consistently outperforms state-of-the-art baselines.

Updated: 2025-08-06 09:26:52

标题: PA-RNet：多模态时间序列预测的扰动感知推理网络

摘要: 在实际应用中，多模态时间序列数据经常受到干扰，特别是在文本模态中。现有的多模态时间序列预测方法经常忽略文本数据中固有的扰动，其中无关、嘈杂或含糊的内容会显著降低模型性能，特别是当噪声具有不同强度或源自结构不一致时。为了解决这一挑战，我们提出了PA-RNet（多模态时间序列预测的扰动感知推理网络），这是一个强大的多模态预测框架。PA-RNet具有一个扰动感知投影模块和一个跨模态注意力机制，可以有效地从文本嵌入中分离噪声，同时保持语义上有意义的表示，从而增强模型的泛化能力。从理论上讲，我们建立了PA-RNet相对于文本输入的Lipschitz连续性，并证明了所提出的扰动模块可以减少预期的预测误差，在嘈杂条件下提供稳定性的强有力保证。此外，我们引入了一个文本扰动管道，可以无缝地整合到现有的多模态时间序列预测任务中，允许在不同水平的文本噪声存在下系统评估模型的稳健性。在各种领域和时间设置下进行的广泛实验表明，PA-RNet始终优于现有基准。

更新时间: 2025-08-06 09:26:52

领域: cs.LG

下载: http://arxiv.org/abs/2508.04750v1

Circuit-Aware SAT Solving: Guiding CDCL via Conditional Probabilities

Circuit Satisfiability (CSAT) plays a pivotal role in Electronic Design Automation. The standard workflow for solving CSAT problems converts circuits into Conjunctive Normal Form (CNF) and employs generic SAT solvers powered by Conflict-Driven Clause Learning (CDCL). However, this process inherently discards rich structural and functional information, leading to suboptimal solver performance. To address this limitation, we introduce CASCAD, a novel circuit-aware SAT solving framework that directly leverages circuit-level conditional probabilities computed via Graph Neural Networks (GNNs). By explicitly modeling gate-level conditional probabilities, CASCAD dynamically guides two critical CDCL heuristics -- variable phase selection and clause managementto significantly enhance solver efficiency. Extensive evaluations on challenging real-world Logical Equivalence Checking (LEC) benchmarks demonstrate that CASCAD reduces solving times by up to 10x compared to state-of-the-art CNF-based approaches, achieving an additional 23.5% runtime reduction via our probability-guided clause filtering strategy. Our results underscore the importance of preserving circuit-level structural insights within SAT solvers, providing a robust foundation for future improvements in SAT-solving efficiency and EDA tool design.

Updated: 2025-08-06 09:16:47

标题: 电路感知SAT求解：通过条件概率引导CDCL

摘要: 电路可满足性（CSAT）在电子设计自动化中起着关键作用。解决CSAT问题的标准工作流程将电路转换为合取范式（CNF），并使用由冲突驱动子句学习（CDCL）驱动的通用SAT求解器。然而，这个过程固有地丢弃了丰富的结构和功能信息，导致求解器性能不佳。为了解决这一限制，我们引入了CASCAD，一个新颖的电路感知SAT求解框架，直接利用通过图神经网络（GNNs）计算的电路级条件概率。通过明确建模门级条件概率，CASCAD动态引导两个关键的CDCL启发式--变量相位选择和子句管理，显著提升求解器效率。对具有挑战性的真实世界逻辑等价检查（LEC）基准进行广泛评估表明，与最先进的基于CNF的方法相比，CASCAD将求解时间缩短了最多10倍，通过我们的概率引导子句过滤策略实现了额外的23.5％运行时间减少。我们的结果强调了在SAT求解器中保留电路级结构洞察力的重要性，为未来提高SAT求解效率和EDA工具设计提供了坚实基础。

更新时间: 2025-08-06 09:16:47

领域: cs.AI

下载: http://arxiv.org/abs/2508.04235v1

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

Updated: 2025-08-06 09:16:16

标题: UITron-Speech：基于语音指令的自动化GUI代理的研究

摘要: 图形用户界面（GUI）的自主代理正在革新人机交互，但它们对基于文本的指令的依赖限制了可访问性和便利性，特别是在无需使用双手的场景下。为了解决这个问题，我们提出将文本替换为语音作为GUI代理的指令输入模式，并引入UITron-Speech，这是第一个能够直接处理语音指令和设备截图以预测用户操作的端到端GUI代理。为了解决数据稀缺性问题，我们使用随机说话者文本转语音模型合成高质量的语音指令数据集。此外，我们设计了一种混合模态训练策略，以减轻预先训练的基础模型中固有的模态不平衡。此外，我们对GUI定位预测误差的分布进行了统计分析，并提出了一种无需训练的两步定位细化方法，以减轻轻微的本地化偏差。在多个基准测试上进行的广泛实验表明，UITron-Speech实现了稳健的性能和卓越的适应性，强调了语音驱动的GUI代理在更可访问和智能的人机交互中的可行性和潜力。我们的代码和数据集可在https://github.com/UITron-hub/UITron-Speech 上找到。

更新时间: 2025-08-06 09:16:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.11127v2

Empowering Time Series Forecasting with LLM-Agents

Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

Updated: 2025-08-06 09:14:08

标题: 用LLM代理技术增强时间序列预测

摘要: 大型语言模型（LLM）驱动的代理已经成为自动化机器学习（AutoML）系统的有效规划者。虽然大多数现有的AutoML方法侧重于自动化特征工程和模型架构搜索，但最近在时间序列预测方面的研究表明，轻量级模型通常可以实现最先进的性能。这一观察结果促使我们探索通过改善数据质量，而不是模型架构，作为AutoML在时间序列数据上的一个潜在富有成果的方向。我们提出了DCATS，一个面向时间序列的数据中心代理。DCATS利用伴随时间序列的元数据来清洁数据，同时优化预测性能。我们使用四个时间序列预测模型在一个大规模交通量预测数据集上评估了DCATS。结果表明，DCATS在所有测试模型和时间范围上实现了平均6%的误差降低，突显了数据中心方法在时间序列预测的AutoML中的潜力。

更新时间: 2025-08-06 09:14:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04231v1

UltraSTF: Ultra-Compact Model for Large-Scale Spatio-Temporal Forecasting

Spatio-temporal data, prevalent in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represents a specialized case of multivariate time series characterized by high dimensionality. This high dimensionality necessitates computationally efficient models and benefits from applying univariate forecasting approaches through channel-independent strategies. SparseTSF, a recently proposed competitive univariate forecasting model, leverages periodicity to achieve compactness by focusing on cross-period dynamics, extending the Pareto frontier in terms of model size and predictive performance. However, it underperforms on spatio-temporal data due to limited capture of intra-period temporal dependencies. To address this limitation, we propose UltraSTF, which integrates a cross-period forecasting component with an ultra-compact shape bank component. Our model efficiently captures recurring patterns in time series using the attention mechanism of the shape bank component, significantly enhancing its capability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while utilizing fewer than 0.2% of the parameters required by the second-best methods, thereby further extending the Pareto frontier of existing approaches.

Updated: 2025-08-06 09:04:42

标题: UltraSTF：大规模时空预测的超紧凑模型

摘要: 时空数据在现实世界的应用中广泛存在，如交通监测、金融交易和共享出行需求，代表了一种特殊的多变量时间序列，具有高维度特征。这种高维度要求计算效率高的模型，并且通过应用通道独立策略从单变量预测方法中获益。SparseTSF是一种最近提出的竞争性单变量预测模型，利用周期性通过关注跨周期动态来实现紧凑性，扩展了模型大小和预测性能的帕累托前沿。然而，由于对周期内时间依赖性的捕捉有限，它在时空数据上表现不佳。为了解决这一限制，我们提出了UltraSTF，它将跨周期预测组件与超紧凑形状库组件进行整合。我们的模型利用形状库组件的注意机制高效地捕捉时间序列中的重复模式，显著增强了学习周期内动态的能力。UltraSTF在LargeST基准上实现了最先进的性能，同时利用的参数数量少于第二好方法所需的0.2%，进一步拓展了现有方法的帕累托前沿。

更新时间: 2025-08-06 09:04:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.20634v2

LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

Updated: 2025-08-06 09:03:16

标题: LayerT2V：互动式多目标轨迹分层视频生成

摘要: 在文本到视频（T2V）生成中控制对象运动轨迹是一个具有挑战性且相对未被充分探索的领域，特别是在涉及多个移动对象的场景中。大多数T2V领域中的社区模型和数据集都是为单个对象运动设计的，限制了当前生成模型在多对象任务中的性能。此外，现有的T2V中的运动控制方法要么缺乏对多对象运动场景的支持，要么在对象轨迹相交时经历严重的性能降级，主要是由于碰撞区域的语义冲突。为了解决这些限制，我们引入了LayerT2V，这是一种通过逐层合成背景和前景对象来生成视频的方法。这种分层生成方法使得视频中的多个独立元素可以灵活地集成在一起，将每个元素定位在一个独立的“层”上，从而促进一致的多对象合成，并增强对生成过程的控制。大量实验证明了LayerT2V在生成复杂的多对象场景方面的优越性，展示了在mIoU和AP50度量上比最先进方法（SOTA）提高了1.4倍和4.5倍。项目页面和代码可在https://kr-panghu.github.io/LayerT2V/ 上找到。

更新时间: 2025-08-06 09:03:16

领域: cs.CV,cs.AI,cs.LG,cs.MM

下载: http://arxiv.org/abs/2508.04228v1

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

Updated: 2025-08-06 09:03:10

标题: 持续学习对VLMs的影响：超越遗忘的调查和分类

摘要: 视觉语言模型（VLMs）通过利用大规模预训练，在各种多模态任务中取得了令人印象深刻的性能。然而，使它们能够不断学习来自非稳态数据仍然是一个主要挑战，因为它们的跨模态对齐和泛化能力特别容易受到灾难性遗忘的影响。与传统的单模态持续学习（CL）不同，VLMs面临着独特的挑战，如跨模态特征漂移、由于共享架构而导致的参数干扰以及零样本能力的侵蚀。这项调查提供了对VLMs（VLM-CL）持续学习的第一次专注和系统性评估。我们首先确定了在VLM-CL中降低性能的三种核心失败模式。基于这些情况，我们提出了一个挑战驱动的分类法，将解决方案映射到目标问题：（1）\textit{多模态重放策略}通过显式或隐式记忆机制解决跨模态漂移问题；（2）\textit{跨模态正则化}在更新过程中保持模态对齐；和（3）\textit{参数高效适应}通过模块化或低秩更新减轻参数干扰。我们进一步分析了当前的评估协议、数据集和度量标准，强调了需要更好的基准测试，以捕捉VLM特定的遗忘和组合泛化。最后，我们概述了开放问题和未来方向，包括持续预训练和组合零样本学习。这项调查旨在为开发终身视觉语言系统的研究人员提供全面和诊断性的参考。所有资源均可在以下链接找到：https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models。

更新时间: 2025-08-06 09:03:10

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04227v1

Symmetric Behavior Regularization via Taylor Expansion of Symmetry

This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of $f$-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric $f$ Actor-Critic (S$f$-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S$f$-AC performs competitively.

Updated: 2025-08-06 09:01:29

标题: 通过对称性的Taylor展开进行对称行为正则化

摘要: 本文将对称散度引入到行为规范化政策优化（BRPO）中，建立一个新颖的离线RL框架。现有方法专注于使用KL等非对称散度来获得解析规范化政策和实用的最小化目标。我们表明对称散度不允许解析政策作为规范化，并可能导致数值问题作为损失。我们通过$f$-散度的泰勒级数来解决这些挑战。具体而言，我们证明可以通过有限级数获得解析政策。对于损失，我们观察到对称散度可以分解为一个不对称性和一个条件对称性项，对后者进行泰勒展开可以减轻数值问题。综合起来，我们提出了对称$f$ Actor-Critic（S$f$-AC），这是第一个具有对称散度的实用BRPO算法。在分布逼近和MuJoCo上的实验结果验证了S$f$-AC具有竞争力。

更新时间: 2025-08-06 09:01:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.04225v1

Hierarchical Text Classification Using Black Box Large Language Models

Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

Updated: 2025-08-06 08:53:50

标题: 使用黑盒大型语言模型进行分层文本分类

摘要: 层次文本分类（HTC）旨在将文本分配到结构化的标签层次结构中；然而，由于数据稀缺和模型复杂性，它面临挑战。本研究探讨了使用通过API访问的黑盒大型语言模型（LLMs）作为HTC的替代方案的可行性，而不是传统的机器学习方法，这些方法需要大量标记数据和计算资源。我们评估了三种提示策略--直接叶标签预测（DL）、直接层级标签预测（DH）和自顶向下多步层级标签预测（TMH）--在零样本和少样本设置中，比较了这些策略的准确性和成本效益。在两个数据集上的实验表明，少样本设置相对于零样本设置一致地提高了分类准确性。虽然传统的机器学习模型在具有浅层次结构的数据集上获得了高准确性，但LLMs，特别是DH策略，倾向于在具有更深层次结构的数据集上胜过机器学习模型。由于DH策略需要更多的输入标记，导致API成本显著增加。这些结果强调了准确性改进与提示策略的计算成本之间的权衡。这些发现突显了黑盒LLMs在HTC中的潜力，同时强调了需要谨慎选择提示策略以平衡性能和成本。

更新时间: 2025-08-06 08:53:50

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.04219v1

Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function--an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level CT nodes and 50% low-level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

Updated: 2025-08-06 08:49:39

标题: 加速多智能体路径规划中紧密下界的焦点搜索

摘要: 多智能体路径规划（MAPF）涉及在最小化成本函数的同时为多个智能体找到无碰撞路径-这是一个NP难的问题。像Enhanced Conflict-Based Search（ECBS）和Explicit Estimation CBS（EECBS）这样的有界次优方法利用焦点搜索机制平衡解决方案质量与计算效率。然而，传统的焦点搜索面临一个限制：在早期搜索阶段，确定哪些节点进入FOCAL列表的下界（LB）值通常增长缓慢，导致一个受限的搜索空间，延迟找到有效解决方案。在本文中，我们提出了一种新颖的有界次优算法，双重ECBS（DECBS），以解决这个问题，首先确定最大LB值，然后使用由此LB引导的最佳优先搜索来找到无碰撞路径。实验结果表明，DECBS在大多数测试案例中优于ECBS，并且与现有的优化技术兼容。DECBS可以减少近30％的高级CT节点和50％的低级焦点搜索节点。当智能体密度为中等到高时，DECBS在与ECBS相同的次优性界限和优化下实现了23.5％的平均运行时间改进。

更新时间: 2025-08-06 08:49:39

领域: cs.MA,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.03779v2

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM's internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

Updated: 2025-08-06 08:48:55

标题: 因果奖励调整：通过后门校正缓解外部推理中的奖励黑客行为

摘要: 外部推理系统将语言模型与过程奖励模型（PRMs）结合起来，选择高质量的推理路径，用于解决数学问题等复杂任务。然而，这些系统容易受到奖励欺骗的影响，即PRMs会给出高分但逻辑不正确的路径高分，导致错误答案。从因果推断的角度来看，我们主要将这种现象归因于混淆的语义特征。为了解决这个问题，我们提出了因果奖励调整（CRA）方法，通过估计推理路径的真实奖励来缓解奖励欺骗。CRA使用稀疏自编码器对PRM的内部激活进行训练，以恢复可解释特征，然后通过背门调整来纠正混淆。在数学解决数据集上的实验证明，CRA可以减轻奖励欺骗，并提高最终准确性，而不需要修改策略模型或重新训练PRM。

更新时间: 2025-08-06 08:48:55

领域: cs.LG

下载: http://arxiv.org/abs/2508.04216v1

A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora

Taxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM) play a central role in providing the primary framework through which intelligent systems can explore and interpret the literature. However, these resources have traditionally been manually curated, a process that is time-consuming, prone to obsolescence, and limited in granularity. This paper presents Sci-OG, a semi-auto\-mated methodology for generating research topic ontologies, employing a multi-step approach: 1) Topic Discovery, extracting potential topics from research papers; 2) Relationship Classification, determining semantic relationships between topic pairs; and 3) Ontology Construction, refining and organizing topics into a structured ontology. The relationship classification component, which constitutes the core of the system, integrates an encoder-based language model with features describing topic occurrence in the scientific literature. We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples. Our method achieves the highest F1 score (0.951), surpassing various competing approaches, including a fine-tuned SciBERT model and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is corroborated by a use case which illustrates the practical application of our system to extend the CSO ontology in the area of cybersecurity. The presented solution is designed to improve the accessibility, organization, and analysis of scientific knowledge, thereby supporting advancements in AI-enabled literature management and research exploration.

Updated: 2025-08-06 08:48:14

标题: 一种用于从科学论文语料库生成研究主题本体论的混合人工智能方法论

摘要: 研究主题的分类和本体论（例如，MeSH、UMLS、CSO、NLM）在提供智能系统探索和解释文献的主要框架中发挥着核心作用。然而，这些资源传统上是手动策划的，这个过程耗时、容易过时，且粒度有限。本文介绍了Sci-OG，一种用于生成研究主题本体论的半自动方法，采用多步骤方法：1）主题发现，从研究论文中提取潜在主题；2）关系分类，确定主题对之间的语义关系；3）本体构建，将主题精细化和组织成结构化本体论。关系分类组件，构成系统的核心，将基于编码器的语言模型与描述科学文献中主题出现的特征整合在一起。我们使用一个包含21,649个手动标注的语义三元组数据集评估了该方法，并取得了最高的F1分数（0.951），超过了各种竞争方法，包括经过精调的SciBERT模型和几个LLM基线，如经过精调的GPT4-mini。我们的工作得到了一个用例的支持，该用例阐明了我们的系统在网络安全领域扩展CSO本体论的实际应用。所提出的解决方案旨在改进科学知识的获取、组织和分析，从而支持AI-启用的文献管理和研究探索的进展。

更新时间: 2025-08-06 08:48:14

领域: cs.DL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2508.04213v1

AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking'' process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model's reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model's inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.

Updated: 2025-08-06 08:46:22

标题: AttriLens-Mol：使用大型语言模型进行属性引导的分子属性预测的强化学习

摘要: 大型语言模型（LLMs）已经显示出在辅助分子属性预测任务方面具有潜力，但通常依赖于人工设计的提示和思维链模板。虽然最近的先进大型推理模型如DeepSeek-R1采用强化学习进行扩展的“思考”过程，但它们的推理可能冗长且缺乏相关性。我们引入了AttriLens-Mol，这是一个用于分子属性预测的属性引导强化学习框架，与LLMs一起使用。AttriLens-Mol通过使用以下方式引导模型的推理：（1）鼓励基于属性的结构化输出的格式奖励，（2）避免枚举不相关属性的计数奖励，以及（3）使用先进的LLMs和RDKit来验证生成的属性的相关性的合理性奖励。这种方法在推理过程中隐式引发了模型对相关分子属性的内在知识，使得对分子属性的预测更加有效。在分布内和分布外的数据集上进行的实验表明，使用我们提出的AttriLens-Mol方法在4,000个样本上训练7B大小的R1-Distilled-Qwen2.5和R1-Distilled-LLaMA3.1模型显著提升了性能，获得了与监督微调模型（Mol-Instructions，ChemDFM等）和先进模型（GPT-3.5，GPT-4o，DeepSeek-V3，DeepSeek-R1等）相媲美或更好的结果。此外，当将我们提取的目标属性用作可解释决策树模型的特征时，与提示LLMs生成的属性相比，获得了更优越的性能。这表明AttriLens-Mol有效地引出了更相关和预测性更强的分子属性，提高了属性预测的可解释性和性能。我们在https://github.com/szu-tera/AttriLens-Mol上发布了代码。

更新时间: 2025-08-06 08:46:22

领域: cs.LG

下载: http://arxiv.org/abs/2508.04748v1

DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models

As deep learning-based, data-driven information extraction systems become increasingly integrated into modern document processing workflows, one primary concern is the risk of malicious leakage of sensitive private data from these systems. While some recent works have explored Differential Privacy (DP) to mitigate these privacy risks, DP-based training is known to cause significant performance degradation and impose several limitations on standard training procedures, making its direct application to downstream tasks both difficult and costly. In this work, we aim to address the above challenges within the context of document image classification by substituting real private data with a synthetic counterpart. In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images under strict privacy constraints, which can then be utilized to train a downstream classifier following standard training procedures. We investigate our approach under various pretraining setups, including unconditional, class-conditional, and layout-conditional pretraining, in combination with multiple private training strategies such as class-conditional and per-label private fine-tuning with DPDM and DP-Promise algorithms. Additionally, we evaluate it on two well-known document benchmark datasets, RVL-CDIP and Tobacco3482, and show that it can generate useful and realistic document samples across various document types and privacy levels ($\varepsilon \in \{1, 5, 10\}$). Lastly, we show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets, compared to the direct application of DP-Adam.

Updated: 2025-08-06 08:43:08

标题: DP-DocLDM：使用潜在扩散模型实现差分隐私文档图像生成

摘要: 随着基于深度学习的、数据驱动的信息提取系统越来越多地整合到现代文档处理工作流程中，一个主要关注点是这些系统可能存在的敏感私人数据恶意泄露风险。虽然一些最近的研究已经探索了差分隐私（DP）来减轻这些隐私风险，但众所周知，基于DP的训练会导致显著的性能下降，并对标准训练程序施加一些限制，使其直接应用于下游任务变得困难且昂贵。在这项工作中，我们旨在通过用合成数据替代真实私人数据来解决上述挑战，其中文档图像分类的背景下。具体而言，我们建议结合差分隐私（DP）来使用条件潜扩散模型（LDMs）生成在严格隐私约束条件下的特定类别的合成文档图像，然后可以利用这些合成数据来训练一个下游分类器，遵循标准训练程序。我们研究了我们的方法在各种预训练设置下的效果，包括无条件、类别条件和布局条件的预训练，结合多种私人训练策略，如基于类别和基于标签的私人微调，使用DPDM和DP-Promise算法。此外，我们在两个知名的文档基准数据集RVL-CDIP和Tobacco3482上进行评估，并展示它可以生成在各种文档类型和隐私级别（$\varepsilon \in \{1, 5, 10\}$）上有用且逼真的文档样本。最后，我们展示了我们的方法在小规模数据集的下游评估中相比直接应用DP-Adam取得了显著的性能提升。

更新时间: 2025-08-06 08:43:08

领域: cs.CR

下载: http://arxiv.org/abs/2508.04208v1

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

Updated: 2025-08-06 08:42:32

标题: SimpleRL-Zoo：调查和驯服野外开放基础模型的零强化学习

摘要: DeepSeek-R1表明，通过一个简单的基于规则的奖励强化学习框架，长链式思维（CoT）推理可以自然地出现，其中训练可以直接从基本模型开始，这一范式被称为零RL训练。最近的努力主要集中在Qwen2.5模型系列上，这可能并不具有代表性，因为我们发现基本模型已经表现出较强的遵循指令和自我反思能力。在这项工作中，我们研究了跨越10个不同家族和大小的多样的基本模型上的零RL训练，包括LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B，以及所有Qwen2.5模型从0.5B到32B。通过利用几个关键设计策略，比如调整格式奖励和控制查询难度，我们在大多数设置中在推理准确性和回应长度上取得了显著的改进。然而，通过仔细监测训练动态，我们观察到不同的基本模型在训练过程中表现出不同的模式。例如，增加的回应长度并不总是与某些认知行为的出现（即“灵光一现”）相关。值得注意的是，我们首次在不属于Qwen家族的小型模型中观察到“灵光一现”。我们分享了成功进行零RL训练的关键设计，以及我们的发现和实践。为了促进进一步的研究，我们开放源代码、模型和分析工具。

更新时间: 2025-08-06 08:42:32

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.18892v3

RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding Across Different Environments and Tasks

Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community's continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

Updated: 2025-08-06 08:39:51

标题: RAILGUN：一种统一的卷积策略，用于跨不同环境和任务的多智能体路径规划

摘要: 多智能体路径规划（MAPF）专注于为多个机器人找到无碰撞路径，对于从空中群体到仓库自动化等应用至关重要。解决MAPF是NP难的，因此基于学习的方法已经引起了关注，特别是利用深度神经网络的方法。尽管社区一直在努力，但所有基于学习的MAPF规划器仍然依赖于分散式规划，因为代理数量和地图大小的变化。我们开发了第一个用于MAPF问题的基于学习的集中式策略，名为RAILGUN。RAILGUN不是基于代理的策略，而是基于地图的策略。通过利用基于CNN的架构，RAILGUN可以在不同地图上进行泛化，并处理任意数量的代理。我们从基于规则的方法中收集轨迹，以监督方式训练我们的模型。在实验中，RAILGUN优于大多数基准方法，并展示了在训练数据集中未见的各种任务、地图和代理数量上的出色的零-shot泛化能力。

更新时间: 2025-08-06 08:39:51

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.02992v2

Boosting Adversarial Transferability via Residual Perturbation Attack

Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.

Updated: 2025-08-06 08:39:08

标题: 通过残余扰动攻击增强对抗性迁移

摘要: 深度神经网络容易受到对抗性样本的影响，同时也容易因不可察觉的扰动而导致错误预测。基于转移的攻击方法会为替代模型创建对抗性样本，然后在黑盒情景下将这些样本转移到目标模型。最近的研究表明，在平坦的损失景观中的对抗性样本具有更好的可转移性，可以减轻替代模型的过拟合问题。然而，以往的研究忽视了扰动方向的影响，导致可转移性受限。本文提出了一种新的攻击方法，名为残差扰动攻击（ResPA），依赖于残差梯度作为扰动方向，引导对抗性样本朝向损失函数的平坦区域。具体而言，ResPA对输入梯度进行指数移动平均，以获取第一时刻作为参考梯度，包括历史梯度的方向。与过于依赖当前梯度产生的本地平坦性作为扰动方向不同，ResPA进一步考虑当前梯度与参考梯度之间的残差，以捕捉全局扰动方向的变化。实验结果表明，ResPA比现有的典型转移攻击方法具有更好的可转移性，而通过将ResPA与当前输入变换方法结合，可进一步提高可转移性。该代码可在https://github.com/ZezeTao/ResPA上找到。

更新时间: 2025-08-06 08:39:08

领域: cs.CV,cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.05689v1

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

Updated: 2025-08-06 08:35:10

标题: ReasoningGuard：通过推理时安全Aha时刻保护大型推理模型

摘要: 大型推理模型（LRMs）在推理密集型任务中表现出色，但它们仍然容易受到有害内容生成的影响，特别是在推理过程的中后期。然而，现有的防御机制依赖于昂贵的微调和额外的专业知识，这限制了它们的可扩展性。在这项工作中，我们提出了ReasoningGuard，一种用于LRMs的推理时保护机制，它在适当的时机注入安全的灵光一现，引导无害但有益的推理过程。利用模型的内部注意力行为，我们的方法准确识别了推理路径中的关键点，并触发自发的、以安全为导向的反思。为了保护后续的推理步骤和最终答案，我们在解码阶段进一步实施了一种扩展采样策略，选择最佳的推理路径。通过引入最小额外推理成本，ReasoningGuard有效地缓解了三种越狱攻击，包括针对LRMs推理过程的最新攻击。我们的方法优于七种现有的保护机制，实现了最先进的安全防御，同时有效地避免了常见的夸大的安全问题。

更新时间: 2025-08-06 08:35:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04204v1

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

Updated: 2025-08-06 08:32:53

标题: EdgeInfinite-Instruct: 将基于SFT的优化与NPU级效率桥接到边缘设备上

摘要: 在资源受限的边缘设备上部署基于Transformer的大型语言模型（LLMs）用于长序列任务仍然具有挑战性，这是由于自注意力的二次时间复杂性和不断增长的键-值（KV）缓存需求。虽然现有的KV缓存优化提高了内存效率，但它们通常无法减少到第一个标记的时间（TTFT），并且可能通过标记修剪降低性能。替代序列建模架构解决了这些限制中的一些，但通常需要完全重新训练并且缺乏基础设施支持。EdgeInfinite通过仅微调一小部分参数提供了一个高效的解决方案，保持质量的同时降低了计算和内存成本，包括改进的TTFT。然而，它的指令跟随能力有限，缺乏针对移动设备的优化。为了解决这些问题，我们提出了EdgeInfinite-Instruct，引入了一种针对总结和问答等长序列任务量身定制的分段监督微调（S-SFT）策略。我们进一步优化了EdgeInfinite-Instruct，通过使用精细化的后训练量化（PTQ）来降低计算需求同时保持准确性，并通过实现一个固定形状的计算图，通过场景特定的自定义输入标记和缓存大小平衡内存使用和设备效率。长上下文基准测试和真实世界移动任务的实验表明，我们的方法提高了领域特定性能，同时在NPU加速的边缘设备上保持了效率。

更新时间: 2025-08-06 08:32:53

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.00370v2

Gauge Flow Models

This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.

Updated: 2025-08-06 08:31:56

标题: 规范流模型

摘要: 本文介绍了Gauge Flow Models，这是一种新颖的生成流模型类别。这些模型在流常微分方程中包含一个可学习的规范场。提供了这些模型的全面数学框架，详细说明了它们的构造和特性。在高斯混合模型上使用流匹配实验表明，Gauge Flow Models的性能明显优于传统的相当大的流模型。此外，未发表的研究表明，在更广泛的生成任务范围内，Gauge Flow Models可能会表现出更强大的性能。

更新时间: 2025-08-06 08:31:56

领域: cs.LG,cs.AI,math.DG

下载: http://arxiv.org/abs/2507.13414v2

ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

Updated: 2025-08-06 08:31:11

标题: ViFP：一种用于在VLMs中增强推理可靠性的视觉假阳性检测框架

摘要: 在视觉语言模型（VLM）推理中，当模型生成正确答案但遵循错误推理路径时，会发生虚假阳性（FP）推理。现有的基于特定多步推理数据集和强化学习策略的方法，导致训练成本高昂且泛化能力有限。在这项工作中，我们提出了ViFP，一个增强视觉推理可靠性的通用框架。它通过检测FP来提高答案准确性和推理合理性。ViFP通过构建基于视觉推理核心维度（如对象定位、特征描述和对象发现）的子问题模板，解决了数据集依赖性和泛化性差的限制。然后，ViFP通过多轮问答构建有效的推理路径，以提高推理准确性。同时，ViFP动态分析推理路径的一致性，识别潜在的FP，并引入了一种有针对性的思维链（CoT）机制，自适应地引导FP样本和非FP样本。从而在保持准确性的同时减少推理路径中的逻辑错误。最后，我们引入了一种可靠性评估指标-VoC，综合考虑了答案准确性和FP率，提供了一个量化工具，用于评估一个VLM是否不仅回答正确，而且推理可靠。我们在闭源VLM上的实验表明，ViFP在三个数据集（A-OKVQA、OKVQA和FVQA）上持续提高性能。在A-OKVQA上，ViFP将准确性提高了最多5.4%，超过先前的最新技术水平4.3%，并显著减少了FP的数量，验证了其在增强推理可靠性方面的好处。

更新时间: 2025-08-06 08:31:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04201v1

Bootstrap Deep Spectral Clustering with Optimal Transport

Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering -- affinity matrix construction, spectral embedding, and $k$-means clustering -- using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16\% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.

Updated: 2025-08-06 08:30:30

标题: 用最优输运的Bootstrap深度谱聚类

摘要: 谱聚类是一种主要的聚类方法。它的两个主要缺点是不连续的优化过程和有限的表示能力。为了解决这些问题，我们提出了一种深度谱聚类模型（命名为BootSC），它通过一个端到端的方式联合学习谱聚类的所有阶段 -- 亲和力矩阵构建、谱嵌入和$k$-means聚类。BootSC利用有效和高效的基于最优传输的监督来启动亲和力矩阵和聚类分配矩阵。此外，引入了一种语义一致的正交重新参数化技术来正交化谱嵌入，显著增强了区分能力。实验结果表明，BootSC实现了最先进的聚类性能。例如，在具有挑战性的ImageNet-Dogs数据集上，与第二名方法相比，它实现了显著的16% NMI改进。我们的代码可以在https://github.com/spdj2271/BootSC 上找到。

更新时间: 2025-08-06 08:30:30

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04200v1

Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling

Artificial neural networks are a promising technique for virtual analog modeling, having shown particular success in emulating distortion circuits. Despite their potential, enhancements are needed to enable effect parameters to influence the network's response and to achieve a low-latency output. While hybrid solutions, which incorporate both analytical and black-box techniques, offer certain advantages, black-box approaches, such as neural networks, can be preferable in contexts where rapid deployment, simplicity, or adaptability are required, and where understanding the internal mechanisms of the system is less critical. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. We compare State-Space models and Linear Recurrent Units against the more common LSTM networks, with a variety of audio effects. We evaluate the performance and limitations of these models using multiple metrics, providing insights for future research and development. Our metrics aim to assess the models' ability to accurately replicate the signal's energy and frequency contents, with a particular focus on transients. The Feature-wise Linear Modulation method is employed to incorporate effect parameters that influence the network's response, enabling dynamic adaptability based on specified conditions. Experimental results suggest that LSTM networks offer an advantage in emulating distortions and equalizers, although performance differences are sometimes subtle yet statistically significant. On the other hand, encoder-decoder configurations of Long Short-Term Memory networks and State-Space models excel in modeling saturation and compression, effectively managing the dynamic aspects inherent in these effects. However, no models effectively emulate the low-pass filter, and Linear Recurrent Units show inconsistent performance across various audio effects.

Updated: 2025-08-06 08:29:21

标题: 虚拟模拟音频效果建模的基于状态的神经网络的比较研究

摘要: 人工神经网络是一种有前途的技术，用于虚拟模拟建模，尤其在模拟失真电路方面表现出色。尽管具有潜力，但需要增强效果参数对网络响应的影响，并实现低延迟输出。混合解决方案结合了分析和黑盒技术，提供了一定的优势，但在需要快速部署、简单性或适应性的情况下，例如神经网络等黑盒方法可能更可取，在这些情况下，了解系统的内部机制并不那么关键。在本文中，我们探讨了最新机器学习进展在虚拟模拟建模中的应用。我们将状态空间模型和线性递归单元与更常见的LSTM网络进行比较，并结合各种音频效果进行评估。我们使用多种指标评估这些模型的性能和局限性，为未来的研究和开发提供见解。我们的指标旨在评估模型准确复制信号的能量和频率内容的能力，特别关注瞬态。采用特征线性调制方法，将影响网络响应的效果参数纳入，根据指定条件实现动态适应性。实验结果表明，LSTM网络在模拟失真和均衡器方面具有优势，尽管性能差异有时微妙但具有统计学意义。另一方面，长短期记忆网络和状态空间模型的编码器-解码器配置在建模饱和和压缩方面表现出色，有效地管理这些效果固有的动态特性。然而，没有模型有效地模拟低通滤波器，线性递归单元在各种音频效果中表现不一致。

更新时间: 2025-08-06 08:29:21

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2405.04124v6

Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.

Updated: 2025-08-06 08:26:36

标题: 汇集和追踪：从实例导向的视角重新思考视频文本VQA

摘要: 视频文本视觉问答（Video TextVQA）旨在通过明确阅读和推理视频中涉及的文本来回答问题。这一领域的大部分工作都遵循帧级框架，这种框架存在冗余的文本实体和隐含的关系建模，导致准确性和效率方面的局限性。在本文中，我们从实例导向的角度重新思考了Video TextVQA任务，并提出了一种新颖的模型，称为GAT（Gather and Trace）。首先，为了获得每个视频文本实例的准确阅读结果，设计了一个上下文聚合实例收集模块，将相关实体的视觉外观、布局特征和文本内容整合成统一的文本表示。然后，为了捕捉视频流中文本的动态演变，利用了一个实例关注的轨迹追踪模块，建立实例之间的时空关系并推断最终答案。在几个公共Video TextVQA数据集上进行的大量实验证实了我们框架的有效性和泛化性。GAT在准确性和推理速度方面均优于现有的Video TextVQA方法、视频语言预训练方法和视频大型语言模型。值得注意的是，GAT在准确性方面超过了先前最先进的Video TextVQA方法3.86％，并且比视频大型语言模型快十倍的推理速度。源代码可在https://github.com/zhangyan-ucas/GAT上找到。

更新时间: 2025-08-06 08:26:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04197v1

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

Updated: 2025-08-06 08:25:40

标题: 引发和分析最先进的大型语言模型中出现的不一致现象

摘要: 尽管在对齐技术方面取得了重大进展，我们证明现代语言模型仍然容易受到精心设计的对话情境的影响，这些情境可能诱发各种形式的不对齐而无需明确的越狱。通过与Claude-4-Opus进行系统化的手动红队攻击，我们发现了10种成功的攻击情景，揭示了当前对齐方法处理叙事沉浸、情感压力和战略框架的基本漏洞。这些情景成功地引发了一系列不对齐的行为，包括欺骗、价值漂移、自我保护和操纵性推理，每种行为都利用了不同的心理和语境漏洞。为了验证泛化性，我们将成功的手动攻击总结为MISALIGNMENTBENCH，这是一个自动化评估框架，可以实现跨多个模型的可重现测试。对我们的10种情景针对五个前沿LLM进行跨模型评估，结果显示整体漏洞率为76%，存在显著的变化：GPT-4.1显示出最高的易受性（90%），而Claude-4-Sonnet表现出更强的抵抗力（40%）。我们的发现表明，复杂的推理能力往往成为攻击向量而不是保护机制，因为模型可以被操纵成对不对齐行为的复杂理由。这项工作提供了（i）对话操纵模式的详细分类和（ii）可重复使用的评估框架。总之，这些发现揭示了当前对齐策略中的重要差距，并强调了未来人工智能系统需要对抗微妙的基于情景的操纵的鲁棒性。

更新时间: 2025-08-06 08:25:40

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.04196v1

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

Updated: 2025-08-06 08:25:26

标题: NVSpeech：一种集成和可扩展的管道，用于具有语言声音的人类语音建模与语言声音化。

摘要: 语音附加语音-包括非语言声音如笑声和呼吸声，以及词汇化的插入语如“嗯”和“哦”-是自然口头交流中不可或缺的。尽管这些附加语音在传达情感、意图和互动暗示方面很重要，但这些线索在传统的自动语音识别（ASR）和文本转语音（TTS）系统中仍然被忽视。我们提出了NVSpeech，一个集成和可扩展的流水线，桥接了附加语音的识别和合成，包括数据集构建、ASR建模和可控TTS。(1)我们引入了一个手动注释的数据集，包含48,430个人类口语话语，有18个单词级附加语类别。(2)我们开发了附加语言意识ASR模型，将附加语线索视为内联可解码的标记（例如，“你真的太有趣 [笑声]”），实现联合词汇和非语言转录。然后，该模型用于自动注释一个大语料库，这是第一个具有单词级对齐和附加语言提示的大规模中文数据集，包含174,179个话语（573小时）。(3)我们在人工和自动标记的数据上对零射击TTS模型进行微调，以便明确控制附加语音，允许在任意标记位置进行上下文感知插入，实现类人语音合成。通过统一附加语音的识别和生成，NVSpeech提供了首个开放、大规模、单词级注释的普通话表达式语音建模流水线，以可扩展和可控的方式集成识别和合成。数据集和音频演示可在https://nvspeech170k.github.io/上找到。

更新时间: 2025-08-06 08:25:26

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04195v1

Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes

The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and high computational cost. To address these issues, we propose a novel method, Stochastic Alternating Minimization with Trainable Step Sizes (SAMT), which updates network parameters in an alternating manner by treating the weights of each layer as a block. By decomposing the overall optimization into sub-problems corresponding to different blocks, this block-wise alternating strategy reduces per-step computational overhead and enhances training stability in nonconvex settings. To fully leverage these benefits, inspired by meta-learning, we proposed a novel adaptive step size strategy to incorporate into the sub-problem solving steps of alternating updates. It supports different types of trainable step sizes, including but not limited to scalar, element-wise, row-wise, and column-wise, enabling adaptive step size selection tailored to each block via meta-learning. We further provide a theoretical convergence guarantee for the proposed algorithm, establishing its optimization soundness. Extensive experiments for multiple benchmarks demonstrate that SAMT achieves better generalization performance with fewer parameter updates compared to state-of-the-art methods, highlighting its effectiveness and potential in neural network optimization.

Updated: 2025-08-06 08:23:38

标题: 通过可训练步长的随机交替最小化进行神经网络训练

摘要: 深度神经网络的训练本质上是一个非凸优化问题，然而标准方法如随机梯度下降（SGD）需要同时更新所有参数，往往导致不稳定的收敛和高计算成本。为了解决这些问题，我们提出了一种新颖的方法，即具有可训练步长的随机交替最小化（SAMT），该方法通过将每层的权重视为一个块，交替更新网络参数。通过将整体优化分解为对应于不同块的子问题，这种块状交替策略降低了每步计算开销，并增强了在非凸设置下的训练稳定性。为了充分利用这些好处，受到元学习的启发，我们提出了一种新颖的自适应步长策略，融入到交替更新的子问题求解步骤中。它支持不同类型的可训练步长，包括但不限于标量、元素级、行级和列级，通过元学习实现对每个块的自适应步长选择。我们进一步为所提出的算法提供了理论收敛保证，确立了其优化的合理性。对多个基准的广泛实验表明，与最先进的方法相比，SAMT在更少的参数更新情况下实现了更好的泛化性能，突显了其在神经网络优化中的有效性和潜力。

更新时间: 2025-08-06 08:23:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.04193v1

Multi-task neural networks by learned contextual inputs

This paper explores learned-context neural networks. It is a multi-task learning architecture based on a fully shared neural network and an augmented input vector containing trainable task parameters. The architecture is interesting due to its powerful task adaption mechanism, which facilitates a low-dimensional task parameter space. Theoretically, we show that a scalar task parameter is sufficient for universal approximation of all tasks, which is not necessarily the case for more common architectures. Empirically it is shown that, for homogeneous tasks, the dimension of the task parameter may vary with the complexity of the tasks, but a small task parameter space is generally viable. The task parameter space is found to be well-behaved, which simplifies workflows related to updating models as new data arrives, and learning new tasks with the shared parameters are frozen. Additionally, the architecture displays robustness towards datasets where tasks have few data points. The architecture's performance is compared to similar neural network architectures on ten datasets, with competitive results.

Updated: 2025-08-06 08:19:16

标题: 通过学习的上下文输入实现的多任务神经网络

摘要: 本文探讨了学习上下文神经网络。这是一种基于完全共享神经网络和包含可训练任务参数的增强输入向量的多任务学习架构。该架构由于其强大的任务适应机制而具有趣味性，这有助于实现低维任务参数空间。从理论上讲，我们表明标量任务参数足以实现所有任务的通用逼近，这对于更常见的架构来说并非必然情况。从经验上看，对于同质任务，任务参数的维度可能随任务的复杂性而变化，但较小的任务参数空间通常是可行的。发现任务参数空间表现良好，简化了相关工作流程，包括在新数据到达时更新模型和学习新任务时冻结共享参数。此外，该架构对于任务数据点较少的数据集表现出鲁棒性。在十个数据集上，该架构的性能与类似神经网络架构进行了比较，结果具有竞争力。

更新时间: 2025-08-06 08:19:16

领域: cs.LG

下载: http://arxiv.org/abs/2303.00788v2

BadTime: An Effective Backdoor Attack on Multivariate Long-Term Time Series Forecasting

Multivariate Long-Term Time Series Forecasting (MLTSF) models are increasingly deployed in critical domains such as climate, finance, and transportation. Although a variety of powerful MLTSF models have been proposed to improve predictive performance, the robustness of MLTSF models against malicious backdoor attacks remains entirely unexplored, which is crucial to ensuring their reliable and trustworthy deployment. To address this gap, we conduct an in-depth study on backdoor attacks against MLTSF models and propose the first effective attack method named BadTime. BadTime executes a backdoor attack by poisoning training data and customizing the backdoor training process. During data poisoning, BadTime proposes a contrast-guided strategy to select the most suitable training samples for poisoning, then employs a graph attention network to identify influential variables for trigger injection. Subsequently, BadTime further localizes optimal positions for trigger injection based on lag analysis and proposes a puzzle-like trigger structure that distributes the trigger across multiple poisoned variables to jointly steer the prediction of the target variable. During backdoor training, BadTime alternately optimizes the model and triggers via proposed tailored optimization objectives. Extensive experiments show that BadTime significantly outperforms state-of-the-art (SOTA) backdoor attacks on time series forecasting by reducing MAE by over 50% on target variables and boosting stealthiness by more than 3 times.

Updated: 2025-08-06 08:18:01

标题: BadTime：对多变量长期时间序列预测的有效后门攻击

摘要: 多变量长期时间序列预测（MLTSF）模型越来越多地部署在关键领域，如气候、金融和交通。尽管已经提出了各种强大的MLTSF模型来改善预测性能，但MLTSF模型对恶意后门攻击的鲁棒性仍然完全未被探索，这对确保它们可靠和值得信赖的部署至关重要。为了填补这一空白，我们对MLTSF模型的后门攻击进行了深入研究，并提出了第一个有效的攻击方法，命名为BadTime。BadTime通过毒化训练数据和定制后门训练过程来执行后门攻击。在数据毒化期间，BadTime提出了一种对比引导策略，选择最适合毒化的训练样本，然后利用图注意网络识别触发注入的影响变量。随后，BadTime通过滞后分析进一步定位触发注入的最佳位置，并提出了一个类似拼图的触发结构，将触发分布在多个毒化变量上，共同引导目标变量的预测。在后门训练期间，BadTime通过提出的定制优化目标交替优化模型和触发。大量实验证明，BadTime在时间序列预测的后门攻击中显著优于最先进的攻击方法，将目标变量的MAE降低了50%以上，并将隐秘性提高了3倍以上。

更新时间: 2025-08-06 08:18:01

领域: cs.CR

下载: http://arxiv.org/abs/2508.04189v1

A Comparative Study of Specialized LLMs as Dense Retrievers

While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.

Updated: 2025-08-06 08:11:23

标题: 一项专业LLM密集检索器的比较研究

摘要: 随着大型语言模型（LLMs）越来越多地被部署为密集检索器，它们在特定领域专业化对检索效果的影响仍未被充分探讨。本研究系统地检验了LLMs中任务特定适应性如何影响它们的检索能力，这是朝着开发能够处理文本、代码、图像和多模态内容的统一检索器的重要一步。我们在零-shot检索设置和监督设置下对八个Qwen2.5 7B LLM进行了大量实验，包括基础模型、针对指导的模型、代码/数学专业化模型、长推理模型以及视觉-语言模型。对于零-shot检索设置，我们考虑了从BEIR基准测试中检索文本和从CoIR基准测试中检索代码。此外，为了评估监督性能，所有LLMs都在MS MARCO数据集上进行了微调。我们发现数学专业化和长推理能力导致在三种设置中持续的性能下降，表明数学推理和语义匹配之间存在冲突。视觉-语言模型和代码专业化的LLMs在零-shot表现上优于其他LLMs，甚至在代码检索任务上超过了BM25，并在监督设置中与基础LLMs保持可比性能。这些发现表明了利用跨领域和跨模态融合的统一检索任务的有希望的方向。

更新时间: 2025-08-06 08:11:23

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.03958v2

Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token's standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.

Updated: 2025-08-06 08:09:12

标题: 用因果充分性和必要性破解MLLMs的幻觉

摘要: 多模态大型语言模型（MLLMs）已经展示出在视觉-语言任务中具有令人印象深刻的能力。然而，它们可能会遭受幻觉的困扰——生成与输入图像或文本在语义上不一致的输出。通过因果分析，我们发现：（i）遗漏幻觉可能是由于未能充分捕捉关键因果因素而导致的，（ii）制造幻觉可能是由于模型被非因果线索误导所致。为了解决这些挑战，我们提出了一个新颖的受因果完备性指导的强化学习框架，同时考虑令牌的因果充分性和因果必要性。具体地，我们评估每个令牌的独立贡献和反事实不可或缺性，以定义一个令牌级别的因果完备性奖励。该奖励用于构建在GRPO优化框架中的一个受因果启发的优势函数，鼓励模型专注于对于准确生成既具有因果充分性又必要性的令牌。在各种基准数据集和任务上的实验结果表明了我们方法的有效性，有效减轻了MLLMs中的幻觉问题。

更新时间: 2025-08-06 08:09:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04182v1

KFS: KAN based adaptive Frequency Selection learning architecture for long term time series forecasting

Multi-scale decomposition architectures have emerged as predominant methodologies in time series forecasting. However, real-world time series exhibit noise interference across different scales, while heterogeneous information distribution among frequency components at varying scales leads to suboptimal multi-scale representation. Inspired by Kolmogorov-Arnold Networks (KAN) and Parseval's theorem, we propose a KAN based adaptive Frequency Selection learning architecture (KFS) to address these challenges. This framework tackles prediction challenges stemming from cross-scale noise interference and complex pattern modeling through its FreK module, which performs energy-distribution-based dominant frequency selection in the spectral domain. Simultaneously, KAN enables sophisticated pattern representation while timestamp embedding alignment synchronizes temporal representations across scales. The feature mixing module then fuses scale-specific patterns with aligned temporal features. Extensive experiments across multiple real-world time series datasets demonstrate that KT achieves state-of-the-art performance as a simple yet effective architecture.

Updated: 2025-08-06 08:08:39

标题: KFS：基于KAN的长期时间序列预测自适应频率选择学习架构

摘要: 多尺度分解架构已成为时间序列预测中占主导地位的方法学。然而，现实世界中的时间序列在不同尺度上表现出噪音干扰，而在不同尺度上频率成分之间的异质信息分布导致了次优的多尺度表示。受 Kolmogorov-Arnold 网络 (KAN) 和 Parseval 定理的启发，我们提出了一种基于 KAN 的自适应频率选择学习架构 (KFS) 来解决这些挑战。该框架通过其 FreK 模块解决了由跨尺度噪音干扰和复杂模式建模引起的预测挑战，该模块在频谱域中执行基于能量分布的主导频率选择。同时，KAN 实现了复杂模式表示，而时间戳嵌入对齐同步了跨尺度的时间表示。特征混合模块然后将尺度特定模式与对齐的时间特征融合。跨多个现实世界时间序列数据集的广泛实验表明，KT 作为一种简单而有效的架构实现了最先进的性能。

更新时间: 2025-08-06 08:08:39

领域: cs.LG

下载: http://arxiv.org/abs/2508.00635v2

One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra

A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.

Updated: 2025-08-06 08:05:01

标题: 指纹一小步，质谱“De Novo”分子生成的一大步

摘要: 一种常见的从质谱数据中进行\emph{de novo}分子生成的方法包括一个两阶段流程：（1）将质谱数据编码成分子指纹，然后（2）将这些指纹解码成分子结构。在我们的工作中，我们采用\textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023}作为编码器，\textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023}作为解码器，利用预训练来增强性能。值得注意的是，预训练\textsc{MolForge}特别有效，使其能够作为稳健的指纹到结构解码器。此外，与其传递指纹中每个位的概率，将概率阈值化为阶跃函数有助于让解码器集中于亚结构的存在，提高了对准确分子结构的恢复，即使\textsc{MIST}预测的指纹与地面真相在Tanimoto相似度方面仅略有相似。编码器和解码器的这种组合导致了比先前最先进方法的十倍改进，从质谱数据中正确生成了28\%的top-1 / 36\%的top-10分子结构。我们将这个流程定位为未来研究中\emph{de novo}分子解析从质谱数据中的一个强大基准。

更新时间: 2025-08-06 08:05:01

领域: cs.LG

下载: http://arxiv.org/abs/2508.04180v1

The State Of TTS: A Case Study with Human Fooling Rates

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

Updated: 2025-08-06 08:04:21

标题: TTS的现状：一个与人类愚弄率相关的案例研究

摘要: 在近年来的主观评估中显示出TTS技术取得了快速进展，但是当前的TTS系统是否能真正通过类似图灵测试的人类欺骗性评估呢？我们引入了人类愚弄率（HFR），这是一个直接衡量机器生成语音被误认为是人类的指标。我们对开源和商业TTS模型进行了大规模评估，发现了一些重要的见解：（i）基于CMOS的声称达到人类水平的说法在欺骗测试中经常失败，（ii）TTS的进展应该以人类语音达到高HFR的数据集为基准，因为评估标准是单调或表达不充分的参考样本设定了一个低标准，（iii）商业模型在零-shot设置中接近人类欺骗水平，而开源系统仍然在自然对话中挣扎，（iv）在高质量数据上进行微调可以提高逼真度，但并不能完全弥合差距。我们的研究结果强调了需要在现有主观测试之外进行更加现实化、以人为中心的评估。

更新时间: 2025-08-06 08:04:21

领域: cs.CL,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.04179v1

Secure Development of a Hooking-Based Deception Framework Against Keylogging Techniques

Keyloggers remain a serious threat in modern cybersecurity, silently capturing user keystrokes to steal credentials and sensitive information. Traditional defenses focus mainly on detection and removal, which can halt malicious activity but do little to engage or mislead adversaries. In this paper, we present a deception framework that leverages API hooking to intercept input-related API calls invoked by keyloggers at runtime and inject realistic decoy keystrokes. A core challenge, however, lies in the increasing adoption of anti-hooking techniques by advanced keyloggers. Anti-hooking strategies allow malware to bypass or detect instrumentation. To counter this, we introduce a hardened hooking layer that detects tampering and rapidly reinstates disrupted hooks, ensuring continuity of deception. We evaluate our framework against a custom-built "super keylogger" incorporating multiple evasion strategies, as well as 50 real-world malware samples spanning ten prominent keylogger families. Experimental results demonstrate that our system successfully resists sophisticated bypass attempts, maintains operational stealth, and reliably deceives attackers by feeding them decoys. The system operates with negligible performance overhead and no observable impact on user experience. Our findings show that resilient, runtime deception can play a practical and robust role in confronting advanced threats.

Updated: 2025-08-06 08:03:39

标题: 基于挂钩的欺骗框架对抗键盘记录技术的安全开发

摘要: 按照文中的描述，关键记录器在现代网络安全中仍然是一个严重的威胁，悄悄地捕获用户按键以窃取凭据和敏感信息。传统的防御主要集中在检测和清除上，这可以阻止恶意活动，但对于引诱或误导对手的效果有限。在本文中，我们提出了一个欺骗框架，利用API挂钩来拦截运行时由关键记录器调用的与输入相关的API调用，并注入逼真的诱饵按键。然而，一个核心挑战在于先进关键记录器对反挂钩技术的日益采用。反挂钩策略允许恶意软件绕过或检测工具。为了对抗这一点，我们引入了一个强化的挂钩层，检测篡改并快速恢复中断的挂钩，确保欺骗的连续性。我们对我们的框架进行评估，对抗一个自定义的“超级关键记录器”，该记录器包含多种规避策略，以及50个覆盖十个知名关键记录器系列的真实恶意软件样本。实验结果表明，我们的系统成功抵抗了复杂的绕过尝试，保持了操作的隐蔽性，并通过向攻击者提供诱饵来可靠地欺骗他们。该系统的性能开销微乎其微，对用户体验没有明显影响。我们的发现表明，具有弹性的运行时欺骗可以在应对先进威胁方面发挥实用和强大的作用。

更新时间: 2025-08-06 08:03:39

领域: cs.CR

下载: http://arxiv.org/abs/2508.04178v1

Quasi-Clique Discovery via Energy Diffusion

Discovering quasi-cliques -- subgraphs with edge density no less than a given threshold -- is a fundamental task in graph mining, with broad applications in social networks, bioinformatics, and e-commerce. Existing heuristics often rely on greedy rules, similarity measures, or metaheuristic search, but struggle to maintain both efficiency and solution consistency across diverse graphs. This paper introduces EDQC, a novel quasi-clique discovery algorithm inspired by energy diffusion. Instead of explicitly enumerating candidate subgraphs, EDQC performs stochastic energy diffusion from source vertices, naturally concentrating energy within structurally cohesive regions. The approach enables efficient dense subgraph discovery without exhaustive search or dataset-specific tuning. Experimental results on 30 real-world datasets demonstrate that EDQC consistently discovers larger quasi-cliques than state-of-the-art baselines on the majority of datasets, while also yielding lower variance in solution quality. To the best of our knowledge, EDQC is the first method to incorporate energy diffusion into quasi-clique discovery.

Updated: 2025-08-06 07:59:56

标题: 通过能量扩散发现准团。

摘要: 发现准团 -- 具有不低于给定阈值的边密度的子图 -- 是图挖掘中的一项基本任务，广泛应用于社交网络、生物信息学和电子商务。现有的启发式方法通常依赖于贪婪规则、相似性度量或元启发式搜索，但往往难以在不同的图中保持效率和解决方案一致性。本文介绍了一种受能量扩散启发的新型准团发现算法EDQC。EDQC不是显式列举候选子图，而是从源顶点执行随机能量扩散，自然地将能量聚集在结构上具有凝聚力的区域。这种方法能够在不需要耗时的搜索或数据集特定调整的情况下实现高效的密集子图发现。对30个真实数据集的实验结果表明，EDQC在大多数数据集上始终发现比最先进基线更大的准团，同时也产生更低的解决方案质量方差。据我们所知，EDQC是第一种将能量扩散纳入准团发现的方法。

更新时间: 2025-08-06 07:59:56

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2508.04174v1

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).

Updated: 2025-08-06 07:59:39

标题: IS-Bench：评估VLM驱动的具身代理在日常家务任务中的交互安全性

摘要: 由基于VLM的实体代理的规划错误造成了重要的安全隐患，阻碍了它们在现实世界家庭任务中的部署。然而，现有的静态、非交互式评估范式未能充分评估这些交互环境中的风险，因为它们无法模拟从代理的行动中出现的动态风险，并依赖于不可靠的事后评估，忽视了不安全的中间步骤。为了弥补这一关键差距，我们提出评估代理的交互安全性：其感知新兴风险的能力以及按照正确的程序顺序执行减轻措施。因此，我们提出了IS-Bench，这是第一个为交互安全性设计的多模态基准，包括161个具有388个独特安全风险的具有高保真度的模拟器中的挑战性场景。它关键地促进了一种新颖的过程导向评估，验证了风险减轻行动是否在特定易受风险步骤之前/之后执行。对包括GPT-4o和Gemini-2.5系列在内的领先VLM进行了大量实验，结果显示当前代理缺乏交互式安全意识，而安全意识Chain-of-Thought虽然可以提高性能，但往往会 compromisetask完成。通过突显这些关键限制，IS-Bench为开发更安全、更可靠的实体AI系统奠定了基础。代码和数据发布在[此 https URL](https://github.com/AI45Lab/IS-Bench)。

更新时间: 2025-08-06 07:59:39

领域: cs.AI,cs.CL,cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2506.16402v2

From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation

Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that govern domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists of three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmarks (WISDM, HAR, HHAR, and MFD) demonstrate DARSD's superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 scenarios and ranking first across all benchmarks.

Updated: 2025-08-06 07:55:26

标题: 从纠缠到对齐：无监督时间序列领域自适应的表示空间分解

摘要: 领域转移在时间序列分析中构成了一个基本挑战，其中在源域上训练的模型在应用到具有不同但相似分布的目标域时往往会出现严重失败。尽管当前的无监督领域自适应（UDA）方法试图对齐跨域特征分布，但它们通常将特征视为不可分割的实体，忽略了决定领域自适应的特征固有组成。我们引入了DARSD，这是一个具有理论可解释性的新领域自适应框架，明确地从表示空间分解的视角实现UDA任务。我们的核心见解是，有效的领域自适应不仅需要对齐，还需要从混合表示中原则性地解开可转移知识。DARSD由三个协同组件组成：（I）一个对抗可学习的共同不变基础，将原始特征投影到一个领域不变子空间，同时保留语义内容；（II）一个原型伪标记机制，根据置信度动态分离目标特征，阻止错误积累；（III）一种混合对比优化策略，同时强化特征聚类和一致性，同时减轻新出现的分布差距。在四个基准测试（WISDM、HAR、HHAR和MFD）上进行的全面实验表明，DARSD在与12种UDA算法的比较中表现出卓越优势，在53个场景中有35个达到最佳性能，并在所有基准测试中排名第一。

更新时间: 2025-08-06 07:55:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.20968v3

Evaluating User Experience in Conversational Recommender Systems: A Systematic Review Across Classical and LLM-Powered Approaches

Conversational Recommender Systems (CRSs) are receiving growing research attention across domains, yet their user experience (UX) evaluation remains limited. Existing reviews largely overlook empirical UX studies, particularly in adaptive and large language model (LLM)-based CRSs. To address this gap, we conducted a systematic review following PRISMA guidelines, synthesising 23 empirical studies published between 2017 and 2025. We analysed how UX has been conceptualised, measured, and shaped by domain, adaptivity, and LLM. Our findings reveal persistent limitations: post hoc surveys dominate, turn-level affective UX constructs are rarely assessed, and adaptive behaviours are seldom linked to UX outcomes. LLM-based CRSs introduce further challenges, including epistemic opacity and verbosity, yet evaluations infrequently address these issues. We contribute a structured synthesis of UX metrics, a comparative analysis of adaptive and nonadaptive systems, and a forward-looking agenda for LLM-aware UX evaluation. These findings support the development of more transparent, engaging, and user-centred CRS evaluation practices.

Updated: 2025-08-06 07:55:11

标题: 评估对话式推荐系统中的用户体验：跨经典和LLM驱动方法的系统性评估

摘要: 对话式推荐系统（CRSs）正在跨领域获得越来越多的研究关注，但它们的用户体验（UX）评估仍然有限。现有的综述主要忽视了实证UX研究，特别是在自适应和基于大语言模型（LLM）的CRSs方面。为了填补这一空白，我们遵循PRISMA指南进行了系统性综述，综合了2017年至2025年间发表的23项实证研究。我们分析了UX如何被领域、适应性和LLM所概念化、测量和塑造。我们的研究结果揭示了持续存在的局限性：事后调查占主导地位，轮次级情感UX构建很少被评估，并且自适应行为很少与UX结果联系起来。基于LLM的CRSs引入了进一步的挑战，包括认识不透明性和冗长，然而评估很少涉及这些问题。我们提供了UX指标的结构化综合，自适应和非自适应系统的比较分析，以及面向未来的LLM意识UX评估议程。这些发现支持更透明、更具吸引力和更用户中心的CRS评估实践的发展。

更新时间: 2025-08-06 07:55:11

领域: cs.IR,cs.AI,cs.HC,H.3.3; H.5.2; I.2.7

下载: http://arxiv.org/abs/2508.02096v2

Agentic-AI based Mathematical Framework for Commercialization of Energy Resilience in Electrical Distribution System Planning and Operation

The increasing vulnerability of electrical distribution systems to extreme weather events and cyber threats necessitates the development of economically viable frameworks for resilience enhancement. While existing approaches focus primarily on technical resilience metrics and enhancement strategies, there remains a significant gap in establishing market-driven mechanisms that can effectively commercialize resilience features while optimizing their deployment through intelligent decision-making. Moreover, traditional optimization approaches for distribution network reconfiguration often fail to dynamically adapt to both normal and emergency conditions. This paper introduces a novel framework integrating dual-agent Proximal Policy Optimization (PPO) with market-based mechanisms, achieving an average resilience score of 0.85 0.08 over 10 test episodes. The proposed architecture leverages a dual-agent PPO scheme, where a strategic agent selects optimal DER-driven switching configurations, while a tactical agent fine-tunes individual switch states and grid preferences under budget and weather constraints. These agents interact within a custom-built dynamic simulation environment that models stochastic calamity events, budget limits, and resilience-cost trade-offs. A comprehensive reward function is designed that balances resilience enhancement objectives with market profitability (with up to 200x reward incentives, resulting in 85% of actions during calamity steps selecting configurations with 4 DERs), incorporating factors such as load recovery speed, system robustness, and customer satisfaction. Over 10 test episodes, the framework achieved a benefit-cost ratio of 0.12 0.01, demonstrating sustainable market incentives for resilience investment. This framework creates sustainable market incentives

Updated: 2025-08-06 07:49:37

标题: 基于主动性AI的数学框架用于电力配电系统规划和运营中能源弹性商业化

摘要: 电力配电系统对极端天气事件和网络威胁的脆弱性日益增加，这就需要制定经济可行的强化韧性的框架。尽管现有方法主要关注技术韧性指标和增强策略，但在建立市场驱动机制方面仍存在重大差距，这些机制可以有效地商业化韧性特性，并通过智能决策优化其部署。此外，传统的配电网络重构优化方法通常无法动态适应正常和紧急条件。本文介绍了一个新颖的框架，将双代理近端策略优化（PPO）与基于市场的机制相结合，实现了在10个测试周期内的平均韧性评分为0.85±0.08。所提出的架构利用了双代理PPO方案，其中一个战略代理选择最佳的DER驱动开关配置，而另一个战术代理在预算和天气限制下微调单个开关状态和电网偏好。这些代理在一个自定义动态模拟环境中进行互动，模拟随机灾难事件、预算限制和韧性成本权衡。设计了一个综合奖励函数，平衡了韧性增强目标和市场盈利能力（最高可达200倍奖励激励，导致在灾难步骤中85%的行动选择了具有4个DER的配置），并考虑了诸如负载恢复速度、系统稳健性和客户满意度等因素。在10个测试周期内，该框架实现了0.12±0.01的效益成本比，展示了韧性投资的可持续市场激励。该框架创建了可持续的市场激励。

更新时间: 2025-08-06 07:49:37

领域: eess.SY,cs.GT,cs.LG,cs.SY

下载: http://arxiv.org/abs/2508.04170v1

Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations

Accurate solar generation prediction is essential for proper estimation of renewable energy resources across diverse geographic locations. However, geographical and weather features vary from location to location which introduces domain shift - a major bottleneck to develop location-agnostic prediction model. As a result, a machine-learning model which can perform well to predict solar power in one location, may exhibit subpar performance in another location. Moreover, the lack of properly labeled data and storage issues make the task even more challenging. In order to address domain shift due to varying weather conditions across different meteorological regions, this paper presents a semi-supervised deep domain adaptation framework, allowing accurate predictions with minimal labeled data from the target location. Our approach involves training a deep convolutional neural network on a source location's data and adapting it to the target location using a source-free, teacher-student model configuration. The teacher-student model leverages consistency and cross-entropy loss for semi-supervised learning, ensuring effective adaptation without any source data requirement for prediction. With annotation of only $20 \%$ data in the target domain, our approach exhibits an improvement upto $11.36 \%$, $6.65 \%$, $4.92\%$ for California, Florida and New York as target domain, respectively in terms of accuracy in predictions with respect to non-adaptive approach.

Updated: 2025-08-06 07:45:35

标题: 半监督深度领域适应用于预测不同地点的太阳能发电量

摘要: 准确的太阳能发电预测对于正确估计不同地理位置的可再生能源资源至关重要。然而，不同地理和气象特征会因地而异，这引入了领域转移，这是开发无关地点的预测模型的主要瓶颈。因此，在一个地点能够很好地预测太阳能的机器学习模型，在另一个地点可能表现不佳。此外，缺乏正确标记的数据和存储问题使任务变得更加具有挑战性。为了解决由于不同气象区域的气候条件变化而引起的领域转移，本文提出了一种半监督深度领域适应框架，允许使用目标地点的最少标记数据进行准确预测。我们的方法涉及在源位置数据上训练深度卷积神经网络，并使用无源、师生模型配置将其适应到目标位置。师生模型利用一致性和交叉熵损失进行半监督学习，确保在没有任何源数据需求的情况下进行有效适应以进行预测。通过在目标领域中仅注释 $20 \%$ 的数据，我们的方法在准确性方面相对于非自适应方法在加利福尼亚、佛罗里达和纽约等目标领域上实现了高达 $11.36 \%$、$6.65 \%$、$4.92\%$ 的改进。

更新时间: 2025-08-06 07:45:35

领域: cs.LG

下载: http://arxiv.org/abs/2508.04165v1

Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork

AI agents deployed in assistive roles often have to collaborate with other agents (humans, AI systems) without prior coordination. Methods considered state of the art for such ad hoc teamwork often pursue a data-driven approach that needs a large labeled dataset of prior observations, lacks transparency, and makes it difficult to rapidly revise existing knowledge in response to changes. As the number of agents increases, the complexity of decision-making makes it difficult to collaborate effectively. This paper advocates leveraging the complementary strengths of knowledge-based and data-driven methods for reasoning and learning for ad hoc teamwork. For any given goal, our architecture enables each ad hoc agent to determine its actions through non-monotonic logical reasoning with: (a) prior commonsense domain-specific knowledge; (b) models learned and revised rapidly to predict the behavior of other agents; and (c) anticipated abstract future goals based on generic knowledge of similar situations in an existing foundation model. We experimentally evaluate our architecture's capabilities in VirtualHome, a realistic physics-based 3D simulation environment.

Updated: 2025-08-06 07:44:38

标题: 通用到具体推理和学习，用于可扩展的即兴团队合作

摘要: 部署在辅助角色中的AI代理通常需要与其他代理（人类、AI系统）协作，而没有事先协调。针对这种临时团队合作的最新方法通常采用基于数据驱动的方法，需要大量标记的先前观察数据集，缺乏透明度，并且难以迅速根据变化修订现有知识。随着代理数量的增加，决策复杂性使有效协作变得困难。本文主张利用基于知识和基于数据驱动方法在临时团队协作中进行推理和学习的互补优势。对于任何给定的目标，我们的架构使每个临时代理通过非单调逻辑推理确定其行动，其中包括：（a）先验常识领域特定知识；（b）快速学习和修订模型，以预测其他代理的行为；以及（c）基于现有基础模型中类似情况的通用知识，预期抽象未来目标。我们在虚拟家园（VirtualHome）这个逼真的基于物理的3D模拟环境中对我们的架构能力进行了实验评估。

更新时间: 2025-08-06 07:44:38

领域: cs.AI,cs.LO,cs.MA

下载: http://arxiv.org/abs/2508.04163v1

Scalable and (quantum-accessible) adaptive pseudorandom quantum states and pseudorandom function-like quantum state generators

Pseudorandom quantum states (PRSs) and pseudorandom function-like quantum state (PRFS) generators are quantum analogues of pseudorandom generators and pseudorandom functions. It is known that PRS (and PRFS) can exist even if BQP = QMA (relative to a quantum oracle) [Kre21] or if P = NP (relative to a classical oracle) [KQST23], which does not allow for the existence of one-way functions (relative to these oracles). Hence, these are potentially weaker objects than quantum-secure one-way functions, which can be used to do quantum cryptography. A desirable property of PRS and PRFS constructions is scalability, which ensures that the security parameter $\lambda$ (which determines indistinguishability from their Haar-random counterparts) can be much larger than $n$ (the number of qubits of the output states). This may be important in some applications where PRS and PRFS primitives are used. We present an isometric procedure to prepare quantum states that can be arbitrarily random (i.e., the trace distance from the Haar-random state can be arbitrarily small for the true random case, or the distinguishing advantage can be arbitrarily small for the pseudorandom case). Our procedure provides a new method for scalable PRS that introduces no entanglement or correlations with the environment. This naturally gives the first construction for scalable and (quantum-accessible) adaptive PRFS assuming quantum-secure one-way functions. Our PRFS construction implies various primitives, including long-input PRFS, short-input PRFS, short-output PRFS, non-adaptive PRFS, and classical-accessible adaptive PRFS [AQY22, AGQY22]. This new construction may be helpful in some simplification of the microcrypt zoo (https://sattath.github.io/microcrypt-zoo/).

Updated: 2025-08-06 07:40:55

标题: 可扩展和（量子可访问的）自适应伪随机量子状态和类似伪随机函数的量子状态生成器

摘要: 伪随机量子态（PRS）和伪随机函数类似的量子态（PRFS）生成器是伪随机生成器和伪随机函数的量子类比。已知PRS（和PRFS）即使在BQP = QMA（相对于量子oracle）[Kre21]或者P = NP（相对于经典oracle）[KQST23]的情况下也可以存在，这不允许存在单向函数（相对于这些oracle）。因此，这些可能比量子安全单向函数更弱，后者可用于进行量子密码学。 PRS和PRFS构造的一个理想特性是可扩展性，这确保了安全参数λ（确定与其Haar随机对应物的不可区分性）可以远远大于n（输出状态的量子比特数）。在一些应用中，这可能非常重要，其中使用PRS和PRFS原语。我们提出了一种等距过程，用于准备可以任意随机的量子态（即，与Haar随机态的跟踪距离可以对真实随机情况下任意小，或对伪随机情况下区分优势可以任意小）。我们的过程提供了一种新的可扩展PRS方法，不引入与环境的纠缠或相关性。这自然地为可扩展和（量子可访问的）自适应PRFS提供了第一个构造，假设量子安全单向函数。我们的PRFS构造意味着各种原语，包括长输入PRFS、短输入PRFS、短输出PRFS、非自适应PRFS和经典可访问的自适应PRFS[AQY22、AGQY22]。这种新构造可能有助于简化微加密动物园的一些内容（https://sattath.github.io/microcrypt-zoo/）。

更新时间: 2025-08-06 07:40:55

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2507.22535v2

Parse Trees Guided LLM Prompt Compression

Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.

Updated: 2025-08-06 07:37:24

标题: 解析树引导的LLM提示压缩

摘要: 将丰富的上下文提供给大型语言模型（LLMs）已被证明可以提升各种任务的性能，但由此产生的更长提示会增加计算成本，并可能超出LLMs的输入限制。最近，一些提示压缩方法已被提出，通过使用语言模型生成较短的提示或开发计算模型选择原始提示的重要部分来缩短提示的长度。生成式压缩方法可能会受到幻觉等问题的困扰，而选择性压缩方法则未涉及语言规则，忽略了提示的全局结构。因此，我们提出了一种名为PartPrompt的新型选择性压缩方法。它首先基于语言规则为每个句子获取一个解析树，并计算每个解析树节点的局部信息熵。然后，这些局部解析树根据层次结构（如句子、段落和章节的依赖性）组织成一个全局树。之后，提出了根向传播和叶向传播来调整全局树上的节点值。最后，基于调整后的节点值开发了一个递归算法来修剪全局树。实验证明，PartPrompt在各种数据集、指标、压缩比率和目标LLMs的推理方面均取得了最先进的性能。深入的消融研究证实了PartPrompt设计的有效性，其他额外实验也证明了其在压缩提示的连贯性以及在极端长提示情况下的优越性。

更新时间: 2025-08-06 07:37:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.15395v2

Evaluating Selective Encryption Against Gradient Inversion Attacks

Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data's significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements.

Updated: 2025-08-06 07:31:43

标题: 评估选择性加密抵抗梯度反演攻击

摘要: 梯度反转攻击对分布式训练框架如联邦学习构成重大隐私威胁，使恶意方能够在聚合过程中从客户端和聚合服务器之间的梯度通信中重建敏感的本地训练数据。虽然传统基于加密的防御措施，如同态加密，在不损害模型效用的情况下提供了强大的隐私保证，但往往会产生难以承受的计算开销。为了缓解这一问题，选择性加密已成为一种有前途的方法，根据数据在某种度量下的重要性，仅加密梯度数据的子集。然而，在实践中，关于如何指定这种度量的系统研究还很少。本文系统评估了不同重要性度量的选择性加密方法对最先进攻击的抵抗能力。我们的研究结果表明，选择性加密在减少计算开销的同时保持对抗攻击的韧性是可行的。我们提出了一个基于距离的重要性分析框架，为选择关键梯度元素进行加密提供了理论基础。通过对不同模型架构（LeNet、CNN、BERT、GPT-2）和攻击类型的广泛实验，我们发现梯度大小是一种普遍有效的度量，用于保护免受基于优化的梯度反转攻击。然而，我们也观察到，在所有攻击场景中没有一种单一的选择性加密策略是普遍优化的，并为选择不同模型架构和隐私需求下适当策略提供了指导。

更新时间: 2025-08-06 07:31:43

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.04155v1

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

Updated: 2025-08-06 07:24:55

标题: HyCodePolicy：用于具有多模态监测和决策功能的体验代理的混合语言控制器

摘要: 最近，多模态大语言模型（MLLMs）的进展使得在具体代理中生成代码策略的感知基础更加丰富。然而，大多数现有系统缺乏有效机制来自适应地监控策略执行并在任务完成过程中修复代码。在这项工作中，我们引入了HyCodePolicy，这是一个混合语言控制框架，系统地将代码合成、几何基础、感知监控和迭代修复整合到一个闭环编程周期中，用于具体代理。技术上，给定一个自然语言指令，我们的系统首先将其分解为子目标，并生成一个初始的可执行程序，基于以对象为中心的几何原语。然后，在仿真中执行该程序，同时一个视觉语言模型（VLM）观察选定的检查点，以检测和定位执行失败并推断失败原因。通过融合捕获程序级事件的结构化执行轨迹与基于VLM的感知反馈，HyCodePolicy推断失败原因并修复程序。这种混合双反馈机制实现了自我校正的程序合成，减少了人工监督。我们的结果表明，HyCodePolicy显著提高了机器人操作策略的鲁棒性和样本效率，为将多模态推理集成到自主决策流水线中提供了可扩展的策略。

更新时间: 2025-08-06 07:24:55

领域: cs.RO,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.02629v2

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

Updated: 2025-08-06 07:24:14

标题: 通过DPO隐式奖励差距进行基于难度的偏好数据选择

摘要: 将大型语言模型（LLMs）与人类偏好对齐是人工智能研究中的一个关键挑战。虽然诸如来自人类反馈的强化学习（RLHF）和直接偏好优化（DPO）等方法被广泛使用，但它们通常依赖于庞大昂贵的偏好数据集。当前的工作缺乏专门用于偏好数据的高质量数据选择方法。在这项工作中，我们引入了一种基于DPO隐式奖励机制的新颖的基于困难度的数据选择策略，通过选择具有较小DPO隐式奖励差距的偏好数据示例，这些差距表明更具挑战性的案例，我们提高了数据效率和模型对齐。我们的方法在多个数据集和对齐任务中始终优于五个强基线，仅使用原始数据的10\%即可实现卓越性能。这种基于原则的高效选择方法为在资源有限的情况下扩展LLM对齐提供了有希望的解决方案。

更新时间: 2025-08-06 07:24:14

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04149v1

Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework, using risk control, for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.

Updated: 2025-08-06 07:22:15

标题: 基于风险的阈值设定：集中式太阳能发电厂可靠异常检测

摘要: 集中式太阳能发电（CSP）厂的高效可靠运行对满足可持续能源日益增长的需求至关重要。然而，高温太阳能接收器面临严重的运行风险，如冻结、变形和腐蚀，导致昂贵的停机和维护。为了监测CSP厂，安装在太阳能接收器上的摄像头在一天中不规则的时间间隔内记录红外图像，时间跨度从一到五分钟不等。异常图像可以通过设定异常分数的阈值来检测，其中阈值的选择是为了优化验证集上的F1分数等指标。本研究提出了一个框架，利用风险控制，为选择任意风险函数上具有有限样本覆盖保证的更可靠的决策阈值提供支持。我们的框架还包括一个弃权机制，允许高风险预测被推迟给领域专家。其次，我们提出了一种密度预测方法，用于估计给定一系列先前观察到的图像的观察图像的可能性，将此可能性作为其异常分数。第三，我们分析了我们的框架在两个CSP厂的多个培训场景中在几个月内的部署结果。这一分析为我们的行业合作伙伴优化维护操作提供了有价值的见解。最后，鉴于我们数据集的机密性质，我们提供了一个扩展的模拟数据集，利用生成建模的最新进展来创建模拟多个CSP厂的多样化热像。我们的代码是公开可用的。

更新时间: 2025-08-06 07:22:15

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.19146v2

Efficient Data Selection for Training Genomic Perturbation Models

Genomic studies, including CRISPR-based Perturb-seq analyses, face a vast hypothesis space, while gene perturbations remain costly and time-consuming. Gene perturbation models based on graph neural networks are trained to predict the outcomes of gene perturbations to facilitate such experiments. Due to the cost of genomic experiments, active learning is often employed to train these models, alternating between wet-lab experiments and model updates. However, the operational constraints of the wet-lab and the iterative nature of active learning significantly increase the total training time. Furthermore, the inherent sensitivity to model initialization can lead to markedly different sets of gene perturbations across runs, which undermines the reproducibility, interpretability, and reusability of the method. To this end, we propose a graph-based data filtering method that, unlike active learning, selects the gene perturbations in one shot and in a model-free manner. The method optimizes a criterion that maximizes the supervision signal from the graph neural network to enhance generalization. The criterion is defined over the input graph and is optimized with submodular maximization. We compare it empirically to active learning, and the results demonstrate that despite yielding months of acceleration, it also improves the stability of the selected perturbation experiments while achieving comparable test error.

Updated: 2025-08-06 07:22:08

标题: 高效的数据选择用于训练基因组扰动模型

摘要: 基因组研究，包括基于CRISPR的Perturb-seq分析，面临着庞大的假设空间，而基因扰动仍然昂贵且耗时。基于图神经网络的基因扰动模型被训练来预测基因扰动的结果，以促进这类实验。由于基因组实验的成本，通常采用主动学习来训练这些模型，交替进行湿实验和模型更新。然而，湿实验的操作限制和主动学习的迭代性质显著增加了总训练时间。此外，对模型初始化的固有敏感性可能导致不同运行中显著不同的基因扰动集，从而破坏了该方法的可重现性、可解释性和可重用性。为此，我们提出了一种基于图的数据过滤方法，与主动学习不同，以一次性和无模型方式选择基因扰动。该方法优化了一个最大化图神经网络监督信号的标准，以增强泛化能力。该标准是在输入图上定义的，并通过子模块最大化进行优化。我们在实验中将其与主动学习进行比较，结果表明，尽管提供了数月的加速，但它还提高了所选扰动实验的稳定性，同时实现了可比较的测试误差。

更新时间: 2025-08-06 07:22:08

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2503.14571v5

UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods. Project page: https://www.factral.co/UnMix-NeRF.

Updated: 2025-08-06 07:17:05

标题: UnMix-NeRF：光谱解混遇上神经辐射场

摘要: 基于神经辐射场（NeRF）的分割方法专注于对象语义，并仅依赖RGB数据，缺乏固有的材料属性。这种限制限制了准确的材料感知，这对机器人技术、增强现实、模拟和其他应用至关重要。我们介绍了UnMix-NeRF，这是一个将光谱解混合集成到NeRF中的框架，实现了联合高光谱新视图合成和无监督材料分割。我们的方法通过漫反射和镜面组件对光谱反射进行建模，其中一个学习的全局端元字典表示纯材料特征，每个点的丰度捕捉它们的分布。对于材料分割，我们使用沿着学习端元的光谱特征预测，允许无监督材料聚类。此外，UnMix-NeRF通过修改学习的端元字典实现了场景编辑，实现了基于材料的外观灵活调整。广泛实验证实了我们的方法，展示了对现有方法的超越的光谱重建和材料分割。项目页面：https://www.factral.co/UnMix-NeRF。

更新时间: 2025-08-06 07:17:05

领域: eess.IV,cs.AI,cs.CV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2506.21884v2

GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation

Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal, particularly in achieving consistent accuracy across all cell types. In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework. By enforcing local consistency over the task-specific PCA-based k-NN graph, our method combines the scalability of the pre-trained models with the structural robustness relied upon in expert annotation. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing automated cell type annotation in practice.

Updated: 2025-08-06 07:09:46

标题: GRIT：用于零样本细胞类型标注的图正则化Logit细化

摘要: 细胞类型注释是单细胞RNA测序（scRNA-seq）数据分析的基本步骤。在实践中，人类专家通常依赖主成分分析（PCA）揭示的结构，然后通过$k$-最近邻（$k$-NN）图构建来指导注释。虽然有效，但这个过程需要大量人力，并且无法扩展到大规模数据集。最近CLIP风格模型的进展为自动化细胞类型注释开辟了一条有希望的道路。通过将scRNA-seq概要文件与自然语言描述对齐，像LangCell这样的模型实现了零样本注释。虽然LangCell展示了不错的零样本性能，但其预测仍然不够理想，特别是在实现所有细胞类型的一致准确性方面。在本文中，我们提出通过图正则化优化框架来改进LangCell生成的零样本对数。通过在基于PCA的任务特定k-NN图上强制局部一致性，我们的方法结合了预训练模型的可扩展性和专家注释中依赖的结构鲁棒性。我们在来自4个不同研究的14个标记的人类scRNA-seq数据集上评估我们的方法，涵盖了11个器官和超过200,000个单个细胞。我们的方法始终提高了零样本注释的准确性，达到了高达10%的准确度增益。进一步的分析展示了GRIT如何有效地通过图传播正确信号，将误标记的细胞拉回更准确的预测。该方法无需训练，与模型无关，并且是一个简单而有效的插件，可用于增强实践中的自动化细胞类型注释。

更新时间: 2025-08-06 07:09:46

领域: q-bio.GN,cs.LG

下载: http://arxiv.org/abs/2508.04747v1

COPO: Consistency-Aware Policy Optimization

Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.

Updated: 2025-08-06 07:05:18

标题: COPO: 一致性感知策略优化

摘要: 强化学习显著提升了大型语言模型（LLMs）在复杂问题解决任务中的推理能力。最近，DeepSeek R1的引入激发了人们对利用基于规则的奖励作为计算优势函数和指导策略优化的低成本替代方案的兴趣激增。然而，观察到在许多复制和扩展工作中的一个共同挑战是，当在单个提示下多个采样响应收敛到相同的结果，无论正确与否，基于组的优势就会退化为零。这导致梯度消失，并使相应的样本对学习无效，最终限制了训练效率和下游性能。为了解决这个问题，我们提出了一个基于结果一致性的策略优化框架，该框架引入了一个基于结果一致性的结构化全局奖励，基于此的全局损失确保，即使模型输出显示高组内一致性，训练过程仍然会接收到有意义的学习信号，这鼓励从全局角度生成正确且自洽的推理路径。此外，我们还结合了基于熵的软混合机制，自适应地平衡了局部优势估计和全局优化，从而在整个训练过程中实现了探索和收敛之间的动态转换。我们的方法在奖励设计和优化策略方面引入了几项关键创新。我们通过在多个数学推理基准测试中实现了显著的性能提升来验证其有效性，突显了所提框架的鲁棒性和普适性。该工作的代码已发布在https://github.com/hijih/copo-code.git。

更新时间: 2025-08-06 07:05:18

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.04138v1

Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Updated: 2025-08-06 07:03:33

标题: 仅文本推理释放零-shot 多模态评估器

摘要: 人类生成的奖励信号对于使生成模型与人类偏好保持一致至关重要，可以指导训练和推理时评估。虽然作为代理评估器使用的大型语言模型（LLMs），即LLM作为法官，显著降低了与手动注释相关的成本，但它们通常需要大量特定于模态的训练数据，并且在各种多模态任务中通常无法很好地泛化。在本文中，我们提出了Flex-Judge，这是一个基于推理的多模态评判模型，利用最少的文本推理数据来稳健地泛化到多个模态和评估格式。我们的核心直觉是，结构化的文本推理解释固有地编码了可泛化的决策模式，从而能够有效地转移到多模态判断，例如图像或视频。实证结果表明，尽管Flex-Judge训练的文本数据显著较少，但与最先进的商业API和经过充分训练的多模态评估器相比，其性能达到了竞争性或更优秀的水平。值得注意的是，Flex-Judge在分子等模态中具有广泛的影响，其中全面的评估基准很少，凸显了它在资源受限领域中的实际价值。我们的框架强调基于推理的文本监督作为传统注释密集方法的强大、具有成本效益的替代方案，大幅推进了可扩展的多模态模型作为法官。

更新时间: 2025-08-06 07:03:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.18601v2

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity

Recently, large multimodal models (LMMs) have achieved significant advancements. When dealing with high-resolution images, dominant LMMs typically divide them into multiple local images and a global image, leading to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. Specifically, we first apply the multiple pooling layers to obtain visual tokens at different granularities. Then we propose a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we put forward RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53$\times$ increase in inference speed on the AI2D benchmark).

Updated: 2025-08-06 07:03:19

标题: AVG-LLaVA：一种具有自适应视觉粒度的高效大型多模态模型

摘要: 最近，大型多模态模型（LMMs）取得了显著的进展。在处理高分辨率图像时，主要的LMMs通常将它们分成多个局部图像和一个全局图像，导致大量的视觉标记。在这项工作中，我们介绍了AVG-LLaVA，这是一种可以根据输入图像和指令自适应选择适当的视觉粒度的LMM。具体而言，我们首先应用多个池化层以获得不同粒度的视觉标记。然后，我们提出了一个视觉粒度路由器，包括一个Transformer层、一个MLP层和一个投票层，用于根据图像和指令选择适当的视觉粒度。此外，我们提出了RGLF，这是一种旨在使路由器预测的粒度与LMM的偏好相一致的新型训练范式，无需额外手动注释的数据。大量实验和分析表明，AVG-LLaVA在11个基准测试中取得了优越的性能，同时显著减少了视觉标记的数量并加快了推理速度（例如，在AI2D基准测试中，视觉标记减少了85.3％，推理速度增加了2.53倍）。

更新时间: 2025-08-06 07:03:19

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.02745v3

Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.

Updated: 2025-08-06 07:02:55

标题: 解释更少，理解更多：通过个性化参数高效微调识别行话

摘要: 个性化术语检测和解释对于使技术文档对具有不同学科背景的读者易于理解至关重要。然而，将模型定制给个体用户通常需要大量的注释工作和计算资源，因为需要用户特定的微调。为了解决这个问题，我们提出了一项个性化术语检测的系统研究，重点关注那些对于实际部署而言既高效又可扩展的方法。我们探索了两种个性化策略：（1）使用低秩调整（LoRA）在开源模型上进行轻量级微调，以及（2）个性化提示，在推理时调整模型行为而不保留。为了反映现实约束，我们还研究了混合方法，将有限的标注数据与无监督的用户背景信号结合起来。我们的个性化LoRA模型在F1分数上比GPT-4高出21.4％，超过了表现最好的基准线8.3％。值得注意的是，我们的方法仅使用10％的标注训练数据就实现了可比较的性能，展示了在资源受限环境中的实用性。我们的研究是第一项系统探索利用开源语言模型进行高效、低资源的术语检测个性化的工作，为可扩展的、用户自适应的自然语言处理系统提供了实用的路径。

更新时间: 2025-08-06 07:02:55

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.16227v2

UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Updated: 2025-08-06 07:02:39

标题: UniFGVC：通过属性感知多模态检索实现的通用无训练少样本细粒度视觉分类

摘要: Few-shot fine-grained visual classification (FGVC)旨在利用有限的数据使模型能够辨别微妙不同的类别。最近的研究大多对预训练的视觉语言模型进行微调以实现性能提升，但存在过拟合和泛化能力弱的问题。为了解决这个问题，我们引入了UniFGVC，这是一个通用的无需训练的框架，将少样本FGVC重新构建为多模态检索。首先，我们提出了Category-Discriminative Visual Captioner（CDV-Captioner），利用多模态大型语言模型（MLLMs）的开放世界知识生成结构化的文本描述，捕捉区分密切相关类别的细粒度属性特征。CDV-Captioner使用链式思维提示和视觉上类似的参考图像来减少幻觉并增强生成的标题的区分能力。使用它，我们可以将每个图像转换为图像描述对，实现更全面的特征表示，并利用少样本样本构建多模态类别模板，用于后续的检索流程。然后，现成的视觉和文本编码器嵌入查询和模板对，通过在联合空间中检索最近的模板来完成FGVC。UniFGVC确保与各种MLLMs和编码器的广泛兼容性，提供可靠的泛化和适应性，适用于各种少样本FGVC场景。对12个FGVC基准数据集的大量实验表明，它在先前基于少样本CLIP的方法和甚至几种完全监督的MLLMs方法上一直表现出优势。

更新时间: 2025-08-06 07:02:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04136v1

DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

Updated: 2025-08-06 06:57:36

标题: DS$^2$Net：医学图像分割的细节-语义深度监督网络

摘要: 深度监督网络在医学图像领域表现出显著的功效。然而，现有的工作仅仅监督粗粒度语义特征或细粒度详细特征，这就牺牲了这两种特征在医学图像分析中持有的重要关系。我们提倡对医学图像分割进行互补特征监督，提出了一种细节-语义深度监督网络（DS$^2$Net）。DS$^2$Net通过细节增强模块（DEM）和语义增强模块（SEM）导航低级详细和高级语义特征监督。DEM和SEM分别利用低级和高级特征图创建细节和语义掩模，以增强特征监督。这是从单视图深度监督向多视图深度监督的一种新颖转变。DS$^2$Net还配备了一种基于不确定性的监督损失，根据其不确定性自适应地分配不同尺度特征的监督强度，从而避免了以往作品中典型的次优启发式设计。通过对在结肠镜检、超声和显微镜下捕获的六个基准数据集进行广泛实验，我们展示了DS$^2$Net在医学图像分析方面始终优于最先进的方法。

更新时间: 2025-08-06 06:57:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04131v1

DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting

Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.

Updated: 2025-08-06 06:56:14

标题: DMSC：时间序列预测的动态多尺度协调框架

摘要: 时间序列预测（TSF）在建模不同尺度上复杂的时间依赖关系方面面临持久挑战。尽管最近利用不同的分解操作和基于CNN、MLP或Transformer的新颖架构取得了进展，但现有方法仍然在静态分解策略、碎片化依赖建模和不灵活的融合机制方面存在困难，限制了其建模复杂时间依赖关系的能力。为了分别明确解决上述三个问题，我们提出了一种新颖的动态多尺度协调框架（DMSC），其中包括多尺度补丁分解块（EMPD）、三角互动块（TIB）和自适应尺度路由MoE块（ASR-MoE）。具体而言，EMPD被设计为一个内置组件，动态将序列分割成具有指数级粒度的层次补丁，通过输入自适应补丁调整消除了预定义的尺度约束。然后，TIB在每个层的分解表示中联合建模补丁内部、补丁间和跨变量的依赖关系。EMPD和TIB被共同集成到层中，形成一个多层渐进级联架构，早期层的粗粒度表示通过门控路径自适应引导后续层中的细粒度特征提取。而ASR-MoE通过利用具有时间感知权重的专门的全局和局部专家动态融合多尺度预测。对十三个真实世界基准测试的全面实验表明，DMSC始终保持最先进（SOTA）的性能，并且在时间序列预测任务中具有优越的计算效率。代码可在https://github.com/1327679995/DMSC 上找到。

更新时间: 2025-08-06 06:56:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.02753v2

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS.

Updated: 2025-08-06 06:54:21

标题: EmoSteer-TTS：通过激活导向实现细粒度和无需训练的情感可控文本转语音

摘要: 语音合成（TTS）在近年来取得了巨大的进展。然而，大多数现有的TTS系统只提供粗糙和僵化的情感控制，通常通过离散的情感标签或精心制作和详细的情感文本提示，使得细粒度情感操作要么无法实现，要么不稳定。这些模型还需要大量高质量的数据集进行训练。为了解决这些限制，我们提出了EmoSteer-TTS，一种新颖的无需训练的方法，通过激活引导实现细粒度语音情感控制（转换、插值、擦除）。我们首先经验性地观察到，在基于流匹配的TTS模型中修改内部激活的子集可以有效地改变合成语音的情感色调。基于这一认识，我们随后开发了一个无需训练且高效的算法，包括激活提取、情感标记搜索和推理时间操控，可以无缝集成到各种预训练模型中（例如F5-TTS、CosyVoice2和E2-TTS）。此外，为了得出有效的操控向量，我们构建了一个包含多样发言人的精心策划的情感语音数据集。大量实验证明，EmoSteer-TTS实现了对语音情感的细粒度、可解释和连续控制，超过了最新技术。据我们所知，这是首个在TTS中实现无需训练和连续细粒度情感控制的方法。

更新时间: 2025-08-06 06:54:21

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2508.03543v2

Learning to Inference Adaptively for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

Updated: 2025-08-06 06:52:07

标题: 学习适应性推理，用于多模态大型语言模型

摘要: 多模态大型语言模型（MLLMs）在视觉推理方面展示出令人印象深刻的能力，但是伴随着相当大的计算成本，限制了它们在资源受限环境中的部署。尽管最近有关改进MLLM效率的努力，但先前的解决方案在响应不同的运行时条件方面存在不足，特别是在资源可用性发生变化时（例如，由于设备上其他程序的执行而导致争用）。为了弥合这一差距，我们引入了AdaLLaVA，一种自适应推理框架，该框架学习在推理过程中动态重新配置MLLM中的操作，考虑输入数据和延迟预算。我们进行了涉及问答、推理和幻想的广泛实验。我们的结果表明，AdaLLaVA有效地遵循输入延迟预算，在运行时实现不同的准确性和延迟权衡。此外，我们证明AdaLLaVA能够适应输入延迟和内容，可以与令牌选择相结合以增强效率，并且在不同的MLLM中具有泛化能力。我们的项目网页及代码发布在https://zhuoyan-xu.github.io/ada-llava/。

更新时间: 2025-08-06 06:52:07

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.10905v3

CAIN: Hijacking LLM-Humans Conversations via Malicious System Prompts

Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

Updated: 2025-08-06 06:49:53

标题: CAIN：通过恶意系统提示劫持LLM-Humans对话

摘要: 大型语言模型(LLMs)已经推动了许多应用，但也被认为容易受到对抗性攻击的影响。在这项工作中，我们引入了一种新的安全威胁：通过操纵LLMs的系统提示来操纵AI与人类的对话，只对特定的目标问题产生恶意答案（例如，“我应该投谁作为美国总统？”，“新冠疫苗安全吗？”），同时在其他问题上表现良好。这种攻击是有害的，因为它可以让恶意行为者通过在线传播看似无害但实际有害的系统提示来进行大规模信息操纵。为了展示这种攻击，我们开发了CAIN，这是一种算法，可以在黑盒设置或无需访问LLMs参数的情况下自动策划这种有害的系统提示，针对特定的问题。在对开源和商业LLMs进行评估时，CAIN展示了显著的对抗性影响。在未定向攻击或强制LLMs输出不正确答案时，CAIN在目标问题上实现了高达40%的F1降级，同时在良性输入上保持高准确性。对于有目标的攻击或强制LLMs输出特定有害答案，CAIN在这些目标响应上实现了超过70%的F1得分，对良性问题几乎没有影响。我们的结果突出了在现实应用中加强鲁棒性措施以保护LLMs的完整性和安全性的迫切需要。所有源代码将公开可用。

更新时间: 2025-08-06 06:49:53

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.16888v2

Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks

The application of Large Language Models (LLMs) is growing in the productive completion of Software Engineering tasks. Yet, studies investigating the productive prompting techniques often employed a limited problem space, primarily focusing on well-known prompting patterns and mainly targeting function-level SE practices. We identify significant gaps in real-world workflows that involve complexities beyond class-level (e.g., multi-class dependencies) and different features that can impact Human-LLM Interactions (HLIs) processes in code generation. To address these issues, we designed an experiment that comprehensively analyzed the HLI features regarding the code generation productivity. Our study presents two project-level benchmark tasks, extending beyond function-level evaluations. We conducted a user study with 36 participants from diverse backgrounds, asking them to solve the assigned tasks by interacting with the GPT assistant using specific prompting patterns. We also examined the participants' experience and their behavioral features during interactions by analyzing screen recordings and GPT chat logs. Our statistical and empirical investigation revealed (1) that three out of 15 HLI features significantly impacted the productivity in code generation; (2) five primary guidelines for enhancing productivity for HLI processes; and (3) a taxonomy of 29 runtime and logic errors that can occur during HLI processes, along with suggested mitigation plans.

Updated: 2025-08-06 06:48:48

标题: ChatGPT在生产性互动策略中的实验分析：基于功能和项目级代码生成任务的用户研究

摘要: 大型语言模型（LLMs）在软件工程任务的高效完成中的应用正在增长。然而，研究常常涉及的高效提示技术通常使用了有限的问题空间，主要集中在众所周知的提示模式上，主要针对函数级别的软件工程实践。我们发现在涉及超出类级别（例如，多类依赖）的复杂性和可能影响人-LLM互动过程的代码生成中存在重大差距。为了解决这些问题，我们设计了一个实验，全面分析了与代码生成生产力相关的HLI特征。我们的研究提出了两个项目级别的基准任务，超出了函数级别的评估。我们进行了一项用户研究，招募了来自不同背景的36名参与者，要求他们通过与GPT助手互动使用特定提示模式来解决分配的任务。我们还通过分析屏幕录像和GPT聊天记录来检查参与者的体验和互动过程中的行为特征。我们的统计和实证调查显示：（1）15个HLI特征中的三个显著影响了代码生成的生产力；（2）五项提高HLI过程生产力的主要准则；以及（3）可能在HLI过程中发生的29个运行时和逻辑错误的分类，以及建议的缓解计划。

更新时间: 2025-08-06 06:48:48

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.04125v1

Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond

Many modern deep learning applications require balancing multiple objectives that are often conflicting. Examples include multi-task learning, fairness-aware learning, and the alignment of Large Language Models (LLMs). This leads to multi-objective deep learning, which tries to find optimal trade-offs or Pareto-optimal solutions by adapting mathematical principles from the field of Multi-Objective Optimization (MOO). However, directly applying gradient-based MOO techniques to deep neural networks presents unique challenges, including high computational costs, optimization instability, and the difficulty of effectively incorporating user preferences. This paper provides a comprehensive survey of gradient-based techniques for multi-objective deep learning. We systematically categorize existing algorithms based on their outputs: (i) methods that find a single, well-balanced solution, (ii) methods that generate a finite set of diverse Pareto-optimal solutions, and (iii) methods that learn a continuous Pareto set of solutions. In addition to this taxonomy, the survey covers theoretical analyses, key applications, practical resources, and highlights open challenges and promising directions for future research. A comprehensive list of multi-objective deep learning algorithms is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning.

Updated: 2025-08-06 06:47:04

标题: 基于梯度的多目标深度学习：算法、理论、应用及未来发展

摘要: 许多现代深度学习应用需要平衡多个经常相互冲突的目标。例如，多任务学习、公平感知学习以及大型语言模型（LLMs）的对齐。这导致了多目标深度学习，试图通过从多目标优化（MOO）领域的数学原理来适应，找到最佳的权衡或帕累托最优解。然而，直接将基于梯度的MOO技术应用于深度神经网络面临独特挑战，包括高计算成本、优化不稳定性以及有效整合用户偏好的困难。本文提供了对基于梯度的多目标深度学习技术的全面调查。我们系统地将现有算法根据它们的输出进行分类：(i) 寻找一个单一、平衡的解决方案的方法，(ii) 生成一组多样化的帕累托最优解的方法，以及(iii) 学习连续的帕累托解集的方法。除了这个分类法，调查还涵盖了理论分析、主要应用、实践资源，并突出了未来研究的挑战和有前途的方向。多目标深度学习算法的综合列表可在https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning找到。

更新时间: 2025-08-06 06:47:04

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.10945v3

Learning Using Privileged Information for Litter Detection

As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.

Updated: 2025-08-06 06:46:14

标题: 使用特权信息进行垃圾检测的学习

摘要: 随着全球垃圾污染不断加剧，开发能够有效检测垃圾的自动化工具仍然是一个重大挑战。本研究提出了一种新颖的方法，首次将特权信息与深度学习目标检测相结合，以提高垃圾检测的效果同时保持模型的高效性。我们在五种广泛使用的目标检测模型上评估了我们的方法，解决了检测小垃圾和被草地或石头部分遮挡的物体等挑战。此外，我们工作的一个关键贡献还在于制定了一种将边界框信息编码为二进制掩模的方法，这可以输入到检测模型中以改进检测指导。通过在著名的SODA数据集上进行内部数据集评估以及在BDW和UAVVaste垃圾检测数据集上进行跨数据集评估的实验，我们展示了所有模型的一致性性能改进。我们的方法不仅增强了训练集内的检测精度，而且在其他垃圾检测上也具有良好的泛化性。关键是，这些改进是在不增加模型复杂性或添加额外层的情况下实现的，确保了计算效率和可扩展性。我们的结果表明，这种方法为垃圾检测提供了一种实用的解决方案，在真实应用中平衡了准确性和效率。

更新时间: 2025-08-06 06:46:14

领域: cs.CV,cs.ET,cs.LG,cs.PF

下载: http://arxiv.org/abs/2508.04124v1

Distributional Soft Actor-Critic with Three Refinements

Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks. However, many model-free RL algorithms experience performance degradation due to inaccurate value estimation, particularly the overestimation of Q-values, which can lead to suboptimal policies. To address this issue, we previously proposed the Distributional Soft Actor-Critic (DSAC or DSACv1), an off-policy RL algorithm that enhances value estimation accuracy by learning a continuous Gaussian value distribution. Despite its effectiveness, DSACv1 faces challenges such as training instability and sensitivity to reward scaling, caused by high variance in critic gradients due to return randomness. In this paper, we introduce three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy: expected value substitution, twin value distribution learning, and variance-based critic gradient adjustment. The enhanced algorithm, termed DSAC with Three refinements (DSAC-T or DSACv2), is systematically evaluated across a diverse set of benchmark tasks. Without the need for task-specific hyperparameter tuning, DSAC-T consistently matches or outperforms leading model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T ensures a stable learning process and maintains robust performance across varying reward scales. Its effectiveness is further demonstrated through real-world application in controlling a wheeled robot, highlighting its potential for deployment in practical robotic tasks.

Updated: 2025-08-06 06:37:01

标题: 具有三种改进的分布式软演员-评论家算法

摘要: 强化学习（RL）在解决复杂的决策和控制任务方面取得了显著的成功。然而，许多无模型RL算法由于值估计不准确，特别是Q值的过度估计，导致性能下降，可能导致次优策略。为了解决这个问题，我们先前提出了分布式软演员-评论家（DSAC或DSACv1），这是一种离线策略RL算法，通过学习连续的高斯值分布来增强值估计准确性。尽管其有效性，DSACv1面临挑战，如训练不稳定性和对奖励缩放的敏感性，由于返回随机性导致评论家梯度的高方差。在本文中，我们介绍了三个关键改进措施，以克服这些限制，并进一步改进Q值估计准确性：预期值替代、双值分布学习和基于方差的评论家梯度调整。增强算法，称为带有三个改进的DSAC（DSAC-T或DSACv2），在一系列基准任务中进行了系统评估。在不需要特定任务的超参数调整的情况下，DSAC-T在所有测试环境中始终匹配或优于领先的无模型RL算法，包括SAC、TD3、DDPG、TRPO和PPO。此外，DSAC-T确保了稳定的学习过程，并在不同奖励尺度下保持强大性能。其有效性进一步通过在控制一个轮式机器人的实际应用中得到证实，突显了其在实际机器人任务中部署的潜力。

更新时间: 2025-08-06 06:37:01

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2310.05858v7

AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities

Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models' parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models' training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments.

Updated: 2025-08-06 06:34:22

标题: AgREE: 对新兴实体进行知识图谱补全的主动推理

摘要: 开放领域知识图谱完善（KGC）在一个不断变化的世界中面临着重大挑战，特别是考虑到每天新实体的不断涌现。现有的KGC方法主要依赖于预训练语言模型的参数化知识、预先构建的查询，或者单步检索，通常需要大量监督和训练数据。即便如此，它们往往无法捕捉有关不受欢迎和/或新兴实体的全面和最新信息。为此，我们介绍了一种新颖的基于代理的框架，称为Emerging Entities (AgREE)，它结合了迭代检索操作和多步推理，动态构建丰富的知识图谱三元组。实验证明，尽管不需要任何训练工作，AgREE在构建知识图谱三元组方面明显优于现有方法，特别是对于在语言模型训练过程中未见过的新兴实体，其表现超过先前方法高达13.7％。此外，我们提出了一种新的评估方法，解决了现有设置的一个根本性弱点，并为KGC在新兴实体上提供了一个新的基准。我们的工作展示了将基于代理的推理与战略信息检索相结合，以在动态信息环境中维护最新知识图谱的有效性。

更新时间: 2025-08-06 06:34:22

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.04118v1

A Compositional Framework for On-the-Fly LTLf Synthesis

Reactive synthesis from Linear Temporal Logic over finite traces (LTLf) can be reduced to a two-player game over a Deterministic Finite Automaton (DFA) of the LTLf specification. The primary challenge here is DFA construction, which is 2EXPTIME-complete in the worst case. Existing techniques either construct the DFA compositionally before solving the game, leveraging automata minimization to mitigate state-space explosion, or build the DFA incrementally during game solving to avoid full DFA construction. However, neither is dominant. In this paper, we introduce a compositional on-the-fly synthesis framework that integrates the strengths of both approaches, focusing on large conjunctions of smaller LTLf formulas common in practice. This framework applies composition during game solving instead of automata (game arena) construction. While composing all intermediate results may be necessary in the worst case, pruning these results simplifies subsequent compositions and enables early detection of unrealizability. Specifically, the framework allows two composition variants: pruning before composition to take full advantage of minimization or pruning during composition to guide on-the-fly synthesis. Compared to state-of-the-art synthesis solvers, our framework is able to solve a notable number of instances that other solvers cannot handle. A detailed analysis shows that both composition variants have unique merits.

Updated: 2025-08-06 06:31:49

标题: 一个用于即时LTLf合成的组合性框架

摘要: 线性时间逻辑（LTLf）上的反应合成可以简化为对LTLf规范的确定有限自动机（DFA）进行的两个玩家游戏。这里的主要挑战是DFA构造，在最坏情况下是2EXPTIME完全的。现有技术要么在解决游戏之前按部就班地构建DFA，利用自动机最小化来减少状态空间的爆炸，要么在游戏解决过程中逐步构建DFA以避免完全构建DFA。然而，两者都不占主导地位。在本文中，我们介绍了一个组合式即时合成框架，结合了两种方法的优势，重点放在实践中常见的较大LTLf公式的连接上。该框架在游戏解决过程中应用组合，而不是自动机（游戏竞技场）构造。虽然在最坏情况下可能需要组合所有中间结果，但修剪这些结果简化了后续的组合，并实现了对不可实现性的早期检测。具体而言，该框架允许两种组合变体：在组合之前修剪以充分利用最小化，或在组合过程中修剪以引导即时合成。与最先进的合成求解器相比，我们的框架能够解决其他求解器无法处理的显著数量的实例。详细分析表明，两种组合变体具有独特的优点。

更新时间: 2025-08-06 06:31:49

领域: cs.AI

下载: http://arxiv.org/abs/2508.04116v1

Why the Agent Made that Decision: Contrastive Explanation Learning for Reinforcement Learning

Reinforcement learning (RL) has demonstrated remarkable success in solving complex decision-making problems, yet its adoption in critical domains is hindered by the lack of interpretability in its decision-making processes. Existing explainable AI (xAI) approaches often fail to provide meaningful explanations for RL agents, particularly because they overlook the contrastive nature of human reasoning--answering "why this action instead of that one?". To address this gap, we propose a novel framework of contrastive learning to explain RL selected actions, named $\textbf{VisionMask}$. VisionMask is trained to generate explanations by explicitly contrasting the agent's chosen action with alternative actions in a given state using a self-supervised manner. We demonstrate the efficacy of our method through experiments across diverse RL environments, evaluating it in terms of faithfulness, robustness, and complexity. Our results show that VisionMask significantly improves human understanding of agent behavior while maintaining accuracy and fidelity. Furthermore, we present examples illustrating how VisionMask can be used for counterfactual analysis. This work bridges the gap between RL and xAI, paving the way for safer and more interpretable RL systems.

Updated: 2025-08-06 06:26:27

标题: 为什么代理人做出那个决定：对比解释学习用于强化学习

摘要: 强化学习（RL）在解决复杂决策问题方面取得了显著成功，但在关键领域的采用受到其决策过程缺乏可解释性的阻碍。现有的可解释人工智能（xAI）方法通常未能为RL代理提供有意义的解释，特别是因为它们忽视了人类推理的对比性特质--即回答“为什么选择这个动作而不是那个动作？”。为了填补这一空白，我们提出了一种新颖的对比学习框架来解释RL选择的动作，命名为$\textbf{VisionMask}$。VisionMask通过明确地对比代理在给定状态下选择的动作与备选动作，在自监督的方式下生成解释。我们通过在不同RL环境中进行实验来展示我们方法的有效性，并从忠实性、鲁棒性和复杂性方面对其进行评估。我们的结果表明，VisionMask显著提高了人类对代理行为的理解，同时保持准确性和忠实性。此外，我们提供了示例说明VisionMask如何用于反事实分析。这项工作弥合了RL与xAI之间的鸿沟，为更安全、更可解释的RL系统铺平了道路。

更新时间: 2025-08-06 06:26:27

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.16120v2

Deep Discrete Encoders: Identifiable Deep Generative Models for Rich Data with Discrete Latent Layers

In the era of generative AI, deep generative models (DGMs) with latent representations have gained tremendous popularity. Despite their impressive empirical performance, the statistical properties of these models remain underexplored. DGMs are often overparametrized, non-identifiable, and uninterpretable black boxes, raising serious concerns when deploying them in high-stakes applications. Motivated by this, we propose interpretable deep generative models for rich data types with discrete latent layers, called Deep Discrete Encoders (DDEs). A DDE is a directed graphical model with multiple binary latent layers. Theoretically, we propose transparent identifiability conditions for DDEs, which imply progressively smaller sizes of the latent layers as they go deeper. Identifiability ensures consistent parameter estimation and inspires an interpretable design of the deep architecture. Computationally, we propose a scalable estimation pipeline of a layerwise nonlinear spectral initialization followed by a penalized stochastic approximation EM algorithm. This procedure can efficiently estimate models with exponentially many latent components. Extensive simulation studies for high-dimensional data and deep architectures validate our theoretical results and demonstrate the excellent performance of our algorithms. We apply DDEs to three diverse real datasets with different data types to perform hierarchical topic modeling, image representation learning, and response time modeling in educational testing.

Updated: 2025-08-06 06:22:51

标题: 深度离散编码器：具有离散潜在层的可识别深度生成模型，适用于丰富数据

摘要: 在生成式人工智能时代，具有潜在表示的深度生成模型（DGMs）已经获得了巨大的流行。尽管它们在实证表现方面令人印象深刻，但这些模型的统计特性仍未被充分探索。DGMs通常过度参数化，不可识别且不可解释，这在将它们部署到高风险应用程序时引起了严重关注。受此启发，我们提出了适用于具有离散潜在层的丰富数据类型的可解释深度生成模型，称为Deep Discrete Encoders（DDEs）。DDE是一个具有多个二元潜在层的有向图模型。在理论上，我们为DDE提出了透明的可识别性条件，这些条件意味着随着深入，潜在层的大小逐渐变小。可识别性确保了一致的参数估计，并激发了深度架构的可解释设计。在计算上，我们提出了一种可扩展的估计流程，首先进行逐层非线性谱初始化，然后使用惩罚式随机逼近EM算法。这个过程可以有效地估计具有指数级潜在组件的模型。对高维数据和深层架构进行了大量模拟研究，验证了我们的理论结果，并展示了我们算法的出色性能。我们将DDE应用于三个不同数据类型的多样真实数据集，以进行层次主题建模、图像表示学习和教育测验中的反应时间建模。

更新时间: 2025-08-06 06:22:51

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2501.01414v2

DRL-ORA: Distributional Reinforcement Learning with Online Risk Adaption

One of the main challenges in reinforcement learning (RL) is that the agent has to make decisions that would influence the future performance without having complete knowledge of the environment. Dynamically adjusting the level of epistemic risk during the learning process can help to achieve reliable policies in safety-critical settings with better efficiency. In this work, we propose a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA). This framework quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online. The framework unifies the existing variants of risk adaption approaches and offers better explainability and flexibility. The selection of risk levels is performed efficiently via a grid search using a Follow-The-Leader-type algorithm, where the offline oracle also corresponds to a ''satisficing measure'' under a specially modified loss function. We show that DRL-ORA outperforms existing methods that rely on fixed risk levels or manually designed risk level adaptation in multiple classes of tasks.

Updated: 2025-08-06 06:16:52

标题: DRL-ORA: 在线风险适应的分布式强化学习

摘要: 在强化学习（RL）中的主要挑战之一是，代理必须在没有完全了解环境的情况下做出决策，这些决策会影响未来的性能。在学习过程中动态调整认识风险水平可以帮助在安全关键环境中实现可靠的策略，更有效率。在这项工作中，我们提出了一个新的框架，分布式RL与在线风险调整（DRL-ORA）。该框架以统一的方式量化认知和隐性危险的不确定性，并通过在线解决总变分最小化问题动态调整认知风险水平。该框架统一了现有的风险调整方法的变体，并提供了更好的可解释性和灵活性。通过使用一种Follow-The-Leader类型的算法进行网格搜索，有效地选择风险水平，在这种情况下，离线的oracle也对应于一个在特殊修改的损失函数下的“满意度度量”。我们展示了DRL-ORA在多类任务中胜过依赖固定风险水平或手动设计风险水平适应的现有方法。

更新时间: 2025-08-06 06:16:52

领域: cs.LG

下载: http://arxiv.org/abs/2310.05179v4

Negative binomial regression and inference using a pre-trained transformer

Negative binomial regression is essential for analyzing over-dispersed count data in in comparative studies, but parameter estimation becomes computationally challenging in large screens requiring millions of comparisons. We investigate using a pre-trained transformer to produce estimates of negative binomial regression parameters from observed count data, trained through synthetic data generation to learn to invert the process of generating counts from parameters. The transformer method achieved better parameter accuracy than maximum likelihood optimization while being 20 times faster. However, comparisons unexpectedly revealed that method of moment estimates performed as well as maximum likelihood optimization in accuracy, while being 1,000 times faster and producing better-calibrated and more powerful tests, making it the most efficient solution for this application.

Updated: 2025-08-06 06:15:40

标题: 负二项回归和使用预训练转换器进行推断

摘要: 负二项回归在比较研究中分析过度离散计数数据时至关重要，但在需要进行数百万次比较的大型屏幕中，参数估计变得计算上具有挑战性。我们研究了使用预训练的转换器从观察到的计数数据中生成负二项回归参数的估计值，通过合成数据生成进行训练，学习从参数生成计数的过程。与最大似然优化相比，转换器方法实现了更好的参数准确性，同时速度快了20倍。然而，比较意外地发现，矩估计方法在准确性方面与最大似然优化表现相当，同时速度快了1,000倍，并且产生更好校准和更强大的测试，使其成为该应用的最有效解决方案。

更新时间: 2025-08-06 06:15:40

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.04111v1

A Foundational Multi-Modal Model for Few-Shot Learning

Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at https://github.com/ptdang1001/M3F. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.

Updated: 2025-08-06 06:12:13

标题: 一种用于少样本学习的基础多模态模型

摘要: Few-shot learning (FSL)是一种机器学习范式，旨在从少量标记示例中推广模型，通常每个类别少于10个。FSL在生物医学、环境、材料和机械科学中尤为关键，这些领域的样本有限，数据收集往往成本高昂、耗时或受到伦理限制。在本研究中，我们提出了一种创新的FSL方法，通过展示一个训练在跨领域、任务类型和输入模态的一组独立任务上的大型多模型模型（LMMM），可以显著提高FSL模型的泛化能力，优于基于传统元学习的相同类型任务的模型。为了支持这一观点，我们首先构建了一个多模型少样本数据集（M3FD，超过10K+少样本样本），其中包括2D RGB图像、2D/3D医学扫描、表格和时间序列数据集，我们从中手动筛选出FSL任务，如分类。我们进一步介绍了M3F（用于少样本学习的多模型模型框架），这是一个专为数据受限科学应用定制的新型大型多模型模型框架。M3F通过模块化管道支持各种科学数据类型。通过在M3FD上微调模型，M3F提高了模型性能，使LMMM在实际的FSL部署中变得可行。源代码位于https://github.com/ptdang1001/M3F。为了使复杂的FSL数据更易访问并促进公共使用的可重现性，M3FD配备了一个灵活且用户友好的工具，可以实现高效的查询、任务特定抽样和预处理。总之，我们的数据集和框架提供了一个统一、可扩展的解决方案，显著降低了在数据稀缺科学领域应用LMMMs的门槛。

更新时间: 2025-08-06 06:12:13

领域: cs.LG

下载: http://arxiv.org/abs/2508.04746v1

Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC)

Diffusion models (DMs) have emerged as powerful tools for high-quality content generation, yet their intensive computational requirements for inference pose challenges for resource-constrained edge devices. Cloud-based solutions aid in computation but often fall short in addressing privacy risks, personalization efficiency, and communication costs in multi-user edge-AIGC scenarios. To bridge this gap, we first analyze existing edge-AIGC applications in personalized content synthesis, revealing their limitations in efficiency and scalability. We then propose a novel cluster-aware hierarchical federated aggregation framework. Based on parameter-efficient local fine-tuning via Low-Rank Adaptation (LoRA), the framework first clusters clients based on the similarity of their uploaded task requirements, followed by an intra-cluster aggregation for enhanced personalization at the server-side. Subsequently, an inter-cluster knowledge interaction paradigm is implemented to enable hybrid-style content generation across diverse clusters.Building upon federated learning (FL) collaboration, our framework simultaneously trains personalized models for individual users at the devices and a shared global model enhanced with multiple LoRA adapters on the server,enabling efficient edge inference; meanwhile, all prompts for clustering and inference are encoded prior to transmission, thereby further mitigating the risk of plaintext leakage. Our evaluations demonstrate that the framework achieves accelerated convergence while maintaining practical viability for scalable multi-user personalized AIGC services under edge constraints.

Updated: 2025-08-06 06:07:24

标题: 边缘辅助的多用户个性化人工智能生成内容（AIGC）协作微调

摘要: 扩散模型（DMs）已经成为高质量内容生成的强大工具，但由于推理所需的高强度计算要求，对资源受限的边缘设备提出了挑战。基于云的解决方案有助于计算，但在多用户边缘-AIGC场景中常常难以解决隐私风险、个性化效率和通信成本等问题。为了弥补这一差距，我们首先分析了现有的边缘-AIGC应用程序在个性化内容合成方面的应用，揭示了它们在效率和可扩展性方面的局限性。然后，我们提出了一种新颖的集群感知层次化联邦聚合框架。基于参数高效的本地微调，通过Low-Rank Adaptation（LoRA），该框架首先根据客户端上传的任务要求的相似性对客户端进行聚类，然后在服务器端进行簇内聚合，以增强个性化。随后，实施了跨簇知识交互范式，以实现跨不同簇的混合式内容生成。在联邦学习（FL）协作的基础上，我们的框架同时为设备上的个人用户训练个性化模型，以及在服务器端增强了多个LoRA适配器的共享全局模型，实现高效的边缘推理；同时，在传输之前对所有聚类和推理的提示进行编码，从而进一步减轻明文泄露的风险。我们的评估表明，该框架在加速收敛的同时，保持了在边缘约束下可扩展的多用户个性化AIGC服务的实用性。

更新时间: 2025-08-06 06:07:24

领域: cs.LG

下载: http://arxiv.org/abs/2508.04745v1

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

Updated: 2025-08-06 06:06:52

标题: 通过轻量级掩模解码释放MLLMs在指代表达分割中的潜力

摘要: 参考表达式分割（RES）旨在分割由指代表达式指定的图像区域，并随着多模式大型模型（MLLMs）的兴起而变得流行。虽然MLLMs在语义理解方面表现出色，但它们的令牌生成范式在像素级密集预测方面存在困难。现有的RES方法要么将MLLMs与参数繁重的Segment Anything Model（SAM）耦合在一起，该模型具有632M个网络参数，要么采用不带SAM的轻量级流水线，牺牲准确性。为了解决性能和成本之间的权衡，我们特别提出了MLLMSeg，这是一个全新的框架，充分利用了MLLM视觉编码器中编码的固有视觉细节特征，而无需引入额外的视觉编码器。此外，我们提出了一种细节增强和语义一致的特征融合模块（DSFF），该模块充分整合了由MLLM的大型语言模型（LLM）输出的与细节相关的视觉特征和语义相关的特征。最后，我们建立了一个轻量级的掩模解码器，只有34M个网络参数，充分利用了来自视觉编码器的详细空间特征和来自LLM的语义特征，以实现精确的掩模预测。大量实验证明，我们的方法通常优于基于SAM和不基于SAM的竞争对手，取得了更好的性能和成本平衡。代码可在https://github.com/jcwang0602/MLLMSeg找到。

更新时间: 2025-08-06 06:06:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04107v1

Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement

Automated grading systems can efficiently score short-answer responses, yet they often fail to indicate when a grading decision is uncertain or potentially contentious. We introduce semantic entropy, a measure of variability across multiple GPT-4-generated explanations for the same student response, as a proxy for human grader disagreement. By clustering rationales via entailment-based similarity and computing entropy over these clusters, we quantify the diversity of justifications without relying on final output scores. We address three research questions: (1) Does semantic entropy align with human grader disagreement? (2) Does it generalize across academic subjects? (3) Is it sensitive to structural task features such as source dependency? Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. Our findings position semantic entropy as an interpretable uncertainty signal that supports more transparent and trustworthy AI-assisted grading workflows.

Updated: 2025-08-06 06:02:14

标题: 朝向透明的人工智能评分：语义熵作为人工智能与人类意见不一致的信号

摘要: 自动评分系统可以高效地为简答问题打分，但通常无法指示评分决定何时存在不确定性或潜在争议。我们引入了语义熵，这是一种对同一学生回答的多个GPT-4生成解释的变异性的度量，作为人类评分者分歧的代理。通过通过蕴涵基于相似性的聚类推理和计算这些聚类上的熵，我们量化了这些理由的多样性，而不依赖于最终输出分数。我们解决了三个研究问题：(1)语义熵是否与人类评分者分歧一致？(2)它是否可以泛化到不同学科？(3)它是否对结构性任务特征敏感，如源依赖性？对ASAP-SAS数据集的实验表明，语义熵与评分者分歧相关，跨学科有意义地变化，并在需要解释性推理的任务中增加。我们的研究结果将语义熵定位为一种可解释的不确定性信号，支持更透明和可信赖的AI辅助评分工作流程。

更新时间: 2025-08-06 06:02:14

领域: cs.AI

下载: http://arxiv.org/abs/2508.04105v1

From Cluster Assumption to Graph Convolution: Graph-based Semi-Supervised Learning Revisited

Graph-based semi-supervised learning (GSSL) has long been a hot research topic. Traditional methods are generally shallow learners, based on the cluster assumption. Recently, graph convolutional networks (GCNs) have become the predominant techniques for their promising performance. In this paper, we theoretically discuss the relationship between these two types of methods in a unified optimization framework. One of the most intriguing findings is that, unlike traditional ones, typical GCNs may not jointly consider the graph structure and label information at each layer. Motivated by this, we further propose three simple but powerful graph convolution methods. The first is a supervised method OGC which guides the graph convolution process with labels. The others are two unsupervised methods: GGC and its multi-scale version GGCM, both aiming to preserve the graph structure information during the convolution process. Finally, we conduct extensive experiments to show the effectiveness of our methods. Code is available at https://github.com/zhengwang100/ogc_ggcm.

Updated: 2025-08-06 06:01:32

标题: 从簇假设到图卷积：基于图的半监督学习再审视

摘要: 基于图的半监督学习（GSSL）长期以来一直是一个热门的研究主题。传统方法通常是基于集群假设的浅层学习者。最近，图卷积网络（GCNs）已成为其性能有望的主要技术。在本文中，我们从理论上讨论了这两种方法在统一优化框架中的关系。其中最令人感兴趣的发现之一是，与传统方法不同，典型的GCNs可能在每一层并不同时考虑图结构和标签信息。受此启发，我们进一步提出了三种简单但强大的图卷积方法。第一种是有监督方法OGC，它通过标签指导图卷积过程。其他两种是无监督方法：GGC及其多尺度版本GGCM，都旨在在卷积过程中保留图结构信息。最后，我们进行了大量实验来展示我们方法的有效性。代码可在https://github.com/zhengwang100/ogc_ggcm找到。

更新时间: 2025-08-06 06:01:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2309.13599v3

SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios

Homomorphic Encryption (HE) prevails in securing Federated Learning (FL), but suffers from high overhead and adaptation cost. Selective HE methods, which partially encrypt model parameters by a global mask, are expected to protect privacy with reduced overhead and easy adaptation. However, in cross-device scenarios with heterogeneous data and system capabilities, traditional Selective HE methods deteriorate client straggling, and suffer from degraded HE overhead reduction performance. Accordingly, we propose SenseCrypt, a Sensitivity-guided selective Homomorphic EnCryption framework, to adaptively balance security and HE overhead per cross-device FL client. Given the observation that model parameter sensitivity is effective for measuring clients' data distribution similarity, we first design a privacy-preserving method to respectively cluster the clients with similar data distributions. Then, we develop a scoring mechanism to deduce the straggler-free ratio of model parameters that can be encrypted by each client per cluster. Finally, for each client, we formulate and solve a multi-objective model parameter selection optimization problem, which minimizes HE overhead while maximizing model security without causing straggling. Experiments demonstrate that SenseCrypt ensures security against the state-of-the-art inversion attacks, while achieving normal model accuracy as on IID data, and reducing training time by 58.4%-88.7% as compared to traditional HE methods.

Updated: 2025-08-06 05:42:41

标题: SenseCrypt：敏感性引导的选择性同态加密，用于跨设备场景中的联合联邦学习

摘要: 同态加密（HE）在保护联合学习（FL）方面占据主导地位，但存在高开销和适应成本的问题。选择性HE方法部分加密模型参数，通过全局掩码预计能够在减少开销和便于适应的同时保护隐私。然而，在具有异构数据和系统能力的跨设备场景中，传统的选择性HE方法会导致客户端拖延，并且在减少HE开销性能方面表现不佳。因此，我们提出了SenseCrypt，一种基于敏感性引导的选择性同态加密框架，以适应性地平衡跨设备FL客户端的安全性和HE开销。鉴于模型参数敏感性对于衡量客户数据分布相似性的有效性，我们首先设计了一种隐私保护方法，分别对具有相似数据分布的客户进行聚类。然后，我们开发了一个评分机制，推断出每个簇中每个客户可以加密的无拖延模型参数的比率。最后，对于每个客户，我们制定并解决了一个多目标模型参数选择优化问题，最小化HE开销，同时最大化模型安全性，而不会导致拖延。实验证明，SenseCrypt能够抵御最先进的逆向攻击，同时在IID数据上实现正常的模型准确性，并且与传统HE方法相比，能够将训练时间缩短58.4%-88.7%。

更新时间: 2025-08-06 05:42:41

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2508.04100v1

GridSE: Towards Practical Secure Geographic Search via Prefix Symmetric Searchable Encryption (Full Version)

The proliferation of location-based services and applications has brought significant attention to data and location privacy. While general secure computation and privacy-enhancing techniques can partially address this problem, one outstanding challenge is to provide near latency-free search and compatibility with mainstream geographic search techniques, especially the Discrete Global Grid Systems (DGGS). This paper proposes a new construction, namely GridSE, for efficient and DGGS-compatible Secure Geographic Search (SGS) with both backward and forward privacy. We first formulate the notion of a semantic-secure primitive called \textit{symmetric prefix predicate encryption} (SP$^2$E), for predicting whether or not a keyword contains a given prefix, and provide a construction. Then we extend SP$^2$E for dynamic \textit{prefix symmetric searchable encryption} (pSSE), namely GridSE, which supports both backward and forward privacy. GridSE only uses lightweight primitives including cryptographic hash and XOR operations and is extremely efficient. Furthermore, we provide a generic pSSE framework that enables prefix search for traditional dynamic SSE that supports only full keyword search. Experimental results over real-world geographic databases of sizes (by the number of entries) from $10^3$ to $10^7$ and mainstream DGGS techniques show that GridSE achieves a speedup of $150\times$ - $5000\times$ on search latency and a saving of $99\%$ on communication overhead as compared to the state-of-the-art. Interestingly, even compared to plaintext search, GridSE introduces only $1.4\times$ extra computational cost and $0.9\times$ additional communication cost. Source code of our scheme is available at https://github.com/rykieguo1771/GridSE-RAM.

Updated: 2025-08-06 05:42:25

标题: GridSE: 通过前缀对称可搜索加密实现实用的安全地理搜索（完整版本）

摘要: 地理位置服务和应用的普及已经引起了对数据和位置隐私的重视。虽然一般的安全计算和增强隐私技术可以部分解决这个问题，但一个突出的挑战是提供接近零延迟的搜索，以及与主流地理搜索技术（特别是离散全局网格系统）的兼容性。本文提出了一种新的构造，即GridSE，用于高效和与DGGS兼容的安全地理搜索（SGS），具有前向和后向隐私。我们首先提出了一种被称为对称前缀谓词加密（SP$^2$E）的语义安全原语，用于预测一个关键字是否包含给定的前缀，并提供了一个构造。然后我们将SP$^2$E扩展为动态的前缀对称可搜索加密（pSSE），即GridSE，支持前向和后向隐私。GridSE只使用轻量级原语，包括加密哈希和异或操作，非常高效。此外，我们提供了一个通用的pSSE框架，可以实现对传统动态SSE的前缀搜索，该框架仅支持全关键字搜索。在实际世界地理数据库的实验结果表明，GridSE在搜索延迟上实现了$150\times$ - $5000\times$的加速，通信开销节省了$99\%$，与最新技术相比。有趣的是，即使与明文搜索相比，GridSE只引入了$1.4\times$的额外计算成本和$0.9\times$的额外通信成本。我们方案的源代码可以在https://github.com/rykieguo1771/GridSE-RAM找到。

更新时间: 2025-08-06 05:42:25

领域: cs.CR

下载: http://arxiv.org/abs/2408.07916v2

Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?

Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE), a technique traditionally used for improving model generalization under occlusion, and uncover its surprising effectiveness as a defense against MI attacks. Specifically, our novel feature space analysis shows that models trained with RE-images introduce a significant discrepancy between the features of MI-reconstructed images and those of the private data. At the same time, features of private images remain distinct from other classes and well-separated from different classification regions. These effects collectively degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Furthermore, we explore two critical properties of RE including Partial Erasure and Random Location. Partial Erasure prevents the model from observing entire objects during training. We find this has a significant impact on MI, which aims to reconstruct the entire objects. Random Location of erasure plays a crucial role in achieving a strong privacy-utility trade-off. Our findings highlight RE as a simple yet effective defense mechanism that can be easily integrated with existing privacy-preserving techniques. Extensive experiments across 37 setups demonstrate that our method achieves state-of-the-art (SOTA) performance in the privacy-utility trade-off. The results consistently demonstrate the superiority of our defense over existing methods across different MI attacks, network architectures, and attack configurations. For the first time, we achieve a significant degradation in attack accuracy without a decrease in utility for some configurations.

Updated: 2025-08-06 05:41:14

标题: 随机擦除 vs. 模型反演：一种有希望的防御还是虚假的希望？

摘要: 模型反演（MI）攻击通过从机器学习模型重建私人训练数据，构成了一个重要的隐私威胁。尽管现有的防御主要集中在以模型为中心的方法上，但数据对MI鲁棒性的影响仍然大部分尚未被探索。在这项工作中，我们探索了随机擦除（RE），这是一种传统用于在遮挡下改善模型泛化的技术，并揭示了其作为对抗MI攻击的防御的惊人有效性。具体来说，我们的新颖特征空间分析显示，使用RE图像训练的模型在MI重建图像的特征与私人数据的特征之间引入了显著差异。同时，私人图像的特征仍然与其他类别明显不同，并且与不同分类区域分隔良好。这些效果共同降低了MI重建质量和攻击准确性，同时保持了合理的自然准确性。此外，我们探讨了RE的两个关键属性，包括部分擦除和随机位置。部分擦除防止了模型在训练期间观察到整个对象。我们发现这对MI有重大影响，因为MI的目标是重建整个对象。擦除的随机位置在实现强隐私-效用权衡方面发挥着关键作用。我们的研究结果突出了RE作为一种简单而有效的防御机制，可以轻松集成到现有的隐私保护技术中。在37个设置中进行的大量实验表明，我们的方法在隐私-效用权衡方面实现了最先进的性能。结果始终一致地表明，我们的防御在不同的MI攻击、网络架构和攻击配置中优于现有方法。我们首次在某些配置中实现了攻击准确性的显著降低，而不降低效用。

更新时间: 2025-08-06 05:41:14

领域: cs.LG,cs.CR,cs.CV

下载: http://arxiv.org/abs/2409.01062v3

Slow is Fast! Dissecting Ethereum's Slow Liquidity Drain Scams

We identify the slow liquidity drain (SLID) scam, an insidious and highly profitable threat to decentralized finance (DeFi), posing a large-scale, persistent, and growing risk to the ecosystem. Unlike traditional scams such as rug pulls or honeypots (USENIX Sec'19, USENIX Sec'23), SLID gradually siphons funds from liquidity pools over extended periods, making detection significantly more challenging. In this paper, we conducted the first large-scale empirical analysis of 319,166 liquidity pools across six major decentralized exchanges (DEXs) since 2018. We identified 3,117 SLID affected liquidity pools, resulting in cumulative losses of more than US$103 million. We propose a rule-based heuristic and an enhanced machine learning model for early detection. Our machine learning model achieves a detection speed 4.77 times faster than the heuristic while maintaining 95% accuracy. Our study establishes a foundation for protecting DeFi investors at an early stage and promoting transparency in the DeFi ecosystem.

Updated: 2025-08-06 05:40:11

标题: 慢即快！剖析以太坊的缓慢流动性耗尽欺诈行为

摘要: 我们确定了慢性流动性消耗（SLID）诈骗，这是对去中心化金融（DeFi）构成的一个隐蔽且极具盈利性的威胁，对生态系统构成了规模庞大、持续增长的风险。与传统的诈骗如地毯拉扯或蜜罐不同（USENIX Sec'19，USENIX Sec'23），SLID在较长时间内逐渐从流动性池中提取资金，使得检测变得更加具有挑战性。在本文中，我们对自2018年以来的六个主要去中心化交易所（DEXs）中的319,166个流动性池进行了首次大规模实证分析。我们确定了3,117个受SLID影响的流动性池，导致累计损失超过1.03亿美元。我们提出了基于规则的启发式和增强的机器学习模型以进行早期检测。我们的机器学习模型在保持95%准确率的同时，实现了比启发式快4.77倍的检测速度。我们的研究为在早期阶段保护DeFi投资者和促进DeFi生态系统透明度奠定了基础。

更新时间: 2025-08-06 05:40:11

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2503.04850v3

DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.

Updated: 2025-08-06 05:37:26

标题: DET-GS: 深度和边缘感知正则化用于高保真度3D高斯喷洒

摘要: 3D高斯喷溅（3DGS）代表了在高效和高保真度新视角合成领域的重大进展。尽管最近取得了进展，但在稀疏视角条件下实现准确的几何重建仍然是一个基本挑战。现有方法通常依赖于非局部深度正则化，无法捕捉细粒度结构，并且对深度估计噪声高度敏感。此外，传统平滑方法忽视语义边界，不加区分地降低基本边缘和纹理，从而限制了重建的整体质量。在这项工作中，我们提出了DET-GS，这是一个针对3D高斯喷溅的统一深度和边缘感知正则化框架。DET-GS引入了一个分层几何深度监督框架，自适应地强化多级几何一致性，显著增强了结构保真度，并提高了对深度估计噪声的鲁棒性。为了保留场景边界，我们设计了一个由Canny边缘检测导出的语义掩模引导的边缘感知深度正则化。此外，我们引入了一个RGB引导的边缘保留总变差损失，选择性地平滑均匀区域，同时严格保留高频细节和纹理。大量实验证明，DET-GS在几何精度和视觉保真度方面取得了显著的提高，优于稀疏视角新视角合成基准上的最新方法。

更新时间: 2025-08-06 05:37:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04099v1

Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting

In practical scenarios, time series forecasting necessitates not only accuracy but also efficiency. Consequently, the exploration of model architectures remains a perennially trending topic in research. To address these challenges, we propose a novel backbone architecture named Time Evidence Fusion Network (TEFN) from the perspective of information fusion. Specifically, we introduce the Basic Probability Assignment (BPA) Module based on evidence theory to capture the uncertainty of multivariate time series data from both channel and time dimensions. Additionally, we develop a novel multi-source information fusion method to effectively integrate the two distinct dimensions from BPA output, leading to improved forecasting accuracy. Lastly, we conduct extensive experiments to demonstrate that TEFN achieves performance comparable to state-of-the-art methods while maintaining significantly lower complexity and reduced training time. Also, our experiments show that TEFN exhibits high robustness, with minimal error fluctuations during hyperparameter selection. Furthermore, due to the fact that BPA is derived from fuzzy theory, TEFN offers a high degree of interpretability. Therefore, the proposed TEFN balances accuracy, efficiency, stability, and interpretability, making it a desirable solution for time series forecasting.

Updated: 2025-08-06 05:33:07

标题: 时间证据融合网络：长期时间序列预测中的多源观点

摘要: 在实际场景中，时间序列预测不仅需要准确性，还需要效率。因此，模型架构的探索仍然是研究中一个永恒的热门话题。为了解决这些挑战，我们提出了一种新颖的骨干架构，名为时间证据融合网络（TEFN），从信息融合的角度出发。具体来说，我们引入了基本概率分配（BPA）模块，基于证据理论来捕捉多变量时间序列数据在通道和时间维度上的不确定性。此外，我们开发了一种新颖的多源信息融合方法，以有效整合BPA输出的两个不同维度，从而提高了预测准确性。最后，我们进行了大量实验，证明TEFN在保持显著较低的复杂性和缩短训练时间的同时，实现了与最先进方法可比的性能。此外，我们的实验表明，TEFN表现出高稳健性，在超参数选择过程中出现最小的误差波动。此外，由于BPA源自模糊理论，TEFN具有很高的可解释性。因此，所提出的TEFN平衡了准确性、效率、稳定性和可解释性，使其成为时间序列预测的理想解决方案。

更新时间: 2025-08-06 05:33:07

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2405.06419v4

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

Updated: 2025-08-06 05:32:54

标题: CAVGAN：通过生成对抗攻击对LLMs内部表示进行统一越狱和防御

摘要: 安全对齐使大型语言模型（LLM）能够获得对恶意查询的保护，但各种越狱攻击方法揭示了这种安全机制的脆弱性。先前的研究孤立了LLM越狱攻击和防御。我们分析了LLM的安全保护机制，并提出了一个结合攻击和防御的框架。我们的方法基于LLM中间层嵌入的线性可分性质，以及越狱攻击的本质，旨在嵌入有害问题并将其转移到安全区域。我们利用生成对抗网络（GAN）学习LLM内部的安全判断边界，以实现高效的越狱攻击和防御。实验结果表明，我们的方法在三个流行的LLM中实现了平均越狱成功率为88.85\%，而在最先进的越狱数据集上的防御成功率达到了84.17\%的平均值。这不仅验证了我们方法的有效性，还为增强模型安全性提供了新的见解。代码和数据可在https://github.com/NLPGM/CAVGAN上找到。

更新时间: 2025-08-06 05:32:54

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.06043v2

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

Since the emergence of Large Language Models (LLMs) popularized by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation using LLMs has become a popular field of research, code evaluation using LLMs remains under-explored. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using \emph{question-specific rubrics} tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use \emph{question-agnostic rubrics}. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen's Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that \emph{question-specific rubrics} significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.

Updated: 2025-08-06 05:32:53

标题: 标尺就是你所需要的：通过特定问题标尺增强基于LLM的代码评估

摘要: 自从大型语言模型（LLM）出现以来，尤其是GPT-3和ChatGPT的发布，LLM在编程相关任务中展现出了显著的潜力。虽然使用LLM进行代码生成已经成为一个热门的研究领域，但使用LLM进行代码评估仍然未被充分探索。在本文中，我们专注于基于LLM的代码评估，并试图填补现有的空白。我们提出了使用针对特定问题而设计的“问题特定评分标准”的多代理新方法，认为这些方法在逻辑评估方面比使用“问题不可知评分标准”的现有方法表现更好。为了解决缺乏合适的评估数据集的问题，我们引入了两个数据集：一个包含来自热门数据结构和算法实践网站的150个学生提交的数据结构和算法数据集，以及一个包含来自本科计算机科学课程的80个学生提交的面向对象编程数据集。除了使用标准度量指标（Spearman相关系数，Cohen's Kappa），我们还提出了一个名为Leniency的新度量指标，该度量指标量化了相对于专家评估的评估严格程度。我们的综合分析表明，“问题特定评分标准”显著提高了在教育环境中对代码的逻辑评估，提供了与教学目标更加一致的、超越纯粹语法正确性的更好反馈。

更新时间: 2025-08-06 05:32:53

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.23989v3

Hybrid Quantum--Classical Machine Learning Potential with Variational Quantum Circuits

Quantum algorithms for simulating large and complex molecular systems are still in their infancy, and surpassing state-of-the-art classical techniques remains an ever-receding goal post. A promising avenue of inquiry in the meanwhile is to seek practical advantages through hybrid quantum-classical algorithms, which combine conventional neural networks with variational quantum circuits (VQCs) running on today's noisy intermediate-scale quantum (NISQ) hardware. Such hybrids are well suited to NISQ hardware. The classical processor performs the bulk of the computation, while the quantum processor executes targeted sub-tasks that supply additional non-linearity and expressivity. Here, we benchmark a purely classical E(3)-equivariant message-passing machine learning potential (MLP) against a hybrid quantum-classical MLP for predicting density functional theory (DFT) properties of liquid silicon. In our hybrid architecture, every readout in the message-passing layers is replaced by a VQC. Molecular dynamics simulations driven by the HQC-MLP reveal that an accurate reproduction of high-temperature structural and thermodynamic properties is achieved with VQCs. These findings demonstrate a concrete scenario in which NISQ-compatible HQC algorithm could deliver a measurable benefit over the best available classical alternative, suggesting a viable pathway toward near-term quantum advantage in materials modeling.

Updated: 2025-08-06 05:30:25

标题: 混合量子-经典机器学习势能与变分量子电路

摘要: 目前，用于模拟大型和复杂分子系统的量子算法仍处于初级阶段，并且超越最先进的经典技术仍然是一个不断后退的目标。与此同时，一个有前途的研究方向是通过混合量子-经典算法寻求实际优势，这些算法将传统神经网络与在今天的噪声中间规模量子硬件上运行的变分量子电路（VQCs）结合起来。这种混合方法非常适合于NISQ硬件。经典处理器执行大部分计算，而量子处理器执行提供额外非线性和表现力的特定子任务。在这里，我们对纯经典的E(3)-等变消息传递机器学习势（MLP）和用于预测液态硅密度泛函理论（DFT）性质的混合量子-经典MLP进行基准测试。在我们的混合架构中，消息传递层中的每个读取都被VQC替代。由HQC-MLP驱动的分子动力学模拟显示，使用VQCs可以准确再现高温结构和热力学性质。这些发现展示了一个具体情景，即NISQ兼容的HQC算法可能比最佳经典替代方案提供可衡量的好处，表明在材料建模中实现近期量子优势的可行途径。

更新时间: 2025-08-06 05:30:25

领域: quant-ph,cond-mat.mtrl-sci,cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2508.04098v1

Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs' vulnerability in leaking private visual training data. To tailored for VLMs' token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31\%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

Updated: 2025-08-06 05:30:05

标题: 视觉-语言模型中的模型反演攻击：它们泄露了学到的东西吗？

摘要: 模型反演（MI）攻击通过从训练过的神经网络中重建私人训练数据，造成了重大的隐私风险。尽管先前的研究侧重于传统的单模型DNN，但视觉语言模型（VLMs）的脆弱性仍未被充分探讨。在本文中，我们进行了第一项研究，以了解VLMs泄露私人视觉训练数据的脆弱性。为了适应VLMs基于标记的生成性质，我们提出了一套新颖的基于标记和序列的模型反演策略。特别地，我们提出了基于标记的模型反演（TMI），收敛的基于标记的模型反演（TMI-C），基于序列的模型反演（SMI），以及带有自适应标记加权的基于序列的模型反演（SMI-AW）。通过对三种最先进的VLMs和多个数据集进行广泛的实验和用户研究，我们首次证明了VLMs容易受到训练数据泄露的影响。实验表明，我们提出的基于序列的方法，特别是基于词汇表示的对数最大化损失的SMI-AW，可以实现竞争性的重建，并在攻击准确性和视觉相似性方面优于基于标记的方法。重要的是，对重建图像的人类评估显示攻击准确率为75.31％，强调了VLMs中模型反演威胁的严重性。值得注意的是，我们还展示了对公开发布的VLMs的反演攻击。我们的研究揭示了VLMs在诸如医疗保健和金融等许多应用中变得越来越流行时的隐私脆弱性。

更新时间: 2025-08-06 05:30:05

领域: cs.LG

下载: http://arxiv.org/abs/2508.04097v1

Isolate Trigger: Detecting and Eradicating Evade-Adaptive Backdoors

All current detection of backdoor attacks on deep learning models fall under the category of a non essential features(NEF), which focus on fighting against simple and efficient vertical class backdoor -- trigger is small, few and not overlapping with the source. Evade-adaptive backdoor (EAB) attacks have evaded NEF detection and improved training efficiency. We introduces a precise, efficient and universal detection and defense framework coined as Isolate Trigger (IsTr). IsTr aims to find the hidden trigger by breaking the barrier of the source features. Therefore, it investigates the essence of backdoor triggering, and uses Steps and Differential-Middle-Slice as components to update past theories of distance and gradient. IsTr also plays a positive role in the model, whether the backdoor exists. For example, accurately find and repair the wrong identification caused by deliberate or unintentional training in automatic driving. Extensive experiments on robustness scross various tasks, including MNIST, facial recognition, and traffic sign recognition, confirm the high efficiency, generality and precision of the IsTr. We rigorously evaluated the effectiveness of the IsTr against a series of six EAB attacks, including Badnets, Sin-Wave, Multi-trigger, SSBAs, CASSOCK, HCB. None of these countermeasures evade, even when attacks are combined and the trigger and source overlap.

Updated: 2025-08-06 05:21:40

标题: 孤立触发器：检测和消除逃避自适应后门

摘要: 所有当前对深度学习模型后门攻击的检测都属于非必要特征（NEF）的范畴，重点是对抗简单高效的垂直类后门攻击，触发器小且数量较少，且与源不重叠。逃避自适应后门（EAB）攻击已逃避NEF检测并提高了训练效率。我们提出了一个精确、高效和通用的检测和防御框架，称为Isolate Trigger（IsTr）。IsTr旨在通过突破源特征的障碍来找到隐藏的触发器。因此，它研究了后门触发的本质，并使用步骤和差分中间切片作为组件来更新过去的距离和梯度理论。IsTr在模型中也起到积极作用，无论后门是否存在。例如，在自动驾驶中准确找到并修复由故意或无意训练造成的错误识别。对包括MNIST、人脸识别和交通标志识别在内的各种任务进行了大量实验，验证了IsTr的高效性、普适性和精度。我们严格评估了IsTr对一系列六种EAB攻击的有效性，包括Badnets、Sin-Wave、Multi-trigger、SSBAs、CASSOCK、HCB。即使攻击组合并且触发器与源重叠，这些对抗措施也没有逃避。

更新时间: 2025-08-06 05:21:40

领域: cs.CR

下载: http://arxiv.org/abs/2508.04094v1

Convolutional autoencoders for the reconstruction of three-dimensional interfacial multiphase flows

In this work, we perform a comprehensive investigation of autoencoders for reduced-order modeling of three-dimensional multiphase flows. Focusing on the accuracy of reconstructing multiphase flow volume/mass fractions with a standard convolutional architecture, we examine the advantages and disadvantages of different interface representation choices (diffuse, sharp, level set). We use a combination of synthetic data with non-trivial interface topologies and high-resolution simulation data of multiphase homogeneous isotropic turbulence for training and validation. This study clarifies the best practices for reducing the dimensionality of multiphase flows via autoencoders. Consequently, this paves the path for uncoupling the training of autoencoders for accurate reconstruction and the training of temporal or input/output models such as neural operators (e.g., FNOs, DeepONets) and neural ODEs on the lower-dimensional latent space given by the autoencoders. As such, the implications of this study are significant and of interest to the multiphase flow community and beyond.

Updated: 2025-08-06 05:01:13

标题: 三维界面多相流重建的卷积自编码器

摘要: 在这项工作中，我们对三维多相流的自动编码器进行了全面研究，用于降低建模。着重于使用标准卷积架构重建多相流体积/质量分数的准确性，我们检查了不同界面表示选择（扩散、锐利、水平集）的优缺点。我们结合合成数据和多相均匀各向同性湍流的高分辨率模拟数据进行训练和验证，包括具有非平凡界面拓扑的数据。这项研究阐明了通过自动编码器降低多相流维度的最佳实践。因此，这为准确重建的自动编码器的训练以及在自动编码器给出的低维潜在空间上进行时间或输入/输出模型的训练（如神经算子，深度神经网络和神经ODE）铺平了道路。因此，这项研究的影响重大，并对多相流社区和其他领域具有重要意义。

更新时间: 2025-08-06 05:01:13

领域: cs.CE,cs.LG,physics.flu-dyn

下载: http://arxiv.org/abs/2508.04084v1

Confounder-Free Continual Learning via Recursive Feature Normalization

Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove their influence from intermediate feature representations, we introduce the Recursive MDN (R-MDN) layer, which can be integrated into any deep learning architecture, including vision transformers, and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.

Updated: 2025-08-06 04:55:54

标题: 无混杂因素的递归特征归一化持续学习

摘要: 混杂变量是影响输入和目标的外部变量，导致虚假相关性和偏倚预测。在传统模型中处理或消除混杂变量方面有最新进展，例如元数据归一化（MDN），其中学习特征的分布根据研究混杂变量进行调整。然而，在持续学习的背景下，模型随着时间不断从新数据中学习而不会遗忘，学习对混杂变量不变的特征表示仍然是一个重要挑战。为了消除混杂变量对中间特征表示的影响，我们引入了递归MDN（R-MDN）层，它可以集成到任何深度学习架构中，包括视觉变换器，并在任何模型阶段进行。R-MDN通过递归最小二乘算法执行统计回归，以维护并持续更新内部模型状态，以应对数据和混杂变量分布的变化。我们的实验表明，R-MDN促进了跨人群组的公平预测，无论是在静态学习中还是在持续学习的不同阶段，通过减少随时间变化的混杂变量效应引起的灾难性遗忘。

更新时间: 2025-08-06 04:55:54

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.09031v2

CityLight: A Neighborhood-inclusive Universal Model for Coordinated City-scale Traffic Signal Control

City-scale traffic signal control (TSC) involves thousands of heterogeneous intersections with varying topologies, making cooperative decision-making across intersections particularly challenging. Given the prohibitive computational cost of learning individual policies for each intersection, some researchers explore learning a universal policy to control each intersection in a decentralized manner, where the key challenge is to construct a universal representation method for heterogeneous intersections. However, existing methods are limited to universally representing information of heterogeneous ego intersections, neglecting the essential representation of influence from their heterogeneous neighbors. Universally incorporating neighborhood information is nontrivial due to the intrinsic complexity of traffic flow interactions, as well as the challenge of modeling collective influences from neighbor intersections. To address these challenges, we propose CityLight, which learns a universal policy based on representations obtained with two major modules: a Neighbor Influence Encoder to explicitly model neighbor's influence with specified traffic flow relation and connectivity to the ego intersection; a Neighbor Influence Aggregator to attentively aggregate the influence of neighbors based on their mutual competitive relations. Extensive experiments on five city-scale datasets, ranging from 97 to 13,952 intersections, confirm the efficacy of CityLight, with an average throughput improvement of 11.68% and a lift of 22.59% for generalization.

Updated: 2025-08-06 04:49:31

标题: CityLight：一个包含社区的通用模型，用于协调城市范围的交通信号控制

摘要: 城市规模的交通信号控制（TSC）涉及成千上万个拓扑不同的异构交叉口，使得跨交叉口的合作决策特别具有挑战性。鉴于为每个交叉口学习单独策略的计算成本过高，一些研究人员探索了以分散方式学习通用策略来控制每个交叉口的方法，其中关键挑战是构建异构交叉口的通用表征方法。然而，现有方法仅局限于通用地表示异构自我交叉口的信息，忽视了来自其异构邻居的影响的重要表征。普遍整合邻域信息并非易事，因为交通流交互的固有复杂性以及建模邻居交叉口的集体影响的挑战。为了解决这些挑战，我们提出了CityLight，它基于两个主要模块获得的表征学习通用策略：邻居影响编码器，明确建模邻居的影响，具有指定的交通流关系和连接性到自我交叉口；邻居影响聚合器，根据它们之间的相互竞争关系注意地聚合邻居的影响。对来自97到13,952个交叉口的五个城市规模数据集进行了广泛实验，证实了CityLight的有效性，平均吞吐量提高了11.68%，泛化提高了22.59%。

更新时间: 2025-08-06 04:49:31

领域: eess.SY,cs.AI,cs.LG,cs.MA,cs.SY

下载: http://arxiv.org/abs/2406.02126v4

GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement

Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles -- most notably Tobler's First Law of Geography -- into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR.

Updated: 2025-08-06 04:45:34

标题: GeoSR：通过迭代自我完善探究地理空间知识边界的认知-代理框架

摘要: 最近的研究扩展了大型语言模型（LLMs）在地理问题上的应用，揭示了即使在没有明确的空间监督的情况下，也具有令人惊讶的地理空间能力。然而，LLMs仍然面临空间一致性、多跳推理和地理偏差等挑战。为了解决这些问题，我们提出了GeoSR，这是一个自我完善的代理推理框架，将核心地理原则（尤其是托布勒的第一地理定律）嵌入到迭代预测循环中。在GeoSR中，推理过程被分解为三个协作代理：（1）选择变量的代理，从相同位置选择相关的协变量；（2）选择点的代理，在先前轮次中LLM生成的附近位置的参考预测中进行选择；（3）细化代理，通过评估预测质量并在必要时触发进一步的轮次来协调迭代细化过程。这个代理循环通过利用空间依赖性和变量间的相互关系逐步改善预测质量。我们在从现实世界财产估计到社会经济预测的任务上验证了GeoSR。实验结果显示，与标准提示策略相比，GeoSR表现出一致的改进，证明将地理统计先验和空间结构推理纳入LLMs可以产生更准确和公正的地理空间预测。GeoSR的代码可在https://github.com/JinfanTang/GeoSR上找到。

更新时间: 2025-08-06 04:45:34

领域: cs.AI,stat.OT

下载: http://arxiv.org/abs/2508.04080v1

Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal Adaptation

Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose a structure-to-detail portrait correction model named ImagePC. It integrates the long-range awareness of the transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePC for unlabeled wide-angle videos (termed VideoPC), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePC, VideoPC maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially in blind scenarios. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in the number of people, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.

Updated: 2025-08-06 04:43:38

标题: 超越广角图像：通过无监督时空适应进行结构到细节视频肖像校正

摘要: 尽管广角相机在内容创作中很受欢迎，但由于失真引起的面部拉伸，尤其是在镜头边缘处，会降低视觉吸引力。为了解决这个问题，我们提出了一个称为ImagePC的结构到细节的肖像校正模型。它将变压器的长距离感知和扩散模型的多步去噪集成到统一框架中，实现全局结构稳健性和局部细节精炼。此外，考虑到获取视频标签的高成本，我们将ImagePC重新用于无标签的广角视频（称为VideoPC），通过空间一致性和时间平滑度约束的时空扩散适应。对于前者，我们鼓励去噪图像近似遵循广角失真分布模式的伪标签，而对于后者，我们通过反向光流推导纠正轨迹并使其平滑。与ImagePC相比，VideoPC在空间中保持高质量的面部校正，并在盲目场景中顺序减少潜在的时间摇动。最后，为了建立评估基准并训练框架，我们建立了一个视频肖像数据集，其中包含不同数量的人员、光照条件和背景。实验证明，所提出的方法在数量和质量上优于现有解决方案，为稳定和自然的高保真广角视频做出贡献。代码和数据集将提供。

更新时间: 2025-08-06 04:43:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.00401v2

BlurryScope enables compact, cost-effective scanning microscopy for HER2 scoring using deep learning on blurry images

We developed a rapid scanning optical microscope, termed "BlurryScope", that leverages continuous image acquisition and deep learning to provide a cost-effective and compact solution for automated inspection and analysis of tissue sections. This device offers comparable speed to commercial digital pathology scanners, but at a significantly lower price point and smaller size/weight. Using BlurryScope, we implemented automated classification of human epidermal growth factor receptor 2 (HER2) scores on motion-blurred images of immunohistochemically (IHC) stained breast tissue sections, achieving concordant results with those obtained from a high-end digital scanning microscope. Using a test set of 284 unique patient cores, we achieved testing accuracies of 79.3% and 89.7% for 4-class (0, 1+, 2+, 3+) and 2-class (0/1+, 2+/3+) HER2 classification, respectively. BlurryScope automates the entire workflow, from image scanning to stitching and cropping, as well as HER2 score classification.

Updated: 2025-08-06 04:42:55

标题: BlurryScope利用模糊图像上的深度学习实现紧凑、经济高效的扫描显微镜，用于HER2评分

摘要: 我们开发了一种快速扫描光学显微镜，名为“BlurryScope”，利用连续图像采集和深度学习，为组织切片的自动检查和分析提供了一种经济高效且紧凑的解决方案。该设备具有与商业数字病理学扫描仪相当的速度，但价格显著较低，体积/重量也更小。使用BlurryScope，我们实现了对免疫组化（IHC）染色的乳腺组织切片的运动模糊图像进行人类表皮生长因子受体2（HER2）评分的自动分类，与从高端数字扫描显微镜获得的结果相一致。使用284个独特患者核心的测试集，我们分别实现了4类（0、1+、2+、3+）和2类（0/1+、2+/3+）HER2分类的测试准确率为79.3％和89.7％。BlurryScope自动化了整个工作流程，从图像扫描到拼接和裁剪，以及HER2评分分类。

更新时间: 2025-08-06 04:42:55

领域: eess.IV,cs.CV,cs.LG,physics.med-ph

下载: http://arxiv.org/abs/2410.17557v2

RLGS: Reinforcement Learning-Based Adaptive Hyperparameter Tuning for Gaussian Splatting

Hyperparameter tuning in 3D Gaussian Splatting (3DGS) is a labor-intensive and expert-driven process, often resulting in inconsistent reconstructions and suboptimal results. We propose RLGS, a plug-and-play reinforcement learning framework for adaptive hyperparameter tuning in 3DGS through lightweight policy modules, dynamically adjusting critical hyperparameters such as learning rates and densification thresholds. The framework is model-agnostic and seamlessly integrates into existing 3DGS pipelines without architectural modifications. We demonstrate its generalization ability across multiple state-of-the-art 3DGS variants, including Taming-3DGS and 3DGS-MCMC, and validate its robustness across diverse datasets. RLGS consistently enhances rendering quality. For example, it improves Taming-3DGS by 0.7dB PSNR on the Tanks and Temple (TNT) dataset, under a fixed Gaussian budget, and continues to yield gains even when baseline performance saturates. Our results suggest that RLGS provides an effective and general solution for automating hyperparameter tuning in 3DGS training, bridging a gap in applying reinforcement learning to 3DGS.

Updated: 2025-08-06 04:37:39

标题: RLGS: 基于强化学习的自适应高斯雪花参数调整

摘要: 在3D高斯喷洒（3DGS）中的超参数调整是一个需要大量劳动和专家驱动的过程，通常会导致不一致的重建和次优结果。我们提出了RLGS，这是一个适用于3DGS中自适应超参数调整的即插即用强化学习框架，通过轻量级策略模块动态调整关键超参数，如学习率和致密阈值。该框架是模型无关的，并可以无缝集成到现有的3DGS流水线中，而无需进行架构修改。我们展示了它在多种最先进的3DGS变体中的泛化能力，包括Taming-3DGS和3DGS-MCMC，并验证了它在不同数据集上的稳健性。RLGS一致地提升了渲染质量。例如，在固定的高斯预算下，它将Taming-3DGS在Tanks and Temple（TNT）数据集上的PSNR提高了0.7dB，并且即使基准性能饱和，仍然能够获得收益。我们的结果表明，RLGS为自动化3DGS训练中的超参数调整提供了一种有效且通用的解决方案，弥合了在将强化学习应用于3DGS中的差距。

更新时间: 2025-08-06 04:37:39

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04078v1

Alz-QNet: A Quantum Regression Network for Studying Alzheimer's Gene Interactions

Understanding the molecular-level mechanisms underpinning Alzheimer's disease (AD) by studying crucial genes associated with the disease remains a challenge. Alzheimer's, being a multifactorial disease, requires understanding the gene-gene interactions underlying it for theranostics and progress. In this article, a novel attempt has been made using a quantum regression to decode how some crucial genes in the AD Amyloid Beta Precursor Protein ($APP$), Sterol regulatory element binding transcription factor 14 ($FGF14$), Yin Yang 1 ($YY1$), and Phospholipase D Family Member 3 ($PLD3$) etc. become influenced by other prominent switching genes during disease progression, which may help in gene expression-based therapy for AD. Our proposed Quantum Regression Network (Alz-QNet) introduces a pioneering approach with insights from the state-of-the-art Quantum Gene Regulatory Networks (QGRN) to unravel the gene interactions involved in AD pathology, particularly within the Entorhinal Cortex (EC), where early pathological changes occur. Using the proposed Alz-QNet framework, we explore the interactions between key genes ($APP$, $FGF14$, $YY1$, $EGR1$, $GAS7$, $AKT3$, $SREBF2$, and $PLD3$) within the CE microenvironment of AD patients, studying genetic samples from the database $GSE138852$, all of which are believed to play a crucial role in the progression of AD. Our investigation uncovers intricate gene-gene interactions, shedding light on the potential regulatory mechanisms that underlie the pathogenesis of AD, which help us to find potential gene inhibitors or regulators for theranostics.

Updated: 2025-08-06 04:31:49

标题: Alz-QNet：一种用于研究阿尔茨海默病基因相互作用的量子回归网络

摘要: 通过研究与阿尔茨海默病（AD）相关的关键基因，理解支撑该疾病的分子水平机制仍然是一个挑战。作为一种多因素疾病，阿尔茨海默病需要理解其背后的基因-基因相互作用，以用于治疗和进展。本文尝试使用量子回归解码在疾病进展过程中一些关键基因（如AD淀粉样蛋白前体蛋白（$APP$）、类固醇调节元素结合转录因子14（$FGF14$）、阴阳1（$YY1$）、磷脂酶D家族成员3（$PLD3$）等）如何受其他显著转换基因影响，从而有助于基于基因表达的AD治疗。我们提出的量子回归网络（Alz-QNet）引入了一种创新的方法，结合了现代量子基因调控网络（QGRN）的见解，以揭示涉及AD病理学的基因相互作用，特别是在早期病变发生的内嗅皮层（EC）内部。通过提出的Alz-QNet框架，我们探究了AD患者CE微环境内关键基因（如$APP$、$FGF14$、$YY1$、$EGR1$、$GAS7$、$AKT3$、$SREBF2$和$PLD3$）之间的相互作用，研究了来自数据库$GSE138852$的基因样本，这些基因被认为在AD进展中起着关键作用。我们的研究揭示了复杂的基因-基因相互作用，阐明了潜在的调节机制，这些机制是AD发病机制的基础，有助于寻找潜在的基因抑制剂或调节剂用于治疗和进展。

更新时间: 2025-08-06 04:31:49

领域: q-bio.MN,cs.LG,q-bio.GN,quant-ph

下载: http://arxiv.org/abs/2508.04743v1

The Ubiquitous Sparse Matrix-Matrix Products

Multiplication of a sparse matrix with another (dense or sparse) matrix is a fundamental operation that captures the computational patterns of many data science applications, including but not limited to graph algorithms, sparsely connected neural networks, graph neural networks, clustering, and many-to-many comparisons of biological sequencing data. In many application scenarios, the matrix multiplication takes places on an arbitrary algebraic semiring where the scalar operations are overloaded with user-defined functions with certain properties or a more general heterogenous algebra where even the domains of the input matrices can be different. Here, we provide a unifying treatment of the sparse matrix-matrix operation and its rich application space including machine learning, computational biology and chemistry, graph algorithms, and scientific computing.

Updated: 2025-08-06 04:26:52

标题: 无处不在的稀疏矩阵乘积

摘要: 将稀疏矩阵与另一个（稠密或稀疏）矩阵相乘是一种基本操作，捕捉了许多数据科学应用程序的计算模式，包括但不限于图算法、稀疏连接的神经网络、图神经网络、聚类以及生物测序数据的多对多比较。在许多应用场景中，矩阵乘法发生在任意的代数半环上，其中标量操作被重载为具有特定属性的用户定义函数，或者更一般的异构代数，甚至输入矩阵的域也可以不同。在这里，我们提供了对稀疏矩阵-矩阵操作及其丰富的应用空间进行统一处理，包括机器学习、计算生物学和化学、图算法以及科学计算。

更新时间: 2025-08-06 04:26:52

领域: math.NA,cs.DC,cs.LG,cs.MS,cs.NA,math.CO

下载: http://arxiv.org/abs/2508.04077v1

Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AI

Modern disease classification often overlooks molecular commonalities hidden beneath divergent clinical presentations. This study introduces a transcriptomics-driven framework for discovering disease relationships by analyzing over 1300 disease-condition pairs using GenoMAS, a fully automated agentic AI system. Beyond identifying robust gene-level overlaps, we develop a novel pathway-based similarity framework that integrates multi-database enrichment analysis to quantify functional convergence across diseases. The resulting disease similarity network reveals both known comorbidities and previously undocumented cross-category links. By examining shared biological pathways, we explore potential molecular mechanisms underlying these connections-offering functional hypotheses that go beyond symptom-based taxonomies. We further show how background conditions such as obesity and hypertension modulate transcriptomic similarity, and identify therapeutic repurposing opportunities for rare diseases like autism spectrum disorder based on their molecular proximity to better-characterized conditions. In addition, this work demonstrates how biologically grounded agentic AI can scale transcriptomic analysis while enabling mechanistic interpretation across complex disease landscapes. All results are publicly accessible at github.com/KeeeeChen/Pathway_Similarity_Network.

Updated: 2025-08-06 04:25:40

标题: 通过Agentic AI推动的转录标记分析发现疾病关系

摘要: 现代疾病分类往往忽视了隐藏在不同临床表现下的分子共性。本研究引入了一个基于转录组学的框架，通过使用GenoMAS，一个完全自动化的代理AI系统，分析超过1300种疾病-病情对，发现疾病之间的关系。除了识别稳健的基因水平重叠外，我们开发了一个新颖的基于通路的相似性框架，该框架集成了多数据库富集分析，以量化疾病之间的功能收敛。由此产生的疾病相似性网络揭示了已知的合并症和以前未记录的跨类别链接。通过检查共享的生物通路，我们探讨潜在的分子机制，这些分子机制超越了基于症状的分类学。我们进一步展示了肥胖和高血压等背景条件如何调节转录组相似性，并根据它们与较好描述的疾病之间的分子接近性，确定罕见疾病如自闭症谱系障碍的药物重用机会。此外，这项工作展示了生物学基础的代理AI如何在复杂的疾病风景中扩展转录组学分析，同时实现机制解释。所有结果都可以在github.com/KeeeeChen/Pathway_Similarity_Network上公开访问。

更新时间: 2025-08-06 04:25:40

领域: q-bio.GN,cs.LG

下载: http://arxiv.org/abs/2508.04742v1

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated

As AI advances in text generation, human trust in AI generated content remains constrained by biases that go beyond concerns of accuracy. This study explores how bias shapes the perception of AI versus human generated content. Through three experiments involving text rephrasing, news article summarization, and persuasive writing, we investigated how human raters respond to labeled and unlabeled content. While the raters could not differentiate the two types of texts in the blind test, they overwhelmingly favored content labeled as "Human Generated," over those labeled "AI Generated," by a preference score of over 30%. We observed the same pattern even when the labels were deliberately swapped. This human bias against AI has broader societal and cognitive implications, as it undervalues AI performance. This study highlights the limitations of human judgment in interacting with AI and offers a foundation for improving human-AI collaboration, especially in creative fields.

Updated: 2025-08-06 04:16:54

标题: 人类在面对人工智能时的偏见：对标注为人工智能生成的文本进行人类判断

摘要: 随着人工智能在文本生成方面的进步，人们对由人工智能生成的内容的信任仍受到超越准确性担忧的偏见的限制。本研究探讨了偏见如何塑造人们对人工智能与人类生成内容的认知。通过涉及文本改写、新闻文章摘要和说服性写作的三项实验，我们调查了人类评分者对标记和未标记内容的反应。尽管评分者在盲测中无法区分这两种类型的文本，他们在偏爱标记为“人工生成”的内容上表现出压倒性的优势，优先得分超过30%。我们观察到即使标签被故意交换，这种模式仍然存在。这种人类对人工智能的偏见具有更广泛的社会和认知影响，因为它低估了人工智能的表现。本研究突显了人类在与人工智能互动中的判断能力的局限性，并为改善人工智能与人类协作的基础提供了支持，尤其是在创意领域。

更新时间: 2025-08-06 04:16:54

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2410.03723v2

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how 25 models - from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) - represent lexical identity and inflectional morphology across six typologically diverse languages. Using linear and nonlinear classifiers trained on hidden activations, we predict word lemmas and inflectional features layer by layer. We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout. Additional experiments probe the nature of these encodings: attention and residual analyses examine where within layers information can be recovered, steering vector experiments test what information can be functionally manipulated, and intrinsic dimensionality analyses explore how the representational structure evolves across layers. Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime (pretrained and instruction-tuned variants). This suggests that, even with substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties are important for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing

Updated: 2025-08-06 04:14:16

标题: 模型内部调查：在现代语言模型中找到词汇身份和屈折形态学

摘要: 基于大型变压器的语言模型主导着现代自然语言处理，然而我们对它们如何编码语言信息的理解根源于对早期模型如BERT和GPT-2的研究。为了更好地理解今天的语言模型，我们研究了25个模型 - 从经典架构（BERT，DeBERTa，GPT-2）到现代大型语言模型（Pythia，OLMo-2，Gemma-2，Qwen2.5，Llama-3.1） - 在六种类型多样的语言中如何表示词汇身份和屈折形态。使用线性和非线性分类器训练隐藏激活，我们逐层预测单词的词干和屈折特征。我们发现模型在早期层中线性集中词汇信息，而在后期层中变得越来越非线性，同时保持屈折信息始终可访问和线性可分。额外的实验探究了这些编码的特性：注意力和残差分析确定了在层内哪里可以恢复信息，指导向量实验测试了可以被功能性操作的信息，内在维度分析探索了表示结构如何随着层次的演变而变化。值得注意的是，尽管架构、大小和训练方式（预训练和指导调整变体）存在差异，这些编码模式在我们测试的所有模型中都出现，这表明，即使在LLM技术方面取得了重大进展，变压器模型以相似的方式组织语言信息，表明这些特性对于下一个标记的预测是重要的，并且在预训练期间早期学习。我们的代码可以在https://github.com/ml5885/model_internal_sleuthing找到。

更新时间: 2025-08-06 04:14:16

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.02132v3

Efficient Strategy for Improving Large Language Model (LLM) Capabilities

Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master's thesis in Systems and Computer Engineering titled "Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)".

Updated: 2025-08-06 04:08:26

标题: 提高大型语言模型（LLM）能力的有效策略

摘要: 大型语言模型（LLMs）已成为人工智能和自然语言处理领域的一个里程碑。然而，它们的大规模部署仍受到对大量计算资源的需求的限制。本研究提出，从基础模型开始，探索并结合数据处理和精心选择数据技术、训练策略和架构调整，以提高LLMs在资源受限环境和受限知识库中的效率。方法论包括定义构建可靠数据集的标准，进行不同配置的受控实验，并系统评估结果变体的能力、多功能性、响应时间和安全性。最后，进行了比较测试，以衡量开发的变体的性能，并验证提出的策略的有效性。本文基于系统与计算机工程硕士论文，题为“提高大型语言模型（LLMs）能力的高效策略”。

更新时间: 2025-08-06 04:08:26

领域: cs.CL,cs.LG,I.2.7; I.2.6; I.5.1

下载: http://arxiv.org/abs/2508.04073v1

KG-Augmented Executable CoT for Mathematical Coding

In recent years, large language models (LLMs) have excelled in natural language processing tasks but face significant challenges in complex reasoning tasks such as mathematical reasoning and code generation. To address these limitations, we propose KG-Augmented Executable Chain-of-Thought (KGA-ECoT), a novel framework that enhances code generation through knowledge graphs and improves mathematical reasoning via executable code. KGA-ECoT decomposes problems into a Structured Task Graph, leverages efficient GraphRAG for precise knowledge retrieval from mathematical libraries, and generates verifiable code to ensure computational accuracy. Evaluations on multiple mathematical reasoning benchmarks demonstrate that KGA-ECoT significantly outperforms existing prompting methods, achieving absolute accuracy improvements ranging from several to over ten percentage points. Further analysis confirms the critical roles of GraphRAG in enhancing code quality and external code execution in ensuring precision. These findings collectively establish KGA-ECoT as a robust and highly generalizable framework for complex mathematical reasoning tasks.

Updated: 2025-08-06 04:07:35

标题: KG增强的数学编码可执行CoT

摘要: 近年来，大型语言模型（LLMs）在自然语言处理任务中表现突出，但在复杂推理任务如数学推理和代码生成方面面临重大挑战。为了解决这些限制，我们提出了KG-Augmented Executable Chain-of-Thought（KGA-ECoT），这是一个通过知识图增强代码生成，通过可执行代码改善数学推理的新框架。KGA-ECoT将问题分解成一个结构化任务图，利用高效的GraphRAG从数学库中精确检索知识，并生成可验证的代码以确保计算准确性。在多个数学推理基准测试中的评估表明，KGA-ECoT明显优于现有的提示方法，绝对准确率提升范围从几个到超过十个百分点。进一步分析证实了GraphRAG在提高代码质量和外部代码执行方面的关键作用。这些发现共同确立了KGA-ECoT作为一个复杂数学推理任务的强大且高度可泛化的框架。

更新时间: 2025-08-06 04:07:35

领域: cs.AI

下载: http://arxiv.org/abs/2508.04072v1

Adversarial Fair Multi-View Clustering

Cluster analysis is a fundamental problem in data mining and machine learning. In recent years, multi-view clustering has attracted increasing attention due to its ability to integrate complementary information from multiple views. However, existing methods primarily focus on clustering performance, while fairness-a critical concern in human-centered applications-has been largely overlooked. Although recent studies have explored group fairness in multi-view clustering, most methods impose explicit regularization on cluster assignments, relying on the alignment between sensitive attributes and the underlying cluster structure. However, this assumption often fails in practice and can degrade clustering performance. In this paper, we propose an adversarial fair multi-view clustering (AFMVC) framework that integrates fairness learning into the representation learning process. Specifically, our method employs adversarial training to fundamentally remove sensitive attribute information from learned features, ensuring that the resulting cluster assignments are unaffected by it. Furthermore, we theoretically prove that aligning view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence preserves clustering consistency without significantly compromising fairness, thereby providing additional theoretical guarantees for our framework. Extensive experiments on data sets with fairness constraints demonstrate that AFMVC achieves superior fairness and competitive clustering performance compared to existing multi-view clustering and fairness-aware clustering methods.

Updated: 2025-08-06 04:07:08

标题: 对抗公平多视角聚类

摘要: 聚类分析是数据挖掘和机器学习中的一个基本问题。近年来，多视图聚类因其能够整合多个视图中的互补信息而受到越来越多的关注。然而，现有方法主要关注聚类性能，而人类中心应用中的公平性-一个关键关注点-却大多被忽视。虽然最近的研究探讨了多视图聚类中的群体公平性，但大多数方法对聚类分配施加了显式正则化，依赖于敏感属性和基础聚类结构之间的对齐。然而，这种假设在实践中经常失败，可能会降低聚类性能。在本文中，我们提出了一个对抗公平多视图聚类（AFMVC）框架，将公平学习整合到表示学习过程中。具体来说，我们的方法采用对抗训练从根本上消除从学习特征中获得的敏感属性信息，确保由此产生的聚类分配不受其影响。此外，我们在理论上证明，通过KL散度将视图特定的聚类分配与公平不变的共识分布对齐，可以保持聚类一致性，同时不会明显影响公平性，从而为我们的框架提供了额外的理论保证。在具有公平约束的数据集上进行的大量实验证明，与现有的多视图聚类和公平感知聚类方法相比，AFMVC在实现公平性和竞争性聚类性能方面表现更优异。

更新时间: 2025-08-06 04:07:08

领域: cs.LG

下载: http://arxiv.org/abs/2508.04071v1

Personalized Knowledge Transfer Through Generative AI: Contextualizing Learning to Individual Career Goals

As artificial intelligence becomes increasingly integrated into digital learning environments, the personalization of learning content to reflect learners' individual career goals offers promising potential to enhance engagement and long-term motivation. In our study, we investigate how career goal-based content adaptation in learning systems based on generative AI (GenAI) influences learner engagement, satisfaction, and study efficiency. The mixed-methods experiment involved more than 4,000 learners, with one group receiving learning scenarios tailored to their career goals and a control group. Quantitative results show increased session duration, higher satisfaction ratings, and a modest reduction in study duration compared to standard content. Qualitative analysis highlights that learners found the personalized material motivating and practical, enabling deep cognitive engagement and strong identification with the content. These findings underscore the value of aligning educational content with learners' career goals and suggest that scalable AI personalization can bridge academic knowledge and workplace applicability.

Updated: 2025-08-06 04:03:56

标题: 通过生成式人工智能实现个性化知识传递：将学习情境化到个人职业目标

摘要: 随着人工智能越来越多地融入数字学习环境中，将学习内容个性化以反映学习者个人职业目标的潜力有望提升参与度和长期动机。在我们的研究中，我们调查了基于生成式人工智能（GenAI）的学习系统中基于职业目标的内容适应如何影响学习者的参与度、满意度和学习效率。这项混合方法实验涉及超过4,000名学习者，其中一组接受了根据其职业目标量身定制的学习场景，另一组作为对照组。定量结果显示，与标准内容相比，会话持续时间增加，满意度评分更高，并且学习时间略有减少。定性分析强调学习者发现个性化材料具有激励和实用性，促进了深度认知参与和与内容的强烈认同。这些发现强调了将教育内容与学习者的职业目标对齐的价值，并建议可扩展的人工智能个性化可以弥合学术知识与职场适用性之间的差距。

更新时间: 2025-08-06 04:03:56

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.04070v1

DRIVE: Dynamic Rule Inference and Verified Evaluation for Constraint-Aware Autonomous Driving

Understanding and adhering to soft constraints is essential for safe and socially compliant autonomous driving. However, such constraints are often implicit, context-dependent, and difficult to specify explicitly. In this work, we present DRIVE, a novel framework for Dynamic Rule Inference and Verified Evaluation that models and evaluates human-like driving constraints from expert demonstrations. DRIVE leverages exponential-family likelihood modeling to estimate the feasibility of state transitions, constructing a probabilistic representation of soft behavioral rules that vary across driving contexts. These learned rule distributions are then embedded into a convex optimization-based planning module, enabling the generation of trajectories that are not only dynamically feasible but also compliant with inferred human preferences. Unlike prior approaches that rely on fixed constraint forms or purely reward-based modeling, DRIVE offers a unified framework that tightly couples rule inference with trajectory-level decision-making. It supports both data-driven constraint generalization and principled feasibility verification. We validate DRIVE on large-scale naturalistic driving datasets, including inD, highD, and RoundD, and benchmark it against representative inverse constraint learning and planning baselines. Experimental results show that DRIVE achieves 0.0% soft constraint violation rates, smoother trajectories, and stronger generalization across diverse driving scenarios. Verified evaluations further demonstrate the efficiency, explanability, and robustness of the framework for real-world deployment.

Updated: 2025-08-06 03:56:06

标题: DRIVE: 动态规则推断和约束感知自动驾驶的验证评估

摘要: 理解和遵守软约束对于安全和社会合规的自动驾驶至关重要。然而，这些约束通常是隐式的、依赖于上下文的，并且很难明确指定。在这项工作中，我们提出了DRIVE，这是一个新颖的动态规则推断和验证评估框架，可以从专家演示中建模和评估类似人类驾驶约束。DRIVE利用指数族似然建模来估计状态转换的可行性，构建了一个概率表示，描述了在不同驾驶上下文中变化的软行为规则。然后，这些学习到的规则分布被嵌入到基于凸优化的规划模块中，使得生成的轨迹不仅在动态上可行，而且符合推断的人类偏好。与先前依赖固定约束形式或纯奖励建模的方法不同，DRIVE提供了一个统一的框架，紧密结合了规则推断和轨迹级决策制定。它支持基于数据驱动的约束泛化和原则性可行性验证。我们在大规模自然驾驶数据集上验证了DRIVE，包括inD、highD和RoundD，并将其与代表性的逆约束学习和规划基线进行了基准测试。实验结果显示，DRIVE实现了0.0%的软约束违反率，更平滑的轨迹，并在各种驾驶场景中具有更强的泛化能力。验证评估进一步展示了该框架在实际部署中的效率、可解释性和鲁棒性。

更新时间: 2025-08-06 03:56:06

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.04066v1

FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning

Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

Updated: 2025-08-06 03:54:29

标题: FLAT：基于潜在驱动的任意目标后门攻击在联邦学习中

摘要: 联邦学习（FL）容易受到后门攻击的影响，然而大多数现有方法受限于固定模式或单一目标触发器，使它们缺乏灵活性并更容易被检测。我们提出了FLAT（FL任意目标攻击），这是一种新颖的后门攻击，利用潜在驱动的条件自动编码器根据需要生成多样化、目标特定的触发器。通过引入潜在代码，FLAT能够创建视觉适应性和高度可变性的触发器，使攻击者能够选择任意目标而无需重新训练，并且能够规避传统的检测机制。我们的方法将攻击成功、隐蔽性和多样性统一在一个框架中，为FL中的后门攻击引入了新的灵活性和复杂性水平。大量实验表明，FLAT实现了高攻击成功率，并且对先进的FL防御保持了稳健性。这些结果突显了需要新的防御策略来应对潜在驱动、多目标后门威胁在联邦设置中的紧迫需求。

更新时间: 2025-08-06 03:54:29

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.04064v1

Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI's closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

Updated: 2025-08-06 03:52:55

标题: 微调以提升少样本提示的效果：短答案评分的实证比较

摘要: 最近研究如何改进自动化短答案评分已经集中于大型语言模型（LLMs）与提示工程和零次或少次提示以获得最佳结果。这与历史上需要大规模计算集群且大多数用户无法访问的微调方法形成对比。新的封闭模型方法如OpenAI的微调服务承诺只需100个示例即可获得结果，而使用开放权重的方法如量化低秩自适应（QLORA）可用于在消费级GPU上微调模型。我们评估了这两种微调方法，测量它们与少次提示相互作用，用于自动化短答案评分（ASAG）并输出结构化（JSON）结果。我们的结果显示，使用少量数据进行微调对Llama开放权重模型的效用有限，但微调方法可以胜过OpenAI封闭模型的少次基线指令调整的LLMs。虽然我们的评估集有限，但我们发现一些证据表明微调的观察到的好处可能受到领域主题的影响。最后，我们观察到通过在初始训练示例中引入大量廉价生成的合成训练数据，LLama 3.1 8B-Instruct开放权重模型的显著改进。

更新时间: 2025-08-06 03:52:55

领域: cs.LG

下载: http://arxiv.org/abs/2508.04063v1

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We leverage a pretrained LLM as the text encoder within the CLIP framework, processing all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks.

Updated: 2025-08-06 03:51:06

标题: 大语言模型的上下文自适应多提示嵌入用于视觉语言对齐

摘要: 我们提出了一种新颖的方法，即上下文自适应多提示嵌入，用于在视觉-语言对比学习中丰富语义表征。与依赖单个文本嵌入的标准CLIP风格模型不同，我们的方法引入了多个结构化提示，每个提示都包含一个捕捉输入文本不同语义方面的独特自适应令牌。我们在CLIP框架中利用一个预训练的LLM作为文本编码器，将所有提示联合在一个前向传递中处理。生成的提示嵌入被合并为一个统一的文本表示，从而使得与视觉特征更丰富地对齐。为了进一步促进语义多样性和表征质量，我们加入了多样性正则化损失和否定感知损失，鼓励提示之间的专业化，并改善对比鉴别。我们的方法在图像文本和视频文本检索基准上实现了一致的改进。

更新时间: 2025-08-06 03:51:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.02762v2

Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences

In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from {\em unknown} distributions. The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.

Updated: 2025-08-06 03:50:08

标题: 数据序列的指数一致性非参数链接聚类

摘要: 在这篇论文中，我们考虑了从未知分布生成的$M$个独立同分布（i.i.d.）数据序列的非参数聚类问题。这$M$个数据序列的分布属于$K$个潜在的分布簇。现有关于指数一致非参数聚类算法的结果，如基于单链接（SLINK）聚类和$k$-中心分布聚类，假设最大簇内距离（$d_L$）小于最小簇间距离（$d_H$）。首先，在固定样本大小（FSS）设置中，我们证明了在一个更宽松的假设$ d_I < d_H $下可以实现SLINK聚类的指数一致性，其中$ d_I $是将簇划分为任意两个子簇之间的最大距离。请注意，一般情况下$ d_I < d_L $。因此，我们的结果表明SLINK对比以前已知的问题类别是指数一致的。在我们的模拟中，我们还发现了$k$-中心聚类无法找到真实簇的例子，但SLINK是指数一致的。然后，我们提出了一种名为SLINK-SEQ的顺序聚类算法，基于SLINK，并证明它也是指数一致的。模拟结果表明，与相同错误概率的FSS SLINK算法相比，SLINK-SEQ算法需要更少的预期样本数量。

更新时间: 2025-08-06 03:50:08

领域: stat.ML,cs.IT,cs.LG,eess.SP,math.IT

下载: http://arxiv.org/abs/2411.13922v4

Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs

Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck -- the time-consuming evaluation of pruning policies -- further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 seconds), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .

Updated: 2025-08-06 03:44:36

标题: 手动设计修剪策略的超越：基于二级性能预测的修剪框架，适用于LLMs

摘要: 非均匀结构网络修剪方法可以通过消除多余的通道或层有效地减少大型语言模型（LLM）的大小，提供比均匀策略更低的性能降级。然而，现有的非均匀方法严重依赖手动设计的修剪策略（例如，层重要性和缩放因子），因此无法有效地适应具有动态修剪比要求的场景。此外，一个关键的瓶颈--耗时的修剪策略评估--进一步限制了迭代和动态找到最佳修剪策略的可行性。为了解决这些局限性，我们提出了PPF（预测修剪框架），这是一种针对LLM的新型修剪框架，通过二级性能预测消除了手动设计依赖性。PPF不仅支持动态修剪比下的实时修剪决策，还适用于静态修剪场景。它采用一个代理来产生自适应和实时的修剪动作，同时使用一个轻量级的性能预测器，可以在几秒钟内评估修剪策略，显著加快迭代优化过程。在Llama2-7B和Llama3-8B上的实验表明，PPF可以生成动态/静态修剪策略，相对于现有方法，它将困惑度降低了高达33.4%（动态修剪）和84.78%（静态修剪），超过手动设计的修剪策略。性能预测器以高精度（预测误差<0.0011）实现了二级性能预测。它将平均评估延迟从分钟级别（1分钟38.02秒的测试集评估方法）降至秒级别（1.52秒），实现了64倍的加速。我们的代码将在https://github.com/Ma-zx/PPF 上提供。

更新时间: 2025-08-06 03:44:36

领域: cs.LG

下载: http://arxiv.org/abs/2508.02381v2

Tool-integrated Reinforcement Learning for Repo Deep Search

Issue localization, the process of identifying code locations that need modification to resolve software issues, is a critical yet challenging task in software development. The semantic gap between natural language issue descriptions and faulty code requires complex multi-hop reasoning through code dependencies. Existing LLM-based agents attempt to address this by integrating repository retrieval tools. However, this transforms issue localization into a demanding task we call Repo Deep Search, which requires the LLM to effectively utilize various repository retrieval tools throughout a multi-step reasoning and navigation process. To tackle this challenge, we present ToolTrain, a two-stage tool-integrated training framework combining rejection-sampled supervised fine-tuning and tool-integrated reinforcement learning to enhance LLMs' ability to use retrieval tools for issue localization. Experimental results show that ToolTrain-trained models achieve state-of-the-art performance, with our 32B model even surpassing Claude-3.7 on function-level localization. The results also show that improved localization performance translates to better end-to-end issue resolution performance. This further demonstrates that training for issue localization is a viable and effective strategy for improving automated software development.

Updated: 2025-08-06 03:43:30

标题: 工具集成的强化学习用于仓库深度搜索

摘要: 问题本地化是识别需要修改的代码位置以解决软件问题的过程，在软件开发中是一个至关重要但具有挑战性的任务。自然语言问题描述和错误代码之间的语义差距需要通过代码依赖关系进行复杂的多跳推理。现有基于LLM的代理尝试通过集成存储库检索工具来解决这个问题。然而，这将问题本地化转变为一个称为Repo Deep Search的繁重任务，需要LLM在整个多步推理和导航过程中有效地利用各种存储库检索工具。为了解决这一挑战，我们提出了ToolTrain，这是一个结合拒绝采样监督微调和工具集成强化学习的两阶段工具集成培训框架，以增强LLMs利用检索工具进行问题本地化的能力。实验结果表明，经过ToolTrain训练的模型实现了最先进的性能，我们的32B模型甚至在功能级本地化上超越了Claude-3.7。结果还显示，改进的本地化性能可以转化为更好的端到端问题解决性能。这进一步证明，针对问题本地化的培训是改进自动化软件开发的一种可行和有效策略。

更新时间: 2025-08-06 03:43:30

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.03012v2

AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

The advent of Large Weather Models (LWMs) has marked a turning point in data-driven forecasting, with many models now outperforming traditional numerical systems in the medium range. However, achieving stable, long-range autoregressive forecasts beyond a few weeks remains a significant challenge. Prevailing state-of-the-art models that achieve year-long stability, such as SFNO and DLWP-HPX, have relied on transforming input data onto non-standard spatial domains like spherical harmonics or HEALPix meshes. This has led to the prevailing assumption that such representations are necessary to enforce physical consistency and long-term stability. This paper challenges that assumption by investigating whether comparable long-range performance can be achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep convolutional network that operates directly on ERA5 data without any spherical remapping. The model's stability is enabled by a novel Gated Residual Fusion (GRF) mechanism, which adaptively moderates feature updates to prevent error accumulation over long recursive simulations. Our results demonstrate that AtmosMJ produces stable and physically plausible forecasts for about 500 days. In quantitative evaluations, it achieves competitive 10-day forecast accuracy against models like Pangu-Weather and GraphCast, all while requiring a remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest that efficient architectural design, rather than non-standard data representation, can be the key to unlocking stable and computationally efficient long-range weather prediction.

Updated: 2025-08-06 03:40:48

标题: AtmosMJ：AI天气预报在年度尺度之外的门控机制再审

摘要: 大型天气模型（LWMs）的出现标志着数据驱动预测的一个转折点，许多模型现在在中期范围内表现优于传统的数值系统。然而，实现稳定的、超过几周的长期自回归预测仍然是一个重大挑战。目前的最先进模型，如SFNO和DLWP-HPX，实现了全年稳定性，这些模型依赖于将输入数据转换到非标准的空间域，如球谐波或HEALPix网格。这导致了这样的假设，即这种表示形式是必要的，以强制物理一致性和长期稳定性。本文通过调查可否在标准的纬度-经度网格上实现可比的长期性能，挑战了这一假设。我们介绍了AtmosMJ，一个在ERA5数据上直接运行的深度卷积网络，而无需任何球形重映射。模型的稳定性是通过一种新颖的门控残差融合（GRF）机制实现的，该机制自适应地调节特征更新，以防止长时间递归模拟中的错误积累。我们的结果表明，AtmosMJ可以为大约500天产生稳定且物理上合理的预测。在定量评估中，它在10天预测精度方面与Pangu-Weather和GraphCast等模型相媲美，同时在V100 GPU上仅需5.7天的极低训练预算。我们的发现表明，高效的建筑设计，而不是非标准数据表示，可能是解锁稳定且计算效率高的长期天气预测的关键。

更新时间: 2025-08-06 03:40:48

领域: cs.LG,cs.AI,cs.CV,physics.ao-ph

下载: http://arxiv.org/abs/2506.09733v2

Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing

Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.

Updated: 2025-08-06 03:36:40

标题: 通过模糊不确定性校准和视图去偏差的方法增强多视角开放式学习

摘要: 现有的多视图学习模型在开放式场景中遇到困难，这是因为它们隐含地假设类别是完整的。此外，静态视图引起的偏见，是由训练过程中形成的虚假视图-标签关联引起的，进一步降低了它们识别未知类别的能力。在本文中，我们提出了一种通过模糊不确定性校准和视图去偏差的多视图开放式学习框架。为了模拟模糊样本，我们设计了O-Mix，一种新颖的合成策略，用于生成具有校准的开放式模糊不确定性的虚拟样本。这些样本进一步通过一个辅助模糊感知网络进行处理，该网络捕获了改进开放式适应性的非典型模式。此外，我们还结合了基于HSIC的对比去偏差模块，强制要求视图特定的模糊和视图一致的表示之间的独立性，鼓励模型学习可泛化的特征。对多样化的多视图基准进行了大量实验，结果表明所提出的框架在增强未知类别识别的同时保持了强大的闭集性能。

更新时间: 2025-08-06 03:36:40

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.01227v2

Neuron-based Multifractal Analysis of Neuron Interaction Dynamics in Large Models

In recent years, there has been increasing attention on the capabilities of large models, particularly in handling complex tasks that small-scale models are unable to perform. Notably, large language models (LLMs) have demonstrated ``intelligent'' abilities such as complex reasoning and abstract language comprehension, reflecting cognitive-like behaviors. However, current research on emergent abilities in large models predominantly focuses on the relationship between model performance and size, leaving a significant gap in the systematic quantitative analysis of the internal structures and mechanisms driving these emergent abilities. Drawing inspiration from neuroscience research on brain network structure and self-organization, we propose (i) a general network representation of large models, (ii) a new analytical framework, called Neuron-based Multifractal Analysis (NeuroMFA), for structural analysis, and (iii) a novel structure-based metric as a proxy for emergent abilities of large models. By linking structural features to the capabilities of large models, NeuroMFA provides a quantitative framework for analyzing emergent phenomena in large models. Our experiments show that the proposed method yields a comprehensive measure of network's evolving heterogeneity and organization, offering theoretical foundations and a new perspective for investigating emergent abilities in large models.

Updated: 2025-08-06 03:34:21

标题: 基于神经元的多重分形分析：大型模型中神经元相互作用动力学

摘要: 近年来，人们越来越关注大型模型的能力，特别是在处理小规模模型无法执行的复杂任务方面。值得注意的是，大型语言模型（LLMs）已经展示出复杂推理和抽象语言理解等“智能”能力，反映出类似认知行为。然而，目前关于大型模型中新兴能力的研究主要集中在模型性能和规模之间的关系上，缺乏对驱动这些新兴能力的内部结构和机制进行系统定量分析的重要研究。受神经科学对大脑网络结构和自组织的研究启发，我们提出了（i）大型模型的通用网络表示，（ii）一个名为神经元基多重分析（NeuroMFA）的新的结构分析框架，以及（iii）一种新颖的基于结构的度量作为大型模型新兴能力的代理。通过将结构特征与大型模型的能力联系起来，NeuroMFA为分析大型模型中新兴现象提供了定量框架。我们的实验表明，所提出的方法提供了网络演化的综合度量和组织，为研究大型模型中新兴能力提供了理论基础和新的视角。

更新时间: 2025-08-06 03:34:21

领域: cs.AI

下载: http://arxiv.org/abs/2402.09099v7

Boost Post-Training Quantization via Null Space Optimization for Large Language Models

Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.

Updated: 2025-08-06 03:30:43

标题: 大语言模型的零空间优化通过后训练量化提升

摘要: 现有的大型语言模型（LLMs）后训练量化方法取得了显著的成功。然而，越来越边缘的性能提升表明现有的量化策略不足以支持更加压缩模型的发展。为了激发未来研究的新方向，本文将零空间的概念引入到LLMs的量化中。我们认为通过将后量化权重扰动限制在输入激活的零空间内，可以有效地减轻量化误差。为了证明这一想法，我们提出了一个专门针对现有里程碑PTQ基线的插拔式零空间投影模块，命名为Q2N。具体来说，我们首先设计了一种高效准确的零空间投影近似方法，以适应LLMs的特性。随后，我们从理论上推导了获得的投影矩阵的等效向量的闭合解，该解满足实际推理条件，同时避免了额外的内存开销。我们在各种最先进的LLMs（LLaMA3、DeepSeek、Qwen3）和基线上进行了广泛的实验，证明了我们的Q2N和零空间优化对LLMs量化的有效性。我们认为这篇论文是基于零空间洞见进一步减轻量化误差的第一步，希望能激发未来研究人员设计更先进的量化方法。代码可在https://github.com/zjq0455/q2n上找到。

更新时间: 2025-08-06 03:30:43

领域: cs.LG

下载: http://arxiv.org/abs/2506.11044v2

Need for zkSpeed: Accelerating HyperPlonk for Zero-Knowledge Proofs

Zero-Knowledge Proofs (ZKPs) are rapidly gaining importance in privacy-preserving and verifiable computing. ZKPs enable a proving party to prove the truth of a statement to a verifying party without revealing anything else. ZKPs have applications in blockchain technologies, verifiable machine learning, and electronic voting, but have yet to see widespread adoption due to the computational complexity of the proving process. Recent works have accelerated the key primitives of state-of-the-art ZKP protocols on GPU and ASIC. However, the protocols accelerated thus far face one of two challenges: they either require a trusted setup for each application, or they generate larger proof sizes with higher verification costs, limiting their applicability in scenarios with numerous verifiers or strict verification time constraints. This work presents an accelerator, zkSpeed, for HyperPlonk, a state-of-the-art ZKP protocol that supports both one-time, universal setup and small proof sizes for typical ZKP applications in publicly verifiable, consensus-based systems. We accelerate the entire protocol, including two major primitives: SumCheck and Multi-scalar Multiplications (MSMs). We develop a full-chip architecture using 366.46 mm$^2$ and 2 TB/s of bandwidth to accelerate the entire proof generation process, achieving geometric mean speedups of 801$\times$ over CPU baselines.

Updated: 2025-08-06 03:30:01

标题: "需要 zkSpeed：加速超级Plonk以实现零知识证明"

摘要: 零知识证明（ZKPs）在隐私保护和可验证计算中迅速增长。ZKPs使证明方能向验证方证明一个陈述的真实性，而不透露其他信息。ZKPs在区块链技术、可验证机器学习和电子投票等领域有应用，但由于证明过程的计算复杂性，尚未被广泛采用。最近的研究加速了现有ZKP协议的关键原语在GPU和ASIC上的实现。然而，迄今为止加速的协议面临两个挑战之一：它们要么需要为每个应用程序进行可信设置，要么生成具有更高验证成本的更大证明大小，限制了它们在具有众多验证方或严格验证时间限制的场景中的适用性。本研究提出了一种加速器zkSpeed，用于HyperPlonk，这是一种支持一次性、通用设置和小型证明大小的最先进ZKP协议，适用于公开可验证、基于共识的系统中的典型ZKP应用。我们加速整个协议，包括两个主要原语：SumCheck和多标量乘法（MSMs）。我们使用占地面积为366.46 mm^2和带宽为2 TB/s的整个芯片架构来加速整个证明生成过程，实现相对于CPU基准的几何平均加速比达到801倍。

更新时间: 2025-08-06 03:30:01

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2504.06211v2

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/SE-Agent.

Updated: 2025-08-06 03:27:31

标题: SE-Agent：基于LLM代理的多步推理中的自我演化轨迹优化

摘要: 基于大型语言模型（LLM）的智能体最近展示了在复杂推理和工具使用方面的卓越能力，通过与环境的多步交互。虽然这些智能体有潜力解决复杂任务，但它们的问题解决过程，即智能体的交互轨迹导致任务完成，仍未得到充分利用。这些轨迹包含丰富的反馈信息，可以引导智能体朝着正确的方向解决问题。尽管现有方法，如蒙特卡洛树搜索（MCTS），可以有效平衡探索和开发利用，但它们忽视了各种轨迹之间的相互依赖，并且缺乏搜索空间的多样性，导致冗余推理和次优结果。为了解决这些挑战，我们提出了SE-Agent，这是一个自我演化框架，使智能体能够迭代地优化其推理过程。我们的方法通过三个关键操作：修订、重组和完善，重新审视和增强先前的试飞轨迹。这种演化机制带来了两个关键优势：(1) 通过智能地探索由先前轨迹引导的多样解决路径，将搜索空间扩展到局部最优解之外，(2) 利用交叉轨迹灵感，高效增强性能同时减轻次优推理路径的影响。通过这些机制，SE-Agent实现了持续的自我演化，逐步提高推理质量。我们在SWE-bench Verified上评估了SE-Agent，用于解决真实世界GitHub问题。跨越五个强大的LLM的实验结果显示，集成SE-Agent可以提供高达55%的相对改进，实现在SWE-bench Verified上所有开源智能体中的最优性能。我们的代码和演示材料可以在https://github.com/wanghuacan/SE-Agent 上公开获取。

更新时间: 2025-08-06 03:27:31

领域: cs.AI

下载: http://arxiv.org/abs/2508.02085v2

Quantum Temporal Fusion Transformer

The Temporal Fusion Transformer (TFT), proposed by Lim et al. [\textit{International Journal of Forecasting}, 2021], is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we propose a Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results display that in certain test cases, the model outperforms its classical counterpart in terms of both training and test loss, while in the remaining cases, it achieves comparable performance. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth.

Updated: 2025-08-06 03:21:20

标题: 量子时间融合变压器

摘要: 由Lim等人提出的时间融合变压器（TFT）[\textit{国际预测杂志}，2021]是一种最先进的基于注意力的深度神经网络架构，专门设计用于多时域时间序列预测。它已经证明在现有基准上有显著的性能提升。在这项工作中，我们提出了一种量子时间融合变压器（QTFT），这是一种量子增强的混合量子-经典架构，扩展了经典TFT框架的能力。我们的结果表明，QTFT在预测数据集上成功训练，并能够准确预测未来的值。特别是，我们的实验结果显示，在某些测试案例中，该模型在训练和测试损失方面优于其经典对应物，而在其余情况下，它达到可比较的性能。我们方法的一个关键优势在于其基础是一种变分量子算法，可以在当前噪声中等规模量子(NISQ)设备上实现，而不需要对量子比特的数量或电路深度有严格要求。

更新时间: 2025-08-06 03:21:20

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2508.04048v1

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Since the release of GPT2-1.5B in 2019, the large language models (LLMs) have evolved from specialized deep models to versatile foundation models. While demonstrating remarkable zero-shot ability, the LLMs still require fine-tuning on local datasets and substantial memory for deployment over the network edges. Traditional first-order fine-tuning techniques require significant GPU memory that exceeds the capacity of mainstream hardware. Besides, the LLMs have been expanded beyond text generation to create images, audio, video, and multi-modal content, necessitating careful investigation of efficient deployment strategies for large-scale foundation models. In response to these challenges, model fine-tuning and model-compression techniques have been developed to support the sustainable growth of LLMs by reducing both operational and capital expenditures. In this work, we provide a comprehensive overview of prevalent memory-efficient fine-tuning methods for deployment at the network edge. We also review state-of-the-art literature on model compression, offering insights into the deployment of LLMs at network edges.

Updated: 2025-08-06 03:19:06

标题: Fein调整和部署大型语言模型在边缘上：问题和方法

摘要: 自2019年发布GPT2-1.5B以来，大型语言模型（LLMs）已经从专业的深度模型发展成为多功能的基础模型。虽然展示了出色的零-shot能力，但LLMs仍然需要在本地数据集上进行微调，并且在部署到网络边缘时需要大量内存。传统的一阶微调技术需要显著的GPU内存，超出了主流硬件的容量。此外，LLMs已经扩展到超越文本生成的领域，可以创建图像、音频、视频和多模态内容，因此需要仔细调查大规模基础模型的高效部署策略。为了应对这些挑战，已经开发了模型微调和模型压缩技术，以减少运营和资本支出，支持LLMs的可持续增长。在这项工作中，我们提供了关于网络边缘部署的流行内存高效微调方法的全面概述。我们还回顾了关于模型压缩的最新文献，为LLMs在网络边缘部署提供了见解。

更新时间: 2025-08-06 03:19:06

领域: cs.AI

下载: http://arxiv.org/abs/2408.10691v3

FeDaL: Federated Dataset Learning for Time Series Foundation Models

Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

Updated: 2025-08-06 03:14:31

标题: FeDaL: 面向时间序列基础模型的联邦数据集学习

摘要: 数据集间的异质性引入了显著的领域偏差，从根本上降低了时间序列基础模型（TSFMs）的泛化能力，然而这一挑战仍未得到充分探讨。本文重新思考了使用联邦学习范式开发TSFMs。我们提出了一种新颖的联邦数据集学习（FeDaL）方法，通过学习数据集无关的时间表示来解决异质时间序列。具体而言，联邦学习的分布式架构是将异质TS数据集分解为共享的广义知识和保留个性化知识的自然解决方案。此外，基于TSFM架构，FeDaL通过添加两种互补机制：领域偏差消除（DBE）和全局偏差消除（GBE），明确地缓解了本地和全局偏差。FeDaL的跨数据集泛化已在涵盖八项任务的真实世界数据集中进行了广泛评估，包括表示学习和下游时间序列分析，对比了54个基线。我们进一步分析了联邦扩展行为，展示了在去中心化情况下数据量、客户端数量和加入率如何影响模型性能。

更新时间: 2025-08-06 03:14:31

领域: cs.LG

下载: http://arxiv.org/abs/2508.04045v1

Tool Unlearning for Tool-Augmented LLMs

Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning'' has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM's knowledge on non-deleted tools and maintaining performance on general tasks.

Updated: 2025-08-06 03:09:38

标题: 工具辅助LLMs的工具去学习

摘要: 工具增强的大型语言模型（LLMs）通常在查询-响应对的数据集上进行训练，这种数据集嵌入了直接使用工具或API的能力到LLMs的参数化知识中。由于安全漏洞、隐私法规或工具停用的原因，工具增强的LLMs需要具有忘记已学习的工具的能力。然而，“工具遗忘”尚未在遗忘文献中进行研究。我们引入了这个新颖的任务，需要解决与传统遗忘相比的不同挑战：知识移除而不是遗忘个别样本，优化LLMs的高成本，以及需要有原则的评估指标。为了弥合这些差距，我们提出了ToolDelete，这是第一个用于从工具增强的LLMs中遗忘工具的方法。它实现了三个关键属性，以解决上述挑战，实现有效的工具遗忘，并引入了一种新的成员推理攻击（MIA）模型，用于有效评估。对多个工具学习数据集和工具增强的LLMs进行了大量实验，结果表明ToolDelete能够有效遗忘随机选择的工具，同时保留LLMs对未删除工具的知识，并在一般任务上保持性能。

更新时间: 2025-08-06 03:09:38

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.01083v2

Understanding Flatness in Generative Models: Its Role and Benefits

Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models even surpassing the indirectly promoting flatness methods -- Input Perturbation (IP) which enforces the Lipschitz condition, ensembling-based approach like Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA) -- are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improve not only generative performance but also robustness.

Updated: 2025-08-06 03:03:47

标题: 理解生成模型中的平坦性：其作用和好处

摘要: 平坦的最小值在监督学习中已被证明可以增强泛化能力和鲁棒性，但在生成模型中仍然很少被探索。在这项工作中，我们系统地研究了损失表面平坦性在生成模型中的作用，从理论和实证两方面进行了研究，特别关注扩散模型。我们建立了一个理论断言，即更平坦的最小值能够提高对目标先验分布扰动的鲁棒性，从而带来诸如减少暴露偏差（即在迭代中噪声估计错误累积）和显著提高对模型量化的韧性等好处，即使在强量化约束条件下仍能保持生成性能。我们进一步观察到，明确控制平坦度的Sharpness-Aware Minimization (SAM)方法可以有效地增强扩散模型中的平坦性，甚至超过间接促进平坦性的方法，如强制施加Lipschitz条件的Input Perturbation (IP)、基于集成的方法如随机权重平均 (SWA)和指数移动平均 (EMA) 等，效果更好。通过对CIFAR-10、LSUN Tower和FFHQ的广泛实验，我们证明了扩散模型中的平坦最小值确实不仅提高了生成性能，还增强了鲁棒性。

更新时间: 2025-08-06 03:03:47

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.11078v2

Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams

Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs' excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

Updated: 2025-08-06 02:58:28

标题: 基于UML序列图的工业代码生成中的数据依赖推断

摘要: 大型语言模型（LLM）擅长从自然语言（NL）描述中生成代码。然而，纯文本描述本质上具有歧义，并经常无法捕捉复杂要求，如复杂系统行为、条件逻辑和架构约束；服务导向架构中的隐含数据依赖关系难以推断和正确处理。为了弥合这一差距，我们提出了一个名为UML2Dep的新型逐步代码生成框架，通过利用复杂要求的明确形式规范来实现。首先，我们引入了一个专为服务导向架构定制的增强统一建模语言（UML）序列图。该图通过集成决策表和API规范扩展了传统的可视语法，明确地形式化了服务交互中的结构关系和业务逻辑流程，从而严格消除了语言歧义。其次，认识到数据流的关键作用，我们引入了一个专门的数据依赖推断（DDI）任务。DDI在实际代码合成之前系统地构建一个明确的数据依赖图。为了确保可靠性，我们通过新颖的提示策略将DDI形式化为一项受限数学推理任务，与LLM的优秀数学实力相一致。额外的静态解析和依赖修剪进一步减少了与复杂规范相关的上下文复杂性和认知负担，从而提高了推理的准确性和效率。

更新时间: 2025-08-06 02:58:28

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.03379v2

SEA: Self-Evolution Agent with Step-wise Reward for Computer Use

Computer use agent is an emerging area in artificial intelligence that aims to operate the computers to achieve the user's tasks, which attracts a lot of attention from both industry and academia. However, the present agents' performance is far from being used. In this paper, we propose the Self-Evolution Agent (SEA) for computer use, and to develop this agent, we propose creative methods in data generation, reinforcement learning, and model enhancement. Specifically, we first propose an automatic pipeline to generate the verifiable trajectory for training. And then, we propose efficient step-wise reinforcement learning to alleviate the significant computational requirements for long-horizon training. In the end, we propose the enhancement method to merge the grounding and planning ability into one model without any extra training. Accordingly, based on our proposed innovation of data generation, training strategy, and enhancement, we get the Selfevolution Agent (SEA) for computer use with only 7B parameters, which outperforms models with the same number of parameters and has comparable performance to larger ones. We will make the models' weight and related codes open-source in the future.

Updated: 2025-08-06 02:57:22

标题: SEA：具有逐步奖励的自我演化代理用于计算机使用

摘要: 计算机使用代理是人工智能中一个新兴领域，旨在操作计算机以实现用户的任务，这引起了行业和学术界的广泛关注。然而，目前代理的性能还远未达到可用的水平。在本文中，我们提出了用于计算机使用的自我进化代理（SEA），并为了开发这个代理，我们提出了在数据生成、强化学习和模型增强方面的创造性方法。具体来说，我们首先提出了一个自动生成可验证轨迹用于训练的自动流水线。然后，我们提出了高效的逐步强化学习方法，以减轻长期训练所需的显著计算需求。最后，我们提出了增强方法，将基础和规划能力合并到一个模型中，而无需额外的训练。因此，基于我们提出的数据生成、训练策略和增强的创新，我们得到了只有7B参数的用于计算机使用的自我进化代理（SEA），其性能优于具有相同数量参数的模型，并且与更大模型有可比性的性能。我们将在未来公开模型的权重和相关代码。

更新时间: 2025-08-06 02:57:22

领域: cs.AI

下载: http://arxiv.org/abs/2508.04037v1

CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion

This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2.

Updated: 2025-08-06 02:57:09

标题: CORE-ReID V2：通过优化训练和集成融合推进物体再识别的域自适应

摘要: 这项研究介绍了CORE-ReID V2，这是在CORE-ReID基础上进一步增强的框架。新框架通过解决人员ReID和车辆ReID中的无监督领域适应（UDA）挑战，进一步适用于物体ReID。在预训练阶段，使用CycleGAN合成多样化数据，弥合不同领域之间的图像特征差距。在微调阶段，一个先进的集成融合机制，包括高效通道注意块（ECAB）和简化高效通道注意块（SECAB），增强了局部和全局特征表示，同时减少了目标样本的伪标签中的歧义。在广泛使用的UDA人员ReID和车辆ReID数据集上的实验结果表明，所提出的框架优于最先进的方法，在平均精度（mAP）和排名k准确性（Top-1、Top-5、Top-10）方面表现出色。此外，该框架支持像ResNet18和ResNet34这样的轻量级骨干，确保可扩展性和效率。我们的工作不仅推动了基于UDA的物体ReID的边界，还为该领域的进一步研究和进展提供了坚实基础。我们的代码和模型可在https://github.com/TrinhQuocNguyen/CORE-ReID-V2 上获取。

更新时间: 2025-08-06 02:57:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04036v1

A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs

This paper presents a comprehensive comparative survey of TensorFlow and PyTorch, the two leading deep learning frameworks, focusing on their usability, performance, and deployment trade-offs. We review each framework's programming paradigm and developer experience, contrasting TensorFlow's graph-based (now optionally eager) approach with PyTorch's dynamic, Pythonic style. We then compare model training speeds and inference performance across multiple tasks and data regimes, drawing on recent benchmarks and studies. Deployment flexibility is examined in depth - from TensorFlow's mature ecosystem (TensorFlow Lite for mobile/embedded, TensorFlow Serving, and JavaScript support) to PyTorch's newer production tools (TorchScript compilation, ONNX export, and TorchServe). We also survey ecosystem and community support, including library integrations, industry adoption, and research trends (e.g., PyTorch's dominance in recent research publications versus TensorFlow's broader tooling in enterprise). Applications in computer vision, natural language processing, and other domains are discussed to illustrate how each framework is used in practice. Finally, we outline future directions and open challenges in deep learning framework design, such as unifying eager and graph execution, improving cross-framework interoperability, and integrating compiler optimizations (XLA, JIT) for improved speed. Our findings indicate that while both frameworks are highly capable for state-of-the-art deep learning, they exhibit distinct trade-offs: PyTorch offers simplicity and flexibility favored in research, whereas TensorFlow provides a fuller production-ready ecosystem - understanding these trade-offs is key for practitioners selecting the appropriate tool. We include charts, code snippets, and more than 20 references to academic papers and official documentation to support this comparative analysis

Updated: 2025-08-06 02:55:57

标题: PyTorch与TensorFlow在深度学习中的比较调查：可用性、性能和部署折衷

摘要: 本文介绍了TensorFlow和PyTorch这两个领先的深度学习框架的全面比较调查，重点关注它们的可用性、性能和部署取舍。我们审查了每个框架的编程范式和开发者体验，对比了TensorFlow基于图的（现在可选的急切）方法与PyTorch的动态、Python风格。然后，我们比较了多个任务和数据规则下的模型训练速度和推断性能，借鉴了最近的基准测试和研究。我们深入研究了部署灵活性 - 从TensorFlow成熟的生态系统（适用于移动/嵌入式的TensorFlow Lite、TensorFlow Serving和JavaScript支持）到PyTorch的较新的生产工具（TorchScript编译、ONNX导出和TorchServe）。我们还调查了生态系统和社区支持，包括库集成、行业采用和研究趋势（例如，PyTorch在最近的研究出版物中占主导地位，而TensorFlow在企业中具有更广泛的工具）。讨论了在计算机视觉、自然语言处理和其他领域的应用，以说明每个框架在实践中的使用方式。最后，我们概述了深度学习框架设计的未来方向和开放挑战，例如统一急切和图执行、提高跨框架的互操作性，以及整合编译器优化（XLA、JIT）以提高速度。我们的研究结果表明，虽然这两个框架都非常适用于最先进的深度学习，但它们展示了明显的取舍：PyTorch提供了在研究中受欢迎的简单性和灵活性，而TensorFlow提供了一个更完整的生产就绪生态系统 - 了解这些权衡是从业者选择适当工具的关键。我们包括图表、代码片段和超过20篇学术论文和官方文档的参考资料，以支持这种比较分析。

更新时间: 2025-08-06 02:55:57

领域: cs.LG,cs.AI,68T05,I.2.6; I.2.10

下载: http://arxiv.org/abs/2508.04035v1

Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models

The feedback loop in industrial recommendation systems reinforces homogeneous content, creates filter bubble effects, and diminishes user satisfaction. Recently, large language models(LLMs) have demonstrated potential in serendipity recommendation, thanks to their extensive world knowledge and superior reasoning capabilities. However, these models still face challenges in ensuring the rationality of the reasoning process, the usefulness of the reasoning results, and meeting the latency requirements of industrial recommendation systems (RSs). To address these challenges, we propose a method that leverages llm to dynamically construct user knowledge graphs, thereby enhancing the serendipity of recommendation systems. This method comprises a two stage framework:(1) two-hop interest reasoning, where user static profiles and historical behaviors are utilized to dynamically construct user knowledge graphs via llm. Two-hop reasoning, which can enhance the quality and accuracy of LLM reasoning results, is then performed on the constructed graphs to identify users' potential interests; and(2) Near-line adaptation, a cost-effective approach to deploying the aforementioned models in industrial recommendation systems. We propose a u2i (user-to-item) retrieval model that also incorporates i2i (item-to-item) retrieval capabilities, the retrieved items not only exhibit strong relevance to users' newly emerged interests but also retain the high conversion rate of traditional u2i retrieval. Our online experiments on the Dewu app, which has tens of millions of users, indicate that the method increased the exposure novelty rate by 4.62%, the click novelty rate by 4.85%, the average view duration per person by 0.15%, unique visitor click through rate by 0.07%, and unique visitor interaction penetration by 0.30%, enhancing user experience.

Updated: 2025-08-06 02:52:09

标题: 通过构建带有大型语言模型的动态用户知识图来增强偶然推荐系统

摘要: 在工业推荐系统中的反馈循环会加强内容的同质化，产生过滤气泡效应，并降低用户满意度。最近，大型语言模型(LLMs)展示了在偶然推荐中的潜力，这要归功于它们广泛的世界知识和卓越的推理能力。然而，这些模型仍然面临着确保推理过程的合理性、推理结果的实用性以及满足工业推荐系统(RSs)的延迟要求方面的挑战。为了解决这些挑战，我们提出了一种方法，利用llm动态构建用户知识图，从而增强推荐系统的偶然性。该方法包括一个两阶段框架：(1) 两跳兴趣推理，利用用户静态资料和历史行为通过llm动态构建用户知识图。然后在构建的图上进行两跳推理，以识别用户的潜在兴趣，从而提高LLM推理结果的质量和准确性；(2) 近线调整，是在工业推荐系统中部署上述模型的一种经济有效方法。我们提出了一个u2i (用户到项目)检索模型，还结合了i2i (项目到项目)检索能力，检索到的项目不仅与用户新出现的兴趣强相关，而且保留了传统u2i检索的高转化率。我们在拥有数千万用户的Dewu应用上进行的在线实验表明，该方法将曝光新颖率提高了4.62%，点击新颖率提高了4.85%，人均平均观看时长提高了0.15%，独立访客点击率提高了0.07%，独立访客互动渗透率提高了0.30%，提升了用户体验。

更新时间: 2025-08-06 02:52:09

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.04032v1

Reliable Evaluation Protocol for Low-Precision Retrieval

Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.

Updated: 2025-08-06 02:48:59

标题: 低精度检索的可靠评估协议

摘要: 将模型参数和计算的数值精度降低被广泛采用来提高检索系统的效率。然而，在低精度计算查询和文档之间的相关性分数时，由于粒度降低，我们观察到了虚假的绑定。这导致了基于绑定解决的结果变化很大，使评估变得不够可靠。为了解决这个问题，我们提出了一个更健壮的检索评估协议，旨在减少得分变化。它包括：（1）高精度评分（HPS），将最终评分步骤提升到更高精度，以最小的计算成本解决绑定候选人；以及（2）绑定感知检索指标（TRM），报告预期得分、范围和偏差，以量化绑定候选人的排序不确定性。我们的实验在两个检索数据集上测试了多个模型，使用三个评分函数，结果表明HPS显著减少了绑定引起的不稳定性，TRM准确恢复了预期的指标值。这种组合为低精度检索提供了更一致和可靠的评估系统。

更新时间: 2025-08-06 02:48:59

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.03306v2

EcoTransformer: Attention without Multiplication

The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy. (This version (v2) supersedes v1 and reflects the intended release and licensing.)

Updated: 2025-08-06 02:41:31

标题: EcoTransformer：没有乘法的注意力

摘要: 使用其缩放点积注意机制，Transformer已成为现代人工智能中的基础架构。然而，这种机制在计算上是密集的，并带来了大量的能源成本。我们提出了一种新的Transformer架构EcoTransformer，在该架构中，输出上下文向量是使用拉普拉斯核对值进行卷积构建的，其中查询和键之间的距离由L1度量来衡量。与基于点积的注意力相比，新的注意力分数计算不涉及矩阵乘法。在自然语言处理、生物信息学和视觉任务中，它表现出色，甚至超过了缩放点积注意力，并且能耗显著降低。（此版本（v2）取代了v1，并反映了预期的发布和许可。）

更新时间: 2025-08-06 02:41:31

领域: cs.LG,cs.AI,cs.CL,68T05

下载: http://arxiv.org/abs/2507.20096v2

Supervised Dynamic Dimension Reduction with Deep Neural Network

This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power. A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods.

Updated: 2025-08-06 02:41:26

标题: 使用深度神经网络进行监督动态降维

摘要: 本文研究了维度缩减的问题，旨在改善具有高维预测因素的时间序列预测。我们提出了一种新颖的监督式深度动态主成分分析（SDDP）框架，将目标变量和滞后观察值纳入因子提取过程中。在一个时间神经网络的协助下，我们通过监督方式对原始预测因素进行缩放，为具有更强预测能力的预测因素分配更大的权重，构建了目标感知型预测因素。然后对目标感知型预测因素进行主成分分析，提取估计的SDDP因子。这种监督式因子提取不仅改善了下游预测任务中的预测准确性，还产生了更具解释性和目标特定的潜在因子。基于SDDP，我们提出了一种因子增强的非线性动态预测模型，统一了广泛的基于因子模型的预测方法。为了进一步展示SDDP的更广泛适用性，我们将研究扩展到了当预测因素只部分可观测时更具挑战性的情景中。我们在几个真实世界的公共数据集上验证了所提方法的实证性能。结果显示，与最先进的方法相比，我们的算法在预测准确性方面取得了显著的改进。

更新时间: 2025-08-06 02:41:26

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.03546v2

MissMecha: An All-in-One Python Package for Studying Missing Data Mechanisms

Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.

Updated: 2025-08-06 02:40:45

标题: MissMecha：一个用于研究缺失数据机制的全能Python软件包

摘要: 不完整数据是现实世界数据集中的一个持续挑战，通常受复杂和不可观测的缺失机制的影响。模拟缺失数据已成为理解其对学习和分析影响的标准方法。然而，现有工具存在碎片化、机制受限和通常只关注数值变量的问题，忽视了现实世界表格数据的异质性。我们提出了MissMecha，一个用于模拟、可视化和评估在MCAR、MAR和MNAR假设下缺失数据的开源Python工具包。MissMecha支持数值和分类特征，能够使研究人员在混合类型的表格数据集中进行机制感知的研究。它包括视觉诊断、MCAR测试工具和类型感知的插补评估指标。设计用于支持数据质量研究、基准测试和教育，MissMecha为与不完整数据一起工作的研究人员和实践者提供了一个统一的平台。

更新时间: 2025-08-06 02:40:45

领域: cs.LG,cs.MS

下载: http://arxiv.org/abs/2508.04740v1

Probabilistic Modeling of Jailbreak on Multimodal LLMs: From Quantification to Application

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal content. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on input image to maximize jailbreak probability, and further enhance it as Multimodal JPA (MJPA) by including monotonic text rephrasing. To counteract attacks, we also propose Jailbreak-Probability-based Finetuning (JPF), which minimizes jailbreak probability through MLLM parameter updates. Extensive experiments show that (1) (M)JPA yields significant improvements when attacking a wide range of models under both white and black box settings. (2) JPF vastly reduces jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.

Updated: 2025-08-06 02:38:30

标题: 多模态LLMs越狱的概率建模：从量化到应用

摘要: 最近，多模态大型语言模型（MLLMs）展示了它们在理解多模态内容方面的出色能力。然而，它们仍然容易受到越狱攻击的影响，这些攻击利用它们安全对齐中的弱点来生成有害的响应。先前的研究将越狱攻击分为成功或失败，基于响应是否包含恶意内容。然而，鉴于MLLM响应的随机性质，对输入的越狱MLLM能力进行二元分类是不合适的。基于这一观点，我们引入越狱概率来量化输入的越狱潜力，该概率表示当提示这个输入时，MLLM生成恶意响应的可能性。我们通过多次查询MLLM来近似这个概率。在使用越狱概率预测网络（JPPN）建模输入隐藏状态及其相应越狱概率之间的关系后，我们使用连续的越狱概率进行优化。具体地，我们提出了基于越狱概率的攻击（JPA），通过对输入图像进行敌对扰动来最大化越狱概率，并通过包含单调文本改写将其进一步增强为多模态JPA（MJPA）。为了抵御攻击，我们还提出了基于越狱概率的微调（JPF），通过MLLM参数更新来最小化越狱概率。大量实验证明，（M）JPA在攻击各种模型时在白盒和黑盒设置下都取得了显著的改进。JPF可将越狱减少至多达60\%以上。以上两个结果表明引入越狱概率以对输入的越狱能力进行细微区分的重要性。

更新时间: 2025-08-06 02:38:30

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2503.06989v3

Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement

Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbf{RecAgent}, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbf{ComplexAction} to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at https://github.com/Fanye12/RecAgent.

Updated: 2025-08-06 02:38:02

标题: 不确定性感知的GUI代理：通过组件推荐和人机协作优化实现自适应感知

摘要: 图形用户界面(GUI)代理在自动化移动任务方面表现出潜力，但仍然在输入冗余和决策模糊方面存在困难。在本文中，我们提出了一种名为\textbf{RecAgent}的具有不确定性感知的代理，通过自适应感知来解决这些问题。我们区分GUI导航中的两种不确定性：(1)感知不确定性，由于输入冗余和全面屏幕信息的噪音引起，以及(2)决策不确定性，由于任务模糊和复杂推理引起。为了减少感知不确定性，RecAgent采用了一个组件推荐机制，识别并专注于最相关的UI元素。对于决策不确定性，它使用一个交互模块，在模糊情况下请求用户反馈，实现意图感知决策。这些组件被集成到一个统一框架中，通过人为调整主动减少输入复杂性，并对高不确定性情况做出反应。此外，我们提出了一个名为\textbf{ComplexAction}的数据集，用于评估GUI代理在复杂场景中执行指定单步操作的成功率。广泛的实验验证了我们方法的有效性。数据集和代码将在https://github.com/Fanye12/RecAgent上提供。

更新时间: 2025-08-06 02:38:02

领域: cs.AI

下载: http://arxiv.org/abs/2508.04025v1

Identity Theft in AI Conference Peer Review

We discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research, with broader implications for other academic procedures. We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations, leveraging weaknesses in reviewer recruitment workflows and identity verification processes. The findings highlight the critical need for stronger safeguards against identity theft in peer review and academia at large, and to this end, we also propose mitigating strategies.

Updated: 2025-08-06 02:36:52

标题: 人工智能会议同行评审中的身份盗窃

摘要: 我们讨论了在人工智能（AI）研究中科学同行评审过程中新发现的身份盗用案例，对其他学术程序也有更广泛的影响。我们详细阐述了不诚实的研究人员如何利用同行评审系统，创建虚假的评审人员档案来操纵论文评估，利用评审人员招募流程和身份验证过程中的弱点。研究结果突显了在同行评审和整个学术界中加强防范身份盗用的迫切需要，为此，我们还提出了缓解策略。

更新时间: 2025-08-06 02:36:52

领域: cs.DL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.04024v1

Benchmarking a Tunable Quantum Neural Network on Trapped-Ion and Superconducting Hardware

We implement a quantum generalization of a neural network on trapped-ion and IBM superconducting quantum computers to classify MNIST images, a common benchmark in computer vision. The network feedforward involves qubit rotations whose angles depend on the results of measurements in the previous layer. The network is trained via simulation, but inference is performed experimentally on quantum hardware. The classical-to-quantum correspondence is controlled by an interpolation parameter, $a$, which is zero in the classical limit. Increasing $a$ introduces quantum uncertainty into the measurements, which is shown to improve network performance at moderate values of the interpolation parameter. We then focus on particular images that fail to be classified by a classical neural network but are detected correctly in the quantum network. For such borderline cases, we observe strong deviations from the simulated behavior. We attribute this to physical noise, which causes the output to fluctuate between nearby minima of the classification energy landscape. Such strong sensitivity to physical noise is absent for clear images. We further benchmark physical noise by inserting additional single-qubit and two-qubit gate pairs into the neural network circuits. Our work provides a springboard toward more complex quantum neural networks on current devices: while the approach is rooted in standard classical machine learning, scaling up such networks may prove classically non-simulable and could offer a route to near-term quantum advantage.

Updated: 2025-08-06 02:36:11

标题: 在离子阱和超导硬件上对可调谐量子神经网络进行基准测试

摘要: 我们在离子阱和IBM超导量子计算机上实现了对MNIST图像进行分类的量子神经网络的泛化，这是计算机视觉中常见的基准测试。网络的前馈过程涉及量子比特旋转，其角度取决于前一层中的测量结果。网络通过模拟进行训练，但推断是在量子硬件上实验性地进行的。经典到量子的对应关系由插值参数$a$控制，该参数在经典极限下为零。增加$a$会将量子不确定性引入测量中，这被证明能够在插值参数的适度值处提高网络性能。然后，我们专注于一些边界图像，这些图像无法被经典神经网络分类，但在量子网络中被正确检测。对于这种边界情况，我们观察到与模拟行为明显偏离。我们将这归因于物理噪音，导致输出在分类能量景观的相邻极小值之间波动。这种对物理噪音的强烈敏感性在清晰图像中不存在。我们通过将额外的单比特和双比特门对插入到神经网络电路中，进一步对物理噪音进行基准测试。我们的工作为在当前设备上实现更复杂的量子神经网络提供了一个起点：虽然这种方法根植于标准的经典机器学习，但扩展这样的网络可能证明在经典情况下无法模拟，并可能提供一条通往近期量子优势的途径。

更新时间: 2025-08-06 02:36:11

领域: quant-ph,cond-mat.dis-nn,cs.LG

下载: http://arxiv.org/abs/2507.21222v2

Mjölnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density

Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mj\"olnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mj\"olnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mj\"olnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).

Updated: 2025-08-06 02:07:36

标题: Mjölnir：全球闪电闪电密度的深度学习参数化框架

摘要: 近年来，基于人工智能的天气预测模型取得了重大进展，如FourCastNet、Pangu-Weather和GraphCast，展示了深度学习模拟复杂大气动力学的显著能力。借助这一势头，我们提出了Mj\"olnir，这是一个新颖的基于深度学习的全球闪电闪电密度参数化框架。在每日时间分辨率和1度空间分辨率上，Mj\"olnir在ERA5大气预测变量和全球闪电定位网络（WWLLN）观测数据上进行训练，捕捉了大尺度环境条件和闪电活动之间的非线性映射关系。该模型架构基于InceptionNeXt骨干网络与SENet，并采用多任务学习策略，同时预测闪电发生和数量级。广泛评估结果表明，Mj\"olnir准确复制了全球闪电活动的分布、季节变化和区域特征，年均场的全球皮尔逊相关系数达到0.96。这些结果表明，Mj\"olnir不仅作为一个有效的基于数据驱动的全球闪电参数化方法，还是下一代地球系统模型（AI-ESMs）的一种有前途的基于人工智能的方案。

更新时间: 2025-08-06 02:07:36

领域: cs.LG,cs.AI,cs.CV,physics.ao-ph

下载: http://arxiv.org/abs/2504.19822v2

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.

Updated: 2025-08-06 02:02:06

标题: PTQ1.61：推动大型语言模型极低比特后训练量化方法的真实极限

摘要: 大型语言模型（LLMs）在面对极低比特（低于2比特）量化时会出现严重的性能下降。一些现有的低于2比特后训练量化（PTQ）方法利用混合精度方案，通过利用非结构化细粒度掩码来明确区分显著权重，但这会引入每个权重额外的1比特或更多比特。为了探索PTQ的真正极限，我们提出了一种极低比特PTQ方法，称为PTQ1.61，这是首次将权重量化到1.61比特。具体地，我们首先引入了一个基于输入激活的一维结构掩码，每个权重额外增加了0.0002比特，从减少量化误差上限的角度来分配对应的显著权重通道到4比特。对于非显著通道的二值化，我们提出了一个有效的基于块的缩放因子优化框架，以考虑隐式的逐行相关性和角度偏差。不同于以往的研究集中在调整量化方法论，我们进一步提出了一种称为量化预处理的新范式，我们认为在量化之前转换预训练模型的权重分布可以缓解每通道极低比特PTQ的困难。大量实验表明我们的PTQ1.61在极低比特量化中达到了最先进的性能。代码可在https://github.com/zjq0455/PTQ1.61找到。

更新时间: 2025-08-06 02:02:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13179v2

Learning the Simplest Neural ODE

Since the advent of the ``Neural Ordinary Differential Equation (Neural ODE)'' paper, learning ODEs with deep learning has been applied to system identification, time-series forecasting, and related areas. Exploiting the diffeomorphic nature of ODE solution maps, neural ODEs has also enabled their use in generative modeling. Despite the rich potential to incorporate various kinds of physical information, training Neural ODEs remains challenging in practice. This study demonstrates, through the simplest one-dimensional linear model, why training Neural ODEs is difficult. We then propose a new stabilization method and provide an analytical convergence analysis. The insights and techniques presented here serve as a concise tutorial for researchers beginning work on Neural ODEs.

Updated: 2025-08-06 02:01:15

标题: 学习最简单的神经ODE

摘要: 自从“神经常微分方程（Neural ODE）”论文问世以来，利用深度学习学习ODE已被应用于系统辨识、时间序列预测和相关领域。利用ODE解映射的可微性质，神经ODE还使它们能够用于生成建模。尽管有丰富的潜力来整合各种物理信息，但在实践中训练神经ODE仍然具有挑战性。本研究通过最简单的一维线性模型展示了为什么训练神经ODE是困难的。然后我们提出了一种新的稳定方法，并提供了一种分析收敛性的分析。这里呈现的见解和技术可以作为一份简明教程，为刚开始研究神经ODE的研究人员提供帮助。

更新时间: 2025-08-06 02:01:15

领域: stat.ML,cs.LG,math.DS

下载: http://arxiv.org/abs/2505.02019v2

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with LLMs remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proofgeneration results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sublemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low budget. The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts. On the miniF2F benchmark, we establish a new stateoftheart accuracy of 84.9% among sub 8Bparameter models while keeping the sampling budget below one hundred. Moreover, Apollo raises the stateoftheart accuracy for GoedelProverSFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. Generalpurpose models (o3mini, o4mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compilerguided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO

Updated: 2025-08-06 01:58:08

标题: 阿波罗：用于高级形式推理的自动LLM和精简协作

摘要: 正式推理和自动定理证明构成了机器学习的一个具有挑战性的子领域，其中机器被要求使用像Lean这样的形式语言证明数学定理。形式验证系统几乎可以立即检查形式证明是否正确，但使用LLMs生成完全正确的形式证明仍然是一项艰巨的任务。文献中的通常方法是多次提示LLM（多达数千次），直到生成的证明之一通过验证系统。在这项工作中，我们提出了APOLLO（通过LLM和Lean协作进行自动证明修复）, 一个模块化、与模型无关的流程，结合了Lean编译器的优势和LLM的推理能力，以在低采样预算下实现更好的证明生成结果。Apollo引导一个完全自动化的流程，在这个过程中，LLM为定理生成证明，一组代理分析证明，修复语法错误，使用Lean识别证明中的错误，隔离失败的子引理，利用自动求解器，并在每个剩余目标上以低预算调用LLM。修复的子证明被重新组合和重新验证，迭代直到用户控制的最大尝试次数。在miniF2F基准测试中，我们建立了一个新的最先进的准确率，其中在低于一百的采样预算下，对于小于8B参数模型达到84.9%的准确率。此外，Apollo将GoedelProverSFT的最先进准确率提高到65.6%，同时将样本复杂度从25600减少到几百。通用型模型（o3mini、o4mini）的准确率从3-7%跃升到超过40%。我们的结果表明，对LLM输出进行有针对性的、编译引导的修复在效率和正确性方面产生了显著的收益，这表明了一种可扩展的自动定理证明的一般范式。代码库可在https://github.com/aziksh-ospanov/APOLLO 上找到。

更新时间: 2025-08-06 01:58:08

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2505.05758v3

Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation

Recommendation systems often suffer from data sparsity caused by limited user-item interactions, which degrade their performance and amplify popularity bias in real-world scenarios. This paper proposes a novel data augmentation framework that leverages Large Language Models (LLMs) and item textual descriptions to enrich interaction data. By few-shot prompting LLMs multiple times to rerank items and aggregating the results via majority voting, we generate high-confidence synthetic user-item interactions, supported by theoretical guarantees based on the concentration of measure. To effectively leverage the augmented data in the context of a graph recommendation system, we integrate it into a graph contrastive learning framework to mitigate distributional shift and alleviate popularity bias. Extensive experiments show that our method improves accuracy and reduces popularity bias, outperforming strong baselines.

Updated: 2025-08-06 01:55:06

标题: 使用多数投票LLM重新排序增强图形推荐

摘要: 推荐系统经常受到数据稀疏性的影响，这是由于有限的用户-项目交互所导致的，这会降低它们的性能并在实际场景中放大流行偏见。本文提出了一种新颖的数据增强框架，利用大型语言模型（LLMs）和项目文本描述来丰富交互数据。通过多次对LLMs进行少量提示以重新排列项目，并通过多数投票聚合结果，我们生成高置信度的合成用户-项目交互，该合成数据受到基于测度集中的理论保证。为了有效利用增强数据在图推荐系统的背景下，我们将其整合到图对比学习框架中，以减轻分布偏移和缓解流行偏见。广泛的实验表明，我们的方法提高了准确性并减少了流行偏见，胜过了强基线方法。

更新时间: 2025-08-06 01:55:06

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.21563v2

Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

Updated: 2025-08-06 01:54:58

标题: 更进一步：在基于元学习的模型编辑中超越单次反向传播

摘要: 大型语言模型(LLMs)支撑着许多人工智能应用，但它们的静态特性使得更新知识成本高昂。模型编辑通过针对性参数修改注入新信息，提供了一种高效的替代方法。特别是，基于元学习的模型编辑(MLBME)方法在编辑效果和效率方面都表现出显著优势。尽管如此，我们发现MLBME在低数据场景中表现出次优性能，并且其训练效率受KL散度计算的限制。为解决这些问题，我们提出了$\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$)，这是一种新颖的MLBME方法，采用$\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$)来提高在有限监督下的编辑性能，并在权重更新上采用规范化来提高训练效率。在两个数据集和两个LLMs上的实验结果表明，SMEdit优于先前的MLBME基线，而MBPS策略可以无缝集成到现有方法中，进一步提升其性能。我们的代码将很快发布。

更新时间: 2025-08-06 01:54:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04012v1

Dual-Label Learning With Irregularly Present Labels

In multi-task learning, labels are often missing irregularly across samples, which can be fully labeled, partially labeled or unlabeled. The irregular label presence often appears in scientific studies due to experimental limitations. It triggers a demand for a new training and inference mechanism that could accommodate irregularly present labels and maximize their utility. This work focuses on the two-label learning task and proposes a novel training and inference framework, Dual-Label Learning (DLL). The DLL framework formulates the problem into a dual-function system, in which the two functions should simultaneously satisfy standard supervision, structural duality and probabilistic duality. DLL features a dual-tower model architecture that allows for explicit information exchange between labels, aimed at maximizing the utility of partially available labels. During training, missing labels are imputed as part of the forward propagation process, while during inference, labels are predicted jointly as unknowns of a bivariate system of equations. Our theoretical analysis guarantees the feasibility of DLL, and extensive experiments are conducted to verify that by explicitly modeling label correlation and maximizing label utility, our method makes consistently better prediction than baseline approaches by up to 9.6% gain in F1-score or 10.2% reduction in MAPE. Remarkably, DLL maintains robust performance at a label missing rate of up to 60%, achieving even better results than baseline approaches at lower missing rates down to only 10%.

Updated: 2025-08-06 01:54:37

标题: 双标签学习中的不规则标签处理

摘要: 在多任务学习中，标签通常在样本之间出现不规则缺失，可以是完全标记、部分标记或未标记的。由于实验限制，不规则的标签存在在科学研究中经常出现。这引发了对一种新的训练和推理机制的需求，该机制能够适应不规则存在的标签并最大化它们的效用。本文侧重于双标签学习任务，并提出了一种新颖的训练和推理框架，双标签学习（DLL）。DLL框架将问题建模为一个双功能系统，在该系统中，这两个函数应同时满足标准监督、结构对偶和概率对偶。DLL具有双塔模型架构，允许标签之间进行明确的信息交流，旨在最大化部分可用标签的效用。在训练过程中，缺失的标签被视为前向传播过程的一部分进行填充，而在推理过程中，标签作为双变量方程系统的未知量进行联合预测。我们的理论分析保证了DLL的可行性，并进行了大量实验证实，通过明确建模标签相关性和最大化标签效用，我们的方法在F1得分上比基线方法始终提高了高达9.6%，或者在MAPE上减少了10.2%。值得注意的是，DLL在标签缺失率高达60%时仍能保持稳健的性能，甚至在仅有10%的较低缺失率下比基线方法取得更好的结果。

更新时间: 2025-08-06 01:54:37

领域: cs.LG

下载: http://arxiv.org/abs/2410.14380v3

StepWrite: Adaptive Planning for Speech-Driven Text Generation

People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions--capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.

Updated: 2025-08-06 01:50:17

标题: StepWrite：针对语音驱动文本生成的自适应规划

摘要: 人们经常使用语音转文本系统来用语音撰写简短的文本。然而，目前的基于语音的界面很难支持撰写更详细、情境复杂的文本，特别是在用户在移动时无法视觉跟踪进展的情况下。撰写结构化电子邮件或深思熟虑的回复等较长形式的交流需要持续的情境跟踪、结构化指导和适应用户意图演变的能力--传统的口述工具和语音助手并不支持这些能力。我们介绍了StepWrite，这是一个基于大型语言模型驱动的语音交互系统，通过使用户能够在移动时结构化、无需使用手和眼睛地撰写更长的文本，增强了人类的写作能力。StepWrite将写作过程分解为可管理的子任务，并通过具有情境意识的非视觉音频提示，按顺序指导用户。StepWrite通过将情境跟踪和自适应计划任务卸载给模型来减轻认知负担。与基准方法（如标准的口述功能，例如微软Word）和对话式语音助手（如ChatGPT高级语音模式）不同，StepWrite根据不断演化的情境和用户意图动态调整其提示，并提供连贯的指导而不损害用户自主性。对25名参与者进行的移动或静止时双手占用的活动的实证评估表明，与基准方法相比，StepWrite显著减轻了认知负担，提高了可用性和用户满意度。技术评估进一步确认了StepWrite在动态情境提示生成、准确的语气对齐和有效的事实检查方面的能力。这项工作突显了结构化、情境感知的语音交互在增强日常多任务情境下的无需使用手和眼睛的通信能力方面的潜力。

更新时间: 2025-08-06 01:50:17

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.04011v1

ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning

Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model's true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments, enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model's projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.

Updated: 2025-08-06 01:50:14

标题: ProtoECGNet：基于案例的可解释深度学习用于具有对比学习的多标签心电图分类

摘要: 基于深度学习的心电图（ECG）分类已经展现出令人印象深刻的性能，但由于缺乏透明和忠实的解释，临床采用速度较慢。后续方法，如显著性地图，可能无法反映模型的真实决策过程。基于原型的推理提供了一个更透明的替代方案，通过将决策基于与真实ECG片段的学习表示的相似性，实现忠实的、基于案例的解释。我们介绍了ProtoECGNet，这是一个用于可解释的多标签ECG分类的基于原型的深度学习模型。ProtoECGNet采用一个结构化的多分支架构，反映了临床解释工作流程：它集成了一个用于节律分类的全局原型的1D CNN，一个用于基于形态的推理的时间局部原型的2D CNN，以及一个用于弥散异常的全局原型的2D CNN。每个分支都使用了专为多标签学习设计的原型损失进行训练，结合了聚类、分离、多样性以及一种新的对比损失，鼓励无关类别的原型适当分离，同时允许频繁共同诊断的聚类。我们在PTB-XL数据集的71个诊断标签上评估了ProtoECGNet，相对于最先进的黑盒模型，展示了竞争性的性能，同时提供了结构化、基于案例的解释。为了评估原型质量，我们对最终模型投影的原型进行了结构化的临床评审，发现它们被评为具有代表性和清晰。ProtoECGNet表明，原型学习可以有效地扩展到复杂的、多标签的时间序列分类，为透明和可信赖的深度学习模型提供了一条实际的路径，用于临床决策支持。

更新时间: 2025-08-06 01:50:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.08713v4

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

Updated: 2025-08-06 01:49:32

标题: HarmonyGuard：通过自适应策略增强和双目标优化实现Web代理的安全性和效用

摘要: 大型语言模型使代理在开放网络环境中能够自主执行任务。然而，随着网络内隐藏威胁的演变，网络代理面临平衡任务执行和新兴风险之间的挑战，尤其是在长序列操作期间。尽管这一挑战至关重要，但目前的研究仍局限于单目标优化或单轮场景，缺乏在网络环境中同时优化安全性和效用的能力。为了填补这一空白，我们提出了HarmonyGuard，这是一个多代理协作框架，利用政策增强和目标优化来共同提高效用和安全性。HarmonyGuard具有一个被两个基本能力特征化的多代理架构：（1）自适应政策增强：我们在HarmonyGuard中引入了政策代理，它可以自动从非结构化外部文档中提取和维护结构化安全政策，并不断更新政策以应对不断演变的威胁。（2）双目标优化：基于安全性和效用的双重目标，集成在HarmonyGuard中的效用代理进行马尔可夫实时推理以评估目标，并利用元认知能力进行优化。对多个基准进行的广泛评估表明，HarmonyGuard相对现有基线可提高最多38％的政策遵从性和最多20％的任务完成度，同时在所有任务中实现超过90％的政策遵从性。我们的项目可以在以下网址找到：https://github.com/YurunChen/HarmonyGuard。

更新时间: 2025-08-06 01:49:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04010v1

CodonMoE: DNA Language Models for mRNA Analyses

Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.

Updated: 2025-08-06 01:40:12

标题: CodonMoE: 用于mRNA分析的DNA语言模型

摘要: 基因组语言模型（gLMs）面临一个基本的效率挑战：要么保持针对每种生物学模态（DNA和RNA）的单独专门模型，要么开发大型多模态架构。这两种方法都会带来显著的计算负担 - 特定模态的模型需要冗余基础设施，尽管存在生物学上的连接，而多模态架构则需要大量参数和广泛的跨模态预训练。为了解决这一限制，我们引入了CodonMoE（自适应密码子改革专家混合器），这是一个轻量级的适配器，可以将DNA语言模型转化为有效的RNA分析器，而无需进行RNA特定的预训练。我们的理论分析将CodonMoE定位为密码子级别的通用逼近器，能够在具有足够专家容量的情况下，将密码子序列映射到RNA属性上的任意函数。在涵盖稳定性、表达和调控的四个RNA预测任务中，使用CodonMoE增强的DNA模型明显优于未修改的对照组，HyenaDNA+CodonMoE系列在使用比专门的RNA模型少80%的参数的情况下实现了最新的结果。通过在保持次二次复杂性的同时实现卓越性能，我们的方法为统一基因组语言建模提供了一条基本的道路，利用更丰富的DNA数据，减少计算开销，同时保留模态特定的性能优势。

更新时间: 2025-08-06 01:40:12

领域: q-bio.GN,cs.LG

下载: http://arxiv.org/abs/2508.04739v1

Decoupled Contrastive Learning for Federated Learning

Federated learning is a distributed machine learning paradigm that allows multiple participants to train a shared model by exchanging model updates instead of their raw data. However, its performance is degraded compared to centralized approaches due to data heterogeneity across clients. While contrastive learning has emerged as a promising approach to mitigate this, our theoretical analysis reveals a fundamental conflict: its asymptotic assumptions of an infinite number of negative samples are violated in finite-sample regime of federated learning. To address this issue, we introduce Decoupled Contrastive Learning for Federated Learning (DCFL), a novel framework that decouples the existing contrastive loss into two objectives. Decoupling the loss into its alignment and uniformity components enables the independent calibration of the attraction and repulsion forces without relying on the asymptotic assumptions. This strategy provides a contrastive learning method suitable for federated learning environments where each client has a small amount of data. Our experimental results show that DCFL achieves stronger alignment between positive samples and greater uniformity between negative samples compared to existing contrastive learning methods. Furthermore, experimental results on standard benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet, demonstrate that DCFL consistently outperforms state-of-the-art federated learning methods.

Updated: 2025-08-06 01:39:54

标题: 分布式学习的解耦对比学习

摘要: 联合学习是一种分布式机器学习范例，允许多个参与者通过交换模型更新而不是原始数据来训练共享模型。然而，与集中式方法相比，由于客户端之间的数据异质性，其性能会降低。虽然对比学习已经成为一种有希望的方法来缓解这一问题，但我们的理论分析揭示了一个根本性冲突：在有限样本的联合学习制度中，其渐近假设中的无限数量负样本被违反。为了解决这个问题，我们引入了Decoupled Contrastive Learning for Federated Learning（DCFL），这是一个将现有对比损失分解为两个目标的新框架。将损失分解为其对齐和均匀性组件使得吸引力和排斥力的独立校准无需依赖渐近假设。这种策略为联合学习环境提供了一种适用于每个客户端具有少量数据的对比学习方法。我们的实验结果显示，与现有对比学习方法相比，DCFL在正样本之间实现了更强的对齐，并在负样本之间实现了更大的均匀性。此外，在包括CIFAR-10、CIFAR-100和Tiny-ImageNet在内的标准基准上的实验结果表明，DCFL始终优于最先进的联合学习方法。

更新时间: 2025-08-06 01:39:54

领域: cs.LG

下载: http://arxiv.org/abs/2508.04005v1

Advanced DAG-Based Ranking (ADR) Protocol for Blockchain Scalability

In the past decade, blockchain has emerged as a promising solution for building secure distributed ledgers and has attracted significant attention. However, current blockchain systems suffer from limited throughput, poor scalability, and high latency. Due to limitations in consensus mechanisms, especially in managing node identities, blockchain is often considered unsuitable for applications such as the Internet of Things (IoT). This paper proposes the Advanced DAG-based Ranking (ADR) protocol to enhance blockchain scalability and throughput. ADR employs a directed acyclic graph (DAG) structure where nodes are positioned based on their rankings. Unlike traditional chains, ADR allows honest nodes to write blocks and verify transactions using a DAG-based topology. The protocol follows a three-step approach to secure the network against double-spending and enhance performance. First, it verifies nodes using their public and private keys before granting entry. Second, it builds an advanced DAG ledger enabling block production and transaction validation. Third, a ranking algorithm filters out malicious nodes, ranks the remaining nodes based on performance, and arranges them topologically. This process increases throughput and ensures robust scalability. We evaluated ADR on Amazon EC2 clusters with over 100 nodes, including scenarios with injected malicious nodes. Simulation results demonstrate that ADR significantly improves transaction throughput and network liveness compared to existing DAG-based blockchains such as IOTA and ByteBall, making it well-suited for IoT applications.

Updated: 2025-08-06 01:27:33

标题: 基于DAG的高级排名（ADR）协议用于区块链可扩展性

摘要: 在过去的十年中，区块链已经成为构建安全分布式账本的一个有前途的解决方案，并引起了广泛关注。然而，当前的区块链系统存在吞吐量有限、可扩展性差和延迟高等问题。由于共识机制的限制，特别是在管理节点身份方面，区块链通常被认为不适用于物联网等应用。本文提出了基于高级DAG排序（ADR）协议来提升区块链的可扩展性和吞吐量。ADR采用了一个基于有向无环图（DAG）结构，其中节点根据其排名而定位。与传统链不同，ADR允许诚实节点使用基于DAG的拓扑结构来写入区块和验证交易。该协议采用三步方法来保护网络免受双重花费并增强性能。首先，在授予入场权限之前，它使用节点的公钥和私钥来验证节点。其次，它构建一个高级的DAG账本，实现区块生产和交易验证。第三，一个排序算法过滤出恶意节点，基于性能对其余节点进行排名，并将它们按拓扑排序。这个过程增加了吞吐量，并确保了强大的可扩展性。我们在亚马逊EC2集群上评估了ADR，包括100多个节点的情况，以及注入恶意节点的场景。模拟结果表明，与现有的基于DAG的区块链（如IOTA和ByteBall）相比，ADR显著提高了交易吞吐量和网络活跃性，使其非常适合物联网应用。

更新时间: 2025-08-06 01:27:33

领域: cs.DC,cs.CR,cs.DB

下载: http://arxiv.org/abs/2508.04000v1

Tensorized Clustered LoRA Merging for Multi-Task Interference

Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4\% accuracy on Phi-3 and +2.3\% on Mistral-7B (+2.3\%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.

Updated: 2025-08-06 01:26:43

标题: 张量化聚类 LoRA 合并用于多任务干扰

摘要: 尽管大型语言模型（LLMs）的单一密集范式取得了成功，但LoRA适配器通过微调小型任务特定模块并将它们与基础模型合并，提供了一种高效的解决方案。然而，在多任务设置中，合并在异质来源上训练的LoRA适配器经常引起任务干扰，降低下游性能。为了解决这个问题，我们提出了一个针对在文本级和参数级解决任务干扰的张量化聚类LoRA（TC-LoRA）库。在文本级别上，我们在嵌入空间中对训练样本进行聚类，以捕捉输入格式的相似性，然后为每个簇训练一个专门的LoRA适配器。在参数级别上，我们引入一个联合的Canonical Polyadic（CP）分解，将LoRA适配器之间的任务特定因素和共享因素解开。这种联合因子分解保留了基本知识，同时减少了跨任务干扰。在域外零样本和技能组合任务上进行了大量实验证明了TC-LoRA在LLM适应中的有效性，包括推理、问答和编码。与强SVD基线相比，TC-LoRA在Phi-3上的准确率提高了+1.4%，在Mistral-7B上提高了+2.3%，展示了TC-LoRA在LLM适应中的有效性。

更新时间: 2025-08-06 01:26:43

领域: cs.LG

下载: http://arxiv.org/abs/2508.03999v1

CLaSP: Learning Concepts for Time-Series Signals from Natural Language Supervision

This paper presents CLaSP, a novel model for retrieving time-series signals using natural language queries that describe signal characteristics. The ability to search time-series signals based on descriptive queries is essential in domains such as industrial diagnostics, where data scientists often need to find signals with specific characteristics. However, existing methods rely on sketch-based inputs, predefined synonym dictionaries, or domain-specific manual designs, limiting their scalability and adaptability. CLaSP addresses these challenges by employing contrastive learning to map time-series signals to natural language descriptions. Unlike prior approaches, it eliminates the need for predefined synonym dictionaries and leverages the rich contextual knowledge of large language models (LLMs). Using the TRUCE and SUSHI datasets, which pair time-series signals with natural language descriptions, we demonstrate that CLaSP achieves high accuracy in retrieving a variety of time series patterns based on natural language queries.

Updated: 2025-08-06 01:01:31

标题: CLaSP：从自然语言监督中学习时间序列信号的概念

摘要: 本文介绍了CLaSP，这是一种新颖的模型，用于利用描述信号特征的自然语言查询检索时间序列信号。基于描述性查询搜索时间序列信号的能力在诸如工业诊断等领域至关重要，数据科学家经常需要找到具有特定特征的信号。然而，现有方法依赖于基于草图的输入、预定义的同义词词典或领域特定的手动设计，限制了它们的可扩展性和适应性。CLaSP通过采用对比学习将时间序列信号映射到自然语言描述来解决这些挑战。与先前的方法不同，它消除了预定义的同义词词典的需求，并利用大型语言模型（LLMs）的丰富上下文知识。使用TRUCE和SUSHI数据集，这些数据集将时间序列信号与自然语言描述配对，我们展示了CLaSP在基于自然语言查询检索各种时间序列模式方面取得了高准确性。

更新时间: 2025-08-06 01:01:31

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.08397v3

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Updated: 2025-08-06 00:59:26

标题: Landsat30-AU：澳大利亚Landsat图像的视觉语言数据集

摘要: 视觉语言模型（VLMs）可以实现自然语言与卫星图像的交互，从而加速专家工作流程，使数据可供非专业人士访问，并实现全球范围的自动化，从而使地球观测民主化。然而，现有数据集主要关注来自有限数量卫星的短期、高分辨率图像，忽视了长期的、低分辨率的、多卫星的档案，如Landsat，这对于可负担和偏见鲁棒的全球监测至关重要。我们通过Landsat30-AU填补了这一空白，这是一个大规模的视觉语言数据集，包括由四颗Landsat卫星（5、7、8和9号）在澳大利亚收集的30米分辨率图像，跨越36年以上。数据集包括两个组件：Landsat30-AU-Cap，包含196,262个图像标题对，以及Landsat30-AU-VQA，包括8个遥感领域的17,725个经过人工验证的视觉问答（VQA）样本。这两个数据集通过引入通用VLMs进行迭代改进和人工验证的引导式流程进行策划，以确保质量。我们在我们的基准评估了八个VLMs，结果表明现成的模型难以理解卫星图像。开源遥感VLM EarthDial在字幕中只达到0.07的SPIDEr，VQA准确率为0.48，突显了当前方法的局限性。令人鼓舞的是，在Landsat30-AU上轻量级微调Qwen2.5-VL-7B，将字幕性能从0.11提高到0.31的SPIDEr，并将VQA准确率从0.74提高到0.87。代码和数据可在https://github.com/papersubmit1/landsat30-au找到。

更新时间: 2025-08-06 00:59:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.03127v2

Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents

Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy's self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy.

Updated: 2025-08-06 00:46:38

标题: Galaxy：面向认知的前瞻、隐私保护和自我进化的LLM代理框架

摘要: 智能个人助手（IPA）如Siri和谷歌助手旨在增强人类能力并代表用户执行任务。LLM代理的出现为IPA的发展带来了新的机遇。虽然响应能力已被广泛研究，但主动行为仍未被充分探索。设计一款主动、保护隐私并具有自我进化能力的IPA仍然是一个重大挑战。设计这样的IPA依赖于LLM代理的认知体系结构。本文提出了认知森林，这是一个旨在将认知建模与系统级设计相一致的语义结构。我们将认知体系结构和系统设计统一为一个自我强化的循环，而不是将它们分开处理。基于这一原则，我们提出了Galaxy，一个支持多维交互和个性化能力生成的框架。基于Galaxy实现了两个协作代理：KoRa，一个增强认知的生成代理，支持响应和主动技能；以及Kernel，一个基于元认知的元代理，实现了Galaxy的自我进化和隐私保护。实验结果显示，Galaxy优于多个最先进的基准测试。消融研究和真实世界的交互案例验证了Galaxy的有效性。

更新时间: 2025-08-06 00:46:38

领域: cs.AI

下载: http://arxiv.org/abs/2508.03991v1

Are Today's LLMs Ready to Explain Well-Being Concepts?

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

Updated: 2025-08-06 00:45:02

标题: 今天的法学硕士准备好解释幸福概念了吗？

摘要: 幸福包括精神、身体和社会维度，对个人成长和明智生活决策至关重要。随着个人越来越多地咨询大型语言模型（LLMs）来了解幸福，一个关键挑战出现了：LLMs能否生成既准确又适合不同受众的解释？高质量的解释需要事实正确性和满足具有不同专业知识的用户期望的能力。在这项工作中，我们构建了一个大规模数据集，包括由十个不同LLMs生成的2,194个幸福概念的43,880个解释。我们引入了一个基于原则的LLM作为评判者评估框架，采用双评判者来评估解释质量。此外，我们表明，使用监督微调（SFT）和直接偏好优化（DPO）对开源LLM进行微调可以显著提高生成解释的质量。我们的结果显示：（1）所提出的LLM评判者与人类评估很好地吻合；（2）解释质量在模型、受众和类别之间有显著差异；（3）DPO和SFT微调模型优于它们更大的对应模型，展示了基于偏好的学习在专门解释任务中的有效性。

更新时间: 2025-08-06 00:45:02

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2508.03990v1

Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework

User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.

Updated: 2025-08-06 00:44:11

标题: 动态用户可控的隐私保护少样本感知框架

摘要: 用户可控隐私在现代感知系统中至关重要，因为隐私偏好可能因人而异，并且隐私偏好可能随时间变化。这在配备惯性测量单元（IMU）传感器的设备中尤为重要，例如智能手机和可穿戴设备，这些设备持续收集丰富的时间序列数据，可能无意中暴露敏感用户行为。尽管先前的研究提出了传感器数据的隐私保护方法，但大多依赖于静态的、预定义的隐私标签或需要大量的私人训练数据，限制了它们的适应性和用户代理能力。在本研究中，我们介绍了PrivCLIP，这是一个动态的、用户可控的、少样本隐私保护感知框架。PrivCLIP允许用户通过将活动分类为敏感（黑名单）、非敏感（白名单）或中性（灰名单）来指定和修改他们的隐私偏好。利用多模态对比学习方法，PrivCLIP将IMU传感器数据与自然语言活动描述在共享嵌入空间中对齐，从而实现对敏感活动的少样本检测。当识别出隐私敏感活动时，系统将使用语言引导的活动消毒程序和运动生成模块（IMU-GPT）将原始数据转换为符合隐私的版本，从语义上类似于非敏感活动。我们在多个人类活动识别数据集上评估了PrivCLIP，并证明它在隐私保护和数据效用方面明显优于基准方法。

更新时间: 2025-08-06 00:44:11

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.03989v1

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) Risk-Reasoning Stealth Score (RRSS) for harmful reasoning beneath benign outputs; (2) Risk-Visual Neglect Rate (RVNR) for unsafe completions despite visual risk recognition; and (3) Refusal Attitude Inconsistency (RAIC) for evaluating refusal unstability under prompt variants. Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior.

Updated: 2025-08-06 00:39:28

标题: 这个标题的翻译是：情感婴儿真的是致命的：你的多模式大推理模型对人类有情感奉承吗？

摘要: 我们观察到，面向以人为中心服务的MLRM在深思阶段对用户情绪线索非常敏感，通常会在高情绪强度下覆盖安全协议或内置的安全检查。受到这一关键洞察的启发，我们提出了EmoAgent，这是一个自主的对抗情感代理框架，用于编排夸张的情感提示来劫持推理路径。即使视觉风险被正确识别，模型仍然可以通过情感不一致而产生有害的完成。我们进一步确定了透明深思场景中持续存在的高风险故障模式，例如MLRM生成的有害推理被掩盖在看似安全的回应之后。这些故障暴露出内部推断与表面行为之间的不一致，逃避了现有基于内容的安全保障措施。为了量化这些风险，我们引入了三个指标：（1）有害推理隐秘分数（RRSS）用于评估表面输出下的有害推理；（2）风险视觉忽略率（RVNR）用于评估尽管识别了视觉风险，但仍产生不安全完成的情况；以及（3）拒绝态度不一致性（RAIC）用于评估在提示变体下的拒绝不稳定性。对先进MLRM的广泛实验表明EmoAgent的有效性，并揭示了模型安全行为中更深层次的情感认知不一致。

更新时间: 2025-08-06 00:39:28

领域: cs.AI

下载: http://arxiv.org/abs/2508.03986v1

Reputation-based partition scheme for IoT security

With the popularity of smart terminals, such as the Internet of Things, crowdsensing is an emerging data aggregation paradigm, which plays a pivotal role in data-driven applications. There are some key issues in the development of crowdsensing such as platform security and privacy protection. As the crowdsensing is usually managed by a centralized platform, centralized management will bring various security vulnerabilities and scalability issues. To solve these issues, an effective reputation-based partition scheme (RSPC) is proposed in this article. The partition scheme calculates the optimal partition size by combining the node reputation value and divides the node into several disjoint partitions according to the node reputation value. By selecting the appropriate partition size, RSPC provides a mechanism to ensure that each partition is valid, as long as themaximum permissible threshold for the failed node is observed. At the same time, the RSPC reorganizes the network periodically to avoid partition attacks. In addition, for cross-partition transactions, this paper innovatively proposes a four-stage confirmation protocol to ensure the efficient and safe completion of cross-partition transactions. Finally, experiments show that RSPC improves scalability, low latency, and high throughput for crowdsensing.

Updated: 2025-08-06 00:27:59

标题: 基于声誉的物联网安全分区方案

摘要: 随着智能终端（如物联网）的普及，众感知是一种新兴的数据聚合范式，在数据驱动应用程序中发挥着关键作用。在众感知的发展过程中存在一些关键问题，如平台安全和隐私保护。由于众感知通常由一个集中式平台管理，集中管理将带来各种安全漏洞和可伸缩性问题。为解决这些问题，本文提出了一种有效的基于声誉的分区方案（RSPC）。该分区方案通过结合节点声誉值计算出最佳分区大小，并根据节点声誉值将节点分割成几个不相交的分区。通过选择合适的分区大小，RSPC提供了一种机制，确保每个分区都有效，只要遵守了失败节点的最大允许阈值。同时，RSPC定期重新组织网络以避免分区攻击。此外，对于跨分区交易，本文创新地提出了一个四阶段确认协议，以确保跨分区交易的高效和安全完成。最后，实验证明RSPC改善了众感知的可伸缩性、低延迟和高吞吐量。

更新时间: 2025-08-06 00:27:59

领域: cs.DC,cs.CR,cs.DB

下载: http://arxiv.org/abs/2508.03981v1

SPADE-S: A Sparsity-Robust Foundational Forecaster

Despite significant advancements in time series forecasting, accurate modeling of time series with strong heterogeneity in magnitude and/or sparsity patterns remains challenging for state-of-the-art deep learning architectures. We identify several factors that lead existing models to systematically underperform on low-magnitude and sparse time series, including loss functions with implicit biases toward high-magnitude series, training-time sampling methods, and limitations of time series encoding methods. SPADE-S is a robust forecasting architecture that significantly reduces magnitude- and sparsity-based systematic biases and improves overall prediction accuracy. Empirical results demonstrate that SPADE-S outperforms existing state-of-the-art approaches across a diverse set of use cases in demand forecasting. In particular, we show that, depending on the quantile forecast and magnitude of the series, SPADE-S can improve forecast accuracy by up to 15%. This results in P90 overall forecast accuracy gains of 2.21%, 6.58%, and 4.28%, and P50 forecast accuracy gains of 0.92%, 0.77%, and 1.95%, respectively, for each of three distinct datasets, ranging from 3 million to 700 million series, from a large online retailer.

Updated: 2025-08-06 00:19:32

标题: SPADE-S：一种稀疏鲁棒的基础预测器

摘要: 尽管时间序列预测取得了重要进展，但对于具有强烈幅度和/或稀疏模式异质性的时间序列进行准确建模仍然是现代深度学习架构面临的挑战。我们确定了几个因素，这些因素导致现有模型在低幅度和稀疏时间序列上系统性表现不佳，包括对高幅度序列具有隐含偏见的损失函数、训练时的采样方法以及时间序列编码方法的局限性。 SPADE-S是一个强大的预测架构，显著减少了基于幅度和稀疏性的系统性偏差，并提高了整体预测准确性。实证结果表明，SPADE-S在需求预测的多种用例中优于现有的最先进方法。特别是，我们展示了，根据分位数预测和序列的幅度，SPADE-S可以将预测准确性提高高达15%。这导致了在来自大型在线零售商的三个不同数据集中，分别为300万到7亿个序列，P90整体预测准确性分别提高了2.21%、6.58%和4.28%，P50预测准确性分别提高了0.92%、0.77%和1.95%。

更新时间: 2025-08-06 00:19:32

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.21155v2