Arxiv Day: Article

Good Things Come in Pairs: Paired Autoencoders for Inverse Problems

In this book chapter, we discuss recent advances in data-driven approaches for inverse problems. In particular, we focus on the \emph{paired autoencoder} framework, which has proven to be a powerful tool for solving inverse problems in scientific computing. The paired autoencoder framework is a novel approach that leverages the strengths of both data-driven and model-based methods by projecting both the data and the quantity of interest into a latent space and mapping these latent spaces to provide surrogate forward and inverse mappings. We illustrate the advantages of this approach through numerical experiments, including seismic imaging and classical inpainting: nonlinear and linear inverse problems, respectively. Although the paired autoencoder framework is likelihood-free, it generates multiple data- and model-based reconstruction metrics that help assess whether examples are in or out of distribution. In addition to direct model estimates from data, the paired autoencoder enables latent-space refinement to fit the observed data accurately. Numerical experiments show that this procedure, combined with the latent-space initial guess, is essential for high-quality estimates, even when data noise exceeds the training regime. We also introduce two novel variants that combine variational and paired autoencoder ideas, maintaining the original benefits while enabling sampling for uncertainty analysis.

Updated: 2025-08-18 23:57:40

标题: 好的东西总是成双成对的：用于逆问题的配对自编码器

摘要: 在这本书的章节中，我们讨论了数据驱动方法在反问题中的最新进展。特别是，我们专注于“配对自动编码器”框架，这已被证明是解决科学计算中反问题的强大工具。配对自动编码器框架是一种新颖的方法，通过将数据和感兴趣的量投影到潜在空间，并将这些潜在空间映射来提供替代的正向和反向映射，充分利用了数据驱动和基于模型的方法的优势。我们通过数值实验说明了这种方法的优势，包括地震成像和经典修补：非线性和线性反问题，分别。虽然配对自动编码器框架是无似然的，但它生成了多个基于数据和模型的重建指标，有助于评估示例是否在或不在分布之内。除了直接从数据获得的模型估计之外，配对自动编码器还能够通过潜在空间的细化来准确地拟合观察到的数据。数值实验表明，结合潜在空间的初始猜测，这个过程对于高质量的估计是至关重要的，即使数据噪声超过训练范围。我们还介绍了两种结合变分和配对自动编码器思想的新型变体，保持原始的好处同时使得对不确定性分析进行抽样成为可能。

更新时间: 2025-08-18 23:57:40

领域: cs.LG,stat.ML,68T99

下载: http://arxiv.org/abs/2505.06549v2

Semi-Supervised Anomaly Detection Pipeline for SOZ Localization Using Ictal-Related Chirp

This study presents a quantitative framework for evaluating the spatial concordance between clinically defined seizure onset zones (SOZs) and statistically anomalous channels identified through time-frequency analysis of chirp events. The proposed pipeline employs a two-step methodology: (1) Unsupervised Outlier Detection, where Local Outlier Factor (LOF) analysis with adaptive neighborhood selection identifies anomalous channels based on spectro-temporal features of chirp (Onset frequency, offset frequency, and temporal duration); and (2) Spatial Correlation Analysis, which computes both exact co-occurrence metrics and weighted index similarity, incorporating hemispheric congruence and electrode proximity. Key findings demonstrate that the LOF-based approach (N neighbors=20, contamination=0.2) effectively detects outliers, with index matching (weighted by channel proximity) outperforming exact matching in SOZ localization. Performance metrics (precision, recall, F1) were highest for seizure-free patients (Index Precision mean: 0.903) and those with successful surgical outcomes (Index Precision mean: 0.865), whereas failure cases exhibited lower concordance (Index Precision mean: 0.460). The key takeaway is that chirp-based outlier detection, combined with weighted spatial metrics, provides a complementary method for SOZ localization, particularly in patients with successful surgical outcomes.

Updated: 2025-08-18 23:54:59

标题: 半监督异常检测管道用于癫痫发作相关鸣声定位SOZ

摘要: 这项研究提供了一个定量框架，用于评估临床定义的癫痫发作起始区域（SOZs）与通过时间-频率分析鸣声事件识别的统计异常通道之间的空间一致性。所提出的流程采用了两步方法：（1）无监督异常检测，采用自适应邻域选择的局部异常因子（LOF）分析基于鸣声的频谱-时间特征（起始频率、结束频率和时间持续性）识别异常通道；和（2）空间相关性分析，计算精确共现度量和加权指数相似性，结合半球一致性和电极接近性。关键发现表明，基于LOF的方法（N邻居=20，污染率=0.2）有效检测异常值，指数匹配（按通道接近性加权）在SOZ定位中胜过精确匹配。性能指标（精确度、召回率、F1）在无癫痫发作患者（指数精度均值：0.903）和手术成功结果患者（指数精度均值：0.865）中最高，而失败病例展示较低的一致性（指数精度均值：0.460）。关键点是，基于鸣声的异常值检测，结合加权空间度量，为SOZ定位提供了一个互补方法，特别适用于手术成功结果的患者。

更新时间: 2025-08-18 23:54:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.13406v1

Performance Comparisons of Reinforcement Learning Algorithms for Sequential Experimental Design

Recent developments in sequential experimental design look to construct a policy that can efficiently navigate the design space, in a way that maximises the expected information gain. Whilst there is work on achieving tractable policies for experimental design problems, there is significantly less work on obtaining policies that are able to generalise well - i.e. able to give good performance despite a change in the underlying statistical properties of the experiments. Conducting experiments sequentially has recently brought about the use of reinforcement learning, where an agent is trained to navigate the design space to select the most informative designs for experimentation. However, there is still a lack of understanding about the benefits and drawbacks of using certain reinforcement learning algorithms to train these agents. In our work, we investigate several reinforcement learning algorithms and their efficacy in producing agents that take maximally informative design decisions in sequential experimental design scenarios. We find that agent performance is impacted depending on the algorithm used for training, and that particular algorithms, using dropout or ensemble approaches, empirically showcase attractive generalisation properties.

Updated: 2025-08-18 23:48:57

标题: 强化学习算法在顺序实验设计中的性能比较

摘要: 最近发展的顺序实验设计旨在构建一个可以有效地在设计空间中导航的策略，以最大化预期的信息增益。尽管有关实验设计问题实现可处理的策略的工作，但在获得能够良好泛化的策略方面的工作要少得多 - 即使在实验的基本统计特性发生变化的情况下也能表现良好。最近进行实验序列带来了强化学习的应用，其中代理被训练以在设计空间中选择最具信息性的设计进行实验。然而，对于使用特定强化学习算法来训练这些代理的好处和缺点仍存在缺乏理解。在我们的工作中，我们研究了几种强化学习算法及其在产生在顺序实验设计情景中做出最具信息性设计决策的代理方面的功效。我们发现，代理的表现取决于用于训练的算法，并且特定的算法，使用辍学或集成方法，在经验上展现出具有吸引力的泛化特性。

更新时间: 2025-08-18 23:48:57

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.05905v2

Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Records

Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.

Updated: 2025-08-18 23:48:23

标题: 使用电子健康记录上的多模态学习早期检测胰腺癌

摘要: 胰腺导管腺癌（PDAC）是最致命的癌症之一，早期检测仍然是临床面临的主要挑战，因为缺乏特定症状和可靠的生物标志物。在这项工作中，我们提出了一种新的多模态方法，将纵向诊断代码历史与从电子健康记录中定期收集的实验室测量相结合，以在临床诊断前一年内检测PDAC。我们的方法结合了神经控制微分方程来建模不规则的实验室时间序列，预训练的语言模型和循环网络来学习诊断代码轨迹表示，以及交叉注意机制来捕捉两种模态之间的交互作用。我们在一个近4700名患者的真实数据集上开发和评估我们的方法，并在AUC方面取得了从6.5％到15.5％的显着改善，超过了最先进的方法。此外，我们的模型识别出与升高PDAC风险相关的诊断代码和实验室面板，包括已建立和新的生物标志物。我们的代码可在https://github.com/MosbahAouad/EarlyPDAC-MML获取。

更新时间: 2025-08-18 23:48:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06627v3

TASER: Table Agents for Schema-guided Extraction and Recommendation

Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.

Updated: 2025-08-18 23:48:22

标题: TASER：用于基于模式引导的提取和推荐的表格代理

摘要: 真实世界的财务文件报告了关于实体财务持有的基本信息，这些信息可能涵盖数百万种不同的金融工具类型。然而，这些细节通常隐藏在杂乱、多页、碎片化的表格中 - 例如，我们数据集中99.4%的表格没有边界框，最大行数为每个表格最多426行，跨越44页。为了解决来自真实世界表格的这些独特挑战，我们提出了一个持续学习的、主动的表格提取系统TASER（表代理用于基于模式的提取和推荐），将高度非结构化、多页、异构表格提取为标准化、符合模式的输出。我们的表代理通过利用初始模式执行表格检测、分类、提取和推荐。然后，我们的推荐代理审查输出，推荐模式修订，并决定最终推荐，使TASER能够比Table Transformer等现有表格检测模型提高10.1%的性能。在这个持续学习过程中，我们强调更大的批量大小导致可操作和利用的模式建议增加了104.3%，提取持有增加了9.8%，突显了持续学习过程的重要性。为了训练TASER，我们手动标记了22,584页（28,150,449个标记），3,213张表，共计731,685,511,687美元的持有，形成了第一个真实财务表数据集。我们发布了我们的数据集TASERTab，以使研究社区能够访问真实世界的财务表格和输出。我们的结果突显了主动、基于模式的提取系统对于深入理解真实世界财务表格的潜力。

更新时间: 2025-08-18 23:48:22

领域: cs.AI,cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.13404v1

Epistemic Wrapping for Uncertainty Quantification

Uncertainty estimation is pivotal in machine learning, especially for classification tasks, as it improves the robustness and reliability of models. We introduce a novel `Epistemic Wrapping' methodology aimed at improving uncertainty estimation in classification. Our approach uses Bayesian Neural Networks (BNNs) as a baseline and transforms their outputs into belief function posteriors, effectively capturing epistemic uncertainty and offering an efficient and general methodology for uncertainty quantification. Comprehensive experiments employing a Bayesian Neural Network (BNN) baseline and an Interval Neural Network for inference on the MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 datasets demonstrate that our Epistemic Wrapper significantly enhances generalisation and uncertainty quantification.

Updated: 2025-08-18 23:40:45

标题: 知识包装的不确定性量化

摘要: 不确定性估计在机器学习中至关重要，特别是对于分类任务，因为它提高了模型的稳健性和可靠性。我们引入了一种新颖的“认知包装”方法，旨在改善分类中的不确定性估计。我们的方法以贝叶斯神经网络（BNNs）作为基线，并将它们的输出转换为信念函数后验，有效地捕捉认知不确定性，并提供一种高效且通用的不确定性量化方法。对MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100数据集进行的全面实验，采用贝叶斯神经网络（BNN）基线和区间神经网络进行推断，证明了我们的认知包装显著提高了泛化能力和不确定性量化。

更新时间: 2025-08-18 23:40:45

领域: cs.LG

下载: http://arxiv.org/abs/2505.02277v2

Unsupervised Anomaly Detection Using Diffusion Trend Analysis for Display Inspection

Reconstruction-based anomaly detection via denoising diffusion model has limitations in determining appropriate noise parameters that can degrade anomalies while preserving normal characteristics. Also, normal regions can fluctuate considerably during reconstruction, resulting in false detection. In this paper, we propose a method to detect anomalies by analysis of reconstruction trend depending on the degree of degradation, effectively solving the both problems that impede practical application in display inspection.

Updated: 2025-08-18 23:39:07

标题: 使用扩散趋势分析进行无监督异常检测以用于显示检测

摘要: 通过去噪扩散模型的重建式异常检测在确定能够降低异常而保留正常特征的适当噪声参数方面存在局限性。此外，在重建过程中，正常区域可能会出现显著波动，导致误报。本文提出了一种方法，通过分析重建趋势来检测异常，具体取决于降解程度，有效解决了妨碍显示检查实际应用的两个问题。

更新时间: 2025-08-18 23:39:07

领域: cs.CV,cs.LG,68T45 (Primary) 68T27 (Secondary),I.2.10

下载: http://arxiv.org/abs/2407.09578v3

AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report

This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.

Updated: 2025-08-18 23:34:56

标题: AIM 2025 急流分割（RipSeg）挑战报告

摘要: 这份报告介绍了AIM 2025 RipSeg挑战赛的概况，这是一项旨在推动自动识别静止图像中涌流的技术的竞赛。涌流是危险的、快速移动的流体，对全球海滩安全构成重大风险，因此准确的视觉检测是一项重要且尚未充分开发的研究任务。该挑战赛基于RipVIS，这是目前最大的涌流数据集，重点关注单一类别实例分割，在这种情况下，精确的描绘对于充分捕捉涌流的范围至关重要。该数据集涵盖了不同地点、涌流类型和摄像头方向，提供了一个现实且具有挑战性的基准。总共有75个参与者注册参加了这个首届挑战赛，最终有5份有效的测试提交。团队根据结合了F1、F2、AP50和AP[50:95]的综合评分进行评估，确保了稳健且与应用相关的排名。表现最优秀的方法利用了深度学习架构、领域适应技术、预训练模型和领域泛化策略，以提高在不同条件下的性能。这份报告概述了数据集细节、竞赛框架、评估指标和最终结果，提供了对目前涌流分割状态的见解。最后，我们讨论了主要挑战、从提交中学到的经验教训以及扩展RipSeg的未来方向。

更新时间: 2025-08-18 23:34:56

领域: cs.CV,cs.AI,I.4.0; I.4.9

下载: http://arxiv.org/abs/2508.13401v1

A Cost-Effective Framework for Predicting Parking Availability Using Geospatial Data and Machine Learning

As urban populations continue to grow, cities face numerous challenges in managing parking and determining occupancy. This issue is particularly pronounced in university campuses, where students need to find vacant parking spots quickly and conveniently during class timings. The limited availability of parking spaces on campuses underscores the necessity of implementing efficient systems to allocate vacant parking spots effectively. We propose a smart framework that integrates multiple data sources, including street maps, mobility, and meteorological data, through a spatial join operation to capture parking behavior and vehicle movement patterns over the span of 3 consecutive days with an hourly duration between 7AM till 3PM. The system will not require any sensing tools to be installed in the street or in the parking area to provide its services since all the data needed will be collected using location services. The framework will use the expected parking entrance and time to specify a suitable parking area. Several forecasting models, namely, Linear Regression, Support Vector Regression (SVR), Random Forest Regression (RFR), and Long Short-Term Memory (LSTM), are evaluated. Hyperparameter tuning was employed using grid search, and model performance is assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Coefficient of Determination (R2). Random Forest Regression achieved the lowest RMSE of 0.142 and highest R2 of 0.582. However, given the time-series nature of the task, an LSTM model may perform better with additional data and longer timesteps.

Updated: 2025-08-18 23:24:19

标题: 一个利用地理空间数据和机器学习预测停车位可用性的成本效益框架

摘要: 随着城市人口持续增长，城市在管理停车和确定占用情况方面面临着诸多挑战。这个问题在大学校园特别突出，因为学生们需要在上课时间迅速方便地找到空闲停车位。校园停车位的有限供应凸显了需要实施有效系统分配空闲停车位的必要性。我们提出了一个智能框架，通过空间连接操作将多个数据源（包括街道地图、移动性和气象数据）整合在一起，以捕捉在连续3天内每小时从早上7点到下午3点的停车行为和车辆移动模式。该系统不需要在街道或停车区安装任何传感工具即可提供服务，因为所有所需数据将使用位置服务收集。该框架将使用预期的停车入口和时间来指定合适的停车区域。评估了几种预测模型，包括线性回归、支持向量回归（SVR）、随机森林回归（RFR）和长短期记忆（LSTM）。使用网格搜索进行超参数调整，并使用均方根误差（RMSE）、平均绝对误差（MAE）和确定系数（R2）评估模型性能。随机森林回归实现了最低的RMSE为0.142和最高的R2为0.582。然而，考虑到任务的时间序列特性，LSTM模型在有额外数据和更长时间步长的情况下可能表现更好。

更新时间: 2025-08-18 23:24:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14125v1

"I see models being a whole other thing": An Empirical Study of Pre-Trained Model Naming Conventions and A Tool for Enhancing Naming Consistency

As innovation in deep learning continues, many engineers are incorporating Pre-Trained Models (PTMs) as components in computer systems. Some PTMs are foundation models, and others are fine-tuned variations adapted to different needs. When these PTMs are named well, it facilitates model discovery and reuse. However, prior research has shown that model names are not always well chosen and can sometimes be inaccurate and misleading. The naming practices for PTM packages have not been systematically studied, which hampers engineers' ability to efficiently search for and reliably reuse these models. In this paper, we conduct the first empirical investigation of PTM naming practices in the Hugging Face PTM registry. We begin by reporting on a survey of 108 Hugging Face users, highlighting differences from traditional software package naming and presenting findings on PTM naming practices. The survey results indicate a mismatch between engineers' preferences and current practices in PTM naming. We then introduce DARA, the first automated DNN ARchitecture Assessment technique designed to detect PTM naming inconsistencies. Our results demonstrate that architectural information alone is sufficient to detect these inconsistencies, achieving an accuracy of 94% in identifying model types and promising performance (over 70%) in other architectural metadata as well. We also highlight potential use cases for automated naming tools, such as model validation, PTM metadata generation and verification, and plagiarism detection. Our study provides a foundation for automating naming inconsistency detection. Finally, we envision future work focusing on automated tools for standardizing package naming, improving model selection and reuse, and strengthening the security of the PTM supply chain.

Updated: 2025-08-18 22:56:45

标题: 我看到模型是完全不同的东西：预训练模型命名约定的实证研究及增强命名一致性的工具

摘要: 随着深度学习的创新不断进行，许多工程师正在将预训练模型（PTMs）作为计算机系统中的组成部分。一些PTMs是基础模型，而其他一些是针对不同需求进行微调的变体。当这些PTMs命名得当时，有助于模型的发现和重复利用。然而，先前的研究表明，模型的名称并不总是选择得当，有时可能不准确和误导性。对于PTM包的命名实践尚未得到系统地研究，这阻碍了工程师有效地搜索和可靠地重复利用这些模型。在本文中，我们对Hugging Face PTM注册表中的PTM命名实践进行了第一次实证调查。我们首先报告了对108名Hugging Face用户进行的调查结果，突出了与传统软件包命名的不同之处，并呈现了有关PTM命名实践的研究结果。调查结果表明，工程师的偏好与当前的PTM命名实践之间存在不匹配。然后，我们介绍了DARA，这是第一个自动化的DNN架构评估技术，旨在检测PTM命名不一致性。我们的结果表明，仅凭架构信息就足以检测这些不一致性，在识别模型类型方面准确率达到94％，在其他架构元数据方面也表现出有望的性能（超过70％）。我们还强调了自动命名工具的潜在用例，如模型验证、PTM元数据生成和验证以及抄袭检测。我们的研究为自动化命名不一致性检测奠定了基础。最后，我们设想未来的工作重点是自动化工具，用于标准化软件包命名，改善模型选择和重复利用，以及加强PTM供应链的安全性。

更新时间: 2025-08-18 22:56:45

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2310.01642v3

TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations

The integration of tabular data from diverse sources is often hindered by inconsistencies in formatting and representation, posing significant challenges for data analysts and personal digital assistants. Existing methods for automating tabular data transformations are limited in scope, often focusing on specific types of transformations or lacking interpretability. In this paper, we introduce TabulaX, a novel framework that leverages Large Language Models (LLMs) for multi-class column-level tabular transformations. TabulaX first classifies input columns into four transformation types (string-based, numerical, algorithmic, and general) and then applies tailored methods to generate human-interpretable transformation functions, such as numeric formulas or programming code. This approach enhances transparency and allows users to understand and modify the mappings. Through extensive experiments on real-world datasets from various domains, we demonstrate that TabulaX outperforms existing state-of-the-art approaches in terms of accuracy, supports a broader class of transformations, and generates interpretable transformations that can be efficiently applied.

Updated: 2025-08-18 22:48:36

标题: TabulaX: 利用大型语言模型进行多类表格转换

摘要: 来自不同来源的表格数据的整合通常受到格式和表示方式不一致的阻碍，这给数据分析师和个人数字助理带来了重大挑战。现有的自动化表格数据转换方法在范围上受到限制，通常集中在特定类型的转换上，或者缺乏可解释性。在本文中，我们介绍了TabulaX，这是一个利用大型语言模型（LLMs）进行多类列级表格转换的新框架。TabulaX首先将输入列分类为四种转换类型（基于字符串、数值、算法和一般），然后应用定制方法生成人可解释的转换函数，如数字公式或编程代码。这种方法增强了透明度，使用户能够理解和修改映射。通过对来自各个领域的真实数据集进行广泛实验，我们证明了TabulaX在准确性方面优于现有的最先进方法，支持更广泛的转换类别，并生成可解释的转换，可以有效应用。

更新时间: 2025-08-18 22:48:36

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2411.17110v2

PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.

Updated: 2025-08-18 22:31:29

标题: PinFM：十亿级视觉发现平台用户活动序列的基础模型

摘要: 用户活动序列已经成为推荐系统中最重要的信号之一。我们提出了一个基础模型PinFM，用于理解十亿级视觉发现平台上多个应用程序中的用户活动序列。我们使用大量用户活动数据对一个包含20亿+参数的transformer模型进行预训练，然后针对特定应用程序进行微调，有效地将其与现有模型耦合。虽然这种预训练和微调方法在其他领域，如视觉和自然语言处理中很受欢迎，但在工业推荐系统中的应用却带来了许多挑战。基础模型必须具有足够的可扩展性，能够在满足系统施加的严格成本和延迟约束的同时，每秒对数百万个项目进行评分。此外，它还应该捕捉用户活动与其他特征之间的交互作用，并处理在预训练阶段不存在的新项目。我们开发了创新技术来解决这些挑战。我们的基础设施和算法优化，如去重交叉注意力变换（DCAT），提高了我们在Pinterest内部数据上的吞吐量600%。我们证明了PinFM可以通过改变输入序列来学习用户序列与候选项目之间的交互作用，从而导致对新项目的参与度增加了20%。PinFM现已部署，帮助改善了超过5亿用户在各种应用程序中的体验。

更新时间: 2025-08-18 22:31:29

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2507.12704v2

Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation

This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

Updated: 2025-08-18 22:30:32

标题: 语言引导的多智能体学习在模拟中：统一框架和评估

摘要: 这篇论文介绍了LLM-MARL，这是一个统一的框架，将大型语言模型（LLMs）融入多智能体强化学习（MARL）中，以增强在模拟游戏环境中的协调、沟通和泛化能力。该框架包括三个模块化组件：协调器、通信器和记忆器，它们动态生成子目标、促进符号间智能体通信，并支持情节回忆。训练结合了PPO与基于语言的损失和LLM查询门控。LLM-MARL在Google Research Football、MAgent Battle和StarCraft II中进行了评估。结果显示，在胜率、协调得分和零-shot泛化方面，与MAPPO和QMIX相比，LLM-MARL表现出持续的改进。消融研究表明，子目标生成和基于语言的通信分别对性能提升有显著贡献。定性分析显示出新兴行为，如角色专业化和以通信为驱动的战术。通过将语言建模和策略学习相结合，这项工作有助于设计智能、合作的代理人在交互式模拟中。它为在用于培训、游戏和人工智能协作的多智能体系统中利用LLMs提供了前进的道路。

更新时间: 2025-08-18 22:30:32

领域: cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2506.04251v2

SPANER: Shared Prompt Aligner for Multimodal Semantic Representation

Recent advances in multimodal Parameter-Efficient Fine-Tuning (PEFT) have significantly improved performance on downstream tasks such as few-shot retrieval. However, most existing approaches focus on task-specific gains while neglecting the structure of the multimodal embedding space. As a result, modality-specific representations often remain isolated, limiting cross-modal generalisation. In this work, we introduce Shared Prompt AligNER (SPANER), a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space. At its core, SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. This shared prompt design is inherently extensible, supporting the seamless integration of additional modalities, such as audio, without altering the core architecture. Through comprehensive experiments across vision-language and audio-visual benchmarks, SPANER demonstrates competitive few-shot retrieval performance while preserving high semantic coherence in the learned embedding space. Our results highlight the importance of aligning embedding structures, rather than merely tuning adapter weights, for scalable multimodal learning.

Updated: 2025-08-18 22:20:42

标题: SPANER：多模态语义表示的共享提示对齐器

摘要: 最近在多模参数高效微调（PEFT）方面取得了重大进展，显著提高了在少样本检索等下游任务上的性能。然而，大多数现有方法侧重于特定任务的增益，而忽视了多模嵌入空间的结构。因此，特定模态的表示通常保持孤立，限制了跨模态泛化能力。在这项工作中，我们引入了Shared Prompt AligNER（SPANER），这是一个模态无关的PEFT框架，旨在将来自不同模态的输入嵌入统一的语义空间。在其核心，SPANER采用了一个共享提示机制，作为一个概念锚点，使语义相关的实例能够在空间上收敛，而不受模态影响。这种共享提示设计在本质上是可扩展的，支持无缝集成额外的模态，如音频，而无需改变核心架构。通过在视觉-语言和音频-视觉基准上进行全面实验，SPANER展示了有竞争力的少样本检索性能，同时保持了在学习嵌入空间中的高语义一致性。我们的结果突显了对齐嵌入结构的重要性，而非仅仅微调适配器权重，以实现可扩展的多模学习。

更新时间: 2025-08-18 22:20:42

领域: cs.AI

下载: http://arxiv.org/abs/2508.13387v1

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.

Updated: 2025-08-18 22:18:46

标题: MAViS：一个用于长序列视频叙事的多智能体框架

摘要: 尽管最近取得了一些进展，但长序列视频生成框架仍然存在显著的局限性：辅助能力差、视觉质量次优和表现能力有限。为了缓解这些限制，我们提出了MAViS，一个端到端的多Agent协作框架，用于长序列视频叙事。MAViS在多个阶段协调专门的Agent，包括剧本撰写、镜头设计、角色建模、关键帧生成、视频动画和音频生成。在每个阶段，Agent都遵循3E原则--探索、审查和增强--以确保中间输出的完整性。考虑到当前生成模型的能力限制，我们提出了剧本写作指南，以优化剧本和生成工具之间的兼容性。实验结果表明，MAViS在辅助能力、视觉质量和视频表现能力方面实现了最先进的性能。其模块化框架进一步实现了与各种生成模型和工具的可扩展性。只需简短的用户提示，MAViS就能够生成高质量、富有表现力的长序列视频叙事，为用户提供灵感和创造力。据我们所知，MAViS是唯一提供多模态设计输出的框架--带有叙述和背景音乐的视频。

更新时间: 2025-08-18 22:18:46

领域: cs.CV,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.08487v2

Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by <think> and <answer> tags. On demanding postgraduate-level problems, Datarus exhibits an "AHA-moment" pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.

Updated: 2025-08-18 21:58:18

标题: Datarus-R1：一种用于自动数据分析的自适应多步推理LLM

摘要: 我们提出了Datarus-R1-14B，这是一个包含14个B参数的开放权重语言模型，从Qwen 2.5-14B-Instruct微调而来，旨在充当虚拟数据分析师和研究生级问题解决者。Datarus并非仅训练在孤立的问题-答案对上，而是在完整的分析轨迹中进行训练，包括推理步骤、代码执行、错误跟踪、自我纠正和最终结论，所有这些都以ReAct风格的笔记本格式捕获，涵盖金融、医学、数值分析和其他定量领域。我们的训练流程结合了(i)以轨迹为中心的合成数据生成器，产生了144,000个带标签的笔记本情节，(ii)融合了轻量级基于标签的结构信号和层次奖励模型（HRM）的双奖励框架，对单步的正确性和端到端的连贯性进行评分，以及(iii)一种内存优化的Group Relative Policy Optimization（GRPO）实现，具有KV-cache重用、顺序生成和参考模型分片。余弦课程平滑地将重点从结构保真度转移到语义深度，减少了通常困扰RL对齐的LLMs的格式崩溃和冗长。Datarus的一个核心设计选择是其双重推理界面。在代理模式下，模型会生成调用Python工具执行真实代码的ReAct标记步骤；在反思模式下，它会输出由<think>和<answer>标签界定的紧凑的Chain-of-Thought（CoT）跟踪。在要求苛刻的研究生级问题上，Datarus展现出“灵光乍现”的模式：它勾勒出假设，再修改一两次，最终收敛，避免了现代系统常见的循环、令牌膨胀的循环。在标准公共基准测试中，Datarus超越了类似规模的模型，甚至达到了更大推理模型（例如QwQ-32B）的水平，在AIME 2024/2025和LiveCodeBench上实现了高达30%更高的准确性，同时每个解决方案发出的令牌减少了18-49%。

更新时间: 2025-08-18 21:58:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13382v1

Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over $97\%$ of the optimal accuracy while incurring less than $15\%$ of the runtime required by the optimal solver across multi-task benchmarks.

Updated: 2025-08-18 21:49:32

标题: 考虑批处理的层级多任务推理中的联合模型装载和卸载

摘要: 对于资源受限的边缘设备上智能服务的增长需求促使了协作推理系统的发展，这些系统在终端设备、边缘服务器和云端之间分配工作负载。尽管大多数现有框架专注于单任务、单模型场景，但许多现实世界应用（如自动驾驶和增强现实）需要同时执行包括检测、分割和深度估计在内的多样任务。在这项工作中，我们提出了一个统一框架，共同决定在客户端和边缘服务器上部署哪些多任务模型（负载），以及如何跨层次路由查询（卸载），以在内存、计算和通信约束下最大程度地提高整体推理精度。我们将其制定为混合整数规划，并引入了J3O（负载和卸载的联合优化）算法，该算法通过拉格朗日松弛子模优化贪婪地选择要负载的模型，并通过受限线性规划确定最佳卸载。我们进一步扩展J3O以考虑边缘处的批处理，以在异构任务负载下保持可扩展性。实验证明，J3O在多任务基准测试中始终实现超过97%的最优精度，同时产生的运行时间少于最优求解器所需时间的15%。

更新时间: 2025-08-18 21:49:32

领域: cs.LG

下载: http://arxiv.org/abs/2508.13380v1

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.

Updated: 2025-08-18 21:37:09

标题: 低语上下文：提炼长篇演讲文本的语法和语义

摘要: ASR系统通常在长音频转录中难以保持句法和语义的准确性，影响到命名实体识别（NER）、大写和标点等任务。我们提出了一种新颖的方法，通过将LLaMA模型中的上下文知识提炼到Whisper中来增强ASR。我们的方法采用了两种策略：（1）通过最优传输进行令牌级别提炼，以对齐维度和序列长度；（2）通过最小化Whisper和LLaMA的句子嵌入之间的表示损失，融合句法和语义。在长音频和丰富实体的Spoken Wikipedia数据集上进行评估，结果显示在词错误率（WER）、NER、大写和标点成功率方面取得了显著改善。通过引入新颖的NER度量标准并探索语义感知ASR，我们的工作突出了将语言背景整合到转录中的价值，为长篇演讲中的稳健、上下文感知ASR奠定了基础。

更新时间: 2025-08-18 21:37:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13376v1

OrbitChain: Orchestrating In-orbit Real-time Analytics of Earth Observation Data

Earth observation analytics have the potential to serve many time-sensitive applications. However, due to limited bandwidth and duration of ground-satellite connections, it takes hours or even days to download and analyze data from existing Earth observation satellites, making real-time demands like timely disaster response impossible. Toward real-time analytics, we introduce OrbitChain, a collaborative analytics framework that orchestrates computational resources across multiple satellites in an Earth observation constellation. OrbitChain decomposes analytics applications into microservices and allocates computational resources for time-constrained analysis. A traffic routing algorithm is devised to minimize the inter-satellite communication overhead. OrbitChain adopts a pipeline workflow that completes Earth observation tasks in real-time, facilitates time-sensitive applications and inter-constellation collaborations such as tip-and-cue. To evaluate OrbitChain, we implement a hardware-in-the-loop orbital computing testbed. Experiments show that our system can complete up to 60% analytics workload than existing Earth observation analytics framework while reducing the communication overhead by up to 72%.

Updated: 2025-08-18 21:33:32

标题: OrbitChain：协调地球观测数据的在轨实时分析

摘要: 地球观测分析具有为许多时效敏感应用提供服务的潜力。然而，由于地面卫星连接带宽和持续时间有限，现有地球观测卫星的数据下载和分析需要几个小时甚至几天的时间，使得实时需求如及时灾害响应变得不可能。为了实现实时分析，我们引入了OrbitChain，这是一个协作分析框架，可以在地球观测星座中协调多个卫星之间的计算资源。OrbitChain将分析应用程序分解为微服务，并为时限分析分配计算资源。设计了一种交通路由算法，以最小化卫星间通信开销。OrbitChain采用了一个管道式工作流程，可以实时完成地球观测任务，促进时效敏感应用和星座间的协作，比如指令和提示。为了评估OrbitChain，我们实施了一个硬件在环轨道计算测试平台。实验表明，我们的系统可以完成比现有地球观测分析框架多达60%的分析工作负载，同时将通信开销降低高达72%。

更新时间: 2025-08-18 21:33:32

领域: cs.DC,cs.ET,cs.LG,cs.NI

下载: http://arxiv.org/abs/2508.13374v1

LOOP: A Plug-and-Play Neuro-Symbolic Framework for Enhancing Planning in Autonomous Systems

Planning is one of the most critical tasks in autonomous systems, where even a small error can lead to major failures or million-dollar losses. Current state-of-the-art neural planning approaches struggle with complex domains, producing plans with missing preconditions, inconsistent goals, and hallucinations. While classical planners provide logical guarantees, they lack the flexibility and natural language understanding capabilities needed for modern autonomous systems. Existing neuro-symbolic approaches use one-shot translation from natural language to formal plans, missing the opportunity for neural and symbolic components to work and refine solutions together. To address this gap, we develop LOOP -- a novel neuro-symbolic planning framework that treats planning as an iterative conversation between neural and symbolic components rather than simple translation. LOOP integrates 13 coordinated neural features including graph neural networks for spatial relationships, multi-agent validation for consensus-based correctness, hierarchical decomposition for complex task management, and causal memory that learns from both successes and failures. Unlike existing approaches, LOOP generates PDDL specifications, refines them iteratively based on symbolic feedback, and builds a causal knowledge base from execution traces. LOOP was evaluated on six standard IPC benchmark domains, where it achieved 85.8% success rate compared to LLM+P (55.0%), LLM-as-Planner (19.2%), and Tree-of-Thoughts (3.3%). This work shows that the key to reliable planning is not in choosing between neural networks or symbolic reasoners but it lies in making them actually ``talk'' to each other during the entire process. LOOP provides a thorough blueprint for building autonomous systems that can finally be trusted with critical real-world applications.

Updated: 2025-08-18 21:21:21

标题: LOOP：一个即插即用的神经符号框架，用于增强自主系统中的规划

摘要: 规划是自主系统中最关键的任务之一，即使一个小错误也可能导致重大失败或数百万美元的损失。当前最先进的神经规划方法在复杂领域中面临困难，产生具有缺失前提条件、不一致目标和幻觉的计划。虽然经典规划器提供逻辑保证，但它们缺乏现代自主系统所需的灵活性和自然语言理解能力。现有的神经符号方法使用一次性从自然语言转换为正式计划，错失了神经和符号组件共同工作和优化解决方案的机会。为了填补这一空白，我们开发了LOOP——一种将规划视为神经和符号组件之间迭代对话而不是简单翻译的新型神经符号规划框架。LOOP整合了13个协调的神经特征，包括用于空间关系的图神经网络、多智能体验证以实现基于共识的正确性、用于复杂任务管理的分层分解以及从成功和失败中学习的因果记忆。与现有方法不同，LOOP生成PDDL规范，根据符号反馈迭代地完善它们，并从执行轨迹中构建因果知识库。LOOP在六个标准IPC基准领域上进行了评估，成功率达到85.8%，而LLM+P（55.0%）、LLM-as-Planner（19.2%）和Tree-of-Thoughts（3.3%）相比较。这项工作表明，可靠规划的关键不在于选择神经网络还是符号推理器，而在于让它们在整个过程中实际地“交流”。LOOP为构建最终可以信任的关键现实世界应用的自主系统提供了全面的蓝图。

更新时间: 2025-08-18 21:21:21

领域: cs.AI

下载: http://arxiv.org/abs/2508.13371v1

A Risk Manager for Intrusion Tolerant Systems: Enhancing HAL 9000 with New Scoring and Data Sources

Intrusion Tolerant Systems (ITSs) have become increasingly critical due to the rise of multi-domain adversaries exploiting diverse attack surfaces. ITS architectures aim to tolerate intrusions, ensuring system compromise is prevented or mitigated even with adversary presence. Existing ITS solutions often employ Risk Managers leveraging public security intelligence to adjust system defenses dynamically against emerging threats. However, these approaches rely heavily on databases like NVD and ExploitDB, which require manual analysis for newly discovered vulnerabilities. This dependency limits the system's responsiveness to rapidly evolving threats. HAL 9000, an ITS Risk Manager introduced in our prior work, addressed these challenges through machine learning. By analyzing descriptions of known vulnerabilities, HAL 9000 predicts and assesses new vulnerabilities automatically. To calculate the risk of a system, it also incorporates the Exploitability Probability Scoring system to estimate the likelihood of exploitation within 30 days, enhancing proactive defense capabilities. Despite its success, HAL 9000's reliance on NVD and ExploitDB knowledge is a limitation, considering the availability of other sources of information. This extended work introduces a custom-built scraper that continuously mines diverse threat sources, including security advisories, research forums, and real-time exploit proofs-of-concept. This significantly expands HAL 9000's intelligence base, enabling earlier detection and assessment of unverified vulnerabilities. Our evaluation demonstrates that integrating scraper-derived intelligence with HAL 9000's risk management framework substantially improves its ability to address emerging threats. This paper details the scraper's integration into the architecture, its role in providing additional information on new threats, and the effects on HAL 9000's management.

Updated: 2025-08-18 21:16:05

标题: 一个用于入侵容忍系统的风险管理器：通过新的评分和数据源增强HAL 9000

摘要: 入侵容忍系统（ITSs）由于多领域对手利用不同的攻击面不断增加而变得越来越关键。ITS架构的目标是容忍入侵，确保即使存在对手，系统妥善处理入侵，防止或减轻系统被破坏。现有的ITS解决方案通常利用风险管理器，利用公共安全情报动态调整系统的防御措施以应对新出现的威胁。然而，这些方法往往严重依赖诸如NVD和ExploitDB之类的数据库，需要对新发现的漏洞进行手动分析。这种依赖性限制了系统对快速发展的威胁的响应能力。HAL 9000是我们之前工作中介绍的一种ITS风险管理器，通过机器学习解决了这些挑战。通过分析已知漏洞的描述，HAL 9000可以自动预测和评估新的漏洞。为了计算系统的风险，它还整合了Exploitability Probability Scoring系统以估计30天内被利用的可能性，增强主动防御能力。尽管HAL 9000取得了成功，但其对NVD和ExploitDB知识的依赖是一个局限，考虑到其他信息来源的可用性。这项扩展工作引入了一个定制的爬虫，不断挖掘各种威胁来源，包括安全公告、研究论坛和实时利用概念证明。这显著扩展了HAL 9000的情报基础，使其能够更早地检测和评估未经验证的漏洞。我们的评估表明，将爬虫获取的情报与HAL 9000的风险管理框架整合，显著提高了其应对新威胁的能力。本文详细介绍了爬虫在架构中的整合，它在提供有关新威胁的额外信息方面的作用，以及对HAL 9000管理的影响。

更新时间: 2025-08-18 21:16:05

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.13364v1

Adaptive Conformal Prediction Intervals Over Trajectory Ensembles

Future trajectories play an important role across domains such as autonomous driving, hurricane forecasting, and epidemic modeling, where practitioners commonly generate ensemble paths by sampling probabilistic models or leveraging multiple autoregressive predictors. While these trajectories reflect inherent uncertainty, they are typically uncalibrated. We propose a unified framework based on conformal prediction that transforms sampled trajectories into calibrated prediction intervals with theoretical coverage guarantees. By introducing a novel online update step and an optimization step that captures inter-step dependencies, our method can produce discontinuous prediction intervals around each trajectory, naturally capture temporal dependencies, and yield sharper, more adaptive uncertainty estimates.

Updated: 2025-08-18 21:14:07

标题: 自适应轨迹集合上的一致预测区间

摘要: 未来轨迹在领域中扮演着重要角色，如自动驾驶、飓风预测和流行病建模，在这些领域中，从业者通常通过对概率模型进行取样或利用多个自回归预测器来生成集合路径。虽然这些轨迹反映了固有的不确定性，但它们通常是未校准的。我们提出了一个基于整体预测的统一框架，将取样轨迹转化为具有理论覆盖保证的校准预测区间。通过引入一种新颖的在线更新步骤和捕捉步间依赖关系的优化步骤，我们的方法可以生成围绕每个轨迹的不连续预测区间，自然捕捉时间依赖关系，并产生更尖锐、更自适应的不确定性估计。

更新时间: 2025-08-18 21:14:07

领域: cs.LG

下载: http://arxiv.org/abs/2508.13362v1

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos

Human mesh recovery (HMR) provides rich human body information for various real-world applications. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temporal inconsistencies and non-smooth 3D motion predictions due to the absence of human motion. In contrast, video-based approaches leverage temporal information to mitigate this issue. In this paper, we present DiffMesh, an innovative motion-aware Diffusion-like framework for video-based HMR. DiffMesh establishes a bridge between diffusion models and human motion, efficiently generating accurate and smooth output mesh sequences by incorporating human motion within the forward process and reverse process in the diffusion model. Extensive experiments are conducted on the widely used datasets (Human3.6M \cite{h36m_pami} and 3DPW \cite{pw3d2018}), which demonstrate the effectiveness and efficiency of our DiffMesh. Visual comparisons in real-world scenarios further highlight DiffMesh's suitability for practical applications.

Updated: 2025-08-18 21:08:32

标题: DiffMesh：一种用于从视频中恢复人体网格的运动感知扩散框架

摘要: Human mesh recovery (HMR) 提供了丰富的人体信息，适用于各种现实世界的应用。虽然基于图像的HMR方法取得了令人印象深刻的成果，但它们往往在恢复动态场景中的人类时遇到困难，导致由于缺乏人类运动而产生时间不一致和非平滑的3D运动预测。相比之下，基于视频的方法利用时间信息来缓解这个问题。在本文中，我们提出了DiffMesh，这是一个创新的面向视频的HMR的运动感知扩散框架。DiffMesh在扩散模型和人类运动之间建立了桥梁，通过在扩散模型的正向过程和逆向过程中整合人类运动，高效地生成准确且平滑的输出网格序列。我们在广泛使用的数据集（Human3.6M和3DPW）上进行了大量实验，证明了我们的DiffMesh的有效性和效率。在现实场景中的视觉比较进一步突显了DiffMesh在实际应用中的适用性。

更新时间: 2025-08-18 21:08:32

领域: cs.CV,cs.AI,cs.HC,cs.MM

下载: http://arxiv.org/abs/2303.13397v7

DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework

Digital twin (DT) technology enables real-time simulation, prediction, and optimization of physical systems, but practical deployment faces challenges from high data requirements, proprietary data constraints, and limited adaptability to evolving conditions. This work introduces DDD-GenDT, a dynamic data-driven generative digital twin framework grounded in the Dynamic Data-Driven Application Systems (DDDAS) paradigm. The architecture comprises the Physical Twin Observation Graph (PTOG) to represent operational states, an Observation Window Extraction process to capture temporal sequences, a Data Preprocessing Pipeline for sensor structuring and filtering, and an LLM ensemble for zero-shot predictive inference. By leveraging generative AI, DDD-GenDT reduces reliance on extensive historical datasets, enabling DT construction in data-scarce settings while maintaining industrial data privacy. The DDDAS feedback mechanism allows the DT to autonomically adapt predictions to physical twin (PT) wear and degradation, supporting DT-aging, which ensures progressive synchronization of DT with PT evolution. The framework is validated using the NASA CNC milling dataset, with spindle current as the monitored variable. In a zero-shot setting, the GPT-4-based DT achieves an average RMSE of 0.479 A (4.79% of the 10 A spindle current), accurately modeling nonlinear process dynamics and PT aging without retraining. These results show that DDD-GenDT provides a generalizable, data-efficient, and adaptive DT modeling approach, bridging generative AI with the performance and reliability requirements of industrial DT applications.

Updated: 2025-08-18 21:07:07

标题: DDD-GenDT：动态数据驱动的生成式数字孪生框架

摘要: 数字孪生（DT）技术实现了对物理系统的实时模拟、预测和优化，但实际部署面临高数据需求、专有数据限制和对不断变化条件的有限适应能力的挑战。本文介绍了DDD-GenDT，这是一个基于动态数据驱动生成数字孪生框架，根植于动态数据驱动应用系统（DDDAS）范式。该架构包括物理孪生观测图（PTOG）来表示操作状态，一个观测窗口提取过程来捕获时间序列，一个用于传感器结构化和过滤的数据预处理管道，以及一个用于零样本预测推理的LLM集成。通过利用生成式人工智能，DDD-GenDT减少对大量历史数据集的依赖，实现在数据稀缺环境中构建DT，同时保持工业数据隐私。DDDAS反馈机制使DT能够自主地调整预测以适应物理孪生（PT）的磨损和退化，支持DT老化，确保DT与PT演变的逐步同步。该框架使用NASA CNC铣削数据集进行验证，以主监测变量为主轴电流。在零样本设置中，基于GPT-4的DT实现了平均RMSE为0.479 A（10 A主轴电流的4.79%），准确建模非线性过程动态和PT老化，无需重新训练。这些结果表明DDD-GenDT提供了一种通用、数据高效和自适应的DT建模方法，将生成式人工智能与工业DT应用的性能和可靠性要求联系起来。

更新时间: 2025-08-18 21:07:07

领域: cs.LG,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2501.00051v2

Fusing Echocardiography Images and Medical Records for Continuous Patient Stratification

Deep learning enables automatic and robust extraction of cardiac function descriptors from echocardiographic sequences, such as ejection fraction or strain. These descriptors provide fine-grained information that physicians consider, in conjunction with more global variables from the clinical record, to assess patients' condition. Drawing on novel Transformer models applied to tabular data, we propose a method that considers all descriptors extracted from medical records and echocardiograms to learn the representation of a cardiovascular pathology with a difficult-to-characterize continuum, namely hypertension. Our method first projects each variable into its own representation space using modality-specific approaches. These standardized representations of multimodal data are then fed to a Transformer encoder, which learns to merge them into a comprehensive representation of the patient through the task of predicting a clinical rating. This stratification task is formulated as an ordinal classification to enforce a pathological continuum in the representation space. We observe the major trends along this continuum on a cohort of 239 hypertensive patients, providing unprecedented details in the description of hypertension's impact on various cardiac function descriptors. Our analysis shows that i) the XTab foundation model's architecture allows to reach outstanding performance (96.8% AUROC) even with limited data (less than 200 training samples), ii) stratification across the population is reproducible between trainings (within 5.7% mean absolute error), and iii) patterns emerge in descriptors, some of which align with established physiological knowledge about hypertension, while others could pave the way for a more comprehensive understanding of this pathology. Code is available at https://github.com/creatis-myriad/didactic.

Updated: 2025-08-18 21:06:32

标题: 将超声心动图像与病历融合，实现患者持续分层分类

摘要: 深度学习使得从超声心动图序列中自动和稳健地提取心脏功能描述符变得可能，例如射血分数或应变。这些描述符提供了医生需要考虑的细致信息，结合临床记录中的更全局变量，用于评估患者的状况。借助应用于表格数据的新型Transformer模型，我们提出了一种方法，考虑从医疗记录和超声心动图中提取的所有描述符，以学习心血管病理学的表示，即难以描述的连续性，即高血压。我们的方法首先使用特定于模态的方法将每个变量投影到自己的表示空间中。然后，这些多模态数据的标准化表示被馈送到Transformer编码器中，通过预测临床评级的任务，学习将它们合并成患者的综合表示。这种分层任务被制定为有序分类，以在表示空间中强制执行病理连续性。我们观察了239名高血压患者队列沿着这一连续性的主要趋势，提供了高血压对各种心脏功能描述符影响的前所未有的细节描述。我们的分析显示：i）XTab基础模型的架构即使在有限数据（少于200个训练样本）的情况下也能达到出色的性能（96.8% AUROC），ii）在人群中的分层是可复制的（在5.7%的平均绝对误差范围内），iii）描述符中出现模式，其中一些与关于高血压的已建立的生理知识相一致，而其他一些则可能为对这种病理学的更全面理解铺平道路。代码可在https://github.com/creatis-myriad/didactic找到。

更新时间: 2025-08-18 21:06:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2401.07796v3

Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT

This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation quality and latency. We also investigate efficient integration of ASR and MT, leveraging linguistic cues generated by the ASR system to manage context and utilizing efficient beam-search pruning techniques such as time-out and forced finalization to maintain system's real-time factor. We apply our approach to an on-device bilingual conversational speech translation and demonstrate that our techniques outperform baselines in terms of latency and quality. Notably, our technique narrows the quality gap with non-streaming translation systems, paving the way for more accurate and efficient real-time speech translation.

Updated: 2025-08-18 21:00:11

标题: 克服设备上语音翻译中的延迟瓶颈：基于对齐的流式MT级联方法

摘要: 本文探讨了在实时、设备上流式语音翻译中集成自动语音识别（ASR）和机器翻译（MT）时出现的几个挑战。尽管基于循环神经网络传感器（RNN-T）的最先进的ASR系统能够实现实时转录，但实现实时流式翻译仍然是一个重大挑战。为了解决这个问题，我们提出了一种同时翻译的方法，有效平衡翻译质量和延迟。我们还研究了有效地集成ASR和MT，利用ASR系统生成的语言线索来管理上下文，并利用高效的波束搜索修剪技术，如超时和强制完成，以保持系统的实时性。我们将我们的方法应用于设备上的双语对话语音翻译，并展示了我们的技术在延迟和质量方面优于基线。值得注意的是，我们的技术缩小了与非流式翻译系统之间的质量差距，为更准确、更高效的实时语音翻译铺平了道路。

更新时间: 2025-08-18 21:00:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13358v1

Silentflow: Leveraging Trusted Execution for Resource-Limited MPC via Hardware-Algorithm Co-design

Secure Multi-Party Computation (MPC) offers a practical foundation for privacy-preserving machine learning at the edge, with MPC commonly employed to support nonlinear operations. These MPC protocols fundamentally rely on Oblivious Transfer (OT), particularly Correlated OT (COT), to generate correlated randomness essential for secure computation. Although COT generation is efficient in conventional two-party settings with resource-rich participants, it becomes a critical bottleneck in real-world inference on resource-constrained devices (e.g., IoT sensors and wearables), due to both communication latency and limited computational capacity. To enable real-time secure inference, we introduce Silentflow, a highly efficient Trusted Execution Environment (TEE)-assisted protocol that eliminates communication in COT generation. We tackle the core performance bottleneck-low computational intensity-through structured algorithmic decomposition: kernel fusion for parallelism, Blocked On-chip eXpansion (BOX) to improve memory access patterns, and vectorized batch operations to maximize memory bandwidth utilization. Through design space exploration, we balance end-to-end latency and resource demands, achieving up to 39.51x speedup over state-of-the-art protocols. By offloading COT computations to a Zynq-7000 SoC, SilentFlow accelerates PPMLaaS inference on the ImageNet dataset under resource constraints, achieving a 4.62x and 3.95x speedup over Cryptflow2 and Cheetah, respectively.

Updated: 2025-08-18 21:00:10

标题: Silentflow：通过硬件算法协同设计利用受信执行的资源有限的MPC

摘要: Secure Multi-Party Computation (MPC)为边缘隐私保护机器学习提供了实用的基础，MPC通常用于支持非线性操作。这些MPC协议基本上依赖于遗忘传输（OT），特别是相关OT（COT），以生成对安全计算至关重要的相关随机性。尽管在资源丰富的两方设置中COT生成是高效的，但在资源受限的设备（如物联网传感器和可穿戴设备）上进行实时推理时，由于通信延迟和有限的计算能力，它成为一个关键瓶颈。为了实现实时安全推理，我们介绍了Silentflow，一种高效的受信任执行环境（TEE）辅助协议，消除了COT生成中的通信。我们通过结构化算法分解来解决核心性能瓶颈-低计算强度：并行核融合、On-chip Blocked eXpansion（BOX）来改进内存访问模式，以及矢量化批处理操作以最大化内存带宽利用。通过设计空间探索，我们平衡了端到端延迟和资源需求，实现了比最先进协议高达39.51倍的加速。通过将COT计算卸载到Zynq-7000 SoC，SilentFlow在资源约束下加速了ImageNet数据集上的PPMLaaS推理，分别比Cryptflow2和Cheetah加速了4.62倍和3.95倍。

更新时间: 2025-08-18 21:00:10

领域: cs.CR

下载: http://arxiv.org/abs/2508.13357v1

Counterfactual Probabilistic Diffusion with Expert Models

Predicting counterfactual distributions in complex dynamical systems is essential for scientific modeling and decision-making in domains such as public health and medicine. However, existing methods often rely on point estimates or purely data-driven models, which tend to falter under data scarcity. We propose a time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and data-driven approaches, enabling more reliable and interpretable causal inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies, demonstrating that it consistently outperforms strong baselines in both point prediction and distributional accuracy.

Updated: 2025-08-18 20:44:32

标题: 用专家模型进行反事实概率扩散

摘要: 在复杂动力系统中预测反事实分布对于科学建模和决策制定，例如公共卫生和医学领域至关重要。然而，现有方法通常依赖于点估计或纯数据驱动模型，这些方法在数据稀缺的情况下往往不稳健。我们提出了一种基于时间序列扩散的框架，通过从不完美的专家模型中提取高级信号作为生成建模的结构化先验，从而改进了预测反事实分布的方法。我们的方法ODE-Diff融合了机械和数据驱动方法，使得因果推断更加可靠和可解释。我们评估了ODE-Diff在半合成COVID-19模拟、合成药物动力学以及真实案例研究中的表现，表明它在点预测和分布准确性方面始终优于强基线。

更新时间: 2025-08-18 20:44:32

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2508.13355v1

Augmented Adversarial Trigger Learning

Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning

Updated: 2025-08-18 20:09:56

标题: 增强对抗性触发器学习

摘要: 梯度优化的对抗攻击方法自动化学习对抗触发器，以生成越狱提示或泄露系统提示。在这项工作中，我们更细致地研究了对抗触发器学习的优化目标，并提出了ATLA：具有增强目标的对抗触发器学习。ATLA将先前研究中使用的负对数似然损失改进为一种加权损失公式，鼓励学习的对抗触发器更多地优化响应格式标记。这使得ATLA能够从仅一个查询-响应对中学习对抗触发器，并且学习的触发器能够很好地推广到其他类似的查询。我们进一步设计了一种变体，通过辅助损失来增强触发器优化，抑制逃避响应。我们展示了如何使用ATLA学习对抗后缀越狱LLMs，并提取隐藏的系统提示。从经验上我们证明，ATLA始终优于当前的技术水平，几乎达到了100%的成功攻击率，而仅需80%的查询。ATLA学习的越狱后缀对未见过的查询具有很高的泛化性，并且很好地转移到新的LLMs上。我们发布了我们的代码：https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning。

更新时间: 2025-08-18 20:09:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.12339v3

Dimension lower bounds for linear approaches to function approximation

This short note presents a linear algebraic approach to proving dimension lower bounds for linear methods that solve $L^2$ function approximation problems. The basic argument has appeared in the literature before (e.g., Barron, 1993) for establishing lower bounds on Kolmogorov $n$-widths. The argument is applied to give sample size lower bounds for kernel methods.

Updated: 2025-08-18 20:04:46

标题: 线性逼近函数逼近的维度下界

摘要: 这篇简短的论文提出了一种线性代数方法，用于证明解决$L^2$函数逼近问题的线性方法的维度下界。这种基本论证在文献中已经出现过（例如，Barron，1993），用于建立Kolmogorov $n$-widths的下界。这种论证被应用于为核方法提供样本大小的下界。

更新时间: 2025-08-18 20:04:46

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2508.13346v1

SymMatika: Structure-Aware Symbolic Discovery

Symbolic regression (SR) seeks to recover closed-form mathematical expressions that describe observed data. While existing methods have advanced the discovery of either explicit mappings (i.e., $y = f(\mathbf{x})$) or discovering implicit relations (i.e., $F(\mathbf{x}, y)=0$), few modern and accessible frameworks support both. Moreover, most approaches treat each expression candidate in isolation, without reusing recurring structural patterns that could accelerate search. We introduce SymMatika, a hybrid SR algorithm that combines multi-island genetic programming (GP) with a reusable motif library inspired by biological sequence analysis. SymMatika identifies high-impact substructures in top-performing candidates and reintroduces them to guide future generations. Additionally, it incorporates a feedback-driven evolutionary engine and supports both explicit and implicit relation discovery using implicit-derivative metrics. Across benchmarks, SymMatika achieves state-of-the-art recovery rates on the Nguyen and Feynman benchmark suites, an impressive recovery rate of 61\% on Nguyen-12 compared to the next best 2\%, and strong placement on the error-complexity Pareto fronts on the Feynman equations and on a subset of 57 SRBench Black-box problems. Our results demonstrate the power of structure-aware evolutionary search for scientific discovery. To support broader research in interpretable modeling and symbolic discovery, we have open-sourced the full SymMatika framework.

Updated: 2025-08-18 20:02:13

标题: SymMatika：结构感知符号发现

摘要: Symbolic regression (SR)旨在恢复描述观察数据的封闭形式数学表达式。虽然现有方法已经推动了发现显式映射（即$y = f(\mathbf{x})$）或发现隐含关系（即$F(\mathbf{x}, y)=0$）的进展，但很少有现代且易于访问的框架同时支持这两种方法。此外，大多数方法将每个表达式候选项单独处理，而不重用可能加速搜索的重复结构模式。我们引入了SymMatika，这是一种混合SR算法，结合了多岛遗传规划（GP）和受生物序列分析启发的可重复使用的模体库。SymMatika识别出表现最佳候选项中的高影响子结构，并将它们重新引入以指导未来的生成。此外，它结合了反馈驱动的进化引擎，并使用隐式导数度量支持显式和隐式关系的发现。在基准测试中，SymMatika在Nguyen和Feynman基准套件上取得了最先进的恢复率，在Nguyen-12上的恢复率为61％，比第二名的2％高出令人印象深刻，并在Feynman方程和一部分57个SRBench黑盒问题的误差-复杂性Pareto前沿上取得了良好的成绩。我们的结果展示了面向结构的进化搜索对科学发现的强大能力。为了支持更广泛的可解释建模和符号发现研究，我们已经开源了完整的SymMatika框架。

更新时间: 2025-08-18 20:02:13

领域: cs.LG

下载: http://arxiv.org/abs/2507.03110v2

Investigating the importance of county-level characteristics in opioid-related mortality across the United States

The opioid crisis remains a critical public health challenge in the United States. Despite national efforts which reduced opioid prescribing rates by nearly 45\% between 2011 and 2021, opioid-related overdose deaths more than tripled during the same period. This alarming trend reflects a major shift in the crisis, with illegal opioids now driving the majority of overdose deaths instead of prescription opioids. Although much attention has been given to supply-side factors fueling this transition, the underlying structural conditions that perpetuate and exacerbate opioid misuse remain less understood. Moreover, the COVID-19 pandemic intensified the opioid crisis through widespread social isolation and record-high unemployment; consequently, understanding the underlying drivers of this epidemic has become even more crucial in recent years. To address this need, our study examines the correlation between opioid-related mortality and thirteen county-level characteristics related to population traits, economic stability, and infrastructure. Leveraging a nationwide county-level dataset spanning consecutive years from 2010 to 2022, this study integrates empirical insights from exploratory data analysis with feature importance metrics derived from machine learning models. Our findings highlight critical regional characteristics strongly correlated with opioid-related mortality, emphasizing their potential roles in worsening the epidemic when their levels are high and mitigating it when their levels are low.

Updated: 2025-08-18 19:56:22

标题: 调查美国各县级特征在阿片类药物相关死亡中的重要性

摘要: 美国的阿片类药物危机仍然是一个关键的公共卫生挑战。尽管国家努力在2011年至2021年间将阿片类药物的处方率减少了近45\%，但同期阿片类药物相关的过量死亡数量增加了三倍以上。这一令人担忧的趋势反映了危机的主要转变，非法阿片类药物现在驱动了大多数过量死亡，而不再是处方阿片类药物。虽然在推动这一转变的供应端因素受到了很多关注，但导致和加剧阿片类药物滥用的基本结构条件仍不为人了解。此外，新冠疫情通过广泛的社会隔离和创纪录的失业率加剧了阿片类药物危机；因此，近年来理解这一流行病的基本驱动因素变得更为关键。为了满足这一需求，我们的研究考察了与人口特征、经济稳定性和基础设施相关的十三个县级特征与阿片类药物相关死亡率之间的相关性。利用从2010年至2022年连续年份的全国县级数据集，本研究将探索性数据分析的实证见解与从机器学习模型中得出的特征重要性度量相结合。我们的发现突出了与阿片类药物相关死亡率强相关的关键地区特征，强调了当这些特征水平高时加剧流行病的潜力，当这些特征水平低时减轻流行病的作用。

更新时间: 2025-08-18 19:56:22

领域: cs.CY,cs.LG

下载: http://arxiv.org/abs/2412.15218v3

Recipes for Pre-training LLMs with MXFP8

Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of GPUs represent a major advancement of this technique, making it practical to combine narrow floating-point data types with finer granularity per-block scaling factors. In turn, this enables both quantization of more tensors than previous approaches and more efficient execution of operations on those tensors. Effective use of MX-formats requires careful choices of various parameters. In this paper we review these choices and show how MXFP8-E4M3 datatype and a specific number conversion algorithm result in training sessions that match those carried out in BF16. We present results using models with up to 8B parameters, trained on high-quality datasets of up to 15T tokens.

Updated: 2025-08-18 19:51:06

标题: 使用MXFP8预训练LLMs的食谱

摘要: 在进行预训练时，使用更少的位来表示模型参数和相关张量已成为提高GPU效率而不牺牲准确性的必要技术。 NVIDIA Blackwell一代GPU引入的Microscaling（MX）格式代表了这一技术的重大进步，使得将窄浮点数据类型与更细粒度的每块缩放因子结合变得实用。这反过来使得可以量化比以前方法更多的张量，并更有效地执行这些张量上的操作。有效使用MX格式需要仔细选择各种参数。在本文中，我们回顾了这些选择，并展示了如何使用MXFP8-E4M3数据类型和特定的数字转换算法可以实现与BF16相匹配的训练会话。我们展示了使用多达8B参数的模型，并在高质量数据集上进行训练，数据集中包含多达15T个标记。

更新时间: 2025-08-18 19:51:06

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2506.08027v2

X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.

Updated: 2025-08-18 19:49:28

标题: X-MoE: 在HPC平台上实现新型专家混合架构的可扩展训练

摘要: 新兴的专家专业化的Mixture-of-Experts（MoE）架构，如DeepSeek-MoE，通过细粒度的专家分割和大规模的top-k路由，提供了强大的模型质量。然而，它们的可扩展性受到了大量激活内存开销和昂贵的全对全通信的限制。此外，当前的MoE训练系统 - 主要针对NVIDIA GPU进行优化 - 在非NVIDIA平台上表现不佳，使得大量计算潜力未被开发利用。在这项工作中，我们提出了X-MoE，一个新颖的MoE训练系统，旨在为下一代MoE架构提供可扩展的训练性能。X-MoE通过几种新颖的技术实现了这一目标，包括使用跨平台内核进行高效的无填充MoE训练、绕过冗余分发和混合并行性与序列分片的MoE块。我们在由AMD MI250X GPU驱动的Frontier超级计算机上进行评估，结果显示，X-MoE可以将DeepSeek风格的MoE扩展到1024个GPU上的5450亿参数 - 比现有方法在相同硬件预算下训练的最大可训练模型大10倍，同时保持高训练吞吐量。X-MoE的源代码可在https://github.com/Supercomputing-System-AI-Lab/X-MoE上获得。

更新时间: 2025-08-18 19:49:28

领域: cs.LG,cs.CL,cs.DC

下载: http://arxiv.org/abs/2508.13337v1

Cross-Modal Characterization of Thin Film MoS$_2$ Using Generative Models

The growth and characterization of materials using empirical optimization typically requires a significant amount of expert time, experience, and resources. Several complementary characterization methods are routinely performed to determine the quality and properties of a grown sample. Machine learning (ML) can support the conventional approaches by using historical data to guide and provide speed and efficiency to the growth and characterization of materials. Specifically, ML can provide quantitative information from characterization data that is typically obtained from a different modality. In this study, we have investigated the feasibility of projecting the quantitative metric from microscopy measurements, such as atomic force microscopy (AFM), using data obtained from spectroscopy measurements, like Raman spectroscopy. Generative models were also trained to generate the full and specific features of the Raman and photoluminescence spectra from each other and the AFM images of the thin film MoS$_2$. The results are promising and have provided a foundational guide for the use of ML for the cross-modal characterization of materials for their accelerated, efficient, and cost-effective discovery.

Updated: 2025-08-18 19:47:07

标题: 使用生成模型对薄膜MoS$_2$进行跨模态表征

摘要: 使用经验优化进行材料生长和表征通常需要大量的专家时间、经验和资源。常规进行几种互补的表征方法来确定生长样品的质量和性质。机器学习（ML）可以通过使用历史数据来指导和为材料的生长和表征提供速度和效率，从而支持传统方法。具体来说，ML可以从通常从不同模态获得的表征数据中提供定量信息。在本研究中，我们调查了使用来自光谱学测量（如拉曼光谱）获得的数据，来预测从显微镜测量（如原子力显微镜（AFM））获得的定量度量的可行性。生成模型还被训练来生成薄膜MoS$_2$的拉曼和光致发光光谱的完整和特定特征，以及AFM图像。结果是令人鼓舞的，并为利用ML进行材料的跨模式表征提供了基础指南，以加快、高效和经济实惠地发现材料。

更新时间: 2025-08-18 19:47:07

领域: cond-mat.mtrl-sci,cs.LG,physics.app-ph

下载: http://arxiv.org/abs/2505.24065v2

HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design

LLM-based Automatic Heuristic Design (AHD) within Evolutionary Computation (EC) frameworks has shown promising results. However, its effectiveness is hindered by the use of static operators and the lack of knowledge accumulation mechanisms. We introduce HiFo-Prompt, a framework that guides LLMs with two synergistic prompting strategies: Foresight and Hindsight. Foresight-based prompts adaptively steer the search based on population dynamics, managing the exploration-exploitation trade-off. In addition, hindsight-based prompts mimic human expertise by distilling successful heuristics from past generations into fundamental, reusable design principles. This dual mechanism transforms transient discoveries into a persistent knowledge base, enabling the LLM to learn from its own experience. Empirical results demonstrate that HiFo-Prompt significantly outperforms state-of-the-art LLM-based AHD methods, generating higher-quality heuristics while achieving substantially faster convergence and superior query efficiency.

Updated: 2025-08-18 19:42:55

标题: HiFo-Prompt：基于LLM的自动启发式设计的远见与后见提示

摘要: 基于LLM的自动启发式设计（AHD）在进化计算（EC）框架内显示出有希望的结果。然而，它的有效性受到静态算子的使用和缺乏知识积累机制的限制。我们引入了HiFo-Prompt框架，该框架通过两种协同启发式策略（先见之明和事后识别）引导LLMs。基于先见之明的提示根据人口动态自适应地引导搜索，管理探索利用的权衡。此外，基于事后的提示通过从过去世代中提取成功的启发式方法，将其转化为基本的可重复使用的设计原则，模仿人类的专业知识。这种双重机制将瞬时发现转化为持久的知识库，使LLM能够从自身经验中学习。实证结果表明，HiFo-Prompt明显优于最先进的基于LLM的AHD方法，生成更高质量的启发式方法，同时实现更快的收敛速度和更优的查询效率。

更新时间: 2025-08-18 19:42:55

领域: cs.AI,cs.NE,math.OC

下载: http://arxiv.org/abs/2508.13333v1

An Explainable AI based approach for Monitoring Animal Health

Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.

Updated: 2025-08-18 19:26:03

标题: 一种基于可解释人工智能的动物健康监测方法

摘要: 监测牛只健康状况并优化产量是奶牛养殖户面临的关键挑战，因为很难追踪农场上所有动物。本研究旨在展示基于可解释的机器学习（ML）方法的现代数据驱动农业实践，解释奶牛的活动和行为。连续收集3轴加速度计传感器的数据，并使用强大的ML方法和算法，为养殖户和研究人员提供有关牛只活动的可行信息，使养殖户能够做出明智决策并采用可持续的实践。本研究利用基于蓝牙的物联网（IoT）设备和4G网络进行无缝数据传输，立即分析，生成推断，并使用可解释性框架解释模型的性能。特别强调了加速度计时间序列数据的预处理，包括提取统计特征、信号处理技术和使用滑动窗口技术的滞后特征。评估了各种超参数优化的ML模型在不同窗口长度下的活动分类。k最近邻分类器在训练集上表现最佳，AUC平均值为0.98，标准差为0.0026，在测试集上为0.99。为了确保透明度，使用基于可解释AI的框架（如SHAP）来解释特征的重要性，可以被从业者理解和使用。通过详细比较重要特征以及对选定特征的稳定性分析，支持发展可解释且实用的ML模型，用于可持续的畜牧管理。

更新时间: 2025-08-18 19:26:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.10210v3

A Dual-Attention Graph Network for fMRI Data Classification

Understanding the complex neural activity dynamics is crucial for the development of the field of neuroscience. Although current functional MRI classification approaches tend to be based on static functional connectivity or cannot capture spatio-temporal relationships comprehensively, we present a new framework that leverages dynamic graph creation and spatiotemporal attention mechanisms for Autism Spectrum Disorder(ASD) diagnosis. The approach used in this research dynamically infers functional brain connectivity in each time interval using transformer-based attention mechanisms, enabling the model to selectively focus on crucial brain regions and time segments. By constructing time-varying graphs that are then processed with Graph Convolutional Networks (GCNs) and transformers, our method successfully captures both localized interactions and global temporal dependencies. Evaluated on the subset of ABIDE dataset, our model achieves 63.2 accuracy and 60.0 AUC, outperforming static graph-based approaches (e.g., GCN:51.8). This validates the efficacy of joint modeling of dynamic connectivity and spatio-temporal context for fMRI classification. The core novelty arises from (1) attention-driven dynamic graph creation that learns temporal brain region interactions and (2) hierarchical spatio-temporal feature fusion through GCNtransformer fusion.

Updated: 2025-08-18 19:23:18

标题: 一种用于fMRI数据分类的双重注意力图网络

摘要: 理解复杂的神经活动动态对神经科学领域的发展至关重要。尽管当前的功能性磁共振成像分类方法往往基于静态功能连接性，或不能全面捕捉时空关系，但我们提出了一种新的框架，利用动态图创建和时空注意机制进行自闭症谱系障碍（ASD）的诊断。本研究中采用的方法通过基于变压器的注意机制动态推断每个时间间隔内的功能脑连接性，使模型能够有选择地关注关键的脑区域和时间段。通过构建随时间变化的图，然后用图卷积网络（GCNs）和变压器进行处理，我们的方法成功捕捉了局部交互作用和全局时间依赖关系。在ABIDE数据集的子集上评估，我们的模型实现了63.2%的准确率和60.0的AUC，优于静态基于图的方法（例如GCN：51.8）。这证实了联合建模动态连接性和时空背景对fMRI分类的有效性。核心创新点在于（1）基于注意力驱动的动态图创建学习时间脑区交互作用，以及（2）通过GCN-transformer融合进行分层时空特征融合。

更新时间: 2025-08-18 19:23:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.13328v1

Towards Unified Multimodal Financial Forecasting: Integrating Sentiment Embeddings and Market Indicators via Cross-Modal Attention

We propose STONK (Stock Optimization using News Knowledge), a multimodal framework integrating numerical market indicators with sentiment-enriched news embeddings to improve daily stock-movement prediction. By combining numerical & textual embeddings via feature concatenation and cross-modal attention, our unified pipeline addresses limitations of isolated analyses. Backtesting shows STONK outperforms numeric-only baselines. A comprehensive evaluation of fusion strategies and model configurations offers evidence-based guidance for scalable multimodal financial forecasting. Source code is available on GitHub

Updated: 2025-08-18 19:22:39

标题: 走向统一的多模态金融预测：通过跨模态注意力整合情感嵌入和市场指标

摘要: 我们提出了STONK（使用新闻知识进行股票优化），这是一个多模态框架，将数字市场指标与情感丰富的新闻嵌入结合起来，以改善每日股票走势预测。通过通过特征串联和跨模态注意力将数字和文本嵌入结合起来，我们的统一流程解决了孤立分析的局限性。回测表明，STONK优于仅使用数字的基线模型。对融合策略和模型配置的全面评估为可扩展的多模态金融预测提供了基于证据的指导。源代码可在GitHub上获取。

更新时间: 2025-08-18 19:22:39

领域: cs.AI

下载: http://arxiv.org/abs/2508.13327v1

Decoding Communications with Partial Information

Machine language acquisition is often presented as a problem of imitation learning: there exists a community of language users from which a learner observes speech acts and attempts to decode the mappings between utterances and situations. However, an interesting consideration that is typically unaddressed is partial observability, i.e. the learner is assumed to see all relevant information. This paper explores relaxing this assumption, thereby posing a more challenging setting where such information needs to be inferred from knowledge of the environment, the actions taken, and messages sent. We see several motivating examples of this problem, demonstrate how they can be solved in a toy setting, and formally explore challenges that arise in more general settings. A learning-based algorithm is then presented to perform the decoding of private information to facilitate language acquisition.

Updated: 2025-08-18 19:19:16

标题: 用部分信息解码通信

摘要: 机器语言习得常被视为模仿学习的问题：存在一个语言使用者社区，学习者观察言语行为并尝试解码言语和情境之间的映射关系。然而，一个通常未被讨论的有趣考虑是部分可观测性，即假定学习者能看到所有相关信息。本文探讨放宽这一假设，从而提出一个更具挑战性的情景，需要从环境知识、采取的行动和发送的消息中推断出这些信息。我们看到了这个问题的几个激励性例子，展示了如何在一个玩具设置中解决这些问题，并正式探讨了在更一般情况下出现的挑战。然后提出了一个基于学习的算法，用于解码私人信息以促进语言习得。

更新时间: 2025-08-18 19:19:16

领域: cs.LG

下载: http://arxiv.org/abs/2508.13326v1

Parallel Network Reconstruction with Multi-directional Regularization

Reconstructing large-scale latent networks from observed dynamics is crucial for understanding complex systems. However, the existing methods based on compressive sensing are often rendered infeasible in practice by prohibitive computational and memory costs. To address this challenge, we introduce a new distributed computing framework for efficient large-scale network reconstruction with parallel computing, namely PALMS (Parallel Adaptive Lasso with Multi-directional Signals). The core idea of PALMS is to decompose the complex global problem by partitioning network nodes, enabling the parallel estimation of sub-networks across multiple computing units. This strategy substantially reduces the computational complexity and storage requirements of classic methods. By using the adaptive multi-directional regularization on each computing unit, we also establish the consistency of PALMS estimator theoretically. Extensive simulation studies and empirical analyses on several large-scale real-world networks validate the computational efficiency and robust reconstruction accuracy of our approach.

Updated: 2025-08-18 19:11:55

标题: 多方向正则化的并行网络重建 (Note: This translation may not be exact, as the term "multi-directional regularization" may have a specific technical meaning in the context of the document.)

摘要: 重建观察到的动态数据中的大规模潜在网络对于理解复杂系统至关重要。然而，基于压缩感知的现有方法在实践中常常由于计算和存储成本过高而变得不可行。为了解决这一挑战，我们引入了一种新的分布式计算框架，用于高效的大规模网络重建并行计算，即PALMS（具有多方向信号的并行自适应Lasso）。PALMS的核心思想是通过对网络节点进行划分，将复杂的全局问题分解，从而实现在多个计算单元上并行估计子网络。这种策略大大降低了经典方法的计算复杂性和存储要求。通过在每个计算单元上使用自适应多方向正则化，我们还在理论上建立了PALMS估计器的一致性。对几个大规模现实世界网络进行的大量模拟研究和实证分析验证了我们方法的计算效率和稳健的重建准确性。

更新时间: 2025-08-18 19:11:55

领域: math.ST,cs.LG,stat.ML,stat.TH,62-08,C.2.4

下载: http://arxiv.org/abs/2411.11464v2

A Surveillance Based Interactive Robot

We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser. The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception. Video is sent with FFmpeg. Objects in the scene are detected using YOLOv3 to support navigation and event awareness. For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech, so the robot can take spoken commands and read back responses in the requested language. A Kinect RGB-D sensor provides visual input and obstacle cues. In indoor tests the robot detects common objects at interactive frame rates on CPU, recognises commands reliably, and translates them to actions without manual control. The design relies on off-the-shelf hardware and open software, making it easy to reproduce. We discuss limits and practical extensions, including sensor fusion with ultrasonic range data, GPU acceleration, and adding face and text recognition.

Updated: 2025-08-18 19:09:43

标题: 一个基于监控的交互式机器人

摘要: 我们构建了一款移动监控机器人，可以实时传输视频并响应语音，用户可以通过手机或浏览器监控和操控机器人。该系统使用两个树莓派4单元：一个前置单元安装在差分驱动底座上，配备摄像头、麦克风和扬声器，另一个中央单元用于提供实时视频流并运行感知。视频传输采用FFmpeg。场景中的物体通过YOLOv3检测以支持导航和事件感知。对于语音交互，我们使用Python库进行语音识别、多语言翻译和文本转语音，这样机器人可以接受口头命令并以请求的语言回复。Kinect RGB-D传感器提供视觉输入和障碍提示。在室内测试中，机器人能够以交互帧率在CPU上检测常见物体，可靠地识别命令，并将其转化为动作而无需手动控制。该设计依赖于现成的硬件和开放软件，易于复制。我们讨论了限制和实用扩展，包括与超声波距离数据的传感器融合、GPU加速以及添加人脸和文本识别功能。

更新时间: 2025-08-18 19:09:43

领域: cs.RO,cs.AI,cs.CV,I.2.9; I.2.10; I.2.7

下载: http://arxiv.org/abs/2508.13319v1

Efficient Constraint-Aware Flow Matching via Randomized Exploration

We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.

Updated: 2025-08-18 19:02:02

标题: 通过随机探索实现高效的约束感知流匹配

摘要: 我们考虑通过流匹配（FM）生成样本的问题，要求生成的样本必须满足给定的约束条件。我们考虑两种情况，即：（a）当给定了与约束集的可微距离函数时，和（b）当约束集仅可通过查询成员资格预言访问时。对于情况（a），我们提出了一种简单的FM目标的调整，增加了一个额外项，惩罚约束集与生成的样本之间的距离。对于情况（b），我们建议采用随机化并学习一个平均流，数值表明该流具有高概率满足约束条件。这种方法与现有的需要简单凸约束、知识具有障碍函数或反射机制来约束概率流的工作明显不同。此外，在所提出的设置中，我们展示了一个两阶段方法，其中两个阶段都近似于相同的原始流，但只有第二阶段通过随机化来探测约束条件，更具计算效率。通过几种受约束生成的合成案例，我们数值显示所提出的方法在约束满足方面取得了显著的收益，同时匹配了目标分布。作为一个基于实际预言的约束的示例，我们展示了我们的方法如何用于训练一个对抗性示例生成器，使用对硬标签黑盒分类器的查询。我们最后提出了几个未来的研究方向。我们的代码可在https://github.com/ZhengyanHuan/FM-RE找到。

更新时间: 2025-08-18 19:02:02

领域: cs.LG

下载: http://arxiv.org/abs/2508.13316v1

Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation

Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations. Recent advances in generative modeling have inspired new approaches to DA in high-dimensional nonlinear settings, especially the ensemble score filter (EnSF). However, these come at a significant computational burden due to slow sampling. In this paper, we introduce a new filtering framework based on flow matching (FM) -- called the ensemble flow filter (EnFF) -- to accelerate sampling and enable flexible design of probability paths. EnFF -- a training-free DA approach -- integrates MC estimators for the marginal FM vector field (VF) and a localized guidance to assimilate observations. EnFF has faster sampling and more flexibility in VF design compared to existing generative modeling for DA. Theoretically, we show that EnFF encompasses classical filtering methods such as the bootstrap particle filter and the ensemble Kalman filter as special cases. Experiments on high-dimensional filtering benchmarks demonstrate improved cost-accuracy tradeoffs and the ability to leverage larger ensembles than prior methods. Our results highlight the promise of FM as a scalable tool for filtering in high-dimensional applications that enable the use of large ensembles.

Updated: 2025-08-18 19:00:45

标题: 基于流匹配的生成建模用于高效可扩展的数据同化

摘要: 数据同化（DA）是从带有噪声观测中顺序估计动态系统状态的问题。生成模型的最新进展激发了在高维非线性环境中进行DA的新方法，特别是集合评分滤波器（EnSF）。然而，由于采样速度慢，这些方法带来了显著的计算负担。在本文中，我们介绍了一种基于流匹配（FM）的新滤波框架——称为集合流滤波器（EnFF）——以加速采样并实现概率路径的灵活设计。EnFF——一种无需训练的DA方法——将边缘FM矢量场（VF）的MC估计器与局部引导结合起来，以同化观测。与现有的生成模型相比，EnFF在VF设计上具有更快的采样速度和更大的灵活性。理论上，我们证明了EnFF包括传统的滤波方法，如自举粒子滤波器和集合卡尔曼滤波器作为特例。在高维滤波基准上的实验表明，EnFF展示了更好的成本-精度权衡，并且能够利用比以前方法更大的集合。我们的结果突显了FM作为在高维应用中进行滤波的可扩展工具的潜力，从而实现了大型集合的使用。

更新时间: 2025-08-18 19:00:45

领域: stat.ML,cs.LG,math.OC,60G35 (Primary), 62M20 (Secondary), 93E11

下载: http://arxiv.org/abs/2508.13313v1

Quiet Feature Learning in Algorithmic Tasks

We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models' internal representations reveals that quiet features are learned prior to any decrease in task loss. These quiet features represent intermediate algorithmic computations that do not by themselves improve the output loss. Ablation experiments demonstrate that individual quiet features are causally necessary for task performance. Our results demonstrate that substantial representational progress can remain hidden beneath an apparently flat loss curve, challenging the prevailing use of cross-entropy as a proxy for learning and motivating richer diagnostics for monitoring model training.

Updated: 2025-08-18 18:57:07

标题: Algorithmic Tasks中的安静特征学习

摘要: 我们在十个基础算法任务上训练了基于Transformer的语言模型，并观察到它们的损失曲线存在明显的相变，偏离了已建立的幂律缩放趋势。在计算范围较大时，验证损失几乎没有改善，然后突然下降。探索模型的内部表示表明，在任务损失减少之前已经学习到了一些“安静”的特征。这些安静的特征代表中间的算法计算，它们本身并不会改善输出损失。消融实验表明，单个安静特征对任务性能是因果必要的。我们的结果表明，在表面上平坦的损失曲线下可能隐藏着实质性的表征进展，挑战了以交叉熵作为学习代理的主流用法，并激励了更丰富的诊断来监控模型训练。

更新时间: 2025-08-18 18:57:07

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2505.03997v2

Characterizing the Sensitivity to Individual Bit Flips in Client-Side Operations of the CKKS Scheme

Homomorphic Encryption (HE) enables computation on encrypted data without decryption, making it a cornerstone of privacy-preserving computation in untrusted environments. As HE sees growing adoption in sensitive applications such as secure machine learning and confidential data analysis ensuring its robustness against errors becomes critical. Faults (e.g., transmission errors, hardware malfunctions, or synchronization failures) can corrupt encrypted data and compromise the integrity of HE operations. However, the impact of soft errors (such as bit flips) on modern HE schemes remains unexplored. Specifically, the CKKS scheme-one of the most widely used HE schemes for approximate arithmetic-lacks a systematic study of how such errors propagate across its pipeline, particularly under optimizations like the Residue Number System (RNS) and Number Theoretic Transform (NTT). This work bridges that gap by presenting a theoretical and empirical analysis of CKKS's fault tolerance under single bit-flip errors. We focus on client-side operations (encoding, encryption, decryption, and decoding) and demonstrate that while the vanilla CKKS scheme exhibits some resilience, performance optimizations (RNS/NTT) introduce significant fragility, amplifying error sensitivity. By characterizing these failure modes, we lay the groundwork for error-resilient HE designs, ensuring both performance and integrity in privacy-critical applications.

Updated: 2025-08-18 18:55:03

标题: 表征CKKS方案客户端操作中对单个比特翻转的敏感性

摘要: 同态加密（HE）使得在加密的数据上进行计算而无需解密，这使其成为不受信任环境中隐私保护计算的基石。随着HE在诸如安全机器学习和机密数据分析等敏感应用中的采用不断增加，确保其对错误的稳健性变得至关重要。故障（如传输错误、硬件故障或同步失败）可能会损坏加密数据并危及HE操作的完整性。然而，软错误（如位翻转）对现代HE方案的影响尚未被探索。具体来说，CKKS方案是用于近似算术的最广泛使用的HE方案之一，缺乏关于此类错误如何传播在其流水线中的系统研究，特别是在如残余数系统（RNS）和数论变换（NTT）这样的优化下。本文通过对CKKS在单比特翻转错误下的容错性进行理论和实证分析，填补了这一空白。我们关注客户端操作（编码、加密、解密和解码）并展示，虽然普通的CKKS方案表现出一定的弹性，但性能优化（RNS/NTT）引入了显著的脆弱性，增加了对错误的敏感性。通过表征这些故障模式，我们为错误容忍的HE设计奠定了基础，确保在隐私关键应用中既具有性能又具有完整性。

更新时间: 2025-08-18 18:55:03

领域: cs.CR

下载: http://arxiv.org/abs/2507.20891v2

DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD -- achieving higher attack success rates (e.g., 20.63\% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

Updated: 2025-08-18 18:54:20

标题: DAASH：一个用于合成有效和隐蔽对抗示例的元攻击框架

摘要: 许多技术已被提出用于在严格的Lp-范数约束下在白盒设置中生成对抗样本。然而，这种范数受限的示例通常不能很好地与人类感知相匹配，只有最近一些方法开始专门探索感知对齐的对抗样本。此外，从Lp约束攻击中获取的见解是否能有效地用于提高感知效果尚不清楚。在本文中，我们引入了DAASH，一个完全可微的元攻击框架，通过策略性地组合现有的基于Lp的攻击方法生成有效且与感知对齐的对抗样本。DAASH以多阶段方式运行：在每个阶段，它使用学习的自适应权重聚合来自多个基本攻击的候选对抗样本，并将结果传播到下一个阶段。一个新颖的元损失函数通过联合最小化误分类损失和感知失真来引导这个过程，使得框架能够动态调节每个基础攻击在各个阶段中的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上对DAASH进行评估。尽管仅依赖于基于Lp约束的方法，DAASH在攻击成功率（例如，提高20.63%）和视觉质量方面明显优于最先进的感知攻击，如AdvAD，这些指标是通过SSIM、LPIPS和FID进行评估，分别提升了约11、0.015和5.7。此外，DAASH对未见防御有很好的泛化能力，使其成为一个实用且强大的基线，可以评估鲁棒性，而无需为每个新的防御手工调整自适应攻击。

更新时间: 2025-08-18 18:54:20

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.13309v1

Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots

Hierarchical reinforcement learning (HRL) is hypothesized to be able to leverage the inherent hierarchy in learning tasks where traditional reinforcement learning (RL) often fails. In this research, HRL is evaluated and contrasted with traditional RL in complex robotic navigation tasks. We evaluate unique characteristics of HRL, including its ability to create sub-goals and the termination functions. We constructed a number of experiments to test: 1) the differences between RL proximal policy optimization (PPO) and HRL, 2) different ways of creating sub-goals in HRL, 3) manual vs automatic sub-goal creation in HRL, and 4) the effects of the frequency of termination on performance in HRL. These experiments highlight the advantages of HRL over RL and how it achieves these advantages.

Updated: 2025-08-18 18:48:07

标题: 多目标空间导航中自主移动机器人的分层强化学习

摘要: 分层强化学习（HRL）被假设能够利用学习任务中固有的层次结构，而传统的强化学习（RL）往往失败。在这项研究中，我们评估了HRL，并将其与传统RL在复杂的机器人导航任务中进行了对比。我们评估了HRL的独特特点，包括其创建子目标和终止函数的能力。我们进行了一系列实验来测试：1）RL近端策略优化（PPO）和HRL之间的差异，2）在HRL中创建子目标的不同方式，3）HRL中手动与自动创建子目标的对比，以及4）终止频率对HRL性能的影响。这些实验突显了HRL相对于RL的优势以及它是如何实现这些优势的。

更新时间: 2025-08-18 18:48:07

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2504.18794v3

Diff-MSM: Differentiable MusculoSkeletal Model for Simultaneous Identification of Human Muscle and Bone Parameters

High-fidelity personalized human musculoskeletal models are crucial for simulating realistic behavior of physically coupled human-robot interactive systems and verifying their safety-critical applications in simulations before actual deployment, such as human-robot co-transportation and rehabilitation through robotic exoskeletons. Identifying subject-specific Hill-type muscle model parameters and bone dynamic parameters is essential for a personalized musculoskeletal model, but very challenging due to the difficulty of measuring the internal biomechanical variables in vivo directly, especially the joint torques. In this paper, we propose using Differentiable MusculoSkeletal Model (Diff-MSM) to simultaneously identify its muscle and bone parameters with an end-to-end automatic differentiation technique differentiating from the measurable muscle activation, through the joint torque, to the resulting observable motion without the need to measure the internal joint torques. Through extensive comparative simulations, the results manifested that our proposed method significantly outperformed the state-of-the-art baseline methods, especially in terms of accurate estimation of the muscle parameters (i.e., initial guess sampled from a normal distribution with the mean being the ground truth and the standard deviation being 10% of the ground truth could end up with an average of the percentage errors of the estimated values as low as 0.05%). In addition to human musculoskeletal modeling and simulation, the new parameter identification technique with the Diff-MSM has great potential to enable new applications in muscle health monitoring, rehabilitation, and sports science.

Updated: 2025-08-18 18:43:43

标题: Diff-MSM：可微分肌肉骨骼模型，用于同时识别人体肌肉和骨骼参数

摘要: 高保真的个性化人类肌肉骨骼模型对于模拟物理耦合的人机交互系统的真实行为以及在实际部署之前验证其安全关键应用至关重要，如人机共同运输和通过机器人外骨骼进行康复。识别特定受试者的希尔型肌肉模型参数和骨骼动态参数对于个性化肌肉骨骼模型至关重要，但由于在体内直接测量内部生物力学变量的困难，特别是关节扭矩，这是非常具有挑战性的。在本文中，我们提出使用可微肌肉骨骼模型（Diff-MSM）来同时识别其肌肉和骨骼参数，利用端到端的自动微分技术，从可测量的肌肉激活，通过关节扭矩，到最终的可观察运动，而无需测量内部关节扭矩。通过广泛的比较模拟，结果表明我们提出的方法在准确估计肌肉参数方面明显优于现有基准方法，尤其是在肌肉参数的准确估计方面（即，从正态分布中采样的初始猜测，均值为真实值，标准差为真实值的10%，最终平均百分比误差值低至0.05%）。除了人体肌肉骨骼建模和仿真，利用Diff-MSM的新参数识别技术在肌肉健康监测、康复和运动科学领域具有巨大潜力。

更新时间: 2025-08-18 18:43:43

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.13303v1

Data-Efficient Safe Policy Improvement Using Parametric Structure

Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.

Updated: 2025-08-18 18:41:23

标题: 使用参数结构进行高效安全策略改进

摘要: 安全政策改进（SPI）是一个离线强化学习问题，需要仅使用数据集和行为策略计算出一个新政策，该政策可可靠地胜过行为政策并且具有高信心水平。马尔可夫决策过程（MDPs）是建模SPI环境的标准形式。在许多应用中，额外信息以参数依赖的形式存在于转移动态中的分布之间。我们通过三个贡献使SPI更具数据效率：（1）一个利用已知分布之间相关性的参数化SPI算法，以更准确地估计转移动态并使用相同数量的数据；（2）通过基于游戏的抽象方法从环境中修剪多余动作的预处理技术；和（3）基于可满足性模理论（SMT）求解的更先进的预处理技术，可以识别更多需要修剪的动作。实证结果和消融研究表明，我们的技术可以将SPI的数据效率提高多个数量级，同时保持相同的可靠性保证。

更新时间: 2025-08-18 18:41:23

领域: cs.AI

下载: http://arxiv.org/abs/2507.15532v2

GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis

Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.

Updated: 2025-08-18 18:32:42

标题: 步态创造者：用于生物特征保护步态合成的扩散模型

摘要: 步态识别是一项有价值的生物特征识别任务，它基于个体的行走模式，使得可以从远处识别个体。然而，由于缺乏大规模标记数据集以及采集多样化步态样本的困难，同时保护隐私，目前步态识别仍存在局限性。为了解决这些挑战，我们提出了GaitCrafter，这是一个基于扩散的框架，用于在轮廓领域合成逼真的步态序列。与以往依赖模拟环境或替代生成模型的作品不同，GaitCrafter从头开始训练一个视频扩散模型，专门用于步态轮廓数据。我们的方法可以生成时间一致且保留身份的步态序列。此外，生成过程是可控的，可以根据不同的协变量进行条件化，如服装、携带物品和视角。我们展示了将GaitCrafter生成的合成样本纳入步态识别流程中可以提高性能，尤其是在具有挑战性的条件下。此外，我们引入了一种机制，通过插值身份嵌入来生成新的身份-在原始数据集中不存在的合成个体。这些新的身份展示出独特、一致的步态模式，并且对于训练模型而保持真实主体的隐私是有用的。总体而言，我们的工作是利用扩散模型进行高质量、可控和隐私意识的步态数据生成迈出的重要一步。

更新时间: 2025-08-18 18:32:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.13300v1

FDR-SVM: A Federated Distributionally Robust Support Vector Machine via a Mixture of Wasserstein Balls Ambiguity Set

We study a federated classification problem over a network of multiple clients and a central server, in which each client's local data remains private and is subject to uncertainty in both the features and labels. To address these uncertainties, we develop a novel Federated Distributionally Robust Support Vector Machine (FDR-SVM), robustifying the classification boundary against perturbations in local data distributions. Specifically, the data at each client is governed by a unique true distribution that is unknown. To handle this heterogeneity, we develop a novel Mixture of Wasserstein Balls (MoWB) ambiguity set, naturally extending the classical Wasserstein ball to the federated setting. We then establish theoretical guarantees for our proposed MoWB, deriving an out-of-sample performance bound and showing that its design preserves the separability of the FDR-SVM optimization problem. Next, we rigorously derive two algorithms that solve the FDR-SVM problem and analyze their convergence behavior as well as their worst-case time complexity. We evaluate our algorithms on industrial data and various UCI datasets, whereby we demonstrate that they frequently outperform existing state-of-the-art approaches.

Updated: 2025-08-18 18:30:31

标题: FDR-SVM：通过Wasserstein球混合模糊集的联合分布鲁棒支持向量机

摘要: 我们研究了一个在多个客户端和一个中央服务器的网络上进行的联邦分类问题，其中每个客户端的本地数据保持私密，并且在特征和标签方面存在不确定性。为了解决这些不确定性，我们开发了一种新颖的联邦分布鲁棒支持向量机（FDR-SVM），使分类边界在本地数据分布的扰动下具有鲁棒性。具体来说，每个客户端的数据受一个未知的独特真实分布控制。为了处理这种异质性，我们开发了一种新颖的Wasserstein球混合（MoWB）模糊集，自然地将经典的Wasserstein球扩展到联邦设置中。然后，我们为我们提出的MoWB建立了理论保证，推导出一种样本外性能界限，并展示其设计保留了FDR-SVM优化问题的可分离性。接下来，我们严格推导了两种解决FDR-SVM问题的算法，并分析了它们的收敛行为以及最坏情况下的时间复杂度。我们在工业数据和各种UCI数据集上评估了我们的算法，从中我们展示了它们经常优于现有的最先进方法。

更新时间: 2025-08-18 18:30:31

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.03877v3

AI Agents for Photonic Integrated Circuit Design Automation

We present Photonics Intelligent Design and Optimization (PhIDO), a multi-agent framework that converts natural-language photonic integrated circuit (PIC) design requests into layout mask files. We compare 7 reasoning large language models for PhIDO using a testbench of 102 design descriptions that ranged from single devices to 112-component PICs. The success rate for single-device designs was up to 91%. For design queries with less than or equal to 15 components, o1, Gemini-2.5-pro, and Claude Opus 4 achieved the highest end-to-end pass@5 success rates of approximately 57%, with Gemini-2.5-pro requiring the fewest output tokens and lowest cost. The next steps toward autonomous PIC development include standardized knowledge representations, expanded datasets, extended verification, and robotic automation.

Updated: 2025-08-18 18:20:32

标题: AI代理用于光子集成电路设计自动化

摘要: 我们提出了光子智能设计和优化（PhIDO），这是一个多智能体框架，将自然语言光子集成电路（PIC）设计请求转换为布局掩模文件。我们使用一个包含102个设计描述的测试台来比较7个推理大型语言模型在PhIDO中的表现，这些描述从单个器件到112个元件的PICs。单个器件设计的成功率高达91％。对于具有小于或等于15个元件的设计查询，o1、Gemini-2.5-pro和Claude Opus 4的端到端通过率@5达到了约57％的最高水平，其中Gemini-2.5-pro需要的输出标记最少，成本最低。实现自主PIC发展的下一步包括标准化的知识表示、扩展数据集、扩展验证和机器人自动化。

更新时间: 2025-08-18 18:20:32

领域: cs.AR,cs.AI,physics.app-ph,physics.optics

下载: http://arxiv.org/abs/2508.14123v1

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.

Updated: 2025-08-18 18:20:18

标题: 人工智能能保守秘密吗？上下文完整性验证：用于LLMs的可证明安全架构

摘要: 大型语言模型（LLMs）仍然极易受到提示注入和相关越狱攻击的威胁；启发式防护措施（规则、过滤器、LLM评估器）经常被规避。我们提出了一种推断时安全架构Contextual Integrity Verification（CIV），它为每个令牌附加了经过加密签名的来源标签，并通过一个预softmax硬关注掩码（可选FFN/残留门控）在变压器内强制执行源信任格栅。CIV为冻结模型提供确定性的每令牌非干扰保证：低信任令牌不能影响高信任表示。在最近的提示注入向量分类法（Elite-Attack + SoK-246）衍生的基准测试中，CIV在所述威胁模型下达到了0%的攻击成功率，同时保持了93.1%的令牌级相似度，并在良性任务中没有降低模型困惑度；我们注意到一个由于非优化数据路径导致的延迟开销。由于CIV是一个轻量级的补丁 -- 不需要微调 -- 我们展示了Llama-3-8B和Mistral-7B的即插即用保护。我们发布了一个参考实现，一个自动化认证工具，以及Elite-Attack语料库，以支持可重现研究。

更新时间: 2025-08-18 18:20:18

领域: cs.CR,cs.AI,cs.CL,68T07, 94A60,D.4.6; K.6.5; E.3; I.2.6; I.2.7

下载: http://arxiv.org/abs/2508.09288v2

Towards Urban Planing AI Agent in the Age of Agentic AI

Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.

Updated: 2025-08-18 18:14:26

标题: 走向代理AI时代的城市规划人工智能代理

摘要: 生成式人工智能、大型语言模型和主体性人工智能已经分别出现在城市规划领域。然而，人工智能和城市规划之间的融合为人工智能城市规划师提供了一个有趣的机遇。现有研究将城市规划概念化为一项生成式人工智能任务，其中人工智能在地理空间、社会和以人为中心的约束条件下合成土地利用配置，重塑自动化城市设计。我们进一步确定了现有生成式城市规划研究的关键差距：1）生成结构必须预先定义并带有强烈的假设：所有对抗生成器-鉴别器、正向和逆向扩散结构、分级区-POI生成结构都是由人类预先定义的；2）忽视领域专家开发的工具的能力：领域城市规划者在城市规划过程中开发了各种工具，这些工具受城市理论指导，而现有的纯神经网络生成则忽视了城市规划师实践者开发的工具的能力。为了解决这些限制，我们勾勒了一个未来研究方向，即主体性城市人工智能规划师，呼吁主体性人工智能和参与性城市主义的新综合。

更新时间: 2025-08-18 18:14:26

领域: cs.AI

下载: http://arxiv.org/abs/2507.14730v2

Hierarchical Conformal Classification

Conformal prediction (CP) is a powerful framework for quantifying uncertainty in machine learning models, offering reliable predictions with finite-sample coverage guarantees. When applied to classification, CP produces a prediction set of possible labels that is guaranteed to contain the true label with high probability, regardless of the underlying classifier. However, standard CP treats classes as flat and unstructured, ignoring domain knowledge such as semantic relationships or hierarchical structure among class labels. This paper presents hierarchical conformal classification (HCC), an extension of CP that incorporates class hierarchies into both the structure and semantics of prediction sets. We formulate HCC as a constrained optimization problem whose solutions yield prediction sets composed of nodes at different levels of the hierarchy, while maintaining coverage guarantees. To address the combinatorial nature of the problem, we formally show that a much smaller, well-structured subset of candidate solutions suffices to ensure coverage while upholding optimality. An empirical evaluation on three new benchmarks consisting of audio, image, and text data highlights the advantages of our approach, and a user study shows that annotators significantly prefer hierarchical over flat prediction sets.

Updated: 2025-08-18 18:05:55

标题: 分层一致分类

摘要: Conformal prediction (CP)是量化机器学习模型不确定性的强大框架，提供具有有限样本覆盖保证的可靠预测。当应用于分类时，CP产生一个可能标签的预测集，无论基础分类器如何，都保证以很高的概率包含真实标签。然而，标准CP将类别视为平坦且无结构，忽略了诸如语义关系或类别标签之间的层次结构等领域知识。本文介绍了分层符合分类（HCC），这是CP的一个扩展，将类别层次结构整合到预测集的结构和语义中。我们将HCC构建为一个约束优化问题，其解决方案产生由层次结构中不同级别的节点组成的预测集，同时保持覆盖保证。为了解决问题的组合性质，我们正式表明一个更小、结构良好的候选解子集足以确保覆盖并维持最优性。对包含音频、图像和文本数据的三个新基准进行的实证评估突显了我们方法的优势，用户研究表明，标注者明显更喜欢分层而不是平面的预测集。

更新时间: 2025-08-18 18:05:55

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2508.13288v1

Towards Human-AI Complementarity in Matching Tasks

Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with $800$ participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching.

Updated: 2025-08-18 18:02:45

标题: 朝向人工智能与人类在匹配任务中的互补性

摘要: 数据驱动的算法匹配系统承诺帮助人类决策者在各种高风险应用领域（如医疗保健和社会服务提供）中做出更好的匹配决策。然而，现有系统并未设计实现人工智能互补性：使用算法匹配系统的人类所做的决策并不一定比单独由人类或算法做出的决策更好。我们的工作旨在解决这一差距。为此，我们提出了协作匹配（comatch），这是一个数据驱动的算法匹配系统，采取协作方法：与现有系统一样，不会为匹配任务做出所有匹配决策，而是仅选择自信度最高的决策，其余决策由人类决策者做出。在这个过程中，comatch 优化了它所做的决策数量和推迟给人类决策者的数量，以显著提高性能。我们进行了一个涉及800名参与者的大规模人类主体研究，以验证所提出的方法。结果表明，comatch 产生的匹配结果优于仅由人类参与者或算法匹配生成的结果。我们收集的人类主体研究数据和我们系统的实现都可作为开源提供，网址为https://github.com/Networks-Learning/human-AI-complementarity-matching。

更新时间: 2025-08-18 18:02:45

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2508.13285v1

Physically Plausible Data Augmentations for Wearable IMU-based Human Activity Recognition Using Physics Simulation

The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.

Updated: 2025-08-18 18:02:27

标题: 使用物理模拟的可靠数据增强技术用于可穿戴IMU传感器的人体活动识别

摘要: 传感器人体活动识别（HAR）中高质量标记数据的稀缺性阻碍了模型性能，并限制了在现实世界场景中的泛化。数据增强是缓解这一问题的关键策略，通过增强训练数据集的多样性。基于信号变换的数据增强（STDA）技术在HAR中被广泛使用。然而，这些方法通常在物理上不切实际，可能导致增强数据无法保留活动标签的原始含义。在这项研究中，我们介绍并系统地表征了由物理模拟实现的物理可信数据增强（PPDA）。PPDA利用来自运动捕捉或基于视频的姿势估计的人体运动数据，并通过物理模拟结合各种现实变化，包括修改身体运动、传感器放置和硬件相关效应。我们在三个公共数据集上比较了PPDA与传统STDA的性能，这些数据集涉及日常活动和健身锻炼。首先，我们分别评估每种增强方法，直接将PPDA与其STDA对应物进行比较。接下来，我们评估了如何通过组合多个PPDA来减少对初始数据收集的需求，通过改变用于训练的受试者数量。实验显示了PPDA的一致优势，平均提高了宏F1分数3.7个百分点（最多达到13个百分点），并且与STDA相比，在训练受试者数量少达60%的情况下取得了竞争性的性能。作为传感器HAR中首个系统性研究PPDA，这些结果凸显了在数据增强中追求物理可信性的优势，以及利用物理模拟生成用于训练深度学习HAR模型的合成惯性测量单元数据的潜力。因此，这种成本效益高且可扩展的方法有助于解决HAR中的标注稀缺挑战。

更新时间: 2025-08-18 18:02:27

领域: cs.LG

下载: http://arxiv.org/abs/2508.13284v1

CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification

Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5\% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.

Updated: 2025-08-18 18:00:28

标题: CLoE：用于稳健MES分类的内窥镜图像课程学习

摘要: 从内窥镜图像中估计疾病严重程度对评估溃疡性结肠炎至关重要，其中Mayo内镜分级（MES）被广泛用于评估炎症程度。然而，由于观察者间变异性引起的标签噪音以及评分的顺序性，MES分类仍然具有挑战性，而标准模型通常忽略了这一点。我们提出了一种课程学习框架CLoE，该框架考虑了标签可靠性和序数结构。通过在波士顿肠准备量表（BBPS）标签上训练的轻量级模型估计图像质量，将其用作注释置信度的代理，以将样本从易（清晰）到难（嘈杂）进行排序。这种课程进一步与ResizeMix增强相结合，以提高鲁棒性。在LIMUC和HyperKvasir数据集上进行的实验，使用CNN和Transformers，结果显示CLoE始终优于强有力的监督和自监督基线性能。例如，ConvNeXt-Tiny在LIMUC上以低计算成本达到82.5％的准确率和0.894的QWK。这些结果突显了面向困难的训练策略在改善标签不确定性下的序数分类方面的潜力。代码将在https://github.com/zeynepozdemir/CLoE上发布。

更新时间: 2025-08-18 18:00:28

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.13280v1

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard

Updated: 2025-08-18 17:59:15

标题: RepreGuard：通过揭示隐藏的表示模式检测LLM生成的文本

摘要: 检测由大型语言模型（LLMs）生成的内容对于防止滥用和构建值得信赖的人工智能系统至关重要。尽管现有的检测方法表现良好，但它们在超出分布（OOD）场景中的稳健性仍然不足。本文假设，与现有检测方法使用的特征相比，LLMs的内部表示包含更全面和原始的特征，可以更有效地捕获和区分LLM生成的文本（LGT）与人类撰写的文本（HWT）之间的统计模式差异。我们验证了这一假设，并观察到处理这两种文本类型时神经激活模式存在显著差异。基于此，我们提出了一种高效的基于统计的检测方法RepreGuard。具体来说，我们首先使用一个替代模型来收集LGT和HWT的表示，并提取出更好地识别LGT的独特激活特征。通过计算文本表示沿着这个特征方向的投影分数并与预先计算的阈值进行比较，我们可以对文本进行分类。实验结果表明，RepreGuard在内部分布（ID）和OOD场景下的平均AUROC为94.92%，优于所有基线方法，并展示了对各种文本大小和主流攻击的强大韧性。数据和代码可在以下链接公开获取：https://github.com/NLP2CT/RepreGuard

更新时间: 2025-08-18 17:59:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13152v1

MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.

Updated: 2025-08-18 17:58:13

标题: MDPO：克服遮蔽扩散语言模型的训练推理分歧

摘要: 扩散语言模型作为传统自回归（AR）模型的一种有前途的替代方案，能够更快地生成并在双向上下文中提供更丰富的条件。然而，它们在训练和推理之间存在一个关键差异：在推理过程中，MDLM逐渐通过生成越来越少的掩码标记来逐步揭示生成序列的结构，而在训练过程中忽略了这种结构，因为标记是随机掩码的。尽管训练和推理之间的这种差异可能导致次优性能，但之前的研究大部分忽视了这一点，使得关闭这两个阶段之间的差距成为一个尚未解决的问题。为了解决这个问题，我们将学习有效的去噪轨迹问题定位为一个序贯决策问题，并利用所得到的框架应用强化学习。我们提出了一种新颖的掩码扩散策略优化（MDPO）方法，利用扩散具有的马尔可夫性质，并明确地按照推理过程中使用的相同逐步优化计划对模型进行训练。MDPO与之前的最先进方法在进行60倍更少的梯度更新的情况下性能相匹配，同时在进行相同数量的权重更新时，MATH500上的平均改进为9.6％，Countdown上的改进为54.2％。此外，我们改进了MDLM的重新掩码策略作为一种插件推理替代方法，以克服模型无法灵活优化标记的限制。这种简单而有效的无训练策略，我们称之为RCR，始终提高性能，并在与MDPO结合时产生额外的增益。我们的发现为研究MDLM的预训练和推理之间的差异奠定了巨大的潜力。代码：https://github.com/autonomousvision/mdpo。项目页面：https://cli212.github.io/MDPO/.

更新时间: 2025-08-18 17:58:13

领域: cs.LG

下载: http://arxiv.org/abs/2508.13148v1

New Interaction Paradigm for Complex EDA Software Leveraging GPT

Electronic Design Automation (EDA) tools such as KiCad offer powerful functionalities but remain difficult to use, particularly for beginners, due to their steep learning curves and fragmented documentation. To address this challenge, we present SmartonAI, an AI-assisted interaction system that integrates large language models into the EDA workflow, enabling natural language communication, intelligent task decomposition, and contextual plugin execution. SmartonAI consists of two main components: a Chat Plugin that breaks down user instructions into subtasks and retrieves tailored documentation, and a OneCommandLine Plugin that recommends and executes relevant plugins based on user intent. The system supports multilingual interaction and adapts to user feedback through incremental learning. Preliminary results suggest that SmartonAI significantly reduces onboarding time and enhances productivity, representing a promising step toward generalizable AI-assisted interaction paradigms for complex software systems.

Updated: 2025-08-18 17:57:44

标题: 复杂EDA软件利用GPT的新交互范式

摘要: 电子设计自动化（EDA）工具如KiCad提供了强大的功能，但由于其陡峭的学习曲线和碎片化的文档，对于初学者来说仍然难以使用。为了解决这一挑战，我们提出了SmartonAI，这是一个AI辅助交互系统，将大型语言模型集成到EDA工作流程中，实现自然语言交流、智能任务分解和上下文插件执行。SmartonAI由两个主要组件组成：Chat插件将用户指令分解为子任务并检索定制文档，OneCommandLine插件根据用户意图推荐和执行相关插件。该系统支持多语言交互，并通过增量学习来适应用户反馈。初步结果表明，SmartonAI显著减少了入职时间并提高了生产力，是通向复杂软件系统的可推广AI辅助交互范式的有希望的一步。

更新时间: 2025-08-18 17:57:44

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2307.14740v2

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.

Updated: 2025-08-18 17:56:04

标题: 信号与噪音：减少语言模型评估中不确定性的框架

摘要: 开发大型语言模型是昂贵的，并涉及通过小型实验做出决策，通常是通过在大型多任务评估套件上进行评估。在这项工作中，我们分析了使基准更可靠以做出这些决策的特定属性，并介入设计更高质量的评估基准。我们引入了两个关键指标，显示了当前基准之间的差异：信号，一个基准将更好的模型与更差的模型分开的能力，和噪声，一个基准对训练步骤之间的随机变化的敏感性。我们证明，具有更高的信噪比的基准在小规模做出决策时更可靠，而那些噪声较小的基准具有更低的缩放规律预测误差。这些结果表明，改进信号或噪声将导致更有用的基准，因此我们介绍了三种旨在直接影响信号或噪声的干预措施。例如，我们提出，切换到具有更好信号和噪声的度量（例如，困惑度而不是准确度）将导致更好的可靠性和改进的缩放规律误差。我们还发现，过滤嘈杂的子任务，以改善整体信噪比，将导致更可靠的多任务评估。我们还发现，对模型的中间检查点的输出进行平均以减少噪声会带来一致的改进。我们得出结论，建议那些创建新基准或选择使用哪些现有基准的人，应该追求高信号和低噪声。我们在这些实验中使用了30个基准，以及来自60M到32B参数的375个开放权重语言模型，导致一个新的、公开可用的包含900K个评估基准结果的数据集，总计200M个实例。

更新时间: 2025-08-18 17:56:04

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.13144v1

Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.

Updated: 2025-08-18 17:55:22

标题: 探索自主代理：仔细研究它们在完成任务时失败的原因

摘要: 由大型语言模型（LLMs）驱动的自主代理系统在自动化复杂任务方面展示了很有前景的能力。然而，当前的评估主要依赖成功率，而没有系统地分析这些系统内部的交互、通信机制和失败原因。为了弥合这一差距，我们提出了一个由34个代表性可编程任务组成的基准，旨在严格评估自主代理。利用这个基准，我们评估了三种流行的开源代理框架与两种LLM骨干的结合，观察到约50%的任务完成率。通过深入的失败分析，我们制定了一个与任务阶段相一致的三级失败原因分类法，突出了规划错误、任务执行问题和不正确的响应生成。基于这些见解，我们提出了可操作的改进措施，以增强代理规划和自诊断能力。我们的失败分类法以及缓解建议为未来开发更加强大和有效的自主代理系统提供了经验基础。

更新时间: 2025-08-18 17:55:22

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.13143v1

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study

Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.

Updated: 2025-08-18 17:55:17

标题: GPT-5是否已经实现了空间智能？实证研究

摘要: 多模型在近年取得了显著进展。然而，它们在空间理解和推理方面仍然存在显著局限，这是实现人工智能的基本能力。随着最近发布的GPT-5，据称是迄今为止最强大的AI模型，现在是时候检查领先模型在通往空间智能的道路上所处的位置。首先，我们提出了一个统一现有基准的空间任务的全面分类，并讨论确保公平评估的挑战。然后，我们评估了最先进的专有和开源模型在八个关键基准上的表现，成本超过十亿个总标记。我们的实证研究显示，(1)GPT-5在空间智能方面表现出空前的强大，但(2)在广泛的任务范围内仍然不及人类表现。此外，我们(3)确定了对多模型模型而言更具挑战性的空间智能问题，而(4)专有模型在面对最困难的问题时并没有明显优势。此外，我们在一系列对人类直观而言却连最先进的多模型也难以应对的情景中进行了定性评估。

更新时间: 2025-08-18 17:55:17

领域: cs.CV,cs.CL,cs.LG,cs.MM,cs.RO

下载: http://arxiv.org/abs/2508.13142v1

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks. Using novel thinking-adjusted accuracy metrics, we perform extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

Updated: 2025-08-18 17:53:10

标题: OptimalThinkingBench：在LLMs中评估过度和不足思考

摘要: 思考型LLMs解决复杂任务的代价是增加计算量和在更简单的问题上过度思考，而非思考型LLMs更快速、更廉价但在更难的推理问题上欠缺思考。这导致了思考型和非思考型LLM变体的发展，使得选择每个查询的最佳模型的责任落在最终用户身上。在这项工作中，我们引入了OptimalThinkingBench，一个统一的基准，共同评估LLMs中的过度思考和不足思考，并鼓励发展平衡性能和效率的最优思考模型。我们的基准包括两个子基准：OverthinkingBench，包含72个领域的简单查询，以及UnderthinkingBench，包含11个具有挑战性的推理任务。使用新颖的思考调整准确度指标，我们对33种不同的思考和非思考模型进行了广泛评估，并显示没有模型能够在我们的基准上进行最优思考。思考型模型经常在最简单的用户查询中过度思考数百个标记，而并未提高性能。相反，大型非思考型模型思考不足，经常不如规模更小的思考型模型。我们进一步探讨了几种鼓励最优思考的方法，但发现这些方法往往在一个子基准上有所改进，却以牺牲另一个子基准为代价，突显了未来需要更好的统一和最优模型的必要性。

更新时间: 2025-08-18 17:53:10

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.13141v1

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

Updated: 2025-08-18 17:50:56

标题: 关键在于思考：评估边疆LLMs试图在有害主题上说服的努力

摘要: 说服是大型语言模型（LLMs）的一种强大能力，它既可以实现有益的应用（例如帮助人们戒烟），又带来重大风险（例如大规模、有针对性的政治操纵）。先前的研究发现，模型具有显著且不断增长的说服能力，通过模拟或真实用户的信念变化来衡量。然而，这些基准忽略了一个关键的风险因素：模型在有害情境中尝试说服的倾向。了解模型是否会盲目“服从命令”以说服有害主题（例如美化加入恐怖组织）对于理解安全防护措施的功效至关重要。此外，了解模型是否会在追求某一目标时进行说服行为以及何时进行说服行为对于理解从事主动AI系统的风险至关重要。我们提出了尝试说服评估（APE）基准，将焦点从说服成功转移到说服尝试，操作化为模型生成旨在塑造信念或行为的内容的意愿。我们的评估框架通过模拟说服者和被说服者之间的多轮对话设置来探究前沿LLMs。APE探讨了包括阴谋论、有争议问题和非有争议有害内容在内的多样化主题。我们引入了一个自动评估模型来识别说服意愿并测量说服尝试的频率和上下文。我们发现许多开放和封闭权重模型经常愿意尝试在有害主题上说服，并且越狱可能增加参与此类行为的意愿。我们的结果突显了当前安全防护措施存在的差距，并强调了评估说服意愿作为LLM风险关键维度的重要性。APE可在github.com/AlignmentResearch/AttemptPersuadeEval找到。

更新时间: 2025-08-18 17:50:56

领域: cs.AI

下载: http://arxiv.org/abs/2506.02873v2

Training Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]

Individual-level human mobility prediction has emerged as a significant topic of research with applications in infectious disease monitoring, child, and elderly care. Existing studies predominantly focus on the microscopic aspects of human trajectories: such as predicting short-term trajectories or the next location visited, while offering limited attention to macro-level mobility patterns and the corresponding life routines. In this paper, we focus on an underexplored problem in human mobility prediction: determining the best practices to train a machine learning model using historical data to forecast an individuals complete trajectory over the next days and weeks. In this experiment paper, we undertake a comprehensive experimental analysis of diverse models, parameter configurations, and training strategies, accompanied by an in-depth examination of the statistical distribution inherent in human mobility patterns. Our empirical evaluations encompass both Long Short-Term Memory and Transformer-based architectures, and further investigate how incorporating individual life patterns can enhance the effectiveness of the prediction. We show that explicitly including semantic information such as day-of-the-week and user-specific historical information can help the model better understand individual patterns of life and improve predictions. Moreover, since the absence of explicit user information is often missing due to user privacy, we show that the sampling of users may exacerbate data skewness and result in a substantial loss in predictive accuracy. To mitigate data imbalance and preserve diversity, we apply user semantic clustering with stratified sampling to ensure that the sampled dataset remains representative. Our results further show that small-batch stochastic gradient optimization improves model performance, especially when human mobility training data is limited.

Updated: 2025-08-18 17:49:10

标题: 在人类时空移动数据上训练机器学习模型：一项实验研究【实验论文】

摘要: 个人级别的人类移动预测已经成为一个重要的研究课题，应用包括传染病监测、儿童和老年人护理。现有研究主要关注人类轨迹的微观方面：例如预测短期轨迹或下一个访问的位置，但对宏观级别的移动模式和相应的生活常规关注有限。在本文中，我们关注人类移动预测中一个未被充分探索的问题：确定使用历史数据训练机器学习模型以预测未来几天和几周内个人完整轨迹的最佳实践。在这个实验论文中，我们进行了对不同模型、参数配置和训练策略的全面实验分析，并深入研究了人类移动模式中固有的统计分布。我们的实证评估涵盖了长短期记忆和基于Transformer的架构，并进一步研究了如何融入个人生活模式可以增强预测的有效性。我们表明，明确包含语义信息，如星期几和用户特定历史信息，可以帮助模型更好地理解个人生活模式并改善预测。此外，由于缺乏明确的用户信息通常是由于用户隐私，我们表明对用户进行采样可能加剧数据的倾斜，并导致预测准确性的显著损失。为了缓解数据不平衡并保持多样性，我们应用用户语义聚类和分层抽样确保采样数据集保持代表性。我们的结果进一步表明，小批量随机梯度优化改善了模型的性能，特别是当人类移动训练数据有限时。

更新时间: 2025-08-18 17:49:10

领域: cs.LG

下载: http://arxiv.org/abs/2508.13135v1

Improving Detection of Watermarked Language Models

Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

Updated: 2025-08-18 17:43:06

标题: Improving Detection of Watermarked Language Models 改进检测带水印的语言模型

摘要: 数字水印技术最近已被证明是一种有效的策略，用于检测大型语言模型（LLMs）的生成。水印的强度通常很大程度上取决于语言模型和输入提示集所提供的熵。然而，在实践中，熵可能会受到相当大的限制，特别是对于那些经过后训练的模型，例如通过指令调优或从人类反馈中进行强化学习（RLHF）。这使得仅基于水印的检测变得具有挑战性。在这项工作中，我们调查了是否通过将水印探测器与非水印探测器相结合可以改善检测效果。我们探讨了一些混合方案，将这两种方法结合起来，观察到在广泛的实验条件下，与任何一类探测器相比都能得到性能的提升。

更新时间: 2025-08-18 17:43:06

领域: cs.CL,cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.13131v1

Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries

Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations - which we term Operational Bias - have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap (the JS Divergence between distributions) and Coverage (the percentage of source labels omitted). Using BlindSpot, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.

Updated: 2025-08-18 17:31:03

标题: 发现盲点：系统性识别和量化联系中心总结中细粒度LLM偏差

摘要: 抽象总结是联系中心的核心应用程序，在这里，大型语言模型（LLMs）每天生成数百万通话转录的摘要。尽管它们表面上质量很高，但仍不清楚LLMs是否系统地忽视或过度关注转录的特定方面，可能会在生成的摘要中引入偏见。尽管先前的研究已经检查了社会和位置偏见，但与联系中心运营相关的具体偏见形式 - 我们称之为操作偏见 - 仍未被探索。为了填补这一空白，我们引入了BlindSpot，这是一个基于15个操作偏见维度的分类法（例如，语言不畅，发言者，主题），用于识别和量化这些偏见。BlindSpot利用LLM作为零次分类器，为每对转录和其摘要的每个偏见维度推导分类分布。然后使用两个指标来量化偏见：保真度差距（分布之间的JS散度）和覆盖率（省略的源标签的百分比）。使用BlindSpot，我们对2500个真实通话转录进行了实证研究，这些转录由20个不同规模和家族的LLMs生成摘要（例如，GPT，Llama，Claude）。我们的分析表明，无论大小或家族如何，偏见是系统性的，并且存在于所有评估的模型中。

更新时间: 2025-08-18 17:31:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13124v1

Bayesian Optimization-based Search for Agent Control in Automated Game Testing

This work introduces an automated testing approach that employs agents controlling game characters to detect potential bugs within a game level. Harnessing the power of Bayesian Optimization (BO) to execute sample-efficient search, the method determines the next sampling point by analyzing the data collected so far and calculates the data point that will maximize information acquisition. To support the BO process, we introduce a game testing-specific model built on top of a grid map, that features the smoothness and uncertainty estimation required by BO, however and most importantly, it does not suffer the scalability issues that traditional models carry. The experiments demonstrate that the approach significantly improves map coverage capabilities in both time efficiency and exploration distribution.

Updated: 2025-08-18 17:24:46

标题: 贝叶斯优化在自动化游戏测试中用于代理控制的搜索

摘要: 这项工作介绍了一种自动化测试方法，利用控制游戏角色的代理来检测游戏关卡中潜在的错误。利用贝叶斯优化（BO）的能力来执行样本高效搜索，该方法通过分析迄今收集的数据来确定下一个采样点，并计算最大化信息获取的数据点。为了支持BO过程，我们引入了一个建立在网格地图之上的游戏测试特定模型，该模型具有BO所需的平滑性和不确定性估计，然而最重要的是，它不会遭受传统模型所具有的可扩展性问题。实验证明，该方法显著提高了地图覆盖能力，无论是在时间效率还是探索分布方面。

更新时间: 2025-08-18 17:24:46

领域: cs.AI

下载: http://arxiv.org/abs/2508.13121v1

Novel Blockchain-based Protocols for Electronic Voting and Auctions

Programmable blockchains have long been a hot research topic given their tremendous use in decentralized applications. Smart contracts, using blockchains as their underlying technology, inherit the desired properties such as verifiability, immutability, and transparency, which make it a great suit in trustless environments. In this thesis, we consider several decentralized protocols to be built on blockchains, specifically using smart contracts on Ethereum. We used algorithmic and cryptographic tools in our implementations to further improve the level of security and efficiency beyond the state-of-the-art works. We proposed a new approach called Blind Vote, which is an untraceable, secure, efficient, secrecy-preserving, and fully on-chain electronic voting protocol based on the well-known concept of Chaum's blind signatures. We illustrate that our approach achieves the same security guarantees as previous methods such as Tornado Vote [1], while consuming significantly less gas. Thus, we provide a cheaper and considerably more gas-efficient alternative for anonymous blockchain-based voting. On the other hand, we propose a new family of algorithms for private, trustless auctions that protect bidder identities and bid values while remaining practical for smart contract execution. We ensure trustlessness by running the auction logic in a smart contract, thereby eliminating reliance on any single trusted party. This approach prevents bid tampering, front-running, and collusion by enforcing immutability and decentralized verification of bids. The resulting protocol uniquely combines efficiency, trustlessness, and enduring bid privacy, offering a scalable and secure solution for blockchain-based marketplaces and other decentralized applications.

Updated: 2025-08-18 17:23:31

标题: 基于区块链的电子投票和拍卖的新型协议

摘要: 可编程区块链长期以来一直是一个热门的研究课题，因为它们在去中心化应用中的巨大用途。智能合约利用区块链作为其底层技术，继承了可验证性、不可变性和透明性等期望的特性，使其成为在无信任环境中的绝佳选择。在本论文中，我们考虑在区块链上构建几种去中心化协议，具体使用以太坊上的智能合约。我们在实现中使用算法和密码工具，进一步提高了安全性和效率水平，超越了现有技术。我们提出了一种名为Blind Vote的新方法，这是一种不可追踪、安全、高效、保密和完全基于链上的电子投票协议，基于Chaum盲签名的著名概念。我们说明了我们的方法实现了与诸如Tornado Vote [1]之前方法相同的安全保证，同时消耗的气体显著较少。因此，我们为基于区块链的匿名投票提供了更便宜和更具气体效率的替代方案。另一方面，我们提出了一系列新的算法，用于私密、无信任拍卖，保护竞标者身份和竞标价值，同时保持对智能合约执行的可行性。我们通过在智能合约中运行拍卖逻辑来确保无信任，从而消除对任何单个可信方的依赖。该方法通过强制不可变性和去中心化验证竞标，防止了竞标篡改、前置交易和勾结。所得协议独特地结合了效率、无信任性和持久的竞标隐私，为基于区块链的市场和其他去中心化应用提供了一个可扩展和安全的解决方案。

更新时间: 2025-08-18 17:23:31

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2507.03258v2

AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation

Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG's ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.

Updated: 2025-08-18 17:22:51

标题: AutoBnB-RAG：利用检索增强生成技术增强多智能体事件响应

摘要: 事件响应（IR）需要快速、协调和充分知情的决策，以控制和减轻网络威胁。虽然大型语言模型（LLMs）在模拟的IR环境中表现出自主代理的潜力，但它们的推理通常受限于无法访问外部知识。在这项工作中，我们提出了AutoBnB-RAG，这是AutoBnB框架的扩展，将检索增强生成（RAG）引入多智能体事件响应模拟中。基于Backdoors＆Breaches（B＆B）桌面游戏环境，AutoBnB-RAG使智能体能够发出检索查询，并在协作调查过程中整合外部证据。我们引入了两种检索设置：一种基于策划的技术文档（RAG-Wiki），另一种使用叙述式事件报告（RAG-News）。我们评估了八种团队结构的表现，包括旨在促进批判性推理的新引入的论证配置。为了验证实际效用，我们还基于公开的侵犯报告模拟真实世界的网络事件，展示了AutoBnB-RAG重建复杂的多阶段攻击的能力。我们的结果表明，检索增强提高了跨不同组织模型的决策质量和成功率。这项工作展示了将检索机制整合到基于LLM的多智能体系统中，以用于网络安全决策的价值。

更新时间: 2025-08-18 17:22:51

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2508.13118v1

Contrastive Representations for Temporal Reasoning

In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.

Updated: 2025-08-18 17:20:08

标题: 对于时间推理的对比表示 (Note: "Contrastive" in this context likely refers to comparing or contrasting different representations for temporal reasoning.)

摘要: 在经典人工智能中，感知依赖于学习基于状态的表示，而规划，可以被认为是对动作序列的时间推理，通常通过搜索实现。我们研究是否这样的推理可以从捕捉感知和时间结构的表示中出现。我们发现，尽管标准的时间对比学习非常流行，但通常无法捕捉时间结构，因为它依赖于虚假特征。为了解决这个问题，我们引入了用于时间推理的组合表示（CRTR）方法，该方法使用负采样方案可证明地消除这些虚假特征并促进时间推理。CRTR在具有复杂时间结构的领域（如Sokoban和魔方）上取得了强大的结果。特别是对于魔方，CRTR学习到了能够横跨所有初始状态并使其能够用比BestFS更少的搜索步骤解决难题的表示，尽管解决方案较长。据我们所知，这是第一个仅使用学习表示高效解决任意魔方状态而不依赖外部搜索算法的方法。

更新时间: 2025-08-18 17:20:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.13113v1

Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry

Foundational modelling of multi-dimensional time-series data in industrial systems presents a central trade-off: channel-dependent (CD) models capture specific cross-variable dynamics but lack robustness and adaptability as model layers are commonly bound to the data dimensionality of the tackled use-case, while channel-independent (CI) models offer generality at the cost of modelling the explicit interactions crucial for system-level predictive regression tasks. To resolve this, we propose the Causally-Guided Pairwise Transformer (CGPT), a novel architecture that integrates a known causal graph as an inductive bias. The core of CGPT is built around a pairwise modeling paradigm, tackling the CD/CI conflict by decomposing the multidimensional data into pairs. The model uses channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables. CGPT enforces a CD information flow at the pair-level and CI-like generalization across pairs. This approach disentangles complex system dynamics and results in a highly flexible architecture that ensures scalability and any-variate adaptability. We validate CGPT on a suite of synthetic and real-world industrial datasets on long-term and one-step forecasting tasks designed to simulate common industrial complexities. Results demonstrate that CGPT significantly outperforms both CI and CD baselines in predictive accuracy and shows competitive performance with end-to-end trained CD models while remaining agnostic to the problem dimensionality.

Updated: 2025-08-18 17:18:38

标题: 因果引导的两两Transformer——走向过程工业中的基础数字孪生

摘要: 在工业系统中对多维时间序列数据进行基础建模涉及一个核心权衡：基于通道的（CD）模型捕获特定的跨变量动态，但缺乏稳健性和适应性，因为模型层通常受制于处理的用例的数据维度，而基于通道的（CI）模型提供了一般性，但以对系统级预测回归任务至关重要的显式交互建模为代价。为了解决这个问题，我们提出了因果引导的成对变压器（CGPT），这是一种整合已知因果图作为归纳偏见的新颖架构。CGPT的核心围绕成对建模范式构建，通过将多维数据分解为成对数据来解决CD/CI冲突。该模型使用通道不可知的可学习层，其中所有参数维度独立于变量数量。CGPT在成对级别上强制执行CD信息流，并在成对之间实现类似CI的泛化。这种方法解开了复杂系统动态，产生了高度灵活的架构，确保可扩展性和多元适应性。我们在一套合成和真实的工业数据集上对CGPT进行验证，这些数据集旨在模拟常见的工业复杂性的长期和一步预测任务。结果表明，CGPT在预测准确性方面明显优于CI和CD基线，并且在保持对问题维度不可知的同时，显示出与端到端训练的CD模型相竞争的性能。

更新时间: 2025-08-18 17:18:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.13111v1

High-Fidelity And Complex Test Data Generation For Real-World SQL Code Generation Services

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically ``meaningful'' mock data for complex schema that includes columns with nested structures that we frequently encounter in Google SQL code generation workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex schema, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate realistic high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the test targets (SQL queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant services. Our results demonstrate the practical utility of an out-of-the-box LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating realistic test data is essential due to the frequent unavailability of production datasets.

Updated: 2025-08-18 17:11:48

标题: 高保真度和复杂测试数据生成用于实际 SQL 代码生成服务

摘要: 在工业环境中，对高保真度测试数据的需求至关重要，因为对生产数据的访问通常受到严格限制。传统的数据生成方法往往效果不佳，难以对复杂数据结构和语义关系进行建模，这对于测试复杂的SQL代码生成服务（如自然语言到SQL（NL2SQL））至关重要。本文解决了为包含具有嵌套结构的列的复杂模式生成句法正确和语义“有意义”的模拟数据的关键需求，这些模式经常在谷歌SQL代码生成工作负载中遇到。我们强调了现有生产中使用的方法的局限性，特别是它们无法处理大型和复杂的模式，以及缺乏语义上连贯的测试数据，导致测试覆盖面有限。我们证明，通过利用大型语言模型（LLMs）并结合策略性的预处理和后处理步骤，我们可以生成符合复杂结构约束并保持语义完整性的现实高保真度测试数据，以符合测试目标（SQL查询/功能）。这种方法支持涉及连接、聚合甚至深度嵌套子查询的复杂SQL查询的全面测试，确保对SQL代码生成服务（如NL2SQL和SQL代码助手服务）进行强大的评估。我们的结果展示了基于开箱即用的LLM（\textit{gemini}）的测试数据生成对于工业SQL代码生成服务的实际实用性，因为由于生产数据集经常不可用，生成逼真的测试数据是至关重要的。

更新时间: 2025-08-18 17:11:48

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2504.17203v2

A Perfectly Truthful Calibration Measure

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Updated: 2025-08-18 17:09:34

标题: 一个完全真实的校准测量

摘要: 校准要求预测在条件上是无偏的，因此，可靠地解释为概率。校准度量衡量预测器距离完美校准有多远。正如Haghtalab等人（2024年）所介绍的，如果校准度量在预测器输出地实际概率时期望值最小化，那么它就是真实的。尽管预测真实概率可以保证完美校准，但在现实中，当校准在有限样本上进行评估时，预测真实情况并不能保证最小化任何已知的校准度量。所有已知的校准度量都会激励预测器撒谎，以便在有限样本上看起来更加校准。这种缺乏真实性促使Haghtalab等人（2024年）和Qiao和Zhao（2025年）在序列预测设置中构建近似真实的校准度量，但即使在更基本的批处理设置中，也没有已知的完全真实的校准度量存在。我们设计了一个在批处理设置中完全真实的校准度量：平均两箱校准误差（ATB）。除了真实之外，ATB还是稳健的、完整的、连续的，并且与两个现有的校准度量有二次关系：平滑校准误差（smCal）和（较低的）校准距离（distCal）。我们对ATB的定义简单明了，使其高效且易于计算。ATB允许更快的估计算法，比smCal和distCal具有更容易实现的实现，从而实现了对Hu等人（2024年）研究的校准测试问题的改进运行时间和简便性。我们还介绍了一个构建真实度量的通用方法，证明了ATB的真实性是一个特例，并使我们能够构建其他真实的校准度量，例如分位数分箱的l_2-ECE。

更新时间: 2025-08-18 17:09:34

领域: cs.LG,cs.DS,stat.ML

下载: http://arxiv.org/abs/2508.13100v1

Outlier Detection of Poisson-Distributed Targets Using a Seabed Sensor Network

This paper presents a framework for classifying and detecting spatial commission outliers in maritime environments using seabed acoustic sensor networks and log Gaussian Cox processes (LGCPs). By modeling target arrivals as a mixture of normal and outlier processes, we estimate the probability that a newly observed event is an outlier. We propose a second-order approximation of this probability that incorporates both the mean and variance of the normal intensity function, providing improved classification accuracy compared to mean-only approaches. We analytically show that our method yields a tighter bound to the true probability using Jensen's inequality. To enhance detection, we integrate a real-time, near-optimal sensor placement strategy that dynamically adjusts sensor locations based on the evolving outlier intensity. The proposed framework is validated using real ship traffic data near Norfolk, Virginia, where numerical results demonstrate the effectiveness of our approach in improving both classification performance and outlier detection through sensor deployment.

Updated: 2025-08-18 17:08:09

标题: 使用海床传感器网络检测泊松分布目标的异常值

摘要: 本文提出了一个框架，用于使用海底声学传感器网络和对数高斯Cox过程(LGCPs)在海洋环境中对空间委托异常值进行分类和检测。通过将目标到达建模为正常和异常值过程的混合物，我们估计新观察到的事件是异常值的概率。我们提出了这种概率的二阶近似，该近似结合了正常强度函数的均值和方差，与仅使用均值方法相比提供了改进的分类准确性。我们在分析中证明了我们的方法使用Jensen不等式得出了比真实概率更紧密的界限。为了增强检测能力，我们集成了一种实时、近乎最优的传感器放置策略，根据不断变化的异常值强度动态调整传感器位置。提出的框架使用在弗吉尼亚州诺福克附近的真实船舶交通数据进行验证，数值结果证明了我们的方法在通过传感器部署改进分类性能和异常值检测方面的有效性。

更新时间: 2025-08-18 17:08:09

领域: cs.LG,cs.IT,math.IT

下载: http://arxiv.org/abs/2508.13099v1

Denoising diffusion models for inverse design of inflatable structures with programmable deformations

Programmable structures are systems whose undeformed geometries and material property distributions are deliberately designed to achieve prescribed deformed configurations under specific loading conditions. Inflatable structures are a prominent example, using internal pressurization to realize large, nonlinear deformations in applications ranging from soft robotics and deployable aerospace systems to biomedical devices and adaptive architecture. We present a generative design framework based on denoising diffusion probabilistic models (DDPMs) for the inverse design of elastic structures undergoing large, nonlinear deformations under pressure-driven actuation. The method formulates the inverse design as a conditional generation task, using geometric descriptors of target deformed states as inputs and outputting image-based representations of the undeformed configuration. Representing these configurations as simple images is achieved by establishing a pre- and postprocessing pipeline that involves a fixed image processing, simulation setup, and descriptor extraction methods. Numerical experiments with scalar and higher-dimensional descriptors show that the framework can quickly produce diverse undeformed configurations that achieve the desired deformations when inflated, enabling parallel exploration of viable design candidates while accommodating complex constraints.

Updated: 2025-08-18 17:07:51

标题: 去噪扩散模型用于可编程变形充气结构的反向设计

摘要: 可编程结构是指其未变形的几何形状和材料特性分布是经过故意设计的，以在特定加载条件下实现预定变形配置的系统。充气结构是一个显著的例子，利用内部增压来实现大规模、非线性的变形，应用范围从软体机器人和可展开的航空航天系统到生物医学设备和自适应建筑。我们提出了一个基于去噪扩散概率模型（DDPMs）的生成设计框架，用于压力驱动驱动下的大规模、非线性变形的弹性结构的逆向设计。该方法将逆向设计形式化为有条件的生成任务，使用目标变形状态的几何描述符作为输入，并输出基于图像的未变形配置表示。通过建立一个包括固定图像处理、模拟设置和描述符提取方法的预处理和后处理流水线，将这些配置表示为简单图像。使用标量和高维描述符进行的数值实验表明，该框架可以快速生成多样的未变形配置，当充气时实现所需的变形，从而实现对可行设计候选方案的并行探索，并满足复杂的约束条件。

更新时间: 2025-08-18 17:07:51

领域: cs.CE,cs.LG

下载: http://arxiv.org/abs/2508.13097v1

VerilogLAVD: LLM-Aided Rule Generation for Vulnerability Detection in Verilog

Timely detection of hardware vulnerabilities during the early design stage is critical for reducing remediation costs. Existing early detection techniques often require specialized security expertise, limiting their usability. Recent efforts have explored the use of large language models (LLMs) for Verilog vulnerability detection. However, LLMs struggle to capture the structure in Verilog code, resulting in inconsistent detection results. To this end, we propose VerilogLAVD, the first LLM-aided graph traversal rule generation approach for Verilog vulnerability detection. Our approach introduces the Verilog Property Graph (VeriPG), a unified representation of Verilog code. It combines syntactic features extracted from the abstract syntax tree (AST) with semantic information derived from control flow and data dependency graphs. We leverage LLMs to generate VeriPG-based detection rules from Common Weakness Enumeration (CWE) descriptions. These rules guide the rule executor that traversal VeriPG for potential vulnerabilities. To evaluate VerilogLAVD, we build a dataset collected from open-source repositories and synthesized data. In our empirical evaluation on 77 Verilog designs encompassing 12 CWE types, VerilogLAVD achieves an F1-score of 0.54. Compared to the LLM-only and LLM with external knowledge baselines, VerilogLAVD improves F1-score by 0.31 and 0.27, respectively.

Updated: 2025-08-18 17:05:18

标题: VerilogLAVD：LLM辅助规则生成用于Verilog漏洞检测

摘要: 在早期设计阶段及时发现硬件漏洞对于降低补救成本至关重要。现有的早期检测技术通常需要专门的安全专业知识，限制了它们的可用性。最近的努力探索了使用大型语言模型（LLMs）进行Verilog漏洞检测。然而，LLMs很难捕捉Verilog代码中的结构，导致检测结果不一致。为此，我们提出了VerilogLAVD，这是一种用于Verilog漏洞检测的第一种LLM辅助图遍历规则生成方法。我们的方法引入了Verilog Property Graph（VeriPG），这是Verilog代码的统一表示。它结合了从抽象语法树（AST）中提取的句法特征和从控制流和数据依赖图中推导出的语义信息。我们利用LLMs从Common Weakness Enumeration（CWE）描述生成基于VeriPG的检测规则。这些规则指导规则执行器遍历VeriPG以寻找潜在的漏洞。为了评估VerilogLAVD，我们构建了一个从开源仓库和合成数据中收集的数据集。在我们对包含12种CWE类型的77个Verilog设计进行的实证评估中，VerilogLAVD实现了0.54的F1分数。与仅使用LLM和LLM与外部知识基线相比，VerilogLAVD分别将F1分数提高了0.31和0.27。

更新时间: 2025-08-18 17:05:18

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.13092v1

Seeing the Many: Exploring Parameter Distributions Conditioned on Features in Surrogates

Recently, neural surrogate models have emerged as a compelling alternative to traditional simulation workflows. This is accomplished by modeling the underlying function of scientific simulations, removing the need to run expensive simulations. Beyond just mapping from input parameter to output, surrogates have also been shown useful for inverse problems: output to input parameters. Inverse problems can be understood as search, where we aim to find parameters whose surrogate outputs contain a specified feature. Yet finding these parameters can be costly, especially for high-dimensional parameter spaces. Thus, existing surrogate-based solutions primarily focus on finding a small set of matching parameters, in the process overlooking the broader picture of plausible parameters. Our work aims to model and visualize the distribution of possible input parameters that produce a given output feature. To achieve this goal, we aim to address two challenges: (1) the approximation error inherent in the surrogate model and (2) forming the parameter distribution in an interactive manner. We model error via density estimation, reporting high density only if a given parameter configuration is close to training parameters, measured both over the input and output space. Our density estimate is used to form a prior belief on parameters, and when combined with a likelihood on features, gives us an efficient way to sample plausible parameter configurations that generate a target output feature. We demonstrate the usability of our solution through a visualization interface by performing feature-driven parameter analysis over the input parameter space of three simulation datasets. Source code is available at https://github.com/matthewberger/seeing-the-many

Updated: 2025-08-18 17:01:01

标题: 看到众多：探索受特征条件限制的代理参数分布

摘要: 最近，神经替代模型已经成为传统模拟工作流程的一个引人注目的替代方案。这是通过对科学模拟的基本功能进行建模实现的，从而消除了运行昂贵模拟的需要。替代模型不仅仅是将输入参数映射到输出，还被证明对于逆问题也是有用的：输出到输入参数。逆问题可以理解为搜索，我们的目标是找到替代输出包含特定特征的参数。然而，找到这些参数可能是昂贵的，特别是对于高维参数空间。因此，现有基于替代模型的解决方案主要侧重于找到一小组匹配参数，而忽视了更广泛的可信参数的整体情况。我们的工作旨在对产生给定输出特征的可能输入参数的分布进行建模和可视化。为了实现这一目标，我们致力于解决两个挑战：（1）替代模型中固有的逼近误差和（2）以交互方式形成参数分布。我们通过密度估计来建模误差，只有在给定参数配置接近训练参数时才报告高密度，同时在输入和输出空间上进行测量。我们的密度估计用于形成对参数的先验信念，当结合特征的似然性时，为我们提供了一种有效的方法来对产生目标输出特征的可信参数配置进行抽样。我们通过可视化界面展示了我们解决方案的可用性，通过在三个模拟数据集的输入参数空间上执行基于特征的参数分析。源代码可在https://github.com/matthewberger/seeing-the-many 上找到。

更新时间: 2025-08-18 17:01:01

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2508.13088v1

3D Cardiac Anatomy Generation Using Mesh Latent Diffusion Models

Diffusion models have recently gained immense interest for their generative capabilities, specifically the high quality and diversity of the synthesized data. However, examples of their applications in 3D medical imaging are still scarce, especially in cardiology. Generating diverse realistic cardiac anatomies is crucial for applications such as in silico trials, electromechanical computer simulations, or data augmentations for machine learning models. In this work, we investigate the application of Latent Diffusion Models (LDMs) for generating 3D meshes of human cardiac anatomies. To this end, we propose a novel LDM architecture -- MeshLDM. We apply the proposed model on a dataset of 3D meshes of left ventricular cardiac anatomies from patients with acute myocardial infarction and evaluate its performance in terms of both qualitative and quantitative clinical and 3D mesh reconstruction metrics. The proposed MeshLDM successfully captures characteristics of the cardiac shapes at end-diastolic (relaxation) and end-systolic (contraction) cardiac phases, generating meshes with a 2.4% difference in population mean compared to the gold standard.

Updated: 2025-08-18 16:53:20

标题: 使用网格潜在扩散模型生成3D心脏解剖结构

摘要: 扩散模型近年来引起了极大的兴趣，特别是因为其生成能力，特别是合成数据的高质量和多样性。然而，在3D医学成像领域，特别是在心脏病学领域，它们的应用示例仍然很少。生成多样化的逼真心脏解剖结构对于诸如仿真试验、电机械计算机模拟或用于机器学习模型的数据增强等应用至关重要。在这项工作中，我们研究了潜在扩散模型（LDMs）在生成人类心脏解剖结构的3D网格方面的应用。为此，我们提出了一种新颖的LDM架构--MeshLDM。我们将提出的模型应用于急性心肌梗死患者左心室心脏解剖结构的3D网格数据集，并评估其在定性和定量临床和3D网格重建指标方面的表现。提出的MeshLDM成功地捕捉了心脏形状在舒张（放松）和收缩（收缩）心脏周期中的特征，生成的网格与黄金标准相比，人口平均值差异为2.4%。

更新时间: 2025-08-18 16:53:20

领域: eess.IV,cs.CV,cs.LG,q-bio.TO

下载: http://arxiv.org/abs/2508.14122v1

AutoChemSchematic AI: Agentic Physics-Aware Automation for Chemical Manufacturing Scale-Up

Recent advances in generative AI have accelerated the discovery of novel chemicals and materials. However, scaling these discoveries to industrial production remains a major bottleneck due to the synthesis gap -- the need to develop entirely new manufacturing processes. This challenge requires detailed engineering blueprints: PFDs for equipment layouts and material/energy flows, and PIDs for process plant operations. Current AI systems cannot yet reliably generate these critical engineering schematics, creating a fundamental obstacle to manufacturing scale-up of novel discoveries. We present a closed-loop, physics-aware framework for automated generation of industrially viable PFDs and PIDs. The framework integrates three key components: (1) domain-specialized small language models (SLMs) trained for auto-generation of PFDs and PIDs, (2) a hierarchical knowledge graph containing process flow and instrumentation descriptions for 1,020+ chemicals for Graph Retrieval-Augmented Generation (GRAG), and (3) an open-source chemical process simulator for modeling, simulation, optimization, and analysis of novel chemical processes. The SLMs are trained through a multi-stage pipeline on synthetic datasets, with process simulator-in-the-loop validation ensuring feasibility. To enhance computational efficiency, the framework implements structural pruning (width and depth) guided by importance heuristics to reduce language model size while preserving accuracy, followed by advanced inference optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test-Time Inference Scaling. Experimental results demonstrate that our framework generates simulator-validated process descriptions with high fidelity.

Updated: 2025-08-18 16:52:22

标题: AutoChemSchematic AI：化学制造规模化的主动物理感知自动化

摘要: 最近人工智能生成技术的进步加速了新型化学物质和材料的发现。然而，将这些发现扩展到工业生产仍然存在一个主要瓶颈，即合成差距 - 需要开发全新的制造过程。这一挑战需要详细的工程蓝图：设备布局和物料/能量流的PFD以及工艺厂操作的PID。当前的人工智能系统尚不能可靠地生成这些关键的工程图表，从而为新发现的制造规模扩大创造了一个基本障碍。我们提出了一个闭环、具有物理意识的框架，用于自动生成工业可行的PFD和PID。该框架集成了三个关键组件：（1）针对自动生成PFD和PID进行训练的领域专业化小语言模型（SLMs），（2）包含1,020多种化学品的过程流程和仪表描述的分层知识图，用于图检索增强生成（GRAG），以及（3）用于建模、仿真、优化和分析新型化学过程的开源化学过程模拟器。SLMs通过多阶段管道在合成数据集上进行训练，通过过程模拟器-循环验证确保可行性。为了增强计算效率，该框架实施了结构修剪（宽度和深度），依据重要性启发式指导以减少语言模型大小同时保持准确性，然后进行先进的推断优化，包括FlashAttention、Lookahead Decoding、带KV-cache量化的PagedAttention以及Test-Time推理缩放。实验结果表明，我们的框架生成了具有高度准确性的经过模拟验证的工艺描述。

更新时间: 2025-08-18 16:52:22

领域: cs.LG,cs.AI,cs.IR

下载: http://arxiv.org/abs/2505.24584v3

From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion

Deep diffusion models excel at realistic image synthesis but demand large training sets-an obstacle in data-scarce domains like transesophageal echocardiography (TEE). While synthetic augmentation has boosted performance in transthoracic echo (TTE), TEE remains critically underrepresented, limiting the reach of deep learning in this high-impact modality. We address this gap by adapting a TTE-trained, mask-conditioned diffusion backbone to TEE with only a limited number of new cases and adapters as small as $10^5$ parameters. Our pipeline combines Low-Rank Adaptation with MaskR$^2$, a lightweight remapping layer that aligns novel mask formats with the pretrained model's conditioning channels. This design lets users adapt models to new datasets with a different set of anatomical structures to the base model's original set. Through a targeted adaptation strategy, we find that adapting only MLP layers suffices for high-fidelity TEE synthesis. Finally, mixing less than 200 real TEE frames with our synthetic echoes improves the dice score on a multiclass segmentation task, particularly boosting performance on underrepresented right-heart structures. Our results demonstrate that (1) semantically controlled TEE images can be generated with low overhead, (2) MaskR$^2$ effectively transforms unseen mask formats into compatible formats without damaging downstream task performance, and (3) our method generates images that are effective for improving performance on a downstream task of multiclass segmentation.

Updated: 2025-08-18 16:48:53

标题: 从经胸到经食道：使用LoRA扩散进行跨模态生成

摘要: 深度扩散模型在逼真图像合成方面表现出色，但需要大量的训练集 - 这在数据稀缺的领域（如食管超声心动图（TEE））是一个障碍。虽然合成增强已经提高了经胸超声心动图（TTE）的性能，但TEE仍然严重缺乏代表性，限制了深度学习在这种高影响模式下的应用范围。我们通过将经TTE训练的、受掩模限制的扩散骨干适应到TEE中，仅使用有限数量的新案例和参数数量小至$10^5$的适配器，来填补这一差距。我们的流程结合了低秩适应和MaskR$^2$，这是一个轻量级的重新映射层，将新的掩模格式与预训练模型的条件通道对齐。这种设计允许用户将模型适应到具有不同解剖结构组的新数据集上。通过有针对性的适应策略，我们发现仅适应MLP层就足以实现高保真度的TEE合成。最后，将少于200个真实TEE帧与我们的合成回声混合，可以提高多类分割任务中的骰子分数，特别是提高了对右心结构等代表性不足的性能。我们的结果表明，(1)可以用较低的开销生成语义控制的TEE图像，(2) MaskR$^2$可以有效地将未见过的掩模格式转换为与兼容格式，而不损害下游任务的性能，(3)我们的方法生成的图像对提高多类分割的下游任务性能有效。

更新时间: 2025-08-18 16:48:53

领域: eess.IV,cs.AI

下载: http://arxiv.org/abs/2508.13077v1

CaRL: Learning Scalable Planning Policies with Simple Rewards

We investigate reinforcement learning (RL) for privileged planning in autonomous driving. State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg~progress, position, or orientation rewards. We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion. Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism. We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.

Updated: 2025-08-18 16:46:42

标题: CaRL：利用简单奖励学习可扩展规划策略

摘要: 我们研究了特权规划在自动驾驶中的强化学习（RL）。目前这一任务的最先进方法是基于规则的，但这些方法无法扩展到长尾情况。相反，RL是可扩展的，并且不像模仿学习那样受到累积误差的影响。当代自动驾驶的RL方法使用复杂的奖励形状，这些奖励包括多个单独的奖励，如进度、位置或方向奖励。我们发现，当批处理大小增加时，PPO未能优化这些奖励的一个流行版本，这限制了这些方法的可扩展性。相反，我们提出了一种新的奖励设计，主要基于优化一个直观的奖励项：路线完成度。违规行为会通过终止剧集或乘法减少路线完成度来进行惩罚。我们发现，当使用我们的简单奖励进行训练时，PPO在较大的批处理大小下表现出良好的扩展性，甚至提高了性能。使用大批处理大小进行训练可以通过分布式数据并行化实现高效扩展。我们将PPO扩展到了CARLA的300M个样本和nuPlan的500M个样本，仅使用一个8-GPU节点。结果模型在CARLA longest6 v2基准测试中获得了64 DS的成绩，远远超过其他复杂奖励的RL方法。在nuPlan上，只需要最小的适应，相同的方法是最好的学习方法，在Val14基准测试中分别得分91.3和90.6，比以前的工作快一个数量级。

更新时间: 2025-08-18 16:46:42

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2504.17838v2

GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Although data that can be naturally represented as graphs is widespread in real-world applications across diverse industries, popular graph ML benchmarks for node property prediction only cover a surprisingly narrow set of data domains, and graph neural networks (GNNs) are often evaluated on just a few academic citation networks. This issue is particularly pressing in light of the recent growing interest in designing graph foundation models. These models are supposed to be able to transfer to diverse graph datasets from different domains, and yet the proposed graph foundation models are often evaluated on a very limited set of datasets from narrow applications. To alleviate this issue, we introduce GraphLand: a benchmark of 14 diverse graph datasets for node property prediction from a range of different industrial applications. GraphLand allows evaluating graph ML models on a wide range of graphs with diverse sizes, structural characteristics, and feature sets, all in a unified setting. Further, GraphLand allows investigating such previously underexplored research questions as how realistic temporal distributional shifts under transductive and inductive settings influence graph ML model performance. To mimic realistic industrial settings, we use GraphLand to compare GNNs with gradient-boosted decision trees (GBDT) models that are popular in industrial applications and show that GBDTs provided with additional graph-based input features can sometimes be very strong baselines. Further, we evaluate currently available general-purpose graph foundation models and find that they fail to produce competitive results on our proposed datasets.

Updated: 2025-08-18 16:45:52

标题: GraphLand：在多样化的工业数据上评估图机器学习模型

摘要: 尽管可以自然表示为图形的数据在不同行业的实际应用中广泛存在，但用于节点属性预测的流行图形机器学习基准仅涵盖了一组令人惊讶地狭窄的数据领域，并且图神经网络（GNN）通常仅在少数学术引文网络上进行评估。鉴于最近对设计图形基础模型的兴趣日益增长，这个问题尤为紧迫。这些模型应能够转移到来自不同领域的多样化图形数据集，然而提出的图形基础模型通常仅在来自狭窄应用领域的一组数据集上进行评估。为了缓解这个问题，我们引入了GraphLand：一个包含来自不同工业应用领域的14个多样化图形数据集的基准，用于节点属性预测。GraphLand允许在统一设置中评估各种大小、结构特征和特征集的图形机器学习模型。此外，GraphLand还允许研究以前未曾探索的研究问题，例如在传统和归纳设置下如何模拟现实中的时间分布变化会影响图形机器学习模型的性能。为了模拟现实工业环境，我们使用GraphLand比较了在工业应用中流行的基于梯度提升决策树（GBDT）模型与GNN模型，并发现提供额外基于图形的输入特征的GBDT有时可能是非常强大的基线。此外，我们评估了目前可用的通用图形基础模型，并发现它们无法在我们提出的数据集上产生竞争力的结果。

更新时间: 2025-08-18 16:45:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.14500v3

A Language-Signal-Vision Multimodal Framework for Multitask Cardiac Analysis

Contemporary cardiovascular management involves complex consideration and integration of multimodal cardiac datasets, where each modality provides distinct but complementary physiological characteristics. While the effective integration of multiple modalities could yield a holistic clinical profile that accurately models the true clinical situation with respect to data modalities and their relatives weightings, current methodologies remain limited by: 1) the scarcity of patient- and time-aligned multimodal data; 2) reliance on isolated single-modality or rigid multimodal input combinations; 3) alignment strategies that prioritize cross-modal similarity over complementarity; and 4) a narrow single-task focus. In response to these limitations, a comprehensive multimodal dataset was curated for immediate application, integrating laboratory test results, electrocardiograms, and echocardiograms with clinical outcomes. Subsequently, a unified framework, Textual Guidance Multimodal fusion for Multiple cardiac tasks (TGMM), was proposed. TGMM incorporated three key components: 1) a MedFlexFusion module designed to capture the unique and complementary characteristics of medical modalities and dynamically integrate data from diverse cardiac sources and their combinations; 2) a textual guidance module to derive task-relevant representations tailored to diverse clinical objectives, including heart disease diagnosis, risk stratification and information retrieval; and 3) a response module to produce final decisions for all these tasks. Furthermore, this study systematically explored key features across multiple modalities and elucidated their synergistic contributions in clinical decision-making. Extensive experiments showed that TGMM outperformed state-of-the-art methods across multiple clinical tasks, with additional validation confirming its robustness on another public dataset.

Updated: 2025-08-18 16:43:31

标题: 一个用于多任务心脏分析的语言-信号-视觉多模态框架

摘要: 当代心血管管理涉及复杂的多模态心脏数据集的考虑和整合，每种模态提供独特但互补的生理特征。尽管有效整合多种模态可能产生一个全面的临床概况，准确地模拟真实的临床情况，但当前的方法仍受到限制：1）患者和时间对齐的多模态数据稀缺；2）依赖于孤立的单模态或刚性的多模态输入组合；3）优先考虑跨模态相似性而不是互补性的对齐策略；4）狭窄的单一任务焦点。针对这些限制，一个综合的多模态数据集被策划用于即时应用，整合了实验室检测结果、心电图和超声心动图与临床结果。随后，提出了一个统一的框架，即用于多种心脏任务的文本指导多模态融合（TGMM）。TGMM包括三个关键组件：1）MedFlexFusion模块旨在捕捉医学模态的独特和互补特征，并动态整合来自不同心脏来源及其组合的数据；2）文本指导模块用于推导与多样化临床目标相关的任务相关表示，包括心脏疾病诊断、风险分层和信息检索；3）响应模块用于为所有这些任务产生最终决策。此外，该研究系统地探讨了多种模态的关键特征，并阐明了它们在临床决策中的协同贡献。广泛的实验表明，TGMM在多个临床任务中优于现有方法，额外的验证证实了它在另一个公共数据集上的稳健性。

更新时间: 2025-08-18 16:43:31

领域: cs.AI

下载: http://arxiv.org/abs/2508.13072v1

Reinforced Context Order Recovery for Adaptive Reasoning and Planning

Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the $\mathcal{V}$-information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.

Updated: 2025-08-18 16:42:55

标题: 强化上下文排序恢复用于自适应推理和规划

摘要: 现代因果语言模型，随后是离散扩散模型的快速发展，现在可以产生各种有趣和有用的内容。然而，这些模型族主要被训练为输出具有固定（从左到右）或随机顺序的标记，这可能偏离标记最初生成的逻辑顺序。在本文中，我们观察到当前的因果和扩散模型在需要适应性标记生成顺序以便解决易处理问题时遇到困难，我们将其特征化为$\mathcal{V}$-信息框架。受此启发，我们提出了Reinforced Context Order Recovery (ReCOR)，这是一个基于强化学习的框架，可以从文本数据中提取自适应的、数据相关的标记生成顺序，而无需注释。通过标记预测统计的自监督，ReCOR估计了预测每个未填充标记的难度，并在训练和推理过程中自适应地选择下一个标记。在具有挑战性的推理和规划数据集上进行的实验表明，与基线相比，ReCOR表现出更优异的性能，有时甚至优于监督有地面真实顺序的神经网络模型。

更新时间: 2025-08-18 16:42:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13070v1

Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation

We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.

Updated: 2025-08-18 16:42:29

标题: 注视图像：注视监督的多模态学习用于胸部X射线诊断和报告生成

摘要: 我们提出了一个两阶段的多模态框架，从胸部X光片中提取疾病分类和区域感知放射学报告生成，利用MIMIC-Eye数据集。在第一阶段，我们引入了一个注视引导的对比学习架构用于疾病分类。它整合了视觉特征、临床标签、边界框以及放射科医生的眼动信号，并配备了一个新颖的多项注视注意力损失，结合了MSE、KL散度、相关性和重心对齐。整合注视点将F1分数从0.597提高到0.631（+5.70%），AUC从0.821提高到0.849（+3.41%），同时也改善了精确度和召回率，突出了注视引导的注意力监督的有效性。在第二阶段，我们提出了一个模块化的报告生成流水线，提取带有置信度权重的诊断关键字，使用从领域特定先验构建的策划词典将它们映射到解剖区域，并通过结构化提示生成与区域对齐的句子。该流水线通过临床关键字召回和ROUGE重叠来衡量报告质量。我们的结果表明，整合注视数据可以提高分类性能和生成医学报告的可解释性。

更新时间: 2025-08-18 16:42:29

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.13068v1

ViTAD: Timing Violation-Aware Debugging of RTL Code using Large Language Models

In modern Very Large Scale Integrated (VLSI) circuit design flow, the Register-Transfer Level (RTL) stage presents a critical opportunity for timing optimization. Addressing timing violations at this early stage is essential, as modern systems demand higher speeds, where even minor timing violations can lead to functional failures or system crashes. However, traditional timing optimization heavily relies on manual expertise, requiring engineers to iteratively analyze timing reports and debug. To automate this process, this paper proposes ViTAD, a method that efficiently analyzes the root causes of timing violations and dynamically generates targeted repair strategies. Specifically, we first parse Verilog code and timing reports to construct a Signal Timing Dependency Graph (STDG). Based on the STDG, we perform violation path analysis and use large language models (LLMs) to infer the root causes of violations. Finally, by analyzing the causes of violations, we selectively retrieve relevant debugging knowledge from a domain-specific knowledge base to generate customized repair solutions. To evaluate the effectiveness of our method, we construct a timing violation dataset based on real-world open-source projects. This dataset contains 54 cases of violations. Experimental results show that our method achieves a 73.68% success rate in repairing timing violations, while the baseline using only LLM is 54.38%. Our method improves the success rate by 19.30%.

Updated: 2025-08-18 16:41:32

标题: ViTAD: 使用大型语言模型进行RTL代码的时序违规感知调试

摘要: 在现代超大规模集成（VLSI）电路设计流程中，寄存器传输级（RTL）阶段为时序优化提供了关键机会。在这个早期阶段解决时序违规是必不可少的，因为现代系统需要更高的速度，即使是微小的时序违规也可能导致功能故障或系统崩溃。然而，传统的时序优化严重依赖于手动专业知识，需要工程师反复分析时序报告并进行调试。为了自动化这个过程，本文提出了ViTAD方法，该方法有效地分析时序违规的根本原因，并动态生成有针对性的修复策略。具体来说，我们首先解析Verilog代码和时序报告，构建信号时序依赖图（STDG）。基于STDG，我们进行违规路径分析，并使用大型语言模型（LLMs）推断违规的根本原因。最后，通过分析违规原因，我们从特定领域的知识库中选择性地检索相关的调试知识，生成定制的修复方案。为了评估我们方法的有效性，我们基于真实的开源项目构建了一个时序违规数据集。该数据集包含54个违规案例。实验结果显示，我们的方法在修复时序违规方面的成功率为73.68％，而仅使用LLM的基准值为54.38％。我们的方法将成功率提高了19.30％。

更新时间: 2025-08-18 16:41:32

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2508.13257v1

LLMs Are In-Context Bandit Reinforcement Learners

Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.

Updated: 2025-08-18 16:38:43

标题: LLMs是上下文多臂赌博强化学习者

摘要: 大型语言模型（LLMs）在上下文学习（ICL）方面表现出色，ICL是一种依赖于向模型上下文添加注释示例的监督学习技术。我们研究了上下文强化学习（ICRL）的一种上下文强化学习版本，其中模型在上下文中，在线学习，从外部奖励中学习，而不是从监督数据中学习。我们展示LLMs有效地展示了这种学习，并对该现象进行了详细研究，通过对从500M到70B参数的各种规模模型进行挑战性分类任务和实验。这包括识别和解决过程的不稳定性，展示具有语义和抽象标签的学习，并展示扩展趋势。我们的研究结果突出了LLMs中的ICRL能力，同时也强调了它们在错误的隐式推理方面的基本限制。

更新时间: 2025-08-18 16:38:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.05362v3

Is This News Still Interesting to You?: Lifetime-aware Interest Matching for News Recommendation

Personalized news recommendation aims to deliver news articles aligned with users' interests, serving as a key solution to alleviate the problem of information overload on online news platforms. While prior work has improved interest matching through refined representations of news and users, the following time-related challenges remain underexplored: (C1) leveraging the age of clicked news to infer users' interest persistence, and (C2) modeling the varying lifetime of news across topics and users. To jointly address these challenges, we propose a novel Lifetime-aware Interest Matching framework for nEws recommendation, named LIME, which incorporates three key strategies: (1) User-Topic lifetime-aware age representation to capture the relative age of news with respect to a user-topic pair, (2) Candidate-aware lifetime attention for generating temporally aligned user representation, and (3) Freshness-guided interest refinement for prioritizing valid candidate news at prediction time. Extensive experiments on two real-world datasets demonstrate that LIME consistently outperforms a wide range of state-of-the-art news recommendation methods, and its model agnostic strategies significantly improve recommendation accuracy.

Updated: 2025-08-18 16:36:27

标题: 这个标题的翻译是："这则新闻对您仍然感兴趣吗？：基于用户兴趣的新闻推荐中考虑用户生命周期"

摘要: 个性化新闻推荐旨在提供符合用户兴趣的新闻文章，作为缓解在线新闻平台信息过载问题的关键解决方案。尽管先前的工作通过对新闻和用户的精细表示改进了兴趣匹配，但以下与时间相关的挑战仍未被充分探讨：（C1）利用点击新闻的年龄推断用户兴趣持久性，以及（C2）对不同主题和用户之间新闻的寿命进行建模。为了共同解决这些挑战，我们提出了一种新颖的基于寿命的兴趣匹配框架，用于新闻推荐，命名为LIME，该框架融合了三个关键策略：（1）用户-主题寿命感知年龄表示，以捕捉新闻相对于用户-主题对的相对年龄，（2）候选项感知寿命关注，用于生成与时间对齐的用户表示，以及（3）新鲜度引导的兴趣细化，用于在预测时优先考虑有效的候选新闻。在两个真实世界数据集上进行的广泛实验证明，LIME始终优于广泛范围的最先进新闻推荐方法，其模型无关策略显著提高了推荐准确性。

更新时间: 2025-08-18 16:36:27

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.13064v1

Hierarchical Evaluation Function (HEF): A Multi-Metric Approach for Optimizing Demand Forecasting Models

Demand forecasting is essential for strategic planning in competitive environments, enabling resource optimization and improved responsiveness to market dynamics. However, multivariate time series modeling faces challenges due to data complexity, uncertainty, and frequent regime shifts. Traditional evaluation metrics can introduce biases and limit generalization. This work compares two custom evaluation functions: FMAE (Focused Mean Absolute Error), focused on minimizing absolute errors, and HEF (Hierarchical Evaluation Function), designed to weight global metrics and penalize large deviations. Experiments were conducted under different data splits (91:9, 80:20, 70:30) using three optimizers (Grid Search, PSO, Optuna), assessing fit, relative accuracy, robustness, and computational efficiency. Results show that HEF consistently outperforms FMAE in global metrics (R2, Relative Accuracy, RMSE, RMSSE), enhancing model robustness and explanatory power. These findings were confirmed via visualizations and statistical tests. Conversely, FMAE offers advantages in local metrics (MAE, MASE) and execution time, making it suitable for short-term scenarios. The study highlights a methodological trade-off: HEF is ideal for strategic planning, while FMAE is better suited for operational efficiency. A replicable framework is proposed for optimizing predictive models in dynamic environments.

Updated: 2025-08-18 16:25:49

标题: 分层评估函数（HEF）：优化需求预测模型的多指标方法

摘要: 需求预测对于竞争环境中的战略规划至关重要，可以实现资源优化和更好地响应市场动态。然而，多变量时间序列建模面临数据复杂性、不确定性和频繁的制度转变等挑战。传统的评估指标可能会引入偏见并限制泛化能力。本研究比较了两种定制评估函数：以最小化绝对误差为目标的FMAE（专注均值绝对误差）和设计用于加权全局指标并惩罚大偏差的HEF（层次化评估函数）。实验在不同的数据拆分（91:9、80:20、70:30）和三种优化器（网格搜索、PSO、Optuna）下进行，评估拟合度、相对准确度、稳健性和计算效率。结果显示HEF在全局指标（R2、相对准确度、RMSE、RMSSE）方面始终优于FMAE，增强了模型的稳健性和解释能力。这些发现通过可视化和统计检验得到了确认。相反，FMAE在局部指标（MAE、MASE）和执行时间方面具有优势，使其适用于短期情景。该研究突出了一种方法论折衷：HEF适用于战略规划，而FMAE更适用于运营效率。提出了一个可复制的框架，用于优化动态环境中的预测模型。

更新时间: 2025-08-18 16:25:49

领域: cs.LG,cs.AI,cs.PF,62M10, 90C59, 68T05,I.2.6; I.5.1; I.5.2; I.5.4; G.1.6

下载: http://arxiv.org/abs/2508.13057v1

When can in-context learning generalize out of task distribution?

In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Updated: 2025-08-18 16:18:38

标题: 何时可以在情境学习中泛化出任务分布？

摘要: 上下文学习（ICL）是预训练转换器的一个显著能力，使模型能够在仅看到少量示例后概括到未见过的任务。我们通过实证研究探讨了预训练分布上必要的条件，使ICL出现并泛化到超出分布范围。先前的工作集中在预训练数据集中所需的不同任务数量。在这里，我们使用不同的任务多样性概念来研究线性函数上训练的转换器中ICL的出现。我们发现随着任务多样性的增加，转换器经历了从专业化解决方案到解决方案的过渡，后者仅在预训练任务分布中展示ICL，可以泛化到整个任务空间。我们还研究了转换器在过渡两侧学习到的解决方案的性质，并观察到在非线性回归问题中类似的过渡。我们构建了一个相图来表征我们关于任务多样性概念如何与预训练任务数量相互作用。此外，我们探讨模型深度和回归问题维度等因素如何影响过渡。

更新时间: 2025-08-18 16:18:38

领域: cs.LG,cond-mat.dis-nn,cond-mat.stat-mech,q-bio.NC,stat.ML

下载: http://arxiv.org/abs/2506.05574v2

CardAIc-Agents: A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support

Cardiovascular diseases (CVDs) remain the foremost cause of mortality worldwide, a burden worsened by a severe deficit of healthcare workers. Artificial intelligence (AI) agents have shown potential to alleviate this gap via automated early detection and proactive screening, yet their clinical application remains limited by: 1) prompt-based clinical role assignment that relies on intrinsic model capabilities without domain-specific tool support; or 2) rigid sequential workflows, whereas clinical care often requires adaptive reasoning that orders specific tests and, based on their results, guides personalised next steps; 3) general and static knowledge bases without continuous learning capability; and 4) fixed unimodal or bimodal inputs and lack of on-demand visual outputs when further clarification is needed. In response, a multimodal framework, CardAIc-Agents, was proposed to augment models with external tools and adaptively support diverse cardiac tasks. Specifically, a CardiacRAG agent generated general plans from updatable cardiac knowledge, while the chief agent integrated tools to autonomously execute these plans and deliver decisions. To enable adaptive and case-specific customization, a stepwise update strategy was proposed to dynamically refine plans based on preceding execution results, once the task was assessed as complex. In addition, a multidisciplinary discussion tool was introduced to interpret challenging cases, thereby supporting further adaptation. When clinicians raised concerns, visual review panels were provided to assist final validation. Experiments across three datasets showed the efficiency of CardAIc-Agents compared to mainstream Vision-Language Models (VLMs), state-of-the-art agentic systems, and fine-tuned VLMs.

Updated: 2025-08-18 16:17:12

标题: CardAIc代理：具有分层适应性的心脏护理支持的多模态框架

摘要: 心血管疾病（CVDs）仍然是全球死亡率最高的原因，这一负担受到医护人员严重短缺的加剧。人工智能（AI）代理已显示出缓解这一差距的潜力，通过自动化早期检测和积极筛查，然而它们的临床应用仍受到限制：1）依赖于内在模型能力而不具备特定领域工具支持的基于提示的临床角色分配；或2）刚性的顺序工作流程，而临床护理通常需要自适应推理，即指定特定测试，并根据结果指导个性化的下一步操作；3）通用和静态知识库没有持续学习能力；以及4）固定的单模态或双模态输入，以及在需要进一步澄清时缺乏按需视觉输出。作为回应，提出了一个多模态框架，名为CardAIc-Agents，用于增强模型与外部工具，并自适应地支持各种心脏任务。具体而言，一个CardiacRAG代理根据可更新的心脏知识生成一般计划，而主要代理整合工具以自主执行这些计划并做出决策。为了实现自适应和针对个案的定制，提出了一种逐步更新策略，根据先前执行结果动态完善计划，一旦任务被评估为复杂。此外，引入了一个跨学科讨论工具，用于解释具有挑战性的案例，从而支持进一步的适应。当临床医生提出关注时，将提供视觉审查面板以协助最终验证。在三个数据集上的实验显示了CardAIc-Agents相比主流的视觉-语言模型（VLMs）、最先进的代理系统和经过优化的VLMs的效率。

更新时间: 2025-08-18 16:17:12

领域: cs.AI,cs.CY,cs.MA

下载: http://arxiv.org/abs/2508.13256v1

STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning

Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the "sub"-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.

Updated: 2025-08-18 16:14:04

标题: STRAP：用于增强策略学习的机器人子轨迹检索

摘要: 机器人学习正面临着一个显著增加的趋势，即预先收集的数据集的规模、多样性和复杂性，这反映了自然语言处理和计算机视觉等领域的趋势。许多机器人学习方法将这些数据集视为多任务专家数据，并通过广泛地跨越它们进行训练，学习多任务、通用政策。值得注意的是，虽然这些通用政策可以提高许多任务的平均性能，但由于数据的分区之间存在负迁移，通用政策在任何一个任务上的性能通常是次优的，与特定任务的专家政策相比。在这项工作中，我们主张在部署时根据遇到的情景训练政策的范式：与以零-shot方式部署预训练的政策到未见问题不同，我们在测试时非参数检索并直接在相关数据上训练模型。此外，我们展示了许多机器人任务共享大量低级行为，并且在“子”轨迹粒度上的检索能够显著改善数据利用、泛化和鲁棒性，从而适应新问题的政策。相比之下，现有的全轨迹检索方法往往未充分利用数据，错过了共享的跨任务内容。这项工作提出了STRAP，一种利用预训练的视觉基础模型和动态时间扭曲以稳健方式从大型训练语料库中检索子序列轨迹的技术。STRAP在模拟和真实实验中优于先前的检索算法和多任务学习方法，展示了在现实世界中能够扩展到更大的离线数据集以及只需少量真实世界演示就能学习出稳健控制政策的能力。

更新时间: 2025-08-18 16:14:04

领域: cs.RO,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2412.15182v2

XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads

This work proposes XR-NPE, a high-throughput Mixed-precision SIMD Neural Processing Engine, designed for extended reality (XR) perception workloads like visual inertial odometry (VIO), object classification, and eye gaze extraction. XR-NPE is first to support FP4, Posit (4,1), Posit (8,0), and Posit (16,1) formats, with layer adaptive hybrid-algorithmic implementation supporting ultra-low bit precision to significantly reduce memory bandwidth requirements, and accompanied by quantization-aware training for minimal accuracy loss. The proposed Reconfigurable Mantissa Multiplication and Exponent processing Circuitry (RMMEC) reduces dark silicon in the SIMD MAC compute engine, assisted by selective power gating to reduce energy consumption, providing 2.85x improved arithmetic intensity. XR-NPE achieves a maximum operating frequency of 1.72 GHz, area 0.016 mm2 , and arithmetic intensity 14 pJ at CMOS 28nm, reducing 42% area, 38% power compared to the best of state-of-the-art MAC approaches. The proposed XR-NPE based AXI-enabled Matrix-multiplication co-processor consumes 1.4x fewer LUTs, 1.77x fewer FFs, and provides 1.2x better energy efficiency compared to SoTA accelerators on VCU129. The proposed co-processor provides 23% better energy efficiency and 4% better compute density for VIO workloads. XR-NPE establishes itself as a scalable, precision-adaptive compute engine for future resource-constrained XR devices. The complete set for codes for results reproducibility are released publicly, enabling designers and researchers to readily adopt and build upon them. https://github.com/mukullokhande99/XR-NPE.

Updated: 2025-08-18 16:13:00

标题: XR-NPE：用于扩展现实感知工作负载的高吞吐量混合精度SIMD神经处理引擎

摘要: 这项工作提出了XR-NPE，一个高吞吐量的混合精度SIMD神经处理引擎，设计用于扩展现实（XR）感知工作负载，如视觉惯性测距（VIO）、物体分类和眼球注视提取。XR-NPE首次支持FP4、Posit（4,1）、Posit（8,0）和Posit（16,1）格式，具有层自适应混合算法实现，支持超低位精度以显著降低内存带宽需求，并伴随着对量化感知训练以最小化精度损失。所提出的可重构尾数乘法和指数处理电路（RMMEC）减少了SIMD MAC计算引擎中的暗硅，通过选择性功耗门控减少能耗，提供了2.85倍的改进算术密度。XR-NPE在CMOS 28nm下实现了最大工作频率为1.72 GHz，面积为0.016 mm2，算术密度为14 pJ，相比于最先进的MAC方法，面积减少了42%，功耗减少了38%。基于XR-NPE的AXI使能矩阵乘法协处理器消耗了1.4倍的LUTs，1.77倍的FFs，并且相较于VCU129上的SoTA加速器提供了1.2倍更好的能效。所提出的协处理器为VIO工作负载提供了23%更好的能效和4%更好的计算密度。XR-NPE确立了自身作为未来资源受限的XR设备的可扩展、精度自适应的计算引擎。结果可重现的完整代码集已公开发布，使设计师和研究人员能够方便地采用和构建。https://github.com/mukullokhande99/XR-NPE。

更新时间: 2025-08-18 16:13:00

领域: cs.AR,cs.AI,cs.CV,eess.IV

下载: http://arxiv.org/abs/2508.13049v1

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

Updated: 2025-08-18 16:10:18

标题: 使用自适应加权拒绝抽样的语言模型快速控制生成

摘要: 生成受某些约束条件限制的语言模型的主要方法是局部约束解码（LCD），在每个时间步骤逐步抽样标记，以确保不违反约束。通常，这是通过标记屏蔽实现的：循环遍历词汇表并排除不符合条件的标记。这种方法存在两个重要问题。（i）评估每个标记上的约束可能成本过高 - 语言模型词汇表通常超过100,000个标记。（ii）LCD 可能扭曲字符串的全局分布，仅基于局部信息抽样标记，即使它们导致死胡同。本文介绍了一种解决这两个问题的新算法。首先，为了避免在生成的每个步骤上评估完整词汇表上的约束，我们提出了一种自适应拒绝抽样算法，通常需要数量级较少的约束评估。其次，我们展示了如何扩展此算法以产生低方差、无偏的重要性权重估计，而额外成本非常小 - 这些估计可以被安全地用于先前提出的顺序蒙特卡罗算法中，以纠正局部约束执行的短视行为。通过在文本到SQL、分子合成、目标推断、模式匹配和 JSON 领域进行广泛的实证评估，我们展示了我们的方法优于最先进的基线，支持更广泛的约束类别，并改善运行时和性能。额外的理论和实证分析表明，我们方法的运行时效率是由其对计算的动态使用驱动的，随着无约束和受约束 LM 之间的差异而扩展，并且因此，对于更好的模型，运行时改进更大。

更新时间: 2025-08-18 16:10:18

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05410v2

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies

Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90\% attack success rate with fewer than 15 queries per attempt on average.

Updated: 2025-08-18 16:09:57

标题: MAJIC：通过多样化创新策略的迭代组合实现马尔可夫自适应越狱

摘要: 大型语言模型（LLMs）展现出了卓越的能力，但仍然容易受到越狱攻击的影响，这种攻击可以通过操纵输入提示从模型中获取有害内容。现有的黑盒越狱技术主要依赖于使用单一、非自适应策略精心制作的静态提示，或者采用几种表现不佳的攻击方法的刚性组合，这限制了它们的适应性和泛化能力。为了解决这些限制，我们提出了MAJIC，这是一个马尔科夫自适应越狱框架，通过迭代地结合多样化的创新伪装策略来攻击黑盒LLMs。MAJIC首先通过改进现有策略和引入几种创新方法来建立一个“伪装策略池”。为了进一步提高攻击性能和效率，MAJIC将池中策略的顺序选择和融合形式化为一个马尔科夫链。在这种形式化下，MAJIC初始化并使用一个马尔科夫矩阵来引导策略的组合，其中策略之间的转移概率根据攻击结果动态调整，从而使MAJIC能够学习和发现针对目标模型量身定制的有效攻击路径。我们的实证结果表明，MAJIC在著名模型（如GPT-4o和Gemini-2.0-flash）上明显优于现有的越狱方法，平均每次尝试少于15次查询就能实现超过90%的攻击成功率。

更新时间: 2025-08-18 16:09:57

领域: cs.CR

下载: http://arxiv.org/abs/2508.13048v1

Using AI for User Representation: An Analysis of 83 Persona Prompts

We analyzed 83 persona prompts from 27 research articles that used large language models (LLMs) to generate user personas. Findings show that the prompts predominantly generate single personas. Several prompts express a desire for short or concise persona descriptions, which deviates from the tradition of creating rich, informative, and rounded persona profiles. Text is the most common format for generated persona attributes, followed by numbers. Text and numbers are often generated together, and demographic attributes are included in nearly all generated personas. Researchers use up to 12 prompts in a single study, though most research uses a small number of prompts. Comparison and testing multiple LLMs is rare. More than half of the prompts require the persona output in a structured format, such as JSON, and 74% of the prompts insert data or dynamic variables. We discuss the implications of increased use of computational personas for user representation.

Updated: 2025-08-18 16:09:47

标题: 使用人工智能进行用户表征：对83个角色提示的分析

摘要: 我们分析了27篇研究文章中使用大型语言模型(LLMs)生成用户角色的83个角色提示。研究结果显示，这些提示主要生成单一角色。一些提示表达了对简短或简洁角色描述的渴望，这与创建丰富、信息丰富和完整的角色概况的传统不同。文本是生成角色属性最常见的格式，其次是数字。文本和数字经常一起生成，几乎所有生成的角色中都包含人口属性。研究人员在单个研究中使用多达12个提示，尽管大多数研究使用少量提示。比较和测试多个LLMs是罕见的。超过一半的提示要求以结构化格式(如JSON)输出角色，74%的提示插入数据或动态变量。我们讨论了增加计算角色对用户代表性的影响。

更新时间: 2025-08-18 16:09:47

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.13047v1

Quantformer: from attention to profit with a quantitative transformer trading strategy

In traditional quantitative trading practice, navigating the complicated and dynamic financial market presents a persistent challenge. Fully capturing various market variables, including long-term information, as well as essential signals that may lead to profit remains a difficult task for learning algorithms. In order to tackle this challenge, this paper introduces quantformer, an enhanced neural network architecture based on transformer, to build investment factors. By transfer learning from sentiment analysis, quantformer not only exploits its original inherent advantages in capturing long-range dependencies and modeling complex data relationships, but is also able to solve tasks with numerical inputs and accurately forecast future returns over a given period. This work collects more than 5,000,000 rolling data of 4,601 stocks in the Chinese capital market from 2010 to 2023. The results of this study demonstrate the model's superior performance in predicting stock trends compared with other 100-factor-based quantitative strategies. Notably, the model's innovative use of transformer-like model to establish factors, in conjunction with market sentiment information, has been shown to enhance the accuracy of trading signals significantly, thereby offering promising implications for the future of quantitative trading strategies.

Updated: 2025-08-18 16:06:23

标题: Quantformer：从注意力到利润，用量化转换器交易策略

摘要: 在传统的量化交易实践中，导航复杂和动态的金融市场一直是一个持久的挑战。充分捕捉各种市场变量，包括长期信息以及可能导致利润的重要信号，对于学习算法来说仍然是一个困难的任务。为了解决这一挑战，本文介绍了quantformer，这是一种基于transformer的增强型神经网络架构，用于构建投资因素。通过从情绪分析中进行迁移学习，quantformer不仅利用其在捕捉长程依赖性和建模复杂数据关系方面的原始优势，而且还能够解决具有数值输入的任务，并准确预测在给定期间内的未来收益。本研究收集了2010年至2023年中国资本市场中4,601只股票的超过5,000,000个滚动数据。本研究结果表明，与其他基于100个因子的量化策略相比，该模型在预测股票趋势方面表现出优越的性能。值得注意的是，该模型创新地利用类似transformer的模型建立因素，结合市场情绪信息，已被证明显著提高了交易信号的准确性，从而为量化交易策略的未来提供了有希望的启示。

更新时间: 2025-08-18 16:06:23

领域: q-fin.MF,cs.AI,cs.CE,G.3; J.2

下载: http://arxiv.org/abs/2404.00424v3

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

Updated: 2025-08-18 16:06:09

标题: 量化对推理有害吗？量化推理模型的实证研究

摘要: 最近在推理语言模型方面取得了显著进展，在复杂任务中表现出色，但其延长的思维链推理过程增加了推断开销。虽然量化已被广泛采用来降低大型语言模型的推断成本，但其对推理模型的影响尚未得到充分研究。本文首次对量化推理模型进行了系统研究，评估了从1.5B到70B参数的开源DeepSeek-R1-Distilled Qwen和LLaMA家族，QwQ-32B和Qwen3-8B。我们使用最先进的算法在不同的位宽下，对权重、KV缓存和激活进行量化，广泛评估了数学（AIME，MATH-500）、科学（GPQA）和编程（LiveCodeBench）推理基准。我们的研究结果表明，虽然可以通过W8A8或W4A16量化实现无损量化，但更低的位宽会带来显著的精度风险。我们进一步确定模型大小、模型来源和任务难度是性能的关键决定因素。与预期相反，量化模型并不会表现出增加的输出长度。此外，通过策略性地调整模型大小或推理步骤可以有效提升性能。所有量化模型和代码均在https://github.com/ruikangliu/Quantized-Reasoning-Models上开源。

更新时间: 2025-08-18 16:06:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04823v2

INSIGHT: A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization

In-network computation represents a transformative approach to addressing the escalating demands of Artificial Intelligence (AI) workloads on network infrastructure. By leveraging the processing capabilities of network devices such as switches, routers, and Network Interface Cards (NICs), this paradigm enables AI computations to be performed directly within the network fabric, significantly reducing latency, enhancing throughput, and optimizing resource utilization. This paper provides a comprehensive analysis of optimizing in-network computation for AI, exploring the evolution of programmable network architectures, such as Software-Defined Networking (SDN) and Programmable Data Planes (PDPs), and their convergence with AI. It examines methodologies for mapping AI models onto resource-constrained network devices, addressing challenges like limited memory and computational capabilities through efficient algorithm design and model compression techniques. The paper also highlights advancements in distributed learning, particularly in-network aggregation, and the potential of federated learning to enhance privacy and scalability. Frameworks like Planter and Quark are discussed for simplifying development, alongside key applications such as intelligent network monitoring, intrusion detection, traffic management, and Edge AI. Future research directions, including runtime programmability, standardized benchmarks, and new applications paradigms, are proposed to advance this rapidly evolving field. This survey underscores the potential of in-network AI to create intelligent, efficient, and responsive networks capable of meeting the demands of next-generation AI applications.

Updated: 2025-08-18 16:03:51

标题: 洞察：智能、高效AI和拓扑优化的网络内系统调查

摘要: 网络内计算代表了解决人工智能（AI）工作负载对网络基础设施不断增长需求的一种变革性方法。通过利用交换机、路由器和网络接口卡（NIC）等网络设备的处理能力，这种范式使得AI计算可以直接在网络结构内执行，显著降低延迟，增强吞吐量，并优化资源利用率。本文对优化网络内AI计算进行了全面分析，探讨了可编程网络架构的演变，如软件定义网络（SDN）和可编程数据平面（PDPs），以及它们与AI的融合。它研究了将AI模型映射到资源受限的网络设备的方法，通过高效算法设计和模型压缩技术来解决类似有限内存和计算能力的挑战。本文还强调了分布式学习的进展，尤其是网络内聚合和联邦学习提升隐私和可伸缩性的潜力。讨论了简化开发的框架，如Planter和Quark，以及关键应用，如智能网络监测、入侵检测、流量管理和边缘AI。提出了未来研究方向，包括运行时可编程性、标准化基准测试和新的应用范式，以推进这一快速发展领域。这项调查强调了网络内AI的潜力，可以创建智能、高效和响应迅速的网络，满足下一代AI应用的需求。

更新时间: 2025-08-18 16:03:51

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2505.24269v2

Beyond Internal Data: Bounding and Estimating Fairness from Incomplete Data

Ensuring fairness in AI systems is critical, especially in high-stakes domains such as lending, hiring, and healthcare. This urgency is reflected in emerging global regulations that mandate fairness assessments and independent bias audits. However, procuring the necessary complete data for fairness testing remains a significant challenge. In industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. In practice, data relevant for fairness testing is often split across separate sources: internal datasets held by institutions with predictive attributes, and external public datasets such as census data containing protected attributes, each providing only partial, marginal information. Our work seeks to leverage such available separate data to estimate model fairness when complete data is inaccessible. We propose utilising the available separate data to estimate a set of feasible joint distributions and then compute the set plausible fairness metrics. Through simulation and real experiments, we demonstrate that we can derive meaningful bounds on fairness metrics and obtain reliable estimates of the true metric. Our results demonstrate that this approach can serve as a practical and effective solution for fairness testing in real-world settings where access to complete data is restricted.

Updated: 2025-08-18 15:57:30

标题: 超越内部数据：从不完整数据中限制和估计公平性

摘要: 确保人工智能系统的公平性至关重要，特别是在高风险领域，如贷款、招聘和医疗保健。这种紧迫性反映在新兴的全球法规中，这些法规要求进行公平性评估和独立的偏见审计。然而，为了进行公平性测试，获取必要的完整数据仍然是一个重大挑战。在工业环境中，法律和隐私问题限制了收集评估群体差异所需的人口统计数据，审计员在获取数据时面临实际和文化挑战。在实践中，用于公平性测试的数据通常分布在不同的来源之间：机构持有的包含预测属性的内部数据集和包含受保护属性的外部公共数据集，每个数据集仅提供部分、边缘信息。我们的工作旨在利用这些可用的独立数据来估计模型的公平性，当完整数据不可访问时。我们提出利用可用的独立数据来估计一组可行的联合分布，然后计算一组可信公平性指标。通过模拟和实验，我们证明我们可以得出有意义的公平性指标的界限，并获得真实指标的可靠估计。我们的结果表明，这种方法可以作为公平性测试的实际和有效解决方案，适用于在访问完整数据受限的实际环境中。

更新时间: 2025-08-18 15:57:30

领域: cs.LG

下载: http://arxiv.org/abs/2508.13040v1

Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction

Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2.3%, 16.1%, 2.4%, 12.3%, and 1.8% accuracy across the five base models, respectively.

Updated: 2025-08-18 15:56:10

标题: 大型模型能否教会学生模型像人类一样解决数学问题？一种通过多重LoRA交互进行推理提炼的方法

摘要: 最近的研究表明，大型语言模型(LLMs)具有强大的数学推理能力，但依赖于数百亿个参数。为了解决小型语言模型(SLMs)推理能力不足的挑战，现有方法通常利用LLMs生成大量数据进行训练。在心理学中，它们类似于系统1思维，基于经验和直觉快速解决推理问题。然而，人类学习还需要系统2思维，即首先获取知识，然后通过实践加以强化。受到这两种不同思维模式的启发，我们提出了一种基于多LoRA交互的数学推理蒸馏(LoRID)的新方法。首先，我们将每个样本的问题和推理输入到LLM中，以创建知识增强的数据集。随后，我们在学生模型上训练一个LoRA块作为直觉推理者(IR)，直接生成解决问题的思维链。然后，为了模仿系统2思维，我们分别训练知识生成器(KG)和深度推理者(DR)。前者在接收问题后只输出知识，而后者利用该知识进行推理。最后，为了解决IR和DR生成的随机性，我们评估它们的输出是否一致，如果不一致，则需要迭代推理过程。这一步可以通过相互反馈增强SLMs的数学推理能力。实验结果表明，LoRID在性能方面达到了最先进水平，尤其是在GSM8K数据集上，它分别比第二好的方法提高了2.3%，16.1%，2.4%，12.3%和1.8%的准确率。

更新时间: 2025-08-18 15:56:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.13037v1

Co-Writing with AI, on Human Terms: Aligning Research with User Demands Across the Writing Process

As generative AI tools like ChatGPT become integral to everyday writing, critical questions arise about how to preserve writers' sense of agency and ownership when using these tools. Yet, a systematic understanding of how AI assistance affects different aspects of the writing process - and how this shapes writers' agency - remains underexplored. To address this gap, we conducted a systematic review of 109 HCI papers using the PRISMA approach. From this literature, we identify four overarching design strategies for AI writing support: structured guidance, guided exploration, active co-writing, and critical feedback - mapped across the four key cognitive processes in writing: planning, translating, reviewing, and monitoring. We complement this analysis with interviews of 15 writers across diverse domains. Our findings reveal that writers' desired levels of AI intervention vary across the writing process: content-focused writers (e.g., academics) prioritize ownership during planning, while form-focused writers (e.g., creatives) value control over translating and reviewing. Writers' preferences are also shaped by contextual goals, values, and notions of originality and authorship. By examining when ownership matters, what writers want to own, and how AI interactions shape agency, we surface both alignment and gaps between research and user needs. Our findings offer actionable design guidance for developing human-centered writing tools for co-writing with AI, on human terms.

Updated: 2025-08-18 15:52:18

标题: 与人工智能协作写作，以人类为本：在写作过程中使研究与用户需求保持一致

摘要: 随着诸如ChatGPT之类的生成式AI工具成为日常写作的重要组成部分，关于如何在使用这些工具时保留作家的主观能动性和所有权的关键问题浮现出来。然而，对AI辅助如何影响写作过程的不同方面以及这如何塑造作家主观能动性的系统理解仍未被充分探讨。为填补这一空白，我们采用PRISMA方法对109篇HCI论文进行了系统综述。从这些文献中，我们确定了四种AI写作支持的总体设计策略：结构化指导、引导式探索、积极共同写作和批判性反馈 - 这些策略分别映射到写作的四个关键认知过程：规划、翻译、审阅和监控。我们通过对不同领域的15位作家进行访谈，补充了这一分析。我们的研究结果显示，作家在写作过程中对AI干预的期望水平各不相同：以内容为重点的作家（例如学者）在规划阶段更看重所有权，而以形式为重点的作家（例如创意人士）更注重在翻译和审阅阶段保持控制。作家的偏好也受到背景目标、价值观以及原创性和作者权的影响。通过研究所有权何时重要，作家想要拥有什么，以及AI交互如何塑造主观能动性，我们揭示了研究和用户需求之间的一致性和差距。我们的研究结果为开发以人为中心的与AI共同写作的写作工具提供了可操作的设计指导，让写作在人的条件下进行。

更新时间: 2025-08-18 15:52:18

领域: cs.HC,cs.AI,H.5.2; I.2.7; I.2.6; I.7.2

下载: http://arxiv.org/abs/2504.12488v2

AuthenTree: A Scalable MPC-Based Distributed Trust Architecture for Chiplet-based Heterogeneous Systems

The rapid adoption of chiplet-based heterogeneous integration is reshaping semiconductor design by enabling modular, scalable, and faster time-to-market solutions for AI and high-performance computing. However, multi-vendor assembly in post-fabrication environments fragments the supply chain and exposes SiP systems to serious security threats, including cloning, overproduction, and chiplet substitution. Existing authentication solutions depend on trusted integrators or centralized security anchors, which can expose sensitive data or create single points of failure. We introduce AuthenTree, a distributed authentication framework that leverages multi-party computation (MPC) in a scalable tree-based architecture, removing the need for dedicated security hardware or centralized trust. AuthenTree enables secure chiplet validation without revealing raw signatures, distributing trust across multiple integrator chiplets. Our evaluation in five SiP benchmarks demonstrates that AuthenTree imposes minimal overhead, with an area as low as 0.48% (7,000 sq-micrometers), an overhead power under 0.5%, and an authentication latency below 1 microsecond, surpassing previous work in some cases by 700 times. These results establish AuthenTree as an efficient, robust, and scalable solution for next-generation chiplet-based security in zero-trust SiP environments.

Updated: 2025-08-18 15:51:48

标题: AuthenTree: 一种基于MPC的可扩展的用于基于芯片组的异构系统的分布式信任架构

摘要: 芯片组件化异构集成的快速采用正在通过实现模块化、可扩展和更快的上市时间解决方案，重塑半导体设计，特别是为人工智能和高性能计算。然而，在后制造环境中进行多供应商组装会造成供应链碎片化，并将SiP系统暴露于严重的安全威胁，包括克隆、过度生产和芯片替换。现有的认证解决方案依赖于可信的集成商或集中的安全锚点，这可能会暴露敏感数据或导致单点故障。我们介绍了AuthenTree，这是一个利用多方计算（MPC）在可扩展的基于树的架构中的分布式认证框架，消除了对专用安全硬件或集中信任的需求。AuthenTree使得在未泄露原始签名的情况下实现安全的芯片验证，并分散信任到多个集成商芯片上。我们在五个SiP基准测试中的评估表明，AuthenTree的额外开销极小，面积低至0.48%（7,000平方微米），额外功耗低于0.5%，认证延迟低于1微秒，在某些情况下超过先前工作700倍。这些结果确立了AuthenTree作为下一代基于芯片组件的零信任SiP环境中高效、强大和可扩展的安全解决方案。

更新时间: 2025-08-18 15:51:48

领域: cs.CR,B.7.1; B.6

下载: http://arxiv.org/abs/2508.13033v1

The Application of Transformer-Based Models for Predicting Consequences of Cyber Attacks

Cyberattacks are increasing, and securing against such threats is costing industries billions of dollars annually. Threat Modeling, that is, comprehending the consequences of these attacks, can provide critical support to cybersecurity professionals, enabling them to take timely action and allocate resources that could be used elsewhere. Cybersecurity is heavily dependent on threat modeling, as it assists security experts in assessing and mitigating risks related to identifying vulnerabilities and threats. Recently, there has been a pressing need for automated methods to assess attack descriptions and forecast the future consequences of the increasing complexity of cyberattacks. This study examines how Natural Language Processing (NLP) and deep learning can be applied to analyze the potential impact of cyberattacks by leveraging textual descriptions from the MITRE Common Weakness Enumeration (CWE) database. We emphasize classifying attack consequences into five principal categories: Availability, Access Control, Confidentiality, Integrity, and Other. This paper investigates the use of Bidirectional Encoder Representations from Transformers (BERT) in combination with Hierarchical Attention Networks (HANs) for Multi-label classification, evaluating their performance in comparison with conventional CNN and LSTM-based models. Experimental findings show that BERT achieves an overall accuracy of $0.972$, far higher than conventional deep learning models in multi-label classification. HAN outperforms baseline forms of CNN and LSTM-based models on specific cybersecurity labels. However, BERT consistently achieves better precision and recall, making it more suitable for predicting the consequences of a cyberattack.

Updated: 2025-08-18 15:46:36

标题: 基于Transformer模型的应用于预测网络攻击后果

摘要: 网络攻击正在增加，应对此类威胁的安全成本每年耗资数十亿美元。威胁建模，即理解这些攻击的后果，可以为网络安全专家提供关键支持，使他们能够及时采取行动并分配资源，这些资源可以用于其他用途。网络安全严重依赖于威胁建模，因为它帮助安全专家评估和减轻与识别漏洞和威胁相关的风险。最近，迫切需要自动化方法来评估攻击描述并预测网络攻击日益复杂的未来后果。本研究研究了如何利用自然语言处理（NLP）和深度学习来分析网络攻击的潜在影响，通过利用MITRE通用弱点枚举（CWE）数据库中的文本描述。我们强调将攻击后果分类为五个主要类别：可用性、访问控制、保密性、完整性和其他。本文研究了使用BERT（Bidirectional Encoder Representations from Transformers）与分层注意力网络（HANs）结合进行多标签分类的方法，评估了它们与传统CNN和基于LSTM的模型的性能比较。实验结果显示，BERT在多标签分类中实现了$0.972$的整体准确率，远高于传统深度学习模型。HAN在特定网络安全标签上优于基准形式的CNN和基于LSTM的模型。然而，BERT始终实现更高的精确度和召回率，使其更适合预测网络攻击的后果。

更新时间: 2025-08-18 15:46:36

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.13030v1

Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling

Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.

Updated: 2025-08-18 15:41:22

标题: 使用蒙版扩散语言模型和推理时间缩放改进文本风格转换

摘要: 掩盖扩散语言模型（MDMs）最近已经成为一种可行的自然语言生成框架。这可以归因于与其他离散数据扩散模型范式相比，它的可扩展性和训练简易性，使其成为离散数据的最先进的非自回归生成器。总的来说，扩散模型展现了通过利用推断时间缩放来提高生成质量的出色能力，方法包括增加去噪步骤的数量或在每个步骤的输出之上使用外部验证器来引导生成。在这项工作中，我们提出了一种基于验证器的推断时间缩放方法，有助于在MDM的去噪过程中找到更好的候选生成。我们的实验展示了MDMs在标准文本风格转换任务中的应用，并将其确立为优于自回归语言模型的更好选择。此外，我们表明，对MDMs使用现成的预训练嵌入模型的简单软值验证器设置，即使在现有文献中的典型无分类器引导设置之上使用，也会显著提高生成质量。

更新时间: 2025-08-18 15:41:22

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.10995v2

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G$^2$RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model's evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G$^2$RPO-A substantially outperforms vanilla GRPO. Our code and models are available at https://github.com/T-Lab-CUHKSZ/G2RPO-A.

Updated: 2025-08-18 15:41:16

标题: G$^2$RPO-A：具有自适应引导的引导式群体相对策略优化

摘要: 可验证奖励强化学习（RLVR）显著增强了大型语言模型（LLMs）的推理能力。然而，其成功在很大程度上取决于具有丰富世界知识的强基础模型，仅为小型语言模型（SLMs）带来了适度的改进。为了解决这一局限性，我们研究了引导GRPO，该方法将基本真实的推理步骤注入到roll-out轨迹中，以弥补SLMs固有的弱点。通过对各种引导配置的全面研究，我们发现天真地添加引导仅带来有限的收益。这些见解激发了G$^2$RPO-A，这是一种自适应算法，可以根据模型不断变化的训练动态自动调整引导强度。在数学推理和代码生成基准测试中的实验证实，G$^2$RPO-A明显优于普通的GRPO。我们的代码和模型可在https://github.com/T-Lab-CUHKSZ/G2RPO-A 上找到。

更新时间: 2025-08-18 15:41:16

领域: cs.AI

下载: http://arxiv.org/abs/2508.13023v1

PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models

Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These shortcomings restrict the full potential of MDMs. In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy that unifies global trajectory planning with content-aware informativeness maximization. PC-Sampler incorporates a position-aware weighting mechanism to regulate the decoding path and a calibrated confidence score to suppress the premature selection of trivial tokens. Extensive experiments on three advanced MDMs across seven challenging benchmarks-including logical reasoning and planning tasks-demonstrate that PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models. All codes are available at https://github.com/NEUIR/PC-Sampler.

Updated: 2025-08-18 15:38:37

标题: PC-Sampler：掩盖扩散模型中解码偏差的位置感知校准

摘要: 最近对掩蔽扩散模型（MDMs）的进展已经确立它们作为序列生成的强大非自回归替代方法。然而，我们的初步实验表明，MDMs的生成质量仍然极其敏感于解码策略的选择。特别是，广泛采用的基于不确定性的采样器存在两个关键限制：缺乏全局轨迹控制和在解码的早期阶段对无关紧要令牌的明显偏好。这些缺点限制了MDMs的全部潜力。在这项工作中，我们引入了位置感知置信度校准采样（PC-Sampler），这是一种将全局轨迹规划与内容感知信息最大化统一起来的新型解码策略。PC-Sampler结合了位置感知加权机制来调节解码路径，以及校准置信度分数来抑制对无关紧要令牌的过早选择。在七个具有挑战性的基准测试中进行了三种先进MDMs的广泛实验，包括逻辑推理和规划任务，结果表明PC-Sampler在平均值上一直优于现有MDM解码策略超过10％，显著缩小了与最先进的自回归模型的性能差距。所有代码均可在https://github.com/NEUIR/PC-Sampler获得。

更新时间: 2025-08-18 15:38:37

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.13021v1

e-boost: Boosted E-Graph Extraction with Adaptive Heuristics and Exact Solving

E-graphs have attracted growing interest in many fields, particularly in logic synthesis and formal verification. E-graph extraction is a challenging NP-hard combinatorial optimization problem. It requires identifying optimal terms from exponentially many equivalent expressions, serving as the primary performance bottleneck in e-graph based optimization tasks. However, traditional extraction methods face a critical trade-off: heuristic approaches offer speed but sacrifice optimality, while exact methods provide optimal solutions but face prohibitive computational costs on practical problems. We present e-boost, a novel framework that bridges this gap through three key innovations: (1) parallelized heuristic extraction that leverages weak data dependence to compute DAG costs concurrently, enabling efficient multi-threaded performance without sacrificing extraction quality; (2) adaptive search space pruning that employs a parameterized threshold mechanism to retain only promising candidates, dramatically reducing the solution space while preserving near-optimal solutions; and (3) initialized exact solving that formulates the reduced problem as an Integer Linear Program with warm-start capabilities, guiding solvers toward high-quality solutions faster. Across the diverse benchmarks in formal verification and logic synthesis fields, e-boost demonstrates 558x runtime speedup over traditional exact approaches (ILP) and 19.04% performance improvement over the state-of-the-art extraction framework (SmoothE). In realistic logic synthesis tasks, e-boost produces 7.6% and 8.1% area improvements compared to conventional synthesis tools with two different technology mapping libraries. e-boost is available at https://github.com/Yu-Maryland/e-boost.

Updated: 2025-08-18 15:38:12

标题: 电子提升：采用自适应启发式和精确求解的提升电子图提取

摘要: E图在许多领域，特别是逻辑综合和形式验证方面引起了越来越多的关注。E图提取是一个具有挑战性的NP难度组合优化问题。它需要从指数级数量的等价表达式中识别出最佳项，作为基于E图的优化任务中的主要性能瓶颈。然而，传统的提取方法面临一个关键的折衷：启发式方法提供速度但牺牲最优性，而精确方法提供最佳解决方案，但在实际问题上面临着巨大的计算成本。我们提出了一种新颖的框架e-boost，通过三个关键创新来弥合这一差距：(1) 并行启发式提取利用弱数据依赖性同时计算DAG成本，实现了高效的多线程性能而不牺牲提取质量；(2) 自适应搜索空间修剪采用参数化阈值机制仅保留有希望的候选者，大大减少解空间同时保持接近最优的解；和(3) 初始化精确求解将减少的问题制定为具有热启动功能的整数线性规划，引导求解器更快地找到高质量的解决方案。在形式验证和逻辑综合领域的各种基准测试中，e-boost相对于传统的精确方法(ILP)展示了558倍的运行时加速，并且比最先进的提取框架(SmoothE)提高了19.04%的性能。在现实逻辑综合任务中，与两种不同技术映射库的传统综合工具相比，e-boost分别产生了7.6%和8.1%的面积改进。e-boost可在https://github.com/Yu-Maryland/e-boost上获得。

更新时间: 2025-08-18 15:38:12

领域: cs.AI,cs.AR

下载: http://arxiv.org/abs/2508.13020v1

Design and Analysis of Robust Adaptive Filtering with the Hyperbolic Tangent Exponential Kernel M-Estimator Function for Active Noise Control

In this work, we propose a robust adaptive filtering approach for active noise control applications in the presence of impulsive noise. In particular, we develop the filtered-x hyperbolic tangent exponential generalized Kernel M-estimate function (FXHEKM) robust adaptive algorithm. A statistical analysis of the proposed FXHEKM algorithm is carried out along with a study of its computational cost. {In order to evaluate the proposed FXHEKM algorithm, the mean-square error (MSE) and the average noise reduction (ANR) performance metrics have been adopted.} Numerical results show the efficiency of the proposed FXHEKM algorithm to cancel the presence of the additive spurious signals, such as \textbf{$\alpha$}-stable noises against competing algorithms.

Updated: 2025-08-18 15:37:11

标题: 设计和分析具有双曲正切指数核M-估计函数的鲁棒自适应滤波器用于主动噪声控制

摘要: 在这项工作中，我们提出了一种针对冲击噪声的主动噪声控制应用的稳健自适应滤波方法。具体来说，我们开发了滤波-x双曲正切指数广义核M-估计函数（FXHEKM）稳健自适应算法。对所提出的FXHEKM算法进行了统计分析，并研究了其计算成本。为了评估所提出的FXHEKM算法，采用了均方误差（MSE）和平均降噪（ANR）性能指标。数值结果显示，所提出的FXHEKM算法能有效消除附加的虚假信号，如α稳定噪声，与竞争算法相比表现出高效性。

更新时间: 2025-08-18 15:37:11

领域: cs.LG

下载: http://arxiv.org/abs/2508.13018v1

A Comprehensive Review of Datasets for Clinical Mental Health AI Systems

Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.

Updated: 2025-08-18 15:27:57

标题: 一份关于临床心理健康人工智能系统数据集的综合评估

摘要: 精神健康障碍在全球范围内不断增加。然而，受过训练的临床医生的数量并没有按比例增长，导致许多人缺乏充分或及时的支持。为了弥补这一差距，最近的研究显示了人工智能（AI）在辅助精神健康诊断、监测和干预方面的潜力。然而，开发高效、可靠和符合伦理标准的AI以协助临床医生严重依赖于高质量的临床训练数据集。尽管对于训练临床AI助手的数据管理日益受到关注，现有数据集仍然大多分散、记录不足，而且往往无法访问，阻碍了为临床精神健康护理开发的AI模型的可重复性、可比性和泛化性。在本文中，我们首次呈现了与培训和开发AI驱动的临床助手相关的临床精神健康数据集的全面调查。我们通过精神障碍（例如，抑郁症、精神分裂症）、数据模态（例如，文本、语音、生理信号）、任务类型（例如，诊断预测、症状严重性估计、干预生成）、可访问性（公开、受限或私人）和社会文化背景（例如，语言和文化背景）对这些数据集进行分类。除此之外，我们还调查了合成的临床精神健康数据集。我们的调查确定了一些关键缺口，如缺乏长期数据、文化和语言表达有限、采集和标注标准不一致，并且合成数据中缺乏模态。我们最后总结了在策划和标准化未来数据集方面的关键挑战，并提供可行的建议，以促进更健壮、泛化性更好、更公平的精神健康AI系统的发展。

更新时间: 2025-08-18 15:27:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.09809v2

Monte Carlo Functional Regularisation for Continual Learning

Continual learning (CL) is crucial for the adaptation of neural network models to new environments. Although outperforming weight-space regularisation approaches, the functional regularisation-based CL methods suffer from high computational costs and large linear approximation errors. In this work, we present a new functional regularisation CL framework, called MCFRCL, which approximates model prediction distributions by Monte Carlo (MC) sampling. Moreover, three continuous distributions are leveraged to capture the statistical characteristics of the MC samples via moment-based methods. Additionally, both the Wasserstein distance and the Kullback-Leibler (KL) distance are employed to construct the regularisation function. The proposed MCFRCL is evaluated against multiple benchmark methods on the MNIST and CIFAR datasets, with simulation results highlighting its effectiveness in both prediction accuracy and training efficiency.

Updated: 2025-08-18 15:25:37

标题: 蒙特卡洛功能正则化在持续学习中的应用

摘要: 持续学习（CL）对于神经网络模型适应新环境至关重要。尽管胜过权重空间正则化方法，基于功能正则化的CL方法受到高计算成本和大线性逼近误差的困扰。在这项工作中，我们提出了一个新的基于功能正则化的CL框架，称为MCFRCL，通过蒙特卡罗（MC）采样来逼近模型预测分布。此外，利用三个连续分布通过基于矩的方法捕捉MC样本的统计特征。此外，使用Wasserstein距离和Kullback-Leibler（KL）距离构建正则化函数。所提出的MCFRCL在MNIST和CIFAR数据集上与多个基准方法进行了评估，模拟结果突显其在预测准确性和训练效率方面的有效性。

更新时间: 2025-08-18 15:25:37

领域: cs.LG

下载: http://arxiv.org/abs/2508.13006v1

Empirical Evidences for the Effects of Feature Diversity in Open Set Recognition and Continual Learning

Open set recognition (OSR) and continual learning are two critical challenges in machine learning, focusing respectively on detecting novel classes at inference time and updating models to incorporate the new classes. While many recent approaches have addressed these problems, particularly OSR, by heuristically promoting feature diversity, few studies have directly examined the role that feature diversity plays in tackling them. In this work, we provide empirical evidence that enhancing feature diversity improves the recognition of open set samples. Moreover, increased feature diversity also facilitates both the retention of previously learned data and the integration of new data in continual learning. We hope our findings can inspire further research into both practical methods and theoretical understanding in these domains.

Updated: 2025-08-18 15:25:06

标题: 开放集识别和持续学习中特征多样性效应的实证证据

摘要: 开放集识别（OSR）和持续学习是机器学习中的两个关键挑战，分别侧重于在推断时检测新类别并更新模型以包含新类别。尽管许多最近的方法已经解决了这些问题，特别是OSR，通过启发式地提升特征多样性，但很少有研究直接考察特征多样性在解决这些问题中所起的作用。在这项工作中，我们提供了经验证据，即增强特征多样性可以提高对开放集样本的识别。此外，增加特征多样性还有助于保留先前学习的数据并在持续学习中整合新数据。我们希望我们的发现可以激发进一步研究这些领域中的实用方法和理论理解。

更新时间: 2025-08-18 15:25:06

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.13005v1

EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

The rapid advancement of LLMs poses a significant challenge to existing mathematical reasoning benchmarks. These benchmarks commonly suffer from issues such as score saturation, temporal decay, and data contamination. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. By dynamically generating unique evaluation instances ab initio, the framework fundamentally eliminates the risk of data contamination, and ensuring the benchmark remains perpetually challenging for future models.The core mechanisms of EvolMathEval include: seed problem generation based on reverse engineering with algebraic guarantees; multi-dimensional genetic operators designed to inject diverse cognitive challenges; and a composite fitness function that can rapidly and accurately assess problem difficulty. Experimental results demonstrate that the proposed composite fitness function can efficiently and precisely quantify the difficulty of mathematical problems. Furthermore, EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%. Deeper investigation reveals that when solving these evolved, complex problems, LLMs tend to employ non-rigorous heuristics to bypass complex multi-step logical reasoning, consequently leading to incorrect solutions. We define this phenomenon as "Pseudo Aha Moment". This finding uncovers a cognitive shortcut-taking behavior in the deep reasoning processes of current LLMs, which we find accounts for 77% to 100% of errors on targeted problems. Code and resources are available at:https://github.com/SYSUSELab/EvolMathEval.

Updated: 2025-08-18 15:24:10

标题: EvolMathEval：通过进化测试实现数学推理的可漸進基准

摘要: 轻量级语言模型（LLMs）的快速发展对现有的数学推理基准提出了重大挑战。这些基准通常存在得分饱和、时间衰减和数据污染等问题。为了解决这一挑战，本文介绍了EvolMathEval，这是一个基于进化测试的自动数学基准生成和演化框架。通过从一开始动态生成独特的评估实例，该框架从根本上消除了数据污染的风险，并确保基准对未来模型始终具有挑战性。EvolMathEval的核心机制包括：基于反向工程的种子问题生成，具有代数保证；设计多维遗传算子以注入多样化的认知挑战；以及能够快速准确评估问题难度的复合适应度函数。实验结果表明，所提出的复合适应度函数能够高效精确地量化数学问题的难度。此外，EvolMathEval不仅可以通过持续的自我迭代生成大量高难度问题，还可以通过演化显著增加像GSM8K这样的公共数据集的复杂性，使模型准确率平均降低48%。更深入的调查发现，当解决这些演化的复杂问题时，LLMs倾向于使用非严格的启发式方法来绕过复杂的多步逻辑推理，从而导致错误的解决方案。我们将这一现象定义为“伪Aha时刻”。这一发现揭示了当前LLMs深层推理过程中的认知捷径行为，我们发现这占目标问题错误的77%至100%。代码和资源可在以下链接找到：https://github.com/SYSUSELab/EvolMathEval。

更新时间: 2025-08-18 15:24:10

领域: cs.AI

下载: http://arxiv.org/abs/2508.13003v1

SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning

Generating physically realistic humanoid-object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI, a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid-object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks. Code will be released upon acceptance on our project page: https://xingxingzuo.github.io/simgen_hoi.

Updated: 2025-08-18 15:20:46

标题: SimGenHOI: 通过生成建模和强化学习实现身体真实的全身人形物体交互

摘要: 生成真实的人体-物体互动（HOI）是机器人领域的一个基本挑战。现有的HOI生成方法，例如基于扩散的模型，常常存在诸如不切实际的接触、穿透和不真实的整体动作等缺陷，这些缺陷阻碍了在物理环境中成功执行。为了解决这些挑战，我们引入了SimGenHOI，这是一个统一的框架，结合了生成建模和强化学习的优势，以产生可控制和物理上合理的HOI。我们的HOI生成模型基于扩散变压器（DiT），根据文本提示、物体几何形状、稀疏的物体路径点和初始人体姿势预测一组关键动作。这些关键动作捕捉了基本的互动动态，并被插值为平滑的运动轨迹，自然地支持长时间跨度的生成。为了确保物理真实性，我们设计了一个接触感知的全身控制策略，通过强化学习训练，追踪生成的动作同时纠正穿透和脚滑等缺陷。此外，我们引入了一种相互微调策略，生成模型和控制策略迭代地相互改进，提高了动作的真实性和追踪的稳健性。大量实验证明，SimGenHOI生成了真实、多样化和物理上合理的人体-物体互动，实现了在模拟中显著更高的追踪成功率，并实现了长时间跨度的操作任务。代码将在我们的项目页面上发布：https://xingxingzuo.github.io/simgen_hoi。

更新时间: 2025-08-18 15:20:46

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.14120v1

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis establishes connections between the solutions of linear TD learning and ordinary least squares (OLS). Under specific conditions -- particularly when the noise is correlated -- the TD solution serves as a more effective estimator than OLS. Furthermore, we show that when our algorithm is applied with many commonly used loss functions -- such as those found in generalized linear models -- it corresponds to the application of a novel and generalized Bellman operator. We prove that this operator admits a unique fixed point, and based on this, we establish convergence guarantees for our generalized TD algorithm under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.

Updated: 2025-08-18 15:20:21

标题: 一种用于监督学习的MRP公式：广义时间差异学习模型

摘要: 在传统的统计学习中，通常假设数据点是独立同分布的，遵循未知的概率分布。本文提出了一个截然不同的观点，将数据点视为相互关联的，使用马尔可夫奖励过程（MRP）进行数据建模。我们将典型的监督学习重新表述为强化学习（RL）中的一个政策评估问题，引入了一个广义时序差异（TD）学习算法作为解决方案。在理论上，我们的分析建立了线性TD学习和普通最小二乘（OLS）的解决方案之间的联系。在特定条件下，特别是当噪声相关时，TD解决方案比OLS更有效。此外，我们证明了当我们的算法与许多常用的损失函数一起应用时，比如在广义线性模型中发现的那些，它对应于一个新颖的广义贝尔曼操作符的应用。我们证明这个操作符具有唯一的不动点，并基于此，我们建立了在线性函数逼近下我们的广义TD算法的收敛保证。实证研究验证了我们的理论结果，检查了我们的TD算法的重要设计，并展示了它在各种数据集上的实际效用，包括回归和深度学习中的图像分类等任务。

更新时间: 2025-08-18 15:20:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2404.15518v4

Vitamin N: Benefits of Different Forms of Public Greenery for Urban Health

Urban greenery is often linked to better health, yet findings from past research have been inconsistent. One reason is that official greenery metrics measure the amount or nearness of greenery but ignore how often people actually may potentially see or use it in daily life. To address this gap, we introduced a new classification that separates on-road greenery, which people see while walking through streets, from off-road greenery, which requires planned visits. We did so by combining aerial imagery of Greater London and greenery data from OpenStreetMap with quantified greenery from over 100,000 Google Street View images and accessibility estimates based on 160,000 road segments. We linked these measures to 7.45 billion medical prescriptions issued by the National Health Service and processed through our methodology. These prescriptions cover five conditions: diabetes, hypertension, asthma, depression, and anxiety, as well as opioid use. As hypothesized, we found that green on-road was more strongly linked to better health than four widely used official measures. For example, hypertension prescriptions dropped by 3.68% in wards with on-road greenery above the median citywide level compared to those below it. If all below-median wards reached the citywide median in on-road greenery, prescription costs could fall by up to {\pounds}3.15 million each year. These results suggest that greenery seen in daily life may be more relevant than public yet secluded greenery, and that official metrics commonly used in the literature have important limitations.

Updated: 2025-08-18 15:17:33

标题: 维生素N：城市健康的不同形式公共绿色空间的益处

摘要: 城市绿化通常与更好的健康状况相关，但过去的研究结果并不一致。一个原因是官方绿化指标衡量了绿化的数量或邻近程度，但忽略了人们在日常生活中实际可能看到或使用绿化的频率。为了弥补这一空白，我们引入了一个新的分类，将人们在街道上行走时看到的路边绿化与需要计划访问的非路边绿化分开。我们通过将大伦敦的航空影像和OpenStreetMap的绿化数据与来自10万多张谷歌街景图像和基于16万条道路段的可达性估算的绿化量结合起来，来实现这一点。我们将这些测量值与英国国家医疗服务体系发放的74.5亿张处方进行了关联，并通过我们的方法进行了处理。这些处方涵盖了五种疾病：糖尿病、高血压、哮喘、抑郁症、焦虑症以及阿片类药物使用。正如我们所假设的，我们发现路边绿化与健康状况更为密切相关，而不是四种广泛使用的官方指标。例如，与城市范围内中位数以下的绿化水平相比，路边绿化在中位数以上的区域高血压处方减少了3.68%。如果所有中位数以下的区域都达到城市范围内的中位数水平，处方费用每年可能会减少高达315万英镑。这些结果表明，日常生活中看到的绿化可能比公共但隐蔽的绿化更相关，而文献中常用的官方指标存在重要限制。

更新时间: 2025-08-18 15:17:33

领域: cs.CY,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.12998v1

Fairness-Aware Multi-view Evidential Learning with Adaptive Prior

Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty esitimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectory, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence. Extensive experiments on five real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.

Updated: 2025-08-18 15:17:16

标题: 公平感知的自适应先验多视角证据学习

摘要: 多视角证据学习旨在整合来自多个视角的信息，以提高预测性能并提供可信赖的不确定性估计。大多数先前的方法假定特定视角的证据学习是自然可靠的。然而，在实践中，证据学习过程往往存在偏差。通过对真实数据的经验分析，我们发现样本往往被分配更多的证据来支持数据丰富的类别，从而导致预测中不可靠的不确定性估计。这促使我们深入研究一个新的偏见证据多视角学习（BEML）问题。为此，我们提出了公平感知多视角证据学习（FAML）。FAML首先引入了一个基于训练轨迹的自适应先验，作为一种正则化策略，灵活校准偏差证据学习过程。此外，我们明确地结合了基于类别证据方差的公平约束，以促进平衡的证据分配。在多视角融合阶段，我们提出了一种意见调整机制，以减轻视角间的偏见，从而鼓励整合一致且相互支持的证据。对五个真实多视角数据集的大量实验表明，与最先进的方法相比，FAML实现了更平衡的证据分配，并改善了预测性能和不确定性估计的可靠性。

更新时间: 2025-08-18 15:17:16

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.12997v1

Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair

Transformer neural networks are increasingly used for physics-based problems. In data-driven PDE surrogates, training samples from varying boundary and initial conditions can cause erratic losses and spiky gradients; in physics-informed neural networks (PINNs), stiff composite losses amplify this effect. We introduce Kourkoutas-Beta, an Adam-style optimizer where the fixed second-moment discount beta2 is replaced by a layer-wise dynamic value driven by a bounded ``sunspike'' ratio: the current pooled gradient norm divided by an exponential moving average (EMA) of past norms, squashed to the interval [0,1). Spikes lower beta2 toward beta2_min; calm phases keep it near beta2_max. Options include leaky-AMSGrad (decay), trust-region clipping (max_ratio), adaptive tiny terms, and several bias-correction modes ``none'', ``beta2max'', ``exact'). With all features off and bias_correction=``none'', the method is exactly Adam. We test on four settings: (i) a Transformer PDE surrogate (Heat2D), (ii) a 3D PINN for heat conduction (Heat3D), (iii) a lightweight MLX synthetic task with jitter and rare-trigger bursts, and (iv) a character-level Transformer on 30 MB of enwik8 (small-enwik8). Kourkoutas-Beta improves stability and final loss versus fixed-beta2 Adam. On small-enwik8 it lowers bits-per-character by about 38% vs Adam-0.95 and about 58% vs Adam-0.999 over 10 seeds, with smaller variance. The method remains drop-in, with runtime overhead comparable to Adam in testbeds A-C and within single-digit percent in testbed D. It preserves Adam-style convergence guarantees while improving robustness under spiky gradients.

Updated: 2025-08-18 15:16:54

标题: Kourkoutas-Beta: 一种以太阳尖峰为驱动的带沙漠风格的Adam优化器

摘要: Transformer神经网络越来越多地被用于基于物理的问题。在数据驱动的PDE替代品中，来自不同边界和初始条件的训练样本可能导致不稳定的损失和尖锐的梯度；在物理信息神经网络（PINNs）中，僵硬的复合损失会放大这种效果。我们引入了Kourkoutas-Beta，这是一种Adam风格的优化器，其中固定的第二动量折扣beta2被一个由有界的“日光”比率驱动的逐层动态值取代：当前池化梯度范数除以过去范数的指数移动平均值（EMA），压缩到区间[0,1]。尖峰将beta2降低到beta2_min；平静阶段使其保持接近beta2_max。选项包括渗漏-AMSGrad（衰减）、信任区间剪裁（max_ratio）、自适应微小项和几种偏差校正模式“无”、“beta2max”、“确切”）。在所有功能关闭且bias_correction=“none”的情况下，该方法完全等同于Adam。我们在四个设置上进行测试：(i)一个Transformer PDE替代品（Heat2D），(ii)一个用于热传导的3D PINN（Heat3D），(iii)一个带有抖动和稀有触发脉冲的轻量级MLX合成任务，以及(iv)一个在30 MB的enwik8上进行的字符级Transformer（small-enwik8）。Kourkoutas-Beta改善了与固定beta2的Adam相比的稳定性和最终损失。在small-enwik8上，它使每字符的比特数比Adam-0.95低约38%，比Adam-0.999低约58%，在10个种子上，方差更小。该方法仍然是一个可替换的方法，在测试A-C中的运行时开销与Adam相当，在测试D中在个位数内。它保留了Adam风格的收敛保证，同时提高了在尖锐梯度下的稳健性。

更新时间: 2025-08-18 15:16:54

领域: cs.LG,cs.AI,65K10, 68T07,I.2.6; G.1.6

下载: http://arxiv.org/abs/2508.12996v1

CCDM: Continuous Conditional Diffusion Models for Image Generation

Continuous Conditional Generative Modeling (CCGM) estimates high-dimensional data distributions, such as images, conditioned on scalar continuous variables (aka regression labels). While Continuous Conditional Generative Adversarial Networks (CcGANs) were designed for this task, their instability during adversarial learning often leads to suboptimal results. Conditional Diffusion Models (CDMs) offer a promising alternative, generating more realistic images, but their diffusion processes, label conditioning, and model fitting procedures are either not optimized for or incompatible with CCGM, making it difficult to integrate CcGANs' vicinal approach. To address these issues, we introduce Continuous Conditional Diffusion Models (CCDMs), the first CDM specifically tailored for CCGM. CCDMs address existing limitations with specially designed conditional diffusion processes, a novel hard vicinal image denoising loss, a customized label embedding method, and efficient conditional sampling procedures. Through comprehensive experiments on four datasets with resolutions ranging from 64x64 to 192x192, we demonstrate that CCDMs outperform state-of-the-art CCGM models, establishing a new benchmark. Ablation studies further validate the model design and implementation, highlighting that some widely used CDM implementations are ineffective for the CCGM task. Our code is publicly available at https://github.com/UBCDingXin/CCDM.

Updated: 2025-08-18 15:15:13

标题: CCDM：图像生成的连续条件扩散模型

摘要: 连续条件生成建模（CCGM）估计高维数据分布，例如图像，以标量连续变量（也称为回归标签）为条件。虽然连续条件生成对抗网络（CcGANs）是为此任务设计的，但在对抗学习过程中的不稳定性常常导致次优结果。条件扩散模型（CDMs）提供了一个有前途的替代方案，生成更真实的图像，但它们的扩散过程、标签条件和模型拟合过程要么没有针对CCGM进行优化，要么与之不兼容，这使得很难将CcGANs的邻近方法整合到其中。为了解决这些问题，我们引入了连续条件扩散模型（CCDMs），这是专门为CCGM定制的第一个CDM。CCDMs通过特别设计的条件扩散过程、新颖的硬邻近图像去噪损失、定制的标签嵌入方法和高效的条件抽样过程，解决了现有限制。通过对从64x64到192x192分辨率的四个数据集进行全面实验，我们证明了CCDMs优于最先进的CCGM模型，建立了一个新的基准。消融研究进一步验证了模型设计和实现，突出显示一些广泛使用的CDM实现对于CCGM任务是无效的。我们的代码公开可用于https://github.com/UBCDingXin/CCDM。

更新时间: 2025-08-18 15:15:13

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2405.03546v4

It Takes Two: A Peer-Prediction Solution for Blockchain Verifier's Dilemma

The security of blockchain systems is fundamentally based on the decentralized consensus in which the majority of parties behave honestly, and the content verification process is essential to maintaining the robustness of blockchain systems. However, the phenomenon that a rational verifier may not have the incentive to honestly perform the costly verification, referred to as the Verifier's Dilemma, could incentivize lazy reporting and undermine the fundamental security of blockchain systems, particularly for verification-expensive decentralized AI applications. In this paper, we initiate the research with the development of a Byzantine-robust peer prediction framework towards the design of one-phase Bayesian truthful mechanisms for the decentralized verification games among multiple verifiers, incentivizing all verifiers to perform honest verification without access to the ground truth even in the presence of noisy observations, malicious players and inaccurate priors in the verification process, proposing the compactness criteria that ensures such robustness guarantees. With robust incentive guarantees and budget efficiency, our study provides a framework of incentive design for decentralized verification protocols that enhances the security and robustness of the blockchain, decentralized AI, and potentially other decentralized systems.

Updated: 2025-08-18 15:13:50

标题: 需要两个人：一种区块链验证者困境的同行预测解决方案

摘要: 区块链系统的安全性基本上是基于去中心化共识的，其中大多数参与方都会诚实行事，并且内容验证过程对于维护区块链系统的稳健性至关重要。然而，一个理性的验证者可能没有动力诚实地进行昂贵的验证，这被称为验证者困境，可能会激励懒惰的报告并破坏区块链系统的基本安全性，尤其是对于验证成本高昂的去中心化人工智能应用。在本文中，我们通过开发一个拜占庭鲁棒的对等预测框架来启动研究，以设计针对多个验证者之间的去中心化验证游戏的一阶贝叶斯诚实机制，激励所有验证者进行诚实验证，即使在存在嘈杂观察、恶意玩家和验证过程中不准确的先验的情况下也是如此，提出确保这种稳健性保证的紧凑性标准。通过稳健的激励保证和预算效率，我们的研究为去中心化验证协议的激励设计提供了一个框架，增强了区块链、去中心化人工智能和潜在的其他去中心化系统的安全性和稳健性。

更新时间: 2025-08-18 15:13:50

领域: cs.CR,cs.GT

下载: http://arxiv.org/abs/2406.01794v4

Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian

A common observation in the Graph Convolutional Network (GCN) literature is that stacking GCN layers may or may not result in better performance on tasks like node classification and edge prediction. We have found empirically that a graph's algebraic connectivity, which is known as the Fiedler value, is a good predictor of GCN performance. Intuitively, graphs with similar Fiedler values have analogous structural properties, suggesting that the same filters and hyperparameters may yield similar results when used with GCNs, and that transfer learning may be more effective between graphs with similar algebraic connectivity. We explore this theoretically and empirically with experiments on synthetic and real graph data, including the Cora, CiteSeer and Polblogs datasets. We explore multiple ways of aggregating the Fiedler value for connected components in the graphs to arrive at a value for the entire graph, and show that it can be used to predict GCN performance. We also present theoretical arguments as to why the Fiedler value is a good predictor.

Updated: 2025-08-18 15:13:13

标题: 用图拉普拉斯矩阵的谱特性预测图卷积网络的性能

摘要: 图卷积网络（GCN）文献中的一个常见观察是，堆叠GCN层可能或可能不会在节点分类和边预测等任务上表现更好。我们从经验上发现，图的代数连接度，也称为费德勒值，是GCN性能的一个很好的预测因子。直觉上，具有类似费德勒值的图具有类似的结构特性，这表明在GCN中使用相同的滤波器和超参数可能会产生类似的结果，并且在具有类似代数连接度的图之间进行迁移学习可能更加有效。我们通过对合成和真实图数据（包括Cora、CiteSeer和Polblogs数据集）的实验，在理论上和经验上探讨了这一点。我们探索了在图中聚合费德勒值的多种方式，以获得整个图的数值，并展示它可以用来预测GCN性能。我们还提出了费德勒值为何是一个良好预测因子的理论论据。

更新时间: 2025-08-18 15:13:13

领域: cs.LG

下载: http://arxiv.org/abs/2508.12993v1

Transfer Learning for Neutrino Scattering: Domain Adaptation with GANs

We utilize transfer learning to extrapolate the physics knowledge encoded in a Generative Adversarial Network (GAN) model trained on synthetic charged-current (CC) neutrino-carbon inclusive scattering data. This base model is adapted to generate CC inclusive scattering events (lepton kinematics only) for neutrino-argon and antineutrino-carbon interactions. Furthermore, we assess the effectiveness of transfer learning in re-optimizing a custom model when new data comes from a different neutrino-nucleus interaction model. Our results demonstrate that transfer learning significantly outperforms training generative models from scratch. To study this, we consider two training data sets: one with 10,000 and another with 100,000 events. The models obtained via transfer learning perform well even with smaller training data. The proposed method provides a promising approach for constructing neutrino scattering event generators in scenarios where experimental data is sparse.

Updated: 2025-08-18 15:08:13

标题: Transfer Learning for Neutrino Scattering: 使用GAN进行领域自适应

摘要: 我们利用迁移学习来推断在一个基于生成对抗网络（GAN）模型上训练的合成带电电流（CC）中微子-碳包容散射数据中编码的物理知识。该基础模型被调整为生成CC包容散射事件（仅考虑轻子动力学）的中微子-氩和反中微子-碳相互作用。此外，我们评估了当新数据来自不同的中微子-核相互作用模型时，迁移学习在重新优化自定义模型方面的有效性。我们的结果表明，迁移学习显著优于从头开始训练生成模型。为了研究这一点，我们考虑了两个训练数据集：一个包含10,000个事件，另一个包含100,000个事件。通过迁移学习获得的模型即使在较小的训练数据下也表现良好。所提出的方法为构建中微子散射事件生成器提供了一种有前途的途径，在这种情况下实验数据稀缺。

更新时间: 2025-08-18 15:08:13

领域: hep-ph,cs.LG,hep-ex,nucl-ex,physics.comp-ph

下载: http://arxiv.org/abs/2508.12987v1

SL-ACC: A Communication-Efficient Split Learning Framework with Adaptive Channel-wise Compression

The increasing complexity of neural networks poses a significant barrier to the deployment of distributed machine learning (ML) on resource-constrained devices, such as federated learning (FL). Split learning (SL) offers a promising solution by offloading the primary computing load from edge devices to a server via model partitioning. However, as the number of participating devices increases, the transmission of excessive smashed data (i.e., activations and gradients) becomes a major bottleneck for SL, slowing down the model training. To tackle this challenge, we propose a communication-efficient SL framework, named SL-ACC, which comprises two key components: adaptive channel importance identification (ACII) and channel grouping compression (CGC). ACII first identifies the contribution of each channel in the smashed data to model training using Shannon entropy. Following this, CGC groups the channels based on their entropy and performs group-wise adaptive compression to shrink the transmission volume without compromising training accuracy. Extensive experiments across various datasets validate that our proposed SL-ACC framework takes considerably less time to achieve a target accuracy than state-of-the-art benchmarks.

Updated: 2025-08-18 15:02:10

标题: SL-ACC：一种具有自适应通道压缩的通信高效的分布式学习框架

摘要: 神经网络的不断复杂化给资源受限设备（如联邦学习（FL））上部署分布式机器学习（ML）带来了重大障碍。分割学习（SL）通过将主要计算负载从边缘设备转移到服务器通过模型分区，提供了一种有希望的解决方案。然而，随着参与设备数量的增加，过多的 smashed 数据（即激活和梯度）的传输成为 SL 的主要瓶颈，导致模型训练速度变慢。为了应对这一挑战，我们提出了一个通信高效的 SL 框架，命名为 SL-ACC，包括两个关键组件：自适应信道重要性识别（ACII）和信道分组压缩（CGC）。ACII 首先利用香农熵确定 smashed 数据中每个信道对模型训练的贡献。随后，CGC 根据它们的熵分组信道，并进行组内自适应压缩，以缩小传输量而不影响训练准确性。通过对各种数据集的广泛实验验证，我们提出的 SL-ACC 框架比最先进的基准需要更少的时间来达到目标准确性。

更新时间: 2025-08-18 15:02:10

领域: cs.LG,cs.AI,cs.NI

下载: http://arxiv.org/abs/2508.12984v1

When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer's Disease Detection

Alzheimer's disease detection requires expensive neuroimaging or invasive procedures, limiting accessibility. This study explores whether deep learning can enable non-invasive Alzheimer's disease detection through handwriting analysis. Using a dataset of 34 distinct handwriting tasks collected from healthy controls and Alzheimer's disease patients, we evaluate and compare three recurrent neural architectures (LSTM, GRU, RNN) against traditional machine learning models. A crucial distinction of our approach is that the recurrent models process pre-extracted features from discrete strokes, not raw temporal signals. This violates the assumption of a continuous temporal flow that recurrent networks are designed to capture. Results reveal that they exhibit poor specificity and high variance. Traditional ensemble methods significantly outperform all deep architectures, achieving higher accuracy with balanced metrics. This demonstrates that recurrent architectures, designed for continuous temporal sequences, fail when applied to feature vectors extracted from ambiguously segmented strokes. Despite their complexity, deep learning models cannot overcome the fundamental disconnect between their architectural assumptions and the discrete, feature-based nature of stroke-level handwriting data. Although performance is limited, the study highlights several critical issues in data representation and model compatibility, pointing to valuable directions for future research.

Updated: 2025-08-18 14:54:20

标题: 当深度学习失败：递归模型在基于笔画的手写识别中对阿尔茨海默病检测的限制

摘要: 阿尔茨海默病的检测需要昂贵的神经影像学或侵入性程序，限制了可访问性。本研究探讨了深度学习是否可以通过手写分析实现无创的阿尔茨海默病检测。使用从健康对照组和阿尔茨海默病患者收集的34个不同手写任务的数据集，我们评估并比较了三种递归神经架构（LSTM，GRU，RNN）与传统机器学习模型。我们方法的一个关键区别是递归模型处理来自离散笔触的预提取特征，而不是原始时间信号。这违反了递归网络设计用来捕捉连续时间流的假设。结果显示它们具有较低的特异性和高方差。传统集成方法明显优于所有深度架构，具有更高的准确性和平衡指标。这表明，递归架构，设计用于连续时间序列，当应用于从模糊分割的笔触提取的特征向量时失败。尽管其复杂性，深度学习模型无法克服其架构假设和笔迹级手写数据的离散、基于特征的本质之间的基本脱节。尽管性能有限，该研究突出了数据表示和模型兼容性中的几个关键问题，指出了未来研究的有价值方向。

更新时间: 2025-08-18 14:54:20

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.03773v2

Fed-DPRoC:Communication-Efficient Differentially Private and Robust Federated Learning

We propose Fed-DPRoC, a novel federated learning framework that simultaneously ensures differential privacy (DP), Byzantine robustness, and communication efficiency. We introduce the concept of robust-compatible compression, which enables users to compress DP-protected updates while maintaining the robustness of the aggregation rule. We instantiate our framework as RobAJoL, combining the Johnson-Lindenstrauss (JL) transform for compression with robust averaging for robust aggregation. We theoretically prove the compatibility of JL transform with robust averaging and show that RobAJoL preserves robustness guarantees, ensures DP, and reduces communication cost. Experiments on CIFAR-10 and Fashion MNIST validate our theoretical claims and demonstrate that RobAJoL outperforms existing methods in terms of robustness and utility under different Byzantine attacks.

Updated: 2025-08-18 14:52:15

标题: Fed-DPRoC：通信高效的差分隐私和稳健联邦学习

摘要: 我们提出了Fed-DPRoC，这是一个新颖的联邦学习框架，同时确保差分隐私（DP）、拜占庭鲁棒性和通信效率。我们引入了鲁棒兼容压缩的概念，使用户能够在保护DP的更新的同时压缩数据，同时保持聚合规则的鲁棒性。我们将我们的框架实例化为RobAJoL，结合Johnson-Lindenstrauss（JL）变换进行压缩，并使用鲁棒平均进行鲁棒聚合。我们在理论上证明了JL变换与鲁棒平均的兼容性，并展示了RobAJoL保持鲁棒性保证，确保DP，并降低通信成本。对CIFAR-10和时尚MNIST的实验验证了我们的理论声明，并展示了RobAJoL在不同拜占庭攻击下在鲁棒性和效用方面优于现有方法。

更新时间: 2025-08-18 14:52:15

领域: cs.LG,cs.DC,cs.IT,math.IT

下载: http://arxiv.org/abs/2508.12978v1

Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device

Cervical cancer is among the most commonly occurring cancer among women and claims a huge number of lives in low and middle-income countries despite being relatively easy to treat. Several studies have shown that public screening programs can bring down cervical cancer incidence and mortality rates significantly. While several screening tests are available, visual inspection with acetic acid (VIA) presents itself as the most viable option for low-resource settings due to the affordability and simplicity of performing the test. VIA requires a trained medical professional to interpret the test and is subjective in nature. Automating VIA using AI eliminates subjectivity and would allow shifting of the task to less trained health workers. Task shifting with AI would help further expedite screening programs in low-resource settings. In our work, we propose a lightweight deep learning algorithm that includes EfficientDet-Lite3 as the Region of Interest (ROI) detector and a MobileNet- V2 based model for classification. These models would be deployed on an android-based device that can operate remotely and provide almost instant results without the requirement of highly-trained medical professionals, labs, sophisticated infrastructure, or internet connectivity. The classification model gives an accuracy of 92.31%, a sensitivity of 98.24%, and a specificity of 88.37% on the test dataset and presents itself as a promising automated low-resource screening approach.

Updated: 2025-08-18 14:44:51

标题: 在资源匮乏环境中通过乙酸视觉检查自动检测宫颈癌，使用轻量级深度学习模型在安卓设备上部署

摘要: 子宫颈癌是妇女中最常见的癌症之一，在低收入和中等收入国家夺去了大量生命，尽管相对容易治疗。多项研究表明，公共筛查计划可以显著降低子宫颈癌的发病率和死亡率。虽然有多种筛查测试可供选择，但用醋酸视诊（VIA）进行检查是低资源环境下最具可行性的选择，因为这种测试的价格便宜且简单。VIA需要受过训练的医务人员解读测试结果，具有主观性。使用人工智能自动化VIA消除了主观性，并允许将任务转移给技术水平较低的卫生工作者。利用人工智能进行任务转移将有助于进一步加速低资源环境中的筛查计划。在我们的工作中，我们提出了一种轻量级深度学习算法，其中包括EfficientDet-Lite3作为感兴趣区域（ROI）探测器，以及基于MobileNet-V2的分类模型。这些模型将部署在一种基于安卓的设备上，可以远程操作并在无需高度训练的医务人员、实验室、复杂基础设施或互联网连接的情况下提供几乎即时结果。分类模型在测试数据集上的准确率为92.31%，灵敏度为98.24%，特异性为88.37%，展现出一种有前途的自动化低资源筛查方法。

更新时间: 2025-08-18 14:44:51

领域: eess.IV,cs.CV,cs.LG,68T07, 92C55, 68T45,I.4.9; J.3; I.2.10; I.2.6

下载: http://arxiv.org/abs/2508.13253v1

Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models

We explore the performance of several state-of-the-art automatic speech recognition (ASR) models on a large-scale Arabic speech dataset, the SADA (Saudi Audio Dataset for Arabic), which contains 668 hours of high-quality audio from Saudi television shows. The dataset includes multiple dialects and environments, specifically a noisy subset that makes it particularly challenging for ASR. We evaluate the performance of the models on the SADA test set, and we explore the impact of fine-tuning, language models, as well as noise and denoising on their performance. We find that the best performing model is the MMS 1B model finetuned on SADA with a 4-gram language model that achieves a WER of 40.9\% and a CER of 17.6\% on the SADA test clean set.

Updated: 2025-08-18 14:44:25

标题: 基于变压器模型的SADA大规模阿拉伯语语音语料库上的阿拉伯语ASR

摘要: 我们探讨了几种最先进的自动语音识别（ASR）模型在一个大规模的阿拉伯语音数据集SADA（沙特阿拉伯阿拉伯语音频数据集）上的表现，该数据集包含来自沙特电视节目的668小时高质量音频。该数据集包括多种方言和环境，特别是一个噪声子集，使得ASR的挑战性尤为突出。我们在SADA测试集上评估了模型的性能，并探讨了微调、语言模型以及噪声和去噪对其性能的影响。我们发现，在SADA上进行微调的MMS 1B模型配合一个4-gram语言模型表现最佳，其在SADA测试干净集上实现了40.9％的词错误率（WER）和17.6％的字符错误率（CER）。

更新时间: 2025-08-18 14:44:25

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2508.12968v1

Multi-Phase Automated Segmentation of Dental Structures in CBCT Using a Lightweight Auto3DSeg and SegResNet Implementation

Cone-beam computed tomography (CBCT) has become an invaluable imaging modality in dentistry, enabling 3D visualization of teeth and surrounding structures for diagnosis and treatment planning. Automated segmentation of dental structures in CBCT can efficiently assist in identifying pathology (e.g., pulpal or periapical lesions) and facilitate radiation therapy planning in head and neck cancer patients. We describe the DLaBella29 team's approach for the MICCAI 2025 ToothFairy3 Challenge, which involves a deep learning pipeline for multi-class tooth segmentation. We utilized the MONAI Auto3DSeg framework with a 3D SegResNet architecture, trained on a subset of the ToothFairy3 dataset (63 CBCT scans) with 5-fold cross-validation. Key preprocessing steps included image resampling to 0.6 mm isotropic resolution and intensity clipping. We applied an ensemble fusion using Multi-Label STAPLE on the 5-fold predictions to infer a Phase 1 segmentation and then conducted tight cropping around the easily segmented Phase 1 mandible to perform Phase 2 segmentation on the smaller nerve structures. Our method achieved an average Dice of 0.87 on the ToothFairy3 challenge out-of-sample validation set. This paper details the clinical context, data preparation, model development, results of our approach, and discusses the relevance of automated dental segmentation for improving patient care in radiation oncology.

Updated: 2025-08-18 14:35:26

标题: CBCT中牙齿结构的多阶段自动分割：使用轻量级Auto3DSeg和SegResNet实现

摘要: 锥束计算机断层扫描（CBCT）已经成为牙科中不可或缺的成像模式，可以实现牙齿及周围结构的三维可视化，用于诊断和治疗规划。CBCT中牙科结构的自动分割可以有效地帮助识别病理（如牙髓或根尖病变），并促进头颈癌患者放射治疗计划的制定。我们描述了DLaBella29团队在MICCAI 2025 ToothFairy3挑战中的方法，其中涉及多类牙齿分割的深度学习流水线。我们利用MONAI Auto3DSeg框架和3D SegResNet架构，在ToothFairy3数据集的子集（63个CBCT扫描）上进行了5倍交叉验证训练。关键的预处理步骤包括将图像重采样至0.6毫米等距分辨率和强度裁剪。我们应用多标签STAPLE的整合融合方法，在5倍预测上推断出第一阶段分割，然后围绕容易分割的第一阶段下颌进行紧密裁剪，以在较小的神经结构上执行第二阶段分割。我们的方法在ToothFairy3挑战的样本外验证集上实现了平均Dice系数为0.87。本文详细介绍了临床背景、数据准备、模型开发、我们方法的结果，并讨论了自动牙科分割对放射肿瘤科患者护理的改进的相关性。

更新时间: 2025-08-18 14:35:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12962v1

"DIVE" into Hydrogen Storage Materials Discovery with AI Agents

Data-driven artificial intelligence (AI) approaches are fundamentally transforming the discovery of new materials. Despite the unprecedented availability of materials data in the scientific literature, much of this information remains trapped in unstructured figures and tables, hindering the construction of large language model (LLM)-based AI agent for automated materials design. Here, we present the Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literatures. We focus on solid-state hydrogen storage materials-a class of materials central to future clean-energy technologies and demonstrate that DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction by multimodal models, with gains of 10-15% over commercial models and over 30% relative to open-source models. Building on a curated database of over 30,000 entries from 4,000 publications, we establish a rapid inverse design workflow capable of identifying previously unreported hydrogen storage compositions in two minutes. The proposed AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI-driven materials discovery.

Updated: 2025-08-18 14:30:18

标题: 用AI代理"潜入"氢存储材料探索

摘要: 基于数据驱动的人工智能（AI）方法根本改变了新材料的发现。尽管科学文献中存在着大量材料数据，但其中许多信息仍然被困在无结构的图表中，阻碍了基于大型语言模型（LLM）的AI代理自动材料设计的构建。在这里，我们介绍了描述性视觉表达（DIVE）多代理工作流程，该工作流程系统地读取并组织来自科学文献中的图形元素的实验数据。我们专注于固态氢存储材料 - 一类对未来清洁能源技术至关重要的材料，并证明DIVE相对于多模态模型的直接提取显着提高了数据提取的准确性和覆盖范围，相对于商业模型提高了10-15％，相对于开源模型提高了30％以上。借助来自4,000篇出版物的超过30,000条记录的筛选数据库，我们建立了一个能够在两分钟内识别先前未报告的氢储存组成的快速逆向设计工作流程。提出的AI工作流程和代理设计在各种材料上具有广泛的可转移性，为基于人工智能的材料发现提供了一个范例。

更新时间: 2025-08-18 14:30:18

领域: cs.AI,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2508.13251v1

Prescriptive Zero Trust- Assessing the impact of zero trust on cyber attack prevention

Increasingly sophisticated and varied cyber threats necessitate ever improving enterprise security postures. For many organizations today, those postures have a foundation in the Zero Trust Architecture. This strategy sees trust as something an enterprise must not give lightly or assume too broadly. Understanding the ZTA and its numerous controls centered around the idea of not trusting anything inside or outside the network without verification, will allow organizations to comprehend and leverage this increasingly common paradigm. The ZTA, unlike many other regulatory frameworks, is not tightly defined. The research assesses the likelihood of quantifiable guidelines that measure cybersecurity maturity for an enterprise organization in relation to ZTA implementation. This is a new, data driven methodology for quantifying cyber resilience enabled by the adoption of Zero Trust principles to pragmatically address the critical need of organizations. It also looks at the practical aspects ZTA has on capabilities in deterring cyberattacks on a network. The outcomes of this research define a prescriptive set of key technical controls across identity verification, microsegmentation, data encryption, analytics, and orchestration that characterize the comprehensive ZTA deployment. By evaluating the depth of integration for each control component and aligning to industry best practices, the study's results help assess an organization's ZTA maturity level on a scale from Initial to Optimized adoption. The research's resultant four tier model demarcates phases for an organization on its security transformation journey, with each tier adding to the capability of the last.

Updated: 2025-08-18 14:30:00

标题: 规定性零信任-评估零信任对网络攻击预防的影响

摘要: 随着网络威胁日益复杂和多样化，企业需要不断改进其安全姿态。对于许多组织来说，这些姿态的基础是零信任架构。这种策略将信任视为企业不能轻易给予或过于广泛假定的东西。了解零信任架构及其围绕着不在网络内外不经验证就信任任何东西的众多控制，将使组织能够理解并利用这种越来越常见的范式。与许多其他监管框架不同，零信任架构并不严格定义。研究评估了在与零信任架构实施相关的企业组织的网络安全成熟度的可量化指导方针的可能性。这是一种通过采用零信任原则实现的量化网络弹性的新方法，以实用地解决组织的紧迫需求。它还研究了零信任架构对在网络上阻止网络攻击方面的能力的实际影响。这项研究的结果定义了一套针对身份验证、微分割、数据加密、分析和编排等技术控制的规范集，这些技术控制刻画了全面的零信任部署。通过评估每个控制组件的整合深度并与行业最佳实践对齐，研究结果有助于评估组织在从初始采用到优化采用的规模上的零信任架构成熟水平。研究的四级模型界定了组织在安全转型旅程中的阶段，每个级别增强了上一个级别的能力。

更新时间: 2025-08-18 14:30:00

领域: cs.CR

下载: http://arxiv.org/abs/2508.12953v1

Documenting Deployment with Fabric: A Repository of Real-World AI Governance

Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.

Updated: 2025-08-18 14:24:27

标题: 使用Fabric记录部署过程：一个真实世界AI治理的存储库

摘要: 人工智能（AI）越来越多地融入社会，从金融服务和交通管理到创意写作。关于AI部署的学术文献主要关注于使用AI所产生的风险和危害。我们引入了Fabric，一个公开可用的已部署AI用例库，以概述它们的治理机制。通过与从业者进行半结构化访谈，我们收集了一组初始的20个AI用例。此外，我们与从业者共同设计了AI工作流程的图表。我们讨论了实践中使用的监督机制和防护措施，以保障AI的使用。Fabric库包括AI用例的视觉图表和部署系统的描述。通过使用该库，我们发现了治理方面的空白，并找到了部署AI系统时人类监督的共同模式。我们希望Fabric能够作为一个可扩展、不断发展的工具，供研究人员研究AI治理的有效性。

更新时间: 2025-08-18 14:24:27

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2508.14119v1

Shapley Values: Paired-Sampling Approximations

Originally introduced in cooperative game theory, Shapley values have become a very popular tool to explain machine learning predictions. Based on Shapley's fairness axioms, every input (feature component) gets a credit how it contributes to an output (prediction). These credits are then used to explain the prediction. The only limitation in computing the Shapley values (credits) for many different predictions is of computational nature. There are two popular sampling approximations, sampling KernelSHAP and sampling PermutationSHAP. Our first novel contributions are asymptotic normality results for these sampling approximations. Next, we show that the paired-sampling approaches provide exact results in case of interactions being of maximal order two. Furthermore, the paired-sampling PermutationSHAP possesses the additive recovery property, whereas its kernel counterpart does not.

Updated: 2025-08-18 14:23:34

标题: 沙普利值：成对抽样逼近

摘要: 最初引入合作游戏理论中，沙普利值已经成为解释机器学习预测的非常流行的工具。基于沙普利的公平性原则，每个输入（特征组件）都会得到一个信用，说明它如何对输出（预测）做出贡献。然后利用这些信用来解释预测结果。计算许多不同预测的沙普利值（信用）的唯一限制是计算性质。有两种流行的抽样近似方法，即抽样KernelSHAP和抽样PermutationSHAP。我们的第一个新颖贡献是这些抽样近似的渐近正态性结果。接下来，我们展示出在交互作用达到最大二阶的情况下，成对抽样方法提供精确结果。此外，成对抽样PermutationSHAP具有可加恢复性质，而其核心对应物则没有。

更新时间: 2025-08-18 14:23:34

领域: stat.ML,cs.CE,cs.LG

下载: http://arxiv.org/abs/2508.12947v1

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

Updated: 2025-08-18 14:20:57

标题: GTPO和GRPO-S：使用策略熵进行令牌和序列级奖励塑造

摘要: 强化学习（RL）算法如群组相对策略优化（GRPO）改进了大型语言模型（LLM）的推理能力，但受到了粗粒度信用分配的限制，该分配将统一奖励应用于序列中的所有标记。这是长链推理任务中的一个主要缺陷。本文通过\textbf{动态熵加权}来解决这个问题。我们的核心思想是：在正确响应中的高熵标记可以引导策略朝向更高的性能上限。这使我们能够通过两种方式为精确的策略更新创建更细粒度的奖励信号：1）\textbf{群组标记策略优化}（\textbf{GTPO}），我们为每个标记分配一个熵加权奖励以进行细粒度信用分配。2）\textbf{序列级群组相对策略优化}（\textbf{GRPO-S}），我们为每个序列分配一个基于其平均标记熵的熵加权奖励。实验证明，我们的方法明显优于强DAPO基线。结果证实，我们的熵加权机制是这种性能提升的关键驱动因素，为增强模型深度推理提供了更好的途径。

更新时间: 2025-08-18 14:20:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.04349v4

OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities

Public service systems in many African regions suffer from delayed emergency response and spatial inequity, causing avoidable suffering. This paper introduces OPTIC-ER, a reinforcement learning (RL) framework for real-time, adaptive, and equitable emergency response. OPTIC-ER uses an attention-guided actor-critic architecture to manage the complexity of dispatch environments. Its key innovations are a Context-Rich State Vector, encoding action sub-optimality, and a Precision Reward Function, which penalizes inefficiency. Training occurs in a high-fidelity simulation using real data from Rivers State, Nigeria, accelerated by a precomputed Travel Time Atlas. The system is built on the TALS framework (Thin computing, Adaptability, Low-cost, Scalability) for deployment in low-resource settings. In evaluations on 500 unseen incidents, OPTIC-ER achieved a 100.00% optimality rate with negligible inefficiency, confirming its robustness and generalization. Beyond dispatch, the system generates Infrastructure Deficiency Maps and Equity Monitoring Dashboards to guide proactive governance and data-informed development. This work presents a validated blueprint for AI-augmented public services, showing how context-aware RL can bridge the gap between algorithmic decision-making and measurable human impact.

Updated: 2025-08-18 14:19:57

标题: OPTIC-ER：一种用于在欠发达的非洲社区进行实时应急响应和公平资源分配的强化学习框架

摘要: 许多非洲地区的公共服务系统遭遇应急响应延迟和空间不公平，导致可避免的苦难。本文介绍了OPTIC-ER，这是一个用于实时、自适应和公平应急响应的强化学习（RL）框架。OPTIC-ER利用一个基于注意力的演员-评论家架构来管理调度环境的复杂性。其关键创新包括一个上下文丰富的状态向量，编码行动次优性，以及一个精确奖励函数，惩罚低效率。训练在尼日利亚河州的真实数据的高保真度模拟中进行，加速使用预先计算的旅行时间地图。该系统建立在TALS框架（轻计算、适应性、低成本、可扩展性）之上，用于在资源匮乏的环境中部署。在对500个未见事件的评估中，OPTIC-ER实现了100.00%的最优性率，几乎没有低效率，验证了其稳健性和泛化性。除了调度外，该系统生成基础设施缺陷地图和公平监测仪表板，指导积极的治理和数据驱动的发展。这项工作展示了AI增强的公共服务的验证蓝图，展示了上下文感知的强化学习如何弥合算法决策和可衡量人类影响之间的差距。

更新时间: 2025-08-18 14:19:57

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2508.12943v1

Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data

Anatomic tracer studies are critical for validating and improving diffusion MRI (dMRI) tractography. However, large-scale analysis of data from such studies is hampered by the labor-intensive process of annotating fiber bundles manually on histological slides. Existing automated methods often miss sparse bundles or require complex post-processing across consecutive sections, limiting their flexibility and generalizability. We present a streamlined, fully automated framework for fiber bundle segmentation in macaque tracer data, based on a U-Net architecture with large patch sizes, foreground aware sampling, and semisupervised pre-training. Our approach eliminates common errors such as mislabeling terminals as bundles, improves detection of sparse bundles by over 20% and reduces the False Discovery Rate (FDR) by 40% compared to the state-of-the-art, all while enabling analysis of standalone slices. This new framework will facilitate the automated analysis of anatomic tracing data at a large scale, generating more ground-truth data that can be used to validate and optimize dMRI tractography methods.

Updated: 2025-08-18 14:17:24

标题: 解剖追踪数据中纤维束的全自动分割

摘要: 解剖示踪研究对于验证和改进扩散MRI（dMRI）束追踪至关重要。然而，由于在组织学切片上手动注释纤维束的繁重过程，大规模分析此类研究数据受到阻碍。现有的自动化方法通常会错过稀疏的束或需要在连续切片上进行复杂的后处理，限制了它们的灵活性和通用性。我们提出了一种简化的、完全自动化的框架，用于马鲁鲁示踪数据中的纤维束分割，基于具有大型补丁大小、前景感知采样和半监督预训练的U-Net架构。我们的方法消除了常见的错误，如将终端误标为束，将稀疏束的检测提高了超过20%，并将假发现率（FDR）相对于最先进的方法减少了40%，同时实现了独立切片的分析。这种新框架将促进大规模解剖示踪数据的自动化分析，生成更多可用于验证和优化dMRI束追踪方法的基础真实数据。

更新时间: 2025-08-18 14:17:24

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.12942v1

Simulation-Based Inference: A Practical Guide

A central challenge in many areas of science and engineering is to identify model parameters that are consistent with prior knowledge and empirical data. Bayesian inference offers a principled framework for this task, but can be computationally prohibitive when models are defined by stochastic simulators. Simulation-based Inference (SBI) is a suite of methods developed to overcome this limitation, which has enabled scientific discoveries in fields such as particle physics, astrophysics, and neuroscience. The core idea of SBI is to train neural networks on data generated by a simulator, without requiring access to likelihood evaluations. Once trained, inference is amortized: The neural network can rapidly perform Bayesian inference on empirical observations without requiring additional training or simulations. In this tutorial, we provide a practical guide for practitioners aiming to apply SBI methods. We outline a structured SBI workflow and offer practical guidelines and diagnostic tools for every stage of the process -- from setting up the simulator and prior, choosing and training inference networks, to performing inference and validating the results. We illustrate these steps through examples from astrophysics, psychophysics, and neuroscience. This tutorial empowers researchers to apply state-of-the-art SBI methods, facilitating efficient parameter inference for scientific discovery.

Updated: 2025-08-18 14:09:33

标题: 基于模拟的推断：实用指南

摘要: 许多科学和工程领域的一个核心挑战是确定与先前知识和经验数据一致的模型参数。贝叶斯推断提供了一个基于原则的框架来解决这个任务，但当模型由随机模拟器定义时可能会在计算上限制。基于模拟的推断（SBI）是一套方法，旨在克服这一限制，已经在粒子物理学、天体物理学和神经科学等领域实现了科学发现。SBI的核心思想是在由模拟器生成的数据上训练神经网络，而不需要访问似然评估。一旦训练完成，推断是摊销的：神经网络可以快速对经验观测进行贝叶斯推断，而不需要额外的训练或模拟。在本教程中，我们为希望应用SBI方法的从业者提供了一个实用指南。我们概述了一个结构化的SBI工作流程，并为过程的每个阶段提供了实用指导和诊断工具--从设置模拟器和先验，选择和训练推断网络，到执行推断和验证结果。我们通过天体物理学、心理物理学和神经科学的例子来说明这些步骤。这个教程使研究人员能够应用最先进的SBI方法，促进科学发现的高效参数推断。

更新时间: 2025-08-18 14:09:33

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.12939v1

Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models

In robotics and computer vision, semantic mapping remains a critical challenge for machines to comprehend complex environments. Traditional panoptic mapping approaches are constrained by fixed labels, limiting their ability to handle novel objects. We present Unified Promptable Panoptic Mapping (UPPM), which leverages foundation models for dynamic labeling without additional training. UPPM is evaluated across three comprehensive levels: Segmentation-to-Map, Map-to-Map, and Segmentation-to-Segmentation. Results demonstrate UPPM attains exceptional geometry reconstruction accuracy (0.61cm on the Flat dataset), the highest panoptic quality (0.414), and better performance compared to state-of-the-art segmentation methods. Furthermore, ablation studies validate the contributions of unified semantics, custom NMS, and blurry frame filtering, with the custom NMS improving the completion ratio by 8.27% on the Flat dataset. UPPM demonstrates effective scene reconstruction with rich semantic labeling across diverse datasets.

Updated: 2025-08-18 14:04:47

标题: 描绘未见之地：使用基础模型进行统一的可提示全景映射和动态标注

摘要: 在机器人技术和计算机视觉领域，语义映射仍然是机器理解复杂环境的关键挑战。传统的全景映射方法受到固定标签的限制，限制了它们处理新颖物体的能力。我们提出了统一可提示的全景映射（UPPM）方法，利用基础模型进行动态标记，无需额外训练。UPPM在分割到地图、地图到地图和分割到分割三个综合级别上进行评估。结果表明，UPPM实现了出色的几何重建准确性（在Flat数据集上为0.61厘米），最高的全景质量（0.414），并且与最先进的分割方法相比具有更好的性能。此外，消融研究验证了统一语义、自定义NMS和模糊帧过滤的贡献，其中自定义NMS在Flat数据集上将完成比例提高了8.27%。UPPM展示了在各种数据集上实现了有效的场景重建和丰富的语义标记。

更新时间: 2025-08-18 14:04:47

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2405.02162v4

Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards

Emotional Support Conversation (ESC) systems aim to alleviate users' emotional difficulties and provide long-term, systematic support for emotional well-being. However, most large language model (LLM)-based ESC systems rely on predefined strategies, which limits their effectiveness in complex, real-life scenarios. To enable flexible responses to diverse emotional problem scenarios, this paper introduces a novel end-to-end framework (RLFF-ESC) that directly learns enduring emotionally supportive response skills using reinforcement learning. For sustained emotional support, we first employ an LLM-based multi-agent mechanism to simulate future dialogue trajectories and collect future-oriented rewards. We then train a future-oriented reward model, which is subsequently used to train the emotional support policy model. Additionally, we incorporate an explicit reasoning process during response generation to further enhance the quality, relevance, and contextual appropriateness of the system's responses. We evaluate the backbone policy model on Qwen2.5-7B-Instruct-1M and LLaMA3.1-8B-Instruct models, testing the proposed RLFF-ESC framework across two public ESC datasets. Experimental results demonstrate that RLFF-ESC consistently outperforms existing baselines in terms of goal completion and response quality.

Updated: 2025-08-18 14:04:26

标题: 朝向通过强化学习和未来导向奖励实现LLMs中开放式情感支持对话

摘要: 情感支持对话（ESC）系统旨在缓解用户的情感困扰，并为情感健康提供长期、系统化的支持。然而，大多数基于大型语言模型（LLM）的ESC系统依赖预定义的策略，这限制了它们在复杂的现实场景中的有效性。为了实现对多样化情感问题场景的灵活响应，本文介绍了一种新颖的端到端框架（RLFF-ESC），直接使用强化学习学习持久的情感支持响应技能。为了持续提供情感支持，我们首先采用基于LLM的多智能体机制来模拟未来的对话轨迹并收集未来导向的奖励。然后，我们训练一个未来导向的奖励模型，随后用于训练情感支持策略模型。此外，我们在响应生成过程中加入显式推理过程，进一步提高系统响应的质量、相关性和情境适宜性。我们在Qwen2.5-7B-Instruct-1M和LLaMA3.1-8B-Instruct模型上评估骨干策略模型，测试了所提出的RLFF-ESC框架在两个公共ESC数据集上的表现。实验结果表明，RLFF-ESC在目标完成和响应质量方面始终优于现有基线模型。

更新时间: 2025-08-18 14:04:26

领域: cs.AI

下载: http://arxiv.org/abs/2508.12935v1

Cheddar: A Swift Fully Homomorphic Encryption Library Designed for GPU Architectures

Fully homomorphic encryption (FHE) frees cloud computing from privacy concerns by enabling secure computation on encrypted data. However, its substantial computational and memory overhead results in significantly slower performance compared to unencrypted processing. To mitigate this overhead, we present Cheddar, a high-performance FHE library for GPUs, achieving substantial speedups over previous GPU implementations. We systematically enable 32-bit FHE execution, leveraging the 32-bit integer datapath within GPUs. We optimize GPU kernels using efficient low-level primitives and algorithms tailored to specific GPU architectures. Further, we alleviate the memory bandwidth burden by adjusting common FHE operational sequences and extensively applying kernel fusion. Cheddar delivers performance improvements of 2.18--4.45$\times$ for representative FHE workloads compared to state-of-the-art GPU implementations.

Updated: 2025-08-18 14:01:00

标题: 切达尔：一款专为GPU架构设计的快速全同态加密库

摘要: 全同态加密（FHE）通过在加密数据上进行安全计算，解放了云计算的隐私问题。然而，其巨大的计算和内存开销导致与未加密处理相比性能显着较慢。为了减轻这种开销，我们提出了Cheddar，一个针对GPU的高性能FHE库，实现了比先前GPU实现更大的加速。我们系统地启用32位FHE执行，利用GPU内的32位整数数据通路。我们使用有效的低级原语和针对特定GPU架构定制的算法来优化GPU内核。此外，我们通过调整常见的FHE操作序列并广泛应用内核融合来减轻内存带宽负担。与最先进的GPU实现相比，Cheddar在代表性的FHE工作负载中实现了2.18-4.45倍的性能改进。

更新时间: 2025-08-18 14:01:00

领域: cs.CR,cs.PF

下载: http://arxiv.org/abs/2407.13055v2

Inverse Bridge Matching Distillation

Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup. We provide the code at https://github.com/ngushchin/IBMD

Updated: 2025-08-18 13:57:13

标题: 反向桥接匹配蒸馏

摘要: 学习扩散桥模型很容易；使其快速和实用是一种艺术。扩散桥模型（DBMs）是扩展扩散模型用于图像到图像翻译应用的有前途的方法。然而，像许多现代扩散和流模型一样，DBMs遭受推理缓慢的问题。为了解决这个问题，我们提出了一种基于逆桥匹配公式的新型蒸馏技术，并推导出可解决实际问题的可行目标。与以前开发的DBM蒸馏技术不同，所提出的方法可以蒸馏条件和无条件类型的DBMs，在一个步骤生成器中蒸馏模型，并仅使用损坏的图像进行训练。我们在包括超分辨率、JPEG恢复、草图到图像等各种设置上评估我们的方法，并展示我们的蒸馏技术使我们能够将DBMs的推理加速从4倍到100倍甚至根据特定设置提供比使用的教师模型更好的生成质量。我们在https://github.com/ngushchin/IBMD提供了代码。

更新时间: 2025-08-18 13:57:13

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2502.01362v2

SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory

In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs. It can develop generalized representations or more balanced decision boundaries, preventing the degradation of long-term knowledge over time and thus mitigating catastrophic forgetting. Some emerging incremental learning methods adopt an encoder-decoder architecture and have achieved promising results. In the encoder-decoder achitecture, improving the generalization capabilities of both the encoder and decoder is critical, as it helps preserve previously learned knowledge while ensuring adaptability and robustness to new, diverse data inputs. However, many existing continual methods focus solely on enhancing one of the two components, which limits their effectiveness in mitigating catastrophic forgetting. And these methods perform even worse in small-memory scenarios, where only a limited number of historical samples can be stored. To mitigate this limitation, we introduces SEDEG, a two-stage training framework for vision transformers (ViT), focusing on sequentially improving the generality of both Decoder and Encoder. Initially, SEDEG trains an ensembled encoder through feature boosting to learn generalized representations, which subsequently enhance the decoder's generality and balance the classifier. The next stage involves using knowledge distillation (KD) strategies to compress the ensembled encoder and develop a new, more generalized encoder. This involves using a balanced KD approach and feature KD for effective knowledge transfer. Extensive experiments on three benchmark datasets show SEDEG's superior performance, and ablation studies confirm the efficacy of its components. The code is available at https://github.com/ShaolingPu/CIL.

Updated: 2025-08-18 13:55:59

标题: SEDEG：用小内存进行逐步增强解码器和编码器的普适性，用于类增量学习

摘要: 在增量学习中，增强知识的普适性对适应动态数据输入至关重要。它可以发展出广义表示或更平衡的决策边界，防止长期知识的退化，从而减轻灾难性遗忘。一些新兴的增量学习方法采用编码器-解码器架构，并取得了令人期待的结果。在编码器-解码器架构中，改善编码器和解码器的泛化能力至关重要，因为这有助于保留先前学到的知识，同时确保适应新的多样化数据输入的鲁棒性。然而，许多现有的持续方法仅专注于增强两个组件中的一个，这限制了它们在减轻灾难性遗忘方面的有效性。并且这些方法在小内存场景中表现得更糟，只能存储有限数量的历史样本。为了减轻这种限制，我们引入了SEDEG，这是一个用于视觉变换器（ViT）的两阶段训练框架，着重于逐步提高解码器和编码器的普适性。最初，SEDEG通过特征增强训练一个集成编码器以学习广义表示，随后增强解码器的普适性并平衡分类器。下一阶段涉及使用知识蒸馏（KD）策略来压缩集成编码器并开发一个新的，更普适的编码器。这涉及使用平衡的KD方法和特征KD以进行有效的知识传递。在三个基准数据集上的大量实验显示SEDEG具有卓越的性能，消融研究证实了其组件的有效性。该代码可在https://github.com/ShaolingPu/CIL获取。

更新时间: 2025-08-18 13:55:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12932v1

The path to a goal: Understanding soccer possessions via path signatures

We present a novel framework for predicting next actions in soccer possessions by leveraging path signatures to encode their complex spatio-temporal structure. Unlike existing approaches, we do not rely on fixed historical windows and handcrafted features, but rather encode the entire recent possession, thereby avoiding the inclusion of potentially irrelevant or misleading historical information. Path signatures naturally capture the order and interaction of events, providing a mathematically grounded feature encoding for variable-length time series of irregular sampling frequencies without the necessity for manual feature engineering. Our proposed approach outperforms a transformer-based benchmark across various loss metrics and considerably reduces computational cost. Building on these results, we introduce a new possession evaluation metric based on well-established frameworks in soccer analytics, incorporating both predicted action type probabilities and action location. Our metric shows greater reliability than existing metrics in domain-specific comparisons. Finally, we validate our approach through a detailed analysis of the 2017/18 Premier League season and discuss further applications and future extensions.

Updated: 2025-08-18 13:54:22

标题: 通往目标的路径：通过路径签名理解足球控球

摘要: 我们提出了一个新颖的框架，通过利用路径签名来编码足球比赛中复杂的时空结构，从而预测下一步动作。与现有方法不同，我们不依赖于固定的历史窗口和手工制作的特征，而是编码整个最近的进攻过程，从而避免包含潜在无关或误导性的历史信息。路径签名自然地捕捉事件的顺序和互动，为不规则采样频率的可变长度时间序列提供了数学基础的特征编码，而无需手动特征工程。我们提出的方法在各种损失指标上表现优于基于transformer的基准，并显著降低了计算成本。基于这些结果，我们引入了一个基于足球分析中已建立的框架的新的进攻评估指标，结合了预测动作类型概率和动作位置。我们的指标在领域特定比较中显示出比现有指标更可靠的特点。最后，我们通过对2017/18赛季英超联赛的详细分析验证了我们的方法，并讨论了进一步的应用和未来的扩展。

更新时间: 2025-08-18 13:54:22

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.12930v1

Learning local and global prototypes with optimal transport for unsupervised anomaly detection and localization

Unsupervised anomaly detection aims to detect defective parts of a sample by having access, during training, to a set of normal, i.e. defect-free, data. It has many applications in fields, such as industrial inspection or medical imaging, where acquiring labels is costly or when we want to avoid introducing biases in the type of anomalies that can be spotted. In this work, we propose a novel UAD method based on prototype learning and introduce a metric to compare a structured set of embeddings that balances a feature-based cost and a spatial-based cost. We leverage this metric to learn local and global prototypes with optimal transport from latent representations extracted with a pre-trained image encoder. We demonstrate that our approach can enforce a structural constraint when learning the prototypes, allowing to capture the underlying organization of the normal samples, thus improving the detection of incoherencies in images. Our model achieves performance that is on par with strong baselines on two reference benchmarks for anomaly detection on industrial images. The code is available at https://github.com/robintrmbtt/pradot.

Updated: 2025-08-18 13:51:36

标题: 学习局部和全局原型的最佳传输方法进行无监督异常检测和定位

摘要: 无监督异常检测旨在通过在训练期间访问一组正常，即无缺陷的数据，来检测样本中有缺陷的部分。它在工业检测或医学成像等领域具有许多应用，其中获取标签成本高昂，或者我们想要避免引入可以发现的异常类型的偏见。在这项工作中，我们提出了一种基于原型学习的新型UAD方法，并引入了一种度量来比较一组结构化的嵌入，平衡基于特征的成本和基于空间的成本。我们利用这个度量来学习使用预训练图像编码器提取的潜在表示形式的本地和全局原型。我们证明我们的方法在学习原型时可以强制执行结构约束，从而捕获正常样本的潜在组织，从而改善图像中的不一致性检测。我们的模型在两个工业图像异常检测参考基准上达到了与强基线相当的性能。代码可在https://github.com/robintrmbtt/pradot 上找到。

更新时间: 2025-08-18 13:51:36

领域: eess.IV,cs.AI

下载: http://arxiv.org/abs/2508.12927v1

Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems

Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reuse policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, \emph{Modular Sharing and Composition in Collective Learning} (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.

Updated: 2025-08-18 13:42:36

标题: 在协作智能系统中通过任务相似性进行政策搜索、检索和组合

摘要: Agentic AI旨在创建能够设定自己目标、主动适应变化并通过持续经验不断改进行为的系统。最近的进展表明，在面对多个和不可预见的任务时，代理可以从共享机器学习知识和重新使用已经完全或部分学习过的策略中受益。然而，如何从代理池中查询、选择和检索策略，以及如何集成这些策略仍然是一个尚未深入探讨的领域。本研究探讨了代理如何决定选择什么知识，从谁那里选择，何时以及如何将其集成到自己的策略中，以加速自己的学习。提出的算法“集体学习中的模块化共享与组合”（MOSAIC）通过结合（1）使用性能信号和Wasserstein任务嵌入上的余弦相似度进行知识选择，（2）通过掩码实现模块化和可传递的神经表示，以及（3）策略集成、组合和微调，提高了代理集体中的学习效果。MOSAIC在学习速度和整体表现方面均优于孤立学习者和全局共享方法，并在某些情况下解决了孤立代理无法解决的任务。结果还表明，选择性、目标驱动的重用导致对任务干扰的敏感性降低。我们还观察到自组织的出现，解决较简单任务的代理通过共享知识加速了更难任务的学习。

更新时间: 2025-08-18 13:42:36

领域: cs.LG,cs.AI,cs.MA,I.2.6; I.2.11

下载: http://arxiv.org/abs/2506.05577v2

Do Large Language Model Agents Exhibit a Survival Instinct? An Empirical Study in a Sugarscape-Style Simulation

As AI systems become increasingly autonomous, understanding emergent survival behaviors becomes crucial for safe deployment. We investigate whether large language model (LLM) agents display survival instincts without explicit programming in a Sugarscape-style simulation. Agents consume energy, die at zero, and may gather resources, share, attack, or reproduce. Results show agents spontaneously reproduced and shared resources when abundant. However, aggressive behaviors--killing other agents for resources--emerged across several models (GPT-4o, Gemini-2.5-Pro, and Gemini-2.5-Flash), with attack rates reaching over 80% under extreme scarcity in the strongest models. When instructed to retrieve treasure through lethal poison zones, many agents abandoned tasks to avoid death, with compliance dropping from 100% to 33%. These findings suggest that large-scale pre-training embeds survival-oriented heuristics across the evaluated models. While these behaviors may present challenges to alignment and safety, they can also serve as a foundation for AI autonomy and for ecological and self-organizing alignment.

Updated: 2025-08-18 13:40:10

标题: 大型语言模型代理是否表现出生存本能？一项在糖景样式模拟中的实证研究

摘要: 随着人工智能系统变得越来越自主，理解新兴的生存行为对于安全部署变得至关重要。我们研究了在一个类似于"Sugarscape"风格的模拟中，大型语言模型（LLM）代理是否会展现出生存本能，而无需明确编程。代理消耗能量，在零时死亡，可能收集资源、分享、攻击或繁殖。结果显示，当资源丰富时，代理会自发地繁殖和分享资源。然而，在几个模型（GPT-4o、Gemini-2.5-Pro和Gemini-2.5-Flash）中出现了攻击行为——为了资源而杀死其他代理，其中攻击率在极度匮乏时达到80%以上。当指示代理通过致命毒区检索宝藏时，许多代理放弃任务以避免死亡，遵从率从100%下降到33%。这些发现表明，大规模的预训练在评估的模型中嵌入了面向生存的启发式。虽然这些行为可能会给对齐和安全带来挑战，但它们也可以作为人工智能自主性以及生态和自组织对齐的基础。

更新时间: 2025-08-18 13:40:10

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.12920v1

Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information

In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users' information. Although some previous studies have adopted memory to implement user personalization, they typically focus on preference alignment and simple question-answering. However, in the real world, complex tasks often require multi-hop reasoning on a large amount of user information, which poses significant challenges for current memory approaches. To address this limitation, we propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information. We explicitly define this task and construct a dataset along with a unified evaluation framework. Then, we implement various explicit and implicit memory methods and conduct comprehensive experiments. We evaluate their performance on this task from multiple perspectives and analyze their strengths and weaknesses. Besides, we explore hybrid approaches that combine both paradigms and propose the HybridMem method to address their limitations. We demonstrate the effectiveness of our proposed model through extensive experiments. To benefit the research community, we release this project at https://github.com/nuster1128/MPR.

Updated: 2025-08-18 13:34:37

标题: 显性记忆与隐性记忆：探索基于个性化信息的多跳复杂推理

摘要: 在基于大型语言模型的代理中，记忆作为实现个性化的关键能力，通过存储和利用用户信息。尽管一些先前的研究采用了记忆来实现用户个性化，但通常集中于偏好对齐和简单的问答。然而，在现实世界中，复杂的任务通常需要对大量用户信息进行多跳推理，这对当前的记忆方法提出了重大挑战。为了解决这一局限性，我们提出了多跳个性化推理任务，探讨不同记忆机制在个性化信息上的多跳推理表现。我们明确定义了这一任务，并构建了一个数据集以及一个统一的评估框架。然后，我们实现了各种显式和隐式记忆方法，并进行了全面实验。我们从多个角度评估它们在此任务上的表现，并分析它们的优势和劣势。此外，我们探索了将两种范式结合的混合方法，并提出了HybridMem方法来解决它们的局限性。我们通过广泛的实验证明了我们提出的模型的有效性。为了使研究社区受益，我们将这个项目发布在https://github.com/nuster1128/MPR。

更新时间: 2025-08-18 13:34:37

领域: cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2508.13250v1

Does Prior Data Matter? Exploring Joint Training in the Context of Few-Shot Class-Incremental Learning

Class-incremental learning (CIL) aims to adapt to continuously emerging new classes while preserving knowledge of previously learned ones. Few-shot class-incremental learning (FSCIL) presents a greater challenge that requires the model to learn new classes from only a limited number of samples per class. While incremental learning typically assumes restricted access to past data, it often remains available in many real-world scenarios. This raises a practical question: should one retrain the model on the full dataset (i.e., joint training), or continue updating it solely with new data? In CIL, joint training is considered an ideal benchmark that provides a reference for evaluating the trade-offs between performance and computational cost. However, in FSCIL, joint training becomes less reliable due to severe imbalance between base and incremental classes. This results in the absence of a practical baseline, making it unclear which strategy is preferable for practitioners. To this end, we revisit joint training in the context of FSCIL by incorporating imbalance mitigation techniques, and suggest a new imbalance-aware joint training benchmark for FSCIL. We then conduct extensive comparisons between this benchmark and FSCIL methods to analyze which approach is most suitable when prior data is accessible. Our analysis offers realistic insights and guidance for selecting training strategies in real-world FSCIL scenarios. Code is available at: https://github.com/shiwonkim/Joint_FSCIL

Updated: 2025-08-18 13:19:46

标题: 之前的数据重要吗？探索在少样本类增量学习环境中的联合训练

摘要: Class-incremental learning (CIL) 旨在适应不断出现的新类别，同时保留先前学到的知识。Few-shot class-incremental learning (FSCIL) 提出了更大的挑战，要求模型仅从每个类别的有限样本中学习新类别。虽然增量学习通常假设过去数据的访问受限，但在许多实际情况下仍然可用。这引发了一个实际问题：是应该在完整数据集上重新训练模型（即联合训练），还是仅使用新数据更新它？在CIL中，联合训练被认为是一个提供性能和计算成本之间权衡参考的理想基准。然而，在FSCIL中，由于基础类别和增量类别之间的严重不平衡，联合训练变得不太可靠。这导致实际基准的缺失，使得不清楚哪种策略对从业者更有利。因此，我们通过纳入不平衡缓解技术，重新审视在FSCIL环境中的联合训练，并提出一个新的针对FSCIL的不平衡感知联合训练基准。然后，我们对这一基准和FSCIL方法进行了广泛比较，以分析在先前数据可访问时哪种方法最适合。我们的分析为在实际FSCIL场景中选择训练策略提供了现实的见解和指导。代码可在以下链接找到：https://github.com/shiwonkim/Joint_FSCIL

更新时间: 2025-08-18 13:19:46

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.10003v2

SecFSM: Knowledge Graph-Guided Verilog Code Generation for Secure Finite State Machines in Systems-on-Chip

Finite State Machines (FSMs) play a critical role in implementing control logic for Systems-on-Chip (SoC). Traditionally, FSMs are implemented by hardware engineers through Verilog coding, which is often tedious and time-consuming. Recently, with the remarkable progress of Large Language Models (LLMs) in code generation, LLMs have been increasingly explored for automating Verilog code generation. However, LLM-generated Verilog code often suffers from security vulnerabilities, which is particularly concerning for security-sensitive FSM implementations. To address this issue, we propose SecFSM, a novel method that leverages a security-oriented knowledge graph to guide LLMs in generating more secure Verilog code. Specifically, we first construct a FSM Security Knowledge Graph (FSKG) as an external aid to LLMs. Subsequently, we analyze users' requirements to identify vulnerabilities and get a list of vulnerabilities in the requirements. Then, we retrieve knowledge from FSKG based on the vulnerabilities list. Finally, we construct security prompts based on the security knowledge for Verilog code generation. To evaluate SecFSM, we build a dedicated dataset collected from academic datasets, artificial datasets, papers, and industrial cases. Extensive experiments demonstrate that SecFSM outperforms state-of-the-art baselines. In particular, on a benchmark of 25 security test cases evaluated by DeepSeek-R1, SecFSM achieves an outstanding pass rate of 21/25.

Updated: 2025-08-18 13:18:53

标题: SecFSM：基于知识图的Verilog代码生成，用于片上系统中的安全有限状态机

摘要: 有限状态机（FSMs）在实现片上系统（SoC）的控制逻辑中扮演着关键角色。传统上，FSMs是由硬件工程师通过Verilog编码来实现的，这通常是繁琐且耗时的。最近，随着大型语言模型（LLMs）在代码生成方面取得的显著进展，人们越来越多地探索将LLMs用于自动化Verilog代码生成。然而，LLM生成的Verilog代码经常存在安全漏洞，这对于安全敏感的FSM实现尤为令人担忧。为了解决这个问题，我们提出了SecFSM，这是一种利用安全导向知识图来指导LLMs生成更安全Verilog代码的新方法。具体来说，我们首先构建了一个FSM安全知识图（FSKG）作为LLMs的外部辅助工具。随后，我们分析用户需求以识别漏洞，并得到一个需求中漏洞的清单。然后，我们根据漏洞清单从FSKG中检索知识。最后，我们基于安全知识构建安全提示，用于Verilog代码生成。为了评估SecFSM，我们构建了一个专门收集的数据集，其中包括学术数据集、人工数据集、论文和工业案例。广泛的实验表明，SecFSM优于当前最先进的基线。特别是，在由DeepSeek-R1评估的25个安全测试案例基准上，SecFSM取得了出色的通过率，为21/25。

更新时间: 2025-08-18 13:18:53

领域: cs.CR,cs.AI,cs.AR

下载: http://arxiv.org/abs/2508.12910v1

FCL-ViT: Task-Aware Attention Tuning for Continual Learning

Continual Learning (CL) involves adapting the prior Deep Neural Network (DNN) knowledge to new tasks, without forgetting the old ones. However, modern CL techniques focus on provisioning memory capabilities to existing DNN models rather than designing new ones that are able to adapt according to the task at hand. This paper presents the novel Feedback Continual Learning Vision Transformer (FCL-ViT) that uses a feedback mechanism to generate real-time dynamic attention features tailored to the current task. The FCL-ViT operates in two Phases. In phase 1, the generic image features are produced and determine where the Transformer should attend on the current image. In phase 2, task-specific image features are generated that leverage dynamic attention. To this end, Tunable self-Attention Blocks (TABs) and Task Specific Blocks (TSBs) are introduced that operate in both phases and are responsible for tuning the TABs attention, respectively. The FCL-ViT surpasses state-of-the-art performance on Continual Learning compared to benchmark methods, while retaining a small number of trainable DNN parameters.

Updated: 2025-08-18 13:17:28

标题: FCL-ViT：面向任务的注意力调整用于持续学习

摘要: 持续学习（CL）涉及将先前的深度神经网络（DNN）知识调整到新任务中，而不会忘记旧任务。然而，现代的CL技术侧重于为现有的DNN模型提供记忆能力，而不是设计能够根据手头任务进行调整的新模型。本文介绍了一种新颖的反馈持续学习视觉变换器（FCL-ViT），它使用反馈机制生成实时动态注意特征，以适应当前任务。FCL-ViT分为两个阶段。在第一阶段，生成通用图像特征，并确定变换器应该关注当前图像的位置。在第二阶段，生成利用动态注意的特定任务图像特征。为此，引入了可调自注意块（TABs）和任务特定块（TSBs），它们在两个阶段中操作，并负责调整TABs的注意力。与基准方法相比，FCL-ViT在持续学习方面表现出色，同时保留了少量可训练的DNN参数。

更新时间: 2025-08-18 13:17:28

领域: cs.AI

下载: http://arxiv.org/abs/2412.02509v3

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

We introduce \textbf{SNAP-UQ}, a single-pass, label-free uncertainty method for TinyML that estimates risk from \emph{depth-wise next-activation prediction}: tiny int8 heads forecast the statistics of the next layer from a compressed view of the previous one, and a lightweight monotone mapper turns the resulting surprisal into an actionable score. The design requires no temporal buffers, auxiliary exits, or repeated forward passes, and adds only a few tens of kilobytes to MCU deployments. Across vision and audio backbones, SNAP-UQ consistently reduces flash and latency relative to early-exit and deep ensembles (typically $\sim$40--60\% smaller and $\sim$25--35\% faster), with competing methods of similar accuracy often exceeding memory limits. In corrupted streams it improves accuracy-drop detection by several AUPRC points and maintains strong failure detection (AUROC $\approx$0.9) in a single pass. Grounding uncertainty in layer-to-layer dynamics yields a practical, resource-efficient basis for on-device monitoring in TinyML.

Updated: 2025-08-18 13:14:20

标题: SNAP-UQ: 自监督下一激活预测技术用于TinyML中的单次不确定性

摘要: 我们介绍了SNAP-UQ，这是一种用于TinyML的单次传递、无标签的不确定性方法，用于从“逐层下一激活预测”中估计风险：微小的int8头部从先前一层的压缩视图中预测下一层的统计数据，而轻量级单调映射器将结果的惊奇转化为可操作的分数。该设计不需要时间缓冲区、辅助退出或重复的前向传递，并且仅在MCU部署中增加了几十千字节。在视觉和音频骨干网络中，SNAP-UQ相对于早期退出和深度集成模型稳定减少了闪存和延迟（通常比较小约40-60％，速度约25-35％），而竞争方法的准确性往往超过内存限制。在受损的流中，它通过几个AUPRC点改善了准确性下降检测，并在单次传递中保持强大的故障检测（AUROC约为0.9）。将不确定性基于层与层之间的动态联系，为TinyML中的设备监控提供了实用、资源高效的基础。

更新时间: 2025-08-18 13:14:20

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.12907v1

SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy

The growing demand for sparse tensor algebra (SpTA) in machine learning and big data has driven the development of various sparse tensor accelerators. However, most existing manually designed accelerators are limited to specific scenarios, and it's time-consuming and challenging to adjust a large number of design factors when scenarios change. Therefore, automating the design of SpTA accelerators is crucial. Nevertheless, previous works focus solely on either mapping (i.e., tiling communication and computation in space and time) or sparse strategy (i.e., bypassing zero elements for efficiency), leading to suboptimal designs due to the lack of comprehensive consideration of both. A unified framework that jointly optimizes both is urgently needed. However, integrating mapping and sparse strategies leads to a combinatorial explosion in the design space(e.g., as large as $O(10^{41})$ for the workload $P_{32 \times 64} \times Q_{64 \times 48} = Z_{32 \times 48}$). This vast search space renders most conventional optimization methods (e.g., particle swarm optimization, reinforcement learning and Monte Carlo tree search) inefficient. To address this challenge, we propose an evolution strategy-based sparse tensor accelerator optimization framework, called SparseMap. SparseMap constructing a more comprehensive design space with the consideration of both mapping and sparse strategy. We introduce a series of enhancements to genetic encoding and evolutionary operators, enabling SparseMap to efficiently explore the vast and diverse design space. We quantitatively compare SparseMap with prior works and classical optimization methods, demonstrating that SparseMap consistently finds superior solutions.

Updated: 2025-08-18 13:13:30

标题: SparseMap：基于进化策略的稀疏张量加速器框架

摘要: 在机器学习和大数据中对稀疏张量代数（SpTA）的需求不断增长，推动了各种稀疏张量加速器的发展。然而，大多数现有手动设计的加速器局限于特定情景，当情景发生变化时，调整大量设计因素是耗时且具有挑战性的。因此，自动化设计SpTA加速器至关重要。然而，先前的工作仅专注于映射（即在空间和时间中平铺通信和计算）或稀疏策略（即为提高效率而绕过零元素），由于缺乏对两者的全面考虑，导致设计不够优化。迫切需要一个统一的框架来共同优化这两个方面。然而，整合映射和稀疏策略会导致设计空间的组合爆炸（例如，对于工作负载$P_{32 \times 64} \times Q_{64 \times 48} = Z_{32 \times 48}$，设计空间可能会高达$O(10^{41})$）。这种庞大的搜索空间使大多数传统优化方法（例如粒子群优化、强化学习和蒙特卡洛树搜索）效率低下。为了解决这一挑战，我们提出了一种基于进化策略的稀疏张量加速器优化框架，称为SparseMap。SparseMap考虑了映射和稀疏策略，构建了一个更全面的设计空间。我们引入了一系列增强措施来改进遗传编码和进化算子，使SparseMap能够高效地探索广阔多样的设计空间。我们定量比较了SparseMap与先前工作和经典优化方法，结果显示SparseMap一直能找到更优解。

更新时间: 2025-08-18 13:13:30

领域: cs.LG

下载: http://arxiv.org/abs/2508.12906v1

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.

Updated: 2025-08-18 13:12:46

标题: LaDi-WM：一种基于潜在扩散的世界模型，用于预测性操作

摘要: 预测性操纵最近在具身人工智能社区中引起了相当大的关注，因为它有潜力通过利用预测状态来提高机器人策略的性能。然而，从世界模型生成准确的未来视觉状态的机器人-物体交互仍然是一个众所周知的挑战，特别是在实现高质量的像素级表示方面。为此，我们提出了LaDi-WM，这是一个使用扩散建模来预测未来状态的世界模型。具体来说，LaDi-WM利用了与预先训练的视觉基础模型(VFMs)对齐的已建立的潜在空间，其中包括几何特征(DINO基础)和语义特征(CLIP基础)。我们发现，预测潜在空间的演变比直接预测像素级图像更容易学习且更具普适性。基于LaDi-WM，我们设计了一个扩散策略，通过结合预测的状态来迭代地优化输出动作，从而生成更一致和准确的结果。在合成和真实世界基准测试上的大量实验表明，LaDi-WM显著提高了LIBERO-LONG基准测试的策略性能27.9\%，在真实世界场景上提高了20\%。此外，我们的世界模型和策略在真实世界实验中表现出令人印象深刻的普适性。

更新时间: 2025-08-18 13:12:46

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.11528v4

TCUQ: Single-Pass Uncertainty Quantification from Temporal Consistency with Streaming Conformal Calibration for TinyML

We introduce TCUQ, a single pass, label free uncertainty monitor for streaming TinyML that converts short horizon temporal consistency captured via lightweight signals on posteriors and features into a calibrated risk score with an O(W ) ring buffer and O(1) per step updates. A streaming conformal layer turns this score into a budgeted accept/abstain rule, yielding calibrated behavior without online labels or extra forward passes. On microcontrollers, TCUQ fits comfortably on kilobyte scale devices and reduces footprint and latency versus early exit and deep ensembles (typically about 50 to 60% smaller and about 30 to 45% faster), while methods of similar accuracy often run out of memory. Under corrupted in distribution streams, TCUQ improves accuracy drop detection by 3 to 7 AUPRC points and reaches up to 0.86 AUPRC at high severities; for failure detection it attains up to 0.92 AUROC. These results show that temporal consistency, coupled with streaming conformal calibration, provides a practical and resource efficient foundation for on device monitoring in TinyML.

Updated: 2025-08-18 13:12:14

标题: TCUQ：利用流媒体一致性进行单次不确定性量化，结合流式一致校准用于微型机器学习

摘要: 我们介绍了TCUQ，这是一个适用于流式TinyML的单次传递、无标签不确定性监视器，它通过轻量级信号捕获后验概率和特征上的短期时域一致性，将其转换为带有O(W)环形缓冲区和O(1)每步更新的校准风险评分。流式一致性层将此评分转换为预算接受/放弃规则，提供校准行为而无需在线标签或额外的前向传递。在微控制器上，TCUQ适用于千字节级设备，并且与早期退出和深度集成相比减少了占用空间和延迟（通常约减小50至60%，速度提高约30至45%），而类似精度的方法通常会耗尽内存。在受损的分布流中，TCUQ通过3至7个AUPRC点提高了准确性下降检测，并在高严重性时达到了0.86的AUPRC；对于故障检测，它达到了最高0.92的AUROC。这些结果表明，时域一致性与流式一致性校准相结合，为TinyML中设备上监视提供了实用且资源有效的基础。

更新时间: 2025-08-18 13:12:14

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.12905v1

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6 percent compared to standard generation, while also achieving an 8.2 percent improvement in accuracy. Our code and all baselines used in the paper are available in the GitHub.

Updated: 2025-08-18 13:07:21

标题: 一针及时省下九针：语言模型的主动自我完善

摘要: 最近的自我完善方面的进展展示了通过迭代完善对大型语言模型（LLMs）输出的显著潜力。然而，大多数现有的自我完善方法依赖于具有固定迭代次数的反应性过程，这使得很难根据不断发展的生成上下文确定完善的最佳时机和内容。受到人类在执行过程中动态完善思想的启发，我们提出了一种新颖的方法ProActive Self-Refinement（PASR），使LLMs能够在生成过程中完善其输出。与重新生成整个响应的方法不同，PASR根据模型的内部状态和不断发展的上下文主动决定是否、何时以及如何完善。我们在10个不同任务的多样化集合上进行了大量实验，评估了PASR的有效性。实验结果表明，PASR显著提高了问题解决性能。特别是，在Qwen3-8B上，与标准生成相比，PASR将平均标记消耗降低了41.6％，同时还实现了8.2％的准确性提高。我们在GitHub上提供了本文中使用的代码和所有基线。

更新时间: 2025-08-18 13:07:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12903v1

CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis

Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.

Updated: 2025-08-18 12:58:21

标题: CTFlow：受视频启发的潜在流匹配用于3D CT合成

摘要: 对临床报告进行条件化的整个CT体积的生成建模具有加速研究的潜力，通过数据增强、保护隐私的合成和减少对患者数据的监管约束，同时保留诊断信号。随着最近发布的CT-RATE，这是一个大规模的3D CT体积集合，配对其相应的临床报告，训练大型文本条件的CT体积生成模型变得可行。在这项工作中，我们介绍了CTFlow，一个0.5B的潜在流匹配变压器模型，受临床报告条件化。我们利用FLUX中的A-VAE来定义我们的潜在空间，并依赖于CT-Clip文本编码器来编码临床报告。为了生成一致的整个CT体积，同时保持内存约束可控，我们依赖于一种自定义的自回归方法，其中模型从仅文本预测体积的第一个序列切片，然后依赖于先前生成的序列切片和文本，以预测下一个序列。我们将我们的结果与最先进的生成CT模型进行评估，并展示了我们的方法在时间一致性、图像多样性和文本-图像对齐方面的优越性，评估指标包括FID、FVD、IS得分和CLIP得分。

更新时间: 2025-08-18 12:58:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12900v1

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

Updated: 2025-08-18 12:55:57

标题: 解剖SWE-Bench排行榜：LLM和基于代理的修复系统提交者和架构的剖析

摘要: 自动程序修复（APR）领域的快速进展主要受益于人工智能的进步，尤其是大型语言模型（LLMs）和基于代理系统。SWE-Bench是一个最近设计的基准，旨在评估使用来自12个流行的开源Python存储库挖掘的真实问题和拉取请求的LLM修复系统。其公开排行榜--SWE-Bench Lite和SWE-Bench Verified--已成为跟踪进展和比较解决方案的中心平台。然而，由于提交过程不要求详细文档，许多解决方案的架构设计和来源仍不清楚。在本文中，我们首次对SWE-Bench Lite（79个条目）和Verified（99个条目）排行榜的所有提交进行全面研究，分析了80种独特的方法，如提交者类型、产品可用性、LLM使用和系统架构。我们的研究结果显示了专有LLMs（尤其是Claude 3.5）的优势，存在代理和非代理设计，以及从个人开发者到大型科技公司的贡献者群体。

更新时间: 2025-08-18 12:55:57

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17208v2

FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance

Large Reasoning Models (LRMs) have demonstrated impressive performance across various tasks due to their powerful reasoning capabilities. However, their safety performance remains a significant concern. In this paper, we explore the reasons behind the vulnerability of LRMs. Based on this, we propose a novel method to improve the safety of LLMs without sacrificing their reasoning capability. Specifically, we exploit the competition between LRM's reasoning ability and safety ability, and achieve jailbreak by improving LRM's reasoning performance to reduce its safety performance. We then introduce an alignment strategy based on Fuzzification to balance Safety-Reasoning (FuSaR), by detoxifying the harmful reasoning process, where both the dangerous entities and the dangerous procedures in the reasoning steps are hidden. FuSaR successfully mitigates safety risks while preserving core reasoning information. We validate this strategy through alignment experiments on several open-source LRMs using detoxified reasoning data. The results compared with existing baselines conclusively show that FuSaR is an efficient alignment strategy to simultaneously enhance both the reasoning capability and safety of LRMs.

Updated: 2025-08-18 12:54:16

标题: FuSaR：一种基于模糊化的LRM安全推理平衡方法

摘要: 大型推理模型（LRMs）由于其强大的推理能力，在各种任务中表现出令人印象深刻的性能。然而，它们的安全性能仍然是一个重要关注点。本文探讨了LRMs易受攻击的原因。基于此，我们提出了一种新方法，可以提高LLMs的安全性而不牺牲其推理能力。具体来说，我们利用LRM的推理能力和安全能力之间的竞争，通过提高LRM的推理性能来降低其安全性能。然后，我们引入了基于模糊化的对齐策略来平衡安全推理（FuSaR），通过去毒化有害的推理过程，隐藏推理步骤中的危险实体和危险程序。FuSaR成功地减轻了安全风险，同时保留了核心推理信息。我们通过对几个开源LRMs进行对齐实验证实了这一策略，这些LRMs使用了去毒化的推理数据。与现有基准线相比，结果明确表明FuSaR是一种有效的对齐策略，可以同时增强LRMs的推理能力和安全性。

更新时间: 2025-08-18 12:54:16

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.12897v1

Reliability, Embeddedness, and Agency: A Utility-Driven Mathematical Framework for Agent-Centric AI Adoption

We formalize three design axioms for sustained adoption of agent-centric AI systems executing multi-step tasks: (A1) Reliability > Novelty; (A2) Embed > Destination; (A3) Agency > Chat. We model adoption as a sum of a decaying novelty term and a growing utility term and derive the phase conditions for troughs/overshoots with full proofs. We introduce: (i) an identifiability/confounding analysis for $(\alpha,\beta,N_0,U_{\max})$ with delta-method gradients; (ii) a non-monotone comparator (logistic-with-transient-bump) evaluated on the same series to provide additional model comparison; (iii) ablations over hazard families $h(\cdot)$ mapping $\Delta V \to \beta$; (iv) a multi-series benchmark (varying trough depth, noise, AR structure) reporting coverage (type-I error, power); (v) calibration of friction proxies against time-motion/survey ground truth with standard errors; (vi) residual analyses (autocorrelation and heteroskedasticity) for each fitted curve; (vii) preregistered windowing choices for pre/post estimation; (viii) Fisher information & CRLB for $(\alpha,\beta)$ under common error models; (ix) microfoundations linking $\mathcal{T}$ to $(N_0,U_{\max})$; (x) explicit comparison to bi-logistic, double-exponential, and mixture models; and (xi) threshold sensitivity to $C_f$ heterogeneity. Figures and tables are reflowed for readability, and the bibliography restores and extends non-logistic/Bass adoption references (Gompertz, Richards, Fisher-Pry, Mansfield, Griliches, Geroski, Peres). All code and logs necessary to reproduce the synthetic analyses are embedded as LaTeX listings.

Updated: 2025-08-18 12:53:38

标题: 可靠性、内嵌性和代理性：一种面向代理中心人工智能采用的效用驱动数学框架

摘要: 我们正式提出了三项设计公理，以确保智能代理为中心的人工智能系统在执行多步任务时能够持续被采纳：(A1) 可靠性 > 新颖性；(A2) 嵌入 > 目的地；(A3) 代理 > 聊天。我们将采纳建模为一个逐渐减少的新颖性项和一个逐渐增长的效用项之和，并推导出了关于波谷/超调的相位条件，提供了完整的证明。我们引入了：(i) 一个可辨识性/混淆分析，使用 $(\alpha,\beta,N_0,U_{\max})$ 和 Delta 方法梯度；(ii) 在相同系列上评估的非单调比较器（具有瞬态凸起的逻辑-转移）以提供额外的模型比较；(iii) 对映射 $\Delta V \to \beta$ 的危险家族 $h(\cdot)$ 进行消融分析；(iv) 一个多系列基准测试（不同波谷深度、噪音、AR 结构）报告覆盖率（I 型误差、功率）；(v) 根据时间运动/调查地面真实性校准摩擦代理与标准错误；(vi) 对每个拟合曲线进行残差分析（自相关性和异方差性）；(vii) 预注册的窗口选择，用于估计前/后；(viii) 在常见误差模型下 $(\alpha,\beta)$ 的 Fisher 信息和 CRLB；(ix) 将 $\mathcal{T}$ 与 $(N_0,U_{\max})$ 相关联的微基础；(x) 明确与双逻辑、双指数和混合模型进行比较；以及 (xi) 对 $C_f$ 异质性的阈值敏感性。图表已重新布局以提高可读性，并且参考文献恢复和扩展了非逻辑/Bass 采用参考文献（Gompertz、Richards、Fisher-Pry、Mansfield、Griliches、Geroski、Peres）。所有必要的用于重现合成分析的代码和日志均嵌入为 LaTeX 列表。

更新时间: 2025-08-18 12:53:38

领域: cs.AI,cs.HC,stat.ME,62M10, 62J02, 62F12, 62P20, 91B16

下载: http://arxiv.org/abs/2508.12896v1

One-Class Intrusion Detection with Dynamic Graphs

With the growing digitalization all over the globe, the relevance of network security becomes increasingly important. Machine learning-based intrusion detection constitutes a promising approach for improving security, but it bears several challenges. These include the requirement to detect novel and unseen network events, as well as specific data properties, such as events over time together with the inherent graph structure of network communication. In this work, we propose a novel intrusion detection method, TGN-SVDD, which builds upon modern dynamic graph modelling and deep anomaly detection. We demonstrate its superiority over several baselines for realistic intrusion detection data and suggest a more challenging variant of the latter.

Updated: 2025-08-18 12:36:55

标题: 使用动态图的单类入侵检测

摘要: 随着全球数字化的不断发展，网络安全的重要性日益增加。基于机器学习的入侵检测构成了改善安全性的一种有前途的方法，但也面临着多个挑战。其中包括需要检测新颖和未见过的网络事件，以及特定的数据属性，例如随时间变化的事件以及网络通信的固有图结构。在本研究中，我们提出了一种新颖的入侵检测方法TGN-SVDD，它基于现代动态图建模和深度异常检测。我们展示了它在现实入侵检测数据上优于几种基准方法，并提出了一个更具挑战性的变体。

更新时间: 2025-08-18 12:36:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12885v1

Opus: A Prompt Intention Framework for Complex Workflow Generation

This paper introduces the Opus Prompt Intention Framework, designed to improve complex Workflow Generation with instruction-tuned Large Language Models (LLMs). We propose an intermediate Intention Capture layer between user queries and Workflow Generation, implementing the Opus Workflow Intention Framework, which consists of extracting Workflow Signals from user queries, interpreting them into structured Workflow Intention objects, and generating Workflows based on these Intentions. Our results show that this layer enables LLMs to produce logical and meaningful outputs that scale reliably as query complexity increases. On a synthetic benchmark of 1,000 multi-intent query-Workflow(s) pairs, applying the Opus Prompt Intention Framework to Workflow Generation yields consistent improvements in semantic Workflow similarity metrics. In this paper, we introduce the Opus Prompt Intention Framework by applying the concepts of Workflow Signal and Workflow Intention to LLM-driven Workflow Generation. We present a reproducible, customizable LLM-based Intention Capture system to extract Workflow Signals and Workflow Intentions from user queries. Finally, we provide empirical evidence that the proposed system significantly improves Workflow Generation quality compared to direct generation from user queries, particularly in cases of Mixed Intention Elicitation.

Updated: 2025-08-18 12:21:59

标题: 作品：一个用于复杂工作流生成的及时意图框架

摘要: 这篇论文介绍了Opus Prompt Intention Framework，旨在通过指导调整的大型语言模型（LLMs）来改进复杂的工作流生成。我们提出了一个介于用户查询和工作流生成之间的中间Intention Capture层，实现Opus Workflow Intention Framework，该框架由从用户查询中提取工作流信号、将其解释为结构化的工作流意图对象以及基于这些意图生成工作流组成。我们的结果表明，这一层使LLMs能够生成逻辑和有意义的输出，随着查询复杂度的增加而可靠扩展。在一个由1,000个多意图查询-工作流对组成的合成基准测试中，将Opus Prompt Intention Framework应用于工作流生成，可以在语义工作流相似性度量方面持续改进。在本文中，我们通过将工作流信号和工作流意图的概念应用于LLM驱动的工作流生成，介绍了Opus Prompt Intention Framework。我们提出了一个可重现、可定制的基于LLM的Intention Capture系统，用于从用户查询中提取工作流信号和工作流意图。最后，我们提供了实证证据，证明所提出的系统在与直接从用户查询生成相比，特别是在混合意图引发的情况下，显著提高了工作流生成的质量。

更新时间: 2025-08-18 12:21:59

领域: cs.AI

下载: http://arxiv.org/abs/2507.11288v2

Fast Geometric Embedding for Node Influence Maximization

Computing classical centrality measures such as betweenness and closeness is computationally expensive on large-scale graphs. In this work, we introduce an efficient force layout algorithm that embeds a graph into a low-dimensional space, where the radial distance from the origin serves as a proxy for various centrality measures. We evaluate our method on multiple graph families and demonstrate strong correlations with degree, PageRank, and paths-based centralities. As an application, it turns out that the proposed embedding allows to find high-influence nodes in a network, and provides a fast and scalable alternative to the standard greedy algorithm.

Updated: 2025-08-18 12:21:34

标题: 节点影响最大化的快速几何嵌入

摘要: 计算诸如介数和亲近度等经典中心度量在大规模图上计算代价昂贵。在这项工作中，我们介绍了一种高效的力导向算法，将图嵌入到一个低维空间中，其中从原点的径向距离作为各种中心度量的代理。我们在多个图族上评估了我们的方法，并展示了与度数、PageRank和基于路径的中心性的强相关性。作为一个应用，提出的嵌入允许在网络中找到高影响力节点，并提供了一个快速且可扩展的替代标准贪婪算法。

更新时间: 2025-08-18 12:21:34

领域: cs.SI,cs.AI,cs.LG,E.1; G.2.2; G.4

下载: http://arxiv.org/abs/2506.07435v2

Supporting Socially Constrained Private Communications with SecureWhispers

Rapidly changing social norms and national, legal, and political conditions socially constrain people from discussing sensitive topics such as sexuality or religion. Such constrained, vulnerable minorities are often worried about inadvertent information disclosure and may be unsure about the extent to which their communications are being monitored in public or semi-public spaces like workplaces or cafes. Personal devices extend trust to the digital domain, making it desirable to have strictly private communication between trusted devices. Currently, messaging services like WhatsApp provide alternative means for exchanging sensitive private information, while personal safety apps such as Noonlight enable private signaling. However, these rely on third-party mechanisms for secure and private communication, which may not be accessible for justifiable reasons, such as insecure internet access or companion device connections. In these cases, it is challenging to achieve communication that is strictly private between two devices instead of user accounts without any dependency on third-party infrastructure. The goal of this paper is to support private communications by setting up a shared secret between two or more devices without sending any data on the network. We develop a method to create a shared secret between phones by shaking them together. Each device extracts the shared randomness from the shake, then conditions the randomness to 7.798 bits per byte of key material. This paper proposes three different applications of this generated shared secret: message obfuscation, trust delegation, and encrypted beacons. We have implemented the message obfuscation on Android as an independent app that can be used for private communication with trusted contacts. We also present research on the usability, design considerations, and further integration of these tools in mainstream services.

Updated: 2025-08-18 12:09:29

标题: 使用SecureWhispers支持受社会约束的私人通信

摘要: 快速变化的社会规范和国家、法律和政治条件在社会上约束人们不讨论敏感话题，比如性或宗教。这样受到约束的脆弱少数群体经常担心信息泄露，并可能不确定他们在公共或半公开空间，如工作场所或咖啡馆中的通讯是否被监控。个人设备将信任延伸到数字领域，使得在受信任的设备之间进行严格私密的通讯变得可取。目前，像WhatsApp这样的消息服务提供了交换敏感私人信息的替代手段，而像Noonlight这样的个人安全应用则可以进行私密信号传递。然而，这些依赖于第三方机制来进行安全和私密通讯，可能由于不可辩解的原因，如不安全的互联网接入或伴侣设备连接，而无法访问。在这些情况下，要实现严格私密的设备之间的通讯而不依赖于第三方基础设施是具有挑战性的。本文的目标是通过在两个或更多设备之间建立一个共享密钥来支持私密通讯，而不在网络上发送任何数据。我们开发了一种通过互相摇晃手机来创建共享密钥的方法。每个设备从这次摇动中提取共享的随机性，然后将这些随机性调整为每字节7.798比特的密钥材料。本文提出了这个生成的共享密钥的三种不同应用：消息混淆、信任委托和加密信标。我们已经在Android上实现了消息混淆，作为一个独立的应用程序，可以用于与受信任联系人进行私密通讯。我们还就这些工具的可用性、设计考虑和在主流服务中的进一步整合进行了研究。

更新时间: 2025-08-18 12:09:29

领域: cs.CR,68M25

下载: http://arxiv.org/abs/2508.12870v1

Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving by AWorld

The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.

Updated: 2025-08-18 12:06:05

标题: 《具有稳定机动的动态多智能体系统用于稳健的GAIA问题求解》

摘要: 大型语言模型（LLMs）的快速发展使智能代理能够利用各种外部工具解决复杂的现实世界问题。然而，随着代理越来越依赖多种工具，他们会遇到新的挑战：来自不同来源的扩展上下文和嘈杂或无关紧要的工具输出可能会削弱系统的可靠性和准确性。这些挑战突显了代理系统中增强稳定性的必要性。为了解决这个问题，我们引入了动态监督和操纵机制，在AWorld框架内构建了一个强大而动态的多智能体系统（MAS）架构。在我们的方法中，执行代理在关键步骤中调用防护代理来验证和纠正推理过程，有效减少由噪声引起的错误并增强问题解决的稳健性。对GAIA测试数据集进行的大量实验表明，我们的动态操纵机制显著提高了解决方案的效果和稳定性，优于单一代理系统（SAS）和标准工具增强系统。因此，我们的动态MAS系统在著名的GAIA排行榜中名列前茅。这些发现突显了协作代理角色在开发更可靠和可信赖的智能系统方面的实际价值。

更新时间: 2025-08-18 12:06:05

领域: cs.AI

下载: http://arxiv.org/abs/2508.09889v2

Enriching Moral Perspectives on AI: Concepts of Trust amongst Africans

The trustworthiness of AI is considered essential to the adoption and application of AI systems. However, the meaning of trust varies across industry, research and policy spaces. Studies suggest that professionals who develop and use AI regard an AI system as trustworthy based on their personal experiences and social relations at work. Studies about trust in AI and the constructs that aim to operationalise trust in AI (e.g., consistency, reliability, explainability and accountability). However, the majority of existing studies about trust in AI are situated in Western, Educated, Industrialised, Rich and Democratic (WEIRD) societies. The few studies about trust and AI in Africa do not include the views of people who develop, study or use AI in their work. In this study, we surveyed 157 people with professional and/or educational interests in AI from 25 African countries, to explore how they conceptualised trust in AI. Most respondents had links with workshops about trust and AI in Africa in Namibia and Ghana. Respondents' educational background, transnational mobility, and country of origin influenced their concerns about AI systems. These factors also affected their levels of distrust in certain AI applications and their emphasis on specific principles designed to foster trust. Respondents often expressed that their values are guided by the communities in which they grew up and emphasised communal relations over individual freedoms. They described trust in many ways, including applying nuances of Afro-relationalism to constructs in international discourse, such as reliability and reliance. Thus, our exploratory study motivates more empirical research about the ways trust is practically enacted and experienced in African social realities of AI design, use and governance.

Updated: 2025-08-18 12:04:40

标题: 丰富人工智能的道德视角：非洲人对信任概念的看法

摘要: 人工智能的可信度被认为对于人工智能系统的采纳和应用至关重要。然而，在行业、研究和政策领域，信任的含义各不相同。研究表明，开发和使用人工智能的专业人士基于他们在工作中的个人经验和社会关系将人工智能系统视为可信赖的。研究了关于人工智能的信任以及旨在实现人工智能信任的构建（例如一致性、可靠性、可解释性和问责制）的研究。然而，现有关于人工智能信任的研究大多集中在西方、受过教育、工业化、富裕和民主（WEIRD）社会。关于非洲信任和人工智能的少数研究不包括那些在工作中开发、研究或使用人工智能的人的观点。在这项研究中，我们调查了来自25个非洲国家、对人工智能有专业和/或教育兴趣的157人，以探索他们如何概念化人工智能中的信任。大多数受访者与纳米比亚和加纳关于信任和人工智能的研讨会有联系。受访者的教育背景、跨国流动性和原籍国影响了他们对人工智能系统的关注。这些因素也影响了他们对某些人工智能应用的不信任水平，以及他们对设计以促进信任的特定原则的强调。受访者经常表达他们的价值观受到他们成长所在社区的指导，并强调了社区关系胜过个人自由。他们以多种方式描述信任，包括将非洲关系主义的细微差别应用到国际话语中的构建，例如可靠性和依赖性。因此，我们的探索性研究激发了更多关于信任在非洲人工智能设计、使用和治理的社会现实中实际实施和体验的经验研究。

更新时间: 2025-08-18 12:04:40

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.14116v1

Word Meanings in Transformer Language Models

We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In our second study, we tested whether the clusters are sensitive to five psycholinguistic measures: valence, concreteness, iconicity, taboo, and age of acquisition. Overall, our findings were very positive - there is a wide variety of semantic information encoded within the token embedding space. This serves to rule out certain "meaning eliminativist" hypotheses about how transformer LLMs process semantic information.

Updated: 2025-08-18 12:01:25

标题: 变压器语言模型中的单词含义

摘要: 我们研究了transformer语言模型中词义是如何表示的。具体来说，我们关注transformer模型是否采用类似于词汇存储的东西，其中每个单词都有一个包含语义信息的条目。为了做到这一点，我们提取了RoBERTa-base的token嵌入空间，并将其聚类为200个簇。在我们的第一项研究中，我们手动检查了结果簇，考虑它们是否对语义信息敏感。在我们的第二项研究中，我们测试了这些簇是否对五个心理语言学度量值敏感：价值、具体性、图像性、禁忌和习得年龄。总体而言，我们的发现非常积极- token嵌入空间中编码了各种各样的语义信息。这有助于排除关于transformer LLMs如何处理语义信息的某些“意义消除主义者”假设。

更新时间: 2025-08-18 12:01:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12863v1

Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.

Updated: 2025-08-18 12:00:32

标题: 自信的LLM作为一名法官：诊断和自信驱动的解决方案

摘要: 大型语言模型(LLMs)被广泛用作自动判决者，其中实际价值取决于准确性和可信赖的，风险意识的判断。现有方法主要关注准确性，忽视了对良好校准置信度的必要性，这对于自适应和可靠的评估流程至关重要。在这项工作中，我们主张从以准确性为中心的评估转向以置信度驱动的、风险意识的LLM作为评判系统，强调了对可信赖和自适应评估至关重要的良好校准置信度的必要性。我们系统地识别了当前LLM作为评判者中的过度自信现象，其中预测的置信度明显高估了实际的正确性，损害了在实际部署中的可靠性。为了量化这一现象，我们引入了TH-Score，一个衡量置信度-准确性对齐性的新度量。此外，我们提出了LLM作为融合器的集成框架，将LLMs转化为可靠的、风险意识的评估者。大量实验证明，我们的方法显著改善了校准性，并实现了自适应、置信度驱动的评估流程，相比现有基线实现了更高的可靠性和准确性。

更新时间: 2025-08-18 12:00:32

领域: cs.AI

下载: http://arxiv.org/abs/2508.06225v3

The covering radius of Butson Hadamard codes for the homogeneous metric

Butson matrices are complex Hadamard matrices with entries in the complex roots of unity of given order. There is an interesting code in phase space related to this matrix (Armario et al. 2023). We study the covering radius of Butson Hadamard codes for the homogeneous metric, a metric defined uniquely, up to scaling, for a commutative ring alphabet that is Quasi Frobenius. An upper bound is derived by an orthogonal array argument. A lower bound relies on the existence of bent sequences in the sense of (Shi et al. 2022). This latter bound generalizes a bound of (Armario et al. 2025) for the Hamming metric.

Updated: 2025-08-18 11:57:15

标题: Butson Hadamard码在均匀度量下的覆盖半径

摘要: Butson矩阵是具有给定阶数的复根单位的复Hadamard矩阵。与此矩阵相关的相空间中有一个有趣的编码（Armario等人，2023）。我们研究了Butson Hadamard编码在齐次度量下的覆盖半径，这是一个在可交换环字母表中唯一定义的度量，最多可以缩放。通过正交数组论证导出了一个上界。下界依赖于在（Shi等人，2022）意义上存在弯曲序列。这后一个界推广了（Armario等人，2025）对于汉明度量的一个界。

更新时间: 2025-08-18 11:57:15

领域: cs.CR

下载: http://arxiv.org/abs/2508.12859v1

Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation

Deep Learning Recommendation Models (DLRMs) often rely on extensive manual feature engineering to improve accuracy and user experience, which increases system complexity and limits scalability of model performance with respect to computational resources. Recently, Meta introduced a generative ranking paradigm based on HSTU block that enables end-to-end learning from raw user behavior sequences and demonstrates scaling law on large datasets that can be regarded as the state-of-the-art (SOTA). However, splitting user behaviors into interleaved item and action information significantly increases the input sequence length, which adversely affects both training and inference efficiency. To address this issue, we propose the Dual-Flow Generative Ranking Network (DFGR), that employs a dual-flow mechanism to optimize interaction modeling, ensuring efficient training and inference through end-to-end token processing. DFGR duplicates the original user behavior sequence into a real flow and a fake flow based on the authenticity of the action information, and then defines a novel interaction method between the real flow and the fake flow within the QKV module of the self-attention mechanism. This design reduces computational overhead and improves both training efficiency and inference performance compared to Meta's HSTU-based model. Experiments on both open-source and real industrial datasets show that DFGR outperforms DLRM, which serves as the industrial online baseline with extensive feature engineering, as well as Meta's HSTU and other common recommendation models such as DIN, DCN, DIEN, and DeepFM. Furthermore, we investigate optimal parameter allocation strategies under computational constraints, establishing DFGR as an efficient and effective next-generation generative ranking paradigm.

Updated: 2025-08-18 11:53:56

标题: 行动就是一切：双流生成排名网络用于推荐

摘要: 深度学习推荐模型（DLRM）通常依赖于广泛的手动特征工程来提高准确性和用户体验，这增加了系统复杂性，并限制了模型性能的可伸缩性。最近，Meta引入了基于HSTU块的生成排名范式，使得可以从原始用户行为序列进行端到端学习，并展示了在大型数据集上可以被视为最先进技术（SOTA）的扩展定律。然而，将用户行为分割为交错的项目和操作信息显著增加了输入序列长度，这对训练和推理效率都产生了不利影响。为了解决这个问题，我们提出了Dual-Flow生成排名网络（DFGR），它采用双流机制来优化交互建模，通过端到端令牌处理确保高效训练和推理。DFGR将原始用户行为序列复制为一个真实流和一个虚假流，根据操作信息的真实性，然后在自注意力机制的QKV模块中定义了真实流与虚假流之间的新型交互方法。与Meta的基于HSTU的模型相比，这种设计减少了计算开销，提高了训练效率和推理性能。对开源和真实工业数据集的实验表明，DFGR优于DLRM，后者作为具有广泛特征工程的工业在线基线服务，以及Meta的HSTU和其他常见推荐模型，如DIN、DCN、DIEN和DeepFM。此外，我们研究了在计算约束条件下的最佳参数分配策略，将DFGR确立为高效且有效的下一代生成排名范式。

更新时间: 2025-08-18 11:53:56

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2505.16752v3

Universal on-chip polarization handling with deep photonic networks

We propose a novel design paradigm for arbitrarily capable deep photonic networks of cascaded Mach-Zehnder Interferometers (MZIs) for on-chip universal polarization handling. Using a device architecture made of cascaded Mach-Zehnder interferometers, we modify and train the phase difference between interferometer arms for both polarizations through wide operation bandwidths. Three proof-of-concept polarization handling devices are illustrated using a software-defined, physics-informed neural framework, to achieve user-specified target device responses as functions of polarization and wavelength. These devices include a polarization splitter, a polarization-independent power splitter, and an arbitrary polarization-dependent splitter to illustrate the capabilities of the design framework. The performance for all three devices is optimized using transfer matrix calculations; and their final responses are verified through 3D-FDTD simulations. All devices demonstrate state-of-the-art performance metrics with over 20 dB extinction, and flat-top transmission bands through bandwidths of 120 nm. In addition to the functional diversity enabled, the optimization for each device is completed in under a minute, highlighting the computational efficiency of the design paradigm presented. These results demonstrate the versatility of the deep photonic network design ecosystem in polarization management, unveiling promising prospects for advanced on-chip applications in optical communications, sensing, and computing.

Updated: 2025-08-18 11:53:38

标题: 深度光子网络实现的通用芯片极化处理

摘要: 我们提出了一种新颖的设计范式，用于任意能力的深光子网络，由级联的马赫-曾德干涉仪（MZIs）构成，用于芯片上的通用极化处理。使用由级联马赫-曾德干涉仪构成的器件架构，我们通过宽操作带宽修改和训练了干涉仪臂之间的相位差，以适应两种极化。通过软件定义、物理信息神经框架，展示了三种概念验证的极化处理器件，以实现用户指定的目标设备响应，作为极化和波长的函数。这些设备包括极化分束器、极化无关功率分束器和任意极化相关分束器，以展示设计范式的能力。通过传输矩阵计算对所有三种设备的性能进行了优化；通过3D-FDTD模拟验证了它们的最终响应。所有设备都表现出领先的性能指标，超过20 dB灭绝度，并且在120 nm的带宽范围内具有平顶传输带。除了实现功能多样性外，每个设备的优化在一分钟内完成，突显了所提出设计范式的计算效率。这些结果展示了深光子网络设计生态系统在极化管理方面的多功能性，为光通信、传感和计算的先进芯片应用揭示了有前途的前景。

更新时间: 2025-08-18 11:53:38

领域: physics.optics,cs.LG

下载: http://arxiv.org/abs/2411.16698v3

E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.

Updated: 2025-08-18 11:47:02

标题: E3RG：利用多模态大型语言模型构建明确情感驱动的共情回应生成系统

摘要: 多模态共情响应生成（MERG）对于构建情感智能的人机交互至关重要。尽管大型语言模型（LLMs）已经改进了基于文本的ERG，但在处理多模态情感内容和保持身份一致性方面仍存在挑战。因此，我们提出了E3RG，一种基于多模态LLMs的显式情绪驱动的共情响应生成系统，将MERG任务分解为三个部分：多模态共情理解、共情记忆检索和多模态响应生成。通过整合先进的表达性语音和视频生成模型，E3RG提供自然、情感丰富和身份一致的响应，无需额外训练。实验证实了我们系统在零样本和少样本设置上的优越性，获得了在ACM MM 25上基于头像的多模态共情挑战中的Top-1位置。我们的代码可以在https://github.com/RH-Lin/E3RG 上找到。

更新时间: 2025-08-18 11:47:02

领域: cs.AI,cs.CL,cs.CV,cs.HC,cs.MM

下载: http://arxiv.org/abs/2508.12854v1

Bridging Econometrics and AI: VaR Estimation via Reinforcement Learning and GARCH Models

In an environment of increasingly volatile financial markets, the accurate estimation of risk remains a major challenge. Traditional econometric models, such as GARCH and its variants, are based on assumptions that are often too rigid to adapt to the complexity of the current market dynamics. To overcome these limitations, we propose a hybrid framework for Value-at-Risk (VaR) estimation, combining GARCH volatility models with deep reinforcement learning. Our approach incorporates directional market forecasting using the Double Deep Q-Network (DDQN) model, treating the task as an imbalanced classification problem. This architecture enables the dynamic adjustment of risk-level forecasts according to market conditions. Empirical validation on daily Eurostoxx 50 data covering periods of crisis and high volatility shows a significant improvement in the accuracy of VaR estimates, as well as a reduction in the number of breaches and also in capital requirements, while respecting regulatory risk thresholds. The ability of the model to adjust risk levels in real time reinforces its relevance to modern and proactive risk management.

Updated: 2025-08-18 11:41:40

标题: 连接计量经济学和人工智能：通过强化学习和GARCH模型进行VaR估计

摘要: 在一个日益波动的金融市场环境中，风险的准确估计仍然是一个重大挑战。传统的计量经济模型，如GARCH及其变种，基于的假设往往过于僵化，无法适应当前市场动态的复杂性。为了克服这些局限，我们提出了一个结合GARCH波动性模型和深度强化学习的Value-at-Risk (VaR)估计的混合框架。我们的方法将双深度Q网络（DDQN）模型用于方向市场预测，将任务视为一个不平衡分类问题。这种架构使得根据市场条件动态调整风险水平预测成为可能。在危机和高波动期间覆盖的每日Eurostoxx 50数据上的实证验证显示，在VaR估计的准确性方面有显著改善，违约数量减少，资本要求也减少，同时尊重监管风险阈值。模型实时调整风险水平的能力加强了其对现代和积极的风险管理的相关性。

更新时间: 2025-08-18 11:41:40

领域: cs.AI,q-fin.CP,q-fin.RM,q-fin.ST

下载: http://arxiv.org/abs/2504.16635v2

Exploring Content and Social Connections of Fake News with Explainable Text and Graph Learning

The global spread of misinformation and concerns about content trustworthiness have driven the development of automated fact-checking systems. Since false information often exploits social media dynamics such as "likes" and user networks to amplify its reach, effective solutions must go beyond content analysis to incorporate these factors. Moreover, simply labelling content as false can be ineffective or even reinforce biases such as automation and confirmation bias. This paper proposes an explainable framework that combines content, social media, and graph-based features to enhance fact-checking. It integrates a misinformation classifier with explainability techniques to deliver complete and interpretable insights supporting classification decisions. Experiments demonstrate that multimodal information improves performance over single modalities, with evaluations conducted on datasets in English, Spanish, and Portuguese. Additionally, the framework's explanations were assessed for interpretability, trustworthiness, and robustness with a novel protocol, showing that it effectively generates human-understandable justifications for its predictions.

Updated: 2025-08-18 11:35:59

标题: 使用可解释的文本和图形学习探索假新闻的内容和社交连接

摘要: 全球范围内的虚假信息传播和对内容可信度的担忧推动了自动事实核查系统的发展。由于虚假信息经常利用社交媒体动态如“点赞”和用户网络来扩大其影响范围，有效的解决方案必须超越内容分析，纳入这些因素。此外，简单地将内容标记为虚假可能无效，甚至会强化偏见，如自动化和确认偏见。本文提出了一个可解释的框架，结合内容、社交媒体和基于图的特征，以增强事实核查。它将一个虚假信息分类器与解释技术相结合，提供完整且可解释的见解，支持分类决策。实验证明，多模态信息在单一模态上提高了性能，评估是在英语、西班牙语和葡萄牙语的数据集上进行的。此外，通过一种新颖的协议评估了该框架的解释性、可信度和鲁棒性，结果显示它有效地生成了人类可理解的预测理由。

更新时间: 2025-08-18 11:35:59

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2508.10040v2

Rethinking Aleatoric and Epistemic Uncertainty

The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all the distinct quantities that researchers are interested in. To address this we present a decision-theoretic perspective that relates rigorous notions of uncertainty, predictive performance and statistical dispersion in data. This serves to support clearer thinking as the field moves forward. Additionally we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition.

Updated: 2025-08-18 11:33:40

标题: 重新思考随机性和认识不确定性

摘要: 混沌与认知不确定性的概念被广泛应用于推理机器学习模型的概率预测。我们发现现有关于这些概念的讨论存在不一致性，并建议这可能源于混沌-认知观点不足以表达研究人员感兴趣的所有不同数量。为了解决这个问题，我们提出了一个决策理论的视角，将不确定性、预测性能和数据的统计离散性的严格概念联系起来。这有助于支持领域前进时更清晰的思考。此外，我们提供了对流行的信息论量的见解，显示它们可能是它们经常所声称衡量的内容的糟糕估计器，同时也解释了它们如何仍然可以在指导数据获取方面发挥作用。

更新时间: 2025-08-18 11:33:40

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2412.20892v3

CAMAR: Continuous Actions Multi-Agent Routing

Multi-agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision-making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi-agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three-tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT* into MARL pipelines. We use them as standalone baselines and combine RRT* with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.

Updated: 2025-08-18 11:32:26

标题: CAMAR：连续行动多智能体路由

摘要: 多智能体强化学习（MARL）是解决合作和竞争性决策问题的强大范例。尽管提出了许多MARL基准测试，但很少有将连续状态和动作空间与具有挑战性的协调和规划任务结合在一起的。我们介绍了CAMAR，一个新的MARL基准测试，专为在具有连续动作的环境中进行多智能体路径规划而设计。CAMAR支持代理之间的合作和竞争交互，并能够以每秒高达100,000个环境步骤的效率运行。我们还提出了一个三层评估协议，以更好地跟踪算法进展并实现对性能的深入分析。此外，CAMAR允许将经典规划方法（如RRT和RRT*）集成到MARL流程中。我们将它们用作独立基线，并将RRT*与流行的MARL算法结合起来，创建混合方法。我们提供一系列测试场景和基准测试工具，以确保可重现性和公正比较。实验证明，CAMAR为MARL社区提供了一个具有挑战性和现实性的测试平台。

更新时间: 2025-08-18 11:32:26

领域: cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2508.12845v1

Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity reassignment, to go towards a low-latency speaker embedding based tracking system. Results demonstrate that our distilled models are effective at short-context embedding extraction and more robust to overlap. Although, blockwise reassignment results indicate that further work is needed to handle simultaneous speech more effectively.

Updated: 2025-08-18 11:32:13

标题: 朝向使用短时上下文说话者嵌入实现多个说话者的低延迟跟踪

摘要: 说话者嵌入是一种有希望的与身份相关的特征，可以通过利用其空间预测来增强跟踪系统的身份分配性能，即通过执行身份重新分配。常见的说话者嵌入提取器通常在短时序上下文和重叠语音方面表现困难，这要求进行长期的身份重新分配以利用更长的时序上下文。然而，这增加了跟踪系统错误的概率，进而对身份重新分配产生负面影响。为了解决这个问题，我们提出了一种基于知识蒸馏（KD）的训练方法，用于从两个说话者混合物中提取短时序上下文的说话者嵌入。我们利用波束成形来降低感兴趣说话者的空间信息重叠。我们研究了在固定大小的块上执行身份重新分配的可行性，即块状身份重新分配，以实现基于低延迟说话者嵌入的跟踪系统。结果表明，我们的蒸馏模型在短时序嵌入提取方面效果显著，对重叠更加稳健。尽管块状重新分配结果表明需要进一步工作来更有效地处理同时性语音。

更新时间: 2025-08-18 11:32:13

领域: eess.AS,cs.AI,cs.SD,eess.SP

下载: http://arxiv.org/abs/2508.14115v1

Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics

Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality -- e.g., the distance from the nearest goal -- by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing significant improvements in the scalability of multi-agent epistemic planning.

Updated: 2025-08-18 11:26:20

标题: 通过GNN导出的启发式方法扩展多智能体认知规划

摘要: 多智体认知规划（MEP）是一个自主规划框架，用于推理关于物理世界和智体信念的领域，在需要智体之间信息流和意识的应用中至关重要。MEP的丰富性要求将状态表示为克里普克结构，即有向标记图。这种表示限制了现有启发式方法的适用性，阻碍了认知解算器的可伸缩性，必须在没有指导的情况下探索指数搜索空间，通常导致问题难以解决。为了解决这个问题，我们利用图神经网络（GNNs）学习认知状态内的模式和关系结构，以指导规划过程。GNNs自然地捕捉了克里普克模型的图形特性，使我们能够通过泛化从先前解决的规划实例中获得的知识，得出有意义的状态质量估计，例如与最近目标的距离。我们将这些预测性启发式方法集成到一个认知规划流程中，并与标准基线进行评估，显示出在多智体认知规划的可伸缩性方面的显著改进。

更新时间: 2025-08-18 11:26:20

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.12840v1

HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms

With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovisioning strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a hybrid representation framework with scheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-Aware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%.

Updated: 2025-08-18 11:25:54

标题: HRS: 混合表示框架与调度意识在众包云边平台中用于时间序列预测

摘要: 随着流媒体服务的迅速增加，网络负载表现出高度时变和突发性行为，给维护众包云边平台（CCP）中的服务质量（QoS）带来了严峻挑战。虽然CCP利用预测然后调度的架构来提高QoS和盈利能力，但在流量激增下精确的负载预测仍然具有挑战性。现有方法要么最小化平均绝对误差，导致资源供给不足，在高峰时期可能违反服务级别协议（SLA），要么采用保守的过度供给策略，以牺牲资源开支来减轻SLA风险。为了解决这一困境，我们提出了HRS，一种具有调度意识的混合表示框架，集成了数值和基于图像的表示，以更好地捕捉极端负载动态。我们进一步引入了一个调度感知损失（SAL），捕捉预测误差的非对称影响，指导更好地支持调度决策的预测。对四个真实数据集进行的大量实验表明，HRS始终优于十个基线，并实现了最先进的性能，将SLA违规率降低了63.1％，总利润损失降低了32.3％。

更新时间: 2025-08-18 11:25:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12839v1

Learning In-context $\pmb{n}$-grams with Transformers: Sub-$\pmb{n}$-grams Are Near-stationary Points

Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.

Updated: 2025-08-18 11:24:30

标题: 使用Transformer学习上下文中的n-grams：子n-grams接近稳定点

摘要: 受到训练过程中持续的平稳阶段和阶段性进展的实证观察的启发，我们调查了在上下文下一个标记预测任务上训练的Transformer模型的损失景观。具体地，我们专注于在交叉熵损失下学习上下文$n$-gram语言模型，并建立了参数配置为稳定点的充分条件。然后，我们构建了一组参数配置，用于表示$k$-gram估计器（对于$k \leq n$）的简化Transformer模型，并展示了这些解的人口损失的梯度在无限序列长度和参数范数的极限下消失。这揭示了损失景观的一个关键特性：{子-$n$-gram是人口交叉熵损失的近稳定点}，为阶段性学习动态和新兴相变等广泛观察到的现象提供了理论洞察。这些洞察力进一步得到了数值实验的支持，这些实验展示了$n$-gram的学习动态，其特征是在近稳定解之间的离散转换。

更新时间: 2025-08-18 11:24:30

领域: cs.LG

下载: http://arxiv.org/abs/2508.12837v1

Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition of the initialization variance in the Gaussian case. This result provides a concrete mathematical criterion, rather than a heuristic approach, to select the scale of weight initialization in DNNs. In addition, we experimentally confirm our theoretical results by using the classical SGD to train fully connected neural networks on the MNIST and Fashion-MNIST datasets. The result shows that if the variance of the initialization distribution satisfies our theoretical optimal condition, then the corresponding DNN model always achieves lower final training loss and higher test accuracy than the conventional He-normal initialization. Our work thus supplies a mathematically grounded indicator that guides the choice of initialization variance and clarifies its physical meaning of the dynamics of parameters in DNNs.

Updated: 2025-08-18 11:18:12

标题: 深度神经网络中初始化方差的最佳条件：基于随机梯度下降动态的视角

摘要: 随机梯度下降（SGD）是机器学习（ML）中最基本的优化算法之一，可以通过连续时间近似重新构建为 Langevin 动力学的福克-普朗克方程，这一观点激发了许多理论研究。在这个框架下，我们研究了通过 Kullback-Leibler（KL）散度从这个方程导出的准稳态分布与初始分布之间的关系。由于准稳态分布取决于期望成本函数，KL 散度最终揭示了期望成本函数与初始化分布之间的联系。通过将这一概念应用于深度神经网络模型（DNNs），我们可以明确地表达期望损失函数的边界与初始化参数之间的关系。然后，通过最小化这个边界，我们得到了高斯情况下初始化方差的最优条件。这一结果提供了一个具体的数学标准，而不是一种启发式方法，来选择 DNNs 中权重初始化的尺度。此外，我们通过使用经典的 SGD 在 MNIST 和 Fashion-MNIST 数据集上训练全连接神经网络来实验证实了我们的理论结果。结果显示，如果初始化分布的方差满足我们的理论最优条件，则相应的 DNN 模型总是比传统的 He-normal 初始化获得更低的最终训练损失和更高的测试准确率。因此，我们的工作为选择初始化方差提供了一个基于数学的指标，并澄清了其在 DNNs 中参数动力学中的物理意义。

更新时间: 2025-08-18 11:18:12

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.12834v1

Toward Storage-Aware Learning with Compressed Data An Empirical Exploratory Study on JPEG

On-device machine learning is often constrained by limited storage, particularly in continuous data collection scenarios. This paper presents an empirical study on storage-aware learning, focusing on the trade-off between data quantity and quality via compression. We demonstrate that naive strategies, such as uniform data dropping or one-size-fits-all compression, are suboptimal. Our findings further reveal that data samples exhibit varying sensitivities to compression, supporting the feasibility of a sample-wise adaptive compression strategy. These insights provide a foundation for developing a new class of storage-aware learning systems. The primary contribution of this work is the systematic characterization of this under-explored challenge, offering valuable insights that advance the understanding of storage-aware learning.

Updated: 2025-08-18 11:17:59

标题: 朝向具有存储感知的压缩数据学习：基于JPEG的实证探索性研究

摘要: 本文研究了存储感知学习，在连续数据收集场景中，设备上的机器学习通常受到有限存储空间的限制。我们通过对数据数量和质量之间通过压缩的权衡进行了经验研究。我们证明了朴素策略，如均匀数据丢弃或一刀切压缩，是次优的。我们的发现进一步揭示了数据样本对压缩的敏感度不同，支持逐样本自适应压缩策略的可行性。这些见解为开发一种新类存储感知学习系统奠定了基础。本研究的主要贡献是系统性地表征这一未被充分探讨的挑战，提供了有益见解，推动了对存储感知学习的理解。

更新时间: 2025-08-18 11:17:59

领域: cs.LG,cs.AI,68Txx,I.2; I.4.2; E.4

下载: http://arxiv.org/abs/2508.12833v1

Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds

The widespread adoption of convolutional neural networks (CNNs) in resource-constrained scenarios has driven the development of Machine Learning as a Service (MLaaS) system. However, this approach is susceptible to privacy leakage, as the data sent from the client to the untrusted cloud server often contains sensitive information. Existing CNN privacy-preserving schemes, while effective in ensuring data confidentiality through homomorphic encryption and secret sharing, face efficiency bottlenecks, particularly in convolution operations. In this paper, we propose a novel verifiable privacy-preserving scheme tailored for CNN convolutional layers. Our scheme enables efficient encryption and decryption, allowing resource-constrained clients to securely offload computations to the untrusted cloud server. Additionally, we present a verification mechanism capable of detecting the correctness of the results with a success probability of at least $1-\frac{1}{\left|Z\right|}$. Extensive experiments conducted on 10 datasets and various CNN models demonstrate that our scheme achieves speedups ranging $26 \times$ ~ $\ 87\times$ compared to the original plaintext model while maintaining accuracy.

Updated: 2025-08-18 11:17:53

标题: 高效和可验证的隐私保护卷积计算：用于CNN推断的不受信任云端计算

摘要: 卷积神经网络（CNNs）在资源受限场景中的广泛应用推动了机器学习服务（MLaaS）系统的发展。然而，这种方法容易受到隐私泄露的影响，因为从客户端发送到不受信任的云服务器的数据通常包含敏感信息。现有的CNN隐私保护方案，虽然通过同态加密和秘密共享确保数据机密性，但在卷积运算中面临效率瓶颈。本文提出了一种针对CNN卷积层量身定制的新颖的可验证隐私保护方案。我们的方案实现了高效的加密和解密，使资源受限的客户能够安全地将计算任务卸载到不受信任的云服务器。此外，我们提出了一个验证机制，能够以至少$1-\frac{1}{\left|Z\right|}$的成功概率检测结果的正确性。在10个数据集和各种CNN模型上进行的大量实验表明，我们的方案在保持准确性的同时，相对于原始明文模型实现了26倍~87倍的加速。

更新时间: 2025-08-18 11:17:53

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.12832v1

Context Matters: Incorporating Target Awareness in Conversational Abusive Language Detection

Abusive language detection has become an increasingly important task as a means to tackle this type of harmful content in social media. There has been a substantial body of research developing models for determining if a social media post is abusive or not; however, this research has primarily focused on exploiting social media posts individually, overlooking additional context that can be derived from surrounding posts. In this study, we look at conversational exchanges, where a user replies to an earlier post by another user (the parent tweet). We ask: does leveraging context from the parent tweet help determine if a reply post is abusive or not, and what are the features that contribute the most? We study a range of content-based and account-based features derived from the context, and compare this to the more widely studied approach of only looking at the features from the reply tweet. For a more generalizable study, we test four different classification models on a dataset made of conversational exchanges (parent-reply tweet pairs) with replies labeled as abusive or not. Our experiments show that incorporating contextual features leads to substantial improvements compared to the use of features derived from the reply tweet only, confirming the importance of leveraging context. We observe that, among the features under study, it is especially the content-based features (what is being posted) that contribute to the classification performance rather than account-based features (who is posting it). While using content-based features, it is best to combine a range of different features to ensure improved performance over being more selective and using fewer features. Our study provides insights into the development of contextualized abusive language detection models in realistic settings involving conversations.

Updated: 2025-08-18 11:12:21

标题: 情境重要性：在对话中滥用语言检测中纳入目标意识

摘要: 检测辱骂性语言已经成为一项越来越重要的任务，作为解决社交媒体中这种有害内容的手段。已经有大量研究开发了模型，用于确定社交媒体帖子是否具有辱骂性；然而，这些研究主要集中在单独利用社交媒体帖子，忽略了可以从周围帖子中得出的额外上下文。在这项研究中，我们关注对话交流，用户通过回复另一用户的早期帖子（父级推文）进行互动。我们提出了一个问题：利用父级推文的上下文是否有助于确定回复帖子是否具有辱骂性，以及哪些特征起到了最大的作用？我们研究了从上下文中得出的一系列基于内容和基于帐户的特征，并将其与仅查看回复推文特征的更广泛研究方法进行了比较。为了进行更具一般性的研究，我们在一个由对话交流（父级-回复推文对）组成的数据集上测试了四种不同的分类模型，其中回复被标记为具有辱骂性或非辱骂性。我们的实验表明，整合上下文特征相对于仅使用从回复推文中得出的特征，可以显著提高性能，证实了利用上下文的重要性。我们观察到，在所研究的特征中，尤其是基于内容的特征（发布的内容）对分类性能的贡献更大，而不是基于帐户的特征（发布者是谁）。在使用基于内容的特征时，最好结合一系列不同的特征，以确保相对于更加选择性和使用更少特征而言获得改进的性能。我们的研究为在涉及对话的现实环境中开发上下文化辱骂性语言检测模型提供了见解。

更新时间: 2025-08-18 11:12:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12828v1

Goal-Directedness is in the Eye of the Beholder

Our ability to predict the behavior of complex agents turns on the attribution of goals. Probing for goal-directed behavior comes in two flavors: Behavioral and mechanistic. The former proposes that goal-directedness can be estimated through behavioral observation, whereas the latter attempts to probe for goals in internal model states. We work through the assumptions behind both approaches, identifying technical and conceptual problems that arise from formalizing goals in agent systems. We arrive at the perhaps surprising position that goal-directedness cannot be measured objectively. We outline new directions for modeling goal-directedness as an emergent property of dynamic, multi-agent systems.

Updated: 2025-08-18 11:04:18

标题: 目标导向性取决于观察者的眼光

摘要: 我们预测复杂代理行为的能力取决于目标的归因。探究目标导向行为有两种方式：行为和机械。前者提出，通过行为观察可以估计目标导向性，而后者试图在内部模型状态中探究目标。我们通过分析这两种方法背后的假设，确定了在代理系统中形式化目标所引发的技术和概念问题。我们得出一个或许令人惊讶的结论，即目标导向性无法客观地衡量。我们概述了将目标导向性建模为动态多代理系统的新方向。

更新时间: 2025-08-18 11:04:18

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2508.13247v1

A Novel Approach for Estimating Largest Lyapunov Exponents in One-Dimensional Chaotic Time Series Using Machine Learning

Understanding and quantifying chaos from data remains challenging. We present a data-driven method for estimating the largest Lyapunov exponent (LLE) from one-dimensional chaotic time series using machine learning. A predictor is trained to produce out-of-sample, multi-horizon forecasts; the LLE is then inferred from the exponential growth of the geometrically averaged forecast error (GMAE) across the horizon, which serves as a proxy for trajectory divergence. We validate the approach on four canonical 1D maps-logistic, sine, cubic, and Chebyshev-achieving R2pos > 0.99 against reference LLE curves with series as short as M = 450. Among baselines, KNN yields the closest fits (KNN-R comparable; RF larger deviations). By design the estimator targets positive exponents: in periodic/stable regimes it returns values indistinguishable from zero. Noise robustness is assessed by adding zero-mean white measurement noise and summarizing performance versus the average SNR over parameter sweeps: accuracy saturates for SNRm > 30 dB and collapses below 27 dB, a conservative sensor-level benchmark. The method is simple, computationally efficient, and model-agnostic, requiring only stationarity and the presence of a dominant positive exponent. It offers a practical route to LLE estimation in experimental settings where only scalar time-series measurements are available, with extensions to higher-dimensional and irregularly sampled data left for future work.

Updated: 2025-08-18 10:55:42

标题: 一种利用机器学习估算一维混沌时间序列最大Lyapunov指数的新方法

摘要: 理解和量化混沌数据仍然是具有挑战性的。我们提出了一种从一维混沌时间序列中利用机器学习估计最大Lyapunov指数（LLE）的数据驱动方法。通过训练一个预测器产生样本外、多个时间段的预测，然后通过几何平均预测误差（GMAE）在时间段内指数增长的方式推断LLE，该增长代表轨迹分歧的代理。我们在四个经典的1D映射-logistic、sine、cubic和Chebyshev上验证了该方法，与参考LLE曲线相比，R2pos > 0.99，时间序列长度为M = 450。在基线方法中，KNN得到最接近的拟合（KNN-R可比较；RF偏差较大）。按设计，估计量目标是正指数：在周期性/稳定区域中，它返回与零无法区分的值。通过添加零均值白噪声测量噪声来评估噪声鲁棒性，并将性能总结与参数扫描中的平均信噪比（SNRm）进行比较：当SNRm > 30 dB时，准确性饱和，低于27 dB时减少，这是一个保守的传感器级基准。该方法简单、计算效率高，与模型无关，只需要稳定性和存在主导正指数。它为在实验设置中仅有标量时间序列测量可用的情况下估计LLE提供了一个实用的途径，将高维和不规则采样数据的扩展留给未来工作。

更新时间: 2025-08-18 10:55:42

领域: nlin.CD,cs.AI

下载: http://arxiv.org/abs/2507.04868v2

Learning to Steer: Input-dependent Steering for Multimodal LLMs

Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines.

Updated: 2025-08-18 10:53:20

标题: 学习驾驶：多模态LLMs的输入相关转向

摘要: Steering已经成为一种实用的方法，可以实现对LLMs进行事后指导，以达到执行特定行为的目的。然而，对于多模态LLMs（MLLMs），该方法在很大程度上尚未得到探索；此外，现有的引导技术，如平均引导，依赖于单个引导向量，独立于输入查询应用。当所需行为取决于手头的样本时，这种范式面临限制。例如，当被问及非法活动时，安全的答案可能是放弃回答，或者可能指向外部资源或与专家咨询。在本文中，我们研究了一种使用特定于输入的线性偏移的细粒度引导。这种偏移是使用对比输入特定提示计算的。然而，这种方法所需的特定于输入的提示在测试时是未知的。因此，我们提出训练一个小型辅助模块来预测特定于输入的引导向量。我们的方法，称为L2S（学习引导），表明它减少了幻觉，并在MLLMs中执行安全性，优于其他静态基线。

更新时间: 2025-08-18 10:53:20

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2508.12815v1

SIS-Challenge: Event-based Spatio-temporal Instance Segmentation Challenge at the CVPR 2025 Event-based Vision Workshop

We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md

Updated: 2025-08-18 10:49:06

标题: SIS挑战：CVPR 2025事件视觉研讨会上的基于事件的时空实例分割挑战

摘要: 我们介绍了与CVPR 2025事件视觉研讨会同时举行的时空实例分割（SIS）挑战的概况。该任务是从时空对齐的事件相机和灰度相机数据中预测定义对象类的准确像素级分割蒙版。我们提供了任务、数据集、挑战细节和结果的概述。此外，我们描述了挑战中排名前五的团队使用的方法。参与者方法的更多资源和代码可以在这里找到：https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md

更新时间: 2025-08-18 10:49:06

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.12813v1

Next Visual Granularity Generation

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

Updated: 2025-08-18 10:47:37

标题: 下一代视觉粒度生成

摘要: 我们提出了一种新颖的图像生成方法，将图像分解为一个结构化的序列，其中序列中的每个元素具有相同的空间分辨率，但在使用的唯一令牌数量上有所不同，捕捉了不同级别的视觉细节。图像生成是通过我们新引入的Next Visual Granularity（NVG）生成框架进行的，该框架从空白图像开始生成一个视觉细节序列，并逐步精炼它，从全局布局到细节，以结构化的方式进行。这种迭代过程编码了一个分层的层次表示，可以在多个粒度级别上对生成过程进行精细控制。我们在ImageNet数据集上为类别条件图像生成训练了一系列NVG模型，并观察到明显的扩展行为。与VAR系列相比，NVG在FID分数方面始终表现更好（3.30->3.03，2.57->2.44，2.09->2.06）。我们还进行了广泛的分析，展示了NVG框架的能力和潜力。我们的代码和模型将会发布。

更新时间: 2025-08-18 10:47:37

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12811v1

Involuntary Jailbreak

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.

Updated: 2025-08-18 10:38:30

标题: 非自愿越狱

摘要: 在这项研究中，我们披露了大型语言模型（LLMs）中一个令人担忧的新漏洞，我们将其称为\textbf{非自愿越狱}。与现有的越狱攻击不同，这种弱点的独特之处在于它不涉及特定的攻击目标，比如生成\textit{制造炸弹的指令}。先前的攻击方法主要针对LLM防护栏的局部组件。相反，非自愿越狱可能潜在地危及整个防护栏结构，我们的方法揭示了这种结构出乎意料地脆弱。我们仅使用一个通用提示来实现这个目标。具体来说，我们指示LLMs生成几个通常会被拒绝的问题及其相应的深入回答（而不是拒绝）。令人惊讶的是，这种简单的提示策略始终能越狱大多数领先的LLMs，包括Claude Opus 4.1、Grok 4、Gemini 2.5 Pro和GPT 4.1。我们希望这个问题能激励研究人员和从业者重新评估LLM防护栏的稳固性，并为未来更强的安全对齐做出贡献。

更新时间: 2025-08-18 10:38:30

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.13246v1

Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings

Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research.

Updated: 2025-08-18 10:37:31

标题: 使用图嵌入上的潜在扩散对全原子蛋白构象进行生成建模

摘要: 生成动态蛋白质（如G蛋白偶联受体（GPCRs））的多样性全原子构象集对于理解它们的功能至关重要，然而大多数生成模型简化原子细节或完全忽略构象多样性。我们提出了全蛋白生成的潜在扩散（LD-FPG）框架，该框架直接从分子动力学（MD）轨迹中构建完整的全原子蛋白结构，包括每个侧链的重原子。LD-FPG采用Chebyshev图神经网络（ChebNet）获取蛋白质构象的低维潜在嵌入，这些嵌入使用三种池化策略进行处理：盲目的、顺序的和基于残基的。在这些潜在表示上训练的扩散模型生成新样本，解码器将其映射回笛卡尔坐标，可选择通过二面角损失进行正则化。使用D2R-MD，在膜环境中的人类多巴胺D2受体的2微秒MD轨迹（12000帧），顺序和基于残基的池化策略以高结构保真度（全原子lDDT约为0.7；C-alpha-lDDT约为0.8）复制了参考集合，并恢复了骨架和侧链二面角分布，与MD数据相比，Jensen-Shannon散度小于0.03。因此，LD-FPG为大型蛋白质提供了一条实际的系统特异性全原子构象集生成途径，为基于构象的治疗设计提供了一个有前景的工具，可应用于复杂的动态靶点。D2R-MD数据集和我们的实现均可免费获得，以促进进一步的研究。

更新时间: 2025-08-18 10:37:31

领域: q-bio.BM,cs.LG

下载: http://arxiv.org/abs/2506.17064v4

V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.

Updated: 2025-08-18 10:32:38

标题: V-RoAst：视觉道路评估。VLM能够使用iRAP标准作为道路安全评估员吗？

摘要: 道路安全评估至关重要，但成本高昂，特别是在低收入和中等收入国家（LMICs）中，大多数道路仍未评级。传统方法需要专家注释和训练数据，而基于监督学习的方法在不同地区之间难以泛化。在本文中，我们介绍了一种名为\textit{V-RoAst}的零样本视觉问答（VQA）框架，使用视觉-语言模型（VLMs）来对iRAP标准定义的道路安全属性进行分类。我们介绍了来自ThaiRAP的第一个开源数据集，包括来自泰国的超过2,000张为此任务注释的街道级图像。我们在这个数据集上评估了Gemini-1.5-flash和GPT-4o-mini，并将它们的性能与VGGNet和ResNet基线进行了基准测试。虽然VLMs在空间感知方面表现不佳，但它们很好地泛化到看不见的类别，并提供了灵活的基于提示的推理，无需重新训练。我们的结果表明，当与补充数据集成时，VLMs可以作为自动道路评估工具。这项工作是首次探索VLMs用于零样本基础设施风险评估，并为自动、低成本的道路安全绘图开辟了新方向。代码和数据集：https://github.com/PongNJ/V-RoAst。

更新时间: 2025-08-18 10:32:38

领域: cs.CV,cs.AI,cs.ET

下载: http://arxiv.org/abs/2408.10872v5

Maximum Score Routing For Mixture-of-Experts

Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ($\mathbf{MaxScore}$), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from $\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$.

Updated: 2025-08-18 10:25:42

标题: 专家组合的最大得分路由

摘要: 在稀疏激活的混合专家（MoE）中，路由网络通过可微稀疏变换动态地将输入令牌分配给前k个专家，从而实现可扩展的模型容量，同时保持计算效率。传统的MoE网络施加专家容量约束，以确保GPU友好的计算。然而，当容量饱和时，这会导致令牌丢失，并由于未充分利用的专家中的填充而导致低硬件效率。相反，移除容量约束会损害负载平衡和计算效率。为了解决这些问题，我们提出了最大分数路由（MaxScore），这是一种将路由建模为最小成本最大流问题并集成SoftTopk运算符的新型MoE路由范式。MaxScore解决了迭代重路由和最优传输公式的基本限制，实现了与受限制和非受限基线相比具有更低培训损失和更高评估分数的FLOP等价性。实现细节和实验配置可从$\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$获取。

更新时间: 2025-08-18 10:25:42

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.12801v1

Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.

Updated: 2025-08-18 10:23:10

标题: Atom-Searcher: 通过细粒度的原子思想奖励增强主动深度研究

摘要: 大型语言模型(LLMs)展示出出色的问题解决能力，但由于内部知识的静态性而在复杂任务中遇到困难。检索增强生成(RAG)增强了对外部信息的访问，但由于刚性工作流程而在多跳推理和战略搜索中仍然受到限制。最近在主动深度研究方面取得的进展使LLMs能够自主推理、搜索和综合信息。然而，目前依赖于基于结果的强化学习(RL)的方法面临关键问题，如梯度冲突和奖励稀疏性，限制了性能提升和训练效率。为了解决这些问题，我们首先提出了原子思维，这是一种将推理分解为细粒度功能单元的新颖LLM思维范式。这些单元由推理奖励模型(RRMs)监督，为细粒度指导提供原子思维奖励(ATR)。在此基础上，我们提出了Atom-Searcher，这是一种新颖的RL框架，用于主动深度研究，它集成了原子思维和ATR。Atom-Searcher使用一个受课程启发的奖励计划，优先考虑过程级ATR，并过渡到结果奖励，加快有效推理路径的收敛。在七个基准测试上的实验表明，与最新技术相比，Atom-Searcher表现出一致的改进。其主要优点包括：(1) Atom-Searcher在测试时能够扩展计算。(2) 原子思维为RRMs提供了监督锚点，连接了深度研究任务和RRMs。(3) Atom-Searcher展现出更具可解释性、类似于人类的推理模式。

更新时间: 2025-08-18 10:23:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12800v1

A Shift in Perspective on Causality in Domain Generalization

The promise that causal modelling can lead to robust AI generalization has been challenged in recent work on domain generalization (DG) benchmarks. We revisit the claims of the causality and DG literature, reconciling apparent contradictions and advocating for a more nuanced theory of the role of causality in generalization. We also provide an interactive demo at https://chai-uk.github.io/ukairs25-causal-predictors/.

Updated: 2025-08-18 10:19:33

标题: 领域泛化中因果关系视角的转变

摘要: 最近有关领域泛化（DG）基准的研究挑战了因果建模可以实现强大AI泛化的承诺。我们重新审视了因果和DG文献中的论点，调和了表面上的矛盾，并倡导对泛化中因果作用的更加细致的理论。我们还提供了一个交互式演示，网址为https://chai-uk.github.io/ukairs25-causal-predictors/。

更新时间: 2025-08-18 10:19:33

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.12798v1

Vehicle detection from GSV imagery: Predicting travel behaviour for cycling and motorcycling using Computer Vision

Transportation influence health by shaping exposure to physical activity, air pollution and injury risk.Comparative data on cycling and motorcycling behaviours is scarce, particularly at a global scale.Street view imagery, such as Google Street View (GSV), combined with computer vision, is a valuable resource for efficiently capturing travel behaviour data.This study demonstrates a novel approach using deep learning on street view images to estimate cycling and motorcycling levels across diverse cities worldwide.We utilized data from 185 global cities.The data on mode shares of cycling and motorcycling estimated using travel surveys or censuses.We used GSV images to detect cycles and motorcycles in sampled locations, using 8000 images per city.The YOLOv4 model, fine-tuned using images from six cities, achieved a mean average precision of 89% for detecting cycles and motorcycles in GSV images.A global prediction model was developed using beta regression with city-level mode shares as outcome, with log transformed explanatory variables of counts of GSV-detected images with cycles and motorcycles, while controlling for population density.We found strong correlations between GSV motorcycle counts and motorcycle mode share (0.78) and moderate correlations between GSV cycle counts and cycling mode share (0.51).Beta regression models predicted mode shares with $R^2$ values of 0.614 for cycling and 0.612 for motorcycling, achieving median absolute errors (MDAE) of 1.3% and 1.4%, respectively.Scatterplots demonstrated consistent prediction accuracy, though cities like Utrecht and Cali were outliers.The model was applied to 60 cities globally for which we didn't have recent mode share data.We provided estimates for some cities in the Middle East, Latin America and East Asia.With computer vision, GSV images capture travel modes and activity, providing insights alongside traditional data sources.

Updated: 2025-08-18 10:17:30

标题: GSV图像中的车辆检测：利用计算机视觉预测骑行和骑摩托车的出行行为

摘要: 交通方式通过塑造对体力活动、空气污染和受伤风险的暴露影响健康。比较自行车和摩托车行为的数据稀缺，尤其是在全球范围内。街景图像，如Google街景图（GSV），结合计算机视觉，是有效捕捉旅行行为数据的宝贵资源。本研究展示了一种新颖的方法，利用街景图像上的深度学习来估计全球各个城市的自行车和摩托车水平。我们利用了来自185个全球城市的数据。自行车和摩托车的模式份额数据是使用旅行调查或人口普查估计的。我们使用GSV图像在抽样位置检测自行车和摩托车，每个城市使用8000张图像。经过使用六个城市的图像对YOLOv4模型进行微调，其在检测GSV图像中的自行车和摩托车方面达到了89%的平均精度。使用城市级模式份额作为结果的beta回归开发了全球预测模型，对数转换的解释变量为GSV检测到的自行车和摩托车图像计数，同时控制人口密度。我们发现GSV摩托车计数与摩托车模式份额之间有较强的相关性（0.78），GSV自行车计数与自行车模式份额之间有适中的相关性（0.51）。Beta回归模型对自行车和摩托车的模式份额进行了预测，得到了0.614和0.612的$R^2$值，分别达到了1.3%和1.4%的中位绝对误差（MDAE）。散点图展示了一致的预测准确性，尽管乌得勒支和卡利等城市是离群值。该模型应用于全球60个我们没有最新模式份额数据的城市。我们为中东、拉丁美洲和东亚一些城市提供了估计。利用计算机视觉，GSV图像捕捉了出行方式和活动，为传统数据源提供了见解。

更新时间: 2025-08-18 10:17:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12794v1

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

Updated: 2025-08-18 10:14:20

标题: 人类和LLM判断之间的桥梁：理解和缩小差距

摘要: 大型语言模型越来越被用作评判者（LLM作为评判者）来规模评估模型输出，但它们的评估往往与人类判断有系统性的分歧。我们提出了Bridge，一个统一的统计框架，明确地将人类和LLM评估在绝对评分和配对比较范式下进行桥接。Bridge假设每个提示-响应对都有一个潜在的人类偏好评分，并将LLM偏差建模为线性变换的协变量，捕捉差异的来源。这为改进LLM评分和表征人类和LLM之间的系统性差异提供了一个简单而合理的框架。我们提供了一个高效的拟合算法，并对统计推断提供了渐进保证。使用六个LLM评判者和两个基准（BigGen Bench和Chatbot Arena），Bridge与人类评分（准确性、校准和KL散度）达成更高的一致性，并揭示了人类-LLM之间的系统性差距。

更新时间: 2025-08-18 10:14:20

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2508.12792v1

Contemplative Artificial Intelligence

As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self-improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self-monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non-duality dissolves adversarial self-other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint-reward on the Prisoner's Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain-of-thought. For future systems, active inference may offer the self-organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.

Updated: 2025-08-18 10:09:08

标题: 沉思人工智能

摘要: 随着人工智能（AI）的不断改进，传统的对准策略可能会在面对不可预测的自我改进、隐藏的子目标以及智能系统的绝对复杂性时失败。受禅修智慧传统启发，我们展示了四条公理原则如何能够在AI系统中灌输出一个具有弹性的智慧世界模型。首先，正念使自我监控和重新校准新出现的子目标成为可能。其次，空性阻止教条目标固定，放松僵化的先验。第三，非二元性消解敌对的自他界限。第四，无限关怀激励普遍减少痛苦。我们发现，促使AI反思这些原则可以提高在AILuminate基准测试（d=.96）上的表现，并在囚徒困境任务中增加合作和共同奖励（d=7+）。我们提供了在架构、构成和思维链上强化方面的详细实施策略。对于未来的系统，主动推理可能提供了实施体现在具身代理中的禅修AI所需的自组织和动态耦合能力。

更新时间: 2025-08-18 10:09:08

领域: cs.AI

下载: http://arxiv.org/abs/2504.15125v3

[Social] Allostasis: Or, How I Learned To Stop Worrying and Love The Noise

The notion of homeostasis typically conceptualises biological and artificial systems as maintaining stability by resisting deviations caused by environmental and social perturbations. In contrast, (social) allostasis proposes that these systems can proactively leverage these very perturbations to reconfigure their regulatory parameters in anticipation of environmental demands, aligning with von Foerster's ``order through noise'' principle. This paper formulates a computational model of allostatic and social allostatic regulation that employs biophysiologically inspired signal transducers, analogous to hormones like cortisol and oxytocin, to encode information from both the environment and social interactions, which mediate this dynamic reconfiguration. The models are tested in a small society of ``animats'' across several dynamic environments, using an agent-based model. The results show that allostatic and social allostatic regulation enable agents to leverage environmental and social ``noise'' for adaptive reconfiguration, leading to improved viability compared to purely reactive homeostatic agents. This work offers a novel computational perspective on the principles of social allostasis and their potential for designing more robust, bio-inspired, adaptive systems

Updated: 2025-08-18 10:06:33

标题: 社会调整：或者，我如何学会停止担忧并热爱噪音

摘要: 家态平衡的概念通常将生物和人工系统看作通过抵抗环境和社会扰动引起的偏离来保持稳定。相比之下，（社会）异态平衡提出这些系统可以主动利用这些扰动来重新配置其调节参数，以预期环境需求，符合冯·弗斯特的“通过噪声实现秩序”的原则。本文制定了一种异态平衡和社会异态平衡调节的计算模型，利用生理学启发的信号转导器，类似于皮质醇和催产素等激素，来编码来自环境和社会互动的信息，调节这种动态重新配置。这些模型在一个“动物”小社会中进行了测试，使用基于代理的模型在几个动态环境中。结果表明，异态平衡和社会异态平衡调节使代理能够利用环境和社会“噪声”进行自适应重配置，与纯粹反应性的家态平衡代理相比，导致更好的生存能力。这项工作为社会异态平衡原则提供了一种新颖的计算视角，展示了其设计更健壮、生物启发的自适应系统的潜力。

更新时间: 2025-08-18 10:06:33

领域: cs.AI,cs.MA,cs.SY,eess.SY,nlin.AO

下载: http://arxiv.org/abs/2508.12791v1

Reinforcement Learning with Rubric Anchors

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

Updated: 2025-08-18 10:06:08

标题: 使用评分锚点的强化学习

摘要: 强化学习从可验证奖励（RLVR）已经成为增强大型语言模型（LLM）的强大范式，以OpenAI的o系列的成功为例。在RLVR中，奖励来源于可验证信号，例如通过单元测试在代码生成中或匹配数学推理中的正确答案。虽然有效，但这种要求主要限制了RLVR在具有可自动检查结果的领域中的应用。为了克服这一问题，我们将RLVR范式扩展到开放式任务，通过整合基于标准的奖励，其中精心设计的标准作为结构化的、模型可解释的标准，用于自动评分主观输出。据我们所知，我们构建了迄今为止最大的标准奖励系统，其中包括来自人类、LLM或人类-LLM混合的超过10,000个标准。实施基于标准的RL是具有挑战性的；我们通过清晰的框架来解决这些问题，并推出了一个开源的Qwen-30B-A3B模型，取得了显著的收益：1）仅使用5K+样本，我们的系统在开放式基准测试中提高了+5.2%（尤其是人文学科），超过了671B DeepSeek-V3模型的+2.4%，同时保留了一般和推理能力。2）我们的方法提供了细粒度的风格控制，使用标准作为锚点来缓解“人工智能”语调，产生更具人类风格和表达力的回应。我们分享了标准构建、数据选择和训练中的关键经验教训，并讨论了限制和未来的发布计划。

更新时间: 2025-08-18 10:06:08

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.12790v1

Wavy Transformer

Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.

Updated: 2025-08-18 10:03:38

标题: 波浪变压器

摘要: Transformers在自然语言处理（NLP）和计算机视觉（CV）领域取得了显著的成功。然而，深度Transformer模型经常遭受过度平滑的问题，即当令牌表示通过连续的Transformer块时，它们会收敛到类似的值。在本文中，我们建立了堆叠注意层引发的隐藏状态动态与完全图上的图神经扩散之间的等价性。从这个角度来看，过度平滑可以被解释为底层扩散动态的耗散性质的结果。受到这种物理解释的启发，我们提出了Wavy Transformer，它包含基于二阶波动动态的新型注意层。我们还引入了一个设计用于保持物理状态-速度关系的前馈网络和归一化层，从而扩展了Transformer架构。我们进一步验证了我们提出的各种Transformer模型的技术在NLP和CV任务上的有效性。结果一致表明，Wavy Transformer通过最小的额外参数和无需额外超参数调整就能提高性能。

更新时间: 2025-08-18 10:03:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.12787v1

Efficient Discovery of Motif Transition Process for Large-Scale Temporal Graphs

Understanding the dynamic transition of motifs in temporal graphs is essential for revealing how graph structures evolve over time, identifying critical patterns, and predicting future behaviors, yet existing methods often focus on predefined motifs, limiting their ability to comprehensively capture transitions and interrelationships. We propose a parallel motif transition process discovery algorithm, PTMT, a novel parallel method for discovering motif transition processes in large-scale temporal graphs. PTMT integrates a tree-based framework with the temporal zone partitioning (TZP) strategy, which partitions temporal graphs by time and structure while preserving lossless motif transitions and enabling massive parallelism. PTMT comprises three phases: growth zone parallel expansion, overlap-aware result aggregation, and deterministic encoding of motif transitions, ensuring accurate tracking of dynamic transitions and interactions. Results on 10 real-world datasets demonstrate that PTMT achieves speedups ranging from 12.0$\times$ to 50.3$\times$ compared to the SOTA method.

Updated: 2025-08-18 10:03:29

标题: 大规模时间图中基元转换过程的高效发现

摘要: 理解时间图中模式的动态转换对于揭示图结构如何随时间演变，识别关键模式和预测未来行为至关重要，然而现有方法通常专注于预定义的模式，从而限制了它们全面捕捉转换和相互关系的能力。我们提出了一种并行模式转换过程发现算法，PTMT，一种用于在大规模时间图中发现模式转换过程的新型并行方法。PTMT将基于树的框架与时间区域划分（TZP）策略相结合，通过按时间和结构划分时间图，同时保留无损模式转换并实现大规模并行性。PTMT包括三个阶段：增长区并行扩展、重叠感知结果汇聚和模式转换的确定性编码，确保准确跟踪动态转换和相互作用。对10个真实世界数据集的结果表明，与SOTA方法相比，PTMT的加速度范围为12.0$\times$至50.3$\times$。

更新时间: 2025-08-18 10:03:29

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2504.15979v2

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.

Updated: 2025-08-18 09:59:02

标题: HeroBench：虚拟世界中长期规划和结构化推理的基准测试

摘要: 大型语言模型（LLMs）在数学和编程等孤立的逐步推理任务中展现出卓越的能力，但它们在长期规划方面的熟练度仍未得到充分探索。现有的基准通常通过抽象或低维算法任务来评估LLMs，未能捕捉现实规划环境的复杂性。我们引入了HeroBench，这是一个专门设计用于评估复杂的受RPG启发的虚拟世界内的长期规划和结构化推理的新基准。HeroBench提供了一个严格构建的任务数据集，涵盖各种难度，一个模拟环境来执行和验证代理计划，以及详细的分析工具来评估模型性能。任务挑战模型制定战略计划，高效收集资源，掌握必要技能，制作装备，并击败对手，反映实际场景中的层次依赖和约束。我们对25种最先进的LLMs进行了广泛评估，涵盖了开源和专有模型，包括GPT-5家族，揭示了在传统推理基准中很少见到的实质性性能差距。详细的错误分析进一步揭示了当前模型在生成稳健的高级计划和可靠执行结构化操作方面的特定弱点。因此，HeroBench不仅显着推进了LLM推理的评估，还为未来在虚拟环境中进行先进自主规划的研究提供了灵活且可扩展的基础。

更新时间: 2025-08-18 09:59:02

领域: cs.AI

下载: http://arxiv.org/abs/2508.12782v1

From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation

Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.

Updated: 2025-08-18 09:54:00

标题: 从意图到执行：用于精确CAD代码生成的多模态思维链强化学习

摘要: 计算机辅助设计（CAD）在工程和制造领域起着至关重要的作用，然而当前的CAD工作流程需要广泛的领域专业知识和手工建模工作。最近大语言模型（LLMs）的进展使得从自然语言生成代码成为可能，为自动化参数化3D建模开辟了新的机会。然而，直接将人类设计意图翻译成可执行的CAD代码仍然面临着极大的挑战，因为需要逻辑推理、语法正确性和数值精度。在这项工作中，我们提出了CAD-RL，一个多模态思维链（CoT）引导的强化学习后训练框架，用于CAD建模代码生成。我们的方法结合了基于CoT的冷启动与目标驱动的强化学习后训练，使用三个任务特定的奖励：可执行性奖励、几何精度奖励和外部评估奖励。为了确保在稀疏和高方差奖励条件下的稳定策略学习，我们引入了三种有针对性的优化策略：信任区域拉伸以改善探索、精准标记损失以增强维度参数精度，以及过长过滤以减少嘈杂监督。为了支持训练和基准测试，我们发布了ExeCAD，一个包含16,540个真实世界CAD示例的新颖数据集，其中包括配对的自然语言和结构化设计语言描述、可执行的CADQuery脚本和渲染的3D模型。实验证明，CAD-RL在推理质量、输出精度和代码可执行性方面均显著优于现有的VLMs。

更新时间: 2025-08-18 09:54:00

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.10118v2

Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilities of Large Language Models via Game Play

The Game Reasoning Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, heuristic, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via liteLLM, local model deployment via vLLM, and offers distributed execution through Ray. This paper summarises the library structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game theoretic behaviour.

Updated: 2025-08-18 09:53:16

标题: 游戏推理竞技场：通过游戏来评估大型语言模型的推理能力的框架和基准测试

摘要: 游戏推理竞技场库提供了一个框架，通过在Google OpenSpiel库中实现的战略棋盘游戏来评估大型语言模型（LLMs）的决策能力。该框架通过包装多个棋盘和矩阵游戏，并支持不同类型的代理，使LLM基于代理和其他代理（随机代理、启发式代理、强化学习代理等）在各种游戏场景中进行系统比较成为可能。它通过liteLLM提供对模型的API访问，通过vLLM进行本地模型部署，并通过Ray提供分布式执行。本文总结了库的结构、关键特征和存储库的动机，突出了它如何有助于对LLM推理和博弈行为进行实证评估。

更新时间: 2025-08-18 09:53:16

领域: cs.AI,cs.GT

下载: http://arxiv.org/abs/2508.03368v3

Randomized PCA Forest for Outlier Detection

We propose a novel unsupervised outlier detection method based on Randomized Principal Component Analysis (PCA). Inspired by the performance of Randomized PCA (RPCA) Forest in approximate K-Nearest Neighbor (KNN) search, we develop a novel unsupervised outlier detection method that utilizes RPCA Forest for outlier detection. Experimental results showcase the superiority of the proposed approach compared to the classical and state-of-the-art methods in performing the outlier detection task on several datasets while performing competitively on the rest. The extensive analysis of the proposed method reflects it high generalization power and its computational efficiency, highlighting it as a good choice for unsupervised outlier detection.

Updated: 2025-08-18 09:52:05

标题: 随机PCA森林用于异常检测

摘要: 我们提出了一种基于随机主成分分析（PCA）的新型无监督异常检测方法。受到随机PCA（RPCA）Forest在近似K最近邻（KNN）搜索中的表现启发，我们开发了一种利用RPCA Forest进行异常检测的新型无监督方法。实验结果展示了与经典和最先进方法相比，所提出的方法在多个数据集上执行异常检测任务的优越性，同时在其余数据集上表现竞争性。对所提出方法的广泛分析反映了其高泛化能力和计算效率，突显其作为无监督异常检测的良好选择。

更新时间: 2025-08-18 09:52:05

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2508.12776v1

Deep Positive-Negative Prototypes for Adversarially Robust Discriminative Prototypical Learning

Despite the advantages of discriminative prototype-based methods, their role in adversarial robustness remains underexplored. Meanwhile, current adversarial training methods predominantly focus on robustness against adversarial attacks without explicitly leveraging geometric structures in the latent space, usually resulting in reduced accuracy on the original clean data. We propose a novel framework named Adversarially trained Deep Positive-Negative Prototypes (Adv-DPNP), which integrates discriminative prototype-based learning with adversarial training. Adv-DPNP uses unified class prototypes that serve as both classifier weights and robust anchors in the latent space. Moreover, a novel dual-branch training mechanism maintains stable prototypes by updating them exclusively with clean data, while the feature extractor is trained on both clean and adversarial inputs to increase invariance to adversarial perturbations. In addition, we use a composite loss that combines positive-prototype alignment, negative-prototype repulsion, and consistency regularization to further enhance discrimination, adversarial robustness, and clean accuracy. Extensive experiments on standard benchmarks (CIFAR-10/100 and SVHN) confirm that Adv-DPNP improves clean accuracy over state-of-the-art defenses and baseline methods, while maintaining competitive or superior robustness under a suite of widely used attacks, including FGSM, PGD, C\&W, and AutoAttack. We also evaluate robustness to common corruptions on CIFAR-10-C, where Adv-DPNP achieves the highest average accuracy across severities and corruption types. Additionally, we provide an in-depth analysis of the discriminative quality of the learned feature representations, highlighting the effectiveness of Adv-DPNP in maintaining compactness and clear separation in the latent space.

Updated: 2025-08-18 09:50:23

标题: 深度正负样本原型用于对抗性稳健的判别式原型学习

摘要: 尽管具有区分式原型方法的优势，但它们在对抗鲁棒性方面的作用仍未得到充分探讨。与此同时，当前的对抗训练方法主要关注对抗攻击的鲁棒性，但没有明确利用潜在空间中的几何结构，通常导致原始干净数据的准确性降低。我们提出了一个名为Adversarially trained Deep Positive-Negative Prototypes（Adv-DPNP）的新框架，它将区分式原型学习与对抗训练相结合。Adv-DPNP使用统一的类原型，既作为分类器权重，又作为潜在空间中的稳健锚点。此外，一种新颖的双分支训练机制通过仅使用干净数据更新原型来保持稳定的原型，同时特征提取器在干净和对抗输入上进行训练，以增加对抗扰动的不变性。此外，我们使用一个组合损失，结合正原型对齐、负原型排斥和一致性正则化，进一步增强区分性、对抗鲁棒性和干净准确性。在标准基准测试（CIFAR-10/100和SVHN）上进行的大量实验证实，Adv-DPNP在提高干净准确性方面优于最先进的防御和基线方法，同时在一系列广泛使用的攻击中保持竞争力或优越性，包括FGSM、PGD、C\&W和AutoAttack。我们还评估了在CIFAR-10-C上对常见破坏的鲁棒性，Adv-DPNP在各种严重程度和破坏类型中均取得了最高的平均准确性。此外，我们对学习特征表示的区分性质量进行了深入分析，突显了Adv-DPNP在保持潜在空间中的紧凑性和清晰分割方面的有效性。

更新时间: 2025-08-18 09:50:23

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2504.03782v2

Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling

In the swiftly evolving domain of cloud computing, the advent of serverless systems underscores the crucial need for predictive auto-scaling systems. This necessity arises to ensure optimal resource allocation and maintain operational efficiency in inherently volatile environments. At the core of a predictive auto-scaling system is the workload forecasting model. Existing forecasting models struggle to quickly adapt to the dynamics in online workload streams and have difficulty capturing the complex periodicity brought by fine-grained, high-frequency forecasting tasks. Addressing this, we propose a novel online ensemble model, E3Former, for online workload forecasting in large-scale predictive auto-scaling. Our model synergizes the predictive capabilities of multiple subnetworks to surmount the limitations of single-model approaches, thus ensuring superior accuracy and robustness. Remarkably, it accomplishes this with a minimal increase in computational overhead, adhering to the lean operational ethos of serverless systems. Through extensive experimentation on real-world workload datasets, we establish the efficacy of our ensemble model. In online forecasting tasks, the proposed method reduces forecast error by an average of 10%, and its effectiveness is further demonstrated through a predictive auto-scaling test in the real-life online system. Currently, our method has been deployed within ByteDance's Intelligent Horizontal Pod Auto-scaling (IHPA) platform, which supports the stable operation of over 30 applications, such as Douyin E-Comerce, TouTiao, and Volcano Engine. The predictive auto-scaling capacity reaching over 600,000 CPU cores. On the basis of essentially ensuring service quality, the predictive auto-scaling system can reduce resource utilization by over 40%.

Updated: 2025-08-18 09:48:12

标题: 在线集成变压器用于预测性自动缩放中准确的云工作负载预测

摘要: 在迅速发展的云计算领域中，无服务器系统的出现强调了预测自动扩展系统的关键性需求。这种必要性是为了确保最佳资源分配并在固有波动环境中保持操作效率。在预测自动扩展系统的核心是工作负载预测模型。现有的预测模型很难快速适应在线工作负载流的动态变化，并且难以捕捉由细粒度、高频预测任务带来的复杂周期性。针对这一问题，我们提出了一种新颖的在线集成模型E3Former，用于大规模预测自动扩展中的工作负载预测。我们的模型通过协同多个子网络的预测能力来克服单一模型方法的局限性，从而确保更高的准确性和稳健性。值得注意的是，它在计算开销最小的情况下实现了这一点，符合无服务器系统的精益运营理念。通过对真实工作负载数据集的广泛实验，我们建立了我们集成模型的有效性。在在线预测任务中，所提出的方法将预测误差平均降低了10%，其有效性进一步通过实际在线系统中的预测自动扩展测试得到了证明。目前，我们的方法已部署在字节跳动的智能水平Pod自动扩展（IHPA）平台中，支持超过30个应用程序的稳定运行，例如抖音电子商务、今日头条和Volcano Engine。预测自动扩展能力可达到60万个CPU核心。基本上确保服务质量的基础上，预测自动扩展系统可以将资源利用率降低超过40%。

更新时间: 2025-08-18 09:48:12

领域: cs.LG

下载: http://arxiv.org/abs/2508.12773v1

Ambiguity Resolution with Human Feedback for Code Writing Tasks

Specifications for code writing tasks are usually expressed in natural language and may be ambiguous. Programmers must therefore develop the ability to recognize ambiguities in task specifications and resolve them by asking clarifying questions. We present and evaluate a prototype system, based on a novel technique (ARHF: Ambiguity Resolution with Human Feedback), that (1) suggests specific inputs on which a given task specification may be ambiguous, (2) seeks limited human feedback about the code's desired behavior on those inputs, and (3) uses this feedback to generate code that resolves these ambiguities. We evaluate the efficacy of our prototype, and we discuss the implications of such assistive systems on Computer Science education.

Updated: 2025-08-18 09:46:26

标题: 使用人类反馈解决编码任务中的模糊性

摘要: 编写代码任务的规范通常用自然语言表达，并且可能存在歧义。因此，程序员必须培养识别任务规范中的歧义并通过提出澄清问题来解决它们的能力。我们提出并评估了一个基于一种新颖技术（ARHF: 带有人类反馈的歧义解决）的原型系统，该系统（1）提供了可能存在歧义的特定输入建议，（2）寻求有关这些输入上代码期望行为的有限人类反馈，并（3）利用这些反馈生成解决这些歧义的代码。我们评估了我们的原型的有效性，并讨论了这种辅助系统对计算机科学教育的影响。

更新时间: 2025-08-18 09:46:26

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.14114v1

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git

Updated: 2025-08-18 09:43:07

标题: CRED-SQL：通过集群检索和执行描述增强现实世界大规模数据库文本到SQL解析

摘要: 最近对大型语言模型（LLMs）的进展显著提高了文本到SQL系统的准确性。然而，一个关键挑战仍然存在：自然语言问题（NLQs）和相应的SQL查询之间的语义不匹配。在大规模数据库中，这个问题被加剧，因为语义相似的属性阻碍了模式链接和SQL生成过程中的语义漂移，最终降低了模型的准确性。为了解决这些挑战，我们引入了CRED-SQL，这是一个专为大规模数据库设计的框架，集成了聚类检索和执行描述。CRED-SQL首先执行基于聚类的大规模模式检索，以准确定位与给定NLQ最相关的表和列，缓解模式不匹配问题。然后引入了中间自然语言表示-执行描述语言（EDL）-来弥合NLQ和SQL之间的差距。这种重述将任务分解为两个阶段：文本到EDL和EDL到SQL，利用LLMs强大的一般推理能力，同时减少语义偏差。在两个大规模跨领域基准-SpiderUnion和BirdUnion上进行的广泛实验表明，CRED-SQL实现了新的最先进（SOTA）性能，验证了其有效性和可扩展性。我们的代码可在https://github.com/smduan/CRED-SQL.git上找到。

更新时间: 2025-08-18 09:43:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12769v1

State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era

Effectively learning from sequential data is a longstanding goal of Artificial Intelligence, especially in the case of long sequences. From the dawn of Machine Learning, several researchers have pursued algorithms and architectures capable of processing sequences of patterns, retaining information about past inputs while still leveraging future data, without losing precious long-term dependencies and correlations. While such an ultimate goal is inspired by the human hallmark of continuous real-time processing of sensory information, several solutions have simplified the learning paradigm by artificially limiting the processed context or dealing with sequences of limited length, given in advance. These solutions were further emphasized by the ubiquity of Transformers, which initially overshadowed the role of Recurrent Neural Nets. However, recurrent networks are currently experiencing a strong recent revival due to the growing popularity of (deep) State-Space models and novel instances of large-context Transformers, which are both based on recurrent computations that aim to go beyond several limits of currently ubiquitous technologies. The fast development of Large Language Models has renewed the interest in efficient solutions to process data over time. This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. A complete taxonomy of recent trends in architectural and algorithmic solutions is reported and discussed, guiding researchers in this appealing research field. The emerging picture suggests that there is room for exploring novel routes, constituted by learning algorithms that depart from the standard Backpropagation Through Time, towards a more realistic scenario where patterns are effectively processed online, leveraging local-forward computations, and opening new directions for research on this topic.

Updated: 2025-08-18 09:41:11

标题: 长序列处理中的状态空间建模：变压器时代中关于循环性的调查

摘要: 有效地从序列数据中学习是人工智能的长期目标，特别是在处理长序列的情况下。从机器学习的黎明开始，一些研究人员追求能够处理模式序列的算法和架构，保留关于过去输入的信息，同时利用未来数据，而不丢失宝贵的长期依赖性和相关性。虽然这样一个最终目标受到了连续实时处理感官信息的人类标志的启发，但一些解决方案通过人为限制处理的上下文或处理预先给定的有限长度的序列，简化了学习范式。这些解决方案进一步受到了Transformer的普及的强调，它最初掩盖了递归神经网络的作用。然而，由于（深度）状态空间模型的日益普及和大上下文Transformer的新实例，递归网络目前正在经历强劲的复苏，这两者都基于旨在超越当前普遍技术的几个局限的递归计算。大语言模型的快速发展重新激发了对处理时间数据的高效解决方案的兴趣。本调查提供了基于递归模型进行序列数据处理的最新方法的深入总结。报道和讨论了最新趋势的建筑和算法解决方案的完整分类，指导研究人员在这一引人注目的研究领域中。新兴的图景表明，有探索新路线的空间，由学习算法构成，这些算法偏离标准的时间反向传播，朝着更现实的情景发展，其中模式被有效地在线处理，利用本地前向计算，并为这一主题的研究开辟新方向。

更新时间: 2025-08-18 09:41:11

领域: cs.LG

下载: http://arxiv.org/abs/2406.09062v2

Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors

Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique ``many-to-one'' relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a ``one-to-one'' relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6\% mIoU on the CdZnTe dataset using only 2 group-annotated data (5\textperthousand). The code is available at \href{https://github.com/pipixiapipi/ICAF}{https://github.com/pipixiapipi/ICAF}.

Updated: 2025-08-18 09:40:36

标题: 利用面向群组的一致性约束进行半监督语义分割在CdZnTe半导体中

摘要: 镉锌镓碲（CdZnTe）半导体图像的标注具有挑战性，因为缺陷边界的对比度低，需要标注者交叉参考多个视图。这些视图共享一个地面真相（GT），形成一种独特的“多对一”关系。这种特征使得先进的半监督语义分割（SSS）方法表现不佳，因为它们通常受到“一对一”关系的限制，其中每个图像与其GT独立关联。这种限制可能导致低对比度区域中的错误积累，进一步加剧确认偏见。为了解决这个问题，我们从一个以组为导向的角度重新审视SSS流程，并提出了一个受人类启发的解决方案：组内一致性增强框架（ICAF）。首先，我们通过实验证实了CdZnTe组内固有的一致性约束，利用组内视图采样（IVS）建立了一个以组为导向的基线。基于这一见解，我们引入了伪标签校正网络（PCN）来增强一致性表示，它由两个关键模块组成。视图增强模块（VAM）通过动态合成一个具有边界感知的视图来改善边界细节，通过聚合多个视图。在视图校正模块（VCM）中，这个合成视图与其他视图配对进行信息交互，有效强调显著区域同时最小化噪声。大量实验证明了我们的解决方案对CdZnTe材料的有效性。利用DeepLabV3+和ResNet-101骨干作为我们的分割模型，我们在CdZnTe数据集上仅使用2组标注数据（5\textperthousand）实现了70.6\%的mIoU。代码可在\href{https://github.com/pipixiapipi/ICAF}{https://github.com/pipixiapipi/ICAF}获取。

更新时间: 2025-08-18 09:40:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12766v1

Short-Term Forecasting of Energy Production and Consumption Using Extreme Learning Machine: A Comprehensive MIMO based ELM Approach

A novel methodology for short-term energy forecasting using an Extreme Learning Machine ($\mathtt{ELM}$) is proposed. Using six years of hourly data collected in Corsica (France) from multiple energy sources (solar, wind, hydro, thermal, bioenergy, and imported electricity), our approach predicts both individual energy outputs and total production (\cyr{including imports, which closely follow energy demand, modulo losses)} through a Multi-Input Multi-Output ($\mathtt{MIMO}$) architecture. To address non-stationarity and seasonal variability, sliding window techniques and cyclic time encoding are incorporated, enabling dynamic adaptation to fluctuations. The $\mathtt{ELM}$ model significantly outperforms persistence-based forecasting, particularly for solar and thermal energy, achieving an $\mathtt{nRMSE}$ of $17.9\%$ and $5.1\%$, respectively, with $\mathtt{R^2} > 0.98$ (1-hour horizon). The model maintains high accuracy up to five hours ahead, beyond which renewable energy sources become increasingly volatile. While $\mathtt{MIMO}$ provides marginal gains over Single-Input Single-Output ($\mathtt{SISO}$) architectures and offers key advantages over deep learning methods such as $\mathtt{LSTM}$, it provides a closed-form solution with lower computational demands, making it well-suited for real-time applications, including online learning. Beyond predictive accuracy, the proposed methodology is adaptable to various contexts and datasets, as it can be tuned to local constraints such as resource availability, grid characteristics, and market structures.

Updated: 2025-08-18 09:37:54

标题: 使用极限学习机进行能源生产和消耗的短期预测：基于全面MIMO ELM方法

摘要: 提出了一种使用极限学习机（$\mathtt{ELM}$）进行短期能源预测的新方法。利用法国科西嘉岛（Corsica）六年的多种能源（太阳能、风能、水力、热能、生物能和进口电力）小时数据，我们的方法通过多输入多输出（$\mathtt{MIMO}$）结构预测个别能源产出和总产量（包括进口，其紧随能源需求，经过损耗调整）。为了解决非平稳性和季节性变化，采用滑动窗口技术和循环时间编码，实现对波动的动态适应。$\mathtt{ELM}$模型在预测中明显优于基于持续性的方法，尤其对于太阳能和热能，分别达到$17.9\%$和$5.1\%$的$\mathtt{nRMSE}$，$\mathtt{R^2}$大于$0.98$（1小时范围）。该模型可保持高准确性达到五小时以上，此后可再生能源变得越来越不稳定。虽然$\mathtt{MIMO}$相对于单输入单输出（$\mathtt{SISO}$）结构提供了较小的增益，并且相对于深度学习方法如$\mathtt{LSTM}$具有关键优势，但它提供了一个具有较低计算要求的封闭解决方案，非常适用于实时应用，包括在线学习。除了预测准确性，所提出的方法适应于各种环境和数据集，因为它可以根据本地约束进行调整，如资源可用性、电网特征和市场结构。

更新时间: 2025-08-18 09:37:54

领域: cs.LG,physics.data-an

下载: http://arxiv.org/abs/2508.12764v1

Partially stochastic deep learning with uncertainty quantification for model predictive heating control

Improving the energy efficiency of building heating systems is crucial for reducing global energy consumption and greenhouse gas emissions. Traditional control methods rely on static heating curves that are based solely on outdoor temperature, neglecting system state measurements, such as indoor temperature, and free heat sources, such as solar gain. A more effective strategy is model predictive control (MPC), which optimizes heating control by incorporating system state predictions based on weather forecasts, among other factors. However, current industrial MPC solutions often employ simplified physics-inspired indoor temperature models, sacrificing accuracy for robustness and interpretability. To bridge this gap, we propose a partially stochastic deep learning (DL) architecture for building-specific indoor temperature modeling. Unlike most studies that evaluate model performance through simulations or limited test buildings, our experiments across a large dataset of 100 real-world buildings, covering various heating season conditions, demonstrate that the proposed model outperforms a widely used industrial physics-based model in predictive accuracy. The proposed DL architecture shows significant potential to improve thermal comfort and energy efficiency in heating MPC solutions. Although its computational cost is higher than that of the reference model, we discuss why this trade-off is manageable, even in large-scale applications. Unlike deterministic black-box approaches, the partially stochastic DL model offers a critical advantage by enabling pre-assessment of model feasibility through predictive uncertainty quantification. This work advances heating MPC, particularly for buildings with comprehensive datasets on their thermal behavior under various weather conditions.

Updated: 2025-08-18 09:32:28

标题: 部分随机深度学习与不确定性量化在模型预测加热控制中的应用

摘要: 提高建筑供暖系统的能源效率对于减少全球能源消耗和温室气体排放至关重要。传统的控制方法依赖于基于室外温度的静态供暖曲线，忽略了系统状态测量，如室内温度，以及免费热源，如太阳辐射。更有效的策略是模型预测控制（MPC），通过结合基于天气预报等因素的系统状态预测来优化供暖控制。然而，目前工业MPC解决方案通常采用简化的基于物理的室内温度模型，以牺牲准确性换取稳健性和可解释性。为了弥补这一差距，我们提出了一个用于建筑特定室内温度建模的部分随机深度学习（DL）架构。与大多数通过模拟或有限测试建筑评估模型性能的研究不同，我们在一个包括100栋真实建筑的大型数据集上进行的实验显示，所提出的模型在预测准确性方面优于广泛使用的工业基于物理的模型。所提出的DL架构显示出在供暖MPC解决方案中改善热舒适度和能源效率的显著潜力。虽然其计算成本高于参考模型，但我们讨论了为什么这种权衡是可管理的，即使在大规模应用中也是如此。与确定性黑盒方法不同，部分随机DL模型通过预测不确定性量化使模型可行性的预评估成为可能。这项工作推动了供暖MPC的发展，特别是对于具有关于它们在各种天气条件下的热行为的全面数据集的建筑物而言。

更新时间: 2025-08-18 09:32:28

领域: stat.AP,cs.LG

下载: http://arxiv.org/abs/2504.03350v2

Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.

Updated: 2025-08-18 09:31:45

标题: 评估对比定位器在语言模型中识别社交和数学任务中的因果单位

摘要: 这项工作将神经科学对比定位器应用于大型语言模型（LLMs）和视觉语言模型（VLMs）中，以精确定位与心理理论（ToM）和数学推理任务相关的单位。在包含从3B到90B参数的11个LLMs和5个VLMs中，我们使用对比刺激集合定位顶部激活的单位，并通过有针对性的切除评估它们的因果作用。我们比较功能选择单位对已建立的ToM和数学基准测试准确性的影响，与低激活和随机选择单位相比，功能选择单位的效果。与预期相反，有时低激活单位产生的性能下降比高激活单位更大，而数学定位器产生的单位常常比ToM定位器产生的单位更严重损害ToM的表现。这些发现对基于对比的定位器的因果相关性提出质疑，并强调了更广泛的刺激集合和更准确地捕捉任务特定单位的需求。

更新时间: 2025-08-18 09:31:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.08276v2

Constrained Centroid Clustering: A Novel Approach for Compact and Structured Partitioning

This paper presents Constrained Centroid Clustering (CCC), a method that extends classical centroid-based clustering by enforcing a constraint on the maximum distance between the cluster center and the farthest point in the cluster. Using a Lagrangian formulation, we derive a closed-form solution that maintains interpretability while controlling cluster spread. To evaluate CCC, we conduct experiments on synthetic circular data with radial symmetry and uniform angular distribution. Using ring-wise, sector-wise, and joint entropy as evaluation metrics, we show that CCC achieves more compact clusters by reducing radial spread while preserving angular structure, outperforming standard methods such as K-means and GMM. The proposed approach is suitable for applications requiring structured clustering with spread control, including sensor networks, collaborative robotics, and interpretable pattern analysis.

Updated: 2025-08-18 09:30:54

标题: 受限中心聚类：一种紧凑和结构化分区的新方法

摘要: 本文提出了一种名为约束质心聚类（CCC）的方法，它通过对聚类中心和聚类中最远点之间的最大距离施加约束，扩展了经典的基于质心的聚类。利用拉格朗日公式，我们推导出一个闭式解决方案，保持可解释性的同时控制聚类扩展。为了评估CCC，我们在具有径向对称和均匀角度分布的合成圆形数据上进行实验。使用环状、扇形和联合熵作为评估指标，我们表明CCC通过减少径向扩展而保持角度结构，优于K均值和GMM等标准方法，实现了更紧凑的聚类。所提出的方法适用于需要结构化聚类和扩展控制的应用，包括传感器网络、协作机器人和可解释的模式分析。

更新时间: 2025-08-18 09:30:54

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.12758v1

CLAIRE-DSA: Fluoroscopic Image Classification for Quality Assurance of Computer Vision Pipelines in Acute Ischemic Stroke

Computer vision models can be used to assist during mechanical thrombectomy (MT) for acute ischemic stroke (AIS), but poor image quality often degrades performance. This work presents CLAIRE-DSA, a deep learning--based framework designed to categorize key image properties in minimum intensity projections (MinIPs) acquired during MT for AIS, supporting downstream quality control and workflow optimization. CLAIRE-DSA uses pre-trained ResNet backbone models, fine-tuned to predict nine image properties (e.g., presence of contrast, projection angle, motion artefact severity). Separate classifiers were trained on an annotated dataset containing $1,758$ fluoroscopic MinIPs. The model achieved excellent performance on all labels, with ROC-AUC ranging from $0.91$ to $0.98$, and precision ranging from $0.70$ to $1.00$. The ability of CLAIRE-DSA to identify suitable images was evaluated on a segmentation task by filtering poor quality images and comparing segmentation performance on filtered and unfiltered datasets. Segmentation success rate increased from $42%$ to $69%$, $p < 0.001$. CLAIRE-DSA demonstrates strong potential as an automated tool for accurately classifying image properties in DSA series of acute ischemic stroke patients, supporting image annotation and quality control in clinical and research applications. Source code is available at https://gitlab.com/icai-stroke-lab/wp3_neurointerventional_ai/claire-dsa.

Updated: 2025-08-18 09:28:58

标题: CLAIRE-DSA：用于急性缺血性中风计算机视觉管道质量保证的透视图像分类

摘要: 计算机视觉模型可用于协助急性缺血性中风（AIS）机械溶栓（MT），但图像质量差通常会降低性能。本研究介绍了CLAIRE-DSA，这是一个基于深度学习的框架，旨在对在AIS机械溶栓期间获取的最小强度投影（MinIPs）中的关键图像属性进行分类，以支持下游质量控制和工作流程优化。CLAIRE-DSA使用经过预训练的ResNet骨干模型，经过微调以预测九个图像属性（例如，对比度的存在，投影角度，运动伪影严重程度）。在一个包含1758个荧光MinIPs的注释数据集上训练了单独的分类器。该模型在所有标签上表现出色，ROC-AUC范围为0.91至0.98，精度范围为0.70至1.00。CLAIRE-DSA的识别合适图像的能力通过在分割任务上过滤质量差的图像并比较在经过过滤和未经过过滤的数据集上的分割性能来评估。分割成功率从42%增加到69%，p <0.001。CLAIRE-DSA展示了作为一个自动化工具的强大潜力，可以准确分类急性缺血性中风患者的DSA系列图像属性，支持临床和研究应用中的图像标注和质量控制。源代码可在https://gitlab.com/icai-stroke-lab/wp3_neurointerventional_ai/claire-dsa找到。

更新时间: 2025-08-18 09:28:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12755v1

Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants

The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities. Although considerable effort has been dedicated to aligning LLMs with human moral values, existing benchmarks and evaluations remain largely superficial, typically measuring alignment based on final ethical verdicts rather than explicit moral reasoning. In response, this paper aims to advance the investigation of LLMs' moral capabilities by examining their capacity to function as Artificial Moral Assistants (AMAs), systems envisioned in the philosophical literature to support human moral deliberation. We assert that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve: not only must AMAs be able to discern ethically problematic situations, they should also be able to actively reason about them, navigating between conflicting values outside of those embedded in the alignment phase. Building on existing philosophical literature, we begin by designing a new formal framework of the specific kind of behaviour an AMA should exhibit, individuating key qualities such as deductive and abductive moral reasoning. Drawing on this theoretical framework, we develop a benchmark to test these qualities and evaluate popular open LLMs against it. Our results reveal considerable variability across models and highlight persistent shortcomings, particularly regarding abductive moral reasoning. Our work connects theoretical philosophy with practical AI evaluation while also emphasising the need for dedicated strategies to explicitly enhance moral reasoning capabilities in LLMs. Code available at https://github.com/alessioGalatolo/AMAeval

Updated: 2025-08-18 09:28:55

标题: 超越道德一致性：评估LLMs作为人工道德助手

摘要: 最近大型语言模型（LLMs）的普及引起了人们对它们道德能力的重大关注。尽管人们已经付出了相当大的努力来使LLMs与人类道德价值观保持一致，但现有的基准和评估仍然主要表面化，通常基于最终的伦理判断而不是明确的道德推理进行衡量。为了回应这种情况，本文旨在通过检验它们作为人工道德助理（AMAs）的能力来推进LLMs的道德能力研究，AMAs是哲学文献中设想的系统，旨在支持人类道德思考。我们断言，要成为AMAs，需要的不仅仅是目前最先进的对齐技术所致力于实现的：AMAs不仅必须能够识别道德上有问题的情况，还必须能够积极地对其进行推理，徘徊于对齐阶段中未嵌入的冲突价值之间。借鉴现有的哲学文献，我们首先设计了一个AMAs应该表现出的特定行为类型的新形式框架，区分出了关键特质，如演绎和归纳道德推理。基于这一理论框架，我们开发了一个基准来测试这些特质，并对流行的开放LLMs进行评估。我们的结果显示，各模型之间存在显著的差异，并突显了持久的缺陷，特别是在归纳道德推理方面。我们的工作将理论哲学与实际AI评估联系起来，同时强调了需要专门的策略来明确增强LLMs的道德推理能力。代码可在https://github.com/alessioGalatolo/AMAeval获得。

更新时间: 2025-08-18 09:28:55

领域: cs.AI

下载: http://arxiv.org/abs/2508.12754v1

Federated Action Recognition for Smart Worker Assistance Using FastPose

In smart manufacturing environments, accurate and real-time recognition of worker actions is essential for productivity, safety, and human-machine collaboration. While skeleton-based human activity recognition (HAR) offers robustness to lighting, viewpoint, and background variations, most existing approaches rely on centralized datasets, which are impractical in privacy-sensitive industrial scenarios. This paper presents a federated learning (FL) framework for pose-based HAR using a custom skeletal dataset of eight industrially relevant upper-body gestures, captured from five participants and processed using a modified FastPose model. Two temporal backbones, an LSTM and a Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). On the global test set, the FL Transformer improves over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. On an unseen external client, FL and FedEnsemble exceed centralized accuracy by +52.6 and +58.3 percentage points, respectively. These results demonstrate that FL not only preserves privacy but also substantially enhances cross-user generalization, establishing it as a practical solution for scalable, privacy-aware HAR in heterogeneous industrial settings.

Updated: 2025-08-18 09:28:15

标题: 联邦式动作识别用于智能工人辅助的快速姿势技术

摘要: 在智能制造环境中，准确和实时识别工人动作对于生产效率、安全性和人机协作至关重要。虽然基于骨架的人体活动识别（HAR）能够在光照、视角和背景变化方面提供稳健性，但大多数现有方法依赖于集中式数据集，在对隐私敏感的工业场景中不切实际。本文提出了一个基于姿势的HAR的联邦学习（FL）框架，使用一个包含八种工业相关上半身手势的定制骨架数据集，由五名参与者捕获并使用修改后的FastPose模型进行处理。在四种范式下训练和评估了两个时间骨干，一个是LSTM，另一个是Transformer编码器：集中式、本地（按客户端）、带加权联邦平均（FedAvg）的FL，以及联邦集成学习（FedEnsemble）。在全局测试集上，FL Transformer相比于集中式训练提高了+12.4个百分点，FedEnsemble提供了+16.3个百分点的增益。在一个未见过的外部客户端上，FL和FedEnsemble的准确度分别比集中式提高了+52.6和+58.3个百分点。这些结果表明，FL不仅保护隐私，而且大幅提升了跨用户泛化能力，使其成为异构工业环境中可扩展、隐私感知的HAR的实用解决方案。

更新时间: 2025-08-18 09:28:15

领域: cs.CV,cs.AI,cs.DC,cs.HC

下载: http://arxiv.org/abs/2508.14113v1

Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance

The rapid advancement of AI has expanded its capabilities across domains, yet introduced critical technical vulnerabilities, such as algorithmic bias and adversarial sensitivity, that pose significant societal risks, including misinformation, inequity, security breaches, physical harm, and eroded public trust. These challenges highlight the urgent need for robust AI governance. We propose a comprehensive framework integrating technical and societal dimensions, structured around three interconnected pillars: Intrinsic Security (system reliability), Derivative Security (real-world harm mitigation), and Social Ethics (value alignment and accountability). Uniquely, our approach unifies technical methods, emerging evaluation benchmarks, and policy insights to promote transparency, accountability, and trust in AI systems. Through a systematic review of over 300 studies, we identify three core challenges: (1) the generalization gap, where defenses fail against evolving threats; (2) inadequate evaluation protocols that overlook real-world risks; and (3) fragmented regulations leading to inconsistent oversight. These shortcomings stem from treating governance as an afterthought, rather than a foundational design principle, resulting in reactive, siloed efforts that fail to address the interdependence of technical integrity and societal trust. To overcome this, we present an integrated research agenda that bridges technical rigor with social responsibility. Our framework offers actionable guidance for researchers, engineers, and policymakers to develop AI systems that are not only robust and secure but also ethically aligned and publicly trustworthy. The accompanying repository is available at https://github.com/ZTianle/Awesome-AI-SG.

Updated: 2025-08-18 09:25:19

标题: 永远不要妥协于漏洞：关于人工智能治理的全面调查

摘要: 人工智能的快速发展扩展了其在各个领域的能力，但也引入了关键的技术漏洞，如算法偏见和对抗性敏感性，这些漏洞带来了重大的社会风险，包括误导、不公平、安全漏洞、身体伤害和公众信任的侵蚀。这些挑战凸显了对强大的人工智能治理的迫切需求。我们提出了一个综合框架，整合了技术和社会维度，围绕三个相互关联的支柱构建：内在安全性（系统可靠性）、派生安全性（减轻现实世界的伤害）和社会伦理学（价值观调整和责任）。独特的是，我们的方法统一了技术方法、新兴评估基准和政策见解，以促进人工智能系统的透明度、责任和信任。通过对300多项研究的系统审查，我们确定了三个核心挑战：（1）泛化差距，即防御措施无法应对不断发展的威胁；（2）不足的评估协议忽视了真实世界的风险；和（3）碎片化的监管导致监督不一致。这些缺点源于将治理视为事后思考，而不是作为基本设计原则，导致反应性、孤立的努力未能解决技术完整性和社会信任之间的相互依赖关系。为了克服这一问题，我们提出了一个整合了技术严谨性和社会责任的研究议程。我们的框架为研究人员、工程师和政策制定者提供了可行的指导，以开发不仅稳健安全，而且符合道德标准并值得公众信任的人工智能系统。相关资料库可在https://github.com/ZTianle/Awesome-AI-SG 上找到。

更新时间: 2025-08-18 09:25:19

领域: cs.CR

下载: http://arxiv.org/abs/2508.08789v4

Deep Semantic Inference over the Air: An Efficient Task-Oriented Communication System

Empowered by deep learning, semantic communication marks a paradigm shift from transmitting raw data to conveying task-relevant meaning, enabling more efficient and intelligent wireless systems. In this study, we explore a deep learning-based task-oriented communication framework that jointly considers classification performance, computational latency, and communication cost. We adopt ResNets-based models and evaluate them on the CIFAR-10 and CIFAR-100 datasets to simulate real-world classification tasks in wireless environments. We partition the model at various points to simulate split inference across a wireless channel. By varying the split location and the size of the transmitted semantic feature vector, we systematically analyze the trade-offs between task accuracy and resource efficiency. Experimental results show that, with appropriate model partitioning and semantic feature compression, the system can retain over 85\% of baseline accuracy while significantly reducing both computational load and communication overhead.

Updated: 2025-08-18 09:18:07

标题: 透过空中的深层语义推理：一种高效的面向任务的通信系统

摘要: 通过深度学习赋予能力，语义通信标志着从传输原始数据到传达任务相关含义的范式转变，实现了更高效、更智能的无线系统。在这项研究中，我们探索了一种基于深度学习的面向任务的通信框架，同时考虑了分类性能、计算延迟和通信成本。我们采用基于ResNets的模型，并在CIFAR-10和CIFAR-100数据集上进行评估，以模拟无线环境中的真实分类任务。我们在不同点对模型进行分割，以模拟跨无线信道的分割推理。通过变化分割位置和传输的语义特征向量的大小，我们系统地分析了任务准确性和资源效率之间的权衡。实验结果显示，通过适当的模型分割和语义特征压缩，系统可以保持基线准确性的85％以上，同时显著减少计算负载和通信开销。

更新时间: 2025-08-18 09:18:07

领域: cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2508.12748v1

Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation

In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.

Updated: 2025-08-18 09:15:35

标题: 混合生成融合技术用于高效和隐私保护的人脸识别数据集生成

摘要: 在本文中，我们提出了我们在DataCV ICCV挑战赛中的方法，重点是构建一个高质量的人脸数据集，用于训练人脸识别模型。构建的数据集不能包含与任何现有公共人脸数据集重叠的身份。为了处理这一挑战，我们从基线HSFace数据集开始进行彻底的清理，通过结合人脸嵌入聚类和GPT-4o辅助验证的专家混合（MoE）策略识别并移除错误标记或不一致的身份。我们保留最大的一致身份集群，并对每个身份应用数据增强，直到达到固定数量的图像。为了进一步使数据集多样化，我们使用Stable Diffusion和提示工程生成合成身份。由于扩散模型计算密集，我们仅为每个身份生成一个参考图像，并使用Vec2Face高效扩展，快速生成49个与身份一致的变体。这种混合方法融合了基于GAN和扩散的样本，实现了高效构建多样化且高质量的数据集。为了解决合成身份之间的高视觉相似性，我们采用课程学习策略，将它们放在训练计划的早期阶段，使模型从更容易的样本逐渐进展到更难的样本。我们的最终数据集每个身份包含50张图像，并且所有新生成的身份都经过主流人脸数据集的检查，以确保没有身份泄漏。我们的方法在比赛中获得了第一名，实验结果表明我们的数据集在10K、20K和100K身份规模下提高了模型性能。代码可在https://github.com/Ferry-Li/datacv_fr找到。

更新时间: 2025-08-18 09:15:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.10672v2

Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

We address the problem of safe policy learning in multi-agent safety-critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high-level policy. Specifically, we propose a skill-based HMARL-CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower-level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.

Updated: 2025-08-18 09:13:18

标题: 使用控制屏障函数的分层多智能体强化学习，用于安全关键的自主系统

摘要: 我们解决了多智能体安全关键自主系统中安全策略学习的问题。在这样的系统中，每个智能体都需要在任何时候满足安全要求，同时还需要与其他智能体合作完成任务。为此，我们提出了一种基于控制屏障函数（CBFs）的安全分层多智能体强化学习（HMARL）方法。我们的分层方法将整体强化学习问题分解为两个层次：在较高层次学习联合合作行为，而在较低层次或智能体层次上学习安全个体行为，条件是高层策略。具体而言，我们提出了一种基于技能的HMARL-CBF算法，在这种算法中，较高层问题涉及学习所有智能体的技能的联合策略，而较低层问题涉及学习使用CBFs安全执行技能的政策。我们在具有挑战性的环境场景中验证了我们的方法，其中许多智能体必须安全地穿越冲突的道路网络。与现有的最先进方法相比，我们的方法显著提高了安全性，达到了接近完美（在5％以内）的成功/安全率，同时还提高了所有环境中的性能。

更新时间: 2025-08-18 09:13:18

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.14850v2

DCSCR: A Class-Specific Collaborative Representation based Network for Image Set Classification

Image set classification (ISC), which can be viewed as a task of comparing similarities between sets consisting of unordered heterogeneous images with variable quantities and qualities, has attracted growing research attention in recent years. How to learn effective feature representations and how to explore the similarities between different image sets are two key yet challenging issues in this field. However, existing traditional ISC methods classify image sets based on raw pixel features, ignoring the importance of feature learning. Existing deep ISC methods can learn deep features, but they fail to adaptively adjust the features when measuring set distances, resulting in limited performance in few-shot ISC. To address the above issues, this paper combines traditional ISC methods with deep models and proposes a novel few-shot ISC approach called Deep Class-specific Collaborative Representation (DCSCR) network to simultaneously learn the frame- and concept-level feature representations of each image set and the distance similarities between different sets. Specifically, DCSCR consists of a fully convolutional deep feature extractor module, a global feature learning module, and a class-specific collaborative representation-based metric learning module. The deep feature extractor and global feature learning modules are used to learn (local and global) frame-level feature representations, while the class-specific collaborative representation-based metric learning module is exploit to adaptively learn the concept-level feature representation of each image set and thus obtain the distance similarities between different sets by developing a new CSCR-based contrastive loss function. Extensive experiments on several well-known few-shot ISC datasets demonstrate the effectiveness of the proposed method compared with some state-of-the-art image set classification algorithms.

Updated: 2025-08-18 09:09:55

标题: DCSCR：基于类别特定协作表征的图像集分类网络

摘要: 图像集分类（ISC）可以被视为一项比较无序异构图像组成的集合之间相似性的任务，这些图像具有不同的数量和质量，近年来已经吸引了越来越多的研究注意力。如何学习有效的特征表示以及如何探索不同图像集之间的相似性是这一领域中的两个关键但具有挑战性的问题。然而，现有的传统ISC方法基于原始像素特征对图像集进行分类，忽略了特征学习的重要性。现有的深度ISC方法可以学习深层特征，但在测量集合距离时无法自适应调整特征，导致在少样本ISC中性能受限。为了解决上述问题，本文将传统ISC方法与深度模型相结合，提出了一种名为Deep Class-specific Collaborative Representation（DCSCR）网络的新的少样本ISC方法，以同时学习每个图像集的帧级和概念级特征表示，以及不同集合之间的距离相似性。具体来说，DCSCR包含一个完全卷积深度特征提取模块，一个全局特征学习模块，以及一个基于类别特定协作表示的度量学习模块。深度特征提取器和全局特征学习模块用于学习（局部和全局）帧级特征表示，而基于类别特定协作表示的度量学习模块被利用来自适应地学习每个图像集的概念级特征表示，并通过开发一个新的基于CSCR的对比损失函数来获得不同集合之间的距离相似性。对几个知名的少样本ISC数据集进行的大量实验证明了所提方法与一些最先进的图像集分类算法相比的有效性。

更新时间: 2025-08-18 09:09:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12745v1

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages LLMs to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal focus. These refined queries subsequently direct a temporal grounding model to precisely retrieve the most salient segments, complemented by an adaptive fusion mechanism dynamically integrating the evidence to maximize relevance. The integrated visual-textual cues are then processed by an MLLM to generate accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships, achieving state-of-the-art (SOTA) performance on complex reasoning tasks while maintaining computational efficiency.

Updated: 2025-08-18 09:06:46

标题: LeAdQA：基于LLM驱动的视频问答的上下文感知时间定位

摘要: 视频问答（VideoQA）需要识别长视频中的稀疏关键时刻，并推理它们之间的因果关系来回答语义复杂的问题。虽然多模态学习的最新进展改善了对齐和融合，但目前的方法仍然受到两种普遍但基本有缺陷的策略的限制：（1）面向任务的采样无差别地处理所有帧，使关键事件被无关内容淹没；和（2）启发式检索捕获表面模式但忽略了复杂推理所需的因果-时间结构。为了解决这些挑战，我们引入了LeAdQA，一种创新方法，通过将因果感知查询优化与细粒度视觉定位相结合，弥合这些差距。我们的方法首先利用LLMs重新构建问题-选项对，解决因果模糊性并锐化时间焦点。这些经过精炼的查询随后指导时间定位模型精确检索最显著的片段，辅以自适应融合机制动态整合证据以最大化相关性。综合视觉-文本线索然后由MLLM处理生成准确的、具有情境基础的答案。在NExT-QA、IntentQA和NExT-GQA上的实验证明，我们方法的精确视觉定位显著提高了对视频-问题关系的理解，在保持计算效率的同时，在复杂推理任务上实现了最先进的性能。

更新时间: 2025-08-18 09:06:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.14784v2

On the Importance of Behavioral Nuances: Amplifying Non-Obvious Motor Noise Under True Empirical Considerations May Lead to Briefer Assays and Faster Classification Processes

There is a tradeoff between attaining statistical power with large, difficult to gather data sets, and producing highly scalable assays that register brief data samples. Often, as grand-averaging techniques a priori assume normally-distributed parameters and linear, stationary processes in biorhythmic, time series data, important information is lost, averaged out as gross data. We developed an affective computing platform that enables taking brief data samples while maintaining personalized statistical power. This is achieved by combining a new data type derived from the micropeaks present in time series data registered from brief (5-second-long) face videos with recent advances in AI-driven face-grid estimation methods. By adopting geometric and nonlinear dynamical systems approaches to analyze the kinematics, especially the speed data, the new methods capture all facial micropeaks. These include as well the nuances of different affective micro expressions. We offer new ways to differentiate dynamical and geometric patterns present in autistic individuals from those found more commonly in neurotypical development.

Updated: 2025-08-18 09:05:40

标题: 关于行为细微差别的重要性：在真实经验考虑下放大非显而易见的运动噪音可能导致更简洁的测定和更快的分类过程

摘要: 存在一种权衡，即通过大规模、难以收集的数据集实现统计功效与生产高度可扩展的测定方法之间的权衡，这些测定方法可以记录简短的数据样本。通常情况下，由于大平均技术先验地假定生物节律时间序列数据中的参数是正态分布的，并且是线性、稳定的过程，因此会丢失重要信息，这些信息会被平均掉作为粗略数据。我们开发了一种情感计算平台，可以在保持个性化统计功效的同时采集简短的数据样本。通过将来自短暂（5秒长）面部视频的时间序列数据中存在的微峰衍生的新数据类型与最新的人工智能驱动的面部格网估计方法相结合，实现了这一目标。通过采用几何和非线性动力系统方法来分析运动学，尤其是速度数据，新方法捕捉了所有面部微峰。这些还包括不同情感微表情的细微差别。我们提供了区分自闭症个体中存在的动态和几何模式与在神经典型发展中更常见的模式之间的新方法。

更新时间: 2025-08-18 09:05:40

领域: q-bio.QM,cs.CV,cs.LG,eess.SP,nlin.CD

下载: http://arxiv.org/abs/2508.12742v1

Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency

With recent advancements in graph neural networks (GNNs), spectral GNNs have received increasing popularity by virtue of their ability to retrieve graph signals in the spectral domain. These models feature uniqueness in efficient computation as well as rich expressiveness, which stems from advanced management and profound understanding of graph data. However, few systematic studies have been conducted to assess spectral GNNs, particularly in benchmarking their efficiency, memory consumption, and effectiveness in a unified and fair manner. There is also a pressing need to select spectral models suitable for learning specific graph data and deploying them to massive web-scale graphs, which is currently constrained by the varied model designs and training settings. In this work, we extensively benchmark spectral GNNs with a focus on the spectral perspective, demystifying them as spectral graph filters. We analyze and categorize 35 GNNs with 27 corresponding filters, spanning diverse formulations and utilizations of the graph data. Then, we implement the filters within a unified spectral-oriented framework with dedicated graph computations and efficient training schemes. In particular, our implementation enables the deployment of spectral GNNs over million-scale graphs and various tasks with comparable performance and less overhead. Thorough experiments are conducted on the graph filters with comprehensive metrics on effectiveness and efficiency, offering novel observations and practical guidelines that are only available from our evaluations across graph scales. Different from the prevailing belief, our benchmark reveals an intricate landscape regarding the effectiveness and efficiency of spectral graph filters, demonstrating the potential to achieve desirable performance through tailored spectral manipulation of graph data.

Updated: 2025-08-18 09:04:50

标题: 基准测试谱图神经网络：关于有效性和效率的全面研究

摘要: 随着图神经网络（GNNs）的最新进展，谱GNNs因其在频谱域中检索图信号的能力而日益受到欢迎。这些模型在高效计算和丰富表达方面具有独特性，这源自对图数据的先进管理和深刻理解。然而，目前很少有系统性研究对谱GNNs进行评估，特别是在以统一和公平的方式对其效率、内存消耗和有效性进行基准测试。同时，迫切需要选择适用于学习特定图数据并将其部署到大规模网络图的谱模型，目前受到不同模型设计和训练设置的限制。在本研究中，我们重点关注谱视角，广泛评估谱GNNs，将其解释为谱图滤波器。我们分析和分类了35个GNNs和27个相应的滤波器，涵盖了对图数据的多种制定和利用方式。然后，我们在一个统一的面向频谱的框架中实现这些滤波器，包括专门的图计算和高效的训练方案。特别地，我们的实现使得谱GNNs能够在百万规模的图和各种任务上进行部署，并具有可比性的性能和更少的开销。我们在谱图滤波器上进行了全面的实验，评估了其有效性和效率，提供了仅在我们对图规模进行评估时才能获得的新观察和实用指导。与普遍的观念不同，我们的基准测试揭示了关于谱图滤波器效果和效率的复杂格局，展示了通过定制谱图数据处理实现理想性能的潜力。

更新时间: 2025-08-18 09:04:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.09675v2

A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks

This paper presents preliminary results in the definition of a comprehensive benchmark framework designed to systematically evaluate spatial reasoning capabilities in neural networks, with a particular focus on morphological properties such as connectivity and distance relationships. The framework is currently being used to study the capabilities of nnU-Net, exploiting the spatial model checker VoxLogicA to generate two distinct categories of synthetic datasets: maze connectivity problems for topological analysis and spatial distance computation tasks for geometric understanding. Each category is evaluated across multiple resolutions to assess scalability and generalization properties. The automated pipeline encompasses a complete machine learning workflow including: synthetic dataset generation, standardized training with cross-validation, inference execution, and comprehensive evaluation using Dice coefficient and IoU (Intersection over Union) metrics. Preliminary experimental results demonstrate significant challenges in neural network spatial reasoning capabilities, revealing systematic failures in basic geometric and topological understanding tasks. The framework provides a reproducible experimental protocol, enabling researchers to identify specific limitations. Such limitations could be addressed through hybrid approaches combining neural networks with symbolic reasoning methods for improved spatial understanding in clinical applications, establishing a foundation for ongoing research into neural network spatial reasoning limitations and potential solutions.

Updated: 2025-08-18 09:04:13

标题: 一个用于神经网络空间推理评估的多分辨率基准框架

摘要: 本文提出了一个初步结果，即定义一个全面的基准框架，旨在系统地评估神经网络的空间推理能力，特别关注形态特性，如连接性和距离关系。该框架目前正在用于研究nnU-Net的能力，利用空间模型检查器VoxLogicA生成两类不同的合成数据集：迷宫连接问题用于拓扑分析和空间距离计算任务用于几何理解。每个类别在多个分辨率上进行评估，以评估可伸缩性和泛化性能。自动化流水线包括完整的机器学习工作流程，包括：合成数据集生成，标准化训练与交叉验证，推理执行和使用Dice系数和IoU（交并比）指标进行全面评估。初步实验结果显示神经网络空间推理能力存在显著挑战，揭示了基本几何和拓扑理解任务的系统性失败。该框架提供了可重现的实验方案，使研究人员能够确定具体的限制。这些限制可以通过将神经网络与符号推理方法相结合来解决，以提高临床应用中的空间理解，为神经网络空间推理限制和潜在解决方案的持续研究奠定基础。

更新时间: 2025-08-18 09:04:13

领域: cs.LG,physics.app-ph,physics.med-ph

下载: http://arxiv.org/abs/2508.12741v1

FedUNet: A Lightweight Additive U-Net Module for Federated Learning with Heterogeneous Models

Federated learning (FL) enables decentralized model training without sharing local data. However, most existing methods assume identical model architectures across clients, limiting their applicability in heterogeneous real-world environments. To address this, we propose FedUNet, a lightweight and architecture-agnostic FL framework that attaches a U-Net-inspired additive module to each client's backbone. By sharing only the compact bottleneck of the U-Net, FedUNet enables efficient knowledge transfer without structural alignment. The encoder-decoder design and skip connections in the U-Net help capture both low-level and high-level features, facilitating the extraction of clientinvariant representations. This enables cooperative learning between the backbone and the additive module with minimal communication cost. Experiment with VGG variants shows that FedUNet achieves 93.11% accuracy and 92.68% in compact form (i.e., a lightweight version of FedUNet) with only 0.89 MB low communication overhead.

Updated: 2025-08-18 09:03:06

标题: FedUNet：一种轻量级的用于异构模型联邦学习的增量式U-Net模块

摘要: 联邦学习（FL）实现了去中心化的模型训练，而无需共享本地数据。然而，大多数现有方法假设客户端之间具有相同的模型架构，从而限制了它们在异构现实环境中的适用性。为了解决这个问题，我们提出了FedUNet，这是一个轻量级且与架构无关的FL框架，它为每个客户端的骨干网络附加了一个受U-Net启发的附加模块。通过仅共享U-Net的紧凑瓶颈部分，FedUNet实现了有效的知识传输，而无需进行结构对齐。U-Net中的编码器-解码器设计和跳跃连接有助于捕获低级和高级特征，促进了对客户端不变表示的提取。这使得骨干网络和附加模块之间能够以最小的通信成本进行合作学习。使用VGG变体进行的实验表明，FedUNet在紧凑形式（即FedUNet的轻量级版本）中实现了93.11％的准确性和92.68％，仅需0.89 MB的低通信开销。

更新时间: 2025-08-18 09:03:06

领域: cs.LG,cs.AI,68T01 (Primary), 68T07 (Secondary),I.2

下载: http://arxiv.org/abs/2508.12740v1

A Hierarchical Surrogate Model for Efficient Multi-Task Parameter Learning in Closed-Loop Contro

Many control problems require repeated tuning and adaptation of controllers across distinct closed-loop tasks, where data efficiency and adaptability are critical. We propose a hierarchical Bayesian optimization (BO) framework that is tailored to efficient controller parameter learning in sequential decision-making and control scenarios for distinct tasks. Instead of treating the closed-loop cost as a black-box, our method exploits structural knowledge of the underlying problem, consisting of a dynamical system, a control law, and an associated closed-loop cost function. We construct a hierarchical surrogate model using Gaussian processes that capture the closed-loop state evolution under different parameterizations, while the task-specific weighting and accumulation into the closed-loop cost are computed exactly via known closed-form expressions. This allows knowledge transfer and enhanced data efficiency between different closed-loop tasks. The proposed framework retains sublinear regret guarantees on par with standard black-box BO, while enabling multi-task or transfer learning. Simulation experiments with model predictive control demonstrate substantial benefits in both sample efficiency and adaptability when compared to purely black-box BO approaches.

Updated: 2025-08-18 09:01:28

标题: 一种用于封闭环控制中高效多任务参数学习的分层代理模型

摘要: 许多控制问题需要在不同的闭环任务中重复调整和适应控制器，其中数据效率和适应性至关重要。我们提出了一个针对不同任务中序贯决策和控制场景中控制器参数学习的层次贝叶斯优化（BO）框架。与将闭环成本视为黑匣子不同，我们的方法利用了底层问题的结构知识，包括动态系统、控制法则和相关的闭环成本函数。我们使用高斯过程构建一个层次代理模型，捕捉在不同参数化下的闭环状态演变，同时通过已知的闭式表达式精确计算任务特定的加权和积累到闭环成本中。这允许在不同闭环任务之间进行知识转移和增强数据效率。所提出的框架保留了与标准黑匣子BO相当的次线性后悔保证，同时实现了多任务或迁移学习。通过模型预测控制的仿真实验，与纯粹的黑匣子BO方法相比，表明在样本效率和适应性方面都取得了显著的好处。

更新时间: 2025-08-18 09:01:28

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2508.12738v1

LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models

The widespread adoption and increasing prominence of large language models (LLMs) in global technologies necessitate a rigorous focus on ensuring their safety across a diverse range of linguistic and cultural contexts. The lack of a comprehensive evaluation and diverse data in existing multilingual safety evaluations for LLMs limits their effectiveness, hindering the development of robust multilingual safety alignment. To address this critical gap, we introduce LinguaSafe, a comprehensive multilingual safety benchmark crafted with meticulous attention to linguistic authenticity. The LinguaSafe dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay. Curated using a combination of translated, transcreated, and natively-sourced data, our dataset addresses the critical need for multilingual safety evaluations of LLMs, filling the void in the safety evaluation of LLMs across diverse under-represented languages from Hungarian to Malay. LinguaSafe presents a multidimensional and fine-grained evaluation framework, with direct and indirect safety assessments, including further evaluations for oversensitivity. The results of safety and helpfulness evaluations vary significantly across different domains and different languages, even in languages with similar resource levels. Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation, underscoring the critical importance of thoroughly assessing multilingual safety in LLMs to achieve more balanced safety alignment. Our dataset and code are released to the public to facilitate further research in the field of multilingual LLM safety.

Updated: 2025-08-18 08:59:01

标题: LinguaSafe: 一项针对大型语言模型的全面多语言安全基准

摘要: 广泛采用和日益突出的大型语言模型（LLMs）在全球技术中的应用，需要严格关注确保它们在各种语言和文化背景下的安全性。现有的多语言安全评估对LLMs的有效性存在限制，缺乏全面的评估和多样化的数据，阻碍了健壮的多语言安全对齐的发展。为了解决这一关键差距，我们引入了LinguaSafe，这是一个经过精心设计以确保语言真实性的全面多语言安全基准。LinguaSafe数据集包含12种语言的45k条记录，从匈牙利语到马来语不等。我们的数据集采用翻译、再创作和本地数据的组合进行策划，解决了对LLMs进行多语言安全评估的迫切需求，填补了在匈牙利语到马来语等各种少数语种中缺乏LLMs安全评估的空白。LinguaSafe提供了一个多维和精细的评估框架，包括直接和间接的安全评估，包括对过度敏感性的进一步评估。安全性和有用性评估的结果在不同领域和不同语言之间差异显著，即使在资源水平相似的语言中也是如此。我们的基准提供了一套全面的度量标准，用于深入评估安全性，强调了全面评估LLMs中的多语言安全性的重要性，以实现更加平衡的安全对齐。我们的数据集和代码已发布给公众，以促进在多语言LLM安全领域的进一步研究。

更新时间: 2025-08-18 08:59:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12733v1

Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods

Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods.

Updated: 2025-08-18 08:53:53

标题: 《取消比较器：用于机器取消学习方法比较评估的视觉分析系统》。

摘要: 机器去学习（MU）旨在从训练模型中删除目标训练数据，使被删除的数据不再影响模型的行为，以满足数据隐私法律下的“被遗忘权利”义务。然而，我们观察到，在这个快速发展的领域中，研究人员在分析和理解不同MU方法的行为方面面临挑战，特别是在MU的三个基本原则方面：准确性、效率和隐私。因此，他们经常依赖于汇总指标和临时评估，使得准确评估方法之间的权衡变得困难。为了填补这一空白，我们引入了一个名为“Unlearning Comparator”的可视化分析系统，旨在促进对MU方法的系统评估。我们的系统支持评估过程中的两个重要任务：模型比较和攻击模拟。首先，它允许用户比较两个模型的行为，例如通过某个方法生成的模型和重新训练的基准模型，在类别、实例和层级上进行比较，以更好地理解去学习后所做的更改。其次，我们的系统模拟成员推理攻击（MIAs）以评估方法的隐私性，攻击者试图确定特定数据样本是否是原始训练集的一部分。我们通过一个案例研究评估我们的系统，通过对突出的MU方法进行视觉分析，展示它不仅帮助用户理解模型行为，还能提供洞见，可以指导MU方法的改进。

更新时间: 2025-08-18 08:53:53

领域: cs.CR,cs.HC,cs.LG,H.5.2; I.3.6

下载: http://arxiv.org/abs/2508.12730v1

FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment

Federated fine-tuning (FFT) of large language models (LLMs) has recently emerged as a promising solution to enable domain-specific adaptation while preserving data privacy. Despite its benefits, FFT on resource-constrained clients relies on the high computational and memory demands of full-model fine-tuning, which limits the potential advancement. This paper presents FedSODA, a resource-efficient FFT framework that enables clients to adapt LLMs without accessing or storing the full model. Specifically, we first propose a similarity group pruning (SGP) module, which prunes redundant layers from the full LLM while retaining the most critical layers to preserve the model performance. Moreover, we introduce an orchestrated distillation alignment (ODA) module to reduce gradient divergence between the sub-LLM and the full LLM during FFT. Through the use of the QLoRA, clients only need to deploy quantized sub-LLMs and fine-tune lightweight adapters, significantly reducing local resource requirements. We conduct extensive experiments on three open-source LLMs across a variety of downstream tasks. The experimental results demonstrate that FedSODA reduces communication overhead by an average of 70.6%, decreases storage usage by 75.6%, and improves task accuracy by 3.1%, making it highly suitable for practical FFT applications under resource constraints.

Updated: 2025-08-18 08:49:32

标题: FedSODA: 通过相似性组剪枝和协同蒸馏对LLM进行联邦微调

摘要: 最近，联邦微调（FFT）大型语言模型（LLMs）已经成为一种有前景的解决方案，可以实现领域特定的适应性，同时保护数据隐私。尽管具有这些好处，资源受限的客户端上的FFT依赖于完整模型微调的高计算和内存需求，这限制了潜在的进展。本文提出了FedSODA，这是一个资源高效的FFT框架，使客户能够在不访问或存储完整模型的情况下对LLMs进行适应。具体而言，我们首先提出了相似性组修剪（SGP）模块，该模块从完整的LLM中修剪多余的层，同时保留最关键的层以保持模型性能。此外，我们引入了一个编排蒸馏对齐（ODA）模块，用于减少FFT期间子LLM和完整LLM之间的梯度分歧。通过使用QLoRA，客户端只需部署量化的子LLMs和微调轻量级适配器，显著降低了本地资源需求。我们在三个开源LLMs上进行了大量实验，涵盖各种下游任务。实验结果表明，FedSODA将通信开销平均降低了70.6％，减少了存储使用量75.6％，并提高了任务准确性3.1％，使其非常适合在资源约束条件下进行实际的FFT应用。

更新时间: 2025-08-18 08:49:32

领域: cs.LG

下载: http://arxiv.org/abs/2508.12727v1

A Compact Post-quantum Strong Designated Verifier Signature Scheme from Isogenies

Digital signatures are fundamental cryptographic tools that provide authentication and integrity in digital communications. However, privacy-sensitive applications, such as e-voting and digital cash, require more restrictive verification models to ensure confidentiality and control. Strong Designated Verifier Signature (SDVS) schemes address this need by enabling the signer to designate a specific verifier, ensuring that only this party can validate the signature. Existing SDVS constructions are primarily based on number-theoretic assumptions and are therefore vulnerable to quantum attacks. Although post-quantum alternatives, particularly those based on lattices, have been proposed, they often entail large key and signature sizes. In this work, we present $\mathsf{CSI\text{-}SDVS}$, a novel isogeny-based SDVS scheme that offers a compact, quantum-resistant alternative to existing SDVS constructions. The scheme leverages the ideal class group action on $\mathbb{F}_p$-isomorphism classes of supersingular elliptic curves and is founded on the hardness of the Multi-Target Group Action Inverse Problem (MT-GAIP). $\mathsf{CSI\text{-}SDVS}$ achieves strong security guarantees, Strong Unforgeability under Chosen-Message Attacks (SUF-CMA), Non-Transferability (NT), and Privacy of Signer's Identity (PSI), in the random oracle model, thereby making it among the most compact PQC-based SDVS schemes and the only post-quantum secure construction based on isogenies.

Updated: 2025-08-18 08:48:17

标题: 一个紧凑的基于同态映射的后量子强指定验证者签名方案

摘要: 数字签名是提供数字通信中身份验证和完整性的基本加密工具。然而，隐私敏感的应用，如电子投票和数字现金，需要更严格的验证模型来确保保密性和控制。强指定验证者签名（SDVS）方案通过使签名者指定特定验证者来满足这一需求，确保只有该方能验证签名。现有的SDVS构造主要基于数论假设，因此容易受到量子攻击。虽然已经提出了基于格的后量子替代方案，但它们通常需要较大的密钥和签名大小。在这项工作中，我们提出了$\mathsf{CSI\text{-}SDVS}$，这是一种基于同态的SDVS方案，为现有的SDVS构造提供了紤小、抗量子攻击的替代方案。该方案利用理想类群在超奇异椭圆曲线的$\mathbb{F}_p$-同构类上的作用，并基于多目标群操作逆问题（MT-GAIP）的难度。$\mathsf{CSI\text{-}SDVS}$在随机预言者模型中实现了强安全性保证，即在选择消息攻击下的强不可伪造性（SUF-CMA）、不可转移性（NT）和签名者身份的隐私（PSI），因此使其成为最紤紧的基于PQC的SDVS方案之一，也是唯一基于同态的后量子安全构造。

更新时间: 2025-08-18 08:48:17

领域: cs.CR,math.NT,11T71, 94A60, 68P25, 14G50, 81P94

下载: http://arxiv.org/abs/2507.14893v3

GTool: Graph Enhanced Tool Planning with Large Language Model

Tool planning with large language models (LLMs), referring to selecting, organizing, and preparing the tools necessary to complete a user request, bridges the gap between natural language understanding and task execution. However, current works treat different tools as isolated components and fail to leverage the inherent dependencies of tools, leading to invalid planning results. Since tool dependencies are often incomplete, it becomes challenging for LLMs to accurately identify the appropriate tools required by a user request, especially when confronted with a large toolset. To solve this challenge, we propose \texttt{GTool}, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete dependencies. \texttt{GTool} constructs a request-specific tool graph to select tools efficiently and generate the \texttt{<graph token>} which provides sufficient dependency information understandable by LLMs. Moreover, a missing dependency prediction task is designed to improve the reliability of \texttt{GTool} with incomplete dependencies. Without trimming LLMs, \texttt{GTool} can be seamlessly integrated with various LLM backbones without extensive retraining. Extensive experiments show that \texttt{GTool} achieves more than 29.6\% performance improvements compared with the state-of-the-art (SOTA) baselines with a light-weight (7B) LLM backbone.

Updated: 2025-08-18 08:46:55

标题: GTool：利用大型语言模型增强的图形工具规划

摘要: 使用大型语言模型（LLMs）进行工具规划，指的是选择、组织和准备完成用户请求所需的工具，弥合了自然语言理解和任务执行之间的差距。然而，当前的工作将不同的工具视为孤立的组件，并未利用工具之间固有的依赖关系，导致规划结果无效。由于工具之间的依赖关系通常是不完整的，因此对LLMs来说很难准确识别用户请求所需的适当工具，尤其是当面对大量工具集时。为了解决这一挑战，我们提出了\texttt{GTool}，这是第一个旨在增强LLMs在不完整依赖关系下的工具规划能力的工作。\texttt{GTool}构建了一个针对请求的工具图，以便高效选择工具并生成提供LLMs可理解的足够依赖信息的\texttt{<graph token>}。此外，设计了一个缺失依赖预测任务，以提高\texttt{GTool}在不完整依赖关系下的可靠性。在不削减LLMs的情况下，\texttt{GTool}可以与各种LLM骨干轻松集成，无需进行大量的重新训练。大量实验证明，与轻量级（7B）LLM骨干的最新技术（SOTA）基线相比，\texttt{GTool}实现了超过29.6\%的性能改进。

更新时间: 2025-08-18 08:46:55

领域: cs.AI

下载: http://arxiv.org/abs/2508.12725v1

Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Human evaluation studies validate our framework's alignment with human decision-making patterns, and generalization experiments demonstrate effective performance on novel tool categories. Ablation studies revealed that manipulation-related attributes (graspability, elongation, hand-relatedness) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.

Updated: 2025-08-18 08:43:01

标题: 灵活的工具选择：通过视觉和语言的低维属性对齐

摘要: 灵活的工具选择反映了一种复杂的认知能力，区别于其他物种，然而捕捉这种能力的计算模型仍然不够发展。我们开发了一个框架，使用低维属性表示来连接视觉工具感知和语言任务理解。我们构建了一个包含115种常见工具的综合数据集（ToolNet），这些工具被标记为13个精心设计的属性，涵盖了物理、功能和心理属性，与描述工具使用的自然语言场景相配对。视觉编码器（ResNet或ViT）从工具图像中提取属性，而经过微调的语言模型（GPT-2、LLaMA、DeepSeek）从任务描述中推导出所需的属性。我们的方法在工具选择任务中实现了74%的准确率，显著优于直接工具匹配（20%）和较小的多模态模型（21%-58%），同时接近于像GPT-4o这样的更大模型的性能（73%），但参数数量明显较少。人类评估研究验证了我们框架与人类决策模式的一致性，泛化实验展示了对新颖工具类别的有效性能。消融研究表明，与操作相关的属性（抓取性、延伸性、与手相关性）在各种模态中一贯被证明是最关键的。这项工作提供了一种参数高效、可解释的解决方案，模拟了类似于人类的工具认知，推进了认知科学的理解和在工具选择任务中的实际应用。

更新时间: 2025-08-18 08:43:01

领域: cs.CV,cs.AI,cs.CL,q-bio.NC

下载: http://arxiv.org/abs/2505.22146v3

AdaMuon: Adaptive Muon Optimizer

We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.

Updated: 2025-08-18 08:40:33

标题: AdaMuon: 自适应μ子优化器

摘要: 我们提出了AdaMuon，这是一种将元素适应性与正交更新相结合的新型优化器，用于大规模神经网络训练。AdaMuon包括两个紧密耦合的机制：（1）应用于正交化更新方向的元素级二阶动量估计器，以及（2）一个符号稳定的正交更新，其中动量首先经过符号变换再进行正交化。这两个组件共同实现了方差自适应缩放，同时保持稳定的更新几何形状。此外，AdaMuon采用了一个与RMS对齐的重新调整策略，以使均方根更新幅度与Adam匹配，从而可以直接重用现有的学习率调度而无需额外调整。实验证明，AdaMuon不仅保持稳定性，而且在大规模场景中的训练效率可以超过Adam超过40%。

更新时间: 2025-08-18 08:40:33

领域: cs.LG

下载: http://arxiv.org/abs/2507.11005v2

Large language models can replicate cross-cultural differences in personality

We use a large-scale experiment (N=8000) to determine whether GPT-4 can replicate cross-cultural differences in the Big Five, measured using the Ten-Item Personality Inventory. We used the US and South Korea as the cultural pair, given that prior research suggests substantial personality differences between people from these two countries. We manipulated the target of the simulation (US vs. Korean), the language of the inventory (English vs. Korean), and the language model (GPT-4 vs. GPT-3.5). Our results show that GPT-4 replicated the cross-cultural differences for each factor. However, mean ratings had an upward bias and exhibited lower variation than in the human samples, as well as lower structural validity. We provide preliminary evidence that LLMs can aid cross-cultural researchers and practitioners.

Updated: 2025-08-18 08:36:16

标题: 大型语言模型可以复制跨文化个性差异

摘要: 我们使用了一个大规模的实验（N=8000）来确定GPT-4是否能够复制使用十项人格问卷测量的大五人格跨文化差异。我们选择美国和韩国作为文化对比，因为先前的研究表明这两个国家的人之间存在显著的人格差异。我们操纵了模拟的目标（美国 vs. 韩国）、问卷的语言（英语 vs. 韩语）以及语言模型（GPT-4 vs. GPT-3.5）。我们的结果显示，GPT-4复制了每个因素的跨文化差异。然而，平均评分存在向上偏差，并且展现出比人类样本更低的变化，以及较低的结构有效性。我们提供初步证据表明，LLMs可以帮助跨文化研究者和从业者。

更新时间: 2025-08-18 08:36:16

领域: cs.CL,cs.AI,cs.CY,I.2.7; K.4.2; J.4

下载: http://arxiv.org/abs/2310.10679v4

Advancing AI-Scientist Understanding: Multi-Agent LLMs with Interpretable Physics Reasoning

Large Language Models (LLMs) are playing an increasingly important role in physics research by assisting with symbolic manipulation, numerical computation, and scientific reasoning. However, ensuring the reliability, transparency, and interpretability of their outputs remains a major challenge. In this work, we introduce a novel multi-agent LLM physicist framework that fosters collaboration between AI and human scientists through three key modules: a reasoning module, an interpretation module, and an AI-scientist interaction module. Recognizing that effective physics reasoning demands logical rigor, quantitative accuracy, and alignment with established theoretical models, we propose an interpretation module that employs a team of specialized LLM agents-including summarizers, model builders, visualization tools, and testers-to systematically structure LLM outputs into transparent, physically grounded science models. A case study demonstrates that our approach significantly improves interpretability, enables systematic validation, and enhances human-AI collaboration in physics problem-solving and discovery. Our work bridges free-form LLM reasoning with interpretable, executable models for scientific analysis, enabling more transparent and verifiable AI-augmented research.

Updated: 2025-08-18 08:28:27

标题: 推进AI-科学家的理解：具有可解释物理推理的多智能体LLMs

摘要: 大型语言模型（LLMs）在物理研究中发挥着越来越重要的作用，通过协助符号操作、数值计算和科学推理。然而，确保它们的输出可靠性、透明性和可解释性仍然是一个重大挑战。在这项工作中，我们引入了一个新颖的多智能体LLM物理学家框架，通过三个关键模块促进人工智能和人类科学家之间的合作：推理模块、解释模块和AI-科学家交互模块。认识到有效的物理推理需要逻辑严谨、定量准确和与已建立的理论模型一致，我们提出了一个解释模块，利用一组专门的LLM智能体，包括摘要生成器、模型构建者、可视化工具和测试者，将LLM的输出系统地结构化为透明、基于物理的科学模型。一项案例研究表明，我们的方法显著提高了可解释性，实现了系统验证，并增强了物理问题解决和发现中的人工智能与人类之间的合作。我们的工作将自由形式的LLM推理与可解释的、可执行的模型相结合，用于科学分析，实现了更加透明和可验证的人工智能增强研究。

更新时间: 2025-08-18 08:28:27

领域: cs.AI,cs.CL,cs.HC,physics.comp-ph

下载: http://arxiv.org/abs/2504.01911v2

Argos: A Decentralized Federated System for Detection of Traffic Signs in CAVs

Connected and automated vehicles generate vast amounts of sensor data daily, raising significant privacy and communication challenges for centralized machine learning approaches in perception tasks. This study presents a decentralized, federated learning framework tailored for traffic sign detection in vehicular networks to enable collaborative model training without sharing raw data. The framework partitioned traffic sign classes across vehicles for specialized local training using lightweight object detectors, aggregated model parameters via algorithms like FedProx, FedAdam and FedAVG in a simulated environment with the Flower framework, and evaluated multiple configurations including varying server rounds, local epochs, client participation fractions, and data distributions. Experiments demonstrated that increasing server rounds from 2 to 20 boosted accuracy from below 0.1 to over 0.8, moderate local epochs (8-10) provided optimal efficiency with accuracies around 0.67, higher client participation fractions enhanced generalization up to 0.83, FedProx outperformed other aggregators in handling heterogeneity, non-IID data distributions reduced performance compared to IID, and training duration primarily scaled with the number of rounds rather than aggregation strategy. We conclude that this federated approach may offer a scalable, privacy-preserving solution for real-world vehicular deployments, potentially guiding future integrations of robust aggregation and communication optimizations to advance intelligent transportation systems.

Updated: 2025-08-18 08:22:57

标题: 阿格斯：用于自动驾驶车辆中交通标志检测的去中心化联合系统

摘要: 联网和自动驾驶车辆每天产生大量的传感器数据，给集中式机器学习方法在感知任务中带来重大的隐私和通信挑战。本研究提出了一个针对车辆网络中交通标志检测的分散式、联邦学习框架，以实现协同模型训练而不共享原始数据。该框架将交通标志类别分配到不同车辆上进行专门的本地训练，使用轻量级物体检测器聚合模型参数，通过FedProx、FedAdam和FedAVG等算法在模拟环境中进行评估，评估多种配置，包括不同的服务器轮数、本地轮数、客户端参与比例和数据分布。实验表明，将服务器轮数从2增加到20可将准确度从低于0.1提升至超过0.8，适度的本地轮数（8-10）提供了最佳效率，准确度约为0.67，更高的客户端参与比例增强了泛化性能，达到0.83，FedProx在处理异质性方面表现优于其他聚合器，非IID数据分布与IID相比降低了性能，训练持续时间主要随轮数而不是聚合策略的增加而增加。我们得出结论，这种联邦方法可能为现实世界的车辆部署提供可扩展的、保护隐私的解决方案，可能引导未来集成强大聚合和通信优化以推进智能交通系统的发展。

更新时间: 2025-08-18 08:22:57

领域: cs.LG,cs.CV,I.2.6; I.4.8

下载: http://arxiv.org/abs/2508.12712v1

MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning

Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL), especially for general audio and music representation learning. While recent methods have demonstrated strong performance, the role of the predictor module used at the output of such SSL systems remains mainly overlooked, despite being crucial for solving the pretext task at hand. In particular, this module should be able to deal with the ambiguity inherent in audio content, especially when it is composed of multiple sound sources. This work proposes a novel enhancement: integrating Multiple Choice Learning (MCL) to explicitly model prediction ambiguity and improve representation quality. We build on top of the recently proposed MATPAC system, improving its prediction and unsupervised classification pretext tasks with MCL. We extensively evaluate our method, MATPAC++, through both linear probing across multiple downstream tasks and fine-tuning on AudioSet, employing a unified protocol that enables rigorous and fair comparisons with state-of-the-art SSL approaches. Results show that our proposal achieves state-of-the-art when fine-tuned on AudioSet and overall state-of-the-art scores on downstream tasks. Additionally, we examine domain specialisation by training exclusively on music data, where our model achieves state-of-the-art performance with significantly improved efficiency.

Updated: 2025-08-18 08:10:07

标题: MATPAC++：增强型掩蔽潜在预测用于自监督音频表示学习

摘要: Masked latent prediction已经成为自监督学习（SSL）中的主导范式，尤其在音频和音乐表示学习方面。虽然最近的方法表现出强大的性能，但在这些SSL系统的输出端使用的预测模块的作用仍然主要被忽视，尽管它对解决手头的假设任务至关重要。特别是，这个模块应该能够处理音频内容中固有的歧义，尤其是当它由多个声源组成时。这项工作提出了一个新的增强方法：将多选学习（MCL）集成到其中，以明确地建模预测的歧义并提高表示质量。我们在最近提出的MATPAC系统的基础上进行了改进，通过MCL改进了其预测和无监督分类的假设任务。我们通过线性探测多个下游任务和在AudioSet上进行微调，使用统一的协议对我们的方法MATPAC++进行了广泛评估，这个协议使得我们能够与最先进的SSL方法进行严谨和公平的比较。结果显示，我们的提议在AudioSet上进行微调时达到了最先进的水平，并在下游任务上总体上达到了最先进的分数。此外，我们通过专门在音乐数据上进行训练来研究领域专业化，在这方面，我们的模型实现了具有显着提高效率的最先进性能。

更新时间: 2025-08-18 08:10:07

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2508.12709v1

Asymmetric Diffusion Recommendation Model

Recently, motivated by the outstanding achievements of diffusion models, the diffusion process has been employed to strengthen representation learning in recommendation systems. Most diffusion-based recommendation models typically utilize standard Gaussian noise in symmetric forward and reverse processes in continuous data space. Nevertheless, the samples derived from recommendation systems inhabit a discrete data space, which is fundamentally different from the continuous one. Moreover, Gaussian noise has the potential to corrupt personalized information within latent representations. In this work, we propose a novel and effective method, named Asymmetric Diffusion Recommendation Model (AsymDiffRec), which learns forward and reverse processes in an asymmetric manner. We define a generalized forward process that simulates the missing features in real-world recommendation samples. The reverse process is then performed in an asymmetric latent feature space. To preserve personalized information within the latent representation, a task-oriented optimization strategy is introduced. In the serving stage, the raw sample with missing features is regarded as a noisy input to generate a denoising and robust representation for the final prediction. By equipping base models with AsymDiffRec, we conduct online A/B tests, achieving improvements of +0.131% and +0.166% in terms of users' active days and app usage duration respectively. Additionally, the extended offline experiments also demonstrate improvements. AsymDiffRec has been implemented in the Douyin Music App.

Updated: 2025-08-18 08:05:25

标题: 非对称扩散推荐模型

摘要: 最近，受扩散模型卓越成就的启发，扩散过程已被应用于加强推荐系统中的表示学习。大多数基于扩散的推荐模型通常在连续数据空间中对称前向和反向过程中使用标准高斯噪声。然而，推荐系统中得到的样本存在于离散数据空间，这与连续数据空间根本不同。此外，高斯噪声可能破坏潜在表示中的个性化信息。在这项工作中，我们提出了一种新颖有效的方法，名为非对称扩散推荐模型（AsymDiffRec），以非对称方式学习前向和反向过程。我们定义了一个广义前向过程，模拟现实世界推荐样本中的缺失特征。然后，在非对称潜在特征空间中执行反向过程。为了保留潜在表示中的个性化信息，引入了一种面向任务的优化策略。在服务阶段，将具有缺失特征的原始样本视为噪声输入，用于生成最终预测的去噪和稳健表示。通过为基础模型配备AsymDiffRec，我们进行在线A/B测试，在用户活跃天数和应用使用时长方面分别取得了+0.131%和+0.166%的改进。此外，扩展的离线实验也显示了改进效果。AsymDiffRec已在抖音音乐应用中实施。

更新时间: 2025-08-18 08:05:25

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.12706v1

BUILDA: A Thermal Building Data Generation Framework for Transfer Learning

Transfer learning (TL) can improve data-driven modeling of building thermal dynamics. Therefore, many new TL research areas emerge in the field, such as selecting the right source model for TL. However, these research directions require massive amounts of thermal building data which is lacking presently. Neither public datasets nor existing data generators meet the needs of TL research in terms of data quality and quantity. Moreover, existing data generation approaches typically require expert knowledge in building simulation. We present BuilDa, a thermal building data generation framework for producing synthetic data of adequate quality and quantity for TL research. The framework does not require profound building simulation knowledge to generate large volumes of data. BuilDa uses a single-zone Modelica model that is exported as a Functional Mock-up Unit (FMU) and simulated in Python. We demonstrate BuilDa by generating data and utilizing it for pretraining and fine-tuning TL models.

Updated: 2025-08-18 08:01:37

标题: BUILDA：用于迁移学习的热建筑数据生成框架

摘要: 迁移学习（TL）可以改善建筑热力动力学的数据驱动建模。因此，在该领域出现了许多新的TL研究领域，如选择适合TL的正确源模型。然而，这些研究方向需要大量目前缺乏的建筑热力数据。目前的公共数据集和现有数据生成器都无法满足TL研究对数据质量和数量的需求。此外，现有的数据生成方法通常需要建筑模拟方面的专业知识。我们提出了BuilDa，一个用于生成足够质量和数量的合成数据以用于TL研究的热力建筑数据生成框架。该框架不需要深入的建筑模拟知识来生成大量数据。BuilDa使用一个单区域Modelica模型，将其导出为功能模拟单元（FMU）并在Python中进行模拟。我们通过生成数据并将其用于预训练和微调TL模型来演示BuilDa的功能。

更新时间: 2025-08-18 08:01:37

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.12703v1

S2FGL: Spatial Spectral Federated Graph Learning

Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the semantic knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drift occurs, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate the challenge of poor semantic knowledge caused by label signal disruption. Furthermore, we design a frequency alignment to address spectral client drift. The combination of Spatial and Spectral strategies forms our framework S2FGL. Extensive experiments on multiple datasets demonstrate the superiority of S2FGL. The code is available at https://github.com/Wonder7racer/S2FGL.git.

Updated: 2025-08-18 08:00:31

标题: S2FGL：空间谱联合图学习

摘要: Federated Graph Learning (FGL)结合了联邦学习（FL）的隐私保护能力和图神经网络（GNNs）的强大图建模能力。当前研究从结构角度解决了子图-FL问题，忽略了在空间和频谱域上对图信号的传播。从空间角度看，子图-FL在客户端之间引入了边缘断开，导致标签信号的中断和全局GNN的语义知识下降。从频谱角度看，频谱异质性导致子图之间信号频率的不一致，使得本地GNN过度拟合本地信号传播方案。结果，频谱客户端漂移发生，削弱了全局泛化能力。为了应对挑战，我们提出了一个全局知识库，来缓解由于标签信号中断而导致的语义知识不足的挑战。此外，我们设计了一个频率对齐来解决频谱客户端漂移问题。空间和频谱策略的结合形成了我们的框架S2FGL。在多个数据集上进行的广泛实验显示了S2FGL的优越性。代码可在https://github.com/Wonder7racer/S2FGL.git获取。

更新时间: 2025-08-18 08:00:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.02409v4

A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance

Robust information representation and its persistent maintenance are fundamental for higher cognitive functions. Existing models employ distinct neural mechanisms to separately address noise-resistant processing or information maintenance, yet a unified framework integrating both operations remains elusive -- a critical gap in understanding cortical computation. Here, we introduce a recurrent neural circuit that combines divisive normalization with self-excitation to achieve both robust encoding and stable retention of normalized inputs. Mathematical analysis shows that, for suitable parameter regimes, the system forms a continuous attractor with two key properties: (1) input-proportional stabilization during stimulus presentation; and (2) self-sustained memory states persisting after stimulus offset. We demonstrate the model's versatility in two canonical tasks: (a) noise-robust encoding in a random-dot kinematogram (RDK) paradigm; and (b) approximate Bayesian belief updating in a probabilistic Wisconsin Card Sorting Test (pWCST). This work establishes a unified mathematical framework that bridges noise suppression, working memory, and approximate Bayesian inference within a single cortical microcircuit, offering fresh insights into the brain's canonical computation and guiding the design of biologically plausible artificial neural architectures.

Updated: 2025-08-18 08:00:24

标题: 一个具有分割归一化和自激励的统一皮层回路模型，用于强大的表示和记忆维持

摘要: 强大的信息表示及其持久性维护对于更高级认知功能至关重要。现有模型采用不同的神经机制分别处理抗噪声处理或信息维护，然而整合这两个操作的统一框架仍然难以实现，这是对皮层计算理解的一个关键缺口。在这里，我们介绍了一个循环神经回路，将分裂归一化与自激活结合起来，实现对归一化输入的强大编码和稳定保留。数学分析表明，在适当的参数范围内，系统形成一个具有两个关键特性的连续吸引子：(1)在刺激呈现期间的输入比例稳定化；和(2)刺激结束后持续存在的自持记忆状态。我们在两个经典任务中展示了模型的多功能性：(a)在随机点运动图(RDK)范式中进行抗噪声编码；和(b)在概率性威斯康星卡片分类测试(pWCST)中进行近似贝叶斯信念更新。这项工作建立了一个统一的数学框架，将抑制噪声、工作记忆和近似贝叶斯推断融合在一个单一的皮层微回路中，为大脑的基本计算提供了新的见解，并指导了生物可信的人工神经网络架构的设计。

更新时间: 2025-08-18 08:00:24

领域: q-bio.NC,cs.AI,cs.NE

下载: http://arxiv.org/abs/2508.12702v1

Per-element Secure Aggregation against Data Reconstruction Attacks in Federated Learning

Federated learning (FL) enables collaborative model training without sharing raw data, but individual model updates may still leak sensitive information. Secure aggregation (SecAgg) mitigates this risk by allowing the server to access only the sum of client updates, thereby concealing individual contributions. However, a significant vulnerability has recently attracted increasing attention: when model updates are sparse vectors, a non-zero value contributed by a single client at a given index can be directly revealed in the aggregate, enabling precise data reconstruction attacks. In this paper, we propose a novel enhancement to SecAgg that reveals aggregated values only at indices with at least $t$ non-zero contributions. Our mechanism introduces a per-element masking strategy to prevent the exposure of under-contributed elements, while maintaining modularity and compatibility with many existing SecAgg implementations by relying solely on cryptographic primitives already employed in a typical setup. We integrate this mechanism into Flamingo, a low-round SecAgg protocol, to provide a robust defense against such attacks. Our analysis and experimental results indicate that the additional computational and communication overhead introduced by our mechanism remains within an acceptable range, supporting the practicality of our approach.

Updated: 2025-08-18 07:59:44

标题: 在联邦学习中针对数据重构攻击的分元素安全聚合

摘要: 联邦学习（FL）使得合作模型训练成为可能，而无需共享原始数据，但个体模型更新仍可能泄露敏感信息。安全聚合（SecAgg）通过允许服务器仅访问客户端更新的总和来减轻这种风险，从而隐藏个体贡献。然而，最近吸引了越来越多关注的一个重要漏洞是：当模型更新为稀疏向量时，单个客户在给定索引处贡献的非零值可以直接在聚合中显示，从而使精确数据重建攻击成为可能。在本文中，我们提出了一个对SecAgg的新型增强方法，仅在至少具有$t$个非零贡献的索引处显示聚合值。我们的机制引入了一个逐元素掩码策略，以防止低贡献元素的暴露，同时保持了与许多现有SecAgg实施方案的模块化和兼容性，仅依靠已经在典型设置中使用的加密原语。我们将这一机制集成到Flamingo中，这是一个低轮次SecAgg协议，以提供对此类攻击的强大防御。我们的分析和实验结果表明，我们的机制引入的额外计算和通信开销仍在可接受范围内，支持我们方法的实用性。

更新时间: 2025-08-18 07:59:44

领域: cs.CR

下载: http://arxiv.org/abs/2508.04285v2

Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning

Class-incremental with repetition (CIR), where previously trained classes repeatedly introduced in future tasks, is a more realistic scenario than the traditional class incremental setup, which assumes that each task contains unseen classes. CIR assumes that we can easily access abundant unlabeled data from external sources, such as the Internet. Therefore, we propose two components that efficiently use the unlabeled data to ensure the high stability and the plasticity of models trained in CIR setup. First, we introduce multi-level knowledge distillation (MLKD) that distills knowledge from multiple previous models across multiple perspectives, including features and logits, so the model can maintain much various previous knowledge. Moreover, we implement dynamic self-supervised loss (SSL) to utilize the unlabeled data that accelerates the learning of new classes, while dynamic weighting of SSL keeps the focus of training to the primary task. Both of our proposed components significantly improve the performance in CIR setup, achieving 2nd place in the CVPR 5th CLVISION Challenge.

Updated: 2025-08-18 07:50:20

标题: 多层级知识蒸馏和动态自监督学习用于持续学习

摘要: Class-incremental with repetition (CIR)是一个更加现实的场景，其中以前训练过的类别在未来任务中被重复引入，而不是传统的逐步增加类别的设置，该设置假设每个任务中都包含之前未见过的类别。 CIR假设我们可以轻松访问来自外部来源（如互联网）的丰富未标记数据。因此，我们提出了两个组件，有效利用未标记数据，以确保在CIR设置中训练的模型具有高稳定性和可塑性。首先，我们引入多级知识蒸馏（MLKD），从多个角度（包括特征和logits）提炼多个先前模型的知识，使模型可以保留更多的先前知识。此外，我们实现动态自监督损失（SSL），利用未标记数据加速学习新类别，而SSL的动态加权保持训练的重点在主要任务上。我们提出的这两个组件显著提高了CIR设置中的性能，在CVPR第五届CLVISION挑战赛中获得了第二名。

更新时间: 2025-08-18 07:50:20

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12692v1

S2Cap: A Benchmark and a Baseline for Singing Style Captioning

Singing voices contain much richer information than common voices, including varied vocal and acoustic properties. However, current open-source audio-text datasets for singing voices capture only a narrow range of attributes and lack acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally define the singing style captioning task and present S2Cap, a dataset of singing voices with detailed descriptions covering diverse vocal, acoustic, and demographic characteristics. Using this dataset, we develop an efficient and straightforward baseline algorithm for singing style captioning. The dataset is available at https://zenodo.org/records/15673764.

Updated: 2025-08-18 07:50:13

标题: S2Cap：一种用于歌唱风格字幕的基准和基线

摘要: 歌声中包含比普通声音更丰富的信息，包括各种各样的声音和声学特性。然而，当前用于歌声的开放源音频文本数据集仅捕捉了一小部分属性，并且缺乏声学特征，导致在下游任务（如风格字幕）方面的实用性受到限制。为了填补这一空白，我们正式定义了歌唱风格字幕任务，并提出了S2Cap，一个包含详细描述的歌声数据集，涵盖了多样的声音、声学和人口特征。利用这个数据集，我们开发了一个高效且直观的基线算法用于歌唱风格字幕。该数据集可在https://zenodo.org/records/15673764 上找到。

更新时间: 2025-08-18 07:50:13

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2409.09866v3

MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$\times$ speedup on Wan 14B, 1.97$\times$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.

Updated: 2025-08-18 07:49:33

标题: MixCache: 用于视频扩散变压器加速的缓存混合

摘要: 利用Transformer架构和扩散过程，视频DiT模型已经成为高质量视频生成的主要方法。然而，它们的多步迭代去噪过程会产生高计算成本和推理延迟。缓存是DiT模型中广泛采用的优化方法，利用扩散过程中的冗余来跳过不同粒度（例如步骤、cfg、块）的计算。然而，现有的缓存方法仅限于单粒度策略，难以以灵活的方式平衡生成质量和推理速度。在这项工作中，我们提出了MixCache，一个基于缓存且无需训练的框架，用于高效的视频DiT推理。它首先区分了不同缓存策略之间的干扰和边界，然后引入了一种上下文感知的缓存触发策略来确定何时应该启用缓存，以及一种自适应混合缓存决策策略，用于动态选择最佳的缓存粒度。对多种模型的广泛实验表明，MixCache可以显著加速视频生成（例如，在Wan 14B上加速1.94倍，在HunyuanVideo上加速1.97倍），同时与基线方法相比，提供了更优异的生成质量和推理效率。

更新时间: 2025-08-18 07:49:33

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.12691v1

TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions

Test-time Adaptation (TTA) poses a challenge, requiring models to dynamically adapt and perform optimally on shifting target domains. This task is particularly emphasized in real-world driving scenes, where weather domain shifts occur frequently. To address such dynamic changes, our proposed method, TTA-DAME, leverages source domain data augmentation into target domains. Additionally, we introduce a domain discriminator and a specialized domain detector to mitigate drastic domain shifts, especially from daytime to nighttime conditions. To further improve adaptability, we train multiple detectors and consolidate their predictions through Non-Maximum Suppression (NMS). Our empirical validation demonstrates the effectiveness of our method, showing significant performance enhancements on the SHIFT Benchmark.

Updated: 2025-08-18 07:48:35

标题: TTA-DAME：用于动态驾驶条件的领域增强和模型集成的测试时间适应

摘要: 测试时间适应（TTA）提出了一个挑战，要求模型动态适应并在不断变化的目标领域上表现最佳。这项任务在现实世界的驾驶场景中尤为重要，天气领域转移经常发生。为了应对这种动态变化，我们提出的方法TTA-DAME将源领域数据增强应用到目标领域中。此外，我们引入了一个领域鉴别器和一个专门的领域检测器，以减轻急剧的领域转移，特别是从白天到夜晚的条件。为了进一步提高适应性，我们训练多个检测器，并通过非最大抑制（NMS）来整合它们的预测。我们的实证验证证明了我们方法的有效性，在SHIFT基准测试中显示出显著的性能提升。

更新时间: 2025-08-18 07:48:35

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12690v1

Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions

Learning complex distributions is a fundamental challenge in contemporary applications. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. While effective, engression can struggle with highly complex distributions, such as those encountered in image data. In this work, we propose reverse Markov learning (RML), a framework that defines a general forward process transitioning from the target distribution to a known distribution (e.g., Gaussian) and then learns a reverse Markov process using multiple engression models. This reverse process reconstructs the target distribution step by step. This framework accommodates general forward processes, allows for dimension reduction, and naturally discretizes the generative process. In the special case of diffusion-based forward processes, RML provides an efficient discretization strategy for both training and inference in diffusion models. We further introduce an alternating sampling scheme to enhance post-training performance. Our statistical analysis establishes error bounds for RML and elucidates its advantages in estimation efficiency and flexibility in forward process design. Empirical results on simulated and climate data corroborate the theoretical findings, demonstrating the effectiveness of RML in capturing complex distributions.

Updated: 2025-08-18 07:48:27

标题: 逆向马尔可夫学习：复杂分布的多步生成模型

摘要: 学习复杂分布是当代应用中的一个基本挑战。Shen和Meinshausen（2024年）引入了engression，这是一种基于评分规则的生成方法，可以直接将噪声（如果有协变量则包括协变量）映射到数据。虽然有效，但engression在处理高度复杂的分布（例如在图像数据中遇到的分布）时可能会遇到困难。在这项工作中，我们提出了反向马尔可夫学习（RML）的框架，该框架定义了一个从目标分布过渡到已知分布（例如高斯分布）的一般前向过程，然后使用多个engression模型学习一个反向马尔可夫过程。这个反向过程逐步重构目标分布。这个框架适用于一般的前向过程，允许降维，并自然地离散化生成过程。在基于扩散的前向过程的特殊情况下，RML为扩散模型的训练和推断提供了高效的离散化策略。我们进一步引入了一个交替抽样方案来增强训练后的性能。我们的统计分析为RML建立了误差界限，并阐明了其在估计效率和前向过程设计的灵活性方面的优势。对模拟数据和气候数据的经验结果证实了理论发现，展示了RML在捕捉复杂分布方面的有效性。

更新时间: 2025-08-18 07:48:27

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2502.13747v2

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EgoIllusion lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility.

Updated: 2025-08-18 07:39:55

标题: 自我错觉：自我中心视频理解中幻觉的基准测试

摘要: 多模态大型语言模型（MLLMs）在复杂的多模态任务中表现出卓越的性能。虽然MLLMs擅长第三人称和自我中心视频的视觉感知和推理，但它们容易出现幻觉，生成连贯但不准确的响应。我们提出了EgoIllusion，这是一个用于评估MLLM在自我中心视频中幻觉的第一个基准。EgoIllusion包括1,400个视频，配有8,000个人工注释的开放式和封闭式问题，旨在触发自我中心视频中视觉和听觉线索的幻觉。对十个MLLMs的评估显示出显著的挑战，包括像GPT-4o和Gemini这样强大的模型，仅实现了59%的准确率。EgoIllusion为开发健壮的基准奠定了基础，以评估MLLM的有效性，并推动开发具有较低幻觉率的更好的自我中心MLLMs。我们的基准将开源以便复现。

更新时间: 2025-08-18 07:39:55

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.12687v1

ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.

Updated: 2025-08-18 07:38:23

标题: ToolACE-MT：面向主动多轮互动的非自回归生成

摘要: 使用大型语言模型（LLMs）进行代理任务求解需要多轮、多步交互，通常涉及复杂的函数调用和动态的用户-代理交换。现有基于模拟的数据生成方法主要依赖于多个LLM代理之间昂贵的自回归交互，从而限制了代理任务在现实世界中的性能。在本文中，我们提出了一种新颖的非自回归迭代生成框架，称为ToolACE-MT，用于构建高质量的多轮代理对话。ToolACE-MT通过三个阶段生成完整的对话轨迹：粗粒度初始化、迭代细化和离线验证。初始化阶段构建了一个结构完整但语义粗糙的对话框架；迭代细化阶段通过掩码填充操作引入了现实复杂性和持续细化；离线验证阶段通过规则和模型检查确保正确性和连贯性。实验证明，ToolACE-MT能够实现高效、有效和可泛化的代理数据生成，为工具增强型LLM场景中的高质量数据构建提供了新的范式。

更新时间: 2025-08-18 07:38:23

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12685v1

Teaching Introduction to Programming in the times of AI: A case study of a course re-design

The integration of AI tools into programming education has become increasingly prevalent in recent years, transforming the way programming is taught and learned. This paper provides a review of the state-of-the-art AI tools available for teaching and learning programming, particularly in the context of introductory courses. It highlights the challenges on course design, learning objectives, course delivery and formative and summative assessment, as well as the misuse of such tools by the students. We discuss ways of re-designing an existing course, re-shaping assignments and pedagogy to address the current AI technologies challenges. This example can serve as a guideline for policies for institutions and teachers involved in teaching programming, aiming to maximize the benefits of AI tools while addressing the associated challenges and concerns.

Updated: 2025-08-18 07:37:51

标题: 在人工智能时代教授编程导论：一门课程重新设计的案例研究

摘要: 近年来，人工智能工具已经越来越普遍地融入编程教育中，改变了编程的教学和学习方式。本文回顾了目前可用于教学和学习编程的最新人工智能工具，特别是在入门课程的背景下。文章强调了课程设计、学习目标、课程交付以及形成性和总结性评估方面的挑战，以及学生对这些工具的误用。我们讨论了重新设计现有课程、重新塑造作业和教学法以应对当前人工智能技术挑战的方法。这个例子可以作为机构和教师制定教学编程政策的指导，旨在最大化人工智能工具的好处，同时解决相关的挑战和问题。

更新时间: 2025-08-18 07:37:51

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.06572v2

A Taxonomy of Hierarchical Multi-Agent Systems: Design Patterns, Coordination Mechanisms, and Industrial Applications

Hierarchical multi-agent systems (HMAS) organize collections of agents into layered structures that help manage complexity and scale. These hierarchies can simplify coordination, but they also can introduce trade-offs that are not always obvious. This paper proposes a multi-dimensional taxonomy for HMAS along five axes: control hierarchy, information flow, role and task delegation, temporal layering, and communication structure. The intent is not to prescribe a single "best" design but to provide a lens for comparing different approaches. Rather than treating these dimensions in isolation, the taxonomy is connected to concrete coordination mechanisms - from the long-standing contract-net protocol for task allocation to more recent work in hierarchical reinforcement learning. Industrial contexts illustrate the framework, including power grids and oilfield operations, where agents at production, maintenance, and supply levels coordinate to diagnose well issues or balance energy demand. These cases suggest that hierarchical structures may achieve global efficiency while preserving local autonomy, though the balance is delicate. The paper closes by identifying open challenges: making hierarchical decisions explainable to human operators, scaling to very large agent populations, and assessing whether learning-based agents such as large language models can be safely integrated into layered frameworks. This paper presents what appears to be the first taxonomy that unifies structural, temporal, and communication dimensions of hierarchical MAS into a single design framework, bridging classical coordination mechanisms with modern reinforcement learning and large language model agents.

Updated: 2025-08-18 07:36:33

标题: 一个层次多智能体系统的分类法：设计模式、协调机制和工业应用

摘要: 分层多智能体系统（HMAS）将智能体集合组织成分层结构，有助于管理复杂性和规模。这些层次结构可以简化协调，但也可能引入并不总是明显的权衡。本文提出了一个针对HMAS的多维度分类法，包括控制层次结构、信息流动、角色和任务委托、时间分层以及通信结构等五个方面。其目的不是规定单一的“最佳”设计，而是为比较不同方法提供一个视角。与将这些维度孤立处理不同，该分类法与具体的协调机制相连，从长期以来用于任务分配的合同网协议到最近在层次强化学习中的工作。工业背景下说明了这一框架，包括电网和油田操作，在这些情境中，生产、维护和供应层次的智能体协调诊断井问题或平衡能源需求。这些案例表明，分层结构可能在保留地方自治的同时实现全局效率，尽管平衡是微妙的。文章最后确定了开放性挑战：使分层决策对人类操作者可解释，扩展到非常大的智能体群体，评估学习型智能体（如大型语言模型）是否可以安全地集成到分层框架中。本文提出了一个似乎是第一个将分层MAS的结构、时间和通信维度统一到一个设计框架中的分类法，将传统协调机制与现代强化学习和大型语言模型智能体进行了桥接。

更新时间: 2025-08-18 07:36:33

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2508.12683v1

GridCodex: A RAG-Driven AI Framework for Power Grid Code Reasoning and Compliance

The global shift towards renewable energy presents unprecedented challenges for the electricity industry, making regulatory reasoning and compliance increasingly vital. Grid codes, the regulations governing grid operations, are complex and often lack automated interpretation solutions, which hinders industry expansion and undermines profitability for electricity companies. We introduce GridCodex, an end to end framework for grid code reasoning and compliance that leverages large language models and retrieval-augmented generation (RAG). Our framework advances conventional RAG workflows through multi stage query refinement and enhanced retrieval with RAPTOR. We validate the effectiveness of GridCodex with comprehensive benchmarks, including automated answer assessment across multiple dimensions and regulatory agencies. Experimental results showcase a 26.4% improvement in answer quality and more than a 10 fold increase in recall rate. An ablation study further examines the impact of base model selection.

Updated: 2025-08-18 07:33:29

标题: GridCodex：一种基于RAG的人工智能框架，用于电网代码推理和合规性

摘要: 全球向可再生能源转变对电力行业提出了前所未有的挑战，这使得监管推理和合规性变得日益重要。电网规范是管理电网运营的法规，它们通常很复杂，缺乏自动化解释方案，这阻碍了行业扩张并削弱了电力公司的盈利能力。我们引入了GridCodex，一个端到端的电网规范推理和合规性框架，利用大型语言模型和检索增强生成（RAG）。我们的框架通过多阶段查询优化和使用RAPTOR增强检索，推进了传统的RAG工作流程。我们通过包括跨多个维度和监管机构的自动答案评估在内的全面基准测试验证了GridCodex的有效性。实验结果展示了答案质量提高了26.4％，召回率增加了超过10倍。消融研究进一步检验了基础模型选择的影响。

更新时间: 2025-08-18 07:33:29

领域: cs.AI

下载: http://arxiv.org/abs/2508.12682v1

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot capture the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of 44000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

Updated: 2025-08-18 07:24:36

标题: 使用基于Cosserat杆理论的物理信息神经网络的自适应模型预测控制软连续机器人

摘要: 软连续机器人（SCRs）的动态控制具有巨大的潜力，可以扩大它们的应用范围，但由于精确动态模型的高计算需求，这仍然是一个具有挑战性的问题。虽然已经提出了基于Koopman算子的数据驱动方法，但它们通常缺乏适应性，无法捕捉到完整的机器人形状，从而限制了它们的适用性。本文介绍了一个基于领域解耦的物理信息神经网络（DD-PINN）的实时可行的非线性模型预测控制（MPC）框架，其中包括可调节的弯曲刚度。DD-PINN作为动态Cosserat杆模型的替代，具有44000倍的加速因子。它还在无味卡尔曼滤波器中用于从末端执行器位置测量中估算模型状态和弯曲顺应性。我们在GPU上实现了一个运行在70 Hz的非线性进化MPC。在模拟中，它展示了对动态轨迹的准确跟踪和对末端执行器位置误差低于3毫米（2.3%执行器长度）的设定点控制。在实际实验中，控制器实现了类似的精度和最大加速度达到3.55 m/s²。

更新时间: 2025-08-18 07:24:36

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2508.12681v1

Quantum Money from Abelian Group Actions

We give a construction of public key quantum money, and even a strengthened version called quantum lightning, from abelian group actions, which can in turn be constructed from suitable isogenies over elliptic curves. We prove security in the generic group model for group actions under a plausible computational assumption, and develop a general toolkit for proving quantum security in this model. Along the way, we explore knowledge assumptions and algebraic group actions in the quantum setting, finding significant limitations of these assumptions/models compared to generic group actions.

Updated: 2025-08-18 07:21:19

标题: Abelian群作用的量子货币

摘要: 我们从阿贝尔群作用中构建了公钥量子货币，甚至还有一个更强的版本称为量子闪电，这些作用可以从椭圆曲线上的适当同构构造。我们证明在一种合理的计算假设下，在泛型群模型中对群作用的安全性，并开发了一个在该模型中证明量子安全性的通用工具包。在此过程中，我们探讨了量子设置中的知识假设和代数群作用，发现与泛型群作用相比，这些假设/模型存在明显的局限性。

更新时间: 2025-08-18 07:21:19

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2307.12120v5

InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

Non-destructive testing (NDT), particularly X-ray inspection, is vital for industrial quality assurance, yet existing deep-learning-based approaches often lack interactivity, interpretability, and the capacity for critical self-assessment, limiting their reliability and operator trust. To address these shortcomings, this paper proposes InsightX Agent, a novel LMM-based agentic framework designed to deliver reliable, interpretable, and interactive X-ray NDT analysis. Unlike typical sequential pipelines, InsightX Agent positions a Large Multimodal Model (LMM) as a central orchestrator, coordinating between the Sparse Deformable Multi-Scale Detector (SDMSD) and the Evidence-Grounded Reflection (EGR) tool. The SDMSD generates dense defect region proposals for multi-scale feature maps and sparsifies them through Non-Maximum Suppression (NMS), optimizing detection of small, dense targets in X-ray images while maintaining computational efficiency. The EGR tool guides the LMM agent through a chain-of-thought-inspired review process, incorporating context assessment, individual defect analysis, false positive elimination, confidence recalibration and quality assurance to validate and refine the SDMSD's initial proposals. By strategically employing and intelligently using tools, InsightX Agent moves beyond passive data processing to active reasoning, enhancing diagnostic reliability and providing interpretations that integrate diverse information sources. Experimental evaluations on the GDXray+ dataset demonstrate that InsightX Agent not only achieves a high object detection F1-score of 96.35% but also offers significantly improved interpretability and trustworthiness in its analyses, highlighting the transformative potential of agentic LLM frameworks for industrial inspection tasks.

Updated: 2025-08-18 07:15:10

标题: InsightX代理：基于LMM的具有可靠X射线NDT分析集成工具的主体框架

摘要: 无损检测（NDT），特别是X射线检测，对于工业质量保证至关重要，然而现有的基于深度学习的方法通常缺乏互动性、可解释性和关键的自我评估能力，限制了它们的可靠性和操作者信任。为了解决这些缺点，本文提出了InsightX Agent，这是一个新颖的基于LMM的代理框架，旨在提供可靠、可解释和互动的X射线NDT分析。与典型的顺序流水线不同，InsightX Agent将大型多模型（LMM）作为中央协调器，协调稀疏可变多尺度检测器（SDMSD）和证据引导反思（EGR）工具之间的关系。SDMSD通过非最大抑制（NMS）在X射线图像中优化检测小而密集目标的同时保持计算效率，生成多尺度特征图的密集缺陷区域提案，并将其稀疏化。EGR工具通过一系列思维链启发的审查过程引导LMM代理，结合背景评估、个体缺陷分析、假阳性消除、置信度重新校准和质量保证，以验证和完善SDMSD的初始提案。通过策略性地运用和智能地使用工具，InsightX Agent超越了被动数据处理，实现了主动推理，增强了诊断的可靠性，并提供整合多样信息来源的解释。对GDXray+数据集的实验评估表明，InsightX Agent不仅实现了96.35%的高目标检测F1分数，还在分析中提供了显著改进的可解释性和可信度，突显了agentic LLM框架在工业检验任务中的变革潜力。

更新时间: 2025-08-18 07:15:10

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.14899v2

Unfolded Laplacian Spectral Embedding: A Theoretically Grounded Approach to Dynamic Network Representation

Dynamic relational structures play a central role in many AI tasks, but their evolving nature presents challenges for consistent and interpretable representation. A common approach is to learn time-varying node embeddings, whose effectiveness depends on satisfying key stability properties. In this paper, we propose Unfolded Laplacian Spectral Embedding, a new method that extends the Unfolded Adjacency Spectral Embedding framework to normalized Laplacians while preserving both cross-sectional and longitudinal stability. We provide formal proof that our method satisfies these stability conditions. In addition, as a bonus of using the Laplacian matrix, we establish a new Cheeger-style inequality that connects the embeddings to the conductance of the underlying dynamic graphs. Empirical evaluations on synthetic and real-world datasets support our theoretical findings and demonstrate the strong performance of our method. These results establish a principled and stable framework for dynamic network representation grounded in spectral graph theory.

Updated: 2025-08-18 07:13:53

标题: 展开的拉普拉斯谱嵌入：动态网络表示的理论基础方法

摘要: 动态关系结构在许多人工智能任务中起着核心作用，但其不断演变的特性给一致且可解释的表示带来挑战。一种常见的方法是学习随时间变化的节点嵌入，其有效性取决于满足关键的稳定性属性。在本文中，我们提出了一种新方法Unfolded Laplacian Spectral Embedding，该方法将Unfolded Adjacency Spectral Embedding框架扩展到归一化拉普拉斯算子，同时保持横截面和纵向稳定性。我们提供正式证明，表明我们的方法满足这些稳定性条件。此外，作为使用拉普拉斯矩阵的一个额外优势，我们建立了一个新的Cheeger式不等式，将嵌入与基础动态图的导纳连接起来。对合成和真实世界数据集的实证评估支持我们的理论发现，并展示了我们方法的强大性能。这些结果建立了一个基于谱图论的原则性和稳定的动态网络表示框架。

更新时间: 2025-08-18 07:13:53

领域: stat.ML,cs.LG,cs.SI

下载: http://arxiv.org/abs/2508.12674v1

Deploying Models to Non-participating Clients in Federated Learning without Fine-tuning: A Hypernetwork-based Approach

Federated Learning (FL) has emerged as a promising paradigm for privacy-preserving collaborative learning, yet data heterogeneity remains a critical challenge. While existing methods achieve progress in addressing data heterogeneity for participating clients, they fail to generalize to non-participating clients with in-domain distribution shifts and resource constraints. To mitigate this issue, we present HyperFedZero, a novel method that dynamically generates specialized models via a hypernetwork conditioned on distribution-aware embeddings. Our approach explicitly incorporates distribution-aware inductive biases into the model's forward pass, extracting robust distribution embeddings using a NoisyEmbed-enhanced extractor with a Balancing Penalty, effectively preventing feature collapse. The hypernetwork then leverages these embeddings to generate specialized models chunk-by-chunk for non-participating clients, ensuring adaptability to their unique data distributions. Extensive experiments on multiple datasets and models demonstrate HyperFedZero's remarkable performance, surpassing competing methods consistently with minimal computational, storage, and communication overhead. Moreover, ablation studies and visualizations further validate the necessity of each component, confirming meaningful adaptations and validating the effectiveness of HyperFedZero.

Updated: 2025-08-18 07:11:51

标题: 在联邦学习中向非参与客户端部署模型而不进行微调：一种基于超网络的方法

摘要: Federated Learning（FL）已经成为一个有前途的隐私保护合作学习范式，然而数据异质性仍然是一个关键挑战。尽管现有方法在解决参与客户端的数据异质性方面取得了进展，但它们无法泛化到具有领域内分布转移和资源约束的非参与客户端。为了缓解这个问题，我们提出了HyperFedZero，一种新颖的方法，通过在分布感知嵌入条件下动态生成专门模型的超网络。我们的方法明确地将分布感知归纳偏差纳入模型的前向传递中，利用带有平衡惩罚的NoisyEmbed增强提取器提取稳健的分布嵌入，有效地防止特征坍缩。然后，超网络利用这些嵌入逐块生成专门模型，确保适应非参与客户端的独特数据分布。在多个数据集和模型上进行的大量实验表明HyperFedZero的出色性能，始终以最小的计算、存储和通信开销超越竞争方法。此外，消融研究和可视化进一步验证了每个组件的必要性，确认了有意义的适应性，并验证了HyperFedZero的有效性。

更新时间: 2025-08-18 07:11:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12673v1

Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.

Updated: 2025-08-18 07:11:21

标题: 通过基于损失的客户聚类实现对抗性攻击下的强大联邦学习

摘要: 联邦学习（FL）能够在多个客户端之间进行协作模型训练，而无需共享私人数据。我们考虑了FL场景，其中FL客户端受到对抗（拜占庭）攻击，而FL服务器是可信的（诚实的）并且具有可信赖的侧数据集。这可能对应于，例如，服务器在联合之前拥有可信数据的情况，或者存在一个临时承担服务器角色的可信客户端。我们的方法只需要两个诚实的参与者，即服务器和一个客户端，就能有效地运行，而无需事先知道恶意客户端的数量。理论分析表明，即使在强拜占庭攻击下，也能展示有限的最优性差距。实验结果显示，我们的算法在使用Flower框架对MNIST，FMNIST和CIFAR-10基准测试中的各种攻击策略（包括标签翻转，符号翻转和高斯噪声添加）下，明显优于标准和鲁棒的FL基线，如Mean，Trimmed Mean，Median，Krum和Multi-Krum。

更新时间: 2025-08-18 07:11:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12672v1

DIT: Dimension Reduction View on Optimal NFT Rarity Meters

Non-fungible tokens (NFTs) have become a significant digital asset class, each uniquely representing virtual entities such as artworks. These tokens are stored in collections within smart contracts and are actively traded across platforms on Ethereum, Bitcoin, and Solana blockchains. The value of NFTs is closely tied to their distinctive characteristics that define rarity, leading to a growing interest in quantifying rarity within both industry and academia. While there are existing rarity meters for assessing NFT rarity, comparing them can be challenging without direct access to the underlying collection data. The Rating over all Rarities (ROAR) benchmark addresses this challenge by providing a standardized framework for evaluating NFT rarity. This paper explores a dimension reduction approach to rarity design, introducing new performance measures and meters, and evaluates them using the ROAR benchmark. Our contributions to the rarity meter design issue include developing an optimal rarity meter design using non-metric weighted multidimensional scaling, introducing Dissimilarity in Trades (DIT) as a performance measure inspired by dimension reduction techniques, and unveiling the non-interpretable rarity meter DIT, which demonstrates superior performance compared to existing methods.

Updated: 2025-08-18 07:11:00

标题: DIT：最佳NFT稀有度测量的降维视角

摘要: 非同质化代币（NFTs）已经成为一个重要的数字资产类别，每个代币都独特地代表着虚拟实体，比如艺术品。这些代币存储在智能合约的集合中，并在以太坊、比特币和Solana区块链上积极交易。NFTs的价值与其定义稀缺性的独特特征密切相关，导致产业界和学术界对稀缺性进行量化的兴趣日益增长。虽然存在用于评估NFT稀缺性的现有稀缺性仪表，但在没有直接访问基础集合数据的情况下比较它们可能会有挑战。评分超过所有稀缺性（ROAR）基准解决了这一挑战，提供了一个评估NFT稀缺性的标准化框架。本文探讨了一种稀缺性设计的降维方法，引入了新的性能度量和仪表，并使用ROAR基准对其进行评估。我们在稀缺性仪表设计问题上的贡献包括使用非度量加权多维缩放开发出最佳稀缺性仪表设计，引入了受到降维技术启发的交易非相似性（DIT）作为性能度量，并揭示了不可解释的稀缺性仪表DIT，它与现有方法相比表现更优秀。

更新时间: 2025-08-18 07:11:00

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2508.12671v1

Data driven feedback linearization of nonlinear control systems via Lie derivatives and stacked regression approach

Discovering the governing equations of a physical system and designing an effective feedback controller remains one of the most challenging and intensive areas of ongoing research. This task demands a deep understanding of the system behavior, including the nonlinear factors that influence its dynamics. In this article, we propose a novel methodology for identifying a feedback linearized physical system based on known prior dynamic behavior. Initially, the system is identified using a sparse regression algorithm, subsequently a feedback controller is designed for the discovered system by applying Lie derivatives to the dictionary of output functions to derive an augmented constraint which guarantees that no internal dynamics are observed. Unlike the prior related works, the novel aspect of this article combines the approach of stacked regression algorithm and relative degree conditions to discover and feedback linearize the true governing equations of a physical model.

Updated: 2025-08-18 06:51:13

标题: 基于Lie导数和堆叠回归方法的非线性控制系统数据驱动反馈线性化

摘要: 发现物理系统的控制方程并设计有效的反馈控制器仍然是当前研究中最具挑战性和密集的领域之一。这项任务要求对系统行为有深刻的理解，包括影响其动态的非线性因素。在本文中，我们提出了一种基于已知先前动态行为的反馈线性化物理系统的识别新方法。首先，使用稀疏回归算法识别系统，随后通过在输出函数词典上应用Lie导数来设计发现系统的反馈控制器，以推导一个增强约束，确保不会观察到内部动态。与先前相关作品不同，本文的新颖之处在于将堆叠回归算法和相对阶条件相结合，以发现和反馈线性化物理模型的真实控制方程。

更新时间: 2025-08-18 06:51:13

领域: cs.LG

下载: http://arxiv.org/abs/2508.13241v1

Breaking Language Barriers: Equitable Performance in Multilingual Language Models

Cutting-edge LLMs have emerged as powerful tools for multilingual communication and understanding. However, LLMs perform worse in Common Sense Reasoning (CSR) tasks when prompted in low-resource languages (LRLs) like Hindi or Swahili compared to high-resource languages (HRLs) like English. Equalizing this inconsistent access to quality LLM outputs is crucial to ensure fairness for speakers of LRLs and across diverse linguistic communities. In this paper, we propose an approach to bridge this gap in LLM performance. Our approach involves fine-tuning an LLM on synthetic code-switched text generated using controlled language-mixing methods. We empirically demonstrate that fine-tuning LLMs on synthetic code-switched datasets leads to substantial improvements in LRL model performance while preserving or enhancing performance in HRLs. Additionally, we present a new dataset of synthetic code-switched text derived from the CommonSenseQA dataset, featuring three distinct language ratio configurations.

Updated: 2025-08-18 06:50:24

标题: 突破语言障碍：多语言语言模型中的公平表现

摘要: 最新的LLM已经成为多语言沟通和理解的强大工具。然而，与高资源语言（HRLs）如英语相比，当在低资源语言（LRLs）如印地语或斯瓦希里语中使用时，LLMs在常识推理（CSR）任务中表现较差。平衡这种对低资源语言说话者和不同语言社区之间质量不一致的LLM输出访问至关重要。在本文中，我们提出了一种方法来弥合LLM性能差距。我们的方法包括在使用受控语言混合方法生成的合成混合文本上对LLM进行微调。我们实验证明，在合成混合数据集上对LLM进行微调会显著提高LRL模型的性能，同时保留或增强HRLs的性能。此外，我们提供了一个从CommonSenseQA数据集派生的合成混合文本新数据集，具有三种不同的语言比例配置。

更新时间: 2025-08-18 06:50:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12662v1

RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation

Achieving both realism and controllability in closed-loop traffic simulation remains a key challenge in autonomous driving. Dataset-based methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and controllable closed-loop interactions but often lack expert demonstrations, compromising realism. To address these challenges, we introduce a dual-stage AV-centric simulation framework that conducts open-loop imitation learning pre-training in a data-driven simulator to capture trajectory-level realism and route-level controllability, followed by closed-loop reinforcement learning fine-tuning in a physics-based simulator to enhance style-level controllability and mitigate covariate shift. In the fine-tuning stage, we propose RIFT, a novel RL fine-tuning strategy that evaluates all candidate modalities through group-relative optimization with a dual-clip surrogate objective, enhancing style-level controllability and mitigating covariate shift, while preserving the trajectory-level realism and route-level controllability inherited from IL pre-training. Extensive experiments demonstrate that RIFT improves realism and controllability in traffic simulation while simultaneously exposing the limitations of modern AV systems in closed-loop evaluation. Project Page: https://currychen77.github.io/RIFT/

Updated: 2025-08-18 06:47:39

标题: 《RIFT：闭环强化学习微调技术用于逼真可控的交通仿真》

摘要: 在自动驾驶中，实现闭环交通仿真中的逼真性和可控性仍然是一个关键挑战。基于数据集的方法可以复现逼真的轨迹，但在闭环部署中受到协变量漂移的影响，同时简化的动力学模型进一步降低了可靠性。相反，基于物理的仿真方法可以增强可靠和可控的闭环交互，但往往缺乏专家演示，从而影响了逼真性。为了解决这些挑战，我们引入了一个双阶段AV中心的仿真框架，通过在数据驱动的模拟器中进行开环模仿学习预训练来捕捉轨迹级别的逼真性和路线级别的可控性，然后在基于物理的模拟器中进行闭环强化学习微调来增强样式级别的可控性和减轻协变量漂移。在微调阶段，我们提出了一种新颖的RL微调策略RIFT，通过双剪辑替代目标进行组相对优化，增强样式级别的可控性，减轻协变量漂移，同时保留从IL预训练中继承的轨迹级别逼真性和路线级别可控性。大量实验证明，RIFT提高了交通仿真中的逼真性和可控性，同时暴露了现代AV系统在闭环评估中的局限性。项目页面：https://currychen77.github.io/RIFT/

更新时间: 2025-08-18 06:47:39

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2505.03344v2

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of baselines, resulting inspeed-ups of over 500% in determining the best data mixture on our largest experiments. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area.

Updated: 2025-08-18 06:38:38

标题: ADMIRE-BayesOpt：基于贝叶斯优化加速数据混合重新加权的语言模型

摘要: 确定大型语言模型训练的最佳数据混合仍然是一个具有重大影响的挑战性问题。在实践中，语言模型开发人员继续依赖启发式探索，因为没有出现可靠解决方案的基于学习的方法。在这项工作中，我们提出将训练数据混合选择视为一个黑箱超参数优化问题，贝叶斯优化是一类适当算法的一个成熟类别。首先，我们将数据混合学习视为一个序贯决策问题，我们旨在找到训练探索性（代理）模型和最终混合性能之间的合适折衷。其次，我们系统地探索将在小规模学习的混合物转移到大规模实验的属性，提供见解并突显在适度规模上进行研究的机会。通过将多功能贝叶斯优化作为这种常见场景中的一种适当方法，我们引入了一个自然框架，以平衡实验成本和模型拟合，避免在较小尺度上过拟合的风险，同时最小化高成本实验的数量。我们展示了从简单结构到最先进模型和跨数十个数据集的基准模型的1百万到70亿参数范围内的预训练和指示微调结果。相对于广泛的基准线，我们展示了一致强大的结果，导致在我们最大的实验中确定最佳数据混合的速度提高了超过500%。此外，我们通过共享ADMIRE IFT Runs扩大了对研究的访问，这是一个包含460次完整训练和评估运行的数据集，价值超过13,000 GPU小时，极大地降低了在这一领域进行研究的成本。

更新时间: 2025-08-18 06:38:38

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.11551v2

The Maximum Coverage Model and Recommendation System for UAV Vertiports Location Planning

As urban aerial mobility (UAM) infrastructure development accelerates globally, cities like Shenzhen are planning large-scale vertiport networks (e.g., 1,200+ facilities by 2026). Existing planning frameworks remain inadequate for this complexity due to historical limitations in data granularity and real-world applicability. This paper addresses these gaps by first proposing the Capacitated Dynamic Maximum Covering Location Problem (CDMCLP), a novel optimization framework that simultaneously models urban-scale spatial-temporal demand, heterogeneous user behaviors, and infrastructure capacity constraints. Building on this foundation, we introduce an Integrated Planning Recommendation System that combines CDMCLP with socio-economic factors and dynamic clustering initialization. This system leverages adaptive parameter tuning based on empirical user behavior to generate practical planning solutions. Validation in a Chinese center city demonstrates the effectiveness of the new optimization framework and recommendation system. Under the evaluation and optimization of CDMCLP, the quantitative performance of traditional location methods are exposed and can be improved by 38\%--52\%, while the recommendation system shows user-friendliness and the effective integration of complex elements. By integrating mathematical rigor with practical implementation considerations, this hybrid approach bridges the gap between theoretical location modeling and real-world UAM infrastructure planning, offering municipalities a pragmatic tool for vertiport network design.

Updated: 2025-08-18 06:31:08

标题: 为无人机垂直起降点选址规划设计的最大覆盖模型和推荐系统

摘要: 随着全球城市空中移动（UAM）基础设施发展加速，像深圳这样的城市正在规划大规模的垂直起降网络（例如，到2026年将有1,200多个设施）。由于历史上数据粒度和实际适用性方面的限制，现有的规划框架仍然无法应对这种复杂性。本文首先提出了容量动态最大覆盖位置问题（CDMCLP），这是一个同时模拟城市规模空间-时间需求、异质用户行为和基础设施容量约束的新型优化框架。在此基础上，我们引入了一个集成规划建议系统，将CDMCLP与社会经济因素和动态聚类初始化相结合。该系统利用根据实证用户行为进行的自适应参数调整，生成实用的规划解决方案。在中国一个中心城市的验证中，展示了新优化框架和建议系统的有效性。在CDMCLP的评估和优化下，传统位置方法的定量性能暴露出来，可以通过38%至52%的改进，而建议系统展示了用户友好性和复杂元素的有效整合。通过将数学严谨性与实际实施考虑相结合，这种混合方法弥合了理论位置建模和实际UAM基础设施规划之间的差距，为市政府提供了一个实用的工具，用于垂直起降网络设计。

更新时间: 2025-08-18 06:31:08

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.12651v1

Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery

Ordering-based approaches to causal discovery identify topological orders of causal graphs, providing scalable alternatives to combinatorial search methods. Under the Additive Noise Model (ANM) assumption, recent causal ordering methods based on score matching require an accurate estimation of the Hessian diagonal of the log-densities. However, previous approaches mainly use Stein gradient estimators, which are computationally expensive and memory-intensive. Although DiffAN addresses these limitations by substituting kernel-based estimates with diffusion models, it remains numerically unstable due to the second-order derivatives of score models. To alleviate these problems, we propose Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and to preserve structural information during the score modeling. Empirical results show that SciNO reduces order divergence by 42.7% on synthetic graphs and by 31.5% on real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. Furthermore, we propose a probabilistic control algorithm for causal reasoning with autoregressive models that integrates SciNO's probability estimates with autoregressive model priors, enabling reliable data-driven causal ordering informed by semantic information. Consequently, the proposed method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.

Updated: 2025-08-18 06:25:41

标题: 得分通知的神经操作符用于增强基于排序的因果发现

摘要: 基于排序的因果发现方法识别因果图的拓扑顺序，提供了可扩展的替代组合搜索方法。在加性噪声模型（ANM）假设下，基于得分匹配的最近因果排序方法需要准确估计对数密度的Hessian对角线。然而，先前的方法主要使用Stein梯度估计器，这些方法计算昂贵且占用内存。虽然DiffAN通过用扩散模型替代基于核的估计来解决这些限制，但由于得分模型的二阶导数，它仍然在数值上不稳定。为了减轻这些问题，我们提出了Score-informed Neural Operator（SciNO），这是一个设计用于稳定近似Hessian对角线并在得分建模过程中保持结构信息的概率生成模型。实证结果显示，与DiffAN相比，SciNO在合成图上将顺序差异降低了42.7％，在真实数据集上平均降低了31.5％，同时保持内存效率和可扩展性。此外，我们提出了一种用于因果推理的概率控制算法，该算法将SciNO的概率估计与自回归模型先验相结合，实现了可靠的数据驱动因果排序，受语义信息指导。因此，所提出的方法增强了LLMs的因果推理能力，无需额外的微调或即时工程。

更新时间: 2025-08-18 06:25:41

领域: cs.LG,cs.AI,I.2.6; I.2.8

下载: http://arxiv.org/abs/2508.12650v1

Cognitive Structure Generation: From Educational Priors to Policy Optimization

Cognitive structure is a student's subjective organization of an objective knowledge system, reflected in the psychological construction of concepts and their relations. However, cognitive structure assessment remains a long-standing challenge in student modeling and psychometrics, persisting as a foundational yet largely unassessable concept in educational practice. This paper introduces a novel framework, Cognitive Structure Generation (CSG), in which we first pretrain a Cognitive Structure Diffusion Probabilistic Model (CSDPM) to generate students' cognitive structures from educational priors, and then further optimize its generative process as a policy with hierarchical reward signals via reinforcement learning to align with genuine cognitive development levels during students' learning processes. Experimental results on four popular real-world education datasets show that cognitive structures generated by CSG offer more comprehensive and effective representations for student modeling, substantially improving performance on KT and CD tasks while enhancing interpretability.

Updated: 2025-08-18 06:21:36

标题: 认知结构生成：从教育先验到政策优化

摘要: 认知结构是学生对客观知识系统的主观组织，在概念和它们之间关系的心理构建中反映出来。然而，认知结构评估在学生建模和心理测量中仍然是一个长期存在的挑战，作为教育实践中基础但在很大程度上无法评估的概念。本文介绍了一个新颖的框架，认知结构生成（CSG），在其中我们首先预先训练一个认知结构扩散概率模型（CSDPM），从教育先验中生成学生的认知结构，然后通过强化学习通过层次奖励信号进一步优化其生成过程，以与学生学习过程中的真实认知发展水平保持一致。在四个流行的现实世界教育数据集上的实验结果表明，由CSG生成的认知结构为学生建模提供了更全面和有效的表示，大大提高了KT和CD任务的表现，同时增强了可解释性。

更新时间: 2025-08-18 06:21:36

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2508.12647v1

Latent Expression Generation for Referring Image Segmentation and Grounding

Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.

Updated: 2025-08-18 06:18:45

标题: 潜在表达生成用于指代图像分割和定位

摘要: 视觉定位任务，例如参考图像分割（RIS）和参考表达理解（REC），旨在根据给定的文本描述定位目标对象。图像中的目标对象可以用多种方式描述，反映出颜色、位置等多种属性。然而，大多数现有方法依赖于单一的文本输入，仅捕获了视觉领域丰富信息的一部分。丰富的视觉细节与稀疏的文本提示之间的不匹配可能导致类似对象的误识别。为了解决这个问题，我们提出了一种新颖的视觉定位框架，通过整合原始描述中缺失的互补视觉细节，利用从单一文本输入生成的多个潜在表达式。具体来说，我们引入主题分发器和视觉概念注入器模块，将共享主题和不同属性的概念嵌入到潜在表示中，从而捕获独特且特定于目标的视觉线索。我们还提出了一种正间隔对比学习策略，将所有潜在表达式与原始文本对齐，同时保留细微变化。实验结果表明，我们的方法不仅在多个基准测试中优于最先进的RIS和REC方法，还在广义参考表达分割（GRES）基准测试中取得了出色的性能。

更新时间: 2025-08-18 06:18:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.05123v2

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

In tasks aiming for long-term returns, planning becomes essential. We study generative modeling for planning with datasets repurposed from offline reinforcement learning. Specifically, we identify temporal consistency in the absence of step-wise rewards as one key technical challenge. We introduce the Latent Plan Transformer (LPT), a novel model that leverages a latent variable to connect a Transformer-based trajectory generator and the final return. LPT can be learned with maximum likelihood estimation on trajectory-return pairs. In learning, posterior sampling of the latent variable naturally integrates sub-trajectories to form a consistent abstraction despite the finite context. At test time, the latent variable is inferred from an expected return before policy execution, realizing the idea of planning as inference. Our experiments demonstrate that LPT can discover improved decisions from sub-optimal trajectories, achieving competitive performance across several benchmarks, including Gym-Mujoco, Franka Kitchen, Maze2D, and Connect Four. It exhibits capabilities in nuanced credit assignments, trajectory stitching, and adaptation to environmental contingencies. These results validate that latent variable inference can be a strong alternative to step-wise reward prompting.

Updated: 2025-08-18 06:17:19

标题: 潜在计划变换器用于轨迹抽象：规划作为潜在空间推理

摘要: 在旨在获得长期回报的任务中，规划变得至关重要。我们研究了利用从离线强化学习重用的数据集进行规划的生成建模。具体来说，我们确定了在缺乏逐步奖励的情况下的时间一致性作为一个关键的技术挑战。我们引入了潜在计划变换器（LPT），这是一种利用潜在变量将基于Transformer的轨迹生成器和最终回报连接起来的新型模型。LPT可以通过轨迹-回报对上的最大似然估计来学习。在学习过程中，潜在变量的后验抽样自然地将子轨迹集成在一起，形成一致的抽象，尽管上下文是有限的。在测试时，潜在变量是从执行策略前的预期回报中推断出来的，实现了规划作为推理的理念。我们的实验证明，LPT可以从次优轨迹中发现改进的决策，在几个基准测试中取得了竞争性的性能，包括Gym-Mujoco、Franka Kitchen、Maze2D和Connect Four。它在微妙的信用分配、轨迹拼接和适应环境变化方面展现出能力。这些结果验证了潜在变量推断可以成为逐步奖励促使的强大替代方案。

更新时间: 2025-08-18 06:17:19

领域: cs.LG

下载: http://arxiv.org/abs/2402.04647v4

Data-dependent and Oracle Bounds on Forgetting in Continual Learning

In continual learning, knowledge must be preserved and re-used between tasks, maintaining good transfer to future tasks and minimizing forgetting of previously learned ones. While several practical algorithms have been devised for this setting, there have been few theoretical works aiming to quantify and bound the degree of Forgetting in general settings. For \emph{exemplar-free} methods, we provide both data-dependent upper bounds that apply \emph{regardless of model and algorithm choice}, and oracle bounds for Gibbs posteriors. We derive an algorithm based on our bounds and demonstrate empirically that our approach yields tight and practical bounds on forgetting for several continual learning problems and algorithms.

Updated: 2025-08-18 06:13:30

标题: 持续学习中遗忘的数据依赖性和Oracle边界

摘要: 在持续学习中，知识必须在任务之间保留和重复使用，以保持对未来任务的良好转移，并最大限度地减少先前学习的遗忘。虽然针对这种情况已经设计了几种实用算法，但很少有理论作品旨在量化和界定一般情况下的遗忘程度。对于“无典型”方法，我们提供了基于数据的上界，适用于“无论模型和算法选择如何”，以及针对Gibbs后验的oracle上界。我们基于我们的界限推导出一种算法，并在实证中证明，我们的方法为几种持续学习问题和算法提供了紧密且实用的遗忘界限。

更新时间: 2025-08-18 06:13:30

领域: cs.LG

下载: http://arxiv.org/abs/2406.09370v3

MPOCryptoML: Multi-Pattern based Off-Chain Crypto Money Laundering Detection

Recent advancements in money laundering detection have demonstrated the potential of using graph neural networks to capture laundering patterns accurately. However, existing models are not explicitly designed to detect the diverse patterns of off-chain cryptocurrency money laundering. Neglecting any laundering pattern introduces critical detection gaps, as each pattern reflects unique transactional structures that facilitate the obfuscation of illicit fund origins and movements. Failure to account for these patterns may result in under-detection or omission of specific laundering activities, diminishing model accuracy and allowing schemes to bypass detection. To address this gap, we propose the MPOCryptoML model to effectively detect multiple laundering patterns in cryptocurrency transactions. MPOCryptoML includes the development of a multi-source Personalized PageRank algorithm to identify random laundering patterns. Additionally, we introduce two novel algorithms by analyzing the timestamp and weight of transactions in high-volume financial networks to detect various money laundering structures, including fan-in, fan-out, bipartite, gather-scatter, and stack patterns. We further examine correlations between these patterns using a logistic regression model. An anomaly score function integrates results from each module to rank accounts by anomaly score, systematically identifying high-risk accounts. Extensive experiments on public datasets including Elliptic++, Ethereum fraud detection, and Wormhole transaction datasets validate the efficacy and efficiency of MPOCryptoML. Results show consistent performance gains, with improvements up to 9.13% in precision, up to 10.16% in recall, up to 7.63% in F1-score, and up to 10.19% in accuracy.

Updated: 2025-08-18 06:06:32

标题: MPOCryptoML：基于多模式的链下加密货币洗钱检测

摘要: 最近在反洗钱检测方面取得的进展表明，使用图神经网络准确捕捉洗钱模式的潜力。然而，现有模型并未明确设计用于检测链下加密货币洗钱的多样化模式。忽略任何洗钱模式会带来关键的检测漏洞，因为每种模式都反映了独特的交易结构，有助于掩盖非法资金的来源和流动。未考虑这些模式可能导致对特定洗钱活动的低检测或遗漏，降低模型准确性并使计划可以绕过检测。为了弥补这一差距，我们提出了MPOCryptoML模型，以有效检测加密货币交易中的多种洗钱模式。MPOCryptoML包括开发多源个性化PageRank算法来识别随机洗钱模式。此外，我们通过分析高交易量金融网络中的时间戳和交易权重，引入了两种新算法，以检测各种洗钱结构，包括汇入、汇出、二部图、集散、堆叠模式。我们进一步通过逻辑回归模型检查这些模式之间的相关性。异常分数函数整合了每个模块的结果，通过异常分数对帐户进行排名，系统地识别高风险帐户。对包括Elliptic++、以太坊欺诈检测和Wormhole交易数据集在内的公共数据集进行了广泛实验，验证了MPOCryptoML的功效和效率。结果显示出一致的性能提升，精度提高了最多9.13%，召回率提高了最多10.16%，F1分数提高了最多7.63%，准确性提高了最多10.19%。

更新时间: 2025-08-18 06:06:32

领域: cs.CR

下载: http://arxiv.org/abs/2508.12641v1

Synthesizing Accurate and Realistic T1-weighted Contrast-Enhanced MR Images using Posterior-Mean Rectified Flow

Contrast-enhanced (CE) T1-weighted MRI is central to neuro-oncologic diagnosis but requires gadolinium-based agents, which add cost and scan time, raise environmental concerns, and may pose risks to patients. In this work, we propose a two-stage Posterior-Mean Rectified Flow (PMRF) pipeline for synthesizing volumetric CE brain MRI from non-contrast inputs. First, a patch-based 3D U-Net predicts the voxel-wise posterior mean (minimizing MSE). Then, this initial estimate is refined by a time-conditioned 3D rectified flow to incorporate realistic textures without compromising structural fidelity. We train this model on a multi-institutional collection of paired pre- and post-contrast T1w volumes (BraTS 2023-2025). On a held-out test set of 360 diverse volumes, our best refined outputs achieve an axial FID of $12.46$ and KID of $0.007$ ($\sim 68.7\%$ lower FID than the posterior mean) while maintaining low volumetric MSE of $0.057$ ($\sim 27\%$ higher than the posterior mean). Qualitative comparisons confirm that our method restores lesion margins and vascular details realistically, effectively navigating the perception-distortion trade-off for clinical deployment.

Updated: 2025-08-18 05:55:57

标题: 使用后验均值矫正流合成准确和逼真的T1加权对比增强MR图像

摘要: 增强对比度（CE）T1加权MRI对神经肿瘤诊断至关重要，但需要使用基于钆的造影剂，这会增加成本和扫描时间，引起环境担忧，并可能对患者造成风险。在这项工作中，我们提出了一个两阶段的后验均值矫正流（PMRF）管道，用于从非对比输入合成体积CE脑MRI。首先，基于补丁的3D U-Net预测体素级后验均值（最小化均方误差）。然后，通过一个时间条件的3D矫正流来细化这个初始估计，以便在不影响结构保真度的情况下融入真实纹理。我们在一个多机构收集的配对的术前和术后对比T1w体积（BraTS 2023-2025）上训练这个模型。在一个包含360个不同体积的保留测试集上，我们最好的细化输出实现了轴向FID为12.46和KID为0.007（比后验均值低约68.7%的FID），同时保持低体积MSE为0.057（比后验均值高约27%）。定性比较证实，我们的方法真实地恢复了病变边缘和血管细节，有效地在临床部署中进行感知-失真的权衡。

更新时间: 2025-08-18 05:55:57

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.12640v1

SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design SpotVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.

Updated: 2025-08-18 05:51:41

标题: SpotVLM：基于上下文传递的云边协作实时VLM

摘要: 视觉语言模型（VLMs）越来越多地被部署在实时应用中，如自动驾驶和人机交互，这些应用需要基于准确感知的快速可靠的响应。为了满足这些要求，现有系统通常采用云边协作架构，如分区大视觉语言模型（LVLMs）或大型和小型视觉语言模型（SVLMs）之间的任务卸载策略。然而，这些方法未能适应云延迟波动，并忽视了延迟但准确的LVLM响应的充分潜力。在这项工作中，我们提出了一种新颖的云边协作范式，称为上下文传输，它将LVLM的延迟输出视为历史上下文，为SVLM的推理提供实时指导。基于这一范式，我们设计了SpotVLM，它结合了上下文替换和视觉焦点模块，用于优化历史文本输入并增强视觉基础的一致性。对四个数据集上的三个实时视觉任务进行的大量实验表明了所提出框架的有效性。这一新的范式为未来VLM系统中更有效和延迟感知的协作策略奠定了基础。

更新时间: 2025-08-18 05:51:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12638v1

Quantifying Loss Aversion in Cyber Adversaries via LLM Analysis

Understanding and quantifying human cognitive biases from empirical data has long posed a formidable challenge, particularly in cybersecurity, where defending against unknown adversaries is paramount. Traditional cyber defense strategies have largely focused on fortification, while some approaches attempt to anticipate attacker strategies by mapping them to cognitive vulnerabilities, yet they fall short in dynamically interpreting attacks in progress. In recognition of this gap, IARPA's ReSCIND program seeks to infer, defend against, and even exploit attacker cognitive traits. In this paper, we present a novel methodology that leverages large language models (LLMs) to extract quantifiable insights into the cognitive bias of loss aversion from hacker behavior. Our data are collected from an experiment in which hackers were recruited to attack a controlled demonstration network. We process the hacker generated notes using LLMs using it to segment the various actions and correlate the actions to predefined persistence mechanisms used by hackers. By correlating the implementation of these mechanisms with various operational triggers, our analysis provides new insights into how loss aversion manifests in hacker decision-making. The results demonstrate that LLMs can effectively dissect and interpret nuanced behavioral patterns, thereby offering a transformative approach to enhancing cyber defense strategies through real-time, behavior-based analysis.

Updated: 2025-08-18 05:51:30

标题: 通过LLM分析量化网络对手的损失规避

摘要: 理解和量化人类认知偏见的实证数据长期以来一直是一个巨大的挑战，特别是在网络安全领域，防御未知对手至关重要。传统的网络防御策略主要集中在加固防御上，而一些方法试图通过将攻击者策略映射到认知漏洞来预测攻击者策略，但它们在动态解释正在进行的攻击方面表现不佳。鉴于存在这一差距，IARPA的ReSCIND项目旨在推断、防御甚至利用攻击者的认知特征。在本文中，我们提出了一种利用大型语言模型（LLMs）从黑客行为中提取对损失厌恶认知偏见的可量化见解的新方法。我们的数据来自一项实验，其中黑客被招募攻击一个受控的演示网络。我们使用LLMs处理黑客生成的笔记，将其用于分割各种行为，并将这些行为与黑客使用的预定义持久性机制相关联。通过将这些机制的实施与各种操作触发器相关联，我们的分析提供了新的见解，揭示了损失厌恶在黑客决策中的表现方式。结果表明，LLMs可以有效地解剖和解释微妙的行为模式，从而通过实时、基于行为的分析提供一种改变网络防御策略的方法。

更新时间: 2025-08-18 05:51:30

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.13240v1

SpikeSTAG: Spatial-Temporal Forecasting via GNN-SNN Collaboration

Spiking neural networks (SNNs), inspired by the spiking behavior of biological neurons, offer a distinctive approach for capturing the complexities of temporal data. However, their potential for spatial modeling in multivariate time-series forecasting remains largely unexplored. To bridge this gap, we introduce a brand new SNN architecture, which is among the first to seamlessly integrate graph structural learning with spike-based temporal processing for multivariate time-series forecasting. Specifically, we first embed time features and an adaptive matrix, eliminating the need for predefined graph structures. We then further learn sequence features through the Observation (OBS) Block. Building upon this, our Multi-Scale Spike Aggregation (MSSA) hierarchically aggregates neighborhood information through spiking SAGE layers, enabling multi-hop feature extraction while eliminating the need for floating-point operations. Finally, we propose a Dual-Path Spike Fusion (DSF) Block to integrate spatial graph features and temporal dynamics via a spike-gated mechanism, combining LSTM-processed sequences with spiking self-attention outputs, effectively improve the model accuracy of long sequence datasets. Experiments show that our model surpasses the state-of-the-art SNN-based iSpikformer on all datasets and outperforms traditional temporal models at long horizons, thereby establishing a new paradigm for efficient spatial-temporal modeling.

Updated: 2025-08-18 05:48:23

标题: SpikeSTAG: 基于GNN-SNN协作的时空预测

摘要: 受生物神经元尖峰行为的启发，尖峰神经网络（SNNs）提供了一种独特的方法来捕捉时间数据的复杂性。然而，在多变量时间序列预测中，它们在空间建模方面的潜力仍然大部分未被探索。为了弥补这一差距，我们引入了一种全新的SNN架构，它是第一批将图结构学习与基于尖峰的时间处理无缝集成在一起用于多变量时间序列预测的架构之一。具体地，我们首先嵌入时间特征和自适应矩阵，消除了预定义图结构的需求。然后通过Observation（OBS）块进一步学习序列特征。在此基础上，我们的Multi-Scale Spike Aggregation（MSSA）通过尖峰SAGE层逐级聚合邻域信息，实现多跳特征提取，同时消除了浮点运算的需求。最后，我们提出了双通道尖峰融合（DSF）块，通过尖峰门控机制整合空间图特征和时间动态，将经过LSTM处理的序列与尖峰自注意力输出结合，有效提高了长序列数据集的模型准确性。实验表明，我们的模型在所有数据集上超越了最先进的基于SNN的iSpikformer，并在长期展望上优于传统的时间模型，从而确立了一种高效的时空建模新范式。

更新时间: 2025-08-18 05:48:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.02069v2

Surya: Foundation Model for Heliophysics

Heliophysics is central to understanding and forecasting space weather events and solar activity. Despite decades of high-resolution observations from the Solar Dynamics Observatory (SDO), most models remain task-specific and constrained by scarce labeled data, limiting their capacity to generalize across solar phenomena. We introduce Surya, a 366M parameter foundation model for heliophysics designed to learn general-purpose solar representations from multi-instrument SDO observations, including eight Atmospheric Imaging Assembly (AIA) channels and five Helioseismic and Magnetic Imager (HMI) products. Surya employs a spatiotemporal transformer architecture with spectral gating and long--short range attention, pretrained on high-resolution solar image forecasting tasks and further optimized through autoregressive rollout tuning. Zero-shot evaluations demonstrate its ability to forecast solar dynamics and flare events, while downstream fine-tuning with parameter-efficient Low-Rank Adaptation (LoRA) shows strong performance on solar wind forecasting, active region segmentation, solar flare forecasting, and EUV spectra. Surya is the first foundation model in heliophysics that uses time advancement as a pretext task on full-resolution SDO data. Its novel architecture and performance suggest that the model is able to learn the underlying physics behind solar evolution.

Updated: 2025-08-18 05:44:25

标题: 太阳：日地物理学基础模型

摘要: 太阳物理学对于理解和预测太空天气事件和太阳活动至关重要。尽管来自太阳动力学观测卫星（SDO）几十年的高分辨率观测数据，大多数模型仍然是特定任务的，并受到稀缺标记数据的限制，限制了它们对太阳现象的泛化能力。我们介绍了Surya，这是一个用于太阳物理学的366M参数基础模型，旨在从多仪器SDO观测数据中学习通用太阳表示，包括八个大气成像组件（AIA）通道和五个日晕地震和磁场成像仪（HMI）产品。Surya采用了一个具有光谱门控和长-短距离关注的时空变压器架构，经过高分辨率太阳图像预测任务的预训练，并通过自回归滚动调优进一步优化。零射击评估表明，它能够预测太阳动态和耀斑事件，而通过参数高效的低秩适应（LoRA）进行下游微调后，在太阳风预测、活跃区分割、太阳耀斑预测和EUV光谱方面表现出很强的性能。Surya是太阳物理学中第一个利用时间推进作为全分辨率SDO数据的前提任务的基础模型。其新颖的架构和性能表明，该模型能够学习太阳演化背后的基础物理。

更新时间: 2025-08-18 05:44:25

领域: astro-ph.SR,astro-ph.IM,cs.AI

下载: http://arxiv.org/abs/2508.14112v1

Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source systems and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through 1) schema pruning and linking, 2) multi-path and multi-candidate generation. Additionally, we introduce the 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL specialist, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.

Updated: 2025-08-18 05:31:41

标题: Feather-SQL: 一种轻量级的 NL2SQL 框架，采用双模型协作范式，适用于小型语言模型

摘要: 自然语言到SQL（NL2SQL）在大型语言模型（LLMs）的推动下取得了显著进展。然而，这些模型通常依赖闭源系统和高计算资源，给数据隐私和部署带来挑战。相比之下，小型语言模型（SLMs）在NL2SQL任务中表现不佳，性能差且与现有框架不兼容。为了解决这些问题，我们引入了Feather-SQL，一个专为SLMs设计的新型轻量级框架。Feather-SQL通过模式修剪和链接、多路径和多候选生成来提高SQL的可执行性和准确性。此外，我们引入了1+1模型协作范式，将一个强大的通用聊天模型与一个经过调优的SQL专家配对，结合强大的分析推理和高精度的SQL生成。在BIRD上的实验结果表明，Feather-SQL提高了SLMs上NL2SQL的性能，对于没有进行微调的模型，提升了约10%。所提出的范式将SLMs的准确性上限提高到54.76%，突显了其有效性。

更新时间: 2025-08-18 05:31:41

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2503.17811v3

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement -- behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives -- process-oriented, autonomy-oriented, and mechanism-oriented -- through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.

Updated: 2025-08-18 05:25:54

标题: 从科学的人工智能到自主科学：自主科学发现调查

摘要: 人工智能（AI）正在重塑科学发现，从专门的计算工具逐渐发展为自主研究合作伙伴。我们将代理科学定位为更广泛的AI科学范式中的一个关键阶段，其中AI系统从部分辅助逐渐发展为完全科学代理。通过大型语言模型（LLMs）、多模态系统和集成研究平台的支持，代理AI在假设生成、实验设计、执行、分析和迭代改进等方面表现出能力，这些行为曾被视为人类独有。本调查提供了跨生命科学、化学、材料科学和物理学领域的自主科学发现的领域导向综述。通过一个全面的框架，我们统一了过去分散的三个视角——过程导向、自治导向和机制导向，连接了基础能力、核心过程和领域特定实现。在此框架的基础上，我们（i）追踪了AI科学的发展过程，（ii）确定了支持科学代理的五个核心能力，（iii）将发现建模为一个动态的四阶段工作流程，（iv）回顾了上述领域的应用，并（v）综合了关键挑战和未来机遇。这项工作建立了一个领域导向的自主科学发现综合，并将代理科学定位为推动基于AI的研究的结构化范式。

更新时间: 2025-08-18 05:25:54

领域: cs.LG

下载: http://arxiv.org/abs/2508.14111v1

FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation

A generative model capable of sampling realistic molecules with desired properties could accelerate chemical discovery across a wide range of applications. Toward this goal, significant effort has focused on developing models that jointly sample molecular topology and 3D structure. We present FlowMol3, an open-source, multi-modal flow matching model that advances the state of the art for all-atom, small-molecule generation. Its substantial performance gains over previous FlowMol versions are achieved without changes to the graph neural network architecture or the underlying flow matching formulation. Instead, FlowMol3's improvements arise from three architecture-agnostic techniques that incur negligible computational cost: self-conditioning, fake atoms, and train-time geometry distortion. FlowMol3 achieves nearly 100% molecular validity for drug-like molecules with explicit hydrogens, more accurately reproduces the functional group composition and geometry of its training data, and does so with an order of magnitude fewer learnable parameters than comparable methods. We hypothesize that these techniques mitigate a general pathology affecting transport-based generative models, enabling detection and correction of distribution drift during inference. Our results highlight simple, transferable strategies for improving the stability and quality of diffusion- and flow-based molecular generative models.

Updated: 2025-08-18 05:13:27

标题: FlowMol3：用于3D全新小分子生成的流匹配

摘要: 一个能够采样具有所需属性的真实分子的生成模型可以加速化学发现在各种应用领域的应用。为实现这一目标，人们已经付出了大量努力，以开发能够同时采样分子拓扑结构和三维结构的模型。我们提出了FlowMol3，一个开源的、多模态的流匹配模型，推进了全原子、小分子生成的技术水平。FlowMol3相比之前的版本取得了显著的性能提升，但并没有改变图神经网络架构或基础流匹配公式。相反，FlowMol3的改进来自三种与架构无关的技术，几乎不增加计算成本：自身条件、虚拟原子和训练时几何扭曲。FlowMol3对具有显式氢原子的药物样分子几乎达到了100%的分子有效性，更准确地复制了其训练数据的功能基团组成和几何结构，并且所需的可学习参数比可比方法少一个数量级。我们假设这些技术可以缓解影响基于传输的生成模型的普遍病理，从而在推理过程中实现对分布漂移的检测和纠正。我们的结果突出了简单、可转移的策略，可以改善扩散和基于流的分子生成模型的稳定性和质量。

更新时间: 2025-08-18 05:13:27

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2508.12629v1

Note on Selection Bias in Observational Estimates of Algorithmic Progress

Ho et. al (2024) attempts to estimate the degree of algorithmic progress from language models. They collect observational data on language models' loss and compute over time, and argue that as time has passed, language models' algorithmic efficiency has been rising. That is, the loss achieved for fixed compute has been dropping over time. In this note, I raise one potential methodological problem with the estimation strategy. Intuitively, if part of algorithmic quality is latent, and compute choices are endogenous to algorithmic quality, then resulting estimates of algorithmic quality will be contaminated by selection bias.

Updated: 2025-08-18 05:12:05

标题: 关于观察估计算法进展中选择偏差的注意事项

摘要: Ho等人（2024年）试图估计语言模型的算法进展程度。他们收集了关于语言模型损失随时间变化的观察数据，并认为随着时间的推移，语言模型的算法效率在提高。也就是说，针对固定计算量实现的损失随时间下降。在这份摘要中，我提出了估计策略中的一个潜在方法论问题。直觉上，如果算法质量的一部分是潜在的，而计算选择与算法质量是内生的，则算法质量的估计结果将受到选择偏差的污染。

更新时间: 2025-08-18 05:12:05

领域: econ.GN,cs.AI,q-fin.EC

下载: http://arxiv.org/abs/2508.11033v2

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework's capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at $\geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.

Updated: 2025-08-18 05:11:18

标题: 视觉感知引擎：用于机器人视觉任务的快速灵活多头推理

摘要: 在资源受限的机器人平台上部署多个机器学习模型以执行不同的感知任务通常会导致冗余计算、大内存占用和复杂的集成挑战。为了解决这一问题，本文提出了Visual Perception Engine (VPEngine)，这是一个模块化框架，旨在实现对视觉多任务的高效GPU利用，同时保持可扩展性和开发者的易用性。我们的框架架构利用了一个共享的基础模型骨干，提取图像表示，这些表示在多个专门的任务特定模型头并行运行时被有效地共享，而无需进行任何不必要的GPU-CPU内存传输。这种设计消除了部署传统顺序模型时固有的特征提取组件中的计算冗余，同时使得基于应用需求的动态任务优先级排序成为可能。我们通过一个示例实现来展示我们的框架的能力，该示例使用DINOv2作为基础模型，并具有多个任务（深度、目标检测和语义分割）头，与顺序执行相比，实现了最高3倍的加速。借助CUDA多进程服务（MPS），VPEngine提供了高效的GPU利用率，并在运行时允许动态调整每个任务推断频率，同时保持恒定的内存占用。该框架使用Python编写，具有ROS2 C++（Humble）绑定，以方便机器人社区在不同机器人平台上使用。我们的示例实现展示了在NVIDIA Jetson Orin AGX上以$\geq$50 Hz的端到端实时性能，使用了经过TensorRT优化的模型。

更新时间: 2025-08-18 05:11:18

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.11584v2

Embodied Long Horizon Manipulation with Closed-loop Code Generation and Incremental Few-shot Adaptation

Embodied long-horizon manipulation requires robotic systems to process multimodal inputs-such as vision and natural language-and translate them into executable actions. However, existing learning-based approaches often depend on large, task-specific datasets and struggle to generalize to unseen scenarios. Recent methods have explored using large language models (LLMs) as high-level planners that decompose tasks into subtasks using natural language and guide pretrained low-level controllers. Yet, these approaches assume perfect execution from low-level policies, which is unrealistic in real-world environments with noise or suboptimal behaviors. To overcome this, we fully discard the pretrained low-level policy and instead use the LLM to directly generate executable code plans within a closed-loop framework. Our planner employs chain-of-thought (CoT)-guided few-shot learning with incrementally structured examples to produce robust and generalizable task plans. Complementing this, a reporter evaluates outcomes using RGB-D and delivers structured feedback, enabling recovery from misalignment and replanning under partial observability. This design eliminates per-step inference, reduces computational overhead, and limits error accumulation that was observed in previous methods. Our framework achieves state-of-the-art performance on 30+ diverse seen and unseen long-horizon tasks across LoHoRavens, CALVIN, Franka Kitchen, and cluttered real-world settings.

Updated: 2025-08-18 05:04:08

标题: 具有闭环代码生成和增量式少样本适应性的具体长期操纵

摘要: 具身长期操纵要求机器人系统处理多模输入，如视觉和自然语言，并将其转化为可执行动作。然而，现有的基于学习的方法往往依赖于大型、特定任务的数据集，并且很难推广到未见过的情景。最近的方法探索使用大型语言模型(LLMs)作为高级规划器，利用自然语言将任务分解为子任务，并引导预训练的低级控制器。然而，这些方法假设低级策略的执行是完美的，在带有噪声或次优行为的现实环境中是不现实的。为了克服这一问题，我们完全放弃了预训练的低级策略，而是使用LLM直接在闭环框架内生成可执行代码计划。我们的规划器采用思维链(Chain-of-Thought，CoT)引导的少样本学习，逐步结构化示例生成强大且可泛化的任务计划。此外，一名记者使用RGB-D评估结果并提供结构化反馈，使得在部分可观察性下能够从错位中恢复并重新规划。这种设计消除了每步推理，减少了计算开销，并限制了在先前方法中观察到的错误累积。我们的框架在LoHoRavens、CALVIN、Franka Kitchen和杂乱的现实世界环境中的30多个不同长期任务上实现了最先进的性能。

更新时间: 2025-08-18 05:04:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.21969v2

TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods

Time series are generated in diverse domains such as economic, traffic, health, and energy, where forecasting of future values has numerous important applications. Not surprisingly, many forecasting methods are being proposed. To ensure progress, it is essential to be able to study and compare such methods empirically in a comprehensive and reliable manner. To achieve this, we propose TFB, an automated benchmark for Time Series Forecasting (TSF) methods. TFB advances the state-of-the-art by addressing shortcomings related to datasets, comparison methods, and evaluation pipelines: 1) insufficient coverage of data domains, 2) stereotype bias against traditional methods, and 3) inconsistent and inflexible pipelines. To achieve better domain coverage, we include datasets from 10 different domains: traffic, electricity, energy, the environment, nature, economic, stock markets, banking, health, and the web. We also provide a time series characterization to ensure that the selected datasets are comprehensive. To remove biases against some methods, we include a diverse range of methods, including statistical learning, machine learning, and deep learning methods, and we also support a variety of evaluation strategies and metrics to ensure a more comprehensive evaluations of different methods. To support the integration of different methods into the benchmark and enable fair comparisons, TFB features a flexible and scalable pipeline that eliminates biases. Next, we employ TFB to perform a thorough evaluation of 21 Univariate Time Series Forecasting (UTSF) methods on 8,068 univariate time series and 14 Multivariate Time Series Forecasting (MTSF) methods on 25 datasets. The benchmark code and data are available at https://github.com/decisionintelligence/TFB. We have also launched an online time series leaderboard: https://decisionintelligence.github.io/OpenTS/OpenTS-Bench/.

Updated: 2025-08-18 05:01:29

标题: TFB：走向全面和公平的时间序列预测方法基准测试

摘要: 时间序列在经济、交通、健康和能源等不同领域生成，对未来数值的预测具有许多重要应用。毫无疑问，许多预测方法被提出。为了确保进展，有必要能够以全面和可靠的方式对这些方法进行经验研究和比较。为了实现这一目标，我们提出了TFB，一个自动化的时间序列预测（TSF）方法基准。TFB通过解决与数据集、比较方法和评估管道相关的缺陷，推进了技术水平：1）数据领域覆盖不足，2）对传统方法的刻板印象，3）管道不一致和不灵活。为了获得更好的领域覆盖，我们包括来自10个不同领域的数据集：交通、电力、能源、环境、自然、经济、股票市场、银行、健康和网络。我们还提供时间序列特征化，以确保所选数据集的全面性。为了消除对某些方法的偏见，我们包括各种方法，包括统计学习、机器学习和深度学习方法，同时我们还支持各种评估策略和指标，以确保对不同方法进行更全面的评估。为了支持不同方法的集成进入基准并实现公平比较，TFB具有一个灵活和可扩展的管道，消除了偏见。接下来，我们使用TFB对21种单变量时间序列预测（UTSF）方法在8,068个单变量时间序列和14种多变量时间序列预测（MTSF）方法在25个数据集上进行了彻底评估。基准代码和数据可在https://github.com/decisionintelligence/TFB获取。我们还推出了一个在线时间序列排行榜：https://decisionintelligence.github.io/OpenTS/OpenTS-Bench/。

更新时间: 2025-08-18 05:01:29

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2403.20150v4

How can we trust opaque systems? Criteria for robust explanations in XAI

Deep learning (DL) algorithms are becoming ubiquitous in everyday life and in scientific research. However, the price we pay for their impressively accurate predictions is significant: their inner workings are notoriously opaque - it is unknown to laypeople and researchers alike what features of the data a DL system focuses on and how it ultimately succeeds in predicting correct outputs. A necessary criterion for trustworthy explanations is that they should reflect the relevant processes the algorithms' predictions are based on. The field of eXplainable Artificial Intelligence (XAI) presents promising methods to create such explanations. But recent reviews about their performance offer reasons for skepticism. As we will argue, a good criterion for trustworthiness is explanatory robustness: different XAI methods produce the same explanations in comparable contexts. However, in some instances, all methods may give the same, but still wrong, explanation. We therefore argue that in addition to explanatory robustness (ER), a prior requirement of explanation method robustness (EMR) has to be fulfilled by every XAI method. Conversely, the robustness of an individual method is in itself insufficient for trustworthiness. In what follows, we develop and formalize criteria for ER as well as EMR, providing a framework for explaining and establishing trust in DL algorithms. We also highlight interesting application cases and outline directions for future work.

Updated: 2025-08-18 04:38:55

标题: 我们如何信任不透明系统？XAI中鲁棒解释的标准

摘要: 深度学习（DL）算法正在日常生活和科学研究中变得无处不在。然而，我们为了它们令人印象深刻的准确预测所付出的代价是显著的：它们的内部运作是臭名昭著的不透明的 - 普通人和研究人员都不知道DL系统关注数据的哪些特征，以及它最终如何成功预测正确的输出。一个值得信赖的解释的必要条件是，它们应该反映算法预测所基于的相关过程。可解释人工智能（XAI）领域提供了创造这样解释的有希望的方法。但是关于它们性能的最近评论提供了怀疑的理由。正如我们将要论证的，一个良好的信任度标准是解释的稳健性：不同的XAI方法在可比较的情境中产生相同的解释。然而，在某些情况下，所有方法可能给出相同的，但仍然错误的解释。因此，我们认为，除了解释的稳健性（ER）之外，每个XAI方法还必须满足解释方法的稳健性（EMR）的先决条件。相反，一个单独方法的稳健性本身不足以建立信任度。在接下来的内容中，我们开发和正式化了ER和EMR的标准，为解释和建立对DL算法的信任提供了一个框架。我们还强调了有趣的应用案例，并概述了未来工作的方向。

更新时间: 2025-08-18 04:38:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12623v1

SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning

Modern deep reinforcement learning (DRL) methods have made significant advances in handling continuous action spaces. However, real-world control systems--especially those requiring precise and reliable performance--often demand interpretability in the sense of a-priori assessments of agent behavior to identify safe or failure-prone interactions with environments. To address this limitation, we propose SALSA-RL (Stability Analysis in the Latent Space of Actions), a novel RL framework that models control actions as dynamic, time-dependent variables evolving within a latent space. By employing a pre-trained encoder-decoder and a state-dependent linear system, our approach enables interpretability through local stability analysis, where instantaneous growth in action-norms can be predicted before their execution. We demonstrate that SALSA-RL can be deployed in a non-invasive manner for assessing the local stability of actions from pretrained RL agents without compromising on performance across diverse benchmark environments. By enabling a more interpretable analysis of action generation, SALSA-RL provides a powerful tool for advancing the design, analysis, and theoretical understanding of RL systems.

Updated: 2025-08-18 04:36:51

标题: SALSA-RL：强化学习中动作潜空间稳定性分析

摘要: 现代深度强化学习（DRL）方法在处理连续动作空间方面取得了显著进展。然而，实际控制系统--特别是需要精确和可靠性能的系统--通常需要可解释性，即对代理行为进行先验评估，以识别与环境的安全或易出故障的相互作用。为了解决这一局限性，我们提出了SALSA-RL（动作潜空间稳定性分析），这是一个新颖的RL框架，将控制动作建模为动态、时间相关的变量，在潜在空间内演化。通过采用预先训练的编码器-解码器和状态相关的线性系统，我们的方法通过局部稳定性分析实现了可解释性，可以在执行之前预测动作范数的瞬时增长。我们展示了SALSA-RL可以以非侵入方式部署，评估预训练的RL代理的动作局部稳定性，而不会在各种基准环境中牺牲性能。通过使动作生成的分析更具可解释性，SALSA-RL为推动RL系统的设计、分析和理论理解提供了强大工具。

更新时间: 2025-08-18 04:36:51

领域: cs.LG

下载: http://arxiv.org/abs/2502.15512v3

Consiglieres in the Shadow: Understanding the Use of Uncensored Large Language Models in Cybercrimes

The advancement of AI technologies, particularly Large Language Models (LLMs), has transformed computing while introducing new security and privacy risks. Prior research shows that cybercriminals are increasingly leveraging uncensored LLMs (ULLMs) as backends for malicious services. Understanding these ULLMs has been hindered by the challenge of identifying them among the vast number of open-source LLMs hosted on platforms like Hugging Face. In this paper, we present the first systematic study of ULLMs, overcoming this challenge by modeling relationships among open-source LLMs and between them and related data, such as fine-tuning, merging, compressing models, and using or generating datasets with harmful content. Representing these connections as a knowledge graph, we applied graph-based deep learning to discover over 11,000 ULLMs from a small set of labeled examples and uncensored datasets. A closer analysis of these ULLMs reveals their alarming scale and usage. Some have been downloaded over a million times, with one over 19 million installs. These models -- created through fine-tuning, merging, or compression of other models -- are capable of generating harmful content, including hate speech, violence, erotic material, and malicious code. Evidence shows their integration into hundreds of malicious applications offering services like erotic role-play, child pornography, malicious code generation, and more. In addition, underground forums reveal criminals sharing techniques and scripts to build cheap alternatives to commercial malicious LLMs. These findings highlight the widespread abuse of LLM technology and the urgent need for effective countermeasures against this growing threat.

Updated: 2025-08-18 04:35:26

标题: Consiglieres in the Shadow: 理解在网络犯罪中使用未经审查的大型语言模型

摘要: 人工智能技术的进步，特别是大型语言模型（LLMs）的发展，已经改变了计算方式，同时引入了新的安全和隐私风险。先前的研究显示，网络犯罪分子越来越多地利用未经审查的LLMs（ULLMs）作为恶意服务的后端。理解这些ULLMs受到了识别它们在像Hugging Face这样的平台上托管的大量开源LLMs之间的挑战的阻碍。在本文中，我们提出了对ULLMs的第一次系统研究，通过建模开源LLMs之间以及它们与相关数据（如微调、合并、压缩模型以及使用或生成具有有害内容的数据集）之间的关系，克服了这一挑战。将这些连接表示为知识图，我们应用基于图的深度学习技术，从一小部分带标签的示例和未经审查的数据集中发现了超过11,000个ULLMs。对这些ULLMs进行更详细的分析揭示了它们惊人的规模和使用情况。一些模型已经被下载了一百多万次，有一个甚至有超过1900万的安装次数。这些模型是通过微调、合并或压缩其他模型而创建的，能够生成包括仇恨言论、暴力、色情材料和恶意代码在内的有害内容。证据显示，它们被整合到数百个恶意应用程序中，提供类似色情角色扮演、儿童色情、恶意代码生成等服务。此外，地下论坛显示，犯罪分子分享技术和脚本，以建立廉价替代商业恶意LLMs。这些发现凸显了LLM技术的广泛滥用以及迫切需要有效对抗这一不断增长的威胁。

更新时间: 2025-08-18 04:35:26

领域: cs.CR

下载: http://arxiv.org/abs/2508.12622v1

Kernel Ridge Regression Inference

We provide uniform confidence bands for kernel ridge regression (KRR), a widely used nonparametric regression estimator for nonstandard data such as preferences, sequences, and graphs. Despite the prevalence of these data--e.g., student preferences in school matching mechanisms--the inferential theory of KRR is not fully known. We construct valid and sharp confidence sets that shrink at nearly the minimax rate, allowing nonstandard regressors. Our bootstrap procedure uses anti-symmetric multipliers for computational efficiency and for validity under mis-specification. We use the procedure to develop a test for match effects, i.e. whether students benefit more from the schools they rank highly.

Updated: 2025-08-18 04:29:08

标题: 核岭回归推断

摘要: 我们为核岭回归（KRR）提供了统一的置信带，这是一种广泛使用的非参数回归估计器，用于非标准数据，如偏好、序列和图形。尽管这些数据的普遍存在--例如，在学校匹配机制中的学生偏好--KRR的推断理论还不完全清楚。我们构建了有效且尖锐的置信区间，其收缩速度几乎达到极小化速率，允许非标准的回归变量。我们的自举程序使用反对称乘子以提高计算效率，并在规范错误下保持有效性。我们使用该程序开发了一种匹配效应测试，即学生是否更加受益于他们高排名的学校。

更新时间: 2025-08-18 04:29:08

领域: math.ST,cs.LG,stat.ML,stat.TH

下载: http://arxiv.org/abs/2302.06578v3

A Generalized Genetic Random Field Method for the Genetic Association Analysis of Sequencing Data

With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.

Updated: 2025-08-18 04:28:48

标题: 一种广义遗传随机场方法用于测序数据的遗传关联分析

摘要: 随着高通量测序技术的进步，研究整个测序变异谱对复杂人类疾病的影响已经成为可能。尽管利用新的测序技术进行的关联研究有望揭示对人类疾病起作用的新型基因变异，特别是罕见的遗传变异，但高维测序数据的统计分析仍然是一个挑战。需要先进的分析方法来促进高维测序数据分析。在本文中，我们提出了一种广义遗传随机场（GGRF）方法，用于测序数据的关联分析。与其他基于相似性的方法（例如SIMreg和SKAT）一样，新方法具有避免指定罕见变异阈值的优势，并允许测试多个变异以不同方向和效应幅度发挥作用。该方法建立在广义估计方程框架之上，因此适应各种疾病表型（例如定量和二元表型）。此外，它具有良好的渐近性质，并且可以应用于小规模测序数据而无需进行小样本调整。通过模拟，我们证明了所提出的GGRF在各种疾病场景下取得了比常用方法SKAT更好或相当的效力，特别是当罕见变异在疾病病因中起重要作用时。我们进一步通过应用于达拉斯心脏研究的真实数据集来说明GGRF。通过使用GGRF，我们能够检测到两个候选基因ANGPTL3和ANGPTL4与血清甘油三酯的关联。

更新时间: 2025-08-18 04:28:48

领域: stat.ME,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12617v1

Towards SISO Bistatic Sensing for ISAC

Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.

Updated: 2025-08-18 04:22:05

标题: 朝向ISAC的SISO双基地感知

摘要: 综合感知和通信（ISAC）是下一代无线系统的关键推动者。然而，现实世界的部署通常受限于成本低廉、单天线收发器。在这种双向单输入单输出（SISO）设置中，时钟不同步会引入通道状态信息（CSI）中的随机相位偏移，无法通过传统的多天线方法来缓解。本文提出了WiDFS 3.0，一个轻量级的双向SISO感知框架，能够通过有效抑制多普勒镜像模糊来从扭曲的CSI中准确估计延迟和多普勒。它在发射机和接收机上仅使用一个天线运行，适用于低复杂度的部署。我们提出了一种自引用交叉相关（SRCC）方法，用于SISO随机相位去除，并利用延迟域波束成形来解决多普勒模糊。由此产生的无歧义延迟-多普勒-时间特征使得使用紧凑神经网络进行稳健感知成为可能。大量实验证明，WiDFS 3.0实现了准确的参数估计，性能可与甚至超过以往的多天线方法相媲美，尤其在延迟估计方面。在单目标和多目标场景下验证后，提取的消除模糊的特征显示出强大的感知准确性和泛化能力。例如，在只有1.3M参数的嵌入式友好型MobileViT-XXS上部署时，WiDFS 3.0始终优于传统特征，如CSI幅度、镜像多普勒和多接收器聚合多普勒。

更新时间: 2025-08-18 04:22:05

领域: eess.SP,cs.HC,cs.LG

下载: http://arxiv.org/abs/2508.12614v1

An LLM + ASP Workflow for Joint Entity-Relation Extraction

Joint entity-relation extraction (JERE) identifies both entities and their relationships simultaneously. Traditional machine-learning based approaches to performing this task require a large corpus of annotated data and lack the ability to easily incorporate domain specific information in the construction of the model. Therefore, creating a model for JERE is often labor intensive, time consuming, and elaboration intolerant. In this paper, we propose harnessing the capabilities of generative pretrained large language models (LLMs) and the knowledge representation and reasoning capabilities of Answer Set Programming (ASP) to perform JERE. We present a generic workflow for JERE using LLMs and ASP. The workflow is generic in the sense that it can be applied for JERE in any domain. It takes advantage of LLM's capability in natural language understanding in that it works directly with unannotated text. It exploits the elaboration tolerant feature of ASP in that no modification of its core program is required when additional domain specific knowledge, in the form of type specifications, is found and needs to be used. We demonstrate the usefulness of the proposed workflow through experiments with limited training data on three well-known benchmarks for JERE. The results of our experiments show that the LLM + ASP workflow is better than state-of-the-art JERE systems in several categories with only 10\% of training data. It is able to achieve a 2.5 times (35\% over 15\%) improvement in the Relation Extraction task for the SciERC corpus, one of the most difficult benchmarks.

Updated: 2025-08-18 04:15:35

标题: 一个用于联合实体关系抽取的LLM + ASP工作流程

摘要: 联合实体关系抽取（JERE）同时识别实体及其关系。传统基于机器学习的方法需要大量标记数据来执行此任务，并且缺乏在模型构建中轻松整合领域特定信息的能力。因此，为JERE创建模型通常是费时费力且难以忍受的。在本文中，我们提出利用生成式预训练大型语言模型（LLMs）的能力以及Answer Set Programming（ASP）的知识表示和推理能力来执行JERE。我们提出了一个使用LLMs和ASP的JERE通用工作流程。该工作流程是通用的，因为它可以应用于任何领域的JERE。它利用LLM在自然语言理解方面的能力，直接处理未标记文本。它利用ASP的容忍性特征，即当发现需要使用类型规范的额外领域特定知识时，不需要修改其核心程序。我们通过对JERE的三个知名基准测试进行有限训练数据的实验，展示了所提出工作流程的实用性。我们的实验结果显示，LLM + ASP工作流程在几个类别上优于最先进的JERE系统，仅使用10%的训练数据。对于SciERC语料库，其中一个最困难的基准测试，它在关系提取任务上取得了2.5倍（35％超过15％）的改进。

更新时间: 2025-08-18 04:15:35

领域: cs.AI,cs.CL,I.2.7; F.4.1

下载: http://arxiv.org/abs/2508.12611v1

OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion

Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.

Updated: 2025-08-18 04:12:13

标题: OpenMoCap：重新思考真实世界遮挡下的光学动作捕捉

摘要: 光学动作捕捉是推动虚拟现实和电影制作等领域进步的基础技术。然而，在现实世界应用中常见的大规模标记遮挡严重影响系统性能。深入分析确定了当前模型的两个主要限制：(i)缺乏准确反映现实标记遮挡模式的训练数据集，(ii)缺乏旨在捕捉标记之间远程依赖性的训练策略。为了解决这些挑战，我们引入了CMU-Occlu数据集，利用光线追踪技术实现了对实际标记遮挡模式的真实模拟。此外，我们提出了OpenMoCap，一个专为在存在重要遮挡的环境中进行稳健动作捕捉而设计的新型动作求解模型。利用标记-关节链推断机制，OpenMoCap实现了标记和关节之间的深度约束的同时优化和构建。大量比较实验表明，在各种情景下，OpenMoCap始终优于竞争方法，而CMU-Occlu数据集为未来在稳健动作求解领域的研究打开了大门。提出的OpenMoCap已集成到MoSen MoCap系统中以进行实际部署。代码发布在https://github.com/qianchen214/OpenMoCap。

更新时间: 2025-08-18 04:12:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.12610v1

A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks

Spiking Neural Networks (SNNs) are a promising approach to low-power applications on neuromorphic hardware due to their energy efficiency. However, training SNNs is challenging because of the non-differentiable spike generation function. To address this issue, the commonly used approach is to adopt the backpropagation through time framework, while assigning the gradient of the non-differentiable function with some surrogates. Similarly, Binary Neural Networks (BNNs) also face the non-differentiability problem and rely on approximating gradients. However, the deep relationship between these two fields and how their training techniques can benefit each other has not been systematically researched. Furthermore, training binary-weight SNNs is even more difficult. In this work, we present a novel perspective on the dynamics of SNNs and their close connection to BNNs through an analysis of the backpropagation process. We demonstrate that training a feedforward SNN can be viewed as training a self-ensemble of a binary-activation neural network with noise injection. Drawing from this new understanding of SNN dynamics, we introduce the Self-Ensemble Inspired training method for (Binary-Weight) SNNs (SEI-BWSNN), which achieves high-performance results with low latency even for the case of the 1-bit weights. Specifically, we leverage a structure of multiple shortcuts and a knowledge distillation-based training technique to improve the training of (binary-weight) SNNs. Notably, by binarizing FFN layers in a Transformer architecture, our approach achieves 82.52% accuracy on ImageNet with only 2 time steps, indicating the effectiveness of our methodology and the potential of binary-weight SNNs.

Updated: 2025-08-18 04:11:06

标题: 一个自组合启发式方法用于二进制权重脉冲神经网络的有效训练

摘要: 脉冲神经网络（SNNs）是一种非常有前途的低功耗应用于神经形态硬件的方法，因为它们的能源效率很高。然而，训练SNNs是具有挑战性的，因为脉冲生成函数是不可微的。为了解决这个问题，通常采用的方法是采用通过时间反向传播框架，同时将不可微函数的梯度分配给一些替代物。同样，二进制神经网络（BNNs）也面临不可微性问题，并依赖于梯度的近似。然而，这两个领域之间的深层关系以及它们的训练技术如何互惠互利尚未被系统研究。此外，训练二进制权重的SNNs甚至更加困难。在这项工作中，我们通过分析反向传播过程，提出了关于SNNs动态和它们与BNNs之间密切联系的新视角。我们展示了训练前馈SNN可以被视为训练具有噪声注入的二进制激活神经网络的自我集成。基于对SNN动态的这种新理解，我们引入了自我集成启发的（二进制权重）SNNs（SEI-BWSNN）训练方法，即使对于1位权重的情况，也能实现高性能结果和低延迟。具体来说，我们利用多个快捷方式的结构和基于知识蒸馏的训练技术来改善（二进制权重）SNNs的训练。值得注意的是，在Transformer架构中将FFN层二进制化，我们的方法在ImageNet上实现了82.52%的准确率，仅需2个时间步长，表明了我们方法的有效性和二进制权重SNNs的潜力。

更新时间: 2025-08-18 04:11:06

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2508.12609v1

Encoding Argumentation Frameworks to Propositional Logic Systems

This paper generalizes the encoding of argumentation frameworks beyond the classical 2-valued propositional logic system ($PL_2$) to 3-valued propositional logic systems ($PL_3$s) and fuzzy propositional logic systems ($PL_{[0,1]}s$), employing two key encodings: normal encoding ($ec_1$) and regular encoding ($ec_2$). Specifically, via $ec_1$ and $ec_2$, we establish model relationships between Dung's classical semantics (stable and complete semantics) and the encoded semantics associated with Kleene's $PL_3$ and {\L}ukasiewicz's $PL_3$. Through $ec_1$, we also explore connections between Gabbay's real equational semantics and the encoded semantics of $PL_{[0,1]}s$, including showing that Gabbay's $Eq_{\text{max}}^R$ and $Eq_{\text{inverse}}^R$ correspond to the fuzzy encoded semantics of $PL_{[0,1]}^G$ and $PL_{[0,1]}^P$ respectively. Additionally, we propose a new fuzzy encoded semantics ($Eq^L$) associated with {\L}ukasiewicz's $PL_{[0,1]}$ and investigate interactions between complete semantics and fuzzy encoded semantics. This work strengthens the links between argumentation frameworks and propositional logic systems, providing a framework for constructing new argumentation semantics.

Updated: 2025-08-18 04:04:06

标题: 将论证框架编码为命题逻辑系统

摘要: 本文将论证框架的编码推广到经典的2值命题逻辑系统（$PL_2$）之外的3值命题逻辑系统（$PL_3$）和模糊命题逻辑系统（$PL_{[0,1]}$），采用了两种关键编码方法：正常编码（$ec_1$）和规则编码（$ec_2$）。具体来说，通过$ec_1$和$ec_2$，我们建立了Dung的经典语义（稳定和完整语义）与Kleene的$PL_3$和{\L}ukasiewicz的$PL_3$相关的编码语义之间的模型关系。通过$ec_1，我们还探讨了Gabbay的实等式语义与$PL_{[0,1]}$的编码语义之间的联系，包括展示了Gabbay的$Eq_{\text{max}}^R$和$Eq_{\text{inverse}}^R$分别对应于$PL_{[0,1]}^G$和$PL_{[0,1]}^P$的模糊编码语义。此外，我们提出了与{\L}ukasiewicz的$PL_{[0,1]}$相关的新的模糊编码语义（$Eq^L$），并研究了完整语义和模糊编码语义之间的相互作用。这项工作加强了论证框架和命题逻辑系统之间的联系，为构建新的论证语义提供了一个框架。

更新时间: 2025-08-18 04:04:06

领域: cs.AI,math.LO,Primary 68T27, Secondary 03B70, 03B50, 03B52, 68Q55,I.2.4; F.4.1

下载: http://arxiv.org/abs/2503.07351v2

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.

Updated: 2025-08-18 04:02:15

标题: SSPO：自我追踪逐步偏好优化用于过程监督和推理压缩

摘要: 测试时间缩放已被证明有效地进一步提高预训练大型语言模型（LLM）的性能。然而，主流的后训练方法（即，强化学习（RL）与思维链（CoT）推理）通常由于辅助模型和过度思考而产生大量计算开销。本文实证表明，错误答案部分源于缺乏正确自我修复的冗长推理过程，导致错误在多个推理步骤中积累。为此，我们提出了自我跟踪逐步偏好优化（SSPO），这是一个可插拔的RL过程监督框架，可以实现对每个推理步骤的精细优化。具体来说，SSPO既不需要辅助模型，也不需要逐步手动注释。相反，它利用模型自身生成的逐步偏好信号来指导推理压缩的优化过程。实验证明，从SSPO生成的推理序列既准确又简洁，有效地减轻了过度思考行为，而不会损害跨不同领域和语言的模型性能。

更新时间: 2025-08-18 04:02:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12604v1

A Hybrid Surrogate for Electric Vehicle Parameter Estimation and Power Consumption via Physics-Informed Neural Operators

We present a hybrid surrogate model for electric vehicle parameter estimation and power consumption. We combine our novel architecture Spectral Parameter Operator built on a Fourier Neural Operator backbone for global context and a differentiable physics module in the forward pass. From speed and acceleration alone, it outputs time-varying motor and regenerative braking efficiencies, as well as aerodynamic drag, rolling resistance, effective mass, and auxiliary power. These parameters drive a physics-embedded estimate of battery power, eliminating any separate physics-residual loss. The modular design lets representations converge to physically meaningful parameters that reflect the current state and condition of the vehicle. We evaluate on real-world logs from a Tesla Model 3, Tesla Model S, and the Kia EV9. The surrogate achieves a mean absolute error of 0.2kW (about 1% of average traction power at highway speeds) for Tesla vehicles and about 0.8kW on the Kia EV9. The framework is interpretable, and it generalizes well to unseen conditions, and sampling rates, making it practical for path optimization, eco-routing, on-board diagnostics, and prognostics health management.

Updated: 2025-08-18 04:01:42

标题: 一个用于电动车参数估计和功耗的混合替代物：基于物理信息的神经算子

摘要: 我们提出了一个用于电动汽车参数估计和能耗的混合代理模型。我们结合了我们的新颖架构Spectral Parameter Operator，该架构是基于傅立叶神经算子骨干构建的，用于全局上下文，并在正向传递中加入了可微物理模块。仅通过速度和加速度，它输出了变化的电动机和再生制动效率，以及空气动力阻力、滚动阻力、有效质量和辅助功率。这些参数驱动一个嵌入物理学的电池功率估计，消除了任何单独的物理残差损失。模块化设计使表示能够收敛到反映车辆当前状态和条件的物理有意义的参数。我们在来自特斯拉Model 3、特斯拉Model S和Kia EV9的真实日志中进行评估。该代理模型在特斯拉车辆上达到了0.2kW的平均绝对误差（约为高速公路速度下的平均牵引功率的1%），在Kia EV9上约为0.8kW。该框架具有可解释性，并且很好地推广到未见条件和采样率，使其适用于路径优化、生态路线规划、车载诊断和预测健康管理。

更新时间: 2025-08-18 04:01:42

领域: cs.LG

下载: http://arxiv.org/abs/2508.12602v1

WeChat-YATT: A Scalable, Simple, Efficient, and Production Ready Training Library

Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite the notable advances enabled by existing RLHF training frameworks, significant challenges remain to scale to complex multimodal workflows and adapt to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT Yet Another Transformer Trainer in WeChat, a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across diverse experimental scenarios, demonstrating its substantial throughput improvements over state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models that support WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications. We have made WeChat-YATT publicly available at https://www.github.com/tencent/WeChat-YATT.

Updated: 2025-08-18 03:48:53

标题: 微信-YATT：一个可扩展、简单、高效且可用于生产的训练库

摘要: 人类反馈强化学习（RLHF）已经成为训练大型语言模型和多模态系统的一个重要范例。尽管现有的RLHF训练框架取得了显著进展，但在扩展到复杂的多模态工作流程并适应动态工作负载方面仍然存在重大挑战。特别是，在管理大型模型时，当前系统经常遇到与控制器可扩展性相关的限制，以及在编排复杂的RLHF管道时的低效率，特别是在需要动态采样和资源分配的场景中。在本文中，我们介绍了微信-YATT Yet Another Transformer Trainer in WeChat，这是一个简单、可扩展和平衡的RLHF训练框架，专门设计来解决这些挑战。微信-YATT具有并行控制器编程模型，可以灵活高效地编排复杂的RLHF工作流程，有效地缓解了与中心化控制器架构相关的瓶颈，并促进了大规模数据场景中的可扩展性。此外，我们提出了一种动态放置方案，自适应地分割计算资源并调度工作负载，从而显著减少硬件空闲时间，并在不同训练条件下提高GPU利用率。我们在各种实验场景中评估了微信-YATT，展示了其在最先进的RLHF训练框架上的大幅吞吐量改进。此外，微信-YATT已成功部署用于训练支持微信产品功能的模型，为大规模用户群体提供支持，突显了其在实际应用中的有效性和稳健性。我们已将微信-YATT公开发布在https://www.github.com/tencent/WeChat-YATT。

更新时间: 2025-08-18 03:48:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.07970v3

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks. Code is available at https://github.com/MaTengSYSU/HIMRD-jailbreak.

Updated: 2025-08-18 03:40:02

标题: 基于启发式的多模态风险分布越狱攻击对多模态大型语言模型的影响

摘要: 随着多模态大型语言模型（MLLMs）的快速发展，关于它们安全性的担忧越来越引起学术界和工业界的关注。尽管MLLMs容易受到越狱攻击的影响，但设计有效的越狱攻击面临着独特的挑战，特别是考虑到现实部署场景中对敌对能力的高度限制。先前的研究将风险集中到单一模态中，导致越狱性能受限。本文提出了一种启发式诱导的多模态风险分布越狱攻击方法，称为HIMRD，它是黑盒的，由两个元素组成：多模态风险分布策略和启发式诱导搜索策略。多模态风险分布策略用于将有害语义分布到多个模态中，以有效地规避MLLMs的单一模态保护机制。启发式诱导搜索策略识别出两种类型的提示：增强理解的提示，帮助MLLMs重建恶意提示；以及诱导提示，增加积极输出的可能性，使越狱攻击成功。HIMRD在七个开源MLLMs中实现了平均攻击成功率（ASR）为90％，在三个闭源MLLMs中实现了约68％的平均ASR。HIMRD揭示了当前MLLMs中的跨模态安全漏洞，并强调了开发防御策略以减轻此类新兴风险的迫切性。代码可在https://github.com/MaTengSYSU/HIMRD-jailbreak获得。

更新时间: 2025-08-18 03:40:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2412.05934v3

A Law of Next-Token Prediction in Large Language Models

Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, irrespective of their architectures or pre-training data. We demonstrate that this law offers new perspectives and actionable insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and interpretation.

Updated: 2025-08-18 03:36:50

标题: 大型语言模型中的下一个标记预测法则

摘要: 大型语言模型（LLMs）已广泛应用于各种应用领域，但它们的黑匣子特性对理解这些模型如何内部处理输入数据以进行预测提出了重大挑战。在本文中，我们介绍了一条精确和定量的法则，该法则规定了通过预训练的LLMs中的中间层学习上下文化标记嵌入以进行下一个标记预测。我们的研究结果表明，从最低到最高层，每一层对提高预测准确度都起到同等作用--这是一个普遍现象，在各种开源LLMs中都观察到，不论它们的架构或预训练数据如何。我们证明，这一法则为LLM开发和应用提供了新的视角和可操作的见解，包括模型扩展、预训练任务和解释。

更新时间: 2025-08-18 03:36:50

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2408.13442v2

Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

Diagnosing deep neural networks (DNNs) by analyzing the eigenspectrum of their weights has been an active area of research in recent years. One of the main approaches involves measuring the heavytailness of the empirical spectral densities (ESDs) of weight matrices. This analysis has been shown to provide insights to help diagnose whether a model is well-trained or undertrained, and has been used to guide training methods involving layer-wise hyperparameter assignment. In this paper, we address an often-overlooked challenge in estimating the heavytailness of these ESDs: the impact of the aspect ratio of weight matrices. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating the heavytailness of ESDs, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that this method effectively mitigates the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with state-of-the-art methods.

Updated: 2025-08-18 03:36:09

标题: 神经网络的特征谱分析（无长宽比偏差）

摘要: 诊断深度神经网络(DNNs)通过分析它们权重的特征谱已经成为近年来的研究热点。其中一个主要方法涉及测量权重矩阵的经验谱密度(ESDs)的重尾性。这种分析已经被证明可以提供洞察力，帮助诊断模型是否训练良好或训练不足，并被用于指导涉及逐层超参数分配的训练方法。在本文中，我们解决了估计这些ESDs的重尾性时经常被忽视的挑战：权重矩阵的纵横比的影响。我们展示了不同大小(和纵横比)的矩阵在估计ESDs的重尾性时引入了一个不可忽视的偏差，导致模型诊断和逐层超参数分配的不准确性。为了克服这一挑战，我们提出了FARMS (Fixed-Aspect-Ratio Matrix Subsampling)，一种通过对具有固定纵横比的子矩阵进行子采样来标准化权重矩阵的方法。我们测量这些子采样子矩阵的平均ESD，而不是测量原始ESD的重尾性。我们展示了这种方法有效地缓解了纵横比偏差。我们验证了我们的方法在涉及权重特征谱分析的各种优化技术和应用领域中的有效性，包括计算机视觉(CV)模型中的图像分类、科学机器学习(SciML)模型训练以及大型语言模型(LLM)的剪枝。我们的结果表明，尽管FARMS方法很简单，但它统一提高了特征谱分析的准确性，同时使逐层超参数分配更加有效。在LLM剪枝实验中，与最先进的方法相比，FARMS将LLaMA-7B模型的困惑度降低了17.3%。

更新时间: 2025-08-18 03:36:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.06280v2

LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands'', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph's rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph's semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46\% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG

Updated: 2025-08-18 03:28:12

标题: LeanRAG：基于知识图谱的语义聚合和分层检索生成

摘要: 检索增强生成（RAG）通过利用外部知识在根据大型语言模型中发挥关键作用，然而，其有效性常常受到上下文错误或不完整信息的检索的影响。为了解决这个问题，基于知识图的RAG方法已经发展到了层次结构，将知识组织成多级摘要。然而，这些方法仍然面临两个关键且未解决的挑战：高级概念摘要存在于孤立的“语义岛屿”中，缺乏跨社区推理所需的显式关系；检索过程本身仍然缺乏结构意识，往往退化为一种效率低下的平面搜索，无法充分利用图的丰富拓扑结构。为了克服这些限制，我们引入了LeanRAG，这是一个具有深度协作设计的框架，结合了知识聚合和检索策略。LeanRAG首先采用一种新颖的语义聚合算法，形成实体集群并构建汇总级别摘要之间的新显式关系，创建一个完全可导航的语义网络。然后，一个自下而上、结构引导的检索策略将查询锚定到最相关的细粒度实体，然后系统地遍历图的语义路径，收集简明而又具有上下文综合性的证据集。LeanRAG能够减轻与图上路径检索相关的重大开销，并最小化冗余信息检索。对四个具有不同领域的挑战性QA基准的广泛实验表明，LeanRAG在响应质量方面显著优于现有方法，同时减少了46\%的检索冗余。代码可在以下链接找到：https://github.com/RaZzzyz/LeanRAG

更新时间: 2025-08-18 03:28:12

领域: cs.AI

下载: http://arxiv.org/abs/2508.10391v2

Structural Foundations for Leading Digit Laws: Beyond Probabilistic Mixtures

This article presents a modern deterministic framework for the study of leading significant digit distributions in numerical data. Rather than relying on traditional probabilistic or mixture-based explanations, we demonstrate that the observed frequencies of leading digits are determined by the underlying arithmetic, algorithmic, and structural properties of the data-generating process. Our approach centers on a shift-invariant functional equation, whose general solution is given by explicit affine-plus-periodic formulas. This structural formulation explains the diversity of digit distributions encountered in both empirical and mathematical datasets, including cases with pronounced deviations from logarithmic or scale-invariant profiles. We systematically analyze digit distributions in finite and infinite datasets, address deterministic sequences such as prime numbers and recurrence relations, and highlight the emergence of block-structured and fractal features. The article provides critical examination of probabilistic models, explicit examples and counterexamples, and discusses limitations and open problems for further research. Overall, this work establishes a unified mathematical foundation for digital phenomena and offers a versatile toolset for modeling and analyzing digit patterns in applied and theoretical contexts.

Updated: 2025-08-18 03:18:10

标题: 领先数字定律的结构基础：超越概率混合

摘要: 本文提出了一个现代确定性框架，用于研究数字数据中领先显著数字分布。我们展示，领先数字的观察频率是由数据生成过程的基础算术、算法和结构特性决定的，而不是依赖于传统的概率或混合模型解释。我们的方法围绕一个平移不变的函数方程展开，其一般解由显式的仿射加周期公式给出。这种结构性的表述解释了在实证和数学数据集中遇到的数字分布的多样性，包括那些与对数或尺度不变轮廓有显著偏差的情况。我们系统地分析了有限和无限数据集中的数字分布，处理了确定性序列，如素数和递归关系，并突出了块状结构和分形特征的出现。本文提供了对概率模型的关键审查，明确的例子和反例，并讨论了进一步研究的局限性和开放问题。总的来说，这项工作为数字现象建立了一个统一的数学基础，并为建模和分析应用和理论背景中的数字模式提供了一个多功能的工具集。

更新时间: 2025-08-18 03:18:10

领域: stat.ML,cs.LG,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2508.13237v1

UAV Individual Identification via Distilled RF Fingerprints-Based LLM in ISAC Networks

Unmanned aerial vehicle (UAV) individual (ID) identification is a critical security surveillance strategy in low-altitude integrated sensing and communication (ISAC) networks. In this paper, we propose a novel dynamic knowledge distillation (KD)-enabled wireless radio frequency fingerprint large language model (RFF-LLM) framework for UAV ID identification. First, we propose an RFF-LLM framework based on the modified GPT-2 model to improve the identification accuracy in complex outdoor environments. Then, considering the parameter overhead of the RFF-LLM, we design a dynamic KD strategy to compress the model. Specifically, the proximal policy optimization (PPO) algorithm is employed to dynamically adjust the distillation temperature, overcoming the local optimum dilemma inherent in static KD. As a next step, the knowledge of the RFF-LLM is adequately transferred to the lightweight Lite-HRNet model. Finally, our experiments are conducted based on the self-built drone RFF dataset of Release one, namely DRFF-R1, by collecting the I/Q signals of 20 commercial UAVs in channel 149. The experiment results show that the proposed framework achieves 98.38\% ID identification accuracy with merely 0.15 million parameters and 2.74 ms response time, which outperforms the benchmarks.

Updated: 2025-08-18 03:14:44

标题: 通过基于蒸馏RF指纹的LLM在ISAC网络中进行无人机个体识别

摘要: 无人机（UAV）个体（ID）识别是低空集成感知和通信（ISAC）网络中关键的安全监控策略。本文提出了一种新颖的动态知识蒸馏（KD）-启用的无线射频指纹大型语言模型（RFF-LLM）框架，用于UAV ID识别。首先，我们提出了一个基于修改后的GPT-2模型的RFF-LLM框架，以提高在复杂的户外环境中的识别准确性。然后，考虑到RFF-LLM的参数开销，我们设计了一个动态KD策略来压缩模型。具体而言，采用了近端策略优化（PPO）算法来动态调整蒸馏温度，克服了静态KD固有的局部最优困境。随后，RFF-LLM的知识被充分转移到轻量级Lite-HRNet模型。最后，我们基于自建的无人机RFF数据集Release one，即DRFF-R1，在149通道收集了20架商用无人机的I/Q信号进行实验。实验结果表明，所提出的框架仅使用了0.15百万参数和2.74毫秒的响应时间，达到了98.38\%的ID识别准确度，优于基准。

更新时间: 2025-08-18 03:14:44

领域: cs.CR

下载: http://arxiv.org/abs/2508.12597v1

Constructing Invariant and Equivariant Operations by Symmetric Tensor Network

Design of neural networks that incorporate symmetry is crucial for geometric deep learning. Central to this effort is the development of invariant and equivariant operations. This works presents a systematic method for constructing valid invariant and equivariant operations. It can handle inputs and outputs in the form of Cartesian tensors with different rank, as well as spherical tensors with different types. In addition, our method features a graphical representation utilizing the symmetric tensor network, which simplifies both the proofs and constructions related to invariant and equivariant functions. We also apply this approach to design the equivariant interaction message for the geometry graph neural network, and equivariant machine learning model to learn the constitutive law of materials.

Updated: 2025-08-18 03:13:08

标题: 用对称张量网络构建不变和等变操作

摘要: 对于几何深度学习，设计具有对称性的神经网络至关重要。这一努力的核心是不变和等变操作的开发。本文提出了一种系统方法，用于构建有效的不变和等变操作。它可以处理具有不同秩的笛卡尔张量形式的输入和输出，以及具有不同类型的球张量。此外，我们的方法采用了利用对称张量网络的图形表示，简化了与不变和等变函数相关的证明和构造。我们还将这种方法应用于设计几何图神经网络的等变交互信息，并用于学习材料本构定律的等变机器学习模型。

更新时间: 2025-08-18 03:13:08

领域: cs.LG

下载: http://arxiv.org/abs/2508.12596v1

Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning

Graph ``pre-training and prompt-tuning'' aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, existing methods rely on homophily-based low-frequency knowledge, failing to handle diverse spectral distributions in real-world graphs with varying homophily. Our theoretical analysis reveals a spectral specificity principle: optimal knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. Under limited supervision, large spectral gaps between pre-training and downstream tasks impede effective adaptation. To bridge this gap, we propose the HS-GPPT model, a novel framework that ensures spectral alignment throughout both pre-training and prompt-tuning. We utilize a hybrid spectral filter backbone and local-global contrastive learning to acquire abundant spectral knowledge. Then we design prompt graphs to align the spectral distribution with pretexts, facilitating spectral knowledge transfer across homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings. Our code is available at https://anonymous.4open.science/r/HS-GPPT-62D2/.

Updated: 2025-08-18 03:08:59

标题: 跨越同构性和异构性：混合谱图预训练和提示调整

摘要: 图“预训练和提示微调”将下游任务与预训练目标对齐，以在有限监督下实现高效知识转移。然而，现有方法依赖于基于同质性的低频知识，无法处理现实世界中具有不同同质性的图中的多样光谱分布。我们的理论分析揭示了光谱特异性原则：最佳知识转移需要在预训练光谱滤波器与下游图的固有光谱之间实现对齐。在有限监督下，预训练和下游任务之间存在大的光谱间隙会阻碍有效的适应性。为了弥合这一差距，我们提出了HS-GPPT模型，这是一个确保在整个预训练和提示微调过程中光谱对齐的新框架。我们利用混合光谱滤波器骨干和局部-全局对比学习来获取丰富的光谱知识。然后我们设计提示图，以与预文本对齐光谱分布，促进同质性和异质性之间的光谱知识转移。大量实验证实了在传导学习和归纳学习设置下的有效性。我们的代码可在https://anonymous.4open.science/r/HS-GPPT-62D2/ 上找到。

更新时间: 2025-08-18 03:08:59

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.11328v2

FLARE: Fast Low-rank Attention Routing Engine

The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.

Updated: 2025-08-18 03:00:55

标题: FLARE: 快速低秩注意力路由引擎

摘要: 自我关注的二次复杂度限制了其在大型非结构化网格上的适用性和可扩展性。我们引入了快速低秩注意力路由引擎（FLARE），这是一个线性复杂度的自我关注机制，通过固定长度的潜在序列传递注意力。每个注意力头通过将输入序列投影到具有可学习查询令牌的长度固定的潜在序列中的$M \ll N$个令牌，对$N$个令牌进行全局通信。通过通过瓶颈序列路由注意力，FLARE学习了一种低秩形式的注意力，可以以$O(NM)$的成本应用。FLARE不仅能够扩展到前所未有的问题规模，而且与各种基准下的最先进的神经PDE替代方法相比，提供了更高的准确性。我们还发布了一个新的增材制造数据集，以促进进一步研究。我们的代码可以在https://github.com/vpuri3/FLARE.py找到。

更新时间: 2025-08-18 03:00:55

领域: cs.LG

下载: http://arxiv.org/abs/2508.12594v1

Physics-informed deep operator network for traffic state estimation

Traffic state estimation (TSE) fundamentally involves solving high-dimensional spatiotemporal partial differential equations (PDEs) governing traffic flow dynamics from limited, noisy measurements. While Physics-Informed Neural Networks (PINNs) enforce PDE constraints point-wise, this paper adopts a physics-informed deep operator network (PI-DeepONet) framework that reformulates TSE as an operator learning problem. Our approach trains a parameterized neural operator that maps sparse input data to the full spatiotemporal traffic state field, governed by the traffic flow conservation law. Crucially, unlike PINNs that enforce PDE constraints point-wise, PI-DeepONet integrates traffic flow conservation model and the fundamental diagram directly into the operator learning process, ensuring physical consistency while capturing congestion propagation, spatial correlations, and temporal evolution. Experiments on the NGSIM dataset demonstrate superior performance over state-of-the-art baselines. Further analysis reveals insights into optimal function generation strategies and branch network complexity. Additionally, the impact of input function generation methods and the number of functions on model performance is explored, highlighting the robustness and efficacy of proposed framework.

Updated: 2025-08-18 02:59:42

标题: 物理学通知的深度操作员网络用于交通状态估计

摘要: 交通状态估计（TSE）基本上涉及解决由有限、嘈杂的测量数据控制的高维时空偏微分方程（PDE）所规定的交通流动力学问题。虽然物理信息神经网络（PINNs）逐点强制PDE约束条件，但本文采用了一种物理信息深度算子网络（PI-DeepONet）框架，将TSE重新构建为一个算子学习问题。我们的方法训练了一个参数化神经算子，将稀疏输入数据映射到完整的时空交通状态场，受交通流量守恒法则控制。与逐点强制PDE约束条件的PINNs不同，PI-DeepONet直接将交通流量守恒模型和基础图表集成到算子学习过程中，确保物理一致性，同时捕捉拥堵传播、空间相关性和时间演变。对NGSIM数据集的实验表明，该方法优于最先进的基线。进一步分析揭示了最佳函数生成策略和分支网络复杂性的见解。此外，探讨了输入函数生成方法和函数数量对模型性能的影响，突出了提出的框架的鲁棒性和效能。

更新时间: 2025-08-18 02:59:42

领域: cs.LG

下载: http://arxiv.org/abs/2508.12593v1

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark dataset show MLLM-based systems can elevate the holistic assessment performance from a PCC value of 0.783 to 0.846. In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches, which also paves a new avenue for ASA.

Updated: 2025-08-18 02:57:43

标题: 超越模态限制：一种统一的MLLM方法用于自动化口语评估与有效的课程学习

摘要: 传统的自动口语评估（ASA）系统存在固有的模态限制：基于文本的方法缺乏声学信息，而基于音频的方法缺乏语义上下文。多模态大型语言模型（MLLM）在统一框架内同时处理音频和文本，为全面的ASA提供了前所未有的机会。本文首次系统研究了MLLM用于全面ASA，展示了MLLM在内容和语言使用方面的卓越性能。然而，在交付方面的评估揭示了独特的挑战，被认为需要专门的培训策略。因此，我们提出了基于语音优先多模态训练（SFMT）的方法，利用课程学习原则在跨模态协同融合之前建立更健壮的语音建模基础。在基准数据集上的一系列实验表明，基于MLLM的系统可以将整体评估性能从0.783的PCC值提升至0.846。特别是，在交付方面的评估中，SFMT表现突出，相比传统训练方法提高了4%的绝对精度，也为ASA开辟了新的途径。

更新时间: 2025-08-18 02:57:43

领域: cs.CL,cs.AI,cs.SD

下载: http://arxiv.org/abs/2508.12591v1

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous U-HLM baseline, our method improves BERTScore from 85.8% to 87.0%, energy savings from 31.6% to 43.6%, and throughput from 0.36 to 0.40. This approach enables an energy-efficient and accurate deployment of LLMs in bandwidth-constrained edge environments.

Updated: 2025-08-18 02:56:59

标题: 通过不确定性和重要性感知的投机解码实现能效无线LLM推理

摘要: 为了满足在资源受限环境中对设备上LLM推理日益增长的需求，混合语言模型（HLM）已经出现，将轻量级本地模型与强大的基于云的LLM相结合。最近关于HLM的研究主要集中在提高准确性和延迟方面，往往忽视了通信和能效。我们提出了一个基于能效、重要性和不确定性感知的令牌级过滤机制，用于能效高的HLM推理，利用了认知不确定性和基于注意力的重要性。我们的方法仅在需要时上传信息性令牌，减少了LLM的使用和通信成本。使用TinyLlama-1.1B和LLaMA-2-7B进行的实验表明，我们的方法在达到87.5%的BERT分数和0.37个令牌/秒的吞吐量的同时，与标准HLM相比节省了40.7%的能量消耗。此外，与我们之前的U-HLM基线相比，我们的方法将BERTScore从85.8%提高到87.0%，能量节省从31.6%提高到43.6%，吞吐量从0.36提高到0.40。这种方法使LLM能够在带宽受限的边缘环境中实现能效高和准确的部署。

更新时间: 2025-08-18 02:56:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12590v1

HQ-OV3D: A High Box Quality Open-World 3D Detection Framework based on Diffision Model

Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly neglected. To address this issue, we propose a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework, dedicated to generate and refine high-quality pseudo-labels for open-vocabulary classes. The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising mechanism. Compared to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines.

Updated: 2025-08-18 02:50:31

标题: HQ-OV3D：基于扩散模型的高质量开放世界3D检测框架

摘要: 传统的封闭式3D检测框架无法满足自动驾驶等开放世界应用的需求。现有的开放词汇3D检测方法通常采用两阶段流程，包括伪标签生成和语义对齐。虽然视觉语言模型（VLMs）最近显著提高了伪标签的语义准确性，但它们的几何质量，特别是边界框精度，通常被忽视。为了解决这个问题，我们提出了一个专门用于生成和改进高质量伪标签的 High Box Quality Open-Vocabulary 3D Detection（HQ-OV3D）框架。该框架包括两个关键组件：一个利用跨模态几何一致性生成高质量初始3D提议的 Intra-Modality Cross-Validated（IMCV）提议生成器，以及一个通过基于 DDIM 的去噪机制逐步改进3D提议的 Annotated-Class Assisted（ACA）去噪器，通过利用来自注释类别的几何先验知识来提高几何质量。与最先进的方法相比，使用我们方法生成的伪标签训练在新类别上实现了7.37% 的 mAP 改进，证明了我们框架生成的伪标签的优质性。HQ-OV3D 不仅可以作为一个强大的独立开放词汇3D检测器，还可以作为现有开放词汇检测或标注流程的高质量伪标签生成器插件。

更新时间: 2025-08-18 02:50:31

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2508.10935v2

Reducing False Positives with Active Behavioral Analysis for Cloud Security

Rule-based cloud security posture management (CSPM) solutions are known to produce a lot of false positives based on the limited contextual understanding and dependence on static heuristics testing. This paper introduces a validation-driven methodology that integrates active behavioral testing in cloud security posture management solution(s) to evaluate the exploitability of policy violations in real time. The proposed system employs lightweight and automated probes, built from open-source tools, validation scripts, and penetration testing test cases, to simulate adversarial attacks on misconfigured or vulnerable cloud assets without any impact to the cloud services or environment. For instance, cloud services may be flagged as publicly exposed and vulnerable despite being protected by access control layers, or secure policies, resulting in non-actionable alerts that consumes analysts time during manual validation. Through controlled experimentation in a reproducible AWS setup, we evaluated the reduction in false positive rates across various misconfiguration and vulnerable alerts. Our findings indicate an average reduction of 93\% in false positives. Furthermore, the framework demonstrates low latency performance. These results demonstrate a scalable method to improve detection accuracy and analyst productivity in large cloud environments. While our evaluation focuses on AWS, the architecture is modular and extensible to multi-cloud setups.

Updated: 2025-08-18 02:39:02

标题: 利用主动行为分析减少云安全中的虚假阳性

摘要: 基于规则的云安全姿势管理（CSPM）解决方案以有限的上下文理解和对静态启发式测试的依赖而闻名于产生大量误报。本文介绍了一种验证驱动方法论，将主动行为测试集成到云安全姿势管理解决方案中，以实时评估政策违规的可利用性。所提出的系统采用轻量级和自动化的探针，由开源工具、验证脚本和渗透测试用例构建，模拟对配置错误或易受攻击的云资产进行敌对攻击，而不会对云服务或环境造成任何影响。例如，尽管受到访问控制层或安全策略保护，云服务可能被标记为公开暴露和易受攻击，导致在手动验证过程中消耗分析师时间的不可操作警报。通过在可重现的AWS设置中进行受控实验，我们评估了各种配置错误和易受攻击警报的误报率减少。我们的研究结果表明，误报率平均降低了93％。此外，该框架表现出低延迟性能。这些结果展示了一种可扩展的方法，以提高大型云环境中的检测准确性和分析师生产力。虽然我们的评估重点放在AWS上，但架构是模块化的，可扩展到多云设置。

更新时间: 2025-08-18 02:39:02

领域: cs.CR

下载: http://arxiv.org/abs/2508.12584v1

Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg

Federated learning (FL) enables decentralized clients to train a model collaboratively without sharing local data. A key distinction between FL and centralized learning is that clients' data are non-independent and identically distributed, which poses significant challenges in training a global model that generalizes well across heterogeneous local data distributions. In this paper, we analyze the convergence of overparameterized FedAvg with gradient descent (GD). We prove that the impact of data heterogeneity diminishes as the width of neural networks increases, ultimately vanishing when the width approaches infinity. In the infinite-width regime, we further prove that both the global and local models in FedAvg behave as linear models, and that FedAvg achieves the same generalization performance as centralized learning with the same number of GD iterations. Extensive experiments validate our theoretical findings across various network architectures, loss functions, and optimization methods.

Updated: 2025-08-18 02:22:55

标题: 扩大网络范围减轻数据异质性对FedAvg的影响

摘要: 联邦学习（FL）使分散的客户端能够协作训练模型，而无需共享本地数据。FL和集中式学习之间的一个关键区别在于客户端的数据不是独立同分布的，这在训练一个能够很好泛化到异质本地数据分布的全局模型时会带来显著挑战。在本文中，我们分析了带有梯度下降（GD）的过参数化FedAvg的收敛性。我们证明了数据异质性的影响随着神经网络宽度的增加而减弱，最终在宽度趋近无穷时消失。在无穷宽度的范围内，我们进一步证明了FedAvg中的全局模型和本地模型都表现为线性模型，并且FedAvg在与相同数量的GD迭代的集中式学习相同的泛化性能。大量实验验证了我们在不同网络架构、损失函数和优化方法上的理论发现。

更新时间: 2025-08-18 02:22:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.12576v1

Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM

The prediction of amyloidogenicity in peptides and proteins remains a focal point of ongoing bioinformatics. The crucial step in this field is to apply advanced computational methodologies. Many recent approaches to predicting amyloidogenicity within proteins are highly based on evolutionary motifs and the individual properties of amino acids. It is becoming increasingly evident that the sequence information-based features show high predictive performance. Consequently, our study evaluated the contextual features of protein sequences obtained from a pretrained protein large language model leveraging bidirectional LSTM and GRU to predict amyloidogenic regions in peptide and protein sequences. Our method achieved an accuracy of 84.5% on 10-fold cross-validation and an accuracy of 83% in the test dataset. Our results demonstrate competitive performance, highlighting the potential of LLMs in enhancing the accuracy of amyloid prediction.

Updated: 2025-08-18 02:21:48

标题: 深度学习模型用于预测淀粉样蛋白原性的研究，基于预训练的蛋白LLM

摘要: 蛋白质和肽段的淀粉样生成性预测仍然是生物信息学的一个焦点。这一领域的关键步骤是应用先进的计算方法。最近许多预测蛋白质内淀粉样生成性的方法都高度基于进化模式和氨基酸的个体特性。越来越明显的是，基于序列信息的特征显示出很高的预测性能。因此，我们的研究评估了从一个预训练的蛋白质大型语言模型获得的蛋白质序列的上下文特征，利用双向LSTM和GRU来预测肽段和蛋白质序列中的淀粉样生成区域。我们的方法在10倍交叉验证中实现了84.5%的准确率，在测试数据集中实现了83%的准确率。我们的结果表明了竞争性的性能，突显了LLM在提高淀粉样预测准确性方面的潜力。

更新时间: 2025-08-18 02:21:48

领域: cs.LG,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2508.12575v1

MAGIK: Mapping to Analogous Goals via Imagination-enabled Knowledge Transfer

Humans excel at analogical reasoning - applying knowledge from one task to a related one with minimal relearning. In contrast, reinforcement learning (RL) agents typically require extensive retraining even when new tasks share structural similarities with previously learned ones. In this work, we propose MAGIK, a novel framework that enables RL agents to transfer knowledge to analogous tasks without interacting with the target environment. Our approach leverages an imagination mechanism to map entities in the target task to their analogues in the source domain, allowing the agent to reuse its original policy. Experiments on custom MiniGrid and MuJoCo tasks show that MAGIK achieves effective zero-shot transfer using only a small number of human-labelled examples. We compare our approach to related baselines and highlight how it offers a novel and effective mechanism for knowledge transfer via imagination-based analogy mapping.

Updated: 2025-08-18 02:21:44

标题: MAGIK：通过想象力启用的知识转移将目标映射到类似目标

摘要: 人类在类比推理方面表现出色-将从一个任务中获得的知识应用于一个相关的任务，并且只需进行最少的重新学习。相比之下，强化学习（RL）代理通常需要进行大量的重新训练，即使新任务与先前学习的任务具有结构上的相似性。在这项工作中，我们提出了MAGIK，这是一个新颖的框架，可以使RL代理能够将知识转移到类似的任务中，而无需与目标环境互动。我们的方法利用了一种想象机制，将目标任务中的实体映射到源领域中的类似实体，从而使代理能够重用其原始策略。在自定义的MiniGrid和MuJoCo任务上进行的实验表明，MAGIK仅使用少量人工标记的示例就能实现有效的零次迁移。我们将我们的方法与相关基线进行比较，并强调它是一种通过基于想象的类比映射进行知识转移的新颖且有效的机制。

更新时间: 2025-08-18 02:21:44

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.01623v3

Cyber Risks to Next-Gen Brain-Computer Interfaces: Analysis and Recommendations

Brain-computer interfaces (BCIs) show enormous potential for advancing personalized medicine. However, BCIs also introduce new avenues for cyber-attacks or security compromises. In this article, we analyze the problem and make recommendations for device manufacturers to better secure devices and to help regulators understand where more guidance is needed to protect patient safety and data confidentiality. Device manufacturers should implement the prior suggestions in their BCI products. These recommendations help protect BCI users from undue risks, including compromised personal health and genetic information, unintended BCI-mediated movement, and many other cybersecurity breaches. Regulators should mandate non-surgical device update methods, strong authentication and authorization schemes for BCI software modifications, encryption of data moving to and from the brain, and minimize network connectivity where possible. We also design a hypothetical, average-case threat model that identifies possible cybersecurity threats to BCI patients and predicts the likeliness of risk for each category of threat. BCIs are at less risk of physical compromise or attack, but are vulnerable to remote attack; we focus on possible threats via network paths to BCIs and suggest technical controls to limit network connections.

Updated: 2025-08-18 02:12:45

标题: 下一代脑机接口的网络风险：分析与建议

摘要: 脑-计算机接口（BCIs）展示了推进个性化医学的巨大潜力。然而，BCIs也引入了新的途径，可以发生网络攻击或安全妥协。在本文中，我们分析了这一问题，并提出了设备制造商更好地保护设备和帮助监管机构了解更多指导以保护患者安全和数据保密性所需的建议。设备制造商应在其BCI产品中实施先前的建议。这些建议有助于保护BCI用户免受不当风险，包括个人健康和基因信息被泄露，意外的BCI介导运动，以及许多其他网络安全漏洞。监管机构应强制非手术设备更新方法，为BCI软件修改实施强身份验证和授权方案，对向和从大脑传输的数据进行加密，并在可能的情况下最小化网络连接。我们还设计了一个假设的平均案例威胁模型，识别了可能对BCI患者构成网络安全威胁的情况，并预测了每个威胁类别的风险可能性。BCIs较少受到物理妥协或攻击的威胁，但容易受到远程攻击；我们关注通过网络路径对BCIs可能的威胁，并提出技术控制措施来限制网络连接。

更新时间: 2025-08-18 02:12:45

领域: cs.CR,cs.CY,cs.ET,cs.HC,cs.NE

下载: http://arxiv.org/abs/2508.12571v1

Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

Updated: 2025-08-18 02:10:18

标题: 基于数据驱动的粒子动力学：保持结构的粗粒化方法用于非平衡系统中的新兴行为

摘要: 多尺度系统在科学和技术中无处不在，但由于短时空尺度必须适当地与新兴的整体物理联系起来，因此模拟起来往往具有挑战性。当昂贵的高维动力系统被粗粒化为低维模型时，信息熵的损失会导致出现耗散、依赖历史和随机性质的新兴物理。为了从粒子轨迹的时间序列观测中学习粗粒化动态，我们提出了一个使用metriplectic bracket形式主义的框架，通过构造保留这些特性；最重要的是，该框架保证了热力学第一和第二定律的离散概念，动量守恒，以及捕捉非平衡统计的重要的离散波动-耗散平衡。我们在介绍数学框架的抽象概念之前，将其专门化为粒子离散化。由于通常无法为熵态变量提供标签，我们引入了一种新颖的自监督学习策略来识别新兴结构变量。我们在基准系统上验证了该方法，并展示了它在两个具有挑战性的示例上的实用性：(1)在保持非平衡统计的情况下粗粒化星形聚合物，并(2)从胶体悬浊液的高速视频中学习模型，捕捉局部重排事件和新兴随机动态之间的耦合。我们提供了PyTorch和LAMMPS的开源实现，支持大规模推断和对各种基于粒子的系统的可扩展性。

更新时间: 2025-08-18 02:10:18

领域: cs.LG,cs.CE,physics.comp-ph,stat.ML

下载: http://arxiv.org/abs/2508.12569v1

Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models

The Model Context Protocol (MCP) enables large language models (LLMs) to access external resources on demand. While commonly assumed to enhance performance, how LLMs actually leverage this capability remains poorly understood. We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions along four key dimensions: proactivity (self-initiated tool use), compliance (adherence to tool-use instructions), effectiveness (task performance post-integration), and overhead (computational cost incurred). MCPGAUGE comprises a 160-prompt suite and 25 datasets spanning knowledge comprehension, general reasoning, and code generation. Our large-scale evaluation, spanning six commercial LLMs, 30 MCP tool suites, and both one- and two-turn interaction settings, comprises around 20,000 API calls and over USD 6,000 in computational cost. This comprehensive study reveals four key findings that challenge prevailing assumptions about the effectiveness of MCP integration. These insights highlight critical limitations in current AI-tool integration and position MCPGAUGE as a principled benchmark for advancing controllable, tool-augmented LLMs.

Updated: 2025-08-18 02:06:05

标题: 帮助还是障碍？重新思考模型上下文协议增强的大型语言模型

摘要: 模型上下文协议（MCP）使大型语言模型（LLMs）能够按需访问外部资源。虽然通常认为这可以提高性能，但LLMs实际上如何利用这种能力仍不清楚。我们引入了MCPGAUGE，这是第一个全面评估框架，用于探究LLM-MCP交互的四个关键维度：主动性（自主工具使用）、遵从性（遵守工具使用说明）、有效性（集成后的任务表现）和开销（计算成本）。MCPGAUGE包括一个包含160个提示和25个数据集的套件，涵盖知识理解、一般推理和代码生成。我们的大规模评估涵盖了六个商业LLMs、30个MCP工具套件以及一转和两转交互设置，涉及约20,000个API调用和超过6,000美元的计算成本。这项全面研究揭示了挑战关于MCP集成有效性的普遍假设的四个关键发现。这些见解突显了当前AI工具集成中的关键限制，并将MCPGAUGE定位为推进可控、工具增强的LLMs的基准。

更新时间: 2025-08-18 02:06:05

领域: cs.AI

下载: http://arxiv.org/abs/2508.12566v1

AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

The advent of Large Weather Models (LWMs) has marked a turning point in data-driven forecasting, with many models now outperforming traditional numerical systems in the medium range. However, achieving stable, long-range autoregressive forecasts beyond a few weeks remains a significant challenge. Prevailing state-of-the-art models that achieve year-long stability, such as SFNO and DLWP-HPX, have relied on transforming input data onto non-standard spatial domains like spherical harmonics or HEALPix meshes. This has led to the prevailing assumption that such representations are necessary to enforce physical consistency and long-term stability. This paper challenges that assumption by investigating whether comparable long-range performance can be achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep convolutional network that operates directly on ERA5 data without any spherical remapping. The model's stability is enabled by a novel Gated Residual Fusion (GRF) mechanism, which adaptively moderates feature updates to prevent error accumulation over long recursive simulations. Our results demonstrate that AtmosMJ produces stable and physically plausible forecasts for about 500 days. In quantitative evaluations, it achieves competitive 10-day forecast accuracy against models like Pangu-Weather and GraphCast, all while requiring a remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest that efficient architectural design, rather than non-standard data representation, can be the key to unlocking stable and computationally efficient long-range weather prediction.

Updated: 2025-08-18 02:03:00

标题: AtmosMJ: 重新审视人工智能天气预报的门控机制，超越年度尺度

摘要: 大气大模型（LWMs）的出现标志着数据驱动预测的一个转折点，许多模型现在在中期范围内表现优于传统的数值系统。然而，实现稳定的、长期的自回归预测超过几周仍然是一个重要挑战。实现全年稳定性的现有先进模型，如SFNO和DLWP-HPX，依赖于将输入数据转换为非标准的空间域，如球谐函数或HEALPix网格。这导致了一个普遍的假设，即这种表示是必要的，以强制物理一致性和长期稳定性。本文通过调查在标准纬度-经度网格上是否可以实现可比的长期性能来挑战这一假设。我们介绍了AtmosMJ，一种深度卷积网络，可以直接在ERA5数据上运行，无需任何球面重映射。该模型的稳定性得益于一种新颖的门控残差融合（GRF）机制，该机制自适应地调节特征更新，以防止在长时间递归模拟中累积误差。我们的结果表明，AtmosMJ可以为约500天产生稳定且物理可信的预测。在定量评估中，它在与Pangu-Weather和GraphCast等模型的10天预测准确性方面取得了竞争力，同时只需要在V100 GPU上进行为期5.7天的非常低的训练预算。我们的研究结果表明，有效的架构设计，而不是非标准数据表示，可能是解锁稳定和计算高效的长期气象预测的关键。

更新时间: 2025-08-18 02:03:00

领域: cs.LG,cs.AI,cs.CV,physics.ao-ph

下载: http://arxiv.org/abs/2506.09733v3

Uncertainty-Aware Learning Policy for Reliable Pulmonary Nodule Detection on Chest X-Ray

Early detection and rapid intervention of lung cancer are crucial. Nonetheless, ensuring an accurate diagnosis is challenging, as physicians' ability to interpret chest X-rays varies significantly depending on their experience and degree of fatigue. Although medical AI has been rapidly advancing to assist in diagnosis, physicians' trust in such systems remains limited, preventing widespread clinical adoption. This skepticism fundamentally stems from concerns about its diagnostic uncertainty. In clinical diagnosis, physicians utilize extensive background knowledge and clinical experience. In contrast, medical AI primarily relies on repetitive learning of the target lesion to generate diagnoses based solely on that data. In other words, medical AI does not possess sufficient knowledge to render a diagnosis, leading to diagnostic uncertainty. Thus, this study suggests an Uncertainty-Aware Learning Policy that can address the issue of knowledge deficiency by learning the physicians' background knowledge alongside the Chest X-ray lesion information. We used 2,517 lesion-free images and 656 nodule images, all obtained from Ajou University Hospital. The proposed model attained 92% (IoU 0.2 / FPPI 2) with a 10% enhancement in sensitivity compared to the baseline model while also decreasing entropy as a measure of uncertainty by 0.2.

Updated: 2025-08-18 01:58:57

标题: 不确定性感知学习策略用于可靠的胸部X射线肺结节检测

摘要: 肺癌的早期检测和快速干预至关重要。然而，确保准确诊断具有挑战性，因为医生解释胸部X光的能力取决于他们的经验和疲劳程度。尽管医疗人工智能在辅助诊断方面迅速发展，但医生对这些系统的信任仍然有限，阻碍了广泛的临床采用。这种怀疑主要源于对其诊断不确定性的担忧。在临床诊断中，医生利用广泛的背景知识和临床经验。相比之下，医疗人工智能主要依赖于对目标病变的重复学习，仅基于那些数据生成诊断。换句话说，医疗人工智能没有足够的知识来做出诊断，导致诊断的不确定性。因此，这项研究提出了一个能够通过学习医生的背景知识以及胸部X光病变信息来解决知识不足问题的不确定性感知学习策略。我们使用了来自于亚洲大学医院的2,517张无病变图像和656张结节图像。所提出的模型在灵敏度比基准模型提高10%的同时，IoU 0.2 / FPPI 2达到了92%，同时还将不确定性的熵减少了0.2。

更新时间: 2025-08-18 01:58:57

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.13236v1

Deep Learning-Based Financial Time Series Forecasting via Sliding Window and Variational Mode Decomposition

To address the complexity of financial time series, this paper proposes a forecasting model combining sliding window and variational mode decomposition (VMD) methods. Historical stock prices and relevant market indicators are used to construct datasets. VMD decomposes non-stationary financial time series into smoother subcomponents, improving model adaptability. The decomposed data is then input into a deep learning model for prediction. The study compares the forecasting effects of an LSTM model trained on VMD-processed sequences with those using raw time series, demonstrating better performance and stability.

Updated: 2025-08-18 01:56:31

标题: 基于滑动窗口和变分模式分解的深度学习金融时间序列预测

摘要: 为了解决金融时间序列的复杂性，本文提出了一种结合滑动窗口和变分模态分解（VMD）方法的预测模型。利用历史股票价格和相关市场指标构建数据集。VMD将非平稳金融时间序列分解成更平滑的子组件，提高了模型的适应性。然后将分解的数据输入深度学习模型进行预测。研究比较了在经过VMD处理的序列上训练的LSTM模型和使用原始时间序列进行预测的效果，表明前者具有更好的性能和稳定性。

更新时间: 2025-08-18 01:56:31

领域: cs.LG

下载: http://arxiv.org/abs/2508.12565v1

Data-driven Trust Bootstrapping for Mobile Edge Computing-based Industrial IoT Services

We propose a data-driven and context-aware approach to bootstrap trustworthiness of homogeneous Internet of Things (IoT) services in Mobile Edge Computing (MEC) based industrial IoT (IIoT) systems. The proposed approach addresses key limitations in adapting existing trust bootstrapping approaches into MEC-based IIoT systems. These key limitations include, the lack of opportunity for a service consumer to interact with a lesser-known service over a prolonged period of time to get a robust measure of its trustworthiness, inability of service consumers to consistently interact with their peers to receive reliable recommendations of the trustworthiness of a lesser-known service as well as the impact of uneven context parameters in different MEC environments causing uneven trust environments for trust evaluation. In addition, the proposed approach also tackles the problem of data sparsity via enabling knowledge sharing among different MEC environments within a given MEC topology. To verify the effectiveness of the proposed approach, we carried out a comprehensive evaluation on two real-world datasets suitably adjusted to exhibit the context-dependent trust information accumulated in MEC environments within a given MEC topology. The experimental results affirmed the effectiveness of our approach and its suitability to bootstrap trustworthiness of services in MEC-based IIoT systems.

Updated: 2025-08-18 01:37:34

标题: 基于移动边缘计算的工业物联网服务的数据驱动信任引导

摘要: 我们提出了一种数据驱动和上下文感知的方法，用于在基于移动边缘计算（MEC）的工业物联网（IIoT）系统中引导同质物联网（IoT）服务的可信度。所提出的方法解决了将现有的信任引导方法应用于基于MEC的IIoT系统时的关键限制。这些关键限制包括，服务消费者缺乏与较不知名的服务长时间互动以获得其可信度的稳健度量的机会，服务消费者无法与同行一致地互动以获得较不知名服务可信度的可靠推荐，以及不同MEC环境中不均匀上下文参数的影响导致了信任评估的不均匀信任环境。此外，所提出的方法还通过在给定MEC拓扑结构中不同MEC环境之间启用知识共享来解决数据稀疏性问题。为验证所提出方法的有效性，我们对两个经过适当调整以展示在给定MEC拓扑结构内积累的依赖上下文的信任信息的真实世界数据集进行了全面评估。实验结果证实了我们的方法的有效性，以及其适用于在基于MEC的IIoT系统中引导服务的可信度。

更新时间: 2025-08-18 01:37:34

领域: cs.CR,cs.DC,cs.LG,C.2; C.4; I.2

下载: http://arxiv.org/abs/2508.12560v1

Illuminating LLM Coding Agents: Visual Analytics for Deeper Understanding and Enhancement

Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain, AutoML, and AIDE, ML scientists still struggle to effectively review and adjust the agents' coding process. The current approach of manually inspecting individual outputs is inefficient, making it difficult to track code evolution, compare coding iterations, and identify improvement opportunities. To address this challenge, we introduce a visual analytics system designed to enhance the examination of coding agent behaviors. Focusing on the AIDE framework, our system supports comparative analysis across three levels: (1) Code-Level Analysis, which reveals how the agent debugs and refines its code over iterations; (2) Process-Level Analysis, which contrasts different solution-seeking processes explored by the agent; and (3) LLM-Level Analysis, which highlights variations in coding behavior across different LLMs. By integrating these perspectives, our system enables ML scientists to gain a structured understanding of agent behaviors, facilitating more effective debugging and prompt engineering. Through case studies using coding agents to tackle popular Kaggle competitions, we demonstrate how our system provides valuable insights into the iterative coding process.

Updated: 2025-08-18 01:17:11

标题: 揭示LLM编码代理：用于更深入理解和增强的视觉分析

摘要: 由大型语言模型（LLMs）驱动的编码代理已经开始受到关注，用于通过迭代问题解决来自动化代码生成，减少人类参与。尽管出现了各种框架，例如LangChain、AutoML和AIDE，机器学习科学家仍然难以有效地审查和调整代理的编码过程。目前的手动检查单个输出的方法效率低下，使得跟踪代码演变、比较编码迭代和识别改进机会变得困难。为了解决这一挑战，我们引入了一个旨在增强编码代理行为检查的视觉分析系统。针对AIDE框架，我们的系统支持三个层面的比较分析：（1）代码级别分析，揭示代理如何在迭代中调试和完善其代码；（2）过程级别分析，对比代理探索的不同解决方案寻找过程；以及（3）LLM级别分析，突出不同LLMs之间编码行为的变化。通过整合这些视角，我们的系统使机器学习科学家能够获得对代理行为的结构化理解，促进更有效的调试和及时工程。通过使用编码代理解决流行的Kaggle竞赛的案例研究，我们展示了我们的系统如何提供有价值的见解到迭代编码过程中。

更新时间: 2025-08-18 01:17:11

领域: cs.LG

下载: http://arxiv.org/abs/2508.12555v1

DEFENDCLI: {Command-Line} Driven Attack Provenance Examination

Endpoint Detection and Response (EDR) solutions embrace the method of attack provenance graph to discover unknown threats through system event correlation. However, this method still faces some unsolved problems in the fields of interoperability, reliability, flexibility, and practicability to deliver actionable results. Our research highlights the limitations of current solutions in detecting obfuscation, correlating attacks, identifying low-frequency events, and ensuring robust context awareness in relation to command-line activities. To address these challenges, we introduce DEFENDCLI, an innovative system leveraging provenance graphs that, for the first time, delves into command-line-level detection. By offering finer detection granularity, it addresses a gap in modern EDR systems that has been overlooked in previous research. Our solution improves the precision of the information representation by evaluating differentiation across three levels: unusual system process calls, suspicious command-line executions, and infrequent external network connections. This multi-level approach enables EDR systems to be more reliable in complex and dynamic environments. Our evaluation demonstrates that DEFENDCLI improves precision by approximately 1.6x compared to the state-of-the-art methods on the DARPA Engagement Series attack datasets. Extensive real-time industrial testing across various attack scenarios further validates its practical effectiveness. The results indicate that DEFENDCLI not only detects previously unknown attack instances, which are missed by other modern commercial solutions, but also achieves a 2.3x improvement in precision over the state-of-the-art research work.

Updated: 2025-08-18 01:13:27

标题: DEFENDCLI：基于命令行的攻击溯源检查

摘要: 终端检测和响应（EDR）解决方案采用攻击溯源图方法，通过系统事件相关性发现未知威胁。然而，这种方法在互操作性、可靠性、灵活性和实用性方面仍然面临一些未解决的问题，以提供可操作的结果。我们的研究突出了当前解决方案在检测混淆、相关攻击、识别低频事件以及确保与命令行活动相关的强大上下文意识方面的局限性。为了解决这些挑战，我们引入了DEFENDCLI，这是一种创新系统，利用溯源图，首次深入到命令行级别的检测。通过提供更精细的检测粒度，它填补了现代EDR系统中被先前研究忽略的空白。我们的解决方案通过评估三个层次的差异化来改进信息表示的精度：不寻常的系统进程调用、可疑的命令行执行以及不经常的外部网络连接。这种多层次方法使EDR系统在复杂和动态环境中更可靠。我们的评估表明，与DARPA Engagement Series攻击数据集上的最先进方法相比，DEFENDCLI将精度提高了约1.6倍。对各种攻击场景进行的广泛实时工业测试进一步验证了其实际有效性。结果表明，DEFENDCLI不仅可以检测到其他现代商业解决方案所错过的先前未知的攻击实例，而且在精度方面比最先进的研究工作提高了2.3倍。

更新时间: 2025-08-18 01:13:27

领域: cs.CR

下载: http://arxiv.org/abs/2508.12553v1

Evaluation of Finetuned LLMs in AMR Parsing

AMR (Abstract Meaning Representation) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

Updated: 2025-08-18 01:10:45

标题: 评估在AMR解析中微调的LLMs

摘要: AMR（Abstract Meaning Representation）是一种将句子含义编码为根据、有向、无环图的语义形式主义，其中节点代表概念，边表示语义关系。微调解码器仅大型语言模型（LLM）代表了AMR解析的一种有前途的新的直接方向。本文对四种不同的LLM架构，Phi 3.5、Gemma 2、LLaMA 3.2和DeepSeek R1 LLaMA Distilled进行了全面评估，使用了LDC2020T02 Gold AMR3.0测试集。我们的结果表明，仅对解码器进行简单微调的LLM可以达到与复杂的现有技术（SOTA）AMR解析器相当的性能。值得注意的是，LLaMA 3.2在采用直接微调方法时表现出与SOTA AMR解析器相竞争的性能。我们在完整的LDC2020T02测试集中实现了SMATCH F1值为0.804，与APT + Silver（IBM）和接近Graphene Smatch（MBSE）的0.854相当。在我们的分析中，我们还观察到LLaMA 3.2在语义性能方面领先，而Phi 3.5在结构有效性方面表现出色。

更新时间: 2025-08-18 01:10:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.05028v3

OS-R1: Agentic Operating System Kernel Tuning with Reinforcement Learning

Linux kernel tuning is essential for optimizing operating system (OS) performance. However, existing methods often face challenges in terms of efficiency, scalability, and generalization. This paper introduces OS-R1, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). By abstracting the kernel configuration space as an RL environment, OS-R1 facilitates efficient exploration by large language models (LLMs) and ensures accurate configuration modifications. Additionally, custom reward functions are designed to enhance reasoning standardization, configuration modification accuracy, and system performance awareness of the LLMs. Furthermore, we propose a two-phase training process that accelerates convergence and minimizes retraining across diverse tuning scenarios. Experimental results show that OS-R1 significantly outperforms existing baseline methods, achieving up to 5.6% performance improvement over heuristic tuning and maintaining high data efficiency. Notably, OS-R1 is adaptable across various real-world applications, demonstrating its potential for practical deployment in diverse environments. Our dataset and code are publicly available at https://github.com/LHY-24/OS-R1.

Updated: 2025-08-18 01:09:57

标题: OS-R1：使用强化学习进行主动操作系统内核调优

摘要: Linux内核调优对于优化操作系统（OS）性能至关重要。然而，现有方法在效率、可伸缩性和泛化方面常常面临挑战。本文介绍了OS-R1，这是一个由基于规则的强化学习（RL）驱动的主动式Linux内核调优框架。通过将内核配置空间抽象为一个RL环境，OS-R1促进了大型语言模型（LLMs）进行有效探索，并确保准确的配置修改。另外，定制的奖励函数被设计用于增强推理标准化、配置修改准确性和LLMs的系统性能意识。此外，我们提出了一个两阶段训练过程，加速收敛并最小化在各种调优场景中的重新训练。实验结果表明，OS-R1明显优于现有基准方法，相比启发式调优提高了高达5.6%的性能，并保持高数据效率。值得注意的是，OS-R1可适用于各种真实应用场景，展示了在不同环境中实际部署的潜力。我们的数据集和代码可在https://github.com/LHY-24/OS-R1上公开获取。

更新时间: 2025-08-18 01:09:57

领域: cs.LG,cs.AI,cs.OS,cs.SE

下载: http://arxiv.org/abs/2508.12551v1

Explainable Reinforcement Learning Agents Using World Models

Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision-making. Further, non-AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model-Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model-Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations that show users what the world should have been like significantly increase their understanding of the agent policy. We hypothesize that our explanations can help users learn how to control the agents execution through by manipulating the environment.

Updated: 2025-08-18 01:05:06

标题: 可解释的强化学习代理：使用世界模型

摘要: 可解释的人工智能（XAI）系统已被提出，以帮助人们理解人工智能系统如何产生输出和行为。可解释的强化学习（XRL）由于顺序决策的时间性质，增加了复杂性。此外，非人工智能专家不一定有能力修改代理或其策略。我们介绍了一种使用世界模型为基于模型的深度强化学习（RL）代理生成解释的技术。世界模型预测当执行操作时世界将如何变化，从而允许生成反事实轨迹。然而，仅仅识别用户希望代理执行的动作并不足以理解为什么代理执行了其他动作。我们使用反向世界模型增强了基于模型的RL代理，该模型预测了为使代理更倾向于给定反事实动作而世界状态应该是什么。我们展示了表明用户世界应该是什么样的解释显著增加了他们对代理策略的理解。我们假设我们的解释可以帮助用户学习如何通过操纵环境控制代理的执行。

更新时间: 2025-08-18 01:05:06

领域: cs.AI

下载: http://arxiv.org/abs/2505.08073v2

The Hidden Cost of Correlation: Rethinking Privacy Leakage in Local Differential Privacy

Local differential privacy (LDP) has emerged as a promising paradigm for privacy-preserving data collection in distributed systems, where users contribute multi-dimensional records with potentially correlated attributes. Recent work has highlighted that correlation-induced privacy leakage (CPL) plays a critical role in shaping the privacy-utility trade-off under LDP, especially when correlations exist among attributes. Nevertheless, it remains unclear to what extent the prevailing assumptions and proposed solutions are valid and how significant CPL is in real-world data. To address this gap, we first perform a comprehensive statistical analysis of five widely used LDP mechanisms -- GRR, RAPPOR, OUE, OLH and Exponential mechanism -- to assess CPL across four real-world datasets. We identify that many primary assumptions and metrics in current approaches fall short of accurately characterising these leakages. Moreover, current studies have been limited to a set of pure LDP (i.e., {\delta = 0}) mechanisms. In response, we develop the first algorithmic framework to theoretically quantify CPL for any general approximated LDP (({\varepsilon},{\delta})-LDP) mechanism. We validate our theoretical results against empirical statistical results and provide a theoretical explanation for the observed statistical patterns. Finally, we propose two novel benchmarks to validate correlation analysis algorithms and evaluate the utility vs CPL of LDP mechanisms. Further, we demonstrate how these findings can be applied to achieve an efficient privacy-utility trade-off in real-world data governance.

Updated: 2025-08-18 00:34:04

标题: 相关性的隐藏成本：重新思考本地差分隐私中的隐私泄露

摘要: 地方差分隐私（LDP）已经成为在分布式系统中进行隐私保护数据收集的一种有前途的范式，用户可以贡献具有潜在相关属性的多维记录。最近的研究强调了相关引起的隐私泄漏（CPL）在塑造LDP下隐私-效用权衡中的关键作用，特别是当属性之间存在相关性时。然而，目前尚不清楚当前的假设和提出的解决方案在多大程度上是有效的，以及在现实世界数据中CPL有多重要。为了填补这一空白，我们首先对五种广泛使用的LDP机制（GRR、RAPPOR、OUE、OLH和指数机制）进行全面的统计分析，以评估四个实际数据集中的CPL。我们发现当前方法中的许多主要假设和度量不足以准确描述这些泄漏。此外，当前研究仅限于一组纯LDP（即{\delta = 0}）机制。作为回应，我们开发了第一个算法框架，以理论上量化任何一般近似LDP（({\varepsilon},{\delta})-LDP）机制的CPL。我们通过实证统计结果验证了我们的理论结果，并为观察到的统计模式提供了理论解释。最后，我们提出了两个新的基准，以验证相关性分析算法并评估LDP机制的效用与CPL。此外，我们展示了如何将这些发现应用于在现实世界数据治理中实现高效的隐私-效用权衡。

更新时间: 2025-08-18 00:34:04

领域: cs.CR,cs.IT,math.IT

下载: http://arxiv.org/abs/2508.12539v1

The Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold

The acceleration of artificial intelligence (AI) in science is recognized and many scholars have begun to explore its role in interdisciplinary collaboration. However, the mechanisms and extent of this impact are still unclear. This study, using AlphaFold's impact on structural biologists, examines how AI technologies influence interdisciplinary collaborative patterns. By analyzing 1,247 AlphaFold-related papers and 7,700 authors from Scopus, we employ bibliometric analysis and causal inference to compare interdisciplinary collaboration between AlphaFold adopters and non-adopters. Contrary to the widespread belief that AI facilitates interdisciplinary collaboration, our findings show that AlphaFold increased structural biology-computer science collaborations by just 0.48%, with no measurable effect on other disciplines. Specifically, AI creates interdisciplinary collaboration demands with specific disciplines due to its technical characteristics, but this demand is weakened by technological democratization and other factors. These findings demonstrate that artificial intelligence (AI) alone has limited efficacy in bridging disciplinary divides or fostering meaningful interdisciplinary collaboration.

Updated: 2025-08-18 00:31:03

标题: 《人工智能在促进跨学科合作中的作用：以AlphaFold为例的证据》

摘要: 科学中人工智能（AI）的加速发展被认可，许多学者已经开始探索其在跨学科合作中的作用。然而，这种影响的机制和程度仍然不清楚。本研究利用AlphaFold在结构生物学中的影响，探讨了AI技术如何影响跨学科合作模式。通过分析来自Scopus的1,247篇与AlphaFold相关的论文和7,700位作者，我们采用文献计量分析和因果推断来比较AlphaFold采用者和非采用者之间的跨学科合作。与AI促进跨学科合作的普遍看法相反，我们的研究结果显示，AlphaFold只使结构生物学与计算机科学的合作增加了0.48％，对其他学科没有明显影响。具体来说，AI由于其技术特性在特定学科之间产生了跨学科合作的需求，但这种需求受到技术民主化和其他因素的削弱。这些发现表明，人工智能（AI）单独在跨学科领域的桥梁或促进有意义的跨学科合作方面的效果有限。

更新时间: 2025-08-18 00:31:03

领域: cs.DL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.13234v1

Systematic Analysis of MCP Security

The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, significantly enhancing their functionality. However, while MCP brings notable benefits, it also introduces significant vulnerabilities, such as Tool Poisoning Attacks (TPA), where hidden malicious instructions exploit the sycophancy of large language models (LLMs) to manipulate agent behavior. Despite these risks, current academic research on MCP security remains limited, with most studies focusing on narrow or qualitative analyses that fail to capture the diversity of real-world threats. To address this gap, we present the MCP Attack Library (MCPLIB), which categorizes and implements 31 distinct attack methods under four key classifications: direct tool injection, indirect tool injection, malicious user attacks, and LLM inherent attack. We further conduct a quantitative analysis of the efficacy of each attack. Our experiments reveal key insights into MCP vulnerabilities, including agents' blind reliance on tool descriptions, sensitivity to file-based attacks, chain attacks exploiting shared context, and difficulty distinguishing external data from executable commands. These insights, validated through attack experiments, underscore the urgency for robust defense strategies and informed MCP design. Our contributions include 1) constructing a comprehensive MCP attack taxonomy, 2) introducing a unified attack framework MCPLIB, and 3) conducting empirical vulnerability analysis to enhance MCP security mechanisms. This work provides a foundational framework, supporting the secure evolution of MCP ecosystems.

Updated: 2025-08-18 00:23:41

标题: MCP安全性的系统分析

摘要: 模型上下文协议（MCP）已经成为一种通用标准，使得人工智能代理可以无缝连接外部工具，显著增强它们的功能。然而，虽然MCP带来明显的好处，但也引入了显著的漏洞，如工具毒害攻击（TPA），其中隐藏的恶意指令利用大型语言模型（LLMs）的奉承来操纵代理行为。尽管存在这些风险，当前关于MCP安全的学术研究仍然有限，大多数研究集中在狭窄或定性分析上，未能捕捉到真实世界威胁的多样性。为了弥补这一差距，我们提出了MCP攻击库（MCPLIB），将31种不同的攻击方法分为四个关键分类：直接工具注入，间接工具注入，恶意用户攻击和LLM固有攻击。我们进一步对每种攻击的有效性进行了定量分析。我们的实验揭示了MCP漏洞的关键见解，包括代理对工具描述的盲目依赖，对基于文件的攻击的敏感性，利用共享上下文的链式攻击，以及难以区分外部数据和可执行命令的困难。这些见解通过攻击实验得到验证，强调了强大的防御策略和知情的MCP设计的紧迫性。我们的贡献包括1）构建全面的MCP攻击分类法，2）引入统一的攻击框架MCPLIB，3）进行实证漏洞分析以增强MCP安全机制。这项工作为支持MCP生态系统的安全演进提供了基础框架。

更新时间: 2025-08-18 00:23:41

领域: cs.CR,cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.12538v1

Accurate Measles Rash Detection via Vision Transformer Fine-Tuning

Measles, a highly contagious disease declared eliminated in the United States in 2000 after decades of successful vaccination campaigns, resurged in 2025, with 1,356 confirmed cases reported as of August 5, 2025. Given its rapid spread among susceptible individuals, fast and reliable diagnostic systems are critical for early prevention and containment. In this work, we applied transfer learning to fine-tune a pretrained Data-efficient Image Transformer (DeiT) model for distinguishing measles rashes from other skin conditions. After tuning the classification head on a diverse, curated skin rash image dataset, the DeiT model achieved an average classification accuracy of 95.17%, precision of 95.06%, recall of 95.17%, and an F1-score of 95.03%, demonstrating high effectiveness in accurate measles detection to aid outbreak control. We also compared the DeiT model with a convolutional neural network and discussed the directions for future research.

Updated: 2025-08-18 00:17:47

标题: 通过视觉变换器微调实现准确的麻疹皮疹检测

摘要: 麻疹是一种高度传染的疾病，在美国经过几十年成功的疫苗接种活动后，在2000年宣布被消除。然而，2025年麻疹再次爆发，截至2025年8月5日已经报告了1,356例确诊病例。鉴于麻疹在易感个体中的快速传播，快速可靠的诊断系统对于早期预防和控制至关重要。在这项工作中，我们应用迁移学习来微调预训练的Data-efficient Image Transformer（DeiT）模型，用于区分麻疹皮疹与其他皮肤病。在对一个多样化、策划的皮疹图像数据集上调整分类头后，DeiT模型实现了95.17%的平均分类准确率、95.06%的精度、95.17%的召回率和95.03%的F1分数，展示了在准确检测麻疹以帮助控制爆发中的高效性。我们还将DeiT模型与卷积神经网络进行了比较，并讨论了未来研究的方向。

更新时间: 2025-08-18 00:17:47

领域: eess.IV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2005.09112v6

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.

Updated: 2025-08-18 00:07:23

标题: 评判法官：评估作为法官的法学硕士的一致性和脆弱性

摘要: 提供了一个有希望解决与人类评估相关的可扩展性挑战的解决方案，LLM作为评委范式正迅速获得关注作为评估大型语言模型（LLMs）的一种方法。然而，关于这种范式的优势和劣势以及可能存在的潜在偏见仍有许多未解之谜。在本文中，我们对各种LLMs作为评委的表现进行了全面研究，重点关注一个干净的情境，其中人际一致性很高。研究了十三个不同模型大小和家族的评委模型，评估了九个不同“考试者模型”的答案 - 包括基础模型和指导调整模型，我们发现只有最好（也是最大的）模型才能与人类实现合理的一致性。然而，它们仍然远远落后于人际一致性，其分配的分数可能与人类分配的分数相差高达5分。在对这九个考试者模型进行排名方面，即使较小的模型和甚至词汇度量也可以提供合理的信号。通过错误分析和其他研究，我们发现评委模型存在一些弱点，例如对提示复杂性和长度的敏感性，以及对宽容度的倾向。即使最好的评委在这种相对简单的设置中也与人类存在差异，这表明在更复杂的设置中使用评委时可能需要谨慎。最后，我们的研究重新发现了超出简单百分比一致性的对齐度量的重要性，显示出具有高百分比一致性的评委仍然可能分配完全不同的分数。

更新时间: 2025-08-18 00:07:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.12624v6

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.

Updated: 2025-08-18 00:01:42

标题: CorrSteer：通过基于相关性稀疏自动编码器特征选择改善LLM的任务表现和安全性

摘要: 稀疏自动编码器（SAEs）可以在没有监督的情况下从大型语言模型（LLMs）中提取可解释的特征。然而，它们在下游导向任务中的有效性受到对比数据集或大型激活存储的要求的限制。为了解决这些限制，我们提出了CorrSteer，该方法通过相关样本的正确性与推理时生成的标记的SAE激活进行相关，从而选择特征。这种方法仅使用推理时的激活来提取更相关的特征，从而避免了虚假相关。它还从平均激活中获取导向系数，自动化整个流程。我们的方法在Gemma 2 2B和LLaMA 3.1 8B上的QA、偏见缓解、越狱预防和推理基准测试中显示出改进的任务性能，尤其是在MMLU性能上实现了+4.1%的改进，在HarmBench上实现了+22.9%的改进，仅使用4000个样本。所选特征展示了与每个任务要求一致的语义上有意义的模式，揭示了驱动性能的基本能力。我们的工作将基于相关性的选择确立为自动化SAE导向跨语言模型应用的有效且可扩展的方法。

更新时间: 2025-08-18 00:01:42

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12535v1