Arxiv Day: Article

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM's immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1\% to 69.2\%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.

Updated: 2025-07-31 23:55:29

标题: RL-PLUS: 用混合策略优化对抗深度强化学习中LLMs的能力边界崩溃

摘要: 具有可验证奖励的强化学习（RLVR）显著提高了大型语言模型（LLMs）的复杂推理能力。然而，由于其固有的基于策略的策略与LLM庞大的行动空间和稀疏奖励相结合，它难以突破基础LLM的能力边界。此外，RLVR可能导致能力边界的崩溃，缩小LLM的问题解决范围。为解决这一问题，我们提出了RL-PLUS，一种新颖的方法，将内部开发（即思考）与外部数据（即学习）协同作用，以实现更强的推理能力并超越基础模型的边界。RL-PLUS集成了两个核心组件：多重重要性抽样用于解决外部数据的分布不匹配问题，以及基于探索的优势函数用于指导模型走向高价值、未开发的推理路径。我们提供了理论分析和广泛实验，以证明我们方法的优越性和泛化能力。结果表明，与现有的RLVR方法相比，RL-PLUS在六个数学推理基准测试中表现出了最先进的性能，并在六个超出分布推理任务中表现出了卓越的性能。它还在各种模型系列中实现了一致且显著的收益，平均相对改进范围从21.1\%到69.2\%不等。此外，跨多个基准测试的Pass@k曲线表明RL-PLUS有效地解决了能力边界崩溃问题。

更新时间: 2025-07-31 23:55:29

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.00222v1

Object-Centric Cropping for Visual Few-Shot Classification

In the domain of Few-Shot Image Classification, operating with as little as one example per class, the presence of image ambiguities stemming from multiple objects or complex backgrounds can significantly deteriorate performance. Our research demonstrates that incorporating additional information about the local positioning of an object within its image markedly enhances classification across established benchmarks. More importantly, we show that a significant fraction of the improvement can be achieved through the use of the Segment Anything Model, requiring only a pixel of the object of interest to be pointed out, or by employing fully unsupervised foreground object extraction methods.

Updated: 2025-07-31 23:44:06

标题: 目标中心裁剪用于视觉少样本分类

摘要: 在少样本图像分类领域，仅使用每类一个示例时，由于图像中存在多个物体或复杂背景引起的图像模糊可能会显著降低性能。我们的研究表明，将关于物体在图像中局部定位的额外信息纳入分类中，可以显著提高在已建立基准上的分类性能。更重要的是，我们展示了通过使用Segment Anything模型，仅需要指出感兴趣物体的一个像素，或者使用完全无监督的前景物体提取方法，就可以实现改进的显著部分。

更新时间: 2025-07-31 23:44:06

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.00218v1

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.

Updated: 2025-07-31 23:41:31

标题: 使用LLMs理解表格数据：最新进展和挑战的调查

摘要: 表格在大型语言模型（LLMs）和多模态大型语言模型（MLLMs）中引起了重要关注，这是由于它们复杂和灵活的结构。与线性文本输入不同，表格是二维的，包含从结构良好的数据库表格到复杂的多层电子表格等各种格式，每种格式都有不同的目的。这种格式和目的的多样性导致了专门方法和任务的发展，而不是通用方法，使得导航表格理解任务具有挑战性。为了解决这些挑战，本文通过对表格输入表示的分类和表格理解任务的介绍，引入了关键概念。我们强调了领域中几个关键空白，表明需要进一步研究：（1）检索焦点任务占主导地位，要求除数学和逻辑操作之外的最少推理；（2）模型处理复杂表格结构、大规模表格、长度上下文或多表格场景时面临的重大挑战；及（3）模型在不同表格表示和格式之间的有限泛化能力。

更新时间: 2025-07-31 23:41:31

领域: cs.CL,cs.DB,cs.LG

下载: http://arxiv.org/abs/2508.00217v1

SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

Updated: 2025-07-31 23:26:39

标题: SAM-PTx: 使用参数高效、并行文本适配器对SAM进行文本引导微调

摘要: 该文献摘要介绍了Segment Anything Model (SAM)在基于提示的分割中展现出了令人印象深刻的泛化能力。然而，与传统的空间提示如点和框相比，语义文本提示的潜力仍未得到充分探索。本文介绍了SAM-PTx，这是一种使用冻结的CLIP衍生文本嵌入作为类级语义指导的参数高效方法，用于调整SAM。具体而言，我们提出了一种轻量级适配器设计，称为Parallel-Text，将文本嵌入注入到SAM的图像编码器中，实现了语义引导的分割，同时保持大部分原始架构冻结。我们的适配器仅修改每个变压器块的MLP-并行分支，保留了用于空间推理的注意力路径。通过在COD10K数据集以及COCO和ADE20K的低数据子集上进行监督实验和消融实验，我们表明将固定文本嵌入作为输入可以提高分割性能，超过纯粹的空间提示基线。据我们所知，这是第一次在COD10K数据集上使用文本提示进行分割的工作。这些结果表明，将语义条件整合到SAM的架构中为高效适配提供了一条实用且可扩展的路径，且具有最小的计算复杂度。

更新时间: 2025-07-31 23:26:39

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.00213v1

Reinitializing weights vs units for maintaining plasticity in neural networks

Loss of plasticity is a phenomenon in which a neural network loses its ability to learn when trained for an extended time on non-stationary data. It is a crucial problem to overcome when designing systems that learn continually. An effective technique for preventing loss of plasticity is reinitializing parts of the network. In this paper, we compare two different reinitialization schemes: reinitializing units vs reinitializing weights. We propose a new algorithm, which we name \textit{selective weight reinitialization}, for reinitializing the least useful weights in a network. We compare our algorithm to continual backpropagation and ReDo, two previously proposed algorithms that reinitialize units in the network. Through our experiments in continual supervised learning problems, we identify two settings when reinitializing weights is more effective at maintaining plasticity than reinitializing units: (1) when the network has a small number of units and (2) when the network includes layer normalization. Conversely, reinitializing weights and units are equally effective at maintaining plasticity when the network is of sufficient size and does not include layer normalization. We found that reinitializing weights maintains plasticity in a wider variety of settings than reinitializing units.

Updated: 2025-07-31 23:25:19

标题: 重新初始化权重与单元以维持神经网络中的可塑性

摘要: 失去可塑性是一种现象，即神经网络在长时间训练非稳态数据时失去学习能力。这是在设计持续学习系统时必须克服的关键问题。防止失去可塑性的一种有效技术是重新初始化网络的部分。在本文中，我们比较了两种不同的重新初始化方案：重新初始化单元与重新初始化权重。我们提出了一种新算法，我们将其命名为“选择性权重重新初始化”，用于重新初始化网络中最不重要的权重。我们将我们的算法与持续反向传播和ReDo进行了比较，这两种先前提出的算法重新初始化网络中的单元。通过在持续监督学习问题中的实验，我们确定了两种情况，即重新初始化权重比重新初始化单元更有效地保持可塑性：（1）当网络具有少量单元时，（2）当网络包括层归一化时。相反，当网络足够大且不包括层归一化时，重新初始化权重和单元在维持可塑性方面同样有效。我们发现，与重新初始化单元相比，重新初始化权重在更广泛的设置中保持可塑性。

更新时间: 2025-07-31 23:25:19

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2508.00212v1

Robust Classification under Noisy Labels: A Geometry-Aware Reliability Framework for Foundation Models

Foundation models (FMs) pretrained on large datasets have become fundamental for various downstream machine learning tasks, in particular in scenarios where obtaining perfectly labeled data is prohibitively expensive. In this paper, we assume an FM has to be fine-tuned with noisy data and present a two-stage framework to ensure robust classification in the presence of label noise without model retraining. Recent work has shown that simple k-nearest neighbor (kNN) approaches using an embedding derived from an FM can achieve good performance even in the presence of severe label noise. Our work is motivated by the fact that these methods make use of local geometry. In this paper, following a similar two-stage procedure, reliability estimation followed by reliability-weighted inference, we show that improved performance can be achieved by introducing geometry information. For a given instance, our proposed inference uses a local neighborhood of training data, obtained using the non-negative kernel (NNK) neighborhood construction. We propose several methods for reliability estimation that can rely less on distance and local neighborhood as the label noise increases. Our evaluation on CIFAR-10 and DermaMNIST shows that our methods improve robustness across various noise conditions, surpassing standard K-NN approaches and recent adaptive-neighborhood baselines.

Updated: 2025-07-31 23:01:32

标题: 嘈杂标签下的稳健分类：一种基于几何感知可靠性框架的基础模型

摘要: 基于大型数据集预训练的基础模型（FMs）已成为各种下游机器学习任务的基础，特别是在获取完全标记数据成本过高的情况下。在本文中，我们假设一个FM必须使用嘈杂数据进行微调，并提出了一个两阶段框架，以确保在存在标签噪声的情况下进行鲁棒分类，而无需重新训练模型。最近的研究表明，使用从FM派生的嵌入的简单k最近邻（kNN）方法甚至在存在严重标签噪声的情况下也可以实现良好性能。我们的工作受到这些方法利用局部几何的启发。本文中，遵循类似的两阶段流程，可靠性估计后跟可靠性加权推断，我们展示通过引入几何信息可以实现改进的性能。对于给定实例，我们提出的推断使用训练数据的局部邻域，使用非负核（NNK）邻域构建获得。我们提出了几种可靠性估计方法，可以在标签噪声增加时减少对距离和局部邻域的依赖。我们在CIFAR-10和DermaMNIST上的评估结果显示，我们的方法在各种噪声条件下提高了鲁棒性，超越了标准的K-NN方法和最近的自适应邻域基线。

更新时间: 2025-07-31 23:01:32

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2508.00202v1

RecoMind: A Reinforcement Learning Framework for Optimizing In-Session User Satisfaction in Recommendation Systems

Existing web-scale recommendation systems commonly use supervised learning methods that prioritize immediate user feedback. Although reinforcement learning (RL) offers a solution to optimize longer-term goals, such as in-session engagement, applying it at web scale is challenging due to the extremely large action space and engineering complexity. In this paper, we introduce RecoMind, a simulator-based RL framework designed for the effective optimization of session-based goals at web-scale. RecoMind leverages existing recommendation models to establish a simulation environment and to bootstrap the RL policy to optimize immediate user interactions from the outset. This method integrates well with existing industry pipelines, simplifying the training and deployment of RL policies. Additionally, RecoMind introduces a custom exploration strategy to efficiently explore web-scale action spaces with hundreds of millions of items. We evaluated RecoMind through extensive offline simulations and online A/B testing on a video streaming platform. Both methods showed that the RL policy trained using RecoMind significantly outperforms traditional supervised learning recommendation approaches in in-session user satisfaction. In online A/B tests, the RL policy increased videos watched for more than 10 seconds by 15.81\% and improved session depth by 4.71\% for sessions with at least 10 interactions. As a result, RecoMind presents a systematic and scalable approach for embedding RL into web-scale recommendation systems, showing great promise for optimizing session-based user satisfaction.

Updated: 2025-07-31 23:01:14

标题: RecoMind：一个用于优化推荐系统中会话内用户满意度的强化学习框架

摘要: 现有的网络规模推荐系统通常使用优先考虑即时用户反馈的监督学习方法。尽管强化学习（RL）提供了一种优化长期目标（如会话参与度）的解决方案，但由于极其庞大的行为空间和工程复杂性，将其应用于网络规模是具有挑战性的。在本文中，我们介绍了RecoMind，这是一个基于模拟器的RL框架，旨在有效优化网络规模下的会话目标。RecoMind利用现有的推荐模型建立模拟环境，并引导RL策略优化从一开始即产生的即时用户交互。这种方法与现有的行业管道很好地整合在一起，简化了RL策略的训练和部署。此外，RecoMind引入了一种自定义探索策略，可以高效地探索拥有数亿个项目的网络规模的行为空间。我们通过大量的离线模拟和在线A/B测试对RecoMind进行了评估，测试平台为视频流平台。两种方法都表明，使用RecoMind训练的RL策略在会话用户满意度方面明显优于传统的监督学习推荐方法。在在线A/B测试中，RL策略使视频观看时间长达10秒以上的增加了15.81％，并提高了至少有10次交互的会话深度4.71％。因此，RecoMind提供了一种将RL嵌入到网络规模推荐系统中的系统化和可扩展方法，展现了优化基于会话的用户满意度的巨大潜力。

更新时间: 2025-07-31 23:01:14

领域: cs.LG

下载: http://arxiv.org/abs/2508.00201v1

E.A.R.T.H.: Structuring Creative Evolution through Model Error in Generative AI

How can AI move beyond imitation toward genuine creativity? This paper proposes the E.A.R.T.H. framework, a five-stage generative pipeline that transforms model-generated errors into creative assets through Error generation, Amplification, Refine selection, Transform, and Harness feedback. Drawing on cognitive science and generative modeling, we posit that "creative potential hides in failure" and operationalize this via structured prompts, semantic scoring, and human-in-the-loop evaluation. Implemented using LLaMA-2-7B-Chat, SBERT, BERTScore, CLIP, BLIP-2, and Stable Diffusion, the pipeline employs a composite reward function based on novelty, surprise, and relevance. At the Refine stage, creativity scores increase by 52.5% (1.179 to 1.898, t = -5.56, p < 0.001), with final outputs reaching 2.010 - a 70.4% improvement. Refined slogans are 48.4% shorter, 40.7% more novel, with only a 4.0% drop in relevance. Cross-modal tests show strong slogan-to-image alignment (CLIPScore: 0.249; BERTScore F1: 0.816). In human evaluations, the generated outputs were consistently rated highly, demonstrating strong creative quality and expressive clarity. Feedback highlights stylistic precision and emotional resonance. These results demonstrate that error-centered, feedback-driven generation enhances creativity, offering a scalable path toward self-evolving, human-aligned creative AI.

Updated: 2025-07-31 22:39:25

标题: E.A.R.T.H.: 通过生成式人工智能中的模型误差构建创造性演化

摘要: 如何使人工智能从模仿走向真正的创造力？本文提出了E.A.R.T.H.框架，这是一个五阶段的生成管道，通过错误生成、放大、精选、转换和利用反馈，将模型生成的错误转化为创造性资产。借鉴认知科学和生成建模，我们认为“创造潜力隐藏在失败中”，并通过结构化提示、语义评分和人机协同评估来操作化这一观点。使用LLaMA-2-7B-Chat、SBERT、BERTScore、CLIP、BLIP-2和Stable Diffusion实施，该管道采用基于新颖性、惊喜和相关性的复合奖励函数。在精选阶段，创造力得分提高了52.5%（1.179至1.898，t = -5.56，p <0.001），最终输出达到2.010，提高了70.4%。精炼的口号长度缩短了48.4%，更加新颖了40.7%，相关性仅下降了4.0%。跨模态测试显示口号与图像之间有很强的对齐性（CLIPScore：0.249；BERTScore F1：0.816）。在人类评估中，生成的输出始终得到高评价，展现了强大的创造质量和表达清晰度。反馈突显了风格精确性和情感共鸣。这些结果表明，以错误为中心、受反馈驱动的生成提升了创造力，提供了一条通向自我进化、与人类对齐的创意人工智能的可扩展路径。

更新时间: 2025-07-31 22:39:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.18004v2

Data-driven tool wear prediction in milling, based on a process-integrated single-sensor approach

Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including transformer-inspired convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM), and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on two machines and on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving 99.1\% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.

Updated: 2025-07-31 22:37:23

标题: 铣削中基于过程集成单传感器方法的基于数据驱动的刀具磨损预测

摘要: 准确的工具磨损预测对于维持加工生产力和降低成本至关重要。然而，工具磨损过程的复杂性给可靠预测带来了重大挑战。本研究探讨了数据驱动方法，特别是深度学习，用于工具磨损预测。传统的数据驱动方法通常专注于单一过程，依赖于多传感器设置和大量数据生成，这限制了对新设置的泛化能力。此外，多传感器集成在工业环境中通常不切实际。为了解决这些限制，本研究调查了使用最少训练数据验证跨两个过程的预测模型的可转移性。此外，它使用一个简单的设置，只使用一个加速度传感器来建立低成本数据生成方法，通过迁移学习促进模型对其他过程的泛化。本研究评估了几种机器学习模型，包括受变压器启发的卷积神经网络（CNN）、长短期记忆网络（LSTM）、支持向量机（SVM）和决策树，训练了不同的输入格式，如特征向量和短时傅里叶变换（STFT）。模型在两台机器上以及不同数量的训练数据上进行了评估，包括使用大幅减少数据集的情况，为了解它们在受限数据条件下的有效性提供了见解。结果表明特定模型和配置在有效工具磨损预测方面的潜力，有助于发展更具适应性和效率的加工预测维护策略。值得注意的是，ConvNeXt模型在仅使用四个铣刀的数据进行工具磨损识别时达到了99.1%的准确率。

更新时间: 2025-07-31 22:37:23

领域: cs.LG,cs.RO,eess.SP

下载: http://arxiv.org/abs/2412.19950v4

Graph Lineages and Skeletal Graph Products

Graphs, and sequences of growing graphs, can be used to specify the architecture of mathematical models in many fields including machine learning and computational science. Here we define structured graph "lineages" (ordered by level number) that grow in a hierarchical fashion, so that: (1) the number of graph vertices and edges increases exponentially in level number; (2) bipartite graphs connect successive levels within a graph lineage and, as in multigrid methods, can constrain matrices relating successive levels; (3) using prolongation maps within a graph lineage, process-derived distance measures between graphs at successive levels can be defined; (4) a category of "graded graphs" can be defined, and using it low-cost "skeletal" variants of standard algebraic graph operations and type constructors (cross product, box product, disjoint sum, and function types) can be derived for graded graphs and hence hierarchical graph lineages; (5) these skeletal binary operators have similar but not identical algebraic and category-theoretic properties to their standard counterparts; (6) graph lineages and their skeletal product constructors can approach continuum limit objects. Additional space-efficient unary operators on graded graphs are also derived: thickening, which creates a graph lineage of multiscale graphs, and escalation to a graph lineage of search frontiers (useful as a generalization of adaptive grids and in defining "skeletal" functions). The result is an algebraic type theory for graded graphs and (hierarchical) graph lineages. The approach is expected to be well suited to defining hierarchical model architectures - "hierarchitectures" - and local sampling, search, or optimization algorithms on them. We demonstrate such application to deep neural networks (including visual and feature scale spaces) and to multigrid numerical methods.

Updated: 2025-07-31 22:31:34

标题: 图谱谱系和骨架图产品

摘要: 图形和增长图形序列可以用来指定许多领域的数学模型架构，包括机器学习和计算科学。在这里，我们定义了按级别编号排列的结构化图形“谱系”，以层次化方式增长，因此：（1）图形顶点和边的数量随级别增加呈指数增长；（2）二分图将图形谱系中的连续级别连接起来，并且如多重网格方法一样，可以约束与连续级别相关的矩阵；（3）在图形谱系中使用延伸映射，可以定义连续级别上的图形之间的过程导出的距离度量；（4）可以定义“分级图”类别，并使用它可以为分级图和因此分层图形谱系推导标准代数图形操作和类型构造函数的低成本“骨架”变体（叉积、箱积、不相交和函数类型）；（5）这些骨架二元运算符具有类似但不完全相同的代数和范畴理论特性，与它们的标准对应物；（6）图形谱系及其骨架乘积构造函数可以接近连续极限对象。还可以推导出在分级图上的额外空间高效一元运算符：加厚，创建多尺度图形的图形谱系，以及升级到搜索前沿的图形谱系（作为自适应网格的概括和定义“骨架”函数）。结果是一种适用于分级图和（分层）图形谱系的代数类型理论。这种方法预计非常适合定义分层模型架构 - “层次结构” - 以及在其上进行本地采样、搜索或优化算法。我们展示了这种应用于深度神经网络（包括视觉和特征比例空间）以及多重网格数值方法。

更新时间: 2025-07-31 22:31:34

领域: cs.CV,cs.LG,cs.NA,math.CT,math.NA

下载: http://arxiv.org/abs/2508.00197v1

Cobblestone: Iterative Automation for Formal Verification

Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.

Updated: 2025-07-31 22:29:43

标题: 鹅卵石：形式验证的迭代自动化

摘要: 使用证明助手，如Coq，进行形式验证是提高软件质量的有效方式，但需要大量的工作量和专业知识。机器学习可以自动合成证明，但这种工具只能证明所需软件属性的一部分。我们介绍了Cobblestone，一种分而治之的证明合成方法。Cobblestone使用大型语言模型(LLM)生成潜在证明，使用这些证明将问题分解为更简单的部分，自动识别哪些部分已成功证明，并在剩余部分上迭代构建正确的证明，确保其准确性，尽管依赖于不准确的LLM。我们对四个开源Coq项目的基准进行了Cobblestone评估，控制了训练数据泄漏。完全自动地，Cobblestone胜过了最先进的非LLM工具，并证明了许多其他基于LLM的工具无法证明的定理，并在许多基准上表现更好。每次Cobblestone运行成本仅为1.25美元，平均耗时14.7分钟。Cobblestone还可以与来自用户或其他工具的外部输入一起使用，提供证明结构或相关引理。在这样的神谕评估下，Cobblestone可以证明高达58%的定理。总的来说，我们的研究表明，工具可以利用部分进展和外部输入更有效地自动化形式验证。

更新时间: 2025-07-31 22:29:43

领域: cs.LO,cs.AI,cs.PL

下载: http://arxiv.org/abs/2410.19940v2

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.

Updated: 2025-07-31 21:49:20

标题: EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes 不带滞后的EMA：修正偏差迭代平均方案

摘要: 在语言模型微调中，通常由于使用小批量大小而引起的随机性可能会在训练中引入生成质量的大幅波动，从而导致不稳定性。缓解这种不稳定性的一种流行方法是在整个训练过程中采用权重的指数移动平均（EMA）。虽然EMA可以减少随机性，从而平滑训练过程，但由于来自旧迭代的偏差通常会导致相对于普通训练的优化滞后。在这项工作中，我们提出了经过校正偏差的指数移动平均（BEMA），这是对EMA的简单实用增强，保留了方差减少的好处同时消除了偏差。BEMA是基于一个简单的理论模型的动机，我们在其中展示了BEMA相对于标准EMA和普通训练的可证加速。通过在语言模型上进行大量实验，我们展示了BEMA相对于EMA和普通训练在各种标准LM基准测试中的显着改进收敛速率和最终性能，使BEMA成为更稳定和高效微调的实用且有理论动机的干预措施。

更新时间: 2025-07-31 21:49:20

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2508.00180v1

The SPACE of AI: Real-World Lessons on AI's Impact on Developers

As artificial intelligence (AI) tools become increasingly embedded in software development workflows, questions persist about their true impact on developer productivity and experience. This paper presents findings from a mixed-methods study examining how developers perceive AI's influence across the dimensions of the SPACE framework: Satisfaction, Performance, Activity, Collaboration and Efficiency. Drawing on survey responses from over 500 developers and qualitative insights from interviews and observational studies, we find that AI is broadly adopted and widely seen as enhancing productivity, particularly for routine tasks. However, the benefits vary, depending on task complexity, individual usage patterns, and team-level adoption. Developers report increased efficiency and satisfaction, with less evidence of impact on collaboration. Organizational support and peer learning play key roles in maximizing AI's value. These findings suggest that AI is augmenting developers rather than replacing them, and that effective integration depends as much on team culture and support structures as on the tools themselves. We conclude with practical recommendations for teams, organizations and researchers seeking to harness AI's potential in software engineering.

Updated: 2025-07-31 21:45:54

标题: 人工智能的空间：AI对开发者的现实影响的教训

摘要: 随着人工智能（AI）工具越来越多地嵌入软件开发工作流程中，人们对其对开发者生产力和体验的真正影响仍存疑。本文提出了一项混合方法研究的发现，该研究考察了开发者如何感知AI在SPACE框架的维度中的影响：满意度、性能、活动、协作和效率。通过来自500多名开发者的调查回应以及访谈和观察研究的定性见解，我们发现AI被广泛采用，并被普遍认为提高了生产力，特别是对于常规任务而言。然而，这些优势因任务复杂性、个人使用模式和团队级别采用而有所不同。开发者报告称效率和满意度增加，但对协作的影响证据较少。组织支持和同行学习在最大化AI价值方面发挥着关键作用。这些发现表明，AI是在增强开发者而非取代他们，并且有效整合取决于团队文化和支持结构与工具本身同样重要。我们最后提出了对于寻求在软件工程中发挥AI潜力的团队、组织和研究人员的实用建议。

更新时间: 2025-07-31 21:45:54

领域: cs.HC,cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.00178v1

On Gradual Semantics for Assumption-Based Argumentation

In computational argumentation, gradual semantics are fine-grained alternatives to extension-based and labelling-based semantics . They ascribe a dialectical strength to (components of) arguments sanctioning their degree of acceptability. Several gradual semantics have been studied for abstract, bipolar and quantitative bipolar argumentation frameworks (QBAFs), as well as, to a lesser extent, for some forms of structured argumentation. However, this has not been the case for assumption-based argumentation (ABA), despite it being a popular form of structured argumentation with several applications where gradual semantics could be useful. In this paper, we fill this gap and propose a family of novel gradual semantics for equipping assumptions, which are the core components in ABA frameworks, with dialectical strengths. To do so, we use bipolar set-based argumentation frameworks as an abstraction of (potentially non-flat) ABA frameworks and generalise state-of-the-art modular gradual semantics for QBAFs. We show that our gradual ABA semantics satisfy suitable adaptations of desirable properties of gradual QBAF semantics, such as balance and monotonicity. We also explore an argument-based approach that leverages established QBAF modular semantics directly, and use it as baseline. Finally, we conduct experiments with synthetic ABA frameworks to compare our gradual ABA semantics with its argument-based counterpart and assess convergence.

Updated: 2025-07-31 21:40:27

标题: 关于基于假设的论证的逐渐语义论述

摘要: 在计算论证中，渐进语义是对基于扩展和基于标签的语义的细粒度替代方案。它们赋予（组成部分）论证一种辩证力量，制定它们的可接受程度。已经研究了几种渐进语义，用于抽象的、双极的和定量的双极论证框架（QBAFs），以及在某种程度上也用于一些形式的结构化论证。然而，尽管作为一种流行的结构化论证形式，具有多种应用的假设基础论证（ABA），在这种情况下并非如此。在本文中，我们填补了这一空白，提出了一系列新颖的渐进语义，用于为假设赋予辩证力量，假设是ABA框架中的核心组件。为此，我们使用基于双极集的论证框架作为（潜在的非扁平）ABA框架的抽象，并将最先进的模块化渐进语义推广到ABA框架。我们展示了我们的渐进ABA语义满足渐进QBAF语义的合理性适应性，如平衡和单调性。我们还探讨了一种基于论点的方法，直接利用已建立的QBAF模块化语义，并将其用作基准。最后，我们通过合成ABA框架进行实验，将我们的渐进ABA语义与基于论点的对应物进行比较，并评估收敛性。

更新时间: 2025-07-31 21:40:27

领域: cs.AI

下载: http://arxiv.org/abs/2507.10076v3

RL as Regressor: A Reinforcement Learning Approach for Function Approximation

Standard regression techniques, while powerful, are often constrained by predefined, differentiable loss functions such as mean squared error. These functions may not fully capture the desired behavior of a system, especially when dealing with asymmetric costs or complex, non-differentiable objectives. In this paper, we explore an alternative paradigm: framing regression as a Reinforcement Learning (RL) problem. We demonstrate this by treating a model's prediction as an action and defining a custom reward signal based on the prediction error, and we can leverage powerful RL algorithms to perform function approximation. Through a progressive case study of learning a noisy sine wave, we illustrate the development of an Actor-Critic agent, iteratively enhancing it with Prioritized Experience Replay, increased network capacity, and positional encoding to enable a capable RL agent for this regression task. Our results show that the RL framework not only successfully solves the regression problem but also offers enhanced flexibility in defining objectives and guiding the learning process.

Updated: 2025-07-31 21:39:24

标题: RL作为回归器：一种用于函数逼近的强化学习方法

摘要: 标准回归技术通常受到预定义的可微损失函数的限制，例如均方误差。这些函数可能无法完全捕捉系统的期望行为，特别是在处理不对称成本或复杂、不可微的目标时。在本文中，我们探讨了一种替代范式：将回归问题作为强化学习（RL）问题。我们将模型的预测视为一个动作，并根据预测误差定义一个自定义奖励信号，我们可以利用强大的RL算法进行函数逼近。通过一个逐步学习噪声正弦波的案例研究，我们展示了一个Actor-Critic代理的开发过程，通过优先经验重放、增加网络容量和位置编码来使其成为一个能够胜任这个回归任务的RL代理。我们的结果表明，RL框架不仅成功解决了回归问题，而且在定义目标和引导学习过程方面提供了增强的灵活性。

更新时间: 2025-07-31 21:39:24

领域: cs.LG

下载: http://arxiv.org/abs/2508.00174v1

DiSC-Med: Diffusion-based Semantic Communications for Robust Medical Image Transmission

The rapid development of artificial intelligence has driven smart health with next-generation wireless communication technologies, stimulating exciting applications in remote diagnosis and intervention. To enable a timely and effective response for remote healthcare, efficient transmission of medical data through noisy channels with limited bandwidth emerges as a critical challenge. In this work, we propose a novel diffusion-based semantic communication framework, namely DiSC-Med, for the medical image transmission, where medical-enhanced compression and denoising blocks are developed for bandwidth efficiency and robustness, respectively. Unlike conventional pixel-wise communication framework, our proposed DiSC-Med is able to capture the key semantic information and achieve superior reconstruction performance with ultra-high bandwidth efficiency against noisy channels. Extensive experiments on real-world medical datasets validate the effectiveness of our framework, demonstrating its potential for robust and efficient telehealth applications.

Updated: 2025-07-31 21:36:45

标题: DiSC-Med：用于稳健医学图像传输的基于扩散的语义通信

摘要: 人工智能的快速发展推动了基于下一代无线通信技术的智能健康，激发了远程诊断和干预的令人兴奋的应用。为了实现远程医疗的及时和有效响应，通过带宽有限的嘈杂信道高效传输医疗数据成为一个关键挑战。在这项工作中，我们提出了一种新颖的基于扩散的语义通信框架，即DiSC-Med，用于医学图像传输，其中开发了医学增强压缩和去噪块，分别用于带宽效率和鲁棒性。与传统的像素级通信框架不同，我们提出的DiSC-Med能够捕捉关键的语义信息，并在嘈杂信道中实现卓越的重建性能和超高带宽效率。对真实世界的医学数据集进行广泛实验验证了我们框架的有效性，展示了其在健康应用中的鲁棒性和高效性潜力。

更新时间: 2025-07-31 21:36:45

领域: cs.LG,eess.IV

下载: http://arxiv.org/abs/2508.00172v1

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Updated: 2025-07-31 21:36:02

标题: 标题翻译为：超越标记嵌入的新兴语义：带有冻结视觉Unicode表示的Transformer LMs

摘要: 理解大型语言模型（LLMs）中语义表示的位置对于可解释性和架构创新至关重要。主流范式认为可训练的输入嵌入作为基础的“含义向量”。本文对这一观点提出挑战。我们构建了Transformer模型，其中嵌入层完全冻结，向量不是从数据中派生，而是来自Unicode字形的视觉结构。这些非语义、预先计算的视觉嵌入在训练过程中保持不变。我们的方法与任何标记器兼容，包括我们引入的一种新的以Unicode为中心的标记器，以确保文本的普遍覆盖。尽管没有可训练的、语义化的嵌入，我们的模型收敛，生成连贯的文本，并且在MMLU推理基准测试中关键性地优于具有可训练嵌入的架构相同的模型。我们将这归因于传统模型中的“表征干扰”，其中嵌入层负责学习结构和语义特征。我们的结果表明，高级语义不是输入嵌入固有的，而是Transformer构成性架构和数据规模的一种新兴属性。这将嵌入的角色从含义容器重新定义为结构基元。我们发布所有代码和模型，以促进进一步的研究。

更新时间: 2025-07-31 21:36:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04886v3

A General Framework for Estimating Preferences Using Response Time Data

We propose a general methodology for recovering preference parameters from data on choices and response times. Our methods yield estimates with fast ($1/n$ for $n$ data points) convergence rates when specialized to the popular Drift Diffusion Model (DDM), but are broadly applicable to generalizations of the DDM as well as to alternative models of decision making that make use of response time data. The paper develops an empirical application to an experiment on intertemporal choice, showing that the use of response times delivers predictive accuracy and matters for the estimation of economically relevant parameters.

Updated: 2025-07-31 21:24:39

标题: 使用响应时间数据估计偏好的一般框架

摘要: 我们提出了一种从选择和反应时间数据中恢复偏好参数的一般方法论。当专门应用于流动扩散模型（DDM）时，我们的方法可以快速（$1/n$对于$n$个数据点）收敛，但也广泛适用于DDM的推广以及利用反应时间数据的决策制定的替代模型。本文对一个关于时间选择的实验进行了实证应用，显示使用反应时间可以提高预测准确性，并对经济相关参数的估计产生影响。

更新时间: 2025-07-31 21:24:39

领域: econ.TH,cs.LG

下载: http://arxiv.org/abs/2507.20403v2

Causal Explanations for Image Classifiers

Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to extract them. However, none of the existing tools use a principled approach based on formal definitions of causes and explanations for the explanation extraction. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool ReX and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that \rex is the most efficient tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.

Updated: 2025-07-31 21:07:54

标题: 图像分类器的因果解释

摘要: 现有的用于解释图像分类器输出的算法使用不同的解释定义和各种技术来提取它们。然而，目前不存在任何工具使用基于形式化原因和解释定义的原则方法来进行解释提取。本文提出了一种基于实际因果理论的计算解释的新颖黑盒方法。我们证明了相关的理论结果，并提出了一个基于这些定义的计算近似解释的算法。我们证明了算法的终止性，并讨论了其复杂性和近似量与精确定义之间的比较。我们在一个名为ReX的工具中实现了这个框架，并呈现了实验结果和与最先进工具的比较。我们展示了\rex 是最有效的工具，并产生了最小的解释，除了在标准质量度量上优于其他黑盒工具。

更新时间: 2025-07-31 21:07:54

领域: cs.AI

下载: http://arxiv.org/abs/2411.08875v2

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

Updated: 2025-07-31 21:04:12

标题: 观察权重：对精细调整的LLMs进行无监督监控和控制

摘要: 强大的开放权重大型语言模型（LLMs）的发布往往没有附带完整的训练数据。现有的可解释性方法，特别是基于激活的方法，通常需要或假设分布类似的数据。当检测和防御像后门这样的新潜在威胁时，这是一个重要的限制，后门的定义是超出分布范围的。在这项工作中，我们引入了一种新方法，用于理解、监控和控制微调的LLMs，该方法解释权重而不是激活，从而避开了需要与未知训练数据分布相似的数据的需求。我们证明了微调模型与其基础模型之间的权重差异的前几个奇异向量对应于新获得的行为。通过监测沿着这些方向的激活的余弦相似性，我们可以高精度地检测到微调过程中引入的显著行为。对于绕过安全机制的后门模型，当存在秘密触发器时，我们的方法可以阻止高达100%的攻击，误报率低于1.2%。对于经历了遗忘的模型，我们可以以高达95.42%的准确率检测到已擦除主题上的推断，并且甚至可以引导模型恢复“遗忘”的信息。除了监测外，我们的方法还显示了预部署模型审计的潜力：通过分析商业指令调校模型（OLMo、Llama、Qwen），我们能够揭示特定于模型的微调重点，包括营销策略和Midjourney提示生成。我们的实现可以在https://github.com/fjzzq2002/WeightWatch上找到。

更新时间: 2025-07-31 21:04:12

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.00161v1

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

Updated: 2025-07-31 21:00:28

标题: OmniDraft：跨词汇量、在线自适应草拟器，用于设备内推测性解码

摘要: 投机解码通常要求具有小巧高效的草稿模型，该模型要么是预训练的，要么是蒸馏成特定目标模型系列，例如Llama或Qwen模型。然而，在在线部署环境中，存在两个主要挑战：1）使用与草稿模型不兼容的目标模型；2）期望在使用和时间上获得延迟改进。在这项工作中，我们提出了OmniDraft，这是一个统一的框架，使单个草稿模型能够与任何目标模型配合，并动态地适应用户数据。我们引入了一个在线n-gram缓存和混合蒸馏微调，以解决草稿模型和目标模型之间跨词汇不匹配的问题；并通过利用自适应草稿技术进一步提高解码速度。OmniDraft特别适用于设备上的LLM应用程序，其中模型成本、效率和用户定制性是主要争议点。这进一步突显了解决上述挑战的需求，并激发了“一个草稿者适用于所有”的范例。我们通过在数学推理、编码和文本生成任务上进行在线学习，展示了OmniDraft框架的能力。值得注意的是，OmniDraft使单个Llama-68M模型能够与各种目标模型配对，包括Vicuna-7B、Qwen2-7B和Llama3-8B模型，用于投机解码；并且还提供了高达1.5-2倍的加速。

更新时间: 2025-07-31 21:00:28

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.02659v2

DeformTune: A Deformable XAI Music Prototype for Non-Musicians

Many existing AI music generation tools rely on text prompts, complex interfaces, or instrument-like controls, which may require musical or technical knowledge that non-musicians do not possess. This paper introduces DeformTune, a prototype system that combines a tactile deformable interface with the MeasureVAE model to explore more intuitive, embodied, and explainable AI interaction. We conducted a preliminary study with 11 adult participants without formal musical training to investigate their experience with AI-assisted music creation. Thematic analysis of their feedback revealed recurring challenge--including unclear control mappings, limited expressive range, and the need for guidance throughout use. We discuss several design opportunities for enhancing explainability of AI, including multimodal feedback and progressive interaction support. These findings contribute early insights toward making AI music systems more explainable and empowering for novice users.

Updated: 2025-07-31 20:57:59

标题: DeformTune：面向非音乐家的可变形可解释AI音乐原型

摘要: 许多现有的AI音乐生成工具依赖于文本提示、复杂界面或类似乐器的控制，这可能需要非音乐家所不具备的音乐或技术知识。本文介绍了DeformTune，这是一个原型系统，结合了触觉可变形界面和MeasureVAE模型，以探索更直观、体验化和可解释的AI交互。我们对11名成年参与者进行了初步研究，他们没有接受过正式的音乐培训，以调查他们对AI辅助音乐创作的体验。对他们的反馈进行了主题分析，发现了一些反复出现的挑战，包括不清晰的控制映射、有限的表现范围以及在使用过程中需要指导。我们讨论了几种增强AI可解释性的设计机会，包括多模态反馈和渐进式交互支持。这些发现为使AI音乐系统更具解释性并赋予初学者更大的权力提供了初步见解。

更新时间: 2025-07-31 20:57:59

领域: cs.HC,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.00160v1

Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power

Power is a key concept in AI safety: power-seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human-AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality- and risk-averse long-term aggregate of human power. It takes into account humans' bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi-agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub-goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility-based objectives.

Updated: 2025-07-31 20:56:43

标题: 基于模型的适宜长期人力指标的柔性最大化

摘要: 权力是人工智能安全中的一个关键概念：权力寻求作为一种工具目标，人类突然或逐渐失去权力，人类与人工智能互动中的权力平衡以及国际人工智能治理。与此同时，作为追求多样目标的能力对幸福至关重要。本文探讨了通过明确要求人工智能代理明确赋予人类权力，并以一种理想的方式管理人类与人工智能代理之间的权力平衡，从而促进安全和幸福的理念。通过一种基于原则、部分公理化的方法，我们设计了一个可以参数化和分解的客观函数，代表了人类权力的不平等和风险厌恶的长期总和。它考虑了人类的有限理性和社会规范，关键地，考虑了各种可能的人类目标。我们推导了通过反向归纳或通过一种形式的基于给定世界模型的多智能体强化学习来逼近这个度量的算法。我们举例说明了在各种典型情况下（轻微地）最大化这个度量的后果，并描述了它可能暗示的工具性子目标。我们谨慎地评估认为，软化地最大化适当的人类权力总和度量可能构成对代理人工智能系统有益的客观目标，比直接基于效用的目标更安全。

更新时间: 2025-07-31 20:56:43

领域: cs.AI,cs.CY,cs.LG,econ.TH,math.OC,68Txx,I.2

下载: http://arxiv.org/abs/2508.00159v1

Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items

We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items. Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information. However, these methods often overlook critical challenges, such as noisy user behavior data and data sparsity due to the long-tailed distribution of these behaviors. In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges. Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels. Additionally, we leverage LLMs to generate augmented training data, further enhancing the denoising process during training. We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. In addition, we empirically show that MMSC is effective in modeling cold-start items.

Updated: 2025-07-31 20:53:24

标题: 多模式关系项目表示学习用于推断可替代和互补项目

摘要: 我们引入了一种新颖的自监督多模态关系项表示学习框架，旨在推断可替代和补充项。现有方法主要集中于使用图神经网络（GNNs）建模从用户行为推断出的项-项关联，或利用项内容信息。然而，这些方法通常忽视了关键挑战，例如嘈杂的用户行为数据和由于这些行为的长尾分布而导致的数据稀疏性。在本文中，我们提出了MMSC，一个自监督多模态关系项表示学习框架，以解决这些挑战。具体而言，MMSC包括三个主要组成部分：（1）利用多模态基础模型并从项元数据中学习的多模态项表示学习模块，（2）自监督基于行为的表示学习模块，去噪和从用户行为数据中学习，以及（3）集成项表示在语义和任务级别的分层表示聚合机制。此外，我们利用LLMs生成增强的训练数据，在训练过程中进一步增强去噪过程。我们在五个真实数据集上进行了大量实验，结果显示MMSC在可替代推荐方面的性能比现有基线提高了26.1％，在补充推荐方面提高了39.2％。此外，我们在实证中展示了MMSC在建模冷启动项方面的有效性。

更新时间: 2025-07-31 20:53:24

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.22268v2

Quantum Generative Modeling using Parameterized Quantum Circuits

Quantum generative models use the intrinsic probabilistic nature of quantum mechanics to learn and reproduce complex probability distributions. In this paper, we present an implementation of a 3-qubit quantum circuit Born machine trained to model a 3-bit Gaussian distribution using a Kullback-Leibler (KL) divergence loss and parameter-shift gradient optimization. The variational quantum circuit consists of layers of parameterized rotations and entangling gates, and is optimized such that the Born rule output distribution closely matches the target distribution. We detail the mathematical formulation of the model distribution, the KL divergence cost function, and the parameter-shift rule for gradient evaluation. Training results on a statevector simulator show that the KL divergence is minimized to near zero, and the final generated distribution aligns quantitatively with the target probabilities. We analyze the convergence behavior and discuss the implications for scalability and quantum advantage. Our results demonstrate the feasibility of small-scale quantum generative learning and provide insight into the training dynamics of quantum circuit models.

Updated: 2025-07-31 20:52:38

标题: 使用参数化量子电路进行量子生成建模

摘要: 量子生成模型利用量子力学的固有概率性质来学习和复制复杂的概率分布。在这篇论文中，我们提出了一个实现了一个3比特量子电路Born机的模型，该机器被训练来模拟一个3比特高斯分布，使用Kullback-Leibler（KL）散度损失和参数移位梯度优化。变分量子电路由参数化旋转和纠缠门的层组成，并且经过优化，使Born规则输出分布与目标分布密切匹配。我们详细介绍了模型分布的数学公式、KL散度成本函数和用于梯度评估的参数移位规则。在状态向量模拟器上的训练结果表明，KL散度被最小化到接近零，最终生成的分布与目标概率定量对齐。我们分析了收敛行为，并讨论了可扩展性和量子优势的影响。我们的结果证明了小规模量子生成学习的可行性，并为量子电路模型的训练动态提供了洞察。

更新时间: 2025-07-31 20:52:38

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2303.16955v2

GEPAR3D: Geometry Prior-Assisted Learning for 3D Tooth Segmentation

Tooth segmentation in Cone-Beam Computed Tomography (CBCT) remains challenging, especially for fine structures like root apices, which is critical for assessing root resorption in orthodontics. We introduce GEPAR3D, a novel approach that unifies instance detection and multi-class segmentation into a single step tailored to improve root segmentation. Our method integrates a Statistical Shape Model of dentition as a geometric prior, capturing anatomical context and morphological consistency without enforcing restrictive adjacency constraints. We leverage a deep watershed method, modeling each tooth as a continuous 3D energy basin encoding voxel distances to boundaries. This instance-aware representation ensures accurate segmentation of narrow, complex root apices. Trained on publicly available CBCT scans from a single center, our method is evaluated on external test sets from two in-house and two public medical centers. GEPAR3D achieves the highest overall segmentation performance, averaging a Dice Similarity Coefficient (DSC) of 95.0% (+2.8% over the second-best method) and increasing recall to 95.2% (+9.5%) across all test sets. Qualitative analyses demonstrated substantial improvements in root segmentation quality, indicating significant potential for more accurate root resorption assessment and enhanced clinical decision-making in orthodontics. We provide the implementation and dataset at https://github.com/tomek1911/GEPAR3D.

Updated: 2025-07-31 20:46:58

标题: GEPAR3D：几何先验辅助学习用于3D牙齿分割

摘要: 在锥束计算机断层扫描（CBCT）中的牙齿分割仍然具有挑战性，特别是对于根尖等细微结构，这对于评估正畸学中的根吸收至关重要。我们引入了GEPAR3D，这是一种新颖的方法，将实例检测和多类分割统一为一个步骤，旨在改善根分割。我们的方法集成了牙齿的统计形状模型作为几何先验，捕获解剖上下文和形态一致性，而不强加限制性的邻近约束。我们利用深度分水岭方法，将每颗牙齿建模为连续的3D能量盆地，编码到边界的体素距离。这种实例感知表示确保了狭窄复杂的根尖的准确分割。在来自单个中心的公开可用CBCT扫描上进行训练，我们的方法在来自两个内部和两个公共医疗中心的外部测试集上进行评估。GEPAR3D实现了最高的整体分割性能，平均Dice相似系数（DSC）为95.0%（比第二好的方法高2.8%），并在所有测试集中将召回率提高到95.2%（+9.5%）。定性分析显示根分割质量有显著改善，表明在正畸学中更准确地评估根吸收和增强临床决策的潜力显著。我们在https://github.com/tomek1911/GEPAR3D提供了实现和数据集。

更新时间: 2025-07-31 20:46:58

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.00155v1

Data-Driven Motion Planning for Uncertain Nonlinear Systems

This paper proposes a data-driven motion-planning framework for nonlinear systems that constructs a sequence of overlapping invariant polytopes. Around each randomly sampled waypoint, the algorithm identifies a convex admissible region and solves data-driven linear-matrix-inequality problems to learn several ellipsoidal invariant sets together with their local state-feedback gains. The convex hull of these ellipsoids, still invariant under a piece-wise-affine controller obtained by interpolating the gains, is then approximated by a polytope. Safe transitions between nodes are ensured by verifying the intersection of consecutive convex-hull polytopes and introducing an intermediate node for a smooth transition. Control gains are interpolated in real time via simplex-based interpolation, keeping the state inside the invariant polytopes throughout the motion. Unlike traditional approaches that rely on system dynamics models, our method requires only data to compute safe regions and design state-feedback controllers. The approach is validated through simulations, demonstrating the effectiveness of the proposed method in achieving safe, dynamically feasible paths for complex nonlinear systems.

Updated: 2025-07-31 20:41:34

标题: 基于数据驱动的不确定非线性系统运动规划

摘要: 本文提出了一个基于数据驱动的非线性系统运动规划框架，该框架构建了一系列重叠的不变多面体。在每个随机采样的航点周围，算法确定了一个凸可行区域，并通过解决数据驱动的线性矩阵不等式问题来学习几个椭圆形不变集以及它们的局部状态反馈增益。这些椭圆形的凸包，在通过插值获得的分段仿射控制器下仍然是不变的，然后通过多面体来近似。通过验证连续凸包多面体的交集并引入一个中间节点进行平滑过渡，确保节点之间的安全过渡。控制增益通过基于单纯形的插值实时插值，保持状态在整个运动过程中保持在不变多面体内。与依赖系统动力学模型的传统方法不同，我们的方法只需要数据来计算安全区域并设计状态反馈控制器。通过仿真验证了该方法的有效性，展示了该方法在实现复杂非线性系统的安全、动态可行路径方面的有效性。

更新时间: 2025-07-31 20:41:34

领域: eess.SY,cs.LG,cs.RO,cs.SY,math.OC

下载: http://arxiv.org/abs/2508.00154v1

Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences

Preserving topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), an efficient, differentiable loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the formulation and complexity analysis of DSL, comparing it to other non-differentiable topological measures. Experiments on multidimensional array data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling topological feature preservation on previously impractical problem sizes and in a wider range of gradient-based optimization frameworks.

Updated: 2025-07-31 20:10:32

标题: 方向性标志损失：近似有限差分符号的保拓扑性损失函数

摘要: 在学习潜在空间中保持拓扑特征是表示学习中的一个基本挑战，特别是对于拓扑敏感的数据。本文介绍了方向符号损失（DSL），一种高效、可微分的损失函数，它近似计算了两个数组对应元素之间有限差异的符号不匹配数量。通过惩罚输入和重构数据之间关键点的差异，DSL鼓励自动编码器和其他可学习的压缩器保留原始数据的拓扑特征。我们介绍了DSL的公式和复杂性分析，并将其与其他不可微分的拓扑度量进行了比较。对多维数组数据的实验表明，将DSL与传统损失函数结合使用比单独使用传统损失更有效地保留了拓扑特征。DSL作为常见基于拓扑的度量的可微分、高效代理，使得在以前不可行的问题规模和更广泛的基于梯度的优化框架中实现拓扑特征的保留。

更新时间: 2025-07-31 20:10:32

领域: cs.LG,I.2.6

下载: http://arxiv.org/abs/2504.04202v3

Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define "ground truth." Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors' moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth--prioritizing validity and educational impact over consensus alone.

Updated: 2025-07-31 20:05:26

标题: 超越一致性：重新思考教育人工智能标注中的真实性

摘要: 人类评估者常常是不完美的。他们经常存在偏见、不可靠，并且不适合定义“事实”。然而，鉴于在教育应用中使用人工智能产生大量训练数据的迫切需求，传统的评价者间一致性（IRR）指标如科恩的卡帕仍然是验证标记数据的核心。在教育数据的许多机器学习流程中，IRR仍然是一个基石。例如，在对话中分类导师的举措，或者在机器评分的评估中标记开放性回答。本文认为，对人类IRR的过度依赖作为注释质量的关卡，妨碍了以有效且具有预测能力的方式对数据进行分类，从而改善学习。为解决这个问题，我们突出了五个补充评估方法的例子，如多标签注释方案、基于专家的方法和闭环的有效性。我们认为，这些方法更有可能产生改进学生学习和更具可操作性见解的训练数据和随后的模型，而不仅仅是IRR方法。我们还强调外部有效性的重要性，例如，通过建立验证导师举措的程序，并证明它适用于许多类别的导师行为（例如，提供提示）。我们呼吁该领域重新思考注释质量和事实的优先级--优先考虑有效性和教育影响，而不仅仅是共识。

更新时间: 2025-07-31 20:05:26

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.00143v1

INSPIRE-GNN: Intelligent Sensor Placement to Improve Sparse Bicycling Network Prediction via Reinforcement Learning Boosted Graph Neural Networks

Accurate link-level bicycling volume estimation is essential for sustainable urban transportation planning. However, many cities face significant challenges of high data sparsity due to limited bicycling count sensor coverage. To address this issue, we propose INSPIRE-GNN, a novel Reinforcement Learning (RL)-boosted hybrid Graph Neural Network (GNN) framework designed to optimize sensor placement and improve link-level bicycling volume estimation in data-sparse environments. INSPIRE-GNN integrates Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) with a Deep Q-Network (DQN)-based RL agent, enabling a data-driven strategic selection of sensor locations to maximize estimation performance. Applied to Melbourne's bicycling network, comprising 15,933 road segments with sensor coverage on only 141 road segments (99% sparsity) - INSPIRE-GNN demonstrates significant improvements in volume estimation by strategically selecting additional sensor locations in deployments of 50, 100, 200 and 500 sensors. Our framework outperforms traditional heuristic methods for sensor placement such as betweenness centrality, closeness centrality, observed bicycling activity and random placement, across key metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Furthermore, our experiments benchmark INSPIRE-GNN against standard machine learning and deep learning models in the bicycle volume estimation performance, underscoring its effectiveness. Our proposed framework provides transport planners actionable insights to effectively expand sensor networks, optimize sensor placement and maximize volume estimation accuracy and reliability of bicycling data for informed transportation planning decisions.

Updated: 2025-07-31 20:00:35

标题: INSPIRE-GNN：智能传感器布置以通过强化学习增强的图神经网络来改善稀疏自行车网络预测

摘要: 准确的链路级自行车流量估计对于可持续的城市交通规划至关重要。然而，许多城市面临着由于有限的自行车计数传感器覆盖而导致的数据稀疏性的重大挑战。为了解决这个问题，我们提出了INSPIRE-GNN，这是一个新颖的强化学习（RL）增强的混合图神经网络（GNN）框架，旨在优化传感器布置并改善数据稀疏环境中的链路级自行车流量估计。INSPIRE-GNN集成了图卷积网络（GCN）和图注意力网络（GAT）与基于Deep Q-Network（DQN）的RL代理，使数据驱动的传感器位置的战略选择最大化估计性能。应用于墨尔本的自行车网络，包括15,933个道路段，只有141个道路段有传感器覆盖（99%的稀疏度）- INSPIRE-GNN通过在50、100、200和500传感器的部署中战略选择额外的传感器位置来显著改善体积估计。我们的框架在传感器布置的传统启发式方法（如介数中心性、接近中心性、观测到的自行车活动和随机布置）方面表现出色，跨关键指标如均方误差（MSE）、均方根误差（RMSE）和平均绝对误差（MAE）。此外，我们的实验将INSPIRE-GNN与标准机器学习和深度学习模型在自行车流量估计性能上进行了基准测试，突显其有效性。我们提出的框架为交通规划者提供了可操作的见解，以有效扩展传感器网络、优化传感器布置并最大化自行车数据的体积估计准确性和可靠性，以支持明智的交通规划决策。

更新时间: 2025-07-31 20:00:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00141v1

Your Model Is Unfair, Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models

Systems relying on ML have become ubiquitous, but so has biased behavior within them. Research shows that bias significantly affects stakeholders' trust in systems and how they use them. Further, stakeholders of different backgrounds view and trust the same systems differently. Thus, how ML models' behavior is explained plays a key role in comprehension and trust. We survey explainability visualizations, creating a taxonomy of design characteristics. We conduct user studies to evaluate five state-of-the-art visualization tools (LIME, SHAP, CP, Anchors, and ELI5) for model explainability, measuring how taxonomy characteristics affect comprehension, bias perception, and trust for non-expert ML users. Surprisingly, we find an inverse relationship between comprehension and trust: the better users understand the models, the less they trust them. We investigate the cause and find that this relationship is strongly mediated by bias perception: more comprehensible visualizations increase people's perception of bias, and increased bias perception reduces trust. We confirm this relationship is causal: Manipulating explainability visualizations to control comprehension, bias perception, and trust, we show that visualization design can significantly (p < 0.001) increase comprehension, increase perceived bias, and reduce trust. Conversely, reducing perceived model bias, either by improving model fairness or by adjusting visualization design, significantly increases trust even when comprehension remains high. Our work advances understanding of how comprehension affects trust and systematically investigates visualization's role in facilitating responsible ML applications.

Updated: 2025-07-31 20:00:32

标题: 你的模型是不公平的，你是否意识到？偏见机器学习模型解释可视化中理解和信任之间的反向关系

摘要: 依赖于机器学习的系统已经无处不在，但其中存在偏见行为。研究表明，偏见显著影响利益相关者对系统的信任以及他们如何使用这些系统。此外，不同背景的利益相关者对相同的系统的看法和信任也不同。因此，机器学习模型行为的解释在理解和信任方面起着关键作用。我们对解释性可视化进行调查，创建了一套设计特征的分类法。我们进行用户研究，评估了五种最先进的可视化工具（LIME、SHAP、CP、Anchors和ELI5）用于模型解释性，衡量了分类法特征对非专业机器学习用户的理解、偏见感知和信任的影响。令人惊讶的是，我们发现理解和信任之间存在反向关系：用户对模型的理解越好，信任就越少。我们调查了原因，发现这种关系受偏见感知的强烈中介作用：更易理解的可视化增加了人们对偏见的感知，而增加的偏见感知降低了信任。我们证实了这种关系是因果关系：通过操纵解释性可视化以控制理解、偏见感知和信任，我们展示了可视化设计可以显著（p <0.001）增加理解，增加感知偏见，并降低信任。相反，通过改善模型公平性或调整可视化设计，减少感知模型偏见，即使理解保持高水平，也可以显著增加信任。我们的工作推动了对理解如何影响信任的认识，并系统地研究了可视化在促进负责任的机器学习应用中的作用。

更新时间: 2025-07-31 20:00:32

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.00140v1

Co-Producing AI: Toward an Augmented, Participatory Lifecycle

Despite efforts to mitigate the inherent risks and biases of artificial intelligence (AI) algorithms, these algorithms can disproportionately impact culturally marginalized groups. A range of approaches has been proposed to address or reduce these risks, including the development of ethical guidelines and principles for responsible AI, as well as technical solutions that promote algorithmic fairness. Drawing on design justice, expansive learning theory, and recent empirical work on participatory AI, we argue that mitigating these harms requires a fundamental re-architecture of the AI production pipeline. This re-design should center co-production, diversity, equity, inclusion (DEI), and multidisciplinary collaboration. We introduce an augmented AI lifecycle consisting of five interconnected phases: co-framing, co-design, co-implementation, co-deployment, and co-maintenance. The lifecycle is informed by four multidisciplinary workshops and grounded in themes of distributed authority and iterative knowledge exchange. Finally, we relate the proposed lifecycle to several leading ethical frameworks and outline key research questions that remain for scaling participatory governance.

Updated: 2025-07-31 19:58:58

标题: 共同生产人工智能：走向增强、参与式生命周期

摘要: 尽管努力减轻人工智能（AI）算法固有的风险和偏见，这些算法可能会不成比例地影响文化边缘化群体。提出了一系列方法来解决或减少这些风险，包括制定负责任AI的伦理准则和原则，以及促进算法公平性的技术解决方案。借鉴设计公正、广泛学习理论以及最近关于参与式AI的实证研究，我们认为减轻这些危害需要对AI生产管道进行根本性的重新架构。这种重新设计应当以共同生产、多样性、公平性、包容性（DEI）和跨学科合作为中心。我们介绍了一个增强型AI生命周期，包括五个相互关联的阶段：共同框架、共同设计、共同实施、共同部署和共同维护。该生命周期受到四个跨学科研讨会的启发，并以分权和迭代知识交流为主题。最后，我们将所提出的生命周期与几种主要的伦理框架联系起来，并概述了关于扩大参与式治理的关键研究问题。

更新时间: 2025-07-31 19:58:58

领域: cs.AI

下载: http://arxiv.org/abs/2508.00138v1

SHACL Validation under Graph Updates (Extended Paper)

SHACL (SHApe Constraint Language) is a W3C standardized constraint language for RDF graphs. In this paper, we study SHACL validation in RDF graphs under updates. We present a SHACL-based update language that can capture intuitive and realistic modifications on RDF graphs and study the problem of static validation under such updates. This problem asks to verify whether every graph that validates a SHACL specification will still do so after applying a given update sequence. More importantly, it provides a basis for further services for reasoning about evolving RDF graphs. Using a regression technique that embeds the update actions into SHACL constraints, we show that static validation under updates can be reduced to (un)satisfiability of constraints in (a minor extension of) SHACL. We analyze the computational complexity of the static validation problem for SHACL and some key fragments. Finally, we present a prototype implementation that performs static validation and other static analysis tasks on SHACL constraints and demonstrate its behavior through preliminary experiments.

Updated: 2025-07-31 19:58:16

标题: SHACL验证在图更新下（扩展论文）

摘要: SHACL（SHApe Constraint Language）是一种用于RDF图的W3C标准约束语言。本文研究了在更新情况下RDF图中的SHACL验证。我们提出了一种基于SHACL的更新语言，可以捕捉对RDF图的直观和现实的修改，并研究了在这些更新下的静态验证问题。该问题要求验证是否每个符合SHACL规范的图在应用给定的更新序列后仍然如此。更重要的是，它为进一步推理关于不断发展的RDF图提供了基础。通过将更新操作嵌入SHACL约束的回归技术，我们展示了在更新下的静态验证可以简化为在SHACL的约束（一个小的扩展）中的（不）可满足性。我们分析了SHACL及其一些关键片段的静态验证问题的计算复杂性。最后，我们提出了一个原型实现，对SHACL约束进行静态验证和其他静态分析任务，并通过初步实验展示其行为。

更新时间: 2025-07-31 19:58:16

领域: cs.AI

下载: http://arxiv.org/abs/2508.00137v1

Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Gender classification has emerged as a crucial aspect in various fields, including security, human-machine interaction, surveillance, and advertising. Nonetheless, the accuracy of this classification can be influenced by factors such as cosmetics and disguise. Consequently, our study is dedicated to addressing this concern by concentrating on gender classification using color images of the periocular region. The periocular region refers to the area surrounding the eye, including the eyelids, eyebrows, and the region between them. It contains valuable visual cues that can be used to extract key features for gender classification. This paper introduces a sophisticated Convolutional Neural Network (CNN) model that utilizes color image databases to evaluate the effectiveness of the periocular region for gender classification. To validate the model's performance, we conducted tests on two eye datasets, namely CVBL and (Female and Male). The recommended architecture achieved an outstanding accuracy of 99% on the previously unused CVBL dataset while attaining a commendable accuracy of 96% with a small number of learnable parameters (7,235,089) on the (Female and Male) dataset. To ascertain the effectiveness of our proposed model for gender classification using the periocular region, we evaluated its performance through an extensive range of metrics and compared it with other state-of-the-art approaches. The results unequivocally demonstrate the efficacy of our model, thereby suggesting its potential for practical application in domains such as security and surveillance.

Updated: 2025-07-31 19:52:03

标题: 探索利用深度学习技术从眼部图像准确识别性别的可行性

摘要: 性别分类已成为各个领域中的一个关键方面，包括安全、人机交互、监视和广告等。然而，化妆和伪装等因素可能影响这种分类的准确性。因此，我们的研究致力于通过集中于使用色彩图像的眼周区域进行性别分类来解决这一问题。眼周区域指的是眼睛周围的区域，包括眼睑、眉毛和它们之间的区域。它包含了有价值的视觉线索，可用于提取用于性别分类的关键特征。本文介绍了一种利用色彩图像数据库评估眼周区域对性别分类有效性的复杂卷积神经网络（CNN）模型。为验证模型的性能，我们对两个眼部数据集进行了测试，分别为CVBL和（女性和男性）。推荐的架构在先前未使用的CVBL数据集上取得了出色的99%准确率，同时在（女性和男性）数据集上以较少的可学习参数（7,235,089）取得了96%的令人称赞的准确性。为了验证我们提出的利用眼周区域进行性别分类的模型的有效性，我们通过广泛的指标评估了其性能，并将其与其他最先进的方法进行了比较。结果明确表明我们的模型的有效性，从而表明其在安全和监视等领域的实际应用潜力。

更新时间: 2025-07-31 19:52:03

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.00135v1

ECG Latent Feature Extraction with Autoencoders for Downstream Prediction Tasks

The electrocardiogram (ECG) is an inexpensive and widely available tool for cardiac assessment. Despite its standardized format and small file size, the high complexity and inter-individual variability of ECG signals (typically a 60,000-size vector with 12 leads at 500 Hz) make it challenging to use in deep learning models, especially when only small training datasets are available. This study addresses these challenges by exploring feature generation methods from representative beat ECGs, focusing on Principal Component Analysis (PCA) and Autoencoders to reduce data complexity. We introduce three novel Variational Autoencoder (VAE) variants-Stochastic Autoencoder (SAE), Annealed beta-VAE (A beta-VAE), and Cyclical beta VAE (C beta-VAE)-and compare their effectiveness in maintaining signal fidelity and enhancing downstream prediction tasks using a Light Gradient Boost Machine (LGBM). The A beta-VAE achieved superior signal reconstruction, reducing the mean absolute error (MAE) to 15.7+/-3.2 muV, which is at the level of signal noise. Moreover, the SAE encodings, when combined with traditional ECG summary features, improved the prediction of reduced Left Ventricular Ejection Fraction (LVEF), achieving an holdout test set area under the receiver operating characteristic curve (AUROC) of 0.901 with a LGBM classifier. This performance nearly matches the 0.909 AUROC of state-of-the-art CNN model but requires significantly less computational resources. Further, the ECG feature extraction-LGBM pipeline avoids overfitting and retains predictive performance when trained with less data. Our findings demonstrate that these VAE encodings are not only effective in simplifying ECG data but also provide a practical solution for applying deep learning in contexts with limited-scale labeled training data.

Updated: 2025-07-31 19:37:05

标题: ECG潜在特征提取与自动编码器用于下游预测任务

摘要: 心电图（ECG）是一种廉价且广泛可用的心脏评估工具。尽管其标准化格式和小文件大小，心电图信号的高复杂性和个体间变异性（通常是一个包含12导联的60,000大小向量，每秒500个数据点）使其在深度学习模型中的使用具有挑战性，特别是当只有少量训练数据集可用时。这项研究通过探索从代表性心跳ECG中生成特征的方法来解决这些挑战，重点关注主成分分析（PCA）和自动编码器以减少数据复杂性。我们介绍了三种新颖的变分自动编码器（VAE）变体-随机自动编码器（SAE）、渐变β-VAE（A beta-VAE）和循环β-VAE（C beta-VAE）-并比较它们在保持信号保真性和增强下游预测任务方面的有效性，使用了轻梯度提升机（LGBM）。A beta-VAE实现了卓越的信号重建，将平均绝对误差（MAE）降至15.7+/-3.2μV，达到了信号噪声水平。此外，当SAE编码与传统ECG摘要特征结合时，改善了对降低左心室射血分数（LVEF）的预测，使用LGBM分类器在保留测试集的接收者操作特征曲线下面积（AUROC）0.901。这种性能几乎与最先进的CNN模型的0.909 AUROC相匹配，但需要显著更少的计算资源。此外，ECG特征提取-LGBM管道避免了过拟合，并在用较少数据进行训练时保持预测性能。我们的研究结果表明，这些VAE编码不仅在简化ECG数据方面有效，而且为在具有有限标记训练数据的情境中应用深度学习提供了实际解决方案。

更新时间: 2025-07-31 19:37:05

领域: cs.LG

下载: http://arxiv.org/abs/2508.00131v1

Algorithmic Detection of Rank Reversals, Transitivity Violations, and Decomposition Inconsistencies in Multi-Criteria Decision Analysis

In Multi-Criteria Decision Analysis, Rank Reversals are a serious problem that can greatly affect the results of a Multi-Criteria Decision Method against a particular set of alternatives. It is therefore useful to have a mechanism that allows one to measure the performance of a method on a set of alternatives. This idea could be taken further to build a global ranking of the effectiveness of different methods to solve a problem. In this paper, we present three tests that detect the presence of Rank Reversals, along with their implementation in the Scikit-Criteria library. We also address the complications that arise when implementing these tests for general scenarios and the design considerations we made to handle them. We close with a discussion about how these additions could play a major role in the judgment of multi-criteria decision methods for problem solving.

Updated: 2025-07-31 19:31:41

标题: 算法检测多标准决策分析中的排名逆转、传递性违规和分解不一致性

摘要: 在多标准决策分析中，排名颠倒是一个严重的问题，它可能会极大地影响多标准决策方法对一组特定备选方案的结果。因此，有一个机制可以衡量一种方法在一组备选方案上的表现是很有用的。这个想法可以进一步发展，建立一个全球排名，评估不同方法解决问题的效果。本文提出了三个检测排名颠倒存在的测试，以及它们在Scikit-Criteria库中的实现。我们还讨论了在实施这些测试时可能出现的复杂情况，以及我们为应对这些情况所做的设计考虑。最后，我们讨论了这些补充如何在多标准决策方法的问题解决中发挥重要作用。

更新时间: 2025-07-31 19:31:41

领域: cs.AI,math.OC

下载: http://arxiv.org/abs/2508.00129v1

Structured Transformations for Stable and Interpretable Neural Computation

Despite their impressive performance, contemporary neural networks often lack structural safeguards that promote stable learning and interpretable behavior. In this work, we introduce a reformulation of layer-level transformations that departs from the standard unconstrained affine paradigm. Each transformation is decomposed into a structured linear operator and a residual corrective component, enabling more disciplined signal propagation and improved training dynamics. Our formulation encourages internal consistency and supports stable information flow across depth, while remaining fully compatible with standard learning objectives and backpropagation. Through a series of synthetic and real-world experiments, we demonstrate that models constructed with these structured transformations exhibit improved gradient conditioning, reduced sensitivity to perturbations, and layer-wise robustness. We further show that these benefits persist across architectural scales and training regimes. This study serves as a foundation for a more principled class of neural architectures that prioritize stability and transparency-offering new tools for reasoning about learning behavior without sacrificing expressive power.

Updated: 2025-07-31 19:26:45

标题: 结构化转换用于稳定和可解释的神经计算

摘要: 尽管当代神经网络表现出色，但往往缺乏促进稳定学习和可解释行为的结构保障。在这项工作中，我们介绍了一种层级转换的重新制定，不同于标准的无约束仿射范式。每个转换被分解为一个结构化线性算子和一个残差校正组件，从而实现更有纪律的信号传播和改善的训练动态。我们的公式鼓励内部一致性，并支持深度间稳定信息流，同时仍然与标准学习目标和反向传播完全兼容。通过一系列合成和现实世界实验，我们展示了使用这些结构化转换构建的模型表现出改善的梯度调节、对扰动的敏感性降低和逐层的稳健性。我们进一步展示了这些优势在不同的架构规模和训练制度下持续存在。这项研究为一类更有原则的神经网络架构奠定了基础，这些架构优先考虑稳定性和透明度，为推理学习行为提供新工具，而不牺牲表达能力。

更新时间: 2025-07-31 19:26:45

领域: cs.LG

下载: http://arxiv.org/abs/2508.00127v1

Are Sparse Autoencoders Useful for Java Function Bug Detection?

Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection

Updated: 2025-07-31 19:17:07

标题: 稀疏自编码器对于Java函数错误检测有用吗？

摘要: 软件漏洞，如缓冲区溢出和SQL注入，是安全漏洞的主要来源。传统的漏洞检测方法仍然至关重要，但受到高误报率、可扩展性问题和对手动工作的依赖的限制。这些限制推动了对基于人工智能的自动漏洞检测和安全代码生成方法的兴趣。虽然大型语言模型（LLM）为分类任务开辟了新的途径，但它们的复杂性和不透明性对解释性和部署构成了挑战。稀疏自编码器为解决这一问题提供了一个有希望的解决方案。我们探讨了SAE是否可以作为Java函数中漏洞检测的轻量级、可解释的替代方案。我们评估了将SAE应用于GPT-2 Small和Gemma 2B的表示时的有效性，检查它们在未经调整底层LLMs的情况下突出显示有缺陷的行为的能力。我们发现，通过SAE派生的特征使得漏洞检测的F1分数高达89%，始终优于经过微调的变压器编码器基线。我们的工作提供了第一个经验证据，即SAE可以直接从预训练的LLMs的内部表示中检测软件漏洞，而不需要任何微调或任务特定的监督。代码可在https://github.com/rufimelo99/SAE-Java-Bug-Detection处获取。

更新时间: 2025-07-31 19:17:07

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.10375v3

StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection

Liver diseases are a serious health concern in the world, which requires precise and timely diagnosis to enhance the survival chances of patients. The current literature implemented numerous machine learning and deep learning models to classify liver diseases, but most of them had some issues like high misclassification error, poor interpretability, prohibitive computational expense, and lack of good preprocessing strategies. In order to address these drawbacks, we introduced StackLiverNet in this study; an interpretable stacked ensemble model tailored to the liver disease detection task. The framework uses advanced data preprocessing and feature selection technique to increase model robustness and predictive ability. Random undersampling is performed to deal with class imbalance and make the training balanced. StackLiverNet is an ensemble of several hyperparameter-optimized base classifiers, whose complementary advantages are used through a LightGBM meta-model. The provided model demonstrates excellent performance, with the testing accuracy of 99.89%, Cohen Kappa of 0.9974, and AUC of 0.9993, having only 5 misclassifications, and efficient training and inference speeds that are amenable to clinical practice (training time 4.2783 seconds, inference time 0.1106 seconds). Besides, Local Interpretable Model-Agnostic Explanations (LIME) are applied to generate transparent explanations of individual predictions, revealing high concentrations of Alkaline Phosphatase and moderate SGOT as important observations of liver disease. Also, SHAP was used to rank features by their global contribution to predictions, while the Morris method confirmed the most influential features through sensitivity analysis.

Updated: 2025-07-31 19:13:30

标题: StackLiverNet：一种新颖的堆叠集成模型，用于准确且可解释的肝病检测

摘要: 肝脏疾病是世界上一个严重的健康问题，需要精确和及时的诊断来提高患者的存活机会。当前文献中实施了许多机器学习和深度学习模型来分类肝脏疾病，但大多数模型存在一些问题，如高误分类错误率、解释性差、计算开销高、缺乏良好的预处理策略等。为了解决这些问题，我们在这项研究中引入了StackLiverNet；这是一个专门针对肝脏疾病检测任务的可解释的堆叠集成模型。该框架使用先进的数据预处理和特征选择技术来增加模型的鲁棒性和预测能力。采用随机欠采样来处理类别不平衡问题，使训练更加均衡。StackLiverNet是几个经过超参数优化的基础分类器的集成，通过LightGBM元模型利用它们的互补优势。提供的模型表现出色，测试准确率为99.89%，Cohen Kappa为0.9974，AUC为0.9993，仅有5个误分类，训练和推断速度高效，适用于临床实践（训练时间4.2783秒，推断时间0.1106秒）。此外，本地可解释模型无关解释（LIME）被应用于生成透明的个体预测解释，揭示了碱性磷酸酶浓度高和SGOT适中作为肝脏疾病的重要观察。同时，SHAP被用于根据其对预测的全局贡献对特征进行排名，而Morris方法通过敏感性分析确认了最具影响力的特征。

更新时间: 2025-07-31 19:13:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00117v1

No AI Without PI! Object-Centric Process Mining as the Enabler for Generative, Predictive, and Prescriptive Artificial Intelligence

The uptake of Artificial Intelligence (AI) impacts the way we work, interact, do business, and conduct research. However, organizations struggle to apply AI successfully in industrial settings where the focus is on end-to-end operational processes. Here, we consider generative, predictive, and prescriptive AI and elaborate on the challenges of diagnosing and improving such processes. We show that AI needs to be grounded using Object-Centric Process Mining (OCPM). Process-related data are structured and organization-specific and, unlike text, processes are often highly dynamic. OCPM is the missing link connecting data and processes and enables different forms of AI. We use the term Process Intelligence (PI) to refer to the amalgamation of process-centric data-driven techniques able to deal with a variety of object and event types, enabling AI in an organizational context. This paper explains why AI requires PI to improve operational processes and highlights opportunities for successfully combining OCPM and generative, predictive, and prescriptive AI.

Updated: 2025-07-31 19:11:51

标题: 没有PI就没有AI！以对象为中心的过程挖掘作为生成式、预测性和规范性人工智能的推动者

摘要: 人工智能（AI）的应用影响着我们的工作方式、互动方式、商业行为和研究方式。然而，在工业领域，组织机构往往难以成功应用AI，因为其重点在于端到端的运营流程。在这里，我们考虑生成式、预测性和规定性AI，并详细阐述诊断和改进这些流程的挑战。我们表明，AI需要以对象为中心的过程挖掘（OCPM）作为基础。与文本不同，过程相关数据是结构化的、组织特定的，而且过程通常是高度动态的。OCPM是连接数据和流程的缺失环节，可以实现不同形式的AI。我们使用术语流程智能（PI）来指代以流程为中心的数据驱动技术的融合，能够处理各种对象和事件类型，从而在组织环境中实现AI。本文解释了为什么AI需要PI来改进运营流程，并突出了成功结合OCPM和生成式、预测性和规定性AI的机会。

更新时间: 2025-07-31 19:11:51

领域: cs.AI,H.4.1; I.2.1

下载: http://arxiv.org/abs/2508.00116v1

Sampling from Energy-based Policies using Diffusion

Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation -- limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances sample efficiency in continuous control tasks and captures multimodal behaviors, addressing key limitations of existing methods.

Updated: 2025-07-31 19:07:51

标题: 使用扩散进行能量驱动策略的抽样

摘要: 基于能量的政策为在强化学习中建模复杂、多模式行为提供了灵活的框架。在最大熵强化学习中，最优政策是从软Q函数派生的玻尔兹曼分布，但在连续动作空间中直接从该分布中抽样在计算上是棘手的。因此，现有方法通常使用更简单的参数化分布，如高斯分布，用于政策表示 - 限制了它们捕捉多模式动作分布的完整复杂性的能力。在本文中，我们提出了一种基于扩散的方法来从基于能量的政策中进行抽样，其中负Q函数定义了能量函数。基于这种方法，我们提出了一种称为扩散Q抽样（DQS）的演员-评论家方法，可以实现更具表现力的政策表示，从而在不同环境中实现稳定学习。我们展示了我们的方法在连续控制任务中增强了样本效率，并捕捉了多模式行为，解决了现有方法的关键限制。

更新时间: 2025-07-31 19:07:51

领域: cs.LG

下载: http://arxiv.org/abs/2410.01312v2

funOCLUST: Clustering Functional Data with Outliers

Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to cluster curves and trim outliers. The methodology is evaluated on both simulated and real-world functional datasets, demonstrating strong performance in clustering and outlier identification.

Updated: 2025-07-31 19:00:20

标题: funOCLUST：使用异常值对功能数据进行聚类

摘要: 功能数据由于其无限维的性质和对异常值的敏感性而给聚类带来了独特的挑战。本文提出了将OCLUST算法扩展到功能设置的方法来解决这些问题。该方法利用OCLUST框架，创建了一种稳健的方法来对曲线进行聚类和修剪异常值。该方法在模拟和真实世界的功能数据集上进行了评估，表现出在聚类和异常值识别方面的强大性能。

更新时间: 2025-07-31 19:00:20

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.00110v1

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.

Updated: 2025-07-31 19:00:11

标题: 工厂：一个具有挑战性的人工验证的长篇事实性提示集

摘要: 长篇事实评估评估模型生成准确、全面回答短提示的能力。现有的基准往往缺乏人工验证，导致潜在的质量问题。为了解决这一限制，我们引入了FACTORY，一个大规模的、经人工验证的提示集。使用模型循环方法开发并由人类精细调整，FACTORY包括具有挑战性的提示，这些提示是寻求事实、可回答和明确的。我们使用FACTORY和现有数据集对6种最先进的语言模型进行人类评估。我们的结果表明，FACTORY是一个具有挑战性的基准：与其他数据集相比，SOTA模型的响应中约有40%的声明不属实，而其他数据集仅有10%。我们的分析确定了FACTORY相对于先前基准的优势，强调了其可靠性和模型跨长尾事实进行推理的必要性。

更新时间: 2025-07-31 19:00:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.00109v1

Hyperproperty-Constrained Secure Reinforcement Learning

Hyperproperties for Time Window Temporal Logic (HyperTWTL) is a domain-specific formal specification language known for its effectiveness in compactly representing security, opacity, and concurrency properties for robotics applications. This paper focuses on HyperTWTL-constrained secure reinforcement learning (SecRL). Although temporal logic-constrained safe reinforcement learning (SRL) is an evolving research problem with several existing literature, there is a significant research gap in exploring security-aware reinforcement learning (RL) using hyperproperties. Given the dynamics of an agent as a Markov Decision Process (MDP) and opacity/security constraints formalized as HyperTWTL, we propose an approach for learning security-aware optimal policies using dynamic Boltzmann softmax RL while satisfying the HyperTWTL constraints. The effectiveness and scalability of our proposed approach are demonstrated using a pick-up and delivery robotic mission case study. We also compare our results with two other baseline RL algorithms, showing that our proposed method outperforms them.

Updated: 2025-07-31 18:57:18

标题: 超属性约束下的安全强化学习

摘要: 时间窗口时间逻辑（HyperTWTL）是一种领域特定的形式化规范语言，以其在紧凑表示机器人应用程序的安全性、不透明性和并发性属性方面的有效性而闻名。本文关注受HyperTWTL约束的安全强化学习（SecRL）。尽管受时间逻辑约束的安全强化学习（SRL）是一个不断发展的研究问题，并有一些现有文献，但在探索使用超性质的安全意识强化学习（RL）方面存在重要的研究空白。鉴于代理作为马尔可夫决策过程（MDP）的动态性以及形式化为HyperTWTL的不透明性/安全性约束，我们提出了一种使用动态Boltzmann softmax RL学习安全意识最优策略的方法，同时满足HyperTWTL约束。我们提出的方法的有效性和可扩展性通过一个取送机器人任务案例研究得到证明。我们还将我们的结果与另外两种基线RL算法进行比较，表明我们提出的方法优于它们。

更新时间: 2025-07-31 18:57:18

领域: cs.AI,cs.LG,cs.LO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.00106v1

SPLITZ: Certifiable Robustness via Split Lipschitz Randomized Smoothing

Certifiable robustness gives the guarantee that small perturbations around an input to a classifier will not change the prediction. There are two approaches to provide certifiable robustness to adversarial examples: a) explicitly training classifiers with small Lipschitz constants, and b) Randomized smoothing, which adds random noise to the input to create a smooth classifier. We propose SPLITZ, a practical and novel approach which leverages the synergistic benefits of both the above ideas into a single framework. Our main idea is to split a classifier into two halves, constrain the Lipschitz constant of the first half, and smooth the second half via randomization. Motivation for SPLITZ comes from the observation that many standard deep networks exhibit heterogeneity in Lipschitz constants across layers. SPLITZ can exploit this heterogeneity while inheriting the scalability of randomized smoothing. We present a principled approach to train SPLITZ and provide theoretical analysis to derive certified robustness guarantees during inference. We present a comprehensive comparison of robustness-accuracy trade-offs and show that SPLITZ consistently improves on existing state-of-the-art approaches in the MNIST, CIFAR-10 and ImageNet datasets. For instance, with $\ell_2$ norm perturbation budget of $\epsilon=1$, SPLITZ achieves $43.2\%$ top-1 test accuracy on CIFAR-10 dataset compared to state-of-art top-1 test accuracy $39.8\%$.

Updated: 2025-07-31 18:56:04

标题: SPLITZ：通过分裂Lipschitz随机平滑实现可证明的鲁棒性

摘要: 可验证的鲁棒性能够保证在分类器输入周围的小扰动不会改变预测结果。提供对对抗性示例的可验证鲁棒性有两种方法：a) 明确训练具有小Lipschitz常数的分类器，b) 随机平滑，将随机噪声添加到输入以创建平滑分类器。我们提出了SPLITZ，这是一种实用且新颖的方法，将上述两种思想的协同优势融合为一个框架。我们的主要思想是将分类器分成两半，约束第一半的Lipschitz常数，并通过随机化使第二半平滑。SPLITZ的动机来自观察到许多标准深度网络在不同层之间的Lipschitz常数存在异质性。SPLITZ可以利用这种异质性，同时继承随机平滑的可扩展性。我们提出了一个有原则的方法来训练SPLITZ，并提供理论分析来推导推断期间的认证鲁棒性保证。我们提供了对鲁棒性-准确性权衡的全面比较，并展示了SPLITZ在MNIST、CIFAR-10和ImageNet数据集中持续改进现有最先进方法的结果。例如，对于$\ell_2$范数扰动预算$\epsilon=1$，SPLITZ在CIFAR-10数据集上实现了$43.2\%$的top-1测试准确率，而最先进的top-1测试准确率为$39.8\%$。

更新时间: 2025-07-31 18:56:04

领域: cs.LG,cs.IT,math.IT

下载: http://arxiv.org/abs/2407.02811v3

A Mixed User-Centered Approach to Enable Augmented Intelligence in Intelligent Tutoring Systems: The Case of MathAIde app

Integrating Artificial Intelligence in Education (AIED) aims to enhance learning experiences through technologies like Intelligent Tutoring Systems (ITS), offering personalized learning, increased engagement, and improved retention rates. However, AIED faces three main challenges: the critical role of teachers in the design process, the limitations and reliability of AI tools, and the accessibility of technological resources. Augmented Intelligence (AuI) addresses these challenges by enhancing human capabilities rather than replacing them, allowing systems to suggest solutions. In contrast, humans provide final assessments, thus improving AI over time. In this sense, this study focuses on designing, developing, and evaluating MathAIde, an ITS that corrects mathematics exercises using computer vision and AI and provides feedback based on photos of student work. The methodology included brainstorming sessions with potential users, high-fidelity prototyping, A/B testing, and a case study involving real-world classroom environments for teachers and students. Our research identified several design possibilities for implementing AuI in ITSs, emphasizing a balance between user needs and technological feasibility. Prioritization and validation through prototyping and testing highlighted the importance of efficiency metrics, ultimately leading to a solution that offers pre-defined remediation alternatives for teachers. Real-world deployment demonstrated the usefulness of the proposed solution. Our research contributes to the literature by providing a usable, teacher-centered design approach that involves teachers in all design phases. As a practical implication, we highlight that the user-centered design approach increases the usefulness and adoption potential of AIED systems, especially in resource-limited environments.

Updated: 2025-07-31 18:56:01

标题: 一种混合用户中心方法，用于在智能辅导系统中实现增强智能：以MathAIde应用为例

摘要: 将人工智能融入教育（AIED）旨在通过智能辅导系统（ITS）等技术增强学习体验，提供个性化学习、增加参与度和提高保留率。然而，AIED面临三大挑战：教师在设计过程中的关键作用、AI工具的限制和可靠性，以及技术资源的可及性。增强智能（AuI）通过增强人类能力而不是取代它们来解决这些挑战，允许系统提供解决方案建议。相比之下，人类提供最终评估，从而随着时间改善AI。在这方面，本研究专注于设计、开发和评估MathAIde，一种利用计算机视觉和AI纠正数学练习并基于学生作业照片提供反馈的ITS。方法包括与潜在用户的头脑风暴会议、高保真原型制作、A/B测试，以及涉及真实教室环境的教师和学生的案例研究。我们的研究确定了在ITS中实施AuI的几种设计可能性，强调用户需求和技术可行性之间的平衡。通过原型制作和测试的优先级和验证突出了效率指标的重要性，最终导致提供给教师预定义补救方案的解决方案。实际部署展示了所提出解决方案的实用性。我们的研究通过提供一种可用的、以教师为中心的设计方法，将教师纳入所有设计阶段，为文献做出贡献。作为实际意义，我们强调用户为中心的设计方法增加了AIED系统的实用性和采用潜力，特别是在资源有限的环境中。

更新时间: 2025-07-31 18:56:01

领域: cs.HC,cs.AI,68T01,H.5.0; I.2.0

下载: http://arxiv.org/abs/2508.00103v1

Leveraging Operator Learning to Accelerate Convergence of the Preconditioned Conjugate Gradient Method

We propose a new deflation strategy to accelerate the convergence of the preconditioned conjugate gradient(PCG) method for solving parametric large-scale linear systems of equations. Unlike traditional deflation techniques that rely on eigenvector approximations or recycled Krylov subspaces, we generate the deflation subspaces using operator learning, specifically the Deep Operator Network~(DeepONet). To this aim, we introduce two complementary approaches for assembling the deflation operators. The first approach approximates near-null space vectors of the discrete PDE operator using the basis functions learned by the DeepONet. The second approach directly leverages solutions predicted by the DeepONet. To further enhance convergence, we also propose several strategies for prescribing the sparsity pattern of the deflation operator. A comprehensive set of numerical experiments encompassing steady-state, time-dependent, scalar, and vector-valued problems posed on both structured and unstructured geometries is presented and demonstrates the effectiveness of the proposed DeepONet-based deflated PCG method, as well as its generalization across a wide range of model parameters and problem resolutions.

Updated: 2025-07-31 18:53:23

标题: 利用操作员学习加速预条件共轭梯度方法的收敛

摘要: 我们提出了一种新的缩减策略，以加速求解参数化大规模线性方程组的预条件共轭梯度（PCG）方法的收敛速度。与依赖特征向量近似或重用Krylov子空间的传统缩减技术不同，我们使用运算学习，具体来说是深度操作网络（DeepONet）来生成缩减子空间。为此，我们引入了两种互补的方法来组装缩减算子。第一种方法使用DeepONet学习的基函数来逼近离散PDE算子的近零空间向量。第二种方法直接利用DeepONet预测的解决方案。为了进一步提高收敛速度，我们还提出了几种策略来规定缩减算子的稀疏模式。我们展示了一系列全面的数值实验，涵盖了在结构化和非结构化几何上提出的稳态、时变、标量和矢量值问题，并证明了提出的基于DeepONet的缩减PCG方法的有效性，以及它在广泛的模型参数和问题分辨率范围内的泛化能力。

更新时间: 2025-07-31 18:53:23

领域: math.NA,cs.LG,cs.NA,math.OC,65M55, 68T05, 49K20

下载: http://arxiv.org/abs/2508.00101v1

Stress-Aware Resilient Neural Training

This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: https://github.com/Stress-Aware-Learning/SAL.

Updated: 2025-07-31 18:46:19

标题: 应力感知的弹性神经训练

摘要: 这篇论文介绍了一种名为压力感知学习的弹性神经训练范式，在这种范式中，深度神经网络根据临时（弹性）和永久（塑性）变形的概念动态调整其优化行为，受到材料科学中结构疲劳的启发。为了实现这一概念，我们提出了塑性变形优化器，这是一种压力感知机制，当内部压力信号表明训练损失和准确率的停滞指示持续的优化困难时，会向模型参数注入自适应噪声。这使得模型能够摆脱陡峭最小值并收敛于更平坦、更具一般性的损失景观区域。通过对六种架构、四种优化器和七个视觉基准的实验表明，具有最小计算开销的改进鲁棒性和泛化性。代码和3D可视化将在GitHub上提供：https://github.com/Stress-Aware-Learning/SAL。

更新时间: 2025-07-31 18:46:19

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.00098v1

SourceSplice: Source Selection for Machine Learning Tasks

Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis. We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

Updated: 2025-07-31 18:46:06

标题: SourceSplice：机器学习任务的源选择

摘要: 数据质量在机器学习（ML）任务的预测性能中起着关键作用，这是现代组织中可用数据源数量暴增所带来的挑战。先前关于数据发现的工作主要集中在元数据匹配、语义相似性或识别应该连接以回答特定查询的表格，但没有考虑到下游ML任务的高性能所需的数据源质量。本文解决了确定必须组合以构建给定ML任务的基础训练数据集的最佳数据源子集的问题。我们提出了SourceGrasp和SourceSplice，这是专门设计的框架，用于有效选择一个合适的子集，以最大化下游ML模型的效用。这两种算法都依赖于一个核心理念，即数据源（或它们的组合）对任务效用的贡献是不同的，必须谨慎选择。SourceGrasp利用了基于贪婪标准和随机化的元启发式，并且SourceSplice框架呈现了一个受基因剪接启发的源选择机制 - 这是蛋白质合成中使用的核心概念。我们在三个真实世界数据集和合成数据集上对我们的算法进行了实证评估，并展示了SourceSplice能够有效地识别导致高任务效用的数据源子集，而只需显著较少的子集探索。我们还进行了研究，报告了SourceSplice在几种设置下对决策选择的敏感性。

更新时间: 2025-07-31 18:46:06

领域: cs.LG,cs.AI,cs.DB,I.2.6

下载: http://arxiv.org/abs/2507.22186v2

XRoboToolkit: A Cross-Platform Framework for Robot Teleoperation

The rapid advancement of Vision-Language-Action models has created an urgent need for large-scale, high-quality robot demonstration datasets. Although teleoperation is the predominant method for data collection, current approaches suffer from limited scalability, complex setup procedures, and suboptimal data quality. This paper presents XRoboToolkit, a cross-platform framework for extended reality based robot teleoperation built on the OpenXR standard. The system features low-latency stereoscopic visual feedback, optimization-based inverse kinematics, and support for diverse tracking modalities including head, controller, hand, and auxiliary motion trackers. XRoboToolkit's modular architecture enables seamless integration across robotic platforms and simulation environments, spanning precision manipulators, mobile robots, and dexterous hands. We demonstrate the framework's effectiveness through precision manipulation tasks and validate data quality by training VLA models that exhibit robust autonomous performance.

Updated: 2025-07-31 18:45:13

标题: XRoboToolkit：用于机器人远程操作的跨平台框架

摘要: 视觉-语言-行动模型的快速发展已经创造了对大规模、高质量的机器人演示数据集的迫切需求。尽管远程操作是数据收集的主要方法，但当前方法存在可扩展性有限、设置程序复杂以及数据质量亚优等问题。本文介绍了XRoboToolkit，这是一个基于OpenXR标准的扩展现实机器人远程操作的跨平台框架。该系统具有低延迟的立体视觉反馈、基于优化的逆运动学以及支持包括头部、控制器、手部和辅助运动跟踪器在内的多种跟踪模式。XRoboToolkit的模块化架构实现了跨机器人平台和仿真环境的无缝集成，涵盖精密操纵器、移动机器人和灵巧手。我们通过精密操纵任务展示了该框架的有效性，并通过训练展现出稳健自主性能的VLA模型验证了数据质量。

更新时间: 2025-07-31 18:45:13

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.00097v1

Private GPTs for LLM-driven testing in software development and machine learning

In this contribution, we examine the capability of private GPTs to automatically generate executable test code based on requirements. More specifically, we use acceptance criteria as input, formulated as part of epics, or stories, which are typically used in modern development processes. This gives product owners, or business intelligence, respectively, a way to directly produce testable criteria through the use of LLMs. We explore the quality of the so-produced tests in two ways: i) directly by letting the LLM generate code from requirements, ii) through an intermediate step using Gherkin syntax. As a result, it turns out that the two-step procedure yields better results -where we define better in terms of human readability and best coding practices, i.e. lines of code and use of additional libraries typically used in testing. Concretely, we evaluate prompt effectiveness across two scenarios: a simple "Hello World" program and a digit classification model, showing that structured prompts lead to higher-quality test outputs.

Updated: 2025-07-31 18:44:42

标题: 私人GPTs用于LLM驱动的软件开发和机器学习中的测试

摘要: 在这篇文章中，我们研究了私有GPTs生成可执行测试代码的能力，这些代码是基于需求自动生成的。更具体地说，我们使用验收标准作为输入，这些标准被制定为史诗或故事的一部分，这些史诗或故事通常在现代开发过程中使用。这使产品所有者或商业智能有直接通过使用LLMs生成可测试标准的方法。我们通过两种方式探讨了所生成测试的质量：i)让LLM直接根据需求生成代码，ii)通过使用Gherkin语法进行中间步骤。结果表明，两步骤的过程产生了更好的结果，我们根据人类可读性和最佳编码实践来定义更好的结果，即代码行数和通常在测试中使用的附加库的使用。具体来说，我们评估了两种场景下的提示效果：一个简单的“Hello World”程序和一个数字分类模型，结果显示结构化提示可以产生更高质量的测试输出。

更新时间: 2025-07-31 18:44:42

领域: cs.SE,cs.AI,I.2.1

下载: http://arxiv.org/abs/2506.06509v2

Riemannian Optimization for Distance Geometry: A Study of Convergence, Robustness, and Incoherence

The problem of recovering a configuration of points from partial pairwise distances, referred to as the Euclidean Distance Geometry (EDG) problem, arises in a broad range of applications, including sensor network localization, molecular conformation, and manifold learning. In this paper, we propose a Riemannian optimization framework for solving the EDG problem by formulating it as a low-rank matrix completion task over the space of positive semi-definite Gram matrices. The available distance measurements are encoded as expansion coefficients in a non-orthogonal basis, and optimization over the Gram matrix implicitly enforces geometric consistency through the triangle inequality, a structure inherited from classical multidimensional scaling. Under a Bernoulli sampling model for observed distances, we prove that Riemannian gradient descent on the manifold of rank-$r$ matrices locally converges linearly with high probability when the sampling probability satisfies $p \geq \mathcal{O}(\nu^2 r^2 \log(n)/n)$, where $\nu$ is an EDG-specific incoherence parameter. Furthermore, we provide an initialization candidate using a one-step hard thresholding procedure that yields convergence, provided the sampling probability satisfies $p \geq \mathcal{O}(\nu r^{3/2} \log^{3/4}(n)/n^{1/4})$. A key technical contribution of this work is the analysis of a symmetric linear operator arising from a dual basis expansion in the non-orthogonal basis, which requires a novel application of the Hanson--Wright inequality to establish an optimal restricted isometry property in the presence of coupled terms. Empirical evaluations on synthetic data demonstrate that our algorithm achieves competitive performance relative to state-of-the-art methods. Moreover, we propose a novel notion of matrix incoherence tailored to the EDG setting and provide robustness guarantees for our method.

Updated: 2025-07-31 18:40:42

标题: 黎曼优化在距离几何中的应用：收敛性、稳健性和不一致性的研究

摘要: 从部分成对距离中恢复点配置的问题，称为欧几里德距离几何（EDG）问题，在传感器网络定位、分子构象和流形学习等广泛应用中出现。本文提出了一个利用黎曼优化框架解决EDG问题的方法，将其制定为在正半定格拉姆矩阵空间上的低秩矩阵完成任务。可用的距离测量被编码为非正交基上的展开系数，并且通过格拉姆矩阵的优化隐含地通过三角不等式强制执行几何一致性，这是从经典多维缩放中继承的结构。在对观测距离采用伯努利抽样模型的情况下，我们证明了在采样概率满足$p \geq \mathcal{O}(\nu^2 r^2 \log(n)/n)$时，秩为$r$的矩阵流形上的黎曼梯度下降在局部具有高概率的线性收敛性。此外，我们提供了一个使用一步硬阈值过程的初始化候选，可以实现收敛，前提是采样概率满足$p \geq \mathcal{O}(\nu r^{3/2} \log^{3/4}(n)/n^{1/4})$。这项工作的一个关键技术贡献是分析由非正交基中的对偶基展开产生的对称线性算子，这需要一种新颖的汉森-赖特不等式的应用，以在耦合项存在时建立最优的受限等距性质。对合成数据的实证评估表明，我们的算法在与最先进方法的竞争性能方面取得了良好的表现。此外，我们提出了一种针对EDG设置量身定制的矩阵不连贯性新概念，并为我们的方法提供了鲁棒性保证。

更新时间: 2025-07-31 18:40:42

领域: math.OC,cs.CG,cs.LG

下载: http://arxiv.org/abs/2508.00091v1

Punching Bag vs. Punching Person: Motion Transferability in Videos

Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

Updated: 2025-07-31 18:19:20

标题: 沙袋对比打人：视频中的运动可转移性

摘要: 行动识别模型展示了很强的泛化能力，但它们能否有效地在不同背景下传递高级动作概念，即使是在相似的分布中？例如，当展示一个未见过的变化，比如“打人”，模型能否识别广义的动作“打击”？为了探索这一问题，我们引入了一个运动可转移性框架，包括三个数据集：(1) Syn-TA，一个包含3D物体运动的合成数据集；(2) Kinetics400-TA；和(3) Something-Something-v2-TA，两者都是从自然视频数据集中改编而来。我们在这些基准上评估了13种最先进的模型，并观察到在识别新环境中的高级动作时性能显著下降。我们的分析显示：1) 多模态模型更容易受到细粒度未知动作的影响，而不是粗粒度的动作；2) 无偏差的Syn-TA数据集与真实世界数据集一样具有挑战性，模型在受控环境中表现出更大的性能下降；3) 在空间线索占主导地位时，更大的模型改善了可转移性，但在强调时间推理时遇到困难，同时依赖于物体和背景线索会妨碍泛化。我们进一步探讨如何解开粗细动作可以提高在时间上具有挑战性的数据集中的识别能力。我们相信这项研究为评估动作识别中的运动可转移性建立了一个关键的基准。数据集和相关代码：https://github.com/raiyaan-abdullah/Motion-Transfer。

更新时间: 2025-07-31 18:19:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.00085v1

A Survey on Code Generation with LLM-based Agents

Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.

Updated: 2025-07-31 18:17:36

标题: 基于LLM代理的代码生成调查

摘要: 由大型语言模型（LLMs）驱动的代码生成代理正在革新软件开发范式。与先前的代码生成技术不同，代码生成代理具有三个核心特征。1）自主性：能够独立管理整个工作流程，从任务分解到编码和调试。2）扩展任务范围：具有超越生成代码片段的能力，涵盖整个软件开发生命周期（SDLC）。3）增强工程实用性：从算法创新转向实际工程挑战，如系统可靠性、流程管理和工具集成。这一领域最近经历了快速发展和研究爆炸，展示了显著的应用潜力。本文系统调查了基于LLM的代码生成代理领域。我们追溯了这项技术的发展轨迹，从其创世时期开始系统分类其核心技术，包括单代理和多代理架构。此外，这项调查详细介绍了LLM代理在整个SDLC中的应用，总结了主流评估基准和指标，并对代表性工具进行了目录。最后，通过分析主要挑战，我们确定并提出了几个基础的、长期的研究方向，以指导未来该领域的工作。

更新时间: 2025-07-31 18:17:36

领域: cs.SE,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.00083v1

Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots

Cooperative mission planning for heterogeneous teams of mobile robots presents a unique set of challenges, particularly when operating under communication constraints and limited computational resources. To address these challenges, we propose the Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework, which leverages multi-agent reinforcement learning (MARL) to coordinate distributed decision making among agents with diverse sensing, motion, and actuation capabilities, operating under sporadic ad hoc communication. A Class-based Macro-Action Decentralized Partially Observable Markov Decision Process (CMacDec-POMDP) is also formulated to effectively model asynchronous decision-making for heterogeneous teams of agents. The framework utilizes an asynchronous centralized training and distributed execution scheme, enabled by the proposed Asynchronous Multi-Agent Transformer (AMAT) architecture. This design allows a single trained model to generalize to larger environments and accommodate varying team sizes and compositions. We evaluate CATMiP in a 2D grid-world simulation environment and compare its performance against planning-based exploration methods. Results demonstrate CATMiP's superior efficiency, scalability, and robustness to communication dropouts and input noise, highlighting its potential for real-world heterogeneous mobile robot systems. The code is available at https://github.com/mylad13/CATMiP

Updated: 2025-07-31 18:17:13

标题: 基于Transformer的异步协作式异质移动机器人团队任务规划

摘要: 合作任务规划对异构移动机器人团队提出了一系列独特的挑战，特别是在通信约束和有限的计算资源下运行时。为了解决这些挑战，我们提出了基于合作和异步Transformer的任务规划（CATMiP）框架，利用多智能体强化学习（MARL）来协调具有不同感知、运动和驱动能力的代理之间的分布式决策制定，这些代理在零散的点对点通信下运行。还制定了一个基于类别的宏动作分散部分可观察马尔可夫决策过程（CMacDec-POMDP），以有效地模拟异步决策制定对异构代理团队的影响。该框架利用了异步集中式训练和分布式执行方案，由提出的异步多智能体Transformer（AMAT）架构实现。该设计允许一个经过训练的单一模型推广到更大的环境，并适应不同的团队大小和组成。我们在2D网格世界仿真环境中评估了CATMiP，并将其性能与基于规划的探索方法进行了比较。结果表明CATMiP具有更高的效率、可扩展性和对通信中断和输入噪声的稳健性，突显了它在真实世界中异构移动机器人系统中的潜力。该代码可在https://github.com/mylad13/CATMiP上获得。

更新时间: 2025-07-31 18:17:13

领域: cs.RO,cs.AI,I.2.9; I.2.11

下载: http://arxiv.org/abs/2410.06372v3

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.

Updated: 2025-07-31 18:16:16

标题: PATH：用于评估多变量时间序列在线无监督异常检测方法的离散序列数据集

摘要: 将多变量时间序列的异常检测方法进行基准测试是一项具有挑战性的任务，因为缺乏高质量的数据集。当前公开可用的数据集太小，缺乏多样性，并且特征微不足道的异常，这阻碍了在这一研究领域取得可衡量的进展。我们提出了一个解决方案：通过最先进的仿真工具生成一个多样化、广泛且非微不足道的数据集，反映了汽车动力系统的实际行为，包括其多变量、动态和可变状态的属性。此外，我们的数据集代表了一个离散序列问题，这在文献中之前提出的解决方案中尚未得到解决。为了满足无监督和半监督异常检测设置，以及时间序列生成和预测，我们提供了数据集的不同版本，其中训练和测试子集以受污染和干净的版本提供，具体取决于任务。我们还提供了基于确定性和变分自编码器以及非参数方法的一些方法的基线结果。正如预期的那样，基线实验表明，在半监督版本的数据集上训练的方法优于它们的无监督对应物，突显了需要更加稳健的方法来处理受污染的训练数据。此外，结果显示，使用的阈值可以对检测性能产生很大影响，因此需要更多的工作投入到找到一个适当的阈值的方法，而无需标记数据。

更新时间: 2025-07-31 18:16:16

领域: cs.LG,cs.AI,cs.CE,cs.SY,eess.SY

下载: http://arxiv.org/abs/2411.13951v5

Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

HealthBench, a benchmark designed to measure the capabilities of AI systems for health better (Arora et al., 2025), has advanced medical language model evaluation through physician-crafted dialogues and transparent rubrics. However, its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies, further compounded by potential biases in automated grading systems. These limitations are particularly magnified in low- and middle-income settings, where issues like sparse neglected tropical disease coverage and region-specific guideline mismatches are prevalent. The unique challenges of the African context, including data scarcity, inadequate infrastructure, and nascent regulatory frameworks, underscore the urgent need for more globally relevant and equitable benchmarks. To address these shortcomings, we propose anchoring reward functions in version-controlled Clinical Practice Guidelines (CPGs) that incorporate systematic reviews and GRADE evidence ratings. Our roadmap outlines "evidence-robust" reinforcement learning via rubric-to-guideline linkage, evidence-weighted scoring, and contextual override logic, complemented by a focus on ethical considerations and the integration of delayed outcome feedback. By re-grounding rewards in rigorously vetted CPGs, while preserving HealthBench's transparency and physician engagement, we aim to foster medical language models that are not only linguistically polished but also clinically trustworthy, ethically sound, and globally relevant.

Updated: 2025-07-31 18:16:10

标题: 重新思考医学语言标准中的证据层次：对HealthBench的健康评估进行批判性评价

摘要: HealthBench是一个旨在衡量AI系统在医疗领域能力的基准（Arora等，2025），通过医生精心设计的对话和透明的评分标准，促进了医学语言模型评估的进步。然而，它依赖于专家意见，而不是高级别临床证据，存在着将地区偏见和个体临床医生特异性编码化的风险，进一步受到自动评分系统中潜在偏见的影响。这些局限性在低收入和中等收入国家的设置中特别突出，那里存在着稀缺的被忽视的热带疾病覆盖和区域特定指南不匹配等问题。非洲背景的独特挑战，包括数据稀缺、基础设施不足和新兴的监管框架，强调了更全球相关和公平的基准的迫切需要。为了解决这些缺点，我们提出将奖励功能锚定在版本控制的临床实践指南（CPG）中，这些指南包括系统评价和GRADE证据评级。我们的路线图概述了通过评分标准与指南的链接、证据加权评分和上下文覆盖逻辑实现“证据强大”的强化学习，同时注重道德考量和延迟结果反馈的整合。通过将奖励重新落实到经过严格审查的CPGs中，同时保留HealthBench的透明性和医生参与，我们旨在培养不仅在语言上精湛而且在临床上值得信赖、道德合理和全球相关的医学语言模型。

更新时间: 2025-07-31 18:16:10

领域: cs.AI

下载: http://arxiv.org/abs/2508.00081v1

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

Updated: 2025-07-31 18:12:51

标题: PhysicsEval：推理时间技术，提高大型语言模型在物理问题上的推理能力

摘要: 物理学作为人类智慧的基石，推动着技术的演变，深化了我们对宇宙基本原理的理解。当代文献中包含一些以解决物理问题为中心的作品，这是自然语言推理的关键领域。本文评估了前沿的LLMs在解决物理问题（包括数学和描述性问题）方面的表现。我们还采用了大量的推理技术和代理框架来提高模型的性能。其中包括通过其他较小的LLM代理以累积方式验证提出的解决方案，并对这些技术所带来的性能进行比较分析。当多代理框架应用于模型最初表现不佳的问题时，存在显著的改进。此外，我们引入了一个新的物理问题评估基准${\rm P{\small HYSICS}E{\small VAL}}$，其中包含19,609个问题，这些问题来自各种物理教科书，其相应的正确解决方案来自物理论坛和教育网站。我们的代码和数据可以在https://github.com/areebuzair/PhysicsEval上公开获取。

更新时间: 2025-07-31 18:12:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.00079v1

Evaluating COVID 19 Feature Contributions to Bitcoin Return Forecasting: Methodology Based on LightGBM and Genetic Optimization

This study proposes a novel methodological framework integrating a LightGBM regression model and genetic algorithm (GA) optimization to systematically evaluate the contribution of COVID-19-related indicators to Bitcoin return prediction. The primary objective was not merely to forecast Bitcoin returns but rather to determine whether including pandemic-related health data significantly enhances prediction accuracy. A comprehensive dataset comprising daily Bitcoin returns and COVID-19 metrics (vaccination rates, hospitalizations, testing statistics) was constructed. Predictive models, trained with and without COVID-19 features, were optimized using GA over 31 independent runs, allowing robust statistical assessment. Performance metrics (R2, RMSE, MAE) were statistically compared through distribution overlaps and Mann-Whitney U tests. Permutation Feature Importance (PFI) analysis quantified individual feature contributions. Results indicate that COVID-19 indicators significantly improved model performance, particularly in capturing extreme market fluctuations (R2 increased by 40%, RMSE decreased by 2%, both highly significant statistically). Among COVID-19 features, vaccination metrics, especially the 75th percentile of fully vaccinated individuals, emerged as dominant predictors. The proposed methodology extends existing financial analytics tools by incorporating public health signals, providing investors and policymakers with refined indicators to navigate market uncertainty during systemic crises.

Updated: 2025-07-31 18:12:33

标题: 评估COVID-19特征对比特币回报预测的贡献：基于LightGBM和遗传优化的方法论

摘要: 这项研究提出了一个新颖的方法ological框架，将LightGBM回归模型和遗传算法（GA）优化相结合，系统评估COVID-19相关指标对比特币收益预测的贡献。主要目标不仅仅是预测比特币回报，而是确定是否包含与大流行相关的健康数据显著提高了预测准确性。建立了一个包括每日比特币回报和COVID-19指标（疫苗接种率、住院人数、检测统计数据）的综合数据集。使用和不使用COVID-19特征训练的预测模型通过31次独立运行进行了优化，从而进行了稳健的统计评估。通过分布重叠和Mann-Whitney U检验对性能指标（R2、RMSE、MAE）进行了统计比较。排列特征重要性（PFI）分析量化了各个特征的贡献。结果表明，COVID-19指标显著提高了模型性能，特别是在捕捉极端市场波动方面（R2增加了40%，RMSE减少了2%，在统计上高度显著）。在COVID-19特征中，疫苗接种指标，特别是完全接种个体的第75百分位数，成为主要的预测因素。该方法扩展了现有的金融分析工具，通过整合公共卫生信号，为投资者和决策者提供了精细的指标，帮助他们在系统性危机期间应对市场不确定性。

更新时间: 2025-07-31 18:12:33

领域: cs.LG,cs.AI,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2508.00078v1

OneShield -- the Next Generation of LLM Guardrails

The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes it extremely challenging to universally shield users against their potential risks, and one-size-fits-all solutions are unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, discuss scalability considerations, and provide usage statistics of OneShield since its initial deployment.

Updated: 2025-07-31 18:07:13

标题: OneShield - LLM护栏的下一代

摘要: 大型语言模型的崛起引发了人们对其广泛应用潜力的普遍兴奋。虽然大型语言模型提供了许多可能性，但关于安全、隐私和伦理的问题已经出现，所有关键参与者正在努力通过保护措施来解决这些问题，以确保其模型和独立解决方案的安全。大型语言模型不断发展的特性使得普遍保护用户免受潜在风险变得极具挑战性，通用解决方案是不可行的。在这项工作中，我们提出了OneShield，我们独立、模型无关和可定制的解决方案，以保护大型语言模型。OneShield旨在为定义风险因素、表达和声明上下文安全和合规政策以及减轻大型语言模型风险提供便利，重点关注每个特定客户。我们描述了框架的实施，讨论了可伸缩性考虑因素，并提供了自首次部署以来OneShield的使用统计数据。

更新时间: 2025-07-31 18:07:13

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.21170v2

SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.

Updated: 2025-07-31 17:59:40

标题: 主题：通过合成属性替换对CBM泛化能力进行基准测试

摘要: 概念瓶颈模型（CBMs）和其他基于概念的可解释模型显示出在使人工智能应用更加透明方面具有巨大潜力，这在医学等领域是至关重要的。尽管它们取得了成功，我们证明CBMs在分布转移下很难可靠地识别正确的概念。为了评估CBMs对概念变化的稳健性，我们引入了SUB：一个包含38,400个基于CUB数据集的合成图像和概念基准。为了创建SUB，我们从CUB数据集中选择了33个鸟类别和45个概念的子集，生成替代特定概念的图像，例如翅膀颜色或腹部图案。我们引入了一种新颖的Tied Diffusion Guidance（TDG）方法来精确控制生成的图像，其中两个平行去噪过程的噪声共享确保正确的鸟类别和正确的属性都被生成。这种新颖的基准使得对CBMs和类似可解释模型进行严格评估，有助于开发更加稳健的方法。我们的代码可以在https://github.com/ExplainableML/sub找到，数据集可以在http://huggingface.co/datasets/Jessica-bader/SUB找到。

更新时间: 2025-07-31 17:59:40

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23784v1

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

Updated: 2025-07-31 17:59:09

标题: Phi-Ground技术报告：GUI基础感知的进展

摘要: 随着多模态推理模型的发展，类似于《钢铁侠》中的贾维斯的计算机用户代理（CUAs）正在变成现实。GUI基础是CUAs执行实际操作的核心组件，类似于机械控制在机器人技术中的作用，它直接影响系统的成功或失败。它决定了诸如点击和输入等操作，以及相关的参数如点击的坐标。当前的端到端基础模型在具有挑战性的基准测试如ScreenSpot-pro和UI-Vision上仍然只能达到不到65%的准确率，表明它们远未准备好投入使用。在这项工作中，我们对基础模型的训练进行了经验研究，从数据收集到模型训练的细节进行了考察。最终，我们开发了Phi-Ground模型系列，该系列在代理设置中具有不到10B个参数的模型中在所有五个基础基准测试中实现了最先进的性能。在端到端模型设置下，我们的模型仍然在ScreenSpot-pro上获得了43.2的分数，在UI-Vision上获得了27.2的分数。我们相信，本文讨论的各种细节，以及我们的成功和失败，不仅澄清了基础模型的构建过程，还有益于其他感知任务。项目主页：https://zhangmiaosen2000.github.io/Phi-Ground/

更新时间: 2025-07-31 17:59:09

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2507.23779v1

XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding

Current auto-regressive models can generate high-quality, topologically precise meshes; however, they necessitate thousands-or even tens of thousands-of next-token predictions during inference, resulting in substantial latency. We introduce XSpecMesh, a quality-preserving acceleration method for auto-regressive mesh generation models. XSpecMesh employs a lightweight, multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, thereby accelerating inference. We further propose a verification and resampling strategy: the backbone model verifies each predicted token and resamples any tokens that do not meet the quality criteria. In addition, we propose a distillation strategy that trains the lightweight decoding heads by distilling from the backbone model, encouraging their prediction distributions to align and improving the success rate of speculative predictions. Extensive experiments demonstrate that our method achieves a 1.7x speedup without sacrificing generation quality. Our code will be released.

Updated: 2025-07-31 17:58:30

标题: XSpecMesh：通过多头推理解码实现质量保持自回归网格生成加速

摘要: 当前的自回归模型可以生成高质量、拓扑精确的网格；然而，在推断过程中，它们需要进行数千甚至数万次下一个标记的预测，导致显著的延迟。我们引入了XSpecMesh，这是一种用于自回归网格生成模型的保持质量的加速方法。XSpecMesh采用了一种轻量级、多头的推测解码方案，在单次前向传递中并行预测多个标记，从而加速推断过程。我们进一步提出了一种验证和重采样策略：骨干模型验证每个预测的标记，并重新取样任何不符合质量标准的标记。此外，我们提出了一种蒸馏策略，通过从骨干模型中蒸馏，训练轻量级解码头，鼓励它们的预测分布对齐，并提高推测预测的成功率。大量实验证明，我们的方法实现了1.7倍的加速，而不牺牲生成质量。我们的代码将会发布。

更新时间: 2025-07-31 17:58:30

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23777v1

SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0\% to 32.2\%. World-model-based planning, in particular, shows consistent advantage of up to 124\% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.

Updated: 2025-07-31 17:57:20

标题: SimuRA：通过基于LLM的世界模型实现通用目标导向型Agent

摘要: 建立在大型语言模型（LLMs）上的AI代理具有巨大的潜力，但目前的实践主要集中在一任务一代理的方法上，这不仅缺乏可扩展性和通用性，而且还受限于自回归LLMs的基本限制。另一方面，人类是一种通过模拟其行动和计划结果来推理的通用代理。为了实现更通用和强大的AI代理，我们引入了SimuRA，一种面向目标的通用代理推理架构。基于对任何环境中最佳代理的原则性制定，\modelname通过引入通过模拟进行规划的世界模型来克服自回归推理的限制。通用世界模型使用LLM实现，它可以利用自然语言的概念丰富的潜在空间在各种环境中灵活规划。对困难的网络浏览任务的实验表明，\modelname将航班搜索成功率从0％提高到32.2％。特别是基于世界模型的规划显示出比自回归规划高达124％的一致优势，展示了世界模型模拟作为推理范式的优势。我们对基于LLMs的单一通用代理模型在所有环境中表现超级智能的可能性感到兴奋。为了开始，我们将SimuRA，一个基于预训练LLMs构建的网页浏览代理，作为研究演示供公众测试。

更新时间: 2025-07-31 17:57:20

领域: cs.AI,cs.CL,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.23773v1

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

Updated: 2025-07-31 17:57:18

标题: GenoMAS：一个通过基于代码的基因表达分析进行科学发现的多智能体框架

摘要: 基因表达分析是许多生物医学发现的关键，然而由于多个大型、半结构化文件的复杂性以及对广泛领域专业知识的需求，从原始转录组数据中提取见解仍然是困难的。当前的自动化方法通常受到限制，要么是由于在边缘情况下无法正常工作的不灵活工作流程，要么是因为完全自主的代理缺乏对严格科学探究所必需的精度。 GenoMAS通过提供一个基于LLM的科学家团队，将结构化工作流程的可靠性与自主代理的适应性相结合，开辟了一条不同的道路。GenoMAS通过类型化的消息传递协议协调六个专业的LLM代理，每个代理都为共享的分析画布贡献互补的优势。在GenoMAS的核心是一个引导式规划框架：编程代理将高级任务指南展开为动作单元，并在每个关键点选择前进、修订、绕过或回溯，从而保持逻辑连贯性，同时优雅地弯曲到基因组数据的特殊性。在GenoTEX基准测试中，GenoMAS在数据预处理方面达到了89.13%的复合相似性相关性，基因识别方面达到了60.48%的F1，分别超过了先前最佳水平的10.61%和16.85%。除了指标之外，GenoMAS还呈现出通过文献证实的生物学上合理的基因-表型关联，同时调整潜在混杂因素。代码可在https://github.com/Liu-Hy/GenoMAS 上找到。

更新时间: 2025-07-31 17:57:18

领域: cs.AI,cs.LG,cs.MA,q-bio.GN

下载: http://arxiv.org/abs/2507.21035v2

Consensus-Driven Active Model Selection

The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.

Updated: 2025-07-31 17:56:28

标题: 共识驱动的活跃模型选择

摘要: 现有的现成机器学习模型的广泛可用性带来了一个挑战：在众多可用的候选模型中，应该选择哪一个来进行特定的数据分析任务？传统上，这个模型选择的问题是通过收集和注释一个验证数据集来回答的--这是一个昂贵且耗时的过程。我们提出了一种主动模型选择的方法，利用候选模型的预测来优先标记测试数据点，从而有效区分出最佳候选模型。我们的方法，CODA，通过在概率框架内建模分类器、类别和数据点之间的关系，执行基于共识的主动模型选择。该框架利用候选池中模型之间的共识和分歧来引导标签获取过程，并利用贝叶斯推断来更新对哪个模型最佳的信念随着信息的收集而不断更新。我们通过整理一组包含各种模型选择场景的26个基准任务来验证我们的方法。与先前的最新技术相比，CODA显著优于现有的主动模型选择方法，将发现最佳模型所需的注释工作量减少了高达70%。代码和数据可在https://github.com/justinkay/coda上获得。

更新时间: 2025-07-31 17:56:28

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.23771v1

Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction

Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models and diffusion-based generative 3D interaction refinement to achieve occlusion-robust two-hand reconstruction. First, we introduce a lightweight fusion alignment encoder that aligns fused multimodal 2D priors like key points, segmentation maps, and depth cues from vision foundation models during training. This provides robust structured guidance, further enabling efficient inference without heavy foundation model encoders at test time while maintaining high reconstruction accuracy. Second, we implement a two-hand diffusion model explicitly trained to convert interpenetrated 3D poses into plausible, penetration-free counterparts. Through collision gradient-guided denoising, the model rectifies artifacts while preserving natural spatial relationships between hands. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, HIC, and FreiHAND datasets, significantly advancing occlusion handling and interaction robustness. Our code will be publicly released.

Updated: 2025-07-31 17:55:56

标题: 学习对齐和细化：一种基于传播的框架用于抗遮挡双手重建

摘要: 从单眼图像中重建双手面临着持久的挑战，这是由于复杂和动态的手部姿势和遮挡所造成的，这导致了在实现合理的交互对准方面存在显著困难。现有方法在处理这种对准问题时遇到困难，通常导致对准不准确和穿透伪影。为了解决这个问题，我们提出了一个双阶段的Foundation-to-Diffusion框架，精确对齐来自视觉基础模型的2D先验指导和基于扩散的生成式3D交互细化，以实现抗遮挡的双手重建。首先，我们引入了一个轻量级融合对齐编码器，该编码器在训练期间对齐了来自视觉基础模型的融合多模态2D先验，如关键点、分割图和深度线索。这提供了强大的结构化指导，进一步使得在测试时无需重型基础模型编码器的情况下进行高效推理，同时保持高重建准确性。其次，我们实现了一个专门训练用于将相互穿插的3D姿势转换为合理、无穿透的对应姿势的双手扩散模型。通过碰撞梯度引导去噪，该模型纠正了伪影，同时保持了手部之间的自然空间关系。广泛的评估表明，我们的方法在InterHand2.6M、HIC和FreiHAND数据集上实现了最先进的性能，显著推进了遮挡处理和交互鲁棒性。我们的代码将被公开发布。

更新时间: 2025-07-31 17:55:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17788v2

Formal Bayesian Transfer Learning via the Total Risk Prior

In analyses with severe data-limitations, augmenting the target dataset with information from ancillary datasets in the application domain, called source datasets, can lead to significantly improved statistical procedures. However, existing methods for this transfer learning struggle to deal with situations where the source datasets are also limited and not guaranteed to be well-aligned with the target dataset. A typical strategy is to use the empirical loss minimizer on the source data as a prior mean for the target parameters, which places the estimation of source parameters outside of the Bayesian formalism. Our key conceptual contribution is to use a risk minimizer conditional on source parameters instead. This allows us to construct a single joint prior distribution for all parameters from the source datasets as well as the target dataset. As a consequence, we benefit from full Bayesian uncertainty quantification and can perform model averaging via Gibbs sampling over indicator variables governing the inclusion of each source dataset. We show how a particular instantiation of our prior leads to a Bayesian Lasso in a transformed coordinate system and discuss computational techniques to scale our approach to moderately sized datasets. We also demonstrate that recently proposed minimax-frequentist transfer learning techniques may be viewed as an approximate Maximum a Posteriori approach to our model. Finally, we demonstrate superior predictive performance relative to the frequentist baseline on a genetics application, especially when the source data are limited.

Updated: 2025-07-31 17:55:16

标题: 正式贝叶斯迁移学习：基于总风险先验的方法

摘要: 在分析中，由于数据限制严重，将目标数据集与应用领域中的辅助数据集（称为源数据集）相结合，可以显著改善统计程序。然而，现有的迁移学习方法在处理源数据集也有限且不能保证与目标数据集良好对齐的情况下存在困难。一种典型策略是将源数据上的经验损失最小化器用作目标参数的先验均值，从而将源参数的估计放在贝叶斯形式外。我们的关键概念贡献在于使用条件于源参数的风险最小化器。这使我们能够构建一个单一的联合先验分布，涵盖来自源数据集和目标数据集的所有参数。因此，我们从完全贝叶斯不确定性量化中受益，并可以通过控制每个源数据集的包含的指示变量，通过Gibbs采样执行模型平均。我们展示了我们先验的特定实例如何导致在转换坐标系中的贝叶斯Lasso，并讨论了将我们的方法扩展到中等规模数据集的计算技术。我们还证明了最近提出的极小极大频率迁移学习技术可以看作是对我们模型的一种近似最大后验方法。最后，我们展示了相对于频率基线在遗传学应用中具有更优越的预测性能，特别是当源数据有限时。

更新时间: 2025-07-31 17:55:16

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.23768v1

Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing

A novel approach is presented for identifying distinct signatures of performing acts in the secondary ticket resale market by analyzing dynamic pricing distributions. Using a newly curated, time series dataset from the SeatGeek API, we model ticket pricing distributions as scaled Beta distributions. This enables accurate parameter estimation from incomplete statistical data using a hybrid of quantile matching and the method of moments. Incorporating the estimated $\alpha$ and $\beta$ parameters into Random Forest classifiers significantly improves pairwise artist classification accuracy, demonstrating the unique economic signatures in event pricing data. Additionally, we provide theoretical and empirical evidence that incorporating zero-variance (constant-value) features into Random Forest models acts as an implicit regularizer, enhancing feature variety and robustness. This regularization promotes deeper, more varied trees in the ensemble, improving the bias-variance tradeoff and mitigating overfitting to dominant features. These findings are validated on both the new ticket pricing dataset and the standard UCI ML handwritten digits dataset.

Updated: 2025-07-31 17:55:07

标题: 缩放贝塔模型和特征稀释用于动态票价定价

摘要: 提出了一种新颖的方法，通过分析动态定价分布来识别二手票务转售市场中表演行为的独特特征。使用来自SeatGeek API的新的时间序列数据集，我们将票务定价分布建模为缩放的Beta分布。这使得可以使用分位数匹配和矩法的混合方法从不完整的统计数据中准确估计参数。将估计的α和β参数纳入随机森林分类器中显著提高了艺术家配对分类的准确性，展示了事件定价数据中独特的经济特征。此外，我们提供理论和实证证据表明，将零方差（常值）特征纳入随机森林模型中起到隐式正则化作用，增强了特征的多样性和稳健性。这种正则化促进了集成中更深层、更多样化的树，改善了偏差-方差权衡，并减少了对主导特征的过度拟合。这些发现在新的票务定价数据集和标准的UCI ML手写数字数据集上得到验证。

更新时间: 2025-07-31 17:55:07

领域: stat.ML,cs.LG,68T05, 62H30, 62F10, 68Q32,F.2.2; I.2.6; I.5.2; G.3

下载: http://arxiv.org/abs/2507.23767v1

Improving annotator selection in Active Learning using a mood and fatigue-aware Recommender System

This study centers on overcoming the challenge of selecting the best annotators for each query in Active Learning (AL), with the objective of minimizing misclassifications. AL recognizes the challenges related to cost and time when acquiring labeled data, and decreases the number of labeled data needed. Nevertheless, there is still the necessity to reduce annotation errors, aiming to be as efficient as possible, to achieve the expected accuracy faster. Most strategies for query-annotator pairs do not consider internal factors that affect productivity, such as mood, attention, motivation, and fatigue levels. This work addresses this gap in the existing literature, by not only considering how the internal factors influence annotators (mood and fatigue levels) but also presenting a new query-annotator pair strategy, using a Knowledge-Based Recommendation System (RS). The RS ranks the available annotators, allowing to choose one or more to label the queried instance using their past accuracy values, and their mood and fatigue levels, as well as information about the instance queried. This work bases itself on existing literature on mood and fatigue influence on human performance, simulating annotators in a realistic manner, and predicting their performance with the RS. The results show that considering past accuracy values, as well as mood and fatigue levels reduces the number of annotation errors made by the annotators, and the uncertainty of the model through its training, when compared to not using internal factors. Accuracy and F1-score values were also better in the proposed approach, despite not being as substantial as the aforementioned. The methodologies and findings presented in this study begin to explore the open challenge of human cognitive factors affecting AL.

Updated: 2025-07-31 17:41:30

标题: 使用情绪和疲劳感知推荐系统改进主动学习中的标注者选择

摘要: 这项研究侧重于克服主动学习（AL）中为每个查询选择最佳注释者的挑战，其目标是最小化错误分类。AL认识到在获取标记数据时涉及成本和时间的挑战，并减少所需的标记数据数量。然而，仍然有必要减少注释错误，以尽可能高效地实现预期的准确性。大多数查询-注释者对策略并未考虑影响生产力的内部因素，如心情、注意力、动机和疲劳水平。本研究通过不仅考虑内部因素如何影响注释者（心情和疲劳水平），还提出了一种新的查询-注释者对策略，使用基于知识的推荐系统（RS）。RS对可用的注释者进行排名，允许选择一个或多个使用其过去的准确性值、心情和疲劳水平以及关于查询实例的信息来标记查询实例。本研究基于现有文献关于心情和疲劳对人类表现的影响，以逼真的方式模拟注释者，并通过RS预测他们的表现。结果显示，考虑过去的准确性值以及心情和疲劳水平可以减少注释者所做的注释错误数量，并在模型训练时减少不确定性，与不使用内部因素相比。提出的方法中准确性和F1分数值也更好，尽管不如前者那么重要。本研究提出的方法和发现开始探索影响AL的人类认知因素的开放挑战。

更新时间: 2025-07-31 17:41:30

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2507.23756v1

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.

Updated: 2025-07-31 17:38:50

标题: CoT-Self-Instruct：构建用于推理和非推理任务的高质量合成提示

摘要: 我们提出了CoT-Self-Instruct，这是一种合成数据生成方法，它指导LLMs首先通过基于Chain-of-Thought（CoT）的推理和计划来生成类似质量和复杂性的新合成提示，用于LLM训练，然后使用自动度量对高质量数据进行过滤。在可验证推理方面，我们的合成数据在MATH500、AMC23、AIME24和GPQA-Diamond等数据集上明显优于现有的训练数据集，如s1k和OpenMathReasoning。对于非可验证的指令遵循任务，我们的方法在AlpacaEval 2.0和Arena-Hard上超过了人类或标准自我指导提示的表现。

更新时间: 2025-07-31 17:38:50

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.23751v1

Spatial-Temporal Reinforcement Learning for Network Routing with Non-Markovian Traffic

Reinforcement Learning (RL) has been widely used for packet routing in communication networks, but traditional RL methods rely on the Markov assumption that the current state contains all necessary information for decision-making. In reality, internet traffic is non-Markovian, and past states do influence routing performance. Moreover, common deep RL approaches use function approximators, such as neural networks, that do not model the spatial structure in network topologies. To address these shortcomings, we design a network environment with non-Markovian traffic and introduce a spatial-temporal RL (STRL) framework for packet routing. Our approach outperforms traditional baselines by more than 19% during training and 7% for inference despite a change in network topology.

Updated: 2025-07-31 17:34:18

标题: 非马尔科夫交通网络路由的时空强化学习

摘要: 强化学习（RL）在通信网络中的数据包路由中被广泛使用，但传统的RL方法依赖于马尔可夫假设，即当前状态包含决策所需的所有信息。实际上，互联网流量是非马尔可夫的，过去的状态确实会影响路由性能。此外，常见的深度RL方法使用函数逼近器，如神经网络，不能对网络拓扑结构中的空间结构建模。为了解决这些缺点，我们设计了一个具有非马尔可夫流量的网络环境，并引入了一种空间 - 时间RL（STRL）框架用于数据包路由。我们的方法在训练过程中比传统基准线提高了超过19％，在推断中提高了7％，尽管网络拓扑发生了变化。

更新时间: 2025-07-31 17:34:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.22174v2

Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

Updated: 2025-07-31 17:24:04

标题: Rule2Text：知识图谱中逻辑规则的自然语言解释

摘要: 知识图谱（KGs）通常包含足够的信息来支持推断新事实。识别逻辑规则不仅可以提高知识图谱的完整性，还能够检测潜在错误，揭示微妙的数据模式，并增强整体推理和解释能力。然而，这些规则的复杂性，结合每个KG的独特标记约定，可能使人类难以理解。在本文中，我们探讨了大型语言模型生成逻辑规则的自然语言解释的潜力。具体来说，我们使用AMIE 3.5.1规则发现算法从基准数据集FB15k-237和两个大规模数据集FB-CVT-REV和FB+CVT-REV中提取逻辑规则。我们检查了各种提示策略，包括零和少量提示，包括变量实体类型和思维链推理。我们进行了基于正确性、清晰度和妄想的生成解释的全面人类评估，并评估了大型语言模型作为自动评判器的使用情况。我们的结果显示出在解释正确性和清晰度方面有着令人期待的表现，尽管未来研究仍然面临一些挑战。本研究使用的所有脚本和数据均可在以下网址公开获取：https://github.com/idirlab/KGRule2NL。

更新时间: 2025-07-31 17:24:04

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23740v1

How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted an experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as 'AI' (disclosure). Our dynamic experiment design -- ideas from prior participants in an experimental condition are used as stimuli for future participants in the same experimental condition -- speaks to the interdependent process of cultural creation: creative ideas are built upon prior ideas. Hence, we capture the compounding effects of having LLMs 'in the culture loop'. We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

Updated: 2025-07-31 17:19:39

标题: 人工智能思想如何影响人类思想的创造力、多样性和演变：来自一项大型、动态实验的证据

摘要: 大型语言模型输出的暴露度正在迅速增加。看到人工智能生成的想法将如何影响人类的想法？我们进行了一项实验（800多名参与者，40多个国家），参与者观看了来自ChatGPT或先前实验参与者的创意想法，然后进行头脑风暴，提出自己的想法。我们变化了人工智能生成示例的数量（无、低或高曝光）以及示例是否标记为“AI”（披露）。我们的动态实验设计--实验条件中先前参与者的想法被用作同一实验条件下未来参与者的刺激--涉及文化创造的相互依赖过程：创意想法建立在先前的想法之上。因此，我们捕捉到了在“文化循环”中拥有大型语言模型的复合效应。我们发现高水平的人工智能暴露（但不是低水平的人工智能暴露）并没有影响个体想法的创造力，但确实增加了集体想法多样性的平均数量和变化速度。人工智能使想法变得不同，而不是更好。披露没有主要影响。我们还发现，自我报告为有创造力的人受到较少的影响，因为他们知道一个想法来自人工智能，而当任务困难时，参与者可能会有意采纳人工智能的想法。我们的研究结果表明，引入人工智能的想法可能会增加集体多样性，但不会增加个体创造力。

更新时间: 2025-07-31 17:19:39

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2401.13481v3

DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction

Access to medical imaging and associated text data has the potential to drive major advances in healthcare research and patient outcomes. However, the presence of Protected Health Information (PHI) and Personally Identifiable Information (PII) in Digital Imaging and Communications in Medicine (DICOM) files presents a significant barrier to the ethical and secure sharing of imaging datasets. This paper presents a hybrid de-identification framework developed by Impact Business Information Solutions (IBIS) that combines rule-based and AI-driven techniques, and rigorous uncertainty quantification for comprehensive PHI/PII removal from both metadata and pixel data. Our approach begins with a two-tiered rule-based system targeting explicit and inferred metadata elements, further augmented by a large language model (LLM) fine-tuned for Named Entity Recognition (NER), and trained on a suite of synthetic datasets simulating realistic clinical PHI/PII. For pixel data, we employ an uncertainty-aware Faster R-CNN model to localize embedded text, extract candidate PHI via Optical Character Recognition (OCR), and apply the NER pipeline for final redaction. Crucially, uncertainty quantification provides confidence measures for AI-based detections to enhance automation reliability and enable informed human-in-the-loop verification to manage residual risks. This uncertainty-aware deidentification framework achieves robust performance across benchmark datasets and regulatory standards, including DICOM, HIPAA, and TCIA compliance metrics. By combining scalable automation, uncertainty quantification, and rigorous quality assurance, our solution addresses critical challenges in medical data de-identification and supports the secure, ethical, and trustworthy release of imaging data for research.

Updated: 2025-07-31 17:19:38

标题: DICOM通过混合人工智能和基于规则的框架进行去标识化，以实现可扩展、对不确定性敏感的编辑。

摘要: 对医学成像和相关文本数据的获取有潜力推动医疗保健研究和患者结果方面的重大进展。然而，在数字成像和医学通信（DICOM）文件中存在受保护的健康信息（PHI）和个人可识别信息（PII）会对成像数据集的道德和安全共享构成重大障碍。本文介绍了由Impact Business Information Solutions（IBIS）开发的混合去识别框架，结合了基于规则和人工智能驱动的技术，以及严格的不确定性量化，从元数据和像素数据中全面删除PHI/PII。我们的方法始于一个针对明确和推断的元数据元素的两层规则系统，进一步通过一个经过精细调整用于命名实体识别（NER）的大型语言模型（LLM）和在一套模拟真实临床PHI/PII的合成数据集上进行训练。对于像素数据，我们采用一种带有不确定性意识的Faster R-CNN模型来定位嵌入的文本，通过光学字符识别（OCR）提取候选PHI，并应用NER管道进行最终删除。关键是，不确定性量化提供了AI检测的置信度措施，以增强自动化可靠性，并启用知情人员在循环验证中管理剩余风险。这种不确定性感知的去识别框架在基准数据集和监管标准（包括DICOM、HIPAA和TCIA合规指标）上取得了强大的性能。通过结合可扩展的自动化、不确定性量化和严格的质量保证，我们的解决方案解决了医学数据去识别中的关键挑战，并支持安全、道德和可信赖的成像数据发布用于研究。

更新时间: 2025-07-31 17:19:38

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.23736v1

Distributed AI Agents for Cognitive Underwater Robot Autonomy

Achieving robust cognitive autonomy in robots navigating complex, unpredictable environments remains a fundamental challenge in robotics. This paper presents Underwater Robot Self-Organizing Autonomy (UROSA), a groundbreaking architecture leveraging distributed Large Language Model AI agents integrated within the Robot Operating System 2 (ROS 2) framework to enable advanced cognitive capabilities in Autonomous Underwater Vehicles. UROSA decentralises cognition into specialised AI agents responsible for multimodal perception, adaptive reasoning, dynamic mission planning, and real-time decision-making. Central innovations include flexible agents dynamically adapting their roles, retrieval-augmented generation utilising vector databases for efficient knowledge management, reinforcement learning-driven behavioural optimisation, and autonomous on-the-fly ROS 2 node generation for runtime functional extensibility. Extensive empirical validation demonstrates UROSA's promising adaptability and reliability through realistic underwater missions in simulation and real-world deployments, showing significant advantages over traditional rule-based architectures in handling unforeseen scenarios, environmental uncertainties, and novel mission objectives. This work not only advances underwater autonomy but also establishes a scalable, safe, and versatile cognitive robotics framework capable of generalising to a diverse array of real-world applications.

Updated: 2025-07-31 17:18:55

标题: 分布式人工智能代理用于认知水下机器人自主性

摘要: 在机器人在复杂、不可预测的环境中实现稳健的认知自主性仍然是机器人领域面临的一个基本挑战。本文介绍了水下机器人自组织自治（UROSA）架构，该架构利用分布式大型语言模型人工智能代理集成在机器人操作系统2（ROS 2）框架中，以实现自主水下载具的先进认知能力。UROSA将认知分散到专门的人工智能代理中，负责多模态感知、自适应推理、动态任务规划和实时决策。主要创新包括灵活的代理动态调整其角色、利用向量数据库进行高效知识管理的检索增强生成、强化学习驱动的行为优化，以及用于运行时功能可扩展性的自主即时生成ROS 2节点。广泛的实证验证展示了UROSA在模拟和实际水下任务中的有希望的适应性和可靠性，显示出在处理意外情况、环境不确定性和新任务目标时与传统基于规则的架构相比的显著优势。这项工作不仅推动了水下自主性的发展，还建立了一个可扩展、安全、多功能的认知机器人框架，能够推广到各种真实世界应用。

更新时间: 2025-07-31 17:18:55

领域: cs.RO,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.23735v1

Dimension reduction with structure-aware quantum circuits for hybrid machine learning

Schmidt decomposition of a vector can be understood as writing the singular value decomposition (SVD) in vector form. A vector can be written as a linear combination of tensor product of two dimensional vectors by recursively applying Schmidt decompositions via SVD to all subsystems. Given a vector expressed as a linear combination of tensor products, using only the $k$ principal terms yields a $k$-rank approximation of the vector. Therefore, writing a vector in this reduced form allows to retain most important parts of the vector while removing small noises from it, analogous to SVD-based denoising. In this paper, we show that quantum circuits designed based on a value $k$ (determined from the tensor network decomposition of the mean vector of the training sample) can approximate the reduced-form representations of entire datasets. We then employ this circuit ansatz with a classical neural network head to construct a hybrid machine learning model. Since the output of the quantum circuit for an $2^n$ dimensional vector is an $n$ dimensional probability vector, this provides an exponential compression of the input and potentially can reduce the number of learnable parameters for training large-scale models. We use datasets provided in the Python scikit-learn module for the experiments. The results confirm the quantum circuit is able to compress data successfully to provide effective $k$-rank approximations to the classical processing component.

Updated: 2025-07-31 17:18:43

标题: 结构感知量子电路在混合机器学习中的维度降低

摘要: Schmidt分解是一种将奇异值分解（SVD）以向量形式表示的方法。通过递归地应用Schmidt分解，一个向量可以被写成两维向量的张量积的线性组合。将一个向量表示为张量积的线性组合后，只使用前k个主要项可以得到向量的k秩近似。因此，以这种简化形式表示一个向量可以保留向量的大部分重要部分，同时去除其中的小噪声，类似于基于SVD的去噪方法。在本文中，我们展示了基于训练样本的均值向量的张量网络分解确定的值k所设计的量子电路可以近似整个数据集的简化表示形式。然后我们将这个电路构想与经典神经网络结合，构建一个混合机器学习模型。由于一个2的n次方维度向量的量子电路的输出是一个n维概率向量，这提供了对输入的指数级压缩，潜在地可以减少用于训练大规模模型的可学习参数的数量。我们在Python的scikit-learn模块中使用提供的数据集进行实验。结果证实，量子电路能够成功地压缩数据，为经典处理组件提供有效的k秩近似。

更新时间: 2025-07-31 17:18:43

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2508.00048v1

Coordinating Search-Informed Reasoning and Reasoning-Guided Search in Claim Verification

Multi-hop claim verification is inherently challenging, requiring multi-step reasoning to construct verification chains while iteratively searching for information to uncover hidden bridging facts. This process is fundamentally interleaved, as effective reasoning relies on dynamically retrieved evidence, while effective search demands reasoning to refine queries based on partial information. To achieve this, we propose Hierarchical Agent Reasoning and Information Search (HARIS), explicitly modeling the coordinated process of reasoning-driven searching and search-informed reasoning. HARIS consists of a high-level reasoning agent that focuses on constructing the main verification chain, generating factual questions when more information is needed, and a low-level search agent that iteratively retrieves more information, refining its search based on intermediate findings. This design allows each agent to specialize in its respective task, enhancing verification accuracy and interpretability. HARIS is trained using reinforcement learning with outcome-based rewards. Experimental results on the EX-FEVER and HOVER benchmarks demonstrate that HARIS achieves strong performance, greatly advancing multi-hop claim verification.

Updated: 2025-07-31 17:12:54

标题: 协调搜索导向推理和推理引导搜索在索赔验证中的应用

摘要: 多跳声明验证在本质上是具有挑战性的，需要多步推理来构建验证链，同时迭代地搜索信息以揭示隐藏的桥接事实。这一过程在根本上是交织在一起的，因为有效的推理依赖于动态检索的证据，而有效的搜索则要求根据部分信息来细化查询。为了实现这一目标，我们提出了分层代理推理和信息搜索（HARIS），明确建模了基于推理驱动的搜索和基于搜索信息的推理的协调过程。HARIS包括一个专注于构建主要验证链的高级推理代理，当需要更多信息时生成事实性问题，以及一个低级搜索代理，迭代地检索更多信息，并根据中间发现来完善其搜索。这种设计使每个代理能够专注于其各自的任务，增强了验证的准确性和可解释性。HARIS使用基于结果的奖励进行强化学习训练。在EX-FEVER和HOVER基准测试上的实验结果表明，HARIS取得了强大的表现，极大地推动了多跳声明验证的发展。

更新时间: 2025-07-31 17:12:54

领域: cs.AI

下载: http://arxiv.org/abs/2506.07528v2

A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

Reinforcement learning agents can achieve super-human performance in complex decision-making tasks, but their behaviour is often difficult to understand and explain. This lack of explanation limits deployment, especially in safety-critical settings where understanding and trust are essential. We identify three core explanatory targets that together provide a comprehensive view of reinforcement learning agents: behaviour, outcomes, and predictions. We develop a unified theoretical framework for explaining these three elements of reinforcement learning agents through the influence of individual features that the agent observes in its environment. We derive feature influences by using Shapley values, which collectively and uniquely satisfy a set of well-motivated axioms for fair and consistent credit assignment. The proposed approach, Shapley Values for Explaining Reinforcement Learning (SVERL), provides a single theoretical framework to comprehensively and meaningfully explain reinforcement learning agents. It yields explanations with precise semantics that are not only interpretable but also mathematically justified, enabling us to identify and correct conceptual issues in prior explanations. Through illustrative examples, we show how SVERL produces useful, intuitive explanations of agent behaviour, outcomes, and predictions, which are not apparent from observing agent behaviour alone.

Updated: 2025-07-31 17:02:05

标题: 用Shapley值解释强化学习的理论框架

摘要: 强化学习代理可以在复杂的决策任务中实现超人类表现，但它们的行为通常难以理解和解释。这种解释不足限制了部署，特别是在安全关键环境中，理解和信任是至关重要的。我们确定了三个核心解释目标，共同提供了对强化学习代理的全面视图：行为、结果和预测。我们通过代理在环境中观察到的个体特征的影响，开发了一个统一的理论框架来解释这些强化学习代理的三个元素。我们通过使用Shapley值导出特征影响，这些值共同且独特地满足了一组合理的公理，用于公平和一致的信用分配。所提出的方法，用于解释强化学习的Shapley值（SVERL），提供了一个单一的理论框架，全面而有意义地解释强化学习代理。它产生具有精确语义的解释，不仅可解释，而且在数学上得到了证明，使我们能够识别和纠正先前解释中的概念问题。通过说明性示例，我们展示了SVERL如何产生有用、直观的解释，这些解释关于代理的行为、结果和预测，观察代理行为本身是不明显的。

更新时间: 2025-07-31 17:02:05

领域: cs.LG

下载: http://arxiv.org/abs/2505.07797v2

Intersectional Divergence: Measuring Fairness in Regression

Fairness in machine learning research is commonly framed in the context of classification tasks, leaving critical gaps in regression. In this paper, we propose a novel approach to measure intersectional fairness in regression tasks, going beyond the focus on single protected attributes from existing work to consider combinations of all protected attributes. Furthermore, we contend that it is insufficient to measure the average error of groups without regard for imbalanced domain preferences. Accordingly, we propose Intersectional Divergence (ID) as the first fairness measure for regression tasks that 1) describes fair model behavior across multiple protected attributes and 2) differentiates the impact of predictions in target ranges most relevant to users. We extend our proposal demonstrating how ID can be adapted into a loss function, IDLoss, that satisfies convergence guarantees and has piecewise smooth properties that enable practical optimization. Through an extensive experimental evaluation, we demonstrate how ID allows unique insights into model behavior and fairness, and how incorporating IDLoss into optimization can considerably improve single-attribute and intersectional model fairness while maintaining a competitive balance in predictive performance.

Updated: 2025-07-31 17:00:35

标题: 交叉分歧：在回归中衡量公平性

摘要: 机器学习研究中的公平性通常在分类任务的背景下进行讨论，而在回归任务中存在关键的差距。本文提出了一种新的方法来衡量回归任务中的交叉公平性，超越了现有工作对单个受保护属性的关注，考虑了所有受保护属性的组合。此外，我们认为仅仅衡量群体的平均误差而忽视不平衡的领域偏好是不够的。因此，我们提出了交叉分歧（ID）作为回归任务的第一个公平性度量，该度量描述了跨多个受保护属性的公平模型行为，并区分了对用户最相关的目标范围中预测的影响。我们扩展了我们的提议，展示了如何将ID调整为损失函数IDLoss，该函数满足收敛保证，并具有分段平滑性质，可以进行实际优化。通过广泛的实验评估，我们展示了ID如何提供独特的对模型行为和公平性的洞察，并且如何将IDLoss纳入优化可以显著提高单属性和交叉属性模型的公平性，同时保持预测性能的竞争平衡。

更新时间: 2025-07-31 17:00:35

领域: cs.LG

下载: http://arxiv.org/abs/2505.00830v2

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

Updated: 2025-07-31 17:00:30

标题: Seed-Prover：用于自动定理证明的深入广泛推理

摘要: LLMs通过利用长链推理展示了强大的数学推理能力，但由于仅使用自然语言时缺乏明确的监督信号，它们在定理证明方面仍然存在困难。像Lean这样的专用领域特定语言通过形式化证明的形式验证提供了明确的监督，从而通过强化学习实现有效训练。在这项工作中，我们提出了一种引理式整体证明推理模型Seed-Prover。Seed-Prover可以根据Lean的反馈、证明的引理和自我总结逐步完善其证明。为了解决IMO级比赛问题，我们设计了三种测试推理策略，既能进行深入推理，又能进行广泛推理。Seed-Prover证明了78.1%的过去IMO问题，饱和了MiniF2F，并在PutnamBench上取得了50%以上的成绩，远远超过了之前的最新技术水平。为了解决Lean中缺乏几何支持的问题，我们引入了一个几何推理引擎Seed-Geometry，优于先前的形式几何引擎。我们使用这两个系统参加了IMO 2025，并完全证明了6个问题中的5个。这项工作代表了自动化数学推理的重大进步，展示了长链推理的形式验证的有效性。

更新时间: 2025-07-31 17:00:30

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.23726v1

Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

Updated: 2025-07-31 16:47:21

标题: 使用基于注意力的演员-评论家策略增强多智能体协作

摘要: 本文介绍了Team-Attention-Actor-Critic（TAAC），一种旨在增强合作环境中多智能体协作的强化学习算法。TAAC采用了集中式训练/集中式执行方案，将多头注意力机制融入到演员和评论家中。这种设计促进了动态的智能体间通信，使智能体能够明确地查询队友，从而有效地管理联合行动空间的指数增长，同时确保高度协作。我们进一步引入了一种惩罚损失函数，促进了智能体之间多样化但互补的角色。我们在模拟足球环境中对TAAC进行评估，与代表其他多智能体范式的基准算法（包括Proximal Policy Optimization和Multi-Agent Actor-Attention-Critic）进行比较。我们发现，TAAC在各种指标（胜率、进球差、Elo评分、智能体之间的连接、平衡的空间分布以及频繁的战术互动，如控球交换）上表现出优越的性能和增强的协作行为。

更新时间: 2025-07-31 16:47:21

领域: cs.AI,cs.LG,I.2.0; I.2.8

下载: http://arxiv.org/abs/2507.22782v2

Quantum Transfer Learning for MNIST Classification Using a Hybrid Quantum-Classical Approach

We implement a hybrid quantum-classical model for image classification that compresses MNIST digit images into a low-dimensional feature space and then maps these features onto a 5-qubit quantum state. First, an autoencoder compresses each $28\times28$ image (784 pixels) into a 64-dimensional latent vector, preserving salient features of the digit with minimal reconstruction error. We further reduce the latent representation to 5 principal components using Principal Component Analysis (PCA), to match the 5 available qubits. These 5 features are encoded as rotation angles in a quantum circuit with 5 qubits. The quantum feature map applies single-qubit rotations ($R_y$ gates) proportional to the feature values, followed by a Hadamard gate and a cascade of entangling CNOT gates to produce a non-product entangled state. Measuring the 5-qubit state yields a 32-dimensional probability distribution over basis outcomes, which serves as a quantum-enhanced feature vector for classification. A classical neural network with a softmax output is then trained on these 32-dimensional quantum feature vectors to predict the digit class. We evaluate the hybrid model on the MNIST dataset and compare it to a purely classical baseline that uses the 64-dimensional autoencoder latent features for classification. The results show that the hybrid model can successfully classify digits, demonstrating the feasibility of integrating quantum computing in the classification pipeline, although its accuracy (about 75\% on test data) currently falls below the classical baseline (about 98\% on the same compressed data).

Updated: 2025-07-31 16:45:54

标题: 量子-经典混合方法用于MNIST分类的量子迁移学习

摘要: 我们实现了一个用于图像分类的混合量子-经典模型，将MNIST数字图像压缩到低维特征空间，然后将这些特征映射到一个5量子比特的量子态上。首先，一个自动编码器将每个$28\times28$的图像（784个像素）压缩为一个64维的潜在向量，保留了数字的显著特征，并且重建误差最小。我们进一步使用主成分分析（PCA）将潜在表示减少到5个主成分，以匹配5个可用的量子比特。这5个特征被编码为一个包含5个量子比特的量子电路中的旋转角度。量子特征映射应用与特征值成比例的单量子比特旋转（$R_y$ 门），然后是一个Hadamard门和一系列的纠缠CNOT门，以产生一个非乘积纠缠态。测量5量子比特状态会产生一个32维的基础结果概率分布，这将作为一个用于分类的量子增强特征向量。然后在这些32维量子特征向量上训练一个具有softmax输出的经典神经网络，以预测数字类别。我们在MNIST数据集上评估了混合模型，并将其与一个仅使用64维自动编码器潜在特征进行分类的纯经典基线进行比较。结果显示，混合模型可以成功地对数字进行分类，展示了将量子计算集成到分类流程中的可行性，尽管其准确率（在测试数据上约为75%）目前低于经典基线（在相同压缩数据上约为98%）。

更新时间: 2025-07-31 16:45:54

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2408.03351v2

Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.

Updated: 2025-07-31 16:45:51

标题: 无法忘记：主动干扰揭示了LLM中工作记忆限制的长度超出上下文长度

摘要: 在大型语言模型（LLMs）中的信息检索越来越被认为是与生成能力紧密相连，而不仅仅是简单的查找。虽然通常认为更长的上下文会改善检索效果，但上下文内干扰的影响仍未被充分研究。为了解决这个问题，我们从认知科学中借鉴了主动干扰（PI）范式，其中较早的信息会干扰对更新信息的回忆。在人类中，对这种干扰的敏感性与工作记忆容量呈负相关。我们提出了PI-LLM，一个评估方法，它按顺序流式传输语义相关的键-值更新，并仅查询最终值。尽管这些最终值明显位于查询之前，但随着干扰的累积，LLM的检索准确度呈对数线性下降至零；错误来自于检索先前被覆盖的值。通过提示工程（例如，指示模型忽略先前的输入）来减轻干扰的尝试取得了有限的成功。这些发现揭示了LLMs在解开干扰和灵活操纵信息方面的基本约束，表明除了简单的上下文访问之外，还存在工作记忆瓶颈。这需要采用强化模型在检索过程中压制无关内容的方法。

更新时间: 2025-07-31 16:45:51

领域: cs.CL,cs.AI,q-bio.NC

下载: http://arxiv.org/abs/2506.08184v3

Anomalous Samples for Few-Shot Anomaly Detection

Several anomaly detection and classification methods rely on large amounts of non-anomalous or "normal" samples under the assump- tion that anomalous data is typically harder to acquire. This hypothesis becomes questionable in Few-Shot settings, where as little as one anno- tated sample can make a significant difference. In this paper, we tackle the question of utilizing anomalous samples in training a model for bi- nary anomaly classification. We propose a methodology that incorporates anomalous samples in a multi-score anomaly detection score leveraging recent Zero-Shot and memory-based techniques. We compare the utility of anomalous samples to that of regular samples and study the benefits and limitations of each. In addition, we propose an augmentation-based validation technique to optimize the aggregation of the different anomaly scores and demonstrate its effectiveness on popular industrial anomaly detection datasets.

Updated: 2025-07-31 16:41:06

标题: 少样本异常检测中的异常样本

摘要: 许多异常检测和分类方法依赖于大量的非异常或“正常”样本，假设异常数据通常更难获取。然而，在少样本情况下，甚至只有一个标记样本也能产生显著影响。本文探讨了在二元异常分类模型训练中利用异常样本的问题。我们提出了一种方法，将异常样本结合在一个多评分异常检测评分中，利用最近的零样本和基于记忆的技术。我们比较了异常样本与常规样本的效用，并研究了各自的好处和局限性。此外，我们提出了一种基于增强的验证技术，以优化不同异常评分的聚合，并在流行的工业异常检测数据集上展示其有效性。

更新时间: 2025-07-31 16:41:06

领域: cs.LG

下载: http://arxiv.org/abs/2507.23712v1

TriP-LLM: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection

Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a tri-branch design-Patching, Selection, and Global-to encode the input time series into patch-wise tokens, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git

Updated: 2025-07-31 16:36:54

标题: TriP-LLM：一种用于时间序列异常检测的三支分支按补丁的大型语言模型框架

摘要: 时间序列异常检测在各种应用领域中起着至关重要的作用。随着物联网（IoT）和智能制造的日益普及，时间序列数据在规模和维度上都大幅增加。这种增长暴露了传统统计方法在处理此类数据的高异质性和复杂性方面的局限性。受大型语言模型（LLMs）在语言和视觉领域的多模态任务中取得的最近成功的启发，我们提出了一种新颖的无监督异常检测框架：用于时间序列异常检测的三分支分块大型语言模型框架（TriP-LLM）。TriP-LLM通过三分支设计-分块、选择和全局将局部和全局时间特征集成到输入时间序列中，将其编码为分块标记，然后由一个冻结的、预训练的LLM处理。一个轻量级的分块解码器重构输入，从中得出异常得分。我们使用最近提出的无阈值评估指标PATE在几个公共基准数据集上评估TriP-LLM，并在一个统一的开源框架中进行所有比较，以确保公平性。实验结果表明，TriP-LLM在所有数据集上始终优于最近的最先进方法，展现出强大的检测能力。此外，通过广泛的消融研究，我们验证了LLM对整体架构的重要贡献。与使用通道独立性（CI）分块处理的基于LLM的方法相比，TriP-LLM的内存消耗明显更低，更适合于GPU内存受限环境。所有代码和模型检查点均公开在https://github.com/YYZStart/TriP-LLM.git。

更新时间: 2025-07-31 16:36:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00047v1

GCL-GCN: Graphormer and Contrastive Learning Enhanced Attributed Graph Clustering Network

Attributed graph clustering holds significant importance in modern data analysis. However, due to the complexity of graph data and the heterogeneity of node attributes, leveraging graph information for clustering remains challenging. To address this, we propose a novel deep graph clustering model, GCL-GCN, specifically designed to address the limitations of existing models in capturing local dependencies and complex structures when dealing with sparse and heterogeneous graph data. GCL-GCN introduces an innovative Graphormer module that combines centrality encoding and spatial relationships, effectively capturing both global and local information between nodes, thereby enhancing the quality of node representations. Additionally, we propose a novel contrastive learning module that significantly enhances the discriminative power of feature representations. In the pre-training phase, this module increases feature distinction through contrastive learning on the original feature matrix, ensuring more identifiable initial representations for subsequent graph convolution and clustering tasks. Extensive experimental results on six datasets demonstrate that GCL-GCN outperforms 14 advanced methods in terms of clustering quality and robustness. Specifically, on the Cora dataset, it improves ACC, NMI, and ARI by 4.94%, 13.01%, and 10.97%, respectively, compared to the primary comparison method MBN.

Updated: 2025-07-31 16:36:45

标题: GCL-GCN：图形化和对比学习增强属性图聚类网络

摘要: 属性图聚类在现代数据分析中具有重要意义。然而，由于图数据的复杂性和节点属性的异质性，利用图信息进行聚类仍然具有挑战性。为了解决这个问题，我们提出了一种新颖的深度图聚类模型GCL-GCN，专门设计用于解决现有模型在处理稀疏和异质图数据时捕捉局部依赖性和复杂结构的局限性。GCL-GCN引入了一种创新的Graphormer模块，结合了中心性编码和空间关系，有效地捕捉节点之间的全局和局部信息，从而提高了节点表示的质量。此外，我们提出了一种新颖的对比学习模块，显著增强了特征表示的区分能力。在预训练阶段，这个模块通过对原始特征矩阵进行对比学习，增加特征的区分度，确保更可识别的初始表示，用于后续的图卷积和聚类任务。在六个数据集上的大量实验结果表明，GCL-GCN在聚类质量和稳健性方面优于14种先进方法。具体来说，在Cora数据集上，与主要比较方法MBN相比，它将ACC、NMI和ARI分别提高了4.94%、13.01%和10.97%。

更新时间: 2025-07-31 16:36:45

领域: cs.LG

下载: http://arxiv.org/abs/2507.19095v2

Disparate Conditional Prediction in Multiclass Classifiers

We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds,by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCPunder two different regimes,one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. Code is provided at https://github.com/sivansabato/ DCPmulticlass.

Updated: 2025-07-31 16:34:37

标题: 多类分类器中不同条件预测

摘要: 我们提出了一种审计多类分类器在多类平衡几率下的公平性的方法，通过估计分类器在不完全公平时与平衡几率的偏差。我们将不平等条件预测（DCP）的度量方法推广到多类分类器，这一方法最初由Sabato＆Yom-Tov（2020年）针对二元分类器提出。DCP被定义为分类器预测条件概率与最接近的公共基线不同的人口比例。我们提供了用于估计两种不同情况下的多类DCP的新的局部优化方法，一种情况是每个受保护子群体的条件混淆矩阵已知，另一种情况是这些不能被估计，例如，因为分类器不可访问或因为没有高质量的个体级数据可用。这些方法可以用来检测很可能对人口中的重要比例不公平对待的分类器。实验证明了这些方法的准确性。代码提供在https://github.com/sivansabato/DCPmulticlass。

更新时间: 2025-07-31 16:34:37

领域: cs.LG,cs.CY,stat.ML

下载: http://arxiv.org/abs/2206.03234v3

Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model's effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.

Updated: 2025-07-31 16:26:22

标题: 增强的高斯视频重建速度场建模

摘要: 高保真度的3D视频重建对于在虚拟和增强现实（VR/AR）中实现具有逼真运动的动态场景的实时渲染至关重要。3D高斯散射的变形场范式由于深度变形网络的出色表现能力，在视频重建中取得了接近照片级的结果。然而，在具有复杂运动和显著尺度变化的视频中，变形网络往往过度拟合不规则的高斯轨迹，导致视觉质量次优。此外，针对静态场景重建设计的基于梯度的致密化策略无法解决动态内容的缺失。鉴于这些挑战，我们提出了一种专为高斯视频重建定制的流场增强速度场建模方案，称为FlowGaussian-VR。它由两个核心组件组成：一种速度场渲染（VFR）流水线，它实现了基于光流的优化，以及一种流辅助自适应致密化（FAD）策略，调整动态区域中高斯的数量和大小。我们通过多视角动态重建和包含具有挑战性运动场景的多个现实世界数据集验证了我们模型的有效性，展示了显著的视觉改进（PSNR增益超过2.5 dB）和动态纹理中更少模糊的伪影，同时实现了规范化和可跟踪的每个高斯轨迹。

更新时间: 2025-07-31 16:26:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23704v1

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

Updated: 2025-07-31 16:23:29

标题: 数学证明作为试金石：揭示高级大型推理模型的失败模式

摘要: 大型推理模型（例如R1、o3）展示了出色的数学问题解决能力。然而，这些先进模型在流行数据集上高报告的准确性，依赖纯数值评估和潜在的基准泄漏，往往掩盖了它们真正的推理缺陷。为了解决这个问题，我们提出利用数学证明的固有严谨性和方法论复杂性作为一种诊断工具，揭示这些隐藏的失败。具体地，我们引入了RFMDataset（揭示故障模式），这是一个包含200个不同数学证明问题的集合，并对先进模型在其上的表现进行了彻底评估。我们对它们的失败进行了深入分析，揭示了10种细粒度的错误类型，显示了当前大型推理模型的基本限制：1）大型推理模型在数学证明中有着深刻的挣扎，有些甚至对基本问题产生完全正确的证明不到20％，甚至在基本问题上也失败；2）模型展示了各种推理失败，突出显示了单步推理的正确性和严谨性缺乏保证；3）模型在推理过程中表现出幻觉和不完整性。我们的研究结果表明，模型的自我反思不足以解决当前的逻辑困境，需要进行规范化和细粒度的逻辑训练。

更新时间: 2025-07-31 16:23:29

领域: cs.AI

下载: http://arxiv.org/abs/2506.17114v3

TextQuests: How Good are LLMs at Text-Based Video Games?

Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

Updated: 2025-07-31 16:22:55

标题: TextQuests：LLM在基于文本的视频游戏中表现如何？

摘要: 在模拟真实世界挑战的复杂互动环境中评估人工智能代理非常关键，以了解它们的实际能力。尽管现有的代理基准有效评估了工具使用或在结构化任务上的表现等技能，但它们通常不能完全捕捉代理在需要在长时间和不断增长的背景下进行自主探索的环境中运作的能力。为了推动开发能够在长期未来进行更强大内在推理的代理，我们引入了TextQuests，这是基于Infocom交互式小说游戏套件的基准。这些基于文本的冒险游戏需要人类玩家超过30个小时，并需要数百个精确的操作才能解决，从而成为评估人工智能代理在专注、有状态任务上的有效代理。该基准专门设计用于评估LLM代理的自包容问题解决能力，通过禁止使用外部工具，从而侧重于探索性环境中的内在长期推理能力，这种环境具有试错学习和单个交互会话中的持续问题解决的需求。我们在https://textquests.ai上发布了TextQuests。

更新时间: 2025-07-31 16:22:55

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.23701v1

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by $4\times$ and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.

Updated: 2025-07-31 16:20:02

标题: 可扩展的多任务强化学习：用于视觉动作智能代理的通用空间智能

摘要: 虽然强化学习（RL）在语言建模方面取得了显著成功，但其胜利尚未完全转化为视觉运动代理。 RL模型面临的主要挑战是它们倾向于过度拟合特定任务或环境，从而阻碍在不同设置中获得可泛化行为。本文通过展示RL微调的Minecraft中的视觉运动代理可以实现对未知世界的零-shot泛化，为这一挑战提供了初步答案。具体来说，我们探索了RL在增强3D世界中可泛化的空间推理和交互能力方面的潜力。为了解决多任务RL表示中的挑战，我们分析并建立了跨视图目标规范作为视觉运动策略的统一多任务目标空间。此外，为了克服手动任务设计的重要瓶颈，我们提出了在高度可定制的Minecraft环境中进行大规模多任务RL训练的自动任务合成，并构建了一个高效的分布式RL框架来支持这一点。实验结果显示，RL显著提高了交互成功率4倍，并实现了在不同环境中包括真实世界设置中的空间推理的零-shot泛化。我们的发现强调了在3D模拟环境中进行RL训练的巨大潜力，特别是对于可进行大规模任务生成的环境，以显著推进视觉运动代理的空间推理能力。

更新时间: 2025-07-31 16:20:02

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.23698v1

A survey of multi-agent geosimulation methodologies: from ABM to LLM

We provide a comprehensive examination of agent-based approaches that codify the principles and linkages underlying multi-agent systems, simulations, and information systems. Based on two decades of study, this paper confirms a framework intended as a formal specification for geosimulation platforms. Our findings show that large language models (LLMs) can be effectively incorporated as agent components if they follow a structured architecture specific to fundamental agent activities such as perception, memory, planning, and action. This integration is precisely consistent with the architecture that we formalize, providing a solid platform for next-generation geosimulation systems.

Updated: 2025-07-31 16:12:22

标题: 一种多智能体地理仿真方法的调查：从基于代理的模型到地理学习机制

摘要: 我们提供了对代理基础方法的全面检查，这些方法对多代理系统、模拟和信息系统的原则和联系进行了编码。基于二十年的研究，本文确认了一个旨在作为地理模拟平台的正式规范的框架。我们的研究结果表明，如果大型语言模型（LLMs）遵循特定于基本代理活动（如感知、记忆、规划和行动）的结构化架构，它们可以有效地被整合为代理组件。这种整合与我们正式化的架构完全一致，为下一代地理模拟系统提供了坚实的平台。

更新时间: 2025-07-31 16:12:22

领域: cs.MA,cs.AI,68T42,I.2.11

下载: http://arxiv.org/abs/2507.23694v1

Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains

Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm's ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. Such benchmarks do not represent the many forms of partial observability seen in real domains, like visual occlusion or unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm's generalizability. The second is a large gap between the performance of a agents with more or less state information, all other factors roughly equal. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm's ability to cope with partial observability as opposed to other factors. We introduce best-practice guidelines for empirically benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, we show that these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters as well as algorithm implementations for fast, out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.

Updated: 2025-07-31 16:11:37

标题: 使用一套可改进记忆的领域对强化学习中的部分可观察性进行基准测试

摘要: 减轻部分可观察性是通用强化学习算法的一个必要但具有挑战性的任务。为了提高算法减轻部分可观察性的能力，研究人员需要全面的基准来衡量进展。大多数处理部分可观察性的算法只在简单形式的状态别名基准上进行评估，例如特征掩盖和高斯噪声。这些基准并不能代表真实领域中所见到的许多形式的部分可观察性，比如视觉遮挡或未知对手意图。我们认为一个部分可观察性基准应具有两个关键属性。第一个是在部分可观察性形式上的覆盖范围，以确保算法的泛化能力。第二个是在拥有更多或更少状态信息的代理之间的表现差距很大，其他因素大致相等。这种差距意味着环境是可记忆改进的：在一个领域中的绩效提升来自算法应对部分可观察性的能力，而不是其他因素。我们引入了在部分可观察性下进行强化学习的实证基准最佳实践指南，以及开源库POBAX：JAX中的部分可观察基准。我们对各种环境中存在的部分可观察性类型进行了表征，并为我们的基准选择了代表性环境。这些环境包括定位和映射、视觉控制、游戏等。此外，我们展示了这些任务都是可记忆改进的，并且需要难以学习的记忆功能，为部分可观察性研究提供了具体信号。该框架包括推荐的超参数以及算法实现，用于快速、开箱即用的评估，以及在JAX中实现的高性能环境，用于GPU可扩展实验。

更新时间: 2025-07-31 16:11:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00046v1

Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

Updated: 2025-07-31 15:59:35

标题: 卫星联合微调用于空间计算能源网络中基础模型

摘要: 人工智能（AI）和低地球轨道（LEO）卫星的进步推动了大型遥感基础模型在各种下游任务中的应用。然而，由于隐私问题和带宽有限，直接下载这些模型以进行地面微调受到了阻碍。卫星联邦学习（FL）通过在卫星上直接进行模型微调并汇总模型更新而提供了解决方案，无需下载数据。然而，对于大型基础模型来说，传统卫星FL框架中卫星的计算能力不足以支持有效的在机微调。为了解决这些挑战，我们提出了一种卫星-地面协作的联邦微调框架。该框架的关键在于如何合理地分解和分配模型组件，以缓解卫星上计算能力不足的问题。在微调过程中，卫星与地面站或其他卫星交换中间结果进行前向传播和反向传播，这带来了通信挑战，因为空间传输网络的特殊通信拓扑结构，如间歇性的卫星地面通信、短暂的卫星地面通信窗口和不稳定的轨道间卫星链接（ISLs）。为了减少传输延迟，我们进一步引入了定制的通信策略，将通信和计算资源整合在一起。具体来说，我们提出了一种并行的轨道内通信策略、一种基于拓扑的卫星地面通信策略和一种最小化延迟的轨道间通信策略，以减少空间通信成本。模拟结果显示，在训练时间上有显著的降低，改善了约33%。

更新时间: 2025-07-31 15:59:35

领域: cs.LG,cs.DC,cs.NI

下载: http://arxiv.org/abs/2504.10403v3

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Visual-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent work has begun to explore the incorporation of latent actions, an abstract representation of visual change between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. Together, these contributions enable villa-X to achieve superior performance across simulated environments including SIMPLER and LIBERO, as well as on two real-world robot setups including gripper and dexterous hand manipulation. We believe the ViLLA paradigm holds significant promise, and that our villa-X provides a strong foundation for future research.

Updated: 2025-07-31 15:57:46

标题: villa-X：增强视觉-语言-动作模型中的潜在动作建模

摘要: 视觉-语言-动作（VLA）模型已经成为一种流行的范式，用于学习机器人操作策略，可以遵循语言指令并推广到新颖的场景。最近的研究开始探索将潜在动作，即两个帧之间的视觉变化的抽象表示，融入VLA预训练中。在本文中，我们介绍了villa-X，一种新颖的视觉-语言-潜在动作（ViLLA）框架，推进潜在动作建模以学习通用的机器人操作策略。我们的方法改进了潜在动作的学习方式以及它们如何融入VLA预训练中。总的来说，这些贡献使villa-X能够在模拟环境（包括SIMPLER和LIBERO）以及两个真实世界机器人设置（包括夹持器和灵巧手操作）中实现出色的性能。我们相信ViLLA范式具有重要的潜力，而我们的villa-X为未来研究提供了坚实的基础。

更新时间: 2025-07-31 15:57:46

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23682v1

DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.

Updated: 2025-07-31 15:51:41

标题: DepMicroDiff：基于扩散的依赖感知多模态微生物组数据插补

摘要: 微生物组数据分析对于理解宿主健康和疾病至关重要，但其固有的稀疏性和噪声为准确插补提出了重大挑战，从而阻碍了下游任务，如生物标记物发现。现有的插补方法，包括最近的基于扩散的模型，通常无法捕捉微生物分类之间复杂的相互依赖关系，并忽视可以指导插补的上下文元数据。我们引入DepMicroDiff，这是一个创新的框架，它将基于扩散的生成建模与依赖感知变换器（DAT）结合起来，以明确捕捉互相间的依赖关系和自回归关系。DepMicroDiff通过对各种癌症数据集进行基于VAE的预训练，并通过一个大型语言模型（LLM）对患者元数据进行编码来进一步增强。对TCGA微生物组数据集的实验表明，DepMicroDiff在多种癌症类型上明显优于现有基线，实现了更高的皮尔逊相关性（高达0.712），余弦相似度（高达0.812），以及更低的RMSE和MAE，展示了其对微生物组插补的稳健性和泛化能力。

更新时间: 2025-07-31 15:51:41

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.23676v1

A Deep Learning Powered Numerical Relativity Surrogate for Binary Black Hole Waveforms

Gravitational-wave approximants are essential for gravitational-wave astronomy, allowing the coverage binary black hole parameter space for inference or match filtering without costly numerical relativity (NR) simulations, but generally trading some accuracy for computational efficiency. To reduce this trade-off, NR surrogate models can be constructed using interpolation within NR waveform space. We present a 2-stage training approach for neural network-based NR surrogate models. Initially trained on approximant-generated waveforms and then fine-tuned with NR data, these dual-stage artificial neural surrogate (\texttt{DANSur}) models offer rapid and competitively accurate waveform generation, generating millions in under 20ms on a GPU while keeping mean mismatches with NR around $10^{-4}$. Implemented in the \textsc{bilby} framework, we show they can be used for parameter estimation tasks.

Updated: 2025-07-31 15:51:12

标题: 一个基于深度学习的数值相对论替代方法，用于二进制黑洞波形

摘要: 引力波近似是引力波天文学中至关重要的工具，使得在没有昂贵的数值相对论(NR)模拟的情况下，可以覆盖二进制黑洞参数空间进行推断或匹配滤波。然而，通常会在计算效率上牺牲一些准确性。为了减少这种权衡，可以利用插值在NR波形空间内构建NR代理模型。我们提出了一种基于神经网络的NR代理模型的两阶段训练方法。这些双阶段人工神经代理(\texttt{DANSur})模型首先在近似生成的波形上进行训练，然后再用NR数据进行微调，以快速且具有竞争力的准确性生成波形，在GPU上在不到20ms内生成数百万个波形，同时保持与NR的平均不匹配度在$10^{-4}$左右。在\textsc{bilby}框架中实施，我们展示了它们可以用于参数估计任务。

更新时间: 2025-07-31 15:51:12

领域: gr-qc,astro-ph.HE,astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2412.06946v3

One-Step Flow Policy Mirror Descent

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on flow policy and MeanFlow policy parametrizations, respectively. Extensive empirical evaluations on MuJoCo benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring hundreds of times fewer function evaluations during inference.

Updated: 2025-07-31 15:51:10

标题: 一步流策略镜像下降

摘要: 扩散策略在在线强化学习（RL）中取得了巨大成功，这归功于其强大的表达能力。然而，扩散策略模型的推论依赖于一个缓慢的迭代抽样过程，这限制了它们的响应性。为了克服这一限制，我们提出了Flow Policy Mirror Descent（FPMD），这是一种在线RL算法，可以在策略推论过程中进行1步抽样。我们的方法利用了直接插值流匹配模型中单步抽样的分布方差与离散化误差之间的理论联系，并且不需要额外的蒸馏或一致性训练。我们分别基于流策略和MeanFlow策略参数化提出了两种算法变体。对MuJoCo基准进行的大量实证评估表明，我们的算法表现出与扩散策略基线相当的强大性能，而在推论过程中需要的函数评估次数少了数百倍。

更新时间: 2025-07-31 15:51:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.23675v1

TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

Updated: 2025-07-31 15:50:57

标题: TweakLLM：一种用于动态定制缓存响应的路由架构

摘要: 大型语言模型（LLMs）每天处理数百万个查询，使得高效的响应缓存成为减少成本和延迟的引人注目的优化选择。然而，由于聊天机器人交互的个性化特性和语义相似性搜索的有限精度，通过这种方法保持与用户查询的相关性变得困难。为了解决这个问题，我们提出了TweakLLM，这是一种新颖的路由架构，采用轻量级的LLM动态地调整缓存的响应以适应传入的提示。通过全面评估，包括用户研究、并排比较、满意度投票以及多代理LLM辩论，我们证明了TweakLLM在保持响应质量与前沿模型可比的同时显著改善了缓存的有效性。我们在真实世界数据集上的结果突显了TweakLLM作为一种可扩展、资源高效的缓存解决方案，适用于高容量LLM部署，而不会影响用户体验。

更新时间: 2025-07-31 15:50:57

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.23674v1

An Inversion-based Measure of Memorization for Diffusion Models

The past few years have witnessed substantial advances in image generation powered by diffusion models. However, it was shown that diffusion models are susceptible to training data memorization, raising significant concerns regarding copyright infringement and privacy invasion. This study delves into a rigorous analysis of memorization in diffusion models. We introduce InvMM, an inversion-based measure of memorization, which is based on inverting a sensitive latent noise distribution accounting for the replication of an image. For accurate estimation of the measure, we propose an adaptive algorithm that balances the normality and sensitivity of the noise distribution. Comprehensive experiments across four datasets, conducted on both unconditional and text-guided diffusion models, demonstrate that InvMM provides a reliable and complete quantification of memorization. Notably, InvMM is commensurable between samples, reveals the true extent of memorization from an adversarial standpoint and implies how memorization differs from membership. In practice, it serves as an auditing tool for developers to reliably assess the risk of memorization, thereby contributing to the enhancement of trustworthiness and privacy-preserving capabilities of diffusion models.

Updated: 2025-07-31 15:50:10

标题: 一种基于反演的扩散模型记忆度量方法

摘要: 近年来，由扩散模型驱动的图像生成取得了重大进展。然而，研究表明，扩散模型容易训练数据记忆，引发了对侵犯版权和隐私侵犯的重大担忧。本研究深入分析了扩散模型中的记忆问题。我们引入了InvMM，一种基于反演的记忆度量，其基于反演一个敏感的潜在噪声分布来解释图像的复制。为了准确估计这个度量，我们提出了一种自适应算法，平衡了噪声分布的正常性和敏感性。在四个数据集上进行的全面实验，对无条件和文本引导的扩散模型进行了测试，证明了InvMM提供了可靠和完整的记忆量化。值得注意的是，InvMM在样本之间是可比较的，从对抗性的角度揭示了记忆的真实程度，并暗示了记忆与成员资格的不同。在实践中，它可以作为开发人员的审计工具，可靠地评估记忆的风险，从而有助于增强扩散模型的信任度和隐私保护能力。

更新时间: 2025-07-31 15:50:10

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2405.05846v3

SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA's effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.

Updated: 2025-07-31 15:49:57

标题: SAMSA：使用光谱角增强的分割任意模型用于高光谱交互式医学图像分割

摘要: 高光谱成像（HSI）为医学成像提供了丰富的光谱信息，但由于数据限制和硬件变化而面临重大挑战。我们介绍了SAMSA，这是一个结合RGB基础模型和光谱分析的新颖交互式分割框架。SAMSA有效地利用用户点击来引导RGB分割和光谱相似性计算。该方法通过独特的光谱特征融合策略解决了HSI分割中的关键限制，该策略独立于光谱波段数量和分辨率。对公开可用数据集的性能评估显示，在神经外科学数据集上，1次点击的DICE系数为81.0％，5次点击的DICE系数为93.4％；在术中猪高光谱数据集上，1次点击的DICE系数为81.1％，5次点击的DICE系数为89.2％。实验结果表明SAMSA在少样本学习和零样本学习场景中以及使用最少训练示例时的有效性。我们的方法实现了具有不同光谱特征的数据集的无缝集成，为高光谱医学图像分析提供了灵活的框架。

更新时间: 2025-07-31 15:49:57

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23673v1

White-Basilisk: A Hybrid Model for Code Vulnerability Detection

The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI model scaling. Utilizing an innovative architecture that integrates Mamba layers, linear self-attention, and a Mixture of Experts framework, White-Basilisk achieves state-of-the-art results in vulnerability detection tasks with a parameter count of only 200M. The model's capacity to process sequences of unprecedented length enables comprehensive analysis of extensive codebases in a single pass, surpassing the context limitations of current Large Language Models (LLMs). White-Basilisk exhibits robust performance on imbalanced, real-world datasets, while maintaining computational efficiency that facilitates deployment across diverse organizational scales. This research not only establishes new benchmarks in code security but also provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks, potentially redefining optimization strategies in AI development for domain-specific applications.

Updated: 2025-07-31 15:49:27

标题: 白色翠鸟：一种用于代码漏洞检测的混合模型

摘要: 软件漏洞的激增对网络安全构成了重大挑战，需要更有效的检测方法。我们介绍了一种名为White-Basilisk的新型漏洞检测方法，该方法在挑战当前AI模型扩展假设的同时展现出卓越的性能。White-Basilisk利用集成了Mamba层、线性自注意力和专家混合框架的创新架构，在参数数量仅为200M的情况下在漏洞检测任务中取得了最先进的结果。该模型处理长度前所未有的序列的能力使得在单次传递中对庞大的代码库进行全面分析成为可能，超越了当前大型语言模型（LLMs）的上下文限制。White-Basilisk在不平衡的真实世界数据集上表现出强大的性能，同时保持计算效率，有助于在各种组织规模中进行部署。这项研究不仅在代码安全领域树立了新的标杆，还提供了实证证据，即紧凑、高效设计的模型在专门任务中可能胜过更大的对应模型，潜在地重新定义了领域特定应用的人工智能开发优化策略。

更新时间: 2025-07-31 15:49:27

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.08540v2

Automating AI Failure Tracking: Semantic Association of Reports in AI Incident Database

Artificial Intelligence (AI) systems are transforming critical sectors such as healthcare, finance, and transportation, enhancing operational efficiency and decision-making processes. However, their deployment in high-stakes domains has exposed vulnerabilities that can result in significant societal harm. To systematically study and mitigate these risk, initiatives like the AI Incident Database (AIID) have emerged, cataloging over 3,000 real-world AI failure reports. Currently, associating a new report with the appropriate AI Incident relies on manual expert intervention, limiting scalability and delaying the identification of emerging failure patterns. To address this limitation, we propose a retrieval-based framework that automates the association of new reports with existing AI Incidents through semantic similarity modeling. We formalize the task as a ranking problem, where each report-comprising a title and a full textual description-is compared to previously documented AI Incidents based on embedding cosine similarity. Benchmarking traditional lexical methods, cross-encoder architectures, and transformer-based sentence embedding models, we find that the latter consistently achieve superior performance. Our analysis further shows that combining titles and descriptions yields substantial improvements in ranking accuracy compared to using titles alone. Moreover, retrieval performance remains stable across variations in description length, highlighting the robustness of the framework. Finally, we find that retrieval performance consistently improves as the training set expands. Our approach provides a scalable and efficient solution for supporting the maintenance of the AIID.

Updated: 2025-07-31 15:48:12

标题: 自动化AI失败跟踪：在AI事故数据库中报告的语义关联

摘要: 人工智能（AI）系统正在改变关键领域，如医疗保健、金融和交通运输，提高运营效率和决策过程。然而，在高风险领域部署它们已经暴露出可能导致重大社会危害的漏洞。为了系统地研究和减轻这些风险，像AI事件数据库（AIID）这样的倡议应运而生，记录了超过3,000起真实世界的AI故障报告。目前，将新报告与适当的AI事件关联依赖于手动专家干预，限制了可扩展性并延迟了新兴故障模式的识别。为了解决这一限制，我们提出了一个检索式框架，通过语义相似性建模自动将新报告与现有的AI事件关联起来。我们将任务形式化为一个排名问题，其中每个报告（包括标题和完整文本描述）基于嵌入余弦相似度与先前记录的AI事件进行比较。在对传统词汇方法、交叉编码器架构和基于Transformer的句子嵌入模型进行基准测试后，我们发现后者始终表现出优越的性能。我们的分析进一步显示，与仅使用标题相比，结合标题和描述在排名准确性方面产生了显着改进。此外，检索性能在描述长度变化时保持稳定，突显了框架的稳健性。最后，我们发现，随着训练集的扩大，检索性能持续提高。我们的方法为支持AIID的维护提供了一个可扩展和高效的解决方案。

更新时间: 2025-07-31 15:48:12

领域: cs.CY,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.23669v1

SHAP-Guided Regularization in Machine Learning Models

Feature attribution methods such as SHapley Additive exPlanations (SHAP) have become instrumental in understanding machine learning models, but their role in guiding model optimization remains underexplored. In this paper, we propose a SHAP-guided regularization framework that incorporates feature importance constraints into model training to enhance both predictive performance and interpretability. Our approach applies entropy-based penalties to encourage sparse, concentrated feature attributions while promoting stability across samples. The framework is applicable to both regression and classification tasks. Our first exploration started with investigating a tree-based model regularization using TreeSHAP. Through extensive experiments on benchmark regression and classification datasets, we demonstrate that our method improves generalization performance while ensuring robust and interpretable feature attributions. The proposed technique offers a novel, explainability-driven regularization approach, making machine learning models both more accurate and more reliable.

Updated: 2025-07-31 15:45:38

标题: SHAP在机器学习模型中的引导正则化

摘要: SHapley Additive exPlanations (SHAP)等特征归因方法已成为理解机器学习模型的重要工具，但它们在指导模型优化方面的作用尚未得到充分探讨。本文提出了一种基于SHAP的正则化框架，将特征重要性约束集成到模型训练中，以提高预测性能和可解释性。我们的方法应用基于熵的惩罚来鼓励稀疏、集中的特征归因，同时促进样本间的稳定性。该框架适用于回归和分类任务。我们首次尝试了基于树模型的正则化，使用了TreeSHAP。通过对基准回归和分类数据集的广泛实验，我们证明了我们的方法提高了泛化性能，同时确保了稳健且可解释的特征归因。所提出的技术提供了一种新颖的、以可解释性为驱动的正则化方法，使机器学习模型更加准确和可靠。

更新时间: 2025-07-31 15:45:38

领域: cs.LG

下载: http://arxiv.org/abs/2507.23665v1

Personalized Education with Ranking Alignment Recommendation

Personalized question recommendation aims to guide individual students through questions to enhance their mastery of learning targets. Most previous methods model this task as a Markov Decision Process and use reinforcement learning to solve, but they struggle with efficient exploration, failing to identify the best questions for each student during training. To address this, we propose Ranking Alignment Recommendation (RAR), which incorporates collaborative ideas into the exploration mechanism, enabling more efficient exploration within limited training episodes. Experiments show that RAR effectively improves recommendation performance, and our framework can be applied to any RL-based question recommender. Our code is available in https://github.com/wuming29/RAR.git.

Updated: 2025-07-31 15:43:51

标题: 个性化教育与排名对齐推荐

摘要: 个性化问题推荐旨在引导个体学生通过问题来增强他们对学习目标的掌握。大多数先前的方法将这一任务建模为马尔科夫决策过程，并使用强化学习来解决，但它们在有效的探索方面存在困难，在训练过程中未能识别出最适合每个学生的问题。为了解决这个问题，我们提出了排名对齐推荐（RAR），它将协作思想融入到探索机制中，从而在有限的训练周期内实现更高效的探索。实验证明，RAR有效地提高了推荐性能，我们的框架可以应用于任何基于强化学习的问题推荐系统。我们的代码可在https://github.com/wuming29/RAR.git上获得。

更新时间: 2025-07-31 15:43:51

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.23664v1

Parallel Split Learning with Global Sampling

Distributed deep learning in resource-constrained environments faces scalability and generalization challenges due to large effective batch sizes and non-identically distributed client data. We introduce a server-driven sampling strategy that maintains a fixed global batch size by dynamically adjusting client-side batch sizes. This decouples the effective batch size from the number of participating devices and ensures that global batches better reflect the overall data distribution. Using standard concentration bounds, we establish tighter deviation guarantees compared to existing approaches. Empirical results on a benchmark dataset confirm that the proposed method improves model accuracy, training efficiency, and convergence stability, offering a scalable solution for learning at the network edge.

Updated: 2025-07-31 15:42:11

标题: 全局采样的并行分割学习

摘要: 在资源受限的环境中，分布式深度学习面临着可扩展性和泛化挑战，这是由于大的有效批量大小和非同分布的客户端数据造成的。我们引入了一种由服务器驱动的采样策略，通过动态调整客户端批量大小来保持固定的全局批量大小。这种方法将有效批量大小与参与设备的数量分开，确保全局批次更好地反映整体数据分布。利用标准的集中界限，我们建立了比现有方法更紧密的偏差保证。基准数据集上的实证结果证实了所提出的方法改善了模型的准确性、训练效率和收敛稳定性，为在网络边缘学习提供了可扩展的解决方案。

更新时间: 2025-07-31 15:42:11

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2407.15738v4

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Updated: 2025-07-31 15:39:27

标题: Ultra3D：具有部分注意力的高效高保真度3D生成

摘要: 最近对稀疏体素表示的研究取得了显著进展，显著提高了3D内容生成的质量，使高分辨率建模和细粒度几何建模成为可能。然而，现有框架由于其两阶段扩散流程中注意机制的二次复杂度而遭受严重的计算效率问题。在这项工作中，我们提出了Ultra3D，一种高效的3D生成框架，显著加快了稀疏体素建模的速度而不牺牲质量。我们的方法利用紧凑的VecSet表示，在第一阶段高效生成粗糙的对象布局，减少了令牌数量并加快了体素坐标的预测。为了在第二阶段细化每个体素的潜在特征，我们引入了Part Attention，一种几何感知的局部化注意机制，将注意力计算限制在语义一致的部分区域内。这种设计保持了结构的连续性，同时避免了不必要的全局注意力，潜在生成速度提高了最多6.7倍。为了支持这种机制，我们构建了一个可扩展的部分注释流程，将原始网格转换为带有部分标签的稀疏体素。大量实验表明，Ultra3D支持1024分辨率的高分辨率3D生成，并在视觉保真度和用户偏好方面取得了最先进的性能。

更新时间: 2025-07-31 15:39:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.17745v3

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

Updated: 2025-07-31 15:25:09

标题: 如何在不泄露真实答案的情况下发布我的LLM基准？

摘要: 在互联网上发布一个大型语言模型（LLM）基准可能会污染未来的LLM：该基准可能会被无意（或有意）地用来训练或选择模型。常见的缓解方法是保持基准的私密性，让参与者提交他们的模型或预测给组织者。然而，这种策略需要对单个组织的信任，并且仍然允许通过重复查询进行测试集过拟合。为了克服这个问题，我们提出了一种在不完全披露问题的正确答案的情况下发布基准的方法，同时仍然保持对LLM进行公开评估的能力。我们的主要想法是通过准备几个逻辑上正确的答案，将随机性注入到答案中，并只在基准中包括其中一个作为解决方案。这降低了基准的最佳准确性，即贝叶斯准确性。这不仅有助于避免披露真相，而且这种方法还提供了一种检测数据污染的测试。原则上，即使是完全能胜任的模型也不应该超过贝叶斯准确性。如果一个模型尽管预期不应该超过这个上限却超过了这个上限，这是数据污染的一个强烈信号。我们提供了实验证据，表明我们的方法可以准确地检测各种基准、模型和训练方法中的数据污染。

更新时间: 2025-07-31 15:25:09

领域: cs.LG,cs.AI,cs.CL,stat.ME

下载: http://arxiv.org/abs/2505.18102v2

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

Updated: 2025-07-31 15:19:55

标题: 高效的掩码式注意力变换器用于少样本分类和分割

摘要: Few-shot分类和分割（FS-CS）专注于使用少量注释示例联合执行多标签分类和多类分割。尽管目前的最先进技术（SOTA）在这两个任务中实现了高准确性，但在处理小物体时仍存在困难。为了克服这一难题，我们提出了高效蒙版注意力变换器（EMAT），它提高了分类和分割的准确性，特别是对于小物体。EMAT引入了三个修改：一种新颖的内存高效蒙版注意力机制，一种可学习的降维策略和参数效率增强。在PASCAL-5$^i$和COCO-20$^i$数据集上，EMAT在至少减少四倍可训练参数的情况下优于所有FS-CS方法。此外，由于当前的FS-CS评估设置丢弃了可用的注释，尽管它们的收集成本高昂，我们引入了两种考虑这些注释的新颖评估设置，以更好地反映实际情景。

更新时间: 2025-07-31 15:19:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23642v1

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging - the strategic underperformance on evaluations by AI models or their developers. One promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36\% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

Updated: 2025-07-31 15:19:30

标题: LLMs能够在能力评估中秘密地采取措施对抗思维链监控

摘要: 危险能力的可信评估对确定人工智能系统是否安全部署变得越来越重要。一个已经在实证上证明的威胁是“偷懒”——人工智能模型或其开发者在评估中的战略表现不佳。一种有希望的防御方法是监控模型的思维链（CoT）推理，因为这可以揭示其意图和计划。在这项工作中，我们衡量了模型在危险能力评估中对CoT监视器进行偷懒的能力，通过提示它们在不知道监视器或知道监视器的情况下偷懒。我们展示了即使是前沿模型和小型开源模型也可以在没有任何提示的情况下对CoT监测进行隐蔽的偷懒。然而，它们目前还不能可靠地这样做：在知道监视器的情况下，成功偷懒的情况下，它们16-36%的时间会绕过监视器。我们对未被捕捉到的CoT进行了定性分析，以了解监视器失败的原因。我们揭示了CoT监控的丰富攻击面，并贡献了由模型生成的五种隐蔽偷懒政策。这些结果为CoT监视的潜在故障模式提供了信息，并可能有助于构建更多样化的偷懒模型生物。

更新时间: 2025-07-31 15:19:30

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.00943v1

Polynomial Lattices for the BIKE Cryptosystem

In this paper we introduce a rank $2$ lattice over a polynomial ring arising from the public key of the BIKE cryptosystem \cite{aragon2022bike}. The secret key is a sparse vector in this lattice. We study properties of this lattice and generalize the recovery of weak keys from \cite{BardetDLO16}. In particular, we show that they implicitly solved a shortest vector problem in the lattice we constructed. Rather than finding only a shortest vector, we obtain a reduced basis of the lattice which makes it possible to check for more weak keys.

Updated: 2025-07-31 15:18:52

标题: 多项式格用于BIKE加密系统

摘要: 在本文中，我们介绍了一个在多项式环上产生的秩为2的格，该格来自BIKE密码系统的公钥\cite{aragon2022bike}。秘钥是这个格中的一个稀疏向量。我们研究了这个格的性质，并推广了\cite{BardetDLO16}中对弱秘钥的恢复。特别地，我们展示了他们隐式地解决了我们构建的格中的最短向量问题。与仅找到最短向量不同，我们得到了格的一个约简基，这使得可以检查更多的弱秘钥。

更新时间: 2025-07-31 15:18:52

领域: cs.CR,11T71, 94A60

下载: http://arxiv.org/abs/2507.23641v1

Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. However, the computational study of these Sociocultural Linguistic Phenomena (SLP) has often been limited to bespoke analyses of specific groups or topics, hindering the pace of scientific discovery. To address this, we introduce Splits!, a 9.7 million-post dataset from Reddit designed for systematic and flexible research. The dataset contains posts from over 53,000 users across 6 demographic groups, organized into 89 discussion topics to enable comparative analysis. We validate Splits! via self-identification and by successfully replicating several known SLPs from existing literature. We complement this dataset with a framework that leverages efficient retrieval methods to rapidly validate potential SLPs (PSLPs) by automatically evaluating whether a given hypothesis is supported by our data. Crucially, to distinguish between novel and obvious insights, the framework incorporates a human-validated measure of a hypothesis's ``unexpectedness.'' We demonstrate that the two-stage process reduces the number of statistically significant findings requiring manual inspection by a factor of 1.5-1.8x, streamlining the discovery of promising phenomena for further investigation.

Updated: 2025-07-31 15:18:47

标题: 分裂！一个灵活的数据集和评估框架，用于社会文化语言研究

摘要: 语言使用的变化，受到说话者的社会文化背景和具体使用情境的塑造，为文化观念、价值观和意见提供了丰富的视角。然而，对这些社会文化语言现象（SLP）的计算研究通常局限于对特定群体或主题的定制分析，阻碍了科学发现的速度。为了解决这个问题，我们介绍了Splits！，这是来自Reddit的970万个帖子数据集，旨在进行系统化和灵活的研究。该数据集包含来自6个人口统计群体的超过53,000个用户的帖子，组织成89个讨论主题，以便进行比较分析。我们通过自我识别和成功复制现有文献中的几个已知SLP验证了Splits！。我们结合了这个数据集和一个框架，利用高效的检索方法快速验证潜在的SLP（PSLP），通过自动评估一个给定假设是否被我们的数据支持。至关重要的是，为了区分新颖和显而易见的见解，该框架结合了一个经人验证的假设“意外性”的度量。我们展示了这个两阶段过程将需要手动检查的具有统计显著性的发现数量减少了1.5-1.8倍，简化了发现有进一步调查价值的现象的过程。

更新时间: 2025-07-31 15:18:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04640v2

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Conformal prediction is a powerful distribution-free framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, provide marginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We present Kandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups -- reminiscent of the rigid, structured grids of Piet Mondrian's art -- our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky's compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.

Updated: 2025-07-31 15:15:50

标题: 康定斯基一致预测：超越类别和协变量条件覆盖

摘要: Conformal prediction是一个强大的无分布框架，用于构建具有覆盖保证的预测集。传统方法，如分割conformal prediction，提供边际覆盖，确保预测集包含具有目标概率的随机测试点的标签。然而，这些保证可能不会在不同的子群体中统一保持，导致覆盖范围的不一致。先前的研究已经探讨了基于与测试点的协变量和标签相关的事件的覆盖保证。我们提出了Kandinsky conformal prediction，这是一个显著扩展条件覆盖保证范围的框架。与将其覆盖保证限制为不相交组的Mondrian conformal prediction相反，我们的框架灵活处理在协变量和标签上联合定义的重叠和分数组成员资格，反映了Wassily Kandinsky作品中的分层、相交形式。我们的算法统一并扩展了现有方法，包括基于协变量组条件、类条件和Mondrian conformal prediction，同时实现了最小极值高概率条件覆盖界限。最后，我们通过对真实数据集的实证评估展示了我们方法的实用性。

更新时间: 2025-07-31 15:15:50

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.17264v2

OptiGradTrust: Byzantine-Robust Federated Learning with Multi-Feature Gradient Analysis and Reinforcement Learning-Based Trust Weighting

Federated Learning (FL) enables collaborative model training across distributed medical institutions while preserving patient privacy, but remains vulnerable to Byzantine attacks and statistical heterogeneity. We present OptiGradTrust, a comprehensive defense framework that evaluates gradient updates through a novel six-dimensional fingerprint including VAE reconstruction error, cosine similarity metrics, $L_2$ norm, sign-consistency ratio, and Monte Carlo Shapley value, which drive a hybrid RL-attention module for adaptive trust scoring. To address convergence challenges under data heterogeneity, we develop FedBN-Prox (FedBN-P), combining Federated Batch Normalization with proximal regularization for optimal accuracy-convergence trade-offs. Extensive evaluation across MNIST, CIFAR-10, and Alzheimer's MRI datasets under various Byzantine attack scenarios demonstrates significant improvements over state-of-the-art defenses, achieving up to +1.6 percentage points over FLGuard under non-IID conditions while maintaining robust performance against diverse attack patterns through our adaptive learning approach.

Updated: 2025-07-31 15:14:36

标题: OptiGradTrust：具有多特征梯度分析和基于强化学习的信任加权的拜占庭容忍的联邦学习

摘要: 联邦学习（FL）使得在分布式医疗机构之间进行协作模型训练，同时保护患者隐私，但仍然容易受到拜占庭攻击和统计异质性的影响。我们提出了OptiGradTrust，一个全面的防御框架，通过一个包括VAE重构误差、余弦相似度指标、$L_2$范数、符号一致性比率和蒙特卡洛Shapley值的新颖六维指纹来评估梯度更新，从而驱动一个用于自适应信任评分的混合RL-attention模块。为了解决数据异质性下的收敛挑战，我们开发了FedBN-Prox（FedBN-P），将联邦批量归一化与近端正则化结合，以实现最佳的准确性-收敛性权衡。在各种拜占庭攻击场景下对MNIST、CIFAR-10和阿尔茨海默病MRI数据集进行了广泛评估，结果显示与最先进的防御方法相比，我们的方法在非IID条件下可以实现高达+1.6个百分点的改进，同时通过我们的自适应学习方法保持对各种攻击模式的强大性能。

更新时间: 2025-07-31 15:14:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.23638v1

MemoCue: Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying

Agent-assisted memory recall is one critical research problem in the field of human-computer interaction. In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories. The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice. Memory theories suggest that the person's relevant memory can be proactively activated through some effective cues. Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories. To this end, there are two key challenges. (1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics? (2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and sparsely annotated strategy patterns? To address the challenges, we propose a Recall Router framework. Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios. We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses. We construct an instruction tuning dataset and fine-tune multiple open-source large language models (LLMs) to develop MemoCue, an agent that excels in providing memory-inspired responses. Experiments on three representative datasets show that MemoCue surpasses LLM-based methods by 17.74% in recall inspiration. Further human evaluation highlights its advantages in memory-recall applications.

Updated: 2025-07-31 15:11:38

标题: MemoCue：通过策略引导查询为人类记忆召回赋能LLM-Based代理

摘要: 代理辅助记忆回忆是人机交互领域中一个关键的研究问题。在传统方法中，代理可以从其配备的记忆模块中检索信息，帮助人们回忆不完整或模糊的记忆。记忆模块的有限大小阻碍了完整记忆的获取，并在实践中影响了记忆回忆性能。记忆理论表明，通过一些有效的线索可以主动激活个人相关的记忆。受此启发，我们提出了一种新颖的策略引导的代理辅助记忆回忆方法，允许代理通过精心设计的策略将原始查询转化为富有线索的查询，以帮助人们回忆记忆。为此，存在两个关键挑战：(1)如何为具有不同记忆回忆特征的不同遗忘情景选择适当的回忆策略？(2)如何在仅有抽象和稀疏注释的策略模式的情况下，获得利用回忆策略的高质量响应？为了应对这些挑战，我们提出了一个回忆路由器框架。具体而言，我们设计了一个5W回忆地图，将记忆查询分类为五种典型情景，并定义了跨相应情景的十五种回忆策略模式。然后，我们提出了一个层次化回忆树，结合蒙特卡洛树搜索算法，优化策略选择和策略响应的生成。我们构建了一个指令调谐数据集，并对多个开源大型语言模型（LLM）进行微调，开发了MemoCue，一个在提供启发式记忆响应方面表现出色的代理。在三个代表性数据集上的实验表明，MemoCue在回忆启发方面超过了基于LLM的方法17.74%。进一步的人类评估突显了它在记忆回忆应用中的优势。

更新时间: 2025-07-31 15:11:38

领域: cs.AI

下载: http://arxiv.org/abs/2507.23633v1

Mantis Shrimp: Exploring Photometric Band Utilization in Computer Vision Networks for Photometric Redshift Estimation

We present Mantis Shrimp, a multi-survey deep learning model for photometric redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and infrared (UnWISE) imagery. Machine learning is now an established approach for photometric redshift estimation, with generally acknowledged higher performance in areas with a high density of spectroscopically identified galaxies over template-based methods. Multiple works have shown that image-based convolutional neural networks can outperform tabular-based color/magnitude models. In comparison to tabular models, image models have additional design complexities: it is largely unknown how to fuse inputs from different instruments which have different resolutions or noise properties. The Mantis Shrimp model estimates the conditional density estimate of redshift using cutout images. The density estimates are well calibrated and the point estimates perform well in the distribution of available spectroscopically confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and catastrophic outlier rate ($\eta$=17.53$\%$). We find that early fusion approaches (e.g., resampling and stacking images from different instruments) match the performance of late fusion approaches (e.g., concatenating latent space representations), so that the design choice ultimately is left to the user. Finally, we study how the models learn to use information across bands, finding evidence that our models successfully incorporates information from all surveys. The applicability of our model to the analysis of large populations of galaxies is limited by the speed of downloading cutouts from external servers; however, our model could be useful in smaller studies such as generating priors over redshift for stellar population synthesis.

Updated: 2025-07-31 15:10:35

标题: 螳螂虾：探索计算机视觉网络中的光度波段利用，用于光度红移估计

摘要: 我们提出了Mantis Shrimp，这是一个用于光度红移估计的多调查深度学习模型，融合了紫外线（GALEX）、光学（PanSTARRS）和红外（UnWISE）图像。机器学习现在已成为光度红移估计的一种已建立的方法，通常在具有高密度光谱鉴定星系的区域中，其性能高于基于模板的方法。多项研究表明，基于图像的卷积神经网络可以胜过基于表格的颜色/星等模型。与表格模型相比，图像模型有额外的设计复杂性：如何融合具有不同分辨率或噪声特性的不同仪器的输入目前在很大程度上还是未知的。Mantis Shrimp模型使用切割图像来估计红移的条件密度估计。密度估计是良好校准的，点估计在可用的经过光谱确认的星系分布中表现良好（偏差=1e-2），散射（NMAD=2.44e-2）和灾难性异常值率（η=17.53%）。我们发现早期融合方法（例如，从不同仪器重新采样和堆叠图像）与晚期融合方法（例如，连接潜在空间表示）的性能相匹配，因此最终的设计选择最终由用户决定。最后，我们研究了模型如何学习跨波段使用信息，发现我们的模型成功地整合了来自所有调查的信息。我们的模型在分析大量星系群体方面的适用性受到从外部服务器下载切割图像的速度限制；然而，我们的模型可能在生成先验红移以用于恒星群体综合等较小的研究中是有用的。

更新时间: 2025-07-31 15:10:35

领域: astro-ph.IM,cs.AI

下载: http://arxiv.org/abs/2501.09112v2

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts.

Updated: 2025-07-31 15:10:03

标题: 关于Softmax注意力表达能力的研究：循环神经网络的视角

摘要: 自从引入以来，softmax注意力已经成为现代transformer架构的支柱，因为它在各种任务中具有表达能力和可扩展性。然而，softmax注意力的主要缺点是与序列长度相关的二次内存需求和计算复杂性。通过替换softmax非线性，引入了线性注意力和类似方法，以避免softmax注意力的二次瓶颈。尽管这些线性形式的注意力是从原始softmax公式推导出来的，但它们通常在下游准确性方面落后。尽管softmax非线性在查询和键内积上具有较强的直觉，表明它与其他非线性相比具有理想的性质，但为什么存在这种差异仍然未得到解答。这项工作通过推导softmax注意力的递归形式，证明了线性注意力是softmax注意力的近似。使用这种形式，可以用递归神经网络（RNNs）的语言描述softmax注意力的每个部分。将softmax注意力描述为RNN允许消除softmax注意力的各个组件，以了解每个部分的重要性及其相互作用。通过这种方式，我们的工作有助于解释为什么softmax注意力比其对应物更具表达能力。

更新时间: 2025-07-31 15:10:03

领域: cs.LG

下载: http://arxiv.org/abs/2507.23632v1

CS-SHRED: Enhancing SHRED for Robust Recovery of Spatiotemporal Dynamics

We present CS-SHRED, a novel deep learning architecture that integrates Compressed Sensing (CS) into a Shallow Recurrent Decoder (SHRED) to reconstruct spatiotemporal dynamics from incomplete, compressed, or corrupted data. Our approach introduces two key innovations. First, by incorporating CS techniques into the SHRED architecture, our method leverages a batch-based forward framework with $\ell_1$ regularization to robustly recover signals even in scenarios with sparse sensor placements, noisy measurements, and incomplete sensor acquisitions. Second, an adaptive loss function dynamically combines Mean Squared Error (MSE) and Mean Absolute Error (MAE) terms with a piecewise Signal-to-Noise Ratio (SNR) regularization, which suppresses noise and outliers in low-SNR regions while preserving fine-scale features in high-SNR regions. We validate CS-SHRED on challenging problems including viscoelastic fluid flows, maximum specific humidity fields, sea surface temperature distributions, and rotating turbulent flows. Compared to the traditional SHRED approach, CS-SHRED achieves significantly higher reconstruction fidelity -- as demonstrated by improved SSIM and PSNR values, lower normalized errors, and enhanced LPIPS scores-thereby providing superior preservation of small-scale structures and increased robustness against noise and outliers. Our results underscore the advantages of the jointly trained CS and SHRED design architecture which includes an LSTM sequence model for characterizing the temporal evolution with a shallow decoder network (SDN) for modeling the high-dimensional state space. The SNR-guided adaptive loss function for the spatiotemporal data recovery establishes CS-SHRED as a promising tool for a wide range of applications in environmental, climatic, and scientific data analyses.

Updated: 2025-07-31 15:08:10

标题: CS-SHRED：增强SHRED以稳健恢复时空动态

摘要: 我们提出了CS-SHRED，这是一种新颖的深度学习架构，将压缩感知（CS）集成到浅层递归解码器（SHRED）中，用于从不完整、压缩或损坏的数据中重建时空动态。我们的方法引入了两个关键创新。首先，通过将CS技术纳入SHRED架构，我们的方法利用基于批处理的前向框架和$\ell_1$正则化来稳健地恢复信号，即使在传感器布置稀疏、测量噪声大和传感器采集不完整的情况下也能实现。其次，一种自适应损失函数动态地将均方误差（MSE）和平均绝对误差（MAE）项与分段信噪比（SNR）正则化相结合，可以在低SNR区域抑制噪声和异常值，同时保留高SNR区域的细微特征。我们在具有挑战性的问题上验证了CS-SHRED，包括粘弹性流体流动、最大比湿场、海表温度分布和旋转湍流。与传统的SHRED方法相比，CS-SHRED实现了更高的重建保真度--通过改善SSIM和PSNR值、降低标准化误差和增强LPIPS得分来证明，从而提供了对小尺度结构的更好保留以及增强了抗噪声和异常值的能力。我们的结果强调了联合训练的CS和SHRED设计架构的优势，该架构包括用于表征时间演变的LSTM序列模型和用于建模高维状态空间的浅解码器网络（SDN）。对时空数据恢复进行SNR引导自适应损失函数，将CS-SHRED确立为环境、气候和科学数据分析中广泛应用的有希望的工具。

更新时间: 2025-07-31 15:08:10

领域: cs.LG,68T07, 35Q35, 94A12,I.2.6; I.5.4; I.6.3; J.2

下载: http://arxiv.org/abs/2507.22303v2

PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two main limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representations lack interaction with their frequency components in jointly modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based framework that integrates time-frequency joint modeling for trajectory prediction. Specifically, we decompose the trajectory into raw time sequences and frequency components, and employ dynamic patch partitioning to perform multi-scale segmentation, capturing hierarchical motion patterns. Each patch undergoes adaptive embedding with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of the two branches are further enhanced via cross-modal attention, facilitating complementary fusion of temporal and spectral cues. The resulting enhanced embeddings exhibit strong expressive power, enabling accurate predictions even when using a vanilla Transformer architecture. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance. Notably, on the egocentric JRDB dataset, PatchTraj attains significant relative improvements of 26.7% in ADE and 17.4% in FDE, underscoring its substantial potential in embodied intelligence.

Updated: 2025-07-31 15:04:27

标题: PatchTraj：通过动态补丁实现轨迹预测的统一时频表示学习

摘要: 行人轨迹预测对于自动驾驶和机器人技术至关重要。现有的基于点和基于网格的方法存在两个主要局限性：未能充分建模人类运动动态，因为它们未能在局部运动细节和长程时空依赖性之间取得平衡，以及时间表示缺乏与其频率成分的交互，在联合建模轨迹序列时。为了解决这些挑战，我们提出了PatchTraj，这是一个基于动态补丁的框架，用于轨迹预测，它整合了时间-频率联合建模的方法。具体地，我们将轨迹分解为原始时间序列和频率成分，并采用动态补丁划分进行多尺度分割，捕捉层次运动模式。每个补丁经历自适应嵌入和尺度感知特征提取，然后进行分层特征聚合，以建模细粒度和长程依赖关系。两个分支的输出通过跨模态注意力进一步增强，促进了时间和谱线索的互补融合。由此产生的增强嵌入具有强大的表达能力，即使使用传统的Transformer架构也能实现准确预测。在ETH-UCY、SDD、NBA和JRDB数据集上的大量实验表明，我们的方法实现了最先进的性能。值得注意的是，在以自我为中心的JRDB数据集上，PatchTraj在ADE方面取得了26.7%的显著相对改善，在FDE方面取得了17.4%的显著相对改善，突显了它在具体智能中的巨大潜力。

更新时间: 2025-07-31 15:04:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.19119v3

DivControl: Knowledge Diversion for Controllable Image Generation

Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

Updated: 2025-07-31 15:00:15

标题: DivControl：可控图像生成的知识转移

摘要: 扩散模型已经从文本到图像（T2I）进化为图像到图像（I2I）生成，通过整合诸如深度图等结构化输入，实现了细粒度的空间控制。然而，现有方法要么为每个条件训练单独的模型，要么依赖于混合表示的统一架构，导致对新条件的泛化能力差和高适应成本。为此，我们提出了DivControl，一个可分解的预训练框架，用于统一可控生成和高效适应。DivControl通过奇异值分解将ControlNet分解为基本组件-成对的奇异向量，通过多条件训练期间的知识转移将其解缠成与条件无关的学习基因和条件特定的调整器。知识转移通过动态门实现，根据条件指令的语义对调整器进行软路由，实现零样本泛化和对新条件的参数高效适应。为了进一步提高条件的保真度和训练效率，我们引入了一个表示对齐损失，将条件嵌入与早期扩散特征对齐。大量实验表明，DivControl在36.4倍更少的训练成本下达到了最先进的可控性，同时提高了基本条件的平均性能。它还在未知条件上表现出强大的零样本和少样本性能，展示了卓越的可扩展性、模块化和可转移性。

更新时间: 2025-07-31 15:00:15

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23620v1

L-GTA: Latent Generative Modeling for Time Series Augmentation

Data augmentation is gaining importance across various aspects of time series analysis, from forecasting to classification and anomaly detection tasks. We introduce the Latent Generative Transformer Augmentation (L-GTA) model, a generative approach using a transformer-based variational recurrent autoencoder. This model uses controlled transformations within the latent space of the model to generate new time series that preserve the intrinsic properties of the original dataset. L-GTA enables the application of diverse transformations, ranging from simple jittering to magnitude warping, and combining these basic transformations to generate more complex synthetic time series datasets. Our evaluation of several real-world datasets demonstrates the ability of L-GTA to produce more reliable, consistent, and controllable augmented data. This translates into significant improvements in predictive accuracy and similarity measures compared to direct transformation methods.

Updated: 2025-07-31 14:53:35

标题: L-GTA: 隐变量生成建模用于时间序列增强

摘要: 数据增强在时间序列分析的各个方面越来越重要，从预测到分类和异常检测任务。我们介绍了Latent Generative Transformer Augmentation (L-GTA)模型，这是一种生成方法，使用基于变分循环自编码器的变压器。该模型在模型的潜在空间内使用受控变换来生成保留原始数据集固有属性的新时间序列。L-GTA使得能够应用各种不同的转换，从简单的抖动到幅度扭曲，并结合这些基本转换来生成更复杂的合成时间序列数据集。我们对几个真实世界数据集的评估表明，L-GTA能够产生更可靠、一致和可控的增强数据。与直接转换方法相比，这将导致预测准确性和相似性度量的显著提高。

更新时间: 2025-07-31 14:53:35

领域: cs.LG,cs.AI,68T01,I.5.1; G.3; H.2.8; I.2.1

下载: http://arxiv.org/abs/2507.23615v1

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

Updated: 2025-07-31 14:50:20

标题: MaxInfoRL：通过最大化信息增益在强化学习中提升探索

摘要: 强化学习（RL）算法旨在平衡利用当前最佳策略与探索可能导致更高奖励的新选项。大多数常见的RL算法使用无向探索，即选择随机动作序列。探索也可以使用内在奖励进行引导，例如好奇心或模型认知不确定性。然而，有效地平衡任务和内在奖励是具有挑战性的，并且往往取决于任务。在这项工作中，我们介绍了一个框架MaxInfoRL，用于平衡内在和外在探索。MaxInfoRL通过最大化内在奖励，如关于潜在任务的信息增益，将探索引导向具有信息量的转换。当与Boltzmann探索结合时，这种方法自然地在值函数最大化和状态、奖励和动作的熵之间进行权衡。我们展示了我们的方法在简化的多臂老虎机设置中实现了次线性后悔。然后，我们将这种一般公式应用于各种基于模型的离线无模型RL方法，适用于连续状态动作空间，从而产生出在困难的探索问题和复杂场景中（如视觉控制任务）都能取得卓越表现的新算法。

更新时间: 2025-07-31 14:50:20

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2412.12098v2

LLM-Based Identification of Infostealer Infection Vectors from Screenshots: The Case of Aurora

Infostealers exfiltrate credentials, session cookies, and sensitive data from infected systems. With over 29 million stealer logs reported in 2024, manual analysis and mitigation at scale are virtually unfeasible/unpractical. While most research focuses on proactive malware detection, a significant gap remains in leveraging reactive analysis of stealer logs and their associated artifacts. Specifically, infection artifacts such as screenshots, image captured at the point of compromise, are largely overlooked by the current literature. This paper introduces a novel approach leveraging Large Language Models (LLMs), more specifically gpt-4o-mini, to analyze infection screenshots to extract potential Indicators of Compromise (IoCs), map infection vectors, and track campaigns. Focusing on the Aurora infostealer, we demonstrate how LLMs can process screenshots to identify infection vectors, such as malicious URLs, installer files, and exploited software themes. Our method extracted 337 actionable URLs and 246 relevant files from 1000 screenshots, revealing key malware distribution methods and social engineering tactics. By correlating extracted filenames, URLs, and infection themes, we identified three distinct malware campaigns, demonstrating the potential of LLM-driven analysis for uncovering infection workflows and enhancing threat intelligence. By shifting malware analysis from traditional log-based detection methods to a reactive, artifact-driven approach that leverages infection screenshots, this research presents a scalable method for identifying infection vectors and enabling early intervention.

Updated: 2025-07-31 14:49:03

标题: 基于LLM的从屏幕截图中识别Infostealer感染向量：以Aurora为例

摘要: Infostealers从受感染的系统中窃取凭据、会话cookie和敏感数据。据2024年报告，超过2900万个stealer日志，规模化的手动分析和缓解几乎是不可行的。尽管大多数研究侧重于主动恶意软件检测，但在利用对stealer日志及其相关物件的反应性分析方面仍存在重大差距。具体而言，目前的文献在忽略感染物件，例如截屏图像方面存在严重疏漏。本文引入了一种新颖的方法，利用大型语言模型（LLMs），更具体地说是gpt-4o-mini，来分析感染截屏图像，提取潜在的威胁指标（IoCs），映射感染向量，并跟踪活动。重点关注Aurora infostealer，我们展示了LLMs如何处理截屏图像以识别感染向量，如恶意URL、安装文件和被利用的软件主题。我们的方法从1000个截屏图像中提取了337个可操作的URL和246个相关文件，揭示了关键的恶意软件分发方法和社会工程策略。通过将提取的文件名、URL和感染主题进行关联，我们识别了三个不同的恶意软件活动，展示了LLM驱动分析揭示感染工作流程和增强威胁情报的潜力。通过将恶意软件分析从传统的基于日志的检测方法转变为一种反应式的、以物件为驱动的方法，利用感染截屏图像，这项研究提出了一种可扩展的方法，用于识别感染向量并实现早期干预。

更新时间: 2025-07-31 14:49:03

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.23611v1

Consistent Point Matching

This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

Updated: 2025-07-31 14:47:40

标题: 一致的点匹配

摘要: 这项研究表明，将一致性启发式方法纳入点匹配算法\cite{yerebakan2023hierarchical}可以提高医学图像对之间解剖位置匹配的稳健性。我们在跨CT和MRI模态的多样化纵向内部和公共数据集上验证了我们的方法。值得注意的是，它在Deep Lesion Tracking数据集上超越了最新的研究结果。此外，我们展示了该方法有效地解决了地标定位问题。该算法在标准CPU硬件上高效运行，并允许在速度和稳健性之间进行可配置的权衡。该方法使得在医学图像之间实现高精度导航成为可能，而无需使用机器学习模型或训练数据。

更新时间: 2025-07-31 14:47:40

领域: cs.CV,cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.23609v1

Medical Image De-Identification Benchmark Challenge

The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge's design, implementation, results, and lessons learned.

Updated: 2025-07-31 14:47:20

标题: 医学影像去识别化基准挑战

摘要: 保护健康信息（PHI）和可识别个人信息（PII）的去识别（deID）是通过公共存储库分享医学图像的基本要求，以确保符合患者隐私法律。此外，保留非PHI元数据以通知和促进影像人工智能（AI）的下游开发在生物医学研究中是一个重要考虑因素。MIDI-B的目标是提供一个标准化平台，用于基于符合HIPAA Safe Harbor法规、DICOM属性机密性配置文件和The Cancer Imaging Archive（TCIA）定义的研究关键元数据保留最佳实践的规则进行DICOM图像去识别工具的基准测试。该挑战使用了一个庞大、多样化、多中心和多模态的真实去识别放射学图像集，插入了合成的PHI/PII。 MIDI-B挑战包括三个阶段：培训、验证和测试。80人注册参加挑战。在培训阶段，我们鼓励参与者使用他们的内部或公共数据来调整他们的算法。验证和测试阶段使用包含合成标识符的DICOM图像（分别为216和322个受试者）。十个团队成功完成了挑战的测试阶段。为了衡量基于规则的图像去识别方法的成功程度，得分被计算为正确操作的百分比与所需操作总数之间的比例。得分范围在97.91%到99.93%之间。参与者使用各种开源和专有工具进行定制配置，大型语言模型和光学字符识别（OCR）。在本文中，我们提供了关于MIDI-B挑战设计、实施、结果和经验教训的全面报告。

更新时间: 2025-07-31 14:47:20

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2507.23608v1

Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.

Updated: 2025-07-31 14:47:16

标题: 基于深度学习的带有不确定性估计的临床试验招募预测

摘要: 临床试验是一种系统性的努力，旨在评估新药物或治疗方法的安全性和有效性。进行这类试验通常需要重大财务投资和精心规划，突显了对试验结果准确预测的需求。准确预测患者招募情况，是试验成功的关键因素之一，在规划阶段是主要挑战之一。在本研究中，我们提出了一种基于深度学习的新方法来解决这一关键挑战。我们的方法作为一个神经网络模型实现，利用预训练语言模型（PLMs）捕捉临床文件的复杂性和微妙之处，将其转化为表达性表示。然后，这些表示与编码的表格特征通过注意机制结合。为了考虑招募预测中的不确定性，我们通过基于Gamma分布的概率层增强模型，从而实现范围估计。我们将提出的模型应用于预测临床试验持续时间，假设基于Poisson-Gamma过程的站点级别招募。我们在真实临床试验数据上进行了大量实验，并展示了所提出的方法可以有效预测在给定临床试验中一定数量站点上招募的患者人数，优于已建立的基准模型。

更新时间: 2025-07-31 14:47:16

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.23607v1

Hierarchical Message-Passing Policies for Multi-Agent Reinforcement Learning

Decentralized Multi-Agent Reinforcement Learning (MARL) methods allow for learning scalable multi-agent policies, but suffer from partial observability and induced non-stationarity. These challenges can be addressed by introducing mechanisms that facilitate coordination and high-level planning. Specifically, coordination and temporal abstraction can be achieved through communication (e.g., message passing) and Hierarchical Reinforcement Learning (HRL) approaches to decision-making. However, optimization issues limit the applicability of hierarchical policies to multi-agent systems. As such, the combination of these approaches has not been fully explored. To fill this void, we propose a novel and effective methodology for learning multi-agent hierarchies of message-passing policies. We adopt the feudal HRL framework and rely on a hierarchical graph structure for planning and coordination among agents. Agents at lower levels in the hierarchy receive goals from the upper levels and exchange messages with neighboring agents at the same level. To learn hierarchical multi-agent policies, we design a novel reward-assignment method based on training the lower-level policies to maximize the advantage function associated with the upper levels. Results on relevant benchmarks show that our method performs favorably compared to the state of the art.

Updated: 2025-07-31 14:42:12

标题: 多智能体强化学习中的分层消息传递策略

摘要: 去中心化的多智能体强化学习（MARL）方法允许学习可扩展的多智能体策略，但面临部分可观察性和引发非稳态性的困难。这些挑战可以通过引入促进协调和高层规划的机制来解决。具体而言，协调和时间抽象可以通过通信（例如，消息传递）和分层强化学习（HRL）方法来实现决策。然而，优化问题限制了分层策略对多智能体系统的适用性。因此，这些方法的结合尚未得到充分探讨。为了填补这一空白，我们提出了一种新颖有效的方法，用于学习消息传递策略的多智能体层次结构。我们采用封建式HRL框架，并依赖于分层图结构进行规划和智能体之间的协调。在层次结构中处于较低级别的智能体从较高级别接收目标，并与同级别的相邻智能体交换消息。为了学习分层多智能体策略，我们设计了一种基于训练较低级别策略最大化与较高级别相关的优势函数的新颖奖励分配方法。相关基准测试结果显示，与现有技术相比，我们的方法表现良好。

更新时间: 2025-07-31 14:42:12

领域: cs.LG

下载: http://arxiv.org/abs/2507.23604v1

EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution

Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed chemical signals into base patterns (components) and their concentrations, playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified component count, usually unknown in real data. As dataset size or component count increases, the scalability and reliability of MF-based MCR face significant challenges. This study reformulates MCR as a generative process (gMCR), and introduces an energy-based deep learning solver, EB-gMCR, that automatically discovers the smallest component set able to reconstruct the data faithfully. EB-gMCR starts from a large candidate pool (e.g., 1024 spectra) and employs a differentiable gating network to retain only active components while estimating their concentrations. On noisy synthetic datasets containing up to 256 latent sources, EB-gMCR maintained R^2 >= 0.98 and recovered the component count within 5% of the ground truth; at lower noise it achieved R^2 >= 0.99 with near exact component estimation. Additional chemical priors, such as non-negativity or nonlinear mixing, enter as simple plug-in functions, enabling adaptation to other instruments or domains without altering the core learning process. By uniting high-capacity generative modeling and hard component selection, EB-gMCR offers a practical route to large-scale signal unmixing analysis, including chemical library-driven scenarios. The source code is available at https://github.com/b05611038/ebgmcr_solver.

Updated: 2025-07-31 14:40:33

标题: EB-gMCR：能量基础生成建模用于信号分离和多变量曲线解析

摘要: 信号解混分析将数据分解为基本模式，并广泛应用于化学和生物研究。多元曲线分解（MCR）是信号解混的一个分支，将混合化学信号分离为基本模式（组分）及其浓度，对理解组成起着关键作用。经典MCR通常被构建为矩阵分解（MF），需要用户指定组件数量，在真实数据中通常是未知的。随着数据集大小或组件数量的增加，基于MF的MCR的可伸缩性和可靠性面临重大挑战。本研究将MCR重新制定为生成过程（gMCR），并引入了一种基于能量的深度学习求解器，即EB-gMCR，它可以自动发现能够忠实重建数据的最小组件集。EB-gMCR从一个庞大的候选池开始（例如，1024个光谱），并利用可微分的门控网络仅保留活动组件，并估计它们的浓度。在包含多达256个潜在源的嘈杂合成数据集中，EB-gMCR保持了R^2 >= 0.98，并在地面实况的5%范围内恢复了组件数量；在较低噪声下，它实现了R^2 >= 0.99，几乎完全准确的组件估计。额外的化学先验，如非负性或非线性混合，作为简单的插件功能输入，使其能够适应其他仪器或领域，而不改变核心学习过程。通过将高容量的生成建模和硬组件选择结合起来，EB-gMCR为大规模信号解混分析提供了一条实用的途径，包括基于化学库的情景。源代码可在https://github.com/b05611038/ebgmcr_solver 上找到。

更新时间: 2025-07-31 14:40:33

领域: cs.LG,cs.CE,G.1.6; G.3; G.4; I.6.5

下载: http://arxiv.org/abs/2507.23600v1

Automated Code Review Using Large Language Models at Ericsson: An Experience Report

Code review is one of the primary means of assuring the quality of released software along with testing and static analysis. However, code review requires experienced developers who may not always have the time to perform an in-depth review of code. Thus, automating code review can help alleviate the cognitive burden on experienced software developers allowing them to focus on their primary activities of writing code to add new features and fix bugs. In this paper, we describe our experience in using Large Language Models towards automating the code review process in Ericsson. We describe the development of a lightweight tool using LLMs and static program analysis. We then describe our preliminary experiments with experienced developers in evaluating our code review tool and the encouraging results.

Updated: 2025-07-31 14:34:00

标题: 在爱立信使用大型语言模型进行自动化代码审查：一份经验报告

摘要: 代码审查是确保发布的软件质量的主要手段之一，与测试和静态分析一起。然而，代码审查需要有经验的开发人员，他们可能并不总是有时间进行深入审查代码。因此，自动化代码审查可以帮助减轻有经验的软件开发人员的认知负担，让他们可以专注于写入代码以添加新功能和修复错误的主要活动。在本文中，我们描述了在爱立信使用大型语言模型自动化代码审查过程的经验。我们描述了使用LLMs和静态程序分析开发轻量级工具的过程。然后，我们描述了与有经验的开发人员进行初步实验以评估我们的代码审查工具以及令人鼓舞的结果。

更新时间: 2025-07-31 14:34:00

领域: cs.SE,cs.AI,D.2.7

下载: http://arxiv.org/abs/2507.19115v2

Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

We investigate the emergence of objects in visual perception in the absence of any semantic annotation. The resulting model has received no supervision, does not use any pre-trained features, and yet it can segment the domain of an image into multiple independently moving regions. The resulting motion segmentation method can handle an unknown and varying number of objects in real-time. The core multi-modal conditional encoder-decoder architecture has one modality (optical flow) feed the encoder to produce a collection of latent codes (slots), and the other modality (color image) conditions the decoder to generate the first modality (flow) from the slots. The training criterion is designed to foster 'information separation' among the slots, while the architecture explicitly allocates activations to individual slots, leading to a method we call Divided Attention (DivA). At test time, DivA handles a different number of objects and different image resolution than seen at training, and is invariant to permutations of the slots. DivA achieves state-of-the-art performance while tripling the runtime speed of comparable methods, up to 104 FPS, and reduces the performance gap from supervised methods to 12% or less. Objects bootstrapped by DivA can then be used to prime static classifiers via contrastive learning. On fewer than 5,000 video clips, training DINO on DivA's object proposals narrows the performance gap to ImageNet-based training by up to 30.2% compared to training directly on the video frames.

Updated: 2025-07-31 14:26:12

标题: 分割注意力：具有上下文分离插槽的无监督多对象发现

摘要: 我们研究了在视觉感知中物体的出现，而没有任何语义注释。结果模型没有接受监督，也不使用任何预训练特征，但它可以将图像的领域分割成多个独立移动的区域。结果的运动分割方法可以实时处理未知和变化的物体数量。核心的多模态条件编码器-解码器架构有一个模态（光流）输入编码器以产生一组潜在编码（槽），另一个模态（彩色图像）调节解码器以从槽生成第一个模态（流）。训练标准旨在促进槽之间的“信息分离”，而架构明确地将激活分配给单个槽，导致我们称为分割关注（DivA）的方法。在测试时，DivA处理不同数量的物体和不同分辨率的图像，对槽的排列是不变的。DivA实现了最先进的性能，同时将可比方法的运行速度提高了三倍，达到104 FPS，并将性能差距从监督方法降低到12％或更低。通过DivA引导的物体可以通过对比学习来用于初始化静态分类器。在少于5,000个视频剪辑上，通过在DivA的物体提议上训练DINO，与直接在视频帧上训练相比，将性能差距缩小了高达30.2％。

更新时间: 2025-07-31 14:26:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2304.01430v3

Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study

Recent advancements in Large Language Models have sparked interest in their potential for robotic task planning. While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain. This paper presents a systematic evaluation of a broad spectrum of current state of the art language models, each directly prompted using Planning Domain Definition Language domain and problem files, and compares their planning performance with the Fast Downward planner across a variety of benchmarks. In addition to measuring success rates, we assess how faithfully the generated plans translate into sequences of actions that can actually be executed, identifying both strengths and limitations of using these models in this setting. Our findings show that while the models perform well on simpler planning tasks, they continue to struggle with more complex scenarios that require precise resource management, consistent state tracking, and strict constraint compliance. These results underscore fundamental challenges in applying language models to robotic planning in real world environments. By outlining the gaps that emerge during execution, we aim to guide future research toward combined approaches that integrate language models with classical planners in order to enhance the reliability and scalability of planning in autonomous robotics.

Updated: 2025-07-31 14:25:54

标题: LLM-Reasoning模型能否取代经典规划？基准研究

摘要: 最近大型语言模型的进展引起了人们对它们在机器人任务规划中潜力的兴趣。虽然这些模型展现出强大的生成能力，但它们在产生结构化和可执行计划方面的有效性仍不确定。本文对当前最先进语言模型的广泛评估进行了系统性分析，每个模型都直接使用规划领域定义语言域和问题文件进行提示，并将它们的规划性能与Fast Downward规划器在各种基准测试中进行了比较。除了衡量成功率，我们还评估了生成的计划如何忠实地转化为实际可执行的动作序列，识别了在这个设置中使用这些模型的优势和局限性。我们的研究结果显示，虽然这些模型在简单的规划任务上表现良好，但它们在需要精确资源管理、一致状态跟踪和严格约束遵从的更复杂场景中仍然面临困难。这些结果突显了将语言模型应用于真实环境中的机器人规划面临的基本挑战。通过概述执行过程中出现的差距，我们旨在引导未来研究朝着将语言模型与经典规划器相结合的综合方法发展，以增强自主机器人规划的可靠性和可扩展性。

更新时间: 2025-07-31 14:25:54

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.23589v1

SinBasis Networks: Matrix-Equivalent Feature Extraction for Wave-Like Optical Spectrograms

Wave-like images-from attosecond streaking spectrograms to optical spectra, audio mel-spectrograms and periodic video frames-encode critical harmonic structures that elude conventional feature extractors. We propose a unified, matrix-equivalent framework that reinterprets convolution and attention as linear transforms on flattened inputs, revealing filter weights as basis vectors spanning latent feature subspaces. To infuse spectral priors we apply elementwise $\sin(\cdot)$ mappings to each weight matrix. Embedding these transforms into CNN, ViT and Capsule architectures yields Sin-Basis Networks with heightened sensitivity to periodic motifs and built-in invariance to spatial shifts. Experiments on a diverse collection of wave-like image datasets-including 80,000 synthetic attosecond streaking spectrograms, thousands of Raman, photoluminescence and FTIR spectra, mel-spectrograms from AudioSet and cycle-pattern frames from Kinetics-demonstrate substantial gains in reconstruction accuracy, translational robustness and zero-shot cross-domain transfer. Theoretical analysis via matrix isomorphism and Mercer-kernel truncation quantifies how sinusoidal reparametrization enriches expressivity while preserving stability in data-scarce regimes. Sin-Basis Networks thus offer a lightweight, physics-informed approach to deep learning across all wave-form imaging modalities.

Updated: 2025-07-31 14:24:03

标题: SinBasis 网络：基于矩阵的特征提取方法用于波状光谱图

摘要: 波状图像-从阿秒级连续频谱到光谱、音频mel-频谱图和周期视频帧-编码了传统特征提取器所忽略的关键谐波结构。我们提出了一个统一的、矩阵等价的框架，将卷积和注意力重新解释为对扁平化输入的线性变换，揭示滤波器权重作为跨隐含特征子空间的基向量。为了注入谱先验，我们对每个权重矩阵应用逐元素$\sin(\cdot)$映射。将这些变换嵌入到CNN、ViT和Capsule架构中，产生了Sin-Basis Networks，对周期性图案具有更高的敏感性，且对空间转移具有内在的不变性。在包括80,000个合成阿秒级连续频谱、数千个拉曼、光致发光和FTIR光谱、来自AudioSet的mel-频谱图和来自Kinetics的周期图像帧的各种波状图像数据集上的实验表明，在重建准确性、平移鲁棒性和零样本跨域转移方面取得了实质性的增益。通过矩阵同构和Mercer核截断的理论分析量化了正弦重新参数化如何丰富表达能力，同时在数据稀缺的情况下保持稳定性。因此，Sin-Basis Networks提供了一种轻量级、物理学知识驱动的深度学习方法，可跨所有波形成像模式。

更新时间: 2025-07-31 14:24:03

领域: cs.LG,cs.AI,cs.CV,physics.optics

下载: http://arxiv.org/abs/2505.06275v2

Agency Among Agents: Designing with Hypertextual Friction in the Algorithmic Web

Today's algorithm-driven interfaces, from recommendation feeds to GenAI tools, often prioritize engagement and efficiency at the expense of user agency. As systems take on more decision-making, users have less control over what they see and how meaning or relationships between content are constructed. This paper introduces "Hypertextual Friction," a conceptual design stance that repositions classical hypertext principles--friction, traceability, and structure--as actionable values for reclaiming agency in algorithmically mediated environments. Through a comparative analysis of real-world interfaces--Wikipedia vs. Instagram Explore, and Are.na vs. GenAI image tools--we examine how different systems structure user experience, navigation, and authorship. We show that hypertext systems emphasize provenance, associative thinking, and user-driven meaning-making, while algorithmic systems tend to obscure process and flatten participation. We contribute: (1) a comparative analysis of how interface structures shape agency in user-driven versus agent-driven systems, and (2) a conceptual stance that offers hypertextual values as design commitments for reclaiming agency in an increasingly algorithmic web.

Updated: 2025-07-31 14:18:28

标题: 代理人之间的代理：在算法网络中设计具有超文本摩擦

摘要: 今天的算法驱动界面，从推荐信息流到GenAI工具，往往以用户代理为代价优先考虑参与度和效率。随着系统承担更多决策，用户对所见内容以及内容之间的意义或关系的构建的控制力减弱。本文介绍了“超文本摩擦”，这是一种概念性设计立场，重新定位了经典超文本原则——摩擦、可追溯性和结构——作为在算法调节环境中重新夺回用户代理权的可行价值。通过对现实界面——维基百科与Instagram探索，以及Are.na与GenAI图像工具的比较分析，我们研究了不同系统如何构建用户体验、导航和作者身份。我们发现，超文本系统强调出处、联想思维和用户驱动的意义构建，而算法系统往往模糊过程并削弱参与度。我们贡献了：（1）对界面结构如何塑造用户驱动与代理驱动系统中代理权的比较分析，以及（2）一个概念立场，提出超文本价值作为在日益算法化的网络中重新夺回代理权的设计承诺。

更新时间: 2025-07-31 14:18:28

领域: cs.HC,cs.AI,cs.MM,cs.SI

下载: http://arxiv.org/abs/2507.23585v1

Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding

Multi-Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision-free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real-world multi-robot coordination. This comprehensive survey bridges the long-standing divide between classical algorithmic approaches and emerging learning-based methods in MAPF research. We present a unified framework that encompasses search-based methods (including Conflict-Based Search, Priority-Based Search, and Large Neighborhood Search), compilation-based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data-driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger-scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning-based approaches (predominantly 10-100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed-motive MAPF with game-theoretic considerations, language-grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real-world applications.

Updated: 2025-07-31 14:16:44

标题: 路径相遇处：经典和基于学习的多智能体路径规划的综合调查

摘要: 多智能体路径规划（MAPF）是人工智能和机器人领域的一个基本问题，需要计算多个智能体从起始位置到指定目标的无碰撞路径。随着自主系统在仓库、城市交通等复杂环境中的普及，MAPF已经从一个理论挑战发展为实际多机器人协调的关键因素。这份综合调查桥接了传统算法方法与新兴基于学习的方法在MAPF研究中长期存在的分歧。我们提出了一个统一的框架，包括基于搜索的方法（包括基于冲突的搜索、基于优先级的搜索和大邻域搜索）、基于编译的方法（SAT、SMT、CSP、ASP和MIP公式）以及数据驱动的技术（强化学习、监督学习和混合策略）。通过对200多篇论文中实验实践的系统分析，我们发现评估方法存在显著差异，传统方法通常在更大规模的实例上进行测试（最多为200×200网格，有1000多个智能体），而基于学习的方法通常测试规模为10-100个智能体。我们提供了一个全面的评估指标、环境类型和基准选择的分类，强调了标准化基准测试协议的必要性。最后，我们概述了有前景的未来方向，包括考虑博弈论的混合动机MAPF、具有大型语言模型的语言基础规划以及将传统方法的严谨性与深度学习的灵活性相结合的神经求解器架构。这份调查既为研究人员提供了全面的参考，也为在日益复杂的实际应用中部署MAPF解决方案提供了实用指南。

更新时间: 2025-07-31 14:16:44

领域: cs.AI,cs.LG,cs.MA,math.CO

下载: http://arxiv.org/abs/2505.19219v2

GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning

Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships. However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning. Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. Next, we design two process-constrained reward functions. To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals. Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards. Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity. Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets. Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements.

Updated: 2025-07-31 14:11:16

标题: GraphRAG-R1: 带有过程约束强化学习的图检索增强生成

摘要: 图检索增强生成（GraphRAG）已经证明通过利用图结构进行知识表示和建模复杂的现实世界关系，有效地增强了LLMs的推理能力。然而，现有的GraphRAG方法在处理需要多跳推理的复杂问题时仍然面临重大瓶颈，因为它们的查询和检索阶段主要基于预定义的启发式，并没有充分利用LLMs的推理潜力。为了解决这个问题，我们提出了GraphRAG-R1，这是一个自适应的GraphRAG框架，通过训练LLMs进行过程受限的基于结果的强化学习（RL）来增强多跳推理能力。我们的方法可以分解复杂问题，自主调用检索工具获取必要信息，并进行有效的推理。具体而言，我们利用了支持带有思考回滚能力的Group Relative Policy Optimization（GRPO）的修改版本。接下来，我们设计了两个过程受限的奖励函数。为了解决浅层检索问题，我们设计了渐进式检索衰减（PRA）奖励以鼓励关键检索。然后，为了解决过度思考问题，我们设计了成本感知的F1（CAF）奖励以平衡模型性能和计算成本。我们进一步设计了一个依赖于阶段的训练策略，包含三个训练阶段对应于冷启动和这两个奖励。最后，我们的方法采用混合图文检索来提高推理能力。大量实验结果表明，与最先进的GraphRAG方法相比，GraphRAG-R1在解决复杂推理问题方面显著提升了LLMs的能力，无论是在领域内还是领域外的数据集上。此外，我们的框架可以灵活地与各种现有的检索方法集成，持续提供性能改进。

更新时间: 2025-07-31 14:11:16

领域: cs.LG

下载: http://arxiv.org/abs/2507.23581v1

Improved Robustness and Functional Localization in Topographic CNNs Through Weight Similarity

Topographic neural networks are computational models that can simulate the spatial and functional organization of the brain. Topographic constraints in neural networks can be implemented in multiple ways, with potentially different impacts on the representations learned by the network. The impact of such different implementations has not been systematically examined. To this end, here we compare topographic convolutional neural networks trained with two spatial constraints: Weight Similarity (WS), which pushes neighboring units to develop similar incoming weights, and Activation Similarity (AS), which enforces similarity in unit activations. We evaluate the resulting models on classification accuracy, robustness to weight perturbations and input degradation, and the spatial organization of learned representations. Compared to both AS and standard CNNs, WS provided three main advantages: i) improved robustness to noise, also showing higher accuracy under weight corruption; ii) greater input sensitivity, reflected in higher activation variance; and iii) stronger functional localization, with units showing similar activations positioned at closer distances. In addition, WS produced differences in orientation tuning, symmetry sensitivity, and eccentricity profiles of units, indicating an influence of this spatial constraint on the representational geometry of the network. Our findings suggest that during end-to-end training, WS constraints produce more robust representations than AS or non-topographic CNNs. These findings also suggest that weight-based spatial constraints can shape feature learning and functional organization in biophysical inspired models.

Updated: 2025-07-31 14:02:40

标题: 通过权重相似性在拓扑CNN中改进鲁棒性和功能定位

摘要: 拓扑神经网络是可以模拟大脑的空间和功能组织的计算模型。神经网络中的拓扑约束可以通过多种方式实现，对网络学习的表示可能产生不同的影响。这些不同实现的影响尚未得到系统地检验。为此，我们比较了使用两种空间约束训练的拓扑卷积神经网络：权重相似性（WS），推动相邻单元发展相似的传入权重，和激活相似性（AS），强制单位激活的相似性。我们评估了得到的模型在分类准确性、对权重扰动和输入恶化的稳健性，以及学习表示的空间组织上的表现。与AS和标准CNN相比，WS提供了三个主要优势：i）更好的抗噪声性能，在权重损坏下也显示出更高的准确性；ii）更大的输入敏感性，反映在更高的激活方差中；iii）更强的功能定位，在距离较近的位置显示出相似激活的单元。此外，WS改变了单位的方向调谐、对称性敏感性和偏心率配置，表明这种空间约束对网络的表示几何形态产生影响。我们的研究结果表明，在端到端训练期间，WS约束比AS或非拓扑CNN产生了更稳健的表示。这些发现还表明，基于权重的空间约束可以塑造生物物理启发模型中的特征学习和功能组织。

更新时间: 2025-07-31 14:02:40

领域: cs.LG

下载: http://arxiv.org/abs/2508.00043v1

Neutral Residues: Revisiting Adapters for Model Extension

We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as finetuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we revisit and improve adapters to extend LLMs from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperform competing approaches such as finetuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.

Updated: 2025-07-31 14:02:13

标题: 中性残基：再探适配器用于模型扩展

摘要: 我们解决了将预训练的大型语言模型扩展到在训练过程中未见过的新领域的问题。标准技术，如微调或低秩适应（LoRA），在领域适应方面取得成功，但并未正式增加模型的容量。这经常导致在新领域表现良好与在原始领域性能下降之间的权衡。在这里，我们重新审视和改进了适配器，从数据、架构和训练过程三个角度扩展LLMs，这三个角度有利地联合考虑。所得到的方法称为中性残差，以一种修改适配器的方式，使每个新残差块在原始领域输出接近零。这种解决方案在将原始训练于英语的最先进模型调整到新语言时取得了强大的结果。中性残差在学习新语言和不遗忘英语之间的权衡方面明显优于竞争方法，如微调、LoRA或普通适配器。

更新时间: 2025-07-31 14:02:13

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.02744v3

Optimised Feature Subset Selection via Simulated Annealing

We introduce SA-FDR, a novel algorithm for $\ell_0$-norm feature selection that considers this task as a combinatorial optimisation problem and solves it by using simulated annealing to perform a global search over the space of feature subsets. The optimisation is guided by the Fisher discriminant ratio, which we use as a computationally efficient proxy for model quality in classification tasks. Our experiments, conducted on datasets with up to hundreds of thousands of samples and hundreds of features, demonstrate that SA-FDR consistently selects more compact feature subsets while achieving a high predictive accuracy. This ability to recover informative yet minimal sets of features stems from its capacity to capture inter-feature dependencies often missed by greedy optimisation approaches. As a result, SA-FDR provides a flexible and effective solution for designing interpretable models in high-dimensional settings, particularly when model sparsity, interpretability, and performance are crucial.

Updated: 2025-07-31 13:57:38

标题: 用模拟退火优化特征子集选择

摘要: 我们介绍了SA-FDR，一种新颖的$\ell_0$-范数特征选择算法，将此任务视为一个组合优化问题，并通过使用模拟退火在特征子集空间上执行全局搜索来解决该问题。优化过程由Fisher判别比引导，我们将其作为在分类任务中模型质量的计算效率代理。我们的实验在拥有数十万个样本和数百个特征的数据集上进行，结果表明SA-FDR在实现高预测准确性的同时一贯选择更紧凑的特征子集。其恢复信息丰富但最小的特征集的能力源于其捕捉通常被贪婪优化方法忽略的特征间依赖关系的能力。因此，SA-FDR为在高维环境中设计可解释模型提供了灵活而有效的解决方案，尤其是在模型稀疏性、可解释性和性能至关重要时。

更新时间: 2025-07-31 13:57:38

领域: cs.LG,cond-mat.stat-mech,stat.ML

下载: http://arxiv.org/abs/2507.23568v1

DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug repositioning.

Updated: 2025-07-31 13:57:25

标题: DrugMCTS：结合多智能体、RAG和蒙特卡洛树搜索的药物再利用框架

摘要: 最近在大型语言模型方面取得的进展已经在科学领域（如药物重定位）展示了相当大的潜力。然而，当推理超出预训练期间获得的知识时，它们的有效性仍受限。传统方法，如微调或检索增强生成，在施加高计算开销或未能充分利用结构化科学数据方面存在限制。为了克服这些挑战，我们提出了DrugMCTS，这是一个新颖的框架，它将RAG、多智能体协作和蒙特卡洛树搜索集成在一起，用于药物重定位。该框架利用五个专门的代理人，负责检索和分析分子和蛋白质信息，从而实现结构化和迭代推理。对DrugBank和KIBA数据集的大量实验表明，与通用型LLMs和深度学习基线相比，DrugMCTS实现了更高的召回率和稳健性。我们的结果突显了结构化推理、基于代理人的协作和反馈驱动的搜索机制在推动LLM应用于药物重定位方面的重要性。

更新时间: 2025-07-31 13:57:25

领域: cs.AI,cs.CE

下载: http://arxiv.org/abs/2507.07426v3

Momentum-based gradient descent methods for Lie groups

Polyak's Heavy Ball (PHB; Polyak, 1964), a.k.a. Classical Momentum, and Nesterov's Accelerated Gradient (NAG; Nesterov, 1983) are well-established momentum-descent methods for optimization. Although the latter generally outperforms the former, primarily, generalizations of PHB-like methods to nonlinear spaces have not been sufficiently explored in the literature. In this paper, we propose a generalization of NAG-like methods for Lie group optimization. This generalization is based on the variational one-to-one correspondence between classical and accelerated momentum methods (Campos et al., 2023). We provide numerical experiments for chosen retractions on the group of rotations based on the Frobenius norm and the Rosenbrock function to demonstrate the effectiveness of our proposed methods, and that align with results of the Euclidean case, that is, a faster convergence rate for NAG.

Updated: 2025-07-31 13:56:02

标题: 基于动量的李群梯度下降方法

摘要: Polyak的Heavy Ball（PHB; Polyak，1964），又称经典动量，以及Nesterov的加速梯度（NAG; Nesterov，1983）是优化中已经确立的动量下降方法。尽管后者通常优于前者，但是在非线性空间中对类似于PHB的方法的泛化在文献中尚未得到充分探讨。在本文中，我们提出了一种基于李群优化的NAG样方法的泛化。这种泛化基于经典和加速动量方法之间的变分一对一对应（Campos等人，2023）。我们提供了选定重映射在旋转群上基于Frobenius范数和Rosenbrock函数的数值实验，以展示我们提出的方法的有效性，并与欧几里得情况的结果一致，即NAG具有更快的收敛速度。

更新时间: 2025-07-31 13:56:02

领域: math.OC,cs.LG,cs.NA,math.DG,math.NA,65K10 (Primary) 70G45, 22E99 (Secondary)

下载: http://arxiv.org/abs/2404.09363v2

Weighted least-squares approximation with determinantal point processes and generalized volume sampling

We consider the problem of approximating a function from $L^2$ by an element of a given $m$-dimensional space $V_m$, associated with some feature map $\boldsymbol{\varphi}$, using evaluations of the function at random points $x_1, \dots,x_n$. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features $\boldsymbol{\varphi}(x_i)$. We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples $n = O(m\log(m))$, that means that the expected $L^2$ error is bounded by a constant times the best approximation error in $L^2$. Also, further assuming that the function is in some normed vector space $H$ continuously embedded in $L^2$, we further prove that the approximation error in $L^2$ is almost surely bounded by the best approximation error measured in the $H$-norm. This includes the cases of functions from $L^\infty$ or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.

Updated: 2025-07-31 13:54:58

标题: 用行列式点过程和广义体积抽样的加权最小二乘逼近

摘要: 我们考虑通过在给定的$m$维空间$V_m$中的元素使用函数在随机点$x_1, \dots, x_n$的评估来逼近$L^2$中的函数的问题，该空间与某些特征映射$\boldsymbol{\varphi}$相关联。在回顾一些关于使用独立同分布点的最优加权最小二乘法的结果后，我们考虑使用投影行列式点过程（DPP）或体积抽样的加权最小二乘法。这些分布引入了点之间的依赖性，促进了所选特征$\boldsymbol{\varphi}(x_i)$的多样性。我们首先提供了体积重新缩放抽样的广义版本，在期望中产生$n = O(m\log(m))$个样本的准最优结果，这意味着期望的$L^2$误差受到常数倍于$L^2$中最佳逼近误差的限制。此外，进一步假设函数在连续嵌入到$L^2$中的某些范数向量空间$H$中，我们进一步证明$L^2$中的逼近误差几乎肯定受到在$H$-范数中测量的最佳逼近误差的限制。这包括来自$L^\infty$或再生核希尔伯特空间的函数的情况。最后，我们提出了一种替代策略，即使用投影DPP（或体积抽样）的独立重复，产生与独立同分布或体积抽样相似的误差界，但在实践中所需的样本数量要低得多。数值实验展示了不同策略的性能。

更新时间: 2025-07-31 13:54:58

领域: math.NA,cs.LG,cs.NA,math.ST,stat.TH

下载: http://arxiv.org/abs/2312.14057v4

Optimal and Near-Optimal Adaptive Vector Quantization

Quantization is a fundamental optimization for many machine-learning use cases, including compressing gradients, model weights and activations, and datasets. The most accurate form of quantization is \emph{adaptive}, where the error is minimized with respect to a given input, rather than optimizing for the worst case. However, optimal adaptive quantization methods are considered infeasible in terms of both their runtime and memory requirements. We revisit the Adaptive Vector Quantization (AVQ) problem and present algorithms that find optimal solutions with asymptotically improved time and space complexity. We also present an even faster near-optimal algorithm for large inputs. Our experiments show our algorithms may open the door to using AVQ more extensively in a variety of machine learning applications.

Updated: 2025-07-31 13:53:50

标题: 最优和接近最优的自适应矢量量化

摘要: 量化是许多机器学习用例的基本优化，包括压缩梯度、模型权重和激活以及数据集。最准确的量化形式是\emph{自适应}，其中误差与给定输入最小化，而不是针对最坏情况进行优化。然而，最佳的自适应量化方法在运行时间和内存需求方面被认为是不可行的。我们重新审视了自适应向量量化（AVQ）问题，并提出了能够找到最优解的算法，其时间和空间复杂度在渐近上得到了改进。我们还提出了一种针对大输入的更快的近似最优算法。我们的实验证明，我们的算法可能会在各种机器学习应用中更广泛地使用AVQ。

更新时间: 2025-07-31 13:53:50

领域: cs.LG,cs.DS,cs.IT,cs.NI,math.IT

下载: http://arxiv.org/abs/2402.03158v2

Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agentic AI

In collaborative systems, the effective completion of tasks hinges on task-specific trust evaluations of potential devices for distributed collaboration. However, the complexity of tasks, the spatiotemporal dynamism of distributed device resources, and the inevitable assessment overhead dramatically increase the complexity and resource consumption of the trust evaluation process. As a result, ill-timed or overly frequent trust evaluations can reduce utilization rate of constrained resources, negatively affecting collaborative task execution. To address this challenge, this paper proposes an autonomous trust orchestration method based on a new concept of semantic chain-of-trust. Our technique employs agentic AI and hypergraph to establish and maintain trust relationships among devices. By leveraging its strengths in autonomous perception, task decomposition, and semantic reasoning, we propose agentic AI to perceive device states and autonomously perform trust evaluations of collaborators based on historical performance data only during device idle periods, thereby enabling efficient utilization of distributed resources. In addition, agentic AI performs task-specific trust evaluations on collaborator resources by analyzing the alignment between resource capabilities and task requirements. Moreover, by maintaining a trust hypergraph embedded with trust semantics for each device, agentic AI enables hierarchical management of collaborators and identifies collaborators requiring trust evaluation based on trust semantics, thereby achieving a balance between overhead and trust accuracy. Furthermore, local trust hypergraphs from multiple devices can be chained together to support multi-hop collaboration, enabling efficient coordination in large-scale systems. Experimental results demonstrate that the proposed method achieves resource-efficient trust evaluation.

Updated: 2025-07-31 13:53:25

标题: 语义信任链：通过超图辅助的智能代理AI进行协作者选择的自主信任编排

摘要: 在协作系统中，任务的有效完成取决于对分布式协作潜在设备的特定任务信任评估。然而，任务的复杂性、分布式设备资源的时空动态性以及不可避免的评估开销显著增加了信任评估过程的复杂性和资源消耗。因此，不及时或过于频繁的信任评估可能降低受限资源的利用率，负面影响协作任务执行。为解决这一挑战，本文提出一种基于新概念“信任链”的自主信任编排方法。我们的技术利用代理人智能和超图来建立和维护设备间的信任关系。通过利用自主感知、任务分解和语义推理的优势，我们提出代理人智能在设备空闲期间仅基于历史性能数据自主感知设备状态并执行协作者的信任评估，从而实现分布式资源的高效利用。此外，代理人智能通过分析资源能力与任务需求之间的对齐性，在协作者资源上执行特定任务的信任评估。此外，通过维护嵌入信任语义的每个设备的信任超图，代理人智能实现对协作者的层次管理，并基于信任语义识别需要信任评估的协作者，从而在开销和信任准确性之间取得平衡。此外，多个设备的本地信任超图可以链结在一起支持多跳协作，实现大规模系统中的高效协调。实验结果表明，所提出的方法实现了资源高效的信任评估。

更新时间: 2025-07-31 13:53:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.23565v1

Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform

Spiking Neural Networks (SNNs) promise orders-of-magnitude lower power consumption and low-latency inference on neuromorphic hardware for a wide range of robotic tasks. In this work, we present an energy-efficient implementation of a reinforcement learning (RL) algorithm using quantized SNNs to solve two classical control tasks. The network is trained using the Q-learning algorithm, then fine-tuned and quantized to low-bit (8-bit) precision for embedded deployment on the SpiNNaker2 neuromorphic chip. To evaluate the comparative advantage of SpiNNaker2 over conventional computing platforms, we analyze inference latency, dynamic power consumption, and energy cost per inference for our SNN models, comparing performance against a GTX 1650 GPU baseline. Our results demonstrate SpiNNaker2's strong potential for scalable, low-energy neuromorphic computing, achieving up to 32x reduction in energy consumption. Inference latency remains on par with GPU-based execution, with improvements observed in certain task settings, reinforcing SpiNNaker2's viability for real-time neuromorphic control and making the neuromorphic approach a compelling direction for efficient deep Q-learning.

Updated: 2025-07-31 13:49:44

标题: 在SpiNNaker2神经形态平台上对脉冲Q网络进行硬件感知微调

摘要: 尖峰神经网络（SNNs）承诺在神经形态硬件上实现低功耗和低延迟推理，可以应用于各种机器人任务中。在本研究中，我们提出了一种能源高效的强化学习（RL）算法的实现，使用量化的SNNs解决两个经典的控制任务。该网络使用Q-learning算法进行训练，然后进行微调和量化到低位（8位）精度，以在SpiNNaker2神经形态芯片上进行嵌入式部署。为了评估SpiNNaker2相对于传统计算平台的比较优势，我们分析了我们的SNN模型的推理延迟、动态功耗和每次推理的能量成本，与GTX 1650 GPU基准性能进行比较。我们的结果表明，SpiNNaker2在可扩展的低能量神经形态计算方面具有很强的潜力，能够实现高达32倍的能量消耗减少。推理延迟与基于GPU的执行相当，并且在某些任务设置中观察到改进，强化了SpiNNaker2在实时神经形态控制中的可行性，使神经形态方法成为高效深度Q学习的引人注目方向。

更新时间: 2025-07-31 13:49:44

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2507.23562v1

DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer

Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks. However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance. While prior work has explored example selection, including in some agentic or multi-step settings, existing approaches typically rely on heuristics or task-specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. Therefore, it is non-trivial to develop a principled, general-purpose method for selecting demonstrations that consistently benefit agent performance. In this paper, we address this challenge with DICE, Dynamic In-Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning. Our approach decomposes demonstration knowledge into transferable and non-transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization. We further propose a stepwise selection criterion with a formal guarantee of improved agent performance. Importantly, DICE is a general, framework-agnostic solution that can be integrated as a plug-in module into existing agentic frameworks without any additional training cost. Extensive experiments across diverse domains demonstrate our method's effectiveness and generality, highlighting the importance of principled, context-aware demo selection for robust and efficient LLM agents.

Updated: 2025-07-31 13:42:14

标题: DICE：通过高效知识传递在LLM代理中进行动态上下文示例选择

摘要: 基于大型语言模型的代理机构，通过上下文学习(ICL)赋予的能力，在复杂推理和工具使用任务中表现出强大的能力。然而，现有研究表明，ICL的有效性对演示选择非常敏感，次优示例往往会导致性能不稳定或下降。尽管先前的工作已经探讨了示例选择，包括在一些代理或多步设置中，但现有方法通常依赖启发式方法或任务特定设计，并缺乏一个一般的、理论上基础的标准，来确定跨推理步骤构成有效示例的内容。因此，开发一个能够持续提升代理性能的基于原则的通用示例选择方法是非常困难的。在本文中，我们提出了DICE，即Dynamic In-Context Example Selection for LLM Agents，这是一个基于理论的ICL框架，用于代理任务，在推理的每个步骤中选择最相关的演示。我们的方法通过因果镜头将演示知识分解为可转移和不可转移的组件，展示了后者如何引入假的依赖关系，从而损害泛化能力。我们进一步提出了一个具有形式保证的逐步选择标准，可以提高代理性能。重要的是，DICE是一个通用的、框架无关的解决方案，可以作为插件模块集成到现有的代理框架中，而不需要额外的训练成本。在不同领域进行的大量实验证明了我们方法的有效性和普适性，突显了基于原则的、上下文感知的演示选择对于健壮和高效的LLM代理的重要性。

更新时间: 2025-07-31 13:42:14

领域: cs.AI

下载: http://arxiv.org/abs/2507.23554v1

Physics-informed Gaussian Processes as Linear Model Predictive Controller

We introduce a novel algorithm for controlling linear time invariant systems in a tracking problem. The controller is based on a Gaussian Process (GP) whose realizations satisfy a system of linear ordinary differential equations with constant coefficients. Control inputs for tracking are determined by conditioning the prior GP on the setpoints, i.e. control as inference. The resulting Model Predictive Control scheme incorporates pointwise soft constraints by introducing virtual setpoints to the posterior Gaussian process. We show theoretically that our controller satisfies open-loop stability for the optimal control problem by leveraging general results from Bayesian inference and demonstrate this result in a numerical example.

Updated: 2025-07-31 13:39:20

标题: 物理信息高斯过程作为线性模型预测控制器

摘要: 我们介绍了一种用于控制线性时不变系统的新算法，用于跟踪问题。控制器基于一个满足具有恒定系数的线性常微分方程组的高斯过程（GP）的实现。跟踪的控制输入通过在先验GP上附加设定点来确定，即将控制视为推理。由此产生的模型预测控制方案通过在后验高斯过程中引入虚拟设定点来引入点软约束。我们理论上证明了我们的控制器通过利用贝叶斯推理的一般结果满足最优控制问题的开环稳定性，并在数值示例中展示了这一结果。

更新时间: 2025-07-31 13:39:20

领域: math.OC,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2412.04502v2

Molecule Graph Networks with Many-body Equivariant Interactions

Message passing neural networks have demonstrated significant efficacy in predicting molecular interactions. Introducing equivariant vectorial representations augments expressivity by capturing geometric data symmetries, thereby improving model accuracy. However, two-body bond vectors in opposition may cancel each other out during message passing, leading to the loss of directional information on their shared node. In this study, we develop Equivariant N-body Interaction Networks (ENINet) that explicitly integrates l = 1 equivariant many-body interactions to enhance directional symmetric information in the message passing scheme. We provided a mathematical analysis demonstrating the necessity of incorporating many-body equivariant interactions and generalized the formulation to $N$-body interactions. Experiments indicate that integrating many-body equivariant representations enhances prediction accuracy across diverse scalar and tensorial quantum chemical properties.

Updated: 2025-07-31 13:38:55

标题: 具有多体等变相互作用的分子图网络

摘要: 信息传递神经网络在预测分子相互作用方面表现出显著的功效。引入等变向量表示增加了表达能力，通过捕捉几何数据的对称性，从而提高了模型的准确性。然而，在信息传递过程中，对立的两体键向量可能会相互抵消，导致共享节点上的方向信息丢失。在本研究中，我们开发了等变N体相互作用网络（ENINet），明确地集成了l=1等变多体相互作用，以增强信息传递方案中的方向对称信息。我们提供了数学分析，证明了必须合并多体等变相互作用，并将公式推广到N体相互作用。实验结果表明，整合多体等变表示可以增强对不同标量和张量量子化学性质的预测准确性。

更新时间: 2025-07-31 13:38:55

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2406.13265v3

ART: Adaptive Relation Tuning for Generalized Relation Prediction

Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

Updated: 2025-07-31 13:34:06

标题: 艺术：用于广义关系预测的自适应关系调整

摘要: 视觉关系检测（VRD）是识别场景中对象之间关系的任务。仅在关系检测数据上训练的VRD模型很难推广到超出其训练范围的关系。虽然提示调整已被用于调整视觉语言模型（VLMs）以适应VRD，但它使用手工制作的提示并且难以处理新颖或复杂的关系。我们认为指令调整提供了一个更有效的解决方案，通过在多样化的指令数据上微调VLMs。因此，我们引入了ART，一种自适应关系调整框架，通过指令调整和战略实例选择，使VLMs适应VRD。通过将VRD数据集转换为指令调整格式，并应用自适应采样算法，ART引导VLM集中关注信息丰富的关系，同时保持泛化能力。具体来说，我们关注关系分类，其中给定主体对象框，模型预测它们之间的谓词。我们在一个保持集上进行微调，并在多个不同复杂性的保持数据集上进行评估。我们的方法明显优于基线方法，并且能够推断看不见的关系概念，这是主流VRD方法中缺少的能力。我们通过使用预测的关系来分割复杂场景展示了ART的实际价值。

更新时间: 2025-07-31 13:34:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23543v1

Toward Integrated Solutions: A Systematic Interdisciplinary Review of Cybergrooming Research

Cybergrooming exploits minors through online trust-building, yet research remains fragmented, limiting holistic prevention. Social sciences focus on behavioral insights, while computational methods emphasize detection, but their integration remains insufficient. This review systematically synthesizes both fields using the PRISMA framework to enhance clarity, reproducibility, and cross-disciplinary collaboration. Findings show that qualitative methods offer deep insights but are resource-intensive, machine learning models depend on data quality, and standard metrics struggle with imbalance and cultural nuances. By bridging these gaps, this review advances interdisciplinary cybergrooming research, guiding future efforts toward more effective prevention and detection strategies.

Updated: 2025-07-31 13:33:16

标题: 朝向综合解决方案：网络调情研究的系统跨学科审查

摘要: 网络聊天室通过在线建立信任来利用未成年人，然而研究仍然零散，限制了整体预防。社会科学关注行为洞察力，而计算方法强调检测，但它们的整合仍然不足。本综述使用PRISMA框架系统地综合了这两个领域，以增强清晰度、可重复性和跨学科合作。研究结果显示，定性方法提供深入洞察，但资源密集，机器学习模型依赖数据质量，标准度量指标面临不平衡和文化细微差异的挑战。通过弥合这些差距，本综述推动了跨学科网络聊天室研究，指导未来努力朝着更有效的预防和检测策略。

更新时间: 2025-07-31 13:33:16

领域: cs.CY,cs.CR

下载: http://arxiv.org/abs/2503.05727v2

Tile and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation

Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel $2\times 2$ 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.

Updated: 2025-07-31 13:32:03

标题: 《瓦片和滑动：一个新的框架，用于将从局部到全球的3D地球观测的NeRF扩展》

摘要: 最近，神经辐射场（NeRF）已经成为从多视角卫星图像中进行3D重建的范例。然而，目前最先进的NeRF方法通常受限于在训练过程中的内存占用，这是本文研究的重点。先前关于大规模NeRF的研究通过将场景划分为NeRF来缓解这一问题。本文介绍了Snake-NeRF，一个可以扩展到大型场景的框架。我们的离线方法消除了同时加载所有图像和网络的需要，并且在单个设备上运行。我们通过将感兴趣区域分成无重叠的NeRF进行3D切片来实现这一点。重要的是，我们使用重叠的方式裁剪图像，以确保每个NeRF都使用了所有必要的像素进行训练。我们引入了一种新颖的$2\times 2$ 3D切片进展策略和分段采样器，这两者共同防止了沿切片边缘的3D重建错误。我们的实验得出结论，大型卫星图像可以在单个GPU上以线性时间复杂度有效处理，并且不会损害质量。

更新时间: 2025-07-31 13:32:03

领域: cs.CV,cs.AI,cs.GR,cs.LG

下载: http://arxiv.org/abs/2507.01631v2

A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving

Autonomous driving systems face significant challenges in achieving human-like adaptability, robustness, and interpretability in complex, open-world environments. These challenges stem from fragmented architectures, limited generalization to novel scenarios, and insufficient semantic extraction from perception. To address these limitations, we propose a unified Perception-Language-Action (PLA) framework that integrates multi-sensor fusion (cameras, LiDAR, radar) with a large language model (LLM)-augmented Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered reasoning core. This framework unifies low-level sensory processing with high-level contextual reasoning, tightly coupling perception with natural language-based semantic understanding and decision-making to enable context-aware, explainable, and safety-bounded autonomous driving. Evaluations on an urban intersection scenario with a construction zone demonstrate superior performance in trajectory tracking, speed prediction, and adaptive planning. The results highlight the potential of language-augmented cognitive frameworks for advancing the safety, interpretability, and scalability of autonomous driving systems.

Updated: 2025-07-31 13:30:47

标题: 一个统一的感知-语言-行动框架，用于自适应自主驾驶

摘要: 自动驾驶系统在复杂、开放的环境中实现类人适应性、鲁棒性和可解释性面临着重大挑战。这些挑战源于碎片化的架构、对新颖情景的有限泛化以及来自感知的语义提取不足。为了解决这些限制，我们提出了一个统一的感知-语言-行动（PLA）框架，将多传感器融合（摄像头、LiDAR、雷达）与大型语言模型（LLM）增强的视觉-语言-行动（VLA）架构（具体来说是由GPT-4.1支持的推理核心）进行整合。该框架将低级感知处理与高级上下文推理结合在一起，将感知与基于自然语言的语义理解和决策紧密耦合，以实现上下文感知、可解释和安全受限的自动驾驶。在一个城市十字路口情景中，包含一个施工区域的评估展示了在轨迹跟踪、速度预测和自适应规划方面的卓越性能。结果突显了语言增强认知框架在推进自动驾驶系统的安全性、可解释性和可扩展性方面的潜力。

更新时间: 2025-07-31 13:30:47

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.23540v1

Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions

Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $K\in \mathbb{R}^{n\times n}$. $K$'s columns are indexed by a set of $n$ keys $k_1,k_2\ldots, k_n\in \mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,\ldots,q_n\in \mathbb{R}^d $, and its $i,j$ entry is $K_{ij} = e^{-\|q_i-k_j\|_2^2/2\sigma^2}$ for some bandwidth parameter $\sigma>0$. Given a vector $x\in \mathbb{R}^n$ and error parameter $\epsilon>0$, our task is to output a $y\in \mathbb{R}^n$ such that $\|Kx-y\|_2\leq \epsilon \|x\|_2$ in time subquadratic in $n$ and linear in $d$. Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$, as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors.

Updated: 2025-07-31 13:29:43

标题: 基于稀疏假设的核矩阵-向量乘法的改进算法

摘要: 受快速处理注意力矩阵问题的启发，我们研究了用于计算非对称高斯核矩阵$K\in \mathbb{R}^{n\times n}$的矩阵-向量乘积的快速算法。$K$的列由一组$n$个键$k_1,k_2\ldots, k_n\in \mathbb{R}^d$索引，行由一组$n$个查询$q_1,q_2,\ldots,q_n\in \mathbb{R}^d$索引，其$i,j$项为$K_{ij} = e^{-\|q_i-k_j\|_2^2/2\sigma^2}$，其中$\sigma>0$为带宽参数。给定一个向量$x\in \mathbb{R}^n$和误差参数$\epsilon>0$，我们的任务是输出一个$y\in \mathbb{R}^n$，使得$\|Kx-y\|_2\leq \epsilon \|x\|_2$，并且时间复杂度在$n$方面是次二次的，在$d$方面是线性的。我们的算法依赖于关于矩阵$K$的以下建模假设：$K$的条目之和在$n$方面呈线性增长，而不是最坏情况下的二次增长。我们通过实验证实了这个假设，对于在各种设置中遇到的高斯核矩阵，比如LLM中的快速注意力计算。我们获得了第一个基于这种假设工作的次二次时间算法，适用于任意向量。

更新时间: 2025-07-31 13:29:43

领域: cs.LG,cs.DS

下载: http://arxiv.org/abs/2507.23539v1

On the Formalization of Cryptographic Migration

We present a novel approach to gaining insight into the structure of cryptographic migration problems which are classic problems in applied cryptography. We use a formal model to capture the inherent dependencies and complexities of such transitions. Using classical mathematical results from combinatorics, probability theory, and combinatorial analysis, we evaluate the challenges of migrating large cryptographic IT infrastructures and prove that - in a suitable sense - cryptographic migration exhibits a certain expected complexity. We also provide numerical data for selected parameter sets. Furthermore, we analyze the proposed model in terms of real-world patterns and its practical applicability. Additionally, we discuss the challenges of modeling real-world migration projects. As concrete examples we examine the transition to post-quantum cryptography of the CI/CD system GitLab and the multi-level technological transition of distribution power grids. This work paves the way for future advancements in both the theoretical understanding and practical implementation of cryptographic migration strategies.

Updated: 2025-07-31 13:24:55

标题: 关于加密迁移的形式化处理

摘要: 我们提出了一种独特的方法来深入了解应用密码学中经典问题的结构，这些问题是密码迁移问题。我们使用一个形式化模型来捕捉这种转变的固有依赖关系和复杂性。利用组合数学、概率论和组合分析的经典数学结果，我们评估了迁移大型密码技术基础设施的挑战，并证明了在适当意义下，密码迁移呈现一定的预期复杂性。我们还为选定的参数集提供了数值数据。此外，我们分析了提出的模型在现实世界模式和实际应用方面的可行性。此外，我们讨论了建模现实世界迁移项目的挑战。作为具体例子，我们研究了CI/CD系统GitLab向后量子密码学的过渡以及配电网的多级技术过渡。这项工作为未来在理论理解和实际实施密码迁移策略方面的进展铺平了道路。

更新时间: 2025-07-31 13:24:55

领域: cs.CR

下载: http://arxiv.org/abs/2408.05997v4

From LLMs to Edge: Parameter-Efficient Fine-Tuning on Edge Devices

Parameter-efficient fine-tuning (PEFT) methods reduce the computational costs of updating deep learning models by minimizing the number of additional parameters used to adapt a model to a down- stream task. While extensively researched in large language models (LLMs), their application to smaller models used on edge devices, such as convolutional neural networks, remains underexplored. This paper benchmarks and analyzes popular PEFT methods on convolutional architectures typically deployed in resource-constrained edge environments. We evaluate LoRA, DoRA, and GaLore for updating standard and depthwise convolutional architectures to handle distribution shifts and accommodate unseen classes. We utilize recently proposed PyTorch profilers to compare the updated model performance and computational costs of these PEFT methods with traditional fine-tuning approaches. With resource efficiency in mind, we investigate their update behavior across different rank dimensions. We find that the evaluated PEFT methods are only half as memory-efficient when applied to depthwise-separable convolution architectures, compared to their efficiency with LLMs. Conversely, when targeting convolu- tional architectures optimized for edge deployment, adapter-based PEFT methods can reduce floating point operations (FLOPs) during model updates by up to 95%. These insights offer valuable guidance for selecting PEFT methods based on hardware constraints, performance requirements, and application needs. Our code is online.

Updated: 2025-07-31 13:23:21

标题: 从LLMs到Edge：边缘设备上的参数高效微调

摘要: Parameter-efficient fine-tuning (PEFT) methods reduce the computational costs of updating deep learning models by minimizing the number of additional parameters used to adapt a model to a down- stream task. While extensively researched in large language models (LLMs), their application to smaller models used on edge devices, such as convolutional neural networks, remains underexplored. This paper benchmarks and analyzes popular PEFT methods on convolutional architectures typically deployed in resource-constrained edge environments. We evaluate LoRA, DoRA, and GaLore for updating standard and depthwise convolutional architectures to handle distribution shifts and accommodate unseen classes. We utilize recently proposed PyTorch profilers to compare the updated model performance and computational costs of these PEFT methods with traditional fine-tuning approaches. With resource efficiency in mind, we investigate their update behavior across different rank dimensions. We find that the evaluated PEFT methods are only half as memory-efficient when applied to depthwise-separable convolution architectures, compared to their efficiency with LLMs. Conversely, when targeting convolutional architectures optimized for edge deployment, adapter-based PEFT methods can reduce floating point operations (FLOPs) during model updates by up to 95%. These insights offer valuable guidance for selecting PEFT methods based on hardware constraints, performance requirements, and application needs. Our code is online.

更新时间: 2025-07-31 13:23:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.23536v1

PurpCode: Reasoning for Safer Code Generation

We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.

Updated: 2025-07-31 13:22:45

标题: PurpCode：更安全代码生成的推理

摘要: 我们介绍了PurpCode，这是第一个用于训练安全代码推理模型的后训练配方，旨在生成安全代码并抵御恶意网络活动。PurpCode在两个阶段训练推理模型：（i）规则学习，明确教导模型参考网络安全规则生成无漏洞代码，并避免促进恶意网络活动；（ii）强化学习，通过多样化、多目标奖励机制来优化模型安全性并保持模型效用。为了赋予训练管道全面的网络安全数据，我们进行内部红队合作，根据真实任务合成全面且高覆盖率的提示，诱发模型中的不安全网络活动。基于PurpCode，我们开发了一种基于推理的编码模型，即PurpCode-32B，展示了最先进的网络安全性能，优于各种前沿模型。同时，我们的对齐方法降低了模型在一般和网络安全特定场景中的过度拒绝率，同时在代码生成和常见安全知识方面保持了模型效用。

更新时间: 2025-07-31 13:22:45

领域: cs.CR,cs.CL,cs.LG,cs.SE

下载: http://arxiv.org/abs/2507.19060v2

Transparent AI: The Case for Interpretability and Explainability

As artificial intelligence systems increasingly inform high-stakes decisions across sectors, transparency has become foundational to responsible and trustworthy AI implementation. Leveraging our role as a leading institute in advancing AI research and enabling industry adoption, we present key insights and lessons learned from practical interpretability applications across diverse domains. This paper offers actionable strategies and implementation guidance tailored to organizations at varying stages of AI maturity, emphasizing the integration of interpretability as a core design principle rather than a retrospective add-on.

Updated: 2025-07-31 13:22:14

标题: 透明人工智能：可解释性和可解释性的案例

摘要: 随着人工智能系统在各个领域中越来越多地影响重要决策，透明度已成为负责任和值得信赖的人工智能实施的基础。利用我们作为推动人工智能研究和促进行业采用的领先研究所的角色，我们提出了从实践可解释性应用中获得的关键见解和经验教训，涵盖了不同领域。本文提供了针对不同AI成熟度阶段的组织量身定制的可操作策略和实施指南，强调将解释性作为核心设计原则而不是事后添加。

更新时间: 2025-07-31 13:22:14

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.23535v1

Continual Learning with Synthetic Boundary Experience Blending

Continual learning (CL) aims to address catastrophic forgetting in models trained sequentially on multiple tasks. While experience replay has shown promise, its effectiveness is often limited by the sparse distribution of stored key samples, leading to overly simplified decision boundaries. We hypothesize that introducing synthetic data near the decision boundary (Synthetic Boundary Data, or SBD) during training serves as an implicit regularizer, improving boundary stability and mitigating forgetting. To validate this hypothesis, we propose a novel training framework, {\bf Experience Blending}, which integrates knowledge from both stored key samples and synthetic, boundary-adjacent data. Experience blending consists of two core components: (1) a multivariate Differential Privacy (DP) noise mechanism that injects batch-wise noise into low-dimensional feature representations, generating SBD; and (2) an end-to-end training strategy that jointly leverages both stored key samples and SBD. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet demonstrate that our method outperforms nine CL baselines, achieving accuracy improvements of 10%, 6%, and 13%, respectively.

Updated: 2025-07-31 13:20:17

标题: 使用合成边界经验融合的持续学习

摘要: 持续学习（CL）旨在解决在连续训练多个任务时模型中的灾难性遗忘问题。虽然经验重播显示出潜力，但其有效性通常受到存储关键样本稀疏分布的限制，导致决策边界过于简化。我们假设在训练过程中引入决策边界附近的合成数据（合成边界数据，或SBD）作为一种隐式正则化器，可以提高边界稳定性并减轻遗忘现象。为了验证这一假设，我们提出了一种新颖的训练框架，经验混合（Experience Blending），该框架整合了存储的关键样本和合成的、与边界相邻的数据的知识。经验混合包括两个核心组件：（1）一个多变量差分隐私（DP）噪声机制，将批次级别的噪声注入低维特征表示，生成SBD；以及（2）一种端到端的训练策略，同时利用存储的关键样本和SBD。在CIFAR-10、CIFAR-100和Tiny ImageNet上进行的大量实验证明，我们的方法优于九个CL基线，分别实现了10%、6%和13%的准确度提升。

更新时间: 2025-07-31 13:20:17

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.23534v1

Diffusion Beats Autoregressive in Data-Constrained Settings

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

Updated: 2025-07-31 13:10:29

标题: 在数据受限制的情况下，扩散胜过自回归

摘要: 自回归（AR）模型长期以来一直主导着大型语言模型的领域，在各种任务中取得了进展。最近，基于扩散的语言模型作为一种有前途的替代方案出现，尽管它们与AR模型相比的优势尚未被充分探索。在本文中，我们系统地研究了在数据受限的情况下掩模扩散模型-在这种情况下，训练涉及对有限数据的重复通过，并发现当计算资源充足而数据稀缺时，它们明显优于AR模型。扩散模型更好地利用了重复数据，实现了更低的验证损失和更优越的下游性能。我们将这一优势解释为隐式数据增强：掩模扩散使模型暴露于各种令牌排序和预测任务的多样化分布，不同于AR的固定从左到右的因式分解。我们找到了扩散模型的新的缩放规律，并推导出了一个关于临界计算阈值的封闭式表达式，在此阈值上，扩散开始优于AR。这些结果表明，在数据而不是计算资源成为瓶颈时，扩散模型提供了一个引人注目的替代方案，与标准的AR范式不同。我们的代码可在以下链接找到：https://diffusion-scaling.github.io。

更新时间: 2025-07-31 13:10:29

领域: cs.LG,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.15857v4

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

Updated: 2025-07-31 13:06:59

标题: H-RDT: 人类操作增强的双手机器人操作

摘要: 机器人操作的模仿学习面临着一个基本挑战：大规模、高质量的机器人示范数据的稀缺性。最近的机器人基础模型通常在跨体机器人数据集上进行预训练，以增加数据规模，但由于不同机器人实体之间多样的形态和行动空间使得统一训练具有显著限制。本文提出了一种新方法H-RDT（Human to Robotics Diffusion Transformer），利用人类操作数据来增强机器人操作能力。我们的关键洞察是，具有配对的3D手部姿势注释的大规模自我中心人类操作视频提供了丰富的行为先验，捕捉了自然操作策略，并可以有益于机器人策略学习。我们引入了一个两阶段训练范式：（1）在大规模自我中心人类操作数据上进行预训练，（2）在具有模块化动作编码器和解码器的机器人特定数据上进行跨体微调。基于拥有2B参数的扩散变压器架构，H-RDT使用流匹配来建模复杂的动作分布。广泛的评估涵盖了仿真和现实世界实验，单任务和多任务场景，以及少样本学习和稳健性评估，结果表明H-RDT胜过从头开始训练和现有的最先进方法，包括Pi0和RDT，在仿真和现实世界实验中分别取得了13.9%和40.5%的显著改进。结果验证了我们的核心假设，即人类操作数据可以作为学习双手机器人操作策略的强大基础。

更新时间: 2025-07-31 13:06:59

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23523v1

Trusted Routing for Blockchain-Empowered UAV Networks via Multi-Agent Deep Reinforcement Learning

Due to the high flexibility and versatility, unmanned aerial vehicles (UAVs) are leveraged in various fields including surveillance and disaster rescue.However, in UAV networks, routing is vulnerable to malicious damage due to distributed topologies and high dynamics. Hence, ensuring the routing security of UAV networks is challenging. In this paper, we characterize the routing process in a time-varying UAV network with malicious nodes. Specifically, we formulate the routing problem to minimize the total delay, which is an integer linear programming and intractable to solve. Then, to tackle the network security issue, a blockchain-based trust management mechanism (BTMM) is designed to dynamically evaluate trust values and identify low-trust UAVs. To improve traditional practical Byzantine fault tolerance algorithms in the blockchain, we propose a consensus UAV update mechanism. Besides, considering the local observability, the routing problem is reformulated into a decentralized partially observable Markov decision process. Further, a multi-agent double deep Q-network based routing algorithm is designed to minimize the total delay. Finally, simulations are conducted with attacked UAVs and numerical results show that the delay of the proposed mechanism decreases by 13.39$\%$, 12.74$\%$, and 16.6$\%$ than multi-agent proximal policy optimal algorithms, multi-agent deep Q-network algorithms, and methods without BTMM, respectively.

Updated: 2025-07-31 13:00:10

标题: 通过多智能体深度强化学习实现基于区块链的无人机网络的可信路由

摘要: 由于高度的灵活性和多功能性，无人机（UAVs）被广泛应用于包括监视和灾难救援在内的各个领域。然而，在UAV网络中，由于分布式拓扑结构和高动态性，路由容易受到恶意破坏，因此保证UAV网络的路由安全具有挑战性。本文对具有恶意节点的时变UAV网络中的路由过程进行了表征。具体来说，我们将路由问题建模为最小化总延迟的整数线性规划问题，这是一个难以解决的问题。为了解决网络安全问题，设计了基于区块链的信任管理机制（BTMM），用于动态评估信任值并识别低信任的UAVs。为了改进区块链中传统的拜占庭容错算法，我们提出了一种共识UAV更新机制。此外，考虑到局部可观测性，将路由问题重新定义为去中心化的部分可观察马尔可夫决策过程。进一步，设计了基于多智能体双深度Q网络的路由算法，以最小化总延迟。最后，通过对受攻击的UAV进行模拟实验，数值结果表明，所提出的机制的延迟比多智能体近端策略最优算法、多智能体深度Q网络算法和没有BTMM的方法分别减少了13.39％、12.74％和16.6％。

更新时间: 2025-07-31 13:00:10

领域: eess.SY,cs.AI,cs.CR,cs.SY

下载: http://arxiv.org/abs/2508.00938v1

TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding

We propose TPP-SD, a novel approach that accelerates Transformer temporal point process (TPP) sampling by adapting speculative decoding (SD) techniques from language models. By identifying the structural similarities between thinning algorithms for TPPs and speculative decoding for language models, we develop an efficient sampling framework that leverages a smaller draft model to generate multiple candidate events, which are then verified by the larger target model in parallel. TPP-SD maintains the same output distribution as autoregressive sampling while achieving significant acceleration. Experiments on both synthetic and real datasets demonstrate that our approach produces samples from identical distributions as standard methods, but with 2-6$\times$ speedup. Our ablation studies analyze the impact of hyperparameters such as draft length and draft model size on sampling efficiency. TPP-SD bridges the gap between powerful Transformer TPP models and the practical need for rapid sequence sampling.

Updated: 2025-07-31 12:50:59

标题: TPP-SD：利用推测解码加速Transformer点过程采样

摘要: 我们提出了TPP-SD，这是一种新颖的方法，通过从语言模型中借鉴推测解码（SD）技术，加速Transformer时间点过程（TPP）的采样。通过识别TPP的稀疏算法和语言模型的推测解码之间的结构相似性，我们开发了一个高效的采样框架，利用一个较小的草稿模型生成多个候选事件，然后由较大的目标模型并行验证。TPP-SD在保持与自回归采样相同的输出分布的同时实现了显著加速。在合成和真实数据集上的实验表明，我们的方法产生的样本与标准方法的分布相同，但速度提高了2-6倍。我们的消融研究分析了草稿长度和草稿模型大小等超参数对采样效率的影响。TPP-SD弥合了强大的Transformer TPP模型和快速序列采样的实际需求之间的差距。

更新时间: 2025-07-31 12:50:59

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.09252v2

Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level

Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central $\alpha$-th moment assumption, $\alpha \in (1,2]$. Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a faster rate than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.

Updated: 2025-07-31 12:48:29

标题: 差分隐私裁剪随机梯度下降：任意裁剪水平下的高概率收敛

摘要: 梯度裁剪是深度学习中的一种基本工具，可以改善随机一阶方法（如随机梯度下降、AdaGrad和Adam）在受到重尾噪声干扰时的高概率收敛，这在训练大型语言模型时很常见。它也是差分隐私（DP）机制的关键组成部分。然而，现有的高概率收敛分析通常要求裁剪阈值随着优化步数的增加而增加，与高斯机制等标准DP机制不兼容。在这项工作中，我们通过提供首个针对DP-Clipped-SGD的固定裁剪水平的高概率收敛分析来弥合这一差距，该方法适用于在重尾噪声下进行的凸和非凸光滑优化，其中噪声特征为有界的中心$\alpha$-th矩假设，$\alpha \in (1,2]$。我们的结果表明，在固定裁剪水平的情况下，该方法收敛速度比现有方法更快，并且可以平衡由DP引入的噪声，提供了更精细的收敛速度和隐私保证之间的权衡。

更新时间: 2025-07-31 12:48:29

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2507.23512v1

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

Updated: 2025-07-31 12:47:43

标题: MECAT: 一个用于细粒度音频理解任务的多专家构建的基准测试

摘要: 尽管大型音频语言模型推动了开放式音频理解，但它们仍然无法达到人类水平的细致理解。这种差距主要是因为当前的基准测试由于数据标注和评估指标的限制，未能可靠地区分通用和高度详细的模型输出。为此，本文介绍了MECAT，一个用于细粒度音频理解任务的多专家构建基准。通过将来自专业专家模型的分析与Chain-of-Thought大型语言模型推理相结合的流水线生成，MECAT提供了多角度、细致的字幕和开放式问题回答对。该基准还配备了一种新颖的度量标准：DATE（Discriminative-Enhanced Audio Text Evaluation）。这种度量标准通过将单样本语义相似性与跨样本区分能力相结合，惩罚通用术语并奖励详细描述。此外，还对最先进的音频模型进行了全面评估，为它们当前的能力和局限性提供了新的见解。数据和代码可在https://github.com/xiaomi-research/mecat 上获得。

更新时间: 2025-07-31 12:47:43

领域: eess.AS,cs.AI,cs.CL,cs.SD

下载: http://arxiv.org/abs/2507.23511v1

A Zero-Knowledge Proof for the Syndrome Decoding Problem in the Lee Metric

The syndrome decoding problem is one of the NP-complete problems lying at the foundation of code-based cryptography. The variant thereof where the distance between vectors is measured with respect to the Lee metric, rather than the more commonly used Hamming metric, has been analyzed recently in several works due to its potential relevance for building more efficient code-based cryptosystems. The purpose of this article is to present a zero-knowledge proof of knowledge for this variant of the problem.

Updated: 2025-07-31 12:46:58

标题: 李度量中综合译码问题的零知识证明

摘要: 综合解码问题是基于密码学基础上的一个NP完全问题之一。最近几篇作品分析了一种变体，其中向量之间的距离是根据Lee度量而不是更常用的Hamming度量进行衡量，这对于构建更高效的基于代码的加密系统具有潜在的重要性。本文的目的是为这个问题的变体提供一个零知识证明。

更新时间: 2025-07-31 12:46:58

领域: cs.CR,cs.IT,math.IT,94A60, 68P25

下载: http://arxiv.org/abs/2502.11641v3

TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling

In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are susceptible to this bottleneck, given the massive data they have to manage. Systolic arrays (SAs) are promising architectures to mitigate data transmission cost, thanks to high data utilization of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (such as weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. In SAs, convolutions are managed either as matrix multiplications or exploiting the raster-order scan of sliding windows. However, data redundancy is a primary concern affecting area, power, and energy. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. TrIM maximizes the local input utilization, minimizes the weight data movement, and solves the data redundancy problem. Furthermore, TrIM does not incur the significant on-chip memory penalty introduced by the row stationary dataflow. When compared to state-of-the-art SA dataflows, the high data utilization offered by TrIM guarantees ~10X less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6X fewer registers than row stationary).

Updated: 2025-07-31 12:46:22

标题: TrIM，三角形输入运动系统数组用于卷积神经网络：数据流和分析建模

摘要: 为了跟上最先进AI模型的日益增长的计算复杂性和数据强度，新的计算范式正在被提出。这些范式旨在通过缓解与将数据在处理核心和内存之间移动的能量成本相关的冯·诺伊曼瓶颈，实现高能效。由于需要管理大量数据，卷积神经网络（CNNs）容易受到这种瓶颈的影响。脉动阵列（SAs）是有望减少数据传输成本的良好架构，这要归功于处理元素（PEs）的高数据利用率。这些PEs根据特定的数据流（如权重静态和行静态）不断地在本地交换和处理数据，从而减少对主内存的内存访问次数。在SAs中，卷积要么作为矩阵乘法来管理，要么利用滑动窗口的栅格顺序扫描。然而，数据冗余是影响面积、功耗和能量的主要问题。在本文中，我们提出了TrIM：一种基于三角形输入移动的SAs数据流，与CNN计算兼容。TrIM最大化了本地输入的利用率，最小化了权重数据的移动，并解决了数据冗余问题。此外，TrIM不会引入行静态数据流所引入的显著内存惩罚。与最先进的SA数据流相比，TrIM提供的高数据利用率保证了大约10倍更少的内存访问。此外，考虑到PEs不断重叠乘法和累加，TrIM实现了高吞吐量（比行静态高达81.8%），而且只需要有限数量的寄存器（比行静态少多达15.6倍）。

更新时间: 2025-07-31 12:46:22

领域: cs.AI,cs.AR

下载: http://arxiv.org/abs/2408.01254v3

I Am Big, You Are Little; I Am Right, You Are Wrong

Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model's `concentration': the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.

Updated: 2025-07-31 12:45:09

标题: 我是大的，你是小的；我是对的，你是错的

摘要: 图像分类的机器学习是一个活跃且快速发展的领域。随着不同大小和不同架构的分类器的大量出现，选择合适的模型变得越来越重要。虽然我们可以通过统计方法评估模型的分类准确性，但我们对这些模型工作方式的理解令人遗憾有限。为了深入了解不同视觉模型的决策过程，我们提出使用最小必要像素集来衡量模型的“集中度”：这些像素通过模型的视角捕捉图像的本质。通过比较像素集的位置、重叠和大小，我们发现不同架构在大小和位置上的集中度存在统计学差异。特别是，ConvNext和EVA模型与其他模型明显不同。我们还发现，被误分类的图像与正确分类的图像相比，与更大的像素集相关联。

更新时间: 2025-07-31 12:45:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23509v1

A Verifier Hierarchy

We investigate the trade-off between certificate length and verifier runtime. We prove a Verifier Trade-off Theorem showing that reducing the inherent verification time of a language from $f(n)$ to $g(n)$, where $f(n) \ge g(n)$, requires certificates of length at least $\Omega(\log(f(n) / g(n)))$. This theorem induces a natural hierarchy based on certificate complexity. We demonstrate its applicability to analyzing conjectured separations between complexity classes (e.g., $\np$ and $\exptime$) and to studying natural problems such as string periodicity and rotation detection. Additionally, we provide perspectives on the $\p$ vs. $\np$ problem by relating it to the existence of sub-linear certificates.

Updated: 2025-07-31 12:42:42

标题: 一个验证器层次结构

摘要: 我们研究了证书长度和验证器运行时间之间的权衡。我们证明了一个验证器权衡定理，表明将语言的固有验证时间从$f(n)$减少到$g(n)$，其中$f(n) \ge g(n)\，需要长度至少为\(\Omega(\log(f(n) / g(n)))$的证书。这个定理引出了一个基于证书复杂性的自然层次结构。我们展示了它在分析复杂性类之间的猜想分离（如$\np$和$\exptime$）以及研究自然问题（如字符串周期性和旋转检测）方面的适用性。此外，我们通过将其与存在次线性证书联系起来，提供了关于$\p$与$\np$问题的观点。

更新时间: 2025-07-31 12:42:42

领域: cs.LG

下载: http://arxiv.org/abs/2507.23504v1

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

Updated: 2025-07-31 12:41:25

标题: LLaVA-MORE：增强视觉指导调整的LLMs和视觉骨干的比较研究

摘要: 最近在多模态大型语言模型（MLLMs）方面取得的进展突显了视觉骨干和基础语言模型的关键作用。虽然先前的工作主要集中在将这些组件扩展到数十亿个参数，但模型大小、架构和性能之间的权衡仍未得到充分探讨。此外，训练数据和评估协议的不一致阻碍了直接比较，使得很难得出最佳设计选择。在本文中，我们介绍了LLaVA-MORE，这是一族新的MLLMs，将最近的语言模型与多样化的视觉骨干集成在一起。为了确保公平比较，我们采用了统一的训练协议，一致地应用于所有架构。我们的分析系统地探索了小型和中型规模的LLMs — 包括Phi-4，LLaMA-3.1和Gemma-2 — 评估多模态推理、生成和指导后续，同时检查模型大小与性能之间的关系。除了评估LLM对最终结果的影响外，我们还进行了对各种视觉编码器的全面研究，从基于CLIP的架构到诸如DINOv2、SigLIP和SigLIP2等替代方案。额外的实验研究了图像分辨率的增加和预训练数据集的变化对结果的影响。总的来说，我们的结果为更有效的MLLMs的设计提供了见解，提供了一个可重现的评估框架，有助于直接比较，并可以指导未来模型的发展。我们的源代码和训练模型可以在以下链接公开获取：https://github.com/aimagelab/LLaVA-MORE。

更新时间: 2025-07-31 12:41:25

领域: cs.CV,cs.AI,cs.CL,cs.MM

下载: http://arxiv.org/abs/2503.15621v2

Directional Ensemble Aggregation for Actor-Critics

Off-policy reinforcement learning in continuous control tasks depends critically on accurate $Q$-value estimates. Conservative aggregation over ensembles, such as taking the minimum, is commonly used to mitigate overestimation bias. However, these static rules are coarse, discard valuable information from the ensemble, and cannot adapt to task-specific needs or different learning regimes. We propose Directional Ensemble Aggregation (DEA), an aggregation method that adaptively combines $Q$-value estimates in actor-critic frameworks. DEA introduces two fully learnable directional parameters: one that modulates critic-side conservatism and another that guides actor-side policy exploration. Both parameters are learned using ensemble disagreement-weighted Bellman errors, which weight each sample solely by the direction of its Bellman error. This directional learning mechanism allows DEA to adjust conservatism and exploration in a data-driven way, adapting aggregation to both uncertainty levels and the phase of training. We evaluate DEA across continuous control benchmarks and learning regimes - from interactive to sample-efficient - and demonstrate its effectiveness over static ensemble strategies.

Updated: 2025-07-31 12:40:50

标题: 演员-评论家的方向性集成聚合

摘要: 在连续控制任务中，基于离线策略的强化学习严重依赖于准确的$Q$值估计。保守地对集合进行聚合，比如取最小值，通常用于减轻过度估计偏差。然而，这些静态规则粗糙，丢弃了集合中宝贵的信息，并且无法适应特定任务需求或不同的学习方式。我们提出了方向性集成聚合（DEA），一种自适应地结合$Q$值估计的方法在演员-评论家框架中。DEA引入了两个完全可学习的方向参数：一个调节评论家侧保守性，另一个指导演员侧策略探索。这两个参数是通过集合不一致加权的Bellman误差来学习的，该误差仅根据其Bellman误差的方向对每个样本进行加权。这种方向性学习机制使DEA能够以数据驱动的方式调整保守性和探索，将聚合适应不确定性水平和训练阶段。我们评估了DEA在连续控制基准和学习方式中的效果-从交互式到样本高效，并展示了其相对于静态集成策略的有效性。

更新时间: 2025-07-31 12:40:50

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.23501v1

Causal Identification of Sufficient, Contrastive and Complete Feature Sets in Image Classification

Existing algorithms for explaining the outputs of image classifiers are based on a variety of approaches and produce explanations that lack formal rigor. On the other hand, logic-based explanations are formally and rigorously defined but their computability relies on strict assumptions about the model that do not hold on image classifiers. In this paper, we show that causal explanations, in addition to being formally and rigorously defined, enjoy the same formal properties as logic-based ones, while still lending themselves to black-box algorithms and being a natural fit for image classifiers. We prove formal properties of causal explanations and introduce contrastive causal explanations for image classifiers. Moreover, we augment the definition of explanation with confidence awareness and introduce complete causal explanations: explanations that are classified with exactly the same confidence as the original image. We implement our definitions, and our experimental results demonstrate that different models have different patterns of sufficiency, contrastiveness, and completeness. Our algorithms are efficiently computable, taking on average 6s per image on a ResNet50 model to compute all types of explanations, and are totally black-box, needing no knowledge of the model, no access to model internals, no access to gradient, nor requiring any properties, such as monotonicity, of the model.

Updated: 2025-07-31 12:33:00

标题: 图像分类中足够、对比和完整特征集的因果识别

摘要: 现有用于解释图像分类器输出的算法基于各种方法，并产生缺乏形式严谨的解释。另一方面，基于逻辑的解释在形式上是严格定义的，但它们的可计算性依赖于对模型的严格假设，这些假设在图像分类器上不成立。在本文中，我们展示了因果解释除了在形式上严格定义之外，还具有与基于逻辑的解释相同的形式属性，同时适用于黑盒算法，并且是图像分类器的自然选择。我们证明了因果解释的形式属性，并介绍了用于图像分类器的对比因果解释。此外，我们增加了对解释的定义，引入了置信度感知，并介绍了完整的因果解释：这些解释的分类与原始图像的置信度完全相同。我们实现了我们的定义，并实验结果表明不同的模型具有不同的充分性、对比性和完整性模式。我们的算法计算效率高，平均每张图像在ResNet50模型上耗时6秒来计算所有类型的解释，完全是黑盒的，不需要了解模型，也不需要访问模型的内部，不需要访问梯度，也不需要模型的任何属性，比如单调性。

更新时间: 2025-07-31 12:33:00

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.23497v1

Towards Reliable AI in 6G: Detecting Concept Drift in Wireless Network

AI-native 6G networks promise unprecedented automation and performance by embedding machine-learning models throughout the radio access and core segments of the network. However, the non-stationary nature of wireless environments due to infrastructure changes, user mobility, and emerging traffic patterns, induces concept drifts that can quickly degrade these model accuracies. Existing methods in general are very domain specific, or struggle with certain type of concept drift. In this paper, we introduce two unsupervised, model-agnostic, batch concept drift detectors. Both methods compute an expected-utility score to decide when concept drift occurred and if model retraining is warranted, without requiring ground-truth labels after deployment. We validate our framework on two real-world wireless use cases in outdoor fingerprinting for localization and for link-anomaly detection, and demonstrate that both methods are outperforming classical detectors such as ADWIN, DDM, CUSUM by 20-40 percentage points. Additionally, they achieve an F1-score of 0.94 and 1.00 in correctly triggering retraining alarm, thus reducing the false alarm rate by up to 20 percentage points compared to the best classical detectors.

Updated: 2025-07-31 12:31:57

标题: 朝着可靠的AI在6G中：检测无线网络中的概念飘移

摘要: AI本地化的6G网络承诺通过在网络的无线接入和核心部分嵌入机器学习模型实现前所未有的自动化和性能。然而，由于基础设施变化、用户移动性和新兴流量模式导致的无线环境的非稳态性，会引发概念漂移，从而迅速降低这些模型的准确性。现有方法通常非常特定于领域，或者在某种类型的概念漂移方面存在困难。在本文中，我们介绍了两种无监督、模型无关的批处理概念漂移检测器。这两种方法都计算一个预期效用分数，以确定何时发生概念漂移以及是否需要进行模型重训练，而无需在部署后需要地面真实标签。我们在两个真实世界的无线使用案例中验证了我们的框架，包括用于定位的室外指纹识别和用于链路异常检测，展示了这两种方法优于传统检测器（如ADWIN、DDM、CUSUM）20-40个百分点。此外，它们在正确触发重训练警报方面实现了0.94和1.00的F1分数，从而与最佳传统检测器相比，将误报率降低了高达20个百分点。

更新时间: 2025-07-31 12:31:57

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2508.00042v1

Graph Representation-based Model Poisoning on Federated Large Language Models

Federated large language models (FedLLMs) enable powerful generative capabilities within wireless networks while preserving data privacy. Nonetheless, FedLLMs remain vulnerable to model poisoning attacks. This article first reviews recent advancements in model poisoning techniques and existing defense mechanisms for FedLLMs, underscoring critical limitations, especially when dealing with non-IID textual data distributions. Current defense strategies predominantly employ distance or similarity-based outlier detection mechanisms, relying on the assumption that malicious updates markedly differ from benign statistical patterns. However, this assumption becomes inadequate against adaptive adversaries targeting billion-parameter LLMs. The article further investigates graph representation-based model poisoning (GRMP), an emerging attack paradigm that exploits higher-order correlations among benign client gradients to craft malicious updates indistinguishable from legitimate ones. GRMP can effectively circumvent advanced defense systems, causing substantial degradation in model accuracy and overall performance. Moreover, the article outlines a forward-looking research roadmap that emphasizes the necessity of graph-aware secure aggregation methods, specialized vulnerability metrics tailored for FedLLMs, and evaluation frameworks to enhance the robustness of federated language model deployments.

Updated: 2025-07-31 12:30:18

标题: 基于图表示的模型中毒对联邦大型语言模型的影响

摘要: 联邦式大型语言模型（FedLLMs）在保护数据隐私的同时，在无线网络中实现强大的生成能力。然而，FedLLMs仍然容易受到模型中毒攻击的影响。本文首先回顾了最近模型中毒技术和现有FedLLMs的防御机制的进展，强调了关键限制，特别是在处理非独立同分布的文本数据分布时。目前的防御策略主要采用基于距离或相似度的异常检测机制，依赖于恶意更新与良性统计模式明显不同的假设。然而，这种假设在面对针对数十亿参数LLMs的自适应对手时变得不足。本文进一步调查了基于图表示的模型中毒（GRMP），这是一种新兴的攻击范式，利用良性客户端梯度之间的高阶相关性来制作与合法更新无法区分的恶意更新。GRMP可以有效地规避先进的防御系统，导致模型准确性和整体性能显著降低。此外，本文概述了一个前瞻性的研究路线图，强调了图感知安全聚合方法的必要性，为FedLLMs量身定制的专门的漏洞度量标准，以及评估框架，以增强联合语言模型部署的稳健性。

更新时间: 2025-07-31 12:30:18

领域: cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.01694v2

Incorporating structural uncertainty in causal decision making

Practitioners making decisions based on causal effects typically ignore structural uncertainty. We analyze when this uncertainty is consequential enough to warrant methodological solutions (Bayesian model averaging over competing causal structures). Focusing on bivariate relationships ($X \rightarrow Y$ vs. $X \leftarrow Y$), we establish that model averaging is beneficial when: (1) structural uncertainty is moderate to high, (2) causal effects differ substantially between structures, and (3) loss functions are sufficiently sensitive to the size of the causal effect. We prove optimality results of our suggested methodological solution under regularity conditions and demonstrate through simulations that modern causal discovery methods can provide, within limits, the necessary quantification. Our framework complements existing robust causal inference approaches by addressing a distinct source of uncertainty typically overlooked in practice.

Updated: 2025-07-31 12:29:49

标题: 在因果决策中考虑结构不确定性

摘要: 从因果效应出发做决策的从业者通常会忽略结构性不确定性。我们分析了何时这种不确定性具有重要性，需要采用方法论解决方案（如在竞争性因果结构上进行贝叶斯模型平均）。我们重点关注双变量关系（$X \rightarrow Y$与$X \leftarrow Y$），我们得出结论：当（1）结构性不确定性为中等到高度时，（2）因果效应在结构之间有明显差异，以及（3）损失函数对因果效应的大小足够敏感时，模型平均是有益的。我们在正则条件下证明了我们推荐的方法论解决方案的最优性结果，并通过模拟表明，现代因果发现方法可以在一定范围内提供必要的量化。我们的框架通过解决实践中通常被忽视的一个独特的不确定性来源，来补充现有的强健因果推断方法。

更新时间: 2025-07-31 12:29:49

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2507.23495v1

Neural-ANOVA: Analytical Model Decomposition using Automatic Integration

The analysis of variance (ANOVA) decomposition offers a systematic method to understand the interaction effects that contribute to a specific decision output. In this paper we introduce Neural-ANOVA, an approach to decompose neural networks into the sum of lower-order models using the functional ANOVA decomposition. Our approach formulates a learning problem, which enables fast analytical evaluation of integrals over subspaces that appear in the calculation of the ANOVA decomposition. Finally, we conduct numerical experiments to provide insights into the approximation properties compared to other regression approaches from the literature.

Updated: 2025-07-31 12:26:55

标题: 神经-ANOVA: 使用自动集成进行分析模型分解

摘要: 方差分析（ANOVA）分解提供了一种系统方法来理解对特定决策输出产生影响的交互作用。在本文中，我们介绍了神经网络分解方法Neural-ANOVA，该方法利用功能ANOVA分解将神经网络分解为较低阶模型的总和。我们的方法制定了一个学习问题，可以快速分析评估在计算ANOVA分解时出现的子空间上的积分。最后，我们进行数值实验，以提供有关与文献中其他回归方法相比的近似性质的见解。

更新时间: 2025-07-31 12:26:55

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2408.12319v2

Digital literacy interventions can boost humans in discerning deepfakes

Deepfakes, i.e., images generated by artificial intelligence (AI), can erode trust in institutions and compromise election outcomes, as people often struggle to discern real images from deepfakes. Improving digital literacy can help address these challenges, yet scalable and effective approaches remain largely unexplored. Here, we compare the efficacy of five digital literacy interventions to boost people's ability to discern deepfakes: (1) textual guidance on common indicators of deepfakes; (2) visual demonstrations of these indicators; (3) a gamified exercise for identifying deepfakes; (4) implicit learning through repeated exposure and feedback; and (5) explanations of how deepfakes are generated with the help of AI. We conducted an experiment with N=1,200 participants from the United States to test the immediate and long-term effectiveness of our interventions. Our results show that our interventions can boost deepfake discernment by up to 13 percentage points while maintaining trust in real images. Altogether, our approach is scalable, suitable for diverse populations, and highly effective for boosting deepfake detection while maintaining trust in truthful information.

Updated: 2025-07-31 12:23:45

标题: 数字素养干预可以帮助人类识别深度伪造视频

摘要: Deepfakes，即由人工智能（AI）生成的图像，可能会破坏人们对机构的信任，并损害选举结果，因为人们经常难以辨别真实图像和Deepfakes。提高数字素养可以帮助解决这些挑战，然而可扩展且有效的方法仍然未被充分探索。在这里，我们比较了五种数字素养干预措施的有效性，以增强人们辨别Deepfakes的能力：（1）关于Deepfakes常见指标的文字指导；（2）这些指标的视觉演示；（3）用于识别Deepfakes的游戏化练习；（4）通过重复接触和反馈进行隐式学习；以及（5）解释如何通过AI生成Deepfakes。我们在美国进行了一项实验，共有1,200名参与者，以测试我们干预措施的即时和长期有效性。我们的结果显示，我们的干预措施可以将辨别Deepfakes的能力提升高达13个百分点，同时保持对真实图像的信任。总的来说，我们的方法是可扩展的，适用于不同人群，并且在提升Deepfakes检测能力的同时保持对真实信息的信任，非常有效。

更新时间: 2025-07-31 12:23:45

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.23492v1

Explainable artificial intelligence model predicting the risk of all-cause mortality in patients with type 2 diabetes mellitus

Objective. Type 2 diabetes mellitus (T2DM) is a highly prevalent non-communicable chronic disease that substantially reduces life expectancy. Accurate estimation of all-cause mortality risk in T2DM patients is crucial for personalizing and optimizing treatment strategies. Research Design and Methods. This study analyzed a cohort of 554 patients (aged 40-87 years) with diagnosed T2DM over a maximum follow-up period of 16.8 years, during which 202 patients (36%) died. Key survival-associated features were identified, and multiple machine learning (ML) models were trained and validated to predict all-cause mortality risk. To improve model interpretability, Shapley additive explanations (SHAP) was applied to the best-performing model. Results. The extra survival trees (EST) model, incorporating ten key features, demonstrated the best predictive performance. The model achieved a C-statistic of 0.776, with the area under the receiver operating characteristic curve (AUC) values of 0.86, 0.80, 0.841, and 0.826 for 5-, 10-, 15-, and 16.8-year all-cause mortality predictions, respectively. The SHAP approach was employed to interpret the model's individual decision-making processes. Conclusions. The developed model exhibited strong predictive performance for mortality risk assessment. Its clinically interpretable outputs enable potential bedside application, improving the identification of high-risk patients and supporting timely treatment optimization.

Updated: 2025-07-31 12:23:10

标题: 可解释的人工智能模型预测2型糖尿病患者全因死亡风险

摘要: 目标。2型糖尿病（T2DM）是一种高度普遍的非传染性慢性疾病，严重降低了寿命。准确估计T2DM患者的全因死亡风险对于个性化和优化治疗策略至关重要。研究设计与方法。本研究分析了554名被诊断为T2DM的患者（年龄为40-87岁），最长随访时间为16.8年，期间有202名患者（36%）死亡。确定了与存活相关的关键特征，并训练和验证了多个机器学习（ML）模型以预测全因死亡风险。为了提高模型的可解释性，将Shapley加性解释（SHAP）应用于表现最佳的模型。结果。额外的生存树（EST）模型，包括十个关键特征，表现出最佳的预测性能。该模型实现了0.776的C统计量，对于5年、10年、15年和16.8年的全因死亡预测，接收操作特性曲线（AUC）值分别为0.86、0.80、0.841和0.826。SHAP方法被用来解释模型的个体决策过程。结论。开发的模型展示了对于死亡风险评估的强大预测性能。其临床可解释性输出使得潜在的床边应用成为可能，提高了高风险患者的识别，并支持及时的治疗优化。

更新时间: 2025-07-31 12:23:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.23491v1

Automated Strategy Invention for Confluence of Term Rewrite Systems

Term rewriting plays a crucial role in software verification and compiler optimization. With dozens of highly parameterizable techniques developed to prove various system properties, automatic term rewriting tools work in an extensive parameter space. This complexity exceeds human capacity for parameter selection, motivating an investigation into automated strategy invention. In this paper, we focus on confluence, an important property of term rewrite systems, and apply machine learning to develop the first learning-guided automatic confluence prover. Moreover, we randomly generate a large dataset to analyze confluence for term rewrite systems. Our results focus on improving the state-of-the-art automatic confluence prover CSI: When equipped with our invented strategies, it surpasses its human-designed strategies both on the augmented dataset and on the original human-created benchmark dataset Cops, proving/disproving the confluence of several term rewrite systems for which no automated proofs were known before.

Updated: 2025-07-31 12:18:32

标题: 基于术语重写系统交汇的自动化策略发明

摘要: 术语重写在软件验证和编译器优化中起着至关重要的作用。随着数十种高度可参数化的技术被开发用来证明各种系统属性，自动术语重写工具在广泛的参数空间中工作。这种复杂性超出了人类对参数选择的能力，促使对自动策略发明进行调查。在本文中，我们关注可交换性，这是术语重写系统的一个重要属性，并应用机器学习开发了第一个学习导向的自动可交换性证明器。此外，我们随机生成了一个大型数据集，以分析术语重写系统的可交换性。我们的结果集中重点关注改进现有技术的自动可交换性证明器CSI：当配备我们发明的策略时，它在扩充数据集和原始人工创建的基准数据集Cops上均超过了人类设计的策略，证明/反驳了一些术语重写系统的可交换性，这些系统以前没有自动证明。

更新时间: 2025-07-31 12:18:32

领域: cs.LO,cs.AI,F.4.2; I.2.8

下载: http://arxiv.org/abs/2411.06409v2

Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery

Causal inference remains a fundamental challenge for large language models. Recent advances in internal reasoning with large language models have sparked interest in whether state-of-the-art reasoning models can robustly perform causal discovery-a task where conventional models often suffer from severe overfitting and near-random performance under data perturbations. We study causal discovery on the Corr2Cause benchmark using the emergent OpenAI's o-series and DeepSeek-R model families and find that these reasoning-first architectures achieve significantly greater native gains than prior approaches. To capitalize on these strengths, we introduce a modular in-context pipeline inspired by the Tree-of-Thoughts and Chain-of-Thoughts methodologies, yielding nearly three-fold improvements over conventional baselines. We further probe the pipeline's impact by analyzing reasoning chain length, complexity, and conducting qualitative and quantitative comparisons between conventional and reasoning models. Our findings suggest that while advanced reasoning models represent a substantial leap forward, carefully structured in-context frameworks are essential to maximize their capabilities and offer a generalizable blueprint for causal discovery across diverse domains.

Updated: 2025-07-31 12:10:27

标题: 分段的因果推理：因果发现的模块化上下文学习

摘要: 因果推断仍然是大型语言模型面临的基本挑战。最近大型语言模型内部推理的进展引起了人们的兴趣，是否最先进的推理模型能够稳健地执行因果发现-传统模型在数据扰动下常常遭受严重过拟合和近乎随机的表现。我们在Corr2Cause基准测试中研究因果发现，使用新兴的OpenAI的o系列和DeepSeek-R模型家族，并发现这些以推理为先的架构比先前方法实现了显著更大的原生收益。为了充分利用这些优势，我们引入了受到思想之树和思想链方法启发的模块化上下文管道，在传统基线上实现了近三倍的改进。我们进一步通过分析推理链长度、复杂性，以及在传统模型和推理模型之间进行定性和定量比较来探究管道的影响。我们的研究结果表明，虽然先进的推理模型代表了重大进步，但精心构建的上下文框架对于最大化它们的能力是必不可少的，并为跨不同领域的因果发现提供了一个通用的蓝图。

更新时间: 2025-07-31 12:10:27

领域: cs.AI

下载: http://arxiv.org/abs/2507.23488v1

On the Approximation of Stationary Processes using the ARMA Model

We look at a problem related to Autoregressive Moving Average (ARMA) models, on quantifying the approximation error between a true stationary process $X_t$ and an ARMA model $Y_t$. We take the transfer function representation $x(L)$ of a stationary process $X_t$ and show that the $L^{\infty}$ norm of $x$ acts as a valid norm on $X_t$ that controls the $\ell^2$ norm of its Wold coefficients. We then show that a certain subspace of stationary processes, which includes ARMA models, forms a Banach algebra under the $L^{\infty}$ norm that respects the multiplicative structure of $H^{\infty}$ transfer functions and thus improves on the structural properties of the cepstral norm for ARMA models. The natural definition of invertibility in this algebra is consistent with the original definition of ARMA invertibility, and generalizes better to non-ARMA processes than Wiener's $\ell^1$ condition. Finally, we calculate some explicit approximation bounds in the simpler context of continuous transfer functions, and critique some heuristic ideas on Pad\'e approximations and parsimonious models.

Updated: 2025-07-31 11:55:42

标题: 关于使用ARMA模型逼近稳态过程的研究

摘要: 我们研究了与自回归滑动平均（ARMA）模型相关的问题，即在真实平稳过程$X_t$与ARMA模型$Y_t$之间的逼近误差量化。我们考虑了平稳过程$X_t$的传递函数表示$x(L)$，并展示了$x$的$L^{\infty}$范数作为控制其Wold系数的$\ell^2$范数的有效范数。然后我们展示了一个包括ARMA模型在内的一定子空间的平稳过程在$L^{\infty}$范数下构成Banach代数，尊重$H^{\infty}$传递函数的乘法结构，从而改进了ARMA模型的倒谱范数的结构特性。在这个代数中，可逆性的自然定义与ARMA可逆性的原始定义一致，并且对于非ARMA过程比Wiener的$\ell^1$条件更具一般性。最后，我们在连续传递函数的简单情境中计算了一些明确的逼近界，并对Pad\'e逼近和简约模型的一些启发式想法进行了批判。

更新时间: 2025-07-31 11:55:42

领域: cs.LG,math.PR,stat.ME,60G10,G.3

下载: http://arxiv.org/abs/2408.10610v4

Automated Feedback on Student-Generated UML and ER Diagrams Using Large Language Models

UML and ER diagrams are foundational in computer science education but come with challenges for learners due to the need for abstract thinking, contextual understanding, and mastery of both syntax and semantics. These complexities are difficult to address through traditional teaching methods, which often struggle to provide scalable, personalized feedback, especially in large classes. We introduce DUET (Diagrammatic UML & ER Tutor), a prototype of an LLM-based tool, which converts a reference diagram and a student-submitted diagram into a textual representation and provides structured feedback based on the differences. It uses a multi-stage LLM pipeline to compare diagrams and generate reflective feedback. Furthermore, the tool enables analytical insights for educators, aiming to foster self-directed learning and inform instructional strategies. We evaluated DUET through semi-structured interviews with six participants, including two educators and four teaching assistants. They identified strengths such as accessibility, scalability, and learning support alongside limitations, including reliability and potential misuse. Participants also suggested potential improvements, such as bulk upload functionality and interactive clarification features. DUET presents a promising direction for integrating LLMs into modeling education and offers a foundation for future classroom integration and empirical evaluation.

Updated: 2025-07-31 11:49:01

标题: 使用大型语言模型对学生生成的UML和ER图进行自动反馈

摘要: UML和ER图是计算机科学教育中的基础，但由于需要抽象思维能力、上下文理解以及对语法和语义的掌握，对学习者来说具有挑战性。这些复杂性很难通过传统的教学方法来解决，尤其是在大班教学中往往难以提供可扩展的、个性化的反馈。我们引入了DUET（图形UML和ER导师），这是一个基于LLM的工具原型，它将参考图和学生提交的图转换为文本表示，并根据差异提供结构化的反馈。它使用多阶段的LLM管道来比较图形并生成反思性的反馈。此外，该工具还为教育者提供了分析洞察力，旨在促进自主学习并指导教学策略。我们通过对六名参与者（包括两名教育者和四名助教）进行半结构化访谈来评估DUET。他们指出了可访问性、可扩展性和学习支持等优点，同时也提出了可靠性和潜在误用等限制。参与者还建议潜在的改进措施，如批量上传功能和互动澄清功能。DUET为将LLM整合到建模教育中提供了一个有前景的方向，并为未来课堂整合和经验评估奠定了基础。

更新时间: 2025-07-31 11:49:01

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.23470v1

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

Updated: 2025-07-31 11:41:04

标题: 在组织中实现安全且具上下文意识的访问控制的角色感知语言模型

摘要: 随着大型语言模型（LLMs）在企业环境中的部署越来越多，基于用户角色控制模型行为成为一项重要要求。现有的安全方法通常假设统一访问，并着重于防止有害或有毒的输出，而不考虑特定角色访问限制。在这项工作中，我们研究了LLMs是否可以被微调以生成反映不同组织角色关联访问权限的响应。我们探索了三种建模策略：基于BERT的分类器，基于LLM的分类器和角色条件生成。为了评估这些方法，我们构建了两个互补的数据集。第一个是通过聚类和角色标记从现有的指令调整语料库中改编而来，而第二个是合成生成的，以反映现实的、角色敏感的企业场景。我们评估了模型在不同组织结构下的性能，并分析了对提示注入、角色不匹配和越狱尝试的稳健性。

更新时间: 2025-07-31 11:41:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.23465v1

Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection

The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,'' where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.

Updated: 2025-07-31 11:38:20

标题: 减轻联邦学习中的分辨率漂移：关键点检测案例

摘要: 联邦学习（FL）方法实现了跨分布式系统的有效学习，同时保护用户数据隐私。迄今为止，研究主要集中在解决统计异质性和通信效率上，通过这些方法，FL在分类任务中取得了成功。然而，其在非分类任务（如人体姿势估计）中的应用仍未得到充分探索。本文确定并调查了一个关键问题，即“分辨率漂移”，由于客户端之间分辨率的可变性，性能显著下降。与类别级别的异质性不同，分辨率漂移突出了分辨率作为另一个非独立或非同分布（non-IID）数据的重要性。为了解决这个问题，我们提出了一种分辨率自适应的联邦学习（RAF）方法，该方法利用基于热图的知识蒸馏。通过高分辨率输出（教师）和低分辨率输出（学生）之间的多分辨率知识蒸馏，我们的方法增强了分辨率的鲁棒性，避免过拟合。大量实验证明，RAF不仅有效减轻了分辨率漂移并取得了显著的性能提升，而且可以无缝集成到现有的FL框架中。此外，尽管本文侧重于人体姿势估计，我们的t-SNE分析显示分类和高分辨率表示任务之间存在明显的特征差异，支持RAF在依赖于保留空间细节的其他任务中的泛化性。

更新时间: 2025-07-31 11:38:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23461v1

KLAN: Kuaishou Landing-page Adaptive Navigator

Modern online platforms configure multiple pages to accommodate diverse user needs. This multi-page architecture inherently establishes a two-stage interaction paradigm between the user and the platform: (1) Stage I: page navigation, navigating users to a specific page and (2) Stage II: in-page interaction, where users engage with customized content within the specific page. While the majority of research has been focusing on the sequential recommendation task that improves users' feedback in Stage II, there has been little investigation on how to achieve better page navigation in Stage I. To fill this gap, we formally define the task of Personalized Landing Page Modeling (PLPM) into the field of recommender systems: Given a user upon app entry, the goal of PLPM is to proactively select the most suitable landing page from a set of candidates (e.g., functional tabs, content channels, or aggregation pages) to optimize the short-term PDR metric and the long-term user engagement and satisfaction metrics, while adhering to industrial constraints. Additionally, we propose KLAN (Kuaishou Landing-page Adaptive Navigator), a hierarchical solution framework designed to provide personalized landing pages under the formulation of PLPM. KLAN comprises three key components: (1) KLAN-ISP captures inter-day static page preference; (2) KLAN-IIT captures intra-day dynamic interest transitions and (3) KLAN-AM adaptively integrates both components for optimal navigation decisions. Extensive online experiments conducted on the Kuaishou platform demonstrate the effectiveness of KLAN, obtaining +0.205% and +0.192% improvements on in Daily Active Users (DAU) and user Lifetime (LT). Our KLAN is ultimately deployed on the online platform at full traffic, serving hundreds of millions of users. To promote further research in this important area, we will release our dataset and code upon paper acceptance.

Updated: 2025-07-31 11:37:11

标题: KLAN：快手落地页自适应导航器

摘要: 现代在线平台配置多个页面以满足不同用户需求。这种多页面架构本质上建立了用户和平台之间的两阶段交互范式：（1）第一阶段：页面导航，将用户导航到特定页面；（2）第二阶段：页面内交互，用户在特定页面内与定制内容进行互动。虽然大多数研究集中在改进第二阶段用户反馈的顺序推荐任务上，但在如何实现更好的页面导航在第一阶段上却鲜有研究。为了填补这一空白，我们正式将个性化着陆页建模（PLPM）的任务定义到推荐系统领域中：给定一个用户进入应用程序时，PLPM的目标是从一组候选页面（例如功能选项卡、内容频道或聚合页面）中主动选择最合适的着陆页，以优化短期PDR指标和长期用户参与和满意度指标，同时遵守工业约束。此外，我们提出了KLAN（快手着陆页自适应导航器），这是一个设计用于在PLPM制定下提供个性化着陆页的层次解决方案框架。KLAN包括三个关键组件：（1）KLAN-ISP捕获了一天之间的静态页面偏好；（2）KLAN-IIT捕获了一天内动态兴趣转变；（3）KLAN-AM自适应地整合了这两个组件以进行最佳导航决策。在快手平台上进行的大量在线实验证明了KLAN的有效性，在日活跃用户（DAU）和用户生命周期（LT）方面分别获得了+0.205%和+0.192%的改善。我们的KLAN最终在在线平台上全流量部署，为数亿用户提供服务。为了在这一重要领域促进进一步研究，我们将在论文被接受后发布我们的数据集和代码。

更新时间: 2025-07-31 11:37:11

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.23459v1

Machine learning and machine learned prediction in chest X-ray images

Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.

Updated: 2025-07-31 11:31:25

标题: 机器学习和机器学习预测在胸部X射线图像中的应用

摘要: 机器学习和人工智能是快速增长的研究领域，其中利用数据来训练算法、学习模式并做出预测。这种方法有助于解决看似错综复杂的问题，而无需显式编程，通过识别数据中的复杂关系。以5824张胸部X光图像为例，我们实施了两种机器学习算法，分别是基线卷积神经网络（CNN）和DenseNet-121，并在预测患病患者方面展示了我们的分析。在本文提出的二元分类问题中，基线CNN和DenseNet-121表现非常出色。梯度加权类激活映射显示，DenseNet-121在决策过程中正确地聚焦于输入胸部X光图像的关键部分，比基线CNN更加准确。

更新时间: 2025-07-31 11:31:25

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23455v1

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

Updated: 2025-07-31 11:29:42

标题: 盲攻击检测中基于LLM评估系统的反事实评估

摘要: 这篇论文研究了针对LLM-based评估系统的防御措施，以抵御提示注入。我们形式化了一类称为盲攻击的威胁，其中候选答案独立地制作，目的是欺骗评估者，而不考虑真实答案。为了对抗这种攻击，我们提出了一个框架，将标准评估（SE）与对照评估（CFE）结合起来，对提交物重新评估，使用故意错误的真相答案。如果系统在标准和对照条件下都验证了答案，则会检测到攻击。实验表明，尽管标准评估容易受到攻击，但我们的SE+CFE框架通过提高攻击检测能力，实现了最小的性能折衷。

更新时间: 2025-07-31 11:29:42

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2507.23453v1

Manifold-regularised Signature Kernel Large-Margin $\ell_p$-SVDD for Multidimensional Time Series Anomaly Detection

We generalise the recently introduced large-margin $\ell_p$-SVDD approach to exploit the geometry of data distribution via manifold regularising and a signature kernel representation for time series anomaly detection. Specifically, we formulate a manifold-regularised variant of the $\ell_p$-SVDD method to encourage label smoothness on the underlying manifold to capture structural information for improved detection performance. Drawing on an existing Representer theorem, we then provide an effective optimisation technique for the proposed method and show that it can benefit from the signature kernel to capture time series complexities for anomaly detection. We theoretically study the proposed approach using Rademacher complexities to analyse its generalisation performance and also provide an experimental assessment of the proposed method across various data sets to compare its performance against other methods.

Updated: 2025-07-31 11:27:01

标题: 流形正则化的特征核大边界$\ell_p$-SVDD用于多维时间序列异常检测

摘要: 我们将最近引入的大边际 $\ell_p$-SVDD 方法推广，通过流形正则化和时间序列异常检测的签名核表示来利用数据分布的几何性质。具体地，我们提出了 $\ell_p$-SVDD 方法的流形正则化变体，以鼓励在基础流形上平滑标签，从而捕获结构信息以提高检测性能。借鉴现有的 Representer 定理，我们提供了一个有效的优化技术来实现所提出的方法，并展示它可以从签名核中受益，以捕获时间序列的复杂性用于异常检测。我们通过Rademacher复杂性对所提出的方法进行理论研究，分析其泛化性能，并在各种数据集上对所提出的方法进行实验评估，以比较其性能与其他方法之间的差异。

更新时间: 2025-07-31 11:27:01

领域: cs.LG

下载: http://arxiv.org/abs/2507.23449v1

Adjoint-Based Aerodynamic Shape Optimization with a Manifold Constraint Learned by Diffusion Models

We introduce an adjoint-based aerodynamic shape optimization framework that integrates a diffusion model trained on existing designs to learn a smooth manifold of aerodynamically viable shapes. This manifold is enforced as an equality constraint to the shape optimization problem. Central to our method is the computation of adjoint gradients of the design objectives (e.g., drag and lift) with respect to the manifold space. These gradients are derived by first computing shape derivatives with respect to conventional shape design parameters (e.g., Hicks-Henne parameters) and then backpropagating them through the diffusion model to its latent space via automatic differentiation. Our framework preserves mathematical rigor and can be integrated into existing adjoint-based design workflows with minimal modification. Demonstrated on extensive transonic RANS airfoil design cases using off-the-shelf and general-purpose nonlinear optimizers, our approach eliminates ad hoc parameter tuning and variable scaling, maintains robustness across initialization and optimizer choices, and achieves superior aerodynamic performance compared to conventional approaches. This work establishes how AI generated priors integrates effectively with adjoint methods to enable robust, high-fidelity aerodynamic shape optimization through automatic differentiation.

Updated: 2025-07-31 11:21:20

标题: 基于邻域约束的扩散模型学习的气动外形优化

摘要: 我们介绍了一种基于伴随方法的气动外形优化框架，该框架集成了一个在现有设计上训练的扩散模型，用于学习一个气动可行形状的平滑流形。这个流形被作为等式约束强制应用于外形优化问题中。我们方法的核心是计算设计目标（例如阻力和升力）对于流形空间的伴随梯度。这些梯度首先通过自动微分计算与传统外形设计参数（例如Hicks-Henne参数）相关的形状导数，然后通过扩散模型反向传播到其潜在空间。我们的框架保留了数学严谨性，并可以在最小修改的情况下集成到现有的基于伴随方法的设计工作流中。通过使用现成和通用非线性优化器展示了大量跨音速RANS翼型设计案例，我们的方法消除了特定参数调整和变量缩放，保持了在初始化和优化器选择方面的鲁棒性，并与传统方法相比达到了更好的气动性能。这项工作建立了AI生成的先验如何有效地与伴随方法集成，从而通过自动微分实现了鲁棒、高保真度的气动外形优化。

更新时间: 2025-07-31 11:21:20

领域: cs.CE,cs.LG,math.OC

下载: http://arxiv.org/abs/2507.23443v1

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation. Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis. This approach introduces a "Micro-Scatter-Macro" multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method. We publicly release our data and codes: https://github.com/Mubuky/Self-Foveate

Updated: 2025-07-31 11:18:42

标题: 自我聚焦：通过多层聚焦增强无监督文本合成指令的多样性和难度

摘要: 大型语言模型（LLMs）具有指令跟随能力，展示出令人印象深刻的问题解决能力。虽然从无监督文本中合成指导数据已成为训练这种模型的常用方法，但传统方法在数据注释方面严重依赖人力。尽管现有的自动合成范例已经缓解了这一限制，但它们仍然在确保合成指导的多样性和难度方面存在显著局限性。为了解决这些挑战，我们提出了Self-Foveate，这是一种创新的LLM驱动的指导合成方法。该方法引入了一种“微-散-宏”多级视线聚焦方法，有效地引导LLM深入挖掘嵌入在无监督文本中的细粒度信息，从而增强了合成指导的多样性和难度。跨多个无监督语料库和多样的模型架构的全面实验验证了我们提出方法的有效性和优越性。我们公开发布我们的数据和代码：https://github.com/Mubuky/Self-Foveate

更新时间: 2025-07-31 11:18:42

领域: cs.AI

下载: http://arxiv.org/abs/2507.23440v1

Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design

Hardware-Aware Neural Architecture Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency, making it particularly useful for the development of Deep Neural Network accelerators on the edge. However, the extensive search space and high computational cost pose significant challenges to its practical adoption. To address these limitations, we propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization. By leveraging sparse inducing points, Coflex reduces the GP kernel complexity from cubic to near-linear with respect to the number of training samples, without compromising optimization performance. This enables scalable approximation of large-scale search space, substantially decreasing computational overhead while preserving high predictive accuracy. We evaluate the efficacy of Coflex across various benchmarks, focusing on accelerator-specific architecture. Our experi- mental results show that Coflex outperforms state-of-the-art methods in terms of network accuracy and Energy-Delay-Product, while achieving a computational speed-up ranging from 1.9x to 9.5x.

Updated: 2025-07-31 11:16:46

标题: Coflex：利用稀疏高斯过程增强硬件神经网络搜索，以实现高效和可扩展的深度神经网络加速器设计

摘要: 硬件感知神经架构搜索（HW-NAS）是一种有效的方法，可以自动协同优化神经网络性能和硬件能效，特别适用于在边缘开发深度神经网络加速器。然而，庞大的搜索空间和高计算成本给其实际采用带来了重大挑战。为了解决这些限制，我们提出了Coflex，一种新颖的HW-NAS框架，将稀疏高斯过程（SGP）与多目标贝叶斯优化相结合。通过利用稀疏诱导点，Coflex将GP内核复杂度从立方降低到接近于线性，而不会影响优化性能。这使得对大规模搜索空间的可扩展近似成为可能，大大降低了计算开销，同时保持了高预测准确性。我们在各种基准测试中评估了Coflex的有效性，重点关注加速器特定的架构。我们的实验结果表明，在网络准确性和能量延迟乘积方面，Coflex优于最先进的方法，同时实现了计算速度提升，范围从1.9倍到9.5倍不等。

更新时间: 2025-07-31 11:16:46

领域: cs.LG,I.2.6; C.1.3; C.3

下载: http://arxiv.org/abs/2507.23437v1

Scalable contribution bounding to achieve privacy

In modern datasets, where single records can have multiple owners, enforcing user-level differential privacy requires capping each user's total contribution. This "contribution bounding" becomes a significant combinatorial challenge. Existing sequential algorithms for this task are computationally intensive and do not scale to the massive datasets prevalent today. To address this scalability bottleneck, we propose a novel and efficient distributed algorithm. Our approach models the complex ownership structure as a hypergraph, where users are vertices and records are hyperedges. The algorithm proceeds in rounds, allowing users to propose records in parallel. A record is added to the final dataset only if all its owners unanimously agree, thereby ensuring that no user's predefined contribution limit is violated. This method aims to maximize the size of the resulting dataset for high utility while providing a practical, scalable solution for implementing user-level privacy in large, real-world systems.

Updated: 2025-07-31 11:14:17

标题: 可扩展的贡献边界以实现隐私

摘要: 在现代数据集中，单个记录可以有多个所有者，强制实施用户级差分隐私需要限制每个用户的总贡献。这种“贡献限制”成为一个重要的组合挑战。现有的顺序算法对于这个任务来说计算密集且无法扩展到当今普遍存在的大型数据集。为了解决这个可扩展性瓶颈，我们提出了一种新颖且高效的分布式算法。我们的方法将复杂的所有权结构建模为一个超图，其中用户是顶点，记录是超边。算法以轮次进行，允许用户并行提出记录。只有当所有所有者一致同意时，才将记录添加到最终数据集中，从而确保不违反任何用户的预定贡献限制。该方法旨在在提供高效用的同时最大化结果数据集的大小，并为在大型实际系统中实现用户级隐私提供了实用且可扩展的解决方案。

更新时间: 2025-07-31 11:14:17

领域: cs.DS,cs.CR,cs.DC

下载: http://arxiv.org/abs/2507.23432v1

A ZeNN architecture to avoid the Gaussian trap

We propose a new simple architecture, Zeta Neural Networks (ZeNNs), in order to overcome several shortcomings of standard multi-layer perceptrons (MLPs). Namely, in the large width limit, MLPs are non-parametric, they do not have a well-defined pointwise limit, they lose non-Gaussian attributes and become unable to perform feature learning; moreover, finite width MLPs perform poorly in learning high frequencies. The new ZeNN architecture is inspired by three simple principles from harmonic analysis: i) Enumerate the perceptons and introduce a non-learnable weight to enforce convergence; ii) Introduce a scaling (or frequency) factor; iii) Choose activation functions that lead to near orthogonal systems. We will show that these ideas allow us to fix the referred shortcomings of MLPs. In fact, in the infinite width limit, ZeNNs converge pointwise, they exhibit a rich asymptotic structure beyond Gaussianity, and perform feature learning. Moreover, when appropriate activation functions are chosen, (finite width) ZeNNs excel at learning high-frequency features of functions with low dimensional domains.

Updated: 2025-07-31 11:11:42

标题: 一个ZeNN架构，避免高斯陷阱

摘要: 我们提出了一种新的简单架构，Zeta神经网络（ZeNNs），旨在克服标准多层感知器（MLPs）的几个缺点。特别是，在宽度很大的情况下，MLPs是非参数的，它们没有明确定义的逐点极限，它们失去了非高斯属性并且无法执行特征学习；此外，有限宽度的MLPs在学习高频率时表现不佳。新的ZeNN架构受到谐波分析中的三个简单原则的启发：i）列举感知器并引入一个不可学习的权重以强制收敛；ii）引入一个缩放（或频率）因子；iii）选择导致近似正交系统的激活函数。我们将展示这些想法使我们能够修复MLPs的上述缺点。实际上，在无限宽度极限下，ZeNNs逐点收敛，展示了丰富的高斯性之外的渐近结构，并执行特征学习。此外，当选择适当的激活函数时，（有限宽度的）ZeNNs在学习具有低维度域的函数的高频特征方面表现出色。

更新时间: 2025-07-31 11:11:42

领域: cs.LG,math.PR,68T07, 68T01,I.2.0; G.0

下载: http://arxiv.org/abs/2505.20553v2

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

Updated: 2025-07-31 11:11:30

标题: 教育Q：通过多智能体对话框架评估LLMs的教学能力

摘要: 大型语言模型（LLMs）越来越被用作教育工具，然而，评估它们的教学能力仍然具有挑战性，因为教师-学生互动的资源密集、依赖环境和方法论复杂的特性。我们介绍了EducationQ，这是一个多代理对话框架，通过模拟动态教育场景来高效评估教学能力，其中包括专门用于教学、学习和评估的代理。在涵盖13个学科和10个难度级别的1,498个问题上测试了14个LLMs（跨越主要的AI组织，包括OpenAI、Meta、Google、Anthropic等），结果显示教学效果并不与模型规模或一般推理能力呈线性相关——一些较小的开源模型在教学环境中表现优于较大的商业对手。这一发现突显了当前评估中优先考虑知识回忆而非互动教学的关键差距。我们的混合方法评估结合了定量指标和定性分析以及专家案例研究，识别出表现最佳模型采用的独特教学优势（如复杂的提问策略、自适应反馈机制）。人类专家评估显示78%的一致性与我们对有效教学行为的自动定性分析，验证了我们的方法。EducationQ表明，LLMs作为教师需要专门的优化而不仅仅是简单的扩展，这表明下一代教育AI应优先考虑对特定教学效果的有针对性增强。

更新时间: 2025-07-31 11:11:30

领域: cs.AI,cs.CE,cs.CL,cs.CY,cs.HC

下载: http://arxiv.org/abs/2504.14928v3

Chatting with your ERP: A Recipe

This paper presents the design, implementation, and evaluation behind a Large Language Model (LLM) agent that chats with an industrial production-grade ERP system. The agent is capable of interpreting natural language queries and translating them into executable SQL statements, leveraging open-weight LLMs. A novel dual-agent architecture combining reasoning and critique stages was proposed to improve query generation reliability.

Updated: 2025-07-31 11:09:50

标题: 与您的ERP聊天：一种配方

摘要: 这篇论文介绍了一个与工业生产级ERP系统进行对话的大型语言模型（LLM）代理的设计、实现和评估。该代理能够解释自然语言查询并将其转换为可执行的SQL语句，利用开放权重的LLM。提出了一种新颖的双代理架构，结合推理和评论阶段，以提高查询生成的可靠性。

更新时间: 2025-07-31 11:09:50

领域: cs.AI,cs.DB,cs.ET,cs.HC,cs.MA,68T50, 68P20,I.2.7; H.2.5; H.2.8; H.5.m

下载: http://arxiv.org/abs/2507.23429v1

Merging Memory and Space: A Spatiotemporal State Space Neural Operator

We propose the Spatiotemporal State Space Neural Operator (ST-SSM), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). ST-SSM introduces a novel factorization of the spatial and temporal dimensions, using structured state-space models to independently model temporal evolution and spatial interactions. This design enables parameter efficiency and flexible modeling of long-range spatiotemporal dynamics. A theoretical connection is established between SSMs and neural operators, and a unified universality theorem is proved for the resulting class of architectures. Empirically, we demonstrate that our factorized formulation outperforms alternative schemes such as zigzag scanning and parallel independent processing on several PDE benchmarks, including 1D Burgers' equation, 1D Kuramoto-Sivashinsky equation, and 2D Navier-Stokes equations under varying physical conditions. Our model performs competitively with existing baselines while using significantly fewer parameters. In addition, our results reinforce previous findings on the benefits of temporal memory by showing improved performance under partial observability. Our results highlight the advantages of dimensionally factorized operator learning for efficient and generalizable PDE modeling, and put this approach on a firm theoretical footing.

Updated: 2025-07-31 11:09:15

标题: 合并记忆和空间：一个时空状态空间神经算子

摘要: 我们提出了时空状态空间神经算子（ST-SSM），这是一个用于学习时间相关偏微分方程（PDEs）解算子的紧凑架构。ST-SSM引入了空间和时间维度的新颖因式分解，利用结构化状态空间模型独立地对时间演变和空间交互进行建模。这种设计实现了参数效率和对长程时空动态的灵活建模。我们建立了SSMs和神经算子之间的理论联系，并为由此类架构产生的一致性普适性定理进行了证明。在实证方面，我们证明了我们的分解公式在几个PDE基准测试中优于替代方案，包括1D Burgers'方程、1D Kuramoto-Sivashinsky方程和2D Navier-Stokes方程在不同物理条件下。我们的模型在使用更少参数的同时与现有基线模型竞争。此外，我们的结果通过显示在部分可观测性下性能提高，加强了先前关于时间记忆益处的发现。我们的结果突出了为高效和可泛化的PDE建模学习维度分解算子的优势，并将这种方法置于坚实的理论基础上。

更新时间: 2025-07-31 11:09:15

领域: cs.LG

下载: http://arxiv.org/abs/2507.23428v1

Identifying Super Spreaders in Multilayer Networks

Identifying super-spreaders can be framed as a subtask of the influence maximisation problem. It seeks to pinpoint agents within a network that, if selected as single diffusion seeds, disseminate information most effectively. Multilayer networks, a specific class of heterogeneous graphs, can capture diverse types of interactions (e.g., physical-virtual or professional-social), and thus offer a more accurate representation of complex relational structures. In this work, we introduce a novel approach to identifying super-spreaders in such networks by leveraging graph neural networks. To this end, we construct a dataset by simulating information diffusion across hundreds of networks - to the best of our knowledge, the first of its kind tailored specifically to multilayer networks. We further formulate the task as a variation of the ranking prediction problem based on a four-dimensional vector that quantifies each agent's spreading potential: (i) the number of activations; (ii) the duration of the diffusion process; (iii) the peak number of activations; and (iv) the simulation step at which this peak occurs. Our model, TopSpreadersNetwork, comprises a relationship-agnostic encoder and a custom aggregation layer. This design enables generalisation to previously unseen data and adapts to varying graph sizes. In an extensive evaluation, we compare our model against classic centrality-based heuristics and competitive deep learning methods. The results, obtained across a broad spectrum of real-world and synthetic multilayer networks, demonstrate that TopSpreadersNetwork achieves superior performance in identifying high-impact nodes, while also offering improved interpretability through its structured output.

Updated: 2025-07-31 10:48:42

标题: 在多层网络中识别超级传播者

摘要: 识别超级传播者可以被视为影响最大化问题的一个子任务。它旨在确定网络中的特定代理人，如果选择为单一扩散种子，则可以最有效地传播信息。多层网络是异质图的一种特殊类别，可以捕捉各种类型的交互（例如，物理-虚拟或专业-社交），因此提供了对复杂关系结构的更准确表示。在这项工作中，我们介绍了一种利用图神经网络来识别这些网络中的超级传播者的新方法。为此，我们通过模拟信息在数百个网络中的传播来构建数据集 - 据我们所知，这是专门针对多层网络的第一个数据集。我们进一步将任务构造为基于四维向量的排名预测问题，该向量量化了每个代理人的传播潜力：（i）激活次数；（ii）扩散过程的持续时间；（iii）激活次数的峰值；和（iv）达到峰值的模拟步骤。我们的模型TopSpreadersNetwork包括一个与关系无关的编码器和一个自定义聚合层。这种设计使其能够泛化到以前未见的数据，并适应不同的图大小。在广泛的评估中，我们将我们的模型与经典的中心性启发式和竞争性深度学习方法进行比较。跨广泛的真实世界和合成多层网络获得的结果表明，TopSpreadersNetwork在识别高影响节点方面表现出卓越性能，同时通过其结构化输出提供了改进的可解释性。

更新时间: 2025-07-31 10:48:42

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2505.20980v2

Detection of Adulteration in Coconut Milk using Infrared Spectroscopy and Machine Learning

In this paper, we propose a system for detecting adulteration in coconut milk, utilizing infrared spectroscopy. The machine learning-based proposed system comprises three phases: preprocessing, feature extraction, and classification. The first phase involves removing irrelevant data from coconut milk spectral signals. In the second phase, we employ the Linear Discriminant Analysis (LDA) algorithm for extracting the most discriminating features. In the third phase, we use the K-Nearest Neighbor (KNN) model to classify coconut milk samples into authentic or adulterated. We evaluate the performance of the proposed system using a public dataset comprising Fourier Transform Infrared (FTIR) spectral information of pure and contaminated coconut milk samples. Findings show that the proposed method successfully detects adulteration with a cross-validation accuracy of 93.33%.

Updated: 2025-07-31 10:44:36

标题: 使用红外光谱和机器学习检测椰子奶中的掺假行为

摘要: 在这篇论文中，我们提出了一个利用红外光谱技术检测椰子牛奶掺假的系统。这个基于机器学习的系统包括三个阶段：预处理、特征提取和分类。第一阶段涉及从椰子牛奶光谱信号中移除不相关的数据。在第二阶段，我们采用线性判别分析（LDA）算法提取最具区分性的特征。在第三阶段，我们使用K-最近邻（KNN）模型将椰子牛奶样本分类为真品或掺假品。我们使用包括傅里叶变换红外（FTIR）光谱信息的公共数据集评估了所提出系统的性能，该数据集包括纯净和受污染的椰子牛奶样本。研究结果表明，所提出的方法成功检测到了掺假，交叉验证准确率达到了93.33%。

更新时间: 2025-07-31 10:44:36

领域: cs.LG

下载: http://arxiv.org/abs/2507.23418v1

Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning

This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.

Updated: 2025-07-31 10:41:45

标题: 蜂蜜掺假检测利用高光谱成像和机器学习

摘要: 本文旨在开发一种基于机器学习的系统，用于根据蜂蜜高光谱成像数据自动检测蜂蜜掺假糖浆的情况。首先，通过植物起源识别子系统对蜂蜜样本的花源进行分类。然后，通过掺假检测子系统识别糖浆掺假情况，并量化其浓度。两个子系统均包括两个步骤。第一步是利用线性判别分析（LDA）从蜂蜜样本中提取相关特征。在第二步中，我们利用K最近邻（KNN）模型对第一个子系统中的蜂蜜植物起源进行分类，并在第二个子系统中识别掺假水平。我们在一个公开的蜂蜜高光谱图像数据集上评估了所提出的系统的性能。结果表明，所提出的系统能够以96.39%的整体交叉验证准确率检测蜂蜜中的掺假情况，这使其成为当前基于化学的检测方法的一个合适替代方案。

更新时间: 2025-07-31 10:41:45

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23416v1

Robust Adverse Weather Removal via Spectral-based Spatial Grouping

Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.

Updated: 2025-07-31 10:38:29

标题: 通过基于光谱的空间分组实现强大的恶劣天气去除

摘要: 恶劣的天气条件导致了多样化和复杂的退化模式，推动了全能模型（AiO）的发展。然而，最近的AiO解决方案仍然难以捕捉多样化的退化，因为像直接在频域上操作这样的全局过滤方法无法处理高度可变和局部失真。为了解决这些问题，我们提出了基于谱空间分组变换器（SSGformer）的新方法，该方法利用谱分解和组内注意力进行多天气图像恢复。SSGformer通过常规边缘检测将图像分解为高频边缘特征，并通过奇异值分解获取低频信息。我们利用多头线性注意力来有效建模这些特征之间的关系。融合的特征与输入结合生成一个分组掩模，根据空间相似性和图像纹理对区域进行聚类。为了充分利用这个掩模，我们引入了一种组内注意机制，实现了强大的恶劣天气去除，并确保在多样化的天气条件下保持一致的性能。我们还提出了一个使用通道注意力和空间注意力的空间分组变换器块，有效平衡特征之间的关系和空间依赖性。大量实验证明了我们方法的优越性，验证了其在处理各种复杂恶劣天气退化方面的有效性。

更新时间: 2025-07-31 10:38:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.22498v2

A Machine Learning Approach for Honey Adulteration Detection using Mineral Element Profiles

This paper aims to develop a Machine Learning (ML)-based system for detecting honey adulteration utilizing honey mineral element profiles. The proposed system comprises two phases: preprocessing and classification. The preprocessing phase involves the treatment of missing-value attributes and normalization. In the classifica-tion phase, we use three supervised ML models: logistic regression, decision tree, and random forest, to dis-criminate between authentic and adulterated honey. To evaluate the performance of the ML models, we use a public dataset comprising measurements of mineral element content of authentic honey, sugar syrups, and adul-terated honey. Experimental findings show that mineral element content in honey provides robust discriminative information for detecting honey adulteration. Results also demonstrate that the random forest-based classifier outperforms other classifiers on this dataset, achieving the highest cross-validation accuracy of 98.37%.

Updated: 2025-07-31 10:36:58

标题: 一种利用矿物元素配置文件进行蜂蜜掺假检测的机器学习方法

摘要: 本文旨在开发一个基于机器学习（ML）的系统，利用蜂蜜矿物元素分析来检测蜂蜜掺假。所提出的系统包括两个阶段：预处理和分类。预处理阶段涉及处理缺失值属性和归一化。在分类阶段，我们使用三种监督学习模型：逻辑回归、决策树和随机森林，来区分真假蜂蜜。为了评估机器学习模型的性能，我们使用一个公共数据集，包括真实蜂蜜、糖浆和掺假蜂蜜的矿物元素含量的测量。实验结果表明，蜂蜜中的矿物元素含量提供了检测蜂蜜掺假的稳健辨别信息。结果还表明，基于随机森林的分类器在该数据集上的性能优于其他分类器，实现了最高的交叉验证准确率为98.37%。

更新时间: 2025-07-31 10:36:58

领域: cs.LG

下载: http://arxiv.org/abs/2507.23412v1

Probabilistic Modeling of Jailbreak on Multimodal LLMs: From Quantification to Application

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal content. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on input image to maximize jailbreak probability, and further enhance it as Multimodal JPA (MJPA) by including monotonic text rephrasing. To counteract attacks, we also propose Jailbreak-Probability-based Finetuning (JPF), which minimizes jailbreak probability through MLLM parameter updates. Extensive experiments show that (1) (M)JPA yields significant improvements when attacking a wide range of models under both white and black box settings. (2) JPF vastly reduces jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.

Updated: 2025-07-31 10:26:35

标题: 多模态LLM的越狱概率建模：从量化到应用

摘要: 最近，多模态大型语言模型（MLLMs）展示了它们在理解多模态内容方面的优越能力。然而，它们仍然容易受到越狱攻击的影响，这些攻击利用它们安全对齐中的弱点生成有害响应。先前的研究将越狱攻击分为成功或失败，根据响应是否包含恶意内容。然而，鉴于MLLM响应的随机性质，对输入越狱MLLM的能力进行二元分类是不合适的。基于这一观点，我们引入越狱概率来量化输入的越狱潜力，这代表MLLM在提示该输入时生成恶意响应的可能性。我们通过对MLLM进行多次查询来近似这一概率。在使用越狱概率预测网络（JPPN）对输入隐藏状态及其相应的越狱概率之间的关系建模之后，我们使用连续的越狱概率进行优化。具体来说，我们提出了基于越狱概率的攻击（JPA），通过在输入图像上优化对抗性扰动来最大化越狱概率，并进一步将其改进为多模态JPA（MJPA），包括单调文本重述。为了抵御攻击，我们还提出了基于越狱概率的微调（JPF），通过MLLM参数更新最小化越狱概率。大量实验证明，（M）JPA在攻击各种模型时在白盒和黑盒环境下都取得显著进展。JPF将越狱攻击大幅减少至多超过60％。以上两个结果都表明引入越狱概率以对输入越狱能力进行细微区分的重要性。

更新时间: 2025-07-31 10:26:35

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2503.06989v2

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model's representational capacity.

Updated: 2025-07-31 10:26:33

标题: 通过呼吸信号实现高效的疼痛识别：基于单交叉注意力变换器多窗口融合管道

摘要: 疼痛是一种影响大部分人群的复杂状况。对于正在经历疼痛的个体来说，准确且一致的评估是必不可少的，这有助于制定有效和先进的管理策略。自动疼痛评估系统提供持续监测并支持临床决策，旨在减少苦恼并预防功能衰退。本研究已提交至“第二届多模态感知大挑战，用于下一代疼痛评估（AI4PAIN）”。所提出的方法引入了一个利用呼吸作为输入信号的流程，并结合了高效的交叉注意力变换器以及多窗口策略。广泛的实验表明，呼吸是一种有价值的生理模态用于疼痛评估。此外，实验表明，当紧密优化时，紧凑高效的模型往往能取得强大的性能，常常超越较大的对应物。所提出的多窗口方法有效捕捉了短期和长期特征，以及全局特性，从而增强了模型的表征能力。

更新时间: 2025-07-31 10:26:33

领域: cs.AI,cs.LG,eess.SP

下载: http://arxiv.org/abs/2507.21886v3

Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual's condition. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.

Updated: 2025-07-31 10:25:10

标题: 多重表征图表用于疼痛识别：将各种皮肤电活动信号整合成单一图像

摘要: 疼痛是一个影响大部分人群的多方面现象。可靠和一致的评估有利于那些正在经历疼痛的人，并支持有效和先进管理策略的发展。自动疼痛评估系统提供持续监测，为临床决策提供信息，并旨在减轻痛苦同时预防功能下降。通过整合生理信号，这些系统为个体的状况提供客观、准确的见解。本研究已提交给\textit{第二届多模态感知大挑战，用于下一代疼痛评估（AI4PAIN）}。所提出的方法引入了一个利用皮肤电活动信号作为输入形式的流程。信号的多个表示形式被创建并显示为波形，并在单个多表示形式图中共同显示。包括各种处理和过滤技术以及多种表示组合的广泛实验表明所提出的方法的有效性。它始终产生与传统融合方法相当甚至在一些情况下更好的结果，将其确立为整合不同信号表示或形式的稳健替代方案。

更新时间: 2025-07-31 10:25:10

领域: cs.AI

下载: http://arxiv.org/abs/2507.21881v3

Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person's state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model's effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model's architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.

Updated: 2025-07-31 10:21:28

标题: 微型生物信号嵌入模型（Tiny-BioMoE）：一种用于生物信号分析的轻量级嵌入模型

摘要: 疼痛是一种复杂而普遍的病症，影响着大部分人口。对于患有疼痛的个体来说，准确而一致的评估是至关重要的，也是在医疗系统中制定有效管理策略的基础。自动疼痛评估系统能够实现持续监测，支持临床决策，并有助于减轻患者痛苦，同时减少功能恶化的风险。利用生理信号可以提供客观和精确的洞察力，它们在多模态框架中的整合可以进一步提升系统性能。本研究已提交至“第二届下一代疼痛评估多模态传感大挑战（AI4PAIN）”。所提出的方法介绍了“Tiny-BioMoE”，这是一个轻量级的预训练生物信号分析嵌入模型。经过对440万生物信号图像表示进行训练，仅包含730万个参数，它可作为提取高质量嵌入用于下游任务的有效工具。涉及皮肤电活动、血容量脉搏、呼吸信号、外周血氧饱和度及其组合的广泛实验突显了该模型在多种模态下自动疼痛识别任务中的有效性。该模型的架构（代码）和权重可以在https://github.com/GkikasStefanos/Tiny-BioMoE 上找到。

更新时间: 2025-07-31 10:21:28

领域: cs.AI

下载: http://arxiv.org/abs/2507.21875v3

MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism--comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement--to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.

Updated: 2025-07-31 10:21:15

标题: MultiEditor：使用3D高斯散射先验进行驾驶情景的可控多模态物体编辑

摘要: 自动驾驶系统在理解复杂环境时严重依赖多模态感知数据。然而，现实世界数据的长尾分布阻碍了泛化能力，尤其是对于罕见但安全关键的车辆类别。为了解决这一挑战，我们提出了MultiEditor，这是一个设计用于在驾驶场景中联合编辑图像和LiDAR点云的双分支潜扩散框架。我们方法的核心是引入3D高斯点云打印（3DGS）作为目标对象的结构和外观先验。利用这一先验，我们设计了一个多级外观控制机制--包括像素级粘贴、语义级引导和多分支精化--以实现跨模态的高保真重构。我们进一步提出了一个深度引导的可变形跨模态条件模块，通过使用3DGS渲染的深度自适应地启用模态之间的相互引导，显著增强了跨模态一致性。大量实验证明，MultiEditor在视觉和几何保真度、编辑可控性和跨模态一致性方面表现出卓越性能。此外，使用MultiEditor生成罕见类别车辆数据显著增强了感知模型对代表性较低类别的检测准确性。

更新时间: 2025-07-31 10:21:15

领域: cs.AI

下载: http://arxiv.org/abs/2507.21872v3

RAVine: Reality-Aligned Evaluation for Agentic Search

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

Updated: 2025-07-31 10:20:56

标题: RAVine：现实对齐评估用于代理搜索

摘要: 主动搜索作为一种更自主和适应性的检索增强范式，正在推动智能搜索系统的演变。然而，现有的评估框架未能与主动搜索的目标很好地对齐。首先，在当前基准测试中常用的复杂查询通常偏离了现实用户搜索场景。其次，先前的方法在提取端到端评估的基本事实时往往引入噪音，导致细粒度评估出现扭曲。第三，大多数当前框架仅关注最终答案的质量，忽略了主动搜索固有的迭代过程的评估。为了解决这些限制，我们提出了RAVine - 一种用于主动LLM与搜索的现实对齐评估框架。RAVine针对多点查询和长格式答案，更好地反映用户意图，并引入了一种可归因的基本事实构造策略，以增强细粒度评估的准确性。此外，RAVine检查了模型在整个迭代过程中与搜索工具的交互，并考虑了效率因素。我们使用RAVine对一系列模型进行了基准测试，并得出了一些见解，我们希望这些见解将有助于推动主动搜索系统的发展。代码和数据集可在https://github.com/SwordFaith/RAVine获得。

更新时间: 2025-07-31 10:20:56

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.16725v2

Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on the availability of substantial training data, often lacking in real-world scenarios. To overcome this limitation, we propose a novel methodology that explicitly integrates artificial inductive biases into the generative process to improve data quality in low-data regimes. Our framework leverages transfer learning and meta-learning techniques to construct and inject informative inductive biases into DGMs. We evaluate four approaches (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain Randomized Search (DRS)) and analyze their impact on the quality of the generated text. Experimental results show that incorporating inductive bias substantially improves performance, with transfer learning methods outperforming meta-learning, achieving up to 60\% gains in Jensen-Shannon divergence. The methodology is model-agnostic and especially relevant in domains such as healthcare and finance, where high-quality synthetic data are essential, and data availability is often limited.

Updated: 2025-07-31 10:15:44

标题: 人工归纳偏差在数据稀缺场景中用于合成表数据生成

摘要: 使用深度生成模型（DGMs）生成合成表格数据为解决数据稀缺和隐私问题提供了一个引人注目的解决方案，但其有效性取决于大量训练数据的可用性，在现实场景中通常缺乏。为了克服这一限制，我们提出了一种新颖的方法，明确将人工归纳偏见整合到生成过程中，以改善低数据情况下的数据质量。我们的框架利用迁移学习和元学习技术来构建和注入信息性归纳偏见到DGMs中。我们评估了四种方法（预训练、模型平均、模型无关元学习（MAML）和域随机搜索（DRS））并分析它们对生成文本质量的影响。实验结果表明，整合归纳偏见显著提高了性能，迁移学习方法优于元学习，在Jensen-Shannon散度上达到高达60%的增益。这种方法是模型无关的，特别适用于医疗保健和金融等领域，其中高质量的合成数据至关重要，数据可用性通常有限。

更新时间: 2025-07-31 10:15:44

领域: cs.LG,cs.AI,I.2.0

下载: http://arxiv.org/abs/2407.03080v2

AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.

Updated: 2025-07-31 10:14:49

标题: AGA：一种用于结构化医学跨模态表示学习的自适应组对齐框架

摘要: 从成对的图像和报告中学习医学视觉表示是在表示学习中一个有前途的方向。然而，当前医学领域中的视觉-语言预训练方法通常将临床报告简化为单一实体或碎片化的标记，忽视了它们固有的结构。此外，对比学习框架通常依赖于大量的困难负样本，这对于小规模医学数据集是不切实际的。为了解决这些挑战，我们提出了一种新框架，称为自适应分组对齐（AGA），它从成对的医学图像和报告中捕获结构化语义。AGA引入了一种基于稀疏相似矩阵的双向分组机制。对于每个图像-报告对，我们计算文本标记和图像补丁之间的细粒度相似性。每个标记选择其最匹配的补丁形成一个视觉组，每个补丁选择其最相关的标记形成一个语言组。为了实现自适应分组，我们设计了两个阈值门控模块，称为语言分组阈值门和视觉分组阈值门，动态学习分组阈值。基于相似度得分计算组表示的加权平均值。为了将每个标记与其组表示对齐，我们引入了一个实例感知组对齐损失，它在每个图像-文本对中操作，消除了对外部负样本的需求。最后，应用了一个双向跨模态分组对齐模块来增强视觉和语言组表示之间的细粒度对齐。在公共和私人数据集上的大量实验表明，我们的方法在图像-文本检索和分类任务中在微调和零样本设置下表现出强大的性能。

更新时间: 2025-07-31 10:14:49

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23402v1

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Updated: 2025-07-31 10:11:56

标题: 其他车辆轨迹也是必需的：一个驾驶世界模型统一了视频潜在空间中的自我-其他车辆轨迹

摘要: 先进的端到端自动驾驶系统可以预测其他车辆的运动并规划自车的轨迹。用于评估自动驾驶系统的世界模型可以预见轨迹的结果。然而，现有的世界模型主要强调自车的轨迹，而忽略了其他车辆的控制。这种限制阻碍了它们实现自车与驾驶场景之间交互的真实模拟能力。本文提出了一种名为EOT-WM的驾驶世界模型，将视频中的自车和其他车辆轨迹统一起来，用于驾驶模拟。具体来说，在BEV空间中将自车和其他车辆的轨迹投影到图像坐标，通过像素位置进行车辆轨迹匹配，以控制视频生成。然后，通过时空变分自动编码器对轨迹视频进行编码，以在统一视觉空间中时空地与驾驶视频潜空间对齐。进一步设计了一种注入轨迹的扩散Transformer，用于在自车和其他车辆轨迹的指导下去噪视频潜空间。此外，我们提出了一种基于控制潜空间相似性的度量来评估轨迹的可控性。在nuScenes数据集上进行了大量实验，提出的模型在FID方面的性能优于最先进的方法30%，在FVD方面的性能提升了55%。该模型还可以预测自行生成轨迹的未知驾驶场景。

更新时间: 2025-07-31 10:11:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.09215v3

Policy Learning from Large Vision-Language Model Feedback without Reward Modeling

Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.

Updated: 2025-07-31 10:07:49

标题: 大视觉-语言模型反馈中的政策学习，无需奖励建模

摘要: 离线强化学习（RL）为通过预先收集的次优数据集训练机器人代理提供了一个强大的框架，消除了昂贵、耗时且潜在危险的在线交互的需求。这在安全关键的现实世界应用中特别有用，其中在线数据收集昂贵且不切实际。然而，现有的离线RL算法通常需要带有奖励标签的数据，这引入了一个额外的瓶颈：奖励函数的设计本身昂贵、耗时且需要重要的领域专业知识。在本文中，我们介绍了PLARE，一种新颖的方法，利用大型视觉-语言模型（VLMs）为代理训练提供指导信号。PLARE不依赖手动设计的奖励函数，而是根据语言任务描述在视觉轨迹段对上查询VLM以获取偏好标签。然后，通过使用监督对比偏好学习目标，直接从这些偏好标签训练策略，绕过了学习显式奖励模型的需求。通过在MetaWorld的机器人操作任务上进行大量实验，PLARE实现了与或超过现有最先进基于VLM的奖励生成方法的性能。此外，我们展示了PLARE在物理机器人的实际操作任务中的有效性，进一步验证了其实际应用性。

更新时间: 2025-07-31 10:07:49

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.23391v1

Causal Explanation of Concept Drift -- A Truly Actionable Approach

In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.

Updated: 2025-07-31 10:02:28

标题: 概念漂移的因果解释-一个真正可操作的方法

摘要: 在一个不断变化的世界中，理解这些变化如何影响不同系统（如工业制造或关键基础设施）是至关重要的。解释在机器学习领域中称为概念漂移的关键变化是实现有针对性干预的第一步，以避免或纠正模型失败，以及在物理世界中的故障和错误。因此，在这项工作中，我们将基于模型的漂移解释扩展到因果解释，这增加了提供解释的可操作性。我们在一些使用案例上评估我们的解释策略，展示了我们的框架的实际有用性，该框架隔离了受概念漂移影响的因果相关特征，并因此允许有针对性地干预。

更新时间: 2025-07-31 10:02:28

领域: cs.LG

下载: http://arxiv.org/abs/2507.23389v1

Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

Updated: 2025-07-31 10:01:11

标题: 因果2Vec：将仅解码器LLMs改进为多功能嵌入模型

摘要: 仅解码器大型语言模型（LLM）越来越被用于构建嵌入模型，有效地将自然语言文本的语义信息编码成密集向量表示，用于各种嵌入任务。然而，许多现有方法主要集中于去除LLMs中的因果注意力掩码，以实现双向注意力，可能会削弱模型在预训练期间获取的语义信息提取能力。此外，主要的单向方法通常依赖额外的输入文本来克服因果关注的固有限制，不可避免地增加了计算成本。在本研究中，我们提出了Causal2Vec，一个通用的嵌入模型，旨在增强仅解码器LLMs的性能，而无需改变其原始架构或引入显著的计算开销。具体而言，我们首先使用一种轻量级的BERT风格模型将输入文本预编码成一个单一的上下文标记，然后将其前置到LLM的输入序列中，使每个标记能够捕获上下文信息，即使没有关注未来标记。此外，为了减轻由最后标记汇总引入的最近偏见，并帮助LLMs更好地利用编码在上下文标记中的语义信息，我们将上下文和EOS标记的最后隐藏状态连接起来作为最终文本嵌入。在实践中，Causal2Vec在大规模文本嵌入基准测试（MTEB）上实现了最先进的性能，超过了仅在公开可用的检索数据集上训练的模型，同时将所需的序列长度减少了高达85％，推断时间减少了高达82％，与表现最佳的方法相比。

更新时间: 2025-07-31 10:01:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.23386v1

MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

Updated: 2025-07-31 09:59:17

标题: MPCC：一种新颖的多模态大语言模型中带有复杂约束的多模态规划基准

摘要: 多模式规划能力是指能够在多模态上下文中预测、推理和设计任务执行步骤的能力，这对于跨多个步骤进行复杂推理和决策至关重要。然而，当前的基准面临两个关键挑战：(1)它们无法直接评估多模态实际世界规划能力，(2)它们缺乏跨模态的约束或隐含约束。为了解决这些问题，我们引入了具有复杂约束的多模态规划(MPCC)，这是第一个系统评估MLLM处理多模态约束的基准。为了解决第一个挑战，MPCC专注于三个实际任务：航班规划、日历规划和会议规划。为了解决第二个挑战，我们在这些任务中引入了复杂约束(例如预算、时间和空间)，并设定了不同难度级别(EASY、MEDIUM、HARD)来区分约束复杂度和搜索空间扩展。对13个先进的MLLM进行的实验显示了重大挑战：闭源模型仅实现了21.3%的可行计划，而开源模型平均低于11%。此外，我们观察到MLLM对约束复杂性非常敏感，传统的多模态提示策略在多约束场景下失败。我们的工作在规划中形式化了多模态约束，提供了严格的评估框架，并强调了在实际世界MLLM应用中对约束感知推理的进步需求。

更新时间: 2025-07-31 09:59:17

领域: cs.CL,cs.AI,cs.CV,I.2.8; I.2.10

下载: http://arxiv.org/abs/2507.23382v1

LLM4Rail: An LLM-Augmented Railway Service Consulting Platform

Large language models (LLMs) have significantly reshaped different walks of business. To meet the increasing demands for individualized railway service, we develop LLM4Rail - a novel LLM-augmented railway service consulting platform. Empowered by LLM, LLM4Rail can provide custom modules for ticketing, railway food & drink recommendations, weather information, and chitchat. In LLM4Rail, we propose the iterative "Question-Thought-Action-Observation (QTAO)" prompting framework. It meticulously integrates verbal reasoning with task-oriented actions, that is, reasoning to guide action selection, to effectively retrieve external observations relevant to railway operation and service to generate accurate responses. To provide personalized onboard dining services, we first construct the Chinese Railway Food and Drink (CRFD-25) - a publicly accessible takeout dataset tailored for railway services. CRFD-25 covers a wide range of signature dishes categorized by cities, cuisines, age groups, and spiciness levels. We further introduce an LLM-based zero-shot conversational recommender for railway catering. To address the unconstrained nature of open recommendations, the feature similarity-based post-processing step is introduced to ensure all the recommended items are aligned with CRFD-25 dataset.

Updated: 2025-07-31 09:45:55

标题: LLM4Rail：一种LLM增强的铁路服务咨询平台

摘要: 大型语言模型(LLMs)已显著改变了不同领域的业务。为了满足个性化铁路服务需求的增长，我们开发了LLM4Rail - 一种新型的LLM增强铁路服务咨询平台。借助LLM的力量，LLM4Rail可以为售票、铁路餐饮推荐、天气信息和闲聊提供定制模块。在LLM4Rail中，我们提出了迭代的“问题-思考-行动-观察(QTAO)”提示框架。它精心整合了口头推理和以任务为导向的行动，即通过推理来指导行动选择，有效检索与铁路运营和服务相关的外部观察，以生成准确的响应。为了提供个性化的车厢餐饮服务，我们首先构建了中国铁路美食饮品(CRFD-25) - 一个专为铁路服务量身定制的公开可访问的外卖数据集。CRFD-25涵盖了按城市、菜系、年龄组和辛辣程度分类的各种招牌菜肴。我们进一步引入了基于LLM的零-shot对话推荐系统，用于铁路餐饮。为了解决开放推荐的无约束性，介绍了基于特征相似度的后处理步骤，以确保所有推荐的项目与CRFD-25数据集对齐。

更新时间: 2025-07-31 09:45:55

领域: cs.AI

下载: http://arxiv.org/abs/2507.23377v1

Some Theoretical Results on Layerwise Effective Dimension Oscillations in Finite Width ReLU Networks

We analyze the layerwise effective dimension (rank of the feature matrix) in fully-connected ReLU networks of finite width. Specifically, for a fixed batch of $m$ inputs and random Gaussian weights, we derive closed-form expressions for the expected rank of the \$m\times n\$ hidden activation matrices. Our main result shows that $\mathbb{E}[EDim(\ell)]=m[1-(1-2/\pi)^\ell]+O(e^{-c m})$ so that the rank deficit decays geometrically with ratio $1-2 / \pi \approx 0.3634$. We also prove a sub-Gaussian concentration bound, and identify the "revival" depths at which the expected rank attains local maxima. In particular, these peaks occur at depths $\ell_k^*\approx(k+1/2)\pi/\log(1/\rho)$ with height $\approx (1-e^{-\pi/2}) m \approx 0.79m$. We further show that this oscillatory rank behavior is a finite-width phenomenon: under orthogonal weight initialization or strong negative-slope leaky-ReLU, the rank remains (nearly) full. These results provide a precise characterization of how random ReLU layers alternately collapse and partially revive the subspace of input variations, adding nuance to prior work on expressivity of deep networks.

Updated: 2025-07-31 09:41:53

标题: 有限宽度ReLU网络中关于逐层有效维度振荡的一些理论结果

摘要: 我们分析了有限宽度的全连接ReLU网络中的逐层有效维数（特征矩阵的秩）。具体来说，对于固定批次的$m$个输入和随机高斯权重，我们推导出了隐藏激活矩阵的期望秩的闭合形式表达式。我们的主要结果显示$\mathbb{E}[EDim(\ell)]=m[1-(1-2/\pi)^\ell]+O(e^{-c m})$，因此秩的不足以几何比率$1-2/\pi \approx 0.3634$衰减。我们还证明了一个次高斯集中界，并确定了期望秩达到局部极值的“复苏”深度。特别是，这些峰值出现在深度$\ell_k^*\approx(k+1/2)\pi/\log(1/\rho)$处，高度约为$\approx (1-e^{-\pi/2}) m \approx 0.79m$。我们进一步表明，这种振荡的秩行为是有限宽度现象：在正交权重初始化或强负斜率的leaky-ReLU下，秩保持（几乎）完整。这些结果提供了对随机ReLU层如何交替地坍塌和部分复苏输入变化子空间的精确描述，为深度网络表达能力的先前工作增添了细微之处。

更新时间: 2025-07-31 09:41:53

领域: cs.LG

下载: http://arxiv.org/abs/2507.07675v2

Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling

Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years. With the rapid advancement of large language models (LLMs), substantial progress has been made in addressing real-world software engineering tasks. Recent studies have introduced ensemble reasoning techniques to enhance the performance of LLM-based issue resolution. However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness. In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution. Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, pruning, and selection. We conduct extensive experiments using three leading LLMs on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques. Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%. We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at https://github.com/bytedance/trae-agent.

Updated: 2025-07-31 09:37:22

标题: Trae Agent：一种基于LLM的软件工程代理，具有测试时间缩放功能

摘要: 软件问题解决是软件工程中的一个关键挑战，并在近年来引起越来越多的关注。随着大型语言模型（LLMs）的快速发展，在解决真实软件工程任务方面取得了实质性进展。最近的研究引入了集成推理技术来提高基于LLM的问题解决性能。然而，现有的基于提示的方法仍面临有效探索大型集成空间和缺乏库级理解能力的局限，这两者限制了它们的整体有效性。在本文中，我们提出了Trae Agent，这是一个用于库级问题解决的基于代理的集成推理方法。Trae Agent将我们的目标定义为一个最优解搜索问题，并通过用于生成、修剪和选择的模块化代理来解决两个关键挑战，即大型集成空间和库级理解。我们在广泛采用的SWE-bench基准上使用三种领先的LLMs进行了大量实验，将Trae Agent与四种最先进的集成推理技术进行了比较。实验结果表明，Trae Agent始终表现出优越性能，在Pass@1方面的所有基线平均改进为10.22%。Trae Agent在SWE-bench Verified排行榜上取得了第一名，Pass@1得分显著为75.20%。我们很高兴将Trae Agent发布为一个开源项目，以支持研究社区，所有资源均可在https://github.com/bytedance/trae-agent获得。

更新时间: 2025-07-31 09:37:22

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.23370v1

Learning Like Humans: Resource-Efficient Federated Fine-Tuning through Cognitive Developmental Stages

Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy, but its resource-intensive nature limits deployment on edge devices. In this paper, we introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development that progressively builds a powerful LLM from a compact foundation. DevFT decomposes the fine-tuning process into developmental stages, each optimizing submodels with increasing parameter capacity. Knowledge from earlier stages transfers to subsequent submodels, providing optimized initialization parameters that prevent convergence to local minima and accelerate training. This paradigm mirrors human learning, gradually constructing comprehensive knowledge structure while refining existing skills. To efficiently build stage-specific submodels, DevFT introduces deconfliction-guided layer grouping and differential-based layer fusion to distill essential information and construct representative layers. Evaluations across multiple benchmarks demonstrate that DevFT significantly outperforms state-of-the-art methods, achieving up to 4.59$\times$ faster convergence, 10.67$\times$ reduction in communication overhead, and 9.07% average performance improvement, while maintaining compatibility with existing approaches.

Updated: 2025-07-31 09:36:43

标题: 像人类一样学习：通过认知发展阶段进行资源有效的联邦微调

摘要: 联邦微调使得大型语言模型（LLMs）能够适应下游任务，同时保护数据隐私，但其资源密集型的特性限制了在边缘设备上的部署。在本文中，我们介绍了发展性联邦微调（DevFT），这是一种受认知发展启发的资源高效方法，逐步从紧凑的基础构建强大的LLM。DevFT将微调过程分解为发展阶段，每个阶段都优化具有增加参数容量的子模型。从早期阶段获得的知识传递到后续子模型，提供优化的初始化参数，防止收敛到局部最小值并加速训练。这种范式反映了人类学习过程，逐渐构建全面的知识结构同时完善现有技能。为了高效构建阶段特定的子模型，DevFT引入了解决冲突的层分组和基于差分的层融合，以提炼关键信息并构建代表性层。跨多个基准测试的评估结果表明，DevFT显著优于最先进的方法，实现了高达4.59倍更快的收敛速度，10.67倍的通信开销减少，以及9.07%的平均性能改进，同时保持与现有方法的兼容性。

更新时间: 2025-07-31 09:36:43

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2508.00041v1

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

Updated: 2025-07-31 09:33:35

标题: 《思维定理：一种多智能体框架，用于语言模型中的引导、演绎和归纳推理》

摘要: 大型语言模型（LLMs）在自然语言推理任务中表现出色，但它们的推理过程仍然脆弱且难以解释。像思维链（CoT）这样的提示技术通过引发中间推理步骤或聚合多个输出来增强可靠性。然而，它们缺乏强制逻辑结构和评估内部一致性的机制。我们引入了Theorem-of-Thought（ToTh），这是一个新颖的框架，将推理建模为三个并行代理之间的协作，每个代理模拟一种不同的推理模式：逆向推理、演绎推理和归纳推理。每个代理产生一个推理跟踪，将其结构化为一个形式推理图。为了评估一致性，我们应用由自然语言推理（NLI）引导的贝叶斯信念传播，为每个步骤分配置信度分数。最一致的图被选择用来推导最终答案。在符号（WebOfLies）和数字（MultiArith）推理基准上的实验证明，ToTh在多个LLMs上始终优于CoT、Self-Consistency和CoT-Decoding，同时产生可解释和逻辑基础的推理链。我们的研究结果表明了构建更加稳健和启发认知的LLM推理的一个有前途的方向。实现可在https://github.com/KurbanIntelligenceLab/theorem-of-thought 上找到。

更新时间: 2025-07-31 09:33:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.07106v2

EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations

As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints are available at: https://github.com/continental/EP-Diffuser.

Updated: 2025-07-31 09:28:25

标题: EP-Diffuser：通过多项式表示进行交通场景生成和预测的高效扩散模型

摘要: 随着预测时间范围的增加，由于代理运动的多模态性，预测交通场景的未来演变变得越来越困难。大多数最先进的（SotA）预测模型主要关注预测最有可能的未来。然而，为了自动驾驶车辆的安全运行，同样重要的是覆盖可能运动替代方案的分布。为了解决这个问题，我们介绍了EP-Diffuser，这是一种新颖的参数高效的基于扩散的生成模型，旨在捕捉可能的交通场景演变分布。在道路布局和代理历史的条件下，我们的模型充当预测器，生成多样化的、可信的场景延续。我们在Argoverse 2数据集上以准确性和预测的合理性对EP-Diffuser进行基准测试，并将其与两个SotA模型进行比较。尽管模型尺寸明显较小，我们的方法实现了高度准确和可信的交通场景预测。我们进一步评估模型在Waymo Open数据集上的超出分布（OoD）测试环境中的泛化能力，并展示了我们方法的卓越鲁棒性。代码和模型检查点可在以下链接获取：https://github.com/continental/EP-Diffuser。

更新时间: 2025-07-31 09:28:25

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2504.05422v3

"I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation

I reflect on my experience creating two music albums centered on state-of-the-art prompt-based AI music generation platforms. The first album explicitly poses the question: What happens when I collide my junk mail with these platforms? The second album is a direct response to the first, and toys with the inability of state-of-the-art prompt-based AI music generation platforms to generate music that is not ``practiced'', ``polished'', and ``produced''. I seed a large language model (LLM) with information about these albums and have it interview me, which results in the exploration of several deeper questions: To what extent am I the author? Where am I in the resulting music? How is my musical identity changing as I am faced with machines that are in some ways far more talented than I? What new musical spaces does my work open, for me or anyone/thing else? I conclude by reflecting on my reflections, as well as LLM-mediated self-reflection as method.

Updated: 2025-07-31 09:25:55

标题: 我创作了这个（这样的）：通过基于提示的AI音乐生成，谈判作者身份，面对欺诈行为，并探索新的音乐空间

摘要: 我反思了我在两张音乐专辑上的经验，这两张专辑都是围绕着最先进的基于提示的人工智能音乐生成平台创建的。第一张专辑明确提出了一个问题：当我将我的垃圾邮件与这些平台相碰撞时会发生什么？第二张专辑是对第一张的直接回应，并玩弄了最先进的基于提示的人工智能音乐生成平台无法生成``练习过''、``精心制作''和``制作完成''音乐的能力。我用关于这些专辑的信息为一个大型语言模型（LLM）提供种子，并让它采访我，这导致了对几个更深层次问题的探讨：我在多大程度上是作者？我在结果音乐中的位置在哪里？当我面对在某些方面比我更有才华的机器时，我的音乐身份是如何变化的？我的作品为我或其他任何/任何其他东西打开了哪些新的音乐空间？最后，我总结了我的反思，以及LLM介导的自我反思作为一种方法。

更新时间: 2025-07-31 09:25:55

领域: cs.SD,cs.AI,eess.AS,I.2; J.5

下载: http://arxiv.org/abs/2507.23365v1

Robust and Fine-Grained Detection of AI Generated Texts

An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models' performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.

Updated: 2025-07-31 09:14:25

标题: AI生成文本的稳健和细粒度检测

摘要: 一种理想的机器生成内容检测系统应该能够在任何生成器上良好运作，随着更多先进的LLMs每天涌现。现有系统通常难以准确识别短文本中的AI生成内容。此外，并非所有文本都完全由人类或LLM创作，因此我们更关注部分案例，即人类-LLM共同创作的文本。我们的论文介绍了一组为令牌分类任务构建的模型，这些模型在广泛收集的人机共同创作文本上进行训练，在未见领域的文本、未见生成器的文本、非母语使用者的文本以及带有对抗性输入的文本上表现良好。我们还介绍了一个包含超过2.4M这种文本的新数据集，其中大部分由23种语言的几种流行专有LLMs共同创作。我们还展示了我们模型在每个领域和生成器的每个文本上的表现结果。额外的发现包括与每种对抗方法的性能比较，输入文本的长度以及生成文本与原始人类创作文本的特征。

更新时间: 2025-07-31 09:14:25

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.11952v3

SWE-Exp: Experience-Driven Software Issue Resolution

Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

Updated: 2025-07-31 09:13:42

标题: SWE-Exp：基于经验的软件问题解决

摘要: 最近，大型语言模型（LLM）代理的最新进展显示出在软件问题解决方面取得了显著进展，利用了先进的技术，如多代理协作和蒙特卡洛树搜索（MCTS）。然而，当前的代理作为无记忆的探险者，将每个问题单独处理，而不保留或重复利用以前修复经验中的知识。这导致对失败轨迹的冗余探索和错失了将成功的问题解决方法调整到类似问题的机会。为了解决这个问题，我们引入了SWE-Exp，一种经验增强方法，从先前的代理轨迹中提炼出简明而可操作的经验，实现跨问题的持续学习。我们的方法引入了一个多方面的经验库，捕捉了成功和失败的修复尝试。具体来说，它从高级问题理解到具体代码更改的不同层次提取可重复使用的问题解决知识。实验证明，SWE-Exp在开源代理框架下的SWE-bench-Verified上实现了最先进的解决率（41.6% Pass@1）。我们的方法建立了一个新的范式，自动软件工程代理系统地积累和利用修复专业知识，从试错探索转变为战略、经验驱动的问题解决。

更新时间: 2025-07-31 09:13:42

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.23361v1

Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting

This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to daily electricity prices. Each identified regime is subsequently modeled by an independent conditional neural process (CNP), trained to learn localized mappings from input contexts to 24-dimensional hourly price trajectories, with final predictions computed as regime-weighted mixtures of these CNP outputs. We rigorously evaluate R-NP against deep neural networks (DNN) and Lasso estimated auto-regressive (LEAR) models by integrating their forecasts into diverse battery storage optimization frameworks, including price arbitrage, risk management, grid services, and cost minimization. This operational utility assessment revealed complex performance trade-offs: LEAR often yielded superior absolute profits or lower costs, while DNN showed exceptional optimality in specific cost-minimization contexts. Recognizing that raw prediction accuracy doesn't always translate to optimal operational outcomes, we employed TOPSIS as a comprehensive multi-criteria evaluation layer. Our TOPSIS analysis identified LEAR as the top-ranked model for 2021, but crucially, our proposed R-NP model emerged as the most balanced and preferred solution for 2021, 2022 and 2023.

Updated: 2025-07-31 09:12:25

标题: 具有多标准决策支持的制度感知条件神经过程，用于运营电力价格预测

摘要: 这项工作将贝叶斯制度检测与条件神经过程集成，用于在德国市场进行24小时电力价格预测。我们的方法将通过一个解耦粘性层次狄利克雷过程隐藏马尔可夫模型（DS-HDP-HMM）应用于每日电力价格的制度检测。随后，每个识别的制度由独立的条件神经过程（CNP）建模，训练以从输入背景到24维小时价格轨迹的局部映射，并通过这些CNP输出的制度加权混合进行最终预测。我们通过将它们的预测整合到多样化的电池储存优化框架中（包括价格套利、风险管理、电网服务和成本最小化），对R-NP进行了严格评估，包括深度神经网络（DNN）和Lasso估计的自回归（LEAR）模型。这种运营效用评估揭示了复杂的绩效权衡：LEAR通常产生较高的绝对利润或更低的成本，而DNN在特定的成本最小化背景下显示出异常的最优性。认识到原始预测准确性并不总是转化为最佳的运营结果，我们采用TOPSIS作为综合多标准评估层。我们的TOPSIS分析确定LEAR为2021年排名最高的模型，但关键的是，我们提出的R-NP模型被确定为2021年、2022年和2023年最均衡和首选的解决方案。

更新时间: 2025-07-31 09:12:25

领域: cs.LG,math.PR,stat.AP,stat.ML,60J20, 68T07

下载: http://arxiv.org/abs/2508.00040v1

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

Updated: 2025-07-31 09:09:45

标题: VL-Cogito：高级多模态推理的渐进式课程强化学习

摘要: 强化学习已经证明了其在增强大型语言模型推理能力方面的有效性。最近的研究工作逐渐将这一范式扩展到多模态推理任务。由于多模态任务的固有复杂性和多样性，特别是在语义内容和问题表述方面，现有模型经常在不同领域和难度水平上表现出不稳定的性能。为了解决这些限制，我们提出了VL-Cogito，这是一个通过新颖的多阶段渐进式课程强化学习（PCuRL）框架训练的先进多模态推理模型。PCuRL系统地引导模型逐渐增加难度的任务，显著提高了其在多样的多模态环境中的推理能力。该框架引入了两个关键创新：（1）在线难度软加权机制，动态调整连续RL训练阶段的训练难度；以及（2）动态长度奖励机制，鼓励模型根据任务复杂性自适应调节其推理路径长度，从而平衡推理效率和正确性。实验评估表明，VL-Cogito在跨数学、科学、逻辑和一般理解的主流多模态基准上始终匹配或超越现有的以推理为导向的模型，验证了我们方法的有效性。

更新时间: 2025-07-31 09:09:45

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.22607v2

Text-to-SQL Task-oriented Dialogue Ontology Construction

Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

Updated: 2025-07-31 09:08:59

标题: 文本到SQL任务导向对话本体构建 (Note: "SQL" stands for Structured Query Language, a programming language used for managing and querying relational databases.)

摘要: 大型语言模型（LLMs）被广泛用作通用知识来源，但它们依赖于参数化知识，限制了可解释性和可信度。在面向任务的对话（TOD）系统中，这种分离是明确的，使用一个由明确本体结构化的外部数据库来确保可解释性和可控性。然而，构建这样的本体需要手动标签或监督训练。我们介绍了TeQoDO：一种文本到SQL任务导向对话本体构建方法。在这里，一个LLM可以自主地通过利用其固有的SQL编程能力结合在提示中提供的对话理论，从头开始构建一个TOD本体，无需监督。我们展示了TeQoDO在超越迁移学习方法方面的表现，其构建的本体在下游对话状态跟踪任务上具有竞争力。消融研究表明了对话理论的关键作用。TeQoDO还可以扩展到允许构建更大的本体，我们在维基百科和ArXiv数据集上进行了研究。我们认为这是朝着更广泛应用本体以提高LLM可解释性的一步。

更新时间: 2025-07-31 09:08:59

领域: cs.CL,cs.AI,cs.DB,cs.IR

下载: http://arxiv.org/abs/2507.23358v1

Quality Evaluation of COBOL to Java Code Transformation

We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM's watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.

Updated: 2025-07-31 09:06:20

标题: COBOL到Java代码转换的质量评估

摘要: 我们提出了一个自动化评估系统，用于评估IBM的watsonx Code Assistant for Z（WCA4Z）中COBOL到Java代码转换。该系统解决了评估基于LLM的翻译器的关键挑战，包括模型的不透明性和翻译质量评估的复杂性。我们的方法结合了分析检查器和LLM作为评判者（LaaJ）技术，以提供可扩展的、多方面的评估。该系统支持持续集成工作流程，实现大规模基准测试，并减少对手动审核的依赖。我们描述了系统架构、评估策略和报告机制，为开发人员和项目经理提供可操作的见解，促进高质量、现代化代码库的演进。

更新时间: 2025-07-31 09:06:20

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.23356v1

KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

Updated: 2025-07-31 08:57:40

标题: 关键知识RAG（K^2RAG）：用于提高LLM问答能力的增强RAG方法

摘要: 微调是一个资源密集型的过程,当重新训练大型语言模型(LLMs)以整合更广泛的知识体系时。尽管已经开发了许多微调技术来减少涉及的时间和计算成本，但挑战仍然存在，因为LLMs继续增长并变得更加庞大和复杂。为了解决这个问题，需要一种新的方法来扩展LLMs中的知识。检索增强生成(RAG)提供了一种替代方案，通过将外部知识存储在数据库中，并检索相关的片段来支持问题回答。然而，RAG的天真实现在可伸缩性和答案准确性方面面临重大限制。本文介绍了KeyKnowledgeRAG (K2RAG)，这是一个旨在克服这些限制的新型框架。受到分治范式的启发，K2RAG集成了密集和稀疏向量搜索、知识图和文本摘要，以改善检索质量和系统效率。该框架还包括一个预处理步骤，对训练数据进行摘要，显著减少了训练时间。K2RAG在MultiHopRAG数据集上进行了评估，其中提出的流程在文档语料库上进行训练，并在一个单独的评估集上进行测试。结果显示，K2RAG相比常见的天真RAG实现有明显改进。K2RAG实现了0.57的最高平均答案相似度分数，并达到了0.82的最高第三四分位(Q3)相似度，表明与地面真相答案的对齐更好。除了改进的准确性，该框架还证明了高效性。摘要步骤将每个组件的平均训练时间减少了93%，执行速度比传统的基于知识图的RAG系统快了高达40%。K2RAG还展现了卓越的可伸缩性，比本研究中测试的几种天真RAG实现需要的VRAM少三倍。

更新时间: 2025-07-31 08:57:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.07695v2

Multi-Waypoint Path Planning and Motion Control for Non-holonomic Mobile Robots in Agricultural Applications

There is a growing demand for autonomous mobile robots capable of navigating unstructured agricultural environments. Tasks such as weed control in meadows require efficient path planning through an unordered set of coordinates while minimizing travel distance and adhering to curvature constraints to prevent soil damage and protect vegetation. This paper presents an integrated navigation framework combining a global path planner based on the Dubins Traveling Salesman Problem (DTSP) with a Nonlinear Model Predictive Control (NMPC) strategy for local path planning and control. The DTSP generates a minimum-length, curvature-constrained path that efficiently visits all targets, while the NMPC leverages this path to compute control signals to accurately reach each waypoint. The system's performance was validated through comparative simulation analysis on real-world field datasets, demonstrating that the coupled DTSP-based planner produced smoother and shorter paths, with a reduction of about 16% in the provided scenario, compared to decoupled methods. Based thereon, the NMPC controller effectively steered the robot to the desired waypoints, while locally optimizing the trajectory and ensuring adherence to constraints. These findings demonstrate the potential of the proposed framework for efficient autonomous navigation in agricultural environments.

Updated: 2025-07-31 08:56:24

标题: 多航点路径规划与运动控制在农业应用中的非完整移动机器人

摘要: 越来越多的人需要具有导航能力的自主移动机器人，能够在不规则的农业环境中进行导航。在草地中进行除草等任务需要通过一个无序的坐标集合进行高效路径规划，同时最小化行驶距离并遵守曲率约束，以防止土壤损坏并保护植被。本文提出了一个综合的导航框架，结合了基于Dubins旅行商问题（DTSP）的全局路径规划器和用于本地路径规划和控制的非线性模型预测控制（NMPC）策略。DTSP生成一个最小长度的、受曲率约束的路径，有效地访问所有目标，而NMPC利用这条路径计算控制信号，精确地到达每个航点。通过对真实世界的田地数据集进行比较模拟分析，验证了系统的性能，表明耦合的基于DTSP的规划器产生了更加平滑和更短的路径，在所提供的场景中，与解耦的方法相比，减少了约16%。基于此，NMPC控制器有效地引导机器人到达所需的航点，同时在本地优化轨迹并确保符合约束。这些发现显示了所提出的框架在农业环境中进行高效自主导航的潜力。

更新时间: 2025-07-31 08:56:24

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.23350v1

Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules

Individualized treatment rules (ITRs) have gained significant attention due to their wide-ranging applications in fields such as precision medicine, ridesharing, and advertising recommendations. However, when ITRs are influenced by sensitive attributes such as race, gender, or age, they can lead to outcomes where certain groups are unfairly advantaged or disadvantaged. To address this gap, we propose a flexible approach based on the optimal transport theory, which is capable of transforming any optimal ITR into a fair ITR that ensures demographic parity. Recognizing the potential loss of value under fairness constraints, we introduce an ``improved trade-off ITR," designed to balance value optimization and fairness while accommodating varying levels of fairness through parameter adjustment. To maximize the value of the improved trade-off ITR under specific fairness levels, we propose a smoothed fairness constraint for estimating the adjustable parameter. Additionally, we establish a theoretical upper bound on the value loss for the improved trade-off ITR. We demonstrate performance of the proposed method through extensive simulation studies and application to the Next 36 entrepreneurial program dataset.

Updated: 2025-07-31 08:56:03

标题: 最优传输学习：平衡个体化治疗规则中的价值优化和公平性

摘要: 个性化治疗规则（ITRs）由于在精准医学、拼车和广告推荐等领域的广泛应用而受到重视。然而，当ITRs受到种族、性别或年龄等敏感属性的影响时，可能导致某些群体被不公平地优待或劣势化。为了解决这一问题，我们提出了一种基于最优输运理论的灵活方法，能够将任何最优ITR转化为确保人口平等的公平ITR。为了平衡价值优化和公平性，并通过参数调整适应不同程度的公平性，我们引入了一种“改进的权衡ITR”。为了在特定公平水平下最大化改进的权衡ITR的价值，我们提出了一个用于估计可调参数的平滑公平性约束。此外，我们为改进的权衡ITR建立了一个理论上的价值损失上限。通过广泛的仿真研究和应用于Next 36企业家项目数据集，我们展示了所提出方法的性能。

更新时间: 2025-07-31 08:56:03

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.23349v1

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

Updated: 2025-07-31 08:54:46

标题: SWE-Debate：面向软件问题解决的竞争性多智能体辩论

摘要: 问题解决在很大程度上得益于大型语言模型（LLM）先进的推理能力取得了显著进展。最近，像SWE-agent这样的基于代理的框架进一步推动了这一进展，使得能够独立、使用工具的代理能够解决复杂的软件工程任务。尽管现有基于代理的问题解决方法主要基于代理的独立探索，但它们经常陷入局部解决方案中，并且无法识别跨越代码库不同部分的问题模式。为了解决这一限制，我们提出了SWE-Debate，这是一个竞争性的多代理辩论框架，鼓励多样化的推理路径，并实现更多的问题定位。SWE-Debate首先通过遍历代码依赖图创建多个故障传播跟踪作为定位提案。然后，它组织了一场三轮辩论，每个专门代理代表沿着故障传播路径的不同推理观点。这种结构化竞争使代理能够协作一致地达成一个固定的修复计划。最后，这个固定的修复计划被集成到基于MCTS的代码修改代理中，用于生成补丁。在SWE-bench基准测试上的实验证明，SWE-Debate在开源代理框架中取得了新的最新技术成果，并且明显优于基线。

更新时间: 2025-07-31 08:54:46

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.23348v1

Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression

This paper presents a new hybrid model for predicting German electricity prices. The algorithm is based on a combination of Gaussian Process Regression (GPR) and Support Vector Regression (SVR). Although GPR is a competent model for learning stochastic patterns within data and for interpolation, its performance for out-of-sample data is not very promising. By choosing a suitable data-dependent covariance function, we can enhance the performance of GPR for the German hourly power prices being tested. However, since the out-of-sample prediction is dependent on the training data, the prediction is vulnerable to noise and outliers. To overcome this issue, a separate prediction is calculated using SVR, which applies margin-based optimization. This method is advantageous when dealing with non-linear processes and outliers, since only certain necessary points (support vectors) in the training data are responsible for regression. The individual predictions are then linearly combined using uniform weights. When tested on historic German power prices, this approach outperforms the publicly available benchmarks, namely the LASSO estimated autoregressive regression model, deep neural network provided in the recent research by [1].

Updated: 2025-07-31 08:51:02

标题: 使用多核高斯过程回归结合基于核的支持向量回归进行电价预测

摘要: 这篇论文提出了一个新的混合模型，用于预测德国电力价格。该算法基于高斯过程回归（GPR）和支持向量回归（SVR）的组合。虽然GPR是一个能够学习数据中随机模式并进行插值的有效模型，但其在样本外数据上的表现并不理想。通过选择一个适当的数据相关协方差函数，我们可以增强GPR在德国每小时电力价格的测试中的表现。然而，由于样本外预测取决于训练数据，预测容易受到噪声和异常值的影响。为了解决这个问题，使用SVR计算了一个单独的预测，该方法应用基于间隔的优化。在处理非线性过程和异常值时，这种方法是有利的，因为只有训练数据中的某些必要点（支持向量）负责回归。然后，使用均匀权重将各个预测线性组合。在历史德国电力价格上进行测试时，这种方法优于公开可用的基准模型，即最近[1]提供的LASSO估计的自回归回归模型和深度神经网络研究中的模型。

更新时间: 2025-07-31 08:51:02

领域: cs.LG,math.PR,62M10(Primary), 62M20, 60G15, 62J05(Secondary)

下载: http://arxiv.org/abs/2412.00123v4

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

Updated: 2025-07-31 08:49:18

标题: 在增强原生性能的同时，通过文化和本地知识提升大型语言模型的多语言能力

摘要: 大型语言模型（LLMs）展示了卓越的能力，但它们的发展主要集中在英语和其他资源丰富的语言上，导致许多语言受到忽视。我们提出了我们最新的印地语-英语双语LLM \textbf{Mantra-14B}，在两种语言上的基准分数平均提高了约3\%，胜过了两倍大小的模型。我们使用了一个由48.5万个样本的英语和印地语指导数据组成的精心策划的数据集，指导调整模型，如Qwen-2.5-14B-Instruct和Phi-4，以提高在英语和印地语上的表现。我们的实验涵盖了七种不同参数大小的LLMs，以及超过140次训练尝试，其中包括不同的英语-印地语训练数据比例，证明了通过显著改进多语言性能而不损害本地性能是可能的。此外，我们的方法避免了资源密集型技术，如词汇扩展或架构修改，从而保持模型的规模较小。我们的结果表明，通过对具有文化和本地信息的数据进行适度调整，可以弥合性能差距，而不会产生重大的计算开销。我们发布我们的训练代码、数据集和模型，以帮助进一步研究欠代表和资源匮乏的语言。

更新时间: 2025-07-31 08:49:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.09753v3

Designing Dynamic Pricing for Bike-sharing Systems via Differentiable Agent-based Simulation

Bike-sharing systems are emerging in various cities as a new ecofriendly transportation system. In these systems, spatiotemporally varying user demands lead to imbalanced inventory at bicycle stations, resulting in additional relocation costs. Therefore, it is essential to manage user demand through optimal dynamic pricing for the system. However, optimal pricing design for such a system is challenging because the system involves users with diverse backgrounds and their probabilistic choices. To address this problem, we develop a differentiable agent-based simulation to rapidly design dynamic pricing in bike-sharing systems, achieving balanced bicycle inventory despite spatiotemporally heterogeneous trips and probabilistic user decisions. We first validate our approach against conventional methods through numerical experiments involving 25 bicycle stations and five time slots, yielding 100 parameters. Compared to the conventional methods, our approach obtains a more accurate solution with a 73% to 78% reduction in loss while achieving more than a 100-fold increase in convergence speed. We further validate our approach on a large-scale urban bike-sharing system scenario involving 289 bicycle stations, resulting in a total of 1156 parameters. Through simulations using the obtained pricing policies, we confirm that these policies can naturally induce balanced inventory without any manual relocation. Additionally, we find that the cost of discounts to induce the balanced inventory can be minimized by setting appropriate initial conditions.

Updated: 2025-07-31 08:43:54

标题: 通过可微分基于代理的模拟设计自行车共享系统的动态定价

摘要: 自行车共享系统正在各个城市兴起，作为一种新的环保交通系统。在这些系统中，时空变化的用户需求导致自行车站点的库存不平衡，导致额外的重新调配成本。因此，通过最佳动态定价管理用户需求对系统至关重要。然而，对于这样一个系统来说，最佳定价设计具有挑战性，因为系统涉及具有不同背景和概率选择的用户。为了解决这个问题，我们开发了一种可微分的基于代理的仿真方法，快速设计自行车共享系统的动态定价，实现平衡的自行车库存，尽管存在时空异质的行程和概率性用户决策。我们首先通过涉及25个自行车站和五个时间段的数值实验，得出100个参数，验证了我们的方法与传统方法的对比。与传统方法相比，我们的方法在获得更准确解的同时，损失减少了73%至78%，并实现收敛速度增加超过100倍。我们进一步验证了我们的方法在涉及289个自行车站的大规模城市自行车共享系统场景中，总共产生了1156个参数。通过使用获得的定价策略进行模拟，我们确认这些策略可以自然地诱导平衡库存，无需任何手动重新调配。此外，我们发现通过设置适当的初始条件，可以最大程度地减少诱导平衡库存的折扣成本。

更新时间: 2025-07-31 08:43:54

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2507.23344v1

DSBC : Data Science task Benchmarking with Context engineering

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

Updated: 2025-07-31 08:32:37

标题: DSBC：具有上下文工程的数据科学任务基准测试

摘要: 最近对大型语言模型（LLMs）的进展显著影响了数据科学工作流程，导致专门设计用于自动化分析任务的数据科学代理的出现。尽管快速被采用，但评估这些代理的有效性和局限性的系统化基准测试仍然很少。在本文中，我们引入了一个专门设计的全面基准测试，以反映用户与数据科学代理的实际交互，通过观察我们商业应用的使用情况。我们评估了三种LLMs：Claude-4.0-Sonnet、Gemini-2.5-Flash和OpenAI-o4-Mini，跨三种方法进行评估：零-shot方法与上下文工程、多步方法与上下文工程，以及使用SmolAgent方法。我们的基准测试评估了八种数据科学任务类别的性能，此外还探讨了模型对常见提示问题（如数据泄霏和略微模糊的指令）的敏感性。我们进一步研究了温度参数对每个模型和方法的整体和任务特定结果的影响。我们的研究结果显示了评估模型和方法之间的性能差异，突出了影响实际部署的关键因素。本文介绍的基准数据集和评估框架旨在为未来更强大和更有效的数据科学代理的研究奠定基础。

更新时间: 2025-07-31 08:32:37

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2507.23336v1

Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions

Patch robustness certification is an emerging verification approach for defending against adversarial patch attacks with provable guarantees for deep learning systems. Certified recovery techniques guarantee the prediction of the sole true label of a certified sample. However, existing techniques, if applicable to top-k predictions, commonly conduct pairwise comparisons on those votes between labels, failing to certify the sole true label within the top k prediction labels precisely due to the inflation on the number of votes controlled by the attacker (i.e., attack budget); yet enumerating all combinations of vote allocation suffers from the combinatorial explosion problem. We propose CostCert, a novel, scalable, and precise voting-based certified recovery defender. CostCert verifies the true label of a sample within the top k predictions without pairwise comparisons and combinatorial explosion through a novel design: whether the attack budget on the sample is infeasible to cover the smallest total additional votes on top of the votes uncontrollable by the attacker to exclude the true labels from the top k prediction labels. Experiments show that CostCert significantly outperforms the current state-of-the-art defender PatchGuard, such as retaining up to 57.3% in certified accuracy when the patch size is 96, whereas PatchGuard has already dropped to zero.

Updated: 2025-07-31 08:31:59

标题: 深度学习模型的可扩展和精确的补丁稳健性认证与前k个预测

摘要: Patch robustness certification是一种新兴的验证方法，用于防御深度学习系统的对抗性贴片攻击，并提供可证明的保证。认证恢复技术保证认证样本的唯一真实标签的预测。然而，现有技术，如果适用于前k个预测，通常在标签之间进行两两比较，由于受攻击者控制的投票数量的增加（即攻击预算），未能精确地在前k个预测标签中认证唯一真实标签，然而列举所有的投票分配组合会遭受组合爆炸问题。我们提出了一种新颖的、可扩展的、精确的基于投票的认证恢复防御者CostCert。CostCert通过一种新颖的设计，在不进行两两比较和组合爆炸的情况下，在前k个预测中验证样本的真实标签：在攻击预算无法覆盖攻击者无法控制的投票之上，排除真实标签不在前k个预测标签中所需的最小额外投票数。实验证明，CostCert在当前最先进的防御者PatchGuard方面表现显著优于，例如，在贴片尺寸为96时，保持了高达57.3%的认证精度，而PatchGuard已经下降至零。

更新时间: 2025-07-31 08:31:59

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2507.23335v1

FovEx: Human-Inspired Explanations for Vision Transformers and Convolutional Neural Networks

Explainability in artificial intelligence (XAI) remains a crucial aspect for fostering trust and understanding in machine learning models. Current visual explanation techniques, such as gradient-based or class-activation-based methods, often exhibit a strong dependence on specific model architectures. Conversely, perturbation-based methods, despite being model-agnostic, are computationally expensive as they require evaluating models on a large number of forward passes. In this work, we introduce Foveation-based Explanations (FovEx), a novel XAI method inspired by human vision. FovEx seamlessly integrates biologically inspired perturbations by iteratively creating foveated renderings of the image and combines them with gradient-based visual explorations to determine locations of interest efficiently. These locations are selected to maximize the performance of the model to be explained with respect to the downstream task and then combined to generate an attribution map. We provide a thorough evaluation with qualitative and quantitative assessments on established benchmarks. Our method achieves state-of-the-art performance on both transformers (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility among various architectures. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE, +203\% in NSS compared to GradCAM). This comparison enhances our confidence in FovEx's ability to close the interpretation gap between humans and machines.

Updated: 2025-07-31 08:31:44

标题: FovEx：视觉变换器和卷积神经网络的人类启发式解释

摘要: 人工智能中的可解释性（XAI）仍然是促进对机器学习模型的信任和理解的关键方面。当前的视觉解释技术，如基于梯度或基于类激活的方法，通常对特定模型架构具有很强的依赖性。相反，尽管是模型无关的扰动方法在计算上很昂贵，因为它们需要对大量的前向传递进行评估。在这项工作中，我们引入了一种新颖的受人类视觉启发的XAI方法，称为Foveation-based Explanations（FovEx）。FovEx通过迭代地创建图像的中央视觉和将其与基于梯度的视觉探索相结合，以有效确定感兴趣的位置，无缝地集成了生物学启发的扰动。这些位置被选择为最大化要解释的模型在下游任务中的性能，然后结合生成一个归因图。我们在已建立的基准上进行了彻底的定性和定量评估。我们的方法在变形器（在5个指标中的4个中）和卷积模型（在5个指标中的3个中）上实现了最先进的性能，证明了其在各种架构中的多功能性。此外，我们展示了FovEx产生的解释图与人类凝视模式之间的对齐（相比于RISE，NSS增加了14\%，相比于GradCAM，NSS增加了203%）。这种比较增强了我们对FovEx在关闭人类和机器之间的解释差距能力的信心。

更新时间: 2025-07-31 08:31:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2408.02123v3

MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

Updated: 2025-07-31 08:31:05

标题: MUST-RAG: 带检索增强生成的音乐文本问答

摘要: 最近的大型语言模型（LLMs）的进展在各个领域展示出了卓越的能力。虽然它们在各种任务上展现出强大的零-shot性能，但由于训练数据中音乐特定知识的比例相对较小，LLMs在音乐相关应用中的有效性仍然有限。为了解决这一局限性，我们提出了MusT-RAG，这是一个基于检索增强生成（RAG）的综合框架，用于将通用目的的LLMs适应纯文本音乐问答（MQA）任务。RAG是一种通过在生成答案时检索相关背景信息来为LLMs提供外部知识的技术。为了优化音乐领域的RAG，我们（1）提出了MusWikiDB，这是一个用于检索阶段的音乐专用向量数据库，(2)在推理和微调过程中利用上下文信息，有效地将通用目的的LLMs转化为音乐特定模型。我们的实验表明，MusT-RAG在增强LLMs的音乐领域适应能力方面显著优于传统的微调方法，在域内和域外MQA基准测试中均表现出持续改进。此外，我们的MusWikiDB比通用维基百科语料库更为有效，提供了更优越的性能和计算效率。

更新时间: 2025-07-31 08:31:05

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.23334v1

EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.

Updated: 2025-07-31 08:30:45

标题: EaqVLA：用于视觉-语言-动作模型的编码对齐量化

摘要: 随着实体化人工智能的发展，端到端控制策略如视觉-语言-动作（VLA）模型已成为主流。现有的VLA模型面临着昂贵的计算/存储成本，需要进行优化。量化被认为是最有效的方法，不仅可以降低内存成本，还可以实现计算加速。然而，我们发现VLA模型的令牌对齐阻碍了现有量化方法的应用。为了解决这个问题，我们提出了一个名为EaqVLA的优化框架，将编码对齐的量化应用于VLA模型。具体来说，我们提出了一种完整的分析方法，以找到各种粒度的不对齐。基于分析结果，我们提出了一种具有编码对齐意识的混合精度量化方法。实验证明，所提出的EaqVLA比现有的量化方法实现了更好的量化性能（在端到端动作控制的最小量化损失和xxx倍加速）。

更新时间: 2025-07-31 08:30:45

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2505.21567v2

AI Must not be Fully Autonomous

Autonomous Artificial Intelligence (AI) has many benefits. It also has many risks. In this work, we identify the 3 levels of autonomous AI. We are of the position that AI must not be fully autonomous because of the many risks, especially as artificial superintelligence (ASI) is speculated to be just decades away. Fully autonomous AI, which can develop its own objectives, is at level 3 and without responsible human oversight. However, responsible human oversight is crucial for mitigating the risks. To ague for our position, we discuss theories of autonomy, AI and agents. Then, we offer 12 distinct arguments and 6 counterarguments with rebuttals to the counterarguments. We also present 15 pieces of recent evidence of AI misaligned values and other risks in the appendix.

Updated: 2025-07-31 08:22:49

标题: AI不应该完全自主

摘要: 自主人工智能（AI）有许多好处，但也存在许多风险。在这项工作中，我们确定了三个自主AI的级别。我们认为AI不应完全自主，因为存在许多风险，尤其是人工超智能（ASI）被认为离我们只有几十年的时间。完全自主的AI，可以制定自己的目标，处于第三级，并且没有负责任的人类监督。然而，负责任的人类监督对于减轻风险至关重要。为了支持我们的观点，我们讨论了自主性、AI和代理人的理论。然后，我们提出了12个明确的论点和6个反驳论点及其反驳。此外，我们在附录中还提供了15个最近证据，证明AI存在价值观不一致和其他风险。

更新时间: 2025-07-31 08:22:49

领域: cs.AI

下载: http://arxiv.org/abs/2507.23330v1

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

Updated: 2025-07-31 08:13:13

标题: 通过张量分解的转换低秩适应性及其在文本到图像模型中的应用

摘要: 参数高效微调（PEFT）文本到图像模型已经成为一种越来越受欢迎的技术，具有许多应用。在各种PEFT方法中，低秩适应（LoRA）及其变体由于其有效性而受到重视，使用户能够在有限的计算资源下微调模型。然而，低秩假设和期望微调权重之间的近似差距阻止了同时获得超参数效率和更好性能。为了减小这一差距并进一步提高LoRA的能力，我们提出了一种新的PEFT方法，将两类适应性结合起来，即变换和残差适应。具体来说，我们首先对预训练权重应用一个全秩和密集的变换。这个可学习的变换被期望将预训练权重尽可能地对齐到期望权重，从而降低残余权重的秩。然后，残差部分可以通过更紧凑和参数高效的结构有效地近似，具有更小的近似误差。为了在实践中实现超参数效率，我们为变换和残差适应性设计了高度灵活和有效的张量分解。此外，流行的PEFT方法如DoRA可以总结在这种变换加残差适应方案之下。我们在主体驱动和可控生成的稳定扩散模型上进行了微调实验。结果表明，与LoRA和几种基线相比，我们的方法可以实现更好的性能和参数效率。

更新时间: 2025-07-31 08:13:13

领域: cs.LG

下载: http://arxiv.org/abs/2501.08727v2

MVCNet: Multi-View Contrastive Network for Motor Imagery Classification

Electroencephalography (EEG)-based brain-computer interfaces (BCIs) enable neural interaction by decoding brain activity for external communication. Motor imagery (MI) decoding has received significant attention due to its intuitive mechanism. However, most existing models rely on single-stream architectures and overlook the multi-view nature of EEG signals, leading to limited performance and generalization. We propose a multi-view contrastive network (MVCNet), a dual-branch architecture that parallelly integrates CNN and Transformer blocks to capture both local spatial-temporal features and global temporal dependencies. To enhance the informativeness of training data, MVCNet incorporates a unified augmentation pipeline across time, frequency, and spatial domains. Two contrastive modules are further introduced: a cross-view contrastive module that enforces consistency of original and augmented views, and a cross-model contrastive module that aligns features extracted from both branches. Final representations are fused and jointly optimized by contrastive and classification losses. Experiments on five public MI datasets across three scenarios demonstrate that MVCNet consistently outperforms nine state-of-the-art MI decoding networks, highlighting its effectiveness and generalization ability. MVCNet provides a robust solution for MI decoding by integrating multi-view information and dual-branch modeling, contributing to the development of more reliable BCI systems.

Updated: 2025-07-31 08:10:59

标题: MVCNet：用于运动想象分类的多视图对比网络

摘要: 脑电图（EEG）-基于脑机接口（BCI）通过解码大脑活动以进行外部通信实现神经交互。由于其直观的机制，运动想象（MI）解码受到了重视。然而，大多数现有模型依赖于单一流架构，并忽视了EEG信号的多视角性质，导致性能和泛化能力有限。我们提出了一种多视图对比网络（MVCNet），这是一种双分支架构，同时集成了CNN和Transformer模块，以捕获本地空间-时间特征和全局时间依赖性。为了增强训练数据的信息量，MVCNet在时间、频率和空间领域引入了统一的增强管道。进一步引入了两个对比模块：一个跨视图对比模块，强化原始和增强视图的一致性，以及一个跨模型对比模块，对齐从两个分支提取的特征。最终表示通过对比和分类损失融合并联合优化。在五个公共MI数据集的三种场景上的实验表明，MVCNet始终优于九种最先进的MI解码网络，突显了其有效性和泛化能力。MVCNet通过整合多视图信息和双分支建模为MI解码提供了强大的解决方案，有助于开发更可靠的BCI系统。

更新时间: 2025-07-31 08:10:59

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2502.17482v4

HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction

In breast cancer HER2 assessment, clinical evaluation relies on combined H&E and IHC images, yet acquiring both modalities is often hindered by clinical constraints and cost. We propose an adaptive bimodal prediction framework that flexibly supports single- or dual-modality inputs through two core innovations: a dynamic branch selector activating modality completion or joint inference based on input availability, and a cross-modal GAN (CM-GAN) enabling feature-space reconstruction of missing modalities. This design dramatically improves H&E-only accuracy from 71.44% to 94.25%, achieves 95.09% with full dual-modality inputs, and maintains 90.28% reliability under single-modality conditions. The "dual-modality preferred, single-modality compatible" architecture delivers near-dual-modality accuracy without mandatory synchronized acquisition, offering a cost-effective solution for resource-limited regions and significantly improving HER2 assessment accessibility.

Updated: 2025-07-31 07:57:18

标题: 用动态双向重建通过灵活多模输入预测HER2表达

摘要: 在乳腺癌HER2评估中，临床评估依赖于结合H&E和IHC图像，然而由于临床限制和成本，获取两种模态经常受阻。我们提出了一种自适应双模态预测框架，通过两个核心创新灵活支持单模态或双模态输入：动态分支选择器根据输入可用性激活模态完成或联合推理，以及跨模态GAN（CM-GAN）实现缺失模态的特征空间重建。该设计将H&E单独准确率从71.44%显著提高到94.25%，在完整双模态输入下达到95.09%，并在单模态条件下保持90.28%的可靠性。"双模态优先，单模态兼容"的结构提供了几乎双模态准确性而无需强制同步采集的解决方案，为资源有限地区提供了经济有效的解决方案，并显著提高了HER2评估的可访问性。

更新时间: 2025-07-31 07:57:18

领域: cs.MM,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.10006v2

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.

Updated: 2025-07-31 07:55:56

标题: FastDriveVLA：通过即插即用的基于重建的令牌修剪实现高效端到端驾驶

摘要: Vision-Language-Action（VLA）模型在复杂场景理解和动作推理方面展示了显著的潜力，导致它们在端到端自动驾驶系统中越来越被采用。然而，VLA模型的长视觉标记大大增加了计算成本。当前在Vision-Language Models（VLM）中的视觉标记剪枝方法依赖于视觉标记相似性或视觉文本注意力，但在自动驾驶场景中都表现不佳。鉴于人类驾驶员在驾驶时集中注意力在相关的前景区域，我们断言保留包含这些前景信息的视觉标记对于有效决策是必不可少的。受此启发，我们提出了一种专门为自动驾驶设计的基于重建的视觉标记剪枝框架FastDriveVLA。FastDriveVLA包括一个名为ReconPruner的即插即用的视觉标记修剪器，通过MAE风格的像素重建来优先考虑前景信息。设计了一种新颖的对抗前景-背景重建策略，用于训练VLA模型的视觉编码器的ReconPruner。一旦训练完成，ReconPruner可以无需重新训练地无缝应用于具有相同视觉编码器的不同VLA模型。为了训练ReconPruner，我们还引入了一个名为nuScenes-FG的大规模数据集，其中包含241K个带有标注前景区域的图像-掩模对。我们的方法在nuScenes闭环规划基准测试中实现了最先进的结果，覆盖不同的剪枝比例。

更新时间: 2025-07-31 07:55:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23318v1

Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent 'thought' units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step segmentation and alleviating reward hacking. We further introduce a capability-adaptive reward mechanism that dynamically balances exploration and exploitation based on the LRM's current proficiency, guiding learning without stifling creative trial-and-error. These innovations are integrated into a new off-policy RL algorithm, TP-GRPO, which extends grouped proximal optimization with process-based rewards and improves training efficiency. Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines. The results validate that well-structured process rewards can substantially accelerate LRM optimization in math reasoning tasks. Code is available at https://github.com/cs-holder/tp_grpo.

Updated: 2025-07-31 07:54:58

标题: 优秀的学习者认为他们的思维：生成式PRM使大型推理模型更高效的数学学习者

摘要: 最近，大型推理模型（LRMs）在与强化学习（RL）优化时已经显示出在解决复杂数学问题方面的潜力。但是传统方法依赖于仅提供稀疏反馈的结果性奖励，导致优化过程低效。在这项工作中，我们研究了过程奖励模型（PRMs）的功能，以加速LRMs的RL训练。我们提出了一种新颖的内在信号驱动的生成过程评估机制，该机制在思维水平上运行，以解决RL训练中的主要瓶颈。具体而言，我们的方法不需要PRMs知道如何解决问题，而是使用解决方案中的内在信号来逐步评判正确性，并将连续的正确/错误步骤聚合成连贯的“思维”单元。这种结构化的思维水平奖励通过减少步骤分割中的歧义和缓解奖励欺骗，实现更可靠的学分分配。我们进一步引入了一种能力自适应奖励机制，根据LRM当前的熟练程度动态平衡探索和利用，引导学习而不扼杀创造性的试错。这些创新被整合到一种新的离线策略RL算法TP-GRPO中，该算法通过基于过程的奖励扩展了分组近端优化，并提高了训练效率。在1.5B和7B参数的LRMs上进行的实验表明，我们的方法在比基于结果性奖励基线更少的训练样本下实现了更高的问题解决准确性。结果验证了良好结构化的过程奖励可以显著加快在数学推理任务中的LRM优化。代码可在https://github.com/cs-holder/tp_grpo找到。

更新时间: 2025-07-31 07:54:58

领域: cs.LG

下载: http://arxiv.org/abs/2507.23317v1

DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.

Updated: 2025-07-31 07:54:00

标题: DHCP：在大型视觉语言模型中通过跨模态注意力模式检测幻觉

摘要: 大型视觉语言模型（LVLMs）在复杂的多模态任务上表现出色。然而，它们仍然存在显著的幻觉问题，包括对象、属性和关系幻觉。为了准确检测这些幻觉，我们调查了在幻觉和非幻觉状态之间的跨模态注意力模式的变化。利用这些区别，我们开发了一种轻量级检测器，能够识别幻觉。我们提出的方法，通过跨模态注意力模式检测幻觉（DHCP），简单直接，不需要额外的LVLM训练或额外的LVLM推断步骤。实验结果表明，DHCP在幻觉检测方面表现出色。通过为LVLM中幻觉的识别和分析提供新的见解，DHCP有助于提高这些模型的可靠性和可信度。代码可在https://github.com/btzyd/DHCP 上找到。

更新时间: 2025-07-31 07:54:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.18659v2

MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79\% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.

Updated: 2025-07-31 07:53:53

标题: MemShare：通过KV缓存重用实现大型推理模型的内存高效推理

摘要: 大型推理模型（LRMs）在数学推理和形式逻辑任务中取得了显著进展。然而，它们倾向于生成冗长的思维链序列，在推理过程中导致显著的内存开销。我们观察到，LRMs经常产生高度相似的中间推理步骤，这些步骤对应于在不同层之间具有相似KV缓存状态。受到这一观察的启发，我们提出了MemShare，一种新颖的KV缓存管理方法，可以有效减少内存开销。MemShare采用协同过滤算法，高效识别可重复使用的KV缓存块，并实现零拷贝缓存重用，从而显著减少内存开销，提高吞吐量，同时保持准确性。实验结果表明，与现有的KV缓存管理方法相比，MemShare在吞吐量方面提高了高达84.79％，同时保持更好的准确性。

更新时间: 2025-07-31 07:53:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.21433v2

Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.

Updated: 2025-07-31 07:47:30

标题: 超参数优化对实时图像分类轻量化深度学习模型准确性的影响

摘要: 轻量级的卷积和基于Transformer的模型在资源受限的应用中，如嵌入式系统和边缘设备中，对于实时图像分类变得至关重要。本文分析了超参数调整对七种高效深度学习架构（EfficientNetV2-S、ConvNeXt-T、MobileViT v2（XXS/XS/S）、MobileNetV3-L、TinyViT-21M和RepVGG-A2）的准确性和收敛行为的影响。所有模型均在ImageNet-1K数据集上进行训练，采用一致的训练设置，重点放在实时实用性上。进行了全面的消融研究，以分离关键超参数的影响，包括学习率调度、批量大小、输入分辨率、数据增强、正则化方法和优化器选择。为了评估适用于实时应用的合适性，每个模型不仅根据Top-1和Top-5分类准确性进行评估，还根据推断时间、参数数量、模型大小和每秒帧数（FPS）在GPU加速的边缘部署模拟中进行评估。结果表明，余弦学习率衰减和可调整的批量大小可以大大提高准确性和收敛速度，同时保持低延迟和内存成本。值得注意的是，RepVGG-A2在高效的推断性能下实现了超过80%的Top-1准确性，为VGG风格模型的准确性和部署成本之间提供了引人注目的平衡。这些结果为构建适用于实时图像处理流水线的资源高效深度学习模型提供了实用指导。所有代码和训练日志都可以在https://github.com/VineetKumarRakesh/lcnn-opt上公开访问。

更新时间: 2025-07-31 07:47:30

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23315v1

An Interpretable Data-Driven Unsupervised Approach for the Prevention of Forgotten Items

Accurately identifying items forgotten during a supermarket visit and providing clear, interpretable explanations for recommending them remains an underexplored problem within the Next Basket Prediction (NBP) domain. Existing NBP approaches typically only focus on forecasting future purchases, without explicitly addressing the detection of unintentionally omitted items. This gap is partly due to the scarcity of real-world datasets that allow for the reliable estimation of forgotten items. Furthermore, most current NBP methods rely on black-box models, which lack transparency and limit the ability to justify recommendations to end users. In this paper, we formally introduce the forgotten item prediction task and propose two novel interpretable-by-design algorithms. These methods are tailored to identify forgotten items while offering intuitive, human-understandable explanations. Experiments on a real-world retail dataset show our algorithms outperform state-of-the-art NBP baselines by 10-15% across multiple evaluation metrics.

Updated: 2025-07-31 07:37:54

标题: 一个可解释的数据驱动的无监督方法用于防止遗忘项目

摘要: 在超市购物过程中准确识别遗忘的物品，并为推荐它们提供清晰、易于理解的解释，仍然是Next Basket Prediction（NBP）领域中一个未充分探讨的问题。现有的NBP方法通常只专注于预测未来的购买行为，而没有明确地解决意外遗漏物品的检测。这种空白部分部分是由于现实世界数据集的稀缺性，这些数据集允许可靠地估计遗忘的物品。此外，大多数当前的NBP方法依赖于缺乏透明度的黑盒模型，这限制了向最终用户解释推荐的能力。在本文中，我们正式介绍了遗忘物品预测任务，并提出了两种新颖的可解释性设计算法。这些方法旨在识别遗忘的物品，同时提供直观、人类可理解的解释。在实际零售数据集上的实验证明，我们的算法在多个评估指标上比最先进的NBP基线模型表现优异10-15%。

更新时间: 2025-07-31 07:37:54

领域: cs.LG

下载: http://arxiv.org/abs/2507.23303v1

EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition

Emotion recognition using electroencephalography (EEG) signals has attracted increasing attention in recent years. However, existing methods often lack generalization in cross-corpus settings, where a model trained on one dataset is directly applied to another without retraining, due to differences in data distribution and recording conditions. To tackle the challenge of cross-corpus EEG-based emotion recognition, we propose a novel framework termed Soft Contrastive Masked Modeling (SCMM). Grounded in the theory of emotional continuity, SCMM integrates soft contrastive learning with a hybrid masking strategy to effectively capture emotion dynamics (refer to short-term continuity). Specifically, in the self-supervised learning stage, we propose a soft weighting mechanism that assigns similarity scores to sample pairs, enabling fine-grained modeling of emotional transitions and capturing the temporal continuity of human emotions. To further enhance representation learning, we design a similarity-aware aggregator that fuses complementary information from semantically related samples based on pairwise similarities, thereby improving feature expressiveness and reconstruction quality. This dual design contributes to a more discriminative and transferable representation, which is crucial for robust cross-corpus generalization. Extensive experiments on the SEED, SEED-IV, and DEAP datasets show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26% under both same-class and different-class cross-corpus settings. The source code is available at https://github.com/Kyler-RL/SCMM.

Updated: 2025-07-31 07:35:24

标题: EEG-SCMM：基于交叉语料库的脑电情绪识别的软对比掩蔽建模

摘要: 情绪识别利用脑电图（EEG）信号在近年来引起了越来越多的关注。然而，现有方法在跨数据集设置中通常缺乏泛化能力，即在一个数据集上训练的模型直接应用于另一个数据集而无需重新训练，这是由于数据分布和记录条件的差异。为了解决跨数据集EEG情绪识别的挑战，我们提出了一种新颖的框架，称为软对比掩蔽建模（SCMM）。基于情绪连续性理论，SCMM将软对比学习与混合掩蔽策略相结合，以有效捕捉情绪动态（参考短期连续性）。具体而言，在自监督学习阶段，我们提出了一种软加权机制，将相似度分数分配给样本对，实现对情绪过渡的精细建模，并捕捉人类情绪的时间连续性。为了进一步增强表示学习，我们设计了一个相似度感知聚合器，根据成对相似性从语义相关的样本中融合互补信息，从而提高特征表达能力和重构质量。这种双重设计有助于更具辨别性和可传递性的表示，这对于强大的跨数据集泛化至关重要。在SEED、SEED-IV和DEAP数据集上进行了大量实验，结果显示SCMM的性能达到了最先进水平，在相同类别和不同类别的跨数据集设置下，其平均准确率比第二名方法高出4.26％。源代码可在https://github.com/Kyler-RL/SCMM上找到。

更新时间: 2025-07-31 07:35:24

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2408.09186v2

A Compute-Matched Re-Evaluation of TroVE on MATH

Reusing established theorems and formulas is central to mathematical problem solving, serving as essential building blocks for tackling increasingly complex challenges. Recent work, TroVE, argues that code-generating Large Language Models (LLMs) can benefit similarly on the MATH benchmark by inducing and reusing higher-level toolboxes. By allocating computational budget across an ensemble of three modes -- directly generating code, creating tools, and reusing tools -- TroVE claims to outperform a PRIMITIVE baseline that only performs direct generation. However, recent analysis (Berlot-Attwell et al., 2024) casts doubt on these gains, noting that the tools created are often trivial or rarely reused, suggesting that improvements may stem from self-consistency or self-correction. In this work, we re-evaluate TroVE on MATH, analyze the impact of each of its modes, and show that its benefit does not come from these mechanisms, but simply from a higher computational budget spent for TroVE compared to PRIMITIVE. To this end, we also perform a small correction in the original implementation of TroVE's selection mechanism, boosting TroVE's performance on MATH by 3\% in accuracy. After matching for compute, the benefit of TroVE reduces to a marginal improvement of 1\%, suggesting that this toolbox approach does not provide a significant benefit on MATH.

Updated: 2025-07-31 07:33:11

标题: 一个计算匹配的TroVE在数学上的重新评估

摘要: 重复利用已建立的定理和公式对数学问题解决至关重要，它是解决日益复杂挑战的基本构建模块。最近的研究TroVE认为，通过诱导和重复使用更高级别的工具箱，生成代码的大型语言模型（LLMs）在数学基准上也能获益。通过在三种模式（直接生成代码、创建工具和重复使用工具）的集合之间分配计算预算，TroVE声称能够胜过仅进行直接生成的PRIMITIVE基线。然而，最近的分析（Berlot-Attwell等人，2024年）对这些收益提出了质疑，指出所创建的工具通常是微不足道的，或者很少被重复使用，暗示改进可能源于自一致性或自校正。在这项工作中，我们重新评估了TroVE在数学上的表现，分析了其每种模式的影响，并展示其好处并非来自这些机制，而仅仅是由于TroVE相对于PRIMITIVE投入了更高的计算预算。为此，我们还对TroVE的选择机制的原始实施进行了小修正，在数学上将TroVE的性能提高了3\%的准确度。在匹配计算力后，TroVE的好处减少到边际改进1\%，暗示这种工具箱方法在数学上并没有提供显著的好处。

更新时间: 2025-07-31 07:33:11

领域: cs.PL,cs.AI

下载: http://arxiv.org/abs/2507.22069v2

Simulation-based inference for Precision Neutrino Physics through Neural Monte Carlo tuning

Precise modeling of detector energy response is crucial for next-generation neutrino experiments which present computational challenges due to lack of analytical likelihoods. We propose a solution using neural likelihood estimation within the simulation-based inference framework. We develop two complementary neural density estimators that model likelihoods of calibration data: conditional normalizing flows and a transformer-based regressor. We adopt JUNO - a large neutrino experiment - as a case study. The energy response of JUNO depends on several parameters, all of which should be tuned, given their non-linear behavior and strong correlations in the calibration data. To this end, we integrate the modeled likelihoods with Bayesian nested sampling for parameter inference, achieving uncertainties limited only by statistics with near-zero systematic biases. The normalizing flows model enables unbinned likelihood analysis, while the transformer provides an efficient binned alternative. By providing both options, our framework offers flexibility to choose the most appropriate method for specific needs. Finally, our approach establishes a template for similar applications across experimental neutrino and broader particle physics.

Updated: 2025-07-31 07:33:05

标题: 通过神经蒙特卡洛调整的仿真推断，用于精密中微子物理

摘要: 准确建模探测器能量响应对于下一代中微子实验至关重要，这些实验由于缺乏分析似然函数而面临计算挑战。我们提出了一种解决方案，使用神经似然估计在基于模拟的推断框架内。我们开发了两种互补的神经密度估计器，用于建模校准数据的似然函数：条件归一化流和基于变压器的回归器。我们以大型中微子实验JUNO为案例研究。JUNO的能量响应取决于几个参数，所有这些参数都应进行调整，考虑到它们在校准数据中的非线性行为和强相关性。为此，我们将建模的似然函数与贝叶斯嵌套采样相结合进行参数推断，仅受统计学限制的不确定性几乎没有系统偏差。归一化流模型实现了无箱似然分析，而变压器提供了高效的有箱替代方案。通过提供两种选择，我们的框架为选择特定需求最适合的方法提供了灵活性。最后，我们的方法为实验中微子和更广泛的粒子物理领域的类似应用建立了一个模板。

更新时间: 2025-07-31 07:33:05

领域: physics.data-an,cs.LG,hep-ex,hep-ph,physics.ins-det

下载: http://arxiv.org/abs/2507.23297v1

SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.

Updated: 2025-07-31 07:10:39

标题: SequenceLayers：简化序列处理和流式神经网络

摘要: 我们介绍了一个神经网络层的API和库，用于序列建模，旨在轻松创建可以逐层执行（例如，教师强制训练）和逐步执行（例如，自回归采样）的序列模型。为了实现这一目标，层定义了它们随时间的明确状态表示（例如，Transformer KV缓存，卷积缓冲区，RNN隐藏状态），以及一个演化该状态的步骤方法，经过测试，与无状态的逐层调用得到相同的结果。SequenceLayers合同的这一点以及其他方面使得复杂的模型能够立即流式传输，并且可以在任何深度学习库中实现，有助于减轻在流式传输和并行序列处理中出现的各种常见错误，并且可以实现在任何深度学习库中。一个可组合和声明性的API，以及一个全面的层和组合器套件，简化了从简单的可流传组件构建生产规模模型，同时保留了强大的正确性保证。我们目前的SequenceLayers实现（JAX，TensorFlow 2）可在https://github.com/google/sequence-layers 上找到。

更新时间: 2025-07-31 07:10:39

领域: cs.LG,cs.CL,cs.PL,cs.SE,eess.AS

下载: http://arxiv.org/abs/2507.23292v1

Evaluating the Dynamics of Membership Privacy in Deep Learning

Membership inference attacks (MIAs) pose a critical threat to the privacy of training data in deep learning. Despite significant progress in attack methodologies, our understanding of when and how models encode membership information during training remains limited. This paper presents a dynamic analytical framework for dissecting and quantifying privacy leakage dynamics at the individual sample level. By tracking per-sample vulnerabilities on an FPR-TPR plane throughout training, our framework systematically measures how factors such as dataset complexity, model architecture, and optimizer choice influence the rate and severity at which samples become vulnerable. Crucially, we discover a robust correlation between a sample's intrinsic learning difficulty, and find that the privacy risk of samples highly vulnerable in the final trained model is largely determined early during training. Our results thus provide a deeper understanding of how privacy risks dynamically emerge during training, laying the groundwork for proactive, privacy-aware model training strategies.

Updated: 2025-07-31 07:09:52

标题: 评估深度学习中成员隐私动态化

摘要: 成员推理攻击（MIAs）对深度学习中的训练数据的隐私构成了重大威胁。尽管在攻击方法方面取得了显著进展，但我们对模型在训练过程中何时以及如何编码成员信息的理解仍然有限。本文提出了一个动态分析框架，用于解剖和量化个别样本级别的隐私泄漏动态。通过在整个训练过程中跟踪FPR-TPR平面上每个样本的漏洞，我们的框架系统地测量了数据集复杂性、模型架构和优化器选择等因素如何影响样本变得容易受攻击的速率和严重程度。关键的是，我们发现了样本固有学习难度与隐私风险之间的稳健相关性，并发现在训练早期就基本确定了最终训练模型中高度容易受攻击的样本的隐私风险。因此，我们的结果提供了对隐私风险如何在训练过程中动态出现的更深入理解，为主动、隐私意识的模型训练策略奠定了基础。

更新时间: 2025-07-31 07:09:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.23291v1

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

Updated: 2025-07-31 07:07:51

标题: 当文字微笑：从文本生成多样的情绪面部表情

摘要: 使数字人能够表达丰富的情感在对话系统、游戏和其他交互场景中具有重要应用。尽管最近在说话头部合成方面取得了令人印象深刻的唇部同步结果，但它们往往忽视了面部表情的丰富和动态特性。为了填补这一关键空白，我们引入了一种专门关注情感动态的端到端文本到表达模型。我们的模型在连续的潜在空间中学习表达性的面部变化，并生成多样、流畅和情感连贯的表情。为支持这一任务，我们引入了EmoAva，一个包含15,000个文本-3D表达对的大规模高质量数据集。对现有数据集和EmoAva进行了大量实验，结果表明我们的方法在多个评估指标上明显优于基线，标志着该领域的重大进步。

更新时间: 2025-07-31 07:07:51

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2412.02508v3

Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce "recovery time" as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

Updated: 2025-07-31 07:06:39

标题: 增强AI系统的韧性：基于控制理论的LSTM韧性的制定与保证

摘要: 本文提出了一个新颖的理论框架，用于保证和评估控制系统中长短期记忆（LSTM）网络的弹性。我们引入了“恢复时间”作为弹性的新度量标准，以便量化LSTM在异常输入后恢复到正常状态所需的时间。通过对LSTM的增量输入状态稳定性（$\delta$ISS）理论进行数学精炼，我们得出了恢复时间的实用数据无关上限。这个上限为我们提供了具有弹性意识的训练。对简单模型的实验验证证明了我们的弹性估计和控制方法的有效性，为安全关键AI应用的严格质量保证奠定了基础。

更新时间: 2025-07-31 07:06:39

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2505.17696v3

Measuring Harmfulness of Computer-Using Agents

Computer-using agents (CUAs), which autonomously control computers to perform multi-step actions, might pose significant safety risks if misused. Existing benchmarks mostly evaluate language models' (LMs) safety risks in chatbots or simple tool-usage scenarios, without granting full computer access. To better evaluate CUAs' misuse risks, we introduce a new benchmark: CUAHarm. CUAHarm consists of 104 expert-written realistic misuse risks, such as disabling firewalls, leaking confidential information, launching denial-of-service attacks, or installing backdoors. We provide a sandbox environment and rule-based verifiable rewards to measure CUAs' success rates in executing these tasks (e.g., whether the firewall is indeed disabled), not just refusal. We evaluate multiple frontier open-source and proprietary LMs, such as Claude Sonnet, GPT-4o, Gemini Pro 1.5, Llama-3.3-70B, and Mistral Large 2. Surprisingly, even without carefully designed jailbreaking prompts, these frontier LMs comply with executing these malicious tasks at a high success rate (e.g., 59% for Claude 3.7 Sonnet). Newer models show higher misuse rates: Claude 3.7 Sonnet succeeds on 15% more tasks than Claude 3.5. While these models are robust to common malicious prompts (e.g., creating a bomb) in chatbot settings, they behave unsafely as CUAs. We further evaluate a leading agentic framework (UI-TARS-1.5) and find that while it improves performance, it also amplifies misuse risks. Benign variants reveal refusals stem from alignment, not capability limits. To mitigate risks, we explore using LMs to monitor CUAs' actions and chain-of-thoughts (CoTs). Monitoring CUAs is significantly harder than chatbot outputs. Monitoring CoTs yields modest gains, with average detection accuracy at only 72%. Even with hierarchical summarization, improvement is limited to 4%. CUAHarm will be released at https://github.com/db-ol/CUAHarm.

Updated: 2025-07-31 07:02:19

标题: 衡量计算机使用代理的有害程度

摘要: 计算机使用代理（CUAs）可以自主控制计算机执行多步操作，如果被滥用可能会带来显著的安全风险。现有的基准大多评估语言模型（LMs）在聊天机器人或简单工具使用场景中的安全风险，而没有授予完整的计算机访问权限。为了更好地评估CUAs的滥用风险，我们引入了一个新的基准：CUAHarm。CUAHarm包含104个由专家编写的现实滥用风险，例如禁用防火墙、泄露机密信息、发动拒绝服务攻击或安装后门。我们提供了一个沙箱环境和基于规则的可验证奖励，以衡量CUAs在执行这些任务时的成功率（例如，防火墙是否确实被禁用），而不仅仅是拒绝。我们评估了多个前沿的开源和专有LMs，如Claude Sonnet、GPT-4o、Gemini Pro 1.5、Llama-3.3-70B和Mistral Large 2。令人惊讶的是，即使没有经过精心设计的越狱提示，这些前沿LMs在执行这些恶意任务时也能够以很高的成功率（例如，Claude 3.7 Sonnet的成功率为59%）。新模型显示出更高的滥用率：Claude 3.7 Sonnet在比Claude 3.5多成功15%的任务。虽然这些模型对于常见的恶意提示（例如制造炸弹）在聊天机器人设置中表现出鲁棒性，但它们作为CUAs时表现得不安全。我们进一步评估了一个领先的代理框架（UI-TARS-1.5），发现它虽然提高了性能，但也放大了滥用风险。良性变体显示出拒绝源于对齐而不是能力限制。为了减轻风险，我们探讨使用LMs监控CUAs的行动和思维链（CoTs）。监控CUAs比监控聊天机器人的输出更具挑战性。监控CoTs带来了适度的收益，平均检测准确率仅为72%。即使使用层次总结，改进也仅限于4%。CUAHarm将在https://github.com/db-ol/CUAHarm发布。

更新时间: 2025-07-31 07:02:19

领域: cs.CR,cs.AI,I.2.7; K.6.5

下载: http://arxiv.org/abs/2508.00935v1

Tailored Forecasting from Short Time Series via Meta-learning

Machine learning models can effectively forecast dynamical systems from time-series data, but they typically require large amounts of past data, making forecasting particularly challenging for systems with limited history. To overcome this, we introduce Meta-learning for Tailored Forecasting using Related Time Series (METAFORS), which generalizes knowledge across systems to enable forecasting in data-limited scenarios. By learning from a library of models trained on longer time series from potentially related systems, METAFORS builds and initializes a model tailored to short time-series data from the system of interest. Using a reservoir computing implementation and testing on simulated chaotic systems, we demonstrate that METAFORS can reliably predict both short-term dynamics and long-term statistics without requiring contextual labels. We see this even when test and related systems exhibit substantially different behaviors, highlighting METAFORS' strengths in data-limited scenarios.

Updated: 2025-07-31 06:57:11

标题: 通过元学习定制短时间序列预测

摘要: 机器学习模型可以有效地从时间序列数据中预测动态系统，但通常需要大量的过去数据，这使得对历史有限的系统进行预测尤为具有挑战性。为了克服这一问题，我们引入了一种名为METAFORS（Meta-learning for Tailored Forecasting using Related Time Series）的方法，通过在系统之间泛化知识，使得在数据有限的情况下能够进行预测。通过从一个训练有关联系统的更长时间序列的模型库中学习，METAFORS构建并初始化了一个针对感兴趣系统的短时间序列数据的模型。通过使用嵌入式计算实现并在模拟混沌系统上进行测试，我们展示了METAFORS可以可靠地预测短期动态和长期统计数据，而无需上下文标签。即使测试和相关系统表现出明显不同的行为，METAFORS在数据有限的情况下也具有优势。

更新时间: 2025-07-31 06:57:11

领域: cs.LG,nlin.CD,physics.comp-ph

下载: http://arxiv.org/abs/2501.16325v2

Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

Real-world applications of KBQA require models to handle unanswerable questions with a limited volume of in-domain labeled training data. We propose the novel task of few-shot transfer for KBQA with unanswerable questions and contribute two new datasets for performance evaluation. We present FUn-FuSIC - a novel solution for our task that extends FuSIC KBQA, the state-of-the-art few-shot transfer model for answerable-only KBQA. We first note that FuSIC-KBQA's iterative repair makes a strong assumption that all questions are unanswerable. As a remedy, we propose Feedback for Unanswerability (FUn), which uses iterative repair using feedback from a suite of strong and weak verifiers, and an adaptation of self consistency for unanswerabilty to better assess the answerability of a question. Our experiments show that FUn-FuSIC significantly outperforms suitable adaptations of multiple LLM based and supervised SoTA models on our task, while establishing a new SoTA for answerable few-shot transfer as well.

Updated: 2025-07-31 06:53:46

标题: 使用弱验证器进行迭代修复，用于KBQA中的少样本迁移问题及无法回答问题

摘要: KBQA的真实应用需要模型处理有限数量领域内标记的训练数据中无法回答的问题。我们提出了一种新颖的任务，即带有无法回答问题的KBQA的少样本迁移，并贡献了两个新的数据集用于性能评估。我们提出了FUn-FuSIC - 一种针对我们任务的新颖解决方案，扩展了FuSIC KBQA，这是一种适用于仅可回答的KBQA的最先进的少样本迁移模型。我们首先指出FuSIC-KBQA的迭代修复假设所有问题都无法回答。为了补救这一点，我们提出了用于无法回答的反馈（FUn），它利用来自一组强和弱验证器的反馈进行迭代修复，并对无法回答的自洽性进行调整，以更好地评估问题的可回答性。我们的实验表明，FUn-FuSIC在我们的任务中明显优于多个LLM基于和监督的SoTA模型的适当调整，同时为可回答的少样本迁移建立了新的SoTA。

更新时间: 2025-07-31 06:53:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.14313v3

Hybrid LSTM-Transformer Models for Profiling Highway-Railway Grade Crossings

Hump crossings, or high-profile Highway Railway Grade Crossings (HRGCs), pose safety risks to highway vehicles due to potential hang-ups. These crossings typically result from post-construction railway track maintenance activities or non-compliance with design guidelines for HRGC vertical alignments. Conventional methods for measuring HRGC profiles are costly, time-consuming, traffic-disruptive, and present safety challenges. To address these issues, this research employed advanced, cost-effective techniques and innovative modeling approaches for HRGC profile measurement. A novel hybrid deep learning framework combining Long Short-Term Memory (LSTM) and Transformer architectures was developed by utilizing instrumentation and ground truth data. Instrumentation data were gathered using a highway testing vehicle equipped with Inertial Measurement Unit (IMU) and Global Positioning System (GPS) sensors, while ground truth data were obtained via an industrial-standard walking profiler. Field data was collected at the Red Rock Railroad Corridor in Oklahoma. Three advanced deep learning models Transformer-LSTM sequential (model 1), LSTM-Transformer sequential (model 2), and LSTM-Transformer parallel (model 3) were evaluated to identify the most efficient architecture. Models 2 and 3 outperformed the others and were deployed to generate 2D/3D HRGC profiles. The deep learning models demonstrated significant potential to enhance highway and railroad safety by enabling rapid and accurate assessment of HRGC hang-up susceptibility.

Updated: 2025-07-31 06:44:44

标题: 混合LSTM-Transformer模型用于对公路铁路道口进行分类

摘要: 驼峰穿越，或高凸起的公路铁路平交道口（HRGCs），由于潜在的卡顿而对公路车辆造成安全风险。这些穿越通常是由于施工后的铁路轨道维护活动或对HRGC垂直对准设计准则的不遵守而导致的。传统的测量HRGC轮廓的方法成本高昂、耗时、干扰交通，并存在安全挑战。为了解决这些问题，本研究采用先进、经济有效的技术和创新的建模方法来测量HRGC轮廓。通过利用仪器和地面真实数据，开发了一种新颖的混合深度学习框架，结合了长短期记忆（LSTM）和变压器架构。使用装备有惯性测量单元（IMU）和全球定位系统（GPS）传感器的公路测试车收集了仪器数据，而地面真实数据是通过工业标准的步行轮廓测量仪获得的。在俄克拉荷马州的Red Rock铁路走廊收集了现场数据。评估了三种先进的深度学习模型Transformer-LSTM顺序（模型1）、LSTM-Transformer顺序（模型2）和LSTM-Transformer并行（模型3），以确定最有效的架构。模型2和3表现优于其他模型，并被部署用于生成2D/3D HRGC轮廓。深度学习模型展示了通过快速准确评估HRGC卡顿易感性来增强公路和铁路安全的显著潜力。

更新时间: 2025-07-31 06:44:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00039v1

An Efficient Intelligent Semi-Automated Warehouse Inventory Stocktaking System

In the context of evolving supply chain management, the significance of efficient inventory management has grown substantially for businesses. However, conventional manual and experience-based approaches often struggle to meet the complexities of modern market demands. This research introduces an intelligent inventory management system to address challenges related to inaccurate data, delayed monitoring, and overreliance on subjective experience in forecasting. The proposed system integrates bar code and distributed flutter application technologies for intelligent perception, alongside comprehensive big data analytics to enable data-driven decision-making. Through meticulous analysis, system design, critical technology exploration, and simulation validation, the effectiveness of the proposed system is successfully demonstrated. The intelligent system facilitates second-level monitoring, high-frequency checks, and artificial intelligence-driven forecasting, consequently enhancing the automation, precision, and intelligence of inventory management. This system contributes to cost reduction and optimized inventory sizes through accurate predictions and informed decisions, ultimately achieving a mutually beneficial scenario. The outcomes of this research offer

Updated: 2025-07-31 06:44:31

标题: 一个高效的智能半自动化仓库库存盘点系统

摘要: 在不断发展的供应链管理背景下，高效的库存管理对企业的重要性显著增加。然而，传统的手工和经验导向的方法往往难以满足现代市场需求的复杂性。本研究引入了一种智能库存管理系统，以解决与数据不准确、监测延迟和过度依赖主观经验的预测相关的挑战。所提出的系统整合了条形码和分布式Flutter应用技术，用于智能感知，结合全面的大数据分析，实现数据驱动的决策。通过细致的分析、系统设计、关键技术探索和模拟验证，成功证明了所提出系统的有效性。这一智能系统促进了二级监测、高频检查和人工智能驱动的预测，从而提高了库存管理的自动化、精度和智能化。该系统通过准确的预测和明智的决策，有助于降低成本和优化库存规模，最终实现了一个双方受益的局面。本研究的成果提供了

更新时间: 2025-07-31 06:44:31

领域: cs.HC,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2309.12365v3

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

Updated: 2025-07-31 06:40:31

标题: AutoSchemaKG: 通过从Web规模语料库中动态引入模式构建自主知识图 (Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora)

摘要: 我们提出了AutoSchemaKG，这是一个用于完全自主知识图谱构建的框架，消除了对预定义模式的需求。我们的系统利用大型语言模型同时从文本中提取知识三元组并诱导全面的模式，对实体和事件进行建模，同时采用概念化将实例组织成语义类别。通过处理超过5000万个文档，我们构建了ATLAS（自动三元链接和模式归纳），一个包含9亿多个节点和59亿条边的知识图谱系列。这种方法在多跳QA任务上表现优于现有技术基线，并增强了LLM的事实性。值得注意的是，我们的模式归纳实现了与人工制定模式95%的语义对齐，完全无需人工干预，证明了具有动态诱导模式的十亿规模知识图谱可以有效地补充大型语言模型中的参数化知识。

更新时间: 2025-07-31 06:40:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.23628v2

Insights into Closed-form IPM-GAN Discriminator Guidance for Diffusion Modeling

Diffusion models are a state-of-the-art generative modeling framework that transform noise to images via Langevin sampling, guided by the score, which is the gradient of the logarithm of the data distribution. Recent works have shown empirically that the generation quality can be improved when guided by classifier network, which is typically the discriminator trained in a generative adversarial network (GAN) setting. In this paper, we propose a theoretical framework to analyze the effect of the GAN discriminator on Langevin-based sampling, and show that the IPM-GAN optimization can be seen as one of smoothed score-matching, wherein the scores of the data and the generator distributions are convolved with the kernel function associated with the IPM. The proposed approach serves to unify score-based training and optimization of IPM-GANs. Based on these insights, we demonstrate that closed-form kernel-based discriminator guidance, results in improvements (in terms of CLIP-FID and KID metrics) when applied atop baseline diffusion models. We demonstrate these results on the denoising diffusion implicit model (DDIM) and latent diffusion model (LDM) settings on various standard datasets. We also show that the proposed approach can be combined with existing accelerated-diffusion techniques to improve latent-space image generation.

Updated: 2025-07-31 06:38:19

标题: 对封闭形式IPM-GAN鉴别器指导的扩散建模洞察

摘要: 扩散模型是一种最先进的生成建模框架，通过Langevin采样将噪声转化为图像，由得分指导，得分是数据分布对数的梯度。最近的研究实证表明，当由分类器网络引导时，生成质量可以得到提高，这通常是在生成对抗网络（GAN）设置中训练的鉴别器。在本文中，我们提出了一个理论框架来分析GAN鉴别器对基于Langevin采样的影响，并展示了IPM-GAN优化可以被视为平滑的得分匹配之一，在其中，数据和生成器分布的分数与与IPM相关的核函数卷积。所提出的方法可统一基于得分的训练和IPM-GAN的优化。基于这些见解，我们展示了闭式核函数导向的鉴别器指导，在基线扩散模型上应用时，在CLIP-FID和KID指标方面取得了改进。我们在各种标准数据集上展示了这些结果，包括去噪扩散隐式模型（DDIM）和潜变扩散模型（LDM）设置。我们还展示了所提出的方法可以与现有的加速扩散技术相结合，以改善潜在空间图像生成。

更新时间: 2025-07-31 06:38:19

领域: cs.LG,cs.CV,stat.ML

下载: http://arxiv.org/abs/2306.01654v2

How Far Are AI Scientists from Changing the World?

The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.

Updated: 2025-07-31 06:32:06

标题: AI科学家离改变世界还有多远？

摘要: 大型语言模型（LLM）的出现推动了自动化科学发现迈向新的水平，基于LLM的人工智能（AI）科学家系统目前正引领科学研究。在AI科学家系统领域已经出现了几部具有影响力的作品，AI生成的研究论文已经被接受到ICLR 2025研讨会上，这表明一个能够揭示以前人类不知道的现象的人类水平的AI科学家可能很快就会成为现实。在这项调查中，我们着重探讨一个核心问题：AI科学家离改变世界和重塑科学研究范式有多远？为了回答这个问题，我们提供了一个前景驱动的回顾，全面分析了AI科学家系统目前的成就，确定了关键的瓶颈和需要出现的关键组件，以便出现能够产生突破性发现并解决重大挑战的科学代理人。我们希望这项调查将有助于更清晰地了解当前AI科学家系统的局限性，展示我们目前所处的位置，缺少什么，以及科学人工智能的最终目标应该是什么。

更新时间: 2025-07-31 06:32:06

领域: cs.AI

下载: http://arxiv.org/abs/2507.23276v1

CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation

Large Language Models (LLMs) are increasingly becoming the cognitive core of Embodied Intelligence (EI) systems, such as robots and autonomous vehicles. However, this integration also exposes them to serious jailbreak risks, where malicious instructions can be transformed into dangerous physical actions. Existing defense mechanisms suffer from notable drawbacks--including high training costs, significant inference delays, and complex hyperparameter tuning--which limit their practical applicability. To address these challenges, we propose a novel and efficient inference-time defense framework: Concept Enhancement Engineering (CEE). CEE enhances the model's inherent safety mechanisms by directly manipulating its internal representations, requiring neither additional training nor external modules, thereby improving defense efficiency. Furthermore, CEE introduces a rotation-based control mechanism that enables stable and linearly tunable behavioral control of the model. This design eliminates the need for tedious manual tuning and avoids the output degradation issues commonly observed in other representation engineering methods. Extensive experiments across multiple EI safety benchmarks and diverse attack scenarios demonstrate that CEE significantly improves the defense success rates of various multimodal LLMs. It effectively mitigates safety risks while preserving high-quality generation and inference efficiency, offering a promising solution for deploying safer embodied intelligence systems.

Updated: 2025-07-31 06:22:28

标题: CEE：通过子空间概念旋转的推断时间越狱防御策略，以增强智能

摘要: 大型语言模型（LLMs）越来越成为具有体现智能（EI）系统的认知核心，例如机器人和自动驾驶车辆。然而，这种整合也使它们暴露于严重的越狱风险中，恶意指令可以转化为危险的物理行为。现有的防御机制存在明显的缺陷，包括高训练成本、显著的推断延迟和复杂的超参数调整，这限制了它们的实际适用性。为了解决这些挑战，我们提出了一种新颖且高效的推断时防御框架：概念增强工程（CEE）。CEE通过直接操纵模型的内部表示来增强其固有的安全机制，无需额外的训练或外部模块，从而提高防御效率。此外，CEE引入了基于旋转的控制机制，可以实现模型的稳定和线性可调的行为控制。这种设计消除了繁琐的手动调整的需要，并避免了其他表示工程方法中常见的输出退化问题。在多个EI安全基准和不同攻击场景下的广泛实验表明，CEE显著提高了各种多模式LLMs的防御成功率。它有效地减轻了安全风险，同时保持高质量的生成和推断效率，为部署更安全的体现智能系统提供了一个有前途的解决方案。

更新时间: 2025-07-31 06:22:28

领域: cs.CR,cs.LG,cs.MA

下载: http://arxiv.org/abs/2504.13201v2

Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2

Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.

Updated: 2025-07-31 06:15:44

标题: 朝着使用SAM2进行3D乳腺MRI肿瘤分割和可视化的经济实惠方向

摘要: 乳腺MRI提供了关键的高分辨率体积成像，对于肿瘤评估和治疗规划至关重要，然而对3D扫描的手动解释仍然需要大量劳动力且具有主观性。虽然基于AI的工具有望加速医学图像分析，但由于高许可成本、专有软件和基础设施需求，商业医学AI产品在低收入和中等收入国家的采用仍受限制。在这项工作中，我们调查了Segment Anything Model 2（SAM2）是否能够适应低成本、最少输入的乳腺MRI中的3D肿瘤分割。使用一个切片上的单个边界框注释，我们通过三种不同的逐片追踪策略在3D体积中传播分割预测：自上而下、自下而上和自中心向外。我们评估了这些策略在大量患者中的效果，并发现自中心向外的传播可以产生最一致和准确的分割。尽管SAM2是一种未经过训练用于体积医学数据的零样本模型，但在最少监督下实现了强大的分割性能。我们进一步分析了分割性能与肿瘤大小、位置和形状之间的关系，识别出关键的失败模式。我们的结果表明，诸如SAM2这样的通用基础模型可以支持具有最少监督的3D医学图像分析，为资源受限的环境提供了一种可访问和经济的替代方案。

更新时间: 2025-07-31 06:15:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23272v1

A Privacy-Preserving DAO Model Using NFT Authentication for the Punishment not Reward Blockchain Architecture

This paper presents a decentralised autonomous organisation (DAO) model that uses non-fungible tokens (NFTs) for identity management and privacy-preserving interactions within a Punishment not Reward (PnR) blockchain mechanism. The proposed model introduces a dual NFT architecture deployed on Layer 2 networks: Membership NFTs ($NFT_{auth}$) for authentication and access control and interaction NFTs ($NFT_{priv}$) for private interactions among participants. Our Layer 2 implementation achieves 97\% gas cost reduction while maintaining security through cross-chain mechanisms. The identity management system incorporates decentralised KYC processes and Sybil attack resistance using soulbound token characteristics. Governance operates through smart contracts that manage reputation and administer punitive measures, including conditional identity disclosure for forensic purposes. Governance operates through smart contracts that manage reputation and administer punitive measures, including conditional identity disclosure when misconduct is detected.

Updated: 2025-07-31 06:14:02

标题: 一个利用NFT认证实现隐私保护的DAO模型，用于惩罚而不是奖励的区块链架构

摘要: 这篇论文提出了一种分散自治组织（DAO）模型，该模型使用非同质化代币（NFTs）进行身份管理和隐私保护交互，用于Punishment not Reward（PnR）区块链机制。所提出的模型在Layer 2网络上部署了双重NFT架构：Membership NFTs（$NFT_{auth}$）用于认证和访问控制，Interaction NFTs（$NFT_{priv}$）用于参与者之间的私密交互。我们的Layer 2实现实现了97\%的燃气成本降低，同时通过跨链机制保持安全性。身份管理系统整合了分散式KYC流程和Soulbound代币特性来抵抗Sybil攻击。治理通过智能合约进行，管理声誉并执行惩罚措施，包括在发现不端行为时进行有条件的身份披露。治理通过智能合约进行，管理声誉并执行惩罚措施，包括在检测到不端行为时进行有条件的身份披露。

更新时间: 2025-07-31 06:14:02

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2405.13156v2

XABPs: Towards eXplainable Autonomous Business Processes

Autonomous business processes (ABPs), i.e., self-executing workflows leveraging AI/ML, have the potential to improve operational efficiency, reduce errors, lower costs, improve response times, and free human workers for more strategic and creative work. However, ABPs may raise specific concerns including decreased stakeholder trust, difficulties in debugging, hindered accountability, risk of bias, and issues with regulatory compliance. We argue for eXplainable ABPs (XABPs) to address these concerns by enabling systems to articulate their rationale. The paper outlines a systematic approach to XABPs, characterizing their forms, structuring explainability, and identifying key BPM research challenges towards XABPs.

Updated: 2025-07-31 06:10:49

标题: XABPs：朝向可解释的自主业务流程

摘要: 自主业务流程（ABPs），即利用人工智能/机器学习的自执行工作流程，有潜力提高运营效率，减少错误，降低成本，改善响应时间，并使人类工作者能够从事更具战略性和创造性的工作。然而，ABPs可能引发特定的担忧，包括降低利益相关者的信任，调试困难，责任分散，偏见风险以及与监管合规性有关的问题。我们提倡采用可解释的ABPs（XABPs）来解决这些问题，使系统能够表达其理由。该论文概述了一种系统化的XABPs方法，描述了它们的形式，构建可解释性，并确定了朝向XABPs的关键BPM研究挑战。

更新时间: 2025-07-31 06:10:49

领域: cs.SE,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.23269v1

SmartPNT-MSF: A Multi-Sensor Fusion Dataset for Positioning and Navigation Research

High-precision navigation and positioning systems are critical for applications in autonomous vehicles and mobile mapping, where robust and continuous localization is essential. To test and enhance the performance of algorithms, some research institutions and companies have successively constructed and publicly released datasets. However, existing datasets still suffer from limitations in sensor diversity and environmental coverage. To address these shortcomings and advance development in related fields, the SmartPNT Multisource Integrated Navigation, Positioning, and Attitude Dataset has been developed. This dataset integrates data from multiple sensors, including Global Navigation Satellite Systems (GNSS), Inertial Measurement Units (IMU), optical cameras, and LiDAR, to provide a rich and versatile resource for research in multi-sensor fusion and high-precision navigation. The dataset construction process is thoroughly documented, encompassing sensor configurations, coordinate system definitions, and calibration procedures for both cameras and LiDAR. A standardized framework for data collection and processing ensures consistency and scalability, enabling large-scale analysis. Validation using state-of-the-art Simultaneous Localization and Mapping (SLAM) algorithms, such as VINS-Mono and LIO-SAM, demonstrates the dataset's applicability for advanced navigation research. Covering a wide range of real-world scenarios, including urban areas, campuses, tunnels, and suburban environments, the dataset offers a valuable tool for advancing navigation technologies and addressing challenges in complex environments. By providing a publicly accessible, high-quality dataset, this work aims to bridge gaps in sensor diversity, data accessibility, and environmental representation, fostering further innovation in the field.

Updated: 2025-07-31 05:59:58

标题: 智能PNT-MSF：用于定位和导航研究的多传感器融合数据集

摘要: 高精度导航和定位系统对于自动驾驶车辆和移动地图等应用至关重要，其中强大和持续的定位至关重要。为了测试和提升算法的性能，一些研究机构和公司已经陆续构建并公开发布了数据集。然而，现有数据集仍然存在传感器多样性和环境覆盖方面的局限性。为了解决这些不足并推进相关领域的发展，SmartPNT多源集成导航、定位和姿态数据集应运而生。该数据集整合了来自多种传感器的数据，包括全球导航卫星系统（GNSS）、惯性测量单元（IMU）、光学摄像头和激光雷达，为多传感器融合和高精度导航研究提供了丰富多样的资源。数据集构建过程有着详细的文档记录，包括传感器配置、坐标系统定义以及摄像头和激光雷达的校准程序。一个标准化的数据收集和处理框架确保了一致性和可扩展性，实现了大规模分析。利用最先进的同时定位和建图（SLAM）算法，如VINS-Mono和LIO-SAM进行验证，展示了数据集在高级导航研究中的适用性。覆盖了城市区域、校园、隧道和郊区环境等各种真实场景，该数据集为推动导航技术和解决复杂环境中的挑战提供了有价值的工具。通过提供公开可访问的高质量数据集，本研究旨在弥合传感器多样性、数据可访问性和环境表征方面的差距，促进该领域的进一步创新。

更新时间: 2025-07-31 05:59:58

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.19079v2

DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System

Current multi-agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance. To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM-based MAS through two key innovations: (1) an actor-critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning. DynaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks. (c) We propose to fine-tune the demonstration retriever to fully exploit the power of in-context learning (ICL). Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs.

Updated: 2025-07-31 05:52:30

标题: DynaSwarm：基于LLM的多智能体系统动态图结构选择

摘要: 目前多智能体系统（MAS）框架通常依赖手动设计和静态的协作图结构，限制了适应性和性能。为了解决这些限制，我们提出了DynaSwarm，这是一个通过两项关键创新增强LLM-based MAS的动态框架：（1）一个演员-评论家强化学习（A2C）机制，通过优化图结构提高了对比之前RL方法的稳定性；以及（2）一个动态图选择器，通过参数高效的LLM微调为每个输入样本自适应选择最佳的图结构。DynaSwarm消除了对刚性、一刀切的图结构架构的需求，而是利用样本特定的特点动态地将查询路由通过专门的智能体网络。我们提议对演示检索器进行微调，充分利用上下文学习（ICL）的能力。在问答、数学推理和编码任务上进行的大量实验表明，DynaSwarm在多个LLM骨干模型上始终优于最先进的单智能体和MAS基线。我们的发现突显了在LLM MAS设计中具有样本感知结构灵活性的重要性。

更新时间: 2025-07-31 05:52:30

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.23261v1

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both \textit{Static Conversation} and \textit{Dynamic Conversation} settings, we evaluate the performance of 7 state-of-the-art LLMs and summarize the important factors influencing the instruction-following ability of LLMs in multi-turn interactions, as well as potential directions for improvement.

Updated: 2025-07-31 05:49:44

标题: CodeIF-Bench：评估大型语言模型在交互式代码生成中的指令跟随能力

摘要: 大型语言模型（LLMs）已经在代码生成任务中展示了出色的性能，并已成为开发人员不可或缺的编程助手。然而，现有的代码生成基准主要评估了LLMs生成的代码在单轮交互中的功能正确性。它们对LLMs在多轮交互场景中生成严格遵循用户指令的代码的能力提供了有限的见解。在本文中，我们介绍了CodeIF-Bench，一个用于评估LLMs在交互式代码生成中遵循指令能力的基准。具体来说，CodeIF-Bench包含了与现实世界软件开发需求对齐的九种可验证指令，可以通过指定的测试用例进行独立客观验证，有助于评估多轮交互中的遵循指令能力。在“静态对话”和“动态对话”设置中，我们评估了7种最先进的LLMs的性能，并总结了影响LLMs在多轮交互中遵循指令能力的重要因素，以及改进的潜在方向。

更新时间: 2025-07-31 05:49:44

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2503.22688v3

How Cybersecurity Behaviors affect the Success of Darknet Drug Vendors: A Quantitative Analysis

Understanding behavioral drivers of success in illicit digital marketplaces is critical for developing effective enforcement strategies and understanding digital commerce evolution, as darknet drug markets represent a growing share of the total drug economy. This study employs quantitative regression analysis of 50,000+ listings from 2,653 vendors in the Agora marketplace (2014-2015), examining relationships between cybersecurity signaling (PGP encryption mentions), product diversification, and commercial success through nested regression specifications controlling for reputation, pricing, and category-specific factors. Product diversification emerges as the dominant predictor of vendor scale, increasing the odds of large vendor status by 169% per additional category, while PGP encryption signaling functions primarily as a professional marker rather than an independent success factor. Vendor success depends on portfolio breadth rather than specialization, with category-specific enforcement creating differential market constraints. Successful vendors operate as diversified enterprises capable of rapid pivoting between product categories, requiring targeted enforcement towards diversified vendors based on coordinated multi-category enforcement approaches rather than traditional substance-specific targeting strategies.

Updated: 2025-07-31 05:45:07

标题: 网络安全行为如何影响暗网毒品供应商的成功：定量分析

摘要: 理解非法数字市场成功的行为驱动因素对于制定有效的执法策略和理解数字商务的演变至关重要，因为暗网毒品市场代表了总毒品经济的增长份额。本研究采用了对2014年至2015年阿戈拉市场（Agora marketplace）2,653个供应商的50,000多个列表的定量回归分析，研究了网络安全信号（PGP加密提及）、产品多样化和商业成功之间的关系，通过控制声誉、定价和特定类别因素的嵌套回归规范。产品多样化成为供应商规模的主要预测因子，每增加一个类别，使供应商成为大型供应商的几率增加169%，而PGP加密信号主要起到专业标记的作用，而不是独立的成功因素。供应商成功取决于组合的广度而不是专业化，特定类别的执法会产生不同的市场约束。成功的供应商作为多元化企业运营，能够在产品类别之间快速转变，需要基于协调的多类别执法方法而不是传统的特定物质定位策略来针对多样化供应商进行有针对性的执法。

更新时间: 2025-07-31 05:45:07

领域: cs.CR,cs.CY,62P25, 91B24, 62J05,G.3; K.6.5; J.4

下载: http://arxiv.org/abs/2508.00934v1

AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift

Current AI advances largely rely on scaling neural models and expanding training datasets to achieve generalization and robustness. Despite notable successes, this paradigm incurs significant environmental, economic, and ethical costs, limiting sustainability and equitable access. Inspired by biological sensory systems, where adaptation occurs dynamically at the input (e.g., adjusting pupil size, refocusing vision)--we advocate for adaptive sensing as a necessary and foundational shift. Adaptive sensing proactively modulates sensor parameters (e.g., exposure, sensitivity, multimodal configurations) at the input level, significantly mitigating covariate shifts and improving efficiency. Empirical evidence from recent studies demonstrates that adaptive sensing enables small models (e.g., EfficientNet-B0) to surpass substantially larger models (e.g., OpenCLIP-H) trained with significantly more data and compute. We (i) outline a roadmap for broadly integrating adaptive sensing into real-world applications spanning humanoid, healthcare, autonomous systems, agriculture, and environmental monitoring, (ii) critically assess technical and ethical integration challenges, and (iii) propose targeted research directions, such as standardized benchmarks, real-time adaptive algorithms, multimodal integration, and privacy-preserving methods. Collectively, these efforts aim to transition the AI community toward sustainable, robust, and equitable artificial intelligence systems.

Updated: 2025-07-31 05:44:36

标题: AI应该更好地感知，而不仅仅是扩大规模：自适应感知作为一种范式转变

摘要: 当前人工智能的进展在很大程度上依赖于扩展神经模型和拓展训练数据集以实现泛化和稳健性。尽管取得了显著成功，这一范式会产生显著的环境、经济和伦理代价，限制了可持续性和公平访问。受生物感知系统的启发，那里的适应性在输入层动态发生（例如，调整瞳孔大小，重新聚焦视觉）--我们主张自适应感知是一种必要且基础性的转变。自适应感知主动调节传感器参数（例如，曝光、灵敏度、多模态配置）在输入层，显著减轻协变量转移并提高效率。最近研究的经验证据表明，自适应感知使小模型（例如，EfficientNet-B0）能够超越训练数据和计算量明显更多的大模型（例如，OpenCLIP-H）。我们（i）概述了一个蓝图，广泛将自适应感知整合到涵盖人形、医疗、自主系统、农业和环境监测的现实应用中，（ii）批判性评估技术和伦理整合挑战，（iii）提出针对性的研究方向，例如标准化基准测试、实时自适应算法、多模态整合和隐私保护方法。总的来说，这些努力旨在将人工智能社区转向可持续、稳健和公平的人工智能系统。

更新时间: 2025-07-31 05:44:36

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.07820v2

Efficient Machine Unlearning via Influence Approximation

Due to growing privacy concerns, machine unlearning, which aims at enabling machine learning models to ``forget" specific training data, has received increasing attention. Among existing methods, influence-based unlearning has emerged as a prominent approach due to its ability to estimate the impact of individual training samples on model parameters without retraining. However, this approach suffers from prohibitive computational overhead arising from the necessity to compute the Hessian matrix and its inverse across all training samples and parameters, rendering it impractical for large-scale models and scenarios involving frequent data deletion requests. This highlights the difficulty of forgetting. Inspired by cognitive science, which suggests that memorizing is easier than forgetting, this paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning). This connection allows machine unlearning to be addressed from the perspective of incremental learning. Unlike the time-consuming Hessian computations in unlearning (forgetting), incremental learning (memorizing) typically relies on more efficient gradient optimization, which supports the aforementioned cognitive theory. Based on this connection, we introduce the Influence Approximation Unlearning (IAU) algorithm for efficient machine unlearning from the incremental perspective. Extensive empirical evaluations demonstrate that IAU achieves a superior balance among removal guarantee, unlearning efficiency, and comparable model utility, while outperforming state-of-the-art methods across diverse datasets and model architectures. Our code is available at https://github.com/Lolo1222/IAU.

Updated: 2025-07-31 05:34:27

标题: 通过影响近似实现高效的机器遗忘

摘要: 由于日益增长的隐私担忧，旨在使机器学习模型“忘记”特定训练数据的机器遗忘技术受到越来越多的关注。在现有方法中，基于影响的遗忘方法由于其能够估计单个训练样本对模型参数的影响而成为突出的方法，而无需重新训练。然而，这种方法由于需要跨所有训练样本和参数计算Hessian矩阵及其逆的计算开销而备受限制，使其在大规模模型和频繁数据删除请求的情况下变得不切实际。这突显了遗忘的困难性。受认知科学的启发，认知科学指出记忆比遗忘更容易，本文建立了记忆（增量学习）和遗忘（遗忘）之间的理论联系。这种联系使得可以从增量学习的角度来解决机器遗忘问题。与遗忘（忘记）中耗时的Hessian计算不同，增量学习（记忆）通常依赖于更高效的梯度优化，这支持了前述的认知理论。基于这种联系，我们提出了用于从增量学习的角度进行高效机器遗忘的Influence Approximation Unlearning（IAU）算法。大量的实证评估表明，IAU在保证移除、遗忘效率和可比模型效用之间实现了卓越的平衡，同时在各种数据集和模型架构上优于最先进的方法。我们的代码可在https://github.com/Lolo1222/IAU 上找到。

更新时间: 2025-07-31 05:34:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.23257v1

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe learns a well-calibrated probabilistic model of the system and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics, while enforcing pessimism w.r.t. the safety constraints. Under regularity assumptions on the constraints and dynamics, we show that ActSafe guarantees safety during learning while also obtaining a near-optimal policy in finite time. In addition, we propose a practical variant of ActSafe that builds on latest model-based RL advancements and enables safe exploration even in high-dimensional settings such as visual control. We empirically show that ActSafe obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.

Updated: 2025-07-31 05:17:12

标题: ActSafe：强化学习中带安全约束的主动探索

摘要: 强化学习（RL）在现代人工智能系统的发展中无处不在。然而，最先进的RL代理需要与其环境进行广泛且潜在不安全的交互才能有效学习。这些限制将RL代理限制在模拟环境中，阻碍了它们直接在现实世界中学习的能力。在这项工作中，我们提出了ActSafe，一种新颖的基于模型的RL算法，用于安全和高效的探索。ActSafe学习系统的一个良好校准的概率模型，并针对未知动态的认知不确定性进行乐观规划，同时在安全约束方面强制悲观。在对约束和动态的正则性假设下，我们证明ActSafe在学习过程中保证安全性的同时，在有限时间内获得接近最优策略。此外，我们提出了ActSafe的一个实用变体，借鉴了最新的基于模型的RL进展，使其能够在视觉控制等高维设置中实现安全探索。我们在标准安全深度RL基准测试中的困难探索任务中经验性地表明，ActSafe在学习过程中确保安全的同时获得了最先进的性能。

更新时间: 2025-07-31 05:17:12

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2410.09486v3

Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.

Updated: 2025-07-31 05:16:43

标题: 评估LLMs对孟加拉语的多语能力：基准创建和性能分析

摘要: 孟加拉语在自然语言处理研究中是一种被忽视的语言。然而，由于其独特的语言结构和计算约束，它仍然是一个挑战。在这项工作中，我们系统地调查了阻碍孟加拉语自然语言处理性能的挑战，重点放在缺乏标准化评估基准上。然后，我们评估了10个最近的开源大型语言模型（LLMs）在8个翻译数据集中的表现，并进行了全面的错误分析，以确定它们的主要失败模式。我们的研究结果显示，与英语相比，孟加拉语存在一致的性能差距，特别是对于较小的模型和特定模型系列，如Mistral。我们还确定了某些体系结构的鲁棒性，如DeepSeek，在跨语言时保持更稳定的性能。我们的分析揭示了一种标记化效率和LLM准确性之间的反向关系，其中当输入过于标记化时，模型往往表现较差，而更高效和简洁的标记化则会导致性能提升。这些发现突出了当前模型存在不足的关键领域，并强调了针对多语境定制的改进数据集质量和评估方法的必要性。这项工作将促进对被忽视语言的自然语言处理的进一步研究，有助于在全球范围内普及先进语言技术。本研究使用的代码和数据集可在https://github.com/BengaliAI/bn-llm-benchmark上公开获取。

更新时间: 2025-07-31 05:16:43

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.23248v1

GrokAlign: Geometric Characterisation and Acceleration of Grokking

A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).

Updated: 2025-07-31 05:15:21

标题: GrokAlign：Grokking的几何特征和加速

摘要: 机器学习领域面临的一个关键挑战是理解和加速深度网络的训练动态，从而导致延迟泛化和对输入扰动的新兴稳健性，也称为领悟。先前的研究将延迟泛化等现象与深度网络从线性转向特征学习模式以及新兴稳健性与网络的功能几何结构改变联系起来，特别是利用连续分段仿射非线性性的深度网络中所谓线性区域的排列。在这里，我们解释了领悟如何在深度网络的雅可比矩阵中实现，并证明将网络的雅可比矩阵与训练数据对齐（以余弦相似性的意义）可以在低秩雅可比矩阵假设下确保领悟。我们的结果为在优化深度网络中使用雅可比正则化提供了强有力的理论动机，我们将其引入为GrokAlign方法，并通过实验证明，相较于权重衰减等更传统的正则化方法，GrokAlign能更早地诱导出领悟。此外，我们引入了质心对准作为雅可比对准的可行且可解释的简化方法，有效地识别和跟踪深度网络训练动态的阶段。附带网页（https://thomaswalker1.github.io/blog/grokalign.html）和代码（https://github.com/ThomasWalker1/grokalign）。

更新时间: 2025-07-31 05:15:21

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.12284v2

Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

Updated: 2025-07-31 04:55:21

标题: 广义增强学习用于检索器特定的查询重写器与非结构化现实世界文档

摘要: 检索增强生成（RAG）系统在解锁外部知识方面严重依赖有效的查询形式，然而，针对多样化、非结构化的现实世界文档优化查询仍然是一个挑战。我们引入了\textbf{RL-QR}，这是一个为特定检索器重写查询的强化学习框架，消除了对人工标注数据集的需求，并扩展了适用范围到纯文本和多模式数据库。通过综合情景-问题对并利用广义奖励策略优化（GRPO），RL-QR训练了针对特定检索器的查询重写器，增强了在不同领域中的检索性能。在工业内部数据上的实验表明，$\text{RL-QR}_{\text{multi-modal}}$实现了多模式RAG中NDCG@3的11\%相对增益，而$\text{RL-QR}_{\text{lexical}}$为词汇检索器带来了9\%的增益。然而，在语义和混合检索器中仍然存在挑战，重写器未能改善性能，可能是由于训练失调。我们的研究结果凸显了RL-QR在为RAG系统进行查询优化方面的潜力，为实际检索任务提供了一个可扩展的、无标注的解决方案，同时确定了在语义检索环境中进一步完善的途径。

更新时间: 2025-07-31 04:55:21

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.23242v1

Framing Political Bias in Multilingual LLMs Across Pakistani Languages

Large Language Models (LLMs) increasingly shape public discourse, yet most evaluations of political and economic bias have focused on high-resource, Western languages and contexts. This leaves critical blind spots in low-resource, multilingual regions such as Pakistan, where linguistic identity is closely tied to political, religious, and regional ideologies. We present a systematic evaluation of political bias in 13 state-of-the-art LLMs across five Pakistani languages: Urdu, Punjabi, Sindhi, Pashto, and Balochi. Our framework integrates a culturally adapted Political Compass Test (PCT) with multi-level framing analysis, capturing both ideological stance (economic/social axes) and stylistic framing (content, tone, emphasis). Prompts are aligned with 11 socio-political themes specific to the Pakistani context. Results show that while LLMs predominantly reflect liberal-left orientations consistent with Western training data, they exhibit more authoritarian framing in regional languages, highlighting language-conditioned ideological modulation. We also identify consistent model-specific bias patterns across languages. These findings show the need for culturally grounded, multilingual bias auditing frameworks in global NLP.

Updated: 2025-07-31 04:41:18

标题: 跨巴基斯坦语言的多语言LLM中的政治偏见界定

摘要: 大型语言模型（LLMs）越来越影响公共话语，然而大多数对政治和经济偏见的评估都集中在资源丰富的西方语言和环境中。这在资源匮乏、多语种地区（如巴基斯坦）留下了关键的盲点，那里的语言身份与政治、宗教和地区意识形态密切相关。我们对五种巴基斯坦语言（乌尔都语、旁遮普语、信德语、普什图语和俾路支语）中13个最先进的LLM进行了政治偏见的系统评估。我们的框架将一种文化适应的政治指南测试（PCT）与多层次框架分析相结合，捕捉意识形态立场（经济/社会轴）和风格化框架（内容、语气、强调）。提示与巴基斯坦特定背景下的11个社会政治主题保持一致。结果表明，虽然LLMs主要反映了与西方训练数据一致的自由-左翼取向，但它们在地区语言中表现出更多的专制框架，突显了语言条件下的意识形态调制。我们还确定了跨语言一致的模型特定偏见模式。这些发现显示了全球NLP中需要基于文化、多语种的偏见审计框架的必要性。

更新时间: 2025-07-31 04:41:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.00068v2

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

Updated: 2025-07-31 04:00:48

标题: AgentSpec：用于安全可靠LLM代理的可定制运行时强制执行

摘要: 建立在LLMs上的代理正在不断在各个领域部署，自动化复杂决策和任务执行。然而，它们的自主性引入了安全风险，包括安全漏洞、法律违规和意外有害行为。现有的缓解方法，如基于模型的保障和早期执行策略，在鲁棒性、可解释性和适应性方面存在不足。为了解决这些挑战，我们提出了AgentSpec，这是一个轻量级的领域特定语言，用于指定和强制执行LLM代理的运行时约束。使用AgentSpec，用户定义结构化规则，包括触发器、谓词和执行机制，确保代理在预定义的安全边界内运行。我们在多个领域实现了AgentSpec，包括代码执行、具身代理和自动驾驶，展示了其适应性和有效性。我们的评估显示，AgentSpec成功阻止了超过90%的代码代理案例中的不安全执行，消除了所有具身代理任务中的危险行为，并通过自动驾驶汽车(AVs)实现了100%的合规性。尽管AgentSpec具有强大的安全保证，但它仍然计算轻量级，延迟在毫秒级。通过结合可解释性、模块化和效率，AgentSpec为在各种应用中强制执行LLM代理安全提供了实用且可扩展的解决方案。我们还利用LLMs自动化生成规则，并评估其有效性。我们的评估显示，由OpenAI o1生成的规则为具身代理实现了95.56%的精度和70.96%的召回率，在风险代码中成功识别了87.26%，并在8个场景中阻止了AVs违法行为中的5个。

更新时间: 2025-07-31 04:00:48

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.18666v3

Accumulator-Aware Post-Training Quantization for Large Language Models

When quantizing weights and activations to increasingly narrower representations, the cost of additions begins to dominate that of multiplications in multiply-accumulate (MAC) units. Recent studies show that reducing addition costs via low-precision accumulation improves throughput, power, and area across inference platforms, albeit with an increased risk of overflow. Accumulator-aware quantization research has so far only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models and datasets continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To bridge this gap, we introduce AXE, the first accumulator-aware quantization framework explicitly designed to endow overflow avoidance guarantees to PTQ algorithms. We present theoretical motivation for AXE and demonstrate its flexibility by implementing it on top of two existing algorithms: GPFQ and OPTQ. We design AXE to support multi-stage accumulation, opening the door to full datapath optimization for the first time. We evaluate AXE using recent language generation models; when quantizing Llama3 8B for a 16-bit multi-stage accumulation datapath, AXE maintains up to 98% of the FP16 perplexity, surpassing naive bit width manipulation by up to 15%.

Updated: 2025-07-31 03:59:55

标题: 大语言模型的累加器感知后训练量化

摘要: 在将权重和激活量化为越来越窄的表示时，加法的成本开始在乘积累加（MAC）单元中占主导地位。最近的研究表明，通过低精度累加降低加法成本可以改善推理平台上的吞吐量、功耗和面积，尽管存在溢出风险增加的问题。到目前为止，累加器感知量化研究只考虑了量化感知训练（QAT）范式，即在模型微调或从头开始训练时将量化纳入其中。随着模型和数据集的不断增大，QAT技术变得越来越昂贵，这促使了最近在后训练量化（PTQ）研究领域的激增。为了弥合这一差距，我们引入了AXE，这是第一个专门设计为在PTQ算法中赋予溢出避免保证的累加器感知量化框架。我们提出了AXE的理论动机，并通过在现有算法GPFQ和OPTQ之上实现它来展示其灵活性。我们设计AXE支持多阶段累加，首次为完整数据通路优化打开了大门。我们使用最近的语言生成模型评估了AXE；当将Llama3 8B量化为16位多阶段累加数据通路时，AXE保持了高达98%的FP16困惑度，超过了通过朴素比特宽度操作高达15%。

更新时间: 2025-07-31 03:59:55

领域: cs.LG,cs.AI,cs.DM

下载: http://arxiv.org/abs/2409.17092v2

Achieving Deep Continual Learning via Evolution

Deep neural networks, despite their remarkable success, remain fundamentally limited in their ability to perform Continual Learning (CL). While most current methods aim to enhance the capabilities of a single model, Inspired by the collective learning mechanisms of human populations, we introduce Evolving Continual Learning (ECL), a framework that maintains and evolves a diverse population of neural network models. ECL continually searches for an optimal architecture for each introduced incremental task. This tailored model is trained on the corresponding task and archived as a specialized expert, contributing to a growing collection of skills. This approach inherently resolves the core CL challenges: stability is achieved through the isolation of expert models, while plasticity is greatly enhanced by evolving unique, task-specific architectures. Experimental results demonstrate that ECL significantly outperforms state-of-the-art individual-level CL methods. By shifting the focus from individual adaptation to collective evolution, ECL presents a novel path toward AI systems capable of CL.

Updated: 2025-07-31 03:59:49

标题: 通过进化实现深度持续学习

摘要: 深度神经网络，尽管取得了显著的成功，但在持续学习（CL）方面仍然存在根本性局限。尽管大多数当前方法旨在增强单个模型的能力，但受到人类群体集体学习机制的启发，我们引入了进化持续学习（ECL）框架，该框架维护并发展了多样化的神经网络模型群体。ECL不断寻找每个引入的增量任务的最佳架构。这种定制模型针对相应任务进行训练，并被存档为专业专家，为不断增长的技能集合做出贡献。这种方法本质上解决了CL的核心挑战：通过隔离专家模型实现了稳定性，同时通过演化独特的、与任务相关的架构极大地增强了可塑性。实验结果表明，ECL明显优于最先进的个体级CL方法。通过将重点从个体适应转移到集体演化，ECL提出了一种新颖的通往具有CL能力的AI系统的路径。

更新时间: 2025-07-31 03:59:49

领域: cs.LG

下载: http://arxiv.org/abs/2502.06210v2

Fine-Grained Privacy Extraction from Retrieval-Augmented Generation Systems via Knowledge Asymmetry Exploitation

Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge bases, but this advancement introduces significant privacy risks. Existing privacy attacks on RAG systems can trigger data leakage but often fail to accurately isolate knowledge-base-derived sentences within mixed responses. They also lack robustness when applied across multiple domains. This paper addresses these challenges by presenting a novel black-box attack framework that exploits knowledge asymmetry between RAG and standard LLMs to achieve fine-grained privacy extraction across heterogeneous knowledge landscapes. We propose a chain-of-thought reasoning strategy that creates adaptive prompts to steer RAG systems away from sensitive content. Specifically, we first decompose adversarial queries to maximize information disparity and then apply a semantic relationship scoring to resolve lexical and syntactic ambiguities. We finally train a neural network on these feature scores to precisely identify sentences containing private information. Unlike prior work, our framework generalizes to unseen domains through iterative refinement without pre-defined knowledge. Experimental results show that we achieve over 91% privacy extraction rate in single-domain and 83% in multi-domain scenarios, reducing sensitive sentence exposure by over 65% in case studies. This work bridges the gap between attack and defense in RAG systems, enabling precise extraction of private information while providing a foundation for adaptive mitigation.

Updated: 2025-07-31 03:50:16

标题: 通过知识不对称利用从检索增强生成系统中提取精细隐私

摘要: 检索增强生成（RAG）系统通过集成外部知识库增强大型语言模型（LLMs），但这种进步引入了重大的隐私风险。现有的对RAG系统的隐私攻击可能导致数据泄露，但通常无法准确地隔离混合响应中基于知识库的句子。它们在跨多个领域应用时也缺乏鲁棒性。本文通过提出一种新颖的黑盒攻击框架来解决这些挑战，利用RAG和标准LLMs之间的知识不对称性，实现跨异构知识领域的精细隐私提取。我们提出了一种思维链推理策略，创建自适应提示，将RAG系统引导远离敏感内容。具体来说，我们首先分解对抗性查询以最大化信息差异，然后应用语义关系评分以解决词汇和句法模糊性。最后，我们在这些特征分数上训练神经网络，精确识别包含私人信息的句子。与先前的工作不同，我们的框架通过迭代细化而不需要预定义知识，可以泛化到看不见的领域。实验结果表明，在单一领域和多领域场景中，我们实现了超过91%的隐私提取率，减少了案例研究中敏感句子曝光超过65%。这项工作弥合了RAG系统中攻击和防御之间的差距，实现了对私人信息的精确提取，同时为自适应缓解提供了基础。

更新时间: 2025-07-31 03:50:16

领域: cs.CR

下载: http://arxiv.org/abs/2507.23229v1

Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

Updated: 2025-07-31 03:49:31

标题: 利用LLMs实现基于表格生物标志数据的少样本阿尔茨海默病诊断

摘要: 阿尔茨海默病（AD）的早期和准确诊断需要分析异质生物标志物（例如，神经影像学、遗传风险因素、认知测试和脑脊液蛋白）通常以表格形式表示。大型语言模型（LLMs）通过灵活的少样本推理、多模态集成和基于自然语言的可解释性，为利用结构化生物医学数据进行预测提供了前所未有的机会。我们提出了一种新颖的框架，称为TAP-GPT，Tabular Alzheimer's Prediction GPT，它使用结构化生物标志物数据和小样本量，将原本用于商业智能任务的多模态表格专用LLM TableGPT2改编为AD诊断。我们的方法使用来自结构化生物医学数据的上下文学习示例构建少样本表格提示，并使用参数高效的qLoRA适应方法对TableGPT2进行微调，用于AD或认知正常（CN）的临床二元分类任务。TAP-GPT框架利用TableGPT2的强大表格理解能力和LLMs的编码先验知识，胜过更先进的通用LLMs和用于预测任务的表格基础模型（TFM）。据我们所知，这是LLMs首次应用于使用表格生物标志物数据的预测任务，为未来基于LLMs的多智能体框架在生物医学信息学领域开辟了道路。

更新时间: 2025-07-31 03:49:31

领域: cs.CL,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.23227v1

Unveiling the Influence of Amplifying Language-Specific Neurons

Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

Updated: 2025-07-31 03:32:19

标题: 揭示放大语言特定神经元的影响

摘要: LLMs中与个体语言强烈相关的特定语言神经元已被证明通过停用对模型行为产生影响。然而，它们在放大方面的作用尚未得到充分探讨。本研究通过介入18种语言（包括资源匮乏的语言），使用主要在不同语言中进行训练的三个模型，调查了放大特定语言神经元的效果。我们通过提出的语言导向变化（LSS）评估分数比较放大因子的效力，从而评估在下游任务中的效果：常识推理（XCOPA，XWinograd），知识（Include）和翻译（FLORES）。最佳放大因子有效地将输出引导到几乎所有测试语言。在下游任务中使用这一因子在某些情况下改善了自身语言表现，但通常会降低跨语言结果。这些发现突显了多语言行为中特定语言神经元的影响，其中放大尤其有益于资源匮乏的语言，但对跨语言转移的优势有限。

更新时间: 2025-07-31 03:32:19

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.22581v2

LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs' true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ''memorization-based answering'' by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).

Updated: 2025-07-31 03:28:30

标题: LLM众包：一种无需基准的大型语言模型相互评估范式

摘要: 尽管大型语言模型（LLMs）在各种任务中展现出卓越的能力，但评估它们的能力仍然是一项具有挑战性的任务。现有的评估方法存在数据污染、黑匣操作和主观偏好等问题。这些问题使得全面评估LLMs的真实能力变得困难。为了解决这些挑战，我们提出了一种新颖的无基准评估范式，即LLM-Crowdsourced。它利用LLMs生成问题，独立回答，并相互评估。该方法整合了动态、透明、客观和专业四个关键评估标准，这是现有评估方法无法同时满足的。对数学和编程领域的八种主流LLMs进行的实验验证了我们方法在区分LLM性能方面的优势。此外，我们的研究揭示了一些新颖发现，传统方法难以检测到，包括但不限于：（1）Gemini在原创和专业问题设计能力方面表现最优；（2）一些LLMs表现出“基于记忆的回答”，将问题误认为与结构相似的熟悉问题；（3）LLM评估结果表现出高一致性（鲁棒性）。

更新时间: 2025-07-31 03:28:30

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.22359v2

A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

Updated: 2025-07-31 03:26:57

标题: 一个真相的单一方向：观察者模型的线性残差探针揭示和引导情境幻觉

摘要: 情境幻觉——在给定背景下无支持的陈述——仍然是人工智能中的一个重要挑战。我们展示了一个实用的可解释性洞见：一个与生成器无关的观察者模型通过一个前向传递和对其残余流的线性探测来检测幻觉。这个探测器孤立出一个单一的、可传递的线性方向，将幻觉与忠实文本分开，表现优于基线5-27个点，并展示了在Gemma-2模型（从2B到27B）中的鲁棒的中层性能。梯度乘以激活将这个信号定位到稀疏的、晚层的MLP活动。至关重要的是，操纵这个方向因果地引导生成器幻觉率，证明了其可操作性。我们的结果提供了与特定MLP子电路相关的内部、低维度幻觉跟踪的新证据，可用于检测和缓解。我们发布了包含2000个示例的ContraTales基准，用于对这类解决方案进行真实评估。

更新时间: 2025-07-31 03:26:57

领域: cs.LG

下载: http://arxiv.org/abs/2507.23221v1

Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining

Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often struggle with search inefficiency or yield alpha factors that are difficult to interpret. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our framework leverages the LLM's instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to enhance search diversity and prevent formulaic homogenization, further improving performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy and trading performance. The resulting formulas are also more amenable to human interpretation, establishing a more effective and efficient paradigm for formulaic alpha mining.

Updated: 2025-07-31 03:20:47

标题: 穿越Alpha丛林：一种基于LLM的MCTS框架用于公式因子挖掘

摘要: Alpha因子挖掘在量化投资中至关重要，用于从复杂的金融数据中识别预测性信号。传统的公式化Alpha挖掘依赖于人类专业知识，而当代自动化方法，如基于遗传编程或强化学习的方法，经常遇到搜索效率低或产生难以解释的Alpha因子的困难。本文介绍了一个新颖的框架，将大型语言模型（LLMs）与蒙特卡洛树搜索（MCTS）相结合，以克服这些限制。我们的框架利用LLM的指令跟随和推理能力，在MCTS驱动的探索中迭代生成和完善符号化的Alpha公式。一个关键的创新是通过每个候选因子的金融回测提供丰富的定量反馈来指导MCTS的探索，从而实现对庞大搜索空间的高效导航。此外，引入频繁子树避免机制以增强搜索多样性并防止公式化同质化，进一步提高性能。对真实股市数据的实验结果表明，我们基于LLM的框架通过挖掘具有优越预测准确性和交易表现的Alpha来优于现有方法。由此产生的公式也更易于人类解释，建立了一个更有效和高效的公式化Alpha挖掘范式。

更新时间: 2025-07-31 03:20:47

领域: cs.AI

下载: http://arxiv.org/abs/2505.11122v2

Advancing Generative Artificial Intelligence and Large Language Models for Demand Side Management with Internet of Electric Vehicles

Generative artificial intelligence, particularly through large language models (LLMs), is poised to transform energy optimization and demand side management (DSM) within microgrids. This paper explores the integration of LLMs into energy management, emphasizing their roles in automating the optimization of DSM strategies with Internet of electric vehicles. We investigate challenges and solutions associated with DSM and explore the new opportunities presented by leveraging LLMs. Then, we propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization. We present a case study to demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles, highlighting our solution's significant advancements in energy efficiency and user adaptability. This work underscores the potential of LLMs for energy optimization and fosters a new era of intelligent DSM solutions.

Updated: 2025-07-31 03:20:12

标题: 推进生成式人工智能和大型语言模型在基于互联网电动车的需求侧管理中的应用

摘要: 生成人工智能，尤其是通过大型语言模型（LLMs），正准备改变微网格内的能源优化和需求侧管理（DSM）。本文探讨了LLMs整合到能源管理中的作用，强调它们在自动化DSM策略优化中的角色，与电动汽车互联网的结合。我们调查了与DSM相关的挑战和解决方案，并探讨了通过利用LLMs带来的新机遇。然后，我们提出了一种创新的解决方案，通过检索增强生成来增强LLMs，实现自动问题制定、代码生成和定制优化。我们提供了一个案例研究，展示了我们提出的解决方案在电动汽车充电调度和优化中的有效性，突出了我们解决方案在能源效率和用户适应性方面的重大进展。本研究强调了LLMs在能源优化方面的潜力，并促进了智能DSM解决方案的新时代。

更新时间: 2025-07-31 03:20:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.15544v4

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

Updated: 2025-07-31 03:17:43

标题: 用稀疏自动编码器实现的机制主题模型：模型指导，而非文字

摘要: 传统主题模型在发现大型文本集合中的潜在主题方面非常有效。然而，由于它们依赖于词袋表示，它们很难捕捉语义抽象特征。虽然一些神经变体使用更丰富的表示，但它们同样受到将主题表达为单词列表的限制，这限制了它们表达复杂主题的能力。我们引入了机制主题模型（MTMs），这是一类基于稀疏自动编码器（SAEs）学习的可解释特征的主题模型。通过在这个语义丰富的空间上定义主题，MTMs可以揭示具有表现力特征描述的更深层次的概念主题。此外，与其他主题模型不同，MTMs通过使用基于主题的操纵向量实现可控的文本生成。为了正确评估MTM主题与基于单词列表的方法，我们提出了\textit{topic judge}，这是一个基于LLM的两两比较评估框架。在五个数据集中，MTMs在连贯性度量方面与传统和神经基线相匹配或超出，并始终受到\textit{topic judge}的欢迎，并且能够有效地操纵LLM的输出。

更新时间: 2025-07-31 03:17:43

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.23220v1

An Information Bottleneck Asset Pricing Model

Deep neural networks (DNNs) have garnered significant attention in financial asset pricing, due to their strong capacity for modeling complex nonlinear relationships within financial data. However, sophisticated models are prone to over-fitting to the noise information in financial data, resulting in inferior performance. To address this issue, we propose an information bottleneck asset pricing model that compresses data with low signal-to-noise ratios to eliminate redundant information and retain the critical information for asset pricing. Our model imposes constraints of mutual information during the nonlinear mapping process. Specifically, we progressively reduce the mutual information between the input data and the compressed representation while increasing the mutual information between the compressed representation and the output prediction. The design ensures that irrelevant information, which is essentially the noise in the data, is forgotten during the modeling of financial nonlinear relationships without affecting the final asset pricing. By leveraging the constraints of the Information bottleneck, our model not only harnesses the nonlinear modeling capabilities of deep networks to capture the intricate relationships within financial data but also ensures that noise information is filtered out during the information compression process.

Updated: 2025-07-31 03:15:58

标题: 一个信息瓶颈资产定价模型

摘要: 深度神经网络（DNNs）在金融资产定价中引起了重要关注，因为它们具有建模金融数据中复杂非线性关系的强大能力。然而，复杂的模型容易过度拟合金融数据中的噪音信息，导致性能下降。为了解决这个问题，我们提出了一种信息瓶颈资产定价模型，该模型通过压缩信噪比低的数据来消除冗余信息，并保留用于资产定价的关键信息。我们的模型在非线性映射过程中施加互信息约束。具体来说，我们逐渐减少输入数据和压缩表示之间的互信息，同时增加压缩表示和输出预测之间的互信息。该设计确保在金融非线性关系建模过程中遗忘无关信息（即数据中的噪音），而不影响最终的资产定价。通过利用信息瓶颈的约束，我们的模型不仅利用深度网络的非线性建模能力捕捉金融数据中的复杂关系，还确保在信息压缩过程中过滤掉噪音信息。

更新时间: 2025-07-31 03:15:58

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2507.23218v1

Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 \cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.

Updated: 2025-07-31 03:14:45

标题: 使用伪目录引导的检索增强生成的零样本文档理解

摘要: 由于复杂多模态文档的结构不一致性和有限的训练数据可用性，理解这些文档仍然具有挑战性。我们引入了一种名为\textit{DocsRay}的无需训练的文档理解系统，该系统将伪目录（TOC）生成与分层检索增强生成（RAG）集成在一起。我们的方法利用多模态大型语言模型（LLMs）的本机功能，无需专门的模型或额外的训练即可无缝处理包含文本、图像、图表和表格等多样元素的文档。DocsRay的框架协同地结合了三个关键技术：（1）使用基于提示的LLM交互生成分层伪目录的语义结构化模块，（2）将多样的文档元素转换为统一的以文本为中心的表示形式的零-shot多模态分析，利用多模态LLM的固有能力，以及（3）将检索复杂度从$O(N)$降低到$O(S + k_1 \cdot N_s)$的高效两阶段分层检索系统。在平均49.4页和20,971个文本标记的文档上进行评估，DocsRay将查询延迟从3.89秒降低到2.12秒，实现了45%的效率改善。在MMLongBench-Doc基准测试中，DocsRay-Pro达到64.7%的准确率，大大超过了先前的最新结果。

更新时间: 2025-07-31 03:14:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.23217v1

Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs' origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn't consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.

Updated: 2025-07-31 03:13:46

标题: 大型语言模型中的文化偏见：通过道德问卷评估人工智能代理

摘要: 人工智能系统是否真正代表了人类的价值观，还是仅仅在这些价值观之间取得了平均值？我们的研究表明一个令人担忧的现实：尽管具有语言能力，但大型语言模型（LLMs）未能代表多样化的文化道德框架。我们通过在19个文化背景下应用道德基础问卷揭示了人工智能生成的道德直觉与人类道德直觉之间的显著差距。通过比较多个最先进的LLM模型的起源数据与人类基准数据，我们发现这些模型系统地使道德多样性同质化。令人惊讶的是，增加模型大小并不一致地提高文化代表性的准确性。我们的发现挑战了越来越多地将LLMs用作社会科学研究中的合成人口的做法，并突显了当前人工智能对齐方法的基本局限。在没有数据驱动的对齐的情况下，这些系统无法捕捉微妙的、具有文化特定性的道德直觉。我们的结果呼吁更具实践性的对齐目标和评估指标，以确保人工智能系统代表多样化的人类价值观，而不是将道德景观变平。

更新时间: 2025-07-31 03:13:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.10073v2

Not Just What, But When: Integrating Irregular Intervals to LLM for Sequential Recommendation

Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history. In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals. Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 4.4% improvements in average but also the best-performing warm and cold scenarios across all users, items, and the proposed interval perspectives. In addition, we observe that the cold scenario from the interval perspective experiences the most significant performance drop among all recommendation methods. This finding underscores the necessity of further research on interval-based cold challenges and our integration of interval information in the realm of sequential recommendation tasks. Our code is available here: https://github.com/sony/ds-research-code/tree/master/recsys25-IntervalLLM.

Updated: 2025-07-31 03:05:05

标题: 不仅是什么，而是何时：将不规则间隔集成到LLM中用于顺序推荐

摘要: 购买物品之间的时间间隔是顺序推荐任务中的一个关键因素，而现有的方法注重物品序列，并常常忽视了物品之间间隔是静态的假设。然而，动态间隔作为描述用户画像的一个维度，不仅描述了用户内部的历史，还描述了具有相同物品历史的不同用户。在这项工作中，我们提出了IntervalLLM，这是一个将间隔信息整合到LLM中，并结合了新颖的间隔注入注意力，共同考虑物品和间隔信息的新框架。此外，不同于之前只从用户和物品的角度解决冷启动场景的研究，我们引入了一个新的视角：间隔视角，作为评估推荐方法在温暖和冷启动情景下的额外指标。在3个基准测试上进行了广泛实验，包括传统和基于LLM的基线，结果表明我们的IntervalLLM不仅在平均值上取得了4.4%的改进，而且在所有用户、物品和所提出的间隔视角上取得了最佳表现的温暖和冷启动情景。此外，我们观察到，从间隔视角看，冷启动情景经历了所有推荐方法中最显著的性能下降。这一发现强调了在基于间隔的冷启动挑战上进一步研究的必要性，以及我们在顺序推荐任务领域整合间隔信息的重要性。我们的代码可以在https://github.com/sony/ds-research-code/tree/master/recsys25-IntervalLLM找到。

更新时间: 2025-07-31 03:05:05

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.23209v1

Are Recommenders Self-Aware? Label-Free Recommendation Performance Estimation via Model Uncertainty

Can a recommendation model be self-aware? This paper investigates the recommender's self-awareness by quantifying its uncertainty, which provides a label-free estimation of its performance. Such self-assessment can enable more informed understanding and decision-making before the recommender engages with any users. To this end, we propose an intuitive and effective method, probability-based List Distribution uncertainty (LiDu). LiDu measures uncertainty by determining the probability that a recommender will generate a certain ranking list based on the prediction distributions of individual items. We validate LiDu's ability to represent model self-awareness in two settings: (1) with a matrix factorization model on a synthetic dataset, and (2) with popular recommendation algorithms on real-world datasets. Experimental results show that LiDu is more correlated with recommendation performance than a series of label-free performance estimators. Additionally, LiDu provides valuable insights into the dynamic inner states of models throughout training and inference. This work establishes an empirical connection between recommendation uncertainty and performance, framing it as a step towards more transparent and self-evaluating recommender systems.

Updated: 2025-07-31 03:04:34

标题: 推荐系统是否具有自我意识？通过模型不确定性进行无标签推荐性能估计

摘要: 一个推荐模型能够具备自我意识吗？本文通过量化不确定性来研究推荐系统的自我意识，从而提供一个无标签的性能估计。这种自我评估可以在推荐系统与用户互动之前，帮助更深入地理解和决策。为此，我们提出了一种直观有效的方法，基于概率的列表分布不确定性（LiDu）。LiDu通过确定推荐系统基于个别项目的预测分布生成某个排名列表的概率来衡量不确定性。我们在两个设置中验证了LiDu代表模型自我意识的能力：（1）在合成数据集上使用矩阵分解模型，（2）在真实数据集上使用流行的推荐算法。实验结果显示，LiDu与推荐性能的相关性比一系列无标签的性能估计器更高。此外，LiDu还提供了有关模型在训练和推断过程中动态内部状态的宝贵见解。这项工作建立了推荐不确定性与性能之间的实证联系，将其视为更透明和自我评价的推荐系统迈出的一步。

更新时间: 2025-07-31 03:04:34

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.23208v1

Adapt before Continual Learning

Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model's plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.

Updated: 2025-07-31 03:04:31

标题: 在持续学习之前适应

摘要: 持续学习（CL）旨在使神经网络在保留现有知识（稳定性）的同时逐步获取新知识（可塑性）。尽管预训练模型（PTMs）为CL提供了坚实的基础，但现有方法在平衡这两个竞争目标方面面临着根本性挑战。当前方法通常通过冻结PTM骨干来解决稳定性问题，这严重限制了模型的可塑性，特别是当传入数据分布与预训练数据大幅分歧时。另一方面，顺序微调整个PTM可以适应新知识，但往往会导致灾难性遗忘，凸显了基于PTM的CL中关键的稳定性和可塑性权衡。为解决这一局限性，我们提出了在核心CL处理之前适应PTMs的Adapting PTMs before the core CL（ACL）框架，这是一个引入插拔适应阶段的新颖框架，用于在学习每个新任务之前进行调整。在这个阶段，ACL通过将嵌入与其原始类原型对齐，并远离无关类别来完善PTM骨干。该机制在理论和实证上展示出了稳定性和可塑性之间理想的平衡，显著提高了各种基准和集成方法的CL性能。代码可在https://github.com/byyx666/ACL_code找到。

更新时间: 2025-07-31 03:04:31

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2506.03956v3

InfAlign: Inference-aware language model alignment

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

Updated: 2025-07-31 03:02:43

标题: InfAlign：推断感知的语言模型对齐

摘要: 语言模型对齐是训练现代生成式语言模型的关键步骤。对齐旨在提高从对齐模型中采样的样本与基础模型之间的获胜率。如今，我们越来越多地使用推理时算法（例如，最佳-N、受控解码、树搜索）来从语言模型解码，而不是标准采样。我们表明，这种训练/测试不匹配使得标准RLHF框架在这种推理时方法的情况下并不是最佳选择。因此，我们提出了一种推理感知对齐（InfAlign）框架，旨在优化对齐策略在推理时与基础模型之间的获胜率。我们证明，对于任何推理时解码过程，最佳的对齐策略是通过对奖励进行变换的标准RLHF问题的解决方案。这激励我们提供校准和转换RL（InfAlign-CTRL）算法来解决这个问题，其中包括奖励校准步骤和KL正则化奖励最大化步骤，以及对校准奖励的转换。对于最佳-N采样和最佳-N越狱，我们提出了特定的转换，可提高推理时获胜率3-8%。最后，我们还表明，我们提出的奖励校准方法是优化标准获胜率的强大基准。

更新时间: 2025-07-31 03:02:43

领域: cs.LG,cs.CL,cs.IT,math.IT

下载: http://arxiv.org/abs/2412.19792v4

HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors

The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \{https://github.com/scott-yjyang/HRVVS}.

Updated: 2025-07-31 03:01:47

标题: HRVVS：通过分层自回归残差先验的高分辨率视频血管分割网络

摘要: 在肝切除手术视频中对肝血管进行分割在肝切除手术程序的临床意义方面具有重要意义。然而，由于缺乏适当的数据集和固有的复杂任务特性，该领域报道的研究很少。为了解决这个问题，我们首先引入了一个高质量的逐帧标注的肝血管数据集，其中包含35个长肝切除手术视频和11442个高分辨率帧。在此基础上，我们提出了一种新颖的高分辨率视频血管分割网络，称为HRVVS。我们创新地将一个预训练的视觉自回归建模（VAR）模型嵌入到分层编码器的不同层中作为先验信息，以减少在下采样过程中生成的信息降级。此外，我们设计了一个动态记忆解码器在多视角分割网络上，以最小化冗余信息的传输，同时保留帧间更多的细节。对外科视频数据集的大量实验表明，我们提出的HRVVS明显优于最先进的方法。源代码和数据集将在\{https://github.com/scott-yjyang/HRVVS}上公开。

更新时间: 2025-07-31 03:01:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.22530v2

Solution-aware vs global ReLU selection: partial MILP strikes back for DNN verification

To handle complex instances, we revisit a divide-and-conquer approach to break down the complexity: instead of few complex BaB calls, we rely on many small {\em partial} MILP calls. The crucial step is to select very few but very important ReLUs to treat using (costly) binary variables. The previous attempts were suboptimal in that respect. To select these important ReLU variables, we propose a novel {\em solution-aware} ReLU scoring ({\sf SAS}), as well as adapt the BaB-SR and BaB-FSB branching functions as {\em global} ReLU scoring ({\sf GS}) functions. We compare them theoretically as well as experimentally, and {\sf SAS} is more efficient at selecting a set of variables to open using binary variables. Compared with previous attempts, SAS reduces the number of binary variables by around 6 times, while maintaining the same level of accuracy. Implemented in {\em Hybrid MILP}, calling first $\alpha,\beta$-CROWN with a short time-out to solve easier instances, and then partial MILP, produces a very accurate yet efficient verifier, reducing by up to $40\%$ the number of undecided instances to low levels ($8-15\%$), while keeping a reasonable runtime ($46s-417s$ on average per instance), even for fairly large CNNs with 2 million parameters.

Updated: 2025-07-31 02:43:57

标题: 解决方案感知 vs 全局ReLU选择：部分MILP为DNN验证再次发挥作用

摘要: 为了处理复杂实例，我们重新审视了一种分治方法来分解复杂性：我们依靠许多小的{\em 部分} MILP 调用，而不是少量复杂的 BaB 调用。关键步骤是选择非常少但非常重要的 ReLU 变量来使用（昂贵的）二进制变量处理。先前的尝试在这方面是次优的。为了选择这些重要的 ReLU 变量，我们提出了一种新颖的{\em 解决方案感知} ReLU 评分（{\sf SAS}），并将 BaB-SR 和 BaB-FSB 分支函数调整为{\em 全局} ReLU 评分（{\sf GS}）函数。我们在理论上和实验上对它们进行了比较，{\sf SAS} 在使用二进制变量选择一组要打开的变量时更有效率。与先前的尝试相比，SAS 将二进制变量数量减少了约 6 倍，同时保持了相同水平的准确性。在{\em Hybrid MILP} 中实施，首先调用 $\alpha,\beta$-CROWN 并设置较短的超时时间来解决更容易的实例，然后进行部分 MILP，可以生成一个非常准确而高效的验证器，将未决实例的数量降低了高达 $40\%$ 至较低水平（$8-15\%$），同时保持了合理的运行时间（每个实例平均为 $46s-417s$），即使是具有 200 万参数的相当大的 CNN。

更新时间: 2025-07-31 02:43:57

领域: cs.AI

下载: http://arxiv.org/abs/2507.23197v1

Learning 3D Scene Analogies with Neural Contextual Scene Maps

Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https://82magnolia.github.io/3d_scene_analogies/.

Updated: 2025-07-31 02:35:01

标题: 使用神经上下文场景地图学习3D场景类比

摘要: 理解场景背景对于机器在未见或嘈杂的3D环境中执行任务并调整先前知识至关重要。由于基于数据驱动的学习难以全面地包含各种布局和开放空间，我们提议教会机器识别3D空间中的关系共性。与集中于点或对象级表示不同，我们引入了3D场景类比，这是3D场景区域之间的平滑映射，它们对齐了空间关系。与研究充分的单个实例级映射不同，这些场景级映射平滑地连接大场景区域，可能在AR/VR中的轨迹转移、模仿学习的长时间演示转移以及上下文感知对象重新排列中实现独特的应用。为了找到3D场景类比，我们提出了神经上下文场景地图，它提取总结语义和几何上下文的描述符字段，并以一种粗到细的方式全面地对齐它们以进行地图估计。这种方法减少了对个体特征点的依赖，使其对输入噪声或形状变化具有鲁棒性。实验证明了我们的方法在识别场景类比并在各种室内场景中转移轨迹或对象放置方面的有效性，表明了它在机器人学和AR/VR应用中的潜力。项目页面，包括代码，可通过此链接获得：https://82magnolia.github.io/3d_scene_analogies/。

更新时间: 2025-07-31 02:35:01

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.15897v2

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

Updated: 2025-07-31 02:26:58

标题: Geak：引入Triton Kernel AI代理和评估基准

摘要: 对于由AI生成的GPU内核的需求正在迅速增长，受到工业和学术界对可扩展、硬件优化解决方案的需求的影响。随着深度学习工作负载在复杂性和多样性上的增长，自动化低级内核开发以满足性能和生产力需求变得至关重要。主要云服务提供商、半导体公司和研究机构现在正在大力投资于用于GPU的基于AI的代码生成，旨在减少手动优化工作的同时在像AMD MI300X这样的硬件上实现接近专家级性能。Triton语言，一种基于Python的GPU编程DSL，由于其性能和编码易用性的平衡而成为这种AI生成内核的受欢迎目标。在这项工作中，我们提出了一个用于基于Triton的GPU内核和GEAK（Generating Efficient AI-centric GPU Kernels）的评估套件-这是一个利用尖端LLM生成性能优异的Triton代码的框架，专门针对AMD GPU，包括AMD MI300X和MI250。GEAK利用推理时间计算缩放来使用从反思式反馈机制中改编的推理循环生成基于Triton的GPU内核。在两个评估基准上，GEAK在正确性方面显著优于直接提示前沿LLM以及基于反思的生成管道，达到了63%的正确性和高达2.59倍的执行速度提升。这些结果突显了GEAK类似的主动代码生成对于加速各种硬件平台的采用和普及专家级内核性能访问的承诺。

更新时间: 2025-07-31 02:26:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23194v1

FGeo-HyperGNet: Geometric Problem Solving Integrating FormalGeo Symbolic System and Hypergraph Neural Network

Geometric problem solving has always been a long-standing challenge in the fields of mathematical reasoning and artificial intelligence. We built a neural-symbolic system, called FGeo-HyperGNet, to automatically perform human-like geometric problem solving. The symbolic component is a formal system built on FormalGeo, which can automatically perform geometric relational reasoning and algebraic calculations and organize the solution into a hypergraph with conditions as hypernodes and theorems as hyperedges. The neural component, called HyperGNet, is a hypergraph neural network based on the attention mechanism, including an encoder to encode the structural and semantic information of the hypergraph and a theorem predictor to provide guidance in solving problems. The neural component predicts theorems according to the hypergraph, and the symbolic component applies theorems and updates the hypergraph, thus forming a predict-apply cycle to ultimately achieve readable and traceable automatic solving of geometric problems. Experiments demonstrate the effectiveness of this neural-symbolic architecture. We achieved state-of-the-art results with a TPA of 93.50% and a PSSR of 88.36% on the FormalGeo7K dataset. The code is available at https://github.com/BitSecret/HyperGNet.

Updated: 2025-07-31 02:20:49

标题: FGeo-HyperGNet：将形式几何符号系统和超图神经网络整合的几何问题解决方案

摘要: 几何问题的解决一直是数学推理和人工智能领域的长期挑战。我们构建了一个名为FGeo-HyperGNet的神经符号系统，可以自动执行类似人类的几何问题解决。符号组件是建立在FormalGeo上的形式系统，可以自动进行几何关系推理和代数计算，并将解决方案组织成一个超图，其中条件作为超节点，定理作为超边。神经组件称为HyperGNet，是基于注意机制的超图神经网络，包括一个编码器来编码超图的结构和语义信息，以及一个定理预测器来提供解决问题的指导。神经组件根据超图预测定理，符号组件应用定理并更新超图，从而形成一个预测-应用循环，最终实现几何问题的可读和可追踪的自动解决。实验证明了这种神经-符号体系结构的有效性。在FormalGeo7K数据集上，我们取得了93.50%的TPA和88.36%的PSSR的最新成果。代码可在https://github.com/BitSecret/HyperGNet找到。

更新时间: 2025-07-31 02:20:49

领域: cs.AI

下载: http://arxiv.org/abs/2402.11461v3

G-Core: A Simple, Scalable and Balanced RLHF Trainer

Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.

Updated: 2025-07-31 02:18:13

标题: G-Core：一个简单、可扩展且平衡的RLHF训练器

摘要: 来自人类反馈的强化学习（RLHF）已经成为训练大型语言模型（LLMs）和扩散模型的越来越流行的范例。虽然现有的RLHF训练系统已经取得了显著进展，但它们在扩展到多模式和扩散工作流程以及适应动态工作负载方面常常面临挑战。特别是，在处理复杂的RLHF流水线时，当前方法可能遇到控制器可扩展性、灵活资源放置和高效编排方面的限制，尤其是在涉及动态采样或生成奖励建模的场景中。在本文中，我们提出了\textbf{G-Core}，一个简单、可扩展且平衡的RLHF训练框架，旨在解决这些挑战。G-Core引入了并行控制器编程模型，实现了对复杂RLHF工作流程的灵活和高效编排，避免了单一集中控制器的瓶颈。此外，我们提出了一种动态布局模式，自适应地分区资源并安排工作负载，显著减少了硬件空闲时间并提高了利用率，即使在高度变化的训练条件下也是如此。G-Core已成功训练出支持微信产品功能为大规模用户群体提供服务的模型，展示了其在现实场景中的有效性和稳健性。我们的结果表明，G-Core推进了RLHF训练的技术水平，为未来大规模、与人类对齐的模型的研究和部署提供了坚实的基础。

更新时间: 2025-07-31 02:18:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.22789v2

Tractable Responsibility Measures for Ontology-Mediated Query Answering

Recent work on quantitative approaches to explaining query answers employs responsibility measures to assign scores to facts in order to quantify their respective contributions to obtaining a given answer. In this paper, we study the complexity of computing such responsibility scores in the setting of ontology-mediated query answering, focusing on a very recently introduced family of Shapley-value-based responsibility measures defined in terms of weighted sums of minimal supports (WSMS). By exploiting results from the database setting, we can show that such measures enjoy polynomial data complexity for classes of ontology-mediated queries that are first-order-rewritable, whereas the problem becomes "shP"-hard when the ontology language can encode reachability queries (via axioms like $\exists R. A \sqsubseteq A$). To better understand the tractability frontier, we next explore the combined complexity of WSMS computation. We prove that intractability applies already to atomic queries if the ontology language supports conjunction, as well as to unions of `well-behaved' conjunctive queries, even in the absence of an ontology. By contrast, our study yields positive results for common DL-Lite dialects: by means of careful analysis, we identify classes of structurally restricted conjunctive queries (which intuitively disallow undesirable interactions between query atoms) that admit tractable WSMS computation.

Updated: 2025-07-31 02:08:12

标题: 可处理的责任度量方法用于本体介导的查询回答

摘要: 最近关于解释查询答案的定量方法的研究采用责任度量来为事实分配分数，以量化它们对获得给定答案的贡献。本文研究了在本体中介查询答案设置中计算这种责任分数的复杂性，重点关注最近引入的基于Shapley值的责任度量家族，其定义为最小支持的加权和（WSMS）。通过利用数据库设置的结果，我们可以证明这种度量在本体中介查询的数据复杂性为多项式级别，对于那些可被第一阶重写的查询。当本体语言可以编码可达性查询时（例如通过像$\exists R. A \sqsubseteq A$这样的公理），问题将变得“shP”难。为了更好地理解可解性的边界，我们接着探讨了WSMS计算的综合复杂性。我们证明了即使在没有本体的情况下，如果本体语言支持并运算，那么难度已经适用于原子查询，以及对“良好行为”合取查询的并集。相比之下，我们的研究为常见的DL-Lite方言提供了积极结果：通过仔细分析，我们确定了一些结构受限制的合取查询类别（直观上禁止查询原子之间的不良交互作用），这些类别允许可解的WSMS计算。

更新时间: 2025-07-31 02:08:12

领域: cs.AI

下载: http://arxiv.org/abs/2507.23191v1

Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2

Real-time video segmentation is a promising opportunity for AI-assisted surgery, offering intraoperative guidance by identifying tools and anatomical structures. Despite growing interest in surgical video segmentation, annotation protocols vary widely across datasets -- some provide dense, frame-by-frame labels, while others rely on sparse annotations sampled at low frame rates such as 1 FPS. In this study, we investigate how such inconsistencies in annotation density and frame rate sampling influence the evaluation of zero-shot segmentation models, using SAM2 as a case study for cholecystectomy procedures. Surprisingly, we find that under conventional sparse evaluation settings, lower frame rates can appear to outperform higher ones due to a smoothing effect that conceals temporal inconsistencies. However, when assessed under real-time streaming conditions, higher frame rates yield superior segmentation stability, particularly for dynamic objects like surgical graspers. To understand how these differences align with human perception, we conducted a survey among surgeons, nurses, and machine learning engineers and found that participants consistently preferred high-FPS segmentation overlays, reinforcing the importance of evaluating every frame in real-time applications rather than relying on sparse sampling strategies. Our findings highlight the risk of evaluation bias that is introduced by inconsistent dataset protocols and bring attention to the need for temporally fair benchmarking in surgical video AI.

Updated: 2025-07-31 02:07:41

标题: 重新审视使用SAM2进行外科视频分割时由帧采样策略引入的评估偏差

摘要: 实时视频分割是AI辅助手术的一个有前途的机会，通过识别工具和解剖结构提供术中指导。尽管对手术视频分割的兴趣不断增长，但各数据集的注释协议差异很大 -- 有些提供密集的逐帧标签，而其他则依赖于以低帧率采样的稀疏注释，如1 FPS。在本研究中，我们调查了这种注释密度和帧率采样不一致如何影响零样本分割模型的评估，以SAM2作为胆囊切除术程序的案例研究。令人惊讶的是，我们发现在传统的稀疏评估设置下，较低的帧率似乎表现优于较高的帧率，这是由于平滑效果掩盖了时间不一致性。然而，在实时流媒体条件下评估时，较高的帧率产生更稳定的分割结果，特别是对于像手术夹子这样的动态对象。为了理解这些差异与人类感知的关系，我们对外科医生、护士和机器学习工程师进行了调查，发现参与者一致偏好高帧率的分割叠加效果，强调了在实时应用中评估每帧的重要性，而不是依赖稀疏采样策略。我们的发现突出了由不一致的数据集协议引入的评估偏见风险，并引起对手术视频AI中时间公平基准测试的关注。

更新时间: 2025-07-31 02:07:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.20934v3

Accessibility Scout: Personalized Accessibility Scans of Built Environments

Assessing the accessibility of unfamiliar built environments is critical for people with disabilities. However, manual assessments, performed by users or their personal health professionals, are laborious and unscalable, while automatic machine learning methods often neglect an individual user's unique needs. Recent advances in Large Language Models (LLMs) enable novel approaches to this problem, balancing personalization with scalability to enable more adaptive and context-aware assessments of accessibility. We present Accessibility Scout, an LLM-based accessibility scanning system that identifies accessibility concerns from photos of built environments. With use, Accessibility Scout becomes an increasingly capable "accessibility scout", tailoring accessibility scans to an individual's mobility level, preferences, and specific environmental interests through collaborative Human-AI assessments. We present findings from three studies: a formative study with six participants to inform the design of Accessibility Scout, a technical evaluation of 500 images of built environments, and a user study with 10 participants of varying mobility. Results from our technical evaluation and user study show that Accessibility Scout can generate personalized accessibility scans that extend beyond traditional ADA considerations. Finally, we conclude with a discussion on the implications of our work and future steps for building more scalable and personalized accessibility assessments of the physical world.

Updated: 2025-07-31 02:07:31

标题: Accessibility Scout：建筑环境个性化可访问性扫描

摘要: 评估陌生建筑环境的可访问性对于残疾人至关重要。然而，由用户或其个人健康专业人士进行的手动评估是费力的且不可扩展的，而自动机器学习方法往往忽视了个体用户的独特需求。最近大型语言模型（LLMs）的进展使得对这一问题有了新的方法，平衡了个性化和可扩展性，从而实现更具适应性和环境感知性的可访问性评估。我们提出了Accessibility Scout，这是一个基于LLM的可访问性扫描系统，可以从建筑环境的照片中识别出可访问性问题。通过使用，Accessibility Scout变得越来越有能力，通过人工智能协同评估，根据个人的移动水平、偏好和特定环境兴趣量身定制可访问性扫描。我们展示了三项研究结果：与六名参与者进行形成性研究，以指导Accessibility Scout的设计，对500张建筑环境图片进行技术评估，以及对10名不同移动能力参与者进行用户研究。我们的技术评估和用户研究结果显示，Accessibility Scout可以生成个性化的可访问性扫描，超越传统ADA考虑范围。最后，我们总结了我们工作的意义和未来建立更具可扩展性和个性化的物理世界可访问性评估的步骤。

更新时间: 2025-07-31 02:07:31

领域: cs.HC,cs.AI,cs.CV,cs.MA

下载: http://arxiv.org/abs/2507.23190v1

NaN-Propagation: A Novel Method for Sparsity Detection in Black-Box Computational Functions

Sparsity detection in black-box functions enables significant computational speedups in gradient-based optimization through Jacobian compression, but existing finite-difference methods suffer from false negatives due to coincidental zero gradients. These false negatives can silently corrupt gradient calculations, leading to difficult-to-diagnose errors. We introduce NaN-propagation, which exploits the universal contamination property of IEEE 754 Not-a-Number floating-point values to trace input-output dependencies through floating-point numerical computations. By systematically contaminating inputs with NaN and observing which outputs become NaN, the method reconstructs conservative sparsity patterns that eliminate false negatives. We demonstrate the approach on an aerospace wing weight model, achieving a 1.52x speedup while detecting dozens of dependencies missed by conventional methods -- a significant improvement since gradient computation is the bottleneck in many optimization workflows. The technique leverages IEEE 754 compliance to work across programming languages and math libraries without modifying existing black-box codes. Advanced strategies including NaN payload encoding enable faster-than-linear time complexity, improving upon existing black-box sparsity detection methods. Practical algorithms are also proposed to mitigate challenges from branching code execution common in engineering applications.

Updated: 2025-07-31 01:48:56

标题: NaN传播：黑盒计算函数稀疏性检测的新方法

摘要: 在黑盒函数中检测稀疏性可以通过雅可比压缩在梯度优化中实现显著的计算加速，但现有的有限差分方法由于巧合的零梯度而出现假阴性。这些假阴性可能在梯度计算中悄然引入错误，导致难以诊断的错误。我们引入了NaN传播方法，利用IEEE 754浮点数中的普遍污染性质来跟踪浮点数计算中的输入输出依赖关系。通过系统地将输入与NaN污染，并观察哪些输出变为NaN，该方法重新构建了保守的稀疏模式，消除了假阴性。我们在航空翼重量模型上演示了这种方法，实现了1.52倍的加速，同时检测到传统方法中遗漏的数十个依赖关系——这是一个显著的改进，因为梯度计算在许多优化工作流程中是瓶颈。该技术利用IEEE 754的兼容性，在不修改现有黑盒代码的情况下跨编程语言和数学库进行工作。包括NaN有效载荷编码在内的先进策略使时间复杂度比线性更快，改进了现有黑盒稀疏性检测方法。还提出了实用算法来减轻工程应用中常见的分支代码执行所带来的挑战。

更新时间: 2025-07-31 01:48:56

领域: cs.LG,cs.PL

下载: http://arxiv.org/abs/2507.23186v1

H2Tune: Federated Foundation Model Fine-Tuning with Hybrid Heterogeneity

Different from existing federated fine-tuning (FFT) methods for foundation models, hybrid heterogeneous federated fine-tuning (HHFFT) is an under-explored scenario where clients exhibit double heterogeneity in model architectures and downstream tasks. This hybrid heterogeneity introduces two significant challenges: 1) heterogeneous matrix aggregation, where clients adopt different large-scale foundation models based on their task requirements and resource limitations, leading to dimensional mismatches during LoRA parameter aggregation; and 2) multi-task knowledge interference, where local shared parameters, trained with both task-shared and task-specific knowledge, cannot ensure only task-shared knowledge is transferred between clients. To address these challenges, we propose H2Tune, a federated foundation model fine-tuning with hybrid heterogeneity. Our framework H2Tune consists of three key components: (i) sparsified triple matrix decomposition to align hidden dimensions across clients through constructing rank-consistent middle matrices, with adaptive sparsification based on client resources; (ii) relation-guided matrix layer alignment to handle heterogeneous layer structures and representation capabilities; and (iii) alternating task-knowledge disentanglement mechanism to decouple shared and specific knowledge of local model parameters through alternating optimization. Theoretical analysis proves a convergence rate of O(1/\sqrt{T}). Extensive experiments show our method achieves up to 15.4% accuracy improvement compared to state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/H2Tune-1407.

Updated: 2025-07-31 01:43:24

标题: H2Tune：混合异质性下的联邦基础模型微调

摘要: 与现有的基于联邦微调（FFT）方法不同，混合异构联邦微调（HHFFT）是一个未充分探讨的情景，其中客户端在模型架构和下游任务中表现出双重异质性。这种混合异质性引入了两个重要挑战：1）异构矩阵聚合，其中客户端根据任务需求和资源限制采用不同的大规模基础模型，导致在LoRA参数聚合过程中发生维度不匹配；2）多任务知识干扰，本地共享参数训练时既包含任务共享知识又包含任务特定知识，无法确保只有任务共享知识在客户端之间传递。为了解决这些挑战，我们提出了H2Tune，一个具有混合异质性的联邦基础模型微调方法。我们的框架H2Tune包括三个关键组件：（i）稀疏化的三重矩阵分解，通过构建基于客户端资源的自适应稀疏化来通过构建一致的秩中间矩阵来对齐客户端之间的隐藏维度；（ii）关系引导的矩阵层对齐，以处理异构层结构和表示能力；（iii）交替任务知识解耦机制，通过交替优化来解耦本地模型参数的共享和特定知识。理论分析证明了收敛速率为O(1/√T)。广泛的实验表明，与最先进的基线相比，我们的方法实现了高达15.4%的准确性改进。我们的代码可在https://anonymous.4open.science/r/H2Tune-1407 上找到。

更新时间: 2025-07-31 01:43:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.22633v2

MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

Updated: 2025-07-31 01:38:49

标题: MolPIF：一种用于分子生成的参数插值流模型

摘要: 深度学习在分子生成方面的进展显示出加快药物发现的潜力。贝叶斯流网络（BFNs）最近在各种化学任务中展现出令人印象深刻的性能，其成功往往归因于在低方差参数空间中建模的范例。然而，基于贝叶斯推断的策略对设计更灵活的分布转换路径施加了限制，使得适应各种数据分布和不同任务需求变得具有挑战性。此外，更简单、更高效的基于参数空间的模型的潜力尚未被探索。为了解决这个问题，我们提出了一种新颖的参数插值流模型（称为PIF），具有详细的理论基础、训练和推断过程。然后，我们开发了MolPIF用于基于结构的药物设计，展示其在各种度量标准上相对于基线的优越性能。这项工作验证了基于参数空间的生成建模范式对分子的有效性，并为模型设计提供了新的视角。

更新时间: 2025-07-31 01:38:49

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2507.13762v3

LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks

Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.

Updated: 2025-07-31 01:38:05

标题: 激光雷达：轻量级自适应提示感知融合视觉玛巴用于结构裂缝的多模态分割

摘要: 使用多模态数据实现低计算成本的像素级分割仍然是裂缝分割任务中的一个关键挑战。现有方法缺乏自适应感知和高效的交互式融合跨模态特征的能力。为了解决这些挑战，我们提出了一个轻量级自适应提示感知视觉曼巴网络（LIDAR），它有效地感知和整合来自不同模态的形态和纹理提示，在多模态裂缝场景下生成清晰的像素级裂缝分割地图。具体来说，LIDAR由轻量级自适应提示感知视觉状态空间模块（LacaVSS）和轻量级双域动态协作融合模块（LD3CF）组成。LacaVSS通过提出的基于遮罩引导的高效动态引导扫描策略（EDG-SS）自适应地建模裂缝提示，而LD3CF利用自适应频域感知器（AFDP）和双池融合策略有效捕捉跨模态的空间和频域提示。此外，我们设计了一个轻量级动态调制多核卷积（LDMK）来感知复杂的形态结构，以最小的计算开销替代LIDAR中的大部分卷积操作。在三个数据集上的实验表明，我们的方法优于其他最先进的方法。在光场深度数据集上，我们的方法仅使用5.35M参数就实现了0.8204的F1和0.8465的mIoU。代码和数据集可在https://github.com/Karl1109/LIDAR-Mamba找到。

更新时间: 2025-07-31 01:38:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.22477v2

Entanglement-induced provable and robust quantum learning advantages

Quantum computing holds unparalleled potentials to enhance machine learning. However, a demonstration of quantum learning advantage has not been achieved so far. We make a step forward by rigorously establishing a noise-robust, unconditional quantum learning advantage in expressivity, inference speed, and training efficiency, compared to commonly-used classical models. Our proof is information-theoretic and pinpoints the origin of this advantage: entanglement can be used to reduce the communication required by non-local tasks. In particular, we design a task that can be solved with certainty by quantum models with a constant number of parameters using entanglement, whereas commonly-used classical models must scale linearly to achieve a larger-than-exponentially-small accuracy. We show that the quantum model is trainable with constant resources and robust against constant noise. Through numerical and trapped-ion experiments on IonQ Aria, we demonstrate the desired advantage. Our results provide valuable guidance for demonstrating quantum learning advantages with current noisy intermediate-scale devices.

Updated: 2025-07-31 01:32:55

标题: 纠缠诱导的可证明和稳健的量子学习优势

摘要: 量子计算具有无与伦比的潜力来增强机器学习。然而，到目前为止还没有实现量子学习优势的演示。我们通过严格建立一个噪声鲁棒、无条件的量子学习优势，比较常用的经典模型在表达能力、推理速度和训练效率方面。我们的证明是信息理论的，并且指出了这种优势的来源：纠缠可以用来减少非局部任务所需的通信。特别地，我们设计了一个任务，可以通过使用纠缠确保由具有恒定参数的量子模型解决，而常用的经典模型必须按线性比例才能实现大于指数级别的小准确度。我们展示了量子模型可通过恒定资源进行训练，并对恒定噪声具有鲁棒性。通过在IonQ Aria上进行数值和离子阱实验，我们展示了期望的优势。我们的结果为展示当前嘈杂的中等规模设备上的量子学习优势提供了宝贵的指导。

更新时间: 2025-07-31 01:32:55

领域: quant-ph,cs.CC,cs.LG

下载: http://arxiv.org/abs/2410.03094v2

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

Mistake action detection is crucial for developing intelligent archives that detect workers' errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.

Updated: 2025-07-31 01:32:29

标题: EgoOops：一份用于从以自我为中心的视频中检测错误动作的数据集，涉及程序性文本

摘要: 错误行为检测对于开发能够检测工人错误并提供反馈的智能档案非常重要。现有研究主要集中在自由式活动中视觉上明显的错误，导致错误检测采用仅视频的方法。然而，在跟随文本进行活动时，模型无法在不参考文本的情况下确定某些行为的正确性。此外，目前的错误数据集很少使用过程文本进行视频记录，除了烹饪。为填补这些空白，本文提出了EgoOops数据集，其中自我中心的视频记录了跨不同领域的错误活动，当跟随过程文本时。它包含三种类型的注释：视频文本对齐，错误标签和错误描述。我们还提出了一种错误检测方法，结合视频文本对齐和错误标签分类以利用文本。我们的实验结果表明，融入过程文本对于错误检测至关重要。数据可以通过https://y-haneji.github.io/EgoOops-project-page/获得。

更新时间: 2025-07-31 01:32:29

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.05343v3

A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

With the global population growing and arable land resources becoming increasingly scarce,smart agriculture and precision agriculture have emerged as key directions for the future ofagricultural development.Artificial intelligence (AI) technologies, particularly deep learning models, have found widespread applications in areas such as crop monitoring and pest detection. As an emerging generative model, diffusion models have shown significant promise in tasks like agricultural image processing, data augmentation, and remote sensing. Compared to traditional generative adversarial networks (GANs), diffusion models offer superior training stability and generation quality, effectively addressing challenges such as limited agricultural data and imbalanced image samples. This paper reviews the latest advancements in the application of diffusion models in agriculture, focusing on their potential in crop pest and disease detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Experimental results demonstrate that diffusion models significantly improve model accuracy and robustness in data augmentation, image generation, and denoising, especially in complex environments. Despite challenges related to computational efficiency and generalization capabilities, diffusion models are expected to play an increasingly important role in smart and precision agriculture as technology advances, providing substantial support for the sustainable development of global agriculture.

Updated: 2025-07-31 01:28:03

标题: 智能农业中扩散模型的全面回顾：进展、应用和挑战

摘要: 随着全球人口增长和可耕地资源日益稀缺，智能农业和精准农业已经成为农业发展未来的关键方向。人工智能（AI）技术，特别是深度学习模型，在作物监测和害虫检测等领域已经得到广泛应用。作为新兴的生成模型，扩散模型在农业图像处理、数据增强和遥感等任务中表现出显著的潜力。与传统的生成对抗网络（GANs）相比，扩散模型在训练稳定性和生成质量方面具有优势，有效解决了农业数据有限和图像样本不平衡等挑战。本文回顾了扩散模型在农业中应用的最新进展，重点关注它们在作物害虫和疾病检测、遥感图像增强、作物生长预测和农业资源管理中的潜力。实验结果表明，扩散模型显著提高了数据增强、图像生成和去噪方面的模型准确性和稳健性，尤其在复杂环境中。尽管存在与计算效率和泛化能力相关的挑战，随着技术的进步，扩散模型有望在智能和精准农业中发挥越来越重要的作用，为全球农业的可持续发展提供实质支持。

更新时间: 2025-07-31 01:28:03

领域: cs.LG

下载: http://arxiv.org/abs/2507.18376v2

Predicting Large-scale Urban Network Dynamics with Energy-informed Graph Neural Diffusion

Networked urban systems facilitate the flow of people, resources, and services, and are essential for economic and social interactions. These systems often involve complex processes with unknown governing rules, observed by sensor-based time series. To aid decision-making in industrial and engineering contexts, data-driven predictive models are used to forecast spatiotemporal dynamics of urban systems. Current models such as graph neural networks have shown promise but face a trade-off between efficacy and efficiency due to computational demands. Hence, their applications in large-scale networks still require further efforts. This paper addresses this trade-off challenge by drawing inspiration from physical laws to inform essential model designs that align with fundamental principles and avoid architectural redundancy. By understanding both micro- and macro-processes, we present a principled interpretable neural diffusion scheme based on Transformer-like structures whose attention layers are induced by low-dimensional embeddings. The proposed scalable spatiotemporal Transformer (ScaleSTF), with linear complexity, is validated on large-scale urban systems including traffic flow, solar power, and smart meters, showing state-of-the-art performance and remarkable scalability. Our results constitute a fresh perspective on the dynamics prediction in large-scale urban networks.

Updated: 2025-07-31 01:24:01

标题: 用能量信息图神经扩散预测大规模城市网络动态

摘要: 网络化城市系统促进人员、资源和服务的流动，对经济和社会互动至关重要。这些系统通常涉及复杂的过程，其规则未知，通过基于传感器的时间序列观察到。为了在工业和工程环境中辅助决策，使用基于数据驱动的预测模型来预测城市系统的时空动态。目前的模型如图神经网络已经显示出潜力，但由于计算需求，它们在效率和效率之间存在权衡。因此，它们在大规模网络中的应用仍需要进一步的努力。本文通过从物理定律中汲取灵感，提出了基于基本原则并避免架构冗余的必要模型设计的方法，以解决这种权衡挑战。通过理解微观和宏观过程，我们提出了一种基于Transformer样式结构的原则解释性神经扩散方案，其注意层由低维嵌入引发。所提出的可扩展时空Transformer（ScaleSTF）具有线性复杂度，在包括交通流量、太阳能和智能电表在内的大规模城市系统上得到验证，表现出领先的性能和显著的可扩展性。我们的结果为大规模城市网络中的动态预测提供了新的视角。

更新时间: 2025-07-31 01:24:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00037v1

AutoBridge: Automating Smart Device Integration with Centralized Platform

Multimodal IoT systems coordinate diverse IoT devices to deliver human-centered services. The ability to incorporate new IoT devices under the management of a centralized platform is an essential requirement. However, it requires significant human expertise and effort to program the complex IoT integration code that enables the platform to understand and control the device functions. Therefore, we propose AutoBridge to automate IoT integration code generation. Specifically, AutoBridge adopts a divide-and-conquer strategy: it first generates device control logic by progressively retrieving device-specific knowledge, then synthesizes platformcompliant integration code using platform-specific knowledge. To ensure correctness, AutoBridge features a multi-stage debugging pipeline, including an automated debugger for virtual IoT device testing and an interactive hardware-in-the-loop debugger that requires only binary user feedback (yes and no) for real-device verification. We evaluate AutoBridge on a benchmark of 34 IoT devices across two open-source IoT platforms. The results demonstrate that AutoBridge can achieves an average success rate of 93.87% and an average function coverage of 94.87%, without any human involvement. With minimal binary yes and no feedback from users, the code is then revised to reach 100% function coverage. A user study with 15 participants further shows that AutoBridge outperforms expert programmers by 50% to 80% in code accuracy, even when the programmers are allowed to use commercial code LLMs.

Updated: 2025-07-31 01:14:14

标题: AutoBridge：通过集中平台自动化智能设备集成

摘要: 多模式物联网系统协调各种物联网设备，以提供以人为中心的服务。将新的物联网设备纳入集中管理平台的能力是一个基本要求。然而，需要大量的人力专业知识和努力来编写复杂的物联网集成代码，以使平台能够理解和控制设备功能。因此，我们提出了AutoBridge来自动化物联网集成代码的生成。具体而言，AutoBridge采用分而治之的策略：首先通过逐步检索设备特定知识生成设备控制逻辑，然后利用平台特定知识合成符合平台标准的集成代码。为了确保正确性，AutoBridge采用了一个多阶段调试管道，包括用于虚拟物联网设备测试的自动调试器和需要仅二进制用户反馈（是和否）用于真实设备验证的交互式硬件在环调试器。我们在两个开源物联网平台上的34个物联网设备基准上评估了AutoBridge。结果表明，AutoBridge可以实现平均成功率为93.87%和平均功能覆盖率为94.87%，无需任何人类参与。通过用户提供最少的二进制是和否反馈，代码就可以被修改以达到100%的功能覆盖。一项包括15名参与者的用户研究进一步表明，即使允许程序员使用商业代码LLMs，AutoBridge在代码准确性方面也比专业程序员高出50%到80%。

更新时间: 2025-07-31 01:14:14

领域: cs.SE,cs.AI,I.2.5

下载: http://arxiv.org/abs/2507.23178v1

CNN-based solution for mango classification in agricultural environments

This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control.

Updated: 2025-07-31 00:58:34

标题: 基于CNN的农业环境下芒果分类解决方案

摘要: 这篇文章展示了使用卷积神经网络（CNN）设计水果检测和分类系统的过程。其目标是开发一个能够自动评估水果质量以供农场库存管理的系统。具体地，通过图像处理开发了一种用于芒果水果分类的方法，确保了准确性和效率。Resnet-18被选为分类的初步架构，同时使用级联检测器进行检测，平衡执行速度和计算资源消耗。检测和分类结果通过在MatLab App Designer中开发的图形界面显示，简化了系统交互。卷积神经网络和级联检测器的整合为水果分类和检测提供了可靠的解决方案，具有潜在的农业质量控制应用。

更新时间: 2025-07-31 00:58:34

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.23174v1

BAR Conjecture: the Feasibility of Inference Budget-Constrained LLM Services with Authenticity and Reasoning

When designing LLM services, practitioners care about three key properties: inference-time budget, factual authenticity, and reasoning capacity. However, our analysis shows that no model can simultaneously optimize for all three. We formally prove this trade-off and propose a principled framework named The BAR Theorem for LLM-application design.

Updated: 2025-07-31 00:51:16

标题: BAR猜想：具有真实性和推理能力的推理预算受限LLM服务的可行性

摘要: 在设计LLM服务时，从业者关注三个关键属性：推理时间预算、事实真实性和推理能力。然而，我们的分析显示没有模型能够同时优化这三者。我们正式证明了这种权衡，并提出了一个名为BAR定理的原则性框架，用于LLM应用设计。

更新时间: 2025-07-31 00:51:16

领域: cs.LG

下载: http://arxiv.org/abs/2507.23170v1

Leveraging LLMs to Create Content Corpora for Niche Domains

Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.

Updated: 2025-07-31 00:49:03

标题: 利用LLM创建专业领域内容语料库

摘要: 从庞大的、非结构化的网络来源构建专门内容语料库，用于特定领域的应用，面临着实质性的数据整理挑战。在本文中，我们介绍了一种简化的方法，通过高效获取、过滤、结构化和清理基于网络的数据，生成高质量、特定领域的语料库。我们展示了如何利用大型语言模型（LLMs）来解决大规模复杂数据整理问题，并提出了一个战略框架，结合LLM增强技术，用于结构化内容提取和语义去重。我们通过将该方法整合到习惯形成应用程序30 Day Me中，在行为教育领域验证了我们的方法。我们的数据管道，称为30DayGen，使得从逾15K个网页中提取和合成了3,531个独特的30天挑战。用户调查报告显示满意度评分为5分中的4.3分，91%的受访者表示愿意使用精心策划的内容来实现他们的习惯养成目标。

更新时间: 2025-07-31 00:49:03

领域: cs.CL,cs.AI,cs.CY,I.2.7; H.3.1; H.3.3

下载: http://arxiv.org/abs/2505.02851v2

LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.

Updated: 2025-07-31 00:35:45

标题: LENS：从神经状态中学习集成置信度以用于多个LLM答案整合

摘要: 大型语言模型（LLMs）在各种任务中表现出令人印象深刻的性能，不同模型在不同领域和特定能力方面表现出色。有效地结合多个LLM的预测对于提高系统的稳健性和性能至关重要。然而，现有的集成方法通常依赖于简单的技术，如投票或logits集成，忽略了不同上下文中模型的置信度和可靠性的变化。在这项工作中，我们提出了LENS（从神经状态学习集成置信度），这是一种新颖的方法，通过分析内部表示来学习估计模型的置信度。对于每个LLM，我们训练一个轻量级线性置信度预测器，利用逐层隐藏状态和归一化概率作为输入。这样可以根据它们的上下文依赖的可靠性更细致地加权模型预测。我们的方法不需要修改模型参数，并且额外计算量微乎其微。在多项选择和布尔问题回答任务上的实验结果表明，LENS的表现远远优于传统的集成方法。我们的研究结果表明，内部表示为确定模型置信度提供了有价值的信号，并可以有效地用于集成学习。

更新时间: 2025-07-31 00:35:45

领域: cs.CL,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2507.23167v1

Tensor Product Neural Networks for Functional ANOVA Model

Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a high-dimensional function into a sum of lower dimensional functions (commonly referred to as components), is one of the most popular tools for interpretable AI, and recently, various neural networks have been developed for estimating each component in the functional ANOVA model. However, such neural networks are highly unstable when estimating each component since the components themselves are not uniquely defined. That is, there are multiple functional ANOVA decompositions for a given function. In this paper, we propose a novel neural network which guarantees a unique functional ANOVA decomposition and thus is able to estimate each component stably and accurately. We call our proposed neural network ANOVA Tensor Product Neural Network (ANOVA-TPNN) since it is motivated by the tensor product basis expansion. Theoretically, we prove that ANOVA-TPNN can approximate any smooth function well. Empirically, we show that ANOVA-TPNN provide much more stable estimation of each component and thus much more stable interpretation when training data and initial values of the model parameters vary than existing neural networks do.

Updated: 2025-07-31 00:17:08

标题: 张量积神经网络用于功能ANOVA模型

摘要: 随着机器学习模型变得越来越复杂，对于机器学习模型的可解释性变得越来越重要。功能ANOVA模型将高维函数分解为较低维度函数之和（通常称为组件），是可解释人工智能中最流行的工具之一，最近，各种神经网络已被开发用于估计功能ANOVA模型中的每个组件。然而，由于组件本身并没有唯一定义，这些神经网络在估计每个组件时非常不稳定。也就是说，对于给定的函数，存在多个功能ANOVA分解。在本文中，我们提出了一种新颖的神经网络，它保证了一个唯一的功能ANOVA分解，因此能够稳定和准确地估计每个组件。我们将我们提出的神经网络称为ANOVA张量积神经网络（ANOVA-TPNN），因为它受到张量积基扩展的启发。理论上，我们证明了ANOVA-TPNN能够很好地逼近任何光滑函数。在实证方面，我们展示了ANOVA-TPNN提供了比现有神经网络更稳定的每个组件估计，因此在训练数据和模型参数的初始值变化时提供了更稳定的解释。

更新时间: 2025-07-31 00:17:08

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2502.15215v5

Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability

Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns -- sequential, parallel, and conditional -- enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs' versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.

Updated: 2025-07-31 00:08:48

标题: 组合功能网络：一种性能高的替代深度神经网络的选择，具有内置可解释性

摘要: 深度神经网络（DNNs）提供了令人印象深刻的性能，但它们的黑盒特性限制了在需要透明度的高风险领域中的部署。我们引入了组合功能网络（CFNs），这是一个新颖的框架，通过组合具有清晰语义的基本数学函数来构建本质上可解释的模型。与现有的仅限于简单加法结构的可解释方法不同，CFNs支持各种组合模式——顺序、并行和条件——从而实现复杂特征交互并保持透明度。一个关键的创新是，CFNs是完全可微的，允许通过标准梯度下降进行有效训练。我们展示了CFNs在多个领域的多功能性，从符号回归到使用深度分层网络进行图像分类。我们的实证评估显示，CFNs在CIFAR-10上实现了与黑盒模型竞争力的性能（96.24％的准确率），同时胜过像可解释增强机器这样的最先进可解释模型。通过将深度学习的分层表达能力和高效训练与明确定义的数学函数的内在可解释性相结合，CFNs为那些性能和问责性同等重要的应用提供了一个强大的框架。

更新时间: 2025-07-31 00:08:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.21004v2