Arxiv Day: Article

A deep graph model for the signed interaction prediction in biological network

Predicting signed interactions in biological networks is crucial for understanding drug mechanisms and facilitating drug repurposing. While deep graph models have demonstrated success in modeling complex biological systems, existing approaches often fail to distinguish between positive and negative interactions, limiting their utility for precise pharmacological predictions. In this study, we propose a novel deep graph model, \textbf{RGCNTD} (Relational Graph Convolutional Network with Tensor Decomposition), designed to predict both polar (e.g., activation, inhibition) and non-polar (e.g., binding, affect) chemical-gene interactions. Our model integrates graph convolutional networks with tensor decomposition to enhance feature representation and incorporates a conflict-aware sampling strategy to resolve polarity ambiguities. We introduce new evaluation metrics, \textit{AUC\textsubscript{polarity}} and \textit{CP@500}, to assess the model's ability to differentiate interaction types. Experimental results demonstrate that \textbf{RGCNTD} outperforms baseline models, achieving superior classification accuracy and improved discrimination of polar edges. Furthermore, we analyze the impact of subgraph components on predictive performance, revealing that additional network structures do not always enhance accuracy. These findings highlight the importance of polarity-aware modeling in drug discovery and network pharmacology, providing a robust framework for predicting complex biological interactions.

Updated: 2025-03-17 23:54:17

标题: 一个用于生物网络中带符号相互作用预测的深度图模型

摘要: 在生物网络中预测有符号的相互作用对于理解药物机制和促进药物重定位至关重要。尽管深度图模型在建模复杂生物系统方面取得成功，但现有方法往往无法区分正面和负面相互作用，从而限制了它们在精确药理预测中的效用。在本研究中，我们提出了一种新颖的深度图模型\textbf{RGCNTD}（具有张量分解的关系图卷积网络），旨在预测极性（例如激活、抑制）和非极性（例如结合、影响）化学基因相互作用。我们的模型将图卷积网络与张量分解相结合，以增强特征表示，并采用冲突感知的采样策略来解决极性模糊性。我们引入了新的评估指标\textit{AUC\textsubscript{polarity}}和\textit{CP@500}，以评估模型区分相互作用类型的能力。实验结果表明，\textbf{RGCNTD}优于基准模型，实现了更高的分类准确性和改进的极性边缘区分能力。此外，我们分析了子图组件对预测性能的影响，揭示了额外网络结构并不总是提高准确性。这些发现突显了药物发现和网络药理学中极性感知建模的重要性，为预测复杂生物相互作用提供了一个坚实的框架。

更新时间: 2025-03-17 23:54:17

领域: cs.LG,q-bio.MN

下载: http://arxiv.org/abs/2407.07357v2

TAPE: Tailored Posterior Difference for Auditing of Machine Unlearning

With the increasing prevalence of Web-based platforms handling vast amounts of user data, machine unlearning has emerged as a crucial mechanism to uphold users' right to be forgotten, enabling individuals to request the removal of their specified data from trained models. However, the auditing of machine unlearning processes remains significantly underexplored. Although some existing methods offer unlearning auditing by leveraging backdoors, these backdoor-based approaches are inefficient and impractical, as they necessitate involvement in the initial model training process to embed the backdoors. In this paper, we propose a TAilored Posterior diffErence (TAPE) method to provide unlearning auditing independently of original model training. We observe that the process of machine unlearning inherently introduces changes in the model, which contains information related to the erased data. TAPE leverages unlearning model differences to assess how much information has been removed through the unlearning operation. Firstly, TAPE mimics the unlearned posterior differences by quickly building unlearned shadow models based on first-order influence estimation. Secondly, we train a Reconstructor model to extract and evaluate the private information of the unlearned posterior differences to audit unlearning. Existing privacy reconstructing methods based on posterior differences are only feasible for model updates of a single sample. To enable the reconstruction effective for multi-sample unlearning requests, we propose two strategies, unlearned data perturbation and unlearned influence-based division, to augment the posterior difference. Extensive experimental results indicate the significant superiority of TAPE over the state-of-the-art unlearning verification methods, at least 4.5$\times$ efficiency speedup and supporting the auditing for broader unlearning scenarios.

Updated: 2025-03-17 23:51:45

标题: TAPE：定制后差异用于机器遗忘审计

摘要: 随着处理大量用户数据的基于网络的平台日益普及，机器遗忘已经成为维护用户被遗忘权利的关键机制，使个人能够请求从训练模型中删除其指定数据。然而，机器遗忘过程的审计仍然显著未被充分探讨。尽管一些现有方法通过利用后门提供遗忘审计，但这些基于后门的方法效率低下且不切实际，因为它们需要介入初始模型训练过程以嵌入后门。在本文中，我们提出了一种名为TAPE（TAilored Posterior diffErence）的方法，独立于原始模型训练提供遗忘审计。我们观察到机器遗忘过程本质上会引入模型中包含与已擦除数据相关的信息的变化。TAPE利用遗忘模型差异来评估通过遗忘操作移除了多少信息。首先，TAPE通过快速构建基于一阶影响估计的未学习后模型来模拟未学习的后差异。其次，我们训练一个Reconstructor模型来提取和评估未学习后差异的私人信息，以审计遗忘。基于后差异的现有隐私重建方法仅适用于单个样本的模型更新。为了使重建对多样本遗忘请求有效，我们提出了两种策略，即未学习数据扰动和基于未学习影响的划分，以增加后差异。大量实验结果表明，TAPE相对于最先进的遗忘验证方法具有显著优势，至少有4.5倍的效率提升，并支持更广泛的遗忘场景的审计。

更新时间: 2025-03-17 23:51:45

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.19770v2

Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants

Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin -- grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an "onion" approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson's, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ($\alpha = 3$) with Extreme Gradient Boosting achieved the best performance ($R^2 = 0.80$, $MAE = 489 cm^2$). Cross-experiment validation showed robust results ($R^2 = 0.56$, $MAE = 579 cm^2$). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.

Updated: 2025-03-17 23:51:19

标题: 使用来自图像运动的3D重建来预测矮生番茄植株的总叶面积

摘要: 准确估计总叶面积（TLA）对于评估植物生长、光合作用活性和蒸腾至关重要。然而，对于像矮番茄这样的灌木植物来说，由于它们复杂的冠层结构，这仍然是一个挑战。传统方法通常耗时且破坏植物，或者在捕捉冠层复杂性方面受限。本研究评估了一种非破坏性方法，结合从RGB图像中进行顺序3D重建和机器学习来估算三种矮番茄品种（Mohamed、Hahms Gelbe Topftomate和Red Robin）在受控温室条件下生长的TLA。两个实验（春夏季和秋冬季）包括73株植物，通过“洋葱”方法得出了418个TLA测量值。高分辨率视频被记录下来，每株植物使用500帧图像进行3D重建。点云使用四种算法（Alpha Shape、Marching Cubes、Poisson's、Ball Pivoting）进行处理，网格使用七种回归模型进行评估：多变量线性回归、Lasso回归、Ridge回归、Elastic Net回归、随机森林、极端梯度增强和多层感知器。Alpha Shape重建（α=3）与极端梯度增强结合实现了最佳性能（$R^2=0.80$，$MAE=489cm^2$）。交叉实验验证显示出稳健的结果（$R^2=0.56$，$MAE=579cm^2$）。特征重要性分析确定了高度、宽度和表面积作为关键预测因子。这种可扩展的自动化TLA估算方法适用于城市农业和精准农业，在自动修剪、资源效率和可持续食品生产方面具有应用潜力。该方法在不同环境条件和冠层结构下展示了稳健性。

更新时间: 2025-03-17 23:51:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13778v1

Bridging Neural and Symbolic Representations with Transitional Dictionary Learning

This paper introduces a novel Transitional Dictionary Learning (TDL) framework that can implicitly learn symbolic knowledge, such as visual parts and relations, by reconstructing the input as a combination of parts with implicit relations. We propose a game-theoretic diffusion model to decompose the input into visual parts using the dictionaries learned by the Expectation Maximization (EM) algorithm, implemented as the online prototype clustering, based on the decomposition results. Additionally, two metrics, clustering information gain, and heuristic shape score are proposed to evaluate the model. Experiments are conducted on three abstract compositional visual object datasets, which require the model to utilize the compositionality of data instead of simply exploiting visual features. Then, three tasks on symbol grounding to predefined classes of parts and relations, as well as transfer learning to unseen classes, followed by a human evaluation, were carried out on these datasets. The results show that the proposed method discovers compositional patterns, which significantly outperforms the state-of-the-art unsupervised part segmentation methods that rely on visual features from pre-trained backbones. Furthermore, the proposed metrics are consistent with human evaluations.

Updated: 2025-03-17 23:44:57

标题: 用过渡性字典学习连接神经和符号表示

摘要: 本文介绍了一种新颖的过渡字典学习（TDL）框架，可以通过将输入重建为具有隐式关系的部分组合来隐式学习符号知识，如视觉部分和关系。我们提出了一种博弈论扩散模型，通过使用由期望最大化（EM）算法学习的字典对输入进行分解成视觉部分，实现在线原型聚类，基于分解结果。此外，还提出了两个度量标准，聚类信息增益和启发式形状得分，用于评估模型。在三个抽象的组合视觉对象数据集上进行实验，这些数据集要求模型利用数据的组合性而不是简单地利用视觉特征。然后，在预定义的部分和关系类别上进行符号接地任务，以及转移学习到未见类别，随后进行人工评估。结果表明，所提出的方法发现了组合模式，明显优于依赖于来自预训练骨干的视觉特征的最先进的无监督部分分割方法。此外，所提出的度量与人类评估一致。

更新时间: 2025-03-17 23:44:57

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2308.02000v2

Towards AI-assisted Academic Writing

We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user's current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.

Updated: 2025-03-17 23:30:17

标题: 朝向人工智能辅助学术写作

摘要: 我们提出了一个包括引文推荐和引言写作的AI辅助学术写作系统的组成部分。该系统通过考虑用户当前的文档背景来推荐引文，从而提供相关建议。它以结构化的方式生成引言，将研究的贡献定位于先前的工作之中。我们通过定量评估展示了这些组件的有效性。最后，本文通过定性研究探讨了研究人员如何将引文纳入其写作工作流程中。我们的研究结果表明，存在对精确的AI辅助写作系统和简单有效的满足这些需求的方法的需求。

更新时间: 2025-03-17 23:30:17

领域: cs.AI

下载: http://arxiv.org/abs/2503.13771v1

Are Metrics Enough? Guidelines for Communicating and Visualizing Predictive Models to Subject Matter Experts

Presenting a predictive model's performance is a communication bottleneck that threatens collaborations between data scientists and subject matter experts. Accuracy and error metrics alone fail to tell the whole story of a model - its risks, strengths, and limitations - making it difficult for subject matter experts to feel confident in their decision to use a model. As a result, models may fail in unexpected ways or go entirely unused, as subject matter experts disregard poorly presented models in favor of familiar, yet arguably substandard methods. In this paper, we describe an iterative study conducted with both subject matter experts and data scientists to understand the gaps in communication between these two groups. We find that, while the two groups share common goals of understanding the data and predictions of the model, friction can stem from unfamiliar terms, metrics, and visualizations - limiting the transfer of knowledge to SMEs and discouraging clarifying questions being asked during presentations. Based on our findings, we derive a set of communication guidelines that use visualization as a common medium for communicating the strengths and weaknesses of a model. We provide a demonstration of our guidelines in a regression modeling scenario and elicit feedback on their use from subject matter experts. From our demonstration, subject matter experts were more comfortable discussing a model's performance, more aware of the trade-offs for the presented model, and better equipped to assess the model's risks - ultimately informing and contextualizing the model's use beyond text and numbers.

Updated: 2025-03-17 23:19:41

标题: 度量标准足够吗？向主题专家传达和可视化预测模型的指南

摘要: 呈现预测模型的性能是数据科学家和领域专家之间合作的一个沟通瓶颈，威胁着二者之间的合作。仅凭准确度和误差指标无法完全揭示模型的整体情况 - 其风险、优势和局限性，这使得领域专家难以对他们决定使用模型感到自信。因此，模型可能以意想不到的方式失败，或者根本不被使用，因为领域专家会忽视展示不佳的模型，而选择熟悉但可以说是次优的方法。在本文中，我们描述了一项与领域专家和数据科学家进行的迭代研究，以了解这两个群体之间沟通的差距。我们发现，虽然这两个群体共同目标是理解数据和模型的预测，但摩擦可能源于陌生的术语、指标和可视化 - 限制了知识传递给领域专家，并且阻碍了在演示过程中提出澄清问题。根据我们的发现，我们提出了一套沟通准则，使用可视化作为一种共通的媒介来传达模型的优势和劣势。我们在回归建模场景中演示了我们的准则，并征求了领域专家对其使用的反馈。从我们的演示中，领域专家更愿意讨论模型的性能，更加了解所呈现模型的权衡，更有能力评估模型的风险 - 最终在超越文本和数字的情况下为模型的使用提供信息和背景。

更新时间: 2025-03-17 23:19:41

领域: cs.HC,cs.LG

下载: http://arxiv.org/abs/2205.05749v3

Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness

Machine learning risks reinforcing biases present in data and, as we argue in this work, in what is absent from data. In healthcare, societal and decision biases shape patterns in missing data, yet the algorithmic fairness implications of group-specific missingness are poorly understood. The way we address missingness in healthcare can have detrimental impacts on downstream algorithmic fairness. Our work questions current recommendations and practices aimed at handling missing data with a focus on their effect on algorithmic fairness, and offers a path forward. Specifically, we consider the theoretical underpinnings of existing recommendations as well as their empirical predictive performance and corresponding algorithmic fairness measured through subgroup performances. Our results show that current practices for handling missingness lack principled foundations, are disconnected from the realities of missingness mechanisms in healthcare, and can be counterproductive. For example, we show that favouring group-specific imputation strategy can be misguided and exacerbate prediction disparities. We then build on our findings to propose a framework for empirically guiding imputation choices, and an accompanying reporting framework. Our work constitutes an important contribution to recent efforts by regulators and practitioners to grapple with the realities of real-world data, and to foster the responsible and transparent deployment of machine learning systems. We demonstrate the practical utility of the proposed framework through experimentation on widely used datasets, where we show how the proposed framework can guide the selection of imputation strategies, allowing us to choose among strategies that yield equal overall predictive performance but present different algorithmic fairness properties.

Updated: 2025-03-17 23:15:24

标题: 在临床实践下的填充策略：对算法公平性的影响

摘要: 机器学习有可能加强数据中存在的偏见，正如我们在这项工作中所辩论的，数据中缺失的东西也是如此。在医疗保健领域，社会和决策偏见塑造了缺失数据的模式，然而特定群体缺失性的算法公平性影响尚不明确。我们处理医疗保健中的缺失性的方式可能对下游算法的公平性产生有害影响。我们的工作质疑了当前处理缺失数据的建议和实践，重点关注其对算法公平性的影响，并提供了一条前进的道路。具体来说，我们考虑了现有建议的理论基础以及它们的实证预测表现以及通过子组表现衡量的相应算法公平性。我们的结果表明，目前处理缺失性的实践缺乏原则性基础，与医疗保健中缺失性机制的现实脱节，并且可能事与愿违。例如，我们表明，偏向于特定群体的插补策略可能是错误的，会加剧预测差距。然后，我们基于研究结果提出了一个用于实证指导插补选择的框架，以及一个相应的报告框架。我们的工作对监管机构和从业者最近努力应对现实数据的努力做出了重要贡献，并促进了机器学习系统的负责任和透明部署。我们通过在广泛使用的数据集上进行实验展示了所提出框架的实用性，展示了所提出框架如何指导选择插补策略，使我们能够在产生相同总体预测表现但具有不同算法公平性属性的策略之间进行选择。

更新时间: 2025-03-17 23:15:24

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2208.06648v4

A finite-sample bound for identifying partially observed linear switched systems from a single trajectory

We derive a finite-sample probabilistic bound on the parameter estimation error of a system identification algorithm for Linear Switched Systems. The algorithm estimates Markov parameters from a single trajectory and applies a variant of the Ho-Kalman algorithm to recover the system matrices. Our bound guarantees statistical consistency under the assumption that the true system exhibits quadratic stability. The proof leverages the theory of weakly dependent processes. To the best of our knowledge, this is the first finite-sample bound for this algorithm in the single-trajectory setting.

Updated: 2025-03-17 23:02:22

标题: 一个有限样本的边界，用于从单个轨迹中识别部分观测的线性切换系统

摘要: 我们推导了一个有限样本概率界，用于线性切换系统的参数估计算法的误差。该算法从单个轨迹中估计马尔科夫参数，并应用Ho-Kalman算法的变体来恢复系统矩阵。我们的界保证在真实系统表现二次稳定性的假设下统计一致性。证明利用了弱相关过程的理论。据我们所知，这是单轨迹设置中该算法的第一个有限样本界。

更新时间: 2025-03-17 23:02:22

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.13766v1

Reinforcement learning with combinatorial actions for coupled restless bandits

Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods -- which cannot address sequential planning and combinatorial selection simultaneously -- by an average of 24.8\% on these difficult instances.

Updated: 2025-03-17 22:59:28

标题: 用于耦合不安静老虎机的组合动作的强化学习

摘要: 强化学习（RL）越来越多地被应用于解决实际的规划问题，处理大规模状态空间和时间范围方面取得了进展。然而，在许多领域的一个关键瓶颈是RL方法无法适应大规模的组合结构动作空间。在这种情况下，甚至表示单个步骤中可行动作集合可能需要复杂的离散优化公式。我们利用最近在将训练有素的神经网络嵌入优化问题中取得的进展，提出了SEQUOIA，这是一种RL算法，直接优化可行动作空间上的长期奖励。我们的方法将Q网络嵌入到混合整数程序中，以选择每个时间步长中的组合动作。在这里，我们重点研究了不安定贪婪者规划，这是一类捕捉许多实际顺序决策问题的规划问题。我们介绍了coRMAB，一种更广泛的不安定贪婪者类，其中组合动作不能在不安定贪婪者的手臂之间解耦，需要直接在联合指数级大的动作空间上解决。我们在四个具有组合约束的新型不安定贪婪者问题上对SEQUOIA进行了实证验证：多次干预、路径约束、二分图匹配和容量约束。我们的方法在这些困难实例中平均优于现有方法24.8％，这些方法无法同时解决顺序规划和组合选择。

更新时间: 2025-03-17 22:59:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.01919v2

Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.

Updated: 2025-03-17 22:57:37

标题: 考虑有效维度的分数阶随机梯度下降算法在凸优化问题中的应用

摘要: 分数阶随机梯度下降（FOSGD）利用分数指数捕捉优化中的长记忆效应，但其实际影响常受到调整和稳定该指数的困难限制。在这项工作中，我们介绍了2SED分数阶随机梯度下降（2SEDFOSGD），这是一种新颖的方法，将两尺度有效维度（2SED）算法与FOSGD结合起来，以数据驱动的方式自动校准分数指数。通过不断评估模型的敏感性和有效维度，2SED动态调整指数以抑制不稳定的振荡并增强收敛速度。从理论上讲，我们展示了这种维度感知自适应如何保留分数记忆的好处，同时避免了常见于朴素分数SGD中频繁观察到的迟缓或不稳定行为。在多个基准测试中进行的实证评估确认，我们基于2SED的分数指数方法不仅收敛速度更快，而且实现了更稳健的最终性能，表明分数阶方法在大规模机器学习和相关领域具有广泛适用性。

更新时间: 2025-03-17 22:57:37

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2503.13764v1

Neural Edge Histogram Descriptors for Underwater Acoustic Target Recognition

Numerous maritime applications rely on the ability to recognize acoustic targets using passive sonar. While there is a growing reliance on pre-trained models for classification tasks, these models often require extensive computational resources and may not perform optimally when transferred to new domains due to dataset variations. To address these challenges, this work adapts the neural edge histogram descriptors (NEHD) method originally developed for image classification, to classify passive sonar signals. We conduct a comprehensive evaluation of statistical and structural texture features, demonstrating that their combination achieves competitive performance with large pre-trained models. The proposed NEHD-based approach offers a lightweight and efficient solution for underwater target recognition, significantly reducing computational costs while maintaining accuracy.

Updated: 2025-03-17 22:57:05

标题: 神经边缘直方图描述符用于水下声学目标识别

摘要: 许多海洋应用依赖于使用被动声纳识别声学目标的能力。虽然在分类任务中越来越多地依赖预先训练的模型，但这些模型通常需要大量的计算资源，并且由于数据集的变化，当转移到新领域时可能表现不佳。为了解决这些挑战，本工作将最初用于图像分类的神经边缘直方图描述符（NEHD）方法改进为分类被动声纳信号。我们对统计和结构纹理特征进行了全面评估，证明它们的组合与大型预先训练模型具有竞争性能。提出的基于NEHD的方法为水下目标识别提供了一种轻量级和高效的解决方案，显著降低了计算成本同时保持准确性。

更新时间: 2025-03-17 22:57:05

领域: cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2503.13763v1

Predicting Space Tourism Demand Using Explainable AI

Comprehensive forecasts of space tourism demand are crucial for businesses to optimize strategies and customer experiences in this burgeoning industry. Traditional methods struggle to capture the complex factors influencing an individual's decision to travel to space. In this paper, we propose an explainable and trustworthy artificial intelligence framework to address the challenge of predicting space tourism demand by following the National Institute of Standards and Technology guidelines. We develop a novel machine learning network, called SpaceNet, capable of learning wide-range dependencies in data and allowing us to analyze the relationships between various factors such as age, income, and risk tolerance. We investigate space travel demand in the US, categorizing it into four types: no travel, moon travel, suborbital, and orbital travel. To this end, we collected 1860 data points in many states and cities with different ages and then conducted our experiment with the data. From our experiments, the SpaceNet achieves an average ROC-AUC of 0.82 $\pm$ 0.088, indicating strong classification performance. Our investigation demonstrated that travel price, age, annual income, gender, and fatality probability are important features in deciding whether a person wants to travel or not. Beyond demand forecasting, we use explainable AI to provide interpretation for the travel-type decisions of an individual, offering insights into the factors driving interest in space travel, which is not possible with traditional classification methods. This knowledge enables businesses to tailor marketing strategies and optimize service offerings in this rapidly evolving market. To the best of our knowledge, this is the first work to implement an explainable and interpretable AI framework for investigating the factors influencing space tourism.

Updated: 2025-03-17 22:40:34

标题: 使用可解释人工智能预测太空旅游需求

摘要: 太空旅游需求的全面预测对企业来说至关重要，以优化战略和客户体验在这个蓬勃发展的行业。传统方法难以捕捉影响个人决定前往太空的复杂因素。在本文中，我们提出了一个可解释和可信的人工智能框架，以应对预测太空旅游需求的挑战，遵循国家标准与技术研究所的指导方针。我们开发了一个名为SpaceNet的新型机器学习网络，能够学习数据中的广泛依赖关系，并允许我们分析各种因素之间的关系，如年龄、收入和风险承受能力。我们调查了美国的太空旅行需求，将其分类为四种类型：不旅行、月球旅行、亚轨道和轨道旅行。为此，我们收集了来自不同年龄的许多州和城市的1860个数据点，然后用这些数据进行了实验。从我们的实验中，SpaceNet实现了平均ROC-AUC为0.82 ± 0.088，表明具有较强的分类性能。我们的调查表明，旅行价格、年龄、年收入、性别和致命概率是决定一个人是否想要旅行的重要特征。除了需求预测，我们利用可解释的人工智能为个人的旅行类型决策提供解释，提供了洞察力，分析了驱动对太空旅行的兴趣的因素，这是传统分类方法无法实现的。这种知识使企业能够量身定制营销策略，并优化在这个快速发展的市场中的服务提供。据我们所知，这是第一项为探讨影响太空旅游因素的可解释和可解释的人工智能框架实施的工作。

更新时间: 2025-03-17 22:40:34

领域: cs.LG

下载: http://arxiv.org/abs/2503.03113v2

A Generalist Hanabi Agent

Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at: $\href{https://github.com/chandar-lab/R3D2-A-Generalist-Hanabi-Agent}{R3D2-A-Generalist-Hanabi-Agent}$

Updated: 2025-03-17 22:25:15

标题: 一个多面手的花火智能体

摘要: 传统的多智能体强化学习（MARL）系统可以通过反复交互来发展合作策略。然而，这些系统在除了它们训练过的环境之外的任何其他环境中表现都不佳，并且很难成功地与陌生合作者合作。这在Hanabi基准测试中特别明显，这是一款流行的2至5人合作卡牌游戏，需要复杂的推理和准确的帮助其他智能体。目前用于Hanabi的MARL智能体只能学习一个特定的游戏设置（例如，2人游戏），并与相同的算法智能体一起玩。这与人类形成鲜明对比，人类可以快速调整他们的策略以适应陌生的合作伙伴或情况。在本文中，我们介绍了Recurrent Replay Relevance Distributed DQN（R3D2），这是一个为Hanabi设计的通用智能体，旨在克服这些局限性。我们使用文本重新制定了任务，因为已经证明语言可以提高迁移效果。然后，我们提出了一个分布式MARL算法，可以应对由此产生的动态观察和行动空间。通过这样做，我们的智能体是第一个可以同时玩所有游戏设置，并将从一个设置学到的策略扩展到其他设置的智能体。因此，我们的智能体还展示了与不同的算法智能体合作的能力——这些智能体本身无法做到。实现代码可在以下链接获得：[R3D2-A-Generalist-Hanabi-Agent](https://github.com/chandar-lab/R3D2-A-Generalist-Hanabi-Agent)

更新时间: 2025-03-17 22:25:15

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2503.14555v1

Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot

In recent times, reinforcement learning (RL) with physical robots has attracted the attention of a wide range of researchers. However, state-of-the-art RL algorithms do not consider that physical environments do not wait for the RL agent to make decisions or updates. RL agents learn by periodically conducting computationally expensive gradient updates. When decision-making and gradient update tasks are carried out sequentially by the RL agent in a physical robot, it significantly increases the agent's response time. In a rapidly changing environment, this increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods, which separate the computation of decision-making and gradient updates, are a potential solution to this problem. However, only a few comparisons between asynchronous and synchronous RL have been made with physical robots. For this reason, the exact performance benefits of using asynchronous RL methods over synchronous RL methods are still unclear. In this study, we provide a performance comparison between asynchronous and synchronous RL using a physical robotic arm called Franka Emika Panda. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL. Our experiments also demonstrate that the learning agent with a faster response time performs better than the agent with a slower response time, even if the agent with a slower response time performs a higher number of gradient updates.

Updated: 2025-03-17 22:24:39

标题: 同步与异步强化学习在现实世界机器人中的应用

摘要: 最近，使用物理机器人进行强化学习（RL）引起了广泛研究者的关注。然而，目前最先进的RL算法并未考虑到物理环境不会等待RL代理做出决策或更新。RL代理通过定期进行计算密集的梯度更新来学习。当决策和梯度更新任务在物理机器人中由RL代理依次执行时，会显著增加代理的响应时间。在一个快速变化的环境中，这种增加的响应时间可能对学习代理的性能产生不利影响。异步RL方法可以解决这个问题，它将决策制定和梯度更新的计算分开。然而，目前只有少数关于在物理机器人中异步和同步RL之间的比较。因此，使用异步RL方法相比同步RL方法的确切性能优势仍不清楚。在本研究中，我们使用名为Franka Emika Panda的物理机器人手臂进行了异步和同步RL的性能比较。我们的实验表明，使用异步RL代理学习更快，获得更多的回报。我们的实验还表明，具有更快响应时间的学习代理比响应时间较慢的代理表现更好，即使响应时间较慢的代理执行更多的梯度更新。

更新时间: 2025-03-17 22:24:39

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.14554v1

Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping

Recently, the study of heavy-tailed noises in first-order nonconvex stochastic optimization has gotten a lot of attention since it was recognized as a more realistic condition as suggested by many empirical observations. Specifically, the stochastic noise (the difference between the stochastic and true gradient) is considered to have only a finite $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$ instead of assuming it always satisfies the classical finite variance assumption. To deal with this more challenging setting, people have proposed different algorithms and proved them to converge at an optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate for smooth objectives after $T$ iterations. Notably, all these new-designed algorithms are based on the same technique - gradient clipping. Naturally, one may want to know whether the clipping method is a necessary ingredient and the only way to guarantee convergence under heavy-tailed noises. In this work, by revisiting the existing Batched Normalized Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm, we provide the first convergence result under heavy-tailed noises but without gradient clipping. Concretely, we prove that Batched NSGDM can achieve the optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate even under the relaxed smooth condition. More interestingly, we also establish the first $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$ convergence rate in the case where the tail index $\mathfrak{p}$ is unknown in advance, which is arguably the common scenario in practice.

Updated: 2025-03-17 22:23:21

标题: 非凸随机优化在重尾噪声下：无需梯度截断的最佳收敛

摘要: 最近，对于一阶非凸随机优化中的重尾噪声的研究引起了很多关注，因为许多经验观察表明这是一个更现实的条件。具体而言，随机噪声（随机梯度与真实梯度之间的差异）被认为只有有限的$\mathfrak{p}$-th矩，其中$\mathfrak{p}\in\left(1,2\right]$，而不是总是满足经典的有限方差假设。为了处理这种更具挑战性的情况，人们提出了不同的算法，并证明它们在$T$次迭代后以最优$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$的速率收敛到光滑目标。值得注意的是，所有这些新设计的算法都基于相同的技术 - 梯度截断。自然地，人们可能想知道梯度截断方法是否是保证在重尾噪声下收敛的必要因素和唯一方法。在这项工作中，通过重新审视现有的带有动量的批量归一化随机梯度下降（Batched NSGDM）算法，我们提供了在重尾噪声下但无梯度截断的第一个收敛结果。具体而言，我们证明Batched NSGDM可以在放松的光滑条件下实现最优的$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$速率。更有趣的是，我们还建立了在尾指数$\mathfrak{p}$事先未知的情况下的第一个$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$收敛速率，这在实践中可能是常见的情况。

更新时间: 2025-03-17 22:23:21

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2412.19529v3

Optimizing ML Training with Metagradient Descent

A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space. In this work, we unlock a gradient-based approach to this problem. We first introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale. We then introduce a "smooth model training" framework that enables effective optimization using metagradients. With metagradient descent (MGD), we greatly improve on existing dataset selection methods, outperform accuracy-degrading data poisoning attacks by an order of magnitude, and automatically find competitive learning rate schedules.

Updated: 2025-03-17 22:18:24

标题: 用元梯度下降优化机器学习训练

摘要: 在训练大规模机器学习模型中的一个主要挑战是配置训练过程，以最大化模型性能，即从庞大的设计空间中找到最佳训练设置。在这项工作中，我们解锁了一个基于梯度的方法来解决这个问题。我们首先介绍了一种算法，用于高效计算元梯度 -- 通过模型训练的梯度 -- 在规模上。然后我们引入了一个“平滑模型训练”框架，以便使用元梯度进行有效优化。通过元梯度下降（MGD），我们在现有数据集选择方法上取得了巨大进步，优于降低准确性的数据毒化攻击一个数量级，并自动找到具有竞争力的学习率计划。

更新时间: 2025-03-17 22:18:24

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13751v1

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

Updated: 2025-03-17 22:16:53

标题: 重新定义联邦学习中的非独立同分布数据：从标签迁移至嵌入以适应特定任务的数据分布

摘要: 联合学习（FL）代表了分布式机器学习（ML）中的一种范式转变，使客户能够在保持原始数据私密的同时协作训练模型。这种从传统集中式ML转变而来的范式转变引入了挑战，因为在客户之间数据的非独立同分布（non-iid）性质显著影响了FL的性能。现有文献主要通过在客户之间施加标签分布偏斜来研究模型数据的异质性。本文表明，标签分布偏斜未能充分捕捉计算机视觉任务中客户之间真实数据异质性，超出了分类的范畴。随后，我们展示了当前方法通过依赖标签/类别分布偏斜而高估了FL的性能，揭示了文献中一个被忽视的差距。通过利用预训练的深度神经网络提取任务特定的数据嵌入，我们通过每个视觉任务的视角定义了任务特定的数据异质性，并引入了一种新的数据异质性水平，称为基于嵌入的数据异质性。我们的方法涉及基于嵌入对数据点进行聚类，并使用Dirichlet分布将它们分发给客户。通过广泛的实验，我们评估了不同FL方法在我们重新定义的数据异质性观念下的表现，并引入了新的基准性能度量标准到文献中。我们进一步揭示了一系列可以追求的开放研究方向。

更新时间: 2025-03-17 22:16:53

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.14553v1

Fire and Smoke Datasets in 20 Years: An In-depth Review

Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

Updated: 2025-03-17 22:08:02

标题: 20年来的火灾和烟雾数据集：深入审查

摘要: 火灾和烟雾现象对自然环境、生态系统、全球经济以及人类生命和野生动物构成了重大威胁。在这种特殊情况下，需要更加复杂和先进的技术来实施有效的早期检测、实时监测，并最小化火灾对生态平衡和公共安全的总体影响。最近，人工智能（AI）和计算机视觉（CV）框架的快速发展已经极大地改变了开发高效火灾管理系统的势头。然而，这些系统广泛依赖于充足和高质量的火灾和烟雾数据，以创建有效的机器学习（ML）方法，用于各种任务，如检测和监测。虽然火灾和烟雾数据集在训练、评估和测试先进的深度学习（DL）模型方面起着至关重要的作用，但对现有数据集的全面审查仍未被探索。为此，我们提供了一项深入审查，系统分析和评估过去20年收集的火灾和烟雾数据集。我们调查了每个数据集的特征，包括类型、大小、格式、收集方法和地理多样性。我们还审查并突出每个数据集的独特特点，如成像模式（RGB、热成像、红外）及其在不同火灾管理任务（分类、分割、检测）中的适用性。此外，我们总结了每个数据集的优势和劣势，并讨论它们在推进火灾管理研究和技术方面的潜力。最终，我们使用几种最先进的算法（如ResNet-50、DeepLab-V3和YoloV8）在不同数据集上进行了广泛的实验分析。

更新时间: 2025-03-17 22:08:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.14552v1

ConfuGuard: Using Metadata to Detect Active and Stealthy Package Confusion Attacks Accurately and at Scale

Package confusion attacks such as typosquatting threaten software supply chains. Attackers make packages with names that syntactically or semantically resemble legitimate ones, tricking engineers into installing malware. While prior work has developed defenses against package confusions in some software package registries, notably NPM, PyPI, and RubyGems, gaps remain: high false-positive rates; generalization to more software package ecosystems; and insights from real-world deployment. In this work, we introduce ConfuGuard, a solution designed to address the challenges posed by package confusion threats. We begin by presenting the first empirical analysis of benign signals derived from prior package confusion data, uncovering their threat patterns, engineering practices, and measurable attributes. We observed that 13.3% of real package confusion attacks are initially stealthy, so we take that into consideration and refined the definitions. Building on state-of-the-art approaches, we extend support from three to six software package registries, and leverage package metadata to distinguish benign packages. Our approach significantly reduces 64% false-positive (from 77% to 13%), with acceptable additional overhead to filter out benign packages by analyzing the package metadata. ConfuGuard is in production at our industry partner, whose analysts have already confirmed 301 packages detected by ConfuGuard as real attacks. We share lessons learned from production and provide insights to researchers.

Updated: 2025-03-17 21:57:16

标题: ConfuGuard：使用元数据准确且大规模地检测活跃和隐蔽的软件包混淆攻击

摘要: 包混淆攻击，如拼写错误攻击，威胁软件供应链。攻击者制作具有与合法软件包名称在语法或语义上相似的软件包，欺骗工程师安装恶意软件。尽管先前的工作已经在一些软件包注册表中，特别是 NPM、PyPI 和 RubyGems 中开发了针对包混淆的防御措施，但仍存在一些缺陷：高误报率；泛化到更多软件包生态系统；以及从实际部署中获得的见解。在这项工作中，我们介绍了 ConfuGuard，这是一个旨在解决包混淆威胁所带来挑战的解决方案。我们首先提出了从先前的包混淆数据中提取的良性信号的第一次实证分析，揭示了它们的威胁模式、工程实践和可测属性。我们发现，13.3% 的真实包混淆攻击最初是隐蔽的，因此我们考虑到了这一点并完善了定义。在最先进的方法基础上，我们将支持从三个扩展到六个软件包注册表，并利用包元数据来区分良性包。我们的方法显著降低了 64% 的误报率（从 77% 降至 13%），并通过分析包元数据来过滤出良性包的额外开销可接受。ConfuGuard 已在我们的产业合作伙伴处投入使用，其分析师已确认 ConfuGuard 检测到的 301 个软件包是真实攻击。我们分享了从生产中学到的经验，并为研究人员提供见解。

更新时间: 2025-03-17 21:57:16

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2502.20528v2

Self-Supervised Z-Slice Augmentation for 3D Bio-Imaging via Knowledge Distillation

Three-dimensional biological microscopy has significantly advanced our understanding of complex biological structures. However, limitations due to microscopy techniques, sample properties or phototoxicity often result in poor z-resolution, hindering accurate cellular measurements. Here, we introduce ZAugNet, a fast, accurate, and self-supervised deep learning method for enhancing z-resolution in biological images. By performing nonlinear interpolation between consecutive slices, ZAugNet effectively doubles resolution with each iteration. Compared on several microscopy modalities and biological objects, it outperforms competing methods on most metrics. Our method leverages a generative adversarial network (GAN) architecture combined with knowledge distillation to maximize prediction speed without compromising accuracy. We also developed ZAugNet+, an extended version enabling continuous interpolation at arbitrary distances, making it particularly useful for datasets with nonuniform slice spacing. Both ZAugNet and ZAugNet+ provide high-performance, scalable z-slice augmentation solutions for large-scale 3D imaging. They are available as open-source frameworks in PyTorch, with an intuitive Colab notebook interface for easy access by the scientific community.

Updated: 2025-03-17 21:52:46

标题: 自监督Z切片增强在三维生物成像中的应用：通过知识蒸馏

摘要: 三维生物显微镜技术显著推进了我们对复杂生物结构的理解。然而，由于显微镜技术、样本性质或光毒性的限制，常常导致z分辨率不佳，影响细胞测量的准确性。在这里，我们介绍了ZAugNet，一种快速、准确且自监督的深度学习方法，用于增强生物图像中的z分辨率。通过在连续切片之间进行非线性插值，ZAugNet每次迭代有效地将分辨率加倍。在多个显微镜模式和生物对象上进行比较，它在大多数指标上优于竞争方法。我们的方法利用生成对抗网络（GAN）架构结合知识蒸馏，以最大化预测速度而不损害准确性。我们还开发了ZAugNet+，一个扩展版本，使其能够在任意距离上进行连续插值，特别适用于具有非均匀切片间距的数据集。ZAugNet和ZAugNet+为大规模3D成像提供了高性能、可扩展的z切片增强解决方案。它们作为PyTorch的开源框架提供，并具有直观的Colab笔记本界面，便于科学界轻松访问。

更新时间: 2025-03-17 21:52:46

领域: cs.CV,cs.AI,eess.IV,q-bio.QM,68,I.4.3; I.4.4; I.2.0; J.3

下载: http://arxiv.org/abs/2503.04843v2

Synthesizing Interpretable Control Policies through Large Language Model Guided Search

The combination of Large Language Models (LLMs), systematic evaluation, and evolutionary algorithms has enabled breakthroughs in combinatorial optimization and scientific discovery. We propose to extend this powerful combination to the control of dynamical systems, generating interpretable control policies capable of complex behaviors. With our novel method, we represent control policies as programs in standard languages like Python. We evaluate candidate controllers in simulation and evolve them using a pre-trained LLM. Unlike conventional learning-based control techniques, which rely on black-box neural networks to encode control policies, our approach enhances transparency and interpretability. We still take advantage of the power of large AI models, but only at the policy design phase, ensuring that all system components remain interpretable and easily verifiable at runtime. Additionally, the use of standard programming languages makes it straightforward for humans to finetune or adapt the controllers based on their expertise and intuition. We illustrate our method through its application to the synthesis of an interpretable control policy for the pendulum swing-up and the ball in cup tasks. We make the code available at https://github.com/muellerlab/synthesizing_interpretable_control_policies.git.

Updated: 2025-03-17 21:49:35

标题: 通过大型语言模型引导搜索综合可解释的控制策略

摘要: 大型语言模型（LLMs）、系统评估和进化算法的结合已经实现了组合优化和科学发现方面的突破。我们提议将这种强大的组合扩展到动态系统的控制，生成能够展现复杂行为的可解释控制策略。通过我们的创新方法，我们将控制策略表示为标准语言（如Python）中的程序。我们在仿真中评估候选控制器，并使用预先训练的LLM进行演化。与依赖黑盒神经网络来编码控制策略的传统学习型控制技术不同，我们的方法增强了透明度和可解释性。我们仍然利用大型AI模型的强大能力，但仅在策略设计阶段，确保所有系统组件在运行时仍然可解释和容易验证。此外，使用标准编程语言使人类可以根据自己的专业知识和直觉对控制器进行微调或调整。我们通过将其应用于摆杆摆起和杯中球等任务的可解释控制策略的合成来说明我们的方法。我们将代码提供在https://github.com/muellerlab/synthesizing_interpretable_control_policies.git。

更新时间: 2025-03-17 21:49:35

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.05406v2

Unfair Utilities and First Steps Towards Improving Them

Many fairness criteria constrain the policy or choice of predictors, which can have unwanted consequences, in particular, when optimizing the policy under such constraints. Here, we advocate to instead focus on the utility function the policy is optimizing for. We define value of information fairness and propose to not use utility functions that violate this criterion. This principle suggests to modify these utility functions such that they satisfy value of information fairness. We describe how this can be done and discuss consequences for the corresponding optimal policies. We apply our framework to thought experiments and the COMPAS data. Focussing on the utility function provides better answers than existing fairness notions: We are not aware of any intuitively fair policy that is disallowed by value of information fairness, and when we find that value of information fairness recommends an intuitively unfair policy, no existing fairness notion finds an intuitively fair policy.

Updated: 2025-03-17 21:29:31

标题: 不公平的公用事业和改进的第一步

摘要: 许多公平性标准限制政策或选择预测器，这可能会产生意想不到的后果，特别是在优化受到这些约束的政策时。在这里，我们主张应该专注于政策正在优化的效用函数。我们定义了信息价值公平性，并建议不使用违反这一标准的效用函数。这一原则建议修改这些效用函数，使其满足信息价值公平性。我们描述了如何做到这一点，并讨论了对应最优政策的后果。我们将我们的框架应用于思想实验和COMPAS数据。专注于效用函数比现有的公平性概念提供更好的答案：我们不知道任何一种直观公平的政策会被信息价值公平性所禁止，当我们发现信息价值公平性推荐一种直观不公平的政策时，没有任何现有的公平性概念能够找到一种直观公平的政策。

更新时间: 2025-03-17 21:29:31

领域: stat.ML,cs.CY,cs.LG

下载: http://arxiv.org/abs/2306.00636v2

Explainable Differential Privacy-Hyperdimensional Computing for Balancing Privacy and Transparency in Additive Manufacturing Monitoring

Machine Learning (ML) models integrated with in-situ sensing offer transformative solutions for defect detection in Additive Manufacturing (AM), but this integration brings critical challenges in safeguarding sensitive data, such as part designs and material compositions. Differential Privacy (DP), which introduces mathematically controlled noise, provides a balance between data utility and privacy. However, black-box Artificial Intelligence (AI) models often obscure how this noise impacts model accuracy, complicating the optimization of privacy-accuracy trade-offs. This study introduces the Differential Privacy-Hyperdimensional Computing (DP-HD) framework, a novel approach combining Explainable AI (XAI) and vector symbolic paradigms to quantify and predict noise effects on accuracy using a Signal-to-Noise Ratio (SNR) metric. DP-HD enables precise tuning of DP noise levels, ensuring an optimal balance between privacy and performance. The framework has been validated using real-world AM data, demonstrating its applicability to industrial environments. Experimental results demonstrate DP-HD's capability to achieve state-of-the-art accuracy (94.43%) with robust privacy protections in anomaly detection for AM, even under significant noise conditions. Beyond AM, DP-HD holds substantial promise for broader applications in privacy-sensitive domains such as healthcare, financial services, and government data management, where securing sensitive data while maintaining high ML performance is paramount.

Updated: 2025-03-17 21:17:59

标题: 可解释的差分隐私-超维计算用于在增材制造监测中平衡隐私和透明度

摘要: 机器学习（ML）模型与原位传感器相结合为增材制造（AM）中的缺陷检测提供了变革性的解决方案，但这种整合在保护敏感数据（如零件设计和材料组成）方面带来了重要挑战。差分隐私（DP）引入数学控制的噪声，提供了数据效用和隐私之间的平衡。然而，黑盒人工智能（AI）模型经常模糊了这种噪声对模型准确性的影响，使隐私-准确性权衡的优化变得复杂。本研究介绍了差分隐私-高维计算（DP-HD）框架，这是一种结合可解释AI（XAI）和向量符号范式的新方法，用信噪比（SNR）度量量化和预测噪声对准确性的影响。DP-HD能够精确调节DP噪声水平，确保隐私和性能之间的最佳平衡。该框架已经通过真实世界的AM数据进行了验证，展示了它在工业环境中的适用性。实验结果表明，DP-HD在AM的异常检测中能够实现最先进的准确性（94.43%），即使在噪声条件下也能提供强大的隐私保护。除了AM之外，DP-HD还在隐私敏感领域（如医疗保健、金融服务和政府数据管理）中具有广泛应用的巨大潜力，其中保护敏感数据同时保持高度ML性能至关重要。

更新时间: 2025-03-17 21:17:59

领域: cs.LG,cs.CR,cs.CV

下载: http://arxiv.org/abs/2407.07066v4

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Updated: 2025-03-17 21:14:46

标题: 崩溃还是繁荣？合成数据在自动生成世界中的危机与希望

摘要: 当生成式机器学习模型在包含由先前模型生成的数据的网页规模数据集上进行预训练时会发生什么？一些先前的工作警告称，由于网络被合成数据淹没，会发生“模型崩溃”；另一些工作表明，通过管理如何在预训练中使用可用数据，可以避免这个问题（即可以避免崩溃）。在本文中，我们报告了对三种数据使用方法（训练工作流）进行的实验，涉及三种生成模型任务设置（多变量高斯估计、核密度估计和语言模型微调），以进一步证实封闭的可能性：（a）我们确认在所有研究的任务设置中，通过将所有真实数据替换为纯粹合成数据的连续生成，确实导致模型崩溃；（b）我们考虑在训练中积累合成数据和真实数据，并在所有数据上进行训练，并确认，尽管真实数据的比例最终变为零，模型仍然稳定，它们的测试损失在这种训练工作流下不会发散；（c）我们考虑一个训练工作流，其中真实和合成数据一起积累，但是预训练的连续生成被限制为每代使用固定大小的数据子集。在这个工作流中，我们观察到测试损失性能随着代数的增加而缓慢而逐渐地恶化，而不是爆炸性地恶化。我们的见解在预测未来前沿生成模型是会崩溃还是会繁荣时尤为重要，并且我们的结果为实证和数学上研究合成数据的上下文相关价值开辟了新途径。

更新时间: 2025-03-17 21:14:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.16713v4

AutoEval: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.

Updated: 2025-03-17 21:03:16

标题: AutoEval：用于真值维护和推理任务的LLMs的自主评估

摘要: 本文介绍了AutoEval，这是一个新颖的用于扩展大型语言模型（LLM）评估的基准测试，用于形式任务中具有明确正确性概念，如翻译和逻辑推理中的真实维护。AutoEval是第一个提供多个关键优势的基准测试范式，这些优势对于在没有人工标注的情况下扩展LLMs的客观评估至关重要：（a）通过自动生成不同难度级别任务来评估日益复杂的LLMs的能力；（b）自动生成基准真相，消除对昂贵且耗时的人工注释的依赖性；（c）使用自动生成的随机数据集，消除了连续LLMs过拟合许多现代基准测试中使用的静态数据集的能力。实证分析表明，LLM在AutoEval上的表现高度预示其在其他关注翻译和推理任务的各种基准测试中的表现，使其成为一个有价值的自主评估范式，在手工策划的数据集难以获取和/或更新的环境中。

更新时间: 2025-03-17 21:03:16

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.08437v2

Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement

Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.

Updated: 2025-03-17 20:56:55

标题: 通过贝叶斯比较判断为教育评估赋予排名透明度

摘要: 确保教育评估的透明度变得越来越关键，特别是在疫情后，因为对更公平和更可靠的评估方法的需求不断增长。比较评判（CJ）提供了一种有前途的替代传统评估的方法，然而人们仍然担心其被认为是不透明的。本文研究了贝叶斯比较判断（BCJ）如何通过将先验信息整合到判断过程中来增强透明度，提供了一种结构化的、数据驱动的方法，改善了可解释性和问责性。 BCJ为判断结果分配概率，提供了可量化的不确定性度量，并深入了解决策信心。通过系统跟踪先前数据和连续判断如何影响最终排名，BCJ澄清了评估过程，并帮助识别评估者之间的分歧。多标准BCJ通过独立评估多个学习成果（LOs），保留了CJ的丰富性，同时产生与特定评估目标一致的透明、细粒度的排名。它还可以通过个体LOs得出全面排名，确保全面评估而不损害详细反馈。通过在英国具有专业标记者的真实高等教育数据集上进行演示，我们展示了BCJ的量化严谨性和澄清排名理由的能力。通过定性分析和与经验丰富的CJ从业者的讨论，我们探讨了BCJ在透明度至关重要的情境中的有效性，如高风险的国家评估。我们强调了BCJ的优点和局限性，为其在各种教育环境中的实际应用提供了见解。

更新时间: 2025-03-17 20:56:55

领域: cs.CY,cs.AI,cs.HC,cs.IR

下载: http://arxiv.org/abs/2503.15549v1

Deep Self-Supervised Disturbance Mapping with the OPERA Sentinel-1 Radiometric Terrain Corrected SAR Backscatter Product

Mapping land surface disturbances supports disaster response, resource and ecosystem management, and climate adaptation efforts. Synthetic aperture radar (SAR) is an invaluable tool for disturbance mapping, providing consistent time-series images of the ground regardless of weather or illumination conditions. Despite SAR's potential for disturbance mapping, processing SAR data to an analysis-ready format requires expertise and significant compute resources, particularly for large-scale global analysis. In October 2023, NASA's Observational Products for End-Users from Remote Sensing Analysis (OPERA) project released the near-global Radiometric Terrain Corrected SAR backscatter from Sentinel-1 (RTC-S1) dataset, providing publicly available, analysis-ready SAR imagery. In this work, we utilize this new dataset to systematically analyze land surface disturbances. As labeling SAR data is often prohibitively time-consuming, we train a self-supervised vision transformer - which requires no labels to train - on OPERA RTC-S1 data to estimate a per-pixel distribution from the set of baseline imagery and assess disturbances when there is significant deviation from the modeled distribution. To test our model's capability and generality, we evaluate three different natural disasters - which represent high-intensity, abrupt disturbances - from three different regions of the world. Across events, our approach yields high quality delineations: F1 scores exceeding 0.6 and Areas Under the Precision-Recall Curve exceeding 0.65, consistently outperforming existing SAR disturbance methods. Our findings suggest that a self-supervised vision transformer is well-suited for global disturbance mapping and can be a valuable tool for operational, near-global disturbance monitoring, particularly when labeled data does not exist.

Updated: 2025-03-17 20:49:43

标题: 利用OPERA Sentinel-1辐射地形修正SAR回波产品进行深度自监督扰动映射

摘要: 地表扰动的映射支持灾害响应、资源和生态系统管理以及气候适应努力。合成孔径雷达（SAR）是地表扰动映射的无价工具，提供一致的时间序列图像，无论天气或照明条件如何，都能显示地面情况。尽管SAR在地表扰动映射方面具有潜力，但将SAR数据处理成可分析的格式需要专业知识和大量计算资源，特别是对于大规模全球分析而言。2023年10月，NASA的遥感分析终端用户的观测产品（OPERA）项目发布了近全球的Sentinel-1辐射地形校正SAR回波（RTC-S1）数据集，提供公开可用的、可分析的SAR图像。在这项工作中，我们利用这个新数据集系统地分析地表扰动。由于对SAR数据进行标注通常耗时过长，我们训练了一个自监督视觉变换器——无需标签训练——使用OPERA RTC-S1数据来估计每个像素的分布，从基准图像集中评估扰动，当与建模分布有显著偏差时。为了测试我们模型的能力和普适性，我们评估了来自世界三个不同地区的三种不同自然灾害——代表高强度、突发性扰动。在各事件中，我们的方法产生了高质量的划分：F1分数超过0.6，精确率-召回率曲线下面积超过0.65，始终优于现有的SAR扰动方法。我们的研究结果表明，自监督视觉变换器非常适合全球扰动映射，并可以成为一种有价值的工具，尤其是在没有标记数据的情况下，用于操作性、近全球扰动监测。

更新时间: 2025-03-17 20:49:43

领域: cs.CV,cs.LG,eess.IV,J.2; I.2.6; I.4.8

下载: http://arxiv.org/abs/2501.09129v2

Learning to Inference Adaptively for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

Updated: 2025-03-17 20:35:28

标题: 学习为多模态大型语言模型自适应推理

摘要: 多模态大型语言模型（MLLMs）在推理方面展现出令人印象深刻的能力，但是伴随着大量的计算成本，限制了它们在资源受限环境中的部署。尽管最近有关于提高MLLM效率的努力，先前的解决方案仍然无法应对不断变化的运行时条件，特别是不同资源可用性（例如，由于设备上其他程序的执行而导致的争用）。为了弥合这一差距，我们引入了AdaLLaVA，一种自适应推理框架，它学会在推理过程中动态重新配置MLLM中的操作，考虑输入数据和延迟预算。我们进行了涉及问题回答、推理和幻觉的广泛实验。我们的结果表明，AdaLLaVA有效地遵循输入延迟预算，在运行时实现不同的准确性和延迟权衡。此外，我们展示了AdaLLaVA适应输入延迟和内容，可以与令牌选择相结合以提高效率，并且可以在MLLM之间泛化。我们的项目网页和代码发布在https://zhuoyan-xu.github.io/ada-llava/。

更新时间: 2025-03-17 20:35:28

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.10905v2

NVR: Vector Runahead on NPUs for Sparse Memory Access

Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

Updated: 2025-03-17 20:31:46

标题: NVR：用于稀疏内存访问的NPU上的向量预取

摘要: 深度神经网络越来越多地利用稀疏性来减少模型参数大小的扩展。然而，通过稀疏性和修剪来减少墙钟时间仍然具有挑战性，因为不规则的内存访问模式导致频繁的缓存未命中。本文介绍了NPU Vector Runahead（NVR），一种专为NPUs量身定制的预取机制，用于解决稀疏DNN工作负载中的缓存未命中问题。与通过高开销和低可移植性优化内存模式不同，NVR调整了运行前执行以适应NPUs的独特体系结构。NVR为稀疏DNN工作负载提供了一种通用微架构解决方案，无需编译器或算法支持，作为与NPU并行的解耦、推测、轻量级硬件子线程运行，硬件开销极小（不到5％）。与通用处理器中的SOTA预取相比，NVR在平均缓存未命中率上实现了90％的减少，在稀疏工作负载中实现了与没有预取的NPUs相比的4倍平均加速。此外，我们调查了将一个小缓存（16KB）与NPU结合使用并配备NVR的优势。我们的评估显示，扩展这个适度的缓存比以相同数量增加L2缓存大小带来5倍更高的性能收益。

更新时间: 2025-03-17 20:31:46

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2502.13873v2

Multi-modal Time Series Analysis: A Tutorial and Survey

Multi-modal time series analysis has recently emerged as a prominent research area in data mining, driven by the increasing availability of diverse data modalities, such as text, images, and structured tabular data from real-world sources. However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise. Recent advancements in multi-modal time series methods have exploited the multi-modal context via cross-modal interactions based on deep learning methods, significantly enhancing various downstream tasks. In this tutorial and survey, we present a systematic and up-to-date overview of multi-modal time series datasets and methods. We first state the existing challenges of multi-modal time series analysis and our motivations, with a brief introduction of preliminaries. Then, we summarize the general pipeline and categorize existing methods through a unified cross-modal interaction framework encompassing fusion, alignment, and transference at different levels (\textit{i.e.}, input, intermediate, output), where key concepts and ideas are highlighted. We also discuss the real-world applications of multi-modal analysis for both standard and spatial time series, tailored to general and specific domains. Finally, we discuss future research directions to help practitioners explore and exploit multi-modal time series. The up-to-date resources are provided in the GitHub repository: https://github.com/UConn-DSIS/Multi-modal-Time-Series-Analysis

Updated: 2025-03-17 20:30:02

标题: 多模态时间序列分析：教程与调查

摘要: 多模态时间序列分析最近成为数据挖掘领域的一个突出研究领域，这是由于不断增加的各种数据形式的可用性，例如来自现实世界来源的文本、图像和结构化表格数据。然而，多模态时间序列的有效分析受到数据异质性、模态差距、不对齐和固有噪声的阻碍。最近的多模态时间序列方法的进展利用了基于深度学习方法的跨模态交互，显著增强了各种下游任务。在本教程和调查中，我们提供了多模态时间序列数据集和方法的系统和最新概述。我们首先阐述了多模态时间序列分析的现有挑战和我们的动机，并简要介绍了基础知识。然后，我们通过一个统一的跨模态交互框架总结了一般流程，并对现有方法进行分类，涵盖了在不同级别（即输入、中间、输出）进行融合、对齐和转移的关键概念和思想。我们还讨论了多模态分析在标准和空间时间序列方面的实际应用，适用于一般和特定领域。最后，我们讨论了未来的研究方向，以帮助从业者探索和利用多模态时间序列。最新资源可在GitHub存储库中找到：https://github.com/UConn-DSIS/Multi-modal-Time-Series-Analysis

更新时间: 2025-03-17 20:30:02

领域: cs.LG

下载: http://arxiv.org/abs/2503.13709v1

A Circular Construction Product Ontology for End-of-Life Decision-Making

Efficient management of end-of-life (EoL) products is critical for advancing circularity in supply chains, particularly within the construction industry where EoL strategies are hindered by heterogenous lifecycle data and data silos. Current tools like Environmental Product Declarations (EPDs) and Digital Product Passports (DPPs) are limited by their dependency on seamless data integration and interoperability which remain significant challenges. To address these, we present the Circular Construction Product Ontology (CCPO), an applied framework designed to overcome semantic and data heterogeneity challenges in EoL decision-making for construction products. CCPO standardises vocabulary and facilitates data integration across supply chain stakeholders enabling lifecycle assessments (LCA) and robust decision-making. By aggregating disparate data into a unified product provenance, CCPO enables automated EoL recommendations through customisable SWRL rules aligned with European standards and stakeholder-specific circularity SLAs, demonstrating its scalability and integration capabilities. The adopted circular product scenario depicts CCPO's application while competency question evaluations show its superior performance in generating accurate EoL suggestions highlighting its potential to greatly improve decision-making in circular supply chains and its applicability in real-world construction environments.

Updated: 2025-03-17 20:28:08

标题: 一个用于终身决策的循环建筑产品本体论

摘要: 高效管理终身产品对于推动供应链循环性至关重要，特别是在建筑行业，终身产品策略受到异构生命周期数据和数据孤岛的阻碍。目前的工具如环境产品声明（EPDs）和数字产品护照（DPPs）受制于对无缝数据集成和互操作性的依赖，这仍然是重大挑战。为了解决这些问题，我们提出了循环建筑产品本体论（CCPO），这是一个应用框架，旨在克服建筑产品终身决策中的语义和数据异构性挑战。CCPO标准化了词汇，并促进了供应链各利益相关方之间的数据集成，实现了生命周期评估（LCA）和强大的决策制定。通过将不同的数据聚合到统一的产品来源中，CCPO通过与欧洲标准和特定利益相关方循环性服务级别协议相一致的可定制SWRL规则实现了自动化的终身产品建议，展示了其可扩展性和集成能力。采用的循环产品场景描述了CCPO的应用，而能力问题评估显示其在生成准确的终身产品建议方面表现出色，凸显了其在改进循环供应链中的决策制定和在现实世界建筑环境中的适用性。

更新时间: 2025-03-17 20:28:08

领域: cs.AI,cs.DB

下载: http://arxiv.org/abs/2503.13708v1

User Preference Meets Pareto-Optimality in Multi-Objective Bayesian Optimization

Incorporating user preferences into multi-objective Bayesian optimization (MOBO) allows for personalization of the optimization procedure. Preferences are often abstracted in the form of an unknown utility function, estimated through pairwise comparisons of potential outcomes. However, utility-driven MOBO methods can yield solutions that are dominated by nearby solutions, as non-dominance is not enforced. Additionally, classical MOBO commonly relies on estimating the entire Pareto-front to identify the Pareto-optimal solutions, which can be expensive and ignore user preferences. Here, we present a new method, termed preference-utility-balanced MOBO (PUB-MOBO), that allows users to disambiguate between near-Pareto candidate solutions. PUB-MOBO combines utility-based MOBO with local multi-gradient descent to refine user-preferred solutions to be near-Pareto-optimal. To this end, we propose a novel preference-dominated utility function that concurrently preserves user-preferences and dominance amongst candidate solutions. A key advantage of PUB-MOBO is that the local search is restricted to a (small) region of the Pareto-front directed by user preferences, alleviating the need to estimate the entire Pareto-front. PUB-MOBO is tested on three synthetic benchmark problems: DTLZ1, DTLZ2 and DH1, as well as on three real-world problems: Vehicle Safety, Conceptual Marine Design, and Car Side Impact. PUB-MOBO consistently outperforms state-of-the-art competitors in terms of proximity to the Pareto-front and utility regret across all the problems.

Updated: 2025-03-17 20:25:17

标题: 用户偏好在多目标贝叶斯优化中遇到帕累托最优化

摘要: 将用户偏好纳入多目标贝叶斯优化（MOBO）中，可以个性化优化过程。偏好通常以未知效用函数的形式进行抽象，通过潜在结果的成对比较进行估计。然而，基于效用的MOBO方法可能会产生被附近解支配的解，因为非支配性没有得到执行。此外，经典的MOBO通常依赖于估计整个帕累托前沿来识别帕累托最优解，这可能会昂贵且忽略用户偏好。在这里，我们提出了一种新方法，称为偏好-效用平衡MOBO（PUB-MOBO），允许用户区分接近帕累托的候选解。PUB-MOBO将基于效用的MOBO与局部多梯度下降相结合，以精炼用户偏好的解以接近帕累托最优解。为此，我们提出了一种新颖的偏好支配效用函数，同时保留了用户偏好和候选解之间的支配关系。PUB-MOBO的一个关键优势是，局部搜索被限制在由用户偏好指导的（小）帕累托前沿区域，从而减少了估计整个帕累托前沿的需要。PUB-MOBO在三个合成基准问题（DTLZ1、DTLZ2和DH1）以及三个真实世界问题（车辆安全、概念性海洋设计和车辆侧面碰撞）上进行了测试。在所有问题上，PUB-MOBO始终优于最先进的竞争对手，接近帕累托前沿并减少效用后悔。

更新时间: 2025-03-17 20:25:17

领域: cs.LG

下载: http://arxiv.org/abs/2502.06971v3

Align and Distill: Unifying and Improving Domain Adaptive Object Detection

Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, enabling evaluation on diverse real-world data, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes to Foggy Cityscapes, +5.7 AP50 on Sim10k to Cityscapes (where ours is the only method to outperform a fair baseline), and +0.6 AP50 on CFC Kenai to Channel. ALDI and ALDI++ are architecture-agnostic, setting a new state-of-the-art for YOLO and DETR-based DAOD as well without additional hyperparameter tuning. Our framework, dataset, and state-of-the-art method offer a critical reset for DAOD and provide a strong foundation for future research. Code and data are available: https://github.com/justinkay/aldi and https://github.com/visipedia/caltech-fish-counting.

Updated: 2025-03-17 20:18:16

标题: 对齐和精炼：统一和改进领域自适应目标检测

摘要: 目标检测器在与它们的训练集不同的数据上通常表现不佳。最近，领域自适应目标检测（DAOD）方法已经展现出强大的结果来解决这一挑战。不幸的是，我们发现了系统性的基准测试陷阱，对过去的结果提出质疑并阻碍了进一步的进展：（a）由于基线不足而导致性能高估，（b）不一致的实现实践阻碍了方法的透明比较，（c）由于过时的骨干网络和基准测试中的缺乏多样性导致缺乏通用性。我们通过引入以下方法来解决这些问题：（1）统一的基准测试和实施框架Align and Distill（ALDI），使DAOD方法的比较成为可能并支持未来的发展，（2）为DAOD提供公平且现代的训练和评估协议，解决基准测试陷阱，（3）一个新的DAOD基准数据集，CFC-DAOD，使得在各种真实世界数据上进行评估成为可能，并且（4）一种新方法ALDI++，通过大幅度提高实现了最新技术水平。ALDI++在Cityscapes到Foggy Cityscapes上比以前的最新技术水平高出+3.5 AP50，在Sim10k到Cityscapes上高出+5.7 AP50（在这里我们是唯一一种方法超越了公平基线），在CFC Kenai到Channel上高出+0.6 AP50。ALDI和ALDI++是与架构无关的，为YOLO和基于DETR的DAOD设置了新的技术水平，并且不需要额外的超参数调整。我们的框架、数据集和最新方法为DAOD提供了重要的重置，并为未来研究奠定了坚实的基础。代码和数据可在以下链接找到：https://github.com/justinkay/aldi 和 https://github.com/visipedia/caltech-fish-counting。

更新时间: 2025-03-17 20:18:16

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.12029v3

Towards Resource-Efficient Compound AI Systems

Compound AI Systems, integrating multiple interacting components like models, retrievers, and external tools, have emerged as essential for addressing complex AI tasks. However, current implementations suffer from inefficient resource utilization due to tight coupling between application logic and execution details, a disconnect between orchestration and resource management layers, and the perceived exclusiveness between efficiency and quality. We propose a vision for resource-efficient Compound AI Systems through a declarative workflow programming model and an adaptive runtime system for dynamic scheduling and resource-aware decision-making. Decoupling application logic from low-level details exposes levers for the runtime to flexibly configure the execution environment and resources, without compromising on quality. Enabling collaboration between the workflow orchestration and cluster manager enables higher efficiency through better scheduling and resource management. We are building a prototype system, called Murakkab, to realize this vision. Our preliminary evaluation demonstrates speedups up to $\sim 3.4\times$ in workflow completion times while delivering $\sim 4.5\times$ higher energy efficiency, showing promise in optimizing resources and advancing AI system design.

Updated: 2025-03-17 20:14:48

标题: 朝向资源高效的复合人工智能系统

摘要: 复合AI系统，集成多个相互作用的组件，如模型、检索器和外部工具，已经成为解决复杂AI任务的必要条件。然而，当前的实现由于应用逻辑与执行细节之间的紧密耦合、编排和资源管理层之间的脱节，以及效率和质量之间的 perceived exclusiveness 而导致资源利用效率低下。我们提出了通过声明性工作流程编程模型和适应性运行时系统实现资源高效的复合AI系统的愿景，用于动态调度和资源感知的决策制定。将应用逻辑与低级细节分离，为运行时提供了灵活配置执行环境和资源的杠杆，而不会影响质量。通过协作工作流编排和集群管理器，实现更高效的调度和资源管理。我们正在构建一个名为Murakkab的原型系统来实现这一愿景。我们的初步评估显示，在工作流完成时间上可以获得最多约3.4倍的加速，同时提供约4.5倍更高的能源效率，显示了优化资源和推进AI系统设计的潜力。

更新时间: 2025-03-17 20:14:48

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2501.16634v3

Mitigating Spectral Bias in Neural Operators via High-Frequency Scaling for Physical Systems

Neural operators have emerged as powerful surrogates for modeling complex physical problems. However, they suffer from spectral bias making them oblivious to high-frequency modes, which are present in multiscale physical systems. Therefore, they tend to produce over-smoothed solutions, which is particularly problematic in modeling turbulence and for systems with intricate patterns and sharp gradients such as multi-phase flow systems. In this work, we introduce a new approach named high-frequency scaling (HFS) to mitigate spectral bias in convolutional-based neural operators. By integrating HFS with proper variants of UNet neural operators, we demonstrate a higher prediction accuracy by mitigating spectral bias in single and two-phase flow problems. Unlike Fourier-based techniques, HFS is directly applied to the latent space, thus eliminating the computational cost associated with the Fourier transform. Additionally, we investigate alternative spectral bias mitigation through diffusion models conditioned on neural operators. While the diffusion model integrated with the standard neural operator may still suffer from significant errors, these errors are substantially reduced when the diffusion model is integrated with a HFS-enhanced neural operator.

Updated: 2025-03-17 20:08:47

标题: 通过高频缩放减轻神经算子中的光谱偏差，用于物理系统

摘要: 神经算子已经成为建模复杂物理问题的强大替代方法。然而，它们存在谱偏差问题，使其对高频模式视而不见，而这些高频模式存在于多尺度物理系统中。因此，它们往往会产生过度平滑的解决方案，这在建模湍流和具有复杂模式和锐利梯度的系统（如多相流系统）中尤为棘手。在本研究中，我们引入了一种名为高频率缩放（HFS）的新方法，以减轻基于卷积的神经算子的谱偏差。通过将HFS与适当的UNet神经算子变体结合，我们在单相和两相流问题中通过减轻谱偏差来展示更高的预测准确性。与基于傅里叶的技术不同，HFS直接应用于潜在空间，从而消除了与傅里叶变换相关的计算成本。此外，我们通过在神经算子上进行条件化的扩散模型来研究替代的谱偏差减轻方法。虽然集成了标准神经算子的扩散模型可能仍然存在显着的错误，但当扩散模型与增强型HFS神经算子集成时，这些错误会大大减少。

更新时间: 2025-03-17 20:08:47

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2503.13695v1

QPRAC: Towards Secure and Practical PRAC-based Rowhammer Mitigation using Priority Queues

JEDEC has introduced the Per Row Activation Counting (PRAC) framework for DDR5 and future DRAMs to enable precise counting of DRAM row activations. PRAC enables a holistic mitigation of Rowhammer attacks even at ultra-low Rowhammer thresholds. PRAC uses an Alert Back-Off (ABO) protocol to request the memory controller to issue Rowhammer mitigation requests. However, recent PRAC implementations are either insecure or impractical. For example, Panopticon, the inspiration for PRAC, is rendered insecure if implemented per JEDEC's PRAC specification. On the other hand, the recent UPRAC proposal is impractical since it needs oracular knowledge of the `top-N' activated DRAM rows that require mitigation. This paper provides the first secure, scalable, and practical RowHammer solution using the PRAC framework. The crux of our proposal is the design of a priority-based service queue (PSQ) for mitigations that prioritizes pending mitigations based on activation counts to avoid the security risks of prior solutions. This provides principled security using the reactive ABO protocol. Furthermore, we co-design our PSQ, with opportunistic mitigation on Refresh Management (RFM) operations and proactive mitigation during refresh (REF), to limit the performance impact of ABO-based mitigations. QPRAC provides secure and practical RowHammer mitigation that scales to Rowhammer thresholds as low as 71 while incurring a 0.8% slowdown for benign workloads, which further reduces to 0% with proactive mitigations.

Updated: 2025-03-17 20:03:50

标题: QPRAC：利用优先队列实现安全且实用的基于PRAC的Rowhammer缓解方案

摘要: JEDEC引入了Per Row Activation Counting（PRAC）框架，用于DDR5和未来DRAM，以实现对DRAM行激活的精确计数。PRAC使得即使在极低的Rowhammer阈值下也能全面减缓Rowhammer攻击。PRAC使用Alert Back-Off（ABO）协议请求内存控制器发出Rowhammer减缓请求。然而，最近的PRAC实现要么不安全，要么不实用。例如，Panopticon，PRAC的灵感来源，如果按照JEDEC的PRAC规范实施，就会变得不安全。另一方面，最近的UPRAC提案不切实际，因为它需要对需要减缓的“top-N”激活的DRAM行有预言性的知识。本文提出了使用PRAC框架的第一个安全、可扩展和实用的RowHammer解决方案。我们提议的关键是设计一个基于优先级的服务队列（PSQ），用于根据激活计数优先处理待处理的减缓措施，以避免先前解决方案的安全风险。这通过使用反应性ABO协议提供了原则性安全保障。此外，我们与Refresh Management（RFM）操作上的机会性减缓以及刷新（REF）期间的主动减缓进行协同设计，以限制基于ABO的减缓对性能的影响。QPRAC提供了安全且实用的RowHammer减缓，可扩展到Rowhammer阈值低至71，对良性工作负载造成0.8%的减速，而通过主动减缓进一步降至0%。

更新时间: 2025-03-17 20:03:50

领域: cs.CR

下载: http://arxiv.org/abs/2501.18861v4

Atyaephyra at SemEval-2025 Task 4: Low-Rank NPO

We present a submission to the SemEval 2025 shared task on unlearning sensitive content from LLMs. Our approach employs negative preference optimization using low-rank adaptation. We show that we can utilize this combination to cheaply compute additional regularization terms, which help with unlearning stabilization. The results of our approach significantly exceed the shared task baselines.

Updated: 2025-03-17 19:59:19

标题: Atyaephyra在SemEval-2025任务4中的表现：低级别NPO

摘要: 我们提交了一份关于从LLMs中消除敏感内容的SemEval 2025共享任务。我们的方法利用低秩调整进行负偏好优化。我们展示了我们可以利用这种组合便宜地计算额外的正则化项，有助于消除稳定性。我们的方法结果显著超过了共享任务的基线。

更新时间: 2025-03-17 19:59:19

领域: cs.CL,cs.AI,cs.LG,68T50 (Primary), 68T07 (Secondary),I.2.7

下载: http://arxiv.org/abs/2503.13690v1

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation -- that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. This bias occurs because minimizing the sum of squared weights of the linear layers is equivalent to minimizing a low-rank promoting Schatten quasi-norm of a single "virtual" weight matrix. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.

Updated: 2025-03-17 19:39:20

标题: 线性层的ReLU神经网络偏向于单指数和多指数模型

摘要: 神经网络通常在过参数化的范围内操作，在这种情况下，参数比训练样本数量多得多，允许训练数据完美拟合。换句话说，训练网络有效地学习了一个插值函数，插值函数的属性会影响网络对新样本的预测。本文探讨了神经网络学习的函数属性，深度大于两层。我们的框架考虑了一系列深度不同但容量相同但表示成本不同的网络。神经网络架构诱导的函数的表示成本是网络表示该函数所需的最小平方权重之和；它反映了与架构相关的函数空间偏差。我们的结果表明，在将额外的线性层添加到浅ReLU网络的输入端之后，会得到一个偏向于具有低混合变异性的函数的表示成本，即在与低维子空间正交的方向上具有有限变异性，并且可以很好地用单索引模型或多索引模型近似。这种偏差发生是因为最小化线性层的平方权重之和等价于最小化单个“虚拟”权重矩阵的低秩促进Schatten准范数。我们的实验证实了这种行为在标准网络训练制度下的存在。此外，它们还表明线性层可以提高泛化能力，并且当数据使用多索引模型生成时，学习的网络与真实的潜在低维线性子空间很好地对齐。

更新时间: 2025-03-17 19:39:20

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2305.15598v4

Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no BAC, mild, moderate, and severe) on screening mammograms. Outcomes included major adverse cardiovascular events (MACE) and all-cause mortality. BAC severity was independently associated with MACE after adjusting for cardiovascular risk factors, with increasing hazard ratios from mild (HR 1.18-1.22), moderate (HR 1.38-1.47), to severe BAC (HR 2.03-2.22) across datasets (all p<0.001). This association remained significant across all age groups, with even mild BAC indicating increased risk in women under 50. BAC remained an independent predictor when analyzed alongside ASCVD risk scores, showing significant associations with myocardial infarction, stroke, heart failure, and mortality (all p<0.005). Automated BAC quantification enables opportunistic cardiovascular risk assessment during routine mammography without additional radiation or cost. This approach provides value beyond traditional risk factors, particularly in younger women, offering potential for early CVD risk stratification in the millions of women undergoing annual mammography.

Updated: 2025-03-17 19:38:17

标题: 新型基于人工智能的乳腺动脉钙化定量方法用于预测心血管风险

摘要: 妇女在心血管疾病的诊断和治疗方面存在被低估和被低治疗的情况。在筛查乳腺X线摄影中自动量化乳腺动脉钙化可以识别存在心血管疾病风险的妇女，并实现对疾病的早期治疗和管理。在这项回顾性研究中，从两个医疗系统的116,135名妇女中，基于变压器的神经网络对筛查乳腺X线摄影中的BAC严重程度（无BAC、轻度、中度和重度）进行了量化。结果包括主要不良心血管事件（MACE）和全因死亡率。在校正心血管风险因素后，BAC严重程度与MACE独立相关，随着数据集的增加，轻度（HR 1.18-1.22）、中度（HR 1.38-1.47）和重度BAC（HR 2.03-2.22）的危险比逐渐增加（所有p<0.001）。这种关联在所有年龄组中仍然显著，即使在50岁以下的妇女中，轻度BAC也表明增加的风险。当与ASCVD风险评分一起分析时，BAC仍然是一个独立的预测因子，显示与心肌梗死、中风、心力衰竭和死亡的显著关联（所有p<0.005）。自动化的BAC量化使得在常规乳腺X线摄影中进行机会性心血管风险评估，而无需额外的辐射或成本。这种方法提供了超越传统风险因素的价值，特别是对于年轻妇女，在进行年度乳腺X线摄影检查的数百万妇女中提供了早期心血管疾病风险分层的潜力。

更新时间: 2025-03-17 19:38:17

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.14550v1

Cross-Domain Knowledge Transfer for Underwater Acoustic Classification Using Pre-trained Models

Transfer learning is commonly employed to leverage large, pre-trained models and perform fine-tuning for downstream tasks. The most prevalent pre-trained models are initially trained using ImageNet. However, their ability to generalize can vary across different data modalities. This study compares pre-trained Audio Neural Networks (PANNs) and ImageNet pre-trained models within the context of underwater acoustic target recognition (UATR). It was observed that the ImageNet pre-trained models slightly out-perform pre-trained audio models in passive sonar classification. We also analyzed the impact of audio sampling rates for model pre-training and fine-tuning. This study contributes to transfer learning applications of UATR, illustrating the potential of pre-trained models to address limitations caused by scarce, labeled data in the UATR domain.

Updated: 2025-03-17 19:33:19

标题: 跨领域知识转移在使用预训练模型进行水声分类中的应用

摘要: 迁移学习通常用于利用大型预训练模型，并对下游任务进行微调。最常见的预训练模型最初是使用ImageNet进行训练的。然而，它们在不同数据模态下的泛化能力可能会有所不同。本研究比较了预训练的音频神经网络（PANNs）和ImageNet预训练模型在水下声学目标识别（UATR）领域中的表现。观察到ImageNet预训练模型在被动声纳分类中略优于预训练音频模型。我们还分析了音频采样率对模型预训练和微调的影响。本研究对UATR的迁移学习应用做出了贡献，展示了预训练模型在解决UATR领域中由稀缺标记数据引起的限制方面的潜力。

更新时间: 2025-03-17 19:33:19

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2409.13878v2

PrETi: Predicting Execution Time in Early Stage with LLVM and Machine Learning

We introduce preti, a novel framework for predicting software execution time during the early stages of development. preti leverages an LLVM-based simulation environment to extract timing-related runtime information, such as the count of executed LLVM IR instructions. This information, combined with historical execution time data, is utilized to train machine learning models for accurate time prediction. To further enhance prediction accuracy, our approach incorporates simulations of cache accesses and branch prediction. The evaluations on public benchmarks demonstrate that preti achieves an average Absolute Percentage Error (APE) of 11.98\%, surpassing state-of-the-art methods. These results underscore the effectiveness and efficiency of preti as a robust solution for early-stage timing analysis.

Updated: 2025-03-17 19:32:26

标题: PrETi：使用LLVM和机器学习在早期阶段预测执行时间

摘要: 我们介绍了一种新颖的框架preti，用于在开发的早期阶段预测软件执行时间。preti利用基于LLVM的模拟环境提取与时间相关的运行时信息，例如执行的LLVM IR指令数量。这些信息结合历史执行时间数据被用来训练机器学习模型进行准确的时间预测。为了进一步提高预测准确性，我们的方法还包括对缓存访问和分支预测的模拟。对公共基准测试的评估显示，preti实现了平均绝对百分比误差(APE)为11.98\%，超越了最先进的方法。这些结果强调了preti作为早期阶段时间分析的稳健解决方案的有效性和效率。

更新时间: 2025-03-17 19:32:26

领域: cs.PF,cs.LG

下载: http://arxiv.org/abs/2503.13679v1

Sampling Decisions

In this manuscript we introduce a novel Decision Flow (DF) framework for sampling from a target distribution while incorporating additional guidance from a prior sampler. DF can be viewed as an AI driven algorithmic reincarnation of the Markov Decision Process (MDP) approach in Stochastic Optimal Control. It extends the continuous space, continuous time path Integral Diffusion sampling technique to discrete time and space, while also generalizing the Generative Flow Network framework. In its most basic form, an explicit, Neural Network (NN) free formulation, DF leverages the linear solvability of the the underlying MDP to adjust the transition probabilities of the prior sampler. The resulting Markov Process is expressed as a convolution of the reverse time Green's function of the prior sampling with the target distribution. We illustrate the DF framework through an example of sampling from the Ising model, discuss potential NN based extensions, and outline how DF can enhance guided sampling across various applications.

Updated: 2025-03-17 19:32:22

标题: 抽样决策

摘要: 在这篇手稿中，我们介绍了一个新颖的决策流（DF）框架，用于从目标分布中进行抽样，同时结合来自先前抽样器的额外指导。DF可以被视为马尔科夫决策过程（MDP）方法在随机最优控制中的算法重生，并将连续空间、连续时间路径积分扩展到离散时间和空间，同时也泛化了生成流网络框架。在其最基本形式中，作为显式、无神经网络（NN）的公式化，DF利用底层MDP的线性可解性来调整先前抽样器的转移概率。得到的马尔科夫过程表达为先前抽样的逆时间格林函数与目标分布的卷积。我们通过一个从伊辛模型中抽样的示例来说明DF框架，讨论了潜在的基于NN的扩展，并概述了DF如何提高在各种应用中的引导抽样。

更新时间: 2025-03-17 19:32:22

领域: cs.LG,cond-mat.stat-mech,cs.AI,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2503.14549v1

Bayesian Kernel Regression for Functional Data

In supervised learning, the output variable to be predicted is often represented as a function, such as a spectrum or probability distribution. Despite its importance, functional output regression remains relatively unexplored. In this study, we propose a novel functional output regression model based on kernel methods. Unlike conventional approaches that independently train regressors with scalar outputs for each measurement point of the output function, our method leverages the covariance structure within the function values, akin to multitask learning, leading to enhanced learning efficiency and improved prediction accuracy. Compared with existing nonlinear function-on-scalar models in statistical functional data analysis, our model effectively handles high-dimensional nonlinearity while maintaining a simple model structure. Furthermore, the fully kernel-based formulation allows the model to be expressed within the framework of reproducing kernel Hilbert space (RKHS), providing an analytic form for parameter estimation and a solid foundation for further theoretical analysis. The proposed model delivers a functional output predictive distribution derived analytically from a Bayesian perspective, enabling the quantification of uncertainty in the predicted function. We demonstrate the model's enhanced prediction performance through experiments on artificial datasets and density of states prediction tasks in materials science.

Updated: 2025-03-17 19:28:27

标题: 贝叶斯核回归在功能数据中的应用

摘要: 在监督学习中，要预测的输出变量通常被表示为一个函数，比如一个光谱或概率分布。尽管它很重要，但是功能输出回归仍然相对未被探索。在这项研究中，我们提出了一种基于核方法的新型功能输出回归模型。与传统方法不同，传统方法独立地为输出函数的每个测量点训练具有标量输出的回归器，我们的方法利用函数值之间的协方差结构，类似于多任务学习，从而提高了学习效率和改善了预测准确性。与统计功能数据分析中现有的非线性函数-标量模型相比，我们的模型在处理高维非线性的同时保持了简单的模型结构。此外，完全基于核的公式允许模型在再生核希尔伯特空间（RKHS）框架内表示，为参数估计提供了分析形式，并为进一步的理论分析奠定了坚实基础。提出的模型从贝叶斯角度推导出功能输出预测分布，使得可以量化预测函数中的不确定性。通过人造数据集和材料科学中的态密度预测任务的实验，我们展示了该模型的增强预测性能。

更新时间: 2025-03-17 19:28:27

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.13676v1

Investigation of Time-Frequency Feature Combinations with Histogram Layer Time Delay Neural Networks

While deep learning has reduced the prevalence of manual feature extraction, transformation of data via feature engineering remains essential for improving model performance, particularly for underwater acoustic signals. The methods by which audio signals are converted into time-frequency representations and the subsequent handling of these spectrograms can significantly impact performance. This work demonstrates the performance impact of using different combinations of time-frequency features in a histogram layer time delay neural network. An optimal set of features is identified with results indicating that specific feature combinations outperform single data features.

Updated: 2025-03-17 19:21:40

标题: 对带有直方图层时间延迟神经网络的时频特征组合的研究

摘要: 深度学习已经减少了手动特征提取的普及，但通过特征工程对数据进行转换仍然对于改善模型性能至关重要，特别是对于水下声学信号。将音频信号转换为时频表示并对这些声谱图的后续处理方法可以显著影响性能。本研究展示了在直方图层时间延迟神经网络中使用不同时间频率特征组合的性能影响。通过结果确定了一组最佳特征，表明特定特征组合优于单一数据特征。

更新时间: 2025-03-17 19:21:40

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2409.13881v2

On the Convergence of a Federated Expectation-Maximization Algorithm

Data heterogeneity has been a long-standing bottleneck in studying the convergence rates of Federated Learning algorithms. In order to better understand the issue of data heterogeneity, we study the convergence rate of the Expectation-Maximization (EM) algorithm for the Federated Mixture of $K$ Linear Regressions model (FMLR). We completely characterize the convergence rate of the EM algorithm under all regimes of $m/n$ where $m$ is the number of clients and $n$ is the number of data points per client. We show that with a signal-to-noise-ratio (SNR) of order $\Omega(\sqrt{K})$, the well-initialized EM algorithm converges within the minimax distance of the ground truth under all regimes. Interestingly, we identify that when the number of clients grows reasonably with respect to the number of data points per client, the EM algorithm only requires a constant number of iterations to converge. We perform experiments on synthetic data to illustrate our results. In line with our theoretical findings, the simulations show that rather than being a bottleneck, data heterogeneity can accelerate the convergence of iterative federated algorithms.

Updated: 2025-03-17 19:15:58

标题: 关于联邦期望最大化算法的收敛性

摘要: 数据异质性一直是研究联邦学习算法收敛速度的一个长期障碍。为了更好地理解数据异质性问题，我们研究了期望最大化（EM）算法在联邦混合K个线性回归模型（FMLR）中的收敛速度。我们完全描述了EM算法在所有$m/n$制度下的收敛速度，其中$m$是客户端数量，$n$是每个客户端的数据点数量。我们展示了在信噪比（SNR）为$\Omega(\sqrt{K})$量级时，良好初始化的EM算法在所有制度下都能以最小最大距离收敛到真实值。有趣的是，我们发现当客户端数量相对于每个客户端的数据点数量合理增长时，EM算法只需要恒定次迭代即可收敛。我们在合成数据上进行实验以展示我们的结果。与我们的理论发现一致，模拟结果显示，数据异质性不是瓶颈，反而可以加速迭代联邦算法的收敛。

更新时间: 2025-03-17 19:15:58

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2408.05819v2

INPROVF: Leveraging Large Language Models to Repair High-level Robot Controllers from Assumption Violations

This paper presents INPROVF, an automatic framework that combines large language models (LLMs) and formal methods to speed up the repair process of high-level robot controllers. Previous approaches based solely on formal methods are computationally expensive and cannot scale to large state spaces. In contrast, INPROVF uses LLMs to generate repair candidates, and formal methods to verify their correctness. To improve the quality of these candidates, our framework first translates the symbolic representations of the environment and controllers into natural language descriptions. If a candidate fails the verification, INPROVF provides feedback on potential unsafe behaviors or unsatisfied tasks, and iteratively prompts LLMs to generate improved solutions. We demonstrate the effectiveness of INPROVF through 12 violations with various workspaces, tasks, and state space sizes.

Updated: 2025-03-17 19:08:36

标题: INPROVF：利用大型语言模型修复高级机器人控制器的假设违反

摘要: 本文介绍了INPROVF，这是一个自动框架，结合了大型语言模型（LLMs）和形式方法，以加快高级机器人控制器修复过程。先前仅基于形式方法的方法在计算上很昂贵，并且无法扩展到大型状态空间。相反，INPROVF使用LLMs生成修复候选，然后使用形式方法验证其正确性。为了提高这些候选的质量，我们的框架首先将环境和控制器的符号表示转换为自然语言描述。如果候选未通过验证，INPROVF将提供有关潜在不安全行为或未满足任务的反馈，并迭代提示LLMs生成改进的解决方案。我们通过12个违规行为展示了INPROVF的有效性，涵盖了不同的工作空间、任务和状态空间大小。

更新时间: 2025-03-17 19:08:36

领域: cs.RO,cs.AI,cs.FL,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.13660v1

The Future of Data Science Education

The definition of Data Science is a hotly debated topic. For many, the definition is a simple shortcut to Artificial Intelligence or Machine Learning. However, there is far more depth and nuance to the field of Data Science than a simple shortcut can provide. The School of Data Science at the University of Virginia has developed a novel model for the definition of Data Science. This model is based on identifying a unified understanding of the data work done across all areas of Data Science. It represents a generational leap forward in how we understand and teach Data Science. In this paper we will present the core features of the model and explain how it unifies various concepts going far beyond the analytics component of AI. From this foundation we will present our Undergraduate Major curriculum in Data Science and demonstrate how it prepares students to be well-rounded Data Science team members and leaders. The paper will conclude with an in-depth overview of the Foundations of Data Science course designed to introduce students to the field while also implementing proven STEM oriented pedagogical methods. These include, for example, specifications grading, active learning lectures, guest lectures from industry experts and weekly gamification labs.

Updated: 2025-03-17 19:06:45

标题: 数据科学教育的未来

摘要: 数据科学的定义是一个备受争议的话题。对于许多人来说，这个定义是对人工智能或机器学习的简单捷径。然而，数据科学领域远比简单捷径所能提供的更加深入和微妙。弗吉尼亚大学数据科学学院开发了一个新颖的模型来定义数据科学。这个模型是基于对所有数据科学领域的数据工作进行统一理解。它代表了我们如何理解和教授数据科学方面的一次世代性飞跃。在本文中，我们将介绍该模型的核心特征，并解释它如何统一各种概念，远远超越了人工智能的分析组件。基于这个基础，我们将展示我们的数据科学本科专业课程，并演示它如何培养学生成为全面发展的数据科学团队成员和领导者。本文将以深入介绍数据科学基础课程结束，旨在向学生介绍该领域，同时实施经过验证的STEM导向的教学方法。这些方法包括规范分级、积极学习讲座、来自行业专家的客座讲座和每周的游戏化实验室。

更新时间: 2025-03-17 19:06:45

领域: stat.OT,cs.AI

下载: http://arxiv.org/abs/2407.11824v2

Why Do Multi-Agent LLM Systems Fail?

Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator.

Updated: 2025-03-17 19:04:38

标题: 为什么多智能体LLM系统会失败？

摘要: 尽管对于多智能体系统（MAS）的热情日益增长，其中多个LLM智能体协作完成任务，但它们在流行基准测试中的性能提升仍然较单一智能体框架微不足道。这种差距凸显了分析阻碍MAS效果的挑战的必要性。在本文中，我们首次提出了对MAS挑战的全面研究。我们分析了超过150个任务涉及六名专家人类注释者的五种流行MAS框架。我们确定了14种独特的失败模式，并提出了适用于各种MAS框架的全面分类方法。这种分类法是通过每项研究中三名专家注释者的一致达成的，达到了0.88的Cohen's Kappa分数。这些细粒度的失败模式被组织成三类，（i）规范和系统设计失败，（ii）智能体之间的不对齐，以及（iii）任务验证和终止。为了支持可扩展的评估，我们将MASFT与LLM作为评判者集成在一起。我们还探讨了通过提出两种干预措施：改进智能体角色的规范和增强编排策略，是否可以轻松预防已识别的失败。我们的研究结果显示，已确定的失败需要更复杂的解决方案，为未来研究提供了清晰的路线图。我们开源了我们的数据集和LLM注释者。

更新时间: 2025-03-17 19:04:38

领域: cs.AI

下载: http://arxiv.org/abs/2503.13657v1

SOSecure: Safer Code Generation with RAG and StackOverflow Discussions

Large Language Models (LLMs) are widely used for automated code generation. Their reliance on infrequently updated pretraining data leaves them unaware of newly discovered vulnerabilities and evolving security standards, making them prone to producing insecure code. In contrast, developer communities on Stack Overflow (SO) provide an ever-evolving repository of knowledge, where security vulnerabilities are actively discussed and addressed through collective expertise. These community-driven insights remain largely untapped by LLMs. This paper introduces SOSecure, a Retrieval-Augmented Generation (RAG) system that leverages the collective security expertise found in SO discussions to improve the security of LLM-generated code. We build a security-focused knowledge base by extracting SO answers and comments that explicitly identify vulnerabilities. Unlike common uses of RAG, SOSecure triggers after code has been generated to find discussions that identify flaws in similar code. These are used in a prompt to an LLM to consider revising the code. Evaluation across three datasets (SALLM, LLMSecEval, and LMSys) show that SOSecure achieves strong fix rates of 71.7%, 91.3%, and 96.7% respectively, compared to prompting GPT-4 without relevant discussions (49.1%, 56.5%, and 37.5%), and outperforms multiple other baselines. SOSecure operates as a language-agnostic complement to existing LLMs, without requiring retraining or fine-tuning, making it easy to deploy. Our results underscore the importance of maintaining active developer forums, which have dropped substantially in usage with LLM adoptions.

Updated: 2025-03-17 19:03:36

标题: SOSecure：使用RAG和StackOverflow讨论生成更安全的代码

摘要: 大型语言模型(LLMs)广泛用于自动生成代码。它们依赖于更新不频繁的预训练数据，使它们对新发现的漏洞和不断发展的安全标准毫无所知，使其容易生成不安全的代码。相比之下，Stack Overflow(SO)上的开发者社区提供了一个不断发展的知识库，安全漏洞在这里通过集体专业知识进行积极讨论和解决。这些社区驱动的见解在LLMs中基本上未被利用。本文介绍了一个名为SOSecure的检索增强生成(RAG)系统，它利用SO讨论中找到的集体安全专业知识来提高LLM生成的代码的安全性。我们通过提取明确识别漏洞的SO答案和评论来构建一个以安全为重点的知识库。与RAG的常见用途不同，SOSecure在生成代码后触发，以查找识别类似代码中存在缺陷的讨论。这些被用作提示LLMs考虑修改代码。对三个数据集(SALLM、LLMSecEval和LMSys)的评估显示，与没有相关讨论的GPT-4(分别为49.1%、56.5%和37.5%)相比，SOSecure分别达到了71.7%、91.3%和96.7%的强大修复率，并且胜过了多个其他基线。SOSecure作为一种语言不可知的补充存在于现有LLMs中，无需重新训练或微调，易于部署。我们的结果强调了保持活跃的开发者论坛的重要性，这些论坛在LLMs的采用中的使用量大幅下降。

更新时间: 2025-03-17 19:03:36

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2503.13654v1

Debiased Nonparametric Regression for Statistical Inference and Distributionally Robustness

This study proposes a debiasing method for smooth nonparametric estimators. While machine learning techniques such as random forests and neural networks have demonstrated strong predictive performance, their theoretical properties remain relatively underexplored. In particular, many modern algorithms lack guarantees of pointwise and uniform risk convergence, as well as asymptotic normality. These properties are essential for statistical inference and robust estimation and have been well-established for classical methods such as Nadaraya-Watson regression. To ensure these properties for various nonparametric regression estimators, we introduce a model-free debiasing method. By incorporating a correction term that estimates the conditional expected residual of the original estimator, or equivalently, its estimation error, into the initial nonparametric regression estimator, we obtain a debiased estimator that satisfies pointwise and uniform risk convergence, along with asymptotic normality, under mild smoothness conditions. These properties facilitate statistical inference and enhance robustness to covariate shift, making the method broadly applicable to a wide range of nonparametric regression problems.

Updated: 2025-03-17 18:59:48

标题: 无偏非参数回归用于统计推断和分布稳健性

摘要: 这项研究提出了一种用于平滑非参数估计器的去偏方法。虽然机器学习技术如随机森林和神经网络展现出强大的预测性能，但它们的理论性质仍相对未被充分探讨。特别是，许多现代算法缺乏点和均匀风险收敛以及渐近正态性的保证。这些性质对于统计推断和鲁棒估计至关重要，并已经在经典方法如Nadaraya-Watson回归中得到充分确立。为了确保这些性质适用于各种非参数回归估计器，我们引入了一种无模型去偏方法。通过将一个纠正项纳入最初的非参数回归估计器中，该项估计原始估计器的条件期望残差，或等效地，其估计误差，我们得到一个符合点和均匀风险收敛以及在温和光滑条件下渐近正态性的去偏估计器。这些性质有助于统计推断并增强对协变量转移的鲁棒性，使该方法广泛适用于各种非参数回归问题。

更新时间: 2025-03-17 18:59:48

领域: stat.ME,cs.LG,econ.EM,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2412.20173v3

When Can You Get Away with Low Memory Adam?

Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.

Updated: 2025-03-17 18:55:25

标题: 何时可以使用低内存的Adam算法？

摘要: 亚当是训练现代机器学习模型的首选优化器，但它需要额外的内存来维护梯度和它们的平方的移动平均值。虽然已经提出了各种低内存优化器，有时可以匹敌亚当的性能，但它们缺乏可靠性，使亚当成为默认选择。在这项工作中，我们应用了一个简单的逐层信噪比（SNR）分析来量化何时可以有效地用不同维度上的均值替换二阶矩张量。我们的SNR分析揭示了架构、训练超参数和数据集属性如何影响沿着亚当轨迹的可压缩性，自然地导致了SlimAdam，一种内存高效的亚当变体。SlimAdam 在可能时沿着具有高信噪比的维度压缩二阶矩，并在压缩会有害时留下。通过在各种架构和训练场景中进行实验，我们展示了SlimAdam 与亚当的性能和稳定性相匹配，同时节省高达98%的总二阶矩。SlimAdam 的代码可在 https://github.com/dayal-kalra/low-memory-adam 上找到。

更新时间: 2025-03-17 18:55:25

领域: cs.LG,cond-mat.dis-nn,stat.ML

下载: http://arxiv.org/abs/2503.01843v3

SRBB-Based Quantum State Preparation

In this work, a scalable algorithm for the approximate quantum state preparation problem is proposed, facing a challenge of fundamental importance in many topic areas of quantum computing. The algorithm uses a variational quantum circuit based on the Standard Recursive Block Basis (SRBB), a hierarchical construction for the matrix algebra of the $SU(2^n)$ group, which is capable of linking the variational parameters with the topology of the Lie group. Compared to the full algebra, using only diagonal components reduces the number of CNOTs by an exponential factor, as well as the circuit depth, in full agreement with the relaxation principle, inherent to the approximation methodology, of minimizing resources while achieving high accuracy. The desired quantum state is then approximated by a scalable quantum neural network, which is designed upon the diagonal SRBB sub-algebra. This approach provides a new scheme for approximate quantum state preparation in a variational framework and a specific use case for the SRBB hierarchy. The performance of the algorithm is assessed with different loss functions, like fidelity, trace distance, and Frobenius norm, in relation to two optimizers: Adam and Nelder-Mead. The results highlight the potential of SRBB in close connection with the geometry of unitary groups, achieving high accuracy up to 4 qubits in simulation, but also its current limitations with an increasing number of qubits. Additionally, the approximate SRBB-based QSP algorithm has been tested on real quantum devices to assess its performance with a small number of qubits.

Updated: 2025-03-17 18:51:07

标题: 基于SRBB的量子态制备

摘要: 在这项工作中，提出了一种可扩展的近似量子态制备算法，面对量子计算许多主题领域中的一个基本挑战。该算法使用基于标准递归块基础（SRBB）的变分量子电路，这是一个用于$SU(2^n)$群的矩阵代数的分层构造，能够将变分参数与李群的拓扑联系起来。与完整的代数相比，仅使用对角分量将CNOT数量减少了指数因子，同时也减少了电路深度，与近似方法论的松弛原则完全一致，该原则旨在最小化资源使用同时实现高精度。然后，通过一个可扩展的量子神经网络来近似所需的量子态，该网络设计基于对角SRBB子代数。这种方法为变分框架中的近似量子态制备提供了一个新方案，并为SRBB层次结构提供了一个特定的用例。该算法的性能通过不同损失函数进行评估，如保真度、迹距和Frobenius范数，与两种优化器Adam和Nelder-Mead相关。结果突显了SRBB与酉群几何的紧密联系，实现了高精度直到4量子比特的模拟，但也展示了随着量子比特数量增加而出现的当前限制。此外，基于近似SRBB的QSP算法已在真实量子设备上进行了测试，以评估其在少量量子比特上的性能。

更新时间: 2025-03-17 18:51:07

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2503.13647v1

Quantum EigenGame for excited state calculation

Computing the excited states of a given Hamiltonian is computationally hard for large systems, but methods that do so using quantum computers scale tractably. This problem is equivalent to the PCA problem where we are interested in decomposing a matrix into a collection of principal components. Classically, PCA is a well-studied problem setting, for which both centralized and distributed approaches have been developed. On the distributed side, one recent approach is that of EigenGame, a game-theoretic approach to finding eigenvectors where each eigenvector reaches a Nash equilibrium either sequentially or in parallel. With this work, we extend the EigenGame algorithm for both a $0^\text{th}$-order approach and for quantum computers, and harness the framework that quantum computing provides in computing excited states. Results show that using the Quantum EigenGame allows us to converge to excited states of a given Hamiltonian without the need of a deflation step. We also develop theory on error accumulation for finite-differences and parameterized approaches.

Updated: 2025-03-17 18:48:40

标题: 量子本征游戏用于激发态计算

摘要: 计算给定哈密顿量的激发态对于大系统来说是计算困难的，但使用量子计算机进行此类计算具有可伸缩性。这个问题等同于主成分分析（PCA）问题，其中我们有兴趣将一个矩阵分解为一组主成分。在经典计算中，PCA是一个研究充分的问题设定，已经开发了集中和分布式方法。在分布式方面，最近的一种方法是EigenGame，这是一种博弈论方法，用于找到特征向量，其中每个特征向量要么按顺序达到纳什均衡，要么并行达到。通过这项工作，我们扩展了EigenGame算法，既适用于零阶方法，也适用于量子计算机，并利用量子计算提供的框架来计算激发态。结果表明，使用量子EigenGame使我们能够在不需要去除步骤的情况下收敛到给定哈密顿量的激发态。我们还为有限差分和参数化方法开发了误差累积理论。

更新时间: 2025-03-17 18:48:40

领域: quant-ph,cs.DS,cs.LG,math.OC

下载: http://arxiv.org/abs/2503.13644v1

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

Updated: 2025-03-17 18:43:06

标题: 为人体姿势识别匹配基于骨骼的活动表示与异质信号

摘要: 在人类活动识别（HAR）中，活动标签通常以一位有效格式进行编码，最近已经开始转向使用文本表示来提供上下文知识。在这里，我们认为HAR应该以物理运动数据为基础，因为运动构成了活动的基础，并且在感知系统中有效应用，而文本本质上是有限的。我们提出了SKELAR，一种新颖的HAR框架，通过预训练骨架数据中的活动表示并将其与异构HAR信号进行匹配。我们的方法解决了两个主要挑战：（1）捕获核心运动知识而无需特定上下文细节。我们通过自监督的粗角重建任务来实现这一点，该任务恢复关节旋转角度，对用户和部署都不变；（2）将表示调整到具有不同模态和焦点的下游任务。为了解决这个问题，我们引入了一个自我关注匹配模块，以数据驱动的方式动态地优先考虑相关身体部位。考虑到现有骨架数据中缺乏相应的标签，我们建立了MASD，一个新的HAR数据集，包括IMU、WiFi和骨架数据，从20名受试者执行27项活动中收集而来。这是第一个在三种模态之间具有时间同步数据的广泛适用的HAR数据集。实验证明，SKELAR在全射击和少射击设置中实现了最先进的性能。我们还证明了SKELAR可以有效利用合成骨架数据，以扩展其在没有骨架数据收集的情景中的使用。

更新时间: 2025-03-17 18:43:06

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.14547v1

XChainDataGen: A Cross-Chain Dataset Generation Framework

The number of blockchain interoperability protocols for transferring data and assets between blockchains has grown significantly. However, no open dataset of cross-chain transactions exists to study interoperability protocols in operation. There is also no tool to generate such datasets and make them available to the community. This paper proposes XChainDataGen, a tool to extract cross-chain data from blockchains and generate datasets of cross-chain transactions (cctxs). Using XChainDataGen, we extracted over 35 GB of data from five cross-chain protocols deployed on 11 blockchains in the last seven months of 2024, identifying 11,285,753 cctxs that moved over 28 billion USD in cross-chain token transfers. Using the data collected, we compare protocols and provide insights into their security, cost, and performance trade-offs. As examples, we highlight differences between protocols that require full finality on the source blockchain and those that only demand soft finality (\textit{security}). We compare user costs, fee models, and the impact of variables such as the Ethereum gas price on protocol fees (\textit{cost}). Finally, we produce the first analysis of the implications of EIP-7683 for cross-chain intents, which are increasingly popular and greatly improve the speed with which cctxs are processed (\textit{performance}), thereby enhancing the user experience. The availability of XChainDataGen and this dataset allows various analyses, including trends in cross-chain activity, security assessments of interoperability protocols, and financial research on decentralized finance (DeFi) protocols.

Updated: 2025-03-17 18:39:43

标题: XChainDataGen：跨链数据集生成框架

摘要: 区块链之间的数据和资产转移的互操作性协议数量大幅增长。然而，目前不存在用于研究运作中的互操作性协议的跨链交易的开放数据集。也没有工具能够生成这样的数据集并向社区提供。本文提出了XChainDataGen，这是一个从区块链中提取跨链数据并生成跨链交易数据集（cctxs）的工具。使用XChainDataGen，我们从2024年下半年的11个区块链上部署的五个跨链协议中提取了超过35GB的数据，确定了11,285,753个cctxs，其跨链代币转移总额超过280亿美元。利用收集到的数据，我们比较了各个协议，并提供了关于它们的安全性、成本和性能权衡方面的见解。例如，我们强调了需要在源区块链上进行完全最终性的协议与仅要求软最终性的协议之间的差异（安全性）。我们比较了用户成本、费用模型以及以太坊燃气价格对协议费用的影响（成本）。最后，我们对EIP-7683对跨链意图的影响进行了首次分析，这些意图变得越来越流行，并大大提高了cctxs处理速度（性能），从而增强了用户体验。XChainDataGen和这个数据集的可用性允许进行各种分析，包括跨链活动趋势、互操作性协议的安全评估，以及关于去中心化金融（DeFi）协议的金融研究。

更新时间: 2025-03-17 18:39:43

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2503.13637v1

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

As multi-agent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency and (3) scalability. In this work, we introduce Sable, a performant, memory efficient and scalable sequence modeling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi-agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.

Updated: 2025-03-17 18:39:11

标题: Sable：一种用于多智能体强化学习的高性能、高效率和可扩展的序列模型

摘要: 随着多智能体强化学习(MARL)在解决更大更复杂的问题方面取得进展，算法表现出(1) 强大性能、(2) 内存效率和(3) 可扩展性的关键特性变得越来越重要。在这项工作中，我们介绍了Sable，一种性能优越、内存效率高且可扩展的序列建模方法，用于MARL。Sable通过调整保留机制中的记忆网络(Sun等人，2023年)来实现对具有长期记忆的多智能体观察进行计算效率高的处理，用于时间推理。通过在六个不同环境中进行广泛评估，我们展示了Sable如何能够在大量多样化任务中显著优于现有的尖端方法(在45个测试中的34个任务)。此外，Sable在我们增加智能体数量的情况下仍然保持性能，在处理超过一千个智能体的环境时，内存使用呈线性增加。最后，我们进行消融研究以分离Sable性能提升的来源，并确认其高效的计算内存使用。

更新时间: 2025-03-17 18:39:11

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2410.01706v4

Quantifying the Reliability of Predictions in Detection Transformers: Object-Level Calibration and Image-Level Uncertainty

DEtection TRansformer (DETR) has emerged as a promising architecture for object detection, offering an end-to-end prediction pipeline. In practice, however, DETR generates hundreds of predictions that far outnumber the actual number of objects present in an image. This raises the question: can we trust and use all of these predictions? Addressing this concern, we present empirical evidence highlighting how different predictions within the same image play distinct roles, resulting in varying reliability levels across those predictions. More specifically, while multiple predictions are often made for a single object, our findings show that most often one such prediction is well-calibrated, and the others are poorly calibrated. Based on these insights, we demonstrate that identifying a reliable subset of DETR's predictions is crucial for accurately assessing the reliability of the model at both object and image levels. Building on this viewpoint, we first address the shortcomings of widely used performance and calibration metrics, such as average precision and various forms of expected calibration error. Specifically, they are inadequate for determining which subset of DETR's predictions should be trusted and utilized. In response, we present Object-level Calibration Error (OCE), which assesses the calibration quality more effectively and is suitable for both ranking different models and identifying the most reliable predictions within a specific model. As a final contribution, we introduce a post hoc uncertainty quantification (UQ) framework that predicts the accuracy of the model on a per-image basis. By contrasting the average confidence scores of positive (i.e., likely to be matched) and negative predictions determined by OCE, our framework assesses the reliability of the DETR model for each test image.

Updated: 2025-03-17 18:35:23

标题: 量化检测变压器中预测的可靠性：对象级校准和图像级不确定性

摘要: 检测变压器(DETR)已成为一种有前途的目标检测架构，提供了一个端到端的预测流水线。然而，在实践中，DETR会生成数百个预测，远远超过图像中实际对象的数量。这引发了一个问题：我们能相信和使用所有这些预测吗？针对这一问题，我们提出了实证证据，突显了同一图像中不同预测发挥不同作用，导致这些预测的可靠性水平不同。更具体地说，虽然通常为单个对象进行多次预测，但我们的研究结果显示，大多数情况下只有一个这样的预测是良好校准的，而其他预测则是校准不良的。基于这些见解，我们展示了识别DETR预测的可靠子集对于准确评估模型在对象和图像级别的可靠性至关重要。基于这一观点，我们首先解决了广泛使用的性能和校准度指标，如平均精度和各种形式的期望校准误差的缺点。具体来说，它们不足以确定哪些DETR预测子集应该被信任和利用。作为回应，我们提出了对象级校准误差(OCE)，更有效地评估校准质量，并适用于对不同模型进行排名和识别特定模型中最可靠的预测。作为最终贡献，我们介绍了一种事后不确定性量化(UQ)框架，该框架预测每个图像上模型的准确性。通过对由OCE确定的正（即可能匹配）和负预测的平均置信度得分进行对比，我们的框架评估了DETR模型对每个测试图像的可靠性。

更新时间: 2025-03-17 18:35:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.01782v2

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.

Updated: 2025-03-17 18:22:54

标题: ArtFormer：可控生成多样化的3D关节对象

摘要: 本文提出了一个新颖的框架，用于建模和条件生成3D关节对象。由于灵活性和质量之间的权衡，现有方法通常局限于使用预定义的结构或从静态数据集中检索形状。为了解决这些挑战，我们将关节对象参数化为令牌树，并采用变换器生成对象的高级几何代码和其运动关系。随后，使用有符号距离函数（SDF）形状先验进一步解码每个子部分的几何，促进高质量3D形状的合成。我们的方法使得能够生成具有高质量几何和不同部件数量的多样化对象。对从文本描述中生成条件的综合实验展示了我们方法的有效性和灵活性。

更新时间: 2025-03-17 18:22:54

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2412.07237v2

MusicLIME: Explainable Multimodal Music Understanding

Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model's decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.

Updated: 2025-03-17 18:21:48

标题: MusicLIME：可解释的多模态音乐理解

摘要: 多模态模型对音乐理解任务至关重要，因为它们捕捉了音频和歌词之间复杂的相互作用。然而，随着这些模型变得越来越普遍，对解释性的需求也在增长——了解这些系统如何做出决策对于确保公平性、减少偏见和培养信任至关重要。在本文中，我们介绍了MusicLIME，一种针对多模态音乐模型设计的模型无关的特征重要性解释方法。与传统的单模态方法不同，后者单独分析每个模态，而不考虑它们之间的相互作用，往往导致不完整或误导性的解释，MusicLIME揭示了音频和歌词特征如何相互作用并对预测做出贡献，提供了模型决策的整体视角。此外，我们通过将局部解释聚合成全局解释，为用户提供更广泛的模型行为视角。通过这项工作，我们致力于改善多模态音乐模型的可解释性，赋予用户做出知情选择的能力，并促进更公平、公正和透明的音乐理解系统。

更新时间: 2025-03-17 18:21:48

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2409.10496v5

A Convex formulation for linear discriminant analysis

We present a supervised dimensionality reduction technique called Convex Linear Discriminant Analysis (ConvexLDA). The proposed model optimizes a multi-objective cost function by balancing two complementary terms. The first term pulls the samples of a class towards its centroid by minimizing a sample's distance from its class-centroid in low dimensional space. The second term pushes the classes far apart by maximizing their hyperellipsoid scattering volume via the logarithm of the determinant (\textit{log det}) of the outer product matrix formed by the low-dimensional class-centroids. Using the negative of the \textit{log det}, we pose the final cost as a minimization problem, which balances the two terms using a hyper-parameter $\lambda$. We demonstrate that the cost function is convex. Unlike Fisher LDA, the proposed method doesn't require to compute the inverse of a matrix, hence avoiding any ill-conditioned problem where data dimension is very high, e.g. RNA-seq data. ConvexLDA doesn't require pair-wise distance calculation, making it faster and more easily scalable. Moreover, the convex nature of the cost function ensures global optimality, enhancing the reliability of the learned embedding. Our experimental evaluation demonstrates that ConvexLDA outperforms several popular linear discriminant analysis (LDA)-based methods on a range of high-dimensional biological data, image data sets, etc.

Updated: 2025-03-17 18:17:49

标题: 一个用于线性判别分析的凸形式化

摘要: 我们提出了一种监督降维技术，称为凸线性判别分析(ConvexLDA)。所提出的模型通过平衡两个互补的项来优化多目标成本函数。第一个项通过最小化样本在低维空间中与其类中心的距离，将一个类的样本拉向其类中心。第二个项通过最大化它们的超椭球散射体积，通过低维类中心形成的外积矩阵的行列式的对数（log det）来将类推开。使用-log det的负值，我们将最终成本表述为一个最小化问题，通过超参数λ平衡这两个项。我们证明成本函数是凸的。与Fisher LDA不同，所提出的方法不需要计算矩阵的逆，因此避免了数据维度非常高时出现的任何病态问题，例如RNA-seq数据。ConvexLDA不需要进行成对距离计算，使其更快速和更易扩展。此外，成本函数的凸性确保全局最优性，提高了学习嵌入的可靠性。我们的实验评估表明，在一系列高维生物数据、图像数据集等上，ConvexLDA优于几种流行的基于线性判别分析(LDA)的方法。

更新时间: 2025-03-17 18:17:49

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.13623v1

Superalignment with Dynamic Human Values

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.

Updated: 2025-03-17 18:15:17

标题: 超级对齐与动态人类价值观

摘要: 对齐的两个核心挑战是1）可扩展的监督和2）考虑人类价值观的动态性。虽然像递归奖励建模这样的解决方案解决了第一点，但并未同时考虑第二点。我们提出了一个新颖的算法框架的路线图，该框架训练一个超人类推理模型，将复杂任务分解为仍然适合人类级别指导的子任务。我们的方法依赖于我们所称为部分至完整泛化假设，该假设指出子任务解决方案的对齐会泛化为完整解决方案的对齐。我们主张需要衡量这种泛化，并提出了未来改进的方式。

更新时间: 2025-03-17 18:15:17

领域: cs.AI

下载: http://arxiv.org/abs/2503.13621v1

Program Synthesis Dialog Agents for Interactive Decision-Making

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.

Updated: 2025-03-17 18:13:03

标题: 互动决策制定的程序综合对话代理

摘要: 许多现实世界的资格问题，从医学诊断到税收规划，都可以映射到用自然语言表达的决策问题中，模型必须根据用户特征做出二元选择。像法律法规或经常更新的资金机会这样的大规模领域使人类注释（例如网页表单或决策树）变得不切实际，突显了自动协助决策的需求。由于相关信息通常只为用户所知，因此这些代理人提出正确问题至关重要。由于代理人确定何时终止对话，他们面临准确性和所问问题数量之间的权衡，这是用户体验和成本的关键指标。为了评估这项任务，我们提出了BeNYfits，这是一个新的基准，通过交互式决策来确定用户符合多个重叠的社会福利机会的资格。我们的实验表明，当前的语言模型在频繁幻想方面存在困难，使用ReAct风格的思维链，GPT-4o的F1分数仅为35.7。为了解决这个问题，我们引入了ProADA，这是一种利用程序综合来协助决策的新方法，通过将对话规划映射到代码生成问题，并利用结构化数据中的空白来确定最佳下一步操作。我们的代理人ProADA将F1分数提高到55.6，同时保持几乎相同数量的对话轮次。

更新时间: 2025-03-17 18:13:03

领域: cs.AI

下载: http://arxiv.org/abs/2502.19610v2

Riemannian Laplace Approximation with the Fisher Metric

Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.

Updated: 2025-03-17 18:02:56

标题: 用费舍尔度量的黎曼拉普拉斯逼近

摘要: 拉普拉斯方法用高斯分布在其模式处近似目标密度。由于伯恩斯坦-冯·米塞斯定理，它在贝叶斯推断中计算效率高且渐近精确，但对于复杂目标和有限数据后验，它通常是一个过于粗糙的近似。最近对拉普拉斯近似的一般化将高斯近似根据选择的黎曼几何转换，提供了更丰富的近似族，同时仍保持计算效率。然而，正如本文所示，其性质在很大程度上取决于选择的度量，事实上，在先前工作中采用的度量导致近似太窄，甚至在无限数据的极限下也存在偏差。我们通过进一步发展近似族，推导出两种在无限数据极限下精确的替代变体，扩展了该方法的理论分析，并在一系列实验中展示了实际改进。

更新时间: 2025-03-17 18:02:56

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2311.02766v7

Scalable and Interpretable Verification of Image-based Neural Network Controllers for Autonomous Vehicles

Existing formal verification methods for image-based neural network controllers in autonomous vehicles often struggle with high-dimensional inputs, computational inefficiency, and a lack of explainability. These challenges make it difficult to ensure safety and reliability, as processing high-dimensional image data is computationally intensive and neural networks are typically treated as black boxes. To address these issues, we propose SEVIN (Scalable and Explainable Verification of Image-Based Neural Network Controllers), a framework that leverages a Variational Autoencoders (VAE) to encode high-dimensional images into a lower-dimensional, explainable latent space. By annotating latent variables with corresponding control actions, we generate convex polytopes that serve as structured input spaces for verification, significantly reducing computational complexity and enhancing scalability. Integrating the VAE's decoder with the neural network controller allows for formal and robustness verification using these explainable polytopes. Our approach also incorporates robustness verification under real-world perturbations by augmenting the dataset and retraining the VAE to capture environmental variations. Experimental results demonstrate that SEVIN achieves efficient and scalable verification while providing explainable insights into controller behavior, bridging the gap between formal verification techniques and practical applications in safety-critical systems.

Updated: 2025-03-17 18:01:53

标题: 可扩展且可解释的自主车辆图像神经网络控制器验证

摘要: 现有的用于自动驾驶车辆中基于图像的神经网络控制器的形式化验证方法通常面临高维输入、计算效率低和缺乏可解释性的困难。这些挑战使得确保安全性和可靠性变得困难，因为处理高维图像数据需要大量计算资源，而神经网络通常被视为黑盒。为了解决这些问题，我们提出了SEVIN（可扩展和可解释的基于图像的神经网络控制器验证），这是一个利用变分自动编码器（VAE）将高维图像编码为低维、可解释的潜在空间的框架。通过使用对应控制动作进行标注的潜在变量，我们生成凸多面体，作为验证的结构化输入空间，显著减少了计算复杂性并增强了可扩展性。将VAE的解码器与神经网络控制器集成在一起，允许使用这些可解释的多面体进行形式化和鲁棒性验证。我们的方法还通过增加数据集并重新训练VAE来捕捉环境变化，从而在真实世界扰动下进行鲁棒性验证。实验结果表明，SEVIN在提供对控制器行为的可解释见解的同时，实现了高效和可扩展的验证，弥合了形式验证技术和安全关键系统中的实际应用之间的差距。

更新时间: 2025-03-17 18:01:53

领域: cs.LG,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2501.14009v2

MetaScale: Test-Time Scaling with Evolving Meta-Thoughts

One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts -- adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.

Updated: 2025-03-17 17:59:54

标题: MetaScale：随着不断发展的元思维进行测试时间缩放

摘要: 大型语言模型（LLMs）在进行复杂推理时面临的一个关键挑战是它们依赖于匹配训练数据中的推理模式，而不是主动选择最合适的认知策略来解决给定任务。现有方法施加固定的认知结构，增强了特定任务的表现，但缺乏在不同情境下的适应性。为了解决这个限制，我们引入了METASCALE，一个基于元思想的测试时间缩放框架 -- 为每个任务量身定制的自适应思考策略。METASCALE初始化了一个候选元思想池，然后使用上限置信度选择的多臂赌博算法迭代地选择和评估它们，受奖励模型的引导。为了进一步增强适应性，遗传算法演化高奖励的元思想，随着时间的推移，不断完善和扩展策略池。通过在推理时动态提出和优化元思想，METASCALE在各种任务中提高了准确性和泛化能力。实验结果表明，MetaScale始终优于标准推理方法，在GPT-4o的Arena-Hard中获得了11%的性能提升，在风格控制下超过o1-mini 0.9%。值得注意的是，METASCALE在采样预算增加时更有效地扩展，并产生更有结构化、专家级的响应。

更新时间: 2025-03-17 17:59:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13447v1

Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance

As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.

Updated: 2025-03-17 17:59:39

标题: LLM自解释在常识任务中的忠实度：更大更好，指导调整允许权衡但不支配帕累托

摘要: 随着大型语言模型（LLMs）变得越来越强大，确保它们自动生成的解释忠实于其内部决策过程对于安全和监督至关重要。在这项工作中，我们对来自8个家族的62个模型进行了全面的反事实忠实性分析，涵盖了预训练和指令调整变体，并在很大程度上扩展了以前对反事实测试的研究。我们引入了phi-CCT，这是Correlational Counterfactual Test的简化变体，它避免了对标记概率的需求，同时解释了原始测试的大部分方差。我们的研究结果显示明显的扩展趋势：较大的模型在我们的度量上始终更忠实。然而，当比较指令调整和模仿人类的解释时，我们发现观察到的忠实性差异往往可以归因于解释的冗长，导致沿着真正正/假正Pareto边界的转变。虽然指令调整和提示可以影响这种权衡，但我们发现有限的证据表明它们基本上没有超越具有可比大小的预训练模型所能实现的解释忠实度边界。我们的分析突显了指令调整、冗长性和模型决策过程的忠实表达之间微妙的关系。

更新时间: 2025-03-17 17:59:39

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2503.13445v1

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

Updated: 2025-03-17 17:59:33

标题: VideoMind：一种用于长视频推理的LoRA代理链

摘要: 视频具有独特的时间维度，需要精确的基础理解，其中答案直接与可视、可解释的证据相关联。尽管大型语言模型在推理能力方面取得了重大突破，但多模态推理 - 尤其是针对视频的推理 - 仍未被探索。在这项工作中，我们介绍了VideoMind，这是一种专为时序基础视频理解而设计的新型视频-语言代理。VideoMind融合了两个关键创新：(i)我们确定了视频时序推理的基本能力，并开发了一个基于角色的代理工作流程，包括一个协调不同角色的规划者，一个用于时序定位的接地器，一个用于评估时序间隔准确性的验证器，以及一个用于问答的答复者。(ii)为了有效整合这些不同的角色，我们提出了一种新颖的Chain-of-LoRA策略，通过轻量级LoRA适配器实现无缝的角色切换，同时避免多个模型的开销，从而平衡了效率和灵活性。在14个公共基准测试中进行了大量实验，表明我们的代理在各种视频理解任务上取得了最先进的性能，包括3个基于视频的问答任务、6个视频时序定位任务和5个通用视频问答任务，突显了其在推进视频代理和长篇时序推理方面的有效性。

更新时间: 2025-03-17 17:59:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13444v1

Humanoid Policy ~ Human Policy

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/

Updated: 2025-03-17 17:59:09

标题: 人形机器人政策~人类政策

摘要: 通过使用多样化数据对人形机器人进行训练操作策略，可以增强它们在任务和平台间的鲁棒性和泛化能力。然而，仅从机器人演示中学习是劳动密集型的，需要昂贵的远程操作数据收集，难以扩展。本文研究了一种更具可扩展性的数据来源，即自我中心的人类演示，作为机器人学习的跨体培训数据。我们从数据和建模的角度缓解了人形机器人和人类之间的体现差距。我们收集了一个与人形操纵演示直接对齐的自我中心任务导向数据集（PH2D）。然后训练一个人-人形行为策略，我们将其称为Human Action Transformer（HAT）。HAT的状态-动作空间统一用于人类和人形机器人，可以被差分地重新定位到机器人行为。与较小规模的机器人数据一起进行训练，HAT直接对人形机器人和人类进行建模，而无需额外监督。我们展示了人类数据提高了HAT的泛化性和鲁棒性，同时大大提高了数据收集效率。代码和数据：https://human-as-robot.github.io/

更新时间: 2025-03-17 17:59:09

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.13441v1

Deep Belief Markov Models for POMDP Inference

This work introduces a novel deep learning-based architecture, termed the Deep Belief Markov Model (DBMM), which provides efficient, model-formulation agnostic inference in Partially Observable Markov Decision Process (POMDP) problems. The POMDP framework allows for modeling and solving sequential decision-making problems under observation uncertainty. In complex, high-dimensional, partially observable environments, existing methods for inference based on exact computations (e.g., via Bayes' theorem) or sampling algorithms do not scale well. Furthermore, ground truth states may not be available for learning the exact transition dynamics. DBMMs extend deep Markov models into the partially observable decision-making framework and allow efficient belief inference entirely based on available observation data via variational inference methods. By leveraging the potency of neural networks, DBMMs can infer and simulate non-linear relationships in the system dynamics and naturally scale to problems with high dimensionality and discrete or continuous variables. In addition, neural network parameters can be dynamically updated efficiently based on data availability. DBMMs can thus be used to infer a belief variable, thus enabling the derivation of POMDP solutions over the belief space. We evaluate the efficacy of the proposed methodology by evaluating the capability of model-formulation agnostic inference of DBMMs in benchmark problems that include discrete and continuous variables.

Updated: 2025-03-17 17:58:45

标题: 深度信念马尔可夫模型用于POMDP推理

摘要: 这项工作介绍了一种新颖的基于深度学习的架构，称为深度信念马尔可夫模型（DBMM），它在部分可观测马尔可夫决策过程（POMDP）问题中提供了高效的、模型形式不可知的推断。POMDP框架允许在观测不确定性下对顺序决策问题建模和求解。在复杂、高维度、部分可观测的环境中，基于精确计算（例如通过贝叶斯定理）或抽样算法的推断方法不具有良好的可扩展性。此外，学习确切转换动态可能无法获得地面真实状态。DBMM将深度马尔可夫模型扩展到部分可观测决策框架中，通过变分推断方法完全基于可用观测数据进行高效的信念推断。通过利用神经网络的强大功能，DBMM可以推断和模拟系统动态中的非线性关系，并自然地扩展到具有高维度和离散或连续变量的问题。此外，神经网络参数可以根据数据可用性高效地动态更新。因此，DBMM可以用于推断信念变量，从而实现对信念空间中的POMDP解的推导。我们通过评估所提出的方法的有效性，在包括离散和连续变量的基准问题中评估DBMM的模型形式不可知推断的能力。

更新时间: 2025-03-17 17:58:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13438v1

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.

Updated: 2025-03-17 17:58:30

标题: 使用连续标记的统一自回归视觉生成和理解

摘要: 我们提出了UniFluid，一个利用连续视觉标记进行联合视觉生成和理解的统一自回归框架。我们的统一自回归架构处理多模态图像和文本输入，为文本生成离散标记，为图像生成连续标记。我们发现，虽然图像生成和理解任务之间存在固有的权衡，但经过精心调整的训练配方使它们能够相互改进。通过选择适当的损失平衡权重，统一模型在两个任务上实现了与单任务基线相当或超过的结果。此外，我们证明，在训练过程中使用更强的预训练LLMs和随机顺序生成对于在该统一框架内实现高保真度图像生成至关重要。建立在Gemma模型系列之上，UniFluid在图像生成和理解方面表现出竞争力，展示了对各种下游任务的强大可迁移性，包括用于生成的图像编辑，以及用于理解的视觉字幕和问题回答。

更新时间: 2025-03-17 17:58:30

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.13436v1

Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark

With the rapid advancement of large language models (LLMs), extensive research has been conducted to investigate the code generation capabilities of LLMs. However, existing efforts primarily focus on general-domain tasks, leaving LLMs' code generation performance in real-world application domains underexplored. This raises a critical question: can a model's general-domain coding ability reliably represent its ability in specialized domains? In this paper, we introduce DomainCodeBench, a multi-domain code generation benchmark designed to systematically evaluate LLMs across 12 software application domains and 15 programming languages. DomainCodeBench contains 2,400 manually verified tasks with ground truth, human-annotated docstrings, and fine-grained dependency information to ensure more coverage of domain-specific challenges. Specifically, we first identify the most popular application domains by topic mining. Then, we curate coding tasks based on commonly used frameworks and platforms in each domain. We obtain several findings through extensive experiments on DomainCodeBench with ten mainstream LLMs. (1) Performance decoupling: experiments reveal that top general-domain models do not consistently excel in specific application domains; (2) Domain-specific weaknesses: LLMs often fail due to domain knowledge gaps and third-party library misusage; (3) Contextual enhancement: we show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%, providing actionable insights for performance optimization. Our replication package, including the benchmark, source code, and experimental results, is available at https://github.com/DeepSoftwareAnalytics/DomainCodeBench.

Updated: 2025-03-17 17:58:13

标题: 顶级通用性能=顶级域性能？DomainCodeBench：一个多域代码生成基准测试

摘要: 随着大型语言模型（LLMs）的快速发展，已经进行了大量研究来调查LLMs的代码生成能力。然而，现有的努力主要集中在通用领域任务上，而在实际应用领域中LLMs的代码生成性能尚未得到充分探讨。这引发了一个关键问题：一个模型的通用领域编码能力是否可以可靠地代表其在专业领域的能力？在本文中，我们引入了DomainCodeBench，这是一个多领域代码生成基准测试，旨在系统评估LLMs在12个软件应用领域和15种编程语言中的表现。DomainCodeBench包含2400个经过手工验证的任务，具有基准真值、人工注释的文档字符串和细粒度的依赖信息，以确保更多覆盖领域特定挑战。具体来说，我们首先通过主题挖掘确定了最受欢迎的应用领域。然后，我们基于每个领域中常用的框架和平台策划编码任务。我们通过对十个主流LLMs在DomainCodeBench上进行广泛实验获得了一些发现。（1）性能解耦：实验表明，顶级通用领域模型在特定应用领域中并不总是表现出色；（2）领域特定弱点：LLMs经常由于领域知识缺失和第三方库错误使用而失败；（3）上下文增强：我们展示了通过增加具有领域特定知识的提示来提高性能约38.17％，为性能优化提供可操作的见解。我们的复制包，包括基准测试、源代码和实验结果，可在https://github.com/DeepSoftwareAnalytics/DomainCodeBench上获取。

更新时间: 2025-03-17 17:58:13

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2412.18573v2

Population Transformer: Learning Population-level Representations of Neural Activity

We present a self-supervised framework that learns population-level codes for arbitrary ensembles of neural recordings at scale. We address key challenges in scaling models with neural time-series data, namely, sparse and variable electrode distribution across subjects and datasets. The Population Transformer (PopT) stacks on top of pretrained temporal embeddings and enhances downstream decoding by enabling learned aggregation of multiple spatially-sparse data channels. The pretrained PopT lowers the amount of data required for downstream decoding experiments, while increasing accuracy, even on held-out subjects and tasks. Compared to end-to-end methods, this approach is computationally lightweight, while achieving similar or better decoding performance. We further show how our framework is generalizable to multiple time-series embeddings and neural data modalities. Beyond decoding, we interpret the pretrained and fine-tuned PopT models to show how they can be used to extract neuroscience insights from large amounts of data. We release our code as well as a pretrained PopT to enable off-the-shelf improvements in multi-channel intracranial data decoding and interpretability. Code is available at https://github.com/czlwang/PopulationTransformer.

Updated: 2025-03-17 17:58:10

标题: 人群变换器：学习神经活动的人群级表示

摘要: 我们提出了一个自监督框架，可以在规模上学习任意神经记录集合的人群级编码。我们解决了神经时间序列数据模型扩展中的关键挑战，即主题和数据集之间的稀疏和可变电极分布。人群变换器（PopT）在预训练的时间嵌入之上堆叠，并通过启用学习的多个空间稀疏数据通道的聚合来增强下游解码。预训练的PopT降低了下游解码实验所需的数据量，同时增加了准确性，甚至在保留主题和任务上也是如此。与端到端方法相比，这种方法在计算上轻巧，同时实现了类似或更好的解码性能。我们进一步展示了我们的框架如何泛化到多个时间序列嵌入和神经数据模态。除了解码，我们解释了预训练和微调的PopT模型，展示了它们如何从大量数据中提取神经科学见解。我们发布了我们的代码以及一个预训练的PopT，以实现多通道颅内数据解码和可解释性的现成改进。代码可在https://github.com/czlwang/PopulationTransformer 上找到。

更新时间: 2025-03-17 17:58:10

领域: cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2406.03044v3

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: https://liyaowei-stu.github.io/project/BlobCtrl/

Updated: 2025-03-17 17:58:05

标题: BlobCtrl：一种统一灵活的框架，用于元素级图像生成和编辑

摘要: 元素级的视觉操作在数字内容创建中至关重要，但目前基于扩散的方法缺乏传统工具的精准性和灵活性。在这项工作中，我们介绍了BlobCtrl，这是一个统一元素级生成和编辑的框架，使用基于概率的blob表示。通过将blob作为视觉原语，我们的方法有效地解耦和表示空间位置、语义内容和身份信息，实现了精准的元素级操作。我们的关键贡献包括：1）具有分层特征融合的双分支扩散架构，实现无缝的前景-背景集成；2）具有定制数据增强和评分函数的自监督训练范式；和3）可控的丢失策略，以平衡忠实度和多样性。为了支持进一步的研究，我们引入了BlobData进行大规模训练，并引入了BlobBench进行系统评估。实验表明，BlobCtrl在各种元素级操作任务中表现出色，同时保持计算效率，为精确和灵活的视觉内容创建提供了实用解决方案。项目页面：https://liyaowei-stu.github.io/project/BlobCtrl/

更新时间: 2025-03-17 17:58:05

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2503.13434v1

Uncovering Utility Functions from Observed Outcomes

Determining consumer preferences and utility is a foundational challenge in economics. They are central in determining consumer behaviour through the utility-maximising consumer decision-making process. However, preferences and utilities are not observable and may not even be known to the individual making the choice; only the outcome is observed in the form of demand. Without the ability to observe the decision-making mechanism, demand estimation becomes a challenging task and current methods fall short due to lack of scalability or ability to identify causal effects. Estimating these effects is critical when considering changes in policy, such as pricing, the impact of taxes and subsidies, and the effect of a tariff. To address the shortcomings of existing methods, we combine revealed preference theory and inverse reinforcement learning to present a novel algorithm, Preference Extraction and Reward Learning (PEARL) which, to the best of our knowledge, is the only algorithm that can uncover a representation of the utility function that best rationalises observed consumer choice data given a specified functional form. We introduce a flexible utility function, the Input-Concave Neural Network which captures complex relationships across goods, including cross-price elasticities. Results show PEARL outperforms the benchmark on both noise-free and noisy synthetic data.

Updated: 2025-03-17 17:56:59

标题: 揭示从观察结果中得出的效用函数

摘要: 确定消费者偏好和效用是经济学中的一个基础性挑战。它们在通过效用最大化的消费者决策过程中确定消费者行为中起着核心作用。然而，偏好和效用并不可观察，甚至可能不为做出选择的个体所知；只有需求的结果以形式观察到。在没有能力观察决策机制的情况下，需求估计变得困难，当前方法由于缺乏可扩展性或能够识别因果效应而存在不足。在考虑政策变化时，例如定价、税收和补贴的影响以及关税的影响时，估计这些效应是至关重要的。为了解决现有方法的缺点，我们结合了揭示偏好理论和逆强化学习，提出了一种新颖的算法，偏好提取和奖励学习（PEARL），据我们所知，这是唯一一个可以根据指定的函数形式揭示最佳理性化观察到的消费者选择数据的效用函数表示的算法。我们引入了一种灵活的效用函数，即输入凹凸神经网络，它捕捉了商品之间的复杂关系，包括交叉价格弹性。结果显示PEARL在无噪声和有噪声的合成数据上均优于基准。

更新时间: 2025-03-17 17:56:59

领域: cs.LG

下载: http://arxiv.org/abs/2503.13432v1

Measuring In-Context Computation Complexity via Hidden State Prediction

Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric -- in contrast to the next token prediction loss -- correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

Updated: 2025-03-17 17:56:14

标题: 通过隐藏状态预测测量上下文中的计算复杂度

摘要: 检测神经序列模型何时进行“有趣”的计算是一个开放性问题。下一个标记预测损失是一个较差的指标：低损失可能源于可以预测的但无趣的序列，而高损失可能反映不可预测但也无关的信息，可以被模型忽略。我们提出了一个更好的度量标准：衡量模型预测自身未来隐藏状态的能力。我们通过实验证明，这个度量标准 -- 与下一个标记预测损失相比 -- 与任务的直观趣味性相关。为了衡量可预测性，我们引入了与架构无关的“隐藏状态预测”（PHi）层，它作为网络的主路径（例如，变压器中的残差流）上的信息瓶颈。我们提出了一种新颖的学习预测先验，使我们能够衡量每个计算步骤中获得的新信息，这作为我们的度量标准。我们通过实验证明，我们的度量标准预测了在上下文中学习的形式语言的描述长度，数学推理问题的复杂性，以及自动生成推理链的正确性。

更新时间: 2025-03-17 17:56:14

领域: cs.LG,I.2.6

下载: http://arxiv.org/abs/2503.13431v1

AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

Autonomous driving requires an understanding of the infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining as straightforward to integrate and as generic as auxiliary supervision. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements in vectorized map prediction performance up to 13.3% over the StreamMapNet baseline on 60m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code will be released soon.

Updated: 2025-03-17 17:55:32

标题: AugMapNet：通过BEV网格增强改进空间潜在结构，以增强矢量化在线高清地图构建

摘要: 自动驾驶需要理解基础设施元素，如车道和人行横道。为了安全导航，这种理解必须从传感器数据中实时获取，并以矢量化形式表示。学习的鸟瞰视图（BEV）编码器通常用于将多个视图的相机图像组合成一个联合潜在BEV网格。传统上，从这个潜在空间中，会预测出中间光栅地图，提供密集的空间监督，但需要后处理成所需的矢量化形式。最近的模型直接使用矢量化地图解码器将基础设施元素推导为折线，提供实例级信息。我们的方法Augmentation Map Network（AugMapNet）提出了潜在BEV网格增强，这是一种显著提高潜在BEV表示的新技术。AugMapNet比现有架构更有效地结合了矢量解码和密集空间监督，同时保持了与辅助监督一样容易集成和通用。在nuScenes和Argoverse2数据集上的实验表明，在60m范围内，矢量化地图预测性能相比StreamMapNet基线提高了13.3%，在更大范围内的改善更为显著。我们通过将我们的方法应用到另一个基线来确认可转移性，并发现类似的改进。对潜在BEV网格的详细分析确认了AugMapNet更有结构化的潜在空间，并展示了我们的新概念在纯性能改进之外的价值。代码将很快发布。

更新时间: 2025-03-17 17:55:32

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.13430v1

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Updated: 2025-03-17 17:54:55

标题: xLSTM 7B：一种用于快速高效推断的循环LLM

摘要: 最近，在解决推理、数学和编码问题方面取得的重大突破是通过在推理时投入大量计算资源实现的。因此，推理速度是LLM架构中最关键的属性之一，越来越需要高效且快速推理的LLM。最近，基于xLSTM架构构建的LLM已经成为变压器的强大替代品，具有与序列长度线性计算和常量内存使用相关的特性，这两个特性对于高效推理非常理想。然而，这种基于xLSTM的LLM尚未扩展到更大的模型，并且尚未就推理速度和效率进行评估和比较。在这项工作中，我们介绍了xLSTM 7B，一个拥有70亿参数的LLM，结合了xLSTM的架构优势和针对快速高效推理的优化。我们的实验表明，xLSTM 7B在下游任务上的性能与其他类似规模的LLM相当，同时与基于Llama和Mamba的LLM相比，提供了显着更快的推理速度和更高的效率。这些结果将xLSTM 7B确定为最快和最高效的70亿参数LLM，为需要大量测试时间计算的任务提供了解决方案。我们的工作突显了xLSTM作为基于LLM推理的重要方法的潜力。我们的模型权重、模型代码和训练代码均为开源。

更新时间: 2025-03-17 17:54:55

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13427v1

Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis

Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.

Updated: 2025-03-17 17:54:40

标题: 稳健潜在因素：通过采样误差合成提升图像生成

摘要: 最近的图像生成方案通常在一个预先构建的潜在空间中捕捉图像分布，依赖于一个冻结的图像分词器。尽管分词器的性能对于成功的生成起着关键作用，但其当前的评估指标（例如rFID）未能准确评估分词器，并将其性能与生成质量（例如gFID）相关联。在本文中，我们全面分析了在离散潜在空间中重建和生成质量差异的原因，并由此提出了一种新颖的即插即用的分词器训练方案来促进潜在空间的构建。具体地，提出了一种潜在扰动方法来模拟采样噪声，即从生成过程中采样的意外标记。通过潜在扰动，我们进一步提出了（1）一种新颖的分词器评估指标，即pFID，成功地将分词器的性能与生成质量相关联，以及（2）一种即插即用的分词器训练方案，显著增强了分词器的鲁棒性，从而提高了生成质量和收敛速度。我们对11种先进的离散图像分词器进行了广泛的基准测试，并结合2个自回归生成模型来验证我们的方法。通过我们提出的潜在扰动训练的分词器实现了显著的1.60 gFID，其中包含无需分类器指导（CFG）的情况，以及3.45 gFID，其中不包含CFG，生成器规模约为400M。代码：https://github.com/lxa9867/ImageFolder。

更新时间: 2025-03-17 17:54:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.08354v2

SuperBPE: Space Travel for Language Models

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.

Updated: 2025-03-17 17:53:23

标题: SuperBPE：语言模型的空间旅行

摘要: 几乎所有语言模型（LM）标记化方案的假设是，标记应该是子词，即包含在单词边界内。虽然这提供了一个看似合理的归纳偏见，但这种常见做法是否限制了现代LM的潜力？空白不是一个可靠的含义分隔符，如多词表达（例如，“顺便说一下”），不同语言中表达一个概念所需的单词数量的跨语言变化（例如，德语中的“spacesuit helmet”是“raumanzughelm”），以及根本不使用空白的语言（例如，中文）。为了探索超越子词的标记化潜力，我们引入了一个“超词”标记器SuperBPE，它将一个简单的pretokenization课程整合到字节对编码（BPE）算法中，首先学习子词，然后学习桥接空白的超词。这带来了编码效率的显著提高：当将词汇量固定为200k时，SuperBPE平均比BPE少编码了多达33%的标记。在实验中，我们从头开始预训练了8B transformer LM，同时固定模型大小、词汇量和训练计算，仅改变学习词汇的算法。我们训练的使用SuperBPE的模型在30个下游任务中平均比BPE基线提高了4.0%的绝对值（包括MMLU上的8.2%），同时在推理时需要的计算量减少了27%。在分析中，我们发现SuperBPE导致的文本分段在每个标记的难度上更加均匀。定性地说，这可能是因为SuperBPE标记经常捕捉到作为单个单元在语义上起作用的常见多词表达。SuperBPE是一种简单的本地标记化修改，可以同时提高编码效率和下游性能，从而总体提高语言模型的质量。

更新时间: 2025-03-17 17:53:23

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.13423v1

Securing Virtual Reality Experiences: Unveiling and Tackling Cybersickness Attacks with Explainable AI

The synergy between virtual reality (VR) and artificial intelligence (AI), specifically deep learning (DL)-based cybersickness detection models, has ushered in unprecedented advancements in immersive experiences by automatically detecting cybersickness severity and adaptively various mitigation techniques, offering a smooth and comfortable VR experience. While this DL-enabled cybersickness detection method provides promising solutions for enhancing user experiences, it also introduces new risks since these models are vulnerable to adversarial attacks; a small perturbation of the input data that is visually undetectable to human observers can fool the cybersickness detection model and trigger unexpected mitigation, thus disrupting user immersive experiences (UIX) and even posing safety risks. In this paper, we present a new type of VR attack, i.e., a cybersickness attack, which successfully stops the triggering of cybersickness mitigation by fooling DL-based cybersickness detection models and dramatically hinders the UIX. Next, we propose a novel explainable artificial intelligence (XAI)-guided cybersickness attack detection framework to detect such attacks in VR to ensure UIX and a comfortable VR experience. We evaluate the proposed attack and the detection framework using two state-of-the-art open-source VR cybersickness datasets: Simulation 2021 and Gameplay dataset. Finally, to verify the effectiveness of our proposed method, we implement the attack and the XAI-based detection using a testbed with a custom-built VR roller coaster simulation with an HTC Vive Pro Eye headset and perform a user study. Our study shows that such an attack can dramatically hinder the UIX. However, our proposed XAI-guided cybersickness attack detection can successfully detect cybersickness attacks and trigger the proper mitigation, effectively reducing VR cybersickness.

Updated: 2025-03-17 17:49:51

标题: 保障虚拟现实体验：揭示和应对可解释人工智能的晕动攻击

摘要: 虚拟现实（VR）和人工智能（AI），特别是基于深度学习（DL）的晕动病检测模型之间的协同作用，已经通过自动检测晕动病严重程度和自适应各种缓解技术带来了前所未有的沉浸式体验的进展，提供了流畅舒适的VR体验。尽管这种DL启用的晕动病检测方法为增强用户体验提供了有希望的解决方案，但也引入了新的风险，因为这些模型容易受到对抗性攻击的影响；对输入数据进行微小扰动，对人类观察者来说视觉上是无法察觉的，可以欺骗晕动病检测模型并触发意外的缓解措施，从而破坏用户沉浸式体验（UIX）甚至造成安全风险。在本文中，我们提出了一种新型的VR攻击，即晕动病攻击，通过欺骗DL基础的晕动病检测模型成功阻止了晕动病缓解的触发，并显著阻碍了UIX。接下来，我们提出了一种新颖的可解释人工智能（XAI）引导的晕动病攻击检测框架，以便在VR中检测此类攻击，以确保UIX和舒适的VR体验。我们使用两个最先进的开源VR晕动病数据集：Simulation 2021和Gameplay数据集，评估了所提出的攻击和检测框架。最后，为验证我们提出的方法的有效性，我们利用一个带有HTC Vive Pro Eye头戴式显示器的自定义VR过山车模拟测试平台实施了攻击和基于XAI的检测，并进行了用户研究。我们的研究表明，这种攻击可以显著阻碍UIX。然而，我们提出的XAI引导的晕动病攻击检测可以成功地检测到晕动病攻击并触发适当的缓解措施，有效减少了VR晕动病。

更新时间: 2025-03-17 17:49:51

领域: cs.CR,cs.AI,cs.ET,cs.HC

下载: http://arxiv.org/abs/2503.13419v1

FLEX: A Framework for Learning Robot-Agnostic Force-based Skills Involving Sustained Contact Object Manipulation

Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms. We propose a novel framework for learning object-centric manipulation policies in force space, decoupling the robot from the object. By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead. This approach, trained in simulation on a small set of representative objects, captures object dynamics -- such as joint configurations -- allowing policies to generalize effectively to new, unseen objects. Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, UR5) without retraining. Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. Additionally, operating in force space enhances policy transferability across diverse robot platforms and object types. We further showcase the applicability of our method in a real-world robotic setting. For supplementary materials and videos, please visit: https://tufts-ai-robotics-group.github.io/FLEX/

Updated: 2025-03-17 17:49:47

标题: FLEX：一个学习机器人无关的基于力的技能涉及持续接触物体操作的框架

摘要: 学习有效地操作物体，特别是涉及持续接触（例如推动、滑动）和关节部分（例如抽屉、门）的物体，面临着重大挑战。传统方法，如以机器人为中心的强化学习（RL）、模仿学习和混合技术，需要大量训练，并且通常难以在不同物体和机器人平台之间泛化。我们提出了一个新颖的框架，用于在力空间中学习以物体为中心的操作策略，将机器人与物体解耦。通过直接施加力到物体的特定区域，我们的方法简化了动作空间，减少了不必要的探索，并降低了模拟开销。这种方法在模拟中训练了一小组代表性物体，捕捉了物体动力学，如关节配置，使策略能够有效地泛化到新的、未见过的物体。将这些策略与特定于机器人的动态解耦，使其能够直接转移到不同的机器人平台（例如Kinova、Panda、UR5）而无需重新训练。我们的评估表明，该方法明显优于基线，与其他最先进的方法相比，在训练效率方面实现了一个数量级的改进。此外，在力空间中操作增强了策略在不同机器人平台和物体类型之间的可转移性。我们进一步展示了我们的方法在真实世界的机器人环境中的适用性。有关补充材料和视频，请访问：https://tufts-ai-robotics-group.github.io/FLEX/

更新时间: 2025-03-17 17:49:47

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.13418v1

Nyström $M$-Hilbert-Schmidt Independence Criterion

Kernel techniques are among the most popular and powerful approaches of data science. Among the key features that make kernels ubiquitous are (i) the number of domains they have been designed for, (ii) the Hilbert structure of the function class associated to kernels facilitating their statistical analysis, and (iii) their ability to represent probability distributions without loss of information. These properties give rise to the immense success of Hilbert-Schmidt independence criterion (HSIC) which is able to capture joint independence of random variables under mild conditions, and permits closed-form estimators with quadratic computational complexity (w.r.t. the sample size). In order to alleviate the quadratic computational bottleneck in large-scale applications, multiple HSIC approximations have been proposed, however these estimators are restricted to $M=2$ random variables, do not extend naturally to the $M\ge 2$ case, and lack theoretical guarantees. In this work, we propose an alternative Nystr\"om-based HSIC estimator which handles the $M\ge 2$ case, prove its consistency, and demonstrate its applicability in multiple contexts, including synthetic examples, dependency testing of media annotations, and causal discovery.

Updated: 2025-03-17 17:48:40

标题: Nyström $M$-希尔伯特-施密特独立性准则

摘要: 核技术是数据科学中最受欢迎和强大的方法之一。使核技术无处不在的关键特点包括：它们为设计的领域数量之多、与核相关的函数类的希尔伯特结构有助于它们的统计分析、以及它们能够表示概率分布而不丢失信息。这些属性导致了希尔伯特-施密特独立性准则（HSIC）的巨大成功，该准则能够在温和条件下捕获随机变量的联合独立性，并允许具有二次计算复杂性（相对于样本大小）的闭合形式估计量。为了缓解大规模应用中的二次计算瓶颈，已经提出了多个HSIC近似方法，然而这些估计量仅限于$M=2$个随机变量，并不自然地扩展到$M\ge 2$的情况，并且缺乏理论保证。在这项工作中，我们提出了一种基于Nyström的HSIC估计量，可以处理$M\ge 2$的情况，证明其一致性，并展示了它在多种情境中的适用性，包括合成示例、媒体注释的依赖性测试和因果发现。

更新时间: 2025-03-17 17:48:40

领域: stat.ML,cs.IT,cs.LG,math.IT,46E22, 94A17,I.2.6; H.1.1

下载: http://arxiv.org/abs/2302.09930v3

A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios, Approaches, Challenges and Perspectives

With the rapid development of artificial intelligence, intelligent decision-making techniques have gradually surpassed human levels in various human-machine competitions, especially in complex multi-agent cooperative task scenarios. Multi-agent cooperative decision-making involves multiple agents working together to complete established tasks and achieve specific objectives. These techniques are widely applicable in real-world scenarios such as autonomous driving, drone navigation, disaster rescue, and simulated military confrontations. This paper begins with a comprehensive survey of the leading simulation environments and platforms used for multi-agent cooperative decision-making. Specifically, we provide an in-depth analysis for these simulation environments from various perspectives, including task formats, reward allocation, and the underlying technologies employed. Subsequently, we provide a comprehensive overview of the mainstream intelligent decision-making approaches, algorithms and models for multi-agent systems (MAS). Theseapproaches can be broadly categorized into five types: rule-based (primarily fuzzy logic), game theory-based, evolutionary algorithms-based, deep multi-agent reinforcement learning (MARL)-based, and large language models(LLMs)reasoning-based. Given the significant advantages of MARL andLLMs-baseddecision-making methods over the traditional rule, game theory, and evolutionary algorithms, this paper focuses on these multi-agent methods utilizing MARL and LLMs-based techniques. We provide an in-depth discussion of these approaches, highlighting their methodology taxonomies, advantages, and drawbacks. Further, several prominent research directions in the future and potential challenges of multi-agent cooperative decision-making are also detailed.

Updated: 2025-03-17 17:45:46

标题: 一个关于多智能体合作决策的综合调查：场景、方法、挑战和展望

摘要: 随着人工智能的快速发展，智能决策技术逐渐超越了人类在各种人机竞赛中的水平，特别是在复杂的多智能体合作任务场景中。多智能体合作决策涉及多个智能体共同完成既定任务并实现具体目标。这些技术在现实世界场景中广泛适用，如自动驾驶、无人机导航、灾难救援和模拟军事对抗等。本文首先对用于多智能体合作决策的主要模拟环境和平台进行了全面调查。具体来说，我们从任务格式、奖励分配和所采用的基础技术等各个角度对这些模拟环境进行了深入分析。随后，我们对用于多智能体系统（MAS）的主流智能决策方法、算法和模型进行了全面概述。这些方法可以大致分为五种类型：基于规则（主要是模糊逻辑）、基于博弈论、基于进化算法、基于深度多智能体强化学习（MARL）和基于大型语言模型（LLMs）的推理。鉴于MARL和LLMs方法相对于传统规则、博弈论和进化算法的显著优势，本文着重介绍了利用MARL和LLMs技术的多智能体方法。我们对这些方法进行了深入讨论，突出了它们的方法论分类、优势和缺点。此外，还详细介绍了未来多智能体合作决策的几个突出研究方向和潜在挑战。

更新时间: 2025-03-17 17:45:46

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2503.13415v1

The Impact of Artificial Intelligence on Emergency Medicine: A Review of Recent Advances

Artificial Intelligence (AI) is revolutionizing emergency medicine by enhancing diagnostic processes and improving patient outcomes. This article provides a review of the current applications of AI in emergency imaging studies, focusing on the last five years of advancements. AI technologies, particularly machine learning and deep learning, are pivotal in interpreting complex imaging data, offering rapid, accurate diagnoses and potentially surpassing traditional diagnostic methods. Studies highlighted within the article demonstrate AI's capabilities in accurately detecting conditions such as fractures, pneumothorax, and pulmonary diseases from various imaging modalities including X-rays, CT scans, and MRIs. Furthermore, AI's ability to predict clinical outcomes like mechanical ventilation needs illustrates its potential in crisis resource optimization. Despite these advancements, the integration of AI into clinical practice presents challenges such as data privacy, algorithmic bias, and the need for extensive validation across diverse settings. This review underscores the transformative potential of AI in emergency settings, advocating for a future where AI and clinical expertise synergize to elevate patient care standards.

Updated: 2025-03-17 17:45:00

标题: 《人工智能对急诊医学的影响：近期进展综述》

摘要: 人工智能（AI）正在通过增强诊断过程和改善患者预后而革新急诊医学。本文回顾了当前AI在急诊影像研究中的应用，重点关注过去五年的进展。AI技术，特别是机器学习和深度学习，在解释复杂影像数据方面起着关键作用，提供快速、准确的诊断，并有可能超越传统的诊断方法。文章中突出的研究表明，AI能够准确检测各种成像模式包括X光、CT扫描和MRI等成像技术下的骨折、气胸和肺部疾病等疾病。此外，AI预测机械通气需求的能力展示了其在危机资源优化方面的潜力。尽管取得了这些进展，将AI整合到临床实践中也带来了挑战，如数据隐私、算法偏见以及在不同环境中进行广泛验证的需求。本综述强调了AI在急诊环境中的变革潜力，倡导未来AI和临床专业知识相互融合，提升患者护理标准。

更新时间: 2025-03-17 17:45:00

领域: eess.IV,cs.AI,cs.CV,cs.LG,68T07

下载: http://arxiv.org/abs/2503.14546v1

Reward Adaptation Via Q-Manipulation

In this paper, we propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors learned a priori under the same domain dynamics but different reward functions. Learning the target behavior from scratch is possible but often inefficient given the available source behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. Assuming that the target reward function is a known function of the source reward functions, our approach to RA computes bounds of the Q function. We introduce an iterative process to tighten the bounds, similar to value iteration. This enables action pruning in the target domain before learning even starts. We refer to such a method as Q-Manipulation (Q-M). We formally prove that our pruning strategy does not affect the optimality of the returned policy while empirically show that it improves the sample complexity. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

Updated: 2025-03-17 17:42:54

标题: 通过Q值操作的奖励适应

摘要: 在本文中，我们提出了一种新的奖励适应（RA）解决方案，即学习代理根据一个或多个已学习的现有行为在相同领域动态但不同奖励函数下适应目标奖励函数的问题。从头学习目标行为是可能的，但通常效率低下，因为已有的来源行为。我们的工作通过操纵Q函数代表了一种新的RA方法。假设目标奖励函数是已知的源奖励函数的函数，我们的RA方法计算Q函数的边界。我们引入了一个迭代过程来收紧边界，类似于值迭代。这使得在学习开始之前就可以对目标领域进行动作修剪。我们将这种方法称为Q-Manipulation（Q-M）。我们正式证明了我们的修剪策略不会影响返回策略的最优性，同时在经验上表明它提高了样本复杂性。在各种合成和仿真领域中评估了Q-M，以展示其有效性，泛化性和实用性。

更新时间: 2025-03-17 17:42:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13414v1

TSCMamba: Mamba Meets Multi-View Learning for Time Series Classification

Multivariate time series classification (TSC) is critical for various applications in fields such as healthcare and finance. While various approaches for TSC have been explored, important properties of time series, such as shift equivariance and inversion invariance, are largely underexplored by existing works. To fill this gap, we propose a novel multi-view approach to capture patterns with properties like shift equivariance. Our method integrates diverse features, including spectral, temporal, local, and global features, to obtain rich, complementary contexts for TSC. We use continuous wavelet transform to capture time-frequency features that remain consistent even when the input is shifted in time. These features are fused with temporal convolutional or multilayer perceptron features to provide complex local and global contextual information. We utilize the Mamba state space model for efficient and scalable sequence modeling and capturing long-range dependencies in time series. Moreover, we introduce a new scanning scheme for Mamba, called tango scanning, to effectively model sequence relationships and leverage inversion invariance, thereby enhancing our model's generalization and robustness. Experiments on two sets of benchmark datasets (10+20 datasets) demonstrate our approach's effectiveness, achieving average accuracy improvements of 4.01-6.45\% and 7.93\% respectively, over leading TSC models such as TimesNet and TSLANet.

Updated: 2025-03-17 17:40:41

标题: TSCMamba：Mamba与多视角学习相结合用于时间序列分类

摘要: 多元时间序列分类（TSC）对于诸如医疗保健和金融等领域的各种应用至关重要。虽然已经探索了各种TSC方法，但是时间序列的重要属性，如平移等变性和反演不变性，在现有作品中很大程度上未被充分探索。为了填补这一空白，我们提出了一种新颖的多视图方法，以捕捉具有平移等变性属性的模式。我们的方法集成了各种特征，包括谱、时间、局部和全局特征，以获得丰富、互补的上下文用于TSC。我们使用连续小波变换来捕捉时间频率特征，即使输入在时间上发生变化，这些特征也保持一致。这些特征与时间卷积或多层感知器特征融合，提供复杂的局部和全局上下文信息。我们利用Mamba状态空间模型进行高效和可扩展的序列建模，并捕捉时间序列中的长距离依赖关系。此外，我们引入了一种新的Mamba扫描方案，称为探戈扫描，以有效地建模序列关系，并利用反演不变性，从而增强我们模型的泛化能力和鲁棒性。在两组基准数据集（10+20个数据集）上的实验表明，我们的方法的有效性，分别实现了4.01-6.45\%和7.93\%的平均准确率改进，超过了领先的TSC模型，如TimesNet和TSLANet。

更新时间: 2025-03-17 17:40:41

领域: cs.LG

下载: http://arxiv.org/abs/2406.04419v2

Fed-Joint: Joint Modeling of Nonlinear Degradation Signals and Failure Events for Remaining Useful Life Prediction using Federated Learning

Many failure mechanisms of machinery are closely related to the behavior of condition monitoring (CM) signals. To achieve a cost-effective preventive maintenance strategy, accurate remaining useful life (RUL) prediction based on the signals is of paramount importance. However, the CM signals are often recorded at different factories and production lines, with limited amounts of data. Unfortunately, these datasets have rarely been shared between the sites due to data confidentiality and ownership issues, a lack of computing and storage power, and high communication costs associated with data transfer between sites and a data center. Another challenge in real applications is that the CM signals are often not explicitly specified \textit{a priori}, meaning that existing methods, which often usually a parametric form, may not be applicable. To address these challenges, we propose a new prognostic framework for RUL prediction using the joint modeling of nonlinear degradation signals and time-to-failure data within a federated learning scheme. The proposed method constructs a nonparametric degradation model using a federated multi-output Gaussian process and then employs a federated survival model to predict failure times and probabilities for in-service machinery. The superiority of the proposed method over other alternatives is demonstrated through comprehensive simulation studies and a case study using turbofan engine degradation signal data that include run-to-failure events.

Updated: 2025-03-17 17:34:34

标题: Fed-Joint：使用联邦学习对非线性退化信号和故障事件进行联合建模，用于剩余寿命预测

摘要: 许多机械设备的故障机制与条件监测（CM）信号的行为密切相关。为了实现成本效益的预防性维护策略，基于这些信号的准确剩余寿命（RUL）预测至关重要。然而，CM信号通常记录在不同的工厂和生产线上，数据量有限。不幸的是，由于数据保密和所有权问题、缺乏计算和存储能力，以及数据传输与数据中心之间的高通信成本，这些数据集很少在站点之间共享。在实际应用中的另一个挑战是，CM信号通常并未事先明确定义，这意味着现有方法（通常采用参数形式）可能不适用。为了解决这些挑战，我们提出了一个新的预测框架，利用联合建模非线性退化信号和故障时间数据，采用联邦学习方案进行RUL预测。所提出的方法利用联邦多输出高斯过程构建非参数退化模型，然后采用联邦存活模型预测在役机械的故障时间和概率。通过全面的模拟研究和使用包括失效事件的涡轮风扇发动机退化信号数据的案例研究，证明了所提出方法的优越性。

更新时间: 2025-03-17 17:34:34

领域: cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.13404v1

Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.

Updated: 2025-03-17 17:33:54

标题: 利用认知科学工具来理解不同分析水平上的大型语言模型

摘要: 现代人工智能系统，如大型语言模型，越来越强大，但也越来越难以理解。认识到这个问题类似于理解人类思维的历史困难，我们认为认知科学中发展的方法可以用于理解大型语言模型。我们提出了一个基于马尔三个分析层次的应用这些方法的框架。通过重新审视与每个层次相关的已建立的认知科学技术，并说明它们潜在地能够洞察大型语言模型的行为和内部组织，我们的目标是提供一个工具包，以理解这些新型心智。

更新时间: 2025-03-17 17:33:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13401v1

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.

Updated: 2025-03-17 17:33:10

标题: MicroVQA: 一个基于显微镜科学研究的多模态推理基准

摘要: 科学研究需要对多模态数据进行复杂的推理，这在生物学中尤为突出。尽管近年来人工智能辅助研究中的多模态大语言模型（MLLMs）取得了进展，但现有的多模态推理基准仅针对大学水平难度，而研究水平基准强调较低级别的感知，缺乏科学发现所需的复杂多模态推理。为了弥补这一差距，我们引入了MicroVQA，一个旨在评估研究工作流程中三种关键推理能力的视觉问题回答（VQA）基准：专家图像理解、假设生成和实验提案。MicroVQA由生物学专家策划的1,042个多项选择问题（MCQs）组成，涵盖了各种显微镜模式，确保VQA样本代表真实的科学实践。在构建基准时，我们发现标准的MCQ生成方法会导致语言捷径，因此鼓励采用新的两阶段流程：优化的LLM提示结构将问题-答案对转化为MCQs；然后，一个基于代理的“RefineBot”更新它们以消除捷径。在最先进的MLLMs上进行基准测试显示最高性能为53％；具有较小LLMs的模型仅略低于顶级模型，表明基于语言的推理不如多模态推理具有挑战性；通过科学文章调整可以提高性能。专家对思维链响应的分析显示，感知错误是最常见的，其次是知识错误，然后是过度概括错误。这些见解突显了多模态科学推理中的挑战，表明MicroVQA是推动基于人工智能的生物医学研究的宝贵资源。MicroVQA可在https://huggingface.co/datasets/jmhb/microvqa获取，项目页面在https://jmhb0.github.io/microvqa。

更新时间: 2025-03-17 17:33:10

领域: cs.CV,cs.AI,cs.CL,cs.LG,q-bio.CB

下载: http://arxiv.org/abs/2503.13399v1

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.

Updated: 2025-03-17 17:32:06

标题: 成为一个"变形金刚"是什么意思？从理论海森分析中得出的启示

摘要: Transformer架构无疑已经彻底改变了深度学习，超越了传统的多层感知器（MLPs）和卷积神经网络（CNNs）等架构。在其核心，注意力块在形式和功能上与深度学习中的大多数其他架构组件有所不同——与MLPs/CNNs相比，Transformers更经常伴随着自适应优化器、层归一化、学习率预热等。这些外在表现背后的根本原因以及控制它们的精确机制仍然不太清楚。在这项工作中，我们通过对（损失）Hessian的理论比较，填补了这一空白，提供了区分Transformer与其他架构的基本理解。具体而言，对于单个自注意力层，（a）我们首先完全推导出Transformer的Hessian，并用矩阵导数表示它；（b）然后我们以数据、权重和注意力矩的依赖关系来表征它；（c）在这样做的同时，进一步突出了与传统网络的Hessian之间的重要结构差异。我们的结果表明，Transformers中各种常见的架构和优化选择可以追溯到它们对数据和权重矩阵的高度非线性依赖，这些依赖在参数上异质地变化。最终，我们的研究结果提供了对Transformer独特优化景观及其所面临挑战的更深入理解。

更新时间: 2025-03-17 17:32:06

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.10986v2

Challenges and recommendations for Electronic Health Records data extraction and preparation for dynamic prediction modelling in hospitalized patients -- a practical guide

Dynamic predictive modelling using electronic health record (EHR) data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is, in part, determined by the stages preceding the model development: data extraction from EHR systems and data preparation. In this article, we identified over forty challenges encountered during these stages and provide actionable recommendations for addressing them. These challenges are organized into four categories: cohort definition, outcome definition, feature engineering, and data cleaning. This comprehensive list serves as a practical guide for data extraction engineers and researchers, promoting best practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.

Updated: 2025-03-17 17:29:33

标题: 住院患者电子健康记录数据提取和准备中的挑战与建议 - 动态预测建模的实践指南

摘要: 使用电子健康记录（EHR）数据进行动态预测建模近年来受到了广泛关注。这种模型的可靠性和可信度在很大程度上取决于底层数据的质量，这部分取决于模型开发之前的阶段：从EHR系统中提取数据和数据准备。在本文中，我们确定了在这些阶段遇到的超过四十个挑战，并提供了可行的建议来解决这些挑战。这些挑战被组织成四个类别：队列定义、结果定义、特征工程和数据清理。这个全面的列表作为数据提取工程师和研究人员的实用指南，促进最佳实践，并提高动态预测模型在临床环境中的质量和实际适用性。

更新时间: 2025-03-17 17:29:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.10240v2

PANDORA: Diffusion Policy Learning for Dexterous Robotic Piano Playing

We present PANDORA, a novel diffusion-based policy learning framework designed specifically for dexterous robotic piano performance. Our approach employs a conditional U-Net architecture enhanced with FiLM-based global conditioning, which iteratively denoises noisy action sequences into smooth, high-dimensional trajectories. To achieve precise key execution coupled with expressive musical performance, we design a composite reward function that integrates task-specific accuracy, audio fidelity, and high-level semantic feedback from a large language model (LLM) oracle. The LLM oracle assesses musical expressiveness and stylistic nuances, enabling dynamic, hand-specific reward adjustments. Further augmented by a residual inverse-kinematics refinement policy, PANDORA achieves state-of-the-art performance in the ROBOPIANIST environment, significantly outperforming baselines in both precision and expressiveness. Ablation studies validate the critical contributions of diffusion-based denoising and LLM-driven semantic feedback in enhancing robotic musicianship. Videos available at: https://taco-group.github.io/PANDORA

Updated: 2025-03-17 17:22:34

标题: 潘多拉：灵巧机器人钢琴演奏的扩散政策学习

摘要: 我们提出了PANDORA，这是一个专门设计用于灵巧机器人钢琴演奏的基于扩散的政策学习框架。我们的方法采用了一个增强了FiLM全局调节的条件U-Net架构，通过迭代去噪将嘈杂的动作序列转化为平滑的高维轨迹。为了实现精准的按键执行和富有表现力的音乐演奏，我们设计了一个综合奖励函数，该函数整合了任务特定的准确性、音频保真度以及来自大型语言模型（LLM）预测器的高级语义反馈。LLM预测器评估音乐表现力和风格细微差别，使得动态的、手部特定的奖励调整成为可能。通过一个残差逆运动学细化政策的进一步增强，PANDORA在ROBOPIANIST环境中实现了最先进的性能，在精度和表现力方面明显优于基线。消融研究验证了基于扩散去噪和LLM驱动的语义反馈对提升机器人音乐技能的关键贡献。视频可在以下链接找到：https://taco-group.github.io/PANDORA

更新时间: 2025-03-17 17:22:34

领域: cs.LG,cs.RO,cs.SD,eess.AS

下载: http://arxiv.org/abs/2503.14545v1

Investigating the effect of CPT in lateral spreading prediction using Explainable AI

This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent features with site parameters to train XGBoost models for predicting lateral spreading occurrences in the 2011 Christchurch earthquake. Models using the latent CPT features outperformed models with conventional CPT metrics or no CPT data, achieving over 83% accuracy. Explainable AI revealed the most crucial latent feature corresponding to soil behavior between 1-3 meter depths, highlighting this depth range's criticality for liquefaction evaluation. The autoencoder approach provides an automated technique for condensing CPT profiles into informative latent features for machine-learning liquefaction models.

Updated: 2025-03-17 17:22:15

标题: 研究使用可解释人工智能探究CPT在侧向扩散预测中的影响

摘要: 这项研究提出了一种自编码器方法，用于从锥形穿透试验剖面中提取潜在特征，以评估将CPT数据纳入AI模型的潜力。我们使用自编码器将200个土壤行为类型指数(Ic)和归一化锥阻力(qc1Ncs)的CPT剖面压缩为十个潜在特征，同时保留关键信息。然后，我们利用提取的潜在特征与场地参数训练XGBoost模型，用于预测2011年基督城地震中的横向扩展事件。使用潜在CPT特征的模型表现优于使用传统CPT指标或无CPT数据的模型，准确率超过83%。可解释的AI揭示了与1-3米深度之间土壤行为对应的最关键的潜在特征，突显了此深度范围对液化评估的重要性。自编码器方法为将CPT剖面压缩为信息丰富的潜在特征以用于机器学习液化模型提供了一种自动化技术。

更新时间: 2025-03-17 17:22:15

领域: cs.LG,physics.geo-ph

下载: http://arxiv.org/abs/2503.13389v1

Spectrally-Corrected and Regularized QDA Classifier for Spiked Covariance Model

Quadratic discriminant analysis (QDA) is a widely used method for classification problems, particularly preferable over Linear Discriminant Analysis (LDA) for heterogeneous data. However, QDA loses its effectiveness in high-dimensional settings, where the data dimension and sample size tend to infinity. To address this issue, we propose a novel QDA method utilizing spectral correction and regularization techniques, termed SR-QDA. The regularization parameters in our method are selected by maximizing the Fisher-discriminant ratio. We compare SR-QDA with QDA, regularized quadratic discriminant analysis (R-QDA), and several other competitors. The results indicate that SR-QDA performs exceptionally well, especially in moderate and high-dimensional situations. Empirical experiments across diverse datasets further support this conclusion.

Updated: 2025-03-17 17:21:03

标题: 谱校正和正则化的QDA分类器用于尖峰协方差模型

摘要: 二次判别分析（QDA）是一种广泛应用于分类问题的方法，特别适用于异质数据，相较于线性判别分析（LDA）。然而，在高维环境下，数据维度和样本大小趋向无穷时，QDA失去了其有效性。为解决这一问题，我们提出了一种利用谱校正和正则化技术的新型QDA方法，称为SR-QDA。我们的方法中的正则化参数是通过最大化费舍尔判别比来选择的。我们将SR-QDA与QDA、正则化二次判别分析（R-QDA）以及其他几种竞争者进行了比较。结果表明，SR-QDA表现出色，特别是在中等和高维情况下。对多个数据集的经验实验证实了这一结论。

更新时间: 2025-03-17 17:21:03

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2503.13582v1

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Harmful fine-tuning attack poses serious safety concerns for large language models' fine-tuning-as-a-service. While existing defenses have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. To this end, we in this paper show that harmful perturbation over the model weights could be a probable cause of alignment-broken. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction after the simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at https://github.com/git-disl/Booster.

Updated: 2025-03-17 17:17:16

标题: 增强器：通过减弱有害干扰来解决大型语言模型的有害微调

摘要: 有害的精细调整攻击对大型语言模型的精细调整作为服务提出了严重的安全问题。虽然已经提出了现有的防御措施来减轻这个问题，但它们的性能仍然远远不尽人意，并且问题的根本原因尚未完全解决。为此，本文表明模型权重上的有害扰动可能是引起对齐破坏的一个可能原因。为了减轻有害扰动的负面影响，我们提出了一种对齐阶段解决方案，称为Booster。从技术上讲，除了原始对齐损失之外，我们还在对齐阶段的优化中附加了一个损失正则化器。该正则化器确保在模拟的有害扰动之后，模型的有害损失减少，从而减轻后续的精细调整风险。实证结果表明，Booster可以有效降低经过精细调整的模型的有害得分，同时保持下游任务的性能。我们的代码可在https://github.com/git-disl/Booster上获得。

更新时间: 2025-03-17 17:17:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.01586v4

Neural Interactive Proofs

We consider the problem of how a trusted, but computationally bounded agent (a 'verifier') can learn to interact with one or more powerful but untrusted agents ('provers') in order to solve a given task. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games, which generalises previously proposed interaction protocols. We then describe several new protocols for generating neural interactive proofs, and provide a theoretical comparison of both new and existing approaches. Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.

Updated: 2025-03-17 17:16:02

标题: 神经交互证明

摘要: 我们考虑的问题是一个可信赖但计算能力有限的代理人（“验证者”）如何学习与一个或多个强大但不可信赖的代理人（“证明者”）交互以解决给定的任务。更具体地说，我们研究了使用神经网络表示代理人的情况，并将解决这个问题的方案称为神经交互证明。首先，我们介绍了一个基于证明者-验证者游戏的统一框架，这个框架推广了先前提出的交互协议。然后，我们描述了几种用于生成神经交互证明的新协议，并对新旧方法进行了理论比较。最后，我们通过在两个领域的实验支持这一理论：一个演示关键思想的玩具图同构问题，以及使用大型语言模型进行代码验证任务。通过这样做，我们旨在为未来关于神经交互证明及其在构建更安全的人工智能系统中的应用方面的工作奠定基础。

更新时间: 2025-03-17 17:16:02

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.08897v2

Scale Efficient Training for Large Datasets

The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.

Updated: 2025-03-17 17:13:43

标题: 大规模数据集的高效训练

摘要: 数据集规模的快速增长是推动深度学习研究发展的关键驱动因素。然而，随着数据集规模的增加，训练过程变得越来越低效，这是由于存在低价值样本，包括冗余样本过多、过于具有挑战性的样本以及对模型改善贡献较小的低效易样本。为了解决这一挑战，我们提出了用于大型数据集的规模高效训练（SeTa）方法，这是一种动态样本修剪方法，可以无损减少训练时间。为了去除低价值样本，SeTa首先进行随机修剪以消除冗余样本，然后根据损失度量样本的学习难度对剩余样本进行聚类。在这种聚类的基础上，采用滑动窗口策略逐步去除过于具有挑战性和低效易簇，按照易到难的课程表进行操作。我们在大规模合成数据集上进行了大量实验，包括ToCa、SS1M和ST+MJ，每个数据集包含超过300万个样本。SeTa可将训练成本降低高达50\%，同时保持或提高性能，即使在降低70\%成本的情况下也仅有轻微降级。此外，在各种背景（CNNs、Transformers和Mambas）和不同任务（指令调整、多视图立体、地理定位、组合图像检索、引用图像分割）上进行的实验展示了我们方法的强大有效性和普适性。代码可在https://github.com/mrazhou/SeTa上找到。

更新时间: 2025-03-17 17:13:43

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13385v1

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

Updated: 2025-03-17 17:11:22

标题: 精华之选：获取丰富、可扩展和可转移的多模态数据，用于指导微调

摘要: 最近数据整理和选择研究的进展证实了预训练的大型语言模型（LLMs）在微调（SFT）阶段只需要最少的监督（Zhou等，2024年）。然而，由于对实验设置和验证协议的脆弱性，它们的稳定性和泛化能力受到威胁，无法超越随机抽样（Diddee＆Ippolito，2024年; Xia等，2024b）。基于LLMs，多模式LLMs（MLLMs）结合了数据源的庞大令牌数量和增加的异质性，放大了数据选择的重要性和复杂性。为了以强大且高效的方式收集多模式教学数据，我们重新定义了质量指标的粒度，将其分解为14个与视觉语言相关的能力，并引入多模式丰富的评分器来评估每个数据候选的能力。为了促进多样性，在对齐阶段的固有目标的光辉下，我们将交互风格作为多样性指标，并使用多模式丰富的风格器识别数据指导模式。通过这样做，我们的多模式丰富评分器和风格器（mmSSR）确保高分信息以多样化形式传达给用户。摆脱基于嵌入的聚类或贪婪抽样，mmSSR能够高效地扩展到具有不同预算约束的数百万数据，支持通用或特定能力获取的定制，并促进对新领域的无需训练的泛化。在经过14个多模式基准验证的10多个实验设置中，我们展示了相对于随机抽样、基准策略和最先进的选择方法的持续改进，仅使用2.6M数据的30%即可实现99.1%的全面性能。

更新时间: 2025-03-17 17:11:22

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.13383v1

TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM

We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at https://github.com/www-Ye/TimeZero.

Updated: 2025-03-17 17:04:20

标题: TimeZero: 基于推理引导的LVLM的时间视频定位

摘要: 我们介绍了TimeZero，这是一个为时间视频定位（TVG）任务设计的基于推理的LVLM。该任务要求根据给定的语言查询精确定位长视频中相关的视频片段。TimeZero通过扩展推理过程来应对这一挑战，使模型能够仅通过强化学习推理视频与语言之间的关系。为了评估TimeZero的有效性，我们在两个基准数据集上进行了实验，结果表明TimeZero在Charades-STA上实现了最先进的性能。代码可在https://github.com/www-Ye/TimeZero 上找到。

更新时间: 2025-03-17 17:04:20

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13377v1

A deep cut into Split Federated Self-supervised Learning

Collaborative self-supervised learning has recently become feasible in highly distributed environments by dividing the network layers between client devices and a central server. However, state-of-the-art methods, such as MocoSFL, are optimized for network division at the initial layers, which decreases the protection of the client data and increases communication overhead. In this paper, we demonstrate that splitting depth is crucial for maintaining privacy and communication efficiency in distributed training. We also show that MocoSFL suffers from a catastrophic quality deterioration for the minimal communication overhead. As a remedy, we introduce Momentum-Aligned contrastive Split Federated Learning (MonAcoSFL), which aligns online and momentum client models during training procedure. Consequently, we achieve state-of-the-art accuracy while significantly reducing the communication overhead, making MonAcoSFL more practical in real-world scenarios.

Updated: 2025-03-17 16:59:40

标题: 一个深入探究分割联邦自监督学习

摘要: 协作式自监督学习最近在高度分布式环境中变得可行，通过将网络层分为客户端设备和中央服务器之间。然而，像MocoSFL这样的最新方法通常优化于初始层的网络分割，这降低了客户数据的保护性并增加了通信开销。本文演示了在分布式训练中，分层深度对于保持隐私和通信效率至关重要。我们还展示了MocoSFL在最小通信开销下存在灾难性的质量下降。为此，我们引入了动量对齐对比分裂联邦学习（MonAcoSFL），在训练过程中对在线和动量客户模型进行对齐。因此，我们实现了最先进的准确性，同时显著减少了通信开销，使MonAcoSFL在实际场景中更具实用性。

更新时间: 2025-03-17 16:59:40

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2406.08267v2

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Updated: 2025-03-17 16:58:53

标题: SyncDiff：基于扩散的瓶颈时间视觉先验的说话人头合成，以改善同步效果

摘要: Talking head synthesis，也被称为语音到唇部合成，重建了与给定音频轨道对齐的面部运动。合成视频主要在两个方面进行评估，即唇语同步和图像保真度。最近的研究表明，基于GAN和扩散的模型在这一任务上取得了最先进的性能，其中扩散模型的图像保真度优于GAN模型，但与其GAN模型相比，同步性较低。为此，我们提出了SyncDiff，这是一种简单而有效的方法，通过使用包含信息瓶颈的时间姿势帧和从AVHuBERT中提取的面部信息音频特征作为扩散过程的条件输入，来改进基于扩散的模型。我们在两个经典的说话头数据集LRS2和LRS3上评估了SyncDiff，以便与其他最先进的模型进行直接比较。在LRS2/LRS3数据集上的实验表明，SyncDiff的同步得分相对于先前的基于扩散的方法提高了27.7%/62.3%，同时保留了它们的高保真特性。

更新时间: 2025-03-17 16:58:53

领域: cs.LG

下载: http://arxiv.org/abs/2503.13371v1

Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.

Updated: 2025-03-17 16:52:46

标题: 视觉计数：利用有视力用户反馈构建与盲人视障对齐的图表描述数据集

摘要: 通常，注释者组和最终用户组之间的需求和视觉能力存在差异。为盲人和低视力（BLV）用户生成详细的图表描述是一个具有挑战性的领域。视力正常的注释者可以轻松描述视觉内容，但现有研究表明，由他们直接生成的描述成本高昂、容易受到偏见，并在BLV标准上有所欠缺。在这项研究中，我们要求视力正常的个体评估 -- 而不是生成 -- 通过多遍推理引导潜在监督的视觉语言模型（VLM）生成的图表描述。视力正常的评估被证明对自己是BLV并教授视障学习者的专业教育者是有效和有用的。我们发布了Sightation，一个涵盖5,000个图表和137,000个样本的图表描述数据集，用于完成、偏好、检索、问题回答和推理训练，并展示了它们在各种下游任务中的微调潜力。

更新时间: 2025-03-17 16:52:46

领域: cs.AI,cs.CV,cs.HC

下载: http://arxiv.org/abs/2503.13369v1

Follow-the-Regularized-Leader with Adversarial Constraints

Constrained Online Convex Optimization (COCO) can be seen as a generalization of the standard Online Convex Optimization (OCO) framework. At each round, a cost function and constraint function are revealed after a learner chooses an action. The goal is to minimize both the regret and cumulative constraint violation (CCV) against an adaptive adversary. We show for the first time that is possible to obtain the optimal $O(\sqrt{T})$ bound on both regret and CCV, improving the best known bounds of $O \left( \sqrt{T} \right)$ and $\~{O} \left( \sqrt{T} \right)$ for the regret and CCV, respectively.

Updated: 2025-03-17 16:51:16

标题: 带有对抗性约束的跟随正则化领导者

摘要: Constrained Online Convex Optimization (COCO)可以被视为标准Online Convex Optimization (OCO)框架的一种泛化。在每一轮中，学习者选择一个动作后，一个成本函数和约束函数被揭示。目标是最小化对抗者的遗憾和累积约束违反（CCV）。我们首次展示，可以获得对遗憾和CCV的最优$O(\sqrt{T})$界限，改进了遗憾和CCV的最佳已知界限分别为$O(\sqrt{T})$和$\~{O}(\sqrt{T})$。

更新时间: 2025-03-17 16:51:16

领域: cs.LG,cs.DS,math.OC,stat.ML

下载: http://arxiv.org/abs/2503.13366v1

Inteligencia Artificial para la conservación y uso sostenible de la biodiversidad, una visión desde Colombia (Artificial Intelligence for conservation and sustainable use of biodiversity, a view from Colombia)

The rise of artificial intelligence (AI) and the aggravating biodiversity crisis have resulted in a research area where AI-based computational methods are being developed to act as allies in conservation, and the sustainable use and management of natural resources. While important general guidelines have been established globally regarding the opportunities and challenges that this interdisciplinary research offers, it is essential to generate local reflections from the specific contexts and realities of each region. Hence, this document aims to analyze the scope of this research area from a perspective focused on Colombia and the Neotropics. In this paper, we summarize the main experiences and debates that took place at the Humboldt Institute between 2023 and 2024 in Colombia. To illustrate the variety of promising opportunities, we present current uses such as automatic species identification from images and recordings, species modeling, and in silico bioprospecting, among others. From the experiences described above, we highlight limitations, challenges, and opportunities for in order to successfully implementate AI in conservation efforts and sustainable management of biological resources in the Neotropics. The result aims to be a guide for researchers, decision makers, and biodiversity managers, facilitating the understanding of how artificial intelligence can be effectively integrated into conservation and sustainable use strategies. Furthermore, it also seeks to open a space for dialogue on the development of policies that promote the responsible and ethical adoption of AI in local contexts, ensuring that its benefits are harnessed without compromising biodiversity or the cultural and ecosystemic values inherent in Colombia and the Neotropics.

Updated: 2025-03-17 16:47:05

标题: 人工智能在哥伦比亚保护和可持续利用生物多样性方面的应用：一种来自哥伦比亚的视角

摘要: 人工智能（AI）的崛起和恶化的生物多样性危机导致了一个研究领域的出现，即正在开发基于AI的计算方法作为保护和可持续利用自然资源的盟友。虽然全球已经建立了关于这一跨学科研究提供的机遇和挑战的重要一般指导方针，但从每个地区的特定背景和现实中产生本地反思是至关重要的。因此，本文旨在从哥伦比亚和新热带地区的视角分析这一研究领域的范围。在本文中，我们总结了2023年至2024年在哥伦比亚的洪堡研究所举行的主要经验和讨论。为了说明多样有前途的机会，我们介绍了目前的用途，例如从图像和记录中的自动物种识别、物种建模和体内生物探索等。从上述经验中，我们强调了为了成功将AI实施到新热带地区生物资源的保护和可持续管理中而面临的限制、挑战和机会。该结果旨在成为研究人员、决策者和生物多样性管理者的指南，促进对于如何有效将人工智能整合到保护和可持续利用策略中的理解。此外，它还寻求开辟一个关于促进在本地背景下负责任和道德采用AI的政策发展的对话空间，确保其利益得到利用而不影响哥伦比亚和新热带地区固有的生物多样性或文化和生态系统价值。

更新时间: 2025-03-17 16:47:05

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2503.14543v1

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

Updated: 2025-03-17 16:45:12

标题: 通过携带视觉条件进行多模长CoT推理以减轻视觉遗忘

摘要: 近期大语言模型（LLMs）的进展表明其具有增强的推理能力，从思维链（CoT）提示发展到像OpenAI o1这样的先进的面向产品的解决方案。在我们重新实现这个模型的过程中，我们注意到在需要视觉输入的多模态任务中（例如几何问题），多模态LLMs（MLLMs）在保持对视觉信息的关注方面存在困难，换句话说，MLLMs在推理过程中逐渐减少对视觉信息的关注，导致文本过度依赖输出。为了调查这一现象，我们在长链推理中消除图像输入。具体而言，我们在推理过程中截断中间部分，然后去除输入图像重新完成推理过程。我们观察到在MathVista的测试难度子集上仅有约2%的准确率下降，揭示了模型的文本输出主导了接下来的推理过程。受此启发，我们提出了携带视觉条件（TVC）的策略，该策略将图像输入转移到关键推理阶段，并通过动态修剪压缩冗余的视觉标记。这种方法帮助模型在整个推理过程中保持对视觉组件的注意力。我们的方法在五个数学推理基准测试中平均达到了最先进的表现（比以前的sota高3.4%），展示了TVC在增强多模态推理系统方面的有效性。

更新时间: 2025-03-17 16:45:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13360v1

Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design

Recent studies have demonstrated the strong empirical performance of diffusion models on discrete sequences across domains from natural language to biological sequence generation. For example, in the protein inverse folding task, conditional diffusion models have achieved impressive results in generating natural-like sequences that fold back into the original structure. However, practical design tasks often require not only modeling a conditional distribution but also optimizing specific task objectives. For instance, we may prefer protein sequences with high stability. To address this, we consider the scenario where we have pre-trained discrete diffusion models that can generate natural-like sequences, as well as reward models that map sequences to task objectives. We then formulate the reward maximization problem within discrete diffusion models, analogous to reinforcement learning (RL), while minimizing the KL divergence against pretrained diffusion models to preserve naturalness. To solve this RL problem, we propose a novel algorithm, DRAKES, that enables direct backpropagation of rewards through entire trajectories generated by diffusion models, by making the originally non-differentiable trajectories differentiable using the Gumbel-Softmax trick. Our theoretical analysis indicates that our approach can generate sequences that are both natural-like and yield high rewards. While similar tasks have been recently explored in diffusion models for continuous domains, our work addresses unique algorithmic and theoretical challenges specific to discrete diffusion models, which arise from their foundation in continuous-time Markov chains rather than Brownian motion. Finally, we demonstrate the effectiveness of DRAKES in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein-based therapeutics.

Updated: 2025-03-17 16:44:45

标题: 通过奖励优化微调离散扩散模型，应用于DNA和蛋白质设计

摘要: 最近的研究表明，在从自然语言到生物序列生成等领域，扩散模型在离散序列上表现出强大的实证性能。例如，在蛋白质反折叠任务中，条件扩散模型已经取得了令人印象深刻的成果，生成了能够折叠回原始结构的类自然序列。然而，实际的设计任务通常不仅需要建模条件分布，还需要优化特定任务目标。例如，我们可能更喜欢具有高稳定性的蛋白质序列。为了解决这个问题，我们考虑了预训练的离散扩散模型可以生成类自然序列，以及将序列映射到任务目标的奖励模型的情况。然后，在离散扩散模型内部制定奖励最大化问题，类似于强化学习（RL），同时最小化与预训练扩散模型的KL散度以保持自然性。为了解决这个RL问题，我们提出了一种新颖的算法，DRAKES，通过使用Gumbel-Softmax技巧使原本不可微分的轨迹变得可微分，从而使奖励能够直接通过扩散模型生成的整个轨迹进行反向传播。我们的理论分析表明，我们的方法可以生成既类似自然又具有高奖励的序列。虽然最近扩散模型在连续域的类似任务已经被研究，但我们的工作解决了离散扩散模型特有的算法和理论挑战，这些挑战源于它们基于连续时间马尔可夫链而不是布朗运动。最后，我们展示了DRAKES在分别优化增强子活性和蛋白质稳定性的DNA和蛋白质序列生成中的有效性，这是基因治疗和基于蛋白质的治疗的重要任务。

更新时间: 2025-03-17 16:44:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.13643v2

Agents Play Thousands of 3D Video Games

We present PORTAL, a novel framework for developing artificial intelligence agents capable of playing thousands of 3D video games through language-guided policy generation. By transforming decision-making problems into language modeling tasks, our approach leverages large language models (LLMs) to generate behavior trees represented in domain-specific language (DSL). This method eliminates the computational burden associated with traditional reinforcement learning approaches while preserving strategic depth and rapid adaptability. Our framework introduces a hybrid policy structure that combines rule-based nodes with neural network components, enabling both high-level strategic reasoning and precise low-level control. A dual-feedback mechanism incorporating quantitative game metrics and vision-language model analysis facilitates iterative policy improvement at both tactical and strategic levels. The resulting policies are instantaneously deployable, human-interpretable, and capable of generalizing across diverse gaming environments. Experimental results demonstrate PORTAL's effectiveness across thousands of first-person shooter (FPS) games, showcasing significant improvements in development efficiency, policy generalization, and behavior diversity compared to traditional approaches. PORTAL represents a significant advancement in game AI development, offering a practical solution for creating sophisticated agents that can operate across thousands of commercial video games with minimal development overhead. Experiment results on the 3D video games are best viewed on https://zhongwen.one/projects/portal .

Updated: 2025-03-17 16:42:34

标题: 代理人玩成千上万个3D视频游戏

摘要: 我们提出了PORTAL，一个新颖的框架，用于通过语言引导策略生成开发能够玩数千款3D视频游戏的人工智能代理。通过将决策问题转化为语言建模任务，我们的方法利用大型语言模型（LLM）生成用领域特定语言（DSL）表示的行为树。这种方法消除了与传统强化学习方法相关的计算负担，同时保留了战略深度和快速适应性。我们的框架引入了一种混合策略结构，结合了基于规则的节点和神经网络组件，实现了高级战略推理和精确的低级控制。一个包含定量游戏指标和视觉-语言模型分析的双反馈机制促进了在战术和战略层面上的迭代策略改进。由此产生的策略可以立即部署，人类可解释，并且能够在各种游戏环境中进行泛化。实验结果表明，与传统方法相比，PORTAL在数千款第一人称射击（FPS）游戏中表现出了显著的开发效率、策略泛化和行为多样性的改进。PORTAL代表了游戏人工智能开发的重大进步，为创建能够跨越数千种商业视频游戏运行的复杂代理提供了一个实用的解决方案，而开发成本很低。3D视频游戏的实验结果最好在https://zhongwen.one/projects/portal 上查看。

更新时间: 2025-03-17 16:42:34

领域: cs.LG

下载: http://arxiv.org/abs/2503.13356v1

Strain Problems got you in a Twist? Try StrainRelief: A Quantum-Accurate Tool for Ligand Strain Calculations

Ligand strain energy, the energy difference between the bound and unbound conformations of a ligand, is an important component of structure-based small molecule drug design. A large majority of observed ligands in protein-small molecule co-crystal structures bind in low-strain conformations, making strain energy a useful filter for structure-based drug design. In this work we present a tool for calculating ligand strain with a high accuracy. StrainRelief uses a MACE Neural Network Potential (NNP), trained on a large database of Density Functional Theory (DFT) calculations to estimate ligand strain of neutral molecules with quantum accuracy. We show that this tool estimates strain energy differences relative to DFT to within 1.4 kcal/mol, more accurately than alternative NNPs. These results highlight the utility of NNPs in drug discovery, and provide a useful tool for drug discovery teams.

Updated: 2025-03-17 16:33:52

标题: 遇到应变问题困扰你？试试StrainRelief：一种用于配体应变计算的量子精确工具

摘要: 配体应变能是配体结合和未结合构象之间的能量差异，是基于结构的小分子药物设计的重要组成部分。在蛋白质-小分子共结晶结构中观察到的大多数配体以低应变构象结合，使得应变能成为基于结构的药物设计的有用过滤器。在这项工作中，我们提出了一个用于高精度计算配体应变的工具。StrainRelief使用一个在大量密度泛函理论（DFT）计算数据库上进行训练的MACE神经网络势（NNP）来估计中性分子的配体应变，具有量子精度。我们展示了这个工具相对于DFT估计的应变能差异在1.4 kcal/mol以内，比其他替代NNPs更准确。这些结果突显了NNPs在药物发现中的实用性，并为药物发现团队提供了一个有用的工具。

更新时间: 2025-03-17 16:33:52

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2503.13352v1

GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) has recently created impressive 3D assets for various applications. However, considering security, capacity, invisibility, and training efficiency, the copyright of 3DGS assets is not well protected as existing watermarking methods are unsuited for its rendering pipeline. In this paper, we propose GuardSplat, an innovative and efficient framework for watermarking 3DGS assets. Specifically, 1) We propose a CLIP-guided pipeline for optimizing the message decoder with minimal costs. The key objective is to achieve high-accuracy extraction by leveraging CLIP's aligning capability and rich representations, demonstrating exceptional capacity and efficiency. 2) We tailor a Spherical-Harmonic-aware (SH-aware) Message Embedding module for 3DGS, seamlessly embedding messages into the SH features of each 3D Gaussian while preserving the original 3D structure. This enables watermarking 3DGS assets with minimal fidelity trade-offs and prevents malicious users from removing the watermarks from the model files, meeting the demands for invisibility and security. 3) We present an Anti-distortion Message Extraction module to improve robustness against various distortions. Experiments demonstrate that GuardSplat outperforms state-of-the-art and achieves fast optimization speed. Project page is at https://narcissusex.github.io/GuardSplat, and Code is at https://github.com/NarcissusEx/GuardSplat.

Updated: 2025-03-17 16:33:17

标题: GuardSplat：高效且稳健的3D高斯光晕水印技术

摘要: 3D高斯喷溅（3DGS）最近为各种应用程序创建了令人印象深刻的3D资产。然而，考虑到安全性、容量、隐形性和培训效率，3DGS资产的版权保护并不理想，因为现有的水印方法不适用于其渲染管道。在本文中，我们提出了GuardSplat，一个创新而高效的水印3DGS资产的框架。具体来说，1）我们提出了一个由CLIP引导的管道，用最小的成本优化消息解码器。关键目标是通过利用CLIP的对齐能力和丰富的表示来实现高精度的提取，展示出卓越的容量和效率。2）我们为3DGS量身定制了一个球谐感知（SH-aware）消息嵌入模块，将消息无缝嵌入到每个3D高斯的SH特征中，同时保留原始的3D结构。这样可以在最小化保真度折衷的情况下为3DGS资产加水印，并防止恶意用户从模型文件中删除水印，满足了对隐形性和安全性的需求。3）我们提出了一个抗扭曲消息提取模块，以提高对各种扭曲的鲁棒性。实验表明，GuardSplat优于最先进的技术，并实现了快速的优化速度。项目页面位于https://narcissusex.github.io/GuardSplat，代码位于https://github.com/NarcissusEx/GuardSplat。

更新时间: 2025-03-17 16:33:17

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2411.19895v5

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

Updated: 2025-03-17 16:32:24

标题: BigDocs：一个用于在文档和代码任务上训练多模态模型的开放数据集

摘要: 多模态人工智能有潜力显著提升文档理解任务，如处理收据、理解工作流程、从文档中提取数据和总结报告。需要长结构化输出的代码生成任务也可以通过多模态性得到增强。尽管如此，在商业应用中它们的使用通常受限于对训练数据的有限访问和限制性许可，这限制了开放访问。为了解决这些限制，我们引入了BigDocs-7.5M，一个高质量的开放访问数据集，包括30个任务的750万个多模态文档。我们使用高效的数据筛选过程来确保我们的数据是高质量且许可允许的。我们的过程强调通过过滤规则、可追溯的元数据和仔细的内容分析来确保责任、责任和透明度。此外，我们引入了BigDocs-Bench，一个基准套件，包括10个新任务，我们创建了反映现实用例的数据集，涉及对图形用户界面（GUI）的推理和从图像生成代码。我们的实验表明，使用BigDocs-Bench训练可以将文档推理和结构化输出任务的平均性能提高高达25.8%，超过了封闭源GPT-4o。最后，人类评估显示，受过BigDocs训练的模型输出优于GPT-4o。这表明BigDocs可以帮助学术界和开源社区利用和改进人工智能工具，以增强多模态能力和文档推理。该项目托管在 https://bigdocs.github.io。

更新时间: 2025-03-17 16:32:24

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2412.04626v2

An Information Criterion for Controlled Disentanglement of Multimodal Data

Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities. By disentangling modality-specific information from information that is shared across modalities, we can improve interpretability and robustness and enable downstream tasks such as the generation of counterfactual outcomes. Separating the two types of information is challenging since they are often deeply entangled in many real-world applications. We propose Disentangled Self-Supervised Learning (DisentangledSSL), a novel self-supervised approach for learning disentangled representations. We present a comprehensive analysis of the optimality of each disentangled representation, particularly focusing on the scenario not covered in prior work where the so-called Minimum Necessary Information (MNI) point is not attainable. We demonstrate that DisentangledSSL successfully learns shared and modality-specific features on multiple synthetic and real-world datasets and consistently outperforms baselines on various downstream tasks, including prediction tasks for vision-language data, as well as molecule-phenotype retrieval tasks for biological data. The code is available at https://github.com/uhlerlab/DisentangledSSL.

Updated: 2025-03-17 16:27:27

标题: 一个用于多模态数据控制解缠的信息准则

摘要: 多模态表示学习旨在关联和分解多种模态中固有的信息。通过从特定于模态的信息中解开与跨模态共享的信息，我们可以提高可解释性和鲁棒性，并实现生成对抗事实结果等下游任务。由于这两种类型的信息在许多真实应用中通常深深纠缠在一起，因此将它们分离是具有挑战性的。我们提出了Disentangled Self-Supervised Learning (DisentangledSSL)，这是一种用于学习解缠表示的新颖自监督方法。我们对每种解缠表示的最优性进行了全面分析，特别关注了以前工作中未涵盖的情景，即无法达到所谓的最低必要信息点。我们展示了DisentangledSSL成功地学习了多个合成和真实数据集上的共享和特定于模态的特征，并在各种下游任务中始终优于基线，包括视觉-语言数据的预测任务，以及生物数据的分子-表型检索任务。代码可在https://github.com/uhlerlab/DisentangledSSL 获取。

更新时间: 2025-03-17 16:27:27

领域: cs.LG,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2410.23996v2

Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications

Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.

Updated: 2025-03-17 16:21:48

标题: 可扩展的运行时架构用于数据驱动的混合HPC和ML工作流应用程序

摘要: 混合工作流结合传统的高性能计算和新颖的机器学习方法正在改变科学计算。本文介绍了一种可扩展的运行时系统架构和实现，通过将基于服务的执行扩展到 RADICAL-Pilot 来支持 AI-out-HPC 工作流。我们的运行时系统实现了分布式机器学习能力、高效资源管理，以及在本地和远程平台上实现无缝的高性能计算/机器学习耦合。初步实验结果显示，我们的方法可以管理在本地和远程 HPC/云资源上并发执行机器学习模型，且具有最小的架构开销。这为在领先的高性能计算平台上原型化三种代表性的数据驱动工作流应用程序奠定了基础。

更新时间: 2025-03-17 16:21:48

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2503.13343v1

Valid Text-to-SQL Generation with Unification-based DeepStochLog

Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.

Updated: 2025-03-17 16:21:10

标题: 使用基于统一的DeepStochLog进行有效的文本到SQL生成

摘要: 大型语言模型已被用于将自然语言问题翻译成SQL查询。在没有语法和数据库架构的硬性约束的情况下，它们偶尔会生成无法执行的无效查询。这些失败限制了这些系统在实际场景中的使用。我们提出了一个神经符号框架，通过基于统一的确定性子句语法施加SQL语法和架构约束，从而保证生成有效查询。我们的框架还构建了一个双向接口，以利用语言模型的自然语言理解能力。对SQL语法子集的评估结果表明，我们所有的输出查询都是有效的。这项工作是将语言模型扩展为基于统一的语法的第一步。我们展示这种扩展大幅提高了底层语言模型的有效性、执行准确性和地面真相对齐。我们的代码可在https://github.com/ML-KULeuven/deepstochlog-lm 上找到。

更新时间: 2025-03-17 16:21:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13342v1

Reliable and Efficient Amortized Model-based Evaluation

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.

Updated: 2025-03-17 16:15:02

标题: 可靠且高效的摊销模型评估

摘要: 在语言模型（LM）的开发和部署阶段进行全面评估是必要的，因为这些模型具有多种能力（如数学推理、法律支持或医学诊断）以及安全风险（如种族偏见、毒性或错误信息）。在广泛的基准测试中的平均得分提供了一个信号，有助于指导这些LM在实践中的使用。目前，由于基准问题数量庞大，全面评估的成本高昂，使得频繁评估变得不切实际。为降低成本的一种常见尝试是计算基准测试子集的平均得分。然而，这种方法通常会导致LM性能的不可靠测量，因为平均得分往往与基准测试子集中问题的难度混淆。项目反应理论（IRT）旨在解决这一挑战，通过精心控制问题难度，提供可靠的测量。不幸的是，估计问题难度的成本很高。面对这一挑战，我们训练了一个模型，从内容中预测问题难度，以便以较低成本进行可靠的测量。此外，我们利用这个难度预测器进一步提高评估效率，通过训练一个根据难度级别生成问题的问题生成器。这个问题生成器在自适应测试中至关重要，其中根据当前对LLM性能的估计自适应选择信息性问题，而不是使用基准测试问题的随机子集。对22个常见的自然语言基准测试和172个LM进行的实验表明，与当前常见做法相比，这种方法更可靠和高效。

更新时间: 2025-03-17 16:15:02

领域: cs.CL,cs.AI,cs.LG,stat.AP

下载: http://arxiv.org/abs/2503.13335v1

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation -- especially in prior text-to-SQL benchmarks -- they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at https://spider2-sql.github.io

Updated: 2025-03-17 16:10:45

标题: Spider 2.0：评估语言模型在真实企业文本到SQL工作流中的表现

摘要: 真实世界中的企业文本到SQL工作流通常涉及复杂的云端或本地数据，涵盖各种数据库系统，多个不同方言的SQL查询，以及从数据转换到分析等多样化操作。我们介绍了Spider 2.0，一个评估框架，包括632个源自企业级数据库用例的真实世界文本到SQL工作流问题。Spider 2.0中的数据库来自实际数据应用程序，通常包含超过1,000列，并存储在本地或云数据库系统，如BigQuery和Snowflake。我们展示了在Spider 2.0中解决问题通常需要理解和搜索数据库元数据、方言文档，甚至项目级代码库。这个挑战需要模型与复杂的SQL工作流环境互动，处理极长的上下文，进行复杂的推理，并生成多个SQL查询，通常超过100行，远远超出传统的文本到SQL挑战。我们的评估表明，在o1-preview的基础上，我们的代码代理框架仅成功解决了21.3%的任务，而在Spider 1.0上为91.2%，在BIRD上为73.0%。我们在Spider 2.0上的结果显示，虽然语言模型在代码生成方面表现出色，特别是在以前的文本到SQL基准测试中，但它们需要显著改进才能实现足够的性能以满足真实世界企业的使用需求。Spider 2.0的进展代表了朝着为真实世界企业环境开发智能、自主的代码代理迈出的重要步骤。我们的代码、基准模型和数据可在https://spider2-sql.github.io上找到。

更新时间: 2025-03-17 16:10:45

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2411.07763v2

LLM Test Generation via Iterative Hybrid Program Analysis

Automating unit test generation remains a significant challenge, particularly for complex methods in real-world projects. While Large Language Models (LLMs) have made strides in code generation, they struggle to achieve high branch coverage due to their limited ability to reason about intricate control flow structures. To address this limitation, we introduce Panta, a technique that emulates the iterative process human developers follow when analyzing code and constructing test cases. Panta integrates static control flow analysis and dynamic code coverage analysis to systematically guide LLMs in identifying uncovered execution paths and generating better test cases. By incorporating an iterative feedback-driven mechanism, our technique continuously refines test generation based on static and dynamic path coverage insights, ensuring more comprehensive and effective testing. Our empirical evaluation, conducted on classes with high cyclomatic complexity from open-source projects, demonstrates that Panta achieves 26% higher line coverage and 23% higher branch coverage compared to the state-of-the-art.

Updated: 2025-03-17 16:10:38

标题: 通过迭代混合程序分析生成LLM测试

摘要: 自动化单元测试生成仍然是一个重大挑战，特别是对于实际项目中复杂方法。虽然大型语言模型（LLMs）在代码生成方面取得了进展，但由于它们对复杂控制流结构的推理能力有限，它们很难实现高分支覆盖率。为了解决这一限制，我们引入了Panta，这是一种技术，模拟人类开发人员在分析代码和构建测试用例时遵循的迭代过程。Panta将静态控制流分析和动态代码覆盖分析整合在一起，系统地指导LLMs识别未覆盖的执行路径并生成更好的测试用例。通过引入一个迭代反馈驱动机制，我们的技术根据静态和动态路径覆盖洞察不断完善测试生成，确保更全面和有效的测试。我们在开源项目中具有高圈复杂度的类上进行的实证评估表明，与最先进技术相比，Panta实现了26％更高的行覆盖率和23％更高的分支覆盖率。

更新时间: 2025-03-17 16:10:38

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.13580v1

LEAVS: An LLM-based Labeler for Abdominal CT Supervision

Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes.

Updated: 2025-03-17 16:09:22

标题: LEAVS：一种基于LLM的腹部CT监督标注器

摘要: 从放射学报告中提取结构化标签已被用于创建视觉模型，同时检测几种类型的异常。然而，现有的研究主要集中在胸部区域。由于腹部解剖结构更加复杂，腹部放射学报告中存在更广泛的病理学，因此对腹部放射学报告的研究较少。我们提出了LEAVS（Large language model Extractor for Abdominal Vision Supervision）。这个标签器可以为CT放射学报告中九个腹部器官的七种异常类型注释存在的确定性和紧急程度。为了确保广泛覆盖，我们选择了在CT报告中包含大多数发现类型的异常。我们的方法采用了一种专门的思维链提示策略，通过基于树的决策系统中的句子提取和多项选择问题来运行本地的LLM。我们证明LLM可以提取腹部器官中的多种异常类型，平均F1分数为0.89，明显优于竞争标签器和人类。此外，我们展示了紧急程度标签的提取性能与人类注释相当。最后，我们证明异常标签包含了为训练一个单一视觉模型，将多个器官分类为正常或异常所需的有价值信息。我们发布了我们的代码和一个包含超过1,000个CT体积的公共CT数据集的结构化标注。

更新时间: 2025-03-17 16:09:22

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.13330v1

PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modelling and machine learning

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated datasets. Being able to easily access and utilise these is crucial for allowing researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM datasets and/or creating new synthetic cryoEM datasets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning based algorithms for aiding numerous steps in the processing and reconstruction of experimental datasets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large datasets which can be cumbersome to curate and unwieldy to make use of. In this paper we present a suite of Python software packages which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or Alphafold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive datasets in a machine-learning compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labelling. These packages may be utilised independently or as building blocks in workflows. All are available in open source repositories and designed to be easily extensible to facilitate more advanced workflows if required.

Updated: 2025-03-17 16:07:56

标题: PERC：用于冷冻电镜数据整理的软件工具套件，适用于模拟、建模和机器学习

摘要: 数据、工具和模型的易于访问性加快了科学研究。在结构生物学领域，现在有许多开放的实验和模拟数据存储库。能够轻松访问和利用这些数据对于研究人员充分利用他们研究工作是至关重要的。本文介绍的工具对于整理现有的公共冷冻电子显微镜（cryoEM）数据集和/或创建新的合成cryoEM数据集以促进新型数据处理和解释算法的发展非常有用。近年来，结构生物学领域已经看到了大量基于机器学习的算法的发展，用于辅助实验数据集处理和重建的多个步骤，并且这些方法的应用已经变得普遍。在结构生物学中开发这样的技术需要访问大型数据集，这可能会很麻烦并且难以利用。本文介绍了一套Python软件包，我们统称为PERC（profet、EMPIARreader和CAKED）。这些软件包旨在减轻结构生物学研究中数据整理所带来的负担。蛋白质结构提取器（profet）软件包允许用户方便地从蛋白质数据银行或Alphafold数据库中下载和切割序列或结构。EMPIARreader允许在机器学习兼容的结构中延迟加载电子显微镜公共图像存档数据集。关键电子显微数据分类聚合器（CAKED）软件包旨在无缝地促进机器学习模型在电子显微镜数据上的训练，包括电子冷冻显微术特定的数据增强和标记。这些软件包可以独立使用，也可以作为工作流程中的构建模块。所有软件包都可在开源存储库中获得，并且设计得易于扩展，以便根据需要促进更高级的工作流程。

更新时间: 2025-03-17 16:07:56

领域: cs.LG,cs.CE,q-bio.BM

下载: http://arxiv.org/abs/2503.13329v1

Investigating the contribution of terrain-following coordinates and conservation schemes in AI-driven precipitation forecasts

Artificial Intelligence (AI) weather prediction (AIWP) models often produce "blurry" precipitation forecasts that overestimate drizzle and underestimate extremes. This study provides a novel solution to tackle this problem -- integrating terrain-following coordinates with global mass and energy conservation schemes into AIWP models. Forecast experiments are conducted to evaluate the effectiveness of this solution using FuXi, an example AIWP model, adapted to 1.0-degree grid spacing data. Verification results show large performance gains. The conservation schemes are found to reduce drizzle bias, whereas using terrain-following coordinates improves the estimation of extreme events and precipitation intensity spectra. Furthermore, a case study reveals that terrain-following coordinates capture near-surface winds better over mountains, offering AIWP models more accurate information on understanding the dynamics of precipitation processes. The proposed solution of this study can benefit a wide range of AIWP models and bring insights into how atmospheric domain knowledge can support the development of AIWP models.

Updated: 2025-03-17 16:06:25

标题: 研究地形跟随坐标和保守方案在基于人工智能的降水预测中的贡献

摘要: 人工智能（AI）天气预测（AIWP）模型通常会产生“模糊”的降水预报，高估细雨和低估极端情况。本研究提供了一个新颖的解决方案来解决这个问题 - 将地形跟随坐标与全球质量和能量守恒方案整合到AIWP模型中。通过使用适应于1.0度网格间距数据的示例AIWP模型FuXi进行预测实验，评估了这种解决方案的有效性。验证结果显示出较大的性能提升。发现守恒方案可以减少细雨偏差，而使用地形跟随坐标可以改善极端事件和降水强度谱的估计。此外，一个案例研究揭示了地形跟随坐标可以更好地捕捉山脉上的近地表风，为AIWP模型提供更准确的理解降水过程动态的信息。本研究提出的解决方案可以使各种AIWP模型受益，并为如何大气领域知识支持AIWP模型的发展带来启示。

更新时间: 2025-03-17 16:06:25

领域: physics.ao-ph,cs.AI

下载: http://arxiv.org/abs/2503.00332v2

SMPR: A structure-enhanced multimodal drug-disease prediction model for drug repositioning and cold start

Repositioning drug-disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, using the Mol2VEC method to generate drug embedded representations, and learn disease embedded representations through heterogeneous network graph neural networks. Ultimately, a drug-disease relationship matrix is constructed. In addition, to reduce the difficulty of users' use, SMPR also provides a cold start interface based on structural similarity based on reposition results to simply and quickly predict drug-related diseases. The repositioning ability and cold start capability of the model are verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reach 99% and 61% respectively, the AUC of cold start achieve 80%. In particular, the cold start Recall indicator can reach more than 70%, which means that SMPR is more sensitive to positive samples. Finally, case analysis is used to verify the practical value of the model and visual analysis directly demonstrates the improvement of the structure to the model. For quick use, we also provide local deployment of the model and package it into an executable program.

Updated: 2025-03-17 15:59:20

标题: SMPR：一种用于药物再定位和冷启动的结构增强多模式药物-疾病预测模型

摘要: 重新定位药物与疾病关系一直是研究的热点领域。然而，经过生物验证的药物重定位实际案例仍然非常有限，现有模型尚未充分利用药物的结构信息。此外，大多数重定位模型仅用于完成关系矩阵，而在处理药物冷启动问题时，它们的实用性较差。本文提出了一种基于结构增强的多模态关系预测模型（SMRP）。SMRP基于药物的SMILE结构，使用Mol2VEC方法生成药物嵌入表示，并通过异构网络图神经网络学习疾病嵌入表示。最终构建了一个药物-疾病关系矩阵。此外，为了降低用户使用的难度，SMRP还提供了基于重定位结果的结构相似性的冷启动界面，以简单快速地预测与药物相关的疾病。模型的重定位能力和冷启动能力从多个角度得到验证。重定位的AUC和ACUPR得分分别达到了99％和61％，冷启动的AUC达到了80％。特别是，冷启动的召回率指标可以达到70％以上，这意味着SMRP对正样本更为敏感。最后，通过案例分析验证了模型的实际价值，并通过直观分析直接展示了结构对模型的改进。为了快速使用，我们还提供了模型的本地部署，并将其打包成可执行程序。

更新时间: 2025-03-17 15:59:20

领域: cs.LG

下载: http://arxiv.org/abs/2503.13322v1

ASMR: Adaptive Skeleton-Mesh Rigging and Skinning via 2D Generative Prior

Despite the growing accessibility of skeletal motion data, integrating it for animating character meshes remains challenging due to diverse configurations of both skeletons and meshes. Specifically, the body scale and bone lengths of the skeleton should be adjusted in accordance with the size and proportions of the mesh, ensuring that all joints are accurately positioned within the character mesh. Furthermore, defining skinning weights is complicated by variations in skeletal configurations, such as the number of joints and their hierarchy, as well as differences in mesh configurations, including their connectivity and shapes. While existing approaches have made efforts to automate this process, they hardly address the variations in both skeletal and mesh configurations. In this paper, we present a novel method for the automatic rigging and skinning of character meshes using skeletal motion data, accommodating arbitrary configurations of both meshes and skeletons. The proposed method predicts the optimal skeleton aligned with the size and proportion of the mesh as well as defines skinning weights for various mesh-skeleton configurations, without requiring explicit supervision tailored to each of them. By incorporating Diffusion 3D Features (Diff3F) as semantic descriptors of character meshes, our method achieves robust generalization across different configurations. To assess the performance of our method in comparison to existing approaches, we conducted comprehensive evaluations encompassing both quantitative and qualitative analyses, specifically examining the predicted skeletons, skinning weights, and deformation quality.

Updated: 2025-03-17 15:59:02

标题: ASMR：通过2D生成先验进行自适应骨骼-网格绑定和蒙皮

摘要: 尽管骨骼运动数据的获取日益便利，但由于骨架和网格的不同配置，将其整合用于角色网格动画仍然具有挑战性。具体来说，骨架的身体比例和骨骼长度应根据网格的大小和比例进行调整，确保所有关节准确地定位在角色网格内。此外，由于骨骼配置的变化（如关节数量和层次结构）以及网格配置的差异（包括连接性和形状），定义蒙皮权重变得复杂。虽然现有方法已经努力自动化这个过程，但几乎没有解决骨骼和网格配置的变化。在本文中，我们提出了一种新颖的方法，利用骨骼运动数据自动设置和蒙皮角色网格，适应任意网格和骨架的配置。所提出的方法预测了与网格大小和比例对齐的最佳骨架，并为不同网格-骨架配置定义蒙皮权重，而无需针对每个配置进行显式监督。通过将扩散3D特征（Diff3F）作为角色网格的语义描述符，我们的方法在不同配置之间实现了稳健的泛化。为了评估我们的方法与现有方法的性能差异，我们进行了全面评估，包括定量和定性分析，具体检查了预测的骨骼、蒙皮权重和变形质量。

更新时间: 2025-03-17 15:59:02

领域: cs.GR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13579v1

Do you understand epistemic uncertainty? Think again! Rigorous frequentist epistemic uncertainty estimation in regression

Quantifying model uncertainty is critical for understanding prediction reliability, yet distinguishing between aleatoric and epistemic uncertainty remains challenging. We extend recent work from classification to regression to provide a novel frequentist approach to epistemic and aleatoric uncertainty estimation. We train models to generate conditional predictions by feeding their initial output back as an additional input. This method allows for a rigorous measurement of model uncertainty by observing how prediction responses change when conditioned on the model's previous answer. We provide a complete theoretical framework to analyze epistemic uncertainty in regression in a frequentist way, and explain how it can be exploited in practice to gauge a model's uncertainty, with minimal changes to the original architecture.

Updated: 2025-03-17 15:54:57

标题: 你了解认识论不确定性吗？再想想！在回归中严格的频率主义认识论不确定性估计

摘要: 量化模型不确定性对于理解预测的可靠性至关重要，然而区分偶然和认知不确定性仍然具有挑战性。我们将最近的分类工作扩展到回归，提供一种新颖的频率论方法来估计认知和偶然不确定性。我们训练模型通过将其初始输出作为额外输入来生成条件预测。这种方法允许通过观察在模型先前答案的条件下预测响应如何改变来严格测量模型不确定性。我们提供了一个完整的理论框架来以频率学的方式分析回归中的认知不确定性，并解释如何在实践中利用它来评估模型的不确定性，而对原始架构的改变最小。

更新时间: 2025-03-17 15:54:57

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.13317v1

RainScaleGAN: a Conditional Generative Adversarial Network for Rainfall Downscaling

To this day, accurately simulating local-scale precipitation and reliably reproducing its distribution remains a challenging task. The limited horizontal resolution of Global Climate Models is among the primary factors undermining their skill in this context. The physical mechanisms driving the onset and development of precipitation, especially in extreme events, operate at spatio-temporal scales smaller than those numerically resolved, thus struggling to be captured accurately. In order to circumvent this limitation, several downscaling approaches have been developed over the last decades to address the discrepancy between the spatial resolution of models output and the resolution required by local-scale applications. In this paper, we introduce RainScaleGAN, a conditional deep convolutional Generative Adversarial Network (GAN) for precipitation downscaling. GANs have been effectively used in image super-resolution, an approach highly relevant for downscaling tasks. RainScaleGAN's capabilities are tested in a perfect-model setup, where the spatial resolution of a precipitation dataset is artificially degraded from 0.25$^{\circ}\times$0.25$^{\circ}$ to 2$^{\circ}\times$2$^\circ$, and RainScaleGAN is used to restore it. The developed model outperforms one of the leading precipitation downscaling method found in the literature. RainScaleGAN not only generates a synthetic dataset featuring plausible high-resolution spatial patterns and intensities, but also produces a precipitation distribution with statistics closely mirroring those of the ground-truth dataset. Given that RainScaleGAN's approach is agnostic with respect to the underlying physics, the method has the potential to be applied to other physical variables such as surface winds or temperature.

Updated: 2025-03-17 15:54:20

标题: RainScaleGAN：用于降雨降尺度的条件生成对抗网络

摘要: 迄今为止，准确模拟局部尺度降水并可靠地重现其分布仍然是一项具有挑战性的任务。全球气候模型的有限水平分辨率是削弱其在这一背景下技能的主要因素之一。驱动降水发生和发展的物理机制，特别是在极端事件中，作用于比数值解析的时空尺度更小的范围，因此很难准确捕捉。为了规避这一限制，过去几十年发展了几种降尺度方法，以解决模型输出的空间分辨率与局部应用所需分辨率之间的差异。在本文中，我们介绍了RainScaleGAN，这是一种有条件的深度卷积生成对抗网络（GAN），用于降尺度降水。GAN已经被有效地用于图像超分辨率，这是一种与降尺度任务高度相关的方法。RainScaleGAN的能力在一个完美模型设置中进行了测试，其中一个降水数据集的空间分辨率被人为降低，从0.25°×0.25°降至2°×2°，然后使用RainScaleGAN进行恢复。开发的模型优于文献中发现的领先降尺度降水方法之一。RainScaleGAN不仅生成一个具有合理高分辨率空间模式和强度的合成数据集，还产生一个降水分布，其统计数据与地面真实数据集的统计数据非常接近。鉴于RainScaleGAN的方法对底层物理学是不可知的，该方法有潜力应用于其他物理变量，如地表风或温度。

更新时间: 2025-03-17 15:54:20

领域: physics.ao-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13316v1

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, opposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.

Updated: 2025-03-17 15:50:58

标题: 不需要分类器的引导在吸引盆地内可能导致记忆化

摘要: 扩散模型容易精确复制训练数据中的图像。训练数据的精确复制令人担忧，因为这可能导致侵犯版权和/或泄露涉及隐私的信息。在本文中，我们提出了一个关于记忆现象的新视角，并提出了一个简单而有效的方法来减轻它。我们认为，记忆发生是因为在去噪过程中存在一个吸引盆地，它将扩散轨迹引向一个已记忆的图像。然而，通过在理想的过渡点之前不应用无分类器指导来引导扩散轨迹远离吸引盆地，可以减轻这种现象。这会导致生成高质量且与条件机制良好对齐的非记忆图像。为了进一步改进，我们提出了一种新的指导技术，即相反指导，它在去噪过程中更早地逃离了吸引盆地。我们展示了在各种情况下发生记忆的吸引盆地的存在，并且证明了我们提出的方法成功地减轻了记忆现象。

更新时间: 2025-03-17 15:50:58

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.16738v2

Leveraging Large Language Models for Collective Decision-Making

In various work contexts, such as meeting scheduling, collaborating, and project planning, collective decision-making is essential but often challenging due to diverse individual preferences, varying work focuses, and power dynamics among members. To address this, we propose a system leveraging Large Language Models (LLMs) to facilitate group decision-making by managing conversations and balancing preferences among individuals. Our system aims to extract individual preferences from each member's conversation with the system and suggest options that satisfy the preferences of the members. We specifically apply this system to corporate meeting scheduling. We create synthetic employee profiles and simulate conversations at scale, leveraging LLMs to evaluate the system performance as a novel approach to conducting a user study. Our results indicate efficient coordination with reduced interactions between the members and the LLM-based system. The system refines and improves its proposed options over time, ensuring that many of the members' individual preferences are satisfied in an equitable way. Finally, we conduct a survey study involving human participants to assess our system's ability to aggregate preferences and reasoning about them. Our findings show that the system exhibits strong performance in both dimensions.

Updated: 2025-03-17 15:50:13

标题: 利用大型语言模型进行集体决策-making

摘要: 在各种工作环境中，如会议安排、协作和项目规划中，集体决策是至关重要的，但往往具有挑战性，原因在于个体偏好各异、工作重点不同以及成员之间的权力动态。为了解决这一问题，我们提出了一个利用大型语言模型（LLMs）的系统，通过管理对话和平衡个体偏好来促进集体决策。我们的系统旨在从每个成员与系统的对话中提取个体偏好，并建议满足成员偏好的选项。我们特别将这一系统应用于企业会议安排。我们创建了合成员工档案，并规模化模拟对话，利用LLMs评估系统性能作为进行用户研究的一种新方法。我们的结果表明，在成员和基于LLM的系统之间的互动减少的情况下，实现了高效的协调。该系统随着时间的推移不断完善和改进其提出的选项，确保许多成员的个体偏好得到公平满足。最后，我们进行了一项涉及人类参与者的调查研究，以评估我们的系统在聚合偏好和对其进行推理方面的能力。我们的研究结果表明，该系统在两个方面表现出色。

更新时间: 2025-03-17 15:50:13

领域: cs.CL,cs.AI,cs.HC,cs.SI

下载: http://arxiv.org/abs/2311.04928v3

Generative AI for Software Architecture. Applications, Trends, Challenges, and Future Directions

Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.

Updated: 2025-03-17 15:49:30

标题: 生成式人工智能用于软件架构：应用、趋势、挑战和未来发展方向

摘要: 背景：生成式人工智能（GenAI）正在改变许多软件开发领域，然而其在软件架构中的应用仍处于起步阶段，之前没有系统地研究过这个主题。目的：我们的目标是系统地综合分析GenAI在软件架构中的应用、理由、背景、可用性和未来挑战。方法：我们进行了一项多声音文献回顾（MLR），分析同行评议和灰色文献，识别当前实践、模型、采用背景和报告的挑战，通过开放编码提取主题。结果：我们的回顾发现GenAI在架构决策支持和架构重构方面的应用较为广泛。OpenAI GPT模型被广泛应用，技术如少样本提示和检索增强生成（RAG）被一贯使用。GenAI主要被应用于软件开发生命周期（SDLC）的初期阶段，如需求到架构和架构到代码。单体和微服务架构是主要目标。然而，对GenAI输出的严格测试通常在研究中缺失。最常见的挑战包括模型精度、幻觉、伦理方面、隐私问题、缺乏架构特定数据集和缺乏可靠的评估框架。结论：GenAI在软件设计中显示出重要潜力，但在其更广泛应用的道路上仍存在一些挑战。研究工作应着重设计通用的评估方法、处理伦理和精度问题、增加透明度和可解释性，并促进架构特定数据集和基准的建立，以弥合理论可能性与实际应用之间的差距。

更新时间: 2025-03-17 15:49:30

领域: cs.SE,cs.AI,cs.DC,cs.ET

下载: http://arxiv.org/abs/2503.13310v1

Integrating AI for Human-Centric Breast Cancer Diagnostics: A Multi-Scale and Multi-View Swin Transformer Framework

Despite advancements in Computer-Aided Diagnosis (CAD) systems, breast cancer remains one of the leading causes of cancer-related deaths among women worldwide. Recent breakthroughs in Artificial Intelligence (AI) have shown significant promise in development of advanced Deep Learning (DL) architectures for breast cancer diagnosis through mammography. In this context, the paper focuses on the integration of AI within a Human-Centric workflow to enhance breast cancer diagnostics. Key challenges are, however, largely overlooked such as reliance on detailed tumor annotations and susceptibility to missing views, particularly during test time. To address these issues, we propose a hybrid, multi-scale and multi-view Swin Transformer-based framework (MSMV-Swin) that enhances diagnostic robustness and accuracy. The proposed MSMV-Swin framework is designed to work as a decision-support tool, helping radiologists analyze multi-view mammograms more effectively. More specifically, the MSMV-Swin framework leverages the Segment Anything Model (SAM) to isolate the breast lobe, reducing background noise and enabling comprehensive feature extraction. The multi-scale nature of the proposed MSMV-Swin framework accounts for tumor-specific regions as well as the spatial characteristics of tissues surrounding the tumor, capturing both localized and contextual information. The integration of contextual and localized data ensures that MSMV-Swin's outputs align with the way radiologists interpret mammograms, fostering better human-AI interaction and trust. A hybrid fusion structure is then designed to ensure robustness against missing views, a common occurrence in clinical practice when only a single mammogram view is available.

Updated: 2025-03-17 15:48:56

标题: 整合人工智能进行以人为中心的乳腺癌诊断：一个多尺度和多视角的Swin Transformer框架

摘要: 尽管计算机辅助诊断（CAD）系统取得了进展，但乳腺癌仍然是全球妇女癌症相关死亡的主要原因之一。人工智能（AI）领域的最新突破显示了通过乳腺X线摄影发展高级深度学习（DL）架构在乳腺癌诊断方面具有重要潜力。在这种背景下，本文着重于将人工智能融入人类中心工作流程中，以增强乳腺癌诊断。然而，关键挑战往往被忽视，比如对详细肿瘤标注的依赖以及在测试时特别容易出现视图缺失。为解决这些问题，我们提出了一种基于Swin Transformer的混合、多尺度和多视角框架（MSMV-Swin），以增强诊断的鲁棒性和准确性。所提出的MSMV-Swin框架旨在作为决策支持工具，帮助放射科医师更有效地分析多视图乳腺X线摄影。具体而言，MSMV-Swin框架利用Segment Anything Model（SAM）来分离乳腺叶，减少背景噪音并实现全面的特征提取。所提出的MSMV-Swin框架的多尺度特性考虑了肿瘤特定区域以及周围组织的空间特征，捕捉了局部和上下文信息。整合上下文和局部数据确保MSMV-Swin的输出与放射科医师解读乳腺X线摄影的方式一致，促进更好的人工智能交互和信任。然后设计了混合融合结构，以确保在临床实践中出现视图缺失时的鲁棒性，仅有单个乳腺X线摄影视图时这种情况很常见。

更新时间: 2025-03-17 15:48:56

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.13309v1

Computation Mechanism Behind LLM Position Generalization

Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.

Updated: 2025-03-17 15:47:37

标题: LLM位置泛化背后的计算机制

摘要: 大多数书面自然语言由单词和句子序列组成。与人类类似，大型语言模型(LLMs)表现出处理文本位置的灵活性，我们称之为位置泛化。它们可以理解带有位置扰动的文本，并且可以推广到比训练时遇到的更长的文本，这是使用最新技术的现象。这些现象表明LLMs能够容忍位置扰动，但LLMs如何计算地处理位置相关性仍然大部分未被探索。这项工作将语言现象与LLMs的计算机制连接起来。我们展示了LLMs如何强制执行某些计算机制以实现前述位置扰动的容忍性。尽管自注意机制的设计复杂，这项工作揭示了LLMs学习到了注意力logits的一种反直觉的解缠绕。它们的值与位置相关性和语义重要性的近似算术和的线性相关性为0.959。此外，我们在中间特征中识别出一种普遍模式，我们理论上证明这种模式使得这种效果成为可能。这种模式与随机初始化参数的行为不同，表明这是一种学习行为而不是模型架构的自然结果。基于这些发现，我们提供了LLMs位置灵活性的计算解释和标准。这项工作在将位置泛化与现代LLMs内部机制联系起来方面迈出了开创性的一步。

更新时间: 2025-03-17 15:47:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13305v1

GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation

Feature selection in deep learning remains a critical challenge, particularly for high-dimensional tabular data where interpretability and computational efficiency are paramount. We present GFSNetwork, a novel neural architecture that performs differentiable feature selection through temperature-controlled Gumbel-Sigmoid sampling. Unlike traditional methods, where the user has to define the requested number of features, GFSNetwork selects it automatically during an end-to-end process. Moreover, GFSNetwork maintains constant computational overhead regardless of the number of input features. We evaluate GFSNetwork on a series of classification and regression benchmarks, where it consistently outperforms recent methods including DeepLasso, attention maps, as well as traditional feature selectors, while using significantly fewer features. Furthermore, we validate our approach on real-world metagenomic datasets, demonstrating its effectiveness in high-dimensional biological data. Concluding, our method provides a scalable solution that bridges the gap between neural network flexibility and traditional feature selection interpretability. We share our python implementation of GFSNetwork at https://github.com/wwydmanski/GFSNetwork, as well as a PyPi package (gfs_network).

Updated: 2025-03-17 15:47:26

标题: GFSNetwork: 通过Gumbel-Sigmoid松弛实现可微分特征选择

摘要: 深度学习中的特征选择仍然是一个关键挑战，特别是对于高维表格数据，解释性和计算效率至关重要。我们提出了GFSNetwork，这是一种新颖的神经架构，通过温度控制的Gumbel-Sigmoid抽样进行可微特征选择。与传统方法不同，用户无需定义所需特征的数量，GFSNetwork会在端到端过程中自动选择。此外，GFSNetwork在输入特征数量上保持恒定的计算开销。我们在一系列分类和回归基准测试上评估了GFSNetwork，在这些测试中，它始终优于包括DeepLasso、注意力图在内的最新方法，以及传统特征选择器，同时使用的特征数量明显较少。此外，我们在真实世界的宏基因组数据集上验证了我们的方法，展示了其在高维生物数据中的有效性。总之，我们的方法提供了一个可扩展的解决方案，弥合了神经网络的灵活性和传统特征选择的解释性之间的差距。我们在https://github.com/wwydmanski/GFSNetwork分享了GFSNetwork的Python实现，以及一个PyPi包（gfs_network）。

更新时间: 2025-03-17 15:47:26

领域: cs.LG

下载: http://arxiv.org/abs/2503.13304v1

A Survey on Transformer Context Extension: Approaches and Evaluation

Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.

Updated: 2025-03-17 15:44:09

标题: 一项关于Transformer上下文扩展的调查：方法和评估

摘要: 基于Transformer的大型语言模型（LLMs）已广泛应用于自然语言处理（NLP）领域，表现出强大的性能，特别是在处理短文本任务方面表现出色。然而，当涉及长文本情景时，LLMs的性能会因为一些挑战而降低。为了缓解这种现象，最近提出了一些工作。在本调查中，我们首先列出了将预训练的LLMs应用于处理长文本所面临的挑战。然后系统地审查了与长文本相关的方法，并提出了将它们分类为四种主要类型：位置编码、上下文压缩、检索增强和注意力模式。除了这些方法，我们还关注长文本的评估，根据现有的长文本基准组织相关数据、任务和指标。最后，我们总结了长文本领域尚未解决的问题，并提出了对未来发展的看法。

更新时间: 2025-03-17 15:44:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13299v1

The Limits of Differential Privacy in Online Learning

Differential privacy (DP) is a formal notion that restricts the privacy leakage of an algorithm when running on sensitive data, in which privacy-utility trade-off is one of the central problems in private data analysis. In this work, we investigate the fundamental limits of differential privacy in online learning algorithms and present evidence that separates three types of constraints: no DP, pure DP, and approximate DP. We first describe a hypothesis class that is online learnable under approximate DP but not online learnable under pure DP under the adaptive adversarial setting. This indicates that approximate DP must be adopted when dealing with adaptive adversaries. We then prove that any private online learner must make an infinite number of mistakes for almost all hypothesis classes. This essentially generalizes previous results and shows a strong separation between private and non-private settings since a finite mistake bound is always attainable (as long as the class is online learnable) when there is no privacy requirement.

Updated: 2025-03-17 15:43:14

标题: 在线学习中差分隐私的局限性

摘要: 差分隐私（DP）是一种正式概念，限制算法在敏感数据上运行时的隐私泄露，其中隐私-效用权衡是私人数据分析中的一个核心问题。在这项工作中，我们研究了在线学习算法中差分隐私的基本限制，并提出证据将三种类型的约束分开：无DP、纯DP和近似DP。我们首先描述了一种假设类，在自适应对抗设置下，在近似DP下可以在线学习，但在纯DP下无法在线学习。这表明在处理自适应对手时必须采用近似DP。然后我们证明，任何私有在线学习器在几乎所有假设类中必须做出无限次错误。这本质上推广了先前的结果，并显示了私有和非私有设置之间的强分离，因为只要类是在线可学习的，当没有隐私要求时，总是可以实现有限的错误边界。

更新时间: 2025-03-17 15:43:14

领域: cs.LG

下载: http://arxiv.org/abs/2411.05483v2

On Local Posterior Structure in Deep Ensembles

Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.

Updated: 2025-03-17 15:41:39

标题: 关于深度集成中的本地后验结构

摘要: 贝叶斯神经网络（BNNs）通常在模型校准和预测不确定性量化方面优于最大后验（MAP）等点估计器。同样，深度集成（DEs）也被认为可以改善校准，因此，很自然地假设深度BNNs的集成（DE-BNNs）应该能够提供更进一步的改进。在这项工作中，我们系统地在多个数据集、神经网络架构和BNN逼近方法中进行研究，令人惊讶地发现，当集成变得足够大时，DEs在内部数据上始终优于DE-BNNs。为了阐明这一观察结果，我们进行了几项敏感性和消融研究。此外，我们表明，尽管DE-BNNs在外部数据指标上优于DEs，但这是以降低内部性能为代价的。作为最终贡献，我们开源了大量训练模型，以便进一步研究这个主题。

更新时间: 2025-03-17 15:41:39

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.13296v1

$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $\phi$-Decoding. To provide a precise and expressive estimation of step value, $\phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $\phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.

Updated: 2025-03-17 15:38:33

标题: $φ$-解码：平衡推理时间探索和利用的自适应前瞻取样

摘要: 推理时间优化可以将计算规模扩大以推导出有效性能的刻意推理步骤。虽然先前基于搜索的策略解决了自回归生成的短视问题，但庞大的搜索空间导致了过度探索和不足的开发。为了达到有效的平衡以推导出最佳步骤，我们将解码策略框架化为预见采样，利用模拟未来步骤来获得全局最优步骤估计。基于此，我们提出了一种新颖的解码策略，称为$\phi$-Decoding。为了提供精确和表达力强的步骤值估计，$\phi$-Decoding通过预见和聚类近似两个分布。从联合分布中取样，可以选择最佳步骤进行开发。为了支持自适应计算分配，我们提出了宽度和深度剪枝策略，具有轻量级解决方案以实现推理效率。对七个基准测试进行的广泛实验表明，$\phi$-Decoding在性能和效率方面优于强基线。额外的分析表明，它在各种LLM模型上具有泛化能力，并且在各种计算预算范围内具有可扩展性。代码将在https://github.com/xufangzhi/phi-Decoding发布，开源PyPI软件包即将推出。

更新时间: 2025-03-17 15:38:33

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13288v1

Explaining the Unexplainable: A Systematic Review of Explainable AI in Finance

Practitioners and researchers trying to strike a balance between accuracy and transparency center Explainable Artificial Intelligence (XAI) at the junction of finance. This paper offers a thorough overview of the changing scene of XAI applications in finance together with domain-specific implementations, methodological developments, and trend mapping of research. Using bibliometric and content analysis, we find topic clusters, significant research, and most often used explainability strategies used in financial industries. Our results show a substantial dependence on post-hoc interpretability techniques; attention mechanisms, feature importance analysis and SHAP are the most often used techniques among them. This review stresses the need of multidisciplinary approaches combining financial knowledge with improved explainability paradigms and exposes important shortcomings in present XAI systems.

Updated: 2025-03-17 15:37:42

标题: 解释无法解释的现象：金融领域可解释人工智能的系统性评估

摘要: Practitioners and researchers in the field of finance are increasingly focusing on Explainable Artificial Intelligence (XAI) in order to achieve a balance between accuracy and transparency. This paper provides a comprehensive overview of the evolving landscape of XAI applications in finance, including specific implementations, methodological advancements, and research trends. Through bibliometric and content analysis, the study identifies topic clusters, key research areas, and common explainability strategies employed in the financial industry. The results highlight a significant reliance on post-hoc interpretability techniques, with attention mechanisms, feature importance analysis, and SHAP being among the most frequently used methods. The review emphasizes the importance of interdisciplinary approaches that integrate financial expertise with enhanced explainability paradigms, while also pointing out crucial shortcomings in current XAI systems.

更新时间: 2025-03-17 15:37:42

领域: q-fin.GN,cs.AI

下载: http://arxiv.org/abs/2503.05966v2

TraSCE: Trajectory Steering for Concept Erasure

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content, including ones proposed by red teams, and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (either image or prompt), making it easier for model owners to erase new concepts.

Updated: 2025-03-17 15:37:35

标题: TraSCE：用于概念消除的轨迹引导

摘要: 最近文本到图像扩散模型的进展使它们受到公众关注，变得广泛可访问并受到日常用户的欢迎。然而，这些模型已被证明会生成有害内容，如不适宜工作场所的图像。虽然已经提出方法从模型中消除这种抽象概念，但越狱技术已成功绕过这些安全措施。在本文中，我们提出了TraSCE，一种引导扩散轨迹远离生成有害内容的方法。我们的方法基于负面提示，但正如本文所示，一种广泛使用的负面提示策略并不是一个完全的解决方案，而且在某些特殊情况下很容易被绕过。为了解决这个问题，我们首先提出使用一种特定的负面提示公式而不是广泛使用的公式。此外，我们引入了基于局部损失的引导，通过引导扩散轨迹来增强修改后的负面提示技术。我们证明我们提出的方法在消除有害内容方面取得了各种基准的最新结果，包括由红队提出的内容，以及消除艺术风格和对象。我们提出的方法不需要任何训练、权重修改或训练数据（无论是图像还是提示），这使得模型所有者更容易消除新概念。

更新时间: 2025-03-17 15:37:35

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.07658v2

Tensor Networks Meet Neural Networks: A Survey and Future Perspectives

Tensor networks (TNs) and neural networks (NNs) are two fundamental data modeling approaches. TNs were introduced to solve the curse of dimensionality in large-scale tensors by converting an exponential number of dimensions to polynomial complexity. As a result, they have attracted significant attention in the fields of quantum physics and machine learning. Meanwhile, NNs have displayed exceptional performance in various applications, e.g., computer vision, natural language processing, and robotics research. Interestingly, although these two types of networks originate from different observations, they are inherently linked through the typical multilinearity structure underlying both TNs and NNs, thereby motivating a significant number of developments regarding combinations of TNs and NNs. In this paper, we refer to these combinations as tensorial neural networks~(TNNs) and present an introduction to TNNs from both data processing and model architecture perspectives. From the data perspective, we explore the capabilities of TNNs in multi-source fusion, multimodal pooling, data compression, multi-task training, and quantum data processing. From the model perspective, we examine TNNs' integration with various architectures, including Convolutional Neural Networks, Recurrent Neural Networks, Graph Neural Networks, Transformers, Large Language Models, and Quantum Neural Networks. Furthermore, this survey also explores methods for improving TNNs, examines flexible toolboxes for implementing TNNs, and documents TNN development while highlighting potential future directions. To the best of our knowledge, this is the first comprehensive survey that bridges the connections among NNs and TNs. We provide a curated list of TNNs at https://github.com/tnbar/awesome-tensorial-neural-networks.

Updated: 2025-03-17 15:33:59

标题: 张量网络遇上神经网络：调查与未来展望

摘要: 张量网络（TNs）和神经网络（NNs）是两种基本的数据建模方法。TNs被引入用于解决大规模张量中的维度灾难，通过将指数数量的维度转换为多项式复杂度来实现。因此，它们在量子物理和机器学习领域引起了重大关注。与此同时，NNs在各种应用中表现出色，例如计算机视觉、自然语言处理和机器人研究。有趣的是，尽管这两种类型的网络起源于不同的观察，但它们通过支撑TNs和NNs的典型多线性结构在本质上存在联系，由此激发了关于TNs和NNs组合的大量发展。在本文中，我们将这些组合称为张量神经网络（TNNs），并从数据处理和模型架构的角度介绍TNNs。从数据角度看，我们探讨了TNNs在多源融合、多模态汇聚、数据压缩、多任务训练和量子数据处理方面的能力。从模型角度看，我们研究了TNNs与各种架构的整合，包括卷积神经网络、循环神经网络、图神经网络、Transformer、大语言模型和量子神经网络。此外，本调查还探讨了改进TNNs的方法，检查了实施TNNs的灵活工具箱，并记录了TNN的发展，同时突出了潜在的未来方向。据我们所知，这是第一份全面调查，桥接了NNs和TNs之间的联系。我们提供了一个精选的TNN列表，网址为https://github.com/tnbar/awesome-tensorial-neural-networks。

更新时间: 2025-03-17 15:33:59

领域: cs.LG

下载: http://arxiv.org/abs/2302.09019v3

Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation

As requirements drift with rapid iterations, agile development becomes the dominant paradigm. Goal-driven Requirements Elicitation (RE) is a pivotal yet challenging task in agile project development due to its heavy tangling with adaptive planning and efficient collaboration. Recently, AI agents have shown promising ability in supporting requirements analysis by saving significant time and effort for stakeholders. However, current research mainly focuses on functional RE, and research works have not been reported bridging the long journey from goal to user stories. Moreover, considering the cost of LLM facilities and the need for data and idea protection, privately hosted small-sized LLM should be further utilized in RE. To address these challenges, we propose Goal2Story, a multi-agent fleet that adopts the Impact Mapping (IM) framework while merely using cost-effective sLLMs for goal-driven RE. Moreover, we introduce a StorySeek dataset that contains over 1,000 user stories (USs) with corresponding goals and project context information, as well as the semi-automatic dataset construction method. For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs. Experimental results demonstrate that Goal2Story outperforms the baseline performance of the Super-Agent adopting powerful LLMs, while also showcasing the performance improvements in key metrics brought by CoT and Agent Profile to Goal2Story, as well as its exploration in identifying latent needs.

Updated: 2025-03-17 15:31:20

标题: Goal2Story：基于私人启用的sLLMs的多智能体车队，用于对需求引起影响的映射

摘要: 随着需求的快速迭代变化，敏捷开发成为主导范式。目标驱动的需求获取(RE)在敏捷项目开发中是一个至关重要但具有挑战性的任务，因为它与自适应规划和高效协作紧密交织在一起。最近，AI代理显示出在支持需求分析方面具有显著的能力，为利益相关者节省了大量时间和精力。然而，当前研究主要集中在功能性RE上，尚未报道从目标到用户故事的漫长过程的研究工作。此外，考虑到LLM设施的成本和数据和想法保护的需求，私人托管的小型LLM应进一步在RE中使用。为了解决这些挑战，我们提出了Goal2Story，一个采用Impact Mapping (IM)框架的多代理舰队，仅使用成本效益的sLLM进行目标驱动的RE。此外，我们介绍了一个包含超过1,000个用户故事（USs）的StorySeek数据集，其中包含对应的目标和项目背景信息，以及半自动数据集构建方法。为了评估，我们提出了两个指标：事实命中率（FHR）用于衡量生成的USs与数据集之间的一致性，以及质量和一致性评估（QuACE）用于评估生成的USs的质量。实验结果表明，Goal2Story优于采用强大LLMs的Super-Agent的基线性能，同时展示了由CoT和Agent Profile带来的对Goal2Story的关键指标性能改进，以及其在识别潜在需求方面的探索。

更新时间: 2025-03-17 15:31:20

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.13279v1

Artificial Intelligence-Driven Prognostic Classification of COVID-19 Using Chest X-rays: A Deep Learning Approach

Background: The COVID-19 pandemic has overwhelmed healthcare systems, emphasizing the need for AI-driven tools to assist in rapid and accurate patient prognosis. Chest X-ray imaging is a widely available diagnostic tool, but existing methods for prognosis classification lack scalability and efficiency. Objective: This study presents a high-accuracy deep learning model for classifying COVID-19 severity (Mild, Moderate, and Severe) using Chest X-ray images, developed on Microsoft Azure Custom Vision. Methods: Using a dataset of 1,103 confirmed COVID-19 X-ray images from AIforCOVID, we trained and validated a deep learning model leveraging Convolutional Neural Networks (CNNs). The model was evaluated on an unseen dataset to measure accuracy, precision, and recall. Results: Our model achieved an average accuracy of 97%, with specificity of 99%, sensitivity of 87%, and an F1-score of 93.11%. When classifying COVID-19 severity, the model achieved accuracies of 89.03% (Mild), 95.77% (Moderate), and 81.16% (Severe). These results demonstrate the model's potential for real-world clinical applications, aiding in faster decision-making and improved resource allocation. Conclusion: AI-driven prognosis classification using deep learning can significantly enhance COVID-19 patient management, enabling early intervention and efficient triaging. Our study provides a scalable, high-accuracy AI framework for integrating deep learning into routine clinical workflows. Future work should focus on expanding datasets, external validation, and regulatory compliance to facilitate clinical adoption.

Updated: 2025-03-17 15:27:21

标题: COVID-19胸部X射线人工智能驱动的预后分类：一种深度学习方法

摘要: 背景：COVID-19大流行已经让医疗系统不堪重负，强调了需要AI驱动的工具来帮助快速准确地预测患者的病情。胸部X光成像是一种广泛可用的诊断工具，但现有的预后分类方法缺乏可扩展性和效率。目标：本研究提出了一个高准确度的深度学习模型，用于使用胸部X光图像对COVID-19病情严重程度（轻度、中度和重度）进行分类，该模型在Microsoft Azure Custom Vision上开发。方法：使用来自AIforCOVID的1,103张确认COVID-19的X光图像数据集，我们训练和验证了一个利用卷积神经网络（CNNs）的深度学习模型。该模型在一个未见数据集上进行评估，以测量准确度、精确度和召回率。结果：我们的模型实现了97%的平均准确度，99%的特异度，87%的敏感度和93.11%的F1分数。在分类COVID-19病情严重程度时，模型的准确率分别为89.03%（轻度）、95.77%（中度）和81.16%（重度）。这些结果展示了该模型在现实世界临床应用中的潜力，有助于加快决策速度和改善资源分配。结论：使用深度学习的AI驱动预后分类可以显著增强COVID-19患者的管理，实现早期干预和高效分诊。我们的研究提供了一个可扩展的、高准确度的AI框架，用于将深度学习整合到常规临床工作流程中。未来的工作应该集中在扩大数据集、外部验证和符合法规，以促进临床采用。

更新时间: 2025-03-17 15:27:21

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.13277v1

Knowledge-Aware Iterative Retrieval for Multi-Agent Systems

We introduce a novel large language model (LLM)-driven agent framework, which iteratively refines queries and filters contextual evidence by leveraging dynamically evolving knowledge. A defining feature of the system is its decoupling of external sources from an internal knowledge cache that is progressively updated to guide both query generation and evidence selection. This design mitigates bias-reinforcement loops and enables dynamic, trackable search exploration paths, thereby optimizing the trade-off between exploring diverse information and maintaining accuracy through autonomous agent decision-making. Our approach is evaluated on a broad range of open-domain question answering benchmarks, including multi-step tasks that mirror real-world scenarios where integrating information from multiple sources is critical, especially given the vulnerabilities of LLMs that lack explicit reasoning or planning capabilities. The results show that the proposed system not only outperforms single-step baselines regardless of task difficulty but also, compared to conventional iterative retrieval methods, demonstrates pronounced advantages in complex tasks through precise evidence-based reasoning and enhanced efficiency. The proposed system supports both competitive and collaborative sharing of updated context, enabling multi-agent extension. The benefits of multi-agent configurations become especially prominent as task difficulty increases. The number of convergence steps scales with task difficulty, suggesting cost-effective scalability.

Updated: 2025-03-17 15:27:02

标题: 多智能体系统中的知识感知迭代检索

摘要: 我们介绍了一种新颖的大语言模型（LLM）驱动的代理框架，通过利用动态演化的知识，迭代地优化查询和过滤上下文证据。系统的一个显著特点是将外部来源与内部知识缓存解耦，后者逐步更新以指导查询生成和证据选择。这种设计有助于减轻偏见强化循环，并实现动态、可追踪的搜索探索路径，从而通过自主代理决策优化探索多样信息和保持准确性之间的权衡。我们的方法在广泛的开放域问答基准测试中进行了评估，包括模拟现实场景的多步任务，在这些场景中，整合来自多个来源的信息至关重要，尤其是考虑到缺乏明确推理或规划能力的LLM的脆弱性。结果表明，所提出的系统不仅在任务难度方面优于单步基线，而且与传统的迭代检索方法相比，在复杂任务中通过精确的基于证据的推理和提高效率，显示出显著的优势。所提出的系统支持更新内容的竞争性和协作性分享，实现多代理扩展。随着任务难度的增加，多代理配置的优势尤为突出。收敛步骤的数量随任务难度而增加，表明了成本效益可扩展性。

更新时间: 2025-03-17 15:27:02

领域: cs.AI,cs.IR,I.2.0; I.2.7; I.2.11; H.3.3

下载: http://arxiv.org/abs/2503.13275v1

Graph Generative Models Evaluation with Masked Autoencoder

In recent years, numerous graph generative models (GGMs) have been proposed. However, evaluating these models remains a considerable challenge, primarily due to the difficulty in extracting meaningful graph features that accurately represent real-world graphs. The traditional evaluation techniques, which rely on graph statistical properties like node degree distribution, clustering coefficients, or Laplacian spectrum, overlook node features and lack scalability. There are newly proposed deep learning-based methods employing graph random neural networks or contrastive learning to extract graph features, demonstrating superior performance compared to traditional statistical methods, but their experimental results also demonstrate that these methods do not always working well across different metrics. Although there are overlaps among these metrics, they are generally not interchangeable, each evaluating generative models from a different perspective. In this paper, we propose a novel method that leverages graph masked autoencoders to effectively extract graph features for GGM evaluations. We conduct extensive experiments on graphs and empirically demonstrate that our method can be more reliable and effective than previously proposed methods across a number of GGM evaluation metrics, such as "Fr\'echet Distance (FD)" and "MMD Linear". However, no single method stands out consistently across all metrics and datasets. Therefore, this study also aims to raise awareness of the significance and challenges associated with GGM evaluation techniques, especially in light of recent advances in generative models.

Updated: 2025-03-17 15:23:21

标题: 使用掩码自编码器评估图生成模型

摘要: 近年来，已经提出了许多图生成模型（GGMs）。然而，评估这些模型仍然是一个重大挑战，主要是由于提取准确代表现实世界图的有意义图特征的困难。传统的评估技术依赖于图统计特性，如节点度分布、聚类系数或拉普拉斯谱，忽略了节点特征并且缺乏可扩展性。有最近提出的基于深度学习的方法，采用图随机神经网络或对比学习来提取图特征，表现出比传统统计方法更优越的性能，但它们的实验结果也表明这些方法并不总是在不同的指标上表现良好。尽管这些指标之间存在重叠，但它们通常不可互换，每个都从不同的角度评估生成模型。在本文中，我们提出了一种新颖的方法，利用图掩码自动编码器来有效地提取用于GGM评估的图特征。我们在图上进行了大量实验，并从经验上证明，我们的方法在许多GGM评估指标（如“Fr\'echet Distance（FD）”和“MMD Linear”）上比先前提出的方法更可靠和有效。然而，并没有一种单一方法在所有指标和数据集上始终表现突出。因此，这项研究也旨在提高对GGM评估技术的重要性和挑战的认识，特别是考虑到生成模型的最新进展。

更新时间: 2025-03-17 15:23:21

领域: cs.LG

下载: http://arxiv.org/abs/2503.13271v1

Mirror Online Conformal Prediction with Intermittent Feedback

Online conformal prediction enables the runtime calibration of a pre-trained artificial intelligence model using feedback on its performance. Calibration is achieved through set predictions that are updated via online rules so as to ensure long-term coverage guarantees. While recent research has demonstrated the benefits of incorporating prior knowledge into the calibration process, this has come at the cost of replacing coverage guarantees with less tangible regret guarantees based on the quantile loss. This work introduces intermittent mirror online conformal prediction (IM-OCP), a novel runtime calibration framework that integrates prior knowledge, while maintaining long-term coverage and achieving sub-linear regret. IM-OCP features closed-form updates with minimal memory complexity, and is designed to operate under potentially intermittent feedback.

Updated: 2025-03-17 15:16:47

标题: 镜像在线间歇反馈的共形预测

摘要: 在线一致性预测通过对预训练人工智能模型的性能进行反馈，实现了运行时校准。校准是通过通过在线规则更新的集合预测来实现的，以确保长期的覆盖保证。尽管最近的研究已经证明了将先前知识纳入校准过程的好处，但这是以用基于分位数损失的不太明显的遗憾保证来取代覆盖保证为代价的。这项工作介绍了间歇镜像在线一致性预测（IM-OCP），这是一个集成先前知识的新型运行时校准框架，同时保持长期覆盖并实现次线性遗憾。IM-OCP具有闭式更新和最小内存复杂性，并且设计用于在潜在的间歇反馈下运行。

更新时间: 2025-03-17 15:16:47

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2503.10345v2

Zero-Knowledge Proof-Based Consensus for Blockchain-Secured Federated Learning

Federated learning (FL) enables multiple participants to collaboratively train machine learning models while ensuring their data remains private and secure. Blockchain technology further enhances FL by providing stronger security, a transparent audit trail, and protection against data tampering and model manipulation. Most blockchain-secured FL systems rely on conventional consensus mechanisms: Proof-of-Work (PoW) is computationally expensive, while Proof-of-Stake (PoS) improves energy efficiency but risks centralization as it inherently favors participants with larger stakes. Recently, learning-based consensus has emerged as an alternative by replacing cryptographic tasks with model training to save energy. However, this approach introduces potential privacy vulnerabilities, as the training process may inadvertently expose sensitive information through gradient sharing and model updates. To address these challenges, we propose a novel Zero-Knowledge Proof of Training (ZKPoT) consensus mechanism. This method leverages the zero-knowledge succinct non-interactive argument of knowledge proof (zk-SNARK) protocol to validate participants' contributions based on their model performance, effectively eliminating the inefficiencies of traditional consensus methods and mitigating the privacy risks posed by learning-based consensus. We analyze our system's security, demonstrating its capacity to prevent the disclosure of sensitive information about local models or training data to untrusted parties during the entire FL process. Extensive experiments demonstrate that our system is robust against privacy and Byzantine attacks while maintaining accuracy and utility without trade-offs, scalable across various blockchain settings, and efficient in both computation and communication.

Updated: 2025-03-17 15:13:10

标题: 零知识证明为基础的区块链安全联邦学习共识

摘要: 联邦学习（FL）使多个参与者能够在保护数据隐私和安全的同时共同训练机器学习模型。区块链技术进一步增强了FL的安全性，提供了更强的安全性、透明的审计追踪以及防止数据篡改和模型操纵的保护。大多数基于区块链的FL系统依赖于传统的共识机制：工作证明（PoW）计算成本高昂，而权益证明（PoS）提高了能源效率，但存在中心化的风险，因为它在本质上偏向于拥有更大权益的参与者。最近，基于学习的共识已被提出作为一种替代方案，通过用模型训练替换加密任务来节省能源。然而，这种方法引入了潜在的隐私漏洞，因为训练过程可能会通过梯度共享和模型更新无意中暴露敏感信息。为了解决这些挑战，我们提出了一种新颖的零知识训练证明（ZKPoT）共识机制。该方法利用零知识简洁非交互式知识证明（zk-SNARK）协议基于参与者的模型性能验证其贡献，有效消除传统共识方法的低效率，并减轻基于学习的共识所带来的隐私风险。我们分析了系统的安全性，证明其具有防止在整个FL过程中向不受信任的方提供有关本地模型或训练数据的敏感信息的能力。大量实验证明我们的系统能够抵御隐私和拜占庭攻击，同时保持准确性和效用，并且在各种区块链设置中具有可扩展性，并且在计算和通信方面效率高。

更新时间: 2025-03-17 15:13:10

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2503.13255v1

GraphRouter: A Graph-based Router for LLM Selections

The rapidly growing number and variety of Large Language Models (LLMs) present significant challenges in efficiently selecting the appropriate LLM for a given query, especially considering the trade-offs between performance and computational cost. Current LLM selection methods often struggle to generalize across new LLMs and different tasks because of their limited ability to leverage contextual interactions among tasks, queries, and LLMs, as well as their dependence on a transductive learning framework. To address these shortcomings, we introduce a novel inductive graph framework, named as GraphRouter, which fully utilizes the contextual information among tasks, queries, and LLMs to enhance the LLM selection process. GraphRouter constructs a heterogeneous graph comprising task, query, and LLM nodes, with interactions represented as edges, which efficiently captures the contextual information between the query's requirements and the LLM's capabilities. Through an innovative edge prediction mechanism, GraphRouter is able to predict attributes (the effect and cost of LLM response) of potential edges, allowing for optimized recommendations that adapt to both existing and newly introduced LLMs without requiring retraining. Comprehensive experiments across three distinct effect-cost weight scenarios have shown that GraphRouter substantially surpasses existing routers, delivering a minimum performance improvement of 12.3%. In addition, it achieves enhanced generalization across new LLMs settings and supports diverse tasks with at least a 9.5% boost in effect and a significant reduction in computational demands. This work endeavors to apply a graph-based approach for the contextual and adaptive selection of LLMs, offering insights for real-world applications. Our codes for GraphRouter is released at https://github.com/ulab-uiuc/GraphRouter.

Updated: 2025-03-17 15:08:47

标题: GraphRouter：一种基于图的用于LLM选择的路由器

摘要: 随着大型语言模型（LLMs）数量和种类迅速增长，有效选择适合特定查询的LLM面临着重大挑战，尤其考虑到性能和计算成本之间的权衡。当前的LLM选择方法通常难以泛化到新的LLMs和不同任务，因为它们有限的能力利用任务、查询和LLMs之间的上下文交互，以及它们对传导学习框架的依赖。为了解决这些缺点，我们引入了一种新颖的归纳图框架，命名为GraphRouter，它充分利用任务、查询和LLMs之间的上下文信息来增强LLM选择过程。GraphRouter构建了一个包含任务、查询和LLM节点的异构图，交互作用表示为边，有效地捕获了查询要求和LLM能力之间的上下文信息。通过一种创新的边预测机制，GraphRouter能够预测潜在边的属性（LLM响应的效果和成本），从而实现适应现有和新引入的LLMs的优化推荐，而无需重新训练。跨三种不同的效果-成本权重情景进行的综合实验显示，GraphRouter明显优于现有的路由器，提供了至少12.3%的最小性能改进。此外，它在新的LLMs设置下实现了增强的泛化能力，并支持各种任务，在效果上至少提高了9.5%，同时显著减少了计算需求。这项工作旨在应用基于图的方法对LLMs进行上下文和自适应选择，为实际应用提供见解。我们的GraphRouter代码已发布在https://github.com/ulab-uiuc/GraphRouter。

更新时间: 2025-03-17 15:08:47

领域: cs.AI

下载: http://arxiv.org/abs/2410.03834v2

Convolutional neural network for early detection of lameness and irregularity in horses using an IMU sensor

Lameness and gait irregularities are significant concerns in equine health management, affecting performance, welfare, and economic value. Traditional observational methods rely on subjective expert assessments, which can lead to inconsistencies in detecting subtle or early-stage lameness. While AI-based approaches have emerged, many require multiple sensors, force plates, or video systems, making them costly and impractical for field deployment. In this applied research study, we present a stride-level classification system that utilizes a single inertial measurement unit (IMU) and a one-dimensional convolutional neural network (1D CNN) to objectively differentiate between sound and lame horses, with a primary focus on the trot gait. The proposed system was tested under real-world conditions, achieving a 90% session-level accuracy with no false positives, demonstrating its robustness for practical applications. By employing a single, non-intrusive, and readily available sensor, our approach significantly reduces the complexity and cost of hardware requirements while maintaining high classification performance. These results highlight the potential of our CNN-based method as a field-tested, scalable solution for automated lameness detection. By enabling early diagnosis, this system offers a valuable tool for preventing minor gait irregularities from developing into severe conditions, ultimately contributing to improved equine welfare and performance in veterinary and equestrian practice.

Updated: 2025-03-17 15:05:01

标题: 使用IMU传感器的卷积神经网络早期检测马匹跛行和不规则行为

摘要: 跛行和步态异常是马匹健康管理中的重要问题，影响着表现、福利和经济价值。传统的观察方法依赖于主观专家评估，可能导致在检测轻微或早期跛行时出现不一致性。虽然基于人工智能的方法已经出现，但许多方法需要多个传感器、力板或视频系统，使其在现场部署方面成本高且不切实际。在这项应用研究中，我们提出了一个利用单一惯性测量单元（IMU）和一维卷积神经网络（1D CNN）的步态级分类系统，以客观区分健康和跛行的马匹，主要关注于小跑步态。所提出的系统在实际条件下进行了测试，在不出现误报的情况下实现了90%的会话级准确性，表明其在实际应用中的稳健性。通过采用单一、非侵入性和现成的传感器，我们的方法显著降低了硬件需求的复杂性和成本，同时保持了较高的分类性能。这些结果突出了我们基于CNN的方法作为经过实地测试的、可扩展的自动跛行检测解决方案的潜力。通过实现早期诊断，该系统为防止轻微步态异常发展为严重疾病提供了有价值的工具，最终有助于改善兽医和马术实践中的马匹福利和表现。

更新时间: 2025-03-17 15:05:01

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13578v1

Learning Program Behavioral Models from Synthesized Input-Output Pairs

We introduce Modelizer - a novel framework that, given a black-box program, learns a model from its input/output behavior using neural machine translation algorithms. The resulting model mocks the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also reversible - that is, the model can predict the input that would have produced a given output. Finally, the model is differentiable and can be efficiently restricted to predict only a certain aspect of the program behavior. Modelizer uses grammars to synthesize and inputs and unsupervised tokenizers to decompose the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams. Other than input grammars, Modelizer only requires the ability to execute the program. The resulting models are small, requiring fewer than 6.3 million parameters for languages such as Markdown or HTML; and they are accurate, achieving up to 95.4% accuracy and a BLEU score of 0.98 with standard error 0.04 in mocking real-world applications. As it learns from and predicts executions rather than code, Modelizer departs from the LLM-centric research trend, opening new opportunities for program-specific models that are fully tuned towards individual programs. Indeed, we foresee several applications of these models, especially as the output of the program can be any aspect of program behavior. Beyond mocking and predicting program behavior, the models can also synthesize inputs that are likely to produce a particular behavior, such as failures or coverage, thus assisting in program understanding and maintenance.

Updated: 2025-03-17 15:04:06

标题: 从合成的输入-输出对中学习程序行为模型

摘要: 我们介绍了Modelizer - 一个新颖的框架，通过神经机器翻译算法，从黑匣子程序的输入/输出行为中学习模型。所得模型模拟了原始程序：给定输入，模型预测程序会产生的输出。然而，该模型也是可逆的 - 也就是说，模型可以预测产生给定输出的输入。最后，该模型是可微分的，并且可以有效地限制为仅预测程序行为的某个方面。Modelizer使用语法合成和输入，并使用无监督的标记器来分解结果输出，从而使其能够学习标记流之间的序列-序列关联。除了输入语法，Modelizer只需要能够执行程序的能力。所得模型较小，对于Markdown或HTML等语言，只需不到630万个参数；它们精确度高，以标准误差0.04的95.4%准确率和0.98的BLEU分数在模拟现实应用中。由于它是从执行中学习和预测而不是从代码中学习，Modelizer跳出了以LLM为中心的研究趋势，为完全针对个别程序调优的程序特定模型开辟了新的机会。事实上，我们预见了这些模型的几种应用，尤其是因为程序的输出可以是程序行为的任何方面。除了模拟和预测程序行为，这些模型还可以合成可能产生特定行为的输入，例如失败或覆盖，从而有助于程序的理解和维护。

更新时间: 2025-03-17 15:04:06

领域: cs.SE,cs.LG,68T07 (Primary), 68N30 (Secondary), 68Q42,D.2.5; D.2.7; I.2.6; F.1.1; F.4.3

下载: http://arxiv.org/abs/2407.08597v2

AI-Driven Rapid Identification of Bacterial and Fungal Pathogens in Blood Smears of Septic Patients

Sepsis is a life-threatening condition which requires rapid diagnosis and treatment. Traditional microbiological methods are time-consuming and expensive. In response to these challenges, deep learning algorithms were developed to identify 14 bacteria species and 3 yeast-like fungi from microscopic images of Gram-stained smears of positive blood samples from sepsis patients. A total of 16,637 Gram-stained microscopic images were used in the study. The analysis used the Cellpose 3 model for segmentation and Attention-based Deep Multiple Instance Learning for classification. Our model achieved an accuracy of 77.15% for bacteria and 71.39% for fungi, with ROC AUC of 0.97 and 0.88, respectively. The highest values, reaching up to 96.2%, were obtained for Cutibacterium acnes, Enterococcus faecium, Stenotrophomonas maltophilia and Nakaseomyces glabratus. Classification difficulties were observed in closely related species, such as Staphylococcus hominis and Staphylococcus haemolyticus, due to morphological similarity, and within Candida albicans due to high morphotic diversity. The study confirms the potential of our model for microbial classification, but it also indicates the need for further optimisation and expansion of the training data set. In the future, this technology could support microbial diagnosis, reducing diagnostic time and improving the effectiveness of sepsis treatment due to its simplicity and accessibility. Part of the results presented in this publication was covered by a patent application at the European Patent Office EP24461637.1 "A computer implemented method for identifying a microorganism in a blood and a data processing system therefor".

Updated: 2025-03-17 15:02:49

标题: 人工智能驱动的快速识别脓毒症患者血涂片中的细菌和真菌病原体

摘要: 败血症是一种危及生命的疾病，需要快速诊断和治疗。传统的微生物学方法耗时且昂贵。为了解决这些挑战，开发了深度学习算法，用于从败血症患者阳性血样的革兰染色涂片的显微图像中识别14种细菌和3种类似酵母真菌。研究中使用了总共16,637张革兰染色显微图像。分析采用了Cellpose 3模型进行分割和基于注意力的深度多实例学习进行分类。我们的模型对细菌的准确率为77.15%，对真菌为71.39%，ROC AUC分别为0.97和0.88。最高值达到96.2%的是葡萄球菌、粪肠球菌、马尔斯特罗菲莫纳斯菌和格拉布拉杆菌。由于形态相似性，对于近缘物种如人类葡萄球菌和溶血性葡萄球菌，以及由于高形态多样性而在白色念珠菌内观察到分类困难。研究证实了我们模型在微生物分类方面的潜力，但也指出了需要进一步优化和扩展训练数据集的必要性。未来，这项技术可能支持微生物诊断，减少诊断时间并提高败血症治疗的效果，由于其简单性和易获得性。本出版物中呈现的部分结果已通过欧洲专利局EP24461637.1的专利申请进行了覆盖，“一种用于在血液中识别微生物的计算机实施方法及其数据处理系统”。

更新时间: 2025-03-17 15:02:49

领域: eess.IV,cs.AI,cs.CE,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.14542v1

Neural network-based Godunov corrections for approximate Riemann solvers using bi-fidelity learning

The Riemann problem is fundamental in the computational modeling of hyperbolic partial differential equations, enabling the development of stable and accurate upwind schemes. While exact solvers provide robust upwinding fluxes, their high computational cost necessitates approximate solvers. Although approximate solvers achieve accuracy in many scenarios, they produce inaccurate solutions in certain cases. To overcome this limitation, we propose constructing neural network-based surrogate models, trained using supervised learning, designed to map interior and exterior conservative state variables to the corresponding exact flux. Specifically, we propose two distinct approaches: one utilizing a vanilla neural network and the other employing a bi-fidelity neural network. The performance of the proposed approaches is demonstrated through applications to one-dimensional and two-dimensional partial differential equations, showcasing their robustness and accuracy.

Updated: 2025-03-17 15:01:26

标题: 使用双保真度学习的基于神经网络的Godunov修正的近似Riemann求解器

摘要: 黎曼问题在计算建模中是基础性的，尤其在双曲型偏微分方程的计算建模中，可以实现稳定和精确的上风格式。虽然精确求解器可以提供稳健的上风通量，但它们的高计算成本需要近似求解器。虽然近似求解器在许多情况下可以达到精度，但在某些情况下会产生不准确的解决方案。为了克服这一限制，我们提出构建基于神经网络的代理模型，通过监督学习进行训练，旨在将内部和外部保守状态变量映射到相应的精确通量。具体而言，我们提出了两种不同的方法：一种利用基本神经网络，另一种采用双保真神经网络。通过应用于一维和二维偏微分方程的案例，展示了所提出方法的性能，展示了它们的稳健性和精度。

更新时间: 2025-03-17 15:01:26

领域: math.NA,cs.LG,cs.NA,physics.flu-dyn

下载: http://arxiv.org/abs/2503.13248v1

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective

As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how practitioners resolve these disagreements. We first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and six different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that (1) state-of-the-art explanation methods often disagree in terms of the explanations they output, and (2) machine learning practitioners often employ ad hoc heuristics when resolving such disagreements. These findings suggest that practitioners may be relying on misleading explanations when making consequential decisions. They also underscore the importance of developing principled frameworks for effectively evaluating and comparing explanations output by various explanation techniques.

Updated: 2025-03-17 15:00:52

标题: 可解释机器学习中的分歧问题：从实践者的角度看

摘要: 随着各种事后解释方法在高风险环境中越来越多地被利用来解释复杂模型，发展对于这些方法输出的解释何时和为何彼此不一致以及如何在实践中解决这些不一致的更深入理解变得至关重要。然而，目前几乎没有研究能够提供对这些关键问题的答案。在这项研究中，我们介绍并研究了可解释机器学习中的不一致问题。更具体地，我们形式化了不同解释之间的不一致概念，分析了在实践中这种不一致发生的频率以及从业者如何解决这些不一致。我们首先进行了与数据科学家的访谈，以了解由不同方法生成的相同模型预测的解释之间的不一致构成什么，并引入了一个新颖的定量框架来形式化这一理解。然后，我们利用这一框架对四个真实世界数据集、六种最先进的事后解释方法和六种不同的预测模型进行了严格的经验分析，以衡量由各种流行解释方法生成的解释之间的不一致程度。此外，我们还进行了一项在线用户研究，与数据科学家一起了解他们如何解决上述不一致。我们的结果表明，（1）最先进的解释方法在输出的解释方面经常存在分歧，（2）机器学习从业者在解决这种分歧时经常采用临时启发式方法。这些发现表明，当做出重大决策时，从业者可能依赖于误导性的解释。这也强调了开发有原则的框架来有效评估和比较各种解释方法输出的重要性。

更新时间: 2025-03-17 15:00:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2202.01602v5

Highly Efficient Direct Analytics on Semantic-aware Time Series Data Compression

Semantic communication has emerged as a promising paradigm to tackle the challenges of massive growing data traffic and sustainable data communication. It shifts the focus from data fidelity to goal-oriented or task-oriented semantic transmission. While deep learning-based methods are commonly used for semantic encoding and decoding, they struggle with the sequential nature of time series data and high computation cost, particularly in resource-constrained IoT environments. Data compression plays a crucial role in reducing transmission and storage costs, yet traditional data compression methods fall short of the demands of goal-oriented communication systems. In this paper, we propose a novel method for direct analytics on time series data compressed by the SHRINK compression algorithm. Through experimentation using outlier detection as a case study, we show that our method outperforms baselines running on uncompressed data in multiple cases, with merely 1% difference in the worst case. Additionally, it achieves four times lower runtime on average and accesses approximately 10% of the data volume, which enables edge analytics with limited storage and computation power. These results demonstrate that our approach offers reliable, high-speed outlier detection analytics for diverse IoT applications while extracting semantics from time-series data, achieving high compression, and reducing data transmission.

Updated: 2025-03-17 14:58:22

标题: 高效直接分析语义感知时间序列数据压缩

摘要: 语义通信已经成为解决日益增长的数据流量和可持续数据通信挑战的一种有前途的范式。它将焦点从数据的保真度转移到目标导向或任务导向的语义传输。虽然基于深度学习的方法通常用于语义编码和解码，但它们在处理时间序列数据的顺序性和高计算成本方面遇到困难，特别是在资源受限的物联网环境中。数据压缩在降低传输和存储成本方面起着至关重要的作用，然而传统的数据压缩方法无法满足目标导向通信系统的需求。在本文中，我们提出了一种通过SHRINK压缩算法对时间序列数据进行直接分析的新方法。通过以异常检测为案例研究的实验，我们展示了我们的方法在多种情况下优于在未压缩数据上运行的基线，最差情况下仅有1%的差异。此外，它平均运行时间较低四倍，并且访问的数据量仅为原始数据的约10%，这使得在存储和计算能力受限的边缘上进行分析成为可能。这些结果表明，我们的方法为各种物联网应用提供可靠、高速的异常检测分析，同时从时间序列数据中提取语义，实现高压缩并减少数据传输。

更新时间: 2025-03-17 14:58:22

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2503.13246v1

Hypergraph $p$-Laplacian regularization on point clouds for data interpolation

As a generalization of graphs, hypergraphs are widely used to model higher-order relations in data. This paper explores the benefit of the hypergraph structure for the interpolation of point cloud data that contain no explicit structural information. We define the $\varepsilon_n$-ball hypergraph and the $k_n$-nearest neighbor hypergraph on a point cloud and study the $p$-Laplacian regularization on the hypergraphs. We prove the variational consistency between the hypergraph $p$-Laplacian regularization and the continuum $p$-Laplacian regularization in a semisupervised setting when the number of points $n$ goes to infinity while the number of labeled points remains fixed. A key improvement compared to the graph case is that the results rely on weaker assumptions on the upper bound of $\varepsilon_n$ and $k_n$. To solve the convex but non-differentiable large-scale optimization problem, we utilize the stochastic primal-dual hybrid gradient algorithm. Numerical experiments on data interpolation verify that the hypergraph $p$-Laplacian regularization outperforms the graph $p$-Laplacian regularization in preventing the development of spikes at the labeled points.

Updated: 2025-03-17 14:57:22

标题: 超图$p$-Laplacian正则化在点云数据插值中的应用

摘要: 作为图的一种泛化，超图被广泛用于建模数据中的高阶关系。本文探讨了超图结构对不包含显式结构信息的点云数据插值的好处。我们在点云上定义了$\varepsilon_n$-球超图和$k_n$-最近邻超图，并研究了超图上的$p$-拉普拉斯正则化。我们证明了在半监督设置下，当点的数量$n$趋向无穷时，标记点的数量保持不变时，超图$p$-拉普拉斯正则化与连续$p$-拉普拉斯正则化之间的变分一致性。与图的情况相比的一个关键改进是结果依赖于对$\varepsilon_n$和$k_n$上界的较弱假设。为了解决凸但非可微的大规模优化问题，我们利用了随机原始-对偶混合梯度算法。数据插值的数值实验验证了超图$p$-拉普拉斯正则化在防止在标记点处出现尖峰方面优于图$p$-拉普拉斯正则化。

更新时间: 2025-03-17 14:57:22

领域: math.NA,cs.LG,cs.NA,math.AP,49J55, 35J20, 65N12

下载: http://arxiv.org/abs/2405.01109v2

Training Neural Networks as Recognizers of Formal Languages

Characterizing the computational power of neural network architectures in terms of formal language theory remains a crucial line of research, as it describes lower and upper bounds on the reasoning capabilities of modern AI. However, when empirically testing these bounds, existing work often leaves a discrepancy between experiments and the formal claims they are meant to support. The problem is that formal language theory pertains specifically to recognizers: machines that receive a string as input and classify whether it belongs to a language. On the other hand, it is common instead to evaluate language models on proxy tasks, e.g., language modeling or sequence-to-sequence transduction, that are similar in only an informal sense to the underlying theory. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings, using a general method that can be applied to a wide variety of languages. As part of this, we extend an algorithm recently proposed by Sn{\ae}bjarnarson et al. (2025) for efficient length-controlled sampling of strings from regular languages. We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures: a simple RNN, an LSTM, and a causally-masked transformer. We find that the RNN and LSTM often outperform the transformer, and that auxiliary training objectives such as language modeling can help, although no single objective uniformly improves performance across languages and architectures. Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work. We have released our datasets as a benchmark called FLaRe (Formal Language Recognition), along with our code.

Updated: 2025-03-17 14:51:27

标题: 将神经网络训练为形式语言识别器

摘要: 将神经网络架构的计算能力描述为形式语言理论的一个重要研究方向，因为这描述了现代人工智能的推理能力的下限和上限。然而，在经验性地测试这些界限时，现有的研究往往存在实验和它们所支持的形式主张之间的差异。问题在于形式语言理论特指识别器：接收字符串输入并分类它是否属于某种语言的机器。另一方面，通常会评估语言模型在代理任务上，例如语言建模或序列到序列转换，这些任务与基础理论仅在非正式意义上相似。我们通过训练和评估神经网络直接作为字符串的二元分类器来纠正这种不匹配，使用一个通用方法，可以应用于各种语言。作为此过程的一部分，我们扩展了Sn{\ae}bjarnarson等人（2025年）最近提出的一种用于从正则语言中高效抽样字符串的算法。我们对Chomsky层次结构中的各种语言提供了三种神经架构的结果：简单的RNN、LSTM和因果屏蔽变压器。我们发现RNN和LSTM通常优于变压器，并且辅助训练目标，如语言建模，可以帮助，尽管没有单一目标能够统一提高所有语言和架构的性能。我们的贡献将有助于未来工作中对语言识别主张进行理论上合理的经验测试。我们已发布我们的数据集作为一个名为FLaRe（Formal Language Recognition）的基准，并附带我们的代码。

更新时间: 2025-03-17 14:51:27

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.07107v2

Gradient Extrapolation for Debiased Representation Learning

Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch's loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.

Updated: 2025-03-17 14:48:57

标题: 梯度外推用于无偏表示学习

摘要: 使用经验风险最小化（ERM）训练的机器学习分类模型往往会无意中依赖于虚假的相关性。当测试数据中不存在这些非目标属性和目标标签之间的意外关联时，会导致泛化能力较差。本文从模型优化的角度解决了这个问题，并提出了一种新方法，即梯度外推用于去偏表示学习（GERNE），旨在在已知和未知属性训练案例中学习去偏表示。GERNE使用具有不同数量虚假相关性的两个不同批次来定义目标梯度，作为从每个批次的损失计算出的两个梯度的线性外推。实验表明，如果外推梯度指向具有较少虚假相关性批次的梯度，可以引导训练过程朝向学习去偏模型。GERNE可以作为去偏的通用框架，方法包括ERM、重新加权和重抽样等都被证明是特殊情况。为了确保收敛，导出了外推因子的理论上限和下限。通过调整这个因子，GERNE可以被调整以最大化群体平衡准确率（GBA）或最差组准确率。所提出的方法在五个视觉和一个NLP基准上进行了验证，与最先进的基线方法相比，表现出竞争力和通常优越的性能。

更新时间: 2025-03-17 14:48:57

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.13236v1

Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch

Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://github.com/Jay-Codeman/SAGE

Updated: 2025-03-17 14:41:51

标题: 注意差距：信心差异可以引导基于伪不匹配的联合半监督学习

摘要: Federated Semi-Supervised Learning (FSSL)旨在利用跨客户的未标记数据来训练具有强大泛化能力的全局模型，这些客户拥有有限的标记数据。大多数FSSL方法依赖于使用伪标签的一致性正则化，将来自本地或全局模型的预测转换为硬伪标签作为监督信号。然而，我们发现伪标签的质量在很大程度上受到数据异质性的影响，这是联邦学习的固有特征。在本文中，我们深入研究了FSSL的问题，并展示了(1)异质性加剧了伪标签的不匹配，进一步降低了模型的性能和收敛性，以及(2)随着异质性的增加，本地和全局模型的预测倾向发散。受到这些发现的启发，我们提出了一种简单而有效的方法，称为基于置信度差异校正伪标签的半监督聚合全局增强集成(SAGE)。这种策略有效地缓解了由不正确的伪标签导致的性能下降，并增强了本地和全局模型之间的共识。实验结果表明，SAGE在性能和收敛性方面优于现有的FSSL方法。我们的代码可在https://github.com/Jay-Codeman/SAGE找到。

更新时间: 2025-03-17 14:41:51

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.13227v1

ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction

Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. **ProDiF** quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that **ProDiF** reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65\%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security.

Updated: 2025-03-17 14:37:42

标题: ProDiF：保护领域不变特征以确保预训练模型免受提取攻击

摘要: 预训练模型是有价值的知识产权，捕捉了其权重空间中的领域特定和领域不变特征。然而，模型提取攻击威胁这些资产，通过启用未经授权的源领域推断和利用领域不变特征促进跨领域转移。在这项工作中，我们介绍了**ProDiF**，这是一个新颖的框架，利用有针对性的权重空间操作来保护预训练模型免受提取攻击。**ProDiF**量化了过滤器的可转移性，并扰乱了不安全内存中关键过滤器的权重，同时在受信任的执行环境（TEE）中保留实际的关键权重给授权用户。双层优化进一步确保了对自适应微调攻击的弹性。实验结果显示，**ProDiF**将源领域准确度降低到接近随机水平，并将跨领域可转移性降低了74.65％，为预训练模型提供了强大的保护。这项工作为预训练DNN模型提供了全面的保护，并突出了权重空间操作作为一种新颖的模型安全方法的潜力。

更新时间: 2025-03-17 14:37:42

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2503.13224v1

Robust Decision-Making Via Free Energy Minimization

Despite their groundbreaking performance, state-of-the-art autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training/environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge when deploying agents in the real world. Here, departing from mainstream views seeking robustness through training, we introduce DR-FREE, a free energy model that installs this core property by design. It directly wires robustness into the agent decision-making mechanisms via free energy minimization. By combining a robust extension of the free energy principle with a novel resolution engine, DR-FREE returns a policy that is optimal-yet-robust against ambiguity. Moreover, for the first time, it reveals the mechanistic role of ambiguity on optimal decisions and requisite Bayesian belief updating. We evaluate DR-FREE on an experimental testbed involving real rovers navigating an ambiguous environment filled with obstacles. Across all the experiments, DR-FREE enables robots to successfully navigate towards their goal even when, in contrast, standard free energy minimizing agents that do not use DR-FREE fail. In short, DR-FREE can tackle scenarios that elude previous methods: this milestone may inspire both deployment in multi-agent settings and, at a perhaps deeper level, the quest for a biologically plausible explanation of how natural agents - with little or no training - survive in capricious environments.

Updated: 2025-03-17 14:36:08

标题: 通过自由能量最小化实现健壮决策-making

摘要: 尽管最先进的自主代理在性能方面具有突破性，但当训练和环境条件变得不一致时，它们可能会出现不当行为，甚至导致灾难性失败。对于智能代理来说，对这些训练/环境模糊性的鲁棒性是一个核心要求，而在实际部署代理时实现这一要求一直是一个长期挑战。在这里，我们摒弃了通过训练来寻求鲁棒性的主流观点，引入了DR-FREE，一个通过设计安装这一核心属性的自由能模型。它通过自由能最小化直接将鲁棒性融入代理的决策机制中。通过结合自由能原则的鲁棒扩展和一种新颖的解析引擎，DR-FREE返回一个对于模糊性而言是最优的同时又具有鲁棒性的策略。此外，首次揭示了模糊性对最优决策和必要的贝叶斯信念更新的机械角色。我们在一个涉及真实漫游器在充满障碍物的模糊环境中导航的实验测试平台上评估了DR-FREE。在所有实验中，DR-FREE使机器人成功地朝着目标导航，即使相反，不使用DR-FREE的标准自由能最小化代理失败。总之，DR-FREE可以处理以往方法无法解决的场景：这一里程碑可能激励在多代理设置中的部署，并且在更深层次上，可能会启发对自然代理如何在变幻莫测的环境中生存的生物学合理解释的探索。

更新时间: 2025-03-17 14:36:08

领域: cs.AI,cs.SY,eess.SY,math.OC

下载: http://arxiv.org/abs/2503.13223v1

Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs

In knowledge-intensive tasks, especially in high-stakes domains like medicine and law, it is critical not only to retrieve relevant information but also to provide causal reasoning and explainability. Large language models (LLMs) have achieved remarkable performance in natural language understanding and generation tasks. However, they often suffer from limitations such as difficulty in incorporating new knowledge, generating hallucinations, and explaining their reasoning process. To address these challenges, integrating knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has emerged as an effective solution. Traditional Graph RAG methods often rely on simple graph traversal or semantic similarity, which do not capture causal relationships or align well with the model's internal reasoning steps. This paper proposes a novel pipeline that filters large knowledge graphs to emphasize cause-effect edges, aligns the retrieval process with the model's chain-of-thought (CoT), and enhances reasoning through multi-stage path improvements. Experiments on medical question-answering tasks show consistent gains, with up to a 10\% absolute improvement across multiple large language models (LLMs). This approach demonstrates the value of combining causal reasoning with stepwise retrieval, leading to more interpretable and logically grounded solutions for complex queries.

Updated: 2025-03-17 14:32:08

标题: 因果图遇见思维：增强图增强型LLMs中的复杂推理

摘要: 在知识密集型任务中，尤其是在医学和法律等高风险领域，不仅检索相关信息至关重要，还要提供因果推理和可解释性。大型语言模型（LLMs）在自然语言理解和生成任务中取得了显著的性能。然而，它们经常受到诸如难以整合新知识、生成幻觉和解释推理过程等限制。为了解决这些挑战，将知识图与图检索增强生成（Graph RAG）相结合已经成为一种有效的解决方案。传统的图检索增强生成方法通常依赖于简单的图遍历或语义相似性，这并不能捕捉因果关系或与模型内部推理步骤很好地对齐。本文提出了一种新颖的流程，对大型知识图进行过滤以强调因果边缘，将检索过程与模型的思维链（CoT）对齐，并通过多阶段路径改进增强推理。在医学问答任务的实验中显示出一致的增益，多个大型语言模型（LLMs）的绝对改进率高达10％。这种方法展示了将因果推理与分步检索相结合的价值，为复杂查询提供更可解释和逻辑基础的解决方案。

更新时间: 2025-03-17 14:32:08

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2501.14892v2

Can Language Models Follow Multiple Turns of Entangled Instructions?

Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.

Updated: 2025-03-17 14:31:37

标题: 语言模型能否遵循多个交织指令的转折？

摘要: 尽管在提高大型语言模型（LLMs）的指令跟随能力方面取得了显著成就，但处理多个可能纠缠或冲突的指令的能力仍然是一个相当大的挑战。现实世界中的场景通常需要随时间保持一致，如保密性、个人偏好和优先级，这需要复杂的能力来整合多个回合，并在指令相交或冲突时仔细平衡竞争目标。本研究系统地调查了LLMs在处理多轮指令时的能力，涵盖了三个难度级别：（1）从指令中检索信息，（2）跟踪和推理跨回合，以及（3）解决指令之间的冲突。我们通过人机协作方法构建了大约1.1K个高质量的多轮对话MultiTurnInstruct，并得出了九个能力类别，包括静态和动态、推理和多任务处理。我们的研究揭示了不同能力之间的有趣的权衡。虽然GPT模型展示了优越的记忆能力，但在需要选择性信息隐瞒的隐私保护任务中表现出降低的效率。更大的模型展现了更强的推理能力，但在解决冲突指令方面仍然存在困难。重要的是，这些表现差距不能仅归因于信息丢失，因为模型在记忆任务上表现出色BLEU分数，但它们的注意机制未能有效整合多个相关指令。这些发现突显了在涉及多轮指令的复杂现实世界任务中需要改进的关键领域。

更新时间: 2025-03-17 14:31:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13222v1

Dense Policy: Bidirectional Autoregressive Learning of Actions

Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication. Project page: https: //selen-suyue.github.io/DspNet/.

Updated: 2025-03-17 14:28:08

标题: 密集政策：双向自回归学习行动

摘要: 主流的视觉运动策略主要依赖于生成模型进行整体动作预测，而当前的自回归策略，预测下一个标记或块，结果显示不佳。这促使人们寻找更有效的学习方法，释放自回归策略在机器人操作中的潜力。本文介绍了一种名为密集策略的双向扩展学习方法，以建立行动预测中自回归策略的新范式。它采用轻量级的仅编码器架构，以粗粒到细粒的方式迭代展开动作序列，从一个初始单帧到目标序列，具有对数时间推断。大量实验验证了我们的密集策略具有优越的自回归学习能力，并可以超越现有的整体生成策略。我们的策略、示例数据和训练代码将在出版后公开发布。项目页面：https://selen-suyue.github.io/DspNet/。

更新时间: 2025-03-17 14:28:08

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.13217v1

Standardizing Structural Causal Models

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$-sortability and $\operatorname{R^2}$-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$-sortable. We also find empirical evidence that they are mostly not $\operatorname{R^2}$-sortable for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here. Our code is publicly available at: https://github.com/werkaaa/iscm.

Updated: 2025-03-17 14:26:33

标题: 标准化结构因果模型

摘要: 由结构因果模型（SCMs）生成的合成数据集通常用于基准测试因果结构学习算法。然而，在SCM数据中，方差和成对相关性往往沿因果顺序增加。一些流行的算法利用这些人为因素，可能导致结论不能推广到现实世界的情境。现有的度量标准如Var-sortability和R^2-sortability量化这些模式，但它们并没有提供解决这些问题的工具。为了解决这个问题，我们提出了内部标准化的结构因果模型（iSCMs），这是SCMs的一种修改，在生成过程中引入了每个变量的标准化操作。通过构造，iSCMs不是Var-sortable。我们还发现实证证据表明，对于常用的图形族，它们大多数不是R^2-sortable。此外，与标准SCMs生成的数据的事后标准化相反，我们证明线性iSCMs在权重的先验知识上较难识别，并且不会在大系统中崩溃成确定性关系，这可能使iSCMs成为超越本文研究的基准测试问题的因果推断中的有用模型。我们的代码可在以下网址公开获取：https://github.com/werkaaa/iscm。

更新时间: 2025-03-17 14:26:33

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11601v3

When Should We Orchestrate Multiple Agents?

Strategies for orchestrating the interactions between multiple agents, both human and artificial, can wildly overestimate performance and underestimate the cost of orchestration. We design a framework to orchestrate agents under realistic conditions, such as inference costs or availability constraints. We show theoretically that orchestration is only effective if there are performance or cost differentials between agents. We then empirically demonstrate how orchestration between multiple agents can be helpful for selecting agents in a simulated environment, picking a learning strategy in the infamous Rogers' Paradox from social science, and outsourcing tasks to other agents during a question-answer task in a user study.

Updated: 2025-03-17 14:26:07

标题: 我们何时应该协调多个代理？

摘要: 我们设计了一个框架来在现实条件下编排代理人，比如推理成本或可用性约束。我们理论上表明，只有在代理人之间存在性能或成本差异时，编排才是有效的。然后，我们通过实证方法展示了在模拟环境中编排多个代理人如何有助于选择代理人，在社会科学中臭名昭著的罗杰斯悖论中选择学习策略，以及在用户研究中在问答任务中将任务外包给其他代理人。

更新时间: 2025-03-17 14:26:07

领域: cs.MA,cs.CY,cs.LG

下载: http://arxiv.org/abs/2503.13577v1

Proportional Aggregation of Preferences for Sequential Decision Making

We study the problem of fair sequential decision making given voter preferences. In each round, a decision rule must choose a decision from a set of alternatives where each voter reports which of these alternatives they approve. Instead of going with the most popular choice in each round, we aim for proportional representation across rounds, using axioms inspired by the multi-winner voting literature. The axioms require that every group of $\alpha\%$ of the voters that agrees in every round (i.e., approves a common alternative), must approve at least $\alpha\%$ of the decisions. A stronger version of the axioms requires that every group of $\alpha\%$ of the voters that agrees in a $\beta$ fraction of rounds must approve $\beta\cdot\alpha\%$ of the decisions. We show that three attractive voting rules satisfy axioms of this style. One of them (Sequential Phragm\'en) makes its decisions online, and the other two satisfy strengthened versions of the axioms but make decisions semi-online (Method of Equal Shares) or fully offline (Proportional Approval Voting). We present empirical results for these rules based on synthetic data and U.S. political elections. We also run experiments using the moral machine dataset about ethical dilemmas: We train preference models on user responses from different countries and let the models cast votes. We find that aggregating these votes using our rules leads to a more equal utility distribution across demographics than making decisions using a single global preference model.

Updated: 2025-03-17 14:25:48

标题: 顺序决策制定中的偏好比例聚合

摘要: 我们研究了在给定选民偏好的情况下进行公平的顺序决策的问题。在每一轮中，决策规则必须从一组备选方案中选择一个决策，而每个选民都会报告他们赞成哪些备选方案。我们不是在每一轮中选择最受欢迎的选择，而是通过受到多赢家投票文献启发的公平代表性，实现跨多轮的比例代表。这些公理要求每个在每一轮中达成一致（即赞成一个共同备选方案）的$\alpha\%$选民组至少要赞成$\alpha\%$的决策。公理的更强版本要求每个在$\beta$比例的轮次中达成一致的$\alpha\%$选民组必须赞成$\beta\cdot\alpha\%$的决策。我们展示了三种吸引人的投票规则满足这种风格的公理。其中一种（顺序Phragm\'en）在线做出决策，另外两种满足加强版本的公理，但是半在线（等份方法）或完全离线（比例赞成投票）做出决策。我们基于合成数据和美国政治选举的实证结果展示了这些规则的结果。我们还在伦理机器数据集上进行实验：我们在不同国家的用户响应上训练偏好模型，并让模型进行投票。我们发现，使用我们的规则聚合这些投票会比使用单一全局偏好模型做出决策导致更均等的效用分布跨越人口统计学。

更新时间: 2025-03-17 14:25:48

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2306.14858v2

MAME: Multidimensional Adaptive Metamer Exploration with Human Perceptual Feedback

Alignment between human brain networks and artificial models is actively studied in machine learning and neuroscience. A widely adopted approach to explore their functional alignment is to identify metamers for both humans and models. Metamers refer to input stimuli that are physically different but equivalent within a given system. If a model's metameric space completely matched the human metameric space, the model would achieve functional alignment with humans. However, conventional methods lack direct ways to search for human metamers. Instead, researchers first develop biologically inspired models and then infer about human metamers indirectly by testing whether model metamers also appear as metamers to humans. Here, we propose the Multidimensional Adaptive Metamer Exploration (MAME) framework, enabling direct high-dimensional exploration of human metameric space. MAME leverages online image generation guided by human perceptual feedback. Specifically, it modulates reference images across multiple dimensions by leveraging hierarchical responses from convolutional neural networks (CNNs). Generated images are presented to participants whose perceptual discriminability is assessed in a behavioral task. Based on participants' responses, subsequent image generation parameters are adaptively updated online. Using our MAME framework, we successfully measured a human metameric space of over fifty dimensions within a single experiment. Experimental results showed that human discrimination sensitivity was lower for metameric images based on low-level features compared to high-level features, which image contrast metrics could not explain. The finding suggests that the model computes low-level information not essential for human perception. Our framework has the potential to contribute to developing interpretable AI and understanding of brain function in neuroscience.

Updated: 2025-03-17 14:23:04

标题: MAME：具有人类感知反馈的多维自适应元合成探索

摘要: 人类大脑网络与人工模型之间的对齐在机器学习和神经科学中得到了积极研究。探索它们功能对齐的一种广泛采用的方法是为人类和模型都识别出变形体。变形体指的是在给定系统内物理上不同但等效的输入刺激。如果模型的变形空间完全匹配人类的变形空间，那么模型将与人类实现功能对齐。然而，传统方法缺乏直接寻找人类变形体的途径。相反，研究人员首先开发受生物启发的模型，然后间接推断人类变形体，方法是测试模型变形体是否也出现为人类的变形体。在这里，我们提出了多维自适应变形体探索（MAME）框架，实现了对人类变形空间的直接高维探索。MAME利用受人类感知反馈指导的在线图像生成。具体地，它通过利用卷积神经网络（CNNs）的分层响应，在多个维度上调节参考图像。生成的图像呈现给参与者，其感知区分能力在行为任务中进行评估。根据参与者的反应，随后的图像生成参数在线自适应更新。使用我们的MAME框架，我们成功地在单个实验中测量了超过五十个维度的人类变形空间。实验结果显示，与基于高级特征的变形图像相比，基于低级特征的变形图像的人类辨别敏感性较低，而图像对比度度量无法解释这一现象。这一发现表明，该模型计算出对人类感知并不重要的低级信息。我们的框架有潜力为发展可解释的人工智能和理解神经科学中的大脑功能做出贡献。

更新时间: 2025-03-17 14:23:04

领域: cs.LG

下载: http://arxiv.org/abs/2503.13212v1

MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

Advancements in AI for medical imaging offer significant potential. However, their applications are constrained by the limited availability of data and the reluctance of medical centers to share it due to patient privacy concerns. Generative models present a promising solution by creating synthetic data as a substitute for real patient data. However, medical images are typically high-dimensional, and current state-of-the-art methods are often impractical for computational resource-constrained healthcare environments. These models rely on data sub-sampling, raising doubts about their feasibility and real-world applicability. Furthermore, many of these models are evaluated on quantitative metrics that alone can be misleading in assessing the image quality and clinical meaningfulness of the generated images. To address this, we introduce MedLoRD, a generative diffusion model designed for computational resource-constrained environments. MedLoRD is capable of generating high-dimensional medical volumes with resolutions up to 512$\times$512$\times$256, utilizing GPUs with only 24GB VRAM, which are commonly found in standard desktop workstations. MedLoRD is evaluated across multiple modalities, including Coronary Computed Tomography Angiography and Lung Computed Tomography datasets. Extensive evaluations through radiological evaluation, relative regional volume analysis, adherence to conditional masks, and downstream tasks show that MedLoRD generates high-fidelity images closely adhering to segmentation mask conditions, surpassing the capabilities of current state-of-the-art generative models for medical image synthesis in computational resource-constrained environments.

Updated: 2025-03-17 14:22:49

标题: MedLoRD: 一种用于高分辨率3D CT图像合成的医学低资源扩散模型

摘要: 医学影像人工智能的进展具有重要潜力。然而，由于数据的有限可用性和医疗中心出于患者隐私考虑而不愿分享数据，它们的应用受到限制。生成模型提供了一个有希望的解决方案，通过创建合成数据作为真实患者数据的替代品。然而，医学图像通常是高维的，当前的先进方法在计算资源受限的医疗环境中往往不切实际。这些模型依赖于数据子采样，对其可行性和实际应用性产生了疑问。此外，许多这些模型仅根据定量指标进行评估，在评估生成图像的质量和临床意义时可能会产生误导。为了解决这个问题，我们引入了MedLoRD，一个专为计算资源受限环境设计的生成扩散模型。MedLoRD能够生成分辨率高达512×512×256的高维医学体积，仅利用24GB VRAM的GPU，这在标准台式工作站中很常见。MedLoRD在多个模态下进行评估，包括冠状动脉计算机断层扫描和肺部计算机断层扫描数据集。通过放射学评估、相对区域体积分析、遵循条件掩码以及下游任务的广泛评估显示，MedLoRD生成的高保真度图像与分割掩码条件紧密符合，超过了当前计算资源受限环境中医学图像合成的先进生成模型的能力。

更新时间: 2025-03-17 14:22:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13211v1

Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach

Prompt-tuning (PT) for large language models (LLMs) can facilitate the performance on various conventional NLP tasks with significantly fewer trainable parameters. However, our investigation reveals that PT provides limited improvement and may even degrade the primitive performance of LLMs on complex reasoning tasks. Such a phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the later phases of reasoning. To address these challenges, We first identify an information accumulation within the soft prompts. Through detailed analysis, we demonstrate that this phenomenon is often accompanied by erroneous information flow patterns in the deeper layers of the model, which ultimately lead to incorrect reasoning outcomes. we propose a novel method called \textbf{D}ynamic \textbf{P}rompt \textbf{C}orruption (DPC) to take better advantage of soft prompts in complex reasoning tasks, which dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, DPC consists of two stages: Dynamic Trigger and Dynamic Corruption. First, Dynamic Trigger measures the impact of soft prompts, identifying whether beneficial or detrimental. Then, Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate the proposed approach through extensive experiments on various LLMs and reasoning tasks, including GSM8K, MATH, and AQuA. Experimental results demonstrate that DPC can consistently enhance the performance of PT, achieving 4\%-8\% accuracy gains compared to vanilla prompt tuning, highlighting the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.

Updated: 2025-03-17 14:20:48

标题: 用动态提示破坏提升复杂推理能力：一种软提示优化方法

摘要: 快速调整（Prompt-tuning，PT）大型语言模型（LLMs）可以在各种传统自然语言处理任务中提高性能，同时使用较少的可训练参数。然而，我们的调查发现，PT提供的改进有限，甚至可能降低LLMs在复杂推理任务上的原始性能。这种现象表明，软提示可以在某些情况下产生积极影响，同时在其他情况下产生消极影响，尤其是在推理的后阶段。为了解决这些挑战，我们首先确定了软提示中的信息累积现象。通过详细分析，我们证明这种现象通常伴随着模型深层中错误的信息流模式，最终导致错误的推理结果。我们提出了一种新方法，称为\textbf{D}ynamic \textbf{P}rompt \textbf{C}orruption（DPC），以更好地利用软提示在复杂推理任务中，动态调整软提示对推理过程的影响。具体而言，DPC包括两个阶段：动态触发和动态破坏。首先，动态触发测量软提示的影响，确定是有益还是有害。然后，动态破坏通过有选择地屏蔽干扰推理过程的关键标记，减轻软提示的负面影响。我们通过对各种LLMs和推理任务（包括GSM8K、MATH和AQuA）进行大量实验证明了所提出的方法的有效性。实验结果表明，DPC可以始终提高PT的性能，与基本的提示调整相比，准确度提升4％-8％，突显了我们方法的有效性和其提高LLMs中复杂推理能力的潜力。

更新时间: 2025-03-17 14:20:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13208v1

Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques

Scaling large language models has driven remarkable advancements across various domains, yet the continual increase in model size presents significant challenges for real-world deployment. The Mixture of Experts (MoE) architecture offers a promising solution by dynamically selecting and activating only a subset of experts during inference, thus substantially reducing computational costs while preserving high performance. Despite these benefits, MoE introduces new inefficiencies, such as excessive parameters and communication overhead. In this work, we present a holistic study of compression techniques for Mixture of Experts to enhance both efficiency and scalability. While recent efforts have focused on Expert Trimming, which reduces the number of experts, these approaches still suffer from considerable communication and computational costs. To address this, we propose more aggressive strategies, such as Layer Drop, which removes entire MoE layers, and Block Drop, which eliminates transformer blocks. Surprisingly, these aggressive pruning techniques not only preserve model performance but also substantially improve computation and memory efficiency. Furthermore, beyond Expert Trimming, we also introduce Expert Slimming, which compresses individual experts to further boost performance and can be seamlessly integrated with Expert Trimming. Extensive experimental results demonstrate the effectiveness of our proposed methods-Layer Drop and Block Drop-along with the comprehensive recipe that integrates Expert Slimming and Expert Trimming, achieving a 6.05x speedup with 77.1% reduced memory usage while maintaining over 92% of performance on Mixtral-8x7B. Our code is released at https://github.com/CASE-Lab-UMD/Unified-MoE-Compression.

Updated: 2025-03-17 14:18:42

标题: 朝着高效的专家混合：压缩技术的整体研究

摘要: 大规模语言模型的扩展推动了各个领域的显著进展，然而模型大小的持续增长为实际部署带来了重大挑战。混合专家（MoE）架构通过在推断期间动态选择和激活仅一部分专家，从而显著降低计算成本同时保持高性能，提供了一个有希望的解决方案。尽管具有这些优点，MoE引入了新的低效，如过多的参数和通信开销。在这项工作中，我们提出了Mixture of Experts的压缩技术的全面研究，以增强效率和可扩展性。虽然最近的努力集中在专家修剪上，减少专家数量，但这些方法仍然存在相当大的通信和计算成本。为了解决这个问题，我们提出了更激进的策略，比如Layer Drop，它删除整个MoE层，以及Block Drop，它消除transformer块。令人惊讶的是，这些激进的修剪技术不仅保持了模型性能，还显着提高了计算和内存效率。此外，除了专家修剪，我们还介绍了Expert Slimming，它压缩单个专家以进一步提高性能，并可以无缝集成专家修剪。广泛的实验结果表明，我们提出的方法-Layer Drop和Block Drop-以及整合Expert Slimming和Expert Trimming的综合配方，在Mixtral-8x7B上实现了6.05倍的加速，内存使用减少了77.1%，同时保持了超过92%的性能。我们的代码发布在https://github.com/CASE-Lab-UMD/Unified-MoE-Compression。

更新时间: 2025-03-17 14:18:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.02500v3

Could AI Leapfrog the Web? Evidence from Teachers in Sierra Leone

Although 85% of sub-Saharan Africa's population is covered by mobile broadband signal, only 37% use the internet, and those who do seldom use the web. The most frequently cited reason for low internet usage is the cost of data. We investigate whether AI can bridge this gap by analyzing 40,350 queries submitted to an AI chatbot by 469 teachers in Sierra Leone over 17 months. Teachers use AI for teaching assistance more frequently than web search. We compare the AI responses to the corresponding top search results for the same queries from the most popular local web search engine, google.com.sl. Only 2% of results for corresponding web searches contain content from in country. Additionally, the average web search result consumes 3,107 times more data than an AI response. Bandwidth alone costs \$2.41 per thousand web search results loaded, while the total cost of AI is \$0.30 per thousand responses. As a result, AI is 87% less expensive than web search. In blinded evaluations, an independent sample of teachers rate AI responses as more relevant, helpful, and correct than web search results. These findings suggest that AI-driven solutions can cost-effectively bridge information gaps in low-connectivity regions.

Updated: 2025-03-17 14:14:56

标题: 人工智能能否实现网络跨越？来自塞拉利昂教师的证据

摘要: 尽管撒哈拉以南非洲的85%人口覆盖了移动宽带信号，但只有37%使用互联网，而那些使用互联网的人很少使用网络。使用互联网的最常见原因是数据费用。我们通过分析塞拉利昂469名教师在17个月内向AI聊天机器人提交的40,350个查询，来调查AI是否可以弥合这一鸿沟。教师更频繁地使用AI进行教学辅助，而不是进行网络搜索。我们将AI的回应与相同查询在当地最流行的网络搜索引擎google.com.sl上对应的前几个搜索结果进行比较。仅有2%的与网络搜索对应的结果包含本国内容。此外，平均网络搜索结果消耗的数据量是AI回应的3107倍。仅带宽成本每加载千条网络搜索结果为2.41美元，而AI的总成本为每千次回应0.30美元。因此，AI比网络搜索便宜87%。在盲评中，一组独立的教师样本将AI回应评价为比网络搜索结果更相关、有用和正确。这些发现表明，基于AI的解决方案可以以成本效益的方式弥合低连接地区的信息鸿沟。

更新时间: 2025-03-17 14:14:56

领域: cs.CY,cs.AI,cs.HC,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2502.12397v2

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.

Updated: 2025-03-17 14:14:28

标题: 地图：评估和多智能体增强用于住院路径的大型语言模型

摘要: 住院患者路径需要基于全面的患者信息进行复杂的临床决策，给临床医生带来了重要挑战。尽管大型语言模型（LLMs）在医学应用方面取得了进展，但由于缺乏大规模的住院患者数据集，有限的研究专注于人工智能（AI）在住院患者路径系统中的应用。此外，现有的医学基准通常集中在医学问答和检查上，忽视了住院环境中临床决策的多方面性质。为了填补这些空白，我们首先从MIMIC-IV数据库中开发了住院患者路径决策支持（IPDS）基准，涵盖了51,274例病例，涉及九个分类部门和17个主要疾病类别，以及16种标准化治疗选择。然后，我们提出了多智能体住院患者路径（MAP）框架，以三个临床智能体完成住院患者路径，包括管理患者入院的分类智能体、在科室中担任主要决策者的诊断智能体，以及提供治疗方案的治疗智能体。此外，我们的MAP框架还包括一位主管智能体，负责监督住院患者路径，指导和促进这三个临床智能体。广泛的实验显示，与现有最先进的LLM HuatuoGPT2-13B相比，我们的MAP将诊断准确性提高了25.10%。值得注意的是，我们的MAP表现出显著的临床符合性，胜过三名董事会认证的临床医生10%-12%，为住院患者路径系统奠定了基础。

更新时间: 2025-03-17 14:14:28

领域: cs.AI,cs.CL,cs.CV,cs.HC,cs.MA

下载: http://arxiv.org/abs/2503.13205v1

On the Byzantine-Resilience of Distillation-Based Federated Learning

Federated Learning (FL) algorithms using Knowledge Distillation (KD) have received increasing attention due to their favorable properties with respect to privacy, non-i.i.d. data and communication cost. These methods depart from transmitting model parameters and instead communicate information about a learning task by sharing predictions on a public dataset. In this work, we study the performance of such approaches in the byzantine setting, where a subset of the clients act in an adversarial manner aiming to disrupt the learning process. We show that KD-based FL algorithms are remarkably resilient and analyze how byzantine clients can influence the learning process. Based on these insights, we introduce two new byzantine attacks and demonstrate their ability to break existing byzantine-resilient methods. Additionally, we propose a novel defence method which enhances the byzantine resilience of KD-based FL algorithms. Finally, we provide a general framework to obfuscate attacks, making them significantly harder to detect, thereby improving their effectiveness. Our findings serve as an important building block in the analysis of byzantine FL, contributing through the development of new attacks and new defence mechanisms, further advancing the robustness of KD-based FL algorithms.

Updated: 2025-03-17 14:08:19

标题: 关于基于蒸馏的联邦学习的拜占庭韧性

摘要: 使用知识蒸馏（KD）的联邦学习（FL）算法因其在隐私、非独立同分布数据和通信成本方面的有利特性而受到越来越多的关注。这些方法不再传输模型参数，而是通过在公共数据集上共享预测来传达关于学习任务的信息。在这项工作中，我们研究了这些方法在拜占庭设置下的性能，其中一部分客户端以敌对方式行事，旨在干扰学习过程。我们展示了基于KD的FL算法具有显著的韧性，并分析了拜占庭客户端如何影响学习过程。基于这些见解，我们引入了两种新的拜占庭攻击，并展示了它们破坏现有的拜占庭韧性方法的能力。此外，我们提出了一种增强KD-based FL算法拜占庭韧性的新防御方法。最后，我们提供了一个通用框架来混淆攻击，使其更难被检测，从而提高其有效性。我们的发现在拜占庭FL分析中起着重要的基础作用，通过开发新的攻击和新的防御机制，进一步提高了KD-based FL算法的韧性。

更新时间: 2025-03-17 14:08:19

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2402.12265v3

Timing the Match: A Deep Reinforcement Learning Approach for Ride-Hailing and Ride-Pooling Services

Efficient timing in ride-matching is crucial for improving the performance of ride-hailing and ride-pooling services, as it determines the number of drivers and passengers considered in each matching process. Traditional batched matching methods often use fixed time intervals to accumulate ride requests before assigning matches. While this approach increases the number of available drivers and passengers for matching, it fails to adapt to real-time supply-demand fluctuations, often leading to longer passenger wait times and driver idle periods. To address this limitation, we propose an adaptive ride-matching strategy using deep reinforcement learning (RL) to dynamically determine when to perform matches based on real-time system conditions. Unlike fixed-interval approaches, our method continuously evaluates system states and executes matching at moments that minimize total passenger wait time. Additionally, we incorporate a potential-based reward shaping (PBRS) mechanism to mitigate sparse rewards, accelerating RL training and improving decision quality. Extensive empirical evaluations using a realistic simulator trained on real-world data demonstrate that our approach outperforms fixed-interval matching strategies, significantly reducing passenger waiting times and detour delays, thereby enhancing the overall efficiency of ride-hailing and ride-pooling systems.

Updated: 2025-03-17 14:07:58

标题: 匹配时机：一种用于网约车和拼车服务的深度强化学习方法

摘要: 在搭车匹配中，高效的时间安排对于提升搭车叫车和拼车服务的性能至关重要，因为它决定了每次匹配过程中考虑的司机和乘客数量。传统的批量匹配方法通常使用固定的时间间隔来累积搭车请求，然后分配匹配。虽然这种方法增加了可用的司机和乘客数量进行匹配，但它无法适应实时供需波动，经常导致乘客等待时间变长和司机空闲时间增加。为了解决这一限制，我们提出了一种使用深度强化学习（RL）的自适应搭车匹配策略，根据实时系统条件动态确定何时进行匹配。与固定间隔方法不同，我们的方法持续评估系统状态，并在最小化总乘客等待时间的时刻执行匹配。此外，我们还结合了基于潜力的奖励塑造（PBRS）机制来缓解稀疏的奖励，加快RL训练并提高决策质量。使用基于真实世界数据训练的逼真模拟器进行广泛的实证评估表明，我们的方法优于固定间隔匹配策略，显著减少乘客等待时间和绕道延误，从而提高搭车叫车和拼车系统的整体效率。

更新时间: 2025-03-17 14:07:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13200v1

Life-inspired Interoceptive Artificial Intelligence for Autonomous and Adaptive Agents

Building autonomous -- i.e., choosing goals based on one's needs -- and adaptive -- i.e., surviving in ever-changing environments -- agents has been a holy grail of artificial intelligence (AI). A living organism is a prime example of such an agent, offering important lessons about adaptive autonomy. Here, we focus on interoception, a process of monitoring one's internal environment to keep it within certain bounds, which underwrites the survival of an organism. To develop AI with interoception, we need to factorize the state variables representing internal environments from external environments and adopt life-inspired mathematical properties of internal environment states. This paper offers a new perspective on how interoception can help build autonomous and adaptive agents by integrating the legacy of cybernetics with recent advances in theories of life, reinforcement learning, and neuroscience.

Updated: 2025-03-17 14:07:57

标题: 受生命启发的自主适应智能代理的内感知人工智能

摘要: 建立自主的——即根据自身需求选择目标——和自适应的——即在不断变化的环境中生存——代理人一直是人工智能（AI）的圣杯。生物是这种代理人的一个典型例子，提供了关于自适应自主性的重要教训。在这里，我们着重介绍内部感知，这是监控自身内部环境以保持其在一定范围内的过程，支撑着生物的生存。为了开发具有内部感知的人工智能，我们需要将代表内部环境的状态变量与外部环境进行因子分解，并采用内部环境状态的生命启发数学属性。本文提供了一个新的视角，说明内部感知如何通过将控制论的遗产与最近生命理论、强化学习和神经科学的进展相整合，从而帮助构建自主和自适应的代理人。

更新时间: 2025-03-17 14:07:57

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2309.05999v2

Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. SPARTA introduces an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that produces an enriched representation of human motion over time. This approach ensures that the transformer's attention mechanism captures both spatial and temporal patterns simultaneously, rather than focusing on only one aspect. The addition of the relative pose further emphasizes subtle deviations from normal human movements. The architecture's core, a novel Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that SPARTA consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD.

Updated: 2025-03-17 14:05:49

标题: 人类中心的视频异常检测通过时空姿势标记和变换器

摘要: 视频异常检测（VAD）在计算机视觉领域面临着重大挑战，主要是由于异常事件的不可预测和不经常发生的特性，以及它们发生的多样化和动态环境。人类中心的VAD是该领域的一个专门领域，面临着额外的复杂性，包括人类行为的变化、数据中的潜在偏见以及与人类主体相关的重大隐私问题。这些问题使得开发既健壮又可泛化的模型变得复杂。为了解决这些挑战，近期的进展集中在基于姿势的VAD上，利用人类姿势作为高级特征来减轻隐私问题、减少外观偏见并最小化背景干扰。在本文中，我们介绍了SPARTA，一种专门设计用于人类中心姿势基础VAD的新型基于transformer的架构。SPARTA引入了一种创新的时空姿势和相对姿势（ST-PRP）的标记方法，产生了人类运动随时间的丰富表示。这种方法确保了transformer的注意机制同时捕捉空间和时间模式，而不是只关注一个方面。相对姿势的添加进一步强调了与正常人类运动的微小偏差。该架构的核心是一种新颖的统一编码器双解码器（UETD）transformer，显著改进了视频数据中异常行为的检测。在多个基准数据集上进行的广泛评估表明，SPARTA始终优于现有方法，在基于姿势的VAD中建立了一个新的技术水平。

更新时间: 2025-03-17 14:05:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2408.15185v2

Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey

The rapid expansion of data from diverse sources has made anomaly detection (AD) increasingly essential for identifying unexpected observations that may signal system failures, security breaches, or fraud. As datasets become more complex and high-dimensional, traditional detection methods struggle to effectively capture intricate patterns. Advances in deep learning have made AD methods more powerful and adaptable, improving their ability to handle high-dimensional and unstructured data. This survey provides a comprehensive review of over 180 recent studies, focusing on deep learning-based AD techniques. We categorize and analyze these methods into reconstruction-based and prediction-based approaches, highlighting their effectiveness in modeling complex data distributions. Additionally, we explore the integration of traditional and deep learning methods, highlighting how hybrid approaches combine the interpretability of traditional techniques with the flexibility of deep learning to enhance detection accuracy and model transparency. Finally, we identify open issues and propose future research directions to advance the field of AD. This review bridges gaps in existing literature and serves as a valuable resource for researchers and practitioners seeking to enhance AD techniques using deep learning.

Updated: 2025-03-17 14:04:48

标题: 深度学习在异常检测中的进展：一项综合调查

摘要: 随着来自各种来源的数据迅速扩展，异常检测（AD）越来越重要，用于识别可能表明系统故障、安全漏洞或欺诈的意外观察结果。随着数据集变得越来越复杂和高维，传统检测方法很难有效捕捉复杂的模式。深度学习的进步使得AD方法更加强大和适应性强，提高了处理高维和非结构化数据的能力。这项调查综合审查了180多项最近的研究，重点关注基于深度学习的AD技术。我们将这些方法分类并分析为基于重建和基于预测的方法，突出它们在建模复杂数据分布方面的有效性。此外，我们探讨了传统和深度学习方法的整合，突出了混合方法如何将传统技术的可解释性与深度学习的灵活性相结合，以增强检测准确性和模型透明度。最后，我们确定了存在的问题并提出了未来研究方向，以推进AD领域。这项审查填补了现有文献中的空白，并为寻求利用深度学习增强AD技术的研究人员和实践者提供了宝贵的资源。

更新时间: 2025-03-17 14:04:48

领域: cs.LG

下载: http://arxiv.org/abs/2503.13195v1

A representational framework for learning and encoding structurally enriched trajectories in complex agent environments

The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them, such as disentangled representations that exploit symmetries. Whereas such representations are procedurally efficient, they are based on the compression of low-level state-action transitions, which lack structural richness. To address this problem, we propose to enrich the agent's ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions and affordances. SETs are built as multi-level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph-based memory structure of multi-level relational dependencies essential for generalisation. Using reinforcement learning as a data generation tool, we demonstrate that SETLE can support downstream tasks, enabling agents to recognise task-relevant structural patterns across diverse environments.

Updated: 2025-03-17 14:04:27

标题: 一个用于在复杂代理环境中学习和编码结构丰富轨迹的表征框架

摘要: 人工智能代理在复杂情境中做出最优决策并将其推广到不同领域和任务的能力受到威胁。解决这个问题的一种方法是专注于学习世界的高效表示以及代理的行为如何影响它们，比如利用对称性的分离表示。虽然这种表示在程序上是高效的，但它们基于低级状态-动作转换的压缩，缺乏结构丰富性。为了解决这个问题，我们提出丰富代理的本体论，并扩展传统的轨迹概念，以提供对任务执行的更细致视图。结构丰富的轨迹（SETs）通过将对象、互动和功能的层次关系纳入到状态序列及其转换的编码中来扩展表示。SETs被构建为多层级图，提供了对代理动态的详细表示和任务的可转移的功能抽象。SETs被整合到一个架构中，即结构丰富的轨迹学习和编码（SETLE），该架构采用了多级关系依赖性的异构图形内存结构，对于泛化是至关重要的。使用强化学习作为数据生成工具，我们证明SETLE可以支持下游任务，使代理能够在不同环境中识别与任务相关的结构模式。

更新时间: 2025-03-17 14:04:27

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13194v1

3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o

Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.

Updated: 2025-03-17 13:57:05

标题: 3DAxisPrompt: 促进GPT-4o中的3D基础和推理

摘要: 多模式大型语言模型（MLLMs）展现出在各种任务中的令人印象深刻的能力，特别是当配备精心设计的视觉提示时。然而，现有研究主要集中在逻辑推理和视觉理解上，而MLLMs在3D视觉中的有效操作能力仍然是一个持续探索的领域。在本文中，我们引入了一种新颖的视觉提示方法，称为3DAxisPrompt，以唤起MLLMs在现实场景中的3D理解能力。更具体地说，我们的方法利用3D坐标轴和从Segment Anything Model（SAM）生成的蒙版，为MLLMs提供明确的几何先验，然后将它们令人印象深刻的2D接地和推理能力扩展到现实世界的3D场景。此外，我们首先对潜在的视觉提示格式进行了彻底的调查，并总结我们的发现，揭示了GPT-4o作为MLLMs代表的3D理解能力的潜力和限制。最后，我们建立了包括四个数据集的评估环境，即ScanRefer、ScanNet、FMB和nuScene数据集，涵盖各种3D任务。基于此，我们进行了广泛的定量和定性实验，证明了所提出方法的有效性。总的来说，我们的研究表明，在3DAxisPrompt的帮助下，MLLMs可以有效地感知物体在现实世界场景中的3D位置。然而，单一的提示工程方法并不能始终为所有3D任务实现最佳结果。这项研究强调了利用MLLMs进行3D视觉接地/推理的可行性，借助提示工程技术。

更新时间: 2025-03-17 13:57:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13185v1

Rapfi: Distilling Efficient Neural Network for the Game of Gomoku

Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a compact neural network with a pattern-based codebook distilled from CNNs, and an incremental update scheme that minimizes computation when input changes are minor. This new network uses computation that is orders of magnitude less to reach a similar accuracy of much larger neural networks such as Resnet. Thanks to our incremental update scheme, depth-first search methods such as the alpha-beta search can be significantly accelerated. With a carefully tuned evaluation and search, Rapfi reached strength surpassing Katagomo, the strongest open-source Gomoku AI based on AlphaZero's algorithm, under limited computational resources where accelerators like GPUs are absent. Rapfi ranked first among 520 Gomoku agents on Botzone and won the championship in GomoCup 2024.

Updated: 2025-03-17 13:53:57

标题: Rapfi：为五子棋游戏提炼高效神经网络

摘要: 游戏在推动人工智能发展中发挥了关键作用，AI代理使用复杂技术进行竞争。尽管基于神经网络的游戏AI取得了成功，但它们的性能通常需要大量计算资源。本文介绍了一种名为Rapfi的高效五子棋代理，它在有限的计算环境中胜过基于CNN的代理。Rapfi利用了一种紧凑的神经网络，该网络使用从CNN中提炼出的基于模式的码本，以及一种增量更新方案，当输入变化较小时最小化计算。这种新网络使用的计算量较少，达到了Resnet等更大神经网络的相似准确度。由于我们的增量更新方案，深度优先搜索方法如α-β搜索可以得到显著加速。通过精心调整的评估和搜索，Rapfi在没有像GPU这样的加速器的有限计算资源下，实现了超越基于AlphaZero算法的最强开源五子棋AI Katagomo的实力。Rapfi在Botzone的520个五子棋代理中排名第一，并在GomoCup 2024赢得了冠军。

更新时间: 2025-03-17 13:53:57

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13178v1

Leveraging Taxonomy Similarity for Next Activity Prediction in Patient Treatment

The rapid progress in modern medicine presents physicians with complex challenges when planning patient treatment. Techniques from the field of Predictive Business Process Monitoring, like Next-activity-prediction (NAP) can be used as a promising technique to support physicians in treatment planning, by proposing a possible next treatment step. Existing patient data, often in the form of electronic health records, can be analyzed to recommend the next suitable step in the treatment process. However, the use of patient data poses many challenges due to its knowledge-intensive character, high variability and scarcity of medical data. To overcome these challenges, this article examines the use of the knowledge encoded in taxonomies to improve and explain the prediction of the next activity in the treatment process. This study proposes the TS4NAP approach, which uses medical taxonomies (ICD-10-CM and ICD-10-PCS) in combination with graph matching to assess the similarities of medical codes to predict the next treatment step. The effectiveness of the proposed approach will be evaluated using event logs that are derived from the MIMIC-IV dataset. The results highlight the potential of using domain-specific knowledge held in taxonomies to improve the prediction of the next activity, and thus can improve treatment planning and decision-making by making the predictions more explainable.

Updated: 2025-03-17 13:52:26

标题: 利用分类相似性进行病人治疗中下一步活动预测

摘要: 现代医学的快速进展使医生在规划患者治疗时面临复杂挑战。来自预测业务流程监控领域的技术，如下一步活动预测（NAP），可作为一种有前途的技术，支持医生在治疗规划中提出可能的下一步治疗步骤。现有的患者数据，通常以电子健康记录的形式存在，可以进行分析，以推荐治疗过程中的下一个适当步骤。然而，使用患者数据面临许多挑战，由于其知识密集性、高变异性和医疗数据的稀缺性。为了克服这些挑战，本文研究了利用编码在分类法中的知识来改进和解释治疗过程中下一个活动的预测。本研究提出了TS4NAP方法，该方法使用医学分类法（ICD-10-CM和ICD-10-PCS）与图匹配相结合，评估医疗编码的相似性以预测下一个治疗步骤。提出的方法的有效性将使用从MIMIC-IV数据集中导出的事件日志进行评估。结果突显了利用分类法中持有的领域特定知识来改进下一个活动的预测的潜力，从而可以通过使预测更具可解释性来改进治疗规划和决策制定。

更新时间: 2025-03-17 13:52:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.07638v2

Valley: Video Assistant with Large Language model Enhanced abilitY

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

Updated: 2025-03-17 13:51:51

标题: 山谷：具有大型语言模型增强能力的视频助手

摘要: 大型语言模型（LLMs）具有显著的对话能力，已经成为能够处理视觉和文本模态的AI助手。然而，它们在视频和语言理解方面的有效性尚未得到广泛探讨。在本文中，我们介绍了Valley，这是一个多模态基础模型，旨在实现增强的视频理解和遵循指令能力。为此，我们构建了两个数据集，即Valley-702k和Valley-instruct-73k，涵盖了各种视频文本对齐和基于视频的指令任务，如多镜头标题、长视频描述、动作识别、因果推理等。然后，我们采用ViT-L/14作为视觉编码器，并探索了三种不同的时间建模模块，学习增强视频理解的多方面特征。此外，我们为Valley实施了两阶段训练方法：第一阶段专注于训练投影模块，以帮助LLM理解视觉输入的能力，第二阶段联合训练投影模块和LLM，以提高它们的遵循指令能力。大量实验证明，Valley有潜力成为一种有效的视频助手，简化复杂的视频理解场景。我们的代码和数据以匿名形式发布在https://github.com/valley-vl/Valley。

更新时间: 2025-03-17 13:51:51

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2306.07207v3

PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

Updated: 2025-03-17 13:50:35

标题: PAUSE：用于联邦学习的低延迟和注重隐私的活跃用户选择

摘要: 联邦学习（FL）使多个边缘设备能够在不共享潜在私密数据的情况下共同训练机器学习模型。联邦学习通过模型更新的迭代交换来进行，面临两个关键挑战：随时间累积的隐私泄漏和通信延迟。这两个限制通常分别解决：前者通过扰动更新以增强隐私性，后者通过用户选择来减轻延迟，但都会牺牲准确性。在这项工作中，我们提出了一种方法，通过主动用户选择来共同解决隐私泄漏累积和通信延迟，旨在改善隐私、延迟和模型性能之间的权衡。为实现这一目标，我们构建了一个考虑这三个目标的奖励函数。基于这个奖励，我们提出了一个基于多臂赌博机（MAB）的算法，称为隐私感知主动用户选择（PAUSE），它在每一轮动态选择一组用户，同时确保整体隐私泄漏有限。我们进行了理论分析，系统地展示了PAUSE的奖励增长率遵循MAB文献中已知最佳率的情况。为解决主动用户选择的复杂性开销，我们提出了基于模拟退火的PAUSE松弛，并分析其在降低复杂性下逼近最大化奖励政策的能力。我们在各种情景下对我们的方法在联邦训练中的隐私泄漏、改进的延迟和准确性收益进行了数值验证。

更新时间: 2025-03-17 13:50:35

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2503.13173v1

HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning

The acquisition of large-scale and diverse demonstration data are essential for improving robotic imitation learning generalization. However, generating such data for complex manipulations is challenging in real-world settings. We introduce HybridGen, an automated framework that integrates Vision-Language Model (VLM) and hybrid planning. HybridGen uses a two-stage pipeline: first, VLM to parse expert demonstrations, decomposing tasks into expert-dependent (object-centric pose transformations for precise control) and plannable segments (synthesizing diverse trajectories via path planning); second, pose transformations substantially expand the first-stage data. Crucially, HybridGen generates a large volume of training data without requiring specific data formats, making it broadly applicable to a wide range of imitation learning algorithms, a characteristic which we also demonstrate empirically across multiple algorithms. Evaluations across seven tasks and their variants demonstrate that agents trained with HybridGen achieve substantial performance and generalization gains, averaging a 5% improvement over state-of-the-art methods. Notably, in the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate, significantly outperforming Mimicgen's 49.5%. These results demonstrating its effectiveness and practicality.

Updated: 2025-03-17 13:49:43

标题: HybridGen: 用于模仿学习数据生成的VLM引导混合规划

摘要: 获取大规模和多样化的演示数据对于改进机器人模仿学习泛化能力至关重要。然而，在现实世界的环境中生成复杂操纵的数据是具有挑战性的。我们引入了HybridGen，这是一个自动化框架，集成了视觉语言模型（VLM）和混合规划。HybridGen使用两阶段流程：首先，使用VLM解析专家演示，将任务分解为专家相关（面向对象的姿态变换以实现精确控制）和可规划的部分（通过路径规划合成多样化轨迹）；其次，姿态变换显著扩展了第一阶段数据。至关重要的是，HybridGen生成了大量的训练数据，而无需特定的数据格式，使其广泛适用于各种模仿学习算法，这也是我们在多个算法中经验性地证明的特点。对七个任务及其变体的评估表明，使用HybridGen训练的代理获得了显著的性能和泛化增益，平均提高了5%以上，超过了现有方法。值得注意的是，在最具挑战性的任务变体中，HybridGen取得了显著的改进，达到了59.7%的平均成功率，明显优于Mimicgen的49.5%。这些结果证明了其有效性和实用性。

更新时间: 2025-03-17 13:49:43

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.13171v1

Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis

Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.

Updated: 2025-03-17 13:49:29

标题: 利用视觉语言模型推进慢性结核病诊断：用于精准分析的多模态框架

摘要: 背景：本研究提出了一种视觉语言模型（VLM），利用SIGLIP编码器和Gemma-3b变压器解码器来增强自动慢性结核病（TB）筛查。通过将胸部X线图像与临床数据整合，该模型解决了手动解释的挑战，提高了诊断的一致性和可访问性，特别是在资源有限的环境中。方法：VLM架构结合了视觉变换器（ViT）进行视觉编码和基于变压器的文本编码器处理临床背景，如患者病史和治疗记录。跨模态注意机制将放射性特征与文本信息对齐，而Gemma-3b解码器生成全面的诊断报告。该模型在500万对医学图像和文本进行了预训练，并使用了10万张特定于慢性结核病的胸部X射线进行了微调。结果：该模型对于检测关键的慢性结核病病理学表现，包括纤维化、钙化肉芽肿和支气管扩张，表现出了高精度（94%）和召回率（94%）。曲线下面积（AUC）得分超过0.93，交集超过联盟（IoU）值超过0.91，验证了其在检测和定位与TB相关的异常方面的有效性。结论：VLM为自动慢性TB诊断提供了强大且可扩展的解决方案，整合了放射性和临床数据，提供可操作且具有上下文意识的见解。未来的工作将解决微妙的病理学和数据集偏差，以增强模型的泛化能力，确保在不同人群和医疗环境中实现公平的表现。

更新时间: 2025-03-17 13:49:29

领域: eess.IV,cs.AI,cs.CV,cs.LG,68T07, 92C55, 68U10, 92C50, 60G35

下载: http://arxiv.org/abs/2503.14536v1

A Recipe for Improving Remote Sensing VLM Zero Shot Generalization

Foundation models have had a significant impact across various AI applications, enabling use cases that were previously impossible. Contrastive Visual Language Models (VLMs), in particular, have outperformed other techniques in many tasks. However, their prevalence in remote sensing (RS) is still limited, due to the scarcity of diverse remote-sensing visual-language datasets. In this work we introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain, resulting in a diverse dataset with greater breadth in image styles and subject matter. These datasets are used to pre-train the MaMMUT~\citep{kuo2023mammutsimplearchitecturejoint} VLM architecture, resulting in state-of-the-art generalization performance in zero-shot cross-modal retrieval on well-known public benchmarks. Finally, we present our ongoing research to distill image-level knowledge gained in the VLM contrastive training procedure to enhance the model's localization ability. Specifically, we iteratively generate pseudo-labels for image regions based on the model's attention maps and use these labels for further training. To mitigate noisy attention maps and create robust segmentation masks, we introduce a novel attention-pooling mechanism called the Smooth-Attention-Operation.

Updated: 2025-03-17 13:49:27

标题: 改进遥感VLM零样本泛化的配方

摘要: 基础模型在各种人工智能应用中产生了重大影响，使先前不可能实现的用例成为可能。对比视觉语言模型（VLMs），特别是在许多任务中表现出色，超越了其他技术。然而，在遥感领域，由于缺乏多样化的遥感视觉语言数据集，它们的普及仍然有限。在这项工作中，我们引入了两个新颖的图像-标题数据集，用于遥感基础模型的训练。第一个数据集将航空和卫星图像与由Gemini生成的标题配对，Gemini使用从Google Maps提取的地标。第二个数据集利用公共网络图像及其对应的替代文本，经过遥感领域的筛选，得到一个样式和主题更广泛的多样化数据集。这些数据集用于预训练MaMMUT VLM架构，使其在著名的公共基准测试中实现了最先进的零样本跨模态检索泛化性能。最后，我们介绍了我们正在进行的研究，以提炼在VLM对比训练过程中获得的图像级知识，以增强模型的定位能力。具体来说，我们根据模型的注意力图逐步生成图像区域的伪标签，并将这些标签用于进一步训练。为了减轻注意力图的噪声并创建稳健的分割掩模，我们引入了一种称为平滑注意力操作的新颖注意力池化机制。

更新时间: 2025-03-17 13:49:27

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.08722v2

Influence Functions for Scalable Data Attribution in Diffusion Models

Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an influence functions framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

Updated: 2025-03-17 13:47:39

标题: 扩展数据传播模型中可扩展数据归因的影响函数

摘要: 扩散模型在生成建模方面取得了重大进展。然而，它们的广泛采用对数据归因和可解释性提出了挑战。在本文中，我们旨在通过开发一个影响函数框架来帮助解决扩散模型中的这些挑战。基于影响函数的数据归因方法近似模型的输出如果一些训练数据被移除会发生怎样的变化。在监督学习中，通常用于预测特定示例上的损失会如何变化。对于扩散模型，我们专注于通过几个代理测量来预测生成特定示例的概率变化。我们展示了如何为这些量制定影响函数，以及如何解释先前提出的方法可以被解释为我们框架中的特定设计选择。为了确保影响函数中的Hessian计算的可扩展性，我们系统地开发了基于广义高斯牛顿矩阵的K-FAC近似，专门针对扩散模型。我们将先前提出的方法重新设计为我们框架中的特定设计选择，并展示我们推荐的方法在常见评估中优于先前的数据归因方法，如线性数据建模分数（LDS）或重新训练而不考虑最重要的影响，而无需特定于方法的超参数调整。

更新时间: 2025-03-17 13:47:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.13850v4

Collaborative AI Enhances Image Understanding in Materials Science

The Copilot for Real-world Experimental Scientist (CRESt) system empowers researchers to control autonomous laboratories through conversational AI, providing a seamless interface for managing complex experimental workflows. We have enhanced CRESt by integrating a multi-agent collaboration mechanism that utilizes the complementary strengths of the ChatGPT and Gemini models for precise image analysis in materials science. This innovative approach significantly improves the accuracy of experimental outcomes by fostering structured debates between the AI models, which enhances decision-making processes in materials phase analysis. Additionally, to evaluate the generalizability of this approach, we tested it on a quantitative task of counting particles. Here, the collaboration between the AI models also led to improved results, demonstrating the versatility and robustness of this method. By harnessing this dual-AI framework, this approach stands as a pioneering method for enhancing experimental accuracy and efficiency in materials research, with applications extending beyond CRESt to broader scientific experimentation and analysis.

Updated: 2025-03-17 13:44:30

标题: Collaborative AI Enhances Image Understanding in Materials Science 协作人工智能提升材料科学中的图像理解

摘要: The Copilot for Real-world Experimental Scientist (CRESt)系统通过对话人工智能使研究人员能够控制自主实验室，为管理复杂实验工作流程提供无缝接口。我们通过整合一个利用ChatGPT和Gemini模型在材料科学中进行精确图像分析的多智能体协作机制，增强了CRESt系统。这种创新方法通过促进AI模型之间的结构化辩论显著提高了实验结果的准确性，增强了材料相分析中的决策过程。此外，为了评估该方法的泛化能力，我们在一个计算颗粒数量的定量任务上进行了测试。在这里，AI模型之间的协作也导致了改进的结果，展示了该方法的多功能性和稳健性。通过利用这种双AI框架，这种方法成为增强材料研究中实验准确性和效率的开创性方法，其应用范围不仅限于CRESt，还可扩展到更广泛的科学实验和分析领域。

更新时间: 2025-03-17 13:44:30

领域: cs.AI,I.2.1; I.2.10

下载: http://arxiv.org/abs/2503.13169v1

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.

Updated: 2025-03-17 13:40:46

标题: 分析子空间路由：在大型语言模型的连续学习中递归最小二乘法的工作原理

摘要: 大型语言模型（LLMs）具有广泛的能力，可以处理各种语言相关任务。然而，在LLMs上进行微调将减弱这种通用技能，并且持续的微调将进一步导致已积累知识的严重退化。最近，在大型语言模型（LLMs）中出现了持续学习（CL）的概念，旨在持续适应LLMs到新任务，同时保持先前学到的知识并继承通用技能。现有技术要么利用以前的数据进行重播，导致额外的计算成本，要么利用单个参数高效的模块学习下游任务，限制新知识吸收并干扰不同任务之间的干扰。针对这些问题，本文提出了Analytic Subspace Routing（ASR）来解决这些挑战。对于每个任务，我们通过低秩适应将学习限定在深层特征的子空间中，消除不同任务之间的知识干扰。此外，我们提出了一种分析路由机制，以适当利用在不同子空间学到的知识。我们的方法采用递归最小二乘法训练一个多任务路由器模型，允许路由器动态适应传入的数据，而无需访问历史数据。此外，路由器有效地将当前任务分配到适当的子空间，并具有先前学习任务的非遗忘性质，并带有坚实的理论保证。实验结果表明，我们的方法在有效克服现有方法的核心限制的同时，实现了先前知识的几乎完美保留，并无缝集成新信息。我们的代码将在接受后发布。

更新时间: 2025-03-17 13:40:46

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13575v1

Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models

The rapid advancement of artificial intelligence (AI) technologies has led to an increasing deployment of AI models on edge and terminal devices, driven by the proliferation of the Internet of Things (IoT) and the need for real-time data processing. This survey comprehensively explores the current state, technical challenges, and future trends of on-device AI models. We define on-device AI models as those designed to perform local data processing and inference, emphasizing their characteristics such as real-time performance, resource constraints, and enhanced data privacy. The survey is structured around key themes, including the fundamental concepts of AI models, application scenarios across various domains, and the technical challenges faced in edge environments. We also discuss optimization and implementation strategies, such as data preprocessing, model compression, and hardware acceleration, which are essential for effective deployment. Furthermore, we examine the impact of emerging technologies, including edge computing and foundation models, on the evolution of on-device AI models. By providing a structured overview of the challenges, solutions, and future directions, this survey aims to facilitate further research and application of on-device AI, ultimately contributing to the advancement of intelligent systems in everyday life.

Updated: 2025-03-17 13:37:33

标题: 增强边缘智能：关于设备端人工智能模型的综合调研

摘要: 人工智能（AI）技术的快速发展导致越来越多的AI模型部署在边缘和终端设备上，受物联网（IoT）的普及和实时数据处理需求驱动。本调查全面探讨了设备上AI模型的现状、技术挑战和未来趋势。我们将设备上AI模型定义为旨在执行本地数据处理和推断的模型，强调其特点，如实时性能、资源约束和增强的数据隐私。调查围绕关键主题展开，包括AI模型的基本概念、各领域的应用场景，以及在边缘环境中面临的技术挑战。我们还讨论了优化和实施策略，如数据预处理、模型压缩和硬件加速，这些对于有效部署至关重要。此外，我们还研究了新兴技术（包括边缘计算和基础模型）对设备上AI模型演变的影响。通过提供挑战、解决方案和未来方向的结构化概述，本调查旨在促进设备上AI的进一步研究和应用，最终为日常智能系统的发展做出贡献。

更新时间: 2025-03-17 13:37:33

领域: cs.AI,cs.LG,cs.NI

下载: http://arxiv.org/abs/2503.06027v2

Efficient Imitation Under Misspecification

Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert's policy lies within the learner's policy class (i.e. the learner can perfectly imitate the expert in all states). However, in practice, perfect imitation of the expert is often impossible due to differences in state information and action space expressiveness (e.g. morphological differences between robots and humans.) In this paper, we consider the more general misspecified setting, where no assumptions are made about the expert policy's realizability. We introduce a novel structural condition, reward-agnostic policy completeness, and prove that it is sufficient for interactive IL algorithms to efficiently avoid the quadratically compounding errors that stymie offline approaches like behavioral cloning. We address an additional practical constraint-the case of limited expert data-and propose a principled method for using additional offline data to further improve the sample-efficiency of interactive IL algorithms. Finally, we empirically investigate the optimal reset distribution in efficient IL under misspecification with a suite of continuous control tasks.

Updated: 2025-03-17 13:35:55

标题: 错误指定条件下的高效模仿

摘要: 交互式模仿学习（IL）是一种强大的范式，用于从专家演示如何执行任务学习做出决策序列。以往在高效模仿学习方面的工作主要集中在可实现设置上，即专家的策略位于学习者的策略类中（即学习者可以在所有状态下完美模仿专家）。然而，在实践中，由于状态信息和行动空间表达能力的差异（例如，机器人和人类之间的形态差异），往往无法完美模仿专家。在本文中，我们考虑更一般的错误设置，不对专家策略的可实现性做任何假设。我们引入了一种新颖的结构条件，即奖励不可知的策略完整性，并证明这对于交互式IL算法来说足以有效地避免四次方复合误差，这些误差会阻碍像行为克隆这样的离线方法。我们解决了另一个实际约束条件——有限专家数据的情况，并提出了一种原则性方法，利用额外的离线数据进一步提高交互式IL算法的样本效率。最后，我们通过一系列连续控制任务对错误设置下的高效IL中的最佳重置分布进行了实证研究。

更新时间: 2025-03-17 13:35:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13162v1

Learning of Patch-Based Smooth-Plus-Sparse Models for Image Reconstruction

We aim at the solution of inverse problems in imaging, by combining a penalized sparse representation of image patches with an unconstrained smooth one. This allows for a straightforward interpretation of the reconstruction. We formulate the optimization as a bilevel problem. The inner problem deploys classical algorithms while the outer problem optimizes the dictionary and the regularizer parameters through supervised learning. The process is carried out via implicit differentiation and gradient-based optimization. We evaluate our method for denoising, super-resolution, and compressed-sensing magnetic-resonance imaging. We compare it to other classical models as well as deep-learning-based methods and show that it always outperforms the former and also the latter in some instances.

Updated: 2025-03-17 13:32:12

标题: Patch-Based平滑加稀疏模型在图像重建中的学习

摘要: 我们旨在通过将图像块的惩罚稀疏表示与无约束的平滑表示相结合，解决成像中的逆问题。这使得重建的解释变得直观。我们将优化问题表述为一个双层问题。内部问题使用经典算法，而外部问题通过监督学习优化字典和正则化参数。该过程通过隐式微分和基于梯度的优化完成。我们评估了我们的方法在去噪、超分辨率和压缩感知磁共振成像方面的表现。我们将其与其他经典模型以及基于深度学习的方法进行比较，并展示在某些情况下它总是优于前者和后者。

更新时间: 2025-03-17 13:32:12

领域: eess.IV,cs.CV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2412.13070v2

Laplace-Net: Learning Dynamical Systems with External Forcing

Modelling forced dynamical systems - where an external input drives the system state - is critical across diverse domains such as engineering, finance, and the natural sciences. In this work, we propose Laplace-Net, a decoupled, solver-free neural framework for learning forced and delay-aware systems. It leverages a Laplace transform-based approach to decompose internal dynamics, external inputs, and initial values into established theoretical concepts, enhancing interpretability. Laplace-Net promotes transferability since the system can be rapidly re-trained or fine-tuned for new forcing signals, providing flexibility in applications ranging from controller adaptation to long-horizon forecasting. Experimental results on eight benchmark datasets - including linear, non-linear, and delayed systems - demonstrate the method's improved accuracy and robustness compared to state-of-the-art approaches, particularly in handling complex and previously unseen inputs.

Updated: 2025-03-17 13:31:12

标题: 拉普拉斯网络：学习具有外部驱动的动力系统

摘要: 建模强制动力系统——外部输入驱动系统状态——在工程、金融和自然科学等不同领域至关重要。在这项工作中，我们提出了Laplace-Net，这是一个解耦、无解算器的神经框架，用于学习强制和延迟感知系统。它利用基于拉普拉斯变换的方法将内部动态、外部输入和初始值分解为已建立的理论概念，增强了可解释性。Laplace-Net促进了可转移性，因为系统可以快速重新训练或微调以适应新的强制信号，从控制器适应到长期预测等应用中提供了灵活性。在包括线性、非线性和延迟系统在内的八个基准数据集上的实验结果显示，与最先进的方法相比，该方法在处理复杂和以前未见的输入方面具有更高的准确性和鲁棒性。

更新时间: 2025-03-17 13:31:12

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.13158v1

Noise-Aware Differentially Private Variational Inference

Differential privacy (DP) provides robust privacy guarantees for statistical inference, but this can lead to unreliable results and biases in downstream applications. While several noise-aware approaches have been proposed which integrate DP perturbation into the inference, they are limited to specific types of simple probabilistic models. In this work, we propose a novel method for noise-aware approximate Bayesian inference based on stochastic gradient variational inference which can also be applied to high-dimensional and non-conjugate models. We also propose a more accurate evaluation method for noise-aware posteriors. Empirically, our inference method has similar performance to existing methods in the domain where they are applicable. Outside this domain, we obtain accurate coverages on high-dimensional Bayesian linear regression and well-calibrated predictive probabilities on Bayesian logistic regression with the UCI Adult dataset.

Updated: 2025-03-17 13:23:33

标题: 噪声感知的差分隐私变分推理

摘要: 差分隐私（DP）为统计推断提供了强大的隐私保证，但这可能导致下游应用中的不可靠结果和偏见。虽然已经提出了几种噪声感知方法，将DP扰动集成到推断中，但它们仅适用于特定类型的简单概率模型。在这项工作中，我们提出了一种基于随机梯度变分推断的噪声感知近似贝叶斯推断方法，也可应用于高维和非共轭模型。我们还提出了一种更准确的评估方法，用于评估噪声感知的后验概率。从经验上看，我们的推断方法在适用于其领域的领域中具有与现有方法类似的性能。在这个领域之外，我们在高维贝叶斯线性回归和UCI成人数据集上的贝叶斯逻辑回归中获得了准确的覆盖率和良好校准的预测概率。

更新时间: 2025-03-17 13:23:33

领域: stat.ML,cs.CR,cs.LG

下载: http://arxiv.org/abs/2410.19371v2

The Empirical Impact of Reducing Symmetries on the Performance of Deep Ensembles and MoE

Recent studies have shown that reducing symmetries in neural networks enhances linear mode connectivity between networks without requiring parameter space alignment, leading to improved performance in linearly interpolated neural networks. However, in practical applications, neural network interpolation is rarely used; instead, ensembles of networks are more common. In this paper, we empirically investigate the impact of reducing symmetries on the performance of deep ensembles and Mixture of Experts (MoE) across five datasets. Additionally, to explore deeper linear mode connectivity, we introduce the Mixture of Interpolated Experts (MoIE). Our results show that deep ensembles built on asymmetric neural networks achieve significantly better performance as ensemble size increases compared to their symmetric counterparts. In contrast, our experiments do not provide conclusive evidence on whether reducing symmetries affects both MoE and MoIE architectures.

Updated: 2025-03-17 13:20:52

标题: 减少对称性对深度集成和MoE性能的实证影响

摘要: 最近的研究表明，减少神经网络中的对称性可以增强网络之间的线性模式连接，而无需参数空间对齐，从而提高线性插值神经网络的性能。然而，在实际应用中，神经网络插值很少被使用；相反，网络集合更为常见。本文实证研究了减少对称性对深度集合和专家混合（MoE）在五个数据集上的性能影响。此外，为了探索更深的线性模式连接，我们引入了插值专家混合（MoIE）。我们的结果表明，建立在非对称神经网络上的深度集合在集合规模增加时比其对称对应体表现出显著更好的性能。相比之下，我们的实验并未提供清晰证据表明减少对称性会影响MoE和MoIE架构。

更新时间: 2025-03-17 13:20:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.17391v2

Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs

We introduce an Item Response Theory (IRT)-based framework to detect and quantify socioeconomic bias in large language models (LLMs) without relying on subjective human judgments. Unlike traditional methods, IRT accounts for item difficulty, improving ideological bias estimation. We fine-tune two LLM families (Meta-LLaMa 3.2-1B-Instruct and Chat- GPT 3.5) to represent distinct ideological positions and introduce a two-stage approach: (1) modeling response avoidance and (2) estimating perceived bias in answered responses. Our results show that off-the-shelf LLMs often avoid ideological engagement rather than exhibit bias, challenging prior claims of partisanship. This empirically validated framework enhances AI alignment research and promotes fairer AI governance.

Updated: 2025-03-17 13:20:09

标题: LLM（真的）具有意识形态吗？基于IRT的分析和对LLM中感知的社会经济偏见的对齐工具

摘要: 我们引入了一个基于项目反应理论（IRT）的框架，用于在大型语言模型（LLMs）中检测和量化社会经济偏见，而不依赖主观人类判断。与传统方法不同，IRT考虑了项目难度，提高了意识形态偏见的估计。我们对两个LLM家族（Meta-LLaMa 3.2-1B-Instruct和Chat-GPT 3.5）进行微调，以代表不同的意识形态立场，并引入了一个两阶段方法：（1）建模回避反应和（2）估计已回答反应中的感知偏见。我们的结果表明，现成的LLMs通常会避免意识形态参与，而不是展示偏见，挑战了先前关于党派性的说法。这一经验验证的框架增强了人工智能对齐研究，并促进了更公平的人工智能治理。

更新时间: 2025-03-17 13:20:09

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2503.13149v1

Improved Bounds for Pure Private Agnostic Learning: Item-Level and User-Level Privacy

Machine Learning has made remarkable progress in a wide range of fields. In many scenarios, learning is performed on datasets involving sensitive information, in which privacy protection is essential for learning algorithms. In this work, we study pure private learning in the agnostic model -- a framework reflecting the learning process in practice. We examine the number of users required under item-level (where each user contributes one example) and user-level (where each user contributes multiple examples) privacy and derive several improved upper bounds. For item-level privacy, our algorithm achieves a near optimal bound for general concept classes. We extend this to the user-level setting, rendering a tighter upper bound than the one proved by Ghazi et al. (2023). Lastly, we consider the problem of learning thresholds under user-level privacy and present an algorithm with a nearly tight user complexity.

Updated: 2025-03-17 13:19:11

标题: 改进纯私密式无偏学习的界限：项目级和用户级隐私

摘要: 机器学习在各个领域取得了显著的进展。在许多情况下，学习是在涉及敏感信息的数据集上进行的，其中隐私保护对于学习算法至关重要。在这项工作中，我们研究了在不可知模型中的纯隐私学习 - 这是一个反映实践中学习过程的框架。我们研究了在项目级别（每个用户贡献一个示例）和用户级别（每个用户贡献多个示例）隐私下所需的用户数量，并推导出了几个改进的上界。对于项目级别的隐私，我们的算法针对一般概念类别实现了接近最优的界限。我们将这一结果扩展到用户级别设置，得出了比Ghazi等人（2023年）证明的一个更紧密的上界。最后，我们考虑了在用户级别隐私下学习阈值的问题，并提出了一个几乎完全符合用户复杂性的算法。

更新时间: 2025-03-17 13:19:11

领域: cs.LG

下载: http://arxiv.org/abs/2407.20640v2

High-entropy Advantage in Neural Networks' Generalizability

While the 2024 Nobel Prize in Physics ignites a worldwide discussion on the origins of neural networks and their foundational links to physics, modern machine learning research predominantly focuses on computational and algorithmic advancements, overlooking a picture of physics. Here we introduce the concept of entropy into neural networks by reconceptualizing them as hypothetical physical systems where each parameter is a non-interacting 'particle' within a one-dimensional space. By employing a Wang-Landau algorithms, we construct the neural networks' (with up to 1 million parameters) entropy landscapes as functions of training loss and test accuracy (or loss) across four distinct machine learning tasks, including arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of \textit{entropy advantage}, where the high-entropy states generally outperform the states reached via classical training optimizer like stochastic gradient descent. We also find this advantage is more pronounced in narrower networks, indicating a need of different training optimizers tailored to different sizes of neural networks.

Updated: 2025-03-17 13:16:25

标题: 神经网络的泛化能力中的高熵优势

摘要: 尽管2024年的诺贝尔物理学奖在全球引发了有关神经网络起源及其与物理学基础联系的讨论，但现代机器学习研究主要集中在计算和算法的进步上，忽略了物理学的画面。在这里，我们通过将神经网络重新概念化为假设的物理系统，其中每个参数是一维空间中的非相互作用的“粒子”，引入了熵的概念。通过使用Wang-Landau算法，我们构建了神经网络（最多有100万个参数）的熵景观，作为训练损失和测试准确性（或损失）的函数，跨越四个不同的机器学习任务，包括算术问题、真实世界的表格数据、图像识别和语言建模。我们的结果揭示了\textit{熵优势}的存在，即高熵状态通常优于通过经典训练优化器（如随机梯度下降）达到的状态。我们还发现这种优势在更窄的网络中更为显著，表明需要针对不同尺寸的神经网络定制不同的训练优化器。

更新时间: 2025-03-17 13:16:25

领域: cs.LG,cond-mat.stat-mech

下载: http://arxiv.org/abs/2503.13145v1

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.'' To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Updated: 2025-03-17 13:07:34

标题: 逻辑框架：通过视觉语义逻辑验证进行长视频理解的动态关键帧搜索

摘要: 理解长视频内容是一个复杂的任务，通常依赖密集采样的帧标题或端到端特征选择器，然而这些技术通常忽视了文本查询和视觉元素之间的逻辑关系。在实践中，计算约束要求进行粗糙的帧子采样，这是一个类似于“大海捞针”的挑战。为了解决这个问题，我们引入了一个基于语义驱动的搜索框架，将关键帧选择重新构建为视觉语义逻辑搜索范式。具体地，我们系统地定义了四种基本的逻辑依赖关系：1) 空间共现，2) 时间接近，3) 属性依赖，和 4) 因果顺序。这些关系通过迭代的精炼过程动态地更新帧采样分布，实现了上下文感知的，针对特定查询需求的语义关键帧的识别。我们的方法在关键帧选择指标的手动注释基准上建立了新的SOTA表现。此外，当应用于下游视频问答任务时，所提出的方法在LongVideoBench和Video-MME上展示出对现有方法的最佳性能增益，验证了它在弥合文本查询和视觉-时间推理之间的逻辑差距方面的有效性。代码将会公开发布。

更新时间: 2025-03-17 13:07:34

领域: cs.CV,cs.AI,cs.CL,eess.IV

下载: http://arxiv.org/abs/2503.13139v1

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Updated: 2025-03-17 13:01:59

标题: Reangle-A-Video：作为视频到视频翻译的4D视频生成

摘要: 我们介绍了Reangle-A-Video，这是一个统一的框架，用于从单个输入视频生成同步的多视角视频。与主流方法不同，主流方法训练大规模4D数据集上的多视角视频扩散模型，我们的方法将多视角视频生成任务重新构建为视频到视频的转换，利用公开可用的图像和视频扩散先验。实质上，Reangle-A-Video分为两个阶段。（1）多视角运动学习：通过自监督方式对图像到视频扩散变换器进行同步微调，从一组扭曲的视频中提炼出视图不变的运动。（2）多视角一致的图像到图像翻译：将输入视频的第一帧在推理时间下视图一致性指导下使用DUSt3R进行扭曲和修复，生成多视角一致的起始图像。对静态视图传输和动态摄像机控制的大量实验表明，Reangle-A-Video超越了现有方法，为多视角视频生成建立了一个新的解决方案。我们将公开发布我们的代码和数据。项目页面：https://hyeonho99.github.io/reangle-a-video/

更新时间: 2025-03-17 13:01:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.09151v2

Online Signature Verification based on the Lagrange formulation with 2D and 3D robotic models

Online Signature Verification commonly relies on function-based features, such as time-sampled horizontal and vertical coordinates, as well as the pressure exerted by the writer, obtained through a digitizer. Although inferring additional information about the writers arm pose, kinematics, and dynamics based on digitizer data can be useful, it constitutes a challenge. In this paper, we tackle this challenge by proposing a new set of features based on the dynamics of online signatures. These new features are inferred through a Lagrangian formulation, obtaining the sequences of generalized coordinates and torques for 2D and 3D robotic arm models. By combining kinematic and dynamic robotic features, our results demonstrate their significant effectiveness for online automatic signature verification and achieving state-of-the-art results when integrated into deep learning models.

Updated: 2025-03-17 12:56:43

标题: 基于拉格朗日公式和2D、3D机器人模型的在线签名验证

摘要: 在线签名验证通常依赖于基于功能的特征，例如通过数字化器获得的时间采样的水平和垂直坐标以及作者施加的压力。尽管基于数字化器数据推断有关作者手臂姿势、运动学和动力学的额外信息可能会很有用，但也构成了一个挑战。在本文中，我们通过提出一组基于在线签名动力学的新特征来应对这一挑战。这些新特征是通过拉格朗日公式推断的，获得了2D和3D机械臂模型的广义坐标和力矩序列。通过结合运动学和动力学机器人特征，我们的结果表明它们在在线自动签名验证和集成到深度学习模型时取得了显著的有效性和最新技术成果。

更新时间: 2025-03-17 12:56:43

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.13573v1

MIXPINN: Mixed-Material Simulations by Physics-Informed Neural Network

Simulating the complex interactions between soft tissues and rigid anatomy is critical for applications in surgical training, planning, and robotic-assisted interventions. Traditional Finite Element Method (FEM)-based simulations, while accurate, are computationally expensive and impractical for real-time scenarios. Learning-based approaches have shown promise in accelerating predictions but have fallen short in modeling soft-rigid interactions effectively. We introduce MIXPINN, a physics-informed Graph Neural Network (GNN) framework for mixed-material simulations, explicitly capturing soft-rigid interactions using graph-based augmentations. Our approach integrates Virtual Nodes (VNs) and Virtual Edges (VEs) to enhance rigid body constraint satisfaction while preserving computational efficiency. By leveraging a graph-based representation of biomechanical structures, MIXPINN learns high-fidelity deformations from FEM-generated data and achieves real-time inference with sub-millimeter accuracy. We validate our method in a realistic clinical scenario, demonstrating superior performance compared to baseline GNN models and traditional FEM methods. Our results show that MIXPINN reduces computational cost by an order of magnitude while maintaining high physical accuracy, making it a viable solution for real-time surgical simulation and robotic-assisted procedures.

Updated: 2025-03-17 12:48:29

标题: MIXPINN: 物理信息神经网络进行混合材料模拟

摘要: 模拟软组织与刚性解剖之间复杂相互作用对于在外科培训、规划和机器人辅助干预中的应用至关重要。传统有限元方法（FEM）模拟虽然准确，但在实时场景中计算代价高昂且不切实际。基于学习的方法显示出加速预测的潜力，但在有效建模软硬互动方面表现不佳。我们引入了MIXPINN，这是一个用于混合材料模拟的物理信息图神经网络（GNN）框架，通过基于图的增强明确捕捉软硬互动。我们的方法整合了虚拟节点（VNs）和虚拟边（VEs），以增强刚性体约束满足性并保持计算效率。通过利用生物力学结构的基于图的表示，MIXPINN从FEM生成的数据中学习高保真变形，并实现亚毫米精度的实时推断。我们在现实临床场景中验证了我们的方法，与基线GNN模型和传统FEM方法相比，表现出优越性能。我们的结果显示，MIXPINN将计算成本降低一个数量级，同时保持高物理精度，使其成为实时外科模拟和机器人辅助程序的可行解决方案。

更新时间: 2025-03-17 12:48:29

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.13123v1

DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data

In this paper, we propose a method to automatically classify AI-related documents from large-scale literature databases, leading to the creation of an AI-related literature dataset, named DeepDiveAI. The dataset construction approach integrates expert knowledge with the capabilities of advanced models, structured across two global stages. In the first stage, expert-curated classification datasets are used to train an LSTM model, which classifies coarse AI related records from large-scale datasets. In the second stage, we use Qwen2.5 Plus to annotate a random 10% of the coarse AI-related records, which are then used to train a BERT binary classifier. This step further refines the coarse AI related record set to obtain the final DeepDiveAI dataset. Evaluation results demonstrate that the entire workflow can efficiently and accurately identify AI-related literature from large-scale datasets.

Updated: 2025-03-17 12:46:22

标题: DeepDiveAI：在大规模文献数据中识别与人工智能相关的文档

摘要: 在本文中，我们提出了一种方法，可以自动分类大规模文献数据库中与人工智能相关的文档，从而创建一个名为DeepDiveAI的AI相关文献数据集。数据集构建方法将专家知识与先进模型的能力结合起来，分为两个全局阶段。在第一阶段，使用专家策划的分类数据集训练了一个LSTM模型，该模型将大规模数据集中的粗糙AI相关记录进行分类。在第二阶段，我们使用Qwen2.5 Plus对粗糙AI相关记录中的随机10%进行注释，然后用于训练一个BERT二元分类器。这一步进一步完善了粗糙AI相关记录集，获得了最终的DeepDiveAI数据集。评估结果表明，整个工作流程可以高效准确地从大规模数据集中识别AI相关文献。

更新时间: 2025-03-17 12:46:22

领域: cs.AI

下载: http://arxiv.org/abs/2408.12871v4

VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding

Large language models (LLMs) offer significant potential for coding, yet fine-tuning (FT) with curated data is essential for niche languages like Verilog. Using proprietary intellectual property (IP) for FT presents a serious risk, as FT data can be leaked through LLM inference. This leads to a critical dilemma for design houses: seeking to build externally accessible LLMs offering competitive Verilog coding, how can they leverage in-house IP to enhance FT utility while ensuring IP protection? For the first time in the literature, we study this dilemma. Using LLaMA 3.1-8B, we conduct in-house FT on a baseline Verilog dataset (RTLCoder) supplemented with our own in-house IP, which is validated through multiple tape-outs. To rigorously assess IP leakage, we quantify structural similarity (AST/Dolos) and functional equivalence (Synopsys Formality) between generated codes and our in-house IP. We show that our IP can indeed be leaked, confirming the threat. As defense, we evaluate logic locking of Verilog codes (ASSURE). This offers some level of protection, yet reduces the IP's utility for FT and degrades the LLM's performance. Our study shows the need for novel strategies that are both effective and minimally disruptive to FT, an essential effort for enabling design houses to fully utilize their proprietary IP toward LLM-driven Verilog coding.

Updated: 2025-03-17 12:38:03

标题: VeriLeaky: 在LLM驱动的Verilog编码中平衡IP保护与实用性

摘要: 大型语言模型（LLMs）为编码提供了巨大潜力，然而对于像Verilog这样的小众语言，使用经过精心筛选的数据进行微调（FT）至关重要。使用专有知识产权（IP）进行FT存在严重风险，因为FT数据可能会通过LLM推断泄露。这导致了设计公司面临一个关键的困境：试图构建可以竞争Verilog编码的外部可访问的LLMs时，他们如何利用内部IP来增强FT效用同时确保IP保护？在文献中，我们首次研究了这一困境。使用LLaMA 3.1-8B，我们在基准Verilog数据集（RTLCoder）上进行内部FT，补充了我们自己的内部IP，并通过多次tape-outs进行验证。为了严格评估IP泄露，我们量化了生成代码与我们内部IP之间的结构相似性（AST/Dolos）和功能等效性（Synopsys Formality）。我们展示了我们的IP确实可以泄露，确认了这一威胁。作为防御，我们评估了Verilog代码的逻辑锁定（ASSURE）。这提供了一定程度的保护，但降低了IP的FT效用，并降低了LLM的性能。我们的研究显示了需要新颖的策略，既有效又对FT干扰最小，这是使设计公司能够充分利用其专有IP进行LLM驱动的Verilog编码的重要努力。

更新时间: 2025-03-17 12:38:03

领域: cs.CR,cs.AR,cs.LG

下载: http://arxiv.org/abs/2503.13116v1

Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization

Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.

Updated: 2025-03-17 12:37:53

标题: 超越混沌传播：一种用于均场优化的随机算法

摘要: 在2-Wasserstein空间中的梯度流被广泛用于优化概率分布上的泛函，并通常使用一个包含$n$个粒子的相互作用粒子系统来实现。分析这些算法需要证明：(a) 有限粒子系统收敛和/或 (b) 粒子的结果经验分布与最优分布紧密逼近（即混沌传播）。然而，建立有效的充分条件可能具有挑战性，因为有限粒子系统可能产生高度相关的随机变量。在这项工作中，我们研究了虚拟粒子随机逼近，最初是为了Stein变分梯度下降而引入的。这种方法可以看作是Wasserstein空间中的一种随机梯度下降形式，并且可以高效实现。在流行的设置中，我们展示了我们的算法输出在类似于无限粒子极限条件下收敛到最优分布，并且不需要显式建立混沌传播边界即可生成i.i.d.样本。

更新时间: 2025-03-17 12:37:53

领域: cs.LG,cs.AI,math.PR,stat.ML

下载: http://arxiv.org/abs/2503.13115v1

Exploring the Potential of Bilevel Optimization for Calibrating Neural Networks

Handling uncertainty is critical for ensuring reliable decision-making in intelligent systems. Modern neural networks are known to be poorly calibrated, resulting in predicted confidence scores that are difficult to use. This article explores improving confidence estimation and calibration through the application of bilevel optimization, a framework designed to solve hierarchical problems with interdependent optimization levels. A self-calibrating bilevel neural-network training approach is introduced to improve a model's predicted confidence scores. The effectiveness of the proposed framework is analyzed using toy datasets, such as Blobs and Spirals, as well as more practical simulated datasets, such as Blood Alcohol Concentration (BAC). It is compared with a well-known and widely used calibration strategy, isotonic regression. The reported experimental results reveal that the proposed bilevel optimization approach reduces the calibration error while preserving accuracy.

Updated: 2025-03-17 12:34:55

标题: 探索双层优化在校准神经网络中的潜力

摘要: 处理不确定性对于确保智能系统可靠的决策至关重要。现代神经网络被认为校准不足，导致预测的置信度分数难以使用。本文通过应用双层优化探讨改进置信度估计和校准，双层优化是一种旨在解决具有相互依赖的优化层次的框架。引入了一种自校准的双层神经网络训练方法，以改善模型的预测置信度分数。利用玩具数据集，如Blobs和Spirals，以及更实际的模拟数据集，如血液酒精浓度（BAC），分析了所提出的框架的有效性。将其与众所周知且广泛使用的校准策略等温回归进行比较。报道的实验结果显示，所提出的双层优化方法降低了校准误差，同时保持准确性。

更新时间: 2025-03-17 12:34:55

领域: cs.LG

下载: http://arxiv.org/abs/2503.13113v1

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.

Updated: 2025-03-17 12:34:22

标题: MM-Spatial：探索多模态LLM中的3D空间理解

摘要: 多模态大型语言模型（MLLMs）在二维视觉理解方面表现出色，但在理解三维空间方面仍然存在局限性。在这项工作中，我们利用大规模高质量的3D场景数据和开放式注释，引入了1）一个新颖的监督微调数据集和2）一个新的评估基准，重点关注室内场景。我们的Cubify Anything VQA（CA-VQA）数据涵盖了包括空间关系预测、度量尺寸和距离估计以及3D基准在内的各种空间任务。我们展示了CA-VQA使我们能够训练MM-Spatial，一种强大的通用MLLM，也在包括我们自己在内的3D空间理解基准上实现了最先进的性能。我们展示了如何将度量深度和多视角输入（在CA-VQA中提供）纳入到3D理解中，以及证明仅凭数据就能使我们的模型实现与专用单眼深度估计模型相媲美的深度感知能力。我们将发布我们的SFT数据集和基准。

更新时间: 2025-03-17 12:34:22

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.13111v1

MoMa: A Modular Deep Learning Framework for Material Property Prediction

Deep learning methods for material property prediction have been widely explored to advance materials discovery. However, the prevailing pre-train then fine-tune paradigm often fails to address the inherent diversity and disparity of material tasks. To overcome these challenges, we introduce MoMa, a Modular framework for Materials that first trains specialized modules across a wide range of tasks and then adaptively composes synergistic modules tailored to each downstream scenario. Evaluation across 17 datasets demonstrates the superiority of MoMa, with a substantial 14% average improvement over the strongest baseline. Few-shot and continual learning experiments further highlight MoMa's potential for real-world applications. Pioneering a new paradigm of modular material learning, MoMa will be open-sourced to foster broader community collaboration.

Updated: 2025-03-17 12:33:30

标题: MoMa：用于材料性能预测的模块化深度学习框架

摘要: 深度学习方法在材料性能预测方面得到了广泛探讨，以推动材料发现的进展。然而，普遍的预训练然后微调范式经常无法解决材料任务的固有多样性和差异性。为了克服这些挑战，我们引入了MoMa，即材料模块化框架，首先在广泛的任务范围内训练专门的模块，然后自适应地组合适合每个下游场景的协同模块。对17个数据集的评估显示了MoMa的优越性，平均改进幅度达到了14%，超过了最强基线。少样本学习和持续学习实验进一步突出了MoMa在现实世界应用中的潜力。作为材料模块化学习的先驱，MoMa将开源以促进更广泛的社区合作。

更新时间: 2025-03-17 12:33:30

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2502.15483v2

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual information remains unclear. In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within visual modality. Based on these insights, we propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance. Our findings offer a new understanding of visual information processing in MLLMs and provide a state-of-the-art solution for efficient inference.

Updated: 2025-03-17 12:31:23

标题: 揭开MLLMs中视觉信息流的面纱：解锁更快推理的路径

摘要: 多模态大型语言模型（MLLMs）通过将预训练视觉编码器中的视觉特征整合到大型语言模型（LLMs）中，提高了视觉语言任务的性能。然而，MLLMs如何处理和利用视觉信息仍不清楚。本文揭示了视觉信息主导流的转变：（1）在浅层，观察到图像标记和指令标记之间存在强烈的交互作用，其中大部分视觉信息被注入到指令标记中，形成跨模态语义表示；（2）在深层，图像标记主要与彼此互动，聚合剩余的视觉信息，优化视觉模态内的语义表示。基于这些见解，我们提出了分层模态感知剪枝（HiMAP），这是一种即插即用的推理加速方法，动态地在特定层级剪枝图像标记，将计算成本降低约65％，而不牺牲性能。我们的发现为MLLMs中的视觉信息处理提供了新的理解，并为高效推理提供了最先进的解决方案。

更新时间: 2025-03-17 12:31:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13108v1

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model's bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.

Updated: 2025-03-17 12:30:40

标题: ClearSight: 多模态大语言模型中目标幻觉缓解的视觉信号增强

摘要: 对比解码策略被广泛应用于减轻多模式大型语言模型（MLLMs）中的对象幻觉。通过减少对语言先验的过度依赖，这些策略确保生成的内容保持与视觉输入密切相关，产生具有上下文准确性的输出。由于对比解码不需要额外的训练或外部工具，因此它既具有计算效率又具有多功能性，使其具有很高的吸引力。然而，这些方法存在两个主要限制：（1）过于压制语言先验可能会损害生成内容的连贯性和准确性，（2）处理对比输入会增加计算负荷，显著减慢推理速度。为了解决这些挑战，我们提出了Visual Amplification Fusion（VAF），这是一种即插即用的技术，可以增强模型中间层对视觉信号的注意力，这是模态融合主要发生的地方。这种方法能够更有效地捕获视觉特征，减少模型对语言模态的偏见。实验结果表明，VAF显著减少了各种MLLMs中的幻觉，而不影响推理速度，同时保持了生成输出的连贯性和准确性。

更新时间: 2025-03-17 12:30:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13107v1

Fine-tuning can Help Detect Pretraining Data from Large Language Models

In the era of large language models (LLMs), detecting pretraining data has been increasingly important due to concerns about fair evaluation and ethical risks. Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%. However, the diversity and complexity of training data magnifies the difficulty of distinguishing, leading to suboptimal performance in detecting pretraining data. In this paper, we first explore the benefits of unseen data, which can be easily collected after the release of the LLM. We find that the perplexities of LLMs shift differently for members and non-members, after fine-tuning with a small amount of previously unseen data. In light of this, we introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection. In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain. In effect, using a few unseen data can largely decrease the scores of all non-members, leading to a larger deviation distance than members. Extensive experiments demonstrate the effectiveness of our method, significantly improving the AUC score on common benchmark datasets across various models.

Updated: 2025-03-17 12:29:05

标题: Fine-tuning 可以帮助检测大型语言模型的预训练数据

摘要: 在大型语言模型（LLMs）时代，由于对公平评估和伦理风险的担忧，检测预训练数据变得越来越重要。目前的方法通过设计评分函数（如困惑度和Min-k%）来区分成员和非成员。然而，训练数据的多样性和复杂性增加了区分的难度，导致在检测预训练数据方面表现不佳。在本文中，我们首先探讨了未见数据的好处，这些数据可以在LLM发布后轻松收集。我们发现，在用少量以前未见数据进行微调后，LLMs的困惑度在成员和非成员之间有不同程度的变化。基于此，我们提出了一种新颖有效的方法，称为Fine-tuned Score Deviation（FSD），它改进了当前评分函数在检测预训练数据方面的性能。具体来说，我们提出在相同领域的少量未见数据上进行微调后，测量当前评分的偏差距离。实际上，使用少量未见数据可以显著降低所有非成员的得分，导致比成员更大的偏差距离。大量实验证明了我们方法的有效性，显著提高了在各种模型上的常见基准数据集上的AUC分数。

更新时间: 2025-03-17 12:29:05

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.10880v2

VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination

Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).

Updated: 2025-03-17 12:26:49

标题: VeriContaminated: 评估LLM驱动的Verilog编码对数据污染的影响

摘要: 大型语言模型（LLMs）已经彻底改变了代码生成，取得了在各种已建立的基准测试框架上的异常成果。然而，对于数据污染的担忧——即基准数据无意中泄漏到预训练或微调数据集中——引发了对这些评估的有效性的质疑。虽然这个问题已经被认识到，但限制了LLM驱动的软件工程的工业采用，硬件编码在这些风险方面却几乎没有受到关注。我们首次分析了用于Verilog代码生成的最先进（SOTA）评估框架（VerilogEval和RTLLM），使用已建立的污染检测方法（CCD和Min-K％ Prob）。我们涵盖了SOTA商业和开源LLMs（CodeGen2.5、Minitron 4b、Mistral 7b、phi-4 mini、LLaMA-{1,2,3.1}、GPT-{2,3.5,4o}、Deepseek-Coder和CodeQwen 1.5），在基线和微调模型（RTLCoder和Verigen）中。我们的研究证实了数据污染是一个关键问题。我们探讨了减轻措施以及代码质量与公平性之间的权衡（即，减少污染以实现无偏见的基准测试）。

更新时间: 2025-03-17 12:26:49

领域: cs.AR,cs.CR,cs.LG

下载: http://arxiv.org/abs/2503.13572v1

Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.

Updated: 2025-03-17 12:08:52

标题: 可解释的无监督联合去噪和增强技术，用于真实世界低光情境

摘要: 现实世界中的低光照图像通常受到诸如局部过曝光、低亮度、噪声和不均匀照明等复杂退化的影响。监督方法往往会过度拟合特定场景，而无监督方法虽然在泛化方面更好，但由于缺乏参考图像而难以对这些退化进行建模。为了解决这个问题，我们提出了一个专门针对现实场景定制的可解释的零参考联合去噪和低光照增强框架。我们的方法基于具有不同照明和噪声水平的配对子图像的训练策略，基于物理成像原理和retinex理论。此外，我们利用离散余弦变换（DCT）在sRGB空间中进行频域分解，并引入一种隐式引导的混合表示策略，有效分离复杂的混合退化。在骨干网络设计中，我们开发了受隐式退化表示机制引导的视网膜分解网络。大量实验证明了我们方法的优越性。代码将在https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025 上提供。

更新时间: 2025-03-17 12:08:52

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2503.14535v1

ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion

This paper proposes ShapeShifter, a new 3D generative model that learns to synthesize shape variations based on a single reference model. While generative methods for 3D objects have recently attracted much attention, current techniques often lack geometric details and/or require long training times and large resources. Our approach remedies these issues by combining sparse voxel grids and point, normal, and color sampling within a multiscale neural architecture that can be trained efficiently and in parallel. We show that our resulting variations better capture the fine details of their original input and can handle more general types of surfaces than previous SDF-based methods. Moreover, we offer interactive generation of 3D shape variants, allowing more human control in the design loop if needed.

Updated: 2025-03-17 12:06:19

标题: 变形者：利用多尺度和稀疏点 - 体素扩散进行三维变化

摘要: 本文提出了ShapeShifter，一种新的3D生成模型，它学习基于单个参考模型合成形状变化。虽然最近对于3D对象的生成方法引起了很多关注，但当前的技术通常缺乏几何细节和/或需要较长的训练时间和大量资源。我们的方法通过将稀疏体素网格和点、法线和颜色采样结合在一个多尺度神经架构中来解决这些问题，这可以有效且并行地训练。我们展示了我们的结果变化更好地捕捉了原始输入的细节，并且可以处理比以前基于SDF的方法更一般类型的表面。此外，我们提供互动生成3D形状变体，如果需要的话，可以在设计循环中提供更多人类控制。

更新时间: 2025-03-17 12:06:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.02187v2

Rethinking model prototyping through the MedMNIST+ dataset collection

The integration of deep learning based systems in clinical practice is often impeded by challenges rooted in limited and heterogeneous medical datasets. In addition, the field has increasingly prioritized marginal performance gains on a few, narrowly scoped benchmarks over clinical applicability, slowing down meaningful algorithmic progress. This trend often results in excessive fine-tuning of existing methods on selected datasets rather than fostering clinically relevant innovations. In response, this work introduces a comprehensive benchmark for the MedMNIST+ dataset collection, designed to diversify the evaluation landscape across several imaging modalities, anatomical regions, classification tasks and sample sizes. We systematically reassess commonly used Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures across distinct medical datasets, training methodologies, and input resolutions to validate and refine existing assumptions about model effectiveness and development. Our findings suggest that computationally efficient training schemes and modern foundation models offer viable alternatives to costly end-to-end training. Additionally, we observe that higher image resolutions do not consistently improve performance beyond a certain threshold. This highlights the potential benefits of using lower resolutions, particularly in prototyping stages, to reduce computational demands without sacrificing accuracy. Notably, our analysis reaffirms the competitiveness of CNNs compared to ViTs, emphasizing the importance of comprehending the intrinsic capabilities of different architectures. Finally, by establishing a standardized evaluation framework, we aim to enhance transparency, reproducibility, and comparability within the MedMNIST+ dataset collection. Code is available at https://github.com/sdoerrich97/rethinking-model-prototyping-MedMNISTPlus .

Updated: 2025-03-17 12:01:18

标题: 重新思考模型原型设计：通过MedMNIST+数据集收集

摘要: 深度学习系统在临床实践中的整合通常受限于有限且异质的医学数据集所固有的挑战。此外，该领域越来越倾向于在一些狭义的基准测试中优化微小性能提升，而不是关注临床适用性，从而减缓了有意义的算法进展。这种趋势往往导致对选定数据集上现有方法的过度微调，而不是促进具有临床相关性的创新。为此，本研究引入了一个全面的基准测试框架，针对MedMNIST+数据集收集，旨在跨越多个成像模式、解剖区域、分类任务和样本大小来丰富评估景观。我们系统地重新评估了常用的卷积神经网络（CNNs）和Vision Transformer（ViT）架构在不同医学数据集、训练方法和输入分辨率上的表现，以验证和完善关于模型有效性和发展的现有假设。我们的研究结果表明，计算效率高的训练方案和现代基础模型提供了昂贵的端到端训练的可行替代方案。此外，我们观察到，高分辨率图像并不总是在一定阈值之上提高性能。这突出了使用较低分辨率的潜在好处，尤其在原型设计阶段，可以减少计算需求而不降低准确性。值得注意的是，我们的分析证实了CNNs相对于ViTs的竞争力，强调了理解不同架构的内在能力的重要性。最后，通过建立一个标准化的评估框架，我们旨在增强MedMNIST+数据集收集中的透明度、可重复性和可比性。代码可在https://github.com/sdoerrich97/rethinking-model-prototyping-MedMNISTPlus中找到。

更新时间: 2025-03-17 12:01:18

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2404.15786v3

ExChanGeAI: An End-to-End Platform and Efficient Foundation Model for Electrocardiogram Analysis and Fine-tuning

Electrocardiogram data, one of the most widely available biosignal data, has become increasingly valuable with the emergence of deep learning methods, providing novel insights into cardiovascular diseases and broader health conditions. However, heterogeneity of electrocardiogram formats, limited access to deep learning model weights and intricate algorithmic steps for effective fine-tuning for own disease target labels result in complex workflows. In this work, we introduce ExChanGeAI, a web-based end-to-end platform that streamlines the reading of different formats, pre-processing, visualization and custom machine learning with local and privacy-preserving fine-tuning. ExChanGeAI is adaptable for use on both personal computers and scalable to high performance server environments. The platform offers state-of-the-art deep learning models for training from scratch, alongside our novel open-source electrocardiogram foundation model CardX, pre-trained on over one million electrocardiograms. Evaluation across three external validation sets, including an entirely new testset extracted from routine care, demonstrate the fine-tuning capabilities of ExChanGeAI. CardX outperformed the benchmark foundation model while requiring significantly fewer parameters and lower computational resources. The platform enables users to empirically determine the most suitable model for their specific tasks based on systematic validations.The code is available at https://imigitlab.uni-muenster.de/published/exchangeai .

Updated: 2025-03-17 11:58:52

标题: ExChanGeAI: 一种用于心电图分析和微调的端到端平台和高效基础模型

摘要: 心电图数据是最广泛可用的生物信号数据之一，随着深度学习方法的出现，它变得越来越有价值，为心血管疾病和更广泛的健康状况提供了新的见解。然而，心电图格式的异质性、深度学习模型权重的有限访问以及为自己的疾病目标标签进行有效微调的复杂算法步骤导致了复杂的工作流程。在这项工作中，我们介绍了ExChanGeAI，这是一个基于网络的端到端平台，简化了不同格式的读取、预处理、可视化和定制机器学习，同时具有本地和保护隐私的微调。ExChanGeAI可用于个人计算机，并可扩展到高性能服务器环境。该平台提供了用于从头开始训练的最先进的深度学习模型，以及我们的开源心电图基础模型CardX，该模型在超过一百万个心电图上进行了预训练。对包括从常规护理中提取的全新测试集在内的三个外部验证集的评估表明了ExChanGeAI的微调能力。CardX在性能基准模型的基础上表现更好，同时需要更少的参数和更低的计算资源。该平台使用户能够根据系统验证来实证确定最适合其特定任务的模型。代码可在https://imigitlab.uni-muenster.de/published/exchangeai 上找到。

更新时间: 2025-03-17 11:58:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13570v1

ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Updated: 2025-03-17 11:52:16

标题: ClusComp：一种简单的模型压缩和高效微调范式

摘要: 随着大型语言模型（LLMs）的规模扩大，模型压缩对于边缘部署和可访问性至关重要。仅权重量化可以减小模型大小，但在较低位宽下会导致性能下降。此外，标准微调与量化模型不兼容，而替代方法往往无法完全微调。在本文中，我们提出了ClusComp，一种简单而有效的压缩范式，将权重矩阵聚类到码书中，并逐块微调它们。ClusComp（1）在2-4位量化中取得了卓越的性能，（2）将压缩推动到1位，同时在最小微调的情况下胜过超低位方法，并且（3）实现了高效微调，甚至超越了现有的基于量化的方法，并与全FP16微调相匹敌。值得注意的是，ClusComp支持在单个A6000-48GB GPU上对70B LLMs进行压缩和微调。

更新时间: 2025-03-17 11:52:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13089v1

Free-form language-based robotic reasoning and grasping

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

Updated: 2025-03-17 11:41:16

标题: 自由形式的基于语言的机器人推理和抓取

摘要: 在一个杂乱的箱子中执行基于人类指令的机器人抓取是一项具有挑战性的任务，因为它需要理解自由形式语言的微妙之处以及物体之间的空间关系。在像GPT-4o这样的Web规模数据上训练的视觉语言模型（VLMs）展示了在文本和图像之间的卓越推理能力。但它们真的可以在零样本设置下用于这项任务吗？它们的局限性又是什么？在本文中，我们通过基于自由形式语言的机器人抓取任务探讨这些研究问题，并提出一种新颖方法FreeGrasp，利用预训练的VLMs的世界知识来推理人类指令和物体空间排列。我们的方法将所有物体检测为关键点，并使用这些关键点在图像上标记，旨在促进GPT-4o的零样本空间推理。这使我们的方法能够确定请求的物体是否可以直接抓取，或者是否必须先抓取和移除其他物体。由于目前没有专门设计用于这项任务的现有数据集，我们通过将MetaGraspNetV2数据集扩展为具有人类注释的指令和地面真实抓取序列的合成数据集FreeGraspData来介绍。我们使用FreeGraspData进行广泛的分析，并通过具有夹具装备的机械臂进行真实世界验证，展示了在抓取推理和执行方面的最新性能。项目网站：https://tev-fbk.github.io/FreeGrasp/。

更新时间: 2025-03-17 11:41:16

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.13082v1

A Framework to Assess Multilingual Vulnerabilities of LLMs

Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.

Updated: 2025-03-17 11:39:44

标题: 一个评估LLM多语言脆弱性的框架

摘要: 大型语言模型（LLMs）正在获得更广泛的能力，包括理解和回答多种语言。虽然它们经过安全训练以防止它们回答非法问题，但在训练数据和人类评估资源中的不平衡可能使这些模型在资源匮乏的语言中更容易受到攻击。本文提出了一个框架，可以自动评估常用LLMs的多语言漏洞。使用我们的框架，我们评估了八种语言中的六个LLMs，代表了不同资源可用性水平。我们通过在两种语言中进行人工评估验证了我们自动框架生成的评估结果，表明在大多数情况下，框架的结果与人类判断一致。我们的研究结果揭示了在资源匮乏的语言中存在的漏洞；然而，这些漏洞可能构成很小的风险，因为它们往往源于模型的性能不佳，导致回复不连贯。

更新时间: 2025-03-17 11:39:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13081v1

Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

Multi-agent reinforcement learning has shown promise in learning cooperative behaviors in team-based environments. However, such methods often demand extensive training time. For instance, the state-of-the-art method TiZero takes 40 days to train high-quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi-agent methods. We propose two different approaches for better exploration in TiZero: a self-supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero's computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self-supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi-agent and strategic aspects.

Updated: 2025-03-17 11:32:28

标题: 朝着通过探索在多智能体强化学习中实现更好的样本效率

摘要: 多智能体强化学习在团队环境中学习合作行为方面表现出潜力。然而，这些方法通常需要大量的训练时间。例如，最先进的方法TiZero需要40天的时间来训练足球环境中的高质量策略。在本文中，我们假设更好的探索机制可以提高多智能体方法的样本效率。我们提出了两种不同的方法来改进TiZero的探索：自监督内在奖励和随机网络蒸馏奖励。此外，我们对原始算法进行了架构修改，以增强TiZero的计算效率。我们通过广泛的实验评估了这些方法的样本效率。我们的结果显示，与原始的TiZero相比，随机网络蒸馏将训练样本效率提高了18.8%。此外，我们评估了由两种变体产生的模型的定性行为，与启发式AI对比，自监督奖励鼓励掌控，而随机网络蒸馏导致更具攻击性的表现。我们的结果突出了我们随机网络蒸馏变体在实际环境中的适用性。最后，由于所提出方法的性质，我们承认其在足球模拟之外的使用，特别是在具有强大多智能体和战略方面的环境中。

更新时间: 2025-03-17 11:32:28

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2503.13077v1

On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs

In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus an Upper Confidence Bounds (UCB) exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.

Updated: 2025-03-17 11:29:03

标题: 关于随机长度的情节式MDPs的蒙特卡洛UCB的收敛性

摘要: 在强化学习中，蒙特卡罗算法通过计算每个回合的平均回报来更新Q函数。在蒙特卡罗UCB（MC-UCB）算法中，每个状态下采取的动作是最大化Q函数加上一个上限置信界（UCB）探索项的动作，这会偏向于选择那些被选择得较少的动作。尽管已经有大量工作建立了MC-UCB的遗憾界，但大部分工作都集中在问题的有限时间版本上，其中每个回合在一定步数后终止。对于这种有限时间问题，最优策略取决于当前状态和回合内的时间。然而，对于许多自然的回合问题，如围棋、国际象棋和机器人任务，回合长度是随机的，最优策略是稳定的。对于这种环境，一个悬而未决的问题是MC-UCB中的Q函数是否会收敛到最优Q函数；我们猜测，与Q学习不同，它不会对所有MDPs收敛。尽管如此，我们展示了对于一个包括随机MDPs（如二十一点）和确定性MDPs（如围棋）在内的大类MDPs，MC-UCB中的Q函数几乎确定地收敛到最优Q函数。这一结果的一个直接推论是，对于所有有限时间MDPs，它也几乎确定地收敛。我们还提供了数值实验，进一步深入了解MC-UCB。

更新时间: 2025-03-17 11:29:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2209.02864v2

A Survey on Knowledge-Oriented Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.

Updated: 2025-03-17 11:24:11

标题: 关于基于知识的检索增强生成的调查

摘要: 检索增强生成（RAG）近年来引起了广泛关注，因为它有潜力通过将大规模检索系统与生成模型结合来增强自然语言理解和生成。RAG利用外部知识源，例如文档、数据库或结构化数据，来提高模型性能并生成更准确和上下文相关的输出。本调查旨在通过审查其基本组成部分，包括检索机制、生成过程和两者之间的整合，从而全面概述RAG。我们讨论了RAG的关键特性，例如其通过动态外部知识增强生成模型的能力，以及与生成目标对齐的检索信息所面临的挑战。我们还提出了一个对RAG方法进行分类的分类法，从基本的检索增强方法到包含多模态数据和推理能力的更高级模型。此外，我们还回顾了评估RAG系统常用的基准和数据集，以及对其在诸如问答、摘要和信息检索等领域的应用的详细探讨。最后，我们重点介绍了改进RAG系统的新兴研究方向和机会，例如增强检索效率、模型可解释性和领域特定适应性。本文通过概述RAG在解决现实挑战和推动自然语言处理进一步发展方面的前景来总结。

更新时间: 2025-03-17 11:24:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.10677v2

A Coefficient Makes SVRG Effective

Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlight, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at github.com/davidyyd/alpha-SVRG.

Updated: 2025-03-17 11:14:58

标题: 一个系数使得SVRG有效

摘要: 随机方差减少梯度（SVRG），由Johnson＆Zhang（2013）引入，是一种在理论上引人注目的优化方法。但是，正如Defazio＆Bottou（2019）所强调的，它在深度学习中的有效性尚未得到证明。在这项工作中，我们展示了SVRG在优化现实世界神经网络中的潜力。我们的实证分析发现，对于更深层的神经网络，SVRG中的方差减少项的强度应较小，并随着训练的进行而减小。受此启发，我们引入了一个用于控制强度并通过线性衰减调整的乘法系数α。我们将我们的方法命名为α-SVRG。我们的结果显示，与基线和标准SVRG相比，α-SVRG更好地优化模型，持续降低训练损失，适用于各种模型架构和多个图像分类数据集。我们希望我们的发现能够鼓励进一步探索深度学习中的方差减少技术。代码可在github.com/davidyyd/alpha-SVRG上找到。

更新时间: 2025-03-17 11:14:58

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2311.05589v2

MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling

Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling by predicting species distributions based on environmental conditions. The selection of predictors is crucial, strongly impacting both model accuracy and how well the predictions reflect ecological patterns. To ensure meaningful insights, input variables must be carefully chosen to match the study objectives and the ecological requirements of the target species. However, existing SDMs, including both traditional and deep learning-based approaches, often lack key capabilities for variable selection: (i) flexibility to choose relevant predictors at inference without retraining; (ii) robustness to handle missing predictor values without compromising accuracy; and (iii) explainability to interpret and accurately quantify each predictor's contribution. To overcome these limitations, we introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy. This approach allows the model to make predictions with arbitrary subsets of input variables while remaining robust to missing data. It also provides a clearer understanding of how adding or removing a given predictor affects model performance and predictions. Additionally, MaskSDM leverages Shapley values for precise predictor contribution assessments, improving upon traditional approximations. We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species. Our results show that MaskSDM outperforms imputation-based methods and approximates models trained on specific subsets of variables. These findings underscore MaskSDM's potential to increase the applicability and adoption of SDMs, laying the groundwork for developing foundation models in SDMs that can be readily applied to diverse ecological applications.

Updated: 2025-03-17 11:02:28

标题: 使用Shapley值的MaskSDM以提高物种分布模型的灵活性、鲁棒性和可解释性

摘要: 物种分布模型（SDMs）在生物多样性研究、保护规划和生态位建模中发挥着重要作用，通过基于环境条件预测物种分布。选择预测变量至关重要，强烈影响模型准确性以及预测如何反映生态模式。为确保有意义的洞察力，必须谨慎选择输入变量，以符合研究目标和目标物种的生态要求。然而，现有的SDMs，包括传统的和基于深度学习的方法，通常缺乏关键的变量选择能力：（i）在推断时灵活选择相关预测变量而无需重新训练；（ii）处理缺失预测变量值的鲁棒性而不影响准确性；和（iii）可解释性，以解释和准确量化每个预测变量的贡献。为了克服这些限制，我们引入了MaskSDM，这是一种新颖的基于深度学习的SDM，通过采用掩码训练策略实现灵活的预测变量选择。这种方法允许模型使用任意子集的输入变量进行预测，同时保持对缺失数据的鲁棒性。它还提供了更清晰的理解，即添加或删除特定预测变量如何影响模型性能和预测。此外，MaskSDM利用Shapley值进行精确的预测变量贡献评估，改进了传统的近似方法。我们在全球sPlotOpen数据集上评估了MaskSDM，对12,738种植物物种的分布进行建模。我们的结果显示，MaskSDM优于基于插补的方法，并接近在特定变量子集上训练的模型。这些发现强调了MaskSDM增加SDM的适用性和采用率的潜力，为开发可广泛应用于不同生态应用的SDM基础模型奠定了基础。

更新时间: 2025-03-17 11:02:28

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.13057v1

Deep Hedging of Green PPAs in Electricity Markets

In power markets, Green Power Purchase Agreements have become an important contractual tool of the energy transition from fossil fuels to renewable sources such as wind or solar radiation. Trading Green PPAs exposes agents to price risks and weather risks. Also, developed electricity markets feature the so-called cannibalisation effect : large infeeds induce low prices and vice versa. As weather is a non-tradable entity the question arises how to hedge and risk-manage in this highly incom-plete setting. We propose a ''deep hedging'' framework utilising machine learning methods to construct hedging strategies. The resulting strategies outperform static and dynamic benchmark strategies with respect to different risk measures.

Updated: 2025-03-17 11:02:23

标题: 在电力市场中对绿色购电协议进行深度套期保值

摘要: 在电力市场中，绿色电力购买协议已经成为从化石燃料向风能或太阳能等可再生能源过渡的重要合同工具。交易绿色电力购买协议使代理商面临价格风险和天气风险。此外，发达的电力市场具有所谓的食人效应：大规模的输入会导致低价格，反之亦然。由于天气是一种不可交易的实体，因此如何在这种高度不完整的情况下进行对冲和风险管理成为一个问题。我们提出了一个“深度对冲”框架，利用机器学习方法构建对冲策略。由此产生的策略在不同风险度量方面表现优于静态和动态基准策略。

更新时间: 2025-03-17 11:02:23

领域: q-fin.CP,cs.LG,q-fin.RM

下载: http://arxiv.org/abs/2503.13056v1

Further Exploration of Precise Binding Energies from Physics Informed Machine Learning and the Development of a Practical Ensemble Model

Sixteen new physics informed machine learning models have been trained on binding energy residuals from modern mass models that leverage shape parameters and other physical features. The models have been trained on a subset of AME 2012 data and have been verified with a subset of the AME 2020 data. Among the machine learning approaches tested in this work, the preferred approach is the least squares boosted ensemble of trees which appears to have a superior ability to both interpolate and extrapolate binding energy residuals. The machine learning models for four mass models created from the ensemble of trees approach have been combined to create a composite model called the Four Model Tree Ensemble (FMTE). The FMTE model predicts binding energy values from AME 2020 with a standard deviation of 76 keV and a mean deviation of 34 keV for all nuclei with N > 7 and Z > 7. A comparison with new mass measurements for 33 isotopes not included in AME 2012 or AME 2020 indicates that the FMTE performs better than all mass models that were tested.

Updated: 2025-03-17 11:01:56

标题: 进一步探索物理信息机器学习中精确结合能量，并开发实用的集成模型

摘要: 已经对16个新的物理信息机器学习模型进行了训练，这些模型利用了现代质量模型中的形状参数和其他物理特征的结合能残差。这些模型已经在AME 2012数据的子集上进行了训练，并且已经通过AME 2020数据的子集进行了验证。在这项工作中测试的机器学习方法中，首选的方法是最小二乘提升树集合，它似乎具有更好的插值和外推结合能残差的能力。从树集合方法创建的四个质量模型的机器学习模型已经结合在一起，形成了一个名为Four Model Tree Ensemble（FMTE）的综合模型。FMTE模型针对所有N > 7和Z > 7的核素预测AME 2020的结合能值，标准偏差为76 keV，平均偏差为34 keV。与AME 2012或AME 2020中未包含的33个同位素的新质量测量进行比较表明，FMTE的表现优于所有测试的质量模型。

更新时间: 2025-03-17 11:01:56

领域: cs.LG,nucl-th

下载: http://arxiv.org/abs/2503.11066v2

Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Food Preparation Task Planning

We study Multimodal Large Language Models (MLLMs) with in-context learning for food preparation task planning. In this context, we identify two key challenges: cross-modal distraction and geometric feasibility. Cross-modal distraction occurs when the inclusion of visual input degrades the reasoning performance of a MLLM. Geometric feasibility refers to the ability of MLLMs to ensure that the selected skills are physically executable in the environment. To address these issues, we adapt Chain of Thought (CoT) with Self-Consistency to mitigate reasoning loss from cross-modal distractions and use affordance predictor as skill preconditions to guide MLLM on geometric feasibility. We construct a dataset to evaluate the ability of MLLMs on quantity estimation, reachability analysis, relative positioning and collision avoidance. We conducted a detailed evaluation to identify issues among different baselines and analyze the reasons for improvement, providing insights into each approach. Our method reaches a success rate of 76.7% on the entire dataset, showing a substantial improvement over the CoT baseline at 36.7%.

Updated: 2025-03-17 11:01:02

标题: 通过可支配性引导的、自洽的MLLMS减轻跨模态干扰并确保几何可行性，用于食物准备任务规划。

摘要: 我们研究了在食物准备任务规划中具有上下文学习的多模态大型语言模型（MLLMs）。在这种情境下，我们确定了两个关键挑战：跨模态干扰和几何可行性。跨模态干扰发生在包含视觉输入时，降低了MLLM的推理性能。几何可行性指的是MLLM确保所选技能在环境中是物理可执行的能力。为了解决这些问题，我们采用了具有自一致性的“思维链”（CoT）以减轻跨模态干扰导致的推理损失，并使用适应性预测作为技能前提条件，以指导MLLM在几何可行性上。我们构建了一个数据集，评估MLLM在数量估计、可达性分析、相对定位和碰撞回避方面的能力。我们进行了详细评估，以确定不同基线之间的问题，并分析改进的原因，为每种方法提供见解。我们的方法在整个数据集上达到了76.7％的成功率，较CoT基线的36.7％有显著改进。

更新时间: 2025-03-17 11:01:02

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.13055v1

The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation tend to rely on a single token that acts as a narrow gate for visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.

Updated: 2025-03-17 10:59:29

标题: 狭窄之门：视觉-语言模型中的本地化图像-文本沟通

摘要: 最近在多模态训练领域取得的进展显著提高了图像理解和生成在统一模型中的整合。本研究调查了视觉语言模型（VLMs）如何处理图像理解任务，特别关注视觉信息如何被处理并转移到文本领域。我们比较了生成图像和文本的VLMs与仅输出文本的VLMs，突出了信息流中的关键差异。我们发现，在具有多模态输出的模型中，图像和文本嵌入更多地分离在残差流中。此外，模型在如何从视觉到文本标记进行信息交换方面存在差异。仅输出文本的VLMs表现出分布式通信模式，其中信息通过多个图像标记进行交换。相比之下，经过图像和文本生成训练的模型倾向于依赖于一个作为狭窄门的单个标记来传递视觉信息。我们证明，消除这个单个标记显著降低了图像理解任务的性能。此外，修改这个标记可以有效地引导图像语义，表明有针对性的局部干预可以可靠地控制模型的全局行为。

更新时间: 2025-03-17 10:59:29

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.06646v2

Bitcoin Battle: Burning Bitcoin for Geopolitical Fun and Profit

This study empirically analyzes the transaction activity of Bitcoin addresses linked to Russian intelligence services, which have liquidated over 7 Bitcoin (BTC), i.e., equivalent to approximately US$300,000 based on the exchange rate at the time. Our investigation begins with an observed anomaly in transaction outputs featuring the Bitcoin Script operation code, tied to input addresses identified by cyber threat intelligence sources and court documents as belonging to Russian intelligence agencies. We explore how an unauthorized entity appears to have gained control of the associated private keys, with messages embedded in the outputs confirming the seizure. Tracing the funds' origins, we connect them to cryptocurrency mixers and establish a link to the Russian ransomware group Conti, implicating intelligence service involvement. This analysis represents one of the first empirical studies of large-scale Bitcoin misuse by nation-state cyber actors.

Updated: 2025-03-17 10:55:59

标题: 比特币之战：为了地缘政治的乐趣和利润而销毁比特币

摘要: 这项研究通过实证分析与俄罗斯情报机构相关的比特币地址的交易活动，这些地址已经清算了超过7比特币（BTC），即根据当时的汇率约等于30万美元。我们的调查始于观察到的交易输出异常，其中涉及比特币脚本操作代码，与网络威胁情报来源和法院文件确定的输入地址相关联，这些地址被确认为属于俄罗斯情报机构。我们探讨了未经授权的实体似乎已经获得了相关私钥的控制权，并且输出中嵌入的消息证实了此次扣押。通过追踪资金的来源，我们将其与加密货币混合器联系起来，并建立了与俄罗斯勒索软件组织Conti的联系，涉及情报机构的参与。这项分析代表着国家级网络行动者大规模滥用比特币的第一项实证研究之一。

更新时间: 2025-03-17 10:55:59

领域: cs.CR

下载: http://arxiv.org/abs/2503.13052v1

Permutation Learning with Only N Parameters: From SoftSort to Self-Organizing Gaussians

Sorting and permutation learning are key concepts in optimization and machine learning, especially when organizing high-dimensional data into meaningful spatial layouts. The Gumbel-Sinkhorn method, while effective, requires N*N parameters to determine a full permutation matrix, making it computationally expensive for large datasets. Low-rank matrix factorization approximations reduce memory requirements to 2MN (with M << N), but they still struggle with very large problems. SoftSort, by providing a continuous relaxation of the argsort operator, allows differentiable 1D sorting, but it faces challenges with multidimensional data and complex permutations. In this paper, we present a novel method for learning permutations using only N parameters, which dramatically reduces storage costs. Our approach builds on SoftSort, but extends it by iteratively shuffling the N indices of the elements to be sorted through a separable learning process. This modification significantly improves sorting quality, especially for multidimensional data and complex optimization criteria, and outperforms pure SoftSort. Our method offers improved memory efficiency and scalability compared to existing approaches, while maintaining high-quality permutation learning. Its dramatically reduced memory requirements make it particularly well-suited for large-scale optimization tasks, such as "Self-Organizing Gaussians", where efficient and scalable permutation learning is critical.

Updated: 2025-03-17 10:55:55

标题: 只用N个参数进行置换学习：从SoftSort到自组织高斯函数

摘要: 排序和排列学习是优化和机器学习中的关键概念，特别是在将高维数据组织成有意义的空间布局时。尽管Gumbel-Sinkhorn方法非常有效，但需要N*N个参数来确定一个完整的排列矩阵，这使得在处理大型数据集时计算成本很高。低秩矩阵分解逼近将内存需求降低到2MN（其中M << N），但仍然在处理非常大的问题时存在困难。SoftSort通过提供argsort运算符的连续松弛，允许可微的一维排序，但在处理多维数据和复杂排列时面临挑战。在本文中，我们提出了一种仅使用N个参数学习排列的新方法，大大降低了存储成本。我们的方法基于SoftSort，通过通过可分离学习过程反复对要排序的元素的N个索引进行洗牌来扩展它。这种修改显著改善了排序质量，特别是对于多维数据和复杂优化标准，并且优于纯SoftSort。我们的方法相比现有方法提供了更高的内存效率和可扩展性，同时保持高质量的排列学习。其大幅降低的内存需求使其特别适合大规模优化任务，例如"自组织高斯"，其中高效和可扩展的排列学习至关重要。

更新时间: 2025-03-17 10:55:55

领域: cs.LG,cs.CV,stat.ML

下载: http://arxiv.org/abs/2503.13051v1

Preserving clusters and correlations: a dimensionality reduction method for exceptionally high global structure preservation

We present Preserving Clusters and Correlations (PCC), a novel dimensionality reduction (DR) method a novel dimensionality reduction (DR) method that achieves state-of-the-art global structure (GS) preservation while maintaining competitive local structure (LS) preservation. It optimizes two objectives: a GS preservation objective that preserves an approximation of Pearson and Spearman correlations between high- and low-dimensional distances, and an LS preservation objective that ensures clusters in the high-dimensional data are separable in the low-dimensional data. PCC has a state-of-the-art ability to preserve the GS while having competitive LS preservation. In addition, we show the correlation objective can be combined with UMAP to significantly improve its GS preservation with minimal degradation of the LS. We quantitatively benchmark PCC against existing methods and demonstrate its utility in medical imaging, and show PCC is a competitive DR technique that demonstrates superior GS preservation in our benchmarks.

Updated: 2025-03-17 10:48:02

标题: 保留聚类和相关性：一种用于保留异常高全局结构的降维方法

摘要: 我们提出了保留簇和相关性（PCC）的方法，这是一种新颖的降维（DR）方法，它在保持竞争性局部结构（LS）保留的同时实现了最先进的全局结构（GS）保留。它优化了两个目标：一个GS保留目标，保持高维和低维距离之间皮尔逊和斯皮尔曼相关性的近似，以及一个LS保留目标，确保高维数据中的簇在低维数据中是可分离的。PCC具有保持GS的最新能力，同时具有竞争性的LS保留。此外，我们展示了相关性目标可以与UMAP结合，显著改善其GS保留，同时最小程度降低LS。我们通过与现有方法进行定量基准测试，并展示其在医学成像中的实用性，表明PCC是一种竞争性的DR技术，在我们的基准测试中展示了优越的GS保留能力。

更新时间: 2025-03-17 10:48:02

领域: cs.LG

下载: http://arxiv.org/abs/2503.07609v3

WMINet: A Wheel-Mounted Inertial Learning Approach For Mobile-Robot Positioning

Autonomous mobile robots are widely used for navigation, transportation, and inspection tasks indoors and outdoors. In practical situations of limited satellite signals or poor lighting conditions, navigation depends only on inertial sensors. In such cases, the navigation solution rapidly drifts due to inertial measurement errors. In this work, we propose WMINet a wheel-mounted inertial deep learning approach to estimate the mobile robot's position based only on its inertial sensors. To that end, we merge two common practical methods to reduce inertial drift: a wheel-mounted approach and driving the mobile robot in periodic trajectories. Additionally, we enforce a wheelbase constraint to further improve positioning performance. To evaluate our proposed approach we recorded using the Rosbot-XL a wheel-mounted initial dataset totaling 190 minutes, which is made publicly available. Our approach demonstrated a 66\% improvement over state-of-the-art approaches. As a consequence, our approach enables navigation in challenging environments and bridges the pure inertial gap. This enables seamless robot navigation using only inertial sensors for short periods.

Updated: 2025-03-17 10:43:46

标题: WMINet：一种用于移动机器人定位的轮式惯性学习方法

摘要: 自主移动机器人被广泛用于室内和室外的导航、运输和检查任务。在卫星信号有限或照明条件不佳的实际情况下，导航仅依赖惯性传感器。在这种情况下，由于惯性测量误差，导航解决方案会迅速漂移。在本研究中，我们提出了一种名为WMINet的轮式惯性深度学习方法，仅基于机器人的惯性传感器估计移动机器人的位置。为此，我们结合了两种常见的实用方法来减少惯性漂移：轮式方法和在周期轨迹上驾驶移动机器人。此外，我们强制执行一个轴距约束以进一步提高定位性能。为了评估我们提出的方法，我们使用Rosbot-XL记录了一个190分钟的轮式初始数据集，该数据集已公开。我们的方法表现出对最先进方法的66%的改进。因此，我们的方法实现了在具有挑战性的环境中导航，并弥合了纯惯性差距。这使得仅使用惯性传感器在短时间内实现无缝的机器人导航成为可能。

更新时间: 2025-03-17 10:43:46

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13568v1

Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization

We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based $\ell_1$ regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters.

Updated: 2025-03-17 10:38:39

标题: 学习空间自适应的 $\ell_1$-范数权重用于卷积综合正则化

摘要: 我们提出了一种展开算法的方法，用于在卷积综合型基于$\ell_1$正则化框架中学习空间自适应参数映射。更具体地说，我们考虑了一组预先训练的卷积滤波器，并通过展开FISTA算法来估计深度参数化的空间变化参数，这些参数应用于稀疏特征映射，以解决潜在的稀疏估计问题。我们评估了所提出的方法用于低场MRI图像重建，并与依赖于总变差正则化的空间自适应和非自适应分析型程序以及一个成熟的基于模型的深度学习方法进行比较。我们展示了所提出的方法产生了与后一方法在视觉和定量方面可比的结果，同时仍然具有很高的可解释性。特别是，推断的参数映射量化了重建中每个滤波器的局部贡献，这为算法机制提供了宝贵的见解，并有可能用于丢弃不适合的滤波器。

更新时间: 2025-03-17 10:38:39

领域: cs.LG,cs.CV,math.OC

下载: http://arxiv.org/abs/2503.09483v2

Cross-Platform Benchmarking of the FHE Libraries: Novel Insights into SEAL and Openfhe

The rapid growth of cloud computing and data-driven applications has amplified privacy concerns, driven by the increasing demand to process sensitive data securely. Homomorphic encryption (HE) has become a vital solution for addressing these concerns by enabling computations on encrypted data without revealing its contents. This paper provides a comprehensive evaluation of two leading HE libraries, SEAL and OpenFHE, examining their performance, usability, and support for prominent HE schemes such as BGV and CKKS. Our analysis highlights computational efficiency, memory usage, and scalability across Linux and Windows platforms, emphasizing their applicability in real-world scenarios. Results reveal that Linux outperforms Windows in computation efficiency, with OpenFHE emerging as the optimal choice across diverse cryptographic settings. This paper provides valuable insights for researchers and practitioners to advance privacy-preserving applications using FHE.

Updated: 2025-03-17 10:37:14

标题: 跨平台对FHE库进行基准测试：对SEAL和Openfhe的新见解

摘要: 云计算和数据驱动应用的快速增长加剧了隐私担忧，这是由于对安全处理敏感数据的需求不断增加。同态加密（HE）已成为应对这些担忧的重要解决方案，通过在加密数据上进行计算而不泄露其内容。本文全面评估了两个主要的HE库，SEAL和OpenFHE，检查它们的性能、可用性和对诸如BGV和CKKS等知名HE方案的支持。我们的分析突出了计算效率、内存使用和在Linux和Windows平台上的可伸缩性，强调它们在现实场景中的适用性。结果显示，在计算效率方面，Linux优于Windows，而OpenFHE在不同加密设置中表现出色。本文为研究人员和从业者提供了宝贵的见解，以推进使用FHE的隐私保护应用。

更新时间: 2025-03-17 10:37:14

领域: cs.CR

下载: http://arxiv.org/abs/2503.11216v2

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

Updated: 2025-03-17 10:36:30

标题: TinyR1-32B-Preview: 使用分支融合蒸馏提高准确性

摘要: 缩小大型语言模型（LLMs）的大小并保持其性能的挑战已经引起了很大关注。然而，现有的方法，如模型蒸馏和迁移学习，通常无法实现高准确性。为了解决这一局限性，我们引入了Branch-Merge蒸馏方法，通过两个阶段增强模型压缩：（1）分支阶段，在这里，通过领域特定的监督微调（SFT）从大型教师模型中\textit{选择性地蒸馏}知识到专门的学生模型中；（2）合并阶段，在这里，这些学生模型被合并以实现跨领域知识转移和提高泛化能力。我们使用DeepSeek-R1作为教师，DeepSeek-R1-Distill-Qwen-32B作为学生验证了我们的蒸馏方法。结果合并模型TinyR1-32B-Preview在多个基准测试中表现优于其对应的DeepSeek-R1-Distill-Qwen-32B，包括数学（+5.5分）、编码（+4.4分）和科学（+2.9分），同时在AIME 2024上表现接近DeepSeek-R1。Branch-Merge蒸馏方法为创建更小、高性能的LLMs提供了可扩展的解决方案，同时降低了计算成本和时间。

更新时间: 2025-03-17 10:36:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.04872v2

PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.

Updated: 2025-03-17 10:28:35

标题: PoseSyn：从野外2D数据中合成多样化的3D姿势数据

摘要: 尽管已经做出了相当大的努力来提高3D姿势估计器的泛化能力，而无需昂贵的3D标注，但现有的数据增强方法在现实世界中面对各种不同人类外貌和复杂姿势的情况下仍然存在困难。我们提出了PoseSyn，一种新颖的数据合成框架，将野外丰富的2D姿势数据集转化为多样化的3D姿势图像对。PoseSyn包括两个关键组成部分：错误提取模块（EEM），用于从2D姿势数据集中识别具有挑战性的姿势；运动合成模块（MSM），用于在具有挑战性的姿势周围合成运动序列。然后，通过使用与具有挑战性的姿势和外貌对齐的人体动画模型生成逼真的3D训练数据，PoseSyn将各种3D姿势估计器的准确度提高了高达14%，涵盖了各种背景和遮挡、具有挑战性的姿势和多视角场景等真实世界基准测试。大量实验进一步证实，无论姿势估计器的模型大小或设计如何，PoseSyn都是一种可扩展且有效的方法，可改善泛化能力，而无需依赖昂贵的3D标注。

更新时间: 2025-03-17 10:28:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13025v1

Probabilistic Shielding for Safe Reinforcement Learning

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Updated: 2025-03-17 10:26:39

标题: 概率性屏蔽用于安全强化学习

摘要: 在现实场景中，一个旨在最大化奖励的强化学习（RL）代理通常也必须在训练时以安全的方式行为。因此，近年来，许多关注点都集中在安全强化学习上，其中代理旨在学习一个在满足给定安全约束条件的所有策略中的最优策略。然而，严格的安全保证通常是通过基于线性规划的方法提供的，因此存在扩展性有限的问题。在本文中，我们提出了一种新的可扩展方法，该方法在安全强化学习中享有严格的形式保证，特别是在马尔可夫决策过程（MDP）的安全动态已知且安全定义为无折扣概率避免属性的情况下。我们的方法基于MDP的状态增强和设计一个限制代理可用动作的护盾。我们展示了我们的方法提供了一个严格的形式安全保证，使代理在训练和测试时保持安全。此外，通过实验评估，我们证明了我们的方法在实践中是可行的。

更新时间: 2025-03-17 10:26:39

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.07671v2

A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU

Deep learning (DL) has emerged as a powerful subset of machine learning (ML) and artificial intelligence (AI), outperforming traditional ML methods, especially in handling unstructured and large datasets. Its impact spans across various domains, including speech recognition, healthcare, autonomous vehicles, cybersecurity, predictive analytics, and more. However, the complexity and dynamic nature of real-world problems present challenges in designing effective deep learning models. Consequently, several deep learning models have been developed to address different problems and applications. In this article, we conduct a comprehensive survey of various deep learning models, including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Temporal Convolutional Networks (TCN), Transformer, Kolmogorov-Arnold networks (KAN), Generative Models, Deep Reinforcement Learning (DRL), and Deep Transfer Learning. We examine the structure, applications, benefits, and limitations of each model. Furthermore, we perform an analysis using three publicly available datasets: IMDB, ARAS, and Fruit-360. We compared the performance of six renowned deep learning models: CNN, RNN, Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional GRU alongside two newer models, TCN and Transformer, using the IMDB and ARAS datasets. Additionally, we evaluated the performance of eight CNN-based models, including VGG (Visual Geometry Group), Inception, ResNet (Residual Network), InceptionResNet, Xception (Extreme Inception), MobileNet, DenseNet (Dense Convolutional Network), and NASNet (Neural Architecture Search Network), for image classification tasks using the Fruit-360 dataset.

Updated: 2025-03-17 10:18:52

标题: 深度学习模型的综述和比较分析：CNN、RNN、LSTM、GRU

摘要: 深度学习（DL）已经成为机器学习（ML）和人工智能（AI）的一个强大子集，超越了传统的ML方法，特别是在处理非结构化和大型数据集方面。它的影响跨越各个领域，包括语音识别、医疗保健、自动驾驶车辆、网络安全、预测分析等等。然而，现实世界问题的复杂性和动态性提出了设计有效深度学习模型的挑战。因此，已经开发了几种深度学习模型来解决不同的问题和应用。在本文中，我们对各种深度学习模型进行了全面调查，包括卷积神经网络（CNN）、循环神经网络（RNN）、时间卷积网络（TCN）、Transformer、Kolmogorov-Arnold网络（KAN）、生成模型、深度强化学习（DRL）和深度迁移学习。我们考察了每个模型的结构、应用、优势和局限性。此外，我们使用三个公开数据集：IMDB、ARAS和Fruit-360进行了分析。我们比较了六种著名深度学习模型的性能：CNN、RNN、长短期记忆（LSTM）、双向LSTM、门控循环单元（GRU）和双向GRU，以及两种较新的模型TCN和Transformer，使用IMDB和ARAS数据集。此外，我们还评估了八种基于CNN的模型的性能，包括VGG（视觉几何组）、Inception、ResNet（残差网络）、InceptionResNet、Xception（极限Inception）、MobileNet、DenseNet（密集卷积网络）和NASNet（神经架构搜索网络），用于使用Fruit-360数据集的图像分类任务。

更新时间: 2025-03-17 10:18:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2305.17473v4

Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods

Job shop scheduling problems address the routing and sequencing of tasks in a job shop setting. Despite significant interest from operations research and machine learning communities over the years, a comprehensive platform for testing and comparing solution methods has been notably lacking. To fill this gap, we introduce a unified implementation of job shop scheduling problems and their solution methods, addressing the long-standing need for a standardized benchmarking platform in this domain. Our platform supports classic Job Shop (JSP), Flow Shop (FSP), Flexible Job Shop (FJSP), and Assembly Job Shop (AJSP), as well as variants featuring Sequence-Dependent Setup Times (SDST), variants with online arrivals of jobs, and combinations of these problems (e.g., FJSP-SDST and FAJSP). The platfrom provides a wide range of scheduling solution methods, from heuristics, metaheuristics, and exact optimization to deep reinforcement learning. The implementation is available as an open-source GitHub repository, serving as a collaborative hub for researchers, practitioners, and those new to the field. Beyond enabling direct comparisons with existing methods on widely studied benchmark problems, this resource serves as a robust starting point for addressing constrained and complex problem variants. By establishing a comprehensive and unified foundation, this platform is designed to consolidate existing knowledge and to inspire the development of next-generation algorithms in job shop scheduling research.

Updated: 2025-03-17 10:18:45

标题: 作业车间调度基准：用于学习和非学习方法的环境和实例

摘要: 作业车间调度问题涉及作业车间环境中任务的路径和顺序安排。尽管运筹学和机器学习社区多年来对此颇感兴趣，但缺乏一个全面的用于测试和比较解决方法的平台。为了填补这一空白，我们引入了作业车间调度问题及其解决方法的统一实现，解决了该领域长期以来对标准化基准测试平台的需求。我们的平台支持经典作业车间调度（JSP）、流水车间（FSP）、灵活作业车间（FJSP）和装配作业车间（AJSP），以及具有序列相关设置时间（SDST）的变体，具有作业的在线到达，以及这些问题的组合（例如，FJSP-SDST和FAJSP）。该平台提供了广泛的调度解决方法，从启发式方法、元启发式方法和精确优化到深度强化学习。该实施方案可作为一个开源的GitHub存储库，为研究人员、从业者和新手提供协作中心。除了使得可以在广泛研究的基准问题上直接比较现有方法外，这一资源还作为解决受限和复杂问题变体的坚实起点。通过建立全面统一的基础，该平台旨在整合现有知识并激发作业车间调度研究中下一代算法的发展。

更新时间: 2025-03-17 10:18:45

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2308.12794v2

Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.

Updated: 2025-03-17 10:11:11

标题: 通过宇宙学习实现测试时间域通用化：一种用于医学图像分割的多图匹配方法

摘要: 尽管域泛化（DG）已经显著解决了预训练模型由于领域转移而导致的性能下降问题，但在实际部署中通常表现不佳。测试时适应（TTA）通过使用未标记的测试数据调整学习模型，提供了一种有前途的解决方案。然而，大多数现有的TTA方法在医学图像分割中很难展现强大的性能，主要是因为它们忽视了医学图像固有的关键先验知识。为了解决这一挑战，我们融入形态学信息，并提出了基于多图匹配的框架。具体来说，我们在多源训练期间引入可学习的宇宙嵌入，整合形态学先验，并提出了新颖的无监督测试时间范式进行领域适应。这种方法保证了多匹配中的循环一致性，同时使模型更有效地捕捉未见数据的不变先验，显著减轻了领域转移的影响。大量实验证明，我们的方法在两个医学图像分割基准上的表现优于其他最先进的方法，无论是多源还是单源域泛化任务。源代码可在 https://github.com/Yore0/TTDG-MGM 上找到。

更新时间: 2025-03-17 10:11:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.13012v1

An Analysis of Safety Guarantees in Multi-Task Bayesian Optimization

This paper addresses the integration of additional information sources into a Bayesian optimization framework while ensuring that safety constraints are satisfied. The interdependencies between these information sources are modeled using an unknown correlation matrix. We explore how uniform error bounds must be adjusted to maintain constraint satisfaction throughout the optimization process, considering both Bayesian and frequentist statistical perspectives. This is achieved by appropriately scaling the error bounds based on a confidence interval that can be estimated from the data. Furthermore, the efficacy of the proposed approach is demonstrated through experiments on two benchmark functions and a controller parameter optimization problem. Our results highlight a significant improvement in sample efficiency, demonstrating the methods suitability for optimizing expensive-to-evaluate functions.

Updated: 2025-03-17 10:10:28

标题: 多任务贝叶斯优化中的安全性保证分析

摘要: 本文讨论了将额外信息源集成到贝叶斯优化框架中，同时确保满足安全约束。对这些信息源之间的相互依赖关系进行建模，使用一个未知的相关矩阵。我们研究了如何调整均匀误差界限以在整个优化过程中保持约束满足，考虑了贝叶斯和频率统计的观点。通过根据从数据中估计的置信区间适当调整误差界限来实现这一点。此外，通过在两个基准函数和一个控制器参数优化问题上进行实验，展示了所提出方法的有效性。我们的结果突显了样本效率的显著提高，表明这种方法适用于优化昂贵评估函数。

更新时间: 2025-03-17 10:10:28

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.08555v2

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments demonstrate 1) High Transferability: the backdoor mechanism operates successfully in cross-lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) Robustness: the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.

Updated: 2025-03-17 10:09:29

标题: TuBA：具有指令调优的LLMs中后门攻击的跨语言可传递性

摘要: 针对以英语为中心的大型语言模型（LLMs）的后门攻击所带来的影响已经得到广泛研究 - 这些攻击可以通过在训练过程中嵌入恶意行为并在特定条件下触发恶意输出来实现。尽管开源和专有LLMs对多语言能力的支持日益增强，但后门攻击对这些系统的影响仍然大部分未被探索。我们的研究重点放在了跨语言后门攻击对多语言LLMs的影响，特别是研究了如何通过操纵一个或两种语言的指令调整数据来影响那些指令调整数据未被操纵的语言的输出。尽管方法简单，我们的实证分析表明，我们的方法在像mT5和GPT-4o这样的模型中表现出了显著的有效性，在各种情况下12种语言中的7种以上的攻击成功率超过90%。我们的发现还表明，更强大的模型对可传递的跨语言后门攻击的敏感性增加，这也适用于主要在英语数据上进行预训练的LLMs，如Llama2、Llama3和Gemma。此外，我们的实验表明1）高可传递性：后门机制在26种语言的跨语言响应场景中成功运作，平均攻击成功率达到99%，2）鲁棒性：所提出的攻击即使在应用了防御措施后仍然有效。这些发现揭示了多语言LLMs中的关键安全漏洞，并强调了对更强大、有针对性的防御策略的迫切需要，以解决跨语言后门传递所带来的独特挑战。

更新时间: 2025-03-17 10:09:29

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2404.19597v3

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG), an attribution method, to optimise the compression of convolutional neural networks. We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. This approach leverages the teacher's decision-making insights, enhancing the student's ability to replicate complex patterns with reduced parameters. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold speedup. We perform hyperparameter optimisation for efficient learning. Comprehensive ablation studies dissect the contributions of KD and IG, revealing synergistic effects that boost both performance and model explainability. Our method's emphasis on feature-level guidance via IG distinguishes it from conventional KD, offering a data-driven solution for mining transferable knowledge in neural architectures. This work contributes to machine learning by providing a scalable, interpretable compression technique, ideal for edge computing applications where efficiency and transparency are paramount.

Updated: 2025-03-17 10:07:50

标题: 知识蒸馏：通过集成梯度增强神经网络压缩

摘要: 在资源受限设备上高效部署深度神经网络需要先进的压缩技术，以保持准确性和互操作性。本文提出了一种机器学习框架，将知识蒸馏（KD）与集成梯度（IG），一种归因方法相结合，以优化卷积神经网络的压缩。我们引入了一种新颖的数据增强策略，其中从教师模型预先计算出的IG映射叠加到训练图像上，以引导紧凑的学生模型朝向关键特征表示。这种方法利用了教师的决策洞察力，增强了学生用更少参数复制复杂模式的能力。在CIFAR-10上的实验表明了我们的方法的有效性：一个从MobileNet-V2教师压缩了4.1倍的学生模型达到了92.5%的分类准确度，超过了基准学生模型的91.4%和传统的KD方法，同时将推断延迟从140毫秒减少到13毫秒--速度提升了十倍。我们进行了超参数优化以实现高效学习。全面的消融研究解剖了KD和IG的贡献，揭示了促进性能和模型可解释性的协同效应。我们方法通过IG在特征级别上的引导，与传统的KD有所区别，为挖掘神经结构中可转移知识提供了数据驱动的解决方案。这项工作通过提供一种可扩展的、可解释的压缩技术，为边缘计算应用提供了理想的解决方案，其中效率和透明度至关重要。

更新时间: 2025-03-17 10:07:50

领域: cs.LG,cs.CV,68T05, 68T07,I.2.6; I.4.2; I.4.9

下载: http://arxiv.org/abs/2503.13008v1

Cauchy-Schwarz Regularizers

We introduce a novel class of regularization functions, called Cauchy-Schwarz (CS) regularizers, which can be designed to induce a wide range of properties in solution vectors of optimization problems. To demonstrate the versatility of CS regularizers, we derive regularization functions that promote discrete-valued vectors, eigenvectors of a given matrix, and orthogonal matrices. The resulting CS regularizers are simple, differentiable, and can be free of spurious stationary points, making them suitable for gradient-based solvers and large-scale optimization problems. In addition, CS regularizers automatically adapt to the appropriate scale, which is, for example, beneficial when discretizing the weights of neural networks. To demonstrate the efficacy of CS regularizers, we provide results for solving underdetermined systems of linear equations and weight quantization in neural networks. Furthermore, we discuss specializations, variations, and generalizations, which lead to an even broader class of new and possibly more powerful regularizers.

Updated: 2025-03-17 10:01:57

标题: Cauchy-Schwarz 正则化器

摘要: 我们介绍了一种新型的正则化函数类别，称为柯西-施瓦茨（CS）正则化器，可以设计为诱导解决优化问题的解向量具有各种性质。为了展示CS正则化器的多样性，我们推导出促进离散值向量、给定矩阵的特征向量和正交矩阵的正则化函数。由此产生的CS正则化器简单、可微分，可以避免虚假的稳定点，使其适用于基于梯度的求解器和大规模优化问题。此外，CS正则化器自动适应适当的尺度，例如，在离散化神经网络的权重时是有益的。为了展示CS正则化器的有效性，我们提供了解决欠定线性方程组和神经网络中权重量化问题的结果。此外，我们讨论了特化、变体和泛化，从而导致更广泛的新型可能更强大的正则化器类别。

更新时间: 2025-03-17 10:01:57

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2503.01639v3

Exploring ReAct Prompting for Task-Oriented Dialogue: Insights and Shortcomings

Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) (Yao et al., 2022) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing task-oriented dialogue (TOD). We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs severely underperform state-of-the-art approaches on success rate in simulation, this difference becomes less pronounced in human evaluation. Moreover, compared to the baseline, humans report higher subjective satisfaction with ReAct-LLM despite its lower success rate, most likely thanks to its natural and confidently phrased responses.

Updated: 2025-03-17 10:01:21

标题: 探索面向任务导向对话的ReAct提示：洞见与不足

摘要: 大型语言模型（LLMs）因其在非结构化对话中展现出的印象深刻的能力而受到极大的欢迎。将LLMs赋予先进的提示策略，如推理和行动（ReAct）（Yao等，2022年），已显示出在解决传统上需要强化学习的复杂任务方面具有潜力。在这项工作中，我们将ReAct策略应用于引导LLMs执行面向任务的对话（TOD）。我们评估基于ReAct的LLMs（ReAct-LLMs）在模拟中和与真实用户一起的表现。虽然在模拟中，ReAct-LLMs在成功率方面严重表现不佳，但在人类评估中，这种差异变得不那么明显。此外，与基线相比，尽管ReAct-LLM的成功率较低，但人类报告对其更高的主观满意度，这很可能归功于其自然和自信的措辞回应。

更新时间: 2025-03-17 10:01:21

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2412.01262v2

Linear-Size Neural Network Representation of Piecewise Affine Functions in $\mathbb{R}^2$

It is shown that any continuous piecewise affine (CPA) function $\mathbb{R}^2\to\mathbb{R}$ with $p$ pieces can be represented by a ReLU neural network with two hidden layers and $O(p)$ neurons. Unlike prior work, which focused on convex pieces, this analysis considers CPA functions with connected but potentially non-convex pieces.

Updated: 2025-03-17 09:56:39

标题: 线性规模的神经网络表示在$\mathbb{R}^2$中的分段仿射函数

摘要: 研究表明，具有$p$个部分的连续分段仿射（CPA）函数$\mathbb{R}^2\to\mathbb{R}$可以由具有两个隐藏层和$O(p)$个神经元的ReLU神经网络表示。与先前的研究不同，该分析考虑了具有连接但可能非凸部分的CPA函数。

更新时间: 2025-03-17 09:56:39

领域: cs.LG,cs.NE,math.MG,stat.ML

下载: http://arxiv.org/abs/2503.13001v1

Convex Formulations for Training Two-Layer ReLU Neural Networks

Solving non-convex, NP-hard optimization problems is crucial for training machine learning models, including neural networks. However, non-convexity often leads to black-box machine learning models with unclear inner workings. While convex formulations have been used for verifying neural network robustness, their application to training neural networks remains less explored. In response to this challenge, we reformulate the problem of training infinite-width two-layer ReLU networks as a convex completely positive program in a finite-dimensional (lifted) space. Despite the convexity, solving this problem remains NP-hard due to the complete positivity constraint. To overcome this challenge, we introduce a semidefinite relaxation that can be solved in polynomial time. We then experimentally evaluate the tightness of this relaxation, demonstrating its competitive performance in test accuracy across a range of classification tasks.

Updated: 2025-03-17 09:56:35

标题: 训练两层ReLU神经网络的凸形式

摘要: 解决非凸、NP难优化问题对于训练机器学习模型，包括神经网络，至关重要。然而，非凸性通常会导致黑盒机器学习模型，内部运作不清晰。虽然凸形式已被用于验证神经网络的鲁棒性，但其在训练神经网络中的应用仍未被充分探讨。为了应对这一挑战，我们重新构造了训练无限宽度两层ReLU网络的问题，将其转化为一个在有限维度（扩展）空间中的凸完全正规化程序。尽管具有凸性，由于完全正规化约束，解决这一问题仍然是NP难的。为了克服这一挑战，我们引入了一个半定松弛，可以在多项式时间内解决。然后，我们在一系列分类任务中实验评估了这种松弛的紧密性，展示了其在测试准确性方面的竞争性表现。

更新时间: 2025-03-17 09:56:35

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2410.22311v2

Believing is Seeing: Unobserved Object Detection using Generative Models

Can objects that are not visible in an image -- but are in the vicinity of the camera -- be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.

Updated: 2025-03-17 09:56:24

标题: 信仰是见证：使用生成模型进行未观察对象检测

摘要: 本研究介绍了2D、2.5D和3D未被观察对象检测的新任务，用于预测相机附近被遮挡或位于图像框之外的物体的位置。我们改编了几种最先进的预训练生成模型来解决这个任务，包括2D和3D扩散模型和视觉语言模型，并展示它们可以用于推断未直接观察到的物体的存在。为了对这个任务进行基准测试，我们提出了一套能够捕捉性能不同方面的度量指标。我们在RealEstate10k和NYU Depth v2数据集的室内场景上进行的实证评估展示了鼓励使用生成模型进行未被观察对象检测任务的结果。

更新时间: 2025-03-17 09:56:24

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2410.05869v3

Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at \href{https://github.com/zengkaiya/CaT}{https://github.com/zengkaiya/CaT}.

Updated: 2025-03-17 09:55:01

标题: 概念作为树：合成数据是VLM个性化的全部所需

摘要: 视觉语言模型（VLMs）在各种多模态任务中表现出色。最近，人们对提高VLMs的个性化能力越来越感兴趣。为了更好地将用户提供的概念整合到VLMs中，许多方法使用正样本和负样本来微调这些模型。然而，用户提供的正样本稀缺以及检索到的负样本质量低为微调带来了挑战。为了揭示样本与模型性能之间的关系，我们系统地研究了正样本和负样本（易和难）以及它们的多样性对VLM个性化任务的影响。基于详细分析，我们引入了概念树（CaT），将一个概念表示为树结构，从而为VLM个性化提供了具有不同难度和多样性的正样本和负样本的数据生成。通过精心设计的数据过滤策略，我们的CaT框架可以确保生成数据的质量，构成一个强大的流水线。我们进行了与各种VLM个性化基线的彻底实验，以评估该流水线的有效性，缓解了正样本的缺乏和负样本质量低的问题。我们的结果表明，配备所提出的数据过滤器的CaT显著增强了VLMs在MyVLM、Yo'LLaVA和MC-LLaVA数据集上的个性化能力。据我们所知，这项工作是第一个可控的VLM个性化合成数据流水线。代码发布在\href{https://github.com/zengkaiya/CaT}{https://github.com/zengkaiya/CaT}。

更新时间: 2025-03-17 09:55:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12999v1

Entropic Matching for Expectation Propagation of Markov Jump Processes

We propose a novel, tractable latent state inference scheme for Markov jump processes, for which exact inference is often intractable. Our approach is based on an entropic matching framework that can be embedded into the well-known expectation propagation algorithm. We demonstrate the effectiveness of our method by providing closed-form results for a simple family of approximate distributions and apply it to the general class of chemical reaction networks, which are a crucial tool for modeling in systems biology. Moreover, we derive closed-form expressions for point estimation of the underlying parameters using an approximate expectation maximization procedure. We evaluate our method across various chemical reaction networks and compare it to multiple baseline approaches, demonstrating superior performance in approximating the mean of the posterior process. Finally, we discuss the limitations of our method and potential avenues for future improvement, highlighting its promising direction for addressing complex continuous-time Bayesian inference problems.

Updated: 2025-03-17 09:52:50

标题: 马尔可夫跳跃过程的期望传播的熵匹配

摘要: 我们提出了一种新颖且可行的潜在状态推断方案，用于马尔可夫跳跃过程，对于这种过程，精确推断通常是难以处理的。我们的方法基于一种熵匹配框架，可以嵌入到众所周知的期望传播算法中。我们通过为一类简单的近似分布提供闭合形式的结果来展示我们方法的有效性，并将其应用于化学反应网络的一般类别，这是系统生物学建模的关键工具。此外，我们使用近似期望最大化程序推导出了基础参数的点估计的闭合形式表达式。我们在各种化学反应网络上评估了我们的方法，并将其与多种基线方法进行比较，展示了在逼近后验过程的均值方面的卓越性能。最后，我们讨论了我们方法的局限性和未来改进的潜在途径，突出了其有望解决复杂连续时间贝叶斯推断问题的方向。

更新时间: 2025-03-17 09:52:50

领域: cs.LG,q-bio.MN,q-bio.QM,stat.ML

下载: http://arxiv.org/abs/2309.15604v2

Robot Policy Transfer with Online Demonstrations: An Active Reinforcement Learning Approach

Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.

Updated: 2025-03-17 09:47:42

标题: 使用在线演示进行机器人政策迁移：一种主动强化学习方法

摘要: 迁移学习（TL）是一种强大的工具，使机器人能够在不同的环境、任务或具体实施中转移学习的策略。为了进一步促进这一过程，人们努力将其与示范学习（LfD）相结合，以实现更灵活、有效的策略转移。然而，这些方法几乎完全限于在策略转移开始之前收集的离线示范，这可能会受到由LfD带来的协变量转移固有问题的影响，并损害策略转移的性能。与此同时，在从头开始学习的广泛工作中表明，在线示范可以有效缓解协变量转移，并提高样本效率，从而带来更好的策略性能。本文结合了这些见解，将在线示范引入到策略转移设置中。我们提出了一种名为Policy Transfer with Online Demonstrations的主动LfD算法，用于在有限的示范预算下优化在线周期性专家示范的时间和内容。我们在八种机器人场景中评估了我们的方法，涉及跨不同环境特征、任务目标和机器人实施的策略转移，旨在将从源任务训练的策略转移到相关但不同的目标任务。结果显示，与两种具有离线示范的经典LfD方法和一种具有在线示范的主动LfD方法相比，我们的方法在平均成功率和样本效率方面显著优于所有基准线。此外，我们在真实环境中对三种转移场景的转移策略进行了初步的模拟到真实的测试，展示了在真实机器人操作器上的策略有效性。

更新时间: 2025-03-17 09:47:42

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.12993v1

Intra-neuronal attention within language models Relationships between activation and semantics

This study investigates the ability of perceptron-type neurons in language models to perform intra-neuronal attention; that is, to identify different homogeneous categorical segments within the synthetic thought category they encode, based on a segmentation of specific activation zones for the tokens to which they are particularly responsive. The objective of this work is therefore to determine to what extent formal neurons can establish a homomorphic relationship between activation-based and categorical segmentations. The results suggest the existence of such a relationship, albeit tenuous, only at the level of tokens with very high activation levels. This intra-neuronal attention subsequently enables categorical restructuring processes at the level of neurons in the following layer, thereby contributing to the progressive formation of high-level categorical abstractions.

Updated: 2025-03-17 09:47:11

标题: 语言模型中的神经内注意力：激活与语义之间的关系

摘要: 本研究调查了语言模型中感知型神经元执行神经元内部关注的能力；即，根据对它们特别响应的令牌的特定激活区域的分割，识别编码的合成思维类别中不同的同质分类段。因此，本研究的目标是确定形式神经元在激活基础和分类分割之间能建立多么同源的关系。结果表明，尽管在具有非常高激活水平的令牌层面上稍微存在这种关系，但这种关系是微弱的。随后，这种神经元内部关注使得在下一层神经元的分类重构过程能够进行，从而有助于逐步形成高级别的分类抽象。

更新时间: 2025-03-17 09:47:11

领域: cs.AI,cs.CL,q-bio.NC

下载: http://arxiv.org/abs/2503.12992v1

Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks

Albeit the ubiquitous use of Graph Neural Networks (GNNs) in machine learning (ML) prediction tasks involving graph-structured data, their interpretability remains challenging. In explainable artificial intelligence (XAI), the Shapley Value (SV) is the predominant method to quantify contributions of individual features to a ML model's output. Addressing the limitations of SVs in complex prediction models, Shapley Interactions (SIs) extend the SV to groups of features. In this work, we explain single graph predictions of GNNs with SIs that quantify node contributions and interactions among multiple nodes. By exploiting the GNN architecture, we show that the structure of interactions in node embeddings are preserved for graph prediction. As a result, the exponential complexity of SIs depends only on the receptive fields, i.e. the message-passing ranges determined by the connectivity of the graph and the number of convolutional layers. Based on our theoretical results, we introduce GraphSHAP-IQ, an efficient approach to compute any-order SIs exactly. GraphSHAP-IQ is applicable to popular message passing techniques in conjunction with a linear global pooling and output layer. We showcase that GraphSHAP-IQ substantially reduces the exponential complexity of computing exact SIs on multiple benchmark datasets. Beyond exact computation, we evaluate GraphSHAP-IQ's approximation of SIs on popular GNN architectures and compare with existing baselines. Lastly, we visualize SIs of real-world water distribution networks and molecule structures using a SI-Graph.

Updated: 2025-03-17 09:46:45

标题: 图神经网络中任意阶Shapley相互作用的精确计算

摘要: 尽管图神经网络（GNNs）在涉及图结构数据的机器学习（ML）预测任务中被广泛使用，但它们的可解释性仍然具有挑战性。在可解释人工智能（XAI）中，沙普利值（SV）是用于量化单个特征对ML模型输出的贡献的主要方法。为了解决SV在复杂预测模型中的局限性，沙普利相互作用（SIs）将SV扩展到特征组。在这项工作中，我们使用SIs解释GNN的单个图预测，量化节点的贡献和多个节点之间的相互作用。通过利用GNN体系结构，我们展示了节点嵌入中的相互作用结构对图预测是保留的。因此，SIs的指数复杂度仅取决于接收域，即由图的连接性和卷积层数量确定的消息传递范围。根据我们的理论结果，我们介绍了GraphSHAP-IQ，一种计算任意阶SIs的高效方法。GraphSHAP-IQ适用于流行的消息传递技术，结合线性全局池化和输出层。我们展示了GraphSHAP-IQ在多个基准数据集上大大降低了计算精确SIs的指数复杂度。除了精确计算，我们评估了GraphSHAP-IQ在流行的GNN架构上对SIs的近似，并与现有基线进行比较。最后，我们使用SI-Graph可视化了真实世界的水配送网络和分子结构的SIs。

更新时间: 2025-03-17 09:46:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.16944v2

A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show significant improvements in classification accuracy. Furthermore, we demonstrate the framework's adaptability for multi-label skill classification. Our results indicate that the framework outperforms existing LLM-based methods, offering a practical and scalable solution for occupation classification and related tasks across LLMs.

Updated: 2025-03-17 09:44:50

标题: 一个多阶段框架，利用大型语言模型进行职业分类的分类学引导推理

摘要: 自动使用标准职业分类对工作数据进行注释，被称为职业分类，在劳动力市场分析中至关重要。然而，这项任务经常受到数据稀缺和手动注释的挑战的阻碍。虽然大型语言模型（LLMs）由于其广泛的世界知识和上下文学习能力而前景广阔，但其有效性取决于其对职业分类的了解，这仍不清楚。在本研究中，我们评估了LLMs生成来自分类法的精确分类实体的能力，突显了它们的局限性。为了解决这些挑战，我们提出了一个包含推理、检索和重新排名阶段的多阶段框架，该框架集成了由分类法指导的推理示例，通过将输出与分类知识对齐来提高性能。对一个大规模数据集的评估显示了分类准确性的显著提高。此外，我们展示了该框架适用于多标签技能分类。我们的结果表明，该框架优于现有基于LLMs的方法，为跨LLMs的职业分类和相关任务提供了实际且可扩展的解决方案。

更新时间: 2025-03-17 09:44:50

领域: cs.CL,cs.AI,cs.SI

下载: http://arxiv.org/abs/2503.12989v1

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

As large language models (LLMs) demonstrate powerful capabilities, deploying them on edge devices has become increasingly crucial, offering advantages in privacy and real-time interaction. QLoRA has emerged as the standard approach for on-device LLMs, leveraging quantized models to reduce memory and computational costs while utilizing LoRA for task-specific adaptability. In this work, we propose ROMA, a QLoRA accelerator with a hybrid storage architecture that uses ROM for quantized base models and SRAM for LoRA weights and KV cache. Our insight is that the quantized base model is stable and converged, making it well-suited for ROM storage. Meanwhile, LoRA modules offer the flexibility to adapt to new data without requiring updates to the base model. To further reduce the area cost of ROM, we introduce a novel B-ROM design and integrate it with the compute unit to form a fused cell for efficient use of chip resources. ROMA can effectively store both a 4-bit 3B and a 2-bit 8B LLaMA model entirely on-chip, achieving a notable generation speed exceeding 20,000 tokens/s without requiring external memory.

Updated: 2025-03-17 09:44:17

标题: ROMA：基于只读存储器的加速器，用于基于QLoRA的设备端LLM

摘要: 随着大型语言模型（LLMs）展示出强大的能力，将它们部署在边缘设备上变得日益关键，这样可以在隐私和实时互动方面提供优势。QLoRA已成为在设备上部署LLMs的标准方法，利用量化模型来减少内存和计算成本，同时利用LoRA来实现任务特定的适应性。在这项工作中，我们提出了ROMA，一种具有混合存储架构的QLoRA加速器，该架构使用ROM存储量化基础模型，使用SRAM存储LoRA权重和KV缓存。我们的洞察力是，量化基础模型是稳定和收敛的，非常适合ROM存储。与此同时，LoRA模块提供了灵活性，可以适应新数据而无需更新基础模型。为了进一步减少ROM的面积成本，我们引入了一种新颖的B-ROM设计，并将其与计算单元集成在一起，形成一个融合单元，以有效利用芯片资源。ROMA能够有效地完全将4位3B和2位8B LLaMA模型储存在芯片上，实现了一个显着的生成速度，超过每秒20,000个标记，而无需外部存储器。

更新时间: 2025-03-17 09:44:17

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2503.12988v1

Enhancing Job Salary Prediction with Disentangled Composition Effect Modeling: A Neural Prototyping Approach

In the era of the knowledge economy, understanding how job skills influence salary is crucial for promoting recruitment with competitive salary systems and aligned salary expectations. Despite efforts on salary prediction based on job positions and talent demographics, there still lacks methods to effectively discern the set-structured skills' intricate composition effect on job salary. While recent advances in neural networks have significantly improved accurate set-based quantitative modeling, their lack of explainability hinders obtaining insights into the skills' composition effects. Indeed, model explanation for set data is challenging due to the combinatorial nature, rich semantics, and unique format. To this end, in this paper, we propose a novel intrinsically explainable set-based neural prototyping approach, namely \textbf{LGDESetNet}, for explainable salary prediction that can reveal disentangled skill sets that impact salary from both local and global perspectives. Specifically, we propose a skill graph-enhanced disentangled discrete subset selection layer to identify multi-faceted influential input subsets with varied semantics. Furthermore, we propose a set-oriented prototype learning method to extract globally influential prototypical sets. The resulting output is transparently derived from the semantic interplay between these input subsets and global prototypes. Extensive experiments on four real-world datasets demonstrate that our method achieves superior performance than state-of-the-art baselines in salary prediction while providing explainable insights into salary-influencing patterns.

Updated: 2025-03-17 09:36:07

标题: 用解耦组合效应建模增强工作薪资预测：一种神经原型方法

摘要: 在知识经济时代，了解工作技能如何影响薪资对于促进具有竞争力的薪资体系和符合薪资预期的招聘至关重要。尽管已经通过基于职位和人才人口统计数据的薪资预测进行了努力，但仍然缺乏有效区分结构化技能对工作薪资的复杂影响的方法。虽然最近神经网络的进展显著提高了准确的基于集合的定量建模，但由于缺乏可解释性，这阻碍了对技能组成影响的洞察。事实上，由于组合性质、丰富的语义和独特的格式，对集合数据的模型解释具有挑战性。因此，在本文中，我们提出了一种新颖的内在可解释的基于集合的神经原型方法，即LGDESetNet，用于可解释的薪资预测，可以从本地和全局的视角揭示影响薪资的解开的技能集。具体地，我们提出了一种技能图增强的解开离散子集选择层，以识别具有不同语义的多方面影响力输入子集。此外，我们提出了一种面向集合的原型学习方法，以提取全局具有影响力的原型集。最终的输出透明地从这些输入子集和全局原型之间的语义相互作用中得出。对四个真实数据集的广泛实验表明，我们的方法在薪资预测方面优于最先进的基线方法，同时提供了关于影响薪资模式的可解释见解。

更新时间: 2025-03-17 09:36:07

领域: cs.LG

下载: http://arxiv.org/abs/2503.12978v1

Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.

Updated: 2025-03-17 09:31:14

标题: 将视觉与语言对齐：基于无文本的多模态知识图构建，以增强LLMs推理

摘要: 大型语言模型（LLMs）中的多模态推理面临着不完整知识和幻觉伪迹的挑战，这些挑战仅部分由文本知识图（KGs）缓解，因为它们具有模态隔离。虽然多模态知识图（MMKGs）承诺增强跨模态理解，但由于手动文本注释的语义狭窄性和视觉-语义实体链接中固有的噪声，它们的实际构建受到阻碍。在本文中，我们提出了Vision-align-to-Language集成知识图（VaLiK），这是一种新颖的方法，用于构建通过跨模态信息补充增强LLMs推理的MMKGs。具体而言，我们将预训练的Vision-Language模型（VLMs）级联，以对齐图像特征和文本，将它们转换为包含图像特定信息的描述。此外，我们开发了一个跨模态相似性验证机制，以量化语义一致性，有效地过滤在特征对齐过程中引入的噪声。即使没有手动注释的图像标题，精炼的描述本身就足以构建MMKG。与传统的MMKG构建范式相比，我们的方法在保持直接实体到图像链接能力的同时实现了实质性的存储效率提升。在多模态推理任务上的实验结果表明，具有VaLiK的LLMs优于先前的最先进模型。我们的代码已发布在https://github.com/Wings-Of-Disaster/VaLiK。

更新时间: 2025-03-17 09:31:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12972v1

Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery

This paper presents a framework for extracting georeferenced vehicle trajectories from high-altitude drone imagery, addressing key challenges in urban traffic monitoring and the limitations of traditional ground-based systems. Our approach integrates several novel contributions, including a tailored object detector optimized for high-altitude bird's-eye view perspectives, a unique track stabilization method that uses detected vehicle bounding boxes as exclusion masks during image registration, and an orthophoto and master frame-based georeferencing strategy that enhances consistent alignment across multiple drone viewpoints. Additionally, our framework features robust vehicle dimension estimation and detailed road segmentation, enabling comprehensive traffic analysis. Conducted in the Songdo International Business District, South Korea, the study utilized a multi-drone experiment covering 20 intersections, capturing approximately 12TB of 4K video data over four days. The framework produced two high-quality datasets: the Songdo Traffic dataset, comprising approximately 700,000 unique vehicle trajectories, and the Songdo Vision dataset, containing over 5,000 human-annotated images with about 300,000 vehicle instances in four classes. Comparisons with high-precision sensor data from an instrumented probe vehicle highlight the accuracy and consistency of our extraction pipeline in dense urban environments. The public release of Songdo Traffic and Songdo Vision, and the complete source code for the extraction pipeline, establishes new benchmarks in data quality, reproducibility, and scalability in traffic research. Results demonstrate the potential of integrating drone technology with advanced computer vision for precise and cost-effective urban traffic monitoring, providing valuable resources for developing intelligent transportation systems and enhancing traffic management strategies.

Updated: 2025-03-17 09:25:50

标题: 高级计算机视觉技术用于从无人机图像中提取地理参考车辆轨迹

摘要: 本文提出了一个从高空无人机图像中提取地理参考车辆轨迹的框架，解决了城市交通监测中的关键挑战和传统基于地面系统的局限性。我们的方法整合了几个新颖的贡献，包括为高空鸟瞰视角优化的定制对象检测器，一种独特的轨迹稳定方法，该方法在图像注册过程中使用检测到的车辆边界框作为排除蒙版，并且基于正射影像和主帧的地理参考策略增强了多个无人机视角之间的一致对齐。此外，我们的框架具有强大的车辆尺寸估计和详细的道路分割，实现了全面的交通分析。该研究在韩国仁川宋都国际商务区进行，利用了一个涵盖20个交叉口的多无人机实验，在四天内捕获了约12TB的4K视频数据。该框架生成了两个高质量数据集：宋都交通数据集，包括约70万条独特的车辆轨迹，以及宋都视觉数据集，其中包含超过5,000张人工标注的图像，其中约有30万辆车的实例分为四类。与来自一个仪器化探测车辆的高精度传感器数据进行比较，突显了我们在密集城市环境中提取管道的准确性和一致性。宋都交通和宋都视觉的公开发布以及提取管道的完整源代码为交通研究中数据质量、可重现性和可扩展性建立了新的基准。结果表明，将无人机技术与先进的计算机视觉相结合，可实现精确且经济高效的城市交通监测，为发展智能交通系统和增强交通管理策略提供宝贵资源。

更新时间: 2025-03-17 09:25:50

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.02136v2

Distributed Black-box Attack: Do Not Overestimate Black-box Attacks

As cloud computing becomes pervasive, deep learning models are deployed on cloud servers and then provided as APIs to end users. However, black-box adversarial attacks can fool image classification models without access to model structure and weights. Recent studies have reported attack success rates of over 95% with fewer than 1,000 queries. Then the question arises: whether black-box attacks have become a real threat against cloud APIs? To shed some light on this, our research indicates that black-box attacks are not as effective against cloud APIs as proposed in research papers due to several common mistakes that overestimate the efficiency of black-box attacks. To avoid similar mistakes, we conduct black-box attacks directly on cloud APIs rather than local models.

Updated: 2025-03-17 09:22:14

标题: 分布式黑盒攻击：不要高估黑盒攻击

摘要: 随着云计算的普及，深度学习模型被部署在云服务器上，然后作为API提供给最终用户。然而，黑匣子对抗攻击可以欺骗图像分类模型，而无需访问模型结构和权重。最近的研究报告显示，攻击成功率超过95%，查询次数少于1,000次。那么问题来了：黑匣子攻击是否已经成为云API的真正威胁？为了阐明这个问题，我们的研究表明，由于一些常见的错误高估了黑匣子攻击的效率，黑匣子攻击对云API并不像研究论文中所提出的那样有效。为了避免类似的错误，我们直接在云API上进行黑匣子攻击，而不是在本地模型上进行。

更新时间: 2025-03-17 09:22:14

领域: cs.LG

下载: http://arxiv.org/abs/2210.16371v5

Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ''full-denoising'', to the alternative ''half-denoising'' introduced by Hyv{\"a}rinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different conclusions. We prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

Updated: 2025-03-17 09:22:14

标题: 在基于得分的生成模型中的最佳去噪：数据规则的作用

摘要: 基于分数的生成模型通过去噪由高斯噪声扰动的分布来实现最先进的采样性能。在本文中，我们专注于单一确定性去噪步骤，并比较二次损失的最优去噪器，我们将其命名为“全去噪”，与 Hyvärinen（2024）引入的“半去噪”进行比较。我们展示，从分布之间的距离的角度来看，性能告诉一个更加细致的故事，对数据的不同假设导致非常不同的结论。我们证明对于足够规则的密度，半去噪比全去噪更好，而对于混合的Dirac测度或支持于低维子空间的密度等奇异密度，全去噪更好。在后一种情况下，我们证明在线性流形假设下，全去噪可以缓解维度灾难。

更新时间: 2025-03-17 09:22:14

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.12966v1

Training Video Foundation Models with NVIDIA NeMo

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Updated: 2025-03-17 09:19:12

标题: 使用NVIDIA NeMo训练视频基础模型

摘要: 最近，视频基础模型（VFMs）已被用于模拟真实世界，以训练物理人工智能系统并开发创意视觉体验。然而，训练能够产生高质量视频的大规模高质量VFMs存在重大挑战。我们提出了一个可扩展的、开源的VFM训练管道，利用NVIDIA NeMo，提供加速视频数据集策划、多模态数据加载，以及并行化视频扩散模型训练和推断。我们还提供了一项全面的性能分析，突出了高效的VFM训练和推断的最佳实践。

更新时间: 2025-03-17 09:19:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.12964v1

Is Contrasting All You Need? Contrastive Learning for the Detection and Attribution of AI-generated Text

The significant progress in the development of Large Language Models has contributed to blurring the distinction between human and AI-generated text. The increasing pervasiveness of AI-generated text and the difficulty in detecting it poses new challenges for our society. In this paper, we tackle the problem of detecting and attributing AI-generated text by proposing WhosAI, a triplet-network contrastive learning framework designed to predict whether a given input text has been generated by humans or AI and to unveil the authorship of the text. Unlike most existing approaches, our proposed framework is conceived to learn semantic similarity representations from multiple generators at once, thus equally handling both detection and attribution tasks. Furthermore, WhosAI is model-agnostic and scalable to the release of new AI text-generation models by incorporating their generated instances into the embedding space learned by our framework. Experimental results on the TuringBench benchmark of 200K news articles show that our proposed framework achieves outstanding results in both the Turing Test and Authorship Attribution tasks, outperforming all the methods listed in the TuringBench benchmark leaderboards.

Updated: 2025-03-17 09:19:05

标题: 你需要对比吗？对比学习用于检测和归因AI生成的文本

摘要: 大型语言模型的发展取得了显著进展，模糊了人类生成的文本和人工智能生成的文本之间的区别。人工智能生成的文本越来越普遍，并且难以检测，这给我们的社会带来了新的挑战。在本文中，我们通过提出WhosAI来解决检测和归因人工智能生成文本的问题，这是一个三元网络对比学习框架，旨在预测给定输入文本是由人类还是人工智能生成的，并揭示文本的作者。与大多数现有方法不同，我们提出的框架旨在同时从多个生成器学习语义相似性表示，因此能够平等处理检测和归因任务。此外，WhosAI是模型无关的，并且可以扩展到新的人工智能文本生成模型的发布，通过将它们生成的实例合并到我们的框架学习的嵌入空间中。在包含20万条新闻文章的TuringBench基准测试上的实验结果显示，我们提出的框架在图灵测试和作者归因任务中取得了出色的成绩，优于TuringBench基准测试排行榜上列出的所有方法。

更新时间: 2025-03-17 09:19:05

领域: cs.CL,cs.AI,cs.CY,cs.HC,physics.soc-ph

下载: http://arxiv.org/abs/2407.09364v2

FedSDP: Explainable Differential Privacy in Federated Learning via Shapley Values

Federated learning (FL) enables participants to store data locally while collaborating in training, yet it remains vulnerable to privacy attacks, such as data reconstruction. Existing differential privacy (DP) technologies inject noise dynamically into the training process to mitigate the impact of excessive noise. However, this dynamic scheduling is often grounded in factors indirectly related to privacy, making it difficult to clearly explain the intricate relationship between dynamic noise adjustments and privacy requirements. To address this issue, we propose FedSDP, a novel and explainable DP-based privacy protection mechanism that guides noise injection based on privacy contribution. Specifically, FedSDP leverages Shapley values to assess the contribution of private attributes to local model training and dynamically adjusts the amount of noise injected accordingly. By providing theoretical insights into the injection of varying scales of noise into local training, FedSDP enhances interpretability. Extensive experiments demonstrate that FedSDP can achieve a superior balance between privacy preservation and model performance, surpassing state-of-the-art (SOTA) solutions.

Updated: 2025-03-17 09:14:19

标题: FedSDP：通过Shapley值在联邦学习中实现可解释的差分隐私

摘要: 联邦学习（FL）使参与者能够在本地存储数据同时合作进行训练，但仍然容易受到隐私攻击的影响，例如数据重建。现有的差分隐私（DP）技术会动态地向训练过程中注入噪声，以减轻噪声过量带来的影响。然而，这种动态调度通常基于间接与隐私相关的因素，使得难以清晰地解释动态噪声调整与隐私需求之间微妙的关系。为了解决这个问题，我们提出了FedSDP，一种基于差分隐私（DP）的隐私保护机制，通过根据隐私贡献来引导噪声注入。具体来说，FedSDP利用Shapley值来评估私有属性对本地模型训练的贡献，并相应地动态调整注入的噪声量。通过提供对将不同规模的噪声注入到本地训练的理论洞见，FedSDP增强了可解释性。大量实验证明，FedSDP可以在隐私保护和模型性能之间实现更好的平衡，超越了现有技术水平（SOTA）的解决方案。

更新时间: 2025-03-17 09:14:19

领域: cs.CR

下载: http://arxiv.org/abs/2503.12958v1

Visually Wired NFTs: Exploring the Role of Inspiration in Non-Fungible Tokens

The fervor for Non-Fungible Tokens (NFTs) attracted countless creators, leading to a Big Bang of digital assets driven by latent or explicit forms of inspiration, as in many creative processes. This work exploits Vision Transformers and graph-based modeling to delve into visual inspiration phenomena between NFTs over the years. Our goals include unveiling the main structural traits that shape visual inspiration networks, exploring the interrelation between visual inspiration and asset performances, investigating crypto influence on inspiration processes, and explaining the inspiration relationships among NFTs. Our findings unveil how the pervasiveness of inspiration led to a temporary saturation of the visual feature space, the impact of the dichotomy between inspiring and inspired NFTs on their financial performance, and an intrinsic self-regulatory mechanism between markets and inspiration waves. Our work can serve as a starting point for gaining a broader view of the evolution of Web3.

Updated: 2025-03-17 09:07:22

标题: 视觉有线NFTs：探讨灵感在非同质化代币中的作用

摘要: 对非同质化代币（NFTs）的热情吸引了无数创作者，导致了由潜在或明确形式的灵感驱动的数字资产的大爆炸，就像许多创意过程一样。本研究利用视觉转换器和基于图形的建模来深入研究多年来NFT之间的视觉灵感现象。我们的目标包括揭示塑造视觉灵感网络的主要结构特征，探索视觉灵感与资产表现之间的相互关系，调查加密对灵感过程的影响，并解释NFT之间的灵感关系。我们的研究揭示了灵感的普遍性如何导致了对视觉特征空间的暂时饱和，启发和受启发的NFT之间的二分法对它们的财务表现的影响，以及市场和灵感波之间的内在自我调节机制。我们的工作可以作为获得对Web3演变更广泛视角的起点。

更新时间: 2025-03-17 09:07:22

领域: cs.SI,cs.AI,cs.CV,physics.soc-ph

下载: http://arxiv.org/abs/2303.17031v4

Performance Analysis and Industry Deployment of Post-Quantum Cryptography Algorithms

As quantum computing advances, modern cryptographic standards face an existential threat, necessitating a transition to post-quantum cryptography (PQC). The National Institute of Standards and Technology (NIST) has selected CRYSTALS-Kyber and CRYSTALS-Dilithium as standardized PQC algorithms for secure key exchange and digital signatures, respectively. This study conducts a comprehensive performance analysis of these algorithms by benchmarking execution times across cryptographic operations such as key generation, encapsulation, decapsulation, signing, and verification. Additionally, the impact of AVX2 optimizations is evaluated to assess hardware acceleration benefits. Our findings demonstrate that Kyber and Dilithium achieve efficient execution times, outperforming classical cryptographic schemes such as RSA and ECDSA at equivalent security levels. Beyond technical performance, the real-world deployment of PQC introduces challenges in telecommunications networks, where large-scale infrastructure upgrades, interoperability with legacy systems, and regulatory constraints must be addressed. This paper examines the feasibility of PQC adoption in telecom environments, highlighting key transition challenges, security risks, and implementation strategies. Through industry case studies, we illustrate how telecom operators are integrating PQC into 5G authentication, subscriber identity protection, and secure communications. Our analysis provides insights into the computational trade-offs, deployment considerations, and standardization efforts shaping the future of quantum-safe cryptographic infrastructure.

Updated: 2025-03-17 09:06:03

标题: 后量子密码算法的性能分析和行业部署

摘要: 随着量子计算的发展，现代密码标准面临着一种存在威胁，迫使转向后量子密码学（PQC）。国家标准与技术研究所（NIST）已选择CRYSTALS-Kyber和CRYSTALS-Dilithium作为安全密钥交换和数字签名的标准化PQC算法。本研究通过对这些算法在关键生成、封装、解封、签名和验证等加密操作的执行时间进行基准测试，进行了全面的性能分析。此外，评估了AVX2优化的影响，以评估硬件加速的好处。我们的研究结果表明，Kyber和Dilithium实现了高效的执行时间，在等效安全级别下优于传统的加密方案，如RSA和ECDSA。除了技术性能外，PQC在电信网络中的实际部署引入了挑战，需要解决大规模基础设施升级、与传统系统的互操作性和监管限制。本文检验了PQC在电信环境中的采纳可行性，突出了关键的过渡挑战、安全风险和实施策略。通过行业案例研究，我们阐明了电信运营商如何将PQC整合到5G认证、用户身份保护和安全通信中。我们的分析提供了关于计算权衡、部署考虑和标准化努力对塑造未来量子安全加密基础设施的见解。

更新时间: 2025-03-17 09:06:03

领域: cs.CR,68,E.3

下载: http://arxiv.org/abs/2503.12952v1

Classification of power quality events in the transmission grid: comparative evaluation of different machine learning models

Automatic classification of electric power quality events with respect to their root causes is critical for electrical grid management. In this paper, we present comparative evaluation results of an extensive set of machine learning models for the classification of power quality events, based on their root causes. After extensive experiments using different machine learning libraries, it is observed that the best performing learning models turn out to be Cubic SVM and XGBoost. During error analysis, it is observed that the main source of performance degradation for both models is the classification of ABC faults as ABCG faults, or vice versa. Ultimately, the models achieving the best results will be integrated into the event classification module of a large-scale power quality and grid monitoring system for the Turkish electricity transmission system.

Updated: 2025-03-17 09:02:31

标题: Transmission grid中电能质量事件的分类：不同机器学习模型的比较评估

摘要: 电力质量事件根本原因的自动分类对电网管理至关重要。本文介绍了对一系列机器学习模型进行比较评估的结果，用于根据其根本原因对电力质量事件进行分类。通过使用不同的机器学习库进行广泛实验后，发现表现最佳的学习模型是Cubic SVM和XGBoost。在错误分析中，观察到这两种模型性能下降的主要原因是将ABC故障错误地分类为ABCG故障，反之亦然。最终，取得最佳结果的模型将集成到土耳其电力传输系统的大规模电力质量和电网监测系统的事件分类模块中。

更新时间: 2025-03-17 09:02:31

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2503.13566v1

Interleaved-Modal Chain-of-Thought

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.

Updated: 2025-03-17 09:01:38

标题: 交替模态思维链

摘要: Chain-of-Thought (CoT)提示引发了大型语言模型（LLMs）在到达最终答案之前生成一系列中间推理步骤。然而，当转换到视觉-语言模型（VLMs）时，它们仅基于文本的解释很难表达与原始图像的细粒度关联。在本文中，我们提出了一种结合图像的多模式Chain-of-Thought，命名为\textbf{交织模式Chain-of-Thought（ICoT）}，它生成由配对的视觉和文本解释组成的顺序推理步骤以推断最终答案。直观地说，新颖的ICoT要求VLMs能够生成细粒度的交织模式内容，这对当前的VLMs来说是困难的。考虑到所需的视觉信息通常是输入图像的一部分，我们提出\textbf{注意力驱动选择（ADS）}来实现对现有VLMs的ICoT。ADS智能地插入输入图像的区域，以生成带有可忽略附加延迟的交织模式推理步骤。ADS仅依赖于VLMs的注意力图，无需参数化，因此它是一种即插即用的策略，可以推广到一系列VLMs。我们将ADS应用于两种不同架构的流行VLMs上实现ICoT。对三个基准的广泛评估显示，与现有的多模式CoT提示方法相比，ICoT提示实现了显著的性能（高达14％）和可解释性改进。

更新时间: 2025-03-17 09:01:38

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.19488v2

Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation

This work introduces Open3DBench, an open-source 3D-IC backend implementation benchmark built upon the OpenROAD-flow-scripts framework, enabling comprehensive evaluation of power, performance, area, and thermal metrics. Our proposed flow supports modular integration of 3D partitioning, placement, 3D routing, RC extraction, and thermal simulation, aligning with advanced 3D flows that rely on commercial tools and in-house scripts. We present two foundational 3D placement algorithms: Open3D-Tiling, which emphasizes regular macro placement, and Open3D-DMP, which enhances wirelength optimization through cross-die co-placement with analytical placer DREAMPlace. Experimental results show significant improvements in area (51.19%), wirelength (24.06%), timing (30.84%), and power (5.72%) compared to 2D flows. The results also highlight that better wirelength does not necessarily lead to PPA gain, emphasizing the need of developing PPA-driven methods. Open3DBench offers a standardized, reproducible platform for evaluating 3D EDA methods, effectively bridging the gap between open-source tools and commercial solutions in 3D-IC design.

Updated: 2025-03-17 08:59:00

标题: Open3DBench：用于3D-IC后端实现和PPA评估的开源基准测试

摘要: 这项工作介绍了Open3DBench，这是一个建立在OpenROAD-flow-scripts框架之上的开源3D-IC后端实现基准，可以全面评估功耗、性能、面积和热度指标。我们提出的流程支持3D分区、布局、3D布线、RC提取和热模拟的模块化集成，与依赖商业工具和内部脚本的先进3D流程相吻合。我们提出了两种基础的3D布局算法：Open3D-Tiling强调规则宏布局，而Open3D-DMP通过与分析布局器DREAMPlace的跨芯片并置来增强线长优化。实验结果显示，与2D流程相比，在面积（51.19%）、线长（24.06%）、时序（30.84%）和功耗（5.72%）方面都取得了显著的改进。结果还突显了更好的线长并不一定导致PPA增益，强调了开发PPA驱动方法的必要性。Open3DBench提供了一个标准化、可重现的平台，用于评估3D EDA方法，有效地弥合了开源工具和商业解决方案在3D-IC设计中的差距。

更新时间: 2025-03-17 08:59:00

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2503.12946v1

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.

Updated: 2025-03-17 08:56:03

标题: HiDe-LLaVA: 分层解耦用于多模态大型语言模型的持续指导调整

摘要: 指令调整被广泛应用于改进预训练的多模态大型语言模型（MLLM），通过在精心筛选的任务特定数据集上训练它，实现对人类指令的更好理解。然而，在现实场景中同时收集所有可能的指令数据集是不可行的。因此，为了保持其适应性，使MLLM具有持续的指令调整至关重要。然而，现有方法通常以牺牲内存效率为代价获得性能提升，严重损害了整体效率。在本文中，我们提出了一个基于中心核对齐（CKA）相似性在不同模型层训练时在不同数据集上的变化的任务特定扩展和任务通用融合框架。此外，我们分析了现有基准中存在的信息泄漏，并提出一个新的更具挑战性的基准，以合理评估不同方法的性能。全面的实验展示了我们的方法相对于现有最先进方法的显著性能改进。我们的代码将公开发布。

更新时间: 2025-03-17 08:56:03

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12941v1

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

Updated: 2025-03-17 08:51:44

标题: R1-VL：通过逐步组相对策略优化学习多模态大型语言模型的推理能力

摘要: 最近的研究通常通过在高质量的链式推理数据上进行监督微调来增强MLLMs的推理能力，这往往导致模型仅仅模仿成功的推理路径而不理解错误的推理路径是什么。在这项工作中，我们旨在超越 passively imitating positive reasoning paths，增强MLLMs的推理能力。为此，我们设计了Step-wise Group Relative Policy Optimization（StepGRPO），这是一个新的在线强化学习框架，使MLLMs能够通过简单、有效和密集的逐步奖励来自我提高推理能力。具体而言，StepGRPO引入了两种新颖的基于规则的推理奖励：逐步推理准确度奖励（StepRAR）和逐步推理有效性奖励（StepRVR）。StepRAR通过软键步匹配技术奖励包含必要中间推理步骤的推理路径，而StepRAR通过推理完整性和逻辑评估策略奖励遵循良好结构和逻辑一致的推理路径。通过提出的StepGRPO，我们引入了R1-VL，一系列在逐步推理方面具有出色能力的MLLMs。对8个基准的大量实验证明了我们方法的优越性。

更新时间: 2025-03-17 08:51:44

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.12937v1

Leveraging Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning

Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at https://github.com/Deceptrax123/JPEB-GSSL

Updated: 2025-03-17 08:45:19

标题: 利用联合预测嵌入和贝叶斯推断在图自监督学习中的作用

摘要: 图表示学习已成为节点分类和链接预测等任务的基石，然而普遍的自监督学习（SSL）方法面临诸如计算效率低下、依赖对比目标和表示崩溃等挑战。现有方法通常依赖于特征重建、负采样或复杂的解码器，这些方法引入了训练开销并阻碍了泛化能力。此外，当前技术解决了这些限制的方法未能考虑节点嵌入对于在没有标记节点的情况下某种预测的贡献。为了解决这些限制，我们提出了一种新颖的用于图SSL的联合嵌入预测框架，该框架消除了对比目标和负采样，同时保留了语义和结构信息。此外，我们引入了一个从高斯混合模型（GMMs）派生的伪标签的语义感知目标项，通过评估潜在特征的贡献来增强节点的可区分性。大量实验表明，我们的框架在各种基准测试中优于最先进的图SSL方法，实现了更优越的性能，而无需对比损失或复杂的解码器。关键创新包括（1）一种非对比的、视图不变的联合嵌入预测架构，（2）利用子图之间的单一上下文和多目标关系，以及（3）基于GMM的伪标签评分来捕捉语义贡献。本文通过提供一个计算效率高、抗崩溃的范式，将空间和语义图特征桥接到下游任务，推动了图SSL的发展。我们的论文代码可以在https://github.com/Deceptrax123/JPEB-GSSL找到。

更新时间: 2025-03-17 08:45:19

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2502.01684v2

Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We have made the source code publicly available to encourage further research in this direction.

Updated: 2025-03-17 08:41:43

标题: 通过接受-拒绝方法和增强型MDPs实现高效的动作受限强化学习

摘要: 行动受限强化学习（ACRL）是一个通用框架，用于学习控制策略，以零动作约束违规，这是各种安全关键和资源受限应用所必需的。现有的ACRL方法通常可以实现令人满意的约束满足，但要付出高计算负担（由二次规划程序（QP）引起）或由于使用复杂的生成模型而增加架构复杂性的代价。在本文中，我们提出了一个通用且计算效率高的框架，可以通过两种修改将标准无约束RL方法调整为ACRL：（i）为了强制执行动作约束，我们利用经典的接受-拒绝方法，其中我们将无约束策略视为提议分布，并推导出一个具有可行动作的修改策略。（ii）为了改善提议分布的接受率，我们构建了一个增强的双目标马尔可夫决策过程（MDP），其中包括额外的自循环状态转换和用于拒绝动作的惩罚信号。这个增强的MDP激励学习策略保持接近可行动作集。通过在机器人控制和资源分配领域进行大量实验，我们证明了所提出的框架在训练进展、约束满足和动作推理时间方面同时优于最先进的ACRL方法。我们已经公开提供了源代码，以鼓励在这个方向进行进一步研究。

更新时间: 2025-03-17 08:41:43

领域: cs.LG,I.2.6; I.5.1

下载: http://arxiv.org/abs/2503.12932v1

MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies generally rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules are incapable of accommodating the inherent complexity and dynamic nature of real jailbreak attacks. In this paper, we propose a novel concept of ``mirror'' to enable dynamic and adaptive defense. A mirror refers to a dynamically generated prompt that mirrors the syntactic structure of the input while ensuring semantic safety. The personalized discrepancies between the input prompts and their corresponding mirrors serve as the guiding principles for defense. A new defense paradigm, MirrorGuard, is further proposed to detect and calibrate risky inputs based on such mirrors. An entropy-based detection metric, Relative Input Uncertainty (RIU), is integrated into MirrorGuard to quantify the discrepancies between input prompts and mirrors. MirrorGuard is evaluated on several popular datasets, demonstrating state-of-the-art defense performance while maintaining general effectiveness.

Updated: 2025-03-17 08:41:29

标题: MirrorGuard: 通过熵引导的镜像制作对抗越狱的自适应防御

摘要: 保护大型语言模型（LLMs）免受越狱攻击是确保它们安全部署的关键。现有的防御策略通常依赖于预定义的静态标准来区分有害和良性提示。然而，这种严格的规则无法适应真实越狱攻击的固有复杂性和动态性质。在本文中，我们提出了一个新颖的概念“镜像”，以实现动态和自适应防御。镜像指的是动态生成的提示，镜像输入的句法结构同时确保语义安全。输入提示和其对应镜像之间的个性化差异作为防御的指导原则。进一步提出了一种新的防御范式MirrorGuard，用于基于这些镜像检测和校准风险输入。基于熵的检测指标Relative Input Uncertainty（RIU）被整合到MirrorGuard中，用于量化输入提示和镜像之间的差异。MirrorGuard在几个流行数据集上进行评估，展示了最新的防御性能，同时保持了普遍有效性。

更新时间: 2025-03-17 08:41:29

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.12931v1

Augmented Invertible Koopman Autoencoder for long-term time series forecasting

Following the introduction of Dynamic Mode Decomposition and its numerous extensions, many neural autoencoder-based implementations of the Koopman operator have recently been proposed. This class of methods appears to be of interest for modeling dynamical systems, either through direct long-term prediction of the evolution of the state or as a powerful embedding for downstream methods. In particular, a recent line of work has developed invertible Koopman autoencoders (IKAEs), which provide an exact reconstruction of the input state thanks to their analytically invertible encoder, based on coupling layer normalizing flow models. We identify that the conservation of the dimension imposed by the normalizing flows is a limitation for the IKAE models, and thus we propose to augment the latent state with a second, non-invertible encoder network. This results in our new model: the Augmented Invertible Koopman AutoEncoder (AIKAE). We demonstrate the relevance of the AIKAE through a series of long-term time series forecasting experiments, on satellite image time series as well as on a benchmark involving predictions based on a large lookback window of observations.

Updated: 2025-03-17 08:40:50

标题: 增强可逆 Koopman 自编码器用于长期时间序列预测

摘要: 随着动态模态分解及其众多扩展的引入，最近提出了许多基于神经自编码器的 Koopman 算子的实现。这类方法似乎对建模动态系统很感兴趣，可以通过直接长期预测状态的演变或作为下游方法的强大嵌入来实现。特别地，最近的一系列工作开发了可逆 Koopman 自编码器（IKAEs），它们通过基于耦合层归一化流模型的解析可逆编码器实现了对输入状态的精确重建。我们发现，由归一化流施加的维度保持对 IKAE 模型是一个限制，因此我们提出通过第二个非可逆编码器网络增加潜在状态。这导致我们的新模型：增强可逆 Koopman 自编码器（AIKAE）。我们通过一系列长期时间序列预测实验展示了 AIKAE 的相关性，包括卫星图像时间序列以及涉及基于大量观测历史数据的预测的基准。

更新时间: 2025-03-17 08:40:50

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.12930v1

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

Updated: 2025-03-17 08:38:45

标题: ML-SpecQD：多级别带有量化草稿的推测解码

摘要: 投机解码（SD）已经成为一种加速LLM推断的方法，而又不损失16位模型推断的准确性。在典型的SD设置中，使用全精度、小型、快速模型作为“草稿”来生成接下来的几个标记，然后使用“目标”大型模型来验证草稿生成的标记。这种方法的有效性在很大程度上取决于草稿生成的标记的接受率和草稿与目标模型的相对标记吞吐量。然而，一个高效的SD管道需要对草稿模型进行预训练和与目标模型对齐，这使得在插拔式方式下进行LLM推断变得不切实际。在这项工作中，我们提出以插拔式方式使用MXFP4模型作为草稿，因为MXFP4权重仅定量化（WOQ）将BF16目标模型权重直接转换为MXFP4。实践中，我们的插拔式解决方案相比BF16基线加速高达2倍。然后，我们追求进一步加速的机会：通过使用另一个更小的草稿来加速MXFP4草稿标记生成本身，可以通过投机解码加速。我们将这种方法称为ML-SpecQD：具有量化草稿的多级投机解码，因为它递归地应用投机来加速草稿标记生成。将多级投机解码与MXFP4量化草稿相结合，我们胜过了最先进的投机解码技术，使速度提高了高达2.72倍以上的BF16基线。

更新时间: 2025-03-17 08:38:45

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13565v1

Lifelong Reinforcement Learning with Similarity-Driven Weighting by Large Models

Lifelong Reinforcement Learning (LRL) holds significant potential for addressing sequential tasks, but it still faces considerable challenges. A key difficulty lies in effectively preventing catastrophic forgetting and facilitating knowledge transfer while maintaining reliable decision-making performance across subsequent tasks in dynamic environments. To tackle this, we propose a novel framework, SDW (Similarity-Driven Weighting Framework), which leverages large-language-model-generated dynamic functions to precisely control the training process. The core of SDW lies in two functions pre-generated by large models: the task similarity function and the weight computation function. The task similarity function extracts multidimensional features from task descriptions to quantify the similarities and differences between tasks in terms of states, actions, and rewards. The weight computation function dynamically generates critical training parameters based on the similarity information, including the proportion of old task data stored in the Replay Buffer and the strategy consistency weight in the loss function, enabling an adaptive balance between learning new tasks and transferring knowledge from previous tasks. By generating function code offline prior to training, rather than relying on large-model inference during the training process, the SDW framework reduces computational overhead while maintaining efficiency in sequential task scenarios. Experimental results on Atari and MiniHack sequential tasks demonstrate that SDW significantly outperforms existing lifelong reinforcement learning methods.

Updated: 2025-03-17 08:36:16

标题: 使用大型模型的相似度驱动加权进行终身强化学习

摘要: 终身强化学习（LRL）在解决序列任务方面具有重要潜力，但仍面临着相当大的挑战。一个关键困难在于有效防止灾难性遗忘和促进知识转移，同时在动态环境中保持可靠的决策性能跨越后续任务。为了解决这个问题，我们提出了一个新颖的框架，SDW（相似性驱动加权框架），它利用大型语言模型生成的动态函数精确控制训练过程。SDW的核心在于大型模型预先生成的两个函数：任务相似度函数和权重计算函数。任务相似度函数从任务描述中提取多维特征，以量化任务之间在状态、动作和奖励方面的相似性和差异。权重计算函数根据相似性信息动态生成关键的训练参数，包括存储在回放缓冲区中的旧任务数据比例和损失函数中的策略一致性权重，实现学习新任务和从先前任务中转移知识之间的自适应平衡。通过在训练之前离线生成函数代码，而不是在训练过程中依赖大型模型推断，SDW框架在维持顺序任务场景的效率的同时减少了计算开销。对Atari和MiniHack序列任务的实验结果表明，SDW明显优于现有的终身强化学习方法。

更新时间: 2025-03-17 08:36:16

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2503.12923v1

Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach

Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents.

Updated: 2025-03-17 08:35:32

标题: 视觉导航中端到端训练代理的推理：一种动力系统方法

摘要: 进展在体现智能人工智能方面取得了重大进展，使得端到端训练的智能体能够在逼真的环境中进行高级推理和零射击或语言条件下的行为导航，但基准测试仍然主要由模拟控制。在这项工作中，我们关注快速移动实际机器人的细粒度行为，并在一个实际机器人的物理环境中进行了大规模实验研究，涉及\numepisodes{}个导航实验，分析端到端训练中产生的推理类型。特别是，我们研究了智能体为开环预测学习的现实动态的存在，以及它们与感知的相互作用。我们分析了智能体使用潜在记忆来保存场景结构元素和探索过程中收集的信息的方式。我们探索了智能体的规划能力，并在其记忆中找到了一些对有限视野内相当精确的计划的证据。此外，我们在事后分析中显示，智能体学习的值函数与长期规划相关。综合起来，我们的实验描绘了如何利用计算机视觉和序贯决策制定工具，为机器人和控制领域带来新的能力。一个交互式工具可在europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents上找到。

更新时间: 2025-03-17 08:35:32

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.08306v3

EMN: Brain-inspired Elastic Memory Network for Quick Domain Adaptive Feature Mapping

Utilizing unlabeled data in the target domain to perform continuous optimization is critical to enhance the generalization ability of neural networks. Most domain adaptation methods focus on time-consuming optimization of deep feature extractors, which limits the deployment on lightweight edge devices. Inspired by the memory mechanism and powerful generalization ability of biological neural networks in human brains, we propose a novel gradient-free Elastic Memory Network, namely EMN, to support quick fine-tuning of the mapping between features and prediction without heavy optimization of deep features. In particular, EMN adopts randomly connected neurons to memorize the association of features and labels, where the signals in the network are propagated as impulses, and the prediction is made by associating the memories stored on neurons based on their confidence. More importantly, EMN supports reinforced memorization of feature mapping based on unlabeled data to quickly adapt to a new domain. Experiments based on four cross-domain real-world datasets show that EMN can achieve up to 10% enhancement of performance while only needing less than 1% timing cost of traditional domain adaptation methods.

Updated: 2025-03-17 08:34:07

标题: EMN: 快速领域适应特征映射的脑启发式弹性记忆网络

摘要: 利用目标领域中的未标记数据进行连续优化对于增强神经网络的泛化能力至关重要。大多数领域适应方法侧重于耗时的深度特征提取器优化，这限制了在轻量级边缘设备上部署的可能性。受人类大脑中生物神经网络的记忆机制和强大的泛化能力启发，我们提出了一种新颖的无梯度弹性记忆网络，即EMN，以支持快速微调特征和预测之间的映射，无需对深度特征进行重型优化。特别是，EMN采用随机连接的神经元来记忆特征和标签的关联，网络中的信号以脉冲形式传播，并且根据其置信度将存储在神经元上的记忆关联起来进行预测。更重要的是，EMN支持基于未标记数据对特征映射进行强化记忆，以便快速适应新领域。基于四个跨领域真实世界数据集的实验表明，EMN可以实现高达10%的性能提升，而仅需传统领域适应方法不到1%的时间成本。

更新时间: 2025-03-17 08:34:07

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2402.14598v2

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Recent years have witnessed the success of Multimodal Large Language Models (MLLMs) in the vision understanding domain. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has mainly been powered by automatic data pipelines, which center around the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our main study approach is fine-tuning pre-trained image-LLMs with video data and investigating learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to baselines trained with many more samples. Meanwhile, we find that incorporating these synthetic samples can boost the performance of long video understanding without training with long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.

Updated: 2025-03-17 08:33:00

标题: 麻雀：具有文本到图像增强的数据高效视频-LLM

摘要: 近年来，多模态大型语言模型（MLLMs）在视觉理解领域取得了成功。这些模型的成功主要归因于主导的缩放定律，该定律指出更大的参数大小和数据量会提高性能。值得注意的是，数据缩放主要是由自动数据管道驱动的，这些管道围绕LLMs的自我指导展开。这种范式已经被默认接受了相当长的时间，但对这些数据的缩放效果的研究却被长时间忽视了。在这种背景下，本文重新审视了使用合成数据进行缩放，并致力于从数据为中心的角度开发视频-LLMs。我们的主要研究方法是通过使用视频数据微调预训练的图像-LLMs，并通过数据缩放调查学习效率。我们初步实验的结果显示，简单扩大视频数据样本时存在低学习效率现象，通过我们的探究，这可以归因于缺乏指导多样性。针对这个问题，我们提出了一种称为Sparrow的数据增强方法，该方法从纯文本指导数据中合成类似视频的样本。将这些合成样本与视频数据混合可以实现更高效的训练方案。通过全面的实验，我们证明了我们提出的方法实现了与使用更多样本训练的基线相媲美甚至更优越的性能。同时，我们发现将这些合成样本纳入可以提升长视频理解的性能，而无需使用长视频数据进行训练。代码和数据示例可在https://github.com/VITA-MLLM/Sparrow找到。

更新时间: 2025-03-17 08:33:00

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.19951v4

COSMOS: Continuous Simplicial Neural Networks

Simplicial complexes provide a powerful framework for modeling high-order interactions in structured data, making them particularly suitable for applications such as trajectory prediction and mesh processing. However, existing simplicial neural networks (SNNs), whether convolutional or attention-based, rely primarily on discrete filtering techniques, which can be restrictive. In contrast, partial differential equations (PDEs) on simplicial complexes offer a principled approach to capture continuous dynamics in such structures. In this work, we introduce COntinuous SiMplicial neural netwOrkS (COSMOS), a novel SNN architecture derived from PDEs on simplicial complexes. We provide theoretical and experimental justifications of COSMOS's stability under simplicial perturbations. Furthermore, we investigate the over-smoothing phenomenon, a common issue in geometric deep learning, demonstrating that COSMOS offers better control over this effect than discrete SNNs. Our experiments on real-world datasets of ocean trajectory prediction and regression on partial deformable shapes demonstrate that COSMOS achieves competitive performance compared to state-of-the-art SNNs in complex and noisy environments.

Updated: 2025-03-17 08:31:25

标题: 宇宙：连续单纯神经网络

摘要: 单纯复合体为建模结构化数据中高阶交互作用提供了一个强大的框架，使其特别适用于轨迹预测和网格处理等应用。然而，现有的单纯神经网络（SNNs），无论是卷积还是基于注意力的，主要依赖于离散滤波技术，这可能具有限制性。相比之下，在单纯复合体上的偏微分方程（PDEs）提供了一种原则性的方法来捕捉这种结构中的连续动态。在这项工作中，我们介绍了COntinuous SiMplicial神经网络（COSMOS），这是一种源自单纯复合体上的PDEs的新颖SNN架构。我们提供了COSMOS在单纯扰动下的稳定性的理论和实验证明。此外，我们调查了过度平滑现象，在几何深度学习中常见的问题，证明COSMOS比离散SNNs具有更好的控制效果。我们在真实世界数据集上进行的海洋轨迹预测和部分可变形形状回归的实验表明，与复杂且嘈杂的环境中的最先进SNNs相比，COSMOS实现了竞争性的性能。

更新时间: 2025-03-17 08:31:25

领域: cs.LG

下载: http://arxiv.org/abs/2503.12919v1

Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible

The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data. If we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts-issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs to the level of a Constraint Satisfaction Problem (CSP). To further enhance performance, we introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels, such as in addition, for some tasks, while for others, like Sudoku, the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.

Updated: 2025-03-17 08:28:58

标题: 验证学习：使无监督神经符号系统可行

摘要: 目前的神经符号学习范式存在对标记数据的过度依赖。如果我们完全忽略标签，会导致符号信息减少、解空间增大以及更多无法解决的快捷方式问题，目前的NeSy系统无法解决这些问题。本文介绍了一种新颖的学习范式，验证学习（VL），它通过将NeSy中基于标签的推理过程转化为无标签的验证过程来解决这一挑战。VL仅依赖于未标记数据和一个验证当前预测是否符合规则的函数，就能取得出色的学习结果。我们将这个问题形式化为一个约束优化问题（COP），并提出了一种动态组合排序（DCS）算法，通过减少验证尝试来加速解决方案，有效降低计算成本至约束满足问题（CSP）的水平。为进一步提高性能，我们引入了一个先验对齐方法来解决潜在的快捷方式。我们的理论分析指出了NeSy系统中哪些任务可以在没有标签的情况下完成，并解释了为什么规则可以替代无限的标签，例如在一些任务中，如加法，而对于其他任务，如数独，规则没有效果。我们通过几个完全无监督的任务（包括加法、排序、匹配和国际象棋）验证了所提出的框架，每个任务都显示出显著的性能和效率提高。

更新时间: 2025-03-17 08:28:58

领域: cs.AI

下载: http://arxiv.org/abs/2503.12917v1

Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs

Despite extensive research efforts focused on OOD detection on images, OOD detection on nodes in graph learning remains underexplored. The dependence among graph nodes hinders the trivial adaptation of existing approaches on images that assume inputs to be i.i.d. sampled, since many unique features and challenges specific to graphs are not considered, such as the heterophily issue. Recently, GNNSafe, which considers node dependence, adapted energy-based detection to the graph domain with state-of-the-art performance, however, it has two serious issues: 1) it derives node energy from classification logits without specifically tailored training for modeling data distribution, making it less effective at recognizing OOD data; 2) it highly relies on energy propagation, which is based on homophily assumption and will cause significant performance degradation on heterophilic graphs, where the node tends to have dissimilar distribution with its neighbors. To address the above issues, we suggest training EBMs by MLE to enhance data distribution modeling and remove energy propagation to overcome the heterophily issues. However, training EBMs via MLE requires performing MCMC sampling on both node feature and node neighbors, which is challenging due to the node interdependence and discrete graph topology. To tackle the sampling challenge, we introduce DeGEM, which decomposes the learning process into two parts: a graph encoder that leverages topology information for node representations and an energy head that operates in latent space. Extensive experiments validate that DeGEM, without OOD exposure during training, surpasses previous state-of-the-art methods, achieving an average AUROC improvement of 6.71% on homophilic graphs and 20.29% on heterophilic graphs, and even outperform methods trained with OOD exposure. Our code is available at: https://github.com/draym28/DeGEM.

Updated: 2025-03-17 08:23:08

标题: 解耦图能量模型用于异质图上节点分布检测

摘要: 尽管对图像上的OOD检测进行了大量的研究工作，但图学习中节点的OOD检测仍未充分探索。图节点之间的依赖关系阻碍了将现有方法在假设输入为i.i.d.采样的图像上进行简单调整，因为许多图特有的独特特征和挑战并未考虑，比如异性问题。最近，GNNSafe考虑了节点依赖性，将基于能量的检测方法应用到图领域中，表现出最先进的性能，然而它存在两个严重问题：1）它从分类logits中推导节点能量，没有经过专门设计的训练来建模数据分布，使其在识别OOD数据方面效果较差；2）它高度依赖能量传播，这是基于同性假设的，会导致在异性图上性能严重下降，其中节点往往与其邻居具有不同的分布。为了解决上述问题，我们建议通过MLE训练EBMs来增强数据分布建模，并消除能量传播以克服异性问题。然而，通过MLE训练EBMs需要对节点特征和节点邻居进行MCMC采样，由于节点间的相互依赖性和离散图拓扑结构，这是具有挑战性的。为了解决采样挑战，我们引入了DeGEM，将学习过程分解为两部分：利用拓扑信息进行节点表示的图编码器和在潜在空间中运作的能量头。广泛的实验证明，DeGEM在训练过程中没有OOD暴露的情况下，超越了先前的最先进方法，在同性图上平均AUROC提高了6.71％，在异性图上提高了20.29％，甚至胜过了经过OOD暴露训练的方法。我们的代码可在以下链接获取：https://github.com/draym28/DeGEM。

更新时间: 2025-03-17 08:23:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.17912v2

Pose as a Modality: A Psychology-Inspired Network for Personality Recognition with a New Multimodal Dataset

In recent years, predicting Big Five personality traits from multimodal data has received significant attention in artificial intelligence (AI). However, existing computational models often fail to achieve satisfactory performance. Psychological research has shown a strong correlation between pose and personality traits, yet previous research has largely ignored pose data in computational models. To address this gap, we develop a novel multimodal dataset that incorporates full-body pose data. The dataset includes video recordings of 287 participants completing a virtual interview with 36 questions, along with self-reported Big Five personality scores as labels. To effectively utilize this multimodal data, we introduce the Psychology-Inspired Network (PINet), which consists of three key modules: Multimodal Feature Awareness (MFA), Multimodal Feature Interaction (MFI), and Psychology-Informed Modality Correlation Loss (PIMC Loss). The MFA module leverages the Vision Mamba Block to capture comprehensive visual features related to personality, while the MFI module efficiently fuses the multimodal features. The PIMC Loss, grounded in psychological theory, guides the model to emphasize different modalities for different personality dimensions. Experimental results show that the PINet outperforms several state-of-the-art baseline models. Furthermore, the three modules of PINet contribute almost equally to the model's overall performance. Incorporating pose data significantly enhances the model's performance, with the pose modality ranking mid-level in importance among the five modalities. These findings address the existing gap in personality-related datasets that lack full-body pose data and provide a new approach for improving the accuracy of personality prediction models, highlighting the importance of integrating psychological insights into AI frameworks.

Updated: 2025-03-17 08:21:33

标题: 扮演作为一种模态：一种受心理学启发的人格识别网络，使用新的多模态数据集

摘要: 近年来，从多模态数据中预测大五人格特质在人工智能领域引起了重大关注。然而，现有的计算模型往往无法取得令人满意的性能。心理学研究表明，姿势与人格特质之间存在很强的相关性，然而先前的研究在计算模型中很大程度上忽视了姿势数据。为了弥补这一缺口，我们开发了一个包含全身姿势数据的新型多模态数据集。该数据集包括287名参与者完成虚拟面试36个问题的视频录像，以及作为标签的自我报告的大五人格分数。为了有效利用这些多模态数据，我们引入了心理学启发网络（PINet），它包括三个关键模块：多模态特征感知（MFA）、多模态特征交互（MFI）和心理学启发的模态相关损失（PIMC Loss）。MFA模块利用Vision Mamba Block捕获与人格相关的全面视觉特征，而MFI模块有效地融合多模态特征。基于心理学理论的PIMC Loss引导模型强调不同人格维度的不同模态。实验结果显示，PINet优于几种最先进的基线模型。此外，PINet的三个模块对模型整体性能的贡献几乎相等。整合姿势数据显著提升了模型的性能，姿势模态在五种模态中的重要性居中。这些发现填补了与人格相关数据集中缺乏全身姿势数据的现有差距，并为提高人格预测模型的准确性提供了一种新方法，突出了将心理学见解整合到人工智能框架中的重要性。

更新时间: 2025-03-17 08:21:33

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.12912v1

HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more "contrast-effective" hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.

Updated: 2025-03-17 08:17:28

标题: HICD：通过关注分散诱发幻觉以对比解码减轻大型语言模型中的幻觉

摘要: 大型语言模型（LLMs）通常会产生幻觉，生成的输出在语境上不准确或事实上错误。我们引入了HICD，这是一种新颖的方法，旨在诱导幻觉进行对比解码，以减轻幻觉。与现有的对比解码方法不同，HICD选择对模型预测至关重要的注意力头作为诱导头，然后通过分散这些诱导头的注意力来诱发幻觉，并将幻觉输出与原始输出进行比较以获得最终结果。我们的方法显著改善了需要上下文忠实度的任务的性能，例如上下文补全、阅读理解和问题回答。它还提高了需要准确知识回忆的任务的事实性。我们证明了我们的诱导头选择和注意力分散方法导致更多“对比有效”的幻觉进行对比解码，优于其他诱导幻觉方法。我们的发现提供了一种有希望的策略，通过控制方式诱发幻觉，增强LLMs在广泛任务中的性能。

更新时间: 2025-03-17 08:17:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.12908v1

Treble Counterfactual VLMs: A Causal Approach to Hallucination

Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.

Updated: 2025-03-17 08:11:52

标题: 三重反事实VLMs：一种关于幻觉的因果方法

摘要: 视觉-语言模型（VLMs）已经推动了图像字幕、视觉问答和推理等多模态任务的发展。然而，它们经常生成与视觉环境或提示不一致的幻觉输出，限制了在自动驾驶和医学成像等关键应用中的可靠性。现有研究将幻觉与统计偏见、语言先验和有偏特征学习联系起来，但缺乏结构化因果理解。在这项工作中，我们引入了因果透视来分析和减轻VLMs中的幻觉。我们假设幻觉是由于视觉或文本模态的非预期直接影响而产生，绕过了适当的多模态融合。为了解决这个问题，我们为VLMs构建了一个因果图，并采用反事实分析来估计视觉、文本及它们之间的跨模态交互对输出的自然直接影响（NDE）。我们系统地识别和减轻这些非预期的直接效应，以确保响应主要由真实的多模态融合驱动。我们的方法包括三个步骤：（1）设计结构性因果图，以区分正确的融合路径和虚假模态快捷方式，（2）使用扰动的图像表示、幻想的文本嵌入和降级的视觉输入来估计模态特定和跨模态NDE，（3）实施一个测试时间干预模块，动态调整模型对每个模态的依赖。实验结果表明，我们的方法显著减少了幻觉，同时保留了任务性能，为改善VLM可靠性提供了一个强大且可解释的框架。为了增强可访问性和可重复性，我们的代码公开可用于https://github.com/TREE985/Treble-Counterfactual-VLMs。

更新时间: 2025-03-17 08:11:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06169v2

MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG

Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to 0.83 (+0.25) on targeted task. Our code and data are available at https://github.com/wpydcr/MES-RAG.

Updated: 2025-03-17 08:09:42

标题: MES-RAG：将多模态、实体存储和安全增强引入到RAG

摘要: 检索增强生成（RAG）通过使用外部知识改进大型语言模型（LLMs），但在精确实体信息检索方面存在困难。在本文中，我们提出了MES-RAG框架，该框架增强了特定实体查询处理，并提供准确、安全和一致的响应。MES-RAG引入了积极的安全措施，通过在数据访问之前应用保护措施来确保系统的完整性。此外，该系统支持实时多模态输出，包括文本、图像、音频和视频，无缝集成到现有的RAG架构中。实验结果表明，MES-RAG显着提高了准确性和召回率，突出了其在提高问答安全性和实用性方面的有效性，将准确性提高到0.83（+0.25）在目标任务上。我们的代码和数据可在https://github.com/wpydcr/MES-RAG上获得。

更新时间: 2025-03-17 08:09:42

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2503.13563v1

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the end-to-end autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In addition, it remains a challenge to match multiple trajectories with each vehicle in the video to control the video generation. To address above issues, a driving World Model named EOT-WM is proposed in this paper, unifying Ego-Other vehicle Trajectories in videos. Specifically, we first project ego and other vehicle trajectories in the BEV space into the image coordinate to match each trajectory with its corresponding vehicle in the video. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Updated: 2025-03-17 08:07:46

标题: 其他车辆轨迹同样重要：一个驾驶世界模型统一了视频潜空间中的自我和其他车辆轨迹

摘要: 高级端到端自动驾驶系统预测其他车辆的运动并规划自车的轨迹。可以预见轨迹结果的世界模型被用来评估端到端自动驾驶系统。然而，现有的世界模型主要强调自车的轨迹，而忽略其他车辆的控制。这种限制阻碍了它们实际模拟自车与驾驶场景之间的交互能力。此外，匹配视频中每辆车的多条轨迹以控制视频生成仍然是一个挑战。为了解决以上问题，本文提出了一种名为EOT-WM的驾驶世界模型，统一了视频中的自车和其他车辆轨迹。具体来说，我们首先将自车和其他车辆轨迹在BEV空间中投影到图像坐标，以便将每个轨迹与其在视频中对应的车辆匹配。然后，轨迹视频通过时空变分自动编码器进行编码，以在统一的视觉空间中时空地对齐驾驶视频潜在。进一步设计了注入轨迹扩散Transformer，以在自车和其他车辆轨迹的指导下去噪视频潜在，用于视频生成。此外，我们提出了基于控制潜在相似性的度量来评估轨迹的可控性。在nuScenes数据集上进行了大量实验，提出的模型在FID上优于最先进方法30%，在FVD上优于55%。该模型还可以预测未见的驾驶场景并产生自身生成的轨迹。

更新时间: 2025-03-17 08:07:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.09215v2

Experiments with Optimal Model Trees

Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

Updated: 2025-03-17 08:03:47

标题: 与最优模型树的实验

摘要: 模型树为执行可解释的分类和回归问题的机器学习提供了一种吸引人的方式。与叶子节点中具有恒定值的“经典”决策树相比，模型树可以在叶子节点中使用预测变量的线性组合来形成预测，这可以帮助实现更高的准确性和更小的树。典型的从训练数据中学习模型树的算法以贪婪方式工作，通过将数据递归地分割成越来越小的子集来自顶向下生长树。关键是，选择的分割只是局部最优的，可能使树过于复杂，并且比对训练数据全局最优的树结构的准确性更低。在本文中，我们通过线性支持向量机在叶子节点上构建全局最优模型树对分类和回归的影响进行了实证研究。为此，我们提出了混合整数线性规划的公式来学习最优树，为大量基准数据集计算这样的树，并将它们的性能与贪婪生长的模型树在可解释性和准确性方面进行比较。我们还将其与经典的最优和贪婪生长的决策树、随机森林和支持向量机进行比较。我们的结果表明，最优模型树可以实现与非常小的树竞争性准确性。我们还研究了用多变量替换轴平行分割对准确性的影响，从而放弃可解释性，同时可能获得更高的准确性。

更新时间: 2025-03-17 08:03:47

领域: cs.LG

下载: http://arxiv.org/abs/2503.12902v1

A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation

Language Models (LMs) are widely used in software engineering for code generation, but they may produce code with errors. Rather than repairing the generated code, an alternative way is to address the underlying failures of models. LM repair offers a lightweight solution to this challenge: it requires minimal data, reduces computational costs, and reduces the side effects. Unlike retraining, LM repair focuses on applying tailored updates to targeted neurons, making it ideal for scenarios with limited resources, high-performance demands, or strict safety requirements. In this paper, we propose \ul{S}emantic \ul{T}argeting for \ul{A}nalytical \ul{R}epair (\textsc{STAR}), a pioneering and novel semantic-based optimization approach for repairing LLMs. \textsc{STAR} realizes main operations in LM repair methods in an optimization process, including locating ``buggy neurons'', solving ``neuron patches'', and patching ``buggy neurons''. Correspondingly, it computes the deltas of weight matrix as the prior information to guide optimization; and attributes the targeted layers and neurons leveraging statistical insights. The neuron patches are computed with a solid semantic-based analytical formula, which directly bridges the changes to logits with the deltas of neurons, by steering latent representations. Compared to the prior work of LM repair (\textsc{MINT}) and optimization methods (\textsc{SGD}), \textsc{STAR} integrates their strengths while mitigating their limitations. \textsc{STAR} supports solving multiple failures together, significantly improving the usefulness. Evaluated on three code generation tasks using popular code LMs, \textsc{STAR} demonstrates superior effectiveness. Additionally, \textsc{STAR} exhibits better efficiency. In terms of side effects, namely the balance between generalization and specificity, \textsc{STAR} outperforms prior work by a significant margin.

Updated: 2025-03-17 07:59:42

标题: 一种基于语义的优化方法用于修复LLMs：代码生成的案例研究

摘要: 语言模型（LMs）广泛应用于软件工程中的代码生成，但可能会产生错误的代码。与修复生成的代码不同，另一种替代方式是解决模型的基本失败。LM修复提供了一种轻量级解决方案：它需要最少的数据，降低计算成本，并减少副作用。与重新训练不同，LM修复侧重于对目标神经元应用定制更新，使其在资源有限、高性能需求或严格安全要求的情况下成为理想选择。在本文中，我们提出了一种新的基于语义的优化方法\ul{S}emantic \ul{T}argeting for \ul{A}nalytical \ul{R}epair（\textsc{STAR}），用于修复LLMs。 \textsc{STAR}在优化过程中实现了LM修复方法的主要操作，包括定位“有缺陷的神经元”，解决“神经元补丁”和修补“有缺陷的神经元”。相应地，它计算权重矩阵的增量作为指导优化的先验信息；并利用统计见解属性化目标层和神经元。神经元补丁使用坚实的基于语义的分析公式计算，直接通过引导潜在表示与神经元的增量将变化与对数联系起来。与LM修复（\textsc{MINT}）和优化方法（\textsc{SGD}）的先前工作相比，\textsc{STAR}整合了它们的优势，同时减轻了它们的局限性。 \textsc{STAR}支持一起解决多个失败，显著提高了实用性。在使用流行代码LMs进行的三个代码生成任务上评估，\textsc{STAR}展示了出色的有效性。此外，\textsc{STAR}表现出更好的效率。在副作用方面，即泛化和特异性之间的平衡，\textsc{STAR}明显优于之前的工作。

更新时间: 2025-03-17 07:59:42

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12899v1

Federated Continual Instruction Tuning

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.

Updated: 2025-03-17 07:58:06

标题: 联合持续教育调优

摘要: 大规模多模态模型（LMMs）的出色性能关键在于大量的指导调整数据，但在监督微调过程中，相关的计算成本和数据收集需求使得对大多数研究者来说并不实际。联邦学习（FL）有潜力利用所有分布式数据和训练资源来减少联合训练的开销。然而，大多数现有方法假设任务数量固定，而在现实场景中，客户端持续遇到新知识并且由于内存限制通常难以保留旧任务。在这项工作中，我们引入了联邦持续指导调整（FCIT）基准来模拟这一现实挑战。我们的基准包括两种现实场景，涵盖四种不同设置和十二个经过精心策划的指导调整数据集。为了应对FCIT提出的挑战，我们提出了动态知识组织，在训练过程中有效整合来自不同任务的更新，并提出了子空间选择性激活，在推理过程中分配特定任务的输出。广泛的实验结果表明，我们提出的方法显著提高了模型在不同数据异质性和灾难性遗忘水平上的性能。我们的源代码和数据集将会公开发布。

更新时间: 2025-03-17 07:58:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.12897v1

Safeguarding LLM Embeddings in End-Cloud Collaboration via Entropy-Driven Perturbation

Recent studies improve on-device language model (LM) inference through end-cloud collaboration, where the end device retrieves useful information from cloud databases to enhance local processing, known as Retrieval-Augmented Generation (RAG). Typically, to retrieve information from the cloud while safeguarding privacy, the end device transforms original data into embeddings with a local embedding model. However, the recently emerging Embedding Inversion Attacks (EIAs) can still recover the original data from text embeddings (e.g., training a recovery model to map embeddings back to original texts), posing a significant threat to user privacy. To address this risk, we propose EntroGuard, an entropy-driven perturbation-based embedding privacy protection method, which can protect the privacy of text embeddings while maintaining retrieval accuracy during the end-cloud collaboration. Specifically, to defeat various EIAs, we perturb the embeddings to increase the entropy of the recovered text in the common structure of recovery models, thus steering the embeddings toward meaningless texts rather than original sensitive texts during the recovery process. To maintain retrieval performance in the cloud, we constrain the perturbations within a bound, applying the strategy of reducing them where redundant and increasing them where sparse. Moreover, EntroGuard can be directly integrated into end devices without requiring any modifications to the embedding model. Extensive experimental results demonstrate that EntroGuard can reduce the risk of privacy leakage by up to 8 times at most with negligible loss of retrieval performance compared to existing privacy-preserving methods.

Updated: 2025-03-17 07:58:05

标题: 通过熵驱动的扰动来保护在端-云协作中的LLM嵌入

摘要: 最近的研究通过端-云协作改进了设备上的语言模型（LM）推断，其中端设备从云数据库中检索有用信息以增强本地处理，这被称为检索增强生成（RAG）。通常，为了从云中检索信息同时保护隐私，端设备使用本地嵌入模型将原始数据转换为嵌入。然而，最近出现的嵌入逆向攻击（EIAs）仍然可以从文本嵌入中恢复原始数据（例如，训练一个恢复模型将嵌入映射回原始文本），对用户隐私构成重大威胁。为了应对这一风险，我们提出了EntroGuard，一种基于熵驱动的扰动型嵌入隐私保护方法，可以在端-云协作期间保护文本嵌入的隐私同时保持检索准确性。具体来说，为了击败各种EIAs，我们扰动嵌入以增加恢复模型的常见结构中恢复文本的熵，从而在恢复过程中将嵌入引向毫无意义的文本而非原始敏感文本。为了在云中保持检索性能，我们将扰动限制在一个范围内，应用减少冗余扰动和增加稀疏扰动的策略。此外，EntroGuard可以直接集成到端设备中，而无需对嵌入模型进行任何修改。大量实验结果表明，与现有的隐私保护方法相比，EntroGuard可以将隐私泄露风险降低多达8倍，同时几乎不损失检索性能。

更新时间: 2025-03-17 07:58:05

领域: cs.CR

下载: http://arxiv.org/abs/2503.12896v1

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $\textit{benign relearning attacks}$. With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work indicates that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Updated: 2025-03-17 07:46:49

标题: 忘却还是混淆？通过良性重学来激发未被遗忘的长期记忆

摘要: 机器遗忘是减轻机器学习模型中对训练数据不良记忆的一种有前途的方法。然而，在这项工作中，我们发现现有的LLM中遗忘方法令人惊讶地容易受到一组简单的“良性重新学习攻击”的影响。只需访问一小部分可能与原始数据 loosley 相关的数据，我们发现我们可以“激活”已遗忘模型的记忆，以逆转遗忘的影响。例如，我们展示了重新学习公开医学文章可以导致一个已遗忘的LLM输出有关生物武器的有害知识，而重新学习有关哈利波特系列书籍的维基百科信息可以迫使模型输出逐字记忆的文本。我们形式化了这个遗忘-重新学习管道，探索了跨三个流行的遗忘基准的攻击，并讨论了我们研究结果所产生的未来方向和指南。我们的工作表明，当前的近似遗忘方法仅仅抑制了模型的输出，并未能在LLM中牢固遗忘目标知识。

更新时间: 2025-03-17 07:46:49

领域: cs.LG

下载: http://arxiv.org/abs/2406.13356v4

Edgeworth Expansion for Semi-hard Triplet Loss

We develop a higher-order asymptotic analysis for the semi-hard triplet loss using the Edgeworth expansion. It is known that this loss function enforces that embeddings of similar samples are close while those of dissimilar samples are separated by a specified margin. By refining the classical central limit theorem, our approach quantifies the impact of the margin parameter and the skewness of the underlying data distribution on the loss behavior. In particular, we derive explicit Edgeworth expansions that reveal first-order corrections in terms of the third cumulant, thereby characterizing non-Gaussian effects present in the distribution of distance differences between anchor-positive and anchor-negative pairs. Our findings provide detailed insight into the sensitivity of the semi-hard triplet loss to its parameters and offer guidance for choosing the margin to ensure training stability.

Updated: 2025-03-17 07:46:10

标题: Edgeworth Expansion for Semi-hard Triplet Loss的翻译为：半硬三元组损失的Edgeworth展开

摘要: 我们利用Edgeworth展开式对半硬三元损失进行了高阶渐近分析。众所周知，该损失函数强制使相似样本的嵌入接近，而不相似样本的嵌入则被指定的间隔分开。通过改进经典的中心极限定理，我们的方法量化了间隔参数和基础数据分布的偏斜对损失行为的影响。具体而言，我们推导了明确的Edgeworth展开式，揭示了以第三阶累积量为单位的一阶修正，从而描述了在锚-正样本和锚-负样本对之间距离差异分布中存在的非高斯效应。我们的发现深入剖析了半硬三元损失对其参数的敏感性，并为选择间隔以确保训练稳定性提供了指导。

更新时间: 2025-03-17 07:46:10

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.12893v1

Privacy-Aware RAG: Secure and Isolated Knowledge Retrieval

The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.

Updated: 2025-03-17 07:45:05

标题: 隐私感知RAG：安全和隔离的知识检索

摘要: 检索增强生成（RAG）系统在现实世界应用中的广泛采用增加了对其专有知识库的保密性和完整性的担忧。这些知识库在增强大型语言模型（LLM）的生成能力方面发挥着关键作用，但越来越容易受到可能泄露敏感信息的侵犯。为了解决这些挑战，本文提出了一种先进的加密方法，旨在保护RAG系统免受未经授权的访问和数据泄露。我们的方法在存储之前加密文本内容及其对应的嵌入，确保所有数据保持安全加密。这种机制限制了只有具有适当解密密钥的授权实体才能访问，从而显著降低了意外数据曝光的风险。此外，我们证明我们的加密策略保持了RAG管道的性能和功能，确保在各种领域和应用中的兼容性。为了验证我们方法的稳健性，我们提供了全面的安全性证明，突显了其对潜在威胁和漏洞的抵抗力。这些证明还揭示了现有方法的局限性，这些方法通常缺乏稳健性、适应性，或依赖于开源模型。我们的研究结果表明，将先进的加密技术集成到RAG系统的设计和部署中可以有效增强隐私保障。这项研究为改善以人工智能驱动的服务的安全措施持续对话，并主张在RAG架构中制定更严格的数据保护标准。

更新时间: 2025-03-17 07:45:05

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.15548v1

Secure On-Device Video OOD Detection Without Backpropagation

Out-of-Distribution (OOD) detection is critical for ensuring the reliability of machine learning models in safety-critical applications such as autonomous driving and medical diagnosis. While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices. To overcome these challenges, we propose SecDOOD, a secure cloud-device collaboration framework for efficient on-device OOD detection without requiring device-side backpropagation. SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. Central to SecDOOD is a HyperNetwork-based personalized parameter generation module, which adapts cloud-trained models to device-specific distributions by dynamically generating local weight adjustments, effectively combining central and local information without local fine-tuning. Additionally, our dynamic feature sampling and encryption strategy selectively encrypts only the most informative feature channels, largely reducing encryption overhead without compromising detection performance. Extensive experiments across multiple datasets and OOD scenarios demonstrate that SecDOOD achieves performance comparable to fully fine-tuned models, enabling secure, efficient, and personalized OOD detection on resource-limited edge devices. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/Dystopians/SecDOOD.

Updated: 2025-03-17 07:44:00

标题: 在不使用反向传播的情况下实现安全的设备端视频OOD检测

摘要: Out-of-Distribution (OOD) 检测对于确保机器学习模型在自动驾驶和医疗诊断等安全关键应用中的可靠性至关重要。尽管在边缘设备上直接部署个性化的OOD检测是理想的，但由于模型尺寸庞大和设备端训练的计算不可行性，仍然具有挑战性。联邦学习在一定程度上解决了这个问题，但仍需要梯度计算和反向传播，超出了许多边缘设备的能力范围。为了克服这些挑战，我们提出了SecDOOD，这是一个安全的云设备协作框架，用于在设备端高效实现OOD检测，而无需设备端反向传播。SecDOOD利用云资源进行模型训练，同时通过在设备端保留敏感信息来确保用户数据隐私。SecDOOD的核心是基于HyperNetwork的个性化参数生成模块，通过动态生成本地权重调整，将云端训练的模型适应于设备特定的分布，有效地结合了中央和本地信息，而无需进行本地微调。此外，我们的动态特征采样和加密策略有选择地加密只有最具信息量的特征通道，大大减少了加密开销，同时不影响检测性能。跨多个数据集和OOD场景的大量实验表明，SecDOOD实现了与完全微调模型相当的性能，从而在资源有限的边缘设备上实现了安全、高效和个性化的OOD检测。为了增强可访问性和可重现性，我们的代码可以在https://github.com/Dystopians/SecDOOD上公开获取。

更新时间: 2025-03-17 07:44:00

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.06166v2

Micro Text Classification Based on Balanced Positive-Unlabeled Learning

In real-world text classification tasks, negative texts often contain a minimal proportion of negative content, which is especially problematic in areas like text quality control, legal risk screening, and sensitive information interception. This challenge manifests at two levels: at the macro level, distinguishing negative texts is difficult due to the high similarity between coarse-grained positive and negative samples; at the micro level, the issue stems from extreme class imbalance and a lack of fine-grained labels. To address these challenges, we propose transforming the coarse-grained positive-negative (PN) classification task into an imbalanced fine-grained positive-unlabeled (PU) classification problem, supported by theoretical analysis. We introduce a novel framework, Balanced Fine-Grained Positive-Unlabeled (BFGPU) learning, which features a unique PU learning loss function that optimizes macro-level performance amidst severe imbalance at the micro level. The framework's performance is further boosted by rebalanced pseudo-labeling and threshold adjustment. Extensive experiments on both public and real-world datasets demonstrate the effectiveness of BFGPU, which outperforms other methods, even in extreme scenarios where both macro and micro levels are highly imbalanced.

Updated: 2025-03-17 07:42:27

标题: 基于平衡正无标签学习的微型文本分类

摘要: 在现实世界的文本分类任务中，负面文本通常包含极少比例的负面内容，这在文本质量控制、法律风险筛查和敏感信息拦截等领域尤为棘手。这一挑战在两个层面上表现出来：在宏观层面，由于粗粒度正负样本之间的高相似性，区分负面文本变得困难；在微观层面，问题源于极端的类别不平衡和缺乏细粒度标签。为了解决这些挑战，我们提出将粗粒度正负（PN）分类任务转化为不平衡的细粒度正-未标记（PU）分类问题，并支持理论分析。我们引入了一种新颖的框架，平衡细粒度正-未标记（BFGPU）学习，其特点是一种独特的PU学习损失函数，能够在微观层面的严重不平衡中优化宏观层面的性能。该框架通过重新平衡的伪标记和阈值调整进一步提高了性能。在公共和现实世界的数据集上进行了大量实验，证明了BFGPU的有效性，即使在宏观和微观层面都极度不平衡的极端情况下，也能胜过其他方法。

更新时间: 2025-03-17 07:42:27

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13562v1

Neonpool: Reimagining cryptocurrency transaction pools for lightweight clients and IoT devices

The transaction pool plays a critical role in processing and disseminating transactions in cryptocurrency networks. However, increasing transaction loads strain the resources of full node deployments. We present Neonpool, an innovative transaction pool optimization using bloom filter variants, which reduces the memory footprint of the transaction pool to a fraction. Implemented in C++ and benchmarked using a unique Bitcoin and Ethereum dataset, our solution verifies and forwards transactions with over 99.99\% accuracy and does not necessitate a hard fork. Neonpool is ideally suited for lightweight cryptocurrency clients and for resource-constrained devices such as browsers, systems-on-a-chip, mobile or IoT devices.

Updated: 2025-03-17 07:37:50

标题: Neonpool：为轻量级客户端和物联网设备重新构想加密货币交易池

摘要: 交易池在加密货币网络中处理和传播交易中扮演着至关重要的角色。然而，不断增加的交易负载对全节点部署的资源构成了压力。我们提出了Neonpool，一种使用布隆过滤器变体的创新交易池优化方案，将交易池的内存占用减少到一小部分。我们采用C++实现了该方案，并使用独特的比特币和以太坊数据集进行了基准测试，我们的解决方案可以以超过99.99\%的准确率验证和转发交易，并且不需要硬分叉。Neonpool非常适用于轻量级加密货币客户端和资源受限的设备，如浏览器、片上系统、移动设备或物联网设备。

更新时间: 2025-03-17 07:37:50

领域: cs.CR

下载: http://arxiv.org/abs/2412.16217v2

Multi-Agent Image Restoration

Image restoration (IR) is challenging due to the complexity of real-world degradations. While many specialized and all-in-one IR models have been developed, they fail to effectively handle complex, mixed degradations. Recent agentic methods RestoreAgent and AgenticIR leverage intelligent, autonomous workflows to alleviate this issue, yet they suffer from suboptimal results and inefficiency due to their resource-intensive finetunings, and ineffective searches and tool execution trials for satisfactory outputs. In this paper, we propose MAIR, a novel Multi-Agent approach for complex IR problems. We introduce a real-world degradation prior, categorizing degradations into three types: (1) scene, (2) imaging, and (3) compression, which are observed to occur sequentially in real world, and reverse them in the opposite order. Built upon this three-stage restoration framework, MAIR emulates a team of collaborative human specialists, including a "scheduler" for overall planning and multiple "experts" dedicated to specific degradations. This design minimizes search space and trial efforts, improving image quality while reducing inference costs. In addition, a registry mechanism is introduced to enable easy integration of new tools. Experiments on both synthetic and real-world datasets show that proposed MAIR achieves competitive performance and improved efficiency over the previous agentic IR system. Code and models will be made available.

Updated: 2025-03-17 07:34:25

标题: 多智能体图像恢复

摘要: 图像恢复(IR)是具有挑战性的，因为真实世界中的退化复杂。尽管已经开发了许多专门化和一体化的IR模型，但它们无法有效处理复杂的、混合的退化。最近的代理方法RestoreAgent和AgenticIR利用智能、自主的工作流程来缓解这一问题，但由于它们的资源密集型微调、效果不佳的搜索和工具执行试验而导致效率低下。在本文中，我们提出了MAIR，一种新颖的多智能体方法，用于解决复杂的IR问题。我们引入了一个真实世界的退化先验，将退化分为三种类型：(1)场景、(2)成像和(3)压缩，这些类型在真实世界中观察到是按顺序发生的，并按相反的顺序进行恢复。基于这个三阶段的恢复框架，MAIR模拟了一个协作的人类专家团队，包括一个负责整体规划的“调度器”和多个致力于特定退化的“专家”。这种设计最大程度地减少了搜索空间和试验工作量，提高了图像质量同时降低了推理成本。此外，引入了一个注册机制，以便轻松集成新工具。对合成和真实世界数据集的实验表明，提出的MAIR在性能和效率上均优于先前的代理IR系统。代码和模型将会提供。

更新时间: 2025-03-17 07:34:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.09403v2

Early Detection of Forest Calamities in Homogeneous Stands -- Deep Learning Applied to Bark-Beetle Outbreaks

Climate change has increased the vulnerability of forests to insect-related damage, resulting in widespread forest loss in Central Europe and highlighting the need for effective, continuous monitoring systems. Remote sensing based forest health monitoring, oftentimes, relies on supervised machine learning algorithms that require labeled training data. Monitoring temporal patterns through time series analysis offers a potential alternative for earlier detection of disturbance but requires substantial storage resources. This study investigates the potential of a Deep Learning algorithm based on a Long Short Term Memory (LSTM) Autoencoder for the detection of anomalies in forest health (e.g. bark beetle outbreaks), utilizing Sentinel-2 time series data. This approach is an alternative to supervised machine learning methods, avoiding the necessity for labeled training data. Furthermore, it is more memory-efficient than other time series analysis approaches, as a robust model can be created using only a 26-week-long time series as input. In this study, we monitored pure stands of spruce in Thuringia, Germany, over a 7-year period from 2018 to the end of 2024. Our best model achieved a detection accuracy of 87% on test data and was able to detect 61% of all anomalies at a very early stage (more than a month before visible signs of forest degradation). Compared to another widely used time series break detection algorithm - BFAST (Breaks For Additive Season and Trend), our approach consistently detected higher percentage of anomalies at an earlier stage. These findings suggest that LSTM-based Autoencoders could provide a promising, resource-efficient approach to forest health monitoring, enabling more timely responses to emerging threats.

Updated: 2025-03-17 07:28:15

标题: 均质林木中森林灾害的早期检测——深度学习应用于树皮甲虫爆发

摘要: 气候变化增加了森林对昆虫相关损害的脆弱性，导致中欧广泛的森林损失，并凸显了需要有效、持续的监测系统的重要性。基于遥感的森林健康监测通常依赖于需要标记的训练数据的监督式机器学习算法。通过时间序列分析监测时间模式提供了早期检测干扰的潜在替代方法，但需要大量存储资源。本研究探讨了基于长短期记忆（LSTM）自编码器的深度学习算法在利用Sentinel-2时间序列数据检测森林健康异常（如松树皮甲虫爆发）方面的潜力。这种方法是对监督式机器学习方法的替代，避免了需要标记的训练数据的必要性。此外，与其他时间序列分析方法相比，它更具存储效率，因为仅使用26周长的时间序列作为输入就可以创建一个强大的模型。在这项研究中，我们在德国图林根州的纯云杉林中，在2018年至2024年底的7年期间进行了监测。我们的最佳模型在测试数据上实现了87%的检测准确率，并能够在非常早期阶段（比森林退化可见迹象的一个月以上）检测到61%的所有异常。与另一种广泛使用的时间序列断点检测算法BFAST（季节和趋势的断点）相比，我们的方法始终在较早阶段检测到更高百分比的异常。这些发现表明，基于LSTM的自动编码器可能为森林健康监测提供一种有前景的、资源高效的方法，从而使对新兴威胁作出更及时的应对。

更新时间: 2025-03-17 07:28:15

领域: cs.LG

下载: http://arxiv.org/abs/2503.12883v1

DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.

Updated: 2025-03-17 07:25:32

标题: DAPI：领域自适应毒性探针矢量干预用于精细化解毒

摘要: 已经有人尝试利用线性探头进行解毒，现有研究依赖于单一毒性探头矢量来降低毒性。然而，毒性可以被细分为各种子类别，这使得使用单一毒性探头矢量难以去除某些类型的毒性。为了解决这一限制，我们提出了一种特定类别的毒性探头矢量方法。首先，我们为不同的毒性类别训练多个毒性探头矢量。在生成过程中，我们根据当前上下文动态选择最相关的毒性探头矢量。最后，所选的矢量动态缩放并从模型中减去。我们的方法成功地减轻了单一探头矢量方法无法解毒的毒性类别。实验表明，我们的方法在评估数据集上实现了高达78.52%的毒性减少，而流畅性几乎保持不变，仅与未引导的模型相比下降了0.052%。

更新时间: 2025-03-17 07:25:32

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12882v1

nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity

Natural Language to Visualization (NL2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, NL2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nvBench 2.0, a new benchmark designed to evaluate NL2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous NL2VIS tasks using nvBench 2.0. We also propose Step-NL2VIS, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-NL2VIS outperforms all baselines, setting a new state-of-the-art for ambiguous NL2VIS tasks.

Updated: 2025-03-17 07:20:11

标题: nvBench 2.0：一种在模糊情况下进行自然语言到可视化的基准测试

摘要: 自然语言到可视化（NL2VIS）使用户能够通过自然语言查询创建可视化，使数据洞察更加易于获取。然而，NL2VIS在解释模糊查询方面面临挑战，因为用户通常用不精确的语言表达他们的可视化需求。为了解决这一挑战，我们引入了nvBench 2.0，这是一个新的基准测试，旨在评估涉及模糊查询场景中的NL2VIS系统。nvBench 2.0包括7,878个自然语言查询和24,076个对应的可视化，来自153个领域的780个表。它是使用控制的模糊注入流水线构建的，通过反向生成工作流程生成模糊查询。通过从明确的种子可视化开始，并有选择地注入模糊性，该流水线为每个查询产生多个有效解释，每个模糊查询都可以通过逐步推理路径追溯到其相应的可视化。我们评估各种大型语言模型（LLM）在使用nvBench 2.0执行模糊NL2VIS任务的能力。我们还提出了Step-NL2VIS，这是一个基于LLM的模型，经过nvBench 2.0训练，通过逐步优化偏好，在模糊情景中提高性能。我们的结果显示，Step-NL2VIS优于所有基线，为模糊NL2VIS任务设定了新的技术水平。

更新时间: 2025-03-17 07:20:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.12880v1

P1-KAN: an effective Kolmogorov-Arnold network with application to hydraulic valley optimization

A new Kolmogorov-Arnold network (KAN) is proposed to approximate potentially irregular functions in high dimensions. We provide error bounds for this approximation, assuming that the Kolmogorov-Arnold expansion functions are sufficiently smooth. When the function is only continuous, we also provide universal approximation theorems. We show that it outperforms multilayer perceptrons in terms of accuracy and convergence speed. We also compare it with several proposed KAN networks: it outperforms all networks for irregular functions and achieves similar accuracy to the original spline-based KAN network for smooth functions. Finally, we compare some of the KAN networks in optimizing a French hydraulic valley.

Updated: 2025-03-17 07:17:43

标题: P1-KAN：一种有效的Kolmogorov-Arnold网络及其在水力山谷优化中的应用

摘要: 提出了一种新的Kolmogorov-Arnold网络（KAN），用于在高维空间中逼近潜在不规则函数。我们提供了这种逼近的误差界限，假设Kolmogorov-Arnold扩展函数足够平滑。当函数仅连续时，我们还提供了通用逼近定理。我们展示了它在准确性和收敛速度方面优于多层感知器。我们还将其与几种提出的KAN网络进行了比较：对于不规则函数，它优于所有网络，并对于光滑函数，它实现了与原始基于样条的KAN网络类似的准确性。最后，我们比较了一些KAN网络在优化法国水力谷中的表现。

更新时间: 2025-03-17 07:17:43

领域: cs.LG,cs.NE,stat.ML,68T07

下载: http://arxiv.org/abs/2410.03801v3

DimOL: Dimensional Awareness as A New 'Dimension' in Operator Learning

In the realm of computational physics, an enduring topic is the numerical solutions to partial differential equations (PDEs). Recently, the attention of researchers has shifted towards Neural Operator methods, renowned for their capability to approximate ``operators'' -- mappings from functions to functions. Despite the universal approximation theorem within neural operators, ensuring error bounds often requires employing numerous Fourier layers. However, what about lightweight models? In response to this question, we introduce DimOL (Dimension-aware Operator Learning), drawing insights from dimensional analysis. To implement DimOL, we propose the ProdLayer, which can be seamlessly integrated into FNO-based and Transformer-based PDE solvers, enhancing their ability to handle sum-of-products structures inherent in many physical systems. Empirically, DimOL models achieve up to 48% performance gain within the PDE datasets. Furthermore, by analyzing Fourier components' weights, we can symbolically discern the physical significance of each term. This sheds light on the opaque nature of neural networks, unveiling underlying physical principles.

Updated: 2025-03-17 06:54:47

标题: DimOL：维度意识作为运算学习中的新“维度”

摘要: 在计算物理领域，一个持久的话题是偏微分方程（PDE）的数值解。最近，研究人员的注意力转向了神经算子方法，以其近似“算子”（从函数到函数的映射）的能力而闻名。尽管神经算子中存在普遍逼近定理，但要确保误差界通常需要使用大量的傅立叶层。然而，轻量级模型如何呢？为了回答这个问题，我们引入了DimOL（维度感知算子学习），从维度分析中获得启示。为了实现DimOL，我们提出了ProdLayer，它可以无缝地集成到基于FNO和Transformer的PDE求解器中，增强其处理许多物理系统固有的乘积结构的能力。从经验上看，DimOL模型在PDE数据集中实现了高达48％的性能增益。此外，通过分析傅立叶分量的权重，我们可以符号地区分每个项的物理意义。这揭示了神经网络不透明性的一些特性，揭示了潜在的物理原则。

更新时间: 2025-03-17 06:54:47

领域: cs.LG

下载: http://arxiv.org/abs/2410.05894v3

GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction

The construction of Generalized Knowledge Graph (GKG), including knowledge graph, event knowledge graph and commonsense knowledge graph, is fundamental for various natural language processing tasks. Current studies typically construct these types of graph separately, overlooking holistic insights and potential unification that could be beneficial in computing resources and usage perspectives. However, a key challenge in developing a unified framework for GKG is obstacles arising from task-specific differences. In this study, we propose a unified framework for constructing generalized knowledge graphs to address this challenge. First, we collect data from 15 sub-tasks in 29 datasets across the three types of graphs, categorizing them into in-sample, counter-task, and out-of-distribution (OOD) data. Then, we propose a three-stage curriculum learning fine-tuning framework, by iteratively injecting knowledge from the three types of graphs into the Large Language Models. Extensive experiments show that our proposed model improves the construction of all three graph types across in-domain, OOD and counter-task data.

Updated: 2025-03-17 06:41:34

标题: GKG-LLM：通用知识图构建的统一框架

摘要: 广义知识图谱（GKG）的构建，包括知识图谱、事件知识图谱和常识知识图谱，对于各种自然语言处理任务至关重要。目前的研究通常分别构建这些类型的图，忽视了整体洞察力和潜在的统一性，这可能有利于计算资源和使用视角。然而，开发GKG的统一框架面临的一个关键挑战是源自任务特定差异的障碍。在这项研究中，我们提出了一个统一的框架，用于构建广义知识图谱，以应对这一挑战。首先，我们从三种类型的图中的15个子任务和29个数据集中收集数据，并将它们分类为样本内、对抗任务和分布外（OOD）数据。然后，我们提出了一个三阶段的课程学习微调框架，通过迭代地将三种类型的图中的知识注入到大型语言模型中。大量实验证明，我们提出的模型改进了所有三种图类型的构建，涵盖了域内、OOD和对抗任务数据。

更新时间: 2025-03-17 06:41:34

领域: cs.AI

下载: http://arxiv.org/abs/2503.11227v2

Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English

Test-time adaptation (TTA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English or Nigerian English, of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data. Our code is available at https://github.com/dukenguyenxyz/dialect-adaptation

Updated: 2025-03-17 06:40:06

标题: 利用测试时间适应技术进行涉及英语方言的NLU任务

摘要: 测试时间适应（TTA）是一种出色的方法，有助于在不使用标记数据集的情况下泛化模型跨领域、任务和分布。因此，在方言环境中，TTA 在自然语言处理（NLP）中非常有用，因为通常模型是在标准美式英语（SAE）上训练，评估印度英语或尼日利亚英语，其分布与前者有显著差异。这尤其有用，因为方言数据集稀缺。在本文中，我们探讨了方言NLP中最著名的TTA技术之一，即SHOT。我们在不同组合的方言GLUE上对SHOT进行微调和评估。我们的研究结果表明，当标记数据集不可用时，SHOT是一种可行的技术。我们还在理论上提出了方言差距的概念，并表明它与SHOT的有效性呈正相关。我们还发现，在许多情况下，在SAE上微调的性能要高于在方言数据上微调。我们的代码可以在 https://github.com/dukenguyenxyz/dialect-adaptation 找到。

更新时间: 2025-03-17 06:40:06

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12858v1

The Best Time for an Update: Risk-Sensitive Minimization of Age-Based Metrics

Popular methods to quantify transmitted data quality are the Age of Information (AoI), the Query Age of Information (QAoI), and the Age of Incorrect Information (AoII). We consider these metrics in a point-to-point wireless communication system, where the transmitter monitors a process and sends status updates to a receiver. The challenge is to decide on the best time for an update, balancing the transmission energy and the age-based metric at the receiver. Due to the inherent risk of high age-based metric values causing complications such as unstable system states, we introduce the new concept of risky states to denote states with high age-based metric. We use this new notion of risky states to quantify and minimize this risk of experiencing high age-based metrics by directly deriving the frequency of risky states as a novel risk-metric. Building on this foundation, we introduce two risk-sensitive strategies for AoI, QAoI and AoII. The first strategy uses system knowledge, i.e., channel quality and packet arrival probability, to find an optimal strategy that transmits when the age-based metric exceeds a tunable threshold. A lower threshold leads to higher risk-sensitivity. The second strategy uses an enhanced Q-learning approach and balances the age-based metric, the transmission energy and the frequency of risky states without requiring knowledge about the system. Numerical results affirm our risk-sensitive strategies' high effectiveness.

Updated: 2025-03-17 06:39:25

标题: 最佳更新时间：风险敏感的基于年龄的指标最小化

摘要: 流行的用于量化传输数据质量的方法是信息时代（AoI），查询信息时代（QAoI）和错误信息时代（AoII）。我们考虑这些指标在点对点无线通信系统中的应用，发送方监控一个过程并向接收方发送状态更新。挑战在于决定最佳的更新时间，平衡传输能量和接收方的基于时代的指标。由于高时代指标值的固有风险可能导致不稳定的系统状态等问题，我们引入了高风险状态的新概念来表示具有高时代指标的状态。我们利用这种新概念的高风险状态来量化和最小化通过直接推导高风险状态的频率作为新型风险指标来经历高时代指标的风险。在此基础上，我们引入了两种对AoI，QAoI和AoII敏感的策略。第一种策略利用系统知识，即信道质量和数据包到达概率，来找到一个在时代指标超过可调节阈值时传输的最佳策略。较低的阈值会导致更高的风险敏感性。第二种策略使用增强的Q-learning方法，平衡时代指标、传输能量和高风险状态的频率，而无需了解系统。数值结果证实了我们的风险敏感策略的高效性。

更新时间: 2025-03-17 06:39:25

领域: cs.IT,cs.LG,cs.NI,math.IT

下载: http://arxiv.org/abs/2401.10265v2

Island-Based Evolutionary Computation with Diverse Surrogates and Adaptive Knowledge Transfer for High-Dimensional Data-Driven Optimization

In recent years, there has been a growing interest in data-driven evolutionary algorithms (DDEAs) employing surrogate models to approximate the objective functions with limited data. However, current DDEAs are primarily designed for lower-dimensional problems and their performance drops significantly when applied to large-scale optimization problems (LSOPs). To address the challenge, this paper proposes an offline DDEA named DSKT-DDEA. DSKT-DDEA leverages multiple islands that utilize different data to establish diverse surrogate models, fostering diverse subpopulations and mitigating the risk of premature convergence. In the intra-island optimization phase, a semi-supervised learning method is devised to fine-tune the surrogates. It not only facilitates data argumentation, but also incorporates the distribution information gathered during the search process to align the surrogates with the evolving local landscapes. Then, in the inter-island knowledge transfer phase, the algorithm incorporates an adaptive strategy that periodically transfers individual information and evaluates the transfer effectiveness in the new environment, facilitating global optimization efficacy. Experimental results demonstrate that our algorithm is competitive with state-of-the-art DDEAs on problems with up to 1000 dimensions, while also exhibiting decent parallelism and scalability. Our DSKT-DDEA is open-source and accessible at: https://github.com/LabGong/DSKT-DDEA.

Updated: 2025-03-17 06:35:59

标题: 基于岛屿的进化计算：多样化替代品和自适应知识传递用于高维数据驱动优化

摘要: 近年来，对于利用替代模型来近似具有有限数据的目标函数的数据驱动进化算法（DDEAs）引起了越来越多的关注。然而，目前的DDEAs主要设计用于低维问题，在应用于大规模优化问题（LSOPs）时性能显著下降。为了应对这一挑战，本文提出了一种名为DSKT-DDEA的离线DDEA。DSKT-DDEA利用多个岛屿，利用不同的数据建立多样化替代模型，培养多样化的亚群体，减轻过早收敛的风险。在岛内优化阶段，设计了一种半监督学习方法来微调替代模型。它不仅有助于数据论证，还结合了在搜索过程中收集的分布信息，以使替代模型与不断变化的本地景观保持一致。然后，在岛间知识传递阶段，该算法结合了一个自适应策略，周期性地传输个体信息，并评估在新环境中的传输效果，促进全局优化效果。实验结果表明，我们的算法在高达1000维的问题上与最先进的DDEAs竞争力强，同时表现出良好的并行性和可伸缩性。我们的DSKT-DDEA是开源的，可以在https://github.com/LabGong/DSKT-DDEA上获得。

更新时间: 2025-03-17 06:35:59

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2503.12856v1

VITED: Video Temporal Evidence Distillation

We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.

Updated: 2025-03-17 06:30:02

标题: VITED：视频时间证据提炼

摘要: 我们通过证据链推理来研究复杂视频问答-识别来自视频多个相关部分的时间跨度序列，以及其中的视觉证据。现有模型在多步推理方面存在困难，因为它们均匀采样固定数量的帧，可能会错过在整个视频中分布不均匀的关键证据。此外，它们缺乏将这些证据在整个视频的更广泛背景中进行时间定位的能力，这对于回答复杂问题是必要的。我们提出了一个框架，通过搜索视频中具有支持证据的感兴趣的最佳时间间隔自动构建证据推理链，以最大化回答给定问题的可能性。我们训练我们的模型（VITED）直接生成这些证据链，使其能够在长视频内容中既定位证据窗口，又在其中进行多步推理。我们展示了我们的证据蒸馏模型在一系列长视频问答基准测试中的价值，我们在这些基准测试中胜过了缺乏证据推理能力的最先进方法。

更新时间: 2025-03-17 06:30:02

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.12855v1

Seeing World Dynamics in a Nutshell

We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.

Updated: 2025-03-17 06:29:41

标题: 在一个核心中看世界动态

摘要: 我们考虑如何有效地以空间和时间连贯的方式表示随意捕获的单目视频。虽然现有方法主要依赖于将视频视为时空像素集合的2D/2.5D技术，但由于缺乏时间连贯性和明确的3D结构，它们在处理复杂运动、遮挡和几何一致性方面存在困难。受单目视频作为动态3D世界投影的启发，我们通过空间-时间连续高斯基元流探索以其固有的3D形式表示视频。在本文中，我们提出了NutWorld，一个新颖的框架，可以在单向传递中将单目视频高效地转换为动态3D高斯表示。在其核心，NutWorld引入了一个结构化的空间-时间对齐高斯（STAG）表示，实现了无需优化的场景建模，具有有效的深度和流正则化。通过全面的实验，我们证明NutWorld实现了高保真度的视频重建质量，同时实时实现了各种下游应用。演示和代码将在https://github.com/Nut-World/NutWorld 上提供。

更新时间: 2025-03-17 06:29:41

领域: cs.CV,cs.AI,cs.GR,cs.MM

下载: http://arxiv.org/abs/2502.03465v2

Adaptive Transformer Attention and Multi-Scale Fusion for Spine 3D Segmentation

This study proposes a 3D semantic segmentation method for the spine based on the improved SwinUNETR to improve segmentation accuracy and robustness. Aiming at the complex anatomical structure of spinal images, this paper introduces a multi-scale fusion mechanism to enhance the feature extraction capability by using information of different scales, thereby improving the recognition accuracy of the model for the target area. In addition, the introduction of the adaptive attention mechanism enables the model to dynamically adjust the attention to the key area, thereby optimizing the boundary segmentation effect. The experimental results show that compared with 3D CNN, 3D U-Net, and 3D U-Net + Transformer, the model of this study has achieved significant improvements in mIoU, mDice, and mAcc indicators, and has better segmentation performance. The ablation experiment further verifies the effectiveness of the proposed improved method, proving that multi-scale fusion and adaptive attention mechanism have a positive effect on the segmentation task. Through the visualization analysis of the inference results, the model can better restore the real anatomical structure of the spinal image. Future research can further optimize the Transformer structure and expand the data scale to improve the generalization ability of the model. This study provides an efficient solution for the task of medical image segmentation, which is of great significance to intelligent medical image analysis.

Updated: 2025-03-17 06:27:43

标题: 自适应变压器注意力和多尺度融合用于脊柱3D分割

摘要: 这项研究提出了一种基于改进的SwinUNETR的脊柱三维语义分割方法，以提高分割精度和鲁棒性。针对脊柱图像复杂的解剖结构，本文引入了多尺度融合机制，通过利用不同尺度的信息来增强特征提取能力，从而提高模型对目标区域的识别精度。此外，引入自适应注意机制使模型能够动态调整对关键区域的注意力，从而优化边界分割效果。实验结果表明，与3D CNN、3D U-Net和3D U-Net + Transformer相比，本研究模型在mIoU、mDice和mAcc指标上取得了显著改进，并具有更好的分割性能。消融实验进一步验证了提出的改进方法的有效性，证明了多尺度融合和自适应注意机制对分割任务具有积极作用。通过推理结果的可视化分析，模型能够更好地恢复脊柱图像的真实解剖结构。未来的研究可以进一步优化Transformer结构并扩大数据规模，以提高模型的泛化能力。这项研究为医学图像分割任务提供了一种高效的解决方案，对智能医学图像分析具有重要意义。

更新时间: 2025-03-17 06:27:43

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.12853v1

ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

Offline Reinforcement Learning (RL), which operates solely on static datasets without further interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. The prevailing methods typically learn a conservative policy to mitigate the problem of Q-value overestimation, but it is prone to overdo it, leading to an overly conservative policy. Moreover, they optimize all samples equally with fixed constraints, lacking the nuanced ability to control conservative levels in a fine-grained manner. Consequently, this limitation results in a performance decline. To address the above two challenges in a united way, we propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range and enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions. We theoretically analyze the conditions under which the conservative level of the learned Q-function can be limited in a mild range and how to optimize each transition adaptively. Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition. Subsequently, we design a monotonicity loss and surrogate losses to train the adaptive weight functions, Q-function, and policy network alternatively. We evaluate ACL-QL on the commonly used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

Updated: 2025-03-17 06:25:26

标题: ACL-QL：离线强化学习中的自适应保守水平Q学习

摘要: 离线强化学习（RL）仅通过静态数据集进行操作，而不需要与环境进行进一步的交互，为学习安全且有前景的控制策略提供了一种吸引人的替代方案。目前的方法通常学习一个保守的策略来减轻Q值高估的问题，但往往会过度，导致过度保守的策略。此外，它们使用固定约束优化所有样本，缺乏对保守水平进行微观控制的能力。因此，这种限制导致性能下降。为了统一解决上述两个挑战，我们提出了一个框架，自适应Q学习中的保守水平（ACL-QL），它将Q值限制在一个温和的范围内，并实现对每个状态-动作对的保守水平进行自适应控制，即提高好的转换的Q值，减少坏的转换的Q值。我们在理论上分析了学习到的Q函数的保守水平可以在一个温和的范围内受限的条件，以及如何自适应优化每个转换。受到理论分析的启发，我们提出了一种新算法ACL-QL，它使用两个可学习的自适应权重函数来控制每个转换的保守水平。随后，我们设计了一个单调性损失和替代损失来交替训练自适应权重函数、Q函数和策略网络。我们在常用的D4RL基准上评估了ACL-QL，并进行了广泛的消融研究，以展示其与现有离线DRL基线相比的有效性和处于行业领先水平的性能。

更新时间: 2025-03-17 06:25:26

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2412.16848v2

Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

Improving data utilization, especially for imperfect data from task failures, is crucial for robotic manipulation due to the challenging, time-consuming, and expensive data collection process in the real world. Current imitation learning (IL) typically discards imperfect data, focusing solely on successful expert data. While reinforcement learning (RL) can learn from explorations and failures, the sim2real gap and its reliance on dense reward and online exploration make it difficult to apply effectively in real-world scenarios. In this work, we aim to conquer the challenge of leveraging imperfect data without the need for reward information to improve the model performance for robotic manipulation in an offline manner. Specifically, we introduce a Self-Supervised Data Filtering framework (SSDF) that combines expert and imperfect data to compute quality scores for failed trajectory segments. High-quality segments from the failed data are used to expand the training dataset. Then, the enhanced dataset can be used with any downstream policy learning method for robotic manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks using the Franka robot arm demonstrated that the SSDF can accurately expand the training dataset with high-quality imperfect data and improve the success rates for all robotic manipulation tasks.

Updated: 2025-03-17 06:17:11

标题: 通过自我监督学习从不完美的演示中学习用于机器人操作

摘要: 提高数据利用率，特别是对于由任务失败导致的不完美数据的利用，对于机器人操作至关重要，因为在现实世界中，数据收集过程具有挑战性、耗时且昂贵。目前的模仿学习（IL）通常会丢弃不完美的数据，仅关注成功的专家数据。而强化学习（RL）可以从探索和失败中学习，但是sim2real差距及其对稠密奖励和在线探索的依赖使其难以有效应用于现实世界的场景。在这项工作中，我们旨在征服利用不完美数据的挑战，无需奖励信息即可以离线方式提高机器人操作模型性能。具体来说，我们引入了一个自监督数据过滤框架（SSDF），将专家和不完美数据结合起来，计算失败轨迹段的质量分数。来自失败数据的高质量段用于扩展训练数据集。然后，增强的数据集可以与任何下游策略学习方法一起用于机器人操作任务。在基于高保真Sapien模拟器和Franka机械臂进行的ManiSkill2基准测试以及真实世界机器人操作任务的广泛实验中，证明了SSDF能够准确地扩展训练数据集，提高所有机器人操作任务的成功率。

更新时间: 2025-03-17 06:17:11

领域: cs.RO,cs.AI,I.2.9

下载: http://arxiv.org/abs/2401.08957v3

Exploiting Edited Large Language Models as General Scientific Optimizers

Large language models (LLMs) have been widely adopted in mathematical optimization in scientific scenarios for their extensive knowledge and advanced reasoning capabilities. Existing methods mainly focus on utilizing LLMs to solve optimization problems in a prompt-based manner, which takes observational feedback as additional textual descriptions. However, due to LLM's \textbf{high sensitivity to the prompts} and \textbf{tendency to get lost in lengthy prompts}, these methods struggle to effectively utilize the {observational} feedback from each optimization step, which severely hinders the applications for real-world scenarios. To address these challenges, we propose a conceptually simple and general {bi-level} optimization method, namely \textbf{G}eneral \textbf{S}cientific \textbf{O}ptimizers (GSO). Specifically, GSO first utilizes inner-level simulators as experimental platforms to evaluate the current solution and provide observational feedback. Then, LLMs serve as knowledgeable and versatile scientists, generating new solutions by refining potential errors from the feedback as the outer-level optimization. Finally, simulations together with the expert knowledge in LLMs are jointly updated with bi-level interactions via model editing. Extensive experiments show that GSO consistently outperforms existing state-of-the-art methods using \textit{six} different LLM backbones on \textit{seven} different tasks, demonstrating the effectiveness and a wide range of applications.

Updated: 2025-03-17 05:40:49

标题: 利用编辑的大型语言模型作为通用科学优化器

摘要: 大型语言模型（LLMs）已被广泛应用于科学场景中的数学优化中，因其广泛的知识和先进的推理能力而闻名。现有方法主要集中在利用LLMs以即时方式解决优化问题，将观察反馈作为额外的文本描述。然而，由于LLM对提示的高敏感性和在冗长提示中迷失的倾向，这些方法很难有效利用每个优化步骤中的观察反馈，严重阻碍了实际场景中的应用。为了解决这些挑战，我们提出了一个概念上简单且通用的双层优化方法，即GSO（General Scientific Optimizers）。具体而言，GSO首先利用内部模拟器作为实验平台来评估当前解决方案并提供观察反馈。然后，LLMs作为知识丰富且多才多艺的科学家，在外层优化中通过从反馈中细化潜在错误来生成新解决方案。最后，通过模型编辑，模拟和LLMs中的专业知识通过双层交互进行联合更新。大量实验证明，GSO在七个不同任务上使用六种不同的LLM骨干始终优于现有的最先进方法，展示了其有效性和广泛的应用范围。

更新时间: 2025-03-17 05:40:49

领域: math.OC,cs.AI

下载: http://arxiv.org/abs/2503.09620v2

Compositional Models for Estimating Causal Effects

Many real-world systems can be usefully represented as sets of interacting components. Examples include computational systems, such as query processors and compilers, natural systems, such as cells and ecosystems, and social systems, such as families and organizations. However, current approaches to estimating potential outcomes and causal effects typically treat such systems as single units, represent them with a fixed set of variables, and assume a homogeneous data-generating process. In this work, we study a compositional approach for estimating individual-level potential outcomes and causal effects in structured systems, where each unit is represented by an instance-specific composition of multiple heterogeneous components. The compositional approach decomposes unit-level causal queries into more fine-grained queries, explicitly modeling how unit-level interventions affect component-level outcomes to generate a unit's outcome. We demonstrate this approach using modular neural network architectures and show that it provides benefits for causal effect estimation from observational data, such as accurate causal effect estimation for structured units, increased sample efficiency, improved overlap between treatment and control groups, and compositional generalization to units with unseen combinations of components. Remarkably, our results show that compositional modeling can improve the accuracy of causal estimation even when component-level outcomes are unobserved. We also create and use a set of real-world evaluation environments for the empirical evaluation of compositional approaches for causal effect estimation and demonstrate the role of composition structure, varying amounts of component-level data access, and component heterogeneity in the performance of compositional models as compared to the non-compositional approaches.

Updated: 2025-03-17 05:36:58

标题: 估计因果效应的组合模型

摘要: 许多现实世界的系统可以被有用地表示为一组相互作用的组件。例如，计算系统，如查询处理器和编译器，自然系统，如细胞和生态系统，以及社会系统，如家庭和组织。然而，当前用于估算潜在结果和因果效应的方法通常将这些系统视为单一单元，用固定的变量集表示它们，并假定均匀的数据生成过程。在这项工作中，我们研究了一种用于估算结构化系统中个体水平潜在结果和因果效应的组合方法，其中每个单元由多个异构组件的实例特定组合表示。组合方法将单元级因果查询分解为更精细的查询，明确地建模单元级干预如何影响组件级结果以生成单元的结果。我们使用模块化神经网络架构展示了这种方法，并表明它提供了从观察数据中估算因果效应的好处，如为结构化单元准确的因果效应估算，增加了样本效率，改善了治疗和对照组之间的重叠，并对具有未见组件组合的单元进行组合泛化。值得注意的是，我们的结果表明，即使组件级结果未被观察到，组合建模也可以提高因果估算的准确性。我们还创建并使用一组真实世界评估环境来对组合方法进行因果效应估算的实证评估，并展示了组合结构、不同数量的组件级数据访问以及组件异质性在相对于非组合方法的组合模型性能中的作用。

更新时间: 2025-03-17 05:36:58

领域: cs.AI,cs.LG,stat.ME

下载: http://arxiv.org/abs/2406.17714v3

GeoMask3D: Geometrically Informed Mask Selection for Self-Supervised Point Cloud Learning in 3D

We introduce a pioneering approach to self-supervised learning for point clouds, employing a geometrically informed mask selection strategy called GeoMask3D (GM3D) to boost the efficiency of Masked Auto Encoders (MAE). Unlike the conventional method of random masking, our technique utilizes a teacher-student model to focus on intricate areas within the data, guiding the model's focus toward regions with higher geometric complexity. This strategy is grounded in the hypothesis that concentrating on harder patches yields a more robust feature representation, as evidenced by the improved performance on downstream tasks. Our method also presents a complete-to-partial feature-level knowledge distillation technique designed to guide the prediction of geometric complexity utilizing a comprehensive context from feature-level information. Extensive experiments confirm our method's superiority over State-Of-The-Art (SOTA) baselines, demonstrating marked improvements in classification, and few-shot tasks.

Updated: 2025-03-17 05:35:35

标题: GeoMask3D：几何信息驱动的3D自监督点云学习中的蒙版选择

摘要: 我们引入了一种自监督学习的开创性方法，用于点云，采用一种几何信息驱动的遮罩选择策略称为GeoMask3D（GM3D），以提高遮罩自动编码器（MAE）的效率。与传统的随机遮罩方法不同，我们的技术利用了师生模型来专注于数据中的复杂区域，引导模型的关注点向几何复杂性较高的区域。这种策略的基础是集中注意力于更难的区域会产生更稳健的特征表示，这一点通过在下游任务上的表现改善得到证明。我们的方法还提出了一种完整到部分的特征级知识蒸馏技术，旨在利用来自特征级信息的全面上下文指导几何复杂性的预测。大量实验证实了我们的方法优于最先进的基线方法，在分类和少样本任务上表现出显著的改进。

更新时间: 2025-03-17 05:35:35

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2405.12419v2

CompMarkGS: Robust Watermarking for Compression 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) enables rapid differentiable rendering for 3D reconstruction and novel view synthesis, leading to its widespread commercial use. Consequently, copyright protection via watermarking has become critical. However, because 3DGS relies on millions of Gaussians, which require gigabytes of storage, efficient transfer and storage require compression. Existing 3DGS watermarking methods are vulnerable to quantization-based compression, often resulting in the loss of the embedded watermark. To address this challenge, we propose a novel watermarking method that ensures watermark robustness after model compression while maintaining high rendering quality. In detail, we incorporate a quantization distortion layer that simulates compression during training, preserving the watermark under quantization-based compression. Also, we propose a learnable watermark embedding feature that embeds the watermark into the anchor feature, ensuring structural consistency and seamless integration into the 3D scene. Furthermore, we present a frequency-aware anchor growing mechanism to enhance image quality in high-frequency regions by effectively identifying Guassians within these regions. Experimental results confirm that our method preserves the watermark and maintains superior image quality under high compression, validating it as a promising approach for a secure 3DGS model.

Updated: 2025-03-17 05:32:15

标题: CompMarkGS：用于压缩3D高斯splatting的强大水印技术

摘要: 3D高斯点云绘制（3DGS）实现了快速可微分渲染，用于3D重建和新视角合成，因此被广泛商业应用。因此，通过水印技术进行版权保护变得至关重要。然而，由于3DGS依赖于数百万个高斯点云，需要占用几千兆字节的存储空间，因此高效的传输和存储需要压缩。现有的3DGS水印方法容易受到基于量化的压缩攻击，通常导致嵌入水印的丢失。为了解决这一挑战，我们提出了一种新颖的水印方法，在模型压缩后保证水印的稳健性，同时保持高质量的渲染效果。具体来说，我们引入了一个量化失真层，模拟训练期间的压缩过程，保持了水印在基于量化的压缩下的稳定性。此外，我们提出了一个可学习的水印嵌入特征，将水印嵌入到锚定特征中，确保结构一致性，无缝融入3D场景。此外，我们提出了一个频率感知的锚定增长机制，通过有效识别这些区域内的高斯点云，提高高频区域的图像质量。实验结果证实，我们的方法在高压缩下保留了水印并保持了优越的图像质量，验证了其作为安全3DGS模型的一种有前途的方法。

更新时间: 2025-03-17 05:32:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12836v1

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.

Updated: 2025-03-17 05:31:09

标题: PASTA：具有文本对齐先验知识的部分感知草图到3D形状生成

摘要: 在条件3D形状生成中的一个基本挑战是最小化信息丢失并最大化用户输入的意图。现有方法主要集中在两种类型的孤立条件信号上，即用户草图和文本描述，每种方法都无法灵活控制生成的形状。在本文中，我们介绍了PASTA，一种灵活的方法，无缝地将用户草图和文本描述集成到3D形状生成中。关键思想是使用视觉语言模型中的文本嵌入来丰富草图的语义表示。具体而言，这些文本衍生的先验指定了对象的部件组件，弥补了模糊草图中缺失的视觉线索。此外，我们引入了ISG-Net，它采用两种类型的图卷积网络：IndivGCN处理精细的细节，PartGCN将这些细节聚合到部件并完善物体的结构。大量实验证明，PASTA在部件级别编辑方面优于现有方法，并在草图到3D形状生成中取得了最先进的结果。

更新时间: 2025-03-17 05:31:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12834v1

Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

Large Language Models (LLMs) are combined with plugins to create powerful LLM agents that provide a wide range of services. Unlike traditional software, LLM agent's behavior is determined at runtime by natural language prompts from either user or plugin's data. This flexibility enables a new computing paradigm with unlimited capabilities and programmability, but also introduces new security risks, vulnerable to privilege escalation attacks. Moreover, user prompt is prone to be interpreted in an insecure way by LLM agents, creating non-deterministic behaviors that can be exploited by attackers. To address these security risks, we propose Prompt Flow Integrity (PFI), a system security-oriented solution to prevent privilege escalation in LLM agents. Analyzing the architectural characteristics of LLM agents, PFI features three mitigation techniques -- i.e., untrusted data identification, enforcing least privilege on LLM agents, and validating unsafe data flows. Our evaluation result shows that PFI effectively mitigates privilege escalation attacks while successfully preserving the utility of LLM agents.

Updated: 2025-03-17 05:27:57

标题: 快速流完整性以防止LLM代理程序中的特权升级

摘要: 大型语言模型（LLMs）与插件结合，创建强大的LLM代理，提供各种服务。与传统软件不同，LLM代理的行为是由用户或插件数据的自然语言提示在运行时确定的。这种灵活性使得一种新的计算范式具有无限的能力和可编程性，但也引入了新的安全风险，易受特权升级攻击的影响。此外，用户提示容易被LLM代理以不安全的方式解释，从而产生可以被攻击者利用的非确定性行为。为了解决这些安全风险，我们提出了Prompt Flow Integrity（PFI），这是一个系统安全导向的解决方案，用于防止LLM代理中的特权升级。通过分析LLM代理的架构特征，PFI具有三种缓解技术，即识别不可信数据，对LLM代理实施最低权限，以及验证不安全数据流。我们的评估结果表明，PFI有效地缓解了特权升级攻击，同时成功地保留了LLM代理的效用。

更新时间: 2025-03-17 05:27:57

领域: cs.CR,cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.15547v1

SparseLUT: Sparse Connectivity Optimization for Lookup Table-based Deep Neural Networks

The deployment of deep neural networks (DNNs) on resource-constrained edge devices such as field-programmable gate arrays (FPGAs) requires a careful balance of latency, power, and resource usage while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs, including LogicNets, PolyLUT, PolyLUT-Add, and NeuraLUT, exploit native FPGA resources with random sparse connectivity. This paper introduces SparseLUT, a connectivity-centric training technique tailored for LUT-based DNNs. SparseLUT leverages a non-greedy training strategy that prioritizes the pruning of less significant connections and strategically regrows alternative ones, resulting in efficient convergence to the target sparsity. Experimental results show consistent accuracy improvements across benchmarks, including up to a 2.13\% increase on MNIST and a 0.94\% improvement for Jet Substructure Classification compared to random sparsity. This is done without any hardware overhead and achieves state-of-the-art results for LUT-based DNNs.

Updated: 2025-03-17 05:21:54

标题: SparseLUT: 基于查找表的深度神经网络稀疏连接优化

摘要: 在资源受限的边缘设备（如可编程门阵列（FPGAs））上部署深度神经网络（DNNs）需要在保持高准确性的同时仔细平衡延迟、功耗和资源使用。现有的基于查找表（LUT）的DNNs，包括LogicNets、PolyLUT、PolyLUT-Add和NeuraLUT，利用原生FPGA资源和随机稀疏连接。本文介绍了SparseLUT，一种针对基于LUT的DNNs量身定制的连接重心训练技术。SparseLUT利用一种非贪心的训练策略，优先修剪不太重要的连接，并有策略地重新生长替代连接，从而有效地收敛到目标稀疏度。实验结果显示，在各种基准测试中，包括在MNIST上最高可达2.13％的提高和与随机稀疏度相比，Jet亚结构分类的0.94％改进。这是在没有任何硬件开销的情况下实现的，并且为基于LUT的DNNs实现了最先进的结果。

更新时间: 2025-03-17 05:21:54

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2503.12829v1

An Optimization Framework for Differentially Private Sparse Fine-Tuning

Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

Updated: 2025-03-17 05:05:05

标题: 一个用于差分隐私稀疏微调的优化框架

摘要: 差分隐私随机梯度下降（DP-SGD）被广泛认为是在差分隐私（DP）下训练和微调神经网络的黄金标准。随着高质量的预训练模型检查点（例如视觉和语言模型）的日益可用，微调已成为一种流行的策略。然而，尽管最近在理解和应用DP-SGD进行私密迁移学习任务方面取得了进展，仍然存在一些显著挑战，尤其是在受DP-SGD微调的模型与其非私密对应物之间的性能差距。在私密数据上稀疏微调已成为一种替代完整模型微调的方法；最近的研究表明，只私密微调一小部分模型权重并保持其余权重固定可以带来更好的性能。在这项工作中，我们提出了一种在DP下对神经网络进行稀疏微调的新方法。现有的私密稀疏微调工作通常使用固定的可训练权重选择（例如，仅更新最后一层），或依赖于公共模型的权重来选择要修改的权重子集。这种权重选择仍然不够优化。相反，我们探索了一种基于优化的方法，我们的选择方法利用了私密梯度信息，同时使用现成的隐私核算技术。我们在几个计算机视觉模型和数据集上的数值实验表明，与完整模型私密微调或现有私密稀疏微调方法相比，我们的选择方法可以带来更好的预测准确性。

更新时间: 2025-03-17 05:05:05

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.12822v1

To Trust or Not to Trust? Enhancing Large Language Models' Situated Faithfulness to External Contexts

Large Language Models (LLMs) are often augmented with external contexts, such as those used in retrieval-augmented generation (RAG). However, these contexts can be inaccurate or intentionally misleading, leading to conflicts with the model's internal knowledge. We argue that robust LLMs should demonstrate situated faithfulness, dynamically calibrating their trust in external information based on their confidence in the internal knowledge and the external context to resolve knowledge conflicts. To benchmark this capability, we evaluate LLMs across several QA datasets, including a newly created dataset featuring in-the-wild incorrect contexts sourced from Reddit posts. We show that when provided with both correct and incorrect contexts, both open-source and proprietary models tend to overly rely on external information, regardless of its factual accuracy. To enhance situated faithfulness, we propose two approaches: Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning (RCR). SCR enables models to self-assess the confidence of external information relative to their own internal knowledge to produce the most accurate answer. RCR, in contrast, extracts explicit confidence signals from the LLM and determines the final answer using predefined rules. Our results show that for LLMs with strong reasoning capabilities, such as GPT-4o and GPT-4o mini, SCR outperforms RCR, achieving improvements of up to 24.2% over a direct input augmentation baseline. Conversely, for a smaller model like Llama-3-8B, RCR outperforms SCR. Fine-tuning SCR with our proposed Confidence Reasoning Direct Preference Optimization (CR-DPO) method improves performance on both seen and unseen datasets, yielding an average improvement of 8.9% on Llama-3-8B. In addition to quantitative results, we offer insights into the relative strengths of SCR and RCR.

Updated: 2025-03-17 04:47:58

标题: 相信还是不相信？提高大型语言模型对外部环境的忠实性

摘要: 大型语言模型（LLMs）通常会使用外部上下文进行增强，例如在检索增强生成（RAG）中使用的上下文。然而，这些上下文可能是不准确的或者有意误导，导致与模型内部知识发生冲突。我们认为，强大的LLMs应该展示出站点忠实性，动态地根据对内部知识和外部上下文的信心来调整对外部信息的信任，以解决知识冲突。为了评估这种能力，我们评估了LLMs在几个问答数据集上的表现，包括一个新创建的数据集，其中包含从Reddit帖子中获取的野生错误上下文。我们展示了当提供正确和错误的上下文时，无论外部信息的事实准确性如何，开源和专有模型都倾向于过度依赖外部信息。为了增强站点忠实性，我们提出了两种方法：自主引导信心推理（SCR）和基于规则的信心推理（RCR）。SCR使模型能够自我评估外部信息与其自身内部知识之间的信心，以产生最准确的答案。相反，RCR从LLM中提取明确的信心信号，并使用预定义规则确定最终答案。我们的结果表明，对于具有强大推理能力的LLMs，如GPT-4o和GPT-4o mini，SCR优于RCR，在直接输入增强基线上实现了高达24.2%的改善。相反，对于像Llama-3-8B这样的较小模型，RCR优于SCR。使用我们提出的信心推理直接偏好优化（CR-DPO）方法对SCR进行微调可提高在已知和未知数据集上的表现，在Llama-3-8B上平均提升了8.9%。除了定量结果外，我们还提供了关于SCR和RCR相对优势的见解。

更新时间: 2025-03-17 04:47:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.14675v2

Versatile Physics-based Character Control with Hybrid Latent Representation

We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.

Updated: 2025-03-17 04:45:51

标题: 具有混合潜在表示的多功能基于物理的角色控制

摘要: 我们提出了一种多功能的潜在表示，使物理模拟角色能够高效利用运动先验。为了构建一个强大的运动嵌入，该物理控制器应该使用丰富的潜在空间，这些潜在空间易于探索并能够生成高质量的运动。我们提出了整合连续和离散潜在表示以构建一个多功能运动先验，该先验可以适应各种具有挑战性的控制任务。具体地，我们构建了一个离散潜在模型，以捕获独特的后验分布而避免崩溃，并同时利用连续残差增强采样向量，以生成高质量、平滑的运动而无抖动。我们进一步整合了残差向量量化，不仅最大化了离散运动先验的容量，而且在任务学习阶段有效地抽象了动作空间。我们证明了我们的代理可以通过遍历学习到的运动先验进行无条件的运动生成，从而产生多样且平滑的动作。此外，我们的模型能够稳健地满足稀疏目标条件，并具有高度表现力的自然运动，包括头戴设备跟踪和不规则时间间隔的运动插值，这是现有潜在表示无法实现的。

更新时间: 2025-03-17 04:45:51

领域: cs.GR,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.12814v1

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

Updated: 2025-03-17 04:36:45

标题: 一个跨学习率调度的损失曲线预测的多重幂律

摘要: 训练大型模型既耗费资源又耗时，因此了解模型性能与超参数之间的定量关系至关重要。本文提出了一个经验定律，描述了大型语言模型的预训练损失在不同学习率调度下的演变，例如常数、余弦和步减调度。我们提出的定律采用多项式形式，结合了基于学习率总和的幂律以及额外的幂律，以考虑学习率衰减引起的损失降低效应。我们在各种模型大小和架构上广泛验证了这一定律，并证明在适应几个学习率调度后，该定律能够准确预测不同形状和视野的未知调度的损失曲线。此外，通过最小化预测的最终预训练损失跨学习率调度，我们能够找到一种优于广泛使用的余弦学习率调度的调度。有趣的是，这种自动发现的调度与最近提出的Warmup-Stable-Decay（WSD）调度（Hu等人，2024年）有些相似，但实现了略低的最终损失。我们相信这些结果可能为了解预训练动态和设计学习率调度以提高效率提供有价值的见解。

更新时间: 2025-03-17 04:36:45

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2503.12811v1

STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading

In financial trading, factor models are widely used to price assets and capture excess returns from mispricing. Recently, we have witnessed the rise of variational autoencoder-based latent factor models, which learn latent factors self-adaptively. While these models focus on modeling overall market conditions, they often fail to effectively capture the temporal patterns of individual stocks. Additionally, representing multiple factors as single values simplifies the model but limits its ability to capture complex relationships and dependencies. As a result, the learned factors are of low quality and lack diversity, reducing their effectiveness and robustness across different trading periods. To address these issues, we propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM, which extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings. The discrete codebooks cluster similar factor embeddings, ensuring orthogonality and diversity, which helps distinguish between different factors and enables factor selection in financial trading. To show the performance of the proposed factor model, we apply it to two downstream experiments: portfolio management on two stock datasets and individual trading tasks on six specific stocks. The extensive experiments demonstrate STORM's flexibility in adapting to downstream tasks and superior performance over baseline models.

Updated: 2025-03-17 04:30:03

标题: STORM：基于双向量量化变分自动编码器的金融交易时空因子模型

摘要: 在金融交易中，因子模型被广泛用于定价资产并捕捉由于错误定价而产生的超额收益。最近，我们目睹了基于变分自编码器的潜在因子模型的兴起，这些模型能够自适应地学习潜在因子。虽然这些模型着重于建模整体市场状况，但它们通常无法有效捕捉个别股票的时间模式。此外，将多个因子表示为单个值简化了模型，但限制了其捕捉复杂关系和依赖性的能力。因此，学习到的因子质量较低且缺乏多样性，降低了其在不同交易期间的有效性和稳健性。为了解决这些问题，我们提出了一种基于双向量量化变分自编码器的时空因子模型，命名为STORM，该模型从时间和空间角度提取股票特征，然后在细粒度和语义级别融合和对齐这些特征，并将因子表示为多维嵌入。离散码本将相似的因子嵌入聚类在一起，确保正交性和多样性，有助于区分不同因子并在金融交易中实现因子选择。为了展示所提出的因子模型的性能，我们将其应用于两个下游实验：在两个股票数据集上的投资组合管理和在六只特定股票上的个别交易任务。广泛的实验表明STORM在适应下游任务和基准模型的表现上具有灵活性和优越性。

更新时间: 2025-03-17 04:30:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2412.09468v3

Leveraging Deep Neural Networks for Aspect-Based Sentiment Classification

Aspect-based sentiment analysis seeks to determine sentiment with a high level of detail. While graph convolutional networks (GCNs) are commonly used for extracting sentiment features, their straightforward use in syntactic feature extraction can lead to a loss of crucial information. This paper presents a novel edge-enhanced GCN, called EEGCN, which improves performance by preserving feature integrity as it processes syntactic graphs. We incorporate a bidirectional long short-term memory (Bi-LSTM) network alongside a self-attention-based transformer for effective text encoding, ensuring the retention of long-range dependencies. A bidirectional GCN (Bi-GCN) with message passing then captures the relationships between entities, while an aspect-specific masking technique removes extraneous information. Extensive evaluations and ablation studies on four benchmark datasets show that EEGCN significantly enhances aspect-based sentiment analysis, overcoming issues with syntactic feature extraction and advancing the field's methodologies.

Updated: 2025-03-17 04:19:20

标题: 利用深度神经网络进行基于方面的情感分类

摘要: 基于方面的情感分析旨在以高级别的细节确定情感。虽然图卷积网络（GCNs）通常用于提取情感特征，但它们在句法特征提取中的直接使用可能会导致关键信息的丢失。本文提出了一种称为EEGCN的新型边增强GCN，通过在处理句法图时保持特征完整性来提高性能。我们将双向长短期记忆（Bi-LSTM）网络与基于自注意力的变压器结合使用，以实现有效的文本编码，确保长距离依赖关系的保留。然后，具有消息传递的双向GCN（Bi-GCN）捕获实体之间的关系，同时采用特定于方面的掩码技术去除多余信息。对四个基准数据集进行的广泛评估和消融研究表明，EEGCN显著增强了基于方面的情感分析，克服了句法特征提取的问题，并推动了该领域的方法论。

更新时间: 2025-03-17 04:19:20

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12803v1

BLIA: Detect model memorization in binary classification model through passive Label Inference attack

Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed "canaries," we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

Updated: 2025-03-17 04:15:47

标题: BLIA：通过被动标签推断攻击检测二元分类模型中的模型记忆

摘要: 模型记忆对机器学习模型的泛化能力和训练数据的隐私性都有影响。本文通过两种新颖的被动标签推断攻击（BLIA）研究了二元分类模型中的标签记忆。这些攻击是被动的，仅依赖于预训练模型的输出，如置信度分数和对数损失值，而不与或修改训练过程进行交互。通过在受控子集中故意翻转50％的标签，称为“金丝雀”，我们评估了两种情况下标签记忆的程度：没有标签差分隐私（Label-DP）进行训练的模型，以及使用基于随机响应的Label-DP进行训练的模型。尽管应用了不同程度的Label-DP，提出的攻击始终能够达到超过50％的成功率，超过随机猜测的基线，并确切地证明了即使这些标签故意与特征不相关，模型仍会记住训练标签。

更新时间: 2025-03-17 04:15:47

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2503.12801v1

A3E: Aligned and Augmented Adversarial Ensemble for Accurate, Robust and Privacy-Preserving EEG Decoding

An electroencephalogram (EEG) based brain-computer interface (BCI) enables direct communication between the brain and external devices. However, EEG-based BCIs face at least three major challenges in real-world applications: data scarcity and individual differences, adversarial vulnerability, and data privacy. While previous studies have addressed one or two of these issues, simultaneous accommodation of all three challenges remains challenging and unexplored. This paper fills this gap, by proposing an Aligned and Augmented Adversarial Ensemble (A3E) algorithm and integrating it into three privacy protection scenarios (centralized source-free transfer, federated source-free transfer, and source data perturbation), achieving simultaneously accurate decoding, adversarial robustness, and privacy protection of EEG-based BCIs. Experiments on three public EEG datasets demonstrated that our proposed approach outperformed over 10 classic and state-of-the-art approaches in both accuracy and robustness in all three privacy-preserving scenarios, even outperforming state-of-the-art transfer learning approaches that do not consider privacy protection at all. This is the first time that three major challenges in EEG-based BCIs can be addressed simultaneously, significantly improving the practicalness of EEG decoding in real-world BCIs.

Updated: 2025-03-17 04:11:54

标题: A3E：用于准确、稳健和隐私保护的脑电图解码的对齐和增强对抗性集成

摘要: 一种基于脑电图（EEG）的脑-计算机界面（BCI）实现了大脑与外部设备之间的直接通信。然而，在现实世界的应用中，基于EEG的BCI面临至少三个主要挑战：数据稀缺和个体差异、对抗性脆弱性以及数据隐私。尽管先前的研究已经解决了其中一个或两个问题，但同时解决这三个挑战仍然具有挑战性且未被探索。本文通过提出一种对齐和增强对抗集成（A3E）算法，并将其整合到三个隐私保护场景（集中式无源转移、联邦式无源转移和源数据扰动）中，实现了EEG-based BCI的准确解码、对抗性鲁棒性和隐私保护的同时。对三个公共EEG数据集的实验表明，我们提出的方法在所有三个隐私保护场景中在准确性和鲁棒性方面均优于10种经典和最先进的方法，甚至优于根本不考虑隐私保护的最先进迁移学习方法。这是首次同时解决EEG-based BCI中的三个主要挑战，显著提高了EEG解码在现实世界BCI中的实用性。

更新时间: 2025-03-17 04:11:54

领域: cs.HC,cs.LG,eess.SP

下载: http://arxiv.org/abs/2412.11390v2

Efficient Learning With Sine-Activated Low-rank Matrices

Low-rank decomposition has emerged as a vital tool for enhancing parameter efficiency in neural network architectures, gaining traction across diverse applications in machine learning. These techniques significantly lower the number of parameters, striking a balance between compactness and performance. However, a common challenge has been the compromise between parameter efficiency and the accuracy of the model, where reduced parameters often lead to diminished accuracy compared to their full-rank counterparts. In this work, we propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process. This approach not only preserves the benefits of the parameter efficiency characteristic of low-rank methods but also increases the decomposition's rank, thereby enhancing model performance. Our method proves to be a plug in enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF) and 3D shape modelling.

Updated: 2025-03-17 04:11:01

标题: 使用正弦激活的低秩矩阵进行高效学习

摘要: 低秩分解已经成为增强神经网络架构中参数效率的重要工具，在机器学习的各种应用中受到了广泛关注。这些技术显著降低了参数数量，在紧凑性和性能之间取得了平衡。然而，一个常见的挑战是在参数效率和模型准确性之间的妥协，其中减少的参数通常导致与其完整秩对应的模型精度下降。在这项工作中，我们提出了一个将正弦函数整合到低秩分解过程中的新颖理论框架。这种方法不仅保留了低秩方法的参数效率特性的优点，而且增加了分解的秩，从而提高了模型的性能。我们的方法被证明是现有低秩模型的增强插件，其成功应用在视觉变换器（ViT）、大型语言模型（LLMs）、神经辐射场（NeRF）和3D形状建模中。

更新时间: 2025-03-17 04:11:01

领域: cs.LG,cs.CV,cs.NE

下载: http://arxiv.org/abs/2403.19243v5

A Reinforcement Learning-Driven Transformer GAN for Molecular Generation

Generating molecules with desired chemical properties presents a critical challenge in fields such as chemical synthesis and drug discovery. Recent advancements in artificial intelligence (AI) and deep learning have significantly contributed to data-driven molecular generation. However, challenges persist due to the inherent sensitivity of simplified molecular input line entry system (SMILES) representations and the difficulties in applying generative adversarial networks (GANs) to discrete data. This study introduces RL-MolGAN, a novel Transformer-based discrete GAN framework designed to address these challenges. Unlike traditional Transformer architectures, RL-MolGAN utilizes a first-decoder-then-encoder structure, facilitating the generation of drug-like molecules from both $de~novo$ and scaffold-based designs. In addition, RL-MolGAN integrates reinforcement learning (RL) and Monte Carlo tree search (MCTS) techniques to enhance the stability of GAN training and optimize the chemical properties of the generated molecules. To further improve the model's performance, RL-MolWGAN, an extension of RL-MolGAN, incorporates Wasserstein distance and mini-batch discrimination, which together enhance the stability of the GAN. Experimental results on two widely used molecular datasets, QM9 and ZINC, validate the effectiveness of our models in generating high-quality molecular structures with diverse and desirable chemical properties.

Updated: 2025-03-17 04:06:10

标题: 一种基于强化学习的Transformer GAN用于分子生成

摘要: 产生具有所需化学性质的分子在化学合成和药物发现等领域中是一个关键挑战。最近人工智能（AI）和深度学习的进展对基于数据的分子生成做出了重大贡献。然而，由于简化的分子输入线条表示法（SMILES）的固有敏感性以及将生成对抗网络（GANs）应用于离散数据的困难，挑战仍然存在。本研究介绍了RL-MolGAN，这是一种新型基于Transformer的离散GAN框架，旨在解决这些挑战。与传统的Transformer架构不同，RL-MolGAN利用了一种先解码后编码的结构，有助于从$de~novo$和基于骨架的设计中生成类似药物的分子。此外，RL-MolGAN集成了强化学习（RL）和蒙特卡洛树搜索（MCTS）技术，以增强GAN训练的稳定性并优化生成分子的化学性质。为了进一步提高模型的性能，RL-MolWGAN是RL-MolGAN的扩展，它结合了Wasserstein距离和小批量区分技术，共同提高了GAN的稳定性。在两个广泛使用的分子数据集QM9和ZINC上的实验结果验证了我们的模型在生成具有多样化和理想化学性质的高质量分子结构方面的有效性。

更新时间: 2025-03-17 04:06:10

领域: cs.LG,cs.CL,physics.chem-ph

下载: http://arxiv.org/abs/2503.12796v1

Quantum-Enhanced LLM Efficient Fine Tuning

Low-Rank Adaptation (LoRA) enables efficient fine-tuning of pre-trained language models via low-rank matrix approximation, which is effective in many scenarios. However, its low-rank representation capacity is constrained in complex tasks or high-rank dependency settings, potentially limiting model adaptability. Addressing the expressive bottleneck of classical low-rank approximation in fine-tuning large language models, this paper proposes a parameter-efficient fine-tuning method based on a Quantum Weighted Tensor Hybrid Network (QWTHN), which leverages Quantum Neural Network (QNN). The study investigates quantum-classical hybrid parameter-efficient fine-tuning in low-rank spaces. QWTHN decomposes pre-trained weights into quantum neural network and tensor network representations, utilizing quantum state superposition and other methods to break through classical rank limitations. Experiments show that the proposed quantum fine-tuning technique for large models approaches or even surpasses the parameter efficiency of LoRA. On the CPsyCounD and R1-Distill-SFT datasets, QWTHN, compared to classical LoRA, reduces training loss by up to 15% while using 76% fewer parameters, and achieves an 8.4% performance improvement on the CPsyCounD test set. This research not only realizes lightweight and efficient adaptation of quantum resources to billion-parameter models but also validates the practical path of quantum hardware driven by large model tasks, laying the first engineering-ready technical foundation for future quantum-enhanced AGI systems.

Updated: 2025-03-17 03:59:26

标题: 量子增强的LLM高效微调

摘要: 低秩适应（LoRA）通过低秩矩阵逼近实现了对预训练语言模型的高效微调，在许多情况下都非常有效。然而，在复杂任务或高秩依赖设置中，其低秩表示能力受到限制，可能会限制模型的适应性。为了解决传统低秩逼近在微调大型语言模型中的表达瓶颈，本文提出了一种基于量子加权张量混合网络（QWTHN）的参数高效微调方法，该方法利用量子神经网络（QNN）。该研究探讨了低秩空间中量子-经典混合参数高效微调。QWTHN 将预训练权重分解为量子神经网络和张量网络表示，利用量子态叠加和其他方法突破了经典秩的限制。实验证明，所提出的量子微调技术对于大型模型接近甚至超过了 LoRA 的参数效率。在 CPsyCounD 和 R1-Distill-SFT 数据集上，与经典 LoRA 相比，QWTHN 可以减少高达 15% 的训练损失，同时使用的参数数量减少了 76%，并在 CPsyCounD 测试集上实现了 8.4% 的性能提升。这项研究不仅实现了量子资源对十亿参数模型的轻量级和高效适应，还验证了以大型模型任务驱动的量子硬件的实际路径，为未来量子增强 AGI 系统奠定了第一批工程准备好的技术基础。

更新时间: 2025-03-17 03:59:26

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2503.12790v1

Causal Feature Learning in the Social Sciences

Variable selection poses a significant challenge in causal modeling, particularly within the social sciences, where constructs often rely on inter-related factors such as age, socioeconomic status, gender, and race. Indeed, it has been argued that such attributes must be modeled as macro-level abstractions of lower-level manipulable features, in order to preserve the modularity assumption essential to causal inference. This paper accordingly extends the theoretical framework of Causal Feature Learning (CFL). Empirically, we apply the CFL algorithm to diverse social science datasets, evaluating how CFL-derived macrostates compare with traditional microstates in downstream modeling tasks.

Updated: 2025-03-17 03:43:00

标题: 社会科学中的因果特征学习

摘要: 变量选择在因果建模中面临着重大挑战，特别是在社会科学领域，其中构建往往依赖于诸如年龄、社会经济地位、性别和种族等相互关联的因素。事实上，有人认为这些属性必须被建模为较低级可操作特征的宏观级抽象，以保持因果推断所必需的模块化假设。因此，本文扩展了因果特征学习（CFL）的理论框架。在实证方面，我们将CFL算法应用于多样的社会科学数据集，评估CFL衍生的宏观状态在下游建模任务中与传统微观状态的比较。

更新时间: 2025-03-17 03:43:00

领域: stat.ME,cs.LG,stat.AP

下载: http://arxiv.org/abs/2503.12784v1

SAM2 for Image and Video Segmentation: A Comprehensive Survey

Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.

Updated: 2025-03-17 03:33:36

标题: SAM2用于图像和视频分割：一项全面调查

摘要: 尽管深度学习在图像和视频分割方面取得了显著进展，但现有模型在跨领域适应性和泛化方面仍然面临挑战。图像和视频分割是计算机视觉中的基础任务，在医疗、农业、工业检测和自动驾驶等领域具有广泛的应用。随着大规模基础模型的出现，SAM2 - SAM（Segment Anything Model）的改进版本已经针对分割任务进行了优化，在复杂场景中展现出了卓越的性能。然而，SAM2在特定领域的适应性和局限性需要进一步研究。本文系统分析了SAM2在图像和视频分割中的应用，并评估了其在各个领域的表现。我们首先介绍了图像分割的基本概念，对基础模型进行分类，并探讨了SAM和SAM2的技术特点。随后，我们深入探讨了SAM2在静态图像和视频分割中的应用，强调了其在医学影像等专业领域的表现以及跨领域适应性的挑战。作为我们研究的一部分，我们审阅了200多篇相关论文，以提供对该主题的全面分析。最后，本文突出了SAM2在分割任务中的优势和劣势，确定了它面临的技术挑战，并提出了未来的发展方向。这篇综述为优化和应用SAM2提供了宝贵的见解和实际建议。

更新时间: 2025-03-17 03:33:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12781v1

LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

Updated: 2025-03-17 03:33:28

标题: LangDA：通过语言构建上下文感知性以实现领域自适应语义分割

摘要: 无监督领域适应语义分割（DASS）旨在将知识从标签丰富的源领域转移到没有标签的目标领域。DASS中的两种关键方法是（1）仅使用遮罩或多分辨率裁剪的视觉方法，以及（2）使用由目标领域指导的通用类别提示的基于语言的方法（例如“一张{多雪的}照片{类别}”）。然而，前者容易受到偏向于源领域的嘈杂伪标签的影响。后者未能完全捕捉对象之间复杂的空间关系——这对于密集预测任务至关重要。因此，我们提出了LangDA。LangDA通过首先学习通过VLM生成的场景描述之间的上下文关系（例如“一个行人在人行道上，街道两侧是建筑物”）来解决这些挑战。其次，LangDA将整个图像特征与这种具有上下文感知的场景标题的文本表示对齐，并通过文本学习广义表示。有了这一点，LangDA在三个DASS基准测试中创下了新的最先进水平，比现有方法提高了2.6％、1.4％和3.9％。

更新时间: 2025-03-17 03:33:28

领域: cs.CV,cs.AI,cs.LG,eess.IV,stat.ML,68Txx,I.2.1

下载: http://arxiv.org/abs/2503.12780v1

Diffusion Suction Grasping with Large-Scale Parcel Dataset

While recent advances in object suction grasping have shown remarkable progress, significant challenges persist particularly in cluttered and complex parcel handling scenarios. Two fundamental limitations hinder current approaches: (1) the lack of a comprehensive suction grasp dataset tailored for parcel manipulation tasks, and (2) insufficient adaptability to diverse object characteristics including size variations, geometric complexity, and textural diversity. To address these challenges, we present Parcel-Suction-Dataset, a large-scale synthetic dataset containing 25 thousand cluttered scenes with 410 million precision-annotated suction grasp poses. This dataset is generated through our novel geometric sampling algorithm that enables efficient generation of optimal suction grasps incorporating both physical constraints and material properties. We further propose Diffusion-Suction, an innovative framework that reformulates suction grasp prediction as a conditional generation task through denoising diffusion probabilistic models. Our method iteratively refines random noise into suction grasp score maps through visual-conditioned guidance from point cloud observations, effectively learning spatial point-wise affordances from our synthetic dataset. Extensive experiments demonstrate that the simple yet efficient Diffusion-Suction achieves new state-of-the-art performance compared to previous models on both Parcel-Suction-Dataset and the public SuctionNet-1Billion benchmark.

Updated: 2025-03-17 03:26:36

标题: 扩展规模包裹数据集的扩散吸附抓取

摘要: 在最近的物体吸附抓取技术方面取得了显著进展，但在拥挤和复杂的包裹处理场景中仍然存在重大挑战。目前的方法存在两个基本限制：(1) 缺乏专门针对包裹操作任务的全面吸附抓取数据集，以及(2) 对不同对象特征的适应性不足，包括大小变化、几何复杂性和纹理多样性。为了解决这些挑战，我们提出了Parcel-Suction-Dataset，一个包含25万个杂乱场景和4.1亿个精确注释吸附抓取姿势的大规模合成数据集。该数据集是通过我们的新颖几何采样算法生成的，该算法能够有效地生成结合了物理约束和材料属性的最佳吸附抓取。我们进一步提出了Diffusion-Suction，一个创新的框架，将吸附抓取预测重新构建为通过去噪扩散概率模型进行条件生成任务。我们的方法通过从点云观察中得到的视觉条件引导，将随机噪声迭代地转化为吸附抓取评分图，从我们的合成数据集中有效地学习空间点位方便。大量实验证明，简单而高效的Diffusion-Suction相对于先前模型在Parcel-Suction-Dataset和公共SuctionNet-1Billion基准上实现了新的最先进性能。

更新时间: 2025-03-17 03:26:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.07238v2

Adaptive Deep Learning for Multiclass Breast Cancer Classification via Misprediction Risk Analysis

Breast cancer remains one of the leading causes of cancer-related deaths worldwide. Early detection is crucial for improving patient outcomes, yet the diagnostic process is often complex and prone to inconsistencies among pathologists. Computer-aided diagnostic approaches have significantly enhanced breast cancer detection, particularly in binary classification (benign vs. malignant). However, these methods face challenges in multiclass classification, leading to frequent mispredictions. In this work, we propose a novel adaptive learning approach for multiclass breast cancer classification using H&E-stained histopathology images. First, we introduce a misprediction risk analysis framework that quantifies and ranks the likelihood of an image being mislabeled by a classifier. This framework leverages an interpretable risk model that requires only a small number of labeled samples for training. Next, we present an adaptive learning strategy that fine-tunes classifiers based on the specific characteristics of a given dataset. This approach minimizes misprediction risk, allowing the classifier to adapt effectively to the target workload. We evaluate our proposed solutions on real benchmark datasets, demonstrating that our risk analysis framework more accurately identifies mispredictions compared to existing methods. Furthermore, our adaptive learning approach significantly improves the performance of state-of-the-art deep neural network classifiers.

Updated: 2025-03-17 03:25:28

标题: 自适应深度学习在多类别乳腺癌分类中的应用：通过误判风险分析

摘要: 乳腺癌仍然是全球癌症相关死亡的主要原因之一。早期检测对于改善患者预后至关重要，然而诊断过程往往复杂且容易在病理学家之间产生不一致性。计算机辅助诊断方法显著增强了乳腺癌检测，特别是在二元分类（良性 vs 恶性）方面。然而，这些方法在多类分类中面临挑战，导致频繁的错误预测。在这项工作中，我们提出了一种新颖的自适应学习方法，用于使用H&E染色组织病理学图像进行多类乳腺癌分类。首先，我们引入了一种误判风险分析框架，量化和排名分类器误标记图像的可能性。该框架利用一个可解释的风险模型，只需要少量标记样本进行训练。接下来，我们提出了一种自适应学习策略，根据给定数据集的特定特征对分类器进行微调。这种方法最小化了误判风险，使分类器能够有效地适应目标工作负载。我们在真实基准数据集上评估了我们提出的解决方案，展示了我们的风险分析框架相对于现有方法更准确地识别误判。此外，我们的自适应学习方法显著提高了最先进的深度神经网络分类器的性能。

更新时间: 2025-03-17 03:25:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12778v1

Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

Foundation models have revolutionized general-purpose problem-solving, offering rapid task adaptation through pretraining, meta-training, and finetuning. Recent crucial advances in these paradigms reveal the importance of challenging task prioritized sampling to enhance adaptation robustness under distribution shifts. However, ranking task difficulties over iteration as a preliminary step typically requires exhaustive task evaluation, which is practically unaffordable in computation and data-annotation. This study provides a novel perspective to illuminate the possibility of leveraging the dual importance of adaptation robustness and learning efficiency, particularly in scenarios where task evaluation is risky or costly, such as iterative agent-environment interactions for robotic policy evaluation or computationally intensive inference steps for finetuning foundation models. Firstly, we introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk landscape, providing a theoretical foundation for robust active task sampling. MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference. The resulting risk learner amortizes the costly evaluation of task adaptation performance and provably approximates task difficulty rankings. MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings. Empirically, we conduct extensive experiments in pattern recognition using foundation models and sequential decision-making. Our results demonstrate that MPTS significantly enhances adaptation robustness for tail or out-of-distribution (OOD) tasks and improves learning efficiency compared to state-of-the-art (SOTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.

Updated: 2025-03-17 03:16:23

标题: 超越任何调整：在无额外成本的情况下预测优化结果以获得稳健性收益

摘要: 基于基础模型的革命性问题解决方式，通过预训练、元训练和微调，实现了快速任务适应。最近在这些范例中的关键进展揭示了挑战任务优先采样的重要性，以增强在分布转移下的适应鲁棒性。然而，将任务难度排名作为一个初步步骤通常需要耗费大量计算和数据注释，这在实践中是不可承受的。本研究提供了一个新的视角，阐明了利用适应鲁棒性和学习效率的双重重要性的可能性，特别是在任务评估具有风险或成本高昂的场景中，例如用于机器人政策评估的迭代代理-环境交互或用于微调基础模型的计算密集推理步骤。首先，我们介绍了模型预测任务采样（MPTS），这是一个桥梁框架，连接了任务空间和适应风险景观，为稳健的主动任务采样提供了理论基础。MPTS利用生成模型来描述情节化优化过程，并通过后验推理预测特定任务的适应风险。由此产生的风险学习器摊销了任务适应性表现的昂贵评估，并且可以可靠地近似任务难度排名。MPTS无缝集成到零热、少热和监督微调设置中。在基于基础模型和顺序决策制定的模式识别领域进行了大量实验。我们的结果表明，与现有技术（SOTA）方法相比，MPTS显著提高了尾部或分布外（OOD）任务的适应性鲁棒性，并提高了学习效率。代码可在项目网站https://github.com/thu-rllab/MPTS 上找到。

更新时间: 2025-03-17 03:16:23

领域: cs.LG

下载: http://arxiv.org/abs/2501.11039v4

Metric properties of partial and robust Gromov-Wasserstein distances

The Gromov-Wasserstein (GW) distances define a family of metrics, based on ideas from optimal transport, which enable comparisons between probability measures defined on distinct metric spaces. They are particularly useful in areas such as network analysis and geometry processing, as computation of a GW distance involves solving for registration between the objects which minimizes geometric distortion. Although GW distances have proven useful for various applications in the recent machine learning literature, it has been observed that they are inherently sensitive to outlier noise and cannot accommodate partial matching. This has been addressed by various constructions building on the GW framework; in this article, we focus specifically on a natural relaxation of the GW optimization problem, introduced by Chapel et al., which is aimed at addressing exactly these shortcomings. Our goal is to understand the theoretical properties of this relaxed optimization problem, from the viewpoint of metric geometry. While the relaxed problem fails to induce a metric, we derive precise characterizations of how it fails the axioms of non-degeneracy and triangle inequality. These observations lead us to define a novel family of distances, whose construction is inspired by the Prokhorov and Ky Fan distances, as well as by the recent work of Raghvendra et al.\ on robust versions of classical Wasserstein distance. We show that our new distances define true metrics, that they induce the same topology as the GW distances, and that they enjoy additional robustness to perturbations. These results provide a mathematically rigorous basis for using our robust partial GW distances in applications where outliers and partial matching are concerns.

Updated: 2025-03-17 03:15:01

标题: 偏移和鲁棒Gromov-Wasserstein距离的度量特性

摘要: Gromov-Wasserstein（GW）距离定义了一组度量，基于最优输运的思想，使得可以比较定义在不同度量空间上的概率测度。它们在网络分析和几何处理等领域特别有用，因为计算GW距离涉及解决对象之间的配准，从而最小化几何失真。尽管最近机器学习文献中已经证明了GW距离在各种应用中的有效性，但观察到它们对异常噪声敏感，并且无法容纳部分匹配。这些问题已经通过构建在GW框架上的各种构造得到解决；在本文中，我们专注于由Chapel等人引入的GW优化问题的自然放松，旨在解决这些缺点。我们的目标是从度量几何的角度理解这种放松的优化问题的理论性质。虽然这种放松的问题无法导致度量，但我们对它如何违反非退化性和三角不等式公设进行了精确的表征。这些观察结果促使我们定义了一组新的距离，其构造灵感来自Prokhorov和Ky Fan距离，以及Raghvendra等人关于经典Wasserstein距离的鲁棒版本的最新工作。我们证明了我们的新距离定义了真实的度量，它们导致与GW距离相同的拓扑，并且它们具有对扰动的额外鲁棒性。这些结果为在关注异常值和部分匹配的应用中使用我们的鲁棒部分GW距离提供了数学严谨的基础。

更新时间: 2025-03-17 03:15:01

领域: math.MG,cs.LG

下载: http://arxiv.org/abs/2411.02198v2

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.

Updated: 2025-03-17 03:12:39

标题: NuPlanQA：一个用于多模态大型语言模型中的多视角驾驶场景理解的大规模数据集和基准测试

摘要: 最近的多模态大语言模型（MLLMs）在各个领域表现出强大的性能；然而，它们理解驾驶场景的能力仍未得到充分证明。驾驶情景的复杂性，包括多视角信息，给现有的MLLMs带来了重大挑战。在本文中，我们介绍了NuPlanQA-Eval，一个用于驾驶场景理解的多视角、多模态评估基准。为了进一步支持多视角驾驶场景的泛化，我们还提出了NuPlanQA-1M，一个包含100万真实世界视觉问答（VQA）对的大规模数据集。为了对交通场景进行上下文感知分析，我们将数据集分为九个子任务，涵盖三个核心技能：道路环境感知、空间关系识别和自我中心推理。此外，我们提出了BEV-LLM，将多视角图像中的鸟瞰图（BEV）特征整合到MLLMs中。我们的评估结果揭示了现有MLLMs在驾驶场景特定感知和空间推理方面面临的关键挑战。相比之下，BEV-LLM在九个子任务中有六个表现出色，表现出在这一领域具有显著的适应性。这些发现突显了BEV整合如何增强多视角MLLMs，同时也识别了需要进一步细化以有效适应驾驶场景的关键领域。为了促进进一步研究，我们公开发布了NuPlanQA，网址为https://github.com/sungyeonparkk/NuPlanQA。

更新时间: 2025-03-17 03:12:39

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.12772v1

Economic Rationality under Specialization: Evidence of Decision Bias in AI Agents

In the study by Chen et al. (2023) [01], the large language model GPT demonstrated economic rationality comparable to or exceeding the average human level in tasks such as budget allocation and risk preference. Building on this finding, this paper further incorporates specialized agents, such as biotechnology experts and economists, for a horizontal comparison to explore whether specialization can enhance or maintain economic rationality equivalent to that of GPT in similar decision-making scenarios. The results indicate that when agents invest more effort in specialized fields, their decision-making behavior is more prone to 'rationality shift,' specifically manifested as increased violations of GARP (Generalized Axiom of Revealed Preference), decreased CCEI (Critical Cost Efficiency Index), and more significant decision deviations under high-risk conditions. In contrast, GPT and more generalized basic agents maintain a more stable and consistent level of rationality across multiple tasks. This study reveals the inherent conflict between specialization and economic rationality, providing new insights for constructing AI decision-making systems that balance specialization and generalization across various scenarios.

Updated: 2025-03-17 03:09:57

标题: 专业化背景下的经济理性：AI代理决策偏差的证据

摘要: 在陈等人（2023）的研究中，大型语言模型GPT在预算分配和风险偏好等任务中表现出与或超过平均人类水平的经济理性。基于这一发现，本文进一步引入了专门的代理人，如生物技术专家和经济学家，进行横向比较，探讨专业化是否能增强或维持在类似决策情景中与GPT相当的经济理性。结果表明，当代理人在专业领域投入更多精力时，他们的决策行为更容易出现“理性转变”，具体表现为GARP（揭示偏好的广义公理）违反增加、CCEI（关键成本效率指数）减少，以及在高风险条件下决策偏差更显著。相比之下，GPT和更广义的基本代理人在多个任务中保持了更稳定一致的理性水平。这项研究揭示了专业化与经济理性之间的固有冲突，为构建在各种情景中平衡专业化和泛化的AI决策系统提供了新的见解。

更新时间: 2025-03-17 03:09:57

领域: cs.AI

下载: http://arxiv.org/abs/2501.18190v2

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.

Updated: 2025-03-17 03:08:29

标题: PDMX：用于符号音乐处理的大规模公共领域MusicXML数据集

摘要: 近期生成式AI音乐系统的爆发引发了人们对数据版权、从音乐家那里获得许可以及开源AI与大型知名公司之间的冲突的种种担忧。这些问题突显了公开可用、无版权的音乐数据的需求，特别是对于象征性音乐数据的需求。为了缓解这一问题，我们提出了PDMX：一个大规模的开源数据集，包含超过25万份从得分分享论坛MuseScore收集的公共领域MusicXML乐谱，据我们所知，这是目前最大的可用的无版权象征音乐数据集。PDMX还包括丰富的标签和用户交互元数据，使我们能够高效地分析数据集并筛选出高质量的用户生成乐谱。鉴于我们的数据收集过程提供的额外元数据，我们进行了多轨音乐生成实验，评估PDMX的不同代表性子集如何导致下游模型的不同行为，以及用户评分统计如何被用作数据质量的有效衡量标准。示例可在https://pnlong.github.io/PDMX.demo/找到。

更新时间: 2025-03-17 03:08:29

领域: cs.SD,cs.AI,cs.LG,cs.MM,eess.AS

下载: http://arxiv.org/abs/2409.10831v2

Asynchronous Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games

Counterfactual Regret Minimization (CFR) algorithms are widely used to compute a Nash equilibrium (NE) in two-player zero-sum imperfect-information extensive-form games (IIGs). Among them, Predictive CFR$^+$ (PCFR$^+$) is particularly powerful, achieving an exceptionally fast empirical convergence rate via the prediction in many games. However, the empirical convergence rate of PCFR$^+$ would significantly degrade if the prediction is inaccurate, leading to unstable performance on certain IIGs. To enhance the robustness of PCFR$^+$, we propose a novel variant, Asynchronous PCFR$^+$ (APCFR$^+$), which employs an adaptive asynchronization of step-sizes between the updates of implicit and explicit accumulated counterfactual regrets to mitigate the impact of the prediction inaccuracy on convergence. We present a theoretical analysis demonstrating why APCFR$^+$ can enhance the robustness. Finally, we propose a simplified version of APCFR$^+$ called Simple APCFR$^+$ (SAPCFR$^+$), which uses a fixed asynchronization of step-sizes to simplify the implementation that only needs a single-line modification of the original PCFR+. Interestingly, SAPCFR$^+$ achieves a constant-factor lower theoretical regret bound than PCFR$^+$ in the worst case. Experimental results demonstrate that (i) both APCFR$^+$ and SAPCFR$^+$ outperform PCFR$^+$ in most of the tested games, as well as (ii) SAPCFR$^+$ achieves a comparable empirical convergence rate with APCFR$^+$.

Updated: 2025-03-17 03:07:38

标题: 在解决广泛形式博弈中的异步预测性反事实遗憾最小化$^+$算法

摘要: 反事实后悔减少（CFR）算法被广泛用于计算两人零和不完全信息广泛形式游戏（IIGs）中的纳什均衡（NE）。其中，预测性CFR +（PCFR +）尤其强大，通过在许多游戏中的预测实现异常快速的经验收敛速度。然而，如果预测不准确，PCFR +的经验收敛速度将显著降低，导致在某些IIGs上表现不稳定。为了增强PCFR +的鲁棒性，我们提出了一种新颖的变体，即异步PCFR +（APCFR +），它采用隐式和显式积累的后悔之间的步长的自适应异步化，以减轻预测不准确对收敛的影响。我们提出了一个理论分析，证明了APCFR +为什么能增强鲁棒性。最后，我们提出了一个简化版本的APCFR +，称为简化APCFR +（SAPCFR +），它使用固定的步长异步化简化实现，只需要对原始PCFR +进行一行修改。有趣的是，SAPCFR +在最坏情况下实现了一个常数因子更低的理论后悔边界比PCFR +。实验结果表明，（i）在大多数测试游戏中，APCFR +和SAPCFR +都优于PCFR +，以及（ii）SAPCFR +实现了与APCFR +相当的经验收敛速度。

更新时间: 2025-03-17 03:07:38

领域: cs.LG

下载: http://arxiv.org/abs/2503.12770v1

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions. $f$-divergence loss functions, which are derived from variational representations of $f$-divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the $L_p$ errors through $f$-divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific $f$-divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the $p$-th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the $L_p$ error increases significantly as the KL divergence grows when $p > 1$. This increase becomes even more pronounced as the value of $p$ grows. The theoretical insights are validated through numerical experiments.

Updated: 2025-03-17 03:01:25

标题: 通过f-散度损失函数估计密度比的Lp误差界限

摘要: 密度比估计（DRE）是机器学习中的核心技术，用于捕捉两个概率分布之间的关系。源自$f$-散度的变分表示的$f$-散度损失函数已成为DRE中实现尖端性能的标准选择。本研究通过从$f$-散度损失函数推导$L_p$误差的上下界，为DRE提供了新颖的理论见解。这些界适用于属于Lipschitz连续估计器类的任何估计器，无论使用的具体$f$-散度损失函数如何。推导的界被表达为涉及数据维度和期望值的密度比的$p$次幂的乘积。值得注意的是，下界包括一个取决于Kullback-Leibler（KL）散度的指数项，揭示出当$p>1$时，随着KL散度增长，$L_p$误差显著增加。随着$p$值的增长，这种增加变得更加显著。通过数值实验验证了这些理论见解。

更新时间: 2025-03-17 03:01:25

领域: cs.LG

下载: http://arxiv.org/abs/2410.01516v2

Stabilization Analysis and Mode Recognition of Kerosene Supersonic Combustion: A Deep Learning Approach Based on Res-CNN-beta-VAE

The scramjet engine is a key propulsion system for hypersonic vehicles, leveraging supersonic airflow to achieve high specific impulse, making it a promising technology for aerospace applications. Understanding and controlling the complex interactions between fuel injection, turbulent combustion, and aerodynamic effects of compressible flows are crucial for ensuring stable combustion in scramjet engines. However, identifying stable modes in scramjet combustors is often challenging due to limited experimental measurement means and extremely complex spatiotemporal evolution of supersonic turbulent combustion. This work introduces an innovative deep learning framework that combines dimensionality reduction via the Residual Convolutional Neural Network-beta-Variational Autoencoder (Res-CNN-beta-VAE) model with unsupervised clustering (K-means) to identify and analyze dynamical combustion modes in a supersonic combustor. By mapping high-dimensional data of combustion snapshots to a reduced three-dimensional latent space, the Res-CNN-beta-VAE model captures the essential temporal and spatial features of flame behaviors and enables the observation of transitions between combustion states. By analyzing the standard deviation of latent variable trajectories, we introduce a novel method for objectively distinguishing between dynamic transitions, which provides a scalable and expert-independent alternative to traditional classification methods. Besides, the unsupervised K-means clustering approach effectively identifies the complex interplay between the cavity and the jet-wake stabilization mechanisms, offering new insights into the system's behavior across different gas-to-liquid mass flow ratios (GLRs).

Updated: 2025-03-17 03:00:01

标题: Kerosene超音速燃烧的稳定性分析和模式识别：基于Res-CNN-beta-VAE的深度学习方法

摘要: 超燃冲压发动机是高超音速飞行器的关键推进系统，利用超音速气流实现高比冲，使其成为航空航天应用中一项有前途的技术。理解和控制燃料喷射、湍流燃烧以及可压缩流动的气动效应之间复杂的相互作用对于确保超燃冲压发动机中的稳定燃烧至关重要。然而，由于实验测量手段有限以及超音速湍流燃烧的极其复杂的时空演化，通常很难确定超燃冲压燃烧器中的稳定模式。本研究引入了一种创新的深度学习框架，该框架通过使用Res-CNN-beta-VAE模型进行降维，并结合无监督聚类（K-means），以识别和分析超音速燃烧室中的动态燃烧模式。通过将燃烧快照的高维数据映射到降维的三维潜在空间，Res-CNN-beta-VAE模型捕捉了火焰行为的关键时间和空间特征，并实现了燃烧状态之间的过渡的观察。通过分析潜在变量轨迹的标准差，我们引入了一种新颖的方法，客观地区分动态过渡，这为传统分类方法提供了一种可扩展且不依赖专家的替代方案。此外，无监督的K-means聚类方法有效地识别了腔室与喷流尾迹稳定机制之间复杂的相互作用，为不同气液质量流比（GLR）下系统行为提供了新的见解。

更新时间: 2025-03-17 03:00:01

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2503.12765v1

Alpha-divergence loss function for neural density ratio estimation

Density ratio estimation (DRE) is a fundamental machine learning technique for capturing relationships between two probability distributions. State-of-the-art DRE methods estimate the density ratio using neural networks trained with loss functions derived from variational representations of $f$-divergences. However, existing methods face optimization challenges, such as overfitting due to lower-unbounded loss functions, biased mini-batch gradients, vanishing training loss gradients, and high sample requirements for Kullback--Leibler (KL) divergence loss functions. To address these issues, we focus on $\alpha$-divergence, which provides a suitable variational representation of $f$-divergence. Subsequently, a novel loss function for DRE, the $\alpha$-divergence loss function ($\alpha$-Div), is derived. $\alpha$-Div is concise but offers stable and effective optimization for DRE. The boundedness of $\alpha$-divergence provides the potential for successful DRE with data exhibiting high KL-divergence. Our numerical experiments demonstrate the effectiveness of $\alpha$-Div in optimization. However, the experiments also show that the proposed loss function offers no significant advantage over the KL-divergence loss function in terms of RMSE for DRE. This indicates that the accuracy of DRE is primarily determined by the amount of KL-divergence in the data and is less dependent on $\alpha$-divergence.

Updated: 2025-03-17 02:58:04

标题: 神经密度比估计的Alpha-分歧损失函数

摘要: 密度比估计（DRE）是捕捉两个概率分布之间关系的基本机器学习技术。最先进的DRE方法利用神经网络估计密度比，其训练损失函数源自对$f$-散度的变分表示。然而，现有方法面临优化挑战，如过拟合（由于较低下界的损失函数）、偏向的小批量梯度、消失的训练损失梯度，以及对Kullback--Leibler（KL）散度损失函数的高样本要求。为解决这些问题，我们专注于$\alpha$-散度，它提供了$f$-散度的适当变分表示。随后，导出了DRE的新型损失函数，即$\alpha$-散度损失函数（$\alpha$-Div）。$\alpha$-Div简洁而稳定，为DRE提供了有效的优化。$\alpha$-散度的有界性为具有高KL散度数据的成功DRE提供了潜力。我们的数值实验证明了$\alpha$-Div在优化中的有效性。然而，实验也显示，所提出的损失函数在DRE的RMSE方面与KL散度损失函数没有明显优势。这表明，DRE的准确性主要取决于数据中的KL散度量，而不太依赖于$\alpha$-散度。

更新时间: 2025-03-17 02:58:04

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2402.02041v4

SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic

Recent advancements in Text-to-SQL systems have improved the conversion of natural language queries into SQL, but challenges remain in ensuring accuracy and reliability. While self-correction techniques refine outputs, they often introduce new errors. Existing methods focused on execution feedback mainly address syntax issues, leaving semantic errors -- where the query's logic fails to align with the user's intent -- largely unaddressed. We propose a novel approach combining structured execution feedback with a trained critic agent that provides detailed, interpretable critiques. This method effectively identifies and corrects both syntactic and semantic errors, enhancing accuracy and interpretability. Experimental results show significant improvements on two major Text-to-SQL benchmarks, Spider and BIRD, demonstrating the effectiveness of our approach.

Updated: 2025-03-17 02:57:48

标题: SQLCritic：通过按子句批评纠正文本到SQL生成

摘要: 最近在文本到SQL系统方面取得了进展，提高了将自然语言查询转换为SQL的能力，但在确保准确性和可靠性方面仍存在挑战。虽然自我纠正技术可以改进输出，但通常会引入新的错误。现有方法主要集中在执行反馈上，主要解决语法问题，而语义错误，即查询逻辑与用户意图不一致的问题，往往被忽视。我们提出了一种新颖的方法，结合结构化执行反馈和经过训练的评论代理，提供详细的、可解释的批评。这种方法有效地识别和纠正了语法和语义错误，提高了准确性和可解释性。实验结果显示，在两个主要的文本到SQL基准测试Spider和BIRD上取得了显著的改进，证明了我们方法的有效性。

更新时间: 2025-03-17 02:57:48

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.07996v3

A Survey on Human Interaction Motion Generation

Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.

Updated: 2025-03-17 02:55:10

标题: 关于人类交互动作生成的调查

摘要: 人类生活在一个由互动所定义的世界中 -- 与其他人类、物体和环境进行互动。这些互动动作不仅传达了我们与周围环境的关系，还展示了我们如何感知和与现实世界进行交流。因此，在数字系统中复制这些互动行为已经成为机器人、虚拟现实和动画应用中的一个重要主题。虽然深度生成模型和新数据集的最新进展加速了这一领域的进展，但在对人类动态和其与外部实体的互动建模方面仍存在重大挑战。在本调查中，我们首次全面介绍了人类互动动作生成的文献。我们首先建立了理解研究背景所必需的基础概念。然后系统地审查了现有解决方案和数据集，涵盖了三个主要的互动任务 -- 人与人之间的互动、人与物体之间的互动以及人与场景之间的互动 -- 并介绍了评估指标。最后，我们讨论了开放的研究方向和未来的机遇。

更新时间: 2025-03-17 02:55:10

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.12763v1

Dynamical Mode Recognition of Turbulent Flames in a Swirl-stabilized Annular Combustor by a Time-series Learning Approach

Thermoacoustic instability in annular combustors, essential to aero engines and modern gas turbines, can severely impair operational stability and efficiency, accurately recognizing and understanding various combustion modes is the prerequisite for understanding and controlling combustion instabilities. However, the high-dimensional spatial-temporal dynamics of turbulent flames typically pose considerable challenges to mode recognition. Based on the bidirectional temporal and nonlinear dimensionality reduction models, this study introduces a two-layer bidirectional long short-term memory variational autoencoder, Bi-LSTM-VAE model, to effectively recognize dynamical modes in annular combustion systems. Specifically, leveraging 16 pressure signals from a swirl-stabilized annular combustor, the model maps complex dynamics into a low-dimensional latent space while preserving temporal dependency and nonlinear behavior features through the recurrent neural network structure. The results show that the novel Bi-LSTM-VAE method enables a clear representation of combustion states in two-dimensional state space. Analysis of latent variable distributions reveals distinct patterns corresponding to a wide range of equivalence ratios and premixed fuel and air mass flow rates, offering novel insights into mode classification and transitions, highlighting this model's potential for deciphering complex thermoacoustic phenomena.

Updated: 2025-03-17 02:55:01

标题: 用时间序列学习方法在涡流稳定环形燃烧器中对湍流火焰的动态模式识别

摘要: 环形燃烧器中的热声不稳定性对航空发动机和现代燃气轮机至关重要，可能严重影响操作稳定性和效率，准确识别和理解各种燃烧模式是理解和控制燃烧不稳定性的前提。然而，湍流火焰的高维时空动态通常对模式识别提出了相当大的挑战。基于双向时间和非线性降维模型，本研究引入了一个双层双向长短期记忆变分自动编码器，即Bi-LSTM-VAE模型，以有效识别环形燃烧系统中的动态模式。具体来说，利用涡流稳定的环形燃烧器中的16个压力信号，该模型将复杂动态映射到低维潜在空间，同时通过循环神经网络结构保留时间依赖性和非线性行为特征。结果表明，新颖的Bi-LSTM-VAE方法能够在二维状态空间中清晰表示燃烧状态。潜变量分布的分析显示出与各种当量比和预混合燃料和空气质量流速相对应的明显模式，为模式分类和转换提供了新的见解，突出了该模型解读复杂热声现象的潜力。

更新时间: 2025-03-17 02:55:01

领域: cs.LG

下载: http://arxiv.org/abs/2503.13559v1

Analyzing sequential activity and travel decisions with interpretable deep inverse reinforcement learning

Travel demand modeling has shifted from aggregated trip-based models to behavior-oriented activity-based models because daily trips are essentially driven by human activities. To analyze the sequential activity-travel decisions, deep inverse reinforcement learning (DIRL) has proven effective in learning the decision mechanisms by approximating a reward function to represent preferences and a policy function to replicate observed behavior using deep neural networks (DNNs). However, most existing research has focused on using DIRL to enhance only prediction accuracy, with limited exploration into interpreting the underlying decision mechanisms guiding sequential decision-making. To address this gap, we introduce an interpretable DIRL framework for analyzing activity-travel decision processes, bridging the gap between data-driven machine learning and theory-driven behavioral models. Our proposed framework adapts an adversarial IRL approach to infer the reward and policy functions of activity-travel behavior. The policy function is interpreted through a surrogate interpretable model based on choice probabilities from the policy function, while the reward function is interpreted by deriving both short-term rewards and long-term returns for various activity-travel patterns. Our analysis of real-world travel survey data reveals promising results in two key areas: (i) behavioral pattern insights from the policy function, highlighting critical factors in decision-making and variations among socio-demographic groups, and (ii) behavioral preference insights from the reward function, indicating the utility individuals gain from specific activity sequences.

Updated: 2025-03-17 02:54:02

标题: 用可解释的深度逆向强化学习分析顺序活动和出行决策

摘要: 旅行需求建模已经从聚合的基于旅行的模型转变为基于行为的活动模型，因为日常出行基本上是由人类活动驱动的。为了分析顺序活动-出行决策，深度逆强化学习（DIRL）已被证明通过逼近奖励函数来表示偏好和策略函数以复制观察到的行为，利用深度神经网络（DNN）学习决策机制是有效的。然而，大部分现有研究集中在使用DIRL仅增强预测准确性，而对指导顺序决策的基本决策机制进行的解释有限。为了填补这一空白，我们引入了一个可解释的DIRL框架，用于分析活动-出行决策过程，弥合了数据驱动的机器学习和理论驱动的行为模型之间的差距。我们提出的框架采用对抗IRL方法来推断活动-出行行为的奖励和策略函数。策略函数通过基于策略函数的选择概率的替代可解释模型来解释，而奖励函数则通过推导各种活动-出行模式的短期奖励和长期回报来解释。我们对现实世界的出行调查数据的分析在两个关键领域显示了有希望的结果：（一）从策略函数中获取的行为模式见解，突出了决策中的关键因素和社会人口群体之间的变化，（二）从奖励函数中获得的行为偏好见解，表明个体从特定活动序列中获得的效用。

更新时间: 2025-03-17 02:54:02

领域: cs.AI

下载: http://arxiv.org/abs/2503.12761v1

SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement

To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail outcomes. To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce. In practice, however, policy classes are often large, and digital experiments tend to produce datasets with small effect sizes relative to noise. In this setting, standard approaches such as data splitting or multiple testing often result in unstable policy selection and/or insufficient statistical power. In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges. Our method enables policy learning while simultaneously providing high-confidence guarantees using the entire dataset, avoiding the need for data-splitting. We present finite-sample and asymptotic versions of our algorithm that ensure the recommended policy satisfies high-probability guarantees for avoiding guardrail regressions and/or achieving goal outcome improvements. We test both variants of our approach approach empirically on a real-world application of personalizing SMS delivery. Our results on real-world data suggest that our approach offers dramatic improvements in settings with large policy classes and low signal-to-noise across both finite-sample and asymptotic safety guarantees, offering up to 300\% improvements in detection rates and 150\% improvements in policy gains at significantly smaller sample sizes.

Updated: 2025-03-17 02:53:53

标题: SNPL：安全多目标策略改进的同时策略学习和评估

摘要: 为了设计有效的数字干预措施，实验者面临着使用离线数据平衡多个目标的决策策略的挑战。通常，他们的目标是开发最大化目标结果的策略，同时确保在防护栏结果中没有不良变化。为了提供可靠的建议，实验者不仅必须确定满足目标和防护栏结果的期望变化的策略，还必须提供关于这些策略引发的变化的概率保证。然而，在实践中，策略类通常很大，数字实验往往产生相对于噪声的小效果大小的数据集。在这种情况下，标准方法，如数据分割或多重检验，通常会导致不稳定的策略选择和/或不足的统计功率。在本文中，我们提出了一种新颖的方法，即安全噪声策略学习（SNPL），利用算法稳定性的概念来解决这些挑战。我们的方法不仅能够在整个数据集上提供高置信度的保证，同时进行策略学习，避免了数据分割的需要。我们提出了我们算法的有限样本和渐近版本，确保推荐的策略满足高概率保证，以避免防护栏回归和/或实现目标结果改善。我们在一个个性化短信投递的实际应用中对我们方法的两个变体进行了实证测试。我们在真实数据上的结果表明，我们的方法在策略类别较大且信号噪声比较低的情况下，在有限样本和渐近安全保证方面都取得了显著改进，提供了高达300\%的检测率和150\%的策略增益，而采样量则显著减小。

更新时间: 2025-03-17 02:53:53

领域: stat.ML,cs.LG,econ.EM

下载: http://arxiv.org/abs/2503.12760v1

Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection

Artificial intelligence (AI) has significantly improved medical screening accuracy, particularly in cancer detection and risk assessment. However, traditional classification metrics often fail to account for imbalanced data, varying performance across cohorts, and patient-level inconsistencies, leading to biased evaluations. We propose the Cohort-Attention Evaluation Metrics (CAT) framework to address these challenges. CAT introduces patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity. Key metrics like CATSensitivity (CATSen), CATSpecificity (CATSpe), and CATMean ensure balanced and fair evaluation across diverse populations. This approach enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.

Updated: 2025-03-17 02:50:40

标题: 队列-关注评估指标对比绑定数据：研究癌症检测中分类模型的性能

摘要: 人工智能（AI）显著提高了医学筛查的准确性，特别是在癌症检测和风险评估方面。然而，传统分类指标通常未能考虑到数据不平衡、不同队列之间的性能差异以及患者级别的不一致性，导致评估存在偏见。我们提出了队列关注评估指标（CAT）框架来解决这些挑战。CAT引入了患者级别评估、基于熵的分布加权以及队列加权敏感性和特异性。关键指标如CAT敏感性（CATSen）、CAT特异性（CATSpe）和CAT平均确保在不同人群中进行平衡和公平评估。这种方法增强了预测可靠性、公平性和可解释性，为基于AI的医学筛查模型提供了强大的评估方法。

更新时间: 2025-03-17 02:50:40

领域: cs.LG,cs.CE,stat.ML

下载: http://arxiv.org/abs/2503.12755v1

Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life

The accurate prediction of RUL for lithium-ion batteries is crucial for enhancing the reliability and longevity of energy storage systems. Traditional methods for RUL prediction often struggle with issues such as data sparsity, varying battery chemistries, and the inability to capture complex degradation patterns over time. In this study, we propose a survival analysis-based framework combined with deep learning models to predict the RUL of lithium-ion batteries. Specifically, we utilize five advanced models: the Cox-type models (Cox, CoxPH, and CoxTime) and two machine-learning-based models (DeepHit and MTLR). These models address the challenges of accurate RUL estimation by transforming raw time-series battery data into survival data, including key degradation indicators such as voltage, current, and internal resistance. Advanced feature extraction techniques enhance the model's robustness in diverse real-world scenarios, including varying charging conditions and battery chemistries. Our models are tested using 10-fold cross-validation, ensuring generalizability and minimizing overfitting. Experimental results show that our survival-based framework significantly improves RUL prediction accuracy compared to traditional methods, providing a reliable tool for battery management and maintenance optimization. This study contributes to the advancement of predictive maintenance in battery technology, offering valuable insights for both researchers and industry practitioners aiming to enhance the operational lifespan of lithium-ion batteries.

Updated: 2025-03-17 02:49:34

标题: 用机器学习进行生存分析以预测锂离子电池剩余寿命

摘要: 锂离子电池寿命剩余预测的准确性对于提高能量储存系统的可靠性和寿命至关重要。传统的寿命剩余预测方法通常面临数据稀疏、电池化学成分不同以及无法捕捉随时间变化的复杂退化模式等问题。本研究提出了基于生存分析的框架结合深度学习模型来预测锂离子电池的寿命剩余。具体而言，我们利用了五种先进模型：Cox型模型（Cox、CoxPH和CoxTime）和两种基于机器学习的模型（DeepHit和MTLR）。这些模型通过将原始时间序列电池数据转化为生存数据，包括电压、电流和内阻等关键退化指标，来解决准确寿命剩余估计的挑战。先进的特征提取技术增强了模型在各种真实世界场景中的稳健性，包括不同的充电条件和电池化学成分。我们使用10折交叉验证测试了我们的模型，确保泛化性并减少过拟合。实验结果表明，我们基于生存分析的框架相对于传统方法显著提高了寿命剩余预测的准确性，为电池管理和维护优化提供了可靠的工具。本研究有助于推动电池技术的预测性维护，为旨在提高锂离子电池的运行寿命的研究人员和行业从业者提供了宝贵的见解。

更新时间: 2025-03-17 02:49:34

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13558v1

SafeSlice: Enabling SLA-Compliant O-RAN Slicing via Safe Deep Reinforcement Learning

Deep reinforcement learning (DRL)-based slicing policies have shown significant success in simulated environments but face challenges in physical systems such as open radio access networks (O-RANs) due to simulation-to-reality gaps. These policies often lack safety guarantees to ensure compliance with service level agreements (SLAs), such as the strict latency requirements of immersive applications. As a result, a deployed DRL slicing agent may make resource allocation (RA) decisions that degrade system performance, particularly in previously unseen scenarios. Real-world immersive applications require maintaining SLA constraints throughout deployment to prevent risky DRL exploration. In this paper, we propose SafeSlice to address both the cumulative (trajectory-wise) and instantaneous (state-wise) latency constraints of O-RAN slices. We incorporate the cumulative constraints by designing a sigmoid-based risk-sensitive reward function that reflects the slices' latency requirements. Moreover, we build a supervised learning cost model as part of a safety layer that projects the slicing agent's RA actions to the nearest safe actions, fulfilling instantaneous constraints. We conduct an exhaustive experiment that supports multiple services, including real virtual reality (VR) gaming traffic, to investigate the performance of SafeSlice under extreme and changing deployment conditions. SafeSlice achieves reductions of up to 83.23% in average cumulative latency, 93.24% in instantaneous latency violations, and 22.13% in resource consumption compared to the baselines. The results also indicate SafeSlice's robustness to changing the threshold configurations of latency constraints, a vital deployment scenario that will be realized by the O-RAN paradigm to empower mobile network operators (MNOs).

Updated: 2025-03-17 02:41:49

标题: SafeSlice：通过安全深度强化学习实现符合SLA的O-RAN切片

摘要: 基于深度强化学习（DRL）的切片策略在模拟环境中取得了显著成功，但在物理系统（如开放式无线接入网络（O-RANs））中面临挑战，这是由于模拟与现实之间存在差距。这些策略经常缺乏安全保障，无法确保遵守服务级别协议（SLAs），例如沉浸式应用的严格延迟要求。因此，部署的DRL切片代理可能会做出资源分配（RA）决策，从而降低系统性能，特别是在以前未见过的场景中。真实世界的沉浸式应用需要在部署过程中保持SLA约束，以防止危险的DRL探索。在本文中，我们提出了SafeSlice来解决O-RAN切片的累积（轨迹-wise）和瞬时（状态-wise）延迟约束。我们通过设计基于Sigmoid的风险敏感奖励函数来纳入累积约束，该函数反映了切片的延迟要求。此外，我们构建了一个监督学习成本模型作为安全层的一部分，将切片代理的RA动作投影到最近的安全动作，以满足瞬时约束。我们进行了一项全面的实验，支持多种服务，包括真实虚拟现实（VR）游戏流量，以研究SafeSlice在极端和变化的部署条件下的性能。与基线相比，SafeSlice在平均累积延迟上实现了高达83.23％的降低，瞬时延迟违规降低了93.24％，资源消耗降低了22.13％。结果还表明SafeSlice对于更改延迟约束阈值配置的稳健性，这是一种重要的部署场景，将由O-RAN范式实现以赋予移动网络运营商（MNOs）力量。

更新时间: 2025-03-17 02:41:49

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.12753v1

Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing

The emergence of foundational models and generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing, especially in code development, refactoring, and translating from one programming language to another. However, because the output of GenAI cannot be guaranteed to be correct, manual intervention remains necessary. Some of this intervention can be automated through task-specific tools, alongside additional methodologies for correctness verification and effective prompt development. We explored the application of GenAI in assisting with code translation, language interoperability, and codebase inspection within a legacy Fortran codebase used to simulate particle interactions at the Large Hadron Collider (LHC). In the process, we developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion. In this paper, we demonstrate how CodeScribe assists in converting Fortran code to C++, generating Fortran-C APIs for integrating legacy systems with modern C++ libraries, and providing developer support for code organization and algorithm implementation. We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing workflows.

Updated: 2025-03-17 02:38:43

标题: 利用大型语言模型进行代码翻译和科学计算领域的软件开发

摘要: 基于基础模型和生成式人工智能（GenAI）的出现，科学计算中的生产力将发生转变，特别是在代码开发、重构以及从一种编程语言转换到另一种编程语言方面。然而，由于GenAI的输出无法保证正确性，手动干预仍然是必要的。通过特定任务工具的自动化介入，以及用于正确性验证和有效提示开发的附加方法，可以部分地实现这种干预。我们探讨了将GenAI应用于辅助代码转换、语言互操作性以及在用于模拟大型强子对撞机（LHC）中的遗留Fortran代码库中进行代码库检查的过程。在这个过程中，我们开发了一个工具CodeScribe，该工具将提示工程与用户监督相结合，建立了一个高效的代码转换过程。在本文中，我们展示了CodeScribe如何帮助将Fortran代码转换为C++，生成Fortran-C API以将遗留系统与现代C++库集成，以及为代码组织和算法实现提供开发人员支持。我们还讨论了AI驱动的代码转换面临的挑战，并突出了它对提高科学计算工作流程中的生产力的好处。

更新时间: 2025-03-17 02:38:43

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2410.24119v2

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.

Updated: 2025-03-17 02:24:48

标题: 一项关于最先进的大规模视觉语言模型的调查：对齐、基准、评估和挑战

摘要: 多模态视觉语言模型（VLMs）已经成为计算机视觉和自然语言处理交叉领域的一种变革性技术，使机器能够通过视觉和文本两种模态来感知和推理世界。例如，像CLIP、Claude和GPT-4V这样的模型在视觉和文本数据上展现出强大的推理和理解能力，并在零样本分类上击败了经典的单模态视觉模型。尽管它们在研究中取得了快速进展并在应用中越来越受欢迎，但对于那些希望在特定领域利用VLMs的研究人员来说，对现有研究进行全面调查明显缺乏。因此，我们在以下方面系统概述了VLMs：过去五年（2019-2024）开发的主要VLMs的模型信息；这些VLMs的主要架构和训练方法；VLMs的流行基准和评估指标的总结和分类；VLMs的应用，包括具身代理、机器人和视频生成；当前VLMs面临的挑战和问题，如幻觉、公平性和安全性。详细的论文和模型存储库链接收集在https://github.com/zli12321/Vision-Language-Models-Overview。

更新时间: 2025-03-17 02:24:48

领域: cs.CV,cs.AI,cs.CL,cs.LG,cs.RO

下载: http://arxiv.org/abs/2501.02189v5

Semidefinite programming relaxations and debiasing for MAXCUT-based clustering

In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of 2 sub-gaussian distributions in $\R^p$. We consider semidefinite programming relaxations of an integer quadratic program that is formulated as finding the maximum cut on a graph, where edge weights in the cut represent dissimilarity scores between two nodes based on their $p$ features. We are interested in the case that individual features are of low average quality $\gamma$, and we want to use as few of them as possible to correctly partition the sample. Denote by $\Delta^2:=p \gamma$ the $\ell_2^2$ distance between two centers (mean vectors) in $\R^p$. The goal is to allow a full range of tradeoffs between $n, p, \gamma$ in the sense that partial recovery (success rate $< 100%$) is feasible once the signal to noise ratio $s^2 := \min{np \gamma^2, \Delta^2}$ is lower bounded by a constant. For both balanced and unbalanced cases, we allow each population to have distinct covariance structures with diagonal matrices as special cases. In the present work, (a) we provide a unified framework for analyzing three computationally efficient algorithms, namely, SDP1, BalancedSDP, and Spectral clustering; and (b) we prove that the misclassification error decays exponentially with respect to the SNR $s^2$ for SDP1. Moreover, for balanced partitions, we design an estimator $\bf {BalancedSDP}$ with a superb debiasing property. Indeed, with this new estimator, we remove an assumption (A2) on bounding the trace difference between the two population covariance matrices while proving the exponential error bound as stated above. These estimators and their statistical analyses are novel to the best of our knowledge. We provide simulation evidence illuminating the theoretical predictions.

Updated: 2025-03-17 02:24:42

标题: 半定规划放松及去偏差对基于MAXCUT的聚类的影响

摘要: 在本文中，我们考虑从一个包含2个次高斯分布的混合分布中抽取的大小为$n$的小数据样本在$\R^p$中的分区问题。我们考虑将一个整数二次规划问题的半定松弛形式化为在图上找到最大切割，其中切割中的边权重表示基于它们的$p$个特征的两个节点之间的不相似度分数。我们感兴趣的情况是，个体特征的平均质量很低，为$\gamma$，我们希望尽可能少地使用它们来正确地分割样本。记$\Delta^2:=p \gamma$为$\R^p$中两个中心（均值向量）之间的$\ell_2^2$距离。目标是允许在$n, p, \gamma$之间的权衡中获得完整的范围，即一旦信噪比$s^2 := \min{np \gamma^2, \Delta^2}$被下界常数限制，部分恢复（成功率$< 100%$）是可行的。对于平衡和不平衡的情况，我们允许每个群体具有不同的协方差结构，其中对角矩阵是特殊情况。在本研究中，(a)我们提供了一个统一的框架，用于分析三种计算效率高的算法，即SDP1、BalancedSDP和谱聚类；以及(b)我们证明了对于SDP1，误分类错误随着SNR $s^2$呈指数衰减。此外，对于平衡分割，我们设计了一个具有出色去偏性特性的估计器$\bf {BalancedSDP}$。实际上，通过这个新的估计器，我们在证明如上所述的指数误差界时，消除了一个关于两个群体协方差矩阵之间的跟踪差异的假设（A2）。据我们所知，这些估计器及其统计分析是新颖的。我们提供模拟证据来阐明理论预测。

更新时间: 2025-03-17 02:24:42

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2401.10927v2

Finite Samples for Shallow Neural Networks

This paper investigates the ability of finite samples to identify two-layer irreducible shallow networks with various nonlinear activation functions, including rectified linear units (ReLU) and analytic functions such as the logistic sigmoid and hyperbolic tangent. An ``irreducible" network is one whose function cannot be represented by another network with fewer neurons. For ReLU activation functions, we first establish necessary and sufficient conditions for determining the irreducibility of a network. Subsequently, we prove a negative result: finite samples are insufficient for definitive identification of any irreducible ReLU shallow network. Nevertheless, we demonstrate that for a given irreducible network, one can construct a finite set of sampling points that can distinguish it from other network with the same neuron count. Conversely, for logistic sigmoid and hyperbolic tangent activation functions, we provide a positive result. We construct finite samples that enable the recovery of two-layer irreducible shallow analytic networks. To the best of our knowledge, this is the first study to investigate the exact identification of two-layer irreducible networks using finite sample function values. Our findings provide insights into the comparative performance of networks with different activation functions under limited sampling conditions.

Updated: 2025-03-17 02:24:31

标题: 浅层神经网络的有限样本

摘要: 本文研究了有限样本识别具有不同非线性激活函数的两层不可约浅层网络的能力，包括修正线性单元（ReLU）和逻辑Sigmoid以及双曲正切等解析函数。一个“不可约”网络是指其功能不能由具有更少神经元的另一个网络来表示。对于ReLU激活函数，我们首先建立了确定网络不可约性的必要和充分条件。随后，我们证明了一个负面结果：有限样本不足以明确识别任何不可约ReLU浅层网络。然而，我们证明了对于给定的不可约网络，可以构建一个有限的采样点集，可以将其与具有相同神经元数量的其他网络区分开。相反，对于逻辑Sigmoid和双曲正切激活函数，我们提供了一个正面结果。我们构建了一个有限样本，可以恢复两层不可约浅层解析网络。据我们所知，这是第一项研究使用有限样本函数值准确识别两层不可约网络。我们的发现为在有限采样条件下比较具有不同激活函数的网络的性能提供了见解。

更新时间: 2025-03-17 02:24:31

领域: cs.LG,cs.IT,cs.NA,math.IT,math.NA

下载: http://arxiv.org/abs/2503.12744v1

EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition

Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a novel transformer model called emotion transformer (EmT). EmT is designed to excel in both generalized cross-subject EEG emotion classification and regression tasks. In EmT, EEG signals are transformed into a temporal graph format, creating a sequence of EEG feature graphs using a temporal graph construction module (TGC). A novel residual multi-view pyramid GCN module (RMPG) is then proposed to learn dynamic graph representations for each EEG feature graph within the series, and the learned representations of each graph are fused into one token. Furthermore, we design a temporal contextual transformer module (TCT) with two types of token mixers to learn the temporal contextual information. Finally, the task-specific output module (TSO) generates the desired outputs. Experiments on four publicly available datasets show that EmT achieves higher results than the baseline methods for both EEG emotion classification and regression tasks. The code is available at https://github.com/yi-ding-cs/EmT.

Updated: 2025-03-17 02:22:04

标题: EmT：一种用于广义跨主体脑电情绪识别的新型变压器

摘要: 将神经生理学的先前知识整合到神经网络架构中，可以提升情绪解码的性能。尽管许多技术强调学习空间和短期时间模式，但对于捕获与情绪认知过程相关的重要长期上下文信息却有限关注。为了解决这一差距，我们引入了一种名为情感变压器（EmT）的新型变压器模型。EmT旨在在广义跨主题EEG情绪分类和回归任务中表现卓越。在EmT中，EEG信号被转换为时间图格式，使用时间图构建模块（TGC）创建一系列EEG特征图的序列。然后提出了一种新颖的残差多视图金字塔GCN模块（RMPG），用于学习每个序列中每个EEG特征图的动态图表示，并将每个图的学习表示融合为一个令牌。此外，我们设计了一个具有两种令牌混合器的时间上下文变压器模块（TCT）来学习时间上下文信息。最后，任务特定输出模块（TSO）生成所需的输出。对四个公开可用数据集的实验结果显示，EmT在EEG情绪分类和回归任务中均比基线方法取得了更高的结果。源代码可在https://github.com/yi-ding-cs/EmT获得。

更新时间: 2025-03-17 02:22:04

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2406.18345v3

TNCSE: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings

Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.

Updated: 2025-03-17 02:14:42

标题: TNCSE：张量的规范约束用于无监督对比学习句子嵌入

摘要: 无监督句子嵌入表示已成为自然语言处理中的热门研究课题。作为一个张量，句子嵌入具有两个关键属性：方向和范数。现有研究仅限于约束样本表示的方向，而忽略了它们模长的特征。为了解决这个问题，我们提出了一种新的训练目标，通过约束正样本之间的模长特征来优化无监督对比学习的训练。我们将Tensor的范数约束训练目标与集成学习相结合，提出了一个新的句子嵌入表示框架，TNCSE。我们评估了七个语义文本相似性任务，结果显示TNCSE及其衍生模型是当前最先进的方法；此外，我们进行了大量的零样本评估，结果表明TNCSE优于其他基线模型。

更新时间: 2025-03-17 02:14:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.12739v1

Enhancing Circuit Trainability with Selective Gate Activation Strategy

Hybrid quantum-classical computing relies heavily on Variational Quantum Algorithms (VQAs) to tackle challenges in diverse fields like quantum chemistry and machine learning. However, VQAs face a critical limitation: the balance between circuit trainability and expressibility. Trainability, the ease of optimizing circuit parameters for problem-solving, is often hampered by the Barren Plateau, where gradients vanish and hinder optimization. On the other hand, increasing expressibility, the ability to represent a wide range of quantum states, often necessitates deeper circuits with more parameters, which in turn exacerbates trainability issues. In this work, we investigate selective gate activation strategies as a potential solution to these challenges within the context of Variational Quantum Eigensolvers (VQEs). We evaluate three different approaches: activating gates randomly without considering their type or parameter magnitude, activating gates randomly but limited to a single gate type, and activating gates based on the magnitude of their parameter values. Experiment results reveal that the Magnitude-based strategy surpasses other methods, achieving improved convergence.

Updated: 2025-03-17 02:10:35

标题: 通过选择性门激活策略增强电路的可训练性

摘要: 混合量子-经典计算在量子化学和机器学习等不同领域的挑战中，严重依赖变分量子算法（VQAs）。然而，VQAs面临一个关键限制：电路可训练性和表达性之间的平衡。可训练性是指优化电路参数以解决问题的便捷性，通常受到荒凉高原的阻碍，梯度消失并阻碍优化。另一方面，增加表达性，即表示各种量子状态的能力，通常需要更深的电路和更多的参数，这反过来加剧了可训练性问题。在这项工作中，我们研究选择性门激活策略作为在变分量子本征求解器（VQEs）背景下应对这些挑战的潜在解决方案。我们评估了三种不同的方法：随机激活门，而不考虑它们的类型或参数大小，随机激活门，但限制为单个门类型，以及根据参数值的大小激活门。实验结果显示，基于参数大小的策略超越了其他方法，实现了改进的收敛性。

更新时间: 2025-03-17 02:10:35

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2503.12738v1

Video Super-Resolution: All You Need is a Video Diffusion Model

We present a generic video super-resolution algorithm in this paper, based on the Diffusion Posterior Sampling framework with an unconditional video generation model in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets demonstrate that our method has strong capabilities to address video super-resolution challenges.

Updated: 2025-03-17 02:09:02

标题: 视频超分辨率：你所需要的只是一个视频扩散模型

摘要: 本文提出了一种基于扩散后验抽样框架的通用视频超分辨率算法，该算法基于潜在空间中的无条件视频生成模型。视频生成模型是一个扩散变换器，作为时空模型。我们认为，一个强大的模型，能够学习真实世界的物理规律，可以轻松处理各种运动模式作为先验知识，从而无需明确估计光流或像素对齐的运动参数。此外，所提出的视频扩散变换器模型的单个实例可以适应不同的采样条件而无需重新训练。对合成和真实世界数据集的实证结果表明，我们的方法具有强大的能力来应对视频超分辨率挑战。

更新时间: 2025-03-17 02:09:02

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2503.03355v3

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Updated: 2025-03-17 02:00:49

标题: 《上下文线性回归解密：多头Softmax注意力的训练动态和机制可解释性》

摘要: 我们研究了多头softmax注意力模型如何被训练用于在线性数据上进行上下文学习。通过大量的实证实验和严格的理论分析，我们揭示了优雅的注意力模式的出现：在关键-查询（KQ）权重中出现了对角线和均匀的模式，以及在输出-值（OV）权重中出现了仅最后一个条目和零和的模式。值得注意的是，这些模式在从随机初始化开始的基于梯度的训练中始终出现。我们的分析表明，这种出现的结构使得多头注意力能够近似实现一种无偏梯度下降预测器--它优于单头注意力，并且几乎达到了贝叶斯最优性的比例因子。此外，与线性变换器相比，softmax注意力更容易泛化到比训练中看到的序列更长的序列。我们还将研究扩展到具有非各向异性协变量和多任务线性回归的情景中。在前者中，多头注意力学习实现了一种预调梯度下降形式。在后者中，我们揭示了一种有趣的制度，即头数和任务数之间的相互作用引发了一个超定现象，有效解决了多任务上下文学习。我们的结果表明，上下文学习能力是作为经过训练的变换器的结构和基础数据分布的综合效果而出现的，为更深入理解和更广泛应用上下文学习铺平了道路。

更新时间: 2025-03-17 02:00:49

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.12734v1

APF+: Boosting adaptive-potential function reinforcement learning methods with a W-shaped network for high-dimensional games

Studies in reward shaping for reinforcement learning (RL) have flourished in recent years due to its ability to speed up training. Our previous work proposed an adaptive potential function (APF) and showed that APF can accelerate the Q-learning with a Multi-layer Perceptron algorithm in the low-dimensional domain. This paper proposes to extend APF with an encoder (APF+) for RL state representation, allowing applying APF to the pixel-based Atari games using a state-encoding method that projects high-dimensional game's pixel frames to low-dimensional embeddings. We approach by designing the state-representation encoder as a W-shaped network (W-Net), by using which we are able to encode both the background as well as the moving entities in the game frames. Specifically, the embeddings derived from the pre-trained W-Net consist of two latent vectors: One represents the input state, and the other represents the deviation of the input state's representation from itself. We then incorporate W-Net into APF to train a downstream Dueling Deep Q-Network (DDQN), obtain the APF-WNet-DDQN, and demonstrate its effectiveness in Atari game-playing tasks. To evaluate the APF+W-Net module in such high-dimensional tasks, we compare with two types of baseline methods: (i) the basic DDQN; and (ii) two encoder-replaced APF-DDQN methods where we replace W-Net by (a) an unsupervised state representation method called Spatiotemporal Deep Infomax (ST-DIM) and (b) a ground truth state representation provided by the Atari Annotated RAM Interface (ARI). The experiment results show that out of 20 Atari games, APF-WNet-DDQN outperforms DDQN (14/20 games) and APF-STDIM-DDQN (13/20 games) significantly. In comparison against the APF-ARI-DDQN which employs embeddings directly of the detailed game-internal state information, the APF-WNet-DDQN achieves a comparable performance.

Updated: 2025-03-17 01:53:26

标题: APF +：利用W形网络提升自适应潜力函数强化学习方法在高维游戏中的应用

摘要: 在最近几年中，为了加速训练，奖励塑造在强化学习（RL）中的研究蓬勃发展。我们先前的工作提出了一种自适应潜力函数（APF），并展示了APF能够加速在低维度领域中使用多层感知器算法的Q学习。本文提出将APF扩展为一个编码器（APF+），用于RL状态表示，允许应用APF到基于像素的Atari游戏中，使用一种状态编码方法，将高维游戏像素帧投影到低维嵌入中。我们设计了状态表示编码器作为一个W形网络（W-Net），通过使用它，我们能够对游戏帧中的背景和移动实体进行编码。具体来说，从预训练的W-Net得出的嵌入包括两个潜在向量：一个表示输入状态，另一个表示输入状态表示与自身的偏差。然后我们将W-Net整合到APF中，训练一个下游的Dueling Deep Q-Network（DDQN），获得APF-WNet-DDQN，并展示其在Atari游戏任务中的有效性。为了评估APF+W-Net模块在这种高维任务中的表现，我们与两种基准方法进行比较：（i）基本的DDQN；和（ii）两种替换编码器的APF-DDQN方法，其中我们将W-Net替换为（a）一种称为Spatiotemporal Deep Infomax（ST-DIM）的无监督状态表示方法和（b）由Atari注释RAM接口（ARI）提供的地面真实状态表示。实验结果显示，在20个Atari游戏中，APF-WNet-DDQN明显优于DDQN（14/20个游戏）和APF-STDIM-DDQN（13/20个游戏）。与直接使用详细游戏内部状态信息的APF-ARI-DDQN相比，APF-WNet-DDQN达到了可比的性能水平。

更新时间: 2025-03-17 01:53:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13557v1

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including edge attribution patching and sparse autoencoders, to identify minimal circuits and components supporting SQL generation. Our analysis reveals both the potential and limitations of current interpretability methods, showing how circuits can vary even across similar queries. Lastly, we demonstrate how mechanistic interpretability can identify flawed heuristics in models and improve synthetic dataset design. Our work provides a comprehensive framework for evaluating and advancing interpretability techniques while establishing clear boundaries for their reliable application.

Updated: 2025-03-17 01:47:50

标题: TinySQL：一个用于机械可解释性研究的渐进式文本到SQL数据集

摘要: 机制可解释性研究面临着在玩具任务中分析简单电路和在大型模型中发现特征之间的差距。为了弥合这一差距，我们提出将文本转换为SQL生成作为一个理想的研究任务，因为它将玩具任务的形式结构与现实世界的复杂性结合在一起。我们引入了TinySQL，一个从基本到高级SQL操作逐步推进的合成数据集，并训练了范围从33M到1B参数的模型，以建立一个全面的可解释性测试平台。我们应用多种互补的可解释性技术，包括边缘归因修补和稀疏自动编码器，以识别支持SQL生成的最小电路和组件。我们的分析揭示了当前可解释性方法的潜力和局限性，展示了即使在类似查询之间，电路也可以有所不同。最后，我们演示了机制性可解释性如何能够识别模型中的错误启发式，并改进合成数据集设计。我们的工作为评估和推进可解释性技术提供了一个全面的框架，同时为它们可靠应用建立了明确的界限。

更新时间: 2025-03-17 01:47:50

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2503.12730v1

Forget Vectors at Play: Universal Input Perturbations Driving Machine Unlearning in Image Classification

Machine unlearning (MU), which seeks to erase the influence of specific unwanted data from already-trained models, is becoming increasingly vital in model editing, particularly to comply with evolving data regulations like the ``right to be forgotten''. Conventional approaches are predominantly model-based, typically requiring retraining or fine-tuning the model's weights to meet unlearning requirements. In this work, we approach the MU problem from a novel input perturbation-based perspective, where the model weights remain intact throughout the unlearning process. We demonstrate the existence of a proactive input-based unlearning strategy, referred to forget vector, which can be generated as an input-agnostic data perturbation and remains as effective as model-based approximate unlearning approaches. We also explore forget vector arithmetic, whereby multiple class-specific forget vectors are combined through simple operations (e.g., linear combinations) to generate new forget vectors for unseen unlearning tasks, such as forgetting arbitrary subsets across classes. Extensive experiments validate the effectiveness and adaptability of the forget vector, showcasing its competitive performance relative to state-of-the-art model-based methods. Codes are available at https://github.com/Changchangsun/Forget-Vector.

Updated: 2025-03-17 01:46:48

标题: 不要忽略在发挥作用的矢量：驱动图像分类中机器遗忘的通用输入扰动

摘要: 机器遗忘（MU）旨在从已经训练过的模型中消除特定不需要的数据的影响，在模型编辑中变得越来越重要，特别是为了遵守不断发展的数据法规，比如“被遗忘的权利”。传统方法主要是基于模型的，通常需要重新训练或微调模型的权重来满足遗忘的要求。在这项工作中，我们从一种新颖的输入扰动角度解决了MU问题，其中模型权重在整个遗忘过程中保持不变。我们展示了一种主动的基于输入的遗忘策略，称为忘记向量，可以作为一个与输入无关的数据扰动生成，并且与基于模型的近似遗忘方法一样有效。我们还探讨了忘记向量算术，通过简单操作（例如线性组合）将多个类特定的忘记向量组合在一起，生成新的忘记向量，用于未见的遗忘任务，如跨类遗忘任意子集。大量实验验证了忘记向量的有效性和适应性，展示了它与最先进的基于模型的方法相比具有竞争力的性能。代码可在https://github.com/Changchangsun/Forget-Vector找到。

更新时间: 2025-03-17 01:46:48

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2412.16780v4

Generative Modeling for Mathematical Discovery

We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing resources. Applying {\it funsearch} to a new problem involves modifying a small segment of Python code and selecting a large language model (LLM) from one of many third-party providers. We benchmarked our implementation on three different problems, obtaining metrics that may inform applications of {\it funsearch} to new problems. Our results demonstrate that {\it funsearch} successfully learns in a variety of combinatorial and number-theoretic settings, and in some contexts learns principles that generalize beyond the problem originally trained on.

Updated: 2025-03-17 01:42:26

标题: 数学发现的生成建模

摘要: 我们提出了一个新的LLM驱动的遗传算法funsearch的实现，其目的是生成对数学家感兴趣的示例，并且在极端组合学问题中已经取得了一些成功。我们的实现旨在实际对工作数学家有用；它不需要机器学习专业知识或高性能计算资源。将funsearch应用于一个新问题涉及修改一小段Python代码并从许多第三方提供商中选择一个大型语言模型（LLM）。我们在三个不同的问题上对我们的实现进行了基准测试，获得的度量标准可能有助于将funsearch应用于新问题。我们的结果表明，funsearch在各种组合和数论环境中成功学习，并且在某些情况下学习原始训练问题之外的一般原则。

更新时间: 2025-03-17 01:42:26

领域: cs.LG,math.CO,68T20

下载: http://arxiv.org/abs/2503.11061v2

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.

Updated: 2025-03-17 01:21:54

标题: 通过个性化引导和表征工程在多智能体环境中识别合作性人格

摘要: 随着大型语言模型（LLMs）获得自主能力，它们在多智能体环境中的协调变得越来越重要。然而，它们通常在合作方面遇到困难，导致次优结果。受阿克塞尔罗德的迭代囚徒困境（IPD）锦标赛的启发，我们探讨个性特质如何影响LLM的合作。通过表示工程，我们引导大五个性特质（例如宜人性，责任心）在LLM中，并分析它们对IPD决策的影响。我们的结果表明，较高的宜人性和责任心提高了合作，但增加了受剥削的风险，突显了基于个性的引导对于调整AI代理的潜力和局限性。

更新时间: 2025-03-17 01:21:54

领域: cs.AI,cs.CL,cs.GT,cs.MA

下载: http://arxiv.org/abs/2503.12722v1

Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective

Recent Large Language Models (LLMs) such as OpenAI o3-mini and DeepSeek-R1 use enhanced reasoning through Chain-of-Thought (CoT). Their potential in hardware design, which relies on expert-driven iterative optimization, remains unexplored. This paper investigates whether reasoning LLMs can address challenges in High-Level Synthesis (HLS) design space exploration and optimization. During HLS, engineers manually define pragmas/directives to balance performance and resource constraints. We propose an LLM-based optimization agentic framework that automatically restructures code, inserts pragmas, and identifies optimal design points via feedback from HLs tools and access to integer-linear programming (ILP) solvers. Experiments compare reasoning models against conventional LLMs on benchmarks using success rate, efficiency, and design quality (area/latency) metrics, and provide the first-ever glimpse into the CoTs produced by a powerful open-source reasoning model like DeepSeek-R1.

Updated: 2025-03-17 01:21:39

标题: Reasoning Models能够推理硬件吗？一个主观HLS视角

摘要: 最近的大型语言模型（LLMs）如OpenAI o3-mini和DeepSeek-R1通过思维链（CoT）实现增强推理。它们在依赖专家驱动的迭代优化的硬件设计中的潜力尚未被探索。本文研究了推理LLMs是否能够解决高级综合（HLS）设计空间探索和优化中的挑战。在HLS过程中，工程师手动定义指令以平衡性能和资源约束。我们提出了一个基于LLM的优化代理框架，通过HLs工具的反馈和对整数线性规划（ILP）求解器的访问，自动重组代码，插入指令，并识别最佳设计点。实验比较了推理模型与传统LLMs在基准测试中的成功率、效率和设计质量（面积/延迟）指标，并首次展示了像DeepSeek-R1这样强大的开源推理模型产生的CoTs的样本。

更新时间: 2025-03-17 01:21:39

领域: cs.AI

下载: http://arxiv.org/abs/2503.12721v1

Enabling High-Frequency Trading with Near-Instant, Trustless Cross-Chain Transactions via Pre-Signing Adaptor Signatures

Atomic swaps have been widely considered to be an ideal solution for cross-chain cryptocurrency transactions due to their trustless and decentralized nature. However, their adoption in practice has been strictly limited compared to centralized exchange order books because of long transaction times (anywhere from 20 to 60 minutes) prohibiting market makers from accurately pricing atomic swap spreads. For the decentralized finance ecosystem to expand and benefit all users, this would require accommodating market makers and high-frequency traders to reduce spreads and dramatically boost liquidity. This white paper will introduce a protocol for atomic swaps that eliminates the need for an intermediary currency or centralized trusted third party, reducing transaction times between Bitcoin and Ethereum swaps to approximately 15 seconds for a market maker, and could be reduced further with future Layer 2 solutions.

Updated: 2025-03-17 01:15:33

标题: 通过预签名适配器签名实现几乎瞬时、无信任的跨链交易，促进高频交易。

摘要: 原子交换被广泛认为是跨链加密货币交易的理想解决方案，因为它们具有无需信任和去中心化的特性。然而，与中心化交易所订单簿相比，它们在实践中的采用受到严格限制，因为长时间的交易时间（从20到60分钟不等）阻止了市场制造商准确定价原子交换价差。为了扩大去中心化金融生态系统并使所有用户受益，这将需要容纳市场制造商和高频交易者，以减少价差并显著提高流动性。本白皮书将介绍一种原子交换协议，消除了中介货币或中心化可信第三方的需要，将比特币和以太坊交换之间的交易时间缩短到约15秒对于市场制造商，未来的第二层解决方案可能进一步缩短交易时间。

更新时间: 2025-03-17 01:15:33

领域: cs.CR

下载: http://arxiv.org/abs/2503.12719v1

Enforcing Cybersecurity Constraints for LLM-driven Robot Agents for Online Transactions

The integration of Large Language Models (LLMs) into autonomous robotic agents for conducting online transactions poses significant cybersecurity challenges. This study aims to enforce robust cybersecurity constraints to mitigate the risks associated with data breaches, transaction fraud, and system manipulation. The background focuses on the rise of LLM-driven robotic systems in e-commerce, finance, and service industries, alongside the vulnerabilities they introduce. A novel security architecture combining blockchain technology with multi-factor authentication (MFA) and real-time anomaly detection was implemented to safeguard transactions. Key performance metrics such as transaction integrity, response time, and breach detection accuracy were evaluated, showing improved security and system performance. The results highlight that the proposed architecture reduced fraudulent transactions by 90%, improved breach detection accuracy to 98%, and ensured secure transaction validation within a latency of 0.05 seconds. These findings emphasize the importance of cybersecurity in the deployment of LLM-driven robotic systems and suggest a framework adaptable to various online platforms.

Updated: 2025-03-17 01:01:10

标题: 强制执行LLM驱动的机器人代理在线交易的网络安全约束

摘要: 将大型语言模型（LLMs）集成到自主机器人代理中进行在线交易，面临着重要的网络安全挑战。本研究旨在实施强大的网络安全约束，以减轻与数据泄露、交易欺诈和系统操纵相关的风险。背景侧重于LLM驱动的机器人系统在电子商务、金融和服务行业的兴起，以及它们引入的漏洞。实施了结合区块链技术、多因素认证（MFA）和实时异常检测的新型安全架构来保护交易。评估了交易完整性、响应时间和违规检测准确性等关键性能指标，显示了改进的安全性和系统性能。结果突出显示，所提出的架构将欺诈交易减少了90％，将违规检测准确性提高到了98％，并确保在0.05秒的延迟内进行安全交易验证。这些发现强调了在部署LLM驱动的机器人系统中网络安全的重要性，并提出了一个适用于各种在线平台的框架。

更新时间: 2025-03-17 01:01:10

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2503.15546v1

A Model Stealing Attack Against Multi-Exit Networks

Compared to traditional neural networks with a single output channel, a multi-exit network has multiple exits that allow for early outputs from the model's intermediate layers, thus significantly improving computational efficiency while maintaining similar main task accuracy. Existing model stealing attacks can only steal the model's utility while failing to capture its output strategy, i.e., a set of thresholds used to determine from which exit to output. This leads to a significant decrease in computational efficiency for the extracted model, thereby losing the advantage of multi-exit networks. In this paper, we propose the first model stealing attack against multi-exit networks to extract both the model utility and the output strategy. We employ Kernel Density Estimation to analyze the target model's output strategy and use performance loss and strategy loss to guide the training of the extracted model. Furthermore, we design a novel output strategy search algorithm to maximize the consistency between the victim model and the extracted model's output behaviors. In experiments across multiple multi-exit networks and benchmark datasets, our method always achieves accuracy and efficiency closest to the victim models.

Updated: 2025-03-17 00:56:01

标题: 对多出口网络的一个模型盗窃攻击

摘要: 与具有单个输出通道的传统神经网络相比，多出口网络具有多个出口，允许模型的中间层提前输出，从而显著提高计算效率，同时保持类似的主要任务准确性。现有的模型窃取攻击只能窃取模型的效用，而无法捕获其输出策略，即用于确定从哪个出口输出的一组阈值。这导致提取模型的计算效率显着降低，从而失去了多出口网络的优势。在本文中，我们提出了针对多出口网络的第一个模型窃取攻击，以提取模型效用和输出策略。我们使用核密度估计来分析目标模型的输出策略，并使用性能损失和策略损失来指导提取模型的训练。此外，我们设计了一种新颖的输出策略搜索算法，以最大化受害模型和提取模型的输出行为之间的一致性。在跨多个多出口网络和基准数据集的实验中，我们的方法始终实现了与受害模型最接近的准确性和效率。

更新时间: 2025-03-17 00:56:01

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2305.13584v2

Active Learning with Context Sampling and One-vs-Rest Entropy for Semantic Segmentation

Multi-class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch-based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch-based AL method designed for multi-class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel-wise uncertainty scores. Additionally, we introduce one-vs-rest entropy, a novel uncertainty score function that computes class-wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.

Updated: 2025-03-17 00:35:34

标题: 使用上下文抽样和一对多熵进行语义分割的主动学习

摘要: 多类语义分割仍然是计算机视觉中的一个重要挑战。然而，数据集的创建仍然需要大量时间和精力，特别是对于专门领域。主动学习（AL）通过战略性地选择要注释的数据点来缓解这一挑战。然而，现有基于补丁的AL方法通常忽视了边界像素的关键信息，这对于准确分割是至关重要的。我们提出了OREAL，一种专为多类语义分割设计的新型基于补丁的AL方法。OREAL通过使用像素级不确定性分数的最大聚合来增强边界检测。此外，我们引入了一种新的不确定性评分函数one-vs-rest entropy，它在数据集创建过程中计算类别级别的不确定性，同时实现了隐式类别平衡。通过在不同数据集和模型架构上进行全面实验证实了我们的假设。

更新时间: 2025-03-17 00:35:34

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.06470v2

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones, making it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics on SAE feature spaces across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.

Updated: 2025-03-17 00:31:46

标题: 稀疏自动编码器揭示了大型语言模型中的通用特征空间

摘要: 我们研究了大型语言模型（LLMs）中的特征普适性，这是一个旨在了解不同模型如何在其中间层的潜在空间中类似地表示概念的研究领域。展示特征的普适性使得关于潜在表示的发现能够在多个模型中推广。然而，由于多义性，跨LLMs比较特征是具有挑战性的，因为个别神经元通常对应于多个特征而不是不同的特征，这使得在不同模型之间解缠和匹配特征变得困难。为了解决这个问题，我们采用了一种称为字典学习的方法，通过使用稀疏自动编码器（SAEs）将LLM激活转化为由对应于各个特征的神经元所张成的更易解释的空间。通过激活相关性匹配模型间的特征神经元后，我们对不同LLMs中的SAE特征空间应用了表示空间相似性度量。我们的实验揭示了不同LLMs中SAE特征空间中的显著相似性，为特征普适性提供了新的证据。

更新时间: 2025-03-17 00:31:46

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.06981v3

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.

Updated: 2025-03-17 00:27:45

标题: 通过视觉问答促使医学大型视觉语言模型诊断病变

摘要: 大型视觉-语言模型(LVLMs)在近年取得了显著的成功，并已被拓展到医学领域。尽管在医学视觉问答(VQA)任务上表现出令人满意的性能，医学LVLMs(MLVLMs)存在幻觉问题，导致它们无法诊断复杂的病理。此外，由于训练数据不平衡，它们很容易无法学习少数病理。我们提出了两种提示策略，用于减少MLVLMs的幻觉并改善VQA性能。在第一种策略中，我们提供了被询问病理的详细解释。在第二种策略中，我们微调一个廉价、弱学习器，以在特定指标上实现高性能，并将其判断以文本形式提供给MLVLMs。在MIMIC-CXR-JPG和Chexpert数据集上经过测试，我们的方法显著提高了诊断F1分数，最高增长为0.27。我们还证明了我们的提示策略可以扩展到一般LVLM领域。根据POPE指标，它有效地抑制了现有LVLMs的假阴性预测，并将召回率提高了约0.07。

更新时间: 2025-03-17 00:27:45

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2407.21368v3

Scaling Large Language Model-based Multi-Agent Collaboration

Recent breakthroughs in large language model-driven autonomous agents have revealed that multi-agent collaboration often surpasses each individual through collective reasoning. Inspired by the neural scaling law--increasing neurons enhances performance, this study explores whether the continuous addition of collaborative agents can yield similar benefits. Technically, we utilize directed acyclic graphs to organize agents into a multi-agent collaboration network (MacNet), upon which their interactive reasoning is topologically orchestrated for autonomous task solving. Extensive evaluations reveal that it effectively supports collaboration among over a thousand agents, with irregular topologies outperforming regular ones. We also identify a collaborative scaling law--the overall performance follows a logistic growth pattern as agents scale, with collaborative emergence occurring earlier than traditional neural emergence. We speculate this may be because scaling agents catalyzes their multidimensional considerations during interactive reflection and refinement, thereby producing more comprehensive artifacts. The code is available at https://github.com/OpenBMB/ChatDev/tree/macnet.

Updated: 2025-03-17 00:22:42

标题: 基于大型语言模型的多智能体协作的扩展

摘要: 最近，在大型语言模型驱动的自主代理方面取得了突破，揭示了多智能体协作往往通过集体推理超越了每个个体。受神经缩放定律的启发--增加神经元可以提高性能，本研究探讨了连续添加协作代理是否可以产生类似的好处。从技术上讲，我们利用有向无环图将代理组织成一个多智能体协作网络（MacNet），通过这个网络，它们的互动推理被拓扑地组织以解决自主任务。广泛的评估表明，它有效地支持超过一千个代理之间的协作，不规则的拓扑结构胜过规则的。我们还确定了一个协作扩展定律--随着代理规模的增长，整体性能遵循逻辑增长模式，协作的出现比传统的神经元出现早。我们推测这可能是因为扩展代理在互动反思和完善过程中催化了他们的多维考虑，从而产生了更全面的产品。代码可在https://github.com/OpenBMB/ChatDev/tree/macnet 上找到。

更新时间: 2025-03-17 00:22:42

领域: cs.AI,cs.CL,cs.MA,cs.NI,cs.SI

下载: http://arxiv.org/abs/2406.07155v3

Random Forest Autoencoders for Guided Representation Learning

Decades of research have produced robust methods for unsupervised data visualization, yet supervised visualization$\unicode{x2013}$where expert labels guide representations$\unicode{x2013}$remains underexplored, as most supervised approaches prioritize classification over visualization. Recently, RF-PHATE, a diffusion-based manifold learning method leveraging random forests and information geometry, marked significant progress in supervised visualization. However, its lack of an explicit mapping function limits scalability and prevents application to unseen data, posing challenges for large datasets and label-scarce scenarios. To overcome these limitations, we introduce Random Forest Autoencoders (RF-AE), a neural network-based framework for out-of-sample kernel extension that combines the flexibility of autoencoders with the supervised learning strengths of random forests and the geometry captured by RF-PHATE. RF-AE enables efficient out-of-sample supervised visualization and outperforms existing methods, including RF-PHATE's standard kernel extension, in both accuracy and interpretability. Additionally, RF-AE is robust to the choice of hyper-parameters and generalizes to any kernel-based dimensionality reduction method.

Updated: 2025-03-17 00:18:37

标题: 随机森林自动编码器用于引导表示学习

摘要: 数十年的研究已经产生了强大的无监督数据可视化方法，然而有监督的可视化-即专家标签引导表示的方法-仍然未被充分探索，因为大多数有监督的方法更注重分类而非可视化。最近，一种基于扩散的流形学习方法RF-PHATE，利用随机森林和信息几何学取得了有监督可视化方面的显著进展。然而，它缺乏明确的映射函数限制了可扩展性，阻碍了对未见数据的应用，对于大型数据集和标签稀缺的情况提出了挑战。为了克服这些限制，我们引入了随机森林自动编码器（RF-AE），这是一个基于神经网络的框架，用于样本外核扩展，结合了自编码器的灵活性、随机森林的有监督学习优势以及RF-PHATE捕获的几何形状。RF-AE实现了高效的样本外有监督可视化，并在准确性和可解释性方面优于现有方法，包括RF-PHATE的标准核扩展。此外，RF-AE对超参数的选择具有鲁棒性，并可以推广到任何基于核的降维方法。

更新时间: 2025-03-17 00:18:37

领域: cs.LG

下载: http://arxiv.org/abs/2502.13257v2

Pareidolic Illusions of Meaning: ChatGPT, Pseudolaw and the Triumph of Form over Substance

The early 2020s has seen the rise of two strange and potentially quite impactful social phenomena, namely pseudolaw, where users rely upon pseudolegal arguments that mimic the form and ritual of legal argumentation but fundamentally distort the content of law, and generative AI/LLMs, which generate content that uses probabilistic calculations to create outputs that look like human generated text. This article argues that the juxtaposition of the two phenomena helps to reveal that they both share two fundamental traits as both elevate form and appearance over substance and content, and users of both routinely mistake the form for the substance. In drawing upon legal theory, computer science, linguistics and cognitive psychology, the article argues that both phenomena rely upon creating illusions of meaning that users mistake for the underlying primary phenomenon. I then explore four implications of this conception of both phenomena. Firstly, both rely on human tendencies of conceptual pareidolia resulting in the erroneous perception of meaningful linguistic legal patterns from nebulous inputs. Secondly, both rely upon the confidence heuristic, the human cognitive bias for treating confidence as a proxy for competence. Thirdly, both succeed when the primary concern is with the form of the output and not its content. Fourthly, both rely heavily upon the magical thinking of users and the desire for the promise of the approach to be real. The article argues that the legal context helps to reveal a solution for the problems caused by both phenomena as it is only where users possess sufficient legal and technological literacy that it becomes possible to reveal to them the illusionary nature of the phenomena.

Updated: 2025-03-17 00:15:41

标题: 意义的错觉：ChatGPT、伪法律与形式胜于实质的胜利

摘要: 在2020年初期，出现了两种奇怪而潜在影响巨大的社会现象，即伪法律和生成式人工智能/LLMs。在伪法律中，用户依赖于伪法律论点，这些论点模仿了法律论证的形式和仪式，但实质上扭曲了法律的内容；而生成式人工智能/LLMs则生成使用概率计算来创建类似人类生成文本的内容。本文认为，将这两种现象并置可以揭示它们共享的两个基本特征，即都将形式和外观置于实质和内容之上，用户常常将形式误认为实质。通过借鉴法律理论、计算机科学、语言学和认知心理学，本文认为这两种现象都依赖于创造用户误认为是基本现象的意义幻觉。接着探讨了这种概念对这两种现象的四个影响。首先，两者都依赖于概念错觉的人类倾向，从模糊的输入中错误地感知有意义的语言法律模式。其次，两者都依赖于信心启发式，即人类认知偏见，将信心视为能力的替代品。第三，只有当关注点在输出的形式而非内容时，两者才会成功。第四，两者都严重依赖用户的魔幻思维和对方法承诺的渴望。本文认为，法律背景有助于揭示解决这两种现象带来的问题的方法，因为只有当用户具备足够的法律和技术素养时，才能向他们揭示这些现象的幻觉性质。

更新时间: 2025-03-17 00:15:41

领域: cs.CY,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13556v1

Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning

Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target domain using only unlabeled target data. Current SFDA methods face challenges in effectively leveraging pre-trained knowledge and exploiting target domain data. Multimodal Large Language Models (MLLMs) offer remarkable capabilities in understanding visual and textual information, but their applicability to SFDA poses challenges such as instruction-following failures, intensive computational demands, and difficulties in performance measurement prior to adaptation. To alleviate these issues, we propose $\textbf{Reliability-based Curriculum Learning (RCL)}$, a novel framework that integrates multiple MLLMs for knowledge exploitation via pseudo-labeling in SFDA. Our framework incorporates Reliable Knowledge Transfer, Self-correcting and MLLM-guided Knowledge Expansion, and Multi-hot Masking Refinement to progressively exploit unlabeled data in the target domain. RCL achieves state-of-the-art (SOTA) performance on multiple SFDA benchmarks, e.g., $\textbf{+9.4%}$ on DomainNet, demonstrating its effectiveness in enhancing adaptability and robustness without requiring access to source data. Our code is available at: https://github.com/Dong-Jie-Chen/RCL.

Updated: 2025-03-17 00:11:36

标题: 通过MLLM引导的可靠性基于课程学习对无源域自适应进行增强

摘要: 无源领域自适应（SFDA）旨在仅使用未标记的目标数据将预训练的源模型调整到目标领域。当前的SFDA方法面临有效利用预训练知识和利用目标领域数据的挑战。多模态大语言模型（MLLMs）在理解视觉和文本信息方面具有显著的能力，但它们在SFDA中的适用性面临着挑战，如遵循指令失败、计算需求大和在适应之前难以衡量性能。为了缓解这些问题，我们提出了一种新颖的框架——$\textbf{基于可靠性的课程学习（RCL）}$，通过伪标记在SFDA中整合多个MLLMs进行知识利用。我们的框架包括可靠知识转移、自我校正和MLLM引导知识扩展以及多热掩码细化，逐步利用目标领域中的未标记数据。RCL在多个SFDA基准上实现了最先进的性能，例如在DomainNet上达到$\textbf{+9.4\%}$，展示了其在增强适应性和鲁棒性方面的有效性，而无需访问源数据。我们的代码可在以下网址获取：https://github.com/Dong-Jie-Chen/RCL。

更新时间: 2025-03-17 00:11:36

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2405.18376v2