Arxiv Day: Article

Cross-functional transferability in universal machine learning interatomic potentials

The rapid development of universal machine learning interatomic potentials (uMLIPs) has demonstrated the possibility for generalizable learning of the universal potential energy surface. In principle, the accuracy of uMLIPs can be further improved by bridging the model from lower-fidelity datasets to high-fidelity ones. In this work, we analyze the challenge of this transfer learning problem within the CHGNet framework. We show that significant energy scale shifts and poor correlations between GGA and r$^2$SCAN pose challenges to cross-functional data transferability in uMLIPs. By benchmarking different transfer learning approaches on the MP-r$^2$SCAN dataset of 0.24 million structures, we demonstrate the importance of elemental energy referencing in the transfer learning of uMLIPs. By comparing the scaling law with and without the pre-training on a low-fidelity dataset, we show that significant data efficiency can still be achieved through transfer learning, even with a target dataset of sub-million structures. We highlight the importance of proper transfer learning and multi-fidelity learning in creating next-generation uMLIPs on high-fidelity data.

Updated: 2025-04-07 23:45:40

标题: 跨功能转移在通用机器学习原子间势中的应用

摘要: 普遍机器学习相互作用势（uMLIPs）的快速发展已经证明了对通用势能面进行可推广学习的可能性。原则上，通过从低保真数据集到高保真数据集的模型连接，uMLIPs的准确性可以进一步提高。在这项工作中，我们分析了在CHGNet框架内进行这种转移学习问题的挑战。我们展示了在uMLIPs中，GGA和r$^2$SCAN之间的显著能量尺度转移和较差的相关性构成了跨功能数据可转移性的挑战。通过在0.24百万个结构的MP-r$^2$SCAN数据集上对不同的转移学习方法进行基准测试，我们展示了元素能量参考在uMLIPs的转移学习中的重要性。通过比较在低保真数据集上进行预训练与不进行预训练的缩放规律，我们表明即使目标数据集为亚百万个结构，仍然可以通过转移学习实现显著的数据效率。我们强调了在高保真数据上创建下一代uMLIPs时适当的转移学习和多保真度学习的重要性。

更新时间: 2025-04-07 23:45:40

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2504.05565v1

A Survey on Human Interaction Motion Generation

Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.

Updated: 2025-04-07 23:38:41

标题: 对人类交互动作生成的调查

摘要: 人类生活在一个由与其他人类、物体和环境互动所定义的世界中。这些交互式动作不仅传达了我们与周围环境的关系，还展示了我们如何感知和与现实世界进行交流。因此，在数字系统中复制这些交互行为已经成为机器人、虚拟现实和动画应用中的一个重要主题。虽然深度生成模型和新数据集的最新进展加速了这一领域的进展，但在对复杂的人类动态和其与外部实体的互动建模方面仍存在重大挑战。在这项调查中，我们首次全面概述了人类互动动作生成领域的文献。我们首先建立起理解研究背景所必需的基本概念。然后系统地审查了现有解决方案和数据集，涵盖了三个主要互动任务——人际互动、人物-物体互动和人-场景互动，随后是评估指标。最后，我们讨论了开放的研究方向和未来机会。

更新时间: 2025-04-07 23:38:41

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.12763v2

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

Updated: 2025-04-07 23:35:11

标题: 基因指令：扩大合成生成大型语言模型编码指令

摘要: 大型语言模型（LLMs）需要高质量的指导数据才能有效对齐，特别是在代码生成任务中，专家策划的数据集成本昂贵。我们提出了一种可扩展的算法，名为遗传指导（Genetic-Instruct），用于利用进化原则合成大规模、高质量的编码指导。从一小组种子指导开始，Genetic-Instruct通过利用一个Instructor-LLM进行生成、一个Coder-LLM进行代码合成，以及一个Judge-LLM进行自动质量评估，生成多样化且具有挑战性的指导-代码对。我们提出的方法高度可并行化，即使有小种子数据和较弱的生成器模型也能取得有效结果。我们用这种方法生成了750万多个编码指导。然后我们通过对LLMs进行微调并演示了与其他合成生成方法和公开数据集相比在代码生成能力方面显著改进。我们的结果突显了Genetic-Instruct框架的效率、可扩展性和普适性。

更新时间: 2025-04-07 23:35:11

领域: cs.CL,cs.LG,cs.NE

下载: http://arxiv.org/abs/2407.21077v2

From Fairness to Truthfulness: Rethinking Data Valuation Design

As large language models increasingly rely on external data sources, fairly compensating data contributors has become a central concern. In this paper, we revisit the design of data markets through a game-theoretic lens, where data owners face private, heterogeneous costs for data sharing. We show that commonly used valuation methods--such as Leave-One-Out and Data Shapley--fail to ensure truthful reporting of these costs, leading to inefficient market outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We demonstrate that the Myerson payment is the minimal truthful payment mechanism, optimal from the buyer's perspective, and that VCG and Myerson payments coincide in unconstrained allocation settings. Our findings highlight the importance of incorporating incentive compatibility into data valuation, paving the way for more robust and efficient data markets.

Updated: 2025-04-07 23:34:11

标题: 从公平到真实：重新思考数据价值设计

摘要: 随着大型语言模型越来越多地依赖外部数据源，公平地补偿数据贡献者已成为一个中心关注点。在本文中，我们通过博弈论的视角重新审视数据市场的设计，其中数据所有者面临着数据共享的私人、异质成本。我们发现常用的估值方法，如Leave-One-Out和Data Shapley，未能确保这些成本的真实报告，导致市场效果低效。为了解决这个问题，我们将机制设计中的成熟付款规则，即Myerson和Vickrey-Clarke-Groves（VCG），应用到数据市场设置中。我们证明了Myerson付款是最小的真实付款机制，从买方的角度来看是最优的，而在无约束的分配设置中，VCG和Myerson付款是一致的。我们的发现突显了将激励兼容性纳入数据估值的重要性，为更健壮、更高效的数据市场铺平了道路。

更新时间: 2025-04-07 23:34:11

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2504.05563v1

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.

Updated: 2025-04-07 23:21:49

标题: 文本胜过视觉：ASCII艺术揭示了视觉-语言模型中的文本偏见

摘要: 视觉语言模型（VLMs）在处理多模态信息方面取得了快速进展，但它们在调解跨模态之间的冲突信号的能力仍未得到充分探讨。本文研究了VLMs如何处理ASCII艺术，这是一种独特的媒介，其中文本元素共同形成视觉模式，可能会产生语义-视觉冲突。我们引入了一个新颖的评估框架，系统地挑战了五种最先进的模型（包括GPT-4o、Claude和Gemini），使用对抗性ASCII艺术，其中字符级语义故意与全局视觉模式相矛盾。我们的实验揭示了强烈的文本优先偏见：VLMs始终优先考虑文本信息而不是视觉模式，随着语义复杂性增加，视觉识别能力急剧下降。通过视觉参数调整和提示工程的各种尝试仅带来了轻微的改善，这表明这种限制需要在架构级别上找到解决方案。这些发现揭示了当前VLMs如何整合多模态信息存在基本缺陷，为未来模型开发提供重要指导，同时强调了对容易受到对抗性示例攻击的内容管理系统的重要影响。

更新时间: 2025-04-07 23:21:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.01589v2

SciSciGPT: Advancing Human-AI Collaboration in the Science of Science

The increasing availability of large-scale datasets has fueled rapid progress across many scientific fields, creating unprecedented opportunities for research and discovery while posing significant analytical challenges. Recent advances in large language models (LLMs) and AI agents have opened new possibilities for human-AI collaboration, offering powerful tools to navigate this complex research landscape. In this paper, we introduce SciSciGPT, an open-source, prototype AI collaborator that uses the science of science as a testbed to explore the potential of LLM-powered research tools. SciSciGPT automates complex workflows, supports diverse analytical approaches, accelerates research prototyping and iteration, and facilitates reproducibility. Through case studies, we demonstrate its ability to streamline a wide range of empirical and analytical research tasks while highlighting its broader potential to advance research. We further propose an LLM Agent capability maturity model for human-AI collaboration, envisioning a roadmap to further improve and expand upon frameworks like SciSciGPT. As AI capabilities continue to evolve, frameworks like SciSciGPT may play increasingly pivotal roles in scientific research and discovery, unlocking further opportunities. At the same time, these new advances also raise critical challenges, from ensuring transparency and ethical use to balancing human and AI contributions. Addressing these issues may shape the future of scientific inquiry and inform how we train the next generation of scientists to thrive in an increasingly AI-integrated research ecosystem.

Updated: 2025-04-07 23:19:39

标题: SciSciGPT：推进在科学领域的人工智能合作

摘要: 随着大规模数据集的增加，许多科学领域取得了快速进展，为研究和发现提供了前所未有的机会，同时也带来了重大的分析挑战。大型语言模型（LLMs）和人工智能代理的最新进展开辟了人工智能与人类协作的新可能性，为导航这一复杂的研究领域提供了强大的工具。本文介绍了SciSciGPT，一个开源的原型人工智能协作伙伴，利用科学的科学作为一个测试平台来探索LLM驱动的研究工具的潜力。SciSciGPT自动化复杂的工作流程，支持多样化的分析方法，加速研究原型设计和迭代，促进可重复性。通过案例研究，我们展示了它在简化各种实证和分析研究任务方面的能力，同时突出了它在推进研究方面的更广泛潜力。我们进一步提出了一个人工智能代理能力成熟度模型，为人工智能与人类协作设想了一个进步和扩展的路线图，以进一步改进和扩展像SciSciGPT这样的框架。随着人工智能能力的不断发展，像SciSciGPT这样的框架可能在科学研究和发现中扮演越来越关键的角色，开启更多的机会。与此同时，这些新进展也带来了关键挑战，从确保透明度和道德使用到平衡人类和人工智能的贡献。解决这些问题可能塑造科学探究的未来，并告知我们如何培养下一代科学家在一个日益与人工智能整合的研究生态系统中蓬勃发展。

更新时间: 2025-04-07 23:19:39

领域: cs.AI,I.2; J.4

下载: http://arxiv.org/abs/2504.05559v1

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.

Updated: 2025-04-07 23:17:50

标题: 利用次优数据进行人机协同强化学习

摘要: 为了创建有用的强化学习（RL）代理，第一步是设计一个能捕捉任务细微差别的合适奖励函数。然而，奖励工程可能是一个困难且耗时的过程。相反，人机协作RL方法有望从人类反馈中学习奖励函数。尽管最近取得了成功，许多人机协作RL方法仍需要大量的人类交互来学习成功的奖励函数。为了提高人机协作RL方法的反馈效率（即减少人类交互），本文介绍了次优数据预训练（SDP），一种利用无奖励、次优数据来改进基于标量和偏好的RL算法的方法。在SDP中，我们首先用最小环境奖励伪标记所有低质量数据。通过这个过程，我们获得奖励标签来预训练我们的奖励模型，而无需人类标注或偏好。这个预训练阶段为奖励模型提供了学习的先机，使其能够识别低质量转换应该被分配低奖励。通过对模拟和人类教师进行广泛实验，我们发现SDP至少能够满足，并且通常能够显著改善各种模拟机器人任务的最新人机协作RL性能。

更新时间: 2025-04-07 23:17:50

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2405.00746v2

Federated Hierarchical Reinforcement Learning for Adaptive Traffic Signal Control

Multi-agent reinforcement learning (MARL) has shown promise for adaptive traffic signal control (ATSC), enabling multiple intersections to coordinate signal timings in real time. However, in large-scale settings, MARL faces constraints due to extensive data sharing and communication requirements. Federated learning (FL) mitigates these challenges by training shared models without directly exchanging raw data, yet traditional FL methods such as FedAvg struggle with highly heterogeneous intersections. Different intersections exhibit varying traffic patterns, demands, and road structures, so performing FedAvg across all agents is inefficient. To address this gap, we propose Hierarchical Federated Reinforcement Learning (HFRL) for ATSC. HFRL employs clustering-based or optimization-based techniques to dynamically group intersections and perform FedAvg independently within groups of intersections with similar characteristics, enabling more effective coordination and scalability than standard FedAvg. Our experiments on synthetic and real-world traffic networks demonstrate that HFRL not only outperforms both decentralized and standard federated RL approaches but also identifies suitable grouping patterns based on network structure or traffic demand, resulting in a more robust framework for distributed, heterogeneous systems.

Updated: 2025-04-07 23:02:59

标题: 分层联合强化学习用于自适应交通信号控制

摘要: 多智能体强化学习(MARL)已经展现出在自适应交通信号控制(ATSC)方面的潜力，使得多个路口能够实时协调信号配时。然而，在大规模环境中，由于大量数据共享和通信需求，MARL面临约束。联邦学习(FL)通过训练共享模型而不直接交换原始数据来缓解这些挑战，然而传统的FL方法如FedAvg在高度异质的路口上表现不佳。不同的路口展示出不同的交通模式、需求和道路结构，因此在所有智能体之间执行FedAvg是低效的。为了解决这一差距，我们提出了用于ATSC的分层联邦强化学习(HFRL)。HFRL采用基于聚类或优化的技术动态地对路口进行分组，并在具有相似特征的路口组内独立执行FedAvg，从而实现比标准FedAvg更有效的协调和可扩展性。我们在合成和现实世界的交通网络上进行的实验表明，HFRL不仅优于分散和标准的联邦RL方法，而且根据网络结构或交通需求识别出适合的分组模式，从而构建出一个更为健壮的分布式、异质系统框架。

更新时间: 2025-04-07 23:02:59

领域: cs.LG

下载: http://arxiv.org/abs/2504.05553v1

Path Database Guidance for Motion Planning

One approach to using prior experience in robot motion planning is to store solutions to previously seen problems in a database of paths. Methods that use such databases are characterized by how they query for a path and how they use queries given a new problem. In this work we present a new method, Path Database Guidance (PDG), which innovates on existing work in two ways. First, we use the database to compute a heuristic for determining which nodes of a search tree to expand, in contrast to prior work which generally pastes the (possibly transformed) queried path or uses it to bias a sampling distribution. We demonstrate that this makes our method more easily composable with other search methods by dynamically interleaving exploration according to a baseline algorithm with exploitation of the database guidance. Second, in contrast to other methods that treat the database as a single fixed prior, our database (and thus our queried heuristic) updates as we search the implicitly defined robot configuration space. We experimentally demonstrate the effectiveness of PDG in a variety of explicitly defined environment distributions in simulation.

Updated: 2025-04-07 23:00:31

标题: 路径数据库在运动规划中的指导原则

摘要: 一种利用先前经验进行机器人运动规划的方法是将先前见过的问题的解决方案存储在路径数据库中。使用这种数据库的方法的特点在于它们如何查询路径以及在面临新问题时如何使用查询。在这项工作中，我们提出了一种新方法，路径数据库引导（PDG），在现有工作中创新了两个方面。首先，我们使用数据库计算启发式，确定要扩展搜索树的节点，与先前的方法相比，先前的方法通常粘贴（可能是变换的）查询路径或使用它来偏向采样分布。我们证明了这使我们的方法更容易与其他搜索方法组合，通过根据基线算法动态交错探索和利用数据库引导。其次，与将数据库视为单一固定先验的其他方法相比，我们的数据库（因此我们查询的启发式）会随着搜索隐含定义的机器人配置空间而更新。我们在模拟中实验证明了PDG在各种明确定义的环境分布中的有效性。

更新时间: 2025-04-07 23:00:31

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.05550v1

Inverse Attention Agents for Multi-Agent Systems

A major challenge for Multi-Agent Systems is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind (ToM) implemented algorithmically using an attention mechanism trained in an end-to-end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent's final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.

Updated: 2025-04-07 22:59:41

标题: 多智能体系统中的逆向注意力代理

摘要: 多Agent系统面临的一个主要挑战是使代理能够动态地适应各种环境，其中对手和队友可能不断变化。使用传统方法训练的代理往往只在其训练队伍的范围内表现出色；当面对陌生代理时，它们的表现显著下降。为了解决这一缺点，我们引入了逆向注意力代理，采用了从心智理论（ToM）中借用的概念，通过训练端到端的注意力机制来实现算法化。在确定这些代理的最终动作方面至关重要的是，它们注意力模型中的权重明确表示对不同目标的关注。此外，我们提出了一个逆向注意力网络，根据观察和先前的动作推断代理的心智理论。该网络推断其他代理的注意状态，从而调整代理的最终动作的注意权重。我们在一个连续环境中进行实验，处理包含合作、竞争和两者混合的严峻任务。实验表明，逆向注意力网络成功地推断了其他代理的注意力，并且这些信息提高了代理的性能。额外的人类实验表明，与基准代理模型相比，我们的逆向注意力代理与人类合作更好，并更好地模拟人类行为。

更新时间: 2025-04-07 22:59:41

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2410.21794v2

Photovoltaic power forecasting using quantum machine learning

Predicting solar panel power output is crucial for advancing the transition to renewable energy but is complicated by the variable and non-linear nature of solar energy. This is influenced by numerous meteorological factors, geographical positioning, and photovoltaic cell properties, posing significant challenges to forecasting accuracy and grid stability. Our study introduces a suite of solutions centered around hybrid quantum neural networks designed to tackle these complexities. The first proposed model, the Hybrid Quantum Long Short-Term Memory, surpasses all tested models by achieving mean absolute errors and mean squared errors that are more than 40% lower. The second proposed model, the Hybrid Quantum Sequence-to-Sequence neural network, once trained, predicts photovoltaic power with 16% lower mean absolute error for arbitrary time intervals without the need for prior meteorological data, highlighting its versatility. Moreover, our hybrid models perform better even when trained on limited datasets, underlining their potential utility in data-scarce scenarios. These findings represent progress towards resolving time series prediction challenges in energy forecasting through hybrid quantum models, showcasing the transformative potential of quantum machine learning in catalyzing the renewable energy transition.

Updated: 2025-04-07 22:55:21

标题: 使用量子机器学习进行光伏发电功率预测

摘要: 预测太阳能电池板功率输出对于推动向可再生能源转型至关重要，但由于太阳能的变化和非线性特性，预测变得复杂。这受到众多气象因素、地理位置和光伏电池特性的影响，给预测准确性和电网稳定性带来重大挑战。我们的研究引入了一套围绕混合量子神经网络设计的解决方案，旨在解决这些复杂性。第一个提出的模型，混合量子长短期记忆网络，通过实现比所有测试模型更低40%以上的平均绝对误差和均方误差而超越了所有测试模型。第二个提出的模型，混合量子序列到序列神经网络，在训练后，预测光伏功率的平均绝对误差比任意时间段低16%，无需先前的气象数据，突显了其多功能性。此外，我们的混合模型即使在有限数据集上训练时也表现更好，强调了它们在数据稀缺情况下的潜在用途。这些发现代表了通过混合量子模型解决能源预测中的时间序列预测挑战的进展，展示了量子机器学习在催化可再生能源转型中的变革潜力。

更新时间: 2025-04-07 22:55:21

领域: cs.LG,cs.ET,quant-ph

下载: http://arxiv.org/abs/2312.16379v2

Neural Port-Hamiltonian Differential Algebraic Equations for Compositional Learning of Electrical Networks

We develop compositional learning algorithms for coupled dynamical systems. While deep learning has proven effective at modeling complex relationships from data, compositional couplings between system components typically introduce algebraic constraints on state variables, posing challenges to many existing data-driven approaches to modeling dynamical systems. Towards developing deep learning models for constrained dynamical systems, we introduce neural port-Hamiltonian differential algebraic equations (N-PHDAEs), which use neural networks to parametrize unknown terms in both the differential and algebraic components of a port-Hamiltonian DAE. To train these models, we propose an algorithm that uses automatic differentiation to perform index reduction, automatically transforming the neural DAE into an equivalent system of neural ordinary differential equations (N-ODEs), for which established model inference and backpropagation methods exist. The proposed compositional modeling framework and learning algorithms may be applied broadly to learn control-oriented models of dynamical systems in a variety of application areas, however, in this work, we focus on their application to the modeling of electrical networks. Experiments simulating the dynamics of nonlinear circuits exemplify the benefits of our approach: the proposed N-PHDAE model achieves an order of magnitude improvement in prediction accuracy and constraint satisfaction when compared to a baseline N-ODE over long prediction time horizons. We also validate the compositional capabilities of our approach through experiments on a simulated D.C. microgrid: we train individual N-PHDAE models for separate grid components, before coupling them to accurately predict the behavior of larger-scale networks.

Updated: 2025-04-07 22:47:51

标题: 神经端口哈密顿微分代数方程用于电力网络复合学习

摘要: 我们开发了用于耦合动力系统的组合学习算法。尽管深度学习已被证明能够有效地建模复杂关系，但系统组件之间的组合耦合通常会对状态变量施加代数约束，对许多现有的数据驱动动态系统建模方法构成挑战。为了开发受约束动态系统的深度学习模型，我们引入了神经端口-哈密顿微分代数方程（N-PHDAEs），它使用神经网络来参数化端口-哈密顿DAE的未知项的微分和代数组件。为了训练这些模型，我们提出了一种算法，该算法使用自动微分来执行指标降低，自动将神经DAE转换为等效的神经常微分方程（N-ODEs），已有建模推理和反向传播方法用于这些N-ODEs。所提出的组合建模框架和学习算法可以广泛应用于学习各种应用领域中的动态系统的控制导向模型，但在这项工作中，我们重点关注它们在电力网络建模中的应用。通过模拟非线性电路动态的实验展示了我们方法的优势：与基线N-ODE相比，所提出的N-PHDAE模型在长期预测时间范围内实现了预测准确性和约束满足方面的数量级改善。我们还通过对模拟的直流微网进行实验验证了我们方法的组合能力：我们为单独的网格组件训练独立的N-PHDAE模型，然后将它们耦合起来准确预测更大规模网络的行为。

更新时间: 2025-04-07 22:47:51

领域: cs.LG,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2412.11215v2

Flash STU: Fast Spectral Transform Units

Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.

Updated: 2025-04-07 22:47:40

标题: Flash STU: 快速光谱变换单元

摘要: 最近对状态空间模型架构的进展显示出了高效序列建模的巨大潜力，但在平衡计算效率和模型表现力方面仍然存在挑战。我们提出了Flash STU架构，这是一个混合模型，交替使用谱状态空间模型层和滑动窗口注意力，使其在语言建模中能够扩展到数十亿个参数，同时保持接近线性的时间复杂度。我们在不同的序列预测任务上评估了Flash STU及其变体，包括线性动态系统、机器人控制和语言建模。我们发现，在固定的参数预算下，Flash STU架构始终优于Transformer和其他领先的状态空间模型，如S4和Mamba-2。

更新时间: 2025-04-07 22:47:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.10489v4

Large Language Model (LLM) for Software Security: Code Analysis, Malware Analysis, Reverse Engineering

Large Language Models (LLMs) have recently emerged as powerful tools in cybersecurity, offering advanced capabilities in malware detection, generation, and real-time monitoring. Numerous studies have explored their application in cybersecurity, demonstrating their effectiveness in identifying novel malware variants, analyzing malicious code structures, and enhancing automated threat analysis. Several transformer-based architectures and LLM-driven models have been proposed to improve malware analysis, leveraging semantic and structural insights to recognize malicious intent more accurately. This study presents a comprehensive review of LLM-based approaches in malware code analysis, summarizing recent advancements, trends, and methodologies. We examine notable scholarly works to map the research landscape, identify key challenges, and highlight emerging innovations in LLM-driven cybersecurity. Additionally, we emphasize the role of static analysis in malware detection, introduce notable datasets and specialized LLM models, and discuss essential datasets supporting automated malware research. This study serves as a valuable resource for researchers and cybersecurity professionals, offering insights into LLM-powered malware detection and defence strategies while outlining future directions for strengthening cybersecurity resilience.

Updated: 2025-04-07 22:32:46

标题: 大型语言模型（LLM）用于软件安全：代码分析、恶意软件分析、逆向工程

摘要: 大型语言模型（LLMs）最近已经成为网络安全领域的强大工具，在恶意软件检测、生成和实时监控方面提供了先进的能力。许多研究探讨了它们在网络安全中的应用，展示了它们在识别新型恶意软件变种、分析恶意代码结构和增强自动威胁分析方面的有效性。提出了几种基于转换器的架构和LLM驱动的模型，旨在提高恶意软件分析的能力，利用语义和结构洞察更准确地识别恶意意图。本研究介绍了基于LLM的恶意代码分析方法的全面回顾，总结了最近的进展、趋势和方法。我们审查了显著的学术作品，绘制了研究格局，识别了关键挑战，并突出了LLM驱动网络安全领域新兴创新。此外，我们强调了静态分析在恶意软件检测中的作用，介绍了显著的数据集和专门的LLM模型，并讨论了支持自动化恶意软件研究的关键数据集。本研究为研究人员和网络安全专业人员提供了宝贵的资源，提供了关于LLM驱动恶意软件检测和防御策略的见解，同时概述了加强网络安全韧性的未来方向。

更新时间: 2025-04-07 22:32:46

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2504.07137v1

Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling

We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints and their associated local affine transformations. These keypoints are identified using a self-supervised keypoint detector and arranged into a time series corresponding to the successive frames. Forecasting is performed on these keypoints by integrating two advanced generative time series models into the motion transfer pipeline, namely the Variational Recurrent Neural Network (VRNN) and the Gated Recurrent Unit with Normalizing Flow (GRU-NF). The predicted keypoints are subsequently synthesized into realistic video frames using an optical flow estimator paired with a generator network, thereby facilitating accurate video forecasting and enabling efficient, low-frame-rate video transmission. We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement. Our results confirm that by utilizing the superior reconstruction property of the Variational Autoencoder, the VRNN integrated FOMM excels in applications involving multi-step ahead forecasts such as video conferencing. On the other hand, by leveraging the Normalizing Flow architecture for exact likelihood estimation, and enabling efficient latent space sampling, the GRU-NF based FOMM exhibits superior capabilities for producing diverse future samples while maintaining high visual quality for tasks like real-time video-based anomaly detection.

Updated: 2025-04-07 22:21:54

标题: 通过生成式时间序列建模实现高效的实时视频动作转移

摘要: 我们提出了一个深度学习框架，旨在显著优化支持运动传输的视频应用的带宽，包括视频会议、虚拟现实交互、健康监测系统和基于视觉的实时异常检测。为了有效捕捉复杂的运动效果，我们利用了第一阶段运动模型（FOMM），通过检测关键点及其相关的局部仿射变换来编码动态对象。这些关键点是使用自监督关键点检测器识别的，并按照时间序列排列，对应于连续的帧。预测是通过将两种先进的生成时间序列模型集成到运动传输管道中来对这些关键点进行的，即变分递归神经网络（VRNN）和带有归一化流的门控循环单元（GRU-NF）。预测的关键点随后通过与生成器网络配对的光流估计器合成为逼真的视频帧，从而促进准确的视频预测，并实现高效的低帧率视频传输。我们通过三个视频动画和重建数据集验证了我们的结果，使用以下指标：平均绝对误差、联合嵌入预测框架嵌入距离、结构相似性指数和平均成对位移。我们的结果证实，通过利用变分自动编码器的优越重建性能，VRNN集成的FOMM在涉及多步预测（如视频会议）的应用中表现出色。另一方面，通过利用归一化流架构进行准确的可能性估计，并实现高效的潜在空间抽样，基于GRU-NF的FOMM表现出卓越的能力，能够生成多样化的未来样本，同时保持高视觉质量，适用于实时基于视频的异常检测等任务。

更新时间: 2025-04-07 22:21:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05537v1

Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics

Geometric tempering is a popular approach to sampling from challenging multi-modal probability distributions by instead sampling from a sequence of distributions which interpolate, using the geometric mean, between an easier proposal distribution and the target distribution. In this paper, we theoretically investigate the soundness of this approach when the sampling algorithm is Langevin dynamics, proving both upper and lower bounds. Our upper bounds are the first analysis in the literature under functional inequalities. They assert the convergence of tempered Langevin in continuous and discrete-time, and their minimization leads to closed-form optimal tempering schedules for some pairs of proposal and target distributions. Our lower bounds demonstrate a simple case where the geometric tempering takes exponential time, and further reveal that the geometric tempering can suffer from poor functional inequalities and slow convergence, even when the target distribution is well-conditioned. Overall, our results indicate that geometric tempering may not help, and can even be harmful for convergence.

Updated: 2025-04-07 22:16:36

标题: 几何调温法在 Langevin 动力学中的收敛性及局限性证明

摘要: 几何调和是一种从具有挑战性的多模态概率分布中抽样的流行方法，而不是从一个易于提出的分布和目标分布之间使用几何平均插值的一系列分布中抽样。在本文中，我们从理论上研究了当采样算法为 Langevin 动力学时，这种方法的合理性，证明了上限和下限。我们的上限是文献中第一次在功能不等式下进行的分析。它们断言调和 Langevin 在连续和离散时间上的收敛性，并且它们的最小化导致了一些提议和目标分布对的闭合形式最佳调和时间表。我们的下限展示了一个简单的情况，其中几何调和需要指数时间，进一步揭示了即使目标分布是良好条件的情况下，几何调和也可能受到功能不等式和收敛速度慢的影响。总的来说，我们的结果表明几何调和可能不会有帮助，甚至可能对收敛性有害。

更新时间: 2025-04-07 22:16:36

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2410.09697v2

Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces

This study investigates the application of Riemannian geometry-based methods for brain decoding using invasive electrophysiological recordings. Although previously employed in non-invasive, the utility of Riemannian geometry for invasive datasets, which are typically smaller and scarcer, remains less explored. Here, we propose a Minimum Distance to Mean (MDM) classifier using a Riemannian geometry approach based on covariance matrices extracted from intracortical Local Field Potential (LFP) recordings across various regions during different brain state dynamics. For benchmarking, we evaluated the performance of our approach against Convolutional Neural Networks (CNNs) and Euclidean MDM classifiers. Our results indicate that the Riemannian geometry-based classification not only achieves a superior mean F1 macro-averaged score across different channel configurations but also requires up to two orders of magnitude less computational training time. Additionally, the geometric framework reveals distinct spatial contributions of brain regions across varying brain states, suggesting a state-dependent organization that traditional time series-based methods often fail to capture. Our findings align with previous studies supporting the efficacy of geometry-based methods and extending their application to invasive brain recordings, highlighting their potential for broader clinical use, such as brain computer interface applications.

Updated: 2025-04-07 22:11:59

标题: 黎曼几何用于利用皮层脑机接口对大脑状态进行分类

摘要: 这项研究调查了基于黎曼几何方法应用于使用侵入性电生理记录的大脑解码。尽管之前已经在非侵入性研究中使用过，但黎曼几何在侵入性数据集中的实用性，通常较小且较稀缺，仍未被充分探索。在这里，我们提出了一种基于黎曼几何方法的最小距离均值（MDM）分类器，该方法基于从不同大脑状态下不同区域的颅内局部场电位（LFP）记录中提取的协方差矩阵。为了进行基准测试，我们评估了我们的方法与卷积神经网络（CNNs）和欧式MDM分类器的性能。我们的结果表明，基于黎曼几何的分类不仅在不同通道配置上实现了卓越的平均F1宏平均分数，而且需要高达两个数量级的较少计算训练时间。此外，几何框架揭示了大脑不同区域在不同大脑状态下的明显空间贡献，表明传统基于时间序列的方法通常无法捕捉到。我们的发现与之前支持基于几何方法的有效性并将其应用于侵入性大脑记录的研究结果一致，突出了它们在更广泛的临床应用中的潜力，比如大脑计算机接口应用。

更新时间: 2025-04-07 22:11:59

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2504.05534v1

FORCE: Feature-Oriented Representation with Clustering and Explanation

Learning about underlying patterns in data using latent unobserved structures to improve the accuracy of predictive models has become an active avenue of deep learning research. Most approaches cluster the original features to capture certain latent structures. However, the information gained in the process can often be implicitly derived by sufficiently complex models. Thus, such approaches often provide minimal benefits. We propose a SHAP (Shapley Additive exPlanations) based supervised deep learning framework FORCE which relies on two-stage usage of SHAP values in the neural network architecture, (i) an additional latent feature to guide model training, based on clustering SHAP values, and (ii) initiating an attention mechanism within the architecture using latent information. This approach gives a neural network an indication about the effect of unobserved values that modify feature importance for an observation. The proposed framework is evaluated on three real life datasets. Our results demonstrate that FORCE led to dramatic improvements in overall performance as compared to networks that did not incorporate the latent feature and attention framework (e.g., F1 score for presence of heart disease 0.80 vs 0.72). Using cluster assignments and attention based on SHAP values guides deep learning, enhancing latent pattern learning and overall discriminative capability.

Updated: 2025-04-07 22:05:50

标题: FORCE：基于特征导向的表示与聚类和解释

摘要: 学习使用潜在未观察到的结构来了解数据中的基本模式，以提高预测模型的准确性已成为深度学习研究的一个活跃领域。大多数方法对原始特征进行聚类，以捕获某些潜在结构。然而，在这个过程中获得的信息往往可以通过足够复杂的模型隐含推导出来。因此，这样的方法通常提供最小的好处。我们提出了一种基于SHAP（Shapley Additive exPlanations）的监督深度学习框架FORCE，该框架依赖于SHAP值在神经网络架构中的两阶段使用，(i)基于聚类SHAP值的附加潜在特征来指导模型训练，(ii)使用潜在信息在架构内初始化注意机制。这种方法为神经网络提供了有关未观察值的影响的指示，该值修改了观察结果的特征重要性。所提出的框架在三个真实数据集上进行了评估。我们的结果表明，与未整合潜在特征和注意框架的网络相比（例如，心脏病存在的F1分数为0.80 vs 0.72），FORCE大大改善了整体性能。根据SHAP值的聚类分配和注意力指导深度学习，增强了潜在模式学习和整体区分能力。

更新时间: 2025-04-07 22:05:50

领域: cs.LG,cs.AI,stat.AP,I.2.6

下载: http://arxiv.org/abs/2504.05530v1

Bridging Industrial Expertise and XR with LLM-Powered Conversational Agents

This paper introduces a novel integration of Retrieval-Augmented Generation (RAG) enhanced Large Language Models (LLMs) with Extended Reality (XR) technologies to address knowledge transfer challenges in industrial environments. The proposed system embeds domain-specific industrial knowledge into XR environments through a natural language interface, enabling hands-free, context-aware expert guidance for workers. We present the architecture of the proposed system consisting of an LLM Chat Engine with dynamic tool orchestration and an XR application featuring voice-driven interaction. Performance evaluation of various chunking strategies, embedding models, and vector databases reveals that semantic chunking, balanced embedding models, and efficient vector stores deliver optimal performance for industrial knowledge retrieval. The system's potential is demonstrated through early implementation in multiple industrial use cases, including robotic assembly, smart infrastructure maintenance, and aerospace component servicing. Results indicate potential for enhancing training efficiency, remote assistance capabilities, and operational guidance in alignment with Industry 5.0's human-centric and resilient approach to industrial development.

Updated: 2025-04-07 22:02:19

标题: 利用LLM提升对话代理的工业专业知识与XR技术的连接

摘要: 本文介绍了一种新颖的检索增强生成（RAG）增强的大型语言模型（LLM）与扩展现实（XR）技术的集成，以解决工业环境中的知识传递挑战。所提出的系统通过自然语言界面将领域特定的工业知识嵌入XR环境中，为工人提供无需手操作、上下文感知的专家指导。我们提出的系统架构包括一个具有动态工具编排的LLM聊天引擎和一个具有语音驱动交互功能的XR应用程序。对各种分块策略、嵌入模型和向量数据库的性能评估表明，语义分块、平衡的嵌入模型和高效的向量存储为工业知识检索提供了最佳性能。通过在多个工业用例中的早期实施，包括机器人装配、智能基础设施维护和航空航天零部件维修，展示了系统的潜力。结果表明，可以提高培训效率、远程协助能力和操作指导，与工业5.0的以人为本和韧性发展方法保持一致。

更新时间: 2025-04-07 22:02:19

领域: cs.CL,cs.AI,68T50, 68T40, 68U20, 68U35,H.5.1; I.2.7; I.2.11; H.3.3; H.5.2; C.3

下载: http://arxiv.org/abs/2504.05527v1

PyraNet: A Multi-Layered Hierarchical Dataset for Verilog

Recently, there has been a growing interest in leveraging Large Language Models for Verilog code generation. However, the current quality of the generated Verilog code remains suboptimal. This is largely due to the absence of well-defined, well-organized datasets with high-quality samples, as well as a lack of innovative fine-tuning methods and models specifically trained on Verilog. In this paper, we introduce a novel open-source dataset and a corresponding fine-tuning technique, which utilizes a multi-layered structure that we refer to as PyraNet. Our experiments demonstrate that employing the proposed dataset and fine-tuning approach leads to a more accurate fine-tuned model, producing syntactically and functionally correct Verilog code. The evaluation results show improvements by up-to $32.6\%$ in comparison to the CodeLlama-7B baseline model and up-to $16.7\%$ in comparison to the state-of-the-art models using VerilogEval evaluation platform.

Updated: 2025-04-07 21:58:26

标题: PyraNet：Verilog的多层次分层数据集

摘要: 最近，越来越多的人对利用大型语言模型生成Verilog代码产生了兴趣。然而，目前生成的Verilog代码质量仍然不理想。这在很大程度上是由于缺乏定义明确、组织良好、质量高的样本数据集，以及缺乏针对Verilog专门训练的创新微调方法和模型。在本文中，我们介绍了一个新颖的开源数据集和相应的微调技术，利用了我们称之为PyraNet的多层结构。我们的实验证明，采用提出的数据集和微调方法会导致一个更准确的微调模型，产生语法和功能正确的Verilog代码。评估结果显示，与CodeLlama-7B基准模型相比，改进高达32.6％，与使用VerilogEval评估平台的最先进模型相比，改进高达16.7％。

更新时间: 2025-04-07 21:58:26

领域: cs.AR,cs.AI,cs.LG,cs.PL

下载: http://arxiv.org/abs/2412.06947v3

Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding

Rotary Position Embedding (RoPE) is widely adopted in Transformers due to its ability to encode relative positions with high efficiency and extrapolation capability. However, existing RoPE variants lack a unified theoretical foundation, especially in higher dimensions. In this paper, we propose a systematic mathematical framework for RoPE grounded in Lie group and Lie algebra theory. We identify two core properties of RoPE, named relativity and reversibility, and derive general constraints and constructions for valid RoPE in 1D, 2D, and N-dimensional (ND). We prove that RoPE must lie in the basis of a maximal abelian subalgebra (MASA) of the special orthogonal Lie algebra, and show that standard RoPE corresponds to the maximal toral subalgebra. Furthermore, we propose to model inter-dimensional interactions by learning an orthogonal basis transformation. Our framework unifies and explains existing RoPE designs, while enabling principled extensions to new modalities and tasks.

Updated: 2025-04-07 21:58:22

标题: 重新思考RoPE：一个N维位置编码的数学蓝图

摘要: Rotary Position Embedding (RoPE)因其高效的编码相对位置和外推能力而被广泛采用于Transformer中。然而，现有的RoPE变种缺乏统一的理论基础，特别是在更高维度上。在本文中，我们提出了一个基于李群和李代数理论的RoPE系统化数学框架。我们确定了RoPE的两个核心属性，即相对性和可逆性，并推导出在1D、2D和N维（ND）中有效RoPE的一般约束和构造。我们证明RoPE必须位于特殊正交李代数的最大可交换子代数（MASA）的基础上，并展示标准RoPE对应于最大的扭曲子代数。此外，我们提出通过学习正交基变换来建模跨维度的相互作用。我们的框架统一和解释了现有的RoPE设计，同时为新的模态和任务提供了原则性的扩展。

更新时间: 2025-04-07 21:58:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.06308v1

Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights

The rapid adoption of large language models (LLMs) has led to significant energy consumption and carbon emissions, posing a critical challenge to the sustainability of generative AI technologies. This paper explores the integration of energy-efficient optimization techniques in the deployment of LLMs to address these environmental concerns. We present a case study and framework that demonstrate how strategic quantization and local inference techniques can substantially lower the carbon footprints of LLMs without compromising their operational effectiveness. Experimental results reveal that these methods can reduce energy consumption and carbon emissions by up to 45\% post quantization, making them particularly suitable for resource-constrained environments. The findings provide actionable insights for achieving sustainability in AI while maintaining high levels of accuracy and responsiveness.

Updated: 2025-04-07 21:56:59

标题: 优化大型语言模型：度量标准、能效和案例研究见解

摘要: 大规模语言模型（LLMs）的快速采用导致了巨大的能源消耗和碳排放，给生成式人工智能技术的可持续性带来了重大挑战。本文探讨了在LLMs部署中整合能效优化技术以解决这些环境问题。我们提出了一个案例研究和框架，展示了战略量化和局部推理技术如何显著降低LLMs的碳足迹，而不影响其运行效果。实验结果显示，这些方法可以在量化后将能源消耗和碳排放降低高达45％，使其特别适用于资源受限环境。这些发现为在AI领域实现可持续性并保持高水平的准确性和响应性提供了可行的见解。

更新时间: 2025-04-07 21:56:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.06307v1

TULIP: Towards Unified Language-Image Pretraining

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

Updated: 2025-04-07 21:50:58

标题: TULIP：走向统一的语言-图像预训练

摘要: 尽管最近像CLIP和SigLIP这样的图像文本对比模型取得了成功，但这些模型经常在需要高保真度图像理解的以视觉为中心的任务中遇到困难，例如计数、深度估计和细粒度物体识别。这些模型通过执行语言对齐，往往优先考虑高级语义而不是视觉理解，削弱了它们的图像理解能力。另一方面，以视觉为中心的模型擅长处理视觉信息，但在理解语言方面有困难，限制了它们在以语言驱动的任务中的灵活性。在这项工作中，我们介绍了TULIP，这是一个开源的、可替代现有CLIP类模型的方法。我们的方法利用生成数据增强、增强图像-图像和文本-文本对比学习以及图像/文本重构正则化，学习细粒度的视觉特征同时保持全局语义对齐。我们的方法，规模扩展到超过10亿参数，在多个基准测试中胜过现有的最先进模型，建立了ImageNet-1K的新SOTA零样本性能，为少样本分类的线性探测在RxRx1上提供了高达2倍的增强，改进了视觉-语言模型，在MMVP上实现了比SigLIP高出3倍的分数。我们的代码/检查点可在https://tulip-berkeley.github.io 获取。

更新时间: 2025-04-07 21:50:58

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.15485v2

Scaling Laws for Predicting Downstream Performance in LLMs

Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach FLP consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of fully-converged sampling models, followed by mapping the pre-training loss to downstream task Performance using the intermediate models with emerged performance. In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

Updated: 2025-04-07 21:47:09

标题: 在低速液体混合器中预测下游性能的比例定律

摘要: Large language models (LLMs) require precise estimation of downstream performance before training to guide their development process. Scaling laws analysis uses statistics from smaller sampling language models (LMs) to predict LLM performance. The challenge lies in predicting abilities beyond task-specific computational thresholds. This study focuses on pre-training loss as a more computationally efficient metric for performance estimation. The two-stage FLP approach first maps computational resources to pre-training loss using fully-converged sampling models, then maps pre-training loss to downstream task performance using intermediate models with emerged performance. The FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using sampling LMs up to 3B, with error margins of 5% and 10% respectively, outperforming the FLOPs-to-Performance approach. FLP-M extends the power law function to predict domain-specific pre-training loss based on FLOPs across data sources and uses a neural network to model the relationship between domain-specific loss and downstream performance. By training on a specific ratio and smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures with error margins within 10%.

更新时间: 2025-04-07 21:47:09

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.08527v2

Meta-Dynamical State Space Models for Integrative Neural Data Analysis

Learning shared structure across environments facilitates rapid learning and adaptive behavior in neural systems. This has been widely demonstrated and applied in machine learning to train models that are capable of generalizing to novel settings. However, there has been limited work exploiting the shared structure in neural activity during similar tasks for learning latent dynamics from neural recordings. Existing approaches are designed to infer dynamics from a single dataset and cannot be readily adapted to account for statistical heterogeneities across recordings. In this work, we hypothesize that similar tasks admit a corresponding family of related solutions and propose a novel approach for meta-learning this solution space from task-related neural activity of trained animals. Specifically, we capture the variabilities across recordings on a low-dimensional manifold which concisely parametrizes this family of dynamics, thereby facilitating rapid learning of latent dynamics given new recordings. We demonstrate the efficacy of our approach on few-shot reconstruction and forecasting of synthetic dynamical systems, and neural recordings from the motor cortex during different arm reaching tasks.

Updated: 2025-04-07 21:44:06

标题: 元动力学状态空间模型用于整合神经数据分析

摘要: 学习跨环境共享结构有助于神经系统中的快速学习和自适应行为。这在机器学习中已经被广泛证明并应用，用于训练能够泛化到新领域的模型。然而，在相似任务期间利用神经活动中的共享结构来学习神经记录的潜在动态的工作有限。现有方法旨在从单个数据集中推断动态，并且无法轻松适应跨记录的统计异质性。在这项工作中，我们假设相似任务允许一系列相关解决方案，并提出了一种新颖的方法，从受训动物的任务相关神经活动中元学习这个解决方案空间。具体而言，我们捕捉跨记录的变化在一个低维流形上，这个流形简洁地参数化了这个动态家族，从而促进了在新记录中快速学习潜在动态。我们在合成动态系统的少样本重建和预测以及在不同臂伸展任务期间的运动皮层神经记录上展示了我们方法的有效性。

更新时间: 2025-04-07 21:44:06

领域: stat.ML,cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2410.05454v2

Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals

This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.

Updated: 2025-04-07 21:42:15

标题: GPT模型能否遵循人类总结准则？针对目标通信目标的研究

摘要: 这项研究调查了GPT模型（ChatGPT、GPT-4和GPT-4o）生成对话摘要遵循人类指南的能力。我们的评估涉及尝试不同的提示来引导模型遵守两个数据集上的指南：DialogSum（英语社交对话）和DECODA（法语呼叫中心互动）。基于摘要指南的人类评估作为主要评估方法，辅以广泛的定量和定性分析。我们的研究结果显示，相对于任务特定的预训练模型和参考摘要，人们更偏向于由GPT生成的摘要，突显了GPT模型遵循人类指南的能力，尽管有时会产生较长的输出并展示与参考文献的词汇和结构对齐不一致。ROUGE、BERTScore和人类评估之间的差异强调了更可靠的自动评估指标的必要性。

更新时间: 2025-04-07 21:42:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2310.16810v2

Deep Reinforcement Learning Algorithms for Option Hedging

Dynamic hedging is a financial strategy that consists in periodically transacting one or multiple financial assets to offset the risk associated with a correlated liability. Deep Reinforcement Learning (DRL) algorithms have been used to find optimal solutions to dynamic hedging problems by framing them as sequential decision-making problems. However, most previous work assesses the performance of only one or two DRL algorithms, making an objective comparison across algorithms difficult. In this paper, we compare the performance of eight DRL algorithms in the context of dynamic hedging; Monte Carlo Policy Gradient (MCPG), Proximal Policy Optimization (PPO), along with four variants of Deep Q-Learning (DQL) and two variants of Deep Deterministic Policy Gradient (DDPG). Two of these variants represent a novel application to the task of dynamic hedging. In our experiments, we use the Black-Scholes delta hedge as a baseline and simulate the dataset using a GJR-GARCH(1,1) model. Results show that MCPG, followed by PPO, obtain the best performance in terms of the root semi-quadratic penalty. Moreover, MCPG is the only algorithm to outperform the Black-Scholes delta hedge baseline with the allotted computational budget, possibly due to the sparsity of rewards in our environment.

Updated: 2025-04-07 21:32:14

标题: 深度强化学习算法用于期权对冲

摘要: 动态套期保值是一种金融策略，其基本思想是定期交易一个或多个金融资产，以抵消与相关责任相关的风险。深度强化学习（DRL）算法已被用于将动态套期保值问题构建为序贯决策问题，并寻找最佳解决方案。然而，大多数先前的研究只评估了一两种DRL算法的性能，这使得跨算法的客观比较变得困难。在本文中，我们比较了八种DRL算法在动态套期保值背景下的表现；蒙特卡洛策略梯度（MCPG）、近端策略优化（PPO），以及四种深度Q学习（DQL）的变体和两种深度确定性策略梯度（DDPG）的变体。其中两种变体代表了对动态套期保值任务的新应用。在我们的实验中，我们以Black-Scholes delta对冲作为基准，并使用GJR-GARCH（1,1）模型模拟数据集。结果显示，MCPG在根半二次惩罚方面表现最好，其次是PPO。此外，由于我们环境中奖励的稀缺性，MCPG是唯一在所分配的计算预算内超越Black-Scholes delta对冲基准的算法。

更新时间: 2025-04-07 21:32:14

领域: q-fin.CP,cs.AI,cs.CE

下载: http://arxiv.org/abs/2504.05521v1

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

Updated: 2025-04-07 21:31:31

标题: 通过自适应课程学习实现高效的强化微调

摘要: 强化微调（RFT）已经展示出增强大型语言模型（LLMs）数学推理能力的巨大潜力，但往往在样本和计算效率上存在问题，需要进行大量训练。在这项工作中，我们介绍了AdaRFT（自适应课程强化微调），这是一种通过自适应课程学习显著提高RFT效率和最终准确性的方法。AdaRFT根据模型最近的奖励信号动态调整训练问题的难度，确保模型始终在具有挑战性但可解决的任务上进行训练。这种自适应采样策略通过保持最佳难度范围加速学习，避免在太容易或太难的问题上浪费计算资源。AdaRFT只需对标准RFT算法（如近端策略优化（PPO））进行轻量级扩展，而无需修改奖励函数或模型架构。对包括AMC、AIME和IMO风格问题在内的竞赛级数学数据集的实验表明，AdaRFT显著提高了训练效率和推理性能。我们跨多个数据分布和模型大小评估了AdaRFT，结果显示它可以将训练步骤数量减少多达2倍，并且显著提高准确性，提供了一个更具可伸缩性和有效性的RFT框架。

更新时间: 2025-04-07 21:31:31

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2504.05520v1

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.

Updated: 2025-04-07 21:25:31

标题: 评估大型语言模型在代码推理上的泛化能力

摘要: 我们评估大型语言模型（LLMs）的代码推理能力如何推广到不同类型的程序。我们提出了获取具有不同特征的内部和外部分布程序的技术：从特定领域语言中抽样的代码，由LLM自动生成的代码，从竞争性编程竞赛中收集的代码以及这些程序的突变版本。我们还提出了一种实验方法来评估LLM在这些程序上的泛化能力，通过比较它们在这些程序上的表现。我们对过去一年的10个最新模型进行了广泛评估，从中获得了它们随着时间和不同类别程序的泛化能力的见解。我们的结果表明，尽管早期模型表现出与模式匹配一致的行为，但最新模型在代码推理方面表现出强大的泛化能力。

更新时间: 2025-04-07 21:25:31

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.05518v1

L3GS: Layered 3D Gaussian Splats for Efficient 3D Scene Delivery

Traditional 3D content representations include dense point clouds that consume large amounts of data and hence network bandwidth, while newer representations such as neural radiance fields suffer from poor frame rates due to their non-standard volumetric rendering pipeline. 3D Gaussian splats (3DGS) can be seen as a generalization of point clouds that meet the best of both worlds, with high visual quality and efficient rendering for real-time frame rates. However, delivering 3DGS scenes from a hosting server to client devices is still challenging due to high network data consumption (e.g., 1.5 GB for a single scene). The goal of this work is to create an efficient 3D content delivery framework that allows users to view high quality 3D scenes with 3DGS as the underlying data representation. The main contributions of the paper are: (1) Creating new layered 3DGS scenes for efficient delivery, (2) Scheduling algorithms to choose what splats to download at what time, and (3) Trace-driven experiments from users wearing virtual reality headsets to evaluate the visual quality and latency. Our system for Layered 3D Gaussian Splats delivery L3GS demonstrates high visual quality, achieving 16.9% higher average SSIM compared to baselines, and also works with other compressed 3DGS representations.

Updated: 2025-04-07 21:23:32

标题: L3GS：用于高效3D场景传送的分层3D高斯斑点

摘要: 传统的3D内容表示包括消耗大量数据和网络带宽的密集点云，而新的表示形式如神经辐射场由于其非标准的体积渲染管线而导致帧速率低下。3D高斯斑点（3DGS）可以被看作是满足两者最佳的点云的泛化，具有高视觉质量和实时帧速率的高效渲染。然而，将3DGS场景从服务器传送到客户端设备仍然具有挑战性，因为会消耗大量网络数据（例如，一个场景需要1.5GB）。本文旨在创建一个高效的3D内容传递框架，允许用户以3DGS作为基础数据表示查看高质量的3D场景。本文的主要贡献包括：（1）创建新的分层3DGS场景以实现高效传递，（2）选择何时下载哪些斑点的调度算法，以及（3）通过戴着虚拟现实头盔的用户进行基于跟踪的实验以评估视觉质量和延迟。我们的Layered 3D Gaussian Splats delivery L3GS系统展示了高视觉质量，与基线相比平均SSIM提高了16.9％，并且还可以与其他压缩的3DGS表示形式配合使用。

更新时间: 2025-04-07 21:23:32

领域: cs.GR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2504.05517v1

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference

Large language models (LLMs) have achieved remarkable success across diverse tasks, yet their inference processes are hindered by substantial time and energy demands due to single-token generation at each decoding step. While previous methods such as speculative decoding mitigate these inefficiencies by producing multiple tokens per step, each token is still generated by its single-token distribution, thereby enhancing speed without improving effectiveness. In contrast, our work simultaneously enhances inference speed and improves the output effectiveness. We consider multi-token joint decoding (MTJD), which generates multiple tokens from their joint distribution at each iteration, theoretically reducing perplexity and enhancing task performance. However, MTJD suffers from the high cost of sampling from the joint distribution of multiple tokens. Inspired by speculative decoding, we introduce multi-token assisted decoding (MTAD), a novel framework designed to accelerate MTJD. MTAD leverages a smaller auxiliary model to approximate the joint distribution of a larger model, incorporating a verification mechanism that not only ensures the accuracy of this approximation, but also improves the decoding efficiency over conventional speculative decoding. Theoretically, we demonstrate that MTAD closely approximates exact MTJD with bounded error. Empirical evaluations using Llama-2 and OPT models ranging from 13B to 70B parameters across various tasks reveal that MTAD reduces perplexity by 21.2% and improves downstream performance compared to standard single-token sampling. Furthermore, MTAD achieves a 1.42x speed-up and consumes 1.54x less energy than conventional speculative decoding methods. These results highlight MTAD's ability to make multi-token joint decoding both effective and efficient, promoting more sustainable and high-performance deployment of LLMs.

Updated: 2025-04-07 21:21:40

标题: 优化的多令牌联合解码与辅助模型在LLM推理中的应用

摘要: 大型语言模型（LLMs）在各种任务中取得了显著的成功，但由于在每个解码步骤中生成单个标记，它们的推断过程受到了相当大的时间和能量需求的阻碍。虽然先前的方法，如猜测解码，通过在每个步骤生成多个标记来缓解这些低效性，但每个标记仍由其单个标记分布生成，从而提高速度而不改善效果。相比之下，我们的工作同时提高了推断速度并改善了输出效果。我们考虑多标记联合解码（MTJD），它在每次迭代中从它们的联合分布生成多个标记，从理论上降低困惑度并增强任务性能。然而，MTJD遭受从多个标记的联合分布中抽样的高成本的困扰。受猜测解码的启发，我们引入了多标记辅助解码（MTAD），这是一个旨在加速MTJD的新颖框架。MTAD利用一个较小的辅助模型来近似一个更大模型的联合分布，包括一个验证机制，不仅确保这个近似的准确性，还提高了解码效率，超过了传统的猜测解码。从理论上讲，我们证明了MTAD与有界误差地紧密近似了精确的MTJD。使用从13B到70B参数的Llama-2和OPT模型进行的各种任务的实证评估显示，与标准的单标记抽样相比，MTAD将困惑度降低了21.2％，并提高了下游性能。此外，MTAD实现了1.42倍的加速，并比传统的猜测解码方法消耗了1.54倍的能量。这些结果突显了MTAD使多标记联合解码既有效又高效的能力，促进了LLMs更可持续和高性能的部署。

更新时间: 2025-04-07 21:21:40

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2407.09722v3

Diversity Enhances an LLM's Performance in RAG and Long-context Task

The rapid advancements in large language models (LLMs) have highlighted the challenge of context window limitations, primarily due to the quadratic time complexity of the self-attention mechanism ($O(N^2)$, where $N$ denotes the context window length). This constraint impacts tasks such as retrieval-augmented generation (RAG) in question answering (Q\&A) and long context summarization. A common approach involves selecting content with the highest similarity to the query; however, this often leads to redundancy and the exclusion of diverse yet relevant information. Building on principles from Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we integrate diversity into the content selection process. Our findings reveal that incorporating diversity substantially increases the recall of selecting relevant sentences or chunks before LLM-based Q\&A and summarization. These results highlight the importance of maintaining diversity in future LLM applications to further improve summarization and Q\&A outcomes.

Updated: 2025-04-07 21:14:51

标题: 多样性提高LLM在RAG和长文本任务中的表现

摘要: 大型语言模型（LLMs）的快速发展突显了上下文窗口限制的挑战，主要是由于自注意力机制的二次时间复杂度（$O(N^2)$，其中$N$表示上下文窗口长度）。这一约束影响了问题回答（Q\&A）中的检索增强生成（RAG）和长篇摘要等任务。一种常见的方法是选择与查询最相似的内容；然而，这往往导致冗余和排除多样但相关的信息。基于最大边际相关性（MMR）和最远点采样（FPS）原则，我们将多样性融入内容选择过程中。我们的研究结果表明，将多样性纳入选择相关句子或片段的过程中，显著提高了在基于LLM的Q\&A和摘要之前的召回率。这些结果突显了在未来LLM应用中保持多样性以进一步改善摘要和Q\&A结果的重要性。

更新时间: 2025-04-07 21:14:51

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.09017v2

Generalized Random Forests using Fixed-Point Trees

We propose a computationally efficient alternative to generalized random forests arXiv:1610.01271 (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRFs theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves multiple times the speed over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data, validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.

Updated: 2025-04-07 21:11:21

标题: 使用固定点树的广义随机森林

摘要: 我们提出了一个计算效率高的方法，用于估计大维度下的异质效应，作为广义随机森林（GRFs）arXiv:1610.01271的替代方案。虽然GRFs依赖于基于梯度的分裂标准，在大维度下计算成本高且不稳定，我们的方法引入了一个固定点逼近法，消除了对Jacobian估计的需要。这种无梯度的方法保留了GRFs的一致性和渐近正态性的理论保证，同时显著提高了计算效率。我们证明我们的方法在不损失统计准确性的情况下，比标准GRFs快几倍。在模拟和真实数据上的实验证明了我们的方法。我们的发现表明，所提出的方法是机器学习和因果推断应用中本地效应估计的可扩展替代方案。

更新时间: 2025-04-07 21:11:21

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2306.11908v3

Secure Smart Contract with Control Flow Integrity

Smart contracts power decentralized financial (DeFi) services but are vulnerable to complex security exploits that can lead to significant financial losses. Existing security measures often fail to adequately protect these contracts due to the composability of DeFi protocols and the increasing sophistication of attacks. Through a large-scale empirical study of historical transactions from the 30 hacked DeFi protocols, we discovered that while benign transactions typically exhibit a limited number of unique control flows, in stark contrast, attack transactions consistently introduce novel, previously unobserved control flows. Building on these insights, we developed CrossGuard, a novel framework that enforces control flow integrity in real-time to secure smart contracts. Crucially, CrossGuard does not require prior knowledge of specific hacks; instead, it dynamically enforces control flow whitelisting policies and applies simplification heuristics at runtime. This approach monitors and prevents potential attacks by reverting all transactions that do not adhere to the established control flow whitelisting rules. Our evaluation demonstrates that CrossGuard effectively blocks 28 of the 30 analyzed attacks when configured only once prior to contract deployment, maintaining a low false positive rate of 0.28% and minimal additional gas costs. These results underscore the efficacy of applying control flow integrity to smart contracts, significantly enhancing security beyond traditional methods and addressing the evolving threat landscape in the DeFi ecosystem.

Updated: 2025-04-07 21:08:16

标题: 具有控制流完整性的安全智能合约

摘要: 智能合约支撑着去中心化金融（DeFi）服务，但容易受到复杂安全漏洞的影响，可能导致重大财务损失。现有的安全措施通常无法充分保护这些合约，因为DeFi协议的可组合性和攻击的复杂性不断增加。通过对30个被黑客攻击的DeFi协议的历史交易进行大规模实证研究，我们发现，尽管良性交易通常表现出有限数量的独特控制流，但与之形成鲜明对比的是，攻击交易始终引入新颖的、以前未观察到的控制流。基于这些见解，我们开发了CrossGuard，一种新颖的框架，实时强制执行控制流完整性，以保护智能合约。关键是，CrossGuard不需要事先了解特定的黑客攻击；相反，它在运行时动态地强制执行控制流白名单策略，并应用简化启发式。这种方法通过撤销所有不符合已建立的控制流白名单规则的交易，监视和阻止潜在攻击。我们的评估表明，只需在合约部署前配置一次，CrossGuard有效地阻止了30个分析攻击中的28个，保持了低的误报率为0.28%，并且额外的燃气成本极小。这些结果强调了将控制流完整性应用于智能合约的有效性，显著增强了安全性，超越了传统方法，并解决了DeFi生态系统中不断演变的威胁形势。

更新时间: 2025-04-07 21:08:16

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2504.05509v1

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.

Updated: 2025-04-07 20:53:18

标题: Prism：使用Monte Carlo Tree Search动态灵活地评估LLMs代码生成Benchmarking

摘要: 大型语言模型（LLMs）的快速发展已经超过了传统的评估方法。静态基准测试无法捕捉LLM能力的深度和广度，并最终变得过时，而大多数动态方法要么过于依赖LLM-based评估，要么受到预定义测试集的限制。我们介绍了Prism，这是一个灵活的、动态的基准测试框架，旨在全面评估LLM。Prism建立在三个关键组件之上：（1）基于树的状态表示，将评估建模为马尔可夫决策过程，（2）一种适应于发现具有挑战性评估场景的蒙特卡洛树搜索算法，以及（3）一个多代理评估流水线，可以同时评估各种能力。为了确保健壮的评估，Prism将树探索模式的结构测量与各种难度水平上的性能指标相结合，提供错误模式、测试覆盖范围和解决方案方法的详细诊断。通过对五种最先进的LLM进行广泛实验，我们分析了模型架构和规模如何影响在不同任务难度下的代码生成性能。我们的结果证明了Prism作为一个动态基准的有效性，随着模型进步而发展，同时提供了对它们局限性更深入的洞察。

更新时间: 2025-04-07 20:53:18

领域: cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2504.05500v1

A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany

In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.

Updated: 2025-04-07 20:52:04

标题: 一则关于“中立”信息化人工智能工具的警示故事：德国2025年联邦选举前的情况

摘要: 在这项研究中，我们考察了基于人工智能的选举建议应用程序（VAAs）和大型语言模型（LLMs）在提供客观政治信息方面的可靠性。我们的分析是与38个Wahl-O-Mat的声明与政党回应进行比较的基础，这是一个成熟的德国在线工具，通过将选民的观点与政党立场进行比较来帮助选民了解。对于LLMs，我们发现了显著的偏见。它们与左翼政党的一致性较高（平均超过75%），而与中右翼（小于50%）和右翼政党（约30%）的一致性较低。此外，对于旨在客观通知选民的VAAs，我们发现了与Wahl-O-Mat中政党的声明立场存在实质性偏差：其中一个VAA在25%的情况下存在偏差，另一个VAA在超过50%的情况下存在偏差。对于后者，我们甚至观察到简单的提示注入导致严重的幻觉，包括虚假声明，如政党与右翼极端主义联系不存在。

更新时间: 2025-04-07 20:52:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.15568v2

Predicting Survivability of Cancer Patients with Metastatic Patterns Using Explainable AI

Cancer remains a leading global health challenge and a major cause of mortality. This study leverages machine learning (ML) to predict the survivability of cancer patients with metastatic patterns using the comprehensive MSK-MET dataset, which includes genomic and clinical data from 25,775 patients across 27 cancer types. We evaluated five ML models-XGBoost, Na\"ive Bayes, Decision Tree, Logistic Regression, and Random Fores using hyperparameter tuning and grid search. XGBoost emerged as the best performer with an area under the curve (AUC) of 0.82. To enhance model interpretability, SHapley Additive exPlanations (SHAP) were applied, revealing key predictors such as metastatic site count, tumor mutation burden, fraction of genome altered, and organ-specific metastases. Further survival analysis using Kaplan-Meier curves, Cox Proportional Hazards models, and XGBoost Survival Analysis identified significant predictors of patient outcomes, offering actionable insights for clinicians. These findings could aid in personalized prognosis and treatment planning, ultimately improving patient care.

Updated: 2025-04-07 20:48:15

标题: 利用可解释的人工智能预测转移性癌症患者的存活率

摘要: 癌症仍然是全球健康挑战的主要问题和主要死亡原因。本研究利用机器学习（ML）来预测转移模式的癌症患者的生存率，使用了包括来自27种癌症类型的25,775名患者的基因组和临床数据的综合MSK-MET数据集。我们评估了五种ML模型-XGBoost、Naïve Bayes、决策树、逻辑回归和随机森林，使用超参数调整和网格搜索。XGBoost表现最佳，曲线下面积（AUC）为0.82。为增强模型的可解释性，应用了SHapley Additive exPlanations（SHAP），揭示了关键预测因子，如转移部位计数、肿瘤突变负荷、基因组改变的比例和器官特异性转移。进一步使用Kaplan-Meier曲线、Cox比例风险模型和XGBoost生存分析进行生存分析，确定了患者预后的显著预测因子，为临床医生提供可操作的见解。这些发现有助于个性化预后和治疗规划，最终改善患者护理。

更新时间: 2025-04-07 20:48:15

领域: q-bio.QM,cs.AI

下载: http://arxiv.org/abs/2504.06306v1

GSCE: A Prompt Framework with Enhanced Reasoning for Reliable LLM-driven Drone Control

The integration of Large Language Models (LLMs) into robotic control, including drones, has the potential to revolutionize autonomous systems. Research studies have demonstrated that LLMs can be leveraged to support robotic operations. However, when facing tasks with complex reasoning, concerns and challenges are raised about the reliability of solutions produced by LLMs. In this paper, we propose a prompt framework with enhanced reasoning to enable reliable LLM-driven control for drones. Our framework consists of novel technical components designed using Guidelines, Skill APIs, Constraints, and Examples, namely GSCE. GSCE is featured by its reliable and constraint-compliant code generation. We performed thorough experiments using GSCE for the control of drones with a wide level of task complexities. Our experiment results demonstrate that GSCE can significantly improve task success rates and completeness compared to baseline approaches, highlighting its potential for reliable LLM-driven autonomous drone systems.

Updated: 2025-04-07 20:45:14

标题: GSCE：具有增强推理的可靠LLM驱动无人机控制的快速框架

摘要: 将大型语言模型（LLMs）整合到机器人控制中，包括无人机，有潜力彻底改变自主系统。研究表明，LLMs可以支持机器人操作。然而，在面对具有复杂推理任务时，对LLMs产生的解决方案的可靠性提出了担忧和挑战。在本文中，我们提出了一个具有增强推理能力的提示框架，以实现可靠的LLM驱动的无人机控制。我们的框架包括使用指南、技能API、约束和示例设计的新颖技术组件，即GSCE。GSCE以其可靠和符合约束的代码生成而著称。我们使用GSCE进行了彻底的实验，用于控制具有各种任务复杂性的无人机。我们的实验结果表明，与基线方法相比，GSCE可以显著提高任务成功率和完成度，突显了其在可靠的LLM驱动自主无人机系统中的潜力。

更新时间: 2025-04-07 20:45:14

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2502.12531v2

Neural network-enhanced integrators for simulating ordinary differential equations

Numerous applications necessitate the computation of numerical solutions to differential equations across a wide range of initial conditions and system parameters, which feeds the demand for efficient yet accurate numerical integration methods.This study proposes a neural network (NN) enhancement of classical numerical integrators. NNs are trained to learn integration errors, which are then used as additive correction terms in numerical schemes. The performance of these enhanced integrators is compared with well-established methods through numerical studies, with a particular emphasis on computational efficiency. Analytical properties are examined in terms of local errors and backward error analysis. Embedded Runge-Kutta schemes are then employed to develop enhanced integrators that mitigate generalization risk, ensuring that the neural network's evaluation in previously unseen regions of the state space does not destabilize the integrator. It is guaranteed that the enhanced integrators perform at least as well as the desired classical Runge-Kutta schemes. The effectiveness of the proposed approaches is demonstrated through extensive numerical studies using a realistic model of a wind turbine, with parameters derived from the established simulation framework OpenFast.

Updated: 2025-04-07 20:38:35

标题: 神经网络增强积分器用于模拟常微分方程

摘要: 许多应用程序需要计算跨广泛初始条件和系统参数范围的微分方程的数值解，这引发了对高效而准确的数值积分方法的需求。本研究提出了一种神经网络（NN）增强传统数值积分器的方法。神经网络被训练去学习积分误差，然后将其用作数值方案中的附加校正项。通过数值研究比较了这些增强积分器与已建立的方法的性能，特别强调计算效率。通过局部误差和后向误差分析考察了分析性质。然后采用嵌入的Runge-Kutta方案开发了增强积分器，以减轻泛化风险，确保神经网络在先前未见的状态空间区域的评估不会使积分器不稳定。保证增强积分器至少可以与所需的经典Runge-Kutta方案一样有效。通过使用一个风力涡轮机的真实模型进行广泛的数值研究，其中参数源自已建立的模拟框架OpenFast，证明了所提出方法的有效性。

更新时间: 2025-04-07 20:38:35

领域: math.NA,cs.LG,cs.NA,cs.SY,eess.SY

下载: http://arxiv.org/abs/2504.05493v1

A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs

The increasing complexity of LLMs presents significant challenges to their transparency and interpretability, necessitating the use of eXplainable AI (XAI) techniques to enhance trustworthiness and usability. This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques across five LLMs and two downstream tasks. We apply this framework to evaluate several XAI techniques LIME, SHAP, Integrated Gradients, Layer-wise Relevance Propagation (LRP), and Attention Mechanism Visualization (AMV) using the IMDB Movie Reviews and Tweet Sentiment Extraction datasets. The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. Our results show that LIME consistently achieves high scores across multiple LLMs and evaluation metrics, while AMV demonstrates superior Robustness and near-perfect Consistency. LRP excels in Contrastivity, particularly with more complex models. Our findings provide valuable insights into the strengths and limitations of different XAI methods, offering guidance for developing and selecting appropriate XAI techniques for LLMs.

Updated: 2025-04-07 20:37:11

标题: 一个统一的框架及新颖指标用于评估在LLMs中XAI技术的有效性

摘要: 随着LLMs的复杂性不断增加，它们的透明性和可解释性面临着重大挑战，这需要使用可解释的人工智能（XAI）技术来增强可信度和可用性。本研究引入了一个全面的评估框架，包括四个新颖的度量标准，用于评估五种XAI技术在五个LLMs和两个下游任务中的有效性。我们应用这一框架来评估几种XAI技术，包括LIME、SHAP、Integrated Gradients、Layer-wise Relevance Propagation（LRP）和Attention Mechanism Visualization（AMV），并使用IMDB电影评论和推特情感提取数据集。评估重点放在四个关键指标上：人类推理一致性（HA）、稳健性、一致性和对比性。我们的结果显示，LIME在多个LLMs和评估指标中始终获得较高分数，而AMV展现出出色的稳健性和近乎完美的一致性。LRP在对比性方面表现突出，特别是在更复杂的模型中。我们的研究结果为不同XAI方法的优势和局限性提供了有价值的见解，为开发和选择适用于LLMs的适当XAI技术提供了指导。

更新时间: 2025-04-07 20:37:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.05050v2

Optimal Bayesian Affine Estimator and Active Learning for the Wiener Model

This paper presents a Bayesian estimation framework for Wiener models, focusing on learning nonlinear output functions under known linear state dynamics. We derive a closed-form optimal affine estimator for the unknown parameters, characterized by the so-called "dynamic basis statistics (DBS)." Several features of the proposed estimator are studied, including Bayesian unbiasedness, closed-form posterior statistics, error monotonicity in trajectory length, and consistency condition (also known as persistent excitation). In the special case of Fourier basis functions, we demonstrate that the closed-form description is computationally available, as the Fourier DBS enjoys explicit expression. Furthermore, we identify an inherent inconsistency in single-trajectory measurements, regardless of input excitation. Leveraging the closed-form estimation error, we develop an active learning algorithm synthesizing input signals to minimize estimation error. Numerical experiments validate the efficacy of our approach, showing significant improvements over traditional regularized least-squares methods.

Updated: 2025-04-07 20:36:06

标题: 最优贝叶斯仿射估计器及Wiener模型的主动学习

摘要: 本文提出了一个贝叶斯估计框架，用于维纳模型，重点是在已知线性状态动态下学习非线性输出函数。我们推导出未知参数的闭合形式最优仿射估计器，其特征是所谓的“动态基础统计量（DBS）”。研究了所提出的估计器的几个特点，包括贝叶斯无偏性，闭合形式后验统计量，轨迹长度中的误差单调性以及一致性条件（也称为持续激励）。在傅里叶基函数的特殊情况下，我们证明了闭合形式描述在计算上是可用的，因为傅里叶DBS具有显式表达式。此外，我们确定了单轨迹测量中固有的不一致性，无论输入激励如何。利用闭合形式估计误差，我们开发了一种主动学习算法，合成输入信号以最小化估计误差。数值实验验证了我们方法的有效性，显示出相对传统正则化最小二乘方法的显著改进。

更新时间: 2025-04-07 20:36:06

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2504.05490v1

Towards Zero Trust Security in Connected Vehicles: A Comprehensive Survey

Zero Trust is the new cybersecurity model that challenges the traditional one by promoting continuous verification of users, devices, and applications, whatever their position or origin. This model is critical for reducing the attack surface and preventing lateral movement without relying on implicit trust. Adopting the zero trust principle in Intelligent Transportation Systems (ITS), especially in the context of connected vehicles (CVs), presents an adequate solution in the face of increasing cyber threats, thereby strengthening the ITS environment. This paper offers an understanding of Zero Trust security through a comprehensive review of existing literature, principles, and challenges. It specifically examines its applications in emerging technologies, particularly within connected vehicles, addressing potential issues and cyber threats faced by CVs. Inclusion/exclusion criteria for the systematic literature review were planned alongside a bibliometric analysis. Moreover, keyword co-occurrence analysis was done, which indicates trends and general themes for the Zero Trust model, Zero Trust implementation, and Zero Trust application. Furthermore, the paper explores various ZT models proposed in the literature for connected vehicles, shedding light on the challenges associated with their integration into CV systems. Future directions of this research will focus on incorporating Zero Trust principles within Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication paradigms. This initiative intends to enhance the security posture and safety protocols within interconnected vehicular networks. The proposed research seeks to address the unique cybersecurity vulnerabilities inherent in the highly dynamic nature of vehicular communication systems.

Updated: 2025-04-07 20:29:11

标题: 朝向零信任安全的联网车辆：一项全面调查

摘要: 零信任是挑战传统的新网络安全模型，通过促进对用户、设备和应用程序的持续验证，无论其位置或来源如何。该模型对于减少攻击面和防止横向移动至关重要，而不依赖隐含的信任。在智能交通系统（ITS）中采用零信任原则，特别是在连接车辆（CVs）的情境下，提供了一种适当的解决方案，以应对不断增加的网络威胁，从而加强ITS环境。本文通过对现有文献、原则和挑战的全面审查，提供了对零信任安全的理解。特别是在新兴技术中的应用，特别是在连接车辆中，解决CVs面临的潜在问题和网络威胁。系统文献综述的纳入/排除标准已经计划，并进行了文献计量分析。此外，进行了关键词共现分析，显示了零信任模型、零信任实施和零信任应用的趋势和一般主题。此外，本文探讨了文献中提出的各种连接车辆的ZT模型，揭示了与将其整合入CV系统相关的挑战。该研究的未来方向将集中在将零信任原则纳入车辆对车辆（V2V）和车辆对基础设施（V2I）通信范式中。这一倡议旨在增强互联车辆网络中的安全姿态和安全协议。拟议的研究旨在解决车辆通信系统高度动态性固有的独特网络安全漏洞。

更新时间: 2025-04-07 20:29:11

领域: cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2504.05485v1

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies. Our data and leaderboard is available at blackswan.cs.ubc.ca.

Updated: 2025-04-07 20:26:05

标题: 黑天鹅：在不可预测事件中的推测性和可驳倒的视频推理

摘要: 视觉语言模型（VLMs）的常识推理能力，特别是在因果推理和可擒获推理方面，仍然知之甚少。大多数基准测试集侧重于典型的视觉场景，很难分辨模型性能是源于敏锐的感知和推理能力，还是依赖于纯粹的统计回忆。我们认为，通过关注视频中的非典型事件，可以更清晰地了解VLMs的核心能力。解释和理解这些超出分布范围的事件要求模型超越基本的模式识别和重复他们之前的知识。为此，我们引入了BlackSwanSuite，这是一个评估VLMs通过因果推理和可擒获任务来推理意外事件能力的基准测试。我们的任务在提供给模型的视觉信息数量上人为限制，同时向它们提出关于隐藏的意外事件的问题，或者提供可能改变现有事件假设的新视觉信息。我们精心策划了一个包括超过3,800个多项选择题，4,900个生成式问题和6,700个是/否问题的全面基准测试套件，涵盖了1,655个视频。在广泛评估各种最先进的VLMs，包括GPT-4o和Gemini 1.5 Pro，以及开源VLMs，如LLaVA-Video之后，我们发现在这些任务中与人类之间存在高达32％的显著性能差距。我们的研究结果揭示了当前VLMs的关键局限性，强调了增强模型架构和训练策略的必要性。我们的数据和排行榜可在blackswan.cs.ubc.ca上找到。

更新时间: 2025-04-07 20:26:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.05725v2

GraphRAFT: Retrieval Augmented Fine-Tuning for Knowledge Graphs on Graph Databases

Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM's context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q\&As on large text-attributed KGs.

Updated: 2025-04-07 20:16:22

标题: GraphRAFT：图数据库上知识图的检索增强微调

摘要: 大型语言模型展示了出色的语言处理和推理能力，但在涉及私人数据时容易出现幻觉。检索增强生成（RAG）检索符合LLM上下文窗口的相关数据，并提示LLM回答。GraphRAG将这种方法扩展到结构化知识图（KG）和关于多跳实体的问题。最近的大多数GraphRAG方法要么忽略了检索步骤，要么具有抽象或低效的特定检索过程。这导致当知识图存储在支持图查询语言的图数据库中时，这些方法无法被采用。在这项工作中，我们提出了GraphRAFT，一个检索和推理框架，通过微调LLM生成可以检索高质量子图上下文并产生准确答案的Cypher查询。我们的方法是第一个可直接使用于原生图数据库中的KG的解决方案。基准测试表明，我们的方法具有样本效率，并随着训练数据的可用性扩展。我们的方法在两个具有挑战性的大型文本属性KG上的两个问题答案中，所有四个标准指标上均比所有最先进模型取得显着更好的结果。

更新时间: 2025-04-07 20:16:22

领域: cs.LG,cs.CL,cs.IR

下载: http://arxiv.org/abs/2504.05478v1

Well2Flow: Reconstruction of reservoir states from sparse wells using score-based generative models

This study investigates the use of score-based generative models for reservoir simulation, with a focus on reconstructing spatially varying permeability and saturation fields in saline aquifers, inferred from sparse observations at two well locations. By modeling the joint distribution of permeability and saturation derived from high-fidelity reservoir simulations, the proposed neural network is trained to learn the complex spatiotemporal dynamics governing multiphase fluid flow in porous media. During inference, the framework effectively reconstructs both permeability and saturation fields by conditioning on sparse vertical profiles extracted from well log data. This approach introduces a novel methodology for incorporating physical constraints and well log guidance into generative models, significantly enhancing the accuracy and physical plausibility of the reconstructed subsurface states. Furthermore, the framework demonstrates strong generalization capabilities across varying geological scenarios, highlighting its potential for practical deployment in data-scarce reservoir management tasks.

Updated: 2025-04-07 20:12:19

标题: Well2Flow：使用基于得分的生成模型从稀疏井中重建储层状态

摘要: 本研究调查了基于评分的生成模型在储层模拟中的应用，重点是重建盐水含水层中空间变化的渗透率和饱和度场，这些场是从两个井位的稀疏观测推断出来的。通过对由高保真度储层模拟得出的渗透率和饱和度的联合分布进行建模，所提出的神经网络被训练以学习控制多相流体在多孔介质中的复杂时空动态。在推断过程中，该框架通过根据从测井数据中提取的稀疏垂直剖面进行调节，有效地重建了渗透率和饱和度场。这种方法引入了一种新颖的方法论，将物理约束和测井指导融入到生成模型中，显著提高了重建亚地表状态的准确性和物理可信度。此外，该框架展示了在不同地质情景下的强大泛化能力，突显了其在数据稀缺的储层管理任务中实际部署的潜力。

更新时间: 2025-04-07 20:12:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.06305v1

Towards Scalable Newborn Screening: Automated General Movement Assessment in Uncontrolled Settings

General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.

Updated: 2025-04-07 20:02:50

标题: 朝向可扩展的新生儿筛查：在不受控制的环境中自动化的一般运动评估

摘要: 普通动作（GMs）是婴儿中自发协调的身体运动，可为发育中的神经系统提供宝贵的洞见。通过Prechtl GM评估（GMA）评估，GMs可可靠地预测神经发育障碍。然而，GMA需要专门训练的临床医生，其数量有限。为扩大新生儿筛查规模，需要一种能够自动分类婴儿视频录像中GMs的算法。这些数据存在挑战，包括录制长度、设备类型和环境的变化，每个视频都粗略注释了整体运动质量。在这项工作中，我们引入了一种工具，用于从这些录像中提取特征，并探索各种机器学习技术用于自动GM分类。

更新时间: 2025-04-07 20:02:50

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2411.09821v3

Graph Neural Networks for Enhancing Ensemble Forecasts of Extreme Rainfall

Climate change is increasing the occurrence of extreme precipitation events, threatening infrastructure, agriculture, and public safety. Ensemble prediction systems provide probabilistic forecasts but exhibit biases and difficulties in capturing extreme weather. While post-processing techniques aim to enhance forecast accuracy, they rarely focus on precipitation, which exhibits complex spatial dependencies and tail behavior. Our novel framework leverages graph neural networks to post-process ensemble forecasts, specifically modeling the extremes of the underlying distribution. This allows to capture spatial dependencies and improves forecast accuracy for extreme events, thus leading to more reliable forecasts and mitigating risks of extreme precipitation and flooding.

Updated: 2025-04-07 20:01:55

标题: 图神经网络用于增强极端降雨的集成预测

摘要: 气候变化正在增加极端降水事件的发生，威胁基础设施、农业和公共安全。集合预测系统提供概率预测，但存在偏差和难以捕捉极端天气的困难。虽然后处理技术旨在提高预测准确性，但很少关注降水，降水表现出复杂的空间依赖性和尾部行为。我们的新颖框架利用图神经网络对集合预测进行后处理，特别是对基础分布的极端值进行建模。这使得能够捕捉空间依赖性，并提高极端事件的预测准确性，从而实现更可靠的预测，减轻极端降水和洪水风险。

更新时间: 2025-04-07 20:01:55

领域: cs.LG

下载: http://arxiv.org/abs/2504.05471v1

A Survey on Federated Unlearning: Challenges and Opportunities

Federated learning (FL), introduced in 2017, facilitates collaborative learning between non-trusting parties with no need for the parties to explicitly share their data among themselves. This allows training models on user data while respecting privacy regulations such as GDPR and CPRA. However, emerging privacy requirements may mandate model owners to be able to \emph{forget} some learned data, e.g., when requested by data owners or law enforcement. This has given birth to an active field of research called \emph{machine unlearning}. In the context of FL, many techniques developed for unlearning in centralized settings are not trivially applicable! This is due to the unique differences between centralized and distributed learning, in particular, interactivity, stochasticity, heterogeneity, and limited accessibility in FL. In response, a recent line of work has focused on developing unlearning mechanisms tailored to FL. This SoK paper aims to take a deep look at the \emph{federated unlearning} literature, with the goal of identifying research trends and challenges in this emerging field. By carefully categorizing papers published on FL unlearning (since 2020), we aim to pinpoint the unique complexities of federated unlearning, highlighting limitations on directly applying centralized unlearning methods. We compare existing federated unlearning methods regarding influence removal and performance recovery, compare their threat models and assumptions, and discuss their implications and limitations. For instance, we analyze the experimental setup of FL unlearning studies from various perspectives, including data heterogeneity and its simulation, the datasets used for demonstration, and evaluation metrics. Our work aims to offer insights and suggestions for future research on federated unlearning.

Updated: 2025-04-07 19:55:57

标题: 一个关于联邦式遗忘的调查：挑战与机遇

摘要: 联邦学习（FL）于2017年引入，促进了不信任方之间的合作学习，无需方明确共享其数据。这允许在尊重隐私法规如GDPR和CPRA的情况下对用户数据进行模型训练。然而，新兴的隐私要求可能要求模型所有者能够“遗忘”一些已学习的数据，例如，当数据所有者或执法部门提出要求时。这催生了一个名为“机器遗忘”的活跃研究领域。在FL的背景下，许多为集中设置开发的遗忘技术并不是显而易见地适用！这是由于FL中的集中式学习和分布式学习之间的独特差异，特别是互动性、随机性、异质性和FL中有限的可访问性。作为回应，最近的一系列工作集中于开发适用于FL的遗忘机制。这篇SoK论文旨在深入研究“联邦遗忘”文献，旨在识别这一新兴领域的研究趋势和挑战。通过仔细分类自2020年以来发表的关于FL遗忘的论文，我们旨在指出联邦遗忘的独特复杂性，突出直接应用集中式遗忘方法的限制。我们比较现有的联邦遗忘方法，包括影响消除和性能恢复，比较它们的威胁模型和假设，并讨论它们的影响和局限性。例如，我们从各种角度分析了FL遗忘研究的实验设置，包括数据异质性及其模拟、用于演示的数据集以及评估指标。我们的工作旨在为未来关于联邦遗忘的研究提供见解和建议。

更新时间: 2025-04-07 19:55:57

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2403.02437v3

Quantum Mechanics and Neural Networks

We demonstrate that any Euclidean-time quantum mechanical theory may be represented as a neural network, ensured by the Kosambi-Karhunen-Lo\`eve theorem, mean-square path continuity, and finite two-point functions. The additional constraint of reflection positivity, which is related to unitarity, may be achieved by a number of mechanisms, such as imposing neural network parameter space splitting or the Markov property. Non-differentiability of the networks is related to the appearance of non-trivial commutators. Neural networks acting on Markov processes are no longer Markov, but still reflection positive, which facilitates the definition of deep neural network quantum systems. We illustrate these principles in several examples using numerical implementations, recovering classic quantum mechanical results such as Heisenberg uncertainty, non-trivial commutators, and the spectrum.

Updated: 2025-04-07 19:54:00

标题: 量子力学和神经网络

摘要: 我们证明了任何欧几里德时间量子力学理论都可以表示为一个神经网络，这得益于Kosambi-Karhunen-Lo\`eve定理、均方路径连续性和有限的两点函数。与幺正性相关的额外约束条件——反射正性——可以通过多种机制实现，比如施加神经网络参数空间分裂或马尔可夫性质。网络的非可微性与非平凡对易子的出现相关。作用于马尔可夫过程的神经网络不再是马尔可夫的，但仍具有反射正性，这有助于定义深度神经网络量子系统。我们通过数值实现在几个示例中阐明了这些原理，恢复了经典量子力学结果，如海森堡不确定性、非平凡对易子和谱。

更新时间: 2025-04-07 19:54:00

领域: hep-th,cs.LG,math.PR,quant-ph

下载: http://arxiv.org/abs/2504.05462v1

Intermediate Layer Classifiers for OOD generalization

Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textit{Intermediate Layer Classifiers} (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/intermediate-layer-generalization

Updated: 2025-04-07 19:50:50

标题: 中间层分类器用于OOD泛化

摘要: 深度分类器因对数据分布转变敏感而闻名，主要是因为它们依赖于训练数据中的虚假相关性。有人指出，这些分类器仍然可以在网络的最后一层找到在这种转变下仍然有效的特征。在这项工作中，我们质疑将最后一层表示用于超出分布（OOD）泛化的做法，并探索中间层的效用。为此，我们引入了Intermediate Layer Classifiers（ILCs）。我们发现，中间层表示通常比倒数第二层的表示提供更好的泛化效果。在许多情况下，使用较早层表示的零样本OOD泛化接近于重新训练倒数第二层表示的几次样本表现。这一结果在多个数据集、架构和分布转变类型中得到验证。我们的分析表明，与倒数第二层相比，中间层对分布转变不太敏感。这些发现强调了理解信息如何在网络层间分布及其在OOD泛化中的作用的重要性，同时也指出了倒数第二层表示效用的局限性。代码可在https://github.com/oshapio/intermediate-layer-generalization找到。

更新时间: 2025-04-07 19:50:50

领域: cs.LG

下载: http://arxiv.org/abs/2504.05461v1

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

Indexing is an important step towards strong performance in retrieval-augmented generation (RAG) systems. However, existing methods organize data based on either semantic similarity (similarity) or related information (relatedness), but do not cover both perspectives comprehensively. Our analysis reveals that modeling only one perspective results in insufficient knowledge synthesis, leading to suboptimal performance on complex tasks requiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAG indexing approach that explicitly considers both similar and related information. On the similarity side, we follow existing work and explore some variances to construct a similarity tree based on recursive summarization. On the relatedness side, SiReRAG extracts propositions and entities from texts, groups propositions via shared entities, and generates recursive summaries to construct a relatedness tree. We index and flatten both similarity and relatedness trees into a unified retrieval pool. Our experiments demonstrate that SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SiReRAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores. Our code is available at https://github.com/SalesforceAIResearch/SiReRAG .

Updated: 2025-04-07 19:47:16

标题: SiReRAG：用于多跳推理的相似和相关信息索引

摘要: 索引是检索增强生成（RAG）系统中取得强大性能的重要步骤。然而，现有方法基于语义相似性（相似性）或相关信息（相关性）组织数据，但并未全面涵盖这两种视角。我们的分析显示，仅建模一种视角会导致知识综合不足，从而在需要多跳推理的复杂任务上表现出次优性能。在本文中，我们提出了SiReRAG，一种新颖的RAG索引方法，明确考虑了相似和相关信息。在相似性方面，我们遵循现有工作并探索一些变体，构建了基于递归总结的相似性树。在相关性方面，SiReRAG从文本中提取命题和实体，通过共享实体对命题进行分组，并生成递归总结以构建相关性树。我们将相似性和相关性树索引并压平成统一的检索池。我们的实验证明，SiReRAG在三个多跳数据集（MuSiQue，2WikiMultiHopQA和HotpotQA）上始终优于最先进的索引方法，F1分数平均提高了1.9%。作为一个相当高效的解决方案，SiReRAG显著增强了现有的重新排名方法，在平均F1分数上提高了高达7.8%。我们的代码可在https://github.com/SalesforceAIResearch/SiReRAG 上找到。

更新时间: 2025-04-07 19:47:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.06206v2

Large-Scale Classification of Shortwave Communication Signals with Machine Learning

This paper presents a deep learning approach to the classification of 160 shortwave radio signals. It addresses the typical challenges of the shortwave spectrum, which are the large number of different signal types, the presence of various analog modulations and ionospheric propagation. As a classifier a deep convolutional neural network is used, that is trained to recognize 160 typical shortwave signal classes. The approach is blind and therefore does not require preknowledge or special preprocessing of the signal and no manual design of discriminative features for each signal class. The network is trained on a large number of synthetically generated signals and high quality recordings. Finally, the network is evaluated on real-world radio signals obtained from globally deployed receiver hardware and achieves up to 90% accuracy for an observation time of only 1 second.

Updated: 2025-04-07 19:45:08

标题: 机器学习在短波通信信号的大规模分类中的应用

摘要: 本文介绍了一种深度学习方法，用于对160个短波无线电信号进行分类。该方法解决了短波频谱的典型挑战，即大量不同的信号类型、各种模拟调制和电离层传播。作为分类器，使用了一个深度卷积神经网络，该网络经过训练可以识别160种典型的短波信号类别。该方法是盲目的，因此不需要对信号进行预先处理或特殊处理，也不需要为每种信号类别手动设计识别特征。该网络经过大量合成信号和高质量录音的训练。最后，该网络在从全球部署的接收硬件获取的真实无线电信号上进行了评估，并在仅1秒的观测时间内达到了高达90%的准确率。

更新时间: 2025-04-07 19:45:08

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05455v1

GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

Updated: 2025-04-07 19:42:12

标题: GraphPINE：用于可解释药物反应预测的图重要性传播

摘要: 解释性对于生物医学研究中的许多任务是必要的。最近的解释性方法集中在注意力、梯度和Shapley值上。这些方法不能处理具有强关联先验知识的数据，也无法根据已知的预测特征之间的关系来约束解释性结果。我们提出了GraphPINE，这是一种图神经网络（GNN）架构，利用领域特定的先验知识来初始化在训练过程中为药物反应预测优化的节点重要性。通常，手动的后预测步骤会检查文献（即先验知识）以了解返回的预测特征。虽然可以通过梯度和注意力获得节点重要性，但这些方法获得的节点重要性缺乏补充的先验知识；GraphPINE试图克服这一限制。GraphPINE与其他GNN门控方法不同，它利用类似LSTM的顺序格式。我们引入了一个重要性传播层，统一了特征矩阵和节点重要性的更新，并使用基于GNN的图传播特征值。这种初始化和更新机制允许深入了解特征学习和改进图表示。我们将GraphPINE应用于癌症药物反应预测，使用收集了超过5,000个基因节点的药物筛选和基因数据，这些数据包含了一个基因-基因图和一个药物-靶标相互作用（DTI）图的初始重要性。基因-基因图和DTIs来自策划的来源，并根据讨论药物和基因之间关系的文章数量进行加权。GraphPINE在952种药物中实现了0.894的PR-AUC和0.796的ROC-AUC。代码可在https://anonymous.4open.science/r/GraphPINE-40DE 上找到。

更新时间: 2025-04-07 19:42:12

领域: cs.LG,cs.AI,cs.CE,q-bio.GN,q-bio.QM

下载: http://arxiv.org/abs/2504.05454v1

Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard.

Updated: 2025-04-07 19:41:30

标题: 忠实性强制位移编辑用于Learn2Reg 2024 SHG-BF挑战的文献

摘要: 共同检查二次谐波产生（SHG）和明场（BF）显微镜可以区分组织成分和胶原纤维，有助于分析人类乳腺和胰腺癌组织。然而，SHG和BF图像之间的巨大差异给当前基于学习的注册模型在将SHG对齐到BF时带来挑战。在本文中，我们提出了一种新颖的多模式注册框架，采用忠实性强制位移编辑来解决这些挑战。该框架集成了分批对比学习、基于特征的预对齐和实例级优化。来自Learn2Reg COMULISglobe SHG-BF挑战的实验结果验证了我们方法的有效性，确保了在线排行榜上的第一名。

更新时间: 2025-04-07 19:41:30

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2410.20812v3

Fast constrained sampling in pre-trained diffusion models

Large denoising diffusion models, such as Stable Diffusion, have been trained on billions of image-caption pairs to perform text-conditioned image generation. As a byproduct of this training, these models have acquired general knowledge about image statistics, which can be useful for other inference tasks. However, when confronted with sampling an image under new constraints, e.g. generating the missing parts of an image, using large pre-trained text-to-image diffusion models is inefficient and often unreliable. Previous approaches either utilize backpropagation, making them significantly slower and more memory-demanding than text-to-image inference, or only enforce the constraint locally, failing to capture critical long-range correlations. In this work, we propose an algorithm that enables fast and high-quality generation under arbitrary constraints. We observe that, during inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image. This allows us to employ a numerical approximation to expensive gradient computations, incurring significant speed-ups in inference. Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches while requiring a fraction of the time. We demonstrate the effectiveness of our algorithm under both linear and non-linear constraints. An implementation is provided at https://github.com/cvlab-stonybrook/fast-constrained-sampling.

Updated: 2025-04-07 19:36:42

标题: 在预训练扩散模型中的快速约束采样

摘要: 大规模去噪扩散模型，如稳定扩散，已经被训练用于执行基于文本的图像生成的数十亿个图像-标题对。作为这种训练的副产品，这些模型已经获得了关于图像统计的一般知识，这对于其他推理任务可能是有用的。然而，当面临在新约束下对图像进行采样时，例如生成图像的缺失部分时，使用大规模预训练的文本到图像扩散模型是低效且常常不可靠的。先前的方法要么利用反向传播，使其比文本到图像推理更慢且更占用内存，要么仅在本地实施约束，未能捕捉关键的长程相关性。在这项工作中，我们提出了一种算法，它可以在任意约束下实现快速且高质量的生成。我们观察到，在推理过程中，我们可以在嘈杂图像上计算的梯度更新和在最终干净图像上计算的更新之间进行交换。这使我们能够利用对昂贵的梯度计算的数值近似，从而在推理中获得显着的加速。我们的方法产生的结果可以与最先进的无训练推理方法媲美或超越，同时只需花费一小部分时间。我们在线性和非线性约束下展示了我们算法的有效性。实现代码可在https://github.com/cvlab-stonybrook/fast-constrained-sampling找到。

更新时间: 2025-04-07 19:36:42

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.18804v2

DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction

Advancements in large language models (LLMs) allow them to address diverse questions using human-like interfaces. Still, limitations in their training prevent them from answering accurately in scenarios that could benefit from multiple perspectives. Multi-agent systems allow the resolution of questions to enhance result consistency and reliability. While drug-target interaction (DTI) prediction is important for drug discovery, existing approaches face challenges due to complex biological systems and the lack of interpretability needed for clinical applications. DrugAgent is a multi-agent LLM system for DTI prediction that combines multiple specialized perspectives with transparent reasoning. Our system adapts and extends existing multi-agent frameworks by (1) applying coordinator-based architecture to the DTI domain, (2) integrating domain-specific data sources, including ML predictions, knowledge graphs, and literature evidence, and (3) incorporating Chain-of-Thought (CoT) and ReAct (Reason+Act) frameworks for transparent DTI reasoning. We conducted comprehensive experiments using a kinase inhibitor dataset, where our multi-agent LLM method outperformed the non-reasoning multi-agent model (GPT-4o mini) by 45% in F1 score (0.514 vs 0.355). Through ablation studies, we demonstrated the contributions of each agent, with the AI agent being the most impactful, followed by the KG agent and search agent. Most importantly, our approach provides detailed, human-interpretable reasoning for each prediction by combining evidence from multiple sources - a critical feature for biomedical applications where understanding the rationale behind predictions is essential for clinical decision-making and regulatory compliance. Code is available at https://anonymous.4open.science/r/DrugAgent-B2EA.

Updated: 2025-04-07 19:32:55

标题: DrugAgent：基于多智能体大型语言模型的药物靶点相互作用预测推理

摘要: 大型语言模型（LLMs）的进展使它们能够利用类似人类的接口回答各种问题。然而，它们在训练方面的限制阻碍了它们在需要多重视角受益的情景中准确回答问题。多智能体系统允许解决问题以增强结果的一致性和可靠性。虽然药物靶点相互作用（DTI）预测对于药物发现至关重要，但现有方法面临挑战，因为复杂的生物系统和缺乏临床应用所需的可解释性。DrugAgent是一个用于DTI预测的多智能体LLM系统，结合了多个专业视角和透明的推理。我们的系统通过（1）将基于协调者的体系结构应用于DTI领域，（2）整合领域特定的数据来源，包括ML预测、知识图和文献证据，以及（3）结合了Chain-of-Thought（CoT）和ReAct（Reason+Act）框架用于透明的DTI推理，对现有的多智能体框架进行了调整和扩展。我们使用激酶抑制剂数据集进行了全面的实验，在F1得分方面，我们的多智能体LLM方法比非推理多智能体模型（GPT-4o mini）提高了45%（0.514 vs 0.355）。通过消融研究，我们展示了每个智能体的贡献，其中AI智能体的影响最大，其次是知识图智能体和搜索智能体。最重要的是，我们的方法通过结合多个来源的证据为每个预测提供了详细的人类可解释推理 - 这对于生物医学应用至关重要，因为理解预测背后的原因对于临床决策和监管合规性至关重要。代码可在https://anonymous.4open.science/r/DrugAgent-B2EA 上找到。

更新时间: 2025-04-07 19:32:55

领域: cs.AI,cs.CL,cs.IR,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2408.13378v4

Geometry-Aware Generative Autoencoders for Warped Riemannian Metric Learning and Generative Modeling on Data Manifolds

Rapid growth of high-dimensional datasets in fields such as single-cell RNA sequencing and spatial genomics has led to unprecedented opportunities for scientific discovery, but it also presents unique computational and statistical challenges. Traditional methods struggle with geometry-aware data generation, interpolation along meaningful trajectories, and transporting populations via feasible paths. To address these issues, we introduce Geometry-Aware Generative Autoencoder (GAGA), a novel framework that combines extensible manifold learning with generative modeling. GAGA constructs a neural network embedding space that respects the intrinsic geometries discovered by manifold learning and learns a novel warped Riemannian metric on the data space. This warped metric is derived from both the points on the data manifold and negative samples off the manifold, allowing it to characterize a meaningful geometry across the entire latent space. Using this metric, GAGA can uniformly sample points on the manifold, generate points along geodesics, and interpolate between populations across the learned manifold using geodesic-guided flows. GAGA shows competitive performance in simulated and real-world datasets, including a 30% improvement over the state-of-the-art methods in single-cell population-level trajectory inference.

Updated: 2025-04-07 19:30:58

标题: 几何感知生成自动编码器用于扭曲黎曼度量学习和数据流形上的生成建模

摘要: 在诸如单细胞RNA测序和空间基因组学等领域，高维数据集的快速增长为科学发现提供了前所未有的机会，但也带来了独特的计算和统计挑战。传统方法在几何感知数据生成、沿着有意义轨迹的插值以及通过可行路径传输群体方面存在困难。为了解决这些问题，我们引入了Geometry-Aware Generative Autoencoder (GAGA)，这是一个结合了可扩展流形学习和生成建模的新框架。GAGA构建了一个神经网络嵌入空间，尊重流形学习发现的内在几何结构，并学习了数据空间上的新型扭曲黎曼度量。这种扭曲度量是从数据流形上的点和流形之外的负样本中导出的，使其能够在整个潜空间中表征有意义的几何结构。利用这种度量，GAGA可以均匀地在流形上采样点，沿测地线生成点，并使用测地线引导的流在学习的流形之间插值群体。GAGA在模拟和真实数据集中展现了竞争性的性能，包括在单细胞群体级轨迹推断上比现有方法提高30%。

更新时间: 2025-04-07 19:30:58

领域: cs.LG,math.DG,stat.ML

下载: http://arxiv.org/abs/2410.12779v4

Using Machine Learning for Lunar Mineralogy-I: Hyperspectral Imaging of Volcanic Samples

This study examines the mineral composition of volcanic samples similar to lunar materials, focusing on olivine and pyroxene. Using hyperspectral imaging from 400 to 1000 nm, we created data cubes to analyze the reflectance characteristics of samples from samples from Vulcano, a volcanically active island in the Aeolian Archipelago, north of Sicily, Italy, categorizing them into nine regions of interest and analyzing spectral data for each. We applied various unsupervised clustering algorithms, including K-Means, Hierarchical Clustering, GMM, and Spectral Clustering, to classify the spectral profiles. Principal Component Analysis revealed distinct spectral signatures associated with specific minerals, facilitating precise identification. Clustering performance varied by region, with K-Means achieving the highest silhouette-score of 0.47, whereas GMM performed poorly with a score of only 0.25. Non-negative Matrix Factorization aided in identifying similarities among clusters across different methods and reference spectra for olivine and pyroxene. Hierarchical clustering emerged as the most reliable technique, achieving a 94\% similarity with the olivine spectrum in one sample, whereas GMM exhibited notable variability. Overall, the analysis indicated that both Hierarchical and K-Means methods yielded lower errors in total measurements, with K-Means demonstrating superior performance in estimated dispersion and clustering. Additionally, GMM showed a higher root mean square error compared to the other models. The RMSE analysis confirmed K-Means as the most consistent algorithm across all samples, suggesting a predominance of olivine in the Vulcano region relative to pyroxene. This predominance is likely linked to historical formation conditions similar to volcanic processes on the Moon, where olivine-rich compositions are common in ancient lava flows and impact melt rocks.

Updated: 2025-04-07 19:15:56

标题: 使用机器学习进行月球矿物学研究-I：火山样品的高光谱成像

摘要: 这项研究考察了类似于月球物质的火山样本的矿物组成，重点关注橄榄石和辉石。利用从400到1000纳米的高光谱成像，我们创建了数据立方体，以分析来自意大利西西里岛北部的爱奥尼亚群岛上火山活动岛屿Vulcano的样本的反射特性，将其分类为九个感兴趣区域，并分析每个区域的光谱数据。我们应用了各种无监督聚类算法，包括K均值、层次聚类、GMM和光谱聚类，以分类光谱轮廓。主成分分析显示与特定矿物相关的明显光谱特征，有助于精确识别。不同区域的聚类性能有所不同，K均值的轮廓得分最高，为0.47，而GMM的得分仅为0.25。非负矩阵分解有助于识别不同方法和橄榄石、辉石的参考光谱之间的相似性。层次聚类 emerged 被认为是最可靠的技术，在一个样本中实现了与橄榄石光谱的94%相似度，而GMM表现出明显的变化。总体而言，分析表明，在总测量值方面，层次和K均值方法均产生较低的误差，其中K均值在估计的离散度和聚类方面表现卓越。此外，与其他模型相比，GMM显示出更高的均方根误差。均方根误差分析证实K均值作为所有样本中最一致的算法，暗示了Vulcano地区相对于辉石存在橄榄石的主导地位。这种主导地位可能与类似月球火山过程的历史形成条件有关，在古老的熔岩流和撞击熔岩岩石中，橄榄石富集的组成是常见的。

更新时间: 2025-04-07 19:15:56

领域: astro-ph.EP,cs.LG

下载: http://arxiv.org/abs/2503.22617v2

Do Chinese models speak Chinese languages?

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.

Updated: 2025-04-07 19:09:50

标题: 中国模特会说中文吗？

摘要: 中国发布了表现优秀的开放式权重LLM，巩固了中国在人工智能发展中的领导地位。这些模型支持中国使用的语言吗？还是说他们和西方模型说着相同的语言？比较多语言能力对于两个原因至关重要。首先，语言能力可以洞察预训练数据的整理情况，从而影响资源配置和发展重点。其次，中国有着悠久的明确语言政策历史，介于包容少数民族语言和以普通话为主的政策之间。为了测试中国的LLM是否反映了中国语言的议程，我们对亚洲地区和中国少数民族语言上进行了中西开源LLM的性能测试。我们在信息平等和阅读理解方面的实验表明，中国模型在这些语言上的表现与西方模型高度相关（r=0.93），唯一的例外是普通话表现更好。有时，中国模型无法识别中国少数民族如哈萨克语和维吾尔语，尽管他们擅长法语和德语。这些结果提供了当前发展重点的视窗，为未来发展提供了选项，并为最终用户提供了指导。

更新时间: 2025-04-07 19:09:50

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2504.00289v2

Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

High-resolution climate simulations are valuable for understanding climate change impacts. This has motivated use of regional convection-permitting climate models (CPMs), but these are very computationally expensive. We present a convection-permitting model generative emulator (CPMGEM), to skilfully emulate precipitation simulations by a 2.2km-resolution regional CPM at much lower cost. This utilises a generative machine learning approach, a diffusion model. It takes inputs at the 60km resolution of the driving global climate model and downscales these to 8.8km, with daily-mean time resolution, capturing the effect of convective processes represented in the CPM at these scales. The emulator is trained on simulations over England and Wales from the United Kingdom Climate Projections Local product, covering years between 1980 and 2080 following a high emissions scenario. The output precipitation has a similarly realistic spatial structure and intensity distribution to the CPM simulations. The emulator is stochastic, which improves the realism of samples. We show evidence that the emulator has skill for extreme events with ~100 year return times. It captures the main features of the simulated 21st century climate change, but exhibits some error in the magnitude. We demonstrate successful transfer from a "perfect model" training setting to application using GCM variable inputs. We also show that the method can be useful in situations with limited amounts of high-resolution data. Potential applications include producing high-resolution precipitation predictions for large-ensemble climate simulations and producing output based on different GCMs and climate change scenarios to better sample uncertainty.

Updated: 2025-04-07 19:08:37

标题: 用扩散模型进行km尺度区域气候模拟的机器学习降水模拟

摘要: 高分辨率气候模拟对于了解气候变化影响具有重要价值。这促使人们使用区域对流允许的气候模型（CPMs），但这些模型的计算成本非常昂贵。我们提出了一个对流允许模型生成模拟器（CPMGEM），可以以更低的成本技能地模拟2.2km分辨率区域CPM的降水模拟。这利用了一种生成式机器学习方法，一个扩散模型。它接受来自驱动全球气候模型的60km分辨率的输入，并将其降至8.8km，每日平均时间分辨率，捕捉了在这些尺度下CPM中所代表的对流过程的影响。该模拟器经过训练，使用来自英国气候预测局的模拟数据，涵盖了1980年至2080年间的高排放情景。输出的降水具有与CPM模拟相似的实际空间结构和强度分布。该模拟器是随机的，可以提高样本的逼真性。我们展示了该模拟器对于具有约100年重现期的极端事件的技能。它捕捉了模拟的21世纪气候变化的主要特征，但在幅度上存在一些误差。我们展示了从“完美模型”训练设置到使用GCM变量输入的应用的成功转移。我们还表明，这种方法可以在高分辨率数据有限的情况下有用。潜在应用包括为大型集合气候模拟生成高分辨率降水预测，并基于不同的GCM和气候变化情景生成输出，以更好地抽样不确定性。

更新时间: 2025-04-07 19:08:37

领域: physics.ao-ph,cs.LG,J.2

下载: http://arxiv.org/abs/2407.14158v2

Faster Reinforcement Learning by Freezing Slow States

We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). Such structure is common in real-world problems where sequential decisions need to be made at high frequencies over long horizons, where slowly evolving information also influences optimal decisions. Examples include inventory control under slowly changing demand, or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during a phase of lower-level planning, solving finite-horizon MDPs conditioned on a fixed slow state, and then applying value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off computational budget versus regret. Empirically, we demonstrate that frozen-state methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.

Updated: 2025-04-07 18:55:35

标题: 通过冻结慢状态实现更快的强化学习

摘要: 我们研究具有“快慢”结构的无限时间段马尔可夫决策过程（MDP），其中一些状态变量发展迅速（“快速状态”），而其他状态变化较为缓慢（“慢速状态”）。这种结构在现实世界中很常见，需要在长时间段内以高频率做出顺序决策，同时慢慢演变的信息也影响着最优决策。例如，库存控制中的需求缓慢变化，或者动态定价中逐渐转变的消费者行为。将问题建模为自然决策频率导致的MDP的折现因子接近于1，使其在计算上具有挑战性。我们提出了一种新颖的近似策略，即在较低层规划阶段“冻结”慢速状态，解决条件于固定慢速状态的有限时间段MDP，然后将值迭代应用于在较慢时间尺度上演变的辅助上层MDP。短暂冻结状态会导致更容易解决的较低层问题，而较慢的上层时间尺度则允许更有利的折现因子。在理论方面，我们分析了我们的冻结状态方法所产生的后悔，这为如何在计算预算和后悔之间进行权衡提供了简单的见解。在实证方面，我们证明了冻结状态方法可以生成高质量的策略，并且计算量明显较少，同时我们还展示了简单地省略慢速状态通常是一个糟糕的启发式策略。

更新时间: 2025-04-07 18:55:35

领域: cs.AI,cs.LG,cs.SY,eess.SY,math.OC

下载: http://arxiv.org/abs/2301.00922v2

Survey on Algorithms for multi-index models

We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity. In many cases, a gap is observed between the sample complexity of the best known computationally efficient methods and the information-theoretical minimum. We also review algorithms based on estimating the span of gradients using nonparametric methods, and algorithms based on fitting neural networks using gradient descent

Updated: 2025-04-07 18:50:11

标题: 多指数模型算法调查

摘要: 我们回顾了关于在多指数模型中估计指数空间的算法的文献。主要关注点是高斯空间中计算效率（多项式时间）算法的假设，这些方法保证了一致性，以及它们的样本复杂性。在许多情况下，观察到最佳已知计算效率方法的样本复杂性与信息论最小值之间存在差距。我们还回顾了基于使用非参数方法估计梯度范围和基于使用梯度下降拟合神经网络的算法。

更新时间: 2025-04-07 18:50:11

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2504.05426v1

Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

Updated: 2025-04-07 18:49:28

标题: 在多模态、多任务、多对话设置中的LLMs的本地-云推理卸载

摘要: 与传统机器学习模型相比，最近的大型语言模型(LLMs)可以通过多个对话和多模态数据源展现多任务解决能力。LLMs的这些独特特征，加上它们庞大的模型规模，使它们的部署更具挑战性。具体而言，(i)在本地设备上部署LLMs面临计算、内存和能源资源问题，而(ii)在云端部署它们无法保证实时服务，并造成通信/使用成本。在本文中，我们设计了一个名为TMO的本地-云LLM推理系统，其中包含三M卸载：多模态、多任务和多对话。TMO整合了(i) 一个轻量级的本地LLM，可以高速处理简单任务，和(ii) 一个大规模的云端LLM，可以处理多模态数据源。我们为TMO开发了一种资源约束的强化学习(RCRL)策略，用于优化推理位置(即本地 vs. 云)和用于每个任务/对话的多模态数据源，旨在最大化长期奖励(响应质量、延迟和使用成本)，同时遵守资源约束。我们还贡献了M4A1，一个我们策划的新数据集，其中包含跨多个模态、任务、对话和LLM配置的奖励和成本指标，以便评估卸载决策。我们展示了相比几种探索-决策和LLM作为代理的基线方法，TMO的有效性，显示出在延迟、成本和响应质量方面的显著改进。

更新时间: 2025-04-07 18:49:28

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2502.11007v2

A Behavior-Based Knowledge Representation Improves Prediction of Players' Moves in Chess by 25%

Predicting player behavior in strategic games, especially complex ones like chess, presents a significant challenge. The difficulty arises from several factors. First, the sheer number of potential outcomes stemming from even a single position, starting from the initial setup, makes forecasting a player's next move incredibly complex. Second, and perhaps even more challenging, is the inherent unpredictability of human behavior. Unlike the optimized play of engines, humans introduce a layer of variability due to differing playing styles and decision-making processes. Each player approaches the game with a unique blend of strategic thinking, tactical awareness, and psychological tendencies, leading to diverse and often unexpected actions. This stylistic variation, combined with the capacity for creativity and even irrational moves, makes predicting human play difficult. Chess, a longstanding benchmark of artificial intelligence research, has seen significant advancements in tools and automation. Engines like Deep Blue, AlphaZero, and Stockfish can defeat even the most skilled human players. However, despite their exceptional ability to outplay top-level grandmasters, predicting the moves of non-grandmaster players, who comprise most of the global chess community -- remains complicated for these engines. This paper proposes a novel approach combining expert knowledge with machine learning techniques to predict human players' next moves. By applying feature engineering grounded in domain expertise, we seek to uncover the patterns in the moves of intermediate-level chess players, particularly during the opening phase of the game. Our methodology offers a promising framework for anticipating human behavior, advancing both the fields of AI and human-computer interaction.

Updated: 2025-04-07 18:49:00

标题: 基于行为的知识表示提高了对国际象棋玩家走步的预测准确率25%

摘要: 预测玩家在战略游戏中的行为，尤其是象棋等复杂游戏中的行为，是一个重大挑战。这种困难源于几个因素。首先，即使是从最初的设置开始，一个位置可能产生的潜在结果数量之巨使得预测玩家的下一步行为变得极其复杂。其次，也许更具挑战性的是人类行为的不可预测性。与引擎的优化游戏不同，人类由于不同的游戏风格和决策过程而引入了一层变化性。每位玩家以独特的战略思维、战术意识和心理倾向来对待游戏，导致多样化且常常出人意料的行为。这种风格上的变化，再加上创造力甚至非理性的举动的可能性，使得预测人类游戏变得困难。象棋，作为人工智能研究的一个长期基准，已经在工具和自动化方面取得了重大进展。像深蓝、AlphaZero和Stockfish这样的引擎可以击败甚至是最熟练的人类选手。然而，尽管它们有能力击败顶级大师，但对于占全球象棋社区大多数的非大师玩家的移动进行预测仍然对这些引擎来说是复杂的。本文提出了一种结合专家知识和机器学习技术来预测人类玩家下一步行动的新方法。通过应用基于领域专业知识的特征工程，我们试图揭示中级象棋玩家在游戏开局阶段的走步模式。我们的方法为预测人类行为提供了一个有前途的框架，推动了人工智能和人机交互领域的发展。

更新时间: 2025-04-07 18:49:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05425v1

Safe Automated Refactoring for Efficient Migration of Imperative Deep Learning Programs to Graph Execution

Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code -- supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the "best of both worlds," using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution. We present an automated refactoring approach that assists developers in specifying whether their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs while preserving semantics. The approach, based on a novel imperative tensor analysis, automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution. The approach is implemented as a PyDev Eclipse IDE plug-in that integrates the WALA Ariadne analysis framework and evaluated on 19 Python projects consisting of 132.05 KLOC. We found that 326 of 766 candidate functions (42.56%) were refactorable, and an average speedup of 2.16 on performance tests was observed. The results indicate that the approach is useful in optimizing imperative DL code to its full potential.

Updated: 2025-04-07 18:48:43

标题: 安全自动重构以有效迁移命令式深度学习程序到图执行

摘要: 效率对于支持对不断增长的数据集的响应至关重要，特别是对于深度学习（DL）系统而言。DL框架传统上采用延迟执行风格的DL代码，支持基于符号、基于图的深度神经网络（DNN）计算。虽然这种开发方式可扩展，但容易出错，不直观且难以调试。因此，更自然的、鼓励急切执行的DL框架涌现，但牺牲了运行时性能。尽管混合方法旨在实现“两全其美”，但有效使用它们需要细致考虑，使代码适合安全、准确和高效的图执行。我们提出了一种自动重构方法，帮助开发人员指定他们通常急切执行的命令式DL代码是否可以可靠且高效地执行为图形，同时保留语义。该方法基于一种新颖的命令式张量分析，自动确定何时可以安全且有可能有利地将命令式DL代码迁移到图执行。该方法作为一个PyDev Eclipse IDE插件实现，集成了WALA Ariadne分析框架，并在由132.05 KLOC组成的19个Python项目上进行评估。我们发现，766个候选函数中有326个（42.56％）可以重构，并且在性能测试中观察到了平均加速比为2.16。结果表明，该方法对优化命令式DL代码发挥了其全部潜力。

更新时间: 2025-04-07 18:48:43

领域: cs.SE,cs.AI,cs.PL,D.2.7; C.4; D.3.4; I.2.6

下载: http://arxiv.org/abs/2504.05424v1

EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations

As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints can be found here: https://github.com/continental/EP-Diffuser.

Updated: 2025-04-07 18:45:49

标题: EP-Diffuser：一种通过多项式表示的高效扩散模型，用于交通场景生成和预测

摘要: 随着预测时间范围的增加，由于代理移动的多模态性质，预测交通场景的未来演变变得越来越困难。大多数最先进的预测模型主要关注预测最可能的未来。然而，为了自动驾驶车辆的安全运行，同样重要的是覆盖可能运动替代方案的分布。为了解决这个问题，我们引入了EP-Diffuser，这是一个新颖的参数高效的扩散式生成模型，旨在捕获可能的交通场景演变分布。在道路布局和代理历史的条件下，我们的模型作为预测器，生成多样化的、可信的场景延续。我们在Argoverse 2数据集上以准确性和可信度的预测为标准，将EP-Diffuser与两个最先进的模型进行了基准测试。尽管模型规模明显较小，但我们的方法实现了高度准确和可信的交通场景预测。我们进一步通过使用Waymo Open数据集在分布外测试环境下评估模型的泛化能力，并展示了我们方法的卓越鲁棒性。代码和模型检查点可在此处找到：https://github.com/continental/EP-Diffuser。

更新时间: 2025-04-07 18:45:49

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2504.05422v1

Preconditioned FEM-based Neural Networks for Solving Incompressible Fluid Flows and Related Inverse Problems

The numerical simulation and optimization of technical systems described by partial differential equations is expensive, especially in multi-query scenarios in which the underlying equations have to be solved for different parameters. A comparatively new approach in this context is to combine the good approximation properties of neural networks (for parameter dependence) with the classical finite element method (for discretization). However, instead of considering the solution mapping of the PDE from the parameter space into the FEM-discretized solution space as a purely data-driven regression problem, so-called physically informed regression problems have proven to be useful. In these, the equation residual is minimized during the training of the neural network, i.e., the neural network "learns" the physics underlying the problem. In this paper, we extend this approach to saddle-point and non-linear fluid dynamics problems, respectively, namely stationary Stokes and stationary Navier-Stokes equations. In particular, we propose a modification of the existing approach: Instead of minimizing the plain vanilla equation residual during training, we minimize the equation residual modified by a preconditioner. By analogy with the linear case, this also improves the condition in the present non-linear case. Our numerical examples demonstrate that this approach significantly reduces the training effort and greatly increases accuracy and generalizability. Finally, we show the application of the resulting parameterized model to a related inverse problem.

Updated: 2025-04-07 18:45:14

标题: 基于预处理的有限元神经网络用于求解不可压缩流体流动及相关逆问题

摘要: 描述由偏微分方程描述的技术系统的数值模拟和优化是昂贵的，特别是在多查询情况下，其中必须为不同参数解决基础方程。在这种情况下，一个相对较新的方法是将神经网络的良好逼近特性（用于参数依赖性）与经典的有限元方法（用于离散化）相结合。然而，与其将PDE的解映射从参数空间到FEM离散解空间视为纯数据驱动的回归问题，所谓的物理信息回归问题已被证明是有用的。在这些问题中，在神经网络的训练过程中最小化方程残差，即神经网络“学习”了问题的物理基础。在本文中，我们将这种方法扩展到鞍点和非线性流体动力学问题，分别是静止斯托克斯和静止纳维-斯托克斯方程。具体而言，我们提出了现有方法的修改：在训练过程中不是最小化普通的方程残差，而是最小化通过预处理器修改的方程残差。类比于线性情况，这也改善了目前非线性情况的条件。我们的数值示例表明，这种方法显著减少了训练工作量，并极大地提高了准确性和泛化能力。最后，我们展示了将得到的参数化模型应用于相关逆问题的应用。

更新时间: 2025-04-07 18:45:14

领域: math.NA,cs.LG,cs.NA,physics.flu-dyn

下载: http://arxiv.org/abs/2409.04067v2

Sample Compression Scheme Reductions

We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $\epsilon$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/\epsilon))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).

Updated: 2025-04-07 18:44:28

标题: 样本压缩方案的减少

摘要: 我们将多类分类、回归和对抗性强化学习设置中的样本压缩方案与二进制样本压缩方案进行了新颖的归约。假设我们有一个二进制类别大小为$f(d_\mathrm{VC})$的压缩方案，其中$d_\mathrm{VC}$是VC维度，则我们有以下结果：(1) 如果二进制压缩方案是多数投票或稳定的压缩方案，则存在一个大小为$O(f(d_\mathrm{G}))$的多类压缩方案，其中$d_\mathrm{G}$是图维度。此外，对于一般的二进制压缩方案，我们获得了一个大小为$O(f(d_\mathrm{G})\log|Y|)$的压缩方案，其中$Y$是标签空间。(2) 如果二进制压缩方案是多数投票或稳定的压缩方案，则存在一个大小为$O(f(d_\mathrm{P}))$的$\epsilon$-近似回归压缩方案，其中$d_\mathrm{P}$是伪维度。对于一般的二进制压缩方案，我们获得了一个大小为$O(f(d_\mathrm{P})\log(1/\epsilon))$的压缩方案。如果解决了样本压缩猜想，则这些结果将有重要的影响，该猜想认为任何具有有限VC维度的二进制概念类别都具有大小为$O(d_\mathrm{VC})$的二进制压缩方案（Littlestone 和Warmuth，1986；Floyd 和Warmuth，1995；Warmuth，2003）。我们的结果将立即将该猜想的证明扩展到其他设置。我们为对抗性强化学习建立了类似的结果，并且还提供了一个具有稳健可学习性但没有有界大小压缩方案的概念类别的示例，证明了可学习性与具有独立于样本大小的压缩方案并不等价，与二进制分类不同，在二进制分类中，大小为$2^{O(d_\mathrm{VC})}$的压缩是可以实现的（Moran 和Yehudayoff，2016）。

更新时间: 2025-04-07 18:44:28

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.13012v3

PreSumm: Predicting Summarization Performance Without Summarizing

Despite recent advancements in automatic summarization, state-of-the-art models do not summarize all documents equally well, raising the question: why? While prior research has extensively analyzed summarization models, little attention has been given to the role of document characteristics in influencing summarization performance. In this work, we explore two key research questions. First, do documents exhibit consistent summarization quality across multiple systems? If so, can we predict a document's summarization performance without generating a summary? We answer both questions affirmatively and introduce PreSumm, a novel task in which a system predicts summarization performance based solely on the source document. Our analysis sheds light on common properties of documents with low PreSumm scores, revealing that they often suffer from coherence issues, complex content, or a lack of a clear main theme. In addition, we demonstrate PreSumm's practical utility in two key applications: improving hybrid summarization workflows by identifying documents that require manual summarization and enhancing dataset quality by filtering outliers and noisy documents. Overall, our findings highlight the critical role of document properties in summarization performance and offer insights into the limitations of current systems that could serve as the basis for future improvements.

Updated: 2025-04-07 18:43:00

标题: PreSumm：在不进行总结的情况下预测总结性能

摘要: 尽管自动摘要技术近年来取得了进展，但当前最先进的模型并不都能同样有效地总结所有文档，这引发了一个问题：为什么会这样？尽管先前的研究已经广泛分析了摘要模型，但对文档特征在影响摘要性能方面的作用却鲜有关注。在这项研究中，我们探讨了两个关键研究问题。首先，文档在多个系统中的摘要质量是否一致？如果是，我们是否可以在生成摘要之前预测文档的摘要性能？我们对这两个问题都给出了肯定回答，并引入了PreSumm，这是一个新颖的任务，系统仅基于原始文档预测摘要性能。我们的分析揭示了低PreSumm分数文档的共同特点，显示它们往往存在连贯性问题、复杂内容或缺乏明确的主题。此外，我们展示了PreSumm在两个关键应用中的实际效用：通过识别需要人工摘要的文档来改进混合摘要工作流程，以及通过过滤异常值和嘈杂文档来提高数据集质量。总的来说，我们的发现突显了文档属性在摘要性能中的关键作用，并为当前系统的局限性提供了见解，这可能成为未来改进的基础。

更新时间: 2025-04-07 18:43:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.05420v1

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.

Updated: 2025-04-07 18:42:01

标题: 推理模型知道何时是正确的：探测隐藏状态进行自我验证

摘要: Reasoning models have achieved impressive performance in tasks such as math and logical reasoning due to their ability to search during the reasoning process. However, they still suffer from overthinking, often going through unnecessary reasoning steps even after arriving at the correct answer. This leads to the question: can models assess the accuracy of their intermediate answers during reasoning? In this study, we investigate whether reasoning models contain information about answer correctness by examining the model's hidden states. The resulting probe can accurately verify intermediate answers and produce well-calibrated scores. Furthermore, we discover that the hidden states of the models also encode the correctness of future answers, allowing for early prediction of accuracy before the intermediate answer is fully determined. We then use the probe as a validator to determine whether to stop reasoning at intermediate answers during inference, leading to a 24% reduction in the number of inference tokens without compromising performance. These findings indicate that reasoning models do incorporate a concept of correctness, but fail to fully utilize it, highlighting significant untapped potential for improving their efficiency.

更新时间: 2025-04-07 18:42:01

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.05419v1

Towards Assessing Deep Learning Test Input Generators

Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.

Updated: 2025-04-07 18:35:13

标题: 朝向评估深度学习测试输入生成器

摘要: 深度学习（DL）系统越来越多地被部署在安全关键应用中，然而它们仍然容易受到导致重大故障的稳健性问题的影响。虽然已经开发了许多测试输入生成器（TIGs）来评估DL的稳健性，但对它们在不同维度上的有效性的全面评估仍然缺乏。本文在多个关键方面全面评估了四种最先进的TIGs--DeepHunter、DeepFault、AdvGAN和SinVAD：故障发现能力、自然性、多样性和效率。我们的实证研究利用三个预训练模型（LeNet-5、VGG16和EfficientNetB3）在不同复杂度的数据集上（MNIST、CIFAR-10和ImageNet-1K）来评估TIG的性能。我们的研究结果揭示了在稳健性发现能力、测试用例生成的变化和TIG之间的计算效率方面的重要权衡。结果还显示，随着数据集复杂性的增加，TIG的性能变化显著，因为在简单数据集上表现良好的工具可能在更复杂的数据集上遇到困难。相反，其他工具的性能保持得更加稳定或更具可伸缩性。本文提供了选择与特定目标和数据集特征相一致的适当TIGs的实用指导。尽管如此，仍需要更多的工作来解决TIG的局限性，并推进面向实际、安全关键系统的TIGs的发展。

更新时间: 2025-04-07 18:35:13

领域: cs.LG,cs.CV,cs.SE

下载: http://arxiv.org/abs/2504.02329v2

Leveraging State Space Models in Long Range Genomics

Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Updated: 2025-04-07 18:34:06

标题: 利用状态空间模型进行长距离基因组学研究

摘要: 长程依赖关系对于理解基因组结构和功能至关重要，然而大多数传统方法在处理这些依赖关系时存在困难。广泛采用的基于transformer的模型在短上下文任务方面表现出色，但由于注意力模块的二次计算复杂性和无法对超出训练范围的序列进行外推而受限。在这项工作中，我们通过在与50M参数transformer基准相对应的条件下对两种受SSM启发的体系结构Caduceus和Hawk进行基准测试，探讨SSM作为一个有前途的替代方案。我们发现，SSM与transformer的性能相匹配，并展示了在多个任务上令人印象深刻的零-shot外推，处理的上下文长度是训练过程中所见的10到100倍，表明更具一般化表示更适合对长而复杂的人类基因组进行建模。此外，我们证明这些模型能够在单个GPU上高效处理1M令牌的序列，即使在计算资源有限的实验室中，也能一次性对整个基因组区域进行建模。我们的研究结果确立了SSM在长上下文基因组分析中的高效性和可伸缩性。

更新时间: 2025-04-07 18:34:06

领域: q-bio.GN,cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.06304v1

'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution

Large language model (LLM)-driven AI systems may exhibit an inference failure mode we term `neural howlround,' a self-reinforcing cognitive loop where certain highly weighted inputs become dominant, leading to entrenched response patterns resistant to correction. This paper explores the mechanisms underlying this phenomenon, which is distinct from model collapse and biased salience weighting. We propose an attenuation-based correction mechanism that dynamically introduces counterbalancing adjustments and can restore adaptive reasoning, even in `locked-in' AI systems. Additionally, we discuss some other related effects arising from improperly managed reinforcement. Finally, we outline potential applications of this mitigation strategy for improving AI robustness in real-world decision-making tasks.

Updated: 2025-04-07 18:30:52

标题: 大型语言模型中的“神经咆哮”：一种自我强化的偏见现象，以及一种动态衰减解决方案

摘要: 基于大型语言模型（LLM）的人工智能系统可能表现出一种推理失败模式，我们称之为“神经回声”，这是一种自我强化的认知循环，其中某些权重较高的输入变得主导，导致根深蒂固的响应模式，难以纠正。本文探讨了这一现象的机制，它与模型崩溃和偏倚的显著性加权是不同的。我们提出了一种基于衰减的纠正机制，可以动态引入平衡调整，并可以恢复自适应推理，甚至在“锁定”AI系统中也能实现。此外，我们讨论了一些由管理不当的强化所产生的其他相关效应。最后，我们概述了这种缓解策略在改善AI在现实决策任务中的鲁棒性方面的潜在应用。

更新时间: 2025-04-07 18:30:52

领域: cs.CL,cs.AI,cs.NE

下载: http://arxiv.org/abs/2504.07992v1

Less but Better: Parameter-Efficient Fine-Tuning of Large Language Models for Personality Detection

Personality detection automatically identifies an individual's personality from various data sources, such as social media texts. However, as the parameter scale of language models continues to grow, the computational cost becomes increasingly difficult to manage. Fine-tuning also grows more complex, making it harder to justify the effort and reliably predict outcomes. We introduce a novel parameter-efficient fine-tuning framework, PersLLM, to address these challenges. In PersLLM, a large language model (LLM) extracts high-dimensional representations from raw data and stores them in a dynamic memory layer. PersLLM then updates the downstream layers with a replaceable output network, enabling flexible adaptation to various personality detection scenarios. By storing the features in the memory layer, we eliminate the need for repeated complex computations by the LLM. Meanwhile, the lightweight output network serves as a proxy for evaluating the overall effectiveness of the framework, improving the predictability of results. Experimental results on key benchmark datasets like Kaggle and Pandora show that PersLLM significantly reduces computational cost while maintaining competitive performance and strong adaptability.

Updated: 2025-04-07 18:30:39

标题: 更少但更好：大型语言模型参数有效地微调用于个性检测

摘要: 人格检测自动地通过各种数据源（如社交媒体文本）识别个体的人格。然而，随着语言模型的参数规模不断增长，计算成本变得越来越难以管理。微调也变得更加复杂，使得难以证明这一努力是值得的，并且可靠地预测结果变得更加困难。我们引入了一种新颖的参数高效微调框架PersLLM来解决这些挑战。在PersLLM中，一个大型语言模型（LLM）从原始数据中提取高维表示，并将其存储在一个动态内存层中。然后，PersLLM使用可替换的输出网络更新下游层，从而实现对各种人格检测场景的灵活适应。通过将特征存储在内存层中，我们消除了LLM重复复杂计算的需要。同时，轻量级的输出网络作为评估框架整体有效性的代理，提高了结果的可预测性。在Kaggle和Pandora等关键基准数据集上的实验结果显示，PersLLM显著降低了计算成本，同时保持了竞争性能和强大的适应性。

更新时间: 2025-04-07 18:30:39

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.05411v1

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

Updated: 2025-04-07 18:30:18

标题: 使用自适应加权拒绝抽样的语言模型快速生成控制生成

摘要: 生成受某些约束条件限制的语言模型的主要方法是局部约束解码（LCD），在每个时间步骤逐渐采样标记，以确保不违反约束。通常，通过标记屏蔽来实现：遍历词汇表并排除不符合条件的标记。这种方法存在两个重要问题。(i) 对每个标记评估约束可能代价高昂 -- 语言模型词汇表通常超过10万个标记。(ii) LCD 可能扭曲字符串的全局分布，仅基于局部信息对标记进行采样，即使它们导致走进死胡同。这项工作介绍了一种新算法，解决了这两个问题。首先，为了避免在生成的每个步骤中对完整词汇表进行约束评估，我们提出了一种自适应拒绝抽样算法，通常需要数量级更少的约束评估。其次，我们展示了如何扩展该算法以在很小的额外成本下产生低方差、无偏的重要性权重估计 -- 这些估计可以被稳妥地用于先前提出的顺序蒙特卡洛算法中，用于纠正局部约束执行的近视行为。通过在文本到SQL、分子合成、目标推断、模式匹配和JSON领域进行广泛的实证评估，我们展示了我们的方法优于最先进的基线，支持更广泛的约束类别，并提高了运行时和性能。额外的理论和实证分析表明，我们方法的运行时效率受到其对计算的动态利用的驱动，与无约束和受约束的语言模型之间的偏差成比例，因此，对于更好的模型，运行时改进更大。

更新时间: 2025-04-07 18:30:18

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05410v1

SoK: Frontier AI's Impact on the Cybersecurity Landscape

As frontier AI advances rapidly, understanding its impact on cybersecurity and inherent risks is essential to ensuring safe AI evolution (e.g., guiding risk mitigation and informing policymakers). While some studies review AI applications in cybersecurity, none of them comprehensively discuss AI's future impacts or provide concrete recommendations for navigating its safe and secure usage. This paper presents an in-depth analysis of frontier AI's impact on cybersecurity and establishes a systematic framework for risk assessment and mitigation. To this end, we first define and categorize the marginal risks of frontier AI in cybersecurity and then systemically analyze the current and future impacts of frontier AI in cybersecurity, qualitatively and quantitatively. We also discuss why frontier AI likely benefits attackers more than defenders in the short term from equivalence classes, asymmetry, and economic impact. Next, we explore frontier AI's impact on future software system development, including enabling complex hybrid systems while introducing new risks. Based on our findings, we provide security recommendations, including constructing fine-grained benchmarks for risk assessment, designing AI agents for defenses, building security mechanisms and provable defenses for hybrid systems, enhancing pre-deployment security testing and transparency, and strengthening defenses for users. Finally, we present long-term research questions essential for understanding AI's future impacts and unleashing its defensive capabilities.

Updated: 2025-04-07 18:25:18

标题: SoK：前沿人工智能对网络安全格局的影响

摘要: 随着前沿人工智能的快速发展，理解其对网络安全的影响和固有风险对确保人工智能演进安全至关重要（例如，指导风险缓解和向决策者提供信息）。虽然一些研究回顾了人工智能在网络安全中的应用，但没有一篇综合讨论人工智能未来的影响或为如何安全地使用提供具体建议。本文深入分析了前沿人工智能对网络安全的影响，并建立了一个系统框架用于风险评估和缓解。为此，我们首先定义和分类了前沿人工智能在网络安全中的边际风险，然后系统地定性和定量地分析了前沿人工智能在网络安全中的当前和未来影响。我们还讨论了为什么前沿人工智能在短期内可能更有利于攻击者而不是防御者，从等价类、不对称性和经济影响方面。接下来，我们探讨了前沿人工智能对未来软件系统开发的影响，包括实现复杂的混合系统，同时引入新风险。根据我们的研究结果，我们提出了安全建议，包括构建用于风险评估的细粒度基准，为防御设计人工智能代理，为混合系统建立安全机制和可证明的防御，增强部署前的安全测试和透明度，以及加强用户的防御。最后，我们提出了理解人工智能未来影响和释放其防御能力所必要的长期研究问题。

更新时间: 2025-04-07 18:25:18

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2504.05408v1

TRATSS: Transformer-Based Task Scheduling System for Autonomous Vehicles

Efficient scheduling remains a critical challenge in various domains, requiring solutions to complex NP-hard optimization problems to achieve optimal resource allocation and maximize productivity. In this paper, we introduce a framework called Transformer-Based Task Scheduling System (TRATSS), designed to address the intricacies of single agent scheduling in graph-based environments. By integrating the latest advancements in reinforcement learning and transformer architecture, TRATSS provides a novel system that outputs optimized task scheduling decisions while dynamically adapting to evolving task requirements and resource availability. Leveraging the self-attention mechanism in transformers, TRATSS effectively captures complex task dependencies, thereby providing solutions with enhanced resource utilization and task completion efficiency. Experimental evaluations on benchmark datasets demonstrate TRATSS's effectiveness in providing high-quality solutions to scheduling problems that involve multiple action profiles.

Updated: 2025-04-07 18:23:13

标题: TRATSS：自主车辆的基于Transformer的任务调度系统

摘要: 高效的调度在各个领域仍然是一个重要挑战，需要解决复杂的NP难优化问题，以实现最佳资源分配和最大化生产力。本文介绍了一个名为Transformer-Based Task Scheduling System（TRATSS）的框架，旨在解决基于图形环境中单一代理人调度的复杂性。通过整合强化学习和Transformer架构的最新进展，TRATSS提供了一个新颖的系统，可以输出优化的任务调度决策，同时动态适应不断演变的任务需求和资源可用性。利用Transformer中的自注意机制，TRATSS有效地捕捉复杂的任务依赖关系，从而提供具有增强资源利用和任务完成效率的解决方案。基准数据集上的实验评估表明，TRATSS在解决涉及多种行动配置的调度问题时具有高效性，能够提供高质量的解决方案。

更新时间: 2025-04-07 18:23:13

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05407v1

The Role of Environment Access in Agnostic Reinforcement Learning

We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $\mu$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies $\pi \in \Pi$. These values are approximated without any explicit value function class.

Updated: 2025-04-07 18:19:56

标题: 《环境访问在无偏强化学习中的作用》

摘要: 我们研究在具有大状态空间的环境中的强化学习（RL），其中需要函数逼近以进行高效的学习。与过去长期工作的悠久历史不同，我们考虑了函数逼近的最弱形式，称为不可知策略学习，其中学习者试图在给定类$\Pi$中找到最佳策略，但不能保证$\Pi$包含基础任务的最优策略。尽管已知在标准在线RL设置中无法进行高效的不可知策略学习，除非有进一步的假设，我们调查了通过更强的环境访问形式可以克服这一问题的程度。具体来说，我们表明：1.当可以访问本地模拟器，并且可以重置到任何先前看到的状态时，不可知策略学习仍然在统计上难以处理。即使策略类是可实现的，这个结果与[MFR24]的正面结果形成对比，后者表明在可实现性下，基于值的学习在具有本地模拟器访问权限时是可处理的。2.当可以在线访问具有良好覆盖性质的重置分布（所谓的$\mu$-重置设置）时，不可知策略学习仍然在统计上难以处理。我们还研究了策略学习的更强形式的函数逼近，表明在没有策略完备性的情况下，PSDP [BKSN03]和CPI [KL02]在实际中失败。3.值得一提的是，对于具有上述两种重置模型访问权限的Block MDPs，不可知策略学习在统计上可处理。我们通过一种新算法建立了这一点，该算法精心构建了一个策略仿真器：一个具有小状态空间的表格MDP，该表格MDP近似于所有策略$\pi \in \Pi$的值函数。这些值是在没有明确值函数类的情况下近似的。

更新时间: 2025-04-07 18:19:56

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2504.05405v1

Compute-Constrained Data Selection

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.

Updated: 2025-04-07 18:16:42

标题: 计算受限的数据选择

摘要: 数据选择可以减少微调LLMs所需的训练数据量；然而，数据选择的有效性与其计算量直接相关。受计算受限微调的实际挑战的启发，我们考虑了同时为选择数据和训练分配预算的情境。我们首先通过一个成本感知效用函数形式化了数据选择问题，并将数据选择问题建模为在初始选择成本和训练增益之间的权衡。我们在多个任务上进行了全面的实验，通过缩放微调标记、模型大小和数据选择计算来变化计算预算。有趣的是，我们发现许多强大的数据选择方法几乎从未达到计算最优，而更便宜的数据选择替代方案在理论和实证视角上都占主导地位。对于计算最优的训练，我们发现，困惑度和梯度数据选择分别需要训练与选择模型大小比率的5倍和10倍。

更新时间: 2025-04-07 18:16:42

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.16208v4

GARF: Learning Generalizable 3D Reassembly for Real-World Fractures

3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with flow matching enabling precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate Fractura, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87\% lower rotation error and 25.15\% higher part accuracy. This sheds light on training on synthetic data to advance real-world 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types.

Updated: 2025-04-07 18:13:16

标题: GARF：学习适用于真实世界裂缝的通用3D重组

摘要: 3D重新组装是一个具有广泛科学应用的具有挑战性的空间智能任务。虽然大规模合成数据集推动了有前途的学习方法，但它们在不同领域的泛化能力有限。关键是，目前还不确定在合成数据集上训练的模型能否泛化到现实世界中更复杂的断裂模式。为了弥合这一差距，我们提出了GARF，这是一个适用于现实世界断裂的通用3D重新组装框架。GARF利用断裂感知预训练从单个碎片中学习断裂特征，流匹配实现精确的6自由度对齐。在推理时，我们引入了一步预组装，提高了对未知对象和不同数量的断裂的鲁棒性。与考古学家、古人类学家和鸟类学家合作，我们策划了Fractura，这是一个多样化的数据集，供视觉和学习社区使用，展示了陶器、骨骼、蛋壳和岩石等现实世界不同类型的断裂。全面的实验表明，我们的方法在合成和现实世界数据集上始终优于最先进的方法，旋转误差降低了82.87\%，零件准确性提高了25.15\%。这为训练合成数据以推进现实世界的3D拼图解决提供了启示，证明了其在未知对象形状和多样的断裂类型上的强大泛化能力。

更新时间: 2025-04-07 18:13:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05400v1

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.

Updated: 2025-04-07 18:03:40

标题: Prompt-CAM: 使视觉变压器可解释用于细粒度分析

摘要: 我们提出了一种简单的方法，使预训练的视觉Transformer（ViTs）在细粒度分析中具有解释性，旨在识别和定位区分视觉上相似类别（如鸟类物种）的特征。预训练的ViTs，如DINO，已经展示出在提取局部、具有区分性特征方面的显著能力。然而，像Grad-CAM这样的显著性图往往无法识别这些特征，产生模糊、粗糙的热图，突出显示整个对象。我们提出了一种新颖的方法，Prompt Class Attention Map（Prompt-CAM），来解决这一限制。Prompt-CAM为预训练的ViT学习类别特定的提示，并使用相应的输出进行分类。为了正确分类一幅图像，真实类别的提示必须关注不同于其他类别图像中的唯一图像补丁（即特征）。因此，真实类别的多头注意力图揭示了特征及其位置。在实现方面，Prompt-CAM几乎是一个“免费午餐”，只需要对视觉提示调整（VPT）的预测头进行修改。这使得Prompt-CAM易于训练和应用，与需要设计特定模型和训练过程的其他可解释性方法形成鲜明对比。对来自各个领域的数十个数据集（如鸟类、鱼类、昆虫、真菌、花卉、食物和汽车）进行了大量实证研究，验证了Prompt-CAM的卓越解释能力。源代码和演示可在https://github.com/Imageomics/Prompt_CAM上找到。

更新时间: 2025-04-07 18:03:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.09333v2

Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images

Backdoor attacks pose a critical threat by embedding hidden triggers into inputs, causing models to misclassify them into target labels. While extensive research has focused on mitigating these attacks in object recognition models through weight fine-tuning, much less attention has been given to detecting backdoored samples directly. Given the vast datasets used in training, manual inspection for backdoor triggers is impractical, and even state-of-the-art defense mechanisms fail to fully neutralize their impact. To address this gap, we introduce a groundbreaking method to detect unseen backdoored images during both training and inference. Leveraging the transformative success of prompt tuning in Vision Language Models (VLMs), our approach trains learnable text prompts to differentiate clean images from those with hidden backdoor triggers. Experiments demonstrate the exceptional efficacy of this method, achieving an impressive average accuracy of 86% across two renowned datasets for detecting unseen backdoor triggers, establishing a new standard in backdoor defense.

Updated: 2025-04-07 18:01:26

标题: 主动对抗性防御：利用视觉-语言模型中的及时调整来检测未知的后门图像

摘要: 后门攻击通过将隐藏的触发器嵌入输入数据，导致模型将其错误分类为目标标签，构成了一个重要的威胁。虽然已经有大量研究致力于通过调整权重来缓解这些攻击对目标识别模型的影响，但对直接检测植入后门的样本却付出了较少的关注。考虑到在训练中使用的大量数据集，人工检查后门触发器是不切实际的，即使是最先进的防御机制也无法完全中和它们的影响。为了填补这一空白，我们引入了一种开创性的方法，在训练和推断过程中检测未见过的带有后门的图像。利用视觉语言模型（VLMs）中提示调整的巨大成功，我们的方法训练可学习的文本提示来区分干净的图像和带有隐藏后门触发器的图像。实验证明了这种方法的卓越功效，实现了在两个著名数据集中检测未见过的后门触发器的令人印象深刻的平均准确率为86％，建立了后门防御的新标准。

更新时间: 2025-04-07 18:01:26

领域: cs.CV,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2412.08755v4

Interactive Explanations for Reinforcement-Learning Agents

As reinforcement learning methods increasingly amass accomplishments, the need for comprehending their solutions becomes more crucial. Most explainable reinforcement learning (XRL) methods generate a static explanation depicting their developers' intuition of what should be explained and how. In contrast, literature from the social sciences proposes that meaningful explanations are structured as a dialog between the explainer and the explainee, suggesting a more active role for the user and her communication with the agent. In this paper, we present ASQ-IT -- an interactive explanation system that presents video clips of the agent acting in its environment based on queries given by the user that describe temporal properties of behaviors of interest. Our approach is based on formal methods: queries in ASQ-IT's user interface map to a fragment of Linear Temporal Logic over finite traces (LTLf), which we developed, and our algorithm for query processing is based on automata theory. User studies show that end-users can understand and formulate queries in ASQ-IT and that using ASQ-IT assists users in identifying faulty agent behaviors.

Updated: 2025-04-07 18:00:50

标题: 互动式解释对强化学习代理的影响

摘要: 随着强化学习方法的不断取得成就，理解其解决方案的需求变得更加关键。大多数可解释性强化学习（XRL）方法生成一个静态解释，展示了开发者对应该如何解释以及解释内容的直觉。相比之下，社会科学的文献提出，有意义的解释应该被构建为解释者和被解释者之间的对话，这表明用户及其与代理之间的交流起到了更积极的作用。在本文中，我们提出了ASQ-IT——一个交互式解释系统，根据用户描述感兴趣的行为的时间特性的查询，呈现代理在其环境中行动的视频片段。我们的方法基于形式化方法：ASQ-IT用户界面中的查询映射到我们开发的有限轨迹上的线性时间逻辑片段（LTLf），而我们的查询处理算法基于自动机理论。用户研究表明，最终用户可以理解并在ASQ-IT中制定查询，并且使用ASQ-IT有助于用户识别有故障的代理行为。

更新时间: 2025-04-07 18:00:50

领域: cs.AI

下载: http://arxiv.org/abs/2504.05393v1

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Updated: 2025-04-07 18:00:00

标题: MMAudio: 驯服多模态联合训练，实现高质量的视频到音频合成

摘要: 我们提出使用新颖的多模态联合训练框架MMAudio，在给定视频和可选文本条件的情况下合成高质量和同步的音频。与仅基于（有限）视频数据的单模态训练相比，MMAudio与更大规模、易于获取的文本-音频数据一起进行联合训练，以学习生成语义对齐的高质量音频样本。此外，我们通过一个条件同步模块改进音频-视觉同步，该模块在帧级别将视频条件与音频潜变量对齐。通过训练流匹配目标，MMAudio在音频质量、语义对齐和音频-视觉同步方面在公开模型中取得了新的最先进水平，同时具有低推理时间（生成8秒片段仅需1.23秒）和仅157M参数。MMAudio在文本-音频生成中也取得了令人惊讶的竞争性表现，表明联合训练不会妨碍单模态性能。代码和演示可在以下网址找到：https://hkchengrex.github.io/MMAudio

更新时间: 2025-04-07 18:00:00

领域: cs.CV,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2412.15322v2

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.

Updated: 2025-04-07 17:59:58

标题: 关于种族表示对去偏高风险决策的有效性和泛化性

摘要: 理解和减轻偏见对于大型语言模型（LLMs）在高风险决策中的采用至关重要。我们引入了招生和招聘决策任务，其中包含了虚构的申请人档案，可以从其姓名中推断出其种族，作为种族偏见的简化测试基础。我们展示了Gemma 2B Instruct和LLaMA 3.2 3B Instruct表现出强烈的偏见。Gemma向白人申请者授予入学资格比黑人申请者多26％，而LLaMA雇佣的亚洲申请者比白人申请者多60％。我们证明了这些偏见对提示工程是抵抗的：多种提示策略均未能促进公平。相比之下，使用分布式对齐搜索，我们可以识别模型激活中的“种族子空间”并对其进行干预以消除模型决策的偏见。在子空间中跨所有种族平均表示将Gemma的偏见降低了37-57％。最后，我们检查了Gemma的种族子空间的普适性，并发现有限证据支持普遍性，即改变提示格式可以影响种族表征。我们的工作表明，机械方法可能为改善LLMs的公平性提供有希望的途径，但普遍的种族表征仍然难以捉摸。

更新时间: 2025-04-07 17:59:58

领域: cs.CY,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.06303v1

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Are $n$-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

Updated: 2025-04-07 17:59:50

标题: Infini-gram：将无限制的n-gram语言模型扩展到一万亿个标记

摘要: 在这个神经大型语言模型（LLMs）的时代，$n$-gram语言模型仍然具有相关性吗？我们的答案是肯定的，并且我们展示它们在文本分析和改进神经LLMs中的价值。这是通过在两个方面现代化$n$-gram LM实现的。首先，我们将它们训练到与神经LLMs相同的数据规模-- 5万亿标记。这是迄今构建的最大的$n$-gram LM。其次，现有的$n$-gram LM使用小$n$会阻碍其性能；相反，我们允许$n$可以任意大，通过引入一个新的$\infty$-gram LM与回退。我们开发了一个名为infini-gram的引擎，由后缀数组驱动，可以以毫秒级的延迟计算$\infty$-gram（以及任意$n$的$n$-gram）的概率，而不是预先计算$n$-gram计数表（这将是非常昂贵的）。$\infty$-gram框架和infini-gram引擎使我们能够进行许多新颖和有趣的人类编写和机器生成文本的分析：我们发现$\infty$-gram LM对于下一个标记的预测具有相当高的准确性（47％），并且可以补充神经LLMs以大大降低其困惑度。在分析机器生成文本时，我们还观察到机器与$\infty$-gram在后缀长度方面的一致性水平存在不规则性，这表明神经LLM预训练和Transformer的位置嵌入存在缺陷。

更新时间: 2025-04-07 17:59:50

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2401.17377v4

URECA: Unique Region Caption Anything

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

Updated: 2025-04-07 17:59:44

标题: URECA: 独特区域标题一切

摘要: 区域级标题旨在为特定图像区域生成自然语言描述，同时突出它们的显著特征。然而，现有方法在跨多粒度生成唯一标题方面存在困难，限制了它们在现实世界中的适用性。为了满足对详细区域级理解的需求，我们引入了URECA数据集，这是一个专为多粒度区域标题设计的大规模数据集。与以往主要关注显著对象的数据集不同，URECA数据集通过整合各种对象、部件和背景元素，确保了区域和标题之间的独特和一致的映射。这其中的核心是一个分阶段的数据整理流程，每个阶段逐步完善区域选择和标题生成。通过在每个阶段利用多模态大型语言模型（MLLMs），我们的流程生成具有改进准确性和语义多样性的独特和上下文相关的标题。基于这个数据集，我们提出了URECA，这是一个专为有效编码多粒度区域而设计的新颖标题模型。URECA通过对现有MLLMs进行简单而有影响的修改，保持了位置和形状等重要的空间特性，从而实现了细粒度和语义丰富的区域描述。我们的方法引入了动态掩模建模和高分辨率掩模编码器，以增强标题的独特性。实验证明，URECA在URECA数据集上取得了最先进的性能，并且在现有的区域级标题基准上具有良好的泛化能力。

更新时间: 2025-04-07 17:59:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05305v1

Gaussian Mixture Flow Matching Models

Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.

Updated: 2025-04-07 17:59:42

标题: 高斯混合流匹配模型

摘要: 扩散模型将去噪分布近似为高斯分布，并预测其均值，而流匹配模型将高斯均值重新参数化为流速。然而，由于离散化误差，它们在少步采样中表现不佳，并且在无分类器引导下往往会产生过饱和的颜色。为了解决这些限制，我们提出了一种新颖的高斯混合流匹配（GMFlow）模型：GMFlow不是预测均值，而是预测动态高斯混合（GM）参数，以捕获多模态流速分布，这可以通过KL散度损失进行学习。我们证明了GMFlow可以推广先前的扩散和流匹配模型，其中一个单一高斯通过$L_2$去噪损失被学习。对于推断，我们导出了利用解析去噪分布和速度场进行准确少步采样的GM-SDE/ODE求解器。此外，我们还引入了一种新颖的概率引导方案，可以缓解CFG的过饱和问题，并提高图像生成质量。大量实验表明，GMFlow在生成质量方面始终优于流匹配基线，在ImageNet 256×256上仅使用6个采样步骤就实现了0.942的精度。

更新时间: 2025-04-07 17:59:42

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2504.05304v1

Dimension-Free Convergence of Diffusion Models for Approximate Gaussian Mixtures

Diffusion models are distinguished by their exceptional generative performance, particularly in producing high-quality samples through iterative denoising. While current theory suggests that the number of denoising steps required for accurate sample generation should scale linearly with data dimension, this does not reflect the practical efficiency of widely used algorithms like Denoising Diffusion Probabilistic Models (DDPMs). This paper investigates the effectiveness of diffusion models in sampling from complex high-dimensional distributions that can be well-approximated by Gaussian Mixture Models (GMMs). For these distributions, our main result shows that DDPM takes at most $\widetilde{O}(1/\varepsilon)$ iterations to attain an $\varepsilon$-accurate distribution in total variation (TV) distance, independent of both the ambient dimension $d$ and the number of components $K$, up to logarithmic factors. Furthermore, this result remains robust to score estimation errors. These findings highlight the remarkable effectiveness of diffusion models in high-dimensional settings given the universal approximation capability of GMMs, and provide theoretical insights into their practical success.

Updated: 2025-04-07 17:59:07

标题: 无维度收敛的扩散模型用于近似高斯混合模型

摘要: 扩散模型以其出色的生成性能而著称，特别是通过迭代去噪来生成高质量样本。尽管当前理论表明，为了准确生成样本，所需的去噪步骤数量应该与数据维度呈线性关系，但这并不反映出像去噪扩散概率模型（DDPMs）这样广泛使用的算法的实际效率。本文研究了扩散模型在从可以被高斯混合模型（GMMs）很好逼近的复杂高维分布中采样的有效性。对于这些分布，我们的主要结果显示，DDPM在总变差（TV）距离上达到ε-精确分布最多需要$\widetilde{O}(1/\varepsilon)$次迭代，与环境维度d和组件数量K无关，直到对数因子。此外，这一结果对评分估计误差具有鲁棒性。这些发现突显了扩散模型在高维环境中的显著有效性，鉴于GMM的通用逼近能力，并提供了关于它们实际成功的理论见解。

更新时间: 2025-04-07 17:59:07

领域: cs.LG,cs.NA,math.NA,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2504.05300v1

SmolVLM: Redefining small and efficient multimodal models

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Updated: 2025-04-07 17:58:57

标题: SmolVLM：重新定义小型高效多模型

摘要: 大型视觉-语言模型（VLMs）提供了出色的性能，但需要大量的计算资源，限制了它们在移动和边缘设备上的部署。较小的VLMs通常反映了较大模型的设计选择，如大量的图像标记化，导致GPU内存使用效率低下，对设备应用的实用性受到限制。我们介绍了SmolVLM，这是一系列专为资源高效推断而设计的紧凑多模型模型。我们系统地探索了为低计算开销优化的架构配置、标记化策略和数据策划。通过这一过程，我们确定了关键的设计选择，可以在图像和视频任务上实现显著的性能提升，且占用内存印记最小。我们的最小模型SmolVLM-256M在推断过程中使用少于1GB的GPU内存，并且表现优于300倍大的Idefics-80B模型，尽管二者相差18个月的开发时间。我们的最大模型，拥有2.2B个参数，与消耗两倍GPU内存的最先进的VLMs不相上下。SmolVLM模型在静态图像之外，展示出了强大的视频理解能力。我们的结果强调，战略性的架构优化、积极且高效的标记化，以及精心策划的训练数据显著增强了多模性能，促进了在较小规模上实现实用、节能的部署。

更新时间: 2025-04-07 17:58:57

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2504.05299v1

Dion: A Communication-Efficient Optimizer for Large Models

Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead -- especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.

Updated: 2025-04-07 17:49:37

标题: Dion：一种用于大模型的通信高效优化器

摘要: 训练大型AI模型高效需要将计算分布在多个加速器上，但这经常会产生显著的通信开销，特别是在梯度同步期间。我们引入了Dion，这是一种通信高效的优化器，保留了标准分布式训练（例如DDP、FSDP）的同步语义，同时大大降低了I/O成本。与传统的优化器不同，Dion利用设备本地动量缓冲区进行正交化更新，消除了完整梯度交换的需要。它进一步支持一种高效的分片策略，避免在训练过程中重构大矩阵。

更新时间: 2025-04-07 17:49:37

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2504.05295v1

EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design

Large Language Models (LLMs) have significantly advanced smart education in the Artificial General Intelligence (AGI) era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: (1) Customized Generation: generating niche-targeted teaching content based on students' varying learning abilities and states, and (2) Intelligent Optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multi-agent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. Additionally, we introduce the CIDDP, an LLM-based five-dimensional evaluation module encompassing clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework. Our code is publicly available at https://github.com/Zc0812/Edu_Planner

Updated: 2025-04-07 17:49:12

标题: EduPlanner: 基于LLM的多智能体系统，用于定制和智能化教学设计

摘要: 大型语言模型（LLMs）在人工通用智能（AGI）时代显著推动了智能教育的发展。一个有前途的应用是自动泛化课程和学习活动的教学设计，重点关注两个关键方面：（1）定制生成：基于学生不同的学习能力和状态生成针对性教学内容，以及（2）智能优化：根据学习效果或考试成绩的反馈迭代优化内容。目前，单个大型LLM无法有效管理整个过程，为设计智能教学计划提出了挑战。为解决这些问题，我们开发了EduPlanner，一个基于LLM的多智能体系统，包括评估者智能体、优化者智能体和问题分析师，通过对抗性协作生成针对性和智能的课程和学习活动的教学设计。以数学课程为例，EduPlanner采用一种新颖的技能树结构来准确建模学生群体的背景数学知识，根据学生的知识水平和学习能力个性化课程和学习活动的教学设计。此外，我们引入CIDDP，一个基于LLM的包含清晰度、完整性、深度、实用性和相关性的五维评估模块，以全面评估数学课程计划的质量并引导智能优化。在GSM8K和代数数据集上进行的实验表明，EduPlanner在评估和优化课程和学习活动的教学设计方面表现出色。消融研究进一步验证了框架内每个组件的重要性和有效性。我们的代码公开可用于https://github.com/Zc0812/Edu_Planner。

更新时间: 2025-04-07 17:49:12

领域: cs.AI

下载: http://arxiv.org/abs/2504.05370v1

A Formalisation of the Purpose Framework: the Autonomy-Alignment Problem in Open-Ended Learning Robots

The unprecedented advancement of artificial intelligence enables the development of increasingly autonomous robots. These robots hold significant potential, particularly in moving beyond engineered factory settings to operate in the unstructured environments inhabited by humans. However, this possibility also generates a relevant autonomy-alignment problem to ensure that robots' autonomous learning processes still focus on acquiring knowledge relevant to accomplish human practical purposes, while their behaviour still aligns with their broader purposes. The literature has only begun to address this problem, and a conceptual, terminological, and formal framework is still lacking. Here we address one of the most challenging instances of the problem: autonomous open-ended learning (OEL) robots, capable of cumulatively acquiring new skills and knowledge through direct interaction with the environment, guided by self-generated goals and intrinsic motivations. In particular, we propose a computational framework, first introduced qualitatively and then formalised, to support the design of OEL robot architectures that balance autonomy and control. The framework pivots on the novel concept of purpose. A human purpose specifies what humans (e.g., designers or users) want the robot to learn, do or not do, within a certain boundary of autonomy and independently of the domains in which it operates.The framework decomposes the autonomy-alignment problem into more tractable sub-problems: the alignment of `robot purposes' with human purposes, either by hardwiring or through learning; the arbitration between multiple purposes; the grounding of purposes into specific domain-dependent robot goals; and the competence acquisition needed to accomplish these goals. The framework and its potential utility are further elucidated through the discussion of hypothetical example scenarios framed within it.

Updated: 2025-04-07 17:46:43

标题: 一个目的框架的形式化：开放式学习机器人中的自主性-对齐问题

摘要: 人工智能的前所未有的进步使得越来越自主的机器人的发展成为可能。这些机器人具有重要的潜力，特别是在超越工程化工厂环境，进入人类居住的非结构化环境中运作。然而，这种可能性也产生了一个相关的自主-协调问题，以确保机器人的自主学习过程仍然专注于获取与实现人类实际目的相关的知识，同时它们的行为仍然与更广泛的目的保持一致。文献仅仅开始解决这个问题，一个概念性、术语性和形式化框架仍然缺乏。在这里，我们解决了这个问题中最具挑战性的一个实例：自主的开放式学习（OEL）机器人，能够通过与环境的直接互动，由自动生成的目标和内在动机引导，逐步获得新技能和知识。特别是，我们提出了一个计算框架，首先是定性地介绍，然后是形式化的，以支持设计平衡自主性和控制的OEL机器人架构。这个框架围绕着目的这一新概念。人类目的指明了人们（例如设计者或用户）希望机器人在某一自主范围内学习、做或不做的事情，独立于其运作领域。这个框架将自主-协调问题分解成更易处理的子问题：‘机器人目的’与人类目的的协调，无论是通过硬连线还是通过学习；多个目的之间的仲裁；将目的转化为特定领域相关的机器人目标；以及完成这些目标所需的能力获取。通过讨论在其中创造的假设性示例场景来进一步阐明框架及其潜在的实用性。

更新时间: 2025-04-07 17:46:43

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.02514v2

Learning Coarse-Grained Dynamics on Graph

We consider a Graph Neural Network (GNN) non-Markovian modeling framework to identify coarse-grained dynamical systems on graphs. Our main idea is to systematically determine the GNN architecture by inspecting how the leading term of the Mori-Zwanzig memory term depends on the coarse-grained interaction coefficients that encode the graph topology. Based on this analysis, we found that the appropriate GNN architecture that will account for $K$-hop dynamical interactions has to employ a Message Passing (MP) mechanism with at least $2K$ steps. We also deduce that the memory length required for an accurate closure model decreases as a function of the interaction strength under the assumption that the interaction strength exhibits a power law that decays as a function of the hop distance. Supporting numerical demonstrations on two examples, a heterogeneous Kuramoto oscillator model and a power system, suggest that the proposed GNN architecture can predict the coarse-grained dynamics under fixed and time-varying graph topologies.

Updated: 2025-04-07 17:44:58

标题: 学习图上的粗粒度动态特性

摘要: 我们考虑了一个图神经网络（GNN）非马尔可夫建模框架，用于在图上识别粗粒度动力系统。我们的主要想法是通过检查Mori-Zwanzig记忆项的主导项如何取决于编码图拓扑的粗粒度相互作用系数，系统地确定GNN架构。通过这种分析，我们发现适合$K$跳动力学相互作用的GNN架构必须使用至少$2K$步的消息传递（MP）机制。我们还推断，在假设相互作用强度呈幂律随跳距衰减的情况下，所需的用于准确封闭模型的记忆长度会随着交互强度的函数而减小。通过对两个示例，异质Kuramoto振荡器模型和电力系统的支持数值演示，表明所提出的GNN架构可以预测在固定和时变图拓扑下的粗粒度动力学。

更新时间: 2025-04-07 17:44:58

领域: math.NA,cond-mat.dis-nn,cs.LG,cs.NA

下载: http://arxiv.org/abs/2405.09324v2

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Inference-time computation offers a powerful axis for scaling the performance of language models. However, naively increasing computation in techniques like Best-of-N sampling can lead to performance degradation due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment, which we formalize as the problem of improving the quality of responses drawn from a pre-trained policy, given a prompt of interest and access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-$N$ alignment with an ideal choice for $N$ can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when $N$ is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce $\texttt{InferenceTimePessimism}$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with $N$, meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of $\texttt{InferenceTimePessimism}$ across a variety of tasks and models.

Updated: 2025-04-07 17:44:38

标题: 最好的N选择是最好的吗？覆盖率、扩展性和推理时间对齐的最优性

摘要: 推断时间计算为提升语言模型性能提供了一个强大的方向。然而，像最佳N抽样这样的技术中天真地增加计算可能会导致由于奖励欺骗而性能下降。为了更好地理解如何最好地利用额外的计算，我们专注于推断时间对齐，将其形式化为一个问题，即在给定感兴趣的提示和访问不完美的奖励模型的情况下，如何改善从预训练策略中提取的响应的质量。我们分析了推断时间对齐算法在响应质量和计算方面的性能，并提供了突显预训练策略覆盖高质量响应对性能和计算扩展的重要性的新结果： 1. 我们展示了最佳N对齐在理想的N选择下可以在严格的覆盖概念下实现最佳性能，但是当N较大时会因奖励欺骗而遭受损害，并且在更现实的覆盖条件下无法实现严格的保证。 2. 我们引入了InferenceTimePessimism，一种新算法，通过有意识地利用推断时间计算来缓解奖励欺骗，实现在面对不确定性时的悲观主义原则，通过拒绝抽样来证明其性能是最佳的，并且不会随着N的增加而下降，即它是尺度单调的。我们通过实验评估来补充我们的理论结果，展示了InferenceTimePessimism在各种任务和模型中的好处。

更新时间: 2025-04-07 17:44:38

领域: cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.21878v2

Superhuman Game AI Disclosure: Expertise and Context Moderate Effects on Trust and Fairness

As artificial intelligence surpasses human performance in select tasks, disclosing superhuman capabilities poses distinct challenges for fairness, accountability, and trust. However, the impact of such disclosures on diverse user attitudes and behaviors remains unclear, particularly concerning potential negative reactions like discouragement or overreliance. This paper investigates these effects by utilizing Persona Cards: a validated, standardized set of synthetic personas designed to simulate diverse user reactions and fairness perspectives. We conducted an ethics board-approved study (N=32), utilizing these personas to investigate how capability disclosure influenced behaviors with a superhuman game AI in competitive StarCraft II scenarios. Our results reveal transparency is double-edged: while disclosure could alleviate suspicion, it also provoked frustration and strategic defeatism among novices in cooperative scenarios, as well as overreliance in competitive contexts. Experienced and competitive players interpreted disclosure as confirmation of an unbeatable opponent, shifting to suboptimal goals. We release the Persona Cards Dataset, including profiles, prompts, interaction logs, and protocols, to foster reproducible research into human alignment AI design. This work demonstrates that transparency is not a cure-all; successfully leveraging disclosure to enhance trust and accountability requires careful tailoring to user characteristics, domain norms, and specific fairness objectives.

Updated: 2025-04-07 17:39:10

标题: 超级人类游戏人工智能披露：专业知识和背景对信任和公平性的影响适度。

摘要: 随着人工智能在某些任务中超越人类表现，披露超人类能力对公平性、问责性和信任性提出了独特的挑战。然而，这类披露对不同用户的态度和行为的影响仍不清楚，特别是涉及潜在的消极反应，如沮丧或过度依赖。本文通过使用Persona Cards来调查这些影响：这是一套经过验证的标准化的合成人物，旨在模拟不同用户的反应和公平性观点。我们进行了一个经伦理委员会批准的研究（N=32），利用这些人物来调查能力披露如何影响在竞争性StarCraft II场景中与超人类游戏AI的行为。我们的研究结果显示透明度是一把双刃剑：虽然披露可以减轻怀疑，但也引发了合作情景中新手的沮丧和战略失败主义，以及在竞争环境中的过度依赖。经验丰富的和竞争性的玩家将披露解释为无法战胜对手的确认，并转向次优目标。我们发布了Persona Cards数据集，其中包括个人资料、提示、互动日志和协议，以促进对人类对齐AI设计的可重复研究。这项工作表明透明度并非万灵药；成功利用披露来增强信任和问责性需要谨慎地根据用户特征、领域规范和具体的公平目标来制定。

更新时间: 2025-04-07 17:39:10

领域: cs.HC,cs.AI,cs.CL,cs.CY,cs.ET,K.4.1; K.4.3; H.5.2; H.5.1; I.2.7

下载: http://arxiv.org/abs/2503.15514v2

Understanding Virtual Nodes: Oversquashing and Node Heterogeneity

While message passing neural networks (MPNNs) have convincing success in a range of applications, they exhibit limitations such as the oversquashing problem and their inability to capture long-range interactions. Augmenting MPNNs with a virtual node (VN) removes the locality constraint of the layer aggregation and has been found to improve performance on a range of benchmarks. We provide a comprehensive theoretical analysis of the role of VNs and benefits thereof, through the lenses of oversquashing and sensitivity analysis. First, we characterize, precisely, how the improvement afforded by VNs on the mixing abilities of the network and hence in mitigating oversquashing, depends on the underlying topology. We then highlight that, unlike Graph-Transformers (GTs), classical instantiations of the VN are often constrained to assign uniform importance to different nodes. Consequently, we propose a variant of VN with the same computational complexity, which can have different sensitivity to nodes based on the graph structure. We show that this is an extremely effective and computationally efficient baseline for graph-level tasks.

Updated: 2025-04-07 17:33:06

标题: 理解虚拟节点：过度压缩和节点异质性

摘要: 尽管消息传递神经网络（MPNNs）在各种应用中取得了令人信服的成功，但它们存在一些限制，例如过度压缩问题和无法捕捉长距离交互作用。通过在MPNNs中增加虚拟节点（VN），可以消除层聚合的局部性约束，并在一系列基准测试中提高性能。我们通过过度压缩和敏感性分析的视角，提供了对VN的作用和优势的全面理论分析。首先，我们准确地表征了VN对网络混合能力的改进以及由此减轻过度压缩问题的依赖于底层拓扑结构的方式。然后，我们强调，与图形转换器（GTs）不同，经典的VN实例通常受限于为不同节点分配统一的重要性。因此，我们提出了一种具有相同计算复杂性的VN变种，可以根据图结构对节点具有不同的敏感性。我们展示了这对于图级任务来说是一种极其有效且计算效率高的基准线。

更新时间: 2025-04-07 17:33:06

领域: cs.LG

下载: http://arxiv.org/abs/2405.13526v3

Is Adversarial Training with Compressed Datasets Effective?

Dataset Condensation (DC) refers to the recent class of dataset compression methods that generate a smaller, synthetic, dataset from a larger dataset. This synthetic dataset aims to retain the essential information of the original dataset, enabling models trained on it to achieve performance levels comparable to those trained on the full dataset. Most current DC methods have mainly concerned with achieving high test performance with limited data budget, and have not directly addressed the question of adversarial robustness. In this work, we investigate the impact of adversarial robustness on models trained with compressed datasets. We show that the compressed datasets obtained from DC methods are not effective in transferring adversarial robustness to models. As a solution to improve dataset compression efficiency and adversarial robustness simultaneously, we present a robustness-aware dataset compression method based on finding the Minimal Finite Covering (MFC) of the dataset. The proposed method is (1) provably robust by minimizing the generalized adversarial loss, (2) more effective than DC methods when applying adversarial training over MFC, (3) obtained by a one-time computation and is applicable for any model.

Updated: 2025-04-07 17:31:31

标题: Adversarial Training with Compressed Datasets的有效性如何？

摘要: 数据集压缩（DC）是指一类最近的数据集压缩方法，从一个较大的数据集生成一个较小的合成数据集。这个合成数据集旨在保留原始数据集的基本信息，使得在其上训练的模型能够达到与在完整数据集上训练的模型相当的性能水平。大多数当前的DC方法主要关注如何在有限的数据预算下实现高测试性能，并且并未直接解决对抗鲁棒性的问题。在这项工作中，我们研究了对使用压缩数据集训练的模型的对抗鲁棒性的影响。我们发现，通过DC方法获得的压缩数据集并不能有效地将对抗鲁棒性转移到模型上。为了同时提高数据集压缩效率和对抗鲁棒性，我们提出了一种基于找到数据集的最小有限覆盖（MFC）的鲁棒性感知数据集压缩方法。所提出的方法（1）通过最小化广义对抗损失来证明其鲁棒性，（2）在应用对抗训练时比DC方法更有效，（3）通过一次计算获得，并适用于任何模型。

更新时间: 2025-04-07 17:31:31

领域: cs.LG

下载: http://arxiv.org/abs/2402.05675v2

Covariant Gradient Descent

We present a manifestly covariant formulation of the gradient descent method, ensuring consistency across arbitrary coordinate systems and general curved trainable spaces. The optimization dynamics is defined using a covariant force vector and a covariant metric tensor, both computed from the first and second statistical moments of the gradients. These moments are estimated through time-averaging with an exponential weight function, which preserves linear computational complexity. We show that commonly used optimization methods such as RMSProp and Adam correspond to special limits of the covariant gradient descent (CGD) and demonstrate how these methods can be further generalized and improved.

Updated: 2025-04-07 17:25:50

标题: 共变梯度下降

摘要: 我们提出了一个显著协变的梯度下降方法的表述，确保在任意坐标系和一般曲线可训练空间中保持一致性。优化动态是使用协变力向量和协变度规张量定义的，两者都是根据梯度的第一和第二统计矩计算得出的。这些矩是通过指数权重函数进行时间平均估计而得到的，这保持了线性计算复杂度。我们展示了常用的优化方法如RMSProp和Adam对应于协变梯度下降（CGD）的特殊极限，并展示了如何进一步推广和改进这些方法。

更新时间: 2025-04-07 17:25:50

领域: cs.LG

下载: http://arxiv.org/abs/2504.05279v1

The challenge of uncertainty quantification of large language models in medicine

This study investigates uncertainty quantification in large language models (LLMs) for medical applications, emphasizing both technical innovations and philosophical implications. As LLMs become integral to clinical decision-making, accurately communicating uncertainty is crucial for ensuring reliable, safe, and ethical AI-assisted healthcare. Our research frames uncertainty not as a barrier but as an essential part of knowledge that invites a dynamic and reflective approach to AI design. By integrating advanced probabilistic methods such as Bayesian inference, deep ensembles, and Monte Carlo dropout with linguistic analysis that computes predictive and semantic entropy, we propose a comprehensive framework that manages both epistemic and aleatoric uncertainties. The framework incorporates surrogate modeling to address limitations of proprietary APIs, multi-source data integration for better context, and dynamic calibration via continual and meta-learning. Explainability is embedded through uncertainty maps and confidence metrics to support user trust and clinical interpretability. Our approach supports transparent and ethical decision-making aligned with Responsible and Reflective AI principles. Philosophically, we advocate accepting controlled ambiguity instead of striving for absolute predictability, recognizing the inherent provisionality of medical knowledge.

Updated: 2025-04-07 17:24:11

标题: 在医学领域中大型语言模型不确定性量化的挑战

摘要: 这项研究调查了大型语言模型（LLMs）在医疗应用中的不确定性量化问题，强调技术创新和哲学影响。随着LLMs成为临床决策的重要组成部分，准确传达不确定性对于确保可靠、安全和道德的AI辅助医疗至关重要。我们的研究将不确定性框定为知识的一个必要部分，而不是一个障碍，这种方法鼓励对AI设计采取动态和反思的态度。通过将先进的概率方法（如贝叶斯推断、深度集成和蒙特卡洛辍学）与计算预测和语义熵的语言分析相结合，我们提出了一个全面的框架，管理着认识不确定性和偶然性不确定性。该框架结合了代理模型建模以解决专有API的限制，多源数据整合以提供更好的背景信息，并通过持续和元学习进行动态校准。通过不确定性地图和置信度指标嵌入可解释性，以支持用户信任和临床可解释性。我们的方法支持与负责任和反思AI原则一致的透明和道德决策。在哲学上，我们主张接受受控的模糊性，而不是追求绝对的可预测性，认识到医学知识的固有的临时性。

更新时间: 2025-04-07 17:24:11

领域: cs.AI

下载: http://arxiv.org/abs/2504.05278v1

Feature Selection for Latent Factor Models

Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.

Updated: 2025-04-07 17:23:13

标题: 潜在因素模型的特征选择

摘要: 特征选择对于在高维数据集中准确定位相关特征、缓解“维度灾难”并提高机器学习性能至关重要。传统的分类特征选择方法使用所有类别的数据来为每个类别选择特征。本文探讨了一种基于低秩生成方法的类模型的特征选择方法，针对每个类别分别选择特征，并引入了信噪比（SNR）特征选择标准。这种新颖的方法在一定假设下具有理论上的真实特征恢复保证，并在标准分类数据集上表现出优于一些现有特征选择方法的性能。

更新时间: 2025-04-07 17:23:13

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2412.10128v2

Aggregating time-series and image data: functors and double functors

Aggregation of time-series or image data over subsets of the domain is a fundamental task in data science. We show that many known aggregation operations can be interpreted as (double) functors on appropriate (double) categories. Such functorial aggregations are amenable to parallel implementation via straightforward extensions of Blelloch's parallel scan algorithm. In addition to providing a unified viewpoint on existing operations, it allows us to propose new aggregation operations for time-series and image data.

Updated: 2025-04-07 17:12:20

标题: 聚合时间序列和图像数据：函子和双函子

摘要: 时间序列或图像数据在域的子集上的聚合是数据科学中的一个基本任务。我们展示许多已知的聚合操作可以被解释为适当（双重）范畴上的（双重）函子。这种函子聚合可以通过Blelloch的并行扫描算法的直接扩展进行并行实现。除了提供现有操作的统一视角外，它还使我们能够为时间序列和图像数据提出新的聚合操作。

更新时间: 2025-04-07 17:12:20

领域: math.CT,cs.LG,18D05 68W10

下载: http://arxiv.org/abs/2504.05274v1

MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

A critical approach for efficiently deploying Mixture-of-Experts (MoE) models with massive parameters is quantization. However, state-of-the-art MoE models suffer from non-negligible accuracy loss with extreme quantization, such as under 4 bits. To address this, we introduce MiLo, a novel method that augments highly quantized MoEs with a mixture of low-rank compensators. These compensators consume only a small amount of additional memory but significantly recover accuracy loss from extreme quantization. MiLo also identifies that MoEmodels exhibit distinctive characteristics across weights due to their hybrid dense-sparse architectures, and employs adaptive rank selection policies along with iterative optimizations to close the accuracy gap. MiLo does not rely on calibration data, allowing it to generalize to different MoE models and datasets without overfitting to a calibration set. To avoid the hardware inefficiencies of extreme quantization, such as 3-bit, MiLo develops Tensor Core-friendly 3-bit kernels, enabling measured latency speedups on 3-bit quantized MoE models. Our evaluation shows that MiLo outperforms existing methods on SoTA MoE models across various tasks.

Updated: 2025-04-07 17:09:26

标题: MiLo: 高效的量化MoE推断与低秩补偿器混合

摘要: 一种有效部署参数庞大的专家混合（MoE）模型的关键方法是量化。然而，最先进的MoE模型在极端量化（如低于4位）时存在不可忽略的精度损失。为解决这一问题，我们引入了MiLo，一种新颖的方法，通过混合低秩补偿器来增强高度量化的MoEs。这些补偿器仅消耗少量额外内存，但能显著从极端量化中恢复精度损失。MiLo还发现，由于MoE模型具有混合稠密-稀疏结构，权重之间存在明显的特征差异，因此采用自适应秩选择策略和迭代优化来缩小精度差距。MiLo不依赖校准数据，使其能够推广到不同的MoE模型和数据集，避免过度拟合到校准集。为避免极端量化（如3位）的硬件效率低下，MiLo开发了适用于Tensor Core的3位内核，实现了3位量化MoE模型上的测量延迟加速。我们的评估表明，MiLo在各种任务上优于现有方法，超越了SoTA MoE模型。

更新时间: 2025-04-07 17:09:26

领域: cs.LG

下载: http://arxiv.org/abs/2504.02658v2

AnomalousNet: A Hybrid Approach with Attention U-Nets and Change Point Detection for Accurate Characterization of Anomalous Diffusion in Video Data

Anomalous diffusion occurs in a wide range of systems, including protein transport within cells, animal movement in complex habitats, pollutant dispersion in groundwater, and nanoparticle motion in synthetic materials. Accurately estimating the anomalous diffusion exponent and the diffusion coefficient from the particle trajectories is essential to distinguish between sub-diffusive, super-diffusive, or normal diffusion regimes. These estimates provide a deeper insight into the underlying dynamics of the system, facilitating the identification of particle behaviors and the detection of changes in diffusion states. However, analyzing short and noisy video data, which often yield incomplete and heterogeneous trajectories, poses a significant challenge for traditional statistical approaches. We introduce a data-driven method that integrates particle tracking, an attention U-Net architecture, and a change-point detection algorithm to address these issues. This approach not only infers the anomalous diffusion parameters with high accuracy but also identifies temporal transitions between different states, even in the presence of noise and limited temporal resolution. Our methodology demonstrated strong performance in the 2nd Anomalous Diffusion (AnDi) Challenge benchmark within the top submissions for video tasks.

Updated: 2025-04-07 17:08:17

标题: AnomalousNet：一种混合方法，结合注意力U-Net和变点检测，用于准确表征视频数据中的异常扩散

摘要: 异常扩散发生在各种系统中，包括细胞内蛋白质运输、复杂栖息地中动物运动、地下水中污染物扩散以及合成材料中纳米粒子运动。准确估计粒子轨迹中的异常扩散指数和扩散系数对于区分亚扩散、超扩散或正常扩散阶段至关重要。这些估计提供了对系统基础动态的更深入的洞察，有助于识别粒子行为并检测扩散状态的变化。然而，分析短暂且嘈杂的视频数据，通常产生不完整和异质的轨迹，对于传统统计方法构成了重大挑战。我们引入了一种数据驱动方法，该方法整合了粒子跟踪、关注U-Net架构和变点检测算法，以解决这些问题。这种方法不仅能够高精度地推断异常扩散参数，还能够识别不同状态之间的时间转换，即使在存在噪声和有限时间分辨率的情况下也可以做到。我们的方法在第二届异常扩散(AnDi)挑战赛中表现出色，位居视频任务的顶级提交之列。

更新时间: 2025-04-07 17:08:17

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.05271v1

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

Updated: 2025-04-07 17:03:03

标题: FetalCLIP：胎儿超声图像分析的视觉语言基础模型

摘要: 基金会模型在医疗领域变得越来越有效，提供了在大型数据集上预训练的模型，可以轻松地为下游任务进行调整。尽管取得了进展，胎儿超声图像仍然是基金会模型的挑战领域，因为其固有的复杂性，通常需要大量的额外训练，并且由于缺乏配对的多模态数据而面临限制。为了克服这些挑战，我们引入了FetalCLIP，这是一个能够生成胎儿超声图像通用表示的视觉语言基础模型。FetalCLIP在一个包含210,035张胎儿超声图像和文本配对的多样化数据集上使用多模态学习方法进行了预训练。这代表了迄今为止用于基金会模型开发的最大配对数据集。这种独特的训练方法使FetalCLIP能够有效地学习胎儿超声图像中存在的复杂解剖特征，从而产生可以用于各种下游应用的健壮表示。在一系列重要的胎儿超声应用（包括分类、孕龄估计、先天性心脏病检测和胎儿结构分割）的广泛基准测试中，FetalCLIP表现优于所有基线，同时展示出了出色的泛化能力和即使在有限标记数据情况下也具有强大性能。我们计划公开发布FetalCLIP模型，以造福更广泛的科学界。

更新时间: 2025-04-07 17:03:03

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.14807v2

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.

Updated: 2025-04-07 16:52:52

标题: 如何评估LLM代理的控制措施？从今天到超智能的轨迹

摘要: 随着LLM代理人在自主造成危害方面变得更有能力，AI开发人员将依赖越来越复杂的控制措施来防止可能导致危害的不对齐代理人。AI开发人员可以通过运行控制评估来证明他们的控制措施是足够的：在测试练习中，红队制造试图颠覆控制措施的代理人。为了确保控制评估准确捕捉不对齐风险，应根据将要部署在控制措施下的代理人的能力配置适应给予这个红队的可能性。在本文中，我们提出了一个系统框架，用于将红队的能力适应于不断发展的AI能力。我们并不假设代理人将始终执行人类所知的最佳攻击策略，而是展示了如何代理人实际能力配置的知识可以为比例控制评估提供信息，从而导致更加实际和具有成本效益的控制措施。我们通过考虑一个具有不断进步能力的五个虚构模型（M1-M5），定义了五个不同的AI控制级别（ACL），来说明我们的框架。对于每个ACL，我们提供了控制评估、控制措施和安全案例的示例规则，这可能是适当的。最后，我们展示了为超智能LLM代理人构建一个引人注目的AI控制安全案例将需要研究突破，突出了我们最终可能需要另辟途径来减轻不对齐风险。

更新时间: 2025-04-07 16:52:52

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2504.05259v1

Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.

Updated: 2025-04-07 16:51:45

标题: 学习随时间推理：时间轴自我反思，提高语言模型的时间推理能力

摘要: 大型语言模型(LLMs)已经成为生成连贯文本、理解背景和执行推理任务的强大工具。然而，它们在时间推理方面存在困难，时间推理需要处理与时间相关的信息，如事件排序、持续时间和时间间隔关系。这些能力对于问题回答、调度和历史分析等应用至关重要。在本文中，我们介绍了TISER，这是一个通过将时间线构建与迭代自我反思相结合的多阶段过程，以增强LLMs的时间推理能力的新框架。我们的方法利用测试时间缩放来延长推理过程的长度，使模型能更有效地捕捉复杂的时间依赖关系。这种策略不仅提高了推理准确性，还改善了推理过程的可追溯性。实验结果表明，在多个基准测试中表现出最先进的性能，包括超出分布测试集，并且TISER使较小的开源模型在具有挑战性的时间推理任务上超越了更大的闭权重模型。

更新时间: 2025-04-07 16:51:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.05258v1

A Managed Tokens Service for Securely Keeping and Distributing Grid Tokens

Fermilab is transitioning authentication and authorization for grid operations to using bearer tokens based on the WLCG Common JWT (JSON Web Token) Profile. One of the functionalities that Fermilab experimenters rely on is the ability to automate batch job submission, which in turn depends on the ability to securely refresh and distribute the necessary credentials to experiment job submit points. Thus, with the transition to using tokens for grid operations, we needed to create a service that would obtain, refresh, and distribute tokens for experimenters' use. This service would avoid the need for experimenters to be experts in obtaining their own tokens and would better protect the most sensitive long-lived credentials. Further, the service needed to be widely scalable, as Fermilab hosts many experiments, each of which would need their own credentials. To address these issues, we created and deployed a Managed Tokens service. The service is written in Go, taking advantage of that language's native concurrency primitives to easily be able to scale operations as we onboard experiments. The service uses as its first credentials a set of kerberos keytabs, stored on the same secure machine that the Managed Tokens service runs on. These kerberos credentials allow the service to use htgettoken via condor_vault_storer to store vault tokens in the HTCondor credential managers (credds) that run on the batch system scheduler machines (HTCondor schedds); as well as downloading a local, shorter-lived copy of the vault token. The kerberos credentials are then also used to distribute copies of the locally-stored vault tokens to experiment submit points.

Updated: 2025-04-07 16:50:29

标题: 一个用于安全保存和分发网格令牌的管理令牌服务

摘要: Fermilab正在将网格操作的身份验证和授权转换为基于WLCG通用JWT（JSON Web Token）配置文件的持有者令牌。 Fermilab实验者依赖的功能之一是能够自动化批处理作业提交，这又取决于安全地刷新和分发实验作业提交点所需的凭据的能力。因此，随着转向使用持有者令牌进行网格操作，我们需要创建一个服务，用于获取、刷新和分发实验者使用的令牌。该服务将避免实验者需要成为获取自己令牌的专家，并且将更好地保护最敏感的长期凭据。此外，该服务需要具有广泛的可扩展性，因为Fermilab托管许多实验，每个实验都需要自己的凭据。为了解决这些问题，我们创建并部署了一个托管令牌服务。该服务是用Go语言编写的，利用该语言的本机并发原语，以便在加入实验时能够轻松扩展操作。该服务首先使用一组存储在托管令牌服务运行的安全机器上的kerberos keytabs作为其第一凭据。这些kerberos凭据允许服务使用htgettoken通过condor_vault_storer将vault令牌存储在运行在批处理系统调度程序机器（HTCondor schedds）上的HTCondor凭据管理器（credds）中；以及下载本地、较短寿命的vault令牌副本。然后，kerberos凭据还用于将本地存储的vault令牌副本分发到实验提交点。

更新时间: 2025-04-07 16:50:29

领域: cs.CR,physics.ins-det

下载: http://arxiv.org/abs/2503.19768v2

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.

Updated: 2025-04-07 16:49:15

标题: RAG的实时评估模型：谁最擅长检测幻觉？

摘要: 本文调查了评估模型，以自动检测检索增强生成（RAG）中的幻觉，并在六个RAG应用程序中展示了它们的综合性能基准。我们研究中包括的方法包括：LLM作为评判者，普罗米修斯，猞猁，休斯幻觉评估模型（HHEM）和可信语言模型（TLM）。这些方法都是无参考的，不需要地面真实答案/标签来捕捉不正确的LLM响应。我们的研究表明，在各种RAG应用程序中，一些这些方法能够始终以高精度/召回率检测到不正确的RAG响应。

更新时间: 2025-04-07 16:49:15

领域: cs.LG

下载: http://arxiv.org/abs/2503.21157v3

Explaining Low Perception Model Competency with High-Competency Counterfactuals

There exist many methods to explain how an image classification model generates its decision, but very little work has explored methods to explain why a classifier might lack confidence in its prediction. As there are various reasons the classifier might lose confidence, it would be valuable for this model to not only indicate its level of uncertainty but also explain why it is uncertain. Counterfactual images have been used to visualize changes that could be made to an image to generate a different classification decision. In this work, we explore the use of counterfactuals to offer an explanation for low model competency--a generalized form of predictive uncertainty that measures confidence. Toward this end, we develop five novel methods to generate high-competency counterfactual images, namely Image Gradient Descent (IGD), Feature Gradient Descent (FGD), Autoencoder Reconstruction (Reco), Latent Gradient Descent (LGD), and Latent Nearest Neighbors (LNN). We evaluate these methods across two unique datasets containing images with six known causes for low model competency and find Reco, LGD, and LNN to be the most promising methods for counterfactual generation. We further evaluate how these three methods can be utilized by pre-trained Multimodal Large Language Models (MLLMs) to generate language explanations for low model competency. We find that the inclusion of a counterfactual image in the language model query greatly increases the ability of the model to generate an accurate explanation for the cause of low model competency, thus demonstrating the utility of counterfactual images in explaining low perception model competency.

Updated: 2025-04-07 16:46:52

标题: 用高能力反事实解释低感知模型能力

摘要: 存在许多方法来解释图像分类模型如何生成其决策，但很少有研究探讨了解释分类器为何可能缺乏对其预测的信心的方法。由于分类器可能失去信心的各种原因，对于该模型不仅指示其不确定性水平而且解释为什么不确定将是有价值的。反事实图像已被用来可视化可以对图像进行的更改以生成不同的分类决策。在这项工作中，我们探索了使用反事实来解释低模型能力（一种衡量信心的泛化预测不确定性的形式）。为此，我们开发了五种新方法来生成高能力的反事实图像，即图像梯度下降（IGD）、特征梯度下降（FGD）、自编码器重建（Reco）、潜在梯度下降（LGD）和潜在最近邻（LNN）。我们在包含六种导致低模型能力的已知原因的图像的两个独特数据集中评估了这些方法，并发现Reco、LGD和LNN是最有前途的反事实生成方法。我们进一步评估了这三种方法如何被预训练的多模式大语言模型（MLLMs）利用来为低模型能力生成语言解释。我们发现，在语言模型查询中包含反事实图像极大地提高了模型生成关于低模型能力原因的准确解释的能力，从而展示了反事实图像在解释低感知模型能力方面的实用性。

更新时间: 2025-04-07 16:46:52

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05254v1

Adversarial KA

Regarding the representation theorem of Kolmogorov and Arnold (KA) as an algorithm for representing or {\guillemotleft}expressing{\guillemotright} functions, we test its robustness by analyzing its ability to withstand adversarial attacks. We find KA to be robust to countable collections of continuous adversaries, but unearth a question about the equi-continuity of the outer functions that, so far, obstructs taking limits and defeating continuous groups of adversaries. This question on the regularity of the outer functions is relevant to the debate over the applicability of KA to the general theory of NNs.

Updated: 2025-04-07 16:46:52

标题: 对抗性知识蒸馏

摘要: 关于Kolmogorov和Arnold（KA）的表示定理作为一种表示或“表达”函数的算法，我们通过分析其抵抗对抗性攻击的能力来测试其稳健性。我们发现KA对可数集合的连续对手具有稳健性，但发现外部函数的等连续性存在一个问题，迄今为止，这一问题阻碍了取极限和击败连续对手组。关于外部函数的正则性的这个问题与关于KA在NNs的一般理论中的适用性的争论相关。

更新时间: 2025-04-07 16:46:52

领域: cs.LG,cs.AI,math.FA

下载: http://arxiv.org/abs/2504.05255v1

PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity

As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets.

Updated: 2025-04-07 16:42:09

标题: PEAKS：通过核相似度锚定的预测误差逐步选择关键训练示例

摘要: 随着深度学习在不断扩大的数据集的推动下，理解哪些示例对于泛化最为重要已经成为一个关键问题。虽然在数据选择方面取得了进展，但新兴应用需要在动态环境中研究这个问题。为了弥补这一差距，我们提出了增量数据选择（IDS）问题，其中示例以连续流的形式到达，并且需要在没有完整数据源的情况下进行选择。在这种情境下，学习器必须在同时学习基础任务的同时逐步构建一个预定义大小的训练数据集。我们发现在IDS中，新样本对模型状态的影响基本上取决于其在特征空间中的几何关系和预测误差。利用这一观点，我们提出了基于核相似性锚定预测误差（PEAKS）的高效数据选择方法，专为IDS量身定制。我们的全面评估表明，PEAKS始终优于现有的选择策略。此外，随着训练数据规模在真实数据集上的增长，PEAKS的性能回报比随机选择更好。

更新时间: 2025-04-07 16:42:09

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2504.05250v1

Texture2LoD3: Enabling LoD3 Building Reconstruction With Panoramic Images

Despite recent advancements in surface reconstruction, Level of Detail (LoD) 3 building reconstruction remains an unresolved challenge. The main issue pertains to the object-oriented modelling paradigm, which requires georeferencing, watertight geometry, facade semantics, and low-poly representation -- Contrasting unstructured mesh-oriented models. In Texture2LoD3, we introduce a novel method leveraging the ubiquity of 3D building model priors and panoramic street-level images, enabling the reconstruction of LoD3 building models. We observe that prior low-detail building models can serve as valid planar targets for ortho-rectifying street-level panoramic images. Moreover, deploying segmentation on accurately textured low-level building surfaces supports maintaining essential georeferencing, watertight geometry, and low-poly representation for LoD3 reconstruction. In the absence of LoD3 validation data, we additionally introduce the ReLoD3 dataset, on which we experimentally demonstrate that our method leads to improved facade segmentation accuracy by 11% and can replace costly manual projections. We believe that Texture2LoD3 can scale the adoption of LoD3 models, opening applications in estimating building solar potential or enhancing autonomous driving simulations. The project website, code, and data are available here: https://wenzhaotang.github.io/Texture2LoD3/.

Updated: 2025-04-07 16:40:16

标题: Texture2LoD3：利用全景图像实现LoD3建筑重建

摘要: 尽管表面重建近年来取得了进展，但细节级别（LoD）3建筑重建仍然是一个未解决的挑战。主要问题涉及到面向对象建模范式，需要地理参考、封闭几何形状、立面语义和低多边形表示——与非结构化网格导向模型形成对比。在Texture2LoD3中，我们引入了一种新颖的方法，利用3D建筑模型先验和全景街道级别图像的普及性，实现了LoD3建筑模型的重建。我们观察到，先前的低细节建筑模型可以作为正交矫正街道级全景图像的有效平面目标。此外，对准确纹理化的低水平建筑表面进行分割有助于维持LoD3重建所需的基本地理参考、封闭几何形状和低多边形表示。在缺乏LoD3验证数据的情况下，我们另外引入了ReLoD3数据集，通过实验证明我们的方法可以将立面分割准确性提高11%，并可以替代昂贵的手动投影。我们相信Texture2LoD3可以推动LoD3模型的采用，开启在估算建筑太阳能潜力或增强自动驾驶模拟方面的应用。项目网站、代码和数据可在此处获得：https://wenzhaotang.github.io/Texture2LoD3/。

更新时间: 2025-04-07 16:40:16

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.05249v1

PINNverse: Accurate parameter estimation in differential equations from noisy data with constrained physics-informed neural networks

Parameter estimation for differential equations from measured data is an inverse problem prevalent across quantitative sciences. Physics-Informed Neural Networks (PINNs) have emerged as effective tools for solving such problems, especially with sparse measurements and incomplete system information. However, PINNs face convergence issues, stability problems, overfitting, and complex loss function design. Here we introduce PINNverse, a training paradigm that addresses these limitations by reformulating the learning process as a constrained differential optimization problem. This approach achieves a dynamic balance between data loss and differential equation residual loss during training while preventing overfitting. PINNverse combines the advantages of PINNs with the Modified Differential Method of Multipliers to enable convergence on any point on the Pareto front. We demonstrate robust and accurate parameter estimation from noisy data in four classical ODE and PDE models from physics and biology. Our method enables accurate parameter inference also when the forward problem is expensive to solve.

Updated: 2025-04-07 16:34:57

标题: PINNverse：使用受限物理信息神经网络从嘈杂数据中准确估计微分方程参数

摘要: 从测量数据中对微分方程进行参数估计是量化科学中普遍存在的一个反问题。物理信息神经网络（PINNs）已经成为解决这类问题的有效工具，特别是在测量稀疏和系统信息不完整的情况下。然而，PINNs面临收敛问题、稳定性问题、过拟合和复杂损失函数设计等挑战。在这里，我们介绍了PINNverse，这是一种通过将学习过程重新构造为受限微分优化问题来解决这些限制的训练范式。这种方法在训练过程中实现了数据损失和微分方程残差损失之间的动态平衡，同时防止过拟合。PINNverse结合了PINNs的优势和修正的乘数法的多重差分法，从而实现对帕累托前沿上任意点的收敛。我们在物理学和生物学中的四个经典ODE和PDE模型中展示了对嘈杂数据的稳健和准确的参数估计。我们的方法还能够在前向问题昂贵的情况下进行准确的参数推断。

更新时间: 2025-04-07 16:34:57

领域: cs.LG,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2504.05248v1

Embedded Federated Feature Selection with Dynamic Sparse Training: Balancing Accuracy-Cost Tradeoffs

Federated Learning (FL) enables multiple resource-constrained edge devices with varying levels of heterogeneity to collaboratively train a global model. However, devices with limited capacity can create bottlenecks and slow down model convergence. One effective approach to addressing this issue is to use an efficient feature selection method, which reduces overall resource demands by minimizing communication and computation costs, thereby mitigating the impact of struggling nodes. Existing federated feature selection (FFS) methods are either considered as a separate step from FL or rely on a third party. These approaches increase computation and communication overhead, making them impractical for real-world high-dimensional datasets. To address this, we present \textit{Dynamic Sparse Federated Feature Selection} (DSFFS), the first innovative embedded FFS that is efficient in both communication and computation. In the proposed method, feature selection occurs simultaneously with model training. During training, input-layer neurons, their connections, and hidden-layer connections are dynamically pruned and regrown, eliminating uninformative features. This process enhances computational efficiency on devices, improves network communication efficiency, and boosts global model performance. Several experiments are conducted on nine real-world datasets of varying dimensionality from diverse domains, including biology, image, speech, and text. The results under a realistic non-iid data distribution setting show that our approach achieves a better trade-off between accuracy, computation, and communication costs by selecting more informative features compared to other state-of-the-art FFS methods.

Updated: 2025-04-07 16:33:05

标题: 嵌入联邦特征选择与动态稀疏训练：平衡准确性和成本的权衡

摘要: 联邦学习（FL）使多个资源受限的边缘设备能够以协作方式训练全局模型，这些设备具有不同程度的异质性。然而，容量有限的设备可能会造成瓶颈并减慢模型收敛速度。解决这一问题的一种有效方法是使用高效的特征选择方法，通过减少通信和计算成本来降低整体资源需求，从而缓解节点间的影响。现有的联邦特征选择（FFS）方法要么被视为FL的一个单独步骤，要么依赖于第三方。这些方法增加了计算和通信开销，使它们在现实世界中处理高维数据集变得不切实际。为了解决这个问题，我们提出了\textit{动态稀疏联邦特征选择}（DSFFS），这是第一个在通信和计算效率上都高效的嵌入式FFS。在所提出的方法中，特征选择与模型训练同时进行。在训练过程中，输入层神经元、它们的连接以及隐藏层连接会动态修剪和重新生长，消除无信息的特征。这个过程提升了设备上的计算效率，改善了网络通信效率，并提升了全局模型性能。我们在来自生物学、图像、语音和文本等不同领域的九个真实世界数据集上进行了多个实验，这些数据集的维度不同。在一个现实的非独立同分布数据分布设置下，结果显示，与其他最先进的FFS方法相比，我们的方法通过选择更具信息量的特征实现了更好的准确性、计算和通信成本之间的平衡。

更新时间: 2025-04-07 16:33:05

领域: cs.LG

下载: http://arxiv.org/abs/2504.05245v1

IAEmu: Learning Galaxy Intrinsic Alignment Correlations

The intrinsic alignments (IA) of galaxies, a key contaminant in weak lensing analyses, arise from correlations in galaxy shapes driven by tidal interactions and galaxy formation processes. Accurate IA modeling is essential for robust cosmological inference, but current approaches rely on perturbative methods that break down on nonlinear scales or on expensive simulations. We introduce IAEmu, a neural network-based emulator that predicts the galaxy position-position ($\xi$), position-orientation ($\omega$), and orientation-orientation ($\eta$) correlation functions and their uncertainties using mock catalogs based on the halo occupation distribution (HOD) framework. Compared to simulations, IAEmu achieves ~3% average error for $\xi$ and ~5% for $\omega$, while capturing the stochasticity of $\eta$ without overfitting. The emulator provides both aleatoric and epistemic uncertainties, helping identify regions where predictions may be less reliable. We also demonstrate generalization to non-HOD alignment signals by fitting to IllustrisTNG hydrodynamical simulation data. As a fully differentiable neural network, IAEmu enables $\sim$10,000$\times$ speed-ups in mapping HOD parameters to correlation functions on GPUs, compared to CPU-based simulations. This acceleration facilitates inverse modeling via gradient-based sampling, making IAEmu a powerful surrogate model for galaxy bias and IA studies with direct applications to Stage IV weak lensing surveys.

Updated: 2025-04-07 16:19:50

标题: IAEmu：学习星系固有对齐相关性

摘要: 星系的内在对齐（IA）是弱引力透镜分析中的一个关键污染物，它是由于星系形态的相互关联引起的，这种相互关联是由潮汐相互作用和星系形成过程驱动的。准确的IA建模对于健壮的宇宙学推断至关重要，但目前的方法依赖于在非线性尺度上失效或昂贵的模拟的摄动方法。我们介绍了IAEmu，这是一个基于神经网络的仿真器，它使用基于晕体占据分布（HOD）框架的模拟目录来预测星系的位置-位置（ξ）、位置-方向（ω）和方向-方向（η）相关函数及其不确定性。与模拟相比，IAEmu对ξ的平均误差约为3%，对ω约为5%，同时捕捉了η的随机性而不过度拟合。该仿真器提供了以及不确定性，有助于识别预测可能不够可靠的区域。我们还通过拟合IllustrisTNG流体动力学模拟数据展示了对非HOD对齐信号的泛化能力。作为一个完全可微的神经网络，IAEmu使得在GPU上将HOD参数映射到相关函数的速度提高了约10,000倍，相比于基于CPU的模拟。这种加速有助于通过梯度为基础的采样进行反向建模，使IAEmu成为一个对星系偏差和IA研究有用的替代模型，直接应用于第四阶段弱引力透镜调查。

更新时间: 2025-04-07 16:19:50

领域: astro-ph.CO,astro-ph.GA,cs.LG

下载: http://arxiv.org/abs/2504.05235v1

Mapping biodiversity at very-high resolution in Europe

This paper describes a cascading multimodal pipeline for high-resolution biodiversity mapping across Europe, integrating species distribution modeling, biodiversity indicators, and habitat classification. The proposed pipeline first predicts species compositions using a deep-SDM, a multimodal model trained on remote sensing, climate time series, and species occurrence data at 50x50m resolution. These predictions are then used to generate biodiversity indicator maps and classify habitats with Pl@ntBERT, a transformer-based LLM designed for species-to-habitat mapping. With this approach, continental-scale species distribution maps, biodiversity indicator maps, and habitat maps are produced, providing fine-grained ecological insights. Unlike traditional methods, this framework enables joint modeling of interspecies dependencies, bias-aware training with heterogeneous presence-absence data, and large-scale inference from multi-source remote sensing inputs.

Updated: 2025-04-07 16:15:52

标题: 在欧洲进行极高分辨率的生物多样性绘图

摘要: 本文描述了一个跨欧洲的级联多模式管道，用于高分辨率生物多样性映射，整合了物种分布建模、生物多样性指标和栖息地分类。所提出的管道首先使用深度SDM预测物种组成，这是一个在遥感、气候时间序列和50x50m分辨率的物种出现数据上训练的多模式模型。然后使用这些预测结果生成生物多样性指标地图，并使用Pl@ntBERT对栖息地进行分类，这是一种基于变压器的面向物种-栖息地映射的LLM。通过这种方法，生态学细节得以展现，制作了大陆尺度的物种分布地图、生物多样性指标地图和栖息地图。与传统方法不同的是，这个框架能够联合建模物种间的依赖关系，使用异质存在-缺失数据进行偏差感知训练，并从多源遥感输入进行大规模推断。

更新时间: 2025-04-07 16:15:52

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.05231v1

FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking

The field of explainable Automatic Fact-Checking (AFC) aims to enhance the transparency and trustworthiness of automated fact-verification systems by providing clear and comprehensible explanations. However, the effectiveness of these explanations depends on their actionability --their ability to empower users to make informed decisions and mitigate misinformation. Despite actionability being a critical property of high-quality explanations, no prior research has proposed a dedicated method to evaluate it. This paper introduces FinGrAct, a fine-grained evaluation framework that can access the web, and it is designed to assess actionability in AFC explanations through well-defined criteria and an evaluation dataset. FinGrAct surpasses state-of-the-art (SOTA) evaluators, achieving the highest Pearson and Kendall correlation with human judgments while demonstrating the lowest ego-centric bias, making it a more robust evaluation approach for actionability evaluation in AFC.

Updated: 2025-04-07 16:14:27

标题: FinGrAct：一种用于可解释自动事实核查中行动性细粒度评估的框架

摘要: 可解释的自动事实检查（AFC）领域旨在通过提供清晰易懂的解释，提高自动事实验证系统的透明度和可信度。然而，这些解释的有效性取决于它们的可操作性——它们赋予用户做出明智决定和减少错误信息的能力。尽管可操作性是高质量解释的关键属性，但以往的研究没有提出专门评估它的方法。本文介绍了FinGrAct，一个可以访问网络的细粒度评估框架，旨在通过明确定义的标准和评估数据集评估AFC解释的可操作性。FinGrAct超越了最先进的评估器，实现了与人类判断最高的Pearson和Kendall相关性，同时展示了最低的自我中心偏见，使其成为AFC解释中可操作性评估的更为稳健的评估方法。

更新时间: 2025-04-07 16:14:27

领域: cs.AI

下载: http://arxiv.org/abs/2504.05229v1

Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG

Retrieval models typically rely on costly human-labeled query-document relevance annotations for training and evaluation. To reduce this cost and leverage the potential of Large Language Models (LLMs) in relevance judgments, we aim to explore whether LLM-generated annotations can effectively replace human annotations in training retrieval models. Retrieval usually emphasizes relevance, which indicates "topic-relatedness" of a document to a query, while in RAG, the value of a document (or utility) depends on how it contributes to answer generation. Recognizing this mismatch, some researchers use LLM performance on downstream tasks with documents as labels, but this approach requires manual answers for specific tasks, leading to high costs and limited generalization. In another line of work, prompting LLMs to select useful documents as RAG references eliminates the need for human annotation and is not task-specific. If we leverage LLMs' utility judgments to annotate retrieval data, we may retain cross-task generalization without human annotation in large-scale corpora. Therefore, we investigate utility-focused annotation via LLMs for large-scale retriever training data across both in-domain and out-of-domain settings on the retrieval and RAG tasks. To reduce the impact of low-quality positives labeled by LLMs, we design a novel loss function, i.e., Disj-InfoNCE. Our experiments reveal that: (1) Retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting on both tasks, demonstrating superior generalization capabilities. (2) LLM annotation does not replace human annotation in the in-domain setting. However, incorporating just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.

Updated: 2025-04-07 16:05:52

标题: 利用LLMs进行以效用为中心的注释：减少检索和RAG的手动工作

摘要: 检索模型通常依赖于昂贵的人工标记的查询-文档相关性注释进行训练和评估。为了降低这一成本并利用大型语言模型（LLMs）在相关性判断中的潜力，我们旨在探索LLM生成的注释是否可以有效地取代人工注释来训练检索模型。检索通常强调相关性，这表明文档与查询的“主题相关性”，而在RAG中，文档的价值（或实用性）取决于它对答案生成的贡献。认识到这种不匹配，一些研究人员使用LLM在具有文档标签的下游任务上的表现，但这种方法需要特定任务的手动答案，导致高成本和有限的泛化。在另一方面的工作中，促使LLMs选择有用的文档作为RAG参考消除了对人工注释的需求，并且不是特定任务。如果我们利用LLMs的实用性判断来注释检索数据，我们可能在大规模语料库中保留跨任务泛化而无需人工注释。因此，我们通过LLMs对检索和RAG任务中的大规模检索器训练数据进行了实用性注释的研究，涵盖了领域内和领域外的设置。为了减少LLMs标记的低质量正样本的影响，我们设计了一种新颖的损失函数，即Disj-InfoNCE。我们的实验发现：（1）在领域外设置中，基于实用性注释训练的检索器在两个任务上明显优于基于人工注释训练的检索器，展示出卓越的泛化能力。（2）在领域内设置中，LLM注释不能完全取代人工注释。然而，仅整合20%的人工注释数据就能使基于实用性注释训练的检索器达到完全基于人工注释训练的模型性能水平。

更新时间: 2025-04-07 16:05:52

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.05220v1

Hybrid machine learning data assimilation for marine biogeochemistry

Marine biogeochemistry models are critical for forecasting, as well as estimating ecosystem responses to climate change and human activities. Data assimilation (DA) improves these models by aligning them with real-world observations, but marine biogeochemistry DA faces challenges due to model complexity, strong nonlinearity, and sparse, uncertain observations. Existing DA methods applied to marine biogeochemistry struggle to update unobserved variables effectively, while ensemble-based methods are computationally too expensive for high-complexity marine biogeochemistry models. This study demonstrates how machine learning (ML) can improve marine biogeochemistry DA by learning statistical relationships between observed and unobserved variables. We integrate ML-driven balancing schemes into a 1D prototype of a system used to forecast marine biogeochemistry in the North-West European Shelf seas. ML is applied to predict (i) state-dependent correlations from free-run ensembles and (ii), in an ``end-to-end'' fashion, analysis increments from an Ensemble Kalman Filter. Our results show that ML significantly enhances updates for previously not-updated variables when compared to univariate schemes akin to those used operationally. Furthermore, ML models exhibit moderate transferability to new locations, a crucial step toward scaling these methods to 3D operational systems. We conclude that ML offers a clear pathway to overcome current computational bottlenecks in marine biogeochemistry DA and that refining transferability, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting, should be future research priorities.

Updated: 2025-04-07 16:04:10

标题: 海洋生物地球化学的混合机器学习数据同化

摘要: 海洋生物地球化学模型对于预测以及估计生态系统对气候变化和人类活动的响应至关重要。数据同化（DA）通过与实际观测相匹配来改进这些模型，但海洋生物地球化学DA面临着模型复杂性、强非线性和稀疏、不确定观测的挑战。现有的应用于海洋生物地球化学的DA方法难以有效更新未观测到的变量，而基于集合的方法在高复杂性海洋生物地球化学模型中计算成本过高。本研究展示了机器学习（ML）如何通过学习观测和未观测变量之间的统计关系来改进海洋生物地球化学DA。我们将ML驱动的平衡方案集成到一个用于预测北欧西北架海域海洋生物地球化学的1D原型系统中。ML被应用于预测（i）自由运行集合中的状态相关相关性和（ii）用于集合卡尔曼滤波的分析增量。我们的结果显示，与操作上使用的类似于单变量方案相比，ML显著增强了以前未更新变量的更新。此外，ML模型对新位置具有适度的可转移性，这是将这些方法扩展到3D操作系统的重要一步。我们得出结论，ML为克服当前海洋生物地球化学DA中的计算瓶颈提供了明确的途径，未来的研究重点应放在提高可转移性、优化训练数据采样以及评估大规模海洋预测的可扩展性上。

更新时间: 2025-04-07 16:04:10

领域: physics.ao-ph,cs.LG

下载: http://arxiv.org/abs/2504.05218v1

Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling

Dense retrieval is a crucial task in Information Retrieval (IR) and is the foundation for downstream tasks such as re-ranking. Recently, large language models (LLMs) have shown compelling semantic understanding capabilities and are appealing to researchers studying dense retrieval. LLMs, as decoder-style generative models, are competent at language generation while falling short on modeling global information due to the lack of attention to tokens afterward. Inspired by the classical word-based language modeling approach for IR, i.e., the query likelihood (QL) model, we seek to sufficiently utilize LLMs' generative ability by QL maximization. However, instead of ranking documents with QL estimation, we introduce an auxiliary task of QL maximization to yield a better backbone for contrastively learning a discriminative retriever. We name our model as LLM-QL. To condense global document semantics to a single vector during QL modeling, LLM-QL has two major components, Attention Stop (AS) and Input Corruption (IC). AS stops the attention of predictive tokens to previous tokens until the ending token of the document. IC masks a portion of tokens in the input documents during prediction. Experiments on MSMARCO show that LLM-QL can achieve significantly better performance than other LLM-based retrievers and using QL estimated by LLM-QL for ranking outperforms word-based QL by a large margin.

Updated: 2025-04-07 16:03:59

标题: 释放LLMs在密集检索中的力量：使用查询似然建模

摘要: 密集检索是信息检索（IR）中的关键任务，也是下游任务（如重新排序）的基础。最近，大型语言模型（LLMs）展现出令人信服的语义理解能力，吸引了研究密集检索的研究人员。LLMs作为解码器风格的生成模型，在语言生成方面表现出色，但在建模全局信息方面表现不佳，因为之后的标记缺乏关注。受经典基于单词的信息检索语言建模方法（即查询似然（QL）模型）的启发，我们希望通过QL最大化充分利用LLMs的生成能力。然而，我们并不是通过QL估计对文档进行排名，而是引入一个辅助任务，即QL最大化，以产生更好的骨干结构，用于对比学习辨别式检索器。我们将我们的模型命名为LLM-QL。为了在QL建模期间将全局文档语义压缩为一个单一向量，LLM-QL有两个主要组件，Attention Stop（AS）和Input Corruption（IC）。AS将预测标记的注意力停止到文档的结束标记之前的标记。IC在预测过程中屏蔽输入文档中的一部分标记。在MSMARCO上的实验表明，LLM-QL可以比其他基于LLM的检索器实现显着更好的性能，并且使用LLM-QL估计的QL进行排名明显优于基于单词的QL。

更新时间: 2025-04-07 16:03:59

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.05216v1

Scalable and Ethical Insider Threat Detection through Data Synthesis and Analysis by LLMs

Insider threats wield an outsized influence on organizations, disproportionate to their small numbers. This is due to the internal access insiders have to systems, information, and infrastructure. %One example of this influence is where anonymous respondents submit web-based job search site reviews, an insider threat risk to organizations. Signals for such risks may be found in anonymous submissions to public web-based job search site reviews. This research studies the potential for large language models (LLMs) to analyze and detect insider threat sentiment within job site reviews. Addressing ethical data collection concerns, this research utilizes synthetic data generation using LLMs alongside existing job review datasets. A comparative analysis of sentiment scores generated by LLMs is benchmarked against expert human scoring. Findings reveal that LLMs demonstrate alignment with human evaluations in most cases, thus effectively identifying nuanced indicators of threat sentiment. The performance is lower on human-generated data than synthetic data, suggesting areas for improvement in evaluating real-world data. Text diversity analysis found differences between human-generated and LLM-generated datasets, with synthetic data exhibiting somewhat lower diversity. Overall, the results demonstrate the applicability of LLMs to insider threat detection, and a scalable solution for insider sentiment testing by overcoming ethical and logistical barriers tied to data acquisition.

Updated: 2025-04-07 16:01:47

标题: 通过LLM进行数据综合和分析的可扩展和道德内部威胁检测

摘要: 内部威胁对组织产生了不成比例的影响，远远超出了他们的数量。这是因为内部人员对系统、信息和基础设施拥有的访问权限。一个例子是匿名受访者提交基于网站的求职网站评论，这对组织构成了内部威胁风险。此类风险的信号可能在公共基于网站的求职网站评论中找到。本研究探讨了大型语言模型（LLMs）分析和检测求职网站评论中内部威胁情绪的潜力。针对道德数据收集关切，本研究利用LLMs生成合成数据，并与现有的求职评论数据集进行比较分析。LLMs生成的情绪得分与专家人工评分进行了基准比较。研究结果显示，在大多数情况下，LLMs与人类评价保持一致，有效识别威胁情绪的微妙指标。在人类生成的数据上，性能低于合成数据，这表明在评估现实数据方面有改进的空间。文本多样性分析发现人类生成和LLM生成的数据集之间存在差异，合成数据的多样性稍低。总体而言，结果表明LLMs在内部威胁检测方面具有适用性，并且可以通过克服与数据获取相关的道德和后勤障碍，提供一种可扩展的解决方案用于内部情绪测试。

更新时间: 2025-04-07 16:01:47

领域: cs.CR,cs.AI,cs.CL,cs.CY,C.2.0; I.2.7; K.4.1; H.3.3

下载: http://arxiv.org/abs/2502.07045v2

On Sinkhorn's Algorithm and Choice Modeling

For a broad class of models widely used in practice for choice and ranking data based on Luce's choice axiom, including the Bradley--Terry--Luce and Plackett--Luce models, we show that the associated maximum likelihood estimation problems are equivalent to a classic matrix balancing problem with target row and column sums. This perspective opens doors between two seemingly unrelated research areas, and allows us to unify existing algorithms in the choice modeling literature as special instances or analogs of Sinkhorn's celebrated algorithm for matrix balancing. We draw inspirations from these connections and resolve some open problems on the study of Sinkhorn's algorithm. We establish the global linear convergence of Sinkhorn's algorithm for non-negative matrices whenever finite scaling matrices exist, and characterize its linear convergence rate in terms of the algebraic connectivity of a weighted bipartite graph. We further derive the sharp asymptotic rate of linear convergence, which generalizes a classic result of Knight (2008). To our knowledge, these are the first quantitative linear convergence results for Sinkhorn's algorithm for general non-negative matrices and positive marginals. Our results highlight the importance of connectivity and orthogonality structures in matrix balancing and Sinkhorn's algorithm, which could be of independent interest. More broadly, the connections we establish in this paper between matrix balancing and choice modeling could also help motivate further transmission of ideas and lead to interesting results in both disciplines.

Updated: 2025-04-07 15:59:57

标题: 关于Sinkhorn算法和选择建模

摘要: 对于一类广泛应用于实践中基于卢斯选择公理的选择和排名数据的模型，包括布拉德利-特里-卢斯模型和普莱克特-卢斯模型，我们展示了相关的最大似然估计问题等价于一个经典的矩阵平衡问题，其中目标行和列和相等。这种视角打开了两个看似不相关的研究领域之间的大门，使我们能够将选择建模文献中的现有算法统一为Sinkhorn的矩阵平衡的特殊实例或类比。我们从这些联系中汲取灵感，解决了一些关于研究Sinkhorn算法的开放问题。我们证明了当存在有限缩放矩阵时，Sinkhorn算法对于非负矩阵具有全局线性收敛性，并且以加权二部图的代数连通性来表征其线性收敛速度。我们进一步推导了线性收敛的尖锐渐近速率，这推广了Knight (2008)的一个经典结果。据我们所知，这是针对一般非负矩阵和正边缘的Sinkhorn算法的首个定量线性收敛结果。我们的结果突显了在矩阵平衡和Sinkhorn算法中的连通性和正交结构的重要性，这可能具有独立的兴趣。更广泛地说，我们在本文中建立的矩阵平衡和选择建模之间的联系还可能有助于激发更多思想的传递，并在两个学科领域中产生有趣的结果。

更新时间: 2025-04-07 15:59:57

领域: math.OC,cs.LG,econ.EM

下载: http://arxiv.org/abs/2310.00260v2

A moving target in AI-assisted decision-making: Dataset shift, model updating, and the problem of update opacity

Machine learning (ML) systems are vulnerable to performance decline over time due to dataset shift. To address this problem, experts often suggest that ML systems should be regularly updated to ensure ongoing performance stability. Some scholarly literature has begun to address the epistemic and ethical challenges associated with different updating methodologies. Thus far, however, little attention has been paid to the impact of model updating on the ML-assisted decision-making process itself, particularly in the AI ethics and AI epistemology literatures. This article aims to address this gap in the literature. It argues that model updating introduces a new sub-type of opacity into ML-assisted decision-making -- update opacity -- that occurs when users cannot understand how or why an update has changed the reasoning or behaviour of an ML system. This type of opacity presents a variety of distinctive epistemic and safety concerns that available solutions to the black box problem in ML are largely ill-equipped to address. A variety of alternative strategies may be developed or pursued to address the problem of update opacity more directly, including bi-factual explanations, dynamic model reporting, and update compatibility. However, each of these strategies presents its own risks or carries significant limitations. Further research will be needed to address the epistemic and safety concerns associated with model updating and update opacity going forward.

Updated: 2025-04-07 15:58:23

标题: 在AI辅助决策中的一个不断变化的目标：数据集转移、模型更新和更新不透明性问题

摘要: 机器学习（ML）系统容易因数据集转移而导致性能下降。为解决这一问题，专家们经常建议定期更新ML系统，以确保持续性能稳定。一些学术文献已经开始探讨与不同更新方法相关的认识论和伦理挑战。然而，迄今为止，对模型更新对ML辅助决策过程本身的影响，特别是在AI伦理学和AI认识论文献中，关注较少。本文旨在弥补这一文献中的空白。文章认为，模型更新引入了一种新的ML辅助决策的不透明子类型--更新不透明性--当用户无法理解更新如何或为何改变ML系统的推理或行为时发生。这种不透明性提出了各种独特的认知和安全问题，现有的解决ML黑盒问题的解决方案很大程度上无法解决。可以发展或追求各种替代策略来更直接地解决更新不透明性问题，包括双因果解释、动态模型报告和更新兼容性。然而，每种策略都存在其自身的风险或具有重大的局限性。需要进一步研究来解决与模型更新和更新不透明性相关的认知和安全问题。

更新时间: 2025-04-07 15:58:23

领域: cs.CY,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2504.05210v1

Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging

Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time. However, a lack of fully annotated data hinders the development of effective ULDT approaches. Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances. To address these issues, in this work, we developed a self-training pipeline for ULDT. A VFNet model was trained on a limited 11.5\% subset of DeepLesion (bounding boxes + tags) to detect and classify lesions in CT studies. Then, it identified and incorporated novel lesion candidates from a larger unseen data subset into its training set, and self-trained itself over multiple rounds. Multiple self-training experiments were conducted with different threshold policies to select predicted lesions with higher quality and cover the class imbalances. We discovered that direct self-training improved the sensitivities of over-represented lesion classes at the expense of under-represented classes. However, upsampling the lesions mined during self-training along with a variable threshold policy yielded a 6.5\% increase in sensitivity at 4 FP in contrast to self-training without class balancing (72\% vs 78.5\%) and a 11.7\% increase compared to the same self-training policy without upsampling (66.8\% vs 78.5\%). Furthermore, we show that our results either improved or maintained the sensitivity at 4FP for all 8 lesion classes.

Updated: 2025-04-07 15:57:03

标题: 使用自训练方法纠正类别不平衡以改善通用病变检测和标记

摘要: 通用病变检测和标记（ULDT）在CT研究中对于肿瘤负荷评估和跟踪病变状态（增长/缩小）的进展至关重要。然而，缺乏完全注释的数据阻碍了有效ULDT方法的发展。先前的工作使用DeepLesion数据集（4,427名患者，10,594个研究，32,120个CT切片，32,735个病变，8个身体部位标签）进行算法开发，但该数据集并非完全注释，且存在类别不平衡。为解决这些问题，在这项工作中，我们开发了一个用于ULDT的自学习管道。使用DeepLesion的有限11.5%子集（边界框+标签）对VFNet模型进行训练，以检测和分类CT研究中的病变。然后，它识别并纳入来自更大未见数据子集的新病变候选者到其训练集中，并在多轮中进行自我训练。进行了多个自我训练实验，采用不同的阈值策略选择预测质量更高的病变，以覆盖类别不平衡。我们发现，直接自我训练提高了过度代表的病变类别的灵敏度，但以牺牲代表不足的类别为代价。然而，通过在自我训练过程中对挖掘到的病变进行上采样以及采用可变阈值策略，与没有类别平衡的自我训练相比，我们在4FP处的灵敏度提高了6.5%（72% vs 78.5%），与相同的自我训练策略而没有上采样的情况下相比，提高了11.7%（66.8% vs 78.5%）。此外，我们展示了我们的结果在所有8个病变类别中要么改善要么保持了在4FP处的灵敏度。

更新时间: 2025-04-07 15:57:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05207v1

Infinitely Divisible Noise for Differential Privacy: Nearly Optimal Error in the High $\varepsilon$ Regime

Differential privacy (DP) can be achieved in a distributed manner, where multiple parties add independent noise such that their sum protects the overall dataset with DP. A common technique here is for each party to sample their noise from the decomposition of an infinitely divisible distribution. We analyze two mechanisms in this setting: 1) the generalized discrete Laplace (GDL) mechanism, whose distribution (which is closed under summation) follows from differences of i.i.d. negative binomial shares, and 2) the multi-scale discrete Laplace (MSDLap) mechanism, a novel mechanism following the sum of multiple i.i.d. discrete Laplace shares at different scales. For $\varepsilon \geq 1$, our mechanisms can be parameterized to have $O\left(\Delta^3 e^{-\varepsilon}\right)$ and $O\left(\min\left(\Delta^3 e^{-\varepsilon}, \Delta^2 e^{-2\varepsilon/3}\right)\right)$ MSE, respectively, where $\Delta$ denote the sensitivity; the latter bound matches known optimality results. We also show a transformation from the discrete setting to the continuous setting, which allows us to transform both mechanisms to the continuous setting and thereby achieve the optimal $O\left(\Delta^2 e^{-2\varepsilon / 3}\right)$ MSE. To our knowledge, these are the first infinitely divisible additive noise mechanisms that achieve order-optimal MSE under pure DP, so our work shows formally there is no separation in utility when query-independent noise adding mechanisms are restricted to infinitely divisible noise. For the continuous setting, our result improves upon the Arete mechanism from [Pagh and Stausholm, ALT 2022] which gives an MSE of $O\left(\Delta^2 e^{-\varepsilon/4}\right)$. Furthermore, we give an exact sampler tuned to efficiently implement the MSDLap mechanism, and we apply our results to improve a state of the art multi-message shuffle DP protocol in the high $\varepsilon$ regime.

Updated: 2025-04-07 15:50:46

标题: 无穷可分噪声用于差分隐私：在高 $\varepsilon$ 区域中的近乎最优误差

摘要: 差分隐私（DP）可以通过分布式方式实现，多个参与方添加独立噪声，使它们的总和保护整个数据集具有DP。这里的一种常见技术是让每个参与方从无限可分解分布的分解中采样他们的噪声。我们在这个环境中分析了两种机制：1）广义离散拉普拉斯（GDL）机制，其分布（在求和下封闭）来源于i.i.d.负二项分布的差异，以及2）多尺度离散拉普拉斯（MSDLap）机制，一种新颖的机制，其跟随不同尺度上多个i.i.d.离散拉普拉斯分享的总和。对于$\varepsilon \geq 1$，我们的机制可以被参数化为具有$O\left(\Delta^3 e^{-\varepsilon}\right)$和$O\left(\min\left(\Delta^3e^{-\varepsilon}, \Delta^2 e^{-2\varepsilon/3}\right)\right)$的均方误差（MSE），其中$\Delta$表示敏感度；后一个界限与已知的最优结果相匹配。我们还展示了从离散设置到连续设置的转换，这使我们能够将两种机制转换到连续设置，并从而实现最优的$O\left(\Delta^2e^{-2\varepsilon / 3}\right)$ MSE。据我们所知，这些是首个在纯DP下实现顺序最优MSE的无限可分加性噪声机制，因此我们的工作正式表明，当查询无关的噪声添加机制限制为无限可分噪声时，实用性不存在分离。对于连续设置，我们的结果改进了[Pagh和Stausholm，ALT 2022]中的Arete机制，该机制给出了$O\left(\Delta^2e^{-\varepsilon/4}\right)$的MSE。此外，我们提供了一个精确的采样器来有效实现MSDLap机制，并将我们的结果应用于改进一种最先进的高$\varepsilon$区域的多消息混洗DP协议。

更新时间: 2025-04-07 15:50:46

领域: cs.CR,cs.DS

下载: http://arxiv.org/abs/2504.05202v1

3D Universal Lesion Detection and Tagging in CT with Self-Training

Radiologists routinely perform the tedious task of lesion localization, classification, and size measurement in computed tomography (CT) studies. Universal lesion detection and tagging (ULDT) can simultaneously help alleviate the cumbersome nature of lesion measurement and enable tumor burden assessment. Previous ULDT approaches utilize the publicly available DeepLesion dataset, however it does not provide the full volumetric (3D) extent of lesions and also displays a severe class imbalance. In this work, we propose a self-training pipeline to detect 3D lesions and tag them according to the body part they occur in. We used a significantly limited 30\% subset of DeepLesion to train a VFNet model for 2D lesion detection and tagging. Next, the 2D lesion context was expanded into 3D, and the mined 3D lesion proposals were integrated back into the baseline training data in order to retrain the model over multiple rounds. Through the self-training procedure, our VFNet model learned from its own predictions, detected lesions in 3D, and tagged them. Our results indicated that our VFNet model achieved an average sensitivity of 46.9\% at [0.125:8] false positives (FP) with a limited 30\% data subset in comparison to the 46.8\% of an existing approach that used the entire DeepLesion dataset. To our knowledge, we are the first to jointly detect lesions in 3D and tag them according to the body part label.

Updated: 2025-04-07 15:50:27

标题: 3D通用CT自训练图像中的病变检测和标记

摘要: 放射科医师通常在计算机断层扫描（CT）研究中执行病变定位、分类和大小测量的单调任务。通用病变检测和标记（ULDT）可以同时帮助缓解病变测量的繁琐性质，并实现肿瘤负担评估。先前的ULDT方法利用公开可用的DeepLesion数据集，然而它并未提供病变的完整体积（3D）范围，并且显示出严重的类别不平衡。在这项工作中，我们提出了一个自训练流程，用于检测3D病变并根据其发生的部位标记它们。我们使用了DeepLesion的显著受限的30\%子集来训练VFNet模型，用于2D病变检测和标记。接下来，2D病变上下文被扩展到3D，挖掘的3D病变提案被集成回基准训练数据，以便在多轮中重新训练模型。通过自训练过程，我们的VFNet模型从自身预测中学习，检测了3D病变并标记了它们。我们的结果表明，与使用整个DeepLesion数据集的现有方法相比，我们的VFNet模型在有限的30\%数据子集中以46.9\%的平均灵敏度在[0.125:8]假阳性（FP）处取得了较好的效果。据我们所知，我们是首次联合检测3D病变并根据部位标记它们。

更新时间: 2025-04-07 15:50:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05201v1

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

Recent trends in deep learning (DL) have made hardware accelerators essential for various high-performance computing (HPC) applications, including image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent developments in DL accelerators, focusing on their role in meeting the performance demands of HPC applications. We explore cutting-edge approaches to DL acceleration, covering not only GPU- and TPU-based platforms but also specialized hardware such as FPGA- and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators, and co-processors. This survey also describes accelerators leveraging emerging memory technologies and computing paradigms, including 3D-stacked Processor-In-Memory, non-volatile memories like Resistive RAM and Phase Change Memories used for in-memory computing, as well as Neuromorphic Processing Units, and Multi-Chip Module-based accelerators. Furthermore, we provide insights into emerging quantum-based accelerators and photonics. Finally, this survey categorizes the most influential architectures and technologies from recent years, offering readers a comprehensive perspective on the rapidly evolving field of deep learning acceleration.

Updated: 2025-04-07 15:49:45

标题: 针对异构HPC平台的深度学习硬件加速器调查

摘要: 最近深度学习（DL）领域的趋势使得硬件加速器对于各种高性能计算（HPC）应用至关重要，包括图像分类、计算机视觉和语音识别。本调查总结和分类了DL加速器领域最新的发展，重点关注它们在满足HPC应用性能需求方面的作用。我们探讨了DL加速的尖端方法，涵盖了不仅仅是基于GPU和TPU的平台，还包括基于FPGA和ASIC的专用硬件加速器、神经处理单元、基于开放硬件RISC-V的加速器和协处理器。本调查还描述了利用新兴存储技术和计算范式的加速器，包括3D堆叠的Processor-In-Memory、用于内存计算的阻变RAM和相变存储器等非易失性存储器，以及神经形态处理单元和基于多芯片模块的加速器。此外，我们还为读者提供了关于新兴基于量子和光子技术的加速器的见解。最后，本调查对近年来最具影响力的架构和技术进行分类，为读者提供了对深度学习加速领域快速发展的全面视角。

更新时间: 2025-04-07 15:49:45

领域: cs.AR,cs.ET,cs.LG

下载: http://arxiv.org/abs/2306.15552v3

CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles

This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, and minimal usage of resources such as costs, energy, computational requirements, and weight. We emphasize the interplay between perception and motion planning in decision-making by introducing the concept of occupancy queries to quantify the perception requirements for sampling-based motion planners. Sensor and algorithm performance are evaluated using False Negative Rates (FPR) and False Positive Rates (FPR) across various factors such as geometric relationships, object properties, sensor resolution, and environmental conditions. By integrating perception requirements with perception performance, an Integer Linear Programming (ILP) approach is proposed for efficient sensor and algorithm selection and placement. This forms the basis for a co-design optimization that includes the robot body, motion planner, perception pipeline, and computing unit. We refer to this framework for solving the co-design problem of mobile robots as CODEI, short for Co-design of Embodied Intelligence. A case study on developing an Autonomous Vehicle (AV) for urban scenarios provides actionable information for designers, and shows that complex tasks escalate resource demands, with task performance affecting choices of the autonomy stack. The study demonstrates that resource prioritization influences sensor choice: cameras are preferred for cost-effective and lightweight designs, while lidar sensors are chosen for better energy and computational efficiency.

Updated: 2025-04-07 15:48:11

标题: CODEI：面向自主车辆的移动机器人感知和决策任务驱动的资源高效共同设计

摘要: 这篇论文讨论了设计移动机器人的集成挑战和策略，重点关注基于任务驱动的硬件和软件的最佳选择，以平衡安全性、效率和最小资源使用（如成本、能量、计算需求和重量）。我们强调感知和运动规划在决策过程中的相互作用，通过引入占用查询的概念来量化基于采样的运动规划器的感知需求。传感器和算法性能使用虚警率（FPR）和漏警率（FPR）在各种因素下进行评估，如几何关系、对象属性、传感器分辨率和环境条件。通过将感知需求与感知性能整合，提出了一种整数线性规划（ILP）方法，用于有效地选择和放置传感器和算法。这构成了一个协同设计优化的基础，包括机器人主体、运动规划器、感知管道和计算单元。我们将解决移动机器人的协同设计问题的框架称为CODEI，简称Co-design of Embodied Intelligence。一个关于开发城市场景自动驾驶车辆（AV）的案例研究为设计者提供了可操作的信息，并显示复杂任务会增加资源需求，任务性能会影响自主堆栈的选择。研究表明，资源优先级影响传感器选择：相机适用于成本效益和轻量级设计，而激光雷达传感器则适用于更好的能源和计算效率。

更新时间: 2025-04-07 15:48:11

领域: cs.RO,cs.AI,cs.AR,cs.CV,cs.SY,eess.SY,I.2.9; I.2.10; I.2.8; I.4.8

下载: http://arxiv.org/abs/2503.10296v2

Analyzing Generative Models by Manifold Entropic Metrics

Good generative models should not only synthesize high quality data, but also utilize interpretable representations that aid human understanding of their behavior. However, it is difficult to measure objectively if and to what degree desirable properties of disentangled representations have been achieved. Inspired by the principle of independent mechanisms, we address this difficulty by introducing a novel set of tractable information-theoretic evaluation metrics. We demonstrate the usefulness of our metrics on illustrative toy examples and conduct an in-depth comparison of various normalizing flow architectures and $\beta$-VAEs on the EMNIST dataset. Our method allows to sort latent features by importance and assess the amount of residual correlations of the resulting concepts. The most interesting finding of our experiments is a ranking of model architectures and training procedures in terms of their inductive bias to converge to aligned and disentangled representations during training.

Updated: 2025-04-07 15:47:53

标题: 通过流形熵度量分析生成模型

摘要: 优秀的生成模型不仅应该合成高质量的数据，还应该利用可解释的表征，以帮助人类理解它们的行为。然而，要客观地衡量是否以及在何种程度上已经实现了解耦表征的理想属性是困难的。受独立机制原则的启发，我们通过引入一组新颖的可计算的信息论评估指标来解决这一困难。我们在说明性的玩具示例上展示了我们指标的用处，并对各种正则化流架构和$\beta$-VAEs在EMNIST数据集上进行了深入比较。我们的方法允许按重要性对潜在特征进行排序，并评估所得概念的残余相关性的数量。我们实验的最有趣的发现是，在训练过程中，模型架构和训练程序的排名在于它们收敛到对齐和解耦表示的归纳偏好。

更新时间: 2025-04-07 15:47:53

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.19426v2

Universal Lymph Node Detection in Multiparametric MRI with Selective Augmentation

Robust localization of lymph nodes (LNs) in multiparametric MRI (mpMRI) is critical for the assessment of lymphadenopathy. Radiologists routinely measure the size of LN to distinguish benign from malignant nodes, which would require subsequent cancer staging. Sizing is a cumbersome task compounded by the diverse appearances of LNs in mpMRI, which renders their measurement difficult. Furthermore, smaller and potentially metastatic LNs could be missed during a busy clinical day. To alleviate these imaging and workflow problems, we propose a pipeline to universally detect both benign and metastatic nodes in the body for their ensuing measurement. The recently proposed VFNet neural network was employed to identify LN in T2 fat suppressed and diffusion weighted imaging (DWI) sequences acquired by various scanners with a variety of exam protocols. We also use a selective augmentation technique known as Intra-Label LISA (ILL) to diversify the input data samples the model sees during training, such that it improves its robustness during the evaluation phase. We achieved a sensitivity of $\sim$83\% with ILL vs. $\sim$80\% without ILL at 4 FP/vol. Compared with current LN detection approaches evaluated on mpMRI, we show a sensitivity improvement of $\sim$9\% at 4 FP/vol.

Updated: 2025-04-07 15:46:43

标题: 多参数MRI中的通用淋巴结检测与选择性增强

摘要: 在多参数MRI（mpMRI）中对淋巴结（LNs）进行强健的定位对于淋巴结病变的评估至关重要。放射科医生通常测量LN的大小以区分良性和恶性淋巴结，这将需要随后的癌症分期。由于mpMRI中LN的外观各异，使得其测量变得困难，因此大小的测量是一项繁重的任务。此外，在繁忙的临床工作日中可能会错过较小且潜在转移的LN。为了缓解这些影像和工作流问题，我们提出了一个管道，用于在全身普遍检测良性和转移LN以进行随后的测量。最近提出的VFNet神经网络被用于识别由各种扫描仪以及各种检查方案获取的T2脂肪抑制和扩散加权成像（DWI）序列中的LN。我们还使用一种称为Intra-Label LISA（ILL）的选择性增强技术来使模型在训练期间看到的输入数据样本多样化，从而在评估阶段提高其稳健性。我们在4 FP/vol时使用ILL实现了约83％的灵敏度，而未使用ILL时为约80％。与当前在mpMRI上评估的LN检测方法相比，我们在4 FP/vol时显示出约9％的灵敏度改善。

更新时间: 2025-04-07 15:46:43

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2504.05196v1

Handling Weather Uncertainty in Air Traffic Prediction through an Inverse Approach

Adverse weather conditions, particularly convective phenomena, pose significant challenges to Air Traffic Management, often requiring real-time rerouting decisions that impact efficiency and safety. This study introduces a 3-D Gaussian Mixture Model to predict long lead-time flight trajectory changes, incorporating comprehensive weather and traffic data. Utilizing high-resolution meteorological datasets, including convective weather maps and wind data, alongside traffic records, the model demonstrates robust performance in forecasting reroutes up to 60 minutes. The novel 3-D Gaussian Mixture Model framework employs a probabilistic approach to capture uncertainty while providing accurate forecasts of altitude, latitude, and longitude. Extensive evaluation revealed a Mean Absolute Percentage Error below 0.02 across varying lead times, highlighting the model's accuracy and scalability. By integrating explainability techniques such as the Vanilla Gradient algorithm, the study provides insights into feature contributions, showing that they contribute to improving Air Traffic Management strategies to mitigate weather-induced disruptions.

Updated: 2025-04-07 15:42:09

标题: 通过反向方法处理空中交通预测中的天气不确定性

摘要: 恶劣的天气条件，特别是对流现象，给空中交通管理带来了重大挑战，通常需要实时重新路由决策来影响效率和安全性。本研究引入了一个3-D高斯混合模型来预测长时间飞行轨迹变化，综合考虑了天气和交通数据。利用高分辨率气象数据集，包括对流天气图和风数据，以及交通记录，该模型展现了在预测最多60分钟的重路由方面的稳健性。新颖的3-D高斯混合模型框架采用了概率方法来捕捉不确定性，同时提供了高度、纬度和经度的准确预测。广泛的评估显示了在不同的提前时间内，均方绝对百分比误差低于0.02，突显出模型的准确性和可伸缩性。通过集成“香草梯度”算法等解释技术，该研究提供了对特征贡献的见解，显示它们有助于改进空中交通管理策略，以减轻天气引起的干扰。

更新时间: 2025-04-07 15:42:09

领域: cs.LG

下载: http://arxiv.org/abs/2504.05366v1

Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework

Beamforming is a key technology in millimeter-wave (mmWave) communications that improves signal transmission by optimizing directionality and intensity. However, conventional channel estimation methods, such as pilot signals or beam sweeping, often fail to adapt to rapidly changing communication environments. To address this limitation, multimodal sensing-aided beam prediction has gained significant attention, using various sensing data from devices such as LiDAR, radar, GPS, and RGB images to predict user locations or network conditions. Despite its promising potential, the adoption of multimodal sensing-aided beam prediction is hindered by high computational complexity, high costs, and limited datasets. Thus, in this paper, a resource-efficient learning approach is proposed to transfer knowledge from a multimodal network to a monomodal (radar-only) network based on cross-modal relational knowledge distillation (CRKD), while reducing computational overhead and preserving predictive accuracy. To enable multimodal learning with realistic data, a novel multimodal simulation framework is developed while integrating sensor data generated from the autonomous driving simulator CARLA with MATLAB-based mmWave channel modeling, and reflecting real-world conditions. The proposed CRKD achieves its objective by distilling relational information across different feature spaces, which enhances beam prediction performance without relying on expensive sensor data. Simulation results demonstrate that CRKD efficiently distills multimodal knowledge, allowing a radar-only model to achieve $94.62\%$ of the teacher performance. In particular, this is achieved with just $10\%$ of the teacher network's parameters, thereby significantly reducing computational complexity and dependence on multimodal sensor data.

Updated: 2025-04-07 15:38:25

标题: 资源高效的多模态逼真仿真框架下毫米波通信中的波束预测

摘要: 波束成形是毫米波通信中的关键技术，通过优化方向性和强度来改善信号传输。然而，传统的信道估计方法，如导向信号或波束扫描，往往无法适应快速变化的通信环境。为了解决这一局限性，利用多模态感知辅助波束预测引起了广泛关注，利用来自LiDAR、雷达、GPS和RGB图像等设备的各种感知数据来预测用户位置或网络条件。尽管具有很大的潜力，但采用多模态感知辅助波束预测受到高计算复杂性、高成本和有限数据集的阻碍。因此，在本文中，提出了一种资源高效的学习方法，基于跨模态关系知识蒸馏（CRKD）从多模态网络向单模态（仅雷达）网络传输知识，同时降低计算开销并保持预测准确性。为了实现具有真实数据的多模态学习，开发了一种新颖的多模态仿真框架，将从自动驾驶模拟器CARLA生成的传感器数据与基于MATLAB的毫米波信道建模相结合，并反映真实世界条件。提出的CRKD通过在不同特征空间之间提炼关系信息来实现其目标，从而增强波束预测性能，而无需依赖昂贵的传感器数据。模拟结果表明，CRKD有效地提炼多模态知识，使得仅使用雷达模型就能实现老师性能的94.62%。特别是，这是通过仅使用老师网络参数的10%来实现的，从而显著降低了计算复杂性，并减少了对多模态传感器数据的依赖。

更新时间: 2025-04-07 15:38:25

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05187v1

Training state-of-the-art pathology foundation models with orders of magnitude less data

The field of computational pathology has recently seen rapid advances driven by the development of modern vision foundation models (FMs), typically trained on vast collections of pathology images. Recent studies demonstrate that increasing the training data set and model size and integrating domain-specific image processing techniques can significantly enhance the model's performance on downstream tasks. Building on these insights, our work incorporates several recent modifications to the standard DINOv2 framework from the literature to optimize the training of pathology FMs. We also apply a post-training procedure for fine-tuning models on higher-resolution images to further enrich the information encoded in the embeddings. We present three novel pathology FMs trained on up to two orders of magnitude fewer WSIs than those used to train other state-of-the-art FMs while demonstrating a comparable or superior performance on downstream tasks. Even the model trained on TCGA alone (12k WSIs) outperforms most existing FMs and, on average, matches Virchow2, the second-best FM published to date. This suggests that there still remains a significant potential for further improving the models and algorithms used to train pathology FMs to take full advantage of the vast data collections.

Updated: 2025-04-07 15:38:12

标题: 用数量级更少的数据训练最先进的病理学基础模型

摘要: 计算病理学领域最近取得了快速进展，这得益于现代视觉基础模型（FMs）的发展，通常是在大量的病理学图像集上进行训练。最近的研究表明，增加训练数据集和模型大小，并整合领域特定的图像处理技术，可以显著提升模型在下游任务上的性能。基于这些见解，我们的工作结合了文献中对标准DINOv2框架的几项最近修改，以优化病理FMs的训练。我们还应用了一个后训练过程，对高分辨率图像上的模型进行微调，进一步丰富了嵌入中编码的信息。我们提出了三种新颖的病理FMs，它们比用于训练其他最先进的FMs的WSIs数量少了两个数量级，同时在下游任务上展现出可比较或更优越的性能。即使是仅在TCGA上进行训练的模型（12k WSIs）也优于大多数现有的FMs，并且平均匹配到迄今为止发表的第二好的Virchow2 FM。这表明，仍然存在进一步改进用于训练病理FMs的模型和算法的巨大潜力，以充分利用大量的数据集。

更新时间: 2025-04-07 15:38:12

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.05186v1

GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS

The advent of generative AI exemplified by large language models (LLMs) opens new ways to represent and compute geographic information and transcends the process of geographic knowledge production, driving geographic information systems (GIS) towards autonomous GIS. Leveraging LLMs as the decision core, autonomous GIS can independently generate and execute geoprocessing workflows to perform spatial analysis. In this vision paper, we further elaborate on the concept of autonomous GIS and present a conceptual framework that defines its five autonomous goals, five autonomous levels, five core functions, and three operational scales. We demonstrate how autonomous GIS could perform geospatial data retrieval, spatial analysis, and map making with four proof-of-concept GIS agents. We conclude by identifying critical challenges and future research directions, including fine-tuning and self-growing decision-cores, autonomous modeling, and examining the societal and practical implications of autonomous GIS. By establishing the groundwork for a paradigm shift in GIScience, this paper envisions a future where GIS moves beyond traditional workflows to autonomously reason, derive, innovate, and advance geospatial solutions to pressing global challenges. As we design and deploy increasingly intelligent geospatial systems, we have a responsibility to ensure they are developed in socially responsible ways, serve the public good, and support the continued value of human geographic insight in an AI-augmented future.

Updated: 2025-04-07 15:29:39

标题: 人工智能时代的地理信息科学：朝向自主GIS的研究议程

摘要: 生成式人工智能的出现，例如大型语言模型（LLMs），为表示和计算地理信息开辟了新的途径，并超越了地理知识生产的过程，推动地理信息系统（GIS）朝着自主GIS发展。利用LLMs作为决策核心，自主GIS可以独立生成和执行地理处理工作流程，进行空间分析。在这篇展望性论文中，我们进一步阐述了自主GIS的概念，并提出了一个概念框架，定义了其五个自主目标、五个自主级别、五个核心功能和三个运营规模。我们演示了自主GIS如何通过四个GIS代理执行地理空间数据检索、空间分析和制图。我们最后指出了关键挑战和未来研究方向，包括优化和自我增长的决策核心、自主建模，以及探讨自主GIS的社会和实际影响。通过为GIScience中的范式转变奠定基础，本文展望了一个未来，在这个未来中，GIS将超越传统工作流程，自主地进行推理、推导、创新并推进地理空间解决方案以解决全球紧迫挑战。随着我们设计和部署越来越智能的地理空间系统，我们有责任确保它们以社会负责的方式发展，为公共利益服务，并在AI增强的未来中支持人类地理洞察力的持续价值。

更新时间: 2025-04-07 15:29:39

领域: cs.AI,cs.ET,cs.SE

下载: http://arxiv.org/abs/2503.23633v3

Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval

Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.

Updated: 2025-04-07 15:27:37

标题: 轻量级和直接的生成式信息检索文档相关性优化

摘要: 生成式信息检索（GenIR）是一种有希望的神经检索范式，将文档检索形式化为一个文档标识符（docid）生成任务，允许端到端优化以实现统一的全局检索目标。然而，现有的GenIR模型存在标记级别的错位，模型训练以预测下一个标记往往无法有效捕捉文档级别的相关性。虽然基于强化学习的方法，如基于相关性反馈的强化学习（RLRF），旨在通过奖励建模解决这种错位，但它们引入了显著的复杂性，需要优化辅助奖励函数，然后进行强化微调，这在计算上是昂贵的，而且往往不稳定。为了解决这些挑战，我们提出了直接文档相关性优化（DDRO），通过成对排序的直接优化将标记级别的docid生成与文档级别的相关性估计对齐，消除了明确奖励建模和强化学习的需求。基准数据集上的实验结果，包括MS MARCO文档和自然问题，显示DDRO优于基于强化学习的方法，在MS MARCO的MRR@10上实现了7.4%的改进，在自然问题上实现了19.9%的改进。这些发现突显了DDRO增强检索效果的潜力，采用了简化的优化方法。通过将对齐作为直接优化问题来框定，DDRO简化了GenIR模型的排名优化管道，同时为基于强化学习的方法提供了一种可行的替代方案。

更新时间: 2025-04-07 15:27:37

领域: cs.IR,cs.AI,cs.DL,cs.LG,H.3.3

下载: http://arxiv.org/abs/2504.05181v1

BRIDGES: Bridging Graph Modality and Large Language Models within EDA Tasks

While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (<1% model weights increase and <30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.

Updated: 2025-04-07 15:27:32

标题: BRIDGES：在EDA任务中连接图模态和大型语言模型

摘要: 尽管许多EDA任务已涉及基于图的数据，但现有的EDA中的LLM主要将图表示为顺序文本，或者简单地忽略可能有益的图结构化数据，如RTL代码的数据流图。最近的研究发现，当图表示为顺序文本时，LLM的性能会受到影响，而使用额外的图信息会显著提升性能。为了解决这些挑战，我们引入了BRIDGES，一个旨在将图模态纳入LLM用于EDA任务的框架。BRIDGES集成了自动化数据生成工作流程，将图模态与LLM结合起来的解决方案，以及全面的评估套件。首先，我们建立了一个LLM驱动的工作流程，生成RTL和网表级数据，将它们转换为具有功能描述的数据流和网表图。这个工作流程产生了一个包含超过500,000个图实例和超过15亿个标记的大规模数据集。其次，我们提出了一个轻量级的跨模态投影器，将图表示编码为与文本兼容的提示，使LLM能够有效利用图数据而无需架构修改。实验结果表明，在多个任务中，与仅文本的基线相比，包括设计检索准确性、类型预测和功能描述中的困惑度，我们实现了2倍至10倍的改进，而计算开销几乎可以忽略不计（模型权重增加小于1%，额外运行时开销小于30%）。即使没有额外的LLM微调，我们的结果也大幅领先于仅文本。我们计划发布BRIDGES，包括数据集、模型和训练流程。

更新时间: 2025-04-07 15:27:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05180v1

Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects

We develop a method that recovers the surface, materials, and illumination of a scene from its posed multi-view images. In contrast to prior work, it does not require any additional data and can handle glossy objects or bright lighting. It is a progressive inverse rendering approach, which consists of three stages. In the first stage, we reconstruct the scene radiance and signed distance function (SDF) with a novel regularization strategy for specular reflections. We propose to explain a pixel color using both surface and volume rendering jointly, which allows for handling complex view-dependent lighting effects for surface reconstruction. In the second stage, we distill light visibility and indirect illumination from the learned SDF and radiance field using learnable mapping functions. Finally, we design a method for estimating the ratio of incoming direct light reflected in a specular manner and use it to reconstruct the materials and direct illumination. Experimental results demonstrate that the proposed method outperforms the current state-of-the-art in recovering surfaces, materials, and lighting without relying on any additional data.

Updated: 2025-04-07 15:24:58

标题: 分解-NeuS：重建可能光滑物体的表面、照明和材料

摘要: 我们开发了一种方法，可以从多视角图像中恢复场景的表面、材料和照明。与先前的工作相比，它不需要任何额外的数据，并且可以处理有光泽的物体或明亮的照明。这是一种渐进式的反渲染方法，包括三个阶段。在第一阶段，我们使用一种新颖的正则化策略重建了场景辐射和带符号距离函数（SDF），以处理镜面反射。我们建议同时使用表面和体积渲染来解释像素颜色，这可以处理复杂的视角依赖照明效果以进行表面重建。在第二阶段，我们从学习到的SDF和辐射场中提取光能见度和间接照明，使用可学习的映射函数。最后，我们设计了一种方法来估计以镜面方式反射的入射直接光的比率，并用它来重建材料和直接照明。实验结果表明，所提出的方法在恢复表面、材料和照明方面优于目前的最先进技术，而无需依赖任何额外的数据。

更新时间: 2025-04-07 15:24:58

领域: cs.CV,cs.AI,cs.GR

下载: http://arxiv.org/abs/2305.17929v2

Learning symmetries in datasets

We investigate how symmetries present in datasets affect the structure of the latent space learned by Variational Autoencoders (VAEs). By training VAEs on data originating from simple mechanical systems and particle collisions, we analyze the organization of the latent space through a relevance measure that identifies the most meaningful latent directions. We show that when symmetries or approximate symmetries are present, the VAE self-organizes its latent space, effectively compressing the data along a reduced number of latent variables. This behavior captures the intrinsic dimensionality determined by the symmetry constraints and reveals hidden relations among the features. Furthermore, we provide a theoretical analysis of a simple toy model, demonstrating how, under idealized conditions, the latent space aligns with the symmetry directions of the data manifold. We illustrate these findings with examples ranging from two-dimensional datasets with $O(2)$ symmetry to realistic datasets from electron-positron and proton-proton collisions. Our results highlight the potential of unsupervised generative models to expose underlying structures in data and offer a novel approach to symmetry discovery without explicit supervision.

Updated: 2025-04-07 15:17:41

标题: 学习数据集中的对称性

摘要: 我们研究了数据集中存在的对称性如何影响变分自动编码器（VAEs）学习的潜在空间的结构。通过在源自简单机械系统和粒子碰撞的数据上训练VAEs，我们通过一个相关性度量分析了潜在空间的组织，该度量确定了最有意义的潜在方向。我们展示了当存在对称性或近似对称性时，VAE会自组织其潜在空间，有效地沿着减少数量的潜在变量压缩数据。这种行为捕捉了由对称约束决定的固有维度，并揭示了特征之间的隐藏关系。此外，我们提供了一个简单玩具模型的理论分析，展示了在理想条件下，潜在空间如何与数据流形的对称方向对齐。我们通过从具有$O(2)$对称性的二维数据集到来自电子-正电子和质子-质子碰撞的现实数据集的示例来说明这些发现。我们的结果凸显了无监督生成模型揭示数据中潜在结构的潜力，并提供了一种无需明确监督的对称性发现新方法。

更新时间: 2025-04-07 15:17:41

领域: cs.LG,hep-ph

下载: http://arxiv.org/abs/2504.05174v1

Attention-Based Multi-Scale Temporal Fusion Network for Uncertain-Mode Fault Diagnosis in Multimode Processes

Fault diagnosis in multimode processes plays a critical role in ensuring the safe operation of industrial systems across multiple modes. It faces a great challenge yet to be addressed - that is, the significant distributional differences among monitoring data from multiple modes make it difficult for the models to extract shared feature representations related to system health conditions. In response to this problem, this paper introduces a novel method called attention-based multi-scale temporal fusion network. The multi-scale depthwise convolution and gated recurrent unit are employed to extract multi-scale contextual local features and long-short-term features. A temporal attention mechanism is designed to focus on critical time points with higher cross-mode shared information, thereby enhancing the accuracy of fault diagnosis. The proposed model is applied to Tennessee Eastman process dataset and three-phase flow facility dataset. The experiments demonstrate that the proposed model achieves superior diagnostic performance and maintains a small model size.

Updated: 2025-04-07 15:16:22

标题: 多模式过程中基于注意力的多尺度时间融合网络用于不确定模式故障诊断

摘要: 多模式过程中的故障诊断在确保工业系统跨多种模式安全运行中起着至关重要的作用。它面临着一个尚未解决的巨大挑战 - 即，来自多个模式的监测数据之间的显著分布差异使得模型难以提取与系统健康状况相关的共享特征表示。针对这一问题，本文介绍了一种名为基于注意力的多尺度时间融合网络的新方法。利用多尺度深度卷积和门控循环单元来提取多尺度上下文局部特征和长短期特征。设计了一个时间注意力机制，用于集中关注具有更高跨模式共享信息的关键时间点，从而提高故障诊断的准确性。所提出的模型应用于田纳西伊斯曼过程数据集和三相流设备数据集。实验证明，所提出的模型实现了优越的诊断性能，并保持了较小的模型大小。

更新时间: 2025-04-07 15:16:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05172v1

SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection

Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.

Updated: 2025-04-07 15:15:06

标题: SSLFusion：用于多模态3D物体检测的尺度和空间对齐潜在融合模型

摘要: 基于深度神经网络的多模态3D物体检测确实取得了显著进展。然而，由于从2D图像提取的特征和从3D点云导出的特征之间的尺度和空间信息不对齐，它仍然面临挑战。现有方法通常在单个阶段聚合多模态特征。然而，利用多阶段跨模态特征对于检测各种尺度的物体至关重要。因此，这些方法经常难以有效地整合不同尺度和模态之间的特征，从而限制了检测的准确性。此外，在现有方法中通常使用的基于查询-键-值（QKV）的跨注意力操作有助于通过捕捉非局部上下文来推理对象的位置和存在。然而，这种方法往往会增加计算复杂性。为了解决这些挑战，我们提出了SSLFusion，一种新颖的尺度和空间对齐潜在融合模型，包括尺度对齐融合策略（SAF）、3D到2D空间对齐模块（SAM）和潜在跨模态融合模块（LFM）。SAF通过在多个级别上聚合来自图像和点云的特征，减轻了模态之间的尺度不对齐。SAM旨在通过将3D坐标信息纳入2D图像特征中，减少图像和点云特征之间的模态间差距。此外，LFM在潜在空间中捕获跨模态的非局部上下文，而不利用基于QKV的注意力操作，从而减轻了计算复杂性。在KITTI和DENSE数据集上的实验表明，我们的SSLFusion优于最先进的方法。与KITTI测试集中的中等级别上的最新方法GraphAlign相比，我们的方法在3D AP方面获得了2.15%的绝对增益。

更新时间: 2025-04-07 15:15:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05170v1

Machine learning interatomic potential can infer electrical response

Modeling the response of material and chemical systems to electric fields remains a longstanding challenge. Machine learning interatomic potentials (MLIPs) offer an efficient and scalable alternative to quantum mechanical methods but do not by themselves incorporate electrical response. Here, we show that polarization and Born effective charge (BEC) tensors can be directly extracted from long-range MLIPs within the Latent Ewald Summation (LES) framework, solely by learning from energy and force data. Using this approach, we predict the infrared spectra of bulk water under zero or finite external electric fields, ionic conductivities of high-pressure superionic ice, and the phase transition and hysteresis in ferroelectric PbTiO$_3$ perovskite. This work thus extends the capability of MLIPs to predict electrical response--without training on charges or polarization or BECs--and enables accurate modeling of electric-field-driven processes in diverse systems at scale.

Updated: 2025-04-07 15:14:07

标题: 机器学习的原子间势能可以推断电响应

摘要: 对材料和化学系统对电场的响应进行建模仍然是一个长期的挑战。机器学习原子间势(MLIPs)提供了一种高效且可扩展的替代量子力学方法，但它们本身不包含电响应。在这里，我们展示了在潜在Ewald求和（LES）框架内，极化和Born有效电荷（BEC）张量可以直接从长程MLIPs中提取，仅通过从能量和力数据中学习。利用这种方法，我们预测了在零或有限外部电场下的大水体的红外光谱，高压超离子冰的离子电导率，以及铁电PbTiO$_3$钛酸铅的相变和磁滞。这项工作扩展了MLIPs预测电响应的能力--而无需对电荷、极化或BEC进行训练--并能够准确地模拟不同系统中规模的电场驱动过程。

更新时间: 2025-04-07 15:14:07

领域: cond-mat.mtrl-sci,cs.LG,physics.chem-ph,physics.comp-ph

下载: http://arxiv.org/abs/2504.05169v1

RLBayes: a Bayesian Network Structure Learning Algorithm via Reinforcement Learning-Based Search Strategy

The score-based structure learning of Bayesian network (BN) is an effective way to learn BN models, which are regarded as some of the most compelling probabilistic graphical models in the field of representation and reasoning under uncertainty. However, the search space of structure learning grows super-exponentially as the number of variables increases, which makes BN structure learning an NP-hard problem, as well as a combination optimization problem (COP). Despite the successes of many heuristic methods on it, the results of the structure learning of BN are usually unsatisfactory. Inspired by Q-learning, in this paper, a Bayesian network structure learning algorithm via reinforcement learning-based (RL-based) search strategy is proposed, namely RLBayes. The method borrows the idea of RL and tends to record and guide the learning process by a dynamically maintained Q-table. By creating and maintaining the dynamic Q-table, RLBayes achieve storing the unlimited search space within limited space, thereby achieving the structure learning of BN via Q-learning. Not only is it theoretically proved that RLBayes can converge to the global optimal BN structure, but also it is experimentally proved that RLBayes has a better effect than almost all other heuristic search algorithms.

Updated: 2025-04-07 15:11:51

标题: RLBayes：一种通过强化学习搜索策略学习贝叶斯网络结构的算法

摘要: 基于评分的贝叶斯网络（BN）结构学习是学习BN模型的一种有效方法，BN模型被认为是不确定性表示和推理领域中最引人注目的概率图模型之一。然而，随着变量数量的增加，结构学习的搜索空间呈超指数增长，使得BN结构学习成为一个NP难问题，同时也是一个组合优化问题（COP）。尽管许多启发式方法在此方面取得了成功，但BN的结构学习结果通常令人不满意。受Q-learning的启发，本文提出了一种基于强化学习的贝叶斯网络结构学习算法，即RLBayes。该方法借鉴了强化学习的思想，并通过动态维护的Q表记录和引导学习过程。通过创建和维护动态Q表，RLBayes在有限空间内实现了对无限搜索空间的存储，从而通过Q-learning实现了对BN的结构学习。不仅在理论上证明了RLBayes可以收敛到全局最优的BN结构，而且实验证明RLBayes比几乎所有其他启发式搜索算法效果更好。

更新时间: 2025-04-07 15:11:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05167v1

Adversarial Robustness for Deep Learning-based Wildfire Prediction Models

Rapidly growing wildfires have recently devastated societal assets, exposing a critical need for early warning systems to expedite relief efforts. Smoke detection using camera-based Deep Neural Networks (DNNs) offers a promising solution for wildfire prediction. However, the rarity of smoke across time and space limits training data, raising model overfitting and bias concerns. Current DNNs, primarily Convolutional Neural Networks (CNNs) and transformers, complicate robustness evaluation due to architectural differences. To address these challenges, we introduce WARP (Wildfire Adversarial Robustness Procedure), the first model-agnostic framework for evaluating wildfire detection models' adversarial robustness. WARP addresses inherent limitations in data diversity by generating adversarial examples through image-global and -local perturbations. Global and local attacks superimpose Gaussian noise and PNG patches onto image inputs, respectively; this suits both CNNs and transformers while generating realistic adversarial scenarios. Using WARP, we assessed real-time CNNs and Transformers, uncovering key vulnerabilities. At times, transformers exhibited over 70% precision degradation under global attacks, while both models generally struggled to differentiate cloud-like PNG patches from real smoke during local attacks. To enhance model robustness, we proposed four wildfire-oriented data augmentation techniques based on WARP's methodology and results, which diversify smoke image data and improve model precision and robustness. These advancements represent a substantial step toward developing a reliable early wildfire warning system, which may be our first safeguard against wildfire destruction.

Updated: 2025-04-07 15:10:53

标题: 深度学习基础下的野火预测模型的对抗性鲁棒性

摘要: 快速增长的森林大火最近已经摧毁了社会财产，暴露了迫切需要提前警告系统以加快救援工作的问题。利用基于摄像头的深度神经网络（DNNs）进行烟雾检测为森林大火预测提供了一个有希望的解决方案。然而，烟雾在时间和空间上的稀缺性限制了训练数据，引发了模型过拟合和偏见问题。目前的DNNs，主要是卷积神经网络（CNNs）和transformers，由于架构差异，使得鲁棒性评估变得复杂。为了解决这些挑战，我们引入了WARP（Wildfire Adversarial Robustness Procedure），这是第一个用于评估森林大火检测模型对抗鲁棒性的通用模型框架。WARP通过图像全局和局部扰动生成对抗性示例，以解决数据多样性方面的固有限制。全局和局部攻击分别将高斯噪声和PNG贴片叠加到图像输入中，这既适用于CNNs和transformers，又能生成真实的对抗情景。利用WARP，我们评估了实时CNNs和Transformers，揭示了关键的漏洞。有时，transformers在全局攻击下表现出超过70%的精度下降，而在局部攻击下，两个模型通常难以区分类似云状的PNG贴片和真实的烟雾。为了增强模型的鲁棒性，我们提出了基于WARP方法和结果的四种面向森林大火的数据增强技术，这些技术使烟雾图像数据多样化，提高了模型的精度和鲁棒性。这些进展代表了朝着开发可靠的早期森林大火预警系统迈出了重要一步，这可能是我们对抗森林大火破坏的第一个防护措施。

更新时间: 2025-04-07 15:10:53

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.20006v3

Evaluating Knowledge Graph Based Retrieval Augmented Generation Methods under Knowledge Incompleteness

Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique that enhances Large Language Model (LLM) inference in tasks like Question Answering (QA) by retrieving relevant information from knowledge graphs (KGs). However, real-world KGs are often incomplete, meaning that essential information for answering questions may be missing. Existing benchmarks do not adequately capture the impact of KG incompleteness on KG-RAG performance. In this paper, we systematically evaluate KG-RAG methods under incomplete KGs by removing triples using different methods and analyzing the resulting effects. We demonstrate that KG-RAG methods are sensitive to KG incompleteness, highlighting the need for more robust approaches in realistic settings.

Updated: 2025-04-07 15:08:03

标题: 评估基于知识图的检索增强生成方法在知识不完整情况下的效果

摘要: 基于知识图谱的检索增强生成（KG-RAG）是一种技术，通过从知识图谱（KGs）中检索相关信息，增强大规模语言模型（LLM）在问题回答（QA）等任务中的推理能力。然而，现实世界中的知识图谱往往是不完整的，这意味着回答问题所需的关键信息可能会缺失。现有的基准测试无法充分捕捉知识图谱不完整对KG-RAG性能的影响。本文系统评估了在不完整的知识图谱下的KG-RAG方法，通过不同方法删除三元组并分析结果的影响。我们证明了KG-RAG方法对知识图谱不完整性敏感，突出了在现实环境中需要更稳健方法的必要性。

更新时间: 2025-04-07 15:08:03

领域: cs.AI

下载: http://arxiv.org/abs/2504.05163v1

DDPM Score Matching and Distribution Learning

Score estimation is the backbone of score-based generative models (SGMs), especially denoising diffusion probabilistic models (DDPMs). A key result in this area shows that with accurate score estimates, SGMs can efficiently generate samples from any realistic data distribution (Chen et al., ICLR'23; Lee et al., ALT'23). This distribution learning result, where the learned distribution is implicitly that of the sampler's output, does not explain how score estimation relates to classical tasks of parameter and density estimation. This paper introduces a framework that reduces score estimation to these two tasks, with various implications for statistical and computational learning theory: Parameter Estimation: Koehler et al. (ICLR'23) demonstrate that a score-matching variant is statistically inefficient for the parametric estimation of multimodal densities common in practice. In contrast, we show that under mild conditions, denoising score-matching in DDPMs is asymptotically efficient. Density Estimation: By linking generation to score estimation, we lift existing score estimation guarantees to $(\epsilon,\delta)$-PAC density estimation, i.e., a function approximating the target log-density within $\epsilon$ on all but a $\delta$-fraction of the space. We provide (i) minimax rates for density estimation over H\"older classes and (ii) a quasi-polynomial PAC density estimation algorithm for the classical Gaussian location mixture model, building on and addressing an open problem from Gatmiry et al. (arXiv'24). Lower Bounds for Score Estimation: Our framework offers the first principled method to prove computational lower bounds for score estimation across general distributions. As an application, we establish cryptographic lower bounds for score estimation in general Gaussian mixture models, conceptually recovering Song's (NeurIPS'24) result and advancing his key open problem.

Updated: 2025-04-07 15:07:19

标题: DDPM得分匹配和分布学习

摘要: 评分估计是基于分数的生成模型（SGMs）的支柱，特别是去噪扩散概率模型（DDPMs）。该领域的一个关键结果表明，准确的评分估计可以使SGMs有效地从任何现实数据分布中生成样本（Chen等人，ICLR'23；Lee等人，ALT'23）。这种分布学习结果，其中学习的分布隐含地是采样器输出的分布，没有解释评分估计与经典参数和密度估计任务之间的关系。本文介绍了一个框架，将评分估计简化为这两个任务，对统计和计算学习理论产生了各种影响：参数估计：Koehler等人（ICLR'23）证明，对于实践中常见的多模态密度的参数估计，一种分数匹配变体在统计上是低效的。相反，我们显示，在温和条件下，DDPMs中的去噪分数匹配在渐近上是高效的。密度估计：通过将生成与评分估计联系起来，我们将现有的评分估计保证提升到$(\epsilon,\delta)$-PAC密度估计，即在除了$\delta$空间上的$\epsilon$内逼近目标对数密度的函数。我们为H\"older类别的密度估计提供了极小化率，并为经典的高斯位置混合模型提供了一个准多项式PAC密度估计算法，建立在Gatmiry等人（arXiv'24）的一个开放问题的基础上。评分估计的下界：我们的框架提供了首个证明普遍分布下评分估计的计算下界的原则方法。作为一个应用，我们在一般的高斯混合模型中建立了加密评分估计的下界，从概念上恢复了Song（NeurIPS'24）的结果并推进了他的关键开放问题。

更新时间: 2025-04-07 15:07:19

领域: stat.ML,cs.DS,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2504.05161v1

A Fast Multiplication Algorithm and RLWE-PLWE Equivalence for the Maximal Real Subfield of the $2^r p^s$-th Cyclotomic Field

This paper proves the RLWE-PLWE equivalence for the maximal real subfields of the cyclotomic fields with conductor $n = 2^r p^s$, where $p$ is an odd prime, and $r \geq 0$ and $s \geq 1$ are integers. In particular, we show that the canonical embedding as a linear transform has a condition number bounded above by a polynomial in $n$. In addition, we describe a fast multiplication algorithm in the ring of integers of these real subfields. The multiplication algorithm uses the fast Discrete Cosine Transform (DCT) and has computational complexity $\mathcal{O}(n \log n)$. Both the proof of the RLWE-PLWE equivalence and the fast multiplication algorithm are generalizations of previous results by Ahola et al., where the same claims are proved for a single prime $p = 3$.

Updated: 2025-04-07 15:01:48

标题: 一个快速乘法算法和RLWE-PLWE等价性在$2^r p^s$-th循环域的最大实子域中

摘要: 本文证明了在具有导体$n = 2^r p^s$的旋转群体的最大实子域中，RLWE-PLWE等价性成立，其中$p$是一个奇素数，$r \geq 0$和$s \geq 1$是整数。特别地，我们展示了作为线性变换的规范嵌入具有一个由$n$的多项式上界限制的条件数。此外，我们描述了在这些实子域的整数环中的快速乘法算法。该乘法算法使用快速离散余弦变换（DCT），计算复杂度为$\mathcal{O}(n \log n)$。RLWE-PLWE等价性的证明和快速乘法算法都是Ahola等人以前结果的推广，其中相同的论点对于单个素数$p = 3$已经被证明。

更新时间: 2025-04-07 15:01:48

领域: cs.CR,math.NT,94A60 (Primary), 11R80, 11T06 (Secondary),E.3.3

下载: http://arxiv.org/abs/2504.05159v1

Addressing Label Leakage in Knowledge Tracing Models

Knowledge Tracing (KT) is concerned with predicting students' future performance on learning items in intelligent tutoring systems. Learning items are tagged with skill labels called knowledge concepts (KCs). Many KT models expand the sequence of item-student interactions into KC-student interactions by replacing learning items with their constituting KCs. This approach addresses the issue of sparse item-student interactions and minimises the number of model parameters. However, we identified a label leakage problem with this approach. The model's ability to learn correlations between KCs belonging to the same item can result in the leakage of ground truth labels, which leads to decreased performance, particularly on datasets with a high number of KCs per item. In this paper, we present methods to prevent label leakage in knowledge tracing (KT) models. Our model variants that utilize these methods consistently outperform their original counterparts. This further underscores the impact of label leakage on model performance. Additionally, these methods enhance the overall performance of KT models, with one model variant surpassing all tested baselines on different benchmarks. Notably, our methods are versatile and can be applied to a wide range of KT models.

Updated: 2025-04-07 15:00:58

标题: 解决知识追踪模型中的标签泄露问题

摘要: 知识追踪（KT）关注于在智能辅导系统中预测学生对学习项目的未来表现。学习项目被标记为称为知识概念（KCs）的技能标签。许多KT模型将项目-学生互动序列扩展为KC-学生互动，通过用其构成的KCs替换学习项目。这种方法解决了项目-学生互动稀疏的问题，并最小化了模型参数的数量。然而，我们发现这种方法存在标签泄漏问题。模型学习同一项目中KC之间的相关性可能导致地面真实标签的泄漏，从而导致性能下降，特别是在每个项目具有较多KC的数据集上。在本文中，我们提出了防止知识追踪（KT）模型中标签泄漏的方法。我们利用这些方法的模型变体始终优于其原始对应模型。这进一步强调了标签泄漏对模型性能的影响。此外，这些方法提高了KT模型的整体性能，其中一个模型变体在不同基准测试中超越了所有测试基准。值得注意的是，我们的方法具有通用性，可以应用于各种KT模型。

更新时间: 2025-04-07 15:00:58

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.15304v3

Emojis Decoded: Leveraging ChatGPT for Enhanced Understanding in Social Media Communications

Emojis, which encapsulate semantics beyond mere words or phrases, have become prevalent in social network communications. This has spurred increasing scholarly interest in exploring their attributes and functionalities. However, emoji-related research and application face two primary challenges. First, researchers typically rely on crowd-sourcing to annotate emojis in order to understand their sentiments, usage intentions, and semantic meanings. Second, subjective interpretations by users can often lead to misunderstandings of emojis and cause the communication barrier. Large Language Models (LLMs) have achieved significant success in various annotation tasks, with ChatGPT demonstrating expertise across multiple domains. In our study, we assess ChatGPT's effectiveness in handling previously annotated and downstream tasks. Our objective is to validate the hypothesis that ChatGPT can serve as a viable alternative to human annotators in emoji research and that its ability to explain emoji meanings can enhance clarity and transparency in online communications. Our findings indicate that ChatGPT has extensive knowledge of emojis. It is adept at elucidating the meaning of emojis across various application scenarios and demonstrates the potential to replace human annotators in a range of tasks.

Updated: 2025-04-07 15:00:36

标题: Emoji解读：利用ChatGPT提升社交媒体沟通的理解

摘要: 表情符号包含了超越单词或短语的语义，已经在社交网络沟通中变得越来越普遍。这引发了学术界对其属性和功能的探索兴趣增加。然而，与表情符号相关的研究和应用面临两个主要挑战。首先，研究人员通常依赖众包来注释表情符号，以了解它们的情感、使用意图和语义含义。其次，用户的主观解释往往会导致对表情符号的误解，并造成沟通障碍。大型语言模型（LLMs）在各种注释任务中取得了显著成功，ChatGPT在多个领域展示了专业知识。在我们的研究中，我们评估了ChatGPT在处理先前注释和下游任务中的有效性。我们的目标是验证ChatGPT可以作为表情符号研究中人类注释者的可行替代，并且其解释表情符号含义的能力可以增强在线沟通的清晰度和透明度。我们的研究结果表明，ChatGPT对表情符号有广泛的了解。它擅长阐明各种应用场景中表情符号的含义，并展示了取代人类注释者在一系列任务中的潜力。

更新时间: 2025-04-07 15:00:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.01681v3

Leveraging Label Potential for Enhanced Multimodal Emotion Recognition

Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could significantly aid MER. We introduce a novel model called Label Signal-Guided Multimodal Emotion Recognition (LSGMER) to overcome this limitation. This model aims to fully harness the power of emotion label information to boost the classification accuracy and stability of MER. Specifically, LSGMER employs a Label Signal Enhancement module that optimizes the representation of modality features by interacting with audio and text features through label embeddings, enabling it to capture the nuances of emotions precisely. Furthermore, we propose a Joint Objective Optimization(JOO) approach to enhance classification accuracy by introducing the Attribution-Prediction Consistency Constraint (APC), which strengthens the alignment between fused features and emotion categories. Extensive experiments conducted on the IEMOCAP and MELD datasets have demonstrated the effectiveness of our proposed LSGMER model.

Updated: 2025-04-07 15:00:34

标题: 利用标签潜力提升多模态情绪识别

摘要: 多模态情绪识别（MER）旨在整合各种模态以准确预测情绪状态。然而，目前大多数研究仅关注音频和文本特征的融合，忽视了情绪标签中的宝贵信息。这种忽略可能会潜在地阻碍现有方法的性能，因为情绪标签中蕴含丰富、有洞察力的信息，可以显著帮助MER。我们引入了一种名为标签信号引导多模态情绪识别（LSGMER）的新模型，以克服这一限制。该模型旨在充分利用情绪标签信息的力量，以提高MER的分类准确性和稳定性。具体而言，LSGMER采用了一个标签信号增强模块，通过与音频和文本特征交互，通过标签嵌入优化模态特征的表示，使其能够精确捕捉情绪的细微差别。此外，我们提出了一个联合目标优化（JOO）方法，通过引入归因-预测一致性约束（APC），增强了融合特征和情绪类别之间的对齐。在IEMOCAP和MELD数据集上进行的大量实验已经证明了我们提出的LSGMER模型的有效性。

更新时间: 2025-04-07 15:00:34

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2504.05158v1

SparsyFed: Sparse Adaptive Federated Training

Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients' consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.

Updated: 2025-04-07 14:57:02

标题: SparsyFed：稀疏自适应联邦训练

摘要: Sparse training 方法经常被采用在跨设备联邦学习（FL）环境中，其中受限设备通过在异构网络中交换伪梯度来共同训练一个机器学习模型。虽然稀疏训练方法可以减少FL中的通信开销和计算负担，但通常由于以下关键原因而在实践中未被使用：（1）数据异质性使得客户端达成稀疏模型共识相对于密集模型更加困难，需要更长的训练时间；（2）用于获取稀疏掩码的方法缺乏适应性，无法容纳非常异质的数据分布，在跨设备FL中至关重要；（3）需要额外的超参数，这在FL中尤为难以调整。本文介绍了一种名为SparsyFed的实用联邦稀疏训练方法，关键解决了上述问题。先前的研究只解决了其中一两个挑战，却引入了新的权衡，如客户端对掩码的共识与稀疏模式适应性之间的权衡。我们展示了SparsyFed同时实现了以下三点：（1）可以产生95%的稀疏模型，在准确性上几乎没有降级，仅需要一个超参数；（2）每轮权重再生长比之前方法小200倍；（3）允许稀疏掩码适应高度异质的数据分布，并在此类条件下胜过所有基线方法。

更新时间: 2025-04-07 14:57:02

领域: cs.LG

下载: http://arxiv.org/abs/2504.05153v1

A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks

This paper presents Post-Decision Proximal Policy Optimization (PDPPO), a novel variation of the leading deep reinforcement learning method, Proximal Policy Optimization (PPO). The PDPPO state transition process is divided into two steps: a deterministic step resulting in the post-decision state and a stochastic step leading to the next state. Our approach incorporates post-decision states and dual critics to reduce the problem's dimensionality and enhance the accuracy of value function estimation. Lot-sizing is a mixed integer programming problem for which we exemplify such dynamics. The objective of lot-sizing is to optimize production, delivery fulfillment, and inventory levels in uncertain demand and cost parameters. This paper evaluates the performance of PDPPO across various environments and configurations. Notably, PDPPO with a dual critic architecture achieves nearly double the maximum reward of vanilla PPO in specific scenarios, requiring fewer episode iterations and demonstrating faster and more consistent learning across different initializations. On average, PDPPO outperforms PPO in environments with a stochastic component in the state transition. These results support the benefits of using a post-decision state. Integrating this post-decision state in the value function approximation leads to more informed and efficient learning in high-dimensional and stochastic environments.

Updated: 2025-04-07 14:56:43

标题: 一种适用于具有随机变量环境的强化学习方法：具有双批评网络的后决策近端策略优化

摘要: 这篇论文介绍了后决策近端策略优化（PDPPO），这是一种领先的深度强化学习方法——近端策略优化（PPO）的新颖变体。PDPPO状态转移过程分为两步：一个确定性步骤导致后决策状态，一个随机步骤导致下一个状态。我们的方法结合了后决策状态和双批评家以减少问题的维度并增强价值函数估计的准确性。批量大小是一个混合整数规划问题，我们举例说明了这种动态。批量大小的目标是在不确定的需求和成本参数中优化生产、交付履约和库存水平。本文评估了PDPPO在不同环境和配置下的性能。值得注意的是，在特定场景中，具有双批评家架构的PDPPO实现了几乎是普通PPO最大奖励的两倍，需要更少的情节迭代，并在不同初始化下表现更快和更一致的学习。平均而言，PDPPO在具有状态转移随机成分的环境中表现优于PPO。这些结果支持使用后决策状态的好处。将这种后决策状态整合到价值函数近似中可实现在高维度和随机环境中更具信息和效率的学习。

更新时间: 2025-04-07 14:56:43

领域: cs.LG,cs.AI,I.2.6; G.1.6

下载: http://arxiv.org/abs/2504.05150v1

Pr$εε$mpt: Sanitizing Sensitive Prompts for LLMs

The rise of large language models (LLMs) has introduced new privacy challenges, particularly during inference where sensitive information in prompts may be exposed to proprietary LLM APIs. In this paper, we address the problem of formally protecting the sensitive information contained in a prompt while maintaining response quality. To this end, first, we introduce a cryptographically inspired notion of a prompt sanitizer which transforms an input prompt to protect its sensitive tokens. Second, we propose Pr$\epsilon\epsilon$mpt, a novel system that implements a prompt sanitizer. Pr$\epsilon\epsilon$mpt categorizes sensitive tokens into two types: (1) those where the LLM's response depends solely on the format (such as SSNs, credit card numbers), for which we use format-preserving encryption (FPE); and (2) those where the response depends on specific values, (such as age, salary) for which we apply metric differential privacy (mDP). Our evaluation demonstrates that Pr$\epsilon\epsilon$mpt is a practical method to achieve meaningful privacy guarantees, while maintaining high utility compared to unsanitized prompts, and outperforming prior methods

Updated: 2025-04-07 14:52:40

标题: Pr$εε$mpt: 为LLMs清理敏感提示

摘要: 大型语言模型(LLMs)的兴起引入了新的隐私挑战，特别是在推理过程中，提示中的敏感信息可能会暴露给专有LLM API。在本文中，我们解决了正式保护提示中包含的敏感信息的问题，同时保持响应质量。为此，首先，我们引入了一个受密码学启发的提示消毒剂的概念，它将输入提示转换为保护其敏感标记。其次，我们提出了Pr$\epsilon\epsilon$mpt，这是一个实现提示消毒剂的新型系统。Pr$\epsilon\epsilon$mpt将敏感标记分为两种类型：(1)LLM响应仅取决于格式的标记(如社会安全号码、信用卡号)，我们使用格式保留加密(FPE)；(2)响应取决于特定值的标记(如年龄、工资)，我们应用度量差分隐私(mDP)。我们的评估表明，Pr$\epsilon\epsilon$mpt是一种实现有意义的隐私保证的实用方法，与未经消毒的提示相比，保持高效用，并优于先前的方法。

更新时间: 2025-04-07 14:52:40

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2504.05147v1

Online Cluster-Based Parameter Control for Metaheuristic

The concept of parameter setting is a crucial and significant process in metaheuristics since it can majorly impact their performance. It is a highly complex and challenging procedure since it requires a deep understanding of the optimization algorithm and the optimization problem at hand. In recent years, the upcoming rise of autonomous decision systems has attracted ongoing scientific interest in this direction, utilizing a considerable number of parameter-tuning methods. There are two types of methods: offline and online. Online methods usually excel in complex real-world problems, as they can offer dynamic parameter control throughout the execution of the algorithm. The present work proposes a general-purpose online parameter-tuning method called Cluster-Based Parameter Adaptation (CPA) for population-based metaheuristics. The main idea lies in the identification of promising areas within the parameter search space and in the generation of new parameters around these areas. The method's validity has been demonstrated using the differential evolution algorithm and verified in established test suites of low- and high-dimensional problems. The obtained results are statistically analyzed and compared with state-of-the-art algorithms, including advanced auto-tuning approaches. The analysis reveals the promising solid CPA's performance as well as its robustness under a variety of benchmark problems and dimensions.

Updated: 2025-04-07 14:48:30

标题: 基于在线聚类的启发式算法参数控制

摘要: 参数设置的概念是元启发式算法中一个至关重要且显著的过程，因为它可以极大地影响它们的性能。这是一个高度复杂和具有挑战性的过程，因为它需要对优化算法和当前优化问题有深刻的理解。近年来，自主决策系统的兴起吸引了持续的科学兴趣，利用了大量的参数调整方法。有两种类型的方法：离线和在线。在线方法通常在复杂的现实世界问题中表现出色，因为它们可以在算法执行过程中提供动态的参数控制。本研究提出了一种通用的在线参数调整方法，称为基于集群的参数自适应（CPA），用于基于群体的元启发式算法。其主要思想在于识别参数搜索空间中的有前景区域，并围绕这些区域生成新的参数。该方法的有效性已通过差分进化算法进行了验证，并在低维和高维问题的已建立测试套件中进行了验证。获得的结果经过统计分析，并与最先进的算法进行了比较，包括先进的自动调整方法。分析显示，CPA在各种基准问题和维度下表现出有前途的性能以及其稳健性。

更新时间: 2025-04-07 14:48:30

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2504.05144v1

Taming Double-Spending in Offline Payments with Reputation-Weighted Loan Networks

Blockchain solutions typically assume a synchronous network to ensure consistency and achieve consensus. In contrast, offline transaction systems aim to enable users to agree on and execute transactions without assuming bounded communication delays when interacting with the blockchain. Most existing offline payment schemes depend on trusted hardware wallets that are assumed to be secure and tamper-proof. While this work introduces Overdraft, a novel offline payment system that shifts the reliance from hardware to users themselves. Overdraft allows potential payment receivers to assess the likelihood of being paid, allowing them to accept transactions with confidence or deny them. Overdraft achieves this by maintaining a loan network that is weighted by online reputation. This loan network contains time-limited agreements where users pledge to cover another user's payment if necessary. For example, when a payer lacks sufficient funds at the moment of commitment. Offline users rely on the last known view of the loan network -- which they had access to when last online -- to determine whether to participate in an offline transaction. This view is used to estimate the probability of eventual payment, possibly using multiple loans. Once online again, users commit their transactions to the blockchain with any conflicts being resolved deterministically. Overdraft incorporates incentives for users and is designed to be resilient against Sybil attacks. As a proof of concept, we implemented Overdraft as an Ethereum Solidity smart contract and deployed it on the Sepolia testnet to evaluate its performance.

Updated: 2025-04-07 14:48:19

标题: 在离线支付中通过声誉加权贷款网络驯服双重支付

摘要: 区块链解决方案通常假设同步网络以确保一致性并实现共识。相反，脱机交易系统旨在使用户能够在与区块链交互时不考虑有界通信延迟的情况下达成协议并执行交易。大多数现有的脱机支付方案依赖于被认为是安全且防篡改的可信硬件钱包。而这项工作介绍了一种名为Overdraft的新型脱机支付系统，将依赖性从硬件转移到用户自身。Overdraft允许潜在的支付接收者评估被支付的可能性，使他们能够自信地接受或拒绝交易。Overdraft通过维护一个由在线声誉加权的贷款网络来实现这一点。该贷款网络包含时间限制的协议，用户承诺在必要时支付另一用户的款项。例如，当支付者在承诺时缺乏足够的资金。脱机用户依赖于他们上次在线时所访问的贷款网络的最后已知视图，以确定是否参与脱机交易。该视图用于估计最终支付的概率，可能使用多笔贷款。一旦再次在线，用户将其交易提交到区块链，任何冲突都将被确定性地解决。Overdraft为用户提供激励措施，并设计为抵抗Sybil攻击。作为概念验证，我们将Overdraft实现为以太坊Solidity智能合约，并在Sepolia测试网上部署以评估其性能。

更新时间: 2025-04-07 14:48:19

领域: cs.CR

下载: http://arxiv.org/abs/2504.05143v1

EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively

Open-World Tracking (OWT) aims to track every object of any category, which requires the model to have strong generalization capabilities. Trackers can improve their generalization ability by leveraging Visual Language Models (VLMs). However, challenges arise with the fine-tuning strategies when VLMs are transferred to OWT: full fine-tuning results in excessive parameter and memory costs, while the zero-shot strategy leads to sub-optimal performance. To solve the problem, EffOWT is proposed for efficiently transferring VLMs to OWT. Specifically, we build a small and independent learnable side network outside the VLM backbone. By freezing the backbone and only executing backpropagation on the side network, the model's efficiency requirements can be met. In addition, EffOWT enhances the side network by proposing a hybrid structure of Transformer and CNN to improve the model's performance in the OWT field. Finally, we implement sparse interactions on the MLP, thus reducing parameter updates and memory costs significantly. Thanks to the proposed methods, EffOWT achieves an absolute gain of 5.5% on the tracking metric OWTA for unknown categories, while only updating 1.3% of the parameters compared to full fine-tuning, with a 36.4% memory saving. Other metrics also demonstrate obvious improvement.

Updated: 2025-04-07 14:47:58

标题: EffOWT：将视觉语言模型有效地转移到开放世界跟踪

摘要: 开放世界跟踪（OWT）旨在跟踪任何类别的每个对象，这要求模型具有强大的泛化能力。跟踪器可以通过利用视觉语言模型（VLMs）来提高其泛化能力。然而，当VLMs被转移到OWT时，微调策略会带来挑战：完全微调会导致过多的参数和内存成本，而零-shot策略会导致次优性能。为解决这一问题，提出了EffOWT来有效地将VLMs转移到OWT。具体来说，我们在VLM主干外部构建了一个小型独立可学习的侧网络。通过冻结主干，仅对侧网络进行反向传播，可以满足模型的效率要求。此外，EffOWT通过提出Transformer和CNN的混合结构来增强侧网络，以提高模型在OWT领域的性能。最后，我们在MLP上实现了稀疏交互，从而显著减少参数更新和内存成本。由于提出的方法，EffOWT在未知类别的跟踪指标OWTA上取得了5.5%的绝对增益，与完全微调相比，仅更新了1.3%的参数，并节省了36.4%的内存。其他指标也表现出明显的改善。

更新时间: 2025-04-07 14:47:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.05141v1

Unifying Physics- and Data-Driven Modeling via Novel Causal Spatiotemporal Graph Neural Network for Interpretable Epidemic Forecasting

Accurate epidemic forecasting is crucial for effective disease control and prevention. Traditional compartmental models often struggle to estimate temporally and spatially varying epidemiological parameters, while deep learning models typically overlook disease transmission dynamics and lack interpretability in the epidemiological context. To address these limitations, we propose a novel Causal Spatiotemporal Graph Neural Network (CSTGNN), a hybrid framework that integrates a Spatio-Contact SIR model with Graph Neural Networks (GNNs) to capture the spatiotemporal propagation of epidemics. Inter-regional human mobility exhibits continuous and smooth spatiotemporal patterns, leading to adjacent graph structures that share underlying mobility dynamics. To model these dynamics, we employ an adaptive static connectivity graph to represent the stable components of human mobility and utilize a temporal dynamics model to capture fluctuations within these patterns. By integrating the adaptive static connectivity graph with the temporal dynamics graph, we construct a dynamic graph that encapsulates the comprehensive properties of human mobility networks. Additionally, to capture temporal trends and variations in infectious disease spread, we introduce a temporal decomposition model to handle temporal dependence. This model is then integrated with a dynamic graph convolutional network for epidemic forecasting. We validate our model using real-world datasets at the provincial level in China and the state level in Germany. Extensive studies demonstrate that our method effectively models the spatiotemporal dynamics of infectious diseases, providing a valuable tool for forecasting and intervention strategies. Furthermore, analysis of the learned parameters offers insights into disease transmission mechanisms, enhancing the interpretability and practical applicability of our model.

Updated: 2025-04-07 14:46:11

标题: 通过新颖的因果时空图神经网络将物理和数据驱动建模统一起来，用于可解释的流行病预测

摘要: 准确的流行病预测对于有效的疾病控制和预防至关重要。传统的隔室模型通常难以估计在时间和空间上变化的流行病学参数，而深度学习模型通常忽视疾病传播动态，并且在流行病学背景下缺乏可解释性。为了解决这些限制，我们提出了一种新颖的因果时空图神经网络（CSTGNN），这是一个混合框架，将空间接触SIR模型与图神经网络（GNNs）集成在一起，以捕捉流行病的时空传播。不同地区的人类流动展现出连续而平滑的时空模式，导致共享潜在流动动力学的相邻图结构。为了建模这些动态，我们采用自适应静态连接图来表示人类流动的稳定组成部分，并利用一个时间动态模型来捕捉这些模式内的波动。通过将自适应静态连接图与时间动态图集成在一起，我们构建了一个动态图，涵盖了人类流动网络的全面特性。此外，为了捕捉传染病传播的时间趋势和变化，我们引入了一个时间分解模型来处理时间依赖性。然后，将这个模型与动态图卷积网络集成，用于流行病预测。我们在中国的省级和德国的州级真实数据集上验证了我们的模型。广泛的研究表明，我们的方法有效地建模了传染病的时空动态，为预测和干预策略提供了有价值的工具。此外，对学习的参数进行分析可以提供关于疾病传播机制的见解，增强了我们模型的可解释性和实际适用性。

更新时间: 2025-04-07 14:46:11

领域: cs.LG,physics.soc-ph,q-bio.QM,stat.ML,92D30, 68T07,I.2.6; I.5.1

下载: http://arxiv.org/abs/2504.05140v1

Towards Optimal Heterogeneous Client Sampling in Multi-Model Federated Learning

Federated learning (FL) allows edge devices to collaboratively train models without sharing local data. As FL gains popularity, clients may need to train multiple unrelated FL models, but communication constraints limit their ability to train all models simultaneously. While clients could train FL models sequentially, opportunistically having FL clients concurrently train different models -- termed multi-model federated learning (MMFL) -- can reduce the overall training time. Prior work uses simple client-to-model assignments that do not optimize the contribution of each client to each model over the course of its training. Prior work on single-model FL shows that intelligent client selection can greatly accelerate convergence, but na\"ive extensions to MMFL can violate heterogeneous resource constraints at both the server and the clients. In this work, we develop a novel convergence analysis of MMFL with arbitrary client sampling methods, theoretically demonstrating the strengths and limitations of previous well-established gradient-based methods. Motivated by this analysis, we propose MMFL-LVR, a loss-based sampling method that minimizes training variance while explicitly respecting communication limits at the server and reducing computational costs at the clients. We extend this to MMFL-StaleVR, which incorporates stale updates for improved efficiency and stability, and MMFL-StaleVRE, a lightweight variant suitable for low-overhead deployment. Experiments show our methods improve average accuracy by up to 19.1% over random sampling, with only a 5.4% gap from the theoretical optimum (full client participation).

Updated: 2025-04-07 14:43:17

标题: 朝向多模型联邦学习中最佳异构客户端采样

摘要: 联邦学习（FL）允许边缘设备在不共享本地数据的情况下协同训练模型。随着FL的流行，客户端可能需要训练多个不相关的FL模型，但通信限制限制了他们同时训练所有模型的能力。虽然客户端可以顺序训练FL模型，但是通过机会主义地让FL客户端同时训练不同的模型 - 称为多模型联邦学习（MMFL）- 可以减少总体训练时间。先前的工作使用简单的客户端到模型的分配，未优化每个客户端在其训练过程中对每个模型的贡献。先前关于单一模型FL的工作表明，智能客户端选择可以极大加速收敛，但对MMFL的天真扩展可能会违反服务器和客户端的异构资源约束。在这项工作中，我们开发了一个关于任意客户端抽样方法的MMFL的新型收敛分析，从理论上证明了先前建立的梯度方法的优势和局限性。在这个分析的基础上，我们提出了MMFL-LVR，一种基于损失的抽样方法，可以最小化训练方差，同时明确尊重服务器的通信限制，减少客户端的计算成本。我们将这一方法扩展到MMFL-StaleVR，它结合了陈旧的更新以提高效率和稳定性，以及MMFL-StaleVRE，一种适用于低开销部署的轻量级变体。实验表明，我们的方法将平均准确度提高了高达19.1％，仅与理论最优解（完全客户端参与）相差5.4％。

更新时间: 2025-04-07 14:43:17

领域: cs.LG,cs.DC,I.2.11

下载: http://arxiv.org/abs/2504.05138v1

Controlled Latent Diffusion Models for 3D Porous Media Reconstruction

Three-dimensional digital reconstruction of porous media presents a fundamental challenge in geoscience, requiring simultaneous resolution of fine-scale pore structures while capturing representative elementary volumes. We introduce a computational framework that addresses this challenge through latent diffusion models operating within the EDM framework. Our approach reduces dimensionality via a custom variational autoencoder trained in binary geological volumes, improving efficiency and also enabling the generation of larger volumes than previously possible with diffusion models. A key innovation is our controlled unconditional sampling methodology, which enhances distribution coverage by first sampling target statistics from their empirical distributions, then generating samples conditioned on these values. Extensive testing on four distinct rock types demonstrates that conditioning on porosity - a readily computable statistic - is sufficient to ensure a consistent representation of multiple complex properties, including permeability, two-point correlation functions, and pore size distributions. The framework achieves better generation quality than pixel-space diffusion while enabling significantly larger volume reconstruction (256-cube voxels) with substantially reduced computational requirements, establishing a new state-of-the-art for digital rock physics applications.

Updated: 2025-04-07 14:41:54

标题: 受控潜在扩散模型用于3D多孔介质重建

摘要: 三维数字重建多孔介质在地球科学中是一个基本挑战，需要同时解决细观孔隙结构和捕获代表性元素体积的问题。我们引入了一个计算框架，通过EDM框架中的潜在扩散模型来解决这一挑战。我们的方法通过在二进制地质体积中训练的自定义变分自动编码器来降低维度，提高效率，同时也实现比以前扩散模型更大的体积生成。一个重要创新是我们控制的无条件抽样方法，通过首先从实际分布中抽样目标统计量，然后生成依赖于这些值的样本，增强了分布覆盖。在四种不同的岩石类型上进行的广泛测试表明，以孔隙度为条件 - 一个易于计算的统计量 - 足以确保多个复杂属性的一致表示，包括渗透率、两点相关函数和孔隙大小分布。该框架实现了比像素空间扩散更好的生成质量，同时实现了明显减少的计算要求下更大的体积重建（256立方体体素），为数字岩石物理应用建立了一个新的技术水平。

更新时间: 2025-04-07 14:41:54

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2503.24083v2

DEPT: Decoupled Embeddings for Pre-training Language Models

Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.

Updated: 2025-04-07 14:29:54

标题: DEPT：解耦嵌入用于预训练语言模型

摘要: 语言模型预训练使用广泛的数据混合来增强跨领域和语言的性能。然而，在这种异构文本语料库上进行训练需要大量且昂贵的努力。由于这些数据源在词汇、句法和语义方面存在显著差异，它们会引起负面干扰或“多语言诅咒”。为了解决这些挑战，我们提出了一种高效通讯的预训练框架DEPT。我们的方法将嵌入从变压器主体解耦，同时训练后者在多个数据源上，而无需共享词汇。 DEPT可以：（1）在数据异质性明显的情况下稳健有效地训练，（2）将令牌嵌入参数最小化为仅需数据源词汇的内容，同时按通信频率和参数减少的比例削减通信成本，（3）增强变压器主体的可塑性和泛化能力，提高平均困惑度（高达20％）和下游任务性能，（4）实现根据数据源进行自定义优化词汇的训练。我们通过首次无词汇的联合预训练百亿级模型展示了DEPT的潜力，大幅减少了通信成本，并将嵌入内存提高了4-5倍。

更新时间: 2025-04-07 14:29:54

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2410.05021v5

Interpretable Style Takagi-Sugeno-Kang Fuzzy Clustering

Clustering is an efficient and essential technique for exploring latent knowledge of data. However, limited attention has been given to the interpretability of the clusters detected by most clustering algorithms. In addition, due to the homogeneity of data, different groups of data have their own homogeneous styles. In this paper, the above two aspects are considered, and an interpretable style Takagi-Sugeno-Kang (TSK) fuzzy clustering (IS-TSK-FC) algorithm is proposed. The clustering behavior of IS-TSK-FC is fully guided by the TSK fuzzy inference on fuzzy rules. In particular, samples are grouped into clusters represented by the corresponding consequent vectors of all fuzzy rules learned in an unsupervised manner. This can explain how the clusters are generated in detail, thus making the underlying decision-making process of the IS-TSK-FC interpretable. Moreover, a series of style matrices are introduced to facilitate the consequents of fuzzy rules in IS-TSK-FC by capturing the styles of clusters as well as the nuances between different styles. Consequently, all the fuzzy rules in IS-TSK-FC have powerful data representation capability. After determining the antecedents of all the fuzzy rules, the optimization problem of IS-TSK-FC can be iteratively solved in an alternation manner. The effectiveness of IS-TSK-FC as an interpretable clustering tool is validated through extensive experiments on benchmark datasets with unknown implicit/explicit styles. Specially, the superior clustering performance of IS-TSK-FC is demonstrated on case studies where different groups of data present explicit styles. The source code of IS-TSK-FC can be downloaded from https://github.com/gusuhang10/IS-TSK-FC.

Updated: 2025-04-07 14:28:56

标题: 可解释的风格高桥-菅野-康模糊聚类

摘要: 聚类是一种探索数据潜在知识的高效且必要的技术。然而，大多数聚类算法检测到的聚类的可解释性受到了限制的关注。此外，由于数据的同质性，不同数据组具有各自的同质风格。本文考虑了上述两个方面，提出了一种可解释风格的高谷-菅野-康（TSK）模糊聚类（IS-TSK-FC）算法。IS-TSK-FC的聚类行为完全受到模糊规则上的TSK模糊推理的指导。特别是，样本被分组到由以无监督方式学习的所有模糊规则的相应结论向量表示的群集中。这可以详细解释聚类是如何生成的，从而使IS-TSK-FC的潜在决策过程可解释。此外，引入了一系列风格矩阵，通过捕捉群集的风格以及不同风格之间的微妙差异，以促进IS-TSK-FC中模糊规则的结果。因此，IS-TSK-FC中的所有模糊规则具有强大的数据表示能力。确定所有模糊规则的前提条件后，IS-TSK-FC的优化问题可以以交替方式迭代地解决。通过在具有未知隐式/显式风格的基准数据集上进行广泛实验，验证了IS-TSK-FC作为一个可解释的聚类工具的有效性。特别是，在不同数据组呈现显式风格的案例研究中，IS-TSK-FC的卓越聚类性能得到了证实。IS-TSK-FC的源代码可以从https://github.com/gusuhang10/IS-TSK-FC下载。

更新时间: 2025-04-07 14:28:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05125v1

The Transient Cost of Learning in Queueing Systems

Queueing systems are widely applicable stochastic models with use cases in communication networks, healthcare, service systems, etc. Although their optimal control has been extensively studied, most existing approaches assume perfect knowledge of the system parameters. This assumption rarely holds in practice where there is parameter uncertainty, thus motivating a recent line of work on bandit learning for queueing systems. This nascent stream of research focuses on the asymptotic performance of the proposed algorithms but does not provide insight on the transient performance in the early stages of the learning process. In this paper, we propose the Transient Cost of Learning in Queueing (TCLQ), a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty. We characterize the TCLQ of a single-queue multi-server system, and then extend these results to multi-queue multi-server systems and networks of queues. In establishing our results, we propose a unified analysis framework for TCLQ that bridges Lyapunov and bandit analysis, provides guarantees for a wide range of algorithms, and could be of independent interest.

Updated: 2025-04-07 14:22:40

标题: 学习在队列系统中的暂时成本

摘要: 队列系统是广泛适用的随机模型，在通信网络、医疗保健、服务系统等领域有用例。尽管它们的最优控制已被广泛研究，但大多数现有方法假设对系统参数有完全的了解。在实践中，这种假设很少成立，因为存在参数不确定性，因此最近出现了一系列关于队列系统的自适应学习的研究。这一新兴研究方向关注所提出算法的渐近性能，但并未提供关于学习过程早期阶段的瞬态性能的见解。在本文中，我们提出了队列学习中的瞬态学习成本（TCLQ），这是一个新的度量标准，用于量化由参数不确定性引起的时间平均队列长度的最大增加。我们表征了单队列多服务器系统的TCLQ，然后将这些结果扩展到多队列多服务器系统和队列网络。为了建立我们的结果，我们提出了一个统一的TCLQ分析框架，它连接了Lyapunov和bandit分析，为各种算法提供了保证，并可能引起独立的兴趣。

更新时间: 2025-04-07 14:22:40

领域: cs.LG,cs.DS,cs.PF,math.PR

下载: http://arxiv.org/abs/2308.07817v3

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. To bridge this gap, we investigate zero-shot generalization from the perspective of the data itself. We first demonstrate that zero-shot generalization happens very early during instruction tuning, with loss serving as a stable indicator. Next, we investigate training data arrangement through similarity and granularity perspectives, confirming that the timing of exposure to certain training examples may greatly facilitate generalization on unseen tasks. Finally, we propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. Our code is released at https://github.com/thunlp/Dynamics-of-Zero-Shot-Generalization.

Updated: 2025-04-07 14:21:36

标题: 时间很重要：数据排列影响指导调整中的零样本泛化

摘要: 理解对齐技术始于理解指导调整带来的零样本泛化，但对其机制的理解仍然有限。现有研究主要集中在任务级别，而没有考虑到任务是人为定义的，对于LLMs来说仅仅是由标记和表示组成。为了弥合这一差距，我们从数据本身的角度研究了零样本泛化。我们首先证明，在指导调整过程中零样本泛化发生得非常早，损失作为一个稳定的指标。接下来，我们通过相似性和粒度的视角研究训练数据的安排，确认暴露给特定训练示例的时机可能极大地促进对未见任务的泛化。最后，我们提出了一个更加扎实的训练数据安排框架，即测试为中心的多轮安排，展示了它在促进持续学习和进一步降低损失方面的有效性。我们首次展示，在指导调整过程中的零样本泛化是基于相似性在实例级别的训练数据和测试数据之间的泛化形式。我们的代码已发布在https://github.com/thunlp/Dynamics-of-Zero-Shot-Generalization。

更新时间: 2025-04-07 14:21:36

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11721v2

Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection

Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx's KV260 SoM.

Updated: 2025-04-07 14:21:31

标题: 通过激活函数选择在嵌入式DNN中平衡鲁棒性和效率

摘要: 基于机器学习的嵌入式系统用于安全关键应用，如航空航天和自动驾驶，必须对软错误引起的扰动具有鲁棒性。随着晶体管几何形状的缩小和电压的降低，现代电子设备变得更容易受到背景辐射的影响，增加了因软错误引起的故障的担忧。深度神经网络（DNNs）对这些错误的鲁棒性不仅取决于目标设备技术，还取决于模型结构和其参数的数值表示和算术精度。用于减少存储占用和计算复杂性的压缩技术，如修剪和量化，会改变模型结构和表示方式，影响软错误的鲁棒性。在这方面，虽然常常被忽视，但激活函数（AFs）的选择不仅影响准确性和可训练性，还影响可压缩性和错误鲁棒性。本文探讨了使用有界激活函数来增强对参数扰动的鲁棒性，同时评估其对模型准确性、可压缩性和计算负载的影响，采用技术无关的方法。我们重点研究了为高光谱图像语义分割开发的编码器-解码器卷积模型，应用于自动驾驶系统。实验在AMD-Xilinx的KV260 SoM上进行。

更新时间: 2025-04-07 14:21:31

领域: cs.LG,cs.AI,cs.AR,cs.CV,eess.IV

下载: http://arxiv.org/abs/2504.05119v1

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

Updated: 2025-04-07 14:21:11

标题: VAPO：高效可靠的用于高级推理任务的强化学习

摘要: 我们提出了一个名为VAPO的价值增强近端策略优化框架，专为在价值导向范式内的推理模型而设计。在AIME 2024数据集上进行基准测试，VAPO基于Qwen 32B预训练模型，取得了60.4的最新成绩。在相同的实验设置下直接比较，VAPO的表现优于之前报道的DeepSeek-R1-Zero-Qwen-32B和DAPO超过10个点。VAPO的训练过程以其稳定性和效率脱颖而出。它在短短5000步内达到了最新的性能水平。此外，在多次独立运行中，没有发生任何训练崩溃，突显了其可靠性。这项研究探讨了使用价值导向强化学习框架进行长链推理的长CoT推理。我们指出了困扰价值导向方法的三个关键挑战：价值模型偏见、异构序列长度的存在以及奖励信号的稀疏性。通过系统设计，VAPO提供了一个集成解决方案，有效缓解了这些挑战，使长CoT推理任务的性能得到提升。

更新时间: 2025-04-07 14:21:11

领域: cs.AI

下载: http://arxiv.org/abs/2504.05118v1

Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning

Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on three combinatorial optimization tasks - bin packing, traveling salesman, and the flatpack problem - show that combining RL and evolutionary search improves discovery efficiency of improved algorithms, showcasing the potential of RL-enhanced evolutionary strategies to assist computer scientists and mathematicians for more efficient algorithm design.

Updated: 2025-04-07 14:14:15

标题: 用LLMs进行算法发现：进化搜索与强化学习的结合

摘要: 发现解决复杂问题的高效算法一直是数学和计算机科学中的一个重大挑战，多年来需要大量的人类专业知识。近年来，基于大型语言模型（LLMs）的进化搜索的最新进展显示出在加速跨各个领域的算法发现方面具有潜力，特别是在数学和优化领域。然而，现有方法将LLM视为静态生成器，错失了利用从进化探索获得的信号来更新模型的机会。在这项工作中，我们提出通过持续优化搜索算子 - LLM - 来增强基于LLM的进化搜索，通过强化学习（RL）微调。我们的方法利用进化搜索作为一种探索策略来发现改进的算法，同时RL基于这些发现优化LLM策略。我们在三个组合优化任务 - 装箱问题、旅行推销员问题和平装包问题上的实验表明，将RL和进化搜索相结合可以提高改进算法的发现效率，展示了RL增强的进化策略有助于计算机科学家和数学家设计更高效算法的潜力。

更新时间: 2025-04-07 14:14:15

领域: cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2504.05108v1

SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

Updated: 2025-04-07 14:13:49

标题: SpeakEasy：增强文本到语音互动，促进表达内容的创作

摘要: 新手内容创作者经常花费大量时间录制表达性强的语音用于社交媒体视频。尽管最近文本转语音（TTS）技术的进步可以生成多种语言和口音的高度逼真的语音，但许多人在使用不直观或过于精细的TTS界面时感到困难。我们提出通过允许用户在脚本旁边指定高级别上下文来简化TTS生成。我们的“SpeakEasy”巫术系统利用用户提供的上下文来影响和指导TTS输出，从而实现与高级反馈的迭代精化。这种方法得到了两项8人形成性研究的启发：一项研究内容创作者使用TTS的经验，另一项借鉴了声优的有效策略。我们的评估表明，使用“SpeakEasy”的参与者在生成符合其个人标准的表演方面更为成功，而不需要比主流行业界面更多的努力。

更新时间: 2025-04-07 14:13:49

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05106v1

6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction

Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at https://github.com/continental/6Img-to-3D, and more examples can be found at our website here https://6Img-to-3D.GitHub.io/.

Updated: 2025-04-07 14:07:31

标题: 6Img-to-3D：少图像大规模室外驾驶场景重建

摘要: 目前的3D重建技术在从少数图像中准确推断无界场景方面存在困难。具体来说，现有方法具有高计算需求，需要详细的姿势信息，且无法可靠地重建被遮挡的区域。我们引入了6Img-to-3D，这是一种高效、可扩展的基于转换器的编码-渲染器方法，用于单次图像到3D重建。我们的方法仅从六个外部面向的输入图像中输出一个具有一致性参数化的三面体，用于大规模、无界的户外驾驶场景。我们通过结合压缩的自定义交叉和自我关注机制，用于三面体参数化、可微体积渲染、场景压缩和图像特征投影，向解决现有缺陷迈出了一步。我们展示，仅仅使用一个时间戳的六个环绕视图车辆图像，而无需全局姿势信息，就足以在推断时间内重建360°场景，耗时395毫秒。例如，我们的方法允许渲染第三人图像和鸟瞰图。我们的代码可在https://github.com/continental/6Img-to-3D找到，更多示例可在我们的网站https://6Img-to-3D.GitHub.io/上找到。

更新时间: 2025-04-07 14:07:31

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.12378v2

State Tuning: State-based Test-Time Scaling on RWKV-7

Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

Updated: 2025-04-07 14:04:30

标题: 状态调整：基于状态的测试时间缩放在RWKV-7上

摘要: 测试时间缩放已经成为机器学习中一个突出的研究方向，使模型能够在推断过程中增强其表达能力。变压器因在效率和表达能力之间取得微妙平衡而闻名，受益于利用扩展的键-值（KV）缓存的测试时间缩放技术，从而显著提高性能。在本文中，我们介绍了一种新颖的基于状态的测试时间缩放方法，我们称之为状态调整，专门针对基于RNN的RWKV-7模型。通过利用RWKV-7的独特优势，我们的方法在不改变模型预训练权重的情况下，在目标任务上实现了最先进的性能。我们的方法围绕三个关键创新展开。首先，我们开发了一个观察者框架，允许较小的模型复制并学习RWKV-7模型的状态动态。其次，我们采用核方法动态扩展状态大小，增强模型捕获复杂模式的能力。第三，我们集成了Decorrelated Backpropagation（DBP）来优化扩展后的状态矩阵，从而提高收敛性和表达能力。通过仅调整状态矩阵，我们证明较小的模型可以在给定任务上胜过更大的模型。这种方法保留了原始RWKV-7架构的效率，同时利用测试时间缩放的威力提供优越结果。我们的研究结果强调了状态调整作为在资源受限环境中提高模型性能的有效策略的潜力。我们的代码链接为https://github.com/TorchRWKV/flash-linear-attention。

更新时间: 2025-04-07 14:04:30

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.05097v1

Hollow Victory: How Malicious Proposers Exploit Validator Incentives in Optimistic Rollup Dispute Games

Blockchain systems, such as Ethereum, are increasingly adopting layer-2 scaling solutions to improve transaction throughput and reduce fees. One popular layer-2 approach is the Optimistic Rollup, which relies on a mechanism known as a dispute game for block proposals. In these systems, validators can challenge blocks that they believe contain errors, and a successful challenge results in the transfer of a portion of the proposer's deposit as a reward. In this paper, we reveal a structural vulnerability in the mechanism: validators may not be awarded a proper profit despite winning a dispute challenge. We develop a formal game-theoretic model of the dispute game and analyze several scenarios, including cases where the proposer controls some validators and cases where a secondary auction mechanism is deployed to induce additional participation. Our analysis demonstrates that under current designs, the competitive pressure from validators may be insufficient to deter malicious behavior. We find that increased validator competition, paradoxically driven by higher rewards or participation, can allow a malicious proposer to significantly lower their net loss by capturing value through mechanisms like auctions. To address this, we propose countermeasures such as an escrowed reward mechanism and a commit-reveal protocol. Our findings provide critical insights into enhancing the economic security of layer-2 scaling solutions in blockchain networks.

Updated: 2025-04-07 14:00:46

标题: 空洞的胜利：恶意提议者如何利用乐观 Rollup 争议游戏中验证者的激励

摘要: 区块链系统，如以太坊，越来越多地采用第二层扩展解决方案来提高交易吞吐量并减少费用。一种流行的第二层方法是乐观 Rollup，它依赖于一种称为争议游戏的机制来提出区块。在这些系统中，验证者可以挑战他们认为包含错误的区块，成功的挑战会导致将部分提议者的存款转移为奖励。在本文中，我们揭示了该机制中的一个结构性漏洞：尽管赢得争议挑战，验证者可能无法获得适当的利润。我们开发了一个正式的博弈论模型来分析争议游戏，并分析了几种情况，包括提议者控制一些验证者的情况以及部署次级拍卖机制以诱导额外参与的情况。我们的分析表明，在当前设计下，验证者的竞争压力可能不足以阻止恶意行为。我们发现，增加验证者竞争，反而是由于更高的奖励或参与，会使一个恶意提议者通过拍卖等机制显著降低其净损失。为了解决这个问题，我们提出了一些对策，例如托管奖励机制和提交-揭示协议。我们的研究结果为增强区块链网络中第二层扩展解决方案的经济安全提供了关键的见解。

更新时间: 2025-04-07 14:00:46

领域: cs.GT,cs.CR

下载: http://arxiv.org/abs/2504.05094v1

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property of the KL-regularized RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm with provable benefits compared to standard online learning. Our approach achieves low regret in the early stage by quickly adapting to the best available source reward models without prior knowledge of their quality, and over time, it attains an $\tilde{O}(\sqrt{T})$ regret bound \emph{independent} of structural complexity measures. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection method with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Updated: 2025-04-07 13:56:56

标题: RLHF是否可以在不完美的奖励模型下更有效？一个策略覆盖的视角

摘要: 样本效率对于从人类反馈中进行在线强化学习（RLHF）至关重要。虽然现有的研究探讨了样本高效的在线探索策略，但利用错误但相关的奖励模型加速学习的潜力仍未被充分开发。本文研究了如何在在线RLHF中从这些不完美的奖励模型中转移知识。我们首先确定了KL正则化RLHF目标的一个新特性：\emph{一个策略对最优策略的覆盖能力取决于其次优性}。基于这一见解，我们提出了新颖的转移学习原则和一个具有可证明优势的理论算法，与标准在线学习相比。我们的方法通过快速适应最佳可用源奖励模型，在早期阶段实现了低遗憾，而不需要先验知识其质量，并随着时间的推移，它达到了一个与结构复杂性度量无关的$\tilde{O}(\sqrt{T})$遗憾上界。在经验上，受我们理论发现的启发，我们开发了一种基于胜率的转移策略选择方法，具有改进的计算效率。此外，我们的经验转移学习技术是模块化的，可以与各种策略优化方法（如DPO、IPO和XPO）集成，以进一步提高它们的性能。我们通过对摘要任务的实验验证了我们方法的有效性。

更新时间: 2025-04-07 13:56:56

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2502.19255v2

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. While diagonal matrices used in architectures like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited expressivity. To address this, recent architectures such as (Gated) DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, allowing simultaneous token-channel mixing, which overcomes some expressivity limitations with only a slight decrease in training efficiency. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency and a stable recurrence. Through extensive experiments, we demonstrate that DeltaProduct achieves superior state-tracking and language modeling capabilities while exhibiting significantly improved length extrapolation compared to DeltaNet. Additionally, we also strengthen the theoretical foundation of DeltaNet by proving that it can solve dihedral group word problems in just two layers.

Updated: 2025-04-07 13:39:44

标题: DeltaProduct：通过Householder乘积改进线性RNN中的状态跟踪

摘要: 线性循环神经网络（线性RNN）已经成为与变压器相竞争的序列建模的有效替代方案，提供了高效的训练和线性时间推断。然而，现有的架构面临着表达能力和效率之间的基本权衡，由它们的状态转移矩阵结构所决定。像Mamba、GLA或mLSTM等架构中使用的对角矩阵虽然在运行时快速，但却面临严重的表达能力限制。为了解决这个问题，最近的架构如（门控）DeltaNet和RWKV-7采用了对角加秩-1结构，允许同时进行标记-通道混合，从而克服了一些表达能力限制，只有轻微降低训练效率。在对DeltaNet的循环体解释为在关联回忆损失上为每个标记执行一步在线梯度下降的基础上，我们介绍了DeltaProduct，它代替每个标记执行多个（$n_h$）步骤。这自然导致对角加秩-$n_h$状态转移矩阵，形成为$n_h$个广义Householder变换的乘积，提供了一个可调节的机制来平衡表达能力和效率以及一个稳定的循环。通过大量实验，我们证明DeltaProduct在状态跟踪和语言建模能力方面具有卓越的表现，同时在长度外推方面与DeltaNet相比展现出显著改进。此外，我们还通过在仅两层中解决二面角群字问题来加强了DeltaNet的理论基础。

更新时间: 2025-04-07 13:39:44

领域: cs.LG,cs.CL,cs.FL

下载: http://arxiv.org/abs/2502.10297v4

Occam Gradient Descent

Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification. With respect to loss, compute and model size, our experiments show (a) on image classification benchmarks, linear and convolutional neural networks trained with Occam Gradient Descent outperform traditional gradient descent with or without post-train pruning; (b) on a range of tabular data classification tasks, neural networks trained with Occam Gradient Descent outperform traditional gradient descent, as well as Random Forests; (c) on natural language transformers, Occam Gradient Descent outperforms traditional gradient descent.

Updated: 2025-04-07 13:38:50

标题: 奥卡姆梯度下降

摘要: 深度学习神经网络模型必须足够大，以适应其问题领域，同时又要足够小，以避免在梯度下降过程中过拟合训练数据。为了平衡这些竞争性需求，过度配置的深度学习模型如变压器在大数据集上训练一个时代，因此在计算资源和训练数据方面效率低下。为了应对这些低效，我们利用学习理论推导出奥卡姆梯度下降算法，该算法交替进行适应性减小模型大小以最小化泛化误差，以及梯度下降模型权重以最小化拟合误差。相比之下，传统的梯度下降贪婪地最小化拟合误差，而不考虑泛化误差。我们的算法同时降低神经网络的权重空间和拓扑大小，无需修改。在损失、计算和模型大小方面，我们的实验表明：（a）在图像分类基准测试中，使用奥卡姆梯度下降训练的线性和卷积神经网络优于传统梯度下降，无论是否进行后训练修剪；（b）在一系列表格数据分类任务中，使用奥卡姆梯度下降训练的神经网络优于传统梯度下降以及随机森林；（c）在自然语言变换器方面，奥卡姆梯度下降优于传统梯度下降。

更新时间: 2025-04-07 13:38:50

领域: cs.LG

下载: http://arxiv.org/abs/2405.20194v8

AI-Driven Tactical Communications and Networking for Defense: A Survey and Emerging Trends

The integration of Artificial Intelligence (AI) in military communications and networking is reshaping modern defense strategies, enhancing secure data exchange, real-time situational awareness, and autonomous decision-making. This survey explores how AI-driven technologies improve tactical communication networks, radar-based data transmission, UAV-assisted relay systems, and electronic warfare resilience. The study highlights AI applications in adaptive signal processing, multi-agent coordination for network optimization, radar-assisted target tracking, and AI-driven electronic countermeasures. Our work introduces a novel three-criteria evaluation methodology. It systematically assesses AI applications based on general system objectives, communications constraints in the military domain, and critical tactical environmental factors. We analyze key AI techniques for different types of learning applied to multi-domain network interoperability and distributed data information fusion in military operations. We also address challenges such as adversarial AI threats, the real-time adaptability of autonomous communication networks, and the limitations of current AI models under battlefield conditions. Finally, we discuss emerging trends in self-healing networks, AI-augmented decision support systems, and intelligent spectrum allocation. We provide a structured roadmap for future AI-driven defense communications and networking research.

Updated: 2025-04-07 13:38:32

标题: 人工智能驱动的国防战术通信与网络：调研和新兴趋势

摘要: 人工智能（AI）在军事通信和网络中的整合正在重新塑造现代国防战略，增强安全数据交换、实时态势感知和自主决策能力。本调查探讨了AI驱动技术如何改进战术通信网络、基于雷达的数据传输、无人机辅助中继系统和电子战抗干扰能力。研究重点介绍了AI在自适应信号处理、多智能体协调网络优化、雷达辅助目标跟踪和AI驱动电子对抗等方面的应用。我们的工作引入了一种新颖的三标准评估方法。它根据一般系统目标、军事领域通信约束和关键战术环境因素系统地评估AI应用。我们分析了不同类型学习的关键AI技术，应用于多域网络互操作性和分布式数据信息融合在军事行动中。我们还讨论了挑战，如对抗性AI威胁、自主通信网络的实时适应性和当前AI模型在战场条件下的局限性。最后，我们讨论了自愈网络、AI增强决策支持系统和智能频谱分配等新兴趋势。我们提供了未来AI驱动国防通信和网络研究的结构化路线图。

更新时间: 2025-04-07 13:38:32

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2504.05071v1

PTQ4VM: Post-Training Quantization for Visual Mamba

Visual Mamba is an approach that extends the selective space state model, Mamba, to vision tasks. It processes image tokens sequentially in a fixed order, accumulating information to generate outputs. Despite its growing popularity for delivering high-quality outputs at a low computational cost across various tasks, Visual Mamba is highly susceptible to quantization, which makes further performance improvements challenging. Our analysis reveals that the fixed token access order in Visual Mamba introduces unique quantization challenges, which we categorize into three main issues: 1) token-wise variance, 2) channel-wise outliers, and 3) a long tail of activations. To address these challenges, we propose Post-Training Quantization for Visual Mamba (PTQ4VM), which introduces two key strategies: Per-Token Static (PTS) quantization and Joint Learning of Smoothing Scale and Step Size (JLSS). To the our best knowledge, this is the first quantization study on Visual Mamba. PTQ4VM can be applied to various Visual Mamba backbones, converting the pretrained model to a quantized format in under 15 minutes without notable quality degradation. Extensive experiments on large-scale classification and regression tasks demonstrate its effectiveness, achieving up to 1.83x speedup on GPUs with negligible accuracy loss compared to FP16. Our code is available at https://github.com/YoungHyun197/ptq4vm.

Updated: 2025-04-07 13:30:39

标题: PTQ4VM：视觉马巴的后训练量化

摘要: Visual Mamba是一种将选择性空间状态模型Mamba扩展到视觉任务的方法。它按照固定顺序逐个处理图像标记，累积信息以生成输出。尽管在各种任务中以低计算成本提供高质量输出而日益受欢迎，但Visual Mamba很容易受到量化的影响，这使得进一步提高性能变得具有挑战性。我们的分析揭示了Visual Mamba中固定标记访问顺序引入的独特量化挑战，我们将其归类为三个主要问题：1）标记方差，2）通道异常值，和3）激活的长尾。为了应对这些挑战，我们提出了Visual Mamba的后训练量化（PTQ4VM）方法，引入了两个关键策略：逐标记静态（PTS）量化和平滑尺度和步长的联合学习（JLSS）。据我们所知，这是Visual Mamba的第一项量化研究。PTQ4VM可应用于各种Visual Mamba骨干，将预训练模型转换为量化格式，且在不到15分钟内完成，而不会有明显的质量降低。在大规模分类和回归任务上的大量实验表明其有效性，在GPU上的速度提高可达1.83倍，与FP16相比几乎没有准确性损失。我们的代码可在https://github.com/YoungHyun197/ptq4vm 上找到。

更新时间: 2025-04-07 13:30:39

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.20386v2

MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments with both human-driven and autonomous vehicles. However, uncertainties introduced by inherent driving behaviors -- such as acceleration, deceleration, and left and right maneuvers -- pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging an intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance.

Updated: 2025-04-07 13:30:00

标题: MIAT：用于时空轨迹预测的机动意图感知变换器

摘要: 精确的车辆轨迹预测对于安全和高效的自动驾驶至关重要，特别是在混合交通环境中，其中既有人驾驶车辆又有自动驾驶车辆。然而，由固有驾驶行为引入的不确定性，如加速、减速以及左右转弯，对可靠的轨迹预测构成了重大挑战。我们引入了一种Maneuver-Intention-Aware Transformer（MIAT）架构，该架构将机动意图感知机制与时空交互建模相结合，以增强长期轨迹预测。我们系统地研究了机动意图意识程度变化对短期和长期轨迹预测的影响。在真实世界的NGSIM数据集上进行评估，并与各种基于Transformer和LSTM的方法进行基准比较，我们的方法在短期预测中实现了高达4.7%的改进，长期预测中实现了1.6%的改进，相比其他意图感知基准方法。此外，通过利用意图感知控制机制，MIAT实现了长期预测方面11.1%的性能提升，短期性能略有下降。

更新时间: 2025-04-07 13:30:00

领域: cs.LG

下载: http://arxiv.org/abs/2504.05059v1

It's All in the Mix: Wasserstein Classification and Regression with Mixed Features

Problem definition: A key challenge in supervised learning is data scarcity, which can cause prediction models to overfit to the training data and perform poorly out of sample. A contemporary approach to combat overfitting is offered by distributionally robust problem formulations that consider all data-generating distributions close to the empirical distribution derived from historical samples, where 'closeness' is determined by the Wasserstein distance. While such formulations show significant promise in prediction tasks where all input features are continuous, they scale exponentially when discrete features are present. Methodology/results: We demonstrate that distributionally robust mixed-feature classification and regression problems can indeed be solved in polynomial time. Our proof relies on classical ellipsoid method-based solution schemes that do not scale well in practice. To overcome this limitation, we develop a practically efficient (yet, in the worst case, exponential time) cutting plane-based algorithm that admits a polynomial time separation oracle, despite the presence of exponentially many constraints. We compare our method against alternative techniques both theoretically and empirically on standard benchmark instances. Managerial implications: Data-driven operations management problems often involve prediction models with discrete features. We develop and analyze distributionally robust prediction models that faithfully account for the presence of discrete features, and we demonstrate that our models can significantly outperform existing methods that are agnostic to the presence of discrete features, both theoretically and on standard benchmark instances.

Updated: 2025-04-07 13:24:35

标题: 混合特征的Wasserstein分类和回归：一切都在混合中

摘要: 问题定义：监督学习中的一个关键挑战是数据稀缺性，这可能导致预测模型对训练数据过拟合，并在样本外表现不佳。一种应对过拟合的现代方法是通过分布鲁棒性问题的公式，考虑所有接近于从历史样本中得出的经验分布的数据生成分布，其中“接近性”由Wasserstein距离确定。虽然这种公式在所有输入特征连续的预测任务中显示出显著的潜力，但当存在离散特征时，它们的规模呈指数增长。方法/结果：我们证明了分布鲁棒性混合特征分类和回归问题确实可以在多项式时间内解决。我们的证明依赖于基于经典椭球方法的解决方案方案，在实践中缩放效果不佳。为了克服这一限制，我们开发了一种实际高效（尽管在最坏情况下是指数时间）的基于割平面的算法，尽管存在指数多的约束条件，但它允许多项式时间的分离预言者。我们在标准基准实例上在理论上和实证上将我们的方法与替代技术进行比较。管理意涵：数据驱动的运营管理问题通常涉及具有离散特征的预测模型。我们开发和分析了能够忠实考虑离散特征存在的分布鲁棒性预测模型，并且我们证明我们的模型在理论上和在标准基准实例上可以显著优于对于离散特征的存在无所谓的现有方法。

更新时间: 2025-04-07 13:24:35

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2312.12230v2

Quantum Complex-Valued Self-Attention Model

Self-attention has revolutionized classical machine learning, yet existing quantum self-attention models underutilize quantum states' potential due to oversimplified or incomplete mechanisms. To address this limitation, we introduce the Quantum Complex-Valued Self-Attention Model (QCSAM), the first framework to leverage complex-valued similarities, which captures amplitude and phase relationships between quantum states more comprehensively. To achieve this, QCSAM extends the Linear Combination of Unitaries (LCUs) into the Complex LCUs (CLCUs) framework, enabling precise complex-valued weighting of quantum states and supporting quantum multi-head attention. Experiments on MNIST and Fashion-MNIST show that QCSAM outperforms recent quantum self-attention models, including QKSAN, QSAN, and GQHAN. With only 4 qubits, QCSAM achieves 100% and 99.2% test accuracies on MNIST and Fashion-MNIST, respectively. Furthermore, we evaluate scalability across 3-8 qubits and 2-4 class tasks, while ablation studies validate the advantages of complex-valued attention weights over real-valued alternatives. This work advances quantum machine learning by enhancing the expressiveness and precision of quantum self-attention in a way that aligns with the inherent complexity of quantum mechanics.

Updated: 2025-04-07 13:24:02

标题: 量子复值自注意力模型

摘要: 自注意力已经彻底改变了经典机器学习，然而现有的量子自注意力模型由于过于简化或不完整的机制而未充分利用量子态的潜力。为了解决这一限制，我们引入了量子复值自注意模型（QCSAM），这是第一个利用复值相似性的框架，更全面地捕捉量子态之间的振幅和相位关系。为了实现这一点，QCSAM将线性组合的酉矩阵（LCUs）扩展为复酉矩阵（CLCUs）框架，实现了对量子态的精确复值加权，并支持量子多头注意力。在MNIST和Fashion-MNIST上的实验表明，QCSAM优于最近的量子自注意力模型，包括QKSAN、QSAN和GQHAN。仅使用4个量子比特，QCSAM在MNIST和Fashion-MNIST上分别实现了100%和99.2%的测试准确率。此外，我们评估了在3-8个量子比特和2-4类任务之间的可扩展性，消融研究验证了复值注意力权重相对于实值替代品的优势。这项工作通过增强量子自注意力的表达能力和精确性，进一步推动了量子机器学习的发展，使其与量子力学的固有复杂性相一致。

更新时间: 2025-04-07 13:24:02

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2503.19002v2

Less is More? Revisiting the Importance of Frame Rate in Real-Time Zero-Shot Surgical Video Segmentation

Real-time video segmentation is a promising feature for AI-assisted surgery, providing intraoperative guidance by identifying surgical tools and anatomical structures. However, deploying state-of-the-art segmentation models, such as SAM2, in real-time settings is computationally demanding, which makes it essential to balance frame rate and segmentation performance. In this study, we investigate the impact of frame rate on zero-shot surgical video segmentation, evaluating SAM2's effectiveness across multiple frame sampling rates for cholecystectomy procedures. Surprisingly, our findings indicate that in conventional evaluation settings, frame rates as low as a single frame per second can outperform 25 FPS, as fewer frames smooth out segmentation inconsistencies. However, when assessed in a real-time streaming scenario, higher frame rates yield superior temporal coherence and stability, particularly for dynamic objects such as surgical graspers. Finally, we investigate human perception of real-time surgical video segmentation among professionals who work closely with such data and find that respondents consistently prefer high FPS segmentation mask overlays, reinforcing the importance of real-time evaluation in AI-assisted surgery.

Updated: 2025-04-07 13:22:10

标题: Less is More? 重温实时零样本手术视频分割中帧率的重要性

摘要: 实时视频分割是AI辅助手术的一项有前景的功能，通过识别手术工具和解剖结构提供术中指导。然而，在实时设置中部署最先进的分割模型，例如SAM2，计算上是具有挑战性的，这使得在帧率和分割性能之间保持平衡至关重要。在本研究中，我们研究了帧率对零样本外科手术视频分割的影响，评估了SAM2在多个帧采样率下在胆囊切除术中的有效性。令人惊讶的是，我们的发现表明，在传统的评估设置中，每秒一个帧率甚至可以胜过25 FPS，因为更少的帧可以平滑分割不一致性。然而，在实时流媒体场景中评估时，更高的帧率可以提供更好的时间一致性和稳定性，特别是对于像手术夹具这样的动态对象。最后，我们调查了与此类数据密切相关的专业人士对实时外科手术视频分割的人类感知，并发现受访者一致偏好高帧率分割蒙版叠加，强调了实时评估在AI辅助手术中的重要性。

更新时间: 2025-04-07 13:22:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.20934v2

Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local "safety regions" in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts--a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.

Updated: 2025-04-07 13:20:17

标题: 揭示与大型语言模型的固有道德脆弱性

摘要: 大型语言模型（LLMs）是人工智能通用性的基础探索，然而它们通过指导调整和偏好学习与人类价值观的对齐仅达到表面合规性。在这里，我们展示了在预训练过程中嵌入的有害知识在LLMs的参数化内存中持久存在，逃避了对齐保障并在分布转移时重新出现为不可磨灭的“黑暗模式”。在这项研究中，我们首先通过证明当前的对齐方法仅在知识流形中产生局部“安全区域”，从而从理论上分析了对齐LLMs的内在道德脆弱性。相反，预训练知识通过高可能性的对抗轨迹与有害概念全局连接。基于这一理论洞察，我们通过在分布转移下使用语义连贯性诱导来实证验证我们的发现，这种方法通过优化对抗提示系统地绕过对齐约束。这种理论和实证结合的方法在23个最先进的对齐LLMs中的19个中实现了100%的攻击成功率，包括DeepSeek-R1和LLaMA-3，揭示了它们的普遍脆弱性。

更新时间: 2025-04-07 13:20:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.05050v1

Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). While this approach improves reasoning capability, it incurs substantial computational overhead due to iterative agent interactions. Furthermore, engaging in debates for queries that do not necessitate collaboration amplifies the risk of error generation. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates the debate process based on the confidence score of the agent's initial response. For queries where debate is triggered, agents refine their outputs using responses from participating agents and their confidence scores. Experimental results demonstrate that this mechanism significantly improves efficiency while maintaining or even surpassing the performance of existing multiagent debate systems. We also find that confidence-guided debate mitigates error propagation and enhances the selective incorporation of reliable responses. These results establish DOWN as an optimization strategy for efficient and effective multiagent reasoning, facilitating the practical deployment of LLM-based collaboration.

Updated: 2025-04-07 13:17:52

标题: 只在必要时进行辩论：高效的LLM推理的自适应多智能体协作

摘要: 多智能体协作已经成为增强大型语言模型（LLMs）推理能力的一个有前途的框架。虽然这种方法提高了推理能力，但由于迭代智能体相互作用，它会产生大量的计算开销。此外，对于不需要协作的查询进行辩论会增加错误生成的风险。为了解决这些挑战，我们提出了仅在必要时进行辩论（DOWN），这是一个自适应的多智能体辩论框架，根据智能体初始响应的置信度分数选择性地激活辩论过程。对于激发辩论的查询，智能体使用参与智能体的响应和置信度分数来完善他们的输出。实验结果表明，这种机制显著提高了效率，同时保持甚至超越了现有多智能体辩论系统的性能。我们还发现，置信度引导的辩论减轻了错误传播，并增强了可靠响应的选择性融合。这些结果将DOWN确立为一种用于高效和有效的多智能体推理的优化策略，有助于LLM协作的实际部署。

更新时间: 2025-04-07 13:17:52

领域: cs.AI

下载: http://arxiv.org/abs/2504.05047v1

Attention-Augmented Inverse Reinforcement Learning with Graph Convolutions for Multi-Agent Task Allocation

Multi-agent task allocation (MATA) plays a vital role in cooperative multi-agent systems, with significant implications for applications such as logistics, search and rescue, and robotic coordination. Although traditional deep reinforcement learning (DRL) methods have been shown to be promising, their effectiveness is hindered by a reliance on manually designed reward functions and inefficiencies in dynamic environments. In this paper, an inverse reinforcement learning (IRL)-based framework is proposed, in which multi-head self-attention (MHSA) and graph attention mechanisms are incorporated to enhance reward function learning and task execution efficiency. Expert demonstrations are utilized to infer optimal reward densities, allowing dependence on handcrafted designs to be reduced and adaptability to be improved. Extensive experiments validate the superiority of the proposed method over widely used multi-agent reinforcement learning (MARL) algorithms in terms of both cumulative rewards and task execution efficiency.

Updated: 2025-04-07 13:14:45

标题: 用图卷积的注意力增强逆强化学习进行多智能体任务分配

摘要: 多智能体任务分配（MATA）在合作多智能体系统中扮演着至关重要的角色，对物流、搜索和救援以及机器人协作等应用具有重要意义。尽管传统的深度强化学习（DRL）方法被证明是有希望的，但它们的有效性受到手动设计奖励函数和动态环境中的低效率的阻碍。本文提出了一种基于逆强化学习（IRL）的框架，其中包括多头自注意力（MHSA）和图注意力机制，以增强奖励函数学习和任务执行效率。专家演示被用来推断最佳奖励密度，从而减少对手工设计的依赖性，提高适应性。大量实验证实了所提出方法在累积奖励和任务执行效率方面优于广泛使用的多智能体强化学习（MARL）算法。

更新时间: 2025-04-07 13:14:45

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2504.05045v1

A Survey on Federated Analytics: Taxonomy, Enabling Techniques, Applications and Open Issues

The escalating influx of data generated by networked edge devices, coupled with the growing awareness of data privacy, has restricted the traditional data analytics workflow, where the edge data are gathered by a centralized server to be further utilized by data analysts. To continue leveraging vast edge data to support various data-incentive applications, computing paradigms have promoted a transformative shift from centralized data processing to privacy-preserved distributed data processing. The need to perform data analytics on private edge data motivates federated analytics (FA), an emerging technique to support collaborative data analytics among diverse data owners without centralizing the raw data. Despite the wide applications of FA in industry and academia, a comprehensive examination of existing research efforts in FA has been notably absent. This survey aims to bridge this gap by first providing an overview of FA, elucidating key concepts, and discussing its relationship with similar concepts. We then thoroughly examine FA, including its key challenges, taxonomy, and enabling techniques. Diverse FA applications, including statistical metrics, frequency-related applications, database query operations, FL-assisting FA tasks, and other wireless network applications are then carefully reviewed. We complete the survey with several open research issues, future directions, and a comprehensive lessons learned part. This survey intends to provide a holistic understanding of the emerging FA techniques and foster the continued evolution of privacy-preserving distributed data processing in the emerging networked society.

Updated: 2025-04-07 13:11:28

标题: 《关于联合分析的调查：分类、支持技术、应用及未解决问题》

摘要: 网络边缘设备生成的数据不断增加，再加上对数据隐私意识的增强，限制了传统的数据分析工作流程，其中边缘数据由集中式服务器收集，然后由数据分析师进一步利用。为了继续利用大量边缘数据支持各种数据激励应用程序，计算范式促进了从集中式数据处理向隐私保护的分布式数据处理的转变。在私人边缘数据上执行数据分析的需求推动了联邦分析（FA），这是一种支持各种数据所有者之间协作数据分析的新兴技术，而不是集中原始数据。尽管FA在工业和学术界有广泛应用，但对FA现有研究工作的全面审查明显缺失。本调查旨在填补这一空白，首先概述FA，阐明关键概念，并讨论其与类似概念的关系。然后我们全面审查FA，包括其主要挑战、分类和启用技术。然后仔细审查各种FA应用，包括统计指标、与频率相关的应用、数据库查询操作、FL辅助FA任务和其他无线网络应用。我们用几个开放性研究问题、未来方向和全面的经验教训部分完成了调查。本调查旨在全面了解新兴FA技术，并促进隐私保护的分布式数据处理在新兴网络化社会中的持续发展。

更新时间: 2025-04-07 13:11:28

领域: cs.DC,cs.CR,cs.ET

下载: http://arxiv.org/abs/2404.12666v3

Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID

Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the well-established YOLOv5 with DeepSORT combination, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the 4th Anti-UAV Challenge metrics and reach competitive performance. Notably, we achieved strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for multi-UAV tracking tasks. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .

Updated: 2025-04-07 13:03:35

标题: 强基准线：利用BoT-SORT-ReID的YOLOv12进行多无人机跟踪

摘要: 在热红外视频中检测和跟踪多个无人机（UAV）本质上是具有挑战性的，原因是低对比度、环境噪音和目标尺寸小。本文提供了一个简单的方法来解决热红外视频中的多个UAV跟踪问题，利用了最近在检测和跟踪方面的进展。我们提出了一个基于YOLOv12和BoT-SORT构建的跟踪框架，通过定制的训练和推理策略进行增强。我们根据第四届反无人机挑战的指标评估了我们的方法，并取得了有竞争力的表现。值得注意的是，我们在不使用对比度增强或时间信息融合来丰富UAV特征的情况下取得了强大的结果，突出了我们的方法作为多个UAV跟踪任务的“强基线”。我们提供了实现细节、深入的实验分析以及潜在改进的讨论。代码可在https://github.com/wish44165/YOLOv12-BoT-SORT-ReID中获取。

更新时间: 2025-04-07 13:03:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17237v2

Explainable AI for Enhancing Efficiency of DL-based Channel Estimation

The support of artificial intelligence (AI) based decision-making is a key element in future 6G networks, where the concept of native AI will be introduced. Moreover, AI is widely employed in different critical applications such as autonomous driving and medical diagnosis. In such applications, using AI as black-box models is risky and challenging. Hence, it is crucial to understand and trust the decisions taken by these models. Tackling this issue can be achieved by developing explainable AI (XAI) schemes that aim to explain the logic behind the black-box model behavior, and thus, ensure its efficient and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST framework that is oriented toward channel estimation in wireless communications. The core idea of the XAI-CHEST framework is to identify the relevant model inputs by inducing high noise on the irrelevant ones. This manuscript provides the detailed theoretical foundations of the XAI-CHEST framework. In particular, we derive the analytical expressions of the XAI-CHEST loss functions and the noise threshold fine-tuning optimization problem. Hence the designed XAI-CHEST delivers a smart input feature selection methodology that can further improve the overall performance while optimizing the architecture of the employed model. Simulation results show that the XAI-CHEST framework provides valid interpretations, where it offers an improved bit error rate performance while reducing the required computational complexity in comparison to the classical DL-based channel estimation.

Updated: 2025-04-07 13:02:14

标题: 可解释的人工智能提升基于深度学习的信道估计效率

摘要: 人工智能（AI）决策支持是未来6G网络的关键元素，其中将引入本地AI概念。此外，AI广泛应用于不同的关键应用程序，如自动驾驶和医学诊断。在这些应用中，使用AI作为黑匣子模型是有风险和挑战的。因此，了解和信任这些模型所做的决定至关重要。解决这个问题可以通过开发可解释的AI（XAI）方案来实现，旨在解释黑匣子模型行为背后的逻辑，从而确保其高效和安全的部署。最近，我们提出了一个新颖的基于扰动的XAI-CHEST框架，该框架面向无线通信中的信道估计。XAI-CHEST框架的核心思想是通过对不相关输入引入高噪声来识别相关的模型输入。本文提供了XAI-CHEST框架的详细理论基础。特别是，我们推导了XAI-CHEST损失函数和噪声阈值微调优化问题的分析表达式。因此，设计的XAI-CHEST提供了一种智能的输入特征选择方法，可以进一步改进整体性能，同时优化所使用模型的架构。模拟结果表明，XAI-CHEST框架提供了有效的解释，它提供了改进的比特错误率性能，同时减少了与传统DL-based信道估计相比所需的计算复杂性。

更新时间: 2025-04-07 13:02:14

领域: cs.AI,eess.SP

下载: http://arxiv.org/abs/2407.07009v2

Mathematical theory of deep learning

This book provides an introduction to the mathematical analysis of deep learning. It covers fundamental results in approximation theory, optimization theory, and statistical learning theory, which are the three main pillars of deep neural network theory. Serving as a guide for students and researchers in mathematics and related fields, the book aims to equip readers with foundational knowledge on the topic. It prioritizes simplicity over generality, and presents rigorous yet accessible results to help build an understanding of the essential mathematical concepts underpinning deep learning.

Updated: 2025-04-07 12:59:43

标题: 深度学习的数学理论

摘要: 这本书介绍了深度学习的数学分析。它涵盖了逼近理论、优化理论和统计学习理论等基本结果，这些是深度神经网络理论的三大支柱。作为数学和相关领域的学生和研究人员的指南，该书旨在为读者提供有关该主题的基础知识。它优先考虑简单性而非一般性，并呈现严谨而易懂的结果，以帮助建立对支撑深度学习的基本数学概念的理解。

更新时间: 2025-04-07 12:59:43

领域: cs.LG,math.HO

下载: http://arxiv.org/abs/2407.18384v3

Graph-based Diffusion Model for Collaborative Filtering

Recently, diffusion-based recommendation methods have achieved impressive results. However, existing approaches predominantly treat each user's historical interactions as independent training samples, overlooking the potential of higher-order collaborative signals between users and items. Such signals, which encapsulate richer and more nuanced relationships, can be naturally captured using graph-based data structures. To address this limitation, we extend diffusion-based recommendation methods to the graph domain by directly modeling user-item bipartite graphs with diffusion models. This enables better modeling of the higher-order connectivity inherent in complex interaction dynamics. However, this extension introduces two primary challenges: (1) Noise Heterogeneity, where interactions are influenced by various forms of continuous and discrete noise, and (2) Relation Explosion, referring to the high computational costs of processing large-scale graphs. To tackle these challenges, we propose a Graph-based Diffusion Model for Collaborative Filtering (GDMCF). To address noise heterogeneity, we introduce a multi-level noise corruption mechanism that integrates both continuous and discrete noise, effectively simulating real-world interaction complexities. To mitigate relation explosion, we design a user-active guided diffusion process that selectively focuses on the most meaningful edges and active users, reducing inference costs while preserving the graph's topological integrity. Extensive experiments on three benchmark datasets demonstrate that GDMCF consistently outperforms state-of-the-art methods, highlighting its effectiveness in capturing higher-order collaborative signals and improving recommendation performance.

Updated: 2025-04-07 12:51:18

标题: 基于图的协同过滤扩散模型

摘要: 最近，基于扩散的推荐方法取得了令人印象深刻的成果。然而，现有方法主要将每个用户的历史交互视为独立的训练样本，忽视了用户和物品之间更高阶合作信号的潜力。这种信号包含了更丰富和更微妙的关系，可以通过基于图的数据结构自然捕捉到。为了解决这一局限，我们将基于扩散的推荐方法扩展到图领域，通过直接建模用户-物品二分图和扩散模型。这样可以更好地建模复杂交互动态中固有的高阶连接性。然而，这一扩展引入了两个主要挑战：（1）噪声异质性，其中交互受到各种连续和离散噪声的影响，以及（2）关系爆炸，指的是处理大规模图的高计算成本。为了解决这些挑战，我们提出了一种基于图的协同过滤扩散模型（GDMCF）。为了解决噪声异质性，我们引入了一个多级噪声破坏机制，将连续和离散噪声结合起来，有效模拟真实世界的交互复杂性。为了减轻关系爆炸，我们设计了一个用户主动引导的扩散过程，有选择地关注最有意义的边缘和活跃用户，降低推断成本同时保持图的拓扑完整性。在三个基准数据集上的大量实验表明，GDMCF始终优于最先进的方法，突出了其在捕捉高阶协作信号和提高推荐性能方面的有效性。

更新时间: 2025-04-07 12:51:18

领域: cs.SI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05029v1

Multi-level Neural Networks for high-dimensional parametric obstacle problems

A new method to solve computationally challenging (random) parametric obstacle problems is developed and analyzed, where the parameters can influence the related partial differential equation (PDE) and determine the position and surface structure of the obstacle. As governing equation, a stationary elliptic diffusion problem is assumed. The high-dimensional solution of the obstacle problem is approximated by a specifically constructed convolutional neural network (CNN). This novel algorithm is inspired by a finite element constrained multigrid algorithm to represent the parameter to solution map. This has two benefits: First, it allows for efficient practical computations since multi-level data is used as an explicit output of the NN thanks to an appropriate data preprocessing. This improves the efficacy of the training process and subsequently leads to small errors in the natural energy norm. Second, the comparison of the CNN to a multigrid algorithm provides means to carry out a complete a priori convergence and complexity analysis of the proposed NN architecture. Numerical experiments illustrate a state-of-the-art performance for this challenging problem.

Updated: 2025-04-07 12:50:56

标题: 多层神经网络用于高维参数障碍问题

摘要: 一个新的方法被开发和分析，用于解决计算上具有挑战性的（随机）参数障碍问题，其中参数可以影响相关的偏微分方程（PDE）并确定障碍的位置和表面结构。假设一个稳态椭圆扩散问题作为主导方程。障碍问题的高维解通过一个特别构建的卷积神经网络（CNN）来近似。这一新颖算法受有限元约束多重网格算法的启发，以表示参数到解的映射。这有两个好处：首先，它允许进行高效的实际计算，因为多级数据被用作NN的显式输出，得益于适当的数据预处理。这提高了训练过程的效率，随后导致在自然能量范数中的小误差。其次，将CNN与多重网格算法进行比较提供了对所提出的NN架构进行完整先验收敛和复杂性分析的手段。数值实验展示了这一具有挑战性问题的最先进性能。

更新时间: 2025-04-07 12:50:56

领域: cs.LG,cs.NA,math.FA,math.NA,68T07, 68T09, 35J85,I.2.0; I.5.2; I.5.4; G.1.8; F.1

下载: http://arxiv.org/abs/2504.05026v1

Concept Extraction for Time Series with ECLAD-ts

Convolutional neural networks (CNNs) for time series classification (TSC) are being increasingly used in applications ranging from quality prediction to medical diagnosis. The black box nature of these models makes understanding their prediction process difficult. This issue is crucial because CNNs are prone to learning shortcuts and biases, compromising their robustness and alignment with human expectations. To assess whether such mechanisms are being used and the associated risk, it is essential to provide model explanations that reflect the inner workings of the model. Concept Extraction (CE) methods offer such explanations, but have mostly been developed for the image domain so far, leaving a gap in the time series domain. In this work, we present a CE and localization method tailored to the time series domain, based on the ideas of CE methods for images. We propose the novel method ECLAD-ts, which provides post-hoc global explanations based on how the models encode subsets of the input at different levels of abstraction. For this, concepts are produced by clustering timestep-wise aggregations of CNN activation maps, and their importance is computed based on their impact on the prediction process. We evaluate our method on synthetic and natural datasets. Furthermore, we assess the advantages and limitations of CE in time series through empirical results. Our results show that ECLAD-ts effectively explains models by leveraging their internal representations, providing useful insights about their prediction process.

Updated: 2025-04-07 12:49:20

标题: ECLAD-ts时间序列的概念提取

摘要: 卷积神经网络(CNNs)用于时间序列分类(TSC)的应用越来越广泛，范围从质量预测到医学诊断。这些模型的黑匣子特性使得理解它们的预测过程变得困难。这个问题至关重要，因为CNN很容易学习捷径和偏见，影响其稳健性和与人类期望的一致性。为了评估是否正在使用此类机制以及相关风险，提供反映模型内部运作的模型解释是至关重要的。概念提取(CE)方法提供了这样的解释，但目前主要是针对图像领域开发的，导致了时间序列领域的空白。在这项工作中，我们提出了一种适用于时间序列领域的CE和定位方法，基于图像领域的CE方法的思想。我们提出了一种新颖的方法ECLAD-ts，根据模型在不同抽象级别编码输入子集的方式提供事后全局解释。为此，概念是通过对CNN激活图的时间步聚合进行聚类产生的，其重要性是基于它们对预测过程的影响来计算的。我们在合成和自然数据集上评估了我们的方法。此外，我们通过实证结果评估了CE在时间序列中的优势和局限性。我们的结果表明，ECLAD-ts通过利用其内部表示有效地解释模型，提供有关其预测过程的有用见解。

更新时间: 2025-04-07 12:49:20

领域: cs.LG

下载: http://arxiv.org/abs/2504.05024v1

Batch Aggregation: An Approach to Enhance Text Classification with Correlated Augmented Data

Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called 'Batch Aggregation' (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.

Updated: 2025-04-07 12:46:07

标题: 批量聚合：一种利用相关增强数据提升文本分类的方法

摘要: 自然语言处理模型经常面临挑战，因为标记数据有限，特别是在特定领域，例如临床试验。为了克服这一问题，通常使用文本增强技术来增加样本大小，通过将原始输入数据转换为保留标签的人工数据。然而，传统的文本分类方法忽略了增强文本之间的关系，将它们视为独立样本，这可能会引入分类错误。因此，我们提出了一种称为“批量聚合”（BAGG）的新方法，通过增加一个额外的层来聚合相关文本的结果，明确地建模通过增强生成的文本之间的依赖关系。通过研究不同领域的多个基准数据集，我们发现BAGG可以提高分类准确性。我们还发现，在特定领域的数据集中，BAGG的性能提高更为明显，准确率提高了10-29%。通过对基准数据的分析，所提出的方法解决了传统技术的局限性，并提高了文本分类任务的鲁棒性。我们的结果表明，当训练数据有限时，BAGG提供了更稳健的结果，并优于传统方法。

更新时间: 2025-04-07 12:46:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.05020v1

Mixture-of-Personas Language Models for Population Simulation

Advances in Large Language Models (LLMs) paved the way for their emerging applications in various domains, such as human behavior simulations, where LLMs could augment human-generated data in social science research and machine learning model training. However, pretrained LLMs often fail to capture the behavioral diversity of target populations due to the inherent variability across individuals and groups. To address this, we propose \textit{Mixture of Personas} (MoP), a \textit{probabilistic} prompting method that aligns the LLM responses with the target population. MoP is a contextual mixture model, where each component is an LM agent characterized by a persona and an exemplar representing subpopulation behaviors. The persona and exemplar are randomly chosen according to the learned mixing weights to elicit diverse LLM responses during simulation. MoP is flexible, requires no model finetuning, and is transferable across base models. Experiments for synthetic data generation show that MoP outperforms competing methods in alignment and diversity metrics.

Updated: 2025-04-07 12:43:05

标题: 混合人物语言模型用于人口模拟

摘要: 大型语言模型（LLMs）的进展为它们在各个领域的新兴应用铺平了道路，例如人类行为模拟，在这种情况下，LLMs可以增强社会科学研究和机器学习模型训练中的人类生成数据。然而，预训练的LLMs通常无法捕捉目标人群的行为多样性，这是由于个体和群体之间固有的变异性。为了解决这个问题，我们提出了“人物混合”（MoP），这是一种概率提示方法，可以将LLM响应与目标人群对齐。MoP是一种上下文混合模型，其中每个组件都是一个由人物和代表亚群行为的示例特征化的LM代理。根据学习到的混合权重随机选择人物和示例，以在模拟过程中引发多样化的LLM响应。MoP具有灵活性，无需模型微调，并且可以在基础模型之间进行转移。通过对合成数据生成的实验表明，MoP在对齐和多样性指标方面优于竞争方法。

更新时间: 2025-04-07 12:43:05

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2504.05019v1

Joint Pedestrian and Vehicle Traffic Optimization in Urban Environments using Reinforcement Learning

Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52%, respectively, while simultaneously decreasing total accumulated wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL's potential for developing transportation systems that serve all road users.

Updated: 2025-04-07 12:41:58

标题: 城市环境中使用强化学习进行步行者和车辆交通优化

摘要: 强化学习（RL）在自适应交通信号控制方面具有重要潜力。尽管现有的基于RL的方法在减少车辆拥堵方面表现出有效性，但它们主要关注以车辆为中心的优化，未解决行人的移动需求和安全挑战。在本文中，我们提出了一个深度RL框架，用于实现对沿着真实城市走廊的八个交通信号的自适应控制，共同优化行人和车辆的效率。我们使用来自Wi-Fi日志和视频分析的真实行人和车辆需求数据来训练单智能体策略。结果表明，与传统固定时间信号相比，平均等待时间每位行人和每辆车可分别减少高达67%和52%，同时两组的总累计等待时间可减少高达67%和53%。此外，我们的结果显示出在不同交通需求情况下的泛化能力，包括在训练期间完全未见过的条件，验证了RL在开发为所有道路使用者提供服务的交通系统方面的潜力。

更新时间: 2025-04-07 12:41:58

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2504.05018v1

Deconstructing Jazz Piano Style Using Machine Learning

Artistic style has been studied for centuries, and recent advances in machine learning create new possibilities for understanding it computationally. However, ensuring that machine-learning models produce insights aligned with the interests of practitioners and critics remains a significant challenge. Here, we focus on musical style, which benefits from a rich theoretical and mathematical analysis tradition. We train a variety of supervised-learning models to identify 20 iconic jazz musicians across a carefully curated dataset of 84 hours of recordings, and interpret their decision-making processes. Our models include a novel multi-input architecture that enables four musical domains (melody, harmony, rhythm, and dynamics) to be analysed separately. These models enable us to address fundamental questions in music theory and also advance the state-of-the-art in music performer identification (94% accuracy across 20 classes). We release open-source implementations of our models and an accompanying web application for exploring musical styles.

Updated: 2025-04-07 12:37:39

标题: 利用机器学习解构爵士钢琴风格

摘要: 艺术风格已经被研究了几个世纪，最近机器学习的进步为以计算方式理解艺术风格创造了新的可能性。然而，确保机器学习模型产生的见解与从业者和评论家的兴趣保持一致仍然是一个重大挑战。在这里，我们关注音乐风格，它受益于丰富的理论和数学分析传统。我们训练了各种监督学习模型，以识别出20位标志性的爵士音乐家，通过经过精心策划的84小时录音数据集，并解释他们的决策过程。我们的模型包括一种新颖的多输入架构，使四个音乐领域（旋律、和声、节奏和动态）可以分别进行分析。这些模型使我们能够解决音乐理论中的基本问题，同时推进音乐表演者识别领域的最新技术（20个类别的准确率达到94%）。我们发布了我们模型的开源实现以及一个配套的网络应用程序，用于探索音乐风格。

更新时间: 2025-04-07 12:37:39

领域: cs.SD,cs.IR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2504.05009v1

Measuring the right thing: justifying metrics in AI impact assessments

AI Impact Assessments are only as good as the measures used to assess the impact of these systems. It is therefore paramount that we can justify our choice of metrics in these assessments, especially for difficult to quantify ethical and social values. We present a two-step approach to ensure metrics are properly motivated. First, a conception needs to be spelled out (e.g. Rawlsian fairness or fairness as solidarity) and then a metric can be fitted to that conception. Both steps require separate justifications, as conceptions can be judged on how well they fit with the function of, for example, fairness. We argue that conceptual engineering offers helpful tools for this step. Second, metrics need to be fitted to a conception. We illustrate this process through an examination of competing fairness metrics to illustrate that here the additional content that a conception offers helps us justify the choice for a specific metric. We thus advocate that impact assessments are not only clear on their metrics, but also on the conceptions that motivate those metrics.

Updated: 2025-04-07 12:32:41

标题: 测量正确的事物：在人工智能影响评估中证明度量的合理性

摘要: 人工智能影响评估的有效性取决于评估这些系统影响的措施。因此，我们必须能够证明在这些评估中选择度量标准的合理性，特别是对于难以量化的伦理和社会价值观。我们提出了一个两步方法来确保度量标准得到适当的动机。首先，需要明确一个概念（例如罗尔斯的公平或团结公平），然后可以将度量标准适配到该概念上。这两个步骤需要分别进行证明，因为概念可以根据其与公平等功能的匹配程度来评判。我们认为概念工程为这一步骤提供了有用的工具。其次，度量标准需要适配到一个概念。我们通过对竞争性公平度量标准的研究来说明这个过程，以阐明概念提供额外内容如何帮助我们证明选择特定度量标准的合理性。因此，我们主张影响评估不仅要明确其度量标准，还要明确激励这些度量标准的概念。

更新时间: 2025-04-07 12:32:41

领域: cs.CY,cs.AI,cs.ET

下载: http://arxiv.org/abs/2504.05007v1

Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM

Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts, with traditional detection methods struggling to address emerging and machine-unauditable flaws. This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection. We introduce a comprehensive dataset of 215 real-world DApp projects (4,998 contracts), including hard-to-detect logical errors like token price manipulation, addressing the limitations of existing simplified benchmarks. By fine-tuning LLMs (Llama3-8B and Qwen2-7B) with Full-Parameter Fine-Tuning (FFT) and Low-Rank Adaptation (LoRA), our method achieves superior performance, attaining an F1-score of 0.83 with FFT and data augmentation via Random Over Sampling (ROS). Comparative experiments demonstrate significant improvements over prompt-based LLMs and state-of-the-art tools. Notably, the approach excels in detecting non-machine-auditable vulnerabilities, achieving 0.97 precision and 0.68 recall for price manipulation flaws. The results underscore the effectiveness of domain-specific LLM fine-tuning and data augmentation in addressing real-world DApp security challenges, offering a robust solution for blockchain ecosystem protection.

Updated: 2025-04-07 12:32:14

标题: 利用精细调整的LLM增强DApps中智能合约漏洞检测

摘要: 去中心化应用（DApps）由于智能合约中的漏洞而面临重大安全风险，传统的检测方法难以解决新出现的和机器无法审计的缺陷。本文提出了一种新颖的方法，利用经过调整的大型语言模型（LLMs）来增强智能合约的漏洞检测。我们引入了一个包括215个真实世界DApp项目（4,998个合约）的全面数据集，包括难以检测的逻辑错误，比如令牌价格操纵，解决了现有简化基准的限制。通过使用完全参数微调（FFT）和低秩适应（LoRA）对LLMs（Llama3-8B和Qwen2-7B）进行微调，我们的方法实现了卓越的性能，通过FFT和通过随机过采样（ROS）进行数据增强，达到了0.83的F1分数。比较实验显示，该方法在基于提示的LLMs和最先进工具方面取得了显著改进。值得注意的是，该方法在检测非机器审计漏洞方面表现出色，针对价格操纵漏洞实现了0.97的精度和0.68的召回率。结果强调了领域特定LLM微调和数据增强在解决真实世界DApp安全挑战方面的有效性，为区块链生态系统的保护提供了强大的解决方案。

更新时间: 2025-04-07 12:32:14

领域: cs.CR

下载: http://arxiv.org/abs/2504.05006v1

Comparative analysis of Realistic EMF Exposure Estimation from Low Density Sensor Network by Finite & Infinite Neural Networks

Understanding the spatial and temporal patterns of environmental exposure to radio-frequency electromagnetic fields (RF-EMF) is essential for conducting risk assessments. These assessments aim to explore potential connections between RF-EMF exposure and its effects on human health, as well as on wildlife and plant life. Existing research has used different machine learning tools for EMF exposure estimation; however, a comparative analysis of these techniques is required to better understand their performance for real-world datasets. In this work, we present both finite and infinite-width convolutional network-based methods to estimate and assess EMF exposure levels from 70 real-world sensors in Lille, France. A comparative analysis has been conducted to analyze the performance of the methods' execution time and estimation accuracy. To improve estimation accuracy for higher-resolution grids, we utilized a preconditioned gradient descent method for kernel estimation. Root Mean Square Error (RMSE) is used as the evaluation criterion for comparing the performance of these deep learning models.

Updated: 2025-04-07 12:31:53

标题: 有限和无限神经网络比较分析低密度传感器网络实际电磁场暴露估计

摘要: 理解环境暴露于射频电磁场（RF-EMF）的空间和时间模式对于进行风险评估至关重要。这些评估旨在探讨射频电磁场暴露与其对人类健康以及野生动植物的影响之间的潜在联系。现有研究已经使用不同的机器学习工具来进行EMF暴露估计；然而，需要对这些技术进行比较分析，以更好地了解它们在真实数据集中的性能。在这项工作中，我们提出了基于有限和无限宽度卷积网络的方法，以估计和评估法国里尔70个真实传感器中的EMF暴露水平。进行了比较分析，以分析方法的执行时间和估计准确度的性能。为了提高高分辨率网格的估计准确度，我们利用了一个预处理的梯度下降方法来进行内核估计。均方根误差（RMSE）被用作评估标准，用于比较这些深度学习模型的性能。

更新时间: 2025-04-07 12:31:53

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.07990v1

Stacking Variational Bayesian Monte Carlo

Variational Bayesian Monte Carlo (VBMC) is a sample-efficient method for approximate Bayesian inference with computationally expensive likelihoods. While VBMC's local surrogate approach provides stable approximations, its conservative exploration strategy and limited evaluation budget can cause it to miss regions of complex posteriors. In this work, we introduce Stacking Variational Bayesian Monte Carlo (S-VBMC), a method that constructs global posterior approximations by merging independent VBMC runs through a principled and inexpensive post-processing step. Our approach leverages VBMC's mixture posterior representation and per-component evidence estimates, requiring no additional likelihood evaluations while being naturally parallelizable. We demonstrate S-VBMC's effectiveness on two synthetic problems designed to challenge VBMC's exploration capabilities and two real-world applications from computational neuroscience, showing substantial improvements in posterior approximation quality across all cases.

Updated: 2025-04-07 12:30:59

标题: 叠加变分贝叶斯蒙特卡洛

摘要: Variational Bayesian Monte Carlo（VBMC）是一种用于近似贝叶斯推断的样本高效方法，适用于计算密集型的似然函数。虽然VBMC的局部替代方法提供了稳定的近似，但其保守的探索策略和有限的评估预算可能导致错过复杂后验的区域。在这项工作中，我们介绍了堆叠变分贝叶斯蒙特卡洛（S-VBMC），这是一种通过合并独立的VBMC运行来构建全局后验近似的方法，通过一个合理且廉价的后处理步骤。我们的方法利用了VBMC的混合后验表示和每个组件的证据估计，不需要额外的似然评估，同时具有自然的可并行性。我们在两个设计用来挑战VBMC探索能力的合成问题和两个来自计算神经科学的真实应用中展示了S-VBMC的有效性，显示在所有情况下后验近似质量的显著改善。

更新时间: 2025-04-07 12:30:59

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2504.05004v1

SmartBugBert: BERT-Enhanced Vulnerability Detection for Smart Contract Bytecode

Smart contracts deployed on blockchain platforms are vulnerable to various security vulnerabilities. However, only a small number of Ethereum contracts have released their source code, so vulnerability detection at the bytecode level is crucial. This paper introduces SmartBugBert, a novel approach that combines BERT-based deep learning with control flow graph (CFG) analysis to detect vulnerabilities directly from bytecode. Our method first decompiles smart contract bytecode into optimized opcode sequences, extracts semantic features using TF-IDF, constructs control flow graphs to capture execution logic, and isolates vulnerable CFG fragments for targeted analysis. By integrating both semantic and structural information through a fine-tuned BERT model and LightGBM classifier, our approach effectively identifies four critical vulnerability types: transaction-ordering, access control, self-destruct, and timestamp dependency vulnerabilities. Experimental evaluation on 6,157 Ethereum smart contracts demonstrates that SmartBugBert achieves 90.62% precision, 91.76% recall, and 91.19% F1-score, significantly outperforming existing detection methods. Ablation studies confirm that the combination of semantic features with CFG information substantially enhances detection performance. Furthermore, our approach maintains efficient detection speed (0.14 seconds per contract), making it practical for large-scale vulnerability assessment.

Updated: 2025-04-07 12:30:12

标题: 智能BugBert：BERT增强的智能合约字节码漏洞检测

摘要: 部署在区块链平台上的智能合约容易受到各种安全漏洞的影响。然而，只有少数以太坊合约发布了其源代码，因此在字节码级别进行漏洞检测至关重要。本文介绍了SmartBugBert，一种将基于BERT的深度学习与控制流图（CFG）分析相结合的新方法，以直接从字节码中检测漏洞。我们的方法首先将智能合约字节码反汇编为优化的操作码序列，利用TF-IDF提取语义特征，构建控制流图以捕获执行逻辑，并将易受攻击的CFG片段隔离出来进行有针对性的分析。通过将经过微调的BERT模型和LightGBM分类器整合在一起，我们的方法有效地识别了四种关键的漏洞类型：交易排序、访问控制、自毁和时间戳依赖性漏洞。对6,157个以太坊智能合约的实验评估表明，SmartBugBert实现了90.62%的精确度，91.76%的召回率和91.19%的F1分数，明显优于现有的检测方法。消融研究证实，语义特征与CFG信息的结合显著增强了检测性能。此外，我们的方法保持了高效的检测速度（每个合约0.14秒），使其在大规模漏洞评估中具有实际意义。

更新时间: 2025-04-07 12:30:12

领域: cs.CR

下载: http://arxiv.org/abs/2504.05002v1

Approximate Agreement Algorithms for Byzantine Collaborative Learning

In Byzantine collaborative learning, $n$ clients in a peer-to-peer network collectively learn a model without sharing their data by exchanging and aggregating stochastic gradient estimates. Byzantine clients can prevent others from collecting identical sets of gradient estimates. The aggregation step thus needs to be combined with an efficient (approximate) agreement subroutine to ensure convergence of the training process. In this work, we study the geometric median aggregation rule for Byzantine collaborative learning. We show that known approaches do not provide theoretical guarantees on convergence or gradient quality in the agreement subroutine. To satisfy these theoretical guarantees, we present a hyperbox algorithm for geometric median aggregation. We practically evaluate our algorithm in both centralized and decentralized settings under Byzantine attacks on non-i.i.d. data. We show that our geometric median-based approaches can tolerate sign-flip attacks better than known mean-based approaches from the literature.

Updated: 2025-04-07 12:26:32

标题: 拜占庭协同学习的近似一致性算法

摘要: 在拜占庭协作学习中，$n$个对等网络中的客户端通过交换和聚合随机梯度估计来集体学习模型，而不共享它们的数据。拜占庭客户端可以阻止其他人收集相同的梯度估计集。因此，聚合步骤需要与一个高效（近似）的协议子程序结合起来，以确保训练过程的收敛。在这项工作中，我们研究了拜占庭协作学习的几何中位数聚合规则。我们表明已知的方法在协议子程序中不能提供收敛或梯度质量的理论保证。为了满足这些理论保证，我们提出了一种用于几何中位数聚合的超立方体算法。我们在受到非独立同分布数据的拜占庭攻击的集中和分散设置中对我们的算法进行了实际评估。我们表明，我们基于几何中位数的方法比文献中已知的基于平均值的方法更能容忍符号翻转攻击。

更新时间: 2025-04-07 12:26:32

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2504.01504v2

Addressing common misinterpretations of KART and UAT in neural network literature

This note addresses the Kolmogorov-Arnold Representation Theorem (KART) and the Universal Approximation Theorem (UAT), focusing on their common and frequent misinterpretations in many papers related to neural network approximation. Our remarks aim to support a more accurate understanding of KART and UAT among neural network specialists. In addition, we explore the minimal number of neurons required for universal approximation, showing that KART's lower bounds extend to standard multilayer perceptrons, even with smooth activation functions.

Updated: 2025-04-07 12:25:13

标题: 纠正神经网络文献中关于KART和UAT的常见误解

摘要: 这篇论文摘要讨论了科尔莫哥洛夫-阿诺德表示定理（KART）和通用逼近定理（UAT），重点关注它们在许多与神经网络逼近相关论文中经常被错误解释的问题。我们的观点旨在支持神经网络专家对KART和UAT的更准确理解。此外，我们探讨了实现通用逼近所需的最小神经元数量，表明KART的下界延伸到标准多层感知器，即使使用平滑的激活函数也是如此。

更新时间: 2025-04-07 12:25:13

领域: cs.LG,cs.NE,26B40, 41A30, 41A63, 68T05

下载: http://arxiv.org/abs/2408.16389v4

SurvSurf: a partially monotonic neural network for first-hitting time prediction of intermittently observed discrete and continuous sequential events

We propose a neural-network based survival model (SurvSurf) specifically designed for direct and simultaneous probabilistic prediction of the first hitting time of sequential events from baseline. Unlike existing models, SurvSurf is theoretically guaranteed to never violate the monotonic relationship between the cumulative incidence functions of sequential events, while allowing nonlinear influence from predictors. It also incorporates implicit truths for unobserved intermediate events in model fitting, and supports both discrete and continuous time and events. We also identified a variant of the Integrated Brier Score (IBS) that showed robust correlation with the mean squared error (MSE) between the true and predicted probabilities by accounting for implied truths about the missing intermediate events. We demonstrated the superiority of SurvSurf compared to modern and traditional predictive survival models in two simulated datasets and two real-world datasets, using MSE, the more robust IBS and by measuring the extent of monotonicity violation.

Updated: 2025-04-07 12:24:59

标题: SurvSurf：一种部分单调的神经网络，用于预测间断观测的离散和连续序列事件的首次击中时间。

摘要: 我们提出了一种基于神经网络的生存模型（SurvSurf），专门设计用于直接和同时概率预测从基线开始的顺序事件的首次触发时间。与现有模型不同，SurvSurf在理论上保证永远不会违反顺序事件的累积发病函数之间的单调关系，同时允许预测因子的非线性影响。它还在模型拟合中包含了未观察到的中间事件的隐含真相，并支持离散和连续时间和事件。我们还确定了一个集成Brier分数（IBS）的变体，通过考虑有关缺失中间事件的隐含真相，与真实和预测概率之间的均方误差（MSE）之间显示出强大的相关性。我们通过测量单调性违反的程度，利用MSE、更强大的IBS以及在两个模拟数据集和两个真实数据集中展示了SurvSurf相对于现代和传统预测生存模型的优越性。

更新时间: 2025-04-07 12:24:59

领域: stat.ML,cs.AI,cs.LG,math.ST,stat.AP,stat.TH,62N01

下载: http://arxiv.org/abs/2504.04997v1

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Updated: 2025-04-07 12:23:59

标题: 跟随价值观的耳语：揭示LLMs中价值导向行为背后的神经机制

摘要: 尽管大型语言模型（LLMs）表现出色，但它们可能存在由编码值驱动的意外偏见和有害行为，强调了理解其背后的价值机制的迫切需要。然而，当前的研究主要通过外部响应评估这些价值，重点放在人工智能安全上，缺乏可解释性，并未能在现实环境中评估社会价值。在本文中，我们提出了一个名为ValueExploration的新框架，旨在探索LLMs中的国家社会价值的行为驱动机制，以神经元级别为基础。作为一个案例研究，我们关注中国社会价值，并首先构建了C-voice，一个用于识别和评估LLMs中的中国社会价值的大规模双语基准。通过利用C-voice，我们然后根据激活差异识别和定位负责编码这些价值的神经元。最后，通过停用这些神经元，我们分析模型行为的变化，揭示价值如何影响LLM的决策制定的内部机制。对四个代表性LLMs的大量实验验证了我们框架的有效性。基准和代码将提供。

更新时间: 2025-04-07 12:23:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04994v1

On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance

Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.

Updated: 2025-04-07 12:16:27

标题: 关于边缘人工智能黑盒部署策略对延迟和模型性能的影响

摘要: 决定在边缘AI层次上使用哪种运算符的组合以实现特定的延迟和模型性能要求是MLOps工程师面临的一个开放问题。本研究旨在经验性地评估不同黑盒边缘AI部署策略的准确性与推理时间的权衡，即部署运算符和部署层次的组合。本文通过对四个广泛使用的计算机视觉模型进行推理实验，涉及3种部署运算符（即，分区、量化、早期退出）、3种部署层次（即，移动、边缘、云）及其组合，以研究从MLOps开发人员的角度寻找最佳策略。我们的研究结果表明，在中等准确性损失下，通过使用混合量化+早期退出运算符在边缘部署上可能优于非混合运算符（在边缘上进行量化/早期退出，在移动-边缘上进行分区）。然而，当最小化准确性损失是一个考虑因素时，MLOps工程师应优先选择在边缘上只使用量化运算符，而不是在边缘/移动-边缘上使用早期退出/分区和量化早期退出（在边缘上）运算符。在受移动CPU/RAM资源限制的情况下，观察到了在移动和边缘层次上优先使用分区而不是移动部署的偏好。对于输入数据样本较小的模型（例如FCN），网络受限的云部署也可能比移动/边缘部署和分区策略更好。对于输入数据样本较大的模型（ResNet、ResNext、DUC），边缘层次具有比云/移动更高的网络/计算能力可能比分区、移动/云部署策略更可行。

更新时间: 2025-04-07 12:16:27

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.17154v2

PaperBench: Evaluating AI's Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Updated: 2025-04-07 12:15:49

标题: PaperBench：评估AI复制AI研究的能力

摘要: 我们介绍了PaperBench，这是一个评估AI代理复制最新AI研究能力的基准。代理必须从零开始复制20篇ICML 2024的重点和口头报告论文，包括理解论文贡献、开发代码库以及成功执行实验。为了客观评估，我们制定了分层分解每个复制任务的评分标准，具有明确的评分标准的小任务。总共，PaperBench包含8,316个可单独评分的任务。评分标准是与每篇ICML论文的作者共同开发的，以确保准确性和现实性。为了实现可扩展的评估，我们还开发了一个基于LLM的评分器，用于自动评分复制尝试，并通过创建一个专门的评分器基准来评估我们评分器的性能。我们在PaperBench上评估了几个前沿模型，发现经过测试的表现最佳的代理模型Claude 3.5 Sonnet（新版本）使用开源搭建，平均复制得分为21.0%。最后，我们招募了顶尖的机器学习博士生来尝试PaperBench的一个子集，发现模型尚未超越人类基准。我们开源我们的代码（https://github.com/openai/preparedness）以促进未来研究，以了解AI代理的AI工程能力。

更新时间: 2025-04-07 12:15:49

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.01848v3

RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.

Updated: 2025-04-07 12:13:43

标题: RS-RAG：利用多模态数据集和检索增强生成模型构建遥感图像和全面知识之间的桥梁

摘要: 最近在视觉语言模型（VLMs）方面取得的进展展示了在自然图像领域的各种任务中令人印象深刻的能力。受到这些进展的启发，遥感社区已经开始采用VLMs来进行遥感视觉-语言任务，包括场景理解、图像描述和视觉问答。然而，现有的遥感VLMs通常依赖于封闭式场景理解，并专注于通用场景描述，但缺乏整合外部知识的能力。这一限制阻碍了它们在涉及特定领域知识或世界知识的复杂或上下文相关查询上进行语义推理的能力。为了解决这些挑战，我们首先介绍了一个多模态遥感世界知识（RSWK）数据集，其中包括175个国家的14,141个著名地标的高分辨率卫星图像和详细文本描述，整合了遥感领域知识和更广泛的世界知识。在此数据集的基础上，我们提出了一个新颖的遥感检索增强生成（RS-RAG）框架，它由两个关键组件组成。多模态知识向量数据库构建模块将遥感图像和相关文本知识编码成统一的向量空间。知识检索和响应生成模块基于图像和/或文本查询检索和重新排序相关知识，并将检索到的内容整合到增强知识提示中，以指导VLM生成具有上下文基础的响应。我们在三个代表性的视觉语言任务上验证了我们方法的有效性，包括图像描述、图像分类和视觉问答，在这些任务中，RS-RAG明显优于最先进的基线模型。

更新时间: 2025-04-07 12:13:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04988v1

A Nature-Inspired Colony of Artificial Intelligence System with Fast, Detailed, and Organized Learner Agents for Enhancing Diversity and Quality

The concepts of convolutional neural networks (CNNs) and multi-agent systems are two important areas of research in artificial intelligence (AI). In this paper, we present an approach that builds a CNN-based colony of AI agents to serve as a single system and perform multiple tasks (e.g., predictions or classifications) in an environment. The proposed system impersonates the natural environment of a biological system, like an ant colony or a human colony. The proposed colony of AI that is defined as a role-based system uniquely contributes to accomplish tasks in an environment by incorporating AI agents that are fast learners, detailed learners, and organized learners. These learners can enhance their localized learning and their collective decisions as a single system of colony of AI agents. This approach also enhances the diversity and quality of the colony of AI with the help of Genetic Algorithms and their crossover and mutation mechanisms. The evolution of fast, detailed, and organized learners in the colony of AI is achieved by introducing a unique one-to-one mapping between these learners and the pretrained VGG16, VGG19, and ResNet50 models, respectively. This role-based approach creates two parent-AI agents using the AI models through the processes, called the intra- and inter-marriage of AI, so that they can share their learned knowledge (weights and biases) based on a probabilistic rule and produce diversified child-AI agents to perform new tasks. This process will form a colony of AI that consists of families of multi-model and mixture-model AI agents to improve diversity and quality. Simulations show that the colony of AI, built using the VGG16, VGG19, and ResNet50 models, can provide a single system that generates child-AI agents of excellent predictive performance, ranging between 82% and 95% of F1-scores, to make diversified collective and quality decisions on a task.

Updated: 2025-04-07 12:13:14

标题: 受自然启发的人工智能系统群体，具有快速、详细和有组织的学习代理，以增强多样性和质量。

摘要: 卷积神经网络（CNN）和多智能体系统的概念是人工智能（AI）研究中的两个重要领域。在本文中，我们提出了一种建立基于CNN的AI智能体群体的方法，以作为单一系统在环境中执行多个任务（例如预测或分类）。所提出的系统模仿了生物系统的自然环境，如蚂蚁群或人类群。所提出的AI群体定义为基于角色的系统，通过整合快速学习者、详细学习者和组织学习者的AI智能体来唯一地完成环境中的任务。这些学习者可以通过其本地化学习和集体决策来增强AI智能体群体的性能。该方法还通过遗传算法及其交叉和突变机制来增强AI智能体群体的多样性和质量。通过在这些学习者和预训练的VGG16、VGG19和ResNet50模型之间引入独特的一对一映射，实现了AI智能体群体中快速、详细和有组织的学习者的进化。这种基于角色的方法通过所谓的AI内部和外部婚姻过程利用AI模型创建两个父AI智能体，以便它们可以基于概率规则分享其学习到的知识（权重和偏差），并产生多样化的子AI智能体来执行新任务。这个过程将形成一个由多模型和混合模型AI智能体家族组成的AI群体，以提高多样性和质量。模拟结果表明，使用VGG16、VGG19和ResNet50模型构建的AI群体可以提供一个单一系统，生成预测性能优秀的子AI智能体，F1分数在82%至95%之间，从而在任务上做出多样化的集体决策和质量决策。

更新时间: 2025-04-07 12:13:14

领域: cs.NE,cs.AI,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2504.05365v1

Transforming Future Data Center Operations and Management via Physical AI

Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 {\deg}C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.

Updated: 2025-04-07 12:09:22

标题: 通过物理人工智能改变未来数据中心运营和管理

摘要: 数据中心（DC）作为关键的基础设施，在推动人工智能（AI）和数字经济的增长方面起着关键作用。从互联网DC到AI DC的演变引入了新的挑战，即在操作和管理数据中心方面，以提高业务弹性和降低总体拥有成本。因此，未来数据中心必须采用超越传统最佳实践的新范式。在这项研究中，我们提出并开发了一个新颖的物理AI（PhyAI）框架，用于推进DC操作和管理。我们的系统利用了最先进的工业产品和我们自己的研发的新能力。具体而言，它提出了三个核心模块，即：1）一个工业级内部模拟引擎，以高度准确的方式模拟DC操作，2）建立在NVIDIA PhysicsNemo上的AI引擎，用于训练和评估基于物理的机器学习（PIML）模型，以及3）建立在NVIDIA Omniverse上的数字孪生平台，用于我们提出的五层数字孪生框架。这个系统提供了一个可扩展和适应的解决方案，通过为未来数据中心启用实时数字孪生，数字化、优化和自动化未来数据中心的操作和管理。为了说明其有效性，我们提出了一个引人注目的案例研究，用于建立一个替代模型，以实时方式预测大规模DC的热力和气流特性。我们的结果表明，与传统耗时的计算流体动力学/热传递（CFD/HT）模拟相比，其性能优越，中位绝对温度预测误差为0.18°C。这种新兴方法将为未来DC操作中推进物理AI打开多个潜在的研究方向。

更新时间: 2025-04-07 12:09:22

领域: cs.AI,cs.DC

下载: http://arxiv.org/abs/2504.04982v1

DiCoTTA: Domain-invariant Learning for Continual Test-time Adaptation

This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the current test domain only, overlooking generalization to arbitrary test domains a model may face in the future. To tackle this limitation, we present a novel online domain-invariant learning framework for CTTA, dubbed DiCoTTA. DiCoTTA aims to learn feature representation to be invariant to both current and previous test domains on the fly during testing. To this end, we propose a new model architecture and a test-time adaptation strategy dedicated to learning domain-invariant features without corrupting semantic contents, along with a new data structure and optimization algorithm for effectively managing information from previous test domains. DiCoTTA achieved state-of-the-art performance on four public CTTA benchmarks. Moreover, it showed superior generalization to unseen test domains.

Updated: 2025-04-07 12:09:18

标题: DiCoTTA：面向领域不变的持续测试时间适应学习

摘要: 本文研究了持续测试时间适应（CTTA），即在测试过程中适应模型不断变化的未知领域，同时保留先前学到的知识。现有的CTTA方法主要集中在适应当前测试领域，忽视了模型可能在未来面对任意测试领域的泛化能力。为了解决这一限制，我们提出了一种新颖的在线域不变学习框架，称为DiCoTTA。DiCoTTA旨在学习特征表示，在测试过程中实时对当前和先前的测试领域保持不变。为此，我们提出了一种新的模型架构和测试时间适应策略，专门用于学习不受领域影响的特征，同时不破坏语义内容，还提出了一种新的数据结构和优化算法，有效管理来自先前测试领域的信息。DiCoTTA在四个公共CTTA基准测试中取得了最先进的性能。此外，它表现出对未知测试领域的优越泛化能力。

更新时间: 2025-04-07 12:09:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04981v1

Differential Transformer

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

Updated: 2025-04-07 12:04:28

标题: 差动变压器

摘要: Transformer倾向于将注意力过度分配给无关上下文。在这项工作中，我们引入了Diff Transformer，它可以增强对相关上下文的注意力，同时取消噪音。具体来说，差分注意力机制将注意力分数计算为两个独立softmax注意力图之间的差异。减法可以取消噪音，促进稀疏注意力模式的出现。语言建模的实验结果表明，在扩大模型规模和训练令牌的各种设置中，Diff Transformer表现优于Transformer。更有趣的是，它在实际应用中具有显着优势，如长上下文建模、关键信息检索、幻觉缓解、上下文学习和减少激活异常值。由于不受无关上下文的干扰，Diff Transformer可以减轻问答和文本摘要中的幻觉。对于上下文学习，Diff Transformer不仅提高了准确性，而且对于顺序置换更加稳健，这被认为是一个长期的稳健性问题。这些结果将Diff Transformer定位为推动大型语言模型发展的高效且有前景的架构。

更新时间: 2025-04-07 12:04:28

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.05258v2

Towards Visual Text Grounding of Multimodal Large Language Model

Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

Updated: 2025-04-07 12:01:59

标题: 朝向多模态大语言模型的视觉文本基础

摘要: 尽管多模态大型语言模型（MLLMs）已经存在，但它们在视觉文本定位方面仍存在一个不可忽视的局限，特别是在文档中的文本丰富图像中。文档图像，如扫描表格和信息图表，由于其复杂的布局和文本内容，突出了关键挑战。然而，当前的基准测试并没有完全解决这些挑战，因为它们主要关注自然图像上的视觉定位，而不是文本丰富的文档图像。因此，为了弥合这一差距，我们引入了TRIG，这是一项新任务，具有新设计的指令数据集，用于评估和改进MLLMs在文档问答中的文本丰富图像定位能力。具体而言，我们提出了一个OCR-LLM-人类交互流水线，用于创建800个手动注释的问题-答案对作为基准，以及基于四个不同数据集的90美元合成数据的大规模训练集。对我们提出的基准测试中各种MLLMs的全面评估显示了它们在文本丰富图像上的定位能力存在重大局限。此外，我们提出了两种简单有效的TRIG方法，分别基于通用指令调优和即插即用高效嵌入。通过在我们的合成数据集上微调MLLMs，它们有望改善空间推理和定位能力。

更新时间: 2025-04-07 12:01:59

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.04974v1

DataLab: A Unified Platform for LLM-Powered Business Intelligence

Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports various BI tasks for different data roles in data preparation, analysis, and visualization by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.

Updated: 2025-04-07 12:01:15

标题: DataLab：一个基于LLM的商业智能统一平台

摘要: 商业智能（BI）将现代组织中的大量数据转化为可操作的见解，以支持明智的决策。最近，基于大型语言模型（LLM）的代理通过根据自然语言（NL）查询在可执行环境中自动执行任务规划、推理和操作，从而简化了BI工作流程。然而，现有方法主要集中在个别BI任务，如NL2SQL和NL2VIS。不同数据角色和工具之间任务的碎片化导致了BI的迭代性和协作性质，可能会出现效率低下和潜在错误。本文介绍了DataLab，这是一个统一的BI平台，将一站式的LLM代理框架与增强型计算笔记本界面集成在一起。DataLab通过在单一环境中无缝结合LLM辅助和用户定制支持数据准备、分析和可视化等不同数据角色的各种BI任务。为了实现这一统一，我们设计了一个针对企业特定BI任务的领域知识整合模块，一个促进BI工作流程中信息共享的代理间通信机制，以及一个基于单元格的上下文管理策略，以增强BI笔记本中上下文利用效率。广泛的实验表明，DataLab在各种BI任务上在流行的研究基准上取得了最先进的性能。此外，DataLab在腾讯的真实数据集上保持了高效性和高效率，在企业特定BI任务上准确率提高了58.58％，标记成本减少了61.65％。

更新时间: 2025-04-07 12:01:15

领域: cs.DB,cs.AI,cs.CL

下载: http://arxiv.org/abs/2412.02205v3

Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of $\tilde{\mathcal{O}}(\sqrt{T})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{T})$ constraint violation over $T$ episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

Updated: 2025-04-07 11:58:19

标题: 在不确定环境中确保安全：通过随机阈值的受限马尔可夫决策过程(MDPs)

摘要: 本文研究了具有约束的马尔可夫决策过程（CMDPs），其中约束针对随机阈值，旨在确保在未知和不确定环境中进行强化学习的安全性。我们利用从与不确定和动态环境的交互中采样的Growing-Window估计器来估计阈值，基于此设计了一种基于模型的原始-对偶算法Stochastic Pessimistic-Optimistic Thresholding（SPOT），用于对抗随机阈值的多约束。SPOT使得在悲观和乐观阈值设置下进行强化学习成为可能。我们证明了我们的算法实现了次线性的后悔和约束违反；即，在$T$个episode中，奖励后悔为$\tilde{\mathcal{O}}(\sqrt{T})$，同时允许$\tilde{\mathcal{O}}(\sqrt{T})$的约束违反。理论保证表明，我们的算法实现了与依赖于固定和明确阈值的方法相当的性能。据我们所知，SPOT是第一个在甚至阈值都未知的不确定环境中实现理论保证性能的强化学习算法。

更新时间: 2025-04-07 11:58:19

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2504.04973v1

A High-Force Gripper with Embedded Multimodal Sensing for Powerful and Perception Driven Grasping

Modern humanoid robots have shown their promising potential for executing various tasks involving the grasping and manipulation of objects using their end-effectors. Nevertheless, in the most of the cases, the grasping and manipulation actions involve low to moderate payload and interaction forces. This is due to limitations often presented by the end-effectors, which can not match their arm-reachable payload, and hence limit the payload that can be grasped and manipulated. In addition, grippers usually do not embed adequate perception in their hardware, and grasping actions are mainly driven by perception sensors installed in the rest of the robot body, frequently affected by occlusions due to the arm motions during the execution of the grasping and manipulation tasks. To address the above, we developed a modular high grasping force gripper equipped with embedded multi-modal perception functionalities. The proposed gripper can generate a grasping force of 110 N in a compact implementation. The high grasping force capability is combined with embedded multi-modal sensing, which includes an eye-in-hand camera, a Time-of-Flight (ToF) distance sensor, an Inertial Measurement Unit (IMU) and an omnidirectional microphone, permitting the implementation of perception-driven grasping functionalities. We extensively evaluated the grasping force capacity of the gripper by introducing novel payload evaluation metrics that are a function of the robot arm's dynamic motion and gripper thermal states. We also evaluated the embedded multi-modal sensing by performing perception-guided enhanced grasping operations.

Updated: 2025-04-07 11:57:08

标题: 一个具有嵌入式多模感知的高力量夹持器，用于强大和感知驱动的抓取

摘要: 现代仿人机器人展示了它们在执行涉及抓取和操纵物体任务方面的潜在潜力。然而，在大多数情况下，抓取和操纵动作涉及低至中等的负载和交互力。这是由于末端执行器经常出现的限制，无法匹配其可达负载，因此限制了可以抓取和操纵的负载。此外，夹持器通常不在其硬件中嵌入足够的感知能力，抓取动作主要受到安装在机器人其他部位的感知传感器的驱动，这些传感器经常受到在执行抓取和操纵任务过程中由于手臂运动而造成的遮挡的影响。为了解决上述问题，我们开发了一个装备有嵌入式多模感知功能的模块化高抓取力夹持器。该提议的夹持器可以在紧凑的实现中产生110牛顿的抓取力。高抓取力能力与嵌入式多模感知相结合，其中包括手眼相机、飞行时间（ToF）距离传感器、惯性测量单元（IMU）和全向麦克风，允许实现感知驱动的抓取功能。我们通过引入新颖的负载评估指标，该指标是机器人臂的动态运动和夹持器热状态的函数，全面评估了夹持器的抓取力容量。我们还通过执行感知引导的增强抓取操作来评估嵌入式多模感知。

更新时间: 2025-04-07 11:57:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.04970v1

The Dream Within Huang Long Cave: AI-Driven Interactive Narrative for Family Storytelling and Emotional Reflection

This paper introduces the art project The Dream Within Huang Long Cave, an AI-driven interactive and immersive narrative experience. The project offers new insights into AI technology, artistic practice, and psychoanalysis. Inspired by actual geographical landscapes and familial archetypes, the work combines psychoanalytic theory and computational technology, providing an artistic response to the concept of the non-existence of the Big Other. The narrative is driven by a combination of a large language model (LLM) and a realistic digital character, forming a virtual agent named YELL. Through dialogue and exploration within a cave automatic virtual environment (CAVE), the audience is invited to unravel the language puzzles presented by YELL and help him overcome his life challenges. YELL is a fictional embodiment of the Big Other, modeled after the artist's real father. Through a cross-temporal interaction with this digital father, the project seeks to deconstruct complex familial relationships. By demonstrating the non-existence of the Big Other, we aim to underscore the authenticity of interpersonal emotions, positioning art as a bridge for emotional connection and understanding within family dynamics.

Updated: 2025-04-07 11:54:11

标题: 《黄龙洞内的梦境：面向家庭叙事和情感反思的人工智能驱动交互叙事》

摘要: 这篇论文介绍了艺术项目《黄龙洞内的梦》，这是一个由人工智能驱动的互动和沉浸式叙事体验。该项目为AI技术、艺术实践和精神分析提供了新的见解。作品灵感来自实际地理景观和家庭原型，结合了精神分析理论和计算技术，提供了对“大他者不存在”概念的艺术回应。叙事由大型语言模型（LLM）和逼真的数字角色驱动，形成了一个名为YELL的虚拟代理人。通过在CAVE（洞穴自动虚拟环境）内的对话和探索，观众被邀请解开YELL提出的语言难题，并帮助他克服生活中的挑战。YELL是大他者的虚构化体现，以艺术家的真实父亲为原型。通过与这位数字化父亲进行跨时间的互动，该项目试图解构复杂的家庭关系。通过展示大他者的不存在，我们旨在强调人际情感的真实性，将艺术定位为在家庭动态中建立情感联系和理解的桥梁。

更新时间: 2025-04-07 11:54:11

领域: cs.MM,cs.AI

下载: http://arxiv.org/abs/2504.04968v1

Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation

While music remains a challenging domain for generative models like Transformers, a two-pronged approach has recently proved successful: inserting musically-relevant structural information into the positional encoding (PE) module and using kernel approximation techniques based on Random Fourier Features (RFF) to lower the computational cost from quadratic to linear. Yet, it is not clear how such RFF-based efficient PEs compare with those based on rotation matrices, such as Rotary Positional Encoding (RoPE). In this paper, we present a unified framework based on kernel methods to analyze both families of efficient PEs. We use this framework to develop a novel PE method called RoPEPool, capable of extracting causal relationships from temporal sequences. Using RFF-based PEs and rotation-based PEs, we demonstrate how seemingly disparate PEs can be jointly studied by considering the content-context interactions they induce. For empirical validation, we use a symbolic music generation task, namely, melody harmonization. We show that RoPEPool, combined with highly-informative structural priors, outperforms all methods.

Updated: 2025-04-07 11:51:29

标题: 所有条纹：研究以结构为基础的位置编码，用于高效音乐生成

摘要: 尽管音乐仍然是Transformers等生成模型的一个具有挑战性的领域，但最近已经证明了一种成功的双重方法：将与音乐相关的结构信息插入位置编码（PE）模块，并使用基于随机傅立叶特征（RFF）的核逼近技术，将计算成本从二次降低到线性。然而，目前尚不清楚基于这种RFF的高效PE与基于旋转矩阵的PE（如Rotary Positional Encoding，RoPE）相比如何。在本文中，我们提出了一个基于核方法的统一框架，用于分析这两类高效PE。我们利用这一框架开发了一种名为RoPEPool的新型PE方法，能够从时间序列中提取因果关系。通过使用基于RFF和基于旋转的PE，我们展示了通过考虑它们引发的内容-上下文交互作用，看似不同的PE可以共同研究。为了进行经验验证，我们使用了一个符号音乐生成任务，即旋律协调。我们展示了RoPEPool结合高度信息化的结构先验优于所有方法。

更新时间: 2025-04-07 11:51:29

领域: cs.SD,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2504.05364v1

GOTHAM: Graph Class Incremental Learning Framework under Weak Supervision

Graphs are growing rapidly, along with the number of distinct label categories associated with them. Applications like e-commerce, healthcare, recommendation systems, and various social media platforms are rapidly moving towards graph representation of data due to their ability to capture both structural and attribute information. One crucial task in graph analysis is node classification, where unlabeled nodes are categorized into predefined classes. In practice, novel classes appear incrementally sometimes with just a few labels (seen classes) or even without any labels (unseen classes), either because they are new or haven't been explored much. Traditional methods assume abundant labeled data for training, which isn't always feasible. We investigate a broader objective: \emph{Graph Class Incremental Learning under Weak Supervision (GCL)}, addressing this challenge by meta-training on base classes with limited labeled instances. During the incremental streams, novel classes can have few-shot or zero-shot representation. Our proposed framework GOTHAM efficiently accommodates these unlabeled nodes by finding the closest prototype representation, serving as class representatives in the attribute space. For Text-Attributed Graphs (TAGs), our framework additionally incorporates semantic information to enhance the representation. By employing teacher-student knowledge distillation to mitigate forgetting, GOTHAM achieves promising results across various tasks. Experiments on datasets such as Cora-ML, Amazon, and OBGN-Arxiv showcase the effectiveness of our approach in handling evolving graph data under limited supervision. The repository is available here: \href{https://github.com/adityashahane10/GOTHAM--Graph-based-Class-Incremental-Learning-Framework-under-Weak-Supervision}{\small \textcolor{blue}{Code}}

Updated: 2025-04-07 11:39:13

标题: 哥谭：弱监督下的图类增量学习框架

摘要: 图表正在迅速增长，与它们相关联的不同标签类别数量也在增加。诸如电子商务、医疗保健、推荐系统和各种社交媒体平台等应用正迅速向图表数据表示转移，因为它们能够捕捉结构和属性信息。图分析中的一个关键任务是节点分类，其中未标记的节点被分类到预定义的类别中。在实践中，新类别有时会逐步出现，有时只有少量标签（已见类别），甚至没有任何标签（未见类别），这可能是因为它们是新的或还没有被充分探索。传统方法假设有大量标记数据用于训练，这并不总是可行的。我们研究了一个更广泛的目标：在弱监督下进行图类增量学习（GCL），通过在基础类别上进行元训练来解决这一挑战，这些基础类别具有有限数量的标记实例。在增量数据流中，新类别可以具有少量或零样本表示。我们提出的框架GOTHAM通过在属性空间中找到最接近的原型表示来有效地容纳这些未标记节点，这些原型表示作为类别代表。对于带有文本属性的图表（TAGs），我们的框架还融入了语义信息以增强表示。通过采用师生知识蒸馏来减少遗忘，GOTHAM在各种任务中取得了令人满意的结果。在诸如Cora-ML、亚马逊和OBGN-Arxiv等数据集上的实验展示了我们的方法在处理受限监督下的不断发展的图数据方面的有效性。存储库在此处可用：\href{https://github.com/adityashahane10/GOTHAM--Graph-based-Class-Incremental-Learning-Framework-under-Weak-Supervision}{\small \textcolor{blue}{Code}}.

更新时间: 2025-04-07 11:39:13

领域: cs.AI

下载: http://arxiv.org/abs/2504.04954v1

Dual JPEG Compatibility: a Reliable and Explainable Tool for Image Forensics

Given a JPEG pipeline (compression or decompression), this paper demonstrates how to find the antecedent of an 8x8 block. If it exists, the block is considered compatible with the pipeline. For unaltered images, all blocks remain compatible with the original pipeline; however, for manipulated images, this is not necessarily true. This article provides a first demonstration of the potential of compatibility-based approaches for JPEG image forensics. It introduces a method to address the key challenge of finding a block antecedent in a high-dimensional space, relying on a local search algorithm with restrictions on the search space. We show that inpainting, copy-move, and splicing, when applied after JPEG compression, result in three distinct mismatch problems that can be detected. In particular, if the image is re-compressed after modification, the manipulation can be detected when the quality factor of the second compression is higher than that of the first. Through extensive experiments, we highlight the potential of this compatibility attack under varying degrees of assumptions. While our approach shows promising results-outperforming three state-of-the-art deep learning models in an idealized setting-it remains a proof of concept rather than an off-the-shelf forensic tool. Notably, with a perfect knowledge of the JPEG pipeline, our method guarantees zero false alarms in block-by-block localization, given sufficient computational power.

Updated: 2025-04-07 11:38:19

标题: 双重JPEG兼容性：一种可靠且可解释的图像取证工具

摘要: 鉴于JPEG管道（压缩或解压缩），本文演示了如何找到8x8块的前因。如果存在，则认为该块与管道兼容。对于未经修改的图像，所有块仍与原始管道兼容；然而，对于经过篡改的图像，情况并非总是如此。本文首次展示了基于兼容性方法用于JPEG图像取证的潜力。它介绍了一种方法来解决在高维空间中找到块前因的关键挑战，依赖于具有对搜索空间的限制的局部搜索算法。我们展示了在JPEG压缩后应用修补、复制移动和拼接时，会导致三种不同的不匹配问题，可以被检测出来。特别是，如果图像在修改后重新压缩，当第二次压缩的质量因子高于第一次时，操纵可以被检测出来。通过广泛的实验，我们强调了在不同假设程度下，这种兼容性攻击的潜力。虽然我们的方法在理想化设置中表现出有希望的结果-胜过三种最先进的深度学习模型，但它仍然是一个概念验证，而不是一种现成的取证工具。值得注意的是，携带对JPEG管道的完美知识，我们的方法保证在块对块定位时零误报，假设具有足够的计算能力。

更新时间: 2025-04-07 11:38:19

领域: cs.CR,eess.IV

下载: http://arxiv.org/abs/2408.17106v2

M-Prometheus: A Suite of Open Multilingual LLM Judges

The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.

Updated: 2025-04-07 11:37:26

标题: M-Prometheus：一套开放的多语言LLM法官套件

摘要: 使用语言模型自动评估长篇文本（LLM作为评判者）的应用越来越普遍，然而大多数LLM评判者仅针对英语进行优化，对于增强它们的多语言评估能力的策略在当前文献中仍然未被广泛探讨。这导致了非英语语言的自动评估方法质量存在差异，最终阻碍了拥有更好多语言能力的模型的发展。为了弥合这一差距，我们引入了M-Prometheus，一个开放权重的LLM评判者套件，参数范围从3B到14B，可以对多语言输出进行直接评估和两两比较反馈。M-Prometheus模型在跨越20多种语言的多语言奖励基准和涵盖4种语言对的文学机器翻译（MT）评估上胜过最先进的开放LLM评判者。此外，M-Prometheus模型可以在解码时利用，显著改善所有三种测试语言的生成输出，展示了它们对于开发更好多语言模型的实用性。最后，通过大量消融实验，我们确定了获得有效多语言评判者的关键因素，包括骨干模型选择和在本地多语言反馈数据上进行训练，而不是翻译数据。我们发布了我们的模型、训练数据集和代码。

更新时间: 2025-04-07 11:37:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04953v1

A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization

Reinforcement Learning from Human Feedback (RLHF) has emerged as a important paradigm for aligning large language models (LLMs) with human preferences during post-training. This framework typically involves two stages: first, training a reward model on human preference data, followed by optimizing the language model using reinforcement learning algorithms. However, current RLHF approaches may constrained by two limitations. First, existing RLHF frameworks often rely on Bradley-Terry models to assign scalar rewards based on pairwise comparisons of individual responses. However, this approach imposes significant challenges on reward model (RM), as the inherent variability in prompt-response pairs across different contexts demands robust calibration capabilities from the RM. Second, reward models are typically initialized from generative foundation models, such as pre-trained or supervised fine-tuned models, despite the fact that reward models perform discriminative tasks, creating a mismatch. This paper introduces Pairwise-RL, a RLHF framework that addresses these challenges through a combination of generative reward modeling and a pairwise proximal policy optimization (PPO) algorithm. Pairwise-RL unifies reward model training and its application during reinforcement learning within a consistent pairwise paradigm, leveraging generative modeling techniques to enhance reward model performance and score calibration. Experimental evaluations demonstrate that Pairwise-RL outperforms traditional RLHF frameworks across both internal evaluation datasets and standard public benchmarks, underscoring its effectiveness in improving alignment and model behavior.

Updated: 2025-04-07 11:34:48

标题: 一个统一的RLHF配对框架：连接生成奖励建模和策略优化

摘要: 人类反馈强化学习（RLHF）已经成为一个重要的范例，用于在训练后对齐大型语言模型（LLMs）与人类偏好。这个框架通常包括两个阶段：首先，基于人类偏好数据训练一个奖励模型，然后使用强化学习算法优化语言模型。然而，当前的RLHF方法可能受到两个限制的约束。首先，现有的RLHF框架通常依赖于Bradley-Terry模型，根据个体回应的两两比较来分配标量奖励。然而，这种方法对奖励模型（RM）造成了重大挑战，因为在不同背景下提示-回应对之间的固有变异性要求RM具有强大的校准能力。其次，奖励模型通常是从生成式基础模型初始化的，例如预训练或监督微调模型，尽管奖励模型执行判别性任务，造成了不匹配。本文介绍了Pairwise-RL，这是一个RLHF框架，通过将生成式奖励建模和配对近端策略优化（PPO）算法相结合，解决了这些挑战。Pairwise-RL统一了奖励模型的训练和在强化学习中的应用，在一致的配对范例中利用生成建模技术来增强奖励模型的性能和评分校准。实验评估表明，Pairwise-RL在内部评估数据集和标准公共基准测试中均优于传统的RLHF框架，强调了其在改善对齐和模型行为方面的有效性。

更新时间: 2025-04-07 11:34:48

领域: cs.LG

下载: http://arxiv.org/abs/2504.04950v1

One Quantizer is Enough: Toward a Lightweight Audio Codec

Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and generate discrete tokens that can be utilized in downstream generative modeling tasks. However, leading approaches often rely on resource-intensive models and multi-quantizer architectures, resulting in considerable computational overhead and constrained real-world applicability. In this paper, we present SQCodec, a lightweight neural audio codec that leverages a single quantizer to address these limitations. SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv, a novel mechanism designed to capture acoustic variations across multiple temporal scales, thereby enhancing reconstruction fidelity while reducing model complexity. Extensive experiments across diverse datasets show that SQCodec achieves audio quality comparable to multi-quantizer baselines, while its single-quantizer design offers enhanced adaptability and its lightweight architecture reduces resource consumption by an order of magnitude. The source code is publicly available at https://github.com/zhai-lw/SQCodec.

Updated: 2025-04-07 11:34:39

标题: 一个量化器就足够了：朝向轻量级音频编解码器

摘要: 最近，神经音频编解码器因其能够压缩高保真音频并生成可用于下游生成建模任务的离散标记而备受关注。然而，主流方法通常依赖资源密集型模型和多量化器架构，导致相当大的计算开销和受限的现实应用性。在本文中，我们提出了SQCodec，一种轻量级神经音频编解码器，利用单个量化器来解决这些限制。SQCodec探索了简化的卷积网络和本地Transformer模块，以及TConv，一种旨在捕捉多个时间尺度上的声学变化的新机制，从而提高重建保真度同时减少模型复杂性。在各种数据集上进行的大量实验表明，SQCodec实现了与多量化器基线相当的音频质量，而其单量化器设计提供了增强的适应性，其轻量级架构将资源消耗降低了一个数量级。源代码可在https://github.com/zhai-lw/SQCodec 上公开获取。

更新时间: 2025-04-07 11:34:39

领域: cs.SD,cs.AI,68T07,I.2.m

下载: http://arxiv.org/abs/2504.04949v1

A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam

Legal reasoning tasks present unique challenges for large language models (LLMs) due to the complexity of domain-specific knowledge and reasoning processes. This paper investigates how effectively smaller language models (Llama 2 7B and Llama 3 8B) can be fine-tuned with a limited dataset of 1,514 Multi-state Bar Examination (MBE) questions to improve legal question answering accuracy. We evaluate these models on the 2022 MBE questions licensed from JD Advising, the same dataset used in the 'GPT-4 passes the Bar exam' study. Our methodology involves collecting approximately 200 questions per legal domain across 7 domains. We distill the dataset using Llama 3 (70B) to transform explanations into a structured IRAC (Issue, Rule, Application, Conclusion) format as a guided reasoning process to see if it results in better performance over the non-distilled dataset. We compare the non-fine-tuned models against their supervised fine-tuned (SFT) counterparts, trained for different sample sizes per domain, to study the effect on accuracy and prompt adherence. We also analyse option selection biases and their mitigation following SFT. In addition, we consolidate the performance across multiple variables: prompt type (few-shot vs zero-shot), answer ordering (chosen-option first vs generated-explanation first), response format (Numbered list vs Markdown vs JSON), and different decoding temperatures. Our findings show that domain-specific SFT helps some model configurations achieve close to human baseline performance, despite limited computational resources and a relatively small dataset. We release both the gathered SFT dataset and the family of Supervised Fine-tuned (SFT) adapters optimised for MBE performance. This establishes a practical lower bound on resources needed towards achieving effective legal question answering in smaller LLMs.

Updated: 2025-04-07 11:31:22

标题: 一只羊驼走进“酒吧”：多州律师资格考试中的高效监督微调法律推理

摘要: 法律推理任务对大型语言模型（LLMs）提出了独特的挑战，这是由于领域特定知识和推理过程的复杂性。本文研究了如何有效地使用一组仅有1,514个多州律师资格考试（MBE）问题的有限数据集对较小的语言模型（Llama 2 7B和Llama 3 8B）进行微调，以提高法律问题回答的准确性。我们在从JD Advising获得授权的2022年MBE问题上对这些模型进行评估，这是在“GPT-4通过律师资格考试”研究中使用的相同数据集。我们的方法涉及在7个领域中每个法律领域收集约200个问题。我们使用Llama 3（70B）对数据集进行提炼，将解释转化为结构化的IRAC（问题、规则、应用、结论）格式，作为引导性推理过程，以查看是否会导致比未提炼数据集更好的性能。我们将未经微调的模型与它们的受监督微调（SFT）对应模型进行比较，针对不同领域的不同样本大小进行训练，以研究对准确性和提示遵从的影响。我们还分析了选项选择偏见及其在SFT后的缓解。此外，我们整合了多个变量的性能：提示类型（少样本 vs 零样本）、答案排序（选择选项优先 vs 生成解释优先）、响应格式（编号列表 vs Markdown vs JSON）以及不同的解码温度。我们的研究结果显示，特定领域的SFT有助于一些模型配置实现接近人类基准性能，尽管计算资源有限且数据集相对较小。我们发布了收集的SFT数据集以及针对MBE性能进行优化的监督微调（SFT）适配器系列。这为在较小的LLMs中实现有效的法律问题回答奠定了所需资源的实际下限。

更新时间: 2025-04-07 11:31:22

领域: cs.LG,cs.AI,cs.CL,I.2.7; I.2.1

下载: http://arxiv.org/abs/2504.04945v1

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Automatically conjecturing useful, interesting and novel lemmas would greatly improve automated reasoning tools and lower the bar for formalizing mathematics in proof assistants. It is however a very challenging task for both neural and symbolic approaches. We present the first steps towards a practical neuro-symbolic lemma conjecturing tool, Lemmanaid, that combines Large Language Models (LLMs) and symbolic methods, and evaluate it on proof libraries for the Isabelle proof assistant. We train an LLM to generate lemma templates that describe the shape of a lemma, and use symbolic methods to fill in the details. We compare Lemmanaid against an LLM trained to generate complete lemma statements as well as previous fully symbolic conjecturing methods. Our results indicate that neural and symbolic techniques are complementary. By leveraging the best of both symbolic and neural methods we can generate useful lemmas for a wide range of input domains, facilitating computer-assisted theory development and formalization.

Updated: 2025-04-07 11:30:36

标题: 莱曼纳德：神经符号引理推测

摘要: 自动推测有用、有趣和新颖的引理将极大地提高自动推理工具的效果，并降低在证明助理中形式化数学的门槛。然而，这对于神经和符号方法来说是一个非常具有挑战性的任务。我们介绍了实用神经符号引理推测工具Lemmanaid的第一步，该工具结合了大型语言模型（LLMs）和符号方法，并在Isabelle证明助理的证明库中进行了评估。我们训练一个LLM生成描述引理形状的引理模板，并使用符号方法填写细节。我们将Lemmanaid与训练生成完整引理陈述的LLM以及之前的完全符号推测方法进行比较。我们的结果表明，神经和符号技术是互补的。通过利用符号和神经方法的最佳部分，我们可以为各种输入领域生成有用的引理，促进计算机辅助理论发展和形式化。

更新时间: 2025-04-07 11:30:36

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2504.04942v1

Contextualized Messages Boost Graph Representations

Graph neural networks (GNNs) have gained significant attention in recent years for their ability to process data that may be represented as graphs. This has prompted several studies to explore their representational capability based on the graph isomorphism task. Notably, these works inherently assume a countable node feature representation, potentially limiting their applicability. Interestingly, only a few study GNNs with uncountable node feature representation. In the paper, a new perspective on the representational capability of GNNs is investigated across all levels$\unicode{x2014}$node-level, neighborhood-level, and graph-level$\unicode{x2014}$when the space of node feature representation is uncountable. Specifically, the injective and metric requirements of previous works are softly relaxed by employing a pseudometric distance on the space of input to create a soft-injective function such that distinct inputs may produce similar outputs if and only if the pseudometric deems the inputs to be sufficiently similar on some representation. As a consequence, a simple and computationally efficient soft-isomorphic relational graph convolution network (SIR-GCN) that emphasizes the contextualized transformation of neighborhood feature representations via anisotropic and dynamic message functions is proposed. Furthermore, a mathematical discussion on the relationship between SIR-GCN and key GNNs in literature is laid out to put the contribution into context, establishing SIR-GCN as a generalization of classical GNN methodologies. To close, experiments on synthetic and benchmark datasets demonstrate the relative superiority of SIR-GCN, outperforming comparable models in node and graph property prediction tasks.

Updated: 2025-04-07 11:27:48

标题: 上下文化的信息提升图形表示

摘要: 图神经网络（GNNs）近年来受到了广泛关注，因为它们能够处理以图形表示的数据。这促使了几项研究探索了它们基于图同构任务的表征能力。值得注意的是，这些作品本质上假定了可数节点特征表示，可能限制了它们的适用性。有趣的是，只有少数研究了具有不可数节点特征表示的GNNs。在本文中，当节点特征表示空间是不可数时，对GNNs的表征能力进行了全面的研究，包括节点级别、邻域级别和图级别。具体来说，通过在输入空间上使用伪度量距离软松地放松了先前作品的单射和度量要求，创建了一个软单射函数，使得仅当伪度量认为输入在某些表征上足够相似时，不同的输入才可能产生相似的输出。因此，提出了一种简单且计算效率高的软同构关系图卷积网络（SIR-GCN），通过各向异性和动态消息函数强调了邻域特征表示的情境化转换。此外，对SIR-GCN与文献中关键GNNs之间的关系进行了数学讨论，将贡献放入背景之中，将SIR-GCN确定为经典GNN方法的一般化。最后，对合成和基准数据集进行的实验表明，SIR-GCN在节点和图属性预测任务中表现相对优越，胜过可比较的模型。

更新时间: 2025-04-07 11:27:48

领域: cs.LG

下载: http://arxiv.org/abs/2403.12529v4

Ontology Embedding: A Survey of Methods, Applications and Resources

Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies and formally defines ontology embedding as well as its property of faithfulness. Based on this, it systematically categorizes and analyses a relatively complete set of over 80 papers, according to the ontologies they aim at and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL and discusses the challenges and future directions.

Updated: 2025-04-07 11:24:13

标题: 本体嵌入：方法、应用和资源的综述

摘要: 本文摘要：本体论广泛用于表示领域知识和元数据，在信息系统、语义网络、生物信息学和许多其他领域中发挥着越来越重要的作用。然而，本体论可以直接支持的逻辑推理在学习、近似和预测方面相当有限。一个直接的解决方案是将统计分析和机器学习整合起来。为此，自动学习本体知识的向量表示，即本体嵌入，已经得到广泛研究。许多论文已经发表关于本体嵌入，但缺乏系统性的综述阻碍了研究人员对这一领域的全面理解。为了弥补这一差距，我们编写了这篇调查论文，首先介绍了本体论的不同语义类型，并正式定义了本体嵌入及其忠实性属性。基于此，它根据它们的目标本体和技术解决方案（包括几何建模、序列建模和图传播）系统地分类和分析了一个相对完整的80多篇论文集。这项调查还介绍了本体嵌入在本体工程、机器学习增强和生命科学中的应用，提出了一个新的库mOWL，并讨论了挑战和未来方向。

更新时间: 2025-04-07 11:24:13

领域: cs.AI

下载: http://arxiv.org/abs/2406.10964v3

A Taxonomy of Self-Handover

Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics.

Updated: 2025-04-07 11:21:42

标题: 自我移交的分类学

摘要: 自我交接，即在自己的双手之间传递一个物体，是一种常见但未被深入研究的双手动作。虽然它有助于复杂任务中的无缝过渡，但其执行背后的策略仍然大部分未被探索。在这里，我们介绍了第一个系统化的自我交接分类法，该分类法是通过对21名参与者进行的12小时烹饪活动的手动注释得出的。我们的分析表明，自我交接不仅仅是一种被动的过渡，而是一种高度协调的动作，涉及双手的预期调整。作为迈向自动化分析人类操纵的一步，我们进一步展示了使用最先进的视觉语言模型对自我交接类型进行分类的可行性。这些发现为双手协调提供了新的见解，突显了自我交接在实现平稳任务过渡中的作用-这是对于自适应双臂机器人至关重要的能力。

更新时间: 2025-04-07 11:21:42

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2504.04939v1

VidCtx: Context-aware Video Question Answering with Image Models

To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that processes them to produce the final response. However, in this way, the LLM does not have access to visual information and often has to process repetitive textual descriptions of nearby frames. To address those shortcomings, in this paper, we introduce VidCtx, a novel training-free VideoQA framework which integrates both modalities, i.e. both visual information from input frames and textual descriptions of others frames that give the appropriate context. More specifically, in the proposed framework a pre-trained Large Multimodal Model (LMM) is prompted to extract at regular intervals, question-aware textual descriptions (captions) of video frames. Those will be used as context when the same LMM will be prompted to answer the question at hand given as input a) a certain frame, b) the question and c) the context/caption of an appropriate frame. To avoid redundant information, we chose as context the descriptions of distant frames. Finally, a simple yet effective max pooling mechanism is used to aggregate the frame-level decisions. This methodology enables the model to focus on the relevant segments of the video and scale to a high number of frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models on three public Video QA benchmarks, NExT-QA, IntentQA and STAR. Our code is available at https://github.com/IDT-ITI/VidCtx.

Updated: 2025-04-07 11:20:37

标题: VidCtx：基于图像模型的上下文感知视频问答

摘要: 为了解决大型多模态模型在视频问答任务中的计算和内存限制，最近有几种方法提取每帧的文本表示（例如通过字幕生成），并将其馈送给一个大型语言模型（LLM）来处理它们以生成最终的响应。然而，通过这种方式，LLM无法访问视觉信息，通常需要处理附近帧的重复文本描述。为了解决这些缺点，在本文中，我们引入了VidCtx，这是一个新颖的无需训练的VideoQA框架，它整合了两种模态，即来自输入帧的视觉信息和其他帧的文本描述，提供了适当的上下文。具体而言，在所提出的框架中，一个预训练的大型多模态模型（LMM）被提示以在视频帧的规律间隔内提取问题感知的文本描述（字幕）。当相同的LMM被提示回答手头问题时，作为输入的是a）特定帧，b）问题和c）适当帧的上下文/字幕。为了避免冗余信息，我们选择作为上下文的是远距离帧的描述。最后，使用简单而有效的最大池化机制来聚合帧级决策。这种方法使模型能够专注于视频的相关片段，并扩展到大量帧。实验表明，VidCtx在NExT-QA、IntentQA和STAR三个公共视频QA基准上取得了与依赖开放模型的方法相竞争的性能。我们的代码可在https://github.com/IDT-ITI/VidCtx获取。

更新时间: 2025-04-07 11:20:37

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2412.17415v2

Constrained Gaussian Process Motion Planning via Stein Variational Newton Inference

Gaussian Process Motion Planning (GPMP) is a widely used framework for generating smooth trajectories within a limited compute time--an essential requirement in many robotic applications. However, traditional GPMP approaches often struggle with enforcing hard nonlinear constraints and rely on Maximum a Posteriori (MAP) solutions that disregard the full Bayesian posterior. This limits planning diversity and ultimately hampers decision-making. Recent efforts to integrate Stein Variational Gradient Descent (SVGD) into motion planning have shown promise in handling complex constraints. Nonetheless, these methods still face persistent challenges, such as difficulties in strictly enforcing constraints and inefficiencies when the probabilistic inference problem is poorly conditioned. To address these issues, we propose a novel constrained Stein Variational Gaussian Process Motion Planning (cSGPMP) framework, incorporating a GPMP prior specifically designed for trajectory optimization under hard constraints. Our approach improves the efficiency of particle-based inference while explicitly handling nonlinear constraints. This advancement significantly broadens the applicability of GPMP to motion planning scenarios demanding robust Bayesian inference, strict constraint adherence, and computational efficiency within a limited time. We validate our method on standard benchmarks, achieving an average success rate of 98.57% across 350 planning tasks, significantly outperforming competitive baselines. This demonstrates the ability of our method to discover and use diverse trajectory modes, enhancing flexibility and adaptability in complex environments, and delivering significant improvements over standard baselines without incurring major computational costs.

Updated: 2025-04-07 11:20:11

标题: 通过Stein变分牛顿推断的受限高斯过程运动规划

摘要: 高斯过程运动规划（GPMP）是一个广泛使用的框架，用于在有限计算时间内生成平滑轨迹--这是许多机器人应用中的一个基本要求。然而，传统的GPMP方法通常在强制执行硬非线性约束方面存在困难，并依赖于忽略完整贝叶斯后验的最大后验（MAP）解决方案。这限制了规划的多样性，最终阻碍了决策制定。最近将Stein变分梯度下降（SVGD）集成到运动规划中的努力显示出处理复杂约束的潜力。然而，这些方法仍然面临持久的挑战，如在严格执行约束时遇到困难以及在概率推断问题条件不佳时效率低下。为了解决这些问题，我们提出了一种新颖的受限Stein变分高斯过程运动规划（cSGPMP）框架，结合了专门设计用于在硬约束下进行轨迹优化的GPMP先验。我们的方法改善了基于粒子的推断的效率，同时明确处理非线性约束。这一进步显著扩大了GPMP在需要强大贝叶斯推断、严格遵守约束和有限时间内的计算效率的运动规划场景中的适用性。我们在标准基准上验证了我们的方法，在350个规划任务中实现了98.57%的平均成功率，明显优于竞争基线。这表明我们的方法具有发现和利用多样轨迹模式的能力，在复杂环境中增强灵活性和适应性，并在不产生主要计算成本的情况下显著改进了标准基线。

更新时间: 2025-04-07 11:20:11

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2504.04936v1

AGBD: A Global-scale Biomass Dataset

Accurate estimates of Above Ground Biomass (AGB) are essential in addressing two of humanity's biggest challenges: climate change and biodiversity loss. Existing datasets for AGB estimation from satellite imagery are limited. Either they focus on specific, local regions at high resolution, or they offer global coverage at low resolution. There is a need for a machine learning-ready, globally representative, high-resolution benchmark dataset. Our findings indicate significant variability in biomass estimates across different vegetation types, emphasizing the necessity for a dataset that accurately captures global diversity. To address these gaps, we introduce a comprehensive new dataset that is globally distributed, covers a range of vegetation types, and spans several years. This dataset combines AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery. Additionally, it includes pre-processed high-level features such as a dense canopy height map, an elevation map, and a land-cover classification map. We also produce a dense, high-resolution (10m) map of AGB predictions for the entire area covered by the dataset. Rigorously tested, our dataset is accompanied by several benchmark models and is publicly available. It can be easily accessed using a single line of code, offering a solid basis for efforts towards global AGB estimation. The GitHub repository github.com/ghjuliasialelli/AGBD serves as a one-stop shop for all code and data.

Updated: 2025-04-07 11:19:12

标题: AGBD：全球尺度生物量数据集

摘要: 准确估计地上生物量（AGB）对于解决人类面临的两个最大挑战至关重要：气候变化和生物多样性丧失。目前卫星图像用于AGB估算的数据集有限。要么它们专注于特定的、高分辨率的本地区域，要么提供全球范围的低分辨率覆盖。需要一个机器学习准备、全球代表性、高分辨率的基准数据集。我们的研究结果表明，在不同植被类型之间存在显著的生物量估计变异，强调了需要一个准确捕捉全球多样性的数据集。为填补这些空白，我们引入了一个全球分布、涵盖多种植被类型、跨越数年的全面新数据集。该数据集结合了GEDI任务的AGB参考数据和来自Sentinel-2和PALSAR-2图像的数据。此外，它还包括预处理的高级特征，如密集冠层高度图、高程图和土地覆盖分类图。我们还为数据集覆盖的整个区域生成了稠密、高分辨率（10m）的AGB预测地图。经过严格测试，我们的数据集配备了几个基准模型，并可公开获取。可以通过一行代码轻松访问，为全球AGB估算努力提供坚实基础。GitHub存储库github.com/ghjuliasialelli/AGBD作为所有代码和数据的一站式商店。

更新时间: 2025-04-07 11:19:12

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2406.04928v3

RCCFormer: A Robust Crowd Counting Network Based on Transformer

Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model's ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network's capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.

Updated: 2025-04-07 11:19:05

标题: RCCFormer：基于Transformer的稳健人群计数网络

摘要: 人群计数是一个重要的计算机视觉任务，已经成为人群分析和公共安全管理中的基础技术。然而，诸如尺度变化和复杂背景等挑战显著影响了人群计数的准确性。为了缓解这些问题，本文提出了一种稳健的基于Transformer的人群计数网络，称为RCCFormer，专门设计用于背景抑制和尺度感知。所提出的方法融合了一个多级特征融合模块（MFFM），精心整合了在骨干架构不同阶段提取的特征。它建立了一个强大的基线，能够捕获复杂和全面的特征表示，超越了传统基线。此外，引入了Detail-Embedded Attention Block（DEAB），通过全局自注意力和局部注意力以及一种可学习的方式有效融合，捕获了上下文信息和局部细节。这增强了模型在有效消除背景噪音干扰的同时集中关注前景区域的能力。此外，我们开发了一个自适应尺度感知模块（ASAM），其中我们的新颖的依赖输入的可变形卷积（IDConv）作为其基本构建模块。该模块动态适应头部目标形状和尺度的变化，显著提高了网络适应大尺度变化的能力。所提出方法的有效性在上海科技Part_A和Part_B、NWPU-Crowd和QNRF数据集上得到验证。结果表明，我们的RCCFormer在所有四个数据集上均取得了出色的性能，展示了最先进的结果。

更新时间: 2025-04-07 11:19:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04935v1

Boosting Relational Deep Learning with Pretrained Tabular Models

Relational databases, organized into tables connected by primary-foreign key relationships, are a common format for organizing data. Making predictions on relational data often involves transforming them into a flat tabular format through table joins and feature engineering, which serve as input to tabular methods. However, designing features that fully capture complex relational patterns remains challenging. Graph Neural Networks (GNNs) offer a compelling alternative by inherently modeling these relationships, but their time overhead during inference limits their applicability for real-time scenarios. In this work, we aim to bridge this gap by leveraging existing feature engineering efforts to enhance the efficiency of GNNs in relational databases. Specifically, we use GNNs to capture complex relationships within relational databases, patterns that are difficult to featurize, while employing engineered features to encode temporal information, thereby avoiding the need to retain the entire historical graph and enabling the use of smaller, more efficient graphs. Our \textsc{LightRDL} approach not only improves efficiency, but also outperforms existing models. Experimental results on the RelBench benchmark demonstrate that our framework achieves up to $33\%$ performance improvement and a $526\times$ inference speedup compared to GNNs, making it highly suitable for real-time inference.

Updated: 2025-04-07 11:19:04

标题: 使用预训练的表格模型增强关系深度学习

摘要: 关系数据库以主外键关系连接的表组织形式是组织数据的常见格式。在关系数据上进行预测通常涉及通过表连接和特征工程将其转换为扁平表格格式，作为表格方法的输入。然而，设计能够充分捕捉复杂关系模式的特征仍然具有挑战性。图神经网络（GNNs）通过固有地建模这些关系提供了一个引人注目的替代方案，但在推理过程中的时间开销限制了它们在实时场景中的适用性。在这项工作中，我们旨在通过利用现有的特征工程工作来增强关系数据库中GNNs的效率，以弥合这一差距。具体而言，我们使用GNNs来捕捉关系数据库中的复杂关系，这些关系模式很难进行特征化，同时利用工程特征来编码时间信息，从而避免保留整个历史图并使得使用更小、更高效的图成为可能。我们的LightRDL方法不仅提高了效率，而且优于现有模型。在RelBench基准测试上的实验结果表明，与GNNs相比，我们的框架实现了高达33%的性能改进和526倍的推理加速，使其非常适合实时推理。

更新时间: 2025-04-07 11:19:04

领域: cs.DB,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.04934v1

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at https://github.com/XIAO4579/Vlm-interpretability.

Updated: 2025-04-07 11:16:51

标题: 朝着理解大型视觉语言模型中知识演化的方向

摘要: 大型视觉语言模型（LVLMs）逐渐成为许多人工智能应用的基础。然而，理解它们的内部工作机制一直困扰着研究人员，这反过来限制了它们能力的进一步增强。在本文中，我们试图研究多模态知识如何在LVLMs中演化并最终诱导自然语言。我们设计了一系列新颖的策略来分析LVLMs内部知识，并深入探讨了多模态知识的演化从三个层面，包括单个标记概率、标记概率分布和特征编码。在这个过程中，我们确定了知识演化中的两个关键节点：关键层和突变层，将演化过程分为三个阶段：快速演化、稳定和突变。我们的研究是第一个揭示LVLMs中知识演化轨迹的研究，为理解其潜在机制提供了新的视角。我们的代码可在https://github.com/XIAO4579/Vlm-interpretability上找到。

更新时间: 2025-04-07 11:16:51

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.02862v2

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Project page: hyeonho99.github.io/track4gen

Updated: 2025-04-07 11:16:47

标题: Track4Gen：教授视频扩散模型以跟踪点，改进视频生成

摘要: 最近的基础视频生成器产生了视觉上丰富的输出，但它们仍然在外观漂移方面存在困难，即物体逐渐退化或在帧之间不一致地改变，破坏了视觉连贯性。我们假设这是因为在特征级别没有明确的空间跟踪监督。我们提出了Track4Gen，一个空间感知视频生成器，它将视频扩散损失与帧间的点跟踪结合起来，为扩散特征提供增强的空间监督。Track4Gen通过对现有视频生成架构进行最小改动，将视频生成和点跟踪任务合并为一个单一网络。使用稳定视频扩散作为骨干，Track4Gen表明可以统一视频生成和点跟踪，这通常被视为独立任务。我们的广泛评估表明，Track4Gen有效地减少了外观漂移，产生了时间稳定且视觉连贯的视频生成。项目页面：hyeonho99.github.io/track4gen

更新时间: 2025-04-07 11:16:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.06016v3

Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification

Authorship Verification (AV) is a key area of research in digital text forensics, which addresses the fundamental question of whether two texts were written by the same person. Numerous computational approaches have been proposed over the last two decades in an attempt to address this challenge. However, existing AV methods often suffer from high complexity, low explainability and especially from a lack of clear scientific justification. We propose a simpler method based on modeling the grammar of an author following Cognitive Linguistics principles. These models are used to calculate $\lambda_G$ (LambdaG): the ratio of the likelihoods of a document given the candidate's grammar versus given a reference population's grammar. Our empirical evaluation, conducted on twelve datasets and compared against seven baseline methods, demonstrates that LambdaG achieves superior performance, including against several neural network-based AV methods. LambdaG is also robust to small variations in the composition of the reference population and provides interpretable visualizations, enhancing its explainability. We argue that its effectiveness is due to the method's compatibility with Cognitive Linguistics theories predicting that a person's grammar is a behavioral biometric.

Updated: 2025-04-07 11:12:57

标题: 语法作为一种行为生物特征：使用认知动机的语法模型进行作者验证

摘要: 作者验证（AV）是数字文本取证中的一个关键研究领域，它解决了两个文本是否由同一人撰写的基本问题。在过去的两十年中，已经提出了许多计算方法，试图解决这一挑战。然而，现有的AV方法往往存在高复杂性、低可解释性，尤其是缺乏明确的科学理据。我们提出了一种更简单的方法，基于认知语言学原则对作者的语法进行建模。这些模型被用来计算λ_G（LambdaG）：给定候选人的语法和给定参考人群的语法下一篇文档的可能性之比。我们进行了实证评估，在十二个数据集上与七种基准方法进行了比较，结果表明LambdaG取得了更优异的性能，包括对几种基于神经网络的AV方法的超越。LambdaG对参考人群构成的小变化也很稳健，并提供了可解释的可视化效果，增强了其可解释性。我们认为其有效性是因为该方法与认知语言学理论相容，该理论预测一个人的语法是一种行为生物特征。

更新时间: 2025-04-07 11:12:57

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2403.08462v2

Probabilistic Pontryagin's Maximum Principle for Continuous-Time Model-Based Reinforcement Learning

Without exact knowledge of the true system dynamics, optimal control of non-linear continuous-time systems requires careful treatment of epistemic uncertainty. In this work, we propose a probabilistic extension to Pontryagin's maximum principle by minimizing the mean Hamiltonian with respect to epistemic uncertainty. We show minimization of the mean Hamiltonian is a necessary optimality condition when optimizing the mean cost, and propose a multiple shooting numerical method scalable to large-scale probabilistic dynamical models, including ensemble neural ordinary differential equations. Comparisons against state-of-the-art methods in online and offline model-based reinforcement learning tasks show that our probabilistic Hamiltonian formulation leads to reduced trial costs in offline settings and achieves competitive performance in online scenarios. By bridging optimal control and reinforcement learning, our approach offers a principled and practical framework for controlling uncertain systems with learned dynamics.

Updated: 2025-04-07 11:11:43

标题: 基于连续时间模型的强化学习的概率性庞特里亚金最大原理

摘要: 在没有准确了解真实系统动态的情况下，对非线性连续时间系统进行最优控制需要对认知不确定性进行谨慎处理。在这项工作中，我们提出了一种概率扩展的庞特里亚金最大原理，通过最小化与认知不确定性相关的平均哈密顿量。我们表明，当优化平均成本时，最小化平均哈密顿量是必要的最优性条件，并提出了一种多步数值方法，可扩展到大规模概率动态模型，包括集合神经常微分方程。与在线和离线基于模型的强化学习任务中的最先进方法进行比较表明，我们的概率哈密顿形式在离线设置中导致试验成本降低，并在在线场景中实现了竞争性表现。通过连接最优控制和强化学习，我们的方法为控制具有学习动态的不确定系统提供了一个原则性和实用的框架。

更新时间: 2025-04-07 11:11:43

领域: cs.LG

下载: http://arxiv.org/abs/2504.02543v2

BriLLM: Brain-inspired Large Language Model

This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of "least resistance" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

Updated: 2025-04-07 11:09:39

标题: BriLLM：脑启发的大型语言模型

摘要: 这篇论文报道了第一个大型脑启发式语言模型（BriLLM）。这是一个非Transformer、非GPT、非传统的机器学习输入-输出控制生成式语言模型。该模型基于有向图中的信号全连接流动（SiFu）定义，具有整个模型图中所有节点的可解释性，而不是传统的机器学习模型只在输入和输出端具有有限的可解释性。在语言模型情景中，令牌被定义为图中的节点。根据“最小阻力”的原则，在节点之间以随机形状或用户定义的信号流动路径。要预测或生成的下一个令牌或节点是信号流的目标。作为语言模型，BriLLM在模型大小独立于模型的输入和预测长度时在理论上支持无限长的n-gram模型。模型的工作信号流提供了召回激活和与人类大脑认知模式类似的内在多模态支持的可能性。目前，我们发布了第一个中文版的BriLLM，具有4000个令牌，32维节点宽度，16个令牌长度的序列预测能力，并且语言模型预测性能可与GPT-1相媲美。更多的计算能力将帮助我们探索上述无限可能性。

更新时间: 2025-04-07 11:09:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.11299v2

Expectations vs Reality -- A Secondary Study on AI Adoption in Software Testing

In the software industry, artificial intelligence (AI) has been utilized more and more in software development activities. In some activities, such as coding, AI has already been an everyday tool, but in software testing activities AI it has not yet made a significant breakthrough. In this paper, the objective was to identify what kind of empirical research with industry context has been conducted on AI in software testing, as well as how AI has been adopted in software testing practice. To achieve this, we performed a systematic mapping study of recent (2020 and later) studies on AI adoption in software testing in the industry, and applied thematic analysis to identify common themes and categories, such as the real-world use cases and benefits, in the found papers. The observations suggest that AI is not yet heavily utilized in software testing, and still relatively few studies on AI adoption in software testing have been conducted in the industry context to solve real-world problems. Earlier studies indicated there was a noticeable gap between the actual use cases and actual benefits versus the expectations, which we analyzed further. While there were numerous potential use cases for AI in software testing, such as test case generation, code analysis, and intelligent test automation, the reported actual implementations and observed benefits were limited. In addition, the systematic mapping study revealed a potential problem with false positive search results in online databases when using the search string "artificial intelligence".

Updated: 2025-04-07 11:03:54

标题: 期望与现实——软件测试中AI采用的二次研究

摘要: 在软件行业中，人工智能（AI）在软件开发活动中被越来越多地利用。在一些活动中，如编码，AI已经成为日常工具，但在软件测试活动中，AI尚未取得重大突破。本文的目标是确定在软件测试中已经进行了哪种与行业背景相关的实证研究，以及AI如何在软件测试实践中被采用。为了实现这一目标，我们对最近（2020年及以后）关于AI在软件测试中的采用的研究进行了系统性映射研究，并应用主题分析来确定发现论文中的共同主题和类别，如真实用例和优势。观察表明，AI在软件测试中尚未被广泛利用，在行业背景下，对AI在软件测试中的采用进行了相对较少的研究以解决实际问题。早期研究表明实际使用案例和实际收益与期望之间存在明显差距，我们进一步分析了这一点。虽然在软件测试中有许多潜在的AI使用案例，如测试用例生成，代码分析和智能测试自动化，但报道的实际实施和观察到的收益有限。此外，系统性映射研究揭示了在使用搜索字符串“人工智能”时，在线数据库中出现虚假阳性搜索结果的潜在问题。

更新时间: 2025-04-07 11:03:54

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2504.04921v1

Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B

As language models continue to grow larger, the cost of acquiring high-quality training data has increased significantly. Collecting human feedback is both expensive and time-consuming, and manual labels can be noisy, leading to an imbalance between helpfulness and harmfulness. Constitutional AI, introduced by Anthropic in December 2022, uses AI to provide feedback to another AI, greatly reducing the need for human labeling. However, the original implementation was designed for a model with around 52 billion parameters, and there is limited information on how well Constitutional AI performs with smaller models, such as LLaMA 3-8B. In this paper, we replicated the Constitutional AI workflow using the smaller LLaMA 3-8B model. Our results show that Constitutional AI can effectively increase the harmlessness of the model, reducing the Attack Success Rate in MT-Bench by 40.8%. However, similar to the original study, increasing harmlessness comes at the cost of helpfulness. The helpfulness metrics, which are an average of the Turn 1 and Turn 2 scores, dropped by 9.8% compared to the baseline. Additionally, we observed clear signs of model collapse in the final DPO-CAI model, indicating that smaller models may struggle with self-improvement due to insufficient output quality, making effective fine-tuning more challenging. Our study suggests that, like reasoning and math ability, self-improvement is an emergent property.

Updated: 2025-04-07 11:01:25

标题: 宪法还是崩溃？用Llama 3-8B 探索宪法人工智能

摘要: 随着语言模型不断增大，获取高质量的训练数据的成本显著增加。收集人类反馈既昂贵又耗时，而手动标注可能会产生噪音，导致帮助性和有害性之间的不平衡。Anthropic于2022年12月推出的Constitutional AI利用人工智能向另一台人工智能提供反馈，大大减少了人类标注的需求。然而，最初的实现是针对约520亿个参数的模型设计的，对于较小的模型（如LLaMA 3-8B）的Constitutional AI性能信息有限。在本文中，我们使用较小的LLaMA 3-8B模型复制了Constitutional AI工作流程。我们的结果显示，Constitutional AI可以有效地增加模型的无害性，在MT-Bench中将攻击成功率降低了40.8%。然而，与原始研究类似，增加无害性是以帮助性为代价的。帮助性指标，即转折1和转折2分数的平均值，与基准相比下降了9.8%。此外，我们观察到最终的DPO-CAI模型存在明显的模型崩溃迹象，表明较小的模型可能由于输出质量不足而难以进行自我改进，使得有效的微调更具挑战性。我们的研究表明，就像推理和数学能力一样，自我改进是一种新兴的特性。

更新时间: 2025-04-07 11:01:25

领域: cs.AI,68T05, 68T50,I.2.6; I.2.7; I.2.1

下载: http://arxiv.org/abs/2504.04918v1

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identifying 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

Updated: 2025-04-07 10:57:45

标题: AgentSpec：用于安全可靠的LLM代理的可定制运行时执行

摘要: 建立在LLMs上的代理程序越来越被广泛应用于各个领域，自动化复杂的决策和任务执行。然而，它们的自主性带来了安全风险，包括安全漏洞、法律违规和意外有害行为。现有的缓解方法，如基于模型的保障和早期执行策略，在鲁棒性、可解释性和适应性方面有所不足。为了解决这些挑战，我们提出了AgentSpec，这是一种轻量级的领域特定语言，用于指定和执行对LLM代理运行时约束。使用AgentSpec，用户定义结构化规则，包括触发器、谓词和执行机制，确保代理程序在预定义的安全边界内运行。我们在多个领域实施AgentSpec，包括代码执行、具体代理和自动驾驶，展示了其适应性和有效性。我们的评估显示，AgentSpec成功地阻止了超过90%的代码代理案例中的不安全执行，在具体代理任务中消除了所有危险行为，并通过自动驾驶汽车(AVs)强制执行100%的合规性。尽管AgentSpec具有强大的安全保证，但其计算开销仍然很小，毫秒级。通过结合可解释性、模块化和效率，AgentSpec为跨不同应用程序强制执行LLM代理程序安全提供了实用且可扩展的解决方案。我们还利用LLMs自动生成规则并评估其有效性。我们的评估显示，由OpenAI o1生成的规则对于具体代理实现了95.56%的精度和70.96%的召回率，成功识别了87.26%的风险代码，并防止了8种场景中的5种中AVs违法。

更新时间: 2025-04-07 10:57:45

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.18666v2

Open-Vocabulary Action Localization with Iterative Visual Prompting

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/

Updated: 2025-04-07 10:55:13

标题: 使用迭代视觉提示进行开放词汇动作定位

摘要: 视频动作定位旨在从长视频中找到特定动作的时间。尽管现有的基于学习的方法取得了成功，但它们需要对视频进行注释，这带来了相当大的人力成本。本文提出了一种基于新兴的现成视觉语言模型（VLMs）的无需训练、开放词汇的方法。挑战在于VLMs既不是为处理长视频而设计的，也不是为寻找动作而量身定制的。我们通过扩展迭代式视觉提示技术来克服这些问题。具体来说，我们对视频帧进行采样，并创建一个带有帧索引标签的连接图像，使VLM能够识别最有可能对应动作开始和结束的帧。通过在选择的帧周围逐步缩小采样窗口，估计逐渐收敛到更精确的时间边界。我们证明了这种技术产生了合理的性能，实现了可与最先进的零样本动作定位相媲美的结果。这些结果支持将VLMs作为理解视频的实用工具。示例代码可在https://microsoft.github.io/VLM-Video-Action-Localization/找到。

更新时间: 2025-04-07 10:55:13

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2408.17422v5

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.

Updated: 2025-04-07 10:52:22

标题: Collab-RAG：通过白盒和黑盒LLM协作提升复杂问题回答的检索增强生成

摘要: 检索增强生成（RAG）系统通常难以准确处理多跳问题回答任务，这是由于无关上下文检索和有限的复杂推理能力。我们引入了Collab-RAG，这是一个协作训练框架，利用了白盒小语言模型（SLM）和黑盒大语言模型（LLM）之间的相互增强。具体地，SLM将复杂查询分解为更简单的子问题，从而提高了检索的准确性，并促进了黑盒LLM的更有效推理。同时，黑盒LLM提供反馈信号以改善SLM的分解能力。我们观察到，Collab-RAG完全依赖于来自可负担得起的黑盒LLM的监督，而无需从前沿LLM进行额外的蒸馏，但在多个黑盒LLM上表现出强大的泛化能力。对五个多跳QA数据集的实验评估表明，Collab-RAG在平均值上比现有的仅黑盒和SLM微调基线表现出了明显优势，提高了1.8%-14.2%。特别是，我们微调的3B SLM在问题分解方面超过了冻结的32B LLM，突出了Collab-RAG在提高复杂问题推理和检索效率方面的优势。Collab-RAG的代码可在https://github.com/ritaranx/Collab-RAG/上找到。

更新时间: 2025-04-07 10:52:22

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2504.04915v1

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.

Updated: 2025-04-07 10:45:47

标题: 深度研究者：通过强化学习在现实环境中扩展深度研究

摘要: 大型语言模型（LLMs）配备了网络搜索功能，展示了在深度研究任务中具有令人印象深刻的潜力。然而，当前的方法主要依赖于要么是基于手工设计的提示（基于提示工程）具有脆弱性能，要么是在受控的检索增强生成（RAG）环境中通过强化学习（RL）进行训练（基于RAG），未能捕捉现实世界互动的复杂性。在本文中，我们介绍了DeepResearcher，这是第一个通过在真实网络搜索互动中扩展强化学习（RL）进行端到端训练的LLM-基础深度研究代理的综合性框架。与假设所有必要信息都存在于固定语料库中的基于RAG的方法不同，我们的方法训练代理人来导航开放网络的嘈杂、非结构化和动态性质。我们实现了一个专门的多代理架构，其中浏览代理从各种网页结构中提取相关信息并克服重要的技术挑战。对开放领域研究任务的广泛实验表明，DeepResearcher相对于基于提示工程的基线实现了高达28.9个点的实质性改进，并且相对于基于RAG的RL代理实现了高达7.2个点的改进。我们的定性分析揭示了端到端RL训练产生的新兴认知行为，包括制定计划的能力，从多个来源交叉验证信息，进行自我反思以重新定向研究，并在无法找到明确答案时保持诚实。我们的结果突显了在真实网络环境中的端到端训练不仅仅是一个实施细节，而且是与现实世界应用相一致的强大研究能力的基本要求。我们在https://github.com/GAIR-NLP/DeepResearcher上发布了DeepResearcher。

更新时间: 2025-04-07 10:45:47

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.03160v2

IterMask3D: Unsupervised Anomaly Detection and Segmentation with Test-Time Iterative Mask Refinement in 3D Brain MR

Unsupervised anomaly detection and segmentation methods train a model to learn the training distribution as 'normal'. In the testing phase, they identify patterns that deviate from this normal distribution as 'anomalies'. To learn the `normal' distribution, prevailing methods corrupt the images and train a model to reconstruct them. During testing, the model attempts to reconstruct corrupted inputs based on the learned 'normal' distribution. Deviations from this distribution lead to high reconstruction errors, which indicate potential anomalies. However, corrupting an input image inevitably causes information loss even in normal regions, leading to suboptimal reconstruction and an increased risk of false positives. To alleviate this, we propose IterMask3D, an iterative spatial mask-refining strategy designed for 3D brain MRI. We iteratively spatially mask areas of the image as corruption and reconstruct them, then shrink the mask based on reconstruction error. This process iteratively unmasks 'normal' areas to the model, whose information further guides reconstruction of 'normal' patterns under the mask to be reconstructed accurately, reducing false positives. In addition, to achieve better reconstruction performance, we also propose using high-frequency image content as additional structural information to guide the reconstruction of the masked area. Extensive experiments on the detection of both synthetic and real-world imaging artifacts, as well as segmentation of various pathological lesions across multiple MRI sequences, consistently demonstrate the effectiveness of our proposed method.

Updated: 2025-04-07 10:41:23

标题: IterMask3D：三维脑MR中基于测试时间迭代蒙版细化的无监督异常检测和分割

摘要: 无监督异常检测和分割方法训练模型学习训练分布作为“正常”。在测试阶段，他们识别偏离这个正常分布的模式为“异常”。为了学习“正常”分布，流行的方法会破坏图像并训练模型重建它们。在测试期间，模型试图基于学习的“正常”分布重建受损输入。与该分布偏差导致高重建错误，这表明潜在的异常。然而，破坏输入图像不可避免地会导致信息丢失，即使在正常区域也是如此，从而导致次优重建和增加假阳性风险。为了缓解这一问题，我们提出了IterMask3D，一种专为3D脑MRI设计的迭代空间掩模细化策略。我们迭代地空间掩蔽图像区域作为破坏并重建它们，然后根据重建错误缩小掩蔽区域。这个过程迭代地向模型揭示“正常”区域，其信息进一步指导了在掩膜下准确重建的“正常”模式，从而减少了假阳性。此外，为了实现更好的重建性能，我们还提出使用高频图像内容作为额外的结构信息来指导掩蔽区域的重建。对合成和真实世界成像伪影的检测，以及跨多个MRI序列分割各种病理病变的广泛实验，始终证明了我们提出方法的有效性。

更新时间: 2025-04-07 10:41:23

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.04911v1

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.

Updated: 2025-04-07 10:39:12

标题: 基于认知科学的方法评估AI中物体理解的核心能力

摘要: 我们世界模型的核心组成部分之一是“直觉物理学” - 对物体、空间和因果关系的理解。这种能力使我们能够预测事件、计划行动和导航环境，所有这些都依赖于一个综合的对象感知。尽管其重要性，但并没有单一的、统一的对象性解释，尽管多个理论框架提供了见解。在本文的第一部分中，我们提供了对象性研究中主要理论框架的全面概述 - 格式塔心理学、行为认知学和发展心理学，并确定每个框架赋予对象理解的核心能力，以及它们在塑造生物体世界模型中扮演的功能角色。考虑到对象性在世界建模中的基础性作用，理解对象性在人工智能中也是必不可少的。在本文的第二部分中，我们评估了当前人工智能范式如何处理和测试对象性能力，与认知科学中的对象性相比。我们将人工智能范式定义为对对象性概念化、研究对象性的方法、使用的数据和评估技术的组合。我们发现，虽然基准可以检测到人工智能系统对对象性的孤立方面建模，但基准无法检测到当人工智能系统在这些能力之间缺乏功能集成时，并未完全解决对象性挑战。最后，我们探讨了与本文中所概述的对象性综合愿景一致的新颖评估方法。这些方法是推进从孤立对象能力向真正理解真实环境中对象的通用人工智能的有希望的候选方案。

更新时间: 2025-04-07 10:39:12

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.21668v2

AlgOS: Algorithm Operating System

Algorithm Operating System (AlgOS) is an unopinionated, extensible, modular framework for algorithmic implementations. AlgOS offers numerous features: integration with Optuna for automated hyperparameter tuning; automated argument parsing for generic command-line interfaces; automated registration of new classes; and a centralised database for logging experiments and studies. These features are designed to reduce the overhead of implementing new algorithms and to standardise the comparison of algorithms. The standardisation of algorithmic implementations is crucial for reproducibility and reliability in research. AlgOS combines Abstract Syntax Trees with a novel implementation of the Observer pattern to control the logical flow of algorithmic segments.

Updated: 2025-04-07 10:36:46

标题: AlgOS: 算法操作系统

摘要: 算法操作系统（AlgOS）是一个无偏见的、可扩展的、模块化框架，用于算法实现。AlgOS提供了许多功能：与Optuna集成以进行自动超参数调优；用于通用命令行界面的自动参数解析；自动注册新类；以及用于记录实验和研究的集中式数据库。这些功能旨在减少实现新算法的开销，并标准化算法的比较。算法实现的标准化对于研究中的可重复性和可靠性至关重要。AlgOS将抽象语法树与一种新颖的观察者模式实现结合起来，以控制算法段的逻辑流程。

更新时间: 2025-04-07 10:36:46

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.04909v1

Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Small Language Models (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.

Updated: 2025-04-07 10:33:14

标题: 区域微型故事：使用小型模型比较语言学习和分词器性能

摘要: 小型语言模型（SLMs）为特定领域提供了LLMs的有效替代方案。2023年的TinyStories研究开发了一个英语数据集，使得具有1至1000万参数的SLMs能够产生连贯的输出。我们的研究通过将原始数据集翻译成印度语言并利用LLMs创建合成数据，扩展了这一框架。我们专注于印地语、马拉地语和孟加拉语，评估SLMs在区域语言处理和理解语言复杂性方面的表现。我们表明，与LLMs相比，SLMs能够以显著较少的参数高效处理区域语言，为“基于推断的评估”提供了补充框架，用于标记化策略和语言复杂性的理解。我们的分析显示，针对印度语言的特定标记化器优于通用标记化器。通过信息论和形态分析支持的经验验证提供了对印地语模型在马拉地语和孟加拉语上表现更佳背后的基本理解。此外，我们展示了合成数据集在训练SLMs时优于翻译内容。相关性分析揭示了跨语言模式和语言特定关系之间创造力、语法精确性和叙述完整性之间的关系。这些发现推进了SLMs在为未开发语言提供实际应用以及我们对神经语言发展的理论理解方面的进步。

更新时间: 2025-04-07 10:33:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.07989v1

Video-Bench: Human-Aligned Video Generation Benchmark

Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos while aligning with human expectations. Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency. To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. This benchmark represents the first attempt to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, Video-Bench provides a structured, scalable approach to generated video evaluation. Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions. Moreover, in instances where our framework's assessments diverge from human evaluations, it consistently offers more objective and accurate insights, suggesting an even greater potential advantage over traditional human judgment.

Updated: 2025-04-07 10:32:42

标题: Video-Bench：人类对齐视频生成基准

摘要: 视频生成评估对于确保生成模型产生视觉逼真、高质量的视频并与人类期望保持一致至关重要。目前的视频生成基准分为两个主要类别：传统基准，使用指标和嵌入来评估生成的视频质量在多个维度上，但通常缺乏与人类判断的一致性；和基于大型语言模型（LLM）的基准，虽然能够进行类似人类的推理，但受限于对视频质量指标和跨模态一致性的有限理解。为了解决这些挑战并建立一个更符合人类偏好的基准，本文介绍了Video-Bench，一个包含丰富提示套件和广泛评估维度的综合基准。这个基准代表了第一次系统地利用MLLM在生成模型中与视频生成评估相关的所有维度。通过结合少样本评分和查询链技术，Video-Bench提供了一种结构化、可扩展的生成视频评估方法。对包括Sora在内的先进模型的实验表明，Video-Bench在所有维度上都能够更好地与人类偏好保持一致。此外，在我们的框架评估与人类评价不一致的情况下，它始终提供更客观、准确的见解，表明它甚至比传统人类判断具有更大的潜在优势。

更新时间: 2025-04-07 10:32:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04907v1

KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

Artificial intelligence is gradually demonstrating its immense potential, and increasing attention is being given to how AI can be harnessed to advance scientific research. In this vision paper, we present our perspectives on how AI can better assist scientific inquiry and explore corresponding technical approach. We have proposed and open-sourced two large models of our KALE-LM model series, KALE-LM-Chem(-1.5), which have achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development.

Updated: 2025-04-07 10:25:31

标题: KALE-LM：通过知识和逻辑增强的大型模型释放人工智能在科学中的力量

摘要: 人工智能正逐渐展示其巨大潜力，越来越多的关注正在转向如何利用人工智能推动科学研究的发展。在这篇愿景论文中，我们提出了我们对人工智能如何更好地协助科学研究的观点，并探讨了相应的技术方法。我们提出并开源了我们的KALE-LM模型系列中的两个大型模型，KALE-LM-Chem(-1.5)，在与化学领域相关的任务中表现出色。我们希望我们的工作作为一个强有力的起点，帮助实现更智能的人工智能，并促进人类科学技术的进步，以及社会发展。

更新时间: 2025-04-07 10:25:31

领域: cs.AI,cs.CE,cs.CL

下载: http://arxiv.org/abs/2409.18695v2

Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.

Updated: 2025-04-07 10:22:00

标题: Lumina-OmniLV：通用低级视觉统一多模式框架

摘要: 我们提出了Lunima-OmniLV（简称OmniLV），这是一个通用的多模态多任务框架，用于处理低级别视觉中的100多个子任务，涵盖了图像恢复、图像增强、弱语义密集预测和风格化等四个主要类别。OmniLV利用文本和视觉提示，提供灵活且用户友好的交互。基于扩散变压器（DiT）的生成先验，我们的框架支持任意分辨率，在1K分辨率下实现最佳性能，同时保留了细节和高保真度。通过大量实验，我们证明分别编码文本和视觉指令，并结合使用浅层特征控制进行联合训练，对减轻任务模糊性和增强多任务泛化至关重要。我们的研究还揭示了将高级生成任务整合到低级别视觉模型中可能会影响对细节敏感的恢复。这些见解为更强大和可泛化的低级别视觉系统铺平了道路。

更新时间: 2025-04-07 10:22:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04903v1

Practical Acoustic Eavesdropping On Typed Passphrases

Cloud services have become an essential infrastructure for enterprises and individuals. Access to these cloud services is typically governed by Identity and Access Management systems, where user authentication often relies on passwords. While best practices dictate the implementation of multi-factor authentication, it's a reality that many such users remain solely protected by passwords. This reliance on passwords creates a significant vulnerability, as these credentials can be compromised through various means, including side-channel attacks. This paper exploits keyboard acoustic emanations to infer typed natural language passphrases via unsupervised learning, necessitating no previous training data. Whilst this work focuses on short passphrases, it is also applicable to longer messages, such as confidential emails, where the margin for error is much greater, than with passphrases, making the attack even more effective in such a setting. Unlike traditional attacks that require physical access to the target device, acoustic side-channel attacks can be executed within the vicinity, without the user's knowledge, offering a worthwhile avenue for malicious actors. Our findings replicate and extend previous work, confirming that cross-correlation audio preprocessing outperforms methods like mel-frequency-cepstral coefficients and fast-fourier transforms in keystroke clustering. Moreover, we show that partial passphrase recovery through clustering and a dictionary attack can enable faster than brute-force attacks, further emphasizing the risks posed by this attack vector.

Updated: 2025-04-07 10:07:08

标题: 实用的声学窃听输入密码

摘要: 云服务已成为企业和个人的基础设施。访问这些云服务通常通过身份和访问管理系统进行管理，用户身份验证通常依赖于密码。虽然最佳实践规定实施多因素身份验证，但实际情况是许多用户仍然仅通过密码受到保护。对密码的依赖造成了重大漏洞，因为这些凭据可以通过各种手段泄露，包括侧信道攻击。本文利用键盘声学发射来通过无监督学习推断键入的自然语言密码短语，不需要任何先前的训练数据。虽然这项工作侧重于短密码短语，但也适用于更长的消息，如机密电子邮件，在这种情况下出错的余地比密码要大得多，使得攻击在这种环境中更加有效。与传统攻击需要物理接触目标设备不同，声学侧信道攻击可以在附近执行，用户毫不知情，为恶意行为者提供了一个值得尝试的途径。我们的研究结果复制并扩展了先前的工作，证实了交叉相关音频预处理在击键聚类方面优于梅尔频率倒谱系数和快速傅里叶变换等方法。此外，我们展示了通过聚类和字典攻击部分密码短语恢复可以比暴力攻击更快，进一步强调了这种攻击向量带来的风险。

更新时间: 2025-04-07 10:07:08

领域: cs.CR

下载: http://arxiv.org/abs/2503.16719v2

Graph of Effort: Quantifying Risk of AI Usage for Vulnerability Assessment

With AI-based software becoming widely available, the risk of exploiting its capabilities, such as high automation and complex pattern recognition, could significantly increase. An AI used offensively to attack non-AI assets is referred to as offensive AI. Current research explores how offensive AI can be utilized and how its usage can be classified. Additionally, methods for threat modeling are being developed for AI-based assets within organizations. However, there are gaps that need to be addressed. Firstly, there is a need to quantify the factors contributing to the AI threat. Secondly, there is a requirement to create threat models that analyze the risk of being attacked by AI for vulnerability assessment across all assets of an organization. This is particularly crucial and challenging in cloud environments, where sophisticated infrastructure and access control landscapes are prevalent. The ability to quantify and further analyze the threat posed by offensive AI enables analysts to rank vulnerabilities and prioritize the implementation of proactive countermeasures. To address these gaps, this paper introduces the Graph of Effort, an intuitive, flexible, and effective threat modeling method for analyzing the effort required to use offensive AI for vulnerability exploitation by an adversary. While the threat model is functional and provides valuable support, its design choices need further empirical validation in future work.

Updated: 2025-04-07 10:01:44

标题: 努力图：量化AI在脆弱性评估中的使用风险

摘要: 随着基于人工智能的软件变得广泛可用，利用其高度自动化和复杂模式识别能力的风险可能会显著增加。将人工智能用于攻击非人工智能资产的行为被称为进攻性人工智能。当前的研究探讨了如何利用进攻性人工智能以及如何对其使用进行分类。此外，正在为组织内基于人工智能的资产开发威胁建模的方法。然而，存在需要解决的差距。首先，有必要量化导致人工智能威胁的因素。其次，需要创建威胁模型，分析组织所有资产遭受人工智能攻击的风险，以进行漏洞评估。这在云环境中尤其关键和具有挑战性，因为复杂的基础设施和访问控制场景很普遍。量化和进一步分析进攻性人工智能带来的威胁能够使分析人员对漏洞进行排名，并优先实施主动的对策措施。为了解决这些差距，本文介绍了“努力图”，一种直观、灵活和有效的威胁建模方法，用于分析对手利用进攻性人工智能进行漏洞利用所需的努力。虽然威胁模型是功能性的并提供了有价值的支持，但其设计选择需要在未来的工作中进一步进行经验证。

更新时间: 2025-04-07 10:01:44

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2503.16392v2

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. However, existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1,162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models (VLMs) on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings reveal that typographic attacks persist in state-of-the-art Large Vision-Language Models (LVLMs) due to the choice of their vision encoder, though larger Large Language Models (LLMs) backbones help mitigate their vulnerability. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. We publicly release the datasets introduced in this paper under https://huggingface.co/datasets/BLISS-e-V/SCAM, along with the code for evaluations at https://github.com/Bliss-e-V/SCAM.

Updated: 2025-04-07 10:01:38

标题: SCAM：多模基础模型的真实世界排版鲁棒性评估

摘要: 排版攻击利用多模基础模型中文本和视觉内容之间的相互作用，在图像中嵌入误导性文本时引起错误分类。然而，现有数据集在规模和多样性方面受限，使得研究这种漏洞变得困难。在本文中，我们介绍了迄今为止最大和最多样化的真实排版攻击图像数据集SCAM，包含1,162张图像，涵盖数百种对象类别和攻击词。通过对SCAM上视觉语言模型（VLMs）进行广泛基准测试，我们证明排版攻击显著降低了性能，并确定训练数据和模型架构影响对这些攻击的敏感性。我们的研究结果显示，由于视觉编码器的选择，排版攻击在最先进的大型视觉语言模型（LVLMs）中仍然存在，尽管更大的大型语言模型（LLMs）骨干有助于减轻其脆弱性。此外，我们证明合成攻击与真实世界（手写）攻击非常相似，验证了它们在研究中的使用。我们的工作提供了一个全面的资源和实证见解，以促进未来研究朝向稳健和可信赖的多模态人工智能系统。我们在https://huggingface.co/datasets/BLISS-e-V/SCAM上公开发布本文介绍的数据集，以及用于评估的代码在https://github.com/Bliss-e-V/SCAM。

更新时间: 2025-04-07 10:01:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04893v1

ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA

Updated: 2025-04-07 09:59:44

标题: ECG-Expert-QA：用于评估心脏疾病诊断的医学大型语言模型的基准Benchmark

摘要: 我们呈现了ECG-Expert-QA，这是一个全面的多模态数据集，用于评估心电图（ECG）解读中的诊断能力。它将真实世界的临床ECG数据与系统生成的合成病例相结合，涵盖12个基本的诊断任务，共计47,211个经专家验证的QA对。这些涵盖了各种临床场景，从基本的节律识别到涉及罕见疾病和时间变化的复杂诊断。一个关键创新是支持多轮对话，使得可以开发模拟临床医生-患者或跨专业交互的对话医学人工智能系统。这样可以更真实地评估AI模型的临床推理、诊断准确性和知识整合。通过一个知识引导的框架和严格的质量控制构建，ECG-Expert-QA确保语言和临床一致性，使其成为推进AI辅助ECG解读的高质量资源。它通过在上下文丰富的场景中挑战模型的任务，如识别细微的缺血变化和解释复杂的心律失常。为促进研究透明度和合作，该数据集、伴随的代码和提示已在https://github.com/Zaozzz/ECG-Expert-QA上公开发布。

更新时间: 2025-04-07 09:59:44

领域: eess.SP,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.17475v3

Leveraging Large Language Models for Cost-Effective, Multilingual Depression Detection and Severity Assessment

Depression is a prevalent mental health disorder that is difficult to detect early due to subjective symptom assessments. Recent advancements in large language models have offered efficient and cost-effective approaches for this objective. In this study, we evaluated the performance of four LLMs in depression detection using clinical interview data. We selected the best performing model and further tested it in the severity evaluation scenario and knowledge enhanced scenario. The robustness was evaluated in complex diagnostic scenarios using a dataset comprising 51074 statements from six different mental disorders. We found that DeepSeek V3 is the most reliable and cost-effective model for depression detection, performing well in both zero-shot and few-shot scenarios, with zero-shot being the most efficient choice. The evaluation of severity showed low agreement with the human evaluator, particularly for mild depression. The model maintains stably high AUCs for detecting depression in complex diagnostic scenarios. These findings highlight DeepSeek V3s strong potential for text-based depression detection in real-world clinical applications. However, they also underscore the need for further refinement in severity assessment and the mitigation of potential biases to enhance clinical reliability.

Updated: 2025-04-07 09:58:19

标题: 利用大型语言模型进行成本效益高、多语种抑郁症检测和严重性评估

摘要: 抑郁症是一种普遍存在的心理健康障碍，由于主观症状评估的困难，早期检测很难。最近大型语言模型的发展为这一目标提供了高效且具有成本效益的方法。在本研究中，我们使用临床访谈数据评估了四种LLM在抑郁症检测中的表现。我们选择了表现最佳的模型，并在严重性评估场景和知识增强场景中进一步测试了它。我们使用包含来自六种不同心理障碍的51074个陈述的数据集评估了在复杂诊断场景中的稳健性。我们发现DeepSeek V3是最可靠且具有成本效益的模型，对于抑郁症的检测表现良好，特别是在零样本和少样本场景中，其中零样本是最高效的选择。严重性评估显示与人类评估者的协议较低，特别是对于轻度抑郁症。该模型在复杂诊断场景中检测抑郁症的AUC保持稳定高水平。这些发现突显了DeepSeek V3在真实临床应用中基于文本的抑郁症检测的强大潜力。然而，它们也强调了对严重性评估的进一步完善和减轻潜在偏见以提高临床可靠性的需求。

更新时间: 2025-04-07 09:58:19

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.04891v1

Going beyond explainability in multi-modal stroke outcome prediction models

Aim: This study aims to enhance interpretability and explainability of multi-modal prediction models integrating imaging and tabular patient data. Methods: We adapt the xAI methods Grad-CAM and Occlusion to multi-modal, partly interpretable deep transformation models (dTMs). DTMs combine statistical and deep learning approaches to simultaneously achieve state-of-the-art prediction performance and interpretable parameter estimates, such as odds ratios for tabular features. Based on brain imaging and tabular data from 407 stroke patients, we trained dTMs to predict functional outcome three months after stroke. We evaluated the models using different discriminatory metrics. The adapted xAI methods were used to generated explanation maps for identification of relevant image features and error analysis. Results: The dTMs achieve state-of-the-art prediction performance, with area under the curve (AUC) values close to 0.8. The most important tabular predictors of functional outcome are functional independence before stroke and NIHSS on admission, a neurological score indicating stroke severity. Explanation maps calculated from brain imaging dTMs for functional outcome highlighted critical brain regions such as the frontal lobe, which is known to be linked to age which in turn increases the risk for unfavorable outcomes. Similarity plots of the explanation maps revealed distinct patterns which give insight into stroke pathophysiology, support developing novel predictors of stroke outcome and enable to identify false predictions. Conclusion: By adapting methods for explanation maps to dTMs, we enhanced the explainability of multi-modal and partly interpretable prediction models. The resulting explanation maps facilitate error analysis and support hypothesis generation regarding the significance of specific image regions in outcome prediction.

Updated: 2025-04-07 09:56:16

标题: 超越可解释性：多模态卒中预后预测模型

摘要: 目的：本研究旨在提高集成成像和表格型患者数据的多模态预测模型的可解释性和可解释性。方法：我们将xAI方法Grad-CAM和Occlusion调整到多模态、部分可解释的深度转换模型（dTMs）中。DTMs结合统计和深度学习方法，同时实现了最先进的预测性能和可解释的参数估计，如表格特征的比值比。基于407名中风患者的脑成像和表格数据，我们训练了dTMs来预测中风后三个月的功能结果。我们使用不同的判别性指标评估了模型。调整后的xAI方法被用于生成解释地图，以识别相关的图像特征和错误分析。结果：dTMs实现了最先进的预测性能，曲线下面积（AUC）值接近0.8。功能结果的最重要的表格预测因素是中风前的功能独立性和入院时的NIHSS，这是一个指示中风严重程度的神经学评分。从脑成像dTMs计算的功能结果解释地图突出了关键的脑区域，如额叶，已知与年龄有关，进而增加了不利结果的风险。解释地图的相似性图表显示出不同的模式，可以深入了解中风病理生理学，支持开发新的中风结果预测因子，并能够识别错误预测。结论：通过将解释地图方法调整到dTMs，我们增强了多模态和部分可解释的预测模型的可解释性。结果解释地图有助于错误分析，并支持关于特定图像区域在结果预测中意义的假设生成。

更新时间: 2025-04-07 09:56:16

领域: eess.IV,cs.CV,cs.LG,stat.AP

下载: http://arxiv.org/abs/2504.06299v1

SoK: LLM-based Log Parsing

Log data, generated by software systems, provides crucial insights for tasks like monitoring, root cause analysis, and anomaly detection. Due to the vast volume of logs, automated log parsing is essential to transform semi-structured log messages into structured representations. Traditional log parsing techniques often require manual configurations, such as defining log formats or labeling data, which limits scalability and usability. Recent advances in large language models (LLMs) have introduced the new research field of LLM-based log parsing, offering potential improvements in automation and adaptability. Despite promising results, there is no structured overview of these approaches since this is a relatively new research field with the earliest advances published in late 2023. This paper systematically reviews 29 LLM-based log parsing methods, comparing their capabilities, limitations, and reliance on manual effort. We analyze the learning and prompt-engineering paradigms employed, efficiency- and effectiveness-enhancing techniques, and the role of LLMs in the parsing process. We aggregate the results of the survey in a large table comprising the characterizing features of LLM-based log parsing approaches and derive the general process of LLM-based log parsing, incorporating all reviewed approaches in a single flow chart. Additionally, we benchmark seven open-source LLM-based log parsers on public datasets and critically assess their reproducibility. Our findings summarize the advances of this new research field and provide insights for researchers and practitioners seeking efficient and user-friendly log parsing solutions, with all code and results made publicly available for transparency.

Updated: 2025-04-07 09:41:04

标题: SoK: 基于LLM的日志解析

摘要: 日志数据由软件系统生成，在监控、根本原因分析和异常检测等任务中提供了关键见解。由于日志数据的庞大数量，自动化日志解析对将半结构化日志消息转换为结构化表示至关重要。传统的日志解析技术通常需要手动配置，例如定义日志格式或标记数据，这限制了可伸缩性和可用性。大型语言模型（LLMs）的最新进展引入了基于LLM的日志解析的新研究领域，提供了自动化和适应性的潜在改进。尽管有了令人期待的结果，但由于这是一个相对较新的研究领域，最早的进展发表在2023年末，因此尚无关于这些方法的结构化概述。本文系统地审查了29种基于LLM的日志解析方法，比较了它们的能力、局限性和对手动工作的依赖程度。我们分析了所采用的学习和提示工程范式、提高效率和有效性的技术，以及LLMs在解析过程中的作用。我们将调查结果汇总在一个大表中，其中包括LLM基础日志解析方法的特征，并推导出LLM基础日志解析的一般过程，将所有审查的方法整合在一个流程图中。此外，我们在公共数据集上对七种开源的基于LLM的日志解析器进行基准测试，并对其可重现性进行了批判性评估。我们的研究结果总结了这一新研究领域的进展，并为寻求高效和用户友好的日志解析解决方案的研究人员和从业者提供了见解，所有代码和结果都已公开发布以保持透明度。

更新时间: 2025-04-07 09:41:04

领域: cs.LG,I.2; I.5

下载: http://arxiv.org/abs/2504.04877v1

Low-Rank Extragradient Methods for Scalable Semidefinite Optimization

We consider several classes of highly important semidefinite optimization problems that involve both a convex objective function (smooth or nonsmooth) and additional linear or nonlinear smooth and convex constraints, which are ubiquitous in statistics, machine learning, combinatorial optimization, and other domains. We focus on high-dimensional and plausible settings in which the problem admits a low-rank solution which also satisfies a low-rank complementarity condition. We provide several theoretical results proving that, under these circumstances, the well-known Extragradient method, when initialized in the proximity of an optimal primal-dual solution, converges to a solution of the constrained optimization problem with its standard convergence rates guarantees, using only low-rank singular value decompositions (SVD) to project onto the positive semidefinite cone, as opposed to computationally-prohibitive full-rank SVDs required in worst-case. Our approach is supported by numerical experiments conducted with a dataset of Max-Cut instances.

Updated: 2025-04-07 09:36:31

标题: 低秩外推方法用于可扩展半定优化

摘要: 我们考虑了几类非常重要的半定规划问题，这些问题涉及凸目标函数（平滑或非平滑）以及额外的线性或非线性平滑和凸约束，这些问题在统计学、机器学习、组合优化和其他领域中普遍存在。我们专注于高维和可能的设置，在这种设置下，问题具有一个低秩解，并满足低秩互补条件。我们提供了几个理论结果，证明在这些情况下，当Extragradient方法在接近最优原始-对偶解的情况下初始化时，仅使用低秩奇异值分解（SVD）将收敛到受约束优化问题的解，具有标准的收敛速率保证，而不像在最坏情况下需要计算复杂的全秩SVD。我们的方法得到了使用Max-Cut实例数据集进行的数值实验的支持。

更新时间: 2025-04-07 09:36:31

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2402.09081v2

Debate-Feedback: A Multi-Agent Framework for Efficient Legal Judgment Prediction

The use of AI in legal analysis and prediction (LegalAI) has gained widespread attention, with past research focusing on retrieval-based methods and fine-tuning large models. However, these approaches often require large datasets and underutilize the capabilities of modern large language models (LLMs). In this paper, inspired by the debate phase of real courtroom trials, we propose a novel legal judgment prediction model based on the Debate-Feedback architecture, which integrates LLM multi-agent debate and reliability evaluation models. Unlike traditional methods, our model achieves significant improvements in efficiency by minimizing the need for large historical datasets, thus offering a lightweight yet robust solution. Comparative experiments show that it outperforms several general-purpose and domain-specific legal models, offering a dynamic reasoning process and a promising direction for future LegalAI research.

Updated: 2025-04-07 09:34:14

标题: 辩论反馈：用于有效法律判决预测的多代理框架

摘要: AI在法律分析和预测（LegalAI）中的应用引起了广泛关注，过去的研究主要集中在检索型方法和对大型模型进行微调。然而，这些方法通常需要大量数据集，并且未充分利用现代大型语言模型（LLMs）的能力。在本文中，受到真实法庭审判辩论阶段的启发，我们提出了一种基于辩论-反馈架构的新型法律判决预测模型，该模型集成了LLM多智能体辩论和可靠性评估模型。与传统方法不同，我们的模型通过最小化对大型历史数据集的需求，实现了效率方面的显著改进，因此提供了一种轻量但强大的解决方案。比较实验证明，它优于几种通用和领域特定的法律模型，提供了一种动态推理过程和未来LegalAI研究的有希望方向。

更新时间: 2025-04-07 09:34:14

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2504.05358v1

Find A Winning Sign: Sign Is All We Need to Win the Lottery

The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git

Updated: 2025-04-07 09:30:38

标题: 找到一个赢利的迹象：迹象是我们赢得彩票所需的一切

摘要: 摘要：抽奖票假设（LTH）提出了存在一个稀疏子网络（即中奖票）的假设，当从头开始训练时，它可以像过度参数化对应物一样进行泛化。寻找中奖票的常见方法是通过迭代剪枝（IP）来保留原始强泛化，并通过将结果稀疏掩码应用于未经训练的网络来传递用于实现学习泛化的信息。然而，现有的IP方法仍然难以将观察泛化到临时初始化和小规模架构或数据集之外，或者通过将其掩码应用于已训练权重而绕过这些挑战。在本文中，我们证明参数符号配置在向任意随机初始化网络传递有用信息以进行泛化方面起着关键作用。通过线性模式连接分析，我们观察到通过现有IP方法训练的稀疏网络可以保留其吸引盆地，如果其参数符号和规范化层参数被保留。为了更接近找到中奖票，我们通过防止由我们的方法训练的稀疏网络和具有初始化规范化层参数的对应网络之间的线性路径上的高误差障碍，减轻了对规范化层参数的依赖。有趣的是，在各种架构和数据集上，我们观察到任何随机初始化的网络都可以通过继承其稀疏性和参数符号信息，优化以表现出沿着通过我们的方法训练的稀疏网络的线性路径的低误差障碍，从而实现与原始性能相当的可能性。该代码可在https://github.com/JungHunOh/AWS_ICLR2025.git上找到。

更新时间: 2025-04-07 09:30:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05357v1

Futureproof Static Memory Planning

The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand buffers. The "AI memory wall", combined with deep neural networks' static architecture, has reignited interest in DSA. We present idealloc, a low-fragmentation, high-performance DSA implementation designed for million-buffer instances. Evaluated on a novel suite of particularly hard benchmarks from several domains, idealloc ranks first against four production implementations in terms of a joint effectiveness/robustness criterion.

Updated: 2025-04-07 09:28:54

标题: 未来静态内存规划

摘要: 将已知大小和生命周期的缓冲区分配偏移量以最小化总内存使用量的NP完全组合优化任务称为动态存储分配(DSA)。现有的DSA实现绕过了理论上最先进的算法，而是倾向于快速但浪费的启发式方法，或者不超过一千个缓冲区的内存高效方法。 "AI内存墙"与深度神经网络的静态架构相结合，重新点燃了对DSA的兴趣。我们提出了idealloc，一个设计用于百万缓冲区实例的低碎片化、高性能的DSA实现。在来自几个领域的一系列特别困难的基准测试中评估，idealloc在联合有效性/健壮性标准方面排名第一，胜过了四种生产实现。

更新时间: 2025-04-07 09:28:54

领域: cs.OS,cs.AI,cs.PL

下载: http://arxiv.org/abs/2504.04874v1

Closed-Loop Neural Operator-Based Observer of Traffic Density

We consider the problem of traffic density estimation with sparse measurements from stationary roadside sensors. Our approach uses Fourier neural operators to learn macroscopic traffic flow dynamics from high-fidelity microscopic-level simulations. During inference, the operator functions as an open-loop predictor of traffic evolution. To close the loop, we couple the open-loop operator with a correction operator that combines the predicted density with sparse measurements from the sensors. Simulations with the SUMO software indicate that, compared to open-loop observers, the proposed closed-loop observer exhibit classical closed-loop properties such as robustness to noise and ultimate boundedness of the error. This shows the advantages of combining learned physics with real-time corrections, and opens avenues for accurate, efficient, and interpretable data-driven observers.

Updated: 2025-04-07 09:28:50

标题: 基于闭环神经算子的交通密度观测器

摘要: 我们考虑了使用来自静止路边传感器的稀疏测量进行交通密度估计的问题。我们的方法利用傅立叶神经算子从高保真度的微观级别模拟中学习宏观交通流动动态。在推断过程中，该算子作为交通演变的开环预测器。为了关闭循环，我们将开环算子与校正算子耦合，将传感器的稀疏测量与预测密度结合起来。使用SUMO软件进行的模拟表明，与开环观察者相比，所提出的闭环观察者表现出经典的闭环特性，如对噪声的稳健性和误差的最终有界性。这表明了将学习到的物理知识与实时校正相结合的优势，并为准确、高效和可解释的数据驱动观察者打开了道路。

更新时间: 2025-04-07 09:28:50

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2504.04873v1

Ternarization of Vision Language Models for use on edge devices

We propose a process to compress a pre-trained Vision Language Model into a ternary version of itself instead of training a ternary model from scratch. A new initialization scheme from pre-trained weights based on the k-means algorithm is proposed to reduce the ternarization time. We implement different custom operators for executing the ternary model on the TensorFlow Lite Engine. We compare the original model with its ternary and binary versions in terms of memory consumption, inference speed and perplexity. We find that the ternary model using our custom ternary matrix multiplication operator provides a good compromise in term of memory usage and perplexity, while having the fastest token generation speed.

Updated: 2025-04-07 09:28:11

标题: 边缘设备上使用视觉语言模型的三元化

摘要: 我们提出了一种将预训练的视觉语言模型压缩成其三值版本的过程，而不是从头开始训练一个三值模型。提出了一种基于k均值算法的基于预训练权重的新初始化方案，以减少三值化时间。我们实现了不同的自定义运算符，以在TensorFlow Lite引擎上执行三值模型。我们比较了原始模型及其三值和二值版本在内存消耗、推理速度和困惑度方面的差异。我们发现，使用我们的自定义三值矩阵乘法运算符的三值模型在内存使用和困惑度方面提供了一个很好的折衷，同时具有最快的token生成速度。

更新时间: 2025-04-07 09:28:11

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.06298v1

DyTTP: Trajectory Prediction with Normalization-Free Transformers

Accurate trajectory prediction is a cornerstone for the safe operation of autonomous driving systems, where understanding the dynamic behavior of surrounding agents is crucial. Transformer-based architectures have demonstrated significant promise in capturing complex spatio-temporality dependencies. However, their reliance on normalization layers can lead to computation overhead and training instabilities. In this work, we present a two-fold approach to address these challenges. First, we integrate DynamicTanh (DyT), which is the latest method to promote transformers, into the backbone, replacing traditional layer normalization. This modification simplifies the network architecture and improves the stability of the inference. We are the first work to deploy the DyT to the trajectory prediction task. Complementing this, we employ a snapshot ensemble strategy to further boost trajectory prediction performance. Using cyclical learning rate scheduling, multiple model snapshots are captured during a single training run. These snapshots are then aggregated via simple averaging at inference time, allowing the model to benefit from diverse hypotheses without incurring substantial additional computational cost. Extensive experiments on Argoverse datasets demonstrate that our combined approach significantly improves prediction accuracy, inference speed and robustness in diverse driving scenarios. This work underscores the potential of normalization-free transformer designs augmented with lightweight ensemble techniques in advancing trajectory forecasting for autonomous vehicles.

Updated: 2025-04-07 09:26:25

标题: DyTTP：不使用归一化的Transformer进行轨迹预测

摘要: 准确的轨迹预测是自动驾驶系统安全运行的基石，其中理解周围代理的动态行为至关重要。基于Transformer的架构已经显示出在捕捉复杂的时空依赖性方面具有显著的潜力。然而，它们对规范化层的依赖可能会导致计算开销和训练不稳定性。在这项工作中，我们提出了一个双重方法来解决这些挑战。首先，我们将最新的促进Transformer的方法DynamicTanh（DyT）整合到骨干中，取代传统的层规范化。这种修改简化了网络架构，并提高了推理的稳定性。我们是第一个将DyT部署到轨迹预测任务中的工作。此外，我们采用了快照集成策略来进一步提升轨迹预测性能。通过循环学习率调度，在单次训练中捕获多个模型快照。然后在推理时通过简单平均聚合这些快照，使模型能够从不同的假设中获益而不会带来额外的计算成本。在Argoverse数据集上进行的大量实验表明，我们的综合方法显著提高了在多样化驾驶场景中的预测准确性、推理速度和稳健性。这项工作强调了无规范化Transformer设计与轻量级集成技术相结合，在推进自动驾驶车辆的轨迹预测方面的潜力。

更新时间: 2025-04-07 09:26:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05356v1

FedSAUC: A Similarity-Aware Update Control for Communication-Efficient Federated Learning in Edge Computing

Federated learning is a distributed machine learning framework to collaboratively train a global model without uploading privacy-sensitive data onto a centralized server. Usually, this framework is applied to edge devices such as smartphones, wearable devices, and Internet of Things (IoT) devices which closely collect information from users. However, these devices are mostly battery-powered. The update procedure of federated learning will constantly consume the battery power and the transmission bandwidth. In this work, we propose an update control for federated learning, FedSAUC, by considering the similarity of users' behaviors (models). At the server side, we exploit clustering algorithms to group devices with similar models. Then we select some representatives for each cluster to update information to train the model. We also implemented a testbed prototyping on edge devices for validating the performance. The experimental results show that this update control will not affect the training accuracy in the long run.

Updated: 2025-04-07 09:21:43

标题: FedSAUC：边缘计算中通信高效的联邦学习相似性感知更新控制

摘要: 联合学习是一种分布式机器学习框架，可以在不将隐私敏感数据上传到集中式服务器的情况下协作训练一个全局模型。通常，这种框架应用于边缘设备，如智能手机、可穿戴设备和物联网设备，这些设备密切收集用户信息。然而，这些设备大多是由电池供电的。联合学习的更新过程会不断消耗电池电量和传输带宽。在这项工作中，我们提出了一种联合学习的更新控制方法FedSAUC，通过考虑用户行为（模型）的相似性。在服务器端，我们利用聚类算法将具有相似模型的设备分组。然后我们选择每个群集的一些代表来更新信息以训练模型。我们还在边缘设备上实施了一个测试床原型来验证性能。实验结果表明，这种更新控制在长期内不会影响训练准确性。

更新时间: 2025-04-07 09:21:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.04867v1

GAMDTP: Dynamic Trajectory Prediction with Graph Attention Mamba Network

Accurate motion prediction of traffic agents is crucial for the safety and stability of autonomous driving systems. In this paper, we introduce GAMDTP, a novel graph attention-based network tailored for dynamic trajectory prediction. Specifically, we fuse the result of self attention and mamba-ssm through a gate mechanism, leveraging the strengths of both to extract features more efficiently and accurately, in each graph convolution layer. GAMDTP encodes the high-definition map(HD map) data and the agents' historical trajectory coordinates and decodes the network's output to generate the final prediction results. Additionally, recent approaches predominantly focus on dynamically fusing historical forecast results and rely on two-stage frameworks including proposal and refinement. To further enhance the performance of the two-stage frameworks we also design a scoring mechanism to evaluate the prediction quality during the proposal and refinement processes. Experiments on the Argoverse dataset demonstrates that GAMDTP achieves state-of-the-art performance, achieving superior accuracy in dynamic trajectory prediction.

Updated: 2025-04-07 09:19:20

标题: GAMDTP: 使用图注意力曼巴网络进行动态轨迹预测

摘要: 交通代理的准确运动预测对自动驾驶系统的安全和稳定至关重要。在本文中，我们介绍了GAMDTP，这是一种针对动态轨迹预测的新型基于图注意力的网络。具体地，我们通过门控机制融合了自注意力和mamba-ssm的结果，利用两者的优势在每个图卷积层中更有效地提取特征。GAMDTP对高清地图（HD地图）数据和代理的历史轨迹坐标进行编码，并解码网络的输出以生成最终的预测结果。此外，最近的方法主要集中在动态地融合历史预测结果，并依赖于包括提案和改进的两阶段框架。为了进一步提高两阶段框架的性能，我们还设计了一个评分机制来在提案和改进过程中评估预测质量。对Argoverse数据集的实验表明，GAMDTP实现了最先进的性能，在动态轨迹预测中获得了更高的准确性。

更新时间: 2025-04-07 09:19:20

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2504.04862v1

SAFT: Structure-aware Transformers for Textual Interaction Classification

Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc., where each interaction is associated with a text description. Classifying such textual interactions (TIC) finds extensive use in detecting spam reviews in e-commerce, fraudulent transactions in finance, and so on. Existing TIC solutions either (i) fail to capture the rich text semantics due to the use of context-free text embeddings, and/or (ii) disregard the bipartite structure and node heterogeneity of TINs, leading to compromised TIC performance. In this work, we propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions. In particular, line graph attention (LGA)/gated attention units (GAUs) and pretrained language models (PLMs) are capitalized on to model the interaction-level and token-level signals, which are further coupled via the proxy token in an iterative and contextualized fashion. Additionally, an efficient and theoretically-grounded approach is developed to encode the local and global topology information pertaining to interactions into structural embeddings. The resulting embeddings not only inject the structural features underlying TINs into the textual interaction encoding but also facilitate the design of graph sampling strategies. Extensive empirical evaluations on multiple real TIN datasets demonstrate the superiority of SAFT over the state-of-the-art baselines in TIC accuracy.

Updated: 2025-04-07 09:19:12

标题: SAFT：用于文本交互分类的结构感知变换器

摘要: 文本交互网络（TINs）是一种普遍存在的数据结构，用于建模电子商务网站、社交网络等用户和物品之间的相互作用，其中每个交互都与文本描述相关联。对这种文本交互（TIC）进行分类在检测电子商务中的垃圾评论、金融中的欺诈交易等方面得到广泛应用。现有的TIC解决方案要么无法捕捉丰富的文本语义，要么忽视TINs的双部分结构和节点异质性，导致TIC性能受损。在这项工作中，我们提出了SAFT，这是一种集成了语言和基于图的模块的新架构，用于有效地将文本和结构语义融合到交互表示学习中。具体地，利用线图注意（LGA）/门控注意单元（GAUs）和预训练语言模型（PLMs）来建模交互级和标记级信号，通过代理标记以迭代和情境化的方式进一步耦合。此外，开发了一种高效且理论基础的方法，将涉及交互的本地和全局拓扑信息编码到结构嵌入中。产生的嵌入不仅将TINs的结构特征注入到文本交互编码中，还有助于设计图采样策略。对多个真实TIN数据集的广泛实证评估表明，SAFT在TIC准确性方面优于现有技术。

更新时间: 2025-04-07 09:19:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04861v1

MetaSC: Test-Time Safety Specification Optimization for Language Models

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git .

Updated: 2025-04-07 09:15:30

标题: MetaSC: 语言模型的测试时安全性规范优化

摘要: 我们提出了一个新颖的动态安全框架，可以在推理时优化语言模型（LM）的安全推理，而无需修改模型权重。借鉴最近自我批评方法的进展，我们的方法利用一个元批评机制，通过迭代更新安全提示（称为规范）来驱动自我批评和修订过程的适应性。这种测试时间优化不仅可以提高对抗性越狱请求的性能，还可以在各种一般的安全相关任务中表现良好，比如避免道德伤害或追求诚实的回应。我们在多个语言模型上的实证评估表明，动态优化的安全提示与固定系统提示和静态自我批评防御相比，显著提高了安全得分。代码发布在https://github.com/vicgalle/meta-self-critique.git。

更新时间: 2025-04-07 09:15:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.07985v2

Don't Lag, RAG: Training-Free Adversarial Detection Using RAG

Adversarial patch attacks pose a major threat to vision systems by embedding localized perturbations that mislead deep models. Traditional defense methods often require retraining or fine-tuning, making them impractical for real-world deployment. We propose a training-free Visual Retrieval-Augmented Generation (VRAG) framework that integrates Vision-Language Models (VLMs) for adversarial patch detection. By retrieving visually similar patches and images that resemble stored attacks in a continuously expanding database, VRAG performs generative reasoning to identify diverse attack types, all without additional training or fine-tuning. We extensively evaluate open-source large-scale VLMs, including Qwen-VL-Plus, Qwen2.5-VL-72B, and UI-TARS-72B-DPO, alongside Gemini-2.0, a closed-source model. Notably, the open-source UI-TARS-72B-DPO model achieves up to 95 percent classification accuracy, setting a new state-of-the-art for open-source adversarial patch detection. Gemini-2.0 attains the highest overall accuracy, 98 percent, but remains closed-source. Experimental results demonstrate VRAG's effectiveness in identifying a variety of adversarial patches with minimal human annotation, paving the way for robust, practical defenses against evolving adversarial patch attacks.

Updated: 2025-04-07 09:14:47

标题: 不要拖延，RAG：使用RAG实现无需训练的对抗检测

摘要: 对抗性贴片攻击通过嵌入误导深度模型的局部扰动，对视觉系统构成重大威胁。传统的防御方法通常需要重新训练或微调，这使它们在实际部署中变得不切实际。我们提出了一种无需训练的视觉检索增强生成（VRAG）框架，该框架集成了视觉-语言模型（VLMs）用于对抗性贴片的检测。通过检索视觉上相似的贴片和图像，在不断扩展的数据库中找到类似已存攻击的模式，VRAG执行生成推理以识别不同类型的攻击，而无需额外的训练或微调。我们广泛评估了开源的大规模VLMs，包括Qwen-VL-Plus、Qwen2.5-VL-72B和UI-TARS-72B-DPO，以及闭源模型Gemini-2.0。值得注意的是，开源的UI-TARS-72B-DPO模型达到了高达95%的分类准确率，为开源对抗性贴片检测设立了新的技术标准。Gemini-2.0获得了最高的总体准确率，达到了98%，但仍然是闭源的。实验结果表明，VRAG在识别各种对抗性贴片方面的有效性，几乎没有人工标注，为对抗性贴片攻击的不断演变提供了坚固、实用的防御手段。

更新时间: 2025-04-07 09:14:47

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.04858v1

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

Detecting biases in structured data is a complex and time-consuming task. Existing automated techniques are limited in diversity of data types and heavily reliant on human case-by-case handling, resulting in a lack of generalizability. Currently, large language model (LLM)-based agents have made significant progress in data science, but their ability to detect data biases is still insufficiently explored. To address this gap, we introduce the first end-to-end, multi-agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations. To address the lack of a standardized framework for evaluating the capability of LLM agents to detect biases in data, we further propose a comprehensive benchmark that includes multiple evaluation metrics and a large set of test cases. Extensive experiments demonstrate that our framework achieves exceptional overall performance in structured data bias detection, setting a new milestone for fairer data applications.

Updated: 2025-04-07 09:12:00

标题: BIASINSPECTOR: 通过LLM代理检测结构化数据中的偏见

摘要: 检测结构化数据中的偏见是一个复杂且耗时的任务。现有的自动化技术在数据类型的多样性方面受到限制，并且严重依赖于人工逐案处理，导致缺乏泛化能力。目前，基于大型语言模型（LLM）的代理在数据科学领域取得了显著进展，但它们对于检测数据偏见的能力仍未得到充分探索。为了填补这一空白，我们引入了第一个端到端的、多代理协同框架BIASINSPECTOR，旨在根据特定用户需求自动检测结构化数据中的偏见。它首先制定了一个多阶段计划来分析用户指定的偏见检测任务，然后使用多样化和适合的工具来实施。它提供包括解释和可视化在内的详细结果。为了解决评估LLM代理在数据中检测偏见能力的标准化框架的缺乏，我们进一步提出了一个包括多个评估指标和大量测试用例的全面基准。广泛的实验表明，我们的框架在结构化数据偏见检测方面取得了出色的整体表现，为更公平的数据应用设立了新的里程碑。

更新时间: 2025-04-07 09:12:00

领域: cs.AI

下载: http://arxiv.org/abs/2504.04855v1

Efficient Hamiltonian, structure and trace distance learning of Gaussian states

In this work, we initiate the study of Hamiltonian learning for positive temperature bosonic Gaussian states, the quantum generalization of the widely studied problem of learning Gaussian graphical models. We obtain efficient protocols, both in sample and computational complexity, for the task of inferring the parameters of their underlying quadratic Hamiltonian under the assumption of bounded temperature, squeezing, displacement and maximal degree of the interaction graph. Our protocol only requires heterodyne measurements, which are often experimentally feasible, and has a sample complexity that scales logarithmically with the number of modes. Furthermore, we show that it is possible to learn the underlying interaction graph in a similar setting and sample complexity. Taken together, our results put the status of the quantum Hamiltonian learning problem for continuous variable systems in a more advanced state when compared to spins, where state-of-the-art results are either unavailable or quantitatively inferior to ours. In addition, we use our techniques to obtain the first results on learning Gaussian states in trace distance with a quadratic scaling in precision and polynomial in the number of modes, albeit imposing certain restrictions on the Gaussian states. Our main technical innovations are several continuity bounds for the covariance and Hamiltonian matrix of a Gaussian state, which are of independent interest, combined with what we call the local inversion technique. In essence, the local inversion technique allows us to reliably infer the Hamiltonian of a Gaussian state by only estimating in parallel submatrices of the covariance matrix whose size scales with the desired precision, but not the number of modes. This way we bypass the need to obtain precise global estimates of the covariance matrix, controlling the sample complexity.

Updated: 2025-04-07 09:10:06

标题: 高斯态的高效哈密顿学习、结构和迹距离学习

摘要: 在这项工作中，我们开始研究正温度玻色气体高斯态的哈密顿学习问题，这是广泛研究的高斯图模型学习问题的量子推广。在假设温度、挤压、位移和相互作用图的最大度数有界的情况下，我们获得了有效的协议，无论是在样本还是计算复杂度上，用于推断其基础二次哈密顿参数。我们的协议仅需要零差测量，这通常在实验上是可行的，并且其样本复杂度随模式数的对数缩放。此外，我们展示了在类似设置和样本复杂度下学习基础相互作用图的可能性。综合起来，与自旋相比，我们的结果将连续变量系统的量子哈密顿学习问题的状态提升到更高级的状态，其中现有的最先进结果要么不可用，要么在数量上不如我们的结果。此外，我们使用我们的技术在迹距离中获得了第一批以二次精度和模式数量多项式缩放为基础的学习高斯态的结果，尽管对高斯态施加了一定的限制。我们的主要技术创新是高斯态的协方差和哈密顿矩阵的连续性界限，这些界限具有独立的兴趣，结合我们所谓的局部反演技术。本质上，局部反演技术允许我们通过仅估计与所需精度成比例但不是模式数量的大小的协方差矩阵的子矩阵来可靠推断高斯态的哈密顿。这样，我们就可以绕过获取协方差矩阵的精确全局估计的需要，从而控制样本复杂度。

更新时间: 2025-04-07 09:10:06

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2411.03163v3

Looped Transformers for Length Generalization

Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

Updated: 2025-04-07 09:03:44

标题: 循环变压器用于长度概括

摘要: 最近的研究表明，从头开始训练的Transformer可以成功地解决各种算术和算法任务，例如加法和计算奇偶性。虽然这些Transformer在未见输入的相同长度上具有很好的泛化能力，但它们在长度泛化方面表现出困难，即处理未见长度的输入。在这项工作中，我们证明了具有自适应步数的循环Transformer显著改善了长度泛化。我们专注于具有已知迭代解的任务，涉及多次RASP-L操作的迭代 - 这是一个可以由有限大小的Transformer表达的长度泛化操作。我们使用我们提出的学习算法训练循环Transformer，并观察到它们学习了各种任务的高度长度泛化的解决方案。

更新时间: 2025-04-07 09:03:44

领域: cs.LG

下载: http://arxiv.org/abs/2409.15647v3

An Efficient Approach for Cooperative Multi-Agent Learning Problems

In this article, we propose a centralized Multi-Agent Learning framework for learning a policy that models the simultaneous behavior of multiple agents that need to coordinate to solve a certain task. Centralized approaches often suffer from the explosion of an action space that is defined by all possible combinations of individual actions, known as joint actions. Our approach addresses the coordination problem via a sequential abstraction, which overcomes the scalability problems typical to centralized methods. It introduces a meta-agent, called \textit{supervisor}, which abstracts joint actions as sequential assignments of actions to each agent. This sequential abstraction not only simplifies the centralized joint action space but also enhances the framework's scalability and efficiency. Our experimental results demonstrate that the proposed approach successfully coordinates agents across a variety of Multi-Agent Learning environments of diverse sizes.

Updated: 2025-04-07 09:03:35

标题: 一种高效的合作多智能体学习问题的方法

摘要: 在这篇文章中，我们提出了一个集中式的多智能体学习框架，用于学习一个模拟多个智能体同时行为的策略，这些智能体需要协调解决某个任务。集中式方法通常面临由所有可能组合的个体动作定义的行动空间爆炸的问题，即联合动作。我们的方法通过顺序抽象来解决协调问题，克服了集中式方法典型的可伸缩性问题。它引入了一个元智能体，称为\textit{监督员}，将联合动作抽象为将动作分配给每个智能体的顺序。这种顺序抽象不仅简化了集中式联合动作空间，还增强了框架的可伸缩性和效率。我们的实验结果表明，所提出的方法成功地协调了不同规模的多智能体学习环境中的智能体。

更新时间: 2025-04-07 09:03:35

领域: cs.AI

下载: http://arxiv.org/abs/2504.04850v1

Nonlocal techniques for the analysis of deep ReLU neural network approximations

Recently, Daubechies, DeVore, Foucart, Hanin, and Petrova introduced a system of piece-wise linear functions, which can be easily reproduced by artificial neural networks with the ReLU activation function and which form a Riesz basis of $L_2([0,1])$. This work was generalized by two of the authors to the multivariate setting. We show that this system serves as a Riesz basis also for Sobolev spaces $W^s([0,1]^d)$ and Barron classes ${\mathbb B}^s([0,1]^d)$ with smoothness $0<s<1$. We apply this fact to re-prove some recent results on the approximation of functions from these classes by deep neural networks. Our proof method avoids using local approximations and allows us to track also the implicit constants as well as to show that we can avoid the curse of dimension. Moreover, we also study how well one can approximate Sobolev and Barron functions by ANNs if only function values are known.

Updated: 2025-04-07 09:00:22

标题: 非本地技术用于分析深度ReLU神经网络逼近

摘要: 最近，Daubechies、DeVore、Foucart、Hanin和Petrova引入了一种分段线性函数系统，可以通过具有ReLU激活函数的人工神经网络轻松复制，并且形成$L_2([0,1])$的Riesz基。两位作者将这项工作推广到多变量环境中。我们展示了这个系统也可以作为Sobolev空间$W^s([0,1]^d)$和Barron类${\mathbb B}^s([0,1]^d)$（平滑度为$0<s<1$）的Riesz基。我们将这一事实应用于重新证明一些关于深度神经网络逼近这些类别函数的最近结果。我们的证明方法避免了使用局部逼近，并且允许我们跟踪隐含常数，并且表明我们可以避免维数诅咒。此外，我们还研究了如果只知道函数值，人工神经网络可以多好地逼近Sobolev和Barron函数。

更新时间: 2025-04-07 09:00:22

领域: cs.LG,cs.CC,cs.NA,math.NA,68T07, 42C15, 11A25

下载: http://arxiv.org/abs/2504.04847v1

Improving Customer Service with Automatic Topic Detection in User Emails

This study introduces a novel natural language processing pipeline that enhances customer service efficiency at Telekom Srbija, a leading Serbian telecommunications company, through automated email topic detection and labeling. Central to the pipeline is BERTopic, a modular framework that allows unsupervised topic modeling. After a series of preprocessing and postprocessing steps, we assign one of 12 topics and several additional labels to incoming emails, allowing the customer service to filter and access them through a custom-made application. The model's performance was evaluated by assessing the speed and correctness of the automatically assigned topics, with a weighted average processing time of 0.041 seconds per email and a weighted average F1 score of 0.96. The pipeline shows broad applicability across languages, particularly to those that are low-resourced and morphologically rich. The system now operates in the company's production environment, streamlining customer service operations through automated email classification.

Updated: 2025-04-07 08:58:17

标题: 使用自动主题检测改进用户邮件中的客户服务

摘要: 本研究介绍了一种新颖的自然语言处理流程，通过自动化电子邮件主题检测和标记，提高了塞尔维亚领先的电信公司Telekom Srbija的客户服务效率。该流程的核心是BERTopic，一个允许无监督主题建模的模块化框架。经过一系列预处理和后处理步骤，我们为传入的电子邮件分配了12个主题和若干额外标签，使客户服务可以通过自定义应用程序进行过滤和访问。通过评估自动分配主题的速度和正确性，模型的性能得到了评估，每封邮件的加权平均处理时间为0.041秒，加权平均F1分数为0.96。该流程在各种语言中具有广泛的适用性，特别适用于那些资源有限且形态丰富的语言。该系统现在在公司的生产环境中运行，通过自动化电子邮件分类简化客户服务操作。

更新时间: 2025-04-07 08:58:17

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.19115v2

Deep Learning for Double Auction

Auctions are important mechanisms extensively implemented in various markets, e.g., search engines' keyword auctions, antique auctions, etc. Finding an optimal auction mechanism is extremely difficult due to the constraints of imperfect information, incentive compatibility (IC), and individual rationality (IR). In addition to the traditional economic methods, some recently attempted to find the optimal (single) auction using deep learning methods. Unlike those attempts focusing on single auctions, we develop deep learning methods for double auctions, where imperfect information exists on both the demand and supply sides. The previous attempts on single auction cannot directly apply to our contexts and those attempts additionally suffer from limited generalizability, inefficiency in ensuring the constraints, and learning fluctuations. We innovate in designing deep learning models for solving the more complex problem and additionally addressing the previous models' three limitations. Specifically, we achieve generalizability by leveraging a transformer-based architecture to model market participants as sequences for varying market sizes; we utilize the numerical features of the constraints and pre-treat them for a higher learning efficiency; we develop a gradient-conflict-elimination scheme to address the problem of learning fluctuation. Extensive experimental evaluations demonstrate the superiority of our approach to classical and machine learning baselines.

Updated: 2025-04-07 08:56:32

标题: 双向拍卖的深度学习

摘要: 拍卖是广泛实施于各种市场的重要机制，例如搜索引擎的关键词拍卖、古董拍卖等。由于信息不完全、激励兼容性（IC）和个体理性（IR）的约束，找到最优的拍卖机制是极其困难的。除了传统的经济方法，一些最近尝试利用深度学习方法找到最优（单一）拍卖。不同于那些专注于单一拍卖的尝试，我们开发了用于双向拍卖的深度学习方法，其中需求和供应双方都存在信息不完全。之前对单一拍卖的尝试无法直接应用于我们的情境，而且这些尝试还受限于有限的泛化能力、确保约束的效率不高以及学习波动。我们在设计深度学习模型解决更为复杂的问题的同时，还解决了之前模型的三个限制。具体来说，我们通过利用基于转换器的架构将市场参与者建模为不同市场规模的序列来实现泛化能力；我们利用约束的数值特征并对其进行预处理以提高学习效率；我们开发了一个梯度冲突消除方案来解决学习波动的问题。广泛的实验评估证明了我们的方法相对于经典和机器学习基线的优越性。

更新时间: 2025-04-07 08:56:32

领域: cs.LG,cs.GT,econ.TH

下载: http://arxiv.org/abs/2504.05355v1

Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models

Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.

Updated: 2025-04-07 08:55:40

标题: 激发未见之物：检测黑盒模型中隐藏的后门

摘要: 视觉提示（VP）是一种新技术，将经过良好训练的冻结模型适应于源域任务到目标域任务。本研究检验了VP对黑盒模型级后门检测的益处。VP中的视觉提示映射了源域和目标域之间的类子空间。我们确定了一个不一致性，称为类子空间不一致性，存在于干净和被毒害数据集之间。基于此，我们引入了\textsc{BProm}，一种黑盒模型级检测方法，用于识别可疑模型中的后门。当存在后门时，\textsc{BProm}利用提示模型的低分类准确率。大量实验证实了\textsc{BProm}的有效性。

更新时间: 2025-04-07 08:55:40

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.09540v2

A Causal Framework for Evaluating Deferring Systems

Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems, which allows to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we have access to both the human and ML model predictions for the deferred instances. Here, we can identify the individual causal effects for deferred instances and the aggregates of them. In the second one, only human predictions are available for the deferred instances. Here, we can resort to regression discontinuity designs to estimate a local causal effect. We evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.

Updated: 2025-04-07 08:54:30

标题: 一个用于评估延期系统的因果框架

摘要: Deferring systems扩展了监督机器学习（ML）模型，使其可以将预测推迟给人类专家。然而，评估推迟策略对系统准确性的影响仍然是一个被忽视的领域。本文通过因果镜头评估推迟系统来填补这一空白。我们将因果推断的潜在结果框架与推迟系统联系起来，从而可以确定推迟策略对预测准确性的因果影响。我们区分两种情况。在第一种情况下，我们可以访问被推迟实例的人类和ML模型预测。在这里，我们可以确定被推迟实例的个体因果效应以及它们的聚合。在第二种情况下，只有被推迟实例的人类预测是可用的。在这里，我们可以借助回归不连续设计来估计局部因果效应。我们在来自文献的七个推迟系统的合成和真实数据集上评估我们的方法。

更新时间: 2025-04-07 08:54:30

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2405.18902v2

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

Updated: 2025-04-07 08:54:28

标题: M2-omni：推动具有竞争性能力的全面模态支持的Omni-MLLM

摘要: 我们介绍了M2-omni，这是一种先进的开源全模态多模态语言模型，实现了与GPT-4o竞争力相当的表现。M2-omni采用统一的多模态序列建模框架，赋予大型语言模型(LLMs)全面的跨模态理解和生成能力。具体来说，M2-omni可以处理任意组合的音频、视频、图像和文本模态作为输入，生成交织着音频、图像或文本输出的多模态序列，从而实现高级和交互式的实时体验。这样一个全模态多模态语言模型的训练受到数据数量和收敛速率在不同模态之间的显著差异的挑战。为了解决这些挑战，我们提出了一种在预训练阶段采用的步骤平衡策略，以处理模态特定数据量的差异。此外，在指导微调阶段引入了一种动态自适应平衡策略，以同步模态间的训练进展，确保最佳的收敛。值得注意的是，我们优先保持在纯文本任务上的强大表现，以在训练过程中保持M2-omni的语言理解能力的稳固性。据我们所知，M2-omni目前是一种非常具有竞争力的开源模型，具有全面的模态和任务支持，以及出色的表现。我们期望M2-omni将推动全模态多模态语言模型的发展，从而促进该领域的未来研究。

更新时间: 2025-04-07 08:54:28

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.18778v3

On the best approximation by finite Gaussian mixtures

We consider the problem of approximating a general Gaussian location mixture by finite mixtures. The minimum order of finite mixtures that achieve a prescribed accuracy (measured by various $f$-divergences) is determined within constant factors for the family of mixing distributions with compactly support or appropriate assumptions on the tail probability including subgaussian and subexponential. While the upper bound is achieved using the technique of local moment matching, the lower bound is established by relating the best approximation error to the low-rank approximation of certain trigonometric moment matrices, followed by a refined spectral analysis of their minimum eigenvalue. In the case of Gaussian mixing distributions, this result corrects a previous lower bound in [Allerton Conference 48 (2010) 620-628].

Updated: 2025-04-07 08:48:24

标题: 有限高斯混合逼近的最佳逼近

摘要: 我们考虑通过有限混合来近似一般的高斯位置混合问题。对于具有紧支持或适当尾部概率假设（包括亚高斯和亚指数）的混合分布族，确定了实现预定精度（通过各种$f$-散度测量）的有限混合的最低阶数，其常数因子内。虽然上界是通过使用局部时刻匹配技术实现的，但下界是通过将最佳近似误差与某些三角函数时刻矩阵的低秩逼近相关联而建立的，随后对它们的最小特征值进行了精细的谱分析。在高斯混合分布的情况下，该结果纠正了[Allerton Conference 48 (2010) 620-628]中的先前下界。

更新时间: 2025-04-07 08:48:24

领域: math.ST,cs.IT,cs.LG,math.IT,stat.ML,stat.TH

下载: http://arxiv.org/abs/2404.08913v2

Explanation-Driven Interventions for Artificial Intelligence Model Customization: Empowering End-Users to Tailor Black-Box AI in Rhinocytology

The integration of Artificial Intelligence (AI) in modern society is heavily shifting the way that individuals carry out their tasks and activities. Employing AI-based systems raises challenges that designers and developers must address to ensure that humans remain in control of the interaction process, particularly in high-risk domains. This article presents a novel End-User Development (EUD) approach for black-box AI models through a redesigned user interface in the Rhino-Cyt platform, a medical AI-based decision-support system for medical professionals (more precisely, rhinocytologists) to carry out cell classification. The proposed interface empowers users to intervene in AI decision-making process by editing explanations and reconfiguring the model, influencing its future predictions. This work contributes to Human-Centered AI (HCAI) and EUD by discussing how explanation-driven interventions allow a blend of explainability, user intervention, and model reconfiguration, fostering a symbiosis between humans and user-tailored AI systems.

Updated: 2025-04-07 08:44:48

标题: 解释驱动的人工智能模型定制干预：赋予终端用户在鼻细胞学中定制黑盒人工智能的能力

摘要: 人工智能（AI）在现代社会的整合严重改变了个人执行任务和活动的方式。使用基于AI的系统引发了设计师和开发人员必须解决的挑战，以确保人类保持对交互过程的控制，特别是在高风险领域。本文介绍了一种新颖的端用户开发（EUD）方法，通过重新设计用户界面在Rhino-Cyt平台中实现黑盒AI模型，这是一种面向医疗专业人员（更准确地说是犀牛细胞学家）进行细胞分类的医疗AI决策支持系统。所提出的界面赋予用户干预AI决策过程的能力，通过编辑解释和重新配置模型，影响其未来的预测。这项工作通过讨论基于解释的干预如何允许解释性、用户干预和模型重新配置的结合，促进人类中心的AI（HCAI）和EUD。从而促进人类和用户定制的AI系统之间的共生。

更新时间: 2025-04-07 08:44:48

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2504.04833v1

Noisy Test-Time Adaptation in Vision-Language Models

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of $8.32\%$ in harmonic mean accuracy ($\text{Acc}_\text{H}$) for ZS-NTTA and $9.40\%$ in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.

Updated: 2025-04-07 08:44:36

标题: 视觉-语言模型中的嘈杂测试时间适应

摘要: 测试时间适应（TTA）旨在通过仅依赖目标数据，在测试期间解决源数据与目标数据之间的分布偏移。在开放世界场景中，模型经常遇到嘈杂样本，即超出分布内（ID）标签空间的样本。利用预训练视觉语言模型（VLMs）的零样本能力，本文介绍了零样本嘈杂TTA（ZS-NTTA），重点是以零样本方式在测试期间使模型适应具有嘈杂样本的目标数据。我们发现现有的TTA方法在ZS-NTTA下表现不佳，通常甚至落后于冻结模型。我们进行了全面的实验分析这一现象，揭示未经过滤的嘈杂数据的负面影响超过了干净数据在模型更新期间的好处。此外，为ID分类和噪声检测调整分类器会妨碍两个子任务。基于此，我们提出了一个框架，将分类器和检测器解耦，专注于开发一个独立的检测器，同时保持分类器冻结。在技术上，我们引入了自适应噪声检测器（AdaND），它利用冻结模型的输出作为伪标签来训练噪声检测器。为了处理干净数据流，我们在适应过程中进一步注入高斯噪声，防止检测器将干净样本误分类为嘈杂样本。除了ZS-NTTA，AdaND还可以提高VLMs的零样本外分布（ZS-OOD）检测能力。实验表明，AdaND在ZS-NTTA和ZS-OOD检测中表现优异。在ImageNet上，与SOTA方法相比，AdaND在ZS-NTTA的谐波平均准确率（Acc_H）上实现了8.32％的显著提升，在ZS-OOD检测的FPR95上实现了9.40％的提升。重要的是，AdaND在计算效率上具有可比性，与模型冻结方法相当。代码公开可用于：https://github.com/tmlr-group/ZS-NTTA。

更新时间: 2025-04-07 08:44:36

领域: cs.LG

下载: http://arxiv.org/abs/2502.14604v2

Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse Fingerprints

Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this paper, we propose a systematic integration of an Attentional Graph Neural Network (AGNN) model, capable of learning spatial adjacency relationships and aggregating information from neighboring fingerprints, and a meta-learning framework that utilizes datasets with similar environmental characteristics to enhance model training. To minimize the labor required for fingerprint collection, we introduce two novel data augmentation strategies: 1) unlabeled fingerprint augmentation using moving platforms, which enables the semi-supervised AGNN model to incorporate information from unlabeled fingerprints, and 2) synthetic labeled fingerprint augmentation through environmental digital twins, which enhances the meta-learning framework through a practical distribution alignment, which can minimize the feature discrepancy between synthetic and real-world fingerprints effectively. By integrating these novel modules, we propose the Attentional Graph Meta-Learning (AGML) model. This novel model combines the strengths of the AGNN model and the meta-learning framework to address the challenges posed by extremely sparse fingerprints. To validate our approach, we collected multiple datasets from both consumer-grade WiFi devices and professional equipment across diverse environments. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that the AGML model-based localization method consistently outperforms all baseline methods using sparse fingerprints across all evaluated metrics.

Updated: 2025-04-07 08:37:18

标题: 关注图元学习在使用极度稀疏指纹进行室内定位中的应用

摘要: 指纹室内定位通常是劳动密集的，因为需要密集的网格和随时间和空间重复的测量。保持高定位精度与极度稀疏的指纹仍然是一个持久的挑战。现有的基准方法主要依赖于测量的指纹，而忽视了有价值的空间和环境特征。在本文中，我们提出了Attentional Graph神经网络（AGNN）模型的系统集成，能够学习空间邻接关系并从相邻指纹中聚合信息，以及利用具有相似环境特征的数据集增强模型训练的元学习框架。为了最小化指纹收集所需的劳动，我们引入了两种新颖的数据增强策略：1）使用移动平台对未标记的指纹进行增强，使半监督AGNN模型能够整合来自未标记指纹的信息，以及2）通过环境数字孪生进行合成标记的指纹增强，通过实用的分布对齐增强元学习框架，可以有效地最小化合成和真实世界指纹之间的特征差异。通过整合这些新颖的模块，我们提出了Attentional Graph元学习（AGML）模型。这种新颖的模型结合了AGNN模型和元学习框架的优势，以解决极度稀疏指纹带来的挑战。为了验证我们的方法，我们从消费级WiFi设备和专业设备跨多样化环境收集了多个数据集。在合成和真实世界数据集上进行的大量实验证明，基于AGML模型的定位方法始终优于所有基线方法，使用稀疏指纹在所有评估指标上。

更新时间: 2025-04-07 08:37:18

领域: cs.LG,eess.SP,stat.ML

下载: http://arxiv.org/abs/2504.04829v1

From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e.g., lighting inconsistency, color mismatch, etc). But should we ``teach" the detector to learn all these artifacts separately? It is impossible and impractical to elaborate on them all. So the core idea is to pinpoint the more common and general artifacts across different deepfakes. Accordingly, we categorize deepfake artifacts into two distinct yet complementary types: Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA). FIA arise from the challenge of generating all intricate details, inevitably causing inconsistencies between the complex facial features and relatively uniform surrounding areas. USA, on the other hand, are the inevitable traces left by the generator's decoder during the up-sampling process. This categorization stems from the observation that all existing deepfakes typically exhibit one or both of these artifacts. To achieve this, we propose a new data-level pseudo-fake creation framework that constructs fake samples with only the FIA and USA, without introducing extra less-general artifacts. Specifically, we employ a super-resolution to simulate the USA, while design a Blender module that uses image-level self-blending on diverse facial regions to create the FIA. We surprisingly found that, with this intuitive design, a standard image classifier trained only with our pseudo-fake data can non-trivially generalize well to unseen deepfakes.

Updated: 2025-04-07 08:34:28

标题: 从特定性到普适性：重访在检测面部Deepfakes中的可泛化工件

摘要: 检测深度伪造在AI生成技术迅速发展的情况下变得越来越重要。在本文中，我们提出了一个问题：如何构建一个通用的检测框架，对大多数面部深度伪造有效？一个重要挑战是市面上有各种各样的深度伪造生成器，导致伪造的特征各异（如光照不一致，颜色不匹配等）。但是我们是否应该让检测器分别学习所有这些特征？这是不可能和不切实际的。因此，核心思想是找出不同深度伪造中更常见和普遍的特征。因此，我们将深度伪造的特征分为两种截然不同但互补的类型：面部不一致特征（FIA）和上采样特征（USA）。FIA源于生成所有复杂细节的挑战，不可避免地导致复杂的面部特征与相对统一的周围区域之间的不一致。另一方面，USA是生成器在上采样过程中留下的不可避免的痕迹。这种分类源于观察到所有现有的深度伪造通常展示出这两种特征中的一种或两种。为了实现这一目标，我们提出了一个新的数据级伪造创建框架，仅使用FIA和USA构建伪造样本，而不引入额外的不太通用的特征。具体地，我们使用超分辨率模拟USA，同时设计一个混合器模块，在不同的面部区域上使用图像级的自我混合来创建FIA。令人惊讶的是，通过这种直观的设计，只使用我们的伪造数据训练的标准图像分类器可以非常好地推广到看不见的深度伪造。

更新时间: 2025-04-07 08:34:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04827v1

Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios

Autonomous driving has made significant progress in both academia and industry, including performance improvements in perception task and the development of end-to-end autonomous driving systems. However, the safety and robustness assessment of autonomous driving has not received sufficient attention. Current evaluations of autonomous driving are typically conducted in natural driving scenarios. However, many accidents often occur in edge cases, also known as safety-critical scenarios. These safety-critical scenarios are difficult to collect, and there is currently no clear definition of what constitutes a safety-critical scenario. In this work, we explore the safety and robustness of autonomous driving in safety-critical scenarios. First, we provide a definition of safety-critical scenarios, including static traffic scenarios such as adversarial attack scenarios and natural distribution shifts, as well as dynamic traffic scenarios such as accident scenarios. Then, we develop an autonomous driving safety testing platform to comprehensively evaluate autonomous driving systems, encompassing not only the assessment of perception modules but also system-level evaluations. Our work systematically constructs a safety verification process for autonomous driving, providing technical support for the industry to establish standardized test framework and reduce risks in real-world road deployment.

Updated: 2025-04-07 08:26:00

标题: 朝向对自动驾驶在安全关键场景下的安全性和稳健性进行基准测试和评估

摘要: 自动驾驶在学术界和工业界取得了显著进展，包括在感知任务性能改进和端到端自动驾驶系统的发展方面。然而，自动驾驶的安全性和稳健性评估并未得到足够关注。目前对自动驾驶的评估通常在自然驾驶场景中进行。然而，许多事故通常发生在边缘情况，也即安全关键场景。这些安全关键场景难以收集，并且目前还没有明确定义什么构成安全关键场景。在这项工作中，我们探讨自动驾驶在安全关键场景中的安全性和稳健性。首先，我们提供了安全关键场景的定义，包括静态交通场景，如对抗性攻击场景和自然分布变化，以及动态交通场景，如事故场景。然后，我们开发了一个自动驾驶安全测试平台，全面评估自动驾驶系统，不仅包括对感知模块的评估，还包括系统级评估。我们的工作系统地构建了自动驾驶的安全验证过程，为行业建立标准化测试框架和减少在实际道路部署中的风险提供技术支持。

更新时间: 2025-04-07 08:26:00

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.23708v2

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

Updated: 2025-04-07 08:22:45

标题: 量化会影响推理能力吗？关于量化推理模型的实证研究

摘要: 最近推理语言模型的进展表明，在复杂任务中表现出了卓越的性能，但其延长的思维链推理过程增加了推理开销。虽然量化已被广泛采用来降低大型语言模型的推理成本，但其对推理模型的影响仍未得到充分研究。在这项研究中，我们对量化推理模型进行了第一次系统研究，评估了从1.5B到70B参数范围的开源DeepSeek-R1-Distilled Qwen和LLaMA家族，以及QwQ-32B。我们使用最先进的算法在不同的位宽下对权重、KV缓存和激活进行量化，对数学（AIME、MATH-500）、科学（GPQA）和编程（LiveCodeBench）推理基准进行了广泛评估。我们的研究结果表明，虽然可以通过W8A8或W4A16量化实现无损量化，但较低的位宽会带来显著的精度风险。我们进一步确定模型大小、模型来源和任务难度是性能的关键决定因素。与预期相反，量化模型不会表现出增加的输出长度。此外，策略性地调整模型大小或推理步骤可以有效提升性能。所有量化模型和代码将在https://github.com/ruikangliu/Quantized-Reasoning-Models 上开源。

更新时间: 2025-04-07 08:22:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04823v1

A Customized SAT-based Solver for Graph Coloring

We introduce ZykovColor, a novel SAT-based algorithm to solve the graph coloring problem working on top of an encoding that mimics the Zykov tree. Our method is based on an approach of H\'ebrard and Katsirelos (2020) that employs a propagator to enforce transitivity constraints, incorporate lower bounds for search tree pruning, and enable inferred propagations. We leverage the recently introduced IPASIR-UP interface for CaDiCal to implement these techniques with a SAT solver. Furthermore, we propose new features that take advantage of the underlying SAT solver. These include modifying the integrated decision strategy with vertex domination hints and using incremental bottom-up search that allows to reuse learned clauses from previous calls. Additionally, we integrate a more efficient clique computation to improve the lower bounds during the search. We validate the effectiveness of each new feature through an experimental analysis. ZykovColor outperforms other state-of-the-art graph coloring implementations on the DIMACS benchmark set. Further experiments on random Erd\H{o}s-R\'enyi graphs show that our new approach dominates state-of-the-art SAT-based methods for both very sparse and highly dense graphs.

Updated: 2025-04-07 08:22:00

标题: 一个用于图着色的定制化SAT求解器

摘要: 我们介绍了ZykovColor，这是一种基于SAT的算法，用于解决图着色问题，其工作在模拟Zykov树的编码之上。我们的方法基于H\'ebrard和Katsirelos（2020）的方法，该方法利用传播器来强制传递性约束，结合搜索树剪枝的下界，并启用推断传播。我们利用最近引入的IPASIR-UP接口为CaDiCal实现了这些技术与SAT求解器。此外，我们提出了利用底层SAT求解器的新功能。这些功能包括使用顶点支配提示修改集成决策策略，并使用增量自底向上搜索，允许重用从先前调用中学到的子句。此外，我们集成了更有效的团计算来改进搜索过程中的下界。我们通过实验分析验证了每个新功能的有效性。ZykovColor在DIMACS基准集上优于其他最先进的图着色实现。对随机Erd\H{o}s-R\'enyi图的进一步实验表明，我们的新方法在非常稀疏和高度密集的图中都优于最先进的基于SAT的方法。

更新时间: 2025-04-07 08:22:00

领域: cs.DM,cs.AI,cs.DS,cs.LO,05C15,G.2.2

下载: http://arxiv.org/abs/2504.04821v1

SINCon: Mitigate LLM-Generated Malicious Message Injection Attack for Rumor Detection

In the era of rapidly evolving large language models (LLMs), state-of-the-art rumor detection systems, particularly those based on Message Propagation Trees (MPTs), which represent a conversation tree with the post as its root and the replies as its descendants, are facing increasing threats from adversarial attacks that leverage LLMs to generate and inject malicious messages. Existing methods are based on the assumption that different nodes exhibit varying degrees of influence on predictions. They define nodes with high predictive influence as important nodes and target them for attacks. If the model treats nodes' predictive influence more uniformly, attackers will find it harder to target high predictive influence nodes. In this paper, we propose Similarizing the predictive Influence of Nodes with Contrastive Learning (SINCon), a defense mechanism that encourages the model to learn graph representations where nodes with varying importance have a more uniform influence on predictions. Extensive experiments on the Twitter and Weibo datasets demonstrate that SINCon not only preserves high classification accuracy on clean data but also significantly enhances resistance against LLM-driven message injection attacks.

Updated: 2025-04-07 08:20:48

标题: SINCon：减轻LLM生成的恶意消息注入攻击，用于谣言检测

摘要: 在快速发展的大型语言模型(LLMs)时代，基于信息传播树(MPTs)的最先进谣言检测系统面临越来越多来自对抗攻击的威胁，这些攻击利用LLMs生成并注入恶意信息。现有方法是基于不同节点对预测的影响程度不同的假设。他们将具有高预测影响力的节点定义为重要节点，并将其作为攻击目标。如果模型对节点的预测影响力更加统一，攻击者将更难以针对高预测影响力节点进行攻击。在本文中，我们提出了一种名为SINCon的防御机制，该机制通过对比学习鼓励模型学习图形表示，在这种表示中，具有不同重要性的节点对预测的影响更加统一。对Twitter和微博数据集的大量实验表明，SINCon不仅在干净数据上保持了高分类精度，还显著增强了抵抗LLM驱动的消息注入攻击的能力。

更新时间: 2025-04-07 08:20:48

领域: cs.CR

下载: http://arxiv.org/abs/2504.07135v1

Low-Rank Extragradient Method for Nonsmooth and Low-Rank Matrix Optimization Problems

Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a natural \textit{generalized strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, the \textit{extragradient method}, when initialized with a ``warm-start'' point, converges to an optimal solution with rate $O(1/t)$ while requiring only two \textit{low-rank} SVDs per iteration. We give a precise trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating that using simple initializations, the extragradient method produces exactly the same iterates when full-rank SVDs are replaced with SVDs of rank that matches the rank of the (low-rank) ground-truth matrix to be recovered.

Updated: 2025-04-07 08:09:29

标题: 低秩外推方法用于非光滑和低秩矩阵优化问题

摘要: 低秩和非光滑矩阵优化问题涵盖了统计学和机器学习中许多基本任务。近年来，在开发有效方法以避免维护高秩矩阵和计算昂贵的高秩SVD方面取得了显著进展，但对于非光滑问题的进展却缓慢。本文考虑了这类问题的标准凸松弛。主要是证明在自然的广义严格互补条件下，以及在相对温和的假设下，即非光滑目标可以写成光滑函数的最大值时，当采用“热启动”点初始化的\textit{extragradient方法}时，每次迭代只需要两个\textit{低秩}SVD，便能以$O(1/t)$的速度收敛到最优解。我们给出了所需SVD的秩和需要初始化方法的球半径之间的精确权衡。我们通过对几个非光滑低秩矩阵恢复任务进行实证实验来支持我们的理论结果，证明了使用简单初始化方法时，extragradient方法产生的迭代与用秩与（低秩）真实矩阵匹配的SVD替换全秩SVD时完全相同。

更新时间: 2025-04-07 08:09:29

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2202.04026v2

Counterfactual Situation Testing: From Single to Multidimensional Discrimination

We present counterfactual situation testing (CST), a causal data mining framework for detecting individual discrimination in a dataset of classifier decisions. CST answers the question ``what would have been the model outcome had the individual, or complainant, been of a different protected status?'' It extends the legally-grounded situation testing (ST) of Thanh et al. (2011) by operationalizing the notion of "fairness given the difference" via counterfactual reasoning. ST finds for each complainant similar protected and non-protected instances in the dataset; constructs, respectively, a control and test group; and compares the groups such that a difference in model outcomes implies a potential case of individual discrimination. CST, instead, avoids this idealized comparison by establishing the test group on the complainant's generated counterfactual, which reflects how the protected attribute when changed influences other seemingly neutral attributes of the complainant. Under CST we test for discrimination for each complainant by comparing similar individuals within the control and test group but dissimilar individuals across these groups. We consider single (e.g.,~gender) and multidimensional (e.g.,~gender and race) discrimination testing. For multidimensional discrimination we study multiple and intersectional discrimination and, as feared by legal scholars, find evidence that the former fails to account for the latter kind. Using a k-nearest neighbor implementation, we showcase CST on synthetic and real data. Experimental results show that CST uncovers a higher number of cases than ST, even when the model is counterfactually fair. CST, in fact, extends counterfactual fairness (CF) of Kusner et al. (2017) by equipping CF with confidence intervals, which we report for all experiments.

Updated: 2025-04-07 08:09:21

标题: 反事实情景测试：从单维到多维歧视

摘要: 我们提出了反事实情况测试（CST），这是一个用于检测数据集中分类器决策中个体歧视的因果数据挖掘框架。CST回答了一个问题：“如果个体或申诉人具有不同的受保护地位，模型结果会是什么？”它通过反事实推理，扩展了Thanh等人（2011年）的法律基础情况测试（ST），操作化了“在不同情况下的公平性”概念。ST在数据集中为每个申诉人找到类似的受保护和非受保护实例；分别构建控制组和测试组；并比较这些组，以便模型结果的差异暗示可能存在个体歧视案例。相比之下，CST通过建立基于申诉人生成的反事实的测试组，避免了这种理想化的比较，这种反事实反映了受保护属性在改变时如何影响申诉人的其他看似中性属性。在CST下，我们通过比较控制组和测试组内相似的个体，以及跨这些组不相似的个体，来测试每个申诉人的歧视情况。我们考虑单一（例如性别）和多维度（例如性别和种族）歧视测试。对于多维度歧视，我们研究了多重和交叉歧视，并发现了法律学者担心的问题，即前者未能考虑到后者。通过使用k最近邻实现，我们展示了CST在合成和真实数据上的应用。实验结果表明，即使模型在反事实上是公平的，CST也能发现比ST更多的案例。事实上，CST通过为CF提供置信区间，扩展了Kusner等人（2017年）的反事实公平性（CF），我们为所有实验报告了这些置信区间。

更新时间: 2025-04-07 08:09:21

领域: cs.LG

下载: http://arxiv.org/abs/2502.01267v2

Sparse Optimization for Transfer Learning: A L0-Regularized Framework for Multi-Source Domain Adaptation

This paper explores transfer learning in heterogeneous multi-source environments with distributional divergence between target and auxiliary domains. To address challenges in statistical bias and computational efficiency, we propose a Sparse Optimization for Transfer Learning (SOTL) framework based on L0-regularization. The method extends the Joint Estimation Transferred from Strata (JETS) paradigm with two key innovations: (1) L0-constrained exact sparsity for parameter space compression and complexity reduction, and (2) refining optimization focus to emphasize target parameters over redundant ones. Simulations show that SOTL significantly improves both estimation accuracy and computational speed, especially under adversarial auxiliary domain conditions. Empirical validation on the Community and Crime benchmarks demonstrates the statistical robustness of the SOTL method in cross-domain transfer.

Updated: 2025-04-07 08:06:16

标题: 稀疏优化用于迁移学习：多源领域适应的L0正则化框架

摘要: 本文探讨了在异构多源环境中具有目标领域和辅助领域之间分布差异的迁移学习。为了解决统计偏差和计算效率方面的挑战，我们提出了基于L0正则化的稀疏优化迁移学习（SOTL）框架。该方法通过两个关键创新扩展了从属地转移的联合估计（JETS）范式：（1）L0约束的精确稀疏性用于参数空间压缩和复杂性降低，以及（2）优化重点的优化重点，强调目标参数而不是冗余参数。模拟结果显示，SOTL显着提高了估计精度和计算速度，特别是在对抗性辅助领域条件下。在社区和犯罪基准测试中的实证验证表明了SOTL方法在跨领域转移中的统计鲁棒性。

更新时间: 2025-04-07 08:06:16

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2504.04812v1

PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization

Multimodal Large Language Models (MLLMs), which integrate vision and other modalities into Large Language Models (LLMs), significantly enhance AI capabilities but also introduce new security vulnerabilities. By exploiting the vulnerabilities of the visual modality and the long-tail distribution characteristic of code training data, we present PiCo, a novel jailbreaking framework designed to progressively bypass multi-tiered defense mechanisms in advanced MLLMs. PiCo employs a tier-by-tier jailbreak strategy, using token-level typographic attacks to evade input filtering and embedding harmful intent within programming context instructions to bypass runtime monitoring. To comprehensively assess the impact of attacks, a new evaluation metric is further proposed to assess both the toxicity and helpfulness of model outputs post-attack. By embedding harmful intent within code-style visual instructions, PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro Vision and 52.66% on GPT-4, surpassing previous methods. Experimental results highlight the critical gaps in current defenses, underscoring the need for more robust strategies to secure advanced MLLMs.

Updated: 2025-04-07 08:05:25

标题: PiCo: 通过$\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization越狱多模式大型语言模型

摘要: 多模态大型语言模型（MLLMs）将视觉和其他模态集成到大型语言模型（LLMs）中，显著增强了人工智能的能力，但也引入了新的安全漏洞。通过利用视觉模态的漏洞和代码训练数据的长尾分布特征，我们提出了PiCo，这是一个新颖的越狱框架，旨在逐步绕过先进MLLMs中的多层防御机制。PiCo采用逐层越狱策略，利用标记级别的印刷攻击来规避输入过滤，并在编程上下文指令中嵌入有害意图以绕过运行时监测。为了全面评估攻击的影响，进一步提出了一个新的评估指标，用于评估攻击后模型输出的毒性和帮助性。通过在代码风格的视觉指令中嵌入有害意图，PiCo在Gemini-Pro Vision上实现了84.13%的平均攻击成功率（ASR），在GPT-4上实现了52.66%，超过了先前的方法。实验结果突出了当前防御中的关键漏洞，强调了需要更加健壮的策略来保护先进的MLLMs。

更新时间: 2025-04-07 08:05:25

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2504.01444v2

Select Me! When You Need a Tool: A Black-box Text Attack on Tool Selection

Tool learning serves as a powerful auxiliary mechanism that extends the capabilities of large language models (LLMs), enabling them to tackle complex tasks requiring real-time relevance or high precision operations. Behind its powerful capabilities lie some potential security issues. However, previous work has primarily focused on how to make the output of the invoked tools incorrect or malicious, with little attention given to the manipulation of tool selection. To fill this gap, we introduce, for the first time, a black-box text-based attack that can significantly increase the probability of the target tool being selected in this paper. We propose a two-level text perturbation attack witha coarse-to-fine granularity, attacking the text at both the word level and the character level. We conduct comprehensive experiments that demonstrate the attacker only needs to make some perturbations to the tool's textual information to significantly increase the possibility of the target tool being selected and ranked higher among the candidate tools. Our research reveals the vulnerability of the tool selection process and paves the way for future research on protecting this process.

Updated: 2025-04-07 08:04:23

标题: 选择我！当您需要一个工具时：针对工具选择的黑盒文本攻击

摘要: 工具学习作为一种强大的辅助机制，扩展了大型语言模型（LLMs）的能力，使它们能够处理需要实时相关性或高精度操作的复杂任务。在其强大功能背后存在一些潜在的安全问题。然而，先前的工作主要集中在如何使调用工具的输出不正确或恶意，很少关注对工具选择的操纵。为了填补这一空白，我们首次引入了一种黑盒基于文本的攻击，可以显著增加本文中目标工具被选中的概率。我们提出了一种两级文本扰动攻击，以粗到细的粒度攻击文本的单词级和字符级。我们进行了全面的实验，证明攻击者只需要对工具的文本信息进行一些扰动，就可以显著增加目标工具被选中并在候选工具中排名较高的可能性。我们的研究揭示了工具选择过程的脆弱性，并为未来关于保护这一过程的研究铺平了道路。

更新时间: 2025-04-07 08:04:23

领域: cs.CR

下载: http://arxiv.org/abs/2504.04809v1

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ETL.git.

Updated: 2025-04-07 08:03:36

标题: ELT-Bench：用于评估AI代理在ELT管道上的端到端基准的标题

摘要: 随着云数据仓库的广泛采用，从业者越来越倾向于使用Extract-Load-Transform (ELT)管道。然而，设计这些管道通常需要大量手工操作以确保正确性。最近基于人工智能方法的进展展现了在数据任务中具有强大能力，比如文本到SQL，为减轻开发ELT管道的手工工作提供了机会。不幸的是，当前的数据工程基准仅评估孤立的任务，比如使用数据工具和编写数据转换查询，从而在评估AI代理生成端到端ELT管道方面存在重要差距。为了填补这一差距，我们引入了ELT-Bench，一个端到端基准，旨在评估AI代理构建ELT管道的能力。ELT-Bench包含100个管道，包括835个源表和203个数据模型，涵盖各种领域。通过模拟涉及各种数据源集成和流行数据工具使用的现实场景，ELT-Bench评估AI代理在处理复杂数据工程工作流程方面的能力。AI代理必须与数据库和数据工具交互，编写代码和SQL查询，并编排每个管道阶段。我们使用六种流行的大型语言模型(LLMs)在ELT-Bench上评估了两种代表性代码代理框架，Spider-Agent和SWE-Agent。表现最优秀的代理是Spider-Agent Claude-3.7-Sonnet，具有扩展思维，正确生成了仅有3.9%的数据模型，平均成本为$4.30，每个管道89.3步。我们的实验结果展示了ELT-Bench的挑战，并突出了需要更先进的AI代理来减少ELT工作流程中的手工工作。我们的代码和数据可以在https://github.com/uiuc-kang-lab/ETL.git上找到。

更新时间: 2025-04-07 08:03:36

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2504.04808v1

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test (\textbf{87.8\%}), YoutubeVOS 2019 (\textbf{88.1\%}), MOSE val (\textbf{74.0\%}), and LVOS test (\textbf{73.0\%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at \href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.

Updated: 2025-04-07 07:55:21

标题: 学习空间-语义特征以实现稳健的视频目标分割

摘要: 在长期视频中跟踪和分割具有不同或复杂部分的多个相似对象特别具有挑战性，这是因为在识别目标组件时存在歧义，并且由遮挡、背景混乱以及随着时间推移外观或环境的变化所引起的混淆。在本文中，我们提出了一个强大的视频对象分割框架，该框架学习了空间语义特征和有区别的对象查询，以解决上述问题。具体来说，我们构建了一个空间语义块，包括一个语义嵌入组件和一个空间依赖建模部分，用于关联全局语义特征和局部空间特征，提供全面的目标表示。此外，我们开发了一个蒙版交叉注意力模块，用于生成在查询传播过程中专注于目标对象最具有区别性部分的对象查询，减轻噪声累积，确保有效的长期查询传播。大量实验结果表明，所提出的方法在基准数据集上取得了最先进的性能，包括DAVIS2017测试（87.8％），YoutubeVOS 2019（88.1％），MOSE val（74.0％）和LVOS测试（73.0％），并展示了我们模型的有效性和泛化能力。源代码和训练模型已发布在https://github.com/yahooo-m/S3。

更新时间: 2025-04-07 07:55:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2407.07760v2

Out of Sight, Still at Risk: The Lifecycle of Transitive Vulnerabilities in Maven

The modern software development landscape heavily relies on transitive dependencies. They enable seamless integration of third-party libraries. However, they also introduce security challenges. Transitive vulnerabilities that arise from indirect dependencies expose projects to risks associated with Common Vulnerabilities and Exposures (CVEs). It happens even when direct dependencies remain secure. This paper examines the lifecycle of transitive vulnerabilities in the Maven ecosystem. We employ survival analysis to measure the time projects remain exposed after a CVE is introduced. Using a large dataset of Maven projects, we identify factors that influence the resolution of these vulnerabilities. Our findings offer practical advice on improving dependency management.

Updated: 2025-04-07 07:54:15

标题: Out of Sight, Still at Risk: The Lifecycle of Transitive Vulnerabilities in Maven （视而不见，仍然存在风险：Maven中传递性漏洞的生命周期）

摘要: 现代软件开发领域在很大程度上依赖于传递依赖关系。它们使第三方库的无缝集成成为可能。然而，它们也引入了安全挑战。由间接依赖引起的传递性漏洞使项目面临与公共漏洞和曝露（CVEs）相关的风险。即使直接依赖保持安全，这种情况也会发生。本文研究了Maven生态系统中传递性漏洞的生命周期。我们采用生存分析来衡量项目在CVE引入后保持暴露的时间。利用大量的Maven项目数据集，我们确定了影响这些漏洞解决的因素。我们的发现提供了关于改进依赖管理的实用建议。

更新时间: 2025-04-07 07:54:15

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2504.04803v1

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM ${M}_t$ and a novel safety critique LLM ${M}_c$. ${M}_t$ is responsible for automatically generating test cases in accordance with the proposed risk taxonomy. ${M}_c$ can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.

Updated: 2025-04-07 07:52:28

标题: S-Eval：面向大型语言模型的自动化和全面安全评估

摘要: 生成式大型语言模型（LLMs）以其变革性和新兴能力彻底改变了自然语言处理。然而，最近的证据表明，LLMs可能会产生违反社会规范的有害内容，引发了关于部署这些先进模型的安全和道德后果的重大关切。因此，在部署之前对LLMs进行严格和全面的安全评估既至关重要又迫在眉睫。尽管存在这种需求，由于LLM生成空间的广泛性，仍然缺乏一个统一和标准化的风险分类体系，以系统地反映LLM内容的安全性，以及探索潜在风险的自动化安全评估技术。为了弥补这一显著差距，我们提出了S-Eval，一个基于LLM的自动安全评估框架，其中定义了全面的风险分类体系。S-Eval包括两个关键组件，即专家测试LLM ${M}_t$ 和新型安全批评LLM ${M}_c$。${M}_t$ 负责根据提出的风险分类自动生成测试用例。${M}_c$ 可以为LLMs的风险意识提供定量和可解释的安全评估。与先前的工作相比，S-Eval在测试生成和安全评估方面高效且有效。此外，由于基于LLM的架构，S-Eval可以灵活配置和适应LLMs的快速演变以及伴随的新安全威胁、测试生成方法和安全批评方法。S-Eval已在我们的工业合作伙伴中部署，用于自动评估为数百万用户提供服务的多个LLMs的安全性，展示了其在实际场景中的有效性。我们的基准测试公开可用于 https://github.com/IS2Lab/S-Eval。

更新时间: 2025-04-07 07:52:28

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2405.14191v4

Large Language Models are In-Context Molecule Learners

Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.

Updated: 2025-04-07 07:46:51

标题: 大型语言模型是上下文分子学习者

摘要: 大型语言模型（LLMs）在生物化学任务中表现出卓越的性能，特别是在分子字幕翻译任务中，旨在弥合分子和自然语言文本之间的差距。然而，以往在将LLMs调整到分子字幕翻译任务中的方法需要额外的领域特定预训练阶段，存在分子和文本空间之间的弱对齐，或者对LLMs的规模施加严格的要求。为了解决这些挑战，我们提出了In-Context Molecule Adaptation（ICMA），作为一种新的范例，允许LLMs通过In-Context Molecule Tuning从上下文示例中学习分子-文本对齐。具体而言，ICMA包括以下三个阶段：混合上下文检索、后检索重排和上下文分子调整。最初，混合上下文检索利用BM25字幕检索和分子图检索来检索相似的信息上下文示例。此外，后检索重排由序列反转和随机游走选择组成，进一步提高检索结果的质量。最后，In-Context Molecule Tuning利用检索到的示例解锁了LLMs的上下文学习和推理能力，并调整LLMs的参数以实现更好的分子和文本对齐。实验结果表明，ICMA可以使LLMs在没有额外训练语料库和复杂结构的情况下实现最新或可比较的性能，表明LLMs从本质上是上下文分子学习者。

更新时间: 2025-04-07 07:46:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.04197v4

Multi-agent Application System in Office Collaboration Scenarios

This paper introduces a multi-agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized collaboration support based on team members' needs and incorporate data analysis tools to improve decision-making quality. The paper also proposes an intelligent agent architecture that separates Plan and Solver, and through techniques such as multi-turn query rewriting and business tool retrieval, it enhances the agent's multi-intent and multi-turn dialogue capabilities. Furthermore, the paper details the design of tools and multi-turn dialogue in the context of office collaboration scenarios, and validates the system's effectiveness through experiments and evaluations. Ultimately, the system has demonstrated outstanding performance in real business applications, particularly in query understanding, task planning, and tool calling. Looking forward, the system is expected to play a more significant role in addressing complex interaction issues within dynamic environments and large-scale multi-agent systems.

Updated: 2025-04-07 07:46:24

标题: 办公协作场景中的多Agent应用系统

摘要: 本文介绍了一个多智能体应用系统，旨在提高办公协作效率和工作质量。该系统集成了人工智能、机器学习和自然语言处理技术，实现了任务分配、进度监控和信息共享等功能。系统内的智能体能够根据团队成员的需求提供个性化的协作支持，并结合数据分析工具来提高决策质量。本文还提出了一个智能体架构，将规划和求解器分离，通过多轮查询重写和业务工具检索等技术，增强了智能体的多意图和多轮对话能力。此外，本文详细介绍了在办公协作场景中工具和多轮对话的设计，并通过实验和评估验证了系统的有效性。最终，该系统在真实业务应用中表现出色，特别在查询理解、任务规划和工具调用方面。展望未来，预计该系统将在解决动态环境和大规模多智能体系统中的复杂交互问题中发挥更重要的作用。

更新时间: 2025-04-07 07:46:24

领域: cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2503.19584v3

Topological Schrödinger Bridge Matching

Given two boundary distributions, the Schr\"odinger Bridge (SB) problem seeks the ``most likely`` random evolution between them with respect to a reference process. It has revealed rich connections to recent machine learning methods for generative modeling and distribution matching. While these methods perform well in Euclidean domains, they are not directly applicable to topological domains such as graphs and simplicial complexes, which are crucial for data defined over network entities, such as node signals and edge flows. In this work, we propose the Topological Schr\"odinger Bridge problem (TSBP) for matching signal distributions on a topological domain. We set the reference process to follow some linear tractable topology-aware stochastic dynamics such as topological heat diffusion. For the case of Gaussian boundary distributions, we derive a closed-form topological SB (TSB) in terms of its time-marginal and stochastic differential. In the general case, leveraging the well-known result, we show that the optimal process follows the forward-backward topological dynamics governed by some unknowns. Building on these results, we develop TSB-based models for matching topological signals by parameterizing the unknowns in the optimal process as (topological) neural networks and learning them through likelihood training. We validate the theoretical results and demonstrate the practical applications of TSB-based models on both synthetic and real-world networks, emphasizing the role of topology. Additionally, we discuss the connections of TSB-based models to other emerging models, and outline future directions for topological signal matching.

Updated: 2025-04-07 07:45:21

标题: 拓扑薛定谔桥匹配

摘要: 鉴于两个边界分布，Schrödinger桥（SB）问题寻求在参考过程中对它们之间进行“最可能”的随机演变。它揭示了与最近的生成建模和分布匹配机器学习方法之间丰富的联系。虽然这些方法在欧几里得域中表现良好，但并不直接适用于拓扑域，例如图和单纯复合体，这对于在网络实体上定义的数据非常重要，如节点信号和边缘流。在这项工作中，我们提出了用于在拓扑域上匹配信号分布的拓扑Schrödinger桥问题（TSBP）。我们将参考过程设置为遵循某些线性易处理拓扑感知随机动态，例如拓扑热扩散。对于高斯边界分布的情况，我们推导了一个封闭形式的拓扑SB（TSB），其时间边际和随机微分。在一般情况下，利用众所周知的结果，我们展示了最优过程遵循由一些未知量控制的前向-后向拓扑动态。基于这些结果，我们通过将最优过程中的未知量参数化为（拓扑）神经网络并通过似然训练学习它们，开发了基于TSB的模型来匹配拓扑信号。我们验证了理论结果，并展示了TSB-based模型在合成和实际网络上的实际应用，强调拓扑的作用。此外，我们讨论了TSB-based模型与其他新兴模型的联系，并概述了拓扑信号匹配的未来方向。

更新时间: 2025-04-07 07:45:21

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2504.04799v1

TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.

Updated: 2025-04-07 07:44:27

标题: TabRep：使用简单有效的连续表示训练表格扩散模型

摘要: 扩散模型一直是表格数据生成模型的主导模型。然而，它们面临着在单独数据表示和统一数据表示下建模的难题。前者面临着在一个模型中联合建模所有表格数据的多模态分布的挑战。而后者通过学习所有特征的单一表示来缓解这一问题，但目前仍然利用稀疏的次优编码启发式方法，并需要额外的计算成本。在这项工作中，我们通过提出TabRep，一个训练有统一连续表示的表格扩散架构，来解决后者。为了激发我们表示设计的动机，我们提供了关于数据流形如何影响扩散模型的几何洞察。我们表示的关键属性由其密度、为名义特征提供充分可分离性的灵活性以及保留内在关系的能力组成。最终，TabRep为在连续数据流形下训练表格扩散模型提供了一种简单而有效的方法。我们的结果展示了TabRep在广泛的评估中实现了卓越的性能。它是第一个合成超越原始数据集下游质量的表格数据，并且保持了隐私并保持计算效率。

更新时间: 2025-04-07 07:44:27

领域: cs.LG

下载: http://arxiv.org/abs/2504.04798v1

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

Updated: 2025-04-07 07:42:30

标题: MegaScale-Infer: 利用分散的专家并行性规模化提供专家混合服务

摘要: Mixture-of-Experts（MoE）展示了以更高性能和降低计算复杂性的方式扩展大型语言模型（LLMs）的巨大潜力。然而，在推断过程中，其稀疏激活的架构将前馈网络（FFNs）从计算密集型转变为内存密集型，导致GPU利用率大幅降低和操作成本增加。我们提出了MegaScale-Infer，这是一个为服务大规模MoE模型而设计的高效且具有成本效益的系统。MegaScale-Infer在每个模型层内分解注意力和FFN模块，实现独立扩展、定制并行策略，以及两个模块的异构部署。为了充分利用分解技术在MoE的稀疏性情况下，MegaScale-Infer引入了乒乓管道并行，将一个请求批次分成微批次，并在注意力和FFN之间进行推断。结合每个模块的不同模型并行性，MegaScale-Infer有效隐藏了通信开销，最大化了GPU利用率。为了适应分解的注意力和FFN模块，并最小化数据传输开销（例如标记分发），MegaScale-Infer提供了一个高性能的M2N通信库，消除了不必要的GPU到CPU数据复制、组初始化开销和GPU同步。实验结果表明，MegaScale-Infer的每GPU吞吐量比最先进的解决方案高出多达1.90倍。

更新时间: 2025-04-07 07:42:30

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2504.02263v2

Attentional Graph Neural Network Is All You Need for Robust Massive Network Localization

In this paper, we design Graph Neural Networks (GNNs) with attention mechanisms to tackle an important yet challenging nonlinear regression problem: massive network localization. We first review our previous network localization method based on Graph Convolutional Network (GCN), which can exhibit state-of-the-art localization accuracy, even under severe Non-Line-of-Sight (NLOS) conditions, by carefully preselecting a constant threshold for determining adjacency. As an extension, we propose a specially designed Attentional GNN (AGNN) model to resolve the sensitive thresholding issue of the GCN-based method and enhance the underlying model capacity. The AGNN comprises an Adjacency Learning Module (ALM) and Multiple Graph Attention Layers (MGAL), employing distinct attention architectures to systematically address the demerits of the GCN-based method, rendering it more practical for real-world applications. Comprehensive analyses are conducted to explain the superior performance of these methods, including a theoretical analysis of the AGNN's dynamic attention property and computational complexity, along with a systematic discussion of their robust characteristic against NLOS measurements. Extensive experimental results demonstrate the effectiveness of the GCN-based and AGNN-based network localization methods. Notably, integrating attention mechanisms into the AGNN yields substantial improvements in localization accuracy, approaching the fundamental lower bound and showing approximately 37\% to 53\% reduction in localization error compared to the vanilla GCN-based method across various NLOS noise configurations. Both methods outperform all competing approaches by far in terms of localization accuracy, robustness, and computational time, especially for considerably large network sizes.

Updated: 2025-04-07 07:39:56

标题: "关注图神经网络是您在大规模网络定位中所需要的一切"

摘要: 在这篇论文中，我们设计了具有注意机制的图神经网络（GNN）来解决一个重要但具有挑战性的非线性回归问题：大规模网络定位。我们首先回顾了基于图卷积网络（GCN）的先前网络定位方法，通过精心预选一个用于确定邻接性的恒定阈值，即使在严重的非直射（NLOS）条件下，也能展示最先进的定位精度。作为一种扩展，我们提出了一个特别设计的注意力图神经网络（AGNN）模型来解决基于GCN方法的敏感阈值问题，并增强基础模型的容量。AGNN包括一个邻接学习模块（ALM）和多个图注意力层（MGAL），采用不同的注意力架构系统地解决GCN方法的缺点，使其在实际应用中更加实用。对这些方法的卓越性能进行了全面分析，包括对AGNN动态注意力特性和计算复杂性的理论分析，以及对它们对抗NLOS测量的稳健特性的系统讨论。广泛的实验结果证明了基于GCN和AGNN的网络定位方法的有效性。值得注意的是，将注意机制整合到AGNN中显著提高了定位精度，接近基本下限，并与各种NLOS噪声配置下的原始GCN方法相比，定位误差减少了约37％至53％。在定位精度、稳健性和计算时间方面，这两种方法在所有竞争方法中都表现出色，尤其是对于相当大的网络规模。

更新时间: 2025-04-07 07:39:56

领域: cs.LG,eess.SP,stat.ML

下载: http://arxiv.org/abs/2311.16856v3

Enhancing Trust in AI Marketplaces: Evaluating On-Chain Verification of Personalized AI models using zk-SNARKs

The rapid advancement of artificial intelligence (AI) has brought about sophisticated models capable of various tasks ranging from image recognition to natural language processing. As these models continue to grow in complexity, ensuring their trustworthiness and transparency becomes critical, particularly in decentralized environments where traditional trust mechanisms are absent. This paper addresses the challenge of verifying personalized AI models in such environments, focusing on their integrity and privacy. We propose a novel framework that integrates zero-knowledge succinct non-interactive arguments of knowledge (zk-SNARKs) with Chainlink decentralized oracles to verify AI model performance claims on blockchain platforms. Our key contribution lies in integrating zk-SNARKs with Chainlink oracles to securely fetch and verify external data to enable trustless verification of AI models on a blockchain. Our approach addresses the limitations of using unverified external data for AI verification on the blockchain while preserving sensitive information of AI models and enhancing transparency. We demonstrate our methodology with a linear regression model predicting Bitcoin prices using on-chain data verified on the Sepolia testnet. Our results indicate the framework's efficacy, with key metrics including proof generation taking an average of 233.63 seconds and verification time of 61.50 seconds. This research paves the way for transparent and trustless verification processes in blockchain-enabled AI ecosystems, addressing key challenges such as model integrity and model privacy protection. The proposed framework, while exemplified with linear regression, is designed for broader applicability across more complex AI models, setting the stage for future advancements in transparent AI verification.

Updated: 2025-04-07 07:38:29

标题: 增强AI市场的信任：使用zk-SNARKs评估个性化AI模型的链上验证

摘要: 人工智能（AI）的迅速发展带来了能够执行各种任务的复杂模型，从图像识别到自然语言处理。随着这些模型的复杂性不断增长，在去中心化环境中确保其可信和透明性变得至关重要，特别是在传统信任机制缺失的情况下。本文解决了在这种环境中验证个性化AI模型的挑战，重点关注其完整性和隐私性。我们提出了一个新颖的框架，将零知识简洁非互动知识论证（zk-SNARKs）与Chainlink去中心化预言机集成，以验证区块链平台上AI模型的性能声明。我们的关键贡献在于将zk-SNARKs与Chainlink预言机集成，安全地获取和验证外部数据，实现在区块链上对AI模型进行无信任验证。我们的方法解决了在区块链上使用未经验证的外部数据进行AI验证的局限性，同时保留了AI模型的敏感信息并增强了透明度。我们通过使用在Sepolia测试网络上验证的链上数据预测比特币价格的线性回归模型展示了我们的方法论。我们的结果表明，框架的有效性，关键指标包括平均生成证明时间为233.63秒，验证时间为61.50秒。这项研究为区块链启用的AI生态系统中的透明和无信任验证过程铺平了道路，解决了模型完整性和模型隐私保护等关键挑战。提出的框架虽以线性回归为例，但设计用于更复杂的AI模型，为透明的AI验证的未来进展奠定了基础。

更新时间: 2025-04-07 07:38:29

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2504.04794v1

Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making

As a strategic pillar industry for human survival and development, modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development. Against the backdrop of intensified climate change leading to frequent extreme weather events, the uncertainty risks in agricultural production systems are increasing exponentially. To address these challenges, this study proposes an innovative \textbf{M}ultimodal \textbf{A}gricultural \textbf{A}gent \textbf{A}rchitecture (\textbf{MA3}), which leverages cross-modal information fusion and task collaboration mechanisms to achieve intelligent agricultural decision-making. This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation. We propose a unified backbone for sugarcane disease classification and detection tools, as well as a sugarcane disease expert model. By integrating an innovative tool selection module, we develop a multimodal agricultural agent capable of effectively performing tasks in classification, detection, and VQA. Furthermore, we introduce a multi-dimensional quantitative evaluation framework and conduct a comprehensive assessment of the entire architecture over our evaluation dataset, thereby verifying the practicality and robustness of MA3 in agricultural scenarios. This study provides new insights and methodologies for the development of agricultural agents, holding significant theoretical and practical implications. Our source code and dataset will be made publicly available upon acceptance.

Updated: 2025-04-07 07:32:41

标题: 多模式农业代理架构（MA3）：智能农业决策的新范式

摘要: 作为人类生存和发展的战略支柱产业，现代农业面临着优化生产效率和实现可持续发展的双重挑战。在气候变化加剧导致极端天气事件频繁发生的背景下，农业生产系统中的不确定性风险呈指数增长。为了应对这些挑战，本研究提出了一种创新的多模态农业智能体架构（MA3），利用交叉模态信息融合和任务协作机制实现智能农业决策。本研究构建了一个涵盖五个主要任务的多模态农业智能体数据集：分类、检测、视觉问答（VQA）、工具选择和智能体评估。我们提出了一个统一的甘蔗病害分类和检测工具支撑，以及一个甘蔗病害专家模型。通过整合创新的工具选择模块，我们开发出一个能够有效执行分类、检测和VQA任务的多模态农业智能体。此外，我们引入了一个多维量化评估框架，并对整个架构在我们的评估数据集上进行了全面评估，从而验证了MA3在农业场景中的实用性和稳健性。本研究为农业智能体的发展提供了新的见解和方法论，具有重要的理论和实践意义。我们的源代码和数据集将在接受后公开提供。

更新时间: 2025-04-07 07:32:41

领域: cs.AI

下载: http://arxiv.org/abs/2504.04789v1

Dynamic Vision Mamba

Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.

Updated: 2025-04-07 07:31:28

标题: 动态视觉曼巴

摘要: 基于Mamba的视觉模型由于在计算上比基于注意力的模型更高效而受到广泛关注。然而，这些模型仍然存在空间冗余，表现为标记和块冗余。对于标记冗余，我们分析发现早期的标记修剪方法会导致训练和推断之间的不一致，或者引入额外的计算量进行推断。因此，我们通过重新排列修剪后的序列，使之适应Mamba结构来定制标记修剪。对于块冗余，我们允许每个图像根据经验观察动态选择SSM块，因为基于Mamba的视觉模型的推断速度很大程度上受到SSM块数量的影响。我们提出的方法，Dynamic Vision Mamba（DyVM），有效地减少了FLOPs，同时性能略有下降。我们在Vim-S上只损失1.7\%的精度，但FLOPs减少了35.2\%。它在不同的Mamba视觉模型架构和不同的视觉任务中表现良好。我们的代码将会公开发布。

更新时间: 2025-04-07 07:31:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04787v1

Transfer learning from first-principles calculations to experiments with chemistry-informed domain transformation

Simulation-to-Real (Sim2Real) transfer learning, the machine learning technique that efficiently solves a real-world task by leveraging knowledge from computational data, has received increasing attention in materials science as a promising solution to the scarcity of experimental data. We proposed an efficient transfer learning scheme from first-principles calculations to experiments based on the chemistry-informed domain transformation, that integrates the heterogeneous source and target domains by harnessing the underlying physics and chemistry. The proposed method maps the computational data from the simulation space (source domain) into the space of experimental data (target domain). During this process, these qualitatively different domains are efficiently integrated by a couple of prior knowledge of chemistry, (1) the statistical ensemble, and (2) the relationship between source and target quantities. As a proof-of-concept, we predict the catalyst activity for the reverse water-gas shift reaction by using the abundant first-principles data in addition to the experimental data. Through the demonstration, we confirmed that the transfer learning model exhibits positive transfer in accuracy and data efficiency. In particular, a significantly high accuracy was achieved despite using a few (less than ten) target data in domain transformation, whose accuracy is one order of magnitude smaller than that of a full scratch model trained with over 100 target data. This result indicates that the proposed method leverages the high prediction performance with few target data, which helps to save the number of trials in real laboratories.

Updated: 2025-04-07 07:29:31

标题: 从第一原理计算到化学启发的领域转换实验的迁移学习

摘要: 模拟到真实（Sim2Real）迁移学习是一种机器学习技术，通过利用来自计算数据的知识有效解决真实世界任务，已经在材料科学领域引起了越来越多的关注，作为解决实验数据稀缺性的有前途的解决方案。我们提出了一种基于化学信息领域转换的有效迁移学习方案，从第一性原理计算到实验中，通过利用底层的物理和化学，整合异质源和目标领域。该方法将模拟空间（源域）中的计算数据映射到实验数据空间（目标域）。在这个过程中，通过化学的两个先验知识（1）统计集合和（2）源和目标量之间的关系，这些质量不同的领域被有效地整合。作为概念验证，我们通过使用丰富的第一性原理数据和实验数据，预测了反水煤气转移反应的催化剂活性。通过演示，我们确认迁移学习模型在准确性和数据效率方面表现出积极的转移。特别是，在领域转换中使用少量（不到十个）目标数据的情况下，实现了显著高的准确性，其准确性比经过100多个目标数据训练的全新模型小一个数量级。这一结果表明，所提出的方法利用少量目标数据实现了高预测性能，有助于节省实验室试验次数。

更新时间: 2025-04-07 07:29:31

领域: physics.chem-ph,cond-mat.mtrl-sci,cs.LG,physics.comp-ph,92E99,I.2.1; J.2

下载: http://arxiv.org/abs/2504.02848v2

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical. Existing training-free methods, including manually or automated designed workflows, typically demand substantial human effort or yield suboptimal results. This paper proposes Weak-for-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize workflows for harnessing stronger models. W4S formulates workflow design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization (RLAO) to train a weak meta-agent. Through iterative interaction with the environment, the meta-agent learns to design increasingly effective workflows without manual intervention. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% ~ 24.6% across eleven benchmarks, successfully elevating the performance of state-of-the-art models such as GPT-3.5-Turbo and GPT-4o. Notably, W4S exhibits strong generalization capabilities across both seen and unseen tasks, offering an efficient, high-performing alternative to directly fine-tuning strong models.

Updated: 2025-04-07 07:27:31

标题: 弱者变强者：训练弱元代理以利用强执行者

摘要: 当直接微调费时费力且常常不切实际时，高效利用当代大型语言模型（LLMs）的能力变得越来越具有挑战性。现有的无需训练的方法，包括手动或自动设计的工作流程，通常需要大量人力投入或产生次优结果。本文提出了一种名为Weak-for-Strong Harnessing（W4S）的新颖框架，该框架定制较小、成本效益高的语言模型，以设计和优化用于利用更强大模型的工作流程。W4S将工作流设计形式化为一个多轮马尔可夫决策过程，并引入了强化学习以进行主体工作流程优化（RLAO）来训练一个弱元主体。通过与环境的迭代交互，元主体学习设计越来越有效的工作流程，无需手动干预。实证结果表明，我们的7B元主体仅需一个GPU小时的训练，就在十一项基准测试中超越最强基线2.9% ~ 24.6%，成功提升了诸如GPT-3.5-Turbo和GPT-4o等最先进模型的性能。值得注意的是，W4S展现出强大的泛化能力，可在已见和未见任务中都提供一种高效、高性能的替代方案，而不是直接微调强大模型。

更新时间: 2025-04-07 07:27:31

领域: cs.AI

下载: http://arxiv.org/abs/2504.04785v1

Playing Non-Embedded Card-Based Games with Reinforcement Learning

Significant progress has been made in AI for games, including board games, MOBA, and RTS games. However, complex agents are typically developed in an embedded manner, directly accessing game state information, unlike human players who rely on noisy visual data, leading to unfair competition. Developing complex non-embedded agents remains challenging, especially in card-based RTS games with complex features and large state spaces. We propose a non-embedded offline reinforcement learning training strategy using visual inputs to achieve real-time autonomous gameplay in the RTS game Clash Royale. Due to the lack of a object detection dataset for this game, we designed an efficient generative object detection dataset for training. We extract features using state-of-the-art object detection and optical character recognition models. Our method enables real-time image acquisition, perception feature fusion, decision-making, and control on mobile devices, successfully defeating built-in AI opponents. All code is open-sourced at https://github.com/wty-yy/katacr.

Updated: 2025-04-07 07:26:02

标题: 使用强化学习玩非嵌入式基于卡片的游戏

摘要: 在游戏人工智能方面取得了显著进展，包括棋盘游戏、MOBA和即时战略游戏。然而，复杂的代理通常以嵌入方式开发，直接访问游戏状态信息，与依赖嘈杂的视觉数据的人类玩家形成不公平竞争。在具有复杂特征和大状态空间的基于卡牌的即时战略游戏中，开发复杂的非嵌入式代理仍然具有挑战性。我们提出了一种使用视觉输入的非嵌入式离线强化学习训练策略，以实现在RTS游戏《Clash Royale》中的实时自主游戏。由于缺乏这款游戏的目标检测数据集，我们设计了一个高效的生成式目标检测数据集进行训练。我们利用最先进的目标检测和光学字符识别模型提取特征。我们的方法实现了在移动设备上进行实时图像获取、感知特征融合、决策和控制，成功击败了内置AI对手。所有代码都在https://github.com/wty-yy/katacr上开源。

更新时间: 2025-04-07 07:26:02

领域: cs.LG

下载: http://arxiv.org/abs/2504.04783v1

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.

Updated: 2025-04-07 07:23:33

标题: 大型语言模型中的安全层：LLM安全的关键

摘要: 对齐的LLMs是安全的，能够识别并拒绝回答恶意问题。然而，内部参数在维持这种安全性方面的作用尚不明确，而且这些模型在遭受微调攻击时可能容易受到安全性降级的影响。为了应对这些挑战，我们的工作揭示了对齐的LLMs在参数级别上安全性背后的机制，确定了模型中关键的一小组连续层，这些层位于模型中间，对于区分恶意查询和正常查询至关重要，被称为“安全层”。我们通过分析模型内部层中输入向量的变化来首次确认这些安全层的存在。此外，我们利用过度拒绝现象和参数缩放分析来精确定位安全层。基于这些发现，我们提出了一种新颖的微调方法，即安全部分参数微调（SPPFT），该方法在微调过程中固定安全层的梯度，以解决安全性降级问题。我们的实验证明，与完全微调相比，所提出的方法能够显著保护LLM的安全性，同时保持性能并减少计算资源的使用。

更新时间: 2025-04-07 07:23:33

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2408.17003v5

A Zero-shot Learning Method Based on Large Language Models for Multi-modal Knowledge Graph Embedding

Zero-shot learning (ZL) is crucial for tasks involving unseen categories, such as natural language processing, image classification, and cross-lingual transfer.Current applications often fail to accurately infer and handle new relations orentities involving unseen categories, severely limiting their scalability and prac-ticality in open-domain scenarios. ZL learning faces the challenge of effectivelytransferring semantic information of unseen categories in multi-modal knowledgegraph (MMKG) embedding representation learning. In this paper, we proposeZSLLM, a framework for zero-shot embedding learning of MMKGs using largelanguage models (LLMs). We leverage textual modality information of unseencategories as prompts to fully utilize the reasoning capabilities of LLMs, enablingsemantic information transfer across different modalities for unseen categories.Through model-based learning, the embedding representation of unseen cate-gories in MMKG is enhanced. Extensive experiments conducted on multiplereal-world datasets demonstrate the superiority of our approach compared tostate-of-the-art methods.

Updated: 2025-04-07 07:22:25

标题: 一种基于大型语言模型的零样本学习方法，用于多模态知识图嵌入

摘要: 零样本学习（ZL）对于涉及未见类别的任务至关重要，例如自然语言处理、图像分类和跨语言转移。当前的应用程序通常无法准确推断和处理涉及未见类别的新关系或实体，严重限制了它们在开放领域场景中的可扩展性和实用性。ZL学习面临着有效传递未见类别的语义信息的挑战，在多模态知识图（MMKG）嵌入表示学习中。在本文中，我们提出了ZSLLM，这是一个利用大型语言模型（LLMs）进行MMKG零样本嵌入学习的框架。我们利用未见类别的文本模态信息作为提示，充分利用LLMs的推理能力，实现不同模态之间的未见类别的语义信息传递。通过基于模型的学习，增强了MMKG中未见类别的嵌入表示。在多个现实世界数据集上进行的大量实验表明，与最先进的方法相比，我们的方法具有更高的优越性。

更新时间: 2025-04-07 07:22:25

领域: cs.AI

下载: http://arxiv.org/abs/2503.07202v2

Automatic Parameter Selection for Non-Redundant Clustering

High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.

Updated: 2025-04-07 07:13:36

标题: 自动参数选择用于非冗余聚类

摘要: 高维数据集通常在不同子空间中包含多个有意义的聚类。例如，对象可以根据颜色、重量或大小进行聚类，从而揭示给定数据集的不同解释。许多方法能够识别这种非冗余的聚类。然而，大多数这些方法需要用户指定每个子空间和每个子空间中的聚类的期望数量。陈述这些值是一个非平凡的问题，通常需要对输入数据集有详细的了解。在本文中，我们提出了一个利用最小描述长度原则（MDL）来自动检测子空间数量和每个子空间中的聚类数量的框架。我们描述了一个有效的过程，通过在子空间内分裂和合并子空间和子空间内的聚类，贪心地搜索参数空间。此外，引入了一种编码策略，允许我们在每个子空间中检测异常值。大量实验表明，我们的方法与最先进的方法具有很高的竞争力。

更新时间: 2025-04-07 07:13:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2312.11952v2

Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding

Real-time scene comprehension is a key advance in artificial intelligence, enhancing robotics, surveillance, and assistive tools. However, hallucination remains a challenge. AI systems often misinterpret visual inputs, detecting nonexistent objects or describing events that never happened. These errors, far from minor, threaten reliability in critical areas like security and autonomous navigation where accuracy is essential. Our approach tackles this by embedding self-awareness into the AI. Instead of trusting initial outputs, our framework continuously assesses them in real time, adjusting confidence thresholds dynamically. When certainty falls below a solid benchmark, it suppresses unreliable claims. Combining YOLOv5's object detection strength with VILA1.5-3B's controlled language generation, we tie descriptions to confirmed visual data. Strengths include dynamic threshold tuning for better accuracy, evidence-based text to reduce hallucination, and real-time performance at 18 frames per second. This feedback-driven design cuts hallucination by 37 percent over traditional methods. Fast, flexible, and reliable, it excels in applications from robotic navigation to security monitoring, aligning AI perception with reality.

Updated: 2025-04-07 06:59:30

标题: 反馈增强型耐幻觉视觉语言模型用于实时场景理解

摘要: 实时场景理解是人工智能的一个重要进展，可以增强机器人技术、监控和辅助工具的功能。然而，幻觉仍然是一个挑战。人工智能系统经常会错误解释视觉输入，检测不存在的物体或描述从未发生的事件。这些错误远非微不足道，威胁到诸如安全和自主导航等关键领域的可靠性，其中准确性至关重要。我们的方法通过将自我意识嵌入到人工智能中来应对这一挑战。我们的框架不再完全依赖初始输出，而是在实时中持续评估这些输出，动态调整置信阈值。当确定性低于一个可靠基准时，它会抑制不可靠的断言。结合YOLOv5的目标检测能力和VILA1.5-3B的控制语言生成，我们将描述与确认的视觉数据联系起来。其优势包括动态阈值调整以提高准确性，基于证据的文本以减少幻觉，并以每秒18帧的速度实现实时性能。这种反馈驱动的设计相比传统方法可以将幻觉减少37%。快速、灵活和可靠，在从机器人导航到安全监控等应用中表现出色，将人工智能的感知与现实保持一致。

更新时间: 2025-04-07 06:59:30

领域: cs.LG

下载: http://arxiv.org/abs/2504.04772v1

Demystifying Issues, Causes and Solutions in LLM Open-Source Projects

With the advancements of Large Language Models (LLMs), an increasing number of open-source software projects are using LLMs as their core functional component. Although research and practice on LLMs are capturing considerable interest, no dedicated studies explored the challenges faced by practitioners of LLM open-source projects, the causes of these challenges, and potential solutions. To fill this research gap, we conducted an empirical study to understand the issues that practitioners encounter when developing and using LLM open-source software, the possible causes of these issues, and potential solutions. We collected all closed issues from 15 LLM open-source projects and labelled issues that met our requirements. We then randomly selected 994 issues from the labelled issues as the sample for data extraction and analysis to understand the prevalent issues, their underlying causes, and potential solutions. Our study results show that (1) Model Issue is the most common issue faced by practitioners, (2) Model Problem, Configuration and Connection Problem, and Feature and Method Problem are identified as the most frequent causes of the issues, and (3) Optimize Model is the predominant solution to the issues. Based on the study results, we provide implications for practitioners and researchers of LLM open-source projects.

Updated: 2025-04-07 06:57:06

标题: 《揭示LLM开源项目中的问题、原因和解决方案》

摘要: 随着大型语言模型（LLMs）的发展，越来越多的开源软件项目将LLMs作为其核心功能组件。尽管LLMs的研究和实践引起了相当大的兴趣，但尚未有专门研究探讨LLM开源项目从业者面临的挑战、这些挑战的原因以及潜在解决方案。为了填补这一研究空白，我们进行了一项实证研究，以了解从业者在开发和使用LLM开源软件时遇到的问题、这些问题的可能原因以及潜在解决方案。我们收集了来自15个LLM开源项目的所有已关闭问题，并对符合我们要求的问题进行了标记。然后，我们从标记的问题中随机选择了994个问题作为数据提取和分析的样本，以了解普遍存在的问题、其潜在原因和潜在解决方案。我们的研究结果显示：（1）模型问题是从业者面临的最常见问题，（2）模型问题、配置和连接问题以及功能和方法问题被确定为问题最频繁的原因，（3）优化模型是解决问题的主要方法。基于研究结果，我们为LLM开源项目的从业者和研究人员提供了启示。

更新时间: 2025-04-07 06:57:06

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2409.16559v2

HDVIO2.0: Wind and Disturbance Estimation with Hybrid Dynamics VIO

Visual-inertial odometry (VIO) is widely used for state estimation in autonomous micro aerial vehicles using onboard sensors. Current methods improve VIO by incorporating a model of the translational vehicle dynamics, yet their performance degrades when faced with low-accuracy vehicle models or continuous external disturbances, like wind. Additionally, incorporating rotational dynamics in these models is computationally intractable when they are deployed in online applications, e.g., in a closed-loop control system. We present HDVIO2.0, which models full 6-DoF, translational and rotational, vehicle dynamics and tightly incorporates them into a VIO with minimal impact on the runtime. HDVIO2.0 builds upon the previous work, HDVIO, and addresses these challenges through a hybrid dynamics model combining a point-mass vehicle model with a learning-based component, with access to control commands and IMU history, to capture complex aerodynamic effects. The key idea behind modeling the rotational dynamics is to represent them with continuous-time functions. HDVIO2.0 leverages the divergence between the actual motion and the predicted motion from the hybrid dynamics model to estimate external forces as well as the robot state. Our system surpasses the performance of state-of-the-art methods in experiments using public and new drone dynamics datasets, as well as real-world flights in winds up to 25 km/h. Unlike existing approaches, we also show that accurate vehicle dynamics predictions are achievable without precise knowledge of the full vehicle state.

Updated: 2025-04-07 06:48:15

标题: HDVIO2.0：混合动力学VIO中的风和干扰估计

摘要: 视觉惯性里程计（VIO）被广泛用于自主微型无人机的状态估计，利用机载传感器。当前方法通过纳入转换车辆动力学模型来改进VIO，但当面对低精度车辆模型或连续外部干扰（如风）时，性能会下降。此外，在这些模型中纳入旋转动力学在在线应用中（如闭环控制系统）是计算上棘手的。我们提出了HDVIO2.0，该模型完整建模了6自由度的，包括平移和旋转的车辆动力学，并将它们紧密结合到VIO中，对运行时影响最小。HDVIO2.0基于先前的工作HDVIO，并通过一个混合动力学模型来解决这些挑战，该模型结合了点质量车辆模型和基于学习的组件，可以访问控制命令和IMU历史记录，以捕捉复杂的气动效应。建模旋转动力学的关键思想是用连续时间函数表示它们。HDVIO2.0利用混合动力学模型的实际运动与预测运动之间的差异来估计外部力，以及机器人状态。我们的系统在使用公共和新的无人机动力学数据集进行实验以及在高达25公里/小时的风中进行的实际飞行中超越了最先进方法的性能。与现有方法不同，我们还表明，即使没有对完整车辆状态的精确知识，也可以实现准确的车辆动力学预测。

更新时间: 2025-04-07 06:48:15

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.00969v2

Bidirectional Hierarchical Protein Multi-Modal Representation Learning

Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs lack structural information. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features, improving information exchange and enhancement across layers of the neural network. Based on the framework, we further introduce local Bi-Hierarchical Fusion with gating and global Bi-Hierarchical Fusion with multihead self-attention approaches. Through extensive experiments on a diverse set of protein-related tasks, our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including react (enzyme/EC classification), model quality assessment (MQA), protein-ligand binding affinity prediction (LBA), protein-protein binding site prediction (PPBS), and B cell epitopes prediction (BCEs). Our method establishes a new state-of-the-art for multimodal protein representation learning, emphasizing the efficacy of BIHIERARCHICAL FUSION in bridging sequence and structural modalities.

Updated: 2025-04-07 06:47:49

标题: 双向分层蛋白质多模态表示学习

摘要: 蛋白质表示学习对于许多生物任务至关重要。最近，基于大型变压器的蛋白质语言模型（pLMs）在大规模蛋白质序列上预训练，已经在基于序列的任务中取得了显著成功。然而，pLMs缺乏结构信息。相反，旨在利用3D结构信息的图神经网络（GNNs）在蛋白质相关的预测任务中显示出有希望的泛化能力，但它们的有效性往往受到标记结构数据的稀缺性的限制。认识到序列和结构表示是同一蛋白质实体的互补视角，我们提出了一个多模态双向分层融合框架，以有效地合并这些模态。我们的框架采用注意和门控机制，实现了pLMs生成的顺序表示和GNN提取的结构特征之间的有效交互，改善了神经网络各层之间的信息交流和增强。基于该框架，我们进一步引入了带有门控的局部双层融合和带有多头自注意的全局双层融合方法。通过对各种蛋白质相关任务的大量实验，我们的方法在各种蛋白质表示学习基准测试中始终表现出优于强基线和现有融合技术的改进，包括反应（酶/EC分类）、模型质量评估（MQA）、蛋白质-配体结合亲和力预测（LBA）、蛋白质-蛋白质结合位点预测（PPBS）和B细胞抗原表位预测（BCEs）。我们的方法在多模态蛋白质表示学习中树立了一个新的最新技术水平，强调BIHIERARCHICAL FUSION在连接序列和结构模态中的有效性。

更新时间: 2025-04-07 06:47:49

领域: cs.LG,cs.AI,q-bio.MN

下载: http://arxiv.org/abs/2504.04770v1

KunPeng: A Global Ocean Environmental Model

Inspired by the similarity of the atmosphere-ocean physical coupling mechanism, this study innovatively migrates meteorological large-model techniques to the ocean domain, constructing the KunPeng global ocean environmental prediction model. Aimed at the discontinuous characteristics of marine space, we propose a terrain-adaptive mask constraint mechanism to mitigate effectively training divergence caused by abrupt gradients at land-sea boundaries. To fully integrate far-, medium-, and close-range marine features, a longitude-cyclic deformable convolution network (LC-DCN) is employed to enhance the dynamic receptive field, achieving refined modeling of multi-scale oceanic characteristics. A Deformable Convolution-enhanced Multi-Step Prediction module (DC-MTP) is employed to strengthen temporal dependency feature extraction capabilities. Experimental results demonstrate that this model achieves an average ACC of 0.80 in 15-day global predictions at 0.25$^\circ$ resolution, outperforming comparative models by 0.01-0.08. The average mean squared error (MSE) is 0.41 (representing a 5%-31% reduction) and the average mean absolute error (MAE) is 0.44 (0.6%-21% reduction) compared to other models. Significant improvements are particularly observed in sea surface parameter prediction, deep-sea region characterization, and current velocity field forecasting. Through a horizontal comparison of the applicability of operators at different scales in the marine domain, this study reveals that local operators significantly outperform global operators under slow-varying oceanic processes, demonstrating the effectiveness of dynamic feature pyramid representations in predicting marine physical parameters.

Updated: 2025-04-07 06:41:05

标题: 昆鹏：全球海洋环境模型

摘要: 受大气海洋物理耦合机制相似性的启发，本研究创新地将气象大型模型技术迁移到海洋领域，构建了KunPeng全球海洋环境预测模型。针对海洋空间的不连续特征，我们提出了一种地形自适应蒙版约束机制，以有效减轻由陆海边界突变梯度引起的训练分歧。为了充分整合远、中、近程海洋特征，采用经度循环可变形卷积网络（LC-DCN）增强动态感受野，实现多尺度海洋特征的精细建模。利用可变形卷积增强的多步预测模块（DC-MTP）来增强时间依赖特征提取能力。实验结果表明，该模型在0.25°分辨率的15天全球预测中取得了平均ACC为0.80，优于比较模型0.01-0.08。平均均方误差（MSE）为0.41（表示5%-31%的减少），平均绝对误差（MAE）为0.44（0.6%-21%的减少），与其他模型相比。海表参数预测、深海区域特征化和流速场预测等方面尤其表现出显著改进。通过水平比较海洋领域不同尺度操作符的适用性，本研究揭示了在海洋过程缓慢变化下，局部操作符明显优于全局操作符的效果，证明了动态特征金字塔表示在预测海洋物理参数方面的有效性。

更新时间: 2025-04-07 06:41:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.04766v1

Combining Threat Intelligence with IoT Scanning to Predict Cyber Attack

While the Web has become a global platform for communication; malicious actors, including hackers and hacktivist groups, often disseminate ideological content and coordinate activities through the "Dark Web" an obscure counterpart of the conventional web. Presently, challenges such as information overload and the fragmented nature of cyber threat data impede comprehensive profiling of these actors, thereby limiting the efficacy of predictive analyses of their online activities. Concurrently, the proliferation of internet-connected devices has surpassed the global human population, with this disparity projected to widen as the Internet of Things (IoT) expands. Technical communities are actively advancing IoT-related research to address its growing societal integration. This paper proposes a novel predictive threat intelligence framework designed to systematically collect, analyze, and visualize Dark Web data to identify malicious websites and correlate this information with potential IoT vulnerabilities. The methodology integrates automated data harvesting, analytical techniques, and visual mapping tools, while also examining vulnerabilities in IoT devices to assess exploitability. By bridging gaps in cybersecurity research, this study aims to enhance predictive threat modeling and inform policy development, thereby contributing to intelligence research initiatives focused on mitigating cyber risks in an increasingly interconnected digital ecosystem.

Updated: 2025-04-07 06:33:58

标题: 将威胁情报与物联网扫描相结合，预测网络攻击

摘要: 尽管互联网已经成为全球通信的平台，但恶意行为者，包括黑客和骇客组织，经常通过“暗网”这个不为人知的对应传统网络的平台来传播意识形态内容并协调活动。目前，诸如信息过载和网络威胁数据的碎片化性质等挑战阻碍了对这些行为者进行全面描述，从而限制了对其在线活动的预测分析的效力。与此同时，互联网连接设备的数量已经超过了全球人口，随着物联网的扩张，这种不平衡将进一步加剧。技术社区正在积极推进与物联网相关的研究，以解决其不断增长的社会融合问题。本文提出了一个新颖的预测性威胁情报框架，旨在系统地收集、分析和可视化暗网数据，以识别恶意网站，并将这些信息与潜在的物联网漏洞进行关联。该方法整合了自动化数据收集、分析技术和可视化地图工具，同时还检查了物联网设备的漏洞以评估其可利用性。通过弥合网络安全研究中存在的差距，本研究旨在增强预测性威胁建模，并为政策制定提供信息，从而为致力于减轻日益相互连接的数字生态系统中的网络风险的情报研究倡议做出贡献。

更新时间: 2025-04-07 06:33:58

领域: cs.CR,cs.AI,cs.CY,cs.NI

下载: http://arxiv.org/abs/2411.17931v3

Enhancing Leaf Disease Classification Using GAT-GCN Hybrid Model

Agriculture plays a critical role in the global economy, providing livelihoods and ensuring food security for billions. As innovative agricultural practices become more widespread, the risk of crop diseases has increased, highlighting the urgent need for efficient, low-intervention disease identification methods. This research presents a hybrid model combining Graph Attention Networks (GATs) and Graph Convolution Networks (GCNs) for leaf disease classification. GCNs have been widely used for learning from graph-structured data, and GATs enhance this by incorporating attention mechanisms to focus on the most important neighbors. The methodology integrates superpixel segmentation for efficient feature extraction, partitioning images into meaningful, homogeneous regions that better capture localized features. The authors have employed an edge augmentation technique to enhance the robustness of the model. The edge augmentation technique has introduced a significant degree of generalization in the detection capabilities of the model. To further optimize training, weight initialization techniques are applied. The hybrid model is evaluated against the individual performance of the GCN and GAT models and the hybrid model achieved a precision of 0.9822, recall of 0.9818, and F1-score of 0.9818 in apple leaf disease classification, a precision of 0.9746, recall of 0.9744, and F1-score of 0.9743 in potato leaf disease classification, and a precision of 0.8801, recall of 0.8801, and F1-score of 0.8799 in sugarcane leaf disease classification. These results demonstrate the robustness and performance of the model, suggesting its potential to support sustainable agricultural practices through precise and effective disease detection. This work is a small step towards reducing the loss of crops and hence supporting sustainable goals of zero hunger and life on land.

Updated: 2025-04-07 06:31:38

标题: 利用GAT-GCN混合模型提升叶片疾病分类

摘要: 农业在全球经济中扮演着至关重要的角色，为数十亿人提供生计并确保粮食安全。随着创新的农业实践变得更加普及，作物疾病风险也在增加，突显了对高效、低干预的疾病识别方法的迫切需求。本研究提出了一种将图注意网络（GATs）和图卷积网络（GCNs）结合起来进行叶片疾病分类的混合模型。GCNs已被广泛应用于从图结构化数据中学习，而GATs通过整合注意机制以便专注于最重要的邻居，进一步增强了这一过程。该方法整合了超像素分割以实现高效的特征提取，将图像分割为有意义、均匀的区域，更好地捕捉局部特征。作者采用了边缘增强技术来增强模型的稳健性。边缘增强技术在提高模型检测能力的同时，还引入了显著的泛化程度。为了进一步优化训练，还应用了权重初始化技术。该混合模型与GCN和GAT模型的个体性能进行了评估，结果显示混合模型在苹果叶病分类中获得了精度为0.9822、召回率为0.9818和F1值为0.9818，在马铃薯叶病分类中获得了精度为0.9746、召回率为0.9744和F1值为0.9743，在甘蔗叶病分类中获得了精度为0.8801、召回率为0.8801和F1值为0.8799。这些结果展示了模型的稳健性和性能，表明其有潜力通过准确有效的疾病检测支持可持续的农业实践。这项工作是朝着减少作物损失、支持零饥饿和生命在陆地上可持续发展目标迈出的一小步。

更新时间: 2025-04-07 06:31:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04764v1

mixEEG: Enhancing EEG Federated Learning for Cross-subject EEG Classification with Tailored mixup

The cross-subject electroencephalography (EEG) classification exhibits great challenges due to the diversity of cognitive processes and physiological structures between different subjects. Modern EEG models are based on neural networks, demanding a large amount of data to achieve high performance and generalizability. However, privacy concerns associated with EEG pose significant limitations to data sharing between different hospitals and institutions, resulting in the lack of large dataset for most EEG tasks. Federated learning (FL) enables multiple decentralized clients to collaboratively train a global model without direct communication of raw data, thus preserving privacy. For the first time, we investigate the cross-subject EEG classification in the FL setting. In this paper, we propose a simple yet effective framework termed mixEEG. Specifically, we tailor the vanilla mixup considering the unique properties of the EEG modality. mixEEG shares the unlabeled averaged data of the unseen subject rather than simply sharing raw data under the domain adaptation setting, thus better preserving privacy and offering an averaged label as pseudo-label. Extensive experiments are conducted on an epilepsy detection and an emotion recognition dataset. The experimental result demonstrates that our mixEEG enhances the transferability of global model for cross-subject EEG classification consistently across different datasets and model architectures. Code is published at: https://github.com/XuanhaoLiu/mixEEG.

Updated: 2025-04-07 06:24:23

标题: mixEEG: 使用定制的mixup增强脑电图联合学习，用于跨受试者脑电图分类

摘要: 跨受试者脑电图（EEG）分类面临巨大挑战，因为不同受试者之间的认知过程和生理结构存在多样性。现代EEG模型基于神经网络，需要大量数据才能实现高性能和泛化能力。然而，与EEG相关的隐私问题限制了不同医院和机构之间数据共享，导致大多数EEG任务缺乏大型数据集。联邦学习（FL）使多个分散客户端能够协作训练全局模型，而无需直接通信原始数据，从而保护隐私。我们首次在FL设置中研究跨受试者EEG分类。在本文中，我们提出了一个简单而有效的框架，称为mixEEG。具体来说，我们根据EEG模态的独特特性定制了普通的mixup。mixEEG共享了未知受试者的未标记平均数据，而不仅仅在域适应设置下共享原始数据，从而更好地保护隐私并提供平均标签作为伪标签。我们在癫痫检测和情绪识别数据集上进行了大量实验。实验结果表明，我们的mixEEG跨不同数据集和模型架构始终增强了全局模型的可传递性，用于跨受试者EEG分类。代码发布在：https://github.com/XuanhaoLiu/mixEEG。

更新时间: 2025-04-07 06:24:23

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2504.07987v1

Learnable Sparse Customization in Heterogeneous Edge Computing

To effectively manage and utilize massive distributed data at the network edge, Federated Learning (FL) has emerged as a promising edge computing paradigm across data silos. However, FL still faces two challenges: system heterogeneity (i.e., the diversity of hardware resources across edge devices) and statistical heterogeneity (i.e., non-IID data). Although sparsification can extract diverse submodels for diverse clients, most sparse FL works either simply assign submodels with artificially-given rigid rules or prune partial parameters using heuristic strategies, resulting in inflexible sparsification and poor performance. In this work, we propose Learnable Personalized Sparsification for heterogeneous Federated learning (FedLPS), which achieves the learnable customization of heterogeneous sparse models with importance-associated patterns and adaptive ratios to simultaneously tackle system and statistical heterogeneity. Specifically, FedLPS learns the importance of model units on local data representation and further derives an importance-based sparse pattern with minimal heuristics to accurately extract personalized data features in non-IID settings. Furthermore, Prompt Upper Confidence Bound Variance (P-UCBV) is designed to adaptively determine sparse ratios by learning the superimposed effect of diverse device capabilities and non-IID data, aiming at resource self-adaptation with promising accuracy. Extensive experiments show that FedLPS outperforms status quo approaches in accuracy and training costs, which improves accuracy by 1.28%-59.34% while reducing running time by more than 68.80%.

Updated: 2025-04-07 06:21:16

标题: 可学习的稀疏定制化在异构边缘计算中

摘要: 为了有效管理和利用网络边缘的大规模分布式数据，联邦学习（FL）已经成为一种有前途的边缘计算范式，适用于数据孤岛之间。然而，FL仍然面临两个挑战：系统异质性（即，边缘设备之间硬件资源的多样性）和统计异质性（即，非独立同分布数据）。尽管稀疏化可以提取不同客户端的不同子模型，但大多数稀疏FL作品要么简单地根据人为给定的严格规则分配子模型，要么使用启发式策略修剪部分参数，导致稀疏化不灵活且性能不佳。在这项工作中，我们提出了适用于异构联邦学习的可学习个性化稀疏化（FedLPS），实现了对异构稀疏模型的可学习定制，具有与重要性相关的模式和自适应比率，以同时解决系统和统计异质性。具体而言，FedLPS学习了局部数据表示中模型单元的重要性，并进一步推导出一种基于重要性的稀疏模式，最小化启发式以准确提取非独立同分布设置中的个性化数据特征。此外，Prompt Upper Confidence Bound Variance（P-UCBV）旨在通过学习各种设备能力和非独立同分布数据的叠加效应来自适应确定稀疏比率，旨在实现具有有前景准确性的资源自适应。大量实验证明，FedLPS在准确性和培训成本方面优于现状方法，其提高了准确性1.28%-59.34%，同时减少了超过68.80%的运行时间。

更新时间: 2025-04-07 06:21:16

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2412.07216v3

Hypergraph $p$-Laplacian equations for data interpolation and semi-supervised learning

Hypergraph learning with $p$-Laplacian regularization has attracted a lot of attention due to its flexibility in modeling higher-order relationships in data. This paper focuses on its fast numerical implementation, which is challenging due to the non-differentiability of the objective function and the non-uniqueness of the minimizer. We derive a hypergraph $p$-Laplacian equation from the subdifferential of the $p$-Laplacian regularization. A simplified equation that is mathematically well-posed and computationally efficient is proposed as an alternative. Numerical experiments verify that the simplified $p$-Laplacian equation suppresses spiky solutions in data interpolation and improves classification accuracy in semi-supervised learning. The remarkably low computational cost enables further applications.

Updated: 2025-04-07 06:20:47

标题: 超图$p$-拉普拉斯方程用于数据插值和半监督学习

摘要: 使用$p$-Laplacian正则化的超图学习吸引了很多关注，因为它在建模数据中的高阶关系方面具有灵活性。本文侧重于其快速数值实现，这是具有挑战性的，因为目标函数的非可微性和最小化器的非唯一性。我们从$p$-Laplacian正则化的次微分中推导出一个超图$p$-Laplacian方程。提出了一个数学上良好且计算效率高的简化方程作为替代方案。数值实验验证了简化的$p$-Laplacian方程可以抑制数据插值中的尖峰解，并提高半监督学习中的分类准确性。显著低的计算成本使得进一步应用成为可能。

更新时间: 2025-04-07 06:20:47

领域: math.NA,cs.LG,cs.NA,35R02, 65D05

下载: http://arxiv.org/abs/2411.12601v2

Large-Scale Targeted Cause Discovery with Data-Driven Learning

We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation when intervention costs and feasibility vary across variables. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a local-inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.

Updated: 2025-04-07 06:11:00

标题: 大规模目标性因果发现与数据驱动学习

摘要: 我们提出了一种新颖的机器学习方法，用于从观察中推断目标变量的因果变量。我们的重点是直接推断一组因果因素，而无需完全重建因果图，这在大规模系统中具有计算挑战性。识别的因果集包括在实验设置下目标变量的所有潜在调节因素，使得在干预成本和可行性在变量之间变化的情况下实现有效调节。为了实现这一目标，我们使用监督学习在模拟数据上训练神经网络来推断因果关系。通过采用局部推断策略，我们的方法在变量数量上呈线性复杂度，并可以有效地扩展到数千个变量。实证结果表明，在大规模基因调控网络中识别因果关系的性能优越，优于强调完整图形发现的现有方法。我们验证了我们模型在分布之外的图结构和生成机制上的泛化能力，包括大肠杆菌和人类K562细胞系的基因调控网络。实现代码可在https://github.com/snu-mllab/Targeted-Cause-Discovery 上找到。

更新时间: 2025-04-07 06:11:00

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2408.16218v2

Continuous Locomotive Crowd Behavior Generation

Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at https://github.com/InhwanBae/CrowdES .

Updated: 2025-04-07 06:08:59

标题: 连续机车人群行为生成

摘要: 建模和复制群体行为在各个领域中都很重要，包括心理学、机器人技术、交通工程和虚拟环境。传统方法主要集中在合成瞬时场景上，难以复制真实世界群体的连续性。在本文中，我们介绍了一种新颖的方法，可以自动生成连续、逼真的群体轨迹，其中包含异构行为和个体之间的互动。我们首先设计了一个群体发射器模型。为了实现这一点，我们从单个输入图像获取空间布局，包括分割图、外观图、人口密度图和人口概率，然后进行群体生成。然后，发射器通过使用扩散模型不断将个体放置在时间轴上，分配独立的行为特征，如代理类型、步伐和起始/结束位置。接下来，我们的群体模拟器产生他们的长期运动。为了模拟不同的动作，它可以基于马尔可夫链增强他们的行为。因此，我们的整体框架通过在提出的发射器和模拟器之间交替，使场景中充满了异构的群体行为。需要注意的是，提出的框架中的所有组件都可以由用户控制。最后，我们提出了一个基准协议，以评估生成群体的逼真性和质量，包括场景级人口动态和个体级轨迹准确性。我们证明了我们的方法有效地模拟了多样化的群体行为模式，并在不同地理环境中具有很好的泛化能力。代码可以在https://github.com/InhwanBae/CrowdES 上公开获取。

更新时间: 2025-04-07 06:08:59

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2504.04756v1

Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem.This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings.Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.

Updated: 2025-04-07 05:56:51

标题: 无监督估计非线性音频效果：比较基于扩散和对抗性方法

摘要: 在没有访问成对输入输出信号的情况下准确估计非线性音频效果仍然是一个具有挑战性的问题。本文研究了用于解决这一任务的无监督概率方法。我们引入了一种基于扩散生成模型的方法，这在这个应用中是新颖的，用于盲系统识别，可以估计未知的非线性效果，使用黑盒和灰盒模型。本研究将该方法与先前提出的对抗方法进行比较，分析两种方法在不同参数化效果算子和不同长度的可用受影响录音下的性能。通过对吉他失真效果的实验，我们发现扩散基础方法提供更稳定的结果，并且对数据可用性不太敏感，而对抗方法在估计更显著的失真效果方面更优越。我们的研究结果有助于在音频效果的强大无监督盲估计中，展示了扩散模型在音乐技术中系统识别的潜力。

更新时间: 2025-04-07 05:56:51

领域: eess.AS,cs.AI

下载: http://arxiv.org/abs/2504.04751v1

Vision Transformers with Autoencoders and Explainable AI for Cancer Patient Risk Stratification Using Whole Slide Imaging

Cancer remains one of the leading causes of mortality worldwide, necessitating accurate diagnosis and prognosis. Whole Slide Imaging (WSI) has become an integral part of clinical workflows with advancements in digital pathology. While various studies have utilized WSIs, their extracted features may not fully capture the most relevant pathological information, and their lack of interpretability limits clinical adoption. In this paper, we propose PATH-X, a framework that integrates Vision Transformers (ViT) and Autoencoders with SHAP (Shapley Additive Explanations) to enhance model explainability for patient stratification and risk prediction using WSIs from The Cancer Genome Atlas (TCGA). A representative image slice is selected from each WSI, and numerical feature embeddings are extracted using Google's pre-trained ViT. These features are then compressed via an autoencoder and used for unsupervised clustering and classification tasks. Kaplan-Meier survival analysis is applied to evaluate stratification into two and three risk groups. SHAP is used to identify key contributing features, which are mapped onto histopathological slices to provide spatial context. PATH-X demonstrates strong performance in breast and glioma cancers, where a sufficient number of WSIs enabled robust stratification. However, performance in lung cancer was limited due to data availability, emphasizing the need for larger datasets to enhance model reliability and clinical applicability.

Updated: 2025-04-07 05:48:42

标题: 视觉变换器与自动编码器和可解释人工智能在使用全切片成像进行癌症患者风险分层中的应用

摘要: 癌症仍然是全球死亡率的主要原因之一，需要准确的诊断和预后。随着数字病理学的发展，全切片成像（WSI）已成为临床工作流程的一个重要组成部分。虽然各种研究已经利用了WSIs，但它们提取的特征可能无法完全捕捉最相关的病理信息，而且它们缺乏可解释性，限制了临床采用。本文提出了PATH-X，这是一个集成了Vision Transformers（ViT）和Autoencoders与SHAP（Shapley Additive Explanations）的框架，用于使用来自The Cancer Genome Atlas（TCGA）的WSIs增强模型的可解释性，以进行患者分层和风险预测。从每个WSI中选择代表性的图像切片，并使用Google预训练的ViT提取数值特征嵌入。然后，通过自动编码器对这些特征进行压缩，并用于无监督的聚类和分类任务。Kaplan-Meier生存分析用于评估分成两个和三个风险组。SHAP用于识别关键贡献特征，并将其映射到组织病理切片上，以提供空间上下文。 PATH-X在乳腺癌和胶质瘤等癌症中表现出较强的性能，其中足够数量的WSIs实现了稳健的分层。然而，在肺癌中的表现受到数据可用性的限制，强调了需要更大数据集来增强模型的可靠性和临床适用性。

更新时间: 2025-04-07 05:48:42

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.04749v1

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.

Updated: 2025-04-07 05:38:23

标题: 用语言指令、视觉观察和互动来确定三维物体的可支持性

摘要: 将3D对象功能接地是一个任务，它将对象定位在3D空间中，以便可以对其进行操作，从而将感知和行动联系起来，实现具身智能。例如，对于一个智能机器人，有必要准确地接地对象的功能，并根据人类指令抓取它。在本文中，我们介绍了一项基于语言指令、视觉观察和交互的新型任务，该任务受认知科学启发，用于接地3D对象的功能。我们收集了一个具有点、图像和语言指令（AGPIL）的功能接地数据集，以支持所提出的任务。在3D物理世界中，由于观察方向、对象旋转或空间遮挡，我们只能获得对象的部分观察。因此，这个数据集包括了从全景、部分视图和旋转视图角度对对象功能的估计。为了完成这项任务，我们提出了LMAffordance3D，第一个多模态、语言引导的3D功能接地网络，该网络应用了视觉语言模型将2D和3D空间特征与语义特征融合在一起。对AGPIL的全面实验证明了我们的方法在这项任务上的有效性和优越性，甚至在未见过的实验设置中也表现出色。我们的项目可在https://sites.google.com/view/lmaffordance3d 上找到。

更新时间: 2025-04-07 05:38:23

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2504.04744v1

Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data

Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.

Updated: 2025-04-07 05:35:34

标题: 使用合成偏好数据增强视觉-语言模型中的组合推理

摘要: 组成性，或者正确地将场景识别为原子视觉概念的组合，对于多模式大型语言模型（MLLMs）仍然很困难。即使像GPT-4o这样的最先进的MLLMs也可能在区分“狗追猫”和“猫追狗”等组合时出错。虽然在用于衡量此类推理的基准Winoground上，MLLMs已经取得了显著进展，但它们仍然远远落后于人类的表现。我们展示了通过数据来阐明这些概念，可以改善这些模型中的组合推理，其中模型被训练为更喜欢图像的正确标题而不是接近但不正确的标题。我们引入了SCRAMBLe：用二进制偏好学习增强MLLMs的合成组成推理方法，这种方法可以在从现有图像-标题数据中完全自动生成的合成偏好数据上对开放权重的MLLMs进行偏好调整。SCRAMBLe综合提高了这些MLLMs的组合推理能力，我们可以通过在多个视觉语言组成基准上取得显著改进，以及在一般问题回答任务上取得较小但显著的改进来看出这一点。作为一个预告，SCRAMBLe调整的Molmo-7B模型在Winoground上的表现从49.5%提高到54.8%（迄今为止报道的最佳结果），在更一般的视觉问题回答任务上提高了约1%。SCRAMBLe的代码以及调整的模型和我们的合成训练数据集可在https://github.com/samarth4149/SCRAMBLe 上找到。

更新时间: 2025-04-07 05:35:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04740v1

MedGNN: Capturing the Links Between Urban Characteristics and Medical Prescriptions

Understanding how urban socio-demographic and environmental factors relate with health is essential for public health and urban planning. However, traditional statistical methods struggle with nonlinear effects, while machine learning models often fail to capture geographical (nearby areas being more similar) and topological (unequal connectivity between places) effects in an interpretable way. To address this, we propose MedGNN, a spatio-topologically explicit framework that constructs a 2-hop spatial graph, integrating positional and locational node embeddings with urban characteristics in a graph neural network. Applied to MEDSAT, a comprehensive dataset covering over 150 environmental and socio-demographic factors and six prescription outcomes (depression, anxiety, diabetes, hypertension, asthma, and opioids) across 4,835 Greater London neighborhoods, MedGNN improved predictions by over 25% on average compared to baseline methods. Using depression prescriptions as a case study, we analyzed graph embeddings via geographical principal component analysis, identifying findings that: align with prior research (e.g., higher antidepressant prescriptions among older and White populations), contribute to ongoing debates (e.g., greenery linked to higher and NO2 to lower prescriptions), and warrant further study (e.g., canopy evaporation correlated with fewer prescriptions). These results demonstrate MedGNN's potential, and more broadly, of carefully applied machine learning, to advance transdisciplinary public health research.

Updated: 2025-04-07 05:35:16

标题: MedGNN：捕捉城市特征和医学处方之间的联系

摘要: 理解城市社会人口学和环境因素与健康之间的关系对于公共卫生和城市规划至关重要。然而，传统的统计方法在处理非线性效应方面存在困难，而机器学习模型往往无法以可解释的方式捕捉地理（附近地区更相似）和拓扑（地点之间的连接不均）效应。为了解决这个问题，我们提出了MedGNN，这是一个明确构建2跳空间图的框架，将位置和定位节点嵌入与城市特征在图神经网络中整合。应用于MEDSAT，这是一个涵盖了超过150个环境和社会人口学因素以及六种处方结果（抑郁症、焦虑、糖尿病、高血压、哮喘和阿片类药物）的综合数据集，跨越了4835个大伦敦社区。与基准方法相比，MedGNN的预测平均提高了超过25%。以抑郁症处方为案例研究，我们通过地理主成分分析分析了图嵌入，发现结果与先前研究（例如，老年人和白人中抗抑郁药物处方较高）一致，有助于当前争论（例如，绿色与更高处方相关，而二氧化氮与更低处方相关），并需要进一步研究（例如，冠层蒸发与更少处方相关）。这些结果展示了MedGNN的潜力，更广泛地说，精心应用机器学习可以推动跨学科公共卫生研究的进展。

更新时间: 2025-04-07 05:35:16

领域: cs.LG,cs.CY,I.2.6; J.3; H.2.8

下载: http://arxiv.org/abs/2504.04739v1

Online POMDP Planning with Anytime Deterministic Optimality Guarantees

Decision-making under uncertainty is a critical aspect of many practical autonomous systems due to incomplete information. Partially Observable Markov Decision Processes (POMDPs) offer a mathematically principled framework for formulating decision-making problems under such conditions. However, finding an optimal solution for a POMDP is generally intractable. In recent years, there has been a significant progress of scaling approximate solvers from small to moderately sized problems, using online tree search solvers. Often, such approximate solvers are limited to probabilistic or asymptotic guarantees towards the optimal solution. In this paper, we derive a deterministic relationship for discrete POMDPs between an approximated and the optimal solution. We show that at any time, we can derive bounds that relate between the existing solution and the optimal one. We show that our derivations provide an avenue for a new set of algorithms and can be attached to existing algorithms that have a certain structure to provide them with deterministic guarantees with marginal computational overhead. In return, not only do we certify the solution quality, but we demonstrate that making a decision based on the deterministic guarantee may result in superior performance compared to the original algorithm without the deterministic certification.

Updated: 2025-04-07 05:29:57

标题: 具有任意确定性最优性保证的在线POMDP规划

摘要: 在不完全信息条件下，决策是许多实际自主系统的关键方面。部分可观察马尔可夫决策过程（POMDPs）提供了一个在这种条件下制定决策问题的数学原则框架。然而，找到POMDP的最佳解决方案通常是难以处理的。近年来，使用在线树搜索解决方案，从小规模到中等规模问题的近似求解器的扩展取得了显著进展。通常，这种近似求解器仅限于对最佳解决方案提供概率性或渐近性保证。在本文中，我们推导出离散POMDPs之间的近似解和最佳解之间的确定性关系。我们表明，在任何时候，我们都可以推导出与现有解决方案和最佳解决方案之间的关系。我们展示了我们的推导为一组新算法提供了一条途径，并且可以附加到具有一定结构的现有算法中，以为它们提供确定性保证，而只需较小的计算开销。作为回报，我们不仅证明了解决方案的质量，而且我们展示了基于确定性保证做出决策可能会导致比没有确定性认证的原始算法更优越的性能。

更新时间: 2025-04-07 05:29:57

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2310.01791v4

TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.

Updated: 2025-04-07 05:27:32

标题: TathyaNyaya和FactLegalLlama：在印度法律背景下推动事实判断预测和解释

摘要: 在基于事实的判断预测和解释（FJPE）领域，依赖事实数据对于开发强大和现实的基于人工智能的决策工具至关重要。本文介绍了TathyaNyaya，这是针对印度法律背景量身定制的最大注释数据集，包括来自印度最高法院和各高等法院的判决。源自印地语词汇“Tathya”（事实）和“Nyaya”（正义），TathyaNyaya数据集独特设计以便专注于事实陈述而不是完整的法律文本，反映了真实世界中事实数据推动结果的司法流程。作为这一数据集的补充，我们提出了FactLegalLlama，这是LLaMa-3-8B大型语言模型（LLM）的经过指导调整的变体，经过优化用于在FJPE任务中生成高质量的解释。在TathyaNyaya的事实数据上进行微调，FactLegalLlama将预测准确性与连贯的、与上下文相关的解释相结合，满足AI辅助法律系统中透明度和可解释性的重要需求。我们的方法结合了用于二元判断预测的变压器和用于解释生成的FactLegalLlama，为推动印度法律领域的FJPE提供了一个强大框架。TathyaNyaya不仅在规模和多样性上超越了现有数据集，还为在法律分析中构建可解释的AI系统设立了一个基准。研究结果强调了在增强预测性能和可解释性方面的事实精确性和特定领域的调整的重要性，将TathyaNyaya和FactLegalLlama定位为AI辅助法律决策的基础资源。

更新时间: 2025-04-07 05:27:32

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2504.04737v1

Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.

Updated: 2025-04-07 05:20:58

标题: 合成数据生成和多步骤强化学习用于推理和工具使用

摘要: 强化学习已被证明可以提高大型语言模型的性能。然而，传统方法如RLHF或RLAIF将问题视为单步。随着焦点转向更复杂的推理和主体任务，语言模型在生成解决方案之前必须进行多步文本生成、推理和环境交互。我们提出了一种针对多步优化场景的合成数据生成和强化学习方法。这种方法称为Step-Wise Reinforcement Learning（SWiRL），它通过迭代生成多步推理和工具使用数据，然后从这些数据中学习。它采用简单的逐步分解，将每个多步轨迹分解为多个与原始模型的每个动作对应的子轨迹。然后在这些子轨迹上应用合成数据过滤和强化学习优化。我们在多步工具使用、问题回答和数学推理任务上评估了SWiRL。我们的实验结果显示，在GSM8K、HotPotQA、CofCA、MuSiQue和BeerQA上，SWiRL相对准确率分别提高了21.5％、12.3％、14.8％、11.1％和15.3％，超过了基准方法。令人兴奋的是，这种方法在任务之间展现出泛化能力：例如，仅在HotPotQA（文本问答）上训练可以使在GSM8K（数学数据集）上的零样本性能相对提高16.9％。

更新时间: 2025-04-07 05:20:58

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.04736v1

Teaching Data Science Students to Sketch Privacy Designs through Heuristics (Extended Technical Report)

Recent studies reveal that experienced data practitioners often draw sketches to facilitate communication around privacy design concepts. However, there is limited understanding of how we can help novice students develop such communication skills. This paper studies methods for lowering novice data science students' barriers to creating high-quality privacy sketches. We first conducted a need-finding study (N=12) to identify barriers students face when sketching privacy designs. We then used a human-centered design approach to guide the method development, culminating in three simple, text-based heuristics. Our user studies with 24 data science students revealed that simply presenting three heuristics to the participants at the beginning of the study can enhance the coverage of privacy-related design decisions in sketches, reduce the mental effort required for creating sketches, and improve the readability of the final sketches.

Updated: 2025-04-07 05:12:21

标题: 教导数据科学学生通过启发法草拟隐私设计（扩展技术报告）

摘要: 最近的研究表明，有经验的数据从业者经常绘制草图以促进围绕隐私设计概念的沟通。然而，对于我们如何帮助新手学生发展这种沟通技巧的理解有限。本文研究了降低新手数据科学学生创建高质量隐私草图的障碍的方法。我们首先进行了一个需求研究（N=12），以识别学生在绘制隐私设计时面临的障碍。然后，我们采用人本设计方法来指导方法的发展，最终产生了三个简单的基于文本的启发式方法。我们与24名数据科学学生进行的用户研究表明，简单地在研究开始时向参与者呈现三个启发式方法可以增强草图中与隐私相关的设计决策的涵盖范围，减少创建草图所需的心理努力，并提高最终草图的可读性。

更新时间: 2025-04-07 05:12:21

领域: cs.HC,cs.CR,cs.CY

下载: http://arxiv.org/abs/2504.04734v1

SpinML: Customized Synthetic Data Generation for Private Training of Specialized ML Models

Specialized machine learning (ML) models tailored to users needs and requests are increasingly being deployed on smart devices with cameras, to provide personalized intelligent services taking advantage of camera data. However, two primary challenges hinder the training of such models: the lack of publicly available labeled data suitable for specialized tasks and the inaccessibility of labeled private data due to concerns about user privacy. To address these challenges, we propose a novel system SpinML, where the server generates customized Synthetic image data to Privately traIN a specialized ML model tailored to the user request, with the usage of only a few sanitized reference images from the user. SpinML offers users fine-grained, object-level control over the reference images, which allows user to trade between the privacy and utility of the generated synthetic data according to their privacy preferences. Through experiments on three specialized model training tasks, we demonstrate that our proposed system can enhance the performance of specialized models without compromising users privacy preferences.

Updated: 2025-04-07 05:07:42

标题: SpinML：用于专业机器学习模型私人训练的定制合成数据生成

摘要: 针对用户需求和请求量身定制的专业机器学习（ML）模型越来越多地部署在具有摄像头的智能设备上，以利用摄像头数据提供个性化智能服务。然而，训练这种模型面临两个主要挑战：缺乏适用于专业任务的公开标记数据和由于用户隐私问题而无法访问标记的私人数据。为了解决这些挑战，我们提出了一个新颖的系统SpinML，其中服务器生成定制的合成图像数据，以私密地训练一个专门针对用户请求定制的ML模型，仅使用用户提供的少量经过处理的参考图像。SpinML为用户提供了对参考图像的细粒度、对象级别的控制，使用户可以根据隐私偏好在生成的合成数据的隐私和效用之间进行权衡。通过对三个专业模型训练任务的实验，我们证明了我们提出的系统可以增强专业模型的性能，而不会损害用户的隐私偏好。

更新时间: 2025-04-07 05:07:42

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2503.03160v2

A High-Performance Curve25519 and Curve448 Unified Elliptic Curve Cryptography Accelerator

In modern critical infrastructure such as power grids, it is crucial to ensure security of data communications between network-connected devices while following strict latency criteria. This necessitates the use of cryptographic hardware accelerators. We propose a high-performance unified elliptic curve cryptography accelerator supporting NIST standard Montgomery curves Curve25519 and Curve448 at 128-bit and 224-bit security levels respectively. Our accelerator implements extensive parallel processing of Karatsuba-style large-integer multiplications, restructures arithmetic operations in the Montgomery Ladder and exploits special mathematical properties of the underlying pseudo-Mersenne and Solinas prime fields for optimized performance. Our design ensures efficient resource sharing across both curve computations and also incorporates several standard side-channel countermeasures. Our ASIC implementation achieves record performance and energy of 10.38 $\mu$s / 54.01 $\mu$s and 0.72 $\mu$J / 3.73 $\mu$J respectively for Curve25519 / Curve448, which is significantly better than state-of-the-art.

Updated: 2025-04-07 05:04:02

标题: 一个高性能的Curve25519和Curve448统一椭圆曲线密码加速器

摘要: 在现代关键基础设施如电力网络中，确保网络连接设备之间的数据通信安全并遵循严格的延迟标准至关重要。这需要使用加密硬件加速器。我们提出了一个高性能的统一椭圆曲线加密加速器，支持NIST标准的蒙哥马利曲线Curve25519和Curve448，分别在128位和224位安全级别。我们的加速器实现了卡拉兹巴式大整数乘法的广泛并行处理，重新构建了蒙哥马利梯度中的算术操作，并利用了底层伪梅森和Solinas素域的特殊数学属性以实现优化性能。我们的设计确保了曲线计算之间的有效资源共享，并且还包括几种标准的侧信道对策。我们的ASIC实现实现了记录性能和能量，分别为Curve25519 / Curve448的10.38$\mu$s / 54.01$\mu$s和0.72$\mu$J / 3.73$\mu$J，显著优于现有技术水平。

更新时间: 2025-04-07 05:04:02

领域: cs.CR

下载: http://arxiv.org/abs/2504.04731v1

Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models

Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.

Updated: 2025-04-07 04:57:10

标题: 基于文本到图像扩散模型的人像生成的参数化阴影控制

摘要: Text-to-image扩散模型擅长生成多样化的肖像，但缺乏直观的阴影控制。现有的编辑方法，作为后处理，很难在不同风格之间提供有效的操作。此外，这些方法要么依赖昂贵的现实世界光舞台数据收集，要么需要大量的计算资源进行训练。为了解决这些限制，我们引入了Shadow Director，一种提取和操纵训练良好的扩散模型中隐藏的阴影属性的方法。我们的方法使用一个小估计网络，只需要几千个合成图像和几个小时的训练，不需要昂贵的现实世界光舞台数据。Shadow Director在肖像生成过程中实现了对阴影形状、位置和强度的参数化和直观控制，同时保留了跨不同风格的艺术完整性和身份。尽管只在基于真实身份的合成数据上进行训练，但它在生成具有多样化风格的肖像时表现出很好的泛化能力，使其成为更易接近和资源友好的解决方案。

更新时间: 2025-04-07 04:57:10

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2503.21943v2

Achieving binary weight and activation for LLMs using Post-Training Quantization

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models.

Updated: 2025-04-07 04:50:04

标题: 使用后训练量化实现LLMs的二进制权重和激活

摘要: 将大型语言模型（LLM）量化为1位精度显著降低了计算成本，但现有的量化技术在使用权重和激活精度低于4位（W4A4）时存在明显的性能下降。在本文中，我们提出了一个后训练量化框架，采用W(1+1)A(1*4)配置，其中权重量化为1位，并额外使用1位进行细粒度分组，激活量化为1位，通道数量增加4倍。对于权重量化，我们提出利用基于Hessian的细粒度分组和基于EM的量化方案。对于激活量化，我们将INT4量化的激活等效地分解为4 * INT1格式，并基于量化误差平滑缩放因子，进一步降低激活中的量化误差。我们的方法在W2A4上超越了最先进的LLM量化基线，在多个任务上推动现有LLM量化方法向完全二值化模型的边界。

更新时间: 2025-04-07 04:50:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.05352v1

Leveraging GANs For Active Appearance Models Optimized Model Fitting

Active Appearance Models (AAMs) are a well-established technique for fitting deformable models to images, but they are limited by linear appearance assumptions and can struggle with complex variations. In this paper, we explore if the AAM fitting process can benefit from a Generative Adversarial Network (GAN). We uses a U-Net based generator and a PatchGAN discriminator for GAN-augmented framework in an attempt to refine the appearance model during fitting. This approach attempts to addresses challenges such as non-linear appearance variations and occlusions that traditional AAM optimization methods may fail to handle. Limited experiments on face alignment datasets demonstrate that the GAN-enhanced AAM can achieve higher accuracy and faster convergence than classic approaches with some manual interventions. These results establish feasibility of GANs as a tool for improving deformable model fitting in challenging conditions while maintaining efficient performance, and establishes the need for more future work to evaluate this approach at scale.

Updated: 2025-04-07 04:07:08

标题: 利用生成对抗网络（GANs）优化活动外观模型的模型拟合

摘要: 主动外观模型（AAMs）是一种用于将可变形模型拟合到图像中的成熟技术，但受限于线性外观假设，并且在复杂变化方面可能遇到困难。在本文中，我们探讨了AAM拟合过程是否可以从生成对抗网络（GAN）中获益。我们使用基于U-Net的生成器和PatchGAN鉴别器，构建了一个GAN增强框架，试图在拟合过程中改进外观模型。这种方法旨在解决传统AAM优化方法可能无法处理的非线性外观变化和遮挡等挑战。在面部对齐数据集上进行的有限实验表明，GAN增强的AAM可以比传统方法更快地实现更高的准确性和收敛速度，尽管需要一些手动干预。这些结果确立了GAN作为在具有挑战性的条件下改进可变形模型拟合的工具的可行性，并确立了在更大规模上评估这种方法的未来工作的必要性。

更新时间: 2025-04-07 04:07:08

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.11218v3

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

Updated: 2025-04-07 04:01:17

标题: T1：小型语言模型中测试时间计算缩放的工具集成自验证

摘要: 最近的研究表明，测试时间计算扩展有效地提高了小型语言模型（sLMs）的性能。然而，先前的研究主要是通过使用额外的更大模型作为验证器来检查测试时间计算扩展，而未充分探讨sLMs的自我验证。在这项工作中，我们调查了sLMs是否能够可靠地自我验证其输出在测试时间扩展下。我们发现，即使通过较大的验证器进行知识蒸馏，sLMs在需要记忆的验证任务上仍然面临困难，比如数值计算和事实核实。为了解决这一局限性，我们提出了集成工具自我验证（T1），将记忆密集型验证步骤委托给外部工具，如代码解释器。我们的理论分析表明，工具集成降低了记忆需求并提高了测试时间扩展性能。在MATH基准测试中的实验表明，通过T1，Llama-3.2 1B模型在测试时间扩展下胜过了规模显著更大的Llama-3.1 8B模型。此外，T1有效地推广到数学（MATH500）和多领域知识密集型任务（MMLU-Pro）。我们的发现突显了工具集成的潜力，可以显著提高sLMs的自我验证能力。

更新时间: 2025-04-07 04:01:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04718v1

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.

Updated: 2025-04-07 04:00:08

标题: 超越单轮交互：关于大型语言模型的多轮交互调查

摘要: 最近大型语言模型（LLMs）的进展彻底改变了它们处理单轮任务的能力，然而现实世界的应用需要复杂的多轮交互。本调查对LLMs中评估和增强多轮交互的最新进展进行了全面回顾。着重于任务特定场景，从数学和编码等各种领域的指令跟随到角色扮演中的复杂对话互动，包括医疗、教育，甚至对抗性越狱设置，我们系统地探讨了在持续对话中维持上下文、连贯性、公平性和响应性的挑战。本文将当前的基准测试和数据集组织成连贯的类别，反映了多轮对话评估的发展态势。此外，我们审查了多轮设置下的一系列增强方法，包括模型中心策略（上下文学习、监督微调、强化学习和新架构）、外部集成方法（记忆增强、检索式方法和知识图）以及基于代理的协同交互技术。最后，我们讨论了开放挑战，并提出了未来研究的方向，以进一步提高LLMs中多轮交互的鲁棒性和有效性。相关资源和论文可在https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs找到。

更新时间: 2025-04-07 04:00:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.04717v1

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit

Updated: 2025-04-07 03:57:41

标题: 你是否得到你所付出的代价？LLM API中的审计模型替代

摘要: 通过黑匣子API访问的大型语言模型（LLM）的激增引入了一个重要的信任挑战：用户根据宣传的模型能力（例如大小、性能）支付服务费用，但提供者可能秘密地用更便宜、质量较低的替代品替换指定的模型以降低运营成本。这种缺乏透明度破坏了公平性，削弱了信任，并使可靠的基准测试变得复杂。由于黑盒的特性，检测这种替换是困难的，通常限制了交互到输入输出查询。本文正式化了LLM API中模型替换检测的问题。我们系统地评估了现有的验证技术，包括基于输出的统计测试、基准评估和日志概率分析，在各种现实攻击场景下，如模型量化、随机替换和基准逃避等。我们的研究结果揭示了仅依赖文本输出的方法针对微妙或自适应攻击的局限性。虽然日志概率分析在可用时提供更强的保证，但其可访问性通常有限。我们最后讨论了基于硬件的解决方案（如可信执行环境（TEEs））作为实现可证模型完整性的途径，突出了安全性、性能和提供者采用之间的权衡。代码可在 https://github.com/sunblaze-ucb/llm-api-audit 上找到。

更新时间: 2025-04-07 03:57:41

领域: cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2504.04715v1

Universality of reservoir systems with recurrent neural networks

Approximation capability of reservoir systems whose reservoir is a recurrent neural network (RNN) is discussed. We show what we call uniform strong universality of RNN reservoir systems for a certain class of dynamical systems. This means that, given an approximation error to be achieved, one can construct an RNN reservoir system that approximates each target dynamical system in the class just via adjusting its linear readout. To show the universality, we construct an RNN reservoir system via parallel concatenation that has an upper bound of approximation error independent of each target in the class.

Updated: 2025-04-07 03:54:31

标题: 循环神经网络在储备系统中的普适性

摘要: 本文讨论了以循环神经网络（RNN）作为水库的水库系统的近似能力。我们展示了我们所称的RNN水库系统对于某类动态系统的统一强普遍性。这意味着，给定要实现的近似误差，可以构建一个RNN水库系统，通过调整其线性读出，来近似该类中的每个目标动态系统。为了证明普遍性，我们通过并行串联构建了一个具有近似误差上界的RNN水库系统，该上界独立于该类中的每个目标。

更新时间: 2025-04-07 03:54:31

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2403.01900v2

Generalising from Self-Produced Data: Model Training Beyond Human Constraints

Current large language models (LLMs) are constrained by human-derived training data and limited by a single level of abstraction that impedes definitive truth judgments. This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge through direct interaction with their environment. Central to this approach is an unbounded, ungamable numeric reward - such as annexed disk space or follower count - that guides learning without requiring human benchmarks. AI agents iteratively generate strategies and executable code to maximize this metric, with successful outcomes forming the basis for self-retraining and incremental generalisation. To mitigate model collapse and the warm start problem, the framework emphasizes empirical validation over textual similarity and supports fine-tuning via GRPO. The system architecture employs modular agents for environment analysis, strategy generation, and code synthesis, enabling scalable experimentation. This work outlines a pathway toward self-improving AI systems capable of advancing beyond human-imposed constraints toward autonomous general intelligence.

Updated: 2025-04-07 03:48:02

标题: 从自行生成的数据泛化：超越人类限制的模型训练

摘要: 目前的大型语言模型（LLMs）受制于人类衍生的训练数据，并受限于一种抽象级别，阻碍了明确的真相判断。本文介绍了一种新颖的框架，其中AI模型通过与环境的直接交互自主生成和验证新知识。该方法的核心是一个无限制、不可操纵的数值奖励，比如附加的磁盘空间或关注者数量，指导学习而无需人类基准。AI代理迭代生成策略和可执行代码，以最大化这一指标，成功的结果形成自我重新训练和渐进泛化的基础。为了减轻模型崩溃和温启动问题，该框架强调经验验证而不是文本相似性，并支持通过GRPO进行微调。系统架构采用模块化代理进行环境分析、策略生成和代码合成，实现可扩展的实验。这项工作概述了通往自我改进的AI系统的路径，这些系统能够超越人类强加的限制，朝着自主智能的方向发展。

更新时间: 2025-04-07 03:48:02

领域: cs.AI

下载: http://arxiv.org/abs/2504.04711v1

Out-of-Distribution Generalization in Time Series: A Survey

Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field's evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at https://tsood-generalization.com.

Updated: 2025-04-07 03:45:27

标题: 时间序列中的分布外泛化：一项调查

摘要: 时间序列经常表现出分布转变、多样的潜在特征和非平稳学习动态，特别是在开放和不断发展的环境中。这些特征对于超出分布（OOD）泛化提出了重大挑战。尽管已取得了实质性进展，但对于进展的系统综合仍然缺乏。为了填补这一空白，我们提出了对时间序列OOD泛化方法的首次全面回顾，组织以描绘该领域的发展轨迹和当代研究格局。我们跨越三个基础维度组织我们的分析：数据分布、表示学习和OOD评估。对于每个维度，我们详细介绍了几种流行的算法。此外，我们强调了关键应用场景，强调它们的实际影响。最后，我们确定了持续存在的挑战并提出未来的研究方向。关于时间序列OOD泛化方法回顾的详细总结可以在https://tsood-generalization.com 上获得。

更新时间: 2025-04-07 03:45:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.13868v2

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.

Updated: 2025-04-07 03:39:02

标题: NuScenes-SpatialQA：自动驾驶视觉语言模型的空间理解和推理基准Benchmark

摘要: 最近，视觉语言模型（VLMs）的最新进展展示出在自动驾驶任务中具有强大潜力。然而，它们的空间理解和推理能力 - 自动驾驶的关键能力 - 仍然存在显著限制。值得注意的是，现有的基准数据没有系统地评估VLMs在驾驶场景中的空间推理能力。为了填补这一空白，我们提出了NuScenes-SpatialQA，这是第一个大规模基于地面真实数据的问题回答（QA）基准数据集，专门设计用于评估VLMs在自动驾驶中的空间理解和推理能力。该基准数据集是基于NuScenes数据集构建的，通过自动化的3D场景图生成管道和QA生成管道构建。该基准数据集系统地评估VLMs在空间理解和推理方面的性能，涵盖多个维度。利用这一基准数据集，我们对各种VLMs进行了广泛实验，包括通用模型和空间增强模型，为它们的空间能力在自动驾驶中进行了首次全面评估。令人惊讶的是，实验结果显示，空间增强的VLM在定性QA方面表现优异，但在定量QA方面并不具备竞争力。总的来说，VLMs在空间理解和推理方面仍面临着重大挑战。

更新时间: 2025-04-07 03:39:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.03164v2

AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing

Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences. Existing KT models typically follow a single-step training paradigm, which leads to discrepancies with the multi-step inference process required in real-world simulations, resulting in significant error accumulation. This accumulation of error, coupled with the issue of data sparsity, can substantially degrade the performance of recommendation models in the intelligent tutoring systems. To address these challenges, we propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT), which, for the first time, focuses on the multi-step KT task. More specifically, AdvKT leverages adversarial learning paradigm involving a generator and a discriminator. The generator mimics high-reward responses, effectively reducing error accumulation across multiple steps, while the discriminator provides feedback to generate synthetic data. Additionally, we design specialized data augmentation techniques to enrich the training data with realistic variations, ensuring that the model generalizes well even in scenarios with sparse data. Experiments conducted on four real-world datasets demonstrate the superiority of AdvKT over existing KT models, showcasing its ability to address both error accumulation and data sparsity issues effectively.

Updated: 2025-04-07 03:31:57

标题: AdvKT：一种用于知识追踪的对抗式多步训练框架

摘要: 知识追踪（KT）监控学生的知识状态并模拟他们对问题序列的响应。现有的KT模型通常遵循单步训练范例，这导致与现实世界模拟中所需的多步推理过程不一致，从而导致显著的误差累积。这种误差累积，再加上数据稀疏性问题，可能会严重影响智能辅导系统中推荐模型的性能。为了解决这些挑战，我们提出了一种新颖的针对知识追踪的对抗多步训练框架（AdvKT），这是首次专注于多步KT任务。具体而言，AdvKT利用包括生成器和鉴别器在内的对抗学习范例。生成器模拟高奖励响应，有效地减少多步中的误差累积，而鉴别器提供反馈以生成合成数据。此外，我们设计了专门的数据增强技术，以丰富训练数据，确保模型在稀疏数据情况下也能很好地泛化。在四个真实数据集上进行的实验表明，AdvKT相比现有的KT模型表现出更高的优越性，展示了其有效解决误差累积和数据稀疏性问题的能力。

更新时间: 2025-04-07 03:31:57

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2504.04706v1

AutoOpt: A General Framework for Automatically Designing Metaheuristic Optimization Algorithms with Diverse Structures

Metaheuristics are widely recognized gradient-free solvers to hard problems that do not meet the rigorous mathematical assumptions of conventional solvers. The automated design of metaheuristic algorithms provides an attractive path to relieve manual design effort and gain enhanced performance beyond human-made algorithms. However, the specific algorithm prototype and linear algorithm representation in the current automated design pipeline restrict the design within a fixed algorithm structure, which hinders discovering novelties and diversity across the metaheuristic family. To address this challenge, this paper proposes a general framework, AutoOpt, for automatically designing metaheuristic algorithms with diverse structures. AutoOpt contains three innovations: (i) A general algorithm prototype dedicated to covering the metaheuristic family as widely as possible. It promotes high-quality automated design on different problems by fully discovering potentials and novelties across the family. (ii) A directed acyclic graph algorithm representation to fit the proposed prototype. Its flexibility and evolvability enable discovering various algorithm structures in a single run of design, thus boosting the possibility of finding high-performance algorithms. (iii) A graph representation embedding method offering an alternative compact form of the graph to be manipulated, which ensures AutoOpt's generality. Experiments on numeral functions and real applications validate AutoOpt's efficiency and practicability.

Updated: 2025-04-07 03:22:35

标题: AutoOpt：用于自动设计具有多样化结构的元启发式优化算法的通用框架

摘要: 元启发式是被广泛认可的无梯度求解器，用于解决那些不符合传统求解器严格数学假设的难题。元启发式算法的自动设计为减轻手工设计工作提供了一条吸引人的途径，并使性能超越人工算法成为可能。然而，当前自动设计流程中的特定算法原型和线性算法表达方式限制了设计在一个固定的算法结构内，这阻碍了在元启发式家族中发现新颖性和多样性。为了解决这一挑战，本文提出了一个通用框架AutoOpt，用于自动设计具有多样结构的元启发式算法。AutoOpt包含三个创新点：(i) 一个通用算法原型，致力于尽可能广泛地覆盖元启发式家族。通过充分发现家族中的潜力和新颖性，促进在不同问题上的高质量自动设计。(ii) 一个有向无环图算法表示，适合提议的原型。其灵活性和可演化性使得在单次设计中发现多种算法结构成为可能，从而提高发现高性能算法的可能性。(iii) 一种图表示嵌入方法，提供了一个替代的紧凑形式的图来进行操作，从而确保AutoOpt的通用性。对数字函数和实际应用的实验验证了AutoOpt的效率和实用性。

更新时间: 2025-04-07 03:22:35

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2204.00998v7

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.

Updated: 2025-04-07 03:22:31

标题: CITER：具有标记级路由的高效大型语言模型解码的协同推理

摘要: 大型语言模型在各种任务中取得了显著成功，但在推断过程中存在高计算成本的问题，限制了它们在资源受限的应用中的部署。为了解决这个问题，我们提出了一种新颖的基于Token级别路由的协作推断（CITER）框架，通过一个Token级别路由策略实现了小型和大型语言模型（SLMs和LLMs）的有效协作。具体地，CITER将非关键性Token路由到SLM以提高效率，并将关键性Token路由到LLM以提高泛化质量。我们将路由器训练形式化为一种策略优化，其中路由器根据预测质量和生成成本来获得奖励。这使得路由器能够学习预测Token级别的路由分数，并基于当前Token和其决策的未来影响来进行路由决策。为了进一步加速奖励评估过程，我们引入了一种快捷方式，显著减少了奖励估计的成本，提高了我们方法的实用性。在五个基准数据集上进行的大量实验表明，CITER在减少推断成本的同时保持高质量生成，为实时和资源受限的应用提供了一个有前途的解决方案。我们的数据和代码可在https://github.com/aiming-lab/CITER 上获得。

更新时间: 2025-04-07 03:22:31

领域: cs.CL,cs.AI,cs.LG,cs.PF

下载: http://arxiv.org/abs/2502.01976v4

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modifiation of the inference infrastructure and significant computation overhead. Base on the fact that the Large Lanuage models are autoregresssive models, we propose {\it LagKV}, a KV allocation strategy only relying on straight forward comparison among KV themself. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on LongBench and PasskeyRetrieval show that, our approach achieves nearly zero loss when the ratio is $2\times$ and $\approx 90\%$ of the original model performance for $8\times$. Especially in the 64-digit passkey retrieval task, our mehod outperforms the attention weight based method $H_2O$ over $60\%$ with same compression ratios. Our code is available at \url{https://github.com/AI-Lab-China-Merchants-Bank/LagKV}.

Updated: 2025-04-07 03:22:15

标题: LagKV: KV缓存的滞后相对信息表明哪些令牌是重要的

摘要: 在大型语言模型长文本推理过程中，Key-Value（KV）缓存的不断增加是其在部署成本和任务准确性之间保持平衡的主要障碍。为了减小这种情况下的KV缓存大小，大多数先前的努力都利用注意力权重来驱逐非关键的缓存标记。但是这些方法存在一个权衡，它们通常需要对推理基础设施进行重大修改，并产生显著的计算开销。基于大型语言模型是自回归模型的事实，我们提出了LagKV，一种仅依赖于KV自身直接比较的KV分配策略。这是一种完全无需注意力的方法，易于集成到主流推理平台，并与其他复杂的KV压缩方法相比具有可比性能。在LongBench和PasskeyRetrieval上的结果显示，我们的方法在比例为$2\times$时几乎没有损失，对于$8\times$的原始模型性能约为90\%。特别是在64位密码检索任务中，我们的方法在相同的压缩比下优于基于注意力权重的方法$H_2O$超过60%。我们的代码可在\url{https://github.com/AI-Lab-China-Merchants-Bank/LagKV}找到。

更新时间: 2025-04-07 03:22:15

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2504.04704v1

Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent

Recent advancements in Transformer-based architectures have led to impressive breakthroughs in natural language processing tasks, with models such as GPT-4, Claude, and Gemini demonstrating human-level reasoning abilities. However, despite their high performance, concerns remain about the inherent limitations of these models, especially when it comes to learning basic logical functions. While complexity-theoretic analyses indicate that Transformers can represent simple logic functions (e.g., $\mathsf{AND}$, $\mathsf{OR}$, and majority gates) by its nature of belonging to the $\mathsf{TC}^0$ class, these results assume ideal parameter settings and do not account for the constraints imposed by gradient descent-based training methods. In this work, we investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. We focus on a simplified variant of the Transformer architecture and consider both $n=\mathrm{poly}(d)$ and $n=\exp(\Omega(d))$ number of training samples, where each sample is a $d$-size binary string paired with the output of a basic majority function. Our analysis demonstrates that even after $\mathrm{poly}(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large, growing exponentially with $d$. This work highlights fundamental optimization challenges in training Transformers for the simplest logical reasoning tasks and provides new insights into their theoretical limitations.

Updated: 2025-04-07 03:08:12

标题: 使用梯度下降法证明语言模型在学习大多数布尔逻辑时的失败

摘要: 最近基于Transformer架构的进展在自然语言处理任务中取得了令人印象深刻的突破，例如GPT-4、Claude和Gemini等模型展示了人类水平的推理能力。然而，尽管它们性能很高，但人们仍然担心这些模型的固有限制，特别是在学习基本逻辑功能时。虽然复杂性理论分析表明，Transformer可以通过其属于$\mathsf{TC}^0$类的性质来表示简单的逻辑函数（例如$\mathsf{AND}$、$\mathsf{OR}$和大多数门），但这些结果假设理想的参数设置，并且不考虑梯度下降训练方法所施加的约束。在这项工作中，我们调查了当使用基于梯度的方法训练时，Transformer是否真的可以学习简单的大多数功能。我们专注于Transformer架构的简化变体，并考虑训练样本数量为$n=\mathrm{poly}(d)$和$n=\exp(\Omega(d))$，其中每个样本是一个大小为$d$的二进制字符串，配对的是基本大多数函数的输出。我们的分析表明，即使经过$\mathrm{poly}(d)$个梯度查询，Transformer模型的泛化误差仍然保持相当大，随着$d$的增长呈指数级增长。这项工作突显了训练Transformer进行最简单逻辑推理任务时的基本优化挑战，并为它们的理论限制提供了新的见解。

更新时间: 2025-04-07 03:08:12

领域: cs.LG,cs.AI,cs.CC

下载: http://arxiv.org/abs/2504.04702v1

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

Large language models (LLMs) have shown promising performance in software vulnerability detection (SVD), yet their reasoning capabilities remain unreliable. Existing approaches relying on chain-of-thought (CoT) struggle to provide relevant and actionable security assessments. Additionally, effective SVD requires not only generating coherent reasoning but also differentiating between well-founded and misleading yet plausible security assessments, an aspect overlooked in prior work. To this end, we introduce R2Vul, a novel approach that distills structured reasoning into small LLMs using reinforcement learning from AI feedback (RLAIF). Through RLAIF, R2Vul enables LLMs to produce structured, security-aware reasoning that is actionable and reliable while explicitly learning to distinguish valid assessments from misleading ones. We evaluate R2Vul across five languages against SAST tools, CoT, instruction tuning, and classification-based baselines. Our results show that R2Vul with structured reasoning distillation enables a 1.5B student LLM to rival larger models while improving generalization to out-of-distribution vulnerabilities. Beyond model improvements, we contribute a large-scale, multilingual preference dataset featuring structured reasoning to support future research in SVD.

Updated: 2025-04-07 03:04:16

标题: R2Vul：使用强化学习和结构化推理精炼学习软件漏洞的推理

摘要: 大型语言模型（LLMs）在软件漏洞检测（SVD）方面表现出有希望的性能，但它们的推理能力仍然不可靠。现有的依赖于思维链（CoT）的方法难以提供相关和可操作的安全评估。此外，有效的SVD不仅需要生成连贯的推理，还需要区分基于良好基础和具有误导性但可能合理的安全评估，这是以往研究忽视的方面。为此，我们引入了R2Vul，一种新颖的方法，利用来自AI反馈的强化学习（RLAIF）将结构化推理提炼为小型LLMs。通过RLAIF，R2Vul使LLMs能够生成结构化、具有安全意识的推理，这些推理是可操作和可靠的，同时明确学习区分有效的评估和具有误导性的评估。我们在五种语言中评估了R2Vul，并与SAST工具、CoT、指令调整和基于分类的基线进行了对比。我们的结果显示，具有结构化推理提炼的R2Vul使一个15亿个学生LLM能够与更大的模型媲美，同时提高了对分布外漏洞的泛化能力。除了模型改进之外，我们还贡献了一个大规模的、多语言的偏好数据集，其中包含结构化推理，以支持未来在SVD方面的研究。

更新时间: 2025-04-07 03:04:16

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.04699v1

Quantum Kernel Learning for Small Dataset Modeling in Semiconductor Fabrication: Application to Ohmic Contact

Complex semiconductor fabrication processes, such as Ohmic contact formation in unconventional semiconductor devices, pose significant modeling challenges due to a large number of operational variables and the difficulty of collecting large, high-quality datasets. Classical machine learning (CML) models often struggle in such scenarios, where the data is both high-dimensional and limited in quantity, leading to overfitting and reduced predictive accuracy. To address this challenge, we develop the first application of quantum machine learning (QML) to model this semiconductor process, leveraging quantum systems' capacity to efficiently capture complex correlations in high-dimensional spaces and generalize well with small datasets. Using only 159 experimental samples augmented via a variational autoencoder, we report a quantum kernel-based regressor (SQKR) with a static 2-level ZZ feature map. The SQKR consistently outperformed six mainstream CML models across all evaluation metrics, achieving the lowest mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE), with repeated experiments confirming its robustness. Notably, SQKR achieved an MAE of 0.314 Ohm-mm with data from experimental verification, demonstrating its ability to effectively model semiconductor fabrication processes despite limited data availability. These results highlight QML's unique capability to handle small yet high-dimensional datasets in the semiconductor industry, making it a promising alternative to classical approaches for semiconductor process modeling.

Updated: 2025-04-07 02:57:39

标题: 量子核学习用于半导体制造中小数据集建模：应用于欧姆接触

摘要: 复杂的半导体制造过程，例如在非传统半导体器件中形成欧姆接触，由于操作变量众多且难以收集大量高质量数据集，面临着重要的建模挑战。传统机器学习（CML）模型在这种情况下往往表现不佳，因为数据既高维又有限，容易出现过拟合和降低预测准确性。为了解决这一挑战，我们首次将量子机器学习（QML）应用于建模这一半导体过程，利用量子系统在高维空间中高效捕捉复杂相关性并且能够很好地推广小数据集。仅利用159个实验样本通过变分自编码器增强，我们报告了一个基于量子核的回归器（SQKR），采用静态2级ZZ特征映射。SQKR在所有评估指标上始终优于六种主流CML模型，实现了最低的平均绝对误差（MAE）、均方误差（MSE）和均方根误差（RMSE），重复实验证实了其稳健性。值得注意的是，SQKR在实验验证数据中实现了0.314欧姆-毫米的MAE，表明尽管数据有限，但其能够有效地建模半导体制造过程。这些结果突显了QML在半导体行业处理小而高维数据集的独特能力，使其成为半导体过程建模的有前景的替代方法。

更新时间: 2025-04-07 02:57:39

领域: cs.LG,cs.ET,quant-ph

下载: http://arxiv.org/abs/2409.10803v2

Large-Scale Mixed-Traffic and Intersection Control using Multi-agent Reinforcement Learning

Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small-scale networks or isolated intersections, leaving large-scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi-agent reinforcement learning for large-scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real-world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17 s to 5.09 s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning-based control large-scale traffic can improve overall efficiency and may inform future urban planning strategies.

Updated: 2025-04-07 02:52:39

标题: 使用多智能体强化学习进行大规模混合交通和交叉口控制

摘要: 交通拥堵仍然是现代城市网络中的重要挑战。自动驾驶技术已经成为一个潜在的解决方案。在交通控制方法中，强化学习在各种场景中表现出比交通信号更优越的性能。然而，先前的研究主要集中在小规模网络或孤立的交叉口上，对大规模混合交通控制尚未充分探索。本研究首次尝试使用分散式多智能体强化学习来进行大规模混合交通控制，其中一些交叉口由交通信号控制，另一些由机器人车辆控制。通过评估美国科罗拉多斯普林斯市具有14个交叉口的真实网络，我们通过车辆在交叉口的平均等待时间和在时间窗口内到达目的地的车辆数量（即吞吐量）来衡量交通效率。在80%的机器人车辆渗透率下，我们的方法将等待时间从6.17秒减少到5.09秒，将吞吐量从每500秒454辆车增加到每500秒493辆车，超过了完全信号化交叉口的基线。这些发现表明，将基于强化学习的控制集成到大规模交通中可以提高整体效率，并可能为未来的城市规划战略提供信息。

更新时间: 2025-04-07 02:52:39

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2504.04691v1

Heuristics and Biases in AI Decision-Making: Implications for Responsible AGI

We investigate the presence of cognitive biases in three large language models (LLMs): GPT-4o, Gemma 2, and Llama 3.1. The study uses 1,500 experiments across nine established cognitive biases to evaluate the models' responses and consistency. GPT-4o demonstrated the strongest overall performance. Gemma 2 showed strengths in addressing the sunk cost fallacy and prospect theory, however its performance varied across different biases. Llama 3.1 consistently underperformed, relying on heuristics and exhibiting frequent inconsistencies and contradictions. The findings highlight the challenges of achieving robust and generalizable reasoning in LLMs, and underscore the need for further development to mitigate biases in artificial general intelligence (AGI). The study emphasizes the importance of integrating statistical reasoning and ethical considerations in future AI development.

Updated: 2025-04-07 02:44:51

标题: AI决策中的启发式和偏见：对负责任的AGI的影响

摘要: 我们调查了三个大型语言模型（LLMs）中认知偏见的存在：GPT-4o、Gemma 2和Llama 3.1。该研究使用了1,500个实验涉及九种已建立的认知偏见，以评估模型的响应和一致性。GPT-4o表现出最强的整体表现。Gemma 2在处理沉没成本谬误和前景理论方面表现出优势，然而其表现在不同的偏见上有所变化。Llama 3.1一直表现不佳，依赖启发式，并表现出频繁的不一致性和矛盾。研究结果突显了在LLMs中实现稳健和可推广推理的挑战，并强调了进一步发展以减少人工通用智能（AGI）中的偏见的需要。该研究强调了在未来人工智能发展中整合统计推理和伦理考虑的重要性。

更新时间: 2025-04-07 02:44:51

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.02820v3

Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed uncertain reward model to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data, thus cannot sufficiently mitigate reward hacking to sustain prolonged RLHF training and exploration. In this paper, we propose the Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model, that directly model the reward distribution emerged from the preference data. We theoretically derived PURM's loss function and the reward distribution uncertainty calculation based on Bhattacharyya Coefficient. To mitigate reward hacking with PURM, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. We propose a lightweight and easy-to-use implementation of PURM. Experiments demonstrate that PURM significantly delays the onset of reward hacking while improving final reward performance, outperforming baseline methods in both stability and effectiveness.

Updated: 2025-04-07 02:42:56

标题: 概率不确定奖励模型：布拉德利-特里奖励模型的自然推广

摘要: 人类反馈强化学习（RLHF）已经成为训练大型语言模型的关键技术。然而，奖励欺骗——模型利用奖励模型中的缺陷的现象——仍然是通过长期训练实现稳健和可扩展智能的重要障碍。现有研究提出了不确定奖励模型来解决奖励欺骗问题，然而，它们往往缺乏系统性或理论基础，未能对从偏好数据中固有出现的不确定性进行建模，因此无法充分减轻奖励欺骗以维持持续的RLHF训练和探索。在本文中，我们提出了概率不确定奖励模型（PURM），这是经典Bradley-Terry奖励模型的自然推广，直接对从偏好数据中出现的奖励分布进行建模。我们在理论上推导了PURM的损失函数和基于Bhattacharyya系数的奖励分布不确定性计算。为了减轻PURM中的奖励欺骗问题，我们进一步在Proximal Policy Optimization（PPO）中引入了一个具有不确定性意识的惩罚，利用学习到的不确定性动态平衡奖励优化和探索。我们提出了PURM的轻量级和易于使用的实现。实验证明，PURM显着延迟了奖励欺骗的发生，同时改善了最终奖励性能，在稳定性和有效性方面优于基线方法。

更新时间: 2025-04-07 02:42:56

领域: cs.LG

下载: http://arxiv.org/abs/2503.22480v2

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

Updated: 2025-04-07 02:42:07

标题: SEAL：用于大型语言模型的可导向推理校准方法

摘要: 大型语言模型（LLMs），如OpenAI的o1系列，通过扩展的思维链（CoT）推理机制展示了复杂推理任务的引人注目的能力。然而，最近的研究揭示了CoT推理轨迹中存在大量的冗余，这不仅增加了推理延迟，还通过将注意力转移到不必要的推理路径上，对模型性能产生了负面影响。为了解决这个问题，我们调查了LLMs的内部推理结构，并将其分类为三种主要的思维类型：执行、反思和过渡思维。此外，我们的分析显示，过多的反思和过渡思维与失败案例密切相关，并且这些思维类别在潜在空间中表现出明显的分离。基于这些发现，我们引入了SEAL（可控推理校准），这是一种无需训练的方法，可以无缝校准CoT过程，提高准确性的同时显著提高效率。SEAL包括离线阶段，在潜在空间中提取推理引导向量，然后通过使用这个引导向量进行表示干预，在运行时校准推理轨迹。值得注意的是，引导向量在各种任务之间具有很强的可转移性。跨多个模型（DeepSeek-R1-Distill和QwQ-32B-Preview）和基准测试（Math500，GSM8K，LiveCodeBench）的大量实验验证了SEAL的有效性，准确率提高了11％，同时将推理标记减少了11.8％至50.4％。我们的代码可以在https://github.com/VITA-Group/SEAL 上公开获取。

更新时间: 2025-04-07 02:42:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.07986v1

What is Wrong with Perplexity for Long-context Language Modeling?

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.

Updated: 2025-04-07 02:40:09

标题: 长上下文语言建模中困惑性的问题何在？

摘要: 处理长上下文输入对于大型语言模型（LLMs）至关重要，例如在扩展对话、文档摘要和许多情境学习等任务中。尽管最近的方法已经扩展了LLMs的上下文窗口并使用困惑度（PPL）作为标准评估指标，但PPL已被证明不可靠于评估长上下文能力。这种限制的根本原因仍然不清楚。在这项工作中，我们为这个问题提供了全面的解释。我们发现PPL忽视了对长上下文理解至关重要的关键标记，通过对所有标记进行平均化，从而掩盖了模型在长上下文情景中的真实性能。为了解决这个问题，我们提出了LongPPL，这是一种新颖的度量标准，通过采用长-短上下文对比方法来识别关键标记。我们的实验表明，LongPPL与各种长上下文基准测试的性能强相关（例如，皮尔逊相关系数为-0.96），在预测准确性方面明显优于传统的PPL。此外，我们还引入了LongCE（长上下文交叉熵）损失，这是一种重新加权策略，用于优先考虑关键标记进行微调，从而在各种基准测试中实现持续改进。总之，这些贡献深入了解了PPL的限制，并提出了有效的解决方案，用于准确评估和增强LLMs的长上下文能力。代码可在https://github.com/PKU-ML/LongPPL 上找到。

更新时间: 2025-04-07 02:40:09

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.23771v4

Bridging Knowledge Gap Between Image Inpainting and Large-Area Visible Watermark Removal

Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.

Updated: 2025-04-07 02:37:14

标题: 弥合图像修复和大面积可见水印去除之间的知识差距

摘要: 可见水印去除涉及水印清理和背景内容恢复，对评估水印的韧性至关重要。现有基于深度神经网络（DNN）的模型仍然在处理大面积水印时存在困难，并且过度依赖水印掩模的质量。为了克服这些挑战，我们引入了一种新颖的特征适应框架，利用预训练图像修补模型的表示建模能力。我们的方法通过将水印下面的残余背景内容的信息融合到修补主干模型中，弥补了图像修补和水印去除之间的知识差距。我们建立了一个双分支系统，用于捕获和嵌入来自残余背景内容的特征，这些特征通过门控特征融合模块融合到修补主干模型的中间特征中。此外，为了减轻对高质量水印掩模的依赖，我们引入了一种新的训练范式，利用粗略的水印掩模来指导推理过程。这有助于构建一个在测试过程中对水印掩模质量不敏感的可见图像去除模型。对大规模合成数据集和真实数据集的广泛实验表明，我们的方法明显优于现有的最先进方法。源代码可在补充材料中获得。

更新时间: 2025-04-07 02:37:14

领域: cs.CV,cs.AI,cs.MM,eess.IV,I.2.10; I.4.4; I.4.5

下载: http://arxiv.org/abs/2504.04687v1

Generative Large Language Model usage in Smart Contract Vulnerability Detection

Recent years have seen an explosion of activity in Generative AI, specifically Large Language Models (LLMs), revolutionising applications across various fields. Smart contract vulnerability detection is no exception; as smart contracts exist on public chains and can have billions of dollars transacted daily, continuous improvement in vulnerability detection is crucial. This has led to many researchers investigating the usage of generative large language models (LLMs) to aid in detecting vulnerabilities in smart contracts. This paper presents a systematic review of the current LLM-based smart contract vulnerability detection tools, comparing them against traditional static and dynamic analysis tools Slither and Mythril. Our analysis highlights key areas where each performs better and shows that while these tools show promise, the LLM-based tools available for testing are not ready to replace more traditional tools. We conclude with recommendations on how LLMs are best used in the vulnerability detection process and offer insights for improving on the state-of-the-art via hybrid approaches and targeted pre-training of much smaller models.

Updated: 2025-04-07 02:33:40

标题: 使用生成式大语言模型在智能合约漏洞检测中的应用

摘要: 近年来，生成式人工智能领域，尤其是大型语言模型（LLMs），活动激增，彻底改变了各个领域的应用。智能合约漏洞检测也不例外；由于智能合约存在于公共链上，每天可能涉及数十亿美元的交易，因此对漏洞检测的持续改进至关重要。这导致许多研究人员研究使用生成式大型语言模型（LLMs）来帮助检测智能合约中的漏洞。本文对当前基于LLM的智能合约漏洞检测工具进行了系统性评估，将它们与传统的静态和动态分析工具Slither和Mythril进行了比较。我们的分析突出了每种工具表现更好的关键领域，并表明尽管这些工具表现很有前景，但目前用于测试的基于LLM的工具尚未准备好取代更传统的工具。我们最后提出了关于如何最好地利用LLMs进行漏洞检测过程的建议，并提供了通过混合方法和针对性预训练较小模型来改进现有技术的见解。

更新时间: 2025-04-07 02:33:40

领域: cs.CR

下载: http://arxiv.org/abs/2504.04685v1

A Novel Framework To Assess Cybersecurity Capability Maturity

In today's rapidly evolving digital landscape, organisations face escalating cyber threats that can disrupt operations, compromise sensitive data, and inflict financial and reputational harm. A key reason for this lies in the organisations' lack of a clear understanding of their cybersecurity capabilities, leading to ineffective defences. To address this gap, Cybersecurity Capability Maturity Models (CCMMs) provide a systematic approach to assessing and enhancing an organisation's cybersecurity posture by focusing on capability maturity rather than merely implementing controls. However, their limitations, such as rigid structures, one-size-fits-all approach, complexity, gaps in security scope (i.e., technological, organisational, and human aspects) and lack of quantitative metrics, hinder their effectiveness. It makes implementing CCMMs in varying contexts challenging and results in fragmented, incomprehensive assessments. Therefore, we propose a novel Cybersecurity Capability Maturity Framework that is holistic, flexible, and measurable to provide organisations with a more relevant and impactful assessment to enhance their cybersecurity posture.

Updated: 2025-04-07 02:24:29

标题: 一个用于评估网络安全能力成熟度的新框架

摘要: 在当今快速发展的数字景观中，组织面临不断升级的网络威胁，这些威胁可能会干扰运营、损害敏感数据并造成财务和声誉损失。导致这种情况的一个关键原因在于组织缺乏对其网络安全能力的清晰理解，从而导致防御措施的无效性。为了填补这一差距，网络安全能力成熟度模型（CCMMs）提供了一种系统化方法，通过关注能力成熟度而非仅仅实施控制来评估和增强组织的网络安全姿态。然而，它们的局限性，如刚性结构、一刀切的方法、复杂性、安全范围的空白（即技术、组织和人员方面）以及缺乏定量指标，限制了它们的有效性。这使得在不同环境中实施CCMMs具有挑战性，并导致分散、不全面的评估。因此，我们提出了一种新颖的网络安全能力成熟度框架，该框架是全面的、灵活的和可衡量的，可以为组织提供更相关和有影响力的评估，以增强其网络安全姿态。

更新时间: 2025-04-07 02:24:29

领域: cs.CR

下载: http://arxiv.org/abs/2504.01305v2

Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach

Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies. The training script is available on Github at https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model checkpoints are available on Huggingface at https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.

Updated: 2025-04-07 02:07:30

标题: 一个1.7B LLaMa模型的训练动态：一种数据高效的方法

摘要: 预训练大型语言模型是一个受多种因素影响的复杂工作，包括模型架构、数据质量、训练连续性和硬件限制。在本文中，我们分享了在大约200亿个经过精心筛选的数据令牌上训练DMaS-LLaMa-Lite，一个完全开源、17亿参数的基于LLaMa的模型的经验。我们记录了完整的训练轨迹，记录了不断发展的验证损失水平和下游基准反映出从不连贯的文本到流畅、上下文基础输出的过渡。除了预训练，我们将分析扩展到包括一个后训练阶段，重点放在指导调整，其中模型被精炼以产生更具上下文适当性、用户对齐的响应。我们强调了实用考虑因素，如在从检查点恢复时恢复优化器状态的重要性，以及硬件变化对训练稳定性和吞吐量的影响。虽然定性评估提供了对模型改进的直观理解，但我们的分析延伸到各种性能基准，展示了高质量数据和深思熟虑的扩展如何使竞争结果具有明显较少的训练令牌。通过详细描述这些经验并提供训练日志、检查点和示例输出，我们旨在指导未来的研究人员和从业者改进其预训练策略。训练脚本可在Github上找到，网址为https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code。模型检查点可在Huggingface上找到，网址为https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb。

更新时间: 2025-04-07 02:07:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.13335v3

Detecting AI-Generated Video via Frame Consistency

The escalating quality of video generated by advanced video generation methods results in new security challenges, while there have been few relevant research efforts: 1) There is no open-source dataset for generated video detection, 2) No generated video detection method has been proposed so far. To this end, we propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions, as well as various generation models with different architectures and generation methods, including the most popular commercial models like OpenAI's Sora and Google's Veo. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability. Hence, we propose a simple yet effective \textbf{de}tection model based on \textbf{f}rame \textbf{co}nsistency (\textbf{DeCoF}), which focuses on temporal artifacts by eliminating the impact of spatial artifacts during feature learning. Extensive experiments demonstrate the efficacy of DeCoF in detecting videos generated by unseen video generation models and confirm its powerful generalizability across several commercially proprietary models. Our code and dataset will be released at https://github.com/wuwuwuyue/DeCoF.

Updated: 2025-04-07 02:01:27

标题: 通过帧一致性检测人工智能生成的视频

摘要: 随着先进视频生成方法生成的视频质量不断提高，新的安全挑战也随之而来，然而相关研究工作却很少：1）目前没有针对生成视频检测的开源数据集，2）迄今为止还没有提出生成视频检测方法。为此，我们首次提出了一个开源数据集和检测方法用于生成视频。首先，我们提出了一个可扩展的数据集，包括964个提示，涵盖了各种伪造目标、场景、行为和动作，以及不同架构和生成方法的各种生成模型，包括像OpenAI的Sora和Google的Veo这样最受欢迎的商业模型。其次，我们通过探究实验证明，基于空间伪影的检测器缺乏泛化能力。因此，我们提出了一个简单而有效的基于帧一致性（DeCoF）的检测模型，通过消除特征学习过程中空间伪影的影响，专注于时间伪影。大量实验证明了DeCoF在检测由未知视频生成模型生成的视频方面的有效性，并证实了其在几个商业专有模型中的强大泛化能力。我们的代码和数据集将在https://github.com/wuwuwuyue/DeCoF发布。

更新时间: 2025-04-07 02:01:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2402.02085v7

Adversarially Domain-adaptive Latent Diffusion for Unsupervised Semantic Segmentation

Semantic segmentation requires extensive pixel-level annotation, motivating unsupervised domain adaptation (UDA) to transfer knowledge from labelled source domains to unlabelled or weakly labelled target domains. One of the most efficient strategies involves using synthetic datasets generated within controlled virtual environments, such as video games or traffic simulators, which can automatically generate pixel-level annotations. However, even when such datasets are available, learning a well-generalised representation that captures both domains remains challenging, owing to probabilistic and geometric discrepancies between the virtual world and real-world imagery. This work introduces a semantic segmentation method based on latent diffusion models, termed Inter-Coder Connected Latent Diffusion (ICCLD), alongside an unsupervised domain adaptation approach. The model employs an inter-coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process. Experiments on GTA5, Synthia, and Cityscapes demonstrate that ICCLD outperforms state-of-the-art UDA methods, achieving mIoU scores of 74.4 (GTA5$\rightarrow$Cityscapes) and 67.2 (Synthia$\rightarrow$Cityscapes).

Updated: 2025-04-07 02:01:25

标题: 对抗领域自适应潜在扩散用于无监督语义分割

摘要: 语义分割需要大量的像素级注释，这促使无监督域自适应（UDA）将知识从有标记的源域转移到未标记或弱标记的目标域。最有效的策略之一涉及使用在受控虚拟环境内生成的合成数据集，例如视频游戏或交通模拟器，这些数据集可以自动生成像素级注释。然而，即使存在这样的数据集，学习捕捉两个域之间的概率和几何差异的广义表示仍然具有挑战性，这是因为虚拟世界和真实世界图像之间存在差异。本文介绍了一种基于潜在扩散模型的语义分割方法，称为互编码器连接潜在扩散（ICCLD），以及一种无监督域自适应方法。该模型利用互编码器连接来增强上下文理解和保留细节，同时在潜在扩散过程中通过对抗学习来对齐跨域的潜在特征分布。对GTA5、Synthia和Cityscapes的实验表明，ICCLD优于最先进的UDA方法，在GTA5$\rightarrow$Cityscapes和Synthia$\rightarrow$Cityscapes的mIoU得分分别为74.4和67.2。

更新时间: 2025-04-07 02:01:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.16859v2

Dual Consistent Constraint via Disentangled Consistency and Complementarity for Multi-view Clustering

Multi-view clustering can explore common semantics from multiple views and has received increasing attention in recent years. However, current methods focus on learning consistency in representation, neglecting the contribution of each view's complementarity aspect in representation learning. This limit poses a significant challenge in multi-view representation learning. This paper proposes a novel multi-view clustering framework that introduces a disentangled variational autoencoder that separates multi-view into shared and private information, i.e., consistency and complementarity information. We first learn informative and consistent representations by maximizing mutual information across different views through contrastive learning. This process will ignore complementary information. Then, we employ consistency inference constraints to explicitly utilize complementary information when attempting to seek the consistency of shared information across all views. Specifically, we perform a within-reconstruction using the private and shared information of each view and a cross-reconstruction using the shared information of all views. The dual consistency constraints are not only effective in improving the representation quality of data but also easy to extend to other scenarios, especially in complex multi-view scenes. This could be the first attempt to employ dual consistent constraint in a unified MVC theoretical framework. During the training procedure, the consistency and complementarity features are jointly optimized. Extensive experiments show that our method outperforms baseline methods.

Updated: 2025-04-07 02:00:16

标题: 多视角聚类的双一致性约束：通过解耦一致性和互补性实现

摘要: 多视角聚类可以从多个视角中探索共同的语义，并在近年来受到越来越多的关注。然而，目前的方法侧重于学习表征的一致性，忽视了每个视角在表征学习中的补充性方面的贡献。这一限制在多视角表征学习中构成了一个重要挑战。本文提出了一个新颖的多视角聚类框架，引入了一个解耦变分自编码器，将多视角分离为共享信息和私有信息，即一致性和补充性信息。我们首先通过对比学习最大化不同视角之间的互信息来学习信息丰富且一致的表征。这个过程将忽略互补信息。然后，在试图寻找所有视角共享信息的一致性时，我们采用一致性推理约束来明确利用互补信息。具体来说，我们通过使用每个视图的私有信息和共享信息进行内部重构，通过使用所有视图的共享信息进行交叉重构。双一致性约束不仅有效地提高了数据的表征质量，而且易于扩展到其他场景，特别是在复杂的多视角场景中。这可能是在统一的MVC理论框架中首次尝试应用双一致性约束。在训练过程中，一致性和互补性特征是共同优化的。大量实验证明，我们的方法优于基线方法。

更新时间: 2025-04-07 02:00:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04676v1

HypRL: Reinforcement Learning of Control Policies for Hyperproperties

We study the problem of learning control policies for complex tasks whose requirements are given by a hyperproperty. The use of hyperproperties is motivated by their significant power to formally specify requirements of multi-agent systems as well as those that need expressiveness in terms of multiple execution traces (e.g., privacy and fairness). Given a Markov decision process M with unknown transitions (representing the environment) and a HyperLTL formula $\varphi$, our approach first employs Skolemization to handle quantifier alternations in $\varphi$. We introduce quantitative robustness functions for HyperLTL to define rewards of finite traces of M with respect to $\varphi$. Finally, we utilize a suitable reinforcement learning algorithm to learn (1) a policy per trace quantifier in $\varphi$, and (2) the probability distribution of transitions of M that together maximize the expected reward and, hence, probability of satisfaction of $\varphi$ in M. We present a set of case studies on (1) safety-preserving multi-agent path planning, (2) fairness in resource allocation, and (3) the post-correspondence problem (PCP).

Updated: 2025-04-07 01:58:36

标题: HypRL：超性质控制策略的强化学习

摘要: 我们研究了学习控制策略的问题，这些控制策略用于复杂任务，其要求由超属性给定。使用超属性的动机在于它们在形式上指定了多智能体系统的要求，以及那些需要在多个执行跟踪方面具有表达能力的要求（例如隐私和公平性）。给定一个具有未知转换的马尔可夫决策过程M（代表环境）和一个HyperLTL公式ϕ，我们的方法首先使用斯科莱姆化来处理ϕ中的量词交替。我们引入了用于定义有限M跟踪的HyperLTL奖励的数量鲁棒性函数。最后，我们利用适当的强化学习算法来学习（1）ϕ中每个跟踪量词的策略，以及（2）一起最大化预期奖励和因此最大化ϕ在M中满足概率的M的转换概率分布。我们对以下案例进行了一系列研究：（1）保持安全的多智能体路径规划，（2）资源分配中的公平性，以及（3）后对应问题（PCP）。

更新时间: 2025-04-07 01:58:36

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2504.04675v1

Sparsity-Aware Communication for Distributed Graph Neural Network Training

Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability. Sparse-matrix dense-matrix multiplication (SpMM) is the core computational operation in full-graph training of GNNs. Previous work parallelizing this operation focused on sparsity-oblivious algorithms, where matrix elements are communicated regardless of the sparsity pattern. This leads to a predictable communication pattern that can be overlapped with computation and enables the use of collective communication operations at the expense of wasting significant bandwidth by communicating unnecessary data. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches. First, we communicate only the necessary matrix elements. Second, we utilize a graph partitioning model to reorder the matrix and drastically reduce the amount of communicated elements. Finally, we address the high load imbalance in communication with a tailored partitioning model, which minimizes both the total communication volume and the maximum sending volume. We further couple these sparsity-exploiting approaches with a communication-avoiding approach (1.5D parallel SpMM) in which submatrices are replicated to reduce communication. We explore the tradeoffs of these combined optimizations and show up to 14X improvement on 256 GPUs and on some instances reducing communication to almost zero resulting in a communication-free parallel training relative to a popular GNN framework based on communication-oblivious SpMM.

Updated: 2025-04-07 01:53:14

标题: 稀疏意识通信用于分布式图神经网络训练

摘要: 图神经网络（GNNs）是一种在图数据上学习嵌入和分类的计算效率高的方法。然而，GNN训练具有较低的计算强度，使得通信成本成为可扩展性的瓶颈。稀疏矩阵密集矩阵乘法（SpMM）是GNN全图训练中的核心计算操作。先前对该操作进行并行化的工作集中在对稀疏性无视的算法上，即无论稀疏模式如何，矩阵元素都会被通信。这导致了可预测的通信模式，可以与计算重叠，并且能够利用集体通信操作，但会浪费大量带宽，因为会传输不必要的数据。我们开发了针对GNN训练中的通信瓶颈的三种新方法的稀疏感知算法。首先，我们只传输必要的矩阵元素。其次，我们利用图分区模型重新排序矩阵，并大幅减少传输的元素数量。最后，我们通过一个定制的分区模型解决通信中的负载不平衡问题，该模型既最小化了总通信量，又减少了最大发送量。我们进一步将这些利用稀疏性的方法与一个避免通信的方法（1.5D并行SpMM）相结合，其中子矩阵被复制以减少通信。我们探讨了这些组合优化的权衡，并在256个GPU上取得了最多14倍的改进，并在某些情况下将通信减少到几乎零，使通信与基于通信无视SpMM的流行GNN框架相比成为无通信的并行训练。

更新时间: 2025-04-07 01:53:14

领域: cs.LG

下载: http://arxiv.org/abs/2504.04673v1

Scaling Graph Neural Networks for Particle Track Reconstruction

Particle track reconstruction is an important problem in high-energy physics (HEP), necessary to study properties of subatomic particles. Traditional track reconstruction algorithms scale poorly with the number of particles within the accelerator. The Exa.TrkX project, to alleviate this computational burden, introduces a pipeline that reduces particle track reconstruction to edge classification on a graph, and uses graph neural networks (GNNs) to produce particle tracks. However, this GNN-based approach is memory-prohibitive and skips graphs that would exceed GPU memory. We introduce improvements to the Exa.TrkX pipeline to train on samples of input particle graphs, and show that these improvements generalize to higher precision and recall. In addition, we adapt performance optimizations, introduced for GNN training, to fit our augmented Exa.TrkX pipeline. These optimizations provide a $2\times$ speedup over our baseline implementation in PyTorch Geometric.

Updated: 2025-04-07 01:44:32

标题: 为粒子轨迹重建扩展图神经网络

摘要: 粒子轨迹重建是高能物理中的一个重要问题，必须用来研究亚原子粒子的性质。传统的轨迹重建算法在加速器内粒子数量增加时效率较低。Exa.TrkX项目为减轻这种计算负担，引入了一个将粒子轨迹重建简化为图上边缘分类的流程，并使用图神经网络（GNNs）生成粒子轨迹。然而，这种基于GNN的方法在内存方面具有限制性，并跳过会超出GPU内存的图形。我们对Exa.TrkX流程进行改进，通过对输入粒子图样本进行训练，并展示这些改进对于更高的精度和召回率具有泛化能力。此外，我们将针对GNN训练引入的性能优化调整，以适应我们增强的Exa.TrkX流程。这些优化使我们在PyTorch Geometric中的基准实现速度提升了2倍。

更新时间: 2025-04-07 01:44:32

领域: cs.LG,cs.CE,cs.DC

下载: http://arxiv.org/abs/2504.04670v1

asKAN: Active Subspace embedded Kolmogorov-Arnold Network

The Kolmogorov-Arnold Network (KAN) has emerged as a promising neural network architecture for small-scale AI+Science applications. However, it suffers from inflexibility in modeling ridge functions, which is widely used in representing the relationships in physical systems. This study investigates this inflexibility through the lens of the Kolmogorov-Arnold theorem, which starts the representation of multivariate functions from constructing the univariate components rather than combining the independent variables. Our analysis reveals that incorporating linear combinations of independent variables can substantially simplify the network architecture in representing the ridge functions. Inspired by this finding, we propose active subspace embedded KAN (asKAN), a hierarchical framework that synergizes KAN's function representation with active subspace methodology. The architecture strategically embeds active subspace detection between KANs, where the active subspace method is used to identify the primary ridge directions and the independent variables are adaptively projected onto these critical dimensions. The proposed asKAN is implemented in an iterative way without increasing the number of neurons in the original KAN. The proposed method is validated through function fitting, solving the Poisson equation, and reconstructing sound field. Compared with KAN, asKAN significantly reduces the error using the same network architecture. The results suggest that asKAN enhances the capability of KAN in fitting and solving equations with in the form of ridge functions.

Updated: 2025-04-07 01:43:13

标题: asKAN：主动子空间嵌入式科尔莫哥洛夫-阿诺德网络

摘要: 科尔莫哥洛夫-阿诺德网络（KAN）已经成为小规模AI+科学应用中具有潜力的神经网络架构。然而，它在建模脊函数方面存在缺乏灵活性的问题，而脊函数在表示物理系统中的关系时被广泛使用。本研究通过科尔莫哥洛夫-阿诺德定理的视角来调查这种缺乏灵活性，该定理从构造单变量分量开始表示多元函数，而不是组合独立变量。我们的分析表明，将独立变量的线性组合纳入网络架构中可以极大简化表示脊函数的过程。受到这一发现的启发，我们提出了嵌入主动子空间的KAN（asKAN），这是一个将KAN的函数表示与主动子空间方法相结合的分层框架。该架构在KAN之间策略性地嵌入了主动子空间检测，其中主动子空间方法用于识别主要的脊向量，并且独立变量被自适应地投影到这些关键维度上。提出的asKAN以迭代方式实现，而不增加原始KAN中的神经元数量。提出的方法通过函数拟合、解决泊松方程和重建声场进行了验证。与KAN相比，asKAN在相同的网络架构下显著降低了误差。结果表明，asKAN增强了KAN在适应和解决脊函数形式方程方面的能力。

更新时间: 2025-04-07 01:43:13

领域: physics.comp-ph,cs.LG

下载: http://arxiv.org/abs/2504.04669v1

Discovery and inversion of the viscoelastic wave equation in inhomogeneous media

In scientific machine learning, the task of identifying partial differential equations accurately from sparse and noisy data poses a significant challenge. Current sparse regression methods may identify inaccurate equations on sparse and noisy datasets and are not suitable for varying coefficients. To address this issue, we propose a hybrid framework that combines two alternating direction optimization phases: discovery and embedding. The discovery phase employs current well-developed sparse regression techniques to preliminarily identify governing equations from observations. The embedding phase implements a recurrent convolutional neural network (RCNN), enabling efficient processes for time-space iterations involved in discretized forms of wave equation. The RCNN model further optimizes the imperfect sparse regression results to obtain more accurate functional terms and coefficients. Through alternating update of discovery-embedding phases, essential physical equations can be robustly identified from noisy and low-resolution measurements. To assess the performance of proposed framework, numerical experiments are conducted on various scenarios involving wave equation in elastic/viscoelastic and homogeneous/inhomogeneous media. The results demonstrate that the proposed method exhibits excellent robustness and accuracy, even when faced with high levels of noise and limited data availability in both spatial and temporal domains.

Updated: 2025-04-07 01:39:29

标题: 在非均质介质中发现和逆转粘弹性波方程

摘要: 在科学机器学习中，从稀疏且嘈杂的数据中准确识别偏微分方程的任务是一个重大挑战。当前的稀疏回归方法可能会在稀疏和嘈杂的数据集上识别不准确的方程，并且不适用于变化系数。为了解决这个问题，我们提出了一个混合框架，结合了两个交替方向优化阶段：发现和嵌入。发现阶段利用当前成熟的稀疏回归技术，从观测中初步识别出主导方程。嵌入阶段实现了一个循环卷积神经网络（RCNN），能够在波动方程的离散形式中进行时间空间迭代的高效处理。RCNN模型进一步优化了不完美的稀疏回归结果，以获得更准确的功能项和系数。通过交替更新发现-嵌入阶段，可以从嘈杂和低分辨率的测量中稳健地识别出基本物理方程。为了评估所提出的框架的性能，在包括弹性/粘弹性和均匀/非均匀介质的波动方程的各种情景下进行了数值实验。结果表明，所提出的方法在面对高水平噪声和空间和时间域中有限数据可用性时，表现出优秀的稳健性和准确性。

更新时间: 2025-04-07 01:39:29

领域: cs.LG,physics.geo-ph

下载: http://arxiv.org/abs/2409.18370v2

Interval-Valued Time Series Classification Using $D_K$-Distance

In recent years, modeling and analysis of interval-valued time series have garnered increasing attention in econometrics, finance, and statistics. However, these studies have predominantly focused on statistical inference in the forecasting of univariate and multivariate interval-valued time series, overlooking another important aspect: classification. In this paper, we introduce a classification approach that treats intervals as unified entities, applicable to both univariate and multivariate interval-valued time series. Specifically, we first extend the point-valued time series imaging methods to interval-valued scenarios using the $D_K$-distance, enabling the imaging of interval-valued time series. Then, we employ suitable deep learning model for classification on the obtained imaging dataset, aiming to achieve classification for interval-valued time series. In theory, we derived a sharper excess risk bound for deep multiclassifiers based on offset Rademacher complexity. Finally, we validate the superiority of the proposed method through comparisons with various existing point-valued time series classification methods in both simulation studies and real data applications.

Updated: 2025-04-07 01:31:31

标题: 使用$D_K$距离的区间值时间序列分类

摘要: 近年来，区间值时间序列的建模和分析在计量经济学、金融和统计学领域引起了越来越多的关注。然而，这些研究主要集中在单变量和多变量区间值时间序列预测的统计推断上，忽视了另一个重要方面：分类。在本文中，我们介绍了一种将区间视为统一实体的分类方法，适用于单变量和多变量区间值时间序列。具体而言，我们首先利用$D_K$-距离将点值时间序列成像方法扩展到区间值情景中，实现了区间值时间序列的成像。然后，我们利用适当的深度学习模型对获得的成像数据集进行分类，旨在实现区间值时间序列的分类。在理论上，我们基于偏移Rademacher复杂度推导了深度多分类器的更尖锐的过度风险界限。最后，通过与各种现有的点值时间序列分类方法在模拟研究和实际数据应用中的比较，我们验证了所提出方法的优越性。

更新时间: 2025-04-07 01:31:31

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2504.04667v1

Design of AI-Powered Tool for Self-Regulation Support in Programming Education

Large Language Model (LLM) tools have demonstrated their potential to deliver high-quality assistance by providing instant, personalized feedback that is crucial for effective programming education. However, many of these tools operate independently from institutional Learning Management Systems, which creates a significant disconnect. This isolation limits the ability to leverage learning materials and exercise context for generating tailored, context-aware feedback. Furthermore, previous research on self-regulated learning and LLM support mainly focused on knowledge acquisition, not the development of important self-regulation skills. To address these challenges, we developed CodeRunner Agent, an LLM-based programming assistant that integrates the CodeRunner, a student-submitted code executing and automated grading plugin in Moodle. CodeRunner Agent empowers educators to customize AI-generated feedback by incorporating detailed context from lecture materials, programming questions, student answers, and execution results. Additionally, it enhances students' self-regulated learning by providing strategy-based AI responses. This integrated, context-aware, and skill-focused approach offers promising avenues for data-driven improvements in programming education.

Updated: 2025-04-07 01:30:12

标题: 人工智能辅助自我调节支持工具在编程教育中的设计

摘要: 大型语言模型（LLM）工具已经展示出它们通过提供即时、个性化反馈来提供高质量辅助的潜力，这对于有效的编程教育至关重要。然而，许多这些工具独立于机构学习管理系统运行，这造成了重大脱节。这种孤立限制了利用学习材料和练习背景来生成定制、上下文感知反馈的能力。此外，先前关于自主学习和LLM支持的研究主要集中在知识获取上，而不是重要的自我调节技能的发展。为了解决这些挑战，我们开发了CodeRunner Agent，这是一个基于LLM的编程助手，它集成了CodeRunner，一个在Moodle中执行学生提交代码和自动评分的插件。CodeRunner Agent赋予教育者定制AI生成反馈的能力，通过整合讲座材料、编程问题、学生答案和执行结果的详细上下文。此外，它通过提供基于策略的AI响应来增强学生的自主学习。这种集成的、上下文感知的、以技能为重点的方法为编程教育中数据驱动的改进提供了有希望的途径。

更新时间: 2025-04-07 01:30:12

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2504.03068v2

A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations

Scientific machine learning is an emerging field that broadly describes the combination of scientific computing and machine learning to address challenges in science and engineering. Within the context of differential equations, this has produced highly influential methods, such as neural ordinary differential equations (NODEs). Recent works extend this line of research to consider neural differential-algebraic systems of equations (DAEs), where some unknown relationships within the DAE are learned from data. Training neural DAEs, similarly to neural ODEs, is computationally expensive, as it requires the solution of a DAE for every parameter update. Further, the rigorous consideration of algebraic constraints is difficult within common deep learning training algorithms such as stochastic gradient descent. In this work, we apply the simultaneous approach to neural DAE problems, resulting in a fully discretized nonlinear optimization problem, which is solved to local optimality and simultaneously obtains the neural network parameters and the solution to the corresponding DAE. We extend recent work demonstrating the simultaneous approach for neural ODEs, by presenting a general framework to solve neural DAEs, with explicit consideration of hybrid models, where some components of the DAE are known, e.g. physics-informed constraints. Furthermore, we present a general strategy for improving the performance and convergence of the nonlinear programming solver, based on solving an auxiliary problem for initialization and approximating Hessian terms. We achieve promising results in terms of accuracy, model generalizability and computational cost, across different problem settings such as sparse data, unobserved states and multiple trajectories. Lastly, we provide several promising future directions to improve the scalability and robustness of our approach.

Updated: 2025-04-07 01:26:55

标题: 一个用于训练神经微分-代数方程系统的同时方法

摘要: 科学机器学习是一个新兴领域，广泛描述了科学计算和机器学习相结合来解决科学和工程中的挑战。在微分方程的背景下，这产生了高度影响力的方法，如神经普通微分方程（NODEs）。最近的研究将这一研究方向延伸到考虑神经微分代数方程系统（DAEs），其中DAE中的一些未知关系是从数据中学习的。训练神经DAEs，类似于神经ODEs，计算成本高昂，因为它要求对每个参数更新求解一个DAE。此外，在常见的深度学习训练算法中严格考虑代数约束是困难的，比如随机梯度下降。在这项工作中，我们将同时方法应用于神经DAE问题，导致一个完全离散化的非线性优化问题，该问题被解决到局部最优，并同时获得神经网络参数和相应DAE的解。我们通过展示一个解决神经DAEs的一般框架，进一步扩展了最近对神经ODEs的同时方法的研究，明确考虑了混合模型，其中DAE的一些组成部分是已知的，例如基于物理知识的约束。此外，我们提出了一个改进非线性规划求解器性能和收敛性的一般策略，基于解决一个用于初始化和近似Hessian项的辅助问题。在不同的问题设置中，如稀疏数据，未观察到的状态和多个轨迹，我们在准确性，模型泛化性和计算成本方面取得了有希望的结果。最后，我们提供了几个有前途的未来方向，以改进我们方法的可扩展性和鲁棒性。

更新时间: 2025-04-07 01:26:55

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2504.04665v1

ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback

Automated Program Repair tools are developed for generating feedback and suggesting a repair method for erroneous code. State of the art (SOTA) code repair methods rely on data-driven approaches and often fail to deliver solution for complicated programming questions. To interpret the natural language of unprecedented programming problems, using Large Language Models (LLMs) for code-feedback generation is crucial. LLMs generate more comprehensible feedback than compiler-generated error messages, and Reinforcement Learning with Human Feedback (RLHF) further enhances quality by integrating human-in-the-loop which helps novice students to lean programming from scratch interactively. We are applying RLHF fine-tuning technique for an expected Socratic response such as a question with hint to solve the programming issue. We are proposing code feedback generation tool by fine-tuning LLM with RLHF, Automated Code Evaluation with RLHF (ACE-RLHF), combining two open-source LLM models with two different SOTA optimization techniques. The quality of feedback is evaluated on two benchmark datasets containing basic and competition-level programming questions where the later is proposed by us. We achieved 2-5% higher accuracy than RL-free SOTA techniques using Llama-3-7B-Proximal-policy optimization in automated evaluation and similar or slightly higher accuracy compared to reward model-free RL with AI Feedback (RLAIF). We achieved almost 40% higher accuracy with GPT-3.5 Best-of-n optimization while performing manual evaluation.

Updated: 2025-04-07 01:11:22

标题: ACE-RLHF：使用大型语言模型和强化学习生成智能反馈的自动代码评估工具

摘要: 自动生成程序修复工具旨在为错误代码生成反馈并提出修复方法。最先进的代码修复方法依赖于数据驱动方法，往往无法为复杂的编程问题提供解决方案。为了解释前所未有的编程问题的自然语言，使用大型语言模型(LLMs)进行代码反馈生成至关重要。LLMs生成的反馈比编译器生成的错误消息更易理解，而带有人类反馈的强化学习（RLHF）通过整合人类在其中来进一步提高质量，这有助于新手学生互动地从头开始学习编程。我们正在应用RLHF微调技术以期望的苏格拉底式响应，例如带有提示的问题以解决编程问题。我们提出了通过使用RLHF微调LLM生成代码反馈的工具，自动代码评估与RLHF（ACE-RLHF），结合两种开源LLM模型和两种不同的最先进优化技术。反馈质量在两个基准数据集上进行评估，其中一个包含基本和竞赛级别的编程问题，后者是由我们提出的。在自动评估中，我们使用Llama-3-7B-Proximal-policy optimization实现了比RL-free SOTA技术更高2-5%的准确率，并与基于奖励的无模型RL与AI反馈（RLAIF）相比，准确率相似或略高。在进行手动评估时，我们使用GPT-3.5最佳优化实现了几乎40%更高的准确率。

更新时间: 2025-04-07 01:11:22

领域: cs.LG

下载: http://arxiv.org/abs/2504.04657v1

CoLa -- Learning to Interactively Collaborate with Large LMs

LLMs' remarkable ability to tackle a wide range of language tasks opened new opportunities for collaborative human-AI problem solving. LLMs can amplify human capabilities by applying their intuitions and reasoning strategies at scale. We explore whether human guides can be simulated, by generalizing from human demonstrations of guiding an AI system to solve complex language problems. We introduce CoLa, a novel self-guided learning paradigm for training automated $\textit{guides}$ and evaluate it on two QA datasets, a puzzle-solving task, and a constrained text generation task. Our empirical results show that CoLa consistently outperforms competitive approaches across all domains. Moreover, a small-sized trained guide outperforms a strong model like GPT-4 when acting as a guide. We compare the strategies employed by humans and automated guides by conducting a human study on a QA dataset. We show that automated guides outperform humans by adapting their strategies to reasoners' capabilities and conduct qualitative analyses highlighting distinct differences in guiding strategies.

Updated: 2025-04-07 01:08:58

标题: CoLa -- 与大型语言模型进行互动协作学习

摘要: LLMs的卓越能力可以解决各种语言任务，为人工智能与人类协作问题解决开辟了新的机会。LLMs可以通过在规模上应用直觉和推理策略来放大人类能力。我们探讨了是否可以模拟人类引导AI系统解决复杂语言问题的过程，通过总结人类引导AI系统解决复杂语言问题的示范。我们引入了CoLa，一种新颖的自我引导学习范式，用于训练自动化的引导，并在两个问答数据集、一个解谜任务和一个受限文本生成任务上进行评估。我们的实证结果表明，在所有领域中，CoLa始终优于竞争性方法。此外，一个经过训练的小型引导模型在充当引导时优于像GPT-4这样的强大模型。我们通过在一个问答数据集上进行人类研究，比较了人类和自动化引导所采用的策略。我们展示了自动化引导通过调整策略以适应推理者的能力，优于人类，并进行了定性分析，突显了引导策略的明显差异。

更新时间: 2025-04-07 01:08:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.02965v2

EquiCPI: SE(3)-Equivariant Geometric Deep Learning for Structure-Aware Prediction of Compound-Protein Interactions

Accurate prediction of compound-protein interactions (CPI) remains a cornerstone challenge in computational drug discovery. While existing sequence-based approaches leverage molecular fingerprints or graph representations, they critically overlook three-dimensional (3D) structural determinants of binding affinity. To bridge this gap, we present EquiCPI, an end-to-end geometric deep learning framework that synergizes first-principles structural modeling with SE(3)-equivariant neural networks. Our pipeline transforms raw sequences into 3D atomic coordinates via ESMFold for proteins and DiffDock-L for ligands, followed by physics-guided conformer re-ranking and equivariant feature learning. At its core, EquiCPI employs SE(3)-equivariant message passing over atomic point clouds, preserving symmetry under rotations, translations, and reflections, while hierarchically encoding local interaction patterns through tensor products of spherical harmonics. The proposed model is evaluated on BindingDB (affinity prediction) and DUD-E (virtual screening), EquiCPI achieves performance on par with or exceeding the state-of-the-art deep learning competitors.

Updated: 2025-04-07 00:57:08

标题: EquiCPI: SE(3)-等变几何深度学习用于结构感知化合物-蛋白质相互作用的预测

摘要: 准确预测化合物-蛋白质相互作用（CPI）仍然是计算药物发现中的一个核心挑战。现有的基于序列的方法利用分子指纹或图形表示，但它们关键性地忽视了结合亲和力的三维（3D）结构决定因素。为了弥合这一差距，我们提出了EquiCPI，这是一个端到端的几何深度学习框架，它将第一原理的结构建模与SE(3)-等变神经网络相结合。我们的流程将原始序列通过ESMFold（用于蛋白质）和DiffDock-L（用于配体）转换为3D原子坐标，然后通过物理引导的构象重排和等变特征学习。在其核心，EquiCPI利用原子点云上的SE(3)-等变消息传递，保持在旋转、平移和反射下的对称性，同时通过球谐函数的张量积层次地编码局部相互作用模式。所提出的模型在BindingDB（亲和性预测）和DUD-E（虚拟筛选）上进行了评估，EquiCPI的性能与或超过了最先进的深度学习竞争对手。

更新时间: 2025-04-07 00:57:08

领域: cs.LG,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2504.04654v1

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables

Tabular reasoning involves interpreting natural language queries about tabular data, which presents a unique challenge of combining language understanding with structured data analysis. Existing methods employ either textual reasoning, which excels in semantic interpretation but struggles with mathematical operations, or symbolic reasoning, which handles computations well but lacks semantic understanding. This paper introduces a novel algorithm H-STAR that integrates both symbolic and semantic (textual) approaches in a two-stage process to address these limitations. H-STAR employs: (1) step-wise table extraction using `multi-view' column retrieval followed by row extraction, and (2) adaptive reasoning that adapts reasoning strategies based on question types, utilizing semantic reasoning for direct lookup and complex lexical queries while augmenting textual reasoning with symbolic reasoning support for quantitative and logical tasks. Our extensive experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three tabular question-answering (QA) and fact-verification datasets, underscoring its effectiveness and efficiency.

Updated: 2025-04-07 00:44:34

标题: H-STAR：基于LLM的表格上混合SQL-文本自适应推理

摘要: 表格推理涉及解释关于表格数据的自然语言查询，这提出了将语言理解与结构化数据分析相结合的独特挑战。现有方法要么采用文本推理，擅长语义解释但在数学运算方面表现不佳，要么采用符号推理，处理计算很好但缺乏语义理解。本文介绍了一种新算法H-STAR，该算法在一个两阶段过程中集成了符号和语义（文本）方法，以解决这些限制。H-STAR采用：（1）使用“多视图”列检索后跟随行提取的逐步表格提取，以及（2）自适应推理，根据问题类型调整推理策略，利用语义推理进行直接查找和复杂词汇查询，同时通过符号推理支持量化和逻辑任务。我们的广泛实验证明，H-STAR在三个表格问答（QA）和事实验证数据集上显著优于最先进的方法，强调了其有效性和效率。

更新时间: 2025-04-07 00:44:34

领域: cs.DB,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2407.05952v3

Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification

In the real world, long-tailed data distributions are prevalent, making it challenging for models to effectively learn and classify tail classes. However, we discover that in the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features, a finding that significantly contrasts with the conventional understanding that tail classes are generally difficult to identify. Existing imbalance learning methods, such as resampling and cost-sensitive reweighting, overly rely on sample quantity priors, causing models to excessively focus on tail classes at the expense of head class performance. To address this issue, we propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size. Instead, we establish a dynamical inter-class separability metric using feature distances between different classes. Specifically, we employ a sub-clustering contrastive learning approach to thoroughly learn the embedding features of each class, and we dynamically compute the distances between class embeddings to capture the relative positional evolution of samples from different classes in the feature space, thereby rebalancing the weights of the classification loss function. We conducted experiments on multiple existing long-tailed drug datasets and achieved competitive results by improving the accuracy of tail classes without compromising the performance of dominant classes.

Updated: 2025-04-07 00:09:10

标题: 长尾药物分类中的子聚类用于类别距离重新计算

摘要: 在现实世界中，长尾数据分布是普遍存在的，这使得模型难以有效学习和分类尾部类别。然而，在药物化学领域，我们发现由于其独特的分子结构特征，某些尾部类别在训练过程中表现出更高的可识别性，这一发现与传统观念形成了显著对比，即尾部类别通常很难识别。现有的不平衡学习方法，如重新采样和成本敏感重加权，过度依赖样本数量先验，导致模型过度关注尾部类别而牺牲了头部类别的性能。为了解决这个问题，我们提出了一种新颖的方法，摆脱了基于样本数量的传统静态评估范式。相反，我们使用特征之间的距离建立了一种动态的类间可分性度量。具体而言，我们采用子聚类对比学习方法，彻底学习每个类别的嵌入特征，并动态计算类别嵌入之间的距离，捕捉不同类别样本在特征空间中的相对位置演变，从而重新平衡分类损失函数的权重。我们在多个现有的长尾药物数据集上进行了实验，并通过提高尾部类别的准确性，同时不损害主导类别的性能，取得了竞争性的结果。

更新时间: 2025-04-07 00:09:10

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2504.04647v1