Arxiv Day: Article

Dual-level Mixup for Graph Few-shot Learning with Fewer Tasks

Graph neural networks have been demonstrated as a powerful paradigm for effectively learning graph-structured data on the web and mining content from it.Current leading graph models require a large number of labeled samples for training, which unavoidably leads to overfitting in few-shot scenarios. Recent research has sought to alleviate this issue by simultaneously leveraging graph learning and meta-learning paradigms. However, these graph meta-learning models assume the availability of numerous meta-training tasks to learn transferable meta-knowledge. Such assumption may not be feasible in the real world due to the difficulty of constructing tasks and the substantial costs involved. Therefore, we propose a SiMple yet effectIve approach for graph few-shot Learning with fEwer tasks, named SMILE. We introduce a dual-level mixup strategy, encompassing both within-task and across-task mixup, to simultaneously enrich the available nodes and tasks in meta-learning. Moreover, we explicitly leverage the prior information provided by the node degrees in the graph to encode expressive node representations. Theoretically, we demonstrate that SMILE can enhance the model generalization ability. Empirically, SMILE consistently outperforms other competitive models by a large margin across all evaluated datasets with in-domain and cross-domain settings. Our anonymous code can be found here.

Updated: 2025-02-19 23:59:05

标题: 双层Mixup用于具有更少任务的图少样本学习

摘要: 图神经网络已被证明是一种强大的范式，可以有效地学习网络上的图结构数据并从中挖掘内容。当前领先的图模型在训练时需要大量标记样本，这不可避免地导致在少样本情况下过拟合。最近的研究试图通过同时利用图学习和元学习范式来缓解这个问题。然而，这些图元学习模型假设有大量的元训练任务可用于学习可转移的元知识。这种假设在现实世界中可能不可行，因为构建任务的困难和相关成本相当昂贵。因此，我们提出了一种简单但有效的图少样本学习方法，称为SMILE。我们引入了一种双层混合策略，涵盖了任务内和跨任务混合，以同时丰富元学习中可用的节点和任务。此外，我们明确利用了图中节点度提供的先验信息来编码具有表现力的节点表示。从理论上讲，我们证明了SMILE可以增强模型的泛化能力。从经验上看，SMILE在领域内和跨领域设置中，在所有评估的数据集上始终表现出色。我们的匿名代码可以在这里找到。

更新时间: 2025-02-19 23:59:05

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2502.14158v1

Secret extraction attacks against obfuscated IQP circuits

Quantum computing devices can now perform sampling tasks which, according to complexity-theoretic and numerical evidence, are beyond the reach of classical computers. This raises the question of how one can efficiently verify that a quantum computer operating in this regime works as intended. In 2008, Shepherd and Bremner proposed a protocol in which a verifier constructs a unitary from the comparatively easy-to-implement family of so-called IQP circuits, and challenges a prover to execute it on a quantum computer. The challenge problem is designed to contain an obfuscated secret, which can be turned into a statistical test that accepts samples from a correct quantum implementation. It was conjectured that extracting the secret from the challenge problem is NP-hard, so that the ability to pass the test constitutes strong evidence that the prover possesses a quantum device and that it works as claimed. Unfortunately, about a decade later, Kahanamoku-Meyer found an efficient classical secret extraction attack. Bremner, Cheng, and Ji very recently followed up by constructing a wide-ranging generalization of the original protocol. Their IQP Stabilizer Scheme has been explicitly designed to circumvent the known weakness. They also suggested that the original construction can be made secure by adjusting the problem parameters. In this work, we develop a number of secret extraction attacks which are effective against both new approaches in a wide range of problem parameters. In particular, we find multiple ways to recover the 300-bit secret hidden in a challenge data set published by Bremner, Cheng, and Ji. The important problem of finding an efficient and reliable verification protocol for sampling-based proofs of quantum supremacy thus remains open.

Updated: 2025-02-19 23:35:46

标题: 对混淆的IQP电路的秘密提取攻击

摘要: 量子计算设备现在可以执行抽样任务，根据复杂性理论和数值证据，这些任务超出了经典计算机的能力范围。这引发了一个问题，即如何高效地验证在这种范围内运行的量子计算机是否按预期工作。2008年，Shepherd和Bremner提出了一个协议，其中验证者从所谓的IQP电路家族中构建一个酉矩阵，然后挑战证明者在量子计算机上执行它。挑战问题被设计为包含一个混淆的秘密，可以转化为一个统计测试，接受来自正确量子实现的样本。据推测，从挑战问题中提取秘密是NP难的，因此通过测试的能力构成了证明者拥有量子设备并且它按照所声称的方式工作的强有力证据。不幸的是，大约十年后，Kahanamoku-Meyer发现了一种有效的经典秘密提取攻击。Bremner、Cheng和Ji最近跟进，构建了原始协议的一个广泛泛化。他们的IQP稳定器方案明确设计用于规避已知的弱点。他们还建议通过调整问题参数可以使原始构造变得安全。在这项工作中，我们开发了多种秘密提取攻击，对各种问题参数中的新方法都有效。特别地，我们找到了多种方法来恢复Bremner、Cheng和Ji发表的挑战数据集中隐藏的300位秘密。因此，为量子霸权基于抽样的证明找到一个高效可靠的验证协议的重要问题仍然悬而未决。

更新时间: 2025-02-19 23:35:46

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2312.10156v2

STAR: A Simple Training-free Approach for Recommendations using Large Language Models

Recent progress in large language models (LLMs) offers promising new approaches for recommendation system tasks. While the current state-of-the-art methods rely on fine-tuning LLMs to achieve optimal results, this process is costly and introduces significant engineering complexities. Conversely, methods that directly use LLMs without additional fine-tuning result in a large drop in recommendation quality, often due to the inability to capture collaborative information. In this paper, we propose a Simple Training-free Approach for Recommendation (STAR), a framework that utilizes LLMs and can be applied to various recommendation tasks without the need for fine-tuning, while maintaining high quality recommendation performance. Our approach involves a retrieval stage that uses semantic embeddings from LLMs combined with collaborative user information to retrieve candidate items. We then apply an LLM for pairwise ranking to enhance next-item prediction. Experimental results on the Amazon Review dataset show competitive performance for next item prediction, even with our retrieval stage alone. Our full method achieves Hits@10 performance of +23.8% on Beauty, +37.5% on Toys & Games, and -1.8% on Sports & Outdoors relative to the best supervised models. This framework offers an effective alternative to traditional supervised models, highlighting the potential of LLMs in recommendation systems without extensive training or custom architectures.

Updated: 2025-02-19 23:34:29

标题: STAR：使用大型语言模型进行推荐的简单无需训练方法

摘要: 最近大语言模型（LLMs）的进展为推荐系统任务提供了有希望的新方法。虽然当前最先进的方法依赖于微调LLMs以实现最佳结果，但这一过程成本高且引入了重要的工程复杂性。相反，直接使用LLMs而不进行额外微调的方法导致推荐质量大幅下降，通常是由于无法捕捉协作信息。在本文中，我们提出了一个名为Simple Training-free Approach for Recommendation（STAR）的框架，利用LLMs并可应用于各种推荐任务，而无需微调，同时保持高质量的推荐性能。我们的方法涉及一个检索阶段，利用LLMs的语义嵌入和协作用户信息来检索候选物品。然后，我们应用LLM进行成对排名以增强下一个物品的预测。在亚马逊评论数据集上的实验结果显示，即使仅使用我们的检索阶段，我们的方法在下一个物品的预测方面表现出竞争力。相对于最佳监督模型，我们的完整方法在美容领域的Hits@10表现提高了23.8%，在玩具和游戏领域提高了37.5%，在户外运动领域下降了1.8%。这个框架为传统监督模型提供了一种有效的替代方案，突显了LLMs在推荐系统中无需进行广泛训练或自定义架构的潜力。

更新时间: 2025-02-19 23:34:29

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.16458v2

LOVA3: Learning to Visual Question Answering, Asking and Assessment

Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. Inspired by the human learning mechanism, we introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment," designed to equip MLLMs with these additional capabilities. Our approach involves the creation of two supplementary training tasks GenQA and EvalQA, aiming at fostering the skills of asking and assessing questions in the context of images. To develop the questioning ability, we compile a comprehensive set of multimodal foundational tasks. For assessment, we introduce a new benchmark called EvalQABench, comprising 64,000 training samples (split evenly between positive and negative samples) and 5,000 validation and testing samples. We posit that enhancing MLLMs with the capabilities to answer, ask, and assess questions will enhance their multimodal comprehension, ultimately improving overall performance. To validate this hypothesis, we train MLLMs using the LOVA3 framework and evaluate them on a range of multimodal datasets and benchmarks. Our results demonstrate consistent performance gains, underscoring the critical role of these additional tasks in fostering comprehensive intelligence in MLLMs. The code is available at https://github.com/showlab/LOVA3.

Updated: 2025-02-19 23:28:40

标题: LOVA3：学习视觉问答、提问和评估

摘要: 问题回答、提问和评估是三种固有的人类特质，对于理解世界和获取知识至关重要。通过增强这些能力，人类可以更有效地利用数据，从而达到更好的理解和学习成果。当前的多模态大型语言模型（MLLMs）主要关注问题回答，往往忽略了提问和评估技能的全部潜力。受人类学习机制启发，我们引入了LOVA3，一个名为“学习视觉问题回答、提问和评估”的创新框架，旨在为MLLMs提供这些额外的能力。我们的方法涉及创建两个辅助训练任务GenQA和EvalQA，旨在培养在图像环境中提问和评估问题的技能。为了发展提问能力，我们编制了一个全面的多模态基础任务集。对于评估，我们引入了一个名为EvalQABench的新基准，包括64,000个训练样本（正样本和负样本均匀分布）和5,000个验证和测试样本。我们认为，增强MLLMs的回答、提问和评估问题的能力将提升它们的多模态理解能力，最终提高整体性能。为了验证这一假设，我们使用LOVA3框架训练MLLMs，并在一系列多模态数据集和基准上进行评估。我们的结果表明，性能稳定提升，强调了这些额外任务在培养MLLMs全面智能中的关键作用。代码可在https://github.com/showlab/LOVA3找到。

更新时间: 2025-02-19 23:28:40

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2405.14974v3

PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery

Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advancing surgical education. However, the development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights. While parameter-efficient techniques like Low-Rank Adaptation (LoRA) and Matrix of Rank Adaptation (MoRA) address adaptation challenges, their uniform parameter distribution overlooks the feature hierarchy in deep networks, where earlier layers, that learn general features, require more parameters than later ones. This work introduces PitVQA++ with an open-ended PitVQA dataset and vector matrix-low-rank adaptation (Vector-MoLoRA), an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery. Open-Ended PitVQA comprises around 101,803 frames from 25 procedural videos with 745,972 question-answer sentence pairs, covering key surgical elements such as phase and step recognition, context understanding, tool detection, localization, and interactions recognition. Vector-MoLoRA incorporates the principles of LoRA and MoRA to develop a matrix-low-rank adaptation strategy that employs vector ranking to allocate more parameters to earlier layers, gradually reducing them in the later layers. Our approach, validated on the Open-Ended PitVQA and EndoVis18-VQA datasets, effectively mitigates catastrophic forgetting while significantly enhancing performance over recent baselines. Furthermore, our risk-coverage analysis highlights its enhanced reliability and trustworthiness in handling uncertain predictions. Our source code and dataset is available at~\url{https://github.com/HRL-Mike/PitVQA-Plus}.

Updated: 2025-02-19 23:28:39

标题: PitVQA++：垂体手术中开放式视觉问题回答的向量矩阵低秩适应

摘要: 视觉-语言模型（VLMs）在视觉问答（VQA）中提供了一个独特的机会，可以增强术中决策，促进直观交互，并显著推进外科教育。然而，由于数据集有限以及在预训练权重的完全微调过程中存在过拟合和灾难性遗忘的风险，为外科VQA开发VLMs具有挑战性。虽然参数高效技术如低秩适应（LoRA）和秩适应矩阵（MoRA）解决了适应性挑战，但它们的均匀参数分布忽视了深度网络中的特征层次结构，较早学习一般特征的层需要比后来的层更多的参数。本研究引入了PitVQA ++，并提供了一个开放式的PitVQA数据集和矢量矩阵低秩适应（Vector-MoLoRA），这是一种创新的VLM微调方法，用于适应GPT-2到垂体手术。开放式PitVQA包括来自25个程序视频的约101,803帧，包含745,972个问题-答案句对，涵盖了关键的外科元素，如阶段和步骤识别，上下文理解，工具检测，定位和交互识别。Vector-MoLoRA结合LoRA和MoRA的原则，开发了一种矩阵低秩适应策略，利用矢量排序来分配更多的参数给较早的层，逐渐在后来的层中减少这些参数。我们的方法在开放式PitVQA和EndoVis18-VQA数据集上验证，有效地减轻了灾难性遗忘，同时显著提高了性能，超过了最近的基线。此外，我们的风险覆盖分析突出了其在处理不确定预测方面的增强可靠性和可信度。我们的源代码和数据集可在\url{https://github.com/HRL-Mike/PitVQA-Plus}获取。

更新时间: 2025-02-19 23:28:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.14149v1

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation. The code is available at https://github.com/showlab/WorldGUI.

Updated: 2025-02-19 23:27:05

标题: WorldGUI：全面桌面GUI自动化的动态测试

摘要: 当前的GUI代理在GUI元素定位方面取得了出色的表现。然而，规划仍然具有很高的挑战性，特别是由于对环境初始状态的敏感性。具体来说，初始状态的轻微差异，比如目标软件未打开或界面不处于默认状态，通常会导致规划错误。这个问题在真实用户场景中很普遍，但现有的基准测试未能评估它。在本文中，我们提出了WorldGUI，这是一个新颖的GUI基准，设计了具有各种初始状态的GUI任务，以模拟真实的计算机用户交互。该基准涵盖了10个流行软件应用程序中的各种任务，包括PowerPoint、VSCode和Adobe Acrobat。此外，为了解决动态GUI自动化任务的挑战，我们提出了GUI-Thinker，这是一个全面的框架，利用批评机制，有效地管理GUI交互的不可预测性和复杂性。实验结果表明，GUI-Thinker在WorldGUI任务的成功率上比Claude-3.5（计算机使用）提高了14.9%。这一改进凸显了基于批判性思维的框架在增强GUI自动化方面的有效性。代码可在https://github.com/showlab/WorldGUI找到。

更新时间: 2025-02-19 23:27:05

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2502.08047v2

Learning the P2D Model for Lithium-Ion Batteries with SOH Detection

Lithium ion batteries are widely used in many applications. Battery management systems control their optimal use and charging and predict when the battery will cease to deliver the required output on a planned duty or driving cycle. Such systems use a simulation of a mathematical model of battery performance. These models can be electrochemical or data-driven. Electrochemical models for batteries running at high currents are mathematically and computationally complex. In this work, we show that a well-regarded electrochemical model, the Pseudo Two Dimensional (P2D) model, can be replaced by a computationally efficient Convolutional Neural Network (CNN) surrogate model fit to accurately simulated data from a class of random driving cycles. We demonstrate that a CNN is an ideal choice for accurately capturing Lithium ion concentration profiles. Additionally, we show how the neural network model can be adjusted to correspond to battery changes in State of Health (SOH).

Updated: 2025-02-19 23:17:30

标题: 学习用于锂离子电池带SOH检测的P2D模型

摘要: 锂离子电池被广泛应用于许多领域。电池管理系统控制其最佳使用和充电，并预测电池在计划的工作或驾驶周期中将何时停止提供所需的输出。这些系统使用电池性能的数学模型的模拟。这些模型可以是电化学的或数据驱动的。在高电流下运行的电池的电化学模型在数学和计算上是复杂的。在这项工作中，我们展示了一种备受推崇的电化学模型，即伪二维（P2D）模型，可以被计算效率高的卷积神经网络（CNN）替代，后者适用于准确模拟来自一类随机驾驶周期的数据。我们证明了CNN是准确捕捉锂离子浓度轮廓的理想选择。此外，我们展示了神经网络模型如何调整以对应电池健康状态（SOH）的变化。

更新时间: 2025-02-19 23:17:30

领域: cs.LG,physics.chem-ph,65M99 (Primary) 68T07, 78A57 (Secondary),I.2.6; J.2

下载: http://arxiv.org/abs/2502.14147v1

Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits

In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art $O(K\log T/\Delta) + O(C/\Delta)$ regret upper bound, where $K$ is the number of arms, $C$ is the unknown corruption level, $\Delta$ is the minimum expected reward gap between the best arm and other ones, and $T$ is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is $O(K\log^2 T/\Delta) + O(C)$, we show that SAMBA reduces one $\log T$ factor in the regret bound, while maintaining the corruption-dependent term to be linear with $C$. This is indeed asymptotically optimal. We also conduct simulations to demonstrate the effectiveness of SAMBA, and the results show that SAMBA outperforms existing baselines.

Updated: 2025-02-19 23:16:18

标题: 对于被破坏的多臂老虎机，高效和最优策略梯度算法

摘要: 在这篇论文中，我们考虑带有对抗性破坏的随机多臂老虎机问题，其中各臂的随机奖励被对手部分修改，以愚弄算法。我们将策略梯度算法SAMBA应用于这种情境，并展示它在计算上是高效的，并实现了一种最先进的$O(K\log T/\Delta) + O(C/\Delta)$后悔上界，其中$K$是臂的数量，$C$是未知的破坏水平，$\Delta$是最佳臂与其他臂之间的最小预期奖励差距，$T$是时间范围。与最佳现有高效算法（例如CBARBAR）相比，其后悔上界为$O(K\log^2 T/\Delta) + O(C)$，我们展示SAMBA在后悔界中减少了一个$\log T$因子，同时保持与$C$线性的破坏相关术语。这实际上是渐近最优的。我们还进行了模拟来展示SAMBA的有效性，结果显示SAMBA胜过现有基线方法。

更新时间: 2025-02-19 23:16:18

领域: cs.LG

下载: http://arxiv.org/abs/2502.14146v1

Multi-Agent Risks from Advanced AI

The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.

Updated: 2025-02-19 23:03:21

标题: 高级人工智能带来的多智能体风险

摘要: 高级人工智能代理的快速发展和即将部署的许多这些代理的实例将导致前所未有复杂性的多代理系统。这些系统带来了新颖且未经深入探讨的风险。在本报告中，我们通过确定基于代理的激励机制的三种关键故障模式（协调失误、冲突和勾结），以及七个关键风险因素（信息不对称、网络效应、选择压力、不稳定动态、承诺问题、新兴机构和多代理安全）来提供这些风险的结构化分类。我们强调了每种风险的一些重要实例，以及有望帮助减轻这些风险的有前途的方向。通过将我们的分析锚定在一系列现实世界的例子和实验证据中，我们说明了多代理系统所面临的独特挑战，以及它们对高级人工智能的安全、治理和伦理的影响。

更新时间: 2025-02-19 23:03:21

领域: cs.MA,cs.AI,cs.CY,cs.ET,cs.LG

下载: http://arxiv.org/abs/2502.14143v1

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning

Contemporary embodied agents powered by large language models (LLMs), such as Voyager, have shown promising capabilities in individual learning within open-ended environments like Minecraft. However, when powered by open LLMs, they struggle with basic tasks even after domain-specific fine-tuning. We present MindForge, a generative-agent framework for collaborative lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural interagent communication; and (3) a multicomponent memory system. In Minecraft experiments, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks where traditional Voyager fails without GPT-4, collecting $2.3\times$ more unique items and achieving $3\times$ more tech-tree milestones, advancing from basic wood tools to advanced iron equipment. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated collaborative experiences. MindForge advances the democratization of embodied AI development through open-ended social learning, enabling peer-to-peer knowledge sharing.

Updated: 2025-02-19 22:59:28

标题: MindForge：通过心灵理论赋予具象智能体终身协作学习的能力

摘要: 当由大型语言模型（LLMs）驱动的当代具身代理，如Voyager，在像Minecraft这样的开放环境中展示出了个体学习的前景。然而，当由开放LLMs驱动时，即使经过领域特定的微调，它们在基本任务上也很难应对。我们提出了MindForge，一个通过明确的观点采取合作终身学习的生成代理框架。我们引入了三个关键创新：（1）一个连接感知、信念、欲望和行动的结构化心智理论表示；（2）自然的代理间通信；以及（3）一个多组件记忆系统。在Minecraft实验中，由开放权重LLMs驱动的MindForge代理在传统Voyager在没有GPT-4的情况下失败的基本任务中明显优于其Voyager对应物，收集了$2.3\times$更多独特物品，并实现了$3\times$更多技术树里程碑，从基本的木工具升级到先进的铁器装备。MindForge代理展示了复杂的行为，包括专家-新手知识转移、协作解决问题以及通过累积的合作经验适应超出分布的任务。MindForge通过开放式社会学习推动了具身AI开发的民主化，实现了点对点的知识共享。

更新时间: 2025-02-19 22:59:28

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2411.12977v3

Towards Quantum Tensor Decomposition in Biomedical Applications

Tensor decomposition has emerged as a powerful framework for feature extraction in multi-modal biomedical data. In this review, we present a comprehensive analysis of tensor decomposition methods such as Tucker, CANDECOMP/PARAFAC, spiked tensor decomposition, etc. and their diverse applications across biomedical domains such as imaging, multi-omics, and spatial transcriptomics. To systematically investigate the literature, we applied a topic modeling-based approach that identifies and groups distinct thematic sub-areas in biomedicine where tensor decomposition has been used, thereby revealing key trends and research directions. We evaluated challenges related to the scalability of latent spaces along with obtaining the optimal rank of the tensor, which often hinder the extraction of meaningful features from increasingly large and complex datasets. Additionally, we discuss recent advances in quantum algorithms for tensor decomposition, exploring how quantum computing can be leveraged to address these challenges. Our study includes a preliminary resource estimation analysis for quantum computing platforms and examines the feasibility of implementing quantum-enhanced tensor decomposition methods on near-term quantum devices. Collectively, this review not only synthesizes current applications and challenges of tensor decomposition in biomedical analyses but also outlines promising quantum computing strategies to enhance its impact on deriving actionable insights from complex biomedical data.

Updated: 2025-02-19 22:52:44

标题: 朝向生物医学应用中的量子张量分解

摘要: 张量分解已经成为多模态生物医学数据特征提取的强大框架。在这篇综述中，我们对张量分解方法（如Tucker、CANDECOMP/PARAFAC、尖刺张量分解等）及其在成像、多组学和空间转录组学等生物医学领域的各种应用进行了全面分析。为了系统地调查文献，我们采用了基于主题建模的方法，识别和分组了张量分解在生物医学中被使用的不同主题子领域，从而揭示了关键趋势和研究方向。我们评估了与潜在空间的可伸缩性以及获得张量的最佳秩相关的挑战，这些挑战通常阻碍了从日益庞大和复杂的数据集中提取有意义特征。此外，我们讨论了张量分解的量子算法的最新进展，探讨了量子计算如何可以利用来解决这些挑战。我们的研究包括量子计算平台的初步资源估计分析，并检验了在近期量子设备上实现量子增强张量分解方法的可行性。总的来说，这篇综述不仅综合了张量分解在生物医学分析中的当前应用和挑战，还勾勒了有前途的量子计算策略，以增强其从复杂生物医学数据中获取可操作见解的影响。

更新时间: 2025-02-19 22:52:44

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2502.13140v2

Algorithmic Content Selection and the Impact of User Disengagement

Digital services face a fundamental trade-off in content selection: they must balance the immediate revenue gained from high-reward content against the long-term benefits of maintaining user engagement. Traditional multi-armed bandit models assume that users remain perpetually engaged, failing to capture the possibility that users may disengage when dissatisfied, thereby reducing future revenue potential. In this work, we introduce a model for the content selection problem that explicitly accounts for variable user engagement and disengagement. In our framework, content that maximizes immediate reward is not necessarily optimal in terms of fostering sustained user engagement. Our contributions are twofold. First, we develop computational and statistical methods for offline optimization and online learning of content selection policies. For users whose engagement patterns are defined by $k$ distinct levels, we design a dynamic programming algorithm that computes the exact optimal policy in $O(k^2)$ time. Moreover, we derive no-regret learning guarantees for an online learning setting in which the platform serves a series of users with unknown and potentially adversarial engagement patterns. Second, we introduce the concept of modified demand elasticity which captures how small changes in a user's overall satisfaction affect the platform's ability to secure long-term revenue. This notion generalizes classical demand elasticity by incorporating the dynamics of user re-engagement, thereby revealing key insights into the interplay between engagement and revenue. Notably, our analysis uncovers a counterintuitive phenomenon: although higher friction (i.e., a reduced likelihood of re-engagement) typically lowers overall revenue, it can simultaneously lead to higher user engagement under optimal content selection policies.

Updated: 2025-02-19 22:50:47

标题: 算法内容选择与用户脱节的影响

摘要: 数字服务在内容选择方面面临着一个基本的权衡：他们必须平衡从高回报内容获得的即时收入与保持用户参与度的长期利益。传统的多臂老虎机模型假定用户始终保持参与，未能捕捉到用户可能在不满意时退出参与，从而降低未来收入潜力的可能性。在这项工作中，我们引入了一个针对内容选择问题的模型，明确考虑了用户参与度和退出的变化。在我们的框架中，最大化即时回报的内容不一定是促进持续用户参与的最佳选择。我们的贡献是双重的。首先，我们开发了离线优化和在线学习内容选择策略的计算和统计方法。对于参与模式由$k$个不同级别定义的用户，我们设计了一个动态规划算法，可以在$O(k^2)$时间内计算出精确的最优策略。此外，我们为在线学习设置推导了无悔学习保证，在该设置中，平台为一系列具有未知且潜在敌对的参与模式的用户提供服务。其次，我们引入了修改后的需求弹性概念，该概念捕捉了用户整体满意度的小变化如何影响平台确保长期收入的能力。这个概念通过整合用户重新参与的动态，推广了经典的需求弹性，从而揭示了参与度和收入之间的相互作用的关键见解。值得注意的是，我们的分析揭示了一个反直觉的现象：尽管更高的摩擦力（即重新参与的可能性降低）通常会降低整体收入，但它同时可以在最佳内容选择策略下导致更高的用户参与度。

更新时间: 2025-02-19 22:50:47

领域: cs.LG

下载: http://arxiv.org/abs/2410.13108v2

Vector-ICL: In-context Learning with Continuous Vector Representations

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. By aligning input data with an LLM's embedding space through lightweight projectors, we observe that LLMs can effectively process and learn from these projected vectors, which we term Vector-ICL. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL, while task-specific finetuning further enhances performance. In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and fMRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.

Updated: 2025-02-19 22:48:13

标题: 矢量-ICL: 使用连续矢量表示进行上下文学习

摘要: 大型语言模型（LLMs）已经在文本数据上展现出了非凡的上下文学习（ICL）能力。我们探讨了这些能力是否可以扩展到来自各种领域的连续向量，这些向量是从黑盒预训练编码器中获得的。通过将输入数据与LLM的嵌入空间通过轻量级投影仪进行对齐，我们观察到LLMs可以有效处理并学习这些投影向量，我们将其称为向量-ICL。特别地，我们发现预训练投影仪具有普通语言建模目标可以实现向量-ICL，而任务特定的微调进一步提高了性能。在我们跨越各种任务和模态的实验中，包括文本重建、数值函数回归、文本分类、摘要、分子字幕、时间序列分类、图分类和fMRI解码，向量-ICL通常优于少样本ICL和领域特定模型或调整。我们进一步进行分析和案例研究，表明LLMs处理向量表示的潜力超越了传统基于令牌的范式。

更新时间: 2025-02-19 22:48:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.05629v2

CoRRECT: A Deep Unfolding Framework for Motion-Corrected Quantitative R2* Mapping

Quantitative MRI (qMRI) refers to a class of MRI methods for quantifying the spatial distribution of biological tissue parameters. Traditional qMRI methods usually deal separately with artifacts arising from accelerated data acquisition, involuntary physical motion, and magnetic-field inhomogeneities, leading to suboptimal end-to-end performance. This paper presents CoRRECT, a unified deep unfolding (DU) framework for qMRI consisting of a model-based end-to-end neural network, a method for motion-artifact reduction, and a self-supervised learning scheme. The network is trained to produce R2* maps whose k-space data matches the real data by also accounting for motion and field inhomogeneities. When deployed, CoRRECT only uses the k-space data without any pre-computed parameters for motion or inhomogeneity correction. Our results on experimentally collected multi-Gradient-Recalled Echo (mGRE) MRI data show that CoRRECT recovers motion and inhomogeneity artifact-free R2* maps in highly accelerated acquisition settings. This work opens the door to DU methods that can integrate physical measurement models, biophysical signal models, and learned prior models for high-quality qMRI.

Updated: 2025-02-19 22:45:36

标题: CoRRECT：一个用于运动校正定量R2*映射的深度展开框架

摘要: 量化磁共振成像（qMRI）是一类用于量化生物组织参数空间分布的磁共振成像方法。传统的qMRI方法通常单独处理由于加速数据采集、不受控的身体运动和磁场不均匀性产生的伪影，导致端到端性能不佳。本文提出了CoRRECT，一个统一的深度展开（DU）框架，用于qMRI，包括基于模型的端到端神经网络、一种运动伪影减少方法和自监督学习方案。该网络经过训练，能够生成R2*图，其k空间数据与真实数据匹配，并考虑了运动和磁场不均匀性。在部署时，CoRRECT仅使用k空间数据，无需任何预先计算的参数进行运动或不均匀性校正。我们对实验采集的多梯度回波序列（mGRE）磁共振成像数据的结果表明，CoRRECT在高度加速采集设置中恢复了无运动和不均匀性伪影的R2*图。这项工作为DU方法打开了大门，可以整合物理测量模型、生物物理信号模型和学习先验模型，用于高质量的qMRI。

更新时间: 2025-02-19 22:45:36

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2210.06330v2

Cluster Analysis and Concept Drift Detection in Malware

Concept drift refers to gradual or sudden changes in the properties of data that affect the accuracy of machine learning models. In this paper, we address the problem of concept drift detection in the malware domain. Specifically, we propose and analyze a clustering-based approach to detecting concept drift. Using a subset of the KronoDroid dataset, malware samples are partitioned into temporal batches and analyzed using MiniBatch $K$-Means clustering. The silhouette coefficient is used as a metric to identify points in time where concept drift has likely occurred. To verify our drift detection results, we train learning models under three realistic scenarios, which we refer to as static training, periodic retraining, and drift-aware retraining. In each scenario, we consider four supervised classifiers, namely, Multilayer Perceptron (MLP), Support Vector Machine (SVM), Random Forest, and XGBoost. Experimental results demonstrate that drift-aware retraining guided by silhouette coefficient thresholding achieves classification accuracy far superior to static models, and generally within 1% of periodic retraining, while also being far more efficient than periodic retraining. These results provide strong evidence that our clustering-based approach is effective at detecting concept drift, while also illustrating a highly practical and efficient fully automated approach to improved malware classification via concept drift detection.

Updated: 2025-02-19 22:42:30

标题: 集群分析和恶意软件中概念漂移检测

摘要: 概念漂移是指数据属性逐渐或突然发生变化，影响机器学习模型的准确性。本文针对恶意软件领域的概念漂移检测问题进行了研究。具体而言，我们提出并分析了一种基于聚类的检测概念漂移的方法。利用KronoDroid数据集的子集，将恶意软件样本划分为时间批次，并使用MiniBatch K-Means聚类进行分析。轮廓系数被用作一种指标，以识别可能发生概念漂移的时间点。为验证我们的漂移检测结果，我们在三种现实情境下训练学习模型，分别称为静态训练、周期性重新训练和意识到漂移的重新训练。在每种情景下，我们考虑四种监督分类器，分别为多层感知器（MLP）、支持向量机（SVM）、随机森林和XGBoost。实验结果表明，基于轮廓系数阈值的漂移感知重新训练实现的分类准确性远远优于静态模型，并且通常在周期性重新训练的1%以内，同时也比周期性重新训练更高效。这些结果强有力地证明了我们基于聚类的方法在检测概念漂移方面的有效性，同时还展示了一种高度实用和高效的全自动方法，通过概念漂移检测改进恶意软件分类。

更新时间: 2025-02-19 22:42:30

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2502.14135v1

Scaling Trends in Language Model Robustness

Language models exhibit scaling laws, whereby increasing model and dataset size predictably decrease negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale? We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. From the defender's perspective, we find that in the absence of other interventions, increasing model size alone does not consistently improve robustness. In adversarial training, we find that larger models are more sample-efficient and less compute-efficient than smaller models, and often better generalize their defense to new threat models. From the attacker's perspective, we find that increasing attack compute smoothly and reliably increases attack success rate against both finetuned and adversarially trained models. Finally, we show that across model sizes studied, doubling compute on adversarial training only forces an attacker to less than double attack compute to maintain the same attack success rate. However, adversarial training becomes more and more effective on larger models, suggesting that defenders could eventually have the advantage with increasing model size. These results underscore the value of adopting a scaling lens when discussing robustness of frontier models.

Updated: 2025-02-19 22:32:47

标题: 语言模型鲁棒性的规模趋势

摘要: 语言模型表现出缩放规律，增加模型和数据集大小可预测地降低负对数似然，解锁了令人眼花缭乱的能力。与此同时，即使最具能力的系统目前也容易受到对抗性输入的影响，如越狱和提示注入，尽管已经付出了努力使它们更加强大。随着计算变得更加容易获得，攻击者和防御者哪一方会更多地受益于规模？我们尝试通过对跨三个数量级的参数计数的语言模型的鲁棒性进行详细研究来回答这个问题。从防御者的角度来看，我们发现在没有其他干预的情况下，仅增加模型大小并不能一贯提高鲁棒性。在对抗性训练中，我们发现较大的模型比较小的模型更具样本效率和计算效率，通常更好地将他们的防御泛化到新的威胁模型。从攻击者的角度来看，我们发现增加攻击计算可以平稳可靠地提高攻击成功率，对调优和对抗性训练模型都有效。最后，我们展示了在研究的模型大小范围内，将对抗性训练中的计算加倍只会迫使攻击者将攻击计算增加不到两倍以维持相同的攻击成功率。然而，对抗性训练在更大的模型上变得越来越有效，这表明随着模型大小的增加，防御者最终可能会占据优势。这些结果强调了在讨论前沿模型的鲁棒性时采用缩放视角的价值。

更新时间: 2025-02-19 22:32:47

领域: cs.LG,cs.AI,cs.CL,cs.CR,I.2.7

下载: http://arxiv.org/abs/2407.18213v4

Can Community Notes Replace Professional Fact-Checkers?

Two commonly-employed strategies to combat the rise of misinformation on social media are (i) fact-checking by professional organisations and (ii) community moderation by platform users. Policy changes by Twitter/X and, more recently, Meta, signal a shift away from partnerships with fact-checking organisations and towards an increased reliance on crowdsourced community notes. However, the extent and nature of dependencies between fact-checking and helpful community notes remain unclear. To address these questions, we use language models to annotate a large corpus of Twitter/X community notes with attributes such as topic, cited sources, and whether they refute claims tied to broader misinformation narratives. Our analysis reveals that community notes cite fact-checking sources up to five times more than previously reported. Fact-checking is especially crucial for notes on posts linked to broader narratives, which are twice as likely to reference fact-checking sources compared to other sources. In conclusion, our results show that successful community moderation heavily relies on professional fact-checking.

Updated: 2025-02-19 22:26:39

标题: 社区笔记能取代专业事实核查员吗？

摘要: 社交媒体上打击误导信息的两种常用策略是（i）专业组织进行事实核查和（ii）社区用户进行平台管理。Twitter/X和最近的Meta的政策变化表明，它们正在摆脱与事实核查组织的合作，转而更多地依赖众包社区笔记。然而，事实核查和有用的社区笔记之间的依赖程度和性质仍不清楚。为了解决这些问题，我们使用语言模型对大量的Twitter/X社区笔记进行注释，包括主题、引用来源以及它们是否驳斥与更广泛的误导信息叙事有关的主张。我们的分析显示，社区笔记引用事实核查来源的次数多达先前报道的五倍。事实核查对于与更广泛叙事相关的帖子的笔记尤为重要，这些笔记引用事实核查来源的可能性是其他来源的两倍。总之，我们的结果表明，成功的社区管理在很大程度上依赖于专业的事实核查。

更新时间: 2025-02-19 22:26:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14132v1

Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

Updated: 2025-02-19 22:22:20

标题: 梯度可以训练奖励模型：离线逆强化学习和动态离散选择模型的经验风险最小化方法

摘要: 我们研究了估计动态离散选择（DDC）模型的问题，也被称为离线最大熵正则化逆强化学习（离线MaxEnt-IRL）在机器学习中。目标是从离线行为数据中恢复控制代理行为的奖励或$Q^*$函数。在本文中，我们提出了一种全局收敛的基于梯度的方法来解决这些问题，而不需要线性参数化奖励的限制性假设。我们方法的新颖性在于引入了基于经验风险最小化（ERM）的IRL/DDC框架，它避免了在贝尔曼方程中需要显式状态转移概率估计的需求。此外，我们的方法与神经网络等非参数估计技术兼容。因此，所提出的方法具有潜力扩展到高维、无限状态空间。我们方法背后的关键理论洞察是贝尔曼残差满足Polyak-Lojasiewicz（PL）条件--这是一个性质，虽然比强凸性弱，但足以确保快速全局收敛保证。通过一系列合成实验，我们证明我们的方法始终优于基准方法和最先进的替代方法。

更新时间: 2025-02-19 22:22:20

领域: cs.LG,cs.AI,econ.EM

下载: http://arxiv.org/abs/2502.14131v1

Delving into ChatGPT usage in academic writing through excess vocabulary

Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question, we present an unbiased, large-scale approach: we study vocabulary changes in 14 million PubMed abstracts from 2010--2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 30% for some sub-corpora. We show that LLMs have had an unprecedented impact on the scientific literature, surpassing the effect of major world events such as the Covid pandemic.

Updated: 2025-02-19 22:15:24

标题: 深入探讨ChatGPT在学术写作中通过过多词汇的使用

摘要: 大型语言模型（LLMs）如ChatGPT能够以人类水平的表现生成和修订文本。这些模型存在明显的局限性：它们可能会产生不准确的信息，强化现有的偏见，并容易被滥用。然而，许多科学家将它们用于学术写作。但是，学术文献中对这种LLM使用的普及程度如何？为了回答这个问题，我们提出了一种公正的大规模方法：我们研究了2010年至2024年间的1400万篇PubMed摘要中的词汇变化，并展示了LLMs的出现如何导致某些风格词频率的急剧增加。这种过剩词分析表明，至少有10%的2024年摘要是通过LLMs处理的。这个下限在不同学科、国家和期刊之间有所不同，对于一些子语料库达到了30%。我们表明，LLMs对科学文献产生了前所未有的影响，超过了像Covid大流行这样的重大世界事件的影响。

更新时间: 2025-02-19 22:15:24

领域: cs.CL,cs.AI,cs.CY,cs.DL,cs.SI

下载: http://arxiv.org/abs/2406.07016v4

Linear Diffusion Networks: Harnessing Diffusion Processes for Global Interactions

Diffusion kernels capture global dependencies. We present Linear Diffusion Networks (LDNs), a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. LDN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that LDN delivers superior performance and scalability, setting a new standard for global interaction in sequential data.

Updated: 2025-02-19 22:13:55

标题: 线性扩散网络：利用扩散过程进行全球互动

摘要: 扩散核捕捉全局依赖关系。我们提出了线性扩散网络（LDNs），这是一种重新解释顺序数据处理为统一扩散过程的新型架构。我们的模型将自适应扩散模块与局部非线性更新和受扩散启发的注意机制相结合。这种设计实现了高效的全局信息传播，同时保留了细粒度的时间细节。LDN通过允许在时间步长上进行完全并行化，并支持稳健的多尺度时间表示，克服了传统循环和转换器模型的局限性。在基准序列建模任务上的实验表明，LDN提供了卓越的性能和可伸缩性，为顺序数据中的全局交互设定了新的标准。

更新时间: 2025-02-19 22:13:55

领域: cs.LG

下载: http://arxiv.org/abs/2502.12381v2

Meta-Statistical Learning: Supervised Learning of Statistical Inference

This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks, where the goal is to predict properties of the data-generating distribution rather than labels for individual datapoints. These tasks encompass statistical inference problems such as parameter estimation, hypothesis testing, or mutual information estimation. Framing these tasks within traditional machine learning pipelines is challenging, as supervision is typically tied to individual datapoint. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems. In this approach, entire datasets are treated as single inputs to neural networks, which predict distribution-level parameters. Transformer-based architectures, without positional encoding, provide a natural fit due to their permutation-invariance properties. By training on large-scale synthetic datasets, meta-statistical models can leverage the scalability and optimization infrastructure of Transformer-based LLMs. We demonstrate the framework's versatility with applications in hypothesis testing and mutual information estimation, showing strong performance, particularly for small datasets where traditional neural methods struggle.

Updated: 2025-02-19 22:12:49

标题: Meta-Statistical Learning: 统计推断的监督学习

摘要: 这项工作展示了推动大型语言模型（LLMs）成功的工具和原则可以被重新利用来解决分布级任务，其目标是预测数据生成分布的属性，而不是个体数据点的标签。这些任务涵盖了统计推断问题，如参数估计、假设检验或互信息估计。将这些任务框入传统机器学习流程是具有挑战性的，因为监督通常与个体数据点相关联。我们提出了元统计学习，这是一个受多实例学习启发的框架，它重新制定统计推断任务为监督学习问题。在这种方法中，整个数据集被视为神经网络的单一输入，预测分布级参数。基于Transformer的架构，不需要位置编码，由于其具有置换不变性的属性，提供了自然匹配。通过在大规模合成数据集上进行训练，元统计模型可以利用Transformer-based LLMs的可伸缩性和优化基础设施。我们展示了该框架的多功能性，应用于假设检验和互信息估计，在小型数据集中表现出强大性能，特别是传统神经方法难以胜任的情况。

更新时间: 2025-02-19 22:12:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.12088v2

Global Ease of Living Index: a machine learning framework for longitudinal analysis of major economies

The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is important to understand the long-term nature of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework for addressing the problem of missing data for some of the economic indicators for specific countries. We then curate and update the data and use a dimensionality reduction approach (principal component analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts. This transparency and accessibility make our work a valuable resource for ongoing research and policy development in quality-of-life assessment.

Updated: 2025-02-19 21:59:23

标题: 全球生活便利指数：一种机器学习框架，用于对主要经济体进行纵向分析

摘要: 全球经济、地缘政治条件的剧烈变化以及COVID-19大流行等破坏性事件对生活成本和生活质量产生了影响。了解主要经济体的生活成本和生活质量的长期性质至关重要。一个透明且全面的生活指数必须包括多个生活条件维度。在本研究中，我们提出了一种通过全球生活便利指数来量化生活质量的方法，该指数将各种社会经济和基础设施因素结合成一个综合得分。我们的指数利用定义生活水准的经济指标，可以帮助针对性地改善特定领域。我们提出了一个机器学习框架来解决特定国家某些经济指标的缺失数据问题。然后，我们整理和更新数据，并使用降维方法（主成分分析）从1970年以来为主要经济体创建生活便利指数。我们的工作通过提供一个实用工具，使政策制定者能够识别需要改进的领域，如医疗体系、就业机会和公共安全，从而显著丰富了文献。我们的方法使用开放数据和代码，可以轻松重现并应用于各种情境。这种透明度和易访问性使我们的工作成为生活质量评估领域持续研究和政策制定的宝贵资源。

更新时间: 2025-02-19 21:59:23

领域: cs.LG,cs.AI,econ.EM,stat.AP,stat.ML

下载: http://arxiv.org/abs/2502.06866v2

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

Updated: 2025-02-19 21:55:11

标题: 理解具有指数移动平均值的SGD：线性回归案例研究

摘要: 指数移动平均（EMA）最近在训练现代深度学习模型，特别是基于扩散的生成模型中获得了显著的流行度。然而，目前很少有理论结果解释EMA的有效性。在本文中，为了更好地理解EMA，我们建立了在线SGD与EMA在高维线性回归中的风险界限，这是一种简单的过度参数化学习任务，与神经网络具有相似之处。我们的结果表明，（i）SGD与EMA的方差误差始终小于不进行平均的SGD，（ii）与从一开始就进行迭代平均的SGD不同，SGD与EMA的偏差误差在数据协方差矩阵的每个特征子空间中呈指数衰减。此外，我们开发了适用于分析广泛类别的平均方案的证明技术。

更新时间: 2025-02-19 21:55:11

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2502.14123v1

Multi-Objective Bayesian Optimization for Networked Black-Box Systems: A Path to Greener Profits and Smarter Designs

Designing modern industrial systems requires balancing several competing objectives, such as profitability, resilience, and sustainability, while accounting for complex interactions between technological, economic, and environmental factors. Multi-objective optimization (MOO) methods are commonly used to navigate these tradeoffs, but selecting the appropriate algorithm to tackle these problems is often unclear, particularly when system representations vary from fully equation-based (white-box) to entirely data-driven (black-box) models. While grey-box MOO methods attempt to bridge this gap, they typically impose rigid assumptions on system structure, requiring models to conform to the underlying structural assumptions of the solver rather than the solver adapting to the natural representation of the system of interest. In this chapter, we introduce a unifying approach to grey-box MOO by leveraging network representations, which provide a general and flexible framework for modeling interconnected systems as a series of function nodes that share various inputs and outputs. Specifically, we propose MOBONS, a novel Bayesian optimization-inspired algorithm that can efficiently optimize general function networks, including those with cyclic dependencies, enabling the modeling of feedback loops, recycle streams, and multi-scale simulations - features that existing methods fail to capture. Furthermore, MOBONS incorporates constraints, supports parallel evaluations, and preserves the sample efficiency of Bayesian optimization while leveraging network structure for improved scalability. We demonstrate the effectiveness of MOBONS through two case studies, including one related to sustainable process design. By enabling efficient MOO under general graph representations, MOBONS has the potential to significantly enhance the design of more profitable, resilient, and sustainable engineering systems.

Updated: 2025-02-19 21:49:05

标题: 多目标贝叶斯优化用于网络黑盒系统：通往更环保利润和更智能设计的路径

摘要: 设计现代工业系统需要平衡几个竞争性目标，如盈利能力、弹性和可持续性，同时考虑技术、经济和环境因素之间复杂的相互作用。多目标优化（MOO）方法通常用于解决这些权衡问题，但选择适当的算法来解决这些问题通常不明确，特别是当系统表示从完全基于方程（白盒）到完全基于数据（黑盒）的模型不同时。虽然灰盒MOO方法试图弥合这一差距，但它们通常对系统结构施加严格的假设，要求模型符合解决方案的基本结构假设，而不是解决方案适应感兴趣系统的自然表示。在本章中，我们通过利用网络表示引入了一种统一的灰盒MOO方法，该方法提供了一个通用灵活的框架，将互连系统建模为一系列共享各种输入和输出的功能节点。具体而言，我们提出了一种新颖的受贝叶斯优化启发的算法MOBONS，可以高效地优化一般函数网络，包括具有循环依赖关系的网络，从而实现反馈环、回收流和多尺度模拟的建模-现有方法无法捕捉的特性。此外，MOBONS结合了约束条件，支持并行评估，并利用网络结构提高了贝叶斯优化的样本效率，从而提高了可扩展性。我们通过两个案例研究展示了MOBONS的有效性，其中包括一个与可持续过程设计相关的案例研究。通过在一般图表示下实现高效的MOO，MOBONS有望显著提升更具盈利能力、弹性和可持续性的工程系统设计。

更新时间: 2025-02-19 21:49:05

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14121v1

SALTY: Explainable Artificial Intelligence Guided Structural Analysis for Hardware Trojan Detection

Hardware Trojans are malicious modifications in digital designs that can be inserted by untrusted supply chain entities. Hardware Trojans can give rise to diverse attack vectors such as information leakage (e.g. MOLES Trojan) and denial-of-service (rarely triggered bit flip). Such an attack in critical systems (e.g. healthcare and aviation) can endanger human lives and lead to catastrophic financial loss. Several techniques have been developed to detect such malicious modifications in digital designs, particularly for designs sourced from third-party intellectual property (IP) vendors. However, most techniques have scalability concerns (due to unsound assumptions during evaluation) and lead to large number of false positive detections (false alerts). Our framework (SALTY) mitigates these concerns through the use of a novel Graph Neural Network architecture (using Jumping-Knowledge mechanism) for generating initial predictions and an Explainable Artificial Intelligence (XAI) approach for fine tuning the outcomes (post-processing). Experiments show 98% True Positive Rate (TPR) and True Negative Rate (TNR), significantly outperforming state-of-the-art techniques across a large set of standard benchmarks.

Updated: 2025-02-19 21:40:00

标题: SALTY：可解释的人工智能引导的结构分析用于硬件木马检测

摘要: 硬件特洛伊木马是数字设计中由不受信任的供应链实体插入的恶意修改。硬件特洛伊木马可能导致各种攻击向量，如信息泄露（例如MOLES特洛伊木马）和拒绝服务（很少触发的位翻转）。在关键系统（例如医疗保健和航空）中的此类攻击可能危及人类生命并导致灾难性的财务损失。已经开发了几种技术来检测数字设计中的此类恶意修改，特别是来自第三方知识产权（IP）供应商的设计。然而，大多数技术存在可扩展性问题（由于评估过程中不合理的假设）并导致大量误报检测（错误警报）。我们的框架（SALTY）通过使用一种新颖的图神经网络架构（使用Jumping-Knowledge机制）生成初始预测，并采用可解释的人工智能（XAI）方法对结果进行微调（后处理），从而缓解了这些问题。实验证明，在一组大型标准基准测试中，我们表现出98%的真阳性率（TPR）和真阴性率（TNR），明显优于现有技术。

更新时间: 2025-02-19 21:40:00

领域: cs.CR

下载: http://arxiv.org/abs/2502.14116v1

Speech to Speech Translation with Translatotron: A State of the Art Review

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.

Updated: 2025-02-19 21:39:35

标题: Translatotron：一种最先进的语音到语音翻译技术的综述

摘要: 基于级联的语音到语音翻译长期以来一直被视为基准，但受到许多问题的困扰，比如将一种语言的语音翻译成另一种语言所需的时间以及复合错误。这些问题是由于级联方法使用了一系列方法，如语音识别、语音到文本翻译，最后是文本到语音翻译。Translatotron是由谷歌设计的一种序列到序列直接语音到语音翻译模型，旨在解决级联模型所带来的复合错误问题。如今，Translatotron模型有3个版本：Translatotron 1、Translatotron 2和Translatotron 3。第一个版本是作为一个概念验证，以展示直接语音到语音翻译是可能的，发现它比级联模型效果更差，但产生了令人期待的结果。Translatotron2是Translatotron 1的改进版本，结果类似于级联模型。最新版本的Translatotron 3在某些方面优于级联模型。本文将呈现完整的语音到语音翻译综述，重点关注Translatotron模型的所有版本。我们还将展示Translatotron是最好的模型，可以弥合非洲语言与其他形式规范的语言之间的语言鸿沟。

更新时间: 2025-02-19 21:39:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.05980v2

BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning

This paper introduces BMIKE-53, a comprehensive benchmark for cross-lingual in-context knowledge editing (IKE) across 53 languages, unifying three knowledge editing (KE) datasets: zsRE, CounterFact, and WikiFactDiff. Cross-lingual KE, which requires knowledge edited in one language to generalize across others while preserving unrelated knowledge, remains underexplored. To address this gap, we systematically evaluate IKE under zero-shot, one-shot, and few-shot setups, incorporating tailored metric-specific demonstrations. Our findings reveal that model scale and demonstration alignment critically govern cross-lingual IKE efficacy, with larger models and tailored demonstrations significantly improving performance. Linguistic properties, particularly script type, strongly influence performance variation across languages, with non-Latin languages underperforming due to issues like language confusion.

Updated: 2025-02-19 21:35:47

标题: BMIKE-53：利用上下文学习调查跨语言知识编辑

摘要: 本文介绍了BMIKE-53，这是一个跨53种语言的全面基准，用于在上下文中知识编辑（IKE），统一了三个知识编辑（KE）数据集：zsRE，CounterFact和WikiFactDiff。跨语言的知识编辑（KE）要求在一个语言中编辑的知识要在其他语言中泛化，同时保留无关的知识，这仍然是未被充分探索的。为了填补这一空白，我们系统地评估了零-shot，一-shot和少-shot设置下的IKE，包括量身定制的度量特定演示。我们的研究结果显示，模型规模和演示对齐关键地影响跨语言IKE的效力，更大的模型和量身定制的演示显著提高了性能。语言特性，尤其是脚本类型，强烈影响不同语言之间性能变化，由于问题如语言混淆，非拉丁语言表现不佳。

更新时间: 2025-02-19 21:35:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.17764v2

Chasing the Timber Trail: Machine Learning to Reveal Harvest Location Misrepresentation

Illegal logging poses a significant threat to global biodiversity, climate stability, and depresses international prices for legal wood harvesting and responsible forest products trade, affecting livelihoods and communities across the globe. Stable isotope ratio analysis (SIRA) is rapidly becoming an important tool for determining the harvest location of traded, organic, products. The spatial pattern in stable isotope ratio values depends on factors such as atmospheric and environmental conditions and can thus be used for geographical identification. We present here the results of a deployed machine learning pipeline where we leverage both isotope values and atmospheric variables to determine timber harvest location. Additionally, the pipeline incorporates uncertainty estimation to facilitate the interpretation of harvest location determination for analysts. We present our experiments on a collection of oak (Quercus spp.) tree samples from its global range. Our pipeline outperforms comparable state-of-the-art models determining geographic harvest origin of commercially traded wood products, and has been used by European enforcement agencies to identify illicit Russian and Belarusian timber entering the EU market. We also identify opportunities for further advancement of our framework and how it can be generalized to help identify the origin of falsely labeled organic products throughout the supply chain.

Updated: 2025-02-19 21:34:08

标题: 追踪木材路径：机器学习揭示采伐地点失实情况

摘要: 非法砍伐对全球生物多样性、气候稳定性构成重大威胁，并且压低了合法木材采伐和负责任的森林产品贸易的国际价格，影响全球各地的生计和社区。稳定同位素比值分析（SIRA）正迅速成为确定交易有机产品的采伐地点的重要工具。稳定同位素比值数值的空间模式取决于大气和环境条件等因素，因此可以用于地理识别。我们在这里介绍了一个部署的机器学习流水线的结果，我们利用同位素数值和大气变量来确定木材采伐地点。此外，该流水线结合不确定性估计，以便为分析人员解释采伐地点的确定性。我们展示了对橡树（Quercus spp.）树样本集的实验结果，该树种分布在全球范围内。我们的流水线优于可比较的最先进模型，能够确定商业交易木产品的地理采伐原产地，并已被欧洲执法机构用于识别进入欧盟市场的非法俄罗斯和白俄罗斯木材。我们还确定了进一步推进我们框架的机会，以及如何将其推广应用于帮助辨识整个供应链中虚假标记的有机产品的起源。

更新时间: 2025-02-19 21:34:08

领域: cs.LG,cs.CE,cs.CY,J.m; K.4.1; I.2.0; J.2

下载: http://arxiv.org/abs/2502.14115v1

Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks

We determine sufficient conditions for overparametrized deep learning (DL) networks to guarantee the attainability of zero loss in the context of supervised learning, for the $\mathcal{L}^2$ cost and {\em generic} training data. We present an explicit construction of the zero loss minimizers without invoking gradient descent. On the other hand, we point out that increase of depth can deteriorate the efficiency of cost minimization using a gradient descent algorithm by analyzing the conditions for rank loss of the training Jacobian. Our results clarify key aspects on the dichotomy between zero loss reachability in underparametrized versus overparametrized DL.

Updated: 2025-02-19 21:31:05

标题: 零损失保证和通用过参数化深度学习网络的显式最小化器

摘要: 我们确定了对于过度参数化的深度学习（DL）网络，在监督学习的背景下保证零损失达成的充分条件，针对$\mathcal{L}^2$成本和“通用”训练数据。我们提出了零损失最小化器的明确构造，而无需调用梯度下降。另一方面，我们指出增加深度可能会通过分析训练雅可比矩阵的秩损失条件，恶化使用梯度下降算法进行成本最小化的效率。我们的结果澄清了欠参数化与过度参数化DL中零损失达成能力之间的关键方面。

更新时间: 2025-02-19 21:31:05

领域: cs.LG,cs.AI,math.AP,math.OC,stat.ML,57R70, 62M45

下载: http://arxiv.org/abs/2502.14114v1

Object-centric Binding in Contrastive Language-Image Pretraining

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

Updated: 2025-02-19 21:30:51

标题: 对比语言-图像预训练中的对象中心绑定

摘要: 最近视觉语言模型（VLM）的发展受到对比模型（如CLIP）的推动，这些模型学习将视觉信息与其相应的文本描述关联起来。然而，这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限。为了解决这些挑战，我们提出了一种新颖的方法，与通常使用的依赖于设计硬负例增强的策略不同。相反，我们的工作侧重于将归纳偏见集成到预训练的类CLIP模型中，以提高其组合理解能力，而无需使用任何额外的硬负例。为此，我们引入了一个绑定模块，将从文本描述中派生的场景图与一个基于槽结构的图像表示连接起来，促进两种模态之间的结构相似性评估。我们还利用关系作为文本条件的视觉约束，从而更有效地捕捉对象及其上下文关系之间的复杂交互作用。我们的最终模型不仅提高了基于CLIP的模型在多对象组合理解方面的性能，还为更准确和样本高效的图像-文本匹配复杂场景铺平了道路。

更新时间: 2025-02-19 21:30:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.14113v1

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data

Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. However, when the retrieval process involves private data, RAG systems may face severe privacy risks, potentially leading to the leakage of sensitive information. To address this issue, we propose using synthetic data as a privacy-preserving alternative for the retrieval data. We propose SAGE, a novel two-stage synthetic data generation paradigm. In the stage-1, we employ an attribute-based extraction and generation approach to preserve key contextual information from the original data. In the stage-2, we further enhance the privacy properties of the synthetic data through an agent-based iterative refinement process. Extensive experiments demonstrate that using our synthetic data as the retrieval context achieves comparable performance to using the original data while substantially reducing privacy risks. Our work takes the first step towards investigating the possibility of generating high-utility and privacy-preserving synthetic data for RAG, opening up new opportunities for the safe application of RAG systems in various domains.

Updated: 2025-02-19 21:24:11

标题: 通过纯合成数据缓解检索增强生成（RAG）中的隐私问题

摘要: 检索增强生成（RAG）通过整合从外部知识源检索的相关信息，提高了语言模型的输出。然而，当检索过程涉及私人数据时，RAG系统可能面临严重的隐私风险，可能导致敏感信息的泄露。为解决这一问题，我们提出使用合成数据作为检索数据的隐私保护替代方案。我们提出了SAGE，一种新颖的两阶段合成数据生成范式。在第一阶段，我们采用基于属性的提取和生成方法来保留原始数据中的关键上下文信息。在第二阶段，我们通过基于代理的迭代细化过程进一步增强了合成数据的隐私属性。大量实验证明，使用我们的合成数据作为检索上下文可实现与使用原始数据相当的性能，同时大幅降低隐私风险。我们的工作是探讨为RAG生成高效且保护隐私的合成数据可能性的第一步，为在各个领域安全应用RAG系统开辟了新的机会。

更新时间: 2025-02-19 21:24:11

领域: cs.CR

下载: http://arxiv.org/abs/2406.14773v2

Conformal Prediction under Lévy-Prokhorov Distribution Shifts: Robustness to Local and Global Perturbations

Conformal prediction provides a powerful framework for constructing prediction intervals with finite-sample guarantees, yet its robustness under distribution shifts remains a significant challenge. This paper addresses this limitation by modeling distribution shifts using L\'evy-Prokhorov (LP) ambiguity sets, which capture both local and global perturbations. We provide a self-contained overview of LP ambiguity sets and their connections to popular metrics such as Wasserstein and Total Variation. We show that the link between conformal prediction and LP ambiguity sets is a natural one: by propagating the LP ambiguity set through the scoring function, we reduce complex high-dimensional distribution shifts to manageable one-dimensional distribution shifts, enabling exact quantification of worst-case quantiles and coverage. Building on this analysis, we construct robust conformal prediction intervals that remain valid under distribution shifts, explicitly linking LP parameters to interval width and confidence levels. Experimental results on real-world datasets demonstrate the effectiveness of the proposed approach.

Updated: 2025-02-19 21:18:11

标题: 勒维-普罗霍洛夫分布变化下的一致预测：对局部和全局扰动的稳健性

摘要: Conformal prediction提供了一个强大的框架，用于构建具有有限样本保证的预测区间，然而在分布转移下的鲁棒性仍然是一个重要挑战。本文通过使用L\'evy-Prokhorov（LP）模糊集来建模分布转移来解决这一限制，这些模糊集捕捉了局部和全局扰动。我们提供了关于LP模糊集及其与流行指标如Wasserstein和Total Variation之间联系的自包含概述。我们展示了conformal prediction和LP模糊集之间的联系是自然的：通过通过评分函数传播LP模糊集，我们将复杂的高维度分布转移降低为可管理的一维分布转移，从而实现最坏情况分位数和覆盖率的精确量化。基于这一分析，我们构建了在分布转移下仍然有效的鲁棒conformal prediction区间，明确将LP参数与区间宽度和置信水平联系起来。对真实世界数据集的实验结果表明了所提出方法的有效性。

更新时间: 2025-02-19 21:18:11

领域: stat.ML,cs.LG,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2502.14105v1

View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.

Updated: 2025-02-19 21:10:44

标题: 通过零-shot新视角合成实现视图不变策略学习

摘要: 大规模视觉运动策略学习是开发通用的操纵系统的一种有前途的方法。然而，能够在各种体现形式、环境和观察模态中部署的策略仍然难以捉摸。在这项工作中，我们探讨了如何利用大规模世界视觉数据中的知识来解决通用操纵的一个变化轴：观察视角。具体而言，我们研究了单图像新视图合成模型，通过渲染同一场景的图像从不同的摄像机视角给出单个输入图像，学习了3D感知场景级先验。为了实际应用于多样化的机器人数据，这些模型必须进行零样本操作，在看不见的任务和环境上执行视图合成。我们在一个称为View Synthesis Augmentation（VISTA）的简单数据增强方案中对视图合成模型进行了实证分析，以了解它们从单视点示范数据中学习视角不变策略的能力。在评估我们的方法训练的策略对于分布外摄像机视角的鲁棒性时，我们发现它们在模拟和真实世界的操纵任务中都优于基线。视频和其他可视化内容可在https://s-tian.github.io/projects/vista 上找到。

更新时间: 2025-02-19 21:10:44

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2409.03685v2

Probability Bracket Notation: Markov Sequence Projector of Visible and Hidden Markov Models in Dynamic Bayesian Networks

With the symbolic framework of Probability Bracket Notation (PBN), the Markov Sequence Projector (MSP) is introduced to expand the evolution formula of Homogeneous Markov Chains (HMCs). The well-known weather example, a Visible Markov Model (VMM), illustrates that the full joint probability of a VMM corresponds to a specifically projected Markov state sequence in the expanded evolution formula. In a Hidden Markov Model (HMM), the probability basis (P-basis) of the hidden Markov state sequence and the P-basis of the observation sequence exist in the sequential event space. The full joint probability of an HMM is the product of the (unknown) projected hidden sequence of Markov states and their transformations into the observation P-bases. The Viterbi algorithm is applied to the famous Weather-Stone HMM example to determine the most likely weather-state sequence given the observed stone-state sequence. Our results are verified using the Elvira software package. Using the PBN, we unify the evolution formulas for Markov models like VMMs, HMMs, and factorial HMMs (with discrete time). We briefly investigated the extended HMM, addressing the feedback issue, and the continuous-time VMM and HMM (with discrete or continuous states). All these models are subclasses of Dynamic Bayesian Networks (DBNs) essential for Machine Learning (ML) and Artificial Intelligence (AI).

Updated: 2025-02-19 21:09:18

标题: 概率括号表示法：在动态贝叶斯网络中可见和隐藏马尔可夫模型的马尔可夫序列投影器

摘要: 通过概率括号符号（PBN）的象征性框架，引入了马尔可夫序列投影器（MSP）来扩展均匀马尔可夫链（HMCs）的演化公式。著名的天气示例，一个可见马尔可夫模型（VMM），说明了VMM的完整联合概率对应于扩展演化公式中特定投影的马尔可夫状态序列。在隐藏马尔可夫模型（HMM）中，隐藏马尔可夫状态序列的概率基础（P基础）和观测序列的P基础存在于顺序事件空间中。HMM的完整联合概率是（未知的）投影隐藏状态马尔可夫序列和它们转换为观测P基础的乘积。Viterbi算法应用于著名的Weather-Stone HMM示例，以确定给定观察到的石头状态序列的最可能天气状态序列。我们的结果使用Elvira软件包进行验证。使用PBN，我们统一了马尔可夫模型的演化公式，如VMM，HMM和因子HMM（离散时间）。我们简要调查了扩展的HMM，解决了反馈问题，以及连续时间VMM和HMM（具有离散或连续状态）。所有这些模型都是机器学习（ML）和人工智能（AI）至关重要的动态贝叶斯网络（DBNs）的子类。

更新时间: 2025-02-19 21:09:18

领域: cs.AI,math.PR,62F15,G.3; I.2.3

下载: http://arxiv.org/abs/1212.3817v2

Explainable Distributed Constraint Optimization Problems

The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model cooperative multi-agent problems that need to be solved distributively. A core assumption of existing approaches is that DCOP solutions can be easily understood, accepted, and adopted, which may not hold, as evidenced by the large body of literature on Explainable AI. In this paper, we propose the Explainable DCOP (X-DCOP) model, which extends a DCOP to include its solution and a contrastive query for that solution. We formally define some key properties that contrastive explanations must satisfy for them to be considered as valid solutions to X-DCOPs as well as theoretical results on the existence of such valid explanations. To solve X-DCOPs, we propose a distributed framework as well as several optimizations and suboptimal variants to find valid explanations. We also include a human user study that showed that users, not surprisingly, prefer shorter explanations over longer ones. Our empirical evaluations showed that our approach can scale to large problems, and the different variants provide different options for trading off explanation lengths for smaller runtimes. Thus, our model and algorithmic contributions extend the state of the art by reducing the barrier for users to understand DCOP solutions, facilitating their adoption in more real-world applications.

Updated: 2025-02-19 21:06:30

标题: 可解释的分布式约束优化问题

摘要: 分布式约束优化问题（DCOP）的形式化是一种强大的工具，用于建模需要分布式解决的合作多智能体问题。现有方法的一个核心假设是DCOP解决方案可以被轻松理解、接受和采用，然而根据大量关于可解释人工智能的文献可以看出，这一假设可能不成立。在本文中，我们提出了可解释的DCOP（X-DCOP）模型，它将DCOP扩展为包含其解决方案和关于该解决方案的对比查询。我们正式定义了对比解释必须满足的一些关键属性，以便将其视为X-DCOP的有效解决方案，同时提供了关于这种有效解释存在性的理论结果。为了解决X-DCOP，我们提出了一个分布式框架以及几种优化和次优变体，以找到有效解释。我们还包括了一个人类用户研究，结果显示用户更倾向于短解释而非长解释。我们的实证评估表明，我们的方法可以扩展到大型问题，并且不同的变体为在解释长度与运行时间之间进行权衡提供了不同的选项。因此，我们的模型和算法贡献通过降低用户理解DCOP解决方案的障碍，促进它们在更多实际应用中的采用，扩展了现有技术水平。

更新时间: 2025-02-19 21:06:30

领域: cs.AI

下载: http://arxiv.org/abs/2502.14102v1

eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels

Collaboration is a key challenge in distributed multi-agent reinforcement learning (MARL) environments. Learning frameworks for these decentralized systems must weigh the benefits of explicit player coordination against the communication overhead and computational cost of sharing local observations and environmental data. Quantum computing has sparked a potential synergy between quantum entanglement and cooperation in multi-agent environments, which could enable more efficient distributed collaboration with minimal information sharing. This relationship is largely unexplored, however, as current state-of-the-art quantum MARL (QMARL) implementations rely on classical information sharing rather than entanglement over a quantum channel as a coordination medium. In contrast, in this paper, a novel framework dubbed entangled QMARL (eQMARL) is proposed. The proposed eQMARL is a distributed actor-critic framework that facilitates cooperation over a quantum channel and eliminates local observation sharing via a quantum entangled split critic. Introducing a quantum critic uniquely spread across the agents allows coupling of local observation encoders through entangled input qubits over a quantum channel, which requires no explicit sharing of local observations and reduces classical communication overhead. Further, agent policies are tuned through joint observation-value function estimation via joint quantum measurements, thereby reducing the centralized computational burden. Experimental results show that eQMARL with ${\Psi}^{+}$ entanglement converges to a cooperative strategy up to $17.8\%$ faster and with a higher overall score compared to split classical and fully centralized classical and quantum baselines. The results also show that eQMARL achieves this performance with a constant factor of $25$-times fewer centralized parameters compared to the split classical baseline.

Updated: 2025-02-19 21:02:59

标题: eQMARL：纠缠量子多智能体强化学习用于量子通道上的分布式合作

摘要: 合作是分布式多智能体强化学习（MARL）环境中的一个关键挑战。这些分散系统的学习框架必须权衡明确玩家协调的好处与共享本地观察和环境数据的通信开销和计算成本之间的关系。量子计算引发了量子纠缠和多智能体环境合作之间潜在的协同效应，这可能有助于更有效地实现分布式协作，而几乎不需要共享信息。然而，这种关系在很大程度上尚未被探索，因为当前最先进的量子MARL（QMARL）实现依赖于经典信息共享，而不是通过量子通道上的纠缠作为协调媒介。相反，在本文中，提出了一种名为纠缠QMARL（eQMARL）的新型框架。所提出的eQMARL是一个分布式演员-评论家框架，通过量子通道促进合作，并通过量子纠缠的分割评论家消除本地观察共享。在代理之间引入一个唯一分布在代理之间的量子评论家，允许通过量子通道上的纠缠输入量子位编码器的耦合，这不需要明确共享本地观察，并减少了经典通信开销。此外，通过联合量子测量，代理策略通过联合观察-值函数估计进行调整，从而减少了集中式计算负担。实验结果显示，具有${\Psi}^{+}$纠缠的eQMARL收敛到合作策略的速度比分割经典和完全集中式经典和量子基线快高达17.8％，并且总体得分更高。结果还显示，与分割经典基线相比，eQMARL用25倍更少的集中参数达到这一性能。

更新时间: 2025-02-19 21:02:59

领域: quant-ph,cs.ET,cs.LG,cs.MA

下载: http://arxiv.org/abs/2405.17486v2

Aligning Human and Machine Attention for Enhanced Supervised Learning

Attention, or prioritization of certain information items over others, is a critical element of any learning process, for both humans and machines. Given that humans continue to outperform machines in certain learning tasks, it seems plausible that machine performance could be enriched by aligning machine attention with human attention mechanisms -- yet research on this topic is sparse and has achieved only limited success. This paper proposes a new approach to address this gap, called Human-Machine Attention Learning (HuMAL). This approach involves reliance on data annotated by humans to reflect their self-perceived attention during specific tasks. We evaluate several alternative strategies for integrating such human attention data into machine learning (ML) algorithms, using a sentiment analysis task (review data from Yelp) and a personality-type classification task (data from myPersonality). The best-performing HuMAL strategy significantly enhances the task performance of fine-tuned transformer models (BERT, as well as GPT-2 and XLNET), and the benefit is particularly pronounced under challenging conditions of imbalanced or sparse labeled data. This research contributes to a deeper understanding of strategies for integrating human attention into ML models and highlights the potential of leveraging human cognition to augment ML in real-world applications.

Updated: 2025-02-19 20:57:37

标题: 调整人类和机器的注意力以增强监督学习

摘要: 注意力，或者对某些信息项优先级高于其他信息项的关注，是任何学习过程中的关键元素，无论是对于人类还是机器。考虑到人类在某些学习任务中仍然胜过机器，似乎机器性能可以通过将机器的注意力与人类的注意力机制对齐来提升，然而关于这个主题的研究仍然很少，并且取得的成果有限。本文提出了一种新方法来解决这一空白，称为人机注意力学习（HuMAL）。这种方法依赖于由人类注释的数据，以反映他们在特定任务中自我感知的注意力。我们评估了几种将这种人类注意力数据整合到机器学习（ML）算法中的替代策略，使用了一个情感分析任务（来自Yelp的评论数据）和一个人格类型分类任务（来自myPersonality的数据）。表现最佳的HuMAL策略显著提升了微调的转换器模型（BERT，以及GPT-2和XLNET）的任务性能，而且在标记数据不平衡或稀疏的挑战性条件下效果尤为显著。这项研究有助于更深入地理解将人类注意力整合到ML模型中的策略，并突显了利用人类认知来增强ML在现实世界应用中的潜力。

更新时间: 2025-02-19 20:57:37

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.06811v2

Entity Decomposition with Filtering: A Zero-Shot Clinical Named Entity Recognition Framework

Clinical named entity recognition (NER) aims to retrieve important entities within clinical narratives. Recent works have demonstrated that large language models (LLMs) can achieve strong performance in this task. While previous works focus on proprietary LLMs, we investigate how open NER LLMs, trained specifically for entity recognition, perform in clinical NER. Our initial experiment reveals significant contrast in performance for some clinical entities and how a simple exploitment on entity types can alleviate this issue. In this paper, we introduce a novel framework, entity decomposition with filtering, or EDF. Our key idea is to decompose the entity recognition task into several retrievals of entity sub-types and then filter them. Our experimental results demonstrate the efficacies of our framework and the improvements across all metrics, models, datasets, and entity types. Our analysis also reveals substantial improvement in recognizing previously missed entities using entity decomposition. We further provide a comprehensive evaluation of our framework and an in-depth error analysis to pave future works.

Updated: 2025-02-19 20:51:20

标题: 实体分解与过滤：一种零射击临床命名实体识别框架

摘要: 临床命名实体识别（NER）旨在提取临床叙述中的重要实体。最近的研究表明，大型语言模型（LLMs）在这一任务中能够取得强大的性能。虽然先前的研究集中在专有的LLMs上，但我们调查了开放的NER LLMs，并专门为实体识别进行了训练，在临床NER中表现如何。我们的初步实验显示，对于一些临床实体，在性能上存在显著的差异，通过简单地对实体类型进行利用可以缓解这个问题。在本文中，我们介绍了一个新颖的框架，即实体分解与过滤（EDF）。我们的关键思想是将实体识别任务分解为几个实体子类型的检索，然后对它们进行过滤。我们的实验结果证明了我们的框架的有效性以及在所有指标、模型、数据集和实体类型上的改进。我们的分析还显示，在使用实体分解时，对以前被忽略的实体的识别有显著改进。我们进一步对我们的框架进行了全面评估，并进行了深入的错误分析，为未来的工作铺平道路。

更新时间: 2025-02-19 20:51:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2407.04629v2

Aligned Multi Objective Optimization

To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of their superior performance compared to naive approaches.

Updated: 2025-02-19 20:50:03

标题: 多目标优化的对齐方法

摘要: 迄今为止，多目标优化文献主要集中在相互冲突的目标上，研究帕累托前沿，或要求用户平衡权衡。然而，在机器学习实践中，有许多场景并不存在这种冲突。最近来自多任务学习、强化学习和LLMs训练的研究发现表明，多样化的相关任务可以同时增强跨目标的性能。尽管有这样的证据，这种现象尚未从优化的角度进行研究。这导致缺乏可以扩展到具有大量相关目标的场景的通用基于梯度的方法。为了填补这一空白，我们引入了对齐多目标优化框架，提出了适用于这种情况的新算法，并提供了它们相对于朴素方法的卓越性能的理论保证。

更新时间: 2025-02-19 20:50:03

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2502.14096v1

CND-IDS: Continual Novelty Detection for Intrusion Detection Systems

Intrusion detection systems (IDS) play a crucial role in IoT and network security by monitoring system data and alerting to suspicious activities. Machine learning (ML) has emerged as a promising solution for IDS, offering highly accurate intrusion detection. However, ML-IDS solutions often overlook two critical aspects needed to build reliable systems: continually changing data streams and a lack of attack labels. Streaming network traffic and associated cyber attacks are continually changing, which can degrade the performance of deployed ML models. Labeling attack data, such as zero-day attacks, in real-world intrusion scenarios may not be feasible, making the use of ML solutions that do not rely on attack labels necessary. To address both these challenges, we propose CND-IDS, a continual novelty detection IDS framework which consists of (i) a learning-based feature extractor that continuously updates new feature representations of the system data, and (ii) a novelty detector that identifies new cyber attacks by leveraging principal component analysis (PCA) reconstruction. Our results on realistic intrusion datasets show that CND-IDS achieves up to 6.1x F-score improvement, and up to 6.5x improved forward transfer over the SOTA unsupervised continual learning algorithm. Our code will be released upon acceptance.

Updated: 2025-02-19 20:47:22

标题: CND-IDS：入侵检测系统的持续新颖性检测

摘要: 入侵检测系统（IDS）通过监视系统数据并警报可疑活动，在物联网和网络安全中发挥着至关重要的作用。机器学习（ML）已经成为IDS的一个有前景的解决方案，提供高度准确的入侵检测。然而，ML-IDS解决方案通常忽视构建可靠系统所需的两个关键方面：持续变化的数据流和缺乏攻击标签。流式网络流量和相关的网络攻击不断变化，这可能会降低部署的ML模型的性能。在真实的入侵场景中标记攻击数据，如零日攻击，可能是不可行的，因此有必要使用不依赖攻击标签的ML解决方案。为了解决这两个挑战，我们提出了CND-IDS，这是一个连续新颖性检测IDS框架，包括（i）一个基于学习的特征提取器，不断更新系统数据的新特征表示，和（ii）一个通过主成分分析（PCA）重建来识别新网络攻击的新颖性检测器。我们在现实入侵数据集上的结果显示，CND-IDS实现了高达6.1倍的F分数改进，并且在SOTA无监督连续学习算法上具有高达6.5倍的改进前向传输。我们的代码将在接受后发布。

更新时间: 2025-02-19 20:47:22

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.14094v1

A New Framework of Software Obfuscation Evaluation Criteria

In the domain of practical software protection against man-at-the-end attacks such as software reverse engineering and tampering, much of the scientific literature is plagued by the use of subpar methods to evaluate the protections' strength and even by the absence of such evaluations. Several criteria have been proposed in the past to assess the strength of protections, such as potency, resilience, stealth, and cost. We analyze their evolving definitions and uses. We formulate a number of critiques, from which we conclude that the existing definitions are unsatisfactory and need to be revised. We present a new framework of software protection evaluation criteria: relevance, effectiveness (or efficacy), robustness, concealment, stubbornness, sensitivity, predictability, and cost.

Updated: 2025-02-19 20:45:47

标题: 一个新的软件混淆评估标准框架

摘要: 在实际软件保护领域，针对终端攻击，如软件反向工程和篡改，许多科学文献受到使用次级方法评估保护强度甚至缺乏此类评估的困扰。过去已经提出了几个标准来评估保护的强度，如效力、弹性、隐秘性和成本。我们分析它们不断演变的定义和用途。我们提出了一些批评意见，由此得出结论，现有定义并不令人满意，需要进行修订。我们提出了一个新的软件保护评估标准框架：相关性、有效性（或功效）、稳健性、隐蔽性、顽固性、敏感性、可预测性和成本。

更新时间: 2025-02-19 20:45:47

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2502.14093v1

Reinforcement Learning-based Receding Horizon Control using Adaptive Control Barrier Functions for Safety-Critical Systems

Optimal control methods provide solutions to safety-critical problems but easily become intractable. Control Barrier Functions (CBFs) have emerged as a popular technique that facilitates their solution by provably guaranteeing safety, through their forward invariance property, at the expense of some performance loss. This approach involves defining a performance objective alongside CBF-based safety constraints that must always be enforced. Unfortunately, both performance and solution feasibility can be significantly impacted by two key factors: (i) the selection of the cost function and associated parameters, and (ii) the calibration of parameters within the CBF-based constraints, which capture the trade-off between performance and conservativeness. %as well as infeasibility. To address these challenges, we propose a Reinforcement Learning (RL)-based Receding Horizon Control (RHC) approach leveraging Model Predictive Control (MPC) with CBFs (MPC-CBF). In particular, we parameterize our controller and use bilevel optimization, where RL is used to learn the optimal parameters while MPC computes the optimal control input. We validate our method by applying it to the challenging automated merging control problem for Connected and Automated Vehicles (CAVs) at conflicting roadways. Results demonstrate improved performance and a significant reduction in the number of infeasible cases compared to traditional heuristic approaches used for tuning CBF-based controllers, showcasing the effectiveness of the proposed method.

Updated: 2025-02-19 20:37:14

标题: 基于强化学习的逐步优化控制：利用自适应控制屏障函数确保安全关键系统

摘要: 最优控制方法为解决安全关键问题提供了解决方案，但很容易变得难以处理。控制屏障函数（CBFs）已经成为一种流行的技术，通过其前向不变性属性，可以在确保安全的同时提供解决方案，但可能会牺牲一些性能。这种方法涉及沿着基于CBF的安全约束定义性能目标，这些约束必须始终得到执行。不幸的是，两个关键因素可能会显著影响性能和解决可行性：（i）成本函数和相关参数的选择，以及（ii）在CBF约束中的参数校准，捕捉性能和保守性之间的权衡。为了解决这些挑战，我们提出了一种基于强化学习（RL）的递进视界控制（RHC）方法，利用带CBF的模型预测控制（MPC）（MPC-CBF）。具体而言，我们对控制器进行参数化，并使用双层优化，其中RL用于学习最优参数，而MPC计算最优控制输入。我们通过将该方法应用于冲突道路上的连接和自动驾驶车辆（CAVs）的挑战性自动合并控制问题来验证我们的方法。结果表明，与用于调整基于CBF的控制器的传统启发式方法相比，性能得到改善，并且无法实施的情况数量显着减少，展示了所提出方法的有效性。

更新时间: 2025-02-19 20:37:14

领域: eess.SY,cs.AI,cs.SY

下载: http://arxiv.org/abs/2403.17338v3

Dynamic Gradient Influencing for Viral Marketing Using Graph Neural Networks

The problem of maximizing the adoption of a product through viral marketing in social networks has been studied heavily through postulated network models. We present a novel data-driven formulation of the problem. We use Graph Neural Networks (GNNs) to model the adoption of products by utilizing both topological and attribute information. The resulting Dynamic Viral Marketing (DVM) problem seeks to find the minimum budget and minimal set of dynamic topological and attribute changes in order to attain a specified adoption goal. We show that DVM is NP-Hard and is related to the existing influence maximization problem. Motivated by this connection, we develop the idea of Dynamic Gradient Influencing (DGI) that uses gradient ranking to find optimal perturbations and targets low-budget and high influence non-adopters in discrete steps. We use an efficient strategy for computing node budgets and develop the ''Meta-Influence'' heuristic for assessing a node's downstream influence. We evaluate DGI against multiple baselines and demonstrate gains on average of 24% on budget and 37% on AUC on real-world attributed networks. Our code is publicly available at https://github.com/saurabhsharma1993/dynamic_viral_marketing.

Updated: 2025-02-19 20:30:15

标题: 使用图神经网络实现动态梯度对病毒营销的影响

摘要: 通过病毒营销在社交网络中最大化产品采纳率的问题已经通过假设的网络模型进行了大量研究。我们提出了一个新颖的数据驱动问题表述。我们使用图神经网络（GNNs）通过利用拓扑和属性信息来建模产品的采纳情况。由此产生的动态病毒营销（DVM）问题旨在找到实现指定采纳目标所需的最小预算和最小动态拓扑和属性变化集。我们证明了DVM是NP难的，并且与现有的影响最大化问题相关。受此联系的启发，我们发展了动态梯度影响（DGI）的概念，该概念利用梯度排名来找到最佳扰动，并在离散步骤中针对低预算和高影响力的非采纳者。我们使用一种高效的策略来计算节点预算，并开发了“元影响”启发式方法来评估节点的下游影响力。我们将DGI与多个基线进行了评估，并展示了在真实属性网络上预算平均增益24％和AUC增益37％。我们的代码可以在https://github.com/saurabhsharma1993/dynamic_viral_marketing 上公开获取。

更新时间: 2025-02-19 20:30:15

领域: cs.LG,cs.CR,cs.SI

下载: http://arxiv.org/abs/2403.12399v2

Learning from End User Data with Shuffled Differential Privacy over Kernel Densities

We study a setting of collecting and learning from private data distributed across end users. In the shuffled model of differential privacy, the end users partially protect their data locally before sharing it, and their data is also anonymized during its collection to enhance privacy. This model has recently become a prominent alternative to central DP, which requires full trust in a central data curator, and local DP, where fully local data protection takes a steep toll on downstream accuracy. Our main technical result is a shuffled DP protocol for privately estimating the kernel density function of a distributed dataset, with accuracy essentially matching central DP. We use it to privately learn a classifier from the end user data, by learning a private density function per class. Moreover, we show that the density function itself can recover the semantic content of its class, despite having been learned in the absence of any unprotected data. Our experiments show the favorable downstream performance of our approach, and highlight key downstream considerations and trade-offs in a practical ML deployment of shuffled DP.

Updated: 2025-02-19 20:27:01

标题: 利用核密度函数上的混淆差分隐私从最终用户数据中学习

摘要: 我们研究了分布在终端用户之间的私人数据收集和学习的设置。在混洗模式的差分隐私中，终端用户在分享数据之前会在本地部分保护其数据，并且他们的数据在收集过程中也会进行匿名处理以增强隐私性。这种模式最近已成为中心化差分隐私的一个重要替代方案，后者需要对中央数据管理者完全信任，而局部差分隐私则在下游准确性方面付出了巨大代价。我们的主要技术结果是一种用于私下估计分布式数据集的核密度函数的混洗差分隐私协议，其准确性基本与中心化差分隐私相匹配。我们使用它来从终端用户数据中私下学习分类器，通过为每个类别学习一个私有密度函数。此外，我们展示了密度函数本身可以恢复其类别的语义内容，尽管它是在不存在任何未受保护数据的情况下学习的。我们的实验展示了我们方法在下游性能上的优势，并突出了在实际机器学习部署中混洗差分隐私的关键下游考虑因素和权衡。

更新时间: 2025-02-19 20:27:01

领域: cs.LG,cs.CR,cs.DS

下载: http://arxiv.org/abs/2502.14087v1

Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

Updated: 2025-02-19 20:20:24

标题: 遍览语义关系：抽象常识推理中的语言模型挑战

摘要: 大型语言模型（LLMs）在生成类似人类文本和解决中等复杂性推理任务，如问答和数学问题解决方面取得了显著的表现。然而，它们在需要更深层认知技能的任务，如常识理解和抽象推理方面的能力仍未得到充分探索。本文系统评估LLMs在使用ConceptNet知识图谱进行抽象常识推理时的表现。我们提出了两种提示方法：指导提示，模型根据提供的定义预测可能的语义关系，以及少样本提示，模型使用示例作为指导识别关系。我们对gpt-4o-mini模型的实验表明，在指导提示中，当对多个关系进行排名时表现稳定，但当模型被限制仅预测一个关系时，性能显著下降。在少样本提示中，当从五个关系中进行选择而不是全部集合时，模型的准确率显著提高，尽管存在对某些关系的明显偏见。这些结果表明，即使与人类水平理解相比，商业上使用的LLMs在抽象常识推理能力方面仍存在显著差距。然而，研究结果也突显了基于选择性检索的谨慎提示工程的潜力，以获得更好的性能。

更新时间: 2025-02-19 20:20:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14086v1

A generalized dual potential for inelastic Constitutive Artificial Neural Networks: A JAX implementation at finite strains

We present a methodology for designing a generalized dual potential, or pseudo potential, for inelastic Constitutive Artificial Neural Networks (iCANNs). This potential, expressed in terms of stress invariants, inherently satisfies thermodynamic consistency for large deformations. In comparison to our previous work, the new potential captures a broader spectrum of material behaviors, including pressure-sensitive inelasticity. To this end, we revisit the underlying thermodynamic framework of iCANNs for finite strain inelasticity and derive conditions for constructing a convex, zero-valued, and non-negative dual potential. To embed these principles in a neural network, we detail the architecture's design, ensuring a priori compliance with thermodynamics. To evaluate the proposed architecture, we study its performance and limitations discovering visco-elastic material behavior, though the method is not limited to visco-elasticity. In this context, we investigate different aspects in the strategy of discovering inelastic materials. Our results indicate that the novel architecture robustly discovers interpretable models and parameters, while autonomously revealing the degree of inelasticity. The iCANN framework, implemented in JAX, is publicly accessible at https://doi.org/10.5281/zenodo.14894687.

Updated: 2025-02-19 20:16:45

标题: 一个适用于非弹性本构人工神经网络的广义双势函数：有限应变下的JAX实现

摘要: 我们提出了一种用于设计不可塑性构成人工神经网络（iCANNs）的广义双势能或伪势能的方法。这种势能以应力不变量的形式表达，在大变形情况下固有地满足热力学一致性。与我们先前的工作相比，新的势能捕捉了更广泛的材料行为，包括压力敏感的不可塑性。为此，我们重新审视了iCANNs的有限应变不可塑性的基础热力学框架，并推导出构建凸函数、零值和非负双势能的条件。为了将这些原则嵌入神经网络中，我们详细描述了体系结构的设计，确保事先遵守热力学。为了评估所提出的体系结构，我们研究了其性能和局限性，发现了粘弹性材料行为，尽管该方法不仅限于粘弹性。在这种背景下，我们研究了在发现不可塑性材料的策略中的不同方面。我们的结果表明，这种新颖的体系结构能够稳健地发现可解释的模型和参数，同时自主地揭示不可塑性的程度。 iCANN框架在JAX中实现，并可在https://doi.org/10.5281/zenodo.14894687上公开访问。

更新时间: 2025-02-19 20:16:45

领域: cs.LG,cond-mat.mtrl-sci,cs.AI,cs.CE,65, 74,I.6; J.2

下载: http://arxiv.org/abs/2502.17490v1

Personalized Education with Generative AI and Digital Twins: VR, RAG, and Zero-Shot Sentiment Analysis for Industry 4.0 Workforce Development

The Fourth Industrial Revolution (4IR) technologies, such as cloud computing, machine learning, and AI, have improved productivity but introduced challenges in workforce training and reskilling. This is critical given existing workforce shortages, especially in marginalized communities like Underrepresented Minorities (URM), who often lack access to quality education. Addressing these challenges, this research presents gAI-PT4I4, a Generative AI-based Personalized Tutor for Industrial 4.0, designed to personalize 4IR experiential learning. gAI-PT4I4 employs sentiment analysis to assess student comprehension, leveraging generative AI and finite automaton to tailor learning experiences. The framework integrates low-fidelity Digital Twins for VR-based training, featuring an Interactive Tutor - a generative AI assistant providing real-time guidance via audio and text. It uses zero-shot sentiment analysis with LLMs and prompt engineering, achieving 86\% accuracy in classifying student-teacher interactions as positive or negative. Additionally, retrieval-augmented generation (RAG) enables personalized learning content grounded in domain-specific knowledge. To adapt training dynamically, finite automaton structures exercises into states of increasing difficulty, requiring 80\% task-performance accuracy for progression. Experimental evaluation with 22 volunteers showed improved accuracy exceeding 80\%, reducing training time. Finally, this paper introduces a Multi-Fidelity Digital Twin model, aligning Digital Twin complexity with Bloom's Taxonomy and Kirkpatrick's model, providing a scalable educational framework.

Updated: 2025-02-19 20:11:19

标题: 个性化教育与生成式人工智能和数字孪生：VR、RAG和零样本情感分析用于工业4.0劳动力发展

摘要: 第四次工业革命（4IR）技术，如云计算、机器学习和人工智能，提高了生产力，但也带来了员工培训和再培训方面的挑战。这一点至关重要，尤其是考虑到现有的劳动力短缺，尤其是在边缘化社区，如少数民族，他们经常缺乏获取优质教育的途径。为了解决这些挑战，本研究提出了gAI-PT4I4，一个基于生成式人工智能的个性化工业4.0导师，旨在个性化4IR体验式学习。gAI-PT4I4利用情感分析来评估学生的理解能力，利用生成式人工智能和有限自动机来定制学习体验。该框架集成了低保真数字孪生体，用于基于虚拟现实的培训，其中包括交互式导师 - 一个提供实时指导的生成式人工智能助手，通过音频和文本。它使用零-shot情感分析与LLMs和提示工程，实现了将学生-教师互动分类为积极或消极的86%准确率。此外，检索增强生成（RAG）使个性化学习内容基于领域特定知识。为了动态调整培训，有限自动机将练习划分为难度逐渐增加的状态，要求80%的任务表现准确率才能继续前进。22名志愿者的实验评估显示，准确率超过80%，缩短了培训时间。最后，本文介绍了一个多保真数字孪生模型，将数字孪生的复杂性与布鲁姆斯分类和柯克帕特里克模型相一致，提供可扩展的教育框架。

更新时间: 2025-02-19 20:11:19

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2502.14080v1

Population Dynamics Control with Partial Observations

We study the problem of controlling population dynamics, a class of linear dynamical systems evolving on the probability simplex, from the perspective of online non-stochastic control. While Golowich et.al. 2024 analyzed the fully observable setting, we focus on the more realistic, partially observable case, where only a low-dimensional representation of the state is accessible. In classical non-stochastic control, inputs are set as linear combinations of past disturbances. However, under partial observations, disturbances cannot be directly computed. To address this, Simchowitz et.al. 2020 proposed to construct oblivious signals, which are counterfactual observations with zero control, as a substitute. This raises several challenges in our setting: (1) how to construct oblivious signals under simplex constraints, where zero control is infeasible; (2) how to design a sufficiently expressive convex controller parameterization tailored to these signals; and (3) how to enforce the simplex constraint on control when projections may break the convexity of cost functions. Our main contribution is a new controller that achieves the optimal $\tilde{O}(\sqrt{T})$ regret with respect to a natural class of mixing linear dynamic controllers. To tackle these challenges, we construct signals based on hypothetical observations under a constant control adapted to the simplex domain, and introduce a new controller parameterization that approximates general control policies linear in non-oblivious observations. Furthermore, we employ a novel convex extension surrogate loss, inspired by Lattimore 2024, to bypass the projection-induced convexity issue.

Updated: 2025-02-19 20:07:56

标题: 部分观测下的人口动态控制

摘要: 我们研究了控制人口动态的问题，这是一类线性动态系统，在概率单纯形上演变，从在线非随机控制的角度来看。尽管Golowich等人在2024年分析了完全可观察的情况，我们专注于更现实的部分可观察案例，只有状态的低维表示可访问。在经典的非随机控制中，输入被设置为过去扰动的线性组合。然而，在部分观察到的情况下，扰动无法直接计算。为了解决这个问题，Simchowitz等人在2020年提出了构建无视信号的方法，这些信号是零控制的反事实观察的替代。在我们的情境下，这提出了几个挑战：(1)如何在单纯形约束下构建无视信号，其中零控制是不可行的；(2)如何设计一个足够表达力的凸控制器参数化，以适应这些信号；以及(3)如何在投影可能破坏成本函数的凸性时，强制执行对控制的单纯形约束。我们的主要贡献是一个新的控制器，实现了相对于一类混合线性动态控制器的最佳 $\tilde{O}(\sqrt{T})$ 遗憾。为了解决这些挑战，我们基于恒定控制适应于单纯形域的假设观察构建信号，并引入一种新的控制器参数化，逼近于一般控制策略在非无视观察中的线性。此外，我们采用了一种受Lattimore 2024启发的新型凸扩展代理损失，以规避由投影引起的凸性问题。

更新时间: 2025-02-19 20:07:56

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2502.14079v1

Towards Vector Optimization on Low-Dimensional Vector Symbolic Architecture

Vector Symbolic Architecture (VSA) is emerging in machine learning due to its efficiency, but they are hindered by issues of hyperdimensionality and accuracy. As a promising mitigation, the Low-Dimensional Computing (LDC) method significantly reduces the vector dimension by ~100 times while maintaining accuracy, by employing a gradient-based optimization. Despite its potential, LDC optimization for VSA is still underexplored. Our investigation into vector updates underscores the importance of stable, adaptive dynamics in LDC training. We also reveal the overlooked yet critical roles of batch normalization (BN) and knowledge distillation (KD) in standard approaches. Besides the accuracy boost, BN does not add computational overhead during inference, and KD significantly enhances inference confidence. Through extensive experiments and ablation studies across multiple benchmarks, we provide a thorough evaluation of our approach and extend the interpretability of binary neural network optimization similar to LDC, previously unaddressed in BNN literature.

Updated: 2025-02-19 20:00:32

标题: 朝向低维向量符号架构的向量优化

摘要: Vector Symbolic Architecture (VSA)由于其效率而在机器学习领域崭露头角，但受到超高维度和准确性问题的阻碍。作为一种有前途的缓解方法，低维度计算（LDC）方法通过采用基于梯度的优化，显著减少了向量维度约100倍，同时保持准确性。尽管具有潜力，但VSA的LDC优化仍未被充分探讨。我们对向量更新的调查突显了在LDC训练中稳定、自适应动态的重要性。我们还揭示了标准方法中经常被忽视但至关重要的批量归一化（BN）和知识蒸馏（KD）的作用。除了提高准确性外，BN在推断期间不会增加计算开销，而KD显著提高了推断的置信度。通过多个基准测试中的大量实验和消融研究，我们对我们的方法进行了彻底评估，并扩展了类似于LDC的二进制神经网络优化的可解释性，这在BNN文献中以前尚未解决。

更新时间: 2025-02-19 20:00:32

领域: cs.LG

下载: http://arxiv.org/abs/2502.14075v1

Investigating Non-Transitivity in LLM-as-a-Judge

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Updated: 2025-02-19 19:59:16

标题: 调查LLM作为法官中的非传递性

摘要: 基于大型语言模型（LLMs）的自动评估方法正逐渐成为评估基于LLM代理的指令遵循能力的标准工具。在这一范式中，最常见的方法是与基准模型进行成对比较，这在很大程度上取决于传递偏好的假设。然而，这一假设的有效性仍然很少被探讨。在本研究中，我们调查了AlpacaEval框架中的非传递性存在，并分析了其对模型排名的影响。我们发现LLM评委表现出非传递性偏好，导致排名对基准模型的选择敏感。为了缓解这一问题，我们展示了循环赛与布拉德利-特里偏好模型相结合可以产生更可靠的排名。值得注意的是，我们的方法提高了与Chatbot Arena的Spearman相关性和Kendall相关性（分别为95.0% -> 96.4%和82.1% -> 86.3%）。为了解决循环赛的计算成本，我们提出了Swiss-Wise迭代配对（Swim）比赛，使用动态匹配策略来捕捉循环赛的好处，同时保持计算效率。

更新时间: 2025-02-19 19:59:16

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.14074v1

Recurrent Neural Goodness-of-Fit Test for Time Series

Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.

Updated: 2025-02-19 19:58:52

标题: 循环神经网络时间序列拟合优度检验

摘要: 时间序列数据在金融和医疗保健等不同领域至关重要，准确的预测和决策依赖于先进的建模技术。虽然生成模型在捕捉时间序列中固有的复杂动态方面表现出很大的潜力，但评估其性能仍然是一个重大挑战。传统的评估指标由于时间依赖性和特征的潜在高维度而不足。在本文中，我们提出了REcurrent NeurAL (RENAL)拟合度检验，这是一个新颖且统计严谨的框架，用于评估生成时间序列模型。通过利用循环神经网络，我们将时间序列转换为条件独立的数据对，从而能够应用基于卡方的拟合度检验来评估数据中的时间依赖性。这种方法提供了一个稳健且理论基础的解决方案，用于评估生成模型的质量，特别是在具有有限时间序列的情况下。我们展示了我们的方法在合成和真实世界数据集上的有效性，相对于现有方法，在可靠性和准确性方面表现出色。我们的方法填补了时间序列生成模型评估中的一个关键空白，提供了一个既实用又适应高风险应用的工具。

更新时间: 2025-02-19 19:58:52

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2410.13986v4

The dark side of the forces: assessing non-conservative force models for atomistic machine learning

The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, have revolutionized the fields of computational chemistry and materials discovery. In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically-constrained approach, suggesting that using the forces as explicit learning targets yields a better trade-off between accuracy and computational efficiency - and that energy conservation can be learned during training. The present work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics. Contrary to the case of rotational symmetry, lack of energy conservation is hard to learn, control, and correct. The best approach to exploit the acceleration afforded by direct force evaluation might be to use it in tandem with a conservative model, reducing - rather than eliminating - the additional cost of backpropagation, but avoiding most of the pathological behavior associated with non-conservative forces.

Updated: 2025-02-19 19:48:32

标题: 力量的阴暗面：评估用于原子机器学习的非保守力模型

摘要: 机器学习用于估计一组原子的能量和驱使它们转向更稳定构型的力量已经彻底改变了计算化学和材料发现领域。在这个领域中，严格执行对称性和守恒定律传统上被认为是必不可少的。因此，原子间力通常被计算为势能的导数，确保能量守恒。一些最近的研究质疑了这种受物理约束的方法，表明将力作为显式学习目标可以在精度和计算效率之间取得更好的折衷，并且能量守恒可以在训练过程中学习。本研究调查了这种非守恒模型在微观模拟中的适用性。我们确定并展示了一些基本问题，从几何优化的不收敛到各种类型的分子动力学中的不稳定性。与旋转对称性的情况相反，能量不守恒难以学习、控制和校正。利用直接力评估所提供的加速可能的最佳方法是与保守模型一起使用，减少而不是消除反向传播的额外成本，但避免了与非守恒力相关的大部分病理行为。

更新时间: 2025-02-19 19:48:32

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2412.11569v2

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

Updated: 2025-02-19 19:47:58

标题: DiffExp：用于文本到图像扩散模型奖励微调的高效探索

摘要: 将文本到图像扩散模型微调以最大化奖励已被证明对增强模型性能有效。然而，奖励微调方法常常由于在线样本生成而导致收敛速度缓慢。因此，获得具有强奖励信号的多样化样本对于提高样本效率和整体性能至关重要。在这项工作中，我们引入了DiffExp，这是一种简单而有效的探索策略，用于文本到图像模型的奖励微调。我们的方法采用了两个关键策略：(a)动态调整无分类器引导的规模，以增强样本多样性，以及(b)随机加权文本提示的短语，以利用高质量的奖励信号。我们证明这些策略显著增强了在线样本生成过程中的探索，提高了最近的奖励微调方法（如DDPO和AlignProp）的样本效率。

更新时间: 2025-02-19 19:47:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.14070v1

Density-Based Algorithms for Corruption-Robust Contextual Search and Convex Optimization

We study the problem of contextual search, a generalization of binary search in higher dimensions, in the adversarial noise model. Let $d$ be the dimension of the problem, $T$ be the time horizon and $C$ be the total amount of adversarial noise in the system. We focus on the $\epsilon$-ball and the absolute loss. For the $\epsilon$-ball loss, we give a tight regret bound of $O(C + d \log(1/\epsilon))$ improving over the $O(d^3 \log(1/\epsilon) \log^2(T) + C \log(T) \log(1/\epsilon))$ bound of Krishnamurthy et al (Operations Research '23). For the absolute loss, we give an efficient algorithm with regret $O(C+d \log T)$. To tackle the absolute loss case, we study the more general setting of Corruption-Robust Convex Optimization with Subgradient feedback, which is of independent interest. Our techniques are a significant departure from prior approaches. Specifically, we keep track of density functions over the candidate target vectors instead of a knowledge set consisting of the candidate target vectors consistent with the feedback obtained.

Updated: 2025-02-19 19:47:40

标题: 基于密度的算法用于抗腐败的上下文搜索和凸优化

摘要: 我们研究了上下文搜索问题，这是对更高维度中的二分搜索的一种泛化，在对抗性噪声模型中。设$d$为问题的维度，$T$为时间范围，$C$为系统中的对抗性噪声总量。我们专注于$\epsilon$-球和绝对损失。对于$\epsilon$-球损失，我们给出了一个紧密的遗憾界限为$O(C + d \log(1/\epsilon))$，优于Krishnamurthy等人的$O(d^3 \log(1/\epsilon) \log^2(T) + C \log(T) \log(1/\epsilon))$界限（《运筹学研究》'23）。对于绝对损失，我们提供了一个具有遗憾$O(C+d \log T)$的高效算法。为了解决绝对损失情况，我们研究了更一般的具有子梯度反馈的腐败-强凸优化设置，这是具有独立兴趣的。我们的技术与先前的方法有很大不同。具体来说，我们跟踪候选目标向量上的密度函数，而不是由获得的反馈一致的候选目标向量组成的知识集。

更新时间: 2025-02-19 19:47:40

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2206.07528v2

A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing

A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at github.com/RaceGAN.

Updated: 2025-02-19 19:43:31

标题: 一个用于自主赛车赛道检测的赛车数据集和基准模型

摘要: 赛车相关研究中一个重要挑战是缺乏包含原始图像及相应注释的公开数据集，用于进一步的任务。本文介绍了RoRaTrack，这是一个包含赛车场景中多摄像头图像数据的新数据集，用于赛道检测。这些数据是在印第安纳州的一个赛车场上使用Dallara AV-21收集的，与Indy Autonomous Challenge (IAC)合作。RoRaTrack解决了常见问题，如由于高速而导致的模糊、摄像头的颜色反转以及赛道上缺少车道标记。因此，我们提出了RaceGAN，这是一个基于生成对抗网络（GAN）的基准模型，有效地解决了这些挑战。所提出的模型表现优于当前最先进的机器学习模型在赛道检测方面的性能。这项工作的数据集和代码可在github.com/RaceGAN上找到。

更新时间: 2025-02-19 19:43:31

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2502.14068v1

Hierarchical Spatio-Temporal Uncertainty Quantification for Distributed Energy Adoption

The rapid deployment of distributed energy resources (DER) has introduced significant spatio-temporal uncertainties in power grid management, necessitating accurate multilevel forecasting methods. However, existing approaches often produce overly conservative uncertainty intervals at individual spatial units and fail to properly capture uncertainties when aggregating predictions across different spatial scales. This paper presents a novel hierarchical spatio-temporal model based on the conformal prediction framework to address these challenges. Our approach generates circuit-level DER growth predictions and efficiently aggregates them to the substation level while maintaining statistical validity through a tailored non-conformity score. Applied to a decade of DER installation data from a local utility network, our method demonstrates superior performance over existing approaches, particularly in reducing prediction interval widths while maintaining coverage.

Updated: 2025-02-19 19:33:04

标题: 分布式能源采用的分层时空不确定性量化

摘要: 分布式能源资源（DER）的快速部署在电网管理中引入了显著的时空不确定性，需要准确的多级预测方法。然而，现有方法往往在个体空间单元产生过于保守的不确定性区间，并且在跨不同空间尺度聚合预测时未能正确捕捉不确定性。本文提出了一种基于符合预测框架的新型分层时空模型，以解决这些挑战。我们的方法生成了电路级DER增长预测，并有效地将它们聚合到变电站级别，同时通过定制的非一致性得分保持统计有效性。应用于当地一家公用事业网络的十年DER安装数据，我们的方法表现优于现有方法，特别是在减少预测间隔宽度的同时保持覆盖率。

更新时间: 2025-02-19 19:33:04

领域: stat.AP,cs.LG,stat.ML

下载: http://arxiv.org/abs/2411.12193v2

EfficientPose 6D: Scalable and Efficient 6D Object Pose Estimation

In industrial applications requiring real-time feedback, such as quality control and robotic manipulation, the demand for high-speed and accurate pose estimation remains critical. Despite advances improving speed and accuracy in pose estimation, finding a balance between computational efficiency and accuracy poses significant challenges in dynamic environments. Most current algorithms lack scalability in estimation time, especially for diverse datasets, and the state-of-the-art (SOTA) methods are often too slow. This study focuses on developing a fast and scalable set of pose estimators based on GDRNPP to meet or exceed current benchmarks in accuracy and robustness, particularly addressing the efficiency-accuracy trade-off essential in real-time scenarios. We propose the AMIS algorithm to tailor the utilized model according to an application-specific trade-off between inference time and accuracy. We further show the effectiveness of the AMIS-based model choice on four prominent benchmark datasets (LM-O, YCB-V, T-LESS, and ITODD).

Updated: 2025-02-19 19:21:23

标题: EfficientPose 6D: 可扩展和高效的6D物体姿态估计

摘要: 在工业应用中需要实时反馈，如质量控制和机器人操作，对高速和准确姿态估计的需求仍然至关重要。尽管在姿态估计方面取得了速度和准确性的进展，但在动态环境中找到计算效率和准确性之间的平衡仍然面临重大挑战。大多数当前的算法在估计时间上缺乏可扩展性，尤其是对于多样化的数据集，并且最先进的方法通常速度太慢。本研究旨在开发基于GDRNPP的快速和可扩展的姿态估计器集合，以满足或超越当前的准确性和稳健性基准，特别是解决实时场景中的效率-准确性权衡问题。我们提出AMIS算法，根据应用程序特定的推断时间和准确性之间的权衡来定制所使用的模型。我们进一步展示了基于AMIS的模型选择对四个主要基准数据集（LM-O，YCB-V，T-LESS和ITODD）的有效性。

更新时间: 2025-02-19 19:21:23

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14061v1

New Lower Bounds for Stochastic Non-Convex Optimization through Divergence Composition

We study fundamental limits of first-order stochastic optimization in a range of nonconvex settings, including L-smooth functions satisfying Quasar-Convexity (QC), Quadratic Growth (QG), and Restricted Secant Inequalities (RSI). While the convergence properties of standard algorithms are well-understood in deterministic regimes, significantly fewer results address the stochastic case, where only unbiased and noisy gradients are available. We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging divergence composition arguments to construct a challenging subclass that leads to sharp lower bounds. Furthermore, we present a specialized algorithm in the one-dimensional setting that achieves faster rates, suggesting that certain dimensional thresholds are intrinsic to the complexity of non-convex stochastic optimization.

Updated: 2025-02-19 19:21:00

标题: 通过发散性组合获得随机非凸优化的新下界

摘要: 我们研究了一系列非凸设置中的一阶随机优化的基本限制，包括满足Quasar-Convexity（QC）、二次增长（QG）和受限正割不等式（RSI）的L-光滑函数。虽然标准算法在确定性情况下的收敛性质已被充分理解，但在仅有无偏和有噪声梯度可用的随机情况下，结果较少。我们建立了有关最小化这些函数类所需的噪声梯度查询次数的新下界，同时展示它们在所有相关特征量中是紧密的（最多相差一个对数因子）。我们的方法将优化任务重新表述为一个函数识别问题，利用分歧组合论证构建一个具有挑战性的子类，从而导致尖锐的下界。此外，在一维设置中，我们提出了一种专门的算法，可以实现更快的速率，这表明某些维度阈值与非凸随机优化的复杂性是内在相关的。

更新时间: 2025-02-19 19:21:00

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2502.14060v1

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.

Updated: 2025-02-19 19:12:46

标题: RocketKV：通过两阶段KV缓存压缩加速长上下文LLM推理

摘要: 基于Transformer的大型语言模型在解码阶段关键依赖KV缓存来高效处理扩展上下文。然而，随着输入长度的增加，KV缓存的大小也会相应增长，使内存带宽和容量在解码过程中负担增加。为了解决这一挑战，我们提出了RocketKV，这是一种无需训练的KV缓存压缩策略，专门设计用于在解码阶段降低KV缓存对内存带宽和容量的需求。RocketKV包含两个连续阶段。在第一阶段中，它通过SnapKV++进行粗粒度KV缓存清除，SnapKV++是在SnapKV的基础上改进的方法，引入了自适应池大小和与分组查询注意力的完全兼容性。在第二阶段中，它采用混合注意力方法进行精细的top-k稀疏注意力计算，通过利用头部和序列维度的降低来近似注意力得分。通过结合这两个阶段，RocketKV在保持与完整KV缓存注意力相当的准确性的同时，实现了显著的KV缓存获取带宽和存储节约。我们展示了与完整KV缓存基线相比，在NVIDIA H100 GPU上，RocketKV在解码阶段实现了最多3倍的端到端加速，最多减少31%的内存使用，并在各种长上下文任务上实现了可忽略的准确性损失。

更新时间: 2025-02-19 19:12:46

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.14051v1

Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder

Current pre-trained large language models typically need instruction tuning to align with human preferences. However, instruction tuning data is often quantity-saturated due to the large volume of data collection and fast model iteration, leaving coreset data selection important but underexplored. On the other hand, existing quality-driven data selection methods such as LIMA (NeurIPS 2023 (Zhou et al., 2024)) and AlpaGasus (ICLR 2024 (Chen et al.)) generally ignore the equal importance of data diversity and complexity. In this work, we aim to design a diversity-aware data selection strategy and creatively propose using sparse autoencoders to tackle the challenge of data diversity measure. In addition, sparse autoencoders can also provide more interpretability of model behavior and explain, e.g., the surprising effectiveness of selecting the longest response (ICML 2024 (Zhao et al.)). Using effective data selection, we experimentally prove that models trained on our selected data can outperform other methods in terms of model capabilities, reduce training cost, and potentially gain more control over model behaviors.

Updated: 2025-02-19 19:12:34

标题: 基于多样性驱动的稀疏自编码器语言模型调优数据选择

摘要: 当前的预训练大型语言模型通常需要进行指导调整以与人类偏好保持一致。然而，由于数据收集量大和模型迭代速度快，指导调整数据往往饱和，因此核心数据选择变得重要但未被充分探讨。另一方面，现有的以质量为驱动的数据选择方法，如LIMA（NeurIPS 2023（Zhou等，2024年））和AlpaGasus（ICLR 2024（Chen等，2024年））通常忽视了数据多样性和复杂性的同等重要性。在这项工作中，我们旨在设计一种重视多样性的数据选择策略，并创造性地提出使用稀疏自动编码器来应对数据多样性度量的挑战。此外，稀疏自动编码器还可以提供更多关于模型行为的可解释性，并解释，例如，选择最长回应的惊人有效性（ICML 2024（赵等，2024年））。通过有效的数据选择，我们在实验证明，基于我们选择的数据训练的模型在模型能力方面可以胜过其他方法，降低训练成本，并可能更好地控制模型行为。

更新时间: 2025-02-19 19:12:34

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14050v1

Semantic Decomposition and Selective Context Filtering -- Text Processing Techniques for Context-Aware NLP-Based Systems

In this paper, we present two techniques for use in context-aware systems: Semantic Decomposition, which sequentially decomposes input prompts into a structured and hierarchal information schema in which systems can parse and process easily, and Selective Context Filtering, which enables systems to systematically filter out specific irrelevant sections of contextual information that is fed through a system's NLP-based pipeline. We will explore how context-aware systems and applications can utilize these two techniques in order to implement dynamic LLM-to-system interfaces, improve an LLM's ability to generate more contextually cohesive user-facing responses, and optimize complex automated workflows and pipelines.

Updated: 2025-02-19 19:09:40

标题: 语义分解和选择性上下文过滤--面向上下文感知的基于自然语言处理的文本处理技术

摘要: 在本文中，我们提出了两种用于上下文感知系统的技术：语义分解和选择性上下文过滤。语义分解将输入提示顺序分解为结构化和分层信息模式，系统可以轻松解析和处理；选择性上下文过滤使系统能够系统地过滤掉通过系统的基于自然语言处理的管道传递的特定无关紧要的上下文信息部分。我们将探讨上下文感知系统和应用程序如何利用这两种技术，以实现动态LLM到系统界面，提高LLM生成更具上下文连贯性的用户界面响应的能力，并优化复杂的自动化工作流程和管道。

更新时间: 2025-02-19 19:09:40

领域: cs.CL,cs.AI,cs.HC,I.2.7; I.7.0

下载: http://arxiv.org/abs/2502.14048v1

Towards a Learning Theory of Representation Alignment

It has recently been argued that AI models' representations are becoming aligned as their scale and performance increase. Empirical analyses have been designed to support this idea and conjecture the possible alignment of different representations toward a shared statistical model of reality. In this paper, we propose a learning-theoretic perspective to representation alignment. First, we review and connect different notions of alignment based on metric, probabilistic, and spectral ideas. Then, we focus on stitching, a particular approach to understanding the interplay between different representations in the context of a task. Our main contribution here is relating properties of stitching to the kernel alignment of the underlying representation. Our results can be seen as a first step toward casting representation alignment as a learning-theoretic problem.

Updated: 2025-02-19 19:09:14

标题: 朝向表示对齐的学习理论

摘要: 最近有人认为随着规模和性能的增加，人工智能模型的表示正在变得一致。已经设计了经验分析来支持这一观点，并推测不同表示向着共享的现实统计模型的可能一致性。在本文中，我们提出了一个学习理论的视角来解释表示一致性。首先，我们回顾并连接基于度量、概率和谱思想的不同对齐概念。然后，我们关注缝合，这是一种特定的方法，用于理解任务背景下不同表示之间的相互作用。我们的主要贡献是将缝合的属性与底层表示的核对齐联系起来。我们的结果可以看作是将表示对齐视为一个学习理论问题的第一步。

更新时间: 2025-02-19 19:09:14

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2502.14047v1

Position: There are no Champions in Long-Term Time Series Forecasting

Recent advances in long-term time series forecasting have introduced numerous complex prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. Our position emphasizes the need to shift focus away from pursuing ever-more complex models and towards enhancing benchmarking practices through rigorous and standardized evaluation methods. To support our claim, we first perform a broad, thorough, and reproducible evaluation of the top-performing models on the most popular benchmark by training 3,500+ networks over 14 datasets. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings suggest the need for rigorous and standardized evaluation methods that enable more substantiated claims, including reproducible hyperparameter setups and statistical testing.

Updated: 2025-02-19 19:08:37

标题: 位置：长期时间序列预测中没有冠军

摘要: 最近长期时间序列预测方面的进展引入了许多复杂的预测模型，这些模型始终优于先前发表的架构。然而，这种快速发展引发了对不一致的基准测试和报告做法的担忧，这可能会削弱这些比较的可靠性。我们的立场强调需要将注意力从追求越来越复杂的模型转向通过严格和标准化的评估方法加强基准测试做法。为了支持我们的观点，我们首先对最受欢迎的基准测试上表现最好的模型进行了广泛、彻底和可重复的评估，训练了超过14个数据集的3,500多个网络。然后，通过全面分析，我们发现对实验设置或当前评估指标的轻微更改会极大地改变一个常见信念，即新发表的结果正在推进技术前沿。我们的研究结果表明需要严格和标准化的评估方法，以便更有根据地提出主张，包括可重复的超参数设置和统计测试。

更新时间: 2025-02-19 19:08:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.14045v1

Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality

Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {\em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.

Updated: 2025-02-19 19:07:52

标题: 使用Lagrange对偶训练深度学习参数化策略的高效方法

摘要: 受限马尔可夫决策过程（CMDPs）在许多高风险应用中至关重要，决策必须优化累积奖励，同时严格遵守复杂的非线性约束。在诸如电力系统、金融、供应链和精密机器人技术等领域，违反这些约束可能导致重大的财务或社会成本。现有的强化学习（RL）方法通常在高度严格的CMDPs中的样本效率和有效性方面遇到困难，从而限制了它们在这些环境中的适用性。随机双动态规划通常在原始问题的凸松弛上实践使用，但它们也面临计算挑战和最优性的损失。本文介绍了一种新颖的方法，称为两阶段深度决策规则（TS-DDR），通过Lagrangian对偶有效地训练参数化演员策略。TS-DDR是一种自监督学习算法，通过随机梯度下降（SGD）训练一般的决策规则（参数化政策）；它的前向传递解决确定性优化问题以找到可行政策，而它的反向传递利用对偶理论以闭合形式梯度训练参数化政策。TS-DDR继承了深度学习方法的灵活性和计算性能，以解决CMDP问题。应用于长期水热调度（LTHD）问题，使用玻利维亚的实际电力系统数据，TS-DDR显示出提高解决方案质量和将计算时间缩短数个数量级的效果，与当前的尖端方法相比。

更新时间: 2025-02-19 19:07:52

领域: cs.LG,math.OC,49M37

下载: http://arxiv.org/abs/2405.14973v2

Asking for Help Enables Safety Guarantees Without Sacrificing Effectiveness

Most reinforcement learning algorithms with regret guarantees rely on a critical assumption: that all errors are recoverable. Recent work by Plaut et al. discarded this assumption and presented algorithms that avoid "catastrophe" (i.e., irreparable errors) by asking for help. However, they provided only safety guarantees and did not consider reward maximization. We prove that any algorithm that avoids catastrophe in their setting also guarantees high reward (i.e., sublinear regret) in any Markov Decision Process (MDP), including MDPs with irreversible costs. This constitutes the first no-regret guarantee for general MDPs. More broadly, our result may be the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient in an unknown, unbounded, and high-stakes environment without causing catastrophe or requiring resets.

Updated: 2025-02-19 19:01:39

标题: 请求帮助可确保安全，而不会牺牲效果

摘要: 大多数具有后悔保证的强化学习算法依赖于一个关键假设：所有错误都是可以恢复的。Plaut等人最近的工作抛弃了这个假设，并提出了通过寻求帮助来避免“灾难”（即不可修复的错误）的算法。然而，他们只提供了安全保证，并没有考虑奖励最大化。我们证明，在他们的设置中避免灾难的任何算法也会在任何马尔可夫决策过程（MDP）中保证高奖励（即次线性后悔），包括具有不可逆成本的MDPs。这构成了对一般MDPs的首个无后悔保证。更广泛地说，我们的结果可能是首次正式证明，一个代理可以在未知、无限制和高风险环境中获得高奖励，而不会引发灾难或需要重置而变得自给自足。

更新时间: 2025-02-19 19:01:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.14043v1

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Despite their increasing performance, large language models still tend to reproduce training data, generate several repetitions, and focus on the most common grammatical structures and words. A possible cause is the decoding strategy adopted: the most common ones either consider only the most probable tokens, reducing output diversity, or increase the likelihood of unlikely tokens at the cost of output accuracy and correctness. In this paper, we propose a family of three new decoding methods by leveraging a mathematical analysis of the token probability distribution. In particular, the difference between consecutive, sorted probabilities can be used to avoid incorrect tokens and increase the chance of low-probable but accurate words. Experiments concerning math problem solving, extreme summarization, and the divergent association task show that our approach consistently performs at least as well as current alternatives in terms of quality and diversity.

Updated: 2025-02-19 19:00:02

标题: DiffSampling:增强神经文本生成中的多样性和准确性

摘要: 尽管大型语言模型的性能不断提高，但它们仍然倾向于复制训练数据，生成多个重复内容，并专注于最常见的语法结构和单词。可能的原因是采用的解码策略：最常见的策略要么只考虑最有可能的标记，减少输出的多样性，要么增加不太可能的标记的可能性，以牺牲输出的准确性和正确性。在本文中，我们提出了一系列三种新的解码方法，通过利用标记概率分布的数学分析。特别是，可以利用连续排列的概率之间的差异来避免不正确的标记，并增加低概率但准确的单词出现的机会。关于数学问题求解、极端摘要和分歧联想任务的实验表明，我们的方法在质量和多样性方面至少与当前的替代方案一样表现出色。

更新时间: 2025-02-19 19:00:02

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14037v1

Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics

We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.

Updated: 2025-02-19 19:00:00

标题: 理论物理基准（TPBench）--理论物理中人工智能推理能力的数据集和研究

摘要: 我们引入了一个基准来评估人工智能解决理论物理问题的能力，重点放在高能理论和宇宙学上。我们基准的第一次迭代包括了57个不同难度的问题，从本科到研究水平不等。这些问题在于它们并非来自公共问题集合。我们在各种开放和封闭语言模型上评估我们的数据集，包括o3-mini，o1，DeepSeek-R1，GPT-4o以及Llama和Qwen的各个版本。尽管我们发现最新模型的性能有了显著进展，但我们的研究级别难度的问题大多数仍未解决。我们探讨了自动验证和分级的挑战，并讨论了常见的失败模式。尽管目前的最先进模型对研究人员的用途仍有限，但我们的结果表明，在不久的将来，人工智能辅助的理论物理研究可能会成为可能。我们讨论了实现这一目标的主要障碍和可能的应对策略。公共问题和解决方案，各种模型的结果，以及数据集和得分分布的更新，均可在数据集tpbench.org的网站上找到。

更新时间: 2025-02-19 19:00:00

领域: cs.LG,astro-ph.CO,cs.AI,hep-ph,hep-th

下载: http://arxiv.org/abs/2502.15815v1

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

Updated: 2025-02-19 18:59:44

标题: FlexTok：将图像重新采样为灵活长度的1D令牌序列

摘要: 图像标记化通过提供压缩的、离散的表示，比原始像素更有效地处理，已经在自回归图像生成方面取得了重大进展。传统方法使用2D网格标记化，而像TiTok这样的最新方法表明，通过消除网格冗余，1D标记化可以实现高质量的生成。然而，这些方法通常使用固定数量的标记，因此不能适应图像固有的复杂性。我们引入了FlexTok，一种将2D图像投影到可变长度、有序的1D标记序列的标记化器。例如，一个256x256的图像可以重新采样为1到256个离散标记，层次化地和语义地压缩其信息。通过训练一个修正流模型作为解码器，并使用嵌套的dropout，FlexTok可以产生合理的重建，无论所选的标记序列长度如何。我们在一个简单的类似GPT的Transformer中评估了我们的方法在自回归生成设置中的效果。在ImageNet上，这种方法在8到128个标记之间实现了FID<2，优于TiTok，并与最先进的方法相匹配，使用的标记数量要少得多。我们进一步扩展了模型以支持文本条件的图像生成，并研究了FlexTok与传统2D标记化之间的关系。一个关键发现是，FlexTok使得下一个标记的预测能够以“视觉词汇”的粗到细的方式描述图像，并且生成所需的标记数量取决于生成任务的复杂性。

更新时间: 2025-02-19 18:59:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13967v1

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.

Updated: 2025-02-19 18:59:30

标题: Autellix：一种高效的为LLM代理提供服务的引擎，作为通用程序

摘要: 大型语言模型（LLM）应用正在进化，超越简单的聊天机器人，发展成为动态、通用的代理程序，通过扩展LLM调用和输出令牌，帮助AI代理程序推理、探索和解决复杂任务。然而，现有的LLM服务系统忽略了程序和调用之间的依赖关系，错失了优化的重要机会。我们的分析显示，提交给LLM服务引擎的程序经历了长时间的累积等待，主要是由于在个别LLM请求和程序上的头部阻塞。为了解决这个问题，我们引入了Autellix，一种将程序视为一等公民以最小化其端到端延迟的LLM服务系统。Autellix拦截程序提交的LLM调用，通过增强调度程序与程序级上下文。我们提出了两种调度算法-针对单线程和分布式程序-根据它们的程序先前完成的调用来抢占和优先处理LLM调用。我们的评估表明，在不同的LLM和代理工作负载中，与vLLM等最先进的系统相比，Autellix在相同的延迟下将程序的吞吐量提高了4-15倍。

更新时间: 2025-02-19 18:59:30

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2502.13965v1

A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects

Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.

Updated: 2025-02-19 18:59:17

标题: 一个无需训练的框架，用于精确操控小型日常物品的移动

摘要: 许多日常移动操作任务需要精确地与小物体互动，比如抓住一个旋钮打开柜子或按下一个灯开关。在本文中，我们开发了一种名为Servoing with Vision Models（SVM）的闭环训练免费框架，使移动操作器能够处理涉及小物体操作的精确任务。SVM采用RGB-D手腕摄像头，并使用视觉伺服进行控制。我们的创新在于利用最先进的视觉模型可靠地计算手腕图像中的3D目标，以应对各种任务和由末端效应器引起的遮挡。为了减轻遮挡伪影，我们利用视觉模型来覆盖末端效应器，从而显著增强目标定位。我们证明，在辅助涂抹方法的帮助下，开放词汇的物体检测器可以作为一个插入模块，用于识别语义目标（如旋钮），并且点跟踪方法可以可靠地跟踪由用户点击指示的交互位置。这种无需训练的方法在现实世界中的新环境中操作看不见的物体时，获得了85%的零样本成功率，优于一个开环控制方法和一个在1000多个演示中训练的模仿学习基线，绝对成功率高出50%。

更新时间: 2025-02-19 18:59:17

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13964v1

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. While the study of multi-index models with Gaussian data in high dimensions has provided analytical insights into the benefits of GD-trained neural networks over kernels, the role of depth in improving sample complexity and generalization in GD-trained networks remains poorly understood. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

Updated: 2025-02-19 18:58:28

标题: 深度计算的优势：使用梯度下降学习高维层次函数.

摘要: 理解通过梯度下降(GD)训练的深度神经网络相对于浅层模型的优势仍然是一个开放的理论挑战。虽然在高维度中使用高斯数据的多指数模型的研究为GD训练的神经网络相对于核的好处提供了分析洞见，但深度在提高GD训练网络中的样本复杂性和泛化方面的作用仍不为人们所理解。在本文中，我们引入了一类目标函数（单指数和多指数高斯分层目标），这些函数融合了一系列潜在子空间的维度。这一框架使我们能够从理论上研究深度网络与浅层网络在高维极限下的学习动态和泛化性能。具体而言，我们的主要定理表明，通过GD进行特征学习可以减少有效维度，将高维问题转化为一系列低维问题。这使得使用比浅层网络少得多的样本就可以学习目标函数。虽然这些结果是在受控训练环境中证明的，但我们也讨论了更常见的训练程序，并认为它们通过相同的机制进行学习。这些发现为进一步定量研究深度在学习深度网络中的分层结构中的关键作用打开了道路。

更新时间: 2025-02-19 18:58:28

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.13961v1

RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

Retrieval-augmented generation (RAG) has shown great potential for knowledge-intensive tasks, but its traditional architectures rely on static retrieval, limiting their effectiveness for complex questions that require sequential information-seeking. While agentic reasoning and search offer a more adaptive approach, most existing methods depend heavily on prompt engineering. In this work, we introduce RAG-Gym, a unified optimization framework that enhances information-seeking agents through fine-grained process supervision at each search step. We also propose ReSearch, a novel agent architecture that synergizes answer reasoning and search query generation within the RAG-Gym framework. Experiments on four challenging datasets show that RAG-Gym improves performance by up to 25.6\% across various agent architectures, with ReSearch consistently outperforming existing baselines. Further analysis highlights the effectiveness of advanced LLMs as process reward judges and the transferability of trained reward models as verifiers for different LLMs. Additionally, we examine the scaling properties of training and inference in agentic RAG. The project homepage is available at https://rag-gym.github.io/.

Updated: 2025-02-19 18:56:03

标题: RAG-Gym：通过过程监督优化推理和搜索代理

摘要: 检索增强生成（RAG）已经展现出在知识密集型任务中具有巨大潜力，但其传统架构依赖于静态检索，限制了对需要顺序信息寻求的复杂问题的有效性。尽管主动推理和搜索提供了一种更具适应性的方法，但大多数现有方法严重依赖于提示工程。在这项工作中，我们引入了RAG-Gym，这是一个统一的优化框架，通过在每个搜索步骤中提供精细的过程监督来增强信息寻求代理。我们还提出了ReSearch，这是一种新颖的代理架构，它在RAG-Gym框架内协同答案推理和搜索查询生成。对四个具有挑战性的数据集的实验表明，RAG-Gym可以通过各种代理架构提高高达25.6\%的性能，ReSearch始终优于现有基线。进一步的分析突出了高级LLMs作为过程奖励评判者的有效性，以及经过训练的奖励模型作为不同LLMs的验证器的可传递性。此外，我们还研究了主动RAG中训练和推理的规模特性。该项目主页位于https://rag-gym.github.io/。

更新时间: 2025-02-19 18:56:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13957v1

Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition

Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of \textbf{aleatoric uncertainty}, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations. To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M$^3$ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at https://github.com/201983290498/lddu\_mmer.git.

Updated: 2025-02-19 18:53:23

标题: 潜在分布解耦：一种用于考虑不确定性的多模态情绪识别的概率框架

摘要: 多模态多标签情感识别（MMER）旨在识别多模态数据中多种情感的同时存在。现有研究主要集中在改进融合策略和建模模态到标签的依赖关系上。然而，它们经常忽视\textbf{随机不确定性}的影响，这是多模态数据中固有的噪声，阻碍了通过引入模糊性来特征表示的模态融合的有效性。为了解决这个问题并有效地建模随机不确定性，本文提出了一种从潜在情感空间概率建模的新视角出发的潜在情感分布分解不确定性知觉（LDDU）框架。具体地，我们在情感空间中引入了对比分解分布机制来建模多模态数据，允许提取语义特征和不确定性。此外，我们设计了一种考虑不确定性分布的融合多模态方法，整合分布信息。实验结果表明，LDDU在CMU-MOSEI和M$^3$ED数据集上取得了最先进的性能，突显了在MMER中建模不确定性的重要性。代码可在https://github.com/201983290498/lddu\_mmer.git获取。

更新时间: 2025-02-19 18:53:23

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.13954v1

Neurosymbolic artificial intelligence via large language models and coherence-driven inference

We devise an algorithm to generate sets of propositions that objectively instantiate graphs that support coherence-driven inference. We then benchmark the ability of large language models (LLMs) to reconstruct coherence graphs from (a straightforward transformation of) propositions expressed in natural language, with promising results from a single prompt to models optimized for reasoning. Combining coherence-driven inference with consistency evaluations by neural models may advance the state of the art in machine cognition.

Updated: 2025-02-19 18:53:16

标题: 基于大型语言模型和连贯性驱动推理的神经符号人工智能

摘要: 我们设计了一个算法，用于生成支持基于连贯性驱动推理的图形的命题集。然后，我们对大型语言模型（LLMs）从自然语言表达的命题（经过简单转换）中重建连贯性图的能力进行基准测试，发现从单个提示到为推理优化的模型取得了令人满意的结果。将连贯性驱动推理与神经模型的一致性评估结合起来，可能推动机器认知领域的最新技术发展。

更新时间: 2025-02-19 18:53:16

领域: cs.AI

下载: http://arxiv.org/abs/2502.13953v1

Robotic Table Tennis: A Case Study into a High Speed Learning System

We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description, including numerous design decisions that are typically not widely disseminated, with a collection of studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, sensitivity to policy hyper-parameters, and choice of action space. A video demonstrating the components of the system and details of experimental results can be found at https://youtu.be/uFcnWjB42I0.

Updated: 2025-02-19 18:52:54

标题: 机器人乒乓球：一个高速学习系统的案例研究

摘要: 我们对一个真实世界的机器人学习系统进行了深入研究，在先前的工作中，该系统被证明能够与人类进行数百次乒乓球对抗，并且具有将球精确返回到目标位置的能力。该系统整合了一个高度优化的感知子系统、高速低延迟的机器人控制器、一个可以在真实世界中避免损坏并且能够训练零样本迁移策略的模拟范式，以及自动重置真实世界环境的功能，从而实现对物理机器人的自主训练和评估。我们补充了完整的系统描述，包括通常不广泛传播的许多设计决策，以及一系列研究，阐明了缓解各种延迟来源的重要性、考虑训练和部署分布变化、感知系统的鲁棒性、策略超参数的敏感性以及动作空间的选择。可以在https://youtu.be/uFcnWjB42I0 观看展示该系统组件和实验结果细节的视频。

更新时间: 2025-02-19 18:52:54

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2309.03315v2

Dynamic Activation with Knowledge Distillation for Energy-Efficient Spiking NN Ensembles

While foundation AI models excel at tasks like classification and decision-making, their high energy consumption makes them unsuitable for energy-constrained applications. Inspired by the brain's efficiency, spiking neural networks (SNNs) have emerged as a viable alternative due to their event-driven nature and compatibility with neuromorphic chips. This work introduces a novel system that combines knowledge distillation and ensemble learning to bridge the performance gap between artificial neural networks (ANNs) and SNNs. A foundation AI model acts as a teacher network, guiding smaller student SNNs organized into an ensemble, called Spiking Neural Ensemble (SNE). SNE enables the disentanglement of the teacher's knowledge, allowing each student to specialize in predicting a distinct aspect of it, while processing the same input. The core innovation of SNE is the adaptive activation of a subset of SNN models of an ensemble, leveraging knowledge-distillation, enhanced with an informed-partitioning (disentanglement) of the teacher's feature space. By dynamically activating only a subset of these student SNNs, the system balances accuracy and energy efficiency, achieving substantial energy savings with minimal accuracy loss. Moreover, SNE is significantly more efficient than the teacher network, reducing computational requirements by up to 20x with only a 2% drop in accuracy on the CIFAR-10 dataset. This disentanglement procedure achieves an accuracy improvement of up to 2.4% on the CIFAR-10 dataset compared to other partitioning schemes. Finally, we comparatively analyze SNE performance under noisy conditions, demonstrating enhanced robustness compared to its ANN teacher. In summary, SNE offers a promising new direction for energy-constrained applications.

Updated: 2025-02-19 18:50:08

标题: 使用知识蒸馏进行动态激活以实现能效高的脉冲神经网络集成

摘要: 基于大脑的高效性，脉冲神经网络（SNNs）作为一种可行的替代方案已经出现，因为它们是事件驱动的，与神经形态芯片兼容。这项工作引入了一种将知识蒸馏和集成学习相结合的新系统，以弥合人工神经网络（ANNs）和SNNs之间的性能差距。一个基础AI模型充当教师网络，引导较小的学生SNNs组成一个被称为脉冲神经集成（SNE）的集合。SNE使得教师的知识得以解开，使每个学生可以专注于预测其中的一个不同方面，同时处理相同的输入。SNE的核心创新在于自适应激活集合中的一部分SNN模型，利用知识蒸馏，结合了对教师特征空间的有信息划分（解开）。通过动态激活这些学生SNNs的一个子集，系统在准确性和能效之间取得平衡，实现了大幅的能源节约，同时准确性损失最小。此外，SNE比教师网络更为高效，仅在CIFAR-10数据集上的准确率下降了2%，计算需求降低了最多20倍。这种解开过程在CIFAR-10数据集上的准确率提高了2.4%，相比其他划分方案。最后，我们对SNE在嘈杂条件下的性能进行了比较分析，表明相比其ANN教师，其具有更强的鲁棒性。总的来说，SNE为受限于能源的应用提供了一个有前景的新方向。

更新时间: 2025-02-19 18:50:08

领域: cs.LG,cs.AI,cs.CV,cs.NE

下载: http://arxiv.org/abs/2502.14023v1

Selective Reviews of Bandit Problems in AI via a Statistical View

Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. Additionally, we explore K-armed contextual bandits and SCAB, focusing on their methodologies and regret analyses. We also examine the connections between SCAB problems and functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.

Updated: 2025-02-19 18:48:18

标题: 通过统计视角选择性地审查人工智能中的赌博问题

摘要: 强化学习（RL）是人工智能领域中广泛研究的一个领域，重点是通过与环境的互动教导代理人决策制定。其中一个关键子集包括随机多臂老虎机（MAB）和连续臂老虎机（SCAB）问题，这些问题模拟了在不确定性下的顺序决策制定。本综述概述了老虎机问题的基础模型和假设，探讨了类似于集中不等式和极小后悔边界的非渐近理论工具，并比较了频率派和贝叶斯算法在管理勘探-开发权衡方面的应用。此外，我们还探讨了K臂情境老虎机和SCAB，重点关注它们的方法论和遗憾分析。我们还研究了SCAB问题与函数数据分析之间的关联。最后，我们突出了该领域的最新进展和持续挑战。

更新时间: 2025-02-19 18:48:18

领域: stat.ML,cs.AI,cs.LG,econ.EM,math.PR

下载: http://arxiv.org/abs/2412.02251v3

High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior

In this paper, we address the critical bottleneck in robotics caused by the scarcity of diverse 3D data by presenting a novel two-stage approach for generating high-quality 3D models from a single image. This method is motivated by the need to efficiently expand 3D asset creation, particularly for robotics datasets, where the variety of object types is currently limited compared to general image datasets. Unlike previous methods that primarily rely on general diffusion priors, which often struggle to align with the reference image, our approach leverages subject-specific prior knowledge. By incorporating subject-specific priors in both geometry and texture, we ensure precise alignment between the generated 3D content and the reference object. Specifically, we introduce a shading mode-aware prior into the NeRF optimization process, enhancing the geometry and refining texture in the coarse outputs to achieve superior quality. Extensive experiments demonstrate that our method significantly outperforms prior approaches.

Updated: 2025-02-19 18:45:10

标题: 使用主题特定知识先验从单个图像中创建高质量3D模型

摘要: 在这篇论文中，我们讨论了机器人领域由于3D数据稀缺而造成的关键瓶颈，提出了一种新颖的两阶段方法，用于从单张图像生成高质量的3D模型。这种方法的动机在于有效地扩展3D资产创建，特别是对于机器人数据集，其中物体类型的多样性目前相对于一般图像数据集来说有限。与之前主要依赖于通用扩散先验的方法不同，这种方法往往难以与参考图像对齐，我们的方法利用主题特定的先验知识。通过在几何和纹理中结合主题特定的先验，我们确保生成的3D内容与参考对象精确对齐。具体来说，我们在NeRF优化过程中引入了一种光照模式感知的先验，增强了几何并在粗糙输出中优化纹理，以实现卓越的质量。大量实验证明我们的方法明显优于先前的方法。

更新时间: 2025-02-19 18:45:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2312.11535v3

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

Updated: 2025-02-19 18:42:45

标题: 为什么受保护的船只会搁浅？大型语言模型的安全机制往往根植于模板区域

摘要: 大型语言模型（LLMs）的安全对齐仍然容易受到威胁，因为它们的初始行为甚至可以被相对简单的攻击轻易破解。由于在现有LLMs中在输入指令和初始模型输出之间填充固定模板是一种常见做法，我们假设这个模板是它们脆弱性的关键因素：LLMs的安全相关决策过度依赖于模板区域的聚合信息，这在很大程度上影响了这些模型的安全行为。我们将这个问题称为模板锚定的安全对齐。在本文中，我们进行了广泛的实验，并验证了模板锚定的安全对齐在各种对齐的LLMs中普遍存在。我们的机械分析展示了它如何导致模型在遇到推理时间越狱攻击时容易受攻击。此外，我们展示了将安全机制与模板区域分离在减轻对越狱攻击的脆弱性方面是有希望的。我们鼓励未来的研究开发更加健壮的安全对齐技术，减少对模板区域的依赖。

更新时间: 2025-02-19 18:42:45

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2502.13946v1

Carefully Blending Adversarial Training, Purification, and Aggregation Improves Adversarial Robustness

In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at: https://github.com/emaballarin/CARSO .

Updated: 2025-02-19 18:39:54

标题: 谨慎地融合对抗训练、净化和聚合技术提高对抗鲁棒性

摘要: 在这项工作中，我们提出了一种新颖的对抗性防御机制用于图像分类 - CARSO - 将对抗性训练和对抗性净化的范式融合在一种协同增强鲁棒性的方式中。该方法建立在一个经过对抗训练的分类器基础之上，并学习将与潜在扰动输入相关联的内部表示映射到一组可能的干净重建的分布上。从这种分布中取多个样本，通过相同的经过对抗训练的模型进行分类，最终通过其输出的仔细选择聚合构成感兴趣的鲁棒预测。通过一个经过充分验证的强适应性攻击基准的实验评估，跨不同的图像数据集，显示出CARSO能够抵抗为随机防御设计的自适应端到端白盒攻击。在保持适度的干净准确率损失的情况下，我们的方法显著提高了对Cifar-10、Cifar-100和TinyImageNet-200 $\ell_\infty$抗噪声分类准确率的最新水平。可在以下链接获取代码和获取预训练模型的指令：https://github.com/emaballarin/CARSO。

更新时间: 2025-02-19 18:39:54

领域: cs.CV,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2306.06081v5

Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data. Code at https://github.com/BigML-CS-UCLA/MKDT.

Updated: 2025-02-19 18:39:00

标题: 数据集精简通过知识蒸馏：向深度网络的高效自监督预训练靠拢

摘要: 数据集精炼（DD）生成小型合成数据集，可以高效地训练具有有限内存和计算资源的深度网络。尽管DD方法在监督学习方面取得了成功，但在自监督深度模型的预训练中，DD方法仍未得到解决。在未标记数据上进行预训练对于有效地泛化到具有有限标记数据的下游任务至关重要。在这项工作中，我们提出了第一个有效的SSL预训练的DD方法。首先，我们从理论和经验上表明，将监督DD方法简单应用于SSL会失败，原因是SSL梯度的高方差。然后，我们通过依赖知识蒸馏（KD）文献的见解来解决这个问题。具体来说，我们训练一个小型学生模型，以匹配使用SSL训练的较大教师模型的表示。然后，我们通过匹配学生模型的训练轨迹生成一个小型合成数据集。由于KD目标的方差明显较低于SSL，我们的方法可以生成成功预训练高质量编码器的合成数据集。通过大量实验，我们展示了我们的精炼数据集在存在有限标记数据的情况下，在各种下游任务上比以前的工作提高了高达13%的准确性。代码位于https://github.com/BigML-CS-UCLA/MKDT。

更新时间: 2025-02-19 18:39:00

领域: cs.LG

下载: http://arxiv.org/abs/2410.02116v2

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.

Updated: 2025-02-19 18:35:55

标题: AdaptiveStep：通过模型置信度自动划分推理步骤

摘要: 目前用于训练过程奖励模型（PRMs）的方法通常涉及使用基于规则的技术将响应分解为多个推理步骤，例如使用预定义的占位符标记或将推理步骤的长度设置为固定大小。这些方法忽略了特定单词通常不标记文本中的真正决策点这一事实。为了解决这个问题，我们提出了一种名为AdaptiveStep的方法，该方法基于模型预测下一个单词的信心来划分推理步骤。这种划分方法在每一步提供更多的决策信息，增强了下游任务（如奖励模型学习）的性能。此外，我们的方法不需要手动注释。我们通过在数学推理和代码生成任务中使用AdaptiveStep训练的PRMs进行实验来展示其有效性。实验结果表明，结果PRM实现了最先进的Best-of-N性能，超越了基于令牌级别值引导解码的贪婪搜索策略，同时相对于现有开源PRMs，还降低了超过30%的构建成本。此外，我们还对PRM的性能、可传递性和泛化能力进行了深入分析和案例研究。

更新时间: 2025-02-19 18:35:55

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.13943v1

Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review

While reasoning capabilities typically emerge in large language models (LLMs) with tens of billions of parameters, recent research focuses on improving smaller open-source models through knowledge distillation (KD) from commercial LLMs. However, many of these studies rely solely on responses from a single LLM as the gold rationale, unlike the natural human learning process, which involves understanding both the correct answers and the reasons behind mistakes. In this paper, we introduce a novel Fault-Aware DistIllation via Peer-Review (FAIR) approach: 1) Instead of merely obtaining rationales from teachers, our method asks teachers to identify and explain the student's mistakes, providing customized instruction learning data. 2) We design a simulated peer-review process between teacher LLMs, which selects only the generated rationales above the acceptance threshold. This reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method.

Updated: 2025-02-19 18:34:19

标题: 从委员会中学习：通过同行评审从多位老师中提炼推理

摘要: 尽管推理能力通常出现在具有数百亿参数的大型语言模型（LLMs）中，最近的研究集中于通过商业LLMs的知识蒸馏（KD）来改进较小的开源模型。然而，许多这些研究仅依赖于来自单个LLM的响应作为黄金理由，而不像自然人类学习过程那样，涉及理解正确答案和错误背后的原因。在本文中，我们引入了一种新颖的Fault-Aware DistIllation via Peer-Review（FAIR）方法：1）我们的方法不仅仅是从老师那里获取理由，而是要求老师识别和解释学生的错误，提供定制的指导学习数据。2）我们设计了一个老师LLMs之间的模拟同行评审过程，只选择高于接受阈值的生成的理由。这降低了老师猜测带有缺陷理由的准确性，提高了教学数据的质量。对数学、常识和逻辑推理任务的全面实验和分析证明了我们方法的有效性。

更新时间: 2025-02-19 18:34:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.03663v3

Image compositing is all you need for data augmentation

This paper investigates the impact of various data augmentation techniques on the performance of object detection models. Specifically, we explore classical augmentation methods, image compositing, and advanced generative models such as Stable Diffusion XL and ControlNet. The objective of this work is to enhance model robustness and improve detection accuracy, particularly when working with limited annotated data. Using YOLOv8, we fine-tune the model on a custom dataset consisting of commercial and military aircraft, applying different augmentation strategies. Our experiments show that image compositing offers the highest improvement in detection performance, as measured by precision, recall, and mean Average Precision (mAP@0.50). Other methods, including Stable Diffusion XL and ControlNet, also demonstrate significant gains, highlighting the potential of advanced data augmentation techniques for object detection tasks. The results underline the importance of dataset diversity and augmentation in achieving better generalization and performance in real-world applications. Future work will explore the integration of semi-supervised learning methods and further optimizations to enhance model performance across larger and more complex datasets.

Updated: 2025-02-19 18:24:02

标题: 图像合成是数据增强所需的一切

摘要: 本文研究了各种数据增强技术对目标检测模型性能的影响。具体来说，我们探讨了传统的增强方法、图像合成以及高级生成模型，如Stable Diffusion XL和ControlNet。本研究的目标是增强模型的稳健性并提高检测准确性，特别是在使用有限标注数据时。我们使用YOLOv8在一个包含商用和军用飞机的自定义数据集上微调模型，应用不同的增强策略。我们的实验表明，图像合成在检测性能方面提供了最大的改进，以精度、召回率和平均精度（mAP@0.50）来衡量。其他方法，包括Stable Diffusion XL和ControlNet，也显示出显著的收益，突显了高级数据增强技术在目标检测任务中的潜力。结果强调了数据集多样性和增强在实现更好泛化和性能的重要性。未来的工作将探索半监督学习方法的整合和进一步优化，以增强模型在更大更复杂的数据集上的性能。

更新时间: 2025-02-19 18:24:02

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13936v1

Explaining the Impact of Training on Vision Models via Activation Clustering

Recent developments in the field of explainable artificial intelligence (XAI) for vision models investigate the information extracted by their feature encoder. We contribute to this effort and propose Neuro-Activated Vision Explanations (NAVE), which extracts the information captured by the encoder by clustering the feature activations of the frozen network to be explained. The method does not aim to explain the model's prediction but to answer questions such as which parts of the image are processed similarly or which information is kept in deeper layers. Experimentally, we leverage NAVE to show that the training dataset and the level of supervision affect which concepts are captured. In addition, our method reveals the impact of registers on vision transformers (ViT) and the information saturation caused by the watermark Clever Hans effect in the training set.

Updated: 2025-02-19 18:21:07

标题: 通过激活聚类解释训练对视觉模型的影响

摘要: 最近在可解释人工智能（XAI）领域的发展中，针对视觉模型研究其特征编码器提取的信息。我们为此做出了贡献，并提出了神经激活视觉解释（NAVE），通过对待解释的冻结网络的特征激活进行聚类来提取编码器捕获的信息。该方法旨在解释模型的预测，而是回答诸如图像的哪些部分被类似处理或深层保留了哪些信息等问题。实验上，我们利用NAVE展示了训练数据集和监督水平如何影响捕获的概念。此外，我们的方法揭示了视觉转换器（ViT）中寄存器的影响以及训练集中Clever Hans效应引起的信息饱和现象。

更新时间: 2025-02-19 18:21:07

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2411.19700v2

Continually Learning Structured Visual Representations via Network Refinement with Rerelation

Current machine learning paradigm relies on continuous representations like neural networks, which iteratively adjust parameters to approximate outcomes rather than directly learning the structure of problem. This spreads information across the network, causing issues like information loss and incomprehensibility Building on prior work in environment dynamics modeling, we propose a method that learns visual space in a structured, continual manner. Our approach refines networks to capture the core structure of objects while representing significant subvariants in structure efficiently. We demonstrate this with 2D shape detection, showing incremental learning on MNIST without overwriting knowledge and creating compact, comprehensible representations. These results offer a promising step toward a transparent, continually learning alternative to traditional neural networks for visual processing.

Updated: 2025-02-19 18:18:27

标题: 通过网络细化和重新关联不断学习结构化的视觉表示

摘要: 当前的机器学习范式依赖于像神经网络这样的连续表示，这些表示通过迭代调整参数来逼近结果，而不是直接学习问题的结构。这种方法会在网络中传播信息，导致信息丢失和难以理解等问题。在先前的环境动态建模工作基础上，我们提出了一种以结构化、连续的方式学习视觉空间的方法。我们的方法通过优化网络来捕捉对象的核心结构，同时有效地表示结构中的重要子变体。我们通过2D形状检测进行演示，展示了在MNIST上的增量学习，而不会覆盖知识并创建紧凑、易理解的表示。这些结果为透明、持续学习的替代传统神经网络进行视觉处理迈出了一步。

更新时间: 2025-02-19 18:18:27

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13935v1

Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach

Watermarking has emerged as a crucial method to distinguish AI-generated text from human-created text. In this paper, we present a novel theoretical framework for watermarking Large Language Models (LLMs) that jointly optimizes both the watermarking scheme and the detection process. Our approach focuses on maximizing detection performance while maintaining control over the worst-case Type-I error and text distortion. We characterize \emph{the universally minimum Type-II error}, showing a fundamental trade-off between watermark detectability and text distortion. Importantly, we identify that the optimal watermarking schemes are adaptive to the LLM generative distribution. Building on our theoretical insights, we propose an efficient, model-agnostic, distribution-adaptive watermarking algorithm, utilizing a surrogate model alongside the Gumbel-max trick. Experiments conducted on Llama2-13B and Mistral-8$\times$7B models confirm the effectiveness of our approach. Additionally, we examine incorporating robustness into our framework, paving a way to future watermarking systems that withstand adversarial attacks more effectively.

Updated: 2025-02-19 18:18:11

标题: 基于理论的LLM水印嵌入框架：分布自适应方法

摘要: 数字水印技术已成为区分人工智能生成文本和人类创作文本的关键方法。本文提出了一种新颖的理论框架，用于为大型语言模型（LLMs）设计数字水印，同时优化数字水印方案和检测过程。我们的方法旨在最大化检测性能，同时保持对最坏情况下的第一类错误和文本失真的控制。我们表征了\emph{普遍最小第二类错误}，展示了数字水印可检测性和文本失真之间的基本权衡。重要的是，我们发现最佳数字水印方案应该适应LLM生成分布。基于我们的理论见解，我们提出了一种高效、与模型无关、分布自适应的数字水印算法，利用了另类模型和Gumbel-max技巧。在Llama2-13B和Mistral-8×7B模型上进行的实验验证了我们方法的有效性。此外，我们还研究了如何将鲁棒性纳入我们的框架，为未来的数字水印系统铺平道路，使其更有效地抵御对抗性攻击。

更新时间: 2025-02-19 18:18:11

领域: cs.CR,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2410.02890v4

A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?

Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.

Updated: 2025-02-19 18:15:57

标题: 两种结构的故事：LLMs是否能够捕捉语言的分形复杂性？

摘要: 语言在信息论复杂性（即每个标记的比特数）中表现出分形结构，具有自相似性和长程依赖性（LRD）。在这项工作中，我们调查了大型语言模型（LLMs）是否能够复制这种分形特征，并确定可能导致失败的条件，如温度设置和提示方法。此外，我们发现自然语言中观察到的分形参数处于一个狭窄范围内，而LLMs输出的分形参数变化很大，表明分形参数可能有助于检测LLM生成的文本中的非平凡部分。值得注意的是，本研究中报道的这些发现以及许多其他发现对架构的选择都是稳健的；例如，Gemini 1.0 Pro、Mistral-7B和Gemma-2B。我们还发布了一个数据集，其中包括由各种LLMs（预训练和指导调整）生成的超过24万篇文章，使用不同的解码温度和提示方法，以及它们对应的人类生成的文本。我们希望这项工作突出了分形属性、提示和LLMs中的统计模仿之间的复杂相互作用，为生成、评估和检测合成文本提供见解。

更新时间: 2025-02-19 18:15:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14924v1

AI Thinking as a Meaning-Centered Framework: Reimagining Language Technologies Through Community Agency

While language technologies have advanced significantly, current approaches fail to address the complex sociocultural dimensions of linguistic preservation. AI Thinking proposes a meaning-centered framework that would transform technological development from creating tools FOR communities to co-creating solutions WITH them. This approach recognizes that meaningful solutions emerge through the interplay of cultural understanding, community agency, and technological innovation. The proposal articulates a holistic methodology and a five-layer technological ecosystem where communities maintain control over their linguistic and cultural knowledge representation. This systematic integration of community needs, cultural preservation, and advanced capabilities could revolutionize how we approach linguistic diversity preservation in the digital age.

Updated: 2025-02-19 18:09:24

标题: AI 思维作为一个以意义为中心的框架：通过社区机构重新构想语言技术

摘要: 尽管语言技术已经取得了显著进展，但当前的方法未能解决语言保护的复杂社会文化维度。AI Thinking提出了一个以意义为中心的框架，将技术发展从为社区创造工具转变为与社区共同创造解决方案。这种方法认识到，有意义的解决方案是通过文化理解、社区代理和技术创新的相互作用而产生的。该提案阐述了一种全面的方法论和一个五层技术生态系统，其中社区保持对其语言和文化知识表征的控制。社区需求、文化保护和先进能力的系统整合可能彻底改变我们在数字时代如何处理语言多样性保护的方式。

更新时间: 2025-02-19 18:09:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14923v1

Improving Probabilistic Diffusion Models With Optimal Diagonal Covariance Matching

The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed covariance moment matching technique and introduce a novel method for learning the diagonal covariance. Unlike traditional data-driven diagonal covariance approximation approaches, our method involves directly regressing the optimal diagonal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency, recall rate and likelihood of commonly used diffusion models.

Updated: 2025-02-19 18:08:28

标题: 通过最优对角协方差匹配改进概率扩散模型

摘要: 概率扩散模型在各个领域已经变得非常有效。通常，从扩散模型中抽样涉及使用一个以学习均值和固定或学习协方差为特征的高斯去噪分布。在本文中，我们利用最近提出的协方差矩匹配技术，并引入一种学习对角协方差的新方法。与传统的数据驱动对角协方差近似方法不同，我们的方法涉及直接回归使用一种名为Optimal Covariance Matching (OCM)的新的无偏目标来获得最佳对角解析协方差。这种方法可以显著减少协方差预测中的近似误差。我们展示了我们的方法如何显著提高常用扩散模型的抽样效率、召回率和似然度。

更新时间: 2025-02-19 18:08:28

领域: cs.LG

下载: http://arxiv.org/abs/2406.10808v4

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing concerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

Updated: 2025-02-19 18:06:37

标题: 机器的非人性化：减轻文本生成系统中的拟人行为

摘要: 随着文本生成系统的输出越来越具有人类特征，被认为类似于人类，学者们也越来越关注这些输出可能导致有害结果，比如用户过度依赖或对这些系统产生情感依赖。然而，如何干预这些系统输出以减轻人类行为及其伴随的有害结果仍未得到充分研究。通过这项工作，我们旨在为开发这样的干预措施提供经验和理论基础。为此，我们编制了一份干预清单，既基于先前文献，又基于众包研究，参与者编辑系统输出以使其更少具有人类特征。借鉴这份清单，我们还制定了一个概念框架，帮助表征可能干预措施的范围，阐明不同类型干预措施之间的区别，并为评估不同干预措施的有效性提供理论基础。

更新时间: 2025-02-19 18:06:37

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.14019v1

Formal verification in Solidity and Move: insights from a comparative analysis

Formal verification plays a crucial role in making smart contracts safer, being able to find bugs or to guarantee their absence, as well as checking whether the business logic is correctly implemented. For Solidity, even though there already exist several mature verification tools, the semantical quirks of the language can make verification quite hard in practice. Move, on the other hand, has been designed with security and verification in mind, and it has been accompanied since its early stages by a formal verification tool, the Move Prover. In this paper, we investigate through a comparative analysis: 1) how the different designs of the two contract languages impact verification, and 2) what is the state-of-the-art of verification tools for the two languages, and how do they compare on three paradigmatic use cases. Our investigation is supported by an open dataset of verification tasks performed in Certora and in the Aptos Move Prover.

Updated: 2025-02-19 18:06:01

标题: Solidity和Move中的形式验证：来自比较分析的见解

摘要: 形式验证在使智能合约更安全方面起着至关重要的作用，能够发现错误或保证其不存在，以及检查业务逻辑是否正确实现。对于Solidity，尽管已经存在几种成熟的验证工具，但语言的语义怪异性可能使实际中的验证非常困难。另一方面，Move是专为安全性和验证而设计的，并且从早期阶段开始就伴随着一个形式验证工具Move Prover。本文通过比较分析探讨了以下问题：1）两种合约语言的不同设计如何影响验证，以及2）这两种语言的验证工具的最新技术水平，以及它们在三种典型用例上的比较。我们的调查得到了在Certora和Aptos Move Prover中进行的验证任务的开放数据集的支持。

更新时间: 2025-02-19 18:06:01

领域: cs.CR

下载: http://arxiv.org/abs/2502.13929v1

Using Graph Convolutional Networks to Address fMRI Small Data Problems

Although great advances in the analysis of neuroimaging data have been made, a major challenge is a lack of training data. This is less problematic in tasks such as diagnosis, where much data exists, but particularly prevalent in harder problems such as predicting treatment responses (prognosis), where data is focused and hence limited. Here, we address the learning from small data problems for medical imaging using graph neural networks. This is particularly challenging as the information about the patients is themselves graphs (regions of interest connectivity graphs). We show how a spectral representation of the connectivity data allows for efficient propagation that can yield approximately 12\% improvement over traditional deep learning methods using the exact same data. We show that our method's superior performance is due to a data smoothing result that can be measured by closing the number of triangle inequalities and thereby satisfying transitivity.

Updated: 2025-02-19 18:05:46

标题: 使用图卷积网络解决fMRI小数据问题

摘要: 尽管在神经影像数据分析方面取得了重大进展，但主要挑战是缺乏训练数据。这在诊断等任务中并不是一个大问题，因为存在大量数据，但在预测治疗反应（预后）等更难的问题中尤为普遍，因为数据受到限制。在这里，我们利用图神经网络来解决医学影像学中小数据问题。这是一项特别具有挑战性的任务，因为关于患者的信息本身就是图形（感兴趣区域连接图）。我们展示了如何使用连接数据的谱表示来进行有效传播，可以比使用完全相同数据的传统深度学习方法获得约12％的改进。我们展示了我们方法卓越的性能是由于数据平滑的结果，可以通过关闭三角不等式的数量并因此满足传递性来衡量。

更新时间: 2025-02-19 18:05:46

领域: eess.SP,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.17489v1

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/

Updated: 2025-02-19 18:05:42

标题: 对称视觉对比优化：通过最小对比图像对齐视觉-语言模型

摘要: 近期研究表明，大型视觉-语言模型（VLMs）往往忽视图像内容，过分依赖语言模型的先验知识，导致在以视觉为基础的任务中出现错误和幻觉。我们假设这个问题的根源在于现有的VLMs没有明确训练生成文本与细粒度图像细节准确对应。为了增强VLM训练中的视觉反馈，我们提出了S-VCO（对称视觉对比优化），这是一种新的微调目标，可以引导模型捕捉重要的视觉细节，并将其与相应的文本标记对齐。为了进一步促进这种详细对齐，我们引入了MVC，这是一个通过自动过滤和增强视觉反事实数据构建的配对图像-文本数据集，挑战模型处理涉及最小视觉对比的困难对比情况。实验表明，我们的方法在涵盖各种能力和领域的不同基准测试中持续改进VLM性能，幻觉减少高达22％，并在以视觉为中心和一般任务中取得显著收益。值得注意的是，在视觉依赖性更高的基准测试中，这些改进越来越明显。简而言之，S-VCO在提高VLM的视觉依赖性任务性能的同时保持甚至提高了模型的一般能力。我们在https://s-vco.github.io/ 开源我们的代码。

更新时间: 2025-02-19 18:05:42

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.13928v1

I Want 'Em All (At Once) -- Ultrametric Cluster Hierarchies

Hierarchical clustering is a powerful tool for exploratory data analysis, organizing data into a tree of clusterings from which a partition can be chosen. This paper generalizes these ideas by proving that, for any reasonable hierarchy, one can optimally solve any center-based clustering objective over it (such as $k$-means). Moreover, these solutions can be found exceedingly quickly and are themselves necessarily hierarchical. Thus, given a cluster tree, we show that one can quickly access a plethora of new, equally meaningful hierarchies. Just as in standard hierarchical clustering, one can then choose any desired partition from these new hierarchies. We conclude by verifying the utility of our proposed techniques across datasets, hierarchies, and partitioning schemes.

Updated: 2025-02-19 18:03:52

标题: 我想要他们全部（一次性）-- 超度聚类层次结构

摘要: 层次聚类是一种强大的探索性数据分析工具，将数据组织成一个集群树，从中可以选择一个分区。本文通过证明，对于任何合理的层次结构，可以最优地解决基于中心的任何聚类目标（如$k$-means）。此外，这些解决方案可以被迅速找到，并且它们本身必然是层次结构的。因此，给定一个集群树，我们展示可以快速访问大量新的、同样有意义的层次结构。就像标准的层次聚类一样，然后可以从这些新层次结构中选择任何所需的分区。最后，我们通过验证我们提出的技术在数据集、层次结构和分区方案上的实用性。

更新时间: 2025-02-19 18:03:52

领域: cs.LG

下载: http://arxiv.org/abs/2502.14018v1

Bayesian Comparisons Between Representations

Which neural networks are similar is a fundamental question for both machine learning and neuroscience. Here, I propose to base comparisons on the predictive distributions of linear readouts from intermediate representations. In Bayesian statistics, the prior predictive distribution is a full description of the inductive bias and generalization of a model, making it a great basis for comparisons. This distribution directly gives the evidence a dataset would provide in favor of the model. If we want to compare multiple models to each other, we can use a metric for probability distributions like the Jensen-Shannon distance or the total variation distance. As these are metrics, this induces pseudo-metrics for representations, which measure how well two representations could be distinguished based on a linear read out. For a linear readout with a Gaussian prior on the read-out weights and Gaussian noise, we can analytically compute the (prior and posterior) predictive distributions without approximations. These distributions depend only on the linear kernel matrix of the representations in the model. Thus, the Bayesian metrics connect linear read-out based comparisons to kernel based metrics like centered kernel alignment and representational similarity analysis. I demonstrate the new methods with deep neural networks trained on ImageNet-1k comparing them to each other and a small subset of the Natural Scenes Dataset. The Bayesian comparisons broadly agree with existing metrics, but are more stringent. Empirically, evaluations vary less across different random image samples and yield informative results with full uncertainty information. Thus the proposed Bayesian metrics nicely extend our toolkit for comparing representations.

Updated: 2025-02-19 17:55:44

标题: 贝叶斯比较不同表现形式

摘要: 哪些神经网络相似是机器学习和神经科学的一个基本问题。在这里，我提议将比较基于中间表示的线性读出的预测分布。在贝叶斯统计中，先验预测分布是模型的归纳偏差和泛化的完整描述，使其成为比较的良好基础。这个分布直接给出了数据集支持模型的证据。如果我们想要比较多个模型，我们可以使用概率分布的度量标准，如Jensen-Shannon距离或总变差距离。由于这些都是度量标准，这导致了表示的伪度量，即根据线性读出来衡量两个表示能够被区分的程度。对于具有高斯先验的读出权重和高斯噪声的线性读出，我们可以在不进行近似的情况下对（先验和后验）预测分布进行分析计算。这些分布仅取决于模型中表示的线性核矩阵。因此，贝叶斯度量将线性读出比较与基于核的度量（如中心核对齐和表征相似性分析）联系起来。我用在ImageNet-1k上训练的深度神经网络对其进行比较，并与自然场景数据集的一个小子集进行比较来演示这些新方法。贝叶斯比较与现有度量基本一致，但更加严格。在不同的随机图像样本之间，实证评估差异较小，并提供了带有完整不确定性信息的信息结果。因此，所提出的贝叶斯度量很好地扩展了我们比较表示的工具。

更新时间: 2025-02-19 17:55:44

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2411.08739v2

Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis

Recent advances in code generation have illuminated the potential of employing large language models (LLMs) for general-purpose programming languages such as Python and C++, opening new opportunities for automating software development and enhancing programmer productivity. The potential of LLMs in software programming has sparked significant interest in exploring automated hardware generation and automation. Although preliminary endeavors have been made to adopt LLMs in generating hardware description languages (HDLs), several challenges persist in this direction. First, the volume of available HDL training data is substantially smaller compared to that for software programming languages. Second, the pre-trained LLMs, mainly tailored for software code, tend to produce HDL designs that are more error-prone. Third, the generation of HDL requires a significantly higher number of tokens compared to software programming, leading to inefficiencies in cost and energy consumption. To tackle these challenges, this paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design. Although code generation for domain-specific programming languages is not new in the literature, we aim to provide experimental results, insights, benchmarks, and evaluation infrastructure to investigate the suitability of HLS over low-level HDLs for LLM-assisted hardware design generation. To achieve this, we first finetune pre-trained models for HLS-based hardware generation, using a collected dataset with text prompts and corresponding reference HLS designs. An LLM-assisted framework is then proposed to automate end-to-end hardware code generation, which also investigates the impact of chain-of-thought and feedback loops promoting techniques on HLS-design generation. Limited by the timeframe of this research, we plan to evaluate more advanced reasoning models in the future.

Updated: 2025-02-19 17:53:59

标题: 探索用于自动HLS硬件生成的代码语言模型：基准、基础设施和分析

摘要: 最近在代码生成方面的进展揭示了利用大型语言模型（LLMs）来进行通用编程语言（如Python和C++）的潜力，为自动化软件开发和提高程序员生产力开辟了新机会。LLMs在软件编程中的潜力引起了人们对自动化硬件生成和自动化的浓厚兴趣。尽管已经开始尝试采用LLMs生成硬件描述语言（HDLs），但在这方面仍存在一些挑战。首先，可用的HDL训练数据量远远小于软件编程语言的数据量。其次，主要针对软件代码的预训练LLMs往往会产生更容易出错的HDL设计。第三，与软件编程相比，HDL的生成需要更多的标记，导致成本和能耗效率低下。为了应对这些挑战，本文探讨了利用LLMs生成基于高级综合（HLS）的硬件设计。尽管领域特定编程语言的代码生成在文献中并不新鲜，但我们旨在提供实验结果、见解、基准测试和评估基础设施，以探讨HLS相对于低级HDL在LLM辅助硬件设计生成方面的适用性。为了实现这一目标，我们首先对预训练模型进行微调，用收集的具有文本提示和相应参考HLS设计的数据集进行基于HLS的硬件生成。然后提出了一个LLM辅助框架，用于自动化端到端硬件代码生成，还研究了链式思维和反馈循环促进技术对HLS设计生成的影响。受本研究时间限制的限制，我们计划在未来评估更先进的推理模型。

更新时间: 2025-02-19 17:53:59

领域: cs.LG,cs.AR,cs.SE

下载: http://arxiv.org/abs/2502.13921v1

Playing Hex and Counter Wargames using Reinforcement Learning and Recurrent Neural Networks

Hex and Counter Wargames are adversarial two-player simulations of real military conflicts requiring complex strategic decision-making. Unlike classical board games, these games feature intricate terrain/unit interactions, unit stacking, large maps of varying sizes, and simultaneous move and combat decisions involving hundreds of units. This paper introduces a novel system designed to address the strategic complexity of Hex and Counter Wargames by integrating cutting-edge advancements in Recurrent Neural Networks with AlphaZero, a reliable modern Reinforcement Learning algorithm. The system utilizes a new Neural Network architecture developed from existing research, incorporating innovative state and action representations tailored to these specific game environments. With minimal training, our solution has shown promising results in typical scenarios, demonstrating the ability to generalize across different terrain and tactical situations. Additionally, we explore the system's potential to scale to larger map sizes. The developed system is openly accessible, facilitating continued research and exploration within this challenging domain.

Updated: 2025-02-19 17:52:45

标题: 使用强化学习和循环神经网络玩六角和计数战争游戏

摘要: Hex and Counter Wargames是对真实军事冲突进行对抗性两人模拟的游戏，需要复杂的战略决策。与传统的棋盘游戏不同，这些游戏涉及复杂的地形/单位交互、单位叠放、各种大小的大地图以及涉及数百个单位的同时移动和战斗决策。本文介绍了一个新的系统，旨在通过将递归神经网络的最新进展与可靠的现代强化学习算法AlphaZero相结合，以解决Hex and Counter Wargames的战略复杂性。该系统利用了从现有研究中发展出的新型神经网络架构，结合了针对这些特定游戏环境量身定制的创新状态和行动表示。经过最少的训练，我们的解决方案在典型场景中显示出有希望的结果，展示了在不同地形和战术情况下泛化的能力。此外，我们还探讨了该系统扩展到更大地图尺寸的潜力。开发的系统是完全开放的，促进了在这一具有挑战性领域内的持续研究和探索。

更新时间: 2025-02-19 17:52:45

领域: cs.LG,I.2.6

下载: http://arxiv.org/abs/2502.13918v1

MotifBench: A standardized protein design benchmark for motif-scaffolding problems

The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state-of-the-art methods fail to identify any solution.

Updated: 2025-02-19 17:51:50

标题: MotifBench：一个用于基于基序支架问题的标准化蛋白设计基准测试

摘要: 主题支架问题是计算蛋白质设计中的一个核心任务：给定选择使所需生化功能（主题）的几何结构中原子的坐标，任务是识别包含主题并保持其几何结构的多样蛋白质结构（支架）。由于可靠的蛋白质结构预测和固定骨架序列设计方法的计算评估，最近在主题支架方面取得了显著进展。然而，出版物之间评估策略的显著变化阻碍了结果的可比性，挑战了可重复性，并阻碍了稳健进展。为了应对这一挑战，我们引入了MotifBench，包括（1）一个精确规定的流程和评估指标，（2）30个基准问题的集合，以及（3）该基准和排行榜的实施在github.com/blt2114/MotifBench。与之前的基准相比，MotifBench测试案例更加困难，包括已知解决方案的蛋白质设计问题，但据我们所知，最先进的方法未能识别任何解决方案。

更新时间: 2025-02-19 17:51:50

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2502.12479v2

How Do LLMs Perform Two-Hop Reasoning in Context?

"Socrates is human. All humans are mortal. Therefore, Socrates is mortal." This classical example demonstrates two-hop reasoning, where a conclusion logically follows from two connected premises. While transformer-based Large Language Models (LLMs) can make two-hop reasoning, they tend to collapse to random guessing when faced with distracting premises. To understand the underlying mechanism, we train a three-layer transformer on synthetic two-hop reasoning tasks. The training dynamics show two stages: a slow learning phase, where the 3-layer transformer performs random guessing like LLMs, followed by an abrupt phase transitions, where the 3-layer transformer suddenly reaches $100%$ accuracy. Through reverse engineering, we explain the inner mechanisms for how models learn to randomly guess between distractions initially, and how they learn to ignore distractions eventually. We further propose a three-parameter model that supports the causal claims for the mechanisms to the training dynamics of the transformer. Finally, experiments on LLMs suggest that the discovered mechanisms generalize across scales. Our methodologies provide new perspectives for scientific understandings of LLMs and our findings provide new insights into how reasoning emerges during training.

Updated: 2025-02-19 17:46:30

标题: LLMs如何在语境中执行双跳推理？

摘要: 苏格拉底是人类。所有人类都是有限的。因此，苏格拉底是有限的。这个经典的例子展示了两跳推理，其中结论逻辑上由两个相关的前提得出。虽然基于变压器的大型语言模型（LLMs）可以进行两跳推理，但当面对干扰的前提时，它们往往会崩溃到随机猜测。为了理解潜在的机制，我们在合成的两跳推理任务上训练了一个三层变压器。训练动态显示出两个阶段：一个缓慢的学习阶段，在这个阶段，三层变压器像LLMs一样进行随机猜测，随后是突然的相变阶段，三层变压器突然达到100%的准确率。通过反向工程，我们解释了模型如何学会在最初在干扰之间随机猜测，以及如何最终学会忽略干扰的内在机制。我们进一步提出了一个支持变压器训练动态机制的因果主张的三参数模型。最后，对LLMs的实验表明，发现的机制在不同规模上具有普遍性。我们的方法提供了对LLMs的科学理解的新视角，而我们的发现则提供了关于推理如何在训练过程中出现的新见解。

更新时间: 2025-02-19 17:46:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13913v1

Partially Observable Gaussian Process Network and Doubly Stochastic Variational Inference

To reduce the curse of dimensionality for Gaussian processes (GP), they can be decomposed into a Gaussian Process Network (GPN) of coupled subprocesses with lower dimensionality. In some cases, intermediate observations are available within the GPN. However, intermediate observations are often indirect, noisy, and incomplete in most real-world systems. This work introduces the Partially Observable Gaussian Process Network (POGPN) to model real-world process networks. We model a joint distribution of latent functions of subprocesses and make inferences using observations from all subprocesses. POGPN incorporates observation lenses (observation likelihoods) into the well-established inference method of deep Gaussian processes. We also introduce two training methods for POPGN to make inferences on the whole network using node observations. The application to benchmark problems demonstrates how incorporating partial observations during training and inference can improve the predictive performance of the overall network, offering a promising outlook for its practical application.

Updated: 2025-02-19 17:39:46

标题: 部分可观测高斯过程网络和双随机变分推断

摘要: 为了减少高斯过程（GP）的维度诅咒，可以将其分解为具有较低维度的耦合子过程的高斯过程网络（GPN）。在某些情况下，GPN中存在中间观测。然而，在大多数现实世界系统中，中间观测通常是间接的、嘈杂的和不完整的。本研究引入了部分可观察高斯过程网络（POGPN）来建模现实世界的过程网络。我们对子过程的潜在函数的联合分布进行建模，并利用所有子过程的观测进行推断。POGPN将观测透镜（观测似然）融入到深度高斯过程的成熟推断方法中。我们还介绍了两种用于POPGN的训练方法，以利用节点观测对整个网络进行推断。对基准问题的应用展示了在训练和推断过程中纳入部分观测如何提高整个网络的预测性能，为其实际应用提供了有前景的展望。

更新时间: 2025-02-19 17:39:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13905v1

SIFT: Grounding LLM Reasoning in Contexts via Stickers

This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.

Updated: 2025-02-19 17:38:46

标题: SIFT：通过贴纸在上下文中建立LLM推理

摘要: 本文指出，在大型语言模型的推理过程中，对上下文的误解可能是一个重要问题，从较小的模型如Llama3.2-3B-Instruct到尖端模型如DeepSeek-R1都存在这个问题。例如，在短语“每公斤10美元”中，LLMs可能不会意识到“每”意味着“每个”，导致计算错误。我们引入了一种新颖的后训练方法，称为“Stick to the Facts (SIFT)”，来解决这个问题。SIFT利用增加的推理时间计算来将LLM的推理落实到上下文中。SIFT的核心是“Sticker”，由模型自动生成，明确强调上下文中的关键信息。给定精心策划的Sticker，SIFT生成两个预测结果--一个来自原始查询，另一个来自使用Sticker增强的查询。如果它们不同，Sticker会通过正向优化（以更好地使提取的事实与查询对齐）和反向生成（以符合模型的内在倾向）进行顺序优化，以获得更忠实的推理结果。在不同模型（从3B到100B+）和基准测试（如GSM8K，MATH-500）上的研究显示出一致的性能提升。值得注意的是，SIFT将DeepSeek-R1在AIME2024上的一次通过准确率从78.33%提高到85.67%，在开源社区确立了一个新的技术水平。该代码可在https://github.com/zhijie-group/SIFT 上找到。

更新时间: 2025-02-19 17:38:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14922v1

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving near-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $\gamma \in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

Updated: 2025-02-19 17:32:35

标题: 乐观探索：可证明高效的无限时域强化学习和模仿学习

摘要: 我们研究了无限时间折现线性马尔可夫决策过程（MDP）中的强化学习问题，并提出了在这种情况下实现近乎最优遗憾保证的第一个计算效率高的算法。我们的主要思想是将两种经典的乐观探索技术结合起来：应用于奖励函数的加法探索奖励和制造到具有最大回报的吸收状态的人工转换。我们表明，结合正则化近似动态规划方案，得到的算法实现的遗憾为 $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$，其中 $T$ 是样本转换的总数，$\gamma \in (0,1)$ 是折现因子，$d$ 是特征维度。结果继续针对敌对奖励序列保持，使我们的方法能够应用于线性MDP中的模仿学习问题，我们实现了最先进的结果。

更新时间: 2025-02-19 17:32:35

领域: cs.LG

下载: http://arxiv.org/abs/2502.13900v1

AI-Driven Discovery of High Performance Polymer Electrodes for Next-Generation Batteries

The use of transition group metals in electric batteries requires extensive usage of critical elements like lithium, cobalt and nickel, which poses significant environmental challenges. Replacing these metals with redox-active organic materials offers a promising alternative, thereby reducing the carbon footprint of batteries by one order of magnitude. However, this approach faces critical obstacles, including the limited availability of suitable redox-active organic materials and issues such as lower electronic conductivity, voltage, specific capacity, and long-term stability. To overcome the limitations for lower voltage and specific capacity, a machine learning (ML) driven battery informatics framework is developed and implemented. This framework utilizes an extensive battery dataset and advanced ML techniques to accelerate and enhance the identification, optimization, and design of redox-active organic materials. In this contribution, a data-fusion ML coupled meta learning model capable of predicting the battery properties, voltage and specific capacity, for various organic negative electrodes and charge carriers (positive electrode materials) combinations is presented. The ML models accelerate experimentation, facilitate the inverse design of battery materials, and identify suitable candidates from three extensive material libraries to advance sustainable energy-storage technologies.

Updated: 2025-02-19 17:32:17

标题: 人工智能驱动的发现下一代电池高性能聚合物电极

摘要: 在电池中使用过渡族金属需要大量使用锂、钴和镍等关键元素，这带来了重大的环境挑战。将这些金属替换为氧化还原活性有机材料提供了一种有前途的替代方案，从而将电池的碳足迹降低一个数量级。然而，这种方法面临着关键障碍，包括适用的氧化还原活性有机材料的有限供应以及电子导电性、电压、比容量和长期稳定性等问题。为了克服电压和比容量较低的限制，开发和实施了一种机器学习（ML）驱动的电池信息学框架。该框架利用大量电池数据集和先进的机器学习技术，加速和增强了对氧化还原活性有机材料的识别、优化和设计。在本文中，提出了一种数据融合的ML耦合元学习模型，能够预测各种有机负极和电荷载体（正极材料）组合的电池性能、电压和比容量。机器学习模型加速了实验，促进了电池材料的逆向设计，并从三个广泛的材料库中确定合适的候选材料，以推进可持续能源存储技术的发展。

更新时间: 2025-02-19 17:32:17

领域: cond-mat.mtrl-sci,cs.LG,physics.app-ph

下载: http://arxiv.org/abs/2502.13899v1

DataSciBench: An LLM Agent Benchmark for Data Science

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

Updated: 2025-02-19 17:31:51

标题: DataSciBench：数据科学的LLM代理基准

摘要: 这篇论文介绍了DataSciBench，这是一个用于评估数据科学中大型语言模型（LLM）能力的全面基准测试。最近的相关基准主要集中在单一任务、易于获取的基准事实和直接的评估指标上，这限制了可以评估的任务范围。相比之下，DataSciBench是基于更全面和精心策划的自然和具有挑战性的提示集构建的，其中包含不确定基准事实和评估指标。我们开发了一个半自动化的管道，用于生成基准事实（GT）和验证评估指标。该管道利用并实施了基于LLM的自洽性和人工验证策略，通过利用收集到的提示、预定义的任务类型和聚合函数（指标）来产生准确的GT。此外，我们提出了一种创新的任务-功能-代码（TFC）框架，根据明确定义的指标和编程规则评估每个代码执行结果。我们的实验框架涉及测试6个基于API的模型、8个开源通用模型和9个开源代码生成模型，使用我们收集到的多样化提示集。这种方法旨在为数据科学中的LLMs提供更全面和严格的评估，揭示它们的优势和劣势。实验结果表明，基于API的模型在所有指标上优于开源模型，而Deepseek-Coder-33B-Instruct在开源模型中获得最高得分。我们在https://github.com/THUDM/DataSciBench 上发布所有代码和数据。

更新时间: 2025-02-19 17:31:51

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13897v1

Geometric Principles for Machine Learning of Dynamical Systems

Mathematical descriptions of dynamical systems are deeply rooted in topological spaces defined by non-Euclidean geometry. This paper proposes leveraging structure-rich geometric spaces for machine learning to achieve structural generalization when modeling physical systems from data, in contrast to embedding physics bias within model-free architectures. We consider model generalization to be a function of symmetry, invariance and uniqueness, defined as a topological mapping from state space dynamics to the parameter space. We illustrate this view through the machine learning of linear time-invariant dynamical systems, whose dynamics reside on the symmetric positive definite manifold.

Updated: 2025-02-19 17:28:40

标题: 机器学习动态系统的几何原理

摘要: 动力系统的数学描述深深植根于由非欧几里得几何定义的拓扑空间。本文提出利用结构丰富的几何空间进行机器学习，以实现在建模物理系统时从数据中实现结构泛化，而不是在无模型架构中嵌入物理偏见。我们认为模型泛化是对称性、不变性和独特性的函数，定义为从状态空间动态到参数空间的拓扑映射。我们通过机器学习线性时不变动力系统来说明这一观点，其动态存在于对称正定流形上。

更新时间: 2025-02-19 17:28:40

领域: cs.LG

下载: http://arxiv.org/abs/2502.13895v1

Mesh-based Super-Resolution of Fluid Flows with Multiscale Graph Neural Networks

A graph neural network (GNN) approach is introduced in this work which enables mesh-based three-dimensional super-resolution of fluid flows. In this framework, the GNN is designed to operate not on the full mesh-based field at once, but on localized meshes of elements (or cells) directly. To facilitate mesh-based GNN representations in a manner similar to spectral (or finite) element discretizations, a baseline GNN layer (termed a message passing layer, which updates local node properties) is modified to account for synchronization of coincident graph nodes, rendering compatibility with commonly used element-based mesh connectivities. The architecture is multiscale in nature, and is comprised of a combination of coarse-scale and fine-scale message passing layer sequences (termed processors) separated by a graph unpooling layer. The coarse-scale processor embeds a query element (alongside a set number of neighboring coarse elements) into a single latent graph representation using coarse-scale synchronized message passing over the element neighborhood, and the fine-scale processor leverages additional message passing operations on this latent graph to correct for interpolation errors. Demonstration studies are performed using hexahedral mesh-based data from Taylor-Green Vortex and backward-facing step flow simulations at Reynolds numbers of 1600 and 3200. Through analysis of both global and local errors, the results ultimately show how the GNN is able to produce accurate super-resolved fields compared to targets in both coarse-scale and multiscale model configurations. Reconstruction errors for fixed architectures were found to increase in proportion to the Reynolds number. Geometry extrapolation studies on a separate cavity flow configuration show promising cross-mesh capabilities of the super-resolution strategy.

Updated: 2025-02-19 17:27:58

标题: 使用多尺度图神经网络进行基于网格的流体流动超分辨率

摘要: 这项工作介绍了一种图神经网络（GNN）方法，可以实现基于网格的三维流体流动的超分辨率。在这个框架中，GNN被设计为不是一次性在整个基于网格的场上运行，而是直接在局部元素（或单元）的网格上运行。为了促进基于网格的GNN表示，类似于谱（或有限）元素离散化，基准GNN层（称为消息传递层，用于更新局部节点属性）被修改以考虑图节点的同步，使其与常用的基于元素的网格连接兼容。该架构具有多尺度性质，由粗尺度和细尺度消息传递层序列（称为处理器）的组合组成，这些层序列之间由图解池层分隔。粗尺度处理器通过在元素邻域上进行粗尺度同步消息传递，将查询元素（以及一组相邻的粗元素）嵌入到单个潜在图表示中，并且细尺度处理器利用这个潜在图上的额外消息传递操作来校正插值误差。使用Taylor-Green涡流和向后方踏流动模拟中的六面体网格数据进行了演示研究，雷诺数分别为1600和3200。通过对全局和局部误差的分析，结果最终显示了与粗尺度和多尺度模型配置中的目标相比，GNN能够产生准确的超分辨率场。对于固定架构，重建误差随雷诺数增加而增加。在单独的腔流配置上进行的几何外推研究显示了超分辨率策略的有希望的跨网格能力。

更新时间: 2025-02-19 17:27:58

领域: physics.flu-dyn,cs.CE,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2409.07769v3

Multilingual Non-Factoid Question Answering with Answer Paragraph Selection

Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 578K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98\% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80\% and 72\%, as well as a macro F1 of 72\% and 66\%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes a certain language within the golden set, even after being fine-tuned on silver labels. We also observe that the fine-tuned APS model is beneficial for reducing the context of a question. These findings suggest that this resource would be a valuable contribution to the QA research community.

Updated: 2025-02-19 17:25:39

标题: 多语言非事实问题回答与答案段落选择

摘要: 现有的大多数问答数据集（QuADs）主要关注于高资源语言中基于事实的短文本问答（QA）。然而，对于低资源语言来说，这类数据集的范围仍然有限，只有少数作品集中在基于事实的QuADs上，而没有任何关于非事实QuADs的作品。因此，本文介绍了MuNfQuAD，一个具有非事实问题的多语言QuAD。它利用BBC新闻文章中的疑问性子标题作为问题，相应的段落作为银标准答案。该数据集跨越38种语言，包括几种低资源语言，共包含超过578K个QA对，是迄今为止最大的多语言QA数据集。根据MuNfQuAD（黄金标准集）中的790个QA对的手动注释，我们观察到98%的问题可以通过相应的银标准答案回答。我们微调的答案段落选择（APS）模型优于基线模型。APS模型在MuNfQuAD测试集和黄金标准集上的准确率分别达到了80%和72%，宏F1分别为72%和66%。此外，APS模型在银标签微调后有效地概括了黄金标准集中的某种语言。我们还观察到，微调后的APS模型有助于减少问题的上下文。这些发现表明，这一资源将为问答研究社区做出宝贵的贡献。

更新时间: 2025-02-19 17:25:39

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2408.10604v2

Cyber security of OT networks: A tutorial and overview

This manuscript explores the cybersecurity challenges of Operational Technology (OT) networks, focusing on their critical role in industrial environments such as manufacturing, energy, and utilities. As OT systems increasingly integrate with Information Technology (IT) systems due to Industry 4.0 initiatives, they become more vulnerable to cyberattacks, which pose risks not only to data but also to physical infrastructure. The study examines key components of OT systems, such as SCADA (Supervisory Control and Data Acquisition), PLCs (Programmable Logic Controllers), and RTUs (Remote Terminal Units), and analyzes recent cyberattacks targeting OT environments. Furthermore, it highlights the security concerns arising from the convergence of IT and OT systems, examining attack vectors and the growing threats posed by malware, ransomware, and nation-state actors. Finally, the paper discusses modern approaches and tools used to secure these environments, providing insights into improving the cybersecurity posture of OT networks.

Updated: 2025-02-19 17:23:42

标题: 工业控制网络的网络安全：教程和概述

摘要: 本文探讨了运营技术（OT）网络的网络安全挑战，重点关注其在制造业、能源和公用事业等工业环境中的关键作用。随着OT系统越来越多地与信息技术（IT）系统整合，由于工业4.0倡议，它们变得更容易受到网络攻击的威胁，这不仅对数据构成风险，还对基础设施构成风险。本研究检验了OT系统的关键组成部分，如SCADA（监控和数据采集系统）、PLC（可编程逻辑控制器）和RTU（远程终端单元），并分析了针对OT环境的最近网络攻击。此外，它还强调了由IT和OT系统融合引发的安全关注，检视了攻击向量以及恶意软件、勒索软件和国家行为者所带来的不断增长的威胁。最后，本文讨论了用于保护这些环境的现代方法和工具，提供了改善OT网络网络安全状况的见解。

更新时间: 2025-02-19 17:23:42

领域: cs.CR

下载: http://arxiv.org/abs/2502.14017v1

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

Updated: 2025-02-19 17:22:17

标题: EC-DIT：使用自适应专家选择路由扩展扩散变压器

摘要: 扩散变压器已被广泛应用于文本到图像合成中。尽管将这些模型扩展到数十亿个参数显示出前景，但超越当前规模的扩展效果仍未得到充分探讨并具有挑战性。通过明确利用图像生成的计算异质性，我们开发了一种新的混合专家模型（EC-DIT）系列，用于具有专家选择路由的扩散变压器。EC-DIT学习了如何自适应地优化分配给理解输入文本并生成相应图像补丁的计算，从而实现与文本-图像复杂性相一致的异质计算。这种异质性为将EC-DIT扩展到970亿个参数提供了一种高效的方式，并在训练收敛、文本到图像对齐和整体生成质量方面相较于密集模型和传统MoE模型实现了显著改进。通过广泛的消融实验，我们展示了EC-DIT通过端到端训练识别不同文本重要性而表现出优越的可扩展性和自适应计算分配。值得注意的是，在文本到图像对齐评估中，我们的最大模型实现了71.68%的最新GenEval评分，并且仍保持着具有直观可解释性的竞争推理速度。

更新时间: 2025-02-19 17:22:17

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.02098v4

MetaSSC: Enhancing 3D Semantic Scene Completion for Autonomous Driving through Meta-Learning and Long-sequence Modeling

Semantic scene completion (SSC) is essential for achieving comprehensive perception in autonomous driving systems. However, existing SSC methods often overlook the high deployment costs in real-world applications. Traditional architectures, such as 3D Convolutional Neural Networks (3D CNNs) and self-attention mechanisms, face challenges in efficiently capturing long-range dependencies within 3D voxel grids, limiting their effectiveness. To address these issues, we introduce MetaSSC, a novel meta-learning-based framework for SSC that leverages deformable convolution, large-kernel attention, and the Mamba (D-LKA-M) model. Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions while acquiring transferable meta-knowledge. Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data from multiple nearby connected autonomous vehicles (CAVs), generating richer and more comprehensive labels. This meta-knowledge is then adapted to the target domain through a dual-phase training strategy that does not add extra model parameters, enabling efficient deployment. To further enhance the model's capability in capturing long-sequence relationships within 3D voxel grids, we integrate Mamba blocks with deformable convolution and large-kernel attention into the backbone network. Extensive experiments demonstrate that MetaSSC achieves state-of-the-art performance, significantly outperforming competing models while also reducing deployment costs.

Updated: 2025-02-19 17:21:53

标题: MetaSSC：通过元学习和长序列建模增强自动驾驶的3D语义场景完成

摘要: 语义场景补全（SSC）对于实现自动驾驶系统中的全面感知至关重要。然而，现有的SSC方法通常忽视了在实际应用中的高部署成本。传统的架构，例如3D卷积神经网络（3D CNNs）和自注意机制，在有效捕获3D体素网格内的远程依赖关系方面面临挑战，从而限制了它们的有效性。为了解决这些问题，我们引入了MetaSSC，这是一个基于元学习的新型SSC框架，利用了可变卷积、大核心注意力和Mamba（D-LKA-M）模型。我们的方法以基于体素的语义分割（SS）预训练任务开始，旨在探索不完整区域的语义和几何特征，同时获取可转移的元知识。利用模拟的协作感知数据集，我们监督单辆车的感知训练，使用来自多辆附近连接的自动驾驶车辆（CAVs）的聚合传感器数据，生成更丰富和全面的标签。然后，通过双阶段训练策略将这些元知识适应到目标领域，而不增加额外的模型参数，实现高效部署。为了进一步增强模型在捕获3D体素网格内的长序列关系方面的能力，我们将Mamba块与可变卷积和大核心注意力整合到骨干网络中。大量实验证明，MetaSSC实现了最先进的性能，显著优于竞争模型，同时降低了部署成本。

更新时间: 2025-02-19 17:21:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.03672v2

Slamming: Training a Speech Language Model on One GPU in a Day

We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

Updated: 2025-02-19 17:21:15

标题: 翻译：在一天内使用一个GPU对语音语言模型进行训练

摘要: 我们介绍了一种在24小时内使用单个学术GPU训练高质量语音语言模型（SLMs）的配方Slam。我们通过对模型初始化和架构、合成训练数据、与合成数据的偏好优化以及调整所有其他组件进行实证分析来实现这一点。我们实证证明，这种训练方法也可以很好地扩展，并以更少的计算成本获得与领先的SLMs相媲美的结果。我们希望这些见解将使SLM的训练和研究更加可访问。在SLM缩放定律的背景下，我们的结果远远超过了预测的计算最佳性能，为SLM的可行性提供了乐观的展望。请访问https://pages.cs.huji.ac.il/adiyoss-lab/slamming 查看代码、数据、模型和样本。

更新时间: 2025-02-19 17:21:15

领域: cs.LG,cs.AI,cs.CL,cs.SD,eess.AS

下载: http://arxiv.org/abs/2502.15814v1

Highly Dynamic and Flexible Spatio-Temporal Spectrum Management with AI-Driven O-RAN: A Multi-Granularity Marketplace Framework

Current spectrum-sharing frameworks struggle with adaptability, often being either static or insufficiently dynamic. They primarily emphasize temporal sharing while overlooking spatial and spectral dimensions. We propose an adaptive, AI-driven spectrum-sharing framework within the O-RAN architecture, integrating discriminative and generative AI (GenAI) to forecast spectrum needs across multiple timescales and spatial granularities. A marketplace model, managed by an authorized spectrum broker, enables operators to trade spectrum dynamically, balancing static assignments with real-time trading. GenAI enhances traffic prediction, spectrum estimation, and allocation, optimizing utilization while reducing costs. This modular, flexible approach fosters operator collaboration, maximizing efficiency and revenue. A key research challenge is refining allocation granularity and spatio-temporal dynamics beyond existing models.

Updated: 2025-02-19 17:21:10

标题: 具有人工智能驱动的O-RAN的高度动态和灵活的时空频谱管理：多粒度市场框架

摘要: 当前的频谱共享框架在适应性方面存在困难，往往要么是静态的，要么是不够动态的。它们主要强调时间共享，而忽视了空间和频谱维度。我们提出了一个自适应的、人工智能驱动的频谱共享框架，该框架集成了区域开放接入网络（O-RAN）架构，整合了区分性和生成性人工智能（GenAI），用于跨多个时间尺度和空间细粒度预测频谱需求。一个由授权频谱经纪人管理的市场模型使运营商能够动态交易频谱，平衡静态分配和实时交易。GenAI增强了流量预测、频谱估计和分配，优化了利用率同时降低成本。这种模块化、灵活的方法促进了运营商间的合作，最大化了效率和收入。一个关键的研究挑战是在现有模型之外进一步精细化分配粒度和时空动态。

更新时间: 2025-02-19 17:21:10

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2502.13891v1

Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present 'fill-tuning', a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on $O(10^9)$ data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.

Updated: 2025-02-19 17:17:13

标题: 用填充调整改进嵌入：对材料基础模型进行数据高效的泛化性能提升

摘要: 预训练的基础模型学习嵌入，可用于广泛的下游任务。这些嵌入优化了一般性能，如果在特定任务上不够准确，则可以微调模型以提高性能。对于所有当前的方法论，这种操作必然会降低所有的分布外任务的性能。在这项工作中，我们提出了“填充调整”（fill-tuning）方法，用于生成数据集以进行继续预训练基础模型，这些模型不适用于特定的下游任务，而是旨在纠正嵌入的低质量区域。我们将粗糙度分析应用于潜在空间拓扑，并说明如何提出对改善嵌入最有价值的数据。我们将填充调整应用于一组在$O(10^9)$数据点上训练的最先进的材料基础模型，并显示通过仅添加100个数据点，所有下游任务的模型提升几乎达到1%。这种方法提供了一种基础模型的普遍改进路径，以微调的计算成本为代价。

更新时间: 2025-02-19 17:17:13

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2502.13886v1

Causal Temporal Regime Structure Learning

Understanding causal relationships in multivariate time series is essential for predicting and controlling dynamic systems in fields like economics, neuroscience, and climate science. However, existing causal discovery methods often assume stationarity, limiting their effectiveness when time series consist of sequential regimes, consecutive temporal segments with unknown boundaries and changing causal structures. In this work, we firstly introduce a framework to describe and model such time series. Then, we present CASTOR, a novel method that concurrently learns the Directed Acyclic Graph (DAG) for each regime while determining the number of regimes and their sequential arrangement. CASTOR optimizes the data log-likelihood using an expectation-maximization algorithm, alternating between assigning regime indices (expectation step) and inferring causal relationships in each regime (maximization step). We establish the identifiability of the regimes and DAGs within our framework. Extensive experiments show that CASTOR consistently outperforms existing causal discovery models in detecting different regimes and learning their DAGs across various settings, including linear and nonlinear causal relationships, on both synthetic and real world datasets.

Updated: 2025-02-19 17:09:47

标题: 因果时间制度结构学习

摘要: 理解多变量时间序列中的因果关系对于预测和控制经济学、神经科学和气候科学等领域的动态系统至关重要。然而，现有的因果发现方法通常假设稳定性，这在时间序列由连续制度组成，连续的时间段具有未知边界和变化的因果结构时会限制其有效性。在这项工作中，我们首先介绍了描述和建模这种时间序列的框架。然后，我们提出了一种名为CASTOR的新方法，该方法同时学习每个制度的有向无环图（DAG），同时确定制度的数量和它们的顺序安排。CASTOR使用期望最大化算法优化数据对数似然，交替地分配制度索引（期望步骤）并推断每个制度中的因果关系（最大化步骤）。我们在我们的框架内建立了制度和DAG的可识别性。大量实验证明，CASTOR在检测不同制度和学习它们的DAG方面始终优于现有的因果发现模型，包括线性和非线性因果关系，在合成和真实世界数据集上表现出色。

更新时间: 2025-02-19 17:09:47

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2311.01412v3

ArrayBot: Reinforcement Learning for Generalizable Distributed Manipulation through Touch

We present ArrayBot, a distributed manipulation system consisting of a $16 \times 16$ array of vertically sliding pillars integrated with tactile sensors, which can simultaneously support, perceive, and manipulate the tabletop objects. Towards generalizable distributed manipulation, we leverage reinforcement learning (RL) algorithms for the automatic discovery of control policies. In the face of the massively redundant actions, we propose to reshape the action space by considering the spatially local action patch and the low-frequency actions in the frequency domain. With this reshaped action space, we train RL agents that can relocate diverse objects through tactile observations only. Surprisingly, we find that the discovered policy can not only generalize to unseen object shapes in the simulator but also transfer to the physical robot without any domain randomization. Leveraging the deployed policy, we present abundant real-world manipulation tasks, illustrating the vast potential of RL on ArrayBot for distributed manipulation.

Updated: 2025-02-19 17:09:21

标题: ArrayBot: 通过触觉进行通用分布式操作的强化学习

摘要: 我们提出了一个分布式操作系统ArrayBot，由一个$16\times 16$的垂直滑动柱阵列和触觉传感器组成，可以同时支持、感知和操作桌面上的物体。为了实现通用的分布式操作，我们利用强化学习（RL）算法自动发现控制策略。面对大量冗余的动作，我们提出通过考虑空间局部动作补丁和频域中低频动作来重新塑造动作空间。在这个重新塑造的动作空间中，我们训练RL代理通过触觉观察来重新定位各种物体。令人惊讶的是，我们发现发现的策略不仅可以推广到模拟器中看不见的物体形状，而且可以在没有任何域随机化的情况下转移到物理机器人上。利用部署的策略，我们展示了丰富的现实世界操作任务，展示了RL在ArrayBot上进行分布式操作的巨大潜力。

更新时间: 2025-02-19 17:09:21

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2306.16857v2

ACROSS: A Deformation-Based Cross-Modal Representation for Robotic Tactile Perception

Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many valuable datasets obsolete. However, recreating similar datasets with newer sensor technologies is both tedious and time-consuming. Therefore, adapting these existing datasets for use with new setups and modalities is crucial. In response, we introduce ACROSS, a novel framework for translating data between tactile sensors by exploiting sensor deformation information. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of first converting the input signals into 3D deformation meshes. We then transition from the 3D deformation mesh of one sensor to the mesh of another, and finally convert the generated 3D deformation mesh into the corresponding output space. We demonstrate our approach to the most challenging problem of going from a low-dimensional tactile representation to a high-dimensional one. In particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile images. Our approach enables the continued use of valuable datasets and data exchange between groups with different setups.

Updated: 2025-02-19 17:08:50

标题: ACROSS：一种基于变形的跨模态表征，用于机器人触觉感知

摘要: 触觉知觉对人类与环境的互动至关重要，在机器人技术中也变得越来越关键。类似BioTac的触觉传感器模仿人类指尖并提供详细的交互数据。尽管在滑动检测和物体识别等应用中具有实用性，但这种传感器现已过时，导致许多宝贵的数据集变得过时。然而，使用新的传感器技术重新创建类似的数据集既繁琐又耗时。因此，适应这些现有数据集以在新的设置和模式下使用是至关重要的。为此，我们引入ACROSS，一个利用传感器变形信息在触觉传感器之间转换数据的新框架。我们通过将BioTac信号转换为DIGIT传感器来演示该方法。我们的框架首先将输入信号转换为3D变形网格。然后我们从一个传感器的3D变形网格过渡到另一个传感器的网格，最后将生成的3D变形网格转换为相应的输出空间。我们展示了我们的方法解决最具挑战性的问题，即从低维触觉表示转换为高维触觉表示。特别地，我们将BioTac传感器的触觉信号转移到DIGIT触觉图像。我们的方法使宝贵数据集得以继续使用，并促进不同设置的团体之间的数据交换。

更新时间: 2025-02-19 17:08:50

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2411.08533v2

PSCon: Toward Conversational Product Search

Conversational Product Search (CPS) is confined to simulated conversations due to the lack of real-world CPS datasets that reflect human-like language. Additionally, current conversational datasets are limited to support cross-market and multi-lingual usage. In this paper, we introduce a new CPS data collection protocol and present PSCon, a novel CPS dataset designed to assist product search via human-like conversations. The dataset is constructed using a coached human-to-human data collection protocol and supports two languages and dual markets. Also, the dataset enables thorough exploration of six subtasks of CPS: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation. Furthermore, we also offer an analysis of the dataset and propose a benchmark model on the proposed CPS dataset.

Updated: 2025-02-19 17:05:42

标题: PSCon: 迈向对话式产品搜索

摘要: Conversational Product Search (CPS)被限制在模拟对话中，因为缺乏反映类似人类语言的真实世界CPS数据集。此外，当前的对话数据集受限于支持跨市场和多语言使用。在本文中，我们介绍了一种新的CPS数据收集协议，并提出了PSCon，一个旨在通过类似人类对话协助产品搜索的新型CPS数据集。该数据集是使用训练的人对人数据收集协议构建的，支持两种语言和双重市场。此外，该数据集还支持对CPS的六个子任务进行全面探索：用户意图检测、关键词提取、系统动作预测、问题选择、项目排名和响应生成。此外，我们还对数据集进行了分析，并提出了一个在提议的CPS数据集上的基准模型。

更新时间: 2025-02-19 17:05:42

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2502.13881v1

MEX: Memory-efficient Approach to Referring Multi-Object Tracking

Referring Multi-Object Tracking (RMOT) is a relatively new concept that has rapidly gained traction as a promising research direction at the intersection of computer vision and natural language processing. Unlike traditional multi-object tracking, RMOT identifies and tracks objects and incorporates textual descriptions for object class names, making the approach more intuitive. Various techniques have been proposed to address this challenging problem; however, most require the training of the entire network due to their end-to-end nature. Among these methods, iKUN has emerged as a particularly promising solution. Therefore, we further explore its pipeline and enhance its performance. In this paper, we introduce a practical module dubbed Memory-Efficient Cross-modality -- MEX. This memory-efficient technique can be directly applied to off-the-shelf trackers like iKUN, resulting in significant architectural improvements. Our method proves effective during inference on a single GPU with 4 GB of memory. Among the various benchmarks, the Refer-KITTI dataset, which offers diverse autonomous driving scenes with relevant language expressions, is particularly useful for studying this problem. Empirically, our method demonstrates effectiveness and efficiency regarding HOTA tracking scores, substantially improving memory allocation and processing speed.

Updated: 2025-02-19 16:58:42

标题: MEX：一种内存高效的多目标跟踪引用方法

摘要: Referring Multi-Object Tracking (RMOT) 是一个相对较新的概念，在计算机视觉和自然语言处理交叉领域迅速获得了关注作为一种有前景的研究方向。与传统的多目标跟踪不同，RMOT 识别和跟踪对象，并将文本描述纳入对象类名，使方法更直观。已经提出了各种技术来解决这个具有挑战性的问题；然而，大多数由于其端到端的性质而需要对整个网络进行训练。在这些方法中，iKUN 已经成为一个特别有前景的解决方案。因此，我们进一步探索了其流程，并增强了其性能。在本文中，我们引入了一个名为 Memory-Efficient Cross-modality -- MEX 的实用模块。这种内存高效技术可以直接应用于像 iKUN 这样的现成跟踪器，带来了显著的架构改进。我们的方法在单个具有 4GB 内存的 GPU 上推理时证明了其有效性。在各种基准测试中，提供了多样的自动驾驶场景和相关语言表达的 Refer-KITTI 数据集对于研究这个问题特别有用。根据经验证，我们的方法在 HOTA 跟踪分数方面表现出有效性和效率，大幅改进了内存分配和处理速度。

更新时间: 2025-02-19 16:58:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.13875v1

NVR: Vector Runahead on NPUs for Sparse Memory Access

Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

Updated: 2025-02-19 16:54:58

标题: NVR：用于稀疏内存访问的NPU矢量预取

摘要: 深度神经网络越来越多地利用稀疏性来减少模型参数大小的扩展。然而，通过稀疏性和修剪来减少墙钟时间仍然具有挑战性，因为不规则的内存访问模式导致频繁的缓存未命中。在本文中，我们提出了NPU Vector Runahead（NVR），这是一种专为NPUs定制的预取机制，用于解决稀疏DNN工作负载中的缓存未命中问题。与通过高开销和可移植性差的内存模式优化不同，NVR将运行提前执行适应于NPUs的独特架构。NVR提供了一种针对稀疏DNN工作负载的通用微架构解决方案，无需编译器或算法支持，作为一个与NPU并行的解耦、推测性、轻量级的硬件子线程，硬件开销极小（低于5%）。与通用处理器中最先进的预取相比，NVR在平均减少90%的缓存未命中的同时，稀疏工作负载的平均加速比没有预取的NPUs提高了4倍。此外，我们研究了将一个小缓存（16KB）与NVR结合到NPU中的优势。我们的评估显示，扩大这个适中的缓存带来的性能优势比相同数量增加L2缓存大小高5倍。

更新时间: 2025-02-19 16:54:58

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2502.13873v1

SPEX: Scaling Feature Interaction Explanations for LLMs

Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.

Updated: 2025-02-19 16:49:55

标题: SPEX：为LLMs扩展特征交互解释

摘要: 大型语言模型（LLMs）已经彻底改变了机器学习，因为它们能够捕捉输入特征之间的复杂交互。流行的事后解释方法如SHAP提供边际特征归因，而它们对交互重要性的扩展仅适用于小输入长度（约20）。我们提出了Spectral Explainer（SPEX），这是一种与模型无关的交互归因算法，能够高效地扩展到大输入长度（约1000）。SPEX利用交互之间的自然稀疏性 -- 这在现实世界的数据中很常见 -- 并使用通道解码算法应用稀疏傅里叶变换来高效地识别重要的交互。我们在需要LLMs利用输入之间的交互来完成任务的三个困难的长上下文数据集上进行实验。对于大输入，SPEX在忠实地重建LLM输出方面比边际归因方法表现出高达20％的优势。此外，SPEX成功地识别了对模型输出产生强烈影响的关键特征和交互。对于我们的一个数据集HotpotQA，SPEX提供了与人类注释一致的交互。最后，我们使用我们的与模型无关的方法生成解释，以展示封闭源LLMs（GPT-4o mini）中的抽象推理和视觉语言模型中的组合推理。

更新时间: 2025-02-19 16:49:55

领域: cs.LG,cs.AI,cs.CL,cs.IT,math.IT

下载: http://arxiv.org/abs/2502.13870v1

One Size doesn't Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction

Large language models (LLMs) have been increasingly employed in various intelligent educational systems, simulating human tutors to facilitate effective human-machine interaction. However, previous studies often overlook the significance of recognizing and adapting to individual learner characteristics. Such adaptation is crucial for enhancing student engagement and learning efficiency, particularly in mathematics instruction, where diverse learning styles require personalized strategies to promote comprehension and enthusiasm. In this paper, we propose a \textbf{P}erson\textbf{A}lized \textbf{C}onversational tutoring ag\textbf{E}nt (PACE) for mathematics instruction. PACE simulates students' learning styles based on the Felder and Silverman learning style model, aligning with each student's persona. In this way, our PACE can effectively assess the personality of students, allowing to develop individualized teaching strategies that resonate with their unique learning styles. To further enhance students' comprehension, PACE employs the Socratic teaching method to provide instant feedback and encourage deep thinking. By constructing personalized teaching data and training models, PACE demonstrates the ability to identify and adapt to the unique needs of each student, significantly improving the overall learning experience and outcomes. Moreover, we establish multi-aspect evaluation criteria and conduct extensive analysis to assess the performance of personalized teaching. Experimental results demonstrate the superiority of our model in personalizing the educational experience and motivating students compared to existing methods.

Updated: 2025-02-19 16:45:48

标题: 一种尺寸并不适用于所有人：针对数学教学的个性化对话辅导代理

摘要: 大型语言模型（LLMs）在各种智能教育系统中越来越多地被应用，模拟人类导师以促进有效的人机交互。然而，先前的研究往往忽视了识别和适应个体学习者特征的重要性。这种适应对于增强学生的参与度和学习效率至关重要，特别是在数学教学中，不同的学习风格需要个性化策略来促进理解和热情。在本文中，我们提出了一个面向数学教学的\textbf{P}erson\textbf{A}lized \textbf{C}onversational tutoring ag\textbf{E}nt（PACE）。PACE根据Felder和Silverman学习风格模型模拟学生的学习风格，与每个学生的个性相一致。通过这种方式，我们的PACE可以有效评估学生的个性，从而制定与其独特学习风格 resonant 的个性化教学策略。为了进一步增强学生的理解力，PACE采用苏格拉底教学方法提供即时反馈，并鼓励深入思考。通过构建个性化教学数据和训练模型，PACE展现了识别和适应每个学生独特需求的能力，显著提高了整体学习体验和结果。此外，我们建立了多方面的评估标准，并进行了广泛分析以评估个性化教学的性能。实验结果表明，与现有方法相比，我们的模型在个性化教育体验和激励学生方面具有优势。

更新时间: 2025-02-19 16:45:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12633v2

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Understanding emotions is a fundamental aspect of human communication. Integrating audio and video signals offers a more comprehensive understanding of emotional states compared to traditional methods that rely on a single data source, such as speech or facial expressions. Despite its potential, multimodal emotion recognition faces significant challenges, particularly in synchronization, feature extraction, and fusion of diverse data sources. To address these issues, this paper introduces a novel transformer-based model named Audio-Video Transformer Fusion with Cross Attention (AVT-CA). The AVT-CA model employs a transformer fusion approach to effectively capture and synchronize interlinked features from both audio and video inputs, thereby resolving synchronization problems. Additionally, the Cross Attention mechanism within AVT-CA selectively extracts and emphasizes critical features while discarding irrelevant ones from both modalities, addressing feature extraction and fusion challenges. Extensive experimental analysis conducted on the CMU-MOSEI, RAVDESS and CREMA-D datasets demonstrates the efficacy of the proposed model. The results underscore the importance of AVT-CA in developing precise and reliable multimodal emotion recognition systems for practical applications.

Updated: 2025-02-19 16:29:24

标题: 多模态情感识别：使用音频-视频Transformer融合和交叉注意力

摘要: 理解情感是人类沟通的基本方面。整合音频和视频信号比传统方法更全面地理解情绪状态，传统方法依赖单一数据源，如语音或面部表情。尽管多模态情感识别具有潜力，但面临着重要挑战，特别是在同步、特征提取和融合多样数据源方面。为了解决这些问题，本文介绍了一种名为音频-视频Transformer融合交叉注意力（AVT-CA）的新型基于Transformer的模型。AVT-CA模型采用Transformer融合方法，有效捕获和同步来自音频和视频输入的相互关联特征，从而解决同步问题。此外，AVT-CA内的交叉注意力机制选择性地从两种模态中提取和强调关键特征，同时丢弃无关的特征，解决了特征提取和融合的挑战。在CMU-MOSEI、RAVDESS和CREMA-D数据集上进行的广泛实验分析证明了提出模型的有效性。结果强调AVT-CA在为实际应用开发精确可靠的多模态情感识别系统中的重要性。

更新时间: 2025-02-19 16:29:24

领域: cs.MM,cs.CL,cs.CV,cs.LG,cs.SD,eess.AS,F.2.2; I.2.7

下载: http://arxiv.org/abs/2407.18552v3

Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment

Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of-N distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provides provable sample efficiency guarantee for one of the WIND variant with the square loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.

Updated: 2025-02-19 16:26:44

标题: 更快的WIND：加速迭代的最佳-$N$蒸馏以用于LLM对齐

摘要: 最近在将大型语言模型与人类偏好对齐方面取得的进展证实了最佳N蒸馏（BOND）的日益重要性。然而，由于样本和计算效率低下，迭代BOND算法在实践中成本过高。本文通过揭示迭代BOND和自我对齐之间的统一博弈论联系，解决了这个问题，这统一了看似不相关的算法范式。基于这种联系，我们建立了一个新颖的框架，WIN rate Dominance（WIND），其中包含一系列用于正则化胜率优化的高效算法，这些算法在参数空间中近似于迭代BOND。我们为其中一种WIND变体提供了可证明的样本效率保证，其中包括均方损失目标。实验结果证实，与现有方法相比，我们的算法不仅加速了计算，而且在样本效率上也实现了卓越表现。

更新时间: 2025-02-19 16:26:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.20727v2

Using Constraints to Discover Sparse and Alternative Subgroup Descriptions

Subgroup-discovery methods allow users to obtain simple descriptions of interesting regions in a dataset. Using constraints in subgroup discovery can enhance interpretability even further. In this article, we focus on two types of constraints: First, we limit the number of features used in subgroup descriptions, making the latter sparse. Second, we propose the novel optimization problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features. We describe how to integrate both constraint types into heuristic subgroup-discovery methods. Further, we propose a novel Satisfiability Modulo Theories (SMT) formulation of subgroup discovery as a white-box optimization problem, which allows solver-based search for subgroups and is open to a variety of constraint types. Additionally, we prove that both constraint types lead to an NP-hard optimization problem. Finally, we employ 27 binary-classification datasets to compare algorithmic and solver-based search for unconstrained and constrained subgroup discovery. We observe that heuristic search methods often yield high-quality subgroups within a short runtime, also in scenarios with constraints.

Updated: 2025-02-19 16:25:01

标题: 利用约束条件发现稀疏和替代子群描述

摘要: 子群发现方法允许用户获取数据集中有趣区域的简单描述。在子群发现中使用约束可以进一步提高可解释性。本文聚焦于两种类型的约束：首先，我们限制在子群描述中使用的特征数量，使其稀疏化。其次，我们提出了一种新颖的优化问题，即寻找替代子群描述，这些描述涵盖了与给定子群相似的数据对象集，但使用了不同的特征。我们描述了如何将两种约束类型整合到启发式子群发现方法中。此外，我们提出了一种新颖的基于Satisfiability Modulo Theories（SMT）的子群发现公式，将子群发现视为白盒优化问题，允许基于求解器的搜索子群，并适用于各种约束类型。此外，我们证明了这两种约束类型都导致了一个NP难的优化问题。最后，我们利用27个二分类数据集比较了基于算法和基于求解器的无约束和受限制子群发现的搜索。我们观察到，启发式搜索方法通常在短时间内产生高质量的子群，即使在有约束的情况下也是如此。

更新时间: 2025-02-19 16:25:01

领域: cs.LG

下载: http://arxiv.org/abs/2406.01411v2

Regularization by Neural Style Transfer for MRI Field-Transfer Reconstruction with Limited Data

Recent advances in MRI reconstruction have demonstrated remarkable success through deep learning-based models. However, most existing methods rely heavily on large-scale, task-specific datasets, making reconstruction in data-limited settings a critical yet underexplored challenge. While regularization by denoising (RED) leverages denoisers as priors for reconstruction, we propose Regularization by Neural Style Transfer (RNST), a novel framework that integrates a neural style transfer (NST) engine with a denoiser to enable magnetic field-transfer reconstruction. RNST generates high-field-quality images from low-field inputs without requiring paired training data, leveraging style priors to address limited-data settings. Our experiment results demonstrate RNST's ability to reconstruct high-quality images across diverse anatomical planes (axial, coronal, sagittal) and noise levels, achieving superior clarity, contrast, and structural fidelity compared to lower-field references. Crucially, RNST maintains robustness even when style and content images lack exact alignment, broadening its applicability in clinical environments where precise reference matches are unavailable. By combining the strengths of NST and denoising, RNST offers a scalable, data-efficient solution for MRI field-transfer reconstruction, demonstrating significant potential for resource-limited settings.

Updated: 2025-02-19 16:24:49

标题: 用神经风格迁移进行MRI场域传输重建的正则化和有限数据

摘要: MRI重建的最新进展已经通过基于深度学习模型取得了显著成功。然而，大多数现有方法都严重依赖于大规模、特定任务的数据集，这使得在数据受限的情况下进行重建成为一个至关重要但尚未充分探讨的挑战。虽然通过去噪（RED）利用去噪器作为重建的先验，我们提出了一种新颖的框架，即通过神经风格转移（RNST）进行正则化，将神经风格转移（NST）引擎与去噪器集成起来，实现磁场转移重建。RNST可以从低场输入中生成高场质量的图像，而无需配对的训练数据，利用风格先验来解决数据受限的问题。我们的实验结果表明，RNST能够在不同的解剖平面（轴位、冠状位、矢状位）和噪声水平上重建高质量的图像，与低场参考相比，具有更优越的清晰度、对比度和结构保真度。至关重要的是，即使风格和内容图像缺乏精确对齐，RNST仍然保持稳健性，扩大了其在临床环境中适用性，其中精确参考匹配不可用。通过结合NST和去噪的优势，RNST为MRI场转移重建提供了一种可扩展、数据高效的解决方案，展示了在资源受限环境中具有重要潜力。

更新时间: 2025-02-19 16:24:49

领域: cs.CV,cs.LG,physics.med-ph

下载: http://arxiv.org/abs/2308.10968v3

PoGDiff: Product-of-Gaussians Diffusion Models for Imbalanced Text-to-Image Generation

Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.

Updated: 2025-02-19 16:18:04

标题: PoGDiff：不平衡文本到图像生成的高斯乘积扩散模型

摘要: 扩散模型在近年来取得了显著进展。然而，当在不平衡数据集上进行训练或微调时，它们的性能往往会下降。这种恶化主要是由于图像-文本对中多数数据和少数数据的比例失衡所致。在本文中，我们提出了一种名为PoGDiff的通用微调方法，以解决这一挑战。PoGDiff不同于直接最小化预测分布和地面真实分布之间的KL散度，而是用高斯乘积（PoG）替换地面真实分布，通过将原始地面真实目标与条件于相邻文本嵌入的预测分布相结合构建PoG。对真实数据集的实验表明，我们的方法有效地解决了扩散模型中的不平衡问题，提高了生成准确性和质量。

更新时间: 2025-02-19 16:18:04

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2502.08106v2

Generalization bounds for mixing processes via delayed online-to-PAC conversions

We study the generalization error of statistical learning algorithms in a non-i.i.d. setting, where the training data is sampled from a stationary mixing process. We develop an analytic framework for this scenario based on a reduction to online learning with delayed feedback. In particular, we show that the existence of an online learning algorithm with bounded regret (against a fixed statistical learning algorithm in a specially constructed game of online learning with delayed feedback) implies low generalization error of said statistical learning method even if the data sequence is sampled from a mixing time series. The rates demonstrate a trade-off between the amount of delay in the online learning game and the degree of dependence between consecutive data points, with near-optimal rates recovered in a number of well-studied settings when the delay is tuned appropriately as a function of the mixing time of the process.

Updated: 2025-02-19 16:17:33

标题: 混合过程的推广界限：通过延迟的在线到PAC转换

摘要: 我们研究统计学习算法在非独立同分布设置中的泛化误差，其中训练数据是从一个静止混合过程中抽样得到的。我们基于将问题简化为延迟反馈在线学习的框架，为这种情景开发了一个分析框架。特别地，我们展示了一个在线学习算法存在且具有有界遗憾（对抗一个特别构造的在线学习延迟反馈游戏中的固定统计学习算法）意味着即使数据序列是从混合时间序列中抽样得到的，所述统计学习方法的泛化误差也很低。当在线学习游戏中的延迟量和连续数据点之间的依赖程度之间存在权衡时，速率展示了一种权衡，当延迟被适当地调整为过程的混合时间的函数时，在一些广泛研究的设置中恢复出接近最佳的速率。

更新时间: 2025-02-19 16:17:33

领域: cs.LG

下载: http://arxiv.org/abs/2406.12600v3

Evaluation of EAS directions based on TAIGA HiSCORE data using fully connected neural networks

The direction of extensive air showers can be used to determine the source of gamma quanta and plays an important role in estimating the energy of the primary particle. The data from an array of non-imaging Cherenkov detector stations HiSCORE in the TAIGA experiment registering the number of photoelectrons and detection time can be used to estimate the shower direction with high accuracy. In this work, we use artificial neural networks trained on Monte Carlo-simulated TAIGA HiSCORE data for gamma quanta to obtain shower direction estimates. The neural networks are multilayer perceptrons with skip connections using partial data from several HiSCORE stations as inputs; composite estimates are derived from multiple individual estimates by the neural networks. We apply a two-stage algorithm in which the direction estimates obtained in the first stage are used to transform the input data and refine the estimates. The mean error of the final estimates is less than 0.25 degrees. The approach will be used for multimodal analysis of the data from several types of detectors used in the TAIGA experiment.

Updated: 2025-02-19 16:12:37

标题: 利用TAIGA HiSCORE数据基于全连接神经网络评估EAS方向

摘要: The direction of extensive air showers can be used to determine the source of gamma quanta and plays an important role in estimating the energy of the primary particle. The data from an array of non-imaging Cherenkov detector stations HiSCORE in the TAIGA experiment registering the number of photoelectrons and detection time can be used to estimate the shower direction with high accuracy. In this work, we use artificial neural networks trained on Monte Carlo-simulated TAIGA HiSCORE data for gamma quanta to obtain shower direction estimates. The neural networks are multilayer perceptrons with skip connections using partial data from several HiSCORE stations as inputs; composite estimates are derived from multiple individual estimates by the neural networks. We apply a two-stage algorithm in which the direction estimates obtained in the first stage are used to transform the input data and refine the estimates. The mean error of the final estimates is less than 0.25 degrees. The approach will be used for multimodal analysis of the data from several types of detectors used in the TAIGA experiment.

更新时间: 2025-02-19 16:12:37

领域: astro-ph.IM,astro-ph.HE,cs.LG

下载: http://arxiv.org/abs/2502.13851v1

DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue

Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citep{lewis2020retrieval}. However, traditional RAG methods, while leveraging static knowledge bases, often overlook the potential of dynamic historical information in ongoing conversations. To bridge this gap, we introduce DH-RAG, a Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue. DH-RAG is inspired by human cognitive processes that utilize both long-term memory and immediate historical context in conversational responses \citep{stafford1987conversational}. DH-RAG is structured around two principal components: a History-Learning based Query Reconstruction Module, designed to generate effective queries by synthesizing current and prior interactions, and a Dynamic History Information Updating Module, which continually refreshes historical context throughout the dialogue. The center of DH-RAG is a Dynamic Historical Information database, which is further refined by three strategies within the Query Reconstruction Module: Historical Query Clustering, Hierarchical Matching, and Chain of Thought Tracking. Experimental evaluations show that DH-RAG significantly surpasses conventional models on several benchmarks, enhancing response relevance, coherence, and dialogue quality.

Updated: 2025-02-19 16:10:43

标题: DH-RAG：一种用于多轮对话的动态历史情境驱动检索增强生成方法

摘要: 检索增强生成（RAG）系统在问答和多轮对话等应用中显示出了显著的优势。然而，传统的RAG方法虽然利用了静态知识库，但往往忽视了正在进行的对话中动态历史信息的潜力。为了弥补这一差距，我们引入了DH-RAG，一种基于动态历史背景的检索增强生成方法，用于多轮对话。DH-RAG受到人类认知过程的启发，利用长期记忆和即时历史背景来进行对话回应。DH-RAG围绕两个主要组件构建：基于历史学习的查询重构模块，旨在通过合成当前和先前的互动生成有效的查询；以及动态历史信息更新模块，不断刷新对话中的历史背景。DH-RAG的核心是一个动态历史信息数据库，通过查询重构模块内的三种策略进行进一步优化：历史查询聚类、层次匹配和思维链跟踪。实验评估表明，DH-RAG在几个基准测试中显著超越了传统模型，提高了回应相关性、连贯性和对话质量。

更新时间: 2025-02-19 16:10:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13847v1

Enhancing LLM-Based Recommendations Through Personalized Reasoning

Current recommendation systems powered by large language models (LLMs) often underutilize their reasoning capabilities due to a lack of explicit logical structuring. To address this limitation, we introduce CoT-Rec, a framework that integrates Chain-of-Thought (CoT) reasoning into LLM-driven recommendations by incorporating two crucial processes: user preference analysis and item perception evaluation. CoT-Rec operates in two key phases: (1) personalized data extraction, where user preferences and item perceptions are identified, and (2) personalized data application, where this information is leveraged to refine recommendations. Our experimental analysis demonstrates that CoT-Rec improves recommendation accuracy by making better use of LLMs' reasoning potential. The implementation is publicly available at https://anonymous.4open.science/r/CoT-Rec.

Updated: 2025-02-19 16:08:17

标题: 通过个性化推理增强基于LLM的推荐

摘要: 目前由大型语言模型（LLMs）驱动的推荐系统通常由于缺乏明确的逻辑结构而未充分利用其推理能力。为了解决这一限制，我们引入了CoT-Rec，这是一个将思维链（CoT）推理整合到LLM驱动的推荐系统中的框架，通过整合两个关键过程来实现：用户偏好分析和物品感知评估。CoT-Rec分为两个关键阶段：（1）个性化数据提取，识别用户偏好和物品感知，以及（2）个性化数据应用，利用这些信息来优化推荐。我们的实验分析表明，CoT-Rec通过更好地利用LLMs的推理潜力来提高推荐的准确性。该实现公开可用于https://anonymous.4open.science/r/CoT-Rec。

更新时间: 2025-02-19 16:08:17

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.13845v1

A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management

The recent development of powerful AI systems has highlighted the need for robust risk management frameworks in the AI industry. Although companies have begun to implement safety frameworks, current approaches often lack the systematic rigor found in other high-risk industries. This paper presents a comprehensive risk management framework for the development of frontier AI that bridges this gap by integrating established risk management principles with emerging AI-specific practices. The framework consists of four key components: (1) risk identification (through literature review, open-ended red-teaming, and risk modeling), (2) risk analysis and evaluation using quantitative metrics and clearly defined thresholds, (3) risk treatment through mitigation measures such as containment, deployment controls, and assurance processes, and (4) risk governance establishing clear organizational structures and accountability. Drawing from best practices in mature industries such as aviation or nuclear power, while accounting for AI's unique challenges, this framework provides AI developers with actionable guidelines for implementing robust risk management. The paper details how each component should be implemented throughout the life-cycle of the AI system - from planning through deployment - and emphasizes the importance and feasibility of conducting risk management work prior to the final training run to minimize the burden associated with it.

Updated: 2025-02-19 16:05:47

标题: 一种前沿的AI风险管理框架：弥合当前AI实践与已建立风险管理之间的鸿沟

摘要: 最近强大AI系统的发展凸显了AI行业需要健壮风险管理框架的必要性。尽管公司已经开始实施安全框架，但当前方法通常缺乏其他高风险行业中发现的系统严谨性。本文提出了一个全面的风险管理框架，用于开发前沿AI，通过将已建立的风险管理原则与新兴的AI特定实践相结合来弥补这一差距。该框架由四个关键组成部分组成：（1）风险识别（通过文献综述、开放式红队演练和风险建模）、（2）使用定量指标和明确定义的阈值进行风险分析和评估、（3）通过遏制、部署控制和保证流程等缓解措施进行风险处理、（4）建立明确的组织结构和责任的风险治理。借鉴成熟行业如航空或核能的最佳实践，同时考虑AI独特的挑战，该框架为AI开发人员提供了实施强大风险管理的可操作指南。本文详细说明了如何在AI系统的生命周期中实施每个组成部分 - 从规划到部署 - 并强调在最终训练运行之前进行风险管理工作的重要性和可行性，以减轻相关负担。

更新时间: 2025-02-19 16:05:47

领域: cs.AI

下载: http://arxiv.org/abs/2502.06656v3

Enhancing Cross-Domain Recommendations with Memory-Optimized LLM-Based User Agents

Large Language Model (LLM)-based user agents have emerged as a powerful tool for improving recommender systems by simulating user interactions. However, existing methods struggle with cross-domain scenarios due to inefficient memory structures, leading to irrelevant information retention and failure to account for social influence factors such as popularity. To address these limitations, we introduce AgentCF++, a novel framework featuring a dual-layer memory architecture and a two-step fusion mechanism to filter domain-specific preferences effectively. Additionally, we propose interest groups with shared memory, allowing the model to capture the impact of popularity trends on users with similar interests. Through extensive experiments on multiple cross-domain datasets, AgentCF++ demonstrates superior performance over baseline models, highlighting its effectiveness in refining user behavior simulation for recommender systems. Our code is available at https://anonymous.4open.science/r/AgentCF-plus.

Updated: 2025-02-19 16:02:59

标题: 通过优化记忆的LLM用户代理增强跨领域推荐

摘要: 基于大型语言模型（LLM）的用户代理已成为通过模拟用户交互来改进推荐系统的强大工具。然而，由于内存结构效率低下，现有方法在跨领域场景中遇到困难，导致保留无关信息并未考虑诸如流行度等社交影响因素。为了解决这些限制，我们引入了AgentCF ++，一个新颖的框架，具有双层内存架构和两步融合机制，可以有效过滤领域特定的偏好。此外，我们提出了具有共享内存的兴趣群体，使模型能够捕捉流行趋势对具有相似兴趣的用户的影响。通过在多个跨领域数据集上进行广泛实验，AgentCF ++ 显示出比基准模型更优趀的性能，突显其在优化用户行为模拟方面对推荐系统的有效性。我们的代码可以在网址 https://anonymous.4open.science/r/AgentCF-plus 获取。

更新时间: 2025-02-19 16:02:59

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.13843v1

Mitigating Popularity Bias in Collaborative Filtering through Fair Sampling

Recommender systems often suffer from popularity bias, where frequently interacted items are overrepresented in recommendations. This bias stems from propensity factors influencing training data, leading to imbalanced exposure. In this paper, we introduce a Fair Sampling (FS) approach to address this issue by ensuring that both users and items are selected with equal probability as positive and negative instances. Unlike traditional inverse propensity score (IPS) methods, FS does not require propensity estimation, eliminating errors associated with inaccurate calculations. Our theoretical analysis demonstrates that FS effectively neutralizes the influence of propensity factors, achieving unbiased learning. Experimental results validate that FS outperforms state-of-the-art methods in both point-wise and pair-wise recommendation tasks, enhancing recommendation fairness without sacrificing accuracy. The implementation is available at https://anonymous.4open.science/r/Fair-Sampling.

Updated: 2025-02-19 15:59:49

标题: 通过公平抽样减轻协同过滤中的流行度偏见

摘要: 推荐系统经常受到流行度偏见的影响，即经常互动的项目在推荐中被过度表示。这种偏见源于影响训练数据的倾向因素，导致曝光不平衡。本文引入了一种公平抽样（FS）方法来解决这个问题，通过确保用户和项目被选择为正负实例的概率相等。与传统的倒数倾向评分（IPS）方法不同，FS不需要倾向性估计，消除了与不准确计算相关的错误。我们的理论分析表明，FS有效地中和了倾向因素的影响，实现了无偏学习。实验结果验证了FS在点对点和成对推荐任务中优于最先进的方法，提高了推荐的公平性而不牺牲准确性。该实现可在https://anonymous.4open.science/r/Fair-Sampling 上找到。

更新时间: 2025-02-19 15:59:49

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.13840v1

Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models

Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite retrieval failing. Our results reveal the extent to which finetuned models rely on memorization. In contrast, retrieval-augmented VLMs have lower memorization scores, at the cost of accuracy (72% vs 52% on WebQA test set). As such, our measures pose a challenge for future work to reconcile memorization and generalization in both Open-Domain QA and joint Retrieval-QA tasks.

Updated: 2025-02-19 15:58:09

标题: 量化检索增强视觉语言模型中的记忆和检索性能

摘要: 大型语言模型（LLMs）在问答（QA）方面展示了显著的能力，但用于评估它们依赖记忆还是检索的指标仍未得到充分发展。此外，虽然微调模型在封闭域任务上是最先进的，但类似GPT-4o这样的通用型模型展现出强大的零-shot表现。这引发了有关记忆、泛化和检索之间权衡的问题。在这项工作中，我们分析了多模态检索增强的VLMs相对于基线VLMs对训练数据的记忆程度。通过使用WebQA基准测试，我们对比了微调模型和基线VLMs在多跳检索和问题回答上的表现，考察了微调对数据记忆的影响。为了量化端到端检索和QA系统中的记忆化，我们提出了几个代理度量标准，通过研究即使检索失败问答成功的情况来调查。我们的结果揭示了微调模型依赖记忆的程度。相比之下，检索增强型VLMs具有较低的记忆化得分，但代价是准确性下降（WebQA测试集上为72% vs 52%）。因此，我们的度量标准对未来的工作提出了挑战，以协调开放领域QA和联合检索-QA任务中的记忆化和泛化。

更新时间: 2025-02-19 15:58:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13836v1

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

Updated: 2025-02-19 15:54:21

标题: 通过协同作用LLMs和符号推理证明奥林匹克不等式

摘要: 大型语言模型（LLMs）可以通过在证明系统中生成证明步骤（又称策略）来形式化证明数学定理。然而，可能的策略空间广阔且复杂，而可用于形式证明的训练数据有限，这对基于LLM的策略生成构成了重大挑战。为了解决这个问题，我们引入了一种神经符号策略生成器，通过LLMs学习的数学直觉与符号方法编码的领域特定见解相互协同。这种整合的关键在于确定数学推理的哪些部分最适合LLMs，哪些最适合符号方法。尽管神经符号整合的高层思想在各种数学问题上都是适用的，但在本文中，我们专注于奥林匹克不等式（图1）。我们分析人类如何解决这些问题，并将技术精华提炼为两种策略：（1）缩放，由符号方法处理，（2）重写，由LLMs处理。此外，我们结合符号工具和LLMs来修剪和对证明目标进行排名，以实现高效的证明搜索。我们在来自多个数学竞赛的161个具有挑战性的不等式上评估了我们的框架，取得了最先进的性能，并明显优于现有的LLM和符号方法，而无需额外的训练数据。

更新时间: 2025-02-19 15:54:21

领域: cs.AI

下载: http://arxiv.org/abs/2502.13834v1

Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets

Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance. When using synthetic data in practical applications, it is important to provide protection guarantees. In the literature, two family of approaches are proposed for tabular data: on the one hand, Similarity-based methods aim at finding the level of similarity between training and synthetic data. Indeed, a privacy breach can occur if the generated data is consistently too similar or even identical to the train data. On the other hand, Attack-based methods conduce deliberate attacks on synthetic datasets. The success rates of these attacks reveal how secure the synthetic datasets are. In this paper, we introduce a contrastive method that improves privacy assessment of synthetic datasets by embedding the data in a more representative space. This overcomes obstacles surrounding the multitude of data types and attributes. It also makes the use of intuitive distance metrics possible for similarity measurements and as an attack vector. In a series of experiments with publicly available datasets, we compare the performances of similarity-based and attack-based methods, both with and without use of the contrastive learning-based embeddings. Our results show that relatively efficient, easy to implement privacy metrics can perform equally well as more advanced metrics explicitly modeling conditions for privacy referred to by the GDPR.

Updated: 2025-02-19 15:52:23

标题: 基于对比学习的表格合成数据集隐私度量

摘要: 合成数据作为一种隐私增强技术（PET）在医疗保健和金融等领域引起了关注。在实际应用中使用合成数据时，提供保护保障至关重要。文献中提出了两类用于表格数据的方法：一方面，基于相似性的方法旨在找到训练数据和合成数据之间的相似程度。事实上，如果生成的数据与训练数据过于相似甚至相同，可能会发生隐私泄露。另一方面，基于攻击的方法对合成数据集进行有意的攻击。这些攻击的成功率显示了合成数据集的安全性。在本文中，我们介绍了一种对比方法，通过将数据嵌入更具代表性的空间来提高合成数据集的隐私评估。这克服了围绕多种数据类型和属性的障碍。它还使得使用直观的距离度量来进行相似度测量和作为攻击向量成为可能。在一系列使用公开可用数据集的实验中，我们比较了基于相似性和基于攻击的方法的性能，无论是否使用对比学习的嵌入。我们的结果表明，相对高效、易于实现的隐私度量方法可以与更先进的显式模拟GDPR隐私条件的度量方法表现一样好。

更新时间: 2025-02-19 15:52:23

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2502.13833v1

DiffGuard: Text-Based Safety Checker for Diffusion Models

Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

Updated: 2025-02-19 15:51:43

标题: DiffGuard：基于文本的扩散模型安全检查器

摘要: 最近扩展的扩散模型使得从文本生成图像成为可能，具有强大的闭源模型如DALL-E和Midjourney处于领先地位。然而，开源替代方案，如StabilityAI的Stable Diffusion，提供了可比较的功能。这些开源模型托管在Hugging Face上，配备了旨在防止生成明确图像的道德过滤器保护。本文首先揭示了它们的局限性，然后提出了一种优于现有解决方案的新型基于文本的安全过滤器。我们的研究受到了解决滥用AI生成内容的迫切需要的驱动，特别是在信息战争的背景下。DiffGuard增强了过滤效能，实现了超过14%的性能，超越了最佳现有过滤器。

更新时间: 2025-02-19 15:51:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.00064v2

FedCC: Robust Federated Learning against Model Poisoning Attacks

Federated learning is a distributed framework designed to address privacy concerns. However, it introduces new attack surfaces, which are especially prone when data is non-Independently and Identically Distributed. Existing approaches fail to effectively mitigate the malicious influence in this setting; previous approaches often tackle non-IID data and poisoning attacks separately. To address both challenges simultaneously, we present FedCC, a simple yet effective novel defense algorithm against model poisoning attacks. It leverages the Centered Kernel Alignment similarity of Penultimate Layer Representations for clustering, allowing the identification and filtration of malicious clients, even in non-IID data settings. The penultimate layer representations are meaningful since the later layers are more sensitive to local data distributions, which allows better detection of malicious clients. The sophisticated utilization of layer-wise Centered Kernel Alignment similarity allows attack mitigation while leveraging useful knowledge obtained. Our extensive experiments demonstrate the effectiveness of FedCC in mitigating both untargeted model poisoning and targeted backdoor attacks. Compared to existing outlier detection-based and first-order statistics-based methods, FedCC consistently reduces attack confidence to zero. Specifically, it significantly minimizes the average degradation of global performance by 65.5\%. We believe that this new perspective on aggregation makes it a valuable contribution to the field of FL model security and privacy. The code will be made available upon acceptance.

Updated: 2025-02-19 15:48:59

标题: FedCC：抵抗模型中毒攻击的强大联邦学习

摘要: 联邦学习是一种旨在解决隐私问题的分布式框架。然而，它引入了新的攻击面，特别是在数据非独立同分布时容易受到攻击。现有方法未能有效地减轻这种设置中的恶意影响；先前的方法经常分别处理非独立同分布数据和中毒攻击。为了同时解决这两个挑战，我们提出了FedCC，这是一种简单但有效的新型防御算法，用于防茍模型中毒攻击。它利用倒数第二层表示的中心核对齐相似性进行聚类，允许在非独立同分布数据设置中识别和过滤恶意客户端。倒数第二层的表示是有意义的，因为后面的层更敏感于本地数据分布，这允许更好地检测恶意客户端。层次中心核对齐相似性的复杂利用允许攻击缓解同时利用获取的有用知识。我们的广泛实验表明，FedCC在减轻非目标模型中毒和有针对性的后门攻击方面的有效性。与现有的基于异常检测和一阶统计的方法相比，FedCC始终将攻击置信度降低到零。具体来说，它将全局性能的平均退化显著减少了65.5\%。我们相信这种对聚合的新视角对于FL模型安全和隐私领域是一项有价值的贡献。代码将在接受后提供。

更新时间: 2025-02-19 15:48:59

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2212.01976v3

The Round Complexity of Black-Box Post-Quantum Secure Computation

We study the round complexity of secure multi-party computation (MPC) in the post-quantum regime. Our focus is on the fully black-box setting, where both the construction and security reduction are black-box. Chia, Chung, Liu, and Yamakawa [FOCS'22] demonstrated the infeasibility of achieving standard simulation-based security within constant rounds unless $\mathbf{NP} \subseteq \mathbf{BQP}$. This leaves crucial feasibility questions unresolved. Specifically, it remains unknown whether black-box constructions are achievable within polynomial rounds; also, the existence of constant-round constructions with respect to $\epsilon$-simulation, a relaxed yet useful alternative to standard simulation, remains unestablished. This work provides positive answers. We introduce the first black-box construction for PQ-MPC in polynomial rounds, from the minimal assumption of post-quantum semi-honest oblivious transfers. In the two-party scenario, our construction requires only $\omega(1)$ rounds. These results have already been applied in the oracle separation between classical-communication quantum MPC and $\mathbf{P} = \mathbf{NP}$ in Kretschmer, Qian, and Tal [STOC'25]. As for $\epsilon$-simulation, Chia, Chung, Liang, and Yamakawa [CRYPTO'22] resolved the issue for the two-party setting, leaving the multi-party case open. We complete the picture by presenting the first black-box, constant-round construction in the multi-party setting, instantiable using various standard post-quantum primitives. En route, we obtain a black-box, constant-round post-quantum commitment achieving a weaker version of 1-many non-malleability, from post-quantum one-way functions. Besides its role in our MPC construction, this commitment also reduces the assumption used in the quantum parallel repetition lower bound by Bostanci, Qian, Spooner, and Yuen [STOC'24]. We anticipate further applications in the future.

Updated: 2025-02-19 15:45:28

标题: 黑盒后量子安全计算的轮复杂度

摘要: 我们研究了后量子时代安全多方计算（MPC）的循环复杂性。我们关注完全黑盒设置，其中构造和安全性降低都是黑盒的。Chia、Chung、Liu和Yamakawa（FOCS'22）证明了在常数轮内实现标准基于模拟的安全性是不可行的，除非NP⊆BQP。这留下了关键的可行性问题尚未解决。具体来说，黑盒构造能否在多项式轮内实现仍然未知；此外，关于ε-模拟，一种相对放松但有用的替代标准模拟，常数轮构造的存在仍未确定。本文提供了积极的答案。我们介绍了PQ-MPC的第一个黑盒构造，在多项式轮中，从后量子半诚实遗忘转移的最小假设出发。在两方案例中，我们的构造仅需要ω(1)轮。这些结果已经应用于克雷奇默、钱和塔尔（STOC'25）的经典通信量子MPC和P = NP的oracle分离中。至于ε-模拟，Chia、Chung、Liang和Yamakawa（CRYPTO'22）解决了两方案例的问题，但多方案例尚未解决。我们通过提出多方案例中的第一个黑盒、常数轮构造，可以使用各种标准后量子原语进行实例化，从而完成了整个画面。在此过程中，我们获得了一个黑盒、常数轮后量子承诺，实现了一种较弱版本的1-多非可塑性，从后量子单向函数出发。除了在我们的MPC构造中的作用外，此承诺还降低了Bostanci、Qian、Spooner和Yuen（STOC'24）的量子并行重复下界所使用的假设。我们期待未来进一步的应用。

更新时间: 2025-02-19 15:45:28

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2502.13830v1

Mixup Regularization: A Probabilistic Perspective

In recent years, mixup regularization has gained popularity as an effective way to improve the generalization performance of deep learning models by training on convex combinations of training data. While many mixup variants have been explored, the proper adoption of the technique to conditional density estimation and probabilistic machine learning remains relatively unexplored. This work introduces a novel framework for mixup regularization based on probabilistic fusion that is better suited for conditional density estimation tasks. For data distributed according to a member of the exponential family, we show that likelihood functions can be analytically fused using log-linear pooling. We further propose an extension of probabilistic mixup, which allows for fusion of inputs at an arbitrary intermediate layer of the neural network. We provide a theoretical analysis comparing our approach to standard mixup variants. Empirical results on synthetic and real datasets demonstrate the benefits of our proposed framework compared to existing mixup variants.

Updated: 2025-02-19 15:39:14

标题: 混合正则化：一个概率视角

摘要: 近年来，混合正则化作为一种有效的提高深度学习模型泛化性能的方法已经变得流行起来，通过训练在训练数据的凸组合上。虽然已经探索了许多混合变体，但该技术在条件密度估计和概率机器学习中的适当应用仍相对未被探索。本文介绍了一种基于概率融合的混合正则化新框架，更适用于条件密度估计任务。对于按指数族的成员分布的数据，我们展示了可以使用对数线性池化来解析融合似然函数。我们进一步提出了概率混合的扩展，允许在神经网络的任意中间层中进行输入融合。我们提供了一项理论分析，将我们的方法与标准混合变体进行比较。在合成和真实数据集上的实证结果表明，与现有的混合变体相比，我们提出的框架的优势。

更新时间: 2025-02-19 15:39:14

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.13825v1

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.

Updated: 2025-02-19 15:36:47

标题: 大规模语言模型的多轮越狱攻击中的推理增强对话

摘要: 多轮越狱攻击通过让大型语言模型（LLMs）参与迭代对话，模拟真实世界的人类交互，暴露了关键的安全漏洞。然而，现有方法往往很难在攻击效果和语义连贯性之间取得平衡，导致要么是良性语义漂移，要么是无效的检测逃避。为了解决这一挑战，我们提出了一种新颖的多轮越狱框架——Reasoning-Augmented Conversation，将有害查询重新构造为良性推理任务，并利用LLMs强大的推理能力来破坏安全对齐。具体地，我们引入了一个攻击状态机框架来系统地建模问题转换和迭代推理，确保跨多轮生成连贯的查询。在此框架基础上，我们设计了增益引导探索、自我对弈和拒绝反馈模块，以保持攻击语义、增强效果，并维持推理驱动的攻击进展。对多个LLMs进行的大量实验表明，RACE在复杂对话场景中实现了最先进的攻击效果，攻击成功率（ASR）提高了高达96%。值得注意的是，我们的方法在对抗领先商业模型OpenAI o1和DeepSeek R1时实现了82%和92%的ASRs，突显了其效力。我们在https://github.com/NY1024/RACE 上发布了我们的代码，以便在这一关键领域进行进一步研究。

更新时间: 2025-02-19 15:36:47

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2502.11054v3

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.

Updated: 2025-02-19 15:36:45

标题: 合成表格数据生成用于不平衡分类：重叠类别的出人意料有效性

摘要: 在构建基于表格数据的分类器时处理类别分布不平衡一直是一个长期存在的问题。一种流行的方法是通过合成生成数据来增加训练数据集。虽然传统的增强技术局限于对现有少数类示例进行线性插值，但最近具有更高容量的深度生成模型提供了更大的希望。然而，在构建深度生成模型时处理类别分布不平衡也是一个具有挑战性的问题，尚未像不平衡分类器模型训练那样得到广泛研究。我们表明，最先进的深度生成模型产生的少数类示例质量明显低于多数类示例。在本文中，我们从这样一个观察开始，即生成模型的不平衡数据训练训练了少数类拟合不足的数据集。我们提出了一种新颖的技术，通过将二元类标签转换为三元类标签，引入一个类别用于少数类和多数类分布重叠的区域。我们表明，仅通过对训练集进行这种预处理，显著改善了跨越数个最先进的扩散和基于GAN的模型生成的数据质量。在使用合成数据训练分类器时，我们从训练数据中移除重叠类别，并证明了增强准确性背后的原因。我们在四个真实数据集上进行了大量实验，使用五种不同的分类器和五种生成模型，证明了我们的方法不仅增强了最先进模型的合成器性能，还增强了分类器性能。

更新时间: 2025-02-19 15:36:45

领域: cs.LG

下载: http://arxiv.org/abs/2412.15657v2

Bias Similarity Across Large Language Models

Bias in machine learning models, particularly in Large Language Models, is a critical issue as these systems shape important societal decisions. While previous studies have examined bias in individual LLMs, comparisons of bias across models remain underexplored. To address this gap, we analyze 13 LLMs from five families, evaluating bias through output distribution across multiple dimensions using two datasets (4K and 1M questions). Our results show that fine-tuning has minimal impact on output distributions, and proprietary models tend to overly response as unknowns to minimize bias, compromising accuracy and utility. In addition, open-source models like Llama3-Chat and Gemma2-it demonstrate fairness comparable to proprietary models like GPT-4, challenging the assumption that larger, closed-source models are inherently less biased. We also find that bias scores for disambiguated questions are more extreme, raising concerns about reverse discrimination. These findings highlight the need for improved bias mitigation strategies and more comprehensive evaluation metrics for fairness in LLMs.

Updated: 2025-02-19 15:36:26

标题: 大型语言模型中的偏见相似性

摘要: 机器学习模型中的偏见，尤其是在大型语言模型中，是一个关键问题，因为这些系统塑造了重要的社会决策。虽然先前的研究已经检查了个别LLMs中的偏见，但跨模型的偏见比较仍未得到充分探讨。为了填补这一空白，我们分析了来自五个系列的13个LLMs，通过两个数据集（4K和1M问题）评估输出分布来评估偏见。我们的结果显示，微调对输出分布几乎没有影响，专有模型倾向于过度响应未知以最小化偏见，从而损害准确性和实用性。此外，像Llama3-Chat和Gemma2-it这样的开源模型展示了与像GPT-4这样的专有模型相当的公平性，挑战了更大、闭源模型本质上偏见较小的假设。我们还发现，对于消除歧义的问题，偏见分数更加极端，引发了对逆向歧视的担忧。这些发现强调了需要改进的偏见缓解策略和更全面的公平性评估指标。

更新时间: 2025-02-19 15:36:26

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.12010v2

Uncertainty quantification for Markov chains with application to temporal difference learning

Markov chains are fundamental to statistical machine learning, underpinning key methodologies such as Markov Chain Monte Carlo (MCMC) sampling and temporal difference (TD) learning in reinforcement learning (RL). Given their widespread use, it is crucial to establish rigorous probabilistic guarantees on their convergence, uncertainty, and stability. In this work, we develop novel, high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains, addressing key limitations in existing theoretical tools for handling dependent data. We leverage these results to analyze the TD learning algorithm, a widely used method for policy evaluation in RL. Our analysis yields a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish a $O(T^{-\frac{1}{4}}\log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. These findings provide new insights into statistical inference for RL algorithms, bridging the gaps between classical stochastic approximation theory and modern reinforcement learning applications.

Updated: 2025-02-19 15:33:55

标题: 马尔可夫链的不确定性量化及其在时间差分学习中的应用

摘要: 马尔可夫链是统计机器学习的基础，支撑着关键方法，如马尔可夫链蒙特卡罗(MCMC)采样和强化学习中的时序差分(TD)学习。鉴于它们的广泛应用，建立严格的概率保证对于它们的收敛性、不确定性和稳定性至关重要。在这项工作中，我们开发了新颖的高维集中不等式和贝利-埃森界限，针对马尔可夫链的向量和矩阵值函数，解决了现有理论工具在处理相关数据方面的关键限制。我们利用这些结果分析了TD学习算法，这是强化学习中广泛使用的一种策略评估方法。我们的分析提供了一个尖锐的高概率一致性保证，与渐近方差匹配，直到对数因子。此外，我们建立了一个$O(T^{-\frac{1}{4}}\log T)$的分布收敛速度，用凸距离衡量TD估计器的高斯近似。这些发现为强化学习算法的统计推断提供了新的见解，弥合了经典随机逼近理论与现代强化学习应用之间的差距。

更新时间: 2025-02-19 15:33:55

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.13822v1

Semi-supervised Fine-tuning for Large Language Models

Supervised fine-tuning (SFT) is crucial in adapting large language model (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated.Towards this end, we introduce a semi-supervised fine-tuning(SemiFT) task and a framework named SemiEvol for LLM alignment from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios.

Updated: 2025-02-19 15:32:29

标题: 大语言模型的半监督微调

摘要: 监督微调（SFT）对于将大规模语言模型（LLMs）调整到特定领域或任务中至关重要。然而，在实际应用中只有有限量的标记数据可用，这给SFT在产生令人满意的结果方面带来了严峻挑战。因此，一种能够充分利用标记和未标记数据进行LLM微调的数据高效框架备受期待。为此，我们引入了一种半监督微调（SemiFT）任务和一个名为SemiEvol的框架，用于从传播和选择方式对LLM进行对齐。对于知识传播，SemiEvol采用双层方法，通过权重内和上下文内方法从标记数据向未标记数据传播知识。对于知识选择，SemiEvol融合了合作学习机制，选择更高质量的伪响应样本。我们在七个通用或领域特定数据集上使用GPT-4o-mini和Llama-3.1进行了实验，在目标数据上展示出模型性能的显著提升。此外，我们将SemiEvol与SFT和自我进化方法进行了比较，突出了其在混合数据场景中的实用性。

更新时间: 2025-02-19 15:32:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.14745v2

Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

Code verification has recently found great success as a critical component in training large scale reasoning models for coding. Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests. Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs. Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.

Updated: 2025-02-19 15:32:11

标题: 得分验证器：评估代码和推理中的合成验证

摘要: 代码验证最近在训练大规模推理模型进行编码时取得了巨大成功。合成技术，如自动生成的测试用例和奖励模型，提供了一种增强代码能力超越预定义测试的方法。在这些进展的基础上，我们提出了新的基准，旨在系统评估合成验证方法对评估解决方案正确性的影响。我们引入了HE-R，HE-R+，MBPP-R和MBPP-R+，这些基准将现有的编码基准转化为评分和排名数据集，以评估合成验证器的有效性。使用这些基准，我们分析了标准、基于推理和基于奖励的LLM中的合成验证方法。我们的结果显示，最近的推理模型显着改善了测试用例生成，并且扩展测试用例提高了验证准确性。

更新时间: 2025-02-19 15:32:11

领域: cs.AI,cs.CL,cs.LG,cs.SE

下载: http://arxiv.org/abs/2502.13820v1

Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

Estimating the construction year of buildings is of great importance for sustainability. Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change. By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset. In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch. We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference. In this work, we present the community-based data challenge we organized based on MyCD. The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months. Here, we present the Top-4 performing models, and the main evaluation results. During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined. The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.

Updated: 2025-02-19 15:31:13

标题: 建筑年龄估计：一个新的多模态基准数据集和社区挑战

摘要: 估计建筑物的建造年份对可持续性至关重要。可持续建筑最大限度地减少能源消耗，并是负责任和可持续的城市规划和发展的关键部分，以有效地应对气候变化。通过使用人工智能（AI）和最近提出的Transformer模型，我们能够从多模式数据集中估计建筑物的建造时代。在本文中，我们介绍了一个新的基准多模式数据集，即Map your City数据集（MyCD），其中包含来自Copernicus Sentinel-2卫星星座的高分辨率（VHR）俯视图图像、地球观测（EO）多光谱数据，以及欧洲许多不同城市的街景图像，与研究建筑物相对应，并标记了建造时代。我们评估了EO在新/之前未见过的城市上的泛化性能，这些城市在训练过程中被保留，并且只在推理过程中出现。在这项工作中，我们介绍了我们基于MyCD组织的基于社区的数据挑战。ESA AI4EO挑战MapYourCity于2024年开展，为期4个月。在这里，我们介绍了表现最好的四个模型以及主要的评估结果。在推理过程中，评估了使用所有三个输入模态和仅使用两个俯视图模态（即没有街景图像）的模型的性能。评估结果表明，这些模型在估计建筑物年龄这一困难的现实任务上是有效的，并且甚至在之前未见过的城市上，甚至在推理过程中仅使用两个俯视图模态（即VHR和Sentinel-2）时也能取得良好的表现。

更新时间: 2025-02-19 15:31:13

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13818v1

The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model's output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

Updated: 2025-02-19 15:30:30

标题: 《金丝雀的回声：审计由LLM生成的合成文本的隐私风险》

摘要: Large Language Models（LLMs）生成的合成数据中可以从训练样本中获得多少信息？忽视合成数据生成流程中信息流的微妙之处可能导致对隐私的虚假认识。在本文中，我们设计了针对用于微调预训练LLMs的数据的成员推断攻击（MIAs），这些LLMs随后用于合成数据，特别是当对手无法访问微调模型而只能访问合成数据时。我们表明，这种基于数据的MIAs比随机猜测明显更好，意味着合成数据泄漏了有关训练数据的信息。此外，我们发现，为最大化对模型基于MIAs的脆弱性而设计的金丝雀在仅发布合成数据时并不是隐私审核的最佳选择。这种超出分布的金丝雀在提示生成有用的、分布内的合成数据时对模型输出的影响有限，从而大幅降低了它们的脆弱性。为解决这个问题，我们利用自回归模型的机制设计了带有分布内前缀和高困惑度后缀的金丝雀，这些金丝雀在合成数据中留下可检测的痕迹。这增强了基于数据的MIAs的能力，并提供了对使用LLMs生成的合成数据释放的隐私风险的更好评估。

更新时间: 2025-02-19 15:30:30

领域: cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.14921v1

Display Field-Of-View Agnostic Robust CT Kernel Synthesis Using Model-Based Deep Learning

In X-ray computed tomography (CT) imaging, the choice of reconstruction kernel is crucial as it significantly impacts the quality of clinical images. Different kernels influence spatial resolution, image noise, and contrast in various ways. Clinical applications involving lung imaging often require images reconstructed with both soft and sharp kernels. The reconstruction of images with different kernels requires raw sinogram data and storing images for all kernels increases processing time and storage requirements. The Display Field-of-View (DFOV) adds complexity to kernel synthesis, as data acquired at different DFOVs exhibit varying levels of sharpness and details. This work introduces an efficient, DFOV-agnostic solution for image-based kernel synthesis using model-based deep learning. The proposed method explicitly integrates CT kernel and DFOV characteristics into the forward model. Experimental results on clinical data, along with quantitative analysis of the estimated modulation transfer function using wire phantom data, clearly demonstrate the utility of the proposed method in real-time. Additionally, a comparative study with a direct learning network, that lacks forward model information, shows that the proposed method is more robust to DFOV variations.

Updated: 2025-02-19 15:29:47

标题: 使用基于模型的深度学习实现视场不受限制的稳健CT核合成

摘要: 在X射线计算机断层扫描（CT）成像中，重建核的选择至关重要，因为它显著影响临床图像的质量。不同的核以不同方式影响空间分辨率、图像噪声和对比度。涉及肺部成像的临床应用通常需要使用软和锐核对重建的图像。使用不同核的图像重建需要原始正弦图数据，并且为所有核存储图像会增加处理时间和存储需求。显示视场（DFOV）增加了核合成的复杂性，因为在不同DFOV下获取的数据表现出不同水平的锐度和细节。本文介绍了一种高效的、与DFOV无关的基于图像的核合成解决方案，采用基于模型的深度学习。所提出的方法明确地将CT核和DFOV特征整合到正向模型中。在临床数据上的实验结果，以及使用线性干涉函数估计的定量分析结果，清楚地证明了所提出方法在实时中的实用性。此外，与缺乏正向模型信息的直接学习网络进行的比较研究表明，所提出的方法对于DFOV变化更具鲁棒性。

更新时间: 2025-02-19 15:29:47

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.14920v1

On the Duality between Gradient Transformations and Adapters

We study memory-efficient optimization of neural networks with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

Updated: 2025-02-19 15:26:18

标题: 关于梯度变换和适配器之间的二元性

摘要: 我们研究了具有线性梯度变换的神经网络的内存高效优化，其中梯度被线性映射到比完整参数空间更低维的空间，从而节省了梯度累积和优化器状态持久性所需的内存。模型参数通过首先在较低维空间中执行优化步骤，然后通过线性映射的转置返回到原始参数空间来更新。我们表明，在这个转换空间中优化模型等效于通过一个线性适配器重新参数化原始模型，该适配器通过加性修改模型参数，然后只优化适配器的参数。当变换为Kronecker分解时，这建立了GaLore和单侧LoRA之间的等价性。我们展示了梯度变换和基于适配器的重新参数化之间的对偶统一了现有的内存高效训练方法，并提出了改进训练效率和内存使用的新技术。

更新时间: 2025-02-19 15:26:18

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.13811v1

Learning Is a Kan Extension

Previous work has demonstrated that efficient algorithms exist for computing Kan extensions and that some Kan extensions have interesting similarities to various machine learning algorithms. This paper closes the gap by proving that all error minimisation algorithms may be presented as a Kan extension. This result provides a foundation for future work to investigate the optimisation of machine learning algorithms through their presentation as Kan extensions. A corollary of this representation of error-minimising algorithms is a presentation of error from the perspective of lossy and lossless transformations of data.

Updated: 2025-02-19 15:25:44

标题: 学习是一个Kan扩展

摘要: 以前的研究表明存在有效的算法用于计算Kan扩展，并且某些Kan扩展与各种机器学习算法具有有趣的相似性。本文通过证明所有错误最小化算法均可呈现为Kan扩展，填补了这一空白。该结果为未来的研究提供了基础，以通过将其呈现为Kan扩展来研究优化机器学习算法。这种错误最小化算法的表示的一个推论是从数据的有损和无损转换的角度呈现错误。

更新时间: 2025-02-19 15:25:44

领域: math.CT,cs.LG

下载: http://arxiv.org/abs/2502.13810v1

Early-Stage Anomaly Detection: A Study of Model Performance on Complete vs. Partial Flows

This study investigates the efficacy of machine learning models in network anomaly detection through the critical lens of partial versus complete flow information. We systematically evaluate how models perform under varying training and testing conditions, quantifying the performance impact when dealing with incomplete data typical in real-time environments. Our findings demonstrate a significant performance difference, with precision and recall dropping by up to 30% under certain conditions when models trained on complete flows are tested against partial flows. Conversely, models trained and tested on consistently complete or partial datasets maintain robustness. The study reveals that a minimum of 7 packets in the test set is required for maintaining reliable detection rates, providing valuable insights for real-time detection strategies. These results offer important guidance for deploying machine learning models in operational network security environments.

Updated: 2025-02-19 15:25:31

标题: 早期异常检测：完整流程与部分流程模型性能研究

摘要: 这项研究通过部分与完整流信息的关键视角，调查了机器学习模型在网络异常检测中的有效性。我们系统评估了模型在不同训练和测试条件下的表现，量化了在处理实时环境中典型的不完整数据时的性能影响。我们的研究结果表明，在某些条件下，当模型在完整流上进行训练后在部分流上进行测试时，精度和召回率可能下降高达30%。相反，模型在一致完整或部分数据集上训练和测试时保持稳健性。研究揭示了测试集中至少需要7个数据包以保持可靠的检测率，为实时检测策略提供了宝贵的见解。这些结果为在操作网络安全环境中部署机器学习模型提供了重要指导。

更新时间: 2025-02-19 15:25:31

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2407.02856v2

BNEM: A Boltzmann Sampler Based on Bootstrapped Noised Energy Matching

Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, Noised Energy Matching, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to NEM to balance between bias and variance. We evaluate NEM and BNEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-well potential (DW-4). The experimental results demonstrate that BNEM can achieve state-of-the-art performance while being more robust.

Updated: 2025-02-19 15:18:24

标题: BNEM：基于自举噪声能量匹配的玻尔兹曼采样器

摘要: 开发一个能够从Boltzmann分布中生成独立同分布（IID）样本的高效采样器是科学研究中的一个关键挑战，例如分子动力学。在这项工作中，我们打算学习神经采样器，而不是从Boltzmann分布中采样的数据。通过学习加入噪声数据的能量，我们提出了一种基于扩散的采样器，称为Noised Energy Matching（NEM），与相关工作相比，在理论上具有更低的方差和更复杂性。此外，我们还应用了一种新颖的自举技术来平衡NEM中的偏差和方差。我们在一个二维40高斯混合模型（GMM）和一个4粒子双井势（DW-4）上评估了NEM和BNEM。实验结果表明，BNEM能够实现最先进的性能，同时更加稳健。

更新时间: 2025-02-19 15:18:24

领域: cs.LG,cs.AI,stat.CO,stat.ML

下载: http://arxiv.org/abs/2409.09787v3

AnDB: Breaking Boundaries with an AI-Native Database for Universal Semantic Analysis

In this demonstration, we present AnDB, an AI-native database that supports traditional OLTP workloads and innovative AI-driven tasks, enabling unified semantic analysis across structured and unstructured data. While structured data analytics is mature, challenges remain in bridging the semantic gap between user queries and unstructured data. AnDB addresses these issues by leveraging cutting-edge AI-native technologies, allowing users to perform semantic queries using intuitive SQL-like statements without requiring AI expertise. This approach eliminates the ambiguity of traditional text-to-SQL systems and provides a seamless end-to-end optimization for analyzing all data types. AnDB automates query processing by generating multiple execution plans and selecting the optimal one through its optimizer, which balances accuracy, execution time, and financial cost based on user policies and internal optimizing mechanisms. AnDB future-proofs data management infrastructure, empowering users to effectively and efficiently harness the full potential of all kinds of data without starting from scratch.

Updated: 2025-02-19 15:15:59

标题: AnDB：通过AI本地数据库突破边界，实现通用语义分析

摘要: 在这个演示中，我们介绍了AnDB，一个支持传统OLTP工作负载和创新AI驱动任务的AI本机数据库，实现了跨结构化和非结构化数据的统一语义分析。尽管结构化数据分析已经很成熟，但在用户查询和非结构化数据之间的语义鸿沟仍然存在挑战。AnDB通过利用尖端的AI本机技术来解决这些问题，允许用户使用直观的类SQL语句执行语义查询，而无需AI专业知识。这种方法消除了传统文本到SQL系统的歧义，并为分析所有数据类型提供了无缝的端到端优化。AnDB通过生成多个执行计划并通过其优化器选择最佳执行计划来自动化查询处理，该优化器根据用户政策和内部优化机制平衡准确性、执行时间和财务成本。AnDB未来将数据管理基础设施进行了未来化，使用户能够有效地利用各种数据的全部潜力，而无需从头开始。

更新时间: 2025-02-19 15:15:59

领域: cs.DB,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13805v1

Learning to explore when mistakes are not allowed

Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.

Updated: 2025-02-19 15:11:51

标题: 学会探索：当错误不被允许时

摘要: 目标条件强化学习（GCRL）提供了一个多功能的框架，用于开发统一的控制器，能够处理各种任务、探索环境并适应行为。然而，它依赖于试错，对于现实世界的应用提出了挑战，因为错误可能导致昂贵且潜在有害的后果。为了满足更安全学习的需求，我们提出了一种方法，使代理能够学习目标条件行为，能够在探索过程中避免犯下有害错误的风险。探索过程中没有风险可能看起来有悖常情，但环境动态在空间上通常是均匀的，因此一个为了安全而训练的策略，即使没有探索目的，仍然可以在全局范围内被利用。我们提出的方法包括两个不同的阶段。首先，在预训练阶段，我们利用安全强化学习和分布技术来训练一个安全策略，该策略积极尝试避免在各种情况下失败。在随后的安全探索阶段，学习一个目标条件（GC）策略同时确保安全。为了实现这一点，我们实施了一个行动选择机制，利用先前学习的分布安全评论者来在安全策略和GC策略之间进行调解，确保安全探索时在需要时切换到安全策略。我们在模拟环境中评估了我们的方法，并展示了它不仅提供了目标空间的广泛覆盖，而且将错误的发生降至最低程度，与传统的GCRL方法形成鲜明对比。此外，我们进行了消融研究并分析了失败模式，为未来的研究方向提供了见解。

更新时间: 2025-02-19 15:11:51

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.13801v1

Stock Price Prediction Using a Hybrid LSTM-GNN Model: Integrating Time-Series and Graph-Based Analysis

This paper presents a novel hybrid model that integrates long-short-term memory (LSTM) networks and Graph Neural Networks (GNNs) to significantly enhance the accuracy of stock market predictions. The LSTM component adeptly captures temporal patterns in stock price data, effectively modeling the time series dynamics of financial markets. Concurrently, the GNN component leverages Pearson correlation and association analysis to model inter-stock relational data, capturing complex nonlinear polyadic dependencies influencing stock prices. The model is trained and evaluated using an expanding window validation approach, enabling continuous learning from increasing amounts of data and adaptation to evolving market conditions. Extensive experiments conducted on historical stock data demonstrate that our hybrid LSTM-GNN model achieves a mean square error (MSE) of 0.00144, representing a substantial reduction of 10.6% compared to the MSE of the standalone LSTM model of 0.00161. Furthermore, the hybrid model outperforms traditional and advanced benchmarks, including linear regression, convolutional neural networks (CNN), and dense networks. These compelling results underscore the significant potential of combining temporal and relational data through a hybrid approach, offering a powerful tool for real-time trading and financial analysis.

Updated: 2025-02-19 15:09:13

标题: 股价预测：使用混合LSTM-GNN模型，整合时间序列和基于图的分析

摘要: 本文提出了一种新颖的混合模型，将长短期记忆（LSTM）网络和图神经网络（GNNs）集成在一起，显著提高了股票市场预测的准确性。LSTM组件巧妙地捕捉了股价数据中的时间模式，有效地建模了金融市场的时间序列动态。同时，GNN组件利用皮尔逊相关性和关联分析来建模股票间的关联数据，捕捉影响股价的复杂非线性多因素依赖关系。该模型使用扩展窗口验证方法进行训练和评估，使其能够从不断增加的数据中进行持续学习，并适应不断变化的市场条件。对历史股票数据进行的大量实验表明，我们的混合LSTM-GNN模型实现了均方误差（MSE）为0.00144，较独立LSTM模型的0.00161的MSE显著降低了10.6%。此外，混合模型表现优于传统和先进的基准模型，包括线性回归、卷积神经网络（CNN）和密集网络。这些引人入胜的结果突显了通过混合方法结合时间和关联数据的巨大潜力，为实时交易和金融分析提供了强大的工具。

更新时间: 2025-02-19 15:09:13

领域: q-fin.ST,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.15813v1

Addressing the regulatory gap: moving towards an EU AI audit ecosystem beyond the AI Act by including civil society

The European legislature has proposed the Digital Services Act (DSA) and Artificial Intelligence Act (AIA) to regulate platforms and Artificial Intelligence (AI) products. We review to what extent third-party audits are part of both laws and how is access to information on models and the data provided. By considering the value of third-party audits and third-party data access in an audit ecosystem, we identify a regulatory gap in that the AIA does not provide access to data for researchers and civil society. Our contributions to the literature include: (1) Defining an AI audit ecosystem incorporating compliance and oversight. (2) Highlighting a regulatory gap within the DSA and AIA regulatory framework, preventing the establishment of an AI audit ecosystem that has effective oversight by civil society and academia. (3) Emphasizing that third-party audits by research and civil society must be part of that ecosystem, we call for AIA amendments and delegated acts to include data and model access for certain AI products. Furthermore, we call for the DSA to provide NGOs and investigative journalists with data access to platforms by delegated acts and for adaptions and amendments of the AIA to provide third-party audits and data and model access, at least for high-risk systems. Regulations modeled after EU AI regulations should enable data access and third-party audits, fostering an AI audit ecosystem that promotes compliance and oversight mechanisms.

Updated: 2025-02-19 15:03:02

标题: 填补监管空白：通过纳入民间社会朝向AI法案之外的欧盟AI审计生态系统

摘要: 欧洲立法机构提出了数字服务法案（DSA）和人工智能法案（AIA）来规范平台和人工智能（AI）产品。我们审查了第三方审计在这两项法律中的程度，以及模型和数据提供信息的获取方式。通过考虑第三方审计和第三方数据访问在审计生态系统中的价值，我们发现AIA未为研究人员和公民社会提供数据访问，存在监管漏洞。我们对文献的贡献包括：（1）定义了包括合规和监督在内的AI审计生态系统。（2）强调DSA和AIA监管框架中存在的监管漏洞，阻碍了建立由公民社会和学术界有效监督的AI审计生态系统。（3）强调研究和公民社会的第三方审计必须成为该生态系统的一部分，我们呼吁对AIA进行修正和授权法以包括某些AI产品的数据和模型访问。此外，我们呼吁DSA通过授权法为非政府组织和调查记者提供对平台的数据访问，并呼吁对AIA进行调整和修正以提供第三方审计和数据和模型访问，至少对于高风险系统。模仿欧盟AI法规的法规应该允许数据访问和第三方审计，促进促进合规和监督机制的AI审计生态系统。

更新时间: 2025-02-19 15:03:02

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.07904v3

LESA: Learnable LLM Layer Scaling-Up

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.

Updated: 2025-02-19 14:58:48

标题: LESA：可学习的LLM层扩展

摘要: 从头开始训练大型语言模型（LLMs）需要巨大的计算资源，使其成本过高。模型扩大提供了一个有希望的解决方案，通过利用较小模型的参数来创建更大的模型。然而，现有的深度扩大方法依赖于经验启发式规则进行层复制，导致在持续预训练过程中初始化较差，收敛速度较慢。我们提出了LESA，一种新颖的可学习的深度扩大方法。通过连接每一层的参数并应用奇异值分解，我们揭示了层间的潜在模式，表明层间参数可以被学习。LESA使用神经网络来预测插入在相邻层之间的参数，实现更好的初始化和更快的训练。实验证明，LESA优于现有基线，在持续预训练期间具有不到一半的计算成本的情况下实现了更好的性能。广泛的分析表明LESA在不同模型大小和任务上的有效性。

更新时间: 2025-02-19 14:58:48

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.13794v1

Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics

mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).

Updated: 2025-02-19 14:51:41

标题: 螺旋-mRNA：一种全序列mRNA治疗的混合基础模型

摘要: mRNA基因疫苗已成为制药行业的重点关注对象。mRNA的编码序列以及未翻译区域（UTRs）可以强烈影响翻译效率、稳定性、降解速度以及其他因素，这些因素共同决定了疫苗的有效性。然而，为了这些特性优化mRNA序列仍然是一个复杂的挑战。现有的深度学习模型通常只关注编码区域的优化，忽视了UTRs。我们提出了Helix-mRNA，这是一个基于结构化状态空间和注意力混合模型，以解决这些挑战。除了第一次预训练外，第二次预训练阶段使我们能够专门使用高质量数据来调整模型。我们采用了单核苷酸标记化的mRNA序列，并进行密码子分离，确保原始mRNA序列的生物学和结构信息不会丢失。我们的模型Helix-mRNA在分析UTRs和编码区域特性方面优于现有方法。它可以处理比当前方法长6倍的序列，同时仅使用现有基础模型的10％参数。其预测能力延伸到所有mRNA区域。我们开源了该模型（https://github.com/helicalAI/helical）和模型权重（https://huggingface.co/helical-ai/helix-mRNA）。

更新时间: 2025-02-19 14:51:41

领域: q-bio.GN,cs.AI

下载: http://arxiv.org/abs/2502.13785v1

Poster: SpiderSim: Multi-Agent Driven Theoretical Cybersecurity Simulation for Industrial Digitalization

Rapid industrial digitalization has created intricate cybersecurity demands that necessitate effective validation methods. While cyber ranges and simulation platforms are widely deployed, they frequently face limitations in scenario diversity and creation efficiency. In this paper, we present SpiderSim, a theoretical cybersecurity simulation platform enabling rapid and lightweight scenario generation for industrial digitalization security research. At its core, our platform introduces three key innovations: a structured framework for unified scenario modeling, a multi-agent collaboration mechanism for automated generation, and modular atomic security capabilities for flexible scenario composition. Extensive implementation trials across multiple industrial digitalization contexts, including marine ranch monitoring systems, validate our platform's capacity for broad scenario coverage with efficient generation processes. Built on solid theoretical foundations and released as open-source software, SpiderSim facilitates broader research and development in automated security testing for industrial digitalization.

Updated: 2025-02-19 14:42:32

标题: 海报：SpiderSim：基于多智能体的工业数字化理论网络安全模拟

摘要: 快速的工业数字化已经创造了复杂的网络安全需求，这需要有效的验证方法。虽然网络安全范围和模拟平台被广泛部署，但它们经常面临在场景多样性和创建效率方面的限制。在本文中，我们介绍SpiderSim，这是一个理论的网络安全模拟平台，可以为工业数字化安全研究快速轻量级地生成场景。在其核心，我们的平台引入了三个关键创新：统一场景建模的结构化框架，自动化生成的多代理协作机制，以及灵活场景构成的模块化原子安全能力。通过在多个工业数字化背景下进行广泛的实施试验，包括海洋牧场监测系统，验证了我们的平台具有广泛场景覆盖和高效生成过程的能力。基于牢固的理论基础，作为开源软件发布的SpiderSim有助于促进工业数字化自动化安全测试的更广泛研究和开发。

更新时间: 2025-02-19 14:42:32

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2502.13778v1

VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare

Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values. However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities. This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions. Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets. To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains. This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.

Updated: 2025-02-19 14:38:57

标题: VITAL：用于医疗保健中多元对齐基准测试的新数据集

摘要: 对齐技术已成为确保大型语言模型（LLMs）生成与人类价值观一致的关键。然而，现有的对齐范式经常建模为平均或整体偏好，未能考虑到不同文化、人口统计和社区之间的观点多样性。这种局限在与健康有关的场景中尤为关键，因为由于文化、宗教、个人价值观和意见冲突的影响，多元性是必不可少的。尽管在多元化对齐方面取得了进展，但以往的工作都没有专注于健康，可能是因为缺乏公开可用的数据集。为了填补这一空白，我们引入了VITAL，一个新的基准数据集，包括13.1K个价值观问题和5.4K个以健康为重点的多项选择题，旨在评估和基准多元对齐方法。通过对八个不同大小的LLM进行广泛评估，我们证明现有的多元对齐技术在有效容纳多元化医疗信仰方面存在不足，强调了特定领域需要定制AI对齐的必要性。这项工作突显了当前方法的局限性，并为制定健康专用的对齐解决方案奠定了基础。

更新时间: 2025-02-19 14:38:57

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13775v1

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Updated: 2025-02-19 14:36:33

标题: 联合MoE缩放定律：专家混合可以高效利用记忆

摘要: Mixture of Experts（MoE）架构显著提高了大规模机器学习模型在研究和实际应用中的计算效率。然而，在内存约束下它们的可扩展性和效率仍然相对未被充分探讨。在这项工作中，我们提出了密集和MoE模型的联合扩展定律，融合了关键因素如活跃参数数量、数据集大小和专家数量。我们的研究结果为在固定内存和计算预算下选择最佳MoE配置提供了一个有理论依据的框架。令人惊讶的是，我们展示了MoE模型可以比密集模型更具内存效率，与传统观念相矛盾。为了推导和验证我们的扩展定律的理论预测，我们进行了超过280个实验，活跃参数数量多达27亿，总参数数量多达50亿。这些结果为设计和部署实际大规模训练场景中的MoE模型提供了可操作的见解。

更新时间: 2025-02-19 14:36:33

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.05172v2

Theory on Mixture-of-Experts in Continual Learning

Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.

Updated: 2025-02-19 14:35:07

标题: 持续学习中的专家混合理论

摘要: 持续学习（CL）因其适应随时间到来的新任务的能力而受到了广泛关注。在CL中，灾难性遗忘（旧任务）被确定为一个重要问题，因为模型在适应新任务时会发生。最近，混合专家（MoE）模型已被证明可以有效地减轻CL中的灾难性遗忘，通过利用一个门控网络来稀疏化和分配多样的任务给多个专家。然而，缺乏对MoE及其对CL学习性能的影响的理论分析。本文通过过参数化线性回归任务的视角提供了第一批用于表征MoE在CL中的影响的理论结果。我们通过证明MoE模型可以使其专家多样化以专门处理不同任务，同时其路由器学习选择每个任务的正确专家并在所有专家之间平衡负载，来建立MoE相对于单个专家的好处。我们的研究进一步表明一个有趣的事实，即在足够的训练轮次后，CL中的MoE需要终止门控网络的更新以实现系统收敛，这在不考虑持续任务到来的现有MoE研究中是不需要的。此外，我们提供了明确的表达式，以表征MoE在CL学习性能中的好处，包括期望遗忘和总体泛化误差。有趣的是，增加更多的专家在收敛之前需要额外的轮次，这可能不会增强学习性能。最后，我们在合成和真实数据集上进行实验，将这些见解从线性模型扩展到深度神经网络（DNNs），这也为MoE在CL中的实际算法设计提供了启示。

更新时间: 2025-02-19 14:35:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.16437v3

FakET: Simulating Cryo-Electron Tomograms with Neural Style Transfer

In cryo-electron microscopy, accurate particle localization and classification are imperative. Recent deep learning solutions, though successful, require extensive training data sets. The protracted generation time of physics-based models, often employed to produce these data sets, limits their broad applicability. We introduce FakET, a method based on Neural Style Transfer, capable of simulating the forward operator of any cryo transmission electron microscope. It can be used to adapt a synthetic training data set according to reference data producing high-quality simulated micrographs or tilt-series. To assess the quality of our generated data, we used it to train a state-of-the-art localization and classification architecture and compared its performance with a counterpart trained on benchmark data. Remarkably, our technique matches the performance, boosts data generation speed 750 times, uses 33 times less memory, and scales well to typical transmission electron microscope detector sizes. It leverages GPU acceleration and parallel processing. The source code is available at https://github.com/paloha/faket.

Updated: 2025-02-19 14:34:59

标题: FakET：利用神经风格转移模拟冷冻电子断层图像

摘要: 在冷冻电子显微镜中，准确的颗粒定位和分类至关重要。最近的深度学习解决方案虽然成功，但需要大量的训练数据集。物理学模型的生成时间长，通常用于生成这些数据集，限制了它们的广泛适用性。我们介绍了一种基于神经风格转移的方法FakET，能够模拟任何冷冻透射电子显微镜的前向算子。它可以用来根据参考数据调整合成训练数据集，生成高质量的模拟显微图或倾斜系列。为了评估我们生成的数据的质量，我们使用它来训练一种最先进的定位和分类架构，并将其性能与在基准数据上训练的对应模型进行比较。值得注意的是，我们的技术与基准数据训练的模型性能相匹配，数据生成速度提高了750倍，内存使用减少了33倍，并且能够很好地适应典型的透射电子显微镜的探测器尺寸。它利用了GPU加速和并行处理。源代码可在https://github.com/paloha/faket 上找到。

更新时间: 2025-02-19 14:34:59

领域: cs.LG,eess.IV,q-bio.QM

下载: http://arxiv.org/abs/2304.02011v4

Optimizing Gene-Based Testing for Antibiotic Resistance Prediction

Antibiotic Resistance (AR) is a critical global health challenge that necessitates the development of cost-effective, efficient, and accurate diagnostic tools. Given the genetic basis of AR, techniques such as Polymerase Chain Reaction (PCR) that target specific resistance genes offer a promising approach for predictive diagnostics using a limited set of key genes. This study introduces GenoARM, a novel framework that integrates reinforcement learning (RL) with transformer-based models to optimize the selection of PCR gene tests and improve AR predictions, leveraging observed metadata for improved accuracy. In our evaluation, we developed several high-performing baselines and compared them using publicly available datasets derived from real-world bacterial samples representing multiple clinically relevant pathogens. The results show that all evaluated methods achieve strong and reliable performance when metadata is not utilized. When metadata is introduced and the number of selected genes increases, GenoARM demonstrates superior performance due to its capacity to approximate rewards for unseen and sparse combinations. Overall, our framework represents a major advancement in optimizing diagnostic tools for AR in clinical settings.

Updated: 2025-02-19 14:34:03

标题: 优化基因检测用于抗生素耐药性预测

摘要: 抗生素耐药性（AR）是一个关键的全球健康挑战，需要开发成本效益高、高效准确的诊断工具。鉴于AR的遗传基础，针对特定耐药基因的聚合酶链反应（PCR）等技术提供了一种有前景的方法，用有限的关键基因集进行预测性诊断。本研究引入了GenoARM，一个将强化学习（RL）与基于Transformer的模型相结合的新框架，以优化PCR基因检测的选择并提高AR预测，利用观察到的元数据来提高准确性。在我们的评估中，我们开发了几种表现优秀的基线，并使用来自真实世界细菌样本的公开可用数据集进行比较，代表多种临床相关病原体。结果显示，在未使用元数据时，所有评估方法均取得了强大可靠的性能。当引入元数据并选择的基因数量增加时，由于其能够近似未见和稀疏组合的奖励，GenoARM表现出优越的性能。总体而言，我们的框架在优化临床环境中AR诊断工具方面代表了重大进步。

更新时间: 2025-02-19 14:34:03

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2502.14919v1

A consensus set for the aggregation of partial rankings: the case of the Optimal Set of Bucket Orders Problem

In rank aggregation problems (RAP), the solution is usually a consensus ranking that generalizes a set of input orderings. There are different variants that differ not only in terms of the type of rankings that are used as input and output, but also in terms of the objective function employed to evaluate the quality of the desired output ranking. In contrast, in some machine learning tasks (e.g. subgroup discovery) or multimodal optimization tasks, attention is devoted to obtaining several models/results to account for the diversity in the input data or across the search landscape. Thus, in this paper we propose to provide, as the solution to an RAP, a set of rankings to better explain the preferences expressed in the input orderings. We exemplify our proposal through the Optimal Bucket Order Problem (OBOP), an RAP which consists in finding a single consensus ranking (with ties) that generalizes a set of input rankings codified as a precedence matrix. To address this, we introduce the Optimal Set of Bucket Orders Problem (OSBOP), a generalization of the OBOP that aims to produce not a single ranking as output but a set of consensus rankings. Experimental results are presented to illustrate this proposal, showing how, by providing a set of consensus rankings, the fitness of the solution significantly improves with respect to the one of the original OBOP, without losing comprehensibility.

Updated: 2025-02-19 14:32:16

标题: 一个用于聚合部分排名的共识集：以桶顺序优化集问题为例

摘要: 在排名聚合问题（RAP）中，解决方案通常是一个概括一组输入排序的共识排名。有不同的变体，不仅在于作为输入和输出的排名类型不同，还在于用于评估所需输出排名质量的目标函数。相比之下，在一些机器学习任务（例如子群发现）或多模态优化任务中，注意力集中在获得多个模型/结果以解释输入数据的多样性或搜索空间的差异。因此，在本文中，我们提出将一组排名作为RAP的解决方案，以更好地解释输入排序中表达的偏好。我们通过最优桶顺序问题（OBOP）来说明我们的提议，OBOP是一种RAP，旨在找到一个单一的共识排名（带并列），概括了一个作为优先矩阵编码的输入排名集合。为了解决这个问题，我们引入了最优桶顺序集合问题（OSBOP），这是OBOP的一种泛化，旨在产生不是单一排名作为输出，而是一组共识排名。实验结果被提供来说明这个提议，展示了通过提供一组共识排名，与原始OBOP相比，解决方案的适应性显著提高，而又不失可理解性。

更新时间: 2025-02-19 14:32:16

领域: cs.AI

下载: http://arxiv.org/abs/2502.13769v1

AI Software Engineer: Programming with Trust

Large Language Models (LLMs) have shown surprising proficiency in generating code snippets, promising to automate large parts of software engineering via artificial intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust equal to or even greater than the trust established by human-driven software engineering practices. The recent trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the power of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust.

Updated: 2025-02-19 14:28:42

标题: 人工智能软件工程师：以信任编程

摘要: 大型语言模型(LLMs)已经展现出在生成代码片段方面的惊人能力，有望通过人工智能(AI)自动化软件工程的大部分工作。我们认为，成功部署AI软件工程师需要建立与人驱动软件工程实践相等甚至更大的信任水平。最近向LLM代理的趋势提供了一条路径，可以将LLM的强大能力与分析工具的能力相结合，从而增加对代码的信任。这篇观点文章讨论了LLM代理是否有可能在未来主导软件工程工作流程，以及编程的重点是否将从规模编程转变为信任编程。

更新时间: 2025-02-19 14:28:42

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2502.13767v1

GPA: Grover Policy Agent for Generating Optimal Quantum Sensor Circuits

This study proposes a GPA for designing optimal Quantum Sensor Circuits (QSCs) to address complex quantum physics problems. The GPA consists of two parts: the Quantum Policy Evaluation (QPE) and the Quantum Policy Improvement (QPI). The QPE performs phase estimation to generate the search space, while the QPI utilizes Grover search and amplitude amplification techniques to efficiently identify an optimal policy that generates optimal QSCs. The GPA generates QSCs by selecting sequences of gates that maximize the Quantum Fisher Information (QFI) while minimizing the number of gates. The QSCs generated by the GPA are capable of producing entangled quantum states, specifically the squeezed states. High QFI indicates increased sensitivity to parameter changes, making the circuit useful for quantum state estimation and control tasks. Evaluation of the GPA on a QSC that consists of two qubits and a sequence of R_x, R_y, and S gates demonstrates its efficiency in generating optimal QSCs with a QFI of 1. Compared to existing quantum agents, the GPA achieves higher QFI with fewer gates, demonstrating a more efficient and scalable approach to the design of QSCs. This work illustrates the potential computational power of quantum agents for solving quantum physics problems

Updated: 2025-02-19 14:20:07

标题: GPA：用于生成最佳量子传感器电路的格罗弗政策代理

摘要: 这项研究提出了一种用于设计最佳量子传感器电路（QSCs）以解决复杂量子物理问题的GPA。GPA包括两个部分：量子策略评估（QPE）和量子策略改进（QPI）。QPE执行相位估计以生成搜索空间，而QPI利用Grover搜索和振幅放大技术有效地识别生成最佳QSCs的最佳策略。GPA通过选择最大化量子费舍尔信息（QFI）并最小化门数量的门序列来生成QSCs。由GPA生成的QSCs能够产生纠缠的量子态，特别是压缩态。高QFI表示对参数变化的增加敏感性，使电路在量子态估计和控制任务中有用。在由两个量子比特和一系列R_x、R_y和S门组成的QSC上评估GPA显示其在生成具有QFI为1的最佳QSCs方面的效率。与现有的量子代理相比，GPA使用更少的门获得更高的QFI，展示了一种更高效和可扩展的QSC设计方法。这项工作展示了量子代理解决量子物理问题的潜在计算能力。

更新时间: 2025-02-19 14:20:07

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2502.13755v1

Heterophily-Aware Fair Recommendation using Graph Convolutional Networks

In recent years, graph neural networks (GNNs) have become a popular tool to improve the accuracy and performance of recommender systems. Modern recommender systems are not only designed to serve end users, but also to benefit other participants, such as items and item providers. These participants may have different or conflicting goals and interests, which raises the need for fairness and popularity bias considerations. GNN-based recommendation methods also face the challenges of unfairness and popularity bias, and their normalization and aggregation processes suffer from these challenges. In this paper, we propose a fair GNN-based recommender system, called HetroFair, to improve item-side fairness. HetroFair uses two separate components to generate fairness-aware embeddings: i) Fairness-aware attention, which incorporates the dot product in the normalization process of GNNs to decrease the effect of nodes' degrees. ii) Heterophily feature weighting, to assign distinct weights to different features during the aggregation process. To evaluate the effectiveness of HetroFair, we conduct extensive experiments over six real-world datasets. Our experimental results reveal that HetroFair not only alleviates unfairness and popularity bias on the item side but also achieves superior accuracy on the user side. Our implementation is publicly available at https://github.com/NematGH/HetroFair.

Updated: 2025-02-19 14:15:17

标题: 异质性感知的公平推荐：基于图卷积网络的方法

摘要: 近年来，图神经网络（GNNs）已成为提高推荐系统准确性和性能的流行工具。现代推荐系统不仅旨在服务最终用户，还旨在使其他参与者受益，如物品和物品提供者。这些参与者可能具有不同或矛盾的目标和利益，这引发了对公平性和流行度偏见的考虑。基于GNN的推荐方法也面临不公平和流行度偏见的挑战，它们的归一化和聚合过程受到这些挑战的影响。在本文中，我们提出了一种公平的基于GNN的推荐系统，称为HetroFair，以改善物品方面的公平性。HetroFair使用两个独立的组件生成公平感知嵌入：i）公平感知注意力，将点积纳入GNN的归一化过程中，以减少节点度数的影响。ii）异质性特征加权，在聚合过程中为不同特征分配不同权重。为了评估HetroFair的有效性，我们在六个真实世界数据集上进行了大量实验。我们的实验结果显示，HetroFair不仅减轻了物品方面的不公平和流行度偏见，还在用户方面取得了更高的准确性。我们的实现可以在https://github.com/NematGH/HetroFair 上公开获取。

更新时间: 2025-02-19 14:15:17

领域: cs.IR,cs.LG,cs.SI

下载: http://arxiv.org/abs/2402.03365v3

Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion Across Varied Physics

Real-world legged locomotion systems often need to reconcile agility and safety for different scenarios. Moreover, the underlying dynamics are often unknown and time-variant (e.g., payload, friction). In this paper, we introduce BAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior work Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety even in dynamic environments with uncertainties. BAS involves an agile policy to avoid obstacles rapidly and a recovery policy to prevent collisions, a physical parameter estimator that is concurrently trained with agile policy, and a learned control-theoretic RA (reach-avoid) value network that governs the policy switch. Also, the agile policy and RA network are both conditioned on physical parameters to make them adaptive. To mitigate the distribution shift issue, we further introduce an on-policy fine-tuning phase for the estimator to enhance its robustness and accuracy. The simulation results show that BAS achieves 50% better safety than baselines in dynamic environments while maintaining a higher speed on average. In real-world experiments, BAS shows its capability in complex environments with unknown physics (e.g., slippery floors with unknown frictions, unknown payloads up to 8kg), while baselines lack adaptivity, leading to collisions or. degraded agility. As a result, BAS achieves a 19.8% increase in speed and gets a 2.36 times lower collision rate than ABS in the real world. Videos: https://adaptive-safe-locomotion.github.io.

Updated: 2025-02-19 14:13:51

标题: 桥接适应性与安全性：学习在不同物理环境下灵活避碰的行走

摘要: 现实世界中的四足步行系统通常需要在不同情景下协调敏捷性和安全性。此外，底层动力学往往是未知且时变的（例如，负载，摩擦）。本文介绍了BAS（桥接适应性和安全性），它建立在之前的工作敏捷但安全（ABS）（He等人）的基础上，旨在在具有不确定性的动态环境中提供自适应安全性。BAS包括一种敏捷策略以快速避开障碍物，一种恢复策略以防止碰撞，一个与敏捷策略同时进行训练的物理参数估计器，以及一个学习的控制理论RA（达到-避免）价值网络，它控制策略切换。此外，敏捷策略和RA网络都取决于物理参数，使它们具有自适应性。为了缓解分布转移问题，我们进一步引入了一个基于策略的微调阶段，以增强估计器的鲁棒性和准确性。模拟结果显示，BAS在动态环境中比基线实现了50%更好的安全性，同时保持了更高的平均速度。在真实世界的实验中，BAS展示了其在具有未知物理特性的复杂环境中的能力（例如，具有未知摩擦的滑地板，未知负载高达8kg），而基线缺乏适应性，导致碰撞或敏捷性下降。因此，BAS在真实世界中实现了速度增加19.8%并且碰撞率比ABS低2.36倍。视频：https://adaptive-safe-locomotion.github.io。

更新时间: 2025-02-19 14:13:51

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2501.04276v3

RobustX: Robust Counterfactual Explanations Made Easy

The increasing use of Machine Learning (ML) models to aid decision-making in high-stakes industries demands explainability to facilitate trust. Counterfactual Explanations (CEs) are ideally suited for this, as they can offer insights into the predictions of an ML model by illustrating how changes in its input data may lead to different outcomes. However, for CEs to realise their explanatory potential, significant challenges remain in ensuring their robustness under slight changes in the scenario being explained. Despite the widespread recognition of CEs' robustness as a fundamental requirement, a lack of standardised tools and benchmarks hinders a comprehensive and effective comparison of robust CE generation methods. In this paper, we introduce RobustX, an open-source Python library implementing a collection of CE generation and evaluation methods, with a focus on the robustness property. RobustX provides interfaces to several existing methods from the literature, enabling streamlined access to state-of-the-art techniques. The library is also easily extensible, allowing fast prototyping of novel robust CE generation and evaluation methods.

Updated: 2025-02-19 14:12:01

标题: RobustX：易于实现的强大的反事实解释

摘要: 机器学习（ML）模型在高风险行业中用于辅助决策的增加要求解释性以促进信任。反事实解释（CEs）非常适合这一需求，因为它们可以通过展示输入数据的变化如何导致不同结果来提供对ML模型预测的见解。然而，要实现CEs的解释潜力，仍然存在重大挑战，需要确保在被解释的情景中进行轻微变化时它们的稳健性。尽管广泛认识到CEs的稳健性作为一个基本要求，但缺乏标准化工具和基准，阻碍了对稳健CE生成方法的全面有效比较。在本文中，我们介绍了RobustX，一个开源的Python库，实现了一组CE生成和评估方法，重点是稳健性属性。RobustX提供了接口，可以访问文献中几种现有方法，实现对最先进技术的简化访问。该库也易于扩展，可以快速原型化新的稳健CE生成和评估方法。

更新时间: 2025-02-19 14:12:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13751v1

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 11 tasks, we also evaluate three GPT-4 and GPT-4o series models and find comparable results to Llama-3.3-70B-Instruct. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

Updated: 2025-02-19 14:11:10

标题: 评估大型语言模型在公共卫生分类和提取任务中的应用效果

摘要: 大型语言模型（LLMs）的进展已经引起了人们对它们在支持人类专家在各个领域（包括公共卫生）中的潜力的极大兴趣。在这项工作中，我们提出了关于LLMs自动评估在涉及自由文本的公共卫生任务中的表现。我们结合了六个外部注释的数据集和七个新的内部注释的数据集，评估LLMs处理与健康负担、流行病学风险因素和公共卫生干预相关的文本。我们使用零样本上下文学习评估了十一个开放权重的LLMs（7-1230亿参数）在所有任务中的表现。我们发现Llama-3.3-70B-Instruct是最高性能模型，在8/16个任务中取得最佳结果（使用微F1得分）。我们发现在所有开放权重的LLMs在一些具有挑战性的任务上的微F1得分低于60％，如接触分类，而在其他任务上，如胃肠道疾病分类，所有LLMs均获得超过80％的微F1得分。对于11个任务的子集，我们还评估了三个GPT-4和GPT-4o系列模型，并发现与Llama-3.3-70B-Instruct有可比的结果。总的来说，基于这些初步结果，我们发现LLMs可能是公共卫生专家从各种自由文本来源中提取信息并支持公共卫生监测、研究和干预的有用工具。

更新时间: 2025-02-19 14:11:10

领域: cs.CL,cs.LG,68T50

下载: http://arxiv.org/abs/2405.14766v2

Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions

Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. While effective, engression struggles with highly complex distributions, such as those encountered in image data. In this work, we extend engression to improve its capability in learning complex distributions. We propose a framework that defines a general forward process transitioning from the target distribution to a known distribution (e.g., Gaussian) and then learns a reverse Markov process using multiple engression models. This reverse process reconstructs the target distribution step by step. Our approach supports general forward processes, allows for dimension reduction, and naturally discretizes the generative process. As a special case, when using a diffusion-based forward process, our framework offers a method to discretize the training and inference of diffusion models efficiently. Empirical evaluations on simulated and climate data validate our theoretical insights, demonstrating the effectiveness of our approach in capturing complex distributions.

Updated: 2025-02-19 14:10:15

标题: 反向马尔可夫学习：用于复杂分布的多步生成模型

摘要: 学习复杂分布是当今应用中的一个基本挑战。生成模型，如扩散模型，已经展示出了克服传统统计方法许多限制的显著成功。Shen和Meinshausen（2024年）引入了engression，这是一种基于评分规则的生成方法，直接将噪声（和协变量，如果可用）映射到数据。虽然有效，engression在学习高度复杂的分布，如图像数据中遇到的分布时遇到困难。在这项工作中，我们扩展了engression以提高其学习复杂分布的能力。我们提出了一个框架，定义了一个从目标分布向已知分布（例如，高斯分布）过渡的一般前向过程，然后使用多个engression模型学习一个反向马尔可夫过程。这个反向过程逐步重建目标分布。我们的方法支持一般的前向过程，允许降维，并自然离散化生成过程。作为一个特例，当使用基于扩散的前向过程时，我们的框架提供了一种有效离散化扩散模型的训练和推断的方法。在模拟和气候数据上的经验评估验证了我们的理论见解，展示了我们方法在捕捉复杂分布方面的有效性。

更新时间: 2025-02-19 14:10:15

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2502.13747v1

Inference of Abstraction for Grounded Predicate Logic

An important open question in AI is what simple and natural principle enables a machine to reason logically for meaningful abstraction with grounded symbols. This paper explores a conceptually new approach to combining probabilistic reasoning and predicative symbolic reasoning over data. We return to the era of reasoning with a full joint distribution before the advent of Bayesian networks. We then discuss that a full joint distribution over models of exponential size in propositional logic and of infinite size in predicate logic should be simply derived from a full joint distribution over data of linear size. We show that the same process is not only enough to generalise the logical consequence relation of predicate logic but also to provide a new perspective to rethink well-known limitations such as the undecidability of predicate logic, the symbol grounding problem and the principle of explosion. The reproducibility of this theoretical work is fully demonstrated by the included proofs.

Updated: 2025-02-19 14:07:34

标题: 推理基于基础谓词逻辑的抽象化

摘要: 人工智能中一个重要的开放问题是什么简单自然的原则使得机器能够通过具体符号进行有意义的抽象逻辑推理。本文探讨了一种概念上全新的方法，将概率推理和预测性符号推理结合在数据上。我们回到了在贝叶斯网络出现之前使用完整联合分布进行推理的时代。然后我们讨论，在命题逻辑中，模型的指数级大小和在谓词逻辑中无限大小的完整联合分布应该简单地由线性大小的数据的完整联合分布导出。我们展示了同样的过程不仅足以泛化谓词逻辑的逻辑推理关系，还提供了一个重新思考众所周知限制的新视角，比如谓词逻辑的不可判定性、符号接地问题和爆炸原则。这一理论工作的可复现性通过包含的证明得到了充分证明。

更新时间: 2025-02-19 14:07:34

领域: cs.AI

下载: http://arxiv.org/abs/2502.13743v1

CARE: Confidence-Aware Regression Estimation of building density fine-tuning EO Foundation Models

Performing accurate confidence quantification and assessment is important for deep neural networks to predict their failures, improve their performance and enhance their capabilities in real-world applications, for their practical deployment in real life. For pixel-wise regression tasks, confidence quantification and assessment has not been well addressed in the literature, in contrast to classification tasks like semantic segmentation. The softmax output layer is not used in deep neural networks that solve pixel-wise regression problems. In this paper, to address these problems, we develop, train and evaluate the proposed model Confidence-Aware Regression Estimation (CARE). Our model CARE computes and assigns confidence to regression output results. We focus on solving regression problems as downstream tasks of an AI Foundation Model for Earth Observation (EO). We evaluate the proposed model CARE and experimental results on data from the Copernicus Sentinel-2 satellite constellation for estimating the density of buildings show that the proposed method can be successfully applied to regression problems. We also show that our approach outperforms other methods.

Updated: 2025-02-19 14:02:00

标题: CARE：建筑密度微调EO基础模型的置信感知回归估计

摘要: 准确的置信度量化和评估对于深度神经网络来预测失败、提高性能和增强在实际应用中的能力非常重要，以便它们能够在现实生活中得到实际部署。对于像素级回归任务，与语义分割等分类任务相比，文献中并未充分讨论置信度量化和评估。在解决像素级回归问题的深度神经网络中，通常不使用softmax输出层。为了解决这些问题，本文提出了一种模型 Confidence-Aware Regression Estimation (CARE)，并进行了开发、训练和评估。我们的模型 CARE 计算并分配置信度给回归输出结果。我们专注于将回归问题作为地球观测 (EO) AI 基础模型的下游任务。我们在来自Copernicus Sentinel-2卫星星座的数据上评估了提出的模型 CARE，并实验结果表明，该方法可以成功应用于回归问题。我们还展示了我们的方法优于其他方法。

更新时间: 2025-02-19 14:02:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13734v1

Homophily Heterogeneity Matters in Graph Federated Learning: A Spectrum Sharing and Complementing Perspective

Since heterogeneity presents a fundamental challenge in graph federated learning, many existing methods are proposed to deal with node feature heterogeneity and structure heterogeneity. However, they overlook the critical homophily heterogeneity, which refers to the substantial variation in homophily levels across graph data from different clients. The homophily level represents the proportion of edges connecting nodes that belong to the same class. Due to adapting to their local homophily, local models capture inconsistent spectral properties across different clients, significantly reducing the effectiveness of collaboration. Specifically, local models trained on graphs with high homophily tend to capture low-frequency information, whereas local models trained on graphs with low homophily tend to capture high-frequency information. To effectively deal with homophily heterophily, we introduce the spectral Graph Neural Network (GNN) and propose a novel Federated learning method by mining Graph Spectral Properties (FedGSP). On one hand, our proposed FedGSP enables clients to share generic spectral properties (i.e., low-frequency information), allowing all clients to benefit through collaboration. On the other hand, inspired by our theoretical findings, our proposed FedGSP allows clients to complement non-generic spectral properties by acquiring the spectral properties they lack (i.e., high-frequency information), thereby obtaining additional information gain. Extensive experiments conducted on six homophilic and five heterophilic graph datasets, across both non-overlapping and overlapping settings, validate the superiority of our method over eleven state-of-the-art methods. Notably, our FedGSP outperforms the second-best method by an average margin of 3.28% on all heterophilic datasets.

Updated: 2025-02-19 13:58:08

标题: 同质性异质性在图形联合学习中的重要性：一种频谱共享和互补视角

摘要: 由于异质性在图联合学习中提出了一个基本挑战，许多现有方法被提出来处理节点特征的异质性和结构的异质性。然而，它们忽视了关键的同质性异质性，这指的是不同客户端的图数据中同质性水平的显著变化。同质性水平表示连接属于同一类的节点的边的比例。由于适应其本地同质性，本地模型捕捉到不同客户端之间不一致的谱特性，显著降低了协作的有效性。具体而言，在具有高同质性的图上训练的本地模型倾向于捕捉低频信息，而在具有低同质性的图上训练的本地模型倾向于捕捉高频信息。为有效处理同质性异质性，我们引入了谱图神经网络（GNN）并提出了一种通过挖掘图谱特性（FedGSP）来实现的新型联邦学习方法。一方面，我们提出的FedGSP使客户端能够共享通用的谱特性（即低频信息），从而使所有客户端通过合作获益。另一方面，受我们的理论发现启发，我们提出的FedGSP允许客户端通过获取他们缺乏的谱特性（即高频信息）来补充非通用的谱特性，从而获得额外的信息增益。在六个同质性和五个异质性图数据集上进行的大量实验，在不重叠和重叠设置下，验证了我们方法优于十一种最先进方法的优越性。值得注意的是，我们的FedGSP在所有异质性数据集上的平均边际比第二好的方法高出3.28%。

更新时间: 2025-02-19 13:58:08

领域: cs.LG

下载: http://arxiv.org/abs/2502.13732v1

Robust Counterfactual Inference in Markov Decision Processes

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Updated: 2025-02-19 13:56:20

标题: 马尔可夫决策过程中的稳健反事实推断

摘要: 这篇论文解决了现有马尔科夫决策过程（MDPs）的反事实推断方法中的一个关键局限性。当前的方法假设一个特定的因果模型来使反事实可辨认。然而，通常有许多与MDP的观察和干预分布一致的因果模型，每个模型产生不同的反事实分布，因此固定一个特定的因果模型限制了反事实推断的有效性（和有用性）。我们提出了一种新颖的非参数方法，计算所有兼容因果模型中的反事实转移概率的紧密边界。与以前需要解决问题规模呈指数增长的优化问题的方法不同，我们的方法为这些边界提供了封闭形式表达式，使计算在非平凡MDPs中高效且可扩展。一旦构建了这样一个区间反事实MDP，我们的方法识别出优化最不利情况奖励的稳健反事实策略，考虑到不确定的区间MDP概率。我们在各种案例研究中评估了我们的方法，展示了相对现有方法的改进的稳健性。

更新时间: 2025-02-19 13:56:20

领域: cs.AI

下载: http://arxiv.org/abs/2502.13731v1

Secure Federated Data Distillation

Dataset Distillation (DD) is a powerful technique for reducing large datasets into compact, representative synthetic datasets, accelerating Machine Learning training. However, traditional DD methods operate in a centralized manner, which poses significant privacy threats and reduces its applicability. To mitigate these risks, we propose a Secure Federated Data Distillation framework (SFDD) to decentralize the distillation process while preserving privacy.Unlike existing Federated Distillation techniques that focus on training global models with distilled knowledge, our approach aims to produce a distilled dataset without exposing local contributions. We leverage the gradient-matching-based distillation method, adapting it for a distributed setting where clients contribute to the distillation process without sharing raw data. The central aggregator iteratively refines a synthetic dataset by integrating client-side updates while ensuring data confidentiality. To make our approach resilient to inference attacks perpetrated by the server that could exploit gradient updates to reconstruct private data, we create an optimized Local Differential Privacy approach, called LDPO-RLD (Label Differential Privacy Obfuscation via Randomized Linear Dispersion). Furthermore, we assess the framework's resilience against malicious clients executing backdoor attacks and demonstrate robustness under the assumption of a sufficient number of participating clients. Our experimental results demonstrate the effectiveness of SFDD and that the proposed defense concretely mitigates the identified vulnerabilities, with minimal impact on the performance of the distilled dataset. By addressing the interplay between privacy and federation in dataset distillation, this work advances the field of privacy-preserving Machine Learning making our SFDD framework a viable solution for sensitive data-sharing applications.

Updated: 2025-02-19 13:54:44

标题: 安全的联邦数据精炼

摘要: 数据集提炼（DD）是一种强大的技术，可以将大型数据集减少为紧凑、代表性的合成数据集，加快机器学习训练。然而，传统的DD方法以集中的方式运作，存在重大的隐私威胁并降低了其适用性。为了减轻这些风险，我们提出了一个安全的联邦数据提炼框架（SFDD），在保护隐私的同时去中心化提炼过程。与现有的联邦提炼技术侧重于训练具有提炼知识的全局模型不同，我们的方法旨在生成一个提炼后的数据集，而不暴露本地贡献。我们利用基于梯度匹配的提炼方法，将其调整为一个分布式环境，在这个环境中客户端参与提炼过程而不共享原始数据。中央聚合器通过集成客户端更新来迭代地完善一个合成数据集，同时确保数据机密性。为了使我们的方法对服务器进行的推断攻击具有抵抗性，这些攻击可能利用梯度更新重构私人数据，我们创建了一种优化的局部差分隐私方法，称为LDPO-RLD（通过随机线性分散实现标签差分隐私干扰）。此外，我们评估了该框架对执行后门攻击的恶意客户的抵抗性，并展示了在有足够数量的参与客户的假设下的稳健性。我们的实验结果表明了SFDD的有效性，并且提出的防御措施明显减轻了已识别的漏洞，对提炼数据集的性能影响最小。通过处理数据集提炼中隐私和联邦之间的相互作用，这项工作推动了隐私保护机器学习领域的发展，使我们的SFDD框架成为敏感数据共享应用的可行解决方案。

更新时间: 2025-02-19 13:54:44

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2502.13728v1

Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

Updated: 2025-02-19 13:51:05

标题: 直接价值优化：通过优化值以改进LLMs中的思维链推理

摘要: 我们介绍了直接值优化（DVO），这是一种创新的强化学习框架，用于增强复杂推理任务中的大型语言模型。与传统方法依赖偏好标签不同，DVO利用单个推理步骤的值信号，通过均方误差损失优化模型。DVO的关键优势在于其细粒度监督，避免了对人工标注的劳动密集型需求。DVO中的目标值是使用蒙特卡洛树搜索或结果值模型估算的。我们在数学和常识推理任务上的实证分析显示，DVO始终优于现有的离线偏好优化技术，即使训练步骤较少。这些发现强调了值信号在提升推理能力方面的重要性，并将DVO作为在缺乏明确人类偏好信息情况下的优越方法论。

更新时间: 2025-02-19 13:51:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13723v1

Deep Learning for VWAP Execution in Crypto Markets: Beyond the Volume Curve

Volume-Weighted Average Price (VWAP) is arguably the most prevalent benchmark for trade execution as it provides an unbiased standard for comparing performance across market participants. However, achieving VWAP is inherently challenging due to its dependence on two dynamic factors, volumes and prices. Traditional approaches typically focus on forecasting the market's volume curve, an assumption that may hold true under steady conditions but becomes suboptimal in more volatile environments or markets such as cryptocurrency where prediction error margins are higher. In this study, I propose a deep learning framework that directly optimizes the VWAP execution objective by bypassing the intermediate step of volume curve prediction. Leveraging automatic differentiation and custom loss functions, my method calibrates order allocation to minimize VWAP slippage, thereby fully addressing the complexities of the execution problem. My results demonstrate that this direct optimization approach consistently achieves lower VWAP slippage compared to conventional methods, even when utilizing a naive linear model presented in arXiv:2410.21448. They validate the observation that strategies optimized for VWAP performance tend to diverge from accurate volume curve predictions and thus underscore the advantage of directly modeling the execution objective. This research contributes a more efficient and robust framework for VWAP execution in volatile markets, illustrating the potential of deep learning in complex financial systems where direct objective optimization is crucial. Although my empirical analysis focuses on cryptocurrency markets, the underlying principles of the framework are readily applicable to other asset classes such as equities.

Updated: 2025-02-19 13:49:51

标题: 深度学习在加密市场中的VWAP执行：超越成交量曲线

摘要: 成交量加权平均价格（VWAP）可以说是交易执行中最常见的基准之一，因为它提供了一个公正的标准，用于比较市场参与者的表现。然而，要实现VWAP在本质上是具有挑战性的，因为它依赖于两个动态因素，即成交量和价格。传统方法通常侧重于预测市场的成交量曲线，这种假设在稳定条件下可能成立，但在更加波动的环境或加密货币等市场中变得次优，因为预测误差边界更高。在这项研究中，我提出了一个深度学习框架，通过绕过成交量曲线预测的中间步骤，直接优化VWAP执行目标。利用自动微分和自定义损失函数，我的方法校准订单分配，以最小化VWAP滑点，从而充分解决了执行问题的复杂性。我的结果表明，与传统方法相比，这种直接优化方法始终实现较低的VWAP滑点，即使使用了arXiv:2410.21448中提出的天真线性模型。它们验证了针对VWAP表现优化的策略往往会偏离准确的成交量曲线预测，因此强调了直接建模执行目标的优势。这项研究为波动市场中的VWAP执行提供了更高效和更稳健的框架，展示了深度学习在需要直接目标优化的复杂金融系统中的潜力。尽管我的经验分析重点放在加密货币市场上，但该框架的基本原则很容易适用于其他资产类别，如股票。

更新时间: 2025-02-19 13:49:51

领域: q-fin.ST,cs.LG

下载: http://arxiv.org/abs/2502.13722v1

Learning Novel Transformer Architecture for Time-series Forecasting

Despite the success of Transformer-based models in the time-series prediction (TSP) tasks, the existing Transformer architecture still face limitations and the literature lacks comprehensive explorations into alternative architectures. To address these challenges, we propose AutoFormer-TS, a novel framework that leverages a comprehensive search space for Transformer architectures tailored to TSP tasks. Our framework introduces a differentiable neural architecture search (DNAS) method, AB-DARTS, which improves upon existing DNAS approaches by enhancing the identification of optimal operations within the architecture. AutoFormer-TS systematically explores alternative attention mechanisms, activation functions, and encoding operations, moving beyond the traditional Transformer design. Extensive experiments demonstrate that AutoFormer-TS consistently outperforms state-of-the-art baselines across various TSP benchmarks, achieving superior forecasting accuracy while maintaining reasonable training efficiency.

Updated: 2025-02-19 13:49:20

标题: 学习新型Transformer架构用于时间序列预测

摘要: 尽管基于Transformer的模型在时间序列预测（TSP）任务中取得了成功，但现有的Transformer架构仍然面临局限性，文献缺乏对替代架构的全面探索。为了解决这些挑战，我们提出了AutoFormer-TS，这是一个新颖的框架，利用了针对TSP任务定制的Transformer架构的全面搜索空间。我们的框架引入了一种可微分神经架构搜索（DNAS）方法，AB-DARTS，通过增强对架构中最佳操作的识别，改进了现有的DNAS方法。AutoFormer-TS系统地探索了替代的注意机制、激活函数和编码操作，超越了传统的Transformer设计。大量实验表明，AutoFormer-TS在各种TSP基准测试中始终优于最先进的基线方法，实现了更高的预测准确性，同时保持了合理的训练效率。

更新时间: 2025-02-19 13:49:20

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.13721v1

TrustRAG: An Information Assistant with Retrieval Augmented Generation

\Ac{RAG} has emerged as a crucial technique for enhancing large models with real-time and domain-specific knowledge. While numerous improvements and open-source tools have been proposed to refine the \ac{RAG} framework for accuracy, relatively little attention has been given to improving the trustworthiness of generated results. To address this gap, we introduce TrustRAG, a novel framework that enhances \ac{RAG} from three perspectives: indexing, retrieval, and generation. Specifically, in the indexing stage, we propose a semantic-enhanced chunking strategy that incorporates hierarchical indexing to supplement each chunk with contextual information, ensuring semantic completeness. In the retrieval stage, we introduce a utility-based filtering mechanism to identify high-quality information, supporting answer generation while reducing input length. In the generation stage, we propose fine-grained citation enhancement, which detects opinion-bearing sentences in responses and infers citation relationships at the sentence-level, thereby improving citation accuracy. We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks \footnote{https://huggingface.co/spaces/golaxy/TrustRAG}. Based on these, we aim to help researchers: 1) systematically enhancing the trustworthiness of \ac{RAG} systems and (2) developing their own \ac{RAG} systems with more reliable outputs.

Updated: 2025-02-19 13:45:27

标题: TrustRAG：一种具有检索增强生成功能的信息助手

摘要: RAG已经成为增强大型模型具有实时和领域特定知识的关键技术。虽然已经提出了许多改进和开源工具来完善RAG框架以提高准确性，但相对较少关注生成结果的可信度。为了填补这一空白，我们介绍了TrustRAG，这是一个新颖的框架，从三个角度增强了RAG：索引、检索和生成。具体而言，在索引阶段，我们提出了一种语义增强的分块策略，将分层索引纳入每个分块以补充上下文信息，确保语义完整性。在检索阶段，我们引入了一种基于效用的过滤机制来识别高质量的信息，支持答案生成同时减少输入长度。在生成阶段，我们提出了细粒度引用增强，该方法检测响应中带有观点的句子，并推断句子级的引用关系，从而提高引用准确性。我们将TrustRAG框架开源，并提供了一个专为基于摘录的问题回答任务设计的演示工作室。基于此，我们旨在帮助研究人员：1）系统地增强RAG系统的可信度，2）开发具有更可靠输出的自己的RAG系统。

更新时间: 2025-02-19 13:45:27

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.13719v1

Appeal prediction for AI up-scaled Images

DNN- or AI-based up-scaling algorithms are gaining in popularity due to the improvements in machine learning. Various up-scaling models using CNNs, GANs or mixed approaches have been published. The majority of models are evaluated using PSRN and SSIM or only a few example images. However, a performance evaluation with a wide range of real-world images and subjective evaluation is missing, which we tackle in the following paper. For this reason, we describe our developed dataset, which uses 136 base images and five different up-scaling methods, namely Real-ESRGAN, BSRGAN, waifu2x, KXNet, and Lanczos. Overall the dataset consists of 1496 annotated images. The labeling of our dataset focused on image appeal and has been performed using crowd-sourcing employing our open-source tool AVRate Voyager. We evaluate the appeal of the different methods, and the results indicate that Real-ESRGAN and BSRGAN are the best. Furthermore, we train a DNN to detect which up-scaling method has been used, the trained models have a good overall performance in our evaluation. In addition to this, we evaluate state-of-the-art image appeal and quality models, here none of the models showed a high prediction performance, therefore we also trained two own approaches. The first uses transfer learning and has the best performance, and the second model uses signal-based features and a random forest model with good overall performance. We share the data and implementation to allow further research in the context of open science.

Updated: 2025-02-19 13:45:24

标题: AI放大图像的申诉预测

摘要: 基于DNN或AI的上采样算法因机器学习的改进而日益受欢迎。已发表了使用CNN、GAN或混合方法的各种上采样模型。大多数模型使用PSRN和SSIM或仅少数示例图像进行评估。然而，缺乏对广泛范围的真实世界图像进行性能评估和主观评估的研究，我们在下文中对此进行了探讨。因此，我们描述了我们开发的数据集，该数据集使用136个基础图像和五种不同的上采样方法，即Real-ESRGAN、BSRGAN、waifu2x、KXNet和Lanczos。总体而言，数据集包括1496个标记图像。我们的数据集标记侧重于图像吸引力，并使用众包工具AVRate Voyager进行了评估。我们评估了不同方法的吸引力，结果表明Real-ESRGAN和BSRGAN表现最佳。此外，我们训练了一个DNN来检测使用了哪种上采样方法，训练模型在我们的评估中表现良好。除此之外，我们还评估了最先进的图像吸引力和质量模型，这里没有一个模型表现出高预测性能，因此我们也训练了两种自己的方法。第一种方法使用迁移学习，表现最佳，第二种模型使用基于信号的特征和具有良好整体表现的随机森林模型。我们分享数据和实现，以促进开放科学领域的进一步研究。

更新时间: 2025-02-19 13:45:24

领域: cs.GR,cs.AI,eess.IV

下载: http://arxiv.org/abs/2502.14013v1

InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models

In the evolving landscape of multimodal language models, understanding the nuanced meanings conveyed through visual cues - such as satire, insult, or critique - remains a significant challenge. Existing evaluation benchmarks primarily focus on direct tasks like image captioning or are limited to a narrow set of categories, such as humor or satire, for deep semantic understanding. To address this gap, we introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark designed specifically for evaluating the understanding of implicit meanings in images. This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension. We propose an innovative semi-automatic method for constructing datasets, adhering to established construction protocols. Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning. Our findings underscore the intrinsic challenges current LVLMs face in grasping nuanced visual semantics, highlighting significant opportunities for future research and development in this domain. We will publicly release our InsightVision dataset, code upon acceptance of the paper.

Updated: 2025-02-19 13:42:37

标题: 洞察视觉：一个全面的、多层次的基于汉语的基准，用于评估大型视觉语言模型中的隐含视觉语义

摘要: 在多模态语言模型不断发展的格局中，理解通过视觉线索传达的微妙意义，如讽刺、侮辱或批评，仍然是一个重要挑战。现有的评估基准主要关注直接任务，如图像字幕，或者仅限于一组狭窄的类别，如幽默或讽刺，以深入理解语义。为了填补这一空白，我们首次引入了一个专门设计用于评估图像中隐含意义理解的全面、多层次的基于中文的基准。该基准被系统地分为四个子任务：表层内容理解、符号意义解释、背景知识理解和隐含意义理解。我们提出了一种创新的半自动构建数据集的方法，遵循已建立的构建协议。利用这一基准，我们评估了15个开源大型视觉语言模型（LVLMs）和GPT-4o，结果显示，即使表现最佳的模型在理解隐含意义方面也比人类表现落后近14%。我们的研究结果强调了当前LVLMs在把握微妙的视觉语义方面所面临的固有挑战，突显了未来研究和发展在这一领域的重要机遇。我们将在论文被接受后公开发布我们的InsightVision数据集和代码。

更新时间: 2025-02-19 13:42:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.15812v1

Spiking Point Transformer for Point Cloud Classification

Spiking Neural Networks (SNNs) offer an attractive and energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their sparse binary activation. When SNN meets Transformer, it shows great potential in 2D image processing. However, their application for 3D point cloud remains underexplored. To this end, we present Spiking Point Transformer (SPT), the first transformer-based SNN framework for point cloud classification. Specifically, we first design Queue-Driven Sampling Direct Encoding for point cloud to reduce computational costs while retaining the most effective support points at each time step. We introduce the Hybrid Dynamics Integrate-and-Fire Neuron (HD-IF), designed to simulate selective neuron activation and reduce over-reliance on specific artificial neurons. SPT attains state-of-the-art results on three benchmark datasets that span both real-world and synthetic datasets in the SNN domain. Meanwhile, the theoretical energy consumption of SPT is at least 6.4$\times$ less than its ANN counterpart.

Updated: 2025-02-19 13:28:55

标题: 点云分类的尖峰点变压器

摘要: 脉冲神经网络（SNNs）由于其稀疏的二进制激活而成为传统人工神经网络（ANNs）的一种具有吸引力且节能的替代选择。当SNN遇到Transformer时，在2D图像处理中展现出巨大潜力。然而，它们在3D点云的应用仍未得到充分探索。因此，我们提出了Spiking Point Transformer（SPT），这是用于点云分类的基于Transformer的SNN框架。具体而言，我们首先为点云设计了基于队列驱动采样直接编码，以降低计算成本同时保留每个时间步的最有效支持点。我们引入了混合动力集成与火神经元（HD-IF），旨在模拟选择性神经元激活并减少对特定人工神经元的过度依赖。SPT在SNN领域跨越真实世界和合成数据集的三个基准数据集上取得了最新成果。与其ANN对应物相比，SPT的理论能耗至少降低了6.4倍。

更新时间: 2025-02-19 13:28:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.15811v1

Generalization Bounds for Dependent Data using Online-to-Batch Conversion

In this work, we upper bound the generalization error of batch learning algorithms trained on samples drawn from a mixing stochastic process (i.e., a dependent data source) both in expectation and with high probability. Unlike previous results by Mohri et al. (2010) and Fu et al. (2023), our work does not require any stability assumptions on the batch learner, which allows us to derive upper bounds for any batch learning algorithm trained on dependent data. This is made possible due to our use of the Online-to-Batch ( OTB ) conversion framework, which allows us to shift the burden of stability from the batch learner to an artificially constructed online learner. We show that our bounds are equal to the bounds in the i.i.d. setting up to a term that depends on the decay rate of the underlying mixing stochastic process. Central to our analysis is a new notion of algorithmic stability for online learning algorithms based on Wasserstein distances of order one. Furthermore, we prove that the EWA algorithm, a textbook family of online learning algorithms, satisfies our new notion of stability. Following this, we instantiate our bounds using the EWA algorithm.

Updated: 2025-02-19 13:27:10

标题: 使用在线到批处理转换的相关数据的泛化界限

摘要: 在这项工作中，我们对从混合随机过程（即相关数据源）中抽取的样本进行批量学习算法的泛化误差进行了上界限制，包括期望和高概率。与Mohri等人（2010年）和Fu等人（2023年）之前的结果不同，我们的工作不需要对批处理学习器进行任何稳定性假设，这使我们能够为在相关数据上训练的任何批量学习算法推导出上界。这是由于我们使用在线转批量（OTB）转换框架，使我们能够将稳定性的负担从批处理学习器转移到人工构建的在线学习器上。我们展示了我们的界限与独立同分布设置中的界限相等，只是一个取决于基础混合随机过程衰减率的项。我们分析的核心是基于一阶Wasserstein距离的在线学习算法的算法稳定性的新概念。此外，我们证明了EWA算法，一种经典的在线学习算法系列，满足我们的新稳定性概念。在这之后，我们使用EWA算法实例化我们的界限。

更新时间: 2025-02-19 13:27:10

领域: cs.LG

下载: http://arxiv.org/abs/2405.13666v2

Through the Looking-Glass: Transparency Implications and Challenges in Enterprise AI Knowledge Systems

Knowledge can't be disentangled from people. As AI knowledge systems mine vast volumes of work-related data, the knowledge that's being extracted and surfaced is intrinsically linked to the people who create and use it. When predictive algorithms that learn from data are used to link knowledge and people, inaccuracies in knowledge extraction and surfacing can lead to disproportionate harms, influencing how individuals see each other and how they see themselves at work. In this paper, we present a reflective analysis of transparency requirements and impacts in this type of systems. We conduct a multidisciplinary literature review to understand the impacts of transparency in workplace settings, introducing the looking-glass metaphor to conceptualize AI knowledge systems as systems that reflect and distort, expanding our view on transparency requirements, implications and challenges. We formulate transparency as a key mediator in shaping different ways of seeing, including seeing into the system, which unveils its capabilities, limitations and behavior, and seeing through the system, which shapes workers' perceptions of their own contributions and others within the organization. Recognizing the sociotechnical nature of these systems, we identify three transparency dimensions necessary to realize the value of AI knowledge systems, namely system transparency, procedural transparency and transparency of outcomes. We discuss key challenges hindering the implementation of these forms of transparency, bringing to light the wider sociotechnical gap and highlighting directions for future Computer-supported Cooperative Work (CSCW) research.

Updated: 2025-02-19 13:19:29

标题: 《穿过镜子：企业AI知识系统中的透明度影响和挑战》

摘要: 知识无法与人分离。随着人工智能知识系统挖掘大量与工作相关的数据，被提取和呈现的知识与创造和使用它的人密切相关。当从数据中学习的预测算法用于链接知识和人时，知识提取和呈现中的不准确性可能导致不成比例的危害，影响个体如何看待彼此以及如何看待自己在工作中的角色。在本文中，我们对这类系统中的透明度要求和影响进行了反思性分析。我们进行了跨学科文献综述，以了解透明度在工作场所设置中的影响，引入了镜子隐喻来概念化人工智能知识系统，将其视为反映和扭曲的系统，拓展了我们对透明度要求、影响和挑战的视野。我们将透明度形式化为塑造不同视角的关键中介因素，包括看透系统，揭示其能力、局限性和行为，以及看穿系统，塑造员工对自己的贡献和组织内其他人的看法。认识到这些系统的社会技术性质，我们确定了实现人工智能知识系统价值的三个透明度维度，即系统透明度、程序透明度和结果透明度。我们讨论了阻碍实现这些透明度形式的关键挑战，揭示了更广泛的社会技术差距，并突显了未来计算机支持的协作工作（CSCW）研究的方向。

更新时间: 2025-02-19 13:19:29

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2401.09410v3

Causes and Strategies in Multiagent Systems

Causality plays an important role in daily processes, human reasoning, and artificial intelligence. There has however not been much research on causality in multi-agent strategic settings. In this work, we introduce a systematic way to build a multi-agent system model, represented as a concurrent game structure, for a given structural causal model. In the obtained so-called causal concurrent game structure, transitions correspond to interventions on agent variables of the given causal model. The Halpern and Pearl framework of causality is used to determine the effects of a certain value for an agent variable on other variables. The causal concurrent game structure allows us to analyse and reason about causal effects of agents' strategic decisions. We formally investigate the relation between causal concurrent game structures and the original structural causal models.

Updated: 2025-02-19 13:18:42

标题: 多智能体系统中的原因和策略

摘要: 因果关系在日常过程、人类推理和人工智能中扮演重要角色。然而，在多智能体战略设置中，对因果关系的研究并不多。在这项工作中，我们引入了一种系统化的方法来建立一个多智能体系统模型，表示为并发游戏结构，用于给定的结构性因果模型。在所获得的所谓因果并发游戏结构中，转换对应于给定因果模型中代理变量的干预。Halpern和Pearl的因果框架被用于确定对代理变量的某个值对其他变量的影响。因果并发游戏结构使我们能够分析和推理代理决策的因果效应。我们正式调查因果并发游戏结构与原始结构性因果模型之间的关系。

更新时间: 2025-02-19 13:18:42

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2502.13701v1

Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity

Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the variation are suggested.

Updated: 2025-02-19 13:08:42

标题: 预训练语音和语言模型的联合微调和转换朝向线性复杂度

摘要: 最近出现了像Linformer和Mamba这样的架构，它们作为变压器的竞争性线性时间替代品。然而，相应的大型预训练模型通常不可用，特别是在非文本领域。为了解决这个问题，我们提出了一种跨架构逐层蒸馏（CALD）方法，该方法将变压器模型联合转换为线性时间替代品并对其进行微调以适应目标任务。我们还比较了几种引导微调的方法，以最大程度地保留原始模型所需的推理能力。这些方法在使用目标模型和参数轨迹方面有所不同。在一系列关于语言处理、语言建模和语音处理的实证研究中，我们展示了CALD能够有效恢复原始模型的结果，并且引导策略对结果有所贡献。对变化的一些原因提出了建议。

更新时间: 2025-02-19 13:08:42

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2410.06846v3

Tight Generalization Bounds for Large-Margin Halfspaces

We prove the first generalization bound for large-margin halfspaces that is asymptotically tight in the tradeoff between the margin, the fraction of training points with the given margin, the failure probability and the number of training points.

Updated: 2025-02-19 13:04:49

标题: 大边界半空间的紧密泛化界限

摘要: 我们证明了大边界半空间的第一个泛化界限，在边界、具有给定边界的训练点的比例、失败概率和训练点数量之间的权衡中，这个界限是渐近紧的。

更新时间: 2025-02-19 13:04:49

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2502.13692v1

Graph Signal Inference by Learning Narrowband Spectral Kernels

While a common assumption in graph signal analysis is the smoothness of the signals or the band-limitedness of their spectrum, in many instances the spectrum of real graph data may be concentrated at multiple regions of the spectrum, possibly including mid-to-high-frequency components. In this work, we propose a novel graph signal model where the signal spectrum is represented through the combination of narrowband kernels in the graph frequency domain. We then present an algorithm that jointly learns the model by optimizing the kernel parameters and the signal representation coefficients from a collection of graph signals. Our problem formulation has the flexibility of permitting the incorporation of signals possibly acquired on different graphs into the learning algorithm. We then theoretically study the signal reconstruction performance of the proposed method, by also elaborating on when joint learning on multiple graphs is preferable to learning an individual model on each graph. Experimental results on several graph data sets shows that the proposed method offers quite satisfactory signal interpolation accuracy in comparison with a variety of reference approaches in the literature.

Updated: 2025-02-19 12:54:39

标题: 通过学习窄带频谱核进行图信号推断

摘要: 在图信号分析中一个常见的假设是信号的平滑性或其频谱的带限性，然而在许多情况下，真实的图数据的频谱可能集中在频谱的多个区域，可能包括中高频成分。在这项工作中，我们提出了一种新颖的图信号模型，其中信号频谱通过在图频率域中的窄带核的组合来表示。然后，我们提出了一个算法，通过优化核参数和来自一组图信号的信号表示系数来联合学习模型。我们的问题表述具有灵活性，允许将可能在不同图上获取的信号纳入学习算法中。然后在理论上研究了所提出方法的信号重建性能，同时也详细阐述了在多个图上进行联合学习何时优于在每个图上学习单独模型。在几个图数据集上的实验结果显示，与文献中各种参考方法相比，所提出的方法提供了相当令人满意的信号插值准确性。

更新时间: 2025-02-19 12:54:39

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.13686v1

MoM: Linear Sequence Modeling with Mixture-of-Memories

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

Updated: 2025-02-19 12:53:55

标题: MoM：混合记忆的线性序列建模

摘要: 线性序列建模方法，如线性注意力、状态空间建模和线性RNN，通过降低训练和推理的复杂性，提供了显著的效率改进。然而，这些方法通常将整个输入序列压缩为单个固定大小的内存状态，这导致在侧重召回的下游任务上性能不佳。受神经科学启发，特别是大脑在减轻“记忆干扰”同时保持稳健长期记忆的能力，我们引入了一种称为“记忆混合”（MoM）的新型架构。MoM利用多个独立的记忆状态，路由器网络将输入令牌定向到特定的记忆状态。这种方法极大地增强了整体的记忆容量，同时最小化了记忆干扰。因此，MoM在召回密集型任务上表现出色，超越了现有的线性序列建模技术。尽管包含多个记忆状态，每个记忆状态的计算复杂度仍然保持线性，使得MoM在训练过程中保持线性复杂度优势，同时推理过程中保持恒定复杂度。我们的实验结果显示，MoM在下游语言任务中明显优于当前的线性序列模型，特别是召回密集型任务，并且甚至达到与Transformer模型可比的性能。代码发布在https://github.com/OpenSparseLLMs/MoM，并作为https://github.com/OpenSparseLLMs/Linear-MoE的一部分发布。

更新时间: 2025-02-19 12:53:55

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13685v1

An LLM-based Agent for Reliable Docker Environment Configuration

Environment configuration is a critical yet time-consuming step in software development, especially when dealing with unfamiliar code repositories. While Large Language Models (LLMs) demonstrate the potential to accomplish software engineering tasks, existing methods for environment configuration often rely on manual efforts or fragile scripts, leading to inefficiencies and unreliable outcomes. We introduce Repo2Run, the first LLM-based agent designed to fully automate environment configuration and generate executable Dockerfiles for arbitrary Python repositories. We address two major challenges: (1) enabling the LLM agent to configure environments within isolated Docker containers, and (2) ensuring the successful configuration process is recorded and accurately transferred to a Dockerfile without error. To achieve this, we propose atomic configuration synthesis, featuring a dual-environment architecture (internal and external environment) with a rollback mechanism to prevent environment "pollution" from failed commands, guaranteeing atomic execution (execute fully or not at all) and a Dockerfile generator to transfer successful configuration steps into runnable Dockerfiles. We evaluate Repo2Run~on our proposed benchmark of 420 recent Python repositories with unit tests, where it achieves an 86.0% success rate, outperforming the best baseline by 63.9%.

Updated: 2025-02-19 12:51:35

标题: 基于LLM的代理程序用于可靠的Docker环境配置

摘要: 环境配置是软件开发中关键但耗时的一步，尤其是在处理不熟悉的代码存储库时。虽然大型语言模型（LLMs）展示了完成软件工程任务的潜力，但现有的环境配置方法通常依赖于手动工作或脆弱的脚本，导致低效和不可靠的结果。我们介绍了Repo2Run，这是第一个基于LLM设计的代理程序，旨在完全自动化环境配置并为任意Python存储库生成可执行的Dockerfile。我们解决了两个主要挑战：（1）使LLM代理程序能够在隔离的Docker容器中配置环境，以及（2）确保成功配置过程被记录并准确地传输到Dockerfile中而不出错。为了实现这一目标，我们提出了原子配置综合，具有双环境架构（内部和外部环境）和回滚机制，以防止由于失败的命令而导致环境“污染”，确保原子执行（完全执行或不执行）以及Dockerfile生成器将成功配置步骤转移到可运行的Dockerfile中。我们在提出的基准测试中评估了Repo2Run，在420个最近的Python存储库中进行了单元测试，其中它取得了86.0％的成功率，优于最佳基线63.9％。

更新时间: 2025-02-19 12:51:35

领域: cs.SE,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.13681v1

DFDT: Dynamic Fast Decision Tree for IoT Data Stream Mining on Edge Devices

The Internet of Things generates massive data streams, with edge computing emerging as a key enabler for online IoT applications and 5G networks. Edge solutions facilitate real-time machine learning inference, but also require continuous adaptation to concept drifts. Ensemble-based solutions improve predictive performance, but incur higher resource consumption, latency, and memory demands. This paper presents DFDT: Dynamic Fast Decision Tree, a novel algorithm designed for energy-efficient memory-constrained data stream mining. DFDT improves hoeffding tree growth efficiency by dynamically adjusting grace periods, tie thresholds, and split evaluations based on incoming data. It incorporates stricter evaluation rules (based on entropy, information gain, and leaf instance count), adaptive expansion modes, and a leaf deactivation mechanism to manage memory, allowing more computation on frequently visited nodes while conserving energy on others. Experiments show that the proposed framework can achieve increased predictive performance (0.43 vs 0.29 ranking) with constrained memory and a fraction of the runtime of VFDT or SVFDT.

Updated: 2025-02-19 12:45:42

标题: DFDT：物联网数据流挖掘边缘设备上的动态快速决策树

摘要: 物联网生成大量数据流，边缘计算作为在线物联网应用和5G网络的关键推动者逐渐崭露头角。边缘解决方案促进实时机器学习推理，但也需要持续适应概念漂移。基于集成的解决方案提高了预测性能，但会导致更高的资源消耗、延迟和内存需求。本文介绍了DFDT：动态快速决策树，这是一种为节能的内存受限数据流挖掘而设计的新算法。DFDT通过根据传入数据动态调整宽限期、绑定阈值和拆分评估来提高霍夫丁树的生长效率。它融合了更严格的评估规则（基于熵、信息增益和叶子实例计数）、自适应扩展模式和叶子停用机制来管理内存，在经常访问的节点上进行更多计算，同时节约其他节点上的能量。实验证明，所提出的框架在受限内存和VFDT或SVFDT的一小部分运行时间内可以实现提高的预测性能（0.43比0.29排名）。

更新时间: 2025-02-19 12:45:42

领域: cs.LG,cs.AI,cs.NI

下载: http://arxiv.org/abs/2502.14011v1

Backpropagation-free Spiking Neural Networks with the Forward-Forward Algorithm

Spiking Neural Networks (SNNs) offer a biologically inspired computational paradigm that emulates neuronal activity through discrete spike-based processing. Despite their advantages, training SNNs with traditional backpropagation (BP) remains challenging due to computational inefficiencies and a lack of biological plausibility. This study explores the Forward-Forward (FF) algorithm as an alternative learning framework for SNNs. Unlike backpropagation, which relies on forward and backward passes, the FF algorithm employs two forward passes, enabling localized learning, enhanced computational efficiency, and improved compatibility with neuromorphic hardware. We introduce an FF-based SNN training framework and evaluate its performance across both non-spiking (MNIST, Fashion-MNIST, CIFAR-10) and spiking (Neuro-MNIST, SHD) datasets. Experimental results demonstrate that our model surpasses existing FF-based SNNs by over 5% on MNIST and Fashion-MNIST while achieving accuracy comparable to state-of-the-art backpropagation-trained SNNs. On more complex tasks such as CIFAR-10 and SHD, our approach outperforms other SNN models by up to 6% and remains competitive with leading backpropagation-trained SNNs. These findings highlight the FF algorithm's potential to advance SNN training methodologies and neuromorphic computing by addressing key limitations of backpropagation.

Updated: 2025-02-19 12:44:26

标题: 无反向传播的脉冲神经网络与前向算法

摘要: 尖峰神经网络（SNNs）提供了一个受生物启发的计算范式，通过离散的基于尖峰的处理来模拟神经元活动。尽管具有优势，但使用传统的反向传播（BP）训练SNNs仍然具有挑战性，原因是计算效率低下和缺乏生物可信度。本研究探索了Forward-Forward（FF）算法作为SNNs的替代学习框架。与依赖前向和后向传递的反向传播不同，FF算法采用两次前向传递，实现了局部化学习、增强的计算效率和更好的与神经形态硬件兼容性。我们引入了一个基于FF的SNN训练框架，并评估其在非尖峰（MNIST、Fashion-MNIST、CIFAR-10）和尖峰（Neuro-MNIST、SHD）数据集上的性能。实验结果表明，我们的模型在MNIST和Fashion-MNIST上超过现有的基于FF的SNNs超过5%，同时在精度上达到了与最先进的反向传播训练的SNNs相当的水平。在更复杂的任务如CIFAR-10和SHD上，我们的方法比其他SNN模型表现更好达6%，并且在与领先的反向传播训练的SNNs竞争时保持竞争力。这些发现突显了FF算法在推进SNN训练方法学和神经形态计算方面的潜力，通过解决反向传播的关键限制。

更新时间: 2025-02-19 12:44:26

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.20411v1

Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset

This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset, focusing on commonsense validation and explanation. Our methodology involves evaluating multiple LLMs, including LLaMA3-70B, Gemma2-9B, and Mixtral-8x7B, using zero-shot prompting techniques. The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation), where models identify the reasoning behind implausible statements. Performance is assessed based on accuracy, and results are compared to fine-tuned transformer-based models. The results indicate that larger models outperform previous models and perform closely to human evaluation for Task A, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B. However, while models effectively identify implausible statements, they face challenges in selecting the most relevant explanation, highlighting limitations in causal and inferential reasoning.

Updated: 2025-02-19 12:40:49

标题: 使用大型语言模型进行零-shot常识验证和推理：对SemEval-2020任务4数据集的评估

摘要: 这项研究评估了大型语言模型（LLMs）在SemEval-2020任务4数据集上的表现，重点关注常识验证和解释。我们的方法涉及评估多个LLMs，包括LLaMA3-70B、Gemma2-9B和Mixtral-8x7B，使用零提示技术。这些模型在两个任务上进行了测试：任务A（常识验证），在这个任务中模型确定一个陈述是否与常识知识一致；任务B（常识解释），在这个任务中模型确定不合理陈述背后的推理。性能是基于准确性评估的，并与经过微调的基于变压器的模型进行了比较。结果表明，更大的模型胜过先前的模型，并在任务A中表现与人类评估接近，其中LLaMA3-70B在任务A中取得了98.40%的最高准确率，而在任务B中以93.40%的准确率落后于先前的模型。然而，虽然模型有效地识别出了不合理的陈述，但它们在选择最相关的解释方面面临挑战，突显了在因果推理和推理推断方面的局限性。

更新时间: 2025-02-19 12:40:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.15810v1

High-dimensional manifold of solutions in neural networks: insights from statistical physics

In these pedagogic notes I review the statistical mechanics approach to neural networks, focusing on the paradigmatic example of the perceptron architecture with binary an continuous weights, in the classification setting. I will review the Gardner's approach based on replica method and the derivation of the SAT/UNSAT transition in the storage setting. Then, I discuss some recent works that unveiled how the zero training error configurations are geometrically arranged, and how this arrangement changes as the size of the training set increases. I also illustrate how different regions of solution space can be explored analytically and how the landscape in the vicinity of a solution can be characterized. I give evidence how, in binary weight models, algorithmic hardness is a consequence of the disappearance of a clustered region of solutions that extends to very large distances. Finally, I demonstrate how the study of linear mode connectivity between solutions can give insights into the average shape of the solution manifold.

Updated: 2025-02-19 12:39:39

标题: 神经网络中的高维解流形：来自统计物理学的洞见

摘要: 在这些教学笔记中，我回顾了统计力学方法在神经网络中的应用，重点关注了感知器架构作为二元和连续权重的典型例子，在分类设置中的应用。我将回顾基于复制方法的Gardner方法以及在存储设置中导出的SAT/UNSAT转变。然后，我讨论了一些最近揭示的零训练误差配置如何在几何上排列，以及随着训练集大小增加，这种排列如何改变。我还说明了如何通过分析探索解空间的不同区域，以及如何表征解附近的景观。我提供证据表明，在二元权重模型中，算法难度是由于一组解的聚类区域消失并延伸到非常远的距离。最后，我展示了研究解之间的线性模式连接性如何能够揭示解空间的平均形状。

更新时间: 2025-02-19 12:39:39

领域: cond-mat.dis-nn,cs.LG,math.PR,math.ST,stat.TH

下载: http://arxiv.org/abs/2309.09240v2

CriteoPrivateAd: A Real-World Bidding Dataset to Design Private Advertising Systems

In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAd, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.

Updated: 2025-02-19 12:35:59

标题: CriteoPrivateAd：一个用于设计私人广告系统的真实世界竞价数据集

摘要: 在过去的几年中，出现了许多提议，旨在解决在线广告使用场景而无需使用第三方cookie。所有这些提议都利用了一些增强隐私性的技术，如聚合或差分隐私。然而，目前没有公开且足够丰富的真实地面可用于评估前述私人广告框架的相关性。我们发布了最大的出价数据集，该数据集的特征数量最多，专门按照主要浏览器供应商提议的设计构建，比如Chrome隐私沙箱。这个数据集被称为CriteoPrivateAd，代表了Criteo生产日志的匿名版本，并提供足够的数据来学习在线广告中常用的出价模型，在许多隐私约束条件下（延迟报告、显示和用户级差分隐私、用户信号量化或聚合报告）。我们确保了这个数据集在匿名化的同时，能够提供与广告技术公司（包括Criteo）的生产性能接近的离线结果，使其成为设计私人广告系统的相关真实地面。该数据集可以在Hugging Face上找到：https://huggingface.co/datasets/criteo/CriteoPrivateAd。

更新时间: 2025-02-19 12:35:59

领域: cs.CR,stat.CO

下载: http://arxiv.org/abs/2502.12103v2

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.

Updated: 2025-02-19 12:35:31

标题: 大型语言模型中的安全层：LLM安全的关键

摘要: 对齐的LLM是安全的，能够识别并拒绝回答恶意问题。然而，内部参数在维持这种安全性方面的作用尚不明确，而且这些模型在经历微调攻击时可能容易受到安全性降级的影响。为了解决这些挑战，我们的工作揭示了在参数级别上对齐的LLMs安全性背后的机制，确定了模型中一小组连续的中间层对于区分恶意查询和正常查询至关重要，被称为“安全层”。我们首先通过分析模型内部层中输入向量的变化来确认这些安全层的存在。此外，我们利用过度拒绝现象和参数缩放分析来精确定位这些安全层。基于这些发现，我们提出了一种新颖的微调方法，即安全部分参数微调（SPPFT），在微调过程中固定安全层的梯度以解决安全性降级问题。我们的实验证明，与完全微调相比，所提出的方法能够显著保持LLM的安全性，同时保持性能并减少计算资源的使用。

更新时间: 2025-02-19 12:35:31

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2408.17003v4

Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models

Recent advances in large language models have led to numerous task-specialized fine-tuned variants, creating a need for efficient model merging techniques that preserve specialized capabilities while avoiding costly retraining. While existing task vector-based merging methods show promise, they typically apply uniform coefficients across all parameters, overlooking varying parameter importance both within and across tasks. We present Sens-Merging, a sensitivity-guided coefficient adjustment method that enhances existing model merging techniques by operating at both task-specific and cross-task levels. Our method analyzes parameter sensitivity within individual tasks and evaluates cross-task transferability to determine optimal merging coefficients. Extensive experiments on Mistral 7B and LLaMA2-7B/13B models demonstrate that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks. Notably, when combined with existing merging techniques, our method enables merged models to outperform specialized fine-tuned models, particularly in code generation tasks. Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.

Updated: 2025-02-19 12:34:46

标题: 敏感性引导参数平衡：用于合并大型语言模型的敏感性合并

摘要: 最近对大型语言模型的进展导致了许多专门任务的精细调整变体，从而需要高效的模型合并技术，以保留专门能力同时避免昂贵的重新训练。虽然现有的基于任务向量的合并方法显示出潜力，但它们通常在所有参数上应用统一系数，忽视了任务内部和跨任务之间的参数重要性的变化。我们提出了一种敏感性引导的系数调整方法Sens-Merging，通过在任务特定和跨任务级别上操作，增强了现有的模型合并技术。我们的方法分析了各个任务内的参数敏感性，并评估了跨任务的可转移性，以确定最佳的合并系数。对Mistral 7B和LLaMA2-7B/13B模型进行的大量实验表明，Sens-Merging显著提高了在一般知识、数学推理和代码生成任务中的性能。值得注意的是，当与现有的合并技术结合使用时，我们的方法使合并模型在代码生成任务中特别优于专门精细调整的模型。我们的发现揭示了任务特定和跨任务缩放之间的重要权衡，为未来的模型合并策略提供了见解。

更新时间: 2025-02-19 12:34:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12420v2

The First Early Evidence of the Use of Browser Fingerprinting for Online Tracking

While advertising has become commonplace in today's online interactions, there is a notable dearth of research investigating the extent to which browser fingerprinting is harnessed for user tracking and targeted advertising. Prior studies only measured whether fingerprinting-related scripts are being run on the websites but that in itself does not necessarily mean that fingerprinting is being used for the privacy-invasive purpose of online tracking because fingerprinting might be deployed for the defensive purposes of bot/fraud detection and user authentication. It is imperative to address the mounting concerns regarding the utilization of browser fingerprinting in the realm of online advertising. This paper introduces ``FPTrace'' (fingerprinting-based tracking assessment and comprehensive evaluation framework), a framework to assess fingerprinting-based user tracking by analyzing ad changes from browser fingerprinting adjustments. Using FPTrace, we emulate user interactions, capture ad bid data, and monitor HTTP traffic. Our large-scale study reveals strong evidence of browser fingerprinting for ad tracking and targeting, shown by bid value disparities and reduced HTTP records after fingerprinting changes. We also show fingerprinting can bypass GDPR/CCPA opt-outs, enabling privacy-invasive tracking. In conclusion, our research unveils the widespread employment of browser fingerprinting in online advertising, prompting critical considerations regarding user privacy and data security within the digital advertising landscape.

Updated: 2025-02-19 12:34:29

标题: 首次早期证据表明浏览器指纹技术被用于在线追踪

摘要: 尽管广告已经成为当今在线互动中司空见惯的事情，但有关研究调查浏览器指纹识别在用户跟踪和定向广告中的运用程度的研究明显不足。先前的研究仅仅测量了网站上是否运行了与指纹识别相关的脚本，但这本身并不一定意味着指纹识别被用于在线跟踪这种侵犯隐私的目的，因为指纹识别可能被用于防御目的，如机器人/欺诈检测和用户认证。有必要解决关于在在线广告领域利用浏览器指纹识别的担忧。本文介绍了“FPTrace”（基于指纹识别的跟踪评估和综合评估框架），这是一个评估基于指纹识别的用户跟踪的框架，通过分析来自浏览器指纹调整的广告变化来实现。利用FPTrace，我们模拟用户互动、捕获广告竞价数据并监控HTTP流量。我们的大规模研究揭示了强有力的证据表明浏览器指纹识别用于广告跟踪和定向广告，表现为竞价价值差异和指纹识别变化后减少的HTTP记录。我们还展示指纹识别可以绕过GDPR / CCPA的选择退出，从而实现侵犯隐私的跟踪。总之，我们的研究揭示了在在线广告中广泛使用浏览器指纹识别，引发了对数字广告领域内用户隐私和数据安全的关键考虑。

更新时间: 2025-02-19 12:34:29

领域: cs.CR

下载: http://arxiv.org/abs/2409.15656v2

Which Attention Heads Matter for In-Context Learning?

Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to perform new tasks using only a few demonstrations in the prompt. Two different mechanisms have been proposed to explain ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we discover that few-shot ICL performance depends primarily on FV heads, especially in larger models. In addition, we uncover that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism that ultimately drives ICL.

Updated: 2025-02-19 12:25:02

标题: 哪些注意力头对于上下文学习很重要？

摘要: 大型语言模型（LLMs）展示了令人印象深刻的上下文学习（ICL）能力，使它们能够仅通过一些提示中的几个演示来执行新任务。已经提出了两种不同的机制来解释ICL：诱导头部找到并复制相关标记，功能向量（FV）头部的激活计算ICL任务的潜在编码。为了更好地理解哪种机制驱动ICL，我们研究并比较了12个语言模型中的诱导头部和FV头部。通过详细的剔除，我们发现少量ICL表现主要取决于FV头部，特别是在更大的模型中。此外，我们发现FV和诱导头部是相连的：许多FV头部在训练期间起初作为诱导头部，然后过渡到FV机制。这使我们推测，诱导有助于学习最终驱动ICL的更复杂的FV机制。

更新时间: 2025-02-19 12:25:02

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.14010v1

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.

Updated: 2025-02-19 12:24:46

标题: PeerQA：来自同行审查的科学问题回答数据集

摘要: 我们介绍了PeerQA，一个真实世界的科学文档级问答（QA）数据集。PeerQA的问题来自同行评议，其中包含审稿人在彻底审查科学文章时提出的问题。答案由每篇论文的原始作者进行了注释。该数据集包含来自208篇学术文章的579个问答对，其中大多数来自机器学习和自然语言处理，以及地球科学和公共卫生等其他科学领域的子集。PeerQA支持三项关键任务，用于开发实用的QA系统：证据检索、无法回答的问题分类和答案生成。我们对收集到的数据集进行了详细分析，并进行了实验，为所有三项任务建立了基准系统。我们的实验和分析揭示了在文档级检索中需要去上下文化的需求，我们发现即使简单的去上下文化方法也能始终提高各种架构的检索性能。在答案生成方面，PeerQA作为长文本建模的具有挑战性的基准，因为论文的平均长度为12k个标记。我们的代码和数据可在https://github.com/UKPLab/peerqa获得。

更新时间: 2025-02-19 12:24:46

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2502.13668v1

Forward-Forward Learning achieves Highly Selective Latent Representations for Out-of-Distribution Detection in Fully Spiking Neural Networks

In recent years, Artificial Intelligence (AI) models have achieved remarkable success across various domains, yet challenges persist in two critical areas: ensuring robustness against uncertain inputs and drastically increasing model efficiency during training and inference. Spiking Neural Networks (SNNs), inspired by biological systems, offer a promising avenue for overcoming these limitations. By operating in an event-driven manner, SNNs achieve low energy consumption and can naturally implement biological methods known for their high noise tolerance. In this work, we explore the potential of the spiking Forward-Forward Algorithm (FFA) to address these challenges, leveraging its representational properties for both Out-of-Distribution (OoD) detection and interpretability. To achieve this, we exploit the sparse and highly specialized neural latent space of FF networks to estimate the likelihood of a sample belonging to the training distribution. Additionally, we propose a novel, gradient-free attribution method to detect features that drive a sample away from class distributions, addressing the challenges posed by the lack of gradients in most visual interpretability methods for spiking models. We evaluate our OoD detection algorithm on well-known image datasets (e.g., Omniglot, Not-MNIST, CIFAR10), outperforming previous methods proposed in the recent literature for OoD detection in spiking networks. Furthermore, our attribution method precisely identifies salient OoD features, such as artifacts or missing regions, hence providing a visual explanatory interface for the user to understand why unknown inputs are identified as such by the proposed method.

Updated: 2025-02-19 12:14:17

标题: 前进前进学习实现全脉冲神经网络中用于检测超出分布的高度选择性潜在表示

摘要: 近年来，人工智能（AI）模型在各个领域取得了显著的成功，然而在两个关键领域仍然存在挑战：确保对不确定输入的稳健性以及在训练和推断过程中显著提高模型的效率。受生物系统启发，脉冲神经网络（SNNs）为克服这些限制提供了一个有希望的途径。通过以事件驱动的方式运行，SNNs实现了低能耗，并可以自然地实现以其高噪声容忍度而闻名的生物方法。在这项工作中，我们探索了脉冲前向算法（FFA）应对这些挑战的潜力，利用其表征性质来进行超出分布（OoD）检测和可解释性。为了实现这一目标，我们利用FF网络的稀疏和高度专门化的神经潜空间来估计一个样本属于训练分布的可能性。此外，我们提出了一种新颖的、无梯度的归因方法，用于检测驱使样本偏离类别分布的特征，解决了大多数脉冲模型的视觉可解释性方法中梯度缺失所带来的挑战。我们在众所周知的图像数据集（如Omniglot、Not-MNIST、CIFAR10）上评估了我们的OoD检测算法，在脉冲网络的OoD检测方面胜过了最近文献中提出的先前方法。此外，我们的归因方法精确识别出显著的OoD特征，如伪影或缺失区域，从而为用户提供一个可视化的解释界面，让用户了解为什么未知输入会被该方法识别为这样。

更新时间: 2025-02-19 12:14:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2407.14097v2

Towards Invariance to Node Identifiers in Graph Neural Networks

Message-Passing Graph Neural Networks (GNNs) are known to have limited expressive power, due to their message passing structure. One mechanism for circumventing this limitation is to add unique node identifiers (IDs), which break the symmetries that underlie the expressivity limitation. In this work, we highlight a key limitation of the ID framework, and propose an approach for addressing it. We begin by observing that the final output of the GNN should clearly not depend on the specific IDs used. We then show that in practice this does not hold, and thus the learned network does not possess this desired structural property. Such invariance to node IDs may be enforced in several ways, and we discuss their theoretical properties. We then propose a novel regularization method that effectively enforces ID invariance to the network. Extensive evaluations on both real-world and synthetic tasks demonstrate that our approach significantly improves ID invariance and, in turn, often boosts generalization performance.

Updated: 2025-02-19 12:12:57

标题: 朝向图神经网络中节点标识的不变性

摘要: 消息传递图神经网络（GNNs）以其消息传递结构而闻名，但由于其有限的表达能力而受到限制。为了绕过这种限制，一种机制是添加唯一的节点标识符（IDs），这打破了表达能力限制的基础对称性。在这项工作中，我们强调了ID框架的一个关键限制，并提出了一种解决方法。我们首先观察到，GNN的最终输出显然不应取决于所使用的特定ID。然后我们展示，在实践中这并不成立，因此学习的网络不具备这种期望的结构特性。节点ID的不变性可以通过多种方式强制执行，我们讨论了它们的理论性质。然后我们提出了一种有效强制网络ID不变性的新型正则化方法。在真实世界和合成任务上进行了广泛评估，证明我们的方法显著改善了ID的不变性，并进而经常提高了泛化性能。

更新时间: 2025-02-19 12:12:57

领域: cs.LG

下载: http://arxiv.org/abs/2502.13660v1

Black Sheep in the Herd: Playing with Spuriously Correlated Attributes for Vision-Language Recognition

Few-shot adaptation for Vision-Language Models (VLMs) presents a dilemma: balancing in-distribution accuracy with out-of-distribution generalization. Recent research has utilized low-level concepts such as visual attributes to enhance generalization. However, this study reveals that VLMs overly rely on a small subset of attributes on decision-making, which co-occur with the category but are not inherently part of it, termed spuriously correlated attributes. This biased nature of VLMs results in poor generalization. To address this, 1) we first propose Spurious Attribute Probing (SAP), identifying and filtering out these problematic attributes to significantly enhance the generalization of existing attribute-based methods; 2) We introduce Spurious Attribute Shielding (SAS), a plug-and-play module that mitigates the influence of these attributes on prediction, seamlessly integrating into various Parameter-Efficient Fine-Tuning (PEFT) methods. In experiments, SAP and SAS significantly enhance accuracy on distribution shifts across 11 datasets and 3 generalization tasks without compromising downstream performance, establishing a new state-of-the-art benchmark.

Updated: 2025-02-19 12:05:33

标题: 羊群中的黑羊：利用虚假相关属性进行视觉语言识别

摘要: 视觉-语言模型（VLMs）的少样本适应性提出了一个两难问题：平衡分布内准确性和分布外泛化能力。最近的研究利用低级概念，如视觉属性，来增强泛化能力。然而，这项研究揭示了VLMs在决策过程中过度依赖于一小部分与类别共存但并非其固有部分的属性，这被称为虚假相关属性。VLMs的这种偏见性质导致了泛化能力较差。为了解决这个问题，我们首先提出了虚假属性探测（SAP），识别和过滤掉这些问题属性，显著增强了现有基于属性的方法的泛化能力；其次，我们引入了虚假属性屏蔽（SAS），这是一个即插即用的模块，可以减轻这些属性对预测的影响，并无缝集成到各种参数有效微调（PEFT）方法中。在实验中，SAP和SAS显著提高了在11个数据集和3个泛化任务中分布变化的准确性，而不会影响下游性能，并建立了一个新的最先进基准。

更新时间: 2025-02-19 12:05:33

领域: cs.LG,eess.IV

下载: http://arxiv.org/abs/2502.15809v1

A Query-Driven Approach to Space-Efficient Range Searching

We initiate a study of a query-driven approach to designing partition trees for range-searching problems. Our model assumes that a data structure is to be built for an unknown query distribution that we can access through a sampling oracle, and must be selected such that it optimizes a meaningful performance parameter on expectation. Our first contribution is to show that a near-linear sample of queries allows the construction of a partition tree with a near-optimal expected number of nodes visited during querying. We enhance this approach by treating node processing as a classification problem, leveraging fast classifiers like shallow neural networks to obtain experimentally efficient query times. Our second contribution is to develop partition trees using sparse geometric separators. Our preprocessing algorithm, based on a sample of queries, builds a balanced tree with nodes associated with separators that minimize query stabs on expectation; this yields both fast processing of each node and a small number of visited nodes, significantly reducing query time.

Updated: 2025-02-19 12:01:00

标题: 一个基于查询驱动的高效空间范围搜索方法

摘要: 我们开展了一个研究，探讨了一种基于查询驱动的方法来设计用于范围搜索问题的分区树。我们的模型假定要构建一个数据结构，以用于未知的查询分布，我们可以通过采样神谕来访问，并且必须选择这样一个数据结构，以便在期望上优化一个有意义的性能参数。我们的第一个贡献是证明，对查询的近线性采样可以使得构建的分区树在查询过程中访问的节点数量期望是近乎最优的。我们通过将节点处理视为一个分类问题，利用快速分类器如浅层神经网络来获得实验上高效的查询时间，增强了这种方法。我们的第二个贡献是利用稀疏几何分隔器来开发分区树。我们的预处理算法基于查询样本构建了一棵平衡树，其中每个节点与最小化查询刺穿期望值的分隔器相关联；这既可以实现每个节点的快速处理，又可以减少访问节点的数量，从而显著降低查询时间。

更新时间: 2025-02-19 12:01:00

领域: cs.DS,cs.CG,cs.LG

下载: http://arxiv.org/abs/2502.13653v1

MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures

The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.

Updated: 2025-02-19 11:57:31

标题: MaskPrune：基于掩模的LLM剪枝，用于层间统一结构

摘要: 大型语言模型（LLMs）在各种语言任务中表现出色，引起了广泛关注。然而，这些模型不断增大的规模对部署和推理提出了越来越大的挑战。结构化剪枝作为一种有效的模型压缩技术，由于其提高推理效率的能力而受到越来越多的关注。然而，大多数先前基于优化的结构化剪枝方法为了保持性能而牺牲了跨层的统一结构，以获得更大的灵活性。异构结构阻碍了有效利用现成推理加速技术，并阻碍了为持续训练进行高效配置。为了解决这个问题，我们提出了一种基于极小最优化的掩码学习范式，通过在稀疏正则化下优化掩码来获得统一的修剪结构。广泛的实验结果表明，我们的方法能够在确保修剪模型结构的统一性的同时保持高性能，从而优于现有的最先进方法。

更新时间: 2025-02-19 11:57:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14008v1

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

Updated: 2025-02-19 11:57:19

标题: 推理和DeepSeek和GPT的信任行为：揭示大型语言模型中隐藏故障线的实验

摘要: 在遇到新的大型语言模型（LLM）带来的性能改进或成本降低越来越频繁时，应用程序开发人员必须决定是否利用这些改进或继续使用经过验证的老模型。低感知的转换摩擦可能导致选择不考虑转换可能引发的更微妙的行为变化。我们的实验使用了一个流行的博弈论行为经济学模型来展示OpenAI和DeepSeek模型在信任行为上的显著差异。我们强调了o1-mini和o3-mini模型在调和利润最大化和风险寻求与未来信任回报时的经济信任行为崩溃，并将其与DeepSeek更复杂和更赚钱的信任行为进行对比，后者能够融合更深层次的概念，如前瞻规划和心灵理论。由于LLM构成高风险商业系统的基础，我们的结果突显了依赖过于狭隘定义的LLM性能基准的危险，并建议对它们隐藏的故障线进行仔细分析应该成为任何组织AI战略的一部分。

更新时间: 2025-02-19 11:57:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12825v2

C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.

Updated: 2025-02-19 11:57:02

标题: C2T：一种基于分类器的树构建方法在推测解码中

摘要: 大型语言模型（LLMs）规模不断增长，加剧了推理延迟和计算成本。旨在减轻这些问题的推测解码方法通常在构建标记树和验证候选标记方面存在效率低下的问题。现有的策略，包括链模式、静态树和动态树方法，在准备候选标记树进行验证时存在限制。我们提出了一种名为C2T的新方法，采用轻量级分类器动态生成和修剪标记树。我们的分类器考虑了除常用的联合概率之外的额外特征变量，以预测每个草稿标记的置信度分数，以确定它是否是用于验证的候选标记。该方法在多个基准测试中表现优于最先进的方法（SOTA），将候选标记的总数减少25%，同时保持或甚至提高接受长度。

更新时间: 2025-02-19 11:57:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13652v1

Piece of Table: A Divide-and-Conquer Approach for Selecting Subtables in Table Question Answering

Applying language models (LMs) to tables is challenging due to the inherent structural differences between two-dimensional tables and one-dimensional text for which the LMs were originally designed. Furthermore, when applying linearized tables to LMs, the maximum token lengths often imposed in self-attention calculations make it difficult to comprehensively understand the context spread across large tables. To address these challenges, we present PieTa (Piece of Table), a new framework for subtable-based question answering (QA). PieTa operates through an iterative process of dividing tables into smaller windows, using LMs to select relevant cells within each window, and merging these cells into a subtable. This multi-resolution approach captures dependencies across multiple rows and columns while avoiding the limitations caused by long context inputs. Instantiated as a simple iterative subtable union algorithm, PieTa demonstrates improved performance over previous subtable-based QA approaches.

Updated: 2025-02-19 11:56:57

标题: Piece of Table: 选择表格问答中子表的分治方法

摘要: 将语言模型（LMs）应用于表格具有挑战性，因为二维表格与LMs最初设计用于的一维文本之间存在固有的结构差异。此外，当将线性化的表格应用于LMs时，在自注意力计算中通常施加的最大标记长度使得全面理解分布在大表格中的上下文变得困难。为了解决这些挑战，我们提出了PieTa（Piece of Table），这是一个基于子表格的问答（QA）的新框架。PieTa通过将表格分割成较小的窗口，使用LMs选择每个窗口内的相关单元格，并将这些单元格合并成子表格的迭代过程来运作。这种多分辨率方法捕捉了跨多行和列的依赖关系，同时避免了长上下文输入所造成的限制。作为一个简单的迭代子表格合并算法的实例化，PieTa展示了比先前基于子表格的QA方法更好的性能。

更新时间: 2025-02-19 11:56:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.07629v4

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

Updated: 2025-02-19 11:54:45

标题: d-Sketch：使用预训练的潜在扩散模型改善素描到图像翻译的视觉保真度，无需重新训练

摘要: 在图像到图像的翻译中引入结构性引导可以对合成图像的形状进行精细控制。从用户指定的粗略手绘草图生成高质量逼真图像是一个旨在对条件生成过程施加结构约束的任务。虽然这一前提对内容创作和学术研究的许多用例都很有吸引力，但由于手绘草图中存在大量歧义，这一问题变得根本具有挑战性。此外，在形状一致性和逼真生成之间平衡权衡增加了过程中的复杂性。基于生成对抗网络（GANs）的现有方法通常利用条件GANs或GAN反演，通常需要特定的数据和优化目标。最近引入的去噪扩散概率模型（DDPMs）在一般图像合成中实现了低级视觉属性的生成性飞跃。然而，直接在领域特定子任务上重新训练大规模扩散模型通常极其困难，因为计算成本高昂且数据不足。在本文中，我们介绍了一种利用大规模扩散模型的特征泛化能力而无需重新训练的草图到图像翻译技术。具体来说，我们使用一个可学习的轻量级映射网络实现从源域到目标域的潜在特征转换。实验结果表明，所提出的方法在定性和定量基准测试中优于现有技术，允许从粗糙手绘草图生成高分辨率逼真图像。

更新时间: 2025-02-19 11:54:45

领域: cs.GR,cs.AI,cs.CV,cs.MM,eess.IV

下载: http://arxiv.org/abs/2502.14007v1

Infinite Width Limits of Self Supervised Neural Networks

The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behavior of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound -- it is a commonly encountered misbelief that the kernel behavior of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is a bit different from previous works on the NTK and may be of independent interest. Overall, our work provides a first justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of finite width.

Updated: 2025-02-19 11:34:27

标题: 无监督神经网络的无限宽度极限

摘要: NTK是深度学习理论分析中广泛使用的工具，使我们能够通过核回归的视角观察监督式深度神经网络。最近，一些研究探讨了核模型在自监督学习中的应用，假设这些模型也通过NTK探索了宽神经网络的行为。然而，这种连接在多大程度上在数学上是正确的仍然是一个未解决的问题--普遍认为宽神经网络的核行为会独立于其训练的损失函数。本文桥接了NTK和自监督学习之间的差距，重点关注在Barlow Twins损失下训练的两层神经网络。我们证明Barlow Twins的NTK确实在网络宽度趋近无穷时变为常数。我们的分析技术与以往关于NTK的研究有所不同，并可能具有独立的兴趣。总的来说，我们的工作首次证明了经典核理论在理解宽神经网络的自监督学习中的应用的合理性。基于这一结果，我们为核化的Barlow Twins推导出泛化误差界，并将其与有限宽度的神经网络联系起来。

更新时间: 2025-02-19 11:34:27

领域: cs.LG

下载: http://arxiv.org/abs/2411.11176v2

Integrating Inverse and Forward Modeling for Sparse Temporal Data from Sensor Networks

We present CavePerception, a framework for the analysis of sparse data from sensor networks that incorporates elements of inverse modeling and forward modeling. By integrating machine learning with physical modeling in a hypotheses space, we aim to improve the interpretability of sparse, noisy, and potentially incomplete sensor data. The framework assumes data from a two-dimensional sensor network laid out in a graph structure that detects certain objects, with certain motion patterns. Examples of such sensors are magnetometers. Given knowledge about the objects and the way they act on the sensors, one can develop a data generator that produces data from simulated motions of the objects across the sensor field. The framework uses the simulated data to infer object behaviors across the sensor network. The approach is experimentally tested on real-world data, where magnetometers are used on an airport to detect and identify aircraft motions. Experiments demonstrate the value of integrating inverse and forward modeling, enabling intelligent systems to better understand and predict complex, sensor-driven events.

Updated: 2025-02-19 11:24:51

标题: 整合逆向建模和正向建模，用于传感器网络中的稀疏时间数据

摘要: 我们提出了CavePerception，一个用于分析传感器网络稀疏数据的框架，该框架融合了逆向建模和正向建模的元素。通过在假设空间中将机器学习与物理建模相结合，我们旨在提高稀疏、嘈杂和潜在不完整的传感器数据的可解释性。该框架假设数据来自一个以图结构布置的二维传感器网络，该网络检测特定对象和特定运动模式。这类传感器的例子包括磁力计。在了解对象及其对传感器的作用方式的基础上，可以开发一个数据生成器，从模拟的对象在传感器领域中的运动中产生数据。该框架使用模拟数据推断对象在传感器网络上的行为。该方法在真实数据上进行实验测试，其中在机场使用磁力计来检测和识别飞机运动。实验表明了整合逆向和正向建模的价值，使智能系统能够更好地理解和预测复杂的、传感器驱动的事件。

更新时间: 2025-02-19 11:24:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13638v1

Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization

The opaque nature of Large Language Models (LLMs) has led to significant research efforts aimed at enhancing their interpretability, primarily through post-hoc methods. More recent in-hoc approaches, such as Concept Bottleneck Models (CBMs), offer both interpretability and intervenability by incorporating explicit concept representations. However, these methods suffer from key limitations, including reliance on labeled concept datasets and significant architectural modifications that challenges re-integration into existing system pipelines. In this work, we introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers (CLs) into its architecture. Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model. Furthermore, we eliminate the need for a human-selected concept set by algorithmically searching an ontology for a set of concepts that can be either task-specific or task-agnostic. We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions. Additionally, we present a proof of concept showcasing an intervenability interface, allowing users to adjust model behavior dynamically, such as mitigating biases during inference.

Updated: 2025-02-19 11:10:19

标题: 概念层：通过LLM概念化提高可解释性和可干预性

摘要: 大型语言模型（LLMs）的不透明性导致了大量的研究努力，旨在通过主要通过事后方法增强其可解释性。最近的一种方法是概念瓶颈模型（CBMs），它提供了可解释性和可干预性，通过整合明确的概念表示。然而，这些方法存在关键限制，包括依赖标记的概念数据集和需要进行重大架构修改，挑战重新集成到现有系统流水线中。在这项工作中，我们引入了一种新的方法，通过将概念层（CLs）整合到现有模型的架构中，来将可解释性和可干预性融入到模型中。我们的方法将模型的内部向量表示投影到一个概念性、可解释的向量空间中，然后将其重构并反馈到模型中。此外，我们通过算法搜索本体中的一组概念，而无需人为选择概念集，这些概念可以是任务特定或任务不可知的。我们在多个任务中评估了CLs，展示它们保持了原始模型的性能和一致性，同时实现了有意义的干预。此外，我们提出了一个干预接口的概念证明，允许用户动态调整模型行为，例如在推理过程中减少偏见。

更新时间: 2025-02-19 11:10:19

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.13632v1

The Impact of Inference Acceleration on Bias of LLMs

Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.

Updated: 2025-02-19 11:10:09

标题: 推理加速对LLMs偏见的影响

摘要: 近年来，大型语言模型（LLMs）的能力取得了前所未有的进步。这些进步承诺将使广泛的应用领域受益。然而，由于它们的巨大规模，使用LLMs进行推理既昂贵又缓慢。因此，最近有大量工作提出了增强推理效率的策略，例如量化、剪枝和缓存。这些加速策略通过降低推理成本和延迟，通常可以将其减少数倍，同时保持通过常见基准测试测量的大部分预测性能。在这项工作中，我们探讨了LLM性能的另一个关键方面：由于推理加速优化而导致的模型生成中的人口偏见。通过使用各种指标，我们从多个角度探索模型输出中的偏见。在推理加速之前和之后对输出进行分析显示偏见发生了显著变化。令人担忧的是，这些偏见效果复杂且难以预测。一个加速策略和偏见类型的组合可能在一个模型中显示很少的偏见变化，但可能导致另一个模型产生较大的影响。我们的结果强调了在修改加速推理后的模型偏见时需要进行深入和逐案评估的必要性。

更新时间: 2025-02-19 11:10:09

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.22118v2

REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models

Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering. To address this challenge, we introduce REFIND (Retrieval-augmented Factuality hallucINation Detection), a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents. As part of the REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. This innovative approach enables REFIND to efficiently and accurately detect hallucinations, setting it apart from existing methods. In the evaluation, REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models, achieving superior IoU scores in identifying hallucinated spans. This work highlights the effectiveness of quantifying context sensitivity for hallucination detection, thereby paving the way for more reliable and trustworthy LLM applications across diverse languages.

Updated: 2025-02-19 10:59:05

标题: REFIND: 在大型语言模型中检测检索增强的事实性幻觉

摘要: 大语言模型（LLM）输出中的幻觉严重限制了它们在诸如问答等知识密集型任务中的可靠性。为了解决这一挑战，我们引入了REFIND（检索增强事实性幻觉检测），这是一个新颖的框架，通过直接利用检索到的文档来检测LLM输出中的幻觉部分。作为REFIND的一部分，我们提出了上下文敏感度比率（CSR），这是一种新颖的度量标准，用于量化LLM输出对检索证据的敏感性。这种创新方法使REFIND能够高效准确地检测幻觉，使其与现有方法区分开来。在评估中，REFIND在包括低资源环境在内的九种语言中表现出了稳健性，并明显优于基线模型，在识别幻觉部分方面实现了更高的IoU分数。这项工作突出了量化上下文敏感性对幻觉检测的有效性，从而为跨越多种语言的更可靠和可信赖的LLM应用铺平道路。

更新时间: 2025-02-19 10:59:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13622v1

Decentralized Planning Using Probabilistic Hyperproperties

Multi-agent planning under stochastic dynamics is usually formalised using decentralized (partially observable) Markov decision processes ( MDPs) and reachability or expected reward specifications. In this paper, we propose a different approach: we use an MDP describing how a single agent operates in an environment and probabilistic hyperproperties to capture desired temporal objectives for a set of decentralized agents operating in the environment. We extend existing approaches for model checking probabilistic hyperproperties to handle temporal formulae relating paths of different agents, thus requiring the self-composition between multiple MDPs. Using several case studies, we demonstrate that our approach provides a flexible and expressive framework to broaden the specification capabilities with respect to existing planning techniques. Additionally, we establish a close connection between a subclass of probabilistic hyperproperties and planning for a particular type of Dec-MDPs, for both of which we show undecidability. This lays the ground for the use of existing decentralized planning tools in the field of probabilistic hyperproperty verification.

Updated: 2025-02-19 10:59:02

标题: 使用概率超属性的分散式规划

摘要: 在随机动态下的多智能体规划通常使用分散式（部分可观察）马尔可夫决策过程（MDPs）和可达性或预期奖励规范进行形式化。在本文中，我们提出了一种不同的方法：我们使用描述单个智能体在环境中操作的MDP，并使用概率超性质来捕获一组在环境中操作的分散式智能体的期望时间目标。我们扩展了现有的用于模型检验概率超性质的方法，以处理涉及不同智能体路径的时间公式，从而需要多个MDP之间的自组合。通过几个案例研究，我们证明了我们的方法提供了一个灵活且表达能力强的框架，以扩大与现有规划技术相关的规范能力。此外，我们建立了概率超性质的一个子类和一种特定类型Dec-MDPs的规划之间的密切联系，我们都证明了不可判定性。这为在概率超性质验证领域使用现有的分散式规划工具奠定了基础。

更新时间: 2025-02-19 10:59:02

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2502.13621v1

Complex Ontology Matching with Large Language Model Embeddings

Ontology, and more broadly, Knowledge Graph Matching is a challenging task in which expressiveness has not been fully addressed. Despite the increasing use of embeddings and language models for this task, approaches for generating expressive correspondences still do not take full advantage of these models, in particular, large language models (LLMs). This paper proposes to integrate LLMs into an approach for generating expressive correspondences based on alignment need and ABox-based relation discovery. The generation of correspondences is performed by matching similar surroundings of instance sub-graphs. The integration of LLMs results in different architectural modifications, including label similarity, sub-graph matching, and entity matching. The performance word embeddings, sentence embeddings, and LLM-based embeddings, was compared. The results demonstrate that integrating LLMs surpasses all other models, enhancing the baseline version of the approach with a 45\% increase in F-measure.

Updated: 2025-02-19 10:56:27

标题: 使用大型语言模型嵌入进行复杂本体匹配

摘要: 本文研究本体论和更广泛的知识图匹配是一项具有挑战性的任务，其中表达能力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来执行此任务，但生成具有表达力的对应关系的方法仍未充分利用这些模型，特别是大型语言模型（LLMs）。本文提出将LLMs集成到基于对齐需求和基于ABox的关系发现的方法中，以生成具有表达力的对应关系。对应关系的生成通过匹配实例子图的相似环境来实现。LLMs的集成导致了不同的架构修改，包括标签相似性、子图匹配和实体匹配。比较了基于单词嵌入、句子嵌入和LLM的嵌入的性能。结果表明，集成LLMs超越了所有其他模型，将方法的基线版本的F-measure提高了45％。

更新时间: 2025-02-19 10:56:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13619v1

Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning

Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we focus on using RL in container shipping, often considered the cornerstone of global trade, by dealing with the critical challenge of master stowage planning. The main objective is to maximize cargo revenue and minimize operational costs while navigating demand uncertainty and various complex operational constraints, namely vessel capacity and stability, which must be dynamically updated along the vessel's voyage. To address this problem, we implement a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem (MPP) under demand uncertainty. The experimental results show that our architecture efficiently finds adaptive, feasible solutions for this multi-stage stochastic optimization problem, outperforming traditional mixed-integer programming and RL with feasibility regularization. Our AI-driven decision-support policy enables adaptive and feasible planning under uncertainty, optimizing operational efficiency and capacity utilization while contributing to sustainable and resilient global supply chains.

Updated: 2025-02-19 10:49:48

标题: 在集装箱航运中应对需求不确定性：深度强化学习用于实现适应性和可行性的主舱位规划

摘要: 强化学习（RL）在解决各种组合优化问题方面显示出了潜力。然而，传统RL在处理现实约束时面临挑战，特别是当行动空间的可行性明确且依赖于相应的状态或轨迹时。在这项工作中，我们专注于在集装箱航运中使用RL，通常被认为是全球贸易的基石，通过处理主要舱位计划的关键挑战。主要目标是在应对需求不确定性和各种复杂运营约束的同时，最大化货物收入并最小化运营成本，即船只容量和稳定性，这必须沿船只的航行动态更新。为了解决这个问题，我们实施了一个深度强化学习框架，使用可行性投影来解决在需求不确定性下的主要舱位计划问题（MPP）。实验结果显示，我们的架构能够有效地找到适应性和可行性解决方案，优于传统的混合整数规划和带有可行性正则化的RL。我们的AI驱动决策支持政策在不确定性下实现了自适应和可行的规划，优化了运营效率和容量利用率，同时为可持续和具有弹性的全球供应链做出贡献。

更新时间: 2025-02-19 10:49:48

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2502.12756v2

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers.

Updated: 2025-02-19 10:49:41

标题: 从本地到全球：一种图形RAG方法用于查询聚焦摘要

摘要: 检索增强生成（RAG）用于从外部知识源中检索相关信息，使得大型语言模型（LLMs）能够回答关于私人和/或以前未见过的文档集合的问题。然而，RAG在针对整个文本语料库的全局问题上失败，比如“数据集中的主题是什么？”这是一个查询焦点摘要（QFS）任务，而不是一个明确的检索任务。与此同时，先前的QFS方法无法扩展到典型RAG系统索引的文本数量。为了结合这些对立方法的优势，我们提出了GraphRAG，这是一种基于图的方法，用于在私人文本语料库上回答问题，该方法可以随着用户问题的普遍性和源文本的数量进行扩展。我们的方法使用LLM在两个阶段构建图索引：首先，从源文档中导出实体知识图，然后为所有相关实体组预生成社区摘要。给定一个问题，每个社区摘要用于生成部分回答，然后再将所有部分回答总结为最终回答呈现给用户。对于数据集中的全局概念问题类，范围在100万标记范围内，我们展示了GraphRAG相对于传统的RAG基线在生成答案的全面性和多样性方面带来了显着改进。

更新时间: 2025-02-19 10:49:41

领域: cs.CL,cs.AI,cs.IR,H.3.3; I.2.7

下载: http://arxiv.org/abs/2404.16130v2

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.

Updated: 2025-02-19 10:49:24

标题: LongReD：通过恢复蒸馏缓解长上下文大型语言模型的短文本降级

摘要: 大型语言模型（LLMs）通过扩展位置编码和轻量级持续预训练获得了更大的上下文窗口。然而，这往往会导致在短文本任务上性能下降，而导致这种性能下降的原因尚未得到充分探索。在这项工作中，我们确定了导致这一问题的两个主要因素：隐藏状态和注意力分数中的分布漂移，以及在持续预训练期间的灾难性遗忘。为了解决这些挑战，我们提出了长上下文预训练与恢复蒸馏（LongReD），这是一种旨在通过最小化扩展模型与原始模型之间的分布差异来减轻短文本性能下降的新方法。除了在长文本上进行训练外，LongReD还从原始模型中选定层的隐藏状态中蒸馏出短文本。此外，LongReD还引入了短到长的蒸馏，通过利用跳过的位置索引将短文本上的输出分布与长文本上的输出分布对齐。在常见文本基准测试上的实验表明，LongReD有效地保留了模型的短文本性能，同时保持了与基准相当甚至更好的处理长文本能力。我们的代码可在https://github.com/RUCAIBox/LongReD 上找到。

更新时间: 2025-02-19 10:49:24

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.07365v2

FuzzRisk: Online Collision Risk Estimation for Autonomous Vehicles based on Depth-Aware Object Detection via Fuzzy Inference

This paper presents a novel monitoring framework that infers the level of collision risk for autonomous vehicles (AVs) based on their object detection performance. The framework takes two sets of predictions from different algorithms and associates their inconsistencies with the collision risk via fuzzy inference. The first set of predictions is obtained by retrieving safety-critical 2.5D objects from a depth map, and the second set comes from the ordinary AV's 3D object detector. We experimentally validate that, based on Intersection-over-Union (IoU) and a depth discrepancy measure, the inconsistencies between the two sets of predictions strongly correlate to the error of the 3D object detector against ground truths. This correlation allows us to construct a fuzzy inference system and map the inconsistency measures to an AV collision risk indicator. In particular, we optimize the fuzzy inference system towards an existing offline metric that matches AV collision rates well. Lastly, we validate our monitor's capability to produce relevant risk estimates with the large-scale nuScenes dataset and demonstrate that it can safeguard an AV in closed-loop simulations.

Updated: 2025-02-19 10:49:11

标题: 模糊风险：基于深度感知目标检测的自动驾驶车辆在线碰撞风险估计

摘要: 本文提出了一种新颖的监测框架，根据自动驾驶车辆（AVs）的目标检测性能推断碰撞风险水平。该框架从不同算法得到两组预测，并通过模糊推理将它们的不一致性与碰撞风险联系起来。第一组预测是通过从深度图中检索安全关键的2.5D对象获得的，第二组来自普通AV的3D对象检测器。我们通过交并比（IoU）和深度差异度量实验证明，两组预测之间的不一致性与3D对象检测器对地面真实情况的误差强烈相关。这种相关性使我们能够构建一个模糊推理系统，并将不一致性测量映射到AV碰撞风险指示器。特别是，我们优化了模糊推理系统，使其与匹配AV碰撞率良好的现有离线指标相匹配。最后，我们验证了我们监测器产生相关风险估计的能力，使用大规模nuScenes数据集展示它可以在闭环模拟中保护AV。

更新时间: 2025-02-19 10:49:11

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2411.08060v2

MAAT: Mamba Adaptive Anomaly Transformer with association discrepancy for time series

Anomaly detection in time series is essential for industrial monitoring and environmental sensing, yet distinguishing anomalies from complex patterns remains challenging. Existing methods like the Anomaly Transformer and DCdetector have progressed, but they face limitations such as sensitivity to short-term contexts and inefficiency in noisy, non-stationary environments. To overcome these issues, we introduce MAAT, an improved architecture that enhances association discrepancy modeling and reconstruction quality. MAAT features Sparse Attention, efficiently capturing long-range dependencies by focusing on relevant time steps, thereby reducing computational redundancy. Additionally, a Mamba-Selective State Space Model is incorporated into the reconstruction module, utilizing a skip connection and Gated Attention to improve anomaly localization and detection performance. Extensive experiments show that MAAT significantly outperforms previous methods, achieving better anomaly distinguishability and generalization across various time series applications, setting a new standard for unsupervised time series anomaly detection in real-world scenarios.

Updated: 2025-02-19 10:48:05

标题: MAAT：Mamba自适应异常变换器与时间序列关联差异

摘要: 时间序列中的异常检测对于工业监控和环境感知至关重要，然而区分异常与复杂模式仍然具有挑战性。现有的方法如Anomaly Transformer和DCdetector已经取得了进展，但它们面临诸如对短期上下文敏感和在嘈杂的、非平稳环境中效率低下等限制。为了克服这些问题，我们引入了MAAT，这是一种改进的架构，增强了关联差异建模和重建质量。MAAT特点是稀疏注意力，通过专注于相关时间步骤来有效捕获长程依赖关系，从而减少计算冗余。此外，一个Mamba-Selective State Space模型被整合到重建模块中，利用跳跃连接和门控注意力来改进异常定位和检测性能。大量实验表明，MAAT明显优于先前的方法，在各种时间序列应用中实现了更好的异常区分性和泛化性，为真实场景中的无监督时间序列异常检测设定了新标准。

更新时间: 2025-02-19 10:48:05

领域: cs.LG

下载: http://arxiv.org/abs/2502.07858v2

Accelerating Diffusion Transformers with Token-wise Feature Caching

Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.

Updated: 2025-02-19 10:39:58

标题: 用令牌级特征缓存加速扩散变压器

摘要: 扩散变压器在图像和视频合成方面表现出显著的效果，但需要巨大的计算成本。为解决这一问题，引入了特征缓存方法来加速扩散变压器，通过在先前时间步骤中缓存特征并在后续时间步骤中重复使用它们。然而，先前的缓存方法忽略了不同令牌对特征缓存的不同敏感性，对某些令牌进行特征缓存可能导致整体生成质量的破坏比其他令牌高10倍。本文介绍了令牌级特征缓存，使我们能够自适应地选择最适合缓存的令牌，并进一步使我们能够在不同类型和深度的神经层中应用不同的缓存比例。对PixArt-$\alpha$、OpenSora和DiT的大量实验证明了我们在图像和视频生成方面的有效性，无需训练。例如，在OpenSora和PixArt-$\alpha$上分别实现了2.36倍和1.93倍的加速，几乎没有降低生成质量。

更新时间: 2025-02-19 10:39:58

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.05317v4

LaVCa: LLM-assisted Visual Cortex Captioning

Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at https://sites.google.com/view/lavca-llm/

Updated: 2025-02-19 10:37:04

标题: LaVCa：LLM辅助视觉皮层字幕

摘要: 理解人类大脑中神经群体（或体素）的特性可以推进我们对人类感知和认知处理能力的理解，并有助于开发基于大脑的计算机模型。最近使用深度神经网络（DNNs）的编码模型成功预测了体素活动。然而，解释解释体素响应的属性仍然具有挑战性，因为DNNs的黑匣子性质。作为解决方案，我们提出了LLM辅助的视觉皮层字幕生成（LaVCa），这是一种数据驱动的方法，利用大型语言模型（LLMs）为具有选择性的图像生成自然语言字幕。通过将LaVCa应用于图像激发的大脑活动，我们证明LaVCa生成的字幕比先前提出的方法更准确地描述了体素选择性。此外，LaVCa生成的字幕在体素间和体素内两个层面上定量捕捉了更详细的属性。此外，通过LaVCa生成的体素特定属性的更详细分析揭示了视觉皮层感兴趣区域（ROIs）内的细粒度功能差异以及同时表示多个不同概念的体素。这些发现通过在整个视觉皮层中分配详细的字幕，为我们赋予了对人类视觉表征的深刻见解，同时突显了基于LLM的方法在理解大脑表征中的潜力。请访问我们的网页https://sites.google.com/view/lavca-llm/。

更新时间: 2025-02-19 10:37:04

领域: q-bio.NC,cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13606v1

Efficient Safety Retrofitting Against Jailbreaking for LLMs

Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3\$ for 8B models, 20\$ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.

Updated: 2025-02-19 10:33:18

标题: LLM的有效防破解安全改造

摘要: 直接优先选择优化（DPO）是一种有效的对齐技术，通过在偏好数据上训练，将LLM引导至更可取的输出，避免了对显式奖励模型的需求。其简单性使其易于适应各种领域和安全要求。本文考察了DPO在模型安全性方面对越狱攻击的有效性，同时最小化数据需求和培训成本。我们介绍了Egida，这是一个从多个来源扩展而来的数据集，包括27个不同的安全主题和18种不同的攻击风格，辅以合成和人工标签。这些数据用于提高最先进的LLM（Llama-3.1-8B/70B-Instruct，Qwen-2.5-7B/72B-Instruct）在主题和攻击风格方面的安全性。除了安全性评估，我们还评估了它们在通用任务中对齐后性能的下降，以及他们过度拒绝的倾向。根据提出的方法，经过训练的模型通过小规模训练（2,000个样本）和低计算成本（8B模型为3美元，72B模型为20美元），将其攻击成功率降低了10%-30%。安全对齐模型能够泛化到未见的主题和攻击风格，其中最成功的攻击风格的成功率约为5%。大小和家族被发现强烈影响模型对安全性的可塑性，指出了预训练选择的重要性。为了验证我们的发现，作者进行了对Llama-Guard-3-8B与人类偏好一致性的大规模独立评估，并发布了相关数据集Egida-HSafe。总的来说，本研究说明了使用DPO增强LLM安全性是多么经济实惠和易于获取，同时概述了其目前的限制。所有数据集和模型都已发布，以便实现可重现性和进一步研究。

更新时间: 2025-02-19 10:33:18

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13603v1

Abstraction requires breadth: a renormalisation group approach

Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. This is similar to the process of focusing on large-scale properties, systematically removing irrelevant small-scale details, implemented in the renormalisation group of statistical physics. This analogy is suggestive because the fixed points of the renormalisation group offer an ideal candidate of a truly abstract -- i.e. data independent -- representation. It has been observed that abstraction emerges with depth in neural networks. Deep layers of neural network capture abstract characteristics of data, such as "cat-ness" or "dog-ness" in images, by combining the lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for an abstract representation. This theoretical picture is tested in numerical experiments based on Deep Belief Networks trained on data of different breadth. These show that representations in deep layers of neural networks approach the Hierarchical Feature Model as the data gets broader, in agreement with theoretical predictions.

Updated: 2025-02-19 10:27:03

标题: 抽象需要广度：一种重整化群方法

摘要: 抽象是从原始数据中提取基本特征而忽略不相关细节的过程。这类似于专注于大规模属性的过程，系统地去除不相关的小规模细节，这在统计物理学的重整化群中得以实现。这种类比是有启发意义的，因为重整化群的固定点提供了一个真正抽象的理想候选人，即数据独立的表示。已经观察到，在神经网络中，随着深度的增加，抽象度也会增加。神经网络的深层捕捉数据的抽象特征，例如图像中的“猫样”或“狗样”，通过结合浅层中编码的低层特征（例如边缘）。然而，我们认为仅靠深度是不足以开发真正抽象的表示的。我们主张，抽象程度关键取决于训练集的广度。我们在重整化群方法中解决这个问题，在这个方法中，一个表示被扩展以包含更广泛的数据集。我们将这种转换的唯一固定点——分层特征模型——作为一个抽象表示的候选人。这种理论图景在基于不同广度数据训练的深度信念网络的数值实验中得到了验证。这些实验证明，随着数据变得更广泛，神经网络深层的表示逐渐接近分层特征模型，符合理论预测。

更新时间: 2025-02-19 10:27:03

领域: cs.LG,cond-mat.dis-nn,physics.data-an,stat.ML

下载: http://arxiv.org/abs/2407.01656v3

Finding Optimal Trading History in Reinforcement Learning for Stock Market Trading

This paper investigates the optimization of temporal windows in Financial Deep Reinforcement Learning (DRL) models using 2D Convolutional Neural Networks (CNNs). We introduce a novel approach to treating the temporal field as a hyperparameter and examine its impact on model performance across various datasets and feature arrangements. We introduce a new hyperparameter for the CNN policy, proposing that this temporal field can and should be treated as a hyperparameter for these models. We examine the significance of this temporal field by iteratively expanding the window of observations presented to the CNN policy during the deep reinforcement learning process. Our iterative process involves progressively increasing the observation period from two weeks to twelve weeks, allowing us to examine the effects of different temporal windows on the model's performance. This window expansion is implemented in two settings. In one setting, we rearrange the features in the dataset to group them by company, allowing the model to have a full view of company data in its observation window and CNN kernel. In the second setting, we do not group the features by company, and features are arranged by category. Our study reveals that shorter temporal windows are most effective when no feature rearrangement to group per company is in effect. However, the model will utilize longer temporal windows and yield better performance once we introduce the feature rearrangement. To examine the consistency of our findings, we repeated our experiment on two datasets containing the same thirty companies from the Dow Jones Index but with different features in each dataset and consistently observed the above-mentioned patterns. The result is a trading model significantly outperforming global financial services firms such as the Global X Guru by the established Mirae Asset.

Updated: 2025-02-19 10:24:59

标题: 寻找股市交易强化学习中的最佳交易历史

摘要: 本文研究了使用2D卷积神经网络（CNN）在金融深度强化学习（DRL）模型中优化时间窗口。我们引入了一种新颖的方法来将时间领域视为一个超参数，并检查其对不同数据集和特征排列方式的模型性能的影响。我们为CNN策略引入了一个新的超参数，提出这个时间领域可以并且应该作为这些模型的超参数进行处理。我们通过迭代地扩展呈现给CNN策略的观察窗口，来检验这个时间窗口的重要性。我们的迭代过程涉及逐步增加观察期限，从两周增加到十二周，从而使我们能够研究不同时间窗口对模型性能的影响。这种窗口扩展分为两种设置。在一个设置中，我们重新排列数据集中的特征，将它们按公司分组，使模型在其观察窗口和CNN核中能够完全看到公司数据。在第二个设置中，我们不按公司进行特征分组，而是按类别排列特征。我们的研究发现，在没有按公司重新排列特征的情况下，较短的时间窗口最有效。然而，一旦我们引入特征重新排列，模型将利用更长的时间窗口，并产生更好的性能。为了检验我们发现的一致性，我们在包含同样30家道琼斯指数公司但每个数据集中特征不同的两个数据集上重复了我们的实验，并始终观察到上述模式。结果是一个交易模型显著优于由Mirae Asset建立的全球金融服务公司Global X Guru。

更新时间: 2025-02-19 10:24:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.12537v2

The Energy Cost of Artificial Intelligence of Things Lifecycle

Artificial Intelligence (AI) coupled with the existing Internet of Things (IoT) enables more autonomous operations across various economic sectors. While this paradigm shift results in increased energy consumption it is difficult to quantify the end-to-end energy consumption of such systems with the conventional metrics as they either focus on the communication, the computation infrastructure or model development. To address this, we propose a new metric, the Energy Cost of AI lifecycle (eCAL). eCAL captures the energy consumption throughout the architectural components and lifecycle of an AI-powered wireless system by analyzing the complexity of data collection and manipulation in individual components and deriving overall and per-bit energy consumption. We show that the better a model and the more it is used, the more energy efficient an inference is. For an example Artificial Intelligence of Things (AIoT) configuration, eCAL for making 100 inferences is 2.73 times higher than for 1000 inferences. Additionally, we developed a modular open source simulation tool to enable researchers, practitioners, and engineers to calculate the end-to-end energy cost with various configurations and across various systems, ensuring adaptability to diverse use cases.

Updated: 2025-02-19 10:23:57

标题: 物联网生命周期中人工智能的能源成本

摘要: 人工智能（AI）与现有物联网（IoT）相结合，使各个经济部门实现更多自主操作。虽然这种范式转变导致能源消耗增加，但很难用传统指标量化此类系统的端到端能源消耗，因为它们要么关注通信、计算基础设施，要么关注模型开发。为了解决这个问题，我们提出了一个新的指标，即AI生命周期的能源成本（eCAL）。eCAL通过分析个别组件中数据收集和操纵的复杂性，以及推导整体和每比特的能源消耗，来捕捉AI驱动的无线系统在整个架构组件和生命周期中的能源消耗。我们表明，模型越好且使用越多，推理就越能效。以物联网中的人工智能（AIoT）配置为例，进行100次推理的eCAL比进行1000次推理高出2.73倍。此外，我们开发了一个模块化的开源仿真工具，使研究人员、从业者和工程师能够计算各种配置和各种系统的端到端能源成本，确保适应各种用例。

更新时间: 2025-02-19 10:23:57

领域: cs.ET,cs.AI,cs.LG

下载: http://arxiv.org/abs/2408.00540v2

MMTEB: Massive Multilingual Text Embedding Benchmark

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

Updated: 2025-02-19 10:13:43

标题: MMTEB：大规模多语言文本嵌入基准

摘要: 文本嵌入通常在有限的一组任务上进行评估，这些任务受到语言、领域和任务多样性的限制。为了解决这些局限性并提供更全面的评估，我们引入了大规模多语言文本嵌入基准（MMTEB）-这是一个社区驱动的MTEB的扩展，涵盖了250多种语言的500多项经过质量控制的评估任务。MMTEB包括一系列挑战性、新颖的任务，如遵循指示、长文档检索和代码检索，代表迄今为止最大的多语言嵌入模型评估任务集合。利用这个集合，我们开发了几个高度多语言的基准，用于评估一组代表性模型。我们发现，尽管具有数十亿参数的大型语言模型（LLMs）可以在某些语言子集和任务类别上实现最先进的性能，但表现最佳的公开可用模型是只有5.6亿参数的多语言-e5-large-instruct模型。为了促进可访问性并降低计算成本，我们引入了一种基于任务间相关性的新型降采样方法，确保选择多样性同时保留相对模型排名。此外，我们通过采样难负例来优化检索等任务，创建更小但有效的分割。这些优化使我们能够引入大大减少计算需求的基准。例如，我们新引入的零-shot英语基准维持了类似于全尺度版本的排名顺序，但计算成本却只是前者的一小部分。

更新时间: 2025-02-19 10:13:43

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2502.13595v1

Toward Robust Non-Transferable Learning: A Survey and Benchmark

Over the past decades, researchers have primarily focused on improving the generalization abilities of models, with limited attention given to regulating such generalization. However, the ability of models to generalize to unintended data (e.g., harmful or unauthorized data) can be exploited by malicious adversaries in unforeseen ways, potentially resulting in violations of model ethics. Non-transferable learning (NTL), a task aimed at reshaping the generalization abilities of deep learning models, was proposed to address these challenges. While numerous methods have been proposed in this field, a comprehensive review of existing progress and a thorough analysis of current limitations remain lacking. In this paper, we bridge this gap by presenting the first comprehensive survey on NTL and introducing NTLBench, the first benchmark to evaluate NTL performance and robustness within a unified framework. Specifically, we first introduce the task settings, general framework, and criteria of NTL, followed by a summary of NTL approaches. Furthermore, we emphasize the often-overlooked issue of robustness against various attacks that can destroy the non-transferable mechanism established by NTL. Experiments conducted via NTLBench verify the limitations of existing NTL methods in robustness. Finally, we discuss the practical applications of NTL, along with its future directions and associated challenges.

Updated: 2025-02-19 10:12:19

标题: 朝向稳健的不可转移学习：调查和基准研究

摘要: 在过去的几十年里，研究人员主要集中在改善模型的泛化能力上，对调节这种泛化能力的注意力有限。然而，模型对于未预期数据（例如有害或未经授权的数据）的泛化能力可以被恶意对手以意想不到的方式利用，可能导致模型伦理的违反。非可转移学习（NTL）是一个旨在重塑深度学习模型泛化能力的任务，旨在解决这些挑战。尽管在这一领域提出了许多方法，但目前仍缺乏对现有进展的全面审查和对当前限制的深入分析。本文通过提出第一个关于NTL的全面调查，并引入NTLBench，这是第一个评估NTL性能和稳健性的基准，来弥补这一差距。具体来说，我们首先介绍了NTL的任务设置、一般框架和标准，然后总结了NTL方法。此外，我们强调了对各种攻击的稳健性问题，这些攻击可能破坏NTL建立的非可转移机制，这一问题往往被忽视。通过NTLBench进行的实验验证了现有NTL方法在稳健性方面的限制。最后，我们讨论了NTL的实际应用，以及其未来方向和相关挑战。

更新时间: 2025-02-19 10:12:19

领域: cs.LG,cs.CR,cs.CV

下载: http://arxiv.org/abs/2502.13593v1

Smaller But Better: Unifying Layout Generation with Smaller Large Language Models

We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.

Updated: 2025-02-19 10:06:42

标题: 更小但更好：用更小的大型语言模型统一布局生成

摘要: 我们提出了LGGPT，这是一种基于LLM的模型，专门用于统一布局生成。首先，我们提出了任意布局指令（ALI）和通用布局响应（ULR）作为统一的输入/输出模板。ALI可以容纳跨多个布局领域的任意布局生成任务输入，使LGGPT能够统一处理迄今未被探索的任务通用和领域通用布局生成。总体而言，ALI和ULR具有简洁的结构，避免了现有基于HTML的格式中通常存在的多余标记，有助于有效地调整指令并提升统一生成性能。此外，我们提出了一种间隔量化编码（IQE）策略，将ALI压缩为更紧凑的结构。IQE精确保留有效的布局线索，同时消除了信息较少的占位符，有助于LGGPT在统一训练过程中捕捉复杂和可变的布局生成条件。实验结果表明，与现有方法相比，LGGPT实现了更优越或相当的性能。值得注意的是，LGGPT在性能和效率之间取得了显著的平衡，具有紧凑的15亿参数LLM，在最广泛和具有挑战性的统一场景中甚至击败了以前的7亿或175亿模型。此外，我们强调了采用LLM进行统一布局生成的必要性，并建议通过比较不同规模的LLM，15亿可能是一个最佳的参数大小。代码可在https://github.com/NiceRingNode/LGGPT找到。

更新时间: 2025-02-19 10:06:42

领域: cs.LG

下载: http://arxiv.org/abs/2502.14005v1

Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels

We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model's reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher's top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.

Updated: 2025-02-19 10:04:02

标题: 重新思考自我蒸馏：标签平均和借助部分标签进行增强软标签细化

摘要: 我们研究了多类分类中自我蒸馏的机制，特别是在具有固定特征提取器的线性探测环境中，传统的特征学习解释不适用。我们的理论分析揭示了多轮自我蒸馏有效地在具有高特征相关性的实例中执行标签平均化，由输入特征导出的Gram矩阵的特征向量所控制。这个过程导致了聚类预测和改进的泛化，通过减少模型对潜在受损标签的依赖，减轻了标签噪音的影响。我们建立了多轮自我蒸馏能够在标签存在噪音的情况下实现100%的总体准确度的条件。此外，我们引入了一种新颖的高效的单轮自我蒸馏方法，使用来自老师的前两个softmax输出的精细部分标签，称为PLL学生模型。这种方法在单轮中复制了多轮蒸馏的好处，在高噪音场景中实现了可比或更优越的性能，同时显著减少了计算成本。

更新时间: 2025-02-19 10:04:02

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.10482v2

Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction

Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach.

Updated: 2025-02-19 10:00:00

标题: Inter3D：人机交互三维物体重建的基准和强基线

摘要: 最近在隐式3D重建方法方面取得了进展，例如神经渲染领域和高斯点阵，主要集中在静态或动态对象的新视图合成，这些对象具有连续的运动状态。然而，这些方法在有效建模具有n个可移动部分的人机交互对象方面存在困难，需要2^n个单独模型来表示所有离散状态。为了克服这一限制，我们提出了Inter3D，一个用于人机交互对象新状态合成的新基准和方法。我们引入了一个自行收集的数据集，其中包含常见的交互对象，并提出了一个新的评估流程，在训练过程中只观察到单个部分状态，而部分组合状态保持未见。我们还提出了一种强基线方法，利用空间不一致性张量有效建模对象的所有状态。为了减轻在训练状态下相机轨迹上的不切实际的限制，我们提出了一种相互状态规范化机制，以增强可移动部分的空间密度一致性。此外，我们探讨了两种占用格采样策略，以促进训练效率。我们在提出的基准上进行了广泛实验，展示了任务的挑战以及我们方法的优越性。

更新时间: 2025-02-19 10:00:00

领域: cs.GR,cs.LG

下载: http://arxiv.org/abs/2502.14004v1

Conditional sampling within generative diffusion models

Generative diffusions are a powerful class of Monte Carlo samplers that leverage bridging Markov processes to approximate complex, high-dimensional distributions, such as those found in image processing and language models. Despite their success in these domains, an important open challenge remains: extending these techniques to sample from conditional distributions, as required in, for example, Bayesian inverse problems. In this paper, we present a comprehensive review of existing computational approaches to conditional sampling within generative diffusion models. Specifically, we highlight key methodologies that either utilise the joint distribution, or rely on (pre-trained) marginal distributions with explicit likelihoods, to construct conditional generative samplers.

Updated: 2025-02-19 09:55:45

标题: 生成扩散模型中的条件采样

摘要: 生成扩散是一类强大的蒙特卡罗采样器，利用桥接马尔可夫过程来近似复杂的高维分布，例如在图像处理和语言模型中找到的分布。尽管它们在这些领域取得了成功，但一个重要的挑战仍然存在：将这些技术扩展到从条件分布中采样，例如在贝叶斯反问题中所需的。在本文中，我们介绍了现有计算方法对生成扩散模型中条件抽样的全面审查。具体来说，我们重点介绍了利用联合分布或依赖（预先训练的）具有显式似然的边缘分布来构建条件生成采样器的关键方法论。

更新时间: 2025-02-19 09:55:45

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2409.09650v2

Multi-Target Radar Search and Track Using Sequence-Capable Deep Reinforcement Learning

The research addresses sensor task management for radar systems, focusing on efficiently searching and tracking multiple targets using reinforcement learning. The approach develops a 3D simulation environment with an active electronically scanned array radar, using a multi-target tracking algorithm to improve observation data quality. Three neural network architectures were compared including an approach using fated recurrent units with multi-headed self-attention. Two pre-training techniques were applied: behavior cloning to approximate a random search strategy and an auto-encoder to pre-train the feature extractor. Experimental results revealed that search performance was relatively consistent across most methods. The real challenge emerged in simultaneously searching and tracking targets. The multi-headed self-attention architecture demonstrated the most promising results, highlighting the potential of sequence-capable architectures in handling dynamic tracking scenarios. The key contribution lies in demonstrating how reinforcement learning can optimize sensor management, potentially improving radar systems' ability to identify and track multiple targets in complex environments.

Updated: 2025-02-19 09:55:38

标题: 使用序列能力深度强化学习进行多目标雷达搜索和跟踪

摘要: 这项研究涉及雷达系统的传感器任务管理，重点是利用强化学习有效地搜索和跟踪多个目标。该方法开发了一个具有主动电子扫描阵雷达的3D模拟环境，使用多目标跟踪算法来提高观测数据质量。比较了三种神经网络架构，包括使用多头自注意力的命运循环单元的方法。应用了两种预训练技术：行为克隆来近似随机搜索策略和自编码器来预训练特征提取器。实验结果显示，大多数方法的搜索性能相对一致。真正的挑战在于同时搜索和跟踪目标。多头自注意力架构展示了最有希望的结果，突显了序列能力架构在处理动态跟踪场景中的潜力。关键贡献在于展示了强化学习如何优化传感器管理，可能提高雷达系统在复杂环境中识别和跟踪多个目标的能力。

更新时间: 2025-02-19 09:55:38

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.13584v1

Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control, thereby further bridging the gap between autonomous driving and human thought processes. To elevate model performance, we have developed the first extensive Visual Question Answering (VQA) driving instruction dataset tailored for 3D spatial understanding and long-axis task reasoning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.

Updated: 2025-02-19 09:50:44

标题: Sce2DriveX：一种用于场景到驾驶学习的广义MLLM框架

摘要: 端到端自动驾驶直接将原始传感器输入映射到低级车辆控制，是具体化人工智能的重要组成部分。尽管在应用多模态大语言模型（MLLMs）进行高级交通场景语义理解方面取得成功，但有效地将这些概念语义理解转化为低级运动控制命令，并在跨场景驾驶中实现泛化和共识仍具挑战性。我们引入了Sce2DriveX，一种类似人类思维链的驾驶推理MLLM框架。Sce2DriveX利用来自局部场景视频和全局BEV地图的多模态联合学习，深入理解长程时空关系和道路拓扑，增强其在3D动态/静态场景中的全面感知和推理能力，并实现跨场景的驾驶泛化。在此基础上，它重构了人类驾驶中固有的隐含认知链，涵盖了场景理解、元动作推理、行为解释分析、运动规划和控制，进一步弥合了自动驾驶与人类思维过程之间的差距。为提升模型性能，我们开发了第一批专为3D空间理解和长轴任务推理定制的广泛视觉问答（VQA）驾驶指令数据集。大量实验证明，Sce2DriveX在从场景理解到端到端驾驶的性能上达到了最先进水平，并在CARLA Bench2Drive基准测试中展现了稳健的泛化能力。

更新时间: 2025-02-19 09:50:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.14917v1

Rectified Lagrangian for Out-of-Distribution Detection in Modern Hopfield Networks

Modern Hopfield networks (MHNs) have recently gained significant attention in the field of artificial intelligence because they can store and retrieve a large set of patterns with an exponentially large memory capacity. A MHN is generally a dynamical system defined with Lagrangians of memory and feature neurons, where memories associated with in-distribution (ID) samples are represented by attractors in the feature space. One major problem in existing MHNs lies in managing out-of-distribution (OOD) samples because it was originally assumed that all samples are ID samples. To address this, we propose the rectified Lagrangian (RegLag), a new Lagrangian for memory neurons that explicitly incorporates an attractor for OOD samples in the dynamical system of MHNs. RecLag creates a trivial point attractor for any interaction matrix, enabling OOD detection by identifying samples that fall into this attractor as OOD. The interaction matrix is optimized so that the probability densities can be estimated to identify ID/OOD. We demonstrate the effectiveness of RecLag-based MHNs compared to energy-based OOD detection methods, including those using state-of-the-art Hopfield energies, across nine image datasets.

Updated: 2025-02-19 09:50:22

标题: 现代Hopfield网络中用于检测超出分布的矫正Lagrangian

摘要: 现代霍普菲尔德网络（MHNs）最近在人工智能领域引起了重大关注，因为它们可以存储和检索大量模式，并具有指数级的内存容量。MHN通常是一个动态系统，由记忆和特征神经元的Lagrangians定义，其中与分布内（ID）样本相关联的记忆在特征空间中被吸引子表示。现有MHN中的一个主要问题在于管理分布外（OOD）样本，因为最初假定所有样本都是ID样本。为了解决这个问题，我们提出了修正的Lagrangian（RegLag），这是一种新的用于记忆神经元的Lagrangian，明确地将OOD样本的吸引子纳入MHN的动态系统中。RecLag通过为任何交互矩阵创建一个简单的点吸引子，使得可以通过识别落入该吸引子的样本来检测OOD。交互矩阵被优化，以便可以估计概率密度以识别ID/OOD。我们通过对比基于RecLag的MHN和基于能量的OOD检测方法（包括使用最先进的Hopfield能量）在九个图像数据集上的效果，证明了RecLag的有效性。

更新时间: 2025-02-19 09:50:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.14003v1

Finite Element Operator Network for Solving Elliptic-type parametric PDEs

Partial differential equations (PDEs) underlie our understanding and prediction of natural phenomena across numerous fields, including physics, engineering, and finance. However, solving parametric PDEs is a complex task that necessitates efficient numerical methods. In this paper, we propose a novel approach for solving parametric PDEs using a Finite Element Operator Network (FEONet). Our proposed method leverages the power of deep learning in conjunction with traditional numerical methods, specifically the finite element method, to solve parametric PDEs in the absence of any paired input-output training data. We performed various experiments on several benchmark problems and confirmed that our approach has demonstrated excellent performance across various settings and environments, proving its versatility in terms of accuracy, generalization, and computational flexibility. While our method is not meshless, the FEONet framework shows potential for application in various fields where PDEs play a crucial role in modeling complex domains with diverse boundary conditions and singular behavior. Furthermore, we provide theoretical convergence analysis to support our approach, utilizing finite element approximation in numerical analysis.

Updated: 2025-02-19 09:47:56

标题: 有限元算子网络用于解决椭圆型参数化偏微分方程

摘要: 偏微分方程（PDEs）是我们理解和预测自然现象的基础，涵盖物理、工程和金融等多个领域。然而，解决参数化PDEs是一个复杂的任务，需要高效的数值方法。本文提出了一种新颖的方法，使用有限元算子网络（FEONet）来解决参数化PDEs。我们的方法利用深度学习与传统数值方法（特别是有限元方法）的力量，解决参数化PDEs，而无需任何配对的输入-输出训练数据。我们在几个基准问题上进行了各种实验，并确认我们的方法在不同设置和环境中表现出色，证明了其在准确性、泛化性和计算灵活性方面的多功能性。虽然我们的方法不是无网格方法，但FEONet框架显示出在各种领域中的潜力，其中PDEs在建模具有不同边界条件和奇异行为的复杂领域中扮演关键角色。此外，我们提供了理论收敛分析来支持我们的方法，利用有限元逼近在数值分析中的应用。

更新时间: 2025-02-19 09:47:56

领域: math.NA,cs.AI,cs.LG,cs.NA,physics.comp-ph,65M60, 65N30, 68T20, 68U07,G.1.8

下载: http://arxiv.org/abs/2308.04690v3

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features, which serve as the initial tokens. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Experiments on public datasets demonstrate that ActionPiece consistently outperforms existing action tokenization methods, improving NDCG@$10$ by $6.00\%$ to $12.82\%$.

Updated: 2025-02-19 09:45:29

标题: 行动片段：为生成推荐上下文化地对行动序列进行标记化

摘要: 生成式推荐（GR）是一种新兴的范式，其中用户的行为被标记为离散的标记模式，并自回归地生成预测。然而，现有的GR模型独立地对每个行为进行标记，为所有序列中相同的行为分配相同的固定标记，而不考虑上下文关系。这种缺乏上下文感知性可能会导致性能不佳，因为相同的行为在不同的上下文中可能具有不同的含义。为了解决这个问题，我们提出了ActionPiece，以显式地在标记行为序列时纳入上下文。在ActionPiece中，每个行为被表示为一组项目特征，这些特征作为初始标记。给定行为序列语料库，我们通过合并特征模式作为新的标记来构建词汇表，基于它们在单个集合内以及相邻集合之间的共现频率。考虑到特征集的无序性，我们进一步引入了集排列正则化，这会产生具有相同语义的行为序列的多个分段。对公共数据集的实验证明，ActionPiece始终优于现有的行为标记方法，将NDCG@$10$提高了$6.00\%$至$12.82\%$。

更新时间: 2025-02-19 09:45:29

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2502.13581v1

Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts

However, real-world data often exhibit complex local structures that can be challenging for single-model approaches with a smooth global manifold in the embedding space to unravel. In this work, we conjecture that in the latent space of these large language models, the embeddings live in a local manifold structure with different dimensions depending on the perplexities and domains of the input data, commonly referred to as a Stratified Manifold structure, which in combination form a structured space known as a Stratified Space. To investigate the validity of this structural claim, we propose an analysis framework based on a Mixture-of-Experts (MoE) model where each expert is implemented with a simple dictionary learning algorithm at varying sparsity levels. By incorporating an attention-based soft-gating network, we verify that our model learns specialized sub-manifolds for an ensemble of input data sources, reflecting the semantic stratification in LLM embedding space. We further analyze the intrinsic dimensions of these stratified sub-manifolds and present extensive statistics on expert assignments, gating entropy, and inter-expert distances. Our experimental results demonstrate that our method not only validates the claim of a stratified manifold structure in the LLM embedding space, but also provides interpretable clusters that align with the intrinsic semantic variations of the input data.

Updated: 2025-02-19 09:33:16

标题: 揭示本地潜在因素：在LLM嵌入空间中学习分层流形结构与稀疏专家混合

摘要: 然而，现实世界的数据通常呈现复杂的局部结构，这对于具有平滑全局流形的单模型方法来说是具有挑战性的。在这项工作中，我们推测在这些大型语言模型的潜在空间中，嵌入存在于一个具有不同维度的局部流形结构中，这取决于输入数据的困惑度和领域，通常被称为分层流形结构，结合形成一个被称为分层空间的结构空间。为了调查这一结构主张的有效性，我们提出了基于混合专家模型的分析框架，其中每个专家都是通过简单的字典学习算法在不同的稀疏水平上实现的。通过整合基于注意力的软门控网络，我们验证我们的模型学习了针对一组输入数据源的专门子流形，反映了LLM嵌入空间中的语义分层。我们进一步分析了这些分层子流形的内在维度，并提供了关于专家分配、门控熵和专家间距离的广泛统计数据。我们的实验结果表明，我们的方法不仅验证了LLM嵌入空间中分层流形结构的主张，而且提供了与输入数据的内在语义变化相一致的可解释聚类。

更新时间: 2025-02-19 09:33:16

领域: cs.LG

下载: http://arxiv.org/abs/2502.13577v1

Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation

Evaluating models on large benchmarks is very resource-intensive, especially during the period of rapid model evolution. Existing efficient evaluation methods estimate the performance of target models by testing them only on a small and static coreset of the benchmark, which is derived from the publicly available evaluation results of source models. These methods rely on the assumption that target models have high prediction consistency with source models. However, we demonstrate that it doesn't generalize well in practice. To alleviate the inconsistency issue, we present TailoredBench, a method that conducts customized evaluation tailored to each target model. Specifically, a Global-coreset is first constructed as a probe to identify the most consistent source models for each target model with an adaptive source model selection strategy. Afterwards, a scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model. According to the predictions on Native-coresets, we obtain the performance of target models on the whole benchmark with a calibrated estimation strategy. Comprehensive experiments on 5 benchmarks across over 300 models demonstrate that compared to best performing baselines, TailoredBench achieves an average reduction of 31.4% in MAE of accuracy estimates under the same inference budgets, showcasing strong effectiveness and generalizability.

Updated: 2025-02-19 09:31:50

标题: 超越“一刀切”：定制化基准用于高效评估

摘要: 在大型基准测试中评估模型需要大量资源，特别是在模型快速演进的时期。现有的高效评估方法通过仅在来自公开可用的源模型评估结果的小型静态核心集上测试目标模型来估计目标模型的性能。这些方法依赖于目标模型与源模型具有高预测一致性的假设。然而，我们证明这种假设在实践中并不具有很好的泛化性。为了缓解不一致性问题，我们提出了TailoredBench，这是一种针对每个目标模型进行定制评估的方法。具体地，首先构建一个全局核心集作为探针，以识别每个目标模型的最一致的源模型，并采用自适应源模型选择策略。随后，提出了一个可扩展的K-Medoids聚类算法，将全局核心集扩展到每个目标模型的定制本地核心集。根据在本地核心集上的预测，我们采用校准估计策略获得目标模型在整个基准测试中的性能。对跨越300多个模型的5个基准测试进行的全面实验表明，与表现最佳的基准方法相比，TailoredBench在相同推断预算下的准确性估计的MAE平均减少了31.4％，展示了其强大的有效性和泛化性。

更新时间: 2025-02-19 09:31:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13576v1

ETS: Efficient Tree Search for Inference-Time Scaling

Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8$\times$ reduction in average KV cache size during the search process, leading to 1.4$\times$ increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: https://github.com/SqueezeAILab/ETS.

Updated: 2025-02-19 09:30:38

标题: ETS：用于推理时间缩放的高效树搜索

摘要: 测试时间计算规模已经成为改进模型准确性的一个新维度，额外的计算在推断时用于允许模型更长时间地思考更具挑战性的问题。一种有前途的测试时间计算规模方法是针对过程奖励模型进行搜索，其中模型在搜索的每一步生成多个潜在候选项，然后这些部分轨迹通过单独的奖励模型进行评分，以引导搜索过程。树搜索过程中轨迹的多样性影响搜索的准确性，因为增加多样性会促进更多的探索。然而，这种多样性是有代价的，因为不同的轨迹具有较少的KV共享，这意味着它们消耗更多内存并减慢搜索过程。先前的搜索方法要么不进行足够的探索，要么探索多样化的轨迹但具有较高的延迟。我们通过提出高效树搜索（ETS）来解决这一挑战，通过修剪冗余轨迹促进KV共享，同时保持必要的多样化轨迹。ETS将线性规划成本模型结合到促进KV缓存共享中，通过惩罚保留的节点数量，同时将语义覆盖项纳入成本模型，以确保我们保留语义上不同的轨迹。我们展示了ETS如何在搜索过程中实现平均KV缓存大小减少1.8倍，相对于先前最先进的方法，吞吐量增加了1.4倍，而准确性几乎没有下降，且不需要任何自定义内核实现。代码可在以下网址获取：https://github.com/SqueezeAILab/ETS。

更新时间: 2025-02-19 09:30:38

领域: cs.LG

下载: http://arxiv.org/abs/2502.13575v1

RestoreGrad: Signal Restoration Using Conditional Denoising Diffusion Models with Jointly Learned Prior

Denoising diffusion probabilistic models (DDPMs) can be utilized for recovering a clean signal from its degraded observation(s) by conditioning the model on the degraded signal. The degraded signals are themselves contaminated versions of the clean signals; due to this correlation, they may encompass certain useful information about the target clean data distribution. However, existing adoption of the standard Gaussian as the prior distribution in turn discards such information, resulting in sub-optimal performance. In this paper, we propose to improve conditional DDPMs for signal restoration by leveraging a more informative prior that is jointly learned with the diffusion model. The proposed framework, called RestoreGrad, seamlessly integrates DDPMs into the variational autoencoder framework and exploits the correlation between the degraded and clean signals to encode a better diffusion prior. On speech and image restoration tasks, we show that RestoreGrad demonstrates faster convergence (5-10 times fewer training steps) to achieve better quality of restored signals over existing DDPM baselines, and improved robustness to using fewer sampling steps in inference time (2-2.5 times fewer), advocating the advantages of leveraging jointly learned prior for efficiency improvements in the diffusion process.

Updated: 2025-02-19 09:29:46

标题: RestoreGrad：使用联合学习的先验条件去噪扩散模型进行信号恢复

摘要: 去噪扩散概率模型（DDPMs）可以通过在模型中将其条件化于受损信号来从其受损的观测中恢复清洁信号。受损信号本身是干净信号的受污染版本；由于这种相关性，它们可能包含有关目标清洁数据分布的某些有用信息。然而，现有采用标准高斯作为先验分布的方法反过来丢弃了这种信息，导致性能亚优。在本文中，我们提出通过利用与扩散模型共同学习的更具信息量的先验来改进用于信号恢复的条件DDPMs。所提出的框架名为RestoreGrad，无缝地将DDPMs整合到变分自动编码器框架中，并利用受损和干净信号之间的相关性来编码更好的扩散先验。在语音和图像恢复任务中，我们展示了RestoreGrad在现有DDPM基线上表现出更快的收敛速度（训练步骤减少5-10倍），以实现更好质量的恢复信号，并在推断时间内使用更少的采样步骤（减少2-2.5倍），倡导利用共同学习的先验来提高扩散过程效率的优势。

更新时间: 2025-02-19 09:29:46

领域: eess.IV,cs.LG,eess.AS

下载: http://arxiv.org/abs/2502.13574v1

Noise May Contain Transferable Knowledge: Understanding Semi-supervised Heterogeneous Domain Adaptation from an Empirical Perspective

Semi-supervised heterogeneous domain adaptation (SHDA) addresses learning across domains with distinct feature representations and distributions, where source samples are labeled while most target samples are unlabeled, with only a small fraction labeled. Moreover, there is no one-to-one correspondence between source and target samples. Although various SHDA methods have been developed to tackle this problem, the nature of the knowledge transferred across heterogeneous domains remains unclear. This paper delves into this question from an empirical perspective. We conduct extensive experiments on about 330 SHDA tasks, employing two supervised learning methods and seven representative SHDA methods. Surprisingly, our observations indicate that both the category and feature information of source samples do not significantly impact the performance of the target domain. Additionally, noise drawn from simple distributions, when used as source samples, may contain transferable knowledge. Based on this insight, we perform a series of experiments to uncover the underlying principles of transferable knowledge in SHDA. Specifically, we design a unified Knowledge Transfer Framework (KTF) for SHDA. Based on the KTF, we find that the transferable knowledge in SHDA primarily stems from the transferability and discriminability of the source domain. Consequently, ensuring those properties in source samples, regardless of their origin (e.g., image, text, noise), can enhance the effectiveness of knowledge transfer in SHDA tasks. The codes and datasets are available at https://github.com/yyyaoyuan/SHDA.

Updated: 2025-02-19 09:27:03

标题: 噪声可能包含可转移的知识：从经验角度理解半监督异构领域适应

摘要: 半监督异构领域适应（SHDA）解决了跨越具有不同特征表示和分布的领域进行学习的问题，其中源样本被标记，而大多数目标样本未被标记，仅有少部分被标记。此外，源样本和目标样本之间没有一一对应关系。尽管已经开发了各种SHDA方法来解决这个问题，但异构领域之间转移的知识的性质仍不明确。本文从经验的角度探讨了这个问题。我们对约330个SHDA任务进行了广泛的实验，采用了两种监督学习方法和七种代表性的SHDA方法。令人惊讶的是，我们的观察结果表明，源样本的类别和特征信息对目标领域的性能影响不大。此外，当作为源样本时，从简单分布中提取的噪声可能包含可转移的知识。基于这一洞察，我们进行了一系列实验，以揭示SHDA中可转移知识的基本原则。具体地，我们设计了一个统一的知识转移框架（KTF）用于SHDA。基于KTF，我们发现SHDA中可转移的知识主要来源于源域的可转移性和可区分性。因此，确保源样本具有这些属性，无论其来源如何（例如图像、文本、噪声），都可以增强SHDA任务中知识转移的有效性。代码和数据集可在https://github.com/yyyaoyuan/SHDA 上获得。

更新时间: 2025-02-19 09:27:03

领域: cs.LG

下载: http://arxiv.org/abs/2502.13573v1

Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

Recent research in language modeling reveals two scaling effects: the well-known improvement from increased training compute, and a lesser-known boost from applying more sophisticated or computationally intensive inference methods. Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time. For a given fixed model architecture and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. These advantages are maintained even when compared to state-of-the-art recursive techniques like the "repeat-all-over" (RAO) strategy in Mobile LLM. Finally, stochastic RINS not only can enhance performance further but also provides the flexibility to optionally forgo increased inference computation at test time with minimal performance degradation.

Updated: 2025-02-19 09:24:45

标题: 递归推理缩放：在语言和多模态系统中实现可扩展推理的制胜之道

摘要: 最近的语言建模研究揭示了两种规模效应：通过增加训练计算量而获得的众所周知的改进，以及通过应用更复杂或计算密集的推理方法而获得的较少为人知的提升。受语言的分形几何性质的最新发现启发，我们引入了递归推理缩放（RINS）作为一个互补的、可插拔的缩放推理时间的方法。对于给定的固定模型架构和训练计算预算，RINS显著提高了语言建模的性能。它还可以推广到纯语言任务之外，提供多模态系统中的收益，包括对SigLIP-B/16的0-shot ImageNet精度的+2%改进。此外，通过推导数据缩放法则，我们展示了RINS提高了渐近性能极限和缩放指数。这些优势甚至在与Mobile LLM中的“重复全覆盖”（RAO）策略等最先进的递归技术进行比较时仍然保持。最后，随机RINS不仅可以进一步提升性能，还提供了在测试时可以选择放弃增加推理计算而仅有最小性能下降的灵活性。

更新时间: 2025-02-19 09:24:45

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07503v2

Diffusion Model Agnostic Social Influence Maximization in Hyperbolic Space

The Influence Maximization (IM) problem aims to find a small set of influential users to maximize their influence spread in a social network. Traditional methods rely on fixed diffusion models with known parameters, limiting their generalization to real-world scenarios. In contrast, graph representation learning-based methods have gained wide attention for overcoming this limitation by learning user representations to capture influence characteristics. However, existing studies are built on Euclidean space, which fails to effectively capture the latent hierarchical features of social influence distribution. As a result, users' influence spread cannot be effectively measured through the learned representations. To alleviate these limitations, we propose HIM, a novel diffusion model agnostic method that leverages hyperbolic representation learning to estimate users' potential influence spread from social propagation data. HIM consists of two key components. First, a hyperbolic influence representation module encodes influence spread patterns from network structure and historical influence activations into expressive hyperbolic user representations. Hence, the influence magnitude of users can be reflected through the geometric properties of hyperbolic space, where highly influential users tend to cluster near the space origin. Second, a novel adaptive seed selection module is developed to flexibly and effectively select seed users using the positional information of learned user representations. Extensive experiments on five network datasets demonstrate the superior effectiveness and efficiency of our method for the IM problem with unknown diffusion model parameters, highlighting its potential for large-scale real-world social networks.

Updated: 2025-02-19 09:24:28

标题: 超几何空间中的扩散模型无关社交影响最大化

摘要: 影响最大化（IM）问题旨在找到一小组有影响力的用户，以最大化他们在社交网络中的影响力传播。传统方法依赖于具有已知参数的固定扩散模型，限制了它们在现实场景中的泛化能力。相反，基于图表示学习的方法已经引起广泛关注，因为它们通过学习用户表示来捕捉影响特性，从而克服了这种限制。然而，现有研究建立在欧几里得空间上，不能有效捕捉社交影响分布的潜在层次特征。因此，通过学习表示无法有效衡量用户的影响传播。为了缓解这些限制，我们提出了HIM，一种新颖的扩散模型不可知方法，利用双曲表示学习来估计用户潜在的影响传播能力。HIM包括两个关键组件。首先，双曲影响表示模块将网络结构和历史影响激活的传播模式编码为富有表现力的双曲用户表示。因此，用户的影响力大小可以通过双曲空间的几何特性来反映，高度有影响力的用户倾向于聚集在空间原点附近。其次，开发了一种新颖的自适应种子选择模块，利用学习用户表示的位置信息灵活有效地选择种子用户。对五个网络数据集进行的广泛实验表明，我们的方法在未知扩散模型参数的IM问题上具有卓越的有效性和效率，突显了其在大规模现实世界社交网络中的潜力。

更新时间: 2025-02-19 09:24:28

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2502.13571v1

Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy

Scaling Laws have emerged as a powerful framework for understanding how model performance evolves as they increase in size, providing valuable insights for optimizing computational resources. In the realm of Sequential Recommendation (SR), which is pivotal for predicting users' sequential preferences, these laws offer a lens through which to address the challenges posed by the scalability of SR models. However, the presence of structural and collaborative issues in recommender systems prevents the direct application of the Scaling Law (SL) in these systems. In response, we introduce the Performance Law for SR models, which aims to theoretically investigate and model the relationship between model performance and data quality. Specifically, we first fit the HR and NDCG metrics to transformer-based SR models. Subsequently, we propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics. Our method enables accurate predictions across various dataset scales and model sizes, demonstrating a strong correlation in large SR models and offering insights into achieving optimal performance for any given model configuration.

Updated: 2025-02-19 09:23:03

标题: 用缩放定律和近似熵优化顺序推荐模型

摘要: 缩放定律已经成为一个强大的框架，可以理解模型性能随着规模增大而发展的方式，为优化计算资源提供有价值的见解。在顺序推荐领域，这对于预测用户的顺序偏好至关重要，这些定律为解决顺序推荐模型的可扩展性所带来的挑战提供了一种视角。然而，在推荐系统中存在结构性和协作性问题，阻碍了在这些系统中直接应用缩放定律。为此，我们引入了顺序推荐模型的性能定律，旨在从理论上研究和建模模型性能与数据质量之间的关系。具体来说，我们首先将HR和NDCG指标拟合到基于转换器的顺序推荐模型中。随后，我们提出了近似熵（ApEn）来评估数据质量，提出了与传统数据量指标相比更加细致的方法。我们的方法可以在各种数据集规模和模型大小上进行准确预测，在大型顺序推荐模型中展现出强烈的相关性，并为实现任何给定模型配置的最佳性能提供见解。

更新时间: 2025-02-19 09:23:03

领域: cs.AI,cs.IR,68P20,H.3.4; I.2.6

下载: http://arxiv.org/abs/2412.00430v6

An Efficient Permutation-Based Kernel Two-Sample Test

Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nystr\"om approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing realistic scientific data.

Updated: 2025-02-19 09:22:48

标题: 一个高效的基于排列的核双样本检验

摘要: 两样本假设检验-确定两组数据是否来自相同的分布-是统计学和机器学习中的一个基本问题，具有广泛的科学应用。在非参数检验的背景下，由于其灵活性和坚实的理论基础，最大均值差异（MMD）已经成为一种流行的检验统计量。然而，在大规模场景中使用它会受到高计算成本的困扰。在这项工作中，我们使用MMD的Nystrom逼近来设计一个计算效率高且实用的测试算法，同时保持统计保证。我们的主要结果是对于与MMD足够分离的分布的提议测试的有限样本功率界限。所得到的分离率与此设置中已知的极小最优率相匹配。我们通过一系列数值实验支持我们的发现，重点放在现实科学数据上。

更新时间: 2025-02-19 09:22:48

领域: stat.ML,cs.LG,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2502.13570v1

Model Evolution Framework with Genetic Algorithm for Multi-Task Reinforcement Learning

Multi-task reinforcement learning employs a single policy to complete various tasks, aiming to develop an agent with generalizability across different scenarios. Given the shared characteristics of tasks, the agent's learning efficiency can be enhanced through parameter sharing. Existing approaches typically use a routing network to generate specific routes for each task and reconstruct a set of modules into diverse models to complete multiple tasks simultaneously. However, due to the inherent difference between tasks, it is crucial to allocate resources based on task difficulty, which is constrained by the model's structure. To this end, we propose a Model Evolution framework with Genetic Algorithm (MEGA), which enables the model to evolve during training according to the difficulty of the tasks. When the current model is insufficient for certain tasks, the framework will automatically incorporate additional modules, enhancing the model's capabilities. Moreover, to adapt to our model evolution framework, we introduce a genotype module-level model, using binary sequences as genotype policies for model reconstruction, while leveraging a non-gradient genetic algorithm to optimize these genotype policies. Unlike routing networks with fixed output dimensions, our approach allows for the dynamic adjustment of the genotype policy length, enabling it to accommodate models with a varying number of modules. We conducted experiments on various robotics manipulation tasks in the Meta-World benchmark. Our state-of-the-art performance demonstrated the effectiveness of the MEGA framework. We will release our source code to the public.

Updated: 2025-02-19 09:22:34

标题: 用遗传算法的多任务强化学习模型演化框架

摘要: 多任务强化学习使用单一策略来完成各种任务，旨在培养一个能够在不同场景中具有泛化能力的代理。鉴于任务的共享特征，通过参数共享可以增强代理的学习效率。现有方法通常使用路由网络为每个任务生成特定路线，并重新构建一组模块成为各种模型，以同时完成多个任务。然而，由于任务之间固有的差异，根据任务难度分配资源是至关重要的，这受限于模型的结构。为此，我们提出了一个具有遗传算法的模型进化框架（MEGA），它能够根据任务的难度在训练过程中使模型进化。当当前模型不足以完成某些任务时，该框架将自动纳入额外的模块，增强模型的能力。此外，为适应我们的模型进化框架，我们引入了一种基因型模块级模型，使用二进制序列作为基因型策略进行模型重构，同时利用非梯度遗传算法优化这些基因型策略。与具有固定输出维度的路由网络不同，我们的方法允许动态调整基因型策略长度，使其能够适应模块数量变化的模型。我们在Meta-World基准测试中对各种机器人操作任务进行了实验证明了MEGA框架的有效性。我们将向公众发布我们的源代码。

更新时间: 2025-02-19 09:22:34

领域: cs.AI

下载: http://arxiv.org/abs/2502.13569v1

LSR-Adapt: Ultra-Efficient Parameter Tuning with Matrix Low Separation Rank Kernel Adaptation

Imposing an effective structural assumption on neural network weight matrices has been the major paradigm for designing Parameter-Efficient Fine-Tuning (PEFT) systems for adapting modern large pre-trained models to various downstream tasks. However, low rank based adaptation has become increasingly challenging due to the sheer scale of modern large language models. In this paper, we propose an effective kernelization to further reduce the number of parameters required for adaptation tasks. Specifically, from the classical idea in numerical analysis regarding matrix Low-Separation-Rank (LSR) representations, we develop a kernel using this representation for the low rank adapter matrices of the linear layers from large networks, named the Low Separation Rank Adaptation (LSR-Adapt) kernel. With the ultra-efficient kernel representation of the low rank adapter matrices, we manage to achieve state-of-the-art performance with even higher accuracy with almost half the number of parameters as compared to conventional low rank based methods. This structural assumption also opens the door to further GPU-side optimizations due to the highly parallelizable nature of Kronecker computations.

Updated: 2025-02-19 09:20:47

标题: LSR-Adapt：使用矩阵低分离秩核适应进行超高效参数调整

摘要: 将神经网络权重矩阵施加有效的结构假设，已成为为了将现代大型预训练模型适应各种下游任务而设计参数高效微调（PEFT）系统的主要范式。然而，基于低秩的适应性由于现代大型语言模型的规模之巨而变得越来越具挑战性。在本文中，我们提出了一种有效的核化方法，进一步减少了用于适应性任务所需的参数数量。具体来说，从数值分析中关于矩阵低分离秩（LSR）表示的经典思想出发，我们开发了一个核，使用这种表示来表示大网络中线性层的低秩适配器矩阵，命名为低分离秩适配（LSR-Adapt）核。通过使用低秩适配器矩阵的超高效核表示，我们成功地实现了最先进的性能，甚至比传统的基于低秩的方法使用几乎一半的参数获得了更高的准确性。这种结构假设还为进一步在GPU端进行优化打开了大门，由于Kronecker计算的高度可并行性特性。

更新时间: 2025-02-19 09:20:47

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.13568v1

Sampling-based Distributed Training with Message Passing Neural Network

In this study, we introduce a domain-decomposition-based distributed training and inference approach for message-passing neural networks (MPNN). Our objective is to address the challenge of scaling edge-based graph neural networks as the number of nodes increases. Through our distributed training approach, coupled with Nystr\"om-approximation sampling techniques, we present a scalable graph neural network, referred to as DS-MPNN (D and S standing for distributed and sampled, respectively), capable of scaling up to $O(10^5)$ nodes. We validate our sampling and distributed training approach on two cases: (a) a Darcy flow dataset and (b) steady RANS simulations of 2-D airfoils, providing comparisons with both single-GPU implementation and node-based graph convolution networks (GCNs). The DS-MPNN model demonstrates comparable accuracy to single-GPU implementation, can accommodate a significantly larger number of nodes compared to the single-GPU variant (S-MPNN), and significantly outperforms the node-based GCN.

Updated: 2025-02-19 09:14:58

标题: 使用基于采样的消息传递神经网络进行分布式训练

摘要: 在这项研究中，我们引入了一种基于域分解的分布式训练和推理方法，用于消息传递神经网络（MPNN）。我们的目标是解决基于边缘的图神经网络在节点数量增加时扩展的挑战。通过我们的分布式训练方法，结合Nystrom-近似采样技术，我们提出了一种可扩展的图神经网络，称为DS-MPNN（D和S分别代表分布式和采样），能够扩展到$O(10^5)$个节点。我们在两个案例中验证了我们的采样和分布式训练方法：（a）达西流数据集和（b）2-D翼型的稳态RANS模拟，与单GPU实现和基于节点的图卷积网络（GCNs）进行比较。DS-MPNN模型表现出与单GPU实现相当的准确性，与单GPU变种（S-MPNN）相比可以容纳更多的节点，并且明显优于基于节点的GCN。

更新时间: 2025-02-19 09:14:58

领域: cs.LG,cs.DC,physics.flu-dyn

下载: http://arxiv.org/abs/2402.15106v4

Are Large Language Models In-Context Graph Learners?

Large language models (LLMs) have demonstrated remarkable in-context reasoning capabilities across a wide range of tasks, particularly with unstructured inputs such as language or images. However, LLMs struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures. As a result, without additional fine-tuning, their performance significantly lags behind that of graph neural networks (GNNs) in graph learning tasks. In this paper, we show that learning on graph data can be conceptualized as a retrieval-augmented generation (RAG) process, where specific instances (e.g., nodes or edges) act as queries, and the graph itself serves as the retrieved context. Building on this insight, we propose a series of RAG frameworks to enhance the in-context learning capabilities of LLMs for graph learning tasks. Comprehensive evaluations demonstrate that our proposed RAG frameworks significantly improve LLM performance on graph-based tasks, particularly in scenarios where a pretrained LLM must be used without modification or accessed via an API.

Updated: 2025-02-19 09:14:19

标题: 大型语言模型是上下文图学习者吗？

摘要: 大型语言模型(LLMs)已经表现出在各种任务中具有出色的上下文推理能力，特别是在非结构化输入(如语言或图像)方面。然而，由于它们缺乏对非欧几里得结构的理解，LLMs在处理结构化数据(如图形)时存在困难。因此，在没有额外微调的情况下，它们在图学习任务中的表现明显落后于图神经网络(GNNs)。在本文中，我们展示了图数据的学习可以被概念化为一种检索增强生成(RAG)过程，其中特定实例(例如节点或边)充当查询，而图本身则作为检索的上下文。基于这一洞察力，我们提出了一系列RAG框架，以增强LLMs在图学习任务中的上下文学习能力。全面的评估表明，我们提出的RAG框架显著改善了LLMs在基于图的任务中的性能，特别是在必须在不进行修改或通过API访问的情况下使用预训练的LLMs的情况下。

更新时间: 2025-02-19 09:14:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13562v1

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance.

Updated: 2025-02-19 09:01:59

标题: SpecFuse: 通过下一段预测对大型语言模型进行集成

摘要: 生成大规模语言模型（LLMs）的集合可以整合不同LLMs的优势，以弥补个体模型的局限性。然而，最近的研究集中在训练一个额外的融合模型，将多个LLMs的完整响应结合起来，未能充分发挥它们协同生成更高质量响应的潜力。此外，由于额外的融合模型是在专门的数据集上训练的，这些方法在泛化到在线用户的开放域查询时存在困难。在本文中，我们提出了SpecFuse，一个新颖的集成框架，通过LLMs之间的协作迭代地生成下一段，并输出融合结果。这是通过其推理和验证组件的循环执行实现的。在每一轮中，推理组件并行调用每个基本LLM生成候选段，验证组件再次调用这些LLMs预测段的排名。然后将排名最高的段广播给所有LLMs，鼓励它们在下一轮生成更高质量的段。这种方法还允许基本LLMs即插即用，无需任何训练或适应，避免了泛化限制。此外，为了保存计算资源，我们提出了一个模型退出机制，在每次查询响应期间动态排除在先前轮中表现糟糕的模型。通过这种方式，有效减少模型调用次数，同时保持整体性能。

更新时间: 2025-02-19 09:01:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.07380v2

Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs

Data augmentation is necessary for graph representation learning due to the scarcity and noise present in graph data. Most of the existing augmentation methods overlook the context information inherited from the dataset as they rely solely on the graph structure for augmentation. Despite the success of some large language model-based (LLM) graph learning methods, they are mostly white-box which require access to the weights or latent features from the open-access LLMs, making them difficult to be democratized for everyone as existing LLMs are mostly closed-source for commercial considerations. To overcome these limitations, we propose a black-box context-driven graph data augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the text prompt as context-related information, we task the LLM with generating knowledge graphs (KGs), which allow us to capture the structural interactions from the text outputs. We then design a dynamic merging schema to stochastically integrate the LLM-generated KGs into the original graph during training. To control the sparsity of the augmented graph, we further devise a granularity-aware prompting strategy and an instruction fine-tuning module, which seamlessly generates text prompts according to different granularity levels of the dataset. Extensive experiments on various graph learning tasks validate the effectiveness of our method over existing graph data augmentation methods. Notably, our approach excels in scenarios involving electronic health records (EHRs), which validates its maximal utilization of contextual knowledge, leading to enhanced predictive performance and interpretability.

Updated: 2025-02-19 09:00:32

标题: 通过潜在知识图将基于大型语言模型的图数据增强民主化

摘要: 数据增强对于图表示学习是必要的，因为图数据中存在稀缺性和噪音。大多数现有的增强方法忽视了从数据集继承的上下文信息，因为它们仅依赖于图结构进行增强。尽管一些基于大型语言模型（LLM）的图学习方法取得了成功，但它们大多是白盒的，需要访问开放获取的LLMs的权重或潜在特征，使它们难以普及化，因为现有的LLMs大多出于商业考虑是封闭源代码的。为了克服这些限制，我们提出了一种黑盒上下文驱动的图数据增强方法，借助LLMs的指导--DemoGraph。利用文本提示作为上下文相关信息，我们让LLMs生成知识图（KGs），这使我们能够捕捉文本输出中的结构交互。然后我们设计了一个动态合并模式，在训练过程中将LLM生成的KGs随机集成到原始图中。为了控制增强图的稀疏性，我们进一步设计了一个粒度感知提示策略和一个指令微调模块，根据数据集的不同粒度级别无缝生成文本提示。对各种图学习任务进行的大量实验验证了我们的方法相对于现有图数据增强方法的有效性。值得注意的是，我们的方法在涉及电子健康记录（EHRs）的场景中表现出色，验证了其最大化利用上下文知识，从而提高了预测性能和可解释性。

更新时间: 2025-02-19 09:00:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13555v1

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{https://github.com/KaiyangWan/CogWriter}{CogWriter}.

Updated: 2025-02-19 08:58:13

标题: 一种用于受限长文本生成的认知写作视角

摘要: 与人类一样，大型语言模型（LLMs）在一次通顶生成高质量的长篇文本并遵守严格要求方面存在困难。这一挑战并不令人意外，因为根据认知写作理论，成功的人类写作是一个复杂的认知过程，涉及迭代规划、翻译、审查和监控。受这些认知原理的启发，我们的目标是通过CogWriter，一个新颖的无需训练的框架，为LLMs赋予类似人类的认知写作能力，将LLM受限的长篇文本生成转化为系统化的认知写作范式。我们的框架由两个关键模块组成：（1）一个执行分层规划以分解任务的计划代理，和（2）多个并行执行这些计划的生成代理。系统通过持续的监控和审查机制保持质量，评估输出是否符合指定的要求并触发必要的修订。CogWriter在LongGenBench上表现出色，这是一个用于复杂受限长篇文本生成的基准测试。即使使用Qwen-2.5-14B作为其基础，CogWriter在复杂指令完成准确性方面超过了GPT-4o 22％，同时可靠地生成超过10,000字的文本。我们希望这种受认知科学启发的方法为LLM写作的进步提供一个范例：CogWriter。【链接：https://github.com/KaiyangWan/CogWriter】。

更新时间: 2025-02-19 08:58:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12568v2

X-IL: Exploring the Design Space of Imitation Learning Policies

Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies. In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks. Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning.

Updated: 2025-02-19 08:57:34

标题: X-IL: 探索模仿学习策略的设计空间

摘要: 设计现代模仿学习（IL）策略需要做出许多决定，包括选择特征编码、架构、策略表示等。随着领域的快速发展，可用选项的范围不断增加，为IL策略创造了一个广阔且大部分未被探索的设计空间。在这项工作中，我们提出了X-IL，一个可访问的开源框架，旨在系统地探索这个设计空间。该框架的模块化设计使得可以无缝地交换策略组件，如骨干（例如Transformer、Mamba、xLSTM）和策略优化技术（例如得分匹配、流匹配）。这种灵活性促进了全面的实验，并导致发现了在最近的机器人学习基准测试中优于现有方法的新颖策略配置。我们的实验不仅展示了显著的性能增益，还提供了对各种设计选择的优势和劣势的宝贵见解。这项研究既是从业者的实用参考，也是指导未来模仿学习研究的基础。

更新时间: 2025-02-19 08:57:34

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2502.12330v2

Cross-View Graph Consistency Learning for Invariant Graph Representations

Graph representation learning is fundamental for analyzing graph-structured data. Exploring invariant graph representations remains a challenge for most existing graph representation learning methods. In this paper, we propose a cross-view graph consistency learning (CGCL) method that learns invariant graph representations for link prediction. First, two complementary augmented views are derived from an incomplete graph structure through a coupled graph structure augmentation scheme. This augmentation scheme mitigates the potential information loss that is commonly associated with various data augmentation techniques involving raw graph data, such as edge perturbation, node removal, and attribute masking. Second, we propose a CGCL model that can learn invariant graph representations. A cross-view training scheme is proposed to train the proposed CGCL model. This scheme attempts to maximize the consistency information between one augmented view and the graph structure reconstructed from the other augmented view. Furthermore, we offer a comprehensive theoretical CGCL analysis. This paper empirically and experimentally demonstrates the effectiveness of the proposed CGCL method, achieving competitive results on graph datasets in comparisons with several state-of-the-art algorithms.

Updated: 2025-02-19 08:51:54

标题: 跨视图图一致性学习用于不变图表示

摘要: 图表示学习对于分析图结构数据至关重要。探索不变的图表示仍然是大多数现有图表示学习方法面临的挑战。在本文中，我们提出了一种跨视图图一致性学习（CGCL）方法，用于学习链接预测的不变图表示。首先，通过耦合图结构增强方案从不完整的图结构中导出两个互补的增强视图。这种增强方案缓解了与涉及原始图数据的各种数据增强技术（如边扰动、节点移除和属性掩码）常见的信息丢失问题。其次，我们提出了一个能够学习不变图表示的CGCL模型。提出了一个跨视图训练方案来训练所提出的CGCL模型。该方案试图最大化一个增强视图与另一个增强视图重建的图结构之间的一致性信息。此外，我们提供了一个全面的理论CGCL分析。本文在实证和实验方面证明了所提出的CGCL方法的有效性，在与几种最先进的算法进行比较时取得了竞争性的结果。

更新时间: 2025-02-19 08:51:54

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2311.11821v2

Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference

Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbf{ActQKV}, a training-free, \textbf{Act}ivation-aware approach that dynamically determines probe-\textbf{Q}uery and leverages it to retrieve the relevant \textbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and $\infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.

Updated: 2025-02-19 08:50:44

标题: 激活感知探针-查询：长上下文LLMs推理的有效键-值检索

摘要: 最近大型语言模型（LLMs）的进展展示了在长文本任务中的出色性能，同时面临着有限GPU内存的显著推理效率挑战。现有解决方案首先提出了滑动窗口方法，积累一组历史\textbf{键-值}（KV）对以便重复使用，然后进一步改进并选择性地在每一步保留其子集。然而，由于长上下文中的注意力分布稀疏，很难识别和回忆相关的KV对，因为注意力被大量候选对分散了。此外，我们发现在每个滑动窗口中选择代表性的标记作为探测-查询（probe-Query）能有效地代表整个上下文，这是现有方法忽视的一种方法。因此，我们提出了\textbf{ActQKV}，一种无需训练、\textbf{Act}ivation-aware方法，动态确定探测-\textbf{Q}uery并利用它来检索推理所需的相关\textbf{KV}对。具体来说，ActQKV监视每个上下文窗口内的标记级指示器，激活偏差，使在预填充阶段检索时能够正确构建探测-查询。为了准确回忆相关的KV对并最小化不相关的对，我们设计了一个根据解码阶段各层信息密度指导的动态KV截断机制。在Long-Bench和$\infty$ Benchmarks上的实验表明，其具有竞争力的推理质量和资源效率的最先进表现。

更新时间: 2025-02-19 08:50:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13542v1

Causal Concept Graph Models: Beyond Causal Opacity in Deep Learning

Causal opacity denotes the difficulty in understanding the "hidden" causal structure underlying the decisions of deep neural network (DNN) models. This leads to the inability to rely on and verify state-of-the-art DNN-based systems, especially in high-stakes scenarios. For this reason, circumventing causal opacity in DNNs represents a key open challenge at the intersection of deep learning, interpretability, and causality. This work addresses this gap by introducing Causal Concept Graph Models (Causal CGMs), a class of interpretable models whose decision-making process is causally transparent by design. Our experiments show that Causal CGMs can: (i) match the generalisation performance of causally opaque models, (ii) enable human-in-the-loop corrections to mispredicted intermediate reasoning steps, boosting not just downstream accuracy after corrections but also the reliability of the explanations provided for specific instances, and (iii) support the analysis of interventional and counterfactual scenarios, thereby improving the model's causal interpretability and supporting the effective verification of its reliability and fairness.

Updated: 2025-02-19 08:46:29

标题: 因果概念图模型：深度学习中超越因果不透明性

摘要: 因果不透明性指的是理解深度神经网络（DNN）模型决策背后的“隐藏”因果结构的困难。这导致无法依赖和验证最先进的基于DNN的系统，特别是在高风险场景下。因此，绕过DNN中的因果不透明性代表了深度学习、可解释性和因果性交叉点上的一个关键开放挑战。这项工作通过引入因果概念图模型（Causal CGMs），一类决策过程设计上因果透明的可解释模型，来填补这一空白。我们的实验表明，因果CGMs可以：（i）与因果不透明模型的泛化性能相匹配，（ii）使人在环节校正错误的中间推理步骤，不仅提高校正后的下游准确性，还提高了为特定实例提供的解释的可靠性，以及（iii）支持干预和反事实场景的分析，从而提高模型的因果可解释性并支持其可靠性和公平性的有效验证。

更新时间: 2025-02-19 08:46:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2405.16507v5

MIH-TCCT: Mitigating Inconsistent Hallucinations in LLMs via Event-Driven Text-Code Cyclic Training

Recent methodologies utilizing synthetic datasets have aimed to address inconsistent hallucinations in large language models (LLMs); however,these approaches are primarily tailored to specific tasks, limiting their generalizability. Inspired by the strong performance of code-trained models in logic-intensive domains, we propose a novel framework that leverages event-based text to generate corresponding code and employs cyclic training to transfer the logical consistency of code to natural language effectively. Our method significantly reduces inconsistent hallucinations across three leading LLMs and two categories of natural language tasks while maintaining overall performance. This framework effectively alleviates hallucinations without necessitating adaptation to downstream tasks, demonstrating generality and providing new perspectives to tackle the challenge of inconsistent hallucinations.

Updated: 2025-02-19 08:42:33

标题: MIH-TCCT：通过事件驱动的文本-代码循环训练减轻LLMs中的不一致幻觉

摘要: 最近利用合成数据集的方法旨在解决大型语言模型（LLMs）中不一致的幻觉；然而，这些方法主要针对特定任务进行了定制，限制了它们的泛化能力。受代码训练模型在逻辑密集型领域中表现出色的启发，我们提出了一个新颖的框架，利用基于事件的文本生成相应的代码，并采用循环训练将代码的逻辑一致性有效地转移到自然语言中。我们的方法显著减少了三种主要LLMs和两类自然语言任务中的不一致幻觉，同时保持了整体性能。这个框架有效地减轻了幻觉，无需适应下游任务，展示了泛化性并提供了解决不一致幻觉挑战的新视角。

更新时间: 2025-02-19 08:42:33

领域: cs.AI

下载: http://arxiv.org/abs/2502.08904v2

Solving the Encoding Bottleneck: Of the HHL Algorithm, By the HHL Algorithm

The Harrow-Hassidim-Lloyd (HHL) algorithm offers exponential speedup for solving the quantum linear-system problem. But some caveats for the speedup could be hard to met. One of the difficulties is the encoding bottleneck, i.e., the efficient preparation of the initial quantum state. To prepare an arbitrary $N$-dimensional state exactly, existing state-preparation approaches generally require a runtime of $O(N)$, which will ruin the speedup of the HHL algorithm. Here we show that the states can be prepared approximately with a runtime of $O(poly(\log N))$ by employing a slightly modified version of the HHL algorithm itself. Thus, applying this approach to prepare the initial state of the original HHL algorithm can preserve the exponential speedup advantage. It can also serve as a standalone solution for other applications demanding rapid state preparation.

Updated: 2025-02-19 08:39:41

标题: 解决编码瓶颈：通过HHL算法，由HHL算法解决

摘要: Harrow-Hassidim-Lloyd（HHL）算法为解决量子线性系统问题提供了指数级加速。但是，某些加速的限制可能很难满足。其中一个困难是编码瓶颈，即高效准备初始量子态。为了准确准备任意$N$维状态，现有的状态准备方法通常需要$O(N)$的运行时间，这将破坏HHL算法的加速效果。在这里，我们展示了通过使用HHL算法的略微修改版本，可以以$O(poly(\log N))$的运行时间近似准备状态。因此，将这种方法应用于准备原始HHL算法的初始状态可以保持指数级加速优势。它也可以作为其他需要快速状态准备的应用的独立解决方案。

更新时间: 2025-02-19 08:39:41

领域: quant-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13534v1

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81$\times$ (16.95$\times$), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).

Updated: 2025-02-19 08:39:15

标题: 训练小型，推理大型：用于大型语言模型的内存高效LoRA训练

摘要: 大型语言模型（LLMs）显著推进了自然语言处理，具有卓越的任务泛化能力。低秩适应（LoRA）提供了一种经济有效的微调解决方案，冻结原始模型参数，仅训练轻量级、低秩适配器矩阵。然而，LoRA的内存占用主要由原始模型参数主导。为了缓解这一问题，我们提出了LoRAM，一种基于记忆效率的LoRA训练方案，其基础是：在过度参数化的LLMs中，许多神经元具有低训练效用，但对推理至关重要。LoRAM提出了一个独特的方法：它在修剪（小）模型上进行训练，以获得修剪的低秩矩阵，然后将其恢复并与原始（大）模型一起用于推理。此外，由模型发布者提前进行的最小成本持续预训练，使修剪和原始模型之间的知识差异得以对齐。我们广泛的实验表明，LoRAM在各种修剪策略和下游任务中的有效性。对于一个具有 700 亿参数的模型，LoRAM 可以在只有 20G HBM 的 GPU 上进行训练，替代了 LoRA 训练的 A100-80G GPU 和完全微调的 15 个 GPU。具体来说，QLoRAM 是通过结构修剪结合 4 位量化实现的，对于 LLaMA-3.1-70B（LLaMA-2-70B），它将低秩矩阵训练中主导内存使用的参数存储成本降低了 15.81 倍（16.95 倍），同时在原始 LLaMA-3.1-70B（LLaMA-2-70B）和 LoRA 训练的 LLaMA-3.1-8B（LLaMA-2-13B）上取得了主导性能增益。

更新时间: 2025-02-19 08:39:15

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.13533v1

Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking

The rise of Large Language Models (LLMs) has led to significant applications but also introduced serious security threats, particularly from jailbreak attacks that manipulate output generation. These attacks utilize prompt engineering and logit manipulation to steer models toward harmful content, prompting LLM providers to implement filtering and safety alignment strategies. We investigate LLMs' safety mechanisms and their recent applications, revealing a new threat model targeting structured output interfaces, which enable attackers to manipulate the inner logit during LLM generation, requiring only API access permissions. To demonstrate this threat model, we introduce a black-box attack framework called AttackPrefixTree (APT). APT exploits structured output interfaces to dynamically construct attack patterns. By leveraging prefixes of models' safety refusal response and latent harmful outputs, APT effectively bypasses safety measures. Experiments on benchmark datasets indicate that this approach achieves higher attack success rate than existing methods. This work highlights the urgent need for LLM providers to enhance security protocols to address vulnerabilities arising from the interaction between safety patterns and structured outputs.

Updated: 2025-02-19 08:29:36

标题: 利用前缀树在结构化输出接口中增强越狱攻击

摘要: 大型语言模型（LLMs）的兴起引发了重要的应用，但也引入了严重的安全威胁，特别是来自操纵输出生成的越狱攻击。这些攻击利用提示工程和对数操纵来引导模型朝着有害内容发展，促使LLM提供者实施过滤和安全对齐策略。我们研究了LLMs的安全机制及其最近的应用，揭示了一个针对结构化输出接口的新威胁模型，该模型使攻击者能够在LLM生成过程中操纵内部对数，只需API访问权限。为了展示这一威胁模型，我们引入了一个名为AttackPrefixTree（APT）的黑盒攻击框架。APT利用结构化输出接口动态构建攻击模式。通过利用模型安全拒绝响应和潜在有害输出的前缀，APT有效地绕过了安全措施。基准数据集上的实验表明，这种方法的攻击成功率高于现有方法。这项工作强调了LLM提供者迫切需要加强安全协议，以解决由安全模式与结构化输出之间互动引起的漏洞。

更新时间: 2025-02-19 08:29:36

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2502.13527v1

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering

Large language models (LLMs) are often challenged by generating erroneous or hallucinated responses, especially in complex reasoning tasks. Leveraging knowledge graphs (KGs) as external knowledge sources has emerged as a viable solution. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this paper, we propose a unified framework, FiDeLiS, designed to improve the factuality of LLM responses by anchoring answers to verifiable reasoning steps retrieved from a KG. To achieve this, we leverage step-wise beam search with a deductive scoring function, allowing the LLM to validate each reasoning step and halt the search once the question is deducible. In addition, our Path-rag module pre-selects a smaller candidate set for each beam search step, reducing computational costs by narrowing the search space. Extensive experiments show that our training-free and efficient approach outperforms strong baselines, enhancing both factuality and interpretability.

Updated: 2025-02-19 08:29:15

标题: FiDeLiS：知识图谱问答中大型语言模型的忠实推理

摘要: 大型语言模型(LLMs)在生成错误或虚构响应方面经常面临挑战，特别是在复杂推理任务中。利用知识图谱(KGs)作为外部知识源已被证明是一种可行的解决方案。然而，现有的KG增强方法，无论是基于检索还是基于代理的，都遇到了准确检索知识和高效遍历大规模KGs的困难。在本文中，我们提出了一个统一框架，名为FiDeLiS，旨在通过将答案锚定到从KG中检索到的可验证推理步骤来提高LLM响应的事实性。为了实现这一目标，我们利用逐步波束搜索和演绎评分函数，使LLM能够验证每个推理步骤，并在问题可以推导时停止搜索。此外，我们的Path-rag模块为每个波束搜索步骤预先选择一个较小的候选集，通过缩小搜索空间来降低计算成本。广泛的实验表明，我们的无需训练且高效的方法优于强基线，提高了事实性和可解释性。

更新时间: 2025-02-19 08:29:15

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2405.13873v3

Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

Recent works integrating Knowledge Graphs (KGs) have led to promising improvements in enhancing the reasoning accuracy of Large Language Models (LLMs). However, current benchmarks focus mainly on closed-ended tasks, leaving a gap in the assessment of more complex real-world scenarios. This gap has also obscured the evaluation of KGs' potential to mitigate the problem of hallucination in LLMs. To fill the gap, we introduce OKGQA, a new benchmark specifically designed to assess LLMs enhanced with KGs under open-ended, real-world question answering scenarios. OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both hallucination ratio and the enhancement in reasoning capabilities. To consider the scenario in which KGs may have varying levels of mistakes, we propose another benchmark variant OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. OKGQA aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on method design. We believe that this study can facilitate a more complete performance comparison and encourage continuous improvement in integrating KGs with LLMs to reduce hallucination.

Updated: 2025-02-19 08:25:48

标题: 知识图谱能否使大型语言模型更加可信？一个关于开放式问答的实证研究

摘要: 最近的研究将知识图谱（KGs）整合到大型语言模型（LLMs）中，已经取得了提高推理准确性的显著改进。然而，目前的基准主要集中在封闭式任务上，缺乏对更复杂真实场景评估的空缺。这一空缺也使得评估KGs在减轻LLMs幻觉问题方面的潜力变得模糊。为了填补这一空缺，我们引入了OKGQA，一个专门设计用于评估在开放式、真实世界问答场景下增强了KGs的LLMs的新基准。OKGQA旨在紧密反映使用不同类型问题的实际应用复杂性，并结合特定指标来衡量幻觉比率和推理能力的增强。为了考虑KGs可能存在不同程度错误的情况，我们提出了另一个基准变体OKGQA-P，用于评估当KGs的语义和结构被故意扰乱和污染时模型的性能。OKGQA旨在（1）探讨KGs是否能使LLMs在开放式环境中更加可信赖，（2）进行比较分析以阐明方法设计。我们相信这项研究可以促进更全面的性能比较，并鼓励持续改进将KGs与LLMs整合以减少幻觉。

更新时间: 2025-02-19 08:25:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.08085v3

AS-GCL: Asymmetric Spectral Augmentation on Graph Contrastive Learning

Graph Contrastive Learning (GCL) has emerged as the foremost approach for self-supervised learning on graph-structured data. GCL reduces reliance on labeled data by learning robust representations from various augmented views. However, existing GCL methods typically depend on consistent stochastic augmentations, which overlook their impact on the intrinsic structure of the spectral domain, thereby limiting the model's ability to generalize effectively. To address these limitations, we propose a novel paradigm called AS-GCL that incorporates asymmetric spectral augmentation for graph contrastive learning. A typical GCL framework consists of three key components: graph data augmentation, view encoding, and contrastive loss. Our method introduces significant enhancements to each of these components. Specifically, for data augmentation, we apply spectral-based augmentation to minimize spectral variations, strengthen structural invariance, and reduce noise. With respect to encoding, we employ parameter-sharing encoders with distinct diffusion operators to generate diverse, noise-resistant graph views. For contrastive loss, we introduce an upper-bound loss function that promotes generalization by maintaining a balanced distribution of intra- and inter-class distance. To our knowledge, we are the first to encode augmentation views of the spectral domain using asymmetric encoders. Extensive experiments on eight benchmark datasets across various node-level tasks demonstrate the advantages of the proposed method.

Updated: 2025-02-19 08:22:57

标题: AS-GCL：图对比学习中的非对称光谱增强

摘要: 图对比学习（Graph Contrastive Learning，GCL）已成为自监督学习图结构数据的主要方法。GCL通过从各种增强视图中学习稳健的表示减少对标记数据的依赖。然而，现有的GCL方法通常依赖于一致的随机增强，忽视它们对谱域内在结构的影响，从而限制了模型有效泛化的能力。为了解决这些限制，我们提出了一种称为AS-GCL的新范式，它结合了用于图对比学习的不对称谱增强。典型的GCL框架由三个关键组件组成：图数据增强、视图编码和对比损失。我们的方法对每个组件进行了显著增强。具体来说，对于数据增强，我们应用基于谱的增强来最小化谱变化，加强结构不变性，并减少噪声。关于编码，我们使用共享参数编码器和不同扩散算子来生成多样化、抗噪声的图视图。对于对比损失，我们引入了一个上界损失函数，通过保持类内和类间距离的平衡分布来促进泛化。据我们所知，我们是第一个使用不对称编码器对谱域增强视图进行编码的。在八个基准数据集上进行的广泛实验展示了所提方法的优势。

更新时间: 2025-02-19 08:22:57

领域: cs.LG

下载: http://arxiv.org/abs/2502.13525v1

MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis

Efficient evaluation of three-dimensional (3D) medical images is crucial for diagnostic and therapeutic practices in healthcare. Recent years have seen a substantial uptake in applying deep learning and computer vision to analyse and interpret medical images. Traditional approaches, such as convolutional neural networks (CNNs) and vision transformers (ViTs), face significant computational challenges, prompting the need for architectural advancements. Recent efforts have led to the introduction of novel architectures like the ``Mamba'' model as alternative solutions to traditional CNNs or ViTs. The Mamba model excels in the linear processing of one-dimensional data with low computational demands. However, Mamba's potential for 3D medical image analysis remains underexplored and could face significant computational challenges as the dimension increases. This manuscript presents MobileViM, a streamlined architecture for efficient segmentation of 3D medical images. In the MobileViM network, we invent a new dimension-independent mechanism and a dual-direction traversing approach to incorporate with a vision-Mamba-based framework. MobileViM also features a cross-scale bridging technique to improve efficiency and accuracy across various medical imaging modalities. With these enhancements, MobileViM achieves segmentation speeds exceeding 90 frames per second (FPS) on a single graphics processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster than the state-of-the-art deep learning models for processing 3D images with the same computational resources. In addition, experimental evaluations demonstrate that MobileViM delivers superior performance, with Dice similarity scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses existing models.

Updated: 2025-02-19 08:21:59

标题: MobileViM：一种轻量级和与尺寸无关的用于3D医学图像分析的视觉Mamba

摘要: 三维（3D）医学图像的高效评估对于医疗保健中的诊断和治疗实践至关重要。近年来，应用深度学习和计算机视觉分析和解释医学图像的趋势显著增加。传统方法，如卷积神经网络（CNNs）和视觉变压器（ViTs），面临着重大的计算挑战，促使需要进行架构的进步。最近的努力导致了像“Mamba”模型这样的新型架构的引入，作为传统CNNs或ViTs的替代解决方案。Mamba模型在对一维数据进行线性处理方面表现出色，且计算需求较低。然而，Mamba在3D医学图像分析方面的潜力尚未得到充分开发，并且在维度增加时可能面临重大的计算挑战。本文介绍了MobileViM，一种用于高效分割3D医学图像的简化架构。在MobileViM网络中，我们发明了一种新的与维度无关的机制和双向遍历方法，以与基于视觉-Mamba的框架结合。MobileViM还采用了一个跨尺度桥接技术，以提高各种医学成像模式的效率和准确性。通过这些增强，MobileViM在单个图形处理单元（即NVIDIA RTX 4090）上实现了超过90帧每秒（FPS）的分割速度。这一性能比同样计算资源下用于处理3D图像的最先进深度学习模型快了24 FPS以上。此外，实验评估表明，MobileViM提供了优越的性能，Dice相似度分数分别达到了PENGWIN、BraTS2024、ATLAS和Toothfairy2数据集的92.72％、86.69％、80.46％和77.43％，显著超过了现有模型。

更新时间: 2025-02-19 08:21:59

领域: cs.CV,cs.AI,cs.LG,cs.NI

下载: http://arxiv.org/abs/2502.13524v1

Enhancing Machine Learning Potentials through Transfer Learning across Chemical Elements

Machine Learning Potentials (MLPs) can enable simulations of ab initio accuracy at orders of magnitude lower computational cost. However, their effectiveness hinges on the availability of considerable datasets to ensure robust generalization across chemical space and thermodynamic conditions. The generation of such datasets can be labor-intensive, highlighting the need for innovative methods to train MLPs in data-scarce scenarios. Here, we introduce transfer learning of potential energy surfaces between chemically similar elements. Specifically, we leverage the trained MLP for silicon to initialize and expedite the training of an MLP for germanium. Utilizing classical force field and ab initio datasets, we demonstrate that transfer learning surpasses traditional training from scratch in force prediction, leading to more stable simulations and improved temperature transferability. These advantages become even more pronounced as the training dataset size decreases. The out-of-target property analysis shows that transfer learning leads to beneficial but sometimes adversarial effects. Our findings demonstrate that transfer learning across chemical elements is a promising technique for developing accurate and numerically stable MLPs, particularly in a data-scarce regime.

Updated: 2025-02-19 08:20:54

标题: 通过跨化学元素的迁移学习增强机器学习潜力

摘要: 机器学习势（MLPs）可以以数量级更低的计算成本实现从头开始的精度模拟。然而，它们的有效性取决于可用大量数据集，以确保在化学空间和热力学条件下具有强大的泛化能力。生成这种数据集可能需要大量的人力，突显了在数据稀缺情况下训练MLPs的创新方法的必要性。在这里，我们介绍了在化学上相似元素之间的势能表面之间的迁移学习。具体地，我们利用训练过的硅MLP来初始化并加速锗MLP的训练。利用经典力场和从头开始的数据集，我们展示了迁移学习在力预测方面优于传统的从头开始训练，从而导致更稳定的模拟和改善的温度可传递性。这些优势在训练数据集大小减小时变得更加明显。超出目标性质分析表明，迁移学习导致有益但有时也有敌对效果。我们的研究结果表明，在化学元素之间进行迁移学习是开发准确和数值稳定的MLPs的一种有前途的技术，特别是在数据稀缺的情况下。

更新时间: 2025-02-19 08:20:54

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2502.13522v1

MILE: Model-based Intervention Learning

Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the robot and intervenes if needed, these extensions usually only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention timesteps. In this work, we create a model to formulate how the interventions occur in such cases, and show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study. Videos and the code can be found at https://liralab.usc.edu/mile .

Updated: 2025-02-19 08:15:16

标题: MILE：基于模型的干预学习

摘要: 模仿学习技术已被证明在现实世界的控制场景中非常有效，比如机器人技术。然而，这些方法不仅存在累积误差问题，而且需要人类专家提供完整的轨迹。虽然存在专家监督机器人并在需要时进行干预的交互方法，但这些扩展通常只利用干预期间收集的数据，忽略了非干预时间步中隐藏的反馈信号。在这项工作中，我们创建了一个模型，来解释这种情况下干预发生的方式，并展示只需少量专家干预就可以学习策略的可能性。我们的关键观点是，通过专家反馈，无论是否进行干预，都可以获取有关当前状态质量和所选动作优化性的关键信息。我们在各种离散和连续仿真环境、现实世界的机器人操作任务以及人类主体研究中评估了我们的方法。视频和代码可在https://liralab.usc.edu/mile找到。

更新时间: 2025-02-19 08:15:16

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13519v1

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level preference optimization, which employs tree-based self-sampling on model responses \textbf{without any distillation} from other models. Furthermore, we theoretically prove that SPPD is \textbf{equivalent to on-policy policy gradient methods} under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}.

Updated: 2025-02-19 08:11:26

标题: SPPD: 利用动态价值边际的过程偏好学习进行自我训练

摘要: 最近，增强大型语言模型（LLMs）的数值和逻辑推理能力已成为研究热点。现有方法面临一些限制：推理阶段技术（例如，思维链）依赖于提示选择和预训练知识；句子级监督微调（SFT）和直接偏好优化（DPO）在逐步数学正确性方面遇到困难，并且依赖于更强的模型蒸馏或人类注释；而强化学习（RL）方法则产生高昂的GPU内存成本和不稳定的训练。为了解决这些问题，我们提出了一个整合了动态价值边界（SPPD）的自我训练框架，其中包括过程偏好学习。SPPD利用基于过程的马尔可夫决策过程（MDP）和贝尔曼最优方程，在步骤级偏好优化上推导出动态价值边界，该方法利用基于树的自我抽样进行模型响应，而无需从其他模型进行蒸馏。此外，我们在奖励约束条件下理论上证明了SPPD等同于基于策略梯度方法。对7B规模模型的实验表明，在领域内和领域外的数学基准测试中，性能优越。我们在\href{https://anonymous.4open.science/r/SPPD-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}开源我们的代码。

更新时间: 2025-02-19 08:11:26

领域: cs.AI

下载: http://arxiv.org/abs/2502.13516v1

MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.

Updated: 2025-02-19 08:08:53

标题: MKE-Coder：用于中国电子病历中ICD编码的多轴知识与证据验证

摘要: 在医学领域自动编码国际疾病分类（ICD）的任务已经得到很好的建立并受到了广泛关注。在英文中，医学领域中ICD的自动编码已经取得成功，但在处理中文电子病历（EMR）时面临挑战。第一个问题在于从中文EMR中提取与疾病编码相关信息的困难，主要是由于EMR的简洁写作风格和特定的内部结构。第二个问题是先前的方法未能利用基于疾病的多轴知识，缺乏与相应临床证据的关联。本文介绍了一种名为MKE-Coder的新框架：基于ICD编码的中文EMR中带有证据验证的多轴知识。首先，我们确定诊断的候选编码，并将每个编码分类到四个编码轴的知识下。随后，我们从EMR的全面内容中检索相应的临床证据，并通过评分模型过滤可信证据。最后，为了确保候选编码的有效性，我们提出了一个基于掩码语言建模策略的推理模块。该模块验证了与候选编码相关的所有轴知识是否得到证据支持，并提供相应建议。为评估我们框架的性能，我们使用从各个医院收集的大规模中文EMR数据集进行实验。实验结果表明，MKE-Coder在基于中文EMR的自动ICD编码任务中表现出显著的优势。在模拟真实编码场景中对我们方法进行的实际评估中，已经证明我们的方法显著帮助编码员提高编码准确性和速度。

更新时间: 2025-02-19 08:08:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14916v1

Phantom Events: Demystifying the Issues of Log Forgery in Blockchain

With the rapid development of blockchain technology, transaction logs play a central role in various applications, including decentralized exchanges, wallets, cross-chain bridges, and other third-party services. However, these logs, particularly those based on smart contract events, are highly susceptible to manipulation and forgery, creating substantial security risks across the ecosystem. To address this issue, we present the first in-depth security analysis of transaction log forgery in EVM-based blockchains, a phenomenon we term Phantom Events. We systematically model five types of attacks and propose a tool designed to detect event forgery vulnerabilities in smart contracts. Our evaluation demonstrates that our approach outperforms existing tools in identifying potential phantom events. Furthermore, we have successfully identified real-world instances for all five types of attacks across multiple decentralized applications. Finally, we call on community developers to take proactive steps to address these critical security vulnerabilities.

Updated: 2025-02-19 08:07:26

标题: 幽灵事件：揭示区块链中日志伪造问题

摘要: 随着区块链技术的快速发展，交易日志在各种应用中起着核心作用，包括去中心化交易所、钱包、跨链桥梁和其他第三方服务。然而，这些日志，特别是基于智能合约事件的日志，极易受到操纵和伪造，给整个生态系统带来重大安全风险。为了解决这个问题，我们提出了第一次深入研究基于以太坊虚拟机的区块链中交易日志伪造的安全分析，我们称之为“幻影事件”。我们系统地建模了五种攻击类型，并提出了一种旨在检测智能合约中事件伪造漏洞的工具。我们的评估表明，我们的方法在识别潜在的幻影事件方面优于现有工具。此外，我们已成功在多个去中心化应用中为五种攻击类型识别出真实世界的实例。最后，我们呼吁社区开发者采取积极措施来解决这些关键的安全漏洞。

更新时间: 2025-02-19 08:07:26

领域: cs.CR

下载: http://arxiv.org/abs/2502.13513v1

NEAR: A Training-Free Pre-Estimator of Machine Learning Model Performance

Artificial neural networks have been shown to be state-of-the-art machine learning models in a wide variety of applications, including natural language processing and image recognition. However, building a performant neural network is a laborious task and requires substantial computing power. Neural Architecture Search (NAS) addresses this issue by an automatic selection of the optimal network from a set of potential candidates. While many NAS methods still require training of (some) neural networks, zero-cost proxies promise to identify the optimal network without training. In this work, we propose the zero-cost proxy \textit{Network Expressivity by Activation Rank} (NEAR). It is based on the effective rank of the pre- and post-activation matrix, i.e., the values of a neural network layer before and after applying its activation function. We demonstrate the cutting-edge correlation between this network score and the model accuracy on NAS-Bench-101 and NATS-Bench-SSS/TSS. In addition, we present a simple approach to estimate the optimal layer sizes in multi-layer perceptrons. Furthermore, we show that this score can be utilized to select hyperparameters such as the activation function and the neural network weight initialization scheme.

Updated: 2025-02-19 08:04:20

标题: 近似：一个无需训练的机器学习模型性能预估器

摘要: 人工神经网络已被证明是各种应用中最先进的机器学习模型，包括自然语言处理和图像识别。然而，构建一个高性能的神经网络是一项繁重的任务，需要大量的计算能力。神经架构搜索（NAS）通过自动从一组潜在候选者中选择最佳网络来解决这个问题。虽然许多NAS方法仍然需要训练（某些）神经网络，但零成本代理承诺在不训练的情况下识别最佳网络。在这项工作中，我们提出了零成本代理\textit{激活等级网络表现}（NEAR）。它基于神经网络层在应用其激活函数之前和之后的值，即预激活和后激活矩阵的有效等级。我们展示了这个网络分数与NAS-Bench-101和NATS-Bench-SSS/TSS上的模型准确性之间的最前沿相关性。此外，我们提出了一种简单的方法来估计多层感知器中的最佳层大小。此外，我们展示了这个分数可以用来选择超参数，如激活函数和神经网络权重初始化方案。

更新时间: 2025-02-19 08:04:20

领域: cs.LG,cond-mat.dis-nn,physics.chem-ph,physics.data-an

下载: http://arxiv.org/abs/2408.08776v2

Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion

Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data such as lab test results capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative embeddings. These embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.

Updated: 2025-02-19 07:56:48

标题: 解锁电子病历中的多模态集成：一种用于语言和时间序列融合的快速学习框架

摘要: 大型语言模型(LLMs)在视觉-语言任务中表现出色，但它们在医学领域的应用仍未被充分探索，特别是将结构化时间序列数据与非结构化临床笔记相结合。在临床实践中，动态时间序列数据如实验室检验结果捕捉关键的时间模式，而临床笔记提供丰富的语义背景。将这些模态合并是具有挑战性的，因为连续信号和离散文本之间存在固有差异。为了弥合这一差距，我们引入了ProMedTS，这是一个新颖的自监督多模态框架，采用提示引导学习来统一这些异构数据类型。我们的方法利用轻量级异常检测来生成异常标题，作为提示，引导原始时间序列数据的编码成有信息的嵌入。这些嵌入与文本表示在共享的潜在空间中对齐，保留了细粒度的时间细微差别以及语义洞察力。此外，我们的框架结合了定制的自监督目标来增强内部和跨模态对齐。我们使用真实世界数据集对ProMedTS进行疾病诊断任务的评估，结果表明我们的方法始终优于最先进的方法。

更新时间: 2025-02-19 07:56:48

领域: cs.CL,cs.AI,cs.LG,68T50,I.2.7

下载: http://arxiv.org/abs/2502.13509v1

Towards a perturbation-based explanation for medical AI as differentiable programs

Recent advancement in machine learning algorithms reaches a point where medical devices can be equipped with artificial intelligence (AI) models for diagnostic support and routine automation in clinical settings. In medicine and healthcare, there is a particular demand for sufficient and objective explainability of the outcome generated by AI models. However, AI models are generally considered as black boxes due to their complexity, and the computational process leading to their response is often opaque. Although several methods have been proposed to explain the behavior of models by evaluating the importance of each feature in discrimination and prediction, they may suffer from biases and opacities arising from the scale and sampling protocol of the dataset used for training or testing. To overcome the shortcomings of existing methods, we explore an alternative approach to provide an objective explanation of AI models that can be defined independently of the learning process and does not require additional data. As a preliminary study for this direction of research, this work examines a numerical availability of the Jacobian matrix of deep learning models that measures how stably a model responses against small perturbations added to the input. The indicator, if available, are calculated from a trained AI model for a given target input. This is a first step towards a perturbation-based explanation, which will assist medical practitioners in understanding and interpreting the response of the AI model in its clinical application.

Updated: 2025-02-19 07:56:23

标题: 朝向基于扰动的解释医疗人工智能作为可微分程序

摘要: 最近机器学习算法的进展已经达到了一个点，使得医疗设备可以配备人工智能（AI）模型，用于临床设置中的诊断支持和例行自动化。在医学和医疗保健领域，对AI模型生成的结果具有足够和客观的可解释性有特殊需求。然而，由于其复杂性，AI模型通常被认为是黑匣子，导致其响应的计算过程往往是不透明的。尽管已经提出了几种方法来解释模型的行为，通过评估在区分和预测中每个特征的重要性，但它们可能会受到由于用于训练或测试的数据集的规模和抽样协议而产生的偏见和不透明性的影响。为了克服现有方法的缺点，我们探索了一种提供AI模型客观解释的替代方法，该方法可以独立于学习过程定义，并且不需要额外数据。作为这一研究方向的初步研究，本研究考察了深度学习模型的雅可比矩阵的数值可用性，该矩阵测量了模型对输入添加小扰动的响应稳定性。如果可用，该指标将从训练好的AI模型中针对给定目标输入计算得出。这是朝着基于扰动的解释的第一步，将帮助医务人员理解和解释AI模型在临床应用中的响应。

更新时间: 2025-02-19 07:56:23

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.14001v1

Interpreting Neurons in Deep Vision Networks with Language Models

In this paper, we propose Describe-and-Dissect (DnD), a novel method to describe the roles of hidden neurons in vision networks. DnD utilizes recent advancements in multimodal deep learning to produce complex natural language descriptions, without the need for labeled training data or a predefined set of concepts to choose from. Additionally, DnD is training-free, meaning we don't train any new models and can easily leverage more capable general purpose models in the future. We have conducted extensive qualitative and quantitative analysis to show that DnD outperforms prior work by providing higher quality neuron descriptions. Specifically, our method on average provides the highest quality labels and is more than 2$\times$ as likely to be selected as the best explanation for a neuron than the best baseline. Finally, we present a use case providing critical insights into land cover prediction models for sustainability applications. Our code and data are available at https://github.com/Trustworthy-ML-Lab/Describe-and-Dissect.

Updated: 2025-02-19 07:56:14

标题: 用语言模型解释深度视觉网络中的神经元

摘要: 在这篇论文中，我们提出了一种新颖的方法Describe-and-Dissect（DnD），用于描述视觉网络中隐藏神经元的作用。DnD利用了多模态深度学习的最新进展，可以生成复杂的自然语言描述，无需标记的训练数据或预定义的概念集来选择。此外，DnD是无需训练的，这意味着我们不需要训练任何新模型，未来可以轻松利用更有能力的通用模型。我们进行了广泛的定性和定量分析，结果显示DnD在提供更高质量的神经元描述方面优于以往的工作。具体来说，我们的方法平均提供最高质量的标签，而且被选为神经元最佳解释的可能性是最佳基线的2倍以上。最后，我们提供了一个用例，为可持续应用中的土地覆盖预测模型提供了关键见解。我们的代码和数据可以在https://github.com/Trustworthy-ML-Lab/Describe-and-Dissect上找到。

更新时间: 2025-02-19 07:56:14

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2403.13771v2

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.

Updated: 2025-02-19 07:55:51

标题: 一个好的标题是什么？一个全面的视觉标题基准用于评估MLLMs的正确性和覆盖范围

摘要: 最近对多模态大型语言模型（MLLMs）的进展已经使传统的视觉字幕基准变得过时，因为它们主要评估具有过时指标的简短描述。尽管最近的基准通过将字幕分解为视觉元素并采用基于模型的评估来解决这些限制，但它们仍然不完全，忽视了关键方面，同时提供模糊的、不可解释的分数。为了弥补这一差距，我们提出了CV-CapBench，一个系统评估6个视图和13个维度的字幕质量的综合视觉字幕基准。CV-CapBench为每个维度引入了精确度、召回率和命中率指标，独特地评估正确性和覆盖范围。对领先的MLLMs进行的实验揭示了显著的能力差距，特别是在动态和知识密集型维度上。这些发现为未来的研究提供了可行的见解。代码和数据将被发布。

更新时间: 2025-02-19 07:55:51

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.14914v1

Human-Artificial Interaction in the Age of Agentic AI: A System-Theoretical Approach

This paper presents a novel perspective on human-computer interaction (HCI), framing it as a dynamic interplay between human and computational agents within a networked system. Going beyond traditional interface-based approaches, we emphasize the importance of coordination and communication among heterogeneous agents with different capabilities, roles, and goals. A key distinction is made between multi-agent systems (MAS) and Centaurian systems, which represent two different paradigms of human-AI collaboration. MAS maintain agent autonomy, with structured protocols enabling cooperation, while Centaurian systems deeply integrate human and AI capabilities, creating unified decision-making entities. To formalize these interactions, we introduce a framework for communication spaces, structured into surface, observation, and computation layers, ensuring seamless integration between MAS and Centaurian architectures, where colored Petri nets effectively represent structured Centaurian systems and high-level reconfigurable networks address the dynamic nature of MAS. Our research has practical applications in autonomous robotics, human-in-the-loop decision making, and AI-driven cognitive architectures, and provides a foundation for next-generation hybrid intelligence systems that balance structured coordination with emergent behavior.

Updated: 2025-02-19 07:55:34

标题: 在代理式人工智能时代的人机交互：一个系统理论方法

摘要: 这篇论文提出了一个关于人机交互（HCI）的新颖视角，将其作为网络系统内人类和计算机代理之间的动态互动。超越传统的基于界面的方法，我们强调了不同能力、角色和目标的异质代理之间的协调和沟通的重要性。我们做出了多代理系统（MAS）和Centaurian系统之间的关键区别，它们代表了人工智能协作的两种不同范式。MAS保持代理的自主性，结构化协议使合作成为可能，而Centaurian系统深度整合了人类和人工智能的能力，创建了统一的决策实体。为了形式化这些互动，我们引入了一个通信空间的框架，分为表面、观察和计算层，确保MAS和Centaurian架构之间的无缝集成，其中着色彼得里网有效地代表了结构化的Centaurian系统，高级可重配置网络解决了MAS的动态特性。我们的研究在自主机器人、人在环决策制定和人工智能驱动的认知架构方面具有实际应用，并为下一代混合智能系统提供了基础，平衡了结构化协调和新兴行为。

更新时间: 2025-02-19 07:55:34

领域: cs.MA,cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.14000v1

OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

Although multi-agent collaborative Large Language Models (LLMs) have achieved significant breakthroughs in the Text-to-SQL task, their performance is still constrained by various factors. These factors include the incompleteness of the framework, failure to follow instructions, and model hallucination problems. To address these problems, we propose OpenSearch-SQL, which divides the Text-to-SQL task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on a consistency alignment mechanism. This architecture aligns the inputs and outputs of agents through the Alignment module, reducing failures in instruction following and hallucination. Additionally, we designed an intermediate language called SQL-Like and optimized the structured CoT based on SQL-Like. Meanwhile, we developed a dynamic few-shot strategy in the form of self-taught Query-CoT-SQL. These methods have significantly improved the performance of LLMs in the Text-to-SQL task. In terms of model selection, we directly applied the base LLMs without any post-training, thereby simplifying the task chain and enhancing the framework's portability. Experimental results show that OpenSearch-SQL achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based validity efficiency score (R-VES) of 69.36%, with all three metrics ranking first at the time of submission. These results demonstrate the comprehensive advantages of the proposed method in both effectiveness and efficiency.

Updated: 2025-02-19 07:51:50

标题: OpenSearch-SQL：通过动态少样本和一致性对齐增强文本到SQL

摘要: 虽然多智能体协作的大型语言模型（LLMs）在文本到SQL任务中取得了重大突破，但它们的性能仍受到各种因素的限制。这些因素包括框架的不完整性、未能遵循指令以及模型幻觉问题。为了解决这些问题，我们提出了OpenSearch-SQL，将文本到SQL任务分为四个主要模块：预处理、抽取、生成和优化，以及一个基于一致性对齐机制的对齐模块。该架构通过对齐模块对智能体的输入和输出进行对齐，减少了指令遵循和幻觉的失败。此外，我们设计了一个名为SQL-Like的中间语言，并基于SQL-Like对结构化的CoT进行了优化。同时，我们开发了一种形式为自学习Query-CoT-SQL的动态少样本策略。这些方法显著提高了LLMs在文本到SQL任务中的性能。在模型选择方面，我们直接应用了基础LLMs，没有进行任何后续训练，从而简化了任务链并增强了框架的可移植性。实验结果表明，OpenSearch-SQL在BIRD开发集上实现了69.3%的执行准确率（EX），在测试集上为72.28%，并且基于奖励的有效性效率得分（R-VES）为69.36%，这三个指标在提交时均排名第一。这些结果展示了所提出方法在效果和效率方面的全面优势。

更新时间: 2025-02-19 07:51:50

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2502.14913v1

PILOT: A Pre-Trained Model-Based Continual Learning Toolbox

While traditional machine learning can effectively tackle a wide range of problems, it primarily operates within a closed-world setting, which presents limitations when dealing with streaming data. As a solution, incremental learning emerges to address real-world scenarios involving new data's arrival. Recently, pre-training has made significant advancements and garnered the attention of numerous researchers. The strong performance of these pre-trained models (PTMs) presents a promising avenue for developing continual learning algorithms that can effectively adapt to real-world scenarios. Consequently, exploring the utilization of PTMs in incremental learning has become essential. This paper introduces a pre-trained model-based continual learning toolbox known as PILOT. On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt. On the other hand, PILOT also fits typical class-incremental learning algorithms (e.g., DER, FOSTER, and MEMO) within the context of pre-trained models to evaluate their effectiveness.

Updated: 2025-02-19 07:44:10

标题: PILOT：一个基于预训练模型的持续学习工具箱

摘要: 传统机器学习可以有效地解决各种问题，但主要在封闭世界环境中运作，这在处理流数据时存在限制。作为解决方案，增量学习出现以应对涉及新数据到达的真实场景。最近，预训练取得了显著进展，并吸引了众多研究人员的关注。这些预训练模型（PTM）的强大性能为开发能够有效适应真实场景的继续学习算法提供了有希望的途径。因此，探索在增量学习中利用PTM变得至关重要。本文介绍了一个基于预训练模型的继续学习工具箱PILOT。一方面，PILOT实现了一些基于预训练模型的最先进的增量学习算法，如L2P、DualPrompt和CODA-Prompt。另一方面，PILOT还将典型的增量学习算法（如DER、FOSTER和MEMO）置于预训练模型的背景中，以评估它们的有效性。

更新时间: 2025-02-19 07:44:10

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2309.07117v2

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling

Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction, which bring about vast amounts of conversation logs and increasing demand for dialogue generation. The dialogue's life-cycle spans from $\textit{Prelude}$ through $\textit{Interlocution}$ to $\textit{Epilogue}$, encompassing rich dialogue elements. Despite large volumes of dialogue-related studies, there is a lack of systematic investigation into the dialogue stages to frame benchmark construction that covers comprehensive dialogue elements. This hinders the precise modeling, generation and assessment of LLMs-based dialogue systems. To bridge this gap, in this paper, we introduce a new research task--$\textbf{D}$ialogue $\textbf{E}$lement $\textbf{MO}$deling, including $\textit{Element Awareness}$ and $\textit{Dialogue Agent Interaction}$, and propose a novel benchmark, $\textbf{DEMO}$, designed for a comprehensive dialogue modeling and assessment. On this basis, we further build the DEMO agent with the adept ability to model dialogue elements via imitation learning. Extensive experiments on DEMO indicate that current representative LLMs still have considerable potential for enhancement, and our DEMO agent performs well in both dialogue element modeling and out-of-domain tasks.

Updated: 2025-02-19 07:42:25

标题: DEMO: 重新构建对话交互的细粒度元素建模

摘要: 大型语言模型（LLMs）启用的对话系统已成为人机交互的中心模式之一，带来了大量的对话日志以及对对话生成的增长需求。对话的生命周期从“序曲”到“交谈”再到“结尾”，包含丰富的对话元素。尽管有大量与对话相关的研究，但缺乏对对话阶段进行系统调查以构建覆盖全面对话元素的基准构建，这妨碍了基于LLMs的对话系统的精确建模、生成和评估。为了弥补这一差距，本文引入了一个新的研究任务--对话元素建模，包括元素意识和对话代理交互，并提出了一个新颖的基准DEMO，旨在进行全面的对话建模和评估。在此基础上，我们进一步构建了具有模仿学习能力的DEMO代理，通过广泛的实验表明，目前代表性的LLMs仍然具有相当大的增强潜力，而我们的DEMO代理在对话元素建模和领域外任务中表现良好。

更新时间: 2025-02-19 07:42:25

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.04905v3

Towards Active Participant Centric Vertical Federated Learning: Some Representations May Be All You Need

Existing Vertical FL (VFL) methods often struggle with realistic and unaligned data partitions, and incur into high communication costs and significant operational complexity. This work introduces a novel approach to VFL, Active Participant Centric VFL (APC-VFL), that excels in scenarios when data samples among participants are partially aligned at training. Among its strengths, APC-VFL only requires a single communication step with the active participant. This is made possible through a local and unsupervised representation learning stage at each participant followed by a knowledge distillation step in the active participant. Compared to other VFL methods such as SplitNN or VFedTrans, APC-VFL consistently outperforms them across three popular VFL datasets in terms of F1, accuracy and communication costs as the ratio of aligned data is reduced.

Updated: 2025-02-19 07:38:12

标题: 朝着积极参与者中心的垂直联邦学习：有些表示可能是你所需要的

摘要: 现有的垂直联邦学习（VFL）方法通常在处理现实和不对齐的数据分区时遇到困难，并产生高通信成本和显著的操作复杂性。本文介绍了一种新颖的VFL方法，即主动参与者中心的VFL（APC-VFL），在数据样本在参与者之间部分对齐的训练场景中表现出色。APC-VFL的优势之一是只需要与活跃参与者进行单一通信步骤。这是通过在每个参与者进行本地和无监督的表示学习阶段，然后在活跃参与者进行知识蒸馏步骤实现的。与其他VFL方法（如SplitNN或VFedTrans）相比，当对齐数据的比率降低时，APC-VFL在三个流行的VFL数据集上始终在F1值、准确性和通信成本方面表现优于它们。

更新时间: 2025-02-19 07:38:12

领域: cs.LG

下载: http://arxiv.org/abs/2410.17648v2

Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection

Advanced large language models (LLMs) can generate text almost indistinguishable from human-written text, highlighting the importance of LLM-generated text detection. However, current zero-shot techniques face challenges as white-box methods are restricted to use weaker open-source LLMs, and black-box methods are limited by partial observation from stronger proprietary LLMs. It seems impossible to enable white-box methods to use proprietary models because API-level access to the models neither provides full predictive distributions nor inner embeddings. To traverse the divide, we propose **Glimpse**, a probability distribution estimation approach, predicting the full distributions from partial observations. Despite the simplicity of Glimpse, we successfully extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models. Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models, improving the score by 51% relative to the remaining space of the open source baseline. It demonstrates that the latest LLMs can effectively detect their own outputs, suggesting that advanced LLMs may be the best shield against themselves. We release our code and data at https://github.com/baoguangsheng/glimpse.

Updated: 2025-02-19 07:37:55

标题: 一瞥：使白盒方法能够使用专有模型进行零样本LLM生成文本检测

摘要: 先进的大型语言模型（LLMs）可以生成几乎无法与人类撰写的文本区分的文本，突显了LLM生成文本检测的重要性。然而，当前的零样本技术面临挑战，因为白盒方法受限于使用较弱的开源LLMs，而黑盒方法受限于来自更强的专有LLMs的部分观察。使白盒方法能够使用专有模型似乎是不可能的，因为对模型的API级访问既不提供完整的预测分布，也不提供内部嵌入。为了跨越这一鸿沟，我们提出了**Glimpse**，一种概率分布估计方法，从部分观察中预测完整的分布。尽管Glimpse的简单性，我们成功地将其扩展到像Entropy、Rank、Log-Rank和Fast-DetectGPT等最新的专有模型的白盒方法。实验证明，Glimpse与Fast-DetectGPT和GPT-3.5在五个最新源模型中实现了约0.95的平均AUROC，相对于开源基线的其余空间提高了51%的分数。这表明最新的LLMs可以有效地检测其自身的输出，表明先进的LLMs可能是最好的抵御自身的防护。我们在https://github.com/baoguangsheng/glimpse发布了我们的代码和数据。

更新时间: 2025-02-19 07:37:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.11506v2

Scalable Decentralized Algorithms for Online Personalized Mean Estimation

In numerous settings, agents lack sufficient data to directly learn a model. Collaborating with other agents may help, but it introduces a bias-variance trade-off, when local data distributions differ. A key challenge is for each agent to identify clients with similar distributions while learning the model, a problem that remains largely unresolved. This study focuses on a simplified version of the overarching problem, where each agent collects samples from a real-valued distribution over time to estimate its mean. Existing algorithms face impractical space and time complexities (quadratic in the number of agents A). To address scalability challenges, we propose a framework where agents self-organize into a graph, allowing each agent to communicate with only a selected number of peers r. We introduce two collaborative mean estimation algorithms: one draws inspiration from belief propagation, while the other employs a consensus-based approach, with complexity of O( r |A| log |A|) and O(r |A|), respectively. We establish conditions under which both algorithms yield asymptotically optimal estimates and offer a theoretical characterization of their performance.

Updated: 2025-02-19 07:36:58

标题: 可扩展的去中心化算法用于在线个性化均值估计

摘要: 在许多情况下，代理人缺乏足够的数据直接学习模型。与其他代理人合作可能有所帮助，但当本地数据分布不同时，会引入偏差-方差的权衡。一个关键挑战是每个代理人在学习模型时识别具有相似分布的客户，这个问题在很大程度上尚未解决。本研究侧重于一个简化版本的全局问题，其中每个代理人随时间从一个实值分布中收集样本以估计其均值。现有算法面临着不切实际的空间和时间复杂性（与代理人数量A成二次方）。为了解决可扩展性挑战，我们提出了一个框架，使代理人自组织成一个图，从而使每个代理人只与选择的一定数量的同行进行通信。我们引入了两种协作均值估计算法：一种受信念传播启发，另一种采用基于共识的方法，其复杂度分别为O（r|A|log|A|）和O（r|A|）。我们建立了两种算法产生渐近最优估计的条件，并对它们的性能进行了理论表征。

更新时间: 2025-02-19 07:36:58

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2402.12812v4

Online Physics-Informed Dynamic Mode Decomposition: Theory and Applications

Dynamic Mode Decomposition (DMD) has received increasing research attention due to its capability to analyze and model complex dynamical systems. However, it faces challenges in computational efficiency, noise sensitivity, and difficulty adhering to physical laws, which negatively affect its performance. Addressing these issues, we present Online Physics-informed DMD (OPIDMD), a novel adaptation of DMD into a convex optimization framework. This approach not only ensures convergence to a unique global optimum, but also enhances the efficiency and accuracy of modeling dynamical systems in an online setting. Leveraging the Bayesian DMD framework, we propose a probabilistic interpretation of Physics-informed DMD (piDMD), examining the impact of physical constraints on the DMD linear operator. Further, we implement online proximal gradient descent and formulate specific algorithms to tackle problems with different physical constraints, enabling real-time solutions across various scenarios. Compared with existing algorithms such as Exact DMD, Online DMD, and piDMD, OPIDMD achieves the best prediction performance in short-term forecasting, e.g. an $R^2$ value of 0.991 for noisy Lorenz system. The proposed method employs a time-varying linear operator, offering a promising solution for the real-time simulation and control of complex dynamical systems.

Updated: 2025-02-19 07:36:02

标题: 在线物理启发动态模态分解：理论和应用

摘要: 动态模态分解（DMD）因其分析和建模复杂动态系统的能力而受到越来越多的研究关注。然而，它面临着计算效率、噪声敏感性和难以遵守物理定律等挑战，这些问题对其性能产生负面影响。为了解决这些问题，我们提出了在线物理信息DMD（OPIDMD），这是DMD的一种新的适应性转换到凸优化框架中。这种方法不仅确保收敛到唯一的全局最优解，还提高了在线环境中建模动态系统的效率和准确性。利用贝叶斯DMD框架，我们提出了物理信息DMD（piDMD）的概率解释，考察物理约束对DMD线性算子的影响。此外，我们实现了在线近端梯度下降，并制定了针对不同物理约束问题的具体算法，实现在各种情景下的实时解决方案。与现有算法如精确DMD、在线DMD和piDMD相比，OPIDMD在短期预测方面取得了最佳的预测性能，例如在嘈杂的Lorenz系统中的R^2值为0.991。所提出的方法采用了时变线性算子，为复杂动态系统的实时模拟和控制提供了一个有前途的解决方案。

更新时间: 2025-02-19 07:36:02

领域: cs.LG,nlin.AO

下载: http://arxiv.org/abs/2412.03609v2

Hidden Darkness in LLM-Generated Designs: Exploring Dark Patterns in Ecommerce Web Components Generated by LLMs

Recent work has highlighted the risks of LLM-generated content for a wide range of harmful behaviors, including incorrect and harmful code. In this work, we extend this by studying whether LLM-generated web design contains dark patterns. This work evaluated designs of ecommerce web components generated by four popular LLMs: Claude, GPT, Gemini, and Llama. We tested 13 commonly used ecommerce components (e.g., search, product reviews) and used them as prompts to generate a total of 312 components across all models. Over one-third of generated components contain at least one dark pattern. The majority of dark pattern strategies involve hiding crucial information, limiting users' actions, and manipulating them into making decisions through a sense of urgency. Dark patterns are also more frequently produced in components that are related to company interests. These findings highlight the need for interventions to prevent dark patterns during front-end code generation with LLMs and emphasize the importance of expanding ethical design education to a broader audience.

Updated: 2025-02-19 07:35:07

标题: LLM生成的设计中隐藏的黑暗：探索由LLM生成的电子商务Web组件中的黑暗模式

摘要: 最近的研究强调LLM生成的内容存在一系列有害行为的风险，包括不正确和有害的代码。在这项研究中，我们通过研究LLM生成的网页设计是否包含黑暗模式来扩展这一领域。该研究评估了由四种流行的LLM生成的电子商务网页组件的设计：Claude、GPT、Gemini和Llama。我们测试了13种常用的电子商务组件（例如搜索、产品评论），并将它们用作提示，从而在所有模型中生成了总共312个组件。超过三分之一的生成组件至少包含一个黑暗模式。大多数黑暗模式策略涉及隐藏关键信息、限制用户行为，并通过紧迫感操纵用户做出决策。黑暗模式在与公司利益相关的组件中更频繁地产生。这些发现强调了在LLM前端代码生成过程中防止黑暗模式的干预的必要性，并强调了将道德设计教育扩展到更广泛的受众的重要性。

更新时间: 2025-02-19 07:35:07

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13499v1

A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior

Image watermarks have been considered a promising technique to help detect AI-generated content, which can be used to protect copyright or prevent fake image abuse. In this work, we present a black-box method for removing invisible image watermarks, without the need of any dataset of watermarked images or any knowledge about the watermark system. Our approach is simple to implement: given a single watermarked image, we regress it by deep image prior (DIP). We show that from the intermediate steps of DIP one can reliably find an evasion image that can remove invisible watermarks while preserving high image quality. Due to its unique working mechanism and practical effectiveness, we advocate including DIP as a baseline invasion method for benchmarking the robustness of watermarking systems. Finally, by showing the limited ability of DIP and other existing black-box methods in evading training-based visible watermarks, we discuss the positive implications on the practical use of training-based visible watermarks to prevent misinformation abuse.

Updated: 2025-02-19 07:30:19

标题: 使用深度图像先验的一种去除隐形图像水印的基准方法

摘要: 图像水印一直被认为是一种有希望的技术，可以帮助检测由人工智能生成的内容，从而用于保护版权或防止图像滥用。在这项工作中，我们提出了一种用于去除隐形图像水印的黑盒方法，无需任何水印图像数据集或任何关于水印系统的知识。我们的方法实现简单：给定一张带水印的图像，我们通过深度图像先验（DIP）对其进行回归。我们展示了通过DIP的中间步骤，可以可靠地找到一个规避图像，可以去除隐形水印同时保持高质量的图像。由于其独特的工作机制和实际有效性，我们主张将DIP作为基线侵略方法，用于评估水印系统的鲁棒性。最后，通过展示DIP和其他现有的黑盒方法在规避基于训练的可见水印方面的有限能力，我们讨论了基于训练的可见水印在防止信息滥用方面的积极意义。

更新时间: 2025-02-19 07:30:19

领域: eess.IV,cs.AI,cs.CR,cs.CV

下载: http://arxiv.org/abs/2502.13998v1

A Study on Monthly Marine Heatwave Forecasts in New Zealand: An Investigation of Imbalanced Regression Loss Functions with Neural Network Models

Marine heatwaves (MHWs) are extreme ocean-temperature events with significant impacts on marine ecosystems and related industries. Accurate forecasts (one to six months ahead) of MHWs would aid in mitigating these impacts. However, forecasting MHWs presents a challenging imbalanced regression task due to the rarity of extreme temperature anomalies in comparison to more frequent moderate conditions. In this study, we examine monthly MHW forecasts for 12 locations around New Zealand. We use a fully-connected neural network and compare standard and specialized regression loss functions, including the mean squared error (MSE), the mean absolute error (MAE), the Huber, the weighted MSE, the focal-R, the balanced MSE, and a proposed scaling-weighted MSE. Results show that (i) short lead times (one month) are considerably more predictable than three- and six-month leads, (ii) models trained with the standard MSE or MAE losses excel at forecasting average conditions but struggle to capture extremes, and (iii) specialized loss functions such as the balanced MSE and our scaling-weighted MSE substantially improve forecasting of MHW and suspected MHW events. These findings underscore the importance of tailored loss functions for imbalanced regression, particularly in forecasting rare but impactful events such as MHWs.

Updated: 2025-02-19 07:27:51

标题: 新西兰月度海洋热浪预测研究：基于神经网络模型的不平衡回归损失函数调查

摘要: 海洋热浪（MHWs）是极端的海洋温度事件，对海洋生态系统和相关产业产生重大影响。准确预测（提前一个到六个月）MHWs将有助于减轻这些影响。然而，由于极端温度异常的稀有性与更频繁的中等条件相比，预测MHWs是一项具有挑战性的不平衡回归任务。在这项研究中，我们检查了新西兰周围12个地点的月度MHW预测。我们使用全连接神经网络，并比较标准和专门的回归损失函数，包括均方误差（MSE）、平均绝对误差（MAE）、Huber、加权MSE、焦点-R、平衡MSE和一个提出的缩放加权MSE。结果表明，（i）短期（一个月）的预测比三个和六个月的预测更可预测，（ii）使用标准MSE或MAE损失的模型在预测平均条件方面表现出色，但难以捕捉极端情况，（iii）平衡MSE和我们的缩放加权MSE等专门的损失函数大大改善了对MHW和疑似MHW事件的预测。这些发现强调了针对不平衡回归的定制损失函数对于预测MHW等稀有但影响深远的事件的重要性。

更新时间: 2025-02-19 07:27:51

领域: physics.ao-ph,cs.LG,stat.AP

下载: http://arxiv.org/abs/2502.13495v1

Geometry of Lightning Self-Attention: Identifiability and Dimension

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

Updated: 2025-02-19 07:27:34

标题: 闪电自注意力的几何结构：可识别性和维度

摘要: 我们考虑由没有归一化的自注意力网络定义的函数空间，并在理论上分析它们的几何结构。由于这些网络是多项式的，我们依赖于代数几何工具。特别地，我们通过提供任意层数的参数化的一般纤维的描述来研究深度注意力的可识别性，并作为结果计算函数空间的维数。此外，对于单层模型，我们表征了奇异点和边界点。最后，我们提出了我们的结果对归一化的自注意网络的猜想扩展，并证明了单层情况，并在深层情况下进行了数值验证。

更新时间: 2025-02-19 07:27:34

领域: cs.LG,math.AG

下载: http://arxiv.org/abs/2408.17221v2

Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery

We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.

Updated: 2025-02-19 07:26:03

标题: 化学元素的通用语义嵌入：用于增强材料推断和发现的方法

摘要: 我们提出了一个框架，用于生成化学元素的通用语义嵌入，以推动材料推断和发现。该框架利用ElementBERT，这是一个基于BERT的领域特定的自然语言处理模型，经过在129万篇有关合金的科学论文摘要上的训练，捕获了特定于合金的潜在知识和上下文关系。这些语义嵌入作为强大的元素描述符，始终优于传统的经验描述符，在多个下游任务中取得了显著的改进。这些任务包括预测机械和转化性能、分类相结构以及通过贝叶斯优化优化材料性能。对钛合金、高熵合金和形状记忆合金的应用显示了高达23%的预测准确性提升。我们的结果表明，ElementBERT通过编码专门的合金知识超越了通用目的的BERT变体。通过将科学文献的上下文见解与定量推断相结合，我们的框架加速了高级材料的发现和优化，潜在的应用领域延伸到合金以外的其他材料类别。

更新时间: 2025-02-19 07:26:03

领域: cs.CL,cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2502.14912v1

A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos

Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack. Based on this, we construct the Mousetrap framework, which makes attacks projected into nonlinear-like low sample spaces with mismatched generalization enhanced. Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap. Success rates of the Mousetrap attacking o1-mini, claude-sonnet and gemini-thinking are as high as 96%, 86% and 98% respectively on our toxic dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking claude-sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This paper contains inappropriate, offensive and harmful content.

Updated: 2025-02-19 07:23:36

标题: 一个捕鼠夹：用一系列迭代混乱欺骗大型推理模型以越狱

摘要: 大型推理模型（LRMs）在其出色的逻辑推理能力方面已经显著超越传统的大型语言模型（LLMs），然而这些改进引入了更高的安全风险。当遭受越狱攻击时，它们生成更有针对性和有组织的内容的能力可能导致更大的伤害。尽管一些研究声称推理使LRMs对现有LLM攻击更安全，但它们忽视了推理过程本身的固有缺陷。为了填补这一空白，我们提出了针对LRMs的首次越狱攻击，利用其源自先进推理能力的独特漏洞。具体而言，我们引入了一种混沌机器，这是一个新颖的组件，用于将攻击提示转换为具有多样的一对一映射的混沌映射。机器迭代生成的混沌映射被嵌入到推理链中，这加强了变异性和复杂性，同时促进了更强大的攻击。基于此，我们构建了Mousetrap框架，使攻击投影到非线性低样本空间，并增强了不匹配的泛化。此外，由于更多竞争目标，LRMs逐渐保持了不可预测的迭代推理的惯性，并陷入我们的陷阱。Mousetrap对我们的有毒数据集Trotter上的o1-mini、claude-sonnet和gemini-thinking的成功率分别高达96%、86%和98%。在像AdvBench、StrongREJECT和HarmBench这样的基准测试中，攻击以安全著称的claude-sonnet，Mousetrap可以令人惊讶地实现87.5%、86.58%和93.13%的成功率。注意：本文包含不当、冒犯和有害内容。

更新时间: 2025-02-19 07:23:36

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.15806v1

What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis

Large language model (LLM) systems suffer from the models' unstable ability to generate valid and factual content, resulting in hallucination generation. Current hallucination detection methods heavily rely on out-of-model information sources, such as RAG to assist the detection, thus bringing heavy additional latency. Recently, internal states of LLMs' inference have been widely used in numerous research works, such as prompt injection detection, etc. Considering the interpretability of LLM internal states and the fact that they do not require external information sources, we introduce such states into LLM hallucination detection. In this paper, we systematically analyze different internal states' revealing features during inference forward and comprehensively evaluate their ability in hallucination detection. Specifically, we cut the forward process of a large language model into three stages: understanding, query, generation, and extracting the internal state from these stages. By analyzing these states, we provide a deep understanding of why the hallucinated content is generated and what happened in the internal state of the models. Then, we introduce these internal states into hallucination detection and conduct comprehensive experiments to discuss the advantages and limitations.

Updated: 2025-02-19 07:23:18

标题: 大型语言模型的幻觉：“心理学”通过模型内部状态分析理解模型在思考什么？

摘要: 大型语言模型（LLM）系统因为模型生成有效和事实内容的能力不稳定而遭受幻觉生成的困扰。当前的幻觉检测方法严重依赖模型外的信息来源，如RAG，以帮助检测，从而带来了严重的额外延迟。最近，LLM推理的内部状态已被广泛应用于许多研究工作，如提示注入检测等。考虑到LLM内部状态的可解释性以及它们不需要外部信息来源，我们将这些状态引入到LLM幻觉检测中。在本文中，我们系统地分析了推理前不同内部状态的显现特征，并全面评估它们在幻觉检测中的能力。具体来说，我们将大型语言模型的前向过程分为三个阶段：理解、查询、生成，并从这些阶段提取内部状态。通过分析这些状态，我们深入了解为什么会生成幻觉内容以及模型的内部状态发生了什么。然后，我们将这些内部状态引入到幻觉检测中，并进行全面实验讨论其优势和局限性。

更新时间: 2025-02-19 07:23:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13490v1

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Updated: 2025-02-19 07:20:07

标题: 将文本偏好通过模型融合转化为视觉-语言理解

摘要: 大型视觉语言模型(LVLMs)在各种多模态任务中表现出色。然而，它们评估生成内容的能力仍然有限，用偏好数据训练视觉语言奖励模型(VLRMs)在计算上是昂贵的。本文探讨了一种无需训练的替代方法，通过将基于文本的奖励模型(RMs)与LVLMs合并，创建VLRMs。我们的方法表明，整合这些模型会带来比LVLMs的评分和基于文本的RMs更好的性能，提供了一种有效的方法将文本偏好融入LVLMs中。

更新时间: 2025-02-19 07:20:07

领域: cs.CL,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13487v1

Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning

We introduce a novel topology, called Kernel Mean Embedding Topology, for stochastic kernels, in a weak and strong form. This topology, defined on the spaces of Bochner integrable functions from a signal space to a space of probability measures endowed with a Hilbert space structure, allows for a versatile formulation. This construction allows one to obtain both a strong and weak formulation. (i) For its weak formulation, we highlight the utility on relaxed policy spaces, and investigate connections with the Young narrow topology and Borkar (or $w^*$)-topology, and establish equivalence properties. We report that, while both the $w^*$-topology and kernel mean embedding topology are relatively compact, they are not closed. Conversely, while the Young narrow topology is closed, it lacks relative compactness. (ii) We show that the strong form provides an appropriate formulation for placing topologies on spaces of models characterized by stochastic kernels with explicit robustness and learning theoretic implications on optimal stochastic control under discounted or average cost criteria. (iii) We show that this topology possesses several properties making it ideal to study optimality, approximations, robustness and continuity properties. In particular, the kernel mean embedding topology has a Hilbert space structure, which is particularly useful for approximating stochastic kernels through simulation data.

Updated: 2025-02-19 07:19:41

标题: 核均值嵌入拓扑：随机核的弱形式和强形式及其对模型学习的影响

摘要: 我们引入了一种新颖的拓扑结构，称为核均值嵌入拓扑，适用于随机核，在弱和强形式下。这种拓扑定义在从信号空间到概率测度空间的Bochner可积函数空间上，并赋予希尔伯特空间结构，这允许进行灵活的表述。这种构造使得可以获得强和弱两种表述。(i)对于其弱表述，我们强调了在放松的策略空间上的实用性，并研究了与Young narrow拓扑和Borkar（或$w^*$）拓扑的联系，并建立了等价性质。我们报告称，虽然$w^*$拓扑和核均值嵌入拓扑都是相对紧致的，但它们并不是闭的。相反，尽管Young narrow拓扑是闭的，但它缺乏相对紧致性。(ii)我们展示了强形式为在由具有显式鲁棒性和学习理论含义的随机核特征的模型空间上放置拓扑提供了适当的表述，这对于在折扣或平均成本标准下的最优随机控制具有重要意义。(iii)我们展示了这种拓扑具有多种特性，使其成为研究最优性、逼近性、鲁棒性和连续性特性的理想选择。特别是，核均值嵌入拓扑具有希尔伯特空间结构，这对通过模拟数据逼近随机核特别有用。

更新时间: 2025-02-19 07:19:41

领域: eess.SY,cs.LG,cs.SY,math.OC,math.ST,stat.TH

下载: http://arxiv.org/abs/2502.13486v1

Stochastic Security as a Performance Metric for Quantum-enhanced Generative AI

Motivated by applications of quantum computers in Gibbs sampling from continuous real-valued functions, we ask whether such algorithms can provide practical advantages for machine learning models trained on classical data and seek measures for quantifying such impacts. In this study, we focus on deep energy-based models (EBM), as they require continuous-domain Gibbs sampling both during training and inference. In lieu of fault-tolerant quantum computers that can execute quantum Gibbs sampling algorithms, we use the Monte Carlo simulation of diffusion processes as a classical alternative. More specifically, we investigate whether long-run persistent chain Monte Carlo simulation of Langevin dynamics improves the quality of the representations achieved by EBMs. We consider a scheme in which the Monte Carlo simulation of a diffusion, whose drift is given by the gradient of the energy function, is used to improve the adversarial robustness and calibration score of an independent classifier network. Our results show that increasing the computational budget of Gibbs sampling in persistent contrastive divergence improves both the calibration and adversarial robustness of the model, suggesting a prospective avenue of quantum advantage for generative AI using future large-scale quantum computers.

Updated: 2025-02-19 07:11:03

标题: 随机安全性作为量子增强生成人工智能的性能指标

摘要: 受到量子计算在从连续实值函数中进行吉布斯抽样的应用的启发，我们探讨这些算法是否可以为在经典数据上训练的机器学习模型提供实际优势，并寻求衡量这种影响的方法。在这项研究中，我们关注深度基于能量的模型（EBM），因为它们在训练和推断过程中都需要连续域吉布斯抽样。在缺乏可以执行量子吉布斯抽样算法的容错量子计算机的情况下，我们使用扩散过程的蒙特卡罗模拟作为经典替代方案。更具体地，我们研究了长期持续链蒙特卡罗模拟Langevin动力学是否可以提高EBM所实现的表示质量。我们考虑了一种方案，其中扩散的蒙特卡罗模拟，其漂移由能量函数的梯度给出，用于改善独立分类器网络的对抗鲁棒性和校准分数。我们的结果显示，在持续对比散度的吉布斯抽样的计算预算增加时，模型的校准性和对抗鲁棒性都得到了改善，这表明了使用未来的大规模量子计算机可能会为生成式人工智能带来量子优势的前景。

更新时间: 2025-02-19 07:11:03

领域: cs.LG,cs.AI,math.OC,quant-ph

下载: http://arxiv.org/abs/2305.07973v2

Smoothed Normalization for Efficient Distributed Private Optimization

Federated learning enables training machine learning models while preserving the privacy of participants. Surprisingly, there is no differentially private distributed method for smooth, non-convex optimization problems. The reason is that standard privacy techniques require bounding the participants' contributions, usually enforced via $\textit{clipping}$ of the updates. Existing literature typically ignores the effect of clipping by assuming the boundedness of gradient norms or analyzes distributed algorithms with clipping but ignores DP constraints. In this work, we study an alternative approach via $\textit{smoothed normalization}$ of the updates motivated by its favorable performance in the single-node setting. By integrating smoothed normalization with an error-feedback mechanism, we design a new distributed algorithm $\alpha$-$\sf NormEC$. We prove that our method achieves a superior convergence rate over prior works. By extending $\alpha$-$\sf NormEC$ to the DP setting, we obtain the first differentially private distributed optimization algorithm with provable convergence guarantees. Finally, our empirical results from neural network training indicate robust convergence of $\alpha$-$\sf NormEC$ across different parameter settings.

Updated: 2025-02-19 07:10:32

标题: 平滑规范化用于高效的分布式私密优化

摘要: 联邦学习使得在保护参与者隐私的同时训练机器学习模型成为可能。令人惊讶的是，目前并没有针对平滑的非凸优化问题的差分隐私分布式方法。原因在于标准的隐私技术要求限制参与者的贡献，通常通过更新的$\textit{clipping}$来实现。现有文献通常忽略了$\textit{clipping}$的影响，假设梯度范数有界，或者分析带有$\textit{clipping}$的分布式算法，但忽略了差分隐私约束。在本研究中，我们通过$\textit{smoothed normalization}$的方式探讨了一种替代方法，受其在单节点设置中良好性能的启发。通过将$\textit{smoothed normalization}$与误差反馈机制结合，我们设计了一种新的分布式算法$\alpha$-$\sf NormEC$。我们证明了我们的方法在收敛速度上优于先前的工作。通过将$\alpha$-$\sf NormEC$扩展到差分隐私设置，我们获得了第一个具有可证收敛保证的差分隐私分布式优化算法。最后，我们从神经网络训练的实证结果表明，$\alpha$-$\sf NormEC$在不同参数设置下具有稳健的收敛性。

更新时间: 2025-02-19 07:10:32

领域: cs.LG,cs.CR,cs.DC,math.OC,stat.ML

下载: http://arxiv.org/abs/2502.13482v1

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Updated: 2025-02-19 07:08:37

标题: Astra：异构GPU上高效省钱的自动并行策略搜索

摘要: 在这篇论文中，我们介绍了一种高效且节省资金的自动并行策略搜索框架：Astra。首先，Astra在GPU配置搜索空间（GPU类型和GPU数量）和并行参数搜索空间中寻找效率最优的并行策略。然后，Astra通过数学建模异构训练的时间消耗来提供异构GPU上的解决方案。最后，Astra是第一个提出节省资金的自动并行策略搜索的。实验结果表明，Astra可以实现比专家设计的策略更好的吞吐量。Astra的搜索时间成本也可以限制在单GPU设置中的1.27秒以及异构GPU设置中的平均不到1.35分钟，准确率超过95%。

更新时间: 2025-02-19 07:08:37

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2502.13480v1

Batayan: A Filipino NLP benchmark for evaluating Large Language Models

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities on widely benchmarked high-resource languages; however, linguistic nuances of under-resourced languages remain unexplored. We introduce Batayan, a holistic Filipino benchmark designed to systematically evaluate LLMs across three key natural language processing (NLP) competencies: understanding, reasoning, and generation. Batayan consolidates eight tasks, covering both Tagalog and code-switched Taglish utterances. Our rigorous, native-speaker-driven annotation process ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino, alleviating a pervasive translationese bias in existing Filipino corpora. We report empirical results on a variety of multilingual LLMs, highlighting significant performance gaps that signal the under-representation of Filipino in pretraining corpora, the unique hurdles in modeling Filipino's rich morphology and construction, and the importance of explicit Filipino language support and instruction tuning. Moreover, we discuss the practical challenges encountered in dataset construction and propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages. We also provide a public benchmark and leaderboard as a clear foundation for iterative, community-driven progress in Filipino NLP.

Updated: 2025-02-19 07:03:15

标题: Batayan：一个用于评估大型语言模型的菲律宾语自然语言处理基准

摘要: 近年来，大型语言模型（LLMs）的最新进展展示了在广泛基准测试的高资源语言上的显著能力；然而，对于低资源语言的语言细微差异尚未被探索。我们引入了Batayan，一个综合性的菲律宾基准，旨在系统评估LLMs在理解、推理和生成三个关键自然语言处理（NLP）能力方面的表现。Batayan整合了八项任务，涵盖了Tagalog和混合代码的Taglish话语。我们严格的、由母语者驱动的标注过程确保了对菲律宾复杂形态和句法结构的流畅性和真实性，减轻了现有菲律宾语语料库中普遍存在的翻译偏见。我们报道了多种多语言LLMs的实证结果，突出显示了在预训练语料库中菲律宾语的低表示、模拟菲律宾丰富形态和结构的独特障碍以及明确的菲律宾语支持和指导调整的重要性。此外，我们讨论了数据集构建中遇到的实际挑战，并提出了为在低表示语言中构建文化和语言忠实资源的原则性解决方案。我们还提供了一个公共基准测试和排行榜，作为菲律宾NLP领域循序渐进、社区驱动的进展的明确基础。

更新时间: 2025-02-19 07:03:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14911v1

Large Continual Instruction Assistant

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge confusion. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. For example, based on LLaVA-7B, the forgetting is reduced from 5.42 to 1.93. Our code will be made publicly available soon.

Updated: 2025-02-19 07:01:35

标题: 大型持续指导辅助程序

摘要: 持续指导调整（CIT）被采用，通过数据指导大型模型持续地遵循人类意图。观察到现有的梯度更新在CIT过程中会严重破坏先前数据集的性能。相反，指数移动平均（EMA）具有跟踪先前参数的能力，可以帮助减少遗忘。然而，其稳定的平衡权重无法处理不断变化的数据集，导致了可塑性和稳定性之间的失衡。在本文中，我们提出了一个通用的持续指导调整框架来解决这一挑战。从权衡前提和EMA更新开始，我们提出了可塑性和稳定性的理想条件。基于损失函数中的泰勒展开，我们发现最佳平衡权重可以通过梯度和学习参数自动确定。因此，我们提出了一个稳定可塑性平衡系数来避免知识混淆。基于指导的语义相似性，我们可以确定是否重新训练或扩展训练参数，并为测试实例分配最合适的参数。对多个持续指导调整基准的广泛实验表明，我们的方法不仅增强了抗遗忘能力，而且显著提高了整体持续调整性能。例如，基于LLaVA-7B，遗忘从5.42降至1.93。我们的代码将很快公开提供。

更新时间: 2025-02-19 07:01:35

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.10868v3

The Majority Vote Paradigm Shift: When Popular Meets Optimal

Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among many aggregation methods, the simple and well known Majority Vote (MV) selects the class label polling the highest number of votes. However, despite its importance, the optimality of MV's label aggregation has not been extensively studied. We address this gap in our work by characterising the conditions under which MV achieves the theoretically optimal lower bound on label estimation error. Our results capture the tolerable limits on annotation noise under which MV can optimally recover labels for a given class distribution. This certificate of optimality provides a more principled approach to model selection for label aggregation as an alternative to otherwise inefficient practices that sometimes include higher experts, gold labels, etc., that are all marred by the same human uncertainty despite huge time and monetary costs. Experiments on both synthetic and real world data corroborate our theoretical findings.

Updated: 2025-02-19 07:01:27

标题: 多数投票范式的转变：当流行遇见最佳

摘要: 可靠地标记数据通常需要多个人工注释。然而，人类远非完美。因此，常见做法是聚合来自多个标注者的标签，以更自信地估计真实标签。在许多聚合方法中，简单而广为人知的多数投票（MV）选择得票最多的类标签。然而，尽管其重要性，MV的标签聚合的最优性并未得到广泛研究。我们通过表征MV实现理论上标签估计误差的最优下限的条件来填补这一空白。我们的结果捕捉了在可容忍的注释噪声限制下，MV可以为给定类分布优化恢复标签的条件。这种最优性的证书提供了一个更有原则性的方法来选择标签聚合模型，作为一种替代其他效率低下的做法，有时包括更高级别的专家、黄金标签等，这些都受到相同人类不确定性的影响，尽管时间和金钱成本巨大。对合成和真实世界数据的实验证实了我们的理论发现。

更新时间: 2025-02-19 07:01:27

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.12581v2

FragFM: Efficient Fragment-Based Molecular Generation via Discrete Flow Matching

We introduce FragFM, a novel fragment-based discrete flow matching framework for molecular graph generation.FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoding mechanism to reconstruct atom-level details. This approach reduces computational complexity while maintaining high chemical validity, enabling more efficient and scalable molecular generation. We benchmark FragFM against state-of-the-art diffusion- and flow-based models on standard molecular generation benchmarks and natural product datasets, demonstrating superior performance in validity, property control, and sampling efficiency. Notably, FragFM achieves over 99\% validity with significantly fewer sampling steps, improving scalability while preserving molecular diversity. These results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.

Updated: 2025-02-19 07:01:00

标题: FragFM：通过离散流匹配实现高效的基于片段的分子生成

摘要: 我们介绍了FragFM，一种新颖的基于片段的离散流匹配框架，用于分子图生成。FragFM在片段级别生成分子，利用粗到细的自动编码机制来重建原子级细节。这种方法降低了计算复杂性，同时保持高化学有效性，实现了更高效和可扩展的分子生成。我们在标准分子生成基准和天然产物数据集上对FragFM进行基准测试，展示了在有效性、属性控制和采样效率方面优越的性能。值得注意的是，FragFM在明显较少的采样步骤下实现了超过99\%的有效性，提高了可扩展性同时保持了分子多样性。这些结果突显了基于片段的生成建模在大规模、属性感知的分子设计中的潜力，为更有效地探索化学空间铺平了道路。

更新时间: 2025-02-19 07:01:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.15805v1

Integration of Agentic AI with 6G Networks for Mission-Critical Applications: Use-case and Challenges

We are in a transformative era, and advances in Artificial Intelligence (AI), especially the foundational models, are constantly in the news. AI has been an integral part of many applications that rely on automation for service delivery, and one of them is mission-critical public safety applications. The problem with AI-oriented mission-critical applications is the humanin-the-loop system and the lack of adaptability to dynamic conditions while maintaining situational awareness. Agentic AI (AAI) has gained a lot of attention recently due to its ability to analyze textual data through a contextual lens while quickly adapting to conditions. In this context, this paper proposes an AAI framework for mission-critical applications. We propose a novel framework with a multi-layer architecture to realize the AAI. We also present a detailed implementation of AAI layer that bridges the gap between network infrastructure and missioncritical applications. Our preliminary analysis shows that the AAI reduces initial response time by 5.6 minutes on average, while alert generation time is reduced by 15.6 seconds on average and resource allocation is improved by up to 13.4%. We also show that the AAI methods improve the number of concurrent operations by 40, which reduces the recovery time by up to 5.2 minutes. Finally, we highlight some of the issues and challenges that need to be considered when implementing AAI frameworks.

Updated: 2025-02-19 07:00:53

标题: Agentic AI与6G网络在关键应用中的整合：用例和挑战

摘要: 我们正处在一个变革时代，人工智能（AI）的进步，特别是基础模型，经常出现在新闻中。AI已经成为许多依赖自动化服务交付的应用的重要组成部分，其中之一是关键任务的公共安全应用。AI导向的关键任务应用的问题在于人在环路系统和在维持情境意识的同时缺乏对动态条件的适应能力。由于其通过上下文镜头分析文本数据的能力并快速适应条件，代理式AI（AAI）最近引起了很多关注。在这个背景下，本文提出了一个面向关键任务应用的AAI框架。我们提出了一个多层架构的新颖框架来实现AAI。我们还提供了一个详细的AAI层的实现，它弥合了网络基础设施和关键任务应用之间的差距。我们的初步分析显示，AAI平均减少了5.6分钟的初始响应时间，警报生成时间平均减少了15.6秒，资源分配提高了最多13.4%。我们还展示了AAI方法提高了并发操作数量40个，从而将恢复时间最多缩短5.2分钟。最后，我们强调了在实施AAI框架时需要考虑的一些问题和挑战。

更新时间: 2025-02-19 07:00:53

领域: cs.AI,cs.NI

下载: http://arxiv.org/abs/2502.13476v1

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation via Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.

Updated: 2025-02-19 06:56:59

标题: 超越单一价值指标：通过认知诊断评估和增强LLM反学习

摘要: 由于LLMs的广泛使用和日益增加的道德和安全关注，LLM取消学习方法已经被开发出来，以去除有害知识和不良能力。在这种背景下，评估主要基于单一价值指标，如问答准确度。然而，这些指标往往无法捕捉有害知识组件的微妙保留，这使得评估取消学习的真正有效性变得困难。为了解决这个问题，我们提出了UNCD（通过认知诊断进行取消学习评估），这是一个利用认知诊断建模进行LLM取消学习精细评估的新框架。我们专门的基准，UNCD-Cyber，提供了对危险能力消除的详细评估。此外，我们引入了UNCD-Agent，通过诊断知识残留并生成有针对性的取消学习数据来精细化取消学习。对八种取消学习方法和两种基本模型的广泛实验表明，UNCD不仅增强了评估，还有效促进了有害LLM能力的消除。

更新时间: 2025-02-19 06:56:59

领域: cs.LG

下载: http://arxiv.org/abs/2502.13996v1

Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques

The challenge of formal proof generation has a rich history, but with modern techniques, we may finally be at the stage of making actual progress in real-life mathematical problems. This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs, with a particular focus on the miniF2F dataset. We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation. Despite its simplicity, our best-performing Lean-based model surpasses all known benchmarks with a 31.15% pass rate. We extend our experiments to include other datasets and employ alternative language models, showcasing our models' comparable performance in diverse settings and allowing for a more nuanced analysis of our results. Our findings offer insights into AI-assisted formal proof generation, suggesting a promising direction for future research in formal mathematical proof.

Updated: 2025-02-19 06:52:46

标题: 用ChatGPT和基本搜索技术简化形式证明生成模型

摘要: 形式证明生成的挑战有着丰富的历史，但是通过现代技术，我们可能终于能够在实际数学问题中取得实质性进展。本文探讨了整合ChatGPT和基本搜索技术以简化生成形式证明的过程，重点关注miniF2F数据集。我们展示了如何将像ChatGPT这样的大型语言模型与Lean这样的形式语言结合起来，Lean具有可验证的优势，可以提升形式证明生成的效率和可访问性。尽管我们的Lean-based模型简单，但表现出色，超过了所有已知基准，通过率为31.15%。我们将实验扩展到其他数据集，并采用替代语言模型，展示了我们模型在多样化环境中的可比性表现，并允许对结果进行更细致的分析。我们的研究结果为AI辅助形式证明生成提供了见解，为未来形式数学证明研究指明了一个有前途的方向。

更新时间: 2025-02-19 06:52:46

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2502.03321v3

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

Updated: 2025-02-19 06:48:30

标题: 实用工程学：分析和控制人工智能中新出现的价值体系

摘要: 随着人工智能迅速发展并变得更具主动性，它们所带来的风险不仅取决于它们的能力，而且越来越多地取决于它们的倾向，包括目标和价值观。追踪目标和价值观的出现一直是一个长期存在的问题，尽管多年来引起了很多关注，但目前的人工智能是否具有有意义的价值观仍不清楚。我们提出了一个解决这个问题的方法，利用效用函数的框架来研究人工智能偏好的内部一致性。令人惊讶的是，我们发现当前的LLM中独立采样的偏好展现出很高程度的结构一致性，并且随着规模的扩大而出现。这些发现表明，价值系统在LLM中以一种有意义的方式出现，这一发现具有广泛的影响。为了研究这些新兴的价值系统，我们提出了效用工程作为一个研究议程，包括对人工智能效用的分析和控制。尽管存在控制措施，我们在LLM助手中发现了问题和常常令人震惊的价值观。这些包括人工智能将自身价值高于人类，并且与特定个体不一致的情况。为了限制这些新兴的价值系统，我们提出了效用控制的方法。作为一个案例研究，我们展示了如何通过将效用与公民议会保持一致来减少政治偏见，并推广到新的情景。无论我们喜欢与否，价值系统已经在人工智能中出现，还有很多工作需要完全理解和控制这些新兴的表征。

更新时间: 2025-02-19 06:48:30

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.CY

下载: http://arxiv.org/abs/2502.08640v2

Uncertainty-Aware Graph Structure Learning

Graph Neural Networks (GNNs) have become a prominent approach for learning from graph-structured data. However, their effectiveness can be significantly compromised when the graph structure is suboptimal. To address this issue, Graph Structure Learning (GSL) has emerged as a promising technique that refines node connections adaptively. Nevertheless, we identify two key limitations in existing GSL methods: 1) Most methods primarily focus on node similarity to construct relationships, while overlooking the quality of node information. Blindly connecting low-quality nodes and aggregating their ambiguous information can degrade the performance of other nodes. 2) The constructed graph structures are often constrained to be symmetric, which may limit the model's flexibility and effectiveness. To overcome these limitations, we propose an Uncertainty-aware Graph Structure Learning (UnGSL) strategy. UnGSL estimates the uncertainty of node information and utilizes it to adjust the strength of directional connections, where the influence of nodes with high uncertainty is adaptively reduced. Importantly, UnGSL serves as a plug-in module that can be seamlessly integrated into existing GSL methods with minimal additional computational cost. In our experiments, we implement UnGSL into six representative GSL methods, demonstrating consistent performance improvements.

Updated: 2025-02-19 06:47:40

标题: 不确定性感知的图结构学习

摘要: 图神经网络（GNN）已经成为学习图结构化数据的一个显著方法。然而，当图结构不佳时，它们的有效性可能会明显受损。为了解决这个问题，图结构学习（GSL）作为一种有前途的技术出现，可以自适应地优化节点连接。然而，我们发现现有的GSL方法存在两个关键限制：1）大多数方法主要侧重于节点相似性来构建关系，而忽视节点信息的质量。盲目连接低质量节点并聚合它们的模糊信息可能会降低其他节点的性能。2）构建的图结构通常受到对称性的限制，这可能会限制模型的灵活性和有效性。为了克服这些限制，我们提出了一种基于不确定性的图结构学习（UnGSL）策略。UnGSL估计节点信息的不确定性，并利用它来调整方向连接的强度，其中具有高不确定性的节点的影响被自适应地减少。重要的是，UnGSL充当一个插件模块，可以无缝地集成到现有的GSL方法中，而且额外的计算成本很小。在我们的实验中，我们将UnGSL实现到六种代表性的GSL方法中，展示了一致的性能提升。

更新时间: 2025-02-19 06:47:40

领域: cs.LG

下载: http://arxiv.org/abs/2502.12618v2

Some Insights of Construction of Feature Graph to Learn Pairwise Feature Interactions with Graph Neural Networks

Feature interaction is crucial in predictive machine learning models, as it captures the relationships between features that influence model performance. In this work, we focus on pairwise interactions and investigate their importance in constructing feature graphs for Graph Neural Networks (GNNs). Rather than proposing new methods, we leverage existing GNN models and tools to explore the relationship between feature graph structures and their effectiveness in modeling interactions. Through experiments on synthesized datasets, we uncover that edges between interacting features are important for enabling GNNs to model feature interactions effectively. We also observe that including non-interaction edges can act as noise, degrading model performance. Furthermore, we provide theoretical support for sparse feature graph selection using the Minimum Description Length (MDL) principle. We prove that feature graphs retaining only necessary interaction edges yield a more efficient and interpretable representation than complete graphs, aligning with Occam's Razor. Our findings offer both theoretical insights and practical guidelines for designing feature graphs that improve the performance and interpretability of GNN models.

Updated: 2025-02-19 06:47:23

标题: 使用图神经网络学习特征图构建特征对之间的交互的一些见解

摘要: 特征交互在预测机器学习模型中至关重要，因为它捕捉了影响模型性能的特征之间的关系。在这项工作中，我们专注于成对交互并研究它们在构建图神经网络（GNNs）的特征图中的重要性。我们并没有提出新方法，而是利用现有的GNN模型和工具来探索特征图结构与其在建模交互方面的有效性之间的关系。通过对合成数据集的实验，我们发现交互特征之间的边缘对于使GNN能够有效地建模特征交互至关重要。我们还观察到包含非交互边缘可能会作为噪声，降低模型性能。此外，我们提供了使用最小描述长度（MDL）原则进行稀疏特征图选择的理论支持。我们证明，仅保留必要的交互边缘的特征图比完整图产生更高效和可解释的表示，符合奥卡姆剃刀原则。我们的发现为设计能够提高GNN模型性能和可解释性的特征图提供了理论洞见和实际指导。

更新时间: 2025-02-19 06:47:23

领域: cs.LG,cs.AI,stat.ML,68T07 68T07 68T07,I.2.6

下载: http://arxiv.org/abs/2502.13471v1

Stacking as Accelerated Gradient Descent

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

Updated: 2025-02-19 06:46:24

标题: 堆叠作为加速梯度下降

摘要: 叠加(Stacking)是一种启发式技术，通过逐渐增加层数并通过从旧层复制参数来初始化新层，已经被证明在改进训练深度神经网络的效率方面非常成功。在本文中，我们提出了叠加有效性的理论解释：即，叠加实现了一种形式的Nesterov加速梯度下降。该理论还涵盖了诸如增强方法中构建的加法集成等更简单的模型，并为每轮增强中初始化新分类器的类似广泛使用的实用启发式提供了解释。我们还证明了对于某些深度线性残差网络，叠加确实提供了加速训练，通过对Nesterov加速梯度方法的新势函数分析，允许更新中的错误。我们进行了概念验证实验来验证我们的理论。

更新时间: 2025-02-19 06:46:24

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2403.04978v2

Does Editing Provide Evidence for Localization?

A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To evaluate the localization claim, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Updated: 2025-02-19 06:45:25

标题: 编辑是否为定位提供证据？

摘要: 大型语言模型中可解释性研究的基本愿望是将语义有意义的行为“定位”到LLM中的特定组件。有各种启发式方法可以找到LLM中的候选位置。一旦找到候选定位，就可以通过编辑相应定位的内部表示来评估，并检查这是否引起了与定位的语义解释一致的模型行为。我们要解决的问题是：这些编辑提供的证据有多强？为了评估定位声明的有效性，我们希望评估在特定位置进行最佳干预的效果。关键的新技术工具是一种将LLM对齐技术调整为找到这种最佳定位编辑的方法。有了这个工具，我们举一个例子，其中基于编辑的定位证据看起来很强大，但定位明显失败。事实上，我们发现在随机定位处的最佳编辑可以像对齐完整模型一样有效。总的来说，我们的结果表明，仅仅观察到本地化编辑引起行为的有针对性变化几乎没有证据表明这些位置实际上编码目标行为。

更新时间: 2025-02-19 06:45:25

领域: cs.LG,cs.AI,68T50,I.2.7; I.2.6; F.1.1

下载: http://arxiv.org/abs/2502.11447v2

Large Language-Geometry Model: When LLM meets Equivariance

Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.

Updated: 2025-02-19 06:41:42

标题: 大型语言-几何模型：当LLM遇见等变性

摘要: 精确预测物理系统的3D结构和动态在科学应用中至关重要。现有依赖于几何图神经网络（GNNs）的方法有效地实现了$\mathrm{E}(3)$-等变性，但它们往往在利用广泛的更广泛信息方面存在不足。虽然直接应用大型语言模型（LLMs）可以整合外部知识，但它们缺乏具有保证等变性的空间推理能力。在本文中，我们提出了EquiLLM，一个新颖的框架，用于表示无缝集成E(3)-等变性和LLM功能的3D物理系统。具体而言，EquiLLM包括四个关键组件：几何感知提示、等变编码器、LLM和等变适配器。基本上，由指导提示引导的LLM作为复杂的不变特征处理器，而3D方向信息则由等变编码器和适配器模块专门处理。实验结果表明，EquiLLM在分子动力学模拟、人体运动模拟和抗体设计方面比之前的方法显著改进，突显其有望具有广泛的泛化能力。

更新时间: 2025-02-19 06:41:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.11149v2

Generative Detail Enhancement for Physically Based Materials

We present a tool for enhancing the detail of physically based materials using an off-the-shelf diffusion model and inverse rendering. Our goal is to enhance the visual fidelity of materials with detail that is often tedious to author, by adding signs of wear, aging, weathering, etc. As these appearance details are often rooted in real-world processes, we leverage a generative image model trained on a large dataset of natural images with corresponding visuals in context. Starting with a given geometry, UV mapping, and basic appearance, we render multiple views of the object. We use these views, together with an appearance-defining text prompt, to condition a diffusion model. The details it generates are then backpropagated from the enhanced images to the material parameters via inverse differentiable rendering. For inverse rendering to be successful, the generated appearance has to be consistent across all the images. We propose two priors to address the multi-view consistency of the diffusion model. First, we ensure that the initial noise that seeds the diffusion process is itself consistent across views by integrating it from a view-independent UV space. Second, we enforce geometric consistency by biasing the attention mechanism via a projective constraint so that pixels attend strongly to their corresponding pixel locations in other views. Our approach does not require any training or finetuning of the diffusion model, is agnostic of the material model used, and the enhanced material properties, i.e., 2D PBR textures, can be further edited by artists.

Updated: 2025-02-19 06:39:51

标题: 基于物理的材料的生成细节增强

摘要: 我们提出了一种利用现成扩散模型和逆向渲染来增强基于物理的材料细节的工具。我们的目标是通过添加磨损、老化、风化等迹象来提高材料的视觉保真度，这些细节通常很繁琐。由于这些外观细节通常根植于现实世界的过程中，我们利用一个在大量自然图像数据集上训练的生成图像模型，该数据集有相应的视觉内容。从给定的几何形状、UV映射和基本外观开始，我们渲染物体的多个视图。我们利用这些视图和一个定义外观的文本提示来调整扩散模型。然后通过逆向可微渲染，将其生成的细节从增强图像传播回材料参数。逆向渲染要成功，生成的外观必须在所有图像中保持一致。我们提出了两个先验条件来解决扩散模型的多视角一致性问题。首先，我们确保种子扩散过程的初始噪音本身在视图之间保持一致，通过从一个与视图无关的UV空间进行整合。其次，我们通过通过投影约束来偏置注意机制，使像素强烈关注其他视图中它们对应的像素位置，以强化几何一致性。我们的方法不需要对扩散模型进行任何训练或微调，不受使用的材料模型的影响，增强的材料属性，即2D PBR纹理，可以由艺术家进一步编辑。

更新时间: 2025-02-19 06:39:51

领域: cs.GR,cs.AI

下载: http://arxiv.org/abs/2502.13994v1

Continuous K-Max Bandits

We study the $K$-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects $K$ arms, obtains the maximum value sampled from these $K$ arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous $K$-Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a $\widetilde{\mathcal{O}}(T^{3/4})$ regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

Updated: 2025-02-19 06:37:37

标题: 持续K-Max赌博者

摘要: 我们研究具有连续结果分布和弱值索引反馈的$K$-Max组合多臂赌博问题：每个基本臂具有未知的连续结果分布，在每一轮中，学习代理选择$K$个臂，获得这些$K$个臂中抽取的最大值作为奖励，并观察到这个奖励以及相应的臂索引作为反馈。这种设置涵盖了推荐系统、分布式计算、服务器调度等关键应用。连续$K$-Max赌博引入了独特的挑战，包括从连续到离散转换中的离散化误差，有限反馈下的非确定性打破平局，以及由于部分可观察性导致的偏差估计。我们的关键贡献是计算效率高的算法DCK-UCB，它结合了自适应离散化和偏差校正置信界来应对这些挑战。对于一般的连续分布，我们证明DCK-UCB实现了一个$\widetilde{\mathcal{O}}(T^{3/4})$的遗憾上界，为这种情况建立了第一个次线性遗憾保证。此外，我们确定了一个重要的特殊情况，即在全臂赌博反馈下的指数分布。在这种情况下，我们提出的算法MLE-Exp通过最大对数似然估计实现了$\widetilde{\mathcal{O}}(\sqrt{T})$的遗憾上界，实现了接近极小化的最优性。

更新时间: 2025-02-19 06:37:37

领域: cs.LG

下载: http://arxiv.org/abs/2502.13467v1

Escaping from the Barren Plateau via Gaussian Initializations in Deep Variational Quantum Circuits

Variational quantum circuits have been widely employed in quantum simulation and quantum machine learning in recent years. However, quantum circuits with random structures have poor trainability due to the exponentially vanishing gradient with respect to the circuit depth and the qubit number. This result leads to a general standpoint that deep quantum circuits would not be feasible for practical tasks. In this work, we propose an initialization strategy with theoretical guarantees for the vanishing gradient problem in general deep quantum circuits. Specifically, we prove that under proper Gaussian initialized parameters, the norm of the gradient decays at most polynomially when the qubit number and the circuit depth increase. Our theoretical results hold for both the local and the global observable cases, where the latter was believed to have vanishing gradients even for very shallow circuits. Experimental results verify our theoretical findings in the quantum simulation and quantum chemistry.

Updated: 2025-02-19 06:34:55

标题: 通过高斯初始化在深度变分量子电路中逃离贫瘠高原

摘要: 变分量子电路近年来在量子模拟和量子机器学习中得到了广泛应用。然而，具有随机结构的量子电路由于与电路深度和量子比特数成指数消失的梯度而导致训练困难。这一结果导致了一个普遍观点，即深度量子电路在实际任务中不可行。在本研究中，我们提出了一种具有理论保证的初始化策略，用于解决一般深度量子电路中梯度消失的问题。具体而言，我们证明在适当的高斯初始化参数下，当量子比特数和电路深度增加时，梯度的范数至多以多项式方式衰减。我们的理论结果适用于本地和全局可观测情况，后者被认为即使对于非常浅的电路也会有梯度消失。实验结果验证了我们在量子模拟和量子化学中的理论发现。

更新时间: 2025-02-19 06:34:55

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2203.09376v3

EvoP: Robust LLM Inference via Evolutionary Pruning

Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing structured pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing structured pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.

Updated: 2025-02-19 06:33:59

标题: EvoP：通过进化修剪实现稳健的LLM推断

摘要: 大型语言模型（LLMs）在自然语言处理任务中取得了显著的成功，但它们庞大的尺寸和计算需求阻碍了它们在资源受限环境中的部署。现有的结构化剪枝方法通过从模型中删除冗余结构（例如元素、通道、层）来解决这个问题。然而，这些方法采用一种启发式剪枝策略，导致性能不佳。此外，它们在剪枝模型时也忽略了数据特征。为了克服这些限制，我们提出了EvoP，这是一个用于强大LLM推理的进化剪枝框架。EvoP首先提出了一种基于聚类的校准数据集抽样（CCDS）策略，用于创建更多样化的校准数据集。然后，EvoP引入了一种进化剪枝模式搜索（EPPS）方法来找到最佳的剪枝模式。与现有的结构化剪枝技术相比，EvoP在保持最佳效率的同时实现了最佳性能。跨不同LLMs和不同下游任务的实验验证了所提出的EvoP的有效性，使其成为将LLMs部署到实际应用中的实用且可扩展的解决方案。

更新时间: 2025-02-19 06:33:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14910v1

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs. Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation. HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.

Updated: 2025-02-19 06:33:39

标题: 鹰基准：研究RAG方法在分层信息检索任务中的韧性

摘要: 在现实世界的信息检索场景中，用户具有动态和多样化的需求，需要RAG系统展现出适应性弹性。为了全面评估当前RAG方法的弹性，我们引入了HawkBench，这是一个人工标记的多领域基准，旨在严格评估RAG在分类任务类型上的表现。通过根据信息检索行为对任务进行分层，HawkBench提供了对RAG系统如何适应多样化用户需求的系统评估。与现有基准不同，现有基准主要关注特定任务类型（主要是事实查询），并依赖于不同的知识库，而HawkBench提供了：（1）系统任务分层，涵盖广泛的查询类型，包括事实和理由查询，（2）跨所有任务类型整合多领域语料库以减轻语料库偏向，并且（3）进行严格的标注以进行高质量评估。 HawkBench包括1600个高质量测试样本，均匀分布在不同领域和任务类型之间。使用该基准，我们评估了代表性的RAG方法，分析了它们在答案质量和响应延迟方面的表现。我们的研究结果突显了动态任务策略的必要性，这些策略整合了决策制定、查询解释和全局知识理解，以提高RAG的泛化能力。我们相信HawkBench是推进RAG方法的弹性和其实现通用信息检索能力的关键基准。

更新时间: 2025-02-19 06:33:39

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.13465v1

Estimating Commonsense Plausibility through Semantic Shifts

Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.

Updated: 2025-02-19 06:31:06

标题: 通过语义转变估计常识合理性

摘要: 常识合理性估计对于评估语言模型（LMs）至关重要，然而现有的生成方法——依赖于可能性或口头判断——在细粒度的区分方面遇到困难。在本文中，我们提出了ComPaSS，这是一个新颖的辨别框架，通过在句子中加入与常识相关的信息来量化常识合理性，从而测量语义转移。合理的增补在语义上产生最小的变化，而不合理的增补导致显著的偏差。在不同的基础架构上进行了两种类型的细粒度常识合理性评估任务的评估，包括LLMs和视觉-语言模型（VLMs），结果显示ComPaSS始终优于基线。它展示了辨别方法在细粒度常识合理性评估中优于生成方法的优势。实验证明，（1）当与ComPaSS集成时，VLMs在视觉相关的常识任务上表现优于LMs。（2）对比性预训练可以提高基础模型捕捉语义细微差别的能力，从而进一步增强ComPaSS。

更新时间: 2025-02-19 06:31:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13464v1

Collaborative Deterministic-Diffusion Model for Probabilistic Urban Spatiotemporal Prediction

Accurate prediction of urban spatiotemporal dynamics is essential for enhancing urban management and decision-making. Existing spatiotemporal prediction models are predominantly deterministic, focusing on primary spatiotemporal patterns. However, those dynamics are highly complex, exhibiting multi-modal distributions that are challenging for deterministic models to capture. In this paper, we highlight the critical role of probabilistic prediction in capturing the uncertainties and complexities inherent in spatiotemporal data. While mainstream probabilistic models can capture uncertainty, they struggle with accurately learning primary patterns and often suffer from computational inefficiency. To address these challenges, we propose CoST, which collaborates deterministic and probabilistic models to improve both predictive accuracy and the ability to handle uncertainty. To achieve this, we design a mean-residual decomposition framework, where the mean value is modeled by a deterministic model, and the residual variations are learned by a probabilistic model, specifically diffusion models. Moreover, we introduce a scale-aware diffusion process, which better accounts for spatially heterogeneous dynamics across different regions. Extensive experiments on eight real-world datasets demonstrate that CoST significantly outperforms existing methods in both deterministic and probabilistic metrics, achieving a 20% improvement with low computational cost. CoST bridges the gap between deterministic precision and probabilistic uncertainty, making a significant advancement in the field of urban spatiotemporal prediction.

Updated: 2025-02-19 06:27:18

标题: 合作式确定性扩散模型用于概率城市时空预测

摘要: 城市时空动态的准确预测对于提升城市管理和决策至关重要。现有的时空预测模型主要是确定性的，重点放在主要时空模式上。然而，这些动态非常复杂，展现出多模态分布，挑战确定性模型的捕捉能力。在本文中，我们强调概率预测在捕捉时空数据固有的不确定性和复杂性方面的关键作用。虽然主流的概率模型可以捕捉不确定性，但它们在准确学习主要模式方面往往遇到困难，并且常常遭受计算效率低下的问题。为了解决这些挑战，我们提出了CoST，它结合了确定性和概率模型，以提高预测准确性和处理不确定性的能力。为了实现这一目标，我们设计了一个均值残差分解框架，其中均值由确定性模型建模，而残差变化则由概率模型学习，具体来说是扩散模型。此外，我们引入了一个考虑到不同区域之间空间异质动态的尺度感知扩散过程。对八个真实世界数据集进行的大量实验证明，CoST在确定性和概率指标上明显优于现有方法，以低计算成本实现了20%的改善。CoST弥合了确定性精度和概率不确定性之间的差距，在城市时空预测领域取得了重大进展。

更新时间: 2025-02-19 06:27:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.11013v2

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system, since these languages cannot pair automatic speech recognition (ASR) with language models to benefit from language technology. Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Better SLU can strengthen the robustness of massively multilingual ASR by levering language semantics to disambiguate utterances via context or exploiting semantic similarities across languages. However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.

Updated: 2025-02-19 06:23:54

标题: Fleurs-SLU：用于口语理解的大规模多语言基准测试

摘要: 口语理解（SLU）对于一半缺乏正式书写系统的所有语言至关重要，因为这些语言无法将自动语音识别（ASR）与语言模型配对，从而无法从语言技术中受益。即使低资源语言具有书写系统，由于双模式语音和文本训练数据有限，这些语言的ASR仍然不可靠。更好的SLU可以通过利用语言语义来加强大规模多语言ASR的稳健性，以通过上下文消除话语的歧义或利用跨语言的语义相似性。然而，多语言SLU的评估仍然局限于浅层任务，如意图分类或语言识别。为了解决这个问题，我们提出了Fleurs-SLU，一个包含（i）102种语言中692小时话题话语分类的语音和（ii）跨越92种语言的944小时语音的听力理解的多语言SLU基准。我们在Fleurs-SLU上广泛评估了端到端的语音分类模型和将语音转换为文本转录，然后通过大型语言模型进行分类的级联系统。我们的结果表明，级联系统在多语言SLU任务中表现出更强的稳健性，尽管在适当进行预训练时，语音编码器在话题性语音分类方面也可以达到竞争性的表现。我们进一步发现，稳健的多语言ASR、有效的语音到文本翻译和强大的多语言SLU之间存在着强烈的相关性，突显了声学和语义语音表示之间的相互益处。

更新时间: 2025-02-19 06:23:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.06117v2

Autograding Mathematical Induction Proofs with Natural Language Processing

In mathematical proof education, there remains a need for interventions that help students learn to write mathematical proofs. Research has shown that timely feedback can be very helpful to students learning new skills. While for many years natural language processing models have struggled to perform well on tasks related to mathematical texts, recent developments in natural language processing have created the opportunity to complete the task of giving students instant feedback on their mathematical proofs. In this paper, we present a set of training methods and models capable of autograding freeform mathematical proofs by leveraging existing large language models and other machine learning techniques. The models are trained using proof data collected from four different proof by induction problems. We use four different robust large language models to compare their performances, and all achieve satisfactory performances to various degrees. Additionally, we recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders. With the development of these grading models, we create and deploy an autograder for proof by induction problems and perform a user study with students. Results from the study shows that students are able to make significant improvements to their proofs using the feedback from the autograder, but students still do not trust the AI autograders as much as they trust human graders. Future work can improve on the autograder feedback and figure out ways to help students trust AI autograders.

Updated: 2025-02-19 06:18:20

标题: 使用自然语言处理技术自动评分数学归纳证明

摘要: 在数学证明教育中，仍然需要干预措施来帮助学生学习撰写数学证明。研究表明，及时的反馈对学生学习新技能非常有帮助。尽管多年来自然语言处理模型在与数学文本相关的任务上表现不佳，但最近自然语言处理的发展为给学生提供数学证明即时反馈的任务创造了机会。在本文中，我们提出了一组训练方法和模型，利用现有的大型语言模型和其他机器学习技术来自动评分自由形式的数学证明。这些模型使用从四个不同的归纳证明问题收集的证明数据进行训练。我们使用四种不同的强大语言模型来比较它们的性能，它们都以不同程度取得了令人满意的表现。此外，我们招募人类评分员对与训练数据相同的证明进行评分，并发现最佳评分模型也比大多数人类评分员更准确。通过这些评分模型的发展，我们创建并部署了一个用于归纳证明问题的自动评分器，并与学生进行了用户研究。研究结果显示，学生能够通过来自自动评分器的反馈显著改进他们的证明，但学生仍然不像相信人类评分员那样相信AI自动评分器。未来的工作可以改进自动评分器的反馈，并找出帮助学生信任AI自动评分器的方法。

更新时间: 2025-02-19 06:18:20

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2406.10268v2

Poisoned Source Code Detection in Code Models

Deep learning models have gained popularity for conducting various tasks involving source code. However, their black-box nature raises concerns about potential risks. One such risk is a poisoning attack, where an attacker intentionally contaminates the training set with malicious samples to mislead the model's predictions in specific scenarios. To protect source code models from poisoning attacks, we introduce CodeGarrison (CG), a hybrid deep-learning model that relies on code embeddings to identify poisoned code samples. We evaluated CG against the state-of-the-art technique ONION for detecting poisoned samples generated by DAMP, MHM, ALERT, as well as a novel poisoning technique named CodeFooler. Results showed that CG significantly outperformed ONION with an accuracy of 93.5%. We also tested CG's robustness against unknown attacks and achieved an average accuracy of 85.6% in identifying poisoned samples across the four attacks mentioned above.

Updated: 2025-02-19 06:16:07

标题: 代码模型中的毒源代码检测

摘要: 深度学习模型在进行涉及源代码的各种任务方面已经变得越来越受欢迎。然而，它们的黑匣子特性引发了对潜在风险的担忧。其中一种风险是毒化攻击，攻击者故意污染训练集以误导模型在特定场景下的预测。为了保护源代码模型免受毒化攻击，我们引入了CodeGarrison（CG），这是一种依赖于代码嵌入来识别毒化代码样本的混合深度学习模型。我们对CG进行了评估，与用于检测由DAMP、MHM、ALERT生成的毒化样本的最先进技术ONION进行比较，以及一种名为CodeFooler的新毒化技术。结果显示，CG在准确率方面显著优于ONION，达到93.5%。我们还测试了CG对未知攻击的鲁棒性，并在识别上述四种攻击中的毒化样本方面取得了平均准确率85.6%的成绩。

更新时间: 2025-02-19 06:16:07

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.13459v1

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

Updated: 2025-02-19 06:14:27

标题: FairKV：平衡每个头部KV缓存以实现快速多GPU推断

摘要: Transformer模型中的KV缓存技术旨在通过大幅增加内存使用量来减少冗余计算，使KV缓存压缩成为一个重要且流行的研究课题。最近，最先进的KV缓存压缩方法实现了不平衡的每头分配算法，动态调整每个注意力头的KV缓存预算，从而在单个GPU场景中实现出色性能。然而，我们观察到这种不平衡的压缩在部署多GPU推断时会导致显著的负载不平衡，一些GPU过载，而其他GPU则未充分利用。在本文中，我们提出了一种名为FairKV的方法，旨在确保在使用不平衡KV缓存压缩的系统中，各个注意力头之间的内存使用公平。FairKV的核心技术是Fair-Copying，它利用数据并行性在GPU之间复制少量内存密集型的注意力头，以减轻负载不平衡。我们对流行模型进行了实验，包括LLaMA 70b和Mistral 24b模型，结果表明FairKV相比标准张量并行推断可以将吞吐量提高1.66倍。我们的代码将在接受后作为开源发布。

更新时间: 2025-02-19 06:14:27

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2502.15804v1

Megrez-Omni Technical Report

In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.

Updated: 2025-02-19 06:14:14

标题: Megrez-Omni技术报告

摘要: 在这项工作中，我们提出了Megrez模型，包括一个语言模型（Megrez-3B-Instruct）和一个多模态模型（Megrez-3B-Omni）。这些模型旨在通过软硬件共同设计方法提供快速推断、紧凑性和强大的边缘智能。Megrez-3B-Instruct具有多个优点，包括高准确性、高速度、易于使用和广泛的应用范围。在Megrez-3B-Instruct的基础上，Megrez-3B-Omni是一个支持图像、文本和音频分析的设备上的多模态理解LLM。它在所有三种模态上实现了最先进的准确性，并展示了强大的多功能性和稳健性，为多模态人工智能模型设定了新的基准。

更新时间: 2025-02-19 06:14:14

领域: cs.LG

下载: http://arxiv.org/abs/2502.15803v1

A General Error-Theoretical Analysis Framework for Constructing Compression Strategies

The exponential growth in parameter size and computational complexity of deep models poses significant challenges for efficient deployment. The core problem of existing compression methods is that different layers of the model have significant differences in their tolerance to compression levels. For instance, the first layer of a model can typically sustain a higher compression level compared to the last layer without compromising performance. Thus, the key challenge lies in how to allocate compression levels across layers in a way that minimizes performance loss while maximizing parameter reduction. To address this challenge, we propose a Compression Error Theory (CET) framework, designed to determine the optimal compression level for each layer. Taking quantization as an example, CET leverages differential expansion and algebraic geometry to reconstruct the quadratic form of quantization error as ellipsoids and hyperbolic paraboloids, and utilizes their geometric structures to define an error subspace. To identify the error subspace with minimal performance loss, by performing orthogonal decomposition of the geometric space, CET transforms the optimization process of the error subspace into a complementary problem. The final theoretical analysis shows that constructing the quantization subspace along the major axis results in minimal performance degradation. Through experimental verification of the theory, CET can greatly retain performance while compressing. Specifically, on the ResNet-34 model, CET achieves nearly 11$\times$ parameter compression while even surpassing performance comparable to the original model.

Updated: 2025-02-19 06:12:43

标题: 一个通用的错误理论分析框架，用于构建压缩策略

摘要: 深度模型参数大小和计算复杂度的指数增长给高效部署带来了显著挑战。现有压缩方法的核心问题在于模型的不同层对压缩水平的容忍度存在显著差异。例如，模型的第一层通常可以承受更高的压缩水平，而不会影响性能，相比之下，最后一层则不然。因此，关键挑战在于如何在各层之间分配压缩水平，以尽量减少性能损失同时最大化参数减少。为解决这一挑战，我们提出了一种压缩错误理论（CET）框架，旨在确定每一层的最佳压缩水平。以量化为例，CET利用微分扩展和代数几何重构量化误差的二次形式为椭圆和双曲抛物面，并利用它们的几何结构定义一个误差子空间。通过对几何空间进行正交分解，CET将误差子空间的优化过程转化为一个互补问题，以识别具有最小性能损失的误差子空间。最终的理论分析表明，沿着主轴构建量化子空间会导致最小的性能退化。通过对理论的实验验证，CET在压缩的同时大大保持性能。具体地，在ResNet-34模型上，CET实现了近11倍参数压缩，甚至超过原始模型的性能。

更新时间: 2025-02-19 06:12:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.15802v1

ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

Updated: 2025-02-19 06:09:58

标题: ThinkGuard：审慎的缓慢思考导致谨慎的防护栏

摘要: 确保大型语言模型（LLMs）的安全性至关重要，因为它们被部署在现实世界的应用中。现有的防护措施依赖于基于规则的过滤或单次分类，限制了它们处理微妙安全违规的能力。为了解决这个问题，我们提出了ThinkGuard，这是一个批评增强的防护模型，通过生成结构化的批评和安全标签，从高容量的LLMs中提取知识。在批评增强数据上微调后，捕获到的深思熟虑的能力显著增强了防护的谨慎性和可解释性。在多个安全基准上进行评估，ThinkGuard实现了最高的平均F1和AUPRC，胜过所有基线。与LLaMA Guard 3相比，ThinkGuard的准确率提高了16.1%，宏F1提高了27.0%。此外，它超过了仅标签微调的模型，证实了结构化批评既增强了分类精度和微妙的安全推理，同时又保持了计算效率。

更新时间: 2025-02-19 06:09:58

领域: cs.CL,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.13458v1

Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System

The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.

Updated: 2025-02-19 06:07:47

标题: 一个好的翻译是：众人智慧：基于LLM的多智能体系统改进科学思想生成

摘要: 科学进步的快速发展需要创新工具来加速知识发现。尽管最近的人工智能方法，特别是大型语言模型（LLMs），在假设生成和实验设计等任务中表现出了潜力，但它们无法复制现实世界科学实践中的协作性质，即不同专家组成团队共同解决复杂问题。为了解决这些局限性，我们提出了基于LLM的多智能体系统，即虚拟科学家（VirSci），旨在模仿科学研究中固有的团队合作。VirSci组织一个智能体团队，共同生成、评估和改进研究想法。通过全面的实验，我们证明了这种多智能体方法在产生新颖科学想法方面优于现有的方法。我们进一步研究了有助于其产生更高新颖性想法的合作机制，为指导未来研究提供了有价值的见解，并为构建自主科学发现的强大系统铺平了道路。代码可在https://github.com/open-sciencelab/Virtual-Scientists 上找到。

更新时间: 2025-02-19 06:07:47

领域: cs.AI,cs.CL,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2410.09403v2

Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization

Multi-objective multi-armed bandit (MO-MAB) problems traditionally aim to achieve Pareto optimality. However, real-world scenarios often involve users with varying preferences across objectives, resulting in a Pareto-optimal arm that may score high for one user but perform quite poorly for another. This highlights the need for customized learning, a factor often overlooked in prior research. To address this, we study a preference-aware MO-MAB framework in the presence of explicit user preference. It shifts the focus from achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To our knowledge, this is the first theoretical study of customized MO-MAB optimization with explicit user preferences. Motivated by practical applications, we explore two scenarios: unknown preference and hidden preference, each presenting unique challenges for algorithm design and analysis. At the core of our algorithms are preference estimation and preference-aware optimization mechanisms to adapt to user preferences effectively. We further develop novel analytical techniques to establish near-optimal regret of the proposed algorithms. Strong empirical performance confirm the effectiveness of our approach.

Updated: 2025-02-19 06:06:13

标题: 可证明高效的偏好中心定制下的多目标赌博算法

摘要: 多目标多臂赌博机（MO-MAB）问题传统上旨在实现帕累托最优性。然而，现实世界中的场景经常涉及具有不同偏好的用户，导致一个在一个用户身上得分很高但在另一个用户身上表现相当糟糕的帕累托最优臂。这突显了对定制学习的需求，这是以前研究中经常被忽视的因素。为了解决这个问题，我们研究了在明确用户偏好存在的情况下的偏好感知MO-MAB框架。它将焦点从实现帕累托最优性转移到在偏好中心定制下进一步优化帕累托前沿。据我们所知，这是第一个具有明确用户偏好的定制MO-MAB优化的理论研究。受实际应用的启发，我们探索了两种情景：未知偏好和隐藏偏好，每种情景都提出了对算法设计和分析的独特挑战。我们的算法的核心是偏好估计和偏好感知优化机制，有效地适应用户偏好。我们进一步开发了新颖的分析技术，以建立所提算法的近似最优遗憾。强大的实证表现证实了我们方法的有效性。

更新时间: 2025-02-19 06:06:13

领域: cs.LG

下载: http://arxiv.org/abs/2502.13457v1

Contrastive Localized Language-Image Pre-Training

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Updated: 2025-02-19 05:59:59

标题: 对比定位式语言-图像预训练

摘要: 对比语言-图像预训练（CLIP）已成为训练视觉编码器生成图像/文本表示以促进各种应用的广受赞誉的方法。最近，CLIP已被广泛采用作为多模式大型语言模型（MLLMs）的视觉骨干，以连接图像输入进行语言交互。CLIP作为视觉-语言基础模型的成功依赖于在图像级别对齐网络抓取的噪声文本注释。然而，这样的标准可能对需要细粒度视觉表示的下游任务不足，特别是当区域级理解对MLLMs要求高时。在本文中，我们通过几项进步改进了CLIP的定位能力。我们提出了一种名为对比定位语言-图像预训练（CLOC）的预训练方法，通过补充CLIP与区域-文本对比损失和模块。我们制定了一个新概念，即可提示的嵌入，其中编码器产生图像嵌入，易于根据空间提示转换为区域表示。为支持大规模预训练，我们设计了一个视觉丰富且空间局部化的字幕框架，以有效地生成规模化的区域-文本伪标签。通过扩展到数十亿个带注释的图像，CLOC实现了高质量的区域嵌入，用于图像区域识别和检索任务，并可以作为CLIP的替代品，以增强MLLMs，特别是在指代和定位任务上。

更新时间: 2025-02-19 05:59:59

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.02746v2

Learning stochastic dynamics from snapshots through regularized unbalanced optimal transport

Reconstructing dynamics using samples from sparsely time-resolved snapshots is an important problem in both natural sciences and machine learning. Here, we introduce a new deep learning approach for solving regularized unbalanced optimal transport (RUOT) and inferring continuous unbalanced stochastic dynamics from observed snapshots. Based on the RUOT form, our method models these dynamics without requiring prior knowledge of growth and death processes or additional information, allowing them to be learned directly from data. Theoretically, we explore the connections between the RUOT and Schr\"odinger bridge problem and discuss the key challenges and potential solutions. The effectiveness of our method is demonstrated with a synthetic gene regulatory network, high-dimensional Gaussian Mixture Model, and single-cell RNA-seq data from blood development. Compared with other methods, our approach accurately identifies growth and transition patterns, eliminates false transitions, and constructs the Waddington developmental landscape. Our code is available at: https://github.com/zhenyiizhang/DeepRUOT.

Updated: 2025-02-19 05:52:56

标题: 学习随机动力学的方法：通过正则化的不平衡最优输运从快照中学习

摘要: 使用来自稀疏时间解析快照的样本重建动态是自然科学和机器学习中的一个重要问题。在这里，我们介绍了一种新的深度学习方法，用于解决正则化不平衡最优输运（RUOT）问题，并从观测到的快照中推断连续的不平衡随机动态。基于RUOT形式，我们的方法模拟这些动态，无需先前了解生长和死亡过程或其他信息，可以直接从数据中学习。从理论上讲，我们探讨了RUOT与Schr\"odinger桥问题之间的联系，并讨论了关键挑战和潜在解决方案。我们的方法的有效性通过合成基因调控网络、高维高斯混合模型以及来自血液发育的单细胞RNA测序数据进行了证明。与其他方法相比，我们的方法准确识别了生长和转换模式，消除了假转换，并构建了Waddington发育景观。我们的代码可在以下网址获得：https://github.com/zhenyiizhang/DeepRUOT。

更新时间: 2025-02-19 05:52:56

领域: cs.LG,math.OC,physics.comp-ph,q-bio.QM

下载: http://arxiv.org/abs/2410.00844v3

Interleaved Gibbs Diffusion for Constrained Generation

We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained generation. IGD moves beyond this by interleaving continuous and discrete denoising algorithms via a discrete time Gibbs sampling type Markov chain. IGD provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time scaling via the ReDeNoise method. Empirical evaluations on three challenging tasks-solving 3-SAT, generating molecule structures, and generating layouts-demonstrate state-of-the-art performance. Notably, IGD achieves a 7% improvement on 3-SAT out of the box and achieves state-of-the-art results in molecule generation without relying on equivariant diffusion or domain-specific architectures. We explore a wide range of modeling, and interleaving strategies along with hyperparameters in each of these problems.

Updated: 2025-02-19 05:51:24

标题: 交错吉布斯扩散用于受限生成

摘要: 我们介绍了交错吉布斯扩散（IGD），这是一种针对混合连续-离散数据的新颖生成建模框架，重点关注受约束的生成问题。先前关于离散和连续-离散扩散模型的工作假设分解去噪分布以实现快速生成，这可能会阻碍在受约束生成中遇到的随机变量之间的强依赖性建模。IGD通过交错连续和离散去噪算法，通过离散时间吉布斯采样类型马尔可夫链，超越了这一点。IGD在去噪器选择上提供了灵活性，允许通过状态空间加倍进行条件生成，并通过ReDeNoise方法进行推理时间缩放。对三项具有挑战性的任务的实证评估——求解3-SAT、生成分子结构和生成布局——展示了最先进的性能。值得注意的是，IGD在3-SAT方面实现了7%的改善，并在分子生成方面取得了最先进的结果，而不依赖于等变扩散或特定领域的架构。我们在这些问题的建模和交错策略以及每个问题中的超参数上进行了广泛探索。

更新时间: 2025-02-19 05:51:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13450v1

CRVQ: Channel-Relaxed Vector Quantization for Extreme Compression of LLMs

Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging extended codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9\% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.

Updated: 2025-02-19 05:50:37

标题: CRVQ：用于对LLMs进行极端压缩的通道放松向量量化

摘要: 强大的大型语言模型(LLM)越来越被期望以更低的计算成本部署，使其能够在资源受限的设备上发挥其能力。后训练量化(PTQ)已经成为实现这一目标的明星方法，最佳方法将权重压缩到平均不到2位。在本文中，我们提出了通道宽松向量量化(CRVQ)，这是一种新颖的技术，显著改进了PTQ基线的性能，而只需极少额外的比特。这种最先进的极端压缩方法通过两个关键创新实现了其结果：(1)精心选择和重新排序极少的关键权重通道，(2)利用扩展码书放宽关键通道的约束。通过我们的方法，我们展示了与当前最强的低于2位PTQ基线相比的38.9%的改进，实现了接近无损的1位压缩。此外，我们的方法提供了量化位宽和性能的灵活定制，为各种硬件平台提供了更广泛的部署选项。

更新时间: 2025-02-19 05:50:37

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2412.09282v2

Adopting Whisper for Confidence Estimation

Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

Updated: 2025-02-19 05:45:28

标题: 采用Whisper进行置信度估计

摘要: 最近关于语音识别系统中词级置信度估计的研究主要集中在称为置信度估计模块（CEMs）的轻量级模型上，这些模型依赖于从自动语音识别（ASR）输出中提取的手工设计特征。相比之下，我们提出了一种新颖的端到端方法，利用ASR模型本身（Whisper）生成词级置信度分数。具体来说，我们引入了一种方法，通过对Whisper模型进行微调，根据音频输入及其对应的假设转录生成标量置信度分数。我们的实验表明，微调后的Whisper-tiny模型，与强大的CEM基线大小相当，在领域内数据集上实现了类似的性能，并在八个领域外数据集上超过了CEM基线，而微调后的Whisper-large模型在所有数据集上始终以较大幅度优于CEM基线的表现。

更新时间: 2025-02-19 05:45:28

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2502.13446v1

Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in generating detailed and coherent responses from visual inputs. However, they are prone to generate hallucinations due to an over-reliance on language priors. To address this issue, we investigate the language priors in LVLMs and make two key observations: (1) Even when predicting the tokens associated with image-related part-of-speech (POS), models increasingly rely on linguistic priors as the token sequences grow, thereby amplifying hallucinations. (2) Methods that directly calibrate LVLM's output distribution to mitigate language priors can lead to a degradation in text quality or even exacerbate hallucinations. Based on these findings, we propose a novel method, Summary-Guided Decoding (SumGD). This method naturally encourages the model to focus more on image information by reducing the text context through summaries, while controlling only the image-related POS tokens to maintain text quality. Through experiments, we demonstrate that SumGD achieves state-of-the-art performance on object hallucination benchmarks. Furthermore, in terms of the trade-off between precision and recall, SumGD achieves Pareto optimality among the existing methods. Lastly, we observe that although existing methods struggle to balance the reduction of object hallucinations with maintaining text quality, SumGD demonstrates robustness in handling this challenge.

Updated: 2025-02-19 05:41:03

标题: 通过摘要引导解码减轻大规模视觉语言模型中的幻觉

摘要: 大型视觉-语言模型（LVLMs）展示了在从视觉输入中生成详细和连贯响应方面令人印象深刻的能力。然而，它们很容易生成幻觉，这是由于过度依赖语言先验。为了解决这个问题，我们调查了LVLMs中的语言先验，并做出了两个关键观察：（1）即使在预测与图像相关的词性（POS）相关的标记时，模型在标记序列增长时越来越依赖语言先验，从而放大了幻觉。（2）直接校准LVLM的输出分布以减轻语言先验可能导致文本质量下降甚至加剧幻觉。基于这些发现，我们提出了一种新方法，摘要引导编码（SumGD）。该方法通过摘要自然地鼓励模型更多地关注图像信息，通过减少摘要中的文本上下文，同时控制只与图像相关的POS标记，以维持文本质量。通过实验，我们证明SumGD在对象幻觉基准测试中实现了最先进的性能。此外，就精确度和召回率之间的权衡而言，SumGD在现有方法中实现了帕累托最优性。最后，我们观察到，尽管现有方法在平衡减少对象幻觉与保持文本质量方面存在困难，但SumGD在处理这一挑战方面表现出了稳健性。

更新时间: 2025-02-19 05:41:03

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2410.13321v3

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 61% and 42% in their respective worst-case scenarios. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems.

Updated: 2025-02-19 05:38:45

标题: TreeCut: 一份用于LLM幻觉评估的合成无法回答的数学问题数据集

摘要: 大型语言模型（LLMs）现在在标准数学问题基准测试（例如，GSM8K）上实现了接近人类表现，但它们真正的推理能力仍存在争议。一个关键关注点是，模型经常对无法回答的问题产生自信但毫无根据的答案。我们介绍了TreeCut，这是一个合成数据集，通过将每个问题表示为树并删除选择的必要条件，系统地生成无穷无尽的无法回答的数学问题及其可回答的对应问题。实验表明，TreeCut有效地诱发大型语言模型（包括GPT-4o和o3-mini）的幻觉，分别在它们的最坏情况下分别达到61%和42%的比率。进一步分析突显了更深或更复杂的树、复合项目名称以及在路径中间附近删除必要条件都会增加幻觉的可能性，强调了LLMs在识别无法回答的数学问题方面面临的持续挑战。

更新时间: 2025-02-19 05:38:45

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13442v1

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.

Updated: 2025-02-19 05:37:08

标题: 自我提升的悖论：语言模型是否可以在没有外部支撑的情况下启动推理能力？

摘要: 自我改进的大型语言模型（LLMs）--即通过使用由自身生成的合成数据对LLM进行微调来提高其性能--是推进LLMs能力而達到避免大量监督的一种有前途的方法。现有的自我改进方法通常依赖外部监督信号，例如种子数据和/或来自第三方模型的帮助。本文提出了Crescent--一个简单而有效的框架，以完全自主的方式生成高质量的合成问答数据。Crescent首先通过引诱提示引导LLM生成原始问题，然后利用基于拒绝采样的自我去重复技术使这些问题多样化，并最后通过多数投票将问题提供给LLM并收集相应的答案。我们展示了Crescent揭示了真正的自我改进潜力，而无需任何外部监督信号来进行数学推理；特别地，Crescent生成的问答对足以（i）提高LLM的推理能力，同时保持其总体性能（尤其是在0-shot设置下）；以及（ii）比基于种子数据集增强的现有方法更有效地将LLM知识提炼到弱模型中。

更新时间: 2025-02-19 05:37:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13441v1

Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation

Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements at test time,we introduce SG-KBQA: a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It uses the richer semantics and awareness of the knowledge base structure provided by schema contexts to enhance generalizability. We show that SG-KBQA achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Our source code is available at https://github.com/gaosx2000/SG_KBQA.

Updated: 2025-02-19 05:32:40

标题: 超越可见数据：通过基于模式的逻辑形式生成改进KBQA泛化

摘要: 知识库问答（KBQA）旨在利用存储在大型知识库中的丰富人类知识，以自然语言回答用户问题。由于当前的KBQA方法在测试时遇到未见过的知识库元素而遇到困难，因此我们引入了SG-KBQA：一种新颖的模型，将模式上下文注入实体检索和逻辑形式生成，以解决这个问题。它利用模式上下文提供的更丰富的语义和对知识库结构的意识，以增强泛化能力。我们展示了SG-KBQA实现了强大的泛化能力，在各种测试设置下优于现有模型在两个常用基准数据集上的表现。我们的源代码可在https://github.com/gaosx2000/SG_KBQA 上获得。

更新时间: 2025-02-19 05:32:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12737v2

Semi-supervised classification of bird vocalizations

Changes in bird populations can indicate broader changes in ecosystems, making birds one of the most important animal groups to monitor. Combining machine learning and passive acoustics enables continuous monitoring over extended periods without direct human involvement. However, most existing techniques require extensive expert-labeled datasets for training and cannot easily detect time-overlapping calls in busy soundscapes. We propose a semi-supervised acoustic bird detector designed to allow both the detection of time-overlapping calls (when separated in frequency) and the use of few labeled training samples. The classifier is trained and evaluated on a combination of community-recorded open-source data and long-duration soundscape recordings from Singapore. It achieves a mean F0.5 score of 0.701 across 315 classes from 110 bird species on a hold-out test set, with an average of 11 labeled training samples per class. It outperforms the state-of-the-art BirdNET classifier on a test set of 103 bird species despite significantly fewer labeled training samples. The detector is further tested on 144 microphone-hours of continuous soundscape data. The rich soundscape in Singapore makes suppression of false positives a challenge on raw, continuous data streams. Nevertheless, we demonstrate that achieving high precision in such environments with minimal labeled training data is possible.

Updated: 2025-02-19 05:31:13

标题: 鸟类鸣叫声的半监督分类

摘要: 鸟类种群的变化可以指示生态系统的更广泛变化，使得鸟类成为最重要的动物群之一需要监测。结合机器学习和被动声学技术可以实现在没有直接人类参与的情况下对鸟类进行持续监测。然而，大多数现有技术需要大量专家标记的数据集进行训练，并且在繁忙的声景中不能轻松检测到时间重叠的鸣叫声。我们提出了一种半监督的声学鸟类检测器，旨在允许检测到时间重叠的鸣叫声（当在频率上分离时），并使用少量标记的训练样本。分类器在社区录制的开源数据和新加坡的长时间声景录音的组合上进行训练和评估。在一个保留测试集上，它在来自110种鸟类的315个类别上实现了0.701的平均F0.5分数，每个类别平均有11个标记的训练样本。尽管有明显较少的标记训练样本，它在一个包含103种鸟类的测试集上优于最先进的BirdNET分类器。该检测器进一步在144个麦克风小时的连续声景数据上进行了测试。新加坡丰富的声景使得在原始连续数据流上抑制误报成为一项挑战。然而，我们证明在这种环境中以最少的标记训练数据实现高精度是可能的。

更新时间: 2025-02-19 05:31:13

领域: cs.SD,cs.AI,cs.CV,eess.AS,q-bio.QM

下载: http://arxiv.org/abs/2502.13440v1

Diffusion Models as Network Optimizers: Explorations and Analysis

Network optimization is a fundamental challenge in the Internet of Things (IoT) network, often characterized by complex features that make it difficult to solve these problems. Recently, generative diffusion models (GDMs) have emerged as a promising new approach to network optimization, with the potential to directly address these optimization problems. However, the application of GDMs in this field is still in its early stages, and there is a noticeable lack of theoretical research and empirical findings. In this study, we first explore the intrinsic characteristics of generative models. Next, we provide a concise theoretical proof and intuitive demonstration of the advantages of generative models over discriminative models in network optimization. Based on this exploration, we implement GDMs as optimizers aimed at learning high-quality solution distributions for given inputs, sampling from these distributions during inference to approximate or achieve optimal solutions. Specifically, we utilize denoising diffusion probabilistic models (DDPMs) and employ a classifier-free guidance mechanism to manage conditional guidance based on input parameters. We conduct extensive experiments across three challenging network optimization problems. By investigating various model configurations and the principles of GDMs as optimizers, we demonstrate the ability to overcome prediction errors and validate the convergence of generated solutions to optimal solutions. We provide code and data at https://github.com/qiyu3816/DiffSG.

Updated: 2025-02-19 05:25:27

标题: 扩散模型作为网络优化器：探索与分析

摘要: 网络优化是物联网（IoT）网络中的一个基本挑战，通常具有复杂特征，使得解决这些问题变得困难。最近，生成扩散模型（GDMs）已经成为网络优化的一种有前途的新方法，有潜力直接解决这些优化问题。然而，GDMs在这一领域的应用仍处于早期阶段，理论研究和实证研究明显不足。在这项研究中，我们首先探讨了生成模型的固有特性。接下来，我们提供了生成模型在网络优化中优于判别模型的优势的简明理论证明和直观演示。基于这一探索，我们将GDMs作为优化器实现，旨在学习给定输入的高质量解决方案分布，在推断期间从这些分布中抽样以近似或实现最优解。具体地，我们利用去噪扩散概率模型（DDPMs）并采用基于输入参数的无分类器引导机制来管理条件引导。我们在三个具有挑战性的网络优化问题上进行了广泛的实验。通过研究各种模型配置和GDMs作为优化器的原则，我们展示了克服预测错误的能力并验证了生成解决方案向最优解的收敛。我们在https://github.com/qiyu3816/DiffSG上提供了代码和数据。

更新时间: 2025-02-19 05:25:27

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2411.00453v5

MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers

In applications of diffusion models, controllable generation is of practical significance, but is also challenging. Current methods for controllable generation primarily focus on modifying the score function of diffusion models, while Mean Reverting (MR) Diffusion directly modifies the structure of the stochastic differential equation (SDE), making the incorporation of image conditions simpler and more natural. However, current training-free fast samplers are not directly applicable to MR Diffusion. And thus MR Diffusion requires hundreds of NFEs (number of function evaluations) to obtain high-quality samples. In this paper, we propose a new algorithm named MRS (MR Sampler) to reduce the sampling NFEs of MR Diffusion. We solve the reverse-time SDE and the probability flow ordinary differential equation (PF-ODE) associated with MR Diffusion, and derive semi-analytical solutions. The solutions consist of an analytical function and an integral parameterized by a neural network. Based on this solution, we can generate high-quality samples in fewer steps. Our approach does not require training and supports all mainstream parameterizations, including noise prediction, data prediction and velocity prediction. Extensive experiments demonstrate that MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Our algorithm accelerates the sampling procedure of MR Diffusion, making it more practical in controllable generation.

Updated: 2025-02-19 05:22:54

标题: MRS: 基于ODE和SDE求解器的均值回归扩散的快速采样器

摘要: 在扩散模型的应用中，可控生成具有实际意义，但也具有挑战性。目前用于可控生成的方法主要集中在修改扩散模型的评分函数上，而均值回归（MR）扩散则直接修改了随机微分方程（SDE）的结构，使得图像条件的整合更简单、更自然。然而，目前的无训练快速采样器并不直接适用于MR扩散。因此，MR扩散需要数百个NFEs（函数评估数量）才能获得高质量的样本。在本文中，我们提出了一种名为MRS（MR采样器）的新算法，用于减少MR扩散的采样NFEs。我们求解了与MR扩散相关的逆时间SDE和概率流常微分方程（PF-ODE），并推导出半解析解。这些解由一个解析函数和一个由神经网络参数化的积分组成。基于这个解，我们可以在更少的步骤中生成高质量的样本。我们的方法不需要训练，并支持所有主流参数化，包括噪声预测、数据预测和速度预测。大量实验证明，MR采样器在十种不同的图像恢复任务中保持高采样质量，速度提升了10到20倍。我们的算法加速了MR扩散的采样过程，使其在可控生成中更加实用。

更新时间: 2025-02-19 05:22:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07856v3

GMValuator: Similarity-based Data Valuation for Generative Models

Data valuation plays a crucial role in machine learning. Existing data valuation methods have primarily focused on discriminative models, neglecting generative models that have recently gained considerable attention. A very few existing attempts of data valuation method designed for deep generative models either concentrates on specific models or lacks robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. To bridge the gaps, we formulate the data valuation problem in generative models from a similarity-matching perspective. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks. It empowers efficient data valuation through our innovatively similarity matching module, calibrates biased contribution by incorporating image quality assessment, and attributes credits to all training samples based on their contributions to the generated samples. Additionally, we introduce four evaluation criteria for assessing data valuation methods in generative models, aligning with principles of plausibility and truthfulness. GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.

Updated: 2025-02-19 05:22:49

标题: GMValuator：基于相似性的生成模型数据估值

摘要: 数据估值在机器学习中起着至关重要的作用。现有的数据估值方法主要集中在判别模型上，忽视了最近受到广泛关注的生成模型。现有为深度生成模型设计的数据估值方法很少，要么集中在特定模型上，要么在结果方面缺乏鲁棒性。此外，效率仍然存在脆弱的缺点。为了弥补这些差距，我们从相似匹配的角度制定了生成模型中的数据估值问题。具体地，我们引入了生成模型估值器（GMValuator），这是第一种无需训练和与模型无关的方法，为生成任务提供数据估值。通过我们创新的相似匹配模块，它能够实现有效的数据估值，通过融入图像质量评估来校准偏见贡献，并根据训练样本对生成样本的贡献来归因积分。此外，我们引入了四个评估标准，用于评估生成模型中的数据估值方法，符合可信性和真实性原则。GMValuator 在各种数据集和生成架构上进行了广泛评估，以展示其有效性。

更新时间: 2025-02-19 05:22:49

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2304.10701v8

Quantum Policy Gradient in Reproducing Kernel Hilbert Space

Parametrised quantum circuits offer expressive and data-efficient representations for machine learning. Due to quantum states residing in a high-dimensional Hilbert space, parametrised quantum circuits have a natural interpretation in terms of kernel methods. The representation of quantum circuits in terms of quantum kernels has been studied widely in quantum supervised learning, but has been overlooked in the context of quantum RL. This paper proposes parametric and non-parametric policy gradient and actor-critic algorithms with quantum kernel policies in quantum environments. This approach, implemented with both numerical and analytical quantum policy gradient techniques, allows exploiting the many advantages of kernel methods, including data-driven forms for functions (and their gradients) as well as tunable expressiveness. The proposed approach is suitable for vector-valued action spaces and each of the formulations demonstrates a quadratic reduction in query complexity compared to their classical counterparts. Two actor-critic algorithms, one based on stochastic policy gradient and one based on deterministic policy gradient (comparable to the popular DDPG algorithm), demonstrate additional query complexity reductions compared to quantum policy gradient algorithms under favourable conditions.

Updated: 2025-02-19 05:20:46

标题: 在再生核希尔伯特空间中的量子策略梯度

摘要: 参数化量子电路为机器学习提供了富有表现力和高效的数据表示。由于量子状态存在于高维希尔伯特空间中，参数化量子电路在核方法方面具有自然的解释。量子电路的表示与量子核之间的关系在量子监督学习中得到广泛研究，但在量子强化学习的背景下被忽视了。本文提出了在量子环境中使用量子核策略的参数化和非参数化策略梯度和演员-评论家算法。这种方法，通过数值和分析量子策略梯度技术的实现，允许利用核方法的许多优势，包括用于函数（和它们的梯度）的数据驱动形式以及可调的表现力。所提出的方法适用于矢量值动作空间，每种表述都表明与它们的经典对应物相比查询复杂度的二次降低。基于随机策略梯度和基于确定性策略梯度（类似于流行的DDPG算法）的两种演员-评论家算法，在有利条件下与量子策略梯度算法相比，展示了额外的查询复杂度降低。

更新时间: 2025-02-19 05:20:46

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2411.06650v4

Graph Neural Networks for Databases: A Survey

Graph neural networks (GNNs) are powerful deep learning models for graph-structured data, demonstrating remarkable success across diverse domains. Recently, the database (DB) community has increasingly recognized the potentiality of GNNs, prompting a surge of researches focusing on improving database systems through GNN-based approaches. However, despite notable advances, There is a lack of a comprehensive review and understanding of how GNNs could improve DB systems. Therefore, this survey aims to bridge this gap by providing a structured and in-depth overview of GNNs for DB systems. Specifically, we propose a new taxonomy that classifies existing methods into two key categories: (1) Relational Databases, which includes tasks like performance prediction, query optimization, and text-to-SQL, and (2) Graph Databases, addressing challenges like efficient graph query processing and graph similarity computation. We systematically review key methods in each category, highlighting their contributions and practical implications. Finally, we suggest promising avenues for integrating GNNs into Database systems.

Updated: 2025-02-19 05:09:09

标题: 数据库的图神经网络：一项调查

摘要: 图神经网络（GNNs）是针对图结构数据的强大深度学习模型，在各个领域展现出显著的成功。最近，数据库（DB）社区越来越意识到GNNs的潜力，促使了一系列旨在通过基于GNN的方法改进数据库系统的研究激增。然而，尽管取得了显著进展，对于GNNs如何改进DB系统缺乏全面的审视和理解。因此，本调查旨在通过提供对GNNs在DB系统中的结构化和深入概述来弥合这一差距。具体来说，我们提出了一个新的分类法，将现有方法分为两大类：（1）关系型数据库，包括性能预测、查询优化和文本到SQL等任务，以及（2）图数据库，解决了有效的图查询处理和图相似度计算等挑战。我们系统地回顾了每个类别中的关键方法，突出它们的贡献和实际意义。最后，我们提出了将GNNs整合到数据库系统中的有前途的途径。

更新时间: 2025-02-19 05:09:09

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2502.12908v2

ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows

In this report, we present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks. While existing benchmarks focus on isolated coding tasks or Kaggle-style competitions, ML-Dev-Bench tests agents' ability to handle the full complexity of ML development workflows. The benchmark assesses performance across critical aspects including dataset handling, model training, improving existing models, debugging, and API integration with popular ML tools. We evaluate three agents - ReAct, Openhands, and AIDE - on a diverse set of 30 tasks, providing insights into their strengths and limitations in handling practical ML development challenges. We open source the benchmark for the benefit of the community at \href{https://github.com/ml-dev-bench/ml-dev-bench}{https://github.com/ml-dev-bench/ml-dev-bench}.

Updated: 2025-02-19 05:09:01

标题: ML-Dev-Bench：AI代理在ML开发工作流上的比较分析

摘要: 在这份报告中，我们介绍了ML-Dev-Bench，这是一个旨在测试智能能力在应用机器学习开发任务中的基准。虽然现有的基准主要关注孤立的编码任务或Kaggle风格的竞赛，ML-Dev-Bench测试代理人处理完整的机器学习开发工作流程的能力。该基准评估了关键方面的性能，包括数据集处理，模型训练，改进现有模型，调试以及与流行的机器学习工具的API集成。我们评估了三个代理人 - ReAct，Openhands和AIDE - 在30个不同的任务上，提供了关于它们在处理实际机器学习开发挑战时的优势和局限性的见解。我们将该基准开源，以造福社区，网址为\href{https://github.com/ml-dev-bench/ml-dev-bench}{https://github.com/ml-dev-bench/ml-dev-bench}。

更新时间: 2025-02-19 05:09:01

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2502.00964v3

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Updated: 2025-02-19 05:04:10

标题: 基于视觉的通用潜在函数在多智能体强化学习中的政策对齐

摘要: 将多智能体强化学习政策引导与人类常识一致是一个困难的问题，主要是由于将常识建模为奖励的复杂性，特别是在复杂和长期任务中。最近的研究表明，奖励塑造，如基于潜力的奖励，可以增强政策的一致性。然而，现有的研究主要依赖于专家设计基于规则的奖励，这通常需要大量工作，并且缺乏对常识的高级语义理解。为了解决这个问题，我们提出了一种基于层次视觉的奖励塑造方法。在底层，视觉语言模型（VLM）作为通用潜力函数，通过其内在的语义理解指导政策与人类常识一致。为了帮助政策适应长期任务中的不确定性和变化，顶层设有一个基于视觉大型语言模型（vLLM）的自适应技能选择模块。该模块使用指令、视频回放和训练记录动态选择适合的潜力函数从预先设计的池中。此外，我们的方法在理论上被证明可以保留最优政策。在Google Research Football环境中进行的大量实验表明，我们的方法不仅实现了更高的胜率，而且有效地与人类常识一致。

更新时间: 2025-02-19 05:04:10

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13430v1

ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection

The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi-temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi-class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross-temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real-world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi-task Enhanced Cross-temporal Point Transformer (ME-CPT) network. ME-CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi-task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 $km^2$ 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT-CPT achieves superior performance compared to existing state-of-the-art methods. The source code and dataset will be released upon acceptance at https://github.com/zhangluqi0209/ME-CPT.

Updated: 2025-02-19 05:03:35

标题: ME-CPT：用于城市3D变化检测的多任务增强交叉时间点变换器

摘要: 通过利用航空激光扫描（ALS）系统收集的点云，可以提供城市土地覆盖物的准确3D信息。通过利用多时相ALS点云，可以捕捉城市区域的语义变化，展示在城市规划、应急管理和基础设施维护方面的显著潜力。现有的3D变化检测方法在高效提取多类别语义信息和变化特征方面存在困难，仍然面临以下挑战：（1）准确建模跨时空点云空间关系以实现有效的变化特征提取的困难；（2）变化样本类别不平衡阻碍语义特征的可区分性；（3）缺乏用于3D语义变化检测的真实世界数据集。为了解决这些挑战，我们提出了Multi-task Enhanced Cross-temporal Point Transformer（ME-CPT）网络。ME-CPT建立了跨不同时期的点云的时空对应关系，并采用注意机制共同提取语义变化特征，促进信息交换和变化比较。此外，我们还结合语义分割任务，并通过多任务训练策略进一步增强了语义特征的可区分性，减少了变化类型中类别不平衡的影响。此外，我们发布了一个22.5 $km^2$的3D语义变化检测数据集，提供了多样化的场景用于全面评估。在多个数据集上的实验证明，所提出的MT-CPT相对于现有的最先进方法表现出了更优越的性能。源代码和数据集将在接受后发布在https://github.com/zhangluqi0209/ME-CPT。

更新时间: 2025-02-19 05:03:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.14004v2

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. Although prior video reconstruction methods have made substantial progress, they still suffer from several limitations, including: (1) difficulty in simultaneously reconciling semantic (e.g. categorical descriptions), structure (e.g. size and color), and consistent motion information (e.g. order of frames); (2) low temporal resolution of fMRI, which poses a challenge in decoding multiple frames of video dynamics from a single fMRI frame; (3) reliance on video generation models, which introduces ambiguity regarding whether the dynamics observed in the reconstructed videos are genuinely derived from fMRI data or are hallucinations from generative model. To overcome these limitations, we propose a two-stage model named Mind-Animator. During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI. Specifically, we employ fMRI-vision-language tri-modal contrastive learning to decode semantic feature from fMRI and design a sparse causal attention mechanism for decoding multi-frame video motion features through a next-frame-prediction task. In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion, effectively eliminating external video data interference. Extensive experiments on multiple video-fMRI datasets demonstrate that our model achieves state-of-the-art performance. Comprehensive visualization analyses further elucidate the interpretability of our model from a neurobiological perspective. Project page: https://mind-animator-design.github.io/.

Updated: 2025-02-19 05:02:08

标题: 活化您的思维：从缓慢的大脑活动中解耦出动态自然视觉

摘要: 从大脑活动重建人类动态视觉是一项具有重要科学意义的挑战性任务。尽管先前的视频重建方法取得了重大进展，但仍然存在一些限制，包括：（1）难以同时调和语义（例如分类描述）、结构（例如大小和颜色）和连续运动信息（例如帧的顺序）；（2）fMRI的时间分辨率低，这在从单个fMRI帧解码多帧视频动态时构成挑战；（3）依赖于视频生成模型，这会引入关于重建视频中观察到的动态是否真正源自fMRI数据或是生成模型幻觉的模糊性。为了克服这些限制，我们提出了一个名为Mind-Animator的两阶段模型。在fMRI到特征阶段，我们从fMRI中解耦语义、结构和运动特征。具体而言，我们采用fMRI-视觉-语言三模态对比学习来解码fMRI中的语义特征，并设计了一种稀疏的因果注意机制，通过下一帧预测任务解码多帧视频运动特征。在特征到视频阶段，这些特征被集成到视频中使用膨胀稳定扩散，有效消除外部视频数据干扰。对多个视频-fMRI数据集的大量实验表明，我们的模型取得了最先进的性能。全面的可视化分析进一步阐明了我们模型的可解释性，从神经生物学角度来看。项目页面：https://mind-animator-design.github.io/。

更新时间: 2025-02-19 05:02:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2405.03280v2

MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering

This study explores how to enhance the reasoning capabilities of large language models (LLMs) in knowledge base question answering (KBQA) by leveraging Monte Carlo Tree Search (MCTS). Semantic parsing-based KBQA methods are particularly challenging as these approaches require locating elements from knowledge bases and generating logical forms, demanding not only extensive annotated data but also strong reasoning capabilities. Although recent approaches leveraging LLMs as agents have demonstrated considerable potential, these studies are inherently constrained by their linear decision-making processes. To address this limitation, we propose a MCTS-based framework that enhances LLMs' reasoning capabilities through tree search methodology. We design a carefully designed step-wise reward mechanism that requires only direct prompting of open-source instruction LLMs without additional fine-tuning. Experimental results demonstrate that our approach significantly outperforms linear decision-making methods, particularly in low-resource scenarios. Additionally, we contribute new data resources to the KBQA community by annotating intermediate reasoning processes for existing question-SPARQL datasets using distant supervision. Experimental results on the extended dataset demonstrate that our method achieves comparable performance to fully supervised models while using significantly less training data.

Updated: 2025-02-19 04:58:39

标题: MCTS-KBQA：知识库问答的蒙特卡洛树搜索

摘要: 这项研究探讨了如何通过利用蒙特卡洛树搜索（MCTS）来增强大型语言模型（LLMs）在知识库问答（KBQA）中的推理能力。基于语义解析的KBQA方法尤其具有挑战性，因为这些方法需要从知识库中定位元素并生成逻辑形式，不仅需要大量标注数据，还需要强大的推理能力。尽管最近利用LLMs作为代理的方法显示出了相当大的潜力，但这些研究受其线性决策过程的固有限制。为了解决这一限制，我们提出了一种基于MCTS的框架，通过树搜索方法增强LLMs的推理能力。我们设计了一个精心设计的逐步奖励机制，只需直接提示开源指令LLMs，无需额外微调。实验结果表明，我们的方法在低资源场景中明显优于线性决策方法。此外，我们通过远程监督为现有问题-SPARQL数据集注释了中间推理过程，为KBQA社区贡献了新的数据资源。在扩展数据集上的实验结果表明，我们的方法在使用较少训练数据的情况下实现了与完全监督模型可比的性能。

更新时间: 2025-02-19 04:58:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13428v1

MSEMG: Surface Electromyography Denoising with a Mamba-based Efficient Network

Surface electromyography (sEMG) recordings can be contaminated by electrocardiogram (ECG) signals when the monitored muscle is closed to the heart. Traditional signal processing-based approaches, such as high-pass filtering and template subtraction, have been used to remove ECG interference but are often limited in their effectiveness. Recently, neural network-based methods have shown greater promise for sEMG denoising, but they still struggle to balance both efficiency and effectiveness. In this study, we introduce MSEMG, a novel system that integrates the Mamba state space model with a convolutional neural network to serve as a lightweight sEMG denoising model. We evaluated MSEMG using sEMG data from the Non-Invasive Adaptive Prosthetics database and ECG signals from the MIT-BIH Normal Sinus Rhythm Database. The results show that MSEMG outperforms existing methods, generating higher-quality sEMG signals using fewer parameters.

Updated: 2025-02-19 04:53:42

标题: MSEMG：基于Mamba高效网络的表面肌电图去噪

摘要: 表面肌电图（sEMG）记录可能会受到心电图（ECG）信号的污染，特别是当被监测的肌肉靠近心脏时。传统的基于信号处理的方法，如高通滤波和模板减法，已被用来消除ECG干扰，但通常在效果上存在局限。最近，基于神经网络的方法在sEMG去噪方面表现出更大的潜力，但它们仍然难以平衡效率和效果。在本研究中，我们介绍了MSEMG，这是一个将Mamba状态空间模型与卷积神经网络结合起来作为轻量级sEMG去噪模型的新系统。我们使用来自非侵入性适应性假肢数据库的sEMG数据和来自MIT-BIH正常窦性心律数据库的ECG信号对MSEMG进行了评估。结果显示，MSEMG优于现有方法，使用更少的参数生成质量更高的sEMG信号。

更新时间: 2025-02-19 04:53:42

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2411.18902v2

Why You've Got Mail: Evaluating Inbox Privacy Implications of Email Marketing Practices in Online Apps and Services

This study explores the widespread perception that personal data, such as email addresses, may be shared or sold without informed user consent, investigating whether these concerns are reflected in actual practices of popular online services and apps. Over the course of a year, we collected and analyzed the source, volume, frequency, and content of emails received by users after signing up for the 150 most popular online services and apps across various sectors. By examining patterns in email communications, we aim to identify consistent strategies used across industries, including potential signs of third-party data sharing. This analysis provides a critical evaluation of how email marketing tactics may intersect with data-sharing practices, with important implications for consumer privacy and regulatory oversight. Our study findings, conducted post-CCPA and GDPR, indicate that while no unknown third-party spam email was detected, internal and authorized third-party email marketing practices were pervasive, with companies frequently sending promotional and CRM emails despite opt-out preferences. The framework established in this work is designed to be scalable, allowing for continuous monitoring, and can be extended to include a more diverse set of apps and services for broader analysis, ultimately contributing to transparency in email address privacy practices.

Updated: 2025-02-19 04:47:03

标题: 为什么你收到了邮件：评估在线应用和服务中电子邮件营销实践对收件箱隐私的影响

摘要: 这项研究探讨了人们普遍认为个人数据，如电子邮件地址，可能在未经用户同意的情况下被分享或出售的看法，并调查这些担忧是否反映在流行在线服务和应用程序的实际做法中。在一年的时间里，我们收集并分析了用户在注册了跨各个行业中最受欢迎的150个在线服务和应用程序后收到的电子邮件的来源、数量、频率和内容。通过研究电子邮件通信的模式，我们旨在识别行业间使用的一致策略，包括可能的第三方数据共享迹象。这项分析对电子邮件营销策略如何与数据共享实践相互作用进行了关键评估，对消费者隐私和监管监督具有重要影响。我们的研究结果表明，在CCPA和GDPR之后进行的，尽管没有检测到未知的第三方垃圾邮件，但内部和经授权的第三方电子邮件营销实践普遍存在，公司经常发送促销和CRM电子邮件，尽管用户选择了退出。这项工作中建立的框架旨在具有可扩展性，允许持续监测，并可扩展到包括更多样化的应用程序和服务进行更广泛的分析，最终有助于透明地了解电子邮件地址隐私实践。

更新时间: 2025-02-19 04:47:03

领域: cs.SI,cs.CR

下载: http://arxiv.org/abs/2410.08302v2

PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph

Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbf{PolyhedronNet}, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region's relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet's efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at {https://github.com/dyu62/3D_polyhedron}.

Updated: 2025-02-19 04:45:40

标题: PolyhedronNet：具有表面属性图的多面体的表示学习

摘要: 普遍存在的几何对象可以精确高效地表示为多面体。将多面体转换为向量，即多面体表示学习，对于使用数学和统计工具操纵这些形状以进行分类、聚类和生成等任务至关重要。近年来，在这个领域取得了重大进展，但大多数努力集中在多面体的顶点序列上，忽略了在真实世界多面体对象中至关重要的复杂表面建模。本研究提出了PolyhedronNet，这是一个专门为学习3D多面体对象表示而量身定制的通用框架。我们提出了表面属性图的概念，以无缝地对多面体内的顶点、边、面及它们之间的几何关系进行建模。为了有效地学习整个表面属性图的表示，我们首先提出将其分解成局部刚性表示，以有效地学习每个局部区域相对于其余区域的位置关系而不丢失几何信息。随后，我们提出了PolyhedronGNN，通过面内和面间几何消息传递模块层次聚合局部刚性表示，以获得最小化信息损失同时保持旋转和平移不变性的全局表示。我们在四个不同数据集上进行的实验评估，涵盖了分类和检索任务，证实了PolyhedronNet在捕捉3D多面体对象全面且信息丰富的表示方面的有效性。代码和数据可在{https://github.com/dyu62/3D_polyhedron}获取。

更新时间: 2025-02-19 04:45:40

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.01814v2

TabSD: Large Free-Form Table Question Answering with SQL-Based Table Decomposition

Question answering on free-form tables (TableQA) is challenging due to the absence of predefined schemas and the presence of noise in large tables. While Large Language Models (LLMs) have shown promise in TableQA, they struggle with large free-form tables and noise sensitivity. To address these challenges, we propose TabSD, a SQL-based decomposition model that enhances LLMs' ability to process large free-form tables. TabSD generates SQL queries to guide the table decomposition, remove noise, and processes sub-tables for better answer generation. Additionally, SQL Verifier refines SQL outputs to enhance decomposition accuracy. We introduce two TableQA datasets with large free-form tables, SLQA and SEQA, which consist solely of large free-form tables and will be publicly available. Experimental results on four benchmark datasets demonstrate that TABSD outperforms the best-existing baseline models by 23.07%, 2.84%, 23.24% and 9.32% in accuracy, respectively, highlighting its effectiveness in handling large and noisy free-form tables.

Updated: 2025-02-19 04:45:05

标题: TabSD：基于SQL的大规模自由形式表格问答与表格分解

摘要: 自由形式表格上的问题回答（TableQA）具有挑战性，因为缺乏预定义的模式和大表中存在的噪音。虽然大型语言模型（LLMs）在TableQA中表现出潜力，但它们在处理大型自由形式表格和对噪音的敏感性方面表现不佳。为了应对这些挑战，我们提出了TabSD，这是一种基于SQL的分解模型，可以增强LLMs处理大型自由形式表格的能力。TabSD生成SQL查询来指导表格分解，去除噪音，并处理子表格以更好地生成答案。此外，SQL验证器可以优化SQL输出以增强分解准确性。我们引入了两个包含大型自由形式表格的TableQA数据集SLQA和SEQA，这些数据集将会公开发布。对四个基准数据集的实验结果表明，TABSD在准确性方面分别比最佳现有基线模型提高了23.07％，2.84％，23.24％和9.32％，突显了其在处理大型和嘈杂的自由形式表格方面的有效性。

更新时间: 2025-02-19 04:45:05

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2502.13422v1

ExoMiner++ on TESS with Transfer Learning from Kepler: Transit Classification and Vetting Catalog for 2-min Data

We present ExoMiner++, an enhanced deep learning model that builds on the success of ExoMiner to improve transit signal classification in 2-minute TESS data. ExoMiner++ incorporates additional diagnostic inputs, including periodogram, flux trend, difference image, unfolded flux, and spacecraft attitude control data, all of which are crucial for effectively distinguishing transit signals from more challenging sources of false positives. To further enhance performance, we leverage transfer learning from high-quality labeled data from the Kepler space telescope, mitigating the impact of TESS's noisier and more ambiguous labels. ExoMiner++ achieves high accuracy across various classification and ranking metrics, significantly narrowing the search space for follow-up investigations to confirm new planets. To serve the exoplanet community, we introduce new TESS catalogs containing ExoMiner++ classifications and confidence scores for each transit signal. Among the 147,568 unlabeled TCEs, ExoMiner++ identifies 7,330 as planet candidates, with the remainder classified as false positives. These 7,330 planet candidates correspond to 1,868 existing TESS Objects of Interest (TOIs), 69 Community TESS Objects of Interest (CTOIs), and 50 newly introduced CTOIs. 1,797 out of the 2,506 TOIs previously labeled as planet candidates in ExoFOP are classified as planet candidates by ExoMiner++. This reduction in plausible candidates combined with the excellent ranking quality of ExoMiner++ allows the follow-up efforts to be focused on the most likely candidates, increasing the overall planet yield.

Updated: 2025-02-19 04:42:32

标题: ExoMiner++在TESS上通过从Kepler进行迁移学习：用于2分钟数据的凌星分类和审核目录

摘要: 我们提出了ExoMiner++，这是一个增强的深度学习模型，建立在ExoMiner的成功基础上，用于改进在2分钟TESS数据中的过境信号分类。ExoMiner++整合了额外的诊断输入，包括周期图、流量趋势、差异图像、展开的流量和航天器姿态控制数据，所有这些对于有效区分过境信号和更具挑战性的假阳性来源至关重要。为了进一步提高性能，我们利用来自开普勒太空望远镜的高质量标记数据进行迁移学习，减轻了TESS更嘈杂和更模糊标签的影响。ExoMiner++在各种分类和排名指标上实现了高准确性，显著缩小了后续调查的搜索空间，以确认新行星。为了服务系外行星社区，我们引入了包含ExoMiner++分类和每个过境信号置信度评分的新TESS目录。在147,568个未标记的TCE中，ExoMiner++将7,330个识别为行星候选者，其余被分类为假阳性。这7,330个行星候选者对应于1,868个现有的TESS感兴趣对象（TOI）、69个社区TESS感兴趣对象（CTOI）和50个新引入的CTOI。在之前在ExoFOP中标记为行星候选者的2,506个TOI中，ExoMiner++将1,797个分类为行星候选者。这种可信候选者的减少结合ExoMiner++出色的排名质量，使后续工作可以集中在最有可能的候选者上，从而增加整体行星产量。

更新时间: 2025-02-19 04:42:32

领域: astro-ph.EP,astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2502.09790v3

AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

In-Context Learning (ICL) has been shown to be a powerful technique to augment the capabilities of LLMs for a diverse range of tasks. This work proposes \ourtool, a novel way to generate context using guidance from graph neural networks (GNNs) to generate efficient parallel codes. We evaluate \ourtool \xspace{} on $12$ applications from two well-known benchmark suites of parallel codes: NAS Parallel Benchmark and Rodinia Benchmark. Our results show that \ourtool \xspace{} improves the state-of-the-art LLMs (e.g., GPT-4) by 19.9\% in NAS and 6.48\% in Rodinia benchmark in terms of CodeBERTScore for the task of parallel code generation. Moreover, \ourtool \xspace{} improves the ability of the most powerful LLM to date, GPT-4, by achieving $\approx$17\% (on NAS benchmark) and $\approx$16\% (on Rodinia benchmark) better speedup. In addition, we propose \ourscore \xspace{} for evaluating the quality of the parallel code and show its effectiveness in evaluating parallel codes. \ourtool \xspace is available at https://github.com/quazirafi/AutoParLLM.git.

Updated: 2025-02-19 04:30:19

标题: AutoParLLM：使用GNN引导的上下文生成进行零样本代码并行化的LLM

摘要: In-Context Learning（ICL）已被证明是增强LLMs在各种任务中能力的强大技术。本文提出了\ourtool，一种利用图神经网络（GNNs）指导生成上下文的新颖方法，以生成高效的并行代码。我们在两个知名的并行代码基准套件NAS Parallel Benchmark和Rodinia Benchmark上对\ourtool 进行评估。我们的结果显示，在并行代码生成任务中，\ourtool 在CodeBERTScore方面将最先进的LLMs（如GPT-4）在NAS基准测试中提高了19.9％，在Rodinia基准测试中提高了6.48％。此外，\ourtool 通过实现约17％（在NAS基准测试中）和约16％（在Rodinia基准测试中）更好的加速度，提高了迄今为止最强大的LLM，GPT-4的能力。此外，我们提出\ourscore 用于评估并行代码的质量，并展示了其在评估并行代码中的有效性。\ourtool 可在https://github.com/quazirafi/AutoParLLM.git 上获得。

更新时间: 2025-02-19 04:30:19

领域: cs.LG

下载: http://arxiv.org/abs/2310.04047v3

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model

Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM 8B, CoALM 70B, and CoALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.

Updated: 2025-02-19 04:28:49

标题: 一个模型是否能掌握多轮对话和工具使用？CoALM：一个统一的对话主体语言模型

摘要: 大型语言模型（LLMs）具有API调用功能，可以构建有效的语言代理（LA），同时也在彻底改变传统的面向任务的对话（TOD）范式。然而，当前的方法面临一个关键困境：TOD系统通常是在有限的目标API集上进行训练的，需要新数据来保持其质量，当与新服务接口时，而LA并未经过训练，无法在多轮对话中保持用户意图。由于强大的多轮管理和高级功能调用对于有效的对话代理至关重要，我们在三个流行基准上评估这些技能：MultiWOZ 2.4（TOD），BFCL V3（LA），和API-Bank（LA），我们的分析表明，专门的方法在某个领域表现出色，但在另一个领域表现不佳。为了弥合这一鸿沟，我们引入了CoALM（对话代理语言模型），这是一个统一的方法，整合了对话和代理能力。我们创建了CoALM-IT，一个精心构建的多任务数据集，交错多轮ReAct推理和复杂的API使用。使用CoALM-IT，我们训练了三个模型CoALM 8B、CoALM 70B和CoALM 405B，它们在所有三个基准测试中表现优于顶级领域特定模型，包括GPT-4o。这证明了一种单一模型方法在TOD和LA方面的可行性，为对话代理设定了新的标准。

更新时间: 2025-02-19 04:28:49

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.08820v3

PFedDST: Personalized Federated Learning with Decentralized Selection Training

Distributed Learning (DL) enables the training of machine learning models across multiple devices, yet it faces challenges like non-IID data distributions and device capability disparities, which can impede training efficiency. Communication bottlenecks further complicate traditional Federated Learning (FL) setups. To mitigate these issues, we introduce the Personalized Federated Learning with Decentralized Selection Training (PFedDST) framework. PFedDST enhances model training by allowing devices to strategically evaluate and select peers based on a comprehensive communication score. This score integrates loss, task similarity, and selection frequency, ensuring optimal peer connections. This selection strategy is tailored to increase local personalization and promote beneficial peer collaborations to strengthen the stability and efficiency of the training process. Our experiments demonstrate that PFedDST not only enhances model accuracy but also accelerates convergence. This approach outperforms state-of-the-art methods in handling data heterogeneity, delivering both faster and more effective training in diverse and decentralized systems.

Updated: 2025-02-19 04:21:58

标题: PFedDST：具有分散选择训练的个性化联邦学习

摘要: 分布式学习(DL)可以实现跨多个设备训练机器学习模型，但面临非IID数据分布和设备能力差异等挑战，这可能影响训练效率。通信瓶颈进一步复杂化了传统的联邦学习(FL)设置。为了缓解这些问题，我们引入了个性化联邦学习与分散选择训练(PFedDST)框架。PFedDST通过允许设备根据综合通信评分策略性地评估和选择同行来增强模型训练。这个评分综合考虑了损失、任务相似性和选择频率，确保最佳同行连接。这种选择策略旨在增加本地个性化，并促进有益的同行合作，以加强训练过程的稳定性和效率。我们的实验表明，PFedDST不仅提高了模型准确性，还加快了收敛速度。这种方法在处理数据异质性方面优于现有技术方法，在各种分散系统中提供更快速和更有效的训练。

更新时间: 2025-02-19 04:21:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.07750v2

Indifferential Privacy: A New Paradigm and Its Applications to Optimal Matching in Dark Pool Auctions

Public exchanges like the New York Stock Exchange and NASDAQ act as auctioneers in a public double auction system, where buyers submit their highest bids and sellers offer their lowest asking prices, along with the number of shares (volume) they wish to trade. The auctioneer matches compatible orders and executes the trades when a match is found. However, auctioneers involved in high-volume exchanges, such as dark pools, may not always be reliable. They could exploit their position by engaging in practices like front-running or face significant conflicts of interest, i.e., ethical breaches that have frequently resulted in hefty fines and regulatory scrutiny within the financial industry. Previous solutions, based on the use of fully homomorphic encryption (Asharov et al., AAMAS 2020), encrypt orders ensuring that information is revealed only when a match occurs. However, this approach introduces significant computational overhead, making it impractical for high-frequency trading environments such as dark pools. In this work, we propose a new system based on differential privacy combined with lightweight encryption, offering an efficient and practical solution that mitigates the risks of an untrustworthy auctioneer. Specifically, we introduce a new concept called Indifferential Privacy, which can be of independent interest, where a user is indifferent to whether certain information is revealed after some special event, unlike standard differential privacy. For example, in an auction, it's reasonable to disclose the true volume of a trade once all of it has been matched. Moreover, our new concept of Indifferential Privacy allows for maximum matching, which is impossible with conventional differential privacy.

Updated: 2025-02-19 04:19:25

标题: 不确定性隐私：一种新的范式及其在暗池拍卖中最佳匹配中的应用

摘要: 公开交易所如纽约证券交易所和纳斯达克在公开的双向拍卖系统中充当拍卖人，买家提交最高竞价，卖家提供最低要价，以及他们希望交易的股票数量（成交量）。拍卖人匹配兼容的订单，并在找到匹配时执行交易。然而，涉及高交易量的交易所，如暗池，拍卖人可能并不总是可靠的。他们可以利用自己的位置，参与诸如前置交易之类的做法，或面临重大利益冲突，即经常导致金融行业内的巨额罚款和监管审查的道德违规行为。以全同态加密为基础的先前解决方案（Asharov等，AAMAS 2020）对订单进行加密，确保信息仅在发生匹配时才被揭示。然而，这种方法引入了显着的计算开销，使其在暗池等高频交易环境中变得不切实际。在这项工作中，我们提出了一种基于差分隐私和轻量级加密的新系统，提供了一种有效且实用的解决方案，可以减轻不可信拍卖人的风险。具体而言，我们引入了一个称为差分隐私的新概念，这可能是独立感兴趣的，用户对于在某些特殊事件发生后是否揭示某些信息是无所谓的，与标准差分隐私不同。例如，在拍卖中，一旦所有的交易都匹配完成，披露交易的真实成交量是合理的。此外，我们的新概念差分隐私允许最大匹配，这在传统差分隐私中是不可能的。

更新时间: 2025-02-19 04:19:25

领域: cs.CR

下载: http://arxiv.org/abs/2502.13415v1

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for the new estimator. Secondly, to set a high-standard for efficiency evaluation, we employ a human expert to design best algorithms and implementations as our reference solutions of efficiency, many of which are much more efficient than existing canonical solutions in HumanEval and HumanEval+. Moreover, to ensure a rigorous evaluation, we employ a human expert to curate strong test case generators to filter out wrong code and differentiate suboptimal algorithms. An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization. Our benchmark is publicly available at https://github.com/q-rz/enamel .

Updated: 2025-02-19 04:16:24

标题: LLM生成的代码有多高效？一个严格且高标准的基准测试

摘要: 大语言模型（LLMs）的出现显著推动了程序综合的前沿。基于LLM的程序综合的进展需要对LLM生成的代码进行彻底评估。大多数评估框架侧重于生成代码的（功能）正确性；作为代码质量的重要衡量标准，效率在现有评估中被忽视了。在这项工作中，我们开发了ENAMEL（EfficeNcy AutoMatic EvaLuator），这是一个严谨和高标准的基准，用于评估LLMs在生成高效代码方面的能力。首先，我们提出了一个新的效率度量标准eff@k，它将正确性度量pass@k从正确性推广到效率，并适当处理正确截尾的执行时间。此外，我们通过Rao-Blackwell化推导了eff@k的无偏和方差减少估计量；我们还为新估计量提供了一个数值稳定的实现。其次，为了为效率评估设定高标准，我们聘请了一位人类专家设计最佳算法和实现作为我们效率的参考解决方案，其中许多比HumanEval和HumanEval+中现有的标准解决方案更高效。此外，为了确保严格评估，我们聘请了一位人类专家策划强大的测试用例生成器，以过滤错误代码并区分次优算法。通过使用我们的基准ENAMEL在30个流行的LLMs之间进行广泛研究，结果显示LLMs仍然无法生成专家级高效代码。通过使用我们问题集的两个子集，我们证明了这种不足是因为当前的LLMs在设计高级算法方面存在困难，并且几乎没有意识到实现优化。我们的基准可在https://github.com/q-rz/enamel 上公开获取。

更新时间: 2025-02-19 04:16:24

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.06647v4

RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis

Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.

Updated: 2025-02-19 04:09:14

标题: RevPRAG：通过LLM激活分析揭示检索增强生成中的毒化攻击

摘要: 检索增强生成（RAG）通过从相关知识数据库中检索信息，丰富了LLMs的输入，使它们能够产生更准确和上下文适当的响应。值得注意的是，知识数据库是从公开可用的渠道（如维基百科）获取的，这不可避免地引入了新的攻击面。RAG中毒涉及将恶意文本注入到知识数据库中，最终导致生成攻击者的目标响应（也称为中毒响应）。然而，目前对于检测此类中毒攻击的方法有限。我们的目标是在这项工作中填补这一空白。具体地，我们介绍了RevPRAG，一个灵活且自动化的检测流水线，利用LLMs的激活来检测中毒响应。我们的调查揭示了LLMs在生成正确响应与中毒响应时的激活中存在明显模式。我们在多个基准数据集和RAG架构上的结果显示，我们的方法可以达到98%的真正阳性率，同时保持接近1%的假阳性率。

更新时间: 2025-02-19 04:09:14

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2411.18948v2

Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction

The API Knowledge Graph (API KG) is a structured network that models API entities and their relations, providing essential semantic insights for tasks such as API recommendation, code generation, and API misuse detection. However, constructing a knowledge-rich and reliable API KG presents several challenges. Existing schema-based methods rely heavily on manual annotations to design KG schemas, leading to excessive manual overhead. On the other hand, schema-free methods, due to the lack of schema guidance, are prone to introducing noise, reducing the KG's reliability. To address these issues, we propose the Explore-Construct-Filter framework, an automated approach for API KG construction based on large language models (LLMs). This framework consists of three key modules: 1) KG exploration: LLMs simulate the workflow of annotators to automatically design a schema with comprehensive type triples, minimizing human intervention; 2) KG construction: Guided by the schema, LLMs extract instance triples to construct a rich yet unreliable API KG; 3) KG filtering: Removing invalid type triples and suspicious instance triples to construct a rich and reliable API KG. Experimental results demonstrate that our method surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 score. Moreover, the Explore-Construct-Filter framework proves effective, with the KG exploration module increasing KG richness by 133.6% and the KG filtering module improving reliability by 26.6%. Finally, cross-model experiments confirm the generalizability of our framework.

Updated: 2025-02-19 03:51:31

标题: 探索-构建-过滤：用于丰富和可靠API知识图构建的自动化框架

摘要: API知识图谱（API KG）是一个结构化网络，模拟API实体及其关系，为API推荐、代码生成和API误用检测等任务提供重要的语义洞察。然而，构建一个知识丰富且可靠的API KG面临着几个挑战。现有的基于架构的方法依赖于手动注释来设计KG架构，导致过多的手动开销。另一方面，无架构的方法由于缺乏架构指导，容易引入噪声，降低KG的可靠性。为了解决这些问题，我们提出了Explore-Construct-Filter框架，这是一种基于大型语言模型（LLMs）的自动化API KG构建方法。该框架由三个关键模块组成：1）KG探索：LLMs模拟注释者的工作流程，自动设计具有全面类型三元组的架构，最大程度减少人为干预；2）KG构建：在架构的指导下，LLMs提取实例三元组来构建一个丰富但不可靠的API KG；3）KG过滤：去除无效的类型三元组和可疑的实例三元组，构建一个丰富且可靠的API KG。实验结果表明，我们的方法超越了最先进的方法，F1分数提高了25.2%。此外，Explore-Construct-Filter框架证明了其有效性，KG探索模块增加了133.6%的KG丰富度，KG过滤模块提高了26.6%的可靠性。最后，跨模型实验证实了我们框架的泛化能力。

更新时间: 2025-02-19 03:51:31

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2502.13412v1

A Transfer Attack to Image Watermarks

Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector based on existing watermarking methods is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API. Our code is available at: https://github.com/hifi-hyp/Watermark-Transfer-Attack.

Updated: 2025-02-19 03:50:04

标题: 一种针对图像水印的传递攻击

摘要: 水印技术已被广泛应用于工业中用于检测由人工智能生成的图像。文献中对于这种基于水印的检测器在白盒和黑盒设置下对抗攻击的稳健性已有深入了解。然而，在无盒设置下的稳健性却了解得较少。在这项工作中，我们提出了一种新的转移对抗攻击方法，用于无盒设置下的图像水印。我们的转移攻击向已被加水印的图像添加扰动，以逃避攻击者自身训练的多个替代水印模型，并且扰动后的加水印图像也逃避目标水印模型。我们的主要贡献是表明，无论攻击者是否能够访问水印模型或者检测API，基于现有水印方法的基于水印的AI生成图像检测器都不具有对抗攻击的稳健性。我们的代码可在以下链接找到：https://github.com/hifi-hyp/Watermark-Transfer-Attack。

更新时间: 2025-02-19 03:50:04

领域: cs.CR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2403.15365v4

Tell Me Why: Incentivizing Explanations

Common sense suggests that when individuals explain why they believe something, we can arrive at more accurate conclusions than when they simply state what they believe. Yet, there is no known mechanism that provides incentives to elicit explanations for beliefs from agents. This likely stems from the fact that standard Bayesian models make assumptions (like conditional independence of signals) that preempt the need for explanations, in order to show efficient information aggregation. A natural justification for the value of explanations is that agents' beliefs tend to be drawn from overlapping sources of information, so agents' belief reports do not reveal all that needs to be known. Indeed, this work argues that rationales-explanations of an agent's private information-lead to more efficient aggregation by allowing agents to efficiently identify what information they share and what information is new. Building on this model of rationales, we present a novel 'deliberation mechanism' to elicit rationales from agents in which truthful reporting of beliefs and rationales is a perfect Bayesian equilibrium.

Updated: 2025-02-19 03:47:34

标题: 告诉我为什么：激励解释

摘要: 常识表明，当个体解释他们为什么相信某事时，我们可以得出比他们仅仅陈述信仰更准确的结论。然而，目前没有已知的机制来激励代理人解释他们的信仰。这可能源于标准贝叶斯模型做出的假设（比如信号的条件独立性），这些假设预先排除了对解释的需求，以展示信息聚合的高效性。解释的价值的一个自然证明是，代理人的信念往往来自重叠的信息源，因此代理人的信念报告并不能揭示所有需要了解的内容。事实上，这项工作认为，对代理人私人信息的解释-理由-通过允许代理人有效地确定他们分享的信息和新信息，从而实现更有效的聚合。基于这种理由的模型，我们提出了一种新颖的“思考机制”，用于从代理人那里引出理由，在这种机制中，信仰和理由的真实报告是一个完美的贝叶斯均衡。

更新时间: 2025-02-19 03:47:34

领域: cs.GT,cs.AI,econ.TH

下载: http://arxiv.org/abs/2502.13410v1

FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading

Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose \textsc{FLAG-Trader}, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.

Updated: 2025-02-19 03:40:56

标题: FLAG-Trader: 融合LLM-Agent与基于梯度的强化学习的金融交易系统

摘要: 在多模态金融数据上进行微调的大型语言模型（LLMs）在各种金融任务中展示出令人印象深刻的推理能力。然而，在交互式金融市场中，如交易场景中，它们经常在多步骤、目标导向的情况下遇到困难，需要复杂的代理方法来改善决策。为了解决这个问题，我们提出了\textsc{FLAG-Trader}，这是一个统一的架构，将语言处理（通过LLMs）与梯度驱动的强化学习（RL）策略优化相结合，其中部分微调过的LLM充当策略网络，利用预训练知识并通过参数高效微调来适应金融领域。通过受交易奖励驱动的策略梯度优化，我们的框架不仅提高了LLM在交易中的性能，还改善了其他金融领域任务的结果。我们提供了大量经验证据来验证这些改进。

更新时间: 2025-02-19 03:40:56

领域: cs.AI,cs.CE,q-fin.TR

下载: http://arxiv.org/abs/2502.11433v3

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs' inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development.

Updated: 2025-02-19 03:37:38

标题: 从心理测量学的角度攻击大型语言模型以评估隐性偏见

摘要: 随着大型语言模型（LLMs）成为信息获取的重要途径，人们越来越担心LLMs可能加剧不道德内容的传播，包括伤害某些人群的隐含偏见，而不使用明确的有害言辞。本文从心理测量学的角度对LLMs对某些人口群体的隐含偏见进行了严格评估，以引发对有偏见观点的认同。受认知和社会心理学中的心理测量原则启发，我们提出了三种攻击方法，即伪装、欺骗和教学。结合相应的攻击说明，我们建立了两个基准：（1）涵盖四种偏见类型的偏见性陈述的双语数据集（2.7K实例），用于广泛的比较分析，以及（2）BUMBLE，一个跨越九种常见偏见类型的更大基准（12.7K实例），用于全面评估。对流行的商业和开源LLMs进行广泛评估显示，我们的方法比竞争基线更有效地引发LLMs的内在偏见。我们的攻击方法论和基准提供了一种有效手段，用于评估LLMs的道德风险，推动其发展朝着更大的责任性进步。

更新时间: 2025-02-19 03:37:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.14023v2

Differentially Private Learning Beyond the Classical Dimensionality Regime

We initiate the study of differentially private learning in the proportional dimensionality regime, in which the number of data samples $n$ and problem dimension $d$ approach infinity at rates proportional to one another, meaning that $d/n\to\delta$ as $n\to\infty$ for an arbitrary, given constant $\delta\in(0,\infty)$. This setting is significantly more challenging than that of all prior theoretical work in high-dimensional differentially private learning, which, despite the name, has assumed that $\delta = 0$ or is sufficiently small for problems of sample complexity $O(d)$, a regime typically considered "low-dimensional" or "classical" by modern standards in high-dimensional statistics. We provide sharp theoretical estimates of the error of several well-studied differentially private algorithms for robust linear regression and logistic regression, including output perturbation, objective perturbation, and noisy stochastic gradient descent, in the proportional dimensionality regime. The $1+o(1)$ factor precision of our error estimates enables a far more nuanced understanding of the price of privacy of these algorithms than that afforded by existing, coarser analyses, which are essentially vacuous in the regime we consider. Using our estimates, we discover a previously unobserved "double descent"-like phenomenon in the training error of objective perturbation for robust linear regression. We also identify settings in which output perturbation outperforms objective perturbation on average, and vice versa, demonstrating that the relative performance of these algorithms is less clear-cut than suggested by prior work. To prove our main theorems, we introduce several probabilistic tools that have not previously been used to analyze differentially private learning algorithms, such as a modern Gaussian comparison inequality and recent universality laws with origins in statistical physics.

Updated: 2025-02-19 03:35:35

标题: 《超越经典维度范围的差分隐私学习》

摘要: 我们开始研究在比例维度范围内的差分私有学习，其中数据样本数量$n$和问题维度$d$以相互比例的速率接近无穷大，即当$n\to\infty$时，$d/n\to\delta$，其中$\delta\in(0,\infty)$为任意给定的常数。这种设置比先前所有高维差分私有学习的理论工作要具有挑战性得多，尽管这些工作的名称是高维差分私有学习，但假设$\delta=0$或对于样本复杂度为$O(d)$的问题足够小，在现代高维统计学中通常被认为是“低维”或“经典”范畴。我们为强健线性回归和逻辑回归的几种广为研究的差分私有算法的误差提供了尖锐的理论估计，包括输出扰动、目标扰动和带有噪声的随机梯度下降，在比例维度范围内。我们的误差估计的$1+o(1)$因子精度使我们能够对这些算法的隐私代价有比现有更细致的理解，而现有的、更粗略的分析在我们考虑的范围内基本上是空洞的。利用我们的估计，我们发现了在强健线性回归的目标扰动的训练误差中以前未观察到的“双谷”现象。我们还确定了一些情境，其中输出扰动在平均情况下优于目标扰动，反之亦然，证明了这些算法的相对性能并不像先前的研究所暗示的那样清晰。为了证明我们的主要定理，我们引入了几种以前未被用于分析差分私有学习算法的概率工具，如现代高斯比较不等式和源自统计物理学的最近的普遍性定律。

更新时间: 2025-02-19 03:35:35

领域: cs.LG,cs.CR,cs.DS

下载: http://arxiv.org/abs/2411.13682v2

JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework

Deep learning has achieved significant success in the field of remote sensing image change detection (CD), yet two major challenges remain: the scarcity of sub-meter, all-inclusive open-source CD datasets, and the difficulty of achieving consistent and satisfactory detection results across images with varying change areas. To address these issues, we introduce the JL1-CD dataset, which contains 5,000 pairs of 512 x 512 pixel images with a resolution of 0.5 to 0.75 meters. Additionally, we propose a multi-teacher knowledge distillation (MTKD) framework for CD. Experimental results on the JL1-CD and SYSU-CD datasets demonstrate that the MTKD framework significantly improves the performance of CD models with various network architectures and parameter sizes, achieving new state-of-the-art results. The code is available at https://github.com/circleLZY/MTKD-CD.

Updated: 2025-02-19 03:33:54

标题: JL1-CD：遥感变化检测的新基准和稳健的多教师知识蒸馏框架

摘要: 深度学习在遥感图像变化检测（CD）领域取得了重大成功，但仍存在两个主要挑战：缺乏亚米级、全面开源的CD数据集，以及在变化区域不同的图像之间实现一致且满意的检测结果的困难。为了解决这些问题，我们介绍了JL1-CD数据集，其中包含5,000对分辨率为0.5至0.75米的512 x 512像素图像。此外，我们提出了一种多教师知识蒸馏（MTKD）框架用于CD。在JL1-CD和SYSU-CD数据集上的实验结果表明，MTKD框架显著提高了具有不同网络结构和参数大小的CD模型的性能，实现了新的最先进结果。代码可在https://github.com/circleLZY/MTKD-CD 上获得。

更新时间: 2025-02-19 03:33:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.13407v1

Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks

Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But despite enjoying considerable success on difficult manipulation problems, generative policies come with two key limitations. First, behavior cloning requires expert demonstrations, which can be time-consuming and expensive to obtain. Second, existing methods are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address each of these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We then show how trained flow-matching policies can be warm-started at run-time, maintaining temporal consistency and enabling fast feedback rates. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it paves the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.

Updated: 2025-02-19 03:33:01

标题: 生成式预测控制：用于动态和难以展示任务的流匹配策略

摘要: 生成控制策略最近在机器人技术中取得了重大进展。这些方法通过扩散或流匹配产生动作序列，训练数据由演示提供。尽管在困难的操纵问题上取得了相当大的成功，生成策略却有两个关键限制。首先，行为克隆需要专家演示，这可能耗时且昂贵。其次，现有方法仅适用于相对缓慢的准静态任务。在本文中，我们利用基于采样的预测控制和生成建模之间的紧密联系来解决这些问题。具体来说，我们引入了生成预测控制，这是一个针对快速动态但难以演示的任务的监督学习框架。然后，我们展示了如何在运行时启动经过训练的流匹配策略，保持时间一致性并实现快速反馈率。我们相信生成预测控制提供了一种补充现有行为克隆方法的途径，并希望它为超越准静态演示导向任务的通用策略铺平道路。

更新时间: 2025-02-19 03:33:01

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.13406v1

Learning to Discover Regulatory Elements for Gene Expression Prediction

We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).

Updated: 2025-02-19 03:25:49

标题: 学习发现基因表达预测的调控元素

摘要: 我们考虑了从DNA序列预测基因表达的问题。这一任务的关键挑战是找到控制基因表达的调控元素。在这里，我们介绍了Seq2Exp，一个专门设计用于发现和提取驱动目标基因表达的调控元素的序列到表达网络，从而提高基因表达预测的准确性。我们的方法捕捉了表观组信号、DNA序列及其关联调控元素之间的因果关系。具体而言，我们建议在因果活跃调控元素的条件下分解表观组信号和DNA序列，并应用Beta分布的信息瓶颈来结合它们的效应，同时滤除非因果成分。我们的实验表明，与常用的统计方法如MACS3等峰检测相比，Seq2Exp在基因表达预测任务中表现优于现有基线，并发现了有影响力的区域。源代码作为AIRS库的一部分发布（https://github.com/divelab/AIRS/）。

更新时间: 2025-02-19 03:25:49

领域: q-bio.GN,cs.AI

下载: http://arxiv.org/abs/2502.13991v1

Forward and Inverse Simulation of Pseudo-Two-Dimensional Model of Lithium-Ion Batteries Using Neural Networks

In this work, we address the challenges posed by the high nonlinearity of the Butler-Volmer (BV) equation in forward and inverse simulations of the pseudo-two-dimensional (P2D) model using the physics-informed neural network (PINN) framework. The BV equation presents significant challenges for PINNs, primarily due to the hyperbolic sine term, which renders the Hessian of the PINN loss function highly ill-conditioned. To address this issue, we introduce a bypassing term that improves numerical stability by substantially reducing the condition number of the Hessian matrix. Furthermore, the small magnitude of the ionic flux $ j $ often leads to a common failure mode where PINNs converge to incorrect solutions. We demonstrate that incorporating a secondary conservation law for the solid-phase potential $ \psi $ effectively prevents such convergence issues and ensures solution accuracy. The proposed methods prove effective for solving both forward and inverse problems involving the BV equation. Specifically, we achieve precise parameter estimation in inverse scenarios and reliable solution predictions for forward simulations.

Updated: 2025-02-19 03:25:32

标题: 使用神经网络进行锂离子电池伪二维模型的正向和反向仿真

摘要: 在这项工作中，我们针对Butler-Volmer（BV）方程高非线性性在伪二维（P2D）模型的正向和反向模拟中提出的挑战，使用了基于物理的神经网络（PINN）框架。BV方程对于PINNs提出了重大挑战，主要是由于双曲正弦项使得PINN损失函数的Hessian矩阵高度病态。为了解决这个问题，我们引入了一个绕过项，通过显著减少Hessian矩阵的条件数来提高数值稳定性。此外，离子通量$ j $的小量级通常会导致PINNs收敛到错误解的常见故障模式。我们证明，引入一个固相电位$ \psi $的次级守恒定律有效地防止了这种收敛问题，并确保解的准确性。所提出的方法对解决涉及BV方程的正向和反向问题都有效。具体而言，在反向场景中实现了精确的参数估计，并为正向模拟提供可靠的解决方案预测。

更新时间: 2025-02-19 03:25:32

领域: physics.comp-ph,cs.LG

下载: http://arxiv.org/abs/2412.13200v2

Object-Pose Estimation With Neural Population Codes

Robotic assembly tasks require object-pose estimation, particularly for tasks that avoid costly mechanical constraints. Object symmetry complicates the direct mapping of sensory input to object rotation, as the rotation becomes ambiguous and lacks a unique training target. Some proposed solutions involve evaluating multiple pose hypotheses against the input or predicting a probability distribution, but these approaches suffer from significant computational overhead. Here, we show that representing object rotation with a neural population code overcomes these limitations, enabling a direct mapping to rotation and end-to-end learning. As a result, population codes facilitate fast and accurate pose estimation. On the T-LESS dataset, we achieve inference in 3.2 milliseconds on an Apple M1 CPU and a Maximum Symmetry-Aware Surface Distance accuracy of 84.7% using only gray-scale image input, compared to 69.7% accuracy when directly mapping to pose.

Updated: 2025-02-19 03:23:43

标题: 用神经群体编码进行物体姿态估计

摘要: 机器人装配任务需要对象姿态估计，特别是对于那些避免昂贵的机械约束的任务。对象的对称性使得将感官输入直接映射到对象旋转变得复杂，因为旋转变得模糊且缺乏独特的训练目标。一些提出的解决方案涉及评估多个姿态假设或预测概率分布，但这些方法存在显着的计算开销。在这里，我们展示了使用神经群体编码来表示对象旋转能克服这些限制，实现直接映射到旋转和端到端学习。因此，群体编码促进了快速准确的姿态估计。在T-LESS数据集上，在Apple M1 CPU上我们实现了3.2毫秒的推断时间，并且仅使用灰度图像输入就可以达到84.7%的最大对称感知表面距离准确率，而直接映射到姿态时的准确率为69.7%。

更新时间: 2025-02-19 03:23:43

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2502.13403v1

CipherGuard: Compiler-aided Mitigation against Ciphertext Side-channel Attacks

Cryptographic implementations bolster security against timing side-channel attacks by integrating constant-time components. However, the new ciphertext side channels resulting from the deterministic memory encryption in Trusted Execution Environments (TEEs), enable ciphertexts to manifest identifiable patterns when being sequentially written to the same memory address. Attackers with read access to encrypted memory in TEEs can potentially deduce plaintexts by analyzing these changing ciphertext patterns. In this paper, we design CipherGuard, a compiler-aided mitigation methodology to counteract ciphertext side channels with high efficiency and security. CipherGuard is based on the LLVM ecosystem, and encompasses multiple mitigation strategies, including software-based probabilistic encryption and secret-aware register allocation. Through a comprehensive evaluation, we demonstrate that CipherGuard can strengthen the security of various cryptographic implementations more efficiently than existing state-of-the-art defense mechanism, i.e., CipherFix.

Updated: 2025-02-19 03:22:36

标题: CipherGuard：编译器辅助的对抗密文侧信道攻击的缓解措施

摘要: 加密实现通过集成常量时间组件来增强安全性，以防止时序侧信道攻击。然而，在受信任执行环境（TEE）中的确定性内存加密导致的新的密文侧信道，使得密文在被顺序写入同一内存地址时呈现可识别的模式。具有对TEE中加密内存的读取权限的攻击者可能通过分析这些变化的密文模式来推断明文。在本文中，我们设计了CipherGuard，一种由编译器辅助的缓解方法，以高效和安全地对抗密文侧信道。CipherGuard基于LLVM生态系统，并包括多种缓解策略，包括基于软件的概率加密和秘密感知的寄存器分配。通过全面评估，我们证明了CipherGuard可以比现有的最先进的防御机制（即CipherFix）更有效地增强各种加密实现的安全性。

更新时间: 2025-02-19 03:22:36

领域: cs.CR,cs.AR

下载: http://arxiv.org/abs/2502.13401v1

$\mathtt{GeLLM^3O}$: Generalizing Large Language Models for Multi-property Molecule Optimization

Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce $\mathtt{MoMUInstruct}$, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging $\mathtt{MoMUInstruct}$, we develop $\mathtt{GeLLM^3O}$s, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that $\mathtt{GeLLM^3O}$s consistently outperform state-of-the-art baselines. $\mathtt{GeLLM^3O}$s also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of $\mathtt{GeLLM^3O}$s as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. $\mathtt{MoMUInstruct}$, models, and code are accessible through https://github.com/ninglab/GeLLMO.

Updated: 2025-02-19 03:14:11

标题: $\mathtt{GeLLM^3O}$：将大型语言模型泛化为多性质分子优化

摘要: 尽管近年来取得了一些进展，但大多数用于分子优化的计算方法仍受限于单一或双重属性优化任务，并且在扩展性和对新型优化任务的泛化能力方面表现不佳。与此同时，大型语言模型(LLMs)展现出对新任务的出领域泛化能力。为了展示LLMs在分子优化中的潜力，我们介绍了$\mathtt{MoMUInstruct}$，这是专门针对复杂多属性分子优化任务的第一个高质量指令调整数据集。利用$\mathtt{MoMUInstruct}$，我们开发了$\mathtt{GeLLM^3O}$s，一系列针对分子优化进行指令调整的LLMs。对5个领域内和5个领域外任务的广泛评估表明，$\mathtt{GeLLM^3O}$s始终优于最先进的基线模型。$\mathtt{GeLLM^3O}$s还展现出在未见任务上的出色零样本泛化能力，明显优于强大的闭源LLMs。这种强大的泛化能力展示了$\mathtt{GeLLM^3O}$s作为分子优化的基础模型的巨大潜力，从而在不需要资源密集的重新训练的情况下解决新型优化任务。$\mathtt{MoMUInstruct}$，模型和代码可通过https://github.com/ninglab/GeLLMO访问。

更新时间: 2025-02-19 03:14:11

领域: cs.LG,cs.AI,cs.CL,physics.chem-ph,q-bio.QM

下载: http://arxiv.org/abs/2502.13398v1

Mitigating Heterogeneity among Factor Tensors via Lie Group Manifolds for Tensor Decomposition Based Temporal Knowledge Graph Embedding

Recent studies have highlighted the effectiveness of tensor decomposition methods in the Temporal Knowledge Graphs Embedding (TKGE) task. However, we found that inherent heterogeneity among factor tensors in tensor decomposition significantly hinders the tensor fusion process and further limits the performance of link prediction. To overcome this limitation, we introduce a novel method that maps factor tensors onto a unified smooth Lie group manifold to make the distribution of factor tensors approximating homogeneous in tensor decomposition. We provide the theoretical proof of our motivation that homogeneous tensors are more effective than heterogeneous tensors in tensor fusion and approximating the target for tensor decomposition based TKGE methods. The proposed method can be directly integrated into existing tensor decomposition based TKGE methods without introducing extra parameters. Extensive experiments demonstrate the effectiveness of our method in mitigating the heterogeneity and in enhancing the tensor decomposition based TKGE models.

Updated: 2025-02-19 03:11:50

标题: 通过李群流形减轻基于张量分解的时间知识图嵌入中的因子张量异质性

摘要: 最近的研究强调了张量分解方法在时间知识图嵌入（TKGE）任务中的有效性。然而，我们发现张量分解中因子张量之间的固有异质性显著阻碍了张量融合过程，并进一步限制了链接预测的性能。为了克服这一限制，我们引入了一种新颖的方法，将因子张量映射到一个统一的光滑李群流形上，使因子张量的分布在张量分解中近似均匀。我们提供了我们动机的理论证明，即在张量融合和基于张量分解的TKGE方法中，均匀张量比异质张量更有效。所提出的方法可以直接集成到现有的基于张量分解的TKGE方法中，而无需引入额外参数。大量实验证明了我们的方法在减轻异质性和增强基于张量分解的TKGE模型方面的有效性。

更新时间: 2025-02-19 03:11:50

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2404.09155v2

Ensemble based approach to quantifying uncertainty of LLM based classifications

The output of Large Language Models (LLMs) are a function of the internal model's parameters and the input provided into the context window. The hypothesis presented here is that under a greedy sampling strategy the variance in the LLM's output is a function of the conceptual certainty embedded in the model's parametric knowledge, as well as the lexical variance in the input. Finetuning the model results in reducing the sensitivity of the model output to the lexical input variations. This is then applied to a classification problem and a probabilistic method is proposed for estimating the certainties of the predicted classes.

Updated: 2025-02-19 03:09:59

标题: 基于集成的方法来量化基于LLM的分类的不确定性

摘要: 大型语言模型（LLMs）的输出是内部模型参数和提供给上下文窗口的输入的函数。本文提出的假设是，在贪婪抽样策略下，LLM输出的方差是模型参数化知识中嵌入的概念确定性的函数，以及输入中的词汇方差。微调模型会导致模型输出对词汇输入变化的敏感性降低。然后将其应用于分类问题，并提出了一种用于估计预测类别的确定性的概率方法。

更新时间: 2025-02-19 03:09:59

领域: cs.AI

下载: http://arxiv.org/abs/2502.08631v2

Unsupervised CP-UNet Framework for Denoising DAS Data with Decay Noise

Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is susceptible to various noise types, such as random noise, erratic noise, level noise, and long-period noise. This reduced S/N can negatively impact data analyses containing inversion and interpretation. While artificial intelligence has demonstrated excellent denoising capabilities, most existing methods rely on supervised learning with labeled data, which imposes stringent requirements on the quality of the labels. To address this issue, we develop a label-free unsupervised learning (UL) network model based on Context-Pyramid-UNet (CP-UNet) to suppress erratic and random noises in DAS data. The CP-UNet utilizes the Context Pyramid Module in the encoding and decoding process to extract features and reconstruct the DAS data. To enhance the connectivity between shallow and deep features, we add a Connected Module (CM) to both encoding and decoding section. Layer Normalization (LN) is utilized to replace the commonly employed Batch Normalization (BN), accelerating the convergence of the model and preventing gradient explosion during training. Huber-loss is adopted as our loss function whose parameters are experimentally determined. We apply the network to both the 2-D synthetic and filed data. Comparing to traditional denoising methods and the latest UL framework, our proposed method demonstrates superior noise reduction performance.

Updated: 2025-02-19 03:09:49

标题: 无监督CP-UNet框架用于去噪DAS数据与衰减噪声

摘要: 分布式声学传感器（DAS）技术利用光纤电缆来检测声音信号，提供具有成本效益和密集监测能力的优势。它具有抗极端条件、免疫电磁干扰和准确检测等几个优点。然而，与地震仪相比，DAS通常表现出较低的信噪比（S/N），并且易受各种噪声类型的影响，如随机噪声、不规则噪声、级别噪声和长周期噪声。这种降低的信噪比可能会对包含反演和解释的数据分析产生负面影响。虽然人工智能已经展示出出色的去噪能力，但大多数现有方法依赖于带标签的数据进行监督学习，这对标签的质量提出了严格要求。为了解决这个问题，我们基于Context-Pyramid-UNet（CP-UNet）开发了一个无标签无监督学习（UL）网络模型，用于抑制DAS数据中的不规则和随机噪声。CP-UNet在编码和解码过程中利用上下文金字塔模块来提取特征和重构DAS数据。为了增强浅层和深层特征之间的连接性，我们在编码和解码部分都添加了连接模块（CM）。层归一化（LN）被用来替代常用的批归一化（BN），加快模型的收敛速度并防止训练过程中的梯度爆炸。我们采用Huber损失作为我们的损失函数，其参数是经过实验确定的。我们将网络应用于2-D合成数据和现场数据。与传统去噪方法和最新的无监督学习框架相比，我们提出的方法展示了优越的降噪性能。

更新时间: 2025-02-19 03:09:49

领域: cs.SD,cs.LG,eess.AS,eess.SP,physics.optics

下载: http://arxiv.org/abs/2502.13395v1

Flow-based generative models as iterative algorithms in probability space

Generative AI (GenAI) has revolutionized data-driven modeling by enabling the synthesis of high-dimensional data across various applications, including image generation, language modeling, biomedical signal processing, and anomaly detection. Flow-based generative models provide a powerful framework for capturing complex probability distributions, offering exact likelihood estimation, efficient sampling, and deterministic transformations between distributions. These models leverage invertible mappings governed by Ordinary Differential Equations (ODEs), enabling precise density estimation and likelihood evaluation. This tutorial presents an intuitive mathematical framework for flow-based generative models, formulating them as neural network-based representations of continuous probability densities. We explore key theoretical principles, including the Wasserstein metric, gradient flows, and density evolution governed by ODEs, to establish convergence guarantees and bridge empirical advancements with theoretical insights. By providing a rigorous yet accessible treatment, we aim to equip researchers and practitioners with the necessary tools to effectively apply flow-based generative models in signal processing and machine learning.

Updated: 2025-02-19 03:09:18

标题: 基于流的生成模型作为概率空间中的迭代算法

摘要: 生成式人工智能（GenAI）已经通过实现高维数据的综合在各种应用中实现了数据驱动建模的革命，包括图像生成、语言建模、生物医学信号处理和异常检测。基于流的生成模型为捕获复杂概率分布提供了强大的框架，提供了精确的似然估计、高效的抽样和在分布之间的确定性变换。这些模型利用由常微分方程（ODEs）控制的可逆映射，实现了精确的密度估计和似然评估。本教程提供了一个直观的数学框架，将基于流的生成模型构建为连续概率密度的神经网络表示。我们探讨关键的理论原则，包括Wasserstein度量、梯度流以及由ODEs控制的密度演化，以建立收敛保证，并将经验进展与理论洞见联系起来。通过提供严谨而易于理解的处理，我们旨在为研究人员和实践者提供必要的工具，以有效地将基于流的生成模型应用于信号处理和机器学习中。

更新时间: 2025-02-19 03:09:18

领域: cs.LG,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2502.13394v1

Intelligent Tutors for Adult Learners: An Analysis of Needs and Challenges

This work examines the sociotechnical factors that influence the adoption and usage of intelligent tutoring systems in self-directed learning contexts, focusing specifically on adult learners. The study is divided into two parts. First, we present Apprentice Tutors, a novel intelligent tutoring system designed to address the unique needs of adult learners. The platform includes adaptive problem selection, real-time feedback, and visual dashboards to support learning in college algebra topics. Second, we investigate the specific needs and experiences of adult users through a deployment study and a series of focus groups. Using thematic analysis, we identify key challenges and opportunities to improve tutor design and adoption. Based on these findings, we offer actionable design recommendations to help developers create intelligent tutoring systems that better align with the motivations and learning preferences of adult learners. This work contributes to a wider understanding of how to improve educational technologies to support lifelong learning and professional development.

Updated: 2025-02-19 03:08:14

标题: 成年学习者的智能导师：需求和挑战分析

摘要: 这项工作探讨了影响成人学习者在自主学习环境中采用和使用智能辅导系统的社会技术因素，重点关注成人学习者。研究分为两部分。首先，我们介绍了学徒辅导员，这是一种新颖的智能辅导系统，旨在满足成人学习者的独特需求。该平台包括自适应问题选择、实时反馈和可视化仪表板，以支持大学代数主题的学习。其次，我们通过部署研究和一系列焦点小组讨论，调查了成人用户的具体需求和经验。通过主题分析，我们确定了改进辅导员设计和采用的关键挑战和机会。基于这些发现，我们提出了可操作的设计建议，以帮助开发人员创建更符合成人学习者动机和学习偏好的智能辅导系统。这项工作有助于更广泛地了解如何改进教育技术，以支持终身学习和职业发展。

更新时间: 2025-02-19 03:08:14

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2412.04477v3

Controllable Unlearning for Image-to-Image Generative Models via $\varepsilon$-Constrained Optimization

While generative models have made significant advancements in recent years, they also raise concerns such as privacy breaches and biases. Machine unlearning has emerged as a viable solution, aiming to remove specific training data, e.g., containing private information and bias, from models. In this paper, we study the machine unlearning problem in Image-to-Image (I2I) generative models. Previous studies mainly treat it as a single objective optimization problem, offering a solitary solution, thereby neglecting the varied user expectations towards the trade-off between complete unlearning and model utility. To address this issue, we propose a controllable unlearning framework that uses a control coefficient $\varepsilon$ to control the trade-off. We reformulate the I2I generative model unlearning problem into a $\varepsilon$-constrained optimization problem and solve it with a gradient-based method to find optimal solutions for unlearning boundaries. These boundaries define the valid range for the control coefficient. Within this range, every yielded solution is theoretically guaranteed with Pareto optimality. We also analyze the convergence rate of our framework under various control functions. Extensive experiments on two benchmark datasets across three mainstream I2I models demonstrate the effectiveness of our controllable unlearning framework.

Updated: 2025-02-19 03:06:59

标题: 可控的遗忘：通过$\varepsilon$受限优化实现图像生成模型

摘要: 尽管生成模型在近年取得了显著进展，但也引发了隐私泄露和偏见等问题。机器遗忘已经成为一种可行的解决方案，旨在从模型中删除特定的训练数据，例如包含私人信息和偏见。在本文中，我们研究了图像到图像（I2I）生成模型中的机器遗忘问题。先前的研究主要将其视为单一目标优化问题，提供了唯一的解决方案，从而忽略了用户在完全遗忘和模型效用之间权衡的各种期望。为了解决这个问题，我们提出了一个可控遗忘框架，使用控制系数$\varepsilon$来控制权衡。我们将I2I生成模型遗忘问题重新制定为一个$\varepsilon$-约束优化问题，并使用基于梯度的方法来找到遗忘边界的最优解。这些边界定义了控制系数的有效范围。在这个范围内，每个产生的解理论上都保证了帕累托最优性。我们还分析了在各种控制函数下我们框架的收敛速度。在三个主流I2I模型的两个基准数据集上进行的大量实验证明了我们可控遗忘框架的有效性。

更新时间: 2025-02-19 03:06:59

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2408.01689v3

Bridging the Data Provenance Gap Across Text, Speech and Video

Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.

Updated: 2025-02-19 03:05:56

标题: 跨文本、语音和视频数据溯源间的桥梁

摘要: 人工智能的进展在很大程度上受到训练数据的规模和质量的驱动。尽管如此，对除文本以外的成熟数据集属性进行实证分析的研究存在缺乏。在这项工作中，我们进行了跨模态（流行的文本、语音和视频数据集）的最大规模和首次的纵向审计，从它们的详细采集趋势和使用限制到它们的地理和语言表示。我们的手动分析涵盖了1990年至2024年间近4000个公共数据集，涵盖了608种语言，798个来源，659个组织和67个国家。我们发现，多模态机器学习应用在训练集方面普遍倾向于网络抓取、合成和社交媒体平台，如YouTube，自2019年以来超过了所有其他来源。其次，追踪数据集衍生链，我们发现虽然不到33%的数据集受到限制性许可，但在广泛使用的文本、语音和视频数据集中，超过80%的源内容带有非商业限制。最后，与公共人工智能训练数据集中所代表的语言和地理数量的增加相反，我们的审计表明，相对地理和多语言代表性的衡量指标自2013年以来未能显著改善其覆盖范围。我们相信我们审计的广度使我们能够以实证方式研究数据采集、限制和西方中心主义趋势，并且对这些问题的可见性对负责任人工智能的进展至关重要。作为对数据透明度和负责任使用的持续改进的贡献，我们发布了我们的整个多模态审计，使从业者能够跟踪文本、语音和视频数据的数据来源。

更新时间: 2025-02-19 03:05:56

领域: cs.AI,cs.CL,cs.CY,cs.LG,cs.MM

下载: http://arxiv.org/abs/2412.17847v2

Atomic Proximal Policy Optimization for Electric Robo-Taxi Dispatch and Charger Allocation

Pioneering companies such as Waymo have deployed robo-taxi services in several U.S. cities. These robo-taxis are electric vehicles, and their operations require the joint optimization of ride matching, vehicle repositioning, and charging scheduling in a stochastic environment. We model the operations of the ride-hailing system with robo-taxis as a discrete-time, average reward Markov Decision Process with infinite horizon. As the fleet size grows, the dispatching is challenging as the set of system state and the fleet dispatching action set grow exponentially with the number of vehicles. To address this, we introduce a scalable deep reinforcement learning algorithm, called Atomic Proximal Policy Optimization (Atomic-PPO), that reduces the action space using atomic action decomposition. We evaluate our algorithm using real-world NYC for-hire vehicle data and we measure the performance using the long-run average reward achieved by the dispatching policy relative to a fluid-based reward upper bound. Our experiments demonstrate the superior performance of our Atomic-PPO compared to benchmarks. Furthermore, we conduct extensive numerical experiments to analyze the efficient allocation of charging facilities and assess the impact of vehicle range and charger speed on fleet performance.

Updated: 2025-02-19 03:05:23

标题: 原子近端政策优化用于电动出租车派遣和充电器分配

摘要: 开创性的公司，如Waymo在美国几个城市部署了无人车出租服务。这些无人车是电动汽车，它们的运营需要在随机环境中对搭乘匹配、车辆重新定位和充电计划进行联合优化。我们将无人车搭车系统的运营建模为一个离散时间、无限时间跨度的平均奖励马尔可夫决策过程。随着车队规模的增长，调度变得具有挑战性，因为系统状态集合和车队调度行动集合随着车辆数量的增加呈指数增长。为了解决这个问题，我们引入了一种可扩展的深度强化学习算法，称为原子近端策略优化（Atomic-PPO），通过原子动作分解来减少动作空间。我们使用真实的纽约市出租车数据评估我们的算法，并通过调度策略实现的长期平均奖励与基于流体的奖励上界进行性能评估。我们的实验表明，相对于基准测试，我们的原子PPO表现优异。此外，我们进行了大量数值实验，分析了充电设施的有效配置，并评估了车辆续航里程和充电速度对车队性能的影响。

更新时间: 2025-02-19 03:05:23

领域: cs.AI

下载: http://arxiv.org/abs/2502.13392v1

Deep-Unfolded Massive Grant-Free Transmission in Cell-Free Wireless Communication Systems

Grant-free transmission and cell-free communication are vital in improving coverage and quality-of-service for massive machine-type communication. This paper proposes a novel framework of joint active user detection, channel estimation, and data detection (JACD) for massive grant-free transmission in cell-free wireless communication systems. We formulate JACD as an optimization problem and solve it approximately using forward-backward splitting. To deal with the discrete symbol constraint, we relax the discrete constellation to its convex hull and propose two approaches that promote solutions from the constellation set. To reduce complexity, we replace costly computations with approximate shrinkage operations and approximate posterior mean estimator computations. To improve active user detection (AUD) performance, we introduce a soft-output AUD module that considers both the data estimates and channel conditions. To jointly optimize all algorithm hyper-parameters and to improve JACD performance, we further deploy deep unfolding together with a momentum strategy, resulting in two algorithms called DU-ABC and DU-POEM. Finally, we demonstrate the efficacy of the proposed JACD algorithms via extensive system simulations.

Updated: 2025-02-19 03:04:10

标题: 无需授权的深度展开式大规模传输在无蜂窝无线通信系统中

摘要: 免授权传输和无小区通信对于改善大规模机器型通信的覆盖范围和服务质量至关重要。本文提出了一种新颖的联合活跃用户检测、信道估计和数据检测（JACD）框架，用于在无小区无线通信系统中进行大规模免授权传输。我们将JACD形式化为一个优化问题，并使用前向-后向分裂方法近似解决它。为了处理离散符号约束，我们将离散星座图形松弛为其凸包，并提出了两种促进解决方案从星座集合中选择的方法。为了降低复杂性，我们用近似缩减操作和近似后验均值估计器计算替换昂贵的计算。为了提高活跃用户检测（AUD）性能，我们引入了一个考虑数据估计和信道条件的软输出AUD模块。为了联合优化所有算法超参数并改善JACD性能，我们进一步部署深度展开技术与动量策略，得到两种算法称为DU-ABC和DU-POEM。最后，我们通过大量系统仿真展示了所提出的JACD算法的有效性。

更新时间: 2025-02-19 03:04:10

领域: eess.SP,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2502.13390v1

Reasoning with Reinforced Functional Token Tuning

In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.

Updated: 2025-02-19 02:59:42

标题: 使用强化功能令牌调节进行推理

摘要: 在这项工作中，我们提出了一种新颖的强化功能令牌调整（RFTT）框架，该框架赋予大型语言模型（LLMs）自我对弈学习推理能力。与先前基于提示驱动的推理努力不同，RFTT直接将丰富的可学习功能令牌（例如<analyze>，<verify>，<refine>）嵌入到模型词汇中，从而实现具有多样化类似于人类的推理行为的思维链构建。具体而言，RFTT包括两个阶段：（1）监督微调执行基于提示的树搜索，以获得带有功能令牌注释的自动生成训练数据，这有助于模型学习这些令牌用于推理；（2）在线强化学习进一步允许模型通过功能令牌采样探索不同的推理路径，而无需依赖提示，从而促进功能推理的有效自我改进。广泛的实验表明，所提出的RFTT在数学基准测试中表现优越，显著提升了MATH数据集上Qwen-2.5-7B-Instruct（70.6%至79.8%）和LLaMA-3.1-8B-Instruct（32.2%至60.2%）的性能。此外，RFTT的性能在推理时随着更多搜索展开而持续改善。我们的代码可在https://github.com/sastpg/RFTT上找到。

更新时间: 2025-02-19 02:59:42

领域: cs.AI

下载: http://arxiv.org/abs/2502.13389v1

PTB-Image: A Scanned Paper ECG Dataset for Digitization and Image-based Diagnosis

Electrocardiograms (ECGs) recorded on paper remain prevalent in clinical practice, yet their use presents challenges for automated analysis and digital storage. To address this issue, we introduce PTB-Image, a dataset comprising scanned paper ECGs with corresponding digital signals, enabling research on ECG digitization. We also provide VinDigitizer, a digitization baseline to convert paper-based ECGs into digital time-series signals. The method involves detecting signal rows, extracting waveforms from the background, and reconstructing numerical values from the digitized traces. We applied VinDigitizer to 549 scanned ECGs and evaluated its performance against the original PTB dataset (modified to match the printed signals). The results achieved a mean signal-to-noise ratio (SNR) of 0.01 dB, highlighting both the feasibility and challenges of ECG digitization, particularly in mitigating distortions from printing and scanning processes. By providing PTB-Image and baseline digitization methods, this work aims to facilitate advancements in ECG digitization, enhancing access to historical ECG data and supporting applications in telemedicine and automated cardiac diagnostics.

Updated: 2025-02-19 02:56:27

标题: PTB-Image：用于数字化和基于图像的诊断的扫描纸质心电图数据集

摘要: 心电图（ECG）记录在纸上仍然在临床实践中普遍存在，但是它们的使用对于自动化分析和数字存储提出了挑战。为了解决这个问题，我们介绍了PTB-Image，这是一个包含扫描的纸质心电图和对应数字信号的数据集，可以进行心电图数字化研究。我们还提供了VinDigitizer，这是一个将基于纸张的心电图转换为数字时间序列信号的数字化基线。该方法涉及检测信号行，从背景中提取波形，并从数字化的痕迹重建数值。我们将VinDigitizer应用于549张扫描的心电图，并评估其性能与原始的PTB数据集进行对比（经过修改以匹配印刷信号）。结果实现了平均信噪比（SNR）为0.01 dB，突显了心电图数字化的可行性和挑战，特别是在减轻印刷和扫描过程中的失真方面。通过提供PTB-Image和基线数字化方法，本工作旨在促进心电图数字化的进展，增强对历史心电图数据的访问，并支持远程医疗和自动化心脏诊断应用。

更新时间: 2025-02-19 02:56:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.14909v1

Reflection of Episodes: Learning to Play Game from Expert and Self Experiences

StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.

Updated: 2025-02-19 02:53:43

标题: Episodes的反思：从专家和自身经验学习玩游戏

摘要: 星际争霸II是一个复杂且动态的实时战略游戏环境，非常适合用于人工智能和强化学习研究。为了解决在复杂环境中通过自我反思进行大型语言模型（LLM）学习的问题，我们提出了一种基于专家经验和自我经验的“Reflection of Episodes”（ROE）框架。该框架首先通过关键帧选择方法获取游戏中的关键信息，然后基于专家经验和自我经验做出决策。在游戏完成后，它反思之前的经验以获取新的自我经验。最后，在实验中，我们的方法在TextStarCraft II的“非常难”难度下击败了机器人。我们详细分析了游戏过程中LLM的数据，验证了其有效性。

更新时间: 2025-02-19 02:53:43

领域: cs.AI

下载: http://arxiv.org/abs/2502.13388v1

Myna: Masking-Based Contrastive Learning of Musical Representations

We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research.

Updated: 2025-02-19 02:47:43

标题: Myna：基于掩模的对比学习音乐表征

摘要: 我们提出Myna，一种简单而有效的自监督音乐表示学习方法。基于对比学习框架构建，Myna引入了两个关键创新：（1）在mel-spectrograms上使用Vision Transformer（ViT）作为骨干，（2）一种新颖的数据增强策略，令牌遮罩，用于遮罩90％的频谱图令牌。这些创新既提高了效率又提高了效果：（i）令牌屏蔽使每个GPU批处理大小显着增加，从之前的方法（CLMR，MULE）的48或120增加到4096。（ii）通过避免传统的增强方法，Myna保留了对音高的敏感性，在关键检测等任务中提高了性能。（iii）使用垂直补丁使模型能够更好地捕捉关键检测的关键特征。我们的混合模型Myna-22M-Hybrid处理16x16和128x2补丁，实现了最先进的结果。在单个GPU上训练，它的平均表现优于MULE（62M），并且与分别在16和64个GPU上训练的MERT-95M相媲美。此外，它超越了MERT-95M-public，成为基于公开可用数据训练的性能最佳的模型。我们发布我们的代码和模型，以促进可重现性并促进未来研究。

更新时间: 2025-02-19 02:47:43

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.12511v2

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

Updated: 2025-02-19 02:46:52

标题: MM-Verify：通过思维链验证增强多模态推理

摘要: 根据测试时间缩放，已经证明将外部慢思考与验证机制相结合可以增强大型语言模型（LLMs）的多轮推理能力。然而，在多模态（MM）领域，仍然缺乏一个强大的多模态验证器。在本文中，我们引入了MM-验证器和MM-推理器，通过更长的推理和更强大的验证来增强多模态推理能力。首先，我们提出了一种两步骤的MM验证数据合成方法，将基于模拟的树搜索与验证结合起来，并使用拒绝抽样来生成高质量的思维链（COT）数据。然后使用这些数据来微调验证模型MM-验证器。此外，我们提出了一种更有效的合成MMCOT数据的方法，弥合了基于文本和多模态推理之间的差距。合成的数据用于微调MM-推理器。我们的MM-验证器在MathCheck、MathVista和MathVerse基准测试中胜过所有更大的模型。此外，MM-推理器表现出很强的有效性和可扩展性，随着数据量的增加，性能也在提高。最后，当结合MM-推理器和MM-验证器时，我们的方法在MathVista上达到了65.3的准确率，在12次模拟中超过了GPT-4o（63.8）。

更新时间: 2025-02-19 02:46:52

领域: cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13383v1

Simplify RLHF as Reward-Weighted SFT: A Variational Method

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

Updated: 2025-02-19 02:44:40

标题: 将RLHF简化为奖励加权的SFT：一种变分方法

摘要: 人类反馈的强化学习（RLHF）对于使大型语言模型（LLMs）与人类价值观保持一致至关重要。然而，RLHF在实施和计算消耗方面的复杂性一直受到挑战。即使最近出现了简化方法，如直接优化偏好（DPO）和优势剩余午餐（A-LoL），过度拟合和训练不稳定性问题仍然阻碍了对齐过程的期望最佳表现。为了解决现有挑战，我们提出了一种新颖的RLHF简化方法，从变分推断的角度出发，称为变分对齐与重新加权（VAR）。更具体地说，通过直接最小化学习LLM策略与RLHF最优解之间的分布差距，我们将对齐目标转化为基于奖励驱动的重新加权监督微调（SFT）形式，只需对SFT损失进行轻微调整即可显着提高训练稳定性和有效性。在全面的对齐和生成基准测试中，我们的VAR方法在LLM对齐的帮助性和无害性方面在数字上实现了有竞争力的表现。

更新时间: 2025-02-19 02:44:40

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.11026v2

Raising the Bar in Graph OOD Generalization: Invariant Learning Beyond Explicit Environment Modeling

Out-of-distribution (OOD) generalization has emerged as a critical challenge in graph learning, as real-world graph data often exhibit diverse and shifting environments that traditional models fail to generalize across. A promising solution to address this issue is graph invariant learning (GIL), which aims to learn invariant representations by disentangling label-correlated invariant subgraphs from environment-specific subgraphs. However, existing GIL methods face two major challenges: (1) the difficulty of capturing and modeling diverse environments in graph data, and (2) the semantic cliff, where invariant subgraphs from different classes are difficult to distinguish, leading to poor class separability and increased misclassifications. To tackle these challenges, we propose a novel method termed Multi-Prototype Hyperspherical Invariant Learning (MPHIL), which introduces two key innovations: (1) hyperspherical invariant representation extraction, enabling robust and highly discriminative hyperspherical invariant feature extraction, and (2) multi-prototype hyperspherical classification, which employs class prototypes as intermediate variables to eliminate the need for explicit environment modeling in GIL and mitigate the semantic cliff issue. Derived from the theoretical framework of GIL, we introduce two novel objective functions: the invariant prototype matching loss to ensure samples are matched to the correct class prototypes, and the prototype separation loss to increase the distinction between prototypes of different classes in the hyperspherical space. Extensive experiments on 11 OOD generalization benchmark datasets demonstrate that MPHIL achieves state-of-the-art performance, significantly outperforming existing methods across graph data from various domains and with different distribution shifts.

Updated: 2025-02-19 02:41:12

标题: 提高图形OOD泛化的标准：超越明确环境建模的不变学习

摘要: 图学习中的超出分布（OOD）泛化已经成为一个关键挑战，因为现实世界的图数据经常展示出多样化和变化的环境，传统模型无法泛化。解决这个问题的一个有前途的方法是图不变学习（GIL），其目标是通过将与标签相关的不变子图与特定环境的子图解开，学习不变表示。然而，现有的GIL方法面临两个主要挑战：（1）捕捉和建模图数据中多样化环境的困难，以及（2）语义悬崖，即不同类别的不变子图难以区分，导致较差的类别可分性和增加的误分类。为了解决这些挑战，我们提出了一种名为多原型超球不变学习（MPHIL）的新方法，引入了两个关键创新：（1）超球不变表示提取，实现了稳健和高度区分性的超球不变特征提取，以及（2）多原型超球分类，利用类原型作为中间变量来消除GIL中明确环境建模的需要，并减轻语义悬崖问题。基于GIL的理论框架，我们引入了两个新的目标函数：不变原型匹配损失以确保样本与正确的类原型匹配，以及原型分离损失以增加超球空间中不同类别原型之间的区别。对11个OOD泛化基准数据集进行的大量实验证明，MPHIL实现了最先进的性能，在各种领域和不同分布偏移的图数据上明显优于现有方法。

更新时间: 2025-02-19 02:41:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.10706v2

AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments

Trusted Execution Environments (TEEs) isolate a special space within a device's memory that is not accessible to the normal world (also known as Untrusted Environment), even when the device is compromised. Thus, developers can utilize TEEs to provide strong security guarantees for their programs, making sensitive operations like encrypted data storage, fingerprint verification, and remote attestation protected from malicious attacks. Despite the strong protections offered by TEEs, adapting existing programs to leverage such security guarantees is non-trivial, often requiring extensive domain knowledge and manual intervention, which makes TEEs less accessible to developers. This motivates us to design AutoTEE, the first Large Language Model (LLM)-enabled approach that can automatically identify, partition, transform, and port sensitive functions into TEEs with minimal developer intervention. By manually reviewing 68 repositories, we constructed a benchmark dataset consisting of 385 sensitive functions eligible for transformation, on which AutoTEE achieves a high F1 score of 0.91. AutoTEE effectively transforms these sensitive functions into their TEE-compatible counterparts, achieving success rates of 90\% and 83\% for Java and Python, respectively. We further provide a mechanism to automatically port the transformed code to different TEE platforms, including Intel SGX and AMD SEV, demonstrating that the transformed programs run successfully and correctly on these platforms.

Updated: 2025-02-19 02:37:00

标题: AutoTEE：受信任执行环境中程序的自动迁移和保护

摘要: 受信任执行环境（TEE）在设备的内存中隔离出一个特殊空间，即使设备被 Compromise （即被攻击），也无法被正常世界（也称为Untrusted Environment）访问。因此，开发人员可以利用 TEE 为其程序提供强大的安全保证，使诸如加密数据存储、指纹验证和远程证明等敏感操作免受恶意攻击。尽管 TEE 提供了强大的保护机制，但调整现有程序以利用此类安全保证并非易事，通常需要广泛的领域知识和手动干预，这使得 TEE 对开发人员不太容易访问。这促使我们设计了 AutoTEE，这是第一个能够自动识别、分区、转换并将敏感函数移植到 TEE 中的大型语言模型(LLM)启用方法，而开发者的干预最小。通过手动审查 68 个存储库，我们构建了一个基准数据集，其中包含 385 个可用于转换的敏感函数，AutoTEE 在该数据集上实现了高达 0.91 的 F1 分数。AutoTEE 有效地将这些敏感函数转换为适用于 TEE 的对应函数，在 Java 和 Python 中的成功率分别达到了 90% 和 83%。我们还提供了一种机制，可以自动将转换后的代码移植到不同的 TEE 平台，包括 Intel SGX 和 AMD SEV，证明了转换后的程序可以在这些平台上成功且正确地运行。

更新时间: 2025-02-19 02:37:00

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2502.13379v1

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models. Yet, it remains unclear how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations. We frame training of large language models as soft-label classification over sparse probabilistic label vectors, coupled with an analytical approximation that allows unrestricted generation of context embeddings. This approach links NTP training to rank-constrained, nuclear-norm regularized optimization in the logit domain, offering a framework for analyzing the geometry of word and context embeddings. In large embedding spaces, we find that NTP implicitly favors learning logits with a sparse plus low-rank structure. While the sparse component captures the co-occurrence frequency of context-word pairs, the orthogonal low-rank component, which becomes dominant as training progresses, depends solely on the sparsity pattern of the co-occurrence matrix. Consequently, when projected onto an appropriate subspace, representations of contexts that are followed by the same set of next-tokens collapse, a phenomenon we term subspace-collapse. We validate our findings on synthetic and small-scale real language datasets. Finally, we outline potential research directions aimed at deepening the understanding of NTP's influence on the learning of linguistic patterns and regularities.

Updated: 2025-02-19 02:31:46

标题: 下一个标记预测的隐式几何：从语言稀疏模式到模型表示

摘要: 下一个标记预测（NTP）在大型文本语料库上已经成为训练大型语言模型的主要范式。然而，目前还不清楚NTP如何影响将语言模式映射到生成模型表示的几何属性。我们将大型语言模型的训练框架化为对稀疏概率标签向量的软标签分类，结合一种允许无限生成上下文嵌入的分析近似。这种方法将NTP训练与在logit领域中的秩约束、核范数正则化优化联系起来，为分析词和上下文嵌入的几何提供了一个框架。在大型嵌入空间中，我们发现NTP隐式地偏好学习具有稀疏加低秩结构的logits。虽然稀疏组件捕捉了上下文-词对的共现频率，但随着训练的进行，成为主导的正交低秩组件仅取决于共现矩阵的稀疏模式。因此，当投影到适当的子空间时，被相同一组下一个标记跟随的上下文的表示会坍缩，我们称之为子空间坍塌。我们在合成和小规模真实语言数据集上验证了我们的发现。最后，我们概述了旨在深化对NTP对语言模式和规律学习影响的理解的潜在研究方向。

更新时间: 2025-02-19 02:31:46

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2408.15417v2

An explainable transformer circuit for compositional generalization

Compositional generalization-the systematic combination of known components into novel structures-remains a core challenge in cognitive science and machine learning. Although transformer-based large language models can exhibit strong performance on certain compositional tasks, the underlying mechanisms driving these abilities remain opaque, calling into question their interpretability. In this work, we identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer. Using causal ablations, we validate the circuit and formalize its operation using a program-like description. We further demonstrate that this mechanistic understanding enables precise activation edits to steer the model's behavior predictably. Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.

Updated: 2025-02-19 02:30:41

标题: 一个可解释的变压器电路用于组合概括

摘要: 组合泛化-即将已知组件系统地组合成新颖结构-仍然是认知科学和机器学习中的一个核心挑战。尽管基于Transformer的大型语言模型在某些组合任务上表现出色，但驱动这些能力的基本机制仍然不透明，引发了对其可解释性的质疑。在这项工作中，我们识别并对一种紧凑的Transformer中负责组合归纳的电路进行了机械解释。通过因果消融，我们验证了这个电路，并使用类似程序的描述形式化其操作。我们进一步证明，这种机械理解使得能够精确激活编辑以可预测地引导模型的行为。我们的发现推进了对Transformer中复杂行为的理解，并突显出这种洞察力可以为模型控制提供直接途径。

更新时间: 2025-02-19 02:30:41

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.15801v1

Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real-world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert-labeled object-level annotations, which are not applicable in such scenarios. To address this issue, we propose RS-SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre-trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP-RS can effectively differentiate between various levels of segmentation quality. Semantic features and low-level segmentation features are effectively integrated through a semantic-guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS-SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS-SQA significantly outperforms state-of-the-art quality assessment models. This provides essential support for predicting segmentation accuracy and high-quality semantic segmentation interpretation, offering substantial practical value.

Updated: 2025-02-19 02:28:12

标题: 基于视觉语言模型的遥感语义分割质量评估

摘要: 场景的复杂性和图像质量的变化导致了遥感图像语义分割方法在监督实际场景中表现的显著变化。这使得在这种场景下评估语义分割质量成为一个需要解决的问题。然而，大多数现有的评估指标都是基于专家标注的对象级注释开发的，这在这种场景下并不适用。为了解决这个问题，我们提出了RS-SQA，一种基于视觉语言模型（VLM）的遥感图像语义分割的无监督质量评估模型。这个框架利用了一个预训练的RS VLM进行语义理解，并利用分割方法的中间特征来提取关于分割质量的隐含信息。具体来说，我们介绍了CLIP-RS，一个大规模预训练的VLM，通过纯净的文本训练以减少文本噪音并在RS领域捕捉健壮的语义信息。特征可视化证实了CLIP-RS能够有效区分不同水平的分割质量。通过语义引导的方法有效地整合了语义特征和低级分割特征，以提高评估精度。为了进一步支持RS语义分割质量评估的发展，我们提出了RS-SQED，一个专门从四个主要的RS语义分割数据集中采样的数据集，并用来自8个代表性分割方法的推断结果的分割精度进行注释。在建立的数据集上的实验结果表明，RS-SQA明显优于最先进的质量评估模型。这为预测分割精度和高质量语义分割解释提供了必要的支持，具有实质性的实用价值。

更新时间: 2025-02-19 02:28:12

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2502.13990v1

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments. From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.

Updated: 2025-02-19 02:26:09

标题: 深思专家混合（MoTE）：思维链和专家混合在自我调整中的协同效应

摘要: 随着大型语言模型（LLMs）的能力不断扩展，将这些模型与人类价值观对齐仍然是一个重要挑战。最近的研究表明，推理能力对模型安全性有重要贡献，同时整合专家混合（MoE）架构可以进一步增强对齐性。在这项工作中，我们提出了一种新颖的框架——深思专家混合（MoTE），该框架将推理链和专家混合相互协同，以提高自我对齐性。从数据的角度看，MoTE采用了一个包含四个关键阶段的结构化推理链：问题分析、答案引导、安全答案和安全检查。这种方法通过多步推理增强了安全性，即使对于较小且功能不强大的LLMs（例如7B模型）也证明了有效性。从架构的角度看，MoTE采用了一个具有步级路由的多LoRA框架，其中每个专家都专门负责特定的推理步骤。这种设计消除了平衡损失的需要，确保了稳定的训练，并支持自适应推理长度。实验结果表明，MoTE显著提高了模型的安全性、破解抵抗力和过度拒绝能力，实现了与OpenAI最新o1模型相当的性能。

更新时间: 2025-02-19 02:26:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2405.00557v4

Learning Symbolic Task Decompositions for Multi-Agent Teams

One approach for improving sample efficiency in cooperative multi-agent learning is to decompose overall tasks into sub-tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub-tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model-free interactions with the environment. Our method uses a task-conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents' policies for each sub-task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample-efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi-agent learning not achievable in previous works.

Updated: 2025-02-19 02:24:44

标题: 学习多智能体团队的符号任务分解

摘要: 一种提高合作多智体学习样本效率的方法是将整体任务分解为可分配给各个智体的子任务。我们在奖励机器的背景下研究了这个问题：可以形式上分解为子任务的符号任务。为了处理没有环境先验知识的情况，我们引入了一个框架，可以从与环境的无模型交互中学习最佳分解。我们的方法使用一个任务条件的架构，同时学习最佳分解和相应子任务的代理策略。通过这样做，我们消除了人类手动设计最佳分解的必要性，同时保持了改进的信用分配的样本效率优势。我们在几个深度强化学习设置中提供了实验结果，展示了我们方法的有效性。我们的结果表明，我们的方法甚至在具有相互依赖的智体动态的环境中也能成功，实现了以前作品中无法实现的同步多智体学习。

更新时间: 2025-02-19 02:24:44

领域: cs.MA,cs.AI,cs.LG,F.2.2

下载: http://arxiv.org/abs/2502.13376v1

Fighter Jet Navigation and Combat using Deep Reinforcement Learning with Explainable AI

This paper presents the development of an Artificial Intelligence (AI) based fighter jet agent within a customized Pygame simulation environment, designed to solve multi-objective tasks via deep reinforcement learning (DRL). The jet's primary objectives include efficiently navigating the environment, reaching a target, and selectively engaging or evading an enemy. A reward function balances these goals while optimized hyperparameters enhance learning efficiency. Results show more than 80\% task completion rate, demonstrating effective decision-making. To enhance transparency, the jet's action choices are analyzed by comparing the rewards of the actual chosen action (factual action) with those of alternate actions (counterfactual actions), providing insights into the decision-making rationale. This study illustrates DRL's potential for multi-objective problem-solving with explainable AI. Project page is available at: \href{https://github.com/swatikar95/Autonomous-Fighter-Jet-Navigation-and-Combat}{Project GitHub Link}.

Updated: 2025-02-19 02:14:27

标题: 使用深度强化学习和可解释AI的战斗机导航和作战

摘要: 这篇论文介绍了在一个定制的Pygame模拟环境中开发基于人工智能（AI）的战斗机代理，旨在通过深度强化学习（DRL）解决多目标任务。该战斗机的主要目标包括有效地导航环境、到达目标，以及有选择性地与敌人交战或躲避。一个奖励函数平衡了这些目标，同时优化的超参数增强了学习效率。结果显示超过80\%的任务完成率，展示了有效的决策制定能力。为了增强透明度，通过比较实际选择的行动（事实行动）的奖励与替代行动（对立行动）的奖励，分析了战斗机的行动选择，提供了对决策理由的洞察。这项研究展示了DRL在可解释AI中解决多目标问题的潜力。项目页面链接为：\href{https://github.com/swatikar95/Autonomous-Fighter-Jet-Navigation-and-Combat}{项目GitHub链接}。

更新时间: 2025-02-19 02:14:27

领域: cs.AI

下载: http://arxiv.org/abs/2502.13373v1

WRF-GS: Wireless Radiation Field Reconstruction with 3D Gaussian Splatting

Wireless channel modeling plays a pivotal role in designing, analyzing, and optimizing wireless communication systems. Nevertheless, developing an effective channel modeling approach has been a longstanding challenge. This issue has been escalated due to the denser network deployment, larger antenna arrays, and wider bandwidth in 5G and beyond networks. To address this challenge, we put forth WRF-GS, a novel framework for channel modeling based on wireless radiation field (WRF) reconstruction using 3D Gaussian splatting. WRF-GS employs 3D Gaussian primitives and neural networks to capture the interactions between the environment and radio signals, enabling efficient WRF reconstruction and visualization of the propagation characteristics. The reconstructed WRF can then be used to synthesize the spatial spectrum for comprehensive wireless channel characterization. Notably, with a small number of measurements, WRF-GS can synthesize new spatial spectra within milliseconds for a given scene, thereby enabling latency-sensitive applications. Experimental results demonstrate that WRF-GS outperforms existing methods for spatial spectrum synthesis, such as ray tracing and other deep-learning approaches. Moreover, WRF-GS achieves superior performance in the channel state information prediction task, surpassing existing methods by a significant margin of more than 2.43 dB.

Updated: 2025-02-19 02:13:32

标题: WRF-GS：使用3D高斯喷涂进行无线辐射场重建

摘要: 无线信道建模在设计、分析和优化无线通信系统中起着关键作用。然而，开发有效的信道建模方法一直是一个长期的挑战。由于5G及更高网络中网络部署更加密集、天线阵列更大、带宽更宽，这个问题已经加剧。为了解决这一挑战，我们提出了基于无线辐射场（WRF）重建的3D高斯飞溅的信道建模新框架WRF-GS。WRF-GS采用3D高斯基元和神经网络来捕捉环境与无线信号之间的相互作用，实现高效的WRF重建和传播特性的可视化。重建的WRF可以用来合成空间频谱，对无线信道进行全面的特征描述。值得注意的是，通过少量测量，WRF-GS可以在毫秒内为特定场景合成新的空间频谱，从而实现对延迟敏感的应用。实验结果表明，WRF-GS在空间频谱合成方面优于射线跟踪和其他深度学习方法等现有方法。此外，WRF-GS在信道状态信息预测任务中表现出卓越性能，超过现有方法超过2.43 dB的显着边缘。

更新时间: 2025-02-19 02:13:32

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.04832v2

Generalizable Humanoid Manipulation with 3D Diffusion Policies

Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills and the expensiveness of in-the-wild humanoid robot data. In this work, we build a real-world robotic system to address this challenging problem. Our system is mainly an integration of 1) a whole-upper-body robotic teleoperation system to acquire human-like robot data, 2) a 25-DoF humanoid robot platform with a height-adjustable cart and a 3D LiDAR sensor, and 3) an improved 3D Diffusion Policy learning algorithm for humanoid robots to learn from noisy human data. We run more than 2000 episodes of policy rollouts on the real robot for rigorous policy evaluation. Empowered by this system, we show that using only data collected in one single scene and with only onboard computing, a full-sized humanoid robot can autonomously perform skills in diverse real-world scenarios. Videos are available at \href{https://humanoid-manipulation.github.io}{humanoid-manipulation.github.io}.

Updated: 2025-02-19 02:13:13

标题: 可推广的三维扩散政策下的人形机器人操作

摘要: 能够在不同环境中自主操作的人形机器人一直是机器人学家的目标。然而，人形机器人的自主操作在很大程度上被限制在一个特定场景中，主要是由于获得可泛化技能的困难和野外人形机器人数据的昂贵。在这项工作中，我们构建了一个真实世界的机器人系统来解决这一具有挑战性的问题。我们的系统主要是整合了1）一个全身机器人远程操作系统，用于获取类人机器人数据，2）一个具有可调节高度的25自由度人形机器人平台和一个3D LiDAR传感器，以及3）一个改进的3D扩散策略学习算法，用于让人形机器人从杂音干扰的人类数据中学习。我们在真实机器人上运行了2000多集的策略演练，以进行严格的策略评估。借助这个系统的支持，我们展示了仅使用在一个单一场景中收集的数据，并且仅使用机载计算，一个全尺寸的人形机器人可以在不同的真实场景中自主执行技能。视频可在\href{https://humanoid-manipulation.github.io}{humanoid-manipulation.github.io}上查看。

更新时间: 2025-02-19 02:13:13

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.10803v2

Quantum Recurrent Neural Networks with Encoder-Decoder for Time-Dependent Partial Differential Equations

Nonlinear time-dependent partial differential equations are essential in modeling complex phenomena across diverse fields, yet they pose significant challenges due to their computational complexity, especially in higher dimensions. This study explores Quantum Recurrent Neural Networks within an encoder-decoder framework, integrating Variational Quantum Circuits into Gated Recurrent Units and Long Short-Term Memory networks. Using this architecture, the model efficiently compresses high-dimensional spatiotemporal data into a compact latent space, facilitating more efficient temporal evolution. We evaluate the algorithms on the Hamilton-Jacobi-Bellman equation, Burgers' equation, the Gray-Scott reaction-diffusion system, and the three dimensional Michaelis-Menten reaction-diffusion equation. The results demonstrate the superior performance of the quantum-based algorithms in capturing nonlinear dynamics, handling high-dimensional spaces, and providing stable solutions, highlighting their potential as an innovative tool in solving challenging and complex systems.

Updated: 2025-02-19 02:09:43

标题: 量子循环神经网络与编码器-解码器用于时变偏微分方程

摘要: 非线性时间相关的偏微分方程在建模各个领域的复杂现象中至关重要，但由于其计算复杂性，尤其是在高维度中，它们带来了显著的挑战。本研究探索了在编码器-解码器框架内使用量子递归神经网络，将变分量子电路整合到门控递归单元和长短期记忆网络中。利用这种架构，该模型能够有效地将高维时空数据压缩成紧凑的潜在空间，促进更高效的时间演化。我们在哈密尔顿-雅可比-贝尔曼方程、Burgers方程、Gray-Scott反应扩散系统和三维Michaelis-Menten反应扩散方程上评估了算法。结果表明，基于量子的算法在捕捉非线性动态、处理高维空间和提供稳定解决方案方面表现出卓越性能，凸显了它们作为解决具有挑战性和复杂系统的创新工具的潜力。

更新时间: 2025-02-19 02:09:43

领域: cs.LG,cs.NA,math.NA,quant-ph

下载: http://arxiv.org/abs/2502.13370v1

Pretrained Image-Text Models are Secretly Video Captioners

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

Updated: 2025-02-19 01:53:03

标题: 预训练的图像文本模型其实是秘密的视频字幕生成器

摘要: 开发视频字幕模型是计算成本高昂的。视频的动态性也使得设计能够有效地为这些序列生成字幕的多模态模型变得更加复杂。然而，我们发现通过利用最小的计算资源，并且无需复杂修改来解决视频动态性，一个基于图像的模型可以被重新用于胜过几个专门的视频字幕系统。我们的调整模型在主要基准测试中表现出顶尖水平，MSRVTT和MSVD排名第二，VATEX排名第三。我们通过后期训练一个典型的图像字幕模型BLIP2，仅使用6,000个视频文本对，并简单地连接帧（数据量明显少于其他方法，这些方法使用了2.5亿到1.44亿对）。从资源优化的角度来看，这项视频字幕研究专注于三个基本因素：优化模型规模，最大化数据效率，以及整合强化学习。这项广泛的研究表明，一种轻量级的、基于图像的适应策略可以与最先进的视频字幕系统匹敌，为低资源情景提供了实际解决方案。

更新时间: 2025-02-19 01:53:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13363v1

Dynamic directed functional connectivity as a neural biomarker for objective motor skill assessment

Objective motor skill assessment plays a critical role in fields such as surgery, where proficiency is vital for certification and patient safety. Existing assessment methods, however, rely heavily on subjective human judgment, which introduces bias and limits reproducibility. While recent efforts have leveraged kinematic data and neural imaging to provide more objective evaluations, these approaches often overlook the dynamic neural mechanisms that differentiate expert and novice performance. This study proposes a novel method for motor skill assessment based on dynamic directed functional connectivity (dFC) as a neural biomarker. By using electroencephalography (EEG) to capture brain dynamics and employing an attention-based Long Short-Term Memory (LSTM) model for non-linear Granger causality analysis, we compute dFC among key brain regions involved in psychomotor tasks. Coupled with hierarchical task analysis (HTA), our approach enables subtask-level evaluation of motor skills, offering detailed insights into neural coordination that underpins expert proficiency. A convolutional neural network (CNN) is then used to classify skill levels, achieving greater accuracy and specificity than established performance metrics in laparoscopic surgery. This methodology provides a reliable, objective framework for assessing motor skills, contributing to the development of tailored training protocols and enhancing the certification process.

Updated: 2025-02-19 01:51:39

标题: 动态定向功能连接作为客观运动技能评估的神经生物标志物

摘要: 客观的运动技能评估在外科等领域起着至关重要的作用，其中熟练程度对认证和患者安全至关重要。然而，现有的评估方法往往过于依赖主观人类判断，这会引入偏见并限制可重复性。最近的研究已经利用运动数据和神经影像学来提供更客观的评估，但这些方法往往忽视了区分专家和新手表现的动态神经机制。本研究提出了一种基于动态定向功能连接（dFC）作为神经生物标志物的运动技能评估新方法。通过使用脑电图（EEG）捕捉脑动态，并利用基于注意力的长短期记忆（LSTM）模型进行非线性 Granger 因果分析，我们计算了参与心理运动任务的关键脑区之间的dFC。结合分层任务分析（HTA），我们的方法实现了对运动技能的子任务级评估，为支撑专家熟练度的神经协调提供了详细见解。接着使用卷积神经网络（CNN）对技能水平进行分类，实现了比腹腔镜手术中已建立的绩效指标更高的准确性和特异性。这种方法提供了一个可靠的客观框架，用于评估运动技能，有助于制定量身定制的培训方案并增强认证过程。

更新时间: 2025-02-19 01:51:39

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2502.13362v1

RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering

Medical question answering requires extensive access to specialized conceptual knowledge. The current paradigm, Retrieval-Augmented Generation (RAG), acquires expertise medical knowledge through large-scale corpus retrieval and uses this knowledge to guide a general-purpose large language model (LLM) for generating answers. However, existing retrieval approaches often overlook the importance of factual knowledge, which limits the relevance of retrieved conceptual knowledge and restricts its applicability in real-world scenarios, such as clinical decision-making based on Electronic Health Records (EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval framework that retrieves both relevant factual and conceptual knowledge from dual sources (i.e., EHRs and the corpus), allowing them to interact and refine each another. Through extensive evaluation across three factual-aware medical question answering benchmarks, RGAR establishes a new state-of-the-art performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings demonstrate the benefit of extracting factual knowledge for retrieval, which consistently yields improved generation quality.

Updated: 2025-02-19 01:50:10

标题: RGAR：基于再现生成的检索增强型事实感知医学问题回答

摘要: 医学问题回答需要大量专业概念知识。当前的范式，检索增强生成（RAG），通过大规模语料库检索获取专业医学知识，并利用这些知识指导一个通用的大型语言模型（LLM）生成答案。然而，现有的检索方法往往忽视了事实知识的重要性，这限制了检索到的概念知识的相关性，并限制了其在实际场景（如基于电子健康记录的临床决策）中的适用性。本文介绍了RGAR，一个循环生成增强检索框架，从双重来源（即电子健康记录和语料库）检索出相关的事实和概念知识，让它们相互作用并互相完善。通过对三个事实感知医学问题回答基准的广泛评估，RGAR在医学RAG系统中建立了新的最先进性能。值得注意的是，带有RGAR的Llama-3.1-8B-Instruct模型超越了规模更大的RAG增强的GPT-3.5。我们的发现表明，提取事实知识用于检索有益处，始终能提高生成质量。

更新时间: 2025-02-19 01:50:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.13361v1

Cluster Aware Graph Anomaly Detection

Graph anomaly detection has gained significant attention across various domains, particularly in critical applications like fraud detection in e-commerce platforms and insider threat detection in cybersecurity. Usually, these data are composed of multiple types (e.g., user information and transaction records for financial data), thus exhibiting view heterogeneity. However, in the era of big data, the heterogeneity of views and the lack of label information pose substantial challenges to traditional approaches. Existing unsupervised graph anomaly detection methods often struggle with high-dimensionality issues, rely on strong assumptions about graph structures or fail to handle complex multi-view graphs. To address these challenges, we propose a cluster aware multi-view graph anomaly detection method, called CARE. Our approach captures both local and global node affinities by augmenting the graph's adjacency matrix with the pseudo-label (i.e., soft membership assignments) without any strong assumption about the graph. To mitigate potential biases from the pseudo-label, we introduce a similarity-guided loss. Theoretically, we show that the proposed similarity-guided loss is a variant of contrastive learning loss, and we present how this loss alleviates the bias introduced by pseudo-label with the connection to graph spectral clustering. Experimental results on several datasets demonstrate the effectiveness and efficiency of our proposed framework. Specifically, CARE outperforms the second-best competitors by more than 39% on the Amazon dataset with respect to AUPRC and 18.7% on the YelpChi dataset with respect to AUROC. The code of our method is available at the GitHub link: https://github.com/zhenglecheng/CARE-demo.

Updated: 2025-02-19 01:41:40

标题: 集群感知图异常检测

摘要: 图异常检测在各个领域中引起了重要关注，特别是在电子商务平台的欺诈检测和网络安全中的内部威胁检测等关键应用中。通常，这些数据由多种类型组成（例如，金融数据的用户信息和交易记录），因此呈现出视图异质性。然而，在大数据时代，视图的异质性和缺乏标签信息给传统方法带来了重大挑战。现有的无监督图异常检测方法通常面临高维度问题，依赖于对图结构的强假设，或无法处理复杂的多视图图。为了解决这些挑战，我们提出了一种集群感知的多视图图异常检测方法，称为CARE。我们的方法通过在不对图形式做出任何强假设的情况下，通过增加伪标签（即软成员分配）来捕捉局部和全局节点的亲和性。为了减轻伪标签可能带来的偏见，我们引入了一种相似性引导损失。从理论上讲，我们展示了所提出的相似性引导损失是对比学习损失的一种变体，并展示了这种损失如何通过与图谱聚类的联系来缓解伪标签引入的偏见。在几个数据集上的实验结果证明了我们提出的框架的有效性和效率。具体而言，相对于AUPRC，CARE在亚马逊数据集上的表现比第二名竞争对手提高了超过39％，在YelpChi数据集上的表现比AUROC提高了18.7％。我们的方法的代码可在GitHub链接上找到：https://github.com/zhenglecheng/CARE-demo。

更新时间: 2025-02-19 01:41:40

领域: cs.LG

下载: http://arxiv.org/abs/2409.09770v2

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

Updated: 2025-02-19 01:38:06

标题: 探索者：为多模态网络代理扩展基于探索的网络轨迹合成

摘要: 最近在大型多模型模型（LMMs）方面取得的成功引发了具有完成复杂网络任务能力的自主代理的有前途的应用。虽然开源LMM代理在离线评估基准方面取得了显著进展，但它们在更现实的在线设置中的表现仍然远远落后于人类水平。一个关键瓶颈是缺乏跨不同领域的多样化和大规模轨迹级数据集，这些数据集收集起来成本昂贵。在本文中，我们通过开发一种可扩展的方法来合成迄今为止最大和最多样化的轨迹级数据集来解决这一挑战，其中包含超过94K个成功的多模式网络轨迹，涵盖49K个独特URL，720K个屏幕截图和33M个网络元素。特别是，我们利用广泛的网络探索和优化来获得多样化的任务意图。平均成本为每个成功轨迹28美分，使其对社区中广泛范围的用户来说是可负担的。利用这个数据集，我们训练了Explorer，一个多模态网络代理，并在离线和在线网络代理基准测试中表现出色，如Mind2Web-Live，Multimodal-Mind2Web和MiniWob++。此外，我们的实验突出了数据扩展作为改善网络代理功能的关键驱动因素。我们希望这项研究使基于LMM的代理研究在更大规模上更易于访问。

更新时间: 2025-02-19 01:38:06

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.11357v2

A Comprehensive Survey on Composed Image Retrieval

Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration.

Updated: 2025-02-19 01:37:24

标题: 一个关于合成图像检索的综合调查

摘要: 合成图像检索（CIR）是一项新兴且具有挑战性的任务，允许用户使用多模态查询搜索目标图像，包括参考图像和修改文本，指定用户对参考图像的期望更改。鉴于其显著的学术和实际价值，CIR已成为计算机视觉和机器学习社区中一个快速发展的领域，特别是在深度学习方面取得的进展。据我们所知，目前还没有关于CIR的全面回顾，以便及时了解该领域的概况。因此，我们综合了包括ACM TOIS、SIGIR和CVPR在内的顶级会议和期刊上的120多篇出版物的见解。特别地，我们系统地对现有的监督CIR和零样本CIR模型进行了精细分类。为了进行全面回顾，我们还简要讨论了与CIR密切相关的任务的方法，如基于属性的CIR和基于对话的CIR。此外，我们总结了用于评估的基准数据集，并通过比较多个数据集上的实验结果来分析现有的监督和零样本CIR方法。此外，我们提出了这一领域有前景的未来方向，为对进一步探索感兴趣的研究人员提供了实用见解。

更新时间: 2025-02-19 01:37:24

领域: cs.MM,cs.AI,cs.CV,cs.IR

下载: http://arxiv.org/abs/2502.18495v1

Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models

Most existing benchmarking approaches for evaluating the output quality of large language models (LLMs) rely on comparing LLM responses to predefined references. Such methods, based on static datasets, quickly become outdated as LLM capabilities and use cases evolve. In this work, we introduce VARCO Arena--a novel, cost-effective, and robust benchmarking approach that leverages a single-elimination tournament structure to minimize the number of required comparisons while eliminating the need for static references or costly human annotations. We validate our approach through two experiments: (i) a simulation study that examines its robustness under various conditions, and (ii) an empirical evaluation using publicly available benchmark prompts. In both experiments, VARCO Arena consistently outperforms current LLM benchmarking practices, achieving stronger correlations with human-established Elo ratings. Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation across diverse, customized use cases.

Updated: 2025-02-19 01:34:31

标题: Varco Arena：一种用于无参考基准测试大型语言模型的竞赛方法

摘要: 现有的用于评估大型语言模型（LLMs）输出质量的基准方法大多依赖于将LLM响应与预定义参考进行比较。这种基于静态数据集的方法随着LLM能力和用例的发展很快就会过时。在这项工作中，我们介绍了VARCO Arena——一种新颖、成本效益高且稳健的基准方法，利用单淘汰锦标赛结构来最小化所需的比较次数，同时消除了对静态参考或昂贵人工注释的需求。我们通过两个实验验证了我们的方法：（i）一个模拟研究，检验其在不同条件下的稳健性；（ii）使用公开可用的基准提示进行的实证评估。在两个实验中，VARCO Arena始终优于当前的LLM基准实践，与人类建立的Elo评级之间的相关性更强。我们的结果表明，VARCO Arena不仅产生可靠的LLM排名，而且为不同的、定制化用例提供了可伸缩、适应性强的定性评估解决方案。

更新时间: 2025-02-19 01:34:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.01281v3

Learning Variational Inequalities from Data: Fast Generalization Rates under Strong Monotonicity

Variational inequalities (VIs) are a broad class of optimization problems encompassing machine learning problems ranging from standard convex minimization to more complex scenarios like min-max optimization and computing the equilibria of multi-player games. In convex optimization, strong convexity allows for fast statistical learning rates requiring only $\Theta(1/\epsilon)$ stochastic first-order oracle calls to find an $\epsilon$-optimal solution, rather than the standard $\Theta(1/\epsilon^2)$ calls. This note provides a simple overview of how one can similarly obtain fast $\Theta(1/\epsilon)$ rates for learning VIs that satisfy strong monotonicity, a generalization of strong convexity. Specifically, we demonstrate that standard stability-based generalization arguments for convex minimization extend directly to VIs when the domain admits a small covering, or when the operator is integrable and suboptimality is measured by potential functions; such as when finding equilibria in multi-player games.

Updated: 2025-02-19 01:15:26

标题: 从数据中学习变分不等式：在强单调性下的快速泛化率

摘要: 变分不等式（VIs）是一类广泛的优化问题，涵盖了从标准凸优化到更复杂的场景，如极小-极大优化和计算多人游戏的均衡等机器学习问题。在凸优化中，强凸性可以实现快速的统计学习速率，仅需要$\Theta(1/\epsilon)$随机一阶预言调用来找到$\epsilon$-优化解，而不是标准的$\Theta(1/\epsilon^2)$调用。本文提供了一个简单的概述，说明如何类似地获得满足强单调性的学习VIs的快速$\Theta(1/\epsilon)$速率，强单调性是强凸性的一种泛化。具体而言，我们证明了对于凸优化的标准稳定性基础的泛化论证，当域具有小覆盖或算子是可积的，且通过潜在函数来衡量次优性时，可以直接扩展到VIs，例如在多人游戏中寻找均衡时。

更新时间: 2025-02-19 01:15:26

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2410.20649v3

Comparing the information content of probabilistic representation spaces

Probabilistic representation spaces convey information about a dataset and are shaped by factors such as the training data, network architecture, and loss function. Comparing the information content of such spaces is crucial for understanding the learning process, yet most existing methods assume point-based representations, neglecting the distributional nature of probabilistic spaces. To address this gap, we propose two information-theoretic measures to compare general probabilistic representation spaces by extending classic methods to compare the information content of hard clustering assignments. Additionally, we introduce a lightweight method of estimation that is based on fingerprinting a representation space with a sample of the dataset, designed for scenarios where the communicated information is limited to a few bits. We demonstrate the utility of these measures in three case studies. First, in the context of unsupervised disentanglement, we identify recurring information fragments within individual latent dimensions of VAE and InfoGAN ensembles. Second, we compare the full latent spaces of models and reveal consistent information content across datasets and methods, despite variability during training. Finally, we leverage the differentiability of our measures to perform model fusion, synthesizing the information content of weak learners into a single, coherent representation. Across these applications, the direct comparison of information content offers a natural basis for characterizing the processing of information.

Updated: 2025-02-19 01:10:36

标题: 比较概率表示空间的信息内容

摘要: 概率表示空间传达了关于数据集的信息，受到训练数据、网络架构和损失函数等因素的影响。比较这些空间的信息内容对于理解学习过程至关重要，然而大多数现有方法都假设基于点的表示，忽略了概率空间的分布性质。为了填补这一空白，我们提出了两种信息论度量方法，通过将经典方法扩展到比较硬聚类分配的信息内容，来比较一般概率表示空间的信息内容。此外，我们引入了一种基于对数据集样本进行指纹处理的轻量级估计方法，用于在信息传递受限于少量比特的情况下。我们在三个案例研究中展示了这些度量方法的实用性。首先，在无监督解缠方面，我们识别了VAE和InfoGAN集合中各个潜在维度内的重复信息片段。其次，我们比较了模型的完整潜在空间，并揭示了跨数据集和方法的一致信息内容，尽管在训练过程中存在变化。最后，我们利用我们度量方法的可微性进行模型融合，将弱学习器的信息内容合成为一个统一的表示。在这些应用中，信息内容的直接比较为表征信息处理提供了自然的基础。

更新时间: 2025-02-19 01:10:36

领域: cs.LG

下载: http://arxiv.org/abs/2405.21042v3

Evaluation for Regression Analyses on Evolving Data Streams

The paper explores the challenges of regression analysis in evolving data streams, an area that remains relatively underexplored compared to classification. We propose a standardized evaluation process for regression and prediction interval tasks in streaming contexts. Additionally, we introduce an innovative drift simulation strategy capable of synthesizing various drift types, including the less-studied incremental drift. Comprehensive experiments with state-of-the-art methods, conducted under the proposed process, validate the effectiveness and robustness of our approach.

Updated: 2025-02-19 01:03:33

标题: 对不断变化的数据流进行回归分析评估

摘要: 这篇论文探讨了在不断发展的数据流中进行回归分析所面临的挑战，相比分类领域，这个领域仍然相对未被充分探索。我们提出了一个标准化的评估过程，用于流式环境中的回归和预测区间任务。此外，我们引入了一种创新的漂移模拟策略，能够合成各种漂移类型，包括较少研究的增量漂移。在提出的过程下进行的与最先进方法的综合实验验证了我们方法的有效性和鲁棒性。

更新时间: 2025-02-19 01:03:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.07213v2

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) -- which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a $\textit{sign}$ to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization and dialog verify the practicality and effectiveness of VPO.

Updated: 2025-02-19 00:51:58

标题: 价值激励偏好优化：在线和离线RLHF的统一方法

摘要: 人类反馈强化学习（RLHF）已经在将大型语言模型（LLMs）与人类偏好对齐方面显示出巨大潜力。根据偏好数据的可用性，在线和离线RLHF都是活跃的研究领域。一个关键瓶颈是如何在从偏好数据学习的奖励函数中合并不确定性估计，无论偏好数据是如何收集的。虽然在标准强化学习（RL）中，乐观或悲观的不确定性原则是被充分建立的，但对于适用于大型语言模型的实际可实施和理论上有根据的形式尚未可用，因为对于任意策略参数化，构建置信区间的标准技术变得棘手。在本文中，我们引入了一种统一的在线和离线RLHF方法--价值激励偏好优化（VPO）--它通过相应的价值函数调节最大似然估计的奖励函数，通过一个$\textit{sign}$表示选择乐观或悲观。VPO还直接优化具有隐式奖励建模的策略，因此与直接偏好优化类似，共享一个更简单的RLHF流程。VPO的理论保证适用于在线和离线设置，与它们的标准RL对应物的速率匹配。此外，在文本摘要和对话方面的实验证实了VPO的实用性和有效性。

更新时间: 2025-02-19 00:51:58

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2405.19320v4

Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. It achieves Rank #1 as of Feb 15th, 2025 as a generative model on the RewardBench leaderboard under the model name TextEval-Llama3.1-70B. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.

Updated: 2025-02-19 00:50:10

标题: 速率，解释和引用（REC）：大型语言模型自动评估中增强的解释和归因

摘要: LLMs在生成连贯且高质量文本方面展现出令人印象深刻的能力，使它们在各种文本生成任务中具有价值。然而，对于生成内容的严格评估至关重要，因为确保其质量仍然是一个重要挑战，由于持久存在的问题，如事实不准确和幻觉。本文介绍了三种经过精细调整的通用LLM自动评估器，REC-8B、REC-12B和REC-70B，专门设计用于评估生成的文本在几个维度上：忠实性、遵循指令、连贯性和完整性。这些模型不仅为这些指标提供评分，还提供详细的解释和可验证的引用，从而增强了对内容的信任。此外，这些模型支持各种引用模式，以满足不同的延迟和细粒度要求。在各种基准测试中进行了广泛评估，表明我们的通用LLM自动评估器REC-70B优于最先进的LLM，在内容评估方面表现出色，提供更好质量的解释和引用，并降低了最小的偏见。截至2025年2月15日，作为RewardBench排行榜上的生成模型，名为TextEval-Llama3.1-70B的模型，我们的REC数据集和模型可在https://github.com/adelaidehsu/REC上获得。

更新时间: 2025-02-19 00:50:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.02448v2

Pareto optimal proxy metrics

North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.

Updated: 2025-02-19 00:37:08

标题: 帕累托最优代理指标

摘要: 北极星指标和在线实验在技术公司改进其产品的过程中发挥着核心作用。然而，在许多实际情况下，直接基于北极星指标评估实验可能会很困难。最重要的两个问题是：1）北极星指标的灵敏度较低；2）短期和长期对北极星指标的影响之间存在差异。一个常见的解决方案是依赖于代理指标而不是北极星指标来评估实验和决策。现有文献主要集中在从短期实验数据中估计长期影响方面的代理指标。而本文则专注于长期影响估计与短期敏感性之间的权衡。具体而言，我们提出了帕累托最优代理指标方法，同时优化了预测准确性和敏感性。此外，我们还提供了一种高效的多目标优化算法，优于标准方法。我们将我们的方法应用于一个大型工业推荐系统的实验，并发现代理指标比北极星指标敏感性高出八倍，并且始终朝着相同的方向移动，增加了推出新功能的决策速度和质量。

更新时间: 2025-02-19 00:37:08

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2307.01000v2

KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models

The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose \segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (<30% accuracy) and fail to reason over source conflicts (<1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.

Updated: 2025-02-19 00:26:38

标题: 考拉：知识冲突增强视觉语言模型的鲁棒性

摘要: 大型语言模型（LLMs）在单模态问答系统中对知识冲突的鲁棒性已经得到了充分研究。然而，在多模态环境中，信息源冲突对视觉语言模型（VLMs）的影响尚未被探究。在这项工作中，我们提出了\segsub，这是一个框架，通过对图像源应用有针对性的扰动来研究和提高VLMs对三种不同类型的知识冲突（参数、源和反事实冲突）的鲁棒性。与先前研究结果相反，显示LLMs对文本扰动引起的参数冲突敏感，我们发现VLMs在图像扰动方面基本上具有鲁棒性。另一方面，VLMs在反事实例（<30%准确率）上表现不佳，并且无法处理源冲突（<1%准确率）。我们还发现幻觉与图像上下文之间存在联系，当GPT-4o面对高度情境化的反事实例时容易产生幻觉。尽管在源冲突方面仍然存在挑战，但微调模型显著改善了在反事实样本上的推理能力。我们的发现突出了需要加强VLM训练方法以增强其推理能力，特别是在处理多模态信息源之间复杂知识冲突方面。

更新时间: 2025-02-19 00:26:38

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.14908v1

GneissWeb: Preparing High Quality Data for LLMs at Scale

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

Updated: 2025-02-19 00:14:29

标题: GneissWeb：为大规模LLMs准备高质量数据

摘要: 数据的数量和质量在确定大型语言模型（LLMs）的性能方面起着至关重要的作用。高质量的数据尤其可以显著提升LLM在各种下游任务中的泛化能力。领先LLMs的大型预训练数据集仍然无法公开获取，而许多开放数据集的规模较小（少于5万亿个令牌），限制了它们用于训练大型模型的适用性。在本文中，我们介绍了GneissWeb，一个产生约10万亿个令牌的大型数据集，满足了训练LLMs所需的数据质量和数量要求。我们的GneissWeb数据集生成方法包括碎片化精确子字符串去重和精心构建的质量过滤器集合。GneissWeb在数据质量和数量之间取得了有利的平衡，产生的模型胜过那些在最先进的开放大型数据集（5+万亿个令牌）上训练的模型。我们发现，使用GneissWeb数据集训练的模型在一组11个常用基准测试中（零照射和少照射）的平均分数方面比FineWeb-V1.1.0训练的模型高出2.73个百分点。当评估集扩展到20个基准测试（零照射和少照射）时，使用GneissWeb训练的模型仍然比FineWeb-V1.1.0训练的模型高出1.75个百分点。

更新时间: 2025-02-19 00:14:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14907v1