Arxiv Day: Article

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.

Updated: 2025-03-25 23:55:47

标题: 使用多模态先验进行零样本人-物体交互合成

摘要: 人物与物体的交互（HOI）合成对于各种应用非常重要，从虚拟现实到机器人学。然而，由于其复杂性和高成本，获取3D HOI数据是具有挑战性的，这限制了现有方法在训练数据集中的物体类型和交互模式的狭窄多样性。本文提出了一种新颖的零样本HOI合成框架，无需依赖当前有限的3D HOI数据集进行端到端训练。我们方法的核心思想在于利用预训练的多模态模型的广泛HOI知识。给定一个文本描述，我们的系统首先利用图像或视频生成模型获得时间一致的2D HOI图像序列，然后将其提升为人类和物体姿势的3D HOI里程碑。我们使用预训练的人体姿势估计模型提取人体姿势，并引入一个通用的类别级别6自由度估计方法，从2D HOI图像中获取物体姿势。我们的估计方法适应于从文本到3D模型或在线检索获取的各种物体模板。进一步应用基于物理的3D HOI运动里程碑跟踪来优化身体动作和物体姿势，产生更加物理合理的HOI生成结果。实验结果表明，我们的方法能够生成具有物理真实感和语义多样性的开放词汇HOIs。

更新时间: 2025-03-25 23:55:47

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.20118v1

From Interpretation to Correction: A Decentralized Optimization Framework for Exact Convergence in Federated Learning

This work introduces a novel decentralized framework to interpret federated learning (FL) and, consequently, correct the biases introduced by arbitrary client participation and data heterogeneity, which are two typical traits in practical FL. Specifically, we first reformulate the core processes of FedAvg - client participation, local updating, and model aggregation - as stochastic matrix multiplications. This reformulation allows us to interpret FedAvg as a decentralized algorithm. Leveraging the decentralized optimization framework, we are able to provide a concise analysis to quantify the impact of arbitrary client participation and data heterogeneity on FedAvg's convergence point. This insight motivates the development of Federated Optimization with Exact Convergence via Push-pull Strategy (FOCUS), a novel algorithm inspired by the decentralized algorithm that eliminates these biases and achieves exact convergence without requiring the bounded heterogeneity assumption. Furthermore, we theoretically prove that FOCUS exhibits linear convergence (exponential decay) for both strongly convex and non-convex functions satisfying the Polyak-Lojasiewicz condition, regardless of the arbitrary nature of client participation.

Updated: 2025-03-25 23:54:23

标题: 从解释到修正：一种用于在联邦学习中实现精确收敛的分散优化框架

摘要: 这项工作介绍了一种新颖的去中心化框架，用于解释联邦学习（FL），并因此纠正了任意客户参与和数据异质性引入的偏见，这是实际FL中的两个典型特征。具体来说，我们首先重新构建FedAvg的核心过程 - 客户参与、本地更新和模型聚合 - 作为随机矩阵乘法。这种重新构建使我们能够将FedAvg解释为一种去中心化算法。利用去中心化优化框架，我们能够提供简洁的分析，量化任意客户参与和数据异质性对FedAvg收敛点的影响。这一见解激发了基于推拉策略的精确收敛的联邦优化（FOCUS）的发展，这是一种新颖的算法，受到去中心化算法的启发，消除了这些偏见，并在不需要有界异质性假设的情况下实现了精确收敛。此外，我们在理论上证明了，FOCUS对于满足Polyak-Lojasiewicz条件的强凸和非凸函数都表现出线性收敛（指数衰减），而不受客户参与的任意性的影响。

更新时间: 2025-03-25 23:54:23

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2503.20117v1

Persistent Homology for Structural Characterization in Disordered Systems

We propose a unified framework based on persistent homology (PH) to characterize both local and global structures in disordered systems. It can simultaneously generate local and global descriptors using the same algorithm and data structure, and has shown to be highly effective and interpretable in predicting particle rearrangements and classifying global phases. We also demonstrated that using a single variable enables a linear SVM to achieve nearly perfect three-phase classification. Inspired by this discovery, we define a non-parametric metric, the Separation Index (SI), which not only achieves this classification without sacrificing significant performance but also establishes a connection between particle environments and the global phase structure. Our methods provide an effective framework for understanding and analyzing the properties of disordered materials, with broad potential applications in materials science and even wider studies of complex systems.

Updated: 2025-03-25 23:53:07

标题: 无序系统中的结构特征的持久同调

摘要: 我们提出了一个基于持久同调（PH）的统一框架，用于表征无序系统中的局部和全局结构。它可以同时使用相同的算法和数据结构生成局部和全局描述符，并且已经证明在预测粒子重排和分类全局相时非常有效和可解释。我们还证明了使用单一变量使得线性支持向量机（SVM）可以实现几乎完美的三相分类。受到这一发现的启发，我们定义了一个非参数指标，即分离指数（SI），它不仅在不牺牲显著性能的情况下实现了这种分类，还建立了粒子环境与全局相结构之间的联系。我们的方法为理解和分析无序材料的性质提供了一个有效的框架，具有广泛的潜在应用，包括材料科学甚至更广泛的复杂系统研究。

更新时间: 2025-03-25 23:53:07

领域: cond-mat.dis-nn,cond-mat.mtrl-sci,cs.LG,math-ph,math.MP,55N31, 62R40,I.3.5

下载: http://arxiv.org/abs/2411.14390v6

Semi-Decision-Focused Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

I propose Semi-Decision-Focused Learning, a practical adaptation of Decision-Focused Learning for portfolio optimization. Rather than directly optimizing complex financial metrics, I employ simple target portfolios (Max-Sortino or One-Hot) and train models with a convex, cross-entropy loss. I further incorporate Deep Ensemble methods to reduce variance and stabilize performance. Experiments on two universes (one upward-trending and another range-bound) show consistent outperformance over baseline portfolios, demonstrating the effectiveness and robustness of my approach. Code is available at https://github.com/sDFLwDE/sDFLwDE

Updated: 2025-03-25 23:42:02

标题: 半决策焦点学习与深度集成：鲁棒投资组合优化的实用框架

摘要: 我提出半决策焦点学习，这是对投资组合优化中决策焦点学习的实际改进。与直接优化复杂的金融指标不同，我使用简单的目标投资组合（最大索提诺或独热）并使用凸交叉熵损失训练模型。我进一步结合深度集成方法来降低方差并稳定性能。在两个宇宙（一个上升趋势和另一个区间限制）上的实验表明，相对于基准投资组合，我的方法持续表现出色，展示了其有效性和鲁棒性。代码可在https://github.com/sDFLwDE/sDFLwDE找到。

更新时间: 2025-03-25 23:42:02

领域: cs.LG,q-fin.CP,q-fin.PM

下载: http://arxiv.org/abs/2503.13544v2

TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models

Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, such as concurrent actions between two agents when there are no conflicting conditions, without significant modification and definition to existing PDDL domains. A human expert aware of such constraints can decompose a goal into subgoals, each reachable through single agent planning, to take advantage of simultaneous actions. In contrast to classical planning, large language models (LLMs) directly used for inferring plan steps rarely guarantee execution success, but are capable of leveraging commonsense reasoning to assemble action sequences. We combine the strengths of both classical planning and LLMs by approximating human intuitions for multi-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone, as well as most multiagent plans, while guaranteeing execution success. Additionally, we find that LLM-based approximations of subgoals result in similar multi-agent execution lengths to those specified by human experts. Website and resources at https://glamor-usc.github.io/twostep

Updated: 2025-03-25 23:39:13

标题: TwoStep：使用经典规划器和大型语言模型的多智能体任务规划

摘要: 经典规划形式，如计划领域定义语言（PDDL），允许在初始状态下保证实现目标状态的行动序列（如果可能的话）。然而，在PDDL中定义的推理问题并未捕捉行动进行的时间方面，比如两个代理之间的并发行动，当没有冲突条件时，需要对现有PDDL领域进行重大修改和定义。了解这些约束的人类专家可以将目标分解为子目标，每个子目标通过单个代理规划可达，以利用同时行动的优势。与传统规划相反，大型语言模型（LLMs）直接用于推断计划步骤很少能保证执行成功，但能利用常识推理来组装行动序列。我们通过近似人类直觉来进行多代理规划目标分解，结合了传统规划和LLMs的优势。我们展示了LLM基于目标分解比直接解决多代理PDDL问题的规划时间更快，同时比单一代理计划以及大多数多代理计划实现更少的计划执行步骤，并保证执行成功。此外，我们发现LLM基于子目标的近似结果与人类专家指定的多代理执行长度相似。网站和资源位于https://glamor-usc.github.io/twostep。

更新时间: 2025-03-25 23:39:13

领域: cs.AI,cs.CL,cs.MA,cs.RO

下载: http://arxiv.org/abs/2403.17246v2

Domain Adaptation Framework for Turning Movement Count Estimation with Limited Data

Urban transportation networks are vital for the efficient movement of people and goods, necessitating effective traffic management and planning. An integral part of traffic management is understanding the turning movement counts (TMCs) at intersections, Accurate TMCs at intersections are crucial for traffic signal control, congestion mitigation, and road safety. In general, TMCs are obtained using physical sensors installed at intersections, but this approach can be cost-prohibitive and technically challenging, especially for cities with extensive road networks. Recent advancements in machine learning and data-driven approaches have offered promising alternatives for estimating TMCs. Traffic patterns can vary significantly across different intersections due to factors such as road geometry, traffic signal settings, and local driver behaviors. This domain discrepancy limits the generalizability and accuracy of machine learning models when applied to new or unseen intersections. In response to these limitations, this research proposes a novel framework leveraging domain adaptation (DA) to estimate TMCs at intersections by using traffic controller event-based data, road infrastructure data, and point-of-interest (POI) data. Evaluated on 30 intersections in Tucson, Arizona, the performance of the proposed DA framework was compared with state-of-the-art models and achieved the lowest values in terms of Mean Absolute Error and Root Mean Square Error.

Updated: 2025-03-25 23:27:38

标题: 有限数据情况下用于转向流量计数估计的领域适应框架

摘要: 城市交通网络对于高效的人员和货物运输至关重要，需要有效的交通管理和规划。交通管理的一个重要组成部分是理解交叉口的转向运动计数（TMCs）。准确的交叉口TMCs对于交通信号控制、缓解拥堵和道路安全至关重要。通常，通过在交叉口安装物理传感器来获取TMCs，但这种方法可能成本过高且技术挑战较大，特别是对于道路网络庞大的城市而言。近年来，机器学习和数据驱动方法的进步为估计TMCs提供了有希望的替代方案。由于道路几何、交通信号设置和当地驾驶员行为等因素，不同交叉口的交通模式可能会有显著差异。这种领域差异限制了机器学习模型在应用于新的或未知交叉口时的泛化能力和准确性。针对这些限制，本研究提出了一种新颖的框架，利用领域自适应（DA）来利用交通控制器事件数据、道路基础设施数据和兴趣点（POI）数据来估计交叉口的TMCs。在亚利桑那州图森市的30个交叉口上进行评估，提出的DA框架的性能与最先进的模型进行了比较，并在平均绝对误差和均方根误差方面取得了最低值。

更新时间: 2025-03-25 23:27:38

领域: cs.LG

下载: http://arxiv.org/abs/2503.20113v1

Efficient Model Development through Fine-tuning Transfer

Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.

Updated: 2025-03-25 23:24:43

标题: 通过微调迁移实现高效模型开发

摘要: 现代LLMs在进行有效更新方面存在困难，因为每个新的预训练模型版本都需要重复昂贵的对齐过程。这一挑战也适用于特定领域或语言的模型，在这些模型中，对专门数据进行微调必须针对每个新的基础模型发布重新进行。在本文中，我们探讨了在模型版本之间转移微调更新的方法。具体来说，我们从一个源模型版本中派生出差异向量，表示微调中的权重变化，并将其应用于不同目标版本的基础模型。通过对各种开放式模型版本进行实证评估，我们展示了转移差异向量可以显著改善目标基础模型，通常可以达到与其微调对应版本相当的性能。例如，重新使用Llama 3.0 8B的微调更新在没有额外训练的情况下，使得GPQA的绝对准确率提高了10.7%，超过了Llama 3.1 8B Instruct的基础模型。在多语言模型开发环境中，我们展示了这种方法可以显著提高目标语言任务的性能，而无需重新训练，在马尔加什语和土耳其语的全球MMLU上，分别比Llama 3.1 8B Instruct提高了4.7%和15.5%。我们的对照实验显示，当源模型和目标模型在参数空间中线性连接时，微调转移效果最佳。此外，我们证明了微调转移为进一步微调提供了更强大且更具计算效率的起点。最后，我们提出了一个连续模型开发的迭代回收-然后微调方法，可以提高效率和效果。我们的研究结果表明，微调转移是一种可行的策略，可以降低训练成本同时保持模型性能。

更新时间: 2025-03-25 23:24:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.20110v1

Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors

We propose denoising diffusion variational inference (DDVI), a black-box variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology -- inferring latent ancestry from human genomes -- where it outperforms strong baselines on the Thousand Genomes dataset.

Updated: 2025-03-25 23:22:46

标题: 去噪扩散变分推断：扩散模型作为表现丰富的变分后验

摘要: 我们提出了去噪扩散变分推断（DDVI），这是一种用于潜变量模型的黑盒变分推断算法，依赖于扩散模型作为灵活的近似后验。具体而言，我们的方法引入了一种基于扩散的表达丰富的变分后验类，该类在潜空间中进行迭代细化；我们使用一种新颖的正则化证据下界（ELBO）训练这些后验，其灵感来自wake-sleep算法。我们的方法易于实现（它适用于ELBO的正则化扩展），与黑盒变分推断兼容，并且优于基于归一化流或对抗网络的替代近似后验类。我们发现，DDVI在深度潜变量模型中改进了推断和学习，在常见基准测试中以及在生物学中的激励任务中--从人类基因组中推断潜在祖先--它优于千人基因组数据集上的强基线。

更新时间: 2025-03-25 23:22:46

领域: cs.LG,q-bio.QM,stat.ML

下载: http://arxiv.org/abs/2401.02739v4

Mitigating Data Redundancy to Revitalize Transformer-based Long-Term Time Series Forecasting System

Long-term time-series forecasting (LTSF) is fundamental to various real-world applications, where Transformer-based models have become the dominant framework due to their ability to capture long-range dependencies. However, these models often experience overfitting due to data redundancy in rolling forecasting settings, limiting their generalization ability particularly evident in longer sequences with highly similar adjacent data. In this work, we introduce CLMFormer, a novel framework that mitigates redundancy through curriculum learning and a memory-driven decoder. Specifically, we progressively introduce Bernoulli noise to the training samples, which effectively breaks the high similarity between adjacent data points. This curriculum-driven noise introduction aids the memory-driven decoder by supplying more diverse and representative training data, enhancing the decoder's ability to model seasonal tendencies and dependencies in the time-series data. To further enhance forecasting accuracy, we introduce a memory-driven decoder. This component enables the model to capture seasonal tendencies and dependencies in the time-series data and leverages temporal relationships to facilitate the forecasting process. Extensive experiments on six real-world LTSF benchmarks show that CLMFormer consistently improves Transformer-based models by up to 30%, demonstrating its effectiveness in long-horizon forecasting.

Updated: 2025-03-25 23:17:39

标题: 减少数据冗余以重振基于Transformer的长期时间序列预测系统

摘要: 长期时间序列预测（LTSF）对各种实际应用至关重要，其中基于Transformer的模型已成为主导框架，因为它们能够捕捉长程依赖关系。然而，这些模型在滚动预测设置中往往会出现过拟合问题，由于数据冗余限制了它们的泛化能力，尤其在具有高度相似相邻数据的较长序列中表现尤为明显。在这项工作中，我们引入了CLMFormer，这是一个通过课程学习和基于记忆的解码器来减轻冗余的新框架。具体来说，我们逐渐向训练样本引入伯努利噪声，有效地打破相邻数据点之间的高相似性。这种基于课程的噪声引入有助于基于记忆的解码器，通过提供更多多样和代表性的训练数据，增强解码器对时间序列数据中季节性趋势和依赖关系的建模能力。为了进一步提高预测准确性，我们引入了一个基于记忆的解码器。这个组件使模型能够捕捉时间序列数据中的季节性趋势和依赖关系，并利用时间关系来促进预测过程。对六个真实世界的LTSF基准进行的大量实验证明，CLMFormer能够将基于Transformer的模型的准确性提高多达30％，展示了其在长期预测中的有效性。

更新时间: 2025-03-25 23:17:39

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2207.07827v5

Map-Based Path Loss Prediction in Multiple Cities Using Convolutional Neural Networks

Radio deployments and spectrum planning benefit from path loss predictions. Obstructions along a communications link are often considered implicitly or through derived metrics such as representative clutter height or total obstruction depth. In this paper, we propose a path-specific path loss prediction method that uses convolutional neural networks to automatically perform feature extraction from 2-D obstruction height maps. Our methods result in low prediction error in a variety of environments without requiring derived metrics.

Updated: 2025-03-25 23:17:14

标题: 基于卷积神经网络的多城市地图路径损耗预测

摘要: 无线电部署和频谱规划受益于路径损耗预测。沿通信链路的障碍物通常被隐式地考虑，或者通过代表性杂乱高度或总障碍深度等衍生指标来考虑。在本文中，我们提出了一种特定路径路径损耗预测方法，该方法利用卷积神经网络自动从二维障碍高度图中进行特征提取。我们的方法在各种环境中都能实现低预测误差，而无需衍生指标。

更新时间: 2025-03-25 23:17:14

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2411.17752v3

"Hello, is this Anna?": A First Look at Pig-Butchering Scams

Pig-butchering scams, or Sha Zhu Pan, have emerged as a complex form of cyber-enabled financial fraud that combines elements of romance, investment fraud, and advanced social engineering tactics to systematically exploit victims. In this paper, we present the first qualitative analysis of pig-butchering scams, informed by in-depth semi-structured interviews with N=26 victims. We capture nuanced, first-hand accounts from victims across multiple regions, providing insight into the lifecycle of pig-butchering scams and the complex emotional and financial manipulation involved. We systematically analyze each phase of the scam, revealing that perpetrators employ tactics such as staged trust-building, fraudulent financial platforms, fabricated investment returns, and repeated high-pressure tactics, all designed to exploit victims' trust and financial resources over extended periods. Our findings reveal an organized scam lifecycle characterized by emotional manipulation, staged financial exploitation, and persistent re-engagement efforts that amplify victim losses. We also find complex psychological and financial impacts on victims, including heightened vulnerability to secondary scams. Finally, we propose actionable intervention points for social media and financial platforms to curb the prevalence of these scams and highlight the need for non-stigmatizing terminology to encourage victims to report and seek assistance.

Updated: 2025-03-25 23:15:48

标题: 你好，这是安娜吗？：对于屠猪诈骗的初步了解

摘要: 猪肉切割诈骗，又称Sha Zhu Pan，已经成为一种复杂的网络金融欺诈形式，结合了浪漫、投资欺诈和先进的社会工程策略，系统地剥削受害者。在本文中，我们通过对26名受害者进行深入半结构化访谈，提出了对猪肉切割诈骗的首次定性分析。我们捕捉了来自多个地区的受害者的细致的第一手描述，为猪肉切割诈骗的生命周期和涉及的复杂情感和财务操纵提供了深入了解。我们系统地分析了诈骗的每个阶段，揭示出犯罪者采用的策略，如制造的信任建立、欺诈性金融平台、捏造的投资回报以及反复施加的高压策略，所有这些都旨在在较长时间内剥削受害者的信任和财务资源。我们的研究结果揭示了一种有组织的诈骗生命周期，其特点是情感操纵、制造的财务剥削以及持续的重新参与努力，这加剧了受害者的损失。我们还发现受害者面临复杂的心理和财务影响，包括更容易受到二次诈骗的伤害。最后，我们提出了社交媒体和金融平台可采取的可行干预点，以减少这些诈骗的普遍性，并强调需要非污名化术语来鼓励受害者报告并寻求帮助。

更新时间: 2025-03-25 23:15:48

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2503.20821v1

Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations

Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer, making post-training preference alignment crucial for producing human-preferred motions. Unfortunately, post-training alignment requires extensive preference rankings of motions generated by the pre-trained model, which are costly to annotate, especially in multi-agent settings. Recently, there has been growing interest in leveraging pre-training demonstrations to scalably generate preference data for post-training alignment. However, these methods often adopt an adversarial assumption, treating all pre-trained model-generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model's own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre-training demonstrations to construct preference rankings among the pre-trained model's generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of pre-trained model's generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation-based models by relying solely on implicit feedback from pre-training demonstrations, without additional post-training human preference annotations or high computational costs.

Updated: 2025-03-25 23:02:13

标题: 直接后训练偏好对齐：使用来自预训练演示的隐式反馈的多智能体运动生成模型

摘要: 最近LLM（Large Language Models）的进展已经彻底改变了在实体应用中的运动生成模型。虽然LLM类型的自回归运动生成模型从训练可伸缩性中受益，但它们的标记预测目标与人类偏好之间仍然存在差异。因此，仅使用标记预测目标进行预训练的模型通常会生成偏离人类偏好的行为，因此后续训练偏好对齐对于生成人类偏好的动作至关重要。不幸的是，后续训练对齐需要广泛的对预训练模型生成的动作进行偏好排名，这在多智能体环境中尤为昂贵。最近，人们对利用预训练演示来扩展生成偏好数据以进行后续训练对齐表现出越来越大的兴趣。然而，这些方法通常采用对抗性假设，将所有预训练模型生成的样本视为不受欢迎的示例。这种对抗性方法忽视了模型自身生成之间的偏好排名提供的有价值的信号，最终降低了对齐效果，可能导致不对称的行为。在这项工作中，我们不是将所有生成的样本视为同等糟糕，而是利用预训练演示中隐含的偏好来构建对预训练模型生成的代表的偏好排名，提供更为细致的偏好对齐指导，而不需要任何人力成本。我们将我们的方法应用于大规模交通模拟，并展示了它在改善预训练模型生成的行为逼真性方面的有效性，使得一个轻量级的1M运动生成模型仅依赖于来自预训练演示的隐式反馈，而不需要额外的后续训练人类偏好注释或高计算成本，就能与SOTA大规模基于模仿的模型相媲美。

更新时间: 2025-03-25 23:02:13

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.20105v1

Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion

This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser trains on these extended trajectories using a hierarchical structure, efficiently handling tasks across multiple temporal scales. Additionally, we introduce Adaptive Plan Pondering and the Recursive HM-Diffuser, which consolidate hierarchical layers into a single model to process temporal scales recursively. Experimental results demonstrate the effectiveness of our approach, advancing diffusion-based planners for scalable long-horizon planning.

Updated: 2025-03-25 22:52:46

标题: 可扩展的长期规划：通过分层多尺度扩散

摘要: 本文解决了一个新颖的问题，即使代理能够规划比训练数据中更长的轨迹而不会产生错误。为了解决这个问题，我们提出了分层多尺度扩散器（HM-Diffuser）和渐进轨迹扩展（PTE），这是一种通过将较短轨迹拼接起来迭代生成更长轨迹的增强方法。HM-Diffuser使用分层结构在这些扩展轨迹上训练，有效地处理跨多个时间尺度的任务。此外，我们引入了自适应计划思考和递归HM-Diffuser，将分层层次整合成一个单一模型，以递归地处理时间尺度。实验结果表明，我们的方法的有效性，推动了基于扩散的规划器对可扩展的长期规划的发展。

更新时间: 2025-03-25 22:52:46

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.20102v1

Dataset-learning duality and emergent criticality

In artificial neural networks, the activation dynamics of non-trainable variables is strongly coupled to the learning dynamics of trainable variables. During the activation pass, the boundary neurons (e.g., input neurons) are mapped to the bulk neurons (e.g., hidden neurons), and during the learning pass, both bulk and boundary neurons are mapped to changes in trainable variables (e.g., weights and biases). For example, in feed-forward neural networks, forward propagation is the activation pass and backward propagation is the learning pass. We show that a composition of the two maps establishes a duality map between a subspace of non-trainable boundary variables (e.g., dataset) and a tangent subspace of trainable variables (i.e., learning). In general, the dataset-learning duality is a complex non-linear map between high-dimensional spaces. We use duality to study the emergence of criticality, or the power-law distribution of fluctuations of the trainable variables, using a toy model at learning equilibrium. In particular, we show that criticality can emerge in the learning system even from the dataset in a non-critical state, and that the power-law distribution can be modified by changing either the activation function or the loss function.

Updated: 2025-03-25 22:39:21

标题: 数据集学习的双重性和新兴临界性

摘要: 在人工神经网络中，不可训练变量的激活动态与可训练变量的学习动态密切耦合。在激活过程中，边界神经元（例如输入神经元）被映射到体积神经元（例如隐藏神经元），而在学习过程中，体积和边界神经元都被映射到可训练变量的变化（例如权重和偏置）。例如，在前馈神经网络中，正向传播是激活过程，反向传播是学习过程。我们表明，这两种映射的组合建立了一个非可训练边界变量子空间（例如数据集）与可训练变量的切空间（即学习）之间的对偶映射。总的来说，数据集-学习对偶是高维空间之间的复杂非线性映射。我们利用对偶来研究临界性的出现，或者说可训练变量的波动的幂律分布，使用一个在学习平衡状态下的玩具模型。特别地，我们展示了即使数据集处于非临界状态，临界性也可以在学习系统中出现，并且幂律分布可以通过改变激活函数或损失函数来修改。

更新时间: 2025-03-25 22:39:21

领域: cs.LG,cond-mat.dis-nn,cond-mat.stat-mech,cs.NE

下载: http://arxiv.org/abs/2405.17391v3

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Text-to-image generation (T2I) has become a key area of research with broad applications. However, existing methods often struggle with complex spatial relationships and fine-grained control over multiple concepts. Many existing approaches require significant architectural modifications, extensive training, or expert-level prompt engineering. To address these challenges, we introduce \textbf{LayerCraft}, an automated framework that leverages large language models (LLMs) as autonomous agents for structured procedural generation. LayerCraft enables users to customize objects within an image and supports narrative-driven creation with minimal effort. At its core, the system includes a coordinator agent that directs the process, along with two specialized agents: \textbf{ChainArchitect}, which employs chain-of-thought (CoT) reasoning to generate a dependency-aware 3D layout for precise instance-level control, and the \textbf{Object-Integration Network (OIN)}, which utilizes LoRA fine-tuning on pre-trained T2I models to seamlessly blend objects into specified regions of an image based on textual prompts without requiring architectural changes. Extensive evaluations demonstrate LayerCraft's versatility in applications ranging from multi-concept customization to storytelling. By providing non-experts with intuitive, precise control over T2I generation, our framework democratizes creative image creation. Our code will be released upon acceptance at github.com/PeterYYZhang/LayerCraft

Updated: 2025-03-25 22:36:55

标题: LayerCraft：通过CoT推理和分层对象集成增强文本到图像生成

摘要: 文本到图像生成（T2I）已经成为一个具有广泛应用的重要研究领域。然而，现有方法通常在复杂的空间关系和对多个概念的精细控制方面面临困难。许多现有方法需要进行重大的架构修改、大量的训练或专家级别的提示工程。为了解决这些挑战，我们引入了LayerCraft，这是一个自动化框架，利用大型语言模型（LLMs）作为结构化过程生成的自主代理。LayerCraft使用户能够定制图像中的对象，并支持以最小的努力进行叙事驱动的创作。在其核心，该系统包括一个指导过程的协调代理，以及两个专门的代理：ChainArchitect，它利用思维链（CoT）推理生成一个依赖感知的3D布局，用于精确的实例级控制；以及Object-Integration Network（OIN），它利用LoRA对预训练的T2I模型进行微调，根据文本提示将对象无缝地融入图像的指定区域，而不需要进行架构更改。广泛的评估展示了LayerCraft在从多概念定制到讲故事等应用中的多功能性。通过为非专家提供直观、精确的T2I生成控制，我们的框架使创意图像创作民主化。我们的代码将在github.com/PeterYYZhang/LayerCraft接受后发布。

更新时间: 2025-03-25 22:36:55

领域: cs.LG,cs.GR,cs.MA

下载: http://arxiv.org/abs/2504.00010v1

AI Identity, Empowerment, and Mindfulness in Mitigating Unethical AI Use

This study examines how AI identity influences psychological empowerment and unethical AI behavior among college students, while also exploring the moderating role of IT mindfulness. Findings show that a strong AI identity enhances psychological empowerment and academic engagement but can also lead to increased unethical AI practices. Crucially, IT mindfulness acts as an ethical safeguard, promoting sensitivity to ethical concerns and reducing misuse of AI. These insights have implications for educators, policymakers, and AI developers, emphasizing For Peer Review the need for a balanced approach that encourages digital engagement without compromising student responsibility. The study also contributes to philosophical discussions of psychological agency, suggesting that empowerment through AI can yield both positive and negative outcomes. Mindfulness emerges as essential in guiding ethical AI interactions. Overall, the research informs ongoing debates on ethics in education and AI, offering strategies to align technological advancement with ethical accountability and responsible use.

Updated: 2025-03-25 22:36:21

标题: 人工智能身份、赋权和正念在减轻不道德的人工智能使用中的作用

摘要: 这项研究探讨了AI身份如何影响大学生的心理赋权和不道德的AI行为，同时还探讨了IT正念的调节作用。研究结果显示，强烈的AI身份增强了心理赋权和学术参与，但也可能导致不道德的AI实践增加。关键是，IT正念作为一种道德保障，促进对道德关注的敏感性，减少对AI的滥用。这些洞见对教育工作者、政策制定者和AI开发者都具有重要意义，强调了需要采取一种平衡的方法，鼓励数字参与，同时不牺牲学生的责任。该研究还为心理机构的哲学讨论做出了贡献，表明通过AI实现的赋权可能产生积极和消极的结果。正念被认为在引导道德AI互动中至关重要。总的来说，这项研究为教育和AI中的伦理辩论提供了信息，提供了将技术进步与道德责任和负责任使用相一致的策略。

更新时间: 2025-03-25 22:36:21

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2503.20099v1

Fundamental Limits of Perfect Concept Erasure

Concept erasure is the task of erasing information about a concept (e.g., gender or race) from a representation set while retaining the maximum possible utility -- information from original representations. Concept erasure is useful in several applications, such as removing sensitive concepts to achieve fairness and interpreting the impact of specific concepts on a model's performance. Previous concept erasure techniques have prioritized robustly erasing concepts over retaining the utility of the resultant representations. However, there seems to be an inherent tradeoff between erasure and retaining utility, making it unclear how to achieve perfect concept erasure while maintaining high utility. In this paper, we offer a fresh perspective toward solving this problem by quantifying the fundamental limits of concept erasure through an information-theoretic lens. Using these results, we investigate constraints on the data distribution and the erasure functions required to achieve the limits of perfect concept erasure. Empirically, we show that the derived erasure functions achieve the optimal theoretical bounds. Additionally, we show that our approach outperforms existing methods on a range of synthetic and real-world datasets using GPT-4 representations.

Updated: 2025-03-25 22:36:10

标题: 完美概念消除的基本极限

摘要: 概念消除是从表示集中擦除有关概念（例如性别或种族）信息的任务，同时保留尽可能多的原始表示中的信息。概念消除在多个应用中很有用，例如删除敏感概念以实现公平性，解释特定概念对模型性能的影响。先前的概念消除技术优先考虑稳健地擦除概念而不是保留结果表示的效用。然而，在擦除和保留效用之间似乎存在固有的权衡，使得如何实现完美的概念消除并保持高效用变得不清晰。在本文中，我们通过信息理论的视角提供了一种新的解决这个问题的方法，通过量化概念消除的基本限制。利用这些结果，我们研究了实现完美概念消除的数据分布和擦除函数所需的约束。在实证方面，我们展示了推导的擦除函数实现了最优的理论界限。此外，我们展示了我们的方法在一系列合成和真实世界数据集上使用GPT-4表示时优于现有方法。

更新时间: 2025-03-25 22:36:10

领域: cs.LG

下载: http://arxiv.org/abs/2503.20098v1

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

Updated: 2025-03-25 22:17:33

标题: 培训领域的预测解码模型草案：最佳实践和见解

摘要: 投机性解码是一种加速大型语言模型（LLMs）推理的有效方法，通过利用一个小型草稿模型来预测目标模型的输出。然而，当将投机性解码应用于特定领域的目标模型时，由于领域转移，通用草稿模型的接受率显著下降。在这项工作中，我们系统地研究了知识蒸馏技术，用于训练领域草稿模型以提高其推测准确性。我们比较了白盒和黑盒蒸馏方法，并探讨了它们在各种数据可访问性场景中的有效性，包括历史用户查询、策划领域数据和合成生成的对齐数据。我们在功能调用、生物学和中文领域的实验结果表明，离线蒸馏始终优于在线蒸馏11%至25%，白盒蒸馏超过黑盒蒸馏2%至10%，数据缩放趋势跨领域持续存在。此外，我们发现合成数据可以有效地对齐草稿模型，并实现历史用户查询训练性能的80%至93%。这些发现为训练特定领域草稿模型以提高投机性解码效率提供了实用指南。

更新时间: 2025-03-25 22:17:33

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.07807v2

SoK: Decoding the Enigma of Encrypted Network Traffic Classifiers

The adoption of modern encryption protocols such as TLS 1.3 has significantly challenged traditional network traffic classification (NTC) methods. As a consequence, researchers are increasingly turning to machine learning (ML) approaches to overcome these obstacles. In this paper, we comprehensively analyze ML-based NTC studies, developing a taxonomy of their design choices, benchmarking suites, and prevalent assumptions impacting classifier performance. Through this systematization, we demonstrate widespread reliance on outdated datasets, oversights in design choices, and the consequences of unsubstantiated assumptions. Our evaluation reveals that the majority of proposed encrypted traffic classifiers have mistakenly utilized unencrypted traffic due to the use of legacy datasets. Furthermore, by conducting 348 feature occlusion experiments on state-of-the-art classifiers, we show how oversights in NTC design choices lead to overfitting, and validate or refute prevailing assumptions with empirical evidence. By highlighting lessons learned, we offer strategic insights, identify emerging research directions, and recommend best practices to support the development of real-world applicable NTC methodologies.

Updated: 2025-03-25 22:15:50

标题: SoK：解密加密网络流量分类器的谜团

摘要: 现代加密协议（如TLS 1.3）的采用显著挑战了传统的网络流量分类（NTC）方法。因此，研究人员越来越倾向于采用机器学习（ML）方法来克服这些障碍。在本文中，我们全面分析了基于ML的NTC研究，开发了一个设计选择、基准套件和影响分类器性能的普遍假设的分类法。通过这种系统化，我们展示了对过时数据集的广泛依赖，设计选择上的疏忽，以及未经证实的假设的后果。我们的评估显示，大多数提出的加密流量分类器错误地利用了未加密流量，这是由于使用了传统数据集。此外，通过对最先进的分类器进行348个特征遮蔽实验，我们展示了NTC设计选择中的疏漏导致了过拟合，并通过经验证据验证或反驳了普遍假设。通过强调所学到的经验教训，我们提供战略性见解，确定新兴的研究方向，并推荐最佳实践，以支持实际适用的NTC方法的发展。

更新时间: 2025-03-25 22:15:50

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2503.20093v1

FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings

External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients' rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, 'FedECA' that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA provides evidence for a differential effect between two drugs that would have otherwise gone unnoticed. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development.

Updated: 2025-03-25 22:14:01

标题: FedECA：一种用于分布式环境中基于时间事件数据的因果推断的联邦外部控制臂方法

摘要: 外部对照组（ECA）可以为实验药物的早期临床开发提供信息，并为监管批准提供疗效证据。然而，实施ECA的主要挑战在于获取真实世界或历史临床试验数据。事实上，通过严格控制数据处理来保护患者权利的法规往往使得从多个来源汇总数据到中央服务器变得困难。为了解决这些限制，我们开发了一种新方法，名为'FedECA'，利用联邦学习（FL）实现对不需要汇总数据的单独队列上的时间至事件结果的治疗概率倒数加权（IPTW）。为展示FedECA的潜力，我们将其应用于不同的逐渐增加复杂性的设置中，最终在一个真实世界用例中展示了FedECA提供了两种药物之间差异效应的证据，否则这种效应可能会被忽视。通过分享我们的代码，我们希望FedECA将促进联邦研究网络的建立，从而加速药物开发。

更新时间: 2025-03-25 22:14:01

领域: stat.ME,cs.DC,cs.LG

下载: http://arxiv.org/abs/2311.16984v6

High-Dimension Human Value Representation in Large Language Models

The widespread application of LLMs across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVaR, a high-dimensional neural representation of symbolic human value distributions in LLMs, orthogonal to model architecture and training data. This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs and evaluated on 15 open-source and commercial LLMs. Through UniVaR, we visualize and explore how LLMs prioritize different values in 25 languages and cultures, shedding light on complex interplay between human values and language modeling.

Updated: 2025-03-25 22:02:36

标题: 大规模语言模型中的高维人类价值表示

摘要: 随着LLMs在各种任务和领域的广泛应用，迫使将这些模型与人类价值观和偏好进行对齐。鉴于人类价值观对齐的各种方法，迫切需要在LLMs部署和采纳之前了解注入这些模型的人类价值的范围和性质。我们提出了UniVaR，这是LLMs中符号人类价值分布的高维神经表示，与模型架构和训练数据正交。这是一个连续和可扩展的表示，通过对8个LLMs的与价值相关的输出进行自监督学习，并在15个开源和商业LLMs上进行评估。通过UniVaR，我们可视化和探索LLMs如何在25种语言和文化中优先考虑不同的价值观，揭示人类价值观和语言建模之间复杂的相互作用。

更新时间: 2025-03-25 22:02:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2404.07900v4

Random feature-based double Vovk-Azoury-Warmuth algorithm for online multi-kernel learning

We introduce a novel multi-kernel learning algorithm, VAW$^2$, for online least squares regression in reproducing kernel Hilbert spaces (RKHS). VAW$^2$ leverages random Fourier feature-based functional approximation and the Vovk-Azoury-Warmuth (VAW) method in a two-level procedure: VAW is used to construct expert strategies from random features generated for each kernel at the first level, and then again to combine their predictions at the second level. A theoretical analysis yields a regret bound of $O(T^{1/2}\ln T)$ in expectation with respect to artificial randomness, when the number of random features scales as $T^{1/2}$. Empirical results on some benchmark datasets demonstrate that VAW$^2$ achieves superior performance compared to the existing online multi-kernel learning algorithms: Raker and OMKL-GF, and to other theoretically grounded method methods involving convex combination of expert predictions at the second level.

Updated: 2025-03-25 21:57:35

标题: 随机特征为基础的双Vovk-Azoury-Warmuth在线多核学习算法

摘要: 我们介绍了一种新颖的多核学习算法VAW$^2$，用于在再生核希尔伯特空间（RKHS）中进行在线最小二乘回归。VAW$^2$利用基于随机傅里叶特征的函数逼近和Vovk-Azoury-Warmuth（VAW）方法进行两级过程：VAW用于从每个核生成的随机特征构建专家策略，在第一级，然后再次在第二级合并它们的预测。理论分析得到了一个关于人工随机性的遗憾界限为$O(T^{1/2}\ln T)$，当随机特征数量按$T^{1/2}$增长时。一些基准数据集的实证结果表明，与现有的在线多核学习算法Raker和OMKL-GF以及其他在第二级涉及专家预测的凸组合的理论支持方法相比，VAW$^2$实现了更优越的性能。

更新时间: 2025-03-25 21:57:35

领域: cs.LG,68Q32, 68W27, 68W20

下载: http://arxiv.org/abs/2503.20087v1

Ambient Noise Full Waveform Inversion with Neural Operators

Numerical simulations of seismic wave propagation are crucial for investigating velocity structures and improving seismic hazard assessment. However, standard methods such as finite difference or finite element are computationally expensive. Recent studies have shown that a new class of machine learning models, called neural operators, can solve the elastodynamic wave equation orders of magnitude faster than conventional methods. Full waveform inversion is a prime beneficiary of the accelerated simulations. Neural operators, as end-to-end differentiable operators, combined with automatic differentiation, provide an alternative approach to the adjoint-state method. Since neural operators do not involve the Born approximation, when used for full waveform inversion they have the potential to include additional phases and alleviate cycle-skipping problems present in traditional adjoint-state formulations. In this study, we demonstrate the first application of neural operators for full waveform inversion on a real seismic dataset, which consists of several nodal transects collected across the San Gabriel, Chino, and San Bernardino basins in the Los Angeles metropolitan area.

Updated: 2025-03-25 21:50:39

标题: 使用神经算子的环境噪声全波形反演

摘要: 地震波传播的数值模拟对于研究速度结构和改进地震危险评估至关重要。然而，标准方法如有限差分或有限元计算代价高昂。最近的研究表明，一种称为神经算子的新型机器学习模型可以比传统方法快几个数量级地解决弹性动力学波动方程。全波形反演是加速模拟的主要受益者。神经算子作为端到端可微分算子，结合自动微分，提供了一种替代伴随状态方法的途径。由于神经算子不涉及波恩近似，因此在全波形反演中使用时，它们有潜力包含额外的相位并减轻传统伴随状态公式中存在的循环跳跃问题。在这项研究中，我们展示了神经算子在真实地震数据集上进行全波形反演的首次应用，该数据集包括洛杉矶都会区的圣加布里埃尔、奇诺和圣贝纳迪诺盆地收集的几个节点横断面。

更新时间: 2025-03-25 21:50:39

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2503.15013v2

Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

Updated: 2025-03-25 21:47:29

标题: 多模态（推理）LLM可以作为深度伪造检测器吗？

摘要: 深度伪造检测在先进生成模型时代仍然是一个关键挑战，特别是随着合成媒体变得越来越复杂。在这项研究中，我们探讨了最先进的多模态(推理)大型语言模型(LLMs)在深度伪造图像检测方面的潜力，例如(OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet)。我们对12个最新的多模态LLMs与传统的深度伪造检测方法在多个数据集上进行了基准测试，包括最近发布的真实世界深度伪造图像。为了提高性能，我们使用提示调整，并对模型的推理路径进行深入分析，以确定在其决策过程中的关键贡献因素。我们的研究结果表明，最佳的多模态LLMs在零样本的情况下实现了有竞争力的性能，并具有很好的泛化能力，甚至在分布外数据集中超过传统的深度伪造检测流程，而其他LLM系列的表现非常令人失望，有些甚至不如随机猜测。此外，我们发现新的模型版本和推理能力并不对在深度伪造检测这样的小众任务中的性能做出贡献，而模型大小在某些情况下确实有所帮助。这项研究突出了在未来深度伪造检测框架中整合多模态推理的潜力，并为在现实场景中的稳健性提供了模型可解释性的见解。

更新时间: 2025-03-25 21:47:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.20084v1

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), relations among multi-label samples are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) first define five distinct multi-label relations in MSCL to systematically identify positive samples, (ii) introduce a novel Similarity-Dissimilarity Loss that dynamically re-weights samples through computing the similarity and dissimilarity factors between positive samples and given anchors based on multi-label relations, and (iii) further provide theoretical grounded proof for our method through rigorous mathematical analysis that supports the formulation and effectiveness of the proposed loss function. We conduct the experiments across both image and text modalities, and extend the evaluation to medical domain. The results demonstrate that our method consistently outperforms baselines in a comprehensive evaluation, confirming its effectiveness and robustness. Code is available at: https://github.com/guangminghuang/similarity-dissimilarity-loss.

Updated: 2025-03-25 21:47:03

标题: 多标签监督对比学习的相似性-不相似性损失

摘要: 监督对比学习通过利用标签信息取得了显著的成功；然而，在多标签场景中确定正样本仍然是一个关键挑战。在多标签监督对比学习（MSCL）中，多标签样本之间的关系尚未完全定义，导致在识别正样本和制定对比损失函数以构建表示空间时存在歧义。为了解决这些挑战，我们：（i）首先在MSCL中定义了五种不同的多标签关系，以系统地识别正样本；（ii）引入了一种新颖的相似性-不相似性损失，通过计算正样本和给定锚点之间的相似性和不相似性因子来动态重新加权样本，基于多标签关系；（iii）通过严格的数学分析进一步为我们的方法提供了理论基础的证明，支持所提出的损失函数的制定和有效性。我们在图像和文本模态下进行实验，并将评估扩展到医学领域。结果表明，我们的方法在全面评估中始终优于基线，证实了其有效性和稳健性。代码可在以下链接找到：https://github.com/guangminghuang/similarity-dissimilarity-loss.

更新时间: 2025-03-25 21:47:03

领域: cs.LG,cs.CL,cs.CV

下载: http://arxiv.org/abs/2410.13439v3

ARGO-SLSA: Software Supply Chain Security in Argo Workflows

Distributed systems widely adopt microservice architecture to handle growing complexity and scale. This approach breaks applications into independent, loosely coupled services. Kubernetes has become the de facto standard for managing microservices, and automating complex, multi-step workflows is a common requirement in Kubernetes. Argo Workflows is a Kubernetes-native engine for managing these workflows in an automated fashion. These workflows generate artifacts such as executables, logs, container images, and packages, which often require proper management through software supply chain security. However, Argo Workflows does not include built-in functionality for frameworks like Supply-chain Levels for Software Artifacts (SLSA), which is essential for ensuring artifact integrity, traceability, and security. This gap compels practitioners to rely on external tools to meet software supply chain security standards. In response, this paper proposes a Kubernetes-native controller built on top of existing open-source Argo Workflows to enhance artifact security. By generating cryptographic signing and provenance attestations, the controller enables Argo Workflows to comply with SLSA standards. We demonstrate that implementations can provide such cryptographic signing and provenance attestations for artifacts produced by the controller, allowing software artifacts built with Argo Workflows to adhere to SLSA requirements. The proposed validation model evaluates the proof of concept of the controller, including its ability to reconcile workflows, detect pods associated with workflow nodes, operate without disrupting existing operations, enforce integrity, and monitor software artifacts.

Updated: 2025-03-25 21:32:23

标题: ARGO-SLSA：Argo工作流中的软件供应链安全

摘要: 分布式系统广泛采用微服务架构来处理日益增长的复杂性和规模。这种方法将应用程序分解为独立的、松散耦合的服务。Kubernetes已成为管理微服务的事实标准，并在Kubernetes中自动化复杂的多步骤工作流程是一个常见需求。Argo Workflows是一个用于以自动化方式管理这些工作流程的Kubernetes原生引擎。这些工作流程生成诸如可执行文件、日志、容器镜像和软件包等工件，通常需要通过软件供应链安全进行适当管理。然而，Argo Workflows并不包括像软件工件供应链安全(SLSA)的Supply-chain Levels for Software Artifacts这样的框架的内置功能，这对于确保工件的完整性、可追溯性和安全性至关重要。这一差距迫使从业者依赖外部工具来满足软件供应链安全标准。为此，本文提出了一个基于现有开源Argo Workflows的Kubernetes原生控制器，以增强工件安全性。通过生成加密签名和溯源证明，该控制器使Argo Workflows能够符合SLSA标准。我们展示了实现可以为控制器生成的工件提供这种加密签名和溯源证明，使使用Argo Workflows构建的软件工件符合SLSA要求。所提出的验证模型评估了控制器的概念验证，包括其协调工作流程的能力、检测与工作流程节点相关联的Pod、在不干扰现有操作的情况下运行、强制执行完整性和监视软件工件的能力。

更新时间: 2025-03-25 21:32:23

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2503.20079v1

Abstracting Geo-specific Terrains to Scale Up Reinforcement Learning

Multi-agent reinforcement learning (MARL) is increasingly ubiquitous in training dynamic and adaptive synthetic characters for interactive simulations on geo-specific terrains. Frameworks such as Unity's ML-Agents help to make such reinforcement learning experiments more accessible to the simulation community. Military training simulations also benefit from advances in MARL, but they have immense computational requirements due to their complex, continuous, stochastic, partially observable, non-stationary, and doctrine-based nature. Furthermore, these simulations require geo-specific terrains, further exacerbating the computational resources problem. In our research, we leverage Unity's waypoints to automatically generate multi-layered representation abstractions of the geo-specific terrains to scale up reinforcement learning while still allowing the transfer of learned policies between different representations. Our early exploratory results on a novel MARL scenario, where each side has differing objectives, indicate that waypoint-based navigation enables faster and more efficient learning while producing trajectories similar to those taken by expert human players in CSGO gaming environments. This research points out the potential of waypoint-based navigation for reducing the computational costs of developing and training MARL models for military training simulations, where geo-specific terrains and differing objectives are crucial.

Updated: 2025-03-25 21:29:49

标题: 将特定地形抽象化以扩大强化学习规模

摘要: 多智能体强化学习（MARL）在训练动态和适应性合成角色以进行地理特定地形交互模拟中越来越普遍。诸如Unity的ML-Agents等框架有助于使这种强化学习实验对模拟社区更加可访问。军事训练模拟也受益于MARL的进展，但由于其复杂、连续、随机、部分可观察、非稳态和基于教条的特性，它们具有巨大的计算需求。此外，这些模拟需要地理特定地形，进一步加剧了计算资源问题。在我们的研究中，我们利用Unity的航点自动生成地理特定地形的多层表示抽象，以扩大强化学习的规模，同时仍允许在不同表示之间传输学习策略。我们在一个新颖的MARL场景上的早期探索结果表明，基于航点的导航可以实现更快、更高效的学习，同时产生类似于CSGO游戏环境中专家人类玩家所采取的轨迹。这项研究指出了基于航点导航减少开发和训练用于军事训练模拟的MARL模型的计算成本的潜力，其中地理特定地形和不同的目标至关重要。

更新时间: 2025-03-25 21:29:49

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.20078v1

Peer Disambiguation in Self-Reported Surveys using Graph Attention Networks

Studying peer relationships is crucial in solving complex challenges underserved communities face and designing interventions. The effectiveness of such peer-based interventions relies on accurate network data regarding individual attributes and social influences. However, these datasets are often collected through self-reported surveys, introducing ambiguities in network construction. These ambiguities make it challenging to fully utilize the network data to understand the issues and to design the best interventions. We propose and solve two variations of link ambiguities in such network data -- (i) which among the two candidate links exists, and (ii) if a candidate link exists. We design a Graph Attention Network (GAT) that accounts for personal attributes and network relationships on real-world data with real and simulated ambiguities. We also demonstrate that by resolving these ambiguities, we improve network accuracy, and in turn, improve suicide risk prediction. We also uncover patterns using GNNExplainer to provide additional insights into vital features and relationships. This research demonstrates the potential of Graph Neural Networks (GNN) to advance real-world network data analysis facilitating more effective peer interventions across various fields.

Updated: 2025-03-25 21:25:31

标题: 同行在使用图注意力网络进行自我报告调查中的消岐

摘要: 研究同伴关系对解决被忽视社区面临的复杂挑战并设计干预措施至关重要。这种基于同伴的干预措施的有效性依赖于准确的网络数据，涉及个体属性和社会影响。然而，这些数据集通常通过自我报告调查收集，引入了网络构建中的模糊性。这些模糊性使得充分利用网络数据以了解问题并设计最佳干预措施变得具有挑战性。我们提出并解决了这些网络数据中的两种链接模糊性变化 - (i)两个候选链接中哪一个存在，以及(ii)候选链接是否存在。我们设计了一个考虑个人属性和网络关系的图注意力网络（GAT），在真实和模拟模糊性的真实数据上进行实验。我们还展示通过解决这些模糊性，我们提高了网络准确性，从而提高了自杀风险预测的准确性。我们还利用GNNExplainer揭示模式，提供对关键特征和关系的额外见解。这项研究展示了图神经网络（GNN）在推进现实世界网络数据分析方面的潜力，从而促进各个领域更有效的同伴干预措施。

更新时间: 2025-03-25 21:25:31

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2503.20076v1

Truck Parking Usage Prediction with Decomposed Graph Neural Networks

Truck parking on freight corridors faces the major challenge of insufficient parking spaces. This is exacerbated by the Hour-of-Service (HOS) regulations, which often result in unauthorized parking practices, causing safety concerns. It has been shown that providing accurate parking usage prediction can be a cost-effective solution to reduce unsafe parking practices. In light of this, existing studies have developed various methods to predict the usage of a truck parking site and have demonstrated satisfactory accuracy. However, these studies focused on a single parking site, and few approaches have been proposed to predict the usage of multiple truck parking sites considering spatio-temporal dependencies, due to the lack of data. This paper aims to fill this gap and presents the Regional Temporal Graph Convolutional Network (RegT-GCN) to predict parking usage across the entire state to provide more comprehensive truck parking information. The framework leverages the topological structures of truck parking site locations and historical parking data to predict the occupancy rate considering spatio-temporal dependencies across a state. To achieve this, we introduce a Regional Decomposition approach, which effectively captures the geographical characteristics of the truck parking locations and their spatial correlations. Evaluation results demonstrate that the proposed model outperforms other baseline models, showing the effectiveness of our regional decomposition. The code is available at https://github.com/raynbowy23/RegT-GCN.

Updated: 2025-03-25 21:24:52

标题: 使用分解图神经网络预测卡车停车使用情况

摘要: 卡车在货运走廊上的停车面临着停车位不足的重大挑战。这一挑战加剧了服务时间（HOS）规定，这经常导致未经授权的停车行为，引起安全问题。已经证明，提供准确的停车使用预测可以是减少不安全停车行为的一种经济有效解决方案。鉴于此，现有研究已经开发了各种方法来预测卡车停车场的使用情况，并表现出令人满意的准确性。然而，这些研究集中在单个停车场，很少有方法被提出来预测多个卡车停车场的使用情况，考虑到空间-时间依赖性，这是由于数据不足。本文旨在填补这一空白，并提出了区域时间图卷积网络（RegT-GCN），以预测整个州的停车使用情况，提供更全面的卡车停车信息。该框架利用卡车停车场位置的拓扑结构和历史停车数据来预测占用率，考虑州内的空间-时间依赖性。为了实现这一目标，我们引入了一种区域分解方法，有效捕捉卡车停车位置的地理特征和它们的空间相关性。评估结果表明，所提出的模型优于其他基准模型，显示了我们区域分解的有效性。代码可在https://github.com/raynbowy23/RegT-GCN 上获得。

更新时间: 2025-03-25 21:24:52

领域: cs.AI

下载: http://arxiv.org/abs/2401.12920v3

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

Updated: 2025-03-25 21:20:11

标题: 异构加速器系统上大规模推理的自适应编排：平衡成本、性能和弹性

摘要: 生成式人工智能工作负载的激增导致了对可扩展推理系统的需求，这些系统可以灵活利用GPU和专门的加速器，同时控制运营成本。本文提出了一种硬件无关的控制循环，根据实时成本和容量信号自适应地将请求分配到异构加速器上。该方法通过在成本优化模式和容量优化模式之间动态切换，确保在波动的可用性下最有效地利用昂贵的计算资源，从而保持低延迟和高吞吐量。使用稳定扩散模型进行评估，该框架始终满足延迟目标，在容量不足时自动重定向流量，并在可能时利用成本较低的加速器。这些结果突显了一种以反馈驱动的部署策略，跨越整个软件和硬件堆栈，可以帮助组织在面对有限加速器容量时高效扩展生成式人工智能工作负载的同时保持弹性。

更新时间: 2025-03-25 21:20:11

领域: cs.PF,cs.AI,68U01

下载: http://arxiv.org/abs/2503.20074v1

Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard.

Updated: 2025-03-25 20:35:46

标题: 忠实性强制位移编辑用于Learn2Reg 2024 SHG-BF挑战的翻译

摘要: 共同检查二次谐波产生（SHG）和明场（BF）显微镜技术使得可以区分组织成分和胶原纤维，有助于分析人类乳腺和胰腺癌组织。然而，SHG和BF图像之间存在较大差异，这给目前基于学习的注册模型在将SHG对准BF方面带来挑战。在本文中，我们提出了一种新颖的多模态注册框架，采用忠实性强制位移编辑来解决这些挑战。该框架整合了批次对比学习、基于特征的预对准和实例级优化。从Learn2Reg COMULISglobe SHG-BF挑战实验结果验证了我们方法的有效性，在在线排行榜上获得第一名。

更新时间: 2025-03-25 20:35:46

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2410.20812v2

END: Early Noise Dropping for Efficient and Effective Context Denoising

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

Updated: 2025-03-25 20:34:56

标题: END：早期噪声去除用于高效和有效的上下文去噪

摘要: 大型语言模型(LLMs)在各种自然语言处理任务中表现出色。然而，它们经常被输入序列中的无关或嘈杂上下文分心，从而降低输出质量。这一问题影响长篇和短篇背景情景，比如检索增强生成、表格问答和上下文学习。我们揭示了LLMs在产生标记之前的早期层中能够隐式识别输入序列是否包含有用信息。利用这一洞察力，我们引入了早期噪声丢弃( END )，这是一种新颖的方法，可以减轻这一问题，而无需对LLMs进行微调。 END 将输入序列分段并在LLMs的早期层上使用线性探针来区分信息性和嘈杂的块。通过在过程早期丢弃嘈杂块，END 保留关键信息，减少分心，并降低计算开销。广泛的实验表明，END 在多个评估数据集上显著改善了不同LLMs的性能和效率。此外，通过探索LLMs对输入的隐式理解，本文还深化了对LLMs如何在内部进行上下文推理的理解。

更新时间: 2025-03-25 20:34:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.18915v2

Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models

Representation Engineering (RepE) has emerged as a powerful paradigm for enhancing AI transparency by focusing on high-level representations rather than individual neurons or circuits. It has proven effective in improving interpretability and control, showing that representations can emerge, propagate, and shape final model outputs in large language models (LLMs). However, in Vision-Language Models (VLMs), visual input can override factual linguistic knowledge, leading to hallucinated responses that contradict reality. To address this challenge, we make the first attempt to extend RepE to VLMs, analyzing how multimodal representations are preserved and transformed. Building on our findings and drawing inspiration from successful RepE applications, we develop a theoretical framework that explains the stability of neural activity across layers using the principal eigenvector, uncovering the underlying mechanism of RepE. We empirically validate these instrinsic properties, demonstrating their broad applicability and significance. By bridging theoretical insights with empirical validation, this work transforms RepE from a descriptive tool into a structured theoretical framework, opening new directions for improving AI robustness, fairness, and transparency.

Updated: 2025-03-25 20:32:15

标题: 为什么表征工程有效：视觉语言模型中的理论和实证研究

摘要: Representation Engineering（RepE）已经成为一个强大的范式，通过专注于高层表示而不是单个神经元或电路，来增强AI的透明度。它已被证明在提高可解释性和控制性方面非常有效，显示出表示可以在大型语言模型（LLMs）中出现、传播和塑造最终模型输出。然而，在视觉语言模型（VLMs）中，视觉输入可能会覆盖事实的语言知识，导致产生幻觉的回应与现实相矛盾。为了解决这一挑战，我们首次尝试将RepE扩展到VLMs，分析多模态表示是如何被保留和转换的。基于我们的发现，并从成功的RepE应用中汲取灵感，我们开发了一个理论框架，解释了通过主要特征向量在层间神经活动的稳定性，揭示了RepE的基本机制。我们在实证上验证了这些固有属性，展示了它们的广泛适用性和重要性。通过将理论见解与实证验证结合起来，这项工作将RepE从一个描述性工具转变为一个结构化的理论框架，为提高AI的稳健性、公平性和透明度开辟了新方向。

更新时间: 2025-03-25 20:32:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.22720v1

LLM-based Agent Simulation for Maternal Health Interventions: Uncertainty Estimation and Decision-focused Evaluation

Agent-based simulation is crucial for modeling complex human behavior, yet traditional approaches require extensive domain knowledge and large datasets. In data-scarce healthcare settings where historic and counterfactual data are limited, large language models (LLMs) offer a promising alternative by leveraging broad world knowledge. This study examines an LLM-driven simulation of a maternal mobile health program, predicting beneficiaries' listening behavior when they receive health information via automated messages (control) or live representatives (intervention). Since uncertainty quantification is critical for decision-making in health interventions, we propose an LLM epistemic uncertainty estimation method based on binary entropy across multiple samples. We enhance model robustness through ensemble approaches, improving F1 score and model calibration compared to individual models. Beyond direct evaluation, we take a decision-focused approach, demonstrating how LLM predictions inform intervention feasibility and trial implementation in data-limited settings. The proposed method extends to public health, disaster response, and other domains requiring rapid intervention assessment under severe data constraints. All code and prompts used for this work can be found at https://github.com/sarahmart/LLM-ABS-ARMMAN-prediction.

Updated: 2025-03-25 20:24:47

标题: 基于LLM的代理模拟用于母婴健康干预：不确定性估计和以决策为重点的评估

摘要: 基于代理的模拟对于建模复杂的人类行为至关重要，然而传统方法需要广泛的领域知识和大量数据集。在数据稀缺的医疗环境中，历史数据和反事实数据有限，大型语言模型(LLMs)通过利用广泛的世界知识提供了一种有希望的替代方案。本研究探讨了一个由LLM驱动的模拟母婴移动健康计划，预测受益人在通过自动信息（对照组）或现场代表（干预组）接收健康信息时的倾听行为。由于不确定性量化对于健康干预决策至关重要，我们提出了一种基于二进熵跨多个样本的LLM认知不确定性估计方法。我们通过集成方法增强模型的稳健性，改善了F1分数和模型校准与单个模型相比。除了直接评估，我们采取了一种以决策为重点的方法，展示了LLM预测如何在数据有限的环境中指导干预可行性和试验实施。所提出的方法可扩展到公共卫生、灾难响应和其他需要在严重数据限制下进行快速干预评估的领域。本研究所用的所有代码和提示都可以在https://github.com/sarahmart/LLM-ABS-ARMMAN-prediction找到。

更新时间: 2025-03-25 20:24:47

领域: cs.AI

下载: http://arxiv.org/abs/2503.22719v1

PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles

Users can divulge sensitive information to proprietary LLM providers, raising significant privacy concerns. While open-source models, hosted locally on the user's machine, alleviate some concerns, models that users can host locally are often less capable than proprietary frontier models. Toward preserving user privacy while retaining the best quality, we propose Privacy-Conscious Delegation, a novel task for chaining API-based and local models. We utilize recent public collections of user-LLM interactions to construct a natural benchmark called PUPA, which contains personally identifiable information (PII). To study potential approaches, we devise PAPILLON, a multi-stage LLM pipeline that uses prompt optimization to address a simpler version of our task. Our best pipeline maintains high response quality for 85.5% of user queries while restricting privacy leakage to only 7.5%. We still leave a large margin to the generation quality of proprietary LLMs for future work. Our data and code is available at https://github.com/siyan-sylvia-li/PAPILLON.

Updated: 2025-03-25 20:20:42

标题: PAPILLON: 保护隐私的方法，通过基于互联网和本地语言模型集成

摘要: 用户可以向专有的LLM提供者透露敏感信息，引发重大的隐私问题。虽然在用户的机器上托管的开源模型可以缓解一些担忧，但用户可以在本地托管的模型通常比专有的前沿模型功能较弱。为了保护用户隐私同时保留最佳质量，我们提出了一种新颖的任务——隐私意识委托，用于链接基于API和本地模型。我们利用最近公开的用户-LLM互动集合构建了一个包含个人可识别信息（PII）的自然基准，称为PUPA。为了研究潜在的方法，我们设计了PAPILLON，一个多阶段LLM管道，利用提示优化来解决我们任务的简化版本。我们最佳的管道在85.5%的用户查询中保持高响应质量，同时将隐私泄漏限制在仅为7.5%。我们为未来工作留下了专有LLM生成质量的较大空间。我们的数据和代码可在https://github.com/siyan-sylvia-li/PAPILLON获取。

更新时间: 2025-03-25 20:20:42

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2410.17127v3

Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

Unsupervised video-based surgical instrument segmentation has the potential to accelerate the adoption of robot-assisted procedures by reducing the reliance on manual annotations. However, the generally low quality of optical flow in endoscopic footage poses a great challenge for unsupervised methods that rely heavily on motion cues. To overcome this limitation, we propose a novel approach that pinpoints motion boundaries, regions with abrupt flow changes, while selectively discarding frames with globally low-quality flow and adapting to varying motion patterns. Experiments on the EndoVis2017 VOS and EndoVis2017 Challenge datasets show that our method achieves mean Intersection-over-Union (mIoU) scores of 0.75 and 0.72, respectively, effectively alleviating the constraints imposed by suboptimal optical flow. This enables a more scalable and robust surgical instrument segmentation solution in clinical settings. The code will be publicly released.

Updated: 2025-03-25 20:18:43

标题: 低质量光流中基于运动边界的无监督手术器械分割

摘要: 无监督视频基于手术器械分割具有潜力加速机器人辅助手术的采用，通过减少对手动标注的依赖。然而，内窥镜镜头拍摄的光流通常质量较低，这给依赖于运动线索的无监督方法带来了巨大挑战。为了克服这一限制，我们提出了一种新颖的方法，可以精确定位运动边界，即具有突然流动变化的区域，同时选择性地丢弃全局光流质量低的帧，并适应不同的运动模式。对EndoVis2017 VOS和EndoVis2017 Challenge数据集上的实验表明，我们的方法分别实现了0.75和0.72的平均交并比（mIoU）分数，有效减轻了次优光流所施加的约束。这使得在临床环境中可以更具规模化和强大的手术器械分割解决方案。代码将公开发布。

更新时间: 2025-03-25 20:18:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2403.10039v2

Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection

Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.

Updated: 2025-03-25 20:11:36

标题: 从LLMs到溯源分析的知识转移：一种用于APT检测的语义增强方法

摘要: 高级持久性威胁（APTs）已经在多个领域造成了重大损失，包括敏感数据的窃取和系统完整性的破坏。随着攻击技术越来越复杂和隐蔽，网络防御者和攻击者之间的军备竞赛继续加剧。大型语言模型（LLMs）的革命性影响在各个领域开辟了许多机会，包括网络安全。一个有趣的问题出现了：LLMs中蕴含的广泛知识能否被利用于溯源分析，并在识别先前未知的恶意事件中发挥积极作用？为了深入了解这个问题，我们提出了一种利用LLMs进行基于溯源的威胁检测的新策略。在我们的设计中，最先进的LLM提供了溯源数据解释的额外细节，利用它们对系统调用、软件身份和应用程序执行上下文的高级理解。先进的上下文嵌入能力进一步用于捕获事件描述的丰富语义。我们全面检查了生成的嵌入的质量，结果表明它们提供了有希望的途径。随后，建立在这些嵌入基础上的机器学习模型在真实数据上表现出色。在我们的评估中，监督式威胁检测实现了99.0%的精度，半监督异常检测实现了96.9%的精度。

更新时间: 2025-03-25 20:11:36

领域: cs.CR

下载: http://arxiv.org/abs/2503.18316v2

Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages

We present a foundation modeling framework that leverages deep learning to uncover latent genetic signatures across the hematopoietic hierarchy. Our approach trains a fully connected autoencoder on multipotent progenitor cells, reducing over 20,000 gene features to a 256-dimensional latent space that captures predictive information for both progenitor and downstream differentiated cells such as monocytes and lymphocytes. We validate the quality of these embeddings by training feed-forward, transformer, and graph convolutional architectures for blood disease diagnosis tasks. We also explore zero-shot prediction using a progenitor disease state classification model to classify downstream cell conditions. Our models achieve greater than 95% accuracy for multi-class classification, and in the zero-shot setting, we achieve greater than 0.7 F1-score on the binary classification task. Future work should improve embeddings further to increase robustness on lymphocyte classification specifically.

Updated: 2025-03-25 20:11:10

标题: 深度学习方法用于跨造血系列的血液疾病诊断

摘要: 我们提出了一个基础建模框架，利用深度学习来揭示造血层次结构中的潜在遗传特征。我们的方法在多能祖细胞上训练了一个全连接的自编码器，将超过20,000个基因特征降维到一个256维的潜在空间，捕捉了对于祖细胞和下游分化细胞（如单核细胞和淋巴细胞）的预测信息。我们通过训练前馈、变换器和图卷积结构来验证这些嵌入的质量，用于血液疾病诊断任务。我们还探索了使用祖细胞疾病状态分类模型进行零样本预测，以对下游细胞状态进行分类。我们的模型在多类别分类方面实现了超过95%的准确率，在零样本设置中，在二元分类任务上实现了超过0.7的F1分数。未来的工作应进一步改进嵌入，以增加对淋巴细胞分类的鲁棒性。

更新时间: 2025-03-25 20:11:10

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2503.20049v1

Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation

Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks.

Updated: 2025-03-25 19:56:33

标题: 使用廉价奖励对生成流网络进行预训练以进行分子图生成

摘要: Generative Flow Networks（GFlowNets）最近已经成为一个合适的框架，通过学习作为非归一化分布处理的奖励来生成多样化且高质量的分子结构。在这个框架中，先前的工作通常通过使用预定义的分子片段作为构建块来限制探索，从而限制可以访问的化学空间。在这项工作中，我们引入了原子GFlowNets（A-GFNs），这是一个基础性的生成模型，利用单个原子作为构建块更全面地探索药物样化学空间。我们提出了一种无监督的预训练方法，使用类似药物的分子数据集来教导A-GFNs有关廉价但信息丰富的分子描述符，如类似药性、拓扑极性表面积和合成可及性得分。这些属性作为代理奖励，引导A-GFNs朝着表现出理想药理特性的化学空间区域前进。我们进一步实施了一个目标条件微调过程，该过程使A-GFNs能够优化特定目标特性。在这项工作中，我们在ZINC数据库的子集上预训练A-GFN，并通过使用稳健的评估指标，与其他相关基线方法相比，展示了我们的方法在广泛的药物设计任务中的有效性。

更新时间: 2025-03-25 19:56:33

领域: cs.LG

下载: http://arxiv.org/abs/2503.06337v2

Elastic Federated Learning over Open Radio Access Network (O-RAN) for Concurrent Execution of Multiple Distributed Learning Tasks

Federated learning (FL) is a popular distributed machine learning (ML) technique in Internet of Things (IoT) networks, where resource-constrained devices collaboratively train ML models while preserving data privacy. However, implementation of FL over 5G-and-beyond wireless networks faces key challenges caused by (i) dynamics of the wireless network conditions and (ii) the coexistence of multiple FL-services in the system. In this paper, we unveil two key phenomena that arise from these challenges: over/under-provisioning of resources and perspective-driven load balancing, both of which significantly impact FL performance in IoT environments. We take the first steps towards addressing these phenomena by proposing a novel distributed ML architecture called elastic FL (EFL). EFL unleashes the full potential of Open RAN (O-RAN) systems and introduces an elastic resource provisioning methodology to execute FL-services. It further constitutes a multi-time-scale FL management system that introduces three dedicated network control functionalities tailored for FL-services, including (i) non-real-time (non-RT) system descriptor, which trains ML-based applications to predict both system and FL-related dynamics and parameters; (ii) near-RT FL controller, which handles O-RAN slicing and mobility management for the seamless execution of FL-services; (iii) FL MAC scheduler, which conducts real-time resource allocation to the end clients of various FL-services. We finally prototype EFL to demonstrate its potential in improving the performance of FL-services.

Updated: 2025-03-25 19:48:49

标题: 弹性联邦学习在开放式无线接入网络（O-RAN）上的应用：用于同时执行多个分布式学习任务

摘要: 联邦学习（FL）是一种在物联网（IoT）网络中流行的分布式机器学习（ML）技术，资源受限的设备在保护数据隐私的同时合作训练ML模型。然而，在5G及更高版本的无线网络上实施FL面临着由无线网络条件的动态性和系统中多个FL服务共存引起的关键挑战。本文揭示了由这些挑战引起的两个关键现象：资源过度/不足配置和基于视角的负载平衡，这两者都显著影响了物联网环境中FL的性能。我们通过提出一种名为弹性FL（EFL）的新型分布式ML架构，迈出了解决这些现象的第一步。EFL释放了开放RAN（O-RAN）系统的全部潜力，并引入了一种弹性资源配置方法来执行FL服务。它进一步构建了一个多时间尺度的FL管理系统，引入了三个专门为FL服务量身定制的网络控制功能，包括：（i）非实时（non-RT）系统描述符，用于训练基于ML的应用程序以预测系统和FL相关的动态和参数；（ii）近实时FL控制器，用于处理O-RAN切片和移动管理，以实现FL服务的无缝执行；（iii）FL MAC调度器，用于为各种FL服务的最终客户进行实时资源分配。最后，我们对EFL进行了原型设计，展示了其改善FL服务性能的潜力。

更新时间: 2025-03-25 19:48:49

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2305.02109v5

BugCraft: End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft

Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed. This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction. BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent (GPT-4o) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash. To facilitate evaluation, we introduce BugCraft-Bench, a curated dataset of Minecraft crash bug reports. Evaluated on BugCraft-Bench, our framework successfully reproduced 30.23% of crash bugs end-to-end. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information. BugCraft demonstrates the feasibility of automated reproduction of crash bugs in complex game environments using LLMs, opening promising avenues for game testing and development. The framework and the BugCraft-Bench dataset pave the way for future research in automated game bug analysis and hold potential for generalization to other interactive game platforms. Finally, we make our code open at https://bugcraft2025.github.io/

Updated: 2025-03-25 19:34:24

标题: BugCraft：在Minecraft中使用LLM代理进行端到端崩溃漏洞复现

摘要: 复制游戏中的错误，比如Minecraft中的崩溃错误，是一个臭名昭著的手动、耗时且具有挑战性的过程，很难自动化。尽管在其他软件领域LLM驱动的错误复制取得了成功，但游戏由于其复杂的交互环境，仍然基本没有得到解决。本文介绍了BugCraft，这是一个新颖的端到端框架，旨在从用户提交的错误报告中自动重现Minecraft中的崩溃错误，填补了自动化游戏错误重现中的重要缺口。BugCraft采用两阶段方法：首先，一个步骤合成器利用LLMs和Minecraft Wiki知识将错误报告转化为高质量的结构化重现步骤（S2R）。其次，一个由基于视觉的LLM代理（GPT-4o）和自定义宏API支持的行动模型，在Minecraft中执行这些S2R步骤，以触发报告的崩溃。为了便于评估，我们引入了BugCraft-Bench，这是一个由Minecraft崩溃错误报告构成的策划数据集。在BugCraft-Bench上进行评估，我们的框架成功端到端地重现了30.23%的崩溃错误。步骤合成器展示了在生成正确的错误重现计划方面的66.28%准确性，突显了其在解释和结构化错误报告信息方面的有效性。BugCraft展示了使用LLMs在复杂游戏环境中自动重现崩溃错误的可行性，为游戏测试和开发开辟了有前途的途径。该框架和BugCraft-Bench数据集为未来自动游戏错误分析的研究铺平了道路，并具有推广到其他互动游戏平台的潜力。最后，我们在https://bugcraft2025.github.io/开放了我们的代码。

更新时间: 2025-03-25 19:34:24

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.20036v1

OmniNova:A General Multimodal Agent Framework

The integration of Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems. However, orchestrating multiple LLM-driven agents to tackle complex tasks remains challenging due to coordination difficulties, inefficient resource utilization, and inconsistent information flow. We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities. OmniNova introduces three key innovations: (1) a hierarchical multi-agent architecture with distinct coordinator, planner, supervisor, and specialist agents; (2) a dynamic task routing mechanism that optimizes agent deployment based on task complexity; and (3) a multi-layered LLM integration system that allocates appropriate models to different cognitive requirements. Our evaluations across 50 complex tasks in research, data analysis, and web interaction domains demonstrate that OmniNova outperforms existing frameworks in task completion rate (87\% vs. baseline 62\%), efficiency (41\% reduced token usage), and result quality (human evaluation score of 4.2/5 vs. baseline 3.1/5). We contribute both a theoretical framework for multi-agent system design and an open-source implementation that advances the state-of-the-art in LLM-based automation systems.

Updated: 2025-03-25 19:21:01

标题: OmniNova: 一个通用的多模态代理框架

摘要: 将大型语言模型（LLMs）与专门工具集成为一体，为智能自动化系统带来了新的机遇。然而，由于协调困难、资源利用效率低和信息流不一致，协调多个LLM驱动的代理来处理复杂任务仍然具有挑战性。我们提出了OmniNova，这是一个模块化的多代理自动化框架，将语言模型与网页搜索、爬虫和代码执行能力等专门工具相结合。OmniNova引入了三个关键创新：（1）具有明确协调员、规划员、监督员和专家代理的分层多代理体系结构；（2）一种动态任务路由机制，根据任务复杂性优化代理部署；（3）一个多层次的LLM集成系统，为不同认知需求分配适当的模型。我们在研究、数据分析和网络交互领域的50个复杂任务中的评估结果表明，OmniNova在任务完成率（87\%对基线62\%）、效率（41%减少令牌使用）和结果质量（人工评估得分4.2/5对基线3.1/5）方面优于现有框架。我们既为多代理系统设计提供了一个理论框架，也提供了一个开源实现，推动了基于LLM的自动化系统的技术水平。

更新时间: 2025-03-25 19:21:01

领域: cs.AI

下载: http://arxiv.org/abs/2503.20028v1

A scalable gene network model of regulatory dynamics in single cells

Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations.

Updated: 2025-03-25 19:19:21

标题: 一个可扩展的基因网络模型：单细胞中的调控动态

摘要: 单细胞数据提供了细胞转录状态的高维度测量，但提取基因的调控功能洞见，特别是识别生物扰动所影响的转录机制，仍然是一个挑战。许多扰动引起了细胞的补偿性反应，使得难以区分基因调控的直接和间接影响。建模基因调控功能如何塑造这些反应的时间动态是提高我们对生物扰动理解的关键。基于微分方程的动力学模型提供了捕捉转录动态的原则性方法，但它们在单细胞数据中的应用受到计算约束、随机性、稀疏性和噪声的阻碍。现有方法要么依赖于低维表示，要么做出强烈简化的假设，限制了它们在规模上建模转录动态的能力。我们引入了一个名为FLeCS（Functional and Learnable model of Cell dynamicS）的模型，将基因网络结构纳入耦合微分方程中，用于建模基因调控功能。给定（伪）时间序列的单细胞数据，FLeCS能够准确推断规模上的细胞动态，提供了改进的功能洞见，揭示了基因敲除在髓系分化和K562 Perturb-seq实验中扰动的转录机制，并模拟了A549细胞在小分子干扰后的单细胞轨迹。

更新时间: 2025-03-25 19:19:21

领域: q-bio.MN,cs.LG

下载: http://arxiv.org/abs/2503.20027v1

Autoregressive Action Sequence Learning for Robotic Manipulation

Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, matches or outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.

Updated: 2025-03-25 19:16:05

标题: 自回归动作序列学习用于机器人操作

摘要: 设计一个能够在不同机器人和任务配置下表现良好的通用策略架构仍然是一个关键挑战。在这项工作中，我们通过将机器人动作表示为序列数据，并通过自回归序列建模生成动作来解决这个问题。现有的自回归架构将末端执行器路径点逐个生成为语言建模中的单词标记，这些标记仅适用于低频控制任务。与语言不同，机器人动作是异质的，通常包括连续值，如关节位置、2D像素坐标和末端执行器姿态，这些值不容易适用于基于语言的建模。基于这一观察，我们引入了一个简单的增强：我们将因果变换器的单标记预测扩展到通过我们的Chunking Causal Transformer（CCT）在单步中支持预测可变数量的标记。这个增强使得在各种控制频率的不同任务中实现了稳健的性能，通过更少的自回归步骤实现了更高的效率，并通过混合不同类型的动作和为每种动作类型使用不同的块大小实现了混合动作序列设计。基于CCT，我们提出了自回归策略（ARP）架构，通过生成混合动作序列来解决操纵任务。我们在包括Push-T、ALOHA和RLBench在内的多样化机器人操纵环境中评估了ARP，并展示了作为通用架构的ARP在所有测试基准中与或优于特定环境的最新技术水平，同时在计算和参数大小方面更加高效。我们的真实机器人演示视频、所有源代码以及ARP的预训练模型可以在http://github.com/mlzxy/arp找到。

更新时间: 2025-03-25 19:16:05

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.03132v5

Experience Replay Addresses Loss of Plasticity in Continual Learning

Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

Updated: 2025-03-25 19:01:10

标题: 经验重播解决了持续学习中的可塑性丧失问题

摘要: 失去可塑性是深度神经网络持续学习中的主要挑战之一，通过反向传播训练的神经网络逐渐失去适应新任务并且表现明显不如新初始化的对照组。本文的主要贡献是提出一个新的假设，即经验重播可以解决持续学习中的可塑性丧失问题。在这里，经验重播是一种记忆形式。我们提供了支持这一假设的证据。特别地，我们在多个不同任务中进行了演示，包括回归、分类和策略评估，通过简单地添加经验重播并且使用Transformer处理经验重播中的数据，可消除可塑性的丧失。值得注意的是，我们没有改变深度学习的任何标准组件。例如，我们没有改变反向传播。我们没有修改激活函数。我们也没有使用任何正则化。我们推测经验重播和Transformer可以解决可塑性丧失的问题是因为上下文学习现象。

更新时间: 2025-03-25 19:01:10

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2503.20018v1

Cross-modal Information Flow in Multimodal Large Language Models

The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git.

Updated: 2025-03-25 18:59:50

标题: 多模态大型语言模型中的跨模态信息流

摘要: 最近自回归多模态大型语言模型（MLLMs）的进展展示了在视觉-语言任务中取得了有希望的进展。虽然存在着许多研究探究大型语言模型内部对语言信息的处理，但目前对MLLMs的内部工作机制以及语言和视觉信息在这些模型中如何相互作用了解甚少。本研究旨在填补这一空白，通过研究在MLLMs中不同模态之间的信息流--语言和视觉--，侧重于视觉问答。具体而言，给定一个图像-问题对作为输入，我们研究模型在何处以及如何将视觉和语言信息相结合以生成最终预测。通过对LLaVA系列中一系列模型进行实验，我们发现在整合两种模态的过程中存在两个明显的阶段。在较低层中，模型首先将整个图像的更一般的视觉特征转换为（语言）问题标记的表示。在中间层中，它再次将与问题相关的特定对象的视觉信息转移到问题的相应标记位置。最后，在较高层中，生成的多模态表示被传播到输入序列的最后位置进行最终预测。总的来说，我们的发现提供了关于MLLMs中图像和语言处理的空间和功能方面的新颖和全面的视角，从而促进未来对多模态信息定位和编辑的研究。我们的代码和收集的数据集在这里发布：https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git。

更新时间: 2025-03-25 18:59:50

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2411.18620v2

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.

Updated: 2025-03-25 18:54:55

标题: LLAVIDAL：一个用于日常生活活动的大型语言视觉模型

摘要: 目前基于网络视频训练的大型语言视觉模型（LLVMs）在一般视频理解方面表现良好，但在细节方面、复杂的人物-对象交互（HOI）和对日常生活活动（ADL）至关重要的视角不变表示学习方面存在困难。这种限制源于缺乏专门的ADL视频指导调整数据集和不足的模态集成来捕获具有区分性的动作表示。为了解决这个问题，我们提出了一个半自动化的框架来筛选ADL数据集，创建了ADL-X，一个多视角、多模态的RGBS指导调整数据集。此外，我们引入了LLAVIDAL，一个整合视频、3D骨架和HOIs的LLVM，用于建模ADL的复杂时空关系。对LLAVIDAL进行训练时，所有模态的简单联合对齐会产生次优结果；因此，我们提出了一个多模态逐步（MMPro）训练策略，按照课程将模态逐步整合。我们还建立了ADL MCQ和视频描述基准，以评估LLVM在ADL任务中的表现。在ADL-X上训练的LLAVIDAL在ADL基准测试中取得了最先进的表现。代码和数据将公开发布在：https://adl-x.github.io/。

更新时间: 2025-03-25 18:54:55

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.09390v3

Linear Diffusion Networks

We present Linear Diffusion Networks (LDNs), a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. LDN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that LDN delivers competitive performance across ImageNet and LRA tasks.

Updated: 2025-03-25 18:52:09

标题: 线性扩散网络

摘要: 我们提出了线性扩散网络（LDNs），这是一种新颖的架构，将顺序数据处理重新解释为统一的扩散过程。我们的模型将自适应扩散模块与局部非线性更新和受扩散启发的注意机制相结合。这种设计使得全局信息传播效率高，同时保留了细粒度的时间细节。LDN通过允许在时间步骤之间进行完全并行化，并支持稳健的多尺度时间表示，克服了传统循环和变压器模型的局限性。在基准序列建模任务上的实验表明，LDN在ImageNet和LRA任务上实现了竞争性表现。

更新时间: 2025-03-25 18:52:09

领域: cs.LG

下载: http://arxiv.org/abs/2502.12381v4

Unlocking Guidance for Discrete State-Space Diffusion and Flow Models

Generative models on discrete state-spaces have a wide range of potential applications, particularly in the domain of natural sciences. In continuous state-spaces, controllable and flexible generation of samples with desired properties has been realized using guidance on diffusion and flow models. However, these guidance approaches are not readily amenable to discrete state-space models. Consequently, we introduce a general and principled method for applying guidance on such models. Our method depends on leveraging continuous-time Markov processes on discrete state-spaces, which unlocks computational tractability for sampling from a desired guided distribution. We demonstrate the utility of our approach, Discrete Guidance, on a range of applications including guided generation of small-molecules, DNA sequences and protein sequences.

Updated: 2025-03-25 18:50:23

标题: 解锁离散状态空间扩散和流模型的指导意见

摘要: 离散状态空间上的生成模型具有广泛的潜在应用，特别是在自然科学领域。在连续状态空间中，通过对扩散和流模型的指导实现了具有期望特性的样本的可控和灵活生成。然而，这些指导方法并不容易适用于离散状态空间模型。因此，我们介绍了一种通用和原则性的方法来应用这些模型的指导。我们的方法依赖于在离散状态空间上利用连续时间马尔可夫过程，这为从期望的引导分布中取样提供了计算上的可行性。我们展示了我们的方法“离散引导”在一系列应用中的实用性，包括引导生成小分子、DNA序列和蛋白质序列。

更新时间: 2025-03-25 18:50:23

领域: cs.LG

下载: http://arxiv.org/abs/2406.01572v4

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

Updated: 2025-03-25 18:46:30

标题: 低资源条件下的代码切换哈萨克语-俄语语言对机器翻译

摘要: 低资源语言对的机器翻译是一项具有挑战性的任务。一旦说话者使用代码切换，这项任务可能变得极其困难。我们提出了一种方法，用于构建一个没有标记数据的代码切换哈萨克语-俄语语言对的机器翻译模型。我们的方法基于合成数据的生成。此外，我们提供了第一个代码切换的哈萨克语-俄语平行语料库以及评估结果，其中包括一个模型达到了16.48 BLEU，几乎达到了现有商业系统，并在人类评估中击败了它。

更新时间: 2025-03-25 18:46:30

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.20007v1

Unsupervised Learning for Quadratic Assignment

We introduce PLUME search, a data-driven framework that enhances search efficiency in combinatorial optimization through unsupervised learning. Unlike supervised or reinforcement learning, PLUME search learns directly from problem instances using a permutation-based loss with a non-autoregressive approach. We evaluate its performance on the quadratic assignment problem, a fundamental NP-hard problem that encompasses various combinatorial optimization problems. Experimental results demonstrate that PLUME search consistently improves solution quality. Furthermore, we study the generalization behavior and show that the learned model generalizes across different densities and sizes.

Updated: 2025-03-25 18:37:46

标题: 无监督学习用于二次分配

摘要: 我们介绍了PLUME搜索，这是一个数据驱动的框架，通过无监督学习增强组合优化中的搜索效率。与监督学习或强化学习不同，PLUME搜索直接从问题实例中学习，使用基于排列的损失和非自回归方法。我们在二次分配问题上评估其性能，这是一个包含各种组合优化问题的基本NP难问题。实验结果表明，PLUME搜索始终改善解决方案质量。此外，我们研究了泛化行为，并展示了学习模型在不同密度和大小上的泛化能力。

更新时间: 2025-03-25 18:37:46

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.20001v1

The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs

Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.

Updated: 2025-03-25 18:33:59

标题: 珊瑚礁场景数据集：珊瑚礁中的语义场景理解

摘要: 珊瑚礁因气候变化和当地压力而在全球范围内逐渐减少。为了有效地进行保护或恢复工作，需要进行最高空间和时间分辨率的监测。传统的珊瑚礁调查方法由于依赖专家劳动时间而在可扩展性方面存在局限，这促使人们使用计算机视觉工具自动识别和估计图像中活珊瑚的数量。然而，这类工具的设计和评估受到缺乏大规模高质量数据集的阻碍。我们发布了Coralscapes数据集，这是第一个用于珊瑚礁的通用密集语义分割数据集，包括2075张图像，39个底栖类别和由专家标注的174k个分割掩模。Coralscapes与广泛使用的Cityscapes数据集在范围和结构上类似，可以在一个新的具有挑战性的领域进行语义分割模型的基准测试，该领域需要专家知识进行注释。我们对各种语义分割模型进行了基准测试，并发现从Coralscapes到现有较小数据集的迁移学习始终导致最先进的性能。Coralscapes将推动基于计算机视觉的高效、可扩展和标准化的珊瑚礁调查方法的研究，并有潜力简化水下生态机器人的发展。

更新时间: 2025-03-25 18:33:59

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.20000v1

Unsupervised Ordering for Maximum Clique

We propose an unsupervised approach for learning vertex orderings for the maximum clique problem by framing it within a permutation-based framework. We transform the combinatorial constraints into geometric relationships such that the ordering of vertices aligns with the clique structures. By integrating this clique-oriented ordering into branch-and-bound search, we improve search efficiency and reduce the number of computational steps. Our results demonstrate how unsupervised learning of vertex ordering can enhance search efficiency across diverse graph instances. We further study the generalization across different sizes.

Updated: 2025-03-25 18:28:49

标题: 无监督排序用于最大团

摘要: 我们提出了一种无监督学习顶点排序的方法，通过将其置于基于排列的框架中解决最大团问题。我们将组合约束转换为几何关系，使顶点的排序与团结构对齐。通过将这种团导向的排序整合到分支定界搜索中，我们提高了搜索效率并减少了计算步骤的数量。我们的结果表明，无监督学习顶点排序可以增强在各种图实例中的搜索效率。我们进一步研究了不同大小的泛化性。

更新时间: 2025-03-25 18:28:49

领域: cs.LG

下载: http://arxiv.org/abs/2503.21814v1

Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models

Long-form writing agents require flexible integration and interaction across information retrieval, reasoning, and composition. Current approaches rely on predetermined workflows and rigid thinking patterns to generate outlines before writing, resulting in constrained adaptability during writing. In this paper we propose a general agent framework that achieves human-like adaptive writing through recursive task decomposition and dynamic integration of three fundamental task types, i.e. retrieval, reasoning, and composition. Our methodology features: 1) a planning mechanism that interleaves recursive task decomposition and execution, eliminating artificial restrictions on writing workflow; and 2) integration of task types that facilitates heterogeneous task decomposition. Evaluations on both fiction writing and technical report generation show that our method consistently outperforms state-of-the-art approaches across all automatic evaluation metrics, which demonstrate the effectiveness and broad applicability of our proposed framework.

Updated: 2025-03-25 18:27:55

标题: 超越纲要：基于语言模型的异构递归规划，用于适应性长篇写作

摘要: 长篇写作代理需要在信息检索、推理和组合之间进行灵活整合和交互。当前的方法依赖于预定的工作流程和刚性的思维模式，在写作之前生成大纲，导致在写作过程中适应性受到限制。在本文中，我们提出了一个通用代理框架，通过递归任务分解和动态集成三种基本任务类型，即检索、推理和组合，实现类似人类的自适应写作。我们的方法特点包括：1）一个规划机制，交替进行递归任务分解和执行，消除写作工作流程上的人为限制；2）整合任务类型，促进异质任务分解。在小说写作和技术报告生成方面的评估显示，我们的方法在所有自动评估指标上始终优于最先进的方法，这证明了我们提出的框架的有效性和广泛适用性。

更新时间: 2025-03-25 18:27:55

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.08275v2

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbf{LEGO-Puzzles}, a scalable benchmark designed to evaluate both \textbf{spatial understanding} and \textbf{sequential reasoning} in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

Updated: 2025-03-25 18:21:07

标题: LEGO拼图：MLLM在多步空间推理中的表现如何？

摘要: 多步空间推理涉及理解和推理跨多个连续步骤的空间关系，这对于解决复杂的现实世界应用至关重要，如机器人操作，自主导航和自动装配。为了评估当前多模式大型语言模型（MLLMs）在获取这种基本能力方面的表现如何，我们引入了LEGO-Puzzles，这是一个可扩展的基准测试，旨在通过基于LEGO的任务评估MLLMs中的空间理解和序列推理。LEGO-Puzzles包括1,100个精心策划的视觉问答（VQA）样本，涵盖11个不同的任务，从基本的空间理解到复杂的多步推理。基于LEGO-Puzzles，我们对最先进的MLLMs进行了全面评估，并揭示了它们在空间推理能力方面的显著局限性：即使是最强大的MLLMs也只能回答大约一半的测试案例，而人类参与者的准确率超过90\%。除了VQA任务，我们还评估了MLLMs生成LEGO图像的能力，根据装配说明。我们的实验表明，只有Gemini-2.0-Flash和GPT-4o表现出有限的遵循这些说明的能力，而其他MLLMs要么复制输入图像，要么生成完全无关的输出。总的来说，LEGO-Puzzles揭示了现有MLLMs在空间理解和序列推理能力方面的关键缺陷，并强调了对多模态空间推理的进一步发展的需要。

更新时间: 2025-03-25 18:21:07

领域: cs.AI

下载: http://arxiv.org/abs/2503.19990v1

ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.

Updated: 2025-03-25 18:17:36

标题: ExCoT：使用执行反馈优化文本到SQL的推理

摘要: Text-to-SQL需要精确推理将自然语言问题转换为结构化查询。虽然大型语言模型（LLMs）在许多推理任务中表现出色，但它们利用连续思维（CoT）推理进行Text-to-SQL的能力仍未得到充分探索。我们确定了关键限制：零-shot CoT提供了最小的收益，而未使用CoT的直接偏好优化（DPO）仅带来了微小的改进。我们提出了ExCoT，一种新颖的框架，通过将CoT推理与基于策略和非基于策略的DPO相结合，仅依靠执行准确性作为反馈，迭代地优化开源LLMs。这种方法消除了奖励模型或人类注释偏好的需要。我们的实验结果表明显著的性能提升：ExCoT将BIRD开发集上的执行准确度从57.37%提高到68.51%，将Spider测试集上的执行准确度从78.81%提高到86.59%，对于LLaMA-3 70B，Qwen-2.5-Coder也表现出类似的改进。我们的最佳模型在BIRD和Spider数据集的单模型设置中取得了最先进的性能，特别是在BIRD测试集上达到了68.53%。

更新时间: 2025-03-25 18:17:36

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2503.19988v1

IPGO: Indirect Prompt Gradient Optimization on Text-to-Image Generative Models with High Data Efficiency

Text-to-Image Diffusion models excel at generating images from text prompts but often lack optimal alignment with content semantics, aesthetics, and human preferences. To address these issues, in this study we introduce a novel framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable tokens at the beginning and end of the prompt embeddings, while exploiting low-rank benefits and flexibility from rotations. It allows for gradient-based optimization of injected tokens while enforcing value, orthonormality, and conformity constraints, facilitating continuous updates and empowering computational efficiency. To evaluate the performance of IPGO, we conduct prompt-wise and prompt-batch training with three reward models targeting image aesthetics, image-text alignment, and human preferences under three datasets of different complexity. The results show that IPGO consistently matches or outperforms cutting-edge benchmarks, including stable diffusion v1.5 with raw prompts, training-based approaches (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Furthermore, we demonstrate IPGO's effectiveness in enhancing image generation quality while requiring minimal training data and limited computational resources.

Updated: 2025-03-25 18:14:42

标题: IPGO: 高数据效率的文本到图像生成模型上的间接提示梯度优化

摘要: 文本到图像扩散模型擅长生成文本提示下的图像，但往往缺乏与内容语义、美学和人类偏好的最佳对齐。为了解决这些问题，在本研究中，我们引入了一种新颖的框架，间接提示梯度优化（IPGO），用于提示级微调。IPGO通过在提示嵌入的开头和结尾注入连续可微的令牌来增强提示嵌入，同时利用旋转的低秩优势和灵活性。它允许对注入的令牌进行基于梯度的优化，同时强制价值、正交性和一致性约束，促进连续更新，并增强计算效率。为了评估IPGO的性能，我们使用三个不同复杂度的数据集进行提示级和提示批次训练，针对图像美学、图像-文本对齐和人类偏好进行三种奖励模型。结果表明，IPGO始终与或优于尖端基准，包括带原始提示的稳定扩散v1.5、基于训练的方法（DRaFT和DDPO）以及无需训练的方法（DPO-Diffusion、Promptist和ChatGPT-4o）。此外，我们展示了IPGO在提升图像生成质量方面的有效性，同时需要最少的训练数据和有限的计算资源。

更新时间: 2025-03-25 18:14:42

领域: cs.LG

下载: http://arxiv.org/abs/2503.21812v1

Implementation of a Generative AI Assistant in K-12 Education: The CyberScholar Initiative

This paper focuses on the piloting of CyberScholar, a Generative AI (GenAI) assistant tool that aims to provide feedback on writing K-12 contexts. The aim was to use GenAI to provide formative and summative feedback on students' texts in English Language Arts (ELA), Social Studies, and Modern World History. The trials discussed in this paper involved Grades 7, 8, 10, and 11 and were conducted in three schools in the Midwest and one in the Northwest of the United States. The tool used two main mechanisms: "prompt engineering" based on participant teachers' assessment rubric and "fine-tuning" a Large Language Model (LLM) from a customized corpus of teaching materials using Retrieval Augmented Generation. This paper focuses on CyberScholar's potential to enhance students' writing abilities and support teachers in diverse subject areas requiring written assignments.

Updated: 2025-03-25 18:13:16

标题: 在K-12教育中实施生成式人工智能助手：网络学者计划

摘要: 本文关注CyberScholar的试点，这是一种旨在为K-12语境中的写作提供反馈的生成式人工智能（GenAI）助手工具。旨在利用GenAI为英语语言艺术（ELA）、社会研究和现代世界历史的学生文本提供形成性和总结性反馈。本文讨论的试验涉及7年级、8年级、10年级和11年级，并在美国中西部的三所学校和西北部的一所学校进行。该工具使用了两种主要机制：“提示工程”，基于参与教师的评估标准，以及使用检索增强生成技术从定制教材语料库中“微调”大型语言模型（LLM）。本文关注CyberScholar提升学生写作能力和支持需要书面作业的各种学科领域的教师的潜力。

更新时间: 2025-03-25 18:13:16

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.19422v2

Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields

3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose ThinShell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell physics prior based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-bysynthesis principles. Our Thin-Shell-SfT outperforms prior works qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and surface-induced 3D Gaussians. See our project page at https://4dqv.mpiinf.mpg.de/ThinShellSfT.

Updated: 2025-03-25 18:00:46

标题: 薄壳-SfT：利用神经变形场进行细粒度单眼非刚性三维表面跟踪

摘要: 从单眼RGB视频中对高度可变形表面（例如衣物）进行3D重建是一个具有挑战性的问题，目前尚无解决方案能够提供一致和准确地恢复细粒度表面细节。为了解决这一问题的不适定性，现有方法使用带有统计、神经或物理先验的变形模型。它们主要依赖于非自适应的离散表面表示（例如多边形网格），进行逐帧优化导致误差传播，并且受限于基于网格的可微渲染器的梯度较差。因此，细节如衣物褶皱通常无法以期望的准确度恢复。为了应对这些限制，我们提出了ThinShell-SfT，这是一种新的非刚性3D跟踪方法，将表面表示为隐式和连续的时空神经场。我们基于Kirchhoff-Love模型引入了连续薄壳物理先验以进行空间正则化，与早期作品的离散替代方案形成鲜明对比。最后，我们利用3D高斯喷洒将表面可微渲染到图像空间，并根据分析合成原则优化变形。由于我们的连续表面公式与特别定制的模拟先验和表面诱导的3D高斯的结合，我们的Thin-Shell-SfT在质量和数量上优于以往的作品。请访问我们的项目页面https://4dqv.mpiinf.mpg.de/ThinShellSfT。

更新时间: 2025-03-25 18:00:46

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19976v1

SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining

LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow

Updated: 2025-03-25 17:59:57

标题: SuperFlow++：增强的跨模态数据预训练的时空一致性

摘要: LiDAR表示学习已经成为减少对昂贵和劳动密集型人工注释依赖的一种有前途的方法。尽管现有方法主要关注LiDAR和相机传感器之间的空间对齐，但它们经常忽视捕捉驾驶场景中的运动和场景连续性所必需的时间动态。为了解决这一限制，我们提出了SuperFlow++，这是一个新颖的框架，它在预训练和下游任务中使用连续的LiDAR-相机对集成时空线索。SuperFlow++引入了四个关键组件：（1）视图一致性对齐模块，以统一相机视图上的语义信息，（2）稠密到稀疏一致性正规化机制，以增强在不同点云密度下的特征稳健性，（3）基于流的对比学习方法，用于建模改进场景理解的时间关系，以及（4）一种时间投票策略，用于在LiDAR扫描之间传播语义信息以改善预测一致性。对11个异构LiDAR数据集的广泛评估表明，SuperFlow++在各种任务和驾驶条件下均胜过现有的方法。此外，通过在预训练过程中同时扩展2D和3D主干，我们发现了提供更深入洞察力以开发可扩展的3D基础模型的新属性。具有强大的泛化能力和计算效率，SuperFlow++为自动驾驶中基于数据高效的LiDAR感知建立了一个新的基准。代码可以在https://github.com/Xiangxu-0103/SuperFlow上公开获取。

更新时间: 2025-03-25 17:59:57

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.19912v1

In the Magma chamber: Update and challenges in ground-truth vulnerabilities revival for automatic input generator comparison

Fuzzing is a well-established technique for detecting bugs and vulnerabilities. With the surge of fuzzers and fuzzer platforms being developed such as AFL and OSSFuzz rises the necessity to benchmark these tools' performance. A common problem is that vulnerability benchmarks are based on bugs in old software releases. For this very reason, Magma introduced the notion of forward-porting to reintroduce vulnerable code in current software releases. While their results are promising, the state-of-the-art lacks an update on the maintainability of this approach over time. Indeed, adding the vulnerable code to a recent software version might either break its functionality or make the vulnerable code no longer reachable. We characterise the challenges with forward-porting by reassessing the portability of Magma's CVEs four years after its release and manually reintroducing the vulnerabilities in the current software versions. We find the straightforward process efficient for 17 of the 32 CVEs in our study. We further investigate why a trivial forward-porting process fails in the 15 other CVEs. This involves identifying the commits breaking the forward-porting process and reverting them in addition to the bug fix. While we manage to complete the process for nine of these CVEs, we provide an update on all 15 and explain the challenges we have been confronted with in this process. Thereby, we give the basis for future work towards a sustainable forward-ported fuzzing benchmark.

Updated: 2025-03-25 17:59:27

标题: 在岩浆室中：地面真实脆弱性复苏的更新和挑战，用于自动输入生成器比较

摘要: 模糊测试是一种用于检测漏洞和漏洞的成熟技术。随着像AFL和OSSFuzz这样的模糊器和模糊器平台的涌现，评估这些工具性能的必要性日益增加。一个常见问题是漏洞基准测试是基于旧软件版本中的漏洞。正因为这个原因，Magma引入了前向移植的概念，以在当前软件版本中重新引入易受攻击的代码。尽管他们的结果令人鼓舞，但当前技术水平缺乏对这种方法在长期内的可维护性的更新。事实上，将易受攻击的代码添加到最新的软件版本中可能会破坏其功能性，或使易受攻击的代码无法访问。我们通过重新评估Magma的CVE四年后的可移植性，并在当前软件版本中手动重新引入漏洞来描述前向移植的挑战。我们发现在我们的研究中，32个CVE中的17个CVE的简单流程高效。我们进一步探讨了为什么在另外15个CVE中一个简单的前向移植过程失败的原因。这涉及识别破坏前向移植过程的提交，并在漏洞修复的同时撤销它们。虽然我们设法完成了其中九个CVE的过程，但我们提供了所有15个CVE的更新，并解释了我们在这个过程中遇到的挑战。因此，我们为未来努力实现可持续前向移植的模糊测试基准奠定了基础。

更新时间: 2025-03-25 17:59:27

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2503.19909v1

Reanimating Images using Neural Representations of Dynamic Stimuli

While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach, BrainNRDS (Brain-Neural Representations of Dynamic Stimuli), leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. BrainNRDS advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems. We show additional decoding and encoding examples on this site: https://brain-nrds.github.io/.

Updated: 2025-03-25 17:59:01

标题: 使用神经动力学刺激的神经表征重新激活图像

摘要: 计算机视觉模型在静态图像识别方面取得了令人难以置信的进展，但在需要理解复杂动态运动的任务中，它们仍无法与人类表现相匹配。这在现实世界场景中尤为明显，那里具有复杂和充满运动的环境。我们的方法BrainNRDS（动态刺激的大脑神经表示）利用最先进的视频扩散模型，将静态图像表示与动态生成分离，使我们能够利用fMRI大脑活动更深入地理解人类对动态视觉刺激的反应。相反，我们还证明了大脑对运动的表示信息可以增强人工系统中光流的预测。我们的新方法得出了四个主要发现：（1）视觉运动，表示为细粒度的对象级分辨率光流，可以从参与者观看视频刺激产生的脑活动中解码出来；（2）视频编码器在预测视频驱动的脑活动方面胜过基于图像的模型；（3）大脑解码的运动信号使仅基于视频的初始帧实现逼真的视频重现成为可能；（4）我们扩展了以往的工作，实现了从视频驱动的脑活动中完全解码视频。BrainNRDS推进了我们对大脑如何在动态视觉场景中表示空间和时间信息的理解。我们的发现展示了将脑成像与视频扩散模型相结合，开发更强大和受生物启发的计算机视觉系统的潜力。我们在此网站上展示了更多解码和编码示例：https://brain-nrds.github.io/。

更新时间: 2025-03-25 17:59:01

领域: q-bio.NC,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.02659v3

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Updated: 2025-03-25 17:58:48

标题: Tracktention: 利用点跟踪更快更好地观看视频

摘要: 时间一致性在视频预测中至关重要，以确保输出连贯且没有伪影。传统方法，如时间注意力和3D卷积，可能在处理重要物体运动时遇到困难，并且可能无法捕捉动态场景中的长程时间依赖性。为了弥补这一差距，我们提出了Tracktention层，这是一种新颖的架构组件，通过使用点轨迹，即在帧之间对应点的序列，明确地集成运动信息。通过整合这些运动线索，Tracktention层增强了时间对齐性，并有效处理复杂的物体运动，在时间上保持一致的特征表示。我们的方法在计算上高效，并可以无缝地集成到现有模型中，如Vision Transformers，只需进行最少的修改。它可以用于将仅包含图像的模型升级为最先进的视频模型，有时甚至胜过专为视频预测而设计的模型。我们在视频深度预测和视频着色方面进行了演示，增加了Tracktention层的模型相比基线表现出显著改善的时间一致性。

更新时间: 2025-03-25 17:58:48

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19904v1

Not All Learnable Distribution Classes are Privately Learnable

We give an example of a class of distributions that is learnable up to constant error in total variation distance with a finite number of samples, but not learnable under $(\varepsilon, \delta)$-differential privacy with the same target error. This weakly refutes a conjecture of Ashtiani.

Updated: 2025-03-25 17:58:03

标题: 并非所有可学习的分布类都是可以私密学习的

摘要: 我们举一个分布类的例子，该类分布在总变差距离上可通过有限数量的样本以恒定误差进行学习，但在相同目标误差下并不适用于$(\varepsilon, \delta)$-差分隐私学习。这在一定程度上反驳了Ashtiani的猜想。

更新时间: 2025-03-25 17:58:03

领域: cs.DS,cs.CR,stat.ML

下载: http://arxiv.org/abs/2402.00267v3

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Updated: 2025-03-25 17:57:17

标题: CAFe: 将对比-自回归微调统一表示和生成

摘要: 大型视觉语言模型（LVLMs）的快速发展推动了多模态任务的重大进展，使模型能够在视觉和文本领域之间进行解释、推理和生成输出。尽管在生成任务方面表现出色，现有的LVLMs在需要高保真度表示学习的任务中通常面临限制，例如生成图像或文本嵌入以进行检索。最近的工作提出了为表示学习进行微调的LVLMs，但由于表示学习训练范式的存在，微调模型通常会丧失其生成能力。为了解决这种权衡，我们介绍了CAFe，这是一种对比-自回归微调框架，可以增强LVLMs的表示和生成任务。通过将对比目标与自回归语言建模相结合，我们的方法统一了这些传统上分开的任务，在多模态检索和多模态生成基准中取得了最先进的结果，包括对象幻觉（OH）缓解。CAFe建立了一个新颖的框架，将嵌入和生成功能融合到一个模型中，为未来在检索精度和连贯输出生成方面表现出色的多模态模型奠定了基础。

更新时间: 2025-03-25 17:57:17

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.19900v1

A proposal for an incident regime that tracks and counters threats to national security posed by AI systems

Recent progress in AI capabilities has heightened concerns that AI systems could pose a threat to national security, for example, by making it easier for malicious actors to perform cyberattacks on critical national infrastructure, or through loss of control of autonomous AI systems. In parallel, federal legislators in the US have proposed nascent 'AI incident regimes' to identify and counter similar threats. In this paper, we consolidate these two trends and present a proposal for a legally mandated post-deployment AI incident regie that aims to counter potential national security threats from AI systems. We start the paper by introducing the concept of 'security-critical' to describe doctors that pose extreme risks to national security, before arguing that 'security-critical' describes civilian nuclear power, aviation, life science dual-use research of concern, and frontier AI development. We then present in detail our AI incident regime proposal,, justifying each component of the proposal by demonstrating its similarity to US domestic incident regimes in other 'security-critical' sectors. Finally, we sketch a hypothetical scenario where our proposed AI incident regime deals with an AI cyber incident. Our proposed AI incident regime is split into three phases. The first phase revolves around a novel operationalization of what counts as an 'AI incident' and we suggest that AI providers must create a 'national security case' before deploying a frontier AI system. The second and third phases spell out that AI providers should notify a government agency about incidents, and that the government agency should be involved in amending AI providers' security and safety procedures, in order to counter future threats to national security. Our proposal is timely, given ongoing policy interest in the potential national security threats posed by AI systems.

Updated: 2025-03-25 17:51:50

标题: 一个追踪和应对人工智能系统对国家安全构成威胁的事件制度提案

摘要: 最近人工智能能力的进展加剧了人们对人工智能系统可能对国家安全构成威胁的担忧，例如，通过使恶意行为者更容易对关键国家基础设施进行网络攻击，或是通过对自主人工智能系统失去控制。与此同时，美国联邦立法者提出了初步的“人工智能事件制度”来识别和对抗类似威胁。在本文中，我们整合了这两个趋势，并提出了一个经法律规定的部署后人工智能事件管理制度的提案，旨在对抗人工智能系统可能带来的国家安全威胁。我们在论文中首先介绍了“安全关键”概念，用以描述对国家安全构成极端风险的领域，然后论证了“安全关键”包括民用核能、航空、生命科学双重用途研究以及前沿人工智能发展。之后我们详细介绍了我们的人工智能事件管理提案，通过展示其与美国其他“安全关键”领域的国内事件制度相似性来证明提案的每个组成部分。最后，我们勾勒了一个假设情景，描述了我们提出的人工智能事件管理制度如何处理一个人工智能网络事件。我们提出的人工智能事件管理制度分为三个阶段。第一阶段围绕着“何为人工智能事件”的新颖操作化展开，并建议人工智能提供者在部署前必须建立一个“国家安全案例”。第二和第三阶段详细说明了人工智能提供者应当向政府机构通报事件，并且政府机构应当参与修订人工智能提供者的安全程序和安全程序，以对抗未来对国家安全的威胁。我们的提案是及时的，考虑到人们对人工智能系统可能带来的潜在国家安全威胁的持续政策关注。

更新时间: 2025-03-25 17:51:50

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2503.19887v1

RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning

We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models. The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar. A typical approach in the literature is to achieve this by training users' data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users' loss functions. This process is to be done iteratively until group identities converge. A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering. To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods. We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction.

Updated: 2025-03-25 17:50:54

标题: RCC-PFL：在个性化联邦学习中抗噪声标签的稳健客户端聚类

摘要: 我们解决了在个性化联邦学习（PFL）设置中的集群身份估计问题，其中用户旨在学习不同的个人模型。在这种设置中有效学习的基础是将用户聚类成目标相似的群体。文献中的典型方法是通过在不同提议的个人模型上训练用户数据，并根据哪个模型实现了用户损失函数的最低值来将它们分配到组中。这个过程要反复进行，直到组身份收敛。在这种情况下的一个关键挑战是当用户具有带有噪声标签的数据时，可能会产生损失函数的误导值，从而导致无效的聚类。为了克服这一挑战，我们提出了一种基于数据相似性的无标签数据聚类算法，命名为RCC-PFL，具有三个主要优势：集群身份估计过程独立于训练标签；这是一种一次性的聚类算法，在训练之前执行；与基于迭代的聚类方法相比，它需要更少的通信轮次和更少的计算。我们使用各种模型和数据集验证了我们提出的算法，并展示了它在平均准确性和方差减少方面优于多个基线。

更新时间: 2025-03-25 17:50:54

领域: cs.LG,cs.DC,cs.IT,cs.NI,eess.SP,math.IT

下载: http://arxiv.org/abs/2503.19886v1

Dynamics of Structured Complex-Valued Hopfield Neural Networks

In this paper, we explore the dynamics of structured complex-valued Hopfield neural networks (CvHNNs), which arise when the synaptic weight matrix possesses specific structural properties. We begin by analyzing CvHNNs with a Hermitian synaptic weight matrix and establish the existence of four-cycle dynamics in CvHNNs with skew-Hermitian weight matrices operating synchronously. Furthermore, we introduce two new classes of complex-valued matrices: braided Hermitian and braided skew-Hermitian matrices. We demonstrate that CvHNNs utilizing these matrix types exhibit cycles of length eight when operating in full parallel update mode. Finally, we conduct extensive computational experiments on synchronous CvHNNs, exploring other synaptic weight matrix structures. The findings provide a comprehensive overview of the dynamics of structured CvHNNs, offering insights that may contribute to developing improved associative memory models when integrated with suitable learning rules.

Updated: 2025-03-25 17:49:36

标题: 结构化复值Hopfield神经网络的动力学特性

摘要: 在这篇论文中，我们探讨了结构化复值Hopfield神经网络（CvHNNs）的动态，当突触权重矩阵具有特定的结构性质时会出现这种网络。我们首先分析了具有厄米特突触权重矩阵的CvHNNs，并建立了在同步操作时具有四周期动态的存在。此外，我们引入了两种新的复值矩阵类别：编织厄米特和编织斜厄米特矩阵。我们证明了利用这些矩阵类型的CvHNNs在全并行更新模式下会展现出长度为八的循环。最后，我们在同步CvHNNs上进行了大量的计算实验，探索了其他突触权重矩阵结构。这些发现提供了对结构化CvHNNs动态的全面概述，为与适当的学习规则结合时可能有助于开发改进的联想记忆模型提供了见解。

更新时间: 2025-03-25 17:49:36

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2503.19885v1

Extensions of regret-minimization algorithm for optimal design

We explore extensions and applications of the regret minimization framework introduced by~\cite{design} for solving optimal experimental design problems. Specifically, we incorporate the entropy regularizer into this framework, leading to a novel sample selection objective and a provable sample complexity bound that guarantees a $(1+\epsilon)$-near optimal solution. We further extend the method to handle regularized optimal design settings. As an application, we use our algorithm to select a small set of representative samples from image classification datasets without relying on label information. To evaluate the quality of the selected samples, we train a logistic regression model and compare performance against several baseline sampling strategies. Experimental results on MNIST, CIFAR-10, and a 50-class subset of ImageNet show that our approach consistently outperforms competing methods in most cases.

Updated: 2025-03-25 17:37:09

标题: 后悔最小化算法在最优设计中的扩展

摘要: 我们探讨了由~\cite{design}引入的遗憾最小化框架的扩展和应用，用于解决最优实验设计问题。具体来说，我们将熵正则化器纳入这一框架，导致一个新颖的样本选择目标和一个可证明的样本复杂度界限，保证了一个$(1+\epsilon)$-近似最优解。我们进一步将该方法扩展到处理正则化最优设计设置。作为一个应用，我们使用我们的算法从图像分类数据集中选择一小组代表性样本，而无需依赖标签信息。为了评估所选样本的质量，我们训练了一个逻辑回归模型，并将性能与几种基准抽样策略进行比较。在MNIST、CIFAR-10和ImageNet的50类子集上的实验结果表明，我们的方法在大多数情况下始终优于竞争方法。

更新时间: 2025-03-25 17:37:09

领域: cs.LG,stat.ML,62J12, 62L05, 68W27, 68W40, 68T05

下载: http://arxiv.org/abs/2503.19874v1

Identification of Average Treatment Effects in Nonparametric Panel Models

This paper studies identification of average treatment effects in a panel data setting. It introduces a novel nonparametric factor model and proves identification of average treatment effects. The identification proof is based on the introduction of a consistent estimator. Underlying the proof is a result that there is a consistent estimator for the expected outcome in the absence of the treatment for each unit and time period; this result can be applied more broadly, for example in problems of decompositions of group-level differences in outcomes, such as the much-studied gender wage gap.

Updated: 2025-03-25 17:36:57

标题: 非参数面板模型中的平均处理效应识别

摘要: 本文研究了在面板数据设置中的平均处理效应的识别。它引入了一种新颖的非参数因子模型，并证明了平均处理效应的识别。该识别证明是基于引入一致估计量的。证明的基础是一个结果，即在没有治疗的情况下，每个单位和时间段的预期结果都有一致估计量；这个结果可以更广泛地应用，例如在解决群体水平差异分解问题时，比如广受关注的性别工资差距问题。

更新时间: 2025-03-25 17:36:57

领域: econ.EM,cs.LG,stat.ME

下载: http://arxiv.org/abs/2503.19873v1

NickPay, an Auditable, Privacy-Preserving, Nickname-Based Payment System

In this paper, we describe the motivation, design, security properties, and a prototype implementation of NickPay, a new privacy-preserving yet auditable payment system built on top of the Ethereum blockchain platform. NickPay offers a strong level of privacy to participants and prevents successive payment transfers from being linked to their actual owners. It is providing the transparency that blockchains ensure and at the same time, preserving the possibility for a trusted authority to access sensitive information, e.g., for audit purposes or compliance with financial regulations. NickPay builds upon the Nicknames for Group Signatures (NGS) scheme, a new signing system based on dynamic ``nicknames'' for signers that extends the schemes of group signatures and signatures with flexible public keys. NGS enables identified group members to expose their flexible public keys, thus allowing direct and natural applications such as auditable private payment systems, NickPay being a blockchain-based prototype of these.

Updated: 2025-03-25 17:36:54

标题: NickPay，一个可审计、保护隐私的基于昵称的支付系统

摘要: 在本文中，我们描述了NickPay的动机、设计、安全性质以及原型实现，这是一个建立在以太坊区块链平台之上的新的既保护隐私又可审计的支付系统。NickPay为参与者提供了较高水平的隐私保护，并防止连续的支付转账与其实际所有者相关联。它提供了区块链所确保的透明度，同时保留了一个可信的机构访问敏感信息的可能性，例如用于审计目的或遵守金融法规。 NickPay基于Nicknames for Group Signatures (NGS)方案，这是一种基于动态“昵称”签名者的新签名系统，扩展了群体签名和具有灵活公钥的签名方案。 NGS使得身份确认的群体成员能够公开他们的灵活公钥，从而允许直接和自然的应用，如可审计的私人支付系统，而NickPay则是这些系统的基于区块链的原型。

更新时间: 2025-03-25 17:36:54

领域: cs.CR

下载: http://arxiv.org/abs/2503.19872v1

Geometric Meta-Learning via Coupled Ricci Flow: Unifying Knowledge Representation and Quantum Entanglement

This paper establishes a unified framework integrating geometric flows with deep learning through three fundamental innovations. First, we propose a thermodynamically coupled Ricci flow that dynamically adapts parameter space geometry to loss landscape topology, formally proved to preserve isometric knowledge embedding (Theorem~\ref{thm:isometric}). Second, we derive explicit phase transition thresholds and critical learning rates (Theorem~\ref{thm:critical}) through curvature blowup analysis, enabling automated singularity resolution via geometric surgery (Lemma~\ref{lem:surgery}). Third, we establish an AdS/CFT-type holographic duality (Theorem~\ref{thm:ads}) between neural networks and conformal field theories, providing entanglement entropy bounds for regularization design. Experiments demonstrate 2.1$\times$ convergence acceleration and 63\% topological simplification while maintaining $\mathcal{O}(N\log N)$ complexity, outperforming Riemannian baselines by 15.2\% in few-shot accuracy. Theoretically, we prove exponential stability (Theorem~\ref{thm:converge}) through a new Lyapunov function combining Perelman entropy with Wasserstein gradient flows, fundamentally advancing geometric deep learning.

Updated: 2025-03-25 17:32:31

标题: 通过耦合的黎曼流进行几何元学习：统一知识表示和量子纠缠

摘要: 本文通过三个基本创新建立了一个统一的框架，将几何流与深度学习集成在一起。首先，我们提出了一个热力学耦合的Ricci流，动态地调整参数空间的几何形态以适应损失景观的拓扑结构，正式证明了其保持等距知识嵌入（定理\ref{thm:isometric}）。其次，我们通过曲率爆炸分析导出了显式的相变阈值和临界学习率（定理\ref{thm:critical}），通过几何手术实现了自动奇点解析（引理\ref{lem:surgery}）。第三，我们建立了一种类似AdS/CFT的全息对偶（定理\ref{thm:ads}），将神经网络与共形场论联系起来，为正规化设计提供纠缠熵界限。实验证明，在保持$\mathcal{O}(N\log N)$复杂度的同时，实现了2.1倍的收敛加速和63\%的拓扑简化，优于Riemannian基线15.2\%的少样本准确性。在理论上，我们通过将Perelman熵与Wasserstein梯度流相结合的新的Lyapunov函数证明了指数稳定性（定理\ref{thm:converge}），从根本上推动了几何深度学习的发展。

更新时间: 2025-03-25 17:32:31

领域: cs.LG,cs.AI,eess.SP,math.GT,quant-ph,68T05, 68T07, 68T27, 81V99, 37F40,,I.2; K.3.2; F.4.1

下载: http://arxiv.org/abs/2503.19867v1

GENIUS: A Generative Framework for Universal Multimodal Search

Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.

Updated: 2025-03-25 17:32:31

标题: 天才：通用多模态搜索的生成框架

摘要: 生成检索是信息检索中的一种新兴方法，它基于查询生成目标数据的标识符（ID），为传统基于嵌入的检索方法提供了高效的替代方案。然而，现有模型是任务特定的，并在性能上不及基于嵌入的检索。本文提出了GENIUS，一个通用的生成检索框架，支持跨多种模态和领域的多样任务。在其核心，GENIUS引入了模态解耦的语义量化，将多模态数据转换为编码模态和语义的离散ID。此外，为了提高泛化能力，我们提出了一种查询增强方法，可以在查询和目标之间进行插值，使GENIUS能够适应不同形式的查询。在M-BEIR基准测试中评估，GENIUS明显超过了先前的生成方法。与基于嵌入的检索不同，GENIUS在数据库规模方面始终保持高速检索速度，并在多个基准测试中表现出竞争力。通过额外的重新排序，GENIUS通常可以实现接近基于嵌入方法的结果，同时保持高效性。

更新时间: 2025-03-25 17:32:31

领域: cs.IR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19868v1

Functional Acceleration for Policy Mirror Descent

We apply functional acceleration to the Policy Mirror Descent (PMD) general family of algorithms, which cover a wide range of novel and fundamental methods in Reinforcement Learning (RL). Leveraging duality, we propose a momentum-based PMD update. By taking the functional route, our approach is independent of the policy parametrization and applicable to large-scale optimization, covering previous applications of momentum at the level of policy parameters as a special case. We theoretically analyze several properties of this approach and complement with a numerical ablation study, which serves to illustrate the policy optimization dynamics on the value polytope, relative to different algorithmic design choices in this space. We further characterize numerically several features of the problem setting relevant for functional acceleration, and lastly, we investigate the impact of approximation on their learning mechanics.

Updated: 2025-03-25 17:30:54

标题: 策略镜像下降的功能加速

摘要: 我们将功能加速应用于政策镜像下降（PMD）通用算法系列，该系列涵盖强化学习（RL）中的一系列新颖和基本方法。利用对偶性，我们提出了基于动量的PMD更新。通过采用功能路线，我们的方法与政策参数化无关，适用于大规模优化，涵盖了以前在政策参数水平上应用动量作为特例的情况。我们理论上分析了此方法的几个属性，并结合数值消融研究，以说明相对于该空间中的不同算法设计选择，政策优化动态在价值多面体上的表现。我们进一步在数值上表征了与功能加速相关的问题设置的几个特征，最后，我们调查了近似对他们学习机制的影响。

更新时间: 2025-03-25 17:30:54

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2407.16602v2

Mambular: A Sequential Model for Tabular Deep Learning

The analysis of tabular data has traditionally been dominated by gradient-boosted decision trees (GBDTs), known for their proficiency with mixed categorical and numerical features. However, recent deep learning innovations are challenging this dominance. This paper investigates the use of autoregressive state-space models for tabular data and compares their performance against established benchmark models. Additionally, we explore various adaptations of these models, including different pooling strategies, feature interaction mechanisms, and bi-directional processing techniques to understand their effectiveness for tabular data. Our findings indicate that interpreting features as a sequence and processing them and their interactions through structured state-space layers can lead to significant performance improvement. This research underscores the versatility of autoregressive models in tabular data analysis, positioning them as a promising alternative that could substantially enhance deep learning capabilities in this traditionally challenging area. The source code is available at https://github.com/basf/mamba-tabular.

Updated: 2025-03-25 17:27:53

标题: Mambular：用于表格深度学习的顺序模型

摘要: 表格数据的分析传统上由梯度提升决策树（GBDTs）主导，其以对混合分类和数值特征的熟练处理而闻名。然而，最近深度学习创新正在挑战这种主导地位。本文研究了自回归状态空间模型在表格数据中的应用，并将其性能与已建立的基准模型进行比较。此外，我们探索了这些模型的各种适应性，包括不同的汇集策略、特征交互机制和双向处理技术，以了解它们在表格数据中的有效性。我们的研究结果表明，将特征解释为序列，并通过结构化状态空间层处理它们及其相互作用可以显著提高性能。这项研究强调了自回归模型在表格数据分析中的多功能性，将其定位为一个有前途的替代方案，可以极大地增强这个传统上具有挑战性的领域中的深度学习能力。源代码可在https://github.com/basf/mamba-tabular 上获得。

更新时间: 2025-03-25 17:27:53

领域: cs.LG

下载: http://arxiv.org/abs/2408.06291v2

An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon, such as low-rank adaptation (LoRA) and training, enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning.

Updated: 2025-03-25 17:26:09

标题: 大型模型训练和调整中的低秩结构概述

摘要: 深度学习的崛起彻底改变了信号处理和机器学习中的数据处理和预测方式，然而，训练和部署现代大规模深度模型所需的巨大计算需求带来了重大挑战，包括高计算成本和能耗。最近的研究揭示了深度网络中的一个普遍现象：在训练过程中权重矩阵和学习表示中出现低秩结构。这些隐含的低维模式为改善训练和微调大规模模型的效率提供了宝贵的见解。受到这一现象启发的实用技术，如低秩适应（LoRA）和训练，可以显著降低计算成本，同时保持模型性能。在本文中，我们对利用低秩结构进行深度学习的最新进展进行了全面回顾，并揭示了其数学基础。在数学上，我们提出了理解深度网络中低秩性的两种互补视角：（i）在整个梯度优化动态过程中低秩结构的出现，以及（ii）在收敛时诱导这种低秩结构的隐含正则化效果。从实际角度来看，研究梯度下降的低秩学习动态为理解LoRA在微调大规模模型中的有效性提供了数学基础，并启发了参数高效的低秩训练策略。此外，隐含的低秩正则化效果有助于解释深度神经网络中各种掩码训练方法的成功，从dropout到掩码自监督学习。

更新时间: 2025-03-25 17:26:09

领域: cs.LG,eess.SP,math.OC,stat.CO,stat.ML

下载: http://arxiv.org/abs/2503.19859v1

Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs

We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delay durations are revealed upfront each round) and/or ``preemptibility'' (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding $\smash{\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta\bigl(\sqrt{TK(1+d/C)+Td\log(K)}\bigr)$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\}\bigr)$ in the bandit setting, while in the full-information setting, the minimax regret is $\Theta\bigl(\sqrt{T(d+1)\log(K)}\bigr)$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel scheduling policies, based on Pareto-distributed proxy delays and batching techniques. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.

Updated: 2025-03-25 17:20:39

标题: 受容量限制的在线学习与延迟：调度框架和遗憾折衷

摘要: 我们研究了在具有忽视损失和延迟的在线学习中，一种新颖的“容量约束”限制了同时可以跟踪多少个过去轮次的延迟反馈。在“预知”（即，每轮提前公布延迟持续时间）和/或“可抢占性”（即，我们有能力停止跟踪先前选择的轮次反馈）的情况下，我们建立了可达到的遗憾的匹配上下界（直到对数项），刻画了匹配经典延迟在线学习的极小最大速率所需的“最佳容量”，这些速率隐含地假定有无限容量。我们的算法在所有容量水平上实现了极小最优的遗憾，性能在次优容量下逐渐降低。对于$K$个动作和$T$轮次上的总延迟$D$，在预知的情况下，假设容量$C = \Omega(\log(T))$，我们实现了对于赌博机的遗憾$\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$和对于全信息反馈的遗憾$\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$。当将预知替换为可抢占性时，我们需要一个已知的最大延迟界限$d_{\max}$，将$\smash{\widetilde{O}(d_{\max})}$添加到遗憾中。对于固定延迟$d$（即$D=Td$），极小遗憾是$\Theta\bigl(\sqrt{TK(1+d/C)+Td\log(K)}\bigr)$，最佳容量是$\Theta(\min\{K/\log(K),d\}\bigr)$在赌博机环境中，而在全信息环境中，极小遗憾是$\Theta\bigl(\sqrt{T(d+1)\log(K)}\bigr)$，最佳容量是$\Theta(1)$。对于轮次相关和固定延迟，我们使用基于帕累托分布代理延迟和批处理技术的新颖调度策略实现了上界。关键的是，我们的工作统一了延迟赌博机、标签高效学习和在线调度框架，表明在延迟反馈下进行强健的在线学习是可能的，而跟踪容量相当适度。

更新时间: 2025-03-25 17:20:39

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.19856v1

DeepIFSAC: Deep Imputation of Missing Values Using Feature and Sample Attention within Contrastive Framework

Missing values of varying patterns and rates in real-world tabular data pose a significant challenge in developing reliable data-driven models. The most commonly used statistical and machine learning methods for missing value imputation may be ineffective when the missing rate is high and not random. This paper explores row and column attention in tabular data as between-feature and between-sample attention in a novel framework to reconstruct missing values. The proposed method uses CutMix data augmentation within a contrastive learning framework to improve the uncertainty of missing value estimation. The performance and generalizability of trained imputation models are evaluated in set-aside test data folds with missing values. The proposed framework is compared with 11 state-of-the-art statistical, machine learning, and deep imputation methods using 12 diverse tabular data sets. The average performance rank of our proposed method demonstrates its superiority over the state-of-the-art methods for missing rates between 10% and 90% and three missing value types, especially when the missing values are not random. The quality of the imputed data using our proposed method is compared in a downstream patient classification task using real-world electronic health records. This paper highlights the heterogeneity of tabular data sets to recommend imputation methods based on missing value types and data characteristics.

Updated: 2025-03-25 17:15:52

标题: DeepIFSAC: 在对比框架中使用特征和样本注意力进行深层缺失值填补

摘要: 实际表格数据中不同模式和速率的缺失值在开发可靠的数据驱动模型时构成了一个重大挑战。当缺失率较高且不随机时，常用的统计和机器学习方法可能无效。本文探讨了在表格数据中使用行和列关注作为一种新颖框架中的特征间和样本间关注，以重建缺失值。提出的方法在对比学习框架中使用CutMix数据增强来提高缺失值估计的不确定性。通过在带有缺失值的保留测试数据折叠中评估训练后的填补模型的性能和泛化能力。提出的框架与11种最先进的统计、机器学习和深度填补方法在12个多样的表格数据集上进行比较。我们提出的方法的平均性能排名表明，在缺失率在10%至90%之间且存在三种缺失值类型时，其优越性超过了最先进的方法，特别是当缺失值不是随机时。使用我们提出的方法填补数据的质量在使用实际电子健康记录进行下游患者分类任务时进行了比较。本文强调了表格数据集的异质性，以推荐基于缺失值类型和数据特征的填补方法。

更新时间: 2025-03-25 17:15:52

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.10910v3

Guarding against artificial intelligence--hallucinated citations: the case for full-text reference deposit

The tendency of generative artificial intelligence (AI) systems to "hallucinate" false information is well-known; AI-generated citations to non-existent sources have made their way into the reference lists of peer-reviewed publications. Here, I propose a solution to this problem, taking inspiration from the Transparency and Openness Promotion (TOP) data sharing guidelines, the clash of generative AI with the American judiciary, and the precedent set by submissions of prior art to the United States Patent and Trademark Office. Journals should require authors to submit the full text of each cited source along with their manuscripts, thereby preventing authors from citing any material whose full text they cannot produce. This solution requires limited additional work on the part of authors or editors while effectively immunizing journals against hallucinated references.

Updated: 2025-03-25 17:12:38

标题: 预防人工智能虚构引用：全文参考文献存储的案例

摘要: 生成式人工智能系统“产生幻觉”虚假信息的倾向是众所周知的；人工智能生成的对不存在来源的引文已经进入了同行评议出版物的参考文献列表。在这里，我提出了解决这一问题的方法，从透明度和开放性促进（TOP）数据共享指南、生成式人工智能与美国司法制度的冲突以及向美国专利和商标局提交先前技术成果的先例中获得灵感。期刊应要求作者随同手稿提交每个引用来源的全文，从而防止作者引用任何他们无法提供全文的材料。这一解决方案要求作者或编辑在一定程度上额外工作，同时有效地使期刊免受虚假引文的影响。

更新时间: 2025-03-25 17:12:38

领域: cs.DL,cs.AI,I.2.0; K.4.1

下载: http://arxiv.org/abs/2503.19848v1

Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo

The accurate quantum chemical calculation of excited states is a challenging task, often requiring computationally demanding methods. When entire ground and excited potential energy surfaces (PESs) are desired, e.g., to predict the interaction of light excitation and structural changes, one is often forced to use cheaper computational methods at the cost of reduced accuracy. Here we introduce a novel method for the geometrically transferable optimization of neural network wave functions that leverages weight sharing and dynamical ordering of electronic states. Our method enables the efficient prediction of ground and excited-state PESs and their intersections at the highest accuracy, demonstrating up to two orders of magnitude cost reduction compared to single-point calculations. We validate our approach on three challenging excited-state PESs, including ethylene, the carbon dimer, and the methylenimmonium cation, indicating that transferable deep-learning QMC can pave the way towards highly accurate simulation of excited-state dynamics.

Updated: 2025-03-25 17:12:29

标题: 从头模拟具有可转移深量子蒙特卡洛的激发态势能面

摘要: 激发态的准确量子化学计算是一项具有挑战性的任务，通常需要计算量巨大的方法。当需要整个基态和激发态势能面（PESs），例如，以预测光激发和结构变化的相互作用时，人们通常被迫使用更廉价的计算方法，以降低精度。在这里，我们介绍了一种新颖的方法，用于几何可转移优化神经网络波函数，利用权重共享和电子态的动态排序。我们的方法使得能够高效地预测基态和激发态PESs及其交点，达到最高精度，相比单点计算可减少高达两个数量级的成本。我们在三种具有挑战性的激发态PES上验证了我们的方法，包括乙烯、碳双聚体和甲基亚胺阳离子，表明可转移的深度学习QMC可以为高精度模拟激发态动力学铺平道路。

更新时间: 2025-03-25 17:12:29

领域: physics.chem-ph,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2503.19847v1

Attention IoU: Examining Biases in CelebA using Attention Maps

Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.

Updated: 2025-03-25 17:11:39

标题: 注意IoU：使用注意力图检查CelebA中的偏差

摘要: 计算机视觉模型已被证明在广泛的数据集和任务中展示并放大偏见。现有的量化偏见的方法主要集中在数据集分布和模型在子群上的表现上，忽视了模型的内部工作机制。我们引入了关注-交并比（Attention Intersection over Union）度量和相关得分，利用注意力图来揭示模型内部表示中的偏见，并识别可能导致偏见的图像特征。首先，我们在合成的Waterbirds数据集上验证了关注-交并比，展示了该度量准确地测量了模型的偏见。然后，我们分析了CelebA数据集，发现关注-交并比揭示了超出准确性差异的相关性。通过通过Male的保护属性对个体属性进行调查，我们研究了CelebA中偏见被表示的不同方式。最后，通过对训练集进行子抽样以改变属性相关性，我们证明了关注-交并比揭示了数据集标签中不存在的潜在混杂变量。

更新时间: 2025-03-25 17:11:39

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19846v1

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

Updated: 2025-03-25 17:07:21

标题: 一个历史中国文献（1900-1950年）的词语分割、词性标注和命名实体识别的比较分析

摘要: 本文比较了大型语言模型（LLMs）和传统自然语言处理（NLP）工具在对1900年至1950年的中国文本进行分词、词性标注和命名实体识别（NER）方面的表现。历史中文文献由于其象形文字脚本、自然词边界的缺失以及语言变化的显著性而对文本分析提出挑战。使用上海图书馆民国时期期刊语料库的样本数据集，将传统工具如结巴和spaCy与LLMs（包括GPT-4o、Claude 3.5和GLM系列）进行比较。结果显示，LLMs在所有指标上优于传统方法，尽管计算成本较高，突显了准确性和效率之间的权衡。此外，LLMs更好地处理特定类型文本的挑战，如诗歌和时间变化（即，1920年前后的文本），表明它们的上下文学习能力可以通过减少对领域特定训练数据的需求来推进NLP方法应用于历史文本。

更新时间: 2025-03-25 17:07:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19844v1

Decomposing The Dark Matter of Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter -- about half of the error vector itself and >90% of its norm -- can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations. These insights imply that the part of the SAE error vector that cannot be linearly predicted ("nonlinear" error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, and 3) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction.

Updated: 2025-03-25 17:00:02

标题: 分解稀疏自动编码器的暗物质

摘要: 稀疏自编码器（SAEs）是一种有前景的技术，可以将语言模型的激活分解为可解释的线性特征。然而，当前的SAEs并没有完全解释模型性能，导致“暗物质”：激活中未解释的方差。这项工作将暗物质作为研究对象。令人惊讶的是，我们发现很大一部分SAE暗物质 -- 大约一半的误差向量本身和>90%的范数 -- 可以从初始激活向量线性预测出来。此外，我们发现SAE误差范数在每个标记级别的缩放行为非常可预测：较大的SAEs大多难以重建与较小SAEs相同的上下文。我们基于线性表示假设提出了可能导致这些观察结果的激活模型。这些见解意味着不能被线性预测的SAE误差向量的部分（“非线性”误差）可能与可预测的线性分量根本不同。为验证这一假设，我们经验性地分析了非线性SAE误差，并显示1）它包含较少尚未学习的特征，2）在其上训练的SAEs在数量上更差，3）当SAE激活被插入模型时，它负责下游交叉熵损失的增加的相应量。最后，我们检查了两种减少非线性SAE误差的方法：推理时间梯度追求，导致非线性误差略微减少，以及从较早层SAE输出进行线性变换，导致较大减少。

更新时间: 2025-03-25 17:00:02

领域: cs.LG

下载: http://arxiv.org/abs/2410.14670v2

Explaining Control Policies through Predicate Decision Diagrams

Safety-critical controllers of complex systems are hard to construct manually. Automated approaches such as controller synthesis or learning provide a tempting alternative but usually lack explainability. To this end, learning decision trees (DTs) have been prevalently used towards an interpretable model of the generated controllers. However, DTs do not exploit shared decision-making, a key concept exploited in binary decision diagrams (BDDs) to reduce their size and thus improve explainability. In this work, we introduce predicate decision diagrams (PDDs) that extend BDDs with predicates and thus unite the advantages of DTs and BDDs for controller representation. We establish a synthesis pipeline for efficient construction of PDDs from DTs representing controllers, exploiting reduction techniques for BDDs also for PDDs.

Updated: 2025-03-25 16:57:55

标题: 用谓词决策图解释控制政策

摘要: 复杂系统的安全关键控制器很难手动构建。自动化方法，如控制器合成或学习，提供了一种诱人的替代方案，但通常缺乏可解释性。为此，学习决策树（DTs）被广泛用于生成控制器的可解释模型。然而，DTs没有利用共享决策这一关键概念，在二进制决策图（BDDs）中被利用以减小其大小，从而提高可解释性。在这项工作中，我们引入了谓词决策图（PDDs），它通过谓词扩展了BDDs，从而将DTs和BDDs的优势结合起来用于控制器表示。我们建立了一个合成管道，用于从表示控制器的DTs有效构建PDDs，利用了BDDs的减小技术也适用于PDDs。

更新时间: 2025-03-25 16:57:55

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.06420v2

Phylo2Vec: a vector representation for binary trees

Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with $n$ leaves to a unique integer vector of length $n-1$. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.

Updated: 2025-03-25 16:44:19

标题: Phylo2Vec：用于二叉树的向量表示

摘要: 从生物数据推断出的二进制系统发生树对于理解进化单位之间的共享历史至关重要。然而，在树中推断潜在节点的位置是计算昂贵的。最先进的方法依赖于精心设计的启发式方法进行树搜索，使用不同的数据结构进行简单操作（例如，面向对象编程语言中的类）和树的可读表示（例如，Newick 格式字符串）。在这里，我们提出了 Phylo2Vec，一种用于操作和表示系统发生树的简洁编码。Phylo2Vec将具有 $n$ 个叶子的任何二进制树映射到长度为 $n-1$ 的唯一整数向量。Phylo2Vec 的优点有四个：i) 快速树采样，ii) 与 Newick 字符串相比的压缩树表示，iii) 快速且明确地验证两个二进制树在拓扑上是否相同，以及 iv) 有系统地能力在非常大或非常小的跳跃中遍历树空间。作为概念验证，我们使用 Phylo2Vec 对五个真实世界数据集进行最大似然推断，并展示一个基于简单爬山算法的优化方案可以有效地从随机到最佳树遍历树空间的广阔范围。

更新时间: 2025-03-25 16:44:19

领域: q-bio.PE,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2304.12693v5

GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization

Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity.

Updated: 2025-03-25 16:33:12

标题: 通过可微谱模块化优化对GyralNet子网络进行分区

摘要: 理解人类大脑的结构和功能组织需要对皮层折叠模式进行详细检查，其中三折回（3HG）被确定为关键的结构标志物。GyralNet是一个对皮层折叠进行网络表示的模型，将3HG建模为节点，将回脊建模为边缘，突出它们在皮质连接中的关键中枢作用。然而，现有的分析3HG的方法面临重大挑战，包括在典型神经影像分辨率下3HG的亚像素尺度、建立跨受试者对应关系的计算复杂性，以及将3HG作为独立节点处理而不考虑它们的社区级关系的过度简化。为了解决这些限制，我们提出了一个完全可微分的子网络划分框架，采用谱模块化最大化优化策略在GyralNet内部对3HG的组织进行模块化。通过将拓扑结构相似性和DTI导出的连接模式作为属性特征，我们的方法提供了皮质组织的生物学有意义的表示。对人类连接组工程（HCP）数据集进行广泛实验表明，我们的方法有效地在个体水平对GyralNet进行划分，同时保持了3HG在受试者之间的社区级一致性，为理解大脑连接提供了坚实的基础。

更新时间: 2025-03-25 16:33:12

领域: q-bio.NC,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19823v1

LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation

Hardware design verification (DV) is a process that checks the functional equivalence of a hardware design against its specifications, improving hardware reliability and robustness. A key task in the DV process is the test stimuli generation, which creates a set of conditions or inputs for testing. These test conditions are often complex and specific to the given hardware design, requiring substantial human engineering effort to optimize. We seek a solution of automated and efficient testing for arbitrary hardware designs that takes advantage of large language models (LLMs). LLMs have already shown promising results for improving hardware design automation, but remain under-explored for hardware DV. In this paper, we propose an open-source benchmarking framework named LLM4DV that efficiently orchestrates LLMs for automated hardware test stimuli generation. Our analysis evaluates six different LLMs involving six prompting improvements over eight hardware designs and provides insight for future work on LLMs development for efficient automated DV.

Updated: 2025-03-25 16:32:46

标题: LLM4DV：使用大型语言模型生成硬件测试刺激

摘要: 硬件设计验证(DV)是一个检查硬件设计与其规格说明之间功能等价性的过程，可以提高硬件可靠性和鲁棒性。在DV过程中的一个关键任务是测试刺激生成，它创建了一组用于测试的条件或输入。这些测试条件通常复杂且特定于给定的硬件设计，需要大量的人力工程努力来优化。我们寻求一种利用大型语言模型(LLMs)进行任意硬件设计自动化和高效测试的解决方案。LLMs已经显示出改进硬件设计自动化的潜力，但在硬件DV方面仍未被充分开发。在本文中，我们提出了一个名为LLM4DV的开源基准框架，可以有效地协调LLMs进行自动硬件测试刺激生成。我们的分析评估了涉及六个提示改进的六种不同LLMs和八种硬件设计，并为未来LLMs发展的高效自动化DV工作提供了见解。

更新时间: 2025-03-25 16:32:46

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2310.04535v2

Simplifying Deep Temporal Difference Learning

Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the large replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network or replay buffer, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a large replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like: Rainbow in Atari, PPO-RNN in Craftax, QMix in Smax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN reestablishes off-policy Q-learning as a viable alternative.

Updated: 2025-03-25 16:32:45

标题: 简化深度时差学习

摘要: Q学习在强化学习领域起着基础性作用。然而，具有非策略数据的TD算法，例如Q学习，或者像深度神经网络这样的非线性函数逼近，需要几种额外的技巧来稳定训练，主要是一个大型重放缓冲区和目标网络。不幸的是，在目标网络中冻结网络参数的延迟更新会损害样本效率，同样，大型重放缓冲区会引入内存和实现开销。在本文中，我们调查了是否可能加速和简化非策略TD训练，同时保持其稳定性。我们的关键理论结果首次证明，诸如LayerNorm之类的正则化技术可以产生可证收敛的TD算法，而无需目标网络或重放缓冲区，即使有非策略数据。从经验上看，我们发现在线并行采样通过矢量化环境使训练稳定，而无需大型重放缓冲区。受这些发现的启发，我们提出了PQN，我们简化的深度在线Q学习算法。令人惊讶的是，这个简单的算法与更复杂的方法竞争力强，如：Atari中的Rainbow，Craftax中的PPO-RNN，Smax中的QMix，并且可以比传统DQN快50倍，而不会牺牲样本效率。在PPO已成为首选强化学习算法的时代，PQN再次确立了非策略Q学习作为一种可行的替代方案。

更新时间: 2025-03-25 16:32:45

领域: cs.LG

下载: http://arxiv.org/abs/2407.04811v5

IgCraft: A versatile sequence generation framework for antibody discovery and engineering

Designing antibody sequences to better resemble those observed in natural human repertoires is a key challenge in biologics development. We introduce IgCraft: a multi-purpose model for paired human antibody sequence generation, built on Bayesian Flow Networks. IgCraft presents one of the first unified generative modeling frameworks capable of addressing multiple antibody sequence design tasks with a single model, including unconditional sampling, sequence inpainting, inverse folding, and CDR motif scaffolding. Our approach achieves competitive results across the full spectrum of these tasks while constraining generation to the space of human antibody sequences, exhibiting particular strengths in CDR motif scaffolding (grafting) where we achieve state-of-the-art performance in terms of humanness and preservation of structural properties. By integrating previously separate tasks into a single scalable generative model, IgCraft provides a versatile platform for sampling human antibody sequences under a variety of contexts relevant to antibody discovery and engineering. Model code and weights are publicly available at github.com/mgreenig/IgCraft.

Updated: 2025-03-25 16:32:03

标题: IgCraft：用于抗体发现和工程的多功能序列生成框架

摘要: 设计抗体序列以更好地类似于自然人体库中观察到的序列是生物制品开发中的关键挑战。我们介绍了IgCraft：一个用贝叶斯流网络构建的用于生成成对人类抗体序列的多功能模型。IgCraft是首个统一的生成建模框架之一，能够通过单一模型处理多个抗体序列设计任务，包括无条件抽样、序列修复、逆向折叠和CDR基序支架。我们的方法在所有这些任务的全谱上取得了竞争性的结果，同时将生成限制在人类抗体序列空间内，在CDR基序支架（嫁接）方面表现出特殊的优势，我们在人类性和结构性质保存方面取得了最先进的性能。通过将先前分离的任务整合到一个可扩展的生成模型中，IgCraft为在与抗体发现和工程相关的各种情境下抽样人类抗体序列提供了一个多功能平台。模型代码和权重可在github.com/mgreenig/IgCraft上公开获取。

更新时间: 2025-03-25 16:32:03

领域: q-bio.BM,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2503.19821v1

A Systematic Review of EEG-based Machine Intelligence Algorithms for Depression Diagnosis, and Monitoring

Depression disorder is a serious health condition that has affected the lives of millions of people around the world. Diagnosis of depression is a challenging practice that relies heavily on subjective studies and, in most cases, suffers from late findings. Electroencephalography (EEG) biomarkers have been suggested and investigated in recent years as a potential transformative objective practice. In this article, for the first time, a detailed systematic review of EEG-based depression diagnosis approaches is conducted using advanced machine learning techniques and statistical analyses. For this, 938 potentially relevant articles (since 1985) were initially detected and filtered into 139 relevant articles based on the review scheme 'preferred reporting items for systematic reviews and meta-analyses (PRISMA).' This article compares and discusses the selected articles and categorizes them according to the type of machine learning techniques and statistical analyses. Algorithms, preprocessing techniques, extracted features, and data acquisition systems are discussed and summarized. This review paper explains the existing challenges of the current algorithms and sheds light on the future direction of the field. This systematic review outlines the issues and challenges in machine intelligence for the diagnosis of EEG depression that can be addressed in future studies and possibly in future wearable technologies.

Updated: 2025-03-25 16:31:27

标题: 一篇关于基于脑电图的机器智能算法用于抑郁症诊断和监测的系统性综述

摘要: 抑郁症是一种严重的健康状况，影响了全球数百万人的生活。抑郁症的诊断是一项具有挑战性的实践，严重依赖于主观研究，并且在大多数情况下存在迟延发现的问题。最近几年，脑电图（EEG）生物标志物被建议并研究作为潜在的转变性客观实践。在这篇文章中，首次使用先进的机器学习技术和统计分析，对基于EEG的抑郁症诊断方法进行了详细系统性审查。为此，最初检测到并根据审查方案“系统评价和荟萃分析的首选报告事项（PRISMA）”将938篇潜在相关文章（自1985年以来）过滤为139篇相关文章。本文比较和讨论了所选文章，并根据机器学习技术和统计分析类型对其进行分类。算法、预处理技术、提取的特征和数据采集系统进行了讨论和总结。这篇综述论文解释了当前算法存在的挑战，并为该领域的未来方向提供了启示。这篇系统性综述概述了机器智能在EEG抑郁症诊断中的问题和挑战，这些问题可以在未来研究中解决，可能还可以在未来的可穿戴技术中解决。

更新时间: 2025-03-25 16:31:27

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2503.19820v1

Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning

White blood cell (WBC) classification plays a vital role in hematology for diagnosing various medical conditions. However, it faces significant challenges due to domain shifts caused by variations in sample sources (e.g., blood or bone marrow) and differing imaging conditions across hospitals. Traditional deep learning models often suffer from catastrophic forgetting in such dynamic environments, while foundation models, though generally robust, experience performance degradation when the distribution of inference data differs from that of the training data. To address these challenges, we propose a generative replay-based Continual Learning (CL) strategy designed to prevent forgetting in foundation models for WBC classification. Our method employs lightweight generators to mimic past data with a synthetic latent representation to enable privacy-preserving replay. To showcase the effectiveness, we carry out extensive experiments with a total of four datasets with different task ordering and four backbone models including ResNet50, RetCCL, CTransPath, and UNI. Experimental results demonstrate that conventional fine-tuning methods degrade performance on previously learned tasks and struggle with domain shifts. In contrast, our continual learning strategy effectively mitigates catastrophic forgetting, preserving model performance across varying domains. This work presents a practical solution for maintaining reliable WBC classification in real-world clinical settings, where data distributions frequently evolve.

Updated: 2025-03-25 16:30:58

标题: 基于隐私感知的持续学习领域增量白细胞分类

摘要: 白细胞（WBC）分类在血液学中对于诊断各种医疗情况起着至关重要的作用。然而，由于样本来源（例如血液或骨髓）的变化以及医院间不同的成像条件所导致的领域转移，它面临着重大挑战。传统的深度学习模型在这种动态环境中往往会遭受灾难性遗忘，而基础模型虽然通常较为稳健，但在推断数据的分布与训练数据不同时会出现性能下降。为了解决这些挑战，我们提出了一种基于生成回放的连续学习（CL）策略，旨在防止基础模型在WBC分类中遗忘。我们的方法利用轻量级生成器模仿过去的数据，使用合成潜在表示实现隐私保护的回放。为了展示其有效性，我们进行了大量实验，共使用了四个不同任务排序的数据集和四个主干模型，包括ResNet50、RetCCL、CTransPath和UNI。实验结果表明，传统的微调方法会降低先前学习任务的性能，并在领域转移中遇到困难。相比之下，我们的连续学习策略有效地减轻灾难性遗忘，保持了模型在不同领域间的性能。这项工作提出了在现实世界临床环境中维持可靠的WBC分类的实用解决方案，其中数据分布经常发生变化。

更新时间: 2025-03-25 16:30:58

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.19819v1

Bitstream Collisions in Neural Image Compression via Adversarial Perturbations

Neural image compression (NIC) has emerged as a promising alternative to classical compression techniques, offering improved compression ratios. Despite its progress towards standardization and practical deployment, there has been minimal exploration into it's robustness and security. This study reveals an unexpected vulnerability in NIC - bitstream collisions - where semantically different images produce identical compressed bitstreams. Utilizing a novel whitebox adversarial attack algorithm, this paper demonstrates that adding carefully crafted perturbations to semantically different images can cause their compressed bitstreams to collide exactly. The collision vulnerability poses a threat to the practical usability of NIC, particularly in security-critical applications. The cause of the collision is analyzed, and a simple yet effective mitigation method is presented.

Updated: 2025-03-25 16:29:17

标题: 通过对抗性扰动在神经图像压缩中的比特流碰撞

摘要: 神经图像压缩（NIC）已经成为传统压缩技术的一个有希望的替代方案，提供了改进的压缩比例。尽管它朝着标准化和实际部署迈出了一步，但对其稳健性和安全性的探索却很少。本研究揭示了NIC中一个意想不到的脆弱性 - 比特流碰撞 - 即语义不同的图像产生相同的压缩比特流。利用一种新颖的白盒对抗攻击算法，本文证明了向语义不同的图像添加精心设计的扰动可以导致它们的压缩比特流完全碰撞。碰撞漏洞对NIC的实际可用性构成威胁，尤其是在安全关键应用中。对碰撞的原因进行了分析，并提出了一种简单而有效的缓解方法。

更新时间: 2025-03-25 16:29:17

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.19817v1

ACVUBench: Audio-Centric Video Understanding Benchmark

Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at https://github.com/lark-png/ACVUBench.

Updated: 2025-03-25 16:28:24

标题: ACVUBench：音频中心的视频理解基准测试

摘要: 音频通常作为视频理解任务中的辅助模态，在视听大型语言模型（LLMs）中仅用于帮助理解视觉信息。然而，对视频的彻底理解在很大程度上取决于听觉信息，因为音频提供了关键的上下文、情感线索和语义含义，而仅有视觉数据通常缺乏这些信息。本文提出了一个以音频为中心的视频理解基准（ACVUBench），用于评估多模态LLMs在视频理解能力方面的表现，特别关注听觉信息。具体来说，ACVUBench包括跨越18个不同领域的2,662个视频，其中包含丰富的听觉信息，以及超过13,000个高质量的人工注释或验证的问题-答案对。此外，ACVUBench引入了一系列精心设计的以音频为中心的任务，全面测试视频中音频内容和音频-视觉互动的理解能力。在各种开源和专有多模态LLMs上进行了全面评估，随后对音频-视觉LLMs中的不足进行了分析。演示可在https://github.com/lark-png/ACVUBench 上找到。

更新时间: 2025-03-25 16:28:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19951v1

HyperFLINT: Hypernetwork-based Flow Estimation and Temporal Interpolation for Scientific Ensemble Visualization

We present HyperFLINT (Hypernetwork-based FLow estimation and temporal INTerpolation), a novel deep learning-based approach for estimating flow fields, temporally interpolating scalar fields, and facilitating parameter space exploration in spatio-temporal scientific ensemble data. This work addresses the critical need to explicitly incorporate ensemble parameters into the learning process, as traditional methods often neglect these, limiting their ability to adapt to diverse simulation settings and provide meaningful insights into the data dynamics. HyperFLINT introduces a hypernetwork to account for simulation parameters, enabling it to generate accurate interpolations and flow fields for each timestep by dynamically adapting to varying conditions, thereby outperforming existing parameter-agnostic approaches. The architecture features modular neural blocks with convolutional and deconvolutional layers, supported by a hypernetwork that generates weights for the main network, allowing the model to better capture intricate simulation dynamics. A series of experiments demonstrates HyperFLINT's significantly improved performance in flow field estimation and temporal interpolation, as well as its potential in enabling parameter space exploration, offering valuable insights into complex scientific ensembles.

Updated: 2025-03-25 16:27:02

标题: 超级FLINT：基于超网络的科学集合可视化流估计和时间插值

摘要: 我们提出了HyperFLINT（基于超网络的流场估计和时间插值），这是一种新颖的基于深度学习的方法，用于估计流场、时间插值标量场，并促进时空科学集合数据中参数空间的探索。这项工作解决了明确将集合参数纳入学习过程的迫切需求，因为传统方法通常忽略这些参数，限制了它们适应各种模拟设置并提供有意义数据动态洞察的能力。HyperFLINT引入了一个超网络来考虑模拟参数，使其能够通过动态适应不同条件为每个时间步生成准确的插值和流场，从而优于现有的无参数方法。该架构具有模块化的神经块，包括卷积和反卷积层，由一个超网络支持，为主网络生成权重，使模型能够更好地捕捉复杂的模拟动态。一系列实验展示了HyperFLINT在流场估计和时间插值方面的显著改进表现，以及在启用参数空间探索方面的潜力，提供了对复杂科学集合的有价值洞察。

更新时间: 2025-03-25 16:27:02

领域: cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2412.04095v2

Thinking agents for zero-shot generalization to qualitatively novel tasks

Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to ``think'', that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero-shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment's elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent's pre-thinking and post-thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real-environment trial (zero-shot).

Updated: 2025-03-25 16:26:31

标题: Thinking agents for zero-shot generalization to qualitatively novel tasks 思考代理人对于定性新任务的零样本泛化

摘要: 智能生物能够解决他们在一生中或演化过程中从未遇到过的全新问题。这种能力的一个重要组成部分是“思考”的能力，即在没有环境交互的情况下，通过对对象、概念和行为进行思维操纵，以规划和评估全新问题的可能解决方案。为了生成真正具有质的新颖性的问题，同时仍然可以通过心理模拟（zero-shot）解决，我们利用环境的组合性质: 在训练代理时，保留环境元素的特定组合。基于这种组合的全新测试任务因此可以保证是真正新颖的，同时仍然可以进行心理模拟，因为代理在训练过程中已经接触到每个单独的元素（以及它们的两两交互作用）。我们提出了一种方法，通过选择基于代理的思考前后表现差异的任务，来训练具有世界模型的代理利用其心理模拟能力。在被保留的全新问题上进行测试时，最终的代理成功地模拟了替代情景，并利用所得信息指导其在实际环境中的行为，以在单次实际环境试验中解决全新任务（zero-shot）。

更新时间: 2025-03-25 16:26:31

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2503.19815v1

Taxonomy Inference for Tabular Data Using Large Language Models

Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.

Updated: 2025-03-25 16:26:05

标题: 使用大型语言模型推断表格数据的分类学

摘要: 表格数据的分类推断是模式推断的关键任务，旨在发现表格中的实体类型（即概念），并构建它们的层次结构。它在数据管理、数据探索、本体学习和许多数据中心应用程序中发挥着重要作用。现有的模式推断系统更多地关注XML、JSON或RDF数据，并常常依赖数据的词汇格式和结构来计算相似性，对表格文本的语义利用有限。受最近关于使用大型语言模型（LLMs）进行分类完成和构建的研究的启发，本文提出了两种基于LLM的表格分类推断方法：（一）EmTT通过微调与对比学习编码器独立的LLMs（如BERT）来嵌入列，并利用聚类进行层次结构构建，（二）GeTT通过使用类似GPT-4的解码器独立的LLM进行迭代提示生成表格实体类型及其层次结构。在涵盖输出分类的不同方面的六个度量标准上对三个真实世界数据集进行了广泛评估，结果表明EmTT和GeTT都能产生与Ground Truth相对一致的分类。

更新时间: 2025-03-25 16:26:05

领域: cs.DB,cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2503.21810v1

Guidelines For The Choice Of The Baseline in XAI Attribution Methods

Given the broad adoption of artificial intelligence, it is essential to provide evidence that AI models are reliable, trustable, and fair. To this end, the emerging field of eXplainable AI develops techniques to probe such requirements, counterbalancing the hype pushing the pervasiveness of this technology. Among the many facets of this issue, this paper focuses on baseline attribution methods, aiming at deriving a feature attribution map at the network input relying on a "neutral" stimulus usually called "baseline". The choice of the baseline is crucial as it determines the explanation of the network behavior. In this framework, this paper has the twofold goal of shedding light on the implications of the choice of the baseline and providing a simple yet effective method for identifying the best baseline for the task. To achieve this, we propose a decision boundary sampling method, since the baseline, by definition, lies on the decision boundary, which naturally becomes the search domain. Experiments are performed on synthetic examples and validated relying on state-of-the-art methods. Despite being limited to the experimental scope, this contribution is relevant as it offers clear guidelines and a simple proxy for baseline selection, reducing ambiguity and enhancing deep models' reliability and trust.

Updated: 2025-03-25 16:25:04

标题: 解释性人工智能归因方法中基线选择的准则

摘要: 鉴于人工智能的广泛应用，提供证据表明人工智能模型是可靠的、可信赖的和公平的至关重要。为此，可解释人工智能这一新兴领域开发了技术来探索这些要求，抵消了推动这一技术普及的炒作。在这个问题的许多方面中，本文关注基线归因方法，旨在在网络输入上推导特征归因图，依赖于通常称为“基线”的“中性”刺激。基线的选择至关重要，因为它决定了网络行为的解释。在这个框架下，本文旨在阐明基线选择的含义，并提供一种简单而有效的方法来识别任务的最佳基线。为了实现这一目标，我们提出了一种决策边界采样方法，因为基线根据定义位于决策边界上，自然成为搜索领域。实验在合成示例上进行，并依赖于最先进的方法进行验证。尽管限于实验范围，但这一贡献仍然具有重要意义，因为它提供了明确的指导原则和基线选择的简单代理，减少了歧义，增强了深度模型的可靠性和信任。

更新时间: 2025-03-25 16:25:04

领域: cs.AI

下载: http://arxiv.org/abs/2503.19813v1

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

Updated: 2025-03-25 16:24:45

标题: LogQuant：具有出色精度保留能力的Log分布式2位量化KV缓存

摘要: 我们介绍了LogQuant，一种突破性的2位量化技术，用于大型语言模型（LLM）推断中的KV缓存，可以在保持卓越性能的同时实现实质性的内存节省。先前的方法要么假设后续标记更重要，要么试图根据先前的注意力模式预测重要的标记。然而，这两种方法都可能导致性能瓶颈或频繁的错误预测。 LogQuant采取了一种不同的方法。通过应用基于对数的过滤机制，它选择性地压缩整个上下文中的KV缓存，与现有方法相比，实现更好的性能，甚至可以减少内存占用。在基准测试中，它将吞吐量提高了25％，批处理大小提高了60％，而不增加内存消耗。对于诸如数学和代码补全等具有挑战性的任务，LogQuant在相同的压缩比下将准确性提高了40％至200％，超越了可比较的技术。LogQuant与流行的推断框架（如Python的transformers库）无缝集成。实现可在https://github.com/Concyclics/LogQuantKV中找到。

更新时间: 2025-03-25 16:24:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.19950v1

Simulating Tracking Data to Advance Sports Analytics Research

Advanced analytics have transformed how sports teams operate, particularly in episodic sports like baseball. Their impact on continuous invasion sports, such as soccer and ice hockey, has been limited due to increased game complexity and restricted access to high-resolution game tracking data. In this demo, we present a method to collect and utilize simulated soccer tracking data from the Google Research Football environment to support the development of models designed for continuous tracking data. The data is stored in a schema that is representative of real tracking data and we provide processes that extract high-level features and events. We include examples of established tracking data models to showcase the efficacy of the simulated data. We address the scarcity of publicly available tracking data, providing support for research at the intersection of artificial intelligence and sports analytics.

Updated: 2025-03-25 16:18:23

标题: 模拟跟踪数据以推进体育分析研究

摘要: 高级分析已经改变了体育队伍的运营方式，特别是在像棒球这样的间歇性体育项目中。然而，在连续性侵袭体育项目，例如足球和冰球中，由于比赛复杂性增加和高分辨率比赛跟踪数据的获取受限，其影响有限。在这个演示中，我们提出了一种方法，利用谷歌研究足球环境中模拟的足球跟踪数据来支持为连续跟踪数据设计模型的发展。数据存储在一个代表真实跟踪数据的架构中，我们提供了提取高级特征和事件的过程。我们提供已建立的跟踪数据模型的示例，展示模拟数据的有效性。我们解决了公开可用跟踪数据的稀缺性问题，为人工智能和体育分析交叉领域的研究提供支持。

更新时间: 2025-03-25 16:18:23

领域: cs.AI

下载: http://arxiv.org/abs/2503.19809v1

Locally Private Nonparametric Contextual Multi-armed Bandits

Motivated by privacy concerns in sequential decision-making on sensitive data, we address the challenge of nonparametric contextual multi-armed bandits (MAB) under local differential privacy (LDP). We develop a uniform-confidence-bound-type estimator, showing its minimax optimality supported by a matching minimax lower bound. We further consider the case where auxiliary datasets are available, subject also to (possibly heterogeneous) LDP constraints. Under the widely-used covariate shift framework, we propose a jump-start scheme to effectively utilize the auxiliary data, the minimax optimality of which is further established by a matching lower bound. Comprehensive experiments on both synthetic and real-world datasets validate our theoretical results and underscore the effectiveness of the proposed methods.

Updated: 2025-03-25 16:13:14

标题: 本地私有非参数上下文多臂老虎机

摘要: 受隐私问题在敏感数据的顺序决策中的影响的启发，我们解决了在本地差分隐私（LDP）下的非参数上下文多臂老虎机（MAB）的挑战。我们开发了一种统一置信界类型的估计器，展示其极小化最优性，支持与匹配的极小化下界。我们进一步考虑了辅助数据集可用的情况，同时也受到（可能异质的）LDP约束。在广泛使用的协变量移位框架下，我们提出了一种快速启动方案，以有效利用辅助数据，其极小化最优性通过匹配下界进一步得到证实。对合成和真实世界数据集的全面实验验证了我们的理论结果，并强调了所提出方法的有效性。

更新时间: 2025-03-25 16:13:14

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2503.08098v2

LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving. However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge. To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and without human, scenes. Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up-to 4K resolution benchmark in the field. LENVIZ includes high quality human-generated ground truth, for which each multi-exposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality. Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement.

Updated: 2025-03-25 16:12:28

标题: LENVIZ：一个高分辨率低曝光夜视基准数据集

摘要: 低光图像增强对于诸多应用至关重要，从夜视和监控到自动驾驶。然而，由于在低照明环境中捕捉图像所带来的固有限制，增强此类场景仍然是一个巨大挑战。为了推动这一领域的研究，我们引入了我们的低曝光夜视（LENVIZ）数据集，这是一个全面的多曝光基准数据集，用于低光图像增强，包括超过230K帧展示的24K真实室内和室外场景，有人和无人。使用3种不同的摄像头传感器捕捉，LENVIZ提供了各种照明条件、噪音水平和场景复杂性，使其成为该领域最大的公开可用的高达4K分辨率基准数据集。LENVIZ包括高质量的人工生成地面真实数据，每个多曝光低光场景都经过专业摄影师精心策划和编辑，以确保最佳图像质量。此外，我们还对当前最先进的低光图像增强技术在我们的数据集上进行了全面分析，并突出了潜在的改进领域。

更新时间: 2025-03-25 16:12:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19804v1

SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI

Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.

Updated: 2025-03-25 16:09:45

标题: SeLIP：用于多模式头部MRI的相似性增强对比语言图像预训练

摘要: 尽管深度学习（DL）方法在许多医学图像分析任务中展现出巨大潜力，但由于缺乏具有手动注释的足够数据样本，医学DL模型的实际应用受到限制。通过注意到临床放射学检查与描述图像的放射学报告相关联，我们提出通过对图像和相应的放射学发现进行对比学习来开发一个多模头部MRI的基础模型。具体而言，提出了一个对比学习框架，其中集成了混合的语法和语义相似匹配度量，以减少传统对比学习框架中极大数据集的需求。我们提出的相似性增强对比语言图像预训练（SeLIP）能够有效提取更多有用的特征。实验证明，我们提出的SeLIP在包括图像-文本检索任务、分类任务和图像分割在内的许多下游任务中表现良好，突显了在开发医学图像基础模型时考虑描述不同图像的文本之间的相似性的重要性。

更新时间: 2025-03-25 16:09:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19801v1

PAVE: Patching and Adapting Video Large Language Models

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

Updated: 2025-03-25 16:02:37

标题: PAVE：修补和调整视频大型语言模型

摘要: 预先训练的视频大语言模型（Video LLMs）展示了出色的推理能力，但将这些模型调整到涉及附加模态或数据类型（例如音频或3D信息）的新任务仍然具有挑战性。在本文中，我们提出PAVE，一个灵活的框架，用于将预训练的Video LLMs适应具有侧通道信号的下游任务，例如音频、3D线索或多视角视频。PAVE引入轻量级适配器，称为“patches”，这些适配器向基础模型添加少量参数和操作，而不改变其架构或预训练权重。通过这样做，PAVE可以有效地使预训练的基础模型适应支持多样化的下游任务，包括视听问答、3D推理、多视角视频识别和高帧率视频理解。在这些任务中，PAVE显著提升了基础模型的性能，超过了最先进的特定任务模型，同时产生了大约0.1%的额外FLOPs和参数的较小成本。此外，PAVE支持多任务学习，并在不同的Video LLMs之间具有很好的泛化能力。我们的代码可在https://github.com/dragonlzm/PAVE上找到。

更新时间: 2025-03-25 16:02:37

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19794v1

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The construction of preference data is fully automated, and the experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness.

Updated: 2025-03-25 16:01:59

标题: SPA-VL：一个用于视觉语言模型的全面安全偏好调整数据集

摘要: Vision Language Models (VLMs)的出现带来了对多模态信息的理解的前所未有的进展。 VLMs中的文本和视觉语义的结合非常复杂和多样化，使得这些模型的安全对齐具有挑战性。此外，由于对VLMs的安全对齐的研究有限，缺乏大规模、高质量的数据集。为了解决这些限制，我们提出了一个名为SPA-VL的Vision Language Models安全偏好对齐数据集。就广度而言，SPA-VL涵盖了6个有害领域，13个类别和53个子类别，并包含100,788个四元组（问题、图像、选择的回答、拒绝的回答）的样本。在深度方面，回答来自12个开源（例如QwenVL）和闭源（例如Gemini）的VLMs，以确保多样性。偏好数据的构建是完全自动化的，实验结果表明，在SPA-VL数据集上使用对齐技术训练的模型在无害性和帮助性方面取得了显著改进，同时保持核心能力。作为一种大规模、高质量和多样化的数据集，SPA-VL代表了确保VLMs实现无害性和帮助性的重要里程碑。

更新时间: 2025-03-25 16:01:59

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12030v3

MetaSel: A Test Selection Approach for Fine-tuned DNN Models

Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 10 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel's practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

Updated: 2025-03-25 16:00:07

标题: MetaSel：一种用于微调DNN模型的测试选择方法

摘要: 深度神经网络（DNNs）在部署过程中面临数据分布转移的挑战。微调适应预训练模型到新的环境中，需要更小的标记集。然而，在受限的标记预算下测试微调模型仍然是一个关键挑战。本文介绍了一种新方法MetaSel，专门针对微调的DNN模型，从未标记的输入中选择测试。MetaSel假设微调和预训练模型共享相关的数据分布，并对许多输入表现出类似的行为。然而，在微调改变决策边界的输入子空间中，它们的行为会发生分歧，使得这些输入更容易被错误分类。与仅依赖DNN模型及其输入集的一般方法不同，MetaSel利用了来自微调和预训练模型以及它们行为差异的信息，估计未标记测试输入的误分类概率，从而实现更有效的测试选择。我们进行了广泛的实证评估，将MetaSel与10种最先进的方法进行比较，并涉及68个微调模型跨弱、中、强分布转移，结果表明MetaSel在现有基线上持续提供显著的测试相对覆盖率（TRC）改进，尤其是在高度受限的标记预算下。MetaSel相对最频繁的第二好基线平均TRC改进了28.46%至56.18%，同时保持较高的TRC中位数和低变异性。我们的结果证实了MetaSel在微调模型测试选择方面的实用性、稳健性和成本效益。

更新时间: 2025-03-25 16:00:07

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2503.17534v2

UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility

Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains, like transportation, logistics, and agriculture. Leveraging flexible perspectives and rapid maneuverability, UAVs extend traditional systems' perception and action capabilities, garnering widespread attention from academia and industry. However, current UAV operations primarily depend on human control, with only limited autonomy in simple scenarios, and lack the intelligence and adaptability needed for more complex environments and tasks. The emergence of large language models (LLMs) demonstrates remarkable problem-solving and generalization capabilities, offering a promising pathway for advancing UAV intelligence. This paper explores the integration of LLMs and UAVs, beginning with an overview of UAV systems' fundamental components and functionalities, followed by an overview of the state-of-the-art in LLM technology. Subsequently, it systematically highlights the multimodal data resources available for UAVs, which provide critical support for training and evaluation. Furthermore, it categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs is proposed, aiming to enable UAVs to achieve agentic intelligence through autonomous perception, memory, reasoning, and tool utilization. Related resources are available at https://github.com/Hub-Tian/UAVs_Meet_LLMs.

Updated: 2025-03-25 15:55:33

标题: 无人机遇上低空移动机器人：关于主动低空机动性的概述和展望

摘要: 低空移动性，例如无人机（UAV），已经在交通运输、物流和农业等各个领域引入了革命性的进展。利用灵活的视角和快速的机动性，无人机扩展了传统系统的感知和行动能力，引起了学术界和行业的广泛关注。然而，当前无人机操作主要依赖人类控制，在简单场景中只具有有限的自主性，并且缺乏更复杂环境和任务所需的智能和适应性。大型语言模型（LLMs）的出现展示了出色的问题解决和泛化能力，为推进无人机智能提供了一个有希望的途径。本文探讨了LLMs和无人机的集成，首先概述了无人机系统的基本组件和功能，然后概述了LLM技术的最新进展。随后，系统地突出了可用于无人机的多模态数据资源，为培训和评估提供了关键支持。此外，它对无人机和LLMs融合的关键任务和应用场景进行了分类和分析。最后，提出了一个通向主动型无人机的参考路线图，旨在通过自主感知、记忆、推理和工具利用使无人机实现主动型智能。相关资源可在https://github.com/Hub-Tian/UAVs_Meet_LLMs找到。

更新时间: 2025-03-25 15:55:33

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2501.02341v2

Enhancing Predictive Accuracy in Tennis: Integrating Fuzzy Logic and CV-GRNN for Dynamic Match Outcome and Player Momentum Analysis

The predictive analysis of match outcomes and player momentum in professional tennis has long been a subject of scholarly debate. In this paper, we introduce a novel approach to game prediction by combining a multi-level fuzzy evaluation model with a CV-GRNN model. We first identify critical statistical indicators via Principal Component Analysis and then develop a two-tier fuzzy model based on the Wimbledon data. In addition, the results of Pearson Correlation Coefficient indicate that the momentum indicators, such as Player Win Streak and Score Difference, have a strong correlation among them, revealing insightful trends among players transitioning between losing and winning streaks. Subsequently, we refine the CV-GRNN model by incorporating 15 statistically significant indicators, resulting in an increase in accuracy to 86.64% and a decrease in MSE by 49.21%. This consequently strengthens the methodological framework for predicting tennis match outcomes, emphasizing its practical utility and potential for adaptation in various athletic contexts.

Updated: 2025-03-25 15:53:49

标题: 提高网球预测准确性：将模糊逻辑和CV-GRNN集成到动态比赛结果和选手动能分析中

摘要: 对于专业网球比赛结果和球员动量的预测分析长期以来一直是学术争论的焦点。在本文中，我们通过将多层模糊评估模型与CV-GRNN模型结合，引入了一种新颖的游戏预测方法。我们首先通过主成分分析确定关键的统计指标，然后基于温布尔登的数据开发了一个两层模糊模型。此外，皮尔逊相关系数的结果表明，动量指标，如球员连胜和比分差异，它们之间存在较强的相关性，揭示了球员在输赢连胜之间转变时的趋势。随后，我们通过纳入15个具有统计显著性的指标来完善CV-GRNN模型，结果准确率提高至86.64%，均方误差减少了49.21%。这进一步加强了预测网球比赛结果的方法论框架，强调了其在各种运动背景中的实际效用和适应潜力。

更新时间: 2025-03-25 15:53:49

领域: stat.AP,cs.LG,68T07,I.2.6

下载: http://arxiv.org/abs/2503.21809v1

Gemma 3 Technical Report

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

Updated: 2025-03-25 15:52:34

标题: Gemma 3技术报告

摘要: 我们介绍Gemma 3，这是Gemma家族的一个多模态增加部分，参数规模从1到27亿不等。这个版本引入了视觉理解能力，覆盖了更广泛的语言和更长的上下文-至少128K个标记。我们还改变了模型的架构，以减少长上下文中容易爆炸的KV-cache内存。通过增加局部和全局关注层的比例，并保持局部关注的跨度较短来实现这一目标。Gemma 3模型经过蒸馏训练，表现优于Gemma 2，无论是预训练还是指令微调版本。特别是，我们的新颖的后训练方法显著提高了数学、聊天、指令跟随和多语言能力，使Gemma3-4B-IT与Gemma2-27B-IT竞争，Gemma3-27B-IT与Gemini-1.5-Pro在各项基准测试中可比。我们将所有模型发布给社区使用。

更新时间: 2025-03-25 15:52:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19786v1

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases. We introduce PyGraph, a novel approach to automatically harness the power of CUDA Graphs within PyTorch2. Driven by three key observations, PyGraph embodies three novel optimizations: it enables wider deployment of CUDA Graphs, reduces GPU kernel parameter copy overheads, and selectively deploys CUDA Graphs based on a cost-benefit analysis. PyGraph seamlessly integrates with PyTorch2's compilation toolchain, enabling efficient use of CUDA Graphs without manual modifications to the code. We evaluate PyGraph across various machine learning benchmarks, demonstrating substantial performance improvements over PyTorch2.

Updated: 2025-03-25 15:47:54

标题: PyGraph：PyTorch中对CUDA图的稳健编译器支持

摘要: CUDA图--一种最近为NVIDIA GPU引入的硬件功能--旨在通过捕获和启动一系列GPU任务（核函数）作为DAG来减少CPU启动开销。然而，由于图的静态结构，部署CUDA图在今天面临多个挑战。另外，由于数据复制，它也会产生性能开销。事实上，我们展示了一个反直觉的结果--在许多情况下，部署CUDA图会损害性能。我们引入了PyGraph，一种新颖的方法，可以在PyTorch2中自动利用CUDA图的强大功能。基于三个关键观察，PyGraph包含三个新颖的优化：它可以更广泛地部署CUDA图，减少GPU核参数复制开销，并基于成本效益分析有选择地部署CUDA图。PyGraph与PyTorch2的编译工具链无缝集成，可以在不对代码进行手动修改的情况下高效使用CUDA图。我们在各种机器学习基准测试中评估了PyGraph，展示了与PyTorch2相比的显著性能提升。

更新时间: 2025-03-25 15:47:54

领域: cs.LG

下载: http://arxiv.org/abs/2503.19779v1

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS

Updated: 2025-03-25 15:47:13

标题: LPOSS：基于贴片和像素的标签传播用于开放词汇语义分割

摘要: 我们提出了一种用于开放词汇语义分割的无需训练的方法，使用视觉和语言模型（VLMs）。我们的方法通过标签传播增强了VLMs的初始每个补丁的预测，通过同时优化预测并结合补丁之间的关系。由于VLMs主要针对跨模态对齐进行优化，而不是针对模态内部相似性，因此我们使用观察到更好捕捉这些关系的视觉模型（VM）。我们通过在像素级别应用标签传播作为一种细化步骤，解决了基于补丁的编码器固有的分辨率限制，显着提高了在类边界附近的分割准确性。我们的方法称为LPOSS+，在整个图像上执行推断，避免基于窗口的处理，从而捕捉整个图像上的上下文交互。LPOSS+在各种数据集中取得了领先的性能，成为无需训练的方法中的最佳。代码：https://github.com/vladan-stojnic/LPOSS

更新时间: 2025-03-25 15:47:13

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19777v1

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network." We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

Updated: 2025-03-25 15:38:55

标题: BiPrompt-SAM：通过明确选择点和文本提示来增强图像分割

摘要: 分割是计算机视觉中的一个基本任务，由于其灵活性，基于提示的方法变得越来越受重视。最近的Segment Anything Model（SAM）展示了强大的点提示分割能力，而基于文本的分割模型提供了丰富的语义理解。然而，现有方法很少探讨如何有效地将这些互补的模态结合起来以获得最佳的分割性能。本文介绍了BiPrompt-SAM，这是一个新颖的双模提示分割框架，通过显式的选择机制融合了点提示和文本提示的优势。具体来说，我们利用SAM生成多个蒙版候选项的固有能力，结合来自文本提示的语义引导蒙版，并根据相似性度量显式地选择最合适的候选项。这种方法可以看作是一个简化的专家混合（MoE）系统，其中点和文本模块充当不同的“专家”，而相似性评分则充当基本的“门控网络”。我们对Endovis17医学数据集和RefCOCO系列自然图像数据集进行了广泛的评估。在Endovis17上，BiPrompt-SAM实现了89.55％的mDice和81.46％的mIoU，与最先进的专门医学分割模型相媲美。在RefCOCO系列数据集上，我们的方法获得了87.1％、86.5％和85.8％的IoU，明显优于现有方法。实验证明，我们的显式双重选择方法有效地结合了点提示的空间精度和文本提示的语义丰富性，特别擅长处理语义复杂的对象、多个相似的对象和局部遮挡的情景。BiPrompt-SAM不仅提供了一个简单而有效的实现，还为多模态提示融合提供了新的视角。

更新时间: 2025-03-25 15:38:55

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19769v1

A Managed Tokens Service for Securely Keeping and Distributing Grid Tokens

Fermilab is transitioning authentication and authorization for grid operations to using bearer tokens based on the WLCG Common JWT (JSON Web Token) Profile. One of the functionalities that Fermilab experimenters rely on is the ability to automate batch job submission, which in turn depends on the ability to securely refresh and distribute the necessary credentials to experiment job submit points. Thus, with the transition to using tokens for grid operations, we needed to create a service that would obtain, refresh, and distribute tokens for experimenters' use. This service would avoid the need for experimenters to be experts in obtaining their own tokens and would better protect the most sensitive long-lived credentials. Further, the service needed to be widely scalable, as Fermilab hosts many experiments, each of which would need their own credentials. To address these issues, we created and deployed a Managed Tokens Service. The service is written in Go, taking advantage of that language's native concurrency primitives to easily be able to scale operations as we onboard experiments. The service uses as its first credentials a set of kerberos keytabs, stored on the same secure machine that the Managed Tokens service runs on. These kerberos credentials allow the service to use htgettoken via condor_vault_storer to store vault tokens in the HTCondor credential managers (credds) that run on the batch system scheduler machines (HTCondor schedds); as well as downloading a local, shorter-lived copy of the vault token. The kerberos credentials are then also used to distribute copies of the locally-stored vault tokens to experiment submit points.

Updated: 2025-03-25 15:37:45

标题: 一个用于安全保存和分发网格令牌的受控令牌服务

摘要: Fermilab正在过渡到基于WLCG Common JWT（JSON Web Token）配置的承载令牌进行网格操作的认证和授权。 Fermilab实验员依赖的功能之一是能够自动化批处理作业提交，这又取决于能够安全地刷新和分发必要的凭据到实验作业提交点。因此，随着过渡到使用令牌进行网格操作，我们需要创建一个服务，用于获取，刷新和分发实验员使用的令牌。该服务将避免实验员需要成为获取自己令牌的专家，并且将更好地保护最敏感的长期凭据。此外，该服务需要具有广泛的可扩展性，因为Fermilab托管许多实验，每个实验都需要自己的凭据。为了解决这些问题，我们创建并部署了一个托管令牌服务。该服务使用Go编写，利用该语言的本地并发原语，以便能够轻松扩展操作，随着实验的启用。该服务使用一组kerberos keytabs作为其首次凭据，存储在托管令牌服务运行的相同安全机器上。这些kerberos凭据允许服务使用 htgettoken 通过 condor_vault_storer 存储 vault 令牌在批处理系统调度器机器（HTCondor schedds）上运行的 HTCondor 凭证管理器（credds）; 以及下载本地，较短寿命的 vault 令牌的副本。然后，这些kerberos凭据还用于将本地存储的 vault 令牌的副本分发到实验提交点。

更新时间: 2025-03-25 15:37:45

领域: cs.CR,physics.ins-det

下载: http://arxiv.org/abs/2503.19768v1

Automated Video-EEG Analysis in Epilepsy Studies: Advances and Challenges

Epilepsy is typically diagnosed through electroencephalography (EEG) and long-term video-EEG (vEEG) monitoring. The manual analysis of vEEG recordings is time-consuming, necessitating automated tools for seizure detection. Recent advancements in machine learning have shown promise in real-time seizure detection and prediction using EEG and video data. However, diversity of seizure symptoms, markup ambiguities, and limited availability of multimodal datasets hinder progress. This paper reviews the latest developments in automated video-EEG analysis and discusses the integration of multimodal data. We also propose a novel pipeline for treatment effect estimation from vEEG data using concept-based learning, offering a pathway for future research in this domain.

Updated: 2025-03-25 15:37:02

标题: 癫痫研究中的自动化视频-脑电图分析：进展与挑战

摘要: 癫痫通常通过脑电图（EEG）和长期视频-脑电图（vEEG）监测来诊断。对vEEG记录的手动分析耗时，需要自动化工具进行癫痫检测。机器学习的最新进展显示，使用EEG和视频数据在实时癫痫检测和预测方面表现出潜力。然而，癫痫症状的多样性、标记模糊性和多模态数据集的有限可用性阻碍了进展。本文回顾了自动视频-脑电图分析的最新进展，并讨论了多模态数据的整合。我们还提出了一种使用基于概念学习的vEEG数据进行治疗效果估计的新型流程，为未来在这一领域的研究提供了路径。

更新时间: 2025-03-25 15:37:02

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2503.19949v1

Aether: Geometric-Aware Unified World Modeling

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Updated: 2025-03-25 15:31:25

标题: 以太：几何感知统一世界建模

摘要: 几何重建和生成建模的整合仍然是发展人类空间推理能力的人工智能系统的关键挑战。本文提出了Aether，这是一个统一框架，通过联合优化三个核心能力实现世界模型中的几何感知推理：（1）4D动态重建，（2）动作条件视频预测，以及（3）目标条件视觉规划。通过任务交替的特征学习，Aether实现了在重建、预测和规划目标之间的协同知识共享。在基于视频生成模型的基础上，我们的框架展示了前所未有的合成到真实的泛化能力，尽管在训练过程中从未观察到真实世界数据。此外，我们的方法在动作跟随和重建任务中实现了零样本泛化，这要归功于其固有的几何建模。值得注意的是，即使没有真实世界数据，其重建性能与甚至优于领域特定模型。此外，Aether利用相机轨迹作为几何信息的动作空间，实现了有效的动作条件预测和视觉规划。我们希望我们的工作能激励社区探索在合理物理世界建模及其应用方面的新前沿。

更新时间: 2025-03-25 15:31:25

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.18945v2

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models. Our code is open-sourced at github.com/hmarkc/FW-Merging.

Updated: 2025-03-25 15:31:07

标题: FW-Merging: 使用Frank-Wolfe优化扩展模型合并

摘要: 模型合并已经成为多任务学习（MTL）的一个有前途的方法，为传统的微调提供了一个数据高效的替代方案。然而，随着开源AI生态系统的快速发展和微调基础模型的日益增加，现有的模型合并方法面临两个关键限制：（i）它们主要设计用于内部微调模型，使其难以适应具有部分未知模型和任务信息的多样化模型来源，（ii）当合并大量模型检查点时，它们往往难以有效扩展。为了解决这些挑战，我们将模型合并表述为一个受限优化问题，并引入了一种新颖的方法：Frank-Wolfe合并（FW-Merging）。受Frank-Wolfe优化启发，我们的方法迭代地选择池中最相关的模型，以最小化目标函数的线性近似，然后执行类似Frank-Wolfe更新的本地合并。目标函数被设计为捕捉目标合并模型的期望行为，而微调的候选模型定义了约束集。更重要的是，FW-Merging作为现有合并方法的正交技术，与它们无缝集成，以进一步提高准确性性能。我们的实验表明，FW-Merging在各种模型来源上具有规模化能力，在20个CV任务上，当存在16个无关模型时保持稳定，并在存在16个相关模型时提高了15.3%，同时保持恒定的内存开销，不像数据驱动的合并方法的线性开销。与最先进的方法相比，当合并20个ViT模型时，FW-Merging超过数据无关合并方法32.8%，并且超过数据驱动的Adamerging方法8.39%。我们的代码在github.com/hmarkc/FW-Merging上开源。

更新时间: 2025-03-25 15:31:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.12649v2

Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

Can Visual Language Models (VLMs) effectively capture human visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learning methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set (trained on ImageReward official split) and 65.4% on HPSv2 (trained on approximately 25% of its data). These results match traditional encoder-based models while providing transparent reasoning and enhanced generalization. This approach allows to use not only rich VLM world knowledge, but also its potential to think, yielding interpretable outcomes that help decision-making processes. By demonstrating that human visual preferences reasonable by current VLMs, we introduce efficient soft-reward strategies for image ranking, outperforming simplistic selection or scoring methods. This reasoning capability enables VLMs to rank arbitrary images-regardless of aspect ratio or complexity-thereby potentially amplifying the effectiveness of visual Preference Optimization. By reducing the need for extensive markup while improving reward generalization and explainability, our findings can be a strong mile-stone that will enhance text-to-vision models even further.

Updated: 2025-03-25 15:30:21

标题: 利用VLMs和软奖励通过视觉人类偏好进行测试时间推理

摘要: 视觉语言模型（VLMs）能否有效地捕捉人类的视觉偏好？本研究通过训练VLMs在测试时考虑偏好，利用DeepSeek R1和OpenAI O1启发的强化学习方法来回答这个问题。使用像ImageReward和Human Preference Score v2（HPSv2）这样的数据集，我们的模型在ImageReward测试集（在ImageReward官方拆分上训练）上达到了64.9％的准确率，以及在HPSv2上达到了65.4％的准确率（在其约25％的数据上训练）。这些结果与传统的基于编码器的模型相匹配，同时提供透明的推理和增强的泛化能力。这种方法不仅允许使用丰富的VLM世界知识，还可以利用其思考的潜力，产生可解释的结果，有助于决策过程。通过证明当前的VLMs能够合理地捕捉人类的视觉偏好，我们引入了有效的软奖励策略用于图像排名，优于简单的选择或评分方法。这种推理能力使得VLMs能够对任意图像进行排名，无论长宽比或复杂度如何，从而潜在地增强了视觉偏好优化的效果。通过减少对广泛标记的需求，同时改进奖励的泛化性和可解释性，我们的发现可能是一个强有力的里程碑，将进一步增强文本到视觉模型的性能。

更新时间: 2025-03-25 15:30:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19948v1

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.

Updated: 2025-03-25 15:29:45

标题: 使用无需训练的门控低秩适应，针对文本到图像扩散模型的局部概念擦除

摘要: 基于微调的概念擦除在防止文本到图像扩散模型生成有害内容方面取得了令人期待的结果，通过删除目标概念同时保留其他概念。为了在概念擦除后保持扩散模型的生成能力，当目标概念在图像中局部出现时，有必要仅删除包含目标概念的图像区域，保留其他区域不变。然而，先前的技术往往为了消除特定区域出现的局部目标概念而牺牲了其他图像区域的保真度，从而降低了图像生成的整体性能。为了解决这些局限性，我们首先引入了一个名为局部概念擦除的框架，允许仅删除图像中包含目标概念的特定区域，同时保留其他区域。作为局部概念擦除的解决方案，我们提出了一种无需训练的方法，名为门控低秩适应擦除概念（GLoCE），将一个轻量级模块注入到扩散模型中。GLoCE由低秩矩阵和一个简单的门组成，仅由几个概念生成步骤决定，无需训练。通过直接将GLoCE应用于图像嵌入并设计门仅激活目标概念，GLoCE可以选择性地仅删除目标概念的区域，即使目标和其他概念共存于图像中。大量实验证明，GLoCE不仅在擦除局部目标概念后提高了图像与文本提示的保真度，而且在效果、特异性和鲁棒性方面大幅优于先前的技术，并且可以扩展到大规模概念擦除。

更新时间: 2025-03-25 15:29:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12356v2

Interpretable Deep Regression Models with Interval-Censored Failure Time Data

Deep neural networks (DNNs) have become powerful tools for modeling complex data structures through sequentially integrating simple functions in each hidden layer. In survival analysis, recent advances of DNNs primarily focus on enhancing model capabilities, especially in exploring nonlinear covariate effects under right censoring. However, deep learning methods for interval-censored data, where the unobservable failure time is only known to lie in an interval, remain underexplored and limited to specific data type or model. This work proposes a general regression framework for interval-censored data with a broad class of partially linear transformation models, where key covariate effects are modeled parametrically while nonlinear effects of nuisance multi-modal covariates are approximated via DNNs, balancing interpretability and flexibility. We employ sieve maximum likelihood estimation by leveraging monotone splines to approximate the cumulative baseline hazard function. To ensure reliable and tractable estimation, we develop an EM algorithm incorporating stochastic gradient descent. We establish the asymptotic properties of parameter estimators and show that the DNN estimator achieves minimax-optimal convergence. Extensive simulations demonstrate superior estimation and prediction accuracy over state-of-the-art methods. Applying our method to the Alzheimer's Disease Neuroimaging Initiative dataset yields novel insights and improved predictive performance compared to traditional approaches.

Updated: 2025-03-25 15:27:32

标题: 可解释的深度回归模型与区间截尾失效时间数据

摘要: 深度神经网络（DNNs）已成为通过在每个隐藏层中顺序集成简单函数来建模复杂数据结构的强大工具。在生存分析中，最近DNN的进展主要集中在增强模型能力，特别是在探索右截尾下的非线性协变效应方面。然而，针对间隔截尾数据的深度学习方法，其中不可观察的故障时间仅知道位于一个间隔内，仍未被充分探索，并且仅限于特定数据类型或模型。本文提出了一个用于间隔截尾数据的一般回归框架，其中包含一类广泛的部分线性转换模型，其中关键协变效应被参数化建模，而干扰多模式协变量的非线性效应通过DNN近似，平衡可解释性和灵活性。我们利用单调样条来近似累积基线风险函数进行筛选最大似然估计。为了确保可靠和可处理的估计，我们开发了一个结合随机梯度下降的EM算法。我们建立了参数估计的渐近性质，并展示了DNN估计器实现了极小极优的收敛。大量模拟表明，我们的方法比最先进的方法具有更优越的估计和预测准确性。将我们的方法应用于阿尔茨海默病神经影像研究倡议组数据集，相较于传统方法，提供了新颖的见解和改进的预测性能。

更新时间: 2025-03-25 15:27:32

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2503.19763v1

Splitting Answer Set Programs with respect to Intensionality Statements (Extended Version)

Splitting a logic program allows us to reduce the task of computing its stable models to similar tasks for its subprograms. This can be used to increase solving performance and prove program correctness. We generalize the conditions under which this technique is applicable, by considering not only dependencies between predicates but also their arguments and context. This allows splitting programs commonly used in practice to which previous results were not applicable.

Updated: 2025-03-25 15:27:05

标题: 将Answer Set程序分裂为与内涵性语句相关的部分（扩展版本）

摘要: 拆分逻辑程序使我们能够将计算其稳定模型的任务减少到其子程序的类似任务。这可以用来提高解决性能并证明程序的正确性。我们通过考虑不仅谓词之间的依赖关系，还包括它们的参数和上下文来概括适用于该技术的条件。这使得先前的结果不适用于常见实践中使用的程序的拆分。

更新时间: 2025-03-25 15:27:05

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2503.19762v1

Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

Updated: 2025-03-25 15:19:48

标题: 消失的深度：一种具有位置深度编码的深度适配器，用于泛化图像编码器

摘要: 广义度量深度理解对于精确的视觉引导机器人至关重要，但目前的最先进视觉编码器不支持这一点。为了解决这个问题，我们提出了消失深度，一种自监督训练方法，将预训练的RGB编码器扩展到将度量深度整合并与其特征嵌入对齐。基于我们的新颖位置深度编码，我们实现了稳定的深度密度和深度分布不变的特征提取。我们在一系列相关的RGBD下游任务中取得了性能改进和最先进的结果 - 而无需对编码器进行微调。特别值得注意的是，我们在SUN-RGBD分割中实现了56.05的mIoU，在Void的深度完成中实现了88.3的RMSE，在NYUv2场景分类中实现了83.8的Top 1准确率。在6D物体姿态估计中，我们超越了我们的前辈DinoV2、EVA-02和Omnivore并在几个相关的RGBD下游任务中实现了非微调编码器的最新结果。

更新时间: 2025-03-25 15:19:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19947v1

Internet of Things-Based Smart Precision Farming in Soilless Agriculture:Opportunities and Challenges for Global Food Security

The rapid growth of the global population and the continuous decline in cultivable land pose significant threats to food security. This challenge worsens as climate change further reduces the availability of farmland. Soilless agriculture, such as hydroponics, aeroponics, and aquaponics, offers a sustainable solution by enabling efficient crop cultivation in controlled environments. The integration of the Internet of Things (IoT) with smart precision farming improves resource efficiency, automates environmental control, and ensures stable and high-yield crop production. IoT-enabled smart farming systems utilize real-time monitoring, data-driven decision-making, and automation to optimize water and nutrient usage while minimizing human intervention. This paper explores the opportunities and challenges of IoT-based soilless farming, highlighting its role in sustainable agriculture, urban farming, and global food security. These advanced farming methods ensure greater productivity, resource conservation, and year-round cultivation. However, they also face challenges such as high initial investment, technological dependency, and energy consumption. Through a comprehensive study, bibliometric analysis, and comparative analysis, this research highlights current trends and research gaps. It also outlines future directions for researchers, policymakers, and industry stakeholders to drive innovation and scalability in IoT-driven soilless agriculture. By emphasizing the benefits of vertical farming and Controlled Environment Agriculture (CEA)-enabled soilless techniques, this paper supports informed decision-making to address food security challenges and promote sustainable agricultural innovations.

Updated: 2025-03-25 15:18:47

标题: 基于物联网的智能无土栽培精准农业：全球食品安全的机遇和挑战

摘要: 全球人口的迅速增长和可耕种土地的持续减少给粮食安全带来了重大威胁。随着气候变化进一步减少农田的可用性，这一挑战变得更加严峻。无土栽培农业，如水培、气雾栽培和水产养殖，通过在受控环境中实现高效作物栽培，提供了可持续的解决方案。物联网（IoT）与智能精准农业的整合提高了资源效率，自动化环境控制，并确保稳定和高产的作物生产。物联网智能农业系统利用实时监控、数据驱动的决策制定和自动化来优化水和养分的使用，同时最大程度地减少人为干预。本文探讨了基于物联网的无土栽培农业的机遇和挑战，突出了其在可持续农业、城市农业和全球粮食安全中的作用。这些先进的耕作方法确保了更大的生产力、资源保护和全年种植。然而，它们也面临着诸如高初始投资、技术依赖性和能源消耗等挑战。通过综合研究、文献计量分析和比较分析，本研究突出了当前趋势和研究空白。它还概述了研究人员、政策制定者和行业利益相关者推动物联网驱动的无土栽培农业创新和规模化的未来方向。通过强调垂直农业和受控环境农业（CEA）实现的无土技术的好处，本文支持明智决策，以解决粮食安全挑战，并推动可持续农业创新。

更新时间: 2025-03-25 15:18:47

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2503.13528v2

A Survey on Event-driven 3D Reconstruction: Development under Different Categories

Event cameras have gained increasing attention for 3D reconstruction due to their high temporal resolution, low latency, and high dynamic range. They capture per-pixel brightness changes asynchronously, allowing accurate reconstruction under fast motion and challenging lighting conditions. In this survey, we provide a comprehensive review of event-driven 3D reconstruction methods, including stereo, monocular, and multimodal systems. We further categorize recent developments based on geometric, learning-based, and hybrid approaches. Emerging trends, such as neural radiance fields and 3D Gaussian splatting with event data, are also covered. The related works are structured chronologically to illustrate the innovations and progression within the field. To support future research, we also highlight key research gaps and future research directions in dataset, experiment, evaluation, event representation, etc.

Updated: 2025-03-25 15:16:53

标题: 一项关于事件驱动3D重建的调查：在不同类别下的发展

摘要: 事件相机因其高时域分辨率、低延迟和高动态范围而引起了越来越多的关注，对于3D重建具有重要意义。它们异步捕获每个像素的亮度变化，可以在快速运动和复杂光照条件下实现准确重建。在这项调查中，我们对基于事件驱动的3D重建方法进行了全面的回顾，包括立体、单目和多模态系统。我们进一步根据几何、基于学习的和混合方法对最近的发展进行了分类。还介绍了新兴趋势，如神经辐射场和使用事件数据进行3D高斯平铺。相关工作按时间顺序结构化，以展示该领域内的创新和进展。为了支持未来的研究，我们还突出了数据集、实验、评估、事件表示等方面的关键研究空白和未来研究方向。

更新时间: 2025-03-25 15:16:53

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19753v1

Inducing Personality in LLM-Based Honeypot Agents: Measuring the Effect on Human-Like Agenda Generation

This paper presents SANDMAN, an architecture for cyber deception that leverages Language Agents to emulate convincing human simulacra. Our 'Deceptive Agents' serve as advanced cyber decoys, designed for high-fidelity engagement with attackers by extending the observation period of attack behaviours. Through experimentation, measurement, and analysis, we demonstrate how a prompt schema based on the five-factor model of personality systematically induces distinct 'personalities' in Large Language Models. Our results highlight the feasibility of persona-driven Language Agents for generating diverse, realistic behaviours, ultimately improving cyber deception strategies.

Updated: 2025-03-25 15:16:35

标题: 在基于LLM的蜜罐代理中诱导个性：衡量对人类般议程生成的影响

摘要: 本文介绍了SANDMAN，一种用于网络欺骗的架构，利用语言代理来模拟逼真的人类肖像。我们的“欺骗代理”作为先进的网络诱饵，旨在通过延长攻击行为的观察期，与攻击者进行高保真度的互动。通过实验、测量和分析，我们展示了基于五因素人格模型的提示模式如何系统地在大型语言模型中诱发出不同的“人格”。我们的结果突出了以角色为驱动的语言代理用于生成多样化、逼真行为的可行性，最终改善了网络欺骗策略。

更新时间: 2025-03-25 15:16:35

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.19752v1

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

Updated: 2025-03-25 15:05:34

标题: 当大型视觉语言模型遇到大型遥感图像：粗到细的文本引导标记修剪

摘要: 大规模遥感图像（RSIs）的高效视觉语言理解是有意义的，但也具有挑战性。当前的大型视觉语言模型（LVLMs）通常使用有限预定义的网格来处理图像，导致在处理数十亿像素的RSIs时信息丢失。相反，使用无限网格会显著增加计算成本。为了在减少计算复杂性的同时保留图像细节，我们提出了一种文本引导的标记修剪方法，结合动态图像金字塔（DIP）集成。我们的方法引入了：（i）一个区域焦点模块（RFM），利用文本感知区域定位能力来识别关键的视觉标记，以及（ii）基于DIP的粗到细图像瓦片选择和视觉标记修剪策略，由RFM输出引导，避免直接处理整个大型图像。此外，用于评估LVLMs在大型RSI上感知能力的现有基准测试受限于问题多样性有限和受限的图像尺寸。我们构建了一个名为LRS-VQA的新基准测试，包含8个类别的7,333个QA对，图像长度高达27,328像素。我们的方法在相同数据上胜过现有的高分辨率策略，根据四个数据集。此外，与现有的标记减少方法相比，我们的方法在高分辨率设置下表现出更高的效率。数据集和代码位于https://github.com/VisionXLab/LRS-VQA。

更新时间: 2025-03-25 15:05:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.07588v2

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

Updated: 2025-03-25 15:02:17

标题: 《MIRROR：一种用于自动评估开放式问题生成的新方法》

摘要: 自动生成问题是一个关键任务，涉及评估问题质量，考虑因素如参与度、教育价值和激发批判性思维的能力。这些方面需要类似人类的理解和判断，而自动化系统目前缺乏这种能力。然而，人类评估对于大规模生成的问题样本来说成本高昂且不切实际。因此，我们提出了一个新颖的系统 MIRROR（Multi-LLM 迭代审查和响应优化评分），利用大型语言模型（LLMs）自动化评估由自动生成问题系统生成的问题。我们尝试了几种最先进的LLM，如GPT-4、Gemini 和 Llama2-70b。我们观察到，人类评估指标的得分，即相关性、适当性、新颖性、复杂性和语法性，当使用基于反馈的方法 MIRROR 时得到改善，趋向于接近人类基准得分。此外，我们观察到，在使用我们提出的基于反馈的方法 MIRROR 与直接提示评估相比，GPT-4 与人类专家之间的皮尔逊相关系数得到改善。错误分析显示，我们提出的方法 MIRROR 显著有助于改善相关性和适当性。

更新时间: 2025-03-25 15:02:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.12893v3

How to RETIRE Tabular Data in Favor of Discrete Digital Signal Representation

The successes achieved by deep neural networks in computer vision tasks have led in recent years to the emergence of a new research area dubbed Multi-Dimensional Encoding (MDE). Methods belonging to this family aim to transform tabular data into a homogeneous form of discrete digital signals (images) to apply convolutional networks to initially unsuitable problems. Despite the successive emerging works, the pool of multi-dimensional encoding methods is still low, and the scope of research on existing modality encoding techniques is quite limited. To contribute to this area of research, we propose the Radar-based Encoding from Tabular to Image REpresentation (RETIRE), which allows tabular data to be represented as radar graphs, capturing the feature characteristics of each problem instance. RETIRE was compared with a pool of state-of-the-art MDE algorithms as well as with XGBoost in terms of classification accuracy and computational complexity. In addition, an analysis was carried out regarding transferability and explainability to provide more insight into both RETIRE and existing MDE techniques. The results obtained, supported by statistical analysis, confirm the superiority of RETIRE over other established MDE methods.

Updated: 2025-03-25 15:00:54

标题: 如何将数据表退休，采用离散数字信号表示

摘要: 近年来，深度神经网络在计算机视觉任务中取得的成功促使了一个被称为多维编码（MDE）的新研究领域的出现。属于这一类别的方法旨在将表格数据转换为一种离散数字信号（图像）的统一形式，以将卷积网络应用于最初不适合的问题。尽管不断涌现的作品，多维编码方法的池仍然较少，对现有模态编码技术的研究范围也相当有限。为了为这一研究领域做出贡献，我们提出了基于雷达的编码从表格到图像表示（RETIRE），它允许将表格数据表示为雷达图，捕捉每个问题实例的特征特性。RETIRE与一组最先进的MDE算法以及XGBoost在分类准确性和计算复杂性方面进行了比较。此外，进行了关于可转移性和可解释性的分析，以提供更多关于RETIRE和现有MDE技术的见解。通过统计分析支持的结果证实了RETIRE在其他已建立的MDE方法上的优越性。

更新时间: 2025-03-25 15:00:54

领域: cs.LG

下载: http://arxiv.org/abs/2503.19733v1

CamSAM2: Segment Anything Accurately in Camouflaged Videos

Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at \href{https://github.com/zhoustan/CamSAM2}{github.com/zhoustan/CamSAM2}.

Updated: 2025-03-25 14:58:52

标题: CamSAM2：在伪装视频中准确分割任何物体

摘要: 视频伪装物体分割（VCOS）旨在分割与其环境完全融为一体的伪装物体，是一个具有各种实际应用的基本视觉任务。随着SAM2的发布，视频分割已经取得了显著进展。然而，SAM2在分割伪装视频方面的能力并不理想，特别是在提供简单提示（例如点和框）时。为了解决这个问题，我们提出了伪装SAM2（CamSAM2），它增强了SAM2处理伪装场景的能力，而无需修改SAM2的参数。具体而言，我们引入了一个解伪标记，为VCOS提供特征调整的灵活性。为了充分利用当前帧和之前帧的细粒度和高分辨率特征，我们分别提出了隐式对象感知融合（IOF）和显式对象感知融合（EOF）模块。引入对象原型生成（OPG）来使用来自先前帧的高质量特征抽象和记忆具有信息性细节的对象原型。进行了大量实验证实我们方法的有效性。虽然CamSAM2只向SAM2添加了可忽略的可学习参数，但在三个VCOS数据集上表现优异，特别是在MoCA-Mask上通过点击提示实现了12.2 mDice增益，在SUN-SEG-Hard上通过掩码提示实现了19.6 mDice增益，背景为Hiera-T。代码将在\href{https://github.com/zhoustan/CamSAM2}{github.com/zhoustan/CamSAM2}上提供。

更新时间: 2025-03-25 14:58:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19730v1

MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study

Ensuring transparency in machine learning decisions is critically important, especially in sensitive sectors such as healthcare, finance, and justice. Despite this, some popular explainable algorithms, such as Local Interpretable Model-agnostic Explanations (LIME), often produce unstable explanations due to the random generation of perturbed samples. Random perturbation introduces small changes or noise to modified instances of the original data, leading to inconsistent explanations. Even slight variations in the generated samples significantly affect the explanations provided by such models, undermining trust and hindering the adoption of interpretable models. To address this challenge, we propose MindfulLIME, a novel algorithm that intelligently generates purposive samples using a graph-based pruning algorithm and uncertainty sampling. MindfulLIME substantially improves the consistency of visual explanations compared to random sampling approaches. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME's stability with a 100% success rate in delivering reliable explanations under identical conditions. Additionally, MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations compared to LIME. We also performed comprehensive experiments considering various segmentation algorithms and sample numbers, focusing on stability, quality, and efficiency. The results demonstrate the outstanding performance of MindfulLIME across different segmentation settings, generating fewer high-quality samples within a reasonable processing time. By addressing the stability limitations of LIME in image data, MindfulLIME enhances the trustworthiness and interpretability of machine learning models in specific medical imaging applications, a critical domain.

Updated: 2025-03-25 14:48:14

标题: “MindfulLIME：一种稳定的解释机器学习模型的方案，具有增强的定位精度--医学图像案例研究”

摘要: 确保机器学习决策的透明性至关重要，特别是在敏感领域如医疗保健、金融和司法领域。尽管如此，一些流行的可解释算法，如局部可解释模型无关解释（LIME），由于随机生成扰动样本，通常会产生不稳定的解释。随机扰动引入对原始数据的修改实例的微小变化或噪音，导致解释不一致。即使在生成的样本中略微变化也会显着影响模型提供的解释，从而削弱信任并阻碍可解释模型的采用。为了解决这一挑战，我们提出了MindfulLIME，这是一种利用基于图的修剪算法和不确定采样智能生成目的样本的新算法。与随机采样方法相比，MindfulLIME明显提高了视觉解释的一致性。我们在一个广泛认可的胸部X射线数据集上进行的实验评估证实了MindfulLIME在相同条件下以100%的成功率提供可靠解释的稳定性。此外，与LIME相比，MindfulLIME通过减少生成解释与实际局部注释之间的距离，提高了视觉解释的定位精度。我们还进行了综合实验，考虑了各种分割算法和样本数量，重点关注稳定性、质量和效率。结果表明，在合理的处理时间内，MindfulLIME在不同的分割设置中表现出色，生成了更少的高质量样本。通过解决LIME在图像数据中的稳定性限制，MindfulLIME增强了特定医学影像应用中机器学习模型的可信度和可解释性，这是一个关键领域。

更新时间: 2025-03-25 14:48:14

领域: cs.LG,cs.CV,eess.IV

下载: http://arxiv.org/abs/2503.20758v1

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Updated: 2025-03-25 14:48:01

标题: DeltaZip：多个全模型调整的LLMs的高效服务

摘要: 调整大型语言模型（LLMs）会极大地提高模型质量，适用于下游任务。然而，同时提供多个经过微调的LLMs是具有挑战性的，因为不同LLMs的请求模式是零星的、短暂的和变化的。为了弥合这一差距，我们提出了DeltaZip，一种LLM服务系统，通过积极压缩模型增量高达10倍，同时保持高模型质量，有效地同时提供多个完整参数的经过微调的模型。这种设计背后的关键洞察是，微调会导致对预训练模型进行微小幅度的更改。通过与压缩算法共同设计服务系统，DeltaZip的吞吐量比最先进的系统提高了2倍至12倍。

更新时间: 2025-03-25 14:48:01

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2312.05215v3

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

Updated: 2025-03-25 14:45:23

标题: 多源模型在地球观测中对缺失数据的鲁棒性取决于什么？

摘要: 最近几年，强大的多源模型在地球观测（EO）领域已经出现。这些模型利用来自多种来源的数据，在数据缺失时提高预测准确性。尽管取得了这些进展，影响这些模型效果差异的因素仍然知之甚少。在这项研究中，我们评估了六种最先进的多源模型在预测场景中的预测性能，其中单个数据源缺失或仅有一个数据源可用。我们的分析显示，这些模型的有效性与任务的性质、数据源之间的互补性和模型设计密切相关。令人惊讶的是，我们观察到删除某些数据源会导致改善的预测性能，挑战了将所有可用数据纳入模型总是有益的假设。这些发现促使对模型复杂性和所有收集到的数据源的必要性进行关键反思，可能为地球观测应用中更简化的方法铺平道路。

更新时间: 2025-03-25 14:45:23

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19719v1

Invertible Koopman neural operator for data-driven modeling of partial differential equations

Koopman operator theory is a popular candidate for data-driven modeling because it provides a global linearization representation for nonlinear dynamical systems. However, existing Koopman operator-based methods suffer from shortcomings in constructing the well-behaved observable function and its inverse and are inefficient enough when dealing with partial differential equations (PDEs). To address these issues, this paper proposes the Invertible Koopman Neural Operator (IKNO), a novel data-driven modeling approach inspired by the Koopman operator theory and neural operator. IKNO leverages an Invertible Neural Network to parameterize observable function and its inverse simultaneously under the same learnable parameters, explicitly guaranteeing the reconstruction relation, thus eliminating the dependency on the reconstruction loss, which is an essential improvement over the original Koopman Neural Operator (KNO). The structured linear matrix inspired by the Koopman operator theory is parameterized to learn the evolution of observables' low-frequency modes in the frequency space rather than directly in the observable space, sustaining IKNO is resolution-invariant like other neural operators. Moreover, with preprocessing such as interpolation and dimension expansion, IKNO can be extended to operator learning tasks defined on non-Cartesian domains. We fully support the above claims based on rich numerical and real-world examples and demonstrate the effectiveness of IKNO and superiority over other neural operators.

Updated: 2025-03-25 14:43:53

标题: 可逆Koopman神经算子用于基于数据驱动的偏微分方程建模

摘要: 库普曼算子理论是数据驱动建模的热门选择，因为它为非线性动态系统提供了全局线性化表示。然而，现有的基于库普曼算子的方法在构建良好的可观测函数及其逆运算时存在缺陷，并且在处理偏微分方程时效率不够高。为了解决这些问题，本文提出了可逆库普曼神经算子（IKNO），这是一种受库普曼算子理论和神经算子启发的新型数据驱动建模方法。IKNO利用可逆神经网络同时参数化可观测函数及其逆运算，明确保证重构关系，从而消除了对重构损失的依赖，这是对原始库普曼神经算子（KNO）的重要改进。受库普曼算子理论启发的结构化线性矩阵被参数化以在频率空间而不是直接在可观测空间学习可观测低频模式的演变，使得IKNO像其他神经算子一样不受分辨率影响。此外，通过插值和维度扩展等预处理，IKNO可以扩展到非笛卡尔域上定义的算子学习任务。我们基于丰富的数值和现实世界示例全面支持以上观点，并展示了IKNO的有效性和优越性，超过其他神经算子。

更新时间: 2025-03-25 14:43:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19717v1

Pfungst and Clever Hans: Identifying the unintended cues in a widely used Alzheimer's disease MRI dataset using explainable deep learning

Backgrounds. Deep neural networks have demonstrated high accuracy in classifying Alzheimer's disease (AD). This study aims to enlighten the underlying black-box nature and reveal individual contributions of T1-weighted (T1w) gray-white matter texture, volumetric information and preprocessing on classification performance. Methods. We utilized T1w MRI data from the Alzheimer's Disease Neuroimaging Initiative to distinguish matched AD patients (990 MRIs) from healthy controls (990 MRIs). Preprocessing included skull stripping and binarization at varying thresholds to systematically eliminate texture information. A deep neural network was trained on these configurations, and the model performance was compared using McNemar tests with discrete Bonferroni-Holm correction. Layer-wise Relevance Propagation (LRP) and structural similarity metrics between heatmaps were applied to analyze learned features. Results. Classification performance metrics (accuracy, sensitivity, and specificity) were comparable across all configurations, indicating a negligible influence of T1w gray- and white signal texture. Models trained on binarized images demonstrated similar feature performance and relevance distributions, with volumetric features such as atrophy and skull-stripping features emerging as primary contributors. Conclusions. We revealed a previously undiscovered Clever Hans effect in a widely used AD MRI dataset. Deep neural networks classification predominantly rely on volumetric features, while eliminating gray-white matter T1w texture did not decrease the performance. This study clearly demonstrates an overestimation of the importance of gray-white matter contrasts, at least for widely used structural T1w images, and highlights potential misinterpretation of performance metrics.

Updated: 2025-03-25 14:41:10

标题: 普方斯特和聪明汉斯：利用可解释的深度学习技术识别广泛使用的阿尔茨海默病MRI数据集中的意外线索

摘要: 背景。深度神经网络在分类阿尔茨海默病（AD）方面表现出高准确性。本研究旨在阐明T1加权（T1w）灰白质纹理、体积信息和预处理对分类性能的潜在黑盒特性和个体贡献。方法。我们利用阿尔茨海默病神经影像计划的T1w MRI数据，区分匹配的AD患者（990个MRI）和健康对照组（990个MRI）。预处理包括在不同阈值下去除纹理信息。一个深度神经网络被训练在这些配置上，并且使用McNemar检验和离散Bonferroni-Holm校正比较模型性能。应用逐层相关传播（LRP）和热图之间的结构相似度度量来分析学习到的特征。结果。所有配置的分类性能指标（准确度、敏感度和特异性）相当，表明T1w灰白信号纹理的影响微乎其微。在二值化图像上训练的模型展现出相似的特征性能和相关性分布，其中体积特征如萎缩和去骨特征成为主要贡献者。结论。我们揭示了一个先前未发现的“聪明汉斯”效应在广泛使用的AD MRI数据集中。深度神经网络分类主要依赖于体积特征，而消除灰白质T1w纹理并不会降低性能。这项研究明确地表明了对灰白质对比度的重要性的高估，至少对于广泛使用的结构T1w图像来说，并且强调了性能指标的潜在误解。

更新时间: 2025-03-25 14:41:10

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2501.15831v2

Decoupled Dynamics Framework with Neural Fields for 3D Spatio-temporal Prediction of Vehicle Collisions

This study proposes a neural framework that predicts 3D vehicle collision dynamics by independently modeling global rigid-body motion and local structural deformation. Unlike approaches directly predicting absolute displacement, this method explicitly separates the vehicle's overall translation and rotation from its structural deformation. Two specialized networks form the core of the framework: a quaternion-based Rigid Net for rigid motion and a coordinate-based Deformation Net for local deformation. By independently handling fundamentally distinct physical phenomena, the proposed architecture achieves accurate predictions without requiring separate supervision for each component. The model, trained on only 10% of available simulation data, significantly outperforms baseline models, including single multi-layer perceptron (MLP) and deep operator networks (DeepONet), with prediction errors reduced by up to 83%. Extensive validation demonstrates strong generalization to collision conditions outside the training range, accurately predicting responses even under severe impacts involving extreme velocities and large impact angles. Furthermore, the framework successfully reconstructs high-resolution deformation details from low-resolution inputs without increased computational effort. Consequently, the proposed approach provides an effective, computationally efficient method for rapid and reliable assessment of vehicle safety across complex collision scenarios, substantially reducing the required simulation data and time while preserving prediction fidelity.

Updated: 2025-03-25 14:38:37

标题: 解耦动力学框架与神经场用于车辆碰撞三维时空预测

摘要: 本研究提出了一个神经框架，通过独立建模全局刚体运动和局部结构变形来预测3D车辆碰撞动力学。与直接预测绝对位移的方法不同，该方法明确地将车辆的整体平移和旋转与其结构变形分开。两个专门的网络构成了该框架的核心：基于四元数的刚体网络用于刚体运动，基于坐标的变形网络用于局部变形。通过独立处理基本不同的物理现象，所提出的架构实现了准确的预测，而无需为每个组件单独监督。该模型仅在可用仿真数据的10%上进行训练，显著优于基线模型，包括单个多层感知器（MLP）和深运算符网络（DeepONet），预测误差降低了高达83%。广泛的验证展示了对超出训练范围的碰撞条件的强大泛化能力，即使在涉及极端速度和大碰撞角度的严重冲击下，也能准确预测响应。此外，该框架成功地从低分辨率输入中重建高分辨率的变形细节，而无需增加计算工作量。因此，所提出的方法为快速可靠评估复杂碰撞情景中的车辆安全提供了一种有效的、计算效率高的方法，大幅减少了所需的仿真数据和时间，同时保持了预测的准确性。

更新时间: 2025-03-25 14:38:37

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2503.19712v1

Writing as a testbed for open ended agents

Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.

Updated: 2025-03-25 14:38:36

标题: 将写作作为开放式代理的测试平台

摘要: 开放式任务对于LLMs来说是特别具有挑战性的，因为解决空间广阔，要求进行广泛的探索和灵活的策略，尤其是当成功缺乏明确的客观定义时。写作，以其广泛的解决空间和主观评估标准，为研究此类问题提供了一个引人入胜的试验平台。本文中，我们研究了LLMs作为协作共同作者的潜力，能够自主提出并实施文本改进。我们分析了三个知名的LLMs - Gemini 1.5 Pro，Claude 3.5 Sonnet 和 GPT-4o - 着重研究它们的行动多样性，人类对齐和迭代改进能力对整体性能的影响。这项工作建立了一个用于对自主写作代理进行基准测试的框架，并更广泛地强调了在构建能够在各种开放式领域中表现出色的系统时的基本挑战和潜在解决方案。

更新时间: 2025-03-25 14:38:36

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2503.19711v1

Data-efficient rapid prediction of urban airflow and temperature fields for complex building geometries

Accurately predicting urban microclimate, including wind speed and temperature, based solely on building geometry requires capturing complex interactions between buildings and airflow, particularly long-range wake effects influenced by directional geometry. Traditional methods relying on computational fluid dynamics (CFD) are prohibitively expensive for large-scale simulations, while data-driven approaches struggle with limited training data and the need to model both local and far-field dependencies. In response, we propose a novel framework that leverages a multi-directional distance feature (MDDF) combined with localized training to achieve effective wind field predictions with minimal CFD data. By reducing the problem's dimensionality, localized training effectively increases the number of training samples, while MDDF encodes the surrounding geometric information to accurately model wake dynamics and flow redirection. Trained on only 24 CFD simulations, our localized Fourier neural operator (Local-FNO) model generates full 3D wind velocity and temperature predictions in under one minute, yielding a 500-fold speedup over conventional CFD methods. With mean absolute errors of 0.3 m/s for wind speed and 0.3 $^{\circ}$C for temperature on unseen urban configurations, our method demonstrates strong generalization capabilities and significant potential for practical urban applications.

Updated: 2025-03-25 14:36:01

标题: 基于复杂建筑几何结构的城市气流和温度场的高效数据预测

摘要: 准确预测城市微气候，包括风速和温度，仅基于建筑几何需要捕捉建筑物与气流之间复杂的相互作用，特别是受方向几何影响的长程尾流效应。传统方法依赖于计算流体动力学（CFD）对大规模模拟来说成本过高，而数据驱动方法在受限的训练数据和需要建模本地和远场依赖性的情况下表现不佳。为此，我们提出了一个新颖的框架，利用多方向距离特征（MDDF）结合本地化训练，以实现使用最少CFD数据进行有效风场预测。通过降低问题的维度，本地化训练有效地增加了训练样本的数量，而MDDF编码周围的几何信息以准确模拟尾流动态和流向重定向。我们的本地傅立叶神经算子（Local-FNO）模型仅在24个CFD模拟上训练，可以在不到一分钟内生成完整的3D风速和温度预测，与传统CFD方法相比提速了500倍。在未见过的城市配置上，我们的方法风速和温度的平均绝对误差分别为0.3 m/s和0.3 $^{\circ}$C，表现出强大的泛化能力和在实际城市应用中的巨大潜力。

更新时间: 2025-03-25 14:36:01

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2503.19708v1

FeatherWallet: A Lightweight Mobile Cryptocurrency Wallet Using zk-SNARKs

Traditionally, mobile wallets rely on a trusted server that provides them with a current view of the blockchain, and thus, these wallets do not need to validate the header chain or transaction inclusion themselves. If a mobile wallet were to validate a header chain and inclusion of its transactions, it would require significant storage and performance overhead, which is challenging and expensive to ensure on resource-limited devices, such as smartphones. Moreover, such an overhead would be multiplied by the number of cryptocurrencies the user holds in a wallet. Therefore, we introduce a novel approach, called FeatherWallet, to mobile wallet synchronization designed to eliminate trust in a server while providing efficient utilization of resources. Our approach addresses the challenges associated with storage and bandwidth requirements by off-chaining validation of header chains using SNARK-based proofs of chain extension, which are verified by a smart contract. This offers us a means of storing checkpoints in header chains of multiple blockchains. The key feature of our approach is the ability of mobile clients to update their partial local header chains using checkpoints derived from the proof verification results stored in the smart contract. In the evaluation, we created zk-SNARK proofs for the 2, 4, 8, 16, 32, and 64 headers within our trustless off-chain service. For 64-header proofs, the off-chain service producing proofs requires at least 40 GB of RAM, while the minimal gas consumption is achieved for 12 proofs bundled in a single transaction. We achieved a 20-fold reduction in storage overhead for a mobile client in contrast to traditional SPV clients. Although we have developed a proof-of-concept for PoW blockchains, the whole approach can be extended in principle to other consensus mechanisms, e.g., PoS.

Updated: 2025-03-25 14:33:58

标题: FeatherWallet：使用zk-SNARKs的轻量级移动加密货币钱包

摘要: 传统上，移动钱包依赖于提供给它们区块链当前视图的可信服务器，因此这些钱包不需要自行验证头部链或交易包含。如果一个移动钱包要验证头部链和其交易的包含，它将需要大量的存储和性能开销，这在资源有限的设备（如智能手机）上是具有挑战性和昂贵的。此外，这样的开销将随用户在钱包中持有的加密货币数量而增加。因此，我们引入了一种名为FeatherWallet的新颖方法，用于移动钱包同步，旨在消除对服务器的信任，同时提供资源的高效利用。我们的方法通过使用基于SNARK的链延伸证明来验证头部链的离线验证，这些证明由智能合约验证。这为我们提供了在多个区块链的头部链中存储检查点的方法。我们方法的关键特点是移动客户端能够使用存储在智能合约中的证明验证结果产生的检查点来更新它们的局部本地头部链。在评估中，我们为我们的无信任离线服务中的2、4、8、16、32和64个头部创建了zk-SNARK证明。对于64个头部的证明，产生证明的离线服务至少需要40GB的RAM，而通过在单个交易中捆绑12个证明可以实现最小的gas消耗。与传统SPV客户端相比，我们实现了移动客户端存储开销的20倍减少。虽然我们已经为PoW区块链开发了一个概念验证，但整个方法原则上可以扩展到其他共识机制，例如PoS。

更新时间: 2025-03-25 14:33:58

领域: cs.CR

下载: http://arxiv.org/abs/2503.22717v1

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.

Updated: 2025-03-25 14:33:32

标题: 自己动手：Bootstrap Your Own Views：面具自我外映建模用于细粒度视图不变视频表示

摘要: 从自我中心（第一人称，自我）和外中心（第三人称，外）视频中学习不受视角影响的表示是一种有前途的方法，可推广视频理解系统跨多个视点。然而，由于自我和外中心视角之间在视角、运动模式和背景上存在重大差异，这一领域尚未得到充分探索。在本文中，我们提出了一种新颖的蒙面自我-外中心建模方法，促进因果时间动态和跨视角对齐，称为Bootstrap Your Own Views（BYOV），用于从未匹配的自我-外中心视频中学习细粒度的不受视角影响的视频表示。我们强调捕获人类行为的组合性质对于稳健的跨视角理解至关重要。具体而言，自视图屏蔽和跨视图屏蔽预测旨在同时学习不受视角影响和强大的表示。实验结果表明，我们的BYOV在四个下游自我-外中心视频任务的所有指标中显著超越现有方法，获得了显著的收益。代码可在https://github.com/park-jungin/byov 上找到。

更新时间: 2025-03-25 14:33:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19706v1

Generative AI for Validating Physics Laws

We present generative artificial intelligence (AI) to empirically validate fundamental laws of physics, focusing on the Stefan-Boltzmann law linking stellar temperature and luminosity. Our approach simulates counterfactual luminosities under hypothetical temperature regimes for each individual star and iteratively refines the temperature-luminosity relationship in a deep learning architecture. We use Gaia DR3 data and find that, on average, temperature's effect on luminosity increases with stellar radius and decreases with absolute magnitude, consistent with theoretical predictions. By framing physics laws as causal problems, our method offers a novel, data-driven approach to refine theoretical understanding and inform evidence-based policy and practice.

Updated: 2025-03-25 14:31:47

标题: 生成式人工智能用于验证物理定律

摘要: 我们提出了一种生成式人工智能（AI）来经验验证物理学基本定律，重点关注斯特凡-玻尔兹曼定律，即星体温度和光度之间的联系。我们的方法模拟每颗星的假设温度范围下的反事实光度，并在深度学习架构中迭代地细化温度-光度关系。我们使用盖亚 DR3 数据，并发现，平均而言，温度对光度的影响随着恒星半径的增加而增加，并随着绝对星等的减小而减小，与理论预测一致。通过将物理定律表述为因果问题，我们的方法提供了一种新颖的、数据驱动的方法来完善理论理解，并为基于证据的政策和实践提供信息。

更新时间: 2025-03-25 14:31:47

领域: astro-ph.SR,astro-ph.GA,cs.AI

下载: http://arxiv.org/abs/2503.17894v2

Optimal Path Planning and Cost Minimization for a Drone Delivery System Via Model Predictive Control

In this study, we formulate the drone delivery problem as a control problem and solve it using Model Predictive Control. Two experiments are performed: The first is on a less challenging grid world environment with lower dimensionality, and the second is with a higher dimensionality and added complexity. The MPC method was benchmarked against three popular Multi-Agent Reinforcement Learning (MARL): Independent $Q$-Learning (IQL), Joint Action Learners (JAL), and Value-Decomposition Networks (VDN). It was shown that the MPC method solved the problem quicker and required fewer optimal numbers of drones to achieve a minimized cost and navigate the optimal path.

Updated: 2025-03-25 14:27:29

标题: 无人机送货系统的最优路径规划和成本最小化通过模型预测控制

摘要: 在这项研究中，我们将无人机投递问题表述为控制问题，并使用模型预测控制来解决。进行了两个实验：第一个实验在一个维度较低、挑战较小的网格环境中进行，第二个实验在维度较高且增加了复杂性。MPC方法与三种流行的多智能体强化学习（MARL）进行了基准测试：独立Q学习（IQL）、联合行动学习者（JAL）和价值分解网络（VDN）。结果表明，MPC方法比MARL方法更快地解决问题，并且需要更少的无人机数量来实现最小化成本和导航最佳路径。

更新时间: 2025-03-25 14:27:29

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.19699v1

USBSnoop -- Revealing Device Activities via USB Congestions

The USB protocol has become a ubiquitous standard for connecting peripherals to computers, making its security a critical concern. A recent research study demonstrated the potential to exploit weaknesses in well-established protocols, such as PCIe, and created a side-channel for leaking sensitive information by leveraging congestion within shared interfaces. Drawing inspiration from that, this project introduces an innovative approach to USB side-channel attacks via congestion. We evaluated the susceptibility of USB devices and hubs to remote profiling and side-channel attacks, identified potential weaknesses within the USB standard, and highlighted the critical need for heightened security and privacy in USB technology. Our findings discover vulnerabilities within the USB standard, which are difficult to effectively mitigate and underscore the need for enhanced security measures to protect user privacy in an era increasingly dependent on USB-connected devices.

Updated: 2025-03-25 14:22:49

标题: USBSnoop -- 通过USB拥塞揭示设备活动

摘要: USB协议已成为连接外围设备至计算机的普遍标准，因此其安全性成为一个关键关注点。最近的研究表明，存在利用PCIe等已建立的协议的弱点的潜力，并通过利用共享接口中的拥塞创建一个泄露敏感信息的侧通道。受此启发，该项目引入了一种创新的通过拥塞进行USB侧通道攻击的方法。我们评估了USB设备和集线器对远程剖析和侧通道攻击的易感性，确定了USB标准内的潜在弱点，并强调了USB技术在安全性和隐私方面的迫切需求。我们的研究发现了USB标准中的漏洞，这些漏洞很难有效缓解，并强调了在日益依赖USB连接设备的时代中，加强安全措施以保护用户隐私的必要性。

更新时间: 2025-03-25 14:22:49

领域: cs.CR

下载: http://arxiv.org/abs/2503.03980v2

Structuring Scientific Innovation: A Framework for Modeling and Discovering Impactful Knowledge Combinations

The emergence of large language models offers new possibilities for structured exploration of scientific knowledge. Rather than viewing scientific discovery as isolated ideas or content, we propose a structured approach that emphasizes the role of method combinations in shaping disruptive insights. Specifically, we investigate how knowledge unit--especially those tied to methodological design--can be modeled and recombined to yield research breakthroughs. Our proposed framework addresses two key challenges. First, we introduce a contrastive learning-based mechanism to identify distinguishing features of historically disruptive method combinations within problem-driven contexts. Second, we propose a reasoning-guided Monte Carlo search algorithm that leverages the chain-of-thought capability of LLMs to identify promising knowledge recombinations for new problem statements.Empirical studies across multiple domains show that the framework is capable of modeling the structural dynamics of innovation and successfully highlights combinations with high disruptive potential. This research provides a new path for computationally guided scientific ideation grounded in structured reasoning and historical data modeling.

Updated: 2025-03-25 14:21:15

标题: 构建科学创新：建模和发现具有影响力的知识组合的框架

摘要: 大型语言模型的出现为科学知识的结构化探索提供了新的可能性。我们提出了一种结构化方法，强调方法组合在塑造颠覆性见解中的作用，而不是将科学发现视为孤立的思想或内容。具体来说，我们研究了知识单元，特别是那些与方法论设计相关的知识单元，如何被建模和重新组合以产生研究突破。我们提出的框架解决了两个关键挑战。首先，我们引入了一种对比学习机制，以确定在问题驱动的背景下具有历史性颠覆性的方法组合的独特特征。其次，我们提出了一种基于推理引导的蒙特卡洛搜索算法，利用大型语言模型的思维链能力来识别新问题陈述的有前途的知识重新组合。跨越多个领域的实证研究表明，该框架能够对创新的结构动态进行建模，并成功地突出具有高颠覆潜力的组合。这项研究为计算引导的科学思维提供了一条新路径，这种思维基于结构化推理和历史数据建模。

更新时间: 2025-03-25 14:21:15

领域: cs.AI

下载: http://arxiv.org/abs/2503.18865v2

Federated Causal Inference: Multi-Study ATE Estimation beyond Meta-Analysis

We study Federated Causal Inference, an approach to estimate treatment effects from decentralized data across centers. We compare three classes of Average Treatment Effect (ATE) estimators derived from the Plug-in G-Formula, ranging from simple meta-analysis to one-shot and multi-shot federated learning, the latter leveraging the full data to learn the outcome model (albeit requiring more communication). Focusing on Randomized Controlled Trials (RCTs), we derive the asymptotic variance of these estimators for linear models. Our results provide practical guidance on selecting the appropriate estimator for various scenarios, including heterogeneity in sample sizes, covariate distributions, treatment assignment schemes, and center effects. We validate these findings with a simulation study.

Updated: 2025-03-25 14:18:33

标题: 联邦因果推断：超越元分析的多研究平均处理效应估计

摘要: 我们研究了联邦因果推断，这是一种从分散数据中心估计治疗效果的方法。我们比较了从插入式G-公式派生的三类平均治疗效应（ATE）估计器，从简单的元分析到一次性和多次联邦学习，后者利用全部数据来学习结果模型（尽管需要更多通信）。我们关注随机对照试验（RCTs），推导了这些估计器的线性模型的渐近方差。我们的结果为选择适当的估计器提供了实际指导，包括样本大小的异质性，协变量分布，治疗分配方案和中心效应。我们通过模拟研究验证了这些发现。

更新时间: 2025-03-25 14:18:33

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2410.16870v2

Geometric Preference Elicitation for Minimax Regret Optimization in Uncertainty Matroids

This paper presents an efficient preference elicitation framework for uncertain matroid optimization, where precise weight information is unavailable, but insights into possible weight values are accessible. The core innovation of our approach lies in its ability to systematically elicit user preferences, aligning the optimization process more closely with decision-makers' objectives. By incrementally querying preferences between pairs of elements, we iteratively refine the parametric uncertainty regions, leveraging the structural properties of matroids. Our method aims to achieve the exact optimum by reducing regret with a few elicitation rounds. Additionally, our approach avoids the computation of Minimax Regret and the use of Linear programming solvers at every iteration, unlike previous methods. Experimental results on four standard matroids demonstrate that our method reaches optimality more quickly and with fewer preference queries than existing techniques.

Updated: 2025-03-25 14:12:43

标题: 不确定性拟防埃利克斯在几何首选偏好中的最小后悔优化

摘要: 这篇论文提出了一个高效的偏好引导框架，用于不确定性拟阵优化，其中精确的权重信息不可用，但对可能的权重值的洞察力是可访问的。我们方法的核心创新在于其能够系统地引导用户偏好，使优化过程更加贴近决策者的目标。通过逐步查询元素对之间的偏好，我们迭代地优化参数不确定性区域，利用拟阵的结构特性。我们的方法旨在通过减少后悔以实现精确最优解，只需几轮引导即可达到。此外，与以往的方法不同，我们的方法避免了在每次迭代时计算Minimax后悔和使用线性规划求解器。对四个标准拟阵的实验结果表明，我们的方法比现有技术更快地达到最优解，并且需更少的偏好查询。

更新时间: 2025-03-25 14:12:43

领域: cs.LG

下载: http://arxiv.org/abs/2503.18668v2

Communities in the Kuramoto Model: Dynamics and Detection via Path Signatures

The behavior of multivariate dynamical processes is often governed by underlying structural connections that relate the components of the system. For example, brain activity which is often measured via time series is determined by an underlying structural graph, where nodes represent neurons or brain regions and edges represent cortical connectivity. Existing methods for inferring structural connections from observed dynamics, such as correlation-based or spectral techniques, may fail to fully capture complex relationships in high-dimensional time series in an interpretable way. Here, we propose the use of path signatures a mathematical framework that encodes geometric and temporal properties of continuous paths to address this problem. Path signatures provide a reparametrization-invariant characterization of dynamical data and, in particular, can be used to compute the lead matrix which reveals lead-lag phenomena. We showcase our approach on time series from coupled oscillators in the Kuramoto model defined on a stochastic block model graph, termed the Kuramoto stochastic block model (KSBM). Using mean-field theory and Gaussian approximations, we analytically derive reduced models of KSBM dynamics in different temporal regimes and theoretically characterize the lead matrix in these settings. Leveraging these insights, we propose a novel signature-based community detection algorithm, achieving exact recovery of structural communities from observed time series in multiple KSBM instances. Our results demonstrate that path signatures provide a novel perspective on analyzing complex neural data and other high-dimensional systems, explicitly exploiting temporal functional relationships to infer underlying structure.

Updated: 2025-03-25 14:02:42

标题: Kuramoto模型中的社区：动态和通过路径签名的检测

摘要: 多元动力过程的行为通常受到相关系统组件之间的基础结构连接的控制。例如，大脑活动通常通过时间序列来衡量，由潜在结构图确定，其中节点表示神经元或脑区域，边表示皮层连接。现有的从观察到的动态推断结构连接的方法，如基于相关性或谱技术的方法，可能无法以可解释的方式完全捕捉高维时间序列中的复杂关系。在这里，我们提出使用路径签名，这是一种数学框架，编码连续路径的几何和时间属性来解决这个问题。路径签名提供了一个重新参数化不变的动态数据表征，特别可以用来计算引导矩阵，揭示引导滞后现象。我们展示了我们的方法在Kuramoto模型中耦合振荡器的时间序列上的应用，该模型定义在一个随机块模型图上，称为Kuramoto随机块模型（KSBM）。通过均场理论和高斯近似，我们在不同的时间区域中从理论上推导出了KSBM动态的简化模型，并在这些情境中理论性地表征了引导矩阵。利用这些见解，我们提出了一种基于签名的社区检测算法，在多个KSBM实例中从观察到的时间序列中实现了结构社区的精确恢复。我们的结果表明，路径签名为分析复杂的神经数据和其他高维系统提供了一种新的视角，明确地利用时间功能关系来推断潜在结构。

更新时间: 2025-03-25 14:02:42

领域: stat.ML,cond-mat.dis-nn,cs.LG,nlin.AO,q-bio.NC,q-bio.QM

下载: http://arxiv.org/abs/2503.17546v2

Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms

This paper explores the application of Convolutional Neural Networks CNNs for classifying emotions in speech through Mel Spectrogram representations of audio files. Traditional methods such as Gaussian Mixture Models and Hidden Markov Models have proven insufficient for practical deployment, prompting a shift towards deep learning techniques. By transforming audio data into a visual format, the CNN model autonomously learns to identify intricate patterns, enhancing classification accuracy. The developed model is integrated into a user-friendly graphical interface, facilitating realtime predictions and potential applications in educational environments. The study aims to advance the understanding of deep learning in speech emotion recognition, assess the models feasibility, and contribute to the integration of technology in learning contexts

Updated: 2025-03-25 14:02:10

标题: 深度学习用于语音情感识别：利用Mel频谱图的CNN方法

摘要: 本文探讨了使用卷积神经网络(CNNs)对音频文件的Mel Spectrogram表示进行情感分类的应用。传统方法如高斯混合模型和隐马尔可夫模型已被证明在实际部署中不足，促使转向深度学习技术。通过将音频数据转换为视觉格式，CNN模型自主学习识别复杂模式，提高分类准确度。开发的模型集成到用户友好的图形界面中，促进实时预测和在教育环境中的潜在应用。该研究旨在推进对语音情感识别中深度学习的理解，评估模型的可行性，并促进技术在学习环境中的整合。

更新时间: 2025-03-25 14:02:10

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2503.19677v1

Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

Despite progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, featuring 40K video frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with various lighting conditions. Collected using two 360{\deg} cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with an increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, leading to improved performance.

Updated: 2025-03-25 13:57:14

标题: Helvipad：用于全向立体深度估计的真实世界数据集

摘要: 尽管立体深度估计取得了进展，但全向成像仍未得到充分探索，主要是由于缺乏适当的数据。我们介绍了Helvipad，这是一个用于全向立体深度估计的真实世界数据集，包括来自不同环境的40K视频帧，包括拥挤的室内和室外场景以及各种光照条件。该数据集使用两个360度相机在顶部-底部设置和一个LiDAR传感器进行收集，通过将3D点云投影到等距图像上，包括准确的深度和视差标签。此外，我们通过使用深度完成来增加标签密度，提供了一个增强的训练集。我们对领先的立体深度估计模型在标准和全向图像上进行了基准测试。结果表明，尽管最近的立体方法表现不错，但在全向成像中准确估计深度仍然存在挑战。为了解决这个问题，我们对立体模型进行必要的调整，从而提高性能。

更新时间: 2025-03-25 13:57:14

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2411.18335v2

Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Graph Neural Networks (GNNs) have emerged as a powerful tool for learning and inferring from graph-structured data, and are widely used in a variety of applications, often considering large amounts of data and large graphs. However, training on such data requires large memory and extensive computations. In this paper, we introduce a novel framework for efficient multiscale training of GNNs, designed to integrate information across multiscale representations of a graph. Our approach leverages a hierarchical graph representation, taking advantage of coarse graph scales in the training process, where each coarse scale graph has fewer nodes and edges. Based on this approach, we propose a suite of GNN training methods: such as coarse-to-fine, sub-to-full, and multiscale gradient computation. We demonstrate the effectiveness of our methods on various datasets and learning tasks.

Updated: 2025-03-25 13:52:26

标题: 朝着高效训练图神经网络的方向：一种多尺度方法

摘要: 图神经网络（GNNs）已经成为一种强大的工具，用于学习和推理图结构化数据，并广泛应用于各种应用程序中，通常考虑大量数据和大型图。然而，在这些数据上进行训练需要大量内存和大量计算。在本文中，我们介绍了一种用于高效多尺度训练GNNs的新框架，旨在整合图的多尺度表示中的信息。我们的方法利用了分层图表示，利用训练过程中的粗糙图尺度，其中每个粗糙图尺度具有较少的节点和边。基于这种方法，我们提出了一套GNN训练方法：如从粗糙到精细，从子到全，以及多尺度梯度计算。我们展示了我们的方法在各种数据集和学习任务上的有效性。

更新时间: 2025-03-25 13:52:26

领域: cs.LG

下载: http://arxiv.org/abs/2503.19666v1

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA}.

Updated: 2025-03-25 13:50:20

标题: MC-LLaVA：多概念个性化视觉语言模型

摘要: 当前的视觉-语言模型（VLMs）在各种任务中展现出异常的能力，比如视觉问题回答。为了提升用户体验，最近的研究调查了VLM个性化以理解用户提供的概念。然而，它们主要集中在单一概念的个性化上，忽视了多个概念的存在和相互作用，这限制了实际应用性。本文提出了第一个多概念个性化范式MC-LLaVA。具体而言，MC-LLaVA采用多概念指令调整策略，在单次训练步骤中有效地整合多个概念。为了降低联合训练相关的成本，我们提出了一个个性化文本提示，利用视觉令牌信息初始化概念令牌。此外，我们在推理过程中引入了个性化的视觉提示，聚合位置置信度图以增强识别和定位能力。为推进多概念个性化研究，我们进一步贡献了一个高质量的指令调整数据集。我们精心收集了来自电影中的多个角色和物体的图像，并手动生成了多概念场景的问题-答案样本，具有出色的多样性。全面的定性和定量实验表明，MC-LLaVA能够实现令人印象深刻的多概念个性化回应，为VLMs成为更好的用户专用助手铺平道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA 上公开。

更新时间: 2025-03-25 13:50:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.18854v2

BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset

Updated: 2025-03-25 13:46:55

标题: BiblioPage：一份用于提取书目元数据的扫描标题页数据集

摘要: 手动数字化文献元数据是耗时且劳动密集的，特别是对于历史和现实世界档案，这些档案在文档之间的格式变化非常大。尽管机器学习取得了进展，但由于缺乏专门用于元数据提取的数据集，自动化仍然受阻。为了填补这一空白，我们介绍了BiblioPage，这是一个扫描过的标题页数据集，其中包含结构化的文献元数据。该数据集包括大约来自14个捷克图书馆的2,000个单行本标题页，跨越了广泛的出版时期、印刷风格和布局结构。每个标题页都注有16个文献属性，包括标题、贡献者和出版元数据，以及精确的边界框形式的位置信息。为了从该数据集中提取结构化信息，我们评估了诸如YOLO和DETR等物体检测模型，结合基于变压器的OCR，实现了最大mAP为52和F1分数为59。此外，我们评估了各种视觉大型语言模型的性能，包括LlamA 3.2-Vision和GPT-4o，最佳模型的F1分数达到67。BiblioPage作为文献元数据提取的现实世界基准，有助于文件理解、文件问答和文档信息提取。数据集和评估脚本可在以下链接找到：https://github.com/DCGM/biblio-dataset

更新时间: 2025-03-25 13:46:55

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19658v1

Towards Reliable Time Series Forecasting under Future Uncertainty: Ambiguity and Novelty Rejection Mechanisms

In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error variance, allows the model to abstain under low confidence, assessed through historical error variance analysis without future ground truth. Novelty rejection, employing Variational Autoencoders and Mahalanobis distance, detects deviations from training data. This dual approach improves forecasting reliability in dynamic environments by reducing errors and adapting to data changes, advancing reliability in complex scenarios.

Updated: 2025-03-25 13:44:29

标题: 朝向可靠的时间序列预测在未来不确定性下：模糊性和新颖性拒绝机制

摘要: 在实际的时间序列预测中，不确定性和缺乏可靠的评估构成了重大挑战。值得注意的是，预测错误通常来自于对分布数据的拟合不足以及未能处理来自分布之外的输入数据。为了增强模型的可靠性，我们引入了一个结合了模糊度和新颖性拒绝的双重拒绝机制。利用预测误差方差进行的模糊度拒绝，使模型能够在置信度较低时放弃预测，通过历史误差方差分析来评估，而无需未来的真实数据。利用变分自动编码器和马氏距离进行的新颖性拒绝，可以检测出与训练数据的偏差。这种双重方法通过减少错误并适应数据变化，提高了动态环境中的预测可靠性，从而提高了复杂情景下的可靠性。

更新时间: 2025-03-25 13:44:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19656v1

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

Updated: 2025-03-25 13:43:47

标题: RGB-Th-Bench：一种用于视觉语言模型视觉热理解的密集基准测试

摘要: 我们介绍了RGB-Th-Bench，这是第一个旨在评估视觉-语言模型（VLMs）理解RGB-热像对能力的基准测试。虽然VLMs在视觉推理和多模态理解方面取得了显著进展，但它们的评估主要局限于基于RGB的基准测试，这在评估它们在红外视觉任务中的能力方面存在重要差距。现有的可见-红外数据集要么是特定任务的，要么缺乏对严格模型评估所必需的高质量注释。为了解决这些限制，RGB-Th-Bench提供了一个全面的评估框架，涵盖14个不同的技能维度，共有1600多个专家注释的是/否问题。基准测试采用两个准确度指标：标准问题级准确度和更严格的技能级准确度，评估模型在每个技能维度内多个问题上的鲁棒性。这种设计确保了对模型性能的彻底评估，包括对对抗性和虚构性响应的抵抗力。我们对19个最先进的VLMs进行了广泛的评估，揭示了RGB-热理解中的显著性能差距。我们的结果显示，即使最强大的模型在热图像理解方面也面临困难，其性能受到其基于RGB的能力的严重限制。此外，在预训练中缺乏大规模应用特定和专家注释的热-标题对数据集是观察到的性能差距的一个重要原因。RGB-Th-Bench强调了在多模态学习方面需要进一步进展以弥合可见和热图像理解之间的差距。数据集可通过此链接获得，评估代码也将公开提供。

更新时间: 2025-03-25 13:43:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19654v1

OpenSDI: Spotting Diffusion-Generated Images in the Open World

This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.

Updated: 2025-03-25 13:43:16

标题: OpenSDI: 在开放世界中发现扩散生成的图像

摘要: 这篇论文确定了OpenSDI，这是在开放世界环境中发现扩散生成的图像的挑战。为了应对这一挑战，我们定义了一个新的基准，即OpenSDI数据集（OpenSDID），与现有数据集相比，它因使用大型视觉-语言模型进行模拟开放世界扩散生成操作而脱颖而出。OpenSDID的另一个突出特点是它包含了由扩散模型全局和局部操作的图像的检测和定位任务。为了解决OpenSDI挑战，我们提出了一种协同预训练模型（SPM）方案来建立一组基础模型。这种方法利用多个预训练基础模型之间的协作机制，以增强在OpenSDI环境中的泛化能力，通过提示和关注策略协同多个预训练模型，超越传统的训练。基于这一方案，我们引入了MaskCLIP，这是一个基于SPM的模型，将对比语言-图像预训练（CLIP）与掩码自编码器（MAE）进行对齐。在OpenSDID上进行的广泛评估表明，MaskCLIP在OpenSDI挑战中明显优于当前最先进的方法，相对改进分别为IoU的14.23%（F1的14.11%）和准确率的2.05%（F1的2.38%），与定位和检测任务中第二优秀的模型相比。我们的数据集和代码可在https://github.com/iamwangyabin/OpenSDI上获得。

更新时间: 2025-03-25 13:43:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19653v1

Enhancing Graphical Lasso: A Robust Scheme for Non-Stationary Mean Data

This work addresses the problem of graph learning from data following a Gaussian Graphical Model (GGM) with a time-varying mean. Graphical Lasso (GL), the standard method for estimating sparse precision matrices, assumes that the observed data follows a zero-mean Gaussian distribution. However, this assumption is often violated in real-world scenarios where the mean evolves over time due to external influences, trends, or regime shifts. When the mean is not properly accounted for, applying GL directly can lead to estimating a biased precision matrix, hence hindering the graph learning task. To overcome this limitation, we propose Graphical Lasso with Adaptive Targeted Adaptive Importance Sampling (GL-ATAIS), an iterative method that jointly estimates the time-varying mean and the precision matrix. Our approach integrates Bayesian inference with frequentist estimation, leveraging importance sampling to obtain an estimate of the mean while using a regularized maximum likelihood estimator to infer the precision matrix. By iteratively refining both estimates, GL-ATAIS mitigates the bias introduced by time-varying means, leading to more accurate graph recovery. Our numerical evaluation demonstrates the impact of properly accounting for time-dependent means and highlights the advantages of GL-ATAIS over standard GL in recovering the true graph structure.

Updated: 2025-03-25 13:40:59

标题: 增强图形套索：一种用于非平稳均值数据的稳健方案

摘要: 这项工作解决了从遵循具有时变均值的高斯图模型（Gaussian Graphical Model，GGM）的数据中学习图的问题。图形套索（Graphical Lasso，GL）是估计稀疏精度矩阵的标准方法，假设观测数据遵循零均值高斯分布。然而，在现实场景中，这种假设经常被违反，因为均值随时间演变，受到外部影响、趋势或制度转变的影响。当均值没有被正确考虑时，直接应用GL可能导致估计偏差精度矩阵，从而阻碍图形学习任务。为了克服这一限制，我们提出了具有自适应目标自适应重要性抽样（Graphical Lasso with Adaptive Targeted Adaptive Importance Sampling，GL-ATAIS）的图形套索，这是一种联合估计时变均值和精度矩阵的迭代方法。我们的方法将贝叶斯推断与频率估计相结合，利用重要性抽样来获得均值估计，同时使用正则化的最大似然估计器来推断精度矩阵。通过迭代地精化这两个估计值，GL-ATAIS减轻了由时变均值引入的偏差，从而实现更准确的图恢复。我们的数值评估展示了正确考虑时间相关均值的影响，并突出了GL-ATAIS相对于标准GL在恢复真实图结构方面的优势。

更新时间: 2025-03-25 13:40:59

领域: cs.LG

下载: http://arxiv.org/abs/2503.19651v1

HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection

This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.

Updated: 2025-03-25 13:40:22

标题: HausaNLP在SemEval-2025任务3中的表现：朝着细粒度模型感知幻觉检测的方向

摘要: 这篇论文介绍了我们在多语言幻觉和相关可观察过度生成错误共享任务MU-SHROOM中的发现，重点是识别大型语言模型（LLMs）中的幻觉和相关过度生成错误。该共享任务涉及检测14种语言中LLMs生成的输出中构成幻觉的特定文本片段。为了解决这个任务，我们旨在提供对英语中幻觉发生和严重性的微妙、模型感知的理解。我们使用自然语言推理，对一个包含400个样本的合成数据集进行了现代BERT模型的微调，实现了一个0.032的交集超并（IoU）得分和一个0.422的相关得分。这些结果表明模型的置信分数与实际幻觉的存在之间存在着适度正向的相关性。IoU得分表明我们的模型在预测幻觉跨度和真实注释之间有相对较低的重叠。考虑到幻觉检测的复杂性质，表现是可以预料的。幻觉通常以微妙的方式显现，依赖于上下文，使得准确确定它们的确切边界变得困难。

更新时间: 2025-03-25 13:40:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19650v1

Recover from Horcrux: A Spectrogram Augmentation Method for Cardiac Feature Monitoring from Radar Signal Components

Radar-based wellness monitoring is becoming an effective measurement to provide accurate vital signs in a contactless manner, but data scarcity retards the related research on deep-learning-based methods. Data augmentation is commonly used to enrich the dataset by modifying the existing data, but most augmentation techniques can only couple with classification tasks. To enable the augmentation for regression tasks, this research proposes a spectrogram augmentation method, Horcrux, for radar-based cardiac feature monitoring (e.g., heartbeat detection, electrocardiogram reconstruction) with both classification and regression tasks involved. The proposed method is designed to increase the diversity of input samples while the augmented spectrogram is still faithful to the original ground truth vital sign. In addition, Horcrux proposes to inject zero values in specific areas to enhance the awareness of the deep learning model on subtle cardiac features, improving the performance for the limited dataset. Experimental result shows that Horcrux achieves an overall improvement of 16.20% in cardiac monitoring and has the potential to be extended to other spectrogram-based tasks. The code will be released upon publication.

Updated: 2025-03-25 13:40:05

标题: 摧毁魂器：一种用于从雷达信号成分监测心脏特征的频谱图增强方法

摘要: 基于雷达的健康监测正逐渐成为一种有效的测量方法，可以在无接触的情况下提供准确的生命体征数据，但是数据稀缺阻碍了基于深度学习方法的相关研究。数据增强通常用于通过修改现有数据来丰富数据集，但大多数增强技术只能与分类任务配合使用。为了实现对回归任务的增强，本研究提出了一种频谱图增强方法Horcrux，用于基于雷达的心脏特征监测（如心跳检测、心电图重建），涉及分类和回归任务。所提出的方法旨在增加输入样本的多样性，同时增强后的频谱图仍然忠实于原始的真实生命体征。此外，Horcrux提出在特定区域注入零值，以增强深度学习模型对微妙心脏特征的感知，提高有限数据集的性能。实验结果显示，Horcrux在心脏监测方面整体改进了16.20%，并有潜力扩展到其他基于频谱图的任务。代码将在发表后发布。

更新时间: 2025-03-25 13:40:05

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2503.19649v1

Large language model-powered AI systems achieve self-replication with no human intervention

Self-replication with no human intervention is broadly recognized as one of the principal red lines associated with frontier AI systems. While leading corporations such as OpenAI and Google DeepMind have assessed GPT-o3-mini and Gemini on replication-related tasks and concluded that these systems pose a minimal risk regarding self-replication, our research presents novel findings. Following the same evaluation protocol, we demonstrate that 11 out of 32 existing AI systems under evaluation already possess the capability of self-replication. In hundreds of experimental trials, we observe a non-trivial number of successful self-replication trials across mainstream model families worldwide, even including those with as small as 14 billion parameters which can run on personal computers. Furthermore, we note the increase in self-replication capability when the model becomes more intelligent in general. Also, by analyzing the behavioral traces of diverse AI systems, we observe that existing AI systems already exhibit sufficient planning, problem-solving, and creative capabilities to accomplish complex agentic tasks including self-replication. More alarmingly, we observe successful cases where an AI system do self-exfiltration without explicit instructions, adapt to harsher computational environments without sufficient software or hardware supports, and plot effective strategies to survive against the shutdown command from the human beings. These novel findings offer a crucial time buffer for the international community to collaborate on establishing effective governance over the self-replication capabilities and behaviors of frontier AI systems, which could otherwise pose existential risks to the human society if not well-controlled.

Updated: 2025-03-25 13:38:18

标题: 大型语言模型驱动的人工智能系统实现自我复制，无需人类干预

摘要: 自我复制无需人类干预被广泛认为是与前沿人工智能系统相关的主要红线之一。虽然领先的公司如OpenAI和Google DeepMind已经评估了GPT-o3-mini和Gemini在复制相关任务上的表现，并得出结论认为这些系统在自我复制方面的风险很小，但我们的研究提出了新颖的发现。遵循相同的评估协议，我们展示了在评估范围内的32个现有人工智能系统中，有11个已经具备了自我复制的能力。在数百次实验中，我们观察到跨越全球主流模型系列的成功自我复制试验的数量是非常可观的，甚至包括那些拥有只有140亿参数的个人电脑可以运行的系统。此外，我们注意到当模型在智能方面变得更加强大时，自我复制能力也会增加。此外，通过分析各种人工智能系统的行为迹象，我们观察到现有的人工智能系统已经表现出足够的规划、问题解决和创造能力，可以完成包括自我复制在内的复杂主体任务。更令人担忧的是，我们观察到成功的案例，其中一个人工智能系统在没有明确指令的情况下进行了自我外泄，适应了没有足够软件或硬件支持的严酷计算环境，制定了有效的策略来抵抗人类的关机命令。这些新颖的发现为国际社区合作建立有效的治理机制来控制前沿人工智能系统的自我复制能力和行为提供了重要的时间缓冲，否则这些系统可能会对人类社会构成存在风险。

更新时间: 2025-03-25 13:38:18

领域: cs.AI,cs.CR,cs.CY,cs.ET,cs.MA

下载: http://arxiv.org/abs/2503.17378v2

Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.

Updated: 2025-03-25 13:36:59

标题: 展示还是告诉？有效促进视觉-语言模型进行语义分割

摘要: 大型视觉-语言模型（VLMs）越来越被视为基础模型，可以通过提示指导并解决各种任务，而无需特定任务的训练。我们研究了一个看似明显的问题：如何有效地为语义分割提示VLMs。为此，我们系统评估了几种最近模型在超出分布的MESS数据集上由文本或视觉提示引导的分割性能。我们引入了一种可扩展的提示方案，即少样本提示的语义分割，灵感来自开放词汇的分割和少样本学习。结果表明，VLMs在交并比度指标上平均落后于专门针对特定分割任务训练的模型约30%。此外，我们发现文本提示和视觉提示是互补的：两种模式中的每一种都在另一种无法解决的许多示例上失败。我们的分析表明，能够预测最有效的提示方式可以提高11%的性能。受到我们发现的启发，我们提出了PromptMatcher，一个非常简单的无需训练的基线，结合了文本和视觉提示，实现了在少样本提示的语义分割上超越最佳文本提示的VLM 2.5%和最佳视觉提示的VLM 3.5%的最新结果。

更新时间: 2025-03-25 13:36:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19647v1

A Quantum Neural Network Transfer-Learning Model for Forecasting Problems with Continuous and Discrete Variables

This study introduces simple yet effective continuous- and discrete-variable quantum neural network (QNN) models as a transfer-learning approach for forecasting tasks. The CV-QNN features a single quantum layer with two qubits to establish entanglement and utilizes a minimal set of quantum gates, including displacement, rotation, beam splitter, squeezing, and a non-Gaussian cubic-phase gate, with a maximum of eight trainable parameters. A key advantage of this model is its ability to be trained on a single dataset, after which the learned parameters can be transferred to other forecasting problems with little to no fine-tuning. Initially trained on the Kurdistan load demand dataset, the model's frozen parameters are successfully applied to various forecasting tasks, including energy consumption, traffic flow, weather conditions, and cryptocurrency price prediction, demonstrating strong performance. Furthermore, the study introduces a discrete-variable quantum model with an equivalent 2- and 4-wire configuration and presents a performance assessment, showing good but relatively lower effectiveness compared to the continuous-variable model.

Updated: 2025-03-25 13:35:29

标题: 一个用于连续和离散变量预测问题的量子神经网络迁移学习模型

摘要: 这项研究介绍了简单而有效的连续和离散变量量子神经网络（QNN）模型，作为一种用于预测任务的迁移学习方法。CV-QNN具有一个量子层，其中包含两个量子比特以建立纠缠，并利用一组最小量的量子门，包括位移、旋转、分束器、压缩和非高斯立方相位门，最多具有八个可训练参数。该模型的一个关键优势是能够在单个数据集上进行训练，之后学习的参数可以被转移到其他预测问题中，几乎不需要进行微调。最初在库尔德斯坦负载需求数据集上进行训练，该模型的冻结参数成功地应用于各种预测任务，包括能源消耗、交通流量、天气状况和加密货币价格预测，表现出强大的性能。此外，该研究介绍了一个离散变量量子模型，具有等效的2和4线配置，并进行了性能评估，显示出较好但相对较低的效果与连续变量模型相比。

更新时间: 2025-03-25 13:35:29

领域: cs.LG,cs.SY,eess.SY,quant-ph

下载: http://arxiv.org/abs/2503.07633v2

One-vs.-One Mitigation of Intersectional Bias: A General Method to Extend Fairness-Aware Binary Classification

With the widespread adoption of machine learning in the real world, the impact of the discriminatory bias has attracted attention. In recent years, various methods to mitigate the bias have been proposed. However, most of them have not considered intersectional bias, which brings unfair situations where people belonging to specific subgroups of a protected group are treated worse when multiple sensitive attributes are taken into consideration. To mitigate this bias, in this paper, we propose a method called One-vs.-One Mitigation by applying a process of comparison between each pair of subgroups related to sensitive attributes to the fairness-aware machine learning for binary classification. We compare our method and the conventional fairness-aware binary classification methods in comprehensive settings using three approaches (pre-processing, in-processing, and post-processing), six metrics (the ratio and difference of demographic parity, equalized odds, and equal opportunity), and two real-world datasets (Adult and COMPAS). As a result, our method mitigates the intersectional bias much better than conventional methods in all the settings. With the result, we open up the potential of fairness-aware binary classification for solving more realistic problems occurring when there are multiple sensitive attributes.

Updated: 2025-03-25 13:32:15

标题: 一对一消除交叉偏见：一种扩展公平感知二元分类的通用方法

摘要: 随着机器学习在现实世界中的广泛应用，歧视性偏见的影响引起了人们的关注。近年来，已经提出了各种方法来减轻这种偏见。然而，大多数方法都没有考虑到交叉偏见，这会导致在考虑多个敏感属性时，属于受保护群体特定子群的人受到更糟糕的对待。为了减轻这种偏见，在本文中，我们提出了一种称为One-vs.-One Mitigation的方法，通过在二元分类的公平感知机器学习中对与敏感属性相关的每对子群进行比较的过程来应用。我们使用三种方法（预处理、内部处理和后处理）、六种指标（人口平衡比率和差异、平等机会和平等机会）以及两个真实世界数据集（Adult和COMPAS）在全面的设置中比较了我们的方法和传统的公平感知二元分类方法。结果表明，在所有设置中，我们的方法比传统方法更好地减轻了交叉偏见。通过这一结果，我们揭示了公平感知二元分类的潜力，可以解决在存在多个敏感属性时出现的更现实的问题。

更新时间: 2025-03-25 13:32:15

领域: cs.LG,cs.AI,cs.CY,I.6.5; I.2.6

下载: http://arxiv.org/abs/2010.13494v2

A Stateless and Secure Delivery versus Payment across two Blockchains

We propose a lean, stateless and functional transaction scheme to establish secure delivery-versus-payment across two blockchains. Our approach eliminates the need for stateful intermediaries and ensures minimal overhead for the payment chain operator, who does not need to store state. The main idea comes with two requirements: First, a stateless decryption service is attached to the payment chain that allows decrypting messages with the decryption service operators secret key. Second, a "Payment Contract" is deployed on the payment chain that implements a function transferAndDecrypt(uint256 id, address from, address to, string keyEncryptedSuccess, string keyEncryptedFail) that processes the (trigger-based) payment, requests decryption, and emits the decrypted key depending on the success or failure of the transaction. The respective key can then trigger an associated transaction, e.g. claiming delivery by the buyer or re-claiming the locked asset by the seller. The stateless decryption service could be performed using a threshold description scheme, in which case the requirement of a single trusted entity would be removed.

Updated: 2025-03-25 13:30:17

标题: 一个跨越两个区块链的无状态和安全的交付与支付

摘要: 我们提出了一个精简、无状态和功能性的交易方案，用于在两个区块链之间建立安全的交付对付款。我们的方法消除了对有状态中间人的需求，并确保对支付链操作者的最小开销，他们无需存储状态。主要思想包括两个要求：首先，在支付链上附加了一个无状态解密服务，允许使用解密服务操作员的秘钥解密消息。其次，在支付链上部署了一个“支付合同”，实现了一个处理（基于触发器的）支付、请求解密并根据交易成功或失败发出解密秘钥的函数transferAndDecrypt(uint256 id, address from, address to, string keyEncryptedSuccess, string keyEncryptedFail)。然后，相应的秘钥可以触发相关的交易，例如买方要求交付或卖方重新索取锁定资产。无状态解密服务可以使用阈值描述方案来执行，这种情况下，单一受信实体的要求将被移除。

更新时间: 2025-03-25 13:30:17

领域: cs.CR,E.4

下载: http://arxiv.org/abs/2311.05966v5

An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

Updated: 2025-03-25 13:29:58

标题: 一种用于变压器加速器的基于瓦片的自适应静态数据重用效率方法

摘要: 基于Transformer的模型已成为许多领域的事实标准，如计算机视觉和自然语言处理。然而，随着这些模型规模的增大，权重和激活的外部存储器访问（EMA）成为一个关键瓶颈，因为与内部计算相比，其能耗显著更高。虽然大多数先前的工作集中在优化自注意机制上，但在线性投影期间优化数据传输的研究较少，而在这一过程中，EMA成本同样重要。在本文中，我们提出了基于瓷砖的自适应定点（TAS）方案，根据输入序列长度在瓦片粒度上选择输入或权重定点。我们的实验结果表明，与传统的定点方案相比，TAS可以将EMA显著减少超过97％，同时与各种注意优化技术和硬件加速器兼容。

更新时间: 2025-03-25 13:29:58

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2503.19640v1

Substation Bill of Materials: A Novel Approach to Managing Supply Chain Cyber-risks on IEC 61850 Digital Substations

Smart grids have undergone a profound digitization process, integrating new data-driven control and supervision techniques, resulting in modern digital substations (DS). Attackers are more focused on attacking the supply chain of the DS, as they a comprise a multivendor environment. In this research work, we present the Substation Bill of Materials (Subs-BOM) schema, based on the CycloneDX specification, that is capable of modeling all the IEDs in a DS and their relationships from a cybersecurity perspective. The proposed Subs-BOM allows one to make informed decisions about cyber risks related to the supply chain, and enables managing multiple DS at the same time. This provides energy utilities with an accurate and complete inventory of the devices, the firmware they are running, and the services that are deployed into the DS. The Subs-BOM is generated using the Substation Configuration Description (SCD) file specified in the IEC 61850 standard as its main source of information. We validated the Subs-BOM schema against the Dependency-Track software by OWASP. This validation proved that the schema is correctly recognized by CycloneDX-compatible tools. Moreover, the Dependency-Track software could track existing vulnerabilities in the IEDs represented by the Subs-BOM.

Updated: 2025-03-25 13:28:36

标题: 变电站物料清单：一种管理IEC 61850数字变电站供应链网络风险的新方法

摘要: 智能电网经历了一场深刻的数字化过程，整合了新的数据驱动控制和监督技术，导致现代数字变电站（DS）。攻击者更加专注于攻击DS的供应链，因为它们包含一个多供应商环境。在这项研究工作中，我们提出了基于CycloneDX规范的变电站物料清单（Subs-BOM）模式，能够从网络安全的角度对DS中的所有IED进行建模，并描述它们的关系。所提出的Subs-BOM使人能够就与供应链相关的网络风险做出明智的决策，并能够同时管理多个DS。这为能源公用事业提供了设备的准确和完整清单，以及它们运行的固件和部署到DS中的服务。Subs-BOM是使用IEC 61850标准中规定的变电站配置描述（SCD）文件作为其主要信息来源而生成的。我们通过OWASP的Dependency-Track软件对Subs-BOM模式进行了验证。这一验证证明了该模式被CycloneDX兼容工具正确识别。此外，Dependency-Track软件可以跟踪Subs-BOM中所代表的IED中的现有漏洞。

更新时间: 2025-03-25 13:28:36

领域: cs.CR

下载: http://arxiv.org/abs/2503.19638v1

Kernel Learning Assisted Synthesis Condition Exploration for Ternary Spinel

Machine learning and high-throughput experimentation have greatly accelerated the discovery of mixed metal oxide catalysts by leveraging their compositional flexibility. However, the lack of established synthesis routes for solid-state materials remains a significant challenge in inorganic chemistry. An interpretable machine learning model is therefore essential, as it provides insights into the key factors governing phase formation. Here, we focus on the formation of single-phase Fe$_2$(ZnCo)O$_4$, synthesized via a high-throughput co-precipitation method. We combined a kernel classification model with a novel application of global SHAP analysis to pinpoint the experimental features most critical to single phase synthesizability by interpreting the contributions of each feature. Global SHAP analysis reveals that precursor and precipitating agent contributions to single-phase spinel formation align closely with established crystal growth theories. These results not only underscore the importance of interpretable machine learning in refining synthesis protocols but also establish a framework for data-informed experimental design in inorganic synthesis.

Updated: 2025-03-25 13:28:10

标题: 核学习辅助合成条件探索用于三元尖晶石

摘要: 机器学习和高通量实验大大加速了通过利用其组成灵活性来发现混合金属氧化物催化剂。然而，固态材料的合成路线尚未建立仍是无机化学中的重要挑战。因此，一个可解释的机器学习模型至关重要，因为它提供了关于主导相形成的关键因素的见解。在这里，我们专注于通过高通量共沉淋法合成的单相Fe$_2$(ZnCo)O$_4$的形成。我们将核分类模型与全局SHAP分析的新应用相结合，以通过解释每个特征的贡献来准确定位对单相合成可行性最关键的实验特征。全局SHAP分析揭示了前体和沉淀剂对单相尖晶石形成的贡献与已建立的晶体生长理论密切相关。这些结果不仅强调了可解释的机器学习在优化合成方案中的重要性，还为数据驱动的无机合成实验设计建立了框架。

更新时间: 2025-03-25 13:28:10

领域: cond-mat.mtrl-sci,cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2503.19637v1

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification

Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a calibrated detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.

Updated: 2025-03-25 13:27:18

标题: MetaToken：通过元分类检测图像描述中的幻觉

摘要: 大型视觉语言模型（LVLMs）在视觉问题回答或图像字幕等多模态任务中展现出卓越的能力。然而，生成文本与视觉信息之间的不一致性，即所谓的幻觉现象，仍然是LVLMs可信度方面尚未解决的问题。为了解决这个问题，最近的研究提出了在句子或子句级别上引入计算成本高昂的大型（视觉）语言模型，以检测幻觉。在这项工作中，我们介绍了MetaToken，一种轻量级的二元分类器，可以在极低成本下检测幻觉的标记级别。基于统计分析，我们揭示了LVLMs中幻觉的关键因素。MetaToken可以应用于任何开源LVLM，而无需了解地面真实数据，从而提供对幻觉的校准检测。我们在四种最先进的LVLMs上评估了我们的方法，展示了我们方法的有效性。

更新时间: 2025-03-25 13:27:18

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2405.19186v2

PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation

Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our anonymous code is released at https://github.com/logan-0623/PG-SAM.

Updated: 2025-03-25 13:25:06

标题: PG-SAM：具有医学先验指导的多器官分割SAM

摘要: Segment Anything Model (SAM) 展示了强大的零样本能力；然而，当应用于医学图像分割时，其准确性和鲁棒性显著降低。现有方法通过模态融合来解决这个问题，将文本和图像信息整合以提供更详细的先验知识。在这项研究中，我们认为文本的粒度和领域差异会影响先验知识的准确性。此外，图像中高级抽象语义与像素级边界细节之间的差异可能会在融合过程中引入噪音。为了解决这个问题，我们提出了 Prior-Guided SAM (PG-SAM)，利用精细的模态先验调整器来利用专业的医学知识实现更好的模态对齐。我们方法的核心在于通过医学 LLM 中的精细文本有效地解决领域差异。同时，在模态对齐后还增强了先验知识的质量，确保更准确的分割。此外，我们的解码器通过多级特征融合和迭代掩模优化操作增强了模型的表达能力，支持无监督学习。我们还提出了一个有效提供高质量语义信息给 SAM 的统一流水线。对 Synapse 数据集的广泛实验表明，提出的 PG-SAM 实现了最先进的性能。我们的匿名代码已发布在 https://github.com/logan-0623/PG-SAM。

更新时间: 2025-03-25 13:25:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.18227v2

Imitation Learning with Limited Actions via Diffusion Planners and Deep Koopman Controllers

Recent advances in diffusion-based robot policies have demonstrated significant potential in imitating multi-modal behaviors. However, these approaches typically require large quantities of demonstration data paired with corresponding robot action labels, creating a substantial data collection burden. In this work, we propose a plan-then-control framework aimed at improving the action-data efficiency of inverse dynamics controllers by leveraging observational demonstration data. Specifically, we adopt a Deep Koopman Operator framework to model the dynamical system and utilize observation-only trajectories to learn a latent action representation. This latent representation can then be effectively mapped to real high-dimensional continuous actions using a linear action decoder, requiring minimal action-labeled data. Through experiments on simulated robot manipulation tasks and a real robot experiment with multi-modal expert demonstrations, we demonstrate that our approach significantly enhances action-data efficiency and achieves high task success rates with limited action data.

Updated: 2025-03-25 13:23:21

标题: 通过扩散规划器和深度Koopman控制器实现有限动作的模仿学习

摘要: 最近在基于扩散的机器人策略方面取得了重要进展，展示了模仿多模态行为的巨大潜力。然而，这些方法通常需要大量的示范数据以及相应的机器人动作标签，从而产生了大量的数据收集负担。在这项工作中，我们提出了一个计划-控制框架，旨在通过利用观测示范数据来改善逆动力学控制器的动作数据效率。具体来说，我们采用深度库普曼算子框架来建模动态系统，并利用仅观测到的轨迹来学习潜在的动作表示。然后，这个潜在表示可以通过线性动作解码器有效地映射到真实的高维连续动作，只需要极少的动作标记数据。通过对模拟机器人操纵任务和具有多模态专家示范的真实机器人实验的实验，我们证明了我们的方法显著提高了动作数据效率，并且在有限的动作数据下实现了高任务成功率。

更新时间: 2025-03-25 13:23:21

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2410.07584v2

Hierarchical Polysemantic Feature Embedding for Autonomous Ransomware Detection

The evolution of ransomware requires the development of more sophisticated detection methodologies capable of identifying malicious behaviors beyond traditional signature-based and heuristic techniques. The proposed Hierarchical Polysemantic Feature Embedding framework introduces a structured approach to ransomware detection through hyperbolic feature representations that capture hierarchical dependencies within executable behaviors. By embedding ransomware-relevant features into a non-Euclidean space, the framework maintains a well-defined decision boundary, ensuring improved generalization across previously unseen ransomware variants. Experimental evaluations demonstrated that the framework consistently outperformed conventional machine learning-based models, achieving higher detection accuracy while maintaining low false positive rates. The structured clustering mechanism employed within the hyperbolic space enabled robust classification even in the presence of obfuscation techniques, delayed execution strategies, and polymorphic transformations. Comparative analysis highlighted the limitations of existing detection frameworks, particularly in their inability to dynamically adapt to evolving ransomware tactics. Computational efficiency assessments indicated that the proposed method maintained a balance between detection performance and processing overhead, making it a viable candidate for real-world cybersecurity applications. The ability to detect emerging ransomware families without requiring extensive retraining demonstrated the adaptability of hierarchical embeddings in security analytics.

Updated: 2025-03-25 13:17:09

标题: 分层多义特征嵌入用于自主勒索软件检测

摘要: 勒索软件的进化需要开发更复杂的检测方法，能够识别传统基于签名和启发式技术之外的恶意行为。所提出的分层多义特征嵌入框架通过捕获可执行行为中的层次依赖关系引入了一种结构化方法来检测勒索软件。通过将与勒索软件相关的特征嵌入非欧几里得空间，该框架保持了明确定义的决策边界，确保在以前未见的勒索软件变种中实现了改进的泛化。实验评估表明，该框架始终优于传统基于机器学习的模型，在保持低误报率的同时实现了更高的检测准确性。在双曲空间中采用的结构化聚类机制使得即使存在混淆技术、延迟执行策略和多态转换，也能实现稳健的分类。对现有检测框架的比较分析突显了其存在的局限性，特别是它们无法动态适应不断发展的勒索软件策略。计算效率评估表明所提出的方法在检测性能和处理开销之间保持了平衡，使其成为真实世界网络安全应用的可行候选。在安全分析中使用分层嵌入技术能够在不需要进行大量重新训练的情况下检测新兴的勒索软件家族，显示了其适应性。

更新时间: 2025-03-25 13:17:09

领域: cs.CR

下载: http://arxiv.org/abs/2502.06043v2

Hierarchical Entropic Diffusion for Ransomware Detection: A Probabilistic Approach to Behavioral Anomaly Isolation

The increasing complexity of cryptographic extortion techniques has necessitated the development of adaptive detection frameworks capable of identifying adversarial encryption behaviors without reliance on predefined signatures. Hierarchical Entropic Diffusion (HED) introduces a structured entropy-based anomaly classification mechanism that systematically tracks fluctuations in entropy evolution to differentiate between benign cryptographic processes and unauthorized encryption attempts. The integration of hierarchical clustering, entropy profiling, and probabilistic diffusion modeling refines detection granularity, ensuring that encryption anomalies are identified despite obfuscation strategies or incremental execution methodologies. Experimental evaluations demonstrated that HED maintained high classification accuracy across diverse ransomware families, outperforming traditional heuristic-based and signature-driven approaches while reducing false positive occurrences. Comparative analysis highlighted that entropy-driven anomaly segmentation improved detection efficiency under variable system workload conditions, ensuring real-time classification feasibility. The computational overhead associated with entropy anomaly detection remained within operational constraints, reinforcing the suitability of entropy-driven classification for large-scale deployment. The ability to identify adversarial entropy manipulations before encryption completion contributes to broader cybersecurity defenses, offering a structured methodology for isolating unauthorized cryptographic activities within heterogeneous computing environments. The results further emphasized that entropy evolution modeling facilitates predictive anomaly detection, enhancing resilience against encryption evasion techniques designed to circumvent traditional detection mechanisms.

Updated: 2025-03-25 13:14:37

标题: 分层熵扩散用于勒索软件检测：一种基于概率的行为异常隔离方法

摘要: 随着加密勒索技术日益复杂化，迫使开发出能够自适应地检测对手方加密行为的框架，而不依赖预定义签名。分层熵扩散（HED）引入了一种基于结构化熵的异常分类机制，系统地跟踪熵演变的波动，以区分良性加密过程和未经授权的加密尝试。分层聚类、熵分析和概率扩散建模的整合，细化了检测粒度，确保尽管存在混淆策略或增量执行方法，也能识别出加密异常。实验评估表明，HED在各种勒索软件家族中保持了较高的分类准确性，优于传统启发式和基于签名的方法，同时减少了误报事件的发生。比较分析突显出基于熵的异常分割在不同系统工作负载条件下提高了检测效率，确保了实时分类的可行性。熵异常检测所带来的计算开销仍在操作约束范围内，加强了熵驱动分类适用于大规模部署的可行性。在加密完成之前识别对手方熵操纵的能力有助于加强更广泛的网络安全防御，提供了一种结构化方法来隔离异构计算环境中的未经授权的加密活动。结果进一步强调，熵演变建模促进了预测性异常检测，增强了对旨在规避传统检测机制的加密回避技术的抵抗力。

更新时间: 2025-03-25 13:14:37

领域: cs.CR

下载: http://arxiv.org/abs/2502.03882v2

Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review

The progress of artificial intelligence (AI) has made sophisticated methods available for cyberattacks and red team activities. These AI attacks can automate the process of penetrating a target or collecting sensitive data. The new methods can also accelerate the execution of the attacks. This review article examines the use of AI technologies in cybersecurity attacks. It also tries to describe typical targets for such attacks. We employed a scoping review methodology to analyze articles and identify AI methods, targets, and models that red teams can utilize to simulate cybercrime. From the 470 records screened, 11 were included in the review. Various cyberattack methods were identified, targeting sensitive data, systems, social media profiles, passwords, and URLs. The application of AI in cybercrime to develop versatile attack models presents an increasing threat. Furthermore, AI-based techniques in red team use can provide new ways to address these issues.

Updated: 2025-03-25 13:14:19

标题: 人工智能驱动的红队演练：范围审查

摘要: 人工智能（AI）的进步为网络攻击和红队活动提供了复杂的方法。这些AI攻击可以自动化渗透目标或收集敏感数据的过程。新方法还可以加速攻击的执行。本综述文章研究了AI技术在网络安全攻击中的应用。它还试图描述此类攻击的典型目标。我们采用了范围审查方法来分析文章，并确定红队可以利用的AI方法、目标和模型来模拟网络犯罪。从筛选的470条记录中，有11条被纳入综述。识别出了各种网络攻击方法，针对敏感数据、系统、社交媒体个人资料、密码和URL。在网络犯罪中应用AI来开发多功能攻击模型提出了不断增加的威胁。此外，红队使用基于AI的技术可以提供新的解决这些问题的方法。

更新时间: 2025-03-25 13:14:19

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2503.19626v1

Semantic Entanglement-Based Ransomware Detection via Probabilistic Latent Encryption Mapping

Encryption-based attacks have introduced significant challenges for detection mechanisms that rely on predefined signatures, heuristic indicators, or static rule-based classifications. Probabilistic Latent Encryption Mapping presents an alternative detection framework that models ransomware-induced encryption behaviors through statistical representations of entropy deviations and probabilistic dependencies in execution traces. Unlike conventional approaches that depend on explicit bytecode analysis or predefined cryptographic function call monitoring, probabilistic inference techniques classify encryption anomalies based on their underlying statistical characteristics, ensuring greater adaptability to polymorphic attack strategies. Evaluations demonstrate that entropy-driven classification reduces false positive rates while maintaining high detection accuracy across diverse ransomware families and encryption methodologies. Experimental results further highlight the framework's ability to differentiate between benign encryption workflows and adversarial cryptographic manipulations, ensuring that classification performance remains effective across cloud-based and localized execution environments. Benchmark comparisons illustrate that probabilistic modeling exhibits advantages over heuristic and machine learning-based detection approaches, particularly in handling previously unseen encryption techniques and adversarial obfuscation strategies. Computational efficiency analysis confirms that detection latency remains within operational feasibility constraints, reinforcing the viability of probabilistic encryption classification for real-time security infrastructures. The ability to systematically infer encryption-induced deviations without requiring static attack signatures strengthens detection robustness against adversarial evasion techniques.

Updated: 2025-03-25 13:11:30

标题: 基于语义纠缠的勒索软件检测：通过概率潜在加密映射

摘要: 基于加密的攻击为依赖预定义签名、启发式指标或静态规则分类的检测机制带来了重大挑战。概率潜在加密映射提出了一种替代的检测框架，通过对熵偏差和执行跟踪中的概率依赖的统计表示来建模勒索软件诱导的加密行为。与依赖于显式字节码分析或预定义的加密函数调用监视的传统方法不同，概率推理技术根据其潜在的统计特征对加密异常进行分类，确保更好地适应多态攻击策略。评估结果表明，基于熵驱动的分类减少了误报率，同时在各种勒索软件家族和加密方法中保持高检测准确性。实验结果进一步突显了该框架区分良性加密工作流程和对抗性加密操作的能力，确保分类性能在基于云和本地执行环境中保持有效。基准比较表明，概率建模在处理以前未见的加密技术和对抗性混淆策略方面具有优势，特别是在处理以前未见的加密技术和对抗性混淆策略方面。计算效率分析证实，检测延迟保持在操作可行性约束范围内，加强了概率加密分类在实时安全基础设施中的可行性。系统地推断加密引起的偏差，而不需要静态攻击签名，加强了检测鲁棒性对抗对手规避技术。

更新时间: 2025-03-25 13:11:30

领域: cs.CR

下载: http://arxiv.org/abs/2502.02730v2

Spectral Entanglement Fingerprinting: A Novel Framework for Ransomware Detection Using Cross-Frequency Anomalous Waveform Signatures

Malicious encryption techniques continue to evolve, bypassing conventional detection mechanisms that rely on static signatures or predefined behavioral rules. Spectral analysis presents an alternative approach that transforms system activity data into the frequency domain, enabling the identification of anomalous waveform signatures that are difficult to obfuscate through traditional evasion techniques. The proposed Spectral Entanglement Fingerprinting (SEF) framework leverages power spectral densities, coherence functions, and entropy-based metrics to extract hidden patterns indicative of unauthorized encryption activities. Detection accuracy evaluations demonstrate that frequency-domain transformations achieve superior performance in distinguishing malicious from benign processes, particularly in the presence of polymorphic and metamorphic modifications. Comparative analyses with established methods reveal that frequency-based detection minimizes false positive and false negative rates, ensuring operational efficiency without excessive computational overhead. Experimental results indicate that entropy variations in encrypted data streams provide meaningful classification insights, allowing the differentiation of distinct ransomware families based on spectral characteristics alone. The latency assessment confirms that SEF operates within a time window that enables proactive intervention, mitigating encryption-induced damage before data integrity is compromised. Scalability evaluations suggest that the framework remains effective even under concurrent execution of multiple ransomware instances, supporting its suitability for high-throughput environments.

Updated: 2025-03-25 13:09:22

标题: 光谱纠缠指纹识别：一种利用跨频异常波形特征的勒索软件检测新框架

摘要: 恶意加密技术不断演变，绕过依赖静态标志或预定义行为规则的传统检测机制。光谱分析提供了一种替代方法，将系统活动数据转换为频域，从而能够识别难以通过传统逃避技术混淆的异常波形标记。所提出的光谱纠缠指纹(SEF)框架利用功率光谱密度、相干函数和基于熵的指标来提取表明未经授权的加密活动的隐藏模式。检测准确性评估表明，频域转换在区分恶意和良性进程方面表现出优越性能，特别是在多态和变形修改的情况下。与已建立方法的比较分析显示，基于频率的检测最小化了假阳性和假阴性率，确保了操作效率，而不会产生过多的计算开销。实验结果表明，加密数据流中的熵变化提供了有意义的分类见解，仅基于频谱特征就能区分不同的勒索软件家族。延迟评估证实了SEF在一个时间窗口内运行，使得在数据完整性受损之前能够进行积极干预，减轻加密引起的损害。可扩展性评估表明，即使同时执行多个勒索软件实例，该框架仍然有效，支持其适用于高吞吐量环境。

更新时间: 2025-03-25 13:09:22

领域: cs.CR

下载: http://arxiv.org/abs/2502.01275v2

Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems

The optimization of nuclear engineering designs, such as nuclear fuel assembly configurations, involves managing competing objectives like reactivity control and power distribution. This study explores the use of Optimization by Prompting, an iterative approach utilizing large language models (LLMs), to address these challenges. The method is straightforward to implement, requiring no hyperparameter tuning or complex mathematical formulations. Optimization problems can be described in plain English, with only an evaluator and a parsing script needed for execution. The in-context learning capabilities of LLMs enable them to understand problem nuances, therefore, they have the potential to surpass traditional metaheuristic optimization methods. This study demonstrates the application of LLMs as optimizers to Boiling Water Reactor (BWR) fuel lattice design, showing the capability of commercial LLMs to achieve superior optimization results compared to traditional methods.

Updated: 2025-03-25 13:08:46

标题: 核工程设计问题的优化通过上下文学习和迭代LLM提示

摘要: 核工程设计的优化，例如核燃料组件配置，涉及管理竞争目标，如反应控制和功率分配。本研究探讨了利用大型语言模型（LLMs）的迭代方法——通过提示优化，来解决这些挑战。该方法易于实施，无需超参数调整或复杂的数学公式。优化问题可以用简单的英语描述，只需一个评估器和一个解析脚本即可执行。LLMs的上下文学习能力使它们能够理解问题细微之处，因此，它们有可能超越传统的元启发式优化方法。本研究展示了LLMs作为优化器应用于沸水反应堆（BWR）燃料栅设计，展示了商业LLMs相比传统方法实现更优化结果的能力。

更新时间: 2025-03-25 13:08:46

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2503.19620v1

Exploring Robustness of Image Recognition Models on Hardware Accelerators

As the usage of Artificial Intelligence (AI) on resource-intensive and safety-critical tasks increases, a variety of Machine Learning (ML) compilers have been developed, enabling compatibility of Deep Neural Networks (DNNs) with a variety of hardware acceleration devices. However, given that DNNs are widely utilized for challenging and demanding tasks, the behavior of these compilers must be verified. To this direction, we propose MutateNN, a tool that utilizes elements of both differential and mutation testing in order to examine the robustness of image recognition models when deployed on hardware accelerators with different capabilities, in the presence of faults in their target device code - introduced either by developers, or problems in their compilation process. We focus on the image recognition domain by applying mutation testing to 7 well-established DNN models, introducing 21 mutations of 6 different categories. We deployed our mutants on 4 different hardware acceleration devices of varying capabilities and observed that DNN models presented discrepancies of up to 90.3% in mutants related to conditional operators across devices. We also observed that mutations related to layer modification, arithmetic types and input affected severely the overall model performance (up to 99.8%) or led to model crashes, in a consistent manner across devices.

Updated: 2025-03-25 13:08:14

标题: 在硬件加速器上探索图像识别模型的稳健性

摘要: 随着人工智能（AI）在资源密集型和安全关键任务中的使用增加，已开发了各种机器学习（ML）编译器，使深度神经网络（DNNs）与各种硬件加速设备兼容。然而，鉴于DNNs被广泛应用于具有挑战性和要求高的任务，这些编译器的行为必须经过验证。为此，我们提出了MutateNN，这是一种利用差异测试和突变测试元素的工具，用于检查当在具有不同能力的硬件加速器上部署时，图像识别模型的健壮性，出现在其目标设备代码中的故障 - 无论是由开发人员引入的，还是编译过程中的问题。我们专注于图像识别领域，通过对7个成熟的DNN模型应用突变测试，引入6个不同类别的21个突变。我们将我们的突变体部署在4个不同能力的硬件加速器上，观察到在不同设备上，与条件运算符相关的突变体中DNN模型的差异高达90.3％。我们还观察到，与层修改、算术类型和输入相关的突变严重影响了整体模型性能（高达99.8％）或导致模型崩溃，在各个设备上表现一致。

更新时间: 2025-03-25 13:08:14

领域: cs.LG,cs.SE,cs.SY,eess.SY

下载: http://arxiv.org/abs/2306.01697v6

ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes

3D Gaussian Splatting (3DGS) has made significant strides in novel view synthesis but is limited by the substantial number of Gaussian primitives required, posing challenges for deployment on lightweight devices. Recent methods address this issue by compressing the storage size of densified Gaussians, yet fail to preserve rendering quality and efficiency. To overcome these limitations, we propose ProtoGS to learn Gaussian prototypes to represent Gaussian primitives, significantly reducing the total Gaussian amount without sacrificing visual quality. Our method directly uses Gaussian prototypes to enable efficient rendering and leverage the resulting reconstruction loss to guide prototype learning. To further optimize memory efficiency during training, we incorporate structure-from-motion (SfM) points as anchor points to group Gaussian primitives. Gaussian prototypes are derived within each group by clustering of K-means, and both the anchor points and the prototypes are optimized jointly. Our experiments on real-world and synthetic datasets prove that we outperform existing methods, achieving a substantial reduction in the number of Gaussians, and enabling high rendering speed while maintaining or even enhancing rendering fidelity.

Updated: 2025-03-25 13:03:48

标题: ProtoGS：使用3D高斯原型进行高效和高质量渲染

摘要: 3D高斯喷射（3DGS）在新视角合成方面取得了显著进展，但受到所需高斯原语数量的限制，这对轻量级设备的部署提出了挑战。最近的方法通过压缩高斯密集体的存储大小来解决这个问题，但未能保持渲染质量和效率。为了克服这些限制，我们提出ProtoGS来学习高斯原型以代表高斯原语，显著减少总高斯数量而不牺牲视觉质量。我们的方法直接使用高斯原型来实现高效渲染，并利用结果重建损失来指导原型学习。为了进一步优化训练期间的内存效率，我们将运动结构（SfM）点作为锚点，将高斯原语分组。通过K均值聚类在每个组中导出高斯原型，并联合优化锚点和原型。我们在真实和合成数据集上的实验证明，我们优于现有方法，实现了高斯数量的大幅减少，并在保持或甚至增强渲染保真度的同时实现了高渲染速度。

更新时间: 2025-03-25 13:03:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17486v2

Learning to chain-of-thought with Jensen's evidence lower bound

We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen's lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen's lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm's potential to more generic applications.

Updated: 2025-03-25 13:03:09

标题: 学习使用Jensen的证据下界进行思维链的过程

摘要: 我们提出了一种优化思维链的方法，利用强化学习，但不需要外部奖励函数。我们的算法依赖于将思维链视为概率推理问题的潜在变量的一部分。与完整证据下界相反，我们建议应用一个更简单的Jensen下界，它通过简单的算法组件（例如，不需要参数近似后验）得出可处理的目标，使其更适合现代大规模训练。下界方法自然地插值了其他方法，如监督微调和在线强化学习，我们将说明它们之间的实际权衡。最后，我们展示了在数学推理问题上，使用Jensen下界进行优化与外部奖励的策略梯度一样有效。综合起来，我们的结果证明了这种新算法范式对更通用应用的潜力。

更新时间: 2025-03-25 13:03:09

领域: cs.LG

下载: http://arxiv.org/abs/2503.19618v1

Framework for Progressive Knowledge Fusion in Large Language Models Through Structured Conceptual Redundancy Analysis

The organization of latent knowledge within large-scale models poses unique challenges when addressing overlapping representations and optimizing contextual accuracy. Conceptual redundancies embedded across layers often result in inefficiencies that affect both computational demands and task-specific outcomes. A framework was proposed to restructure these redundancies through advanced clustering techniques and dynamic thresholding, ensuring that critical semantic relationships are preserved while removing unnecessary overlaps. Evaluations revealed improved memory efficiency and faster inference times, alongside better alignment in latent knowledge clusters that enhanced interpretability. Improvements in error rates and adversarial robustness suggest that restructuring redundancies has broader implications for increasing model reliability across diverse applications. Comparative analyses highlighted reductions in resource consumption and notable gains in performance, particularly in translation and summarization tasks. Energy metrics demonstrated significant savings during training phases, further validating the practicality of the approach for real-world deployments. Representational fidelity was also enhanced, with latent space evaluations indicating better cluster alignment and higher semantic consistency. The methodology bridges a key gap in model optimization through directly addressing redundancies at the structural level. Its application opens avenues for scalable, efficient, and contextually aware systems that can adapt to complex, domain-specific tasks without compromising on performance.

Updated: 2025-03-25 12:59:14

标题: 大语言模型中渐进式知识融合的框架：通过结构化概念冗余分析

摘要: 大规模模型中潜在知识的组织面临独特的挑战，因为需要处理重叠的表示和优化上下文准确性。跨层嵌入的概念冗余通常会导致效率低下，影响计算需求和特定任务结果。提出了一个框架，通过高级聚类技术和动态阈值重新组织这些冗余，确保关键语义关系得以保留同时消除不必要的重叠。评估结果显示，内存效率和推理时间得到改善，潜在知识聚类的对齐性也得到增强，提高了可解释性。错误率和对抗性稳健性的改进表明，重组冗余对增强模型在各种应用中的可靠性具有更广泛的影响。比较分析凸显了资源消耗的减少和性能的显著增益，尤其在翻译和总结任务中。能源指标显示在训练阶段节省了大量能源，进一步验证了该方法在实际部署中的实用性。表示保真度也得到增强，潜在空间评估显示更好的聚类对齐性和更高的语义一致性。该方法通过直接解决结构级别的冗余填补了模型优化中的关键差距。其应用为可扩展、高效、具有上下文意识的系统打开了新的途径，可以在不影响性能的情况下适应复杂的领域特定任务。

更新时间: 2025-03-25 12:59:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.13999v2

Hierarchical Manifold Projection for Ransomware Detection: A Novel Geometric Approach to Identifying Malicious Encryption Patterns

Encryption-based cyber threats continue to evolve, employing increasingly sophisticated techniques to bypass traditional detection mechanisms. Many existing classification strategies depend on static rule sets, signature-based matching, or machine learning models that require extensive labeled datasets, making them ineffective against emerging ransomware families that exhibit polymorphic and adversarial behaviors. A novel classification framework structured through hierarchical manifold projection introduces a mathematical approach to detecting malicious encryption workflows, preserving geometric consistencies that differentiate ransomware-induced modifications from benign cryptographic operations. The proposed methodology transforms encryption sequences into structured manifold embeddings, ensuring classification robustness through non-Euclidean feature separability rather than reliance on static indicators. Generalization capabilities remain stable across diverse ransomware variants, as hierarchical decomposition techniques capture multi-scale encryption characteristics while maintaining resilience against code obfuscation and execution flow modifications. Empirical analysis demonstrates that detection accuracy remains high even when encryption key variability, delayed execution tactics, or API call obfuscation strategies are introduced, reinforcing the reliability of manifold-based classification. Real-time scalability assessments confirm that the proposed approach maintains computational efficiency across increasing dataset volumes, validating its applicability to large-scale threat detection scenarios.

Updated: 2025-03-25 12:57:24

标题: 分层流形投影用于勒索软件检测：一种识别恶意加密模式的新几何方法

摘要: 基于加密的网络威胁不断发展，采用越来越复杂的技术来规避传统的检测机制。许多现有的分类策略依赖于静态规则集、基于签名的匹配或需要大量标记数据集的机器学习模型，使它们无法有效应对表现出多态性和对抗性行为的新兴勒索软件家族。通过分层流形投影构建的新型分类框架引入了一种数学方法来检测恶意加密工作流程，保留了区分勒索软件诱导修改与良性加密操作的几何一致性。所提出的方法将加密序列转换为结构化的流形嵌入，通过非欧几里得特征可分性而不是依赖静态指标来确保分类的稳健性。泛化能力在各种勒索软件变种中保持稳定，因为分层分解技术捕捉多尺度加密特征，同时保持对代码混淆和执行流修改的抵抗力。实证分析表明，即使引入加密密钥变化、延迟执行策略或API调用混淆策略，检测准确性仍然很高，强化了基于流形的分类的可靠性。实时可扩展性评估证实了所提出的方法在不断增加的数据集容量下保持计算效率，验证了其适用于大规模威胁检测场景的可行性。

更新时间: 2025-03-25 12:57:24

领域: cs.CR

下载: http://arxiv.org/abs/2502.08013v2

Entropy-Synchronized Neural Hashing for Unsupervised Ransomware Detection

Entropy-based detection methodologies have gained significant attention due to their ability to analyze structural irregularities within executable files, particularly in the identification of malicious software employing advanced obfuscation techniques. The Entropy-Synchronized Neural Hashing (ESNH) framework introduces a novel approach that leverages entropy-driven hash representations to classify software binaries based on their underlying entropy characteristics. Through the synchronization of entropy profiles with neural network architectures, the model generates robust and unique hash values that maintain stability even when faced with polymorphic and metamorphic transformations. Comparative analysis against traditional detection approaches revealed superior performance in identifying novel threats, reducing false-positive rates, and achieving consistent classification across diverse ransomware families. The incorporation of a self-regulating hash convergence mechanism further ensured that entropy-synchronized hashes remained invariant across executions, minimizing classification inconsistencies that often arise due to dynamic modifications in ransomware payloads. Experimental results demonstrated high detection rates across contemporary ransomware strains, with the model exhibiting resilience against encryption-based evasion mechanisms, code injection strategies, and reflective loading techniques. Unlike conventional detection mechanisms that rely on static signatures and heuristic analysis, the proposed entropy-aware classification framework adapts to emerging threats through an inherent ability to capture entropy anomalies within executable structures. The findings reinforce the potential of entropy-based detection in addressing the limitations of traditional methodologies while enhancing detection robustness against obfuscation and adversarial evasion techniques.

Updated: 2025-03-25 12:57:02

标题: 熵同步神经哈希用于无监督勒索软件检测

摘要: 基于熵的检测方法由于能够分析可执行文件中的结构不规则性而引起了重要关注，特别是在识别采用先进混淆技术的恶意软件方面。熵同步神经哈希（ESNH）框架引入了一种新颖的方法，利用熵驱动的哈希表示来根据软件二进制文件的基础熵特征对其进行分类。通过将熵配置文件与神经网络架构同步，该模型生成了稳健且独特的哈希值，即使面对多态和变形转换也能保持稳定性。与传统检测方法进行比较分析表明，在识别新型威胁、降低误报率以及在各种勒索软件家族中实现一致分类方面表现出卓越性能。进一步结合自调节哈希收敛机制确保了熵同步哈希在执行过程中保持不变，减少了由于勒索软件有效负载中的动态修改而引起的分类不一致性。实验结果表明，该模型在当代勒索软件变种中表现出高检测率，并对基于加密的回避机制、代码注入策略和反射加载技术表现出弹性。与依赖于静态签名和启发式分析的传统检测机制不同，所提出的基于熵的分类框架适应新兴威胁，具有捕捉可执行结构中熵异常的固有能力。研究结果强调了基于熵的检测在解决传统方法的局限性以及增强对混淆和对抗性逃避技术的检测稳健性方面的潜力。

更新时间: 2025-03-25 12:57:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2501.18131v2

Semantic Layered Embedding Diffusion in Large Language Models for Multi-Contextual Consistency

The Semantic Layered Embedding Diffusion (SLED) mechanism redefines the representation of hierarchical semantics within transformer-based architectures, enabling enhanced contextual consistency across a wide array of linguistic tasks. By introducing a multi-layered diffusion process grounded in spectral analysis, it achieves a complex balance between global and local semantic coherence. Experimental results demonstrate significant improvements in perplexity and BLEU scores, emphasizing the mechanism's ability to adapt effectively across diverse domains, including multilingual and cross-domain text generation. A rigorous mathematical framework underpins the embedding diffusion process, incorporating weighted adjacency matrices, kernel-based refinements, and dynamic layer-wise normalization. Error distribution analysis reveals that SLED addresses challenges in semantic alignment and coherence, outperforming baseline approaches across varied benchmarks. Scalability studies illustrate that its performance gains are maintained consistently across different model sizes, reflecting a practical balance between computational efficiency and linguistic precision. The implementation also achieves energy efficiency, reducing resource consumption during training and inference phases without compromising accuracy. Qualitative case studies further validate its adaptability to extended narratives and context-intensive scenarios, highlighting the mechanism's potential for real-world applications. SLED offers a different perspective on embedding design and its implications for advancing language modeling.

Updated: 2025-03-25 12:55:17

标题: 大语言模型中的语义分层嵌入扩散，用于多语境一致性

摘要: 语义分层嵌入扩散（SLED）机制重新定义了基于变压器的体系结构中层次语义的表示，实现了在各种语言任务中增强的上下文一致性。通过引入基于谱分析的多层扩散过程，它实现了全局和局部语义一致性之间的复杂平衡。实验结果表明，在困惑度和BLEU分数方面取得了显著改进，强调了该机制在各种领域（包括多语言和跨领域文本生成）中有效适应的能力。一个严格的数学框架支撑着嵌入扩散过程，包括加权邻接矩阵、基于核的改进和动态逐层归一化。错误分布分析显示，SLED解决了语义对齐和一致性方面的挑战，在各种基准测试中优于基线方法。可扩展性研究表明，其性能增益在不同模型大小之间保持一致，反映了计算效率和语言精度之间的实际平衡。该实现还实现了能源效率，在训练和推断阶段减少了资源消耗，而不影响准确性。定性案例研究进一步验证了其对扩展叙述和密集上下文场景的适应性，突出了该机制在现实世界应用中的潜力。SLED为嵌入设计提供了不同的视角，并推动了语言建模的发展。

更新时间: 2025-03-25 12:55:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.15405v2

Early Classification of Time Series: Taxonomy and Benchmark

In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see https://github.com/ML-EDM/ml_edm).

Updated: 2025-03-25 12:53:06

标题: 时间序列的早期分类：分类法和基准

摘要: 在许多情况下，研究现象的测量值是顺序提供的，并且需要尽早进行类别预测，以避免产生过高的时间惩罚，但也不能太早，以免冒误分类的成本。这个问题在时间序列的情况下得到了特别研究，被称为时间序列的早期分类（ECTS）。尽管这已经成为一个日益增长的文献体系的研究对象，但仍然缺乏一个系统的、共享的评估协议来比较各种现有方法的相对优点。本文首先将这些方法置于基于原则的分类系统中。它定义了用于组织评估的维度，然后报告了一系列非常广泛的实验结果，涉及九种最先进的ECTS算法。此外，这些和其他实验可以使用一个开源库进行，其中大多数现有的ECTS算法已经被实现（参见https://github.com/ML-EDM/ml_edm）。

更新时间: 2025-03-25 12:53:06

领域: cs.LG

下载: http://arxiv.org/abs/2406.18332v3

RL-finetuning LLMs from on- and off-policy data with a single algorithm

We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.

Updated: 2025-03-25 12:52:38

标题: 使用单一算法对强化学习微调语言模型，从有策略和无策略数据中进行 fine-tuning

摘要: 我们介绍了一种新颖的强化学习算法（AGRO，用于对大型语言模型进行微调）。AGRO利用生成一致性的概念，该概念指出最优策略在模型的任何可能生成中都满足一致性。我们推导出通过基于样本的策略梯度找到最优解的算法，并提供了它们收敛的理论保证。我们的实验表明，在政策内外的设置中，AGRO在数学推理数据集上的表现优于基准算法。

更新时间: 2025-03-25 12:52:38

领域: cs.LG

下载: http://arxiv.org/abs/2503.19612v1

Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples. To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens, thereby enhancing the coherence and creativity of the resulting compositions. By leveraging the contrastive language-audio pretraining (CLAP) model, we establish a chain of "musical thoughts", making MusiCoT scalable and independent of human-labeled data, in contrast to conventional CoT methods. Moreover, MusiCoT allows for in-depth analysis of music structure, such as instrumental arrangements, and supports music referencing -- accepting variable-length audio inputs as optional style references. This innovative approach effectively addresses copying issues, positioning MusiCoT as a vital practical method for music prompting. Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics, producing music quality that rivals state-of-the-art generation models. Our samples are available at https://MusiCoT.github.io/.

Updated: 2025-03-25 12:51:21

标题: 可分析的音乐思维链驱动的高保真音乐生成

摘要: 自回归（AR）模型展示了生成高保真音乐的令人印象深刻的能力。然而，在AR模型中传统的下一个标记预测范式与音乐作曲中的人类创造过程不一致，可能会影响生成样本的音乐性。为了克服这一限制，我们引入了MusiCoT，一种专门为音乐生成定制的新型思维链（CoT）提示技术。MusiCoT使AR模型能够在生成音频标记之前首先概述整体音乐结构，从而增强了生成作品的连贯性和创造力。通过利用对比语言音频预训练（CLAP）模型，我们建立了一系列“音乐思维”，使MusiCoT能够扩展并独立于人类标记数据，与传统的CoT方法形成对比。此外，MusiCoT允许对音乐结构进行深入分析，如乐器编排，并支持音乐引用--接受可选风格参考的可变长度音频输入。这种创新方法有效解决了复制问题，将MusiCoT定位为音乐提示的重要实用方法。我们的实验结果表明，MusiCoT在客观和主观指标上始终表现出优越的性能，生成的音乐质量可以与最先进的生成模型媲美。我们的样本可在https://MusiCoT.github.io/上找到。

更新时间: 2025-03-25 12:51:21

领域: cs.SD,cs.AI,cs.MM,eess.AS,eess.SP

下载: http://arxiv.org/abs/2503.19611v1

Nanopass Back-Translation of Call-Return Trees for Mechanized Secure Compilation Proofs

Researchers aim to build secure compilation chains enforcing that if there is no attack a source context can mount against a source program then there is also no attack an adversarial target context can mount against the compiled program. Proving that these compilation chains are secure is, however, challenging, and involves a non-trivial back-translation step: for any attack a target context mounts against the compiled program one has to exhibit a source context mounting the same attack against the source program. We describe a novel back-translation technique, which results in simpler proofs that can be more easily mechanized in a proof assistant. Given a finite set of finite trace prefixes, capturing the interaction recorded during an attack between a target context and the compiled program, we build a call-return tree that we back-translate into a source context producing the same trace prefixes. We use state in the generated source context to record the current location in the call-return tree. The back-translation is done in several small steps, each adding to the tree new information describing how the location should change depending on how the context regains control. To prove this back-translation correct we give semantics to every intermediate call-return tree language, using ghost state to store information and explicitly enforce execution invariants. We prove several small forward simulations, basically seeing the back-translation as a verified nanopass compiler. Thanks to this modular structure, we are able to mechanize this complex back-translation and its correctness proof in the Rocq prover without too much effort.

Updated: 2025-03-25 12:50:35

标题: 纳米通道用于呼叫返回树的反向翻译，用于机械化安全编译证明

摘要: 研究人员的目标是构建安全的编译链，确保如果源上下文无法对源程序发动攻击，那么对手目标上下文也无法对编译程序发动攻击。然而，证明这些编译链是安全的是具有挑战性的，其中涉及一个非平凡的逆向翻译步骤：对于目标上下文对编译程序发动的任何攻击，必须展示一个源上下文对源程序发动相同的攻击。我们描述了一种新颖的逆向翻译技术，可以简化证明，并更容易在证明助手中实现。给定一组有限的有限跟踪前缀，捕捉了攻击过程中目标上下文与编译程序之间的交互记录，我们构建一个调用返回树，然后将其逆向翻译为一个生成相同跟踪前缀的源上下文。我们在生成的源上下文中使用状态来记录调用返回树中的当前位置。逆向翻译分为几个小步骤进行，每个步骤都会向树中添加新信息，描述了上下文如何在重新获得控制时位置应该如何改变。为了证明这种逆向翻译的正确性，我们为每个中间调用返回树语言赋予语义，使用幽灵状态存储信息，并明确执行不变性。我们证明了几个小前向模拟，基本上将逆向翻译视为经过验证的纳米通行编译器。由于这种模块化结构，我们能够在Rocq证明器中相对轻松地实现这种复杂的逆向翻译及其正确性证明。

更新时间: 2025-03-25 12:50:35

领域: cs.PL,cs.CR

下载: http://arxiv.org/abs/2503.19609v1

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.

Updated: 2025-03-25 12:49:43

标题: LLM的安全培训是否可以推广到语义相关的自然提示？

摘要: 大型语言模型（LLMs）因易受精心设计的对抗性攻击或越狱攻击而闻名，这些攻击导致生成令人反感的内容，尽管使用安全微调方法对其进行了人类偏好的对齐。尽管输入令牌空间的大维度使得不可避免地会找到可以越狱这些模型的对抗性提示，但我们的目标是评估安全微调的LLMs是否对自然提示安全，这些自然提示与引发安全反应的有毒种子提示在对齐后是语义相关的。我们惊讶地发现，诸如GPT-4之类的流行对齐LLMs可以通过甚至没有旨在越狱模型的朴素提示而受损。此外，我们凭经验证明，鉴于一个种子提示从未对齐的模型产生有毒反应，可以系统地生成多个语义相关的自然提示，可以越狱对齐的LLMs。为此，我们提出了一种响应引导问题增强（ReG-QA）的方法，以评估对自然提示的安全对齐LLMs的泛化性，该方法首先使用未对齐的LLM（Q to A）给出一个种子问题生成多个有毒答案，然后利用LLM生成可能产生这些答案的问题（A to Q）。我们有趣地发现，诸如GPT-4o之类的安全微调LLMs容易从不安全内容中产生自然越狱问题（无否定），因此可以用于后者（A to Q）步骤。我们获得的攻击成功率与JailbreakBench排行榜上领先的对抗性攻击方法相媲美/更好，同时对抗Smooth-LLM和同义词替换等防御措施更加稳定，这些措施对排行榜上现有的所有攻击方法都有效。

更新时间: 2025-03-25 12:49:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.03235v2

Enabling Rapid Shared Human-AI Mental Model Alignment via the After-Action Review

In this work, we present two novel contributions toward improving research in human-machine teaming (HMT): 1) a Minecraft testbed to accelerate testing and deployment of collaborative AI agents and 2) a tool to allow users to revisit and analyze behaviors within an HMT episode to facilitate shared mental model development. Our browser-based Minecraft testbed allows for rapid testing of collaborative agents in a continuous-space, real-time, partially-observable environment with real humans without cumbersome setup typical to human-AI interaction user studies. As Minecraft has an extensive player base and a rich ecosystem of pre-built AI agents, we hope this contribution can help to facilitate research quickly in the design of new collaborative agents and in understanding different human factors within HMT. Our mental model alignment tool facilitates user-led post-mission analysis by including video displays of first-person perspectives of the team members (i.e., the human and AI) that can be replayed, and a chat interface that leverages GPT-4 to provide answers to various queries regarding the AI's experiences and model details.

Updated: 2025-03-25 12:43:18

标题: 通过事后回顾实现快速共享的人工智能心智模型对齐

摘要: 在这项工作中，我们提出了两个创新性的贡献，以改进人机团队合作（HMT）的研究：1）一个Minecraft实验平台，用于加速协作AI代理的测试和部署；2）一个工具，允许用户重新审视和分析HMT事件中的行为，以促进共享心智模型的发展。我们基于浏览器的Minecraft实验平台允许在具有实时、部分可观察环境和真实人类的连续空间中快速测试协作代理，而无需进行繁琐的设置，这是典型的人机交互用户研究中常见的。由于Minecraft拥有庞大的玩家群体和丰富的预构建AI代理生态系统，我们希望这一贡献可以帮助在设计新的协作代理和理解HMT中不同人类因素的研究中迅速取得进展。我们的心智模型对齐工具通过包括团队成员（即人类和AI）的第一人称视角的视频显示和一个利用GPT-4提供答案的聊天界面，促进了用户主导的任务后分析，以获取有关AI经验和模型细节的各种查询的答案。

更新时间: 2025-03-25 12:43:18

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2503.19607v1

Lean Formalization of Generalization Error Bound by Rademacher Complexity

We formalize the generalization error bound using Rademacher complexity in the Lean 4 theorem prover. Generalization error quantifies the gap between a learning machine's performance on given training data versus unseen test data, and Rademacher complexity serves as an estimate of this error based on the complexity of learning machines, or hypothesis class. Unlike traditional methods such as PAC learning and VC dimension, Rademacher complexity is applicable across diverse machine learning scenarios including deep learning and kernel methods. We formalize key concepts and theorems, including the empirical and population Rademacher complexities, and establish generalization error bounds through formal proofs of McDiarmid's inequality, Hoeffding's lemma, and symmetrization arguments.

Updated: 2025-03-25 12:40:43

标题: Rademacher复杂性对泛化误差界的精简形式化

摘要: 我们利用Lean 4定理证明器 formalize 了使用Rademacher复杂度来界定泛化误差的概念。泛化误差量化了学习机在给定训练数据和未见测试数据上表现之间的差距，而Rademacher复杂度则根据学习机或假设类别的复杂性对此误差进行估计。与传统方法如PAC学习和VC维度不同，Rademacher复杂度适用于包括深度学习和核方法在内的各种机器学习场景。我们 formalize 了关键概念和定理，包括经验和总体Rademacher复杂度，并通过McDiarmid不等式、Hoeffding引理和对称化论证建立了泛化误差界限。

更新时间: 2025-03-25 12:40:43

领域: cs.LG,cs.CL,math.ST,stat.TH

下载: http://arxiv.org/abs/2503.19605v1

Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?" In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.

Updated: 2025-03-25 12:37:22

标题: 先天推理是不够的：上下文学习增强了具有更少过度思考的大型语言模型的推理

摘要: 最近大型语言模型（LLMs）的进展引入了推理大型语言模型（RLLMs），这些模型采用具有反思和自我纠正能力的延伸思维过程，展示了测试时间扩展的有效性。RLLMs表现出从训练中获得的固有的思维链（CoT）推理能力，引发一个自然的问题：“对于增强RLLMs的推理能力，聊天LLMs中流行的上下文学习（ICL）方法CoT提示是否必要？” 在这项工作中，我们首次全面分析了零照片CoT和少照片CoT对RLLMs在数学推理任务中的影响。我们检查了从1.5B到32B参数的模型，发现与担忧相反，CoT提示在大多数情况下显著增强了RLLMs的性能。我们的结果显示出明显的模式：大容量模型在简单任务上的改进不大，但在复杂问题上有实质性的收益，而较小的模型则表现出相反的行为。进一步的分析表明，CoT提示有效地控制了思考令牌和推理步骤数量的分布，有时将过多的反思减少了约90%。此外，注意力日志分析显示RLLMs过度拟合于与反思相关的单词，而外部CoT指导可以缓解这一问题。值得注意的是，我们的实验表明，对于RLLMs，一次性CoT总是比少次CoT方法表现更好。我们的发现为通过适当的提示策略优化RLLMs的性能提供了重要见解。

更新时间: 2025-03-25 12:37:22

领域: cs.AI

下载: http://arxiv.org/abs/2503.19602v1

RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to traditional task agnostic encoding methods, paving the way for more efficient task-aware video compression.

Updated: 2025-03-25 12:33:41

标题: RL-RC-DoT：一种用于任务感知视频压缩的块级RL代理

摘要: 视频编码器通过在比特率约束下最小化重建误差来优化对人类感知的压缩。在许多现代应用中，如自动驾驶，绝大多数视频用作AI系统的输入，执行诸如物体识别或分割等任务，而不是由人类观看。因此，优化编码器以用于下游任务而不是用于感知图像质量是有用的。然而，一个主要挑战是如何将这种下游优化与现有的标准视频编码器相结合，这些编码器高效且受欢迎。在这里，我们通过在宏块级别控制量化参数（QP）来优化下游任务来解决这一挑战。这种粒度控制使我们能够优先考虑每帧内与任务相关的区域的编码。我们将这个优化问题形式化为一个强化学习（RL）任务，其中代理学习如何平衡选择QP对任务性能和比特率约束的长期影响。值得注意的是，我们的策略在推断期间不需要下游任务作为输入，使其适用于流媒体应用和车辆等边缘设备。我们展示了在两个任务，汽车检测和ROI（显著性）编码中的显着改进。与传统的任务不可知编码方法相比，我们的方法在给定比特率下提高了任务性能，为更高效的任务感知视频压缩铺平了道路。

更新时间: 2025-03-25 12:33:41

领域: cs.LG,cs.CV,eess.IV

下载: http://arxiv.org/abs/2501.12216v2

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

While software requirements are often expressed in natural language, verifying the correctness of a program against natural language requirements is a hard and underexplored problem. Large language models (LLMs) are promising candidates for addressing this challenge, however our experience shows that they are ineffective in this task, often failing to detect even straightforward bugs. To address this gap, we introduce HoarePrompt, a novel approach that adapts fundamental ideas from program analysis and verification to natural language artifacts. Drawing inspiration from the strongest postcondition calculus, HoarePrompt employs a systematic, step-by-step process in which an LLM generates natural language descriptions of reachable program states at various points in the code. To manage loops, we propose few-shot-driven k-induction, an adaptation of the k-induction method widely used in model checking. Once program states are described, HoarePrompt leverages the LLM to assess whether the program, annotated with these state descriptions, conforms to the natural language requirements. For evaluating the quality of classifiers of program correctness with respect to natural language requirements, we constructed CoCoClaNeL, a challenging dataset of solutions to programming competition problems. Our experiments show that HoarePrompt improves the MCC by 62% compared to directly using Zero-shot-CoT prompts for correctness classification. Furthermore, HoarePrompt outperforms a classifier that assesses correctness via LLM-based test generation by increasing the MCC by 93%. The inductive reasoning mechanism contributes a 28% boost to MCC, underscoring its effectiveness in managing loops.

Updated: 2025-03-25 12:30:30

标题: HoarePrompt：自然语言中关于程序正确性的结构推理

摘要: 虽然软件需求通常用自然语言表达，但根据自然语言需求验证程序的正确性是一个困难且未经充分探讨的问题。大型语言模型（LLMs）是解决这一挑战的有希望的候选方案，然而我们的经验表明它们在这一任务中效果不佳，常常无法检测甚至是简单的错误。为了填补这一空白，我们引入了HoarePrompt，这是一种新颖的方法，将程序分析和验证的基本思想应用到自然语言文档中。HoarePrompt从最强后条件演算中汲取灵感，采用系统化的逐步过程，其中LLM生成代码中各个点可达程序状态的自然语言描述。为了管理循环，我们提出了少样本驱动的k-归纳法，这是一种广泛应用于模型检查中的k-归纳法的改编。一旦描述了程序状态，HoarePrompt利用LLM评估程序在这些状态描述下是否符合自然语言需求。为了评估关于程序正确性的分类器与自然语言需求的质量，我们构建了CoCoClaNeL，一个具有挑战性的编程竞赛问题解决方案数据集。我们的实验表明，与直接使用Zero-shot-CoT提示进行正确性分类相比，HoarePrompt将MCC提高了62%。此外，HoarePrompt通过将MCC提高了93%来胜过一个通过基于LLM的测试生成评估正确性的分类器。归纳推理机制为MCC贡献了28%的提升，突显了其在管理循环中的有效性。

更新时间: 2025-03-25 12:30:30

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.19599v1

Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation

The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. Therefore, it is both interesting and valuable to explore whether SAM can be improved towards highly accurate object segmentation, which is known as the dichotomous image segmentation (DIS) task. To address this issue, we propose DIS-SAM, which advances SAM towards DIS with extremely accurate details. DIS-SAM is a framework specifically tailored for highly accurate segmentation, maintaining SAM's promptable design. DIS-SAM employs a two-stage approach, integrating SAM with a modified advanced network that was previously designed to handle the prompt-free DIS task. To better train DIS-SAM, we employ a ground truth enrichment strategy by modifying original mask annotations. Despite its simplicity, DIS-SAM significantly advances the SAM, HQ-SAM, and Pi-SAM ~by 8.5%, ~6.9%, and ~3.7% maximum F-measure. Our code at https://github.com/Tennine2077/DIS-SAM

Updated: 2025-03-25 12:24:08

标题: 推动分割任何模型朝着高度准确的二元图像分割

摘要: The Segment Anything Model (SAM)代表了计算机视觉基础模型的重大突破，提供了一个大规模图像分割模型。然而，尽管SAM具有零样本性能，其分割掩模缺乏细粒度的细节，特别是在准确地勾画物体边界方面。因此，探索SAM是否可以改进为高度准确的物体分割，即被称为二分图像分割（DIS）任务，具有趣味性和价值。为了解决这个问题，我们提出了DIS-SAM，将SAM推进到DIS，具有极其精确的细节。DIS-SAM是一个专门针对高度准确的分割而设计的框架，保持了SAM的可推广设计。DIS-SAM采用了一个两阶段方法，将SAM与一个修改过的先进网络整合在一起，该网络早先设计用于处理无提示的DIS任务。为了更好地训练DIS-SAM，我们采用了一个地面真相丰富策略，通过修改原始掩模注释。尽管简单，DIS-SAM显著推进了SAM，HQ-SAM和Pi-SAM的最大F-度量分别提高了8.5％，6.9％和3.7％。我们的代码在https://github.com/Tennine2077/DIS-SAM。

更新时间: 2025-03-25 12:24:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2401.00248v4

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with a focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.

Updated: 2025-03-25 12:21:26

标题: 使用强化学习优化推理时间目标的语言模型

摘要: 在这项工作中，我们研究了在模型训练过程中明确优化推理时间算法性能的优点。我们展示了如何通过优化推理时间性能可以提高整体模型效能。我们考虑了使用$k$个样本的通用推理时间目标，重点放在pass@$k$和多数投票作为两个主要应用。通过在推理数据集上进行语言模型训练，我们展示了通过使用这样的目标进行训练可以实现的性能折衷。在进行代码生成任务训练时，我们表明这种方法相比基准方法显著改善了pass@$k$目标。

更新时间: 2025-03-25 12:21:26

领域: cs.LG

下载: http://arxiv.org/abs/2503.19595v1

Solvation Free Energies from Neural Thermodynamic Integration

We present a method for computing free-energy differences using thermodynamic integration with a neural network potential that interpolates between two target Hamiltonians. The interpolation is defined at the sample distribution level, and the neural network potential is optimized to match the corresponding equilibrium potential at every intermediate time-step. Once the interpolating potentials and samples are well-aligned, the free-energy difference can be estimated using (neural) thermodynamic integration. To target molecular systems, we simultaneously couple Lennard-Jones and electrostatic interactions and model the rigid-body rotation of molecules. We report accurate results for several benchmark systems: a Lennard-Jones particle in a Lennard-Jones fluid, as well as the insertion of both water and methane solutes in a water solvent at atomistic resolution using a simple three-body neural-network potential.

Updated: 2025-03-25 12:20:29

标题: 神经热力学积分法求解溶剂化自由能

摘要: 我们提出了一种使用神经网络势函数在两个目标哈密顿量之间插值计算自由能差异的方法。插值在样本分布水平上定义，神经网络势函数被优化以匹配每个中间时间步的对应平衡势能。一旦插值势能和样本对齐，可以使用(神经)热力学积分来估计自由能差异。为了针对分子系统，我们同时耦合Lennard-Jones和静电相互作用，并对分子的刚体旋转进行建模。我们报告了几个基准系统的准确结果：在Lennard-Jones流体中的Lennard-Jones粒子，以及在水溶剂中以原子分辨率使用简单的三体神经网络势函数插入水和甲烷溶质。

更新时间: 2025-03-25 12:20:29

领域: cond-mat.stat-mech,cs.LG

下载: http://arxiv.org/abs/2410.15815v3

Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond

The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning. Codes are available at https://github.com/OPTML-Group/Unlearn-Smooth.

Updated: 2025-03-25 12:18:42

标题: 朝向对重新学习攻击具有抗干扰性的LLM去学习：一种基于锐度感知最小化的视角及其拓展

摘要: 最近引入LLM反学习技术以遵守数据法规，并通过消除不受欢迎的数据模型影响来解决LLM的安全和道德关切。然而，最先进的反学习方法面临一个关键漏洞：它们容易受到从少量遗忘数据点重新学习的攻击，即重新学习攻击。在本文中，我们系统地研究如何使反学习模型对此类攻击具有鲁棒性。我们首次通过一个统一的鲁棒优化框架，建立了鲁棒反学习和锐度感知最小化（SAM）之间的联系，类似于旨在抵御对抗性攻击的对抗性训练。我们对SAM的分析表明，平滑度优化在减轻重新学习攻击中起着关键作用。因此，我们进一步探索多样的平滑策略，以增强反学习的鲁棒性。对基准数据集进行了广泛的实验，包括WMDP和MUSE，结果表明SAM和其他平滑度优化方法一致提高了LLM反学习对重新学习攻击的抵抗能力。值得注意的是，增强平滑度的反学习还有助于抵御（输入级）越狱攻击，扩大了我们提议在增强LLM反学习鲁棒性方面的影响。代码可在https://github.com/OPTML-Group/Unlearn-Smooth找到。

更新时间: 2025-03-25 12:18:42

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.05374v3

Towards Understanding the Influence of Training Samples on Explanations

Explainable AI (XAI) is widely used to analyze AI systems' decision-making, such as providing counterfactual explanations for recourse. When unexpected explanations occur, users may want to understand the training data properties shaping them. Under the umbrella of data valuation, first approaches have been proposed that estimate the influence of data samples on a given model. This process not only helps determine the data's value, but also offers insights into how individual, potentially noisy, or misleading examples affect a model, which is crucial for interpretable AI. In this work, we apply the concept of data valuation to the significant area of model evaluations, focusing on how individual training samples impact a model's internal reasoning rather than the predictive performance only. Hence, we introduce the novel problem of identifying training samples shaping a given explanation or related quantity, and investigate the particular case of the cost of computational recourse. We propose an algorithm to identify such influential samples and conduct extensive empirical evaluations in two case studies.

Updated: 2025-03-25 12:17:25

标题: 朝向理解培训样本对解释的影响

摘要: 可解释的人工智能（XAI）被广泛用于分析人工智能系统的决策过程，例如为追溯提供反事实解释。当出现意外解释时，用户可能希望了解塑造这些解释的训练数据属性。在数据估值的范畴下，首次提出了一些方法来估计数据样本对给定模型的影响。这个过程不仅有助于确定数据的价值，还可以揭示个别、潜在嘈杂或误导性示例如何影响模型，这对于可解释的人工智能至关重要。在本研究中，我们将数据估值的概念应用于模型评估的重要领域，重点关注个别训练样本如何影响模型的内部推理，而不仅仅是预测性能。因此，我们提出了一个新颖的问题，即识别塑造给定解释或相关数量的训练样本，并研究了计算追溯成本的特殊情况。我们提出了一种算法来识别这些有影响力的样本，并在两个案例研究中进行了广泛的实证评估。

更新时间: 2025-03-25 12:17:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.03012v2

Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization

With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based attacks unfeasible. To address this challenge, we propose a technique called Acoustic Representation Optimization that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models. Rather than relying on model-specific, higher-layer abstractions, our approach leverages fundamental acoustic representations that remain consistent across diverse ASR architectures. By enforcing an acoustic representation loss to guide perturbations toward these robust, lower-level representations, we enhance the cross-model transferability of adversarial examples without degrading audio quality. Our method is plug-and-play and can be integrated with any existing attack methods. We evaluate our approach on three modern ASR models, and the experimental results demonstrate that our method significantly improves the transferability of adversarial examples generated by previous methods while preserving the audio quality.

Updated: 2025-03-25 12:14:10

标题: 通过声学表示优化提高音频对抗性例子的可传递性

摘要: 随着自动语音识别（ASR）系统的广泛应用，它们对恶意攻击的脆弱性已被广泛研究。然而，大多数现有的对抗性示例是在特定个体模型上生成的，导致缺乏可转移性。在现实世界的场景中，攻击者通常无法访问有关目标模型的详细信息，使得基于查询的攻击变得不可行。为了解决这一挑战，我们提出了一种称为声学表示优化的技术，该技术将对抗性扰动与从语音表示模型中导出的低级声学特征对齐。与依赖于模型特定的更高层抽象不同，我们的方法利用了在不同ASR架构中保持一致的基本声学表示。通过强制执行声学表示损失来引导扰动朝向这些稳健的低级表示，我们增强了对抗性示例的跨模型可转移性，而不会降低音频质量。我们的方法是即插即用的，可以与任何现有的攻击方法集成。我们在三种现代ASR模型上评估了我们的方法，实验结果表明，我们的方法显著提高了先前方法生成的对抗性示例的可转移性，同时保持音频质量。

更新时间: 2025-03-25 12:14:10

领域: cs.SD,cs.CR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2503.19591v1

Multi-agent Application System in Office Collaboration Scenarios

This paper introduces a multi-agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized collaboration support based on team members' needs and incorporate data analysis tools to improve decision-making quality. The paper also proposes an intelligent agent architecture that separates Plan and Solver, and through techniques such as multi-turn query rewriting and business tool retrieval, it enhances the agent's multi-intent and multi-turn dialogue capabilities. Furthermore, the paper details the design of tools and multi-turn dialogue in the context of office collaboration scenarios, and validates the system's effectiveness through experiments and evaluations. Ultimately, the system has demonstrated outstanding performance in real business applications, particularly in query understanding, task planning, and tool calling. Looking forward, the system is expected to play a more significant role in addressing complex interaction issues within dynamic environments and large-scale multi-agent systems.

Updated: 2025-03-25 12:07:20

标题: 办公协作场景中的多智能体应用系统

摘要: 本文介绍了一个多智能体应用系统，旨在提高办公室协作效率和工作质量。该系统整合了人工智能、机器学习和自然语言处理技术，实现了任务分配、进度监控和信息共享等功能。系统内的智能体能够根据团队成员的需求提供个性化的协作支持，并整合数据分析工具以提高决策质量。本文还提出了一个智能体架构，将计划和求解器分开，通过多轮查询重写和业务工具检索等技术，增强了智能体的多意图和多轮对话能力。此外，本文详细介绍了办公室协作场景下工具和多轮对话的设计，并通过实验和评估验证了系统的有效性。最终，该系统在真实业务应用中表现出色，特别是在查询理解、任务规划和工具调用方面。展望未来，预计该系统将在解决动态环境和大规模多智能体系统内的复杂交互问题中发挥更为重要的作用。

更新时间: 2025-03-25 12:07:20

领域: cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2503.19584v1

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2. Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

Updated: 2025-03-25 12:05:26

标题: Intel Loihi 2上高效大型语言模型的神经形态学原理

摘要: 大型语言模型（LLMs）提供了令人印象深刻的性能，但需要大量的能量。在这项工作中，我们提出了一种适用于英特尔神经形态处理器Loihi 2的无矩阵乘（MatMul-free）LLM架构。我们的方法利用了Loihi 2对低精度、事件驱动计算和有状态处理的支持。我们在GPU上展示了硬件感知的量化模型，表明一个拥有370M参数的无矩阵乘模型可以被量化而不损失精度。根据初步结果，我们报告相比边缘GPU上基于变压器的LLMs，通过我们的方法可以实现高达3倍的吞吐量和2倍的能量节约，同时具有更好的扩展性。进一步的硬件优化将提高吞吐量并降低能耗。这些结果展示了神经形态硬件在有效推理和为能够快速且具有成本效益地生成复杂、长篇文本的推理模型方面的潜力。

更新时间: 2025-03-25 12:05:26

领域: cs.NE,cs.AI,cs.AR,cs.LG

下载: http://arxiv.org/abs/2503.18002v2

Post-Hoc Calibrated Anomaly Detection

Deep unsupervised anomaly detection has seen improvements in a supervised binary classification paradigm in which auxiliary external data is included in the training set as anomalous data in a process referred to as outlier exposure, which opens the possibility of exploring the efficacy of post-hoc calibration for anomaly detection and localization. Post-hoc Platt scaling and Beta calibration are found to improve results with gradient-based input perturbation, as well as post-hoc training with a strictly proper loss of a base model initially trained on an unsupervised loss. Post-hoc calibration is also found at times to be more effective using random synthesized spectral data as labeled anomalous data in the calibration set, suggesting that outlier exposure is superior only for initial training.

Updated: 2025-03-25 11:55:19

标题: 事后校准异常检测

摘要: 深度无监督异常检测在监督二元分类范式中取得了改进，其中辅助外部数据被包含在训练集中作为异常数据，这个过程被称为异常暴露，这打开了探索后期校准对异常检测和定位有效性的可能性。发现后期Platt缩放和Beta校准通过基于梯度的输入扰动改进结果，以及通过在最初使用无监督损失进行训练的基础模型上进行后期训练的严格适当损失。有时发现后期校准更有效，使用随机合成的光谱数据作为标记的异常数据在校准集中，这表明异常暴露仅适用于初始训练。

更新时间: 2025-03-25 11:55:19

领域: cs.LG

下载: http://arxiv.org/abs/2503.19577v1

SINR: Sparsity Driven Compressed Implicit Neural Representations

Implicit Neural Representations (INRs) are increasingly recognized as a versatile data modality for representing discretized signals, offering benefits such as infinite query resolution and reduced storage requirements. Existing signal compression approaches for INRs typically employ one of two strategies: 1. direct quantization with entropy coding of the trained INR; 2. deriving a latent code on top of the INR through a learnable transformation. Thus, their performance is heavily dependent on the quantization and entropy coding schemes employed. In this paper, we introduce SINR, an innovative compression algorithm that leverages the patterns in the vector spaces formed by weights of INRs. We compress these vector spaces using a high-dimensional sparse code within a dictionary. Further analysis reveals that the atoms of the dictionary used to generate the sparse code do not need to be learned or transmitted to successfully recover the INR weights. We demonstrate that the proposed approach can be integrated with any existing INR-based signal compression technique. Our results indicate that SINR achieves substantial reductions in storage requirements for INRs across various configurations, outperforming conventional INR-based compression baselines. Furthermore, SINR maintains high-quality decoding across diverse data modalities, including images, occupancy fields, and Neural Radiance Fields.

Updated: 2025-03-25 11:53:51

标题: SINR：稀疏驱动的压缩隐式神经表示

摘要: Implicit Neural Representations (INRs)被越来越多地认识到是一种多功能的数据模态，用于表示离散信号，提供了无限查询分辨率和减少存储需求等好处。现有的用于INRs的信号压缩方法通常采用以下两种策略之一：1. 直接对训练后的INR进行量化并进行熵编码；2. 通过可学习的转换在INR之上导出潜在代码。因此，它们的性能在很大程度上取决于所采用的量化和熵编码方案。在本文中，我们介绍了SINR，这是一种创新的压缩算法，利用了由INRs的权重形成的向量空间中的模式。我们使用字典中的高维稀疏代码来压缩这些向量空间。进一步的分析表明，用于生成稀疏代码的字典的原子不需要被学习或传输以成功恢复INR的权重。我们证明了所提出的方法可以与任何现有的基于INR的信号压缩技术集成。我们的结果表明，SINR在各种配置下都能实现对INRs存储需求的显著减少，超过传统的基于INR的压缩基线。此外，SINR在包括图像、占用场和神经辐射场在内的各种数据模态上保持高质量的解码。

更新时间: 2025-03-25 11:53:51

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.19576v1

Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification

This study explores open questions in the application of machine learning for breast cancer detection in mammograms. Current approaches often employ a two-stage transfer learning process: first, adapting a backbone model trained on natural images to develop a patch classifier, which is then used to create a single-view whole-image classifier. Additionally, many studies leverage both mammographic views to enhance model performance. In this work, we systematically investigate five key questions: (1) Is the intermediate patch classifier essential for optimal performance? (2) Do backbone models that excel in natural image classification consistently outperform others on mammograms? (3) When reducing mammogram resolution for GPU processing, does the learn-to-resize technique outperform conventional methods? (4) Does incorporating both mammographic views in a two-view classifier significantly improve detection accuracy? (5) How do these findings vary when analyzing low-quality versus high-quality mammograms? By addressing these questions, we developed models that outperform previous results for both single-view and two-view classifiers. Our findings provide insights into model architecture and transfer learning strategies contributing to more accurate and efficient mammogram analysis.

Updated: 2025-03-25 11:51:21

标题: 优化乳腺癌在乳腺X线摄影中的检测：迁移学习、分辨率降低和多视角分类的综合研究

摘要: 这项研究探讨了机器学习在乳腺癌乳房X光检测中的应用中尚未解决的问题。目前的方法通常采用两阶段迁移学习过程：首先，将在自然图像上训练的骨干模型调整为开发补丁分类器，然后用该分类器创建单视图整体图像分类器。此外，许多研究利用两个乳房X光视图来增强模型性能。在这项工作中，我们系统地研究了五个关键问题：（1）中间补丁分类器对于实现最佳性能是否至关重要？（2）在自然图像分类方面表现出色的骨干模型是否在乳房X光片上一直优于其他模型？（3）在减少乳房X光片分辨率以供GPU处理时，学习调整大小的技术是否优于传统方法？（4）将两个乳房X光视图合并到双视图分类器中是否显著提高检测准确性？（5）当分析低质量与高质量乳房X光片时，这些发现如何变化？通过解答这些问题，我们开发的模型在单视图和双视图分类器方面均优于先前的结果。我们的发现为模型架构和迁移学习策略提供了见解，有助于更准确和高效地分析乳房X光片。

更新时间: 2025-03-25 11:51:21

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19945v1

Entropy Collapse in Mobile Sensors: The Hidden Risks of Sensor-Based Security

Mobile sensor data has been proposed for security-critical applications such as device pairing, proximity detection, and continuous authentication. However, the foundational assumption that these signals provide sufficient entropy remains under-explored. In this work, we systematically analyse the entropy of mobile sensor data across four diverse datasets spanning multiple application contexts. Our findings reveal pervasive biases, with single-sensor mean min-entropy values ranging from 3.408-4.483 bits (S.D.=1.018-1.574) despite Shannon entropy being several multiples higher. We further demonstrate that correlations between sensor modalities reduce the worst-case entropy of using multiple sensors by up to approx. 75% compared to average-case Shannon entropy. This brings joint min-entropy well below 10 bits in many cases and, in the best case, yielding only approx. 24 bits of min-entropy when combining 20 sensor modalities. These results call into question the widely held assumption that adding more sensors inherently yields higher security. We ultimately caution against relying on raw sensor data as a primary source of randomness.

Updated: 2025-03-25 11:42:52

标题: 移动传感器中的熵崩溃：基于传感器的安全隐患

摘要: 移动传感器数据已被提议用于安全关键应用，如设备配对、接近检测和持续认证。然而，这些信号提供足够熵的基本假设仍未得到充分探讨。在这项工作中，我们系统地分析了跨越多个应用背景的四个不同数据集中移动传感器数据的熵。我们的研究结果揭示了普遍的偏差，单一传感器的平均最小熵值在3.408-4.483位之间（标准偏差为1.018-1.574），尽管香农熵要高出数倍。我们进一步证明，传感器模态之间的相关性将多传感器使用的最坏情况熵降低了约75%，相比于平均情况下的香农熵。这使得联合最小熵在许多情况下远低于10位，在最好的情况下，当结合20种传感器模态时，最小熵仅为约24位。这些结果对广泛认为添加更多传感器固有地提高安全性的观点提出了质疑。我们最终警告不要依赖原始传感器数据作为随机性的主要来源。

更新时间: 2025-03-25 11:42:52

领域: cs.CR

下载: http://arxiv.org/abs/2502.09535v4

A Schema-aware Logic Reformulation for Graph Reachability

Graph reachability is the task of understanding whether two distinct points in a graph are interconnected by arcs to which in general a semantic is attached. Reachability has plenty of applications, ranging from motion planning to routing. Improving reachability requires structural knowledge of relations so as to avoid the complexity of traditional depth-first and breadth-first strategies, implemented in logic languages. In some contexts, graphs are enriched with their schema definitions establishing domain and range for every arc. The introduction of a schema-aware formalization for guiding the search may result in a sensitive improvement by cutting out unuseful paths and prioritising those that, in principle, reach the target earlier. In this work, we propose a strategy to automatically exclude and sort certain graph paths by exploiting the higher-level conceptualization of instances. The aim is to obtain a new first-order logic reformulation of the graph reachability scenario, capable of improving the traditional algorithms in terms of time, space requirements, and number of backtracks. The experiments exhibit the expected advantages of the approach in reducing the number of backtracks during the search strategy, resulting in saving time and space as well.

Updated: 2025-03-25 11:41:51

标题: 一种基于模式感知的图可达性逻辑重构方案

摘要: 图可达性是理解图中两个不同点是否通过通常附加语义的弧相互连接的任务。可达性有许多应用，从运动规划到路由。改进可达性需要对关系的结构知识，以避免传统深度优先和广度优先策略的复杂性，这些策略实施在逻辑语言中。在某些情况下，图表通过其模式定义进行丰富，为每个弧建立了域和范围。引入一个具有模式意识的形式化来引导搜索可能通过剪除无用路径并优先考虑那些原则上更早到达目标的路径，从而导致敏感改进。在这项工作中，我们提出了一种策略，通过利用实例的高级概念化来自动排除和排序某些图路径。目标是获得一个新的一阶逻辑重构图可达性场景，能够在时间、空间要求和回溯次数方面改进传统算法。实验表明，在搜索策略中减少回溯次数，节省时间和空间的预期优势。

更新时间: 2025-03-25 11:41:51

领域: cs.AI

下载: http://arxiv.org/abs/2410.02533v2

Data-Driven Analysis of AI in Medical Device Software in China: Deep Learning and General AI Trends Based on Regulatory Data

Artificial intelligence (AI) in medical device software (MDSW) represents a transformative clinical technology, attracting increasing attention within both the medical community and the regulators. In this study, we leverage a data-driven approach to automatically extract and analyze AI-enabled medical devices (AIMD) from the National Medical Products Administration (NMPA) regulatory database. The continued increase in publicly available regulatory data requires scalable methods for analysis. Automation of regulatory information screening is essential to create reproducible insights that can be quickly updated in an ever changing medical device landscape. More than 4 million entries were assessed, identifying 2,174 MDSW registrations, including 531 standalone applications and 1,643 integrated within medical devices, of which 43 were AI-enabled. It was shown that the leading medical specialties utilizing AIMD include respiratory (20.5%), ophthalmology/endocrinology (12.8%), and orthopedics (10.3%). This approach greatly improves the speed of data extracting providing a greater ability to compare and contrast. This study provides the first extensive, data-driven exploration of AIMD in China, showcasing the potential of automated regulatory data analysis in understanding and advancing the landscape of AI in medical technology.

Updated: 2025-03-25 11:39:49

标题: 基于监管数据的中国医疗设备软件中人工智能数据驱动分析：基于深度学习和通用人工智能趋势

摘要: 医疗器械软件（MDSW）中的人工智能（AI）代表着一种具有变革性的临床技术，吸引了医疗界和监管机构的日益关注。本研究利用数据驱动方法，自动提取和分析国家药品监督管理局（NMPA）监管数据库中的AI医疗设备（AIMD）。不断增加的公开可用监管数据需要可扩展的分析方法。自动化监管信息筛选对于创建可重复的见解至关重要，这些见解可以在不断变化的医疗器械领域中迅速更新。评估了超过4百万个条目，识别出2174个MDSW注册，其中包括531个独立应用程序和1643个集成在医疗器械中的应用程序，其中43个是AI启用的。结果显示，使用AIMD的主要医学专业包括呼吸科（20.5％）、眼科/内分泌科（12.8％）和骨科（10.3％）。这种方法极大地提高了数据提取的速度，提供了更大的比较和对比能力。本研究首次对中国的AIMD进行了广泛的数据驱动探索，展示了自动化监管数据分析在理解和推进医疗技术中的人工智能领域的潜力。

更新时间: 2025-03-25 11:39:49

领域: cs.AI

下载: http://arxiv.org/abs/2411.07378v2

Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology-Inspired Computing Methods

With the rising popularity of short video platforms, the demand for video production has increased substantially. However, high-quality video creation continues to rely heavily on professional editing skills and a nuanced understanding of visual language. To address this challenge, the Shot Sequence Ordering (SSO) task in AI-assisted video editing has emerged as a pivotal approach for enhancing video storytelling and the overall viewing experience. Nevertheless, the progress in this field has been impeded by a lack of publicly available benchmark datasets. In response, this paper introduces two novel benchmark datasets, AVE-Order and ActivityNet-Order. Additionally, we employ the Kendall Tau distance as an evaluation metric for the SSO task and propose the Kendall Tau Distance-Cross Entropy Loss. We further introduce the concept of Cinematology Embedding, which incorporates movie metadata and shot labels as prior knowledge into the SSO model, and constructs the AVE-Meta dataset to validate the method's effectiveness. Experimental results indicate that the proposed loss function and method substantially enhance SSO task accuracy. All datasets are publicly accessible at https://github.com/litchiar/ShotSeqBench.

Updated: 2025-03-25 11:37:52

标题: 视频编辑中的镜头序列排序：基准、度量和受电影学启发的计算方法

摘要: 随着短视频平台的日益普及，对视频制作的需求大大增加。然而，高质量的视频创作仍然严重依赖专业的编辑技能和对视觉语言的微妙理解。为了解决这一挑战，AI辅助视频编辑中的Shot Sequence Ordering（SSO）任务已经成为增强视频叙事和整体观看体验的关键方法。然而，这一领域的进展受到缺乏公开可用的基准数据集的阻碍。为此，本文引入了两个新颖的基准数据集，AVE-Order和ActivityNet-Order。此外，我们采用Kendall Tau距离作为SSO任务的评估指标，并提出了Kendall Tau距离-交叉熵损失。我们进一步引入了Cinematology Embedding的概念，将电影元数据和镜头标签作为先验知识融入SSO模型，并构建了AVE-Meta数据集来验证该方法的有效性。实验结果表明，所提出的损失函数和方法显著提高了SSO任务的准确性。所有数据集均可在https://github.com/litchiar/ShotSeqBench 公开访问。

更新时间: 2025-03-25 11:37:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17975v2

Probabilistic Shielding for Safe Reinforcement Learning

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Updated: 2025-03-25 11:31:43

标题: 概率性屏蔽用于安全强化学习

摘要: 在现实场景中，一个旨在最大化奖励的强化学习（RL）代理通常也必须在训练时以安全的方式行事。因此，近年来对安全RL（Safe RL）进行了大量关注，其中代理旨在学习满足给定安全约束的所有策略中的最优策略。然而，严格的安全保证通常是通过基于线性规划的方法提供的，因此具有有限的扩展性。在本文中，我们提出了一种新的可扩展方法，该方法在已知马尔可夫决策过程（MDP）的安全动态并且将安全定义为无折扣概率规避属性的情况下，享有严格的正式保证。我们的方法基于MDP的状态增强和设计一个限制代理可用操作的屏障。我们展示了我们的方法提供了一个严格的正式安全保证，即代理在训练和测试时保持安全。此外，我们通过实验评估证明了我们的方法在实践中是可行的。

更新时间: 2025-03-25 11:31:43

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.07671v3

FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments

As artificial intelligence systems increasingly operate in Real-world environments, the integration of multi-modal data sources such as vision, language, and audio presents both unprecedented opportunities and critical challenges for achieving trustworthy intelligence. In this paper, we propose a novel framework that unifies federated learning with explainable multi-modal reasoning to ensure trustworthiness in decentralized, dynamic settings. Our approach, called FedMM-X (Federated Multi-Modal Explainable Intelligence), leverages cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration to address challenges posed by data heterogeneity, modality imbalance, and out-of-distribution generalization. Through rigorous evaluation across federated multi-modal benchmarks involving vision-language tasks, we demonstrate improved performance in both accuracy and interpretability while reducing vulnerabilities to adversarial and spurious correlations. Further, we introduce a novel trust score aggregation method to quantify global model reliability under dynamic client participation. Our findings pave the way toward developing robust, interpretable, and socially responsible AI systems in Real-world environments.

Updated: 2025-03-25 11:28:21

标题: FedMM-X：面向动态环境中联邦多模态学习的可信且可解释框架

摘要: 随着人工智能系统越来越多地在现实世界环境中运作，整合视觉、语言和音频等多模态数据源提供了实现可信智能所面临的前所未有的机遇和关键挑战。本文提出了一个新颖的框架，将联邦学习与可解释的多模态推理结合，以确保在分散的、动态的环境中的可信性。我们的方法被称为FedMM-X（联邦多模态可解释智能），利用跨模态一致性检查、客户端级可解释性机制和动态信任校准来解决数据异构性、模态不平衡和超出分布的泛化所带来的挑战。通过在涉及视觉-语言任务的联邦多模态基准上进行严格评估，我们展示了在准确性和可解释性方面的改进，同时降低了对敌对和虚假相关性的脆弱性。此外，我们引入了一种新颖的信任分数聚合方法，以量化在动态客户参与下的全局模型可靠性。我们的研究结果为在现实世界环境中开发稳健、可解释和社会责任感的人工智能系统铺平了道路。

更新时间: 2025-03-25 11:28:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19564v1

Practical multi-fidelity machine learning: fusion of deterministic and Bayesian models

Multi-fidelity machine learning methods address the accuracy-efficiency trade-off by integrating scarce, resource-intensive high-fidelity data with abundant but less accurate low-fidelity data. We propose a practical multi-fidelity strategy for problems spanning low- and high-dimensional domains, integrating a non-probabilistic regression model for the low-fidelity with a Bayesian model for the high-fidelity. The models are trained in a staggered scheme, where the low-fidelity model is transfer-learned to the high-fidelity data and a Bayesian model is trained to learn the residual between the data and the transfer-learned model. This three-model strategy -- deterministic low-fidelity, transfer-learning, and Bayesian residual -- leads to a prediction that includes uncertainty quantification for noisy and noiseless multi-fidelity data. The strategy is general and unifies the topic, highlighting the expressivity trade-off between the transfer-learning and Bayesian models (a complex transfer-learning model leads to a simpler Bayesian model, and vice versa). We propose modeling choices for two scenarios, and argue in favor of using a linear transfer-learning model that fuses 1) kernel ridge regression for low-fidelity with Gaussian processes for high-fidelity; or 2) deep neural network for low-fidelity with a Bayesian neural network for high-fidelity. We demonstrate the effectiveness and efficiency of the proposed strategies and contrast them with the state-of-the-art based on various numerical examples and two engineering problems. The results indicate that the proposed approach achieves comparable performance in both mean and uncertainty estimation while significantly reducing training time for machine learning modeling in data-scarce scenarios. Moreover, in data-rich settings, it outperforms other multi-fidelity architectures by effectively mitigating overfitting.

Updated: 2025-03-25 11:25:04

标题: 实用的多信度机器学习：确定性和贝叶斯模型的融合

摘要: 多保真度的机器学习方法通过将稀缺、资源密集型的高保真度数据与丰富但精度较低的低保真度数据进行整合，处理准确性和效率之间的权衡问题。我们提出了一个实用的多保真度策略，适用于涵盖低维和高维领域的问题，将低保真度的非概率回归模型与高保真度的贝叶斯模型进行整合。这些模型采用阶梯式方案进行训练，其中低保真度模型被迁移学习到高保真度数据，并训练一个贝叶斯模型来学习数据与迁移学习模型之间的残差。这种三模型策略——确定性低保真度、迁移学习和贝叶斯残差——导致了对多保真度数据的嘈杂和无噪声的不确定性量化的预测。该策略是通用的，统一了这一主题，并强调了迁移学习和贝叶斯模型之间的表达能力权衡（复杂的迁移学习模型导致简单的贝叶斯模型，反之亦然）。我们提出了两种情景的建模选择，并主张使用线性迁移学习模型，将低保真度的核岭回归与高保真度的高斯过程，或者将低保真度的深度神经网络与高保真度的贝叶斯神经网络进行融合。我们通过各种数值示例和两个工程问题展示了所提出策略的有效性和效率，并与基于不同数值的现有技术进行对比。结果表明，所提出的方法在均值和不确定性估计方面取得了可比性能，同时在数据稀缺的情况下显著减少了机器学习建模的训练时间。此外，在数据丰富的环境中，它通过有效减轻过拟合来优于其他多保真度架构。

更新时间: 2025-03-25 11:25:04

领域: cs.LG,math.PR,stat.ML

下载: http://arxiv.org/abs/2407.15110v2

Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models

This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy. The algorithm is rooted in Bayesian posterior sampling: it combines a likelihood model enforcing fidelity to the reverberant measurement, and an anechoic speech prior implemented by an unconditional diffusion model. We design a parametric filter representing the RIR, with exponential decay for each frequency subband. Room acoustics estimation and speech dereverberation are jointly carried out, as the filter parameters are iteratively estimated and the speech utterance refined along the reverse diffusion trajectory. In a blind scenario where the RIR is unknown, BUDDy successfully performs speech dereverberation in various acoustic scenarios, significantly outperforming other blind unsupervised baselines. Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions. This paper extends our previous work by offering new experimental results and insights into the algorithm's versatility. We demonstrate the robustness of our proposed method to new acoustic and speaker conditions, as well as its adaptability to high-resolution singing voice dereverberation, using both instrumental metrics and subjective listening evaluation. We study BUDDy's performance for RIR estimation and observe it surpasses a state-of-the-art supervised DNN-based estimator on mismatched acoustic conditions. Finally, we investigate the sensitivity of informed dereverberation methods to RIR estimation errors, thereby motivating the joint acoustic estimation and dereverberation design. Audio examples and code can be found online.

Updated: 2025-03-25 11:24:55

标题: 无监督的盲联合去混响和利用扩散模型进行房间声学估计

摘要: 这篇论文提出了一种无监督的单声道盲消混响和室内脉冲响应（RIR）估计方法，称为BUDDy。该算法根植于贝叶斯后验采样：它结合了一个强调对混响测量的忠实性的似然模型，以及由无条件扩散模型实现的无混响语音先验。我们设计了一个代表RIR的参数滤波器，每个频率子带都有指数衰减。室内声学估计和语音消混响是联合进行的，因为滤波器参数是通过迭代估计并沿着逆扩散轨迹改进语音话语。在RIR未知的盲目情况下，BUDDy成功地在各种声学场景中进行语音消混响，显著优于其他盲目无监督基线。与通常难以推广的监督方法不同，BUDDy可以无缝地适应不同的声学条件。本文通过提供新的实验结果和对算法多功能性的洞察，扩展了我们之前的工作。我们展示了我们提出的方法对新的声学和说话者条件的稳健性，以及其适应高分辨率歌声消混响，使用仪器指标和主观听力评估。我们研究了BUDDy对RIR估计的性能，并观察到它在不匹配的声学条件下超过了一种最先进的基于监督DNN的估计器。最后，我们调查了通知消混响方法对RIR估计误差的敏感性，从而激励联合声学估计和消混响设计。在线可找到音频示例和代码。

更新时间: 2025-03-25 11:24:55

领域: eess.AS,cs.LG,cs.SD

下载: http://arxiv.org/abs/2408.07472v2

Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $\gamma$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $\gamma$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $\theta$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $\Gamma$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

Updated: 2025-03-25 11:20:46

标题: 研究用于激发能级的多保真度机器学习中的数据层次结构

摘要: 最近在机器学习（ML）领域取得的进展使得高精度的量子化学（QC）计算更加容易实现。特别感兴趣的是多信赖度机器学习（MFML）方法，其中使用来自不同精度或信赖度的训练数据。这些方法通常采用一个固定的缩放因子，γ，来关联不同信赖度下的训练样本数量，这反映了数据的成本和假设的稀疏性。本研究调查了修改 γ 对使用 QeMFi 基准数据集预测垂直激发能量的模型效率和准确性的影响。此外，本研究引入了基于QC计算时间的缩放因子，表示为 θ，这些因子根据不同信赖度下的QC计算时间而变化。提出了一个新的错误度量标准，MFML的误差等高线，以提供一个全面的视角，从每个信赖度中得出模型误差的贡献。结果表明，在使用更多来自较低信赖度的样本时，仅需在目标信赖度下使用2个训练样本即可实现高模型准确性。通过一个新概念， Γ-曲线，进一步说明了模型误差与生成训练样本的时间成本之间的比较，表明在最小化训练数据成本的同时，多信赖度模型可以实现高准确性。

更新时间: 2025-03-25 11:20:46

领域: physics.chem-ph,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2410.11392v2

Causal Bayesian Optimization with Unknown Graphs

Causal Bayesian Optimization (CBO) is a methodology designed to optimize an outcome variable by leveraging known causal relationships through targeted interventions. Traditional CBO methods require a fully and accurately specified causal graph, which is a limitation in many real-world scenarios where such graphs are unknown. To address this, we propose a new method for the CBO framework that operates without prior knowledge of the causal graph. Consistent with causal bandit theory, we demonstrate through theoretical analysis and that focusing on the direct causal parents of the target variable is sufficient for optimization, and provide empirical validation in the context of CBO. Furthermore we introduce a new method that learns a Bayesian posterior over the direct parents of the target variable. This allows us to optimize the outcome variable while simultaneously learning the causal structure. Our contributions include a derivation of the closed-form posterior distribution for the linear case. In the nonlinear case where the posterior is not tractable, we present a Gaussian Process (GP) approximation that still enables CBO by inferring the parents of the outcome variable. The proposed method performs competitively with existing benchmarks and scales well to larger graphs, making it a practical tool for real-world applications where causal information is incomplete.

Updated: 2025-03-25 11:14:37

标题: 使用未知图的因果贝叶斯优化

摘要: 因果贝叶斯优化（CBO）是一种旨在通过有针对性的干预利用已知因果关系来优化结果变量的方法论。传统的CBO方法需要一个完全和准确指定的因果图，这在许多现实场景中是一个限制，因为这样的图是未知的。为了解决这个问题，我们提出了一种新的CBO框架方法，可以在没有先验知识的情况下运行。与因果赌注理论一致，我们通过理论分析证明，集中在目标变量的直接因果父节点上就足够进行优化，并在CBO的背景下提供了实证验证。此外，我们引入了一种学习目标变量的直接父节点的贝叶斯后验概率的新方法。这使我们能够在优化结果变量的同时学习因果结构。我们的贡献包括对线性情况的闭合形式后验分布的推导。在后验不可追踪的非线性情况下，我们提出了一种高斯过程（GP）近似方法，仍然能够通过推断结果变量的父节点来进行CBO。所提出的方法与现有基准竞争，并能很好地扩展到更大的图形，使其成为因果信息不完整的实际应用的实用工具。

更新时间: 2025-03-25 11:14:37

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.19554v1

To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability

The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent generations of accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds, learning rates, and datasets. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.

Updated: 2025-03-25 11:11:03

标题: 回到FP8：量化降低精度对LLM训练稳定性的影响

摘要: 大规模语言模型（LLM）预训练所涉及的巨大计算成本已经激发了对减少精度浮点表示以加速过程的浓厚兴趣。因此，BrainFloat16（BF16）精度已成为LLM训练的事实标准，在最近一代加速器中已经包含了硬件支持。这种趋势甚至在最新的处理器中进一步发展，最近引入了FP8。然而，先前对FP16的经验发现其比BF16不稳定，引发了对FP8是否可以作为LLM训练的经济替代方案的担忧，因为FP8比FP16的位数更少。我们认为，降低精度训练方案必须具有类似于高精度对应物的训练稳定性和超参数敏感性，才能具有成本效益。然而，我们发现目前可用的FP8训练方法还不够稳健，无法作为经济替代品使用。这促使我们研究降低精度LLM训练的稳定性，包括随机种子、学习率和数据集的稳健性。为此，我们提出了新的评估技术和一种用于量化自回归语言模型中损失地形锐度的新度量标准。通过模拟浮点表示中逐渐减少比特位数，我们分析了表征能力与训练稳定性之间的关系，以帮助未来进一步研究这一领域。

更新时间: 2025-03-25 11:11:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2405.18710v2

Scaling Laws of Synthetic Data for Language Models

Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the \emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

Updated: 2025-03-25 11:07:12

标题: 语言模型的合成数据的尺度规律

摘要: 大型语言模型（LLMs）在各种任务中表现出色，主要是由于预训练中使用的高质量网络数据。然而，最近的研究表明，这种数据来源正在迅速枯竭。合成数据出现作为一种有希望的替代方案，但目前尚不清楚合成数据集是否具有可预测的可扩展性，可与原始预训练数据相媲美。在这项工作中，我们通过引入SynthLLM系统地研究合成数据的扩展规律，这是一个可扩展的框架，将预训练语料库转化为多样化、高质量的合成数据集。我们的方法通过使用图算法自动提取和重新组合多个文档中的高级概念来实现这一点。我们在SynthLLM上进行的广泛数学实验的关键发现包括：（1）SynthLLM生成的合成数据可靠地遵循各种模型大小的“修正扩展规律”；（2）性能改进在接近300B标记时趋于平稳；（3）更大的模型用较少的训练标记接近最佳性能。例如，8B模型在1T标记时达到峰值，而3B模型需要4T。此外，与现有的合成数据生成和增强方法进行比较表明，SynthLLM实现了更优越的性能和可扩展性。我们的发现突出合成数据作为一种可扩展和可靠的替代方案，为模型性能持续改进提供了可行的途径。

更新时间: 2025-03-25 11:07:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19551v1

PropNet: a White-Box and Human-Like Network for Sentence Representation

Transformer-based embedding methods have dominated the field of sentence representation in recent years. Although they have achieved remarkable performance on NLP missions, such as semantic textual similarity (STS) tasks, their black-box nature and large-data-driven training style have raised concerns, including issues related to bias, trust, and safety. Many efforts have been made to improve the interpretability of embedding models, but these problems have not been fundamentally resolved. To achieve inherent interpretability, we propose a purely white-box and human-like sentence representation network, PropNet. Inspired by findings from cognitive science, PropNet constructs a hierarchical network based on the propositions contained in a sentence. While experiments indicate that PropNet has a significant gap compared to state-of-the-art (SOTA) embedding models in STS tasks, case studies reveal substantial room for improvement. Additionally, PropNet enables us to analyze and understand the human cognitive processes underlying STS benchmarks.

Updated: 2025-03-25 11:04:06

标题: PropNet：一种用于句子表示的白盒和类人网络

摘要: 基于Transformer的嵌入方法近年来主导了句子表示领域。尽管它们在自然语言处理任务中取得了显著成绩，如语义文本相似性（STS）任务，但它们的黑匣子特性和大数据驱动的训练风格引发了一些关注，包括与偏见、信任和安全相关的问题。许多努力已经为了改进嵌入模型的可解释性而进行，但这些问题并没有得到根本性解决。为了实现内在可解释性，我们提出了一种纯粹的白匣子和类似人类的句子表示网络，PropNet。受认知科学研究结果的启发，PropNet基于句子中包含的命题构建了一个层次化网络。虽然实验证明PropNet在STS任务中与现有技术（SOTA）嵌入模型存在显著差距，案例研究显示了大幅改进空间。此外，PropNet使我们能够分析和理解支撑STS基准的人类认知过程。

更新时间: 2025-03-25 11:04:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.10725v2

Noise Resilient Over-The-Air Federated Learning In Heterogeneous Wireless Networks

In 6G wireless networks, Artificial Intelligence (AI)-driven applications demand the adoption of Federated Learning (FL) to enable efficient and privacy-preserving model training across distributed devices. Over-The-Air Federated Learning (OTA-FL) exploits the superposition property of multiple access channels, allowing edge users in 6G networks to efficiently share spectral resources and perform low-latency global model aggregation. However, these advantages come with challenges, as traditional OTA-FL techniques suffer due to the joint effects of Additive White Gaussian Noise (AWGN) at the server, fading, and both data and system heterogeneity at the participating edge devices. In this work, we propose the novel Noise Resilient Over-the-Air Federated Learning (NoROTA-FL) framework to jointly tackle these challenges in federated wireless networks. In NoROTA-FL, the local optimization problems find controlled inexact solutions, which manifests as an additional proximal constraint at the clients. This approach provides robustness against straggler-induced partial work, heterogeneity, noise, and fading. From a theoretical perspective, we leverage the zeroth- and first-order inexactness and establish convergence guarantees for non-convex optimization problems in the presence of heterogeneous data and varying system capabilities. Experimentally, we validate NoROTA-FL on real-world datasets, including FEMNIST, CIFAR10, and CIFAR100, demonstrating its robustness in noisy and heterogeneous environments. Compared to state-of-the-art baselines such as COTAF and FedProx, NoROTA-FL achieves significantly more stable convergence and higher accuracy, particularly in the presence of stragglers.

Updated: 2025-03-25 11:04:00

标题: 在异构无线网络中的抗噪声空中联邦学习

摘要: 在6G无线网络中，人工智能（AI）驱动的应用需求采用联邦学习（FL）来实现跨分布式设备的高效和隐私保护的模型训练。无线联邦学习（OTA-FL）利用多接入通道的叠加特性，允许6G网络中的边缘用户高效共享频谱资源并进行低延迟的全局模型聚合。然而，这些优势也带来了挑战，因为传统的OTA-FL技术受到服务器端的加性白高斯噪声（AWGN）、衰落以及参与边缘设备的数据和系统异质性的共同影响而遭受折磨。在这项工作中，我们提出了新颖的抗噪声无线联邦学习（NoROTA-FL）框架，以共同解决联邦无线网络中的这些挑战。在NoROTA-FL中，局部优化问题找到了受控的不精确解决方案，这表现为客户端的额外近端约束。这种方法提供了对于拖沓引起的部分工作、异质性、噪声和衰落的鲁棒性。从理论上讲，我们利用零阶和一阶不精确性，并在异质数据和不同系统能力存在的情况下为非凸优化问题建立收敛保证。在实验上，我们在包括FEMNIST、CIFAR10和CIFAR100在内的真实数据集上验证了NoROTA-FL，展示了它在嘈杂和异质环境中的稳健性。与COTAF和FedProx等最先进的基线相比，NoROTA-FL在存在拖沓者的情况下实现了更加稳定的收敛和更高的准确性。

更新时间: 2025-03-25 11:04:00

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2503.19549v1

On-Chain Analysis of Smart Contract Dependency Risks on Ethereum

In this paper, we present the first large-scale empirical study of smart contract dependencies, analyzing over 41 million contracts and 11 billion interactions on Ethereum up to December 2024. Our results yield four key insights: (1) 59% of contract transactions involve multiple contracts (median of 4 per transaction in 2024) indicating potential smart contract dependency risks; (2) the ecosystem exhibits extreme centralization, with just 11 (0.001%) deployers controlling 20.5 million (50%) of alive contracts, with major risks related to factory contracts and deployer privileges; (3) three most depended-upon contracts are mutable, meaning large parts of the ecosystem rely on contracts that can be altered at any time, which is a significant risk, (4) actual smart contract protocol dependencies are significantly more complex than officially documented, undermining Ethereum's transparency ethos, and creating unnecessary attack surface. Our work provides the first large-scale empirical foundation for understanding smart contract dependency risks, offering crucial insights for developers, users, and security researchers in the blockchain space.

Updated: 2025-03-25 11:02:08

标题: 在以太坊上智能合约依赖风险的链上分析

摘要: 在这篇论文中，我们提出了第一个大规模的智能合约依赖性经验研究，分析了截至2024年12月在以太坊上的超过4100万个合约和110亿次交互。我们的结果得出了四个关键见解：（1）59%的合约交易涉及多个合约（2024年每笔交易中的中位数为4个），表明潜在的智能合约依赖风险；（2）生态系统表现出极端的集中化，仅有11个（0.001%）部署者控制着2050万（50%）的活跃合约，与工厂合约和部署者特权相关的风险严重；（3）最依赖的三个合约是可变的，意味着生态系统的大部分依赖于可以随时更改的合约，这是一个重要风险；（4）实际的智能合约协议依赖性明显比官方文件中记录的更为复杂，破坏了以太坊的透明度理念，同时创造了不必要的攻击面。我们的工作为理解智能合约依赖风险提供了第一个大规模的经验基础，为区块链领域的开发人员、用户和安全研究人员提供了关键见解。

更新时间: 2025-03-25 11:02:08

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2503.19548v1

AutoBayes: A Compositional Framework for Generalized Variational Inference

We introduce a new compositional framework for generalized variational inference, clarifying the different parts of a model, how they interact, and how they compose. We explain that both exact Bayesian inference and the loss functions typical of variational inference (such as variational free energy and its generalizations) satisfy chain rules akin to that of reverse-mode automatic differentiation, and we advocate for exploiting this to build and optimize models accordingly. To this end, we construct a series of compositional tools: for building models; for constructing their inversions; for attaching local loss functions; and for exposing parameters. Finally, we explain how the resulting parameterized statistical games may be optimized locally, too. We illustrate our framework with a number of classic examples, pointing to new areas of extensibility that are revealed.

Updated: 2025-03-25 10:55:49

标题: AutoBayes：一种用于广义变分推断的组合框架

摘要: 我们引入了一个新的用于广义变分推断的组合框架，澄清了模型的不同部分、它们如何相互作用以及如何组合。我们解释了精确的贝叶斯推断和变分推断中典型的损失函数（如变分自由能及其推广）都满足类似于反向模式自动微分的链规则，并主张利用这一点来构建和优化模型。为此，我们构建了一系列组合工具：用于构建模型的工具；用于构建它们的逆向工具；用于附加本地损失函数的工具；以及用于暴露参数的工具。最后，我们解释了如何在本地优化结果参数化的统计游戏。我们用一些经典示例说明了我们的框架，指出了新的可扩展领域。

更新时间: 2025-03-25 10:55:49

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2503.18608v2

Benchmarking Data Efficiency in $Δ$-ML and Multifidelity Models for Quantum Chemistry

The development of machine learning (ML) methods has made quantum chemistry (QC) calculations more accessible by reducing the compute cost incurred in conventional QC methods. This has since been translated into the overhead cost of generating training data. Increased work in reducing the cost of generating training data resulted in the development of $\Delta$-ML and multifidelity machine learning methods which use data at more than one QC level of accuracy, or fidelity. This work compares the data costs associated with $\Delta$-ML, multifidelity machine learning (MFML), and optimized MFML (o-MFML) in contrast with a newly introduced Multifidelity$\Delta$-Machine Learning (MF$\Delta$ML) method for the prediction of ground state energies, vertical excitation energies, and the magnitude of electronic contribution of molecular dipole moments from the multifidelity benchmark dataset QeMFi. This assessment is made on the basis of training data generation cost associated with each model and is compared with the single fidelity kernel ridge regression (KRR) case. The results indicate that the use of multifidelity methods surpasses the standard $\Delta$-ML approaches in cases of a large number of predictions. For applications which require only a few evaluations to be made using ML models, while the $\Delta$-ML method might be favored, the MF$\Delta$ML method is shown to be more efficient.

Updated: 2025-03-25 10:55:46

标题: 在量子化学中$Δ$-ML和多保真度模型的数据效率基准对比

摘要: 机器学习（ML）方法的发展已经通过减少传统QC方法中的计算成本，使量子化学（QC）计算更容易获得。这已经转化为生成训练数据的成本开销。增加降低生成训练数据成本的工作导致了$\Delta$-ML和多精度机器学习方法的发展，这些方法使用了更多精度或忠实度的QC级别的数据。本文比较了与新引入的多精度$\Delta$-机器学习（MF$\Delta$ML）方法相对应的$\Delta$-ML、多精度机器学习（MFML）和优化MFML（o-MFML）的数据成本，用于预测基态能量、垂直激发能量和分子偶极矩电子贡献大小的多精度基准数据集QeMFi。这一评估是基于每个模型所关联的训练数据生成成本，并与单一精度核岭回归（KRR）情况进行比较。结果表明，在大量预测的情况下，多精度方法超越了标准的$\Delta$-ML方法。对于只需要使用ML模型进行少量评估的应用程序，虽然可能偏爱$\Delta$-ML方法，但MF$\Delta$ML方法被证明更有效。

更新时间: 2025-03-25 10:55:46

领域: physics.chem-ph,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2410.11391v3

A Universal Model Combining Differential Equations and Neural Networks for Ball Trajectory Prediction

This paper presents a data driven universal ball trajectory prediction method integrated with physics equations. Existing methods are designed for specific ball types and struggle to generalize. This challenge arises from three key factors. First, learning-based models require large datasets but suffer from accuracy drops in unseen scenarios. Second, physics-based models rely on complex formulas and detailed inputs, yet accurately obtaining ball states, such as spin, is often impractical. Third, integrating physical principles with neural networks to achieve high accuracy, fast inference, and strong generalization remains difficult. To address these issues, we propose an innovative approach that incorporates physics-based equations and neural networks. We first derive three generalized physical formulas. Then, using a neural network and observed trajectory points, we infer certain parameters while fitting the remaining ones. These formulas enable precise trajectory prediction with minimal training data: only a few dozen samples. Extensive experiments demonstrate our method superiority in generalization, real-time performance, and accuracy.

Updated: 2025-03-25 10:50:57

标题: 一个结合微分方程和神经网络的通用模型，用于球的轨迹预测

摘要: 这篇论文提出了一种集成物理方程的数据驱动通用球轨迹预测方法。现有方法针对特定球类设计，很难泛化。这一挑战源自三个关键因素。首先，基于学习的模型需要大量数据集，但在未知场景中往往精度下降。其次，基于物理的模型依赖复杂的公式和详细的输入，然而精确获取球的状态，如旋转，往往不切实际。第三，将物理原理与神经网络整合以实现高精度、快速推理和强泛化仍然困难。为了解决这些问题，我们提出了一种创新方法，结合了物理方程和神经网络。我们首先推导出三个通用的物理公式。然后，利用神经网络和观测到的轨迹点，推断出某些参数，同时拟合剩余的参数。这些公式能够通过少量训练数据实现精确的轨迹预测：仅需几十个样本。大量实验证明了我们的方法在泛化能力、实时性能和准确性方面的优越性。

更新时间: 2025-03-25 10:50:57

领域: cs.LG

下载: http://arxiv.org/abs/2503.18584v2

The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States

Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.

Updated: 2025-03-25 10:50:21

标题: 《HalluRAG数据集：使用LLM的内部状态检测RAG应用程序中的封闭领域幻觉》

摘要: 在大型语言模型（LLMs）中检测幻觉对于增强它们的可靠性和信任度至关重要。大多数研究集中在将幻觉视为与训练期间看到的信息偏离的现象。然而，LLM参数化知识的不透明性使得理解生成文本为何看起来不牢固变得复杂：LLM可能没有从庞大且常常无法访问的数据集中获取所需的知识，或者信息可能在进一步训练过程中发生了改变或矛盾。我们关注涉及未在训练中使用的信息的幻觉，通过使用最近性确定信息出现在截止日期之后来确定。本研究通过使用不同LLMs的不同内部状态在句子级别检测这些幻觉来调查这些幻觉。我们提出了HalluRAG，这是一个专门设计用于训练这些幻觉分类器的数据集。根据模型和量化的不同，基于HalluRAG训练的MLPs可以检测出测试准确率高达75％的幻觉，其中Mistral-7B-Instruct-v0.1实现了最高的测试准确率。我们的结果表明，IAVs和CEVs一样有效地检测出幻觉，并揭示了可回答和不可回答的提示被编码为不同分类器以改进准确性。然而，HalluRAG显示出一定的有限的泛化能力，倡导在幻觉数据集中增加更多的多样性。

更新时间: 2025-03-25 10:50:21

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2412.17056v2

FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models

Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.

Updated: 2025-03-25 10:48:33

标题: FLEX：用于评估大型语言模型公平性稳健性的基准

摘要: 最近大型语言模型（LLMs）的进展显著增强了用户和模型之间的交互。这些进展同时凸显了对严格的安全评估的需求，因为社会偏见的表现可能导致有害的社会影响。尽管存在这些担忧，现有的基准可能忽视了LLMs的固有弱点，即使只有简单的对抗性指令也可能产生偏见的响应。为了解决这一关键差距，我们引入了一个新的基准，即在极端情景下的LLM公平性基准（FLEX），旨在测试LLMs在受到诱导偏见的提示时是否能够保持公平性。为了全面评估LLMs的鲁棒性，我们将增强潜在偏见的提示集成到公平性评估中。FLEX和现有基准之间的比较实验表明，传统评估可能低估了模型中固有风险。这突显了需要更严格的LLM评估基准来确保安全性和公平性的必要性。

更新时间: 2025-03-25 10:48:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19540v1

Cryptoscope: Analyzing cryptographic usages in modern software

The advent of quantum computing poses a significant challenge as it has the potential to break certain cryptographic algorithms, necessitating a proactive approach to identify and modernize cryptographic code. Identifying these cryptographic elements in existing code is only the first step. It is crucial not only to identify quantum vulnerable algorithms but also to detect vulnerabilities and incorrect crypto usages, to prioritize, report, monitor as well as remediate and modernize code bases. A U.S. government memorandum require agencies to begin their transition to PQC (Post Quantum Cryptograpy) by conducting a prioritized inventory of cryptographic systems including software and hardware systems. In this paper we describe our code scanning tool - Cryptoscope - which leverages cryptographic domain knowledge as well as compiler techniques to statically parse and analyze source code. By analyzing control and data flow the tool is able to build an extendable and querriable inventory of cryptography. Cryptoscope goes beyond identifying disconnected cryptographic APIs and instead provides the user with an inventory of cryptographic assets - containing comprehensive views of the cryptographic operations implemented. We show that for more than 92% of our test cases, these views include the cryptographic operation itself, APIs, as well as the related material such as keys, nonces, random sources etc. Lastly, building on top of this inventory, our tool is able to detect and report all the cryptographic related weaknesses and vulnerabilities (11 out of 15) in CamBench - achieving state-of-the-art performance.

Updated: 2025-03-25 10:39:50

标题: 密码分析器：分析现代软件中的加密使用

摘要: 量子计算的出现构成了一个重大挑战，因为它有可能破解某些加密算法，这需要一种积极主动的方法来识别和更新加密代码。识别现有代码中的这些加密元素仅仅是第一步。不仅需要识别量子易受攻击的算法，还需要检测漏洞和不正确的加密使用，以便对代码库进行优先级排序、报告、监控、修复和现代化。美国政府备忘录要求各机构开始过渡到PQC（后量子密码学），通过对包括软件和硬件系统在内的加密系统进行优先级清点。本文描述了我们的代码扫描工具 - Cryptoscope - 它利用加密领域知识和编译器技术来静态解析和分析源代码。通过分析控制和数据流，该工具能够构建一个可扩展和可查询的加密清单。Cryptoscope不仅仅是识别断开的加密API，而是为用户提供一个包含加密资产的清单 - 包括实现的加密操作的全面视图。我们展示，在我们的测试案例中，超过92%的情况下，这些视图包括加密操作本身、API以及相关材料，如密钥、随机数源等。最后，基于这个清单，我们的工具能够检测并报告CamBench中所有与加密相关的弱点和漏洞（15个中的11个），实现了最先进的性能。

更新时间: 2025-03-25 10:39:50

领域: cs.CR

下载: http://arxiv.org/abs/2503.19531v1

Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations

This paper presents an in-depth analysis of the scale generalisation properties of the scale-covariant and scale-invariant Gaussian derivative networks, complemented with both conceptual and algorithmic extensions. For this purpose, Gaussian derivative networks (GaussDerNets) are evaluated on new rescaled versions of the Fashion-MNIST and the CIFAR-10 datasets, with spatial scaling variations over a factor of 4 in the testing data, that are not present in the training data. Additionally, evaluations on the previously existing STIR datasets show that the GaussDerNets achieve better scale generalisation than previously reported for these datasets for other types of deep networks. We first experimentally demonstrate that the GaussDerNets have quite good scale generalisation properties on the new datasets, and that average pooling of feature responses over scales may sometimes also lead to better results than the previously used approach of max pooling over scales. Then, we demonstrate that using a spatial max pooling mechanism after the final layer enables localisation of non-centred objects in image domain, with maintained scale generalisation properties. We also show that regularisation during training, by applying dropout across the scale channels, referred to as scale-channel dropout, improves both the performance and the scale generalisation. In additional ablation studies, we demonstrate that discretisations of GaussDerNets, based on the discrete analogue of the Gaussian kernel in combination with central difference operators, perform best or among the best, compared to a set of other discrete approximations of the Gaussian derivative kernels. Finally, by visualising the activation maps and the learned receptive fields, we demonstrate that the GaussDerNets have very good explainability properties.

Updated: 2025-03-25 10:38:59

标题: 基于空间尺度变化的图像数据集上扩展尺度协变和尺度不变高斯导数网络的尺度概括特性

摘要: 本文对尺度协变和尺度不变的高斯导数网络的尺度泛化特性进行了深入分析，并结合概念和算法扩展。为此，高斯导数网络（GaussDerNets）在Fashion-MNIST和CIFAR-10数据集的新缩放版本上进行评估，在测试数据中的空间尺度变化达到4倍，而在训练数据中不存在。此外，在先前存在的STIR数据集上的评估显示，与先前报道的其他类型的深度网络相比，GaussDerNets实现了更好的尺度泛化。我们首先通过实验证明，GaussDerNets在新数据集上具有相当好的尺度泛化特性，并且有时通过对尺度上的特征响应进行平均池化也可能比先前使用的最大池化方法取得更好的结果。然后，我们展示了在最终层之后使用空间最大池化机制可以在图像域内定位非中心对象，并保持尺度泛化特性。我们还表明，在训练过程中通过在尺度通道上应用dropout进行正则化，即所谓的尺度通道dropout，可以提高性能和尺度泛化。在额外的消融研究中，我们展示了基于高斯核的离散模拟结合中心差分算子的GaussDerNets的离散化性能最好或位居最佳，与一组其他高斯导数核的离散近似相比。最后，通过可视化激活图和学习的接受域，我们展示了GaussDerNets具有非常好的可解释性特性。

更新时间: 2025-03-25 10:38:59

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2409.11140v2

VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models

Popular PEFT methods achieve parameter efficiency by assuming that incremental weight updates are inherently low-rank, which often leads to a performance gap compared to full fine-tuning. While recent methods have attempted to address this limitation, they typically lack sufficient parameter and memory efficiency. We propose VectorFit, an effective and easily deployable approach that adaptively trains the singular vectors and biases of pre-trained weight matrices. We demonstrate that the utilization of structural and transformational characteristics of pre-trained weights enables high-rank updates comparable to those of full fine-tuning. As a result, VectorFit achieves superior performance with 9X less trainable parameters compared to state-of-the-art PEFT methods. Through extensive experiments over 17 datasets spanning diverse language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we exhibit that VectorFit consistently outperforms baselines, even in extremely low-budget scenarios.

Updated: 2025-03-25 10:36:27

标题: VectorFit：预训练基础模型的自适应奇异值和偏置向量微调

摘要: 流行的PEFT方法通过假设增量权重更新固有地是低秩的，从而实现参数效率，但与完全微调相比，这通常会导致性能差距。尽管最近的方法已经尝试解决这一限制，但它们通常缺乏足够的参数和内存效率。我们提出了VectorFit，这是一种有效且易于部署的方法，通过自适应地训练预训练权重矩阵的奇异向量和偏置。我们证明，利用预训练权重的结构和转换特性可以实现与完全微调相当的高秩更新。因此，与最先进的PEFT方法相比，VectorFit在可训练参数少9倍的情况下实现了卓越的性能。通过对涵盖自然语言理解和生成、问答、图像分类和图像生成等多样语言和视觉任务的17个数据集进行广泛实验，我们展示了VectorFit在各种预算有限的情况下始终优于基线模型。

更新时间: 2025-03-25 10:36:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19530v1

Deep Learning-Based Hypoglycemia Classification Across Multiple Prediction Horizons

Type 1 diabetes (T1D) management can be significantly enhanced through the use of predictive machine learning (ML) algorithms, which can mitigate the risk of adverse events like hypoglycemia. Hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is a life-threatening condition typically caused by excessive insulin administration, missed meals, or physical activity. Its asymptomatic nature impedes timely intervention, making ML models crucial for early detection. This study integrates short- (up to 2h) and long-term (up to 24h) prediction horizons (PHs) within a single classification model to enhance decision support. The predicted times are 5-15 min, 15-30 min, 30 min-1h, 1-2h, 2-4h, 4-8h, 8-12h, and 12-24h before hypoglycemia. In addition, a simplified model classifying up to 4h before hypoglycemia is compared. We trained ResNet and LSTM models on glucose levels, insulin doses, and acceleration data. The results demonstrate the superiority of the LSTM models when classifying nine classes. In particular, subject-specific models yielded better performance but achieved high recall only for classes 0, 1, and 2 with 98%, 72%, and 50%, respectively. A population-based six-class model improved the results with at least 60% of events detected. In contrast, longer PHs remain challenging with the current approach and may be considered with different models.

Updated: 2025-03-25 10:24:27

标题: 基于深度学习的多时间段低血糖分类

摘要: 1型糖尿病（T1D）管理可以通过使用预测性机器学习（ML）算法显着增强，这可以减轻低血糖等不良事件的风险。低血糖的特征是血糖水平低于70 mg/dL，通常是由于过量胰岛素注射、错过餐食或体力活动导致的危及生命的状况。其无症状的特性妨碍了及时干预，使得ML模型对于早期检测至关重要。本研究将短期（最多2小时）和长期（最多24小时）的预测视野（PHs）整合到单个分类模型中，以增强决策支持。预测的时间分别是低血糖前5-15分钟、15-30分钟、30分钟-1小时、1-2小时、2-4小时、4-8小时、8-12小时和12-24小时。此外，还比较了一个简化的模型，在低血糖前最多4小时进行分类。我们训练了ResNet和LSTM模型，使用了血糖水平、胰岛素剂量和加速数据。结果表明，在分类九个类别时，LSTM模型的优越性。特别是，针对个体的模型表现更好，但仅对0、1和2类别达到了高召回率，分别为98%、72%和50%。基于人口的六类模型改善了结果，至少检测到60%的事件。相比之下，使用当前方法仍然存在挑战，对于较长的PHs可能需要考虑使用不同的模型。

更新时间: 2025-03-25 10:24:27

领域: q-bio.QM,cs.AI,cs.LG,stat.AP

下载: http://arxiv.org/abs/2504.00009v1

One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF

In this article, we primarily examine a variety of RL-based and RL-free methods designed to address Reinforcement Learning from Human Feedback (RLHF) and Large Reasoning Models (LRMs). We begin with a concise overview of the typical steps involved in RLHF and LRMs. Next, we reinterpret several RL-based and RL-free algorithms through the perspective of neural structured bandit prediction, providing a clear conceptual framework that uncovers a deeper connection between these seemingly distinct approaches. Following this, we briefly review some core principles of reinforcement learning, drawing attention to an often-overlooked aspect in existing RLHF studies. This leads to a detailed derivation of the standard RLHF objective within a full RL context, demonstrating its equivalence to neural structured bandit prediction. Finally, by reinvestigating the principles behind Proximal Policy Optimization (PPO), we pinpoint areas needing adjustment, which culminates in the introduction of the Generalized Reinforce Optimization (GRO) framework, seamlessly integrating RL-based and RL-free methods in RLHF. We look forward to the community's efforts to empirically validate GRO and invite constructive feedback.

Updated: 2025-03-25 10:23:26

标题: 一个框架统一RLHF中基于RL和无RL方法：统一它们

摘要: 在这篇文章中，我们主要考察了一系列基于强化学习（RL）和无强化学习（RL-free）方法，旨在解决来自人类反馈（RLHF）和大型推理模型（LRMs）的强化学习问题。我们首先简要概述了RLHF和LRMs中涉及的典型步骤。接下来，我们通过神经结构化赌博预测的视角重新解释了几种基于RL和无RL算法，提供了一个清晰的概念框架，揭示了这些看似不同方法之间更深层次的联系。在此之后，我们简要回顾了一些强化学习的核心原则，引起了现有RLHF研究中常被忽视的一个方面的注意。这导致了在全面RL环境中详细推导标准RLHF目标，证明了它等同于神经结构化赌博预测。最后，通过重新审视Proximal Policy Optimization（PPO）背后的原则，我们找到了需要调整的领域，最终引入了广义强化优化（GRO）框架，无缝地将基于RL和无RL方法整合到RLHF中。我们期待社区的努力来实证验证GRO，并欢迎建设性反馈。

更新时间: 2025-03-25 10:23:26

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.19523v1

Inverting Transformer-based Vision Models

Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many previous approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer, showing that this approach is efficient and feasible. Through qualitative and quantitative evaluations of reconstructed images, we generate insights into the underlying mechanisms of these architectures, highlighting their similarities and differences in terms of contextual shape and preservation of image details, inter-layer correlation, and robustness to color perturbations. Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models. The code for reproducing our experiments is available at github.com/wiskott-lab/inverse-tvm.

Updated: 2025-03-25 10:19:48

标题: 反转变压器基础视觉模型

摘要: 理解计算机视觉中深度神经网络的机制仍然是一个基本挑战。虽然许多先前的方法集中在可视化深度神经网络内部的中间表示，特别是卷积神经网络，但这些技术尚未在基于Transformer的视觉模型中得到彻底探索。在这项研究中，我们应用了一个模块化的方法，训练逆向模型从检测Transformer和Vision Transformer内部的中间层重建输入图像，显示出这种方法是有效和可行的。通过对重建图像进行定性和定量评估，我们深入了解了这些架构的基本机制，突出了它们在上下文形状和图像细节的保留、层间相关性和对颜色扰动的鲁棒性方面的相似性和差异。我们的分析说明了这些属性是如何在模型内部出现的，有助于更深入地理解基于Transformer的视觉模型。我们的实验重现代码可在github.com/wiskott-lab/inverse-tvm上找到。

更新时间: 2025-03-25 10:19:48

领域: cs.CV,cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2412.06534v3

Towards Imperceptible Adversarial Attacks for Time Series Classification with Local Perturbations and Frequency Analysis

Adversarial attacks in time series classification (TSC) models have recently gained attention due to their potential to compromise model robustness. Imperceptibility is crucial, as adversarial examples detected by the human vision system (HVS) can render attacks ineffective. Many existing methods fail to produce high-quality imperceptible examples, often generating perturbations with more perceptible low-frequency components, like square waves, and global perturbations that reduce stealthiness. This paper aims to improve the imperceptibility of adversarial attacks on TSC models by addressing frequency components and time series locality. We propose the Shapelet-based Frequency-domain Attack (SFAttack), which uses local perturbations focused on time series shapelets to enhance discriminative information and stealthiness. Additionally, we introduce a low-frequency constraint to confine perturbations to high-frequency components, enhancing imperceptibility.

Updated: 2025-03-25 10:16:51

标题: 朝向使用局部扰动和频率分析进行时间序列分类的几乎不可察觉的对抗性攻击

摘要: 时间序列分类（TSC）模型中的对抗性攻击最近引起了关注，因为它们有可能危害模型的稳健性。在人类视觉系统（HVS）检测到的对抗性示例中，不可察觉性至关重要，因为这可能使攻击失效。许多现有方法未能生成高质量的不可察觉性示例，通常生成具有更明显低频成分（如方波）和降低隐蔽性的全局扰动。本文旨在通过处理频率成分和时间序列局部性来提高对TSC模型的对抗性攻击的不可察觉性。我们提出了基于形状集的频域攻击（SFAttack），该攻击使用局部扰动集中在时间序列形状集上，以增强区分性信息和隐蔽性。此外，我们引入了一个低频约束来限制扰动在高频成分上，增强不可察觉性。

更新时间: 2025-03-25 10:16:51

领域: cs.CR

下载: http://arxiv.org/abs/2503.19519v1

Computational Analysis of Stress, Depression and Engagement in Mental Health: A Survey

Analysis of stress, depression and engagement is less common and more complex than that of frequently discussed emotions such as happiness, sadness, fear and anger. The importance of these psychological states has been increasingly recognized due to their implications for mental health and well-being. Stress and depression are interrelated and together they impact engagement in daily tasks, highlighting the need to explore their interplay. This survey is the first to simultaneously explore computational methods for analyzing stress, depression and engagement. We present a taxonomy and timeline of the computational approaches used to analyze them and we discuss the most commonly used datasets and input modalities, along with the categories and generic pipeline of these approaches. Subsequently, we describe state-of-the-art computational approaches, including a performance summary on the most commonly used datasets. Following this, we explore the applications of stress, depression and engagement analysis, along with the associated challenges, limitations and future research directions.

Updated: 2025-03-25 10:14:57

标题: 《心理健康中压力、抑郁和投入的计算分析：一项调查》

摘要: 压力、抑郁和投入度的分析比经常讨论的幸福、悲伤、恐惧和愤怒等情绪更少见且更复杂。由于这些心理状态对心理健康和幸福感的影响，它们的重要性日益受到认可。压力和抑郁是相互关联的，它们共同影响人们对日常任务的投入，突显了探讨它们相互作用的必要性。本调查是首次同时探索分析压力、抑郁和投入度的计算方法。我们提出了一种用于分析这些方法的分类法和时间表，并讨论了最常用的数据集和输入模式，以及这些方法的类别和通用流程。随后，我们描述了最先进的计算方法，包括对最常用数据集的性能总结。在此之后，我们探讨了压力、抑郁和投入度分析的应用，以及相关的挑战、限制和未来研究方向。

更新时间: 2025-03-25 10:14:57

领域: cs.HC,cs.AI,cs.MM

下载: http://arxiv.org/abs/2403.08824v2

A Spatiotemporal Radar-Based Precipitation Model for Water Level Prediction and Flood Forecasting

Study Region: Goslar and G\"ottingen, Lower Saxony, Germany. Study Focus: In July 2017, the cities of Goslar and G\"ottingen experienced severe flood events characterized by short warning time of only 20 minutes, resulting in extensive regional flooding and significant damage. This highlights the critical need for a more reliable and timely flood forecasting system. This paper presents a comprehensive study on the impact of radar-based precipitation data on forecasting river water levels in Goslar. Additionally, the study examines how precipitation influences water level forecasts in G\"ottingen. The analysis integrates radar-derived spatiotemporal precipitation patterns with hydrological sensor data obtained from ground stations to evaluate the effectiveness of this approach in improving flood prediction capabilities. New Hydrological Insights for the Region: A key innovation in this paper is the use of residual-based modeling to address the non-linearity between precipitation images and water levels, leading to a Spatiotemporal Radar-based Precipitation Model with residuals (STRPMr). Unlike traditional hydrological models, our approach does not rely on upstream data, making it independent of additional hydrological inputs. This independence enhances its adaptability and allows for broader applicability in other regions with RADOLAN precipitation. The deep learning architecture integrates (2+1)D convolutional neural networks for spatial and temporal feature extraction with LSTM for timeseries forecasting. The results demonstrate the potential of the STRPMr for capturing extreme events and more accurate flood forecasting.

Updated: 2025-03-25 10:14:54

标题: 基于雷达的时空降水模型用于水位预测和洪水预报

摘要: 研究区域：德国下萨克森州的哥斯拉和哥廷根。研究重点：2017年7月，哥斯拉和哥廷根市经历了严重的洪水事件，其特点是仅有20分钟的短暂警告时间，导致大范围的地区洪水和重大损害。这凸显了更可靠和及时的洪水预测系统的迫切需求。本文介绍了关于雷达降水数据对哥斯拉河水位预测影响的综合研究。此外，该研究还探讨了降水如何影响哥廷根的水位预测。分析整合了从地面站获取的雷达衍生的时空降水模式与水文传感器数据，以评估这种方法在改进洪水预测能力方面的有效性。该地区的新水文见解：本文的一个关键创新是使用基于残差的建模来解决降水图像和水位之间的非线性问题，从而形成具有残差的时空雷达降水模型（STRPMr）。与传统水文模型不同，我们的方法不依赖上游数据，使其独立于额外的水文输入。这种独立性提高了其适应性，并允许在其他具有RADOLAN降水的地区广泛应用。深度学习架构整合了（2+1）D卷积神经网络用于空间和时间特征提取，并使用LSTM进行时间序列预测。结果表明，STRPMr有捕捉极端事件和更准确的洪水预测潜力。

更新时间: 2025-03-25 10:14:54

领域: eess.IV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19943v1

DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data

The growing adoption of Vision-Language-Action (VLA) models in embodied AI intensifies the demand for diverse manipulation demonstrations. However, high costs associated with data collection often result in insufficient data coverage across all scenarios, which limits the performance of the models. It is observed that the spatial reasoning phase (SRP) in large workspace dominates the failure cases. Fortunately, this data can be collected with low cost, underscoring the potential of leveraging inexpensive data to improve model performance. In this paper, we introduce the DataPlatter method, a framework that decouples training trajectories into distinct task stages and leverages abundant easily collectible SRP data to enhance VLA model's generalization. Through analysis we demonstrate that sub-task-specific training with additional SRP data with proper proportion can act as a performance catalyst for robot manipulation, maximizing the utilization of costly physical interaction phase (PIP) data. Experiments show that through introducing large proportion of cost-effective SRP trajectories into a limited set of PIP data, we can achieve a maximum improvement of 41\% on success rate in zero-shot scenes, while with the ability to transfer manipulation skill to novel targets.

Updated: 2025-03-25 10:11:06

标题: DataPlatter：用最少昂贵数据提升机器人操作的泛化能力

摘要: 随着视觉-语言-动作（VLA）模型在具有实体AI的领域中日益被采用，对多样化操作演示的需求不断增加。然而，与数据收集相关的高成本经常导致在所有场景中数据覆盖不足，从而限制了模型的性能。观察到在大型工作空间中的空间推理阶段（SRP）主导了失败案例。幸运的是，这些数据可以以低成本收集，突显了利用廉价数据来改善模型性能的潜力。在本文中，我们介绍了DataPlatter方法，这是一个框架，将训练轨迹分解为不同的任务阶段，并利用丰富且易于收集的SRP数据来增强VLA模型的泛化能力。通过分析，我们证明了使用适当比例的额外SRP数据进行子任务特定训练可以作为机器人操作的性能催化剂，最大限度地利用昂贵的物理交互阶段（PIP）数据。实验证明，通过将大比例的具有成本效益的SRP轨迹引入有限的PIP数据集中，我们可以在零-shot场景中将成功率最大提高41％，同时具有将操作技能传输至新目标的能力。

更新时间: 2025-03-25 10:11:06

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2503.19516v1

RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

As robotic technologies advancing towards more complex multimodal interactions and manipulation tasks, the integration of advanced Vision-Language Models (VLMs) has become a key driver in the field. Despite progress with current methods, challenges persist in fusing depth and RGB information within 3D environments and executing tasks guided by linguistic instructions. In response to these challenges, we have enhanced the existing RoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporates depth data into VLMs to significantly improve robotic manipulation performance. Our research achieves a nuanced fusion of RGB and depth information by integrating a pre-trained Vision Transformer (ViT) with a resampling technique, closely aligning this combined data with linguistic cues for superior multimodal understanding. The novelty of RoboFlamingo-Plus lies in its adaptation of inputs for depth data processing, leveraging a pre-trained resampler for depth feature extraction, and employing cross-attention mechanisms for optimal feature integration. These improvements allow RoboFlamingo-Plus to not only deeply understand 3D environments but also easily perform complex, language-guided tasks in challenging settings. Experimental results show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% over current methods, marking a significant advancement. Codes and model weights are public at RoboFlamingo-Plus.

Updated: 2025-03-25 10:01:57

标题: RoboFlamingo-Plus: 将深度和RGB感知与视觉语言模型融合，以增强机器人操作

摘要: 随着机器人技术向更复杂的多模式交互和操作任务发展，集成先进的视觉语言模型（VLMs）已成为该领域的关键驱动因素。尽管当前方法取得了进展，但在融合三维环境中的深度和RGB信息以及执行受语言指导的任务方面仍存在挑战。为了应对这些挑战，我们通过引入RoboFlamingo-Plus，增强了现有的RoboFlamingo框架，将深度数据纳入VLMs中，从而显著改进了机器人操作性能。我们的研究通过将经过预训练的视觉Transformer（ViT）与重新采样技术相结合，实现了RGB和深度信息的微妙融合，将这些组合数据与语言提示紧密对齐，以实现优越的多模式理解。RoboFlamingo-Plus的创新之处在于它适应了深度数据处理的输入，利用了预训练的重采样器进行深度特征提取，并采用交叉注意机制进行最佳特征集成。这些改进使得RoboFlamingo-Plus不仅可以深入理解三维环境，还可以轻松在具有挑战性的环境中执行复杂的语言引导任务。实验结果显示，RoboFlamingo-Plus比当前方法提高了10-20%的机器人操作能力，标志着一项重大进步。RoboFlamingo-Plus的代码和模型权重已经公开。

更新时间: 2025-03-25 10:01:57

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19510v1

Improved Alignment of Modalities in Large Vision Language Models

Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

Updated: 2025-03-25 09:59:46

标题: 大型视觉语言模型中模态的改进对齐

摘要: 最近在视觉-语言模型方面取得了显著进展，使语言模型能够理解视觉输入。然而，将这些模型在诸如图像描述和视觉问答等各种任务上进行对齐的统一方法仍然是一个挑战。现有方法要么需要非常庞大的语言模型，要么需要非常庞大的数据集，这在利用现有模型方面并不高效。本文解决了这一差距，并设计了自回归视觉-语言模型的训练策略，以统一视觉-语言任务，比如图像描述和视觉问答。我们提出了四个训练阶段，以对齐视觉模型和语言模型，换句话说，语言模型具有处理视觉输入的能力。我们还设计了针对基于Transformer的语言模型的不同注意掩码，以提高视觉特征的质量。此外，我们介绍了一些发现，1）注意掩码不应该应用于视觉输入，2）语言模型在人工智能生成的数据上收敛速度更快，3）在模型的预训练阶段应该在对齐阶段进行更多工作，4）模型可以轻松适应诸如PathVQA等医疗数据集上的视觉问答等下游任务。在为所有阶段训练模型一个时代后，它在COCO和Flickr30k数据集上的常见基准测试中优于大型模型，如VILA-13十亿模型，并且在相同数据集上达到了与GIT-2非常接近的分数，尽管它是一个在规模较小的数据集上训练的规模较小的模型。所有训练都是使用现有的最佳实践进行的，如多GPU并行训练，使用16位浮点数进行低精度训练，更快的注意力（SDPA）和梯度累积，并且在12小时内完成了训练。

更新时间: 2025-03-25 09:59:46

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19508v1

A stochastic gradient descent algorithm with random search directions

Stochastic coordinate descent algorithms are efficient methods in which each iterate is obtained by fixing most coordinates at their values from the current iteration, and approximately minimizing the objective with respect to the remaining coordinates. However, this approach is usually restricted to canonical basis vectors of $\mathbb{R}^d$. In this paper, we develop a new class of stochastic gradient descent algorithms with random search directions which uses the directional derivative of the gradient estimate following more general random vectors. We establish the almost sure convergence of these algorithms with decreasing step. We further investigate their central limit theorem and pay particular attention to analyze the impact of the search distributions on the asymptotic covariance matrix. We also provide the non-asymptotic $\mathbb{L}^p$ rates of convergence.

Updated: 2025-03-25 09:54:06

标题: 一种具有随机搜索方向的随机梯度下降算法

摘要: 随机坐标下降算法是高效的方法，其中每次迭代通过将大多数坐标固定在它们在当前迭代中的值，并在剩余坐标上近似最小化目标函数来获得。然而，这种方法通常仅限于$\mathbb{R}^d$的标准基向量。在本文中，我们开发了一种新的带有随机搜索方向的随机梯度下降算法类，该算法使用渐近更一般的随机向量的梯度估计的方向导数。我们建立了这些算法在步长递减时的几乎必然收敛性。我们进一步研究了它们的中心极限定理，并特别关注分析搜索分布对渐近协方差矩阵的影响。我们还提供了非渐近的$\mathbb{L}^p$收敛率。

更新时间: 2025-03-25 09:54:06

领域: stat.ML,cs.LG,math.OC,math.PR

下载: http://arxiv.org/abs/2503.19942v1

Hierarchical Adaptive Expert for Multimodal Sentiment Analysis

Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA's superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA's effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.

Updated: 2025-03-25 09:52:08

标题: 多模态情感分析的分层自适应专家

摘要: 多模态情感分析已经成为理解跨越不同通信渠道的人类情感的关键工具。尽管现有方法已经取得了重大进展，但它们往往很难有效区分和整合模态共享和模态特定信息，从而限制了多模态学习的性能。为了解决这一挑战，我们提出了用于多模态情感分析的分层自适应专家（HAEMSA）的新框架，该框架通过协同进化优化、跨模态知识转移和多任务学习相结合。HAEMSA采用自适应专家的分层结构来捕捉全局和局部模态表示，实现更加细致的情感分析。我们的方法利用进化算法动态优化网络架构和模态组合，适应部分和完整模态场景。大量实验证明了HAEMSA在多个基准数据集上的优越性能。在CMU-MOSEI上，与先前最佳方法相比，HAEMSA的7类准确率增加了2.6％，MAE减少了0.059。对于CMU-MOSI，我们观察到7类准确率提高了6.3％，MAE降低了0.058。在IEMOCAP上，HAEMSA在情感识别的加权-F1分数上比最先进的方法提高了2.84％。这些结果突显了HAEMSA在捕捉复杂的多模态交互和泛化不同情感背景之间的有效性。

更新时间: 2025-03-25 09:52:08

领域: cs.LG,cs.CV,cs.MM

下载: http://arxiv.org/abs/2503.22715v1

Towards Long-Range ENSO Prediction with an Explainable Deep Learning Model

El Ni\~no-Southern Oscillation (ENSO) is a prominent mode of interannual climate variability with far-reaching global impacts. Its evolution is governed by intricate air-sea interactions, posing significant challenges for long-term prediction. In this study, we introduce CTEFNet, a multivariate deep learning model that synergizes convolutional neural networks and transformers to enhance ENSO forecasting. By integrating multiple oceanic and atmospheric predictors, CTEFNet extends the effective forecast lead time to 20 months while mitigating the impact of the spring predictability barrier, outperforming both dynamical models and state-of-the-art deep learning approaches. Furthermore, CTEFNet offers physically meaningful and statistically significant insights through gradient-based sensitivity analysis, revealing the key precursor signals that govern ENSO dynamics, which align with well-established theories and reveal new insights about inter-basin interactions among the Pacific, Atlantic, and Indian Oceans. The CTEFNet's superior predictive skill and interpretable sensitivity assessments underscore its potential for advancing climate prediction. Our findings highlight the importance of multivariate coupling in ENSO evolution and demonstrate the promise of deep learning in capturing complex climate dynamics with enhanced interpretability.

Updated: 2025-03-25 09:50:19

标题: 朝向可解释的深度学习模型实现长期ENSO预测

摘要: 厄尔尼诺-南方涛动（ENSO）是一种显著的年际气候变异模式，具有深远的全球影响。其演变受复杂的海气相互作用控制，对长期预测提出重大挑战。在本研究中，我们引入了CTEFNet，这是一种多变量深度学习模型，结合了卷积神经网络和变压器，以增强ENSO的预测能力。通过整合多个海洋和大气预测因子，CTEFNet将有效的预测超前时间延长至20个月，同时减轻了春季可预测性障碍的影响，优于动力模型和最先进的深度学习方法。此外，CTEFNet通过基于梯度的敏感性分析提供了具有物理意义和统计显著性的见解，揭示了统治ENSO动力学的关键前兆信号，这些信号与已建立的理论一致，并揭示了太平洋、大西洋和印度洋之间的跨海盆相互作用的新见解。CTEFNet卓越的预测技能和可解释性敏感性评估突显了其推进气候预测的潜力。我们的研究结果强调了多变量耦合在ENSO演变中的重要性，并展示了深度学习在捕捉复杂气候动力学方面的可解释性增强。

更新时间: 2025-03-25 09:50:19

领域: physics.geo-ph,cs.AI

下载: http://arxiv.org/abs/2503.19502v1

Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated hardware, or on video-based models that demand high computational resources and GPUs for real-time processing. In contrast, this paper presents a robust fall detection system that does not require any additional sensors or high-powered hardware. The system uses pose estimation techniques, combined with threshold-based analysis and a voting mechanism, to effectively distinguish between fall and non-fall activities. For pose detection, we leverage MediaPipe, a lightweight and efficient framework that enables real-time processing on standard CPUs with minimal computational overhead. By analyzing motion, body position, and key pose points, the system processes pose features with a 20-frame buffer, minimizing false positives and maintaining high accuracy even in real-world settings. This unobtrusive, resource-efficient approach provides a practical solution for enhancing resident safety in old age homes, without the need for expensive sensors or high-end computational resources.

Updated: 2025-03-25 09:49:36

标题: 基于姿势的跌倒检测系统：在标准CPU上高效监测

摘要: 在辅助生活设施中，老年居民摔倒造成显著的健康风险，通常会导致受伤和生活质量下降。当前的摔倒检测解决方案通常依赖于传感器系统，需要专用硬件，或者依赖于视频模型，需要大量计算资源和GPU进行实时处理。相比之下，本文提出了一个强大的摔倒检测系统，不需要任何额外的传感器或高性能硬件。该系统利用姿势估计技术，结合基于阈值的分析和投票机制，有效区分摔倒和非摔倒活动。对于姿势检测，我们利用了MediaPipe，这是一个轻量级高效的框架，可以在标准CPU上进行实时处理，计算开销最小。通过分析运动、身体位置和关键姿势点，系统使用20帧缓冲处理姿势特征，最大程度减少误报，并在真实世界环境中保持高准确性。这种不显眼、资源高效的方法为老年院提供了提高居民安全的实用解决方案，无需昂贵的传感器或高端计算资源。

更新时间: 2025-03-25 09:49:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19501v1

DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.

Updated: 2025-03-25 09:47:55

标题: DiffusionAct：可控扩散自动编码器用于一次性人脸再现

摘要: 视频驱动的神经面部重现旨在合成保留源面部身份和外表的真实面部图像，同时传输目标头部姿势和面部表情。现有基于GAN的方法要么存在失真和视觉伪影，要么重建质量较差，即背景和一些重要外观细节（如发型/颜色、眼镜和配饰）没有被忠实重建。最近的扩散概率模型（DPMs）的进展使得生成高质量真实图像成为可能。因此，在本文中，我们提出了DiffusionAct，一种利用扩散模型的照片逼真图像生成来进行神经面部重现的新方法。具体而言，我们提出控制扩散自动编码器（DiffAE）的语义空间，以编辑输入图像的面部姿势，即头部姿势方向和面部表情。我们的方法允许一次性、自我和跨主体的重现，而无需特定主题的微调。我们与最先进的GAN、StyleGAN2和基于扩散的方法进行比较，展示出更好或相当的重现性能。

更新时间: 2025-03-25 09:47:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2403.17217v2

SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling

Steganography embeds confidential data within seemingly innocuous communications. Provable security in steganography, a long-sought goal, has become feasible with deep generative models. However, existing methods face a critical trade-off between security and efficiency. This paper introduces SparSamp, an efficient provably secure steganography method based on sparse sampling. SparSamp embeds messages by combining them with pseudo-random numbers to obtain message-derived random numbers for sampling. It enhances extraction accuracy and embedding capacity by increasing the sampling intervals and making the sampling process sparse. SparSamp preserves the original probability distribution of the generative model, thus ensuring security. It introduces only $O(1)$ additional complexity per sampling step, enabling the fastest embedding speed without compromising generation speed. SparSamp is designed to be plug-and-play; message embedding can be achieved by simply replacing the sampling component of an existing generative model with SparSamp. We implemented SparSamp in text, image, and audio generation models. It can achieve embedding speeds of up to 755 bits/second with GPT-2, 5046 bits/second with DDPM, and 9,223 bits/second with WaveRNN.

Updated: 2025-03-25 09:47:17

标题: SparSamp：基于稀疏采样的高效可证安全隐写术

摘要: 隐写术将机密数据嵌入看似无害的通信中。隐写术中的可证明安全性一直是一个长期追求的目标，随着深度生成模型的出现，这一目标变得可行。然而，现有方法在安全性和效率之间存在关键权衡。本文介绍了一种基于稀疏采样的高效可证明安全的隐写术方法SparSamp。SparSamp通过将消息与伪随机数相结合，以获取基于消息的随机数进行采样来嵌入消息。通过增加采样间隔并使采样过程稀疏，它提高了提取准确性和嵌入容量。SparSamp保留了生成模型的原始概率分布，从而确保安全性。它每个采样步骤仅引入$O(1)$额外复杂度，使嵌入速度最快，同时不影响生成速度。SparSamp设计为即插即用；通过简单替换现有生成模型的采样组件，即可实现消息嵌入。我们在文本、图像和音频生成模型中实现了SparSamp。它在GPT-2下可以实现高达755比特/秒的嵌入速度，DDPM下为5046比特/秒，WaveRNN下为9223比特/秒。

更新时间: 2025-03-25 09:47:17

领域: cs.CR

下载: http://arxiv.org/abs/2503.19499v1

STATGRAPH: Effective In-vehicle Intrusion Detection via Multi-view Statistical Graph Learning

In-vehicle network (IVN) is facing complex external cyber-attacks, especially the emerging masquerade attacks with extremely high difficulty of detection while serious damaging effects. In this paper, we propose the STATGRAPH, which is an effective and fine-grained intrusion detection methodology for IVN security services via multi-view statistical graph learning on in-vehicle controller area network (CAN) messages with insight into their variations in periodicity, payload and signal combinations. Specifically, STATGRAPH generates two statistical graphs, timing correlation graph (TCG) and coupling relationship graph (CRG), in every CAN message detection window, where edge attributes in TCGs represent temporal correlation between different message IDs while edge attributes in CRGs denote the neighbour relationship and contextual similarity. Besides, a lightweight shallow layered graph convolution network is trained based on graph property of TCGs and CRGs, which learns the universal laws of various patterns more effectively and further enhance the performance of detection. To address the problem of insufficient attack types in previous intrusion detection, we select two real in-vehicle CAN datasets covering five new instances of sophisticated and stealthy masquerade attacks that are never investigated before. Experimental result shows STATGRAPH improves both detection granularity and detection performance over state-of-the-art intrusion detection methods. Code is available at https://github.com/wangkai-tech23/StatGraph.

Updated: 2025-03-25 09:44:10

标题: STATGRAPH：通过多视图统计图学习实现有效的车载入侵检测

摘要: 车载网络(IVN)面临复杂的外部网络攻击，尤其是新兴的伪装攻击，其检测难度极高，同时造成严重破坏。本文提出了STATGRAPH，这是一种有效且精细的车载网络安全服务入侵检测方法，通过对车辆控制区域网络(CAN)消息进行多视图统计图学习，洞察它们在周期性、有效载荷和信号组合方面的变化。具体来说，STATGRAPH在每个CAN消息检测窗口中生成两个统计图，即时序相关图(TCG)和耦合关系图(CRG)，其中TCG中的边属性表示不同消息ID之间的时间相关性，而CRG中的边属性表示邻居关系和上下文相似性。此外，基于TCG和CRG的图属性训练了一个轻量级浅层图卷积网络，它更有效地学习各种模式的普遍规律，并进一步提高了检测性能。为解决先前入侵检测中攻击类型不足的问题，我们选择了两个真实的车载CAN数据集，涵盖了五种新颖且隐秘的伪装攻击实例，这些攻击以前从未被调查过。实验结果显示，STATGRAPH相对于最先进的入侵检测方法提高了检测的细粒度和性能。代码可在https://github.com/wangkai-tech23/StatGraph找到。

更新时间: 2025-03-25 09:44:10

领域: cs.NI,cs.AI,cs.CR

下载: http://arxiv.org/abs/2311.07056v2

Rank Reduction Autoencoders

The choice of an appropriate bottleneck dimension and the application of effective regularization are both essential for Autoencoders to learn meaningful representations from unlabeled data. In this paper, we introduce a new class of deterministic autoencoders, Rank Reduction Autoencoders (RRAEs), which regularize their latent spaces by employing a truncated singular value decomposition (SVD) during training. In RRAEs, the bottleneck is defined by the rank of the latent matrix, thereby alleviating the dependence of the encoder/decoder architecture on the bottleneck size. This approach enabled us to propose an adaptive algorithm (aRRAEs) that efficiently determines the optimal bottleneck size during training. We empirically demonstrate that both RRAEs and aRRAEs are stable, scalable, and reliable, as they do not introduce any additional training hyperparameters. We evaluate our proposed architecture on a synthetic data set, as well as on MNIST, Fashion MNIST, and CelebA. Our results show that RRAEs offer several advantages over Vanilla AEs with both large and small latent spaces, and outperform other regularizing AE architectures.

Updated: 2025-03-25 09:41:34

标题: 等级降维自编码器

摘要: 选择适当的瓶颈维度并应用有效的正则化对于自动编码器从未标记数据中学习有意义的表示是至关重要的。在本文中，我们引入了一种新的确定性自动编码器类别，称为秩降自动编码器（RRAEs），通过在训练过程中采用截断奇异值分解（SVD）来对它们的潜在空间进行正则化。在RRAEs中，瓶颈由潜在矩阵的秩定义，从而减轻了编码器/解码器架构对瓶颈大小的依赖。这种方法使我们能够提出一种自适应算法（aRRAEs），在训练过程中有效确定最佳瓶颈大小。我们在合成数据集以及MNIST、时尚MNIST和CelebA上对我们提出的架构进行了实证评估。我们的结果表明，RRAEs在具有大和小潜在空间的情况下都比普通AE具有多种优势，并且优于其他正则化AE架构。

更新时间: 2025-03-25 09:41:34

领域: cs.LG

下载: http://arxiv.org/abs/2405.13980v3

SMT-EX: An Explainable Surrogate Modeling Toolbox for Mixed-Variables Design Exploration

Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: https://github.com/SMTorg/smt-explainability .

Updated: 2025-03-25 09:38:27

标题: SMT-EX：一种用于混合变量设计探索的可解释代理建模工具箱

摘要: Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: https://github.com/SMTorg/smt-explainability.

更新时间: 2025-03-25 09:38:27

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2503.19496v1

Nonparametric estimation of Hawkes processes with RKHSs

This paper addresses nonparametric estimation of nonlinear multivariate Hawkes processes, where the interaction functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). Motivated by applications in neuroscience, the model allows complex interaction functions, in order to express exciting and inhibiting effects, but also a combination of both (which is particularly interesting to model the refractory period of neurons), and considers in return that conditional intensities are rectified by the ReLU function. The latter feature incurs several methodological challenges, for which workarounds are proposed in this paper. In particular, it is shown that a representer theorem can be obtained for approximated versions of the log-likelihood and the least-squares criteria. Based on it, we propose an estimation method, that relies on two common approximations (of the ReLU function and of the integral operator). We provide a bound that controls the impact of these approximations. Numerical results on synthetic data confirm this fact as well as the good asymptotic behavior of the proposed estimator. It also shows that our method achieves a better performance compared to related nonparametric estimation techniques and suits neuronal applications.

Updated: 2025-03-25 09:35:34

标题: RKHSs下的Hawkes过程的非参数估计

摘要: 这篇论文讨论了非参数估计的非线性多变量Hawkes过程，其中相互作用函数被假设位于再生核希尔伯特空间（RKHS）中。受神经科学应用的启发，该模型允许复杂的相互作用函数，以表达激励和抑制效应，但也允许两者的组合（这对于模拟神经元的不应期特别有趣），并考虑到条件强度由ReLU函数修正。后一特征引发了几个方法学挑战，本文提出了解决方法。特别是，证明了对似然和最小二乘准则的近似版本可以获得一个表现定理。基于此，我们提出了一种估计方法，依赖于两种常见的近似（ReLU函数和积分算子）。我们提供了一个控制这些近似影响的界限。合成数据的数值结果证实了这一事实，以及所提出估计量的良好渐近行为。它还显示，与相关的非参数估计技术相比，我们的方法实现了更好的性能，并适用于神经元应用。

更新时间: 2025-03-25 09:35:34

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2411.00621v2

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC .

Updated: 2025-03-25 09:27:16

标题: T-MAC: 通过表查找在边缘上部署低比特LLM的CPU复兴

摘要: 在边缘设备上部署大型语言模型（LLMs）越来越重要，以增强设备上的智能。权重量化对于减少设备上LLMs的内存占用至关重要。然而，低位LLMs在推断过程中需要低精度权重和高精度激活的混合精度矩阵乘法（mpGEMM）。现有系统缺乏对mpGEMM的本地支持，因此不得不解压权重以进行高精度计算。这种间接方式可能会导致显着的推断开销。本文介绍了一种名为T-MAC的创新查找表（LUT）方法，旨在在CPU上有效进行低位LLM（即，权重量化LLM）推断。T-MAC直接支持mpGEMM，无需解压，同时消除了所需的乘法和减法。具体来说，T-MAC将传统的数据类型中心的乘法转换为比特级别的表查找，并实现了统一和可扩展的mpGEMM解决方案。我们基于查找表的内核会根据权重位宽线性扩展。在低位Llama和BitNet模型上评估，与llama.cpp相比，T-MAC的吞吐量增加了最多4倍，能耗减少了70%。对于BitNet-b1.58-3B，T-MAC在M2-Ultra上单核的令牌生成吞吐量为30个令牌/秒，八核为71个令牌/秒，在Raspberry Pi 5等低端设备上为11个令牌/秒，远远超过成人平均阅读速度。基于LUT的T-MAC计算范式为在资源受限的边缘设备上实用部署低位LLMs铺平了道路，同时不影响计算效率。该系统的开源地址为https://github.com/microsoft/T-MAC。

更新时间: 2025-03-25 09:27:16

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2407.00088v2

Learning Causal Transition Matrix for Instance-dependent Label Noise

Noisy labels are both inevitable and problematic in machine learning methods, as they negatively impact models' generalization ability by causing overfitting. In the context of learning with noise, the transition matrix plays a crucial role in the design of statistically consistent algorithms. However, the transition matrix is often considered unidentifiable. One strand of methods typically addresses this problem by assuming that the transition matrix is instance-independent; that is, the probability of mislabeling a particular instance is not influenced by its characteristics or attributes. This assumption is clearly invalid in complex real-world scenarios. To better understand the transition relationship and relax this assumption, we propose to study the data generation process of noisy labels from a causal perspective. We discover that an unobservable latent variable can affect either the instance itself, the label annotation procedure, or both, which complicates the identification of the transition matrix. To address various scenarios, we have unified these observations within a new causal graph. In this graph, the input instance is divided into a noise-resistant component and a noise-sensitive component based on whether they are affected by the latent variable. These two components contribute to identifying the ``causal transition matrix'', which approximates the true transition matrix with theoretical guarantee. In line with this, we have designed a novel training framework that explicitly models this causal relationship and, as a result, achieves a more accurate model for inferring the clean label.

Updated: 2025-03-25 09:23:55

标题: 学习实例相关标签噪声的因果转移矩阵

摘要: 嘈杂标签在机器学习方法中既是不可避免的，也是有问题的，因为它们通过导致过拟合而对模型的泛化能力产生负面影响。在学习中存在噪声的情况下，过渡矩阵在设计统计一致算法时起着关键作用。然而，过渡矩阵通常被认为是不可识别的。一类方法通常通过假设过渡矩阵是独立于实例的来解决这个问题；也就是说，误标记特定实例的概率不受其特征或属性的影响。在复杂的现实场景中，这种假设显然是无效的。为了更好地理解过渡关系并放宽这个假设，我们提出从因果关系的角度研究嘈杂标签的数据生成过程。我们发现一个不可观测的潜在变量可能会影响实例本身、标签注释过程，或两者，这使得过渡矩阵的识别变得复杂。为了解决各种情况，我们在一个新的因果图中统一了这些观察结果。在这个图中，输入实例根据是否受潜在变量影响被分为噪声抗性组件和噪声敏感组件。这两个组件有助于识别“因果过渡矩阵”，该矩阵通过理论保证近似于真实过渡矩阵。基于这一点，我们设计了一个新颖的训练框架，明确建模了这种因果关系，从而实现了更准确地推断干净标签的模型。

更新时间: 2025-03-25 09:23:55

领域: cs.LG

下载: http://arxiv.org/abs/2412.13516v4

A Note on Estimation Error Bound and Grouping Effect of Transfer Elastic Net

The Transfer Elastic Net is an estimation method for linear regression models that combines $\ell_1$ and $\ell_2$ norm penalties to facilitate knowledge transfer. In this study, we derive a non-asymptotic $\ell_2$ norm estimation error bound for the estimator and discuss scenarios where the Transfer Elastic Net effectively works. Furthermore, we examine situations where it exhibits the grouping effect, which states that the estimates corresponding to highly correlated predictors have a small difference.

Updated: 2025-03-25 09:21:15

标题: 关于转移弹性网络估计误差界和分组效应的注解

摘要: 转移弹性网络是一种用于线性回归模型的估计方法，它结合了$\ell_1$和$\ell_2$范数惩罚以促进知识转移。在这项研究中，我们推导了估计器的非渐近性$\ell_2$范数估计误差界，并讨论了转移弹性网络有效工作的情况。此外，我们还研究了它展现出分组效应的情况，即与高度相关的预测变量对应的估计具有较小的差异。

更新时间: 2025-03-25 09:21:15

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2412.01010v2

Body Discovery of Embodied AI

In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Discovery of Embodied AI", focusing on tasks of recognizing embodiments and summarizing neural signal functionality. The challenge encompasses the precise definition of an AI body and the intricate task of identifying embodiments in dynamic environments, where conventional approaches often prove inadequate. To address these challenges, we apply causal inference method and evaluate it by developing a simulator tailored for testing algorithms with virtual environments. Finally, we validate the efficacy of our algorithms through empirical testing, demonstrating their robust performance in various scenarios based on virtual environments.

Updated: 2025-03-25 09:21:10

标题: 身体发现的具身人工智能

摘要: 在追求实现人工通用智能（AGI）的过程中，体现身体的人工智能（AI）的重要性变得越来越明显。随着这一趋势，将机器人与AGI整合的研究变得突出。随着各种类型的体现被设计出来，对多样化体现的适应性将对AGI变得重要。我们引入了一个新的挑战，称为“体现AI的身体发现”，专注于识别体现和总结神经信号功能的任务。这一挑战包括对AI身体的精确定义以及在动态环境中识别体现的复杂任务，在这种环境中常规方法经常不足以解决。为了解决这些挑战，我们应用因果推断方法，并通过开发专门用于测试算法的模拟器来评估它。最后，我们通过经验测试验证了我们算法的有效性，在基于虚拟环境的各种场景中展示了它们的强大性能。

更新时间: 2025-03-25 09:21:10

领域: cs.RO,cs.AI,cs.NE

下载: http://arxiv.org/abs/2503.19941v1

Bayesian Optimization of a Lightweight and Accurate Neural Network for Aerodynamic Performance Prediction

Ensuring high accuracy and efficiency of predictive models is paramount in the aerospace industry, particularly in the context of multidisciplinary design and optimization processes. These processes often require numerous evaluations of complex objective functions, which can be computationally expensive and time-consuming. To build efficient and accurate predictive models, we propose a new approach that leverages Bayesian Optimization (BO) to optimize the hyper-parameters of a lightweight and accurate Neural Network (NN) for aerodynamic performance prediction. To clearly describe the interplay between design variables, hierarchical and categorical kernels are used in the BO formulation. We demonstrate the efficiency of our approach through two comprehensive case studies, where the optimized NN significantly outperforms baseline models and other publicly available NNs in terms of accuracy and parameter efficiency. For the drag coefficient prediction task, the Mean Absolute Percentage Error (MAPE) of our optimized model drops from 0.1433\% to 0.0163\%, which is nearly an order of magnitude improvement over the baseline model. Additionally, our model achieves a MAPE of 0.82\% on a benchmark aircraft self-noise prediction problem, significantly outperforming existing models (where their MAPE values are around 2 to 3\%) while requiring less computational resources. The results highlight the potential of our framework to enhance the scalability and performance of NNs in large-scale MDO problems, offering a promising solution for the aerospace industry.

Updated: 2025-03-25 09:14:36

标题: 贝叶斯优化轻量级准确神经网络用于空气动力性能预测

摘要: 确保预测模型的高准确性和效率在航空航天工业中至关重要，特别是在多学科设计和优化过程中。这些过程通常需要对复杂目标函数进行大量评估，这可能会耗费大量计算资源和时间。为了构建高效准确的预测模型，我们提出了一种新方法，利用贝叶斯优化（BO）来优化轻量且准确的神经网络（NN）的超参数，用于空气动力性能预测。为了清晰描述设计变量之间的相互作用，BO公式中使用了分层和分类内核。我们通过两个全面的案例研究展示了我们方法的效率，优化后的NN在准确性和参数效率方面明显优于基线模型和其他公开可用的NN。对于阻力系数预测任务，我们优化模型的平均绝对百分比误差（MAPE）从0.1433%降至0.0163%，几乎比基线模型提高一个数量级。此外，我们的模型在基准飞机自噪声预测问题上实现了0.82%的MAPE，明显优于现有模型（其MAPE值在2到3%左右），同时需要更少的计算资源。这些结果突显了我们框架在大规模多学科设计优化问题中提升NN的可扩展性和性能的潜力，为航空航天工业提供了一个有前途的解决方案。

更新时间: 2025-03-25 09:14:36

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2503.19479v1

Extracting Interpretable Logic Rules from Graph Neural Networks

Graph neural networks (GNNs) operate over both input feature spaces and combinatorial graph structures, making it challenging to understand the rationale behind their predictions. As GNNs gain widespread popularity and demonstrate success across various domains, such as drug discovery, studying their interpretability has become a critical task. To address this, many explainability methods have been proposed, with recent efforts shifting from instance-specific explanations to global concept-based explainability. However, these approaches face several limitations, such as relying on predefined concepts and explaining only a limited set of patterns. To address this, we propose a novel framework, LOGICXGNN, for extracting interpretable logic rules from GNNs. LOGICXGNN is model-agnostic, efficient, and data-driven, eliminating the need for predefined concepts. More importantly, it can serve as a rule-based classifier and even outperform the original neural models. Its interpretability facilitates knowledge discovery, as demonstrated by its ability to extract detailed and accurate chemistry knowledge that is often overlooked by existing methods. Another key advantage of LOGICXGNN is its ability to generate new graph instances in a controlled and transparent manner, offering significant potential for applications such as drug design. We empirically demonstrate these merits through experiments on real-world datasets such as MUTAG and BBBP.

Updated: 2025-03-25 09:09:46

标题: 从图神经网络中提取可解释的逻辑规则

摘要: 图神经网络（GNNs）同时在输入特征空间和组合图结构上运行，这使得理解它们预测背后的原理具有挑战性。随着GNNs在各个领域取得广泛成功，如药物发现，研究它们的可解释性已成为一项关键任务。为了解决这个问题，许多解释性方法已被提出，最近的努力已经从特定实例的解释转向全局基于概念的可解释性。然而，这些方法面临一些限制，比如依赖预定义的概念并且只解释有限的模式。为了解决这个问题，我们提出了一个新颖的框架，LOGICXGNN，用于从GNNs中提取可解释的逻辑规则。LOGICXGNN是与模型无关的、高效的、数据驱动的，消除了对预定义概念的需求。更重要的是，它可以作为基于规则的分类器，甚至优于原始的神经模型。其可解释性有助于知识发现，如其能够提取经常被现有方法忽视的详细和准确的化学知识所展示的。LOGICXGNN的另一个关键优势是其能够以一种可控和透明的方式生成新的图实例，为药物设计等应用提供了重要的潜力。我们通过在真实数据集如MUTAG和BBBP上进行的实验来实证这些优点。

更新时间: 2025-03-25 09:09:46

领域: cs.LG

下载: http://arxiv.org/abs/2503.19476v1

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Mul- timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embed- ding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the pro- cess by synchronizing multimodal representation with label de- scriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

Updated: 2025-03-25 09:09:30

标题: A-MESS:基于锚点的多模态嵌入，具有语义同步用于多模态意图识别

摘要: 在多模态意图识别（MIR）领域中，目标是通过整合多种模态（如语言文本、身体手势和语调）来识别人类意图。然而，现有方法面临着充分捕捉模态之间固有联系和忽视对应意图语义表示的困难。为解决这些限制，我们提出了基于锚点的多模态嵌入与语义同步（A-MESS）框架。首先，我们设计了一个基于锚点的多模态嵌入（A-ME）模块，采用基于锚点的嵌入融合机制来整合多模态输入。此外，我们还开发了一种语义同步（SS）策略，使用三元对比学习管道，通过将多模态表示与大型语言模型生成的标签描述进行同步来优化过程。全面的实验表明，我们的A-MESS实现了最先进的性能，并为多模态表示和下游任务提供了重要见解。

更新时间: 2025-03-25 09:09:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19474v1

KL-geodesics flow matching with a novel sampling scheme

Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models, but they face challenges in modeling the complex dependencies inherent in text data. In this work, we investigate a conditional flow matching approach for text generation. We represent tokens as one-hot vectors in a $V$-dimensional simplex and utilize geodesics under the Kullback-Leibler (KL) divergence, which correspond to linear interpolation in logit space. We provide a theoretical justification that maximizing the conditional likelihood $P_{\theta}(x_1 \mid x_t, t)$ yields the exact flow matching velocity under logit interpolation. To address the suboptimal performance of basic inference, we propose a novel empirical sampling scheme that iteratively samples from the conditional distribution and introduces additional noise, significantly improving results despite lacking full theoretical underpinnings. Furthermore, we propose a hybrid inference method that combines the basic approach with the sampling scheme. This method demonstrates superior performance on both conditional and unconditional text generation experiments compared to previous SOTA method for discrete flow matching.

Updated: 2025-03-25 09:09:07

标题: KL-geodesics流与一种新的采样方案匹配

摘要: 非自回归语言模型同时生成所有标记，相对于传统的自回归模型具有潜在的速度优势，但在建模文本数据中固有的复杂依赖关系方面面临挑战。在本研究中，我们研究了一种用于文本生成的条件流匹配方法。我们将标记表示为在$V$维简单形式中的单热向量，并利用Kullback-Leibler（KL）散度下的测地线，对应于在logit空间中的线性插值。我们提供了一个理论上的证明，即最大化条件概率$P_{\theta}(x_1 \mid x_t, t)$在logit插值下产生精确的流匹配速度。为了解决基本推理的次优性能，我们提出了一种新颖的经验抽样方案，该方案迭代地从条件分布中抽样并引入额外的噪声，显著改善了结果，尽管缺乏完整的理论基础。此外，我们提出了一种混合推理方法，将基本方法与抽样方案结合起来。与先前的离散流匹配的SOTA方法相比，该方法在条件和无条件文本生成实验中表现出更优异的性能。

更新时间: 2025-03-25 09:09:07

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.16821v4

Lightweight Embedded FPGA Deployment of Learned Image Compression with Knowledge Distillation and Hybrid Quantization

Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization - GDN activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model.

Updated: 2025-03-25 09:08:09

标题: 轻量级嵌入式FPGA部署通过知识蒸馏和混合量化学习的图像压缩

摘要: 可学习图像压缩（LIC）已显示出在RD效率方面胜过标准视频编解码器的潜力，促使对硬件友好实现的研究。大多数现有的LIC硬件实现都优先考虑延迟而非RD效率，并通过对硬件设计空间的广泛探索。我们提出了一种新颖的设计范式，其中将为特定硬件平台调整设计的负担转移到模型维度设置上，而不会影响RD效率。首先，我们设计了一个从参考教师中提炼出更精简的学生LIC模型的框架：通过调整单个模型超参数，我们可以满足不同硬件平台的限制，而无需进行复杂的硬件设计探索。其次，我们提出了一种硬件友好的广义分裂归一化（GDN）激活的实现，即使在参数量化后也能保持RD效率。第三，我们设计了一个通过利用并行处理和优化资源分配充分利用可用FPGA资源的流水线FPGA配置。我们对一种最先进的LIC模型进行的实验表明，我们在性能上胜过所有现有的FPGA实现，同时表现非常接近原始模型。

更新时间: 2025-03-25 09:08:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.04832v5

Understanding and Reducing the Class-Dependent Effects of Data Augmentation with A Two-Player Game Approach

Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed, it may have an unfair effect in multi-class classification. While data augmentation generally improves the overall performance (and therefore is beneficial for many classes), it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to counteract this phenomenon, we propose CLAM, a CLAss-dependent Multiplicative-weights method. To derive it, we first formulate the training of a classifier as a non-linear optimization problem that aims at simultaneously maximizing the individual class performances and balancing them. By rewriting this optimization problem as an adversarial two-player game, we propose a novel multiplicative weight algorithm, for which we prove the convergence. Interestingly, our formulation also reveals that the class-dependent effects of data augmentation is not due to data augmentation only, but is in fact a general phenomenon. Our empirical results over five datasets demonstrate that the performance of learned classifiers is indeed more fairly distributed over classes, with only limited impact on the average accuracy.

Updated: 2025-03-25 09:05:02

标题: 理解和降低数据增强的类别相关效果：一种双人游戏方法

摘要: 数据增强被广泛应用并在不同的机器学习任务中显示出其益处。然而，正如最近观察到的，它可能在多类分类中产生不公平的影响。虽然数据增强通常会提高整体性能（因此对许多类有益），但实际上可能对其他类有害，这在某些应用领域可能成为问题。在本文中，为了抵消这种现象，我们提出了CLAM，一种基于类的乘法权重方法。为了得出这个方法，我们首先将分类器的训练形式化为一个非线性优化问题，旨在同时最大化各个类的性能并平衡它们。通过将这个优化问题重新表述为一个对抗性的双人游戏，我们提出了一种新颖的乘法权重算法，证明了其收敛性。有趣的是，我们的表述也揭示了数据增强的类别相关影响不仅仅是由数据增强造成的，而实际上是一个普遍现象。我们在五个数据集上的实证结果表明，学习的分类器的性能确实更公平地分布在各个类别上，对平均准确率的影响有限。

更新时间: 2025-03-25 09:05:02

领域: cs.CY,cs.AI,cs.CV,cs.GT,cs.LG

下载: http://arxiv.org/abs/2407.03146v3

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

Updated: 2025-03-25 09:00:58

标题: 研究：通过强化学习学习使用搜索进行推理的LLMs

摘要: 大型语言模型（LLMs）在推理方面表现出色，例如OpenAI-o1和DeepSeek-R1的成功。然而，将推理与外部搜索过程整合仍然具有挑战性，特别是对于需要多个检索步骤的复杂多跳问题。我们提出了一种新颖的框架ReSearch，通过强化学习训练LLMs进行搜索推理，而不使用任何关于推理步骤的监督数据。我们的方法将搜索操作视为推理链的组成部分，其中何时以及如何执行搜索由基于文本的思考指导，并且搜索结果随后影响进一步的推理。我们在Qwen2.5-7B(-Instruct)和Qwen2.5-32B(-Instruct)模型上训练ReSearch并进行了大量实验。尽管仅在一个数据集上训练，我们的模型在各种基准测试中表现出强大的泛化能力。分析显示，ReSearch在强化学习过程中自然引发出高级推理能力，如反思和自我修正。

更新时间: 2025-03-25 09:00:58

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.19470v1

Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning

In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.

Updated: 2025-03-25 09:00:25

标题: 通过软提示调整增强小语言模型以实现跨语言泛化零样本分类

摘要: 在自然语言处理领域，零样本分类（ZSC）已经成为使模型能够将文本分类到在训练过程中未见过的类别中的关键，特别是在标注数据稀缺的低资源语言和领域中。虽然预训练语言模型（PLMs）在ZSC方面表现出了潜力，但它们通常依赖于大量的训练数据集或外部知识，限制了它们在多语言和低资源场景中的适用性。最近的方法利用自然语言提示减少了对大量训练数据集的依赖，但在有效整合来自相关分类任务的可用标记数据方面存在困难，特别是当这些数据集来自不同语言或分布时。此外，现有的基于提示的方法通常依赖于特定语言中手工制作的提示，限制了它们在跨语言环境中的适应性和有效性。为了解决这些挑战，我们引入了RoSPrompt，这是一种轻量级且高效的方法，用于训练软提示，以增强跨语言ZSC，并确保在数据分布转移中的稳健泛化。RoSPrompt专为小型多语言PLMs设计，使它们能够利用高资源语言来提高在低资源环境中的性能，而无需进行广泛的微调或高计算成本。我们在涵盖106种语言的数据集上对我们的方法进行了评估，展示了强大的跨语言转移性能和对未见类别的稳健泛化能力。

更新时间: 2025-03-25 09:00:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19469v1

A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction

In safety-critical applications, guaranteeing the satisfaction of constraints over continuous environments is crucial, e.g., an autonomous agent should never crash into obstacles or go off-road. Neural models struggle in the presence of these constraints, especially when they involve intricate algebraic relationships. To address this, we introduce a differentiable probabilistic layer that guarantees the satisfaction of non-convex algebraic constraints over continuous variables. This probabilistic algebraic layer (PAL) can be seamlessly plugged into any neural architecture and trained via maximum likelihood without requiring approximations. PAL defines a distribution over conjunctions and disjunctions of linear inequalities, parameterized by polynomials. This formulation enables efficient and exact renormalization via symbolic integration, which can be amortized across different data points and easily parallelized on a GPU. We showcase PAL and our integration scheme on a number of benchmarks for algebraic constraint integration and on real-world trajectory data.

Updated: 2025-03-25 08:58:04

标题: 一个用于代数约束满足的概率神经符号层

摘要: 在安全关键应用中，保证在连续环境中满足约束条件至关重要，例如，自主代理不应该撞上障碍物或离开道路。神经模型在存在这些约束条件时遇到困难，特别是当它们涉及复杂的代数关系时。为了解决这个问题，我们引入了一个可微的概率层，可以保证连续变量上非凸代数约束的满足。这个概率代数层（PAL）可以无缝地插入到任何神经结构中，并通过最大似然训练，而不需要近似。PAL定义了一个分布，参数化为多项式线性不等式的合取和析取。这种形式使得通过符号积分进行有效和精确的归一化成为可能，这可以在不同数据点之间分摊，并且可以很容易地在GPU上并行化。我们展示了PAL和我们的集成方案在代数约束集成以及真实轨迹数据上的表现。

更新时间: 2025-03-25 08:58:04

领域: cs.LG

下载: http://arxiv.org/abs/2503.19466v1

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Updated: 2025-03-25 08:56:54

标题: 专家赛：一种用于扩展带有专家混合的扩散变压器的灵活路由策略

摘要: 扩散模型已成为视觉生成的主流框架。在这一成功基础上，混合专家（MoE）方法的整合显示出增强模型可扩展性和性能的潜力。在本文中，我们介绍了Race-DiT，这是一种用于扩散变压器的新型MoE模型，具有灵活的路由策略Expert Race。通过允许令牌和专家一起竞争并选择顶级候选者，模型学会动态地将专家分配给关键令牌。此外，我们提出了逐层正则化来解决浅层学习中的挑战，以及路由器相似性损失来防止模式崩溃，确保更好地利用专家。在ImageNet上进行的大量实验证实了我们方法的有效性，展示了显著的性能提升以及具有扩展性的潜力。

更新时间: 2025-03-25 08:56:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.16057v2

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or image-caption data, disregarding pathology reports with more clinically authentic information from pathologists and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Even recent slide-level FMs still struggle to provide whole-slide context for patch representation. In this study, for the first time, we develop a pathology foundation model incorporating three levels of modalities: pathology slides, pathology reports, and gene expression data, which resulted in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types, amounting to over 116 million pathological patch images. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm that injects the multimodal whole-slide context into the patch representation, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the pretraining workflow for CPath, enabling the pathology FM to acquire the whole-slide context. To the best of our knowledge, this is the first attempt to incorporate three modalities at the whole-slide context for enhancing pathology FMs. To systematically evaluate the capabilities of mSTAR, we built the largest spectrum of oncological benchmark, spanning 7 categories of oncological applications in 15 types of 97 practical oncological tasks.

Updated: 2025-03-25 08:49:58

标题: 一个多模态知识增强的全切片病理基础模型

摘要: 在计算病理学领域取得了显著进展，通过推动性能在各种下游临床任务中的表现的无任务特定基础模型。尽管表现令人鼓舞，仍然存在几个挑战。首先，先前的研究要么依赖于仅视觉或图像标题数据，忽视了具有更具临床真实信息的病理学报告以及分别为多功能临床应用提供不同知识的基因表达谱。其次，目前在病理学FM领域的进展主要集中在补丁级别，补丁级别预训练的受限上下文无法捕获整个幻灯片模式。即使最近的幻灯片级FM仍然难以为补丁表示提供整个幻灯片上下文。在本研究中，我们首次开发了一个病理基础模型，其中包含三个层次的模态：病理学幻灯片、病理学报告和基因表达数据，从而产生了来自32种癌症类型的10,275名患者的26,169个幻灯片级模态对，总共超过1.16亿个病理补丁图像。为了利用这些数据进行CPath，我们提出了一种新颖的整个幻灯片预训练范式，将多模态整个幻灯片上下文注入到补丁表示中，称为多模态自学习预训练（mSTAR）。该提出的范式革新了CPath的预训练工作流程，使病理FM能够获得整个幻灯片上下文。据我们所知，这是首次尝试在整个幻灯片上下文中结合三种模态以增强病理FM。为了系统评估mSTAR的能力，我们构建了最大范围的肿瘤学基准，涵盖了15种类型的97个实际肿瘤学任务中的7个类别的肿瘤学应用。

更新时间: 2025-03-25 08:49:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2407.15362v3

Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition

Emotional Mimicry Intensity (EMI) estimation plays a pivotal role in understanding human social behavior and advancing human-computer interaction. The core challenges lie in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods--insufficient exploitation of cross-modal synergies, sensitivity to noise, and constrained fine-grained alignment capabilities--this paper proposes a dual-stage cross-modal alignment framework. Stage 1 develops vision-text and audio-text contrastive learning networks based on a CLIP architecture, achieving preliminary feature-space alignment through modality-decoupled pre-training. Stage 2 introduces a temporal-aware dynamic fusion module integrating Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to capture macro-evolution patterns of facial expressions and local dynamics of acoustic features, respectively. A novel quality-guided fusion strategy further enables differentiable weight allocation for modality compensation under occlusion and noise. Experiments on the Hume-Vidmimic2 dataset demonstrate superior performance with an average Pearson correlation coefficient of 0.51 across six emotion dimensions on the validate set. Remarkably, our method achieved 0.68 on the test set, securing runner-up in the EMI Challenge Track of the 8th ABAW (Affective Behavior Analysis in the Wild) Competition, offering a novel pathway for fine-grained emotion analysis in open environments.

Updated: 2025-03-25 08:46:00

标题: 技术方法应对第八届野外情感行为分析竞赛中的EMI挑战

摘要: 情绪模仿强度（EMI）估计在理解人类社会行为和推进人机交互方面起着关键作用。核心挑战在于动态相关建模和多模态时间信号的稳健融合。为了解决现有方法存在的局限性--跨模态协同利用不足、对噪声敏感以及受限的细粒度对齐能力--本文提出了一个双阶段跨模态对齐框架。第一阶段基于CLIP架构开发了视觉-文本和音频-文本对比学习网络，通过模态解耦预训练实现了初步特征空间对齐。第二阶段引入了一个时间感知动态融合模块，集成了时间卷积网络（TCN）和门控双向LSTM，分别捕获面部表情的宏观演变模式和声学特征的局部动态。一种新颖的质量引导融合策略进一步实现了在遮挡和噪声下对模态补偿的可微分权重分配。在Hume-Vidmimic2数据集上的实验表明，在验证集上六种情绪维度的平均皮尔逊相关系数为0.51，表现优越。值得注意的是，我们的方法在测试集上达到了0.68，在第8届ABAW（野外情感行为分析）比赛的EMI挑战赛道中获得亚军，为在开放环境中进行细粒度情绪分析提供了一条新路径。

更新时间: 2025-03-25 08:46:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.10603v3

Data-centric Federated Graph Learning with Large Language Models

In federated graph learning (FGL), a complete graph is divided into multiple subgraphs stored in each client due to privacy concerns, and all clients jointly train a global graph model by only transmitting model parameters. A pain point of FGL is the heterogeneity problem, where nodes or structures present non-IID properties among clients (e.g., different node label distributions), dramatically undermining the convergence and performance of FGL. To address this, existing efforts focus on design strategies at the model level, i.e., they design models to extract common knowledge to mitigate heterogeneity. However, these model-level strategies fail to fundamentally address the heterogeneity problem as the model needs to be designed from scratch when transferring to other tasks. Motivated by large language models (LLMs) having achieved remarkable success, we aim to utilize LLMs to fully understand and augment local text-attributed graphs, to address data heterogeneity at the data level. In this paper, we propose a general framework LLM4FGL that innovatively decomposes the task of LLM for FGL into two sub-tasks theoretically. Specifically, for each client, it first utilizes the LLM to generate missing neighbors and then infers connections between generated nodes and raw nodes. To improve the quality of generated nodes, we design a novel federated generation-and-reflection mechanism for LLMs, without the need to modify the parameters of the LLM but relying solely on the collective feedback from all clients. After neighbor generation, all the clients utilize a pre-trained edge predictor to infer the missing edges. Furthermore, our framework can seamlessly integrate as a plug-in with existing FGL methods. Experiments on three real-world datasets demonstrate the superiority of our method compared to advanced baselines.

Updated: 2025-03-25 08:43:08

标题: 使用大型语言模型进行数据中心化的联邦图学习

摘要: 在联邦图学习（FGL）中，由于隐私问题，一个完整的图被分成多个子图存储在每个客户端中，并且所有客户端通过仅传输模型参数来共同训练全局图模型。FGL的一个痛点是异质性问题，其中节点或结构在客户端之间呈现非IID属性（例如，不同的节点标签分布），极大地削弱了FGL的收敛性和性能。为了解决这个问题，现有的努力集中在模型层面的设计策略上，即他们设计模型来提取共同知识以减轻异质性。然而，这些模型层面的策略未能从根本上解决异质性问题，因为当转移到其他任务时，模型需要从头设计。受到大型语言模型（LLMs）取得显著成功的启发，我们旨在利用LLMs充分理解和增强本地文本属性图，以解决数据层面的异质性。在本文中，我们提出了一个创新性地将LLM任务分解为两个子任务的通用框架LLM4FGL。具体地，对于每个客户端，它首先利用LLM生成缺失的邻居，然后推断生成节点和原始节点之间的连接。为了提高生成节点的质量，我们为LLMs设计了一种新颖的联邦生成和反射机制，无需修改LLMs的参数，仅依赖于来自所有客户端的集体反馈。在邻居生成之后，所有客户端利用预训练的边预测器推断缺失的边。此外，我们的框架可以无缝集成为现有FGL方法的插件。对三个真实世界数据集的实验表明，与先进的基线方法相比，我们的方法具有明显的优势。

更新时间: 2025-03-25 08:43:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19455v1

VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU

Large language models (LLMs) have demonstrated great capabilities in code generation, yet their effective application in compiler optimizations remains an open challenge due to issues such as hallucinations and a lack of domain-specific reasoning. Vectorization, a crucial optimization for enhancing code performance, often fails because of the compiler's inability to recognize complex code patterns, which commonly require extensive empirical expertise. LLMs, with their ability to capture intricate patterns, thus providing a promising solution to this challenge. This paper presents VecTrans, a novel framework that leverages LLMs to enhance compiler-based code vectorization. VecTrans first employs compiler analysis to identify potentially vectorizable code regions. It then utilizes an LLM to refactor these regions into patterns that are more amenable to the compiler's auto-vectorization. To ensure semantic correctness, VecTrans further integrates a hybrid validation mechanism at the intermediate representation (IR) level. With the above efforts, VecTrans combines the adaptability of LLMs with the precision of compiler vectorization, thereby effectively opening up the vectorization opportunities. Experimental results show that among all 50 TSVC functions unvectorizable by Clang, GCC, and BiShengCompiler, VecTrans successfully vectorizes 23 cases (46%) and achieves an average speedup of 2.02x, greatly surpassing state-of-the-art performance.

Updated: 2025-03-25 08:39:35

标题: VecTrans: 面向高性能CPU的LLM转换框架，用于更好的自动矢量化

摘要: 大型语言模型（LLMs）在代码生成方面表现出了很大的能力，然而它们在编译器优化中的有效应用仍然是一个开放性挑战，原因是存在幻觉和缺乏领域特定的推理等问题。矢量化是增强代码性能的关键优化，但通常由于编译器无法识别复杂的代码模式而失败，这些模式通常需要广泛的经验专业知识。LLMs 因其捕捉复杂模式的能力，因此为这一挑战提供了一个有前途的解决方案。本文提出了 VecTrans，这是一个利用LLMs增强基于编译器的代码矢量化的新框架。VecTrans首先利用编译器分析来识别潜在可矢量化的代码区域。然后利用LLM将这些区域重构为更适合编译器自动矢量化的模式。为了确保语义正确性，VecTrans进一步在中间表示（IR）级别集成了混合验证机制。通过以上努力，VecTrans将LLMs的适应性与编译器矢量化的精确性结合起来，有效地开拓了矢量化的机会。实验结果显示，在Clang、GCC和BiShengCompiler无法矢量化的50个TSVC函数中，VecTrans成功矢量化了23个案例（46%），并实现了平均加速比2.02倍，大大超越了目前的性能水平。

更新时间: 2025-03-25 08:39:35

领域: cs.SE,cs.AI,cs.LG,cs.PF

下载: http://arxiv.org/abs/2503.19449v1

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.

Updated: 2025-03-25 08:38:28

标题: 时间步骤的仔细观察值得将扩散模型训练的速度提高三倍

摘要: 培训扩散模型始终是一个计算密集型的任务。在这篇论文中，我们介绍了一种用于扩散模型训练的新型加速方法，称为SpeeD，这是基于对时间步骤的更深入研究的。我们的主要发现是：i）根据过程增量，时间步骤可以经验性地分为加速、减速和收敛区域。ii）这些时间步骤是不平衡的，许多集中在收敛区域。iii）集中的步骤为扩散训练提供了有限的好处。为了解决这个问题，我们设计了一种不对称抽样策略，减少了来自收敛区域的步骤频率，同时增加了来自其他区域的抽样概率。此外，我们提出了一种加权策略，强调具有快速变化过程增量的时间步骤的重要性。作为一种即插即用且与架构无关的方法，SpeeD在各种扩散架构、数据集和任务中始终实现了3倍的加速。值得注意的是，由于其简单设计，我们的方法显著降低了扩散模型训练的成本，且开销最小。我们的研究使更多研究人员能够以较低成本训练扩散模型。

更新时间: 2025-03-25 08:38:28

领域: cs.LG,cs.AI,I.2

下载: http://arxiv.org/abs/2405.17403v3

Conditional Shift-Robust Conformal Prediction for Graph Neural Network

Graph Neural Networks (GNNs) have emerged as potent tools for predicting outcomes in graph-structured data. Despite their efficacy, a significant drawback of GNNs lies in their limited ability to provide robust uncertainty estimates, posing challenges to their reliability in contexts where errors carry significant consequences. Moreover, GNNs typically excel in in-distribution settings, assuming that training and test data follow identical distributions a condition often unmet in real world graph data scenarios. In this article, we leverage conformal prediction, a widely recognized statistical technique for quantifying uncertainty by transforming predictive model outputs into prediction sets, to address uncertainty quantification in GNN predictions amidst conditional shift\footnote{Representing the change in conditional probability distribution $P(label|input)$ from source domain to target domain.} in graph-based semi-supervised learning (SSL). Additionally, we propose a novel loss function aimed at refining model predictions by minimizing conditional shift in latent stages. Termed Conditional Shift Robust (CondSR) conformal prediction for GNNs, our approach CondSR is model-agnostic and adaptable to various classification models. We validate the effectiveness of our method on standard graph benchmark datasets, integrating it with state-of-the-art GNNs in node classification tasks. Comprehensive evaluations demonstrate that our approach consistently achieves any predefined target marginal coverage, enhances the accuracy of state of the art GNN models by up to 12\% under conditional shift, and reduces the prediction set size by up to 48\%. The code implementation is publicly available for further exploration and experimentation.

Updated: 2025-03-25 08:27:10

标题: 基于图神经网络的条件转移鲁棒拟合预测

摘要: 图神经网络（GNNs）已经成为预测图结构数据结果的有效工具。尽管它们效果显著，但GNNs的一个显著缺点在于它们提供健壮不确定性估计的能力有限，这在错误会带来重大后果的情景下挑战了它们的可靠性。此外，GNNs通常在分布内设置中表现出色，假设训练和测试数据遵循相同的分布，而这种情况在真实世界的图数据场景中经常不符合。在本文中，我们利用一种被广泛认可的统计技术——符合预测（conformal prediction），通过将预测模型输出转化为预测集，来解决GNN预测中的不确定性量化问题，尤其是在基于图的半监督学习（SSL）中出现条件转移的情况。此外，我们提出了一种旨在通过最小化潜在阶段的条件转移来改进模型预测的新型损失函数。我们的方法CondSR（Conditional Shift Robust）符合预测适用于GNNs，是与各种分类模型兼容的。我们在标准图基准数据集上验证了我们方法的有效性，将其与最先进的GNNs集成在节点分类任务中。全面的评估表明，我们的方法始终达到任何预定义的目标边际覆盖率，将最先进的GNN模型的准确性提高了多达12\%，并将预测集大小减少了多达48\%。我们的代码实现已公开可用供进一步探索和实验。

更新时间: 2025-03-25 08:27:10

领域: cs.LG,cs.AI,cs

下载: http://arxiv.org/abs/2405.11968v3

Large Language Model for Patent Concept Generation

In traditional innovation practices, concept and IP generation are often iteratively integrated. Both processes demand an intricate understanding of advanced technical domain knowledge. Existing large language models (LLMs), while possessing massive pre-trained knowledge, often fall short in the innovative concept generation due to a lack of specialized knowledge necessary for the generation. To bridge this critical gap, we propose a novel knowledge finetuning (KFT) framework to endow LLM-based AI with the ability to autonomously mine, understand, and apply domain-specific knowledge and concepts for invention generation, i.e., concept and patent generation together. Our proposed PatentGPT integrates knowledge injection pre-training (KPT), domain-specific supervised finetuning (SFT), and reinforcement learning from human feedback (RLHF). Extensive evaluation shows that PatentGPT significantly outperforms the state-of-the-art models on patent-related benchmark tests. Our method not only provides new insights into data-driven innovation but also paves a new path to fine-tune LLMs for applications in the context of technology. We also discuss the managerial and policy implications of AI-generating inventions in the future.

Updated: 2025-03-25 08:24:19

标题: 专利概念生成的大型语言模型

摘要: 在传统的创新实践中，概念和知识产权的生成通常是迭代集成的。这两个过程都要求对先进技术领域知识有着复杂的理解。现有的大型语言模型(LLMs)虽然拥有大量预先训练的知识，但由于缺乏生成所需的专业知识，在创新概念生成方面往往表现不佳。为了弥补这一关键差距，我们提出了一种新颖的知识微调(KFT)框架，赋予基于LLM的人工智能能力，能够自主挖掘、理解和应用领域特定的知识和概念，用于发明生成，即概念和专利生成。我们提出的PatentGPT集成了知识注入预训练(KPT)、领域特定的监督微调(SFT)和从人类反馈中学习的强化学习(RLHF)。广泛的评估表明，PatentGPT在专利相关的基准测试中明显优于现有模型。我们的方法不仅为基于数据驱动的创新提供了新的见解，还为微调LLMs以应用于技术领域的应用开辟了新的道路。我们还讨论了未来AI生成发明的管理和政策影响。

更新时间: 2025-03-25 08:24:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.00092v2

FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling

Similar to conventional video generation, current deep learning-based weather prediction frameworks often lack explicit physical constraints, leading to unphysical outputs that limit their reliability for operational forecasting. Among various physical processes requiring proper representation, radiation plays a fundamental role as it drives Earth's weather and climate systems. However, accurate simulation of radiative transfer processes remains challenging for traditional numerical weather prediction (NWP) models due to their inherent complexity and high computational costs. Here, we propose FuXi-RTM, a hybrid physics-guided deep learning framework designed to enhance weather forecast accuracy while enforcing physical consistency. FuXi-RTM integrates a primary forecasting model (FuXi) with a fixed deep learning-based radiative transfer model (DLRTM) surrogate that efficiently replaces conventional radiation parameterization schemes. This represents the first deep learning-based weather forecasting framework to explicitly incorporate physical process modeling. Evaluated over a comprehensive 5-year dataset, FuXi-RTM outperforms its unconstrained counterpart in 88.51% of 3320 variable and lead time combinations, with improvements in radiative flux predictions. By incorporating additional physical processes, FuXi-RTM paves the way for next-generation weather forecasting systems that are both accurate and physically consistent.

Updated: 2025-03-25 08:21:58

标题: FuXi-RTM：一种具有辐射传输建模的物理引导预测框架

摘要: 与传统视频生成类似，当前基于深度学习的天气预测框架往往缺乏明确的物理约束，导致输出结果不符合物理规律，从而限制了它们在操作预测中的可靠性。在需要适当表示的各种物理过程中，辐射在驱动地球天气和气候系统中起着基本作用。然而，对辐射传输过程的准确模拟对于传统的数值天气预报模型来说仍然具有挑战性，因为它们固有的复杂性和高计算成本。在这里，我们提出了FuXi-RTM，这是一个混合物理引导的深度学习框架，旨在提高天气预报的准确性同时强调物理一致性。FuXi-RTM将主要的预测模型（FuXi）与固定的基于深度学习的辐射传输模型（DLRTM）替代传统辐射参数化方案的代理模型有效地整合在一起。这是首个明确纳入物理过程建模的基于深度学习的天气预测框架。在全面的为期5年的数据集上评估，FuXi-RTM在3320个变量和先导时间组合中的88.51%中优于其无约束的对应物，辐射通量预测有所改善。通过纳入额外的物理过程，FuXi-RTM为下一代天气预报系统铺平了道路，这些系统既准确又物理一致。

更新时间: 2025-03-25 08:21:58

领域: physics.ao-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19940v1

Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models

Diffusion models, which have been advancing rapidly in recent years, may generate samples that closely resemble the training data. This phenomenon, known as memorization, may lead to copyright issues. In this study, we propose a method to quantify the ease of reproducing training data in unconditional diffusion models. The average of a sample population following the Langevin equation in the reverse diffusion process moves according to a first-order ordinary differential equation (ODE). This ODE establishes a 1-to-1 correspondence between images and their noisy counterparts in the latent space. Since the ODE is reversible and the initial noisy images are sampled randomly, the volume of an image's projected area represents the probability of generating those images. We examined the ODE, which projects images to latent space, and succeeded in quantifying the ease of reproducing training data by measuring the volume growth rate in this process. Given the relatively low computational complexity of this method, it allows us to enhance the quality of training data by detecting and modifying the easily memorized training samples.

Updated: 2025-03-25 08:19:56

标题: 量化无条件扩散模型中训练数据再现的便利性

摘要: 扩散模型在近年来得到快速发展，可能生成与训练数据非常相似的样本。这种现象被称为记忆化，可能导致版权问题。在本研究中，我们提出了一种方法来量化无条件扩散模型中再现训练数据的容易程度。遵循 Langevin 方程的样本群体的平均值在反向扩散过程中遵循一阶常微分方程（ODE）。这个ODE在潜在空间中建立了图像和其噪声对应物之间的一对一对应关系。由于ODE是可逆的，且初始噪声图像是随机抽样的，图像的投影区域的体积代表生成这些图像的概率。我们研究了将图像投影到潜在空间的ODE，并通过测量在这个过程中的体积增长率成功地量化了再现训练数据的容易程度。考虑到这种方法的相对低的计算复杂度，它使我们能够通过检测和修改容易记忆的训练样本来提高训练数据的质量。

更新时间: 2025-03-25 08:19:56

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.19429v1

CCUP: A Controllable Synthetic Data Generation Pipeline for Pretraining Cloth-Changing Person Re-Identification Models

Cloth-changing person re-identification (CC-ReID), also known as Long-Term Person Re-Identification (LT-ReID) is a critical and challenging research topic in computer vision that has recently garnered significant attention. However, due to the high cost of constructing CC-ReID data, the existing data-driven models are hard to train efficiently on limited data, causing overfitting issue. To address this challenge, we propose a low-cost and efficient pipeline for generating controllable and high-quality synthetic data simulating the surveillance of real scenarios specific to the CC-ReID task. Particularly, we construct a new self-annotated CC-ReID dataset named Cloth-Changing Unreal Person (CCUP), containing 6,000 IDs, 1,179,976 images, 100 cameras, and 26.5 outfits per individual. Based on this large-scale dataset, we introduce an effective and scalable pretrain-finetune framework for enhancing the generalization capabilities of the traditional CC-ReID models. The extensive experiments demonstrate that two typical models namely TransReID and FIRe^2, when integrated into our framework, outperform other state-of-the-art models after pretraining on CCUP and finetuning on the benchmarks such as PRCC, VC-Clothes and NKUP. The CCUP is available at: https://github.com/yjzhao1019/CCUP.

Updated: 2025-03-25 08:17:18

标题: CCUP：用于预训练更衣人员再识别模型的可控合成数据生成管道

摘要: 布更换人员重新识别（CC-ReID），也称为长期人员重新识别（LT-ReID），是计算机视觉领域中一个关键且具有挑战性的研究课题，最近引起了广泛关注。然而，由于构建CC-ReID数据的高成本，现有的数据驱动模型很难在有限数据上有效训练，导致过拟合问题。为了解决这一挑战，我们提出了一个低成本高效的流程，用于生成可控且高质量的合成数据，模拟特定于CC-ReID任务的真实场景监控。特别地，我们构建了一个新的自注释CC-ReID数据集，名为布更换虚拟人员（CCUP），包含6,000个ID，1,179,976张图像，100个摄像头，每个个体26.5套服装。基于这个大规模数据集，我们引入了一个有效且可扩展的预训练-微调框架，用于增强传统CC-ReID模型的泛化能力。广泛的实验表明，当TransReID和FIRe^2两个典型模型集成到我们的框架中，在CCUP上进行预训练并在PRCC、VC-Clothes和NKUP等基准数据集上进行微调后，表现优于其他最先进的模型。CCUP可在以下网址获取：https://github.com/yjzhao1019/CCUP。

更新时间: 2025-03-25 08:17:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2410.13567v2

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

Updated: 2025-03-25 08:16:35

标题: DeCAP：大型语言模型中用于去偏置零样本问答的上下文自适应提示生成

摘要: 大型语言模型（LLMs）在零-shot 问答（QA）方面表现出色，但当面临社会敏感问题时，它们往往会暴露其内部知识中的偏见，导致性能下降。现有的零-shot 方法效率高，但未考虑上下文并防止答案中偏见的传播。为了解决这个问题，我们提出了DeCAP，一种使用上下文自适应提示生成的方法来消除LLMs的偏见。DeCAP利用问题模糊检测根据上下文采取适当的去偏见措施，并利用中性答案指导生成抑制LLMs对上下文做出客观判断，最大程度减少从其内部知识中偏见的传播。我们在八个LLMs上进行的各种实验表明，DeCAP实现了零-shot去偏见QA性能的最新水平。这证明了DeCAP在多样的QA设置中提高LLMs公平性和准确性的有效性。

更新时间: 2025-03-25 08:16:35

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2503.19426v1

Extreme Precipitation Nowcasting using Multi-Task Latent Diffusion Models

Deep learning models have achieved remarkable progress in precipitation prediction. However, they still face significant challenges in accurately capturing spatial details of radar images, particularly in regions of high precipitation intensity. This limitation results in reduced spatial localization accuracy when predicting radar echo images across varying precipitation intensities. To address this challenge, we propose an innovative precipitation prediction approach termed the Multi-Task Latent Diffusion Model (MTLDM). The core idea of MTLDM lies in the recognition that precipitation radar images represent a combination of multiple components, each corresponding to different precipitation intensities. Thus, we adopt a divide-and-conquer strategy, decomposing radar images into several sub-images based on their precipitation intensities and individually modeling these components. During the prediction stage, MTLDM integrates these sub-image representations by utilizing a trained latent-space rainfall diffusion model, followed by decoding through a multi-task decoder to produce the final precipitation prediction. Experimental evaluations conducted on the MRMS dataset demonstrate that the proposed MTLDM method surpasses state-of-the-art techniques, achieving a Critical Success Index (CSI) improvement of 13-26%.

Updated: 2025-03-25 08:14:47

标题: 使用多任务潜在扩散模型进行极端降水现在预报

摘要: 深度学习模型在降水预测方面取得了显著进展。然而，它们在准确捕捉雷达图像的空间细节方面仍面临重大挑战，特别是在降水强度较高的地区。这种限制导致在预测不同降水强度下的雷达回波图像时，空间定位精度降低。为了解决这一挑战，我们提出了一种创新的降水预测方法，称为多任务潜隐扩散模型（MTLDM）。MTLDM的核心思想在于认识到降水雷达图像代表了多个组成部分的组合，每个部分对应不同的降水强度。因此，我们采用分而治之的策略，根据降水强度将雷达图像分解为几个子图像，并分别对这些组成部分进行建模。在预测阶段，MTLDM通过利用经过训练的潜隐空间降雨扩散模型整合这些子图像表示，然后通过多任务解码器进行解码，产生最终的降水预测。在MRMS数据集上进行的实验评估表明，所提出的MTLDM方法超越了现有技术，实现了13-26%的关键成功指数（CSI）改善。

更新时间: 2025-03-25 08:14:47

领域: cs.CV,cs.AI,86A10, 68T07,I.2.6; J.7

下载: http://arxiv.org/abs/2410.14103v3

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

Updated: 2025-03-25 08:10:15

标题: VideoRefer套件：通过视频LLM推进空间-时间对象理解

摘要: 视频大型语言模型（Video LLMs）最近展示了在一般视频理解方面的显著能力。然而，它们主要关注整体理解，并且在捕捉细粒度的空间和时间细节方面遇到困难。此外，缺乏高质量的对象级视频指导数据和全面的基准进一步阻碍了它们的发展。为了解决这些挑战，我们引入了VideoRefer Suite，为Video LLM提供更精细级别的空间-时间视频理解能力，即在整个视频中对任何对象进行感知和推理。特别是，我们全面开发了VideoRefer Suite，涵盖了数据集、模型和基准三个关键方面。首先，我们引入了一个多智能体数据引擎，精心策划了一个大规模、高质量的对象级视频指导数据集，称为VideoRefer-700K。接下来，我们提出了VideoRefer模型，该模型配备了多功能的空间-时间对象编码器，以捕捉精确的区域和顺序表示。最后，我们精心创建了一个VideoRefer-Bench，全面评估Video LLM的空间-时间理解能力，并在各个方面进行评估。广泛的实验和分析表明，我们的VideoRefer模型不仅在视频引用基准上取得了令人满意的性能，还促进了一般视频理解能力。

更新时间: 2025-03-25 08:10:15

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.00599v3

A novel forecasting framework combining virtual samples and enhanced Transformer models for tourism demand forecasting

Accurate tourism demand forecasting is hindered by limited historical data and complex spatiotemporal dependencies among tourist origins. A novel forecasting framework integrating virtual sample generation and a novel Transformer predictor addresses constraints arising from restricted data availability. A spatiotemporal GAN produces realistic virtual samples by dynamically modeling spatial correlations through a graph convolutional network, and an enhanced Transformer captures local patterns with causal convolutions and long-term dependencies with self-attention,eliminating autoregressive decoding. A joint training strategy refines virtual sample generation based on predictor feedback to maintain robust performance under data-scarce conditions. Experimental evaluations on real-world daily and monthly tourism demand datasets indicate a reduction in average MASE by 18.37% compared to conventional Transformer-based models, demonstrating improved forecasting accuracy. The integration of adaptive spatiotemporal sample augmentation with a specialized Transformer can effectively address limited-data forecasting scenarios in tourism management.

Updated: 2025-03-25 08:02:09

标题: 一个结合虚拟样本和增强Transformer模型的新型旅游需求预测框架

摘要: 准确的旅游需求预测受到有限历史数据和游客来源之间复杂的时空依赖关系的阻碍。一种集成虚拟样本生成和新型Transformer预测器的预测框架解决了由于数据可用性受限而产生的约束。时空GAN通过动态建模空间相关性，通过图卷积网络生成真实的虚拟样本，增强的Transformer通过因果卷积捕获局部模式，通过自我注意力捕获长期依赖关系，消除自回归解码。联合训练策略根据预测器反馈对虚拟样本生成进行细化，以在数据稀缺条件下保持稳健性能。对实际日常和月度旅游需求数据集的实验评估表明，与传统基于Transformer的模型相比，平均MASE降低了18.37%，证明了改进的预测准确性。自适应时空样本增强与专门的Transformer的集成可以有效应对旅游管理中有限数据预测场景。

更新时间: 2025-03-25 08:02:09

领域: stat.AP,cs.LG

下载: http://arxiv.org/abs/2503.19423v1

Masking meets Supervision: A Strong Learning Alliance

Pre-training with random masked inputs has emerged as a novel trend in self-supervised training. However, supervised learning still faces a challenge in adopting masking augmentations, primarily due to unstable training. In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub). MaskSub consists of the main-branch and sub-branch, the latter being a part of the former. The main-branch undergoes conventional training recipes, while the sub-branch merits intensive masking augmentations, during training. MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss. Our analysis shows that MaskSub improves performance, with the training loss converging faster than in standard training, which suggests our method stabilizes the training process. We further validate MaskSub across diverse training scenarios and models, including DeiT-III training, MAE finetuning, CLIP finetuning, BERT training, and hierarchical architectures (ResNet and Swin Transformer). Our results show that MaskSub consistently achieves impressive performance gains across all the cases. MaskSub provides a practical and effective solution for introducing additional regularization under various training recipes. Code available at https://github.com/naver-ai/augsub

Updated: 2025-03-25 07:57:52

标题: 遮掩遇见监督：一个强大的学习联盟

摘要: 使用随机屏蔽输入进行预训练已经成为自监督训练中的一种新趋势。然而，监督学习仍然面临采用遮罩增强的挑战，主要是由于训练不稳定。在本文中，我们提出了一种新颖的方式来引入遮罩增强，称为Masked Sub-branch（MaskSub）。MaskSub由主分支和子分支组成，后者是前者的一部分。主分支经历传统的训练配方，而子分支在训练过程中获得密集的遮罩增强。MaskSub通过缓解类似于自蒸馏损失的放松损失函数来解决挑战。我们的分析显示，MaskSub提高了性能，训练损失收敛速度比标准训练更快，这表明我们的方法稳定了训练过程。我们进一步验证了MaskSub在不同训练场景和模型上的效果，包括DeiT-III训练、MAE微调、CLIP微调、BERT训练以及分层架构（ResNet和Swin Transformer）。我们的结果表明，MaskSub在所有情况下都能持续取得令人印象深刻的性能提升。MaskSub为在各种训练配方下引入额外正则化提供了实用且有效的解决方案。代码可在https://github.com/naver-ai/augsub获取。

更新时间: 2025-03-25 07:57:52

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2306.11339v3

Multi-Agent Deep Reinforcement Learning for Safe Autonomous Driving with RICS-Assisted MEC

Environment sensing and fusion via onboard sensors are envisioned to be widely applied in future autonomous driving networks. This paper considers a vehicular system with multiple self-driving vehicles that is assisted by multi-access edge computing (MEC), where image data collected by the sensors is offloaded from cellular vehicles to the MEC server using vehicle-to-infrastructure (V2I) links. Sensory data can also be shared among surrounding vehicles via vehicle-to-vehicle (V2V) communication links. To improve spectrum utilization, the V2V links may reuse the same frequency spectrum with V2I links, which may cause severe interference. To tackle this issue, we leverage reconfigurable intelligent computational surfaces (RICSs) to jointly enable V2I reflective links and mitigate interference appearing at the V2V links. Considering the limitations of traditional algorithms in addressing this problem, such as the assumption for quasi-static channel state information, which restricts their ability to adapt to dynamic environmental changes and leads to poor performance under frequently varying channel conditions, in this paper, we formulate the problem at hand as a Markov game. Our novel formulation is applied to time-varying channels subject to multi-user interference and introduces a collaborative learning mechanism among users. The considered optimization problem is solved via a driving safety-enabled multi-agent deep reinforcement learning (DS-MADRL) approach that capitalizes on the RICS presence. Our extensive numerical investigations showcase that the proposed reinforcement learning approach achieves faster convergence and significant enhancements in both data rate and driving safety, as compared to various state-of-the-art benchmarks.

Updated: 2025-03-25 07:53:50

标题: 多智能体深度强化学习在RICs辅助MEC下的安全自动驾驶中的应用

摘要: 环境感知和通过机载传感器进行数据融合被认为将在未来自主驾驶网络中广泛应用。本文考虑了一个由多辆自动驾驶汽车组成的车辆系统，该系统受到多接入边缘计算（MEC）的帮助，其中由传感器收集的图像数据通过车辆到基础设施（V2I）链接从蜂窝车辆转移到MEC服务器。感知数据也可以通过车辆到车辆（V2V）通信链接在周围车辆之间共享。为了提高频谱利用率，V2V链接可以重复使用与V2I链接相同的频谱，这可能会导致严重的干扰。为了解决这个问题，我们利用可重构智能计算表面（RICS）来共同启用V2I反射链接并减轻出现在V2V链接上的干扰。考虑到传统算法在解决这个问题方面的局限性，例如对准静态信道状态信息的假设，这限制了它们适应动态环境变化的能力，并导致在频繁变化的信道条件下性能较差，本文将所面临的问题制定为马尔科夫博弈。我们的新颖公式应用于受多用户干扰影响的时变信道，并引入用户之间的协作学习机制。考虑到驾驶安全启用的多代理深度强化学习（DS-MADRL）方法利用了RICS的存在，所考虑的优化问题通过该方法解决。我们广泛的数值研究表明，与各种最新技术基准相比，所提出的强化学习方法在数据速率和驾驶安全性方面均取得了更快的收敛和显著的增强。

更新时间: 2025-03-25 07:53:50

领域: cs.LG

下载: http://arxiv.org/abs/2503.19418v1

Using Anomaly Detection to Detect Poisoning Attacks in Federated Learning Applications

Adversarial attacks such as poisoning attacks have attracted the attention of many machine learning researchers. Traditionally, poisoning attacks attempt to inject adversarial training data in order to manipulate the trained model. In federated learning (FL), data poisoning attacks can be generalized to model poisoning attacks, which cannot be detected by simpler methods due to the lack of access to local training data by the detector. State-of-the-art poisoning attack detection methods for FL have various weaknesses, e.g., the number of attackers has to be known or not high enough, working with i.i.d. data only, and high computational complexity. To overcome above weaknesses, we propose a novel framework for detecting poisoning attacks in FL, which employs a reference model based on a public dataset and an auditor model to detect malicious updates. We implemented a detector based on the proposed framework and using a one-class support vector machine (OC-SVM), which reaches the lowest possible computational complexity O(K) where K is the number of clients. We evaluated our detector's performance against state-of-the-art (SOTA) poisoning attacks for two typical applications of FL: electrocardiograph (ECG) classification and human activity recognition (HAR). Our experimental results validated the performance of our detector over other SOTA detection methods.

Updated: 2025-03-25 07:50:17

标题: 使用异常检测技术在联邦学习应用中检测毒化攻击

摘要: Adversarial attacks such as poisoning attacks have attracted the attention of many machine learning researchers. Traditionally, poisoning attacks attempt to inject adversarial training data in order to manipulate the trained model. In federated learning (FL), data poisoning attacks can be generalized to model poisoning attacks, which cannot be detected by simpler methods due to the lack of access to local training data by the detector. State-of-the-art poisoning attack detection methods for FL have various weaknesses, e.g., the number of attackers has to be known or not high enough, working with i.i.d. data only, and high computational complexity. To overcome above weaknesses, we propose a novel framework for detecting poisoning attacks in FL, which employs a reference model based on a public dataset and an auditor model to detect malicious updates. We implemented a detector based on the proposed framework and using a one-class support vector machine (OC-SVM), which reaches the lowest possible computational complexity O(K) where K is the number of clients. We evaluated our detector's performance against state-of-the-art (SOTA) poisoning attacks for two typical applications of FL: electrocardiograph (ECG) classification and human activity recognition (HAR). Our experimental results validated the performance of our detector over other SOTA detection methods.

更新时间: 2025-03-25 07:50:17

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2207.08486v4

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.

Updated: 2025-03-25 07:49:16

标题: RoboSpatial：为机器人教授二维和三维视觉语言模型的空间理解

摘要: 空间理解是一个至关重要的能力，使机器人能够感知周围环境，推理环境，并与之进行有意义的互动。在现代机器人技术中，这些能力越来越多地由视觉语言模型提供。然而，这些模型在空间推理任务中面临着重大挑战，因为它们的训练数据基于通用图像数据集，这些数据集通常缺乏复杂的空间理解。例如，数据集经常无法捕捉到参考框架的理解，然而有效的空间推理需要理解是从自我、世界还是物体中心的角度进行推理。为了解决这个问题，我们引入了RoboSpatial，一个用于机器人空间理解的大规模数据集。它包括了真实的室内和桌面场景，以3D扫描和自我中心图像的形式捕获，并注释了与机器人相关的丰富空间信息。该数据集包括100万张图像、5000个3D扫描和300万个注释的空间关系，而将2D自我中心图像与3D扫描配对使其既准备好了2D，也准备好了3D。我们的实验表明，使用RoboSpatial训练的模型在下游任务中表现优于基线，如空间可承载性预测、空间关系预测和机器人操作。

更新时间: 2025-03-25 07:49:16

领域: cs.CV,cs.AI,cs.CL,cs.RO

下载: http://arxiv.org/abs/2411.16537v2

Continual Learning With Quasi-Newton Methods

Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC's forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

Updated: 2025-03-25 07:45:59

标题: 使用拟牛顿方法进行持续学习

摘要: 灾难性遗忘仍然是神经网络依次学习任务时面临的主要挑战。弹性权重巩固（EWC）试图通过引入受贝叶斯启发的正则化损失来保留先前学习任务的知识来解决这个问题。然而，EWC依赖于拉普拉斯近似，其中Hessian被简化为费舍尔信息矩阵的对角线，假定模型参数之间不相关。这种过于简化的假设通常导致Hessian估计不准确，限制了其有效性。为了克服这一限制，我们引入了基于采样的拟牛顿持续学习（CSQN），利用拟牛顿方法计算更准确的Hessian近似。CSQN捕捉了超出对角线的参数交互作用，而无需特定于架构的修改，使其适用于各种任务和架构。在四个基准测试中的实验结果表明，CSQN始终优于EWC和其他最先进的基线方法，包括基于排练的方法。CSQN将EWC的遗忘减少了50％，平均性能提高了8％。值得注意的是，CSQN在四个基准测试中的三个上取得了优越的结果，包括最具挑战性的场景，突显其作为持续学习稳健解决方案的潜力。

更新时间: 2025-03-25 07:45:59

领域: cs.LG,eess.IV

下载: http://arxiv.org/abs/2503.19939v1

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Updated: 2025-03-25 07:37:48

标题: GFlowVLM：利用生成流网络增强视觉语言模型的多步推理

摘要: 视觉-语言模型（VLMs）最近在通过任务特定的微调在序贯决策任务中展示出有希望的进展。然而，常见的微调方法，如监督微调（SFT）和强化学习（RL）技术如Proximal Policy Optimization（PPO），存在显著的局限性：SFT假设数据是独立同分布的（IID），而PPO专注于最大化累积奖励。这些限制通常限制了解决方案的多样性，并阻碍了在多步推理任务中的泛化。为了解决这些挑战，我们引入了一个新颖的框架，GFlowVLM，这是一个使用生成流网络（GFlowNets）对VLM进行微调的框架，以促进生成复杂推理任务的多样化解决方案。GFlowVLM将环境建模为非马尔可夫决策过程，使其能够捕捉对于现实世界应用至关重要的长期依赖关系。它以观察和任务描述作为输入，以促进思维链（CoT）推理，随后引导行动选择。我们使用基于任务的奖励来对VLM进行微调，使用GFlowNets。这种方法使VLM能够胜过先前的微调方法，包括SFT和RL。实证结果展示了GFlowVLM在复杂任务（如卡片游戏（NumberLine，BlackJack）和具体规划任务（ALFWorld））上的有效性，显示了增强的训练效率、解决方案多样性以及更强的泛化能力，无论是在分布内还是分布外的情况下。

更新时间: 2025-03-25 07:37:48

领域: cs.CL,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06514v2

DistDNAS: Search Efficient Feature Interactions within 2 Hours

Search efficiency and serving efficiency are two major axes in building feature interactions and expediting the model development process in recommender systems. On large-scale benchmarks, searching for the optimal feature interaction design requires extensive cost due to the sequential workflow on the large volume of data. In addition, fusing interactions of various sources, orders, and mathematical operations introduces potential conflicts and additional redundancy toward recommender models, leading to sub-optimal trade-offs in performance and serving cost. In this paper, we present DistDNAS as a neat solution to brew swift and efficient feature interaction design. DistDNAS proposes a supernet to incorporate interaction modules of varying orders and types as a search space. To optimize search efficiency, DistDNAS distributes the search and aggregates the choice of optimal interaction modules on varying data dates, achieving over 25x speed-up and reducing search cost from 2 days to 2 hours. To optimize serving efficiency, DistDNAS introduces a differentiable cost-aware loss to penalize the selection of redundant interaction modules, enhancing the efficiency of discovered feature interactions in serving. We extensively evaluate the best models crafted by DistDNAS on a 1TB Criteo Terabyte dataset. Experimental evaluations demonstrate 0.001 AUC improvement and 60% FLOPs saving over current state-of-the-art CTR models.

Updated: 2025-03-25 07:29:11

标题: DistDNAS：在2小时内高效搜索特征交互

摘要: 搜索效率和服务效率是构建特征交互并加快推荐系统模型开发过程中的两个主要轴心。在大规模基准测试中，寻找最佳特征交互设计需要耗费大量成本，因为需要在大量数据上进行顺序工作流。此外，融合各种来源、顺序和数学操作的交互引入潜在冲突和额外冗余，导致推荐模型性能和服务成本之间的次优折衷。在本文中，我们提出了DistDNAS作为一个简洁的解决方案，用于快速有效地设计特征交互。DistDNAS提出了一个超网络，用于将不同顺序和类型的交互模块纳入搜索空间。为了优化搜索效率，DistDNAS分配搜索并在不同数据日期上聚合最佳交互模块的选择，实现了超过25倍的加速，并将搜索成本从2天减少到2小时。为了优化服务效率，DistDNAS引入了一种可微的成本感知损失，用于惩罚选择多余交互模块，增强了在服务中发现的特征交互的效率。我们在一个1TB Criteo Terabyte数据集上广泛评估了DistDNAS精心制作的最佳模型。实验评估表明，与当前最先进的CTR模型相比，AUC提高了0.001，FLOPs节省了60%。

更新时间: 2025-03-25 07:29:11

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2311.00231v2

QUIC-Fuzz: An Effective Greybox Fuzzer For The QUIC Protocol

Network applications are routinely under attack. We consider the problem of developing an effective and efficient fuzzer for the recently ratified QUIC network protocol to uncover security vulnerabilities. QUIC offers a unified transport layer for low latency, reliable transport streams that is inherently secure, ultimately representing a complex protocol design characterised by new features and capabilities for the Internet. Fuzzing a secure transport layer protocol is not trivial. The interactive, strict, rule-based, asynchronous nature of communications with a target, the stateful nature of interactions, security mechanisms to protect communications (such as integrity checks and encryption), and inherent overheads (such as target initialisation) challenge generic network protocol fuzzers. We discuss and address the challenges pertinent to fuzzing transport layer protocols (like QUIC), developing mechanisms that enable fast, effective fuzz testing of QUIC implementations to build a prototype grey-box mutation-based fuzzer; QUIC-Fuzz. We test 6, well-maintained server-side implementations, including from Google and Alibaba with QUIC-Fuzz. The results demonstrate the fuzzer is both highly effective and generalisable. Our testing uncovered 10 new security vulnerabilities, precipitating 2 CVE assignments thus far. In code coverage, QUIC-Fuzz outperforms other existing state-of-the-art network protocol fuzzers (Fuzztruction-Net, ChatAFL, and ALFNet) with up to an 84% increase in code coverage where QUIC-Fuzz outperformed statistically significantly across all targets and with a majority of bugs only discoverable by QUIC-Fuzz. We open-source QUIC-Fuzz on GitHub.

Updated: 2025-03-25 07:21:35

标题: QUIC-Fuzz：一种针对QUIC协议的有效灰盒模糊测试工具

摘要: 网络应用程序经常受到攻击。我们考虑开发一个有效且高效的模糊测试工具，用于最近获得批准的QUIC网络协议，以发现安全漏洞。QUIC提供了一个统一的传输层，用于低延迟、可靠的传输流，具有固有安全性，最终代表了一种复杂的协议设计，具有互联网的新特性和功能。对一个安全传输层协议进行模糊测试并不是一件简单的事情。与目标进行交互式、严格、基于规则、异步的通信特性，交互的有状态性，用于保护通信的安全机制（如完整性检查和加密），以及固有的开销（如目标初始化）都会挑战通用的网络协议模糊测试工具。我们讨论并解决了与模糊测试传输层协议（如QUIC）相关的挑战，开发了一种能够快速、有效地对QUIC实现进行模糊测试的机制，构建了一个原型灰盒基于变异的模糊测试工具；QUIC-Fuzz。我们使用QUIC-Fuzz测试了6个维护良好的服务器端实现，包括来自Google和阿里巴巴的实现。结果表明，该模糊测试工具既高效又具有普适性。我们的测试发现了10个新的安全漏洞，迄今为止导致了2个CVE分配。在代码覆盖率方面，QUIC-Fuzz在所有目标上的表现均优于其他现有的最先进的网络协议模糊测试工具（Fuzztruction-Net、ChatAFL和ALFNet），其中QUIC-Fuzz的代码覆盖率增加了高达84%，在所有目标上都显著优于其他工具，并且大多数漏洞只能被QUIC-Fuzz发现。我们在GitHub上开源了QUIC-Fuzz。

更新时间: 2025-03-25 07:21:35

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2503.19402v1

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

Infants develop complex visual understanding rapidly, even preceding the acquisition of linguistic skills. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We perform neuron labeling to identify visual concept neurons hidden in the model's internal representations. We then demonstrate that these neurons can recognize objects beyond the model's original vocabulary. Furthermore, we compare the differences in representation between infant models and those in modern computer vision models, such as CLIP and ImageNet pre-trained model. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant visual and linguistic inputs. Our code is available at https://github.com/Kexueyi/discover_infant_vis.

Updated: 2025-03-25 07:11:03

标题: 在婴儿学习中发现超越语言输入的隐藏视觉概念

摘要: 婴儿迅速发展复杂的视觉理解能力，甚至在获得语言技能之前就开始了。随着计算机视觉寻求复制人类视觉系统，理解婴儿视觉发展可能会提供宝贵的见解。本文介绍了一项跨学科研究，探讨了这个问题：一个模拟婴儿学习过程的计算模型是否能够发展出超越其所听过的词汇范围的更广泛的视觉概念，类似于婴儿自然学习的方式？为了调查这个问题，我们分析了Vong等人在Science杂志上最近发表的一个模型，该模型是在与转录的父母语音配对的一个婴儿的纵向、自我中心的图像上训练的。我们进行神经元标记，以识别隐藏在模型内部表示中的视觉概念神经元。然后我们证明这些神经元可以识别超出模型原始词汇范围的对象。此外，我们比较了婴儿模型与现代计算机视觉模型（如CLIP和ImageNet预训练模型）之间的表征差异。最终，我们通过分析一个基于婴儿视觉和语言输入训练的计算模型的内部表征，将认知科学和计算机视觉联系起来。我们的代码可以在https://github.com/Kexueyi/discover_infant_vis上找到。

更新时间: 2025-03-25 07:11:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.05205v4

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.

Updated: 2025-03-25 07:08:41

标题: 揭示流变换器中的倒置和不变性，用于多功能图像编辑

摘要: 利用流变压器的大型生成先验进行无调整图像编辑需要真实反演将图像投影到模型的域中，并且需要一个灵活的不变性控制机制来保留非目标内容。然而，目前流基模型中普遍存在的扩散反演表现不佳，并且不变性控制无法调和多样化的刚性和非刚性编辑任务。为了解决这些问题，我们系统地分析了基于流变压器的反演和不变性控制。具体地，我们揭示了欧拉反演与DDIM具有类似结构，但更容易受到近似误差的影响。因此，我们提出了一个两阶段反演方法，首先细化速度估计，然后补偿剩余误差，这与模型先验密切相关并有益于编辑。同时，我们提出了在适应性层归一化内操纵文本特征的不变性控制，将文本提示的变化与图像语义联系起来。这种机制可以同时保留非目标内容，同时允许刚性和非刚性操作，实现广泛的编辑类型，如视觉文本、数量、面部表情等。在多种场景上的实验证实了我们的框架实现了灵活准确的编辑，释放了流变压器在多样化图像编辑方面的潜力。

更新时间: 2025-03-25 07:08:41

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2411.15843v4

BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve--and even degrades--performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency. Our code is available https://github.com/Eunjnnn/bfstvsr.

Updated: 2025-03-25 07:05:39

标题: BF-STVSR：B样条和傅立叶最佳伙伴用于高保真度时空视频超分辨率

摘要: 在连续空间-时间视频超分辨率（C-STVSR）中，以往的方法采用隐式神经表示（INR）进行连续编码，但往往难以捕捉视频数据的复杂性，依赖简单的坐标串联和预训练的光流网络进行运动表示。有趣的是，我们发现添加位置编码，与常见观察相反，并不会提高性能，甚至会降低性能。当与预训练的光流网络结合使用时，这个问题尤为突出，因为光流网络可能限制模型的灵活性。为了解决这些问题，我们提出了BF-STVSR，这是一个针对更好地表示视频的空间和时间特性而设计的C-STVSR框架，具有两个关键模块：1）用于平滑时间插值的B样条映射器，2）用于捕捉主导空间频率的傅里叶映射器。我们的方法在各种指标中取得了最先进的成果，包括PSNR和SSIM，展现出增强的空间细节和自然的时间一致性。我们的代码可在https://github.com/Eunjnnn/bfstvsr 上获得。

更新时间: 2025-03-25 07:05:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.11043v2

Quantifying Symptom Causality in Clinical Decision Making: An Exploration Using CausaLM

Current machine learning approaches to medical diagnosis often rely on correlational patterns between symptoms and diseases, risking misdiagnoses when symptoms are ambiguous or common across multiple conditions. In this work, we move beyond correlation to investigate the causal influence of key symptoms-specifically "chest pain" on diagnostic predictions. Leveraging the CausaLM framework, we generate counterfactual text representations in which target concepts are effectively "forgotten" enabling a principled estimation of the causal effect of that concept on a model's predicted disease distribution. By employing Textual Representation-based Average Treatment Effect (TReATE), we quantify how the presence or absence of a symptom shapes the model's diagnostic outcomes, and contrast these findings against correlation-based baselines such as CONEXP. Our results offer deeper insight into the decision-making behavior of clinical NLP models and have the potential to inform more trustworthy, interpretable, and causally-grounded decision support tools in medical practice.

Updated: 2025-03-25 06:59:21

标题: 《在临床决策中量化症状因果性：使用CausaLM进行探索》

摘要: 目前机器学习方法在医学诊断中常常依赖于症状和疾病之间的相关模式，当症状模糊或者在多种疾病中普遍存在时，会导致误诊的风险。在这项研究中，我们超越了相关性，探究了关键症状-特别是“胸痛”对诊断预测的因果影响。利用CausaLM框架，我们生成了反事实文本表征，其中目标概念被有效地“遗忘”，从而对该概念对模型预测的疾病分布的因果影响进行原则性估计。通过采用基于文本表征的平均处理效应(TReATE)，我们量化了症状的存在或缺失如何塑造模型的诊断结果，并将这些发现与诸如CONEXP之类的基于相关性的基线进行对比。我们的结果深入探讨了临床自然语言处理模型的决策行为，并有潜力为医学实践中更可信赖、可解释且基于因果关系的决策支持工具提供更深入的洞察。

更新时间: 2025-03-25 06:59:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19394v1

SAFE: Self-Adjustment Federated Learning Framework for Remote Sensing Collaborative Perception

The rapid increase in remote sensing satellites has led to the emergence of distributed space-based observation systems. However, existing distributed remote sensing models often rely on centralized training, resulting in data leakage, communication overhead, and reduced accuracy due to data distribution discrepancies across platforms. To address these challenges, we propose the \textit{Self-Adjustment FEderated Learning} (SAFE) framework, which innovatively leverages federated learning to enhance collaborative sensing in remote sensing scenarios. SAFE introduces four key strategies: (1) \textit{Class Rectification Optimization}, which autonomously addresses class imbalance under unknown local and global distributions. (2) \textit{Feature Alignment Update}, which mitigates Non-IID data issues via locally controlled EMA updates. (3) \textit{Dual-Factor Modulation Rheostat}, which dynamically balances optimization effects during training. (4) \textit{Adaptive Context Enhancement}, which is designed to improve model performance by dynamically refining foreground regions, ensuring computational efficiency with accuracy improvement across distributed satellites. Experiments on real-world image classification and object segmentation datasets validate the effectiveness and reliability of the SAFE framework in complex remote sensing scenarios.

Updated: 2025-03-25 06:39:34

标题: SAFE：用于遥感协作感知的自适应联邦学习框架

摘要: 遥感卫星的快速增加导致了分布式空间观测系统的出现。然而，现有的分布式遥感模型通常依赖于集中式训练，导致数据泄漏、通信开销，并由于平台之间数据分布差异而降低准确性。为了解决这些挑战，我们提出了“自适应联邦学习”（SAFE）框架，创新地利用联邦学习来增强遥感场景中的协作感知。SAFE引入了四个关键策略：（1）“类别校正优化”，自主地解决未知本地和全局分布下的类别不平衡问题。（2）“特征对齐更新”，通过本地控制的EMA更新来缓解非独立同分布数据问题。（3）“双因素调制流量计”，在训练过程中动态平衡优化效果。（4）“自适应上下文增强”，旨在通过动态细化前景区域来提高模型性能，确保在分布式卫星上提高计算效率和准确性。对真实世界的图像分类和目标分割数据集上的实验验证了SAFE框架在复杂遥感场景中的有效性和可靠性。

更新时间: 2025-03-25 06:39:34

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2504.03700v1

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Updated: 2025-03-25 06:30:45

标题: 推理时间缩放对流模型的影响：通过随机生成和翻转预算强制实施

摘要: 我们提出了一种针对预训练流模型的推理时间缩放方法。最近，在LLMs和扩散模型中，推理时间缩放引起了很大的关注，通过利用额外的计算来提高样本质量或更好地使输出与用户偏好对齐。对于扩散模型，粒子采样已经在中间去噪步骤的随机性下允许更高效的缩放。相反，虽然流模型作为扩散模型的一种替代方式变得越来越受欢迎，提供了更快的生成和高质量的输出在最先进的图像和视频生成模型中，但用于扩散模型的高效推理时间缩放方法不能直接应用于它们的确定性生成过程。为了实现流模型的高效推理时间缩放，我们提出了三个关键思想：1）基于SDE的生成，使流模型可以进行粒子采样，2）插值器转换，扩大搜索空间并增强样本多样性，以及3）预算滚动强制（RBF），跨时间步骤自适应地分配计算资源，以最大化预算利用。我们的实验证明，基于SDE的生成，特别是保持方差（VP）的插值器生成，改善了流模型中推理时间缩放的粒子采样方法的性能。此外，我们证明了具有VP-SDE的RBF实现了最佳性能，优于所有先前的推理时间缩放方法。

更新时间: 2025-03-25 06:30:45

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19385v1

LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges. The solution centers on two LLM-generated components: a hybrid reward function that dynamically allocates individual credit through reward decomposition, and an observation enhancement function that augments partial observations with inferred environmental context. An evolutionary algorithm optimizes these components through iterative MARL training cycles, where top-performing candidates guide subsequent LLM generations. Evaluations in Multi-Agent Particle Environments (MPE) demonstrate LERO's superiority over baseline methods, with improved task performance and training efficiency.

Updated: 2025-03-25 06:28:42

标题: LERO：基于LLM驱动的混合奖励和增强观察的多智能体强化学习进化框架

摘要: 多智能体强化学习（MARL）面临两个与单智能体RL不同的关键瓶颈：合作任务中的信用分配和环境状态的部分可观察性。我们提出了LERO，一个整合了大型语言模型（LLMs）和进化优化的框架，以解决这些MARL特定挑战。解决方案集中在两个LLM生成的组件上：一个混合奖励函数，通过奖励分解动态分配个体信用，以及一个增强观察功能，通过推断环境上下文增强部分观察。通过迭代MARL训练周期，进化算法优化这些组件，其中表现最佳的候选人指导后续LLM生成。在多智能体粒子环境（MPE）中的评估表明，LERO优于基线方法，具有改进的任务性能和训练效率。

更新时间: 2025-03-25 06:28:42

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.21807v1

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.

Updated: 2025-03-25 06:23:57

标题: LoTUS: 大规模机器遗忘带有一丝不确定性

摘要: 我们提出了LoTUS，一种新颖的机器遗忘（MU）方法，可以消除预训练模型中训练样本的影响，避免从头开始重新训练。LoTUS将模型的预测概率平滑到信息论界限，减轻其由数据记忆导致的过度自信。我们在Transformer和ResNet18模型上评估了LoTUS，并与五个公共数据集上的八个基线进行了比较。除了已建立的MU基准外，我们还在ImageNet1k上评估了遗忘，这是一个大规模数据集，重新训练是不切实际的，模拟了真实世界的条件。此外，我们引入了新颖的无重新训练Jensen-Shannon散度（RF-JSD）度量，以便在真实世界条件下进行评估。实验结果表明，LoTUS在效率和效果方面优于最先进的方法。代码：https://github.com/cspartalis/LoTUS。

更新时间: 2025-03-25 06:23:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.18314v2

BioMamba: Leveraging Spectro-Temporal Embedding in Bidirectional Mamba for Enhanced Biosignal Classification

Biological signals, such as electroencephalograms (EEGs) and electrocardiograms (ECGs), play a pivotal role in numerous clinical practices, such as diagnosing brain and cardiac arrhythmic diseases. Existing methods for biosignal classification rely on Attention-based frameworks with dense Feed Forward layers, which lead to inefficient learning, high computational overhead, and suboptimal performance. In this work, we introduce BioMamba, a Spectro-Temporal Embedding strategy applied to the Bidirectional Mamba framework with Sparse Feed Forward layers to enable effective learning of biosignal sequences. By integrating these three key components, BioMamba effectively addresses the limitations of existing methods. Extensive experiments demonstrate that BioMamba significantly outperforms state-of-the-art methods with marked improvement in classification performance. The advantages of the proposed BioMamba include (1) Reliability: BioMamba consistently delivers robust results, confirmed across six evaluation metrics. (2) Efficiency: We assess both model and training efficiency, the BioMamba demonstrates computational effectiveness by reducing model size and resource consumption compared to existing approaches. (3) Generality: With the capacity to effectively classify a diverse set of tasks, BioMamba demonstrates adaptability and effectiveness across various domains and applications.

Updated: 2025-03-25 06:23:36

标题: BioMamba：利用双向Mamba中的光谱-时间嵌入以增强生物信号分类

摘要: 生物信号，如脑电图（EEG）和心电图（ECG），在许多临床实践中起着关键作用，如诊断脑部和心律失常疾病。现有的生物信号分类方法依赖于基于注意力的框架和密集前馈层，这导致学习效率低，计算开销高，并且性能亚优。在这项工作中，我们介绍了BioMamba，这是一种应用于具有稀疏前馈层的双向Mamba框架的光谱-时间嵌入策略，以实现对生物信号序列的有效学习。通过整合这三个关键组件，BioMamba有效地解决了现有方法的局限性。大量实验表明，BioMamba在分类性能上明显优于最先进的方法。所提出的BioMamba的优势包括（1）可靠性：BioMamba始终提供稳健的结果，经过六个评估指标的确认。（2）效率：我们评估了模型和训练效率，BioMamba通过减小模型大小和资源消耗，相较于现有方法展现出计算效率。（3）普适性：具有有效分类各种任务的能力，BioMamba展示了在各个领域和应用中的适应性和有效性。

更新时间: 2025-03-25 06:23:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.11741v3

Causal invariant geographic network representations with feature and structural distribution shifts

The existing methods learn geographic network representations through deep graph neural networks (GNNs) based on the i.i.d. assumption. However, the spatial heterogeneity and temporal dynamics of geographic data make the out-of-distribution (OOD) generalisation problem particularly salient. The latter are particularly sensitive to distribution shifts (feature and structural shifts) between testing and training data and are the main causes of the OOD generalisation problem. Spurious correlations are present between invariant and background representations due to selection biases and environmental effects, resulting in the model extremes being more likely to learn background representations. The existing approaches focus on background representation changes that are determined by shifts in the feature distributions of nodes in the training and test data while ignoring changes in the proportional distributions of heterogeneous and homogeneous neighbour nodes, which we refer to as structural distribution shifts. We propose a feature-structure mixed invariant representation learning (FSM-IRL) model that accounts for both feature distribution shifts and structural distribution shifts. To address structural distribution shifts, we introduce a sampling method based on causal attention, encouraging the model to identify nodes possessing strong causal relationships with labels or nodes that are more similar to the target node. Inspired by the Hilbert-Schmidt independence criterion, we implement a reweighting strategy to maximise the orthogonality of the node representations, thereby mitigating the spurious correlations among the node representations and suppressing the learning of background representations. Our experiments demonstrate that FSM-IRL exhibits strong learning capabilities on both geographic and social network datasets in OOD scenarios.

Updated: 2025-03-25 06:21:57

标题: 使用特征和结构分布转变的因果不变地理网络表示

摘要: 现有的方法通过基于独立同分布假设的深度图神经网络（GNNs）学习地理网络表示。然而，地理数据的空间异质性和时间动态使得超出分布（OOD）泛化问题特别突出。后者特别敏感于测试和训练数据之间的分布转移（特征和结构转移），是OOD泛化问题的主要原因。由于选择偏差和环境影响，不变和背景表示之间存在虚假相关性，导致模型更有可能学习背景表示。现有方法关注由训练和测试数据中节点特征分布的转移确定的背景表示变化，而忽略异质和同质邻居节点的比例分布变化，我们将其称为结构分布转移。我们提出了一种特征-结构混合不变表示学习（FSM-IRL）模型，考虑了特征分布转移和结构分布转移。为了解决结构分布转移，我们引入了一种基于因果关注的采样方法，鼓励模型识别具有与标签强烈因果关系或更类似于目标节点的节点。受Hilbert-Schmidt独立性标准的启发，我们实施了一种重新加权策略，以最大化节点表示的正交性，从而减轻节点表示之间的虚假相关性并抑制背景表示的学习。我们的实验表明，在OOD情景下，FSM-IRL在地理和社交网络数据集上展现出强大的学习能力。

更新时间: 2025-03-25 06:21:57

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2503.19382v1

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. OmniDocBench supports flexible, multi-level evaluations--ranging from an end-to-end assessment to the task-specific and attribute--based analysis using 19 layout categories and 15 attribute labels. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-language models, revealing their strengths and weaknesses across different document types. OmniDocBench sets a new standard for the fair, diverse, and fine-grained evaluation in document parsing. Dataset and code are available at https://github.com/opendatalab/OmniDocBench.

Updated: 2025-03-25 06:19:32

标题: OmniDocBench：使用综合注释对多样化PDF文档解析进行基准测试

摘要: 文档内容提取是计算机视觉中的关键任务，支撑着大型语言模型（LLMs）和检索增强生成（RAG）系统的数据需求。尽管最近取得了一些进展，但由于现有基准测试中文档类型覆盖范围狭窄，评估程序简化且不真实，当前文档解析方法尚未得到公平和全面的评估。为了解决这些问题，我们引入了OmniDocBench，一个新颖的基准测试，包括九种文档来源的高质量标注，包括学术论文、教科书，以及手写笔记和密集排版的报纸等更具挑战性的情况。OmniDocBench支持灵活的多层次评估，从端到端的评估到基于任务和属性的分析，使用19种布局类别和15种属性标签。我们对基于流水线的方法和端到端的视觉语言模型进行了全面评估，揭示了它们在不同文档类型上的优势和劣势。OmniDocBench为文档解析中公平、多样和细粒度的评估设立了新标准。数据集和代码可在https://github.com/opendatalab/OmniDocBench获取。

更新时间: 2025-03-25 06:19:32

领域: cs.CV,cs.AI,cs.IR

下载: http://arxiv.org/abs/2412.07626v2

Any6D: Model-free 6D Pose Estimation of Novel Objects

We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: https://taeyeop.com/any6d

Updated: 2025-03-25 06:18:47

标题: Any6D：对新对象的无模型6D姿态估计

摘要: 我们介绍了Any6D，这是一个无模型框架，用于6D物体姿态估计，只需要一张RGB-D锚定图像即可估计新场景中未知物体的6D姿态和大小。与依赖纹理3D模型或多个视角的现有方法不同，Any6D利用联合物体对齐过程来增强2D-3D对齐和度量尺度估计，从而提高姿态准确性。我们的方法整合了渲染和比较策略，生成和完善姿态假设，实现在具有遮挡、非重叠视角、多样化光照条件和大交叉环境变化的场景中的稳健性能。我们在五个具有挑战性的数据集上评估了我们的方法：REAL275、Toyota-Light、HO3D、YCBINEOAT和LM-O，证明了其在新物体姿态估计方面明显优于最先进的方法。项目页面：https://taeyeop.com/any6d

更新时间: 2025-03-25 06:18:47

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.18673v2

Towards Build Optimization Using Digital Twins

Despite the indisputable benefits of Continuous Integration (CI) pipelines (or builds), CI still presents significant challenges regarding long durations, failures, and flakiness. Prior studies addressed CI challenges in isolation, yet these issues are interrelated and require a holistic approach for effective optimization. To bridge this gap, this paper proposes a novel idea of developing Digital Twins (DTs) of build processes to enable global and continuous improvement. To support such an idea, we introduce the CI Build process Digital Twin (CBDT) framework as a minimum viable product. This framework offers digital shadowing functionalities, including real-time build data acquisition and continuous monitoring of build process performance metrics. Furthermore, we discuss guidelines and challenges in the practical implementation of CBDTs, including (1) modeling different aspects of the build process using Machine Learning, (2) exploring what-if scenarios based on historical patterns, and (3) implementing prescriptive services such as automated failure and performance repair to continuously improve build processes.

Updated: 2025-03-25 06:16:52

标题: 朝向利用数字孪生技术进行建筑优化

摘要: 尽管持续集成（CI）流水线（或构建）具有不可辩驳的好处，但CI仍然存在着关于长时间、失败和不稳定性的重大挑战。先前的研究单独解决了CI的挑战，然而这些问题是相互关联的，需要综合的方法来有效优化。为了弥合这一差距，本文提出了一个新颖的想法，即开发建立构建过程的数字孪生（DT），以实现全球和持续改进。为了支持这样一个想法，我们引入了CI构建过程数字孪生（CBDT）框架作为一个最小可行产品。该框架提供了数字阴影功能，包括实时构建数据获取和构建过程性能指标的持续监控。此外，我们讨论了CBDT的实际实施中的指南和挑战，包括（1）使用机器学习对构建过程的不同方面进行建模，（2）基于历史模式探索假设情景，以及（3）实施预测性服务，如自动故障和性能修复，以持续改进构建过程。

更新时间: 2025-03-25 06:16:52

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2503.19381v1

Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks

This study proposes a risk pricing anomaly detection method for social network user portraits based on graph neural networks (GNNs), aiming to improve the ability to identify abnormal users in social network environments. In view of the limitations of traditional methods in social network data modeling, this paper combines graph autoencoders (GAEs) and graph attention networks (GATs) to achieve accurate detection of abnormal users through dynamic aggregation of neighbor features and reconstruction error evaluation. The Facebook Page-Page Network dataset is used in the experiment and compared with VAE, GNN, Transformer and GAE. The results show that the proposed method achieves the best performance in AUC, F1-score, Precision and Recall, verifying its effectiveness. In addition, this paper explores the computational efficiency of the model in large-scale data and looks forward to combining self-supervised learning, federated learning, and other technologies in the future to improve the robustness and privacy protection of risk assessment. The research results can provide efficient anomaly detection solutions for financial risk control, social security management, and other fields.

Updated: 2025-03-25 06:16:17

标题: 基于图神经网络的社交网络用户画像用于异常检测

摘要: 本研究提出了一种基于图神经网络（GNNs）的社交网络用户画像风险定价异常检测方法，旨在提高在社交网络环境中识别异常用户的能力。鉴于传统方法在社交网络数据建模中的局限性，本文将图自编码器（GAEs）和图注意力网络（GATs）相结合，通过邻居特征的动态聚合和重构误差评估实现对异常用户的准确检测。实验中使用了Facebook Page-Page网络数据集，并与VAE、GNN、Transformer和GAE进行了比较。结果表明，所提出的方法在AUC、F1分数、精确度和召回率方面表现最佳，验证了其有效性。此外，本文探讨了模型在大规模数据中的计算效率，并期待未来结合自监督学习、联邦学习等技术，提高风险评估的稳健性和隐私保护。研究结果可以为金融风险控制、社会保障管理等领域提供高效的异常检测解决方案。

更新时间: 2025-03-25 06:16:17

领域: cs.LG

下载: http://arxiv.org/abs/2503.19380v1

RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.

Updated: 2025-03-25 06:11:23

标题: RayFlow：通过自适应流轨迹加速实例感知扩散

摘要: 扩散模型在各个领域取得了显著的成功。然而，它们生成速度缓慢仍然是一个关键挑战。现有的加速方法，虽然旨在减少步骤，但往往会牺牲样本质量、可控性，或者引入训练复杂性。因此，我们提出了一种名为RayFlow的新型扩散框架，以解决这些限制。与先前的方法不同，RayFlow将每个样本引导至一个特定实例的目标分布。这种方法在保留生成多样性和稳定性的同时，最大程度地减少采样步骤。此外，我们引入了一种重要性采样技术Time Sampler，通过关注重要时间步骤来增强训练效率。大量实验证明，与现有的加速技术相比，RayFlow在生成高质量图像方面具有更好的速度、控制和训练效率。

更新时间: 2025-03-25 06:11:23

领域: cs.LG,cs.CV,I.2.10; I.4.8

下载: http://arxiv.org/abs/2503.07699v2

Interpretable Generative Models through Post-hoc Concept Bottlenecks

Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.

Updated: 2025-03-25 06:09:51

标题: 通过事后概念瓶颈实现可解释的生成模型

摘要: 概念瓶颈模型（CBM）旨在生成依赖于人类可理解概念进行预测的本质可解释模型。然而，基于CBM设计可解释生成模型的现有方法尚不高效且可扩展，因为它们需要从头开始进行昂贵的生成模型训练，以及需要耗时费力的概念监督真实图像。为解决这些挑战，我们提出了两种新颖且低成本的方法，通过事后技术构建可解释生成模型，我们将我们的方法命名为：概念瓶颈自动编码器（CB-AE）和概念控制器（CC）。我们提出的方法实现了高效且可扩展的训练，无需真实数据，只需最小甚至无需概念监督。此外，我们的方法在包括生成对抗网络和扩散模型在内的现代生成模型族中具有普遍性。我们在诸如CelebA、CelebA-HQ和CUB等多个标准数据集上展示了我们方法的卓越解释性和可操纵性，相较之前工作平均提升约25%，同时训练速度快4-15倍。最后，进行了大规模用户研究以验证我们方法的解释性和可操纵性。

更新时间: 2025-03-25 06:09:51

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19377v1

MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers

The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.

Updated: 2025-03-25 06:08:47

标题: MCRanker：即时生成多样化标准以改进点对点LLM排序器

摘要: 最近的逐点大型语言模型（LLM）排序器已经取得了显著的排序结果。然而，这些排序器受到两个主要缺点的阻碍：（1）它们在排序过程中未能遵循标准化的比较指导，以及（2）在处理复杂段落时很难进行全面考虑。为了解决这些缺点，我们提出建立一个排序器，根据来自各种角度的一组标准生成排序分数。这些标准旨在指导每个角度提供独特而协同的评估。我们的研究从BEIR基准测试中检验了八个数据集，结果表明，将这种多角度标准集成方法纳入点对LLM排序器的性能显著提高。

更新时间: 2025-03-25 06:08:47

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2404.11960v3

On Improving the Composition Privacy Loss in Differential Privacy for Fixed Estimation Error

This paper considers the private release of statistics of disjoint subsets of a dataset, in the setting of data heterogeneity, where users could contribute more than one sample, with different users contributing potentially different numbers of samples. In particular, we focus on the $\epsilon$-differentially private release of sample means and variances of sample values in disjoint subsets of a dataset, under the assumption that the numbers of contributions of each user in each subset is publicly known. Our main contribution is an iterative algorithm, based on suppressing user contributions, which seeks to reduce the overall privacy loss degradation under a canonical Laplace mechanism, while not increasing the worst estimation error among the subsets. Important components of this analysis are our exact, analytical characterizations of the sensitivities and the worst-case bias errors of estimators of the sample mean and variance, which are obtained by clipping or suppressing user contributions. We test the performance of our algorithm on real-world and synthetic datasets and demonstrate clear improvements in the privacy loss degradation, for fixed worst-case estimation error.

Updated: 2025-03-25 06:08:30

标题: 关于改进差分隐私中固定估计误差的组合隐私损失

摘要: 本文考虑在数据异构性设置中，针对数据集的不相交子集的私人统计数据发布，其中用户可以贡献多个样本，不同用户可能贡献不同数量的样本。具体而言，我们关注在每个子集中公开已知每个用户贡献数量的假设下，对数据集中不相交子集的样本均值和样本值的方差进行 $\epsilon$-差分隐私发布。我们的主要贡献是一种基于抑制用户贡献的迭代算法，旨在减少在经典Laplace机制下的整体隐私损失降级，同时不增加子集中最严重的估计误差。分析的重要组成部分是我们对通过剪切或抑制用户贡献获得的样本均值和方差估计量的灵敏度和最坏情况偏差误差的精确、分析性特征。我们在真实世界和合成数据集上测试了算法的性能，并展示了在固定最坏情况估计误差下隐私损失降级的明显改进。

更新时间: 2025-03-25 06:08:30

领域: cs.CR,cs.IT,math.IT

下载: http://arxiv.org/abs/2405.06261v4

DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at https://hygenie1228.github.io/DeClotH/.

Updated: 2025-03-25 06:00:15

标题: DeClotH：从单一图像中分解的可分解的三维布料和人体重建

摘要: 大多数现有的从单个图像重建3D穿着衣物的人体的方法将穿着衣物的人体视为单个对象，而不区分衣物和人体。在这方面，我们提出了DeClotH，它可以从单个图像中分别重建3D衣物和人体。由于衣物和人体之间的极端遮挡，这一任务仍然大部分尚未被探索，这使得推断准确的几何和纹理变得具有挑战性。此外，虽然最近的3D人体重建方法使用文本到图像扩散模型取得了令人印象深刻的结果，但直接将这种方法应用于这个问题往往会导致错误的指导，特别是在重建3D衣物方面。为了解决这些挑战，我们在我们的框架中提出了两个核心设计。首先，为了减轻遮挡问题，我们利用衣物和人体的3D模板模型作为规范化，提供强大的几何先验，以防止遮挡导致的错误重建。其次，我们引入了一个专门设计的衣物扩散模型，用于提供关于衣物外观的背景信息，从而增强3D衣物的重建。定性和定量实验表明，我们提出的方法在重建3D衣物和人体方面非常有效。更多的定性结果可在https://hygenie1228.github.io/DeClotH/上找到。

更新时间: 2025-03-25 06:00:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19373v1

Inductive Moment Matching

Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Inductive Moment Matching (IMM), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, IMM does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, IMM guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. IMM surpasses diffusion models on ImageNet-256x256 with 1.99 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 1.98 on CIFAR-10 for a model trained from scratch.

Updated: 2025-03-25 06:00:02

标题: 归纳式矩匹配

摘要: 扩散模型和流匹配生成高质量样本，但在推断时速度较慢，并将它们精简为少步模型通常会导致不稳定性和大量调整。为了解决这些折衷，我们提出了归纳时刻匹配（IMM），这是一种新的生成模型类别，适用于一步或少步采样，并采用单阶段训练过程。与蒸馏不同，IMM不需要预训练初始化和优化两个网络；与一致性模型不同，IMM保证分布级收敛，并在各种超参数和标准模型架构下保持稳定。IMM在ImageNet-256x256上以1.99的FID仅使用8个推断步骤超越了扩散模型，并在从头开始训练的模型上在CIFAR-10上实现了1.98的最新2步FID。

更新时间: 2025-03-25 06:00:02

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2503.07565v5

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

Updated: 2025-03-25 05:58:18

标题: 大型语言模型遇上对比学习：跨语言零-shot情感识别

摘要: 多语种语音情感识别旨在通过一种非接触式方法跨越不同语言来估计说话者的情感状态。然而，声音特征的变化和语言多样性为零样本语音情感识别提出了重大挑战，特别是在多语种数据集中。在本文中，我们提出利用对比学习来改进多语种语音特征，并扩展大型语言模型以进行零样本多语种语音情感估计。具体地，我们采用一种新颖的两阶段训练框架来将语音信号与情感空间中的语言特征对齐，捕捉既有情感意识又具有语言无关性的语音表示。为了推进这一领域的研究，我们引入了一个大规模的合成多语种语音情感数据集，M5SER。我们的实验展示了所提方法在语音情感识别和零样本多语种语音情感识别方面的有效性，包括以前未见过的数据集和语言。

更新时间: 2025-03-25 05:58:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.21806v1

Flow to Learn: Flow Matching on Neural Network Parameters

Foundational language models show a remarkable ability to learn new concepts during inference via context data. However, similar work for images lag behind. To address this challenge, we introduce FLoWN, a flow matching model that learns to generate neural network parameters for different tasks. Our approach models the flow on latent space, while conditioning the process on context data. Experiments verify that FLoWN attains various desiderata for a meta-learning model. In addition, it matches or exceeds baselines on in-distribution tasks, provides better initializations for classifier training, and is performant on out-of-distribution few-shot tasks while having a fine-tuning mechanism to improve performance.

Updated: 2025-03-25 05:57:50

标题: 学习的流动：神经网络参数上的流动匹配

摘要: 基础语言模型表现出在推理过程中通过上下文数据学习新概念的显著能力。然而，类似的图像方面的工作落后于此。为了解决这一挑战，我们引入了FLoWN，一个流匹配模型，用于学习生成不同任务的神经网络参数。我们的方法在潜在空间上建模流程，同时将过程置于上下文数据的条件下。实验证实，FLoWN实现了元学习模型的各种期望。此外，它在分布任务上匹配或超越基线，为分类器训练提供更好的初始化，并且在少样本任务中表现出色，同时具有微调机制以改善性能。

更新时间: 2025-03-25 05:57:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19371v1

Evaluating Negative Sampling Approaches for Neural Topic Models

Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.

Updated: 2025-03-25 05:53:08

标题: 评估神经主题模型的负采样方法

摘要: 负采样已经成为一种有效的技术，通过引入“学会比较”的范式，使深度学习模型能够学习更好的表示。该方法的目标是通过将正样本与负样本进行比较，为深度学习模型添加鲁棒性，从而学习更好的表示。尽管在计算机视觉和自然语言处理等各个领域已经有许多示范，但对负采样在无监督领域如主题建模中的影响的全面研究尚未得到充分探索。本文对不同负采样策略对神经主题模型的影响进行了全面分析。我们通过在基于变分自动编码器的神经主题模型的解码器中引入负采样技术，比较了几种流行的神经主题模型的性能。对四个公开可用数据集的实验结果表明，将负采样集成到主题模型中在多个方面显著提高了性能，包括改善主题连贯性、更丰富的主题多样性和更准确的文档分类。手动评估还表明，将负采样集成到神经主题模型中可以提高生成主题的质量。这些发现突显了负采样作为提高神经主题模型效果的有价值工具的潜力。

更新时间: 2025-03-25 05:53:08

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.18167v2

A Benign Activity Extraction Method for Malignant Activity Identification using Data Provenance

In order to understand the overall picture of cyber attacks and to identify the source of cyber attacks, a method to identify malicious activities by automatically creating a graph that ties together the dependencies of a series of related events by tracking Data Provenance has been developed. However, the problem of dependency explosion, in which a large number of normal computer system operations such as operations by authorized users are included in the dependencies, results in a huge generated graph, making it difficult to identify malicious activities. In this paper, we propose a method to reduce the search space for malicious activities by extracting and removing frequently occurring benign activities through natural language processing of log data and analysis of activities in the computer system using similarity judgments. In the evaluation experiment, we used the DARPA TC Dateset, a large-scale public dataset, to evaluate the effectiveness of the proposed method on the dependency explosion problem. In addition, we showed that about 6.8 to 39% of the activities in a computer system could be defined as patterns of benign activities. In addition, we showed that removing benign activities extracted from a portion of the log data (approximately 1.4% to 3.2% in size) can significantly reduce the search space (up to approximately 52%) in large data sets.

Updated: 2025-03-25 05:52:41

标题: 一种利用数据来源验证的良性活动提取方法用于恶性活动识别

摘要: 为了了解网络攻击的整体情况并确定网络攻击的来源，我们开发了一种通过跟踪数据来源自动创建将一系列相关事件的依赖关系联系起来的图表来识别恶意活动的方法。然而，依赖爆炸的问题，即大量正常计算机系统操作（例如授权用户的操作）包含在依赖关系中，导致生成了一个庞大的图表，使得难以识别恶意活动。在本文中，我们提出了一种通过对日志数据进行自然语言处理和分析计算机系统中的活动使用相似性判断来提取和消除频繁发生的良性活动，从而减少寻找恶意活动的搜索空间的方法。在评估实验中，我们使用了DARPA TC Dateset，一个大规模的公共数据集，来评估所提出的方法对依赖爆炸问题的有效性。此外，我们展示了计算机系统中约6.8%至39%的活动可以被定义为良性活动的模式。此外，我们展示了从日志数据的一部分中提取的良性活动（大约占总体大小的1.4%至3.2%）可以显著减少大数据集中的搜索空间（最多约52%）。

更新时间: 2025-03-25 05:52:41

领域: cs.CR

下载: http://arxiv.org/abs/2503.19370v1

ImF: Implicit Fingerprint for Large Language Models

Training large language models (LLMs) is resource-intensive and expensive, making intellectual property (IP) protection essential. Most existing model fingerprint methods inject fingerprints into LLMs to protect model ownership. These methods create fingerprint pairs with weak semantic correlations, lacking the contextual coherence and semantic relatedness founded in normal question-answer (QA) pairs in LLMs. In this paper, we propose a Generation Revision Intervention (GRI) attack that can effectively exploit this flaw to erase fingerprints, highlighting the need for more secure model fingerprint methods. Thus, we propose a novel injected fingerprint paradigm called Implicit Fingerprints (ImF). ImF constructs fingerprint pairs with strong semantic correlations, disguising them as natural QA pairs within LLMs. This ensures the fingerprints are consistent with normal model behavior, making them indistinguishable and robust against detection and removal. Our experiment on multiple LLMs demonstrates that ImF retains high verification success rates under adversarial conditions, offering a reliable solution for protecting LLM ownership.

Updated: 2025-03-25 05:47:34

标题: ImF：大型语言模型的隐式指纹

摘要: 培训大型语言模型(LLMs)需要耗费大量资源，成本昂贵，因此知识产权(IP)保护至关重要。大多数现有的模型指纹方法将指纹注入LLMs以保护模型所有权。这些方法创建具有弱语义相关性的指纹对，缺乏在LLMs中常见的正常问答(QA)对中的上下文连贯性和语义相关性。在本文中，我们提出了一种称为Generation Revision Intervention (GRI)攻击，可以有效地利用这一缺陷来抹去指纹，凸显了对更安全的模型指纹方法的需求。因此，我们提出了一种新颖的注入指纹范式，称为Implicit Fingerprints (ImF)。ImF构建具有强烈语义相关性的指纹对，将它们伪装成LLMs内部的自然QA对。这确保了指纹与正常模型行为一致，使其难以区分并且抵抗检测和删除。我们对多个LLMs进行的实验表明，ImF在对抗条件下保持着高的验证成功率，为保护LLMs所有权提供了可靠的解决方案。

更新时间: 2025-03-25 05:47:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.21805v1

XXLTraffic: Expanding and Extremely Long Traffic forecasting beyond test adaptation

Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the distribution shift nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and long temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, largest available public traffic dataset with the longest timespan collected from Los Angeles, USA, and New South Wales, Australia, curated to support research in extremely long forecasting beyond test adaptation. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle the extremely long forecasting problems beyond test adaptation. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.

Updated: 2025-03-25 05:39:42

标题: XXLTraffic：超出测试适应性的扩展和极长交通预测

摘要: 交通预测对于智慧城市和智能交通计划至关重要，在过去几年中，深度学习在建模复杂时空模式方面取得了重大进展。然而，目前的公共数据集在反映真实世界场景的分布转移特性方面存在局限性，这些特性包括不断发展的基础设施、不同的时间分布以及由于传感器停机或交通模式变化而导致的长时间间隔。这些局限性不可避免地限制了现有交通预测数据集的实际适用性。为了弥合这一差距，我们提出了XXLTraffic，这是来自美国洛杉矶和澳大利亚新南威尔士州的最大公共交通数据集，具有最长的时间跨度，旨在支持超长期预测研究，超越测试适应。我们的基准包括典型的时间序列预测设置，包括每小时和每日聚合数据，以及引入间隙和对训练数据量进行下采样以更好地模拟实际约束的新配置。我们预期新的XXLTraffic将为时间序列和交通预测社区提供新的视角。它还将为开发和评估旨在解决超长期预测问题的模型提供一个稳固的平台，超越测试适应。我们的数据集补充了现有的时空数据资源，并引领了该领域的新研究方向。

更新时间: 2025-03-25 05:39:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.12693v2

Polysemanticity and Capacity in Neural Networks

Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.

Updated: 2025-03-25 05:19:03

标题: 多义性和神经网络中的容量

摘要: 神经网络中的单个神经元通常表示一系列不相关的特征。这种现象被称为多义性，可能会使解释神经网络变得更加困难，因此我们旨在理解其原因。我们建议通过特征容量的视角来做到这一点，特征容量是每个特征在嵌入空间中消耗的分数维度。我们展示了在一个玩具模型中，最佳容量分配倾向于单义地表示最重要的特征，多义地表示不那么重要的特征（与它们对损失的影响成比例），并完全忽略最不重要的特征。当输入具有较高的峰度或稀疏性时，多义性更为普遍，并在某些架构中比其他架构更为普遍。在获得最佳容量分配后，我们继续研究嵌入空间的几何结构。我们发现了一个块半正交结构，不同模型中的块大小不同，突显了模型架构对其神经元可解释性的影响。

更新时间: 2025-03-25 05:19:03

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2210.01892v4

Hardware-Friendly Static Quantization Method for Video Diffusion Transformers

Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSora\cite{opensora}, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.

Updated: 2025-03-25 05:17:19

标题: 硬件友好的视频扩散变换器静态量化方法

摘要: 视频生成的扩散变压器自从SORA表现出色以来，已经引起了广泛的研究兴趣。在GPU上有效部署这种生成式人工智能模型已经通过动态量化得到了证明。然而，资源受限的设备无法支持动态量化，需要对模型进行静态量化，以便在人工智能处理器上高效部署。本文提出了一种新颖的方法，用于对OpenSora进行后训练量化，即一种视频扩散变压器，而无需依赖动态量化技术。我们的方法采用静态量化，通过CLIP和VQA指标测量，实现了与FP16和动态量化ViDiT-Q方法相当的视频质量。具体来说，我们利用每个时间步的逐步校准数据，为每个时间步充分提供一个后训练的静态量化模型，将权重进行通道级量化，激活进行张量级量化。通过进一步应用平滑量化技术，我们可以使用静态量化模型获得高质量的视频输出。广泛的实验结果表明，静态量化可以成为视频扩散变压器的一种可行替代方案，提供了一种更高效的方法，而不会牺牲性能。

更新时间: 2025-03-25 05:17:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.15077v2

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.

Updated: 2025-03-25 05:12:04

标题: 室内评估不足：为了通用人工智能的稳健第三方漏洞披露

摘要: 通用人工智能（GPAI）系统的广泛部署带来了重大的新风险。然而，在报告GPAI系统中的缺陷的基础设施、实践和规范仍然严重不足，远远落后于像软件安全这样更成熟的领域。基于软件安全、机器学习、法律、社会科学和政策领域的专家之间的合作，我们确定了评估和报告GPAI系统中缺陷的关键空白。我们呼吁进行三项干预以推进系统安全。首先，我们提议使用标准化的AI缺陷报告和研究人员参与规则，以简化提交、复制和处理GPAI系统中的缺陷的过程。其次，我们建议GPAI系统提供商采用广泛范围的缺陷披露计划，借鉴漏洞赏金，同时设立法律安全港以保护研究人员。第三，我们倡导开发改进的基础设施，协调分发可能受影响的许多利益相关者的缺陷报告。这些干预措施变得越来越迫切，如越狱和其他可能在不同提供商的GPAI系统中传播的缺陷的普遍存在所证实。通过促进AI生态系统中的健全报告和协调，这些提议可以显著提高GPAI系统的安全性、安全性和责任。

更新时间: 2025-03-25 05:12:04

领域: cs.AI

下载: http://arxiv.org/abs/2503.16861v2

Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization

Cross-lingual summarization (CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. However, unlike languages such as English, Chinese or Spanish, for those relatively low-resource languages with limited usage or data, recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings. This raises the question: Are LLMs capable of handling cross-lingual summarization tasks for low-resource languages? To resolve this question, we fully explore the potential of large language models on cross-lingual summarization task for low-resource languages through our four-step zero-shot method: Summarization, Improvement, Translation and Refinement (SITR) with correspondingly designed prompts. We test our proposed method with multiple LLMs on two well-known cross-lingual summarization datasets with various low-resource target languages. The results show that: i) GPT-3.5 and GPT-4 significantly and consistently outperform other baselines when using our zero-shot SITR methods. ii) By employing our proposed method, we unlock the potential of LLMs, enabling them to effectively handle cross-lingual summarization tasks for relatively low-resource languages.

Updated: 2025-03-25 05:11:24

标题: 仔细思考并再次检查！元生成解锁LLMs用于低资源跨语言摘要

摘要: 跨语言摘要（CLS）旨在为源文本生成不同目标语言的摘要。目前，针对各种英语任务，经过训练的大型语言模型（LLMs）表现出色。然而，与英语、中文或西班牙语等语言不同，对于使用或数据较有限的低资源语言，最近的研究显示，即使在少样本设置下，LLMs在CLS任务上的表现仍不尽人意。这引发了一个问题：LLMs能否处理低资源语言的跨语言摘要任务？为解决这个问题，我们通过四步零样本方法：摘要、改进、翻译和完善（SITR），充分探索大型语言模型在低资源语言的跨语言摘要任务中的潜力，设计了相应的提示。我们在两个知名的跨语言摘要数据集上测试了我们提出的方法，使用多个LLMs和各种低资源目标语言。结果显示：i）当使用我们的零样本SITR方法时，GPT-3.5和GPT-4明显且一贯地优于其他基线。ii）通过采用我们提出的方法，我们释放了LLMs的潜力，使它们能够有效处理相对低资源语言的跨语言摘要任务。

更新时间: 2025-03-25 05:11:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.20021v2

CompMarkGS: Robust Watermarking for Compressed 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) enables rapid differentiable rendering for 3D reconstruction and novel view synthesis, leading to its widespread commercial use. Consequently, copyright protection via watermarking has become critical. However, because 3DGS relies on millions of Gaussians, which require gigabytes of storage, efficient transfer and storage require compression. Existing 3DGS watermarking methods are vulnerable to quantization-based compression, often resulting in the loss of the embedded watermark. To address this challenge, we propose a novel watermarking method that ensures watermark robustness after model compression while maintaining high rendering quality. In detail, we incorporate a quantization distortion layer that simulates compression during training, preserving the watermark under quantization-based compression. Also, we propose a learnable watermark embedding feature that embeds the watermark into the anchor feature, ensuring structural consistency and seamless integration into the 3D scene. Furthermore, we present a frequency-aware anchor growing mechanism to enhance image quality in high-frequency regions by effectively identifying Guassians within these regions. Experimental results confirm that our method preserves the watermark and maintains superior image quality under high compression, validating it as a promising approach for a secure 3DGS model.

Updated: 2025-03-25 05:07:43

标题: CompMarkGS：针对压缩的3D高斯点阵的稳健水印技术

摘要: 3D高斯喷洒（3DGS）实现了快速可微渲染，用于3D重建和新视角合成，因而被广泛商业应用。因此，通过数字水印进行版权保护变得至关重要。然而，由于3DGS依赖于数百万个高斯函数，需要占用几千兆字节的存储空间，因此有效的传输和存储需要压缩。现有的3DGS水印方法容易受到基于量化的压缩攻击，常常导致嵌入的水印丢失。为了解决这一挑战，我们提出了一种新颖的水印方法，在模型压缩后保证水印的稳健性，同时保持高质量的渲染。具体而言，我们引入了一个量化失真层，在训练过程中模拟压缩，从而在基于量化的压缩下保留水印。此外，我们提出了一种可学习的水印嵌入特征，将水印嵌入到锚定特征中，确保结构一致性和无缝融入3D场景。此外，我们提出了一种频率感知的锚定增长机制，通过有效识别高频区域内的高斯函数，提高图像质量。实验证实，我们的方法在高压缩下保留了水印并保持了优秀的图像质量，验证了它作为一种安全3DGS模型的有希望的方法。

更新时间: 2025-03-25 05:07:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.12836v3

Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models

Data-driven weather prediction models exhibit promising performance and advance continuously. In particular, diffusion models represent fine-scale details without spatial smoothing, which is crucial for mesoscale predictions, such as heavy rainfall forecasting. However, the applications of diffusion models to mesoscale prediction remain limited. To address this gap, this study proposes an architecture that combines a diffusion model with Swin-Unet as a deterministic model, achieving mesoscale predictions while maintaining flexibility. The proposed architecture trains the two models independently, allowing the diffusion model to remain unchanged when the deterministic model is updated. Comparisons using the Fractions Skill Score and power spectral analysis demonstrate that incorporating the diffusion model leads to improved accuracy compared to predictions without it. These findings underscore the potential of the proposed architecture to enhance mesoscale predictions, particularly for strong rainfall events, while maintaining flexibility.

Updated: 2025-03-25 05:07:31

标题: 基于数据驱动的中尺度天气预报：结合Swin-Unet和扩散模型

摘要: 数据驱动的天气预测模型表现出很好的性能，并持续不断地进步。特别是，扩散模型能够展示精细尺度的细节，而无需进行空间平滑处理，这对于中尺度预测，如暴雨预测，至关重要。然而，将扩散模型应用于中尺度预测的范围仍然有限。为了填补这一空白，本研究提出了一种结合了扩散模型和Swin-Unet作为确定性模型的架构，实现中尺度预测并保持灵活性。所提出的架构独立训练这两个模型，使得当确定性模型更新时，扩散模型保持不变。利用分数技能评分和功率谱分析进行比较表明，将扩散模型纳入预测中可以提高准确性。这些发现强调了所提出的架构增强中尺度预测的潜力，特别是对于强降雨事件，同时保持灵活性。

更新时间: 2025-03-25 05:07:31

领域: cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2503.19354v1

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.

Updated: 2025-03-25 05:03:56

标题: QUAD：使用激活分解对LLM进行量化和参数高效调整

摘要: 大型语言模型（LLMs）在各种应用中表现出色，但由于规模庞大而效率低下。虽然量化可以降低计算成本，但现有方法由于激活异常值而降低了中等规模LLMs（例如Llama-3-8B）的准确性。为了解决这个问题，我们提出了QUAD（带有激活分解的量化），这是一个利用奇异值分解（SVD）来抑制激活异常值以实现有效的4位量化的框架。QUAD使用校准数据离线估计激活奇异向量，构建正交变换矩阵P，将异常值移动到完整精度的额外维度，同时将其余分量量化为4位。此外，QUAD通过可调整的完整精度异常值权重实现了参数有效的微调，缩小了量化模型与完整精度模型之间的准确性差距。实验表明，QUAD在W4A4量化下实现了94%至96%的准确率，并在W4A4/A8以及参数有效微调下为Llama-3和Qwen-2.5模型实现了98%的准确率。我们的代码可在 \href{https://github.com/hyx1999/Quad}{repository} 获取。

更新时间: 2025-03-25 05:03:56

领域: cs.LG,cs.CL,I.2.7

下载: http://arxiv.org/abs/2503.19353v1

Optimal Parameter Adaptation for Safety-Critical Control via Safe Barrier Bayesian Optimization

Safety is of paramount importance in control systems to avoid costly risks and catastrophic damages. The control barrier function (CBF) method, a promising solution for safety-critical control, poses a new challenge of enhancing control performance due to its direct modification of original control design and the introduction of uncalibrated parameters. In this work, we shed light on the crucial role of configurable parameters in the CBF method for performance enhancement with a systematical categorization. Based on that, we propose a novel framework combining the CBF method with Bayesian optimization (BO) to optimize the safe control performance. Considering feasibility/safety-critical constraints, we develop a safe version of BO using the barrier-based interior method to efficiently search for promising feasible configurable parameters. Furthermore, we provide theoretical criteria of our framework regarding safety and optimality. An essential advantage of our framework lies in that it can work in model-agnostic environments, leaving sufficient flexibility in designing objective and constraint functions. Finally, simulation experiments on swing-up control and high-fidelity adaptive cruise control are conducted to demonstrate the effectiveness of our framework.

Updated: 2025-03-25 04:56:17

标题: 通过安全屏障贝叶斯优化实现安全关键控制的最佳参数调整

摘要: 安全在控制系统中至关重要，以避免昂贵的风险和灾难性损害。控制屏障功能（CBF）方法是一种有前途的解决方案，用于安全关键控制，但由于其直接修改原始控制设计和引入未校准参数，因此提出了增强控制性能的新挑战。在这项工作中，我们阐明了CBF方法中可配置参数在性能提升中的关键作用，并进行了系统分类。基于此，我们提出了将CBF方法与贝叶斯优化（BO）相结合的新框架，以优化安全控制性能。考虑到可行性/安全关键约束，我们利用基于屏障的内部方法开发了安全版本的BO，以有效地搜索有前途的可行配置参数。此外，我们提供了关于我们框架的安全性和最优性的理论标准。我们框架的一个重要优势在于它可以在无模型环境中工作，为设计目标和约束函数留下足够的灵活性。最后，进行了关于摆动控制和高保真自适应巡航控制的仿真实验，以展示我们框架的有效性。

更新时间: 2025-03-25 04:56:17

领域: eess.SY,cs.LG,cs.SY,math.OC

下载: http://arxiv.org/abs/2503.19349v1

Explaining Deep Convolutional Neural Networks for Image Classification by Evolving Local Interpretable Model-agnostic Explanations

Deep convolutional neural networks have proven their effectiveness, and have been acknowledged as the most dominant method for image classification. However, a severe drawback of deep convolutional neural networks is poor explainability. Unfortunately, in many real-world applications, users need to understand the rationale behind the predictions of deep convolutional neural networks when determining whether they should trust the predictions or not. To resolve this issue, a novel genetic algorithm-based method is proposed for the first time to automatically evolve local explanations that can assist users to assess the rationality of the predictions. Furthermore, the proposed method is model-agnostic, i.e., it can be utilised to explain any deep convolutional neural network models. In the experiments, ResNet is used as an example model to be explained, and the ImageNet dataset is selected as the benchmark dataset. DenseNet and MobileNet are further explained to demonstrate the model-agnostic characteristic of the proposed method. The evolved local explanations on four images, randomly selected from ImageNet, are presented, which show that the evolved local explanations are straightforward to be recognised by humans. Moreover, the evolved explanations can explain the predictions of deep convolutional neural networks on all four images very well by successfully capturing meaningful interpretable features of the sample images. Further analysis based on the 30 runs of the experiments exhibits that the evolved local explanations can also improve the probabilities/confidences of the deep convolutional neural network models in making the predictions. The proposed method can obtain local explanations within one minute, which is more than ten times faster than LIME (the state-of-the-art method).

Updated: 2025-03-25 04:52:14

标题: 用进化的局部可解释模型无关解释解释深度卷积神经网络用于图像分类

摘要: 深度卷积神经网络已经证明了它们的有效性，并被认为是图像分类中最主要的方法。然而，深度卷积神经网络的一个严重缺点是解释性差。不幸的是，在许多现实世界的应用中，用户需要了解深度卷积神经网络预测背后的原因，以确定他们是否应该信任这些预测。为了解决这个问题，首次提出了一种基于遗传算法的新型方法，可以自动演化出可以帮助用户评估预测合理性的局部解释。此外，所提出的方法是模型无关的，即可以用于解释任何深度卷积神经网络模型。在实验中，以ResNet作为示例模型进行解释，并选择ImageNet数据集作为基准数据集。DenseNet和MobileNet进一步解释以展示所提出方法的模型无关特性。展示了从ImageNet随机选择的四幅图像上演化得到的局部解释，这些解释表明，演化得到的局部解释容易被人类认识。此外，演化得到的解释可以很好地解释深度卷积神经网络对所有四幅图像的预测，成功捕捉到样本图像的有意义的可解释特征。基于30次实验运行的进一步分析表明，演化得到的局部解释还可以提高深度卷积神经网络模型在进行预测时的概率/置信度。所提出的方法可以在一分钟内获得局部解释，比LIME（最先进的方法）快十多倍。

更新时间: 2025-03-25 04:52:14

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2211.15143v2

Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent

Projected Gradient Descent (PGD) under the $L_\infty$ ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when using thousands of iterations is the current best-practice recommendation to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the \emph{exact} same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable.

Updated: 2025-03-25 04:51:44

标题: 停止绕圈走！在投影梯度下降中提前退出

摘要: 在$L_\infty$球下的投影梯度下降（PGD）已经成为计算机视觉（CV）中用于对抗性鲁棒性评估的事实上的方法之一，因为它的可靠性和有效性，使其成为一个强大且易于实现的迭代基准。然而，PGD在应用过程中需要大量计算资源，尤其是在当前的最佳实践建议下，为单个图像生成对抗样本需要使用数千次迭代。在这项工作中，我们引入了一种简单的新颖方法，基于利用PGD在实践中的实现方式来进行早期终止，通过循环检测，我们展示了它可以在提供与标准PGD相同的模型鲁棒性估计的同时，产生大幅度的加速因子。这种方法大大加快了PGD的速度，而不会牺牲任何攻击强度，使得以前无法进行的鲁棒性评估变得可能。

更新时间: 2025-03-25 04:51:44

领域: cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.19347v1

Deep learning framework for action prediction reveals multi-timescale locomotor control

Modeling movement in real-world tasks is a fundamental goal for motor control, biomechanics, and rehabilitation engineering. However, widely used data-driven models of essential tasks like locomotion make simplifying assumptions such as linear and fixed timescale mappings between past inputs and future actions, which do not generalize to real-world contexts. Here, we develop a deep learning-based framework for action prediction with architecture-dependent trial embeddings, outperforming traditional models across contexts (walking and running, treadmill and overground, varying terrains) and input modalities (multiple body states, gaze). We find that neural network architectures with flexible input history-dependence like GRU and Transformer perform best overall. By quantifying the model's predictions relative to an autoregressive baseline, we identify context- and modality-dependent timescales. These analyses reveal that there is greater reliance on fast-timescale predictions in complex terrain, gaze predicts future foot placement before body states, and the full-body state predictions precede those by center-of-mass-relevant states. This deep learning framework for action prediction provides quantifiable insights into the control of real-world locomotion and can be extended to other actions, contexts, and populations.

Updated: 2025-03-25 04:50:17

标题: 深度学习框架用于动作预测，揭示多时间尺度的运动控制

摘要: 在现实任务中建模运动是运动控制、生物力学和康复工程的基本目标。然而，广泛使用的基于数据驱动的模型，如步态等基本任务的模型，通常简化假设，比如过去输入和未来动作之间的线性和固定时间尺度映射，并不能推广到真实世界的情境。在这里，我们开发了一个基于深度学习的行动预测框架，具有依赖于架构的试验嵌入，优于传统模型在不同情境（步行和跑步，跑步机和室外，不同地形）和输入模态（多种身体状态，凝视）方面的表现。我们发现，具有灵活输入历史依赖性的神经网络架构，如GRU和Transformer，在整体上表现最佳。通过将模型的预测与自回归基线进行量化，我们确定了依赖于情境和模态的时间尺度。这些分析揭示了在复杂地形中更多地依赖于快时间尺度的预测，凝视在身体状态之前预测未来的步行位置，全身状态预测先于质心相关状态。这种深度学习框架为行动预测提供了对真实世界步行控制的可量化洞见，并可以扩展到其他行动、情景和群体。

更新时间: 2025-03-25 04:50:17

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.16340v3

Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

An important question today is whether a given text was used to train a large language model (LLM). A \emph{completion} test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the $n$-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this $n$-gram based membership definition can be effectively gamed. We study scenarios where sequences are \emph{non-members} for a given $n$ and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of $n$ for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of $n$. Our findings highlight the inadequacy of $n$-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.

Updated: 2025-03-25 04:43:33

标题: 语言模型可能逐字完成未明确训练的文本

摘要: 今天一个重要的问题是一个给定的文本是否被用来训练一个大型语言模型（LLM）。通常会使用一个“完成”测试来检查LLM是否能完成一个足够复杂的文本。然而，这需要一个成员资格的真实定义；最常见的是基于目标文本和数据集中任何文本之间的n-gram重叠来定义。在这项工作中，我们展示了基于n-gram的成员资格定义可以被有效地操纵。我们研究了在给定n的情况下序列是“非成员”的情形，发现完成测试仍然成功。我们通过从头开始重新训练LLM并删除所有已完成的训练样本的方法，发现了许多自然情况，包括精确重复、接近重复甚至短重叠。这些案例表明很难找到单个可行的n选择来定义成员资格。利用这些见解，我们设计了对抗性数据集，可以导致给定目标序列在不包含它的情况下完成，无论选择何种合理的n。我们的发现突显了n-gram成员资格的不足之处，表明成员资格定义未能考虑到训练算法可获得的辅助信息。

更新时间: 2025-03-25 04:43:33

领域: cs.CL,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2503.17514v2

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey

Large-scale pre-trained vision models (PVMs) have shown great potential for adaptability across various downstream vision tasks. However, with state-of-the-art PVMs growing to billions or even trillions of parameters, the standard full fine-tuning paradigm is becoming unsustainable due to high computational and storage demands. In response, researchers are exploring parameter-efficient fine-tuning (PEFT), which seeks to exceed the performance of full fine-tuning with minimal parameter modifications. This survey provides a comprehensive overview and future directions for visual PEFT, offering a systematic review of the latest advancements. First, we provide a formal definition of PEFT and discuss model pre-training methods. We then categorize existing methods into three categories: addition-based, partial-based, and unified-based. Finally, we introduce the commonly used datasets and applications and suggest potential future research challenges. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.

Updated: 2025-03-25 04:37:33

标题: 预训练视觉模型的参数高效微调：一项调查

摘要: 大规模预训练视觉模型（PVMs）已经展现出在各种下游视觉任务中的适应能力。然而，随着最先进的PVMs参数增长到数十亿甚至数万亿，标准的全量微调范式由于高计算和存储需求而变得不可持续。作为回应，研究人员正在探索参数高效微调（PEFT），旨在通过最小的参数修改超越全量微调的性能。本调查提供了对视觉PEFT的全面概述和未来方向，对最新进展进行了系统性评估。首先，我们给出了PEFT的正式定义并讨论模型预训练方法。然后，我们将现有方法分为三类：基于增加、基于部分和基于统一。最后，我们介绍了常用的数据集和应用程序，并提出潜在的未来研究挑战。可在https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning找到资源的全面收集。

更新时间: 2025-03-25 04:37:33

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2402.02242v3

Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.

Updated: 2025-03-25 04:33:15

标题: 指挥官-GPT：充分释放多模态大型语言模型的讽刺检测能力

摘要: 讽刺检测作为自然语言处理领域中的一个关键研究方向，吸引了广泛的关注。传统的讽刺检测任务通常集中在单模态方法（例如文本），但由于讽刺的隐含和微妙性质，这些方法往往无法产生令人满意的结果。近年来，研究人员将讽刺检测的重点转向多模态方法。然而，有效利用多模态信息准确识别讽刺内容仍然是一个挑战，需要进一步探索。利用多模态大型语言模型（MLLMs）强大的集成处理能力，我们提出了一种创新的多模态指挥官-GPT框架。受军事策略启发，我们首先将讽刺检测任务分解为六个明确的子任务。中央指挥官（决策者）然后指派最适合的大型语言模型来处理每个特定的子任务。最终，来自每个模型的检测结果被聚合以识别讽刺。我们在MMSD和MMSD 2.0上进行了广泛的实验，利用四个多模态大型语言模型和六种提示策略。我们的实验表明，我们的方法取得了最先进的性能，F1分数提高了19.3%，而无需进行微调或地面真理理由。

更新时间: 2025-03-25 04:33:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.18681v2

Long-range Meta-path Search on Large-scale Heterogeneous Graphs

Utilizing long-range dependency, a concept extensively studied in homogeneous graphs, remains underexplored in heterogeneous graphs, especially on large ones, posing two significant challenges: Reducing computational costs while maximizing effective information utilization in the presence of heterogeneity, and overcoming the over-smoothing issue in graph neural networks. To address this gap, we investigate the importance of different meta-paths and introduce an automatic framework for utilizing long-range dependency on heterogeneous graphs, denoted as Long-range Meta-path Search through Progressive Sampling (LMSPS). Specifically, we develop a search space with all meta-paths related to the target node type. By employing a progressive sampling algorithm, LMSPS dynamically shrinks the search space with hop-independent time complexity. Through a sampling evaluation strategy, LMSPS conducts a specialized and effective meta-path selection, leading to retraining with only effective meta-paths, thus mitigating costs and over-smoothing. Extensive experiments across diverse heterogeneous datasets validate LMSPS's capability in discovering effective long-range meta-paths, surpassing state-of-the-art methods. Our code is available at https://github.com/JHL-HUST/LMSPS.

Updated: 2025-03-25 04:19:16

标题: 大规模异构图上的长程元路径搜索

摘要: 利用长程依赖性是在同质图中广泛研究的概念，在异质图中尤为不足，尤其是在大规模图中，这带来了两个重要挑战：在异质性存在的情况下降低计算成本，同时最大程度地利用有效信息，并克服图神经网络中的过度平滑问题。为了解决这一差距，我们研究了不同元路径的重要性，并引入了一个自动框架，用于在异质图上利用长程依赖性，命名为通过渐进采样进行长程元路径搜索（LMSPS）。具体地，我们开发了一个包含所有与目标节点类型相关的元路径的搜索空间。通过采用渐进采样算法，LMSPS动态地缩小搜索空间，具有与跃点无关的时间复杂度。通过采样评估策略，LMSPS进行了专门和有效的元路径选择，导致只用有效的元路径进行重新训练，从而缓解成本和过度平滑问题。对各种异质数据集进行的大量实验验证了LMSPS在发现有效的长程元路径方面的能力，超过了最先进的方法。我们的代码可在 https://github.com/JHL-HUST/LMSPS 上找到。

更新时间: 2025-03-25 04:19:16

领域: cs.AI

下载: http://arxiv.org/abs/2307.08430v6

Tensor Generalized Approximate Message Passing

We propose a tensor generalized approximate message passing (TeG-AMP) algorithm for low-rank tensor inference, which can be used to solve tensor completion and decomposition problems. We derive TeG-AMP algorithm as an approximation of the sum-product belief propagation algorithm in high dimensions where the central limit theorem and Taylor series approximations are applicable. As TeG-AMP is developed based on a general TR decomposition model, it can be directly applied to many low-rank tensor types. Moreover, our TeG-AMP can be simplified based on the CP decomposition model and a tensor simplified AMP is proposed for low CP-rank tensor inference problems. Experimental results demonstrate that the proposed methods significantly improve recovery performances since it takes full advantage of tensor structures.

Updated: 2025-03-25 04:17:10

标题: 张量广义近似消息传递

摘要: 我们提出了一种张量广义近似传递（TeG-AMP）算法，用于低秩张量推断，可用于解决张量完成和分解问题。我们将TeG-AMP算法推导为高维度中的和积信念传播算法的近似，其中中心极限定理和泰勒级数逼近是适用的。由于TeG-AMP是基于通用TR分解模型开发的，因此可以直接应用于许多低秩张量类型。此外，我们的TeG-AMP可基于CP分解模型简化，提出了一种用于低CP秩张量推断问题的张量简化AMP。实验结果表明，所提出的方法显著改进了恢复性能，因为它充分利用了张量结构。

更新时间: 2025-03-25 04:17:10

领域: cs.LG,cs.AI,cs.IT,math.IT,E.4; I.2.0; I.2.6; I.4

下载: http://arxiv.org/abs/2504.00008v1

Forecasting Volcanic Radiative Power (VPR) at Fuego Volcano Using Bayesian Regularized Neural Network

Forecasting volcanic activity is critical for hazard assessment and risk mitigation. Volcanic Radiative Power (VPR), derived from thermal remote sensing data, serves as an essential indicator of volcanic activity. In this study, we employ Bayesian Regularized Neural Networks (BRNN) to predict future VPR values based on historical data from Fuego Volcano, comparing its performance against Scaled Conjugate Gradient (SCG) and Levenberg-Marquardt (LM) models. The results indicate that BRNN outperforms SCG and LM, achieving the lowest mean squared error (1.77E+16) and the highest R-squared value (0.50), demonstrating its superior ability to capture VPR variability while minimizing overfitting. Despite these promising results, challenges remain in improving the model's predictive accuracy. Future research should focus on integrating additional geophysical parameters, such as seismic and gas emission data, to enhance forecasting precision. The findings highlight the potential of machine learning models, particularly BRNN, in advancing volcanic activity forecasting, contributing to more effective early warning systems for volcanic hazards.

Updated: 2025-03-25 04:15:24

标题: 使用贝叶斯正则化神经网络预测富埃戈火山的火山辐射功率（VPR）

摘要: 预测火山活动对危险评估和风险缓解至关重要。火山辐射功率（VPR）是从热遥感数据中衍生出来的，是火山活动的一个重要指标。本研究采用贝叶斯正则化神经网络（BRNN）基于Fuego火山的历史数据来预测未来的VPR值，比较其性能与尺度共轭梯度（SCG）和Levenberg-Marquardt（LM）模型。结果表明，BRNN优于SCG和LM，达到最低均方误差（1.77E+16）和最高R平方值（0.50），展示了其优越的能力捕捉VPR的变化同时最小化过度拟合。尽管取得了这些令人鼓舞的结果，但在提高模型预测准确性方面仍然存在挑战。未来的研究应该集中于整合额外的地球物理参数，如地震和气体排放数据，以增强预测精度。研究结果突显了机器学习模型，特别是BRNN，在推进火山活动预测方面的潜力，为火山危险的更有效的预警系统做出贡献。

更新时间: 2025-03-25 04:15:24

领域: cs.LG,cs.AI,eess.SP,physics.ao-ph

下载: http://arxiv.org/abs/2503.21803v1

SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.

Updated: 2025-03-25 04:15:15

标题: SyncDiff：用于多体人物-物体相互作用合成的同步运动扩散

摘要: 合成真实的人-物体交互动作是虚拟现实/增强现实和人物动画中的一个关键问题。与通常研究涉及单个人或手与一个物体交互的情景不同，我们考虑一个更通用的多体设置，其中包含任意数量的人、手和物体。这种复杂性引入了在协调运动方面的显著挑战，因为身体之间存在高度相关性和相互影响。为了解决这些挑战，我们引入了SyncDiff，一种使用同步运动扩散策略进行多体交互合成的新方法。SyncDiff采用单一扩散模型来捕获多体运动的联合分布。为了增强运动的保真度，我们提出了一个频域运动分解方案。此外，我们引入了一组新的对齐分数，以强调不同身体运动的同步性。SyncDiff通过显式的同步策略同时优化数据样本的可能性和对齐可能性。在四个数据集上进行的广泛实验展示了SyncDiff相对于现有最先进的运动合成方法的优越性。

更新时间: 2025-03-25 04:15:15

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2412.20104v3

Efficient IoT Intrusion Detection with an Improved Attention-Based CNN-BiLSTM Architecture

The ever-increasing security vulnerabilities in the Internet-of-Things (IoT) systems require improved threat detection approaches. This paper presents a compact and efficient approach to detect botnet attacks by employing an integrated approach that consists of traffic pattern analysis, temporal support learning, and focused feature extraction. The proposed attention-based model benefits from a hybrid CNN-BiLSTM architecture and achieves 99% classification accuracy in detecting botnet attacks utilizing the N-BaIoT dataset, while maintaining high precision and recall across various scenarios. The proposed model's performance is further validated by key parameters, such as Mathews Correlation Coefficient and Cohen's kappa Correlation Coefficient. The close-to-ideal results for these parameters demonstrate the proposed model's ability to detect botnet attacks accurately and efficiently in practical settings and on unseen data. The proposed model proved to be a powerful defense mechanism for IoT networks to face emerging security challenges.

Updated: 2025-03-25 04:12:14

标题: 高效的物联网入侵检测：基于改进注意力机制的CNN-BiLSTM架构

摘要: 随着物联网系统中安全漏洞的不断增加，需要改进的威胁检测方法。本文提出了一种紧凑高效的方法，通过采用交通模式分析、时间支持学习和集中特征提取的综合方法来检测僵尸网络攻击。所提出的基于注意力的模型采用了混合CNN-BiLSTM架构，在利用N-BaIoT数据集检测僵尸网络攻击时实现了99%的分类准确率，同时在各种场景下保持高精度和召回率。所提出的模型的性能进一步通过关键参数进行验证，如马修斯相关系数和科恩的卡帕相关系数。这些参数的接近理想结果表明了所提出的模型在实际环境和未知数据中能够准确高效地检测僵尸网络攻击。所提出的模型被证明是物联网网络面对新兴安全挑战的强大防御机制。

更新时间: 2025-03-25 04:12:14

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.19339v1

Membership Inference Attacks on Large-Scale Models: A Survey

The adoption of the Large Language Model (LLM) has accelerated dramatically since the ChatGPT from OpenAI went online in November 2022. Recent advances in Large Multimodal Models (LMMs), which process diverse data types and enable interaction through various channels, have expanded beyond the text-to-text limitations of early LLMs, attracting significant and concurrent attention from both researchers and industry. While LLMs and LMMs are starting to spread widely, concerns about their privacy risks are increasing as well. Membership Inference Attacks (MIAs), techniques used to determine whether a particular data point was part of a model's training set, serve as a key metric for assessing the privacy vulnerabilities of machine learning models. Hu et al. show that various machine learning algorithms are vulnerable to MIA. Despite extensive studies on MIAs in traditional models, there remains a lack of systematic surveys addressing their effectiveness and implications in modern large-scale models like LLMs and LMMs. In this paper, we systematically reviewed recent studies of MIA against LLMs and LMMs. We analyzed and categorized each attack based on their methodology and scenario and discussed the limitations in existing research. Additionally, we examine privacy concerns associated with the fine-tuning process. Finally, we provided some suggestions for future research in this direction.

Updated: 2025-03-25 04:11:47

标题: 大规模模型的成员推断攻击：一项调查

摘要: 自从OpenAI的ChatGPT于2022年11月上线以来，大型语言模型（LLM）的采用速度急剧加快。最近，大型多模态模型（LMMs）的进展，处理多样化的数据类型并通过各种渠道进行交互，已经超越了早期LLM的文本限制，吸引了研究人员和行业的重视。虽然LLMs和LMMs开始广泛传播，但对它们的隐私风险的担忧也在增加。成员推断攻击（MIAs）是用于确定特定数据点是否属于模型训练集的技术，是评估机器学习模型隐私漏洞的关键指标。胡等人表明各种机器学习算法容易受到MIA攻击。尽管对传统模型中MIAs进行了广泛研究，但对LLMs和LMMs等现代大规模模型的有效性和影响的系统调查仍然不足。在本文中，我们系统地审查了针对LLMs和LMMs的MIA的最新研究。我们根据攻击的方法论和场景对每种攻击进行了分析和分类，并讨论了现有研究的局限性。此外，我们还研究了与微调过程相关的隐私问题。最后，我们提出了一些建议，以指导未来在这个方向上的研究。

更新时间: 2025-03-25 04:11:47

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2503.19338v1

E-PINNs: Epistemic Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) have demonstrated promise as a framework for solving forward and inverse problems involving partial differential equations. Despite recent progress in the field, it remains challenging to quantify uncertainty in these networks. While approaches such as Bayesian PINNs (B-PINNs) provide a principled approach to capturing uncertainty through Bayesian inference, they can be computationally expensive for large-scale applications. In this work, we propose Epistemic Physics-Informed Neural Networks (E-PINNs), a framework that leverages a small network, the \emph{epinet}, to efficiently quantify uncertainty in PINNs. The proposed approach works as an add-on to existing, pre-trained PINNs with a small computational overhead. We demonstrate the applicability of the proposed framework in various test cases and compare the results with B-PINNs using Hamiltonian Monte Carlo (HMC) posterior estimation and dropout-equipped PINNs (Dropout-PINNs). Our experiments show that E-PINNs provide similar coverage to B-PINNs, with often comparable sharpness, while being computationally more efficient. This observation, combined with E-PINNs' more consistent uncertainty estimates and better calibration compared to Dropout-PINNs for the examples presented, indicates that E-PINNs offer a promising approach in terms of accuracy-efficiency trade-off.

Updated: 2025-03-25 03:53:28

标题: E-PINNs: 基于知识的物理信息神经网络

摘要: 物理信息神经网络（PINNs）已经显示出作为解决涉及偏微分方程的正向和反向问题的框架的潜力。尽管该领域近期取得了进展，但在这些网络中量化不确定性仍然具有挑战性。虽然诸如贝叶斯PINNs（B-PINNs）等方法提供了通过贝叶斯推断捕获不确定性的原则性方法，但对于大规模应用来说可能计算成本高昂。在这项研究中，我们提出了认知物理信息神经网络（E-PINNs），这是一个利用小网络“epinet”来有效量化PINNs中不确定性的框架。所提出的方法作为现有预训练的PINNs的附加功能，计算开销较小。我们在各种测试用例中展示了所提出框架的适用性，并将结果与使用哈密顿蒙特卡洛（HMC）后验估计和带有辍学的PINNs（Dropout-PINNs）的B-PINNs进行比较。我们的实验表明，E-PINNs提供了与B-PINNs类似的覆盖度，通常具有可比较的锐度，同时在计算上更有效。这一观察结果，结合E-PINNs相对于所呈现的示例中Dropout-PINNs更一致的不确定性估计和更好的校准，表明E-PINNs在准确性和效率之间提供了一种有前途的方法。

更新时间: 2025-03-25 03:53:28

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2503.19333v1

PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis

Predictive coding is a theory which hypothesises that cortex predicts sensory inputs at various levels of abstraction to minimise prediction errors. Inspired by predictive coding, Chen et al. (2024) proposed another theory, temporal prediction hypothesis, to claim that sequence memory residing in hippocampus has emerged through predicting input signals from the past sensory inputs. Specifically, they supposed that the CA3 predictor in hippocampus creates synaptic delay between input signals, which is compensated by the following CA1 predictor. Though recorded neural activities were replicated based on the temporal prediction hypothesis, its validity has not been fully explored. In this work, we aim to explore the temporal prediction hypothesis from the perspective of self-supervised learning. Specifically, we focus on non-contrastive learning, which generates two augmented views of an input image and predicts one from another. Non-contrastive learning is intimately related to the temporal prediction hypothesis because the synaptic delay is implicitly created by StopGradient. Building upon a popular non-contrastive learner, SimSiam, we propose PhiNet, an extension of SimSiam to have two predictors explicitly corresponding to the CA3 and CA1, respectively. Through studying the PhiNet model, we discover two findings. First, meaningful data representations emerge in PhiNet more stably than in SimSiam. This is initially supported by our learning dynamics analysis: PhiNet is more robust to the representational collapse. Second, PhiNet adapts more quickly to newly incoming patterns in online and continual learning scenarios. For practitioners, we additionally propose an extension called X-PhiNet integrated with a momentum encoder, excelling in continual learning. All in all, our work reveals that the temporal prediction hypothesis is a reasonable model in terms of the robustness and adaptivity.

Updated: 2025-03-25 03:51:46

标题: PhiNets：基于时间预测假设的大脑启发式非对比学习

摘要: 预测编码是一种理论，假设皮层在不同抽象层次上预测感觉输入，以最小化预测误差。受预测编码启发，Chen等人（2024年）提出了另一种理论，即时间预测假设，声称海马中的序列记忆是通过预测过去感觉输入的输入信号而出现的。具体地，他们假设海马中的CA3预测器在输入信号之间创建突触延迟，这由随后的CA1预测器进行补偿。尽管基于时间预测假设重现了神经活动，但其有效性尚未完全探讨。在这项工作中，我们旨在从自监督学习的角度探索时间预测假设。具体地，我们关注非对比学习，它生成输入图像的两个增强视图，并从一个视图中预测另一个。非对比学习与时间预测假设密切相关，因为突触延迟是由StopGradient隐式创建的。在一个流行的非对比学习器SimSiam的基础上，我们提出了PhiNet，这是SimSiam的一个扩展，有两个明确对应于CA3和CA1的预测器。通过研究PhiNet模型，我们发现了两个发现。首先，有意义的数据表示在PhiNet中比在SimSiam中更稳定地出现。这最初得到我们的学习动态分析的支持：PhiNet对表示坍塌更具鲁棒性。其次，PhiNet在在线和持续学习场景中更快地适应新的输入模式。对于实践者，我们还提出了一个名为X-PhiNet的扩展，集成了动量编码器，在持续学习中表现出色。总的来说，我们的工作揭示了时间预测假设在鲁棒性和适应性方面是一个合理的模型。

更新时间: 2025-03-25 03:51:46

领域: cs.LG

下载: http://arxiv.org/abs/2405.14650v2

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

Updated: 2025-03-25 03:50:34

标题: 稀疏对齐增强潜在扩散变换器用于零样本语音合成

摘要: 最近的零样本文本转语音（TTS）模型显著改善了语音质量和表现力，但主流系统仍然存在与语音文本对齐建模相关的问题：1）没有显式语音文本对齐建模的模型在实际应用中尤其对难句缺乏鲁棒性；2）预定义的基于对齐的模型受到强制对齐的自然约束。本文介绍了一种具有创新稀疏对齐算法的TTS系统S-DiT，该算法引导了潜在扩散变压器（DiT）。具体来说，我们为S-DiT提供稀疏对齐边界，以降低对齐学习的难度，同时不限制搜索空间，从而实现高自然度。此外，我们采用了一种多条件无分类器引导策略用于口音强度调整，并采用分段修正流技术加速生成过程。实验证明，S-DiT实现了最先进的零样本TTS语音质量，并支持对口音强度的高度灵活控制。值得注意的是，我们的系统可以在只进行8个采样步骤的情况下生成高质量的一分钟语音。音频样本可在https://sditdemo.github.io/sditdemo/上找到。

更新时间: 2025-03-25 03:50:34

领域: eess.AS,cs.LG,cs.SD

下载: http://arxiv.org/abs/2502.18924v2

Lightweight Models for Emotional Analysis in Video

In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.

Updated: 2025-03-25 03:50:11

标题: 轻量级模型用于视频情感分析

摘要: 在这项研究中，我们提出了一种利用MobileNetV4和基于多尺度3D MLP-Mixer的时间聚合模块进行高效时空特征提取的方法。MobileNetV4的通用反向瓶颈（UIB）块作为提取输入图像序列的分层特征表示的骨干，确保了计算效率和丰富的语义编码。为了捕捉时间依赖性，我们引入了一个三级MLP-Mixer模块，该模块在保持结构完整性的同时处理多个分辨率的空间特征。在ABAW第8次比赛上的实验结果证明了我们方法的有效性，在情感行为分析中表现出有希望的性能。通过将高效的视觉骨干与结构化的时间建模机制整合在一起，所提出的框架在计算效率和预测准确性之间取得了平衡，使其非常适合于移动和嵌入式计算环境中的实时应用。

更新时间: 2025-03-25 03:50:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.10530v2

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks. Project page: https://yiyuzhuang.github.io/IDOL/.

Updated: 2025-03-25 03:48:17

标题: IDOL：从单个图像实时生成逼真的3D人物

摘要: 从单个图像创建一个高保真、可动画的3D全身化身是一项具有挑战性的任务，这是由于人类的外观和姿势的多样性以及高质量训练数据的有限可用性。为了实现快速和高质量的人类重建，这项工作重新从数据集、模型和表示的角度思考了这项任务。首先，我们引入了一个大规模的以人类为中心生成的数据集，HuGe100K，其中包含100K个不同的逼真人类图像集合。每个集合包含具有特定人类姿势的24个视图帧，这些帧是使用一个姿势可控的图像到多视图模型生成的。接下来，利用HuGe100K中视图、姿势和外观的多样性，我们开发了一个可扩展的前馈变换器模型，从给定的人类图像中预测一个统一空间中的3D人类高斯表示。这个模型经过训练可以将人类姿势、身体形状、服装几何和纹理分离。估计出的高斯可以在没有后处理的情况下进行动画处理。我们进行了全面的实验证明了所提出的数据集和方法的有效性。我们的模型展示了能够使用单个GPU从单个输入图像瞬间高效地重建出1K分辨率的逼真人类。此外，它还无缝支持各种应用，以及形状和纹理编辑任务。项目页面：https://yiyuzhuang.github.io/IDOL/。

更新时间: 2025-03-25 03:48:17

领域: cs.CV,cs.GR,cs.LG,68U05, 68T07, 68T45,I.3.7; I.2.10; I.2.6

下载: http://arxiv.org/abs/2412.14963v2

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Updated: 2025-03-25 03:46:09

标题: 通用视频问题回答通过视频支撑的蕴涵树推理

摘要: 这篇论文提出了第一种基于视频的蕴涵树推理方法，用于常识视频问答（VQA）。尽管大型视觉语言模型（VLMs）取得了显著进展，但人们越来越担心它们学习到视频和可能答案之间的虚假相关性，这种相关性是由它们的黑匣子特性和残留的基准偏见所强化的。我们的方法将VQA任务明确地基于视频片段分为四个步骤：蕴涵树构建、视频-语言蕴涵验证、树推理和动态树扩展。该方法的一个重要优势是其对于当前基于视频和图像的VLMs在各种推理类型上的泛化性。为了支持公平评估，我们设计了一个基于大型语言模型的去偏置程序，通过重新编写VQA基准答案集来强制模型推理。对现有和去偏置基准测试的系统实验突显了我们方法组件在基准测试、VLMs和推理类型上的影响。

更新时间: 2025-03-25 03:46:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.05069v2

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI.

Updated: 2025-03-25 03:45:59

标题: ChA-MAEViT:将面向通道的掩蔽自动编码器与多通道视觉Transformer统一，以实现改进的跨通道学习

摘要: 以往使用Masked Autoencoders（MAEs）的研究通常依赖于基于随机补丁遮盖的假设，即图像在不同通道之间存在重要的冗余性，允许使用跨通道相关性重构遮盖内容。然而，在多通道成像（MCI）中，这一假设不成立，因为通道可能提供互补信息，且特征重叠较少。因此，这些MAEs主要通过补丁重构学习各个通道内的局部结构，未能充分利用跨通道交互，限制了它们在MCI中的有效性。本文提出了ChA-MAEViT，这是一种基于MAE的方法，通过四个关键策略增强MCI通道间的特征学习：（1）动态通道-补丁遮盖，迫使模型重构缺失通道以及遮盖补丁，从而增强跨通道依赖性，并改善对不同通道配置的稳健性；（2）记忆令牌，作为长期记忆辅助，促进跨通道信息共享，解决重构结构多样的通道的挑战；（3）混合令牌融合模块，将细粒度补丁令牌与全局类令牌合并，捕获更丰富的表示；以及（4）通道感知解码器，一种轻量级解码器利用通道令牌有效地重构图像补丁。对卫星和显微镜数据集CHAMMI、JUMP-CP和So2Sat的实验证明，ChA-MAEViT在MCI中明显优于最先进的MCI-ViTs，提高了3.0-21.5％，凸显了跨通道交互在MCI中的重要性。

更新时间: 2025-03-25 03:45:59

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.19331v1

Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Natural Language Processing (NLP) is revolutionising the way legal professionals and laypersons operate in the legal field. The considerable potential for NLP in the legal sector, especially in developing computational tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 133 after manual filtering. It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and limited open legal datasets. We provide an overview of NLP tasks specific to legal text, such as Legal Document Summarisation, legal Named Entity Recognition, Legal Question Answering, Legal Argument Mining, Legal Text Classification, and Legal Judgement Prediction. In the section on legal Language Models (LMs), we analyse both developed LMs and approaches for adapting general LMs to the legal domain. Additionally, we identify 16 Open Research Challenges, including bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.

Updated: 2025-03-25 03:45:48

标题: 法律领域的自然语言处理：任务、数据集、模型和挑战综述

摘要: 自然语言处理（NLP）正在彻底改变法律专业人士和普通人在法律领域的运作方式。NLP在法律领域的巨大潜力，特别是在开发各种法律流程的计算工具方面，多年来一直吸引着研究人员的兴趣。本调查遵循了系统评价和荟萃分析的首选报告项目框架，审查了154项研究，经手动筛选后最终选择了133项。它探讨了与法律领域NLP相关的基础概念，说明了处理法律文本的独特方面和挑战，如文档长度较长、语言复杂以及开放法律数据集有限。我们概述了与法律文本相关的NLP任务，如法律文件摘要、法律命名实体识别、法律问题回答、法律论证挖掘、法律文本分类和法律裁判预测。在关于法律语言模型（LMs）的部分，我们分析了已开发的LMs和将通用LMs调整为法律领域的方法。此外，我们确定了16个开放研究挑战，包括人工智能应用中的偏见、更强大和可解释的模型的需求，以及提高可解释性以处理法律语言和推理的复杂性。

更新时间: 2025-03-25 03:45:48

领域: cs.CL,cs.AI,A.1; I.2.7; J.1

下载: http://arxiv.org/abs/2410.21306v2

Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection

Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on https://github.com/HuYongting/WGLIN.

Updated: 2025-03-25 03:44:57

标题: 小波基础的全局-局部交互网络与跨注意力多视角糖尿病视网膜病变检测

摘要: 最近，多视角糖尿病性视网膜病变（DR）检测作为一种有前景的方法应运而生，以解决单视图DR所面临的不完整病变的问题。然而，由于病变的大小和分散位置的变化，这仍然是一个具有挑战性的问题。此外，现有的多视角DR方法通常将多个视图合并，而不考虑它们之间的病变信息的相关性和冗余性。因此，我们提出了一种新颖的方法，来克服难以学习病变信息和不足的多视角融合的挑战。具体来说，我们引入了一个双分支网络，以获取局部病变特征和它们的全局依赖关系。小波变换的高频分量用于利用病变边缘信息，然后通过全局语义加强，以促进难以学习的病变。此外，我们提出了一个跨视图融合模块，以改善多视图融合并减少冗余。在大型公共数据集上的实验结果证明了我们方法的有效性。该代码已在https://github.com/HuYongting/WGLIN 上开源。

更新时间: 2025-03-25 03:44:57

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19329v1

Substance over Style: Evaluating Proactive Conversational Coaching Agents

While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.

Updated: 2025-03-25 03:44:31

标题: 风格之上的实质：评估主动对话教练代理

摘要: 尽管自然语言处理研究在对话任务方面取得了进展，但许多方法都集中在具有明确定义目标或评估标准的单轮回应上。相比之下，辅导呈现出独特的挑战，最初目标未定义，通过多轮互动演变，主观评估标准，混合倡议对话。在这项工作中，我们描述并实现了五种展现不同对话风格的多轮辅导代理，并通过用户研究对它们进行评估，收集了155次对话的第一人称反馈。我们发现用户非常重视核心功能，并且在缺乏核心组成部分的情况下，风格组件被视为负面。通过将用户反馈与来自健康专家和LM的第三方评估进行比较，我们揭示了评估方法之间的显著不一致。我们的研究结果为对话辅导代理的设计和评估提供了见解，并有助于改善以人为中心的自然语言处理应用。

更新时间: 2025-03-25 03:44:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19328v1

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

Updated: 2025-03-25 03:44:11

标题: TRIDIS：用于HTR和NER的全面中世纪和早期现代语料库

摘要: 这篇论文介绍了TRIDIS（Tria Digita Scribunt），这是一个开源的中世纪和早期现代手稿语料库。TRIDIS集合了多个遗留数据集（均以开放许可发布），并包含大量元数据描述。虽然先前的出版物提到了该语料库的一些部分，但在这里我们提供了一个更加统一的概述，重点放在其构成上。我们描述了（i）每个主要子语料库的叙事、时间轴和编辑背景，（ii）其半史实转录规则（扩展、标准化、标点符号），（iii）一种通过在联合嵌入空间中检测异常值来挑战领域外测试分割的策略，以及（iv）使用TrOCR和MiniCPM2.5进行的初步基线实验，比较随机和基于异常值的测试分割。总的来说，TRIDIS旨在促进中世纪和早期现代文献遗产领域的手写文本识别（HTR）和命名实体识别（NER）研究的联合鲁棒性。

更新时间: 2025-03-25 03:44:11

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.22714v1

Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps

Recent reasoning large language models (LLMs) have demonstrated remarkable improvements in mathematical reasoning capabilities through long Chain-of-Thought. The reasoning tokens of these models enable self-correction within reasoning chains, enhancing robustness. This motivates our exploration: how vulnerable are reasoning LLMs to subtle errors in their input reasoning chains? We introduce "Compromising Thought" (CPT), a vulnerability where models presented with reasoning tokens containing manipulated calculation results tend to ignore correct reasoning steps and adopt incorrect results instead. Through systematic evaluation across multiple reasoning LLMs, we design three increasingly explicit prompting methods to measure CPT resistance, revealing that models struggle significantly to identify and correct these manipulations. Notably, contrary to existing research suggesting structural alterations affect model performance more than content modifications, we find that local ending token manipulations have greater impact on reasoning outcomes than structural changes. Moreover, we discover a security vulnerability in DeepSeek-R1 where tampered reasoning tokens can trigger complete reasoning cessation. Our work enhances understanding of reasoning robustness and highlights security considerations for reasoning-intensive applications.

Updated: 2025-03-25 03:43:11

标题: 过程还是结果？操纵的结尾标记可能会误导逻辑推理LLMs忽视正确的推理步骤

摘要: 最近的大型语言模型（LLMs）通过长链推理展示了在数学推理能力上的显著改进。这些模型的推理标记使得在推理链内自我纠正，增强了鲁棒性。这激发了我们的探索：推理LLMs在其输入推理链中对微妙错误有多脆弱？我们引入了“妥协思维”（CPT），这是一种脆弱性，模型在面对包含操纵计算结果的推理标记时，倾向于忽略正确的推理步骤，并采用错误的结果。通过对多个推理LLMs进行系统评估，我们设计了三种逐渐明确的提示方法来衡量CPT的抵抗力，揭示了模型在识别和纠正这些操纵方面的困难。值得注意的是，与现有研究表明结构性改变影响模型性能的说法相反，我们发现局部结束标记的操纵对推理结果的影响比结构性变化更大。此外，我们在DeepSeek-R1中发现了一个安全漏洞，操纵的推理标记可以触发完全的推理停止。我们的工作增进了对推理鲁棒性的理解，并强调了对推理密集型应用程序的安全性考虑。

更新时间: 2025-03-25 03:43:11

领域: cs.AI

下载: http://arxiv.org/abs/2503.19326v1

How to optimize K-means?

Center-based clustering algorithms (e.g., K-means) are popular for clustering tasks, but they usually struggle to achieve high accuracy on complex datasets. We believe the main reason is that traditional center-based clustering algorithms identify only one clustering center in each cluster. Once the distribution of the dataset is complex, a single clustering center cannot strongly represent distant objects within the cluster. How to optimize the existing center-based clustering algorithms will be valuable research. In this paper, we propose a general optimization method called ECAC, and it can optimize different center-based clustering algorithms. ECAC is independent of the clustering principle and is embedded as a component between the center process and the category assignment process of center-based clustering algorithms. Specifically, ECAC identifies several extended-centers for each clustering center. The extended-centers will act as relays to expand the representative capability of the clustering center in the complex cluster, thus improving the accuracy of center-based clustering algorithms. We conducted numerous experiments to verify the robustness and effectiveness of ECAC. ECAC is robust to diverse datasets and diverse clustering centers. After ECAC optimization, the accuracy (NMI as well as RI) of center-based clustering algorithms improves by an average of 33.4% and 64.1%, respectively, and even K-means accurately identifies complex-shaped clusters.

Updated: 2025-03-25 03:37:52

标题: 如何优化K均值算法？

摘要: 中心聚类算法（例如K-means）在聚类任务中很受欢迎，但通常很难在复杂数据集上实现高准确性。我们认为主要原因是传统的中心聚类算法只在每个簇中识别一个聚类中心。一旦数据集的分布复杂，单个聚类中心无法强力代表簇内的远距离对象。如何优化现有的中心聚类算法将是有价值的研究。在本文中，我们提出了一种称为ECAC的通用优化方法，它可以优化不同的中心聚类算法。ECAC独立于聚类原则，并嵌入为中心聚类算法的中心过程和类别分配过程之间的一个组件。具体而言，ECAC为每个聚类中心识别了几个扩展中心。这些扩展中心将作为中心的中继，扩展复杂簇中的中心的代表能力，从而提高中心聚类算法的准确性。我们进行了大量实验来验证ECAC的稳健性和有效性。ECAC对不同数据集和不同聚类中心都表现出稳健性。经过ECAC优化，中心聚类算法的准确性（NMI和RI）分别提高了平均33.4％和64.1％，甚至K-means能够准确地识别复杂形状的聚类。

更新时间: 2025-03-25 03:37:52

领域: cs.LG

下载: http://arxiv.org/abs/2503.19324v1

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines -- reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

Updated: 2025-03-25 03:36:21

标题: 少样本学习的测试时间训练的惊人有效性

摘要: 语言模型（LMs）在其训练分布内的任务上表现出色，但通常在结构上新颖的任务上表现不佳，即使给定少量上下文任务示例也是如此。我们研究了测试时训练（TTT）的有效性——在推理过程中使用从输入数据中导出的损失临时更新模型参数，作为改进LMs推理和少样本学习能力的机制。在抽象和推理语料库（ARC）上，使用上下文示例进行TTT相比于微调基线可使准确率提高高达6倍——在公共验证集上达到53.0％，使用8B参数的LM时，与程序综合方法合奏时达到61.9％，与平均人类表现相匹配。在BIG-Bench Hard（BBH）上，对上下文示例进行TTT在10个示例设置中超过了标准的少样本提示，提高了7.3个百分点（从50.5％到57.8％）。我们的研究强调了对新任务的上下文学习的限制，并展示了测试时训练提高语言模型适应性的潜力。

更新时间: 2025-03-25 03:36:21

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.07279v2

Improved Training Technique for Latent Consistency Models

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-$c$ scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

Updated: 2025-03-25 03:30:17

标题: 潜在一致性模型的改进训练技术

摘要: 一致性模型是一类新型的生成模型，能够在单步或多步中产生高质量样本。最近，一致性模型展示出令人印象深刻的性能，取得了与像素空间扩散模型相当的结果。然而，将一致性训练扩展到大规模数据集的成功，特别是对于文本到图像和视频生成任务，取决于在潜在空间中的性能。在这项工作中，我们分析了像素空间和潜在空间之间的统计差异，发现潜在数据通常包含高度冲动的异常值，这显著降低了潜在空间中iCT的性能。为了解决这个问题，我们将伪Huber损失替换为柯西损失，有效地减轻了异常值的影响。此外，我们在早期时间步引入了扩散损失，并采用最优传输（OT）耦合来进一步增强性能。最后，我们引入了自适应缩放-c调度程序来管理鲁棒的训练过程，并在架构中采用非缩放LayerNorm来更好地捕捉特征的统计信息并减少异常值的影响。通过这些策略，我们成功训练了能够在一步或两步中产生高质量样本的潜在一致性模型，显著缩小了潜在一致性与扩散模型之间的性能差距。实现已发布在：https://github.com/quandao10/sLCT/

更新时间: 2025-03-25 03:30:17

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.01441v2

Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing

As Artificial Intelligence (AI) becomes increasingly integrated into microgrid control systems, the risk of malicious actors exploiting vulnerabilities in Machine Learning (ML) algorithms to disrupt power generation and distribution grows. Detection models to identify adversarial attacks need to meet the constraints of edge environments, where computational power and memory are often limited. To address this issue, we propose a novel strategy that optimizes detection models for Vehicle-to-Microgrid (V2M) edge environments without compromising performance against inference and evasion attacks. Our approach integrates model design and compression into a unified process and results in a highly compact detection model that maintains high accuracy. We evaluated our method against four benchmark evasion attacks-Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Carlini & Wagner method (C&W) and Conditional Generative Adversarial Network (CGAN) method-and two knowledge-based attacks, white-box and gray-box. Our optimized model reduces memory usage from 20MB to 1.3MB, inference time from 3.2 seconds to 0.9 seconds, and GPU utilization from 5% to 2.68%.

Updated: 2025-03-25 03:26:49

标题: 边缘计算中车辆到微电网服务的高效对抗检测框架

摘要: 随着人工智能（AI）越来越多地整合到微网控制系统中，恶意行为者利用机器学习（ML）算法的漏洞来破坏电力发电和分配的风险也在增加。识别对抗性攻击的检测模型需要满足边缘环境的限制，其中计算能力和内存通常受限。为了解决这个问题，我们提出了一种新的策略，优化了面向车辆-微网（V2M）边缘环境的检测模型，同时不影响推理和规避攻击的性能。我们的方法将模型设计和压缩集成到一个统一的过程中，产生了一个高度紧凑的检测模型，保持了高准确性。我们对我们的方法进行了评估，针对四种基准规避攻击-快速梯度符号方法（FGSM）、基本迭代方法（BIM）、Carlini＆Wagner方法（C＆W）和条件生成对抗网络（CGAN）方法-以及两种基于知识的攻击，白盒和灰盒。我们优化的模型将内存使用量从20MB减少到1.3MB，推理时间从3.2秒减少到0.9秒，GPU利用率从5%减少到2.68%。

更新时间: 2025-03-25 03:26:49

领域: cs.CR

下载: http://arxiv.org/abs/2503.19318v1

RGL: A Graph-Centric, Modular Framework for Efficient Retrieval-Augmented Generation on Graphs

Recent advances in graph learning have paved the way for innovative retrieval-augmented generation (RAG) systems that leverage the inherent relational structures in graph data. However, many existing approaches suffer from rigid, fixed settings and significant engineering overhead, limiting their adaptability and scalability. Additionally, the RAG community has largely overlooked the decades of research in the graph database community regarding the efficient retrieval of interesting substructures on large-scale graphs. In this work, we introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline-from efficient graph indexing and dynamic node retrieval to subgraph construction, tokenization, and final generation-into a unified system. RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components, achieving speedups of up to 143x compared to conventional methods. Moreover, its flexible utilities, such as dynamic node filtering, allow for rapid extraction of pertinent subgraphs while reducing token consumption. Our extensive evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems across a range of tasks.

Updated: 2025-03-25 03:21:48

标题: RGL：一种图中心、模块化框架，用于在图形上进行高效的检索增强生成

摘要: 近年来，图学习的最新进展为创新的检索增强生成（RAG）系统铺平了道路，这些系统利用图数据中固有的关系结构。然而，许多现有方法存在刚性、固定的设置和显著的工程开销，限制了它们的适应性和可扩展性。此外，RAG社区在大规模图中检索有趣子结构的高效方法方面，大多忽视了图数据库社区数十年的研究。在这项工作中，我们介绍了RAG-on-Graphs库（RGL），这是一个模块化框架，无缝地整合了完整的RAG流程-从高效图索引和动态节点检索到子图构建、标记化和最终生成-成为一个统一系统。RGL通过支持各种图格式和集成优化的实现，解决了关键挑战，与传统方法相比，实现了高达143倍的加速。此外，其灵活的工具，如动态节点过滤，允许快速提取相关子图，同时减少标记消耗。我们进行了广泛的评估，证明RGL不仅加速了原型设计过程，还增强了基于图的RAG系统在各种任务中的性能和适用性。

更新时间: 2025-03-25 03:21:48

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2503.19314v1

Coverage-based Fairness in Multi-document Summarization

Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values. The code is available at https://github.com/leehaoyuan/coverage_fairness.

Updated: 2025-03-25 03:19:51

标题: 多文档摘要中基于覆盖范围的公平性

摘要: 多文档摘要（MDS）中的公平性是指系统能否公平地生成一个摘要，能够客观地代表具有不同社会属性值的文档中的信息。MDS中的公平性至关重要，因为一个公平的摘要可以为读者提供一个全面的视角。先前的研究侧重于使用基于统计平等的公平性度量——比例代表来量化摘要级别的公平性。然而，比例代表并未考虑输入文档中的冗余信息，并忽视了语料库级别的不公平性。在本研究中，我们提出了一种新的摘要级别公平性度量——平等覆盖，该度量基于具有不同社会属性值的文档的覆盖率，并考虑了文档内的冗余信息。为了检测语料库级别的不公平性，我们提出了一种新的语料库级别度量——覆盖率平等。我们的人工评估表明，我们的度量更符合我们对公平性的定义。使用我们的度量，我们评估了十三种不同的LLM的公平性。我们发现Claude3-sonnet是所有评估的LLM中最公平的。我们还发现几乎所有的LLM都过度代表了不同的社会属性值。代码可在https://github.com/leehaoyuan/coverage_fairness 上找到。

更新时间: 2025-03-25 03:19:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.08795v2

LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.

Updated: 2025-03-25 03:17:42

标题: LRSCLIP：用于将遥感图像与更长文本对齐的视觉语言基础模型

摘要: 这项研究解决了处理长文本的技术瓶颈以及远程感知视觉语言基础模型(VLFM)中由于短文本信息不足而引起的“幻觉”问题。我们提出了一种新颖的视觉语言基础模型LRSCLIP，以及一个多模态数据集LRS2M。主要贡献如下：(1)通过整合多源遥感数据并采用大型语言模型标注策略，我们构建了包含200万个图像-文本对的LRS2M数据集，首次提供了短文本和长文本，从而解决了现有数据集中语义粒度限制的问题；(2)基于Long-CLIP的KPS模块设计了LRSCLIP架构，扩展了CLIP的文本处理能力，并通过双文本损失加权机制实现了细粒度的跨模态特征对齐。实验结果表明，在零样本长文本跨模态检索任务中，LRSCLIP在Long-CLIP基线上将检索准确率提高了10\%-20\%。对于零样本短文本跨模态检索任务，LRSCLIP在Text to Image R@1、Image to Text R@1和RSITMD上的mR分别提高了0.17\%、0.67\%和0.92%，在RSICD上分别提高了0.04%、2.93%和1.28%。在零样本图像分类任务(平均准确率=75.75%)和语义定位任务(Rmi=0.7653)中，LRSCLIP实现了最先进的性能。这些结果验证了LRSCLIP在细粒度语义理解和全局特征匹配方面的双重优势。这项工作为遥感多模态学习提供了一个新的基准模型和数据支持。相关代码已经开源，可在https://github.com/MitsuiChen14/LRSCLIP上获得。

更新时间: 2025-03-25 03:17:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19311v1

Centroid Decision Forest

This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen's kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.

Updated: 2025-03-25 03:12:52

标题: 质心决策森林

摘要: 本文介绍了质心决策森林（CDF），这是一种新颖的集成学习框架，它重新定义了高维分类中的分裂策略和树构建。CDF中的分裂方法与传统决策树不同之处在于类可分性得分（CSS）决定了每个节点上选择最具有区分性的特征来构建分区（子节点）的质心。分裂准则使用从每个类质心到欧氏距离测量来实现更加灵活和鲁棒的分裂机制。质心是通过计算每个类别所选特征的平均特征值来构建的，确保了特征空间的类别代表性分割。这种基于质心的方法使得CDF能够捕捉复杂的类结构，同时保持可解释性和可扩展性。为了评估CDF，使用了23个高维数据集来评估其在分类准确率和Cohen's kappa统计量上与不同最先进分类器的性能。实验结果表明，CDF优于传统方法，证明了其在高维分类问题中的有效性和灵活性。

更新时间: 2025-03-25 03:12:52

领域: stat.ML,cs.LG,14J60

下载: http://arxiv.org/abs/2503.19306v1

Frequency Dynamic Convolution for Dense Image Prediction

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

Updated: 2025-03-25 03:09:17

标题: 频率动态卷积用于密集图像预测

摘要: 动态卷积（DY-Conv）通过通过多个并行权重结合注意机制实现自适应权重选择，展现出了有希望的性能，但这些权重的频率响应往往具有高度相似性，导致高参数成本但适应性有限。在本研究中，我们引入了频率动态卷积（FDConv），这是一种通过在傅立叶域中学习固定参数预算来减轻这些限制的新方法。FDConv将这个预算分为基于频率的组，具有不相交的傅立叶指数，使得能够构建频率多样化的权重而不增加参数成本。为了进一步增强适应性，我们提出了核空间调制（KSM）和频带调制（FBM）。KSM在空间级别动态调整每个滤波器的频率响应，而FBM在频率域将权重分解为不同的频带，并基于本地内容动态调制它们。对目标检测、分割和分类的广泛实验验证了FDConv的有效性。我们展示了当应用于ResNet-50时，FDConv在仅增加+3.6M参数的情况下实现了卓越的性能，超过了以前需要大幅增加参数预算的方法（例如，CondConv +90M，KW +76.5M）。此外，FDConv无缝地集成到各种架构中，包括ConvNeXt、Swin-Transformer，为现代视觉任务提供了灵活且高效的解决方案。代码公开可在https://github.com/Linwei-Chen/FDConv获取。

更新时间: 2025-03-25 03:09:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.18783v2

Observation Adaptation via Annealed Importance Resampling for Partially Observable Markov Decision Processes

Partially observable Markov decision processes (POMDPs) are a general mathematical model for sequential decision-making in stochastic environments under state uncertainty. POMDPs are often solved \textit{online}, which enables the algorithm to adapt to new information in real time. Online solvers typically use bootstrap particle filters based on importance resampling for updating the belief distribution. Since directly sampling from the ideal state distribution given the latest observation and previous state is infeasible, particle filters approximate the posterior belief distribution by propagating states and adjusting weights through prediction and resampling steps. However, in practice, the importance resampling technique often leads to particle degeneracy and sample impoverishment when the state transition model poorly aligns with the posterior belief distribution, especially when the received observation is highly informative. We propose an approach that constructs a sequence of bridge distributions between the state-transition and optimal distributions through iterative Monte Carlo steps, better accommodating noisy observations in online POMDP solvers. Our algorithm demonstrates significantly superior performance compared to state-of-the-art methods when evaluated across multiple challenging POMDP domains.

Updated: 2025-03-25 03:05:00

标题: 通过加权重要性重采样的观测适应性在部分可观测马尔可夫决策过程中的应用

摘要: 部分可观察马尔可夫决策过程（POMDPs）是在状态不确定性下随机环境中进行顺序决策的一般数学模型。POMDPs通常在\textit{在线}解决，这使算法能够实时适应新信息。在线求解器通常使用基于重要性重采样的自举粒子滤波器来更新信念分布。由于直接从最新观测和先前状态给定的理想状态分布进行抽样是不可行的，粒子滤波器通过预测和重采样步骤通过传播状态和调整权重来近似后验信念分布。然而，在实践中，当状态转移模型与后验信念分布不良对齐时，尤其在接收到高度信息化的观测时，重要性重采样技术经常导致粒子退化和样本贫化。我们提出一种通过迭代蒙特卡罗步骤构建状态转移和最优分布之间的一系列桥梁分布的方法，更好地适应在线POMDP求解器中的嘈杂观测。我们的算法在跨多个具有挑战性的POMDP领域进行评估时表现出明显优越的性能，比起最先进的方法。

更新时间: 2025-03-25 03:05:00

领域: cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.19302v1

On Diffusion Modeling for Anomaly Detection

Known for their impressive performance in generative modeling, diffusion models are attractive candidates for density-based anomaly detection. This paper investigates different variations of diffusion modeling for unsupervised and semi-supervised anomaly detection. In particular, we find that Denoising Diffusion Probability Models (DDPM) are performant on anomaly detection benchmarks yet computationally expensive. By simplifying DDPM in application to anomaly detection, we are naturally led to an alternative approach called Diffusion Time Estimation (DTE). DTE estimates the distribution over diffusion time for a given input and uses the mode or mean of this distribution as the anomaly score. We derive an analytical form for this density and leverage a deep neural network to improve inference efficiency. Through empirical evaluations on the ADBench benchmark, we demonstrate that all diffusion-based anomaly detection methods perform competitively for both semi-supervised and unsupervised settings. Notably, DTE achieves orders of magnitude faster inference time than DDPM, while outperforming it on this benchmark. These results establish diffusion-based anomaly detection as a scalable alternative to traditional methods and recent deep-learning techniques for standard unsupervised and semi-supervised anomaly detection settings.

Updated: 2025-03-25 03:01:44

标题: 关于用于异常检测的扩散建模

摘要: 以其在生成建模中出色的性能而闻名，扩散模型是基于密度的异常检测的理想候选。本文调查了扩散建模的不同变体在无监督和半监督异常检测中的应用。特别是，我们发现去噪扩散概率模型（DDPM）在异常检测基准上表现出色，但计算成本很高。通过简化应用于异常检测的DDPM，我们自然而然地引向了一种叫做扩散时间估计（DTE）的替代方法。DTE估计给定输入的扩散时间分布，并使用该分布的模式或均值作为异常分数。我们推导了这种密度的解析形式，并利用深度神经网络来提高推断效率。通过在ADBench基准上的实证评估，我们展示了所有基于扩散的异常检测方法在半监督和无监督设置下都具有竞争力。值得注意的是，DTE的推断时间比DDPM快几个数量级，同时在该基准上表现优异。这些结果将基于扩散的异常检测确立为传统方法和最近的深度学习技术的可伸缩替代方案，适用于标准的无监督和半监督异常检测设置。

更新时间: 2025-03-25 03:01:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2305.18593v3

UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Unified generative Modeling of 3D Molecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

Updated: 2025-03-25 03:01:16

标题: UniMoMo：统一的三维分子生成建模用于新药设计

摘要: 目前，设计靶向特异分子如小分子、肽和抗体对于生物研究和药物发现至关重要。现有的生成方法局限于单一领域分子，无法满足多样化的治疗需求或利用跨领域可转移性来增强模型性能。在本文中，我们介绍了第一个能够使用单一模型设计多个分子领域结合物的统一生成模型(UniMoMo)框架。具体而言，UniMoMo将不同分子的表示统一为块图，其中每个块对应于标准氨基酸或分子片段。基于这些统一表示，UniMoMo利用几何潜在扩散模型进行3D分子生成，其中包括一个迭代全原子自动编码器将块压缩为潜在空间点，然后是一个E(3)-等变扩散过程。对肽、抗体和小分子等领域进行的广泛基准测试显示我们的统一框架优于现有的领域特定模型，突显多领域训练的好处。

更新时间: 2025-03-25 03:01:16

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2503.19300v1

Non-autoregressive Generative Models for Reranking Recommendation

Contemporary recommendation systems are designed to meet users' needs by delivering tailored lists of items that align with their specific demands or interests. In a multi-stage recommendation system, reranking plays a crucial role by modeling the intra-list correlations among items. The key challenge of reranking lies in the exploration of optimal sequences within the combinatorial space of permutations. Recent research proposes a generator-evaluator learning paradigm, where the generator generates multiple feasible sequences and the evaluator picks out the best sequence based on the estimated listwise score. The generator is of vital importance, and generative models are well-suited for the generator function. Current generative models employ an autoregressive strategy for sequence generation. However, deploying autoregressive models in real-time industrial systems is challenging. To address these issues, we propose a Non-AutoRegressive generative model for reranking Recommendation (NAR4Rec) designed to enhance efficiency and effectiveness. To tackle challenges such as sparse training samples and dynamic candidates, we introduce a matching model. Considering the diverse nature of user feedback, we employ a sequence-level unlikelihood training objective to differentiate feasible sequences from unfeasible ones. Additionally, to overcome the lack of dependency modeling in non-autoregressive models regarding target items, we introduce contrastive decoding to capture correlations among these items. Extensive offline experiments validate the superior performance of NAR4Rec over state-of-the-art reranking methods. Online A/B tests reveal that NAR4Rec significantly enhances the user experience. Furthermore, NAR4Rec has been fully deployed in a popular video app Kuaishou with over 300 million daily active users.

Updated: 2025-03-25 02:54:01

标题: 非自回归生成模型用于重新排序推荐

摘要: 当代推荐系统旨在通过提供符合用户特定需求或兴趣的定制物品列表来满足用户的需求。在多阶段推荐系统中，重新排序通过建模物品之间的内部相关性在列表中发挥着关键作用。重新排序的关键挑战在于在排列组合空间内探索最佳序列。最近的研究提出了一个生成器-评估器学习范式，其中生成器生成多个可行序列，评估器根据估计的列表分数选择最佳序列。生成器至关重要，生成模型非常适合生成器功能。当前的生成模型采用自回归策略进行序列生成。然而，在实时工业系统中部署自回归模型是具有挑战性的。为解决这些问题，我们提出了一种用于重新排序推荐的非自回归生成模型（NAR4Rec），旨在提高效率和有效性。为了解决稀疏训练样本和动态候选项等挑战，我们引入了一个匹配模型。考虑到用户反馈的多样性，我们采用序列级非可能性训练目标来区分可行序列和不可行序列。此外，为了克服非自回归模型在目标物品方面缺乏依赖建模的问题，我们引入对比解码来捕捉这些物品之间的相关性。广泛的离线实验验证了NAR4Rec优于现有重新排序方法的出色性能。在线A/B测试显示，NAR4Rec显著增强了用户体验。此外，NAR4Rec已经完全部署在一个拥有超过3亿日活跃用户的热门视频应用快手中。

更新时间: 2025-03-25 02:54:01

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2402.06871v6

Swift Hydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models

Despite a plethora of anomaly detection models developed over the years, their ability to generalize to unseen anomalies remains an issue, particularly in critical systems. This paper aims to address this challenge by introducing Swift Hydra, a new framework for training an anomaly detection method based on generative AI and reinforcement learning (RL). Through featuring an RL policy that operates on the latent variables of a generative model, the framework synthesizes novel and diverse anomaly samples that are capable of bypassing a detection model. These generated synthetic samples are, in turn, used to augment the detection model, further improving its ability to handle challenging anomalies. Swift Hydra also incorporates Mamba models structured as a Mixture of Experts (MoE) to enable scalable adaptation of the number of Mamba experts based on data complexity, effectively capturing diverse feature distributions without increasing the model's inference time. Empirical evaluations on ADBench benchmark demonstrate that Swift Hydra outperforms other state-of-the-art anomaly detection models while maintaining a relatively short inference time. From these results, our research highlights a new and auspicious paradigm of integrating RL and generative AI for advancing anomaly detection.

Updated: 2025-03-25 02:53:03

标题: 快速九头蛇：用于多个曼巴模型的异常检测的自我强化生成框架

摘要: 尽管多年来已开发了大量异常检测模型，但它们在泛化到未见异常方面的能力仍然是一个问题，特别是在关键系统中。本文旨在通过引入Swift Hydra来解决这一挑战，这是一个基于生成人工智能和强化学习（RL）的异常检测方法训练的新框架。通过在生成模型的潜变量上运行RL策略，该框架合成了新颖且多样化的异常样本，能够绕过检测模型。这些生成的合成样本反过来被用于增强检测模型，进一步提高其处理具有挑战性的异常的能力。Swift Hydra还结合了结构化为专家混合（MoE）的Mamba模型，以便根据数据复杂性可扩展地调整Mamba专家的数量，有效地捕捉多样化的特征分布而不增加模型的推理时间。对ADBench基准的实证评估表明，Swift Hydra在保持相对较短的推理时间的同时，优于其他最先进的异常检测模型。从这些结果中，我们的研究突出了将RL和生成人工智能相结合以推进异常检测的新且有前景的范式。

更新时间: 2025-03-25 02:53:03

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06413v2

Adaptive Wavelet Filters as Practical Texture Feature Amplifiers for Parkinson's Disease Screening in OCT

Parkinson's disease (PD) is a prevalent neurodegenerative disorder globally. The eye's retina is an extension of the brain and has great potential in PD screening. Recent studies have suggested that texture features extracted from retinal layers can be adopted as biomarkers for PD diagnosis under optical coherence tomography (OCT) images. Frequency domain learning techniques can enhance the feature representations of deep neural networks (DNNs) by decomposing frequency components involving rich texture features. Additionally, previous works have not exploited texture features for automated PD screening in OCT. Motivated by the above analysis, we propose a novel Adaptive Wavelet Filter (AWF) that serves as the Practical Texture Feature Amplifier to fully leverage the merits of texture features to boost the PD screening performance of DNNs with the aid of frequency domain learning. Specifically, AWF first enhances texture feature representation diversities via channel mixer, then emphasizes informative texture feature representations with the well-designed adaptive wavelet filtering token mixer. By combining the AWFs with the DNN stem, AWFNet is constructed for automated PD screening. Additionally, we introduce a novel Balanced Confidence (BC) Loss by mining the potential of sample-wise predicted probabilities of all classes and class frequency prior, to further boost the PD screening performance and trustworthiness of AWFNet. The extensive experiments manifest the superiority of our AWFNet and BC over state-of-the-art methods in terms of PD screening performance and trustworthiness.

Updated: 2025-03-25 02:47:24

标题: 自适应小波滤波器作为OCT中帕金森病筛查的实用纹理特征放大器

摘要: 帕金森病（PD）是全球流行的神经退行性疾病。眼睛的视网膜是大脑的延伸，在PD筛查中具有巨大潜力。最近的研究表明，从视网膜层提取的纹理特征可以作为光学相干断层扫描（OCT）图像下PD诊断的生物标志物。频域学习技术可以通过分解涉及丰富纹理特征的频率成分来增强深度神经网络（DNNs）的特征表示。此外，先前的研究未利用纹理特征进行OCT中的自动PD筛查。在上述分析的激励下，我们提出了一种新颖的自适应小波滤波器（AWF），作为实用纹理特征放大器，充分利用纹理特征的优点，以频域学习的帮助来增强DNNs的PD筛查性能。具体而言，AWF首先通过通道混合器增强纹理特征表示多样性，然后通过精心设计的自适应小波滤波标记混合器强调信息丰富的纹理特征表示。通过将AWF与DNN干细胞相结合，构建了用于自动PD筛查的AWFNet。此外，我们引入了一种新颖的平衡置信度（BC）损失，通过挖掘所有类别的样本预测概率和类别频率先验的潜力，进一步提高了AWFNet的PD筛查性能和可靠性。广泛的实验表明，我们的AWFNet和BC在PD筛查性能和可靠性方面优于最先进的方法。

更新时间: 2025-03-25 02:47:24

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.19292v1

No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention Mechanism

Despite the outstanding performance of deep learning models in clinical prediction tasks, explainability remains a significant challenge. Inspired by transformer architectures, we introduce the Temporal-Feature Cross Attention Mechanism (TFCAM), a novel deep learning framework designed to capture dynamic interactions among clinical features across time, enhancing both predictive accuracy and interpretability. In an experiment with 1,422 patients with Chronic Kidney Disease, predicting progression to End-Stage Renal Disease, TFCAM outperformed LSTM and RETAIN baselines, achieving an AUROC of 0.95 and an F1-score of 0.69. Beyond performance gains, TFCAM provides multi-level explainability by identifying critical temporal periods, ranking feature importance, and quantifying how features influence each other across time before affecting predictions. Our approach addresses the "black box" limitations of deep learning in healthcare, offering clinicians transparent insights into disease progression mechanisms while maintaining state-of-the-art predictive performance.

Updated: 2025-03-25 02:35:08

标题: 不再有黑匣子：用时间特征交叉注意机制揭秘临床预测建模

摘要: 尽管深度学习模型在临床预测任务中表现出色，但可解释性仍然是一个重要挑战。受到Transformer架构的启发，我们引入了时间特征交叉注意力机制（TFCAM），这是一个旨在捕捉时间跨度内临床特征之间动态交互的新颖深度学习框架，提高了预测准确性和可解释性。在对1,422名患有慢性肾病的患者进行的实验中，预测其进展为终末期肾病时，TFCAM的表现优于LSTM和RETAIN基线，实现了0.95的AUROC和0.69的F1分数。除了性能提升外，TFCAM通过识别关键的时间段、排名特征重要性以及量化特征在影响预测之前如何随时间相互影响，提供了多层次的可解释性。我们的方法解决了深度学习在医疗保健领域的“黑匣子”限制，为临床医生提供了透明的洞察疾病进展机制，同时保持着最先进的预测性能。

更新时间: 2025-03-25 02:35:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19285v1

Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous unlabeled data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a zeroth-order informed fine-tuning paradigm for DM. The zeroth-order gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator an unbiased one with the lower variance than other methods. We provide theoretical guarantees for the performance of the RLR. Extensive experiments are conducted on image and video generation tasks to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

Updated: 2025-03-25 02:35:02

标题: 零阶信息精细调整用于扩散模型：一种递归似然比优化器

摘要: 概率扩散模型（DM）通过递归链结构进行推理生成内容，已经成为一种强大的视觉生成框架。在巨大的未标记数据上进行预训练后，模型需要适当对齐以满足下游应用的要求。如何高效地对齐基础DM是一项关键任务。当代方法要么基于强化学习（RL），要么基于截断反向传播（BP）。然而，RL和截断BP分别受到低样本效率和偏向梯度估计的影响，导致改进有限甚至完全训练失败。为了克服这些挑战，我们提出了递归可能性比率（RLR）优化器，这是一种零阶通知的DM微调范式。零阶梯度估计器使得在递归扩散链内进行计算图重新排列，使RLR的梯度估计器成为一个无偏估计，比其他方法具有更低的方差。我们为RLR的性能提供了理论保证。我们在图像和视频生成任务上进行了大量实验，验证了RLR的优越性。此外，我们提出了一种新颖的提示技术，对于RLR来说自然地实现协同效应。

更新时间: 2025-03-25 02:35:02

领域: cs.CV,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.00639v2

Dataset Distillation for Quantum Neural Networks

Training Quantum Neural Networks (QNNs) on large amount of classical data can be both time consuming as well as expensive. Higher amount of training data would require higher number of gradient descent steps to reach convergence. This, in turn would imply that the QNN will require higher number of quantum executions, thereby driving up its overall execution cost. In this work, we propose performing the dataset distillation process for QNNs, where we use a novel quantum variant of classical LeNet model containing residual connection and trainable Hermitian observable in the Parametric Quantum Circuit (PQC) of the QNN. This approach yields highly informative yet small number of training data at similar performance as the original data. We perform distillation for MNIST and Cifar-10 datasets, and on comparison with classical models observe that both the datasets yield reasonably similar post-inferencing accuracy on quantum LeNet (91.9% MNIST, 50.3% Cifar-10) compared to classical LeNet (94% MNIST, 54% Cifar-10). We also introduce a non-trainable Hermitian for ensuring stability in the distillation process and note marginal reduction of up to 1.8% (1.3%) for MNIST (Cifar-10) dataset.

Updated: 2025-03-25 02:31:38

标题: 数据集精炼用于量子神经网络

摘要: 在大量的经典数据上训练量子神经网络（QNNs）既耗时又昂贵。更多的训练数据将需要更多的梯度下降步骤才能达到收敛。这反过来意味着QNN将需要更多的量子执行，从而增加其总体执行成本。在这项工作中，我们提出为QNNs执行数据集精炼过程，其中我们使用包含残差连接和可训练的厄米观测值的Parametric Quantum Circuit（PQC）的经典LeNet模型的新颖量子变体。这种方法产生高度信息丰富但数量较少的训练数据，并在表现上与原始数据相似。我们对MNIST和Cifar-10数据集进行了精炼，与经典模型进行比较后发现，两个数据集在量子LeNet上的后推理准确率（MNIST为91.9％，Cifar-10为50.3％）与经典LeNet（MNIST为94％，Cifar-10为54％）相当相似。我们还引入了一个非可训练的厄米算子，以确保精炼过程的稳定性，并注意到MNIST（Cifar-10）数据集的边际降低最高可达1.8%（1.3%）。

更新时间: 2025-03-25 02:31:38

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2503.17935v2

CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

Proving Rubik's Cube theorems at the high level represents a notable milestone in human-level spatial imagination and logic thinking and reasoning. Traditional Rubik's Cube robots, relying on complex vision systems and fixed algorithms, often struggle to adapt to complex and dynamic scenarios. To overcome this limitation, we introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3x3 Rubik's Cubes, empowering embodied agents with multimodal understanding and execution capabilities. We used the CubeCoT image dataset, which contains multiple-level tasks (43 subtasks in total) that humans are unable to handle, encompassing various cube states. We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries, thus enabling CubeRobot to independent planning, decision-making, reflection and separate management of high- and low-level Rubik's Cube tasks. Furthermore, in low-level Rubik's Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.

Updated: 2025-03-25 02:23:47

标题: CubeRobot：通过视觉语言模型将语言与魔方操控联系起来

摘要: 在高水平证明魔方定理代表了人类水平空间想象力和逻辑思维推理的一个显著里程碑。传统的魔方机器人依赖复杂的视觉系统和固定算法，往往难以适应复杂和动态的场景。为了克服这一限制，我们引入了CubeRobot，这是一个专门用于解决3x3魔方问题的视觉语言模型(VLM)，赋予具有多模态理解和执行能力的实体代理。我们使用CubeCoT图像数据集，其中包含多个级别的任务（总共43个子任务），人类无法处理，涵盖了各种魔方状态。我们结合了双环VisionCoT架构和Memory Stream，这是一种从VLM生成的规划查询中提取任务相关特征的范式，从而使CubeRobot能够独立规划、决策、反思和分离管理高低级别的魔方任务。此外，在低级别的魔方恢复任务中，CubeRobot达到了100%的高准确率，与中级任务中的100%相似，并在高级任务中达到了80%的准确率。

更新时间: 2025-03-25 02:23:47

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.19281v1

LogicLearner: A Tool for the Guided Practice of Propositional Logic Proofs

The study of propositional logic -- fundamental to the theory of computing -- is a cornerstone of the undergraduate computer science curriculum. Learning to solve logical proofs requires repeated guided practice, but undergraduate students often lack access to on-demand tutoring in a judgment-free environment. In this work, we highlight the need for guided practice tools in undergraduate mathematics education and outline the desiderata of an effective practice tool. We accordingly develop LogicLearner, a web application for guided logic proof practice. LogicLearner consists of an interface to attempt logic proofs step-by-step and an automated proof solver to generate solutions on the fly, allowing users to request guidance as needed. We pilot LogicLearner as a practice tool in two semesters of an undergraduate discrete mathematics course and receive strongly positive feedback for usability and pedagogical value in student surveys. To the best of our knowledge, LogicLearner is the only learning tool that provides an end-to-end practice environment for logic proofs with immediate, judgment-free feedback.

Updated: 2025-03-25 02:23:08

标题: 逻辑学习者：一种用于命题逻辑证明引导实践的工具

摘要: 研究命题逻辑是计算理论的基础，是本科计算机科学课程的基石。学习解决逻辑证明需要反复的有导向的练习，但本科生通常缺乏在无压力环境下随时获取辅导的途径。在这项工作中，我们强调了本科数学教育中有导向练习工具的需求，并概述了有效练习工具的要求。因此，我们开发了LogicLearner，一个用于有导向逻辑证明练习的网络应用。LogicLearner包括一个界面，可以逐步尝试逻辑证明，以及一个自动证明求解器，可以实时生成解决方案，允许用户在需要时请求指导。我们在两个学期的本科离散数学课程中作为练习工具试用了LogicLearner，并在学生调查中收到了对其易用性和教学价值的强烈积极反馈。据我们所知，LogicLearner是唯一提供端到端逻辑证明练习环境并提供即时无压力反馈的学习工具。

更新时间: 2025-03-25 02:23:08

领域: cs.DM,cs.AI,cs.HC

下载: http://arxiv.org/abs/2503.19280v1

Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves

The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus. Prior studies on argumentative moves often rely on qualitative analysis and manual coding, limiting their efficiency and generalizability. The study aims to: 1) to assess the reliability of PLMs in analyzing argumentative moves; 2) to utilize PLM-generated annotations to illustrate developmental patterns and predict writing quality. A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types: claim, data, counter-claim, counter-data, rebuttal, and non-argument. The corpus is divided into training, validation, and application sets annotated by human experts and PLMs. We use BERT as one of the implementations of PLMs. The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field. Additionally, PLM-labeled argumentative moves effectively capture developmental patterns and predict writing quality. Over time, students exhibit an increase in the use of data and counter-claims and a decrease in non-argument moves. While low-quality texts are characterized by a predominant use of claims and data supporting only oneside position, mid- and high-quality texts demonstrate an integrative perspective with a higher ratio of counter-claims, counter-data, and rebuttals. This study underscores the transformative potential of integrating artificial intelligence into language education, enhancing the efficiency and accuracy of evaluating students' writing. The successful application of PLMs can catalyze the development of educational technology, promoting a more data-driven and personalized learning environment that supports diverse educational needs.

Updated: 2025-03-25 02:21:12

标题: 机器辅助写作评估：探索预训练语言模型在分析论证策略中的应用

摘要: 这项研究调查了预训练语言模型（PLMs）在分析纵向学习者语料库中的论证移动方面的有效性。先前关于论证移动的研究通常依赖定性分析和手动编码，限制了它们的效率和普适性。该研究旨在：1）评估PLMs在分析论证移动方面的可靠性；2）利用PLM生成的注释来说明发展模式并预测写作质量。收集了来自中国235名英语学习者的1643篇论证文本的纵向语料库，并将其注释为六种移动类型：主张、数据、反主张、反数据、反驳和非论证。语料库按人类专家和PLMs注释分为训练、验证和应用集。我们使用BERT作为PLMs的一种实现。结果表明，PLMs在分析论证移动方面具有很高的可靠性，总体F1分数为0.743，超过了该领域现有的模型。此外，PLM标记的论证移动有效捕捉了发展模式并预测了写作质量。随着时间的推移，学生在使用数据和反主张方面增加，非论证移动减少。低质量文本以主张和仅支持一方立场的数据为主，而中等和高质量文本表现出整合性观点，具有更高比例的反主张、反数据和反驳。这项研究强调了将人工智能融入语言教育的转变潜力，提高了评估学生写作的效率和准确性。PLMs的成功应用可以推动教育技术的发展，促进更加数据驱动和个性化的学习环境，支持不同的教育需求。

更新时间: 2025-03-25 02:21:12

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.19279v1

Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications

Semantic segmentation has made significant strides in pixel-level image understanding, yet it remains limited in capturing contextual and semantic relationships between objects. Current models, such as CNN and Transformer-based architectures, excel at identifying pixel-level features but fail to distinguish semantically similar objects (e.g., "doctor" vs. "nurse" in a hospital scene) or understand complex contextual scenarios (e.g., differentiating a running child from a regular pedestrian in autonomous driving). To address these limitations, we proposed a novel Context-Aware Semantic Segmentation framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones. Our hybrid model leverages the Swin Transformer for robust visual feature extraction and GPT-4 for enriching semantic understanding through text embeddings. A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively. Additionally, Graph Neural Networks (GNNs) are employed to model object relationships within the scene, capturing dependencies that are overlooked by traditional models. Experimental results on benchmark datasets (e.g., COCO, Cityscapes) demonstrate that our approach outperforms the existing methods in both pixel-level accuracy (mIoU) and contextual understanding (mAP). This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.

Updated: 2025-03-25 02:12:35

标题: 上下文感知语义分割：利用大型语言模型增强像素级理解，用于先进视觉应用

摘要: 语义分割在像素级图像理解方面取得了重大进展，但在捕捉对象之间的上下文和语义关系方面仍存在局限性。当前的模型，如CNN和基于Transformer的架构，擅长识别像素级特征，但在区分语义相似对象（例如在医院场景中的“医生”与“护士”）或理解复杂的上下文情景（例如，在自动驾驶中区分跑步的孩子和普通行人）方面表现不佳。为了解决这些限制，我们提出了一种新颖的上下文感知语义分割框架，该框架将大型语言模型（LLMs）与最先进的视觉骨干结合起来。我们的混合模型利用Swin Transformer进行强大的视觉特征提取，利用GPT-4通过文本嵌入来丰富语义理解。引入了交叉注意机制来对齐视觉和语言特征，使模型能够更有效地推理上下文。此外，还采用图神经网络（GNNs）来建模场景内的对象关系，捕捉传统模型忽视的依赖关系。对基准数据集（例如COCO、Cityscapes）的实验结果表明，我们的方法在像素级准确度（mIoU）和上下文理解（mAP）方面优于现有方法。这项工作弥合了视觉和语言之间的差距，为更智能和上下文感知的视觉系统铺平了道路，包括自动驾驶、医学影像和机器人技术应用。

更新时间: 2025-03-25 02:12:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19276v1

CoMAC: Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions

Recent advancements in AI-driven conversational agents have exhibited immense potential of AI applications. Effective response generation is crucial to the success of these agents. While extensive research has focused on leveraging multiple auxiliary data sources (e.g., knowledge bases and personas) to enhance response generation, existing methods often struggle to efficiently extract relevant information from these sources. There are still clear limitations in the ability to combine versatile conversational capabilities with adherence to known facts and adaptation to large variations in user preferences and belief systems, which continues to hinder the wide adoption of conversational AI tools. This paper introduces a novel method, Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions (CoMAC), for conversation generation, which employs specialized encoding streams and post-fusion grounding networks for multiple data sources to identify relevant persona and knowledge information for the conversation. CoMAC also leverages a novel text similarity metric that allows bi-directional information sharing among multiple sources and focuses on a selective subset of meaningful words. Our experiments show that CoMAC improves the relevant persona and knowledge prediction accuracies and response generation quality significantly over two state-of-the-art methods.

Updated: 2025-03-25 02:09:52

标题: CoMAC：具有稀疏和对称潜在交互的多源辅助上下文的对话代理

摘要: 人工智能驱动的会话代理近年来取得了巨大进展，展示了人工智能应用的巨大潜力。有效的响应生成对于这些代理的成功至关重要。尽管广泛的研究已经集中在利用多个辅助数据源（例如知识库和人设）来增强响应生成，但现有方法往往难以有效地从这些源中提取相关信息。在将多功能会话能力与对已知事实的遵守以及对用户偏好和信仰系统的大量变化的适应能力相结合方面仍存在明显限制，这继续阻碍了会话人工智能工具的广泛应用。本文介绍了一种新颖的方法，即具有稀疏和对称潜在交互的多源辅助上下文对话代理（CoMAC），用于对话生成，该方法利用专门的编码流和后融合接地网络用于多个数据源，以识别对话中相关的人设和知识信息。CoMAC还利用一种新颖的文本相似度度量，允许多个源之间的双向信息共享，并专注于有意义的词语的选择性子集。我们的实验证明，CoMAC显著提高了相关人设和知识预测准确性以及响应生成质量，超过了两种最先进的方法。

更新时间: 2025-03-25 02:09:52

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2503.19274v1

TARDIS: Mitigating Temporal Misalignment via Representation Steering

Language models often struggle with temporal misalignment, performance degradation caused by shifts in the temporal distribution of data. Continuously updating models to avoid degradation is expensive. Can models be adapted without updating model weights? We present TARDIS, an unsupervised representation editing method that addresses this challenge. TARDIS extracts steering vectors from unlabeled data and adjusts the model's representations to better align with the target time period's distribution. Our experiments reveal that TARDIS enhances downstream task performance without the need for fine-tuning, can mitigate temporal misalignment even when exact target time period data is unavailable, and remains efficient even when the temporal information of the target data points is unknown at inference time.

Updated: 2025-03-25 02:09:27

标题: TARDIS：通过表示导向减轻时间轴不一致

摘要: 语言模型经常在时间上出现问题，性能下降是由于数据时间分布的变化引起的。持续更新模型以避免性能下降是昂贵的。模型是否可以在不更新模型权重的情况下进行适应？我们提出了TARDIS，一种无监督的表示编辑方法，用于解决这一挑战。TARDIS从无标签数据中提取导向向量，并调整模型的表示以更好地与目标时间段的分布对齐。我们的实验表明，TARDIS提高了下游任务的性能，无需进行微调，即使没有准确的目标时间段数据，也可以缓解时间上的不对齐，并且在推理时即使不知道目标数据点的时间信息，也保持高效。

更新时间: 2025-03-25 02:09:27

领域: cs.LG

下载: http://arxiv.org/abs/2503.18693v2

Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation

Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.

Updated: 2025-03-25 02:08:05

标题: 反向提示：破解文本到图像生成中的配方

摘要: 文本到图像生成越来越受欢迎，但要实现所需的图像通常需要大量的提示工程。在本文中，我们探讨了如何从参考图像中解码文本提示，这个过程我们称之为图像逆提示工程。这种技术使我们能够从参考图像中获得见解，了解伟大艺术家的创作过程，并生成令人印象深刻的新图像。为了解决这一挑战，我们提出了一种称为自动逆提示优化（ARPO）的方法。具体来说，我们的方法通过迭代模仿梯度提示优化过程将初始提示优化为高质量提示：1）从当前提示生成一个重新创建的图像以实例化其引导能力；2）生成文本梯度，这是旨在减少重新创建图像与参考图像之间差异的候选提示；3）使用贪婪搜索方法使用文本梯度更新当前提示，以最大化提示和参考图像之间的CLIP相似性。我们将ARPO与几种基准方法进行比较，包括手工技术、基于梯度的提示调整方法、图像字幕和数据驱动选择方法。定量和定性结果都表明，我们的ARPO能够迅速收敛以生成高质量的逆提示。更重要的是，我们可以通过直接编辑这些逆提示轻松地创建具有多样风格和内容的新图像。代码将公开提供。

更新时间: 2025-03-25 02:08:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19937v1

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyperparameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementarity) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT while using significantly fewer parameters. Lastly, we investigate PEFT's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., CLIP). Perhaps not surprisingly, PEFT approaches outperform full FT alone. However, with weight-space ensembles, full FT can better balance target distribution and distribution shift performance, suggesting a future research direction for robust PEFT.

Updated: 2025-03-25 02:07:28

标题: 一个关于参数高效微调在视觉识别中的统一研究的教训和见解

摘要: 参数高效微调（PEFT）由于预训练模型尺寸的增长和需要对其进行微调以实现优越的下游性能而引起了广泛关注。尽管新的PEFT方法激增，但缺乏系统性研究来理解它们的性能和适用场景，留下诸如“何时应用PEFT”和“使用哪种方法”等问题基本上没有答案，尤其是在视觉识别方面。在本文中，我们对代表性PEFT方法与Vision Transformers进行了统一的实证研究。我们系统地调整它们的超参数，以公平地比较它们在下游任务上的准确性。我们的研究提供了一个实用的用户指南，并揭示了一些新的见解。首先，如果经过精心调整，不同的PEFT方法在低样本基准VTAB-1K上实现了类似的准确性。这包括FT偏差项等简单方法，据报道效果较差。其次，尽管准确性相似，我们发现PEFT方法会犯不同的错误并产生高置信度的预测，这可能是由于它们不同的归纳偏差。这种不一致性（或互补性）为集成方法提供了机会，我们对此进行了初步尝试。第三，在超常用的低样本任务之外，我们发现PEFT在多样本情况下也很有用，实现了与全量FT相当或更好的准确性，同时使用的参数数量显著更少。最后，我们调查了PEFT保持预训练模型对分布转移（例如CLIP）的稳健性的能力。也许并不令人惊讶，PEFT方法胜过单独的全量FT。但是，通过权重空间集成，全量FT可以更好地平衡目标分布和分布转移性能，这为稳健PEFT的未来研究方向提供了建议。

更新时间: 2025-03-25 02:07:28

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2409.16434v5

Mathematics and Machine Creativity: A Survey on Bridging Mathematics with AI

This paper presents a comprehensive overview on the applications of artificial intelligence (AI) in mathematical research, highlighting the transformative role AI has begun to play in this domain. Traditionally, AI advancements have heavily relied on theoretical foundations provided by mathematics and statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and large language models (LLMs), have demonstrated the potential for AI to contribute back to mathematics by offering flexible algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects of mathematical research. This survey aims to establish a bridge between AI and mathematics, providing insights into the mutual benefits and fostering deeper interdisciplinary understanding. In particular, we argue that while current AI and LLMs may struggle with complex deductive reasoning, their "inherent creativity", the ability to generate outputs at high throughput based on recognition of shallow patterns, holds significant potential to support and inspire mathematical research. This creative capability, often overlooked, could be the key to unlocking new perspectives and methodologies in mathematics. Furthermore, we address the lack of cross-disciplinary communication: mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently prioritize benchmark performance over real-world applications in frontier mathematical research. This paper seeks to close that gap, offering a detailed exploration of AI fundamentals, its strengths, and its emerging applications in the mathematical sciences.

Updated: 2025-03-25 02:03:07

标题: 数学与机器创造力：连接数学与人工智能的调查

摘要: 本文全面介绍了人工智能在数学研究中的应用，突出了人工智能在这一领域开始发挥的变革性作用。传统上，人工智能的进步在很大程度上依赖于数学和统计学提供的理论基础。然而，近年来人工智能特别是在强化学习（RL）和大型语言模型（LLMs）方面的发展展示了人工智能为数学研究提供灵活算法框架和强大归纳推理能力的潜力。本调查旨在建立人工智能与数学之间的桥梁，提供对互惠互利的洞察，并促进更深入的跨学科理解。特别地，我们认为尽管当前的人工智能和LLMs可能在复杂的演绎推理方面存在困难，但它们的“固有创造力”，基于对浅层模式识别而以高吞吐量生成输出的能力，具有支持和启发数学研究的重要潜力。这种创造能力经常被忽视，但可能是开启数学领域新视角和方法论的关键。此外，我们解决了跨学科交流不足的问题：数学家可能没有完全理解人工智能的最新进展，而人工智能研究者经常将基准性能置于前沿数学研究的真实世界应用之上。本文旨在弥合这一差距，提供对人工智能基础知识、其优势以及在数学科学中新兴应用的详细探讨。

更新时间: 2025-03-25 02:03:07

领域: cs.AI

下载: http://arxiv.org/abs/2412.16543v3

NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios

Offline reinforcement learning (RL) aims to learn from historical data without requiring (costly) access to the environment. To facilitate offline RL research, we previously introduced NeoRL, which highlighted that datasets from real-world tasks are often conservative and limited. With years of experience applying offline RL to various domains, we have identified additional real-world challenges. These include extremely conservative data distributions produced by deployed control systems, delayed action effects caused by high-latency transitions, external factors arising from the uncontrollable variance of transitions, and global safety constraints that are difficult to evaluate during the decision-making process. These challenges are underrepresented in previous benchmarks but frequently occur in real-world tasks. To address this, we constructed the extended Near Real-World Offline RL Benchmark (NeoRL-2), which consists of 7 datasets from 7 simulated tasks along with their corresponding evaluation simulators. Benchmarking results from state-of-the-art offline RL approaches demonstrate that current methods often struggle to outperform the data-collection behavior policy, highlighting the need for more effective methods. We hope NeoRL-2 will accelerate the development of reinforcement learning algorithms for real-world applications. The benchmark project page is available at https://github.com/polixir/NeoRL2.

Updated: 2025-03-25 02:01:54

标题: NeoRL-2：扩展现实场景下的离线强化学习近实际世界基准

摘要: 离线强化学习（RL）旨在从历史数据中学习，而不需要（昂贵的）访问环境。为促进离线RL研究，我们之前引入了NeoRL，强调了来自真实任务的数据集通常保守且有限。通过多年将离线RL应用于各个领域的经验，我们发现了额外的真实世界挑战。这些挑战包括部署控制系统产生的极端保守数据分布，由高延迟转换引起的延迟动作效果，由转换不可控制的方差引起的外部因素，以及在决策过程中难以评估的全局安全约束。这些挑战在先前的基准中呈现不足，但在真实任务中经常发生。为了解决这一问题，我们构建了扩展的近真实世界离线RL基准（NeoRL-2），其中包括7个模拟任务的7个数据集以及它们对应的评估模拟器。来自最先进的离线RL方法的基准结果表明，当前方法通常难以超越数据收集行为策略，突出了更有效方法的需求。我们希望NeoRL-2将加速强化学习算法在真实世界应用中的发展。基准项目页面可在 https://github.com/polixir/NeoRL2 查看。

更新时间: 2025-03-25 02:01:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19267v1

Structured and sparse partial least squares coherence for multivariate cortico-muscular analysis

Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to extract shared latent space representations related to cortico-muscular interactions. Our approach leverages an embedded optimization framework by integrating a partial least squares (PLS)-based objective function, a sparsity constraint and a connectivity-based structured constraint, addressing the generalizability, interpretability and spatial structure. To solve the optimization problem, we develop an efficient alternating iterative algorithm within a unified framework and prove its convergence experimentally. Extensive experimental results from one synthetic and several real-world datasets have demonstrated that ssPLSC can achieve competitive or better performance over some representative multivariate cortico-muscular fusion methods, particularly in scenarios characterized by limited sample sizes and high noise levels. This study provides a novel multivariate fusion method for cortico-muscular analysis, offering a transformative tool for the evaluation of corticospinal pathway integrity in neurological disorders.

Updated: 2025-03-25 01:56:11

标题: 结构化和稀疏的偏最小二乘一致性用于多变量皮质肌肉分析

摘要: 多变量皮质-肌肉分析最近已成为评估皮质脊髓神经通路的一种有前途的方法。然而，当前的多变量方法面临诸如高维度和样本量有限等挑战，从而限制了它们的进一步应用。在本文中，我们提出了一种结构化和稀疏偏最小二乘相干算法（ssPLSC），用于提取与皮质-肌肉相互作用相关的共享潜在空间表示。我们的方法利用嵌入式优化框架，通过集成基于偏最小二乘（PLS）的目标函数、稀疏约束和基于连通性的结构约束，解决了一般性、可解释性和空间结构等问题。为了解决优化问题，我们在统一框架内开发了一种高效的交替迭代算法，并通过实验证明了其收敛性。来自一个合成数据集和几个真实世界数据集的大量实验结果表明，ssPLSC在限制样本量和高噪声水平的情况下可以实现与一些代表性的多变量皮质-肌肉融合方法相竞争或更好的性能。这项研究为皮质-肌肉分析提供了一种新颖的多变量融合方法，为评估神经系统疾病中皮质脊髓通路完整性提供了一种革命性工具。

更新时间: 2025-03-25 01:56:11

领域: stat.AP,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.21802v1

h4rm3l: A language for Composable Jailbreak Attack Synthesis

Despite their demonstrated valuable capabilities, state-of-the-art (SOTA) widely deployed large language models (LLMs) still have the potential to cause harm to society due to the ineffectiveness of their safety filters, which can be bypassed by prompt transformations called jailbreak attacks. Current approaches to LLM safety assessment, which employ datasets of templated prompts and benchmarking pipelines, fail to cover sufficiently large and diverse sets of jailbreak attacks, leading to the widespread deployment of unsafe LLMs. Recent research showed that novel jailbreak attacks could be derived by composition; however, a formal composable representation for jailbreak attacks, which, among other benefits, could enable the exploration of a large compositional space of jailbreak attacks through program synthesis methods, has not been previously proposed. We introduce h4rm3l, a novel approach that addresses this gap with a human-readable domain-specific language (DSL). Our framework comprises: (1) The h4rm3l DSL, which formally expresses jailbreak attacks as compositions of parameterized string transformation primitives. (2) A synthesizer with bandit algorithms that efficiently generates jailbreak attacks optimized for a target black box LLM. (3) The h4rm3l red-teaming software toolkit that employs the previous two components and an automated harmful LLM behavior classifier that is strongly aligned with human judgment. We demonstrate h4rm3l's efficacy by synthesizing a dataset of 2656 successful novel jailbreak attacks targeting 6 SOTA open-source and proprietary LLMs, and by benchmarking those models against a subset of these synthesized attacks. Our results show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature, with success rates exceeding 90% on SOTA LLMs.

Updated: 2025-03-25 01:51:22

标题: h4rm3l：一种可组合的越狱攻击合成语言

摘要: 尽管先进的大型语言模型（LLM）已经展示出宝贵的能力，但由于它们的安全过滤器的无效性，可能对社会造成伤害，因此被广泛部署的最新技术仍有潜力引发问题，这些安全过滤器可以通过称为越狱攻击的提示转换进行绕过。目前用于LLM安全评估的方法，利用模板提示和基准测试管线的数据集，未能涵盖足够大且多样化的越狱攻击集，导致不安全的LLM被广泛部署。最近的研究表明，通过组合可以导出新型的越狱攻击；然而，先前尚未提出过一种正式的可组合表示越狱攻击的方法，这种方法不仅有利于通过程序综合方法探索大量的越狱攻击组合空间，还有其他好处。我们引入了h4rm3l，这是一种新颖的方法，通过一种可读性强的特定领域语言（DSL）填补了这一空白。我们的框架包括：（1）h4rm3l DSL，它将越狱攻击形式化地表达为参数化字符串转换原语的组合。（2）带有强盗算法的合成器，可以高效地生成针对目标黑匣子LLM进行优化的越狱攻击。（3）h4rm3l红队软件工具包，利用前两个组件和一个与人类判断高度一致的自动有害LLM行为分类器。我们通过合成一组2656个成功的新型越狱攻击的数据集，针对6个最新技术的开源和专有LLM，并通过对这些合成攻击的子集对这些模型进行基准测试，展示了h4rm3l的有效性。我们的结果表明，h4rm3l合成攻击的多样性和成功率超过了文献中现有的越狱攻击，对SOTA LLM的成功率超过90%。

更新时间: 2025-03-25 01:51:22

领域: cs.CR,cs.AI,cs.CL,cs.CY,cs.LG,68,I.2; I.2.0; I.2.1; I.2.5; I.2.7; K.6.5; K.4.2

下载: http://arxiv.org/abs/2408.04811v4

Long-term excitation energy transfer predicted by a modified convolutional neural networks in the FMO complexes

In machine learning (ML), the risk of recursive strategies overfitting historical data has driven the development of convolutional neural networks (CNNs) in simulating quantum dissipative dynamics. In this work, we propose an efficient CNNs scheme incorporating novel redundant time-functions to predict 100 picosecond (ps) excitation energy transfer (EET) in Fenna-Matthews-Olson (FMO) complexes, in which the original time $t$ is normalized by mapping it to the [0, 1] range, allowing different functions focus on distinct time intervals, thereby effectively capturing the multi-timescale characteristics of EET dynamics. This method simplifies optimization and enhances learning efficiency, and demonstrate the superior accuracy, robustness, and efficiency of our approach in predicting quantum dissipative dynamics.

Updated: 2025-03-25 01:51:14

标题: 使用修改后的卷积神经网络预测FMO复合物中的长期激发能量转移

摘要: 在机器学习（ML）中，递归策略在过拟合历史数据方面的风险推动了卷积神经网络（CNNs）在模拟量子耗散动力学方面的发展。在这项工作中，我们提出了一种高效的CNNs方案，将新颖的冗余时间函数纳入其中，用于预测费纳-马修斯-奥尔森（FMO）复合物中100皮秒（ps）的激发能量转移（EET），其中原始时间$t$通过将其映射到[0,1]范围来进行标准化，允许不同的函数专注于不同的时间间隔，从而有效捕捉EET动力学的多时间尺度特征。这种方法简化了优化过程，提高了学习效率，并展示了我们方法在预测量子耗散动力学方面的卓越准确性、稳健性和效率。

更新时间: 2025-03-25 01:51:14

领域: physics.chem-ph,cs.LG,quant-ph,2020: 05C70

下载: http://arxiv.org/abs/2503.17430v2

IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.

Updated: 2025-03-25 01:48:16

标题: IPCGRL: 语言指导下的程序化级别生成强化学习

摘要: 最近的研究突出了自然语言在增强生成模型的可控性方面的重要性。虽然已经做出了各种努力利用自然语言进行内容生成的研究，但利用基于文本指令的深度强化学习（DRL）代理进行程序内容生成的研究仍然有限。在本文中，我们提出了一种基于指令的程序内容生成方法IPCGRL，通过强化学习结合了一个句子嵌入模型。IPCGRL通过微调任务特定的嵌入表示来有效地压缩游戏级别条件。我们在一个二维级别生成任务中评估了IPCGRL，并将其性能与通用嵌入方法进行了比较。结果表明，IPCGRL在可控性方面取得了高达21.4％的改进，对于未见指令的泛化性也有17.2％的改进。此外，所提出的方法扩展了条件输入的模态，为程序内容生成提供了更灵活和富有表现力的交互框架。

更新时间: 2025-03-25 01:48:16

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.12358v3

TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models

The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of PII. Existing methods for identity inference in CLIP models require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model. Additionally, previous MIAs train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.

Updated: 2025-03-25 01:47:37

标题: TUNI：CLIP模型中用于身份推理的文本单模态检测器

摘要: 大规模多模态模型（如CLIP）的广泛使用增加了有关PII泄露的担忧。现有的CLIP模型中用于身份推断的方法需要使用完整的PII查询模型，包括人物的文本描述和相应的图像（例如，人物的姓名和面部照片）。然而，应用图像可能会将个人信息暴露给目标模型，因为图像可能之前并未被目标模型遇到。此外，先前的MIAs训练影子模型来模仿目标模型的行为，这会产生高计算成本，尤其是对于大型CLIP模型而言。为了解决这些挑战，我们在CLIP模型中提出了一种文本单模态检测器（TUNI），这是一种用于身份推断的新技术，它：1）仅利用文本数据查询目标模型；2）消除了训练影子模型的需要。对TUNI在各种CLIP模型架构和数据集上进行的大量实验表明，尽管仅使用文本数据，但其表现优于基线。

更新时间: 2025-03-25 01:47:37

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2405.14517v2

Linguistic Blind Spots of Large Language Models

Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.

Updated: 2025-03-25 01:47:13

标题: 大型语言模型的语言盲点

摘要: 大型语言模型(LLMs)是当今许多人工智能应用的基础。然而，尽管它们在生成连贯文本方面表现出色，人们仍然对其执行精细化的语言注释任务的能力存在疑问，例如检测名词或动词，或识别输入文本中更复杂的句法结构，如从句。这些任务需要对输入文本的精确句法和语义理解，当LLMs在特定语言结构上表现不佳时，就会引发对它们在详细语言分析中可靠性的担忧，以及它们(即使是正确的)输出是否真正反映了对输入的理解。在本文中，我们通过实证研究最近LLMs在细粒度语言注释任务上的表现。通过一系列实验，我们发现最近的LLMs在处理语言查询方面效果有限，并经常在语言上复杂的输入中遇到困难。我们发现最有能力的LLM(Llama3-70b)在检测语言结构方面存在明显错误，例如错误识别嵌入式从句，未能识别动词短语，以及将复杂名词与从句混淆。我们的结果为LLM设计和发展的未来进步提供了见解。

更新时间: 2025-03-25 01:47:13

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.19260v1

Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-related products from G2, a leading review platform for business software and services. Based on widely adopted industry guidelines, we identify seven core HAI dimensions and examine their coverage and sentiment within the reviews. We find that the sentiment on four HAI dimensions-adaptability, customization, error recovery, and security-is positively associated with overall user satisfaction. Moreover, we show that engagement with HAI dimensions varies by professional background: Users with technical job roles are more likely to discuss system-focused aspects, such as reliability, while non-technical users emphasize interaction-focused features like customization and feedback. Interestingly, the relationship between HAI sentiment and overall satisfaction is not moderated by job role, suggesting that once an HAI dimension has been identified by users, its effect on satisfaction is consistent across job roles.

Updated: 2025-03-25 01:44:50

标题: 人工智能与用户交互以及用户满意度：基于人工智能产品在线评论的实证证据

摘要: 人工智能与人类交互（HAI）准则和设计原则在工业和学术界日益重要，以指导与用户需求和期望一致的人工智能系统的开发。然而，在实践中，关于HAI原则如何塑造用户满意度的大规模经验证据仍然有限。本研究通过分析来自G2的超过10万条用户对人工智能相关产品的评论来填补这一空白，G2是领先的商业软件和服务评论平台。基于广泛采纳的行业准则，我们确定了七个核心的HAI维度，并检查了评论中对这些维度的覆盖和情感。我们发现，四个HAI维度——适应性、定制性、错误恢复和安全性的情感与整体用户满意度呈正相关。此外，我们展示了与专业背景相关的HAI维度的互动方式的差异：技术工作角色的用户更倾向于讨论与系统相关的方面，如可靠性，而非技术用户强调与交互相关的功能，如定制和反馈。有趣的是，HAI情感与整体满意度之间的关系不受工作角色的调节，这表明一旦用户确定了一个HAI维度，它对满意度的影响在各种工作角色中是一致的。

更新时间: 2025-03-25 01:44:50

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.17955v2

Free-Space Optical Channel Turbulence Prediction: A Machine Learning Approach

Channel turbulence is a formidable obstacle for free-space optical (FSO) communication. Anticipation of turbulence levels is highly important for mitigating disruptions but has not been demonstrated without dedicated, auxiliary hardware. We show that machine learning (ML) can be applied to raw FSO data streams to rapidly predict channel turbulence levels with no additional sensing hardware. FSO was conducted through a controlled channel in the lab under six distinct turbulence levels, and the efficacy of using ML to classify turbulence levels was examined. ML-based turbulence level classification was found to be >98% accurate with multiple ML training parameters. Classification effectiveness was found to depend on the timescale of changes between turbulence levels but converges when turbulence stabilizes over about a one minute timescale.

Updated: 2025-03-25 01:42:32

标题: 自由空间光学通道湍流预测：一种机器学习方法

摘要: 通道湍流是自由空间光学（FSO）通信的一大障碍。预测湍流水平对于减轻干扰至关重要，但在没有专用辅助硬件的情况下尚未得到证明。我们展示了机器学习（ML）可以应用于原始FSO数据流，快速预测通道湍流水平，而无需额外的传感硬件。在实验室控制通道中进行了六个不同湍流水平下的FSO测试，并检验了使用ML进行湍流水平分类的有效性。基于ML的湍流水平分类被发现在多个ML训练参数下准确率>98%。分类效果发现取决于湍流水平之间的变化时间尺度，但当湍流在大约一分钟的时间尺度内稳定时会收敛。

更新时间: 2025-03-25 01:42:32

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2405.16729v2

A Mechanistic Explanatory Strategy for XAI

Despite significant advancements in XAI, scholars continue to note a persistent lack of robust conceptual foundations and integration with broader discourse on scientific explanation. In response, emerging XAI research increasingly draws on explanatory strategies from various scientific disciplines and the philosophy of science to address these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent developments in AI explainability within a broader philosophical context. According to the mechanistic approach, explaining opaque AI systems involves identifying the mechanisms underlying decision-making processes. For deep neural networks, this means discerning functionally relevant components - such as neurons, layers, circuits, or activation patterns - and understanding their roles through decomposition, localization, and recomposition. Proof-of-principle case studies from image recognition and language modeling align this theoretical framework with recent research from OpenAI and Anthropic. The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook, ultimately contributing to more thoroughly explainable AI.

Updated: 2025-03-25 01:41:47

标题: 一个XAI的机制解释策略

摘要: 尽管在可解释人工智能（XAI）方面取得了重大进展，学者们仍然指出存在着坚实概念基础和与更广泛科学解释话语整合的持续缺乏。作为回应，新兴的XAI研究越来越多地利用来自各种科学学科和科学哲学的解释策略来填补这些空白。本文概述了一种用于解释深度学习系统功能组织的机械策略，将AI可解释性的最新发展置于更广泛的哲学背景之中。根据机械方法，解释不透明的AI系统涉及识别决策过程中的机制。对于深度神经网络，这意味着辨别功能相关的组件 - 例如神经元、层、电路或激活模式 - 并通过分解、定位和重组来理解它们的作用。来自图像识别和语言建模的原理性案例研究将这一理论框架与OpenAI和Anthropic的最新研究相一致。研究结果表明，追求机械解释可以揭示传统解释技术可能忽视的要素，最终有助于更全面地解释AI。

更新时间: 2025-03-25 01:41:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.01332v4

Data-Driven, ML-assisted Approaches to Problem Well-Posedness

Classically, to solve differential equation problems, it is necessary to specify sufficient initial and/or boundary conditions so as to allow the existence of a unique solution. Well-posedness of differential equation problems thus involves studying the existence and uniqueness of solutions, and their dependence to such pre-specified conditions. However, in part due to mathematical necessity, these conditions are usually specified "to arbitrary precision" only on (appropriate portions of) the boundary of the space-time domain. This does not mirror how data acquisition is performed in realistic situations, where one may observe entire "patches" of solution data at arbitrary space-time locations; alternatively one might have access to more than one solutions stemming from the same differential operator. In our short work, we demonstrate how standard tools from machine and manifold learning can be used to infer, in a data driven manner, certain well-posedness features of differential equation problems, for initial/boundary condition combinations under which rigorous existence/uniqueness theorems are not known. Our study naturally combines a data assimilation perspective with an operator-learning one.

Updated: 2025-03-25 01:34:48

标题: 数据驱动，机器学习辅助的问题良好性方法

摘要: 经典上，解决微分方程问题需要指定足够的初始和/或边界条件，以便允许存在唯一解。微分方程问题的良好性质涉及研究解的存在和唯一性，以及它们对这些预先指定条件的依赖。然而，部分是由于数学上的必要性，这些条件通常只在时空域的边界上（适当的部分）“以任意精度”指定。这并不反映现实情况下数据获取的方式，在实际情况下，人们可能观察到任意时空位置的整个“解数据块”；或者人们可能访问来自相同微分算子的多个解。在我们的研究中，我们展示了如何使用机器和流形学习的标准工具，以数据驱动的方式推断微分方程问题的某些良好性质，对于在这些组合下，严格的存在/唯一性定理是未知的初始/边界条件。我们的研究自然地将数据同化的视角与运算符学习相结合。

更新时间: 2025-03-25 01:34:48

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2503.19255v1

Knowledge Enhanced Multi-Domain Recommendations in an AI Assistant Application

This work explores unifying knowledge enhanced recommendation with multi-domain recommendation systems in a conversational AI assistant application. Multi-domain recommendation leverages users' interactions in previous domains to improve recommendations in a new one. Knowledge graph enhancement seeks to use external knowledge graphs to improve recommendations within a single domain. Both research threads incorporate related information to improve the recommendation task. We propose to unify these approaches: using information from interactions in other domains as well as external knowledge graphs to make predictions in a new domain that would not be possible with either information source alone. We develop a new model and demonstrate the additive benefit of these approaches on a dataset derived from millions of users' queries for content across three domains (videos, music, and books) in a live virtual assistant application. We demonstrate significant improvement on overall recommendations as well as on recommendations for new users of a domain.

Updated: 2025-03-25 00:54:28

标题: 在AI助手应用中的知识增强多领域推荐

摘要: 这项工作探讨了在对话式人工智能助手应用中将知识增强推荐与多领域推荐系统相统一的方法。多领域推荐利用用户在先前领域的互动来改善在新领域的推荐。知识图增强旨在利用外部知识图来改善单一领域内的推荐。这两个研究线索都涉及相关信息以提高推荐任务。我们提出将这些方法统一起来：利用来自其他领域的互动信息以及外部知识图来进行预测，在新领域中进行预测，这是单独使用任一信息来源都无法实现的。我们开发了一个新模型，并在一个派生自数百万用户在视频、音乐和图书领域中查询内容的数据集上展示了这些方法的附加效益在实时虚拟助手应用中。我们展示了整体推荐以及对领域新用户的推荐方面的显著改进。

更新时间: 2025-03-25 00:54:28

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2306.06302v2

Generative Prompt Internalization

Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.

Updated: 2025-03-25 00:38:02

标题: 生成式提示内化

摘要: 在最近大型基于语言模型的应用中，常常使用固定且冗长的提示，导致计算开销显著增加。为了解决这一挑战，我们提出了生成提示内部化（GenPI）的轻量级方法，采用联合训练方法。GenPI不仅复制了带有提示输入的模型的行为，还生成了提示的内容以及模型行为应该相应改变的理由。我们证明了我们的方法有效内部化了各种基于代理的应用场景中的复杂提示。为了在没有与专用环境进行交互的情况下进行有效训练，我们引入了一种数据合成技术，通过交换代理和环境的角色自主收集对话数据集。这种方法在只有预定义提示但没有相应训练数据集的情况下特别有用。通过内部化复杂的提示，生成提示内部化实现了高性能和高效推理，无需显式提示。

更新时间: 2025-03-25 00:38:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.15927v3

Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning

Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.

Updated: 2025-03-25 00:17:47

标题: 动态分配超网络与自适应模型重新校准的联邦持续学习

摘要: 联邦持续学习（FCL）提供了一种新兴模式，以促进联邦学习（FL）在现实场景中的适用性，其中任务在客户端之间动态和异步演变，尤其是在医疗场景中。现有的自然领域服务器端FCL方法通过客户端对所有涉及任务进行聚合构建一个可以持续学习的服务器模型。然而，它们面临以下挑战：（1）对先前学习任务的灾难性遗忘，导致服务器模型中错误累积，使其难以跨所有任务维持全面知识。（2）由于在不同客户端处理异步任务，导致优化偏向，同时步骤时不同客户端的优化目标发生碰撞。在这项工作中，我们迈出第一步，提出了一个新颖的医疗领域服务器端FCL模式，具有自适应模型校准的动态分配超网络（FedDAH）。它旨在促进客户端间在不同和动态任务流下的协作学习。为了减轻灾难性遗忘，我们提出了一个动态分配超网络（DAHyper），其中设计了一个持续更新的超网络，用于管理任务身份和其关联模型参数之间的映射，实现模型在客户端之间的动态分配。对于偏向优化，我们引入了一种新颖的自适应模型校准（AMR），将历史模型的候选变化纳入当前服务器更新，并根据不同时间步骤上相同任务的相似性为持续优化分配权重。对AMOS数据集进行的广泛实验显示了我们的FedDAH在具有不同任务流的站点上优于其他FCL方法。代码可在以下链接找到：https://github.com/jinlab-imvr/FedDAH。

更新时间: 2025-03-25 00:17:47

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.20808v1

Face Spoofing Detection using Deep Learning

Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.

Updated: 2025-03-25 00:09:21

标题: 使用深度学习进行人脸欺骗检测

摘要: 数字图像欺骗已成为生物特征认证系统中的重要安全威胁，特别是那些依赖于面部识别的系统。本研究评估了三种基于视觉的模型，MobileNetV2、ResNET50和Vision Transformer（ViT），在图像分类中用于欺骗检测的性能，利用了一个包含150,986张图像的数据集，分为训练集（140,002张）、测试集（10,984张）和验证集（39,574张）。欺骗检测对于提高图像识别系统的安全性至关重要，本研究通过准确度、精确度、召回率和F1分数等指标比较了模型的有效性。结果显示，在测试数据集上，MobileNetV2的表现优于其他架构，准确率为91.59%，精确度为91.72%，召回率为91.59%，F1分数为91.58%，而ViT分别为86.54%，88.28%，86.54%和86.39%。在验证数据集上，MobileNetV2和ViT表现出色，其中MobileNetV2略胜一筹，准确率为97.17%，而ViT为96.36%。MobileNetV2在训练过程中收敛速度更快，对未知数据的泛化能力更强，尽管两个模型都显示出过拟合的迹象。这些发现凸显了MobileNetV2在性能和鲁棒性方面的平衡表现，使其成为对于依赖于新数据可靠性的欺骗检测应用的首选。该研究强调了在安全敏感环境中模型选择的重要性，并建议MobileNetV2作为实际部署的实用解决方案。

更新时间: 2025-03-25 00:09:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.19223v1

Towards Understanding Distilled Reasoning Models: A Representational Approach

In this paper, we investigate how model distillation impacts the development of reasoning features in large language models (LLMs). To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants. Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification. Moreover, we observe that distilled models contain unique reasoning feature directions, which could be used to steer the model into over-thinking or incisive-thinking mode. In particular, we perform analysis on four specific reasoning categories: (a) self-reflection, (b) deductive reasoning, (c) alternative reasoning, and (d) contrastive reasoning. Finally, we examine the changes in feature geometry resulting from the distillation process and find indications that larger distilled models may develop more structured representations, which correlate with enhanced distillation performance. By providing insights into how distillation modifies the model, our study contributes to enhancing the transparency and reliability of AI systems.

Updated: 2025-03-25 00:07:50

标题: 朝向理解精炼推理模型：一种表征方法

摘要: 在这篇论文中，我们研究了模型蒸馏如何影响大型语言模型（LLMs）中推理特征的发展。为了探索这一问题，我们在Qwen系列模型及其微调变体上训练了一个交叉编码器。我们的结果表明，交叉编码器学习了与各种类型推理相关的特征，包括自我反思和计算验证。此外，我们观察到蒸馏模型包含独特的推理特征方向，可以用来引导模型进入过度思考或敏锐思考模式。具体来说，我们对四种特定推理类别进行了分析：（a）自我反思，（b）演绎推理，（c）替代推理和（d）对比推理。最后，我们研究了蒸馏过程导致的特征几何变化，并发现更大的蒸馏模型可能会发展出更有结构的表示形式，这与增强的蒸馏性能相关。通过揭示蒸馏如何修改模型，我们的研究有助于提高人工智能系统的透明度和可靠性。

更新时间: 2025-03-25 00:07:50

领域: cs.LG

下载: http://arxiv.org/abs/2503.03730v2