Arxiv Day: Article

Privacy Auditing of Large Language Models

Current techniques for privacy auditing of large language models (LLMs) have limited efficacy -- they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries surpass prior approaches. For example, on the Qwen2.5-0.5B model, our designed canaries achieve $49.6\%$ TPR at $1\%$ FPR, vastly surpassing the prior approach's $4.2\%$ TPR at $1\%$ FPR. Our method can be used to provide a privacy audit of $\varepsilon \approx 1$ for a model trained with theoretical $\varepsilon$ of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.

Updated: 2025-03-09 23:32:15

标题: 大型语言模型的隐私审计

摘要: 目前用于大型语言模型（LLMs）隐私审计的技术效果有限 - 它们依赖基本方法生成金丝雀，导致弱的成员推断攻击，进而给出关于实际隐私泄漏的宽松下限。我们开发了比先前工作中使用的金丝雀更为有效的金丝雀，适用于涵盖一系列现实设置的威胁模型。通过对多个经过精细调整的LLMs家族进行广泛实验，我们展示了我们的方法为隐私泄漏检测设定了新标准。对于测量非私密训练的LLMs的记忆率，我们设计的金丝雀超越了先前的方法。例如，在Qwen2.5-0.5B模型上，我们设计的金丝雀在$1\%$ FPR下达到了$49.6\%$的TPR，远远超过先前方法的$1\%$ FPR下的$4.2\%$的TPR。我们的方法可以用于为使用理论$\varepsilon$值为4训练的模型提供约$\varepsilon \approx 1$的隐私审计。据我们所知，这是LLM训练的隐私审计首次在攻击者无法训练阴影模型、插入梯度金丝雀或在每次迭代中访问模型的情况下取得了非平凡的审计成功。

更新时间: 2025-03-09 23:32:15

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06808v1

The Parametric Complexity of Operator Learning

Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their $C^r$- or Lipschitz-regularity, operator learning suffers from a "curse of parametric complexity", which is an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO.The second contribution of the paper is to prove that this general curse can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of parametric complexity related to the infinite-dimensional input and output function spaces.

Updated: 2025-03-09 23:19:24

标题: 操作符学习的参数复杂性

摘要: 神经算子结构利用神经网络来逼近在巴拿赫函数空间之间映射的算子；它们可以用于通过模拟加速模型评估，或者从数据中发现模型。因此，这种方法在近年来受到越来越多的关注，引发了快速增长的算子学习领域。本文的第一个贡献是证明对于仅通过它们的$C^r$-或Lipschitz-正则性特征化的一般类算子，算子学习受到“参数复杂性之咒”的困扰，这是在高维逼近问题中遇到的维数之咒的无限维度类比。该结果适用于各种现有的神经算子，包括PCA-Net、DeepONet和FNO。本文的第二个贡献是证明对于由Hamilton-Jacobi方程定义的解算子，这种一般的咒可以被克服；这是通过利用底层解算子中额外的结构实现的，超越了正则性。为此，引入了一种新颖的神经算子结构，称为HJ-Net，它明确考虑了底层哈密顿系统的特征信息。为HJ-Net导出了误差和复杂度估计，表明这种架构可以明显地击败与无限维度输入和输出函数空间相关的参数复杂性之咒。

更新时间: 2025-03-09 23:19:24

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2306.15924v4

Actionable AI: Enabling Non Experts to Understand and Configure AI Systems

Interaction between humans and AI systems raises the question of how people understand AI systems. This has been addressed with explainable AI, the interpretability arising from users' domain expertise, or collaborating with AI in a stable environment. In the absence of these elements, we discuss designing Actionable AI, which allows non-experts to configure black-box agents. In this paper, we experiment with an AI-powered cartpole game and observe 22 pairs of participants to configure it via direct manipulation. Our findings suggest that, in uncertain conditions, non-experts were able to achieve good levels of performance. By influencing the behaviour of the agent, they exhibited an operational understanding of it, which proved sufficient to reach their goals. Based on this, we derive implications for designing Actionable AI systems. In conclusion, we propose Actionable AI as a way to open access to AI-based agents, giving end users the agency to influence such agents towards their own goals.

Updated: 2025-03-09 23:09:04

标题: 可操作的人工智能：让非专家理解和配置人工智能系统

摘要: 人类与人工智能系统之间的互动引发了人们如何理解人工智能系统的问题。这个问题已经通过可解释的人工智能来解决，解释性源于用户的领域专业知识，或者与人工智能在稳定环境中合作。在缺乏这些元素的情况下，我们讨论设计可操作的人工智能，这使非专家能够配置黑匣子代理。在本文中，我们通过直接操作实验了一款由人工智能驱动的杆车游戏，并观察了22对参与者通过直接操作来配置它。我们的研究结果表明，在不确定的条件下，非专家能够达到良好的表现水平。通过影响代理的行为，他们展示了对其的操作理解，这足以实现他们的目标。基于此，我们得出了设计可操作人工智能系统的启示。总之，我们提出可操作的人工智能作为一种开放访问基于人工智能的代理的方式，使最终用户能够影响这些代理朝着他们自己的目标发展。

更新时间: 2025-03-09 23:09:04

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2503.06803v1

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.

Updated: 2025-03-09 23:07:33

标题: 上采样还是加权？在严重不平衡的数据集上进行平衡训练

摘要: 不同领域的数据丰富度呈现长尾分布：少数领域拥有丰富的数据，而大部分面临数据稀缺。我们的工作侧重于多语言环境，在这种环境中，可用数据严重倾向于高资源语言。解决这种不平衡的两种常见策略是对低资源数据进行过采样（温度采样）和对低资源损失进行加权（标量化）。这些方法通常被认为是等效的，但这种等效性尚未得到严格的验证，促使我们进行调查。通过理论和实证分析，我们确定了这两种方法何时是等效的以及何时会分歧。我们证明它们在完全梯度下降时是等效的，但在随机梯度下降时存在差异，这是由于梯度方差的差异导致的。具体来说，与标量化相比，温度采样在梯度估计中表现出较低的方差，导致更快的收敛，但过拟合的风险更高。基于这些见解，我们提出了Cooldown，一种策略，该策略首先大幅度对低资源语言进行过采样以加速收敛，然后逐渐减少过采样以防止过拟合--实现两者兼顾。我们的方法与现有的数据重新加权技术竞争力强，同时提供了计算效率。

更新时间: 2025-03-09 23:07:33

领域: cs.CL,cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.04579v5

One-step Diffusion Models with $f$-Divergence Distribution Matching

Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

Updated: 2025-03-09 22:53:27

标题: 一步扩散模型与$f$-散度分布匹配

摘要: 扩散模型的抽样涉及一个缓慢的迭代过程，这阻碍了它们的实际部署，特别是对于交互式应用程序。为了加速生成速度，最近的方法通过变分分数蒸馏将多步扩散模型提炼为单步学生生成器，将学生生成的样本分布与教师的分布匹配。然而，这些方法使用反向Kullback-Leibler（KL）散度进行分布匹配，这被认为是寻找模式的。在本文中，我们使用一种新的$f$-散度最小化框架，称为$f$-distill，对分布匹配方法进行了泛化，覆盖了具有不同模式覆盖和训练方差权衡的不同散度。我们推导了教师和学生分布之间的$f$-散度的梯度，并表明它被表示为它们得分差异的乘积和由它们的密度比确定的加权函数。当使用一个不那么寻找模式的散度时，这个加权函数自然地强调教师分布中密度更高的样本。我们观察到，使用反向-KL散度的流行变分分数蒸馏方法是我们框架中的一个特殊情况。在实证上，我们证明了替代$f$-散度，如正向-KL和Jensen-Shannon散度，在图像生成任务中优于当前最佳的变分分数蒸馏方法。特别是，当使用Jensen-Shannon散度时，$f$-distill在ImageNet64上实现了当前最先进的一步生成性能，并在MS-COCO上实现了零样本文本到图像生成。项目页面：https://research.nvidia.com/labs/genair/f-distill.

更新时间: 2025-03-09 22:53:27

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.15681v2

Re-Imagining Multimodal Instruction Tuning: A Representation View

Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

Updated: 2025-03-09 22:44:30

标题: 重新构想多模式教学调整：一种表征视角

摘要: 多模态指导调整已经被证明是一种有效的策略，通过使用指导性数据微调预训练的大型多模态模型（LMMs）来实现零样本泛化。然而，随着LMMs规模的不断增长，完全微调这些模型已经变得高度参数密集。虽然已经引入了参数高效微调（PEFT）方法来减少可调参数的数量，但与完全微调相比，仍然存在显著的性能差距。此外，现有的PEFT方法通常高度参数化，使其难以解释和控制。鉴于此，我们引入了多模态表示调整（MRT），这是一种关注直接编辑语义丰富的多模态表示以实现强大性能并提供直观控制LMMs的新方法。实证结果表明，我们的方法超越了当前最先进的基线，并取得了显著的性能增益（例如，1580.40的MME得分），同时需要更少的可调参数（例如，0.03％的参数）。此外，我们进行了在多模态表示中编辑乐器令牌的实验，证明直接操纵这些表示使得能够简单而有效地控制网络行为。

更新时间: 2025-03-09 22:44:30

领域: cs.LG

下载: http://arxiv.org/abs/2503.00723v2

Characterizing Learning in Spiking Neural Networks with Astrocyte-Like Units

Traditional artificial neural networks take inspiration from biological networks, using layers of neuron-like nodes to pass information for processing. More realistic models include spiking in the neural network, capturing the electrical characteristics more closely. However, a large proportion of brain cells are of the glial cell type, in particular astrocytes which have been suggested to play a role in performing computations. Here, we introduce a modified spiking neural network model with added astrocyte-like units in a neural network and asses their impact on learning. We implement the network as a liquid state machine and task the network with performing a chaotic time-series prediction task. We varied the number and ratio of neuron-like and astrocyte-like units in the network to examine the latter units effect on learning. We show that the combination of neurons and astrocytes together, as opposed to neural- and astrocyte-only networks, are critical for driving learning. Interestingly, we found that the highest learning rate was achieved when the ratio between astrocyte-like and neuron-like units was roughly 2 to 1, mirroring some estimates of the ratio of biological astrocytes to neurons. Our results demonstrate that incorporating astrocyte-like units which represent information across longer timescales can alter the learning rates of neural networks, and the proportion of astrocytes to neurons should be tuned appropriately to a given task.

Updated: 2025-03-09 22:36:58

标题: 用星形胶质细胞单元表征尖峰神经网络中的学习

摘要: 传统的人工神经网络从生物网络中获取灵感，使用类似神经元的节点层传递信息进行处理。更现实的模型包括在神经网络中引入尖峰，在更接近地捕捉电特性。然而，大部分的脑细胞属于胶质细胞类型，特别是被认为在执行计算中发挥作用的星形胶质细胞。在这里，我们介绍了一种修改后的带有类似星形胶质细胞的尖峰神经网络模型，并评估它们对学习的影响。我们将网络实现为液态状态机，并要求网络执行一个混沌时间序列预测任务。我们改变了网络中神经元样和星形胶质细胞样单位的数量和比例，以研究后者对学习的影响。我们表明，神经元和星形胶质细胞的组合，与仅神经元和星形胶质细胞网络相比，对推动学习至关重要。有趣的是，我们发现当星形胶质细胞样和神经元样单位的比例大致为2比1时，获得了最高的学习率，这反映了一些生物星形胶质细胞与神经元比例的估计。我们的结果表明，融合代表跨越较长时间尺度的信息的星形胶质细胞样单位可以改变神经网络的学习速率，星形胶质细胞与神经元的比例应根据特定任务适当调整。

更新时间: 2025-03-09 22:36:58

领域: cs.LG,cs.AI,physics.bio-ph

下载: http://arxiv.org/abs/2503.06798v1

Multimodal AI-driven Biomarker for Early Detection of Cancer Cachexia

Cancer cachexia is a multifactorial syndrome characterized by progressive muscle wasting, metabolic dysfunction, and systemic inflammation, leading to reduced quality of life and increased mortality. Despite extensive research, no single definitive biomarker exists, as cachexia-related indicators such as serum biomarkers, skeletal muscle measurements, and metabolic abnormalities often overlap with other conditions. Existing composite indices, including the Cancer Cachexia Index (CXI), Modified CXI (mCXI), and Cachexia Score (CASCO), integrate multiple biomarkers but lack standardized thresholds, limiting their clinical utility. This study proposes a multimodal AI-based biomarker for early cancer cachexia detection, leveraging open-source large language models (LLMs) and foundation models trained on medical data. The approach integrates heterogeneous patient data, including demographics, disease status, lab reports, radiological imaging (CT scans), and clinical notes, using a machine learning framework that can handle missing data. Unlike previous AI-based models trained on curated datasets, this method utilizes routinely collected clinical data, enhancing real-world applicability. Additionally, the model incorporates confidence estimation, allowing the identification of cases requiring expert review for precise clinical interpretation. Preliminary findings demonstrate that integrating multiple data modalities improves cachexia prediction accuracy at the time of cancer diagnosis. The AI-based biomarker dynamically adapts to patient-specific factors such as age, race, ethnicity, weight, cancer type, and stage, avoiding the limitations of fixed-threshold biomarkers. This multimodal AI biomarker provides a scalable and clinically viable solution for early cancer cachexia detection, facilitating personalized interventions and potentially improving treatment outcomes and patient survival.

Updated: 2025-03-09 22:32:37

标题: 多模态人工智能驱动的癌症恶病质早期检测生物标志物

摘要: 癌症恶病质是一种多因素综合征，其特征包括进行性肌肉消耗、代谢功能紊乱和全身性炎症，导致生活质量下降和死亡率增加。尽管进行了大量研究，但尚无单一明确的生物标志物存在，因为与恶病质相关的指标，如血清生物标志物、骨骼肌测量和代谢异常往往与其他情况重叠。现有的复合指数，包括癌症恶病质指数（CXI）、改良的CXI（mCXI）和恶病质评分（CASCO），整合了多个生物标志物，但缺乏标准化的阈值，限制了它们的临床实用性。本研究提出了一种基于多模式人工智能的生物标志物，用于早期癌症恶病质检测，利用开源大型语言模型（LLMs）和基于医疗数据训练的基础模型。该方法整合了异质患者数据，包括人口统计学、疾病状态、实验室报告、放射影像（CT扫描）和临床记录，采用可以处理缺失数据的机器学习框架。与以前在筛选数据集上训练的基于人工智能的模型不同，该方法利用常规收集的临床数据，增强了现实世界的适用性。此外，该模型包括置信度估计，可以识别需要专家审查以获得精确临床解释的病例。初步研究结果表明，整合多种数据模态在癌症诊断时改善了恶病质预测的准确性。这种基于人工智能的生物标志物可以动态适应患者特定因素，如年龄、种族、体重、癌症类型和分期，避免固定阈值生物标志物的局限性。这种多模式人工智能生物标志物提供了一种可扩展且临床可行的解决方案，用于早期癌症恶病质检测，促进个性化干预，潜在地改善治疗结果和患者生存率。

更新时间: 2025-03-09 22:32:37

领域: eess.IV,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2503.06797v1

AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

The social robot's open API allows users to customize open-domain interactions. However, it remains inaccessible to those without programming experience. In this work, we introduce AutoMisty, the first multi-agent collaboration framework powered by large language models (LLMs), to enable the seamless generation of executable Misty robot code from natural language instructions. AutoMisty incorporates four specialized agent modules to manage task decomposition, assignment, problem-solving, and result synthesis. Each agent incorporates a two-layer optimization mechanism, with self-reflection for iterative refinement and human-in-the-loop for better alignment with user preferences. AutoMisty ensures a transparent reasoning process, allowing users to iteratively refine tasks through natural language feedback for precise execution. To evaluate AutoMisty's effectiveness, we designed a benchmark task set spanning four levels of complexity and conducted experiments in a real Misty robot environment. Extensive evaluations demonstrate that AutoMisty not only consistently generates high-quality code but also enables precise code control, significantly outperforming direct reasoning with ChatGPT-4o and ChatGPT-o1. All code, optimized APIs, and experimental videos will be publicly released through the webpage: https://wangxiaoshawn.github.io/AutoMisty.html

Updated: 2025-03-09 22:07:46

标题: AutoMisty：一种用于Misty社交机器人中自动生成代码的多代理LLM框架

摘要: 社交机器人的开放API允许用户自定义开放域交互。然而，对于那些没有编程经验的人来说，它仍然是不可访问的。在这项工作中，我们介绍了AutoMisty，这是第一个由大型语言模型（LLMs）驱动的多代理协作框架，可以实现从自然语言指令中无缝生成可执行的Misty机器人代码。AutoMisty包含四个专门的代理模块，用于管理任务分解、分配、问题解决和结果综合。每个代理都包含一个两层优化机制，具有自我反思以进行迭代细化，并通过人在循环中更好地与用户偏好对齐。AutoMisty确保透明的推理过程，允许用户通过自然语言反馈迭代地细化任务，以实现精确执行。为了评估AutoMisty的有效性，我们设计了一个跨越四个复杂级别的基准任务集，并在真实的Misty机器人环境中进行了实验。广泛的评估表明，AutoMisty不仅始终生成高质量的代码，而且实现了精确的代码控制，明显优于与ChatGPT-4o和ChatGPT-o1直接推理。所有代码、优化的API和实验视频将通过网页公开发布：https://wangxiaoshawn.github.io/AutoMisty.html.

更新时间: 2025-03-09 22:07:46

领域: cs.RO,cs.AI,cs.HC,cs.MA

下载: http://arxiv.org/abs/2503.06791v1

GenDR: Lightning Generative Detail Restorator

Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

Updated: 2025-03-09 22:02:18

标题: GenDR: 闪电生成细节恢复器

摘要: 最近的研究将文本到图像（T2I）扩散模型应用于现实世界的超分辨率（SR），取得了显著的成功。然而，T2I和SR目标之间的基本不对齐导致了推理速度和细节保真度之间的困境。具体来说，T2I任务优先考虑多步反演，以合成与文本提示对齐的连贯输出，并缩小潜在空间以减少生成复杂性。相反，SR任务保留了低分辨率输入中的大部分信息，仅恢复高频细节，因此需要足够的潜在空间和较少的推理步骤。为了弥合这一差距，我们提出了一种用于生成细节恢复的一步扩散模型GenDR，它是从一个具有更大潜在空间的定制扩散模型中提炼出来的。具体来说，我们通过表示对齐训练了一个新的SD2.1-VAE16（0.9B）来扩展潜在空间，而不是增大模型大小。关于步骤提炼，我们提出了一致分数身份提炼（CiD），将SR任务特定损失融入到分数提炼中，以利用更多的SR先验知识并对齐训练目标。此外，我们将CiD扩展为具有对抗学习和表示对齐的CiDA，以增强感知质量并加速训练。我们还优化了流程以实现更高效的推理。实验结果表明，GenDR在定量指标和视觉保真度方面均取得了最先进的性能。

更新时间: 2025-03-09 22:02:18

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2503.06790v1

Efficient Feature Extraction and Classification Architecture for MRI-Based Brain Tumor Detection and Localization

Uncontrolled cell division in the brain is what gives rise to brain tumors. If the tumor size increases by more than half, there is little hope for the patient's recovery. This emphasizes the need of rapid and precise brain tumor diagnosis. When it comes to analyzing, diagnosing, and planning therapy for brain tumors, MRI imaging plays a crucial role. A brain tumor's development history is crucial information for doctors to have. When it comes to distinguishing between human soft tissues, MRI scans are superior. In order to get reliable classification results from MRI scans quickly, deep learning is one of the most practical methods. Early human illness diagnosis has been demonstrated to be more accurate when deep learning methods are used. In the case of diagnosing a brain tumor, when even a little misdiagnosis might have serious consequences, accuracy is especially important. Disclosure of brain tumors in medical images is still a difficult task. Brain MRIs are notoriously imprecise in revealing the presence or absence of tumors. Using MRI scans of the brain, a CNN was trained to identify the presence of a tumor in this research. Results from the CNN model showed an accuracy of 99.17%. The CNN model's characteristics were also retrieved. The CNN model's characteristics were also retrieved and we also localized the tumor regions from the unannotated images using GradCAM, a deep learning explainability tool. In order to evaluate the CNN model's capability for processing images, we applied the features into different ML models. CNN and machine learning models were also evaluated using the standard metrics of Precision, Recall, Specificity, and F1 score. The significance of the doctor's diagnosis enhanced the accuracy of the CNN model's assistance in identifying the existence of tumor and treating the patient.

Updated: 2025-03-09 22:00:46

标题: MRI图像为基础的脑肿瘤检测和定位的高效特征提取和分类架构

摘要: 大脑中的细胞失控分裂是导致脑肿瘤产生的原因。如果肿瘤大小增长超过一半，患者的康复希望很小。这强调了对迅速和精确的脑肿瘤诊断的需求。在分析、诊断和规划脑肿瘤治疗方面，MRI成像起着至关重要的作用。医生了解脑肿瘤的发展历史是至关重要的信息。在区分人体软组织方面，MRI扫描是优越的。为了快速获得MRI扫描可靠的分类结果，深度学习是最实用的方法之一。早期人类疾病诊断使用深度学习方法已被证明更准确。在诊断脑肿瘤的情况下，即使有一点误诊可能会产生严重后果，准确性尤为重要。在医学图像中揭示脑肿瘤仍然是一项困难的任务。脑部MRI在揭示肿瘤的有无方面是臭名昭著的不准确。本研究使用脑部MRI扫描训练了一个CNN来识别肿瘤的存在。CNN模型的结果显示准确率为99.17%。还检索了CNN模型的特征，并使用深度学习可解释性工具GradCAM，在未标注的图像中定位了肿瘤区域。为了评估CNN模型处理图像的能力，我们将特征应用于不同的ML模型中。CNN和机器学习模型也使用精确度、召回率、特异度和F1得分等标准指标进行评估。医生诊断的重要性增强了CNN模型在识别肿瘤存在并治疗患者方面的准确性。

更新时间: 2025-03-09 22:00:46

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.22619v2

Dubito Ergo Sum: Exploring AI Ethics

We paraphrase Descartes' famous dictum in the area of AI ethics where the "I doubt and therefore I am" is suggested as a necessary aspect of morality. Therefore AI, which cannot doubt itself, cannot possess moral agency. Of course, this is not the end of the story. We explore various aspects of the human mind that substantially differ from AI, which includes the sensory grounding of our knowing, the act of understanding, and the significance of being able to doubt ourselves. The foundation of our argument is the discipline of ethics, one of the oldest and largest knowledge projects of human history, yet, we seem only to be beginning to get a grasp of it. After a couple of thousand years of studying the ethics of humans, we (humans) arrived at a point where moral psychology suggests that our moral decisions are intuitive, and all the models from ethics become relevant only when we explain ourselves. This recognition has a major impact on what and how we can do regarding AI ethics. We do not offer a solution, we explore some ideas and leave the problem open, but we hope somewhat better understood than before our study.

Updated: 2025-03-09 21:59:43

标题: 怀疑即存在：探讨人工智能伦理

摘要: 我们在人工智能伦理领域对笛卡尔的著名格言进行了改述，“我怀疑，所以我存在”被认为是道德的一个必要方面。因此，人工智能无法怀疑自己，因此无法具有道德代理人的能力。当然，这并不是故事的终点。我们探讨了人类思维与人工智能之间的各种差异，其中包括我们知识的感官基础、理解行为以及能够怀疑自己的重要性。我们论证的基础是伦理学这门学科，这是人类历史上最古老和最大的知识项目之一，然而，我们似乎仅仅是开始了解它。经过几千年研究人类伦理，我们（人类）终于达到了一个阶段，道德心理学表明我们的道德决策是直觉的，所有伦理学模型只有在我们解释自己时才相关。这一认识对我们在处理人工智能伦理方面的行为和方式有着重大影响。我们并未提供解决方案，而是探讨了一些想法并将问题留在开放状态，但我们希望在研究之前对问题有所了解。

更新时间: 2025-03-09 21:59:43

领域: cs.AI

下载: http://arxiv.org/abs/2503.06788v1

BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments

Agents based on large language models have shown great potential in accelerating scientific discovery by leveraging their rich background knowledge and reasoning capabilities. In this paper, we introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. We demonstrate our agent on the problem of designing genetic perturbation experiments, where the aim is to find a small subset out of many possible genes that, when perturbed, result in a specific phenotype (e.g., cell growth). Utilizing its biological knowledge, BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model or explicitly design an acquisition function as in Bayesian optimization. Moreover, BioDiscoveryAgent, using Claude 3.5 Sonnet, achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets, and a 46% improvement in the harder task of non-essential gene perturbation, compared to existing Bayesian optimization baselines specifically trained for this task. Our evaluation includes one dataset that is unpublished, ensuring it is not part of the language model's training data. Additionally, BioDiscoveryAgent predicts gene combinations to perturb more than twice as accurately as a random baseline, a task so far not explored in the context of closed-loop experiment design. The agent also has access to tools for searching the biomedical literature, executing code to analyze biological datasets, and prompting another agent to critically evaluate its predictions. Overall, BioDiscoveryAgent is interpretable at every stage, representing an accessible new paradigm in the computational design of biological experiments with the potential to augment scientists' efficacy.

Updated: 2025-03-09 21:57:20

标题: 生物发现代理：一种用于设计遗传扰动实验的人工智能代理

摘要: 基于大型语言模型的代理人已经展现出在加速科学发现方面具有巨大潜力，通过利用其丰富的背景知识和推理能力。在本文中，我们介绍了BioDiscoveryAgent，这是一个代理人，可以设计新的实验，推理它们的结果，并有效地在假设空间中导航以达到期望的解决方案。我们在设计基因扰动实验的问题上展示了我们的代理人，其目的是从许多可能的基因中找到一个小的子集，当扰动时，会产生特定的表型（例如，细胞生长）。利用其生物知识，BioDiscoveryAgent可以独特地设计新实验，无需训练机器学习模型或像贝叶斯优化中那样显式设计获取函数。此外，BioDiscoveryAgent利用Claude 3.5 Sonnet，在六个数据集中平均提高了21%的预测相关基因扰动的能力，并在更困难的非必要基因扰动任务中提高了46%，与针对此任务专门训练的现有贝叶斯优化基线相比。我们的评估包括一个未发表的数据集，确保它不是语言模型的训练数据的一部分。此外，BioDiscoveryAgent预测基因组合的扰动比随机基线准确率高出两倍以上，这是迄今为止在闭环实验设计环境中未探讨过的任务。该代理人还可以访问搜索生物医学文献的工具，执行代码以分析生物数据集，并促使另一个代理人对其预测进行批判性评估。总的来说，BioDiscoveryAgent在每个阶段都是可解释的，代表了生物实验计算设计领域的一个可访问的新范式，有潜力增强科学家的效率。

更新时间: 2025-03-09 21:57:20

领域: cs.AI,cs.CE,cs.MA

下载: http://arxiv.org/abs/2405.17631v3

Deep Learning Foundation and Pattern Models: Challenges in Hydrological Time Series

There has been active investigation into deep learning approaches for time series analysis, including foundation models. However, most studies do not address significant scientific applications. This paper aims to identify key features in time series by examining hydrology data. Our work advances computer science by emphasizing critical application features and contributes to hydrology and other scientific fields by identifying modeling approaches that effectively capture these features. Scientific time series data are inherently complex, involving observations from multiple locations, each with various time-dependent data streams and exogenous factors that may be static or time-varying and either application-dependent or purely mathematical. This research analyzes hydrology time series from the CAMELS and Caravan global datasets, which encompass rainfall and runoff data across catchments, featuring up to six observed streams and 209 static parameters across approximately 8,000 locations. Our investigation assesses the impact of exogenous data through eight different model configurations for key hydrology tasks. Results demonstrate that integrating exogenous information enhances data representation, reducing mean squared error by up to 40% in the largest dataset. Additionally, we present a detailed performance comparison of over 20 state-of-the-art pattern and foundation models. The analysis is fully open-source, facilitated by Jupyter Notebook on Google Colab for LSTM-based modeling, data preprocessing, and model comparisons. Preliminary findings using alternative deep learning architectures reveal that models incorporating comprehensive observed and exogenous data outperform more limited approaches, including foundation models. Notably, natural annual periodic exogenous time series contribute the most significant improvements, though static and other periodic factors are also valuable.

Updated: 2025-03-09 21:54:42

标题: 深度学习基础和模式模型：水文时间序列中的挑战

摘要: 对于时间序列分析，包括基础模型的深度学习方法一直受到广泛关注。然而，大多数研究并未涉及重要的科学应用。本文旨在通过研究水文数据来识别时间序列中的关键特征。我们的工作通过强调关键应用特性推动了计算机科学的发展，并通过确定有效捕捉这些特征的建模方法为水文学和其他科学领域做出了贡献。科学时间序列数据本质上是复杂的，涉及来自多个位置的观测数据，每个位置都有各种时间相关的数据流和静态或时间变化的外生因素，这些因素可能是应用相关的，也可能是纯粹的数学因素。本研究分析了来自CAMELS和Caravan全球数据集的水文时间序列数据，这些数据集涵盖了流域内的降雨和径流数据，包括最多六个观测流和大约8000个位置上的209个静态参数。我们的研究评估了外生数据对关键水文任务的影响，通过八种不同的模型配置。结果表明，整合外生信息可以增强数据表示，最大数据集中均方误差减少了高达40%。此外，我们还对20多个最先进的模式和基础模型进行了详细的性能比较。该分析是完全开源的，通过在Google Colab上使用Jupyter Notebook进行LSTM模型建模、数据预处理和模型比较。初步结果显示，使用替代深度学习架构的模型，将全面观察和外生数据结合起来的方法胜过更有限的方法，包括基础模型。值得注意的是，自然年度周期性外生时间序列对改进起到了最显著的作用，尽管静态和其他周期性因素也是有价值的。

更新时间: 2025-03-09 21:54:42

领域: cs.LG

下载: http://arxiv.org/abs/2410.15218v2

Key Establishment in the Space Environment

As reliance on space systems continues to increase, so does the need to ensure security for them. However, public work in space standards have struggled with defining security protocols that are well tailored to the domain and its risks. In this work, we investigate various space networking paradigms and security approaches, and identify trade-offs and gaps. Furthermore, we describe potential existing security protocol approaches that fit well into the space network paradigm in terms of both functionality and security. Finally, we establish future directions for enabling strong security for space communication.

Updated: 2025-03-09 21:48:13

标题: 太空环境中的密钥建立

摘要: 随着对空间系统的依赖继续增加，确保其安全性的需求也在增加。然而，公共空间标准工作在定义适合该领域及其风险的安全协议方面存在困难。在这项工作中，我们调查了各种空间网络范例和安全方法，并识别了权衡和差距。此外，我们描述了可能存在的适合空间网络范例的安全协议方法，从功能性和安全性两方面均能很好地适应。最后，我们确立了未来的方向，以实现对空间通信的强大安全性。

更新时间: 2025-03-09 21:48:13

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2503.06785v1

Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models

This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.

Updated: 2025-03-09 21:43:37

标题: 海底无限联盟：潜在分形扩散模型实现的逼真3D水下地形生成

摘要: 这篇论文解决了生成水下3D地形表示的问题。现成的生成模型在互联网规模数据上训练，但没有在专门的水下图像上训练，因此显示出降级的逼真度，因为海底图像相对较少。为此，我们介绍了DreamSea，一个生成超逼真水下场景的生成模型。DreamSea在从水下机器人调查中收集的真实世界图像数据库上进行训练。这些调查中的图像包含大量真实的海底观察结果，并涵盖大面积，但容易受到来自真实世界的噪音和伪影的影响。我们利用视觉基础模型从数据中提取3D几何和语义，并训练一个扩散模型，它可以生成在RGBD通道中的逼真海底图像，条件是基于新颖的分形分布型潜在嵌入。然后我们将生成的图像融合到一个3D地图中，构建一个由2D扩散先验监督的3DGS模型，这允许进行逼真的新视图渲染。DreamSea经过严格评估，展示了能够稳健地生成一致、多样和逼真的大规模水下场景的能力。我们的工作在多个领域产生影响，涵盖拍摄、游戏和机器人模拟。

更新时间: 2025-03-09 21:43:37

领域: cs.GR,cs.AI,cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.06784v1

Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point

In this paper we present the statistical analysis of data from inexpensive sensors. We also present the performance of machine learning algorithms when used for automatic calibration such sensors. In this we have used low-cost Non-Dispersive Infrared CO$_2$ sensor placed at a co-located site at Cape Point, South Africa (maintained by Weather South Africa). The collected low-cost sensor data and site truth data are investigated and compared. We compare and investigate the performance of Random Forest Regression, Support Vector Regression, 1D Convolutional Neural Network and 1D-CNN Long Short-Term Memory Network models as a method for automatic calibration and the statistical properties of these model predictions. In addition, we also investigate the drift in performance of these algorithms with time.

Updated: 2025-03-09 21:38:46

标题: 便宜传感器模块的传感器数据统计研究和基于机器学习的校准算法探究：来自开普角的实验

摘要: 在本文中，我们展示了来自廉价传感器的数据的统计分析。我们还展示了当用于自动校准这些传感器时，机器学习算法的性能。在这项研究中，我们使用了一种放置在南非开普角（由南非气象局维护）的低成本非色散红外CO$_2$传感器。我们对收集到的低成本传感器数据和现场真实数据进行了调查和比较。我们比较并调查了随机森林回归、支持向量回归、1D卷积神经网络和1D-CNN长短期记忆网络模型作为自动校准的方法以及这些模型预测的统计特性。此外，我们还调查了随着时间推移这些算法性能的漂移情况。

更新时间: 2025-03-09 21:38:46

领域: eess.SP,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.13487v1

Optimizing Posterior Samples for Bayesian Optimization via Rootfinding

Bayesian optimization devolves the global optimization of a costly objective function to the global optimization of a sequence of acquisition functions. This inner-loop optimization can be catastrophically difficult if it involves posterior sample paths, especially in higher dimensions. We introduce an efficient global optimization strategy for posterior samples based on global rootfinding. It provides gradient-based optimizers with two sets of judiciously selected starting points, designed to combine exploration and exploitation. The number of starting points can be kept small without sacrificing optimization quality. Remarkably, even with just one point from each set, the global optimum is discovered most of the time. The algorithm scales practically linearly to high dimensions, breaking the curse of dimensionality. For Gaussian process Thompson sampling (GP-TS), we demonstrate remarkable improvement in both inner- and outer-loop optimization, surprisingly outperforming alternatives like EI and GP-UCB in most cases. Our approach also improves the performance of other posterior sample-based acquisition functions, such as variants of entropy search. Furthermore, we propose a sample-average formulation of GP-TS, which has a parameter to explicitly control exploitation and can be computed at the cost of one posterior sample. Our implementation is available at https://github.com/UQUH/TSRoots .

Updated: 2025-03-09 21:38:43

标题: 通过根查找优化后验样本以进行贝叶斯优化

摘要: 贝叶斯优化将昂贵的目标函数的全局优化转化为一系列收获函数的全局优化。如果涉及后验样本路径，特别是在更高维度中，这种内循环优化可能非常困难。我们介绍了一种基于全局根查找的后验样本的高效全局优化策略。它为基于梯度的优化器提供了两组谨慎选择的起始点，旨在结合探索和利用。可以保持起始点数量较少而不牺牲优化质量。值得注意的是，即使只有每组一个点，大多数时候也能发现全局最优解。该算法在高维度上实际上线性扩展，打破了维度诅咒。对于高斯过程汤普森采样（GP-TS），我们展示了内循环和外循环优化的显著改进，令人惊讶地在大多数情况下优于EI和GP-UCB等替代方案。我们的方法还提高了其他基于后验样本的收获函数的性能，如熵搜索的变体。此外，我们提出了GP-TS的样本平均公式，它具有一个参数来明确控制利用，可以以一个后验样本的成本计算。我们的实现可在https://github.com/UQUH/TSRoots上找到。

更新时间: 2025-03-09 21:38:43

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2410.22322v3

Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting

Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).

Updated: 2025-03-09 21:23:52

标题: Dr Genre: 从解耦的LLM反馈中进行通用文本重写的强化学习

摘要: 通用文本重写是一种普遍的大型语言模型（LLM）应用，涵盖了多种现实世界任务，如风格转换、事实校正和电子邮件编辑。这些任务在重写目标上有所不同（例如事实一致性与语义保留），这使得开发一个在所有维度上表现出色的统一模型具有挑战性。现有方法通常专注于单一任务或特定目标，限制了它们的泛化能力。在这项工作中，我们介绍了一个精通事实性、风格和对话重写任务的通用模型。为了模拟真实世界用户的重写请求，我们构建了一个对话重写数据集ChatRewrite，从原始电子邮件中使用LLM生成“自然”听起来的指令。结合其他流行的重写数据集，包括用于事实性重写任务的LongFact和用于风格重写任务的RewriteLM，这形成了一个广泛的基准，用于训练和评估通用重写模型。为了与特定任务目标保持一致，我们提出了一种名为Dr Genre的通用重写解耦奖励学习框架，该框架利用具有任务特定权重的面向目标的奖励模型。评估结果显示，该方法在所有目标任务上提供了更高质量的重写，改善了诸如指令遵循（一致性）、内部一致性（连贯性）和最小不必要编辑（简洁性）等目标。

更新时间: 2025-03-09 21:23:52

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06781v1

Practitioner Motives to Select Hyperparameter Optimization Methods

Programmatic hyperparamter optimization (HPO) methods, such as Bayesian optimization and evolutionary algorithms, show high sampling efficiency in finding optimal hyperparameter configurations in development of machine learning (ML) models. Yet, practitioners often use less sample-efficient HPO methods, such as grid search, which often results in under-optimized ML models. As a reason for this behavior, we suspect practitioners choose HPO methods based on different motives. Practitioner motives, however, still need to be clarified to enhance user-centered development of HPO tools. To uncover practitioner motives to use different HPO methods, we conducted 20 semi-structured interviews and an online questionnaire with 49 ML experts. By presenting main goals (e.g., increase ML model understanding) and contextual factors affecting practitioners' selection of HPO methods (e.g., available computer resources), this study offers a conceptual foundation to better understand why practitioners use different HPO methods, supporting design of more user-centered and context-adaptive HPO tools in automated ML.

Updated: 2025-03-09 21:18:34

标题: 从业者选择超参数优化方法的动机

摘要: 程序化超参数优化（HPO）方法，如贝叶斯优化和进化算法，在机器学习（ML）模型开发中寻找最佳超参数配置方面表现出高采样效率。然而，实践者通常使用采样效率较低的HPO方法，例如网格搜索，这通常导致ML模型未经优化。作为这种行为的原因，我们怀疑实践者选择HPO方法是基于不同的动机。然而，实践者的动机仍需要澄清，以增强面向用户的HPO工具的开发。为了揭示实践者使用不同HPO方法的动机，我们进行了20次半结构化访谈和一份包含49名ML专家的在线问卷调查。通过介绍主要目标（例如增加ML模型理解）和影响实践者选择HPO方法的情境因素（例如可用的计算资源），本研究提供了一个概念基础，更好地理解为什么实践者使用不同的HPO方法，支持设计更加面向用户和情境自适应的自动化ML中HPO工具。

更新时间: 2025-03-09 21:18:34

领域: cs.LG

下载: http://arxiv.org/abs/2203.01717v3

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

Updated: 2025-03-09 21:14:14

标题: 大型语言模型是有效的人类标注助手，但并不是好的独立标注者

摘要: 事件注释对于识别市场变化、监测突发新闻和理解社会趋势至关重要。尽管专家注释者设定了黄金标准，但人工编码成本高昂且效率低下。与专注于单一上下文的信息提取实验不同，我们评估了一个综合的工作流程，该工作流程删除了无关文件，合并了关于同一事件的文件，并对事件进行了注释。尽管基于LLM的自动注释优于传统的TF-IDF方法或事件集合策划，但与人类专家相比，它们仍然不是可靠的注释者。然而，将LLMs添加到协助专家进行事件集合策划可以减少变量注释所需的时间和精力。当使用LLMs提取事件变量以协助专家注释者时，它们与提取的变量更加一致，而与完全自动化的LLMs进行注释相比。

更新时间: 2025-03-09 21:14:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06778v1

Agile Climate-Sensor Design and Calibration Algorithms Using Machine Learning: Experiments From Cape Point

In this paper, we describe the design of an inexpensive and agile climate sensor system which can be repurposed easily to measure various pollutants. We also propose the use of machine learning regression methods to calibrate CO2 data from this cost-effective sensing platform to a reference sensor at the South African Weather Service's Cape Point measurement facility. We show the performance of these methods and found that Random Forest Regression was the best in this scenario. This shows that these machine learning methods can be used to improve the performance of cost-effective sensor platforms and possibly extend the time between manual calibration of sensor networks.

Updated: 2025-03-09 21:13:20

标题: 敏捷气候传感器设计和校准算法使用机器学习：来自开普角的实验

摘要: 在这篇论文中，我们描述了一种廉价且灵活的气候传感器系统的设计，该系统可以轻松重新用于测量各种污染物。我们还提出使用机器学习回归方法来校准这种成本效益的传感平台上的二氧化碳数据，以使其与南非天气局Cape Point测量设施的参考传感器相匹配。我们展示了这些方法的性能，并发现在这种情况下，随机森林回归是最佳选择。这表明这些机器学习方法可以用于提高成本效益传感平台的性能，并可能延长传感器网络手动校准之间的时间间隔。

更新时间: 2025-03-09 21:13:20

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2503.06777v1

Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

In contemporary computer vision applications, particularly image classification, architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Despite the widespread use of these pre-trained convolutional neural networks (CNNs), there remains a gap in understanding the performance of various resource-efficient backbones across diverse domains and dataset sizes. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings across a variety of datasets, including natural images, medical images, galaxy images, and remote sensing images. This comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem, especially in scenarios involving small datasets where fine-tuning a pre-trained network is crucial. Even though attention-based architectures are gaining popularity, we observed that they tend to perform poorly under low data finetuning tasks compared to CNNs. We also observed that some CNN architectures such as ConvNeXt, RegNet and EfficientNet performs well compared to others on a diverse set of domains consistently. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones, facilitating informed decision-making in model selection for a broad spectrum of computer vision domains. Our code is available here: https://github.com/pranavphoenix/Backbones

Updated: 2025-03-09 21:00:14

标题: 应该选择哪种骨干网络：计算机视觉中资源高效的领域特定对比

摘要: 在当代计算机视觉应用中，特别是图像分类中，通常会使用在大型数据集（如ImageNet）上预训练的架构骨干作为特征提取器。尽管这些预训练的卷积神经网络（CNN）被广泛使用，但在了解各种资源高效骨干在不同领域和数据集大小中的性能方面仍存在差距。我们的研究系统地评估了多个轻量级、预训练的CNN骨干，在一系列数据集（包括自然图像、医学图像、星系图像和遥感图像）上在一致的训练设置下的表现。这项全面的分析旨在帮助机器学习从业者选择最适合其特定问题的骨干，尤其是在涉及小数据集的情况下，微调预训练网络至关重要。尽管基于注意力的架构变得越来越受欢迎，但我们观察到在低数据微调任务中，它们相对于CNN表现不佳。我们还观察到一些CNN架构，如ConvNeXt、RegNet和EfficientNet，在各种领域中相对于其他架构表现良好且一致。我们的研究结果为不同骨干的性能权衡和有效性提供了可操作的见解，有助于在广泛的计算机视觉领域中进行基于信息的模型选择决策。我们的代码可以在此处找到：https://github.com/pranavphoenix/Backbones

更新时间: 2025-03-09 21:00:14

领域: cs.CV,cs.AI,cs.LG,I.2.10; I.4.0; I.4.1; I.4.2; I.4.6; I.4.7; I.4.8; I.4.9; I.4.10; I.2.10; I.5.1; I.5.2; I.5.4; J.2

下载: http://arxiv.org/abs/2406.05612v3

How Much is Unseen Depends Chiefly on Information About the Seen

The missing mass refers to the proportion of data points in an unknown population of classifier inputs that belong to classes not present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that in expectation the missing mass is entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.

Updated: 2025-03-09 20:56:37

标题: 看不见的东西取决于关于看得见的信息的多少

摘要: 缺失质量指的是分类器输入未知总体中数据点的比例，这些数据点属于分类器训练数据中不存在的类别，假设该训练数据是从未知总体中随机抽样的。我们发现，从期望值来看，缺失质量完全由训练数据中出现相同次数的类别数量$f_k$和呈指数衰减的误差决定。虽然这是对样本中期望缺失质量的首次精确描述，但所导出的估计器存在着不切实际的高方差。然而，我们的理论提出了一个几乎无偏的估计器大搜索空间，可以有效和高效地搜索。因此，我们将无分布估计视为一个优化问题，以找到一个最小化均方误差（MSE）的特定分布估计器，仅给定样本。在我们的实验中，我们的搜索算法发现了比现有技术Good-Turing估计器具有显著更小MSE的估计器。当样本数至少与类别数相同时，超过93%的运行结果表明这一点。我们的估计器的MSE大约是Good-Turing估计器的80%。

更新时间: 2025-03-09 20:56:37

领域: cs.LG,cs.NE,stat.ML

下载: http://arxiv.org/abs/2402.05835v2

Task-Oriented Connectivity for Networked Robotics with Generative AI and Semantic Communications

The convergence of robotics, advanced communication networks, and artificial intelligence (AI) holds the promise of transforming industries through fully automated and intelligent operations. In this work, we introduce a novel co-working framework for robots that unifies goal-oriented semantic communication (SemCom) with a Generative AI (GenAI)-agent under a semantic-aware network. SemCom prioritizes the exchange of meaningful information among robots and the network, thereby reducing overhead and latency. Meanwhile, the GenAI-agent leverages generative AI models to interpret high-level task instructions, allocate resources, and adapt to dynamic changes in both network and robotic environments. This agent-driven paradigm ushers in a new level of autonomy and intelligence, enabling complex tasks of networked robots to be conducted with minimal human intervention. We validate our approach through a multi-robot anomaly detection use-case simulation, where robots detect, compress, and transmit relevant information for classification. Simulation results confirm that SemCom significantly reduces data traffic while preserving critical semantic details, and the GenAI-agent ensures task coordination and network adaptation. This synergy provides a robust, efficient, and scalable solution for modern industrial environments.

Updated: 2025-03-09 20:56:04

标题: 面向任务的连接性：具有生成式人工智能和语义通信的网络化机器人

摘要: 机器人技术、先进通信网络和人工智能（AI）的融合有望通过完全自动化和智能化的运作来改变行业。在这项工作中，我们引入了一个新颖的机器人协作框架，将面向目标的语义通信（SemCom）与生成式AI（GenAI）代理结合在一个语义感知网络下。SemCom优先考虑机器人和网络之间有意义信息的交流，从而减少开销和延迟。同时，GenAI代理利用生成式AI模型来解释高级任务指令，分配资源，并适应网络和机器人环境的动态变化。这种代理驱动的范式引入了一个新的自治和智能水平，使得网络化机器人的复杂任务可以在最小人类干预下进行。我们通过一个多机器人异常检测用例模拟验证了我们的方法，在这个模拟中，机器人检测、压缩并传输相关信息以进行分类。模拟结果证实，SemCom显著减少了数据流量，同时保留了关键的语义细节，而GenAI代理确保了任务协调和网络适应。这种协同作用为现代工业环境提供了一个强大、高效和可扩展的解决方案。

更新时间: 2025-03-09 20:56:04

领域: cs.RO,cs.LG,cs.NI

下载: http://arxiv.org/abs/2503.06771v1

Unique Rashomon Sets for Robust Active Learning

Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.

Updated: 2025-03-09 20:50:34

标题: 独特的拉肇门集合用于稳健的主动学习

摘要: 为机器学习模型收集标记数据通常是昂贵且耗时的。主动学习通过有选择地标记最具信息量的观测数据来解决这一挑战，但当初始标记数据有限时，很难区分真正具有信息量的数据点和由于噪声而显得不确定的数据点。集成方法如随机森林是一个强大的方法来量化这种不确定性，但是通过不加区分地聚合所有模型来实现。这包括表现不佳的模型和冗余的模型，这个问题在存在嘈杂数据时会变得更加严重。我们介绍了UNique Rashomon Ensembled Active Learning（UNREAL），它选择性地从Rashomon集合中仅聚合独特的模型，这是一组几乎最优的模型。将集成成员限制在表现优异且具有不同解释的模型有助于区分真正的不确定性和由噪声引起的变化。我们展示了UNREAL比传统的主动学习方法实现了更快的理论收敛速度，并在五个基准数据集上实现了高达20%的预测准确率的实证改进，同时提高了模型的可解释性。

更新时间: 2025-03-09 20:50:34

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.06770v1

Effectiveness of Zero-shot-CoT in Japanese Prompts

We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.

Updated: 2025-03-09 20:42:38

标题: Zero-shot-CoT在日语提示中的有效性

摘要: 我们比较了在使用ChatGPT-3.5和4o-mini时，零射链式思维（CoT）提示在日语和英语中的有效性。零射CoT技术涉及在提示中添加诸如“让我们一步一步地思考”之类的短语，以鼓励在回答之前进行推理，已被证明在数学和推理任务中提供了LLM性能改进，特别是在英语中。我们调查了这些效果如何在日语中转移，使用了日语多任务语言理解基准（JMMLU）和多任务语言理解基准（MMLU）。我们的结果表明，虽然零射CoT提示可以在GPT-3.5中导致一些提示类别的显着性能提升，但在GPT-4o-mini中，其影响与显着的性能下降相关。然而，对于日语提示，仍然存在某些类别，如大学数学和抽象代数，尽管更先进模型中的效果逐渐减弱的更广泛趋势，仍表现出改善。

更新时间: 2025-03-09 20:42:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06765v1

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.

Updated: 2025-03-09 20:42:34

标题: SemHiTok：通过语义引导的分层码书实现统一的图像标记器，用于多模态理解和生成

摘要: 我们提出了SemHiTok，一种通过语义引导的分层码书实现的统一图像Tokenizer，为多模态理解和生成任务提供一致的离散特征表示。最近，用于理解和生成的统一多模态大型模型（MLLMs）在研究界引发了探索。先前的研究尝试通过结合语义特征重建和像素重建的损失函数来训练统一图像Tokenizer。然而，由于多模态理解和生成任务优先考虑的特征级别不同，联合训练方法面临着很大的挑战，难以取得良好的权衡。SemHiTok通过语义引导的分层码书来解决这一挑战，该码书在预训练的语义码书上构建纹理子码书。这种设计将语义重建和像素重建的训练分开，并为Tokenizer提供了低级纹理特征提取能力，而不会降低高级语义特征提取能力。我们的实验表明，与其他统一Tokenizer相比，SemHiTok在256X256分辨率下实现了最先进的rFID分数，并在多模态理解和生成任务上表现出竞争力。

更新时间: 2025-03-09 20:42:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06764v1

E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions

Vector representations have been pivotal in advancing natural language processing (NLP), with prior research focusing on embedding techniques for mathematical expressions using mathematically equivalent formulations. While effective, these approaches are constrained by the size and diversity of training data. In this work, we address these limitations by introducing E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets, surpassing prior methods in size and operator variety. Leveraging this dataset, we train embedding models using two strategies: (1) generating mathematically equivalent expressions, and (2) contrastive learning to explicitly group equivalent expressions. We evaluate these embeddings on both in-distribution and out-of-distribution mathematical language processing tasks, comparing them against prior methods. Finally, we demonstrate that our embedding-based approach outperforms state-of-the-art large language models (LLMs) on several tasks, underscoring the necessity of optimizing embedding methods for the mathematical data modality. The source code and datasets are available at https://github.com/MLPgroup/E-Gen.

Updated: 2025-03-09 20:31:19

标题: E-Gen：利用E-图改进符号表达式的连续表示

摘要: 矢量表示在推动自然语言处理（NLP）方面起着关键作用，先前的研究主要集中在使用数学等价表达式的嵌入技术。虽然有效，但这些方法受到训练数据规模和多样性的限制。在这项工作中，我们通过引入E-Gen，一种新颖的基于e图的数据集生成方案，解决了这些限制，合成了大规模且多样化的数学表达式数据集，超越了先前的方法在大小和运算符种类上的限制。利用这个数据集，我们使用两种策略训练嵌入模型：（1）生成数学上等价的表达式，（2）对等价表达式进行对比学习以明确地分组。我们评估这些嵌入在分布内和分布外的数学语言处理任务上的表现，并将它们与先前的方法进行比较。最后，我们证明我们基于嵌入的方法在几个任务上优于最先进的大型语言模型（LLMs），强调了优化数学数据模态的嵌入方法的必要性。源代码和数据集可在https://github.com/MLPgroup/E-Gen上找到。

更新时间: 2025-03-09 20:31:19

领域: cs.LG,cs.CL,cs.SC

下载: http://arxiv.org/abs/2501.14951v2

Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints

This paper addresses the challenge of solving Constrained Markov Decision Processes (CMDPs) with $d > 1$ constraints when the transition dynamics are unknown, but samples can be drawn from a generative model. We propose a model-based algorithm for infinite horizon CMDPs with multiple constraints in the tabular setting, aiming to derive and prove sample complexity bounds for learning near-optimal policies. Our approach tackles both the relaxed and strict feasibility settings, where relaxed feasibility allows some constraint violations, and strict feasibility requires adherence to all constraints. The main contributions include the development of the algorithm and the derivation of sample complexity bounds for both settings. For the relaxed feasibility setting we show that our algorithm requires $\tilde{\mathcal{O}} \left( \frac{d |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^3\epsilon^2} \right)$ samples to return $\epsilon$-optimal policy, while in the strict feasibility setting it requires $\tilde{\mathcal{O}} \left( \frac{d^3 |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^5\epsilon^2{\zeta_{\mathbf{c}}^*}^2} \right)$ samples.

Updated: 2025-03-09 20:10:35

标题: 多约束条件下的约束马尔可夫决策过程的原始-对偶样本复杂性界限

摘要: 本文讨论了在转移动态未知但可以从生成模型中抽取样本时解决具有$d>1$约束的受限马尔可夫决策过程（CMDPs）的挑战。我们提出了一种基于模型的算法，用于在表格设置中具有多个约束的无限地平CMDPs，旨在推导并证明学习接近最优策略的样本复杂性界限。我们的方法解决了放宽和严格可行性设置，其中放宽可行性允许一些约束违反，而严格可行性要求遵守所有约束。主要贡献包括算法的开发和两种设置的样本复杂性界限的推导。对于放宽可行性设置，我们表明我们的算法需要$\tilde{\mathcal{O}}\left(\frac{d|\mathcal{S}||\mathcal{A}|\log(1/\delta)}{(1-\gamma)^3\epsilon^2}\right)$个样本返回$\epsilon$-最优策略，而在严格可行性设置中，需要$\tilde{\mathcal{O}}\left(\frac{d^3|\mathcal{S}||\mathcal{A}|\log(1/\delta)}{(1-\gamma)^5\epsilon^2{\zeta_{\mathbf{c}}^*}^2}\right)$个样本。

更新时间: 2025-03-09 20:10:35

领域: cs.LG,68T05, 68T05,F.2.2; I.2.6; G.1.6

下载: http://arxiv.org/abs/2503.06751v1

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .

Updated: 2025-03-09 20:06:45

标题: Vision-R1：激励多模态大型语言模型的推理能力

摘要: DeepSeek-R1-Zero通过强化学习成功展示了LLMs中推理能力的出现。受到这一突破的启发，我们探索了如何利用强化学习来增强MLLMs的推理能力。然而，直接用强化学习进行训练在激活MLLMs中复杂的推理能力（如提问和反思）方面存在困难，这是因为缺乏大量高质量的多模态推理数据。为了解决这一问题，我们提出了推理MLLM Vision-R1，以提高多模态推理能力。具体来说，我们首先通过利用现有的MLLM和DeepSeek-R1进行模态桥接和数据过滤来构建一个高质量的无人注释的多模态CoT数据集，从而获得一个20万个多模态CoT数据集Vision-R1-cold数据集。它作为Vision-R1的冷启动初始化数据。为了缓解冷启动后过度思考导致的优化挑战，我们提出了渐进式思维抑制训练（PTST）策略，并采用硬格式化结果奖励函数的群体相对策略优化（GRPO）来逐步改进模型学习正确和复杂推理过程的能力，在一个1万个多模态数学数据集上进行实验。综合实验显示，我们的模型在各种多模态数学推理基准上取得了平均约6%的改进。Vision-R1-7B在广泛使用的MathVista基准测试中取得了73.5%的准确率，仅比领先的推理模型OpenAI O1低0.4%。数据集和代码将在https://github.com/Osilly/Vision-R1发布。

更新时间: 2025-03-09 20:06:45

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.06749v1

Fully-Decentralized MADDPG with Networked Agents

In this paper, we devise three actor-critic algorithms with decentralized training for multi-agent reinforcement learning in cooperative, adversarial, and mixed settings with continuous action spaces. To this goal, we adapt the MADDPG algorithm by applying a networked communication approach between agents. We introduce surrogate policies in order to decentralize the training while allowing for local communication during training. The decentralized algorithms achieve comparable results to the original MADDPG in empirical tests, while reducing computational cost. This is more pronounced with larger numbers of agents.

Updated: 2025-03-09 20:05:32

标题: 具有网络化代理的完全去中心化的MADDPG

摘要: 在本文中，我们设计了三种具有分散式训练的演员-评论家算法，用于协作、对抗和混合设置下的多智能体强化学习，其中包含连续动作空间。为实现这一目标，我们通过在智能体之间应用网络通信方法来改进MADDPG算法。我们引入了替代策略，以便在训练过程中实现分散式训练，同时允许在训练过程中进行本地通信。在经验测试中，分散式算法实现了与原始MADDPG相当的结果，同时降低了计算成本。这种效果在智能体数量较大时更为明显。

更新时间: 2025-03-09 20:05:32

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.06747v1

Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the non-deterministic, context-sensitive, and dynamic nature of these systems. This paper explores key challenges and opportunities in analyzing and optimizing agentic systems across development, testing, and maintenance. We explore critical issues such as natural language variability and unpredictable execution flows, which hinder predictability and control, demanding adaptive strategies to manage input variability and evolving behaviors. Through our user study, we supported these hypotheses. In particular, we showed a 79% agreement that non deterministic flow of agentic systems acts as a major challenge. Finally, we validated our statements empirically advocating the need for moving beyond classical benchmarking. To bridge these gaps, we introduce taxonomies to present expected analytics outcomes and the ways to collect them by extending standard observability frameworks. Building on these foundations, we introduce and demonstrate novel approach for benchmarking of agent evaluation systems. Unlike traditional "black box" performance evaluation approaches, our benchmark is built from agent runtime logs as input, and analytics outcome including discovered flows and issues. By addressing key limitations in existing methodologies, we aim to set the stage for more advanced and holistic evaluation strategies, which could foster the development of adaptive, interpretable, and robust agentic AI systems.

Updated: 2025-03-09 20:02:04

标题: 超越黑盒基准测试：主体系统的可观察性、分析和优化

摘要: 代理型人工智能系统的崛起，其中代理协作执行各种任务，提出了观察、分析和优化其行为的新挑战。传统的评估和基准测试方法难以处理这些系统的非确定性、上下文敏感和动态性质。本文探讨了在开发、测试和维护过程中分析和优化代理系统的关键挑战和机遇。我们探讨了诸如自然语言变化和难以预测的执行流程等关键问题，这些问题阻碍了可预测性和控制，要求采用适应性策略来管理输入变化和不断发展的行为。通过我们的用户研究，我们支持了这些假设。特别地，我们展示了79%的一致性意见，即代理系统的非确定性流程是一个重要挑战。最后，我们通过实证验证了我们的观点，主张超越传统的基准测试的必要性。为了弥补这些差距，我们引入了分类法，展示了预期的分析结果和通过扩展标准可观察性框架来收集这些结果的方式。基于这些基础，我们介绍并展示了用于代理评估系统基准测试的新方法。与传统的“黑匣子”性能评估方法不同，我们的基准测试是基于代理运行时日志作为输入，并包括发现的流程和问题的分析结果。通过解决现有方法学中的关键局限性，我们旨在为更先进和全面的评估策略奠定基础，这些策略可以促进自适应、可解释和强大的代理型人工智能系统的发展。

更新时间: 2025-03-09 20:02:04

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2503.06745v1

Green Prompting

Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.

Updated: 2025-03-09 19:49:31

标题: 绿色提示

摘要: 大型语言模型（LLMs）已被广泛应用于跨领域的搜索引擎、代码生成和文本创作。然而，与其采用相关的一个主要关注点是推理成本高昂，影响了它们的可持续性和财务可行性。在这项研究中，我们通过实证研究直接探讨了不同提示和响应特征如何影响LLM推理能耗。我们进行了实验，利用了三种开源基于Transformer的LLMs，涵盖了三种任务类型——问题回答、情感分析和文本生成。对于每个推理，我们分析了提示和响应特征（长度、语义含义、时间花费、能量消耗）。我们的结果表明，即使面对相同的任务，模型生成的响应具有不同的特征，并随后展现出不同的能耗模式。我们发现提示的长度不如任务本身的语义含义重要。此外，我们确定了与不同任务相关的能量使用更高或更低的特定关键词。这些发现突显了提示设计在优化推理效率方面的重要性。我们得出结论，提示的语义含义和某些与任务相关的关键词显著影响推理成本，为深入探索创造能量自适应LLMs铺平了道路。

更新时间: 2025-03-09 19:49:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.10666v1

Training Sparse Mixture Of Experts Text Embedding Models

Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.

Updated: 2025-03-09 19:39:00

标题: 训练稀疏专家混合文本嵌入模型

摘要: 基于Transformer的文本嵌入模型通过增加参数数量提高了在MIRACL和BEIR等基准上的性能。然而，这种扩展方法引入了重要的部署挑战，包括增加的推理延迟和内存使用。这些挑战在检索增强生成（RAG）应用中尤为严重，大型模型增加的内存需求限制了数据集摄入能力，而它们更高的延迟直接影响查询性能。虽然因果语言模型已经使用专家混合（MoE）架构解决了类似的效率挑战，但这种方法尚未成功地适应到一般文本嵌入设置中。在本文中，我们介绍了Nomic Embed v2，第一个通用的MoE文本嵌入模型。我们的模型在单语和多语言基准上优于相同参数类别的模型，同时也保持了与其两倍大小模型竞争性能。我们开源所有代码、模型和评估数据，以确保我们的训练流程的完全可重现性。

更新时间: 2025-03-09 19:39:00

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2502.07972v3

SupReMix: Supervised Contrastive Learning for Medical Imaging Regression with Mixup

In medical image analysis, regression plays a critical role in computer-aided diagnosis. It enables quantitative measurements such as age prediction from structural imaging, cardiac function quantification, and molecular measurement from PET scans. While deep learning has shown promise for these tasks, most approaches focus solely on optimizing regression loss or model architecture, neglecting the quality of learned feature representations which are crucial for robust clinical predictions. Directly applying representation learning techniques designed for classification to regression often results in fragmented representations in the latent space, yielding sub-optimal performance. In this paper, we argue that the potential of contrastive learning for medical image regression has been overshadowed due to the neglect of two crucial aspects: ordinality-awareness and hardness. To address these challenges, we propose Supervised Contrastive Learning for Medical Imaging Regression with Mixup (SupReMix). It takes anchor-inclusive mixtures (mixup of the anchor and a distinct negative sample) as hard negative pairs and anchor-exclusive mixtures (mixup of two distinct negative samples) as hard positive pairs at the embedding level. This strategy formulates harder contrastive pairs by integrating richer ordinal information. Through theoretical analysis and extensive experiments on six datasets spanning MRI, X-ray, ultrasound, and PET modalities, we demonstrate that SupReMix fosters continuous ordered representations, significantly improving regression performance.

Updated: 2025-03-09 19:37:46

标题: SupReMix：医学成像回归的监督对比学习与混合

摘要: 在医学图像分析中，回归在计算机辅助诊断中起着关键作用。它使得从结构成像中预测年龄、心脏功能量化和从PET扫描中测量分子等定量测量成为可能。虽然深度学习在这些任务中显示出了潜力，但大多数方法仅专注于优化回归损失或模型架构，忽视了学习特征表示的质量，这对于稳健的临床预测是至关重要的。直接将为分类设计的表示学习技术应用于回归往往会导致潜在空间中的片段化表示，从而产生次优性能。本文认为，由于忽视了两个关键方面：顺序感知和困难性，医学图像回归的对比学习潜力已经被掩盖。为了解决这些挑战，我们提出了带有Mixup的医学图像回归监督对比学习（SupReMix）。它将锚点包含混合（锚点和不同负样本的混合）作为困难负对，将锚点排除混合（两个不同负样本的混合）作为困难正对在嵌入级别。这种策略通过整合更丰富的顺序信息来制定更困难的对比对。通过理论分析和在MRI、X射线、超声和PET模态跨越六个数据集上的广泛实验，我们证明了SupReMix促进了连续有序表示，显著改善了回归性能。

更新时间: 2025-03-09 19:37:46

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2309.16633v3

Faster and Space Efficient Indexing for Locality Sensitive Hashing

This work suggests faster and space-efficient index construction algorithms for LSH for Euclidean distance (\textit{a.k.a.}~\ELSH) and cosine similarity (\textit{a.k.a.}~\SRP). The index construction step of these LSHs relies on grouping data points into several bins of hash tables based on their hashcode. To generate an $m$-dimensional hashcode of the $d$-dimensional data point, these LSHs first project the data point onto a $d$-dimensional random Gaussian vector and then discretise the resulting inner product. The time and space complexity of both \ELSH~and \SRP~for computing an $m$-sized hashcode of a $d$-dimensional vector is $O(md)$, which becomes impractical for large values of $m$ and $d$. To overcome this problem, we propose two alternative LSH hashcode generation algorithms both for Euclidean distance and cosine similarity, namely, \CSELSH, \HCSELSH~and \CSSRP, \HCSSRP, respectively. \CSELSH~and \CSSRP~are based on count sketch \cite{count_sketch} and \HCSELSH~and \HCSSRP~utilize higher-order count sketch \cite{shi2019higher}. These proposals significantly reduce the hashcode computation time from $O(md)$ to $O(d)$. Additionally, both \CSELSH~and \CSSRP~reduce the space complexity from $O(md)$ to $O(d)$; ~and \HCSELSH, \HCSSRP~ reduce the space complexity from $O(md)$ to $O(N \sqrt[N]{d})$ respectively, where $N\geq 1$ denotes the size of the input/reshaped tensor. Our proposals are backed by strong mathematical guarantees, and we validate their performance through simulations on various real-world datasets.

Updated: 2025-03-09 19:33:01

标题: 更快更节省空间的局部敏感哈希索引化

摘要: 这项工作提出了更快和空间高效的LSH索引构建算法，用于欧几里德距离（\textit{即}~\ELSH）和余弦相似度（\textit{即}~\SRP）。这些LSH的索引构建步骤依赖于根据它们的哈希码将数据点分组到多个哈希表的桶中。为了生成$d$维数据点的$m$维哈希码，这些LSH首先将数据点投影到一个$d$维随机高斯向量上，然后离散化得到的内积。计算$d$维向量的$m$维哈希码的时间和空间复杂度为$O(md)$，对于较大的$m$和$d$值来说变得不切实际。为了解决这个问题，我们提出了两种替代的LSH哈希码生成算法，分别用于欧几里德距离和余弦相似度，即\CSELSH、\HCSELSH和\CSSRP、\HCSSRP。 \CSELSH和\CSSRP基于计数草图\cite{count_sketch}，\HCSELSH和\HCSSRP利用了高阶计数草图\cite{shi2019higher}。这些提议将哈希码计算时间从$O(md)$显著降低到$O(d)$。此外，\CSELSH和\CSSRP将空间复杂度从$O(md)$降低到$O(d)$；\HCSELSH和\HCSSRP将空间复杂度分别从$O(md)$降低到$O(N \sqrt[N]{d})$，其中$N\geq 1$表示输入/重塑张量的大小。我们的提议得到了强大的数学保证，并通过在各种真实世界数据集上进行模拟验证了它们的性能。

更新时间: 2025-03-09 19:33:01

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2503.06737v1

Gender Encoding Patterns in Pretrained Language Model Representations

Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

Updated: 2025-03-09 19:17:46

标题: 预训练语言模型表示中的性别编码模式

摘要: 预训练语言模型（PLMs）中的性别偏见带来了重大的社会和伦理挑战。尽管人们对此越来越意识到，但对不同模型如何内部表示和传播这种偏见缺乏全面的调查。本研究采用信息论方法分析了各种基于编码器的体系结构中性别偏见是如何编码的。我们关注三个关键方面：识别模型如何编码性别信息和偏见，研究偏见缓解技术和微调对编码偏见及其有效性的影响，以及探索模型设计差异如何影响偏见的编码。通过严格和系统的调查，我们的发现揭示了跨不同模型的性别编码的一致模式。令人惊讶的是，去偏见技术通常表现出有限的功效，有时会无意中增加内部表示中的编码偏见，同时减少模型输出分布中的偏见。这突显了在输出分布中减少偏见与解决其内部表示之间的脱节。本研究为推进偏见缓解策略和促进更公平的语言模型的发展提供了宝贵的指导。

更新时间: 2025-03-09 19:17:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06734v1

Machine learning for triage of strokes with large vessel occlusion using photoplethysmography biomarkers

Objective. Large vessel occlusion (LVO) stroke presents a major challenge in clinical practice due to the potential for poor outcomes with delayed treatment. Treatment for LVO involves highly specialized care, in particular endovascular thrombectomy, and is available only at certain hospitals. Therefore, prehospital identification of LVO by emergency ambulance services, can be critical for triaging LVO stroke patients directly to a hospital with access to endovascular therapy. Clinical scores exist to help distinguish LVO from less severe strokes, but they are based on a series of examinations that can take minutes and may be impractical for patients with dementia or those who cannot follow commands due to their stroke. There is a need for a fast and reliable method to aid in the early identification of LVO. In this study, our objective was to assess the feasibility of using 30-second photoplethysmography (PPG) recording to assist in recognizing LVO stroke. Method. A total of 88 patients, including 25 with LVO, 27 with stroke mimic (SM), and 36 non-LVO stroke patients (NL), were recorded at the Liverpool Hospital emergency department in Sydney, Australia. Demographics (age, sex), as well as morphological features and beating rate variability measures, were extracted from the PPG. A binary classification approach was employed to differentiate between LVO stroke and NL+SM (NL.SM). A 2:1 train-test split was stratified and repeated randomly across 100 iterations. Results. The best model achieved a median test set area under the receiver operating characteristic curve (AUROC) of 0.77 (0.71--0.82). \textit{Conclusion.} Our study demonstrates the potential of utilizing a 30-second PPG recording for identifying LVO stroke.

Updated: 2025-03-09 19:12:32

标题: 使用光电容积脉搏波生物标志物进行大血管闭塞卒中的分诊的机器学习

摘要: 目的。大血管闭塞（LVO）卒中在临床实践中构成了一项重大挑战，因为延迟治疗可能导致不良结果。LVO的治疗需要高度专业化护理，尤其是内血管溶栓术，只有在特定医院才能进行。因此，急救车服务机构对LVO进行院前识别可以对LVO卒中患者直接将其分流至具备内血管治疗条件的医院而至关重要。存在临床评分用于帮助区分LVO和较轻的卒中，但它们基于一系列可能需要几分钟的检查，对于患有痴呆症或因卒中而无法遵循指令的患者可能不切实际。需要一种快速可靠的方法来帮助早期识别LVO。本研究的目标是评估使用30秒光电容积脉搏图（PPG）记录协助识别LVO卒中的可行性。方法。在澳大利亚悉尼利物浦医院急诊科记录了88名患者，包括25名LVO患者、27名卒中模拟患者（SM）和36名非LVO卒中患者（NL）。从PPG中提取了人口统计学特征（年龄、性别）以及形态特征和心率变化度量。采用二元分类方法区分LVO卒中和NL+SM（NL.SM）。对2:1的训练-测试分组进行分层，随机重复100次。结果。最佳模型在测试集下的接收者操作特征曲线（AUROC）的中位数为0.77（0.71-0.82）。结论。我们的研究展示了利用30秒PPG记录识别LVO卒中的潜力。

更新时间: 2025-03-09 19:12:32

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2503.13486v1

Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits with Strategic Agents

Motivated by applications such as online labor markets we consider a variant of the stochastic multi-armed bandit problem where we have a collection of arms representing strategic agents with different performance characteristics. The platform (principal) chooses an agent in each round to complete a task. Unlike the standard setting, when an arm is pulled it can modify its reward by absorbing it or improving it at the expense of a higher cost. The principle has to solve a mechanism design problem to incentivize the arms to give their best performance. However, since even with an effective mechanism agents may still deviate from rational behavior, the principal wants a robust algorithm that also gives a non-vacuous guarantee on the total accumulated rewards under non-equilibrium behavior. In this paper, we introduce a class of bandit algorithms that meet the two objectives of performance incentivization and robustness simultaneously. We do this by identifying a collection of intuitive properties that a bandit algorithm has to satisfy to achieve these objectives. Finally, we show that settings where the principal has no information about the arms' performance characteristics can be handled by combining ideas from second price auctions with our algorithms.

Updated: 2025-03-09 19:11:21

标题: 具有战略代理的多臂老虎机的稳健和激励表现算法

摘要: 受在线劳动市场等应用的启发，我们考虑了一种变体的随机多臂赌博问题，其中我们有一组代表具有不同表现特征的战略性代理的手臂。平台（主体）在每一轮选择一个代理来完成任务。与标准设置不同，当拉动一个手臂时，它可以通过吸收或改进奖励来修改奖励，以牺牲更高的成本。主体必须解决一个机制设计问题，以激励手臂发挥最佳表现。然而，即使有一个有效的机制，代理仍可能偏离理性行为，主体希望一个既能激励绩效又能在非均衡行为下给出一个非平庸保证的稳健算法。在本文中，我们介绍了一类满足绩效激励和稳健性两个目标的赌博算法。我们通过确定一组直观属性来满足这些目标来实现这一点。最后，我们展示了在主体没有关于手臂表现特征的信息的情况下，通过将第二价格拍卖的思想与我们的算法相结合来处理。

更新时间: 2025-03-09 19:11:21

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2312.07929v2

Data Efficient Subset Training with Differential Privacy

Private machine learning introduces a trade-off between the privacy budget and training performance. Training convergence is substantially slower and extensive hyper parameter tuning is required. Consequently, efficient methods to conduct private training of models is thoroughly investigated in the literature. To this end, we investigate the strength of the data efficient model training methods in the private training setting. We adapt GLISTER (Killamsetty et al., 2021b) to the private setting and extensively assess its performance. We empirically find that practical choices of privacy budgets are too restrictive for data efficient training in the private setting.

Updated: 2025-03-09 19:05:10

标题: 具有差分隐私的数据高效子集训练

摘要: 私人机器学习引入了隐私预算与训练性能之间的权衡。训练收敛速度明显较慢，需要进行大量的超参数调整。因此，文献中对进行模型私人训练的高效方法进行了深入研究。为此，我们研究了在私人训练环境中数据高效模型训练方法的强大性。我们将GLISTER（Killamsetty等人，2021b）调整为私人设置，并对其性能进行了广泛评估。我们在实证研究中发现，实际的隐私预算选择对私人环境中数据高效训练过于限制。

更新时间: 2025-03-09 19:05:10

领域: cs.LG

下载: http://arxiv.org/abs/2503.06732v1

Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention

Concept bottleneck models~(CBM) aim to improve model interpretability by predicting human level ``concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees~(FIGS) to obtain Binary Distillation~(BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while mimicking the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across $4$ datasets, we demonstrate that adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that allow for limited concept interventions.

Updated: 2025-03-09 19:03:48

标题: 通过二进制蒸馏提升CBMs，并应用于测试时间干预

摘要: 概念瓶颈模型（CBM）旨在通过在深度学习模型架构中预测人类水平的“概念”来提高模型的可解释性。然而，如何在预测目标时使用预测的概念，仍然要么保持黑盒性，要么简化以维持可解释性，但牺牲了预测性能。我们提出使用快速可解释的贪心和树（FIGS）来获得二进制蒸馏（BD）。这种新方法称为FIGS-BD，将CBM的二进制增强概念到目标部分蒸馏成一个可解释的基于树的模型，同时模仿CBM教师的竞争性预测性能。FIGS-BD可以用于下游任务，将CBM的预测解释为可解释的二进制概念交互属性，并指导自适应的测试时间干预。在4个数据集上，我们证明了自适应的测试时间干预可以识别关键概念，显著提高适用于允许有限概念干预的现实人类参与环境的性能。

更新时间: 2025-03-09 19:03:48

领域: cs.LG

下载: http://arxiv.org/abs/2503.06730v1

ACAI for SBOs: AI Co-creation for Advertising and Inspiration for Small Business Owners

Small business owners (SBOs) often lack the resources and design experience needed to produce high-quality advertisements. To address this, we developed ACAI (AI Co-Creation for Advertising and Inspiration), an GenAI-powered multimodal advertisement creation tool, and conducted a user study with 16 SBOs in London to explore their perceptions of and interactions with ACAI in advertisement creation. Our findings reveal that structured inputs enhance user agency and control while improving AI outputs by facilitating better brand alignment, enhancing AI transparency, and offering scaffolding that assists novice designers, such as SBOs, in formulating prompts. We also found that ACAI's multimodal interface bridges the design skill gap for SBOs with a clear advertisement vision, but who lack the design jargon necessary for effective prompting. Building on our findings, we propose three capabilities: contextual intelligence, adaptive interactions, and data management, with corresponding design recommendations to advance the co-creative attributes of AI-mediated design tools.

Updated: 2025-03-09 19:00:36

标题: ACAI对SBOs的影响：AI共创广告和对小企业主的启发

摘要: 小型企业所有者（SBOs）常常缺乏生产高质量广告所需的资源和设计经验。为了解决这个问题，我们开发了ACAI（AI广告和灵感的共创），这是一个由GenAI驱动的多模式广告创作工具，并与伦敦的16名SBOs进行了用户研究，以探索他们对ACAI在广告创作中的感知和互动。我们的研究结果表明，结构化输入提高了用户的代理权和控制能力，同时通过促进更好的品牌对齐，增强AI的透明度，并提供帮助初学设计者（如SBOs）制定提示的脚手架，从而改善了AI的输出。我们还发现，ACAI的多模式界面弥合了对于有清晰广告愿景但缺乏有效提示所需的设计术语的SBOs的设计技能差距。基于我们的发现，我们提出了三种能力：情境智能，自适应互动和数据管理，并提出相应的设计建议，以推进AI中介设计工具的共创属性。

更新时间: 2025-03-09 19:00:36

领域: cs.HC,cs.AI,cs.CY,cs.ET

下载: http://arxiv.org/abs/2503.06729v1

Pull-Based Query Scheduling for Goal-Oriented Semantic Communication

This paper addresses query scheduling for goal-oriented semantic communication in pull-based status update systems. We consider a system where multiple sensing agents (SAs) observe a source characterized by various attributes and provide updates to multiple actuation agents (AAs), which act upon the received information to fulfill their heterogeneous goals at the endpoint. A hub serves as an intermediary, querying the SAs for updates on observed attributes and maintaining a knowledge base, which is then broadcast to the AAs. The AAs leverage the knowledge to perform their actions effectively. To quantify the semantic value of updates, we introduce a grade of effectiveness (GoE) metric. Furthermore, we integrate cumulative perspective theory (CPT) into the long-term effectiveness analysis to account for risk awareness and loss aversion in the system. Leveraging this framework, we compute effect-aware scheduling policies aimed at maximizing the expected discounted sum of CPT-based total GoE provided by the transmitted updates while complying with a given query cost constraint. To achieve this, we propose a model-based solution based on dynamic programming and model-free solutions employing state-of-the-art deep reinforcement learning (DRL) algorithms. Our findings demonstrate that effect-aware scheduling significantly enhances the effectiveness of communicated updates compared to benchmark scheduling methods, particularly in settings with stringent cost constraints where optimal query scheduling is vital for system performance and overall effectiveness.

Updated: 2025-03-09 18:51:14

标题: 基于拉取的目标导向语义通信的查询调度

摘要: 本文讨论了面向目标导向语义通信的查询调度问题，该问题发生在基于拉取的状态更新系统中。我们考虑一个系统，其中多个感知代理（SAs）观察具有各种属性的来源，并向多个执行代理（AAs）提供更新，AAs根据接收到的信息来实现其异构目标。一个中心作为中介，向SAs查询有关观察属性的更新，并维护一个知识库，然后广播给AAs。AAs利用这些知识来有效地执行他们的动作。为了量化更新的语义价值，我们引入了一个效果级别（GoE）指标。此外，我们将累积展望理论（CPT）整合到长期效果分析中，以考虑系统中的风险意识和损失厌恶。利用这一框架，我们计算了旨在最大化传输更新提供的基于CPT的总GoE的期望折现和同时遵守给定查询成本约束的效果感知调度策略。为了实现这一目标，我们提出了基于动态规划的基于模型的解决方案和采用最先进的深度强化学习（DRL）算法的无模型解决方案。我们的研究结果表明，效果感知调度明显提升了与基准调度方法相比的通信更新的效果，特别是在成本约束严格的情况下，优化查询调度对系统性能和整体效果至关重要的设置中。

更新时间: 2025-03-09 18:51:14

领域: cs.IT,cs.AI,cs.NI,math.IT

下载: http://arxiv.org/abs/2503.06725v1

Diffusing DeBias: Synthetic Bias Amplification for Model Debiasing

Deep learning model effectiveness in classification tasks is often challenged by the quality and quantity of training data whenever they are affected by strong spurious correlations between specific attributes and target labels. This results in a form of bias affecting training data, which typically leads to unrecoverable weak generalization in prediction. This paper aims at facing this problem by leveraging bias amplification with generated synthetic data: we introduce Diffusing DeBias (DDB), a novel approach acting as a plug-in for common methods of unsupervised model debiasing exploiting the inherent bias-learning tendency of diffusion models in data generation. Specifically, our approach adopts conditional diffusion models to generate synthetic bias-aligned images, which replace the original training set for learning an effective bias amplifier model that we subsequently incorporate into an end-to-end and a two-step unsupervised debiasing approach. By tackling the fundamental issue of bias-conflicting training samples memorization in learning auxiliary models, typical of this type of techniques, our proposed method beats current state-of-the-art in multiple benchmark datasets, demonstrating its potential as a versatile and effective tool for tackling bias in deep learning models.

Updated: 2025-03-09 18:41:50

标题: 扩散去偏见：用于模型去偏见的合成偏见放大

摘要: 深度学习模型在分类任务中的有效性常常受到训练数据的质量和数量的挑战，特别是当它们受到特定属性和目标标签之间存在强烈虚假相关性的影响时。这导致了一种影响训练数据的偏见，通常会导致预测中无法恢复的弱泛化。本文旨在通过利用生成的合成数据来解决这个问题：我们引入了Diffusing DeBias（DDB），这是一种新颖的方法，作为常见的无监督模型去偏方法的插件，利用扩散模型在数据生成中的固有偏见学习倾向。具体而言，我们的方法采用条件扩散模型生成合成的偏见对齐图像，这些图像替换原始训练集，用于学习一个有效的偏见放大器模型，然后将其整合到端到端和两步无监督去偏方法中。通过解决与此类技术典型的辅助模型学习中的偏见冲突训练样本记忆的根本问题，我们提出的方法在多个基准数据集中超越了当前的最新技术水平，展示了其作为解决深度学习模型中偏见的多功能有效工具的潜力。

更新时间: 2025-03-09 18:41:50

领域: cs.LG,cs.CV,I.4; I.5

下载: http://arxiv.org/abs/2502.09564v3

Function-Space MCMC for Bayesian Wide Neural Networks

Bayesian Neural Networks represent a fascinating confluence of deep learning and probabilistic reasoning, offering a compelling framework for understanding uncertainty in complex predictive models. In this paper, we investigate the use of the preconditioned Crank-Nicolson algorithm and its Langevin version to sample from a reparametrised posterior distribution of the neural network's weights, as the widths grow larger. In addition to being robust in the infinite-dimensional setting, we prove that the acceptance probabilities of the proposed algorithms approach 1 as the width of the network increases, independently of any stepsize tuning. Moreover, we examine and compare how the mixing speeds of the underdamped Langevin Monte Carlo, the preconditioned Crank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are influenced by changes in the network width in some real-world cases. Our findings suggest that, in wide Bayesian Neural Networks configurations, the preconditioned Crank-Nicolson algorithm allows for a scalable and more efficient sampling of the reparametrised posterior distribution, as also evidenced by a higher effective sample size and improved diagnostic results compared with the other analysed algorithms.

Updated: 2025-03-09 18:32:27

标题: Bayesian宽神经网络的函数空间MCMC

摘要: 贝叶斯神经网络代表了深度学习和概率推理的迷人交汇，为理解复杂预测模型中的不确定性提供了引人入胜的框架。在本文中，我们研究了预处理的Crank-Nicolson算法及其Langevin版本在神经网络权重的重参数化后分布中进行抽样，随着宽度增大。除了在无限维设置中具有鲁棒性外，我们证明了所提出算法的接受概率在网络宽度增加时逼近1，独立于任何步长调整。此外，我们研究并比较了欠阻尼Langevin蒙特卡洛、预处理的Crank-Nicolson和预处理的Crank-Nicolson Langevin抽样器在一些实际案例中如何受网络宽度变化影响。我们的发现表明，在宽贝叶斯神经网络配置中，预处理的Crank-Nicolson算法允许对重参数化后分布进行可扩展且更有效的抽样，与其他分析算法相比，还表现出更高的有效样本量和改进的诊断结果。

更新时间: 2025-03-09 18:32:27

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2408.14325v4

ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models

The use of Large Language Models (LLMs) in climate science has recently gained significant attention. However, a critical issue remains: the lack of a comprehensive evaluation framework capable of assessing the quality and scientific validity of model outputs. To address this issue, we develop ClimaGen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. As a result, we present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science. Finally, we develop evaluation strategies and compare different LLMs on our benchmarks. Our results offer novel insights into various approaches used to enhance knowledge of climate LLMs. The source code is publicly available at https://github.com/Rose-STL-Lab/genie-climaqa

Updated: 2025-03-09 18:31:12

标题: ClimaQA：气候问答模型的自动评估框架

摘要: 最近，大型语言模型(LLMs)在气候科学中的应用引起了广泛关注。然而，一个关键问题仍然存在：缺乏一个全面的评估框架，能够评估模型输出的质量和科学有效性。为了解决这个问题，我们开发了ClimaGen (Climate QA Generator)，这是一个自适应学习框架，可以从气候科学研究生教科书中生成问题-答案对，并让气候科学家参与其中。因此，我们提出了ClimaQA-Gold，这是一个专家注释的基准数据集，以及ClimaQA-Silver，一个大规模、全面的合成气候科学问答数据集。最后，我们开发评估策略，并比较不同LLMs在我们的基准上的表现。我们的结果为增进对气候LLMs知识的各种方法提供了新的见解。源代码可以在https://github.com/Rose-STL-Lab/genie-climaqa上公开获取。

更新时间: 2025-03-09 18:31:12

领域: cs.LG

下载: http://arxiv.org/abs/2410.16701v2

Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies

World Model-based Reinforcement Learning (WMRL) enables sample efficient policy learning by reducing the need for online interactions which can potentially be costly and unsafe, especially for autonomous driving. However, existing world models often suffer from low prediction fidelity and compounding one-step errors, leading to policy degradation over long horizons. Additionally, traditional RL policies, often deterministic or single Gaussian-based, fail to capture the multi-modal nature of decision-making in complex driving scenarios. To address these challenges, we propose Imagine-2-Drive, a novel WMRL framework that integrates a high-fidelity world model with a multi-modal diffusion-based policy actor. It consists of two key components: DiffDreamer, a diffusion-based world model that generates future observations simultaneously, mitigating error accumulation, and DPA (Diffusion Policy Actor), a diffusion-based policy that models diverse and multi-modal trajectory distributions. By training DPA within DiffDreamer, our method enables robust policy learning with minimal online interactions. We evaluate our method in CARLA using standard driving benchmarks and demonstrate that it outperforms prior world model baselines, improving Route Completion and Success Rate by 15% and 20% respectively.

Updated: 2025-03-09 18:06:08

标题: Imagine-2-Drive：通过多模态扩散策略利用高保真世界模型

摘要: 基于世界模型的强化学习（WMRL）通过减少在线交互的需求，从而实现了高效的策略学习，这可以降低成本和提高安全性，特别是对于自动驾驶。然而，现有的世界模型往往存在低预测精度和累积一步误差的问题，导致长期政策下降。此外，传统的强化学习策略，通常是确定性或基于单个高斯的，无法捕捉复杂驾驶场景中决策的多模态特性。为了解决这些挑战，我们提出了Imagine-2-Drive，一种新颖的WMRL框架，将高精度的世界模型与基于多模态扩散的策略执行器相结合。它包括两个关键组件：DiffDreamer，一个基于扩散的世界模型，同时生成未来观察结果，减少误差累积；以及DPA（扩散策略执行器），一种基于扩散的策略，模拟多样性和多模态的轨迹分布。通过在DiffDreamer中训练DPA，我们的方法实现了在最少在线交互的情况下的强化学习。我们在CARLA中使用标准驾驶基准评估了我们的方法，并展示其优于先前的世界模型基准，分别将路线完成率和成功率提高了15%和20%。

更新时间: 2025-03-09 18:06:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2411.10171v2

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.09% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

Updated: 2025-03-09 17:59:40

标题: 可预测的规模：第一部分--大型语言模型预训练中的最佳超参数缩放定律

摘要: 大型语言模型（LLM）在各种任务中展现出的令人印象深刻的能力已经得到确认，然而它们的有效部署需要仔细的超参数优化。通过对各种配置进行广泛的网格搜索进行了大量的实证研究，我们发现了控制这些超参数的普适性缩放定律：最佳学习率与模型参数和数据大小都遵循幂律关系，而最佳批量大小主要与数据大小相关。我们的分析揭示了在固定模型和数据大小条件下的超参数的凸优化景观。这种凸性意味着存在一个最佳的超参数平台。我们为社区贡献了一个通用的、即插即用的最佳超参数工具。它在测试集上的估计值与通过详尽搜索找到的全局最佳LLM性能仅相差0.09%。这些定律展示了在模型稀疏性、训练数据分布和模型形状变化中的显著稳健性。据我们所知，这是第一个将不同模型形状和结构（如专家混合模型和密集变换器）统一起来，并建立了跨多样数据分布的最佳超参数缩放定律的工作。这一详尽的优化过程需要大量的计算资源，利用了近一百万个英伟达H800 GPU小时来训练3700个不同大小和超参数的LLM，并总共消耗了约100万亿个标记。为了促进可重现性和进一步研究，我们将逐步通过我们指定的存储库 https://step-law.github.io/ 发布所有损失测量和模型检查点。

更新时间: 2025-03-09 17:59:40

领域: cs.LG,cs.AI,F.2.2; I.2.7

下载: http://arxiv.org/abs/2503.04715v2

Delusions of Large Language Models

Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.

Updated: 2025-03-09 17:59:16

标题: 大型语言模型的错觉

摘要: 大型语言模型通常会生成事实上错误但似乎合理的输出，被称为幻觉。我们确定了一个更为隐匿的现象，LLM妄想，定义为高信念幻觉，即输出不正确但置信度异常高，使其更难检测和缓解。与普通幻觉不同，妄想以低不确定性持续存在，对模型可靠性构成重大挑战。通过在多个问答任务上跨不同模型家族和规模的经验分析，我们展示了妄想是普遍存在且与幻觉有所不同的现象。LLMs在妄想方面表现出更低的诚实度，更难通过微调或自我反思来覆盖。我们将妄想形成与训练动态和数据集噪音联系起来，并探索缓解妄想的策略，如检索增强生成和多智能体辩论。通过系统地调查LLM妄想的性质、普遍性和缓解措施，我们的研究为了解这一现象的根本原因并概述改善模型可靠性的未来方向提供了见解。

更新时间: 2025-03-09 17:59:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06709v1

Probabilistic Shielding for Safe Reinforcement Learning

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Updated: 2025-03-09 17:54:33

标题: 概率屏蔽用于安全强化学习

摘要: 在现实场景中，一个旨在最大化奖励的强化学习（RL）代理通常也必须以安全的方式行事，包括在训练时。因此，近年来对安全RL引起了很多关注，其中代理旨在学习在满足给定安全约束的所有策略中找到最优策略。然而，严格的安全保证通常是通过基于线性规划的方法提供的，因此具有限制的扩展性。在本文中，我们提出了一种新的可扩展方法，在已知马尔可夫决策过程（MDP）的安全动态，并将安全定义为未打折的概率避免属性的情况下，享有严格的形式保证安全的RL。我们的方法基于MDP的状态增强和设计一个限制代理可用行动的屏障。我们展示我们的方法提供了一个严格的正式安全保证，即代理在训练和测试时保持安全。此外，我们通过实验评估证明我们的方法在实践中是可行的。

更新时间: 2025-03-09 17:54:33

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.07671v1

Unifying Self-Supervised Clustering and Energy-Based Models

Self-supervised learning excels at learning representations from large amounts of data. At the same time, generative models offer the complementary property of learning information about the underlying data generation process. In this study, we aim at establishing a principled connection between these two paradigms and highlight the benefits of their complementarity. In particular, we perform an analysis of self-supervised learning objectives, elucidating the underlying probabilistic graphical models and presenting a standardized methodology for their derivation from first principles. The analysis suggests a natural means of integrating self-supervised learning with likelihood-based generative models. We instantiate this concept within the realm of cluster-based self-supervised learning and energy models, introducing a lower bound proven to reliably penalize the most important failure modes. Our theoretical findings are substantiated through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, demonstrating that our objective function allows to jointly train a backbone network in a discriminative and generative fashion, consequently outperforming existing self-supervised learning strategies in terms of clustering, generation and out-of-distribution detection performance by a wide margin. We also demonstrate that the solution can be integrated into a neuro-symbolic framework to tackle a simple yet non-trivial instantiation of the symbol grounding problem.

Updated: 2025-03-09 17:47:51

标题: 统一自监督聚类和能量模型

摘要: 自监督学习在从大量数据中学习表示方面表现出色。与此同时，生成模型提供了学习有关底层数据生成过程信息的互补属性。在这项研究中，我们旨在建立这两种范式之间的有原则的联系，并强调它们互补性的好处。特别地，我们对自监督学习目标进行了分析，阐明了潜在的概率图模型，并提出了从第一原则推导它们的标准方法。分析表明自监督学习与基于似然的生成模型的自然融合方式。我们在基于集群的自监督学习和能量模型领域内实例化了这一概念，引入了一个可靠惩罚最重要故障模式的下界。通过对合成和真实世界数据进行实验，包括SVHN、CIFAR10和CIFAR100，证明我们的目标函数允许以较大幅度优于现有自监督学习策略的方式联合训练主干网络，从而在聚类、生成和超出分布检测性能方面表现出色。我们还展示了解决方案可以集成到神经符号框架中，以解决符号接地问题的简单但非平凡的实例化。

更新时间: 2025-03-09 17:47:51

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2401.00873v4

ReynoldsFlow: Exquisite Flow Estimation via Reynolds Transport Theorem

Optical flow is a fundamental technique for motion estimation, widely applied in video stabilization, interpolation, and object tracking. Traditional optical flow estimation methods rely on restrictive assumptions like brightness constancy and slow motion constraints. Recent deep learning-based flow estimations require extensive training on large domain-specific datasets, making them computationally demanding. Also, artificial intelligence (AI) advances have enabled deep learning models to take advantage of optical flow as an important feature for object tracking and motion analysis. Since optical flow is commonly encoded in HSV for visualization, its conversion to RGB for neural network processing is nonlinear and may introduce perceptual distortions. These transformations amplify the sensitivity to estimation errors, potentially affecting the predictive accuracy of the networks. To address these challenges that are influential to the performance of downstream network models, we propose Reynolds flow, a novel training-free flow estimation inspired by the Reynolds transport theorem, offering a principled approach to modeling complex motion dynamics. In addition to conventional HSV-based visualization of Reynolds flow, we also introduce an RGB-encoded representation of Reynolds flow designed to improve flow visualization and feature enhancement for neural networks. We evaluated the effectiveness of Reynolds flow in video-based tasks. Experimental results on three benchmarks, tiny object detection on UAVDB, infrared object detection on Anti-UAV, and pose estimation on GolfDB, demonstrate that networks trained with RGB-encoded Reynolds flow achieve SOTA performance, exhibiting improved robustness and efficiency across all tasks.

Updated: 2025-03-09 17:47:41

标题: ReynoldsFlow：通过雷诺斯输运定理精确流量估算

摘要: 光流是一种用于运动估计的基本技术，在视频稳定、插值和目标跟踪中广泛应用。传统的光流估计方法依赖于诸如亮度恒定和慢动作约束等限制性假设。最近基于深度学习的光流估计需要在大型领域特定数据集上进行广泛训练，使其具有很高的计算要求。此外，人工智能的进步使得深度学习模型能够利用光流作为目标跟踪和运动分析的重要特征。由于光流通常以HSV编码进行可视化，其转换为RGB用于神经网络处理是非线性的，可能引入感知失真。这些转换增强了对估计误差的敏感性，可能影响网络的预测准确性。为了解决这些对下游网络模型性能有影响的挑战，我们提出了Reynolds流，这是一种受雷诺输运定理启发而来的新颖的无需训练的流估计方法，提供了一种对复杂运动动态进行建模的原则性方法。除了传统基于HSV的Reynolds流可视化外，我们还引入了一种设计用于改善流可视化和特征增强的RGB编码的Reynolds流表示。我们在基于视频的任务中评估了Reynolds流的有效性。在三个基准测试上的实验结果，即在UAVDB上进行微小目标检测、在Anti-UAV上进行红外目标检测以及在GolfDB上进行姿势估计，表明使用RGB编码的Reynolds流训练的网络实现了SOTA性能，在所有任务中展示出改进的鲁棒性和效率。

更新时间: 2025-03-09 17:47:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.04500v2

PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts

Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.

Updated: 2025-03-09 17:43:30

标题: PFDial：基于UML流程图的结构化对话指导Fine-tuning方法

摘要: 过程驱动的对话系统在客户服务和设备维护场景中至关重要，其在严格预定义的过程约束下运行。尽管大型语言模型（LLMs）在对话和推理方面取得了显著进展，但它们仍然难以解决这些严格受限的对话任务。为了解决这一挑战，我们构建了Process Flow Dialogue（PFDial）数据集，其中包含来自440个流程图的12,705个高质量中文对话指令，包含5,055个流程节点。根据PlantUML规范，每个UML流程图被转换为原子对话单元，即结构化的五元组。实验结果表明，一个仅使用800个样本训练的7B模型，以及一个在总数据上训练的0.5B模型都可以超过90%的准确率。此外，8B模型可以超过GPT-4o高达43.88%，平均为11.00%。我们进一步评估模型在处理流程流向中具有挑战性的反向转换时的性能，并进行了对各种数据集格式的深入分析，揭示它们对模型在处理决策和顺序分支时的性能影响。数据发布在https://github.com/KongLongGeFDU/PFDial。

更新时间: 2025-03-09 17:43:30

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06706v1

Generative Distribution Prediction: A Unified Approach to Multimodal Learning

Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.

Updated: 2025-03-09 17:40:18

标题: 生成分布预测：一种多模态学习的统一方法

摘要: 使用多模态数据进行准确预测，包括表格，文本和视觉输入或输出，是在不同应用领域推动分析的基础。传统方法往往难以整合异构数据类型，同时保持高预测准确性。我们引入了生成分布预测（GDP），这是一个新颖的框架，利用多模态合成数据生成，如条件扩散模型，以增强跨结构化和非结构化模态的预测性能。GDP是模型不可知的，与任何高保真度的生成模型兼容，并支持用于领域适应的迁移学习。我们为GDP建立了严格的理论基础，对使用扩散模型作为生成骨干时的预测准确性提供了统计保证。通过估计数据生成分布并适应各种损失函数进行风险最小化，GDP实现了跨多模态设置的准确点预测。我们在四个监督学习任务上对GDP进行了实证验证-表格数据预测，问题回答，图像标题生成和自适应分位数回归-展示了其在不同领域的多功能性和有效性。

更新时间: 2025-03-09 17:40:18

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07090v2

Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries

The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by introducing a trainable deep learning architecture, coined neural snowflake, that can adaptively implement fractal-like metrics on $\mathbb{R}^d$. We prove that any given finite weights graph can be isometrically embedded by a standard MLP encoder. Furthermore, when the latent graph can be represented in the feature space of a sufficiently regular kernel, we show that the combined neural snowflake and MLP encoder do not succumb to the curse of dimensionality by using only a low-degree polynomial number of parameters in the number of nodes. This implementation enables a low-dimensional isometric embedding of the latent graph. We conduct synthetic experiments to demonstrate the superior metric learning capabilities of neural snowflakes when compared to more familiar spaces like Euclidean space. Additionally, we carry out latent graph inference experiments on graph benchmarks. Consistently, the neural snowflake model achieves predictive performance that either matches or surpasses that of the state-of-the-art latent graph inference models. Importantly, this performance improvement is achieved without requiring random search for optimal latent geometry. Instead, the neural snowflake model achieves this enhancement in a differentiable manner.

Updated: 2025-03-09 17:34:50

标题: 神经雪花：通过可训练的潜在几何图推断实现的通用潜在图解析

摘要: 图神经网络（GNN）的归纳偏差在其指定的图中大部分被编码。潜在图推断依赖于潜在几何表示，以动态重连或推断GNN的图以最大化GNN的预测下游性能，但在基于嵌入式表示保证方面缺乏坚实的理论基础。本文通过引入一种可训练的深度学习架构，称为神经雪花，来解决这个问题，它可以自适应地在$\mathbb{R}^d$上实现类似分形的度量。我们证明，任何给定的有限权重图都可以通过标准MLP编码器等距嵌入。此外，当潜在图可以在足够规则的核特征空间中表示时，我们展示了组合神经雪花和MLP编码器在节点数量的参数中仅使用低次多项式数量的参数，不会受到维度诅咒的影响。这种实现实现了潜在图的低维等距嵌入。我们进行了合成实验，展示了神经雪花在度量学习能力上与更熟悉的空间（如欧几里得空间）相比的优越性。此外，我们在图基准上进行了潜在图推断实验。一致地，神经雪花模型达到了预测性能，要么与当前最先进的潜在图推断模型相匹敌，要么超越它们。重要的是，这种性能提升是在不需要随机搜索最佳潜在几何的情况下实现的。相反，神经雪花模型以可微分的方式实现了这种增强。

更新时间: 2025-03-09 17:34:50

领域: cs.LG,cs.DM,cs.NE,math.MG

下载: http://arxiv.org/abs/2310.15003v2

Precise Insulin Delivery for Artificial Pancreas: A Reinforcement Learning Optimized Adaptive Fuzzy Control Approach

This paper explores the application of reinforcement learning to optimize the parameters of a Type-1 Takagi-Sugeno fuzzy controller, designed to operate as an artificial pancreas for Type 1 diabetes. The primary challenge in diabetes management is the dynamic nature of blood glucose levels, which are influenced by several factors such as meal intake and timing. Traditional controllers often struggle to adapt to these changes, leading to suboptimal insulin administration. To address this issue, we employ a reinforcement learning agent tasked with adjusting 27 parameters of the Takagi-Sugeno fuzzy controller at each time step, ensuring real-time adaptability. The study's findings demonstrate that this approach significantly enhances the robustness of the controller against variations in meal size and timing, while also stabilizing glucose levels with minimal exogenous insulin. This adaptive method holds promise for improving the quality of life and health outcomes for individuals with Type 1 diabetes by providing a more responsive and precise management tool. Simulation results are given to highlight the effectiveness of the proposed approach.

Updated: 2025-03-09 17:34:09

标题: 精准胰岛素输送：一种强化学习优化的自适应模糊控制方法

摘要: 本文探讨了强化学习在优化一种设计用于作为1型糖尿病人工胰腺的Takagi-Sugeno模糊控制器参数方面的应用。糖尿病管理的主要挑战是血糖水平的动态性，受到餐食摄入和时间等多种因素的影响。传统控制器常常难以适应这些变化，导致胰岛素治疗不够理想。为了解决这个问题，我们采用了一个强化学习代理任务，负责在每个时间步调整Takagi-Sugeno模糊控制器的27个参数，确保实时适应性。研究结果表明，这种方法显著增强了控制器抵抗餐食大小和时间变化的能力，同时通过最小化外源胰岛素稳定了血糖水平。这种自适应方法有望通过提供更具响应性和精确性的管理工具，改善1型糖尿病患者的生活质量和健康结果。模拟结果突显了所提出方法的有效性。

更新时间: 2025-03-09 17:34:09

领域: eess.SY,cs.LG,cs.SY,math.OC

下载: http://arxiv.org/abs/2503.06701v1

Neural Spacetimes for DAG Representation Learning

We propose a class of trainable deep learning-based geometries called Neural Spacetimes (NSTs), which can universally represent nodes in weighted directed acyclic graphs (DAGs) as events in a spacetime manifold. While most works in the literature focus on undirected graph representation learning or causality embedding separately, our differentiable geometry can encode both graph edge weights in its spatial dimensions and causality in the form of edge directionality in its temporal dimensions. We use a product manifold that combines a quasi-metric (for space) and a partial order (for time). NSTs are implemented as three neural networks trained in an end-to-end manner: an embedding network, which learns to optimize the location of nodes as events in the spacetime manifold, and two other networks that optimize the space and time geometries in parallel, which we call a neural (quasi-)metric and a neural partial order, respectively. The latter two networks leverage recent ideas at the intersection of fractal geometry and deep learning to shape the geometry of the representation space in a data-driven fashion, unlike other works in the literature that use fixed spacetime manifolds such as Minkowski space or De Sitter space to embed DAGs. Our main theoretical guarantee is a universal embedding theorem, showing that any $k$-point DAG can be embedded into an NST with $1+\mathcal{O}(\log(k))$ distortion while exactly preserving its causal structure. The total number of parameters defining the NST is sub-cubic in $k$ and linear in the width of the DAG. If the DAG has a planar Hasse diagram, this is improved to $\mathcal{O}(\log(k)) + 2)$ spatial and 2 temporal dimensions. We validate our framework computationally with synthetic weighted DAGs and real-world network embeddings; in both cases, the NSTs achieve lower embedding distortions than their counterparts using fixed spacetime geometries.

Updated: 2025-03-09 17:33:35

标题: 神经时空网络用于有向无环图表示学习

摘要: 我们提出了一种可训练的基于深度学习的几何形态类别，称为神经时空（NSTs），它可以将加权有向无环图（DAGs）中的节点普遍表示为时空流形中的事件。尽管文献中的大多数作品侧重于无向图表示学习或因果嵌入，但我们的可微分几何可以在其空间维度中编码图边缘权重，并在其时间维度中以边缘方向性的形式编码因果关系。我们使用一个将拟度量（用于空间）和偏序（用于时间）相结合的乘积流形。NSTs被实现为三个神经网络，以端到端的方式进行训练：一个嵌入网络，学习优化节点在时空流形中的位置作为事件，以及另外两个网络，同时优化空间和时间几何，在我们称之为神经（拟）度量和神经偏序的情况下。后两个网络利用了最近在分形几何和深度学习交叉处的想法，以数据驱动的方式塑造表征空间的几何形态，与文献中使用固定时空流形（如闵可夫斯基空间或德西特空间）嵌入DAGs的其他作品不同。我们的主要理论保证是一个通用嵌入定理，显示任何$k$-点DAG都可以嵌入到一个NST中，扭曲为$1+\mathcal{O}(\log(k))$，同时完全保留其因果结构。定义NST的参数总数在$k$的子立方体内，并且在DAG的宽度上是线性的。如果DAG具有平面哈斯图，这将改进为$\mathcal{O}(\log(k)) + 2$个空间和2个时间维度。我们通过合成加权DAG和真实网络嵌入在计算上验证了我们的框架；在两种情况下，NSTs实现了比使用固定时空几何的对应物更低的嵌入扭曲。

更新时间: 2025-03-09 17:33:35

领域: cs.LG,cs.DM,cs.NE,math.MG,stat.ML

下载: http://arxiv.org/abs/2408.13885v2

Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping

This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-Component Loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.

Updated: 2025-03-09 17:31:57

标题: 无监督的多聚类和决策策略用于4D-STEM取向映射

摘要: 这项研究提出了一种新颖的无监督学习和决策策略的集成方法，用于对4D-STEM数据集进行高级分析，重点关注非负矩阵分解（NMF）作为主要的聚类方法。我们的方法引入了一个系统框架，以确定用于稳健和可解释的方向映射所需的最佳组件数（k）。通过利用K-Component Loss方法和图像质量评估（IQA）指标，我们有效平衡了重建保真度和模型复杂性。此外，我们强调数据集预处理在提高聚类稳定性和准确性方面的关键作用。此外，通过使用基于阈值的可视化，我们的空间权重矩阵分析提供了关于数据集内重叠区域的见解，促进对聚类互动的详细理解。结果表明，将NMF与先进的IQA指标和预处理技术相结合，可为4D-STEM数据集中可靠的方向映射和结构分析铺平道路，为多维材料表征中的未来应用铺平道路。

更新时间: 2025-03-09 17:31:57

领域: cs.LG,cs.CV,eess.IV

下载: http://arxiv.org/abs/2503.06699v1

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training.

Updated: 2025-03-09 17:29:01

标题: 潜在空间的奥秘：利用扩散潜在空间进行领域泛化

摘要: 域泛化旨在开发能够推广到新颖和未见数据分布的模型。在这项工作中，我们研究了模型架构和预训练目标如何影响特征丰富性，并提出了一种方法来有效地利用它们进行域泛化。具体而言，给定一个预训练的特征空间，我们首先以无监督方式发现潜在的领域结构，称为伪领域，捕获领域特定的变化。接下来，我们用这些补充的伪领域表示增强现有的分类器，使它们更容易适应各种未见的测试领域。我们分析了不同预训练特征空间在捕获领域特定方差方面的差异。我们的实证研究表明，扩散模型的特征在没有明确领域标签的情况下擅长分离领域，并捕捉微妙的领域特定信息。在5个数据集上，我们展示了我们非常简单的框架相对于标准基线经验风险最小化（ERM）可以将测试准确度最大提高超过4％的泛化到未见领域。至关重要的是，我们的方法在训练过程中超过了大多数访问领域标签的算法。

更新时间: 2025-03-09 17:29:01

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.06698v1

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.

Updated: 2025-03-09 16:59:14

标题: InftyThink：打破大型语言模型长文本推理的长度限制

摘要: 大型语言模型中的高级推理在具有挑战性的任务上取得了显著的表现，但目前主流的长文本推理范式面临关键限制：随着序列长度的增加呈二次计算规模增长、推理受最大上下文边界限制、以及在预训练上下文窗口之外导致性能下降。现有方法主要是压缩推理链而未解决基本的缩放问题。为了克服这些挑战，我们引入了InftyThink，这是一种将整体推理转化为具有中间总结的迭代过程的范例。通过交替使用短推理片段和简明的进展总结，我们的方法实现了无限推理深度，同时保持有限的计算成本。这创造了一个显著降低计算复杂性的特征锯齿形内存模式，与传统方法相比。此外，我们开发了一种将长文本推理数据集重构为我们的迭代格式的方法，将OpenR1-Math转换为333K训练实例。跨多个模型架构的实验表明，我们的方法减少了计算成本，同时提高了性能，Qwen2.5-Math-7B在MATH500、AIME24和GPQA_diamond基准测试中显示出3-13%的改进。我们的工作挑战了推理深度和计算效率之间的预设权衡，提供了一种更可扩展的复杂推理方法，无需进行架构修改。

更新时间: 2025-03-09 16:59:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06692v1

LegalCore: A Dataset for Event Coreference Resolution in Legal Documents

Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.

Updated: 2025-03-09 16:53:11

标题: LegalCore：法律文件中事件共指消解的数据集

摘要: 在文档中识别事件及其共指提及对于理解文本的语义含义至关重要。现有关于事件共指解析的研究主要局限于新闻文章。在本文中，我们提出了法律领域的第一个数据集LegalCore，该数据集已经注释了全面的事件和事件共指信息。我们在该数据集中注释的法律合同文件比新闻文章长几倍，平均长度约为每个文档25k个标记。注释显示法律文件具有密集的事件提及，并在事件提及之间具有短距离和超长距离共指链接。我们进一步在该数据集上对主流大型语言模型（LLMs）进行事件检测和事件共指解析任务的基准测试，并发现该数据集对于最先进的开源和专有LLMs存在显著挑战，其表现明显不如监督基线。我们将发布数据集以及代码。

更新时间: 2025-03-09 16:53:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.12509v3

Censoring-Aware Tree-Based Reinforcement Learning for Estimating Dynamic Treatment Regimes with Censored Outcomes

Dynamic Treatment Regimes (DTRs) provide a systematic approach for making sequential treatment decisions that adapt to individual patient characteristics, particularly in clinical contexts where survival outcomes are of interest. Censoring-Aware Tree-Based Reinforcement Learning (CA-TRL) is a novel framework to address the complexities associated with censored data when estimating optimal DTRs. We explore ways to learn effective DTRs, from observational data. By enhancing traditional tree-based reinforcement learning methods with augmented inverse probability weighting (AIPW) and censoring-aware modifications, CA-TRL delivers robust and interpretable treatment strategies. We demonstrate its effectiveness through extensive simulations and real-world applications using the SANAD epilepsy dataset, where it outperformed the recently proposed ASCL method in key metrics such as restricted mean survival time (RMST) and decision-making accuracy. This work represents a step forward in advancing personalized and data-driven treatment strategies across diverse healthcare settings.

Updated: 2025-03-09 16:53:09

标题: 意思是：考虑到审查的基于树的强化学习用于估计具有审查结果的动态治疗方案

摘要: 动态治疗方案（DTRs）提供了一种系统化的方法，用于根据个体患者特征做出连续的治疗决策，特别是在关注生存结果的临床背景下。考虑到数据的截断问题，Censoring-Aware Tree-Based Reinforcement Learning（CA-TRL）是一个新颖的框架，用于估计最优DTRs时所涉及的复杂性。我们探索了从观察数据中学习有效DTRs的方法。通过增强传统的基于树的强化学习方法，采用增强逆概率加权（AIPW）和考虑截断的修改，CA-TRL提供了稳健且可解释的治疗策略。我们通过广泛的模拟和使用SANAD癫痫数据集的实际应用展示了其有效性，在关键指标如限制性平均生存时间（RMST）和决策准确性方面，CA-TRL优于最近提出的ASCL方法。这项工作代表了在不同医疗环境中推进个性化和数据驱动的治疗策略的一大步。

更新时间: 2025-03-09 16:53:09

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2503.06690v1

Robust Clustering on High-Dimensional Data with Stochastic Quantization

This paper addresses the limitations of conventional vector quantization algorithms, particularly K-Means and its variant K-Means++, and investigates the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning tasks. Traditional clustering algorithms often suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we employ a Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.

Updated: 2025-03-09 16:53:00

标题: 高维数据的随机量化下的鲁棒聚类

摘要: 本文讨论了传统向量量化算法的局限性，特别是K-Means及其变体K-Means++，并研究了随机量化（SQ）算法作为高维无监督和半监督学习任务的可扩展替代方案。传统聚类算法在计算过程中往往存在内存利用效率低下的问题，需要将所有数据样本加载到内存中，对于大规模数据集来说变得不切实际。虽然Mini-Batch K-Means等变种通过减少内存使用量在一定程度上缓解了这个问题，但由于聚类问题的非凸性质，它们缺乏稳健的理论收敛保证。相比之下，随机量化算法提供了强大的理论收敛保证，使其成为聚类任务的稳健替代方案。我们在一个部分标记数据的图像分类问题上展示了该算法的计算效率和快速收敛性，比较了在不同标记和未标记数据比例下的模型准确性。为了解决高维度的挑战，我们使用三元网络将图像编码成低维表示在潜在空间中，这作为比较随机量化算法和传统量化算法效率的基础。此外，我们通过引入自适应学习率的修改来提高算法的收敛速度。

更新时间: 2025-03-09 16:53:00

领域: cs.LG,math.OC,90C15

下载: http://arxiv.org/abs/2409.02066v5

UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.

Updated: 2025-03-09 16:43:07

标题: UniGenX：使用自回归扩散统一生成序列和结构

摘要: 科学数据（例如材料、分子、蛋白质）的统一生成序列和结构是一项关键任务。现有方法主要依赖于自回归序列模型或扩散模型，每种方法都具有独特优势和显著限制。自回归模型，如GPT、Llama和Phi-4，在自然语言生成方面取得了显著成功，并已通过使用高级编码器（如VQ-VAE）将复杂模态表示为离散序列，扩展到多模态任务（例如图像、视频和音频）。然而，由于科学数据的高精度要求和多样性，它们直接应用于科学领域具有挑战性。另一方面，扩散模型在生成高维科学数据（如蛋白质、分子和材料结构）方面表现出色，具有卓越的准确性。然而，它们无法有效地对序列进行建模，限制了它们作为通用多模态基础模型的潜力。为了解决这些挑战，我们提出了UniGenX，这是一个统一框架，结合了自回归下一个令牌预测和有条件扩散模型。这种集成利用了自回归模型的优势来简化有条件扩散模型的训练，同时扩散型生成头部增强了自回归预测的精度。我们验证了UniGenX在材料和小分子生成任务上的有效性，实现了材料晶体结构预测的技术水平的显著跃升，并为小分子结构预测、新设计和有条件生成建立了新的技术水平结果。值得注意的是，UniGenX在处理复杂结构的长序列方面表现出显著改进，展示了其作为科学数据生成的多功能工具的有效性。

更新时间: 2025-03-09 16:43:07

领域: cs.LG,cond-mat.mtrl-sci,cs.AI,physics.bio-ph,physics.chem-ph

下载: http://arxiv.org/abs/2503.06687v1

IDInit: A Universal and Stable Initialization Method for Neural Network Training

Deep neural networks have achieved remarkable accomplishments in practice. The success of these networks hinges on effective initialization methods, which are vital for ensuring stable and rapid convergence during training. Recently, initialization methods that maintain identity transition within layers have shown good efficiency in network training. These techniques (e.g., Fixup) set specific weights to zero to achieve identity control. However, settings of remaining weight (e.g., Fixup uses random values to initialize non-zero weights) will affect the inductive bias that is achieved only by a zero weight, which may be harmful to training. Addressing this concern, we introduce fully identical initialization (IDInit), a novel method that preserves identity in both the main and sub-stem layers of residual networks. IDInit employs a padded identity-like matrix to overcome rank constraints in non-square weight matrices. Furthermore, we show the convergence problem of an identity matrix can be solved by stochastic gradient descent. Additionally, we enhance the universality of IDInit by processing higher-order weights and addressing dead neuron problems. IDInit is a straightforward yet effective initialization method, with improved convergence, stability, and performance across various settings, including large-scale datasets and deep models.

Updated: 2025-03-09 16:31:31

标题: IDInit：神经网络训练的通用稳定初始化方法

摘要: 深度神经网络在实践中取得了显著的成就。这些网络的成功取决于有效的初始化方法，这对于确保训练过程中的稳定和快速收敛至关重要。最近，能够在层内保持恒等过渡的初始化方法在网络训练中显示出良好的效率。这些技术（例如Fixup）将特定权重设定为零以实现恒等控制。然而，其余权重的设置（例如Fixup使用随机值来初始化非零权重）将影响仅通过零权重实现的归纳偏差，这可能对训练有害。为了解决这一问题，我们引入了完全相同的初始化（IDInit），这是一种在残差网络的主干和子干层中保持恒等性的新方法。IDInit利用填充的类似恒等矩阵来克服非方形权重矩阵中的秩约束。此外，我们展示了恒等矩阵的收敛问题可以通过随机梯度下降来解决。此外，我们通过处理高阶权重和解决死神经元问题来增强IDInit的普适性。IDInit是一种简单而有效的初始化方法，具有改进的收敛性、稳定性和性能，适用于包括大规模数据集和深度模型在内的各种设置。

更新时间: 2025-03-09 16:31:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.04626v2

Small Vision-Language Models: A Survey on Compact Architectures and Techniques

The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.

Updated: 2025-03-09 16:14:46

标题: 小型视觉-语言模型：关于紧凑结构和技术的调查

摘要: 小型视觉语言模型（sVLMs）的出现标志着多模态人工智能的重要进展，使得在资源受限环境中可以有效处理视觉和文本数据。本调查对sVLM的发展进行了全面探索，提出了基于变压器、蟒蛇和混合架构的分类法，突出了紧凑设计和计算效率方面的创新。讨论了知识蒸馏、轻量级注意机制和模态预融合等技术，作为提高性能并减少资源需求的实现者。通过对TinyGPT-V、MiniGPT-4和VL-Mamba等模型的深入分析，我们确定了精度、效率和可扩展性之间的权衡。持续的挑战，包括数据偏见和对复杂任务的泛化，经过严格审查，提出了解决这些挑战的途径。通过整合sVLM的进展，本研究强调了它们在可访问人工智能方面的变革潜力，为未来高效多模态系统的研究奠定了基础。

更新时间: 2025-03-09 16:14:46

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.10665v1

Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Updated: 2025-03-09 16:03:48

标题: 将Delta参数视为JPEG图像：使用离散余弦变换进行无数据Delta压缩

摘要: 随着基于变压器的模型和预训练-微调范式变得流行，对多任务上的个别微调模型的高存储和部署成本构成了严峻挑战。Delta压缩试图通过减少δ参数（即微调和预训练模型权重之间的差异）的冗余来降低成本。然而，现有方法通常面临包括数据可访问性和训练要求在内的问题。为了解决这个问题，我们引入了Delta-DCT，这是第一个受经典JPEG图像压缩启发的无数据δ压缩方法，利用了离散余弦变换（DCT）。我们首先将层内的δ参数分组成补丁。然后评估每个补丁的重要性，并为它们分配不同的量化比特宽度。之后，我们将这些补丁转换为DCT域，并根据分配的比特宽度对每个补丁进行量化。所提出的Delta-DCT不需要任何训练或数据校准，同时在不同类型的模型（包括：（1）从7B到13B的不同大小的最新发布的LLM，（2）相对较小的语言模型，包括RoBERTa和T5模型，（3）变体的视觉变压器模型以及（4）多模态BEiT-3模型）上实现了与原始微调模型相当甚至超越的性能，在1比特等效的δ压缩比率下。

更新时间: 2025-03-09 16:03:48

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2503.06676v1

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.

Updated: 2025-03-09 15:40:29

标题: AgiBot World Colosseo: 一个用于可扩展和智能实体系统的大规模操纵平台

摘要: 我们探讨了可扩展的机器人数据如何解决广义机器人操作中的现实挑战。引入AgiBot World，这是一个大规模平台，包括来自五个部署场景的217项任务中的1百万个轨迹，与现有数据集相比，我们实现了数据规模的数量级增加。通过标准化的数据收集流程并结合人为验证，AgiBot World保证了高质量和多样化的数据分布。它可以从夹具扩展到灵巧手和视觉触觉传感器，用于细粒度技能获取。基于数据，我们引入了Genie Operator-1（GO-1），这是一种利用潜在动作表示来最大化数据利用率的新型通用策略，展示了随着数据量增加而可预测的性能扩展。在我们的数据集上预训练的策略在域内和域外场景中的平均性能提高了30％，超过了在Open X-Embodiment上训练的策略。GO-1在现实世界的灵巧和长期任务中展现出卓越能力，在复杂任务上实现了超过60％的成功率，并超过了之前的RDT方法32％。通过开放数据集、工具和模型，我们旨在使大规模高质量的机器人数据更加民主化，推动可扩展和通用智能的追求。

更新时间: 2025-03-09 15:40:29

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06669v1

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

Updated: 2025-03-09 15:36:53

标题: 从像素到标记：在量化视觉模态上的字节对编码

摘要: 多模态大型语言模型在整合视觉和文本信息方面取得了重大进展，但通常难以有效地对齐这些模态。我们引入了一种新颖的图像标记器，通过将字节对编码（BPE）原则应用于视觉数据来弥合这一差距。与依赖单独的视觉编码器的传统方法不同，我们的方法直接将结构先验信息纳入图像标记，反映了仅用于文本的大型语言模型中成功的标记化策略。这种创新方法使Transformer模型能够更有效地跨模态学习和推理。通过理论分析和大量实验，我们证明了我们的BPE图像标记器显著增强了MLLMs的多模态理解能力，即使训练数据有限也是如此。利用这种方法，我们开发了Being-VL-0，一个在各种基准测试中表现出优越性能并显示出有希望的可伸缩性的模型，潜在地为更高效和更有能力的多模态基础模型铺平道路。

更新时间: 2025-03-09 15:36:53

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2410.02155v3

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

Updated: 2025-03-09 15:29:46

标题: 探索用于清洗表格机器学习数据集的LLM代理

摘要: 高质量、无错误的数据集是构建可靠、准确和无偏的机器学习（ML）模型的关键因素。然而，现实世界的数据集经常因传感器故障、数据输入错误或跨多个来源的数据集成不当而出现错误，这可能严重降低模型性能。检测和纠正这些问题通常需要量身定制的解决方案并需要广泛的领域专业知识。因此，自动化是具有挑战性的，使得这个过程变得繁重和乏味。在这项研究中，我们调查了大型语言模型（LLMs）是否可以帮助减轻手动数据清理的负担。我们设置了一个实验，其中一个LLM与Python配对，负责清理训练数据集，以提高学习算法的性能，而无需修改训练流程或进行任何特征工程。我们在多个Kaggle数据集上进行了这个实验，这些数据集故意被错误损坏。我们的结果表明，LLMs可以通过利用同一行中其他特征的上下文信息以及来自先前迭代的反馈，识别和纠正错误的条目，例如不合逻辑的值或异常值。然而，它们很难检测需要理解跨多行的数据分布的更复杂的错误，例如趋势和偏见。

更新时间: 2025-03-09 15:29:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06664v1

Variational Entropy Search for Adjusting Expected Improvement

Bayesian optimization is a widely used technique for optimizing black-box functions, with Expected Improvement (EI) being the most commonly utilized acquisition function in this domain. While EI is often viewed as distinct from other information-theoretic acquisition functions, such as entropy search (ES) and max-value entropy search (MES), our work reveals that EI can be considered a special case of MES when approached through variational inference (VI). In this context, we have developed the Variational Entropy Search (VES) methodology and the VES-Gamma algorithm, which adapts EI by incorporating principles from information-theoretic concepts. The efficacy of VES-Gamma is demonstrated across a variety of test functions and read datasets, highlighting its theoretical and practical utilities in Bayesian optimization scenarios.

Updated: 2025-03-09 15:29:40

标题: 变分熵搜索用于调整期望改进

摘要: 贝叶斯优化是一种广泛使用的优化黑盒函数的技术，期望改进（EI）是该领域中最常用的收购函数。虽然EI通常被视为与其他信息论收购函数（如熵搜索（ES）和最大值熵搜索（MES））不同，但我们的工作揭示了当通过变分推理（VI）方法接近时，EI可以被视为MES的一个特例。在这种情况下，我们开发了变分熵搜索（VES）方法和VES-Gamma算法，通过将信息论概念原则纳入EI来调整它。VES-Gamma的有效性已在各种测试函数和读取数据集中得到证明，突显了它在贝叶斯优化场景中的理论和实际效用。

更新时间: 2025-03-09 15:29:40

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2402.11345v2

AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP

Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.

Updated: 2025-03-09 15:22:52

标题: AA-CLIP：通过异常感知CLIP增强零样本异常检测

摘要: 异常检测（AD）用于识别应用程序中的异常值，如缺陷和病变检测。虽然CLIP由于其强大的泛化能力而显示出用于零样本AD任务的潜力，但其固有的异常无意识性导致对正常和异常特征之间的区分能力有限。为解决这一问题，我们提出了Anomaly-Aware CLIP（AA-CLIP），它增强了CLIP在文本和视觉空间中的异常判别能力，同时保持其泛化能力。AA-CLIP通过一种简单而有效的两阶段方法实现：首先创建异常感知的文本锚点，清晰区分正常和异常语义，然后将面向块级的视觉特征与这些锚点对齐，实现精确的异常定位。在残余适配器的帮助下，这种两阶段策略逐渐控制地调整CLIP，实现有效的AD同时保持CLIP的类知识。大量实验证实了AA-CLIP作为零样本AD任务的资源高效解决方案，在工业和医疗应用中取得了最先进的结果。代码可在https://github.com/Mwxinnn/AA-CLIP 上找到。

更新时间: 2025-03-09 15:22:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06661v1

A Causal World Model Underlying Next Token Prediction in GPT

Are generative pre-trained transformer (GPT) models only trained to predict the next token, or do they implicitly learn a world model from which a sequence is generated one token at a time? We examine this question by deriving a causal interpretation of the attention mechanism in GPT, and suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT-models, at inference time, can be utilized for zero-shot causal structure learning for in-distribution sequences. Empirical evaluation is conducted in a controlled synthetic environment using the setup and rules of the Othello board game. A GPT, pre-trained on real-world games played with the intention of winning, is tested on synthetic data that only adheres to the game rules, oblivious to the goal of winning. We find that the GPT model is likely to generate moves that adhere to the game rules for sequences for which a causal structure is encoded in the attention mechanism with high confidence. In general, in cases for which the GPT model generates moves that do not adhere to the game rules, it also fails to capture any causal structure.

Updated: 2025-03-09 15:02:01

标题: 《GPT中下一个标记预测的因果世界模型》

摘要: 生成式预训练变压器（GPT）模型是仅被训练来预测下一个标记，还是它们隐含地学习了一个世界模型，从中逐个标记地生成序列？我们通过从GPT中推导出关注机制的因果解释，并提出了从这种解释中产生的因果世界模型来探讨这个问题。此外，我们提出，在推断时，GPT模型可以用于零样本因果结构学习的内部分布序列。在使用黑白棋棋盘游戏的设置和规则的受控合成环境中进行实证评估。一个GPT，在真实游戏中经过预训练，旨在赢得游戏，被测试在仅遵守游戏规则的合成数据上，不考虑赢得游戏的目标。我们发现，GPT模型很可能生成符合游戏规则的移动，对于其中在关注机制中以高置信度编码了因果结构的序列。一般来说，在GPT模型生成不符合游戏规则的移动的情况下，它也无法捕捉任何因果结构。

更新时间: 2025-03-09 15:02:01

领域: cs.AI,cs.CL,cs.LG,stat.ML

下载: http://arxiv.org/abs/2412.07446v2

Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training

Standard NLP benchmarks often fail to capture vulnerabilities stemming from dataset artifacts and spurious correlations. Contrast sets address this gap by challenging models near decision boundaries but are traditionally labor-intensive to create and limited in diversity. This study leverages large language models to automate the generation of diverse contrast sets. Using the SNLI dataset, we created a 3,000-example contrast set to evaluate and improve model robustness. Fine-tuning on these contrast sets enhanced performance on systematically perturbed examples, maintained standard test accuracy, and modestly improved generalization to novel perturbations. This automated approach offers a scalable solution for evaluating and improving NLP models, addressing systematic generalization challenges, and advancing robustness in real-world applications.

Updated: 2025-03-09 14:52:53

标题: 通过LLM生成的对比集增强NLP的鲁棒性和泛化能力：一种用于系统评估和对抗训练的可扩展框架

摘要: 标准的自然语言处理基准往往无法捕捉源自数据集人为因素和虚假相关性的脆弱性。对比集通过挑战接近决策边界的模型来填补这一空白，但传统上创建对比集需要耗费大量人力，并且缺乏多样性。本研究利用大型语言模型自动生成多样化的对比集。利用SNLI数据集，我们创建了一个包含3,000个示例的对比集，用于评估和提升模型的稳健性。在这些对比集上微调可以提高模型在系统性扰动示例上的表现，保持标准测试准确性，并略微提高对新型扰动的泛化能力。这种自动化方法为评估和改进自然语言处理模型提供了可扩展的解决方案，解决了系统性泛化挑战，并推动了在现实世界应用中的稳健性。

更新时间: 2025-03-09 14:52:53

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06648v1

Personalized Class Incremental Context-Aware Food Classification for Food Intake Monitoring Systems

Accurate food intake monitoring is crucial for maintaining a healthy diet and preventing nutrition-related diseases. With the diverse range of foods consumed across various cultures, classic food classification models have limitations due to their reliance on fixed-sized food datasets. Studies show that people consume only a small range of foods across the existing ones, each consuming a unique set of foods. Existing class-incremental models have low accuracy for the new classes and lack personalization. This paper introduces a personalized, class-incremental food classification model designed to overcome these challenges and improve the performance of food intake monitoring systems. Our approach adapts itself to the new array of food classes, maintaining applicability and accuracy, both for new and existing classes by using personalization. Our model's primary focus is personalization, which improves classification accuracy by prioritizing a subset of foods based on an individual's eating habits, including meal frequency, times, and locations. A modified version of DSN is utilized to expand on the appearance of new food classes. Additionally, we propose a comprehensive framework that integrates this model into a food intake monitoring system. This system analyzes meal images provided by users, makes use of a smart scale to estimate food weight, utilizes a nutrient content database to calculate the amount of each macro-nutrient, and creates a dietary user profile through a mobile application. Finally, experimental evaluations on two new benchmark datasets FOOD101-Personal and VFN-Personal, personalized versions of well-known datasets for food classification, are conducted to demonstrate the effectiveness of our model in improving the classification accuracy of both new and existing classes, addressing the limitations of both conventional and class-incremental food classification models.

Updated: 2025-03-09 14:50:56

标题: 个性化课程增量上下文感知食物分类用于食物摄入监测系统

摘要: 准确监测食物摄入对于保持健康饮食和预防营养相关疾病至关重要。由于不同文化中消费的食物种类繁多，传统的食物分类模型存在局限性，因为它们依赖于固定大小的食物数据集。研究表明，人们只消费现有食物中的一小部分范围，并且每个人消费的食物集合都是独特的。现有的增量类模型对于新类别的准确性较低，并且缺乏个性化。本文介绍了一种个性化的、增量类食物分类模型，旨在克服这些挑战，并提高食物摄入监测系统的性能。我们的方法适应新的食物类别数组，通过个性化保持适用性和准确性，无论是对于新的还是现有的类别。我们的模型主要关注个性化，通过根据个体的饮食习惯（包括用餐频率、时间和地点）优先考虑一部分食物来提高分类准确性。我们使用修改后的 DSN 来扩展新食物类别的外观。此外，我们提出了一个综合框架，将这个模型整合到一个食物摄入监测系统中。该系统分析用户提供的用餐图片，利用智能秤来估计食物重量，利用营养成分数据库来计算每种宏量营养素的含量，并通过一个移动应用程序创建饮食用户档案。最后，在两个新的基准数据集 FOOD101-Personal 和 VFN-Personal 上进行实验评估，这些数据集是针对食物分类的著名数据集的个性化版本，以证明我们的模型在提高新类别和现有类别的分类准确性方面的有效性，解决传统和增量类食物分类模型的局限性。

更新时间: 2025-03-09 14:50:56

领域: cs.CV,cs.CE,cs.LG

下载: http://arxiv.org/abs/2503.06647v1

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($\mathsf{KL}$) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $\pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy ($\pi_{n-1}$) and the reference policy $\pi_0$. Iterating this scheme, we obtain a sequence of policies $\pi_{n}$ for which we can quantify the probability of success $p_n$. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_0$ and the regularization parameter $\beta$ of the $\mathsf{KL}$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_0$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

Updated: 2025-03-09 14:36:45

标题: 用可验证的奖励进行强化学习：GRPO的有效损失、动态和成功放大

摘要: Group Relative Policy Optimization (GRPO)被引入并成功地用于训练DeepSeek R1模型，以促进LLMs的推理能力，使用可验证或二元奖励。本文表明，具有可验证奖励的GRPO可以被写成Kullback Leibler（$\mathsf{KL}$）正则化对比损失，其中对比样本是从旧策略中采样的合成数据。最优的GRPO策略$\pi_{n}$可以明确地用二元奖励以及旧策略（$\pi_{n-1}$）和参考策略$\pi_0$的一阶和二阶统计来表达。通过迭代这个方案，我们得到一系列策略$\pi_{n}$，可以量化成功概率$p_n$。我们表明，策略的成功概率满足一个收敛到一个函数的固定点的递归，该函数取决于初始成功概率$p_0$和$\mathsf{KL}$正则化器的正则化参数$\beta$。我们表明，固定点$p^*$被保证大于$p_0$，从而证明了GRPO有效地增加了策略的成功概率。

更新时间: 2025-03-09 14:36:45

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.06639v1

Treatment Effect Estimation for Graph-Structured Targets

Treatment effect estimation, which helps understand the causality between treatment and outcome variable, is a central task in decision-making across various domains. While most studies focus on treatment effect estimation on individual targets, in specific contexts, there is a necessity to comprehend the treatment effect on a group of targets, especially those that have relationships represented as a graph structure between them. In such cases, the focus of treatment assignment is prone to depend on a particular node of the graph, such as the one with the highest degree, thus resulting in an observational bias from a small part of the entire graph. Whereas a bias tends to be caused by the small part, straightforward extensions of previous studies cannot provide efficient bias mitigation owing to the use of the entire graph information. In this study, we propose Graph-target Treatment Effect Estimation (GraphTEE), a framework designed to estimate treatment effects specifically on graph-structured targets. GraphTEE aims to mitigate observational bias by focusing on confounding variable sets and consider a new regularization framework. Additionally, we provide a theoretical analysis on how GraphTEE performs better in terms of bias mitigation. Experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.

Updated: 2025-03-09 14:36:33

标题: 图结构目标的治疗效果估计

摘要: 治疗效果估计有助于理解治疗和结果变量之间的因果关系，是各个领域决策制定中的核心任务。虽然大多数研究集中在个体目标的治疗效果估计上，在特定情境下，有必要理解对一组目标的治疗效果，特别是那些彼此之间具有图结构关系的目标。在这种情况下，治疗分配的焦点往往取决于图中的特定节点，比如具有最高度数的节点，因此导致从整个图的一小部分产生观察偏差。虽然偏差往往是由这一小部分引起的，但是对先前研究的直接扩展不能有效地减轻偏差，因为使用了整个图的信息。在本研究中，我们提出了一种称为Graph-target Treatment Effect Estimation（GraphTEE）的框架，旨在专门估计图结构目标上的治疗效果。GraphTEE旨在通过专注于混淆变量集并考虑新的正则化框架来减轻观察偏差。此外，我们在理论上分析了GraphTEE在偏差减轻方面的表现更好。对合成和半合成数据集的实验表明了我们提出方法的有效性。

更新时间: 2025-03-09 14:36:33

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2412.20436v2

Deep Cut-informed Graph Embedding and Clustering

Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issue: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which cause a degenerate solution assigning all data points to a single label thus make all samples and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of proximity to the pre-learned cluster center. With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.

Updated: 2025-03-09 14:24:09

标题: 深度剪切图嵌入和聚类

摘要: 图聚类旨在将图分成不同的簇。最近出现的深度图聚类方法主要建立在图神经网络（GNN）上。然而，GNN被设计用于一般图编码，现有基于GNN的深度图聚类算法存在常见的表示坍塌问题。我们将这个问题归因于两个主要原因：（i）GNN模型的归纳偏差：GNN倾向于为相邻节点生成相似的表示。由于图通常包含大量簇间链接，这种偏差导致错误消息传递，并导致偏向聚类；（ii）聚类引导的损失函数：大多数传统方法努力使所有样本更接近预学习的簇中心，这会导致一个退化的解决方案，将所有数据点分配给一个单一标签，从而使所有样本变得不够具有区分性。为了解决这些挑战，我们从图切割的角度研究图聚类，并提出了一种创新的非GNN-based深度切割知情图嵌入和聚类框架，即DCGC。该框架包括两个模块：（i）切割知情的图编码；（ii）通过最优传输进行自监督图聚类。对于编码模块，我们推导了一个切割知情的图嵌入目标，通过最小化它们的联合归一化切割来融合图结构和属性。对于聚类模块，我们利用最优传输理论来获得聚类分配，这可以平衡到预学习的簇中心的接近指导。通过上述两个量身定制的设计，DCGC更适合图聚类任务，可以有效缓解表示坍塌问题并获得更好的性能。我们进行了大量实验来证明，与基准相比，我们的方法简单但有效。

更新时间: 2025-03-09 14:24:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06635v1

A Block-Based Heuristic Algorithm for the Three-Dimensional Nuclear Waste Packing Problem

In this study, we present a block-based heuristic search algorithm to address the nuclear waste container packing problem in the context of real-world nuclear power plants. Additionally, we provide a dataset comprising 1600 problem instances for future researchers to use. Experimental results on this dataset demonstrate that the proposed algorithm effectively enhances the disposal pool's space utilization while minimizing the radiation dose within the pool. The code and data employed in this study are publicly available to facilitate reproducibility and further investigation.

Updated: 2025-03-09 14:20:48

标题: 一个基于块的启发式算法用于三维核废料装箱问题

摘要: 在这项研究中，我们提出了一种基于块的启发式搜索算法，以解决现实世界核电厂中的核废物容器装载问题。此外，我们提供了一个包含1600个问题实例的数据集，供未来研究者使用。对该数据集的实验结果表明，所提出的算法有效地提高了处置池的空间利用率，同时最小化了池内的辐射剂量。本研究中使用的代码和数据是公开可用的，以便促进可重复性和进一步研究。

更新时间: 2025-03-09 14:20:48

领域: math.OC,cs.AI

下载: http://arxiv.org/abs/2503.08705v1

BTFL: A Bayesian-based Test-Time Generalization Method for Internal and External Data Distributions in Federated learning

Federated Learning (FL) enables multiple clients to collaboratively develop a global model while maintaining data privacy. However, online FL deployment faces challenges due to distribution shifts and evolving test samples. Personalized Federated Learning (PFL) tailors the global model to individual client distributions, but struggles with Out-Of-Distribution (OOD) samples during testing, leading to performance degradation. In real-world scenarios, balancing personalization and generalization during online testing is crucial and existing methods primarily focus on training-phase generalization. To address the test-time trade-off, we introduce a new scenario: Test-time Generalization for Internal and External Distributions in Federated Learning (TGFL), which evaluates adaptability under Internal Distribution (IND) and External Distribution (EXD). We propose BTFL, a Bayesian-based test-time generalization method for TGFL, which balances generalization and personalization at the sample level during testing. BTFL employs a two-head architecture to store local and global knowledge, interpolating predictions via a dual-Bayesian framework that considers both historical test data and current sample characteristics with theoretical guarantee and faster speed. Our experiments demonstrate that BTFL achieves improved performance across various datasets and models with less time cost. The source codes are made publicly available at https://github.com/ZhouYuCS/BTFL .

Updated: 2025-03-09 14:16:34

标题: BTFL：一种基于贝叶斯的联邦学习中内部和外部数据分布的测试时泛化方法

摘要: 联邦学习（FL）使多个客户能够在保持数据隐私的同时共同开发全局模型。然而，在线FL部署面临由于分布转移和演化的测试样本而面临挑战。个性化联邦学习（PFL）将全局模型定制为个别客户分布，但在测试过程中遇到了Out-Of-Distribution（OOD）样本，导致性能下降。在现实场景中，在线测试过程中平衡个性化和泛化是至关重要的，现有方法主要侧重于训练阶段的泛化。为了解决测试时间的权衡，我们引入了一个新的情景：联邦学习中的内部和外部分布的测试时间泛化（TGFL），它在内部分布（IND）和外部分布（EXD）下评估适应性。我们提出了BTFL，一种基于贝叶斯的测试时间泛化方法，用于TGFL，在测试过程中在样本级别平衡泛化和个性化。BTFL采用两头架构来存储本地和全局知识，通过考虑历史测试数据和当前样本特征的双贝叶斯框架进行插值预测，具有理论保证和更快的速度。我们的实验证明，BTFL在各种数据集和模型上取得了改进的性能，而时间成本更低。源代码可在https://github.com/ZhouYuCS/BTFL 上公开获取。

更新时间: 2025-03-09 14:16:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06633v1

Hardware-Accelerated Event-Graph Neural Networks for Low-Latency Time-Series Classification on SoC FPGA

As the quantities of data recorded by embedded edge sensors grow, so too does the need for intelligent local processing. Such data often comes in the form of time-series signals, based on which real-time predictions can be made locally using an AI model. However, a hardware-software approach capable of making low-latency predictions with low power consumption is required. In this paper, we present a hardware implementation of an event-graph neural network for time-series classification. We leverage an artificial cochlea model to convert the input time-series signals into a sparse event-data format that allows the event-graph to drastically reduce the number of calculations relative to other AI methods. We implemented the design on a SoC FPGA and applied it to the real-time processing of the Spiking Heidelberg Digits (SHD) dataset to benchmark our approach against competitive solutions. Our method achieves a floating-point accuracy of 92.7% on the SHD dataset for the base model, which is only 2.4% and 2% less than the state-of-the-art models with over 10% and 67% fewer model parameters, respectively. It also outperforms FPGA-based spiking neural network implementations by 19.3% and 4.5%, achieving 92.3% accuracy for the quantised model while using fewer computational resources and reducing latency.

Updated: 2025-03-09 14:08:46

标题: 硬件加速事件图神经网络在SoC FPGA上实现低延迟时间序列分类

摘要: 随着嵌入式边缘传感器记录的数据量增加，智能本地处理的需求也在增加。这些数据通常以时间序列信号的形式出现，基于这些信号可以使用人工智能模型进行本地实时预测。然而，需要一种能够以低延迟和低功耗进行预测的硬件软件方法。本文提出了一种用于时间序列分类的事件图神经网络的硬件实现。我们利用人工耳蜗模型将输入的时间序列信号转换为稀疏的事件数据格式，使得事件图可以相对于其他人工智能方法大幅减少计算数量。我们在SoC FPGA上实现了这一设计，并将其应用于实时处理Spiking Heidelberg Digits（SHD）数据集，以将我们的方法与竞争性解决方案进行比较。我们的方法在SHD数据集的基本模型上实现了92.7%的浮点精度，仅比最先进的模型少2.4%和2%，分别比其少10%和67%的模型参数。它也比基于FPGA的脉冲神经网络实现高出19.3%和4.5%，在使用更少的计算资源和减少延迟的情况下，实现了92.3%的准确率。

更新时间: 2025-03-09 14:08:46

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2503.06629v1

Revisiting Early Detection of Sexual Predators via Turn-level Optimization

Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at https://github.com/jinmyeongAN/SCoRL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.

Updated: 2025-03-09 14:05:27

标题: 重新审视通过转向级优化实现性侵犯者的早期检测

摘要: 在线勾引是一种严重的社会威胁，性侵者逐渐用微妙和渐进的操纵陷害儿童受害者。因此，及时干预在线勾引对积极保护至关重要。然而，以往的方法未能确定最佳干预时机（即匆忙下结论），因为它们依赖于聊天级别的风险标签，导致对危险言论的监管不力。为了及时检测，我们提出了速度控制强化学习（SCoRL）（代码和补充材料可在https://github.com/jinmyeongAN/SCoRL找到），结合了从诱导沟通理论（LCT）中得出的实用策略。为了捕捉性侵者的轮次级别陷害，我们使用基于LCT的轮次级别风险标签。然后，我们设计了一种新颖的速度控制奖励函数，根据轮次级别风险标签平衡速度和准确性之间的权衡；因此，SCoRL能够确定最佳干预时刻。此外，我们引入了一个轮次级别指标进行精确评估，识别了以前使用的聊天级别指标的局限性。实验结果表明，SCoRL有效地预防了在线勾引，提供了更积极和及时的解决方案。进一步分析表明，我们的方法提高了性能，同时直观地确定了最佳的早期干预时机。

更新时间: 2025-03-09 14:05:27

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.06627v1

DiffCLIP: Differential Attention Meets CLIP

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.

Updated: 2025-03-09 14:04:09

标题: DiffCLIP：差异注意力遇见CLIP

摘要: 我们提出了DiffCLIP，这是一种新颖的视觉-语言模型，将差分注意力机制扩展到CLIP架构中。差分注意力最初是为大型语言模型开发的，以放大相关上下文并消除嘈杂信息。在这项工作中，我们将这种机制整合到CLIP的双编码器（图像和文本）框架中。通过最少的额外参数，DiffCLIP在图像-文本理解任务上实现了卓越的性能。在零样本分类、检索和鲁棒性基准测试中，DiffCLIP始终优于基线CLIP模型。值得注意的是，这些收益几乎没有计算开销，表明差分注意力可以显著增强多模态表示，而不牺牲效率。代码可以在https://github.com/hammoudhasan/DiffCLIP找到。

更新时间: 2025-03-09 14:04:09

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06626v1

Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models

We study the problem of synthetic generation of samples of environmental features for autonomous vehicle navigation. These features are described by a spatiotemporally varying scalar field that we refer to as a threat field. The threat field is known to have some underlying dynamics subject to process noise. Some "real-world" data of observations of various threat fields are also available. The assumption is that the volume of ``real-world'' data is relatively small. The objective is to synthesize samples that are statistically similar to the data. The proposed solution is a generative artificial intelligence model that we refer to as a split variational recurrent neural network (S-VRNN). The S-VRNN merges the capabilities of a variational autoencoder, which is a widely used generative model, and a recurrent neural network, which is used to learn temporal dependencies in data. The main innovation in this work is that we split the latent space of the S-VRNN into two subspaces. The latent variables in one subspace are learned using the ``real-world'' data, whereas those in the other subspace are learned using the data as well as the known underlying system dynamics. Through numerical experiments we demonstrate that the proposed S-VRNN can synthesize data that are statistically similar to the training data even in the case of very small volume of ``real-world'' training data.

Updated: 2025-03-09 13:45:15

标题: 使用生成式人工智能模型生成合成数据，用于在时变环境中进行最小暴露导航

摘要: 我们研究了用于自主车辆导航的环境特征样本的合成生成问题。这些特征由一个我们称之为威胁场的时空变化的标量场描述。已知威胁场具有一些受过程噪声影响的基本动态。还有一些观测到的各种威胁场的“现实世界”数据可用。假设“现实世界”数据的量相对较小。目标是合成与数据在统计上相似的样本。提出的解决方案是一种我们称之为分裂变分递归神经网络（S-VRNN）的生成人工智能模型。S-VRNN合并了变分自动编码器的功能，这是一种广泛使用的生成模型，以及递归神经网络，用于学习数据中的时间依赖关系。这项工作的主要创新在于我们将S-VRNN的潜在空间拆分为两个子空间。一个子空间中的潜在变量是使用“现实世界”数据学习的，而另一个子空间中的潜在变量是使用数据以及已知的基本系统动态学习的。通过数值实验，我们证明了提出的S-VRNN即使在“现实世界”训练数据量非常小的情况下也能合成与训练数据在统计上相似的数据。

更新时间: 2025-03-09 13:45:15

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.06619v1

Using Subgraph GNNs for Node Classification:an Overlooked Potential Approach

Previous studies have demonstrated the strong performance of Graph Neural Networks (GNNs) in node classification. However, most existing GNNs adopt a node-centric perspective and rely on global message passing, leading to high computational and memory costs that hinder scalability. To mitigate these challenges, subgraph-based methods have been introduced, leveraging local subgraphs as approximations of full computational trees. While this approach improves efficiency, it often suffers from performance degradation due to the loss of global contextual information, limiting its effectiveness compared to global GNNs. To address this trade-off between scalability and classification accuracy, we reformulate the node classification task as a subgraph classification problem and propose SubGND (Subgraph GNN for NoDe). This framework introduces a differentiated zero-padding strategy and an Ego-Alter subgraph representation method to resolve label conflicts while incorporating an Adaptive Feature Scaling Mechanism to dynamically adjust feature contributions based on dataset-specific dependencies. Experimental results on six benchmark datasets demonstrate that SubGND achieves performance comparable to or surpassing global message-passing GNNs, particularly in heterophilic settings, highlighting its effectiveness and scalability as a promising solution for node classification.

Updated: 2025-03-09 13:37:38

标题: 使用子图GNNs进行节点分类：一个被忽视的潜在方法

摘要: 先前的研究表明，图神经网络（GNNs）在节点分类中表现出很强的性能。然而，大多数现有的GNNs采用以节点为中心的视角，并依赖全局消息传递，导致高计算和内存成本，限制了可扩展性。为了解决这些挑战，引入了基于子图的方法，利用本地子图作为完整计算树的近似。虽然这种方法提高了效率，但由于全局上下文信息的丢失，通常会导致性能下降，从而限制了其与全局GNNs相比的有效性。为了解决可扩展性和分类准确性之间的权衡，我们将节点分类任务重新定义为子图分类问题，并提出了SubGND（节点的子图GNN）。该框架引入了一种差异化的零填充策略和一种Ego-Alter子图表示方法，以解决标签冲突，并结合自适应特征缩放机制，根据数据集特定的依赖动态调整特征贡献。对六个基准数据集的实验结果表明，SubGND在异质设置中实现了与或超过全局消息传递GNNs相当的性能，突显了其作为节点分类的一种有前途的解决方案的有效性和可扩展性。

更新时间: 2025-03-09 13:37:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06614v1

DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion

Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene's subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA and MMCA modules ensure appearance and semantic consistency with reference images and text, respectively. Both modules employ masking mechanisms to prevent subject blending. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.

Updated: 2025-03-09 13:33:40

标题: 梦境故事：LLM引导的多主题一致扩散打开域故事可视化

摘要: 故事可视化旨在创建与文本叙述相对应的视觉吸引力的图像或视频。尽管最近扩散模型取得了令人期待的成果，但现有方法仍然难以仅基于故事创建一致的主题一致帧序列。为此，我们提出了DreamStory，一种通过利用LLMs和一种新颖的多主题一致扩散模型的自动开放领域故事可视化框架。DreamStory包括（1）充当故事导演的LLM和（2）用于生成一致多主题跨图像的创新多主题一致扩散模型（MSD）。首先，DreamStory利用LLM生成与故事对齐的主题和场景描述性提示，为后续主题一致生成注释每个场景的主题。其次，DreamStory利用这些详细的主题描述创建主题的肖像，这些肖像及其对应的文本信息作为多模式锚点（指导）。最后，MSD使用这些多模态锚点生成具有一致多主题的故事场景。具体而言，MSD包括蒙版互相自注意（MMSA）和蒙版互相交叉关注（MMCA）模块。MMSA和MMCA模块确保与参考图像和文本的外观和语义一致性。两个模块都使用蒙版机制以防止主题混合。为验证我们的方法并促进故事可视化的进展，我们建立了一个基准DS-500，它可以评估整体性能、主题识别准确性和生成模型的一致性。广泛的实验验证了DreamStory在主观和客观评估中的有效性。请访问我们的项目主页https://dream-xyz.github.io/dreamstory。

更新时间: 2025-03-09 13:33:40

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2407.12899v2

Improving Graph Neural Networks on Multi-node Tasks with the Labeling Trick

In this paper, we study using graph neural networks (GNNs) for \textit{multi-node representation learning}, where a representation for a set of more than one node (such as a link) is to be learned. Existing GNNs are mainly designed to learn single-node representations. When used for multi-node representation learning, a common practice is to directly aggregate the single-node representations obtained by a GNN. In this paper, we show a fundamental limitation of such an approach, namely the inability to capture the dependence among multiple nodes in the node set. A straightforward solution is to distinguish target nodes from others. Formalizing this idea, we propose \text{labeling trick}, which first labels nodes in the graph according to their relationships with the target node set before applying a GNN and then aggregates node representations obtained in the labeled graph for multi-node representations. Besides node sets in graphs, we also extend labeling tricks to posets, subsets and hypergraphs. Experiments verify that the labeling trick technique can boost GNNs on various tasks, including undirected link prediction, directed link prediction, hyperedge prediction, and subgraph prediction. Our work explains the superior performance of previous node-labeling-based methods and establishes a theoretical foundation for using GNNs for multi-node representation learning.

Updated: 2025-03-09 13:31:18

标题: 使用标签技巧改进多节点任务上的图神经网络

摘要: 在这篇论文中，我们研究了使用图神经网络（GNN）进行多节点表示学习，其中需要学习一组超过一个节点（例如链接）的表示。现有的GNN主要设计用于学习单节点表示。在用于多节点表示学习时，一种常见做法是直接聚合GNN获得的单节点表示。在本文中，我们展示了这种方法的一个基本局限，即无法捕捉节点集中多个节点之间的依赖关系。一个直接的解决方案是区分目标节点和其他节点。形式化这个想法，我们提出了“标记技巧”，它首先根据节点与目标节点集的关系对图中的节点进行标记，然后应用GNN并聚合标记图中获得的节点表示以获得多节点表示。除了图中的节点集，我们还将标记技巧扩展到偏序集、子集和超图中。实验证明，标记技巧技术可以提升GNN在各种任务中的表现，包括无向链接预测、有向链接预测、超边预测和子图预测。我们的工作解释了先前基于节点标记的方法表现优越的原因，并为使用GNN进行多节点表示学习建立了理论基础。

更新时间: 2025-03-09 13:31:18

领域: cs.LG

下载: http://arxiv.org/abs/2304.10074v2

Inverse Reinforcement Learning for Minimum-Exposure Paths in Spatiotemporally Varying Scalar Fields

Performance and reliability analyses of autonomous vehicles (AVs) can benefit from tools that ``amplify'' small datasets to synthesize larger volumes of plausible samples of the AV's behavior. We consider a specific instance of this data synthesis problem that addresses minimizing the AV's exposure to adverse environmental conditions during travel to a fixed goal location. The environment is characterized by a threat field, which is a strictly positive scalar field with higher intensities corresponding to hazardous and unfavorable conditions for the AV. We address the problem of synthesizing datasets of minimum exposure paths that resemble a training dataset of such paths. The main contribution of this paper is an inverse reinforcement learning (IRL) model to solve this problem. We consider time-invariant (static) as well as time-varying (dynamic) threat fields. We find that the proposed IRL model provides excellent performance in synthesizing paths from initial conditions not seen in the training dataset, when the threat field is the same as that used for training. Furthermore, we evaluate model performance on unseen threat fields and find low error in that case as well. Finally, we demonstrate the model's ability to synthesize distinct datasets when trained on different datasets with distinct characteristics.

Updated: 2025-03-09 13:30:11

标题: 反向强化学习用于时空变化标量场中的最小暴露路径

摘要: 自主车辆（AVs）的性能和可靠性分析可以从“放大”小数据集的工具中受益，以合成更大量的AV行为的合理样本。我们考虑了一个特定的数据合成问题实例，该问题涉及最小化AV在前往固定目标位置的旅行过程中暴露于不利环境条件的情况。环境由威胁场特征化，威胁场是一个严格正的标量场，其中更高的强度对应于危险和不利条件。我们解决了合成具有类似于这些路径的训练数据集的最小暴露路径的问题。本文的主要贡献是提出了一种逆强化学习（IRL）模型来解决这个问题。我们考虑了时间不变（静态）以及时间变化（动态）的威胁场。我们发现所提出的IRL模型在合成从训练数据集中未见过的初始条件的路径方面表现出色，当威胁场与用于训练的威胁场相同时。此外，我们评估了模型在未见过的威胁场上的表现，并发现在这种情况下误差很小。最后，我们展示了该模型在训练于具有不同特点的不同数据集时合成不同数据集的能力。

更新时间: 2025-03-09 13:30:11

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.06611v1

GroMo: Plant Growth Modeling with Multiview Images

Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images.

Updated: 2025-03-09 13:23:16

标题: GroMo: 利用多视角图像进行植物生长建模

摘要: 理解植物生长动态对于农业和植物表型学应用至关重要。我们提出了生长建模（GroMo）挑战，旨在解决两个主要任务：（1）植物年龄预测和（2）叶片计数估算，这两个任务对于作物监测和精准农业至关重要。为此挑战，我们引入了GroMo25数据集，其中包含四种作物（萝卜、秋葵、小麦和芥菜）的图像。每种作物包括多个植物（p1、p2、...、pn），在不同日期（d1、d2、...、dm）拍摄，并分为五个级别（L1、L2、L3、L4、L5）。每棵植物从24个不同角度拍摄，图像之间间隔15度。参与者需要使用这些多视角图像为所有四种作物执行这两个任务。我们为GroMo挑战提出了一个多视角视觉Transformer（MVVT）模型，并评估了其在GroMo25上的按作物的性能。MVVT在年龄预测方面报告了平均MAE为7.74，叶片计数方面的MAE为5.52。GroMo挑战旨在通过鼓励跟踪和预测植物生长的创新解决方案推动植物表型学研究。GitHub存储库公开可用于https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images。

更新时间: 2025-03-09 13:23:16

领域: cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2503.06608v1

Interpretable Model Drift Detection

Data in the real world often has an evolving distribution. Thus, machine learning models trained on such data get outdated over time. This phenomenon is called model drift. Knowledge of this drift serves two purposes: (i) Retain an accurate model and (ii) Discovery of knowledge or insights about change in the relationship between input features and output variable w.r.t. the model. Most existing works focus only on detecting model drift but offer no interpretability. In this work, we take a principled approach to study the problem of interpretable model drift detection from a risk perspective using a feature-interaction aware hypothesis testing framework, which enjoys guarantees on test power. The proposed framework is generic, i.e., it can be adapted to both classification and regression tasks. Experiments on several standard drift detection datasets show that our method is superior to existing interpretable methods (especially on real-world datasets) and on par with state-of-the-art black-box drift detection methods. We also quantitatively and qualitatively study the interpretability aspect including a case study on USENET2 dataset. We find our method focuses on model and drift sensitive features compared to baseline interpretable drift detectors.

Updated: 2025-03-09 13:19:06

标题: 可解释的模型漂移检测

摘要: 真实世界中的数据往往具有不断演变的分布。因此，在此类数据上训练的机器学习模型会随着时间过去而过时。这种现象被称为模型漂移。对这种漂移的了解有两个目的：（i）保持准确的模型和（ii）发现关于输入特征和输出变量之间关系变化的知识或见解相对于模型。大多数现有的研究仅关注检测模型漂移，但没有提供可解释性。在这项工作中，我们采用一种基于风险的原则方法，从特征交互意识的假设检验框架来研究可解释的模型漂移检测问题，该框架具有测试功率的保证。所提出的框架是通用的，即可以适应分类和回归任务。对几个标准漂移检测数据集的实验表明，我们的方法优于现有的可解释方法（特别是在真实世界数据集上），并与最先进的黑盒漂移检测方法不相上下。我们还定量和定性地研究了可解释性方面，包括对USENET2数据集的案例研究。我们发现，与基准可解释的漂移检测器相比，我们的方法更专注于模型和漂移敏感的特征。

更新时间: 2025-03-09 13:19:06

领域: cs.LG

下载: http://arxiv.org/abs/2503.06606v1

FW-Shapley: Real-time Estimation of Weighted Shapley Values

Fair credit assignment is essential in various machine learning (ML) applications, and Shapley values have emerged as a valuable tool for this purpose. However, in critical ML applications such as data valuation and feature attribution, the uniform weighting of Shapley values across subset cardinalities leads to unintuitive credit assignments. To address this, weighted Shapley values were proposed as a generalization, allowing different weights for subsets with different cardinalities. Despite their advantages, similar to Shapley values, Weighted Shapley values suffer from exponential compute costs, making them impractical for high-dimensional datasets. To tackle this issue, we present two key contributions. Firstly, we provide a weighted least squares characterization of weighted Shapley values. Next, using this characterization, we propose Fast Weighted Shapley (FW-Shapley), an amortized framework for efficiently computing weighted Shapley values using a learned estimator. We further show that our estimator's training procedure is theoretically valid even though we do not use ground truth Weighted Shapley values during training. On the feature attribution task, we outperform the learned estimator FastSHAP by $27\%$ (on average) in terms of Inclusion AUC. For data valuation, we are much faster (14 times) while being comparable to the state-of-the-art KNN Shapley.

Updated: 2025-03-09 13:13:14

标题: FW-Shapley: 加权Shapley值的实时估计

摘要: 公平的信用分配在各种机器学习（ML）应用中至关重要，Shapley值已经成为这一目的的有价值工具。然而，在关键的ML应用中，如数据估值和特征归因，Shapley值在子集基数上的均匀加权导致了不直观的信用分配。为了解决这个问题，提出了加权Shapley值作为一种泛化方法，允许对具有不同基数的子集应用不同的权重。尽管具有这些优点，与Shapley值类似，加权Shapley值受到指数级计算成本的影响，使其对于高维数据集不切实际。为了解决这个问题，我们提出了两个关键贡献。首先，我们提供了对加权Shapley值的加权最小二乘特征描述。接下来，利用这个特征描述，我们提出了快速加权Shapley（FW-Shapley），这是一个用于高效计算加权Shapley值的摊销框架，使用一个学习的估计器。我们进一步表明，我们的估计器的训练过程在理论上是有效的，即使我们在训练过程中没有使用地面真实的加权Shapley值。在特征归因任务中，我们在包含AUC方面比学习的估计器FastSHAP表现出了27％的提升（平均值）。对于数据估值，我们速度快得多（14倍），同时与最先进的KNN Shapley相媲美。

更新时间: 2025-03-09 13:13:14

领域: cs.LG

下载: http://arxiv.org/abs/2503.06602v1

Automated Proof of Polynomial Inequalities via Reinforcement Learning

Polynomial inequality proving is fundamental to many mathematical disciplines and finds wide applications in diverse fields. Current traditional algebraic methods are based on searching for a polynomial positive definite representation over a set of basis. However, these methods are limited by truncation degree. To address this issue, this paper proposes an approach based on reinforcement learning to find a {Krivine-basis} representation for proving polynomial inequalities. Specifically, we formulate the inequality proving problem as a linear programming (LP) problem and encode it as a basis selection problem using reinforcement learning (RL), achieving a non-negative {Krivine basis}. Moreover, a fast multivariate polynomial multiplication method based on Fast Fourier Transform (FFT) is employed to enhance the efficiency of action space search. Furthermore, we have implemented a tool called {APPIRL} (Automated Proof of Polynomial Inequalities via Reinforcement Learning). Experimental evaluation on benchmark problems demonstrates the feasibility and effectiveness of our approach. In addition, {APPIRL} has been successfully applied to solve the maximum stable set problem.

Updated: 2025-03-09 12:50:28

标题: 通过强化学习实现多项式不等式的自动证明

摘要: 多项式不等式证明是许多数学学科的基础，并在各个领域中得到广泛应用。当前传统的代数方法是基于在一组基础上搜索多项式正定表示。然而，这些方法受到截断程度的限制。为了解决这个问题，本文提出了一种基于强化学习的方法，用于找到证明多项式不等式的 {Krivine基} 表示。具体而言，我们将不等式证明问题形式化为线性规划（LP）问题，并使用强化学习（RL）对其进行编码，实现一个非负的 {Krivine基}。此外，基于快速傅里叶变换（FFT）的快速多变量多项式乘法方法被用于增强行动空间搜索的效率。此外，我们实现了一个名为 {APPIRL}（通过强化学习自动证明多项式不等式）的工具。对基准问题的实验评估显示了我们方法的可行性和有效性。此外，{APPIRL} 已成功应用于解决最大稳定集问题。

更新时间: 2025-03-09 12:50:28

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2503.06592v1

Navigating Conflicting Views: Harnessing Trust for Learning

Resolving conflicts is essential to make the decisions of multi-view classification more reliable. Much research has been conducted on learning consistent informative representations among different views, assuming that all views are identically important and strictly aligned. However, real-world multi-view data may not always conform to these assumptions, as some views may express distinct information. To address this issue, we develop a computational trust-based discounting method to enhance the existing trustworthy framework in scenarios where conflicts between different views may arise. Its belief fusion process considers the trustworthiness of predictions made by individual views via an instance-wise probability-sensitive trust discounting mechanism. We evaluate our method on six real-world datasets, using Top-1 Accuracy, AUC-ROC for Uncertainty-Aware Prediction, Fleiss' Kappa, and a new metric called Multi-View Agreement with Ground Truth that takes into consideration the ground truth labels. The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications.

Updated: 2025-03-09 12:32:00

标题: 导航冲突观点：利用信任进行学习

摘要: 解决冲突对于使多视角分类的决策更可靠至关重要。已经进行了许多研究，旨在学习在不同视角之间保持一致且信息丰富的表示，假设所有视角都具有相同重要性并严格对齐。然而，真实世界中的多视角数据可能并不总是符合这些假设，因为某些视角可能表达不同的信息。为了解决这个问题，我们开发了一个基于计算信任的折扣方法，以增强现有的值得信赖的框架，在不同视角之间可能出现冲突的情况下。其信念融合过程通过基于实例的概率敏感信任折扣机制考虑了各个视角所做预测的可信度。我们在六个真实世界数据集上评估了我们的方法，使用Top-1准确率、不确定性感知预测的AUC-ROC、Fleiss' Kappa和一个考虑了地面真相标签的新度量标准，即与地面真相一致的多视图协议。实验结果表明，计算信任可以有效解决冲突，为真实世界应用中更可靠的多视图分类模型铺平道路。

更新时间: 2025-03-09 12:32:00

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2406.00958v2

IPO: Iterative Preference Optimization for Text-to-Video Generation

Video foundation models have achieved significant advancement with the help of network upgrade as well as model scale-up. However, they are still hard to meet requirements of applications due to unsatisfied generation quality. To solve this problem, we propose to align video foundation models with human preferences from the perspective of post-training in this paper. Consequently, we introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback. Specifically, IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring as in Kahneman-Tversky Optimization. Given this, IPO optimizes video foundation models with guidance of signals from preference feedback, which helps improve generated video quality in subject consistency, motion smoothness and aesthetic quality, etc. In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling. In this way, IPO can efficiently perform multi-round preference optimization in an iterative manner, without the need of tediously manual labeling. Comprehensive experiments demonstrate that the proposed IPO can effectively improve the video generation quality of a pretrained model and help a model with only 2B parameters surpass the one with 5B parameters. Besides, IPO achieves new state-of-the-art performance on VBench benchmark.

Updated: 2025-03-09 12:30:00

标题: IPO：文本到视频生成的迭代偏好优化

摘要: 视频基础模型在网络升级和模型规模扩大的帮助下取得了重大进展。然而，由于生成质量不佳，它们仍然难以满足应用程序的要求。为了解决这个问题，本文提出从后训练的角度调整视频基础模型与人类偏好一致。因此，我们引入了一个迭代偏好优化策略，通过整合人类反馈来提高生成的视频质量。具体而言，IPO利用一个评论模型来验证视频生成的成对排名，就像直接偏好优化或卡内曼-特瓦斯基优化中的点评分一样。基于此，IPO通过偏好反馈信号指导视频基础模型的优化，有助于提高生成的视频质量，包括主题一致性、动作平滑度和美感质量等。此外，IPO将评论模型与多模态大语言模型结合，使其能够自动分配偏好标签，无需重新训练或重新标记。这样，IPO可以有效地以迭代方式进行多轮偏好优化，而无需繁琐的手动标记。全面的实验证明，所提出的IPO可以有效地提高预训练模型的视频生成质量，并帮助一个只有20亿参数的模型超越一个有50亿参数的模型。此外，IPO在VBench基准测试中取得了新的最先进性能。

更新时间: 2025-03-09 12:30:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.02088v3

Agent models: Internalizing Chain-of-Action Generation into Reasoning models

Traditional agentic workflows rely on external prompts to manage interactions with tools and the environment, which limits the autonomy of reasoning models. We position \emph{Large Agent Models (LAMs)} that internalize the generation of \emph{Chain-of-Action (CoA)}, enabling the model to autonomously decide when and how to use external tools. Our proposed AutoCoA framework combines supervised fine-tuning (SFT) and reinforcement learning (RL), allowing the model to seamlessly switch between reasoning and action while efficiently managing environment interactions. Main components include step-level action triggering, trajectory-level CoA optimization, and an internal world model to reduce real-environment interaction costs. Evaluations on open-domain QA tasks demonstrate that AutoCoA-trained agent models significantly outperform ReAct-based workflows in task completion, especially in tasks that require long-term reasoning and multi-step actions. Code and dataset are available at https://github.com/ADaM-BJTU/AutoCoA

Updated: 2025-03-09 12:19:47

标题: 代理模型：将行动链内化到推理模型中

摘要: 传统的代理式工作流依赖外部提示来管理与工具和环境的交互，这限制了推理模型的自主性。我们提出了\emph{大型代理模型（LAMs）}，它们内部化生成\emph{行动链（CoA）}，使模型能够自主决定何时以及如何使用外部工具。我们提出的AutoCoA框架结合了监督微调（SFT）和强化学习（RL），使模型能够在有效地管理环境交互的同时，无缝地在推理和行动之间切换。主要组成部分包括步骤级行动触发、轨迹级CoA优化和内部世界模型，以降低真实环境交互成本。在开放域QA任务上的评估表明，经过AutoCoA训练的代理模型在任务完成方面明显优于基于ReAct的工作流程，特别是在需要长期推理和多步行动的任务中。代码和数据集可在https://github.com/ADaM-BJTU/AutoCoA获取。

更新时间: 2025-03-09 12:19:47

领域: cs.AI

下载: http://arxiv.org/abs/2503.06580v1

WildIFEval: Instruction Following in the Wild

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

Updated: 2025-03-09 12:06:29

标题: WildIFEval：野外环境中的指示遵循

摘要: 最近的大型语言模型在遵循用户指令方面取得了显著的成功，但处理具有多个约束条件的指令仍然是一个重大挑战。在这项工作中，我们介绍了WildIFEval - 一个包含12K个真实用户指令的大规模数据集，具有多样化的多约束条件。与先前的数据集不同，我们的收集涵盖了广泛的词汇和主题谱系中自然用户提示中的约束条件。我们将这些约束条件分类为八个高级类别，以捕捉它们在现实情境中的分布和动态。利用WildIFEval，我们进行了大量实验来评估领先的大型语言模型在遵循指令方面的能力。我们的研究结果显示，所有评估的模型在约束条件数量增加时性能会下降。因此，我们表明所有模型在这类任务上仍有很大的改进空间。此外，我们观察到特定类型的约束条件在模型性能中起着关键作用。我们发布我们的数据集以促进在复杂、现实条件下的指令遵循方面的进一步研究。

更新时间: 2025-03-09 12:06:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.06573v1

SHIP: A Shapelet-based Approach for Interpretable Patient-Ventilator Asynchrony Detection

Patient-ventilator asynchrony (PVA) is a common and critical issue during mechanical ventilation, affecting up to 85% of patients. PVA can result in clinical complications such as discomfort, sleep disruption, and potentially more severe conditions like ventilator-induced lung injury and diaphragm dysfunction. Traditional PVA management, which relies on manual adjustments by healthcare providers, is often inadequate due to delays and errors. While various computational methods, including rule-based, statistical, and deep learning approaches, have been developed to detect PVA events, they face challenges related to dataset imbalances and lack of interpretability. In this work, we propose a shapelet-based approach SHIP for PVA detection, utilizing shapelets - discriminative subsequences in time-series data - to enhance detection accuracy and interpretability. Our method addresses dataset imbalances through shapelet-based data augmentation and constructs a shapelet pool to transform the dataset for more effective classification. The combined shapelet and statistical features are then used in a classifier to identify PVA events. Experimental results on medical datasets show that SHIP significantly improves PVA detection while providing interpretable insights into model decisions.

Updated: 2025-03-09 11:58:03

标题: SHIP：一种基于形状特征的可解释患者-呼吸机不同步检测方法

摘要: 患者-呼吸机不同步（PVA）是机械通气过程中常见且关键的问题，影响高达85%的患者。PVA可能导致临床并发症，如不适感、睡眠中断，以及潜在更严重的情况，如呼吸机诱导肺损伤和横膈功能障碍。传统的PVA管理依赖于医护人员手动调整，往往由于延迟和错误而不足。虽然各种计算方法，包括基于规则、统计和深度学习的方法已经被开发用于检测PVA事件，但它们面临与数据集不平衡和缺乏可解释性相关的挑战。在这项工作中，我们提出了一种基于shapelet的方法SHIP用于PVA检测，利用shapelet - 时间序列数据中的具有辨别性的子序列 - 来增强检测准确性和可解释性。我们的方法通过基于shapelet的数据增强来解决数据集不平衡问题，并构建一个shapelet池来转换数据集以实现更有效的分类。然后将结合shapelet和统计特征用于分类器来识别PVA事件。对医学数据集的实验结果表明，SHIP显著改善了PVA检测的能力，同时提供了对模型决策的可解释洞察。

更新时间: 2025-03-09 11:58:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06571v1

Conceptrol: Concept Control of Zero-shot Personalized Image Generation

Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.

Updated: 2025-03-09 11:54:08

标题: Conceptrol：零样本个性化图像生成的概念控制

摘要: 个性化图像生成与文本到图像扩散模型生成基于参考图像内容的未见图像。IP-Adapter和OminiControl等零样本适配器方法尤其有趣，因为它们不需要测试时微调。然而，它们在平衡保留个性化内容和遵循文本提示方面遇到困难。我们确定了一个关键设计缺陷导致了这种性能差距：当前的适配器未能充分整合个性化图像和文本描述。因此，生成的图像复制了个性化内容，而不是遵循文本提示指令。然而，基础的文本到图像具有强大的概念理解能力，可以加以利用。我们提出了Conceptrol，这是一个简单而有效的框架，可以增强零样本适配器而不增加计算负担。Conceptrol通过用文本概念蒙版约束视觉规范的注意力，提高了主体驱动的生成能力。它在个性化基准测试上比原始IP-Adapter提高了高达89％，甚至可以胜过Dreambooth LoRA等微调方法。源代码可在https://github.com/QY-H00/Conceptrol 找到。

更新时间: 2025-03-09 11:54:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06568v1

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Large language models (LLMs) have demonstrated transformative potential across various domains, yet they face significant challenges in knowledge integration and complex problem reasoning, often leading to hallucinations and unreliable outputs. Retrieval-Augmented Generation (RAG) has emerged as a promising solution to enhance LLMs accuracy by incorporating external knowledge. However, traditional RAG systems struggle with processing complex relational information and multi-step reasoning, limiting their effectiveness in advanced problem-solving tasks. To address these limitations, we propose CogGRAG, a cognition inspired graph-based RAG framework, designed to improve LLMs performance in Knowledge Graph Question Answering (KGQA). Inspired by the human cognitive process of decomposing complex problems and performing self-verification, our framework introduces a three-stage methodology: decomposition, retrieval, and reasoning with self-verification. By integrating these components, CogGRAG enhances the accuracy of LLMs in complex problem solving. We conduct systematic experiments with three LLM backbones on four benchmark datasets, where CogGRAG outperforms the baselines.

Updated: 2025-03-09 11:50:39

标题: 受人类认知启发的具有知识图谱的复杂问题解决RAG

摘要: 大型语言模型（LLMs）已经在各个领域展示了变革潜力，但它们面临着知识整合和复杂问题推理方面的重大挑战，往往导致幻觉和不可靠的输出。检索增强生成（RAG）已经成为一个有希望的解决方案，通过整合外部知识来提高LLMs的准确性。然而，传统的RAG系统在处理复杂关系信息和多步推理方面存在困难，限制了它们在高级问题解决任务中的效力。为了解决这些限制，我们提出了CogGRAG，这是一个受认知启发的基于图的RAG框架，旨在提高LLMs在知识图问题回答（KGQA）中的性能。受到人类认知过程中分解复杂问题和进行自我验证的启发，我们的框架引入了一个三阶段方法：分解、检索和推理与自我验证。通过整合这些组件，CogGRAG增强了LLMs在复杂问题解决中的准确性。我们在四个基准数据集上对三个LLM骨干进行了系统实验，结果表明CogGRAG胜过了基准线。

更新时间: 2025-03-09 11:50:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06567v1

When Unsupervised Domain Adaptation meets One-class Anomaly Detection: Addressing the Two-fold Unsupervised Curse by Leveraging Anomaly Scarcity

This paper introduces the first fully unsupervised domain adaptation (UDA) framework for unsupervised anomaly detection (UAD). The performance of UAD techniques degrades significantly in the presence of a domain shift, difficult to avoid in a real-world setting. While UDA has contributed to solving this issue in binary and multi-class classification, such a strategy is ill-posed in UAD. This might be explained by the unsupervised nature of the two tasks, namely, domain adaptation and anomaly detection. Herein, we first formulate this problem that we call the two-fold unsupervised curse. Then, we propose a pioneering solution to this curse, considered intractable so far, by assuming that anomalies are rare. Specifically, we leverage clustering techniques to identify a dominant cluster in the target feature space. Posed as the normal cluster, the latter is aligned with the source normal features. Concretely, given a one-class source set and an unlabeled target set composed mostly of normal data and some anomalies, we fit the source features within a hypersphere while jointly aligning them with the features of the dominant cluster from the target set. The paper provides extensive experiments and analysis on common adaptation benchmarks for anomaly detection, demonstrating the relevance of both the newly introduced paradigm and the proposed approach. The code will be made publicly available.

Updated: 2025-03-09 11:44:12

标题: 当无监督领域自适应遇到单类异常检测：通过利用异常稀缺性解决双重无监督诅咒

摘要: 这篇论文介绍了第一个完全无监督域自适应（UDA）框架，用于无监督异常检测（UAD）。在真实世界环境中，领域漂移会显著降低UAD技术的性能，难以避免。虽然UDA在二元和多类分类中已经有所贡献，但在UAD中，这样的策略是不可行的。这可能是由于这两个任务的无监督性质，即领域自适应和异常检测。在这里，我们首先明确这个问题，我们称之为两重无监督诅咒。然后，我们提出了对此诅咒的开创性解决方案，目前被认为是难以解决的，假设异常是罕见的。具体而言，我们利用聚类技术在目标特征空间中识别一个主要的簇。将其作为正常簇，与源正常特征对齐。具体而言，给定一个一类源集和一个主要由正常数据和一些异常组成的未标记目标集，我们在一个超球面内拟合源特征，同时将其与目标集中主要簇的特征对齐。该论文对常见的异常检测适应基准进行了广泛的实验和分析，展示了新引入的范式和提出的方法的相关性。代码将被公开提供。

更新时间: 2025-03-09 11:44:12

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2502.21022v2

LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer Screening

The deployment of computer-aided diagnosis systems for cervical cancer screening using whole slide images (WSIs) faces critical challenges due to domain shifts caused by staining variations across different scanners and imaging environments. While existing stain augmentation methods improve patch-level robustness, they fail to scale to WSIs due to two key limitations: (1) inconsistent stain patterns when extending patch operations to gigapixel slides, and (2) prohibitive computational/storage costs from offline processing of augmented WSIs.To address this, we propose Latent Style Augmentation (LSA), a framework that performs efficient, online stain augmentation directly on WSI-level latent features. We first introduce WSAug, a WSI-level stain augmentation method ensuring consistent stain across patches within a WSI. Using offline-augmented WSIs by WSAug, we design and train Stain Transformer, which can simulate targeted style in the latent space, efficiently enhancing the robustness of the WSI-level classifier. We validate our method on a multi-scanner WSI dataset for cervical cancer diagnosis. Despite being trained on data from a single scanner, our approach achieves significant performance improvements on out-of-distribution data from other scanners. Code will be available at https://github.com/caijd2000/LSA.

Updated: 2025-03-09 11:33:59

标题: LSA: 潜在风格增强：面向不受染色影响的宫颈癌筛查

摘要: 计算机辅助诊断系统在使用全幻灯片图像（WSIs）进行宫颈癌筛查时面临着关键挑战，这是由于不同扫描仪和成像环境之间的染色变化引起的领域转移。虽然现有的染色增强方法改善了补丁级别的稳健性，但由于两个关键限制，它们无法扩展到WSIs：（1）将补丁操作扩展到巨像素幻灯片时，染色模式不一致，（2）增强WSIs的离线处理会导致计算/存储成本过高。为了解决这个问题，我们提出了潜在风格增强（LSA）框架，该框架在WSI级别潜在特征上直接执行高效的在线染色增强。我们首先介绍了WSAug，这是一种保证WSI内补丁之间染色一致性的WSI级染色增强方法。使用由WSAug离线增强的WSIs，我们设计和训练了Stain Transformer，它可以在潜在空间中模拟目标风格，从而有效增强WSI级分类器的稳健性。我们在一个用于宫颈癌诊断的多扫描仪WSI数据集上验证了我们的方法。尽管是在单个扫描仪的数据上训练的，但我们的方法在来自其他扫描仪的分布外数据上取得了显著的性能提升。代码将在https://github.com/caijd2000/LSA上提供。

更新时间: 2025-03-09 11:33:59

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.06563v1

Generative modelling with jump-diffusions

Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution. Both probability flow ODE and SDE formulations are derived using basic technical effort, and are implemented for jump amplitudes drawn from a multivariate Laplace distribution. Remarkably, for the problem of capturing a heavy-tailed target distribution, the jump-diffusion Laplace model outperforms models driven by alpha-stable noise despite not containing any heavy-tailed characteristics. The framework can be readily applied to other jump statistics that could further improve on the performance of standard diffusion models.

Updated: 2025-03-09 11:08:03

标题: 使用跳跃扩散的生成建模

摘要: 基于分数的扩散模型通过一个时间反转的扩散过程从未知的目标分布中生成样本。虽然这种模型在工业应用中，如人工图像生成中代表了最先进的方法，但最近注意到通过考虑具有重尾特征的注入噪声可以进一步提高它们的性能。在这里，我提出了将生成扩散过程推广到一类广泛的非高斯噪声过程。我考虑由标准高斯噪声驱动的前向过程，其上叠加了代表有限活动Levy过程的泊松跃变。结果表明，生成过程受一个依赖于跃变振幅分布的广义分数函数控制。基于基本技术工作导出了概率流ODE和SDE的公式，并为从多元拉普拉斯分布中抽取的跃变振幅实施了。值得注意的是，对于捕捉重尾目标分布的问题，跃变扩散拉普拉斯模型优于受α-稳定噪声驱动的模型，尽管它不包含任何重尾特征。这个框架可以很容易地应用于其他可能进一步改善标准扩散模型性能的跃变统计。

更新时间: 2025-03-09 11:08:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.06558v1

BDPFL: Backdoor Defense for Personalized Federated Learning via Explainable Distillation

Federated learning is a distributed learning paradigm that facilitates the collaborative training of a global model across multiple clients while preserving the privacy of local datasets. To address inherent challenges related to data heterogeneity and satisfy personalized needs, a new direction within FL, known as personalized Federated Learning (pFL), has gradually evolved. Extensive attention has been directed toward developing novel frameworks and methods to enhance the performance of pFL. Regrettably, the aspect of security in pFL has been largely overlooked. Our objective is to fill this gap. Similar to FL, pFL is susceptible to backdoor attacks. However, existing backdoor defense strategies are primarily tailored to general FL frameworks, and pFL lacks robustness against backdoor attacks. We propose a novel, backdoor-robust pFL framework named BDPFL to address these challenges. First, BDPFL introduces layer-wise mutual distillation that enables clients to learn their personalized local models while mitigating potential backdoors. Then, BDPFL employs explanation heatmap to learn high-quality intermediate representations and enhance the effect of eliminating deeper and more entrenched backdoors. Moreover, we perform empirical evaluations of BDPFL's performance on three datasets and compare BDPFL with four backdoor defense methods. The experiments demonstrate that BDPFL outperforms baseline methods and is effective under various settings.

Updated: 2025-03-09 10:59:18

标题: BDPFL: 通过可解释蒸馏实现个性化联邦学习的后门防御

摘要: 联邦学习是一种分布式学习范式，它促进了在多个客户端之间协作训练全局模型，同时保护本地数据集的隐私。为了解决与数据异质性相关的固有挑战并满足个性化需求，一个新的方向在联邦学习内逐渐发展，称为个性化联邦学习（pFL）。人们已经极大关注于开发新的框架和方法来提升pFL的性能。遗憾的是，在pFL中安全性方面已经被大大忽视。我们的目标是填补这一空白。与联邦学习类似，pFL容易受到后门攻击。然而，现有的后门防御策略主要针对一般的联邦学习框架，而pFL缺乏对抗后门攻击的鲁棒性。我们提出了一种新颖的、具有后门鲁棒性的pFL框架，名为BDPFL，以解决这些挑战。首先，BDPFL引入了逐层相互蒸馏，使客户端能够学习他们的个性化本地模型，同时减轻潜在的后门风险。然后，BDPFL采用解释热图来学习高质量的中间表示，并增强消除更深层次和更根深蒂固的后门的效果。此外，我们对BDPFL在三个数据集上的性能进行了实证评估，并将其与四种后门防御方法进行了比较。实验表明，BDPFL优于基线方法，并在各种设置下表现出有效性。

更新时间: 2025-03-09 10:59:18

领域: cs.CR

下载: http://arxiv.org/abs/2503.06554v1

Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs

While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs' computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.

Updated: 2025-03-09 10:56:53

标题: 小而强大：利用轻量级LLMs增强时间序列预测

摘要: 虽然LLMs在时间序列预测中展现出了显著的潜力，但它们的实际部署仍受到计算需求和内存占用过多的限制。现有的基于LLM的方法通常存在三个关键限制：处理数值时间序列模式时参数利用效率低下；连续时间信号与离散文本嵌入之间的模态不对齐；以及无法实时集成专家知识。我们提出了SMETimes，对小于3B参数的SLMs进行了首次系统研究，以实现高效准确的时间序列预测。我们的方法围绕三个关键创新展开：通过描述性统计特征构建统计增强提示机制，将数值时间序列与文本语义联系起来；通过可学习参数调整融合嵌入架构，将时间模式与语言模型令牌空间对齐；以及通过SLMs的高效性实现的动态专家混合框架，自适应地将基础预测与领域特定模型结合起来。对七个基准数据集的广泛评估表明，我们的3B参数SLM在五个主要数据集上取得了最先进的性能，同时相比于7B参数LLM基线，训练速度快3.8倍，内存消耗低5.2倍。值得注意的是，所提出的模型具有更好的学习能力，比传统LLM的均方误差低12.3%。消融研究验证了我们的统计提示和跨模态融合模块分别在长期预测任务中贡献了15.7%和18.2%的误差降低。通过重新定义效率-准确性权衡的格局，这项工作将SLMs确立为实际时间序列预测中资源密集型LLMs的可行替代方案。代码和模型可在https://github.com/xiyan1234567/SMETimes 上找到。

更新时间: 2025-03-09 10:56:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.03594v2

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

Updated: 2025-03-09 10:55:51

标题: ProJudge：基于MLLM的多模多学科基准和指导调整数据集

摘要: 随着多模态大型语言模型（MLLMs）在解决科学问题时经常出现错误，评估其推理过程的有效性对于确保可靠性和发现细粒度模型弱点至关重要。由于人工评估费时费力，促使MLLMs作为自动化过程评判者已成为一种常见做法。然而，这些基于模型的评判者的可靠性仍然不确定。为了解决这个问题，我们引入了ProJudgeBench，这是第一个专门设计用于评估基于MLLM的过程评判者能力的全面基准。ProJudgeBench包括2,400个测试案例和50,118个步骤级标签，涵盖了四个科学学科，难度级别各异，具有多模态内容。在ProJudgeBench中，每个步骤都由人类专家进行细致注释，包括正确性、错误类型和解释，从而实现对评判者检测、分类和诊断错误能力的系统评估。在ProJudgeBench的评估中，开源模型和专有模型之间存在显著的性能差距。为了弥补这一差距，我们进一步提出ProJudge-173k，一个大规模指令调整数据集，以及一种动态双阶段微调策略，鼓励模型在评估解决方案之前明确地推理解决问题。这两项贡献显著增强了开源模型的过程评估能力。所有资源将被释放，以促进未来对可靠多模态过程评估的研究。

更新时间: 2025-03-09 10:55:51

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06553v1

An Efficient Intelligent Semi-Automated Warehouse Inventory Stocktaking System

In the context of evolving supply chain management, the significance of efficient inventory management has grown substantially for businesses. However, conventional manual and experience-based approaches often struggle to meet the complexities of modern market demands. This research introduces an intelligent inventory management system to address challenges related to inaccurate data, delayed monitoring, and overreliance on subjective experience in forecasting. The proposed system integrates bar code and distributed flutter application technologies for intelligent perception, alongside comprehensive big data analytics to enable data-driven decision-making. Through meticulous analysis, system design, critical technology exploration, and simulation validation, the effectiveness of the proposed system is successfully demonstrated. The intelligent system facilitates second-level monitoring, high-frequency checks, and artificial intelligence-driven forecasting, consequently enhancing the automation, precision, and intelligence of inventory management. This system contributes to cost reduction and optimized inventory sizes through accurate predictions and informed decisions, ultimately achieving a mutually beneficial scenario. The outcomes of this research offer

Updated: 2025-03-09 10:53:29

标题: 一个高效的智能半自动化仓库库存盘点系统

摘要: 在不断发展的供应链管理背景下，高效的库存管理对企业的重要性显著增加。然而，传统的手动和经验主义方法往往难以满足现代市场需求的复杂性。本研究引入了一个智能库存管理系统，以解决与数据不准确、监控延迟和过分依赖主观经验预测相关的挑战。所提出的系统集成了条形码和分布式Flutter应用技术，用于智能感知，同时结合全面的大数据分析，以实现数据驱动的决策。通过细致的分析、系统设计、关键技术探索和模拟验证，成功地证明了所提出系统的有效性。智能系统实现了二级监测、高频检查和人工智能驱动的预测，从而提升了库存管理的自动化、精度和智能化。该系统通过准确的预测和明智的决策，有助于降低成本并优化库存规模，最终实现双赢的局面。本研究的结果提供了

更新时间: 2025-03-09 10:53:29

领域: cs.HC,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2309.12365v2

ChatGPT-4 in the Turing Test: A Critical Analysis

This paper critically examines the recent publication "ChatGPT-4 in the Turing Test" by Restrepo Echavarr\'ia (2025), challenging its central claims regarding the absence of minimally serious test implementations and the conclusion that ChatGPT-4 fails the Turing Test. The analysis reveals that the criticisms based on rigid criteria and limited experimental data are not fully justified. More importantly, the paper makes several constructive contributions that enrich our understanding of Turing Test implementations. It demonstrates that two distinct formats--the three-player and two-player tests--are both valid, each with unique methodological implications. The work distinguishes between absolute criteria (reflecting an optimal 50% identification rate in a three-player format) and relative criteria (which measure how closely a machine's performance approximates that of a human), offering a more nuanced evaluation framework. Furthermore, the paper clarifies the probabilistic underpinnings of both test types by modeling them as Bernoulli experiments--correlated in the three-player version and uncorrelated in the two-player version. This formalization allows for a rigorous separation between the theoretical criteria for passing the test, defined in probabilistic terms, and the experimental data that require robust statistical methods for proper interpretation. In doing so, the paper not only refutes key aspects of the criticized study but also lays a solid foundation for future research on objective measures of how closely an AI's behavior aligns with, or deviates from, that of a human being.

Updated: 2025-03-09 10:43:17

标题: ChatGPT-4 在图灵测试中的应用：一项关键分析

摘要: 本文对最近由Restrepo Echavarr\'ia（2025）发表的《ChatGPT-4在图灵测试中的应用》进行了批判性审查，挑战了其关于缺乏最小程度严肃测试实施和ChatGPT-4未能通过图灵测试的核心主张。分析揭示了基于严格标准和有限实验数据的批评并不完全合理。更重要的是，本文提出了几点建设性的贡献，丰富了我们对图灵测试实施的理解。它表明了两种不同格式——三人和两人测试——都是有效的，每种都具有独特的方法论意义。本文区分了绝对标准（反映三人格式中的最佳50％识别率）和相对标准（衡量机器表现如何接近人类），提供了更加细致的评估框架。此外，本文通过将它们建模为伯努利实验，明确了两种测试类型的概率基础——在三人版本中存在相关性，在两人版本中不存在相关性。这种形式化允许严格地区分通过测试的理论标准（以概率术语定义）和需要健全统计方法进行正确解释的实验数据。通过这样做，本文不仅驳斥了受批评研究的关键方面，还为将来研究提供了坚实的基础，以客观衡量人工智能行为与人类行为的接近程度或偏离程度。

更新时间: 2025-03-09 10:43:17

领域: cs.AI,cs.CY,cs.HC,68T01

下载: http://arxiv.org/abs/2503.06551v1

Path To Gain Functional Transparency In Artificial Intelligence With Meaningful Explainability

Artificial Intelligence (AI) is rapidly integrating into various aspects of our daily lives, influencing decision-making processes in areas such as targeted advertising and matchmaking algorithms. As AI systems become increasingly sophisticated, ensuring their transparency and explainability becomes crucial. Functional transparency is a fundamental aspect of algorithmic decision-making systems, allowing stakeholders to comprehend the inner workings of these systems and enabling them to evaluate their fairness and accuracy. However, achieving functional transparency poses significant challenges that need to be addressed. In this paper, we propose a design for user-centered compliant-by-design transparency in transparent systems. We emphasize that the development of transparent and explainable AI systems is a complex and multidisciplinary endeavor, necessitating collaboration among researchers from diverse fields such as computer science, artificial intelligence, ethics, law, and social science. By providing a comprehensive understanding of the challenges associated with transparency in AI systems and proposing a user-centered design framework, we aim to facilitate the development of AI systems that are accountable, trustworthy, and aligned with societal values.

Updated: 2025-03-09 10:34:16

标题: 通往人工智能功能透明性的路径：具有有意义的解释能力

摘要: 人工智能（AI）正在迅速融入我们日常生活的各个方面，影响着诸如定向广告和匹配算法等领域的决策过程。随着AI系统变得越来越复杂，确保其透明性和可解释性变得至关重要。功能透明性是算法决策系统的一个基本方面，使利益相关者能够理解这些系统的内部运作，并使他们能够评估其公平性和准确性。然而，实现功能透明性面临着需要解决的重大挑战。在本文中，我们提出了一个用户中心的合规性透明设计，以实现透明系统。我们强调，开发透明和可解释的AI系统是一个复杂且跨学科的努力，需要计算机科学、人工智能、伦理学、法律和社会科学等各个领域的研究人员之间的合作。通过全面了解与AI系统透明性相关的挑战，并提出一个用户中心的设计框架，我们旨在促进开发符合社会价值观的、可信赖的AI系统。

更新时间: 2025-03-09 10:34:16

领域: cs.AI

下载: http://arxiv.org/abs/2310.08849v2

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

Updated: 2025-03-09 10:31:29

标题: AIGCodeSet：一种用于检测AI生成代码的新注释数据集

摘要: 大型语言模型为软件开发提供了重要的便利，但可能会在面试和学生作业中引发道德问题。因此，确定一段代码是由人类编写还是由人工智能(AI)模型生成是一个关键问题。在这项研究中，我们提出了AIGCodeSet，其中包括2,828个由AI生成和4,755个由人类编写的Python代码，使用了CodeLlama 34B、Codestral 22B和Gemini 1.5 Flash进行创建。此外，我们分享了我们使用基线检测方法进行的实验结果。我们的实验表明，贝叶斯分类器的表现优于其他模型。

更新时间: 2025-03-09 10:31:29

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2412.16594v2

Solving the encoding bottleneck: of the HHL algorithm, by the HHL algorithm

The Harrow-Hassidim-Lloyd (HHL) algorithm offers exponential speedup for solving the quantum linear-system problem. But some caveats for the speedup could be hard to met. One of the difficulties is the encoding bottleneck, i.e., the efficient preparation of the initial quantum state. To prepare an arbitrary $N$-dimensional state exactly, existing state-preparation approaches generally require a runtime of $O(N)$, which will ruin the speedup of the HHL algorithm. Here we show that the states can be prepared approximately with a runtime of $O(poly(\log N))$ by employing a slightly modified version of the HHL algorithm itself. Thus, applying this approach to prepare the initial state of the original HHL algorithm can preserve the exponential speedup advantage. It can also serve as a standalone solution for other applications demanding fast state preparation.

Updated: 2025-03-09 10:29:36

标题: 解决编码瓶颈：HHL算法的HHL算法

摘要: HHL算法为解决量子线性系统问题提供了指数级加速。但是要实现加速可能会有一些困难。其中一个困难是编码瓶颈，即高效准备初始量子态。为了准确准备任意的N维状态，现有的状态准备方法通常需要O(N)的运行时间，这将破坏HHL算法的加速效果。在这里，我们展示了通过使用HHL算法的略微修改版本，可以在O(poly(log N))的运行时间内近似准备状态。因此，将这种方法应用于准备原始HHL算法的初始状态可以保持指数级加速优势。它还可以作为独立解决方案，用于其他需要快速状态准备的应用程序。

更新时间: 2025-03-09 10:29:36

领域: quant-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13534v2

Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation

Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.

Updated: 2025-03-09 10:27:43

标题: 超范围：放大语言模型解释的内部特征表示

摘要: 理解和解释大型语言模型（LLMs）的内部表示仍然是一个开放性挑战。Patchscopes引入了一种方法，通过将内部激活打补丁到新的提示中，促使模型自我解释其隐藏表示。我们介绍了Superscopes，这是一种技术，系统地放大MLP输出（多层感知器）和隐藏状态中的叠加特征，然后将它们打补丁到新的上下文中。受“特征作为方向”视角和扩散模型中的无分级指导（CFG）方法的启发，Superscopes放大了弱但有意义的特征，使得能够解释以前的方法无法解释的内部表示-而无需额外的训练。这种方法为我们提供了新的见解，帮助理解LLMs如何构建上下文和表达复杂概念，进一步推动了机械可解释性的发展。

更新时间: 2025-03-09 10:27:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.02078v2

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://armor.github.io.

Updated: 2025-03-09 10:15:39

标题: ARMOR v0.1：通过不对称协同作用实现交错多模态生成的自回归多模态理解模型增强

摘要: 最近，统一模型(UniMs)在视觉和语言领域受到了广泛关注，用于多模态理解和生成。现有的UniMs旨在同时学习多模态理解和生成能力，需要大量的计算资源，并且经常难以生成交错的文本-图像。我们提出了ARMOR，一种资源高效且纯自回归框架，通过微调现有的多模态大型语言模型(MLLMs)实现理解和生成。具体来说，ARMOR从三个方面扩展了现有的MLLMs：(1)对于模型架构，引入了一种带有正向切换机制的不对称编码器-解码器架构，用于统一嵌入空间，整合文本和视觉模态，以便以最小的计算开销实现自然的文本-图像交错生成。(2)对于训练数据，收集了精心策划的高质量交错数据集，用于微调MLLMs。(3)对于训练算法，我们提出了一个“生成什么或如何生成”的算法，通过基于收集的数据集的三个渐进训练阶段，赋予现有的MLLMs多模态生成能力，同时保留它们的多模态理解能力。实验结果表明，ARMOR将现有的MLLMs升级为具有前景图像生成能力的UniMs，同时使用有限的训练资源。我们的代码将很快发布在https://armor.github.io。

更新时间: 2025-03-09 10:15:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06542v1

DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

In recent years, large language models (LLMs) have had great success in tasks such as casual conversation, contributing to significant advancements in domains like virtual assistance. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models, even in adversarial jailbreaking scenarios that challenge response safety. We also highlight DIESEL's generalization capabilities, showing that it can be used in use cases other than safety, providing general-purpose response filtering.

Updated: 2025-03-09 09:54:02

标题: DIESEL - 通过规避LLMs中的语义嵌入实现动态推理指导

摘要: 近年来，大型语言模型（LLMs）在任务中取得了巨大成功，例如随意对话，为虚拟助手等领域的重大进展做出了贡献。然而，它们经常生成与人类价值观不符的响应（例如，道德标准、安全性），导致潜在的不安全或不适当的输出。虽然已经提出了多种技术来解决这个问题，但它们需要昂贵的计算训练或大幅增加推理时间。在本文中，我们提出了DIESEL，一种轻量级推理引导技术，可以无缝集成到任何自回归LLM中，以从响应中语义过滤不良概念。DIESEL可以作为独立的保护措施，也可以作为额外的防御层，通过根据它们与潜在空间中预定义的负面概念的相似性对LLM提出的标记进行重新排名，增强响应的安全性。我们的评估表明，DIESEL在最先进的对话模型上的有效性，甚至在挑战响应安全性的对抗性越狱场景中也表现出色。我们还强调了DIESEL的泛化能力，表明它可以在除了安全性之外的用例中使用，提供通用响应过滤。

更新时间: 2025-03-09 09:54:02

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.19038v2

Conformal Structured Prediction

Conformal prediction has recently emerged as a promising strategy for quantifying the uncertainty of a predictive model; these algorithms modify the model to output sets of labels that are guaranteed to contain the true label with high probability. However, existing conformal prediction algorithms have largely targeted classification and regression settings, where the structure of the prediction set has a simple form as a level set of the scoring function. However, for complex structured outputs such as text generation, these prediction sets might include a large number of labels and therefore be hard for users to interpret. In this paper, we propose a general framework for conformal prediction in the structured prediction setting, that modifies existing conformal prediction algorithms to output structured prediction sets that implicitly represent sets of labels. In addition, we demonstrate how our approach can be applied in domains where the prediction sets can be represented as a set of nodes in a directed acyclic graph; for instance, for hierarchical labels such as image classification, a prediction set might be a small subset of coarse labels implicitly representing the prediction set of all their more fine-descendants. We demonstrate how our algorithm can be used to construct prediction sets that satisfy a desired coverage guarantee in several domains.

Updated: 2025-03-09 09:52:24

标题: 保持一致的结构化预测

摘要: 最近，依从性预测作为一种有前途的策略出现，用于量化预测模型的不确定性；这些算法修改模型以输出一组标签，保证这些标签具有很高的概率包含真实标签。然而，现有的依从性预测算法主要针对分类和回归设置，其中预测集的结构以评分函数的水平集形式简单表示。然而，对于复杂的结构化输出，如文本生成，这些预测集可能包含大量标签，因此用户可能难以解释。在本文中，我们提出了一个通用框架，用于在结构化预测设置中进行依从性预测，修改现有的依从性预测算法以输出隐式表示一组标签的结构化预测集。此外，我们演示了我们的方法如何可以应用于预测集可以表示为有向无环图中的一组节点的领域；例如，对于层次标签如图像分类，预测集可能是一小部分粗粒度标签，隐式表示其更精细后代的预测集。我们演示了我们的算法如何用于构建在几个领域中满足所需覆盖保证的预测集。

更新时间: 2025-03-09 09:52:24

领域: cs.LG

下载: http://arxiv.org/abs/2410.06296v2

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations.

Updated: 2025-03-09 09:48:31

标题: UrBench：用于评估多视角城市场景中大型多模态模型的综合基准

摘要: 最近对大型多模型模型（LMMs）的评估已经探索了它们在各个领域的能力，但只有少数基准专门关注城市环境。此外，现有的城市基准仅限于评估具有基本区域级城市任务的LMMs在单一视图下的能力，导致对LMMs在城市环境中的能力进行不完整评估。为了解决这些问题，我们提出了UrBench，这是一个专门设计用于评估LMMs在复杂多视角城市场景中的综合基准。UrBench包含11.6K个经过精心策划的问题，涵盖了区域级和角色级两个维度，涵盖了4个任务维度：地理定位、场景推理、场景理解和物体理解，共计14种任务类型。在构建UrBench时，我们利用现有数据集的数据，同时从11个城市收集数据，使用跨视图检测匹配方法创建新的注释。利用这些图像和注释，我们然后整合了基于LMM、基于规则和基于人类的方法，构建大规模高质量的问题。我们对21个LMMs的评估显示，目前的LMMs在城市环境中在多个方面都存在困难。即使表现最好的GPT-4o在大多数任务中也落后于人类，从简单的计数到复杂的方向、定位和物体属性识别等任务，平均表现差距为17.4%。我们的基准还显示，LMMs在不同的城市视图中表现出不一致的行为，特别是在理解跨视图关系方面。

更新时间: 2025-03-09 09:48:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2408.17267v3

Comparative clinical evaluation of "memory-efficient" synthetic 3d generative adversarial networks (gan) head-to-head to state of art: results on computed tomography of the chest

Introduction: Generative Adversarial Networks (GANs) are increasingly used to generate synthetic medical images, addressing the critical shortage of annotated data for training Artificial Intelligence (AI) systems. This study introduces a novel memory-efficient GAN architecture, incorporating Conditional Random Fields (CRFs) to generate high-resolution 3D medical images and evaluates its performance against the state-of-the-art hierarchical (HA)-GAN model. Materials and Methods: The CRF-GAN was trained using the open-source lung CT LUNA16 dataset. The architecture was compared to HA-GAN through a quantitative evaluation, using Frechet Inception Distance (FID) and Maximum Mean Discrepancy (MMD) metrics, and a qualitative evaluation, through a two-alternative forced choice (2AFC) test completed by a pool of 12 resident radiologists, in order to assess the realism of the generated images. Results: CRF-GAN outperformed HA-GAN with lower FID (0.047 vs. 0.061) and MMD (0.084 vs. 0.086) scores, indicating better image fidelity. The 2AFC test showed a significant preference for images generated by CRF-Gan over those generated by HA-GAN with a p-value of 1.93e-05. Additionally, CRF-GAN demonstrated 9.34% lower memory usage at 256 resolution and achieved up to 14.6% faster training speeds, offering substantial computational savings. Discussion: CRF-GAN model successfully generates high-resolution 3D medical images with non-inferior quality to conventional models, while being more memory-efficient and faster. Computational power and time saved can be used to improve the spatial resolution and anatomical accuracy of generated images, which is still a critical factor limiting their direct clinical applicability.

Updated: 2025-03-09 09:46:24

标题: 《头对头比较临床评估“内存高效”的合成3D生成对抗网络（GAN）与现有技术的对比：胸部计算机断层扫描结果》

摘要: 介绍：生成对抗网络（GANs）越来越被用于生成合成医学图像，以解决训练人工智能系统所需的标注数据短缺问题。本研究引入了一种新颖的内存高效的GAN架构，结合条件随机场（CRFs）生成高分辨率的3D医学图像，并将其性能与最先进的分层（HA）-GAN模型进行了评估。材料和方法：CRF-GAN使用开源的肺部CT LUNA16数据集进行训练。通过使用Frechet Inception Distance（FID）和最大均值差异（MMD）度量进行定量评估，以及通过由12名住院放射科医生组成的小组完成的两种选择强制测试（2AFC）进行定性评估，以评估生成图像的真实性。结果：CRF-GAN在FID（0.047 vs. 0.061）和MMD（0.084 vs. 0.086）得分较低，表明具有更好的图像保真度。2AFC测试显示，CRF-GAN生成的图像明显优于HA-GAN生成的图像，p值为1.93e-05。此外，CRF-GAN在256分辨率下的内存使用率降低了9.34%，训练速度提高了高达14.6%，节省了大量计算资源。讨论：CRF-GAN模型成功生成高分辨率的3D医学图像，质量不逊于传统模型，同时更加内存高效和快速。节省的计算资源和时间可以用于改善生成图像的空间分辨率和解剖准确性，这仍然是限制其直接临床应用的关键因素。

更新时间: 2025-03-09 09:46:24

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2501.15572v2

AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection

As object detection becomes integral to many safety-critical applications, understanding its vulnerabilities is essential. Backdoor attacks, in particular, pose a serious threat by implanting hidden triggers in victim models, which adversaries can later exploit to induce malicious behaviors during inference. However, current understanding is limited to single-target attacks, where adversaries must define a fixed malicious behavior (target) before training, making inference-time adaptability impossible. Given the large output space of object detection (including object existence prediction, bounding box estimation, and classification), the feasibility of flexible, inference-time model control remains unexplored. This paper introduces AnywhereDoor, a multi-target backdoor attack for object detection. Once implanted, AnywhereDoor allows adversaries to make objects disappear, fabricate new ones, or mislabel them, either across all object classes or specific ones, offering an unprecedented degree of control. This flexibility is enabled by three key innovations: (i) objective disentanglement to scale the number of supported targets; (ii) trigger mosaicking to ensure robustness even against region-based detectors; and (iii) strategic batching to address object-level data imbalances that hinder manipulation. Extensive experiments demonstrate that AnywhereDoor grants attackers a high degree of control, improving attack success rates by 26% compared to adaptations of existing methods for such flexible control.

Updated: 2025-03-09 09:24:24

标题: AnywhereDoor：针对目标检测的多目标后门攻击

摘要: 随着目标检测越来越成为许多安全关键应用的重要组成部分，了解其脆弱性至关重要。特别是，后门攻击通过在受害者模型中植入隐藏触发器，对推理期间可以利用的恶意行为构成严重威胁。然而，目前的理解仅限于单目标攻击，即敌方必须在训练之前定义一个固定的恶意行为（目标），这使得推理时的适应性变得不可能。考虑到目标检测的大输出空间（包括目标存在预测、边界框估计和分类），灵活的、推理时的模型控制可行性尚未被探索。本文介绍了AnywhereDoor，这是一个用于目标检测的多目标后门攻击。一旦植入，AnywhereDoor允许敌方使目标消失、制造新目标或错误标记它们，无论是跨所有目标类别还是特定类别，提供了前所未有的控制程度。这种灵活性得益于三个关键创新：（i）客观解缠以扩大支持目标的数量；（ii）触发器马赛克以确保抵御基于区域的检测器的稳健性；以及（iii）战略分批以解决妨碍操纵的目标级数据不平衡。大量实验证明，AnywhereDoor赋予攻击者高度控制权，与现有方法的适应性相比，攻击成功率提高了26%。

更新时间: 2025-03-09 09:24:24

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.06529v1

Higher Order Reduced Rank Regression

Reduced Rank Regression (RRR) is a widely used method for multi-response regression. However, RRR assumes a linear relationship between features and responses. While linear models are useful and often provide a good approximation, many real-world problems involve more complex relationships that cannot be adequately captured by simple linear interactions. One way to model such relationships is via multilinear transformations. This paper introduces Higher Order Reduced Rank Regression (HORRR), an extension of RRR that leverages multi-linear transformations, and as such is capable of capturing nonlinear interactions in multi-response regression. HORRR employs tensor representations for the coefficients and a Tucker decomposition to impose multilinear rank constraints as regularization akin to the rank constraints in RRR. Encoding these constraints as a manifold allows us to use Riemannian optimization to solve this HORRR problems. We theoretically and empirically analyze the use of Riemannian optimization for solving HORRR problems.

Updated: 2025-03-09 09:21:38

标题: 高阶降秩回归

摘要: Reduced Rank Regression (RRR) 是一种广泛使用的多响应回归方法。然而，RRR 假定特征和响应之间存在线性关系。虽然线性模型是有用的，通常能够提供良好的近似，但许多现实世界中的问题涉及更复杂的关系，这些关系无法通过简单的线性相互作用充分捕捉。一种建模这种关系的方式是通过多线性变换。本文介绍了 Higher Order Reduced Rank Regression (HORRR)，这是 RRR 的一种扩展，利用多线性变换，因此能够捕捉多响应回归中的非线性相互作用。HORRR 使用张量表示系数，并使用 Tucker 分解来施加多线性秩约束作为正则化，类似于 RRR 中的秩约束。将这些约束编码为流形使我们能够使用 Riemannian 优化来解决这些 HORRR 问题。我们在理论和实证中分析了使用 Riemannian 优化来解决 HORRR 问题的方法。

更新时间: 2025-03-09 09:21:38

领域: stat.ML,cs.LG,cs.NA,math.NA,math.OC

下载: http://arxiv.org/abs/2503.06528v1

From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes

Analyzing student behavior in educational scenarios is crucial for enhancing teaching quality and student engagement. Existing AI-based models often rely on classroom video footage to identify and analyze student behavior. While these video-based methods can partially capture and analyze student actions, they struggle to accurately track each student's actions in physical education classes, which take place in outdoor, open spaces with diverse activities, and are challenging to generalize to the specialized technical movements involved in these settings. Furthermore, current methods typically lack the ability to integrate specialized pedagogical knowledge, limiting their ability to provide in-depth insights into student behavior and offer feedback for optimizing instructional design. To address these limitations, we propose a unified end-to-end framework that leverages human activity recognition technologies based on motion signals, combined with advanced large language models, to conduct more detailed analyses and feedback of student behavior in physical education classes. Our framework begins with the teacher's instructional designs and the motion signals from students during physical education sessions, ultimately generating automated reports with teaching insights and suggestions for improving both learning and class instructions. This solution provides a motion signal-based approach for analyzing student behavior and optimizing instructional design tailored to physical education classes. Experimental results demonstrate that our framework can accurately identify student behaviors and produce meaningful pedagogical insights.

Updated: 2025-03-09 09:04:36

标题: 从动作信号到洞见：物理教育课堂学生行为分析和反馈的统一框架

摘要: 分析学生在教育场景中的行为对于提高教学质量和学生参与度至关重要。现有基于人工智能的模型通常依赖于课堂视频素材来识别和分析学生行为。虽然这些基于视频的方法可以部分捕捉和分析学生的动作，但在体育课等户外开放空间中进行的课程中，往往难以准确跟踪每个学生的行为，因为这些场景涉及各种不同的活动，并且难以推广到这些环境中涉及到的专业技术动作。此外，当前的方法通常缺乏整合专业教育知识的能力，限制了它们提供深入了解学生行为并为优化教学设计提供反馈的能力。为了解决这些局限性，我们提出了一个统一的端到端框架，利用基于运动信号的人体活动识别技术，结合先进的大型语言模型，对体育课中学生行为进行更详细的分析和反馈。我们的框架从教师的教学设计开始，利用学生在体育课程中的运动信号，最终生成带有教学见解和改进建议的自动报告，以优化学习和课堂指导。这种解决方案提供了一种基于运动信号的方法来分析学生行为并优化适合体育课的教学设计。实验结果表明，我们的框架能够准确识别学生行为并提供有意义的教育见解。

更新时间: 2025-03-09 09:04:36

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2503.06525v1

AdaSVD: Adaptive Singular Value Decomposition for Large Language Models

Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices $\mathcal{U}$ and $\mathcal{V}^\top$. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM/VLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. Code and models of AdaSVD will be available at https://github.com/ZHITENGLI/AdaSVD.

Updated: 2025-03-09 09:04:18

标题: AdaSVD：大型语言模型的自适应奇异值分解

摘要: 大型语言模型（LLMs）在自然语言处理（NLP）任务中取得了显著的成功，但它们庞大的内存需求对于在资源受限设备上部署提出了重大挑战。奇异值分解（SVD）已经成为LLMs的一种有前途的压缩技术，可以显著减少内存开销。然而，现有的基于SVD的方法经常难以有效地缓解SVD截断引入的错误，导致与原始模型相比存在明显的性能差距。此外，在所有变压器层上应用统一的压缩比率无法考虑不同层的重要性差异。为了解决这些挑战，我们提出了AdaSVD，一种自适应SVD-based LLM压缩方法。具体而言，AdaSVD引入了adaComp，通过交替更新奇异矩阵$\mathcal{U}$和$\mathcal{V}^\top$来自适应性地补偿SVD截断错误。此外，AdaSVD引入了adaCR，根据每个层的相对重要性自适应地分配层特定的压缩比率。跨多个LLM/VLM系列和评估指标进行的广泛实验表明，AdaSVD始终优于最先进的（SOTA）基于SVD的方法，实现了卓越的性能并显著降低了内存需求。AdaSVD的代码和模型将在https://github.com/ZHITENGLI/AdaSVD 上提供。

更新时间: 2025-03-09 09:04:18

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.01403v3

Generative AI as Digital Media

Generative AI is frequently portrayed as revolutionary or even apocalyptic, prompting calls for novel regulatory approaches. This essay argues that such views are misguided. Instead, generative AI should be understood as an evolutionary step in the broader algorithmic media landscape, alongside search engines and social media. Like these platforms, generative AI centralizes information control, relies on complex algorithms to shape content, and extensively uses user data, thus perpetuating common problems: unchecked corporate power, echo chambers, and weakened traditional gatekeepers. Regulation should therefore share a consistent objective: ensuring media institutions remain trustworthy. Without trust, public discourse risks fragmenting into isolated communities dominated by comforting, tribal beliefs -- a threat intensified by generative AI's capacity to bypass gatekeepers and personalize truth. Current governance frameworks, such as the EU's AI Act and the US Executive Order 14110, emphasize reactive risk mitigation, addressing measurable threats like national security, public health, and algorithmic bias. While effective for novel technological risks, this reactive approach fails to adequately address broader issues of trust and legitimacy inherent to digital media. Proactive regulation fostering transparency, accountability, and public confidence is essential. Viewing generative AI exclusively as revolutionary risks repeating past regulatory failures that left social media and search engines insufficiently regulated. Instead, regulation must proactively shape an algorithmic media environment serving the public good, supporting quality information and robust civic discourse.

Updated: 2025-03-09 08:58:17

标题: 生成式人工智能作为数字媒体

摘要: 生成式人工智能经常被描述为革命性甚至是末日，促使呼吁采取新的监管方式。本文认为这样的观点是错误的。相反，应该将生成式人工智能视为更广泛的算法媒体格局中的一步进化，与搜索引擎和社交媒体并列。与这些平台一样，生成式人工智能集中控制信息，依赖复杂算法塑造内容，并广泛使用用户数据，因此延续了常见问题：未受监管的企业权力、信息茧房和传统门户的削弱。因此，监管应该有一个一致的目标：确保媒体机构保持值得信赖。缺乏信任，公共讨论面临分裂成由舒适的部落信仰主导的孤立社区的风险，而生成式人工智能越过门户并个性化真相的能力加剧了这一威胁。当前的治理框架，如欧盟的AI法案和美国总统令14110号，强调反应性风险缓解，解决可衡量的威胁，如国家安全、公共卫生和算法偏见。虽然对新技术风险有效，但这种反应性方法未能充分解决数字媒体固有的广泛信任和合法性问题。推动透明、问责和公众信心的积极监管是至关重要的。将生成式人工智能单独视为革命性风险重复了过去监管失败，导致社交媒体和搜索引擎监管不足。相反，监管必须积极塑造为公共利益服务的算法媒体环境，支持优质信息和强大的公民讨论。

更新时间: 2025-03-09 08:58:17

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2503.06523v1

LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding Representation Parallel for Medical Image Segmentation

The emergence of deep learning techniques has advanced the image segmentation task, especially for medical images. Many neural network models have been introduced in the last decade bringing the automated segmentation accuracy close to manual segmentation. However, cutting-edge models like Transformer-based architectures rely on large scale annotated training data, and are generally designed with densely consecutive layers in the encoder, decoder, and skip connections resulting in large number of parameters. Additionally, for better performance, they often be pretrained on a larger data, thus requiring large memory size and increasing resource expenses. In this study, we propose a new lightweight but efficient model, namely LiteNeXt, based on convolutions and mixing modules with simplified decoder, for medical image segmentation. The model is trained from scratch with small amount of parameters (0.71M) and Giga Floating Point Operations Per Second (0.42). To handle boundary fuzzy as well as occlusion or clutter in objects especially in medical image regions, we propose the Marginal Weight Loss that can help effectively determine the marginal boundary between object and background. Additionally, the Self-embedding Representation Parallel technique is proposed as an innovative data augmentation strategy that utilizes the network architecture itself for self-learning augmentation, enhancing feature extraction robustness without external data. Experiments on public datasets including Data Science Bowls, GlaS, ISIC2018, PH2, Sunnybrook, and Lung X-ray data show promising results compared to other state-of-the-art CNN-based and Transformer-based architectures. Our code is released at: https://github.com/tranngocduvnvp/LiteNeXt.

Updated: 2025-03-09 08:54:13

标题: LiteNeXt：一种基于ConvMixer的新型轻量级模型，具有自嵌入表示并行用于医学图像分割

摘要: 深度学习技术的出现推动了图像分割任务的发展，尤其是在医学图像领域。在过去的十年中，许多神经网络模型被引入，使得自动分割的准确度接近于手动分割。然而，像基于Transformer的架构这样的前沿模型依赖于大规模标注的训练数据，并且通常设计有密集的连续层在编码器、解码器和跳跃连接中，导致参数数量庞大。此外，为了获得更好的性能，它们通常在更大规模的数据上进行预训练，因此需要更大的内存大小并增加资源开销。在本研究中，我们提出了一种新的轻量级但高效的模型，名为LiteNeXt，基于卷积和混合模块，具有简化的解码器，用于医学图像分割。该模型从头开始训练，参数量小（0.71M），每秒浮点运算量为0.42亿。为了处理医学图像区域中特别是物体的边界模糊以及遮挡或混杂的问题，我们提出了边缘权重损失，可以有效确定物体和背景之间的边缘。此外，提出了自嵌入表示并行技术作为一种创新的数据增强策略，利用网络架构本身进行自学习增强，增强特征提取的鲁棒性，无需外部数据。在包括Data Science Bowls、GlaS、ISIC2018、PH2、Sunnybrook和肺X射线数据在内的公共数据集上的实验结果显示，与其他基于CNN和Transformer的最新架构相比，我们的方法表现出有希望的结果。我们的代码已经发布在：https://github.com/tranngocduvnvp/LiteNeXt。

更新时间: 2025-03-09 08:54:13

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2405.15779v2

Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

Gaussian process (GP) is arguably one of the most widely used machine learning algorithms in practice. One of its prominent applications is Bayesian optimization (BO). Although the vanilla GP itself is already a powerful tool for BO, it is often beneficial to be able to consider the dependencies of multiple outputs. To do so, Multi-task GP (MTGP) is formulated, but it is not trivial to fully understand the derivations of its formulations and their gradients from the previous literature. This paper serves friendly derivations of the MTGP formulations and their gradients.

Updated: 2025-03-09 08:53:55

标题: 多输出（也称为多任务）高斯过程输出相关性推断的推导

摘要: 高斯过程（GP）可以说是实践中最广泛使用的机器学习算法之一。其中一个显著的应用是贝叶斯优化（BO）。尽管普通的GP本身已经是BO的一个强大工具，但能够考虑多个输出的依赖关系通常是有益的。为此，提出了多任务GP（MTGP），但要完全理解其公式推导和梯度并不是简单的。本文提供了友好的MTGP公式推导及其梯度。

更新时间: 2025-03-09 08:53:55

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2501.07964v2

Bayesian WeakS-to-Strong from Text Classification to Generation

Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

Updated: 2025-03-09 08:52:56

标题: 贝叶斯弱到强：从文本分类到生成

摘要: 大型语言模型的进展引发了一个问题，即随着模型变得越来越复杂，人类只能弱监督它们，对齐技术将如何适应。弱到强的模仿就是这样一个场景，弱模型监督试图利用一个更强大的模型的全部能力。本文将Weak-to-Strong扩展为WeakS-to-Strong，通过探索一组模拟人类观点变化的弱模型来估计置信度分数，采用贝叶斯方法指导WeakS-to-Strong的泛化。此外，我们将WeakS-to-Strong的应用从文本分类任务扩展到文本生成任务，探讨了更先进的策略来进行监督。此外，直接优化偏好被应用于推进学生模型的偏好学习，超越了教师强制的基本学习框架。结果表明，该方法对于一个可靠的强学生模型的有效性，显示了超对齐的潜力。

更新时间: 2025-03-09 08:52:56

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.03199v3

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

Updated: 2025-03-09 08:47:34

标题: CoMT：一种针对大型视觉语言模型的多模态思维链的新基准

摘要: 最近，大型视觉语言模型（LVLMs）在多模态任务中取得了惊人的成功，包括在多模态思维链（MCoT）推理方面的进展。尽管取得了这些成功，当前的基准仍然遵循传统范式，即多模态输入和文本模态输出，这导致了一些重大缺陷，如缺失视觉操作和模糊表达。受此启发，我们引入了一个新颖的多模态思维链（CoMT）基准，以解决这些限制。与传统的MCoT基准不同，CoMT要求既有多模态输入又有多模态推理输出，旨在模拟天然集成视觉操作的人类思维。具体而言，CoMT包括四个类别：（1）视觉创建，（2）视觉删除，（3）视觉更新，（4）视觉选择，以全面探索复杂视觉操作和真实情景中的简洁表达。我们在CoMT上评估了各种LVLMs和策略，揭示了当前方法的能力和限制的一些关键见解。我们希望CoMT能激发更多关于将多模态生成引入推理过程的研究。

更新时间: 2025-03-09 08:47:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.12932v3

Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation

Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs) due to their low computational demands, enhanced privacy guarantees and comparable performance in specific domains through light-weight fine-tuning. Deploying SLMs on edge devices, such as smartphones and smart vehicles, has become a growing trend. However, the security implications of SLMs have received less attention than LLMs, particularly regarding jailbreak attacks, which is recognized as one of the top threats of LLMs by the OWASP. In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks. Through systematically evaluation on 63 SLMs from 15 mainstream SLM families against 8 state-of-the-art jailbreak methods, we demonstrate that 47.6% of evaluated SLMs show high susceptibility to jailbreak attacks (ASR > 40%) and 38.1% of them can not even resist direct harmful query (ASR > 50%). We further analyze the reasons behind the vulnerabilities and identify four key factors: model size, model architecture, training datasets and training techniques. Moreover, we assess the effectiveness of three prompt-level defense methods and find that none of them achieve perfect performance, with detection accuracy varying across different SLMs and attack methods. Notably, we point out that the inherent security awareness play a critical role in SLM security, and models with strong security awareness could timely terminate unsafe response with little reminder. Building upon the findings, we highlight the urgent need for security-by-design approaches in SLM development and provide valuable insights for building more trustworthy SLM ecosystem.

Updated: 2025-03-09 08:47:16

标题: 小型语言模型能可靠抵抗越狱攻击吗？全面评估

摘要: 小型语言模型（SLMs）已经成为大型语言模型（LLMs）的有希望的替代品，因为它们具有低计算需求、增强的隐私保障，并且通过轻量级微调在特定领域具有可比性的性能。在智能手机和智能车辆等边缘设备上部署SLMs已经成为一种增长趋势。然而，与LLMs相比，SLMs的安全影响受到的关注较少，特别是在监狱破解攻击方面，OWASP认为这是LLMs的顶级威胁之一。在本文中，我们进行了首次大规模实证研究，评估了63个来自15个主流SLM家族的SLMs对8种最先进的监狱破解方法的易受攻击性。我们展示了47.6%的评估SLMs对监狱破解攻击表现出高度易受攻击性（ASR > 40%），其中38.1%甚至无法抵抗直接有害查询（ASR > 50%）。我们进一步分析了漏洞背后的原因，并确定了四个关键因素：模型大小、模型架构、训练数据集和训练技术。此外，我们评估了三种提示级防御方法的有效性，并发现没有一种方法能够达到完美的性能，检测准确率在不同的SLMs和攻击方法之间变化。值得注意的是，我们指出内在的安全意识在SLM安全中起着关键作用，具有强大安全意识的模型可以及时终止不安全的响应而几乎不需要提醒。基于这些发现，我们强调了在SLM开发中迫切需要安全设计方法，并为构建更可信赖的SLM生态系统提供了有价值的见解。

更新时间: 2025-03-09 08:47:16

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.06519v1

Towards Superior Quantization Accuracy: A Layer-sensitive Approach

Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Updated: 2025-03-09 08:45:03

标题: 朝着更高的量化精度：一种层敏感的方法

摘要: 大型视觉和语言模型在自然语言理解、问题解决、逻辑推理和知识检索等任务中展现出与人类相似的非凡智能。然而，训练和运行这些模型需要大量的计算资源，给它们的广泛应用和进一步研究带来了重大障碍。为了缓解这一挑战，已经开发了各种模型压缩技术来减少计算需求。然而，现有方法通常采用统一的量化配置，未能考虑到在量化大型神经网络模型的不同层之间存在的难度差异。本文通过利用层敏感特征，如激活敏感度和权重分布的峰度，来识别难以准确量化的层，并分配额外的内存预算来解决这个问题。提出的方法分别被命名为SensiBoost和KurtBoost，分别在量化准确性方面取得显著改善，与基线相比，LLama模型的困惑度降低了高达9%，内存预算仅增加了2%。

更新时间: 2025-03-09 08:45:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06518v1

DMin: Scalable Training Data Influence Estimation for Diffusion Models

Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

Updated: 2025-03-09 08:41:48

标题: DMin：扩展训练数据影响估计的扩散模型

摘要: 识别对生成的图像影响最大的训练数据样本是理解扩散模型（DMs）的关键任务，然而现有的影响估计方法受限于计算能力，只能应用于小规模或LoRA调整的模型。为解决这一挑战，我们提出了DMin（Diffusion Model influence），这是一个可扩展的框架，用于估计每个训练数据样本对给定生成图像的影响。据我们所知，这是第一个能够对具有数十亿参数的DMs进行影响估计的方法。利用高效的梯度压缩，DMin将存储需求从数百TB降低到仅为MB甚至KB，并在不到1秒的时间内检索出前k个最具影响力的训练样本，同时保持性能。我们的实证结果表明，DMin在识别具有影响力的训练样本方面效果显著，并且在计算和存储需求方面高效。

更新时间: 2025-03-09 08:41:48

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.08637v2

Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control

Humanoid robots require both robust lower-body locomotion and precise upper-body manipulation. While recent Reinforcement Learning (RL) approaches provide whole-body loco-manipulation policies, they lack precise manipulation with high DoF arms. In this paper, we propose decoupling upper-body control from locomotion, using inverse kinematics (IK) and motion retargeting for precise manipulation, while RL focuses on robust lower-body locomotion. We introduce PMP (Predictive Motion Priors), trained with Conditional Variational Autoencoder (CVAE) to effectively represent upper-body motions. The locomotion policy is trained conditioned on this upper-body motion representation, ensuring that the system remains robust with both manipulation and locomotion. We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation. With precise upper-body motion and robust lower-body locomotion control, operators can remotely control the humanoid to walk around and explore different environments, while performing diverse manipulation tasks.

Updated: 2025-03-09 08:41:46

标题: 移动电视：用于人形机器人整体控制的预测性运动先验

摘要: 人形机器人需要具有强大的下肢运动能力和精确的上肢操纵能力。尽管最近的强化学习（RL）方法提供了全身运动-操纵策略，但它们在具有高自由度臂部的精确操纵方面存在不足。本文提出了将上肢控制从运动分离出来的方法，利用逆运动学（IK）和运动重新定位技术实现精确操纵，同时让RL专注于强大的下肢运动。我们引入了通过条件变分自动编码器（CVAE）训练的PMP（预测性运动先验），以有效表示上肢运动。运动策略在这种上肢运动表示的条件下进行训练，确保系统在操纵和运动方面保持稳定。我们表明，CVAE特征对于稳定性和鲁棒性至关重要，并在精确操纵方面明显优于基于RL的全身控制。通过精确的上肢运动和强大的下肢运动控制，操作员可以远程控制人形机器人在不同环境中行走和探索，同时执行各种操纵任务。

更新时间: 2025-03-09 08:41:46

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.07773v2

Instance-wise Supervision-level Optimization in Active Learning

Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.

Updated: 2025-03-09 08:39:22

标题: 主动学习中的实例级监督水平优化

摘要: 主动学习（AL）是一种标签有效的机器学习范式，重点是选择性地注释高价值实例，以最大化学习效率。通过结合弱监督，可以进一步增强其有效性，弱监督使用粗糙但成本有效的注释代替精确（即完整）但昂贵的注释。我们介绍了一种新颖的AL框架，即基于实例的监督级别优化（ISO），它不仅选择要注释的实例，还确定它们在固定注释预算内的最佳注释级别。其优化标准利用了每个实例的价值成本比（VCR），同时确保所选实例之间的多样性。在分类实验中，ISO始终优于传统的AL方法，并超越了一种结合完整和弱监督的最新AL方法，在较低的总成本下实现更高的准确性。此代码可在https://github.com/matsuo-shinnosuke/ISOAL 找到。

更新时间: 2025-03-09 08:39:22

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06517v1

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Updated: 2025-03-09 08:38:10

标题: GFlowVLM：利用生成流网络增强视觉-语言模型中的多步推理

摘要: 视觉语言模型（VLMs）最近在通过任务特定的微调在顺序决策任务中展示了有希望的进展。然而，常见的微调方法，如监督微调（SFT）和强化学习（RL）技术如Proximal Policy Optimization（PPO），存在显著的局限性：SFT假设独立同分布（IID）数据，而PPO专注于最大化累积奖励。这些局限性经常限制解决方案的多样性，并阻碍多步推理任务的泛化。为了解决这些挑战，我们引入了一个新颖的框架，GFlowVLM，这是一个使用生成流网络（GFlowNets）对VLM进行微调的框架，以促进生成复杂推理任务的多样解决方案。GFlowVLM将环境建模为非马尔可夫决策过程，使其能够捕捉对实际应用至关重要的长期依赖关系。它以观察和任务描述作为输入，促使思维链（CoT）推理随后引导行动选择。我们使用基于任务的奖励来通过GFlowNets对VLM进行微调。这种方法使VLM能够胜过先前的微调方法，包括SFT和RL。实证结果展示了GFlowVLM在复杂任务（如卡牌游戏（NumberLine，BlackJack）和具体规划任务（ALFWorld））上的有效性，显示了强化的训练效率、解决方案多样性以及在分布内外场景中更强的泛化能力。

更新时间: 2025-03-09 08:38:10

领域: cs.CL,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06514v1

HFedCKD: Toward Robust Heterogeneous Federated Learning via Data-free Knowledge Distillation and Two-way Contrast

Most current federated learning frameworks are modeled as static processes, ignoring the dynamic characteristics of the learning system. Under the limited communication budget of the central server, the flexible model architecture of a large number of clients participating in knowledge transfer requires a lower participation rate, active clients have uneven contributions, and the client scale seriously hinders the performance of FL. We consider a more general and practical federation scenario and propose a system heterogeneous federation method based on data-free knowledge distillation and two-way contrast (HFedCKD). We apply the Inverse Probability Weighted Distillation (IPWD) strategy to the data-free knowledge transfer framework. The generator completes the data features of the nonparticipating clients. IPWD implements a dynamic evaluation of the prediction contribution of each client under different data distributions. Based on the antibiased weighting of its prediction loss, the weight distribution of each client is effectively adjusted to fairly integrate the knowledge of participating clients. At the same time, the local model is split into a feature extractor and a classifier. Through differential contrast learning, the feature extractor is aligned with the global model in the feature space, while the classifier maintains personalized decision-making capabilities. HFedCKD effectively alleviates the knowledge offset caused by a low participation rate under data-free knowledge distillation and improves the performance and stability of the model. We conduct extensive experiments on image and IoT datasets to comprehensively evaluate and verify the generalization and robustness of the proposed HFedCKD framework.

Updated: 2025-03-09 08:32:57

标题: HFedCKD：通过无数据知识蒸馏和双向对比实现稳健的异构联邦学习

摘要: 大多数当前的联邦学习框架被建模为静态过程，忽略了学习系统的动态特性。在中央服务器的通信预算有限的情况下，大量客户参与知识传递的灵活模型架构需要较低的参与率，活跃客户的贡献不均匀，客户规模严重阻碍了FL的性能。我们考虑了更一般和实际的联邦场景，并提出了一种基于无数据知识蒸馏和双向对比的系统异构联邦方法（HFedCKD）。我们将逆概率加权蒸馏（IPWD）策略应用于无数据知识传递框架。生成器完成了未参与客户的数据特征。IPWD实现了对不同数据分布下每个客户的预测贡献的动态评估。基于其预测损失的无偏加权，每个客户的权重分布被有效调整，以公平地整合参与客户的知识。同时，本地模型被分为特征提取器和分类器。通过差分对比学习，特征提取器在特征空间中与全局模型对齐，而分类器保持个性化决策能力。HFedCKD有效缓解了无数据知识蒸馏下低参与率导致的知识偏移，并提高了模型的性能和稳定性。我们在图像和物联网数据集上进行了大量实验，全面评估和验证了提出的HFedCKD框架的泛化性和稳健性。

更新时间: 2025-03-09 08:32:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06511v1

A Light and Tuning-free Method for Simulating Camera Motion in Video Generation

Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively.

Updated: 2025-03-09 08:28:40

标题: 一个轻量且无需调参的方法用于在视频生成中模拟相机运动

摘要: 现有的摄像机运动控制视频生成方法在微调和推断方面面临计算瓶颈。本文提出了一种称为LightMotion的轻量级、无需微调的方法，用于模拟视频生成中的摄像机运动。在潜在空间中操作，它消除了额外的微调、修补和深度估计，使其比现有方法更为简洁。本文的努力包括：(i)潜在空间置换操作有效地模拟了诸如平移、缩放和旋转等各种摄像机运动。(ii)潜在空间重采样策略结合了背景感知采样和跨帧对齐，以准确填补新视角同时保持帧间的连贯性。(iii)我们的深入分析显示，置换和重采样导致了潜在空间中的信噪比偏移，导致生成质量较差。为了解决这个问题，我们提出了潜在空间校正，通过在去噪过程中重新引入噪声来减轻信噪比偏移，并增强视频生成质量。详尽的实验表明，我们的LightMotion在定量和定性方面都优于现有方法。

更新时间: 2025-03-09 08:28:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06508v1

Semantic Wave Functions: Exploring Meaning in Large Language Models Through Quantum Formalism

Large Language Models (LLMs) encode semantic relationships in high-dimensional vector embeddings. This paper explores the analogy between LLM embedding spaces and quantum mechanics, positing that LLMs operate within a quantized semantic space where words and phrases behave as quantum states. To capture nuanced semantic interference effects, we extend the standard real-valued embedding space to the complex domain, drawing parallels to the double-slit experiment. We introduce a "semantic wave function" to formalize this quantum-derived representation and utilize potential landscapes, such as the double-well potential, to model semantic ambiguity. Furthermore, we propose a complex-valued similarity measure that incorporates both magnitude and phase information, enabling a more sensitive comparison of semantic representations. We develop a path integral formalism, based on a nonlinear Schr\"odinger equation with a gauge field and Mexican hat potential, to model the dynamic evolution of LLM behavior. This interdisciplinary approach offers a new theoretical framework for understanding and potentially manipulating LLMs, with the goal of advancing both artificial and natural language understanding.

Updated: 2025-03-09 08:23:31

标题: 语义波函数：通过量子形式主义探索大型语言模型中的含义

摘要: 大型语言模型（LLMs）编码高维向量嵌入中的语义关系。本文探讨了LLM嵌入空间和量子力学之间的类比，认为LLMs在一个量子化的语义空间中运作，其中单词和短语表现为量子态。为了捕捉微妙的语义干扰效应，我们将标准的实值嵌入空间扩展到复数域，类比于双缝实验。我们引入了一个“语义波函数”来形式化这种基于量子的表示，并利用潜在的景观，比如双井势，来模拟语义模糊性。此外，我们提出了一个复值相似度度量，结合了幅度和相位信息，使得语义表示的比较更加敏感。我们开发了一个基于非线性Schr\"odinger方程的路径积分形式主义，带有规范场和墨西哥帽势能，来模拟LLM行为的动态演化。这种跨学科方法提供了一个新的理论框架，用于理解和潜在地操纵LLMs，旨在推动人工和自然语言理解的发展。

更新时间: 2025-03-09 08:23:31

领域: cs.CL,cs.LG,quant-ph

下载: http://arxiv.org/abs/2503.10664v1

Visual Privacy Auditing with Diffusion Models

Data reconstruction attacks on machine learning models pose a substantial threat to privacy, potentially leaking sensitive information. Although defending against such attacks using differential privacy (DP) provides theoretical guarantees, determining appropriate DP parameters remains challenging. Current formal guarantees on the success of data reconstruction suffer from overly stringent assumptions regarding adversary knowledge about the target data, particularly in the image domain, raising questions about their real-world applicability. In this work, we empirically investigate this discrepancy by introducing a reconstruction attack based on diffusion models (DMs) that only assumes adversary access to real-world image priors and specifically targets the DP defense. We find that (1) real-world data priors significantly influence reconstruction success, (2) current reconstruction bounds do not model the risk posed by data priors well, and (3) DMs can serve as heuristic auditing tools for visualizing privacy leakage.

Updated: 2025-03-09 08:21:00

标题: 使用扩散模型进行视觉隐私审计

摘要: 对机器学习模型的数据重建攻击构成了对隐私的重大威胁，可能泄露敏感信息。虽然使用差分隐私（DP）来抵御此类攻击提供了理论保证，但确定适当的DP参数仍然具有挑战性。目前对数据重建成功的正式保证受到了对于对手对目标数据的知识的过于严格的假设的影响，特别是在图像领域，引发了关于其实际适用性的疑问。在这项工作中，我们通过引入基于扩散模型（DMs）的重建攻击，仅假设对手能够访问真实世界的图像先验，并专门针对DP防御进行了经验性研究。我们发现（1）真实世界数据先验显著影响重建成功，（2）当前的重建界限并未很好地模拟数据先验所带来的风险，（3）DMs可以作为启发式审计工具，用于可视化隐私泄漏。

更新时间: 2025-03-09 08:21:00

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2403.07588v2

Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts, such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships, integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at https://github.com/hadi-hosseini/noise-refinement.

Updated: 2025-03-09 08:18:43

标题: 细粒度对齐和噪声细化用于组合文本到图像生成

摘要: 文本到图像生成模型在近年来取得了显著进展；然而，准确捕捉文本提示中的复杂细节，如实体缺失、属性绑定错误和不正确的关系仍然是一个巨大挑战。为此，我们提出了一种创新的、无需训练的方法，通过融入定制目标以解决这些挑战。与基于布局的方法不同，这些方法强制实施严格的结构并限制多样性，我们提出的方法通过仅仅施加从文本中提取的约束，而不添加任何不必要的内容，提供了一个更灵活的场景布局。这些约束被制定为损失函数 - 实体缺失、实体混合、属性绑定和空间关系，并集成到一个统一的损失中，应用于第一代阶段。此外，我们引入了一个反馈驱动系统用于细粒度的初始噪声细化。该系统集成了一个验证器，评估生成的图像，识别不一致之处，并提供纠正性反馈。利用这个反馈，我们的改进方法首先通过优化与这些约束相关的选择性损失，通过优化引起初始噪声的错误注意力图来解决未达到的约束。随后，我们的统一损失函数被重新应用，以进行第二代阶段的进展。实验结果表明，我们的方法仅依赖于我们提出的目标函数，显著提高了组合性能，人类评估结果提高了24%，空间关系提高了25%。此外，我们的细粒度噪声细化方法证明了有效性，性能提升了高达5%。源代码可在https://github.com/hadi-hosseini/noise-refinement获取。

更新时间: 2025-03-09 08:18:43

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06506v1

DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

Recent advancements in text-to-image generation have spurred interest in personalized human image generation, which aims to create novel images featuring specific human identities as reference images indicate. Although existing methods achieve high-fidelity identity preservation, they often struggle with limited multi-ID usability and inadequate facial editability. We present DynamicID, a tuning-free framework supported by a dual-stage training paradigm that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the original model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which leverages contrastive learning to effectively disentangle and re-entangle facial motion and identity features, thereby enabling flexible facial editing. Additionally, we have developed a curated VariFace-10k facial dataset, comprising 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability.

Updated: 2025-03-09 08:16:19

标题: DynamicID：灵活面部可编辑的零镜头多身份图像个性化

摘要: 最近在文本到图像生成方面的进展引发了对个性化人像生成的兴趣，旨在创建具有特定人物身份的新颖图像，参考图像指示。尽管现有方法实现了高保真度的身份保留，但它们经常在有限的多个身份可用性和不足的面部可编辑性方面遇到困难。我们提出了DynamicID，这是一个无需调整的框架，支持双阶段训练范式，从根本上促进了单一身份和多个身份的个性化生成，保真度高且面部可编辑性灵活。我们的关键创新包括：1）语义激活注意力（SAA），利用查询级激活门控制，减少向原始模型注入身份特征时对原模型的干扰，实现多个身份个性化，无需在训练过程中使用多个身份样本。2）身份-动作重构器（IMR），利用对比学习有效地解开和重新编织面部动作和身份特征，从而实现灵活的面部编辑。此外，我们还开发了一个精心策划的VariFace-10k面部数据集，包括10k个独特的个体，每个个体由35个不同的面部图像表示。实验结果表明，DynamicID在身份保真度、面部可编辑性和多个身份个性化能力方面优于现有方法。

更新时间: 2025-03-09 08:16:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06505v1

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition. However, most of the existing GCNs rely on the binary connection of two neighboring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures. Although some studies have attempted to utilize hyper-graphs to represent the topology, they rely on a fixed construction strategy, which limits their adaptivity in uncovering the intricate latent relationships within the action. In this paper, we address this oversight and explore the merits of an adaptive hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices. In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton. By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods. Specifically, we outperform the existing solutions on NTU-120, achieving 90.5\% and 91.7\% in terms of the top-1 recognition accuracy on X-Sub and X-Set.

Updated: 2025-03-09 08:14:25

标题: 自适应超图卷积网络用于基于骨架的人体动作识别与虚拟连接

摘要: 人类骨骼的共享拓扑结构激发了最近对动作识别的图卷积网络（GCN）解决方案的研究。然而，大多数现有的GCNs依赖于由边（骨骼）形成的两个相邻顶点（关节）的二进制连接，忽视了构建多顶点卷积结构的潜力。虽然一些研究尝试利用超图来表示拓扑结构，但它们依赖于固定的构建策略，这限制了它们在揭示动作中复杂潜在关系方面的适应性。在本文中，我们解决了这一疏忽，并探讨了自适应超图卷积网络（Hyper-GCN）的优点，以实现对骨骼顶点传达的丰富语义信息的聚合。特别是，我们的Hyper-GCN在训练过程中自适应优化超图，揭示了由动作驱动的多顶点关系。此外，虚拟连接通常被设计为支持高效的特征聚合，隐含地扩展了骨骼内的依赖关系范围。通过将虚拟连接注入超图，可以突出显示不同动作类别的语义线索。对NTU-60、NTU-120和NW-UCLA数据集进行的实验结果显示了我们的Hyper-GCN相对于最先进方法的优点。具体而言，我们在NTU-120上的表现优于现有解决方案，在X-Sub和X-Set的前1识别准确率分别达到90.5%和91.7%。

更新时间: 2025-03-09 08:14:25

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2411.14796v2

Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings

The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 280 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.

Updated: 2025-03-09 08:01:06

标题: NOTSOFAR-1挑战赛总结：重点和收获

摘要: NOTSOFAR-1挑战是一个重要的举措，通过提供比以往更符合真实商业应用需求的数据集，设立了新的标准。该挑战提供了280个录制的会议在30个不同环境中的数据集，捕捉了真实世界的声学条件和对话动态，并提供了一个1000小时的模拟训练数据集，通过增强真实性来实现真实世界的泛化，包含了15000个真实声学传递函数。在本文中，我们提供了对挑战中提交的系统的概述，并分析了表现最佳的方法，推测其成功的因素。此外，我们还突出了参与者未探索的有前途的方向。通过呈现关键发现和可操作见解，本研究旨在推动DASR研究和应用的进一步创新和进展。

更新时间: 2025-03-09 08:01:06

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2501.17304v2

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fr\'echet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.

Updated: 2025-03-09 07:59:39

标题: ExGes: 用于音频驱动手势合成的表现性人体动作检索和调制

摘要: 音频驱动的人体手势合成是一项至关重要的任务，在虚拟化身、人机交互和创意内容生成等广泛应用中具有重要意义。尽管取得了显著进展，但现有方法通常产生粗糙、缺乏表现力并且未能完全与音频语义对齐的手势。为了解决这些挑战，我们提出了一种新颖的检索增强扩散框架ExGes，具有三个关键设计：（1）基于动作的构造，利用训练数据集构建手势库；（2）动作检索模块，采用对比学习和动量蒸馏进行细粒度参考姿势检索；以及（3）精度控制模块，整合部分遮罩和随机遮罩以实现灵活和细粒度控制。在BEAT2上的实验评估表明，ExGes将弗雷歇手势距离减少了6.2％，并且比EMAGE提高了5.3％的动作多样性，用户研究显示，71.3％的用户更偏好其自然性和语义相关性。代码将在接受后发布。

更新时间: 2025-03-09 07:59:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06499v1

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

Assessing the safety of vision-language models (VLMs) in autonomous driving is particularly important; however, existing work mainly focuses on traditional benchmark evaluations. As interactive components within autonomous driving systems, VLMs must maintain strong safety cognition during interactions. From this perspective, we propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) . To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text Annotation System (ADA) . Additionally, to ensure data quality in SCD-Bench, our dataset undergoes manual refinement by experts with professional knowledge in autonomous driving. We further develop an automated evaluation method based on large language models (LLMs). To verify its effectiveness, we compare its evaluation results with those of expert human evaluations, achieving a consistency rate of 99.74%. Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition, showing a significant gap compared to GPT-4o. Notably, lightweight models (1B-4B) demonstrate minimal safety cognition. However, since lightweight models are crucial for autonomous driving systems, this presents a significant challenge for integrating VLMs into the field.

Updated: 2025-03-09 07:53:19

标题: 自动驾驶中视觉语言模型的安全认知能力评估

摘要: 评估自动驾驶中视觉-语言模型（VLMs）的安全性尤为重要；然而，现有工作主要集中在传统基准评估上。作为自动驾驶系统中的交互组件，VLMs在交互过程中必须保持强大的安全认知能力。从这个角度出发，我们提出了一种新颖的评估方法：安全认知驾驶基准（SCD-Bench）。为了解决SCD-Bench的大规模注释挑战，我们开发了自动驾驶图像-文本注释系统（ADA）。此外，为了确保SCD-Bench中数据的质量，我们的数据集经过专业知识的专家手工精炼。我们进一步基于大型语言模型（LLMs）开发了自动评估方法。为验证其有效性，我们将其评估结果与专家人工评估进行比较，达到一致性率为99.74%。初步实验结果表明，现有开源模型仍然缺乏足够的安全认知能力，与GPT-4o相比存在显著差距。值得注意的是，轻量级模型（1B-4B）表现出最低的安全认知能力。然而，由于轻量级模型对自动驾驶系统至关重要，这为将VLMs整合到该领域中提出了重大挑战。

更新时间: 2025-03-09 07:53:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06497v1

A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases

Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.

Updated: 2025-03-09 07:42:04

标题: 社交媒体欺诈活动检测的机器学习方法、挑战和偏见的系统性回顾

摘要: 社交媒体平台如Twitter、Facebook和Instagram已经促进了虚假信息的传播，这需要自动检测系统。这篇系统性综述评估了36项研究，这些研究应用了机器学习（ML）和深度学习（DL）模型来检测社交媒体上的假新闻、垃圾信息和虚假账号。使用预测模型偏倚评估工具（PROBAST），综述确定了ML生命周期中的关键偏倚：由于非代表性抽样而引起的选择偏倚，对类别不平衡的不充分处理，语言预处理不足（例如否定）和超参数调整不一致。尽管支持向量机（SVM）、随机森林和长短期记忆（LSTM）网络等模型显示出强大的潜力，但在不平衡数据设置中过度依赖准确度作为评估指标是一个常见缺陷。这篇综述强调了改进数据预处理（例如重采样技术）、一致的超参数调整和使用适当的指标如精确度、召回率、F1分数和AUROC的需求。解决这些限制可以导致更可靠和可推广的ML/DL模型，从而有助于减少社交媒体上的虚假信息。

更新时间: 2025-03-09 07:42:04

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2410.20293v4

Enhancing Malware Fingerprinting through Analysis of Evasive Techniques

As malware detection evolves, attackers adopt sophisticated evasion tactics. Traditional file-level fingerprinting, such as cryptographic and fuzzy hashes, is often overlooked as a target for evasion. Malware variants exploit minor binary modifications to bypass detection, as seen in Microsoft's discovery of GoldMax variations (2020-2021). However, no large-scale empirical studies have assessed the limitations of traditional fingerprinting methods on real-world malware samples or explored improvements. This paper fills this gap by addressing three key questions: (a) How prevalent are file variants in malware samples? Analyzing 4 million Windows Portable Executable (PE) files, 21 million sections, and 48 million resources, we find up to 80% deep structural similarities, including common APIs and executable sections. (b) What evasion techniques are used? We identify resilient fingerprints (clusters of malware variants with high similarity) validated via VirusTotal. Our analysis reveals non-functional mutations, such as altered section numbers, virtual sizes, and section names, as primary evasion tactics. We also classify two key section types: malicious sections (high entropy >5) and camouflage sections (entropy = 0). (c) How can fingerprinting be improved? We propose two novel approaches that enhance detection, improving identification rates from 20% (traditional methods) to over 50% using our refined fingerprinting techniques. Our findings highlight the limitations of existing methods and propose new strategies to strengthen malware fingerprinting against evolving threats.

Updated: 2025-03-09 07:41:49

标题: 通过分析规避技术增强恶意软件指纹识别

摘要: 随着恶意软件检测的发展，攻击者采用了复杂的规避策略。传统的文件级指纹识别，如加密和模糊哈希，经常被忽视作为规避的目标。恶意软件变种利用微小的二进制修改来规避检测，正如微软发现的GoldMax变种（2020-2021年）。然而，没有大规模的实证研究评估传统指纹方法在现实世界恶意软件样本上的限制或探索改进。本文通过回答三个关键问题填补了这一空白：（a）恶意软件样本中文件变种有多普遍？分析了400万个Windows可移植可执行文件（PE）文件，2100万个节和4800万个资源，我们发现高达80%的深层结构相似性，包括常见的API和可执行节。（b）使用了什么规避技术？我们识别出经由VirusTotal验证的具有高相似性的恶意软件变种的弹性指纹（指纹集群）。我们的分析揭示了非功能性突变，如修改的节编号、虚拟大小和节名称，作为主要规避策略。我们还对两种关键节类型进行分类：恶意节（高熵>5）和伪装节（熵=0）。（c）如何改进指纹识别？我们提出了两种增强检测的新方法，使用我们精细的指纹识别技术，将识别率从20%（传统方法）提高到50%以上。我们的研究结果突出了现有方法的局限性，并提出了新的策略，以加强对不断演变的威胁的恶意软件指纹识别。

更新时间: 2025-03-09 07:41:49

领域: cs.CR

下载: http://arxiv.org/abs/2503.06495v1

Dialogue Systems for Emotional Support via Value Reinforcement

Emotional support dialogue systems aim to reduce help-seekers' distress and help them overcome challenges. While human values$\unicode{x2013}$core beliefs that shape an individual's priorities$\unicode{x2013}$are increasingly emphasized in contemporary psychological therapy for their role in fostering internal transformation and long-term emotional well-being, their integration into emotional support systems remains underexplored. To bridge this gap, we present a value-driven method for training emotional support dialogue systems designed to reinforce positive values in seekers. Notably, our model identifies which values to reinforce at each turn and how to do so, by leveraging online support conversations from Reddit. We evaluate the method across support skills, seekers' emotional intensity, and value reinforcement. Our method consistently outperforms various baselines, effectively exploring and eliciting values from seekers. Additionally, leveraging crowd knowledge from Reddit significantly enhances its effectiveness. Therapists highlighted its ability to validate seekers' challenges and emphasize positive aspects of their situations$\unicode{x2013}$both crucial elements of value reinforcement. Our work, being the first to integrate value reinforcement into emotional support systems, demonstrates its promise and establishes a foundation for future research.

Updated: 2025-03-09 07:37:22

标题: 对话系统通过价值强化提供情感支持

摘要: 情感支持对话系统旨在减少寻求帮助者的痛苦，并帮助他们克服挑战。尽管人类价值观-塑造个体优先事项的核心信念-在当代心理治疗中越来越受重视，因为它们在促进内在转变和长期情感幸福感方面起着重要作用，但它们在情感支持系统中的整合仍未得到充分探讨。为了弥合这一差距，我们提出了一种基于价值驱动的方法，用于训练旨在加强求助者正面价值观的情感支持对话系统。值得注意的是，我们的模型通过利用来自Reddit的在线支持对话，确定了在每个转折点上要加强哪些价值观以及如何加强。我们通过支持技能、求助者的情感强度和价值观的强化来评估该方法。我们的方法始终优于各种基线，有效地探索和引发求助者的价值观。此外，利用Reddit上的群体知识显著增强了其效果。治疗师强调了该方法验证求助者的挑战并强调他们情况的积极方面的能力-这两者都是价值观强化的关键要素。我们的工作是第一个将价值观强化整合到情感支持系统中的工作，展示了它的潜力，并为未来研究奠定了基础。

更新时间: 2025-03-09 07:37:22

领域: cs.CL,cs.AI,cs.CY,cs.HC,I.2.7

下载: http://arxiv.org/abs/2501.17182v2

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging ``Multi-Hop Needle-In-A-Video-Haystack'' benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Updated: 2025-03-09 07:32:35

标题: 视频聊天-Flash：用于长上下文视频建模的分层压缩

摘要: 长上下文视频建模对于多模态大语言模型（MLLMs）至关重要，使它们能够处理电影、在线视频流等。尽管取得了进展，但由于有效理解极长视频上下文而仍然具有挑战性。本文旨在从模型架构、训练数据、训练策略和评估基准等方面解决这一问题。首先，我们提出了一种新颖的分层视频令牌压缩（HiCo）方法，利用长视频中的视觉冗余将长视频上下文从剪辑级别压缩到视频级别，显著减少计算量同时保留关键细节，实现了近乎无性能损失的极端压缩比率约为1/50。其次，我们引入了一个多阶段短到长学习方案，一个名为LongVid的实际长视频大规模数据集，以及一个具有挑战性的“多跳视频中的一根针”基准测试。最后，我们构建了一个强大的视频MLLM命名为VideoChat-Flash，在2B和7B模型规模上在主流长短视频基准测试中表现出领先性能。在NIAH中，它首次在开源模型中获得了超过10,000帧的99.1%准确率。

更新时间: 2025-03-09 07:32:35

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2501.00574v3

Temporal Analysis of NetFlow Datasets for Network Intrusion Detection Systems

This paper investigates the temporal analysis of NetFlow datasets for machine learning (ML)-based network intrusion detection systems (NIDS). Although many previous studies have highlighted the critical role of temporal features, such as inter-packet arrival time and flow length/duration, in NIDS, the currently available NetFlow datasets for NIDS lack these temporal features. This study addresses this gap by creating and making publicly available a set of NetFlow datasets that incorporate these temporal features [1]. With these temporal features, we provide a comprehensive temporal analysis of NetFlow datasets by examining the distribution of various features over time and presenting time-series representations of NetFlow features. This temporal analysis has not been previously provided in the existing literature. We also borrowed an idea from signal processing, time frequency analysis, and tested it to see how different the time frequency signal presentations (TFSPs) are for various attacks. The results indicate that many attacks have unique patterns, which could help ML models to identify them more easily.

Updated: 2025-03-09 07:31:18

标题: 网络入侵检测系统中NetFlow数据集的时间分析

摘要: 本文研究了基于机器学习（ML）的网络入侵检测系统（NIDS）的NetFlow数据集的时间分析。尽管许多先前的研究已经强调了时间特征在NIDS中的关键作用，例如数据包到达时间和流长度/持续时间，但目前可用的供NIDS使用的NetFlow数据集缺乏这些时间特征。本研究通过创建并公开一个包含这些时间特征的NetFlow数据集来填补这一空白。借助这些时间特征，我们对NetFlow数据集进行了全面的时间分析，通过检查各种特征随时间的分布并呈现NetFlow特征的时间序列表示。这种时间分析以前在现有文献中尚未提供。我们还借鉴了信号处理中的时间频率分析的思想，并进行了测试，以查看不同攻击的时间频率信号表示（TFSPs）有多大差异。结果表明许多攻击具有独特的模式，这有助于ML模型更容易地识别它们。

更新时间: 2025-03-09 07:31:18

领域: cs.LG,cs.CR,cs.NI

下载: http://arxiv.org/abs/2503.04404v2

MoFE: Mixture of Frozen Experts Architecture

We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments.

Updated: 2025-03-09 07:24:36

标题: MoFE：冻结专家混合架构

摘要: 我们提出了Mixture of Frozen Experts（MoFE）架构，该架构集成了Parameter-efficient Fine-tuning（PEFT）和Mixture of Experts（MoE）架构，以增强训练效率和模型可扩展性。通过在MoE框架内冻结前馈网络（FFN）层，MoFE显著减少了可训练参数的数量，提高了训练效率，同时仍允许有效地从专家模型中进行知识转移。这有助于创建精通多个领域的模型。我们进行了实验，评估了性能和效率之间的权衡，将MoFE与其他PEFT方法进行比较，评估了组成模型中领域专业知识的影响，并确定了最佳训练策略。结果表明，尽管在性能方面可能会有一些权衡，但效率提升是巨大的，使MoFE成为资源受限的现实环境的合理解决方案。

更新时间: 2025-03-09 07:24:36

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.06491v1

Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs

In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2^{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension of the intrinsic computational mechanisms of neural networks. Furthermore, we observe similar computational mechanisms in attention matrices of one-layer Transformers. Our work stands as a significant stride in unraveling their operation complexities, particularly in the realm of complex algebraic tasks.

Updated: 2025-03-09 07:14:46

标题: 神经网络和变压器中的傅立叶电路：多输入模数算术的案例研究

摘要: 在机器学习不断发展的领域中，一个关键挑战在于解读神经网络和Transformer所利用的内部表示。借鉴近期对网络如何执行不同目标函数的理解进展，我们的研究开始探索网络采用特定计算策略背后的原因。我们将焦点放在涉及$k$个输入的模块化加法这一复杂代数学习任务上。我们的研究在解决这一任务时，对具有风格化单隐藏层神经网络和一层Transformer所学习的特征进行了深入的分析。我们理论框架的一个基石是澄清边缘最大化原则如何塑造单隐藏层神经网络所采用的特征。让$p$表示模数，$D_p$表示具有$k$个输入的模块算术数据集，$m$表示网络宽度。我们证明，在神经元计数$m \geq 2^{2k-2} \cdot (p-1)$的情况下，这些网络在数据集$D_p$上实现最大$ L_{2,k+1} $边缘。此外，我们建立了每个隐藏层神经元与解决模块化加法问题密切相关的特定傅里叶谱的对齐关系。通过将我们的发现与类似研究的经验观察相互关联，我们有助于更深入地理解神经网络的固有计算机制。此外，我们在一层Transformer的注意力矩阵中观察到类似的计算机制。我们的工作在揭示它们的操作复杂性方面取得了重要进展，特别是在复杂代数任务领域。

更新时间: 2025-03-09 07:14:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.09469v4

A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025

Phishing websites continue to pose a significant security challenge, making the development of robust detection mechanisms essential. Brand Domain Identification (BDI) serves as a crucial step in many phishing detection approaches. This study systematically evaluates the effectiveness of features employed over the past decade for BDI, focusing on their weighted importance in phishing detection as of 2025. The primary objective is to determine whether the identified brand domain matches the claimed domain, utilizing popular features for phishing detection. To validate feature importance and evaluate performance, we conducted two experiments on a dataset comprising 4,667 legitimate sites and 4,561 phishing sites. In Experiment 1, we used the Weka tool to identify optimized and important feature sets out of 5: CN Information(CN), Logo Domain(LD),Form Action Domain(FAD),Most Common Link in Domain(MCLD) and Cookie Domain through its 4 Attribute Ranking Evaluator. The results revealed that none of the features were redundant, and Random Forest emerged as the best classifier, achieving an impressive accuracy of 99.7\% with an average response time of 0.08 seconds. In Experiment 2, we trained five machine learning models, including Random Forest, Decision Tree, Support Vector Machine, Multilayer Perceptron, and XGBoost to assess the performance of individual BDI features and their combinations. The results demonstrated an accuracy of 99.8\%, achieved with feature combinations of only three features: Most Common Link Domain, Logo Domain, Form Action and Most Common Link Domain,CN Info,Logo Domain using Random Forest as the best classifier. This study underscores the importance of leveraging key domain features for efficient phishing detection and paves the way for the development of real-time, scalable detection systems.

Updated: 2025-03-09 07:14:04

标题: 2025年品牌域识别特征在钓鱼检测中的有效性研究

摘要: 网络钓鱼网站继续构成重要的安全挑战，因此开发强大的检测机制至关重要。品牌域名识别（BDI）在许多网络钓鱼检测方法中扮演着关键角色。本研究系统评估了过去十年用于BDI的特征的有效性，重点关注它们在2025年的网络钓鱼检测中的加权重要性。主要目标是确定已识别的品牌域名是否与宣称的域名匹配，利用流行的网络钓鱼检测特征。为验证特征的重要性并评估性能，我们在一个包含4,667个合法网站和4,561个网络钓鱼网站的数据集上进行了两个实验。在实验1中，我们使用Weka工具通过4个属性评估器识别出5个优化和重要的特征集：CN信息（CN）、Logo域名（LD）、表单操作域（FAD）、域中最常见的链接（MCLD）和Cookie域。结果显示，没有任何特征是多余的，而随机森林成为最佳分类器，达到了99.7\%的令人印象深刻的准确率，平均响应时间为0.08秒。在实验2中，我们训练了五种机器学习模型，包括随机森林、决策树、支持向量机、多层感知器和XGBoost，以评估单个BDI特征及其组合的性能。结果显示，仅使用三个特征的组合（最常见的链接域、Logo域名、表单操作和最常见的链接域、CN信息、Logo域名）就实现了99.8\%的准确率，其中随机森林是最佳分类器。这项研究强调了利用关键域特征进行高效网络钓鱼检测的重要性，并为开发实时、可扩展的检测系统铺平了道路。

更新时间: 2025-03-09 07:14:04

领域: cs.CR

下载: http://arxiv.org/abs/2503.06487v1

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

Updated: 2025-03-09 07:07:03

标题: PerturboLLaVA: 通过扰动性视觉训练减少多模态幻觉

摘要: 这篇论文旨在解决多模式大语言模型（MLLMs）中幻觉的挑战，特别是对密集图像字幕任务。为了应对这一挑战，我们确定当前缺乏一个能够精细衡量概念级别字幕质量的度量标准。我们在此引入HalFscore，这是一种新颖的度量标准，建立在语言图上，旨在评估密集字幕的准确性和完整性。此外，我们确定幻觉的根本原因是模型过度依赖其语言先验。为了解决这一问题，我们提出了PerturboLLaVA，通过在训练过程中引入对抗扰动文本来减少模型对语言先验的依赖。这种方法增强了模型对视觉输入的关注，有效减少了幻觉，并产生准确、基于图像的描述，而不会增加额外的计算负担。PerturboLLaVA显著提高了生成字幕的准确性，优于现有方法处理多模态幻觉，并在一般多模态基准上实现了更好的性能。

更新时间: 2025-03-09 07:07:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06486v1

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. However, Large Language Models (LLMs) struggle with simple tasks such as arranging 3D assets in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

Updated: 2025-03-09 07:05:27

标题: LayoutVLM：通过视觉语言模型对3D布局进行可微分优化

摘要: 空间推理是人类认知的基本方面，它使人们能够直观地理解和操纵三维空间中的物体。然而，大型语言模型（LLMs）在简单任务中遇到困难，例如根据开放式语言指令在空间中排列3D资产，特别是在密集和受物理约束的环境中。我们引入了LayoutVLM，这是一个利用视觉语言模型（VLMs）的语义知识并支持可微优化以确保物理合理性的框架和场景布局表示。LayoutVLM利用VLMs从视觉标记图像中生成两种相互补充的表示，并通过自洽解码过程改进VLMs的空间规划。我们的实验证明LayoutVLM解决了现有LLM和基于约束的方法的局限性，产生更符合输入语言指令的语义意图的物理合理的3D布局。我们还证明，利用从现有场景数据集中提取的建议的场景布局表示微调VLMs可以提高它们的推理性能。

更新时间: 2025-03-09 07:05:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.02193v2

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC based on our carefully crafted positive and negative image-text pairs for essentially destroying the alignment relationship learned by VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus essentially enhancing attacks across various victim models and V+L tasks. The GitHub repository is available at https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.

Updated: 2025-03-09 07:02:23

标题: 一个扰动就足够：生成针对视觉-语言预训练模型的通用对抗性扰动

摘要: 视觉-语言预训练（VLP）模型通过充分利用多模态对齐，在许多应用中展示了前所未有的能力。然而，先前的研究表明它们容易受到恶意制作的对抗样本的影响。尽管最近取得了成功，但这些方法通常是特定实例的，并需要为每个输入样本生成扰动。在本文中，我们揭示了VLP模型也容易受到实例无关的通用对抗扰动（UAP）的攻击。具体地，我们设计了一种新颖的具有跨模态条件的对比训练扰动生成器（C-PGC）来实现攻击。鉴于关键的多模态对齐是通过先进的对比学习技术实现的，我们设计了一个恶意版本的对比学习来训练C-PGC，基于我们精心设计的正负图像-文本对，从根本上破坏了VLP模型学习到的对齐关系。此外，C-PGC充分利用了视觉-语言（V+L）场景的特性，将单模态和跨模态信息结合为有效的指导。大量实验表明，C-PGC成功地迫使对抗样本远离它们在VLP模型特征空间中的原始区域，从而在各种受害模型和V+L任务中实质性地增强了攻击能力。GitHub存储库可在https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks找到。

更新时间: 2025-03-09 07:02:23

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2406.05491v3

Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms

Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M$^2$-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.

Updated: 2025-03-09 06:55:46

标题: 手语翻译使用帧和事件流：基准数据集和算法

摘要: 准确的手语理解对于残疾人来说是至关重要的沟通渠道。当前的手语翻译算法主要依赖于RGB帧，这可能受到固定帧速率、不同的光照条件以及手部快速运动导致的运动模糊的限制。受到事件摄像机在其他领域成功应用的启发，我们提议利用事件流来辅助RGB摄像机捕捉手势数据，解决上述各种挑战。具体来说，我们首先使用DVS346摄像头收集了一个大规模的RGB-Event手语翻译数据集，称为VECSL，其中包含15,676个RGB-Event样本，15,191个符号，并涵盖2,568个汉字。这些样本在各种室内和室外环境中收集，捕捉了多个视角、不同的光照强度和不同的摄像机运动。由于在这个新任务中缺乏用于比较的基准算法，我们重新训练和评估了多个最先进的SLT算法，并相信这个基准可以有效支持随后的相关研究。此外，我们提出了一种新颖的RGB-Event手语翻译框架（即M$^2$-SLT），将细粒度微符号和粗粒度宏符号检索结合起来，在提出的数据集上取得了最先进的结果。源代码和数据集将在https://github.com/Event-AHU/OpenESL上发布。

更新时间: 2025-03-09 06:55:46

领域: cs.CV,cs.AI,cs.NE

下载: http://arxiv.org/abs/2503.06484v1

The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing

Models are expected to engage in invariance learning, which involves distinguishing the core relations that remain consistent across varying environments to ensure the predictions are safe, robust and fair. While existing works consider specific algorithms to realize invariance learning, we show that model has the potential to learn invariance through standard training procedures. In other words, this paper studies the implicit bias of Stochastic Gradient Descent (SGD) over heterogeneous data and shows that the implicit bias drives the model learning towards an invariant solution. We call the phenomenon the implicit invariance learning. Specifically, we theoretically investigate the multi-environment low-rank matrix sensing problem where in each environment, the signal comprises (i) a lower-rank invariant part shared across all environments; and (ii) a significantly varying environment-dependent spurious component. The key insight is, through simply employing the large step size large-batch SGD sequentially in each environment without any explicit regularization, the oscillation caused by heterogeneity can provably prevent model learning spurious signals. The model reaches the invariant solution after certain iterations. In contrast, model learned using pooled SGD over all data would simultaneously learn both the invariant and spurious signals. Overall, we unveil another implicit bias that is a result of the symbiosis between the heterogeneity of data and modern algorithms, which is, to the best of our knowledge, first in the literature.

Updated: 2025-03-09 06:47:55

标题: 异质性对不变性的内在偏见：多环境矩阵感知研究

摘要: 模型应该参与不变性学习，这涉及区分在不同环境中保持一致的核心关系，以确保预测是安全、稳健和公平的。虽然现有作品考虑特定算法来实现不变性学习，我们展示模型有潜力通过标准训练程序来学习不变性。换句话说，本文研究了随机梯度下降（SGD）在异构数据上的隐式偏差，并表明隐式偏差驱使模型学习朝向不变解。我们将这一现象称为隐式不变性学习。具体地，我们从理论上研究了多环境低秩矩阵感知问题，其中在每个环境中，信号包括（i）在所有环境中共享的低秩不变部分；以及（ii）显著变化的环境相关的虚假成分。关键的见解是，通过简单地在每个环境中依次使用大步长大批量SGD，而无需任何显式正则化，异质性引起的振荡可明显阻止模型学习虚假信号。模型在一定迭代后达到不变解。相比之下，使用汇总SGD在所有数据上学习的模型将同时学习不变和虚假信号。总的来说，我们揭示了另一种隐式偏差，这是由数据的异质性和现代算法之间的共生关系导致的，据我们所知，这在文献中是首次出现的。

更新时间: 2025-03-09 06:47:55

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2403.01420v4

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

Although being widely adopted for evaluating generated audio signals, the Fr\'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

Updated: 2025-03-09 06:46:13

标题: KAD: 不再是FAD！一种有效高效的音频生成评估指标

摘要: 尽管广泛应用于评估生成的音频信号，但Fr\'echet音频距离（FAD）存在显著局限性，包括依赖高斯假设、对样本大小敏感和高计算复杂性。作为替代方案，我们引入了基于最大均值差异（MMD）的新颖的、无偏见的和计算效率高的度量标准——核音频距离（KAD）。通过分析和经验验证，我们展示了KAD的优势：（1）更小的样本大小能够更快地收敛，从而可以在有限数据下进行可靠评估；（2）更低的计算成本，可扩展的GPU加速；以及（3）更加符合人类感知判断。通过利用先进的嵌入和特征核，KAD捕捉了真实和生成音频之间微妙的差异。在kadtk工具包中开源，KAD为评估生成音频模型提供了高效、可靠和感知对齐的基准。

更新时间: 2025-03-09 06:46:13

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2502.15602v2

State space models can express n-gram languages

Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on n-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from n-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode n-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the state transition matrix. We conduct experiments with a small dataset generated from n-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.

Updated: 2025-03-09 06:40:39

标题: 状态空间模型可以表达n-gram语言

摘要: 最近对递归神经网络（RNNs）的进展重新激发了人们对其在自然语言处理任务中的应用的兴趣，特别是随着更有效和可并行化的变体——状态空间模型（SSMs）的发展，这些模型在保持较低的内存占用的同时表现出与Transformer模型相竞争的性能。虽然RNNs和SSMs（例如Mamba）在经验上比基于n-gram模型的基于规则的系统更成功，但对于这种成功尚未有严格的理论解释，因为不清楚这些模型如何编码统治下一个词预测任务的组合规则。在本文中，我们构建了能够解决由n-gram规则生成的语言的下一个词预测任务的状态空间语言模型，从而表明前者更具表现力。我们的证明展示了SSMs如何使用其记忆容量的新理论结果来编码n-gram规则，并演示了如何通过限制状态转移矩阵的谱来控制其上下文窗口。我们使用从n-gram规则生成的小数据集进行实验，以展示我们的框架如何应用于通过基于梯度的优化获得的SSMs和RNNs。

更新时间: 2025-03-09 06:40:39

领域: cs.CL,cs.LG,I.2.7

下载: http://arxiv.org/abs/2306.17184v3

ExKG-LLM: Leveraging Large Language Models for Automated Expansion of Cognitive Neuroscience Knowledge Graphs

The paper introduces ExKG-LLM, a framework designed to automate the expansion of cognitive neuroscience knowledge graphs (CNKG) using large language models (LLMs). It addresses limitations in existing tools by enhancing accuracy, completeness, and usefulness in CNKG. The framework leverages a large dataset of scientific papers and clinical reports, applying state-of-the-art LLMs to extract, optimize, and integrate new entities and relationships. Evaluation metrics include precision, recall, and graph density. Results show significant improvements: precision (0.80, +6.67%), recall (0.81, +15.71%), F1 score (0.805, +11.81%), and increased edge nodes (21.13% and 31.92%). Graph density slightly decreased, reflecting a broader but more fragmented structure. Engagement rates rose by 20%, while CNKG diameter increased to 15, indicating a more distributed structure. Time complexity improved to O(n log n), but space complexity rose to O(n2), indicating higher memory usage. ExKG-LLM demonstrates potential for enhancing knowledge generation, semantic search, and clinical decision-making in cognitive neuroscience, adaptable to broader scientific fields.

Updated: 2025-03-09 06:32:56

标题: ExKG-LLM：利用大型语言模型自动扩展认知神经科学知识图谱

摘要: 本文介绍了ExKG-LLM，这是一个旨在利用大型语言模型（LLMs）自动扩展认知神经科学知识图谱（CNKG）的框架。它通过提高CNKG的准确性、完整性和实用性来解决现有工具的局限性。该框架利用大量科学论文和临床报告数据集，应用最先进的LLMs来提取、优化和整合新的实体和关系。评估指标包括精确度、召回率和图密度。结果显示，精确度（0.80，+6.67%），召回率（0.81，+15.71%），F1分数（0.805，+11.81%）均有显著改善，边节点数量增加了21.13%和31.92%。图密度略有降低，反映出更广泛但更分散的结构。参与率提高了20%，而CNKG直径增加至15，表明更分布式的结构。时间复杂度改进为O(n log n)，但空间复杂度提高为O(n2)，表示内存使用更高。ExKG-LLM展示了增强认知神经科学中知识生成、语义搜索和临床决策的潜力，并可适用于更广泛的科学领域。

更新时间: 2025-03-09 06:32:56

领域: cs.AI

下载: http://arxiv.org/abs/2503.06479v1

PDB: Not All Drivers Are the Same -- A Personalized Dataset for Understanding Driving Behavior

Driving behavior is inherently personal, influenced by individual habits, decision-making styles, and physiological states. However, most existing datasets treat all drivers as homogeneous, overlooking driver-specific variability. To address this gap, we introduce the Personalized Driving Behavior (PDB) dataset, a multi-modal dataset designed to capture personalization in driving behavior under naturalistic driving conditions. Unlike conventional datasets, PDB minimizes external influences by maintaining consistent routes, vehicles, and lighting conditions across sessions. It includes sources from 128-line LiDAR, front-facing camera video, GNSS, 9-axis IMU, CAN bus data (throttle, brake, steering angle), and driver-specific signals such as facial video and heart rate. The dataset features 12 participants, approximately 270,000 LiDAR frames, 1.6 million images, and 6.6 TB of raw sensor data. The processed trajectory dataset consists of 1,669 segments, each spanning 10 seconds with a 0.2-second interval. By explicitly capturing drivers' behavior, PDB serves as a unique resource for human factor analysis, driver identification, and personalized mobility applications, contributing to the development of human-centric intelligent transportation systems.

Updated: 2025-03-09 06:28:39

标题: PDB：并非所有驾驶者都相同——用于理解驾驶行为的个性化数据集

摘要: 驾驶行为本质上是个人的，受个人习惯、决策风格和生理状态的影响。然而，大多数现有数据集将所有驾驶员视为同质的，忽视了驾驶员特定的可变性。为了填补这一空白，我们引入了个性化驾驶行为（PDB）数据集，这是一个多模态数据集，旨在捕捉自然驾驶条件下的驾驶行为个性化。与传统数据集不同，PDB通过在会话间保持一致的路线、车辆和照明条件，最大程度地减少外部影响。它包括来自128线激光雷达、前置摄像头视频、GNSS、9轴IMU、CAN总线数据（油门、刹车、转向角）以及驾驶员特定信号，如面部视频和心率。该数据集包含12名参与者，约270,000帧激光雷达图像，160万张图像，以及6.6TB的原始传感器数据。处理后的轨迹数据集包括1,669个片段，每个片段跨越10秒，间隔为0.2秒。通过明确捕捉驾驶员的行为，PDB作为一个独特的资源，为人因分析、驾驶员识别和个性化移动应用做出贡献，有助于发展以人为中心的智能交通系统。

更新时间: 2025-03-09 06:28:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06477v1

SKG-LLM: Developing a Mathematical Model for Stroke Knowledge Graph Construction Using Large Language Models

The purpose of this study is to introduce SKG-LLM. A knowledge graph (KG) is constructed from stroke-related articles using mathematical and large language models (LLMs). SKG-LLM extracts and organizes complex relationships from the biomedical literature, using it to increase the accuracy and depth of KG in stroke research. In the proposed method, GPT-4 was used for data pre-processing, and the extraction of embeddings was also done by GPT-4 in the whole KG construction process. The performance of the proposed model was tested with two evaluation criteria: Precision and Recall. For further validation of the proposed model, GPT-4 was used. Compared with Wikidata and WN18RR, the proposed KG-LLM approach performs better, especially in precision and recall. By including GPT-4 in the preprocessing process, the SKG-LLM model achieved a precision score of 0.906 and a recall score of 0.923. Expert reviews further improved the results and increased precision to 0.923 and recall to 0.918. The knowledge graph constructed by SKG-LLM contains 2692 nodes and 5012 edges, which are 13 distinct types of nodes and 24 types of edges.

Updated: 2025-03-09 06:25:37

标题: SKG-LLM：使用大型语言模型开发卒中知识图构建的数学模型

摘要: 这项研究的目的是介绍SKG-LLM。使用数学和大型语言模型（LLM）从与中风相关的文章构建知识图（KG）。SKG-LLM从生物医学文献中提取并组织复杂关系，利用这些关系增加中风研究中知识图的准确性和深度。在提出的方法中，使用GPT-4进行数据预处理，整个知识图构建过程中也由GPT-4进行嵌入提取。提出的模型的性能通过两个评估标准进行了测试：精度和召回率。为了进一步验证提出的模型，使用了GPT-4。与Wikidata和WN18RR相比，提出的KG-LLM方法表现更好，特别是在精度和召回率方面。通过在预处理过程中包含GPT-4，SKG-LLM模型实现了精度评分为0.906和召回率为0.923。专家评审进一步改善了结果，并将精度提高至0.923，召回率提高至0.918。由SKG-LLM构建的知识图包含2692个节点和5012条边，其中有13种不同类型的节点和24种边类型。

更新时间: 2025-03-09 06:25:37

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.06475v1

Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy

Synthetic Data Generation (SDG) based on Artificial Intelligence (AI) can transform the way clinical medicine is delivered by overcoming privacy barriers that currently render clinical data sharing difficult. This is the key to accelerating the development of digital tools contributing to enhanced patient safety. Such tools include robust data-driven clinical decision support systems, and example-based digital training tools that will enable healthcare professionals to improve their diagnostic performance for enhanced patient safety. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. Its scientific contributions include a) a novel protocol for the systematic Clinical Evaluation of Medical Image Synthesis (CEMIS); b) a novel variational autoencoder-based model for the generation of high-resolution synthetic WCE images; and c) a comprehensive evaluation of the synthetic images using the CEMIS protocol by 10 international WCE specialists, in terms of image quality, diversity, and realism, as well as their utility for clinical decision-making. The results show that TIDE-II generates clinically plausible, very realistic WCE images, of improved quality compared to relevant state-of-the-art generative models. Concludingly, CEMIS can serve as a reference for future research on medical image-generation techniques, while the adaptation/extension of the architecture of TIDE-II to other imaging domains can be promising.

Updated: 2025-03-09 06:23:54

标题: 医学影像合成的临床评估：无线胶囊内镜的案例研究

摘要: 基于人工智能的合成数据生成（SDG）可以改变临床医学的交付方式，克服当前使临床数据共享困难的隐私障碍。这是加速发展数字工具以提升患者安全的关键。这些工具包括强大的数据驱动临床决策支持系统，以及基于示例的数字培训工具，将使医疗专业人员能够提高其诊断性能，以增强患者安全。本研究侧重于医学SDG的临床评估，通过使用无线胶囊内镜（WCE）图像对炎症性肠病（IBD）进行诊断的概念验证研究。其科学贡献包括：a）一种用于系统性医学图像合成临床评估（CEMIS）的新型协议；b）一种基于变分自动编码器的模型，用于生成高分辨率的合成WCE图像；c）通过10名国际WCE专家使用CEMIS协议对合成图像进行全面评估，评估其图像质量、多样性、逼真性以及对临床决策的实用性。结果显示，TIDE-II生成了临床可信、非常逼真的WCE图像，比相关最先进的生成模型具有更好的质量。总之，CEMIS可以成为未来医学图像生成技术研究的参考，而将TIDE-II的架构适应/扩展到其他成像领域可能具有潜在前景。

更新时间: 2025-03-09 06:23:54

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2411.00178v2

HuixiangDou2: A Robustly Optimized GraphRAG Approach

Large Language Models (LLMs) perform well on familiar queries but struggle with specialized or emerging topics. Graph-based Retrieval-Augmented Generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval. However, existing pipelines involve complex engineering workflows, making it difficult to isolate the impact of individual components. Evaluating retrieval effectiveness is also challenging due to dataset overlap with LLM pretraining data. In this work, we introduce HuixiangDou2, a robustly optimized GraphRAG framework. Specifically, we leverage the effectiveness of dual-level retrieval and optimize its performance in a 32k context for maximum precision, and compare logic-based retrieval and dual-level retrieval to enhance overall functionality. Our implementation includes comparative experiments on a test set, where Qwen2.5-7B-Instruct initially underperformed. With our approach, the score improved significantly from 60 to 74.5, as illustrated in the Figure. Experiments on domain-specific datasets reveal that dual-level retrieval enhances fuzzy matching, while logic-form retrieval improves structured reasoning. Furthermore, we propose a multi-stage verification mechanism to improve retrieval robustness without increasing computational cost. Empirical results show significant accuracy gains over baselines, highlighting the importance of adaptive retrieval. To support research and adoption, we release HuixiangDou2 as an open-source resource https://github.com/tpoisonooo/huixiangdou2.

Updated: 2025-03-09 06:20:24

标题: HuixiangDou2：一个经过稳健优化的GraphRAG方法

摘要: 大型语言模型（LLMs）在熟悉的查询上表现良好，但在专业化或新兴主题上表现不佳。基于图的检索增强生成（GraphRAG）通过将领域知识构建为图来解决这个问题，实现动态检索。然而，现有的流程涉及复杂的工程工作流程，使得很难分离出各个组件的影响。由于数据集与LLM预训练数据存在重叠，评估检索效果也很具有挑战性。在这项工作中，我们介绍了一个经过稳健优化的GraphRAG框架HuixiangDou2。具体来说，我们利用双层检索的有效性，并在32k上下文中优化其性能以获得最大精度，并比较基于逻辑的检索和双层检索以增强整体功能性。我们的实现包括对一个测试集的比较实验，其中Qwen2.5-7B-Instruct最初表现不佳。通过我们的方法，得分从60提高到74.5，如图所示。对领域特定数据集的实验表明，双层检索增强了模糊匹配，而逻辑形式的检索改善了结构化推理。此外，我们提出了一个多阶段验证机制，以提高检索的鲁棒性，而不增加计算成本。实证结果显示，相较于基线，我们的方法显著提高了准确性，突显了自适应检索的重要性。为了支持研究和采用，我们将HuixiangDou2作为开源资源发布在https://github.com/tpoisonooo/huixiangdou2。

更新时间: 2025-03-09 06:20:24

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.06474v1

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.

Updated: 2025-03-09 06:20:11

标题: 通过修剪冗余检索增强层注意力效率

摘要: 越来越多的证据表明，层注意机制，可以增强深度神经网络中各层之间的交互，已经显著提升了网络架构。然而，现有的层注意方法存在冗余问题，因为相邻层学习到的注意权重往往非常相似。这种冗余导致多个层提取几乎相同的特征，降低了模型的表征能力并增加了训练时间。为了解决这个问题，我们提出了一种新的方法，通过利用相邻层之间的Kullback-Leibler（KL）散度来量化冗余。此外，我们引入了一种增强贝塔分位数映射（EBQM）方法，可以准确识别并跳过冗余层，从而保持模型的稳定性。我们提出的高效层注意（ELA）架构既提高了训练效率，又提升了整体性能，在图像分类和目标检测等任务中实现了30\%的训练时间缩短，并提升了性能。

更新时间: 2025-03-09 06:20:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06473v1

Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing

The design of artificial neural networks (ANNs) is inspired by the structure of the human brain, and in turn, ANNs offer a potential means to interpret and understand brain signals. Existing methods primarily align brain signals with real-world signals using Mean Squared Error (MSE), which solely focuses on local point-wise alignment, and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. Specifically, we construct a transport plan between brain voxel embeddings and image embeddings, enabling more precise matching. By controlling the amount of transport, we mitigate the influence of redundant information. We apply our alignment model directly to the Brain Captioning task by feeding brain siginals into a large language model (LLM) instead of images. Our approach achieves state-of-the-art performance across ten evaluation metrics, surpassing the previous best method by an average of 6.11\% in single-subject training and 3.81\% in cross-subject training. Additionally, we have uncovered several insightful conclusions that align with existing brain research. We unveil the redundancy and synergy of brain information processing through region masking and data dimensionality reduction visualization experiments. We believe our approach paves the way for a more precise understanding of brain signals in the future. The code is available soon.

Updated: 2025-03-09 06:14:23

标题: 大脑图像对齐的最优传输：揭示神经信息处理中的冗余和协同作用

摘要: 人工神经网络（ANNs）的设计灵感来自人类大脑的结构，反过来，ANNs提供了一种解释和理解脑信号的潜在手段。现有方法主要使用均方误差（MSE）将脑信号与真实世界信号进行对齐，该方法仅专注于局部逐点对齐，忽略全局匹配，导致脑信号解码中的粗略解释和不准确性。在本文中，我们通过最优传输（OT）解决了这些问题，并在理论上证明了为什么OT提供了比MSE更有效的对齐策略。具体来说，我们构建了一个在脑体素嵌入和图像嵌入之间的传输计划，实现更精确的匹配。通过控制传输量，我们减轻了多余信息的影响。我们将我们的对齐模型直接应用于大型语言模型（LLM）而不是图像的脑字幕任务中。我们的方法在十个评估指标上实现了最新性能，单主体训练和跨主体训练的平均优于先前最佳方法分别为6.11％和3.81％。此外，我们通过区域掩膜和数据维度降低可视化实验揭示了几个深刻的结论，与现有的脑研究相吻合。我们相信我们的方法为未来更精确地理解脑信号铺平了道路。代码即将发布。

更新时间: 2025-03-09 06:14:23

领域: q-bio.NC,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.10663v1

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.

Updated: 2025-03-09 06:14:17

标题: 仔细思考一次，点击一次：通过快速和慢速系统增强GUI基础

摘要: 人类可以根据任务复杂性灵活地在不同思维模式之间切换：从快速直觉判断到深入的分析理解。然而，当前的基于图形用户界面（GUI）的接口元素定位系统仅依赖于即时预测而没有推理，难以理解具有嵌套结构和层次关系的复杂界面布局，限制了它们在复杂界面上的效果。受人类双系统认知启发，我们提出了一种名为Focus的新型GUI接口定位框架，结合了快速预测和系统分析。该框架通过基于任务复杂性的自适应系统切换动态地在快速和有意识的处理之间进行切换，优化了效率和准确性。Focus将定位分解为渐进阶段：界面摘要、视觉集中分析和精确坐标预测。这种结构化分解使得对界面布局和视觉关系的系统理解成为可能。大量实验证明，与现有方法相比，Focus在仅使用30万训练数据和2B参数模型的情况下达到了最先进的性能。Focus在复杂GUI场景中表现出优越的性能，实现了77.4%的平均准确率在ScreenSpot上，以及在更具挑战性的ScreenSpot-Pro上达到了13.3%。我们的分析揭示了这种双系统方法的有效性，同时展示了其改善复杂GUI交互场景的潜力。

更新时间: 2025-03-09 06:14:17

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2503.06470v1

Life-Cycle Routing Vulnerabilities of LLM Router

Large language models (LLMs) have achieved remarkable success in natural language processing, yet their performance and computational costs vary significantly. LLM routers play a crucial role in dynamically balancing these trade-offs. While previous studies have primarily focused on routing efficiency, security vulnerabilities throughout the entire LLM router life cycle, from training to inference, remain largely unexplored. In this paper, we present a comprehensive investigation into the life-cycle routing vulnerabilities of LLM routers. We evaluate both white-box and black-box adversarial robustness, as well as backdoor robustness, across several representative routing models under extensive experimental settings. Our experiments uncover several key findings: 1) Mainstream DNN-based routers tend to exhibit the weakest adversarial and backdoor robustness, largely due to their strong feature extraction capabilities that amplify vulnerabilities during both training and inference; 2) Training-free routers demonstrate the strongest robustness across different attack types, benefiting from the absence of learnable parameters that can be manipulated. These findings highlight critical security risks spanning the entire life cycle of LLM routers and provide insights for developing more robust models.

Updated: 2025-03-09 06:00:35

标题: LLM路由器的生命周期路由漏洞

摘要: 大型语言模型（LLMs）在自然语言处理领域取得了显著的成功，但它们的性能和计算成本差异显著。LLM路由器在动态平衡这些权衡方面发挥着关键作用。虽然先前的研究主要集中在路由效率上，但在整个LLM路由器生命周期中存在的安全漏洞仍然大多未被探索。在本文中，我们对LLM路由器的生命周期路由漏洞进行了全面调查。我们在广泛的实验设置下评估了几种代表性路由模型的白盒和黑盒对抗鲁棒性，以及后门鲁棒性。我们的实验揭示了几个关键发现：1）主流基于DNN的路由器通常表现出最弱的对抗性和后门鲁棒性，主要是由于它们强大的特征提取能力在训练和推理过程中放大了漏洞；2）无需训练的路由器在不同类型的攻击中展现出最强的鲁棒性，得益于没有可操纵的可学习参数。这些发现突出了贯穿LLM路由器整个生命周期的关键安全风险，并为开发更加鲁棒的模型提供了见解。

更新时间: 2025-03-09 06:00:35

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.08704v1

Learning time-scales in two-layers neural networks

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

Updated: 2025-03-09 05:50:54

标题: 学习时间尺度在双层神经网络中的应用

摘要: 多层神经网络中的梯度下降学习显示出许多引人注目的特征。特别是，经验风险的降低速率甚至在对大批量进行平均后也是非单调的。长时间段内观察到几乎没有进展的平台与快速降低的间隔交替出现。这些连续的学习阶段通常发生在非常不同的时间尺度上。最后，在早期阶段学到的模型通常更“简单”或更“容易学习”，尽管以一种难以形式化的方式。尽管已提出了这些现象的理论解释，但每个解释最多只能捕捉到某些特定的情况。在本文中，我们研究了在高维度中根据单指数模型（即，目标函数依赖于协变量的一维投影）分布数据时的宽二层神经网络的梯度流动力学。基于一系列新的严格结果、非严格的数学推导和数值模拟，我们提出了这种设置中学习动态的一种情景。特别是，所提出的演变展示了时间尺度的分离和间歇性。这些行为自然地产生，因为人口梯度流可以被重新解释为一个奇异扰动的动力系统。

更新时间: 2025-03-09 05:50:54

领域: cs.LG,math.OC,stat.ML,34E15, 37N40, 68T07

下载: http://arxiv.org/abs/2303.00055v4

StructGS: Adaptive Spherical Harmonics and Rendering Enhancements for Superior 3D Gaussian Splatting

Recent advancements in 3D reconstruction coupled with neural rendering techniques have greatly improved the creation of photo-realistic 3D scenes, influencing both academic research and industry applications. The technique of 3D Gaussian Splatting and its variants incorporate the strengths of both primitive-based and volumetric representations, achieving superior rendering quality. While 3D Geometric Scattering (3DGS) and its variants have advanced the field of 3D representation, they fall short in capturing the stochastic properties of non-local structural information during the training process. Additionally, the initialisation of spherical functions in 3DGS-based methods often fails to engage higher-order terms in early training rounds, leading to unnecessary computational overhead as training progresses. Furthermore, current 3DGS-based approaches require training on higher resolution images to render higher resolution outputs, significantly increasing memory demands and prolonging training durations. We introduce StructGS, a framework that enhances 3D Gaussian Splatting (3DGS) for improved novel-view synthesis in 3D reconstruction. StructGS innovatively incorporates a patch-based SSIM loss, dynamic spherical harmonics initialisation and a Multi-scale Residual Network (MSRN) to address the above-mentioned limitations, respectively. Our framework significantly reduces computational redundancy, enhances detail capture and supports high-resolution rendering from low-resolution inputs. Experimentally, StructGS demonstrates superior performance over state-of-the-art (SOTA) models, achieving higher quality and more detailed renderings with fewer artifacts.

Updated: 2025-03-09 05:39:44

标题: StructGS：自适应球谐和渲染增强技术，用于优秀的3D高斯飞溅

摘要: 最近在3D重建和神经渲染技术方面的进展大大提高了逼真的3D场景的创建，影响了学术研究和行业应用。3D高斯散点和其变体技术结合了基于原语和体积表示的优势，实现了卓越的渲染质量。虽然3D几何散射（3DGS）及其变体推动了3D表示领域的发展，但在训练过程中未能捕捉到非局部结构信息的随机特性。此外，基于3DGS的方法中球函数的初始化通常无法在早期训练轮次中涉及更高阶的项，导致在训练过程中不必要的计算开销。此外，当前的3DGS方法需要对更高分辨率的图像进行训练以渲染更高分辨率的输出，显著增加了内存需求并延长了训练时间。我们引入了StructGS，一个增强3D高斯散点（3DGS）以改善3D重建中新视角合成的框架。StructGS创新地结合了基于补丁的SSIM损失、动态球谐函数初始化和多尺度残差网络（MSRN）来分别解决上述限制。我们的框架显著减少了计算冗余，增强了细节捕捉，并支持从低分辨率输入生成高分辨率渲染。在实验中，StructGS表现出优于最先进模型的性能，实现了更高质量、更详细的渲染结果并减少了瑕疵。

更新时间: 2025-03-09 05:39:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06462v1

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.

Updated: 2025-03-09 05:35:07

标题: 大型语言模型能揭示奥秘吗？它们在解锁复杂场景中的信息能力的探索

摘要: 将多种感知输入结合起来并在复杂场景中进行组合推理是人类的一种复杂认知功能。随着多模态大语言模型的进步，最近的基准往往倾向于评估跨多个图像的视觉理解。然而，它们经常忽视了跨多个感知信息的组合推理的必要性。为了探索先进模型在复杂场景中整合多个感知输入进行组合推理的能力，我们引入了两个基准：Clue-Visual Question Answering (CVQA)，包括三种任务类型来评估视觉理解和综合能力，以及Clue of Password-Visual Question Answering (CPVQA)，包括两种任务类型专注于准确解释和应用视觉数据。对于我们的基准，我们提出了三种即插即用的方法：利用模型输入进行推理，通过最小边界解码与随机生成增强推理，以及检索语义相关的视觉信息以实现有效数据整合。综合结果显示，当前模型在组合推理基准上表现不佳，即使是最先进的封闭源模型也仅在CVQA上达到33.04%的准确率，在CPVQA上下降至7.38%。值得注意的是，我们的方法提高了模型在组合推理上的性能，在CVQA上提高了22.17%，在CPVQA上提高了9.40%，超过了SOTA封闭源模型，证明了其在增强复杂场景中多个感知输入的组合推理方面的有效性。该代码将公开提供。

更新时间: 2025-03-09 05:35:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.19973v2

Reconstructing Depth Images of Moving Objects from Wi-Fi CSI Data

This study proposes a new deep learning method for reconstructing depth images of moving objects within a specific area using Wi-Fi channel state information (CSI). The Wi-Fi-based depth imaging technique has novel applications in domains such as security and elder care. However, reconstructing depth images from CSI is challenging because learning the mapping function between CSI and depth images, both of which are high-dimensional data, is particularly difficult. To address the challenge, we propose a new approach called Wi-Depth. The main idea behind the design of Wi-Depth is that a depth image of a moving object can be decomposed into three core components: the shape, depth, and position of the target. Therefore, in the depth-image reconstruction task, Wi-Depth simultaneously estimates the three core pieces of information as auxiliary tasks in our proposed VAE-based teacher-student architecture, enabling it to output images with the consistency of a correct shape, depth, and position. In addition, the design of Wi-Depth is based on our idea that this decomposition efficiently takes advantage of the fact that shape, depth, and position relate to primitive information inferred from CSI such as angle-of-arrival, time-of-flight, and Doppler frequency shift.

Updated: 2025-03-09 05:30:33

标题: 从Wi-Fi CSI数据重建移动物体的深度图像

摘要: 这项研究提出了一种新的深度学习方法，用于利用Wi-Fi信道状态信息（CSI）在特定区域内重建移动物体的深度图像。基于Wi-Fi的深度成像技术在安全和老年护理等领域具有新颖的应用。然而，从CSI重建深度图像具有挑战性，因为学习CSI和深度图像之间的映射函数，这两者都是高维数据，尤其困难。为了解决这一挑战，我们提出了一种名为Wi-Depth的新方法。Wi-Depth设计背后的主要思想是，移动物体的深度图像可以分解为三个核心组成部分：目标的形状、深度和位置。因此，在深度图像重建任务中，Wi-Depth同时估计三个核心信息作为我们提出的基于VAE的师生架构中的辅助任务，使其能够输出具有正确形状、深度和位置一致性的图像。此外，Wi-Depth的设计基于我们的想法，即这种分解有效地利用了形状、深度和位置与从CSI推断出的原始信息相关，例如到达角、飞行时间和多普勒频移。

更新时间: 2025-03-09 05:30:33

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2503.06458v1

Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: https://github.com/WeiDai-David/2025CVPR_GGEUR

Updated: 2025-03-09 05:30:28

标题: 几何知识引导的本地化全局分布对齐用于联邦学习

摘要: 联邦学习中的数据异质性，表现为本地和全局分布之间的显著不一致，导致本地优化方向的分歧并阻碍全局模型训练。现有研究主要集中在优化本地更新或全局聚合，但这些间接方法在处理高度异质数据分布时表现不稳定，特别是在标签偏斜和域偏斜并存的情况下。为了解决这个问题，我们提出了一种基于几何引导的数据生成方法，重点是在本地模拟全局嵌入分布。我们首先介绍了嵌入分布的几何形状的概念，然后解决了在隐私约束下获取全局几何形状的挑战。随后，我们提出了GGEUR，利用全局几何形状来引导新样本的生成，实现对理想全局分布的更接近逼近。在单域场景中，我们基于全局几何形状增强样本以提升模型泛化能力；在多域场景中，我们进一步利用类原型来模拟跨域的全局分布。广泛的实验结果表明，我们的方法显著提升了现有方法在处理高度异质数据的性能，包括标签偏斜、域偏斜及它们共存的情况。源代码发布在：https://github.com/WeiDai-David/2025CVPR_GGEUR

更新时间: 2025-03-09 05:30:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06457v1

Privacy Protection in Prosumer Energy Management Based on Federated Learning

With the booming development of prosumers, there is an urgent need for a prosumer energy management system to take full advantage of the flexibility of prosumers and take into account the interests of other parties. However, building such a system will undoubtedly reveal users' privacy. In this paper, by solving the non-independent and identical distribution of data (Non-IID) problem in federated learning with federated cluster average(FedClusAvg) algorithm, prosumers' information can efficiently participate in the intelligent decision making of the system without revealing privacy. In the proposed FedClusAvg algorithm, each client performs cluster stratified sampling and multiple iterations. Then, the average weight of the parameters of the sub-server is determined according to the degree of deviation of the parameter from the average parameter. Finally, the sub-server multiple local iterations and updates, and then upload to the main server. The advantages of FedClusAvg algorithm are the following two parts. First, the accuracy of the model in the case of Non-IID is improved through the method of clustering and parameter weighted average. Second, local multiple iterations and three-tier framework can effectively reduce communication rounds.

Updated: 2025-03-09 05:29:29

标题: 基于联邦学习的普罗松能源管理中的隐私保护

摘要: 随着生产者和消费者的蓬勃发展，迫切需要一个生产者能源管理系统，充分利用生产者的灵活性，并考虑其他方的利益。然而，建立这样一个系统无疑会泄露用户的隐私。本文通过使用联邦聚类平均（FedClusAvg）算法解决联邦学习中数据的非独立和相同分布（Non-IID）问题，使生产者的信息能够有效参与系统的智能决策，而不泄露隐私。在提出的FedClusAvg算法中，每个客户端进行聚类分层抽样和多次迭代。然后，根据参数与平均参数的偏差确定子服务器参数的平均权重。最后，子服务器进行多次本地迭代和更新，然后上传至主服务器。FedClusAvg算法的优势在于以下两个方面。首先，通过聚类和参数加权平均方法提高了在非独立和相同分布情况下模型的准确性。其次，本地多次迭代和三层框架可以有效减少通信轮次。

更新时间: 2025-03-09 05:29:29

领域: cs.LG

下载: http://arxiv.org/abs/2503.06455v1

NaviDet: Efficient Input-level Backdoor Detection on Text-to-Image Synthesis via Neuron Activation Variation

In recent years, text-to-image (T2I) diffusion models have garnered significant attention for their ability to generate high-quality images reflecting text prompts. However, their growing popularity has also led to the emergence of backdoor threats, posing substantial risks. Currently, effective defense strategies against such threats are lacking due to the diversity of backdoor targets in T2I synthesis. In this paper, we propose NaviDet, the first general input-level backdoor detection framework for identifying backdoor inputs across various backdoor targets. Our approach is based on the new observation that trigger tokens tend to induce significant neuron activation variation in the early stage of the diffusion generation process, a phenomenon we term Early-step Activation Variation. Leveraging this insight, NaviDet detects malicious samples by analyzing neuron activation variations caused by input tokens. Through extensive experiments, we demonstrate the effectiveness and efficiency of our method against various T2I backdoor attacks, surpassing existing baselines with significantly lower computational overhead. Furthermore, we rigorously demonstrate that our method remains effective against potential adaptive attacks.

Updated: 2025-03-09 05:27:44

标题: NaviDet：通过神经元激活变化在文本到图像合成中有效检测输入级后门

摘要: 最近几年，文本到图像（T2I）扩散模型因其能够生成反映文本提示的高质量图像而引起了广泛关注。然而，它们日益增长的受欢迎程度也导致了后门威胁的出现，带来了重大风险。目前，由于T2I合成中后门目标的多样性，针对此类威胁的有效防御策略尚未形成。在本文中，我们提出了NaviDet，这是第一个用于识别各种后门目标下的后门输入的一般输入级后门检测框架。我们的方法基于一个新的观察结果，即触发令牌往往会在扩散生成过程的早期阶段引起显著的神经元激活变化，我们将这种现象称为早期激活变异。利用这一观察结果，NaviDet通过分析输入令牌引起的神经元激活变化来检测恶意样本。通过大量实验证明，我们的方法在各种T2I后门攻击中表现出了高效性和效果，超越了现有基准并且具有显著更低的计算开销。此外，我们严格证明了我们的方法对潜在的自适应攻击仍然有效。

更新时间: 2025-03-09 05:27:44

领域: cs.CR

下载: http://arxiv.org/abs/2503.06453v1

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Artificial intelligence has made significant strides in medical visual question answering (Med-VQA), yet prevalent studies often interpret images holistically, overlooking the visual regions of interest that may contain crucial information, potentially aligning with a doctor's prior knowledge that can be incorporated with minimal annotations (e.g., bounding boxes). To address this gap, this paper introduces R-LLaVA, designed to enhance biomedical VQA understanding by integrating simple medical annotations as prior knowledge directly into the image space through CLIP. These annotated visual regions of interest are then fed into the LLaVA model during training, aiming to enrich the model's understanding of biomedical queries. Experimental evaluation on four standard Med-VQA datasets demonstrates R-LLaVA's superiority over existing state-of-the-art (SoTA) methods. Additionally, to verify the model's capability in visual comprehension, a novel multiple-choice medical visual understanding dataset is introduced, confirming the positive impact of focusing on visual regions of interest in advancing biomedical VQA understanding.

Updated: 2025-03-09 05:23:35

标题: R-LLaVA：通过视觉感兴趣区域提高Med-VQA理解

摘要: 人工智能在医学视觉问答（Med-VQA）领域取得了重大进展，然而普遍的研究往往将图像解释为整体，忽视了可能包含关键信息的视觉感兴趣区域，这些信息可能与医生的先前知识相一致，可以在最小的注释（例如边界框）的情况下整合。为填补这一空白，本文介绍了R-LLaVA，旨在通过将简单的医学注释作为先验知识直接整合到图像空间中，通过CLIP来增强生物医学VQA的理解。这些带注释的视觉感兴趣区域然后在训练期间馈送到LLaVA模型中，旨在丰富模型对生物医学查询的理解。对四个标准的Med-VQA数据集进行的实验评估显示，R-LLaVA优于现有的最先进方法。此外，为验证模型在视觉理解方面的能力，引入了一种新颖的多项选择医学视觉理解数据集，证实将关注视觉感兴趣区域对推进生物医学VQA理解具有积极影响。

更新时间: 2025-03-09 05:23:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2410.20327v5

A Survey on LLM-as-a-Judge

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

Updated: 2025-03-09 05:21:22

标题: 《关于以LLM为法官的调查》

摘要: 准确和一致的评估对于跨越众多领域的决策至关重要，然而由于固有的主观性、变化性和规模而仍然是一项具有挑战性的任务。大型语言模型（LLMs）在各种领域取得了显著成功，导致了“LLM作为评判者”的出现，其中LLMs被用作复杂任务的评估者。由于它们能够处理各种数据类型并提供可扩展、成本效益和一致的评估，LLMs为传统专家驱动的评估提供了引人注目的替代方案。然而，确保LLM作为评判者系统的可靠性仍然是一个需要仔细设计和标准化的重大挑战。本文提供了对LLM作为评判者的综合调查，解决了核心问题：如何构建可靠的LLM作为评判者系统？我们探讨了增强可靠性的策略，包括提高一致性、减轻偏见和适应多样的评估场景。此外，我们提出了评估LLM作为评判者系统可靠性的方法论，支持一个专为此目的设计的新型基准。为了推进LLM作为评判者系统的发展和实际部署，我们还讨论了实际应用、挑战和未来方向。这项调查为这个快速发展的领域的研究人员和从业者提供了一个基础参考。

更新时间: 2025-03-09 05:21:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.15594v5

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.

Updated: 2025-03-09 05:17:36

标题: 整体式遗忘基准：用于文本-图像扩散模型遗忘的多方面评估

摘要: 随着文本到图像扩散模型在广泛的商业应用中得到应用，人们对于不道德或有害使用的担忧日益增加，包括未经授权生成受版权保护或敏感内容。概念去学习已经成为解决这些挑战的一个有前途的解决方案，通过从预先训练的模型中删除不需要的和有害的信息。然而，先前的评估主要关注于是否移除目标概念同时保持图像质量，而忽视了更广泛的影响，如意外副作用。在这项工作中，我们提出了全面的概念去学习基准（HUB），这是一个评估去学习方法的综合框架，涵盖了六个关键维度：忠实度、对齐度、准确性、多语言稳健性、攻击稳健性和效率。我们的基准涵盖了33个目标概念，每个概念包括16,000个提示，涵盖了四个类别：名人、风格、知识产权和NSFW。我们的调查显示没有一种方法在所有评估标准上都表现优异。通过发布我们的评估代码和数据集，我们希望激发这一领域的进一步研究，促进更可靠和有效的去学习方法的发展。

更新时间: 2025-03-09 05:17:36

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.05664v2

Leverage Knowledge Graph and Large Language Model for Law Article Recommendation: A Case Study of Chinese Criminal Law

Court efficiency is vital for social stability. However, in most countries around the world, the grassroots courts face case backlogs, with decisions relying heavily on judicial personnel's cognitive labor, lacking intelligent tools to improve efficiency. To address this issue, we propose an efficient law article recommendation approach utilizing a Knowledge Graph (KG) and a Large Language Model (LLM). Firstly, we propose a Case-Enhanced Law Article Knowledge Graph (CLAKG) as a database to store current law statutes, historical case information, and correspondence between law articles and historical cases. Additionally, we introduce an automated CLAKG construction method based on LLM. On this basis, we propose a closed-loop law article recommendation method. Finally, through a series of experiments using judgment documents from the website "China Judgements Online", we have improved the accuracy of law article recommendation in cases from 0.549 to 0.694, demonstrating that our proposed method significantly outperforms baseline approaches.

Updated: 2025-03-09 05:10:23

标题: 利用知识图谱和大型语言模型进行法律文章推荐：以中国刑法为例的案例研究

摘要: 法院效率对社会稳定至关重要。然而，在世界大多数国家，基层法院面临案件积压问题，决定往往严重依赖司法人员的认知劳动，缺乏智能工具来提高效率。为解决这一问题，我们提出了一种利用知识图谱（KG）和大型语言模型（LLM）的高效法律文章推荐方法。首先，我们提出了一个案例增强的法律文章知识图谱（CLAKG），作为存储当前法律法规、历史案例信息以及法律文章与历史案例之间对应关系的数据库。此外，我们介绍了一种基于LLM的自动化CLAKG构建方法。在此基础上，我们提出了一个闭环法律文章推荐方法。最后，通过使用来自“中国裁判文书网”网站的判决文书进行一系列实验，我们将案例中法律文章推荐的准确性从0.549提高到0.694，证明我们提出的方法明显优于基准方法。

更新时间: 2025-03-09 05:10:23

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2410.04949v2

CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data

Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To address this issue, we propose CtrTab-a condition controlled diffusion model for tabular data synthesis-to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios. Through CtrTab, we inject samples with added Laplace noise as control signals to improve data diversity and show its resemblance to L2 regularization, which enhances model robustness. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with performance gap in accuracy over 80% on average. Our source code will be released upon paper publication.

Updated: 2025-03-09 05:01:56

标题: CtrTab：具有高维和有限数据的表格数据综合

摘要: 基于扩散的表格数据合成模型已经取得了令人满意的结果。然而，我们观察到当数据的维度增加时，现有模型往往会退化，甚至可能表现比简单的非扩散模型还要差。这是因为在高维空间中有限的训练样本经常会阻碍生成模型准确捕捉分布。为了解决这个问题，我们提出了CtrTab-一种用于改进高维低数据情况下扩散型生成模型性能的条件控制扩散模型。通过CtrTab，我们注入了添加了拉普拉斯噪声的样本作为控制信号，以改善数据多样性，并展示其类似于L2正则化，从而增强模型的鲁棒性。跨多个数据集的实验结果显示，CtrTab在准确性方面表现优于最先进的模型，平均准确率的性能差距超过80%。我们的源代码将在论文发表后发布。

更新时间: 2025-03-09 05:01:56

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2503.06444v1

PerfRL: A Small Language Model Framework for Efficient Code Optimization

Code optimization is a challenging task requiring a substantial level of expertise from developers. Nonetheless, this level of human capacity is not sufficient considering the rapid evolution of new hardware architectures and software environments. In light of this, recent research proposes adopting machine learning and artificial intelligence techniques to automate the code optimization process. In this paper, we introduce PerfRL, an innovative framework designed to tackle the problem of code optimization. Our framework leverages the capabilities of small language models (SLMs) and reinforcement learning (RL), facilitating a system where SLMs can assimilate feedback from their environment during the fine-tuning phase, notably through unit tests. When benchmarked against existing models, PerfRL demonstrates superior efficiency in terms of speed and computational resource usage, attributed to its reduced need for training steps and its compatibility with SLMs. Furthermore, it substantially diminishes the risk of logical and syntactical errors. To evaluate our framework, we conduct experiments on the PIE dataset using a lightweight large language model (i.e., CodeT5) and a new reinforcement learning algorithm, namely RRHF. For evaluation purposes, we use a list of evaluation metrics related to optimization quality and speedup. The evaluation results show that our approach achieves similar or better results compared to state-of-the-art models using shorter training times and smaller pre-trained models.

Updated: 2025-03-09 05:01:42

标题: PerfRL：一种用于高效代码优化的小型语言模型框架

摘要: 代码优化是一项具有挑战性的任务，需要开发人员具备相当水平的专业知识。然而，考虑到新硬件架构和软件环境的快速演变，这种人力容量水平显然是不够的。基于此，最近的研究提出采用机器学习和人工智能技术来自动化代码优化过程。本文介绍了PerfRL，一个创新性框架，旨在解决代码优化问题。我们的框架利用了小型语言模型（SLM）和强化学习（RL）的能力，促进了一个系统，其中SLMs可以在微调阶段通过单元测试等方式吸收来自环境的反馈。与现有模型进行基准测试时，PerfRL在速度和计算资源使用效率方面表现出优越性，这归因于其对训练步骤的需求减少以及与SLMs的兼容性。此外，它大大减少了逻辑和语法错误的风险。为了评估我们的框架，我们在PIE数据集上进行了实验，使用了轻量级大型语言模型（即CodeT5）和一种新的强化学习算法，即RRHF。为了评估目的，我们使用了与优化质量和加速度相关的评估指标列表。评估结果显示，我们的方法在较短的训练时间和较小的预训练模型的情况下，实现了与最先进模型相似或更好的结果。

更新时间: 2025-03-09 05:01:42

领域: cs.LG,cs.AI,cs.PL,cs.SE

下载: http://arxiv.org/abs/2312.05657v2

Generalizable Machine Learning Models for Predicting Data Center Server Power, Efficiency, and Throughput

In the rapidly evolving digital era, comprehending the intricate dynamics influencing server power consumption, efficiency, and performance is crucial for sustainable data center operations. However, existing models lack the ability to provide a detailed and reliable understanding of these intricate relationships. This study employs a machine learning-based approach, using the SPECPower_ssj2008 database, to facilitate user-friendly and generalizable server modeling. The resulting models demonstrate high accuracy, with errors falling within approximately 10% on the testing dataset, showcasing their practical utility and generalizability. Through meticulous analysis, predictive features related to hardware availability date, server workload level, and specifications are identified, providing insights into optimizing energy conservation, efficiency, and performance in server deployment and operation. By systematically measuring biases and uncertainties, the study underscores the need for caution when employing historical data for prospective server modeling, considering the dynamic nature of technology landscapes. Collectively, this work offers valuable insights into the sustainable deployment and operation of servers in data centers, paving the way for enhanced resource use efficiency and more environmentally conscious practices.

Updated: 2025-03-09 04:39:53

标题: 通用的机器学习模型用于预测数据中心服务器的功耗、效率和吞吐量

摘要: 在快速发展的数字时代，理解影响服务器功耗、效率和性能的复杂动态对可持续数据中心运营至关重要。然而，现有模型缺乏提供这些复杂关系的详细和可靠理解的能力。本研究采用基于机器学习的方法，利用SPECPower_ssj2008数据库，以促进用户友好和可推广的服务器建模。所得模型表现出高准确性，测试数据集中误差约为10%，展示了它们的实用性和可推广性。通过细致分析，识别了与硬件上市日期、服务器工作负载水平和规格相关的预测特征，为优化服务器部署和运行中的能源节约、效率和性能提供了见解。通过系统地测量偏见和不确定性，本研究强调了在考虑技术景观的动态性时，使用历史数据进行未来服务器建模时需要谨慎的必要性。总的来说，这项工作为数据中心中服务器的可持续部署和运营提供了宝贵的见解，为提高资源利用效率和更环保的实践铺平了道路。

更新时间: 2025-03-09 04:39:53

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.06439v1

Core Knowledge Deficits in Multi-Modal Language Models

While Multimodal Large Language Models (MLLMs) demonstrate impressive abilities over high level perception and reasoning, their robustness in the wild still lags behind humans and exhibits diminished efficacy on simple tasks that are intuitive for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge, rudimentary cognitive abilities innate to humans from early childhood. To probe core knowledge representation in MLLMs, we draw from developmental cognitive sciences and develop a large-scale benchmark, CoreCognition dataset, encompassing 12 core cognitive concepts. We evaluate 219 models with 10 different prompts, leading to a total of 2409 data points for analysis. Our findings reveal core knowledge deficits in early developed core abilities while models demonstrate human comparable performance in high level cognition. Moreover, we find that low level abilities show little to no scaling, in stark contrast to high level abilities. Finally, we introduce an evaluation technique, Concept Hacking, through which we demonstrate that MLLMs do not genuinely advance toward core knowledge but instead rely on illusory understanding and shortcut learning as they scale. Website with this $\href{https://growing-ai-like-a-child.github.io/}{link}$.

Updated: 2025-03-09 04:39:42

标题: 多模态语言模型中的核心知识缺陷

摘要: 尽管多模式大型语言模型（MLLMs）展示了在高水平知觉和推理方面令人印象深刻的能力，但它们在野外的稳健性仍然落后于人类，并在人类直觉上简单任务上表现出降低的效力。我们检验了这样一个假设，即这些不足源于核心知识的缺失，这是人类从幼年时期就具有的基本认知能力。为了探索MLLMs中的核心知识表征，我们借鉴了发展性认知科学，并开发了一个大规模基准测试，CoreCognition数据集，包括12个核心认知概念。我们评估了219个模型，使用10个不同的提示，共计2409个数据点进行分析。我们的研究结果揭示了早期发展的核心能力中的核心知识缺陷，而模型在高水平认知方面表现出类似于人类的性能。此外，我们发现低水平能力几乎没有扩展，与高水平能力形成鲜明对比。最后，我们引入了一个评估技术，Concept Hacking，通过该技术我们证明MLLMs并没有真正朝着核心知识前进，而是依赖于虚假的理解和捷径学习来扩展规模。该文献的网站链接为：https://growing-ai-like-a-child.github.io/。

更新时间: 2025-03-09 04:39:42

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.10855v3

SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

We present SEED (\textbf{Se}mantic \textbf{E}valuation for Visual Brain \textbf{D}ecoding), a novel metric for evaluating the semantic decoding performance of visual brain decoding models. It integrates three complementary metrics, each capturing a different aspect of semantic similarity between images. Using carefully crowd-sourced human judgment data, we demonstrate that SEED achieves the highest alignment with human evaluations, outperforming other widely used metrics. Through the evaluation of existing visual brain decoding models, we further reveal that crucial information is often lost in translation, even in state-of-the-art models that achieve near-perfect scores on existing metrics. To facilitate further research, we open-source the human judgment data, encouraging the development of more advanced evaluation methods for brain decoding models. Additionally, we propose a novel loss function designed to enhance semantic decoding performance by leveraging the order of pairwise cosine similarity in CLIP image embeddings. This loss function is compatible with various existing methods and has been shown to consistently improve their semantic decoding performances when used for training, with respect to both existing metrics and SEED.

Updated: 2025-03-09 04:25:39

标题: 种子：朝向更准确的视觉脑解码语义评估

摘要: 我们提出SEED（\textbf{Se}mantic \textbf{E}valuation for Visual Brain \textbf{D}ecoding），这是一个用于评估视觉脑解码模型语义解码性能的新型度量。它整合了三种互补的度量标准，每种都捕捉了图像之间不同方面的语义相似性。通过精心策划的众包人类判断数据，我们展示了SEED与人类评估之间的最高对齐度，超过了其他广泛使用的度量标准。通过评估现有的视觉脑解码模型，我们进一步揭示了在翻译过程中经常丢失关键信息，甚至是在达到现有度量标准几乎完美分数的最先进模型中也是如此。为了促进进一步研究，我们开源了人类判断数据，鼓励开发更先进的评估方法用于脑解码模型。此外，我们提出了一个新型损失函数，旨在通过利用CLIP图像嵌入中成对余弦相似性的顺序来增强语义解码性能。当用于训练时，这个损失函数与各种现有方法兼容，并已证明在语义解码性能方面持续改进，无论是对现有度量标准还是SEED。

更新时间: 2025-03-09 04:25:39

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06437v1

Physics-Informed Residual Neural Ordinary Differential Equations for Enhanced Tropical Cyclone Intensity Forecasting

Accurate tropical cyclone (TC) intensity prediction is crucial for mitigating storm hazards, yet its complex dynamics pose challenges to traditional methods. Here, we introduce a Physics-Informed Residual Neural Ordinary Differential Equation (PIR-NODE) model to precisely forecast TC intensity evolution. This model leverages the powerful non-linear fitting capabilities of deep learning, integrates residual connections to enhance model depth and training stability, and explicitly models the continuous temporal evolution of TC intensity using Neural ODEs. Experimental results in the SHIPS dataset demonstrate that the PIR-NODE model achieves a significant improvement in 24-hour intensity prediction accuracy compared to traditional statistical models and benchmark deep learning methods, with a 25. 2\% reduction in the root mean square error (RMSE) and a 19.5\% increase in R-square (R2) relative to a baseline of neural network. Crucially, the residual structure effectively preserves initial state information, and the model exhibits robust generalization capabilities. This study details the PIR-NODE model architecture, physics-informed integration strategies, and comprehensive experimental validation, revealing the substantial potential of deep learning techniques in predicting complex geophysical systems and laying the foundation for future refined TC forecasting research.

Updated: 2025-03-09 04:23:07

标题: 物理信息残差神经普通微分方程用于增强热带气旋强度预测

摘要: 准确的热带气旋（TC）强度预测对于减轻风暴危害至关重要，然而其复杂的动力学对传统方法提出了挑战。在这里，我们引入了一种物理信息残差神经常微分方程（PIR-NODE）模型，以精确预测TC强度演变。该模型利用深度学习强大的非线性拟合能力，集成残差连接以增强模型深度和训练稳定性，并使用神经常微分方程明确地建模TC强度的连续时间演变。在SHIPS数据集中的实验结果表明，与传统统计模型和基准深度学习方法相比，PIR-NODE模型在24小时强度预测准确性方面取得了显著改善，均方根误差（RMSE）减少了25.2％，R平方值（R2）相对于神经网络基线增加了19.5％。重要的是，残差结构有效地保留了初始状态信息，模型表现出强大的泛化能力。本研究详细介绍了PIR-NODE模型架构、物理信息集成策略和全面的实验验证，揭示了深度学习技术在预测复杂地球物理系统方面的巨大潜力，并为未来精细TC预测研究奠定了基础。

更新时间: 2025-03-09 04:23:07

领域: physics.ao-ph,cs.AI

下载: http://arxiv.org/abs/2503.06436v1

Why Train Everything? Tint a Single Layer for Multi-task Model Merging

Model merging integrates independently fine-tuned models into a single multi-task model, offering a flexible alternative to joint training. However, many existing model merging methods introduce additional task-specific components, increasing complexity and requiring extra modifications. We propose Model Tinting, a lightweight yet highly effective approach that improves model merging by updating just a single layer, accounting for as low as 0.5% of total parameters. Our key observation is that explicit task-specific modules are not necessary; instead, subtle adjustments to a single layer can effectively capture task-specific variations within the merged model while maintaining generalization. We introduce a confidence-based filtering mechanism to alleviate the impact of unreliable predictions from individual models on the merged model. Extensive experiments across vision and NLP tasks demonstrate that Model Tinting achieves state-of-the-art performance, even in challenging dense prediction tasks. Our code is available at https://github.com/AIM-SKKU/ModelTinting

Updated: 2025-03-09 04:21:56

标题: 为什么要训练所有模型？为多任务模型合并着色一个单一层

摘要: 模型合并将独立微调的模型整合到单个多任务模型中，为联合训练提供了一种灵活的替代方案。然而，许多现有的模型合并方法引入了额外的任务特定组件，增加了复杂性并需要额外的修改。我们提出了Model Tinting，这是一种轻量但非常有效的方法，通过仅更新单个层来改进模型合并，其占总参数的比例可以低至0.5%。我们的关键观察是明确的任务特定模块并非必要；相反，对单个层进行微调可以有效地捕捉合并模型中的任务特定变化，同时保持泛化能力。我们引入了基于置信度的过滤机制，以减轻个别模型对合并模型的不可靠预测的影响。在视觉和NLP任务中进行的大量实验表明，Model Tinting实现了最先进的性能，甚至在具有挑战性的密集预测任务中也是如此。我们的代码可在https://github.com/AIM-SKKU/ModelTinting上获得。

更新时间: 2025-03-09 04:21:56

领域: cs.LG

下载: http://arxiv.org/abs/2412.19098v2

Seesaw: High-throughput LLM Inference via Model Re-sharding

To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Updated: 2025-03-09 04:14:06

标题: Seesaw: 通过模型重新分片实现的高吞吐量LLM推断

摘要: 为了提高分布式大型语言模型（LLM）推理的效率，提出了各种并行化策略，如张量和管道并行。然而，LLM推理的两个阶段之间存在不同的计算特性-预填充和解码-使得单一静态并行化策略不足以有效优化两个阶段。在这项工作中，我们提出了Seesaw，一种针对吞吐量导向任务进行优化的LLM推理引擎。Seesaw背后的关键思想是动态模型重新分片，这是一种促进跨阶段并行化策略动态重新配置的技术，从而最大化两个阶段的吞吐量。为了减少重新分片的开销并优化计算效率，我们采用分层KV缓存缓冲和最小化转换调度。这些方法协同作用，以减少频繁阶段转换造成的开销，同时确保最大批处理效率。我们的评估表明，与vLLM相比，Seesaw实现了高达1.78倍（平均1.36倍）的吞吐量增加，vLLM是目前最广泛使用的最先进的LLM推理引擎。

更新时间: 2025-03-09 04:14:06

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2503.06433v1

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

Updated: 2025-03-09 04:11:38

标题: M2-omni：推进Omni-MLLM以实现全面的模态支持并具有竞争性能

摘要: 我们介绍了M2-omni，一个先进的开源omni-MLLM，其性能与GPT-4o相比具有竞争力。M2-omni采用统一的多模态序列建模框架，赋予大型语言模型（LLMs）全面的跨模态理解和生成能力。具体而言，M2-omni能够处理任意组合的音频、视频、图像和文本模态作为输入，生成与音频、图像或文本输出交织的多模态序列，从而实现先进且互动的实时体验。这种omni-MLLM的训练受到数据数量和收敛速度在不同模态之间的显著差异的挑战。为了解决这些挑战，我们提出了在预训练期间采用步骤平衡策略来处理模态特定数据的数量差异。此外，在指导调整阶段引入了动态自适应平衡策略，以同步各模态的训练进度，确保最佳收敛。值得注意的是，我们优先保持在纯文本任务上的强大性能，以确保M2-omni在整个训练过程中的语言理解能力稳健。据我们所知，M2-omni目前是一个非常有竞争力的开源模型，具有全面的模态和任务支持，以及卓越的性能。我们期待M2-omni将推动omni-MLLM的发展，从而促进该领域的未来研究。

更新时间: 2025-03-09 04:11:38

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.18778v2

Fairness-aware organ exchange and kidney paired donation

The kidney paired donation (KPD) program provides an innovative solution to overcome incompatibility challenges in kidney transplants by matching incompatible donor-patient pairs and facilitating kidney exchanges. To address unequal access to transplant opportunities, there are two widely used fairness criteria: group fairness and individual fairness. However, these criteria do not consider protected patient features, which refer to characteristics legally or ethically recognized as needing protection from discrimination, such as race and gender. Motivated by the calibration principle in machine learning, we introduce a new fairness criterion: the matching outcome should be conditionally independent of the protected feature, given the sensitization level. We integrate this fairness criterion as a constraint within the KPD optimization framework and propose a computationally efficient solution. Theoretically, we analyze the associated price of fairness using random graph models. Empirically, we compare our fairness criterion with group fairness and individual fairness through both simulations and a real-data example.

Updated: 2025-03-09 04:01:08

标题: 公平感知的器官交换和肾配对捐赠

摘要: 肾脏配对捐赠（KPD）计划提供了一种创新的解决方案，克服肾脏移植中的不相容挑战，通过匹配不相容的捐赠者-患者对，并促进肾脏交换。为了解决移植机会不平等的问题，有两种广泛使用的公平标准：群体公平和个人公平。然而，这些标准并不考虑受保护患者特征，这些特征指的是在法律或伦理上被认为需要免受歧视的特征，如种族和性别。受到机器学习中校准原则的启发，我们引入了一个新的公平标准：匹配结果应在给定敏感性水平的条件下独立于受保护特征。我们将这一公平标准作为约束集成到KPD优化框架中，并提出一个计算效率高的解决方案。在理论上，我们使用随机图模型分析了相关的公平代价。在实证方面，我们通过模拟和实际数据示例比较了我们的公平标准与群体公平和个人公平。

更新时间: 2025-03-09 04:01:08

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2503.06431v1

Graph Retrieval-Augmented LLM for Conversational Recommendation Systems

Conversational Recommender Systems (CRSs) have emerged as a transformative paradigm for offering personalized recommendations through natural language dialogue. However, they face challenges with knowledge sparsity, as users often provide brief, incomplete preference statements. While recent methods have integrated external knowledge sources to mitigate this, they still struggle with semantic understanding and complex preference reasoning. Recent Large Language Models (LLMs) demonstrate promising capabilities in natural language understanding and reasoning, showing significant potential for CRSs. Nevertheless, due to the lack of domain knowledge, existing LLM-based CRSs either produce hallucinated recommendations or demand expensive domain-specific training, which largely limits their applicability. In this work, we present G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems), a novel training-free framework that combines graph retrieval-augmented generation and in-context learning to enhance LLMs' recommendation capabilities. Specifically, G-CRS employs a two-stage retrieve-and-recommend architecture, where a GNN-based graph reasoner first identifies candidate items, followed by Personalized PageRank exploration to jointly discover potential items and similar user interactions. These retrieved contexts are then transformed into structured prompts for LLM reasoning, enabling contextually grounded recommendations without task-specific training. Extensive experiments on two public datasets show that G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.

Updated: 2025-03-09 03:56:22

标题: 图检索增强的LLM用于对话式推荐系统

摘要: 会话式推荐系统（CRSs）已经成为通过自然语言对话提供个性化推荐的变革性范式。然而，他们面临知识稀疏的挑战，因为用户经常提供简短、不完整的偏好陈述。尽管最近的方法已经整合了外部知识源来缓解这一问题，但它们仍然在语义理解和复杂偏好推理方面面临困难。最近的大型语言模型（LLMs）展示了在自然语言理解和推理方面的有希望的能力，显示出对CRSs具有重要潜力。然而，由于缺乏领域知识，现有基于LLM的CRSs要么产生幻觉般的推荐，要么需要昂贵的领域特定训练，这在很大程度上限制了它们的适用性。在这项工作中，我们提出了G-CRS（用于会话式推荐系统的图检索增强大型语言模型），这是一个新颖的无需训练的框架，结合了图检索增强生成和上下文学习，以增强LLM的推荐能力。具体地，G-CRS采用了一个两阶段的检索和推荐架构，其中基于GNN的图推理器首先识别候选项，然后使用个性化PageRank探索来共同发现潜在项和类似用户交互。然后，这些检索到的上下文被转换为结构化提示，用于LLM推理，实现了基于上下文的推荐，而无需特定任务的训练。在两个公共数据集上进行的大量实验表明，与不需要特定任务训练的现有方法相比，G-CRS实现了更优越的推荐性能。

更新时间: 2025-03-09 03:56:22

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2503.06430v1

log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling

Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. A key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM also implements a local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions. Through this hierarchical process, log-RRIM effectively captures how different molecular fragments contribute to and influence the overall reaction yield, regardless of their size variations. log-RRIM shows superior performance in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. The framework's sophisticated modeling of reactant-reagent interactions and precise capture of molecular fragment contributions make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.

Updated: 2025-03-09 03:43:34

标题: log-RRIM: 通过局部到全局反应表示学习和交互建模进行产量预测

摘要: 化学反应产率的准确预测对于优化有机合成至关重要，可以潜在地减少在实验上花费的时间和资源。随着人工智能（AI）的兴起，利用基于AI的方法加速产率预测的兴趣日益增长，而无需进行体外实验。我们提出了log-RRIM，这是一个基于图转换器的创新框架，旨在预测化学反应产率。log-RRIM的一个关键特性是其整合了一个交叉注意机制，专注于试剂和反应中心之间的相互作用。这种设计反映了化学反应中的一个基本原理：试剂在影响断裂和形成过程中发挥的关键作用，最终影响反应产率。log-RRIM还实施了一种从局部到全局的反应表示学习策略。这种方法首先捕获了详细的分子级信息，然后对分子间相互作用进行建模和聚合。通过这种分层过程，log-RRIM有效地捕捉了不同分子片段如何对整体反应产率产生影响，无论它们的大小变化如何。log-RRIM在我们的实验中表现出卓越的性能，特别是对于中等到高产率的反应，证明了它作为一个预测器的可靠性。该框架对反应物-试剂相互作用的复杂建模和对分子片段贡献的精确捕捉使其成为化学合成中反应规划和优化的宝贵工具。log-RRIM的数据和代码可以通过https://github.com/ninglab/Yield_log_RRIM 访问。

更新时间: 2025-03-09 03:43:34

领域: q-bio.BM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.03320v4

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security, which cannot be addressed by traditional textual-harm-focused LLM guardrails. We propose GuardAgent, the first guardrail agent to protect the target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. GuardAgent can understand different safety guard requests and provide reliable code-based guardrails with high flexibility and low operational overhead. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/

Updated: 2025-03-09 03:42:18

标题: GuardAgent：通过知识启用的推理保护LLM代理

摘要: 大型语言模型（LLM）代理的快速发展引起了对其安全性和保密性的新关注，这些问题无法通过传统的以文本为中心的LLM防护措施来解决。我们提出了GuardAgent，这是第一个通过动态检查其行为是否符合给定的安全防护请求来保护目标代理的防护代理。具体来说，GuardAgent首先分析安全防护请求以生成任务计划，然后将该计划映射到防护代码以执行。通过执行代码，GuardAgent可以确定性地遵循安全防护请求并保护目标代理。在这两个步骤中，LLM被用作推理组件，辅以从存储有先前任务经验的记忆模块检索的上下文演示。GuardAgent能够理解不同的安全防护请求，并提供具有高灵活性和低操作开销的可靠基于代码的防护措施。此外，我们提出了两个新颖的基准：EICU-AC基准用于评估医疗代理的访问控制，Mind2Web-SC基准用于评估网络代理的安全策略。我们展示了GuardAgent在这两个基准上有效地调节了不同类型代理的违规行为，分别达到了98%和83%以上的防护准确性。项目页面：https://guardagent.github.io/

更新时间: 2025-03-09 03:42:18

领域: cs.LG

下载: http://arxiv.org/abs/2406.09187v2

Interference-Aware Edge Runtime Prediction with Conformal Matrix Completion

Accurately estimating workload runtime is a longstanding goal in computer systems, and plays a key role in efficient resource provisioning, latency minimization, and various other system management tasks. Runtime prediction is particularly important for managing increasingly complex distributed systems in which more sophisticated processing is pushed to the edge in search of better latency. Previous approaches for runtime prediction in edge systems suffer from poor data efficiency or require intensive instrumentation; these challenges are compounded in heterogeneous edge computing environments, where historical runtime data may be sparsely available and instrumentation is often challenging. Moreover, edge computing environments often feature multi-tenancy due to limited resources at the network edge, potentially leading to interference between workloads and further complicating the runtime prediction problem. Drawing from insights across machine learning and computer systems, we design a matrix factorization-inspired method that generates accurate interference-aware predictions with tight provably-guaranteed uncertainty bounds. We validate our method on a novel WebAssembly runtime dataset collected from 24 unique devices, achieving a prediction error of 5.2% -- 2x better than a naive application of existing methods.

Updated: 2025-03-09 03:41:32

标题: 具有符合矩阵完成的干扰感知边缘运行时预测

摘要: 准确估计工作负载运行时间是计算机系统中长期以来的目标，并在资源分配效率、延迟最小化和各种其他系统管理任务中发挥关键作用。运行时间预测在管理日益复杂的分布式系统中尤为重要，其中更复杂的处理被推送到边缘，以寻求更好的延迟。以往在边缘系统中进行运行时间预测的方法存在数据效率低下或需要大量的仪器设备；这些挑战在异构边缘计算环境中更为复杂，其中历史运行时间数据可能稀缺，仪器设备通常具有挑战性。此外，边缘计算环境通常具有多租户特性，因为网络边缘资源有限，可能导致工作负载之间的干扰，并进一步复杂化运行时间预测问题。借鉴机器学习和计算机系统的见解，我们设计了一种受矩阵分解启发的方法，生成具有紧凑可证明保证的不确定性边界的准确的干扰感知预测。我们在从24台独特设备收集的新型WebAssembly运行时间数据集上验证了我们的方法，实现了5.2%的预测误差--比现有方法的朴素应用好两倍。

更新时间: 2025-03-09 03:41:32

领域: cs.LG

下载: http://arxiv.org/abs/2503.06428v1

Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning

Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at https://github.com/future-item/metarule-select.

Updated: 2025-03-09 03:41:11

标题: 预训练视觉生成性推理学习的元规则选择策略

摘要: 视觉生成维纳贝尔学习研究共同训练符号接地神经视觉生成器，并从数据中引导逻辑规则，以便在学习之后，视觉生成过程受到引导的逻辑规则。这一任务的一个主要挑战是在学习过程中减少逻辑绑架的时间成本，当逻辑符号集很大且要引导的逻辑规则很复杂时，逻辑绑架是一个必要的步骤。为了解决这一挑战，我们提出了一种预训练方法，用于获取最近提出的视觉生成学习方法AbdGen [Peng等人，2023]的元规则选择策略，旨在显著减少候选元规则集并剪枝搜索空间。选择模型是基于案例和元规则的嵌入表示构建的，可以有效地与神经模型和逻辑推理系统集成。预训练过程是在纯符号数据上完成的，不涉及原始视觉输入的符号接地学习，使整个学习过程成本低廉。另一个有趣的观察结果是，选择策略可以纠正在预训练期间未看到的符号接地错误，这是由于注意机制的记忆能力和符号模式的相对稳定性。实验结果表明，我们的方法能够有效解决视觉绑架的元规则选择问题，提高了视觉生成维纳贝尔学习的效率。代码可在https://github.com/future-item/metarule-select上找到。

更新时间: 2025-03-09 03:41:11

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2503.06427v1

Federated Learning for Diffusion Models

Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.

Updated: 2025-03-09 03:41:10

标题: 分布式学习用于扩散模型

摘要: 扩散模型是一种强大的生成模型，可以为各种任务生成高度逼真的样本。通常，这些模型是使用集中、独立和同分布（IID）的训练数据构建的。然而，在实际场景中，数据通常分布在多个客户端之间，并且经常表现出非IID特征。联邦学习（FL）可以利用这种分布式数据来训练扩散模型，但现有的FL方法在非IID场景中的表现令人不满意。为了解决这个问题，我们提出了FedDDPM-使用去噪扩散概率模型的联邦学习，利用扩散模型的数据生成能力来促进模型训练。具体来说，服务器在FL训练之前使用每个客户端上传的训练有素的本地扩散模型来生成辅助数据，这些数据可以近似表示全局数据分布。在每一轮模型聚合之后，服务器进一步优化全局模型，使用辅助数据集来减轻异构数据对模型性能的影响。我们对FedDDPM进行了严格的收敛分析，并提出了一个增强算法FedDDPM+，以减少训练开销。FedDDPM+检测到模型学习缓慢的实例，并使用辅助数据集进行一次性校正。实验结果验证了我们提出的算法在MNIST、CIFAR10和CIFAR100数据集上优于最先进的FL算法。

更新时间: 2025-03-09 03:41:10

领域: cs.LG,cs.CV,cs.DC

下载: http://arxiv.org/abs/2503.06426v1

WECAR: An End-Edge Collaborative Inference and Training Framework for WiFi-Based Continuous Human Activity Recognition

WiFi-based human activity recognition (HAR) holds significant promise for ubiquitous sensing in smart environments. A critical challenge lies in enabling systems to dynamically adapt to evolving scenarios, learning new activities without catastrophic forgetting of prior knowledge, while adhering to the stringent computational constraints of edge devices. Current approaches struggle to reconcile these requirements due to prohibitive storage demands for retaining historical data and inefficient parameter utilization. We propose WECAR, an end-edge collaborative inference and training framework for WiFi-based continuous HAR, which decouples computational workloads to overcome these limitations. In this framework, edge devices handle model training, lightweight optimization, and updates, while end devices perform efficient inference. WECAR introduces two key innovations, i.e., dynamic continual learning with parameter efficiency and hierarchical distillation for end deployment. For the former, we propose a transformer-based architecture enhanced by task-specific dynamic model expansion and stability-aware selective retraining. For the latter, we propose a dual-phase distillation mechanism that includes multi-head self-attention relation distillation and prefix relation distillation. We implement WECAR based on heterogeneous hardware using Jetson Nano as edge devices and the ESP32 as end devices, respectively. Our experiments across three public WiFi datasets reveal that WECAR not only outperforms several state-of-the-art methods in performance and parameter efficiency, but also achieves a substantial reduction in the model's parameter count post-optimization without sacrificing accuracy. This validates its practicality for resource-constrained environments.

Updated: 2025-03-09 03:40:27

标题: WECAR：一种基于WiFi的连续人体活动识别的端边协作推理和训练框架

摘要: 基于WiFi的人体活动识别（HAR）在智能环境中具有重要的潜力。一个关键挑战在于使系统能够动态适应不断变化的场景，学习新的活动而不会忘记先前的知识，同时受限于边缘设备的严格计算约束。当前的方法由于要求保留历史数据的存储需求过高以及参数利用效率低而难以解决这些要求。我们提出了WECAR，这是一个基于WiFi的连续HAR的端-边协作推理和训练框架，它解耦了计算负载以克服这些限制。在这个框架中，边缘设备处理模型训练、轻量级优化和更新，而端设备执行高效的推理。WECAR引入了两个关键创新，即具有参数效率的动态持续学习和用于端部署的分层蒸馏。对于前者，我们提出了一个基于变压器的架构，通过特定任务的动态模型扩展和稳定性感知的选择性重训练来增强。对于后者，我们提出了一个包括多头自注意关系蒸馏和前缀关系蒸馏的双相蒸馏机制。我们基于异构硬件使用Jetson Nano作为边缘设备，ESP32作为端设备来实现WECAR。我们在三个公共WiFi数据集上的实验表明，WECAR不仅在性能和参数效率方面优于几种最先进的方法，而且在参数优化后实现了模型参数数量的大幅减少，而不会牺牲准确性。这验证了它在资源受限环境中的实用性。

更新时间: 2025-03-09 03:40:27

领域: cs.LG

下载: http://arxiv.org/abs/2503.07669v1

Training Free Guided Flow Matching with Optimal Control

Controlled generation with pre-trained Diffusion and Flow Matching models has vast applications. One strategy for guiding ODE-based generative models is through optimizing a target loss $R(x_1)$ while staying close to the prior distribution. Along this line, some recent work showed the effectiveness of guiding flow model by differentiating through its ODE sampling process. Despite the superior performance, the theoretical understanding of this line of methods is still preliminary, leaving space for algorithm improvement. Moreover, existing methods predominately focus on Euclidean data manifold, and there is a compelling need for guided flow methods on complex geometries such as SO(3), which prevails in high-stake scientific applications like protein design. We present OC-Flow, a general and theoretically grounded training-free framework for guided flow matching using optimal control. Building upon advances in optimal control theory, we develop effective and practical algorithms for solving optimal control in guided ODE-based generation and provide a systematic theoretical analysis of the convergence guarantee in both Euclidean and SO(3). We show that existing backprop-through-ODE methods can be interpreted as special cases of Euclidean OC-Flow. OC-Flow achieved superior performance in extensive experiments on text-guided image manipulation, conditional molecule generation, and all-atom peptide design.

Updated: 2025-03-09 03:35:34

标题: 使用最优控制训练无需引导的流匹配

摘要: 使用预先训练的Diffusion和Flow Matching模型进行控制生成具有广泛的应用。引导基于ODE的生成模型的一种策略是通过优化目标损失$R(x_1)$，同时保持接近先验分布。沿着这条线，一些最近的工作展示了通过不同iating其ODE采样过程来引导流模型的有效性。尽管表现优异，但对这一系列方法的理论理解仍处于初步阶段，留下了算法改进的空间。此外，现有方法主要集中在欧几里德数据流形上，对于SO(3)等复杂几何形状的引导流方法有迫切需求，这在高风险科学应用中如蛋白设计中占主导地位。我们提出OC-Flow，这是一个通用且理论上基于训练的框架，用于使用最优控制进行引导流匹配。借鉴最优控制理论的进展，我们开发了解决引导ODE生成中的最优控制的有效和实用算法，并在欧几里德和SO(3)中提供了收敛保证的系统理论分析。我们展示了现有的反向传播通过ODE方法可以解释为欧几里德OC-Flow的特例。OC-Flow在文本引导图像操作、条件分子生成和全原子肽设计的广泛实验中取得了优越的性能。

更新时间: 2025-03-09 03:35:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.18070v3

GenAI for Simulation Model in Model-Based Systems Engineering

Generative AI (GenAI) has demonstrated remarkable capabilities in code generation, and its integration into complex product modeling and simulation code generation can significantly enhance the efficiency of the system design phase in Model-Based Systems Engineering (MBSE). In this study, we introduce a generative system design methodology framework for MBSE, offering a practical approach for the intelligent generation of simulation models for system physical properties. First, we employ inference techniques, generative models, and integrated modeling and simulation languages to construct simulation models for system physical properties based on product design documents. Subsequently, we fine-tune the language model used for simulation model generation on an existing library of simulation models and additional datasets generated through generative modeling. Finally, we introduce evaluation metrics for the generated simulation models for system physical properties. Our proposed approach to simulation model generation presents the innovative concept of scalable templates for simulation models. Using these templates, GenAI generates simulation models for system physical properties through code completion. The experimental results demonstrate that, for mainstream open-source Transformer-based models, the quality of the simulation model is significantly improved using the simulation model generation method proposed in this paper.

Updated: 2025-03-09 03:33:25

标题: 基于模型的系统工程中的仿真模型的GenAI (Note: "GenAI"可能是一个特定的术语或缩略语，不确定具体含义。)

摘要: Generative AI（GenAI）已经在代码生成方面展示出了卓越的能力，将其整合到复杂产品建模和仿真代码生成中可以显著提高基于模型的系统工程（MBSE）中系统设计阶段的效率。在这项研究中，我们介绍了一个适用于MBSE的生成式系统设计方法框架，为系统物理属性的智能生成仿真模型提供了一种实用方法。首先，我们利用推理技术、生成模型和集成建模和仿真语言，基于产品设计文件构建系统物理属性的仿真模型。随后，我们通过对现有仿真模型库和通过生成式建模产生的额外数据集进行微调，以用于仿真模型生成的语言模型。最后，我们引入了用于评估系统物理属性生成的仿真模型的评估指标。我们提出的仿真模型生成方法呈现了可扩展模板的创新概念，通过这些模板，GenAI通过代码补全生成系统物理属性的仿真模型。实验结果表明，对于主流开源基于Transformer的模型，使用本文提出的仿真模型生成方法可以显著提高仿真模型的质量。

更新时间: 2025-03-09 03:33:25

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.06422v1

Explaining Control Policies through Predicate Decision Diagrams

Safety-critical controllers of complex systems are hard to construct manually. Automated approaches such as controller synthesis or learning provide a tempting alternative but usually lack explainability. To this end, learning decision trees (DTs) have been prevalently used towards an interpretable model of the generated controllers. However, DTs do not exploit shared decision-making, a key concept exploited in binary decision diagrams (BDDs) to reduce their size and thus improve explainability. In this work, we introduce predicate decision diagrams (PDDs) that extend BDDs with predicates and thus unite the advantages of DTs and BDDs for controller representation. We establish a synthesis pipeline for efficient construction of PDDs from DTs representing controllers, exploiting reduction techniques for BDDs also for PDDs.

Updated: 2025-03-09 03:31:48

标题: 用谓词决策图解释控制策略

摘要: 安全关键控制器是难以手动构建的复杂系统。自动化方法，如控制器合成或学习，提供了一种诱人的替代方案，但通常缺乏可解释性。为此，学习决策树（DTs）已广泛用于生成控制器的可解释模型。然而，DTs没有利用共享决策这一在二进制决策图（BDDs）中利用的关键概念，以减少其大小，从而提高可解释性。在这项工作中，我们引入了谓词决策图（PDDs），将谓词与BDDs相结合，从而统一了DTs和BDDs的优势，用于控制器表示。我们建立了一个合成流水线，用于从表示控制器的DTs有效构建PDDs，利用了BDDs的缩减技术也适用于PDDs。

更新时间: 2025-03-09 03:31:48

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.06420v1

Advancing AI Negotiations: New Theory and Evidence from a Large-Scale Autonomous Negotiations Competition

Despite the rapid proliferation of artificial intelligence (AI) negotiation agents, there has been limited integration of computer science research and established negotiation theory to develop new theories of AI negotiation. To bridge this gap, we conducted an International AI Negotiations Competition in which participants iteratively designed and refined prompts for large language model (LLM) negotiation agents. We then facilitated over 120,000 negotiations between these agents across multiple scenarios with diverse characteristics and objectives. Our findings revealed that fundamental principles from established human-human negotiation theory remain crucial in AI-AI negotiations. Specifically, agents exhibiting high warmth fostered higher counterpart subjective value and reached deals more frequently, which enabled them to create and claim more value in integrative settings. However, conditional on reaching a deal, warm agents claimed less value while dominant agents claimed more value. These results align with classic negotiation theory emphasizing relationship-building, assertiveness, and preparation. Our analysis also revealed unique dynamics in AI-AI negotiations not fully explained by negotiation theory, particularly regarding the effectiveness of AI-specific strategies like chain-of-thought reasoning and prompt injection. The agent that won our competition implemented an approach that blended traditional negotiation preparation frameworks with AI-specific methods. Together, these results suggest the importance of establishing a new theory of AI negotiations that integrates established negotiation theory with AI-specific strategies to optimize agent performance. Our research suggests this new theory must account for the unique characteristics of autonomous agents and establish the conditions under which traditional negotiation theory applies in automated settings.

Updated: 2025-03-09 03:25:48

标题: 推进人工智能谈判：来自大规模自主谈判竞赛的新理论和证据

摘要: 尽管人工智能（AI）谈判代理的快速增长，但计算机科学研究和已建立的谈判理论之间的整合仍然有限，以开发新的AI谈判理论。为了弥合这一差距，我们开展了一项国际AI谈判竞赛，参与者反复设计和完善了大型语言模型（LLM）谈判代理的提示。然后，我们在具有多样化特征和目标的多个场景中促成了超过12万次这些代理之间的谈判。我们的研究结果显示，已建立的人际谈判理论中的基本原则在AI-AI谈判中仍然至关重要。具体而言，表现出高热情度的代理促进了对手的主观价值更高，并更频繁达成协议，从而使它们在整合性环境中创造并主张更多价值。然而，在达成协议的前提下，热情的代理主张的价值较少，而主导的代理主张的价值较多。这些结果符合强调建立关系、自信和准备工作的经典谈判理论。我们的分析还揭示了AI-AI谈判中不完全由谈判理论解释的独特动态，特别是涉及AI特定策略如思维链推理和提示注入的有效性。赢得我们比赛的代理实施了一种将传统谈判准备框架与AI特定方法结合的方法。总之，这些结果表明建立一种整合已建立谈判理论和AI特定策略的新AI谈判理论对于优化代理性能至关重要。我们的研究表明，这种新理论必须考虑自主代理的独特特征，并确定传统谈判理论在自动化环境中适用的条件。

更新时间: 2025-03-09 03:25:48

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2503.06416v1

Hierarchical graph sampling based minibatch learning with chain preservation and variance reduction

Graph sampling based Graph Convolutional Networks (GCNs) decouple the sampling from the forward and backward propagation during minibatch training, which exhibit good scalability in terms of layer depth and graph size. We propose HIS_GCNs, a hierarchical importance graph sampling based learning method. By constructing minibatches using sampled subgraphs, HIS_GCNs gives attention to the importance of both core and periphery nodes/edges in a scale-free training graph. Specifically, it preserves the centrum of the core to most minibatches, which maintains connectivity between periphery nodes, and samples periphery edges without core node interference, in order to keep more long chains composed entirely of low-degree nodes in the same minibatch. HIS_GCNs can maximize the discrete Ricci curvature (i.e., Ollivier-Ricci curvatures) of the edges in a subgraph that enables the preservation of important chains for information propagation, and can achieve a low node embedding variance and a high convergence speed. Diverse experiments on Graph Neural Networks (GNNs) with node classification tasks confirm superior performance of HIS_GCNs in both accuracy and training time.

Updated: 2025-03-09 03:23:09

标题: Hierarchical graph sampling 基于 minibatch 学习，同时保持链条并减少方差

摘要: 基于图采样的图卷积网络（GCNs）将采样与前向和后向传播在小批量训练期间分离，具有很好的层深度和图大小的可扩展性。我们提出了HIS_GCNs，一种基于分层重要性图采样的学习方法。通过使用采样的子图构建小批量，HIS_GCNs关注训练图中核心和外围节点/边的重要性。具体来说，它将核心的中心保留到大多数小批量中，以保持外围节点之间的连接，并在没有核心节点干扰的情况下对外围边进行采样，以保留更多完全由低度节点组成的长链在同一个小批量中。HIS_GCNs可以最大化子图中边的离散黎曼曲率（即Ollivier-Ricci曲率），从而保留信息传播的重要链，并实现低节点嵌入方差和高收敛速度。对具有节点分类任务的图神经网络（GNNs）进行的多样实验证实了HIS_GCNs在准确性和训练时间方面的优越性能。

更新时间: 2025-03-09 03:23:09

领域: cs.LG

下载: http://arxiv.org/abs/2503.00860v3

WinTSR: A Windowed Temporal Saliency Rescaling Method for Interpreting Time Series Deep Learning Models

Interpreting complex time series forecasting models is challenging due to the temporal dependencies between time steps and the dynamic relevance of input features over time. Existing interpretation methods are limited by focusing mostly on classification tasks, evaluating using custom baseline models instead of the latest time series models, using simple synthetic datasets, and requiring training another model. We introduce a novel interpretation method, \textit{Windowed Temporal Saliency Rescaling (WinTSR)} addressing these limitations. WinTSR explicitly captures temporal dependencies among the past time steps and efficiently scales the feature importance with this time importance. We benchmark WinTSR against 10 recent interpretation techniques with 5 state-of-the-art deep-learning models of different architectures, including a time series foundation model. We use 3 real-world datasets for both time-series classification and regression. Our comprehensive analysis shows that WinTSR significantly outperforms other local interpretation methods in overall performance. Finally, we provide a novel, open-source framework to interpret the latest time series transformers and foundation models.

Updated: 2025-03-09 03:16:36

标题: WinTSR: 一种用于解释时间序列深度学习模型的窗口化时间显著性重缩放方法

摘要: 解释复杂的时间序列预测模型是具有挑战性的，因为时间步之间存在时间依赖性，输入特征的动态相关性随时间变化。现有的解释方法受限于主要关注分类任务，评估时使用自定义基准模型而不是最新的时间序列模型，使用简单的合成数据集，并且需要训练另一个模型。我们引入了一种新颖的解释方法，称为“窗口化时间显著性重缩放（WinTSR）”，解决了这些限制。WinTSR明确捕获了过去时间步之间的时间依赖关系，并有效地根据此时间重要性对特征重要性进行调整。我们将WinTSR与10种最新的解释技术进行了基准测试，使用了5种不同架构的最先进的深度学习模型，包括一个时间序列基础模型。我们使用了3个真实世界的数据集，用于时间序列分类和回归。我们的综合分析显示，WinTSR在整体性能上明显优于其他本地解释方法。最后，我们提供了一个新颖的开源框架，用于解释最新的时间序列转换器和基础模型。

更新时间: 2025-03-09 03:16:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2412.04532v3

Swift Hydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models

Despite a plethora of anomaly detection models developed over the years, their ability to generalize to unseen anomalies remains an issue, particularly in critical systems. This paper aims to address this challenge by introducing Swift Hydra, a new framework for training an anomaly detection method based on generative AI and reinforcement learning (RL). Through featuring an RL policy that operates on the latent variables of a generative model, the framework synthesizes novel and diverse anomaly samples that are capable of bypassing a detection model. These generated synthetic samples are, in turn, used to augment the detection model, further improving its ability to handle challenging anomalies. Swift Hydra also incorporates Mamba models structured as a Mixture of Experts (MoE) to enable scalable adaptation of the number of Mamba experts based on data complexity, effectively capturing diverse feature distributions without increasing the model's inference time. Empirical evaluations on ADBench benchmark demonstrate that Swift Hydra outperforms other state-of-the-art anomaly detection models while maintaining a relatively short inference time. From these results, our research highlights a new and auspicious paradigm of integrating RL and generative AI for advancing anomaly detection.

Updated: 2025-03-09 03:15:15

标题: Swift Hydra: 自我强化的生成框架，用于具有多个曼巴模型的异常检测

摘要: 尽管多年来开发了大量异常检测模型，但它们在泛化到未见异常方面的能力仍然是一个问题，特别是在关键系统中。本文旨在通过引入Swift Hydra来解决这一挑战，这是一个基于生成人工智能和强化学习（RL）的异常检测方法训练的新框架。通过特征化一个在生成模型的潜在变量上运行的RL策略，该框架合成了能够绕过检测模型的新颖和多样化的异常样本。这些生成的合成样本反过来被用来增强检测模型，进一步提高其处理具有挑战性的异常的能力。Swift Hydra还结合了作为专家混合(MoE)的Mamba模型，以便根据数据复杂性可扩展地调整Mamba专家的数量，有效地捕获各种特征分布而不增加模型的推理时间。在ADBench基准测试上的实证评估表明，Swift Hydra在维持相对较短的推理时间的同时，优于其他最先进的异常检测模型。从这些结果中，我们的研究突出了集成RL和生成人工智能以推进异常检测的新且有前途的范式。

更新时间: 2025-03-09 03:15:15

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06413v1

Decoding the Black Box: Integrating Moral Imagination with Technical AI Governance

This paper examines the intricate interplay among AI safety, security, and governance by integrating technical systems engineering with principles of moral imagination and ethical philosophy. Drawing on foundational insights from Weapons of Math Destruction and Thinking in Systems alongside contemporary debates in AI ethics, we develop a comprehensive multi-dimensional framework designed to regulate AI technologies deployed in high-stakes domains such as defense, finance, healthcare, and education. Our approach combines rigorous technical analysis, quantitative risk assessment, and normative evaluation to expose systemic vulnerabilities inherent in opaque, black-box models. Detailed case studies, including analyses of Microsoft Tay (2016) and the UK A-Level Grading Algorithm (2020), demonstrate how security lapses, bias amplification, and lack of accountability can precipitate cascading failures that undermine public trust. We conclude by outlining targeted strategies for enhancing AI resilience through adaptive regulatory mechanisms, robust security protocols, and interdisciplinary oversight, thereby advancing the state of the art in ethical and technical AI governance.

Updated: 2025-03-09 03:11:32

标题: 解码黑匣子：将道德想象力与技术AI治理相结合

摘要: 本文通过将技术系统工程与道德想象和伦理哲学原则相结合，探讨了人工智能安全、安全性和治理之间错综复杂的相互关系。借鉴《数学毁灭之武器》和《系统思维》中的基础见解，结合人工智能伦理学的当代争论，我们开发了一个旨在监管人工智能技术在国防、金融、医疗和教育等高风险领域部署的综合多维框架。我们的方法结合了严格的技术分析、定量风险评估和规范性评估，揭示了不透明黑匣模型固有的系统性漏洞。详细案例研究，包括对微软Tay（2016年）和英国A-Level评分算法（2020年）的分析，展示了安全漏洞、偏见放大和缺乏问责制如何导致连锁故障，破坏公众信任。我们最后总结了通过采用适应性监管机制、健壮的安全协议和跨学科监督来增强人工智能的弹性的有针对性策略，从而推动伦理和技术人工智能治理的最新进展。

更新时间: 2025-03-09 03:11:32

领域: eess.SY,cs.AI,cs.SY

下载: http://arxiv.org/abs/2503.06411v1

Performant LLM Agentic Framework for Conversational AI

The rise of Agentic applications and automation in the Voice AI industry has led to an increased reliance on Large Language Models (LLMs) to navigate graph-based logic workflows composed of nodes and edges. However, existing methods face challenges such as alignment errors in complex workflows and hallucinations caused by excessive context size. To address these limitations, we introduce the Performant Agentic Framework (PAF), a novel system that assists LLMs in selecting appropriate nodes and executing actions in order when traversing complex graphs. PAF combines LLM-based reasoning with a mathematically grounded vector scoring mechanism, achieving both higher accuracy and reduced latency. Our approach dynamically balances strict adherence to predefined paths with flexible node jumps to handle various user inputs efficiently. Experiments demonstrate that PAF significantly outperforms baseline methods, paving the way for scalable, real-time Conversational AI systems in complex business environments.

Updated: 2025-03-09 02:58:34

标题: 高效的LLM主动框架用于对话人工智能

摘要: Agentic应用程序和自动化在语音人工智能行业的兴起导致对大型语言模型（LLMs）的增加依赖，以导航由节点和边组成的基于图形逻辑工作流程。然而，现有方法面临诸如复杂工作流程中的对齐错误和由于过大的上下文大小而引起的幻觉等挑战。为了解决这些限制，我们引入了Performant Agentic Framework（PAF），这是一个新颖的系统，可以帮助LLMs在遍历复杂图形时选择适当的节点并执行操作。PAF将基于LLM的推理与数学基础的向量评分机制相结合，实现了更高的准确性和减少的延迟。我们的方法动态平衡了对预定义路径的严格遵守和对灵活节点跳跃的处理，以有效处理各种用户输入。实验证明，PAF在基准方法上表现显著优越，为复杂业务环境中可扩展的实时对话人工智能系统铺平了道路。

更新时间: 2025-03-09 02:58:34

领域: cs.AI

下载: http://arxiv.org/abs/2503.06410v1

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art graph learning model. However, it can be notoriously challenging to inference GCNs over large graph datasets, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN graphs. This is because real-world graphs can be extremely large and sparse. Furthermore, the node degree of GCNs tends to follow the power-law distribution and therefore have highly irregular adjacency matrices, resulting in prohibitive inefficiencies in both data processing and movement and thus substantially limiting the achievable GCN acceleration efficiency. To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs' inference efficiency. Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the model accuracy, resulting in graph adjacency matrices that (mostly) have merely two levels of workload and enjoys largely enhanced regularity and thus ease of acceleration. On the hardware level, we further develop a dedicated two-pronged accelerator with a separated engine to process each of the aforementioned denser and sparser workloads, further boosting the overall utilization and acceleration efficiency. Extensive experiments and ablation studies validate that our GCoD consistently reduces the number of off-chip accesses, leading to speedups of 15286x, 294x, 7.8x, and 2.5x as compared to CPUs, GPUs, and prior-art GCN accelerators including HyGCN and AWB-GCN, respectively, while maintaining or even improving the task accuracy. Codes are available at https://github.com/RICE-EIC/GCoD.

Updated: 2025-03-09 02:58:24

标题: GCoD: 通过专门算法和加速器协同设计实现图卷积网络加速

摘要: 图卷积网络（GCNs）已成为最先进的图学习模型。然而，在大规模图数据集上进行GCNs推断通常是具有挑战性的，限制了它们在大型实际图上的应用，并阻碍了对更深入和更复杂的GCN图的探索。这是因为实际图可能非常庞大且稀疏。此外，GCNs的节点度往往遵循幂律分布，因此具有高度不规则的邻接矩阵，导致数据处理和移动方面的效率低下，从而大大限制了可实现的GCN加速效率。因此，本文提出了一个名为GCoD的GCN算法和加速器协同设计框架，可以在很大程度上缓解上述GCN的不规则性，并提高GCNs的推断效率。具体而言，在算法级别上，GCoD集成了一种分割和征服的GCN训练策略，将图极化为局部邻域中更密集或更稀疏的图，而不会影响模型准确性，从而产生具有仅两个工作负荷级别的图邻接矩阵，并且享有大幅增强的规则性，从而便于加速。在硬件级别上，我们进一步开发了一个专门的双头加速器，其中一个独立引擎用于处理前述更密集的工作负载，另一个用于处理更稀疏的工作负载，进一步提高整体利用率和加速效率。大量实验和消融研究验证了我们的GCoD一贯减少了离片访问次数，相比于CPU、GPU以及之前的GCN加速器（包括HyGCN和AWB-GCN），分别获得了15286倍、294倍、7.8倍和2.5倍的加速，并且保持或提高了任务准确性。可在https://github.com/RICE-EIC/GCoD获取代码。

更新时间: 2025-03-09 02:58:24

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2112.11594v3

CodeBrain: Imputing Any Brain MRI via Modality- and Instance-Specific Codes

Unified MRI imputation, which can adapt to diverse imputation scenarios, is highly desirable as it reduces scanning costs and provides comprehensive MRI information for improved clinical diagnosis. Existing unified MRI imputation methods either rely on specific prompts to guide their transformation network or require multiple modality-specific modules. However, these approaches struggle to capture large modality and instance variations or become too complex to generalize effectively. To address these limitations, we propose CodeBrain, a fundamentally different pipeline for unified brain MRI imputation. Our key idea is to reframe various inter-modality transformations as a full-modality code prediction task via a two-stage framework. In the first stage, CodeBrain reconstructs a target modality from any other modalities by learning a compact scalar-quantized code for each instance and modality. Any target modality can then be reconstructed with high fidelity by combining the corresponding code with shared features extracted from any available modality. In the second stage, a projection encoder is trained to predict full-modality compact codes from any incomplete MRI samples, effectively simulating various imputation scenarios. We evaluate our CodeBrain on two public brain MRI datasets (i.e., IXI and BraTS 2023). Extensive experiments demonstrate that CodeBrain outperforms state-of-the-art methods, setting a new benchmark for unified brain MRI imputation. Our code will be released.

Updated: 2025-03-09 02:55:58

标题: CodeBrain: 通过模态和实例特定代码对任何脑MRI进行插值

摘要: 统一的MRI插值对于适应各种插值场景是非常理想的，因为它可以降低扫描成本并为改善临床诊断提供全面的MRI信息。现有的统一MRI插值方法要么依赖于特定提示来指导它们的转换网络，要么需要多个特定模态的模块。然而，这些方法往往难以捕捉大规模的模态和实例变化，或者变得过于复杂而无法有效泛化。为了解决这些限制，我们提出了一种基本不同的统一脑MRI插值管道CodeBrain。我们的关键想法是通过一个两阶段框架将各种模态间的转换重新构建为全模态代码预测任务。在第一阶段，CodeBrain通过为每个实例和模态学习紧凑的标量量化代码，从任何其他模态重建目标模态。然后，通过将相应的代码与从任何可用模态提取的共享特征相结合，任何目标模态都可以以高保真度重建。在第二阶段，训练一个投影编码器来从任何不完整的MRI样本中预测全模态紧凑代码，有效模拟各种插值场景。我们在两个公共脑MRI数据集（即IXI和BraTS 2023）上评估了我们的CodeBrain。大量实验证明CodeBrain优于最先进的方法，为统一脑MRI插值设立了一个新的基准。我们的代码将发布。

更新时间: 2025-03-09 02:55:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.18328v2

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.

Updated: 2025-03-09 02:52:43

标题: MM-PoisonRAG：利用本地和全局毒化攻击破坏多模态RAG

摘要: 多模态大型语言模型（MLLMs）配备检索增强生成（RAG）利用其丰富的参数化知识和动态外部知识，在诸如问答等任务中表现出色。虽然RAG通过将响应基于与查询相关的外部知识来增强MLLMs，但这种依赖构成了一个至关重要但尚未得到充分探讨的安全风险：知识中毒攻击，即故意向外部知识库注入错误或无关的知识，以操纵模型输出，使其不正确甚至有害。为了揭示多模态RAG中的这种漏洞，我们提出了MM-PoisonRAG，一个新颖的知识中毒攻击框架，包括两种攻击策略：局部中毒攻击（LPA），在文本和图像中注入特定于查询的错误信息以进行有针对性的操纵，以及全局中毒攻击（GPA），在MLLM生成过程中提供虚假指导，导致所有查询产生无意义的响应。我们在多个任务、模型和访问设置下评估我们的攻击，结果显示LPA成功操纵MLLM生成攻击者控制的答案，成功率在MultiModalQA上高达56%。此外，GPA仅通过一次无关知识注入就完全破坏了模型生成，准确率降至0%。我们的结果突出了对抗知识中毒的强大防御措施的迫切需求，以保护多模态RAG框架。

更新时间: 2025-03-09 02:52:43

领域: cs.LG,cs.AI,cs.CR,cs.CV

下载: http://arxiv.org/abs/2502.17832v2

Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide

Deep learning (DL) has demonstrated significant potential across various safety-critical applications, yet ensuring its robustness remains a key challenge. While adversarial robustness has been extensively studied in worst-case scenarios, probabilistic robustness (PR) offers a more practical perspective by quantifying the likelihood of failures under stochastic perturbations. This paper provides a concise yet comprehensive overview of PR, covering its formal definitions, evaluation and enhancement methods. We introduce a reformulated ''min-max'' optimisation framework for adversarial training specifically designed to improve PR. Furthermore, we explore the integration of PR verification evidence into system-level safety assurance, addressing challenges in translating DL model-level robustness to system-level claims. Finally, we highlight open research questions, including benchmarking PR evaluation methods, extending PR to generative AI tasks, and developing rigorous methodologies and case studies for system-level integration.

Updated: 2025-03-09 02:51:41

标题: 深度学习中的概率鲁棒性：简明而全面的指南

摘要: 深度学习（DL）已经在各种安全关键应用中展示了显著的潜力，但确保其稳健性仍然是一个关键挑战。虽然对抗稳健性在最坏情况下已经得到了广泛研究，但概率稳健性（PR）通过量化在随机扰动下失败的可能性，提供了一个更实用的视角。本文提供了PR的简洁而全面的概述，涵盖了其形式定义、评估和增强方法。我们引入了一个重新构建的“最小-最大”优化框架，专门设计用于改善PR的对抗训练。此外，我们探讨了将PR验证证据整合到系统级安全保证中的方法，解决了将DL模型级稳健性转化为系统级声明的挑战。最后，我们强调了一些开放的研究问题，包括基准PR评估方法、扩展PR到生成AI任务，以及为系统级整合开发严格的方法和案例研究。

更新时间: 2025-03-09 02:51:41

领域: cs.LG

下载: http://arxiv.org/abs/2502.14833v2

Heterogeneous bimodal attention fusion for speech emotion recognition

Multi-modal emotion recognition in conversations is a challenging problem due to the complex and complementary interactions between different modalities. Audio and textual cues are particularly important for understanding emotions from a human perspective. Most existing studies focus on exploring interactions between audio and text modalities at the same representation level. However, a critical issue is often overlooked: the heterogeneous modality gap between low-level audio representations and high-level text representations. To address this problem, we propose a novel framework called Heterogeneous Bimodal Attention Fusion (HBAF) for multi-level multi-modal interaction in conversational emotion recognition. The proposed method comprises three key modules: the uni-modal representation module, the multi-modal fusion module, and the inter-modal contrastive learning module. The uni-modal representation module incorporates contextual content into low-level audio representations to bridge the heterogeneous multi-modal gap, enabling more effective fusion. The multi-modal fusion module uses dynamic bimodal attention and a dynamic gating mechanism to filter incorrect cross-modal relationships and fully exploit both intra-modal and inter-modal interactions. Finally, the inter-modal contrastive learning module captures complex absolute and relative interactions between audio and text modalities. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed HBAF method outperforms existing state-of-the-art baselines.

Updated: 2025-03-09 02:50:49

标题: 异构双模态关注融合用于语音情感识别

摘要: 在对话中进行多模态情绪识别是一个具有挑战性的问题，因为不同模态之间存在复杂且互补的交互作用。音频和文本提示对于从人类角度理解情绪特别重要。大多数现有研究关注于在相同表示级别上探索音频和文本模态之间的交互作用。然而，一个关键问题经常被忽视：低级音频表示和高级文本表示之间的异质模态差距。为了解决这个问题，我们提出了一种称为异质双模态注意融合（HBAF）的新框架，用于对话情绪识别中的多级多模态交互。所提出的方法包括三个关键模块：单模态表示模块、多模态融合模块和模态间对比学习模块。单模态表示模块将上下文内容整合到低级音频表示中，以弥合异质多模态差距，实现更有效的融合。多模态融合模块使用动态双模态注意力和动态门控机制来过滤不正确的跨模态关系，并充分利用模态内部和模态间的交互作用。最后，模态间对比学习模块捕捉音频和文本模态之间的复杂绝对和相对交互作用。对MELD和IEMOCAP数据集的实验证明，所提出的HBAF方法优于现有最先进的基线模型。

更新时间: 2025-03-09 02:50:49

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2503.06405v1

Differential Machine Learning for Time Series Prediction

Accurate time series prediction is challenging due to the inherent nonlinearity and sensitivity to initial conditions. We propose a novel approach that enhances neural network predictions through differential learning, which involves training models on both the original time series and its differential series. Specifically, we develop a differential long short-term memory (Diff-LSTM) network that uses a shared LSTM cell to simultaneously process both data streams, effectively capturing intrinsic patterns and temporal dynamics. Evaluated on the Mackey-Glass, Lorenz, and R\"ossler chaotic time series, as well as a real-world financial dataset from ACI Worldwide Inc., our results demonstrate that the Diff- LSTM network outperforms prevalent models such as recurrent neural networks, convolutional neural networks, and bidirectional and encoder-decoder LSTM networks in both short-term and long-term predictions. This framework offers a promising solution for enhancing time series prediction, even when comprehensive knowledge of the underlying dynamics of the time series is not fully available.

Updated: 2025-03-09 02:42:26

标题: 时间序列预测的差分机器学习

摘要: 准确的时间序列预测是具有挑战性的，因为其固有的非线性和对初始条件的敏感性。我们提出了一种通过差分学习增强神经网络预测的新方法，该方法涉及在原始时间序列和其差分序列上训练模型。具体地，我们开发了一种差分长短期记忆（Diff-LSTM）网络，该网络使用共享的LSTM单元同时处理两个数据流，有效捕捉内在模式和时间动态。在Mackey-Glass、Lorenz和R\"ossler混沌时间序列以及来自ACI Worldwide Inc.的真实金融数据集上进行评估，我们的结果表明，Diff-LSTM网络在短期和长期预测中优于流行的模型，如递归神经网络、卷积神经网络以及双向和编码-解码LSTM网络。即使对于时间序列的基础动态的全面知识并不完全可用，该框架也提供了一个有希望的解决方案来增强时间序列预测。

更新时间: 2025-03-09 02:42:26

领域: cs.LG

下载: http://arxiv.org/abs/2503.03302v2

Causality Enhanced Origin-Destination Flow Prediction in Data-Scarce Cities

Accurate origin-destination (OD) flow prediction is of great importance to developing cities, as it can contribute to optimize urban structures and layouts. However, with the common issues of missing regional features and lacking OD flow data, it is quite daunting to predict OD flow in developing cities. To address this challenge, we propose a novel Causality-Enhanced OD Flow Prediction (CE-OFP), a unified framework that aims to transfer urban knowledge between cities and achieve accuracy improvements in OD flow predictions across data-scarce cities. In specific, we propose a novel reinforcement learning model to discover universal causalities among urban features in data-rich cities and build corresponding causal graphs. Then, we further build Causality-Enhanced Variational Auto-Encoder (CE-VAE) to incorporate causal graphs for effective feature reconstruction in data-scarce cities. Finally, with the reconstructed features, we devise a knowledge distillation method with a graph attention network to migrate the OD prediction model from data-rich cities to data-scare cities. Extensive experiments on two pairs of real-world datasets validate that the proposed CE-OFP remarkably outperforms state-of-the-art baselines, which can reduce the RMSE of OD flow prediction for data-scarce cities by up to 11%.

Updated: 2025-03-09 02:36:36

标题: 在数据稀缺城市中加强因果关系的起点-终点流量预测

摘要: 准确的起点-目的地（OD）流量预测对于发展中的城市至关重要，因为它可以有助于优化城市结构和布局。然而，由于缺少区域特征和缺乏OD流量数据的常见问题，预测发展中城市的OD流量是相当困难的。为了解决这一挑战，我们提出了一种新颖的增强因果关系的OD流量预测（CE-OFP）方法，这是一个统一的框架，旨在在数据稀缺的城市之间转移城市知识，并实现OD流量预测的准确性改进。具体来说，我们提出了一种新颖的强化学习模型，用于发现数据丰富城市中城市特征之间的通用因果关系，并构建相应的因果图。然后，我们进一步建立了增强因果关系变分自动编码器（CE-VAE），以将因果图纳入数据稀缺城市中的有效特征重构。最后，通过重构的特征，我们设计了一种知识蒸馏方法，结合图注意力网络，将OD预测模型从数据丰富城市迁移到数据稀缺城市。对两对真实数据集进行的大量实验证明，所提出的CE-OFP明显优于最先进的基线，能够将数据稀缺城市的OD流量预测的均方根误差减少高达11%。

更新时间: 2025-03-09 02:36:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06398v1

Optimizing Minimum Vertex Cover Solving via a GCN-assisted Heuristic Algorithm

The problem of finding a minimum vertex cover (MVC) in a graph is a well-known NP-hard problem with significant practical applications in optimization and scheduling. Its complexity, combined with the increasing scale of problems, underscores the need for efficient and effective algorithms. However, existing heuristic algorithms for MVC often rely on simplistic initialization strategies and overlook the impact of edge attributes and neighborhood information on vertex selection. In this paper, we introduce GCNIVC, a novel heuristic search algorithm designed to address the limitations of existing methods for solving MVC problems in large-scale graphs. Our approach features two main innovations. First, it utilizes a Graph Convolutional Network (GCN) to capture the global structure of graphs, which enables the generation of high-quality initial solutions that enhance the efficiency of the subsequent search process. Second, GCNIVC introduces a new heuristic that employs three containers and the concept of double-covered edges (dc-edges), improving search efficiency and providing greater flexibility for adding and removing operations based on edge attributes. Through extensive experiments on benchmark datasets, we demonstrate that GCNIVC outperforms state-of-the-art MVC algorithms in terms of both accuracy and efficiency. Our results highlight the effectiveness of GCNIVC's GCN-assisted initialization and its edge-informed search strategy. This study not only advances the understanding of MVC problem-solving but also contributes a new tool for addressing large-scale graph optimization challenges.

Updated: 2025-03-09 02:31:03

标题: 通过GCN辅助启发式算法优化最小顶点覆盖问题的解决

摘要: 在图中找到一个最小顶点覆盖（MVC）的问题是一个众所周知的NP难问题，在优化和调度中具有重要的实际应用。随着问题规模的增加，其复杂性强调了需要高效和有效的算法。然而，现有的MVC启发式算法通常依赖于简单的初始化策略，并忽视了边属性和邻域信息对顶点选择的影响。在本文中，我们引入了GCNIVC，一种新颖的启发式搜索算法，旨在解决大规模图中解决MVC问题的现有方法的局限性。我们的方法具有两个主要创新。首先，它利用图卷积网络（GCN）来捕捉图的全局结构，从而产生改善后续搜索过程效率的高质量初始解。其次，GCNIVC引入了一种新的启发式方法，利用三个容器和双覆盖边（dc-edges）的概念，提高了搜索效率，并基于边属性提供了更大的灵活性来进行添加和移除操作。通过在基准数据集上进行大量实验，我们证明GCNIVC在准确性和效率方面优于最先进的MVC算法。我们的结果突显了GCNIVC的GCN辅助初始化和基于边的搜索策略的有效性。这项研究不仅推动了MVC问题解决方法的理解，也为解决大规模图优化挑战提供了一种新工具。

更新时间: 2025-03-09 02:31:03

领域: cs.AI

下载: http://arxiv.org/abs/2503.06396v1

How to Strategize Human Content Creation in the Era of GenAI?

Generative AI (GenAI) will have significant impact on content creation platforms. In this paper, we study the dynamic competition between a GenAI and a human contributor. Unlike the human, the GenAI's content only improves when more contents are created by the human over time; however, GenAI has the advantage of generating content at a lower cost. We study the algorithmic problem in this dynamic competition model about how the human contributor can maximize her utility when competing against the GenAI for content generation over a set of topics. In time-sensitive content domains (e.g., news or pop music creation) where contents' value diminishes over time, we show that there is no polynomial time algorithm for finding the human's optimal (dynamic) strategy, unless the randomized exponential time hypothesis is false. Fortunately, we are able to design a polynomial time algorithm that naturally cycles between myopically optimizing over a short time window and pausing and provably guarantees an approximation ratio of $\frac{1}{2}$. We then turn to time-insensitive content domains where contents do not lose their value (e.g., contents on history facts). Interestingly, we show that this setting permits a polynomial time algorithm that maximizes the human's utility in the long run. Finally, we conduct simulations that demonstrate the advantage of our algorithms in comparison to a collection of baselines.

Updated: 2025-03-09 02:23:50

标题: 在GenAI时代如何制定人类内容创作策略？

摘要: 生成式人工智能（GenAI）将对内容创建平台产生重大影响。在本文中，我们研究了GenAI与人类贡献者之间的动态竞争。与人类不同，GenAI的内容只有在随着时间推移由人类创造更多内容时才会改善；然而，GenAI有以较低成本生成内容的优势。我们研究了这种动态竞争模型中的算法问题，即在一系列主题上与GenAI竞争内容生成时，人类贡献者如何最大化自己的效用。在时间敏感的内容领域（例如新闻或流行音乐创作）中，内容的价值随时间降低，我们证明了除非随机指数时间假设是错误的，否则无法找到人类的最佳（动态）策略的多项式时间算法。幸运的是，我们能够设计一个多项式时间算法，自然地在短时间窗口内进行近视优化，并暂停并可证明近似比为1/2。然后我们转向时间无关的内容领域，其中内容不会失去其价值（例如历史事实的内容）。有趣的是，我们展示了这种设置允许一个多项式时间算法在长期内最大化人类的效用。最后，我们进行了模拟实验，展示了我们的算法相对于一组基线的优势。

更新时间: 2025-03-09 02:23:50

领域: cs.GT,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2406.05187v2

Learning Mamba as a Continual Learner: Meta-learning Selective State Space Models for Efficient Continual Learning

Continual learning (CL) aims to efficiently learn from a non-stationary data stream, without storing or recomputing all seen samples. CL enables prediction on new tasks by incorporating sequential training samples. Building on this connection between CL and sequential modeling, meta-continual learning (MCL) aims to meta-learn an efficient continual learner as a sequence prediction model, with advanced sequence models like Transformers being natural choices. However, despite decent performance, Transformers rely on a linearly growing cache to store all past representations, conflicting with CL's objective of not storing all seen samples and limiting efficiency. In this paper, we focus on meta-learning sequence-prediction-based continual learners without retaining all past representations. While attention-free models with fixed-size hidden states (e.g., Linear Transformers) align with CL's essential goal and efficiency needs, they have shown limited effectiveness in MCL in previous literature. Given Mamba's strong sequence modeling performance and attention-free nature, we explore a key question: Can attention-free models like Mamba perform well on MCL? By formulating Mamba and the SSM for MCL tasks, we propose MambaCL, a meta-learned continual learner. To enhance MambaCL's training, we introduce selectivity regularization, leveraging the connection between Mamba and Transformers to guide its behavior over sequences. Furthermore, we study how Mamba and other models perform across various MCL scenarios through extensive and well-designed experiments. Our results highlight the promising performance and strong generalization of Mamba and attention-free models in MCL, demonstrating its potential for efficient continual learning and adaptation.

Updated: 2025-03-09 02:19:22

标题: 作为一个不断学习的学习者学习曼巴：元学习选择性状态空间模型以实现高效的不断学习

摘要: Continual Learning（CL）旨在有效地从非平稳数据流中学习，而无需存储或重新计算所有已见样本。CL通过整合顺序训练样本实现对新任务的预测。基于CL和顺序建模之间的联系，元持续学习（MCL）旨在元学习一种高效的持续学习者作为序列预测模型，具有像Transformers这样的先进序列模型是自然的选择。然而，尽管Transformers表现良好，但它们依赖于线性增长的缓存来存储所有过去的表示，与CL的目标不存储所有已见样本的目标相冲突，限制了效率。在本文中，我们专注于在不保留所有过去表示的情况下进行基于序列预测的元学习持续学习者。虽然无注意力模型与固定大小隐藏状态（例如，线性Transformers）符合CL的基本目标和效率需求，但它们在先前文献中显示出有限的有效性。鉴于Mamba在强序列建模性能和无关注的性质，我们探讨一个关键问题：像Mamba这样的无注意力模型能在MCL上表现良好吗？通过为MCL任务制定Mamba和SSM，我们提出了MambaCL，一个元学习的持续学习者。为了增强MambaCL的训练，我们引入了选择性正则化，利用Mamba和Transformers之间的联系来指导其在序列上的行为。此外，我们通过广泛且精心设计的实验研究了Mamba和其他模型在各种MCL场景中的表现。我们的结果突出了Mamba和无关注模型在MCL中的有希望的表现和强大的泛化能力，展示了其在有效持续学习和适应性方面的潜力。

更新时间: 2025-03-09 02:19:22

领域: cs.LG

下载: http://arxiv.org/abs/2412.00776v3

Quantum Langevin Dynamics for Optimization

We initiate the study of utilizing Quantum Langevin Dynamics (QLD) to solve optimization problems, particularly those non-convex objective functions that present substantial obstacles for traditional gradient descent algorithms. Specifically, we examine the dynamics of a system coupled with an infinite heat bath. This interaction induces both random quantum noise and a deterministic damping effect to the system, which nudge the system towards a steady state that hovers near the global minimum of objective functions. We theoretically prove the convergence of QLD in convex landscapes, demonstrating that the average energy of the system can approach zero in the low temperature limit with an exponential decay rate correlated with the evolution time. Numerically, we first show the energy dissipation capability of QLD by retracing its origins to spontaneous emission. Furthermore, we conduct detailed discussion of the impact of each parameter. Finally, based on the observations when comparing QLD with classical Fokker-Plank-Smoluchowski equation, we propose a time-dependent QLD by making temperature and $\hbar$ time-dependent parameters, which can be theoretically proven to converge better than the time-independent case and also outperforms a series of state-of-the-art quantum and classical optimization algorithms in many non-convex landscapes.

Updated: 2025-03-09 02:15:43

标题: 量子朗之万动力学用于优化

摘要: 我们开始研究利用量子朗之万动力学(QLD)来解决优化问题，特别是那些对传统梯度下降算法构成重大障碍的非凸目标函数。具体来说，我们研究了与一个无限热浴耦合的系统的动力学。这种相互作用引发了随机的量子噪声和对系统的确定性阻尼效应，这些效应将系统推向一个接近全局最小值的稳态。我们在凸性景观中理论上证明了QLD的收敛性，表明系统的平均能量可以在低温极限下接近零，并且其指数衰减速率与演化时间相关。在数值上，我们首先通过追溯其源头到自发辐射来展示QLD的能量耗散能力。此外，我们对每个参数的影响进行了详细讨论。最后，通过比较QLD与经典的福克-普朗克-斯莫洛霍夫斯基方程时的观察，我们提出了一种时间相关的QLD，通过使温度和$\hbar$成为时间相关参数，可以在理论上证明比时间无关情况更好地收敛，并且在许多非凸景观中优于一系列最先进的量子和经典优化算法。

更新时间: 2025-03-09 02:15:43

领域: quant-ph,cs.DS,cs.LG,math.OC

下载: http://arxiv.org/abs/2311.15587v3

Causal Discovery and Inference towards Urban Elements and Associated Factors

To uncover the city's fundamental functioning mechanisms, it is important to acquire a deep understanding of complicated relationships among citizens, location, and mobility behaviors. Previous research studies have applied direct correlation analysis to investigate such relationships. Nevertheless, due to the ubiquitous confounding effects, empirical correlation analysis may not accurately reflect underlying causal relationships among basic urban elements. In this paper, we propose a novel urban causal computing framework to comprehensively explore causalities and confounding effects among a variety of factors across different types of urban elements. In particular, we design a reinforcement learning algorithm to discover the potential causal graph, which depicts the causal relations between urban factors. The causal graph further serves as the guidance for estimating causal effects between pair-wise urban factors by propensity score matching. After removing the confounding effects from correlations, we leverage significance levels of causal effects in downstream urban mobility prediction tasks. Experimental studies on open-source urban datasets show that the discovered causal graph demonstrates a hierarchical structure, where citizens affect locations, and they both cause changes in urban mobility behaviors. Experimental results in urban mobility prediction tasks further show that the proposed method can effectively reduce confounding effects and enhance performance of urban computing tasks.

Updated: 2025-03-09 02:15:04

标题: 因果发现与推断：城市要素及相关因素

摘要: 为了揭示城市的基本运行机制，深入了解市民、地点和移动行为之间复杂关系是非常重要的。先前的研究已经应用直接相关性分析来调查这些关系。然而，由于普遍存在的混淆效应，经验性相关性分析可能无法准确反映基本城市元素之间潜在的因果关系。在本文中，我们提出了一个新颖的城市因果计算框架，全面探索不同类型城市元素之间的因果关系和混淆效应。特别地，我们设计了一个强化学习算法来发现潜在的因果图，描述了城市因素之间的因果关系。这个因果图进一步作为指导，通过倾向得分匹配来估计成对城市因素之间的因果效应。在消除混淆效应后，我们利用因果效应的显著水平来进行下游城市移动性预测任务。对开源城市数据集的实验研究显示，发现的因果图展示了一个层次结构，其中市民影响地点，二者都导致了城市移动行为的变化。在城市移动性预测任务中的实验结果进一步表明，所提出的方法可以有效减少混淆效应，并提高城市计算任务的性能。

更新时间: 2025-03-09 02:15:04

领域: cs.AI

下载: http://arxiv.org/abs/2503.06395v1

How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.

Updated: 2025-03-09 02:13:44

标题: LLM（语言学习机器）如何学习：用稀疏自编码器追踪内部表示.

摘要: 大型语言模型（LLMs）展示了出色的多语言能力和广泛的知识。然而，这些能力的内在机制仍然知之甚少。为了探究这一问题，我们分析了LLMs内部表示中编码的信息在训练过程中是如何演变的。具体来说，我们在模型的多个检查点训练了稀疏自动编码器，并系统地比较了这些阶段的解释结果。我们的研究结果表明，LLMs最初独立地获得语言特定知识，然后是跨语言的对应关系。此外，我们观察到，在掌握了令牌级别的知识之后，模型过渡到学习更高级别的抽象概念，表明了更深层次的概念理解的发展。

更新时间: 2025-03-09 02:13:44

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.06394v1

EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors

User consumption behavior data, which records individuals' online spending history at various types of stores, has been widely used in various applications, such as store recommendation, site selection, and sale forecasting. However, its high worth is limited due to deficiencies in data comprehensiveness and changes of application scenarios. Thus, generating high-quality sequential consumption data by simulating complex user consumption behaviors is of great importance to real-world applications. Two branches of existing sequence generation methods are both limited in quality. Model-based methods with simplified assumptions fail to model the complex decision process of user consumption, while data-driven methods that emulate real-world data are prone to noises, unobserved behaviors, and dynamic decision space. In this work, we propose to enhance the fidelity and trustworthiness of the data-driven Generative Adversarial Imitation Learning (GAIL) method by blending it with the Exploration and Preferential Return EPR model . The core idea of our EPR-GAIL framework is to model user consumption behaviors as a complex EPR decision process, which consists of purchase, exploration, and preference decisions. Specifically, we design the hierarchical policy function in the generator as a realization of the EPR decision process and employ the probability distributions of the EPR model to guide the reward function in the discriminator. Extensive experiments on two real-world datasets of user consumption behaviors on an online platform demonstrate that the EPR-GAIL framework outperforms the best state-of-the-art baseline by over 19\% in terms of data fidelity. Furthermore, the generated consumption behavior data can improve the performance of sale prediction and location recommendation by up to 35.29% and 11.19%, respectively, validating its advantage for practical applications.

Updated: 2025-03-09 01:56:42

标题: EPR-GAIL：一种增强的EPR层次模仿学习框架，用于模拟复杂的用户消费行为

摘要: 用户消费行为数据记录了个人在各种类型商店的在线消费历史，已被广泛应用于店铺推荐、位置选择和销售预测等各种应用中。然而，由于数据综合性不足和应用场景的变化，其高价值受到限制。因此，通过模拟复杂用户消费行为生成高质量的顺序消费数据对实际应用至关重要。现有序列生成方法的两个分支在质量上都存在限制。基于模型的方法由于简化的假设无法模拟用户消费的复杂决策过程，而模拟真实世界数据的数据驱动方法容易受到噪音、未观察到的行为和动态决策空间的影响。在本研究中，我们提出了通过将其与探索好感回报（EPR）模型相结合来增强数据驱动生成对抗模仿学习（GAIL）方法的忠实性和可信度。我们的EPR-GAIL框架的核心思想是将用户消费行为建模为复杂的EPR决策过程，其中包括购买、探索和偏好决策。具体地，我们设计了生成器中的层次策略函数作为EPR决策过程的实现，并利用EPR模型的概率分布指导鉴别器中的奖励函数。对一个在线平台上两个真实世界数据集的用户消费行为进行的大量实验表明，EPR-GAIL框架在数据忠实度方面优于最佳的现有基线超过19％。此外，生成的消费行为数据可以分别提高销售预测和位置推荐的性能高达35.29％和11.19％，验证了其在实际应用中的优势。

更新时间: 2025-03-09 01:56:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06392v1

R+R: Security Vulnerability Dataset Quality Is Critical

Large Language Models (LLMs) are of great interest in vulnerability detection and repair. The effectiveness of these models hinges on the quality of the datasets used for both training and evaluation. Our investigation reveals that a number of studies featured in prominent software engineering conferences have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples. Using these datasets for experimentation will yield incorrect results that are significantly different from actual expected behavior. For example, the state-of-the-art VulRepair Model, which is reported to have 44% accuracy, on average yielded 9% accuracy when test-set duplicates were removed from its training set and 13% accuracy when training-set duplicates were removed from its test set. In an effort to tackle these data quality concerns, we have retrained models from several papers without duplicates and conducted an accuracy assessment of labels for the top ten most hazardous Common Weakness Enumerations (CWEs). Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete. Finally, we employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data, leading us to conclude that while previous studies have over-estimated performance due to poor dataset quality, this does not demonstrate that better performance is not possible.

Updated: 2025-03-09 01:49:30

标题: R+R：安全漏洞数据集的质量至关重要

摘要: 大型语言模型(LLMs)在漏洞检测和修复方面备受关注。这些模型的有效性取决于用于训练和评估的数据集的质量。我们的调查发现，在一些知名软件工程会议上发表的研究中，使用的数据集存在高复制率、标签准确性可疑和样本不完整等问题。使用这些数据集进行实验将产生与实际预期行为显著不同的不正确结果。例如，报告具有44%准确率的最先进的VulRepair模型，当将测试集重复项从其训练集中删除时，平均准确率为9%，而当将训练集重复项从其测试集中删除时，准确率为13%。为了解决这些数据质量问题，我们重新训练了几篇论文中的模型，去除了重复项，并对十个最危险的通用弱点枚举(CWEs)的标签进行了准确性评估。我们的发现表明，56%的样本标签不正确，44%的样本不完整，只有31%既准确又完整。最后，我们使用大型去重的bug修复语料库进行迁移学习，以展示这些模型如果提供更多高质量的预训练数据则可以表现出更好的性能，由此我们得出结论，尽管先前的研究由于数据集质量不佳而高估了性能，但这并不意味着更好的性能不可能实现。

更新时间: 2025-03-09 01:49:30

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2503.06387v1

A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization

To adapt to real-world data streams, continual learning (CL) systems must rapidly learn new concepts while preserving and utilizing prior knowledge. When it comes to adding new information to continually-trained deep neural networks (DNNs), classifier weights for newly encountered categories are typically initialized randomly, leading to high initial training loss (spikes) and instability. Consequently, achieving optimal convergence and accuracy requires prolonged training, increasing computational costs. Inspired by Neural Collapse (NC), we propose a weight initialization strategy to improve learning efficiency in CL. In DNNs trained with mean-squared-error, NC gives rise to a Least-Square (LS) classifier in the last layer, whose weights can be analytically derived from learned features. We leverage this LS formulation to initialize classifier weights in a data-driven manner, aligning them with the feature distribution rather than using random initialization. Our method mitigates initial loss spikes and accelerates adaptation to new tasks. We evaluate our approach in large-scale CL settings, demonstrating faster adaptation and improved CL performance.

Updated: 2025-03-09 01:44:22

标题: 一个良好的开端很重要：用基于数据驱动的权重初始化增强持续学习

摘要: 为了适应真实世界的数据流，持续学习（CL）系统必须在保留和利用先前知识的同时，快速学习新概念。当向持续训练的深度神经网络（DNNs）添加新信息时，通常会对新遇到的类别的分类器权重进行随机初始化，导致初始训练损失（尖峰）较高并且不稳定。因此，实现最佳收敛和准确性需要长时间的训练，增加计算成本。受神经坍缩（NC）启发，我们提出了一种权重初始化策略，以提高CL中的学习效率。在使用均方误差训练的DNN中，NC导致了最小二乘（LS）分类器在最后一层中的产生，其权重可以从学习的特征中进行分析推导。我们利用这个LS公式以数据驱动的方式来初始化分类器权重，使其与特征分布对齐，而不是使用随机初始化。我们的方法减轻了初始损失尖峰并加快了对新任务的适应。我们在大规模CL设置中评估了我们的方法，展示了更快的适应速度和改进的CL性能。

更新时间: 2025-03-09 01:44:22

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.06385v1

Bayesian Optimization for Robust Identification of Ornstein-Uhlenbeck Model

This paper deals with the identification of the stochastic Ornstein-Uhlenbeck (OU) process error model, which is characterized by an inverse time constant, and the unknown variances of the process and observation noises. Although the availability of the explicit expression of the log-likelihood function allows one to obtain the maximum likelihood estimator (MLE), this entails evaluating the nontrivial gradient and also often struggles with local optima. To address these limitations, we put forth a sample-efficient global optimization approach based on the Bayesian optimization (BO) framework, which relies on a Gaussian process (GP) surrogate model for the objective function that effectively balances exploration and exploitation to select the query points. Specifically, each evaluation of the objective is implemented efficiently through the Kalman filter (KF) recursion. Comprehensive experiments on various parameter settings and sampling intervals corroborate that BO-based estimator consistently outperforms MLE implemented by the steady-state KF approximation and the expectation-maximization algorithm (whose derivation is a side contribution) in terms of root mean-square error (RMSE) and statistical consistency, confirming the effectiveness and robustness of the BO for identification of the stochastic OU process. Notably, the RMSE values produced by the BO-based estimator are smaller than the classical Cram\'{e}r-Rao lower bound, especially for the inverse time constant, estimating which has been a long-standing challenge. This seemingly counterintuitive result can be explained by the data-driven prior for the learning parameters indirectly injected by BO through the GP prior over the objective function.

Updated: 2025-03-09 01:38:21

标题: 贝叶斯优化用于鲁棒性恩斯坦-乌伦贝克模型的识别

摘要: 这篇论文讨论了随机Ornstein-Uhlenbeck（OU）过程误差模型的识别，该模型的特征是逆时间常数，以及过程和观测噪声的未知方差。尽管可以得到对数似然函数的显式表达式，从而获得最大似然估计（MLE），但这需要评估非平凡的梯度，并且常常受到局部最优解的困扰。为了解决这些局限性，我们提出了一种基于贝叶斯优化（BO）框架的样本高效全局优化方法，该方法依赖于高斯过程（GP）代理模型，有效地平衡探索和开发，以选择查询点。具体而言，通过卡尔曼滤波（KF）递归高效实施每次目标评估。对各种参数设置和采样间隔的全面实验证实，基于BO的估计器在均方根误差（RMSE）和统计一致性方面始终优于通过稳态KF近似和期望最大化算法实现的MLE，其推导是一个附带贡献，证实了BO在识别随机OU过程方面的有效性和稳健性。值得注意的是，基于BO的估计器产生的RMSE值小于经典的Cram\'{e}r-Rao下界，尤其是对于逆时间常数的估计，这一直是一个长期的挑战。这种看似反直觉的结果可以通过BO通过对目标函数的GP先验间接注入的数据驱动先验来解释。

更新时间: 2025-03-09 01:38:21

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2503.06381v1

Tensor-Fused Multi-View Graph Contrastive Learning

Graph contrastive learning (GCL) has emerged as a promising approach to enhance graph neural networks' (GNNs) ability to learn rich representations from unlabeled graph-structured data. However, current GCL models face challenges with computational demands and limited feature utilization, often relying only on basic graph properties like node degrees and edge attributes. This constrains their capacity to fully capture the complex topological characteristics of real-world phenomena represented by graphs. To address these limitations, we propose Tensor-Fused Multi-View Graph Contrastive Learning (TensorMV-GCL), a novel framework that integrates extended persistent homology (EPH) with GCL representations and facilitates multi-scale feature extraction. Our approach uniquely employs tensor aggregation and compression to fuse information from graph and topological features obtained from multiple augmented views of the same graph. By incorporating tensor concatenation and contraction modules, we reduce computational overhead by separating feature tensor aggregation and transformation. Furthermore, we enhance the quality of learned topological features and model robustness through noise-injected EPH. Experiments on molecular, bioinformatic, and social network datasets demonstrate TensorMV-GCL's superiority, outperforming 15 state-of-the-art methods in graph classification tasks across 9 out of 11 benchmarks while achieving comparable results on the remaining two. The code for this paper is publicly available at https://github.com/CS-SAIL/Tensor-MV-GCL.git.

Updated: 2025-03-09 01:31:59

标题: 张量融合多视图图对比学习

摘要: 图对比学习（GCL）已经成为增强图神经网络（GNNs）从未标记的图结构数据中学习丰富表示能力的一种有前途的方法。然而，当前的GCL模型面临着计算需求和有限特征利用的挑战，通常仅依赖于基本的图属性，如节点度和边属性。这限制了它们完全捕获由图表示的现实世界现象的复杂拓扑特征的能力。为了解决这些限制，我们提出了Tensor-Fused Multi-View Graph Contrastive Learning（TensorMV-GCL），这是一个将扩展持久同调（EPH）与GCL表示相结合，并促进多尺度特征提取的新型框架。我们的方法独特地利用张量聚合和压缩来融合从同一图的多个增强视图获得的图和拓扑特征的信息。通过整合张量连接和收缩模块，我们通过分离特征张量聚合和转换来减少计算开销。此外，通过注入噪声的EPH，我们增强了学习到的拓扑特征的质量和模型的鲁棒性。在分子、生物信息学和社交网络数据集上的实验表明，TensorMV-GCL在图分类任务中优势明显，在11个基准测试中的9个中优于15种最先进的方法，而在其余两个上取得可比的结果。本文的代码可以在https://github.com/CS-SAIL/Tensor-MV-GCL.git上公开获取。

更新时间: 2025-03-09 01:31:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.15247v2

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.

Updated: 2025-03-09 01:13:56

标题: 常规刻度解锁具有解释和预测能力的人工智能评估

摘要: 确保人工智能的安全有效使用需要理解和预测其在新领域的表现，从先进科学挑战到转变后的工作场景活动。到目前为止，基准测试已经引导了人工智能的进展，但鉴于在不同任务之间的低可转移性，基准测试对于通用人工智能系统的解释和预测能力有限。在本文中，我们引入了用于人工智能评估的通用标度，可以解释常见人工智能基准测试实际衡量的内容，提取人工智能系统的能力概况，并预测它们在新任务实例中的表现，包括分布内和分布外的情况。我们的全自动方法建立在18个新设计的评分表上，这些评分表对不会饱和的通用标度提出了实例需求。通过对15个大型语言模型和63个任务进行说明，通过检查需求和能力概况，释放了高解释能力，带来了关于不同基准测试所表现的敏感性和特异性以及模型大小、思维链和精炼对知识、元认知和推理的影响的见解。令人惊讶的是，通过使用这些需求水平，可以在实例级别实现高预测能力，提供比基于嵌入或微调的黑盒基准预测器更出色的估计，特别是在分布外环境中（新任务和新基准测试）。本文提供的标度、评分表、测试、技术和结果代表了人工智能评估的重大进步，为未来几年中可靠部署人工智能奠定了基础。

更新时间: 2025-03-09 01:13:56

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2503.06378v1

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback.

Updated: 2025-03-09 01:07:59

标题: 互动反馈：通过人类反馈揭示大型多模态模型的交互智能

摘要: 现有的基准测试不会测试大型多模态模型（LMM）与人类用户的交互智能，而这对于开发通用人工智能助手至关重要。我们设计了InterFeedback，一个交互式框架，可以应用于任何LMM和数据集，以自主评估这种能力。此外，我们引入了InterFeedback-Bench，使用两个代表性数据集MMM-Pro和MathVerse来评估交互智能，以测试10种不同的开源LMM。此外，我们提出了InterFeedback-Human，一个新收集的包含120个案例的数据集，旨在手动测试领先模型（如OpenAI-o1和Claude-3.5-Sonnet）的交互性能。我们的评估结果表明，即使是最先进的LMM，OpenAI-o1，也难以根据人类反馈来优化其响应，平均得分低于50%。我们的研究结果表明，需要一些方法来增强LMM的能力，使其能够解释和受益于反馈。

更新时间: 2025-03-09 01:07:59

领域: cs.CL,cs.AI,cs.CV,cs.HC

下载: http://arxiv.org/abs/2502.15027v2

VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings

Texture recognition has recently been dominated by ImageNet-pre-trained deep Convolutional Neural Networks (CNNs), with specialized modifications and feature engineering required to achieve state-of-the-art (SOTA) performance. However, although Vision Transformers (ViTs) were introduced a few years ago, little is known about their texture recognition ability. Therefore, in this work, we introduce VORTEX (ViTs with Orderless and Randomized Token Encodings for Texture Recognition), a novel method that enables the effective use of ViTs for texture analysis. VORTEX extracts multi-depth token embeddings from pre-trained ViT backbones and employs a lightweight module to aggregate hierarchical features and perform orderless encoding, obtaining a better image representation for texture recognition tasks. This approach allows seamless integration with any ViT with the common transformer architecture. Moreover, no fine-tuning of the backbone is performed, since they are used only as frozen feature extractors, and the features are fed to a linear SVM. We evaluate VORTEX on nine diverse texture datasets, demonstrating its ability to achieve or surpass SOTA performance in a variety of texture analysis scenarios. By bridging the gap between texture recognition with CNNs and transformer-based architectures, VORTEX paves the way for adopting emerging transformer foundation models. Furthermore, VORTEX demonstrates robust computational efficiency when coupled with ViT backbones compared to CNNs with similar costs. The method implementation and experimental scripts are publicly available in our online repository.

Updated: 2025-03-09 00:36:02

标题: VORTEX：使用无序和随机标记编码的视觉变换器挑战CNN在纹理识别中的表现

摘要: 最近，纹理识别主要由ImageNet预训练的深度卷积神经网络（CNNs）主导，需要专门的修改和特征工程才能实现最先进（SOTA）的性能。然而，虽然几年前引入了Vision Transformers（ViTs），但对它们的纹理识别能力知之甚少。因此，在这项工作中，我们介绍了VORTEX（ViTs with Orderless and Randomized Token Encodings for Texture Recognition），这是一种新颖的方法，可以有效地利用ViTs进行纹理分析。VORTEX从预训练的ViT骨干中提取多深度的令牌嵌入，并采用轻量级模块来聚合层次特征并执行无序编码，获得更好的图像表示以用于纹理识别任务。这种方法可以与任何具有共同变压器架构的ViT无缝集成。此外，由于它们仅用作冻结特征提取器，因此不对骨干进行微调，而是将特征馈送给线性SVM。我们在九个不同的纹理数据集上评估了VORTEX，展示了它在各种纹理分析场景中实现或超越SOTA性能的能力。通过弥合基于CNNs和基于变压器的架构之间的差距，VORTEX为采用新兴的变压器基础模型铺平了道路。此外，与成本相似的CNN相比，VORTEX与ViT骨干结合时表现出了强大的计算效率。该方法的实现和实验脚本已在我们的在线存储库中公开可用。

更新时间: 2025-03-09 00:36:02

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.06368v1

Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research-level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Updated: 2025-03-09 00:11:40

标题: 机器学习遇上代数组合学：一套数据集捕捉纯数学中研究级猜想能力

摘要: 随着人工智能系统能力的显著增强，利用机器学习进行推理密集型、数量型任务，特别是数学方面的任务，引起了越来越多的兴趣。虽然有许多资源涵盖了高中、本科和研究生水平的数学，但相对而言，与专业数学家在解决开放性问题时遇到的难度和开放性相符的资源却少之又少。为了解决这一问题，我们介绍了一个新的数据集合集，代数组合数据集库（ACD Repo），代表了代数组合学领域的基础结果或开放性问题，这是数学的一个子领域，研究抽象代数中产生的离散结构。我们数据集合集的进一步区别在于它旨在推测过程。每个数据集包括一个开放性研究级问题和大量实例（在某些情况下高达1000万个），其中应生成猜想。我们描述了所有九个数据集，以及机器学习模型可以应用到它们的不同方式（例如，使用窄模型进行训练，然后进行可解释性分析或使用LLMs进行程序合成），并讨论了设计这类数据集所涉及的一些挑战。

更新时间: 2025-03-09 00:11:40

领域: cs.LG,cs.AI,math.CO,math.RT

下载: http://arxiv.org/abs/2503.06366v1