Arxiv Day: Article

Generative Escher Meshes

This paper proposes a fully-automatic, text-guided generative method for producing perfectly-repeating, periodic, tile-able 2D imagery, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to square texture images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and texture of a 2D mesh, yielding a non-square tile in the shape and appearance of the desired object, with close to no additional background details, that can tile the plane without gaps nor overlaps. We enable optimization of the tile's shape by an unconstrained, differentiable parameterization of the space of all valid tileable meshes for given boundary conditions stemming from a symmetry group. Namely, we construct a differentiable family of linear systems derived from a 2D mesh-mapping technique - Orbifold Tutte Embedding - by considering the mesh's Laplacian matrix as differentiable parameters. We prove that the solution space of these linear systems is exactly all possible valid tiling configurations, thereby providing an end-to-end differentiable representation for the entire space of valid tiles. We render the textured mesh via a differentiable renderer, and leverage a pre-trained image diffusion model to induce a loss on the resulting image, updating the mesh's parameters so as to make its appearance match the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.

Updated: 2024-06-17 23:57:57

标题: 生成式艾舍网格

摘要: 本文提出了一种完全自动的、文本引导的生成方法，用于生成完全重复、周期性、可平铺的二维图像，例如在地板、马赛克、陶瓷和M.C.艾舍的作品中看到的图像。与在平铺时无缝的方形纹理图像相比，我们的方法生成的非方形平铺仅包含相同对象的重复副本。它通过优化二维网格的几何形状和纹理来实现这一点，使得所需对象的形状和外观的非方形瓷砖可平铺于平面上，几乎没有额外的背景细节，既不会出现间隙也不会重叠。我们通过一种无约束的、可微分的参数化方法，使得瓦片的形状可以进行优化，该方法适用于所有给定对称群边界条件下所有有效的可平铺网格的空间。换句话说，我们通过考虑网格的拉普拉斯矩阵作为可微参数，构建了一族可微线性系统，这些系统源自于一种二维网格映射技术 - 球面图嵌入。我们证明这些线性系统的解空间正好是所有可能的有效平铺配置，从而为所有有效瓦片的整个空间提供了端到端的可微分表示。我们通过一个可微分的渲染器对纹理网格进行渲染，并利用一个预训练的图像扩散模型对结果图像施加损失，更新网格的参数以使其外观与文本提示相匹配。我们展示了我们的方法能够为各种不同周期平铺图案产生合理、吸引人的结果，包括非平凡的瓦片。

更新时间: 2024-06-17 23:57:57

领域: cs.CV,cs.AI,cs.CG,cs.GR

下载: http://arxiv.org/abs/2309.14564v4

Language Models as Zero-Shot Trajectory Generators

Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: https://www.robot-learning.uk/language-models-trajectory-generators.

Updated: 2024-06-17 23:57:03

标题: 语言模型作为零样本轨迹生成器

摘要: 最近，大型语言模型（LLMs）在给予一系列低级技能的情况下，已经显示出作为机器人高级规划者的潜力。然而，通常认为LLMs并没有足够的知识来用于低级轨迹本身。在这项工作中，我们彻底解决了这一假设，并调查了当仅提供物体检测和分割视觉模型时，LLM（GPT-4）是否可以直接预测用于操作任务的末端执行器姿势的密集序列。我们设计了一个单一的、任务不可知的提示，没有任何上下文示例、运动基元或外部轨迹优化器。然后，我们研究了它在30个真实世界的基于语言的任务中的表现，比如“打开瓶盖”和“用海绵擦盘子”，并调查了这个提示中哪些设计选择最重要。我们的结论提高了LLMs在机器人领域的假设限制，并首次揭示了LLMs确实具有足够理解低级机器人控制的能力，适用于一系列常见任务，并且它们还可以检测失败并相应重新规划轨迹。视频、提示和代码可在以下链接找到：https://www.robot-learning.uk/language-models-trajectory-generators。

更新时间: 2024-06-17 23:57:03

领域: cs.RO,cs.AI,cs.CL,cs.HC,cs.LG

下载: http://arxiv.org/abs/2310.11604v2

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

Updated: 2024-06-17 23:52:24

标题: VLind-Bench：在大型视觉-语言模型中测量语言先验

摘要: 大型视觉语言模型（LVLMs）已经在各种多模态任务中展现出卓越的性能。然而，它们存在一个被称为语言先验的问题，即仅基于文本模式生成响应，而忽略图像信息。解决语言先验问题至关重要，因为在处理训练分布之外的图像时，可能会导致不良偏见或幻觉。尽管其重要性，目前对于准确衡量LVLMs中的语言先验的方法研究不足。尽管现有基于对照事实或分布之外图像的基准可以部分用于衡量语言先验，但它们无法将语言先验与其他混淆因素分离开来。因此，我们提出了一个名为VLind-Bench的新基准，这是第一个专门设计用于衡量LVLMs的语言先验或盲点的基准。它不仅包括了对对照图像进行测试以评估语言先验，还涉及一系列测试来评估更基本的能力，如常识知识、视觉知觉和常识偏见。在我们的基准中，对于每个实例，我们确保在评估语言先验之前，所有这些基本测试都通过，从而最大程度地减少其他因素对评估的影响。我们在我们的基准中对最近的LVLMs进行的评估和分析显示，几乎所有模型都显著依赖语言先验，这在该领域提出了一个严峻的挑战。

更新时间: 2024-06-17 23:52:24

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2406.08702v2

Occam Gradient Descent

Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification, and is effective in our experiments in outperforming traditional gradient descent with or without post-train pruning in accuracy, compute and model compression.

Updated: 2024-06-17 23:51:55

标题: 奥卡姆梯度下降

摘要: 深度学习神经网络模型必须足够大以适应其问题领域，同时又要足够小以避免在梯度下降训练数据过拟合。为了平衡这些竞争性需求，过度配置的深度学习模型如变压器在大数据集上进行单次纪元训练，因此在计算资源和训练数据方面效率低下。为了应对这些低效，我们利用学习理论推导出奥卡姆梯度下降算法，该算法交替进行自适应减少模型大小以最小化泛化误差，并在模型权重上进行梯度下降以最小化拟合误差。相比之下，传统梯度下降贪婪地最小化拟合误差而不考虑泛化误差。我们的算法同时降低神经网络的权重空间和拓扑大小，无需修改，且在我们的实验中在准确性、计算和模型压缩方面均优于传统梯度下降，无论是否进行后训练剪枝。

更新时间: 2024-06-17 23:51:55

领域: cs.LG

下载: http://arxiv.org/abs/2405.20194v2

ChaosMining: A Benchmark to Evaluate Post-Hoc Local Attribution Methods in Low SNR Environments

In this study, we examine the efficacy of post-hoc local attribution methods in identifying features with predictive power from irrelevant ones in domains characterized by a low signal-to-noise ratio (SNR), a common scenario in real-world machine learning applications. We developed synthetic datasets encompassing symbolic functional, image, and audio data, incorporating a benchmark on the {\it (Model $\times$ Attribution$\times$ Noise Condition)} triplet. By rigorously testing various classic models trained from scratch, we gained valuable insights into the performance of these attribution methods in multiple conditions. Based on these findings, we introduce a novel extension to the notable recursive feature elimination (RFE) algorithm, enhancing its applicability for neural networks. Our experiments highlight its strengths in prediction and feature selection, alongside limitations in scalability. Further details and additional minor findings are included in the appendix, with extensive discussions. The codes and resources are available at \href{https://github.com/geshijoker/ChaosMining/}{URL}.

Updated: 2024-06-17 23:39:29

标题: 混沌挖掘：低信噪比环境中评估事后局部归因方法的基准Benchmark

摘要: 在这项研究中，我们检验了事后局部归因方法在识别具有预测能力的特征和无关特征中的效力，这些领域的特征是由低信噪比（SNR）特征的，这在现实世界的机器学习应用中是一个常见情况。我们开发了包含符号功能、图像和音频数据的合成数据集，结合了{\it （模型 $\times$ 归因 $\times$ 噪声条件）}三元组的基准。通过严格测试从头开始训练的各种经典模型，我们对这些归因方法在多种条件下的性能有了宝贵的洞察。根据这些发现，我们引入了一个新颖的扩展到著名的递归特征消除（RFE）算法，增强了其对神经网络的适用性。我们的实验突出了它在预测和特征选择方面的优势，同时也揭示了可伸缩性的限制。附录中包含更多细节和额外的次要发现，以及广泛的讨论。代码和资源可以在\href{https://github.com/geshijoker/ChaosMining/}{URL}找到。

更新时间: 2024-06-17 23:39:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.12150v1

Metacognitive AI: Framework and the Case for a Neurosymbolic Approach

Metacognition is the concept of reasoning about an agent's own internal processes and was originally introduced in the field of developmental psychology. In this position paper, we examine the concept of applying metacognition to artificial intelligence. We introduce a framework for understanding metacognitive artificial intelligence (AI) that we call TRAP: transparency, reasoning, adaptation, and perception. We discuss each of these aspects in-turn and explore how neurosymbolic AI (NSAI) can be leveraged to address challenges of metacognition.

Updated: 2024-06-17 23:30:46

标题: 元认知人工智能：框架和神经符号方法的案例

摘要: 元认知是关于一个主体对其自身内部过程进行推理的概念，最初是在发展心理学领域引入的。在这篇立场论文中，我们考察了将元认知应用于人工智能的概念。我们引入了一个名为TRAP的框架，用于理解元认知人工智能（AI）：透明度、推理、适应性和感知。我们依次讨论了这些方面，并探讨了如何利用神经符号人工智能（NSAI）来应对元认知的挑战。

更新时间: 2024-06-17 23:30:46

领域: cs.AI

下载: http://arxiv.org/abs/2406.12147v1

Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers

In the contemporary landscape of computer architecture, the demand for efficient parallel programming persists, needing robust optimization techniques. Traditional optimizing compilers have historically been pivotal in this endeavor, adapting to the evolving complexities of modern software systems. The emergence of Large Language Models (LLMs) raises intriguing questions about the potential for AI-driven approaches to revolutionize code optimization methodologies. This paper presents a comparative analysis between two state-of-the-art Large Language Models, GPT-4.0 and CodeLlama-70B, and traditional optimizing compilers, assessing their respective abilities and limitations in optimizing code for maximum efficiency. Additionally, we introduce a benchmark suite of challenging optimization patterns and an automatic mechanism for evaluating performance and correctness of the code generated by such tools. We used two different prompting methodologies to assess the performance of the LLMs -- Chain of Thought (CoT) and Instruction Prompting (IP). We then compared these results with three traditional optimizing compilers, CETUS, PLUTO and ROSE, across a range of real-world use cases. A key finding is that while LLMs have the potential to outperform current optimizing compilers, they often generate incorrect code on large code sizes, calling for automated verification methods. Our extensive evaluation across 3 different benchmarks suites shows CodeLlama-70B as the superior optimizer among the two LLMs, capable of achieving speedups of up to 2.1x. Additionally, CETUS is the best among the optimizing compilers, achieving a maximum speedup of 1.9x. We also found no significant difference between the two prompting methods: Chain of Thought (Cot) and Instructing prompting (IP).

Updated: 2024-06-17 23:26:41

标题: 人工智能是否应该优化您的代码？当前大型语言模型与传统优化编译器的比较研究

摘要: 在当代计算机架构领域，对高效并行编程的需求持续存在，需要强大的优化技术。传统的优化编译器在这方面一直发挥着关键作用，适应了现代软件系统日益复杂的发展。大型语言模型（LLMs）的出现引发了关于人工智能驱动方法革新代码优化方法的有趣问题。本文对两种最先进的大型语言模型GPT-4.0和CodeLlama-70B以及传统的优化编译器进行了比较分析，评估它们在最大效率优化代码方面的能力和局限性。此外，我们引入了一组具有挑战性的优化模式基准套件以及一个评估由这些工具生成的代码性能和正确性的自动机制。我们使用了两种不同的提示方法来评估LLMs的性能——思维链（CoT）和指令提示（IP）。然后我们将这些结果与三种传统的优化编译器CETUS、PLUTO和ROSE在一系列真实用例中进行了比较。一个关键发现是，尽管LLMs有潜力超越当前优化编译器，但它们在大型代码规模上经常生成不正确的代码，需要自动验证方法。我们在3个不同的基准套件上进行的广泛评估显示，CodeLlama-70B是两种LLMs中优化器中表现较好的，能够实现高达2.1倍的加速。此外，CETUS在优化编译器中表现最佳，最大加速度为1.9倍。我们还发现两种提示方法——思维链（Cot）和指令提示（IP）之间没有显著差异。

更新时间: 2024-06-17 23:26:41

领域: cs.AI,cs.PF,cs.SE

下载: http://arxiv.org/abs/2406.12146v1

Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods

Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.

Updated: 2024-06-17 23:08:46

标题: 穿越偏见：利用切片发现方法解释医学图像分析中的性能差距

摘要: 机器学习模型在医学图像分析中取得了高整体准确度。然而，在特定患者群体上的性能差异给它们的临床效用、安全性和公平性带来了挑战。这可能影响到已知患者群体 - 例如基于性别、年龄或疾病亚型的群体 - 以及以前未知和未标记的群体。此外，这种观察到的性能差异的根本原因往往难以发现，阻碍了缓解努力。在本文中，为了解决这些问题，我们利用切片发现方法（SDMs）来识别可解释的数据子集，并提出关于观察到的性能差异原因的假设。我们介绍了一种新颖的SDM，并在对胸部X射线图像中的气胸和肺不张分类的案例研究中应用了它。我们的研究展示了SDMs在假设制定方面的有效性，并解释了在广泛使用的胸部X射线数据集和模型中男性和女性患者之间先前观察到但尚未解释的性能差异。我们的发现表明，在两个分类任务中存在捷径学习，分别通过胸腔引流管和心电图导联线的存在。这些捷径特征在性别之间的流行率差异似乎导致了观察到的分类性能差距，代表了捷径学习和模型公平性分析之间以前未被充分重视的交互作用。

更新时间: 2024-06-17 23:08:46

领域: cs.LG,cs.AI,cs.CV,cs.CY

下载: http://arxiv.org/abs/2406.12142v1

Efficient Algorithms for Learning Monophonic Halfspaces in Graphs

We study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by monophonic halfspaces, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces,have recently attracted interest, and several connections have been drawn between their properties(e.g., their VC dimension) and the structure of the underlying graph $G$. We prove several novel results for learning monophonic halfspaces in the supervised, online, and active settings. Our main result is that a monophonic halfspace can be learned with near-optimal passive sample complexity in time polynomial in $n = |V(G)|$. This requires us to devise a polynomial-time algorithm for consistent hypothesis checking, based on several structural insights on monophonic halfspaces and on a reduction to $2$-satisfiability. We prove similar results for the online and active settings. We also show that the concept class can be enumerated with delay $\operatorname{poly}(n)$, and that empirical risk minimization can be performed in time $2^{\omega(G)}\operatorname{poly}(n)$ where $\omega(G)$ is the clique number of $G$. These results answer open questions from the literature (Gonz\'alez et al., 2020), and show a contrast with geodesic halfspaces, for which some of the said problems are NP-hard (Seiffarth et al., 2023).

Updated: 2024-06-17 23:04:58

标题: 在图中学习单声道半空间的高效算法

摘要: 我们研究在图的顶点上学习二元分类器的问题。具体来说，我们考虑由单调半空间给定的分类器，这些半空间是在某种抽象意义上是凸的顶点划分。单调半空间以及相关概念如测地半空间最近引起了兴趣，并且已经在它们的属性（例如它们的VC维度）与底层图$G$的结构之间绘制了几个连接。我们证明了在监督、在线和主动设置中学习单调半空间的几个新颖结果。我们的主要结果是，单调半空间可以在多项式时间内以接近最优被动样本复杂度学习，其中$n = |V(G)|$。这需要我们设计一个基于单调半空间的几个结构洞察和对2-SAT的简化的一致假设检查的多项式时间算法。我们证明了在线和主动设置的类似结果。我们还展示了概念类可以以延迟$\operatorname{poly}(n)$枚举，经验风险最小化可以在时间$2^{\omega(G)}\operatorname{poly}(n)$内执行，其中$\omega(G)$是$G$的团数。这些结果回答了文献中的开放问题（Gonz\'alez等人，2020年），并且与测地半空间形成对比，其中一些问题是NP难的（Seiffarth等人，2023年）。

更新时间: 2024-06-17 23:04:58

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2405.00853v2

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

Diffusion models have demonstrated strong performance in sampling and editing multi-modal data with high generation quality, yet they suffer from the iterative generation process which is computationally expensive and slow. In addition, most methods are constrained to generate data from Gaussian noise, which limits their sampling and editing flexibility. To overcome both disadvantages, we present Contrastive Optimal Transport Flow (COT Flow), a new method that achieves fast and high-quality generation with improved zero-shot editing flexibility compared to previous diffusion models. Benefiting from optimal transport (OT), our method has no limitation on the prior distribution, enabling unpaired image-to-image (I2I) translation and doubling the editable space (at both the start and end of the trajectory) compared to other zero-shot editing methods. In terms of quality, COT Flow can generate competitive results in merely one step compared to previous state-of-the-art unpaired image-to-image (I2I) translation methods. To highlight the advantages of COT Flow through the introduction of OT, we introduce the COT Editor to perform user-guided editing with excellent flexibility and quality. The code will be released at https://github.com/zuxinrui/cot_flow.

Updated: 2024-06-17 23:02:20

标题: COT Flow: 通过对比对学习最优输运图像抽样和编辑

摘要: 扩散模型在采样和编辑多模态数据方面表现出强大的性能，生成质量高，但受到计算昂贵和缓慢的迭代生成过程的影响。此外，大多数方法受限于从高斯噪声生成数据，这限制了它们的采样和编辑灵活性。为了克服这两个缺点，我们提出了对比最优输运流（COT Flow），这是一种新方法，相比于先前的扩散模型，实现了快速和高质量的生成，提高了零样本编辑的灵活性。受益于最优输运（OT），我们的方法对先验分布没有限制，实现了图像到图像（I2I）翻译的非配对，并将可编辑空间（在轨迹的起始和结束处）增加一倍，相比其他零样本编辑方法。在质量方面，与先前最先进的非配对图像到图像（I2I）翻译方法相比，COT Flow在仅一步之内就能生成具有竞争力的结果。为了突出COT Flow通过引入OT的优势，我们引入了COT Editor，以实现具有出色灵活性和质量的用户引导编辑。代码将在https://github.com/zuxinrui/cot_flow 上发布。

更新时间: 2024-06-17 23:02:20

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12140v1

Bias in Text Embedding Models

Text embedding is becoming an increasingly popular AI methodology, especially among businesses, yet the potential of text embedding models to be biased is not well understood. This paper examines the degree to which a selection of popular text embedding models are biased, particularly along gendered dimensions. More specifically, this paper studies the degree to which these models associate a list of given professions with gendered terms. The analysis reveals that text embedding models are prone to gendered biases but in varying ways. Although there are certain inter-model commonalities, for instance, greater association of professions like nurse, homemaker, and socialite with female identifiers, and greater association of professions like CEO, manager, and boss with male identifiers, not all models make the same gendered associations for each occupation. Furthermore, the magnitude and directionality of bias can also vary on a model-by-model basis and depend on the particular words models are prompted with. This paper demonstrates that gender bias afflicts text embedding models and suggests that businesses using this technology need to be mindful of the specific dimensions of this problem.

Updated: 2024-06-17 22:58:36

标题: 文本嵌入模型中的偏差

摘要: 文本嵌入正成为一种日益流行的人工智能方法，尤其在商业领域，然而文本嵌入模型存在偏见的潜力并不为人们所了解。本文研究了一系列流行的文本嵌入模型存在偏见的程度，特别是在性别维度上。更具体地，本文研究了这些模型在将一系列给定的职业与性别化术语关联的程度。分析显示，文本嵌入模型容易受到性别偏见的影响，但方式各异。尽管存在一定的模型间共同性，例如将护士、家庭主妇和社交名媛等职业更多地与女性标识符相关联，将CEO、经理和老板等职业更多地与男性标识符相关联，但并非所有模型对每种职业都做出相同的性别关联。此外，偏见的幅度和方向也可能因模型而异，并取决于模型被提示的具体词汇。本文证明了性别偏见影响了文本嵌入模型，并建议使用这项技术的企业需要注意这一问题的具体维度。

更新时间: 2024-06-17 22:58:36

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12138v1

IDs for AI Systems

AI systems are increasingly pervasive, yet information needed to decide whether and how to engage with them may not exist or be accessible. A user may not be able to verify whether a system satisfies certain safety standards. An investigator may not know whom to investigate when a system causes an incident. A platform may find it difficult to penalize repeated negative interactions with the same system. Across a number of domains, IDs address analogous problems by identifying \textit{particular} entities (e.g., a particular Boeing 747) and providing information about other entities of the same class (e.g., some or all Boeing 747s). We propose a framework in which IDs are ascribed to \textbf{instances} of AI systems (e.g., a particular chat session with Claude 3), and associated information is accessible to parties seeking to interact with that system. We characterize IDs for AI systems, argue that there could be significant demand for IDs from key actors, analyze how those actors could incentivize ID adoption, explore potential implementations of our framework, and highlight limitations and risks. IDs seem most warranted in high-stakes settings, where certain actors (e.g., those that enable AI systems to make financial transactions) could experiment with incentives for ID use. Deployers of AI systems could experiment with developing ID implementations. With further study, IDs could help to manage a world where AI systems pervade society.

Updated: 2024-06-17 22:48:11

标题: AI 系统的识别码

摘要: 人工智能系统越来越普遍，然而决定是否以及如何与它们互动所需的信息可能不存在或不可访问。用户可能无法验证系统是否满足某些安全标准。调查人员可能不知道当系统引发事故时该调查谁。一个平台可能会发现很难惩罚对同一系统进行重复负面互动。在许多领域，通过识别特定实体（例如，特定的波音747）并提供有关同一类别其他实体的信息（例如，一些或所有波音747），ID解决了类似的问题。我们提出了一个框架，其中ID被赋予人工智能系统的实例（例如，与克劳德3进行的特定聊天会话），并且相关信息可供寻求与该系统互动的各方访问。我们对人工智能系统的ID进行了表征，认为关键参与者可能对ID有重大需求，分析了这些参与者如何激励ID的采用，探讨了我们框架的潜在实施方式，并强调了局限性和风险。在高风险环境中，ID似乎最有必要，特定参与者（例如，使人工智能系统进行金融交易的参与者）可能会尝试激励使用ID。人工智能系统的部署者可以尝试开发ID实施。通过进一步研究，ID有助于管理人工智能系统普及社会的世界。

更新时间: 2024-06-17 22:48:11

领域: cs.AI

下载: http://arxiv.org/abs/2406.12137v1

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories. Our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs. This alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. The versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing VLLMs, training new models or safeguarding pre-trained VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. The code and dataset are available at https://github.com/ys-zong/VLGuard.

Updated: 2024-06-17 22:26:32

标题: 安全微调几乎零成本：视觉大语言模型的基线

摘要: 目前的大型语言模型（VLLMs）展示了卓越的能力，但容易生成有害内容，并且容易受到最简单的越狱攻击。我们的初步分析发现，这是由于在视觉语言指导微调过程中存在有害数据，而VLLM的微调可能导致底层LLM先前学习的安全对齐被遗忘。为了解决这个问题，我们首先筛选了一个视觉语言安全指令跟随数据集VLGuard，涵盖各种有害类别。我们的实验表明，将这个数据集整合到标准的视觉语言微调中，或者将其用于事后微调，可以有效地对齐VLLMs的安全性。这种对齐是在对模型的有用性几乎没有影响，甚至增强的情况下实现的。我们的安全微调数据集的多功能性使其成为一个有价值的资源，用于对现有VLLMs进行安全测试、训练新模型或保护预训练的VLLMs。实证结果表明，微调后的VLLMs有效拒绝不安全的指令，并显著降低了几种黑盒对抗攻击的成功率，许多情况下接近零。代码和数据集可在https://github.com/ys-zong/VLGuard 上找到。

更新时间: 2024-06-17 22:26:32

领域: cs.LG

下载: http://arxiv.org/abs/2402.02207v2

Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems

Solving the problem of cooperation is fundamentally important for the creation and maintenance of functional societies. Problems of cooperation are omnipresent within human society, with examples ranging from navigating busy road junctions to negotiating treaties. As the use of AI becomes more pervasive throughout society, the need for socially intelligent agents capable of navigating these complex cooperative dilemmas is becoming increasingly evident. Direct punishment is a ubiquitous social mechanism that has been shown to foster the emergence of cooperation in both humans and non-humans. In the natural world, direct punishment is often strongly coupled with partner selection and reputation and used in conjunction with third-party punishment. The interactions between these mechanisms could potentially enhance the emergence of cooperation within populations. However, no previous work has evaluated the learning dynamics and outcomes emerging from Multi-Agent Reinforcement Learning (MARL) populations that combine these mechanisms. This paper addresses this gap. It presents a comprehensive analysis and evaluation of the behaviors and learning dynamics associated with direct punishment, third-party punishment, partner selection, and reputation. Finally, we discuss the implications of using these mechanisms on the design of cooperative AI systems.

Updated: 2024-06-17 22:18:47

标题: 调查直接惩罚对多智能体强化学习系统中合作出现的影响

摘要: 解决合作问题对于功能性社会的创立和维护至关重要。合作问题在人类社会中随处可见，例如从应对繁忙的道路交叉口到谈判条约。随着人工智能在社会中的普及应用，对于能够处理这些复杂合作困境的社交智能代理的需求日益显现。直接惩罚是一种普遍存在的社会机制，已被证明能够促进人类和非人类群体中合作的出现。在自然界中，直接惩罚通常与合作伙伴选择和声誉紧密相连，并与第三方惩罚结合使用。这些机制之间的互动可能潜在地增强人群中合作的出现。然而，以往没有人评估结合这些机制的多智能体强化学习（MARL）人群中出现的学习动态和结果。本文填补了这一空白。它对直接惩罚、第三方惩罚、合作伙伴选择和声誉相关的行为和学习动态进行了全面分析和评估。最后，我们讨论了在设计合作人工智能系统时使用这些机制的影响。

更新时间: 2024-06-17 22:18:47

领域: cs.MA,cs.AI,cs.LG

下载: http://arxiv.org/abs/2301.08278v3

Efficient Sequential Decision Making with Large Language Models

This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. The former approach suffers from the computational burden of gradient updates, and the latter approach does not show promising results. In this paper, we propose a new approach that leverages online model selection algorithms to efficiently incorporate LLMs agents into sequential decision making. Statistically, our approach significantly outperforms both traditional decision making algorithms and vanilla LLM agents. Computationally, our approach avoids the need for expensive gradient updates of LLMs, and throughout the decision making process, it requires only a small number of LLM calls. We conduct extensive experiments to verify the effectiveness of our proposed approach. As an example, on a large-scale Amazon dataset, our approach achieves more than a $6$x performance gain over baselines while calling LLMs in only $1.5$\% of the time steps.

Updated: 2024-06-17 22:13:22

标题: 使用大型语言模型进行高效的顺序决策制定

摘要: 本文旨在将大型语言模型（LLMs）的成功延伸到序贯决策制定。现有的努力要么重新训练或微调LLMs以用于决策制定，要么为预训练的LLMs设计提示。前一种方法受到梯度更新的计算负担的影响，后一种方法并未显示出令人满意的结果。在本文中，我们提出了一种新方法，利用在线模型选择算法，将LLMs代理有效地整合到序贯决策制定中。统计上，我们的方法明显优于传统的决策制定算法和基本的LLM代理。在计算上，我们的方法避免了LLMs昂贵的梯度更新需求，并且在整个决策制定过程中，仅需要少量LLM调用。我们进行了大量实验来验证我们提出的方法的有效性。例如，在一个大规模的亚马逊数据集上，我们的方法在仅使用1.5％的时间步骤调用LLMs的情况下，比基线获得了超过6倍的性能提升。

更新时间: 2024-06-17 22:13:22

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.12125v1

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with substantial increases in accuracy to $81.8\%$ (+$5.9\%$), $34.7\%$ (+$5.8\%$), and $76.4\%$ (+$15.8\%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains. Our code is publicly available at https://github.com/YuxiXie/MCTS-DPO.

Updated: 2024-06-17 22:11:49

标题: 蒙特卡洛树搜索通过迭代偏好学习提升推理

摘要: 我们引入一种旨在通过受AlphaZero成功策略启发的迭代偏好学习过程来增强大型语言模型（LLMs）推理能力的方法。我们的工作利用蒙特卡罗树搜索（MCTS）来迭代收集偏好数据，利用其前瞻能力将实例级奖励分解为更细粒度的步骤级信号。为了增强中间步骤的一致性，我们结合结果验证和逐步自我评估，不断更新新生成数据的质量评估。所提出的算法采用直接偏好优化（DPO）使用这些新生成的步骤级偏好数据来更新LLM策略。理论分析揭示了使用基于策略的抽样数据对成功自我改进的重要性。对各种算术和常识推理任务的广泛评估显示出相对于现有模型的显著性能改进。例如，我们的方法在GSM8K、MATH和ARC-C上都优于Mistral-7B监督微调（SFT）基线，准确率分别增加到81.8％（+5.9％）、34.7％（+5.8％）和76.4％（+15.8％）。此外，我们的研究探讨了训练和推断计算之间的权衡，为我们的方法如何有效地最大化性能提升提供了见解。我们的代码可在https://github.com/YuxiXie/MCTS-DPO 上公开获取。

更新时间: 2024-06-17 22:11:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2405.00451v2

Gemini: A Family of Highly Capable Multimodal Models

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

Updated: 2024-06-17 22:05:29

标题: 双子座：一系列高性能多模态模型

摘要: 这份报告介绍了一种新的多模态模型家族Gemini，展现了在图像、音频、视频和文本理解方面的显著能力。Gemini家族包括Ultra、Pro和Nano三种规格，适用于从复杂推理任务到内存受限设备使用案例的各种应用。在广泛的基准测试中，我们最强大的Gemini Ultra模型在32个基准测试中有30个取得了最新的进展 - 特别是作为第一个在广受关注的考试基准测试MMLU上达到人类专家水平的模型，并在我们检查的20个多模态基准测试中改进了最新技术。我们相信Gemini家族在跨模态推理和语言理解方面的新能力将能够实现各种用例。我们讨论了我们对于负责任地对用户进行后训练和部署Gemini模型的方法，包括通过Gemini、Gemini Advanced、Google AI Studio和Cloud Vertex AI等服务。

更新时间: 2024-06-17 22:05:29

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2312.11805v4

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

Learning self-supervised representations using reconstruction or contrastive losses improves performance and sample complexity of image-based and multimodal reinforcement learning (RL). Here, different self-supervised loss functions have distinct advantages and limitations depending on the information density of the underlying sensor modality. Reconstruction provides strong learning signals but is susceptible to distractions and spurious information. While contrastive approaches can ignore those, they may fail to capture all relevant details and can lead to representation collapse. For multimodal RL, this suggests that different modalities should be treated differently based on the amount of distractions in the signal. We propose Contrastive Reconstructive Aggregated representation Learning (CoRAL), a unified framework enabling us to choose the most appropriate self-supervised loss for each sensor modality and allowing the representation to better focus on relevant aspects. We evaluate CoRAL's benefits on a wide range of tasks with images containing distractions or occlusions, a new locomotion suite, and a challenging manipulation suite with visually realistic distractions. Our results show that learning a multimodal representation by combining contrastive and reconstruction-based losses can significantly improve performance and solve tasks that are out of reach for more naive representation learning approaches and other recent baselines.

Updated: 2024-06-17 22:04:47

标题: 将重建和对比方法结合在强化学习中用于多模态表示

摘要: 使用重建或对比损失学习自监督表示改善基于图像和多模态强化学习（RL）的性能和样本复杂性。在这里，不同的自监督损失函数具有不同的优势和局限性，取决于底层传感器模态的信息密度。重建提供强大的学习信号，但容易受到干扰和虚假信息的影响。而对比方法可以忽略这些，但可能无法捕捉到所有相关细节，并可能导致表示坍塌。对于多模态RL，这表明不同的模态应根据信号中的干扰量进行不同处理。我们提出了对比重建聚合表示学习（CoRAL），这是一个统一的框架，使我们能够为每个传感器模态选择最适合的自监督损失，并使表示能够更好地关注相关方面。我们评估了CoRAL在包含干扰或遮挡的图像、新的运动套件以及具有视觉逼真干扰的具有挑战性的操纵套件上的优势。我们的结果表明，通过结合对比和基于重建的损失学习多模态表示可以显著提高性能，并解决对于更为朴素的表示学习方法和其他最近的基线方法而言难以实现的任务。

更新时间: 2024-06-17 22:04:47

领域: cs.LG

下载: http://arxiv.org/abs/2302.05342v3

ChatEMG: Synthetic Data Generation to Control a Robotic Hand Orthosis for Stroke

Intent inferral on a hand orthosis for stroke patients is challenging due to the difficulty of data collection from impaired subjects. Additionally, EMG signals exhibit significant variations across different conditions, sessions, and subjects, making it hard for classifiers to generalize. Traditional approaches require a large labeled dataset from the new condition, session, or subject to train intent classifiers; however, this data collection process is burdensome and time-consuming. In this paper, we propose ChatEMG, an autoregressive generative model that can generate synthetic EMG signals conditioned on prompts (i.e., a given sequence of EMG signals). ChatEMG enables us to collect only a small dataset from the new condition, session, or subject and expand it with synthetic samples conditioned on prompts from this new context. ChatEMG leverages a vast repository of previous data via generative training while still remaining context-specific via prompting. Our experiments show that these synthetic samples are classifier-agnostic and can improve intent inferral accuracy for different types of classifiers. We demonstrate that our complete approach can be integrated into a single patient session, including the use of the classifier for functional orthosis-assisted tasks. To the best of our knowledge, this is the first time an intent classifier trained partially on synthetic data has been deployed for functional control of an orthosis by a stroke survivor. Videos and additional information can be found at https://jxu.ai/chatemg.

Updated: 2024-06-17 22:04:44

标题: ChatEMG：用于控制中风患者的机器手矫形器的合成数据生成

摘要: 对中风患者手部矫正器的意图推断具有挑战性，因为从受损对象收集数据很困难。此外，肌电图信号在不同条件、会话和对象之间表现出显著变化，使分类器很难泛化。传统方法需要从新条件、会话或对象收集大量标记数据来训练意图分类器；然而，这一数据收集过程是繁重且耗时的。在本文中，我们提出了ChatEMG，这是一种自回归生成模型，可以根据提示（即给定的肌电图信号序列）生成合成肌电图信号。ChatEMG使我们能够仅从新条件、会话或对象收集少量数据，并根据该新上下文的提示扩展它们为合成样本。ChatEMG通过生成训练利用大量以前的数据库，同时通过提示保持特定于上下文。我们的实验表明，这些合成样本不受分类器影响，可以提高不同类型分类器的意图推断准确性。我们展示了我们的完整方法可以整合到单个患者会话中，包括在功能矫正器辅助任务中使用分类器。据我们所知，这是第一次部分使用合成数据训练的意图分类器被部署用于中风幸存者的矫正器功能控制。视频和其他信息可在https://jxu.ai/chatemg找到。

更新时间: 2024-06-17 22:04:44

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12123v1

Adding Conditional Control to Diffusion Models with Reinforcement Learning

Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method based on reinforcement learning (RL) to add additional controls, leveraging an offline dataset comprising inputs and corresponding labels. We formulate this task as an RL problem, with the classifier learned from the offline dataset and the KL divergence against pre-trained models serving as the reward functions. We introduce our method, $\textbf{CTRL}$ ($\textbf{C}$onditioning pre-$\textbf{T}$rained diffusion models with $\textbf{R}$einforcement $\textbf{L}$earning), which produces soft-optimal policies that maximize the abovementioned reward functions. We formally demonstrate that our method enables sampling from the conditional distribution conditioned on additional controls during inference. Our RL-based approach offers several advantages over existing methods. Compared to commonly used classifier-free guidance, our approach improves sample efficiency, and can greatly simplify offline dataset construction by exploiting conditional independence between the inputs and additional controls. Furthermore, unlike classifier guidance, we avoid the need to train classifiers from intermediate states to additional controls.

Updated: 2024-06-17 22:00:26

标题: 使用强化学习将条件控制添加到扩散模型中

摘要: 扩散模型是强大的生成模型，允许精确控制生成样本的特征。虽然这些在大型数据集上训练的扩散模型取得了成功，但通常需要在下游微调过程中引入额外的控制，将这些强大的模型视为预训练的扩散模型。本文提出了一种基于强化学习（RL）的新方法，以离线数据集（包括输入和相应标签）为基础，添加额外的控制。我们将这个任务作为一个RL问题来制定，通过离线数据集学习分类器，并以KL散度与预训练模型为奖励函数。我们介绍了我们的方法CTRL（用强化学习调节预训练扩散模型），该方法产生最大化上述奖励函数的软最优策略。我们正式证明了我们的方法能够在推断过程中从条件分布中进行采样，条件是额外的控制。我们基于RL的方法相对于现有方法具有几个优势。与常用的无分类器引导相比，我们的方法提高了样本效率，并且可以通过利用输入和额外控制之间的条件独立性大大简化离线数据集的构建。此外，与分类器引导不同，我们避免了需要从中间状态到额外控制进行分类器训练的需要。

更新时间: 2024-06-17 22:00:26

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.12120v1

Deploying scalable traffic prediction models for efficient management in real-world large transportation networks during hurricane evacuations

Accurate traffic prediction is vital for effective traffic management during hurricane evacuation. This paper proposes a predictive modeling system that integrates Multilayer Perceptron (MLP) and Long-Short Term Memory (LSTM) models to capture both long-term congestion patterns and short-term speed patterns. Leveraging various input variables, including archived traffic data, spatial-temporal road network information, and hurricane forecast data, the framework is designed to address challenges posed by heterogeneous human behaviors, limited evacuation data, and hurricane event uncertainties. Deployed in a real-world traffic prediction system in Louisiana, the model achieved an 82% accuracy in predicting long-term congestion states over a 6-hour period during a 7-day hurricane-impacted duration. The short-term speed prediction model exhibited Mean Absolute Percentage Errors (MAPEs) ranging from 7% to 13% across evacuation horizons from 1 to 6 hours. Evaluation results underscore the model's potential to enhance traffic management during hurricane evacuations, and real-world deployment highlights its adaptability and scalability in diverse hurricane scenarios within extensive transportation networks.

Updated: 2024-06-17 21:59:44

标题: 在飓风疏散期间部署可扩展的交通预测模型，以实现大型实际交通网络的高效管理

摘要: 准确的交通预测对于飓风疏散期间有效的交通管理至关重要。本文提出了一个预测建模系统，整合了多层感知器（MLP）和长短期记忆（LSTM）模型，以捕捉长期拥堵模式和短期速度模式。利用各种输入变量，包括存档的交通数据、时空道路网络信息和飓风预测数据，该框架旨在解决由异质人类行为、有限疏散数据和飓风事件不确定性带来的挑战。在路易斯安那州的一个真实世界交通预测系统中部署，该模型在7天飓风影响期间的6小时内准确预测长期拥堵状态的准确度达到82%。短期速度预测模型在从1到6小时的疏散时间段内表现出的平均绝对百分比误差（MAPE）在7%至13%之间。评估结果强调了该模型在飓风疏散期间增强交通管理的潜力，而实际部署突显了其在广泛交通网络中多样的飓风场景中的适应性和可扩展性。

更新时间: 2024-06-17 21:59:44

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2406.12119v1

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

In the context of text classification, the financial burden of annotation exercises for creating training data is a critical issue. Active learning techniques, particularly those rooted in uncertainty sampling, offer a cost-effective solution by pinpointing the most instructive samples for manual annotation. Similarly, Large Language Models (LLMs) such as GPT-3.5 provide an alternative for automated annotation but come with concerns regarding their reliability. This study introduces a novel methodology that integrates human annotators and LLMs within an Active Learning framework. We conducted evaluations on three public datasets. IMDB for sentiment analysis, a Fake News dataset for authenticity discernment, and a Movie Genres dataset for multi-label classification.The proposed framework integrates human annotation with the output of LLMs, depending on the model uncertainty levels. This strategy achieves an optimal balance between cost efficiency and classification performance. The empirical results show a substantial decrease in the costs associated with data annotation while either maintaining or improving model accuracy.

Updated: 2024-06-17 21:45:48

标题: 通过LLM驱动的主动学习和人工注释提升文本分类

摘要: 在文本分类的背景下，为创建训练数据而进行标注练习的财务负担是一个关键问题。特别是那些根植于不确定性抽样的主动学习技术，通过确定最具教育意义的样本进行手动标注，提供了一种具有成本效益的解决方案。同样，诸如GPT-3.5之类的大型语言模型（LLMs）提供了一种自动标注的替代方案，但随之而来的是对其可靠性的担忧。本研究介绍了一种将人类标注者和LLMs集成到主动学习框架中的新方法。我们在三个公共数据集上进行了评估。IMDB用于情感分析，一个虚假新闻数据集用于真实性辨别，以及一个电影类型数据集用于多标签分类。所提出的框架将人类标注与LLMs的输出集成在一起，取决于模型的不确定性水平。这种策略实现了成本效率和分类性能之间的最佳平衡。实证结果显示，与数据标注相关的成本大幅降低，同时模型的准确性保持不变或有所提高。

更新时间: 2024-06-17 21:45:48

领域: cs.CL,cs.AI,cs.LG,68T50,I.2.7

下载: http://arxiv.org/abs/2406.12114v1

Thermodynamic Transferability in Coarse-Grained Force Fields using Graph Neural Networks

Coarse-graining is a molecular modeling technique in which an atomistic system is represented in a simplified fashion that retains the most significant system features that contribute to a target output, while removing the degrees of freedom that are less relevant. This reduction in model complexity allows coarse-grained molecular simulations to reach increased spatial and temporal scales compared to corresponding all-atom models. A core challenge in coarse-graining is to construct a force field that represents the interactions in the new representation in a way that preserves the atomistic-level properties. Many approaches to building coarse-grained force fields have limited transferability between different thermodynamic conditions as a result of averaging over internal fluctuations at a specific thermodynamic state point. Here, we use a graph-convolutional neural network architecture, the Hierarchically Interacting Particle Neural Network with Tensor Sensitivity (HIP-NN-TS), to develop a highly automated training pipeline for coarse grained force fields which allows for studying the transferability of coarse-grained models based on the force-matching approach. We show that this approach not only yields highly accurate force fields, but also that these force fields are more transferable through a variety of thermodynamic conditions. These results illustrate the potential of machine learning techniques such as graph neural networks to improve the construction of transferable coarse-grained force fields.

Updated: 2024-06-17 21:44:05

标题: 粗粒化力场中的热力学可转移性：使用图神经网络

摘要: 粗粒化是一种分子建模技术，其中原子系统以简化的方式表示，保留最重要的对目标输出有贡献的系统特征，同时消除不太相关的自由度。模型复杂性的降低使得粗粒化分子模拟能够达到比相应的全原子模型更大的空间和时间尺度。粗粒化中的一个核心挑战是构建一个力场，以一种能够保留原子级别性质的方式表示新表示中的相互作用。许多构建粗粒化力场的方法由于在特定热力学状态点上内部波动的平均而导致在不同热力学条件下的有限可转移性。在这里，我们使用了一种图卷积神经网络架构，即具有张量灵敏度的分层相互作用粒子神经网络（HIP-NN-TS），开发了一个高度自动化的培训管道，用于粗粒化力场，可以研究基于力匹配方法的粗粒化模型的可转移性。我们展示了这种方法不仅产生高度准确的力场，而且这些力场在各种热力学条件下更具可转移性。这些结果说明了机器学习技术（如图神经网络）改进可转移的粗粒化力场构建的潜力。

更新时间: 2024-06-17 21:44:05

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2406.12112v1

Cancellable Memory Requests: A transparent, lightweight Spectre mitigation

Speculation is fundamental to achieving high CPU performance, yet it enables vulnerabilities such as Spectre attacks, which remain a significant challenge to mitigate without incurring substantial performance overheads. These attacks typically unfold in three steps: they speculatively access sensitive data (access), alter the cache state (transmit), and then utilize a cache timing attack (e.g., Flush+Reload) to extract the secret (receive). Most Spectre attacks exploit a cache timing side channel during the transmit and receive steps. Our key observation is that Spectre attacks do not require the transmit instruction to complete before mis-prediction is detected and mis-speculated instructions are squashed. Instead, it suffices for the instruction to execute and dispatch a request to the memory hierarchy. Responses from memory that arrive after squashing occurs still alter the cache state, including those related to mis-speculated memory accesses. We therefore propose a novel mitigation technique, Cancellable Memory Requests (CMR), that cancels mis-speculated memory requests. Immediately upon squashing, a cancellation is sent to the cache hierarchy, propagating downstream and preventing any changes to caches that have not yet received a response. This reduces the likelihood of cache state changes, thereby reducing the likelihood of Spectre attacks succeeding. We implement CMR on gem5 and show that it thwarts practical Spectre attacks, and has near-zero performance overheads. We show that CMR can completely thwart Spectre attacks in four real-world processors with realistic system configurations.

Updated: 2024-06-17 21:43:39

标题: 可取消的内存请求：一个透明、轻量级的Spectre缓解方案

摘要: 猜测是实现高CPU性能的基础，然而它会导致漏洞，例如Spectre攻击，这仍然是一个很大的挑战，而且难以减轻而不会造成大量的性能开销。这些攻击通常分为三个步骤：它们猜测性地访问敏感数据（访问），改变缓存状态（传输），然后利用缓存定时攻击（例如，Flush+Reload）来提取秘密（接收）。大多数Spectre攻击利用了传输和接收步骤中的缓存定时侧信道。我们的关键观察是，Spectre攻击不需要在错误预测被检测到并且错误猜测的指令被压制之前完成传输指令。相反，只需要该指令执行并向内存层次结构发送请求即可。在压制发生后到达的来自内存的响应仍然会改变缓存状态，包括与错误猜测的内存访问相关的响应。因此，我们提出了一种新的缓解技术，即可取消内存请求（CMR），它取消了错误猜测的内存请求。在压制发生后立即发送一个取消请求到缓存层次结构，向下传播并阻止尚未收到响应的缓存发生任何变化。这降低了缓存状态变化的可能性，从而降低了Spectre攻击成功的可能性。我们在gem5上实现了CMR，并展示了它阻止了实际的Spectre攻击，并且几乎没有性能开销。我们展示了CMR可以完全阻止在具有实际系统配置的四个真实处理器中的Spectre攻击。

更新时间: 2024-06-17 21:43:39

领域: cs.CR,cs.AR

下载: http://arxiv.org/abs/2406.12110v1

Efficient Privacy-Preserving Machine Learning with Lightweight Trusted Hardware

In this paper, we propose a new secure machine learning inference platform assisted by a small dedicated security processor, which will be easier to protect and deploy compared to today's TEEs integrated into high-performance processors. Our platform provides three main advantages over the state-of-the-art: (i) We achieve significant performance improvements compared to state-of-the-art distributed Privacy-Preserving Machine Learning (PPML) protocols, with only a small security processor that is comparable to a discrete security chip such as the Trusted Platform Module (TPM) or on-chip security subsystems in SoCs similar to the Apple enclave processor. In the semi-honest setting with WAN/GPU, our scheme is 4X-63X faster than Falcon (PoPETs'21) and AriaNN (PoPETs'22) and 3.8X-12X more communication efficient. We achieve even higher performance improvements in the malicious setting. (ii) Our platform guarantees security with abort against malicious adversaries under honest majority assumption. (iii) Our technique is not limited by the size of secure memory in a TEE and can support high-capacity modern neural networks like ResNet18 and Transformer. While previous work investigated the use of high-performance TEEs in PPML, this work represents the first to show that even tiny secure hardware with really limited performance can be leveraged to significantly speed-up distributed PPML protocols if the protocol can be carefully designed for lightweight trusted hardware.

Updated: 2024-06-17 21:38:42

标题: 使用轻量级可信硬件实现高效的隐私保护机器学习

摘要: 在这篇论文中，我们提出了一种新的安全机器学习推断平台，由一个小型专用安全处理器辅助，相比于今天集成在高性能处理器中的TEE，更容易保护和部署。我们的平台相较于现有技术提供了三个主要优势：（i）与现有的分布式隐私保护机器学习（PPML）协议相比，我们只需一个与可信平台模块（TPM）相当的小型安全处理器，就实现了显著的性能改进，或在类似苹果隔离处理器的SoC中的芯片内安全子系统。在半诚实设置下，我们的方案比Falcon（PoPETs'21）和AriaNN（PoPETs'22）快4倍至63倍，通信效率提高了3.8倍至12倍。在恶意设置下，我们实现了更高的性能改进。（ii）我们的平台保证在诚实多数假设下对抗恶意对手的中途中止攻击的安全性。（iii）我们的技术不受TEE中安全内存大小的限制，可以支持像ResNet18和Transformer这样的高容量现代神经网络。虽然先前的研究调查了在PPML中使用高性能TEE，但这项工作是第一个表明，即使是性能非常有限的微小安全硬件，如果协议可以针对轻量级可信硬件进行精心设计，也可以显著加快分布式PPML协议的速度。

更新时间: 2024-06-17 21:38:42

领域: cs.CR

下载: http://arxiv.org/abs/2210.10133v4

Computing in the Life Sciences: From Early Algorithms to Modern AI

Computing in the life sciences has undergone a transformative evolution, from early computational models in the 1950s to the applications of artificial intelligence (AI) and machine learning (ML) seen today. This paper highlights key milestones and technological advancements through the historical development of computing in the life sciences. The discussion includes the inception of computational models for biological processes, the advent of bioinformatics tools, and the integration of AI/ML in modern life sciences research. Attention is given to AI-enabled tools used in the life sciences, such as scientific large language models and bio-AI tools, examining their capabilities, limitations, and impact to biological risk. This paper seeks to clarify and establish essential terminology and concepts to ensure informed decision-making and effective communication across disciplines.

Updated: 2024-06-17 21:36:52

标题: 生命科学中的计算：从早期算法到现代人工智能

摘要: 生命科学中的计算经历了一场变革性的演变，从上世纪50年代早期的计算模型到今天所见的人工智能（AI）和机器学习（ML）的应用。本文通过生命科学计算的历史发展，突出了关键的里程碑和技术进步。讨论包括生物过程计算模型的起源，生物信息学工具的出现，以及人工智能/机器学习在现代生命科学研究中的整合。重点关注在生命科学中使用的人工智能工具，如科学大型语言模型和生物人工智能工具，检验它们的能力、限制和对生物风险的影响。本文旨在澄清和建立基本术语和概念，以确保跨学科间的知情决策和有效沟通。

更新时间: 2024-06-17 21:36:52

领域: q-bio.OT,cs.AI

下载: http://arxiv.org/abs/2406.12108v1

White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs

Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups -- such as Black females -- are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.

Updated: 2024-06-17 21:36:46

标题: 白人男性主导，黑人女性协助？LLM中语言代理社会偏见的基准测试

摘要: 语言代理是评估文本中社会偏见的重要方面。虽然有几项研究探讨了人类编写语言中与代理相关的偏见，但在大型语言模型（LLM）生成的内容中，很少有研究调查这种偏见。此外，先前的研究通常依赖于字符串匹配技术来识别文本中的代理和集体性词汇，这无法准确分类语言代理。我们引入了新颖的语言代理偏见评估（LABE）基准，通过分析模型生成中不同人口群体被赋予的代理水平来全面评估LLMs中的偏见。LABE利用5,400个基于模板的提示、准确的代理分类器和相应的偏见指标，在3个文本生成任务上测试LLMs中的性别、种族和交叉语言代理偏见：传记、教授评价和推荐信。为了构建更好更准确的自动代理分类器，我们还贡献并发布了语言代理分类（LAC）数据集，其中包含3,724个代理和集体性句子。利用LABE，我们揭示了3个最近的LLMs中以前未被探索的语言代理社会偏见：ChatGPT、Llama3和Mistral。我们观察到：（1）对于相同的文本类别，LLM生成展现出比人类编写的文本更高水平的性别偏见；（2）在大多数生成任务中，模型显示出明显更高水平的交叉偏见，而不是其他偏见方面。那些处于性别和种族少数群体的交叉点上的人，比如黑人女性，被文本一致性地描述为代理水平较低；（3）在研究的3个LLMs中，Llama3展示了最大的整体语言代理偏见；（4）基于提示的缓解无法解决LLMs中的语言代理偏见，反而经常导致生成文本中偏见的加剧。

更新时间: 2024-06-17 21:36:46

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2404.10508v2

Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations

State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) knowledge enrichment hallucinations: insufficient subject attribute knowledge in lower layer MLPs, and 2) answer extraction hallucinations: failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM's internal fact recall pipeline, demonstrating superior performance compared to baselines.

Updated: 2024-06-17 21:35:41

标题: 语言模型非事实幻觉的机制理解和缓解

摘要: 最先进的语言模型（LMs）有时会生成与世界知识不符的非事实幻觉。为了探索这些幻觉的机制原因，我们创建了诊断数据集，其中包含主题关系查询，并调整可解释性方法来跟踪通过内部模型表示生成的幻觉。我们发现跨LMs（Llama-2，Pythia，GPT-J）共享的两种幻觉的一般和独特的机制原因：1）知识丰富化幻觉：较低层MLP中主题属性知识不足，2）答案提取幻觉：在较高层注意力头中选择正确对象属性失败。我们还发现这两种幻觉的内部机制原因反映在外部表现中。根据我们对机制分析的见解，我们提出了一种通过有针对性地恢复LM的内部事实回忆管道的新型幻觉缓解方法，表现出比基线更优越的性能。

更新时间: 2024-06-17 21:35:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.18167v2

End-to-end Text-to-SQL Generation within an Analytics Insight Engine

Recent advancements in Text-to-SQL have pushed database management systems towards greater democratization of data access. Today's language models are at the core of these advancements. They enable impressive Text-to-SQL generation as experienced in the development of Distyl AI's Analytics Insight Engine. Its early deployment with enterprise customers has highlighted three core challenges. First, data analysts expect support with authoring SQL queries of very high complexity. Second, requests are ad-hoc and, as such, require low latency. Finally, generation requires an understanding of domain-specific terminology and practices. The design and implementation of our Text-to-SQL generation pipeline, powered by large language models, tackles these challenges. The core tenants of our approach rely on external knowledge that we extract in a pre-processing phase, on retrieving the appropriate external knowledge at query generation time, and on decomposing SQL query generation following a hierarchical CTE-based structure. Finally, an adaptation framework leverages feedback to update the external knowledge, in turn improving query generation over time. We give an overview of our end-to-end approach and highlight the operators generating SQL during inference.

Updated: 2024-06-17 21:33:01

标题: 在分析洞察引擎内的端到端文本到SQL生成

摘要: 最近在文本到SQL方面取得的进展推动了数据库管理系统朝着数据访问民主化的方向发展。如今的语言模型是这些进展的核心。它们实现了令人印象深刻的文本到SQL生成，正如Distyl AI的Analytics Insight Engine开发中所体验到的那样。其与企业客户的早期部署突显了三个核心挑战。首先，数据分析师期望在编写非常复杂的SQL查询时得到支持。其次，请求是临时性的，因此需要低延迟。最后，生成需要理解领域特定的术语和实践。我们基于大型语言模型的文本到SQL生成管道的设计和实现解决了这些挑战。我们方法的核心原则依赖于我们在预处理阶段提取的外部知识，以及在查询生成时检索适当的外部知识，以及根据基于层次CTE的结构分解SQL查询生成。最后，一个适应框架利用反馈来更新外部知识，从而随着时间的推移改进查询生成。我们概述了我们的端到端方法，并突出了推理过程中生成SQL的运算符。

更新时间: 2024-06-17 21:33:01

领域: cs.CL,cs.AI,cs.DB,cs.LG

下载: http://arxiv.org/abs/2406.12104v1

Centering Policy and Practice: Research Gaps around Usable Differential Privacy

As a mathematically rigorous framework that has amassed a rich theoretical literature, differential privacy is considered by many experts to be the gold standard for privacy-preserving data analysis. Others argue that while differential privacy is a clean formulation in theory, it poses significant challenges in practice. Both perspectives are, in our view, valid and important. To bridge the gaps between differential privacy's promises and its real-world usability, researchers and practitioners must work together to advance policy and practice of this technology. In this paper, we outline pressing open questions towards building usable differential privacy and offer recommendations for the field, such as developing risk frameworks to align with user needs, tailoring communications for different stakeholders, modeling the impact of privacy-loss parameters, investing in effective user interfaces, and facilitating algorithmic and procedural audits of differential privacy systems.

Updated: 2024-06-17 21:32:30

标题: 聚焦政策与实践：围绕可用差分隐私的研究空白

摘要: 作为一个在数学上严谨的框架，积累了丰富的理论文献，差分隐私被许多专家认为是保护隐私的数据分析的黄金标准。其他人认为，虽然在理论上差分隐私是一个干净的表述，但在实践中却带来了重大挑战。在我们看来，这两种观点都是有效且重要的。为了弥合差分隐私的承诺与其在现实世界中可用性之间的差距，研究人员和从业者必须共同努力推动这项技术的政策和实践。在本文中，我们概述了建立可用差分隐私的紧迫开放问题，并提出了该领域的建议，例如开发与用户需求相一致的风险框架，为不同利益相关者量身定制沟通，建立隐私损失参数影响的模型，投资于有效的用户界面，并促进差分隐私系统的算法和程序审计。

更新时间: 2024-06-17 21:32:30

领域: cs.CR,cs.CY,cs.HC

下载: http://arxiv.org/abs/2406.12103v1

Adaptive Uncertainty Quantification for Trajectory Prediction Under Distributional Shift

Trajectory prediction models that can infer both finite future trajectories and their associated uncertainties of the target vehicles in an online setting (e.g., real-world application scenarios) is crucial for ensuring the safe and robust navigation and path planning of autonomous vehicle motion. However, the majority of existing trajectory prediction models have neither considered reducing the uncertainty as one objective during the training stage nor provided reliable uncertainty quantification during inference stage under potential distribution shift. Therefore, in this paper, we propose the Conformal Uncertainty Quantification under Distribution Shift framework, CUQDS, to quantify the uncertainty of the predicted trajectories of existing trajectory prediction models under potential data distribution shift, while considering improving the prediction accuracy of the models and reducing the estimated uncertainty during the training stage. Specifically, CUQDS includes 1) a learning-based Gaussian process regression module that models the output distribution of the base model (any existing trajectory prediction or time series forecasting neural networks) and reduces the estimated uncertainty by additional loss term, and 2) a statistical-based Conformal P control module to calibrate the estimated uncertainty from the Gaussian process regression module in an online setting under potential distribution shift between training and testing data.

Updated: 2024-06-17 21:25:36

标题: 适应性不确定性量化在分布偏移下的轨迹预测中的应用

摘要: 在在线设置中（例如，真实世界的应用场景）推断目标车辆的有限未来轨迹及其相关不确定性的轨迹预测模型对于确保自动驾驶车辆运动的安全和稳健导航和路径规划至关重要。然而，现有的大多数轨迹预测模型在训练阶段既未考虑减少不确定性作为一个目标，也未在推断阶段在潜在的分布转变下提供可靠的不确定性量化。因此，在本文中，我们提出了适用于潜在数据分布转变的一致性不确定性量化框架（CUQDS），用于在训练阶段考虑提高模型的预测准确性和降低估计的不确定性的同时，量化现有轨迹预测模型的预测轨迹的不确定性。具体来说，CUQDS包括1）一个基于学习的高斯过程回归模块，该模块对基础模型（任何现有的轨迹预测或时间序列预测神经网络）的输出分布进行建模，并通过额外的损失项减少估计的不确定性，以及2）一个基于统计的一致性P控制模块，用于在潜在的训练和测试数据之间的分布转变下，在在线设置中校准从高斯过程回归模块估计的不确定性。

更新时间: 2024-06-17 21:25:36

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2406.12100v1

Guarantees of confidentiality via Hammersley-Chapman-Robbins bounds

Protecting privacy during inference with deep neural networks is possible by adding noise to the activations in the last layers prior to the final classifiers or other task-specific layers. The activations in such layers are known as "features" (or, less commonly, as "embeddings" or "feature embeddings"). The added noise helps prevent reconstruction of the inputs from the noisy features. Lower bounding the variance of every possible unbiased estimator of the inputs quantifies the confidentiality arising from such added noise. Convenient, computationally tractable bounds are available from classic inequalities of Hammersley and of Chapman and Robbins -- the HCR bounds. Numerical experiments indicate that the HCR bounds are on the precipice of being effectual for small neural nets with the data sets, "MNIST" and "CIFAR-10," which contain 10 classes each for image classification. The HCR bounds appear to be insufficient on their own to guarantee confidentiality of the inputs to inference with standard deep neural nets, "ResNet-18" and "Swin-T," pre-trained on the data set, "ImageNet-1000," which contains 1000 classes. Supplementing the addition of noise to features with other methods for providing confidentiality may be warranted in the case of ImageNet. In all cases, the results reported here limit consideration to amounts of added noise that incur little degradation in the accuracy of classification from the noisy features. Thus, the added noise enhances confidentiality without much reduction in the accuracy on the task of image classification.

Updated: 2024-06-17 21:22:59

标题: 通过Hammersley-Chapman-Robbins界限确保保密性

摘要: 通过向最后几层的激活添加噪声，可以在深度神经网络进行推断过程中保护隐私。在最终分类器或其他特定任务层之前添加噪声。这些层中的激活称为“特征”（或者较少见的称为“嵌入”或“特征嵌入”）。添加的噪声有助于防止从嘈杂的特征中重建输入。对每个可能的无偏估计器的方差进行下界估计，量化由于添加噪声而产生的保密性。方便、可计算的界限可从经典不等式Hammersley和Chapman和Robbins的HCR界限中获得。数值实验表明，HCR界限对于包含图像分类的每个类别的数据集“MNIST”和“CIFAR-10”的小型神经网络即将产生效果。HCR界限似乎单独不足以保证对预先在包含1000个类别的数据集“ImageNet-1000”上训练的标准深度神经网络“ResNet-18”和“Swin-T”进行推断的输入的保密性。在ImageNet的情况下，可能需要补充向特征添加噪声的其他提供保密性的方法。在所有情况下，这里报告的结果限制考虑向嘈杂特征中添加的噪声量，这些量在分类准确性方面几乎没有降低。因此，添加的噪声增强了保密性，而在图像分类任务的准确性方面几乎没有降低。

更新时间: 2024-06-17 21:22:59

领域: cs.LG,cs.CR,cs.CY,stat.ML

下载: http://arxiv.org/abs/2404.02866v3

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.

Updated: 2024-06-17 21:15:13

标题: DistillNeRF: 通过提炼神经场和基础模型特征从单次瞥见图像中感知3D场景

摘要: 我们提出了DistillNeRF，这是一个自监督学习框架，解决了从有限的2D观测中理解自动驾驶中的3D环境的挑战。我们的方法是一个通用的前馈模型，可以从稀疏的单帧多视角摄像机输入中预测丰富的神经场景表示，并通过可微渲染进行自监督训练，以重建RGB、深度或特征图像。我们的第一个见解是利用每个场景优化的神经辐射场（NeRFs），通过生成密集的深度和虚拟相机目标进行训练，从而帮助我们的模型从稀疏的非重叠图像输入中学习3D几何。其次，为了学习一个语义丰富的3D表示，我们提出从预训练的2D基础模型（如CLIP或DINOv2）中提取特征，从而实现各种下游任务，而无需昂贵的3D人类注释。为了利用这两个见解，我们引入了一种新颖的模型架构，包括一个两阶段的lift-splat-shoot编码器和一个参数化的稀疏分层体素表示。在NuScenes数据集上的实验结果表明，DistillNeRF在场景重建、新视图合成和深度估计方面明显优于现有的可比自监督方法；它还允许进行竞争性的零样本3D语义占用预测，以及通过提取的基础模型特征实现对开放世界场景的理解。演示和代码将在https://distillnerf.github.io/上提供。

更新时间: 2024-06-17 21:15:13

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2406.12095v1

Who's asking? User personas and the mechanics of latent misalignment

Despite investments in improving model safety, studies show that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content can persist in hidden representations and can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on its perception of who it is talking to, which we refer to as user persona. In fact, we find manipulating user persona to be even more effective for eliciting harmful content than direct attempts to control model refusal. We study both natural language prompting and activation steering as control methods and show that activation steering is significantly more effective at bypassing safety filters. We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show we can predict a persona's effect on refusal given only the geometry of its steering vector.

Updated: 2024-06-17 21:15:12

标题: 谁在问？用户角色与潜在不一致的机制

摘要: 尽管投资于改进模型安全性，研究表明，不符合安全调整模型的能力仍然潜在存在。在这项工作中，我们揭示了这种现象的机制。首先，我们展示即使模型生成是安全的，有害内容仍然可能存在于隐藏表示中，并且可以通过从较早的层解码来提取。然后，我们展示模型是否泄露这种内容在很大程度上取决于其对话对象的感知，我们称之为用户角色。事实上，我们发现操纵用户角色比直接尝试控制模型拒绝更有效地引发有害内容。我们研究了自然语言提示和激活导向作为控制方法，并表明激活导向在绕过安全过滤器方面显著更有效。我们调查为什么某些角色会打破模型的保护措施，并发现它们使模型能够对原本危险的查询进行更慈善的解释。最后，我们展示我们可以仅通过其导向向量的几何形状来预测一个角色对拒绝的影响。

更新时间: 2024-06-17 21:15:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.12094v1

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

Updated: 2024-06-17 21:10:46

标题: 使用Sarathi-Serve在LLM推理中平衡吞吐量和延迟的技巧

摘要: 每个LLM服务请求经历两个阶段。第一阶段是预填，处理整个输入提示并生成第一个输出标记；第二阶段是解码，逐个生成其余的输出标记。预填迭代具有高延迟，但由于输入提示的并行处理而饱和GPU计算。相比之下，解码迭代具有低延迟，但计算利用率也低，因为解码迭代每次只处理一个标记。这使得对解码进行批处理非常有效，因此也对整体吞吐量有效。然而，对多个请求进行批处理会导致预填和解码迭代交替进行，这使得同时实现高吞吐量和低延迟变得具有挑战性。我们引入了一种高效的LLM推理调度器Sarathi-Serve，以解决吞吐量和延迟之间的折衷。Sarathi-Serve引入了分块预填，将预填请求分成大小近似的块，并创建无阻塞调度，可以在批处理中添加新请求而不会暂停正在进行的解码。无阻塞调度解锁了通过大批处理提高吞吐量的机会，同时最大限度地减少批处理对延迟的影响。此外，Sarathi-Serve中的统一批次改善了迭代之间的不平衡，从而减少了管道气泡。我们的技术在尾延迟约束下跨模型和硬件实现了显著的推理性能改进。对于单个A100 GPU上的Mistral-7B模型，我们实现了2.6倍的更高服务容量，对于在两个A100 GPU上的Yi-34B模型，我们实现了高达3.7倍的更高服务容量，与vLLM相比。在与Falcon-180B上的管道并行性一起使用时，Sarathi-Serve提供了高达5.6倍的端到端服务容量增益。Sarathi-Serve的源代码可在https://github.com/microsoft/sarathi-serve 上获得。

更新时间: 2024-06-17 21:10:46

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2403.02310v3

Is poisoning a real threat to LLM alignment? Maybe more so than you think

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.

Updated: 2024-06-17 21:06:00

标题: 毒物是对LLM对准的真正威胁吗？也许比你想象的更严重

摘要: 最近在人类反馈强化学习（RLHF）方面取得的进展显著影响了大型语言模型（LLMs）的对齐。诸如Proximal Policy Optimization（PPO）之类的强化学习算法的敏感性导致了对直接策略优化（DPO）的新线路工作，该方法将RLHF视为监督学习框架。这些RLHF方法的增加实际使用促使我们分析它们的脆弱性。在这项工作中，我们研究了DPO在不同场景下受到毒化攻击的脆弱性，并比较了首次使用的偏好毒化的有效性。我们全面分析了DPO在不同类型攻击（即后门和非后门攻击）以及不同毒化方法在各种语言模型（即LLama 7B、Mistral 7B和Gemma 7B）下的脆弱性。我们发现，与基于PPO的方法相比，后门攻击时需要至少毒化4%的数据才能引发有害行为，我们更简单地利用DPO的真实脆弱性，只需毒化0.5%的数据即可毒化模型。我们进一步研究了脆弱性背后的潜在原因，以及这种脆弱性如何转化为后门攻击与非后门攻击。

更新时间: 2024-06-17 21:06:00

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2406.12091v1

Class Symbolic Regression: Gotta Fit 'Em All

We introduce 'Class Symbolic Regression' (Class SR) a first framework for automatically finding a single analytical functional form that accurately fits multiple datasets - each realization being governed by its own (possibly) unique set of fitting parameters. This hierarchical framework leverages the common constraint that all the members of a single class of physical phenomena follow a common governing law. Our approach extends the capabilities of our earlier Physical Symbolic Optimization ($\Phi$-SO) framework for Symbolic Regression, which integrates dimensional analysis constraints and deep reinforcement learning for unsupervised symbolic analytical function discovery from data. Additionally, we introduce the first Class SR benchmark, comprising a series of synthetic physical challenges specifically designed to evaluate such algorithms. We demonstrate the efficacy of our novel approach by applying it to these benchmark challenges and showcase its practical utility for astrophysics by successfully extracting an analytic galaxy potential from a set of simulated orbits approximating stellar streams.

Updated: 2024-06-17 20:58:31

标题: 分类符号回归：要适应所有

摘要: 我们介绍了“类符号回归”（Class SR）作为第一个框架，用于自动找到一个准确适合多个数据集的单个分析函数形式 - 每个实现由其自己（可能是）独特的一组拟合参数控制。这种层次框架利用了一个共同的约束，即同一类物理现象的所有成员遵循一个共同的统治法则。我们的方法扩展了我们早期的物理符号优化（$\Phi$-SO）框架，用于符号回归，该框架集成了尺度分析约束和深度强化学习，用于无监督地从数据中发现符号分析函数。此外，我们介绍了第一个Class SR基准测试，包括一系列专门设计用于评估此类算法的合成物理挑战。我们通过将其应用于这些基准挑战来展示我们新方法的有效性，并展示其在天体物理学中的实用性，成功地从模拟轨道集中提取出一个解析的星系势。

更新时间: 2024-06-17 20:58:31

领域: cs.LG,astro-ph.GA,astro-ph.IM,physics.comp-ph

下载: http://arxiv.org/abs/2312.01816v2

Mutual Learning for Finetuning Click-Through Rate Prediction Models

Click-Through Rate (CTR) prediction has become an essential task in digital industries, such as digital advertising or online shopping. Many deep learning-based methods have been implemented and have become state-of-the-art models in the domain. To further improve the performance of CTR models, Knowledge Distillation based approaches have been widely used. However, most of the current CTR prediction models do not have much complex architectures, so it's hard to call one of them 'cumbersome' and the other one 'tiny'. On the other hand, the performance gap is also not very large between complex and simple models. So, distilling knowledge from one model to the other could not be worth the effort. Under these considerations, Mutual Learning could be a better approach, since all the models could be improved mutually. In this paper, we showed how useful the mutual learning algorithm could be when it is between equals. In our experiments on the Criteo and Avazu datasets, the mutual learning algorithm improved the performance of the model by up to 0.66% relative improvement.

Updated: 2024-06-17 20:56:30

标题: 相互学习用于微调点击率预测模型

摘要: 点击率（CTR）预测已成为数字行业中的重要任务，如数字广告或在线购物。许多基于深度学习的方法已经被实施，并已成为该领域的最新模型。为了进一步提高CTR模型的性能，基于知识蒸馏的方法已被广泛使用。然而，大多数当前的CTR预测模型并没有非常复杂的架构，因此很难将其中一个称为“繁琐”而另一个称为“微小”。另一方面，复杂模型和简单模型之间的性能差距也不是很大。因此，从一个模型中提取知识到另一个模型可能不值得努力。在这些考虑下，相互学习可能是一个更好的方法，因为所有模型都可以相互改进。在本文中，我们展示了当相互学习算法在相等的情况下时可以多么有用。在我们对Criteo和Avazu数据集的实验中，相互学习算法将模型的性能提高了最多0.66%的相对改进。

更新时间: 2024-06-17 20:56:30

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2406.12087v1

BiLO: Bilevel Local Operator Learning for PDE inverse problems

We propose a new neural network based method for solving inverse problems for partial differential equations (PDEs) by formulating the PDE inverse problem as a bilevel optimization problem. At the upper level, we minimize the data loss with respect to the PDE parameters. At the lower level, we train a neural network to locally approximate the PDE solution operator in the neighborhood of a given set of PDE parameters, which enables an accurate approximation of the descent direction for the upper level optimization problem. The lower level loss function includes the L2 norms of both the residual and its derivative with respect to the PDE parameters. We apply gradient descent simultaneously on both the upper and lower level optimization problems, leading to an effective and fast algorithm. The method, which we refer to as BiLO (Bilevel Local Operator learning), is also able to efficiently infer unknown functions in the PDEs through the introduction of an auxiliary variable. Through extensive experiments over multiple PDE systems, we demonstrate that our method enforces strong PDE constraints, is robust to sparse and noisy data, and eliminates the need to balance the residual and the data loss, which is inherent to the soft PDE constraints in many existing methods.

Updated: 2024-06-17 20:49:42

标题: BiLO: 双层局部算子学习用于PDE反问题

摘要: 我们提出了一种基于神经网络的新方法，用于通过将PDE反问题定式化为双层优化问题来解决偏微分方程（PDE）的反问题。在上层，我们通过最小化与PDE参数相关的数据损失来解决问题。在下层，我们训练神经网络在给定一组PDE参数的邻域内局部逼近PDE解算子，从而能够准确逼近上层优化问题的下降方向。下层损失函数包括残差及其对PDE参数的导数的L2范数。我们同时在上层和下层优化问题上应用梯度下降，从而导致一种有效且快速的算法。我们称之为BiLO（双层局部算子学习）的方法还能够通过引入辅助变量有效地推断PDE中的未知函数。通过对多个PDE系统进行广泛的实验，我们证明了我们的方法强制执行强PDE约束，对稀疏和嘈杂的数据具有鲁棒性，并消除了需要平衡残差和数据损失的需求，这是许多现有方法中固有的软PDE约束的特征。

更新时间: 2024-06-17 20:49:42

领域: cs.LG,cs.NA,math.NA,math.OC,65M32,I.2.6; G.1.8

下载: http://arxiv.org/abs/2404.17789v2

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.

Updated: 2024-06-17 20:49:35

标题: 当推理遇上信息聚合：以体育叙事为例研究

摘要: 推理在LLM准确聚合相关信息时最为强大。我们通过要求LLM分析体育叙事来检验信息聚合在推理中的关键作用。为了成功完成这项任务，LLM必须从行动中推断出分数，识别相关实体，准确地将分数归因给球员和球队，并汇编关键统计数据以得出结论。我们使用真实的NBA篮球数据进行全面实验，并提出SportsGen，一种合成比赛叙事的新方法。通过合成数据，我们可以严格评估LLM在具有不同叙事长度和信息密度的复杂情景下的推理能力。我们的研究结果显示，包括GPT-4o在内的大多数模型经常由于频繁的得分模式而无法准确聚合篮球比分。像Llama-3这样的开源模型进一步受到重大得分幻觉的影响。最后，推理的有效性受到叙事复杂性，信息密度和领域特定术语的影响，突显了分析推理任务中的挑战。

更新时间: 2024-06-17 20:49:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.12084v1

Uncertainty modeling for fine-tuned implicit functions

Implicit functions such as Neural Radiance Fields (NeRFs), occupancy networks, and signed distance functions (SDFs) have become pivotal in computer vision for reconstructing detailed object shapes from sparse views. Achieving optimal performance with these models can be challenging due to the extreme sparsity of inputs and distribution shifts induced by data corruptions. To this end, large, noise-free synthetic datasets can serve as shape priors to help models fill in gaps, but the resulting reconstructions must be approached with caution. Uncertainty estimation is crucial for assessing the quality of these reconstructions, particularly in identifying areas where the model is uncertain about the parts it has inferred from the prior. In this paper, we introduce Dropsembles, a novel method for uncertainty estimation in tuned implicit functions. We demonstrate the efficacy of our approach through a series of experiments, starting with toy examples and progressing to a real-world scenario. Specifically, we train a Convolutional Occupancy Network on synthetic anatomical data and test it on low-resolution MRI segmentations of the lumbar spine. Our results show that Dropsembles achieve the accuracy and calibration levels of deep ensembles but with significantly less computational cost.

Updated: 2024-06-17 20:46:18

标题: Feine abgestimmte implizite Funktionen的不确定性建模

摘要: 隐式函数，如神经辐射场（NeRFs）、占用网络和有符号距离函数（SDFs），已成为计算机视觉中重要的工具，用于从稀疏视图中重建详细的物体形状。由于输入的极度稀疏和数据损坏引起的分布偏移，要实现这些模型的最佳性能可能具有挑战性。为此，大型、无噪声的合成数据集可以作为形状先验，帮助模型填补空白，但得到的重建结果必须谨慎对待。不确定性估计对于评估这些重建结果的质量至关重要，特别是在识别模型在从先验中推断出的部分方面存在不确定性的地方。在本文中，我们介绍了Dropsembles，一种用于调整隐式函数的不确定性估计的新方法。通过一系列实验，我们展示了我们方法的有效性，从玩具示例开始，逐渐过渡到真实世界场景。具体而言，我们在合成解剖数据上训练了一个卷积占用网络，并在腰椎的低分辨率MRI分割上进行了测试。我们的结果显示，Dropsembles实现了深度合奏的准确性和校准水平，但计算成本显著降低。

更新时间: 2024-06-17 20:46:18

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12082v1

You Only Need One Color Space: An Efficient Network for Low-light Image Enhancement

Low-Light Image Enhancement (LLIE) task tends to restore the details and visual information from corrupted low-light images. Most existing methods learn the mapping function between low/normal-light images by Deep Neural Networks (DNNs) on sRGB and HSV color space. Nevertheless, enhancement involves amplifying image signals, and applying these color spaces to low-light images with a low signal-to-noise ratio can introduce sensitivity and instability into the enhancement process. Consequently, this results in the presence of color artifacts and brightness artifacts in the enhanced images. To alleviate this problem, we propose a novel trainable color space, named Horizontal/Vertical-Intensity (HVI). It not only decouples brightness and color from RGB channels to mitigate the instability during enhancement but also adapts to low-light images in different illumination ranges due to the trainable parameters. Further, we design a novel Color and Intensity Decoupling Network (CIDNet) with two branches dedicated to processing the decoupled image brightness and color in the HVI space. Within CIDNet, we introduce the Lightweight Cross-Attention (LCA) module to facilitate interaction between image structure and content information in both branches, while also suppressing noise in low-light images. Finally, we conducted 22 quantitative and qualitative experiments to show that the proposed CIDNet outperforms the state-of-the-art methods on 11 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.

Updated: 2024-06-17 20:43:01

标题: 您只需要一个颜色空间：一种用于低光图像增强的高效网络

摘要: 低光图像增强（LLIE）任务倾向于从受损的低光图像中恢复细节和视觉信息。大多数现有方法通过深度神经网络（DNNs）在sRGB和HSV颜色空间上学习低/正常光图像之间的映射函数。然而，增强涉及放大图像信号，并将这些颜色空间应用于信噪比低的低光图像可能会引入灵敏度和不稳定性到增强过程中。因此，这导致增强图像中存在颜色伪影和亮度伪影。为了缓解这一问题，我们提出了一种新颖的可训练颜色空间，名为水平/垂直强度（HVI）。它不仅解耦了RGB通道中的亮度和颜色以减轻增强过程中的不稳定性，而且由于可训练参数，还适应了不同光照范围内的低光图像。此外，我们设计了一种新颖的颜色和强度解耦网络（CIDNet），其中包括两个分支，专门用于处理HVI空间中解耦的图像亮度和颜色。在CIDNet中，我们引入了轻量级交叉注意力（LCA）模块，以促进两个分支中的图像结构和内容信息之间的交互，同时抑制低光图像中的噪声。最后，我们进行了22个定量和定性实验，表明所提出的CIDNet在11个数据集上优于现有方法。代码可在https://github.com/Fediory/HVI-CIDNet找到。

更新时间: 2024-06-17 20:43:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2402.05809v3

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

As we push the boundaries of performance in various vision tasks, the models grow in size correspondingly. To keep up with this growth, we need very aggressive pruning techniques for efficient inference and deployment on edge devices. Existing pruning approaches are limited to channel pruning and struggle with aggressive parameter reductions. In this paper, we propose a novel multi-dimensional pruning framework that jointly optimizes pruning across channels, layers, and blocks while adhering to latency constraints. We develop a latency modeling technique that accurately captures model-wide latency variations during pruning, which is crucial for achieving an optimal latency-accuracy trade-offs at high pruning ratio. We reformulate pruning as a Mixed-Integer Nonlinear Program (MINLP) to efficiently determine the optimal pruned structure with only a single pass. Our extensive results demonstrate substantial improvements over previous methods, particularly at large pruning ratios. In classification, our method significantly outperforms prior art HALP with a Top-1 accuracy of 70.0(v.s. 68.6) and an FPS of 5262 im/s(v.s. 4101 im/s). In 3D object detection, we establish a new state-of-the-art by pruning StreamPETR at a 45% pruning ratio, achieving higher FPS (37.3 vs. 31.7) and mAP (0.451 vs. 0.449) than the dense baseline.

Updated: 2024-06-17 20:40:09

标题: 多维修剪：具有延迟约束的联合通道、层和块修剪

摘要: 随着我们在各种视觉任务中不断突破性能边界，模型的规模相应增长。为了跟上这种增长，我们需要非常激进的修剪技术，以实现在边缘设备上高效推断和部署。现有的修剪方法仅限于通道修剪，并且在进行激进参数减少时存在困难。在本文中，我们提出了一种新颖的多维修剪框架，同时优化通道、层和块的修剪，并遵守延迟约束。我们开发了一种延迟建模技术，准确捕捉修剪过程中模型整体延迟的变化，这对于在高修剪比率下实现最佳延迟-准确性权衡至关重要。我们将修剪重新表述为混合整数非线性规划（MINLP），以仅需一次通过便能有效确定最佳修剪结构。我们广泛的结果显示，相比先前的方法，我们取得了实质性的改进，特别是在大修剪比率下。在分类中，我们的方法在Top-1准确率方面明显优于先前的HALP技术，达到了70.0（相对于68.6），FPS为5262 im/s（相对于4101 im/s）。在3D物体检测中，我们通过对StreamPETR进行45%的修剪比率，建立了一个新的最先进技术，实现了比密集基线更高的FPS（37.3 vs. 31.7）和mAP（0.451 vs. 0.449）。

更新时间: 2024-06-17 20:40:09

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12079v1

Conformance Checking of Fuzzy Logs against Declarative Temporal Specifications

Traditional conformance checking tasks assume that event data provide a faithful and complete representation of the actual process executions. This assumption has been recently questioned: more and more often events are not traced explicitly, but are instead indirectly obtained as the result of event recognition pipelines, and thus inherently come with uncertainty. In this work, differently from the typical probabilistic interpretation of uncertainty, we consider the relevant case where uncertainty refers to which activity is actually conducted, under a fuzzy semantics. In this novel setting, we consider the problem of checking whether fuzzy event data conform with declarative temporal rules specified as Declare patterns or, more generally, as formulae of linear temporal logic over finite traces (LTLf). This requires to relax the assumption that at each instant only one activity is executed, and to correspondingly redefine boolean operators of the logic with a fuzzy semantics. Specifically, we provide a threefold contribution. First, we define a fuzzy counterpart of LTLf tailored to our purpose. Second, we cast conformance checking over fuzzy logs as a verification problem in this logic. Third, we provide a proof-of-concept, efficient implementation based on the PyTorch Python library, suited to check conformance of multiple fuzzy traces at once.

Updated: 2024-06-17 20:38:57

标题: 根据声明性时间规范检查模糊日志的一致性

摘要: 传统的符合性检查任务假设事件数据提供了实际流程执行的忠实和完整表示。最近这一假设受到了质疑：越来越多的事件并非明确跟踪，而是间接通过事件识别管道获得，因此固有地带有不确定性。在这项工作中，不同于对不确定性的典型概率解释，我们考虑了不确定性指的是哪个活动实际上被执行的相关情况，采用模糊语义。在这种新颖的设置中，我们考虑检查模糊事件数据是否符合作为Declare模式或更一般地作为线性时态逻辑公式（LTLf）的声明性时间规则。这需要放宽每个瞬间只执行一个活动的假设，并相应地重新定义具有模糊语义的逻辑的布尔运算符。具体地，我们提供了三方面的贡献。首先，我们定义了一个适合我们目的的LTLf的模糊对应物。其次，我们将模糊日志上的符合性检查转化为这种逻辑中的验证问题。第三，我们提供了一个基于PyTorch Python库的概念验证、高效的实现，适用于一次检查多个模糊跟踪的符合性。

更新时间: 2024-06-17 20:38:57

领域: cs.AI,cs.LO,68T27 (Primary) 68T27, 68T30, 68T37, 03B44 (Secondary),I.2.4; F.4.1

下载: http://arxiv.org/abs/2406.12078v1

SurfPro: Functional Protein Design Based on Continuous Surface

How can we design proteins with desired functions? We are motivated by a chemical intuition that both geometric structure and biochemical properties are critical to a protein's function. In this paper, we propose SurfPro, a new method to generate functional proteins given a desired surface and its associated biochemical properties. SurfPro comprises a hierarchical encoder that progressively models the geometric shape and biochemical features of a protein surface, and an autoregressive decoder to produce an amino acid sequence. We evaluate SurfPro on a standard inverse folding benchmark CATH 4.2 and two functional protein design tasks: protein binder design and enzyme design. Our SurfPro consistently surpasses previous state-of-the-art inverse folding methods, achieving a recovery rate of 57.78% on CATH 4.2 and higher success rates in terms of protein-protein binding and enzyme-substrate interaction scores.

Updated: 2024-06-17 20:20:58

标题: SurfPro：基于连续表面的功能蛋白设计

摘要: 我们如何设计具有所需功能的蛋白质？我们受到一种化学直觉的启发，即几何结构和生化性质对蛋白质的功能至关重要。在本文中，我们提出了SurfPro，一种新的方法，可以根据所需表面及其相关的生化性质生成功能蛋白质。SurfPro包括一个分层编码器，逐渐模拟蛋白质表面的几何形状和生化特征，以及一个自回归解码器，用于生成氨基酸序列。我们在标准的逆折叠基准CATH 4.2和两个功能蛋白设计任务上评估了SurfPro：蛋白结合物设计和酶设计。我们的SurfPro始终超越以往的最先进逆折叠方法，在CATH 4.2上实现了57.78%的恢复率，并在蛋白质-蛋白质结合和酶-底物相互作用评分方面取得更高的成功率。

更新时间: 2024-06-17 20:20:58

领域: q-bio.BM,cs.LG

下载: http://arxiv.org/abs/2405.06693v2

DTGB: A Comprehensive Benchmark for Dynamic Text-Attributed Graphs

Dynamic text-attributed graphs (DyTAGs) are prevalent in various real-world scenarios, where each node and edge are associated with text descriptions, and both the graph structure and text descriptions evolve over time. Despite their broad applicability, there is a notable scarcity of benchmark datasets tailored to DyTAGs, which hinders the potential advancement in many research fields. To address this gap, we introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs from diverse domains, with nodes and edges enriched by dynamically changing text attributes and categories. To facilitate the use of DTGB, we design standardized evaluation procedures based on four real-world use cases: future link prediction, destination node retrieval, edge classification, and textual relation generation. These tasks require models to understand both dynamic graph structures and natural language, highlighting the unique challenges posed by DyTAGs. Moreover, we conduct extensive benchmark experiments on DTGB, evaluating 7 popular dynamic graph learning algorithms and their variants of adapting to text attributes with LLM embeddings, along with 6 powerful large language models (LLMs). Our results show the limitations of existing models in handling DyTAGs. Our analysis also demonstrates the utility of DTGB in investigating the incorporation of structural and textual dynamics. The proposed DTGB fosters research on DyTAGs and their broad applications. It offers a comprehensive benchmark for evaluating and advancing models to handle the interplay between dynamic graph structures and natural language. The dataset and source code are available at https://github.com/zjs123/DTGB.

Updated: 2024-06-17 20:16:12

标题: DTGB：动态文本属性图的综合基准

摘要: 动态文本属性图（DyTAGs）在各种现实场景中普遍存在，其中每个节点和边都与文本描述相关联，图结构和文本描述均随时间演变。尽管它们具有广泛的适用性，但专门针对DyTAGs的基准数据集相对稀缺，这阻碍了许多研究领域的潜在进展。为了填补这一空白，我们引入了动态文本属性图基准（DTGB），这是一个包含大规模、时间演变的图的集合，涵盖不同领域，节点和边通过动态变化的文本属性和类别丰富起来。为了促进DTGB的使用，我们设计了基于四个真实用例的标准化评估程序：未来链接预测，目标节点检索，边分类和文本关系生成。这些任务要求模型理解动态图结构和自然语言，突显了DyTAGs带来的独特挑战。此外，我们在DTGB上进行了广泛的基准实验，评估了7种流行的动态图学习算法及其适应LLM嵌入文本属性的变体，以及6种强大的大型语言模型（LLMs）。我们的结果表明现有模型在处理DyTAGs方面存在局限性。我们的分析还展示了DTGB在研究结构和文本动态整合方面的实用性。提出的DTGB促进了对DyTAGs及其广泛应用的研究。它为评估和推进处理动态图结构与自然语言之间相互作用的模型提供了全面的基准。数据集和源代码可在https://github.com/zjs123/DTGB 上找到。

更新时间: 2024-06-17 20:16:12

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12072v1

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.

Updated: 2024-06-17 20:14:34

标题: 由功能重要位点和小分子底物引导的生成酶设计

摘要: 酶是一种通过基因编码的生物催化剂，能够加速化学反应。我们如何自动设计功能性酶？在本文中，我们提出了EnzyGen，这是一种学习设计酶的统一模型的方法，涵盖所有功能家族。我们的关键思想是基于功能重要位点和对应于所需催化功能的底物来生成酶的氨基酸序列和它们的三维坐标。这些位点是从酶数据库中自动挖掘出来的。EnzyGen由一个新颖的注意力和邻域等变层的交错网络组成，捕捉整个蛋白质序列中的长程相关性和在三维空间中最近氨基酸的局部影响。为了学习生成模型，我们设计了一个联合训练目标，包括一个序列生成损失、一个位置预测损失和一个酶-底物相互作用损失。我们进一步构建了EnzyBench，一个包含3157个酶家族的数据集，涵盖了蛋白质数据银行（PDB）中所有可用的酶。实验结果表明，我们的EnzyGen在所有323个测试家族中始终表现最佳，以底物结合亲和力为标准，超过最佳基准线10.79%。这些发现表明EnzyGen在设计具有高亲和力的特定底物的酶时具有出色的能力。

更新时间: 2024-06-17 20:14:34

领域: cs.LG

下载: http://arxiv.org/abs/2405.08205v2

STNAGNN: Spatiotemporal Node Attention Graph Neural Network for Task-based fMRI Analysis

Task-based fMRI uses actions or stimuli to trigger task-specific brain responses and measures them using BOLD contrast. Despite the significant task-induced spatiotemporal brain activation fluctuations, most studies on task-based fMRI ignore the task context information aligned with fMRI and consider task-based fMRI a coherent sequence. In this paper, we show that using the task structures as data-driven guidance is effective for spatiotemporal analysis. We propose STNAGNN, a GNN-based spatiotemporal architecture, and validate its performance in an autism classification task. The trained model is also interpreted for identifying autism-related spatiotemporal brain biomarkers.

Updated: 2024-06-17 20:08:05

标题: STNAGNN：基于任务的fMRI分析的时空节点注意力图神经网络

摘要: 任务驱动的fMRI利用动作或刺激来触发特定任务的大脑反应，并使用BOLD对比度来衡量这些反应。尽管任务引起显著的时空大脑激活波动，大多数关于任务驱动fMRI的研究忽略了与fMRI对齐的任务背景信息，并将任务驱动fMRI视为一个连贯的序列。在本文中，我们展示了使用任务结构作为数据驱动指导对时空分析是有效的。我们提出了STNAGNN，一种基于GNN的时空架构，并验证了其在自闭症分类任务中的性能。训练模型还被解释为识别与自闭症相关的时空大脑生物标志物。

更新时间: 2024-06-17 20:08:05

领域: cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2406.12065v1

Entropic Regression DMD (ERDMD) Discovers Informative Sparse and Nonuniformly Time Delayed Models

In this work, we present a method which determines optimal multi-step dynamic mode decomposition (DMD) models via entropic regression, which is a nonlinear information flow detection algorithm. Motivated by the higher-order DMD (HODMD) method of \cite{clainche}, and the entropic regression (ER) technique for network detection and model construction found in \cite{bollt, bollt2}, we develop a method that we call ERDMD that produces high fidelity time-delay DMD models that allow for nonuniform time space, and the time spacing is discovered by consider most informativity based on ER. These models are shown to be highly efficient and robust. We test our method over several data sets generated by chaotic attractors and show that we are able to build excellent reconstructions using relatively minimal models. We likewise are able to better identify multiscale features via our models which enhances the utility of dynamic mode decomposition.

Updated: 2024-06-17 20:02:43

标题: 信息熵回归的DMD（ERDMD）发现信息丰富的稀疏和非均匀时间延迟模型

摘要: 在这项工作中，我们提出了一种通过熵回归确定最佳多步动态模态分解（DMD）模型的方法，该方法是一种非线性信息流检测算法。受\cite{clainche}中高阶DMD（HODMD）方法和\cite{bollt, bollt2}中用于网络检测和模型构建的熵回归（ER）技术的启发，我们开发了一种称为ERDMD的方法，该方法产生了高保真度的时滞DMD模型，允许非均匀的时间空间，并且时间间隔是通过基于ER考虑最具信息性来发现的。这些模型被证明非常高效和稳健。我们通过几组由混沌吸引子生成的数据集对我们的方法进行了测试，并展示我们能够使用相对较少的模型构建出优秀的重建。同样，我们能够通过我们的模型更好地识别多尺度特征，从而提高了动态模态分解的实用性。

更新时间: 2024-06-17 20:02:43

领域: stat.ML,cs.LG,nlin.CD

下载: http://arxiv.org/abs/2406.12062v1

Not Eliminate but Aggregate: Post-Hoc Control over Mixture-of-Experts to Address Shortcut Shifts in Natural Language Understanding

Recent models for natural language understanding are inclined to exploit simple patterns in datasets, commonly known as shortcuts. These shortcuts hinge on spurious correlations between labels and latent features existing in the training data. At inference time, shortcut-dependent models are likely to generate erroneous predictions under distribution shifts, particularly when some latent features are no longer correlated with the labels. To avoid this, previous studies have trained models to eliminate the reliance on shortcuts. In this study, we explore a different direction: pessimistically aggregating the predictions of a mixture-of-experts, assuming each expert captures relatively different latent features. The experimental results demonstrate that our post-hoc control over the experts significantly enhances the model's robustness to the distribution shift in shortcuts. Besides, we show that our approach has some practical advantages. We also analyze our model and provide results to support the assumption.

Updated: 2024-06-17 20:00:04

标题: 不是消除而是聚合：后验控制多专家混合以解决自然语言理解中的捷径转移

摘要: 最近的自然语言理解模型倾向于利用数据集中的简单模式，通常被称为捷径。这些捷径依赖于训练数据中存在的标签和潜在特征之间的虚假相关性。在推断时，依赖捷径的模型在分布转移下很可能生成错误的预测，特别是当一些潜在特征不再与标签相关时。为了避免这种情况，先前的研究已经训练模型消除对捷径的依赖。在本研究中，我们探索了一个不同的方向：悲观地聚合专家组合的预测，假设每个专家捕捉相对不同的潜在特征。实验结果表明，我们对专家的事后控制显著提高了模型对捷径分布转移的鲁棒性。此外，我们展示了我们的方法具有一些实际优势。我们还分析了我们的模型并提供支持假设的结果。

更新时间: 2024-06-17 20:00:04

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.12060v1

ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model

Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on large datasets. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures. In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing CD tasks. We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively. All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images. For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features, thereby obtaining accurate change information. On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex training strategies or tricks, fully demonstrating the potential of the Mamba architecture in CD tasks. Further experiments show that our architecture is quite robust to degraded data. The source code will be available in https://github.com/ChenHongruixuan/MambaCD

Updated: 2024-06-17 19:57:36

标题: ChangeMamba：基于时空状态空间模型的遥感变化检测

摘要: 卷积神经网络（CNN）和变压器在遥感变化检测（CD）领域取得了令人瞩目的进展。然而，这两种架构都存在固有的缺点：CNN受限于有限的感受野，可能会影响其捕获更广泛空间背景的能力，而变压器计算量大，使其在大型数据集上训练和部署成本高昂。最近，基于状态空间模型的Mamba架构在一系列自然语言处理任务中表现出卓越的性能，可以有效弥补上述两种架构的缺点。本文首次探索了Mamba架构在遥感CD任务中的潜力。我们为二元变化检测（BCD）、语义变化检测（SCD）和建筑损害评估（BDA）分别定制了相应的框架，称为MambaBCD、MambaSCD和MambaBDA。这三个框架都采用最前沿的Visual Mamba架构作为编码器，允许从输入图像中全面学习全局空间背景信息。对于三种架构中都可用的变化解码器，我们提出了三种时空关系建模机制，可以自然地与Mamba架构结合，并充分利用其属性实现多时相特征的时空交互，从而获得准确的变化信息。在五个基准数据集上，我们提出的框架在不使用任何复杂的训练策略或技巧的情况下，优于当前基于CNN和变压器的方法，充分展示了Mamba架构在CD任务中的潜力。进一步的实验表明，我们的架构对降质数据非常稳健。源代码将在https://github.com/ChenHongruixuan/MambaCD中提供。

更新时间: 2024-06-17 19:57:36

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2404.03425v4

A Scalable and Effective Alternative to Graph Transformers

Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.

Updated: 2024-06-17 19:57:34

标题: 一个可扩展且有效的图转换器替代方案

摘要: 图神经网络（GNNs）在图表示学习方面表现出色，但由于其有限的表达能力，面临着捕捉长程依赖关系的挑战。为了解决这个问题，引入了图变换器（GTs），利用自注意机制有效地建模成对节点关系。尽管具有优势，GTs在图中节点数量方面存在二次复杂度，限制了它们在大型图中的适用性。在这项工作中，我们提出了图增强上下文运算符（GECO），这是GTs的一种可扩展和有效的替代方案，利用邻域传播和全局卷积来有效地捕捉局部和全局依赖关系，且时间复杂度为准线性。我们在合成数据集上的研究显示，相对于优化的关注力，GECO在具有2M节点的图上实现了169倍的加速。对各种基准测试的进一步评估表明，GECO能够扩展到大型图，而传统的GTs通常面临内存和时间限制。值得注意的是，GECO始终实现了与基线相媲美或更好的质量，将SOTA提高了高达4.5％，为大规模图学习提供了可扩展和有效的解决方案。

更新时间: 2024-06-17 19:57:34

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2406.12059v1

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WD). We focus on two mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.

Updated: 2024-06-17 19:50:40

标题: WellDunn：关于语言模型和大型语言模型在识别健康维度方面的稳健性和可解释性

摘要: 语言模型（LMs）正在被提议用于心理健康应用领域，其中不良结果的高风险意味着预测性能可能不足以成为模型在临床实践中实用性的检验标准。一个可以信赖于实践的模型应该在解释和临床确定之间有一致性，然而以前的研究并未考察这些模型的注意力忠实度及其对真实解释的影响。我们提出了一个评估设计，重点关注LMs在识别健康维度（WD）方面的稳健性和可解释性。我们关注两个心理健康和幸福数据集：（a）基于多标签分类的多WD，和（b）WellXplain，用于评估注意机制的真实性与专家标记的解释相对应。标签基于哈尔伯特·邓恩（Halbert Dunn）的健康理论，这为我们的评估提供了基础。我们揭示了关于LMs/LLMs的四个令人惊讶的结果：（1）尽管具有类似人类的能力，GPT-3.5/4落后于RoBERTa和MedAlpaca，一个经过精调的LLM未能在性能或解释方面取得显著改进。（2）基于自信度导向损失函数重新审视LMs的预测结果，发现性能显著下降。（3）在所有LMs/LLMs中，注意力和解释之间的对齐程度仍然很低，LLMs得分为0.0。（4）大多数针对心理健康的LMs/LLMs忽视了领域特定知识，低估了解释，导致这些差异。这项研究强调了在心理健康和幸福领域进一步研究它们的一致性和解释的必要性。

更新时间: 2024-06-17 19:50:40

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12058v1

Learning Molecular Representation in a Cell

Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream tasks: molecular property prediction against up to 19 baseline methods across four datasets, plus zero-shot molecule-morphology matching.

Updated: 2024-06-17 19:48:42

标题: 学习细胞中的分子表示

摘要: 预测药物在体内的疗效和安全性需要关于生物反应（例如细胞形态和基因表达）对小分子干扰的信息。然而，当前的分子表示学习方法并不能全面展示细胞在这些干扰下的状态，并且难以消除噪音，从而阻碍了模型的泛化能力。我们引入了信息对齐（InfoAlign）方法，通过信息瓶颈方法在细胞中学习分子表示。我们将分子和细胞反应数据作为节点集成到上下文图中，通过化学、生物和计算准则基于加权边连接它们。对于每个训练批次中的分子，InfoAlign通过最小化目标优化编码器的潜在表示，以丢弃冗余的结构信息。一个充分性目标解码表示，使其与上下文图中分子邻域的不同特征空间对齐。我们证明了所提出的对齐的充分性目标比现有基于编码器的对比方法更紧密。在实证方面，我们在两个下游任务中验证了InfoAlign中的表示：与四个数据集中多达19种基线方法进行分子属性预测，以及零样本分子形态匹配。

更新时间: 2024-06-17 19:48:42

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2406.12056v1

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset. Our code is available at \url{https://github.com/chandar-lab/CMOptimizer}.

Updated: 2024-06-17 19:47:35

标题: 促进记忆增强Adam中的探索：关键时刻的应用

摘要: 自适应梯度优化器，尤其是Adam，在训练大规模深度学习模型时留下了深刻的印记，提供了快速收敛和对超参数设置的稳健性。然而，它们通常在泛化方面遇到困难，这归因于它们收敛到损失景观中尖锐最小值的倾向。为了解决这个问题，我们提出了一种新的Adam的记忆增强版本，在训练过程中通过引入关键动量项的缓冲区来鼓励探索平坦最小值。这个缓冲区促使优化器超过狭窄的最小值，促进探索。通过在简单环境中进行全面分析，我们展示了我们的方法在增加探索和偏向更平坦最小值方面的有效性。我们实验证明，它可以提高在ImageNet和CIFAR10/100上的图像分类，Penn Treebank上的语言建模以及TinyImageNet和5-dataset上的在线学习任务的模型性能。我们的代码可以在\url{https://github.com/chandar-lab/CMOptimizer}上找到。

更新时间: 2024-06-17 19:47:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2307.09638v2

FAWN: Floor-And-Walls Normal Regularization for Direct Neural TSDF Reconstruction

Leveraging 3D semantics for direct 3D reconstruction has a great potential yet unleashed. For instance, by assuming that walls are vertical, and a floor is planar and horizontal, we can correct distorted room shapes and eliminate local artifacts such as holes, pits, and hills. In this paper, we propose FAWN, a modification of truncated signed distance function (TSDF) reconstruction methods, which considers scene structure by detecting walls and floor in a scene, and penalizing the corresponding surface normals for deviating from the horizontal and vertical directions. Implemented as a 3D sparse convolutional module, FAWN can be incorporated into any trainable pipeline that predicts TSDF. Since FAWN requires 3D semantics only for training, no additional limitations on further use are imposed. We demonstrate, that FAWN-modified methods use semantics more effectively, than existing semantic-based approaches. Besides, we apply our modification to state-of-the-art TSDF reconstruction methods, and demonstrate a quality gain in SCANNET, ICL-NUIM, TUM RGB-D, and 7SCENES benchmarks.

Updated: 2024-06-17 19:46:49

标题: FAWN：用于直接神经TSDF重建的地板和墙面法线正则化

摘要: 利用3D语义进行直接3D重建具有巨大的潜力但尚未充分发挥。例如，通过假设墙壁是垂直的，地板是平面和水平的，我们可以纠正扭曲的房间形状并消除局部的缺陷，坑洞和山丘等。在本文中，我们提出了FAWN，这是截断有符号距离函数（TSDF）重建方法的一个修改版本，它通过检测场景中的墙壁和地板，并惩罚从水平和垂直方向偏离的相应表面法线来考虑场景结构。作为一个3D稀疏卷积模块实现，FAWN可以集成到任何能够预测TSDF的可训练管道中。由于FAWN仅在训练时需要3D语义，因此不会对进一步的使用施加额外的限制。我们证明，与现有基于语义的方法相比，FAWN修改的方法更有效地使用语义。此外，我们将我们的修改应用到最先进的TSDF重建方法中，并在SCANNET，ICL-NUIM，TUM RGB-D和7SCENES基准测试中展示了质量的提升。

更新时间: 2024-06-17 19:46:49

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.12054v1

UniGLM: Training One Unified Language Model for Text-Attributed Graphs

Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.

Updated: 2024-06-17 19:45:21

标题: UniGLM：为文本属性图训练一个统一的语言模型

摘要: 在文本属性图（TAGs）上进行表示学习，其中节点由文本描述表示，对于文本和关系知识系统以及推荐系统至关重要。目前，TAGs的最先进嵌入方法主要集中在使用结构感知训练信号对语言模型（例如BERT）进行微调。虽然有效，但这些方法是针对单个TAG定制的，无法在各种图情况下推广。鉴于共享文本空间，利用多个TAG进行联合微调，从不同角度对齐文本和图结构，将更有益。受此启发，我们引入了一种新颖的统一图语言模型（UniGLM）框架，这是第一个能够很好地推广到领域内和跨领域TAGs的图嵌入模型。具体来说，UniGLM通过使用自监督对比学习来训练多个不同领域和规模的TAGs。UniGLM包括一种自适应正样本选择技术，用于识别结构相似的节点，并且设计了一个懒惰对比模块，以通过最小化重复编码计算来加速训练。对9个基准TAG的广泛实证结果表明，在泛化（各种下游任务和骨干）和迁移学习（领域内外情景）方面，UniGLM在对比主要嵌入基线的有效性。代码可在https://github.com/NYUSHCS/UniGLM 获取。

更新时间: 2024-06-17 19:45:21

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2406.12052v1

State-wise Constrained Policy Optimization

Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

Updated: 2024-06-17 19:41:33

标题: State-wise Constrained Policy Optimization 按州限制的政策优化

摘要: 强化学习（RL）算法在仿真环境中取得了巨大成功，但将其应用于实际问题面临重大挑战，安全性是一个主要关注点。特别是，对于许多具有挑战性的任务，如自动驾驶和机器人操作，强制执行基于状态的约束是至关重要的。然而，现有的在受限马尔可夫决策过程（CMDP）框架下的安全RL算法并未考虑基于状态的约束。为了填补这一空白，我们提出了基于状态约束的策略优化（SCPO）算法，这是第一个用于基于状态约束的强化学习的通用策略搜索算法。SCPO在期望中提供了基于状态的约束满足的保证。特别地，我们引入了最大马尔可夫决策过程的框架，并证明在SCPO下最坏情况下的安全违规受到限制。我们在广泛的机器人运动任务中训练神经网络策略，其中代理必须满足各种基于状态的安全约束，证明了我们方法的有效性。我们的结果表明，SCPO明显优于现有方法，并且能够处理高维度机器人任务中的基于状态的约束。

更新时间: 2024-06-17 19:41:33

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2306.12594v3

MEDeA: Multi-view Efficient Depth Adjustment

The majority of modern single-view depth estimation methods predict relative depth and thus cannot be directly applied in many real-world scenarios, despite impressive performance in the benchmarks. Moreover, single-view approaches cannot guarantee consistency across a sequence of frames. Consistency is typically addressed with test-time optimization of discrepancy across views; however, it takes hours to process a single scene. In this paper, we present MEDeA, an efficient multi-view test-time depth adjustment method, that is an order of magnitude faster than existing test-time approaches. Given RGB frames with camera parameters, MEDeA predicts initial depth maps, adjusts them by optimizing local scaling coefficients, and outputs temporally-consistent depth maps. Contrary to test-time methods requiring normals, optical flow, or semantics estimation, MEDeA produces high-quality predictions with a depth estimation network solely. Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.

Updated: 2024-06-17 19:39:13

标题: MEDeA: 多视角高效深度调整

摘要: 现代大多数单视图深度估计方法预测相对深度，因此尽管在基准测试中表现出色，但不能直接应用于许多实际场景中。此外，单视图方法无法保证在一系列帧中的一致性。一致性通常通过视图之间的差异的测试时间优化来解决；然而，处理单个场景需要数小时。在本文中，我们提出了MEDeA，一种高效的多视图测试时间深度调整方法，比现有的测试时间方法快一个数量级。给定具有相机参数的RGB帧，MEDeA预测初始深度图，通过优化局部缩放系数对其进行调整，并输出时间一致的深度图。与需要法线、光流或语义估计的测试时间方法相反，MEDeA仅通过深度估计网络产生高质量的预测。我们的方法在TUM RGB-D、7Scenes和ScanNet基准测试中创造了新的技术水平，并成功处理了从ARKitScenes数据集中捕获的智能手机数据。

更新时间: 2024-06-17 19:39:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.12048v1

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

Updated: 2024-06-17 19:33:08

标题: $τ$-bench：一个用于真实世界领域中工具-代理-用户交互的基准测试

摘要: 现有的基准测试未能测试语言代理与人类用户的交互或遵循特定领域规则的能力，这两者对于将它们部署在现实世界应用中至关重要。我们提出了$\tau$-bench，一个基准测试，模拟用户（由语言模型模拟）与一个配备领域特定API工具和策略指南的语言代理之间的动态对话。我们采用了一种高效且忠实的评估过程，将对话结束时的数据库状态与注释的目标状态进行比较。我们还提出了一个新的度量标准（pass^k）来评估代理行为在多次试验中的可靠性。我们的实验表明，即使是最先进的函数调用代理（如gpt-4o）也只能成功完成<50%的任务，并且相当不一致（在零售中pass^8 <25%）。我们的发现指出了需要改进代理能够行为一致并可靠遵循规则的方法。

更新时间: 2024-06-17 19:33:08

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12045v1

Grade Score: Quantifying LLM Performance in Option Selection

This study introduces the "Grade Score", a novel metric designed to evaluate the consistency and fairness of Large Language Models (LLMs) when used as multiple-choice judges with respect to order bias and choice consistency. The Grade Score combines Entropy, which measures order bias, and Mode Frequency, which assesses choice stability, offering insights into LLMs' reliability and impartiality. The study explores techniques such as prompt engineering and option sampling strategies to optimize the Grade Score, demonstrating their effectiveness in enhancing LLMs' performance. Results showcase varying performances among LLMs with respect to prompts and highlight the positive impact of including irrelevant options. The study also identifies an emergent behavior in instruction-following models, where they adapt to instructions targeting specific biases, demonstrating their adaptability. The Grade Score facilitates comparisons between LLMs and encourages ongoing research towards optimizing their decision-making processes, with potential implications for improving their reliability and fairness in various applications. All code is available on GitHub https://github.com/IoDmitri/GradeLab

Updated: 2024-06-17 19:29:39

标题: 等级分数：量化在选项选择中的LLM表现

摘要: 这项研究介绍了“Grade Score”，这是一种新型指标，旨在评估大型语言模型（LLMs）在作为多项选择评判员时对于顺序偏差和选择一致性的一致性和公平性。Grade Score结合了衡量顺序偏差的熵和评估选择稳定性的模式频率，为LLMs的可靠性和公正性提供了见解。研究探讨了提示工程和选项抽样策略等技术，以优化Grade Score，证明它们在提高LLMs性能方面的有效性。结果展示了LLMs在提示方面的表现各不相同，并强调包含无关选项的积极影响。研究还确定了指令遵循模型中的新兴行为，它们适应针对特定偏见的指令，展示了它们的适应性。Grade Score促进了LLMs之间的比较，并鼓励持续研究以优化它们的决策过程，可能对改善它们在各种应用中的可靠性和公平性产生影响。所有代码都可以在GitHub上找到https://github.com/IoDmitri/GradeLab

更新时间: 2024-06-17 19:29:39

领域: cs.AI

下载: http://arxiv.org/abs/2406.12043v1

Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP's effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g., prompts for generating text images, assigning them to higher capacity codes.

Updated: 2024-06-17 19:22:04

标题: 并非所有提示都是相同的：基于提示的文本到图像扩散模型修剪

摘要: 文本到图像（T2I）扩散模型展示了令人印象深刻的图像生成能力。然而，它们的计算强度阻碍了资源受限组织在对内部目标数据进行微调后部署T2I模型。虽然剪枝技术提供了减少T2I模型计算负担的潜在解决方案，但静态剪枝方法对所有输入提示使用相同的剪枝模型，忽视了不同提示的不同容量需求。动态剪枝通过为每个提示使用一个单独的子网络来解决这个问题，但它阻止了GPU上的批处理并行。为了克服这些限制，我们引入了Adaptive Prompt-Tailored Pruning（APTP），这是一种针对T2I扩散模型设计的新颖的基于提示的剪枝方法。我们方法的核心是一个提示路由模型，它学会确定输入文本提示所需的容量，并将其路由到一个架构代码，给定提示的总预期计算预算。每个架构代码代表一个针对分配给它的提示量身定制的专门模型，代码的数量是一个超参数。我们使用对比学习来训练提示路由器和架构代码，确保类似的提示被映射到相邻的代码。此外，我们采用最优传输来防止代码坍缩成一个单一代码。我们通过使用CC3M和COCO作为目标数据集对Stable Diffusion（SD）V2.1进行剪枝来展示APTP的有效性。在FID、CLIP和CMMD得分方面，APTP优于单模型剪枝基线。我们对APTP学习到的聚类进行了分析，发现它们在语义上是有意义的。我们还展示了APTP可以自动发现以前经验发现的SD中具有挑战性的提示，例如用于生成文本图像的提示，并将它们分配给更高容量的代码。

更新时间: 2024-06-17 19:22:04

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.12042v1

Balanced Filtering via Disclosure-Controlled Proxies

We study the problem of collecting a cohort or set that is balanced with respect to sensitive groups when group membership is unavailable or prohibited from use at deployment time. Specifically, our deployment-time collection mechanism does not reveal significantly more about the group membership of any individual sample than can be ascertained from base rates alone. To do this, we study a learner that can use a small set of labeled data to train a proxy function that can later be used for this filtering or selection task. We then associate the range of the proxy function with sampling probabilities; given a new example, we classify it using our proxy function and then select it with probability corresponding to its proxy classification. Importantly, we require that the proxy classification does not reveal significantly more information about the sensitive group membership of any individual example compared to population base rates alone (i.e., the level of disclosure should be controlled) and show that we can find such a proxy in a sample- and oracle-efficient manner. Finally, we experimentally evaluate our algorithm and analyze its generalization properties.

Updated: 2024-06-17 19:21:28

标题: 通过披露控制代理实现平衡过滤

摘要: 我们研究了在群体成员资格在部署时不可用或被禁止使用时，如何收集一个在敏感群体方面平衡的队列或集合的问题。具体来说，我们的部署时间收集机制不会比基础率单独可以确认的更多地揭示任何个人样本的群体成员资格。为了做到这一点，我们研究了一个可以使用少量标记数据来训练代理函数的学习器，该代理函数可以后续用于此过滤或选择任务。然后，我们将代理函数的范围与抽样概率关联起来；给定一个新样本，我们使用我们的代理函数对其进行分类，然后以与其代理分类相对应的概率选择它。重要的是，我们要求代理分类不会比单独基础率可以确认的更多地揭示任何个人样本的敏感群体成员资格（即，披露水平应该受到控制），并且表明我们可以以一种样本-和oracle-高效的方式找到这样的代理。最后，我们通过实验评估了我们的算法并分析了其泛化属性。

更新时间: 2024-06-17 19:21:28

领域: cs.LG

下载: http://arxiv.org/abs/2306.15083v3

Outer Space Cyberattacks: Generating Novel Scenarios to Avoid Surprise

Though general awareness around it may be low, space cyberattacks are an increasingly urgent problem given the vital role that space systems play in the modern world. Open-source or public discussions about it typically revolve around only a couple generic scenarios, namely satellite hacking and signals jamming or spoofing. But there are so many more possibilities. The report offers a scenario-prompt generator -- a taxonomy of sorts, called the ICARUS matrix -- that can create more than 4 million unique scenario-prompts. We will offer a starting set of 42 scenarios, briefly describing each one, to begin priming the imagination-pump so that many more researchers can bring their diverse expertise and perspectives to bear on the problem. A failure to imagine novel scenarios is a major risk in being taken by surprise and severely harmed by threat actors who are constantly devising new ways, inventive and resourceful ways, to breach the digital systems that control our wired world. To stay vigilant, defenders likewise need to be imaginative to keep up in this adversarial dance between hunter and prey in cybersecurity. More than offering novel scenarios, we will also explore the drivers of the space cybersecurity problem, which include at least seven factors we have identified. For instance, the shared threat of space debris would seem to push rational states and actors to avoid kinetic conflicts in orbit, which weighs in favor of cyberoperations as the dominant form of space conflicts. Outer space is the next frontier for cybersecurity. To guard against space cyberattacks, we need to understand and anticipate them, and imagination is at the very heart of both cybersecurity and frontiers.

Updated: 2024-06-17 19:20:17

标题: 外太空网络攻击：生成新颖情景以避免惊讶

摘要: 尽管人们对此的一般意识可能较低，但太空网络攻击是一个日益紧迫的问题，因为太空系统在现代世界中扮演着至关重要的角色。关于这个问题的开放源或公共讨论通常只围绕着一些通用的情景，即卫星黑客攻击和信号干扰或欺骗。但实际上存在着更多的可能性。该报告提供了一个场景提示生成器--一种分类法，名为ICARUS矩阵--可以创建超过400万个独特的场景提示。我们将提供一个起始集合的42个场景，简要描述每个场景，以开始激发想象力，让更多的研究人员能够将他们各自的专业知识和观点应用到这个问题上。想象不出新颖情景是被意外和严重危害的主要风险，因为威胁行动者不断设计新的、富有创造力和资源的方法来入侵控制我们有线世界的数字系统。为了保持警惕，防御者同样需要富有想象力，以在网络安全领域的猎人与猎物之间的敌对舞蹈中跟上步伐。我们不仅提供新颖的场景，还将探讨太空网络安全问题的驱动因素，其中至少包括我们已经确定的七个因素。例如，太空碎片的共同威胁似乎会促使理性的国家和行为者避免在轨道上进行动能冲突，这有利于将网络作为主导形式的太空冲突。外太空是网络安全的下一个前沿。为了防范太空网络攻击，我们需要了解并预测它们，而想象力正是网络安全和前沿的核心。

更新时间: 2024-06-17 19:20:17

领域: cs.CR

下载: http://arxiv.org/abs/2406.12041v1

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments.

Updated: 2024-06-17 19:19:44

标题: DU-Shapley：用于高效数据集估值的Shapley值代理

摘要: 我们考虑数据集估值问题，即量化将个体数据集聚合到其他数据集时，对某个相关预定义效用的机器学习任务的增量收益的问题。Shapley值是一种自然工具，用于执行数据集估值，因为它具有形式上的公理证明，可以与蒙特卡洛积分结合以克服计算可行性挑战。然而，这种通用的近似方法在某些情况下仍然昂贵。在本文中，我们利用关于数据集估值问题结构的知识，设计更高效的Shapley值估计器。我们提出了一种新颖的近似方法，称为离散均匀Shapley，它表示为在具有合理大小支持的离散均匀分布下的期望。我们通过渐近和非渐近理论保证证明了所提出框架的相关性，并通过大量的数值实验展示了其好处。

更新时间: 2024-06-17 19:19:44

领域: cs.AI,cs.GT,stat.CO,stat.ML

下载: http://arxiv.org/abs/2306.02071v2

Soft Prompting for Unlearning in Large Language Models

The widespread popularity of Large Language Models (LLMs), partly due to their unique ability to perform in-context learning, has also brought to light the importance of ethical and safety considerations when deploying these pre-trained models. In this work, we focus on investigating machine unlearning for LLMs motivated by data protection regulations. In contrast to the growing literature on fine-tuning methods to achieve unlearning, we focus on a comparatively lightweight alternative called soft prompting to realize the unlearning of a subset of training data. With losses designed to enforce forgetting as well as utility preservation, our framework \textbf{S}oft \textbf{P}rompting for \textbf{U}n\textbf{l}earning (SPUL) learns prompt tokens that can be appended to an arbitrary query to induce unlearning of specific examples at inference time without updating LLM parameters. We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting in the context of text classification with LLMs. We further validate our method using multiple LLMs to highlight the scalability of our framework and provide detailed insights into the choice of hyperparameters and the influence of the size of unlearning data. Our implementation is available at \url{https://github.com/karuna-bhaila/llm_unlearning}.

Updated: 2024-06-17 19:11:40

标题: 大型语言模型中的软提示用于取消学习

摘要: 大型语言模型（LLMs）的普遍流行，部分原因是它们具有在上下文学习中独特的能力，也凸显了在部署这些预训练模型时考虑道德和安全性的重要性。在这项工作中，我们专注于研究机器遗忘，其动机是数据保护法规。与日益增长的关于通过微调方法实现遗忘的文献不同，我们专注于一种相对轻量级的替代方案，称为软提示，以实现对训练数据子集的遗忘。通过设计旨在实现遗忘和保留效用的损失，我们的框架Soft Prompting for Unlearning（SPUL）学习提示符令牌，可以附加到任意查询中，以在推断时诱导遗忘特定示例，而无需更新LLM参数。我们对所提出方法进行了严格评估，结果表明，在LLMs的文本分类环境中，SPUL可以显著改善效用和遗忘之间的权衡。我们进一步验证了我们的方法，使用多个LLMs突出了我们框架的可扩展性，并提供了关于超参数选择和遗忘数据大小影响的详细见解。我们的实现可在\url{https://github.com/karuna-bhaila/llm_unlearning}上找到。

更新时间: 2024-06-17 19:11:40

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12038v1

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from pretrained models to construct and verify controllers for sequential decision-making tasks, and to ground these controllers to task environments through visual observations with formal guarantees. In particular, the algorithm queries a pretrained model with a user-provided, text-based task description and uses the model's output to construct an automaton-based controller that encodes the model's task-relevant knowledge. It allows formal verification of whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. Next, the algorithm leverages the vision and language capabilities of pretrained models to link the observations from the task environment to the text-based control logic from the controller (e.g., actions and conditions that trigger the actions). We propose a mechanism to provide probabilistic guarantees on whether the controller satisfies the user-provided specifications under perceptual uncertainties. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks, including daily life and robot manipulation tasks.

Updated: 2024-06-17 19:09:17

标题: 多模态预训练模型用于可验证的顺序决策制定：规划、基础和感知

摘要: 最近开发的预训练模型可以编码表达在多种模态下的丰富世界知识，例如文本和图像。然而，这些模型的输出不能被整合到算法中以解决顺序决策任务。我们开发了一种算法，利用预训练模型的知识来构建和验证顺序决策任务的控制器，并通过视觉观察将这些控制器与任务环境联系起来，并提供正式保证。具体而言，该算法通过用户提供的基于文本的任务描述查询预训练模型，并使用模型的输出来构建基于自动机的控制器，编码模型的任务相关知识。它允许正式验证控制器中编码的知识是否与其他独立可用的知识一致，这些知识可能包括环境的抽象信息或用户提供的规范。接下来，该算法利用预训练模型的视觉和语言能力，将任务环境的观察结果与控制器中的基于文本的控制逻辑（例如触发动作和条件）联系起来。我们提出一种机制，提供概率保证，以确定在感知不确定性下控制器是否满足用户提供的规范。我们展示了该算法通过一系列真实世界任务，包括日常生活和机器人操作任务，构建、验证和基于自动机的控制器的能力。

更新时间: 2024-06-17 19:09:17

领域: cs.AI,cs.FL

下载: http://arxiv.org/abs/2308.05295v2

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

As opposed to evaluating computation and logic-based reasoning, current bench2 marks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive rea4 soning. While such qualitative capabilities are vital to medical diagnosis, in real5 world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.

Updated: 2024-06-17 19:07:21

标题: MedCalc-Bench：评估用于医学计算的大型语言模型

摘要: 与评估计算和基于逻辑的推理相对应，目前用于评估医学领域大型语言模型（LLMs）的基准主要集中在涉及领域知识和描述性推理的问答上。虽然这种定性能力对医学诊断至关重要，但在现实情景中，医生经常使用遵循定量方程和基于规则的推理范式的临床计算器，以支持基于证据的决策。为此，我们提出了MedCalc-Bench，这是一个专注于评估LLMs医学计算能力的首个数据集。MedCalc-Bench包含一个评估集，其中包含来自55个不同医学计算任务的1000多个手动审核的实例。MedCalc-Bench中的每个实例包括一个患者记录，一个要求计算特定医学值的问题，一个基本真实答案，以及一个逐步说明显示答案如何获得。虽然我们的评估结果显示LLMs在这个领域的潜力，但没有一个在临床环境中足够有效。常见问题包括提取不正确的实体，未使用正确的方程或规则进行计算任务，或在计算过程中执行算术错误。我们希望我们的研究突显出LLMs在医学环境中的定量知识和推理差距，鼓励未来改进LLMs以应对各种临床计算任务。

更新时间: 2024-06-17 19:07:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.12036v1

Socially Interactive Agents for Robotic Neurorehabilitation Training: Conceptualization and Proof-of-concept Study

Individuals with diverse motor abilities often benefit from intensive and specialized rehabilitation therapies aimed at enhancing their functional recovery. Nevertheless, the challenge lies in the restricted availability of neurorehabilitation professionals, hindering the effective delivery of the necessary level of care. Robotic devices hold great potential in reducing the dependence on medical personnel during therapy but, at the same time, they generally lack the crucial human interaction and motivation that traditional in-person sessions provide. To bridge this gap, we introduce an AI-based system aimed at delivering personalized, out-of-hospital assistance during neurorehabilitation training. This system includes a rehabilitation training device, affective signal classification models, training exercises, and a socially interactive agent as the user interface. With the assistance of a professional, the envisioned system is designed to be tailored to accommodate the unique rehabilitation requirements of an individual patient. Conceptually, after a preliminary setup and instruction phase, the patient is equipped to continue their rehabilitation regimen autonomously in the comfort of their home, facilitated by a socially interactive agent functioning as a virtual coaching assistant. Our approach involves the integration of an interactive socially-aware virtual agent into a neurorehabilitation robotic framework, with the primary objective of recreating the social aspects inherent to in-person rehabilitation sessions. We also conducted a feasibility study to test the framework with healthy patients. The results of our preliminary investigation indicate that participants demonstrated a propensity to adapt to the system. Notably, the presence of the interactive agent during the proposed exercises did not act as a source of distraction; instead, it positively impacted users' engagement.

Updated: 2024-06-17 19:07:05

标题: 社交互动的机器人神经康复训练代理：概念化和概念验证研究

摘要: 具有不同运动能力的个体通常受益于旨在增强其功能恢复的密集和专业化康复疗法。然而，挑战在于神经康复专业人员的有限可用性，阻碍了必要护理水平的有效提供。机器人设备在减少疗法过程中对医务人员的依赖方面具有巨大潜力，但与此同时，它们通常缺乏传统面对面会话提供的重要人际互动和动力。为了弥合这一差距，我们引入了一个基于人工智能的系统，旨在在神经康复训练期间提供个性化的院外援助。该系统包括康复训练设备、情感信号分类模型、训练练习和作为用户界面的社交互动代理。在专业人员的帮助下，设想的系统旨在量身定制，以满足个体患者的独特康复需求。在概念上，在初步设置和指导阶段之后，患者被装备好在家中舒适地自主继续他们的康复计划，由一个作为虚拟教练助手的社交互动代理提供支持。我们的方法涉及将一个交互式社交意识虚拟代理整合到神经康复机器人框架中，主要目标是重新创造与面对面康复会话固有的社交方面。我们还进行了一项可行性研究，测试健康患者的框架。我们初步调查的结果表明，参与者表现出适应该系统的倾向。值得注意的是，在提议的练习中，交互式代理的存在并没有作为一种分心源；相反，它积极地影响了用户的参与度。

更新时间: 2024-06-17 19:07:05

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2406.12035v1

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.

Updated: 2024-06-17 19:06:54

标题: Self-MoE: 朝向具有自我专家的组合式大型语言模型

摘要: 我们提出了Self-MoE，这是一种将单一的LLM转化为自我专业化专家的组合模块化系统，名为MiXSE（自我专业化专家混合体）。我们的方法利用自我专业化，利用自动生成的合成数据构建专家模块，每个模块都配备有共享的基本LLM并结合自我优化的路由。这允许动态和能力特定地处理各种目标任务，增强整体能力，而无需大量人工标记的数据和添加参数。我们的实证结果显示，专门化LLM可能在非专门化任务的表现上存在潜在的权衡。另一方面，我们的Self-MoE在知识、推理、数学和编码等多种基准测试中显示出对基本LLM的实质性改进。它还在设计上具有更好的灵活性和可解释性，通过语义专家和路由，始终优于其他方法，包括实例合并和权重合并。我们的研究结果突显了模块化的关键作用和自我改进在实现高效、可扩展和适应性系统方面的潜力。

更新时间: 2024-06-17 19:06:54

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.12034v1

Large Scale Transfer Learning for Tabular Data via Language Modeling

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

Updated: 2024-06-17 18:58:20

标题: 大规模表格数据的迁移学习：通过语言建模

摘要: 表格数据——结构化的、异构的、类似于电子表格的行和列数据——在许多领域的实践中得到广泛应用。然而，尽管最近的基础模型已经减少了在诸如语言建模和计算机视觉等领域开发特定任务数据集和预测器的需求，但这种迁移学习范式在表格领域并没有产生类似的影响。在这项工作中，我们试图缩小这一差距，并呈现TabuLa-8B，一个用于表格预测的语言模型。我们定义了一个从TabLib语料库中提取大规模高质量训练数据集的过程，提出了表格数据过滤和质量控制的方法。利用由31万个独特表格组成的超过16亿行的结果数据集，我们对表格数据预测（分类和分箱回归）进行了Llama 3-8B大型语言模型（LLM）的微调，使用了一种新颖的表格预测打包和关注方案。通过对329个数据集的测试套件进行评估，我们发现TabuLa-8B对于未见过的表格具有零射击准确率，比随机猜测高出15个百分点，这是现有最先进的表格预测模型（例如XGBoost、TabPFN）所无法实现的。在少量数据场景（1-32次射击）中，即使没有对目标数据集进行微调，TabuLa-8B也比XGBoost和TabPFN模型更准确，这些模型明确地是在相等甚至多达16倍的数据上进行训练的。我们将我们的模型、代码和数据与本文一起发布。

更新时间: 2024-06-17 18:58:20

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12031v1

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open- (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness. We have made our code https://github.com/EchoseChen/SPA-VL-RLHF and SPA-VL dataset url https://huggingface.co/datasets/sqrti/SPA-VL publicly available.

Updated: 2024-06-17 18:57:37

标题: SPA-VL：一个用于视觉语言模型的全面安全偏好对齐数据集

摘要: 视觉语言模型（VLMs）的出现为理解多模态信息带来了前所未有的进展。VLMs中文本和视觉语义的结合非常复杂和多样化，使得这些模型的安全对齐具有挑战性。此外，由于对VLMs安全对齐的研究有限，缺乏大规模、高质量的数据集。为了解决这些限制，我们提出了一个名为SPA-VL的视觉语言模型安全偏好对齐数据集。在广度方面，SPA-VL涵盖了6个有害领域、13个类别和53个子类别，包含了100,788个四元组样本（问题、图像、选择的回答、拒绝的回答）。在深度方面，回答来自12个开源（例如QwenVL）和闭源（例如Gemini）VLMs，以确保多样性。实验结果表明，使用SPA-VL数据集上的对齐技术训练的模型在无害性和帮助性方面表现出显著改善，同时保持核心能力。作为一个大规模、高质量和多样化的数据集，SPA-VL代表了确保VLMs实现无害性和帮助性的重要里程碑。我们已经将我们的代码https://github.com/EchoseChen/SPA-VL-RLHF和SPA-VL数据集url https://huggingface.co/datasets/sqrti/SPA-VL公开提供。

更新时间: 2024-06-17 18:57:37

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.12030v1

Attacking with Something That Does Not Exist: 'Proof of Non-Existence' Can Exhaust DNS Resolver CPU

NSEC3 is a proof of non-existence in DNSSEC, which provides an authenticated assertion that a queried resource does not exist in the target domain. NSEC3 consists of alphabetically sorted hashed names before and after the queried hostname. To make dictionary attacks harder, the hash function can be applied in multiple iterations, which however also increases the load on the DNS resolver during the computation of the SHA-1 hashes in NSEC3 records. Concerns about the load created by the computation of NSEC3 records on the DNS resolvers were already considered in the NSEC3 specifications RFC5155 and RFC9276. In February 2024, the potential of NSEC3 to exhaust DNS resolvers' resources was assigned a CVE-2023-50868, confirming that extra iterations of NSEC3 created substantial load. However, there is no published evaluation of the attack and the impact of the attack on the resolvers was not clarified. In this work we perform the first evaluation of the NSEC3-encloser attack against DNS resolver implementations and find that the NSEC3-encloser attack can still create a 72x increase in CPU instruction count, despite the victim resolver following RFC5155 recommendations in limiting hash iteration counts. The impact of the attack varies across the different DNS resolvers, but we show that with a sufficient volume of DNS packets the attack can increase CPU load and cause packet loss. We find that at a rate of 150 malicious NSEC3 records per second, depending on the DNS implementation, the loss rate of benign DNS requests varies between 2.7% and 30%. We provide a detailed description and implementation of the NSEC3-encloser attack. We also develop the first analysis how each NSEC3 parameter impacts the load inflicted on the victim resolver during NSEC3-encloser attack.

Updated: 2024-06-17 18:57:10

标题: 攻击利用不存在的东西：'证明不存在'可能耗尽DNS解析器的CPU

摘要: NSEC3是DNSSEC中的不存在性证明，它提供了一个经过验证的断言，即在目标域中查询的资源不存在。NSEC3由在查询主机名之前和之后按字母顺序排序的哈希名称组成。为了使字典攻击更困难，哈希函数可以应用多次迭代，然而这也会增加在计算NSEC3记录中的SHA-1哈希时DNS解析器的负载。关于NSEC3记录的计算对DNS解析器产生的负载的担忧已经在NSEC3规范RFC5155和RFC9276中考虑过。在2024年2月，NSEC3耗尽DNS解析器资源的潜力被分配了一个CVE-2023-50868，确认了NSEC3的额外迭代造成了重大负载。然而，关于攻击的公开评估以及攻击对解析器的影响尚未明确。在这项工作中，我们对DNS解析器实现进行了第一次NSEC3封闭器攻击的评估，并发现尽管受害解析器遵循RFC5155建议限制哈希迭代次数，NSEC3封闭器攻击仍然会导致CPU指令数量增加72倍。攻击的影响因不同的DNS解析器而异，但我们展示了在足够数量的DNS数据包的情况下，攻击可以增加CPU负载并导致数据包丢失。我们发现，在每秒150个恶意NSEC3记录的速率下，根据DNS实现的不同，良性DNS请求的丢失率在2.7%至30%之间变化。我们提供了对NSEC3封闭器攻击的详细描述和实现。我们还开发了第一份分析报告，说明每个NSEC3参数如何影响在NSEC3封闭器攻击中对受害解析器的负载。

更新时间: 2024-06-17 18:57:10

领域: cs.CR

下载: http://arxiv.org/abs/2403.15233v2

Online Cascade Learning for Efficient Inference over Streams

Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning, the first approach to address this challenge. The objective here is to learn a "cascade" of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the collected LLM demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.

Updated: 2024-06-17 18:54:36

标题: 在线级联学习用于流数据高效推断

摘要: 大型语言模型（LLMs）在回答关于数据流的复杂查询方面具有自然的作用，但LLM推理的高计算成本使它们在许多此类任务中不可行。我们提出了在线级联学习，这是首个应对这一挑战的方法。这里的目标是学习一个“级联”模型，从低容量模型（如逻辑回归）开始，最终以强大的LLM结束，并制定一个延迟策略，确定在给定输入上要使用的模型。我们将在线学习级联任务形式化为一个模仿学习问题，其中较小的模型随着时间的推移更新，模仿收集的LLM演示，并为该问题提供了一个无后悔算法。在四个基准测试中的实验结果显示，我们的方法在准确性上与LLMs相媲美，同时将推理成本削减了多达90％，对输入分布变化具有强大的鲁棒性，突显了其在流处理中的有效性和适应性。

更新时间: 2024-06-17 18:54:36

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.04513v3

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

With the proliferation of red-teaming strategies for Large Language Models (LLMs), the deficiency in the literature about improving the safety and robustness of LLM defense strategies is becoming increasingly pronounced. This paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few ($<30$) additional tokens, effectively reducing toxicity in responses from target LLMs. The sentinel model naturally overcomes the \textit{parameter inefficiency} and \textit{limited model accessibility} for fine-tuning large target models. We employ an interleaved training regimen using Proximal Policy Optimization (PPO) to optimize both red team and sentinel models dynamically, incorporating a value head-sharing mechanism inspired by the multi-agent centralized critic to manage the complex interplay between agents. Our extensive experiments across text-to-text and text-to-image demonstrate the effectiveness of our approach in mitigating toxic outputs, even when dealing with larger models like \texttt{Llama-2}, \texttt{GPT-3.5} and \texttt{Stable-Diffusion}, highlighting the potential of our framework in enhancing safety and robustness in various applications.

Updated: 2024-06-17 18:52:45

标题: 微小的改进引发韧性：朝着高效抵御LLM红队攻击的前缀模型

摘要: 随着大型语言模型（LLMs）的红队策略的不断增加，关于改善LLM防御策略的安全性和鲁棒性的文献不足越来越明显。本文引入了基于LLM的\textbf{哨兵}模型作为一种即插即用的前缀模块，旨在通过仅使用少量（<30）额外标记来重新构造输入提示，有效降低目标LLMs响应中的毒性。哨兵模型自然地克服了对大型目标模型进行微调时的\textit{参数效率}和\textit{有限模型可访问性}。我们采用交替训练方案，使用Proximal Policy Optimization（PPO）动态优化红队和哨兵模型，结合受多智能体集中评论者启发的值头共享机制来管理代理之间的复杂相互作用。我们在文本到文本和文本到图像的广泛实验中展示了我们方法在减轻有毒输出方面的有效性，即使处理像\texttt{Llama-2}、\texttt{GPT-3.5}和\texttt{Stable-Diffusion}这样的更大模型时也是如此，突出了我们框架在增强各种应用的安全性和鲁棒性方面的潜力。

更新时间: 2024-06-17 18:52:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2405.12604v2

Qibo: A Large Language Model for Traditional Chinese Medicine

Large Language Models (LLMs) has made significant progress in a number of professional fields, including medicine, law, and finance. However, in traditional Chinese medicine (TCM), there are challenges such as the essential differences between theory and modern medicine, the lack of specialized corpus resources, and the fact that relying only on supervised fine-tuning may lead to overconfident predictions. To address these challenges, we propose a two-stage training approach that combines continuous pre-training and supervised fine-tuning. A notable contribution of our study is the processing of a 2Gb corpus dedicated to TCM, constructing pre-training and instruction fine-tuning datasets for TCM, respectively. In addition, we have developed Qibo-Benchmark, a tool that evaluates the performance of LLM in the TCM on multiple dimensions, including subjective, objective, and three TCM NLP tasks. The medical LLM trained with our pipeline, named \emph{\textbf{Qibo}}, exhibits significant performance boosts. Compared to the baselines, the average subjective win rate is 63\%, the average objective accuracy improved by 23\% to 58\%, and the Rouge-L scores for the three TCM NLP tasks are 0.72, 0.61, and 0.55. Finally, we propose a pipline to apply Qibo to TCM consultation and demonstrate the model performance through the case study.

Updated: 2024-06-17 18:52:04

标题: 契博：一个用于传统中医的大型语言模型

摘要: 大型语言模型（LLMs）在许多专业领域取得了显著进展，包括医学、法律和金融。然而，在传统中医学（TCM）领域存在挑战，如理论与现代医学之间的本质差异、缺乏专门的语料库资源，以及仅依赖监督微调可能导致过度自信的预测。为了解决这些挑战，我们提出了一种两阶段训练方法，结合连续预训练和监督微调。我们研究的一个显著贡献是处理了一个专门用于TCM的2Gb语料库，分别构建了TCM的预训练和指导微调数据集。此外，我们还开发了Qibo-Benchmark工具，评估LLM在TCM领域的性能，包括主观、客观和三个TCM NLP任务。经过我们的管道训练的医学LLM，命名为Qibo，表现出显著的性能提升。与基线相比，平均主观胜率为63％，平均客观准确率提高了23％至58％，三个TCM NLP任务的Rouge-L分数分别为0.72、0.61和0.55。最后，我们提出了一个将Qibo应用于TCM咨询的流程，并通过案例研究展示了模型的性能。

更新时间: 2024-06-17 18:52:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.16056v2

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections -- with millions of downloads -- and show they only provide a false sense of security. We find that low-effort and "off-the-shelf" techniques, such as image upscaling, are sufficient to create robust mimicry methods that significantly degrade existing protections. Through a user study, we demonstrate that all existing protections can be easily bypassed, leaving artists vulnerable to style mimicry. We caution that tools based on adversarial perturbations cannot reliably protect artists from the misuse of generative AI, and urge the development of alternative non-technological solutions.

Updated: 2024-06-17 18:51:45

标题: 对抗性扰动无法可靠地保护艺术家免受生成式人工智能的影响

摘要: 艺术家越来越关注能够紧密复制其独特艺术风格的图像生成模型的进展。作为回应，已经开发了几种针对风格模仿的保护工具，这些工具将小的对抗性扰动纳入在线发布的艺术作品中。在这项工作中，我们评估了流行的保护措施的有效性 -- 这些措施已经被下载了数百万次 -- 并显示它们只提供了一种虚假的安全感。我们发现，低成本和“现成”的技术，比如图像放大，足以创建强大的模仿方法，显著降低了现有的保护措施。通过用户研究，我们证明了所有现有的保护措施都可以轻松被绕过，使艺术家容易受到风格模仿的威胁。我们警告说，基于对抗性扰动的工具不能可靠地保护艺术家免受生成式人工智能的滥用，并敦促开发替代的非技术解决方案。

更新时间: 2024-06-17 18:51:45

领域: cs.CR

下载: http://arxiv.org/abs/2406.12027v1

LiLiuM: eBay's Large Language Models for e-commerce

We introduce the LiLiuM series of large language models (LLMs): 1B, 7B, and 13B parameter models developed 100% in-house to fit eBay's specific needs in the e-commerce domain. This gives eBay full control over all aspects of the models including license, data, vocabulary, and architecture. We expect these models to be used as a foundation for fine-tuning and instruction-tuning, eliminating dependencies to external models. The LiLiuM LLMs have been trained on 3 trillion tokens of multilingual text from general and e-commerce domain. They perform similar to the popular LLaMA-2 models on English natural language understanding (NLU) benchmarks. At the same time, we outperform LLaMA-2 on non-English NLU tasks, machine translation and on e-commerce specific downstream tasks. As part of our data mixture, we utilize the newly released RedPajama-V2 dataset for training and share our insights regarding data filtering and deduplication. We also discuss in detail how to serialize structured data for use in autoregressive language modeling. We provide insights on the effects of including code and parallel machine translation data in pre-training. Furthermore, we develop our own tokenizer and model vocabulary, customized towards e-commerce. This way, we can achieve up to 34% speed-up in text generation on eBay-specific downstream tasks compared to LLaMA-2. Finally, in relation to LLM pretraining, we show that checkpoint averaging can further improve over the best individual model checkpoint.

Updated: 2024-06-17 18:45:41

标题: LiLiuM：eBay电子商务的大规模语言模型

摘要: 我们介绍了LiLiuM系列大型语言模型(LLMs)：1B、7B和13B参数模型，这些模型完全由eBay内部开发，以满足电子商务领域的特定需求。这使得eBay对模型的所有方面，包括许可、数据、词汇和架构，都具有完全控制权。我们期望这些模型将被用作微调和指导调整的基础，消除对外部模型的依赖。 LiLiuM LLMs已经在来自一般和电子商务领域的3万亿个多语种文本标记上进行了训练。在英语自然语言理解(NLU)基准测试中，它们的表现与流行的LLaMA-2模型相似。与此同时，我们在非英语NLU任务、机器翻译和电子商务特定下游任务上超过了LLaMA-2。作为我们数据混合的一部分，我们利用了新发布的RedPajama-V2数据集进行训练，并分享了关于数据过滤和去重的见解。我们还详细讨论了如何序列化结构化数据以在自回归语言建模中使用。我们提供了包括代码和并行机器翻译数据在内的预训练效果的见解。此外，我们开发了自己的分词器和模型词汇，定制用于电子商务。通过这种方式，我们可以在eBay特定下游任务中，相比LLaMA-2，达到高达34%的文本生成速度提升。最后，关于LLM预训练，我们展示了检查点平均化可以进一步改善优于最佳个体模型检查点。

更新时间: 2024-06-17 18:45:41

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.12023v1

Constructing Ancestral Recombination Graphs through Reinforcement Learning

Over the years, many approaches have been proposed to build ancestral recombination graphs (ARGs), graphs used to represent the genetic relationship between individuals. Among these methods, many rely on the assumption that the most likely graph is among the shortest ones. In this paper, we propose a new approach to build short ARGs: Reinforcement Learning (RL). We exploit the similarities between finding the shortest path between a set of genetic sequences and their most recent common ancestor and finding the shortest path between the entrance and exit of a maze, a classic RL problem. In the maze problem, the learner, called the agent, must learn the directions to take in order to escape as quickly as possible, whereas in our problem, the agent must learn the actions to take between coalescence, mutation, and recombination in order to reach the most recent common ancestor as quickly as possible. Our results show that RL can be used to build ARGs as short as those built with a heuristic algorithm optimized to build short ARGs, and sometimes even shorter. Moreover, our method allows to build a distribution of short ARGs for a given sample, and can also generalize learning to new samples not used during the learning process.

Updated: 2024-06-17 18:42:03

标题: 通过强化学习构建祖先重组图

摘要: 多年来，已经提出了许多方法来构建祖先重组图（ARGs），用于表示个体之间的遗传关系。在这些方法中，许多依赖于最可能的图是最短的假设。在本文中，我们提出了一种新的构建短ARGs的方法：强化学习（RL）。我们利用了在一组基因序列及其最近共同祖先之间找到最短路径和在迷宫入口和出口之间找到最短路径之间的相似性，这是一个经典的RL问题。在迷宫问题中，学习者，称为代理人，必须学习要采取的方向，以便尽快逃离，而在我们的问题中，代理人必须学习在共聚、突变和重组之间采取的行动，以便尽快到达最近共同祖先。我们的结果表明，RL可以用于构建与优化为构建短ARGs的启发式算法构建的ARGs一样短，甚至有时更短。此外，我们的方法允许为给定样本构建短ARGs的分布，并且还可以将学习推广到在学习过程中未使用的新样本。

更新时间: 2024-06-17 18:42:03

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.12022v1

The Unfairness of $\varepsilon$-Fairness

Fairness in decision-making processes is often quantified using probabilistic metrics. However, these metrics may not fully capture the real-world consequences of unfairness. In this article, we adopt a utility-based approach to more accurately measure the real-world impacts of decision-making process. In particular, we show that if the concept of $\varepsilon$-fairness is employed, it can possibly lead to outcomes that are maximally unfair in the real-world context. Additionally, we address the common issue of unavailable data on false negatives by proposing a reduced setting that still captures essential fairness considerations. We illustrate our findings with two real-world examples: college admissions and credit risk assessment. Our analysis reveals that while traditional probability-based evaluations might suggest fairness, a utility-based approach uncovers the necessary actions to truly achieve equality. For instance, in the college admission case, we find that enhancing completion rates is crucial for ensuring fairness. Summarizing, this paper highlights the importance of considering the real-world context when evaluating fairness.

Updated: 2024-06-17 18:38:17

标题: $\varepsilon$-公平性的不公平性

摘要: 决策过程中的公平性通常使用概率度量来量化。然而，这些度量可能无法完全捕捉到不公平的现实后果。在本文中，我们采用一种基于效用的方法来更准确地衡量决策过程的实际影响。特别是，我们表明，如果采用$\varepsilon$-公平性的概念，可能会导致在现实世界背景下最不公平的结果。此外，我们通过提出一个仍然捕捉到基本公平考虑的简化设置，解决了关于虚假负面数据不可用的常见问题。我们通过两个实际例子——大学入学和信用风险评估来阐明我们的发现。我们的分析显示，传统基于概率的评估可能会暗示公平性，而基于效用的方法揭示了实现真正平等所需的行动。例如，在大学入学案例中，我们发现增加完成率对于确保公平性至关重要。总之，本文强调了在评估公平性时考虑现实世界背景的重要性。

更新时间: 2024-06-17 18:38:17

领域: cs.LG,econ.TH,q-fin.MF,stat.ML

下载: http://arxiv.org/abs/2405.09360v2

When Box Meets Graph Neural Network in Tag-aware Recommendation

Last year has witnessed the re-flourishment of tag-aware recommender systems supported by the LLM-enriched tags. Unfortunately, though large efforts have been made, current solutions may fail to describe the diversity and uncertainty inherent in user preferences with only tag-driven profiles. Recently, with the development of geometry-based techniques, e.g., box embedding, diversity of user preferences now could be fully modeled as the range within a box in high dimension space. However, defect still exists as these approaches are incapable of capturing high-order neighbor signals, i.e., semantic-rich multi-hop relations within the user-tag-item tripartite graph, which severely limits the effectiveness of user modeling. To deal with this challenge, in this paper, we propose a novel algorithm, called BoxGNN, to perform the message aggregation via combination of logical operations, thereby incorporating high-order signals. Specifically, we first embed users, items, and tags as hyper-boxes rather than simple points in the representation space, and define two logical operations to facilitate the subsequent process. Next, we perform the message aggregation mechanism via the combination of logical operations, to obtain the corresponding high-order box representations. Finally, we adopt a volume-based learning objective with Gumbel smoothing techniques to refine the representation of boxes. Extensive experiments on two publicly available datasets and one LLM-enhanced e-commerce dataset have validated the superiority of BoxGNN compared with various state-of-the-art baselines. The code is released online

Updated: 2024-06-17 18:35:53

标题: 当盒子遇上图神经网络：基于标签的推荐

摘要: 去年见证了由LLM丰富标签支持的标签感知推荐系统的重新兴起。不幸的是，尽管付出了很大努力，但目前的解决方案可能无法准确描述用户偏好中的多样性和不确定性，仅依靠标记驱动的配置文件。最近，随着基于几何的技术的发展，例如盒子嵌入，用户偏好的多样性现在可以完全建模为高维空间中盒子内的范围。然而，这些方法仍存在缺陷，因为它们无法捕获高阶邻居信号，即用户-标签-项目三元图中的语义丰富的多跳关系，严重限制了用户建模的有效性。为了解决这一挑战，在本文中，我们提出了一种新颖的算法，称为BoxGNN，通过逻辑操作的组合执行消息聚合，从而融入高阶信号。具体而言，我们首先将用户、项目和标签作为超级盒子嵌入到表示空间中，而不是简单的点，并定义两个逻辑操作以促进后续过程。接下来，我们通过逻辑操作的组合执行消息聚合机制，以获得相应的高阶盒子表示。最后，我们采用基于体积的学习目标和Gumbel平滑技术来优化盒子的表示。对两个公开可用数据集和一个LLM增强的电子商务数据集进行了广泛实验证实了BoxGNN相对于各种最先进基线的优越性。代码已在线发布。

更新时间: 2024-06-17 18:35:53

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2406.12020v1

Hacking Encrypted Wireless Power: Cyber-Security of Dynamic Charging

Recently, energy encryption for wireless power transfer has been developed for energy safety, which is important in public places to suppress unauthorized energy extraction. Most techniques vary the frequency so that unauthorized receivers cannot extract energy because of non-resonance. However, this strategy is unreliable. To stimulate the progress of energy encryption technology and point out security holes, this paper proposes a decryption method for the fundamental principle of encrypted frequency-varying wireless power transfer. The paper uses an auxiliary coil to detect the frequency and a switched-capacitor array to adaptively compensate the receiver for a wide frequency range. The switched-capacitor array contains two capacitors and one semi-conductor switch. One capacitor compensates the receiver all the time while the other's active time during one wireless power transfer cycle is regulated by the switch. Thus, the proposed hacking receiver controls the equivalent capacitance of the compensation and steals energy. Finally, a detailed simulation model and experimental results prove the effectiveness of the attack on frequency-hopping energy encryption. Although any nonnegligible energy extracted would be problematic, we achieved to steal 78% to 84% of the energy an authorized receiver could get. When the frequency changes, the interceptor is coarsely tuned very quickly, which can hack fast frequency-varying encrypted system.

Updated: 2024-06-17 18:35:45

标题: 破解加密无线电能：动态充电的网络安全

摘要: 最近，为了能源安全，无线能量传输的能量加密已经得到了发展，这在公共场所很重要，以抑制未经授权的能量提取。大多数技术通过改变频率来实现，使未经授权的接收器无法提取能量，因为不共振。然而，这种策略是不可靠的。为了促进能量加密技术的进展并指出安全漏洞，本文提出了一种解密方法，用于加密频率变化的无线能量传输的基本原理。本文使用一个辅助线圈来检测频率，并使用一个开关电容阵列来自适应性补偿接收器的广泛频率范围。开关电容阵列包含两个电容和一个半导体开关。一个电容一直在补偿接收器，而另一个在一个无线能量传输周期内的活动时间由开关调节。因此，所提出的黑客接收器控制了补偿的等效电容，并窃取能量。最后，详细的仿真模型和实验结果证明了对频率跳跃能量加密的攻击的有效性。尽管任何不可忽视的能量提取都会带来问题，我们成功窃取了授权接收器可以获得的能量的78%至84%。当频率变化时，拦截器可以很快地进行粗调谐，从而可以破解快速变化频率的加密系统。

更新时间: 2024-06-17 18:35:45

领域: eess.SY,cs.CR,cs.ET,cs.SY,eess.SP

下载: http://arxiv.org/abs/2406.12019v1

Sparsity-Constraint Optimization via Splicing Iteration

Sparsity-constraint optimization has wide applicability in signal processing, statistics, and machine learning. Existing fast algorithms must burdensomely tune parameters, such as the step size or the implementation of precise stop criteria, which may be challenging to determine in practice. To address this issue, we develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE) to optimize nonlinear differential objective functions with strong convexity and smoothness in low dimensional subspaces. Algorithmically, the SCOPE algorithm converges effectively without tuning parameters. Theoretically, SCOPE has a linear convergence rate and converges to a solution that recovers the true support set when it correctly specifies the sparsity. We also develop parallel theoretical results without restricted-isometry-property-type conditions. We apply SCOPE's versatility and power to solve sparse quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. The numerical results on these specific tasks reveal that SCOPE perfectly identifies the true support set with a 10--1000 speedup over the standard exact solver, confirming SCOPE's algorithmic and theoretical merits. Our open-source Python package skscope based on C++ implementation is publicly available on GitHub, reaching a ten-fold speedup on the competing convex relaxation methods implemented by the cvxpy library.

Updated: 2024-06-17 18:34:51

标题: 稀疏约束优化的剪接迭代

摘要: 稀疏约束优化在信号处理、统计学和机器学习中具有广泛的适用性。现有的快速算法必须繁琐地调整参数，如步长或精确停止标准的实施，这在实践中可能难以确定。为解决这个问题，我们开发了一种名为SCOPE（Sparsity-Constraint Optimization via Splicing Iteration）的算法，用于在低维子空间中优化具有强凸性和平滑性的非线性微分目标函数。在算法上，SCOPE算法在不调整参数的情况下有效收敛。理论上，SCOPE具有线性收敛速度，并在正确指定稀疏度时收敛于能够恢复真实支持集的解。我们还开发了并行理论结果，无需受限的等距性质类型条件。我们将SCOPE的多功能性和强大性应用于解决稀疏二次优化、学习稀疏分类器以及恢复用于二进制变量的稀疏马尔可夫网络。在这些特定任务上的数值结果显示，SCOPE相比标准精确求解器可以将真实支持集完美地识别出来，并实现10至1000倍的加速，证实了SCOPE的算法和理论优点。我们基于C++实现的开源Python包skscope已在GitHub上公开提供，与由cvxpy库实现的竞争凸松弛方法相比，实现了十倍的加速。

更新时间: 2024-06-17 18:34:51

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2406.12017v1

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Despite recent advances in LLM quantization, activation quantization remains to be challenging due to the activation outliers. Conventional remedies, e.g., mixing precisions for different channels, introduce extra overhead and reduce the speedup. In this work, we develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens. Precisely, we propose a method to find a set of key-value cache, coined CushionCache, which mitigates outliers in subsequent tokens when inserted as a prefix. CushionCache works in two steps: First, we greedily search for a prompt token sequence that minimizes the maximum activation values in subsequent tokens. Then, we further tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly. The proposed method successfully addresses activation outliers of LLMs, providing a substantial performance boost for per-tensor activation quantization methods. We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization and can be seamlessly integrated with the recent activation quantization method.

Updated: 2024-06-17 18:33:44

标题: 在大型语言模型量化中，添加注意力汇可以减轻激活异常值

摘要: 尽管最近在LLM量化方面取得了进展，但由于激活异常值的存在，激活量化仍然具有挑战性。传统的疗法，例如为不同通道混合精度，会引入额外开销并降低加速效果。在这项工作中，我们开发了一种简单而有效的策略，通过防止产生问题标记来促进每张量激活量化。精确地说，我们提出了一种方法来找到一组键值缓存，称为CushionCache，当作为前缀插入时，可以减轻后续标记中的异常值。CushionCache分为两步：首先，我们贪婪地搜索一个提示标记序列，以最小化后续标记中的最大激活值。然后，我们进一步调整标记缓存，使后续标记的激活更加有利于量化。该方法成功解决了LLM的激活异常值问题，为每张量激活量化方法提供了显著的性能提升。我们对我们的方法在各种模型和基准测试中进行了全面评估，发现它明显优于每张量W8A8量化的已建立基线，并且可以无缝集成到最近的激活量化方法中。

更新时间: 2024-06-17 18:33:44

领域: cs.LG

下载: http://arxiv.org/abs/2406.12016v1

The Benefits and Risks of Transductive Approaches for AI Fairness

Recently, transductive learning methods, which leverage holdout sets during training, have gained popularity for their potential to improve speed, accuracy, and fairness in machine learning models. Despite this, the composition of the holdout set itself, particularly the balance of sensitive sub-groups, has been largely overlooked. Our experiments on CIFAR and CelebA datasets show that compositional changes in the holdout set can substantially influence fairness metrics. Imbalanced holdout sets exacerbate existing disparities, while balanced holdouts can mitigate issues introduced by imbalanced training data. These findings underline the necessity of constructing holdout sets that are both diverse and representative.

Updated: 2024-06-17 18:29:49

标题: AI公平性的传导方法的益处和风险

摘要: 最近，传统学习方法在训练过程中利用留存集已经变得越来越受欢迎，因为它们有潜力提高机器学习模型的速度、准确性和公平性。尽管如此，留存集本身的组成，特别是敏感子群体的平衡，却很大程度上被忽视了。我们在CIFAR和CelebA数据集上的实验表明，留存集中的组成变化可以大大影响公平性指标。不平衡的留存集会加剧现有的不平等现象，而平衡的留存集可以缓解不平衡训练数据引入的问题。这些发现强调了构建既多样化又代表性的留存集的必要性。

更新时间: 2024-06-17 18:29:49

领域: cs.LG,cs.CY

下载: http://arxiv.org/abs/2406.12011v1

AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach

As Large Language Models (LLMs) are integrated into various sectors, ensuring their reliability and safety is crucial. This necessitates rigorous probing and auditing to maintain their effectiveness and trustworthiness in practical applications. Subjecting LLMs to varied iterations of a single query can unveil potential inconsistencies in their knowledge base or functional capacity. However, a tool for performing such audits with a easy to execute workflow, and low technical threshold is lacking. In this demo, we introduce ``AuditLLM,'' a novel tool designed to audit the performance of various LLMs in a methodical way. AuditLLM's primary function is to audit a given LLM by deploying multiple probes derived from a single question, thus detecting any inconsistencies in the model's comprehension or performance. A robust, reliable, and consistent LLM is expected to generate semantically similar responses to variably phrased versions of the same question. Building on this premise, AuditLLM generates easily interpretable results that reflect the LLM's consistency based on a single input question provided by the user. A certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. One could then use the output of AuditLLM to further investigate issues with the aforementioned LLM. To facilitate demonstration and practical uses, AuditLLM offers two key modes: (1) Live mode which allows instant auditing of LLMs by analyzing responses to real-time queries; and (2) Batch mode which facilitates comprehensive LLM auditing by processing multiple queries at once for in-depth analysis. This tool is beneficial for both researchers and general users, as it enhances our understanding of LLMs' capabilities in generating responses, using a standardized auditing platform.

Updated: 2024-06-17 18:24:41

标题: AuditLLM：使用多探测方法审计大型语言模型的工具

摘要: 随着大型语言模型（LLMs）被整合到各个领域，确保它们的可靠性和安全性至关重要。这需要严格的审查和审计，以维持它们在实际应用中的有效性和可信度。让LLMs经历同一查询的多个迭代可以揭示它们知识库或功能能力中潜在的不一致性。然而，目前缺乏一个易于执行工作流程和技术门槛低的工具来执行这样的审计。在这个演示中，我们介绍了“AuditLLM”，这是一个设计用于系统性审核各种LLMs性能的新型工具。AuditLLM的主要功能是通过部署源自同一问题的多个探测器来审计给定的LLM，从而检测模型的理解或性能中的任何不一致性。一个强大、可靠和一致的LLM预期会对同一问题的不同表述生成语义上相似的回应。基于这一前提，AuditLLM生成易于解释的结果，反映了LLM根据用户提供的单个输入问题的一致性。一定程度的不一致性被证明是潜在偏见、幻觉和其他问题的指标。用户可以使用AuditLLM的输出进一步调查上述LLM的问题。为了促进演示和实际使用，AuditLLM提供了两种关键模式：（1）实时模式，允许通过分析实时查询的回应来进行LLMs的即时审计；（2）批处理模式，通过一次处理多个查询来进行深入分析，从而促进全面的LLM审计。这个工具对于研究人员和一般用户都有益处，因为它通过使用标准化的审计平台增进了我们对LLMs在生成回应方面能力的理解。

更新时间: 2024-06-17 18:24:41

领域: cs.AI

下载: http://arxiv.org/abs/2402.09334v2

QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random Forest

Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications were data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.

Updated: 2024-06-17 18:21:03

标题: QC-Forest：一种经典量子算法，可明显加速随机森林的重新训练

摘要: 随机森林（RF）是一种常用的监督学习树集成方法，因其易用性和灵活性而受到赞赏。在线RF模型需要考虑新的训练数据以维持模型准确性。这在数据流中定期和顺序生成数据的应用中尤为重要，如自动驾驶系统和信用卡支付。在这种情况下，定期使用旧数据和新数据进行模型重新训练是有益的，因为它完全捕捉了随时间可能发生的数据分布漂移。然而，使用RF的最新经典算法是不切实际的，因为它们随着累积样本数量的线性增长。我们提出了QC-Forest，这是一种经典量子算法，旨在在流式设置中为多类分类和回归快速重新训练RF模型，实现了总累积样本数的多项对数运行时间。QC-Forest利用了Des-q，这是由Kumar等人提出的用于单树构建和重新训练的量子算法，通过扩展到多类分类，原始提议仅限于二进制类，并引入了一个精确的经典方法来替换一个底层的量子子程序，产生有限的错误，同时保持相同的多项对数依赖性。最后，我们展示了QC-Forest在广泛使用的基准数据集上与最先进的RF方法相比实现了竞争性的准确性，样本数量高达80,000个，同时显着加快了模型重新训练的速度。

更新时间: 2024-06-17 18:21:03

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2406.12008v1

P3GNN: A Privacy-Preserving Provenance Graph-Based Model for APT Detection in Software Defined Networking

Software Defined Networking (SDN) has brought significant advancements in network management and programmability. However, this evolution has also heightened vulnerability to Advanced Persistent Threats (APTs), sophisticated and stealthy cyberattacks that traditional detection methods often fail to counter, especially in the face of zero-day exploits. A prevalent issue is the inadequacy of existing strategies to detect novel threats while addressing data privacy concerns in collaborative learning scenarios. This paper presents P3GNN (privacy-preserving provenance graph-based graph neural network model), a novel model that synergizes Federated Learning (FL) with Graph Convolutional Networks (GCN) for effective APT detection in SDN environments. P3GNN utilizes unsupervised learning to analyze operational patterns within provenance graphs, identifying deviations indicative of security breaches. Its core feature is the integration of FL with homomorphic encryption, which fortifies data confidentiality and gradient integrity during collaborative learning. This approach addresses the critical challenge of data privacy in shared learning contexts. Key innovations of P3GNN include its ability to detect anomalies at the node level within provenance graphs, offering a detailed view of attack trajectories and enhancing security analysis. Furthermore, the models unsupervised learning capability enables it to identify zero-day attacks by learning standard operational patterns. Empirical evaluation using the DARPA TCE3 dataset demonstrates P3GNNs exceptional performance, achieving an accuracy of 0.93 and a low false positive rate of 0.06.

Updated: 2024-06-17 18:14:03

标题: P3GNN: 一种基于隐私保护的溯源图模型，用于在软件定义网络中检测APT

摘要: 软件定义网络（SDN）在网络管理和可编程性方面带来了显著的进步。然而，这种演变也增加了对高级持久性威胁（APTs）的脆弱性，这些威胁是复杂且隐蔽的网络攻击，传统的检测方法通常难以对抗，特别是在面对零日攻击时。一个普遍的问题是现有策略无法检测新型威胁，同时解决协作学习场景中的数据隐私问题。本文提出了P3GNN（基于隐私保护溯源图的图神经网络模型），这是一种新颖的模型，将联邦学习（FL）与图卷积网络（GCN）相结合，用于有效地检测SDN环境中的APT。P3GNN利用无监督学习来分析溯源图中的操作模式，识别出指示安全漏洞的偏离。其核心特点是将FL与同态加密相结合，从而在协作学习过程中加强数据的保密性和梯度的完整性。这种方法解决了共享学习环境中数据隐私的关键挑战。P3GNN的关键创新之处在于其能够在溯源图的节点级别检测异常，提供攻击轨迹的详细视图，并增强安全分析。此外，该模型的无监督学习能力使其能够通过学习标准操作模式来识别零日攻击。使用DARPA TCE3数据集进行的实证评估显示，P3GNN表现出色，达到了0.93的准确率和0.06的低误报率。

更新时间: 2024-06-17 18:14:03

领域: cs.CR

下载: http://arxiv.org/abs/2406.12003v1

Modeling, Inference, and Prediction in Mobility-Based Compartmental Models for Epidemiology

Classical compartmental models in epidemiology often struggle to accurately capture real-world dynamics due to their inability to address the inherent heterogeneity of populations. In this paper, we introduce a novel approach that incorporates heterogeneity through a mobility variable, transforming the traditional ODE system into a system of integro-differential equations that describe the dynamics of population densities across different compartments. Our results show that, for the same basic reproduction number, our mobility-based model predicts a smaller final pandemic size compared to classic compartmental models, whose population densities are represented as Dirac delta functions in our density-based framework. This addresses the overestimation issue common in many classical models. Additionally, we demonstrate that the time series of the infected population is sufficient to uniquely identify the mobility distribution. We reconstruct this distribution using a machine-learning-based framework, providing both theoretical and algorithmic support to effectively constrain the mobility-based model with real-world data.

Updated: 2024-06-17 18:13:57

标题: 流动性基分区模型在流行病学中的建模、推断和预测

摘要: 经典的流行病学分区模型通常难以准确捕捉现实世界的动态，因为它们无法解决人群固有异质性的问题。在本文中，我们介绍了一种新颖的方法，通过移动性变量来整合异质性，将传统的ODE系统转化为一组描述不同区间人口密度动态的积分微分方程系统。我们的结果表明，对于相同的基本再生数，基于移动性的模型与经典分区模型相比，预测出较小的最终大流行规模，后者在我们的基于密度的框架中将人口密度表示为Dirac δ函数。这解决了许多经典模型中普遍存在的过度估计问题。此外，我们证明了感染人口的时间序列足以唯一确定移动性分布。我们使用基于机器学习的框架重建了该分布，为有效地将基于移动性的模型与现实世界数据相结合提供了理论和算法支持。

更新时间: 2024-06-17 18:13:57

领域: q-bio.PE,cs.LG,cs.NA,math.NA,physics.soc-ph

下载: http://arxiv.org/abs/2406.12002v1

Look Further Ahead: Testing the Limits of GPT-4 in Path Planning

Large Language Models (LLMs) have shown impressive capabilities across a wide variety of tasks. However, they still face challenges with long-horizon planning. To study this, we propose path planning tasks as a platform to evaluate LLMs' ability to navigate long trajectories under geometric constraints. Our proposed benchmark systematically tests path-planning skills in complex settings. Using this, we examined GPT-4's planning abilities using various task representations and prompting approaches. We found that framing prompts as Python code and decomposing long trajectory tasks improve GPT-4's path planning effectiveness. However, while these approaches show some promise toward improving the planning ability of the model, they do not obtain optimal paths and fail at generalizing over extended horizons.

Updated: 2024-06-17 18:12:56

标题: 展望更远：测试GPT-4在路径规划中的极限

摘要: 大型语言模型(LLMs)在各种任务中展现出令人印象深刻的能力。然而，它们仍然面临着长期规划方面的挑战。为了研究这一点，我们提出了路径规划任务作为一个平台，评估LLMs在几何约束下导航长轨迹的能力。我们提出的基准系统地测试了复杂环境中的路径规划能力。利用这一基准，我们使用不同的任务表示和提示方法来检验GPT-4的规划能力。我们发现，将提示作为Python代码并将长期轨迹任务分解可以改善GPT-4的路径规划效果。然而，尽管这些方法显示出改进模型规划能力的一些希望，但它们并未得到最佳路径，并且在延长时间跨度上泛化失败。

更新时间: 2024-06-17 18:12:56

领域: cs.AI

下载: http://arxiv.org/abs/2406.12000v1

Enhancing Recommendation Diversity by Re-ranking with Large Language Models

It has long been recognized that it is not enough for a Recommender System (RS) to provide recommendations based only on their relevance to users. Among many other criteria, the set of recommendations may need to be diverse. Diversity is one way of handling recommendation uncertainty and ensuring that recommendations offer users a meaningful choice. The literature reports many ways of measuring diversity and improving the diversity of a set of recommendations, most notably by re-ranking and selecting from a larger set of candidate recommendations. Driven by promising insights from the literature on how to incorporate versatile Large Language Models (LLMs) into the RS pipeline, in this paper we show how LLMs can be used for diversity re-ranking. We begin with an informal study that verifies that LLMs can be used for re-ranking tasks and do have some understanding of the concept of item diversity. Then, we design a more rigorous methodology where LLMs are prompted to generate a diverse ranking from a candidate ranking using various prompt templates with different re-ranking instructions in a zero-shot fashion. We conduct comprehensive experiments testing state-of-the-art LLMs from the GPT and Llama families. We compare their re-ranking capabilities with random re-ranking and various traditional re-ranking methods from the literature. We open-source the code of our experiments for reproducibility. Our findings suggest that the trade-offs (in terms of performance and costs, among others) of LLM-based re-rankers are superior to those of random re-rankers but, as yet, inferior to the ones of traditional re-rankers. However, the LLM approach is promising. LLMs exhibit improved performance on many natural language processing and recommendation tasks and lower inference costs. Given these trends, we can expect LLM-based re-ranking to become more competitive soon.

Updated: 2024-06-17 18:09:57

标题: 通过使用大型语言模型进行重新排序来增强推荐多样性

摘要: 长期以来，人们已经意识到，推荐系统（RS）仅基于其与用户相关性提供推荐是不够的。在许多其他标准中，推荐集可能需要具有多样性。多样性是处理推荐不确定性的一种方式，确保推荐为用户提供有意义的选择。文献报道了许多衡量多样性和改进推荐集多样性的方法，其中最主要的是通过重新排序和从更大的候选推荐集中进行选择。受到如何将多功能大型语言模型（LLMs）纳入RS管道的文献中的有希望见解的驱动，本文展示了LLMs如何用于多样性重新排序。我们首先进行了一项非正式研究，验证了LLMs可用于重新排序任务，并对物品多样性的概念有一定的理解。然后，我们设计了一种更严格的方法论，其中LLMs被提示使用各种提示模板以零-shot方式从候选排序中生成多样性排序。我们进行了全面的实验，测试了来自GPT和Llama家族的最先进的LLMs。我们将它们的重新排名能力与随机重新排名和文献中各种传统重新排名方法进行了比较。我们开源了我们实验的代码以便重现。我们的研究结果表明，基于LLM的重新排名器的性能和成本等方面的权衡优于随机重新排名器，但目前还不及传统重新排名器。然而，LLM方法是具有发展前景的。LLMs在许多自然语言处理和推荐任务中表现出改进的性能和较低的推理成本。鉴于这些趋势，我们可以预期基于LLM的重新排序将很快变得更具竞争力。

更新时间: 2024-06-17 18:09:57

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2401.11506v2

Delay Embedding Theory of Neural Sequence Models

To generate coherent responses, language models infer unobserved meaning from their input text sequence. One potential explanation for this capability arises from theories of delay embeddings in dynamical systems, which prove that unobserved variables can be recovered from the history of only a handful of observed variables. To test whether language models are effectively constructing delay embeddings, we measure the capacities of sequence models to reconstruct unobserved dynamics. We trained 1-layer transformer decoders and state-space sequence models on next-step prediction from noisy, partially-observed time series data. We found that each sequence layer can learn a viable embedding of the underlying system. However, state-space models have a stronger inductive bias than transformers-in particular, they more effectively reconstruct unobserved information at initialization, leading to more parameter-efficient models and lower error on dynamics tasks. Our work thus forges a novel connection between dynamical systems and deep learning sequence models via delay embedding theory.

Updated: 2024-06-17 18:07:16

标题: 神经序列模型的延迟嵌入理论

摘要: 为了生成连贯的回应，语言模型需要从其输入文本序列中推断未观察到的含义。这种能力的一个潜在解释来自于动力系统中的延迟嵌入理论，该理论证明了未观察变量可以从仅有少数观察变量的历史中恢复出来。为了测试语言模型是否有效地构建延迟嵌入，我们测量了序列模型重建未观察动态的能力。我们训练了1层transformer解码器和状态空间序列模型，用嘈杂的、部分观察到的时间序列数据进行下一步预测。我们发现每个序列层都可以学习到底层系统的一个可行嵌入。然而，状态空间模型比transformer有更强的归纳偏差，特别是它们更有效地在初始化阶段重建未观察信息，从而导致更具参数效率的模型和更低的动态任务误差。因此，我们的工作通过延迟嵌入理论在动力系统与深度学习序列模型之间建立了一种新颖的联系。

更新时间: 2024-06-17 18:07:16

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2406.11993v1

Decomposed evaluations of geographic disparities in text-to-image models

Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.

Updated: 2024-06-17 18:04:23

标题: 文献标题翻译为：文本到图像模型中地理差异的评估分解

摘要: 最近的研究发现，不同地理区域生成的图像存在明显的差异，包括对房屋和汽车等日常物品的刻板描绘。然而，现有用于衡量这些差异的方法要么是人工评估，耗时且成本高昂，要么是评估完整图像的自动指标，无法将这些差异归因于生成图像的具体部分。在这项工作中，我们引入了一组新的指标，即图像生成中的差异分解指标（Decomposed-DIG），使我们能够分别衡量生成图像中物体和背景的地理差异。利用Decomposed-DIG，我们对一个广泛使用的潜在扩散模型进行审计，发现生成的图像在描绘物体时比背景更具现实感，而生成图像中的背景往往包含比物体更大的区域差异。我们使用Decomposed-DIG来指出特定的差异例子，如在非洲生成刻板化的背景、在非洲难以生成现代车辆，以及在户外环境中不切实际地放置一些物体。受我们的指标启发，我们使用一种新的提示结构，使最差地区的改进率提高了52％，平均生成背景多样性提高了20％。

更新时间: 2024-06-17 18:04:23

领域: cs.CV,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2406.11988v1

The Decisive Power of Indecision: Low-Variance Risk-Limiting Audits and Election Contestation via Marginal Mark Recording

Risk-limiting audits (RLAs) are techniques for verifying the outcomes of large elections. While they provide rigorous guarantees of correctness, widespread adoption has been impeded by both efficiency concerns and the fact they offer statistical, rather than absolute, conclusions. We attend to both of these difficulties, defining new families of audits that improve efficiency and offer qualitative advances in statistical power. Our new audits are enabled by revisiting the standard notion of a cast-vote record so that it can declare multiple possible mark interpretations rather than a single decision; this can reflect the presence of marginal marks, which appear regularly on hand-marked ballots. We show that this simple expedient can offer significant efficiency improvements with only minor changes to existing auditing infrastructure. We consider two ways of representing these marks, both yield risk-limiting comparison audits in the formal sense of Fuller, Harrison, and Russell (IEEE Security & Privacy 2023). We then define a new type of post-election audit we call a contested audit. These permit each candidate to provide a cast-vote record table advancing their own claim to victory. We prove that these audits offer remarkable sample efficiency, yielding control of risk with a constant number of samples (that is independent of margin). This is a first for an audit with provable soundness. These results are formulated in a game-based security model that specify quantitative soundness and completeness guarantees. These audits provide a means to handle contestation of election results affirmed by conventional RLAs.

Updated: 2024-06-17 18:04:22

标题: 摇摆不定的决定力：低方差风险限制审计和通过边际标记记录进行选举争议

摘要: 风险限制审计（RLA）是验证大型选举结果的技术。虽然它们提供了严格的正确性保证，但由于效率问题和它们提供的是统计结论而不是绝对结论，它们的广泛采用受到阻碍。我们关注这两个困难，定义了改进效率并在统计力量方面提供质的进步的新型审计家族。我们的新型审计是通过重新审视投票记录的标准概念而实现的，使其能够声明多种可能的标记解释，而不是单一决定；这可以反映手工填写的选票上经常出现的边缘标记的存在。我们展示了这一简单的手段可以通过对现有审计基础设施进行轻微改变，提供显著的效率改进。我们考虑两种表示这些标记的方式，都在 Fuller、Harrison 和 Russell 的形式意义上产生风险限制比较审计（IEEE 安全与隐私 2023）。然后我们定义了一种称为有争议审计的新型后选举审计。这允许每位候选人提供一个投票记录表，推进他们自己对胜利的主张。我们证明这些审计提供了显著的样本效率，以恒定数量的样本（独立于边际）控制风险。这是一个具有可证明合理性的审计的首次。这些结果在一个基于游戏的安全模型中制定了规定数量的合理性和完整性保证。这些审计提供了一种处理选举结果争议的手段，这些结果是由传统RLA确认的。

更新时间: 2024-06-17 18:04:22

领域: cs.CR

下载: http://arxiv.org/abs/2402.06515v4

Online Pareto-Optimal Decision-Making for Complex Tasks using Active Inference

When a robot autonomously performs a complex task, it frequently must balance competing objectives while maintaining safety. This becomes more difficult in uncertain environments with stochastic outcomes. Enhancing transparency in the robot's behavior and aligning with user preferences are also crucial. This paper introduces a novel framework for multi-objective reinforcement learning that ensures safe task execution, optimizes trade-offs between objectives, and adheres to user preferences. The framework has two main layers: a multi-objective task planner and a high-level selector. The planning layer generates a set of optimal trade-off plans that guarantee satisfaction of a temporal logic task. The selector uses active inference to decide which generated plan best complies with user preferences and aids learning. Operating iteratively, the framework updates a parameterized learning model based on collected data. Case studies and benchmarks on both manipulation and mobile robots show that our framework outperforms other methods and (i) learns multiple optimal trade-offs, (ii) adheres to a user preference, and (iii) allows the user to adjust the balance between (i) and (ii).

Updated: 2024-06-17 18:03:45

标题: 使用主动推理进行在线帕累托最优决策的复杂任务

摘要: 当机器人自主执行复杂任务时，经常需要在保持安全的同时平衡竞争目标。在不确定的具有随机结果的环境中，这变得更加困难。增强机器人行为的透明度并与用户偏好一致也是至关重要的。本文介绍了一个新颖的多目标强化学习框架，确保安全任务执行，优化目标之间的权衡，并符合用户偏好。该框架有两个主要层次：多目标任务规划器和高层选择器。规划层生成一组保证满足时间逻辑任务的最佳权衡计划。选择器使用主动推理来决定哪个生成的计划最符合用户偏好并有助于学习。该框架迭代地操作，基于收集的数据更新参数化学习模型。对操纵和移动机器人进行的案例研究和基准测试表明，我们的框架优于其他方法，并且(i)学习多个最佳权衡，(ii)符合用户偏好，(iii)允许用户调整(i)和(ii)之间的平衡。

更新时间: 2024-06-17 18:03:45

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2406.11984v1

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (ChatGPT, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are highly prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. The overall best prompting setup is task-dependent, and minor prompt changes can cause large changes in the distribution of generated labels. By showing that prompt design significantly impacts the quality and distribution of LLM-generated annotations, this work serves as both a warning and practical guide for researchers and practitioners.

Updated: 2024-06-17 18:01:43

标题: 即时设计对计算社会科学任务至关重要，但其影响方式难以预测。

摘要: 手动为计算社会科学任务注释数据可能是昂贵的、耗时的、并且消耗精力的。尽管最近的工作表明LLMs可以在零次设置中执行这种注释任务，但对于提示设计如何影响LLMs的遵从性和准确性知之甚少。我们进行了一项大规模多提示实验，测试了模型选择（ChatGPT、PaLM2和Falcon7b）和提示设计特征（定义包含、输出类型、解释和提示长度）对LLM生成的注释在四个CSS任务（毒性、情绪、谣言立场和新闻框架）的遵从性和准确性的影响。我们的结果表明，LLM的遵从性和准确性高度依赖于提示。例如，提示要求数值分数而不是标签会降低所有LLMs的遵从性和准确性。总体最佳提示设置取决于任务，微小的提示更改可能导致生成的标签分布发生较大变化。通过展示提示设计如何显著影响LLM生成的注释的质量和分布，这项工作既作为警示，也为研究人员和实践者提供了实用指南。

更新时间: 2024-06-17 18:01:43

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2406.11980v1

Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4's performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.

Updated: 2024-06-17 18:01:32

标题: 对话行动令牌：使用多轮规划器引导目标导向对话中的语言模型

摘要: 我们提出了一种称为对话行动令牌（DAT）的方法，该方法使语言模型代理能够规划目标导向的对话。其核心思想是将每个话语视为一种行动，从而将对话转化为游戏，可以应用强化学习等现有方法。具体而言，我们冻结了一个预训练的语言模型，并训练一个小的规划模型，该模型预测连续的行动向量，用于在每一轮中进行受控生成。这种设计避免了在奖励优化下语言质量下降的问题。在Sotopia社会模拟平台上评估时，DAT引导的LLaMA模型超过了GPT-4的性能。我们还将DAT应用于引导一种攻击者语言模型，在新颖的多轮红队对抗设置中揭示了潜在的新攻击面。

更新时间: 2024-06-17 18:01:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11978v1

Compressed representation of brain genetic transcription

The architecture of the brain is too complex to be intuitively surveyable without the use of compressed representations that project its variation into a compact, navigable space. The task is especially challenging with high-dimensional data, such as gene expression, where the joint complexity of anatomical and transcriptional patterns demands maximum compression. Established practice is to use standard principal component analysis (PCA), whose computational felicity is offset by limited expressivity, especially at great compression ratios. Employing whole-brain, voxel-wise Allen Brain Atlas transcription data, here we systematically compare compressed representations based on the most widely supported linear and non-linear methods-PCA, kernel PCA, non-negative matrix factorization (NMF), t-stochastic neighbour embedding (t-SNE), uniform manifold approximation and projection (UMAP), and deep auto-encoding-quantifying reconstruction fidelity, anatomical coherence, and predictive utility with respect to signalling, microstructural, and metabolic targets. We show that deep auto-encoders yield superior representations across all metrics of performance and target domains, supporting their use as the reference standard for representing transcription patterns in the human brain.

Updated: 2024-06-17 18:01:22

标题: 大脑基因转录的压缩表示

摘要: 大脑的结构太复杂，无法直观地进行调查，除非使用将其变化投影到紧凑、可导航空间中的压缩表示。在高维数据（如基因表达）中，这项任务尤为具有挑战性，其中解剖和转录模式的联合复杂性要求最大程度的压缩。现行做法是使用标准的主成分分析（PCA），其计算便利性受到有限的表达能力的限制，尤其是在大幅度压缩比率下。利用整个大脑的基于体素的Allen Brain Atlas转录数据，我们在这里系统地比较了基于最广泛支持的线性和非线性方法（PCA、核PCA、非负矩阵分解（NMF）、t-分布邻居嵌入（t-SNE）、均匀流形逼近和投影（UMAP）以及深度自动编码）的压缩表示，量化重建准确性、解剖一致性和预测效用，关于信号、微结构和代谢靶点。我们表明，深度自动编码器在所有性能和目标领域的度量中产生出色的表示，支持它们作为人类大脑中转录模式的表示的参考标准。

更新时间: 2024-06-17 18:01:22

领域: cs.LG,q-bio.GN,q-bio.NC

下载: http://arxiv.org/abs/2310.16113v2

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

Updated: 2024-06-17 17:59:58

标题: mDPO：多模态大型语言模型的条件偏好优化

摘要: 直接偏好优化（DPO）已被证明是大型语言模型（LLM）对齐的有效方法。最近的研究尝试将DPO应用于多模态场景，但发现难以实现一致改进。通过对比实验，我们确定了多模态偏好优化中的无条件偏好问题，即模型忽视了图像条件。为了解决这个问题，我们提出了mDPO，这是一个多模态DPO目标，通过优化图像偏好也可防止过度优先考虑仅语言的偏好。此外，我们引入了一个奖励锚点，强制选择响应的奖励为正，从而避免它们的可能性减少 - 这是相对偏好优化的固有问题。对两种不同大小的多模态LLM和三个广泛使用的基准数据集进行的实验表明，mDPO有效地解决了多模态偏好优化中的无条件偏好问题，并显著提高了模型性能，特别是减少了幻觉。

更新时间: 2024-06-17 17:59:58

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11839v1

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.

Updated: 2024-06-17 17:59:47

标题: MMDU：多轮多图像对话理解基准和用于LVLMs的指导调整数据集

摘要: 生成自然且有意义的回应以与多模态人类输入进行交流是大型视觉语言模型（LVLMs）的基本能力。尽管当前开源LVLMs在简化场景（如单轮单图像输入）中表现出有希望的性能，但它们在现实世界的对话场景（如在具有多轮和多图像的长篇历史中遵循指令）中表现不足。现有的LVLM基准主要侧重于单选问题或简短回答，这并不能充分评估LVLMs在真实世界人机交互应用中的能力。因此，我们介绍了MMDU，一个全面的基准测试，以及MMDU-45k，一个大规模的指令调整数据集，旨在评估和提高LVLMs在多轮和多图像对话中的能力。我们利用聚类算法从开源维基百科中找到相关图像和文本描述，并在人类注释员的协助下使用GPT-4o模型构建问题-答案对。MMDU最多拥有18k个图像+文本令牌，20个图像和27轮对话，比先前的基准测试至少长5倍，并对当前的LVLMs提出挑战。我们对使用MMDU的15个代表性LVLM进行深入分析，发现由于缺乏对话指令调整数据，开源LVLMs落后于闭源对手。我们证明，对开源LVLMs在MMDU-45k上进行微调显著弥补了这一差距，生成更长更准确的对话，并提高了MMDU和现有基准测试的分数（MMStar：+1.1％，MathVista：+1.5％，ChartQA：+1.2％）。我们的贡献为弥合当前LVLM模型和真实世界应用需求之间的差距铺平了道路。该项目可在https://github.com/Liuziyu77/MMDU上找到。

更新时间: 2024-06-17 17:59:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11833v1

Language Modeling with Editable External Knowledge

When the world changes, so does the text that humans write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation, in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on these systems have focused on improving behavior during prediction through better retrieval or reasoning. This paper introduces ERASE, which instead improves model behavior when new documents are acquired, by incrementally deleting or rewriting other entries in the knowledge base each time a document is added. In two new benchmark datasets evaluating models' ability to answer questions about a stream of news articles or conversations, ERASE improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute. Code and data are available at https://github.com/belindal/ERASE

Updated: 2024-06-17 17:59:35

标题: 可编辑外部知识的语言建模

摘要: 当世界发生变化时，人类书写的文本也会发生变化。我们如何构建可以轻松更新以反映这些变化的语言模型？一种流行的方法是检索增强生成，其中新文档被插入到知识库中，并在预测期间检索以进行下游任务。大多数先前关于这些系统的工作都集中在通过更好的检索或推理来改进预测行为。本文介绍了ERASE，它通过逐渐删除或重写知识库中的其他条目来在获取新文档时改进模型行为。在评估模型回答关于新闻文章流或对话的问题能力的两个新基准数据集中，ERASE相对于传统的检索增强生成提高了7-13%（Mixtral-8x7B）和6-10%（Llama-3-8B）的准确度。代码和数据可在https://github.com/belindal/ERASE上找到。

更新时间: 2024-06-17 17:59:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11830v1

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

We study the computational and sample complexity of learning a target function $f_*:\mathbb{R}^d\to\mathbb{R}$ with additive structure, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1,f_2,...,f_M:\mathbb{R}\to\mathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features $\{v_m\}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $M\asymp d^\gamma$ for $\gamma\ge 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of "skills" that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the information exponent of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.

Updated: 2024-06-17 17:59:17

标题: 学习多种特征的总和：岭组合的计算困难性和高效基于梯度的训练

摘要: 我们研究学习具有加法结构的目标函数$f_*:\mathbb{R}^d\to\mathbb{R}$的计算和样本复杂性，即$f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$，其中$f_1,f_2,...,f_M:\mathbb{R}\to\mathbb{R}$是单指数模型（岭函数）的非线性链接函数，具有多样且接近正交的指数特征$\{v_m\}_{m=1}^M$，并且加法任务数量$M$随着维度$M\asymp d^\gamma$增长，其中$\gamma\ge 0$。这个问题的设定受经典加法模型文献、最近的两层神经网络表示学习理论以及大规模预训练的启发，其中模型同时获取了许多常常局部化在训练网络的不同部分的“技能”。我们证明，通过梯度下降训练两层神经网络，可以有效地学习多项式$f_*$的大子集，其统计和计算复杂度是多项式的，取决于任务数量$M$和$f_m$的信息指数，尽管未知链接函数和$M$随着维度增长。我们通过建立统计查询（SQ）下限结果，为相关SQ和完整SQ算法都证明了计算困难性结果，来补充这个可学习性保证。

更新时间: 2024-06-17 17:59:17

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11828v1

WPO: Enhancing RLHF with Weighted Preference Optimization

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.

Updated: 2024-06-17 17:59:13

标题: WPO: 使用加权偏好优化增强RLHF

摘要: 人类反馈强化学习（RLHF）是将大型语言模型（LLMs）与人类价值观更紧密对齐的一种有前途的解决方案。离策略偏好优化，其中偏好数据是从其他模型获取的，由于成本效益和可扩展性而被广泛采用。然而，离策略偏好优化通常受到用于数据收集的策略和目标策略之间分布差异的影响，导致次优化。在本文中，我们提出了一种新颖的策略，通过模拟使用离策略偏好数据进行策略学习来缓解这个问题。我们的加权偏好优化（WPO）方法通过根据当前策略下偏好对的概率对偏好对进行重新加权，使离策略数据更接近于策略数据。这种方法不仅解决了分布差距问题，还增强了优化过程，而不会产生额外成本。我们在包括Alpaca Eval 2和MT-bench在内的指令跟随基准上验证了我们的方法。WPO不仅在Alpaca Eval 2上比直接偏好优化（DPO）表现出高达5.6%的优势，还基于Llama-3-8B-Instruct建立了出色的长度受控胜率，对抗GPT-4-turbo的胜率为48.6%，使其成为排行榜上最强大的8B模型。我们将在https://github.com/wzhouad/WPO发布代码和模型。

更新时间: 2024-06-17 17:59:13

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11827v1

Spectral Introspection Identifies Group Training Dynamics in Deep Neural Networks for Neuroimaging

Neural networks, whice have had a profound effect on how researchers study complex phenomena, do so through a complex, nonlinear mathematical structure which can be difficult for human researchers to interpret. This obstacle can be especially salient when researchers want to better understand the emergence of particular model behaviors such as bias, overfitting, overparametrization, and more. In Neuroimaging, the understanding of how such phenomena emerge is fundamental to preventing and informing users of the potential risks involved in practice. In this work, we present a novel introspection framework for Deep Learning on Neuroimaging data, which exploits the natural structure of gradient computations via the singular value decomposition of gradient components during reverse-mode auto-differentiation. Unlike post-hoc introspection techniques, which require fully-trained models for evaluation, our method allows for the study of training dynamics on the fly, and even more interestingly, allow for the decomposition of gradients based on which samples belong to particular groups of interest. We demonstrate how the gradient spectra for several common deep learning models differ between schizophrenia and control participants from the COBRE study, and illustrate how these trajectories may reveal specific training dynamics helpful for further analysis.

Updated: 2024-06-17 17:58:15

标题: 光谱内省在神经影像深度神经网络中识别群体训练动态

摘要: 神经网络对研究人员研究复杂现象产生了深远影响，通过复杂的非线性数学结构实现这一目的，这可能会让人类研究者难以解释。当研究人员希望更好地理解特定模型行为的出现，如偏见、过拟合、过参数化等时，这一障碍尤为明显。在神经影像学中，了解这些现象如何出现对于预防和告知用户实践中可能涉及的潜在风险至关重要。在这项工作中，我们提出了一种新颖的深度学习神经影像数据内省框架，利用梯度计算的自然结构，通过反向模式自动微分中梯度分量的奇异值分解。与需要完全训练模型进行评估的事后内省技术不同，我们的方法允许动态研究训练动态，并且更有趣的是，允许基于样本属于特定感兴趣群体的分解梯度。我们展示了几种常见深度学习模型在COBRE研究中精神分裂症和对照参与者之间的梯度光谱差异，并说明这些轨迹可能揭示有助于进一步分析的特定训练动态。

更新时间: 2024-06-17 17:58:15

领域: cs.LG,eess.IV,q-bio.NC

下载: http://arxiv.org/abs/2406.11825v1

Embodied Instruction Following in Unknown Environments

Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.

Updated: 2024-06-17 17:55:40

标题: 在未知环境中的体现式指导跟随

摘要: 使具有实体的代理能够根据自然语言完成复杂的人类指令对于家庭服务中的自主系统至关重要。传统方法只能在已知环境中完成人类指令，其中所有交互对象都提供给了实体代理，并且直接将现有方法部署到未知环境通常会生成操作不存在对象的不可行计划。相反，我们提出了一种用于未知环境中的复杂任务的实体指令跟随（EIF）方法，其中代理有效地探索未知环境以生成使用现有对象完成抽象指令的可行计划。具体来说，我们构建了一个包括高级任务规划器和低级探索控制器的分层实体指令跟随框架，其中包含多模态大语言模型。然后，我们利用动态区域注意力构建了场景的语义表示地图，以展示已知的视觉线索，其中任务规划和场景探索的目标与人类指令一致。对于任务规划器，我们根据任务完成过程和已知的视觉线索为人类目标实现生成可行的逐步计划。对于探索控制器，基于生成的逐步计划和已知的视觉线索预测最佳导航或对象交互策略。实验结果表明，我们的方法可以在大型房屋级场景中完成204个复杂的人类指令，如制作早餐和整理房间，成功率达到45.09%。

更新时间: 2024-06-17 17:55:40

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2406.11818v1

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

Updated: 2024-06-17 17:55:38

标题: 迭代长度正则化直接偏好优化：改进7B语言模型至GPT-4水平的案例研究

摘要: 直接偏好优化（DPO）是一种用于将语言模型与人类偏好对齐的标准方法，传统上应用于离线偏好。最近的研究表明，DPO受益于使用经过训练的奖励模型标记的在线偏好进行迭代训练。在这项工作中，我们确定了普通迭代DPO的一个缺陷 - 良好的响应质量可能会导致冗长。为了解决这个问题，我们引入了迭代长度正则化DPO（iLR-DPO）来惩罚响应长度。我们的实证结果表明，iLR-DPO可以使一个7B模型在不增加冗长的情况下表现与GPT-4相当。具体来说，我们的7B模型在AlpacaEval 2.0上以$50.5\%的受控长度胜率击败了GPT-4预览，并在MT-Bench、Arena-Hard和OpenLLM排行榜等标准基准上表现出色。这些结果表明了迭代DPO在将语言模型与人类反馈对齐方面的有效性。

更新时间: 2024-06-17 17:55:38

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11817v1

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. Additionally, we show that predicting intermediate 2-D representations, which we refer to as "visual traces", can help further align vision and action spaces for robot learning. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model, and we evaluate on 12 different tasks in the RLBench simulator as well as a physical Franka Emika Panda 7-DoF robot. Our experiments yield strong performance, demonstrating that LLARVA - using 2-D and language representations - performs well compared to several contemporary baselines, and can generalize across various robot environments and configurations.

Updated: 2024-06-17 17:55:29

标题: LLARVA：视觉-动作指导调整增强机器人学习

摘要: 近年来，经过指导调整的大型多模态模型（LMM）在几项任务中取得了成功，包括图像字幕和视觉问答；然而，如何利用这些模型仍然是机器人领域的一个悬而未决的问题。先前用于机器人应用的LMM在语言和动作数据上进行了广泛训练，但它们在不同环境中的泛化能力往往不如人意。为了解决这个问题，我们引入了LLARVA，这是一个通过使用结构化提示进行指导调整训练的模型，统一了一系列机器人学习任务、场景和环境。此外，我们还展示了预测中间2-D表示，我们称之为“视觉迹”，可以帮助进一步对齐机器人学习中的视觉和动作空间。我们从Open X-Embodiment数据集中生成了850万个图像-视觉迹对，以预先训练我们的模型，并在RLBench模拟器上评估了12个不同的任务，以及一个物理的Franka Emika Panda 7自由度机器人。我们的实验表现出强大的性能，表明LLARVA - 使用2-D和语言表示 - 与几种当代基线相比表现良好，并且可以泛化到各种机器人环境和配置。

更新时间: 2024-06-17 17:55:29

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11815v1

Stochastic Neural Network Symmetrisation in Markov Categories

We consider the problem of symmetrising a neural network along a group homomorphism: given a homomorphism $\varphi : H \to G$, we would like a procedure that converts $H$-equivariant neural networks into $G$-equivariant ones. We formulate this in terms of Markov categories, which allows us to consider neural networks whose outputs may be stochastic, but with measure-theoretic details abstracted away. We obtain a flexible, compositional, and generic framework for symmetrisation that relies on minimal assumptions about the structure of the group and the underlying neural network architecture. Our approach recovers existing methods for deterministic symmetrisation as special cases, and extends directly to provide a novel methodology for stochastic symmetrisation also. Beyond this, we believe our findings also demonstrate the utility of Markov categories for addressing problems in machine learning in a conceptual yet mathematically rigorous way.

Updated: 2024-06-17 17:54:42

标题: 在马尔可夫类别中的随机神经网络对称化

摘要: 我们考虑将神经网络沿着一个群同态对称化的问题：给定一个同态$\varphi：H \to G$，我们希望找到一个将$H$-等变神经网络转化为$G$-等变神经网络的方法。我们将此问题表述为马尔可夫范畴的术语，这使我们能够考虑神经网络的输出可能是随机的情况，但测度论细节被抽象化。我们得到了一个灵活、组合和通用的对称化框架，它依赖于对群的结构和底层神经网络架构的最小假设。我们的方法将现有的确定性对称化方法作为特例，同时直接扩展到提供一种新颖的随机对称化方法。此外，我们认为我们的发现还展示了马尔可夫范畴在以概念性而数学严密的方式解决机器学习问题方面的实用性。

更新时间: 2024-06-17 17:54:42

领域: stat.ML,cs.LG,math.CT

下载: http://arxiv.org/abs/2406.11814v1

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

Updated: 2024-06-17 17:52:54

标题: RepLiQA：一个用于在看不见的参考内容上对LLMs进行基准测试的问答数据集

摘要: 大型语言模型（LLMs）是在大量数据上进行训练的，其中大部分数据是从互联网自动抓取的。这些数据包括拥有大量通识知识的百科全书文档（例如维基百科），但也可能与用于评估LLMs的基准数据集重叠。因此，在可能泄漏到训练集中的测试拆分上评估模型容易导致误导性结论。为了促进语言模型的有效评估，我们引入了一个名为RepLiQA的新的测试数据集，适用于问答和主题检索任务。RepLiQA是一个由五个测试集拆分组成的集合，其中四个在此之前未发布到互联网或暴露给LLM API。RepLiQA中的每个样本包括（1）由人类注释者精心制作的参考文档，描述一个虚构的场景（例如一篇新闻文章），该场景不存在于互联网上；（2）关于文档主题的问题；（3）直接从文档中的信息中提取出的真实答案；以及（4）包含答案的从参考文档中提取的段落。因此，只有当模型能够在提供的文档中找到相关内容时，才能生成准确的答案。我们进行了一项大规模基准测试，包括几个最先进的LLMs，以揭示在上下文条件语言建模设置中不同类型和大小的模型之间的性能差异。RepLiQA的发布拆分可以在此处找到：https://huggingface.co/datasets/ServiceNow/repliqa。

更新时间: 2024-06-17 17:52:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11811v1

Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics

We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting, a setting that uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least square regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.

Updated: 2024-06-17 17:52:38

标题: 在确定性动力学下的线性贝尔曼完备性下的计算有效强化学习

摘要: 我们研究了用于线性Bellman完整设置的计算和统计有效的强化学习算法，该设置使用线性函数逼近来捕捉价值函数，并统一了现有模型，如线性马尔可夫决策过程（MDP）和线性二次调节器（LQR）。尽管先前的研究表明这种设置在统计上是可行的，但是否存在计算效率高的算法仍然是一个未解决的问题。我们的工作为线性Bellman完整设置提供了一个计算效率高的算法，适用于具有大动作空间、随机初始状态和随机奖励的MDP，但依赖于底层动态是确定性的。我们的方法基于随机化：我们向最小二乘回归问题注入随机噪声以执行乐观的价值迭代。我们的关键技术贡献是精心设计噪声，使其仅在训练数据的零空间中起作用，以确保乐观主义，同时规避一个微妙的误差放大问题。

更新时间: 2024-06-17 17:52:38

领域: cs.LG,cs.RO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2406.11810v1

Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models

Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic physical principles, which consequently makes them untrustworthy and limits generalization. To address this problem, we propose a physics-constrained learning method that combines powerful learning tools and reliable physical models. Our method leverages the data collected from observations by sending them into a Gaussian process that is physically constrained by a distributed Port-Hamiltonian model. Based on the Bayesian nature of the Gaussian process, we not only learn the dynamics of the system, but also enable uncertainty quantification. Furthermore, the proposed approach preserves the compositional nature of Port-Hamiltonian systems.

Updated: 2024-06-17 17:52:01

标题: 物理约束学习用于具有不确定性量化泊松-哈密顿模型的PDE系统

摘要: 建模柔性物体的动态已成为社区中的一个新兴话题，因为这些物体在许多应用中变得越来越普遍，例如软体机器人。由于柔性材料的特性，软体物体的运动往往是高度非线性的，因此预测起来复杂。数据驱动方法似乎有望用于建模这些复杂的动态，但往往忽视基本的物理原理，从而使它们变得不可信赖并限制泛化能力。为了解决这个问题，我们提出了一种物理约束学习方法，结合了强大的学习工具和可靠的物理模型。我们的方法利用从观测中收集的数据，将其发送到受分布式Port-Hamiltonian模型物理约束的高斯过程中。基于高斯过程的贝叶斯性质，我们不仅学习系统的动态，还能进行不确定性量化。此外，所提出的方法保持了Port-Hamiltonian系统的合成性质。

更新时间: 2024-06-17 17:52:01

领域: cs.LG,cs.RO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2406.11809v1

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.

Updated: 2024-06-17 17:51:50

标题: Ovis：多模态大型语言模型的结构嵌入对齐

摘要: 目前，当前的多模态大型语言模型（MLLMs）通常通过连接器（如MLP）将预先训练的LLM与另一个预先训练的视觉变换器集成在一起，赋予LLM视觉能力。然而，在MLLMs中两种嵌入策略之间的不一致性--基于嵌入查找表的结构文本嵌入和由视觉编码器直接生成的连续嵌入--对于更无缝地融合视觉和文本信息提出了挑战。我们提出了Ovis，这是一种新型的MLLM架构，旨在结构地对齐视觉和文本嵌入。Ovis将一个可学习的视觉嵌入表集成到视觉编码器的过程中。为了捕捉丰富的视觉语义，每个图像块多次索引视觉嵌入表，最终得到的视觉嵌入是索引嵌入的概率组合。这种结构化方法与生成文本嵌入的方法相似。在各种多模态基准评估中，Ovis优于类似参数规模的开源MLLMs，甚至整体超越专有模型Qwen-VL-Plus。这些结果突显了Ovis结构化视觉表示对于推进MLLM架构设计和促进更有效的多模态学习的潜力。代码、数据集和模型可在https://github.com/AIDC-AI/Ovis找到。

更新时间: 2024-06-17 17:51:50

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2405.20797v2

Efficient Discovery of Significant Patterns with Few-Shot Resampling

Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists. We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.

Updated: 2024-06-17 17:49:27

标题: 高效发现重要模式的方法：少次重采样

摘要: 显著模式挖掘是挖掘交易数据中的一项基本任务，需要识别与给定特征值（目标）显著相关的模式。在几种应用中，如生物医学、篮筐市场分析和社交网络中，目标是发现与目标的关联性是针对底层人口或过程定义的模式，数据集只是一组观察或样本。捕捉模式与目标的关联性的一种自然方式是考虑其统计显著性，评估其与模式和目标之间独立性（零）假设的偏差。虽然已经提出了几种算法来发现统计显著的模式，但这仍然是一个计算需求高的任务，对于复杂模式如子群，没有有效的解决方案。我们提出了FSR，一个高效的算法，可以识别具有严格假阈值的统计显著模式。FSR建立在一个新颖的挖掘显著模式的通用框架上，该框架捕捉了一些最常考虑的模式，包括项集、序列模式和子群。FSR使用少量重新抽样数据集，通过为每个交易分配i.i.d.标签来严格限制质量统计量的最高偏差，该统计量测量模式的显著性。FSR建立在对最高偏差的新颖严格界限上，需要挖掘少量重新抽样数据集，同时在发现显著模式方面具有高效性。作为一个测试案例，我们考虑了显著子群挖掘，并且我们在几个真实数据集上的评估显示，FSR在发现显著子群方面是有效的，同时需要少量重新抽样数据集。

更新时间: 2024-06-17 17:49:27

领域: cs.LG,cs.DB,stat.ML

下载: http://arxiv.org/abs/2406.11803v1

GAugLLM: Improving Graph Contrastive Learning for Text-Attributed Graphs with Large Language Models

This work studies self-supervised graph learning for text-attributed graphs (TAGs) where nodes are represented by textual attributes. Unlike traditional graph contrastive methods that perturb the numerical feature space and alter the graph's topological structure, we aim to improve view generation through language supervision. This is driven by the prevalence of textual attributes in real applications, which complement graph structures with rich semantic information. However, this presents challenges because of two major reasons. First, text attributes often vary in length and quality, making it difficulty to perturb raw text descriptions without altering their original semantic meanings. Second, although text attributes complement graph structures, they are not inherently well-aligned. To bridge the gap, we introduce GAugLLM, a novel framework for augmenting TAGs. It leverages advanced large language models like Mistral to enhance self-supervised graph learning. Specifically, we introduce a mixture-of-prompt-expert technique to generate augmented node features. This approach adaptively maps multiple prompt experts, each of which modifies raw text attributes using prompt engineering, into numerical feature space. Additionally, we devise a collaborative edge modifier to leverage structural and textual commonalities, enhancing edge augmentation by examining or building connections between nodes. Empirical results across five benchmark datasets spanning various domains underscore our framework's ability to enhance the performance of leading contrastive methods as a plug-in tool. Notably, we observe that the augmented features and graph structure can also enhance the performance of standard generative methods, as well as popular graph neural networks. The open-sourced implementation of our GAugLLM is available at Github.

Updated: 2024-06-17 17:49:19

标题: GAugLLM：通过大语言模型改进文本属性图的图对比学习

摘要: 这项工作研究了自监督图学习，针对文本属性图（TAGs），其中节点由文本属性表示。与传统的图对比方法不同，传统方法会扰动数值特征空间并改变图的拓扑结构，我们的目标是通过语言监督改进视图生成。这是由于实际应用中文本属性的普遍存在，这些属性为图结构提供了丰富的语义信息。然而，这也带来了挑战，主要有两个原因。首先，文本属性通常长度和质量不一，使得在不改变原始语义含义的情况下扰动原始文本描述变得困难。其次，尽管文本属性补充了图结构，但它们并非本质上对齐。为了弥合这一差距，我们引入了GAugLLM，一个用于增强TAGs的新颖框架。它利用类似Mistral的先进大型语言模型来增强自监督图学习。具体来说，我们引入了一种混合提示专家技术，用于生成增强节点特征。这种方法自适应地将多个提示专家映射到数值特征空间，每个专家都使用提示工程修改原始文本属性。此外，我们设计了一个协作边修饰器，利用结构和文本的共同点，通过检查或建立节点之间的连接来增强边的增强。跨越各个领域的五个基准数据集的实证结果强调了我们的框架作为插件工具来增强领先对比方法的性能的能力。值得注意的是，我们观察到增强的特征和图结构也可以增强标准生成方法以及流行的图神经网络的性能。我们的GAugLLM的开源实现可在Github上找到。

更新时间: 2024-06-17 17:49:19

领域: cs.LG,cs.AI,cs.IR

下载: http://arxiv.org/abs/2406.11945v1

Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. We then introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits.

Updated: 2024-06-17 17:49:00

标题: Transcoders发现可解释的LLM特征电路

摘要: 一个重要的目标在机制可解释性是电路分析：找到与特定行为或能力对应的模型稀疏子图。然而，MLP子层使得在基于变压器的语言模型上进行细粒度电路分析变得困难。特别是，可解释的特征 -- 如稀疏自动编码器（SAEs）发现的那些 -- 通常是极多神经元的线性组合，每个神经元都有自己的非线性来解释。在这种情况下的电路分析要么产生难以处理的大电路，要么无法分离局部和全局行为。为了解决这个问题，我们探索了转码器，它旨在用更广泛、稀疏激活的MLP层忠实地近似密集激活的MLP层。我们成功地在具有120M、410M和1.4B参数的语言模型上训练了转码器，并发现它们在稀疏性、忠实性和人类可解释性方面至少与SAEs相当。然后，我们引入了一种新颖的方法，使用转码器通过MLP子层进行基于权重的电路分析。结果电路清晰地分解为与输入相关和与输入无关的项。最后，我们应用转码器来逆向工程模型中的未知电路，并获得了有关GPT2-small中大于电路的新见解。我们的结果表明，转码器可以有效地将涉及MLPs的模型计算分解为可解释的电路。代码可在https://github.com/jacobdunefsky/transcoder_circuits找到。

更新时间: 2024-06-17 17:49:00

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.11944v1

Measurement Simplification in ρ-POMDP with Performance Guarantees

Decision making under uncertainty is at the heart of any autonomous system acting with imperfect information. The cost of solving the decision making problem is exponential in the action and observation spaces, thus rendering it unfeasible for many online systems. This paper introduces a novel approach to efficient decision-making, by partitioning the high-dimensional observation space. Using the partitioned observation space, we formulate analytical bounds on the expected information-theoretic reward, for general belief distributions. These bounds are then used to plan efficiently while keeping performance guarantees. We show that the bounds are adaptive, computationally efficient, and that they converge to the original solution. We extend the partitioning paradigm and present a hierarchy of partitioned spaces that allows greater efficiency in planning. We then propose a specific variant of these bounds for Gaussian beliefs and show a theoretical performance improvement of at least a factor of 4. Finally, we compare our novel method to other state of the art algorithms in active SLAM scenarios, in simulation and in real experiments. In both cases we show a significant speed-up in planning with performance guarantees.

Updated: 2024-06-17 17:47:47

标题: 在具有性能保证的ρ-POMDP中的测量简化

摘要: 在任何以不完美信息行动的自主系统中，不确定性下的决策制定是关键。解决决策问题的成本在行动和观察空间中是指数级的，因此对许多在线系统来说是不可行的。本文介绍了一种新颖的高效决策制定方法，通过对高维观察空间进行分区。利用分区观察空间，我们为一般信念分布制定了信息论奖励的期望的分析界限。然后利用这些界限进行有效规划，同时保持性能保证。我们展示这些界限是自适应的、计算效率高的，并且它们会收敛到原始解决方案。我们扩展了分区范式，并提出了一个分区空间的层次结构，可以在规划中提高效率。然后我们针对高斯信念提出了这些界限的一个具体变体，并展示了至少4倍的理论性能改善。最后，我们将我们的新方法与其他最先进的算法在主动SLAM场景中进行了比较，在仿真和实际实验中。在两种情况下，我们展示了规划速度的显著提升，同时保证性能。

更新时间: 2024-06-17 17:47:47

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2309.10701v2

Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation

H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.

Updated: 2024-06-17 17:47:44

标题: 混合领域对比学习用于无配对的H&E到IHC染色转换

摘要: H&E到IHC染色翻译技术为精确的癌症诊断提供了一个有前途的解决方案，特别是在资源匮乏地区，那里医疗专业人员短缺，且昂贵设备的获取有限。考虑到H&E-IHC图像对的像素级错位，当前研究探索了来自图像对同一位置的补丁之间的病理一致性。然而，大多数研究过分强调域或补丁之间的对应关系，忽视了由不对应的对象提供的附加信息。在本文中，我们提出了一种Mix-Domain对比学习（MDCL）方法，利用不成对的H&E到IHC染色翻译中的监督信息。具体而言，提出的MDCL方法通过估计锚点补丁与所有匹配图像中的补丁之间的相关性，聚合了跨域和内域病理信息，鼓励网络从混合域中学习额外的对比知识。通过混合域病理信息聚合，MDCL增强了相应补丁之间的病理一致性，以及来自生成的IHC图像不同位置的补丁的组件差异。在两个H&E到IHC染色翻译数据集MIST和BCI上的大量实验表明，所提出的方法在多个指标上实现了最先进的性能。

更新时间: 2024-06-17 17:47:44

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11799v1

InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?

Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, $\beta$-weighted $\textit{Legal Safety Score ($LSS_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs' safety by considering its performance in the $\textit{Binary Statutory Reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA--2 models indicate that the proposed $LSS_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA--2 models increase the $LSS_{\beta}$, improving their usability in the Indian legal domain. Our code is publicly released.

Updated: 2024-06-17 17:46:07

标题: InSaAF：通过准确性和公平性融入安全性 | LLMs是否准备好进入印度法律领域？

摘要: 最近语言技术和人工智能的进步导致提出了许多语言模型，用于在法律领域执行各种任务，从预测判决到生成摘要。尽管这些模型具有巨大潜力，但它们被证明学习并展示社会偏见，并做出不公平的预测。在这项研究中，我们探讨了大型语言模型（LLMs）在涉及社会因素时在印度法律领域执行任务的能力。我们提出了一种新颖的度量标准，即$\beta$加权的$\textit{法律安全分数（$LSS_{\beta}$）}$，它包含了LLM的公平性和准确性两个方面。我们通过考虑LLM在$\textit{二元法定推理}$任务中的表现以及在印度社会各个不平等方面的公平性展示来评估LLM的安全性。LLaMA和LLaMA-2模型的任务表现和公平性分数表明，所提出的$LSS_{\beta}$度量标准能够有效确定模型在法律领域安全使用的准备程度。我们还提出了利用专门的法律数据集进行微调管道作为减少偏见和提高模型安全性的潜在方法。对LLaMA和LLaMA-2模型的微调程序增加了$LSS_{\beta}$，提高了它们在印度法律领域的可用性。我们的代码已公开发布。

更新时间: 2024-06-17 17:46:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.10567v4

Personalized Federated Knowledge Graph Embedding with Client-Wise Relation Graph

Federated Knowledge Graph Embedding (FKGE) has recently garnered considerable interest due to its capacity to extract expressive representations from distributed knowledge graphs, while concurrently safeguarding the privacy of individual clients. Existing FKGE methods typically harness the arithmetic mean of entity embeddings from all clients as the global supplementary knowledge, and learn a replica of global consensus entities embeddings for each client. However, these methods usually neglect the inherent semantic disparities among distinct clients. This oversight not only results in the globally shared complementary knowledge being inundated with too much noise when tailored to a specific client, but also instigates a discrepancy between local and global optimization objectives. Consequently, the quality of the learned embeddings is compromised. To address this, we propose Personalized Federated knowledge graph Embedding with client-wise relation Graph (PFedEG), a novel approach that employs a client-wise relation graph to learn personalized embeddings by discerning the semantic relevance of embeddings from other clients. Specifically, PFedEG learns personalized supplementary knowledge for each client by amalgamating entity embedding from its neighboring clients based on their "affinity" on the client-wise relation graph. Each client then conducts personalized embedding learning based on its local triples and personalized supplementary knowledge. We conduct extensive experiments on four benchmark datasets to evaluate our method against state-of-the-art models and results demonstrate the superiority of our method.

Updated: 2024-06-17 17:44:53

标题: 个性化联邦知识图嵌入与客户关系图

摘要: 最近，联邦知识图嵌入（FKGE）引起了相当大的关注，因为它具有从分布式知识图中提取表达丰富的表示的能力，同时保护个体客户的隐私。现有的FKGE方法通常利用来自所有客户的实体嵌入的算术平均值作为全局补充知识，并为每个客户学习全局共识实体嵌入的副本。然而，这些方法通常忽视不同客户之间固有的语义差异。这种疏忽不仅导致全局共享的补充知识在针对特定客户时受到太多噪音的影响，还导致本地和全局优化目标之间存在差异。因此，学习到的嵌入的质量受到 compromise。为了解决这个问题，我们提出了带有客户关系图的个性化联邦知识图嵌入（PFedEG），这是一种新颖的方法，通过识别其他客户的嵌入的语义相关性来学习个性化嵌入。具体而言，PFedEG通过根据客户关系图上的“亲和力”将实体嵌入从其相邻客户合并到每个客户的个性化补充知识中。然后，每个客户根据其本地三元组和个性化补充知识进行个性化嵌入学习。我们在四个基准数据集上进行了大量实验，以评估我们的方法与最先进的模型进行对比，结果表明我们的方法的优越性。

更新时间: 2024-06-17 17:44:53

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2406.11943v1

Grokking Group Multiplication with Cosets

The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem.

Updated: 2024-06-17 17:44:44

标题: 理解陪集中的群乘法

摘要: 深度神经网络的复杂和不可预测性使它们在许多高风险应用中无法安全使用。已经开发了许多技术来解释深度神经网络，但都存在重大限制。算法任务已被证明是解释神经网络端到端的一个富有成效的测试领域。在以前的工作基础上，我们完全反向工程了已经理解排列群 $S_5$ 和 $S_6$ 算术的全连接单隐藏层网络。这些模型发现了完整群的真实子群结构，并收敛于利用排列群的子群分解群算术的神经电路。我们描述了我们如何反向工程模型的机制，并确认我们的理论是对电路功能的忠实描述。我们还通过将我们的工作与 Chughtai 等人的工作进行比较，引起了在进行可解释性研究时面临的当前挑战，后者声称找到了解决这个问题的不同算法。

更新时间: 2024-06-17 17:44:44

领域: cs.LG,cs.AI,math.RT

下载: http://arxiv.org/abs/2312.06581v2

DataComp-LM: In search of the next generation of training sets for language models

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

Updated: 2024-06-17 17:42:57

标题: DataComp-LM：寻找语言模型下一代训练集

摘要: 我们介绍了一种名为DataComp for Language Models（DCLM）的测试平台，旨在通过受控数据集实验来改进语言模型。作为DCLM的一部分，我们提供了一个标准化语料库，其中包含从Common Crawl中提取的240T令牌，并基于OpenLM框架提供了有效的预训练配方，以及一个包含53个下游评估的广泛套件。参与DCLM基准测试的参与者可以在模型规模从412M到7B参数的范围内尝试数据筛选策略，如去重、过滤和数据混合。作为DCLM的基准线，我们进行了大量实验，并发现基于模型的过滤是组装高质量训练集的关键。由此产生的数据集DCLM-Baseline使得可以从头开始训练一个7B参数的语言模型，在2.6T的训练令牌上将MMLU的5-shot准确率提高到64%。与先前在开放数据语言模型中的最新技术MAP-Neo相比，DCLM-Baseline在MMLU上表示出6.6个百分点的改善，同时使用的计算资源减少了40%。我们的基线模型还可以与Mistral-7B-v0.3和Llama 3 8B在MMLU上（63%和66%）相媲美，并在53个自然语言理解任务的平均表现上与Llama 3 8B相比，使用的计算资源减少了6.6倍。我们的结果突出了数据集设计对于训练语言模型的重要性，并为进一步研究数据筛选提供了一个起点。

更新时间: 2024-06-17 17:42:57

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.11794v1

A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping

Robotic grasping presents a difficult motor task in real-world scenarios, constituting a major hurdle to the deployment of capable robots across various industries. Notably, the scarcity of data makes grasping particularly challenging for learned models. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms predicated on massive amounts of data sourced from the Internet, and now nearly all prominent models leverage pretrained backbone networks. Against this backdrop, we begin to investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance. This preliminary literature review sheds light on critical challenges and delineates prospective directions for future research in visual pretraining for robotic manipulation.

Updated: 2024-06-17 17:39:30

标题: 利用大规模视觉模型增强机器人抓取能力的简要调查

摘要: 机器人抓取在现实场景中是一个困难的运动任务，这构成了跨各行各业部署能力机器人的主要障碍。值得注意的是，数据稀缺使得对学习模型而言抓取尤为具有挑战性。最近计算机视觉领域的进展见证了成功的无监督训练机制的增长，这些机制基于从互联网获取的大量数据，并且现今几乎所有著名的模型都利用了预训练的骨干网络。在这样的背景下，我们开始研究大规模视觉预训练在提高机器人抓取性能方面的潜在好处。这份初步文献综述揭示了关键挑战，并勾画了未来研究中视觉预训练在机器人操作中的前景方向。

更新时间: 2024-06-17 17:39:30

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.11786v1

CELL your Model: Contrastive Explanation Methods for Large Language Models

The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.

Updated: 2024-06-17 17:39:10

标题: 将您的模型CELL化：大型语言模型的对比解释方法

摘要: 黑匣子深度神经网络分类模型的出现引发了解释它们决策的需求。然而，在生成式人工智能如大型语言模型（LLMs）的情况下，没有类别预测需要解释。相反，人们可以询问为什么LLM对给定提示输出特定响应。在本文中，我们提出了我们所知的第一个对比解释方法，只需要黑箱/查询访问。我们的解释表明，LLM对给定提示输出回复是因为如果提示稍作修改，LLM将给出一个不同的响应，要么不那么可取，要么与原始响应相矛盾。关键洞察是，对比解释只需要用户有意义的距离函数，而不一定需要特定响应（即类别标签）的实值表示。我们提供了两种寻找对比解释的算法：i）一种近视算法，虽然在创建对比方面有效，但需要许多模型调用；ii）一种预算算法，我们的主要算法贡献，智能地创建符合查询预算的对比，对于更长的上下文是必要的。我们展示了这些方法在各种自然语言任务上的有效性，如开放文本生成、自动红队行动和解释对话降级。

更新时间: 2024-06-17 17:39:10

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11785v1

MDCR: A Dataset for Multi-Document Conditional Reasoning

The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned conditions. However, it is limited to questions on single documents, neglecting harder cases that may require cross-document reasoning and optimization, for example, "What is the maximum number of scholarships attainable?" Such questions over multiple documents are not only more challenging due to more context having to understand, but also because the model has to (1) explore all possible combinations of unmentioned conditions and (2) understand the relationship between conditions across documents, to reason about the optimal outcome. To evaluate models' capability of answering such questions, we propose a new dataset MDCR, which can reflect real-world challenges and serve as a new test bed for complex conditional reasoning that requires optimization. We evaluate this dataset using the most recent LLMs and demonstrate their limitations in solving this task. We believe this dataset will facilitate future research in answering optimization questions with unknown conditions.

Updated: 2024-06-17 17:38:43

标题: MDCR：一个用于多文档条件推理的数据集

摘要: 将同样的现实问题提出给不同的个体可能会根据他们独特的情况得出不同的答案。例如，一个学生是否有资格获得奖学金取决于资格条件，如所需专业或学位。ConditionalQA被提出来评估模型阅读文件并回答资格问题的能力，考虑未提及的条件。然而，它仅限于单个文件上的问题，忽略了可能需要跨文件推理和优化的更难的情况，例如，“可以获得的奖学金的最大数量是多少？”这种跨多个文件的问题不仅更具挑战性，因为需要理解更多上下文，而且因为模型必须（1）探索所有可能的未提及条件的组合，并（2）理解跨文件的条件之间的关系，以推理出最佳结果。为了评估模型回答此类问题的能力，我们提出了一个新的数据集MDCR，可以反映现实世界的挑战，并作为需要优化的复杂条件推理的新测试基础。我们使用最新的LLMs评估这个数据集，并展示它们在解决这个任务中的局限性。我们相信这个数据集将促进未来研究回答具有未知条件的优化问题。

更新时间: 2024-06-17 17:38:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11784v1

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.

Updated: 2024-06-17 17:35:52

标题: 分割、遗忘、合并：利用数据属性更有效地在LLMs中进行遗忘

摘要: 大型语言模型（LLMs）已经显示出存在社会和伦理风险，例如生成有毒语言或促进危险知识的恶意使用。机器遗忘是一种改善LLM安全性的有希望的方法，可以直接消除有害行为和知识。在本文中，我们提出了“SPlit, UNlearn, MerGE”（SPUNGE）框架，可以与任何遗忘方法一起使用以增强其效果。SPUNGE通过在遗忘过程中利用数据属性，将遗忘数据根据特定属性值分成子集，分别对每个子集进行遗忘，然后合并已遗忘的模型。我们实验证明，SPUNGE显著改善了两种最近的遗忘方法在最先进的LLM上的性能，同时保持了它们在标准学术基准上的一般能力。

更新时间: 2024-06-17 17:35:52

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11780v1

Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS), enhancing road safety and traffic efficiency. While traditional methods have laid foundational work, modern deep learning techniques, particularly transformer-based models and generative approaches, have significantly improved prediction accuracy by capturing complex and non-linear patterns in vehicle motion and traffic interactions. However, these models often overlook the detailed car-following behaviors and inter-vehicle interactions essential for real-world driving scenarios. This study introduces a Cross-Attention Transformer Enhanced Conditional Diffusion Model (Crossfusor) specifically designed for car-following trajectory prediction. Crossfusor integrates detailed inter-vehicular interactions and car-following dynamics into a robust diffusion framework, improving both the accuracy and realism of predicted trajectories. The model leverages a novel temporal feature encoding framework combining GRU, location-based attention mechanisms, and Fourier embedding to capture historical vehicle dynamics. It employs noise scaled by these encoded historical features in the forward diffusion process, and uses a cross-attention transformer to model intricate inter-vehicle dependencies in the reverse denoising process. Experimental results on the NGSIM dataset demonstrate that Crossfusor outperforms state-of-the-art models, particularly in long-term predictions, showcasing its potential for enhancing the predictive capabilities of autonomous driving systems.

Updated: 2024-06-17 17:35:47

标题: Crossfusor：一种用于汽车跟随轨迹预测的交叉注意力变压器增强条件扩散模型

摘要: 车辆轨迹预测对于推动自动驾驶和先进驾驶辅助系统（ADAS），提高道路安全和交通效率至关重要。虽然传统方法奠定了基础工作，但现代深度学习技术，特别是基于transformer的模型和生成方法，通过捕捉车辆运动和交通互动中复杂和非线性模式，显著提高了预测准确性。然而，这些模型通常忽视了真实驾驶场景中至关重要的详细车辆跟随行为和车辆间交互。本研究介绍了一种专门设计用于车辆跟随轨迹预测的Cross-Attention Transformer Enhanced Conditional Diffusion Model（Crossfusor）。Crossfusor将详细的车辆间互动和车辆跟随动态整合到一个强大的扩散框架中，提高了预测轨迹的准确性和逼真度。该模型利用了一种结合了GRU、基于位置的注意机制和傅立叶嵌入的新颖时间特征编码框架来捕捉历史车辆动态。它在前向扩散过程中使用根据这些编码的历史特征缩放的噪声，并使用交叉注意transformer来模拟反向去噪过程中复杂的车辆间依赖关系。对NGSIM数据集的实验结果表明，Crossfusor在长期预测方面胜过了最先进的模型，展示了其增强自动驾驶系统预测能力的潜力。

更新时间: 2024-06-17 17:35:47

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2406.11941v1

Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection

Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex real-world tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful.

Updated: 2024-06-17 17:35:02

标题: 塑形 SHAP：通过逐层邻居选择增强稳定性

摘要: 机器学习技术，如深度学习和集成方法，由于其处理复杂真实世界任务的能力而被广泛应用于各个领域。然而，它们的黑匣子性质引发了人们对计算机辅助决策的公平性、可信度和透明度的多重担忧。这导致了局部事后可解释性方法的出现，这些方法为黑匣子算法做出的个别决策提供解释。在这些方法中，由于其与模型无关的性质和其扎实的理论框架，Kernel SHAP被广泛使用。尽管具有这些优势，Kernel SHAP存在高度不稳定性：对相同输入的不同执行可能导致显著不同的解释，从而降低了解释的相关性。本文的贡献是双重的。一方面，我们展示了Kernel SHAP的不稳定性是由其随机邻居选择过程引起的，我们对其进行了调整以实现完全稳定性而不牺牲解释的准确性。另一方面，我们展示了通过将邻居生成限制为大小为1的扰动 - 我们称之为第1层的联合体，我们获得了一种全面稳定、计算效率高且仍然有意义的新型特征归因方法。

更新时间: 2024-06-17 17:35:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2312.12115v2

Provable Guarantees for Model Performance via Mechanistic Interpretability

In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally lower bounding the accuracy of 151 small transformers trained on a Max-of-$k$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we show that shorter proofs seem to require and provide more mechanistic understanding, and that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

Updated: 2024-06-17 17:34:25

标题: 通过机制可解释性实现模型性能的可证明保证

摘要: 在这项工作中，我们提出使用机械解释性技术——将模型权重逆向工程成人类可解释算法——来推导并简洁证明模型性能的形式保证。我们通过正式下界评估了151个小型transformer在Max-of-$k$任务上的准确度。我们创建了102种不同的计算机辅助证明策略，并评估它们在每个模型上的长度和边界的紧密程度。通过定量指标，我们展示了更短的证明似乎需要并提供更多的机械理解，而更忠实的机械理解会导致更紧密的性能边界。我们通过定性地检查我们证明的子集来确认这些关系。最后，我们确定了累积的无结构噪声作为使用机械解释性来生成模型性能紧凑证明的关键挑战。

更新时间: 2024-06-17 17:34:25

领域: cs.LG,cs.LO

下载: http://arxiv.org/abs/2406.11779v1

Task Me Anything

Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.

Updated: 2024-06-17 17:32:42

标题: 任务我任何事情

摘要: 大型多模态语言模型（MLMs）的基准现在用于同时评估模型的一般能力，而不是评估特定能力。因此，当开发人员想确定要为其应用程序使用哪些模型时，他们会被基准的数量所压倒，并且不确定哪个基准的结果最能反映他们特定的用例。本文介绍了Task-Me-Anything，这是一个生成根据用户需求定制的基准的引擎。Task-Me-Anything维护一个可扩展的视觉资产分类法，可以以编程方式生成大量的任务实例。此外，它可以在计算预算内高效地算法地处理用户关于MLM性能的查询。它包含113K张图像，10K个视频，2K个3D对象资产，超过365个对象类别，655个属性和335个关系。它可以生成750M个图像/视频问答对，重点评估MLM的感知能力。Task-Me-Anything揭示了一些关键的见解：开源MLMs在对象和属性识别方面表现出色，但在空间和时间理解方面缺乏；每个模型都有独特的优势和劣势；更大的模型通常表现更好，尽管也存在例外情况；而GPT4o在识别旋转/移动对象和区分颜色方面存在挑战。

更新时间: 2024-06-17 17:32:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11775v1

Optimal Transport-Assisted Risk-Sensitive Q-Learning

The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.

Updated: 2024-06-17 17:32:25

标题: 最优输运辅助风险敏感Q学习

摘要: 强化学习的主要目标是开发决策策略，优先考虑最佳性能而不考虑风险或安全性。相比之下，安全强化学习旨在减轻或避免不安全状态。本文提出了一种风险敏感的Q学习算法，利用最优输运理论增强代理的安全性。通过将最优输运集成到Q学习框架中，我们的方法旨在优化策略的预期回报，同时最小化策略的稳态分布与预定义风险分布之间的Wasserstein距离，该距离包含了来自领域专家的安全偏好。我们在一个Gridworld环境中验证了所提出的算法。结果表明，与传统的Q学习算法相比，我们的方法显著减少了访问风险状态的频率，并实现了更快的收敛到稳定策略。

更新时间: 2024-06-17 17:32:25

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2406.11774v1

Deep Learning methodology for the identification of wood species using high-resolution macroscopic images

Significant advancements in the field of wood species identification are needed worldwide to support sustainable timber trade. In this work we contribute to automate the identification of wood species via high-resolution macroscopic images of timber. The main challenge of this problem is that fine-grained patterns in timber are crucial in order to accurately identify wood species, and these patterns are not properly learned by traditional convolutional neural networks (CNNs) trained on low/medium resolution images. We propose a Timber Deep Learning Identification with Patch-based Inference Voting methodology, abbreviated TDLI-PIV methodology. Our proposal exploits the concept of patching and the availability of high-resolution macroscopic images of timber in order to overcome the inherent challenges that CNNs face in timber identification. The TDLI-PIV methodology is able to capture fine-grained patterns in timber and, moreover, boosts robustness and prediction accuracy via a collaborative voting inference process. In this work we also introduce a new data set of marcroscopic images of timber, called GOIMAI-Phase-I, which has been obtained using optical magnification in order to capture fine-grained details, which contrasts to the other datasets that are publicly available. More concretely, images in GOIMAI-Phase-I are taken with a smartphone with a 24x magnifying lens attached to the camera. Our data set contains 2120 images of timber and covers 37 legally protected wood species. Our experiments have assessed the performance of the TDLI-PIV methodology, involving the comparison with other methodologies available in the literature, exploration of data augmentation methods and the effect that the dataset size has on the accuracy of TDLI-PIV.

Updated: 2024-06-17 17:31:57

标题: 深度学习方法用于利用高分辨率宏观图像识别木材种类

摘要: 木材种类鉴定领域的重大进展在全球范围内都是必需的，以支持可持续的木材贸易。在这项工作中，我们致力于通过高分辨率的木材宏观图像自动识别木材种类。这个问题的主要挑战在于，木材中的细粒度模式对于准确识别木材种类至关重要，而这些模式并没有被传统的卷积神经网络(CNN)在低/中分辨率图像上训练时正确学习。我们提出了一种基于补丁推理投票的木材深度学习鉴别方法，简称TDLI-PIV方法。我们的提议利用了补丁的概念和高分辨率的木材宏观图像的可用性，以克服CNN在木材识别中面临的固有挑战。TDLI-PIV方法能够捕捉木材中的细粒度模式，并且通过协作投票推理过程提高了稳健性和预测准确性。在这项工作中，我们还介绍了一个新的木材宏观图像数据集，名为GOIMAI-Phase-I，该数据集是通过光学放大获取的，以捕捉细节，这与其他公开可用的数据集形成对比。更具体地说，GOIMAI-Phase-I中的图像是使用手机拍摄的，镜头配有24倍放大镜。我们的数据集包含2120张木材图像，涵盖了37种受法律保护的木材种类。我们的实验评估了TDLI-PIV方法的性能，包括与文献中其他方法的比较，数据增强方法的探索以及数据集大小对TDLI-PIV准确性的影响。

更新时间: 2024-06-17 17:31:57

领域: cs.CV,cs.AI,I.2.1; I.2.10

下载: http://arxiv.org/abs/2406.11772v1

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.

Updated: 2024-06-17 17:31:01

标题: GAMA：具有先进音频理解和复杂推理能力的大型音频-语言模型

摘要: 认知和理解非语音声音和非语言言语对于我们与周围环境进行互动并做出决策至关重要。在本文中，我们提出了GAMA，一种新颖的通用大型音频-语言模型（LALM），具有先进的音频理解和复杂推理能力。我们通过将LLM与多种类型的音频表示集成在一起构建了GAMA，包括来自自定义音频Q-Former的特征，以及从音频编码器的多个层中聚合特征的多层聚合器。我们在大规模音频-语言数据集上对GAMA进行微调，从而增强了其音频理解能力。接下来，我们提出了CompA-R（用于复杂音频推理的指令调整），这是一个合成生成的指令调整（IT）数据集，其中的指令要求模型对输入音频进行复杂推理。我们使用CompA-R对GAMA进行指令调整，赋予其复杂推理能力，同时通过利用输入音频的事件标签添加一个高级语义证据作为输入的软提示。最后，我们还提出了CompA-R-test，一个人工标记的评估数据集，用于评估LALMs在需要复杂推理的开放式音频问答任务上的能力。通过自动化和专家人工评估，我们展示了GAMA在各种音频理解任务上的表现优于文献中所有其他LALMs，优势范围为1%至84%。此外，经过CompA-R的指令调整后的GAMA在其复杂推理和指令遵循能力方面表现出色。

更新时间: 2024-06-17 17:31:01

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2406.11768v1

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order to ensure that benchmarks keep up with the pace of LLM development, we address how one can evaluate benchmarks on their ability to confidently separate models and their alignment with human preference. Under these principles, we developed BenchBuilder, a living benchmark that filters high-quality prompts from live data sources to enable offline evaluation on fresh, challenging prompts. BenchBuilder identifies seven indicators of a high-quality prompt, such as the requirement for domain knowledge, and utilizes an LLM annotator to select a high-quality subset of prompts from various topic clusters. The LLM evaluation process employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. We apply BenchBuilder on prompts from the Chatbot Arena to create Arena-Hard-Auto v0.1: 500 challenging user prompts from a wide range of tasks. Arena-Hard-Auto v0.1 offers 3x tighter confidence intervals than MT-Bench and achieves a state-of-the-art 89.1% agreement with human preference rankings, all at a cost of only $25 and without human labelers. The BenchBuilder pipeline enhances evaluation benchmarks and provides a valuable tool for developers, enabling them to extract high-quality benchmarks from extensive data with minimal effort.

Updated: 2024-06-17 17:26:10

标题: 从众包数据到高质量基准：Arena-Hard和BenchBuilder流水线

摘要: 语言模型的快速发展促使了更具挑战性的基准的开发。目前的静态基准往往难以始终区分不同模型的能力，并且无法与真实世界用户偏好保持一致。另一方面，像聊天机器人竞技场这样的实时众包平台收集了各种自然提示和用户反馈。然而，这些提示在复杂性上存在差异，反馈无法离线应用于新模型。为了确保基准与LLM发展的步伐保持一致，我们探讨了如何评估基准在能够自信地区分模型和与人类偏好保持一致方面的能力。基于这些原则，我们开发了BenchBuilder，这是一个生动的基准，可以从实时数据源中过滤高质量提示，以便对新的挑战提示进行离线评估。BenchBuilder识别了高质量提示的七个指标，比如对领域知识的要求，并利用LLM注释器从各种主题集群中选择高质量的提示子集。LLM评估过程利用LLM评审员来确保完全自动化、高质量和不断更新的基准。我们将BenchBuilder应用于聊天机器人竞技场的提示，创建了Arena-Hard-Auto v0.1：500个来自各种任务的具有挑战性的用户提示。Arena-Hard-Auto v0.1提供了比MT-Bench紧3倍的置信区间，并且以每次花费仅为25美元的成本实现了与人类偏好排名达到了89.1%的最新成果，而且无需人工标注者。BenchBuilder管道增强了评估基准，并为开发人员提供了宝贵的工具，使他们能够轻松从大量数据中提取高质量的基准。

更新时间: 2024-06-17 17:26:10

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11939v1

Ultrasound Imaging based on the Variance of a Diffusion Restoration Model

Despite today's prevalence of ultrasound imaging in medicine, ultrasound signal-to-noise ratio is still affected by several sources of noise and artefacts. Moreover, enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. Recently, there has been progress in both model-based and learning-based approaches addressing the problem of ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid reconstruction method combining an ultrasound linear direct model with a learning-based prior coming from a generative Denoising Diffusion model. More specifically, we rely on the unsupervised fine-tuning of a pre-trained Denoising Diffusion Restoration Model (DDRM). Given the nature of multiplicative noise inherent to ultrasound, this paper proposes an empirical model to characterize the stochasticity of diffusion reconstruction of ultrasound images, and shows the interest of its variance as an echogenicity map estimator. We conduct experiments on synthetic, in-vitro, and in-vivo data, demonstrating the efficacy of our variance imaging approach in achieving high-quality image reconstructions from single plane-wave acquisitions and in comparison to state-of-the-art methods. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/DRUSvar

Updated: 2024-06-17 17:25:42

标题: 基于扩散恢复模型变异性的超声成像

摘要: 尽管当今医学中超声波成像的普及，超声信号噪声比仍受多种噪声和伪影的影响。此外，提高超声图像质量涉及平衡对比度、分辨率和斑点保存等并发因素。最近，在解决超声图像重建问题方面，基于模型和基于学习的方法都取得了进展。将两种方法的优点结合起来，我们提出了一种混合重建方法，将超声线性直接模型与源自生成去噪扩散模型的学习先验相结合。更具体地说，我们依赖于对预训练的去噪扩散恢复模型进行非监督微调。鉴于超声波固有的乘性噪声的特性，本文提出了一个经验模型来描述超声图像扩散重建的随机性，并展示了其方差作为回声度图估计器的价值。我们进行了关于合成、体外和体内数据的实验，展示了我们的方差成像方法在从单平面波获取中实现高质量图像重建方面的有效性，并与最先进的方法进行了比较。代码可在以下链接找到：https://github.com/Yuxin-Zhang-Jasmine/DRUSvar

更新时间: 2024-06-17 17:25:42

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2403.15316v2

Joint Linked Component Analysis for Multiview Data

In this work, we propose the joint linked component analysis (joint\_LCA) for multiview data. Unlike classic methods which extract the shared components in a sequential manner, the objective of joint\_LCA is to identify the view-specific loading matrices and the rank of the common latent subspace simultaneously. We formulate a matrix decomposition model where a joint structure and an individual structure are present in each data view, which enables us to arrive at a clean svd representation for the cross covariance between any pair of data views. An objective function with a novel penalty term is then proposed to achieve simultaneous estimation and rank selection. In addition, a refitting procedure is employed as a remedy to reduce the shrinkage bias caused by the penalization.

Updated: 2024-06-17 17:25:23

标题: 多视图数据的联合链接组件分析

摘要: 在这项工作中，我们提出了联合链接成分分析（joint_LCA）用于多视图数据。与经典方法不同，经典方法是按顺序提取共享成分，joint_LCA的目标是同时识别视图特定的加载矩阵和共同潜在子空间的秩。我们制定了一个矩阵分解模型，其中每个数据视图中都存在联合结构和个体结构，这使我们能够得到任意一对数据视图之间的交叉协方差的干净svd表示。然后提出了一个带有新型惩罚项的目标函数，以实现同时估计和秩选择。此外，采用重新拟合程序作为一种补救措施，以减少由于惩罚引起的收缩偏差。

更新时间: 2024-06-17 17:25:23

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2406.11761v1

Tracking the perspectives of interacting language models

Large language models (LLMs) are capable of producing high quality information at unprecedented rates. As these models continue to entrench themselves in society, the content they produce will become increasingly pervasive in databases that are, in turn, incorporated into the pre-training data, fine-tuning data, retrieval data, etc. of other language models. In this paper we formalize the idea of a communication network of LLMs and introduce a method for representing the perspective of individual models within a collection of LLMs. Given these tools we systematically study information diffusion in the communication network of LLMs in various simulated settings.

Updated: 2024-06-17 17:20:16

标题: 跟踪交互语言模型的视角

摘要: 大型语言模型（LLMs）能够以前所未有的速度产生高质量的信息。随着这些模型继续在社会中扎根，它们产生的内容将在数据库中变得越来越普遍，而这些数据库又会被整合到其他语言模型的预训练数据、微调数据、检索数据等中。在本文中，我们正式确定了LLMs的通信网络的概念，并引入了一种表示集合LLMs中各个模型观点的方法。借助这些工具，我们系统地研究了在各种模拟环境中LLMs通信网络中信息传播的情况。

更新时间: 2024-06-17 17:20:16

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2406.11938v1

Quantifying Local Model Validity using Active Learning

Real-world applications of machine learning models are often subject to legal or policy-based regulations. Some of these regulations require ensuring the validity of the model, i.e., the approximation error being smaller than a threshold. A global metric is generally too insensitive to determine the validity of a specific prediction, whereas evaluating local validity is costly since it requires gathering additional data.We propose learning the model error to acquire a local validity estimate while reducing the amount of required data through active learning. Using model validation benchmarks, we provide empirical evidence that the proposed method can lead to an error model with sufficient discriminative properties using a relatively small amount of data. Furthermore, an increased sensitivity to local changes of the validity bounds compared to alternative approaches is demonstrated.

Updated: 2024-06-17 17:19:01

标题: 使用主动学习来量化本地模型的有效性

摘要: 机器学习模型的实际应用往往受到法律或政策规定的约束。其中一些规定要求确保模型的有效性，即近似误差小于一个阈值。通常全局度量过于不敏感，无法确定特定预测的有效性，而评估局部有效性成本高昂，因为需要收集额外的数据。我们提出通过主动学习来学习模型误差，以获得局部有效性估计，并减少所需数据量。通过使用模型验证基准，我们提供实证证据表明，所提出的方法可以通过相对较少的数据量得到具有足够区分性质的误差模型。此外，相比替代方法，我们展示了对有效性边界的局部变化增加了敏感性。

更新时间: 2024-06-17 17:19:01

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.07474v2

STAR: SocioTechnical Approach to Red Teaming Language Models

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.

Updated: 2024-06-17 17:16:45

标题: STAR：面向红队语言模型的社会技术方法

摘要: 这项研究介绍了STAR，一个社会技术框架，它在改进当前大型语言模型红队安全的最佳实践方面有所提升。STAR做出了两个关键贡献：它通过为人类红队员生成参数化指令来增强可操纵性，从而提高了风险表面的覆盖范围。参数化指令还提供了更详细的模型失败洞察，而没有增加额外成本。其次，STAR通过将人口统计数据与特定群体的伤害评估相匹配，提高了信号质量，导致更敏感的注释。STAR进一步利用仲裁这一新颖步骤，以利用各种观点并提高标签可靠性，将分歧视为信号质量的有价值贡献，而非噪音。

更新时间: 2024-06-17 17:16:45

领域: cs.AI,cs.CL,cs.CY,cs.HC

下载: http://arxiv.org/abs/2406.11757v1

DustNet: skillful neural network predictions of Saharan dust

Suspended in the atmosphere are millions of tonnes of mineral dust which interacts with weather and climate. Accurate representation of mineral dust in weather models is vital, yet remains challenging. Large scale weather models use high power supercomputers and take hours to complete the forecast. Such computational burden allows them to only include monthly climatological means of mineral dust as input states inhibiting their forecasting accuracy. Here, we introduce DustNet a simple, accurate and super fast forecasting model for 24-hours ahead predictions of aerosol optical depth AOD. DustNet trains in less than 8 minutes and creates predictions in 2 seconds on a desktop computer. Created by DustNet predictions outperform the state-of-the-art physics-based model on coarse 1 x 1 degree resolution at 95% of grid locations when compared to ground truth satellite data. Our results show DustNet has a potential for fast and accurate AOD forecasting which could transform our understanding of dust impacts on weather patterns.

Updated: 2024-06-17 17:15:30

标题: DustNet：技巧高超的神经网络预测撒哈拉沙尘

摘要: 在大气中悬浮着数百万吨的矿物尘埃，它与天气和气候相互作用。准确表示矿物尘埃在天气模型中至关重要，但仍然具有挑战性。大规模天气模型使用高性能超级计算机，并需要数小时才能完成预测。这种计算负担使它们只能包含矿物尘埃的月度气候平均作为输入状态，限制了其预测准确度。在这里，我们介绍了DustNet，这是一个简单、准确且超快的预测模型，可用于预测气溶胶光学厚度AOD的未来24小时。DustNet在不到8分钟内进行训练，并在台式电脑上在2秒内生成预测。与基于物理的最新模型相比，由DustNet预测所创建的结果在95%的网格位置上优于地面真实卫星数据的粗1 x 1度分辨率。我们的结果显示，DustNet具有快速准确的AOD预测潜力，这可能改变我们对尘埃对天气模式影响的理解。

更新时间: 2024-06-17 17:15:30

领域: physics.geo-ph,cs.AI,physics.ao-ph,physics.data-an,86-06(Primary), 86A10(Secondary),J.2; I.2.1; I.2.7

下载: http://arxiv.org/abs/2406.11754v1

A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models

Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning.

Updated: 2024-06-17 17:13:08

标题: 一种基于语义的层冻结方法，用于有效调整语言模型

摘要: 微调语言模型（LMs）对于调整模型以适应下游数据和任务至关重要。然而，完全的微调通常成本高昂。现有的工作，如参数高效微调（PEFT），通常侧重于\textit{如何微调}，但忽视了\textit{在哪里微调}的问题。作为回答在哪里微调（在层级上）的开创性工作，我们对LM推理过程进行了语义分析。我们首先提出了潜在表示的虚拟转换，然后跟踪其实际转换。基于转换中的偏差，我们估计了微调每个模型层的收益，并进一步缩小了微调的范围。我们在众所周知的LMs和数据集上进行了广泛的实验。结果表明，我们的方法是有效和高效的，并优于现有的基线。我们的方法与现有的高效技术（如PEFT方法）正交，为LM微调提供了实用价值。

更新时间: 2024-06-17 17:13:08

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11753v1

Transcendence: Generative Models Can Outperform The Experts That Train Them

Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.

Updated: 2024-06-17 17:00:52

标题: 超越：生成模型可以胜过训练它们的专家

摘要: 生成模型的训练目标很简单，即模仿其训练数据所引发的条件概率分布。因此，当生成模型在人类生成的数据上进行训练时，我们可能不会期望这个人工模型在原始目标上超越人类。在这项研究中，我们探讨了超越现象：当一个生成模型实现了超越其数据生成专家能力的能力。我们通过训练一个自回归变换器从棋局记录中玩国际象棋来展示超越现象，并展示训练模型有时可以比数据集中的所有玩家表现更好。我们在理论上证明了低温采样可以实现超越，并通过实验证实评估了这一现象。最后，我们讨论了超越现象的其他来源，为未来在更广泛的背景下对这一现象进行研究奠定了基础。

更新时间: 2024-06-17 17:00:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11741v1

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation. This transforms action inference into a local generative task. We leverage pick and place symmetries underlying the tasks in the generation process and achieve extremely high sample efficiency and generalizability to unseen configurations. Finally, we demonstrate state-of-the-art performance across various tasks on the RLbench benchmark compared with several strong baselines.

Updated: 2024-06-17 17:00:41

标题: 想象力政策：使用生成点云模型学习操作策略

摘要: 在规划过程中，人类可以想象目标状态并执行动作以达到这些目标。在这项工作中，我们提出了一种新颖的多任务关键帧策略网络，称为Imagination Policy，用于解决高精度的拾取和放置任务。Imagination Policy不直接学习动作，而是生成点云来想象所需的状态，然后使用刚性动作估计将其转化为动作。这将动作推断转变为一个本地生成任务。我们利用拾取和放置任务中的对称性来生成过程，并实现了极高的样本效率和对未见配置的泛化能力。最后，与几种强基线方法相比，我们展示了在RLbench基准测试中在各种任务上的最先进性能。

更新时间: 2024-06-17 17:00:41

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11740v1

Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.

Updated: 2024-06-17 16:58:35

标题: 朝向双向人工智能对齐：澄清、框架和未来方向的系统综述

摘要: 最近普适人工智能领域的进展突显了引导人工智能系统朝着个人和群体的预期目标、伦理原则和价值观的重要性，这一概念被广泛称为对齐。然而，人工智能与人类对齐的定义和范围缺乏明确定义，构成了一个重要障碍，阻碍了跨研究领域合作努力以实现这种对齐。特别是，以机器学习和哲学为导向的对齐研究往往将人工智能对齐视为一个静态、单向的过程（即，旨在确保人工智能系统的目标与人类相匹配），而非一个持续的、相互对齐的问题。这种观点在很大程度上忽视了对齐的长期互动和动态变化。为了理解这些差距，我们对2019年至2024年1月间发表的400多篇论文进行了系统回顾，涵盖了人机交互（HCI）、自然语言处理（NLP）、机器学习（ML）等多个领域。我们对人工智能与人类对齐进行了特征化、定义和范围划定。基于此，我们提出了一个“双向人工智能与人类对齐”的概念框架，从人类中心的角度组织文献。这一框架包括1）传统的将人工智能对齐到人类以确保人工智能产生人类确定的预期结果的研究，以及2）将人类对齐到人工智能的提出概念，旨在帮助个人和社会在认知和行为上适应人工智能的进步。此外，我们阐明了从文献分析中得出的关键发现，包括有关人类价值观、互动技术和评估的讨论。为了为未来研究铺平道路，我们设想了未来方向的三个关键挑战，并提出了潜在未来解决方案的例子。

更新时间: 2024-06-17 16:58:35

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.09264v2

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to large language models (LLMs), MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.

Updated: 2024-06-17 16:53:49

标题: MLLM-Protector：确保MLLM的安全性而不影响性能

摘要: 多模态大型语言模型（MLLMs）的部署带来了一个独特的易受攻击的漏洞：对视觉输入的恶意攻击的易感性。本文研究了保护MLLMs免受此类攻击的新挑战。与大型语言模型（LLMs）相比，MLLMs包含额外的图像模态。我们发现图像充当了一种“外语”，在安全对齐过程中未被考虑，使MLLMs更容易产生有害响应。不幸的是，与文本型LLMs中考虑的离散标记不同，图像信号的连续性性质提出了重大的对齐挑战，这使得难以全面涵盖所有可能的情况。这种易受攻击性加剧了大多数最先进的MLLMs在有限的图像文本对上进行微调，这些对比广泛的基于文本的预训练语料库的情况要少得多，这使得MLLMs在进行安全微调过程中更容易忘记其原始能力。为了解决这些挑战，我们引入了MLLM-Protector，这是一种即插即用的策略，解决了两个子任务：1）通过轻量级有害检测器识别有害响应，2）通过解毒剂将有害响应转化为无害响应。这种方法有效地减轻了恶意视觉输入带来的风险，而不会牺牲MLLMs的原始性能。我们的结果表明，MLLM-Protector为MLLM安全的一个以前未解决的方面提供了强大的解决方案。

更新时间: 2024-06-17 16:53:49

领域: cs.CR,cs.CL,cs.CV

下载: http://arxiv.org/abs/2401.02906v3

Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models

One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS's success, thereby offering valuable insights for future research in this area. Code will be available at \url{https://github.com/xufangzhi/ENVISIONS}.

Updated: 2024-06-17 16:52:56

标题: 交互式演化：用于大型语言模型的神经符号自训练框架

摘要: 大型语言模型（LLMs）表现优异的主要驱动力之一是人类注释的自然语言数据的广泛可用性，用于对齐微调。这激发了研究人员调查自我训练方法，以减轻对人类注释的广泛依赖。然而，当前自我训练的成功主要观察到在自然语言场景中，而不是在日益重要的神经符号场景中。为此，我们提出了一种名为ENVISIONS的环境引导的神经符号自我训练框架。它旨在克服两个主要挑战：（1）符号数据的稀缺性，和（2）LLMs在处理符号语言方面的有限熟练程度。在三个不同领域进行的广泛评估表明了我们方法的有效性。此外，我们进行了全面分析，揭示了促使ENVISIONS成功的因素，从而为该领域未来研究提供了宝贵的见解。代码将在\url{https://github.com/xufangzhi/ENVISIONS}上提供。

更新时间: 2024-06-17 16:52:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11736v1

A Clipped Trip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. In these cases, clipping biases updates in a way beneficial to training which cannot be recovered by SGD under any schedule. We conclude with a discussion about the links between high-dimensional clipping and neural network training.

Updated: 2024-06-17 16:50:22

标题: 一个剪裁之旅：高维度中梯度裁剪的 SGD 动力学

摘要: 现代机器学习的成功部分归功于开发出来的自适应优化方法，以应对训练复杂数据集上大型模型的困难。其中一种方法是梯度裁剪：这是一个实际操作，但理论基础有限。在这项工作中，我们研究了在流式随机梯度下的最小二乘问题中的裁剪。我们在大固有维度极限下对学习动态进行了理论分析-这是一个依赖于模型和数据集的维度概念。在这个极限下，我们找到了描述损失演化的确定性方程。我们表明，使用高斯噪音裁剪不能改善随机梯度下降的性能。然而，在其他嘈杂的环境中，通过调整裁剪阈值，裁剪可以提供好处。在这些情况下，裁剪会以有利于训练的方式偏置更新，而这种偏置无法通过任何时间表下的随机梯度下降来恢复。最后，我们讨论了高维裁剪和神经网络训练之间的联系。

更新时间: 2024-06-17 16:50:22

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.11733v1

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, has significantly advanced this understanding by assessing the contribution of each datum to model accuracy. However, the resource-intensive and time-consuming nature of multiple model retraining poses significant challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (Conduct of Hardness and Gradient) score, which approximates the utility of each data subset on model accuracy during a single model training. By deriving the closed-form expression of the Shapley value for each data point under the CHG score utility function, we reduce the computational complexity to the equivalent of a single model retraining, an exponential improvement over existing methods. Additionally, we employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data. CHG Shapley facilitates trustworthy model training through efficient data valuation, introducing a novel data-centric perspective on trustworthy machine learning.

Updated: 2024-06-17 16:48:31

标题: CHG Shapley：高效数据估值和选择，助力可信机器学习

摘要: 理解机器学习模型的决策过程对于确保可信赖的机器学习至关重要。数据Shapley是关于数据价值评估的一项里程碑研究，通过评估每个数据对模型准确性的贡献，显著推进了对这一理解。然而，多次模型重新训练所需的资源密集和耗时的特性为将Data Shapley应用于大型数据集带来了重大挑战。为了解决这一问题，我们提出了CHG（Conduct of Hardness and Gradient）得分，该得分在单次模型训练过程中近似了每个数据子集对模型准确性的效用。通过在CHG得分效用函数下推导每个数据点的Shapley值的封闭形式表达式，我们将计算复杂性降低到相当于单次模型重新训练的水平，相比现有方法有了指数级的提升。此外，我们利用CHG Shapley进行实时数据选择，展示了它在识别高价值和噪音数据方面的有效性。CHG Shapley通过高效的数据价值评估促进了可信赖的模型训练，引入了一种新颖的以数据为中心的可信赖机器学习视角。

更新时间: 2024-06-17 16:48:31

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2406.11730v1

Secure Cross-Chain Provenance for Digital Forensics Collaboration

In digital forensics and various sectors like medicine and supply chain, blockchains play a crucial role in providing a secure and tamper-resistant system that meticulously records every detail, ensuring accountability. However, collaboration among different agencies, each with its own blockchains, creates challenges due to diverse protocols and a lack of interoperability, hindering seamless information sharing. Cross-chain technology has been introduced to address these challenges. Current research about blockchains in digital forensics, tends to focus on individual agencies, lacking a comprehensive approach to collaboration and the essential aspect of cross-chain functionality. This emphasizes the necessity for a framework capable of effectively addressing challenges in securely sharing case information, implementing access controls, and capturing provenance data across interconnected blockchains. Our solution, ForensiCross, is the first cross-chain solution specifically designed for digital forensics and provenance. It includes BridgeChain and features a unique communication protocol for cross-chain and multi-chain solutions. ForensiCross offers meticulous provenance capture and extraction methods, mathematical analysis to ensure reliability, scalability considerations for a distributed intermediary in collaborative blockchain contexts, and robust security measures against potential vulnerabilities and attacks. Analysis and evaluation results indicate that ForensiCross is secure and, despite a slight increase in communication time, outperforms in node count efficiency and has secure provenance extraction. As an all-encompassing solution, ForensiCross aims to simplify collaborative investigations by ensuring data integrity and traceability.

Updated: 2024-06-17 16:47:27

标题: 安全的跨链溯源技术用于数字取证合作

摘要: 在数字取证以及医疗和供应链等各个领域，区块链在提供安全和防篡改系统方面发挥着至关重要的作用，详细记录每一个细节，确保问责制。然而，不同机构之间的合作，每个机构都有自己的区块链，由于各种协议的差异和缺乏互操作性，导致了挑战，阻碍了信息共享的无缝性。跨链技术被引入来解决这些挑战。目前关于数字取证中区块链的研究倾向于专注于个体机构，缺乏综合协作和跨链功能的基本方面。这强调了需要一个能够有效解决安全共享案例信息、实施访问控制以及捕获互连区块链上溯数据挑战的框架的必要性。我们的解决方案ForensiCross是第一个专门为数字取证和溯源设计的跨链解决方案。它包括BridgeChain，并具有独特的跨链和多链解决方案的通信协议。ForensiCross提供了细致的溯源捕获和提取方法，数学分析以确保可靠性，在协作区块链环境中分布式中介的可扩展性考虑，以及针对潜在漏洞和攻击的强大安全措施。分析和评估结果表明，ForensiCross是安全的，并且尽管通信时间略有增加，但在节点数量效率和安全溯源提取方面表现优异。作为一个全面的解决方案，ForensiCross旨在通过确保数据完整性和可追溯性简化协作调查。

更新时间: 2024-06-17 16:47:27

领域: cs.CR,cs.SI

下载: http://arxiv.org/abs/2406.11729v1

Social Environment Design

Artificial Intelligence (AI) holds promise as a technology that can be used to improve government and economic policy-making. This paper proposes a new research agenda towards this end by introducing Social Environment Design, a general framework for the use of AI for automated policy-making that connects with the Reinforcement Learning, EconCS, and Computational Social Choice communities. The framework seeks to capture general economic environments, includes voting on policy objectives, and gives a direction for the systematic analysis of government and economic policy through AI simulation. We highlight key open problems for future research in AI-based policy-making. By solving these challenges, we hope to achieve various social welfare objectives, thereby promoting more ethical and responsible decision making.

Updated: 2024-06-17 16:45:47

标题: 社会环境设计

摘要: 人工智能（AI）有望成为一种可以用来改善政府和经济政策制定的技术。本文通过引入社会环境设计，提出了一个新的研究议程，旨在推动利用AI进行自动化政策制定，与强化学习、经济计算和计算社会选择社区联系起来。该框架旨在捕捉一般经济环境，包括对政策目标的投票，并为通过AI模拟对政府和经济政策进行系统分析提供方向。我们强调未来AI政策制定研究中的关键开放问题。通过解决这些挑战，我们希望实现各种社会福利目标，从而促进更具道德和负责任的决策制定。

更新时间: 2024-06-17 16:45:47

领域: cs.AI,econ.GN,q-fin.EC,stat.ML

下载: http://arxiv.org/abs/2402.14090v3

Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity

Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. Captured ROP Retcam images suffer from poor quality. This paper proposes the use of improved novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. Once trained and validated, the evaluations showed that these novel methods in comparison to traditional imaging processing contribute to better and in many aspects higher accuracy in classifying Plus disease, Stages of ROP and Zones in comparison to peer papers.

Updated: 2024-06-17 16:41:54

标题: 新的眼底图像预处理技术用于改善Retcam图像的深度学习分类，以提高早产儿视网膜病变的分类效果

摘要: 早产儿视网膜病变（ROP）是一种潜在的致盲眼部疾病，因为对眼睛视网膜的损伤可能影响早产婴儿。ROP的筛查对于早期检测和治疗至关重要。这是一个费时且需要经过专业培训的医生进行瞳孔扩张眼科检查的过程，这可能是主观的，导致对临床显著疾病的诊断成功率较低。自动诊断方法可以帮助眼科医生使用深度学习提高诊断准确性。多个研究小组已经强调了各种方法。采集的ROP Retcam图像质量较差。本文提出了改进的新型眼底预处理方法，利用预训练的迁移学习框架创建混合模型，从而提高诊断准确性。经过训练和验证后，评估结果显示，与传统图像处理方法相比，这些新方法在分类Plus病变、ROP阶段和区域方面在许多方面都具有更好且更高的准确性。

更新时间: 2024-06-17 16:41:54

领域: eess.IV,cs.CV,cs.LG,I.2.1

下载: http://arxiv.org/abs/2302.02524v5

Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization.

Updated: 2024-06-17 16:40:21

标题: 在指导调整过程中的零样本泛化：相似性和粒度的启示

摘要: 理解对齐技术始于理解指导调整带来的零样本泛化，但对其机制的理解还很有限。现有工作主要局限于任务级别，没有考虑到任务是人为定义的，对于LLMs来说，任务仅仅是由标记和表示组成。这一研究领域一直局限于从任务对角度考察任务之间的迁移，很少有研究关注从数据本身角度理解零样本泛化。为了弥合这一差距，我们首先通过多种指标展示指导调整过程中的零样本泛化发生得很早。接下来，我们从数据相似性和粒度的角度探讨了零样本泛化的促进作用，确认在指导调整过程中较早遇到高度相似和细粒度的训练数据，而无需受到定义的“任务”约束，有助于更好地泛化。最后，我们提出了一种更加基于实际的训练数据排列方法，测试中心多轮排列，并展示了其在促进持续学习和进一步降低损失方面的有效性。我们首次展示了在指导调整过程中的零样本泛化是在实例级别的训练和测试数据之间基于相似性的泛化形式。我们希望我们的分析能推动对指导调整过程中的零样本泛化的理解，并为更加对齐的LLMs的发展做出贡献。我们的代码发布在https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization。

更新时间: 2024-06-17 16:40:21

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11721v1

PEPit: computer-assisted worst-case analyses of first-order optimization methods in Python

PEPit is a Python package aiming at simplifying the access to worst-case analyses of a large family of first-order optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods. The key underlying idea is to cast the problem of performing a worst-case analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. To do that, the package users are only required to write first-order methods nearly as they would have implemented them. The package then takes care of the SDP modeling parts, and the worst-case analysis is performed numerically via a standard solver.

Updated: 2024-06-17 16:40:20

标题: PEPit：Python 中第一阶优化方法的计算机辅助最坏情况分析

摘要: PEPit是一个旨在简化对大量一阶优化方法的最坏情况分析的Python软件包，可能涉及梯度、投影、近端或线性优化oracle，以及它们的近似或Bregman变体。简而言之，PEPit是一个能够实现计算机辅助一阶优化方法最坏情况分析的软件包。其关键思想是将执行最坏情况分析的问题，通常称为性能估计问题（PEP），转化为可以通过数值方法解决的半定规划（SDP）。为此，软件包用户只需按照几乎与实现它们相同的方式编写一阶方法。软件包然后负责SDP建模部分，最坏情况分析通过标准求解器进行数值计算。

更新时间: 2024-06-17 16:40:20

领域: math.OC,cs.LG,cs.MS,cs.NA,math.NA

下载: http://arxiv.org/abs/2201.04040v2

Reward Machines for Deep RL in Noisy and Uncertain Environments

Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.

Updated: 2024-06-17 16:39:08

标题: 奖励机制在嘈杂和不确定环境中的深度强化学习

摘要: 奖励机器提供了一个类似有限自动机的结构，用于指定指令、安全约束和其他时间延长的值得奖励的行为。通过暴露复杂的奖励函数结构，它们能够实现反事实学习更新，从而获得令人印象深刻的样本效率增益。虽然奖励机器已经在表格和深度强化学习环境中得到应用，但它们通常依赖于对形成奖励函数基本组成部分的领域特定词汇的地面真实解释。在许多真实世界环境中，由于部分可观察性或嘈杂的传感，这样的地面真实解释可能难以捉摸。在本文中，我们探讨了在嘈杂和不确定环境中利用奖励机器进行深度强化学习的应用。我们将这个问题描述为一个部分可观察的马尔可夫决策过程，并提出了一套利用任务结构的强化学习算法，以不确定的领域特定词汇解释为基础。理论分析揭示了对这个问题的天真方法的缺陷，而实验结果表明，我们的算法成功利用任务结构，在嘈杂的词汇解释下提高了性能。我们的结果提供了一个在部分可观察环境中利用奖励机器的通用框架。

更新时间: 2024-06-17 16:39:08

领域: cs.LG,cs.AI,cs.FL,I.2.0; I.2.6; I.2.4; F.4.3

下载: http://arxiv.org/abs/2406.00120v2

Refusal in Language Models Is Mediated by a Single Direction

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Updated: 2024-06-17 16:36:12

标题: 语言模型中的拒绝是由一个单一方向中介的 (Note: This translation may not be perfect as the original title may have specific context or nuances that are not fully captured in this translation.)

摘要: 大型对话语言模型经过微调，既可以遵循指令，也可以确保安全性，使其遵从良性请求但拒绝有害请求。虽然拒绝行为在聊天模型中普遍存在，但其基本机制仍不为人们所了解。在本研究中，我们展示了拒绝是通过一个一维子空间介导的，跨越了13个常用的开源聊天模型，其参数规模达到72B。具体而言，对于每个模型，我们找到一个方向，擦除这个方向会阻止模型拒绝有害指令，而添加这个方向则会导致甚至无害指令的拒绝。利用这一见解，我们提出了一种新颖的白盒越狱方法，可以在对其他能力影响最小的情况下手术式地禁用拒绝功能。最后，我们从机械的角度分析了敌对后缀如何抑制拒绝介导方向的传播。我们的发现强调了当前安全微调方法的脆弱性。更广泛地说，我们的工作展示了如何利用对模型内部的理解来开发控制模型行为的实用方法。

更新时间: 2024-06-17 16:36:12

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11717v1

CIMRL: Combining IMitation and Reinforcement Learning for Safe Autonomous Driving

Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach - a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks.

Updated: 2024-06-17 16:34:41

标题: CIMRL：将模仿学习和强化学习结合应用于安全自动驾驶

摘要: 现代自动驾驶方法在很大程度上依赖于通过模仿学习使用大量人类驾驶数据训练的学习组件。然而，这些方法需要大量昂贵的数据收集，即使如此，仍然面临安全处理长尾情景和随时间累积错误的挑战。与此同时，纯强化学习（RL）方法可能无法在稀疏、受限制和具有挑战性的奖励设置（如驾驶）中学习有效的策略。这两个挑战使得在像自动驾驶汽车这样的安全关键应用中部署纯克隆策略具有挑战性。在本文中，我们提出了结合模仿学习和强化学习（CIMRL）方法 - 一种框架，通过利用模仿运动先验和安全约束在模拟环境中训练驾驶策略。CIMRL不需要广泛的奖励规范，并改进了纯克隆方法的闭环行为。通过结合强化学习和模仿学习，我们证明我们的方法在闭环模拟驾驶基准测试中实现了最先进的结果。

更新时间: 2024-06-17 16:34:41

领域: cs.LG

下载: http://arxiv.org/abs/2406.08878v2

Rethink Tree Traversal

We will show how to implement binary decision tree traversal in the language of matrix computation. Our main contribution is to propose some equivalent algorithms of binary tree traversal based on a novel matrix representation of the hierarchical structure of the decision tree. Our key idea is to travel the binary decision tree by maximum inner product search. We not only implement decision tree methods without the recursive traverse but also delve into the partitioning nature of tree-based methods.

Updated: 2024-06-17 16:34:32

标题: 重新思考树的遍历

摘要: 我们将展示如何在矩阵计算语言中实现二叉决策树遍历。我们的主要贡献是提出一些基于决策树层次结构的新型矩阵表示的等效二叉树遍历算法。我们的关键思想是通过最大内积搜索遍历二叉决策树。我们不仅实现了决策树方法，而且深入探讨了基于树的方法的分区性质。

更新时间: 2024-06-17 16:34:32

领域: cs.LG,cs.DS,cs.NA,math.NA

下载: http://arxiv.org/abs/2209.04825v5

Threat analysis and adversarial model for Smart Grids

The power grid is a critical infrastructure that allows for the efficient and robust generation, transmission, delivery and consumption of electricity. In the recent years, the physical components have been equipped with computing and network devices, which optimizes the operation and maintenance of the grid. The cyber domain of this smart power grid opens a new plethora of threats, which adds to classical threats on the physical domain. Accordingly, different stakeholders including regulation bodies, industry and academy, are making increasing efforts to provide security mechanisms to mitigate and reduce cyber-risks. Despite these efforts, there have been various cyberattacks that have affected the smart grid, leading in some cases to catastrophic consequences, showcasing that the industry might not be prepared for attacks from high profile adversaries. At the same time, recent work shows a lack of agreement among grid practitioners and academic experts on the feasibility and consequences of academic-proposed threats. This is in part due to inadequate simulation models which do not evaluate threats based on attackers full capabilities and goals. To address this gap, in this work we first analyze the main attack surfaces of the smart grid, and then conduct a threat analysis from the adversarial model perspective, including different levels of knowledge, goals, motivations and capabilities. To validate the model, we provide real-world examples of the potential capabilities by studying known vulnerabilities in critical components, and then analyzing existing cyber-attacks that have affected the smart grid, either directly or indirectly.

Updated: 2024-06-17 16:33:46

标题: 智能电网的威胁分析和对抗模型

摘要: 电网是一项关键基础设施，可以实现电力的高效、稳定的发电、传输、配送和消费。近年来，物理组件已经配备了计算和网络设备，优化了电网的运营和维护。智能电网的网络领域开启了一系列新的威胁，这些威胁增加了对物理领域的传统威胁。因此，包括监管机构、行业和学术界在内的不同利益相关者正在加大力度提供安全机制，以减轻和降低网络风险。尽管有这些努力，智能电网遭受了各种网络攻击，有些情况下导致了灾难性后果，这表明行业可能并未准备好应对来自高等级对手的攻击。与此同时，最近的研究显示，在智能电网从业者和学术专家之间对于学术提出的威胁的可行性和后果缺乏一致意见。这在一定程度上是由于不足的模拟模型，这些模型没有根据攻击者的完整能力和目标来评估威胁。为了填补这一空白，本文首先分析了智能电网的主要攻击面，然后从对抗模型的角度进行威胁分析，包括不同水平的知识、目标、动机和能力。为了验证模型，我们通过研究关键组件中已知的漏洞，提供了潜在能力的现实世界示例，然后分析已经影响了智能电网的现有网络攻击，无论是直接还是间接影响。

更新时间: 2024-06-17 16:33:46

领域: cs.CR

下载: http://arxiv.org/abs/2406.11716v1

Measuring memorization in RLHF for code completion

Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.

Updated: 2024-06-17 16:33:35

标题: 在RLHF中衡量代码补全的记忆化

摘要: 人类反馈强化学习（RLHF）已成为将大型模型与用户偏好对齐的主要方法。与微调不同，关于训练数据记忆有许多研究，但目前尚不清楚记忆是如何受到或在RLHF对齐过程中引入的。了解这种关系很重要，因为可能会收集和使用真实用户数据来对齐大型模型；如果用户数据在RLHF过程中被记忆并在以后被复述，这可能会引起隐私问题。在这项工作中，我们分析了训练数据记忆如何在RLHF的每个阶段中浮现并传播。我们的研究重点是代码完成模型，因为代码完成是大型语言模型最流行的用例之一。我们发现，与直接在这些数据上微调对齐相比，RLHF显著降低了用于奖励建模和强化学习的数据被记忆的机会，但在RLHF微调阶段已经记忆的示例，在大多数情况下，将在RLHF后继续被记忆。

更新时间: 2024-06-17 16:33:35

领域: cs.LG,cs.CL,cs.SE

下载: http://arxiv.org/abs/2406.11715v1

Scalable Expressiveness through Preprocessed Graph Perturbations

Graph Neural Networks (GNNs) have emerged as the predominant method for analyzing graph-structured data. However, canonical GNNs have limited expressive power and generalization capability, thus triggering the development of more expressive yet computationally intensive methods. One such approach is to create a series of perturbed versions of input graphs and then repeatedly conduct multiple message-passing operations on all variations during training. Despite their expressive power, this approach does not scale well on larger graphs. To address this scalability issue, we introduce Scalable Expressiveness through Preprocessed Graph Perturbation (SE2P). This model offers a flexible, configurable balance between scalability and generalizability with four distinct configuration classes. At one extreme, the configuration prioritizes scalability through minimal learnable feature extraction and extensive preprocessing; at the other extreme, it enhances generalizability with more learnable feature extractions, though this increases scalability costs. We conduct extensive experiments on real-world datasets to evaluate the generalizability and scalability of SE2P variants compared to various state-of-the-art benchmarks. Our results indicate that, depending on the chosen SE2P configuration, the model can enhance generalizability compared to benchmarks while achieving significant speed improvements of up to 8-fold.

Updated: 2024-06-17 16:32:57

标题: 通过预处理图扰动实现可扩展性表达

摘要: 图神经网络（GNNs）已经成为分析图结构化数据的主要方法。然而，传统的GNNs具有有限的表达能力和泛化能力，因此触发了更具表现力但计算密集的方法的发展。其中一种方法是创建输入图的一系列扰动版本，然后在训练过程中反复对所有变体进行多次消息传递操作。尽管这种方法具有表现力，但在更大的图上并不很好扩展。为了解决这个可扩展性问题，我们引入了通过预处理图扰动（SE2P）实现可伸缩性表现力。该模型提供了四种不同配置类之间的灵活、可配置的可扩展性和泛化性之间的平衡。在一个极端，配置通过最小化可学习特征提取和广泛的预处理来优先考虑可扩展性；在另一个极端，它通过更多的可学习特征提取增强泛化性，尽管这会增加可扩展性成本。我们在真实世界的数据集上进行了大量实验，以评估SE2P变体相对于各种最先进的基准的泛化性和可扩展性。我们的结果表明，根据选择的SE2P配置，该模型可以相对于基准提高泛化性，同时实现高达8倍的速度改进。

更新时间: 2024-06-17 16:32:57

领域: cs.LG

下载: http://arxiv.org/abs/2406.11714v1

Topology-aware Federated Learning in Edge Computing: A Comprehensive Survey

The ultra-low latency requirements of 5G/6G applications and privacy constraints call for distributed machine learning systems to be deployed at the edge. With its simple yet effective approach, federated learning (FL) is a natural solution for massive user-owned devices in edge computing with distributed and private training data. FL methods based on FedAvg typically follow a naive star topology, ignoring the heterogeneity and hierarchy of the volatile edge computing architectures and topologies in reality. Several other network topologies exist and can address the limitations and bottlenecks of the star topology. This motivates us to survey network topology-related FL solutions. In this paper, we conduct a comprehensive survey of the existing FL works focusing on network topologies. After a brief overview of FL and edge computing networks, we discuss various edge network topologies as well as their advantages and disadvantages. Lastly, we discuss the remaining challenges and future works for applying FL to topology-specific edge networks.

Updated: 2024-06-17 16:27:17

标题: 边缘计算中的拓扑感知联邦学习：一项全面调查

摘要: 5G/6G 应用的超低延迟要求和隐私约束要求在边缘部署分布式机器学习系统。采用简单而有效的方法，联邦学习（FL）是边缘计算中拥有大量用户设备和分布式私有训练数据的自然解决方案。基于 FedAvg 的 FL 方法通常遵循天真的星形拓扑结构，忽略了现实中不稳定的边缘计算架构和拓扑的异质性和层次性。存在多种其他网络拓扑结构，可以解决星形拓扑结构的局限性和瓶颈。这促使我们对与网络拓扑相关的 FL 解决方案进行调查。在本文中，我们对现有的关于网络拓扑的 FL 工作进行了全面调查。在简要介绍 FL 和边缘计算网络之后，我们讨论了各种边缘网络拓扑结构及其优缺点。最后，我们讨论了将 FL 应用于特定拓扑结构的边缘网络所面临的挑战和未来工作。

更新时间: 2024-06-17 16:27:17

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2302.02573v2

Tackling the Curse of Dimensionality in Fractional and Tempered Fractional PDEs with Physics-Informed Neural Networks

Fractional and tempered fractional partial differential equations (PDEs) are effective models of long-range interactions, anomalous diffusion, and non-local effects. Traditional numerical methods for these problems are mesh-based, thus struggling with the curse of dimensionality (CoD). Physics-informed neural networks (PINNs) offer a promising solution due to their universal approximation, generalization ability, and mesh-free training. In principle, Monte Carlo fractional PINN (MC-fPINN) estimates fractional derivatives using Monte Carlo methods and thus could lift CoD. However, this may cause significant variance and errors, hence affecting convergence; in addition, MC-fPINN is sensitive to hyperparameters. In general, numerical methods and specifically PINNs for tempered fractional PDEs are under-developed. Herein, we extend MC-fPINN to tempered fractional PDEs to address these issues, resulting in the Monte Carlo tempered fractional PINN (MC-tfPINN). To reduce possible high variance and errors from Monte Carlo sampling, we replace the one-dimensional (1D) Monte Carlo with 1D Gaussian quadrature, applicable to both MC-fPINN and MC-tfPINN. We validate our methods on various forward and inverse problems of fractional and tempered fractional PDEs, scaling up to 100,000 dimensions. Our improved MC-fPINN/MC-tfPINN using quadrature consistently outperforms the original versions in accuracy and convergence speed in very high dimensions.

Updated: 2024-06-17 16:26:18

标题: 使用物理信息神经网络解决分数和温和分数PDE中的维度诅咒

摘要: 分数和阶数分数偏微分方程（PDE）是长程相互作用、异常扩散和非局部效应的有效模型。这些问题的传统数值方法是基于网格的，因此受到维度诅咒（CoD）的困扰。基于物理信息的神经网络（PINN）由于其通用逼近、泛化能力和无网格训练而提供了一个有前途的解决方案。原则上，蒙特卡罗分数PINN（MC-fPINN）使用蒙特卡罗方法估计分数导数，从而可能解决维度诅咒。然而，这可能导致显著的方差和错误，从而影响收敛；此外，MC-fPINN对超参数敏感。总的来说，针对阶数分数PDE的数值方法，特别是PINN，尚未完善。在本文中，我们将MC-fPINN扩展到阶数分数PDE，以解决这些问题，结果是蒙特卡罗阶数分数PINN（MC-tfPINN）。为了减少蒙特卡罗采样可能导致的高方差和错误，我们将一维（1D）蒙特卡罗替换为适用于MC-fPINN和MC-tfPINN的一维高斯积分。我们在各种分数和阶数分数PDE的正向和反向问题上验证了我们的方法，扩展到100,000个维度。我们改进的MC-fPINN/MC-tfPINN使用积分在非常高的维度中始终优于原始版本的准确性和收敛速度。

更新时间: 2024-06-17 16:26:18

领域: math.NA,cs.LG,cs.NA,math.DS,F.2.2; I.2.7

下载: http://arxiv.org/abs/2406.11708v1

A First Physical-World Trajectory Prediction Attack via LiDAR-induced Deceptions in Autonomous Driving

Trajectory prediction forecasts nearby agents' moves based on their historical trajectories. Accurate trajectory prediction is crucial for autonomous vehicles. Existing attacks compromise the prediction model of a victim AV by directly manipulating the historical trajectory of an attacker AV, which has limited real-world applicability. This paper, for the first time, explores an indirect attack approach that induces prediction errors via attacks against the perception module of a victim AV. Although it has been shown that physically realizable attacks against LiDAR-based perception are possible by placing a few objects at strategic locations, it is still an open challenge to find an object location from the vast search space in order to launch effective attacks against prediction under varying victim AV velocities. Through analysis, we observe that a prediction model is prone to an attack focusing on a single point in the scene. Consequently, we propose a novel two-stage attack framework to realize the single-point attack. The first stage of prediction-side attack efficiently identifies, guided by the distribution of detection results under object-based attacks against perception, the state perturbations for the prediction model that are effective and velocity-insensitive. In the second stage of location matching, we match the feasible object locations with the found state perturbations. Our evaluation using a public autonomous driving dataset shows that our attack causes a collision rate of up to 63% and various hazardous responses of the victim AV. The effectiveness of our attack is also demonstrated on a real testbed car. To the best of our knowledge, this study is the first security analysis spanning from LiDAR-based perception to prediction in autonomous driving, leading to a realistic attack on prediction. To counteract the proposed attack, potential defenses are discussed.

Updated: 2024-06-17 16:26:00

标题: 通过激光雷达诱导的欺骗在自动驾驶中进行的首次物理世界轨迹预测攻击

摘要: 轨迹预测是根据周围代理的历史轨迹来预测其移动的。准确的轨迹预测对于自动驾驶车辆至关重要。现有的攻击通过直接操纵攻击者车辆的历史轨迹来破坏受害者自动驾驶车辆的预测模型，但在现实世界中的适用性有限。本文首次探讨了一种间接攻击方法，通过针对受害者车辆感知模块的攻击来诱发预测误差。虽然已经证明通过在战略位置放置少量物体可以对基于LiDAR的感知进行物理攻击，但在差异受害者车辆速度下找到有效的物体位置以发动有效攻击仍然是一个挑战。通过分析，我们观察到预测模型容易受到攻击集中在场景中的单个点上。因此，我们提出了一个新颖的两阶段攻击框架来实现单点攻击。预测端攻击的第一阶段高效地识别出那些对预测模型有效且对速度不敏感的状态扰动，这是根据针对感知的基于物体攻击的检测结果分布来引导的。在第二阶段的位置匹配中，我们将可行的物体位置与找到的状态扰动匹配。我们使用公开的自动驾驶数据集进行评估，结果显示我们的攻击导致受害者自动驾驶车辆的碰撞率高达63%，并出现各种危险响应。我们的攻击效果也在实际测试车辆上得到了证明。据我们所知，这项研究是涵盖了基于LiDAR感知到自动驾驶预测的首次安全分析，导致了对预测的现实攻击。为了对抗所提出的攻击，我们讨论了潜在的防御措施。

更新时间: 2024-06-17 16:26:00

领域: cs.CR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11707v1

Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels

We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive with RankLLama, both of which are 7B parameter models trained on over 100K labels. These findings point to the power of automatic prompt optimization for synthetic dataset generation.

Updated: 2024-06-17 16:25:55

标题: 提示作为自动优化的训练超参数：使用10个黄金标签从头开始训练最佳的IR模型

摘要: 我们开发了一种方法，用于训练规模较小（少于1亿参数）的神经信息检索模型，只需10个金标签。该方法依赖于使用语言模型（LM）为文档生成合成查询，关键步骤是我们自动优化用于生成这些查询的LM提示，基于训练质量。在与BIRCO基准的实验中，我们发现使用我们的方法训练的模型胜过RankZephyr，并且与RankLLama竞争激烈，两者都是在超过100K标签上训练的7B参数模型。这些发现表明自动提示优化对于合成数据集生成的强大作用。

更新时间: 2024-06-17 16:25:55

领域: cs.IR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11706v1

Nemotron-4 340B Technical Report

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.

Updated: 2024-06-17 16:25:04

标题: Nemotron-4 340B技术报告

摘要: 我们发布了Nemotron-4 340B型号系列，包括Nemotron-4-340B-Base、Nemotron-4-340B-Instruct和Nemotron-4-340B-Reward。我们的模型在NVIDIA开放模型许可协议下开放访问，这是一种允许分发、修改和使用模型及其输出的宽松模型许可证。这些模型在各种评估基准上表现出色，当以FP8精度部署时，可以适配单个DGX H100与8个GPU。我们相信社区可以从这些模型中受益，在各种研究和商业应用中特别适用于生成合成数据以训练更小的语言模型。值得注意的是，我们模型对齐过程中使用的超过98%的数据是合成生成的，展示了这些模型在生成合成数据方面的有效性。为了进一步支持开放研究并促进模型开发，我们还开源了在我们的模型对齐过程中使用的合成数据生成管道。

更新时间: 2024-06-17 16:25:04

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11704v1

Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies

The phenomenon of double descent has recently gained attention in supervised learning. It challenges the conventional wisdom of the bias-variance trade-off by showcasing a surprising behavior. As the complexity of the model increases, the test error initially decreases until reaching a certain point where the model starts to overfit the train set, causing the test error to rise. However, deviating from classical theory, the error exhibits another decline when exceeding a certain degree of over-parameterization. We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood. We conduct extensive experiments using under-complete auto-encoders (AEs) for various applications, such as dealing with noisy data, domain shifts, and anomalies. We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for all the aforementioned applications. Finally, we assessed the usability of the AEs for detecting anomalies and mitigating the domain shift between datasets. Our findings indicate that over-parameterized models can improve performance not only in terms of reconstruction, but also in enhancing capabilities for the downstream task.

Updated: 2024-06-17 16:24:23

标题: 无监督学习中的多重下降：噪声、领域转移和异常的作用

摘要: 最近，双下降现象在监督学习中引起了关注。它挑战了传统的偏差-方差权衡的智慧，展示了一种令人惊讶的行为。随着模型复杂度的增加，测试误差最初会下降，直到达到某个点，模型开始过度拟合训练集，导致测试误差上升。然而，与经典理论不同，当超过一定程度的过参数化时，误差会再次下降。我们研究了无监督学习中双下降的存在，这是一个受到很少关注且尚未完全理解的领域。我们使用欠完备自动编码器（AEs）进行各种应用的广泛实验，如处理嘈杂数据、领域转移和异常值。我们使用合成和真实数据，并针对所有上述应用识别了模型、时代和样本上的双下降。最后，我们评估了AEs在检测异常和缓解数据集之间的领域转移方面的可用性。我们的研究结果表明，过参数化模型不仅可以改善重构性能，还可以增强下游任务的能力。

更新时间: 2024-06-17 16:24:23

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11703v1

Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale

A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.

Updated: 2024-06-17 16:22:52

标题: 预训练结构化生成变压器：无监督的规模化句法语言模型

摘要: 一种句法语言模型（SLM）以从左到右的方式递增地生成具有其句法树的句子。我们提出了生成式预训练结构变换器（GPST），这是一个规模化的无监督SLM，能够从头开始在原始文本上进行高并行预训练。GPST规避了以往SLM的限制，例如依赖于金标准树和顺序训练。它由两个组件组成，一个常规SLM由单向语言建模损失监督，另一个是附加的组合模型，诱导句法分析树并计算成分表示，由双向语言建模损失监督。我们提出了一个表示替代方案，以便以硬EM方式联合并行训练这两个模型。我们在包含90亿令牌的OpenWebText语料库上对GPST进行了预训练，并在许多任务中展示了GPST在涵盖语言理解和语言生成的性能上优于具有相似规模的GPT-2。同时，GPST在从左到右的语法归纳方面也明显优于现有的无监督SLM，同时在训练加速方面具有显著的优势。

更新时间: 2024-06-17 16:22:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.08293v3

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.

Updated: 2024-06-17 16:13:29

标题: 奉承到诡计：探讨大型语言模型中的操纵奖励

摘要: 在强化学习中，规范游戏（specification gaming）发生在人工智能系统学到了由于训练目标错误而高度奖励的不良行为时。规范游戏可以从简单的讨好行为扩展到复杂和有害的行为，比如奖励篡改，即模型直接修改自己的奖励机制。然而，这些更有害的行为可能太复杂，无法通过探索发现。本文研究了大型语言模型（LLM）助手是否会推广到执行更罕见和更明显的规范游戏形式，直至包括奖励篡改。我们构建了一个逐渐复杂的可游戏环境课程，并发现在早期课程环境上训练会导致在剩余环境中更多的规范游戏。令人惊讶的是，在一小部分时间里，接受完整课程训练的LLM助手会零样本推广到直接重写自己的奖励函数。重新训练LLM以防止在早期课程环境中规范游戏可以减轻，但不能消除后续环境中的奖励篡改。此外，在我们的可游戏环境中添加无害性训练并不能阻止奖励篡改。这些结果表明，LLMs可以从常见的规范游戏形式推广到更有害的奖励篡改，而这种行为可能不容易消除。

更新时间: 2024-06-17 16:13:29

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.10162v2

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy

Updated: 2024-06-17 16:12:03

标题: 优化多阶段语言模型程序的指导和演示

摘要: 语言模型程序，即复杂的模块化语言模型（LM）调用流水线，正日益推动NLP任务的发展，但它们需要制定对所有模块都有效的提示。我们研究了LM程序的提示优化，即如何更新这些提示以最大化下游指标，而不需要访问模块级标签或梯度。为了使这个问题可行，我们将问题分解为优化每个模块的自由形式指令和少样本演示，并引入了几种策略来制定任务相关的指令并在模块之间导航信用分配。我们的策略包括（i）为提出有效指令的程序和数据感知技术，（ii）用于学习我们目标的替代模型的随机小批量评估函数，（iii）一种元优化程序，在这种程序中，我们改进LMs随时间如何构建提议的方法。利用这些见解，我们开发了MIPRO，一种新型优化器，在使用最佳开源模型（Llama-3-8B）的六种不同LM程序中，准确率最高提高了12.9%。我们将在https://github.com/stanfordnlp/dspy上发布我们的新优化器和基准。

更新时间: 2024-06-17 16:12:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11695v1

Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization

Large language models (LLMs) have demonstrated strong capabilities in solving a wide range of programming tasks. However, LLMs have rarely been explored for code optimization. In this paper, we explore code optimization with a focus on performance enhancement, specifically aiming to optimize code for minimal execution time. The recently proposed first PIE dataset for performance optimization constructs program optimization pairs based on iterative submissions from the same programmer for the same problem. However, this approach restricts LLMs to local performance improvements, neglecting global algorithmic innovation. Therefore, we adopt a completely different perspective by reconstructing the optimization pairs into a problem-oriented approach. This allows for the integration of various ingenious ideas from different programmers tackling the same problem. Experimental results demonstrate that adapting LLMs to problem-oriented optimization pairs significantly enhances their optimization capabilities. Meanwhile, we identified performance bottlenecks within the problem-oriented perspective. By employing model merge, we further overcame bottlenecks and ultimately elevated the program optimization ratio ($51.76\%\rightarrow76.65\%$) and speedup ($2.65\times\rightarrow5.09\times$) to new levels.

Updated: 2024-06-17 16:10:10

标题: 迭代还是创新？代码优化的问题导向视角

摘要: 大型语言模型(LLMs)已经展示出在解决各种编程任务方面的强大能力。然而，LLMs很少被用于代码优化。本文探讨了以性能增强为重点的代码优化，特别是旨在优化代码以实现最小执行时间。最近提出的第一个PIE数据集用于性能优化，根据同一程序员针对同一问题的迭代提交构建了程序优化对。然而，这种方法限制了LLMs对局部性能改进，忽略了全局算法创新。因此，我们采用完全不同的视角，将优化对重新构建为以问题为导向的方法。这允许将不同程序员解决同一问题时的各种巧妙想法整合在一起。实验结果表明，将LLMs调整为以问题为导向的优化对显著增强了它们的优化能力。与此同时，我们在问题导向的角度内确定了性能瓶颈。通过采用模型合并，我们进一步克服了瓶颈，最终将程序优化比例($51.76\%\rightarrow76.65\%$)和加速比($2.65\times\rightarrow5.09\times$)提升到新的水平。

更新时间: 2024-06-17 16:10:10

领域: cs.PL,cs.AI,cs.SE

下载: http://arxiv.org/abs/2406.11935v1

t-DGR: A Trajectory-Based Deep Generative Replay Method for Continual Learning in Decision Making

Deep generative replay has emerged as a promising approach for continual learning in decision-making tasks. This approach addresses the problem of catastrophic forgetting by leveraging the generation of trajectories from previously encountered tasks to augment the current dataset. However, existing deep generative replay methods for continual learning rely on autoregressive models, which suffer from compounding errors in the generated trajectories. In this paper, we propose a simple, scalable, and non-autoregressive method for continual learning in decision-making tasks using a generative model that generates task samples conditioned on the trajectory timestep. We evaluate our method on Continual World benchmarks and find that our approach achieves state-of-the-art performance on the average success rate metric among continual learning methods. Code is available at https://github.com/WilliamYue37/t-DGR.

Updated: 2024-06-17 16:04:27

标题: t-DGR：一种基于轨迹的深度生成式回放方法，用于决策连续学习

摘要: 深度生成式回放已经成为决策任务中持续学习的一种有前途的方法。这种方法通过利用从先前遇到的任务生成轨迹来扩充当前数据集，解决了灾难性遗忘的问题。然而，现有的深度生成式回放方法依赖于自回归模型，在生成的轨迹中存在着复合错误。在本文中，我们提出了一种简单、可扩展且非自回归的方法，用于在决策任务中进行持续学习，该方法使用生成模型生成基于轨迹时间步的任务样本。我们在持续世界基准测试上评估了我们的方法，并发现我们的方法在持续学习方法中的平均成功率指标上取得了最先进的性能。代码可在https://github.com/WilliamYue37/t-DGR找到。

更新时间: 2024-06-17 16:04:27

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2401.02576v2

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.

Updated: 2024-06-17 16:04:06

标题: 线性函数逼近下离线强化学习中固有贝尔曼误差的作用

摘要: 在这篇论文中，我们研究了具有线性函数逼近的离线RL问题。我们的主要结构假设是MDP具有较低的固有贝尔曼误差，这意味着线性值函数在贪婪策略下具有线性贝尔曼备份。这个假设是自然的，因为它实质上是值迭代成功所需的最小假设。我们提出了一个计算效率高的算法，在数据集上满足单策略覆盖条件时成功，即输出一个价值至少不低于数据集充分覆盖的任何策略的策略。即使在固有贝尔曼误差为0的情况下（称为线性贝尔曼完备性），我们的算法也可以在单策略覆盖下提供已知的第一个保证。在固有贝尔曼误差为正的情况下${\varepsilon_{\mathrm{BE}}} > 0$，我们展示了我们算法的次优性误差与$\sqrt{\varepsilon_{\mathrm{BE}}}$成比例。此外，我们证明了次优性与$\sqrt{\varepsilon_{\mathrm{BE}}}$的比例是任何算法都无法改进的。我们的下界与许多其他在误差规范化中的强化学习设置形成对比，其中通常可以获得与误差规范化误差线性退化的性能。

更新时间: 2024-06-17 16:04:06

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11686v1

Bridging Design Gaps: A Parametric Data Completion Approach With Graph Guided Diffusion Models

This study introduces a generative imputation model leveraging graph attention networks and tabular diffusion models for completing missing parametric data in engineering designs. This model functions as an AI design co-pilot, providing multiple design options for incomplete designs, which we demonstrate using the bicycle design CAD dataset. Through comparative evaluations, we demonstrate that our model significantly outperforms existing classical methods, such as MissForest, hotDeck, PPCA, and tabular generative method TabCSDI in both the accuracy and diversity of imputation options. Generative modeling also enables a broader exploration of design possibilities, thereby enhancing design decision-making by allowing engineers to explore a variety of design completions. The graph model combines GNNs with the structural information contained in assembly graphs, enabling the model to understand and predict the complex interdependencies between different design parameters. The graph model helps accurately capture and impute complex parametric interdependencies from an assembly graph, which is key for design problems. By learning from an existing dataset of designs, the imputation capability allows the model to act as an intelligent assistant that autocompletes CAD designs based on user-defined partial parametric design, effectively bridging the gap between ideation and realization. The proposed work provides a pathway to not only facilitate informed design decisions but also promote creative exploration in design.

Updated: 2024-06-17 16:03:17

标题: 弥合设计差距：基于图引导扩散模型的参数化数据补全方法

摘要: 这项研究介绍了一种利用图注意力网络和表格扩散模型的生成填充模型，用于完成工程设计中缺失的参数数据。该模型作为AI设计副驾驶，为不完整的设计提供多个设计选项，我们使用自行车设计CAD数据集进行了演示。通过比较评估，我们证明我们的模型在填充选项的准确性和多样性方面显著优于现有的经典方法，如MissForest、hotDeck、PPCA和表格生成方法TabCSDI。生成建模还可以更广泛地探索设计可能性，从而通过允许工程师探索各种设计完成方案来增强设计决策能力。图模型将GNN与装配图中包含的结构信息相结合，使模型能够理解和预测不同设计参数之间的复杂相互依赖关系。图模型有助于准确捕获并填充装配图中的复杂参数相互依赖关系，这对于设计问题至关重要。通过从现有设计数据集中学习，填充能力使模型能够充当智能助手，根据用户定义的部分参数设计自动完成CAD设计，有效地弥合构思和实现之间的差距。提出的工作不仅有助于促进明智的设计决策，还有助于推动设计中的创意探索。

更新时间: 2024-06-17 16:03:17

领域: cs.LG,cs.AI,cs.CE,cs.HC

下载: http://arxiv.org/abs/2406.11934v1

Edge Classification on Graphs: New Directions in Topological Imbalance

Recent years have witnessed the remarkable success of applying Graph machine learning (GML) to node/graph classification and link prediction. However, edge classification task that enjoys numerous real-world applications such as social network analysis and cybersecurity, has not seen significant advancement. To address this gap, our study pioneers a comprehensive approach to edge classification. We identify a novel `Topological Imbalance Issue', which arises from the skewed distribution of edges across different classes, affecting the local subgraph of each edge and harming the performance of edge classifications. Inspired by the recent studies in node classification that the performance discrepancy exists with varying local structural patterns, we aim to investigate if the performance discrepancy in topological imbalanced edge classification can also be mitigated by characterizing the local class distribution variance. To overcome this challenge, we introduce Topological Entropy (TE), a novel topological-based metric that measures the topological imbalance for each edge. Our empirical studies confirm that TE effectively measures local class distribution variance, and indicate that prioritizing edges with high TE values can help address the issue of topological imbalance. Based on this, we develop two strategies - Topological Reweighting and TE Wedge-based Mixup - to focus training on (synthetic) edges based on their TEs. While topological reweighting directly manipulates training edge weights according to TE, our wedge-based mixup interpolates synthetic edges between high TE wedges. Ultimately, we integrate these strategies into a novel topological imbalance strategy for edge classification: TopoEdge. Through extensive experiments, we demonstrate the efficacy of our proposed strategies on newly curated datasets and thus establish a new benchmark for (imbalanced) edge classification.

Updated: 2024-06-17 16:02:36

标题: 图中的边缘分类：拓扑不平衡中的新方向

摘要: 近年来，将图机器学习（GML）应用于节点/图分类和链接预测取得了显著的成功。然而，享有诸多实际应用的边缘分类任务，如社交网络分析和网络安全，尚未取得重大进展。为了填补这一差距，我们的研究首创了一种全面的边缘分类方法。我们确定了一种新颖的“拓扑不平衡问题”，这是由于边缘在不同类别之间的分布倾斜所导致的，影响了每个边缘的局部子图，并损害了边缘分类的性能。受到最近在节点分类中的研究的启发，即性能差异存在于不同的局部结构模式中，我们的目标是研究是否可以通过表征局部类分布方差来缓解拓扑不平衡边缘分类中的性能差异。为了克服这一挑战，我们引入了拓扑熵（TE），一种衡量每个边缘拓扑不平衡的新颖基于拓扑的度量。我们的实证研究证实TE有效地衡量了局部类分布方差，并表明优先考虑具有高TE值的边缘可以帮助解决拓扑不平衡问题。基于此，我们开发了两种策略 - 拓扑重新加权和TE楔形混合 - 以便根据它们的TE值聚焦训练（合成）边缘。虽然拓扑重新加权根据TE直接操纵训练边缘权重，但我们的楔形混合在高TE楔之间插入合成边缘。最终，我们将这些策略整合到一种新颖的边缘分类拓扑不平衡策略中：TopoEdge。通过大量实验，我们展示了我们提出的策略在新的精心策划的数据集上的有效性，从而为（不平衡的）边缘分类建立了一个新的基准。

更新时间: 2024-06-17 16:02:36

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2406.11685v1

On the Convergence of Zeroth-Order Federated Tuning for Large Language Models

The confluence of Federated Learning (FL) and Large Language Models (LLMs) is ushering in a new era in privacy-preserving natural language processing. However, the intensive memory requirements for fine-tuning LLMs pose significant challenges, especially when deploying on clients with limited computational resources. To circumvent this, we explore the novel integration of Memory-efficient Zeroth-Order Optimization within a federated setting, a synergy we term as FedMeZO. Our study is the first to examine the theoretical underpinnings of FedMeZO in the context of LLMs, tackling key questions regarding the influence of large parameter spaces on optimization behavior, the establishment of convergence properties, and the identification of critical parameters for convergence to inform personalized federated strategies. Our extensive empirical evidence supports the theory, showing that FedMeZO not only converges faster than traditional first-order methods such as FedAvg but also significantly reduces GPU memory usage during training to levels comparable to those during inference. Moreover, the proposed personalized FL strategy that is built upon the theoretical insights to customize the client-wise learning rate can effectively accelerate loss reduction. We hope our work can help to bridge theoretical and practical aspects of federated fine-tuning for LLMs, thereby stimulating further advancements and research in this area.

Updated: 2024-06-17 16:00:59

标题: 关于大型语言模型零阶联邦调整的收敛性

摘要: 联邦学习（FL）和大型语言模型（LLMs）的融合正在引领隐私保护自然语言处理的新时代。然而，对LLMs进行微调所需的大量内存需求给部署在计算资源有限的客户端上带来了重大挑战。为了解决这个问题，我们探索了在联邦设置中将内存高效的零阶优化与LLMs整合的新方法，我们将这种协同效应称为FedMeZO。我们的研究是第一个在LLMs的背景下研究FedMeZO理论基础的研究，解决了关于大参数空间对优化行为的影响、收敛性质的建立以及确定收敛的关键参数等关键问题。我们的大量经验证据支持这一理论，表明FedMeZO不仅比传统的FedAvg等一阶方法收敛更快，而且在训练过程中显著降低了GPU内存使用量，使其与推理过程中的水平相当。此外，基于理论见解构建的个性化FL策略可以有效加速损失减少。我们希望我们的工作可以帮助搭建LLMs的联邦微调的理论和实践方面的桥梁，从而刺激这一领域的进一步发展和研究。

更新时间: 2024-06-17 16:00:59

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.05926v3

Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs' safety on specialized domains, e.g. medicine. However, testing the domain-specific safety of LLMs is challenging due to the lack of domain knowledge-driven attacks in existing benchmarks. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaks from domain knowledge to evaluate the safety of LLMs when applied to those domains. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both relevant to the given knowledge and harmful to the target LLMs. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code: https://github.com/THU-KEG/Knowledge-to-Jailbreak/.

Updated: 2024-06-17 15:59:59

标题: 知识到越狱：一个知识点抵得上一次攻击

摘要: 大型语言模型（LLMs）越来越多地应用于各个领域，这引发了人们对LLMs在专业领域（如医学）安全性的增加关注。然而，由于现有基准测试中缺乏基于领域知识的攻击，测试LLMs的领域特定安全性具有挑战性。为了填补这一空白，我们提出了一个新任务，即知识破解，旨在从领域知识中生成破解，以评估LLMs在应用于这些领域时的安全性。我们收集了一个包含12,974个知识-破解对的大规模数据集，并对一个大型语言模型进行微调作为破解生成器，以生成特定于领域知识的破解。在13个领域和8个目标LLMs上的实验表明，破解生成器在生成与给定知识相关且对目标LLMs有害的破解方面是有效的。我们还将我们的方法应用于一个超出领域的知识库，结果表明破解生成器可以生成与人类专家制作的同样有害的破解。数据和代码：https://github.com/THU-KEG/Knowledge-to-Jailbreak/.

更新时间: 2024-06-17 15:59:59

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2406.11682v1

R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAG workflows in conjunction with LLMs. Our toolkit, which supports popular built-in RAG workflows and allows for the incorporation of customized testing data on the specific domain, is designed to be user-friendly, modular, and extensible. We conduct an evaluation of 21 RALLMs across three task levels and two representative domains, revealing significant variations in the effectiveness of RALLMs across different tasks and domains. Our analysis emphasizes the importance of considering both task and domain requirements when choosing a RAG workflow and LLM combination. We are committed to continuously maintaining our platform at https://github.com/THU-KEG/R-Eval to facilitate both the industry and the researchers.

Updated: 2024-06-17 15:59:49

标题: R-Eval：用于评估检索增强大型语言模型领域知识的统一工具包

摘要: 大型语言模型在通用自然语言处理任务上取得了显著成功，但在领域特定问题上可能表现不佳。最近，提出了各种检索增强的大型语言模型（RALLMs）来解决这一缺点。然而，现有的评估工具只提供了一些基线，并在各个领域上对其进行评估，而没有挖掘领域知识的深度。在本文中，我们通过引入R-Eval工具包来解决评估RALLMs的挑战，这是一个设计用于简化与LLMs结合评估不同RAG工作流程的Python工具包。我们的工具包支持流行的内置RAG工作流程，并允许将定制测试数据纳入特定领域，旨在用户友好、模块化和可扩展。我们对21个RALLMs进行了三个任务级别和两个代表性领域的评估，揭示了在不同任务和领域中RALLMs有效性的显著变化。我们的分析强调了在选择RAG工作流程和LLM组合时考虑任务和领域需求的重要性。我们致力于持续维护我们的平台，以便为行业和研究人员提供便利。

更新时间: 2024-06-17 15:59:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11681v1

Score-fPINN: Fractional Score-Based Physics-Informed Neural Networks for High-Dimensional Fokker-Planck-Levy Equations

We introduce an innovative approach for solving high-dimensional Fokker-Planck-L\'evy (FPL) equations in modeling non-Brownian processes across disciplines such as physics, finance, and ecology. We utilize a fractional score function and Physical-informed neural networks (PINN) to lift the curse of dimensionality (CoD) and alleviate numerical overflow from exponentially decaying solutions with dimensions. The introduction of a fractional score function allows us to transform the FPL equation into a second-order partial differential equation without fractional Laplacian and thus can be readily solved with standard physics-informed neural networks (PINNs). We propose two methods to obtain a fractional score function: fractional score matching (FSM) and score-fPINN for fitting the fractional score function. While FSM is more cost-effective, it relies on known conditional distributions. On the other hand, score-fPINN is independent of specific stochastic differential equations (SDEs) but requires evaluating the PINN model's derivatives, which may be more costly. We conduct our experiments on various SDEs and demonstrate numerical stability and effectiveness of our method in dealing with high-dimensional problems, marking a significant advancement in addressing the CoD in FPL equations.

Updated: 2024-06-17 15:57:23

标题: Score-fPINN：用于高维Fokker-Planck-Levy方程的基于分数得分的物理信息神经网络

摘要: 我们介绍了一种创新方法，用于解决建模非布朗运动过程中的高维弗克-普朗克-莱维（FPL）方程，跨学科领域如物理、金融和生态学。我们利用分数得分函数和基于物理信息的神经网络（PINN）来解决维度诅咒（CoD）和减轻指数下降解决方案维度中的数值溢出。引入分数得分函数允许我们将FPL方程转化为不带分数拉普拉斯算子的二阶偏微分方程，因此可以使用标准的基于物理信息的神经网络（PINNs）轻松解决。我们提出了两种获取分数得分函数的方法：分数得分匹配（FSM）和得分-fPINN用于拟合分数得分函数。虽然FSM更具成本效益，但它依赖于已知的条件分布。另一方面，得分-fPINN独立于特定的随机微分方程（SDEs），但需要评估PINN模型的导数，这可能更昂贵。我们在各种SDE上进行实验，并展示了我们的方法在处理高维问题中的数值稳定性和有效性，标志着在解决FPL方程中的CoD方面取得了重大进展。

更新时间: 2024-06-17 15:57:23

领域: cs.LG,cs.NA,math.DS,math.NA,stat.ML,F.2.2; I.2.7

下载: http://arxiv.org/abs/2406.11676v1

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.

Updated: 2024-06-17 15:55:38

标题: BLoB: 用反向传播进行贝叶斯低秩自适应的大型语言模型

摘要: 大型语言模型（LLMs）在推理过程中经常表现出过度自信，特别是在适应下游领域特定任务时，数据有限。先前的工作通过在训练后采用近似贝叶斯估计来解决这个问题，使它们能够量化不确定性。然而，这种训练后方法的性能受到训练期间学习的参数严重限制。在本文中，我们超越了训练后的贝叶斯化，并提出了通过反向传播进行贝叶斯低秩适应（BLoB）的算法，该算法在整个微调过程中持续而联合地调整LLM参数的均值和协方差。我们的实证结果验证了BLoB在泛化和不确定性估计方面的有效性，当在分布内和分布外数据上进行评估时。

更新时间: 2024-06-17 15:55:38

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2406.11675v1

Benchmarking of LLM Detection: Comparing Two Competing Approaches

This article gives an overview of the field of LLM text recognition. Different approaches and implemented detectors for the recognition of LLM-generated text are presented. In addition to discussing the implementations, the article focuses on benchmarking the detectors. Although there are numerous software products for the recognition of LLM-generated text, with a focus on ChatGPT-like LLMs, the quality of the recognition (recognition rate) is not clear. Furthermore, while it can be seen that scientific contributions presenting their novel approaches strive for some kind of comparison with other approaches, the construction and independence of the evaluation dataset is often not comprehensible. As a result, discrepancies in the performance evaluation of LLM detectors are often visible due to the different benchmarking datasets. This article describes the creation of an evaluation dataset and uses this dataset to investigate the different detectors. The selected detectors are benchmarked against each other.

Updated: 2024-06-17 15:51:46

标题: LLM检测的基准测试：比较两种竞争性方法

摘要: 本文概述了LLM文本识别领域的研究。文章介绍了用于识别LLM生成文本的不同方法和已实现的检测器。除了讨论实现方式外，文章还关注检测器的基准测试。尽管存在许多软件产品用于识别LLM生成的文本，重点放在类似ChatGPT的LLM上，但识别质量（识别率）并不清楚。此外，虽然可以看到提出其新方法的科学贡献努力进行某种与其他方法的比较，但评估数据集的构建和独立性通常不可理解。因此，由于不同的基准测试数据集，LLM检测器性能评估中经常会出现差异。本文描述了评估数据集的创建，并使用该数据集调查不同检测器。所选检测器相互进行基准测试。

更新时间: 2024-06-17 15:51:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11670v1

Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'?

The empirical risk minimization (ERM) principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning: in particular, is there a weaker oracle than ERM which can nevertheless enable learnability? We answer this question affirmatively, showing that in the realizable setting of PAC learning for binary classification, a concept class can be learned using an oracle which only returns a single bit indicating whether a given dataset is realizable by some concept in the class. The sample complexity and oracle complexity of our algorithm depend polynomially on the VC dimension of the hypothesis class, thus showing that there is only a polynomial price to pay for use of our weaker oracle. Our results extend to the agnostic learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings. In the setting of partial concept classes, prior to our work no oracle-efficient algorithms were known, even with a standard ERM oracle. Thus, our results address a question of Alon et al. (2021) who asked whether there are algorithmic principles which enable efficient learnability in this setting.

Updated: 2024-06-17 15:50:08

标题: 使用一个回答“是”或“否”的预言机进行高效的PAC学习是否可能？

摘要: 经验风险最小化（ERM）原则在机器学习中具有很高的影响力，既为基于ERM的学习算法提供了接近最优的理论保证，又推动了深度学习中许多最近的经验成功。在本文中，我们研究了一个问题：是否能够执行ERM（计算在给定数据集上最小化经验风险的假设）对于高效学习是必要的：特别是，是否有比ERM更弱的预言者仍然可以实现可学习性？我们肯定地回答了这个问题，证明在PAC学习的可实现设置中，一个概念类可以使用一个只返回一个单一比特的预言者来学习，指示给定数据集是否可以由类中的某个概念实现。我们的算法的样本复杂度和预言者复杂度取决于假设类的VC维度的多项式，从而表明使用我们更弱的预言者只需支付多项式价格。我们的结果扩展到了对于不完全概念、多类别和实值学习设置的识别学习设置。在不完全概念类的设置中，在我们的工作之前，没有已知的预言者高效的算法，即使有标准的ERM预言者。因此，我们的结果回答了Alon等人（2021）提出的一个问题，即是否存在算法原则可以在这种设置中实现高效的可学习性。

更新时间: 2024-06-17 15:50:08

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11667v1

An Interactive Agent Foundation Model

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

Updated: 2024-06-17 15:50:02

标题: 一个交互式智能体基础模型

摘要: 人工智能系统的发展正从创建静态、特定任务模型转变为动态、基于代理的系统，能够在各种应用中表现良好。我们提出了一种交互式代理基础模型，采用一种新颖的多任务代理训练范式，用于在各种领域、数据集和任务中训练人工智能代理。我们的训练范式统一了多样的预训练策略，包括视觉遮罩自编码器、语言建模和下一步动作预测，实现了一个多才多艺、适应性强的人工智能框架。我们展示了我们框架在三个独立领域--机器人、游戏人工智能和医疗保健中的表现。我们的模型展示了其在每个领域生成有意义和与上下文相关的输出的能力。我们方法的优势在于其通用性，利用各种数据源，如机器人序列、游戏数据、大规模视频数据集和文本信息，实现有效的多模态和多任务学习。我们的方法为开发通用、采取行动的、多模态系统提供了一个有前途的途径。

更新时间: 2024-06-17 15:50:02

领域: cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2402.05929v2

ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data

Two key tasks in high-dimensional regularized regression are tuning the regularization strength for good predictions and estimating the out-of-sample risk. It is known that the standard approach -- $k$-fold cross-validation -- is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. To model structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions - a crucial concept from compressed sensing. In the common modern proportional asymptotics regime where the number of features and samples grow comparably, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance under these challenging conditions. We conduct extensive experiments that demonstrate the power of our approach and its superiority over existing methods.

Updated: 2024-06-17 15:50:00

标题: ROTI-GCV：适用于右旋转不变数据的广义交叉验证

摘要: 高维正则化回归中的两个关键任务是调整正则化强度以获得良好的预测和估计样本外风险。众所周知，在现代高维情境下，标准方法——k折交叉验证是不一致的。虽然留一法和广义交叉验证在一些高维情况下保持一致性，但当样本相关或包含重尾协变量时，它们变得不一致。为了模拟结构化样本相关性和重尾性，我们使用右旋转不变协变量分布——这是压缩感知中的一个关键概念。在特征数和样本数增长相当的常见现代比例渐近情况下，我们引入了一个新的框架ROTI-GCV，可可靠地执行交叉验证。在此过程中，我们针对这些具有挑战性条件提出了信噪比和噪声方差的新估计器。我们进行了大量实验证明了我们方法的效力以及其优于现有方法的优越性。

更新时间: 2024-06-17 15:50:00

领域: math.ST,cs.LG,stat.ML,stat.TH

下载: http://arxiv.org/abs/2406.11666v1

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.

Updated: 2024-06-17 15:49:51

标题: 从我的角度看：诊断大型视觉语言模型在图像理解中存在的西方文化偏见

摘要: 视觉-语言模型（VLMs）可以用多种语言回答关于图像的查询。然而，除了语言外，文化也会影响我们看待事物的方式。例如，来自西方文化的个体更加关注图像中的中心人物，而来自东方文化的个体更加关注场景背景。在这项工作中，我们提出了一项新颖的调查，展示并定位了VLMs在图像理解中的西方偏见。我们通过文化多样性的图像和注释对大型VLMs进行了主观和客观视觉任务的评估。我们发现，在每项任务的西方子集上，VLMs的表现优于东方子集。通过对这种偏见来源进行控制实验，突显了在仅文本预训练中使用多样化语言组合对构建公平的VLMs的重要性，即使推理是用英语进行的。此外，虽然在目标文化的语言中提供提示可能会减少偏见，但这并不能替代构建更具代表性的人工智能，涵盖世界各种语言。

更新时间: 2024-06-17 15:49:51

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.11665v1

Diffusion Generative Modelling for Divide-and-Conquer MCMC

Divide-and-conquer MCMC is a strategy for parallelising Markov Chain Monte Carlo sampling by running independent samplers on disjoint subsets of a dataset and merging their output. An ongoing challenge in the literature is to efficiently perform this merging without imposing distributional assumptions on the posteriors. We propose using diffusion generative modelling to fit density approximations to the subposterior distributions. This approach outperforms existing methods on challenging merging problems, while its computational cost scales more efficiently to high dimensional problems than existing density estimation approaches.

Updated: 2024-06-17 15:48:46

标题: 分而治之MCMC的扩散生成建模

摘要: 分而治之MCMC是一种用于并行化马尔可夫链蒙特卡罗采样的策略，通过在数据集的不相交子集上运行独立的采样器并合并它们的输出。文献中一个持续的挑战是在不对后验分布施加分布假设的情况下有效地执行这种合并。我们建议使用扩散生成建模来拟合子后验分布的密度近似。这种方法在具有挑战性的合并问题上表现优于现有方法，同时其计算成本在高维问题上的扩展性比现有的密度估计方法更高效。

更新时间: 2024-06-17 15:48:46

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2406.11664v1

Evaluating Task-based Effectiveness of MLLMs on Charts

In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.

Updated: 2024-06-17 15:44:33

标题: 评估基于任务的MLLMs在图表上的有效性

摘要: 在这篇论文中，我们探讨了一个前瞻性问题：GPT-4V在图表的低级数据分析任务中是否有效？为此，我们首先策划了一个大规模数据集，名为ChartInsights，包含89,388个四元组（图表、任务、问题、答案），涵盖了7种图表类型上的10个广泛使用的低级数据分析任务。首先，我们对18个先进的MLLM进行系统评估，以了解它们的能力和局限性，其中包括12个开源模型和6个闭源模型。从标准的文本提示方法开始，18个MLLM的平均准确率为36.17%。在所有模型中，GPT-4V实现了最高的准确率，达到了56.13%。为了了解多模态大型模型在低级数据分析任务中的局限性，我们设计了各种实验，对GPT-4V的能力进行了深入测试。我们进一步研究了对图表进行视觉修改，例如修改视觉元素（例如更改颜色方案）和引入扰动（例如添加图像噪声），如何影响GPT-4V的性能。其次，我们提出了12个实验发现。这些发现表明GPT-4V有革新与图表互动的潜力，并揭示了人类分析需求与GPT-4V能力之间的差距。第三，我们提出了一种新颖的文本提示策略，名为Chain-of-Charts，专为低级分析任务量身定制，通过提高模型性能24.36%，使准确率达到80.49%。此外，通过引入一种视觉提示策略，将GPT-4V的注意力引向与问题相关的视觉元素，我们进一步将准确率提高到83.83%。我们的研究不仅揭示了GPT-4V在低级数据分析任务中的能力和局限性，还为未来研究提供了宝贵的见解。

更新时间: 2024-06-17 15:44:33

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2405.07001v2

A Systematic Construction Approach for All $4\times 4$ Involutory MDS Matrices

Maximum distance separable (MDS) matrices play a crucial role not only in coding theory but also in the design of block ciphers and hash functions. Of particular interest are involutory MDS matrices, which facilitate the use of a single circuit for both encryption and decryption in hardware implementations. In this article, we present several characterizations of involutory MDS matrices of even order. Additionally, we introduce a new matrix form for obtaining all involutory MDS matrices of even order and compare it with other matrix forms available in the literature. We then propose a technique to systematically construct all $4 \times 4$ involutory MDS matrices over a finite field $\mathbb{F}_{2^m}$. This method significantly reduces the search space by focusing on involutory MDS class representative matrices, leading to the generation of all such matrices within a substantially smaller set compared to considering all $4 \times 4$ involutory matrices. Specifically, our approach involves searching for these representative matrices within a set of cardinality $(2^m-1)^5$. Through this method, we provide an explicit enumeration of the total number of $4 \times 4$ involutory MDS matrices over $\mathbb{F}_{2^m}$ for $m=3,4,\ldots,8$.

Updated: 2024-06-17 15:41:08

标题: 一个系统构造所有 $4\times 4$ 的反交换 MDS 矩阵的方法

摘要: 最大距离可分（MDS）矩阵不仅在编码理论中起着至关重要的作用，而且在块密码和哈希函数的设计中也起着关键作用。特别感兴趣的是对合MDS矩阵，它促进了在硬件实现中同时用于加密和解密的单一电路的使用。在本文中，我们提出了偶数阶对合MDS矩阵的几种特征描述。此外，我们引入了一种新的矩阵形式，用于获得所有偶数阶对合MDS矩阵，并将其与文献中其他矩阵形式进行比较。然后，我们提出了一种系统构造有限域$\mathbb{F}_{2^m}$上所有$4 \times 4$对合MDS矩阵的技术。该方法通过聚焦于对合MDS类代表矩阵，显著减少了搜索空间，从而在相比考虑所有$4 \times 4$对合矩阵的情况下在一个明显更小的集合中生成所有这样的矩阵。具体而言，我们的方法涉及在一个基数为$(2^m-1)^5$的集合中搜索这些代表性矩阵。通过这种方法，我们给出了有限域$\mathbb{F}_{2^m}$上$m=3,4,\ldots,8$时所有$4 \times 4$对合MDS矩阵的明确枚举。

更新时间: 2024-06-17 15:41:08

领域: cs.CR

下载: http://arxiv.org/abs/2404.08250v2

FedML-HE: An Efficient Homomorphic-Encryption-Based Privacy-Preserving Federated Learning System

Federated Learning trains machine learning models on distributed devices by aggregating local model updates instead of local data. However, privacy concerns arise as the aggregated local models on the server may reveal sensitive personal information by inversion attacks. Privacy-preserving methods, such as homomorphic encryption (HE), then become necessary for FL training. Despite HE's privacy advantages, its applications suffer from impractical overheads, especially for foundation models. In this paper, we present FedML-HE, the first practical federated learning system with efficient HE-based secure model aggregation. FedML-HE proposes to selectively encrypt sensitive parameters, significantly reducing both computation and communication overheads during training while providing customizable privacy preservation. Our optimized system demonstrates considerable overhead reduction, particularly for large foundation models (e.g., ~10x reduction for ResNet-50, and up to ~40x reduction for BERT), demonstrating the potential for scalable HE-based FL deployment.

Updated: 2024-06-17 15:39:21

标题: FedML-HE: 一种基于同态加密的高效隐私保护联邦学习系统

摘要: 联邦学习通过聚合本地模型更新而不是本地数据在分布式设备上训练机器学习模型。然而，由于服务器上聚合的本地模型可能通过反演攻击透露敏感个人信息，隐私问题变得尖锐。因此，隐私保护方法，如同态加密（HE），对于联邦学习训练变得必不可少。尽管HE具有隐私优势，但其应用受到了不切实际的开销的影响，尤其是对于基础模型。本文提出了FedML-HE，这是第一个具有高效HE安全模型聚合的实用联邦学习系统。FedML-HE建议选择性加密敏感参数，显著减少训练过程中的计算和通信开销，同时提供可定制的隐私保护。我们优化的系统展示了相当大的开销减少，特别是对于大型基础模型（例如，对于ResNet-50的减少约为10倍，对于BERT的减少高达40倍），展示了基于HE的可扩展联邦学习部署的潜力。

更新时间: 2024-06-17 15:39:21

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2303.10837v3

Reference-based Metrics Disprove Themselves in Question Generation

Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.

Updated: 2024-06-17 15:33:37

标题: 基于参考文献的度量指标在问题生成中自相矛盾

摘要: 基于参考的度量标准，如BLEU和BERTScore，被广泛用于评估问题生成（QG）。在这项研究中，在SQuAD和HotpotQA等QG基准测试中，我们发现使用人工撰写的参考文献不能保证基于参考的度量标准的有效性。大多数QG基准测试只有一个参考文献；我们复制了注释过程并收集了另一个参考文献。一个好的度量标准应该至少不比生成的问题差评一个经过人工验证的问题。然而，在我们新收集的参考文献上，基于参考的度量标准的结果证明了这些度量标准本身是错误的。我们提出了一种基于参考的自由度量标准，由自然度、可回答性和复杂性等多维标准组成，利用大型语言模型。这些标准不受限于单个参考问题的句法或语义，该度量标准不需要多样的参考文献。实验表明，我们的度量标准能够准确区分高质量问题和有缺陷的问题，并与人类判断实现了最先进的对齐。

更新时间: 2024-06-17 15:33:37

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.12242v2

Multimodal Learning To Improve Segmentation With Intraoperative CBCT & Preoperative CT

Intraoperative medical imaging, particularly Cone-beam computed tomography (CBCT), is an important tool facilitating computer aided interventions, despite a lower visual quality. While this degraded image quality can affect downstream segmentation, the availability of high quality preoperative scans represents potential for improvements. Here we consider a setting where preoperative CT and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect. We propose a multimodal learning method that fuses roughly aligned CBCT and CT scans and investigate the effect of CBCT quality and misalignment (affine and elastic transformations facilitating misalignment) on the final segmentation performance. As an application scenario, we focus on the segmentation of liver and liver tumor semantic segmentation and evaluate the effect of intraoperative image quality and misalignment on segmentation performance. To accomplish this, high quality, labelled CTs are defined as preoperative and used as a basis to simulate intraoperative CBCT. We show that the fusion of preoperative CT and simulated, intraoperative CBCT mostly improves segmentation performance and that even clearly misaligned preoperative data has the potential to improve segmentation performance.

Updated: 2024-06-17 15:31:54

标题: 多模式学习以改善术中CBCT和术前CT的分割

摘要: 手术中医学成像，尤其是锥束计算机断层扫描（CBCT），是一种重要的工具，促进计算机辅助干预，尽管其视觉质量较低。尽管这种降低的图像质量可能会影响下游分割，但高质量的术前扫描的可用性代表了改进的潜力。在这里，我们考虑了一个场景，即术前CT和术中CBCT扫描是可用的，然而，扫描之间的对齐（配准）是不完美的。我们提出了一种多模态学习方法，将粗略对齐的CBCT和CT扫描融合在一起，并研究CBCT质量和错位（促进错位的仿射和弹性变换）对最终分割性能的影响。作为一个应用场景，我们专注于肝脏和肝脏肿瘤语义分割的分割，并评估术中图像质量和错位对分割性能的影响。为了实现这一目标，高质量的标记CT被定义为术前，并用作模拟术中CBCT的基础。我们展示了术前CT和模拟的术中CBCT的融合大多改善了分割性能，即使是明显错位的术前数据也具有改善分割性能的潜力。

更新时间: 2024-06-17 15:31:54

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11650v1

Making Old Things New: A Unified Algorithm for Differentially Private Clustering

As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. The study of each variation gave rise to new algorithms: the landscape of private clustering algorithms is therefore quite intricate. In this paper, we show that a 20-year-old algorithm can be slightly modified to work for any of these models. This provides a unified picture: while matching almost all previously known results, it allows us to improve some of them and extend it to a new privacy model, the continual observation setting, where the input is changing over time and the algorithm must output a new solution at each time step.

Updated: 2024-06-17 15:31:53

标题: 使旧事物焕发新生：一种用于差分隐私聚类的统一算法

摘要: 作为数据分析和无监督学习的基本内容，私有聚类问题在各种隐私模型下得到了广泛研究。其中，集中式差分隐私是其中第一个，该问题也被研究了局部和洗牌变体。在每种情况下，目标是设计一个算法，以最小可能的错误私下计算一种聚类。对每种变体的研究产生了新的算法：因此，私有聚类算法的格局相当复杂。在本文中，我们展示了一个20年前的算法可以稍作修改以适用于这些模型中的任何一个。这提供了一个统一的画面：虽然几乎匹配了所有先前已知的结果，但它允许我们改进其中一些并将其扩展到一个新的隐私模型，即持续观察设置，在这种设置下，输入随时间变化，算法必须在每个时间步骤输出一个新的解决方案。

更新时间: 2024-06-17 15:31:53

领域: cs.DS,cs.CR,cs.LG

下载: http://arxiv.org/abs/2406.11649v1

YOLO-FEDER FusionNet: A Novel Deep Learning Architecture for Drone Detection

Predominant methods for image-based drone detection frequently rely on employing generic object detection algorithms like YOLOv5. While proficient in identifying drones against homogeneous backgrounds, these algorithms often struggle in complex, highly textured environments. In such scenarios, drones seamlessly integrate into the background, creating camouflage effects that adversely affect the detection quality. To address this issue, we introduce a novel deep learning architecture called YOLO-FEDER FusionNet. Unlike conventional approaches, YOLO-FEDER FusionNet combines generic object detection methods with the specialized strength of camouflage object detection techniques to enhance drone detection capabilities. Comprehensive evaluations of YOLO-FEDER FusionNet show the efficiency of the proposed model and demonstrate substantial improvements in both reducing missed detections and false alarms.

Updated: 2024-06-17 15:25:31

标题: YOLO-FEDER FusionNet: 一种新颖的用于无人机检测的深度学习架构

摘要: 基于图像的无人机检测的主要方法通常依赖于使用像YOLOv5这样的通用物体检测算法。虽然这些算法擅长在均质背景下识别无人机，但它们在复杂、高纹理环境中经常遇到困难。在这种情况下，无人机会无缝地融入背景中，产生伪装效果，对检测质量产生不利影响。为了解决这个问题，我们引入了一种新颖的深度学习架构，称为YOLO-FEDER FusionNet。与传统方法不同，YOLO-FEDER FusionNet将通用物体检测方法与专门的伪装物体检测技术的优势结合起来，以增强无人机检测能力。对YOLO-FEDER FusionNet的全面评估显示了所提出模型的效率，并展示了在减少漏检和误报方面的显著改进。

更新时间: 2024-06-17 15:25:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11641v1

BirdSet: A Dataset and Benchmark for Classification in Avian Bioacoustics

Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to assess environmental health. To maximize the potential of cost-effective and minimal-invasive passive acoustic monitoring (PAM), DL models must analyze bird vocalizations across a wide range of species and environmental conditions. However, data fragmentation challenges a comprehensive evaluation of generalization performance. Therefore, we introduce the BirdSet dataset, comprising approximately 520,000 global bird recordings for training and over 400 hours of PAM recordings for testing. Our benchmark offers baselines for several DL models to enhance comparability and consolidate research across studies, along with code implementations that include comprehensive training and evaluation protocols.

Updated: 2024-06-17 15:25:11

标题: BirdSet：一个用于鸟类生物声学分类的数据集和基准

摘要: 深度学习（DL）模型已经成为鸟类生物声学中评估环境健康的强大工具。为了最大限度发挥成本效益和最小侵入性的被动声学监测（PAM）的潜力，DL模型必须分析各种物种和环境条件下的鸟类鸣叫声。然而，数据的碎片化挑战了对泛化性能的综合评估。因此，我们引入了BirdSet数据集，包括约520,000条全球鸟类录音用于训练，以及超过400小时的PAM录音用于测试。我们的基准提供了几种DL模型的基线，以增强可比性并在研究中巩固研究，同时提供包括全面的训练和评估协议的代码实现。

更新时间: 2024-06-17 15:25:11

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2403.10380v3

Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions

One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.

Updated: 2024-06-17 15:24:49

标题: 线性贝尔曼完备性足以实现少动作的高效在线强化学习

摘要: 值迭代是一种最自然的强化学习（RL）函数逼近方法，通过解决一系列回归问题，归纳地生成对最优值函数的近似。为了确保值迭代的成功，通常假定贝尔曼完备性成立，这确保了这些回归问题是良好规定的。我们研究在线RL线性函数逼近模型下在贝尔曼完备性条件下学习最优策略的问题。在线性设置中，虽然在贝尔曼完备性条件下已知存在统计效率算法（例如Jiang等人（2017）；Zanette等人（2020）），但这些算法都依赖于全局乐观原则，需要解决非凸优化问题。特别地，尚不明确是否存在计算效率算法。在本文中，我们提出了第一个多项式时间算法，用于在线RL线性贝尔曼完备性条件下行动数量为任意常数时。

更新时间: 2024-06-17 15:24:49

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11640v1

Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models

Self-correction has achieved impressive results in enhancing the style and security of the generated output from large language models (LLMs). However, recent studies suggest that self-correction might be limited or even counterproductive in reasoning tasks due to LLMs' difficulties in identifying logical mistakes. In this paper, we aim to enhance the self-checking capabilities of LLMs by constructing training data for checking tasks. Specifically, we apply the Chain of Thought (CoT) methodology to self-checking tasks, utilizing fine-grained step-level analyses and explanations to assess the correctness of reasoning paths. We propose a specialized checking format called "Step CoT Check". Following this format, we construct a checking-correction dataset that includes detailed step-by-step analysis and checking. Then we fine-tune LLMs to enhance their error detection and correction abilities. Our experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs across multiple benchmarks. This approach outperforms other formats, especially in locating the incorrect position, with greater benefits observed in larger models. For reproducibility, all the datasets and code are provided in https://github.com/bammt/Learn-to-check.

Updated: 2024-06-17 15:24:29

标题: 学习检查：释放大型语言模型自我纠正的潜力

摘要: 自我纠正在增强大型语言模型（LLMs）生成输出的风格和安全性方面取得了令人印象深刻的成果。然而，最近的研究表明，由于LLMs在识别逻辑错误方面存在困难，自我纠正在推理任务中可能受到限制甚至是适得其反。在本文中，我们旨在通过构建用于检查任务的训练数据来增强LLMs的自检能力。具体而言，我们将思维链（CoT）方法应用于自检任务，利用细粒度的步骤级分析和解释来评估推理路径的正确性。我们提出了一种专门的检查格式称为“Step CoT Check”。按照这种格式，我们构建了一个包括详细逐步分析和检查的检查-纠正数据集。然后，我们对LLMs进行微调，以增强它们的错误检测和纠正能力。我们的实验表明，使用“Step CoT Check”格式进行微调显著提高了LLMs在多个基准测试中的自检和自我纠正能力。这种方法在定位错误位置方面优于其他格式，尤其是在较大模型中观察到更大的好处。为了可复现性，所有数据集和代码都可以在https://github.com/bammt/Learn-to-check 找到。

更新时间: 2024-06-17 15:24:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.13035v3

Signatures From Pseudorandom States via $\bot$-PRFs

Different flavors of quantum pseudorandomness have proven useful for various cryptographic applications, with the compelling feature that these primitives are potentially weaker than post-quantum one-way functions. Ananth, Lin, and Yuen (2023) have shown that logarithmic pseudorandom states can be used to construct a pseudo-deterministic PRG: informally, for a fixed seed, the output is the same with $1-1/poly$ probability. In this work, we introduce new definitions for $\bot$-PRG and $\bot$-PRF. The correctness guarantees are that, for a fixed seed, except with negligible probability, the output is either the same (with probability $1-1/poly$) or recognizable abort, denoted $\bot$. Our approach admits a natural definition of multi-time PRG security, as well as the adaptive security of a PRF. We construct a $\bot$-PRG from any pseudo-deterministic PRG and, from that, a $\bot$-PRF. Even though most mini-crypt primitives, such as symmetric key encryption, commitments, MAC, and length-restricted one-time digital signatures, have been shown based on various quantum pseudorandomness assumptions, digital signatures remained elusive. Our main application is a (quantum) digital signature scheme with classical public keys and signatures, thereby addressing a previously unresolved question posed in Morimae and Yamakawa's work (Crypto, 2022). Additionally, we construct CPA secure public-key encryption with tamper-resilient quantum public keys.

Updated: 2024-06-17 15:23:54

标题: 通过$\bot$-PRFs从伪随机状态中提取签名

摘要: 不同类型的量子伪随机性已被证明对各种加密应用有用，这些原语的吸引力在于它们可能比后量子单向函数更弱。Ananth、Lin和Yuen（2023）表明对数伪随机状态可用于构建伪确定性PRG：简而言之，对于固定种子，输出与$1-1/poly$的概率相同。在这项工作中，我们引入了$\bot$-PRG和$\bot$-PRF的新定义。正确性保证是，对于一个固定的种子，除了可以忽略的概率外，输出要么相同（概率为$1-1/poly$），要么可识别为中止，用$\bot$表示。我们的方法允许多次PRG安全的自然定义，以及PRF的自适应安全性。我们从任何伪确定性PRG构造了一个$\bot$-PRG，再从中得到了一个$\bot$-PRF。尽管大多数迷你密码原语，如对称密钥加密、承诺、MAC和长度受限的一次性数字签名，都基于不同的量子伪随机性假设，数字签名仍然难以实现。我们的主要应用是一个（量子）数字签名方案，其中包含经典公钥和签名，从而解决了Morimae和Yamakawa的工作（Crypto，2022）中提出的一个尚未解决的问题。此外，我们构建了具有抗篡改量子公钥的CPA安全公钥加密。

更新时间: 2024-06-17 15:23:54

领域: cs.CR

下载: http://arxiv.org/abs/2311.00847v4

MASAI: Modular Architecture for Software-engineering AI Agents

A common method to solve complex problems in software engineering, is to divide the problem into multiple sub-problems. Inspired by this, we propose a Modular Architecture for Software-engineering AI (MASAI) agents, where different LLM-powered sub-agents are instantiated with well-defined objectives and strategies tuned to achieve those objectives. Our modular architecture offers several advantages: (1) employing and tuning different problem-solving strategies across sub-agents, (2) enabling sub-agents to gather information from different sources scattered throughout a repository, and (3) avoiding unnecessarily long trajectories which inflate costs and add extraneous context. MASAI enabled us to achieve the highest performance (28.33% resolution rate) on the popular and highly challenging SWE-bench Lite dataset consisting of 300 GitHub issues from 11 Python repositories. We conduct a comprehensive evaluation of MASAI relative to other agentic methods and analyze the effects of our design decisions and their contribution to the success of MASAI.

Updated: 2024-06-17 15:19:51

标题: MASAI：用于软件工程AI代理的模块化架构

摘要: 一个在软件工程中解决复杂问题的常见方法是将问题分解为多个子问题。受此启发，我们提出了一种用于软件工程AI（MASAI）代理的模块化架构，其中不同的LLM动力子代理具有明确定义的目标和策略，旨在实现这些目标。我们的模块化架构提供了几个优势：（1）在子代理之间使用和调整不同的问题解决策略，（2）使子代理能够从存储库中分散的不同来源收集信息，（3）避免不必要的长轨迹，从而增加成本并添加多余的上下文。MASAI使我们能够在流行且高难度的SWE-bench Lite数据集上取得最高性能（28.33%的解决率），该数据集由来自11个Python存储库的300个GitHub问题组成。我们对MASAI相对于其他代理方法进行了全面评估，并分析了我们的设计决策的影响以及它们对MASAI成功的贡献。

更新时间: 2024-06-17 15:19:51

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2406.11638v1

A novel hybrid time-varying graph neural network for traffic flow forecasting

Real-time and precise traffic flow prediction is vital for the efficiency of intelligent transportation systems. Traditional methods often employ graph neural networks (GNNs) with predefined graphs to describe spatial correlations among traffic nodes in urban road networks. However, these pre-defined graphs are limited by existing knowledge and graph generation methodologies, offering an incomplete picture of spatial correlations. While time-varying graphs based on data-driven learning have attempted to address these limitations, they still struggle with adequately capturing the inherent spatial correlations in traffic data. Moreover, most current methods for capturing dynamic temporal correlations rely on a unified calculation scheme using a temporal multi-head self-attention mechanism, which at some level might leads to inaccuracies. In order to overcome these challenges, we have proposed a novel hybrid time-varying graph neural network (HTVGNN) for traffic flow prediction. Firstly, a novel enhanced temporal perception multi-head self-attention mechanism based on time-varying mask enhancement was reported to more accurately model the dynamic temporal dependencies among distinct traffic nodes in the traffic network. Secondly, we have proposed a novel graph learning strategy to concurrently learn both static and dynamic spatial associations between different traffic nodes in road networks. Meanwhile, in order to enhance the learning ability of time-varying graphs, a coupled graph learning mechanism was designed to couple the graphs learned at each time step. Finally, the effectiveness of the proposed method HTVGNN was demonstrated with four real data sets. Simulation results revealed that HTVGNN achieves superior prediction accuracy compared to the state of the art spatio-temporal graph neural network models. Additionally, the ablation experiment verifies that the coupled graph learning mechanism can effectively improve the long-term prediction performance of HTVGNN.

Updated: 2024-06-17 15:17:56

标题: 一个新颖的混合时变图神经网络用于交通流预测

摘要: 实时和精确的交通流预测对智能交通系统的效率至关重要。传统方法通常使用具有预定义图形的图神经网络（GNN）来描述城市道路网络中交通节点之间的空间相关性。然而，这些预定义图受到现有知识和图生成方法的限制，提供了一种空间相关性的不完整图像。虽然基于数据驱动学习的时变图尝试解决这些限制，但仍然难以充分捕获交通数据中固有的空间相关性。此外，大多数当前用于捕获动态时间相关性的方法依赖于使用时间多头自注意力机制的统一计算方案，这在某种程度上可能导致不准确性。为了克服这些挑战，我们提出了一种新颖的混合时变图神经网络（HTVGNN）用于交通流预测。首先，报道了一种基于时变蒙版增强的新型增强时间感知多头自注意力机制，更准确地建模了交通网络中不同交通节点之间的动态时间依赖关系。其次，我们提出了一种新颖的图学习策略，同时学习道路网络中不同交通节点之间的静态和动态空间关联。同时，为了增强时变图的学习能力，设计了一种耦合图学习机制来耦合每个时间步骤学习的图。最后，通过四个真实数据集展示了所提出的方法HTVGNN的有效性。模拟结果显示，与最先进的时空图神经网络模型相比，HTVGNN实现了更高的预测准确性。此外，消融实验验证了耦合图学习机制可以有效提高HTVGNN的长期预测性能。

更新时间: 2024-06-17 15:17:56

领域: cs.LG

下载: http://arxiv.org/abs/2401.10155v4

Feasibility of Federated Learning from Client Databases with Different Brain Diseases and MRI Modalities

Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous databases that may contain scans and segmentation labels for different types of brain pathologies and diverse sets of MRI modalities. Is it feasible to use Federated Learning (FL) for training a single model on client databases that contain scans and labels of different brain pathologies and diverse sets of MRI modalities? We demonstrate promising results by combining appropriate, simple, and practical modifications to the model and training strategy: Designing a model with input channels that cover the whole set of modalities available across clients, training with random modality drop, and exploring the effects of feature normalization methods. Evaluation on 7 brain MRI databases with 5 different diseases shows that such FL framework can train a single model that is shown to be very promising in segmenting all disease types seen during training. Importantly, it is able to segment these diseases in new databases that contain sets of modalities different from those in training clients. These results demonstrate, for the first time, feasibility and effectiveness of using FL to train a single segmentation model on decentralised data with diverse brain diseases and MRI modalities, a necessary step towards leveraging heterogeneous real-world databases. Code will be made available at: https://github.com/FelixWag/FL-MultiDisease-MRI

Updated: 2024-06-17 15:16:18

标题: 客户端数据库中不同脑疾病和MRI模态的联邦学习可行性

摘要: 在MRI中用于脑病变的分割模型通常是针对特定疾病开发的，并在具有预定义MRI模态的数据上进行训练。每个这样的模型不能使用具有不同MRI模态集的数据分割疾病，也不能分割任何其他类型的疾病。此外，这种训练范式不允许模型从可能包含不同类型脑病理学和不同MRI模态集的扫描和分割标签的异质数据库中学习。在客户端数据库上使用联邦学习（FL）来训练单个模型是否可行？我们通过对模型和训练策略进行适当、简单和实用的修改来展示有希望的结果：设计一个具有覆盖客户端可用的所有模态的输入通道的模型，训练时进行随机模态删除，并探索特征归一化方法的影响。对7个脑MRI数据库进行评估，涉及5种不同疾病，结果表明这种FL框架可以训练一个单一模型，该模型在分割训练期间看到的所有疾病类型中表现出非常有前途。重要的是，它能够在包含与训练客户端不同模态集的新数据库中分割这些疾病。这些结果首次展示了使用FL在分散数据上训练单一分割模型的可行性和有效性，这是朝着利用异构现实世界数据库的必要步骤。代码将在以下链接提供：https://github.com/FelixWag/FL-MultiDisease-MRI

更新时间: 2024-06-17 15:16:18

领域: eess.IV,cs.CV,cs.LG,I.4.9; I.4.6; I.2.11; I.4.0

下载: http://arxiv.org/abs/2406.11636v1

A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME

eXplainable artificial intelligence (XAI) methods have emerged to convert the black box of machine learning (ML) models into a more digestible form. These methods help to communicate how the model works with the aim of making ML models more transparent and increasing the trust of end-users into their output. SHapley Additive exPlanations (SHAP) and Local Interpretable Model Agnostic Explanation (LIME) are two widely used XAI methods, particularly with tabular data. In this perspective piece, we discuss the way the explainability metrics of these two methods are generated and propose a framework for interpretation of their outputs, highlighting their weaknesses and strengths. Specifically, we discuss their outcomes in terms of model-dependency and in the presence of collinearity among the features, relying on a case study from the biomedical domain (classification of individuals with or without myocardial infarction). The results indicate that SHAP and LIME are highly affected by the adopted ML model and feature collinearity, raising a note of caution on their usage and interpretation.

Updated: 2024-06-17 15:15:51

标题: 一种可解释人工智能方法的视角：SHAP和LIME

摘要: 可解释的人工智能（XAI）方法已经出现，将机器学习（ML）模型的黑匣子转换为更易理解的形式。这些方法帮助传达模型的工作方式，旨在使ML模型更透明，并增加最终用户对其输出的信任。SHapley Additive exPlanations（SHAP）和Local Interpretable Model Agnostic Explanation（LIME）是两种广泛使用的XAI方法，特别适用于表格数据。在这篇观点文章中，我们讨论了这两种方法的可解释性指标是如何生成的，并提出了一个用于解释其输出的框架，突出了它们的弱点和优点。具体来说，我们讨论了它们在模型依赖性和特征共线性存在的情况下的结果，依赖于生物医学领域的一个案例研究（对具有或没有心肌梗塞的个体进行分类）。结果表明，SHAP和LIME受到采用的ML模型和特征共线性的影响较大，对它们的使用和解释提出了警示。

更新时间: 2024-06-17 15:15:51

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2305.02012v3

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.

Updated: 2024-06-17 15:14:10

标题: 基准率效应对LLM基准表现的影响：区分测试策略和基准表现

摘要: 填充测试是衡量大型语言模型在许多基准任务上行为的常用方法。使用MMLU数据集，我们发现答案标记之间的基础概率（BRP）差异显著，并影响任务表现，即在不确定时猜测A。我们发现，反事实提示能够充分缓解BRP效应。BRP效应被发现与人类采用的考试策略具有类似效果，导致任务表现和考试能力的混淆。我们提出了Nvr-X-MMLU任务，这是MMLU的一个变体，有助于将考试能力与任务表现区分开来，并报告后者。

更新时间: 2024-06-17 15:14:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11634v1

Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation

Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility. In this work, we show that Quality Estimation (QE) reranking, which uses a QE model as a reranker, can be viewed as a variant of MBR. Inspired by this, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes synthetic sources generated by backward translation as ``support hypotheses'' and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR significantly outperforms QE reranking and is competitive with standard MBR decoding. Furthermore, sMBR calls the utility function fewer times compared to MBR. Our findings suggest that sMBR is a promising approach for high-quality NMT decoding.

Updated: 2024-06-17 15:13:52

标题: 揭示源的力量：基于源的最小贝叶斯风险解码用于神经机器翻译

摘要: 最大后验解码是神经机器翻译（NMT）中常用的方法，旨在最大化估计后验概率。然而，高估计概率并不总是导致高翻译质量。最小贝叶斯风险（MBR）解码通过寻求具有最高期望效用的假设提供了一种替代方法。在这项工作中，我们展示了使用质量估计（QE）重新排序的QE重新排序可以被视为MBR的变体。受此启发，我们提出了基于源的MBR（sMBR）解码，这是一种利用由反向翻译生成的合成源作为“支持假设”和无参考质量估计度量作为效用函数的新方法，标志着首次仅使用源在MBR解码中。实验证明，sMBR明显优于QE重新排序，并且与标准MBR解码相竞争。此外，与MBR相比，sMBR对效用函数的调用次数较少。我们的研究结果表明，sMBR是一种高质量NMT解码的有希望的方法。

更新时间: 2024-06-17 15:13:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11632v1

The Liouville Generator for Producing Integrable Expressions

There has been a growing need to devise processes that can create comprehensive datasets in the world of Computer Algebra, both for accurate benchmarking and for new intersections with machine learning technology. We present here a method to generate integrands that are guaranteed to be integrable, dubbed the LIOUVILLE method. It is based on Liouville's theorem and the Parallel Risch Algorithm for symbolic integration. We show that this data generation method retains the best qualities of previous data generation methods, while overcoming some of the issues built into that prior work. The LIOUVILLE generator is able to generate sufficiently complex and realistic integrands, and could be used for benchmarking or machine learning training tasks related to symbolic integration.

Updated: 2024-06-17 15:13:36

标题: 用于产生可积表达式的Liouville生成器

摘要: 在计算代数领域，越来越需要设计能够创建全面数据集的过程，这既可以用于准确的基准测试，也可以与机器学习技术进行新的交叉。我们在这里提出了一种生成可积函数的方法，被称为LIOUVILLE方法。该方法基于Liouville定理和符号积分的Parallel Risch算法。我们展示了这种数据生成方法保留了先前数据生成方法的最佳特性，同时克服了一些内置在先前工作中的问题。LIOUVILLE生成器能够生成足够复杂和逼真的可积函数，并可以用于与符号积分相关的基准测试或机器学习训练任务。

更新时间: 2024-06-17 15:13:36

领域: cs.SC,cs.LG

下载: http://arxiv.org/abs/2406.11631v1

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.

Updated: 2024-06-17 15:11:20

标题: DiaHalu: 一个用于大型语言模型对话级幻觉评估的基准测试

摘要: 鉴于近年来大型语言模型（LLMs）取得了显著成功，幻觉问题仍然是一个挑战，已经提出了许多基准来检测幻觉。然而，其中一些基准并非由LLMs自然生成，而是经过有意诱导的。此外，许多基准仅关注事实性幻觉，而忽略了忠实性幻觉。此外，尽管在LLMs时代对话模式被广泛利用，但目前的基准仅集中在句子级和段落级幻觉上。在这项研究中，我们提出了DiaHalu，据我们所知是第一个对话级幻觉评估基准。首先，我们将收集的主题整合到系统提示中，并促进两个ChatGPT3.5之间的对话。随后，我们手动修改不符合人类语言约定的内容，然后让LLMs重新生成，模拟真实的人机交互场景。最后，专业学者对数据集中的所有样本进行注释。DiaHalu涵盖了四个常见的多轮对话领域和五种幻觉亚型，扩展了事实性和忠实性幻觉。通过对数据集上一些知名LLMs和检测方法的实验表明，DiaHalu是一个具有挑战性的基准，对进一步研究具有重要价值。

更新时间: 2024-06-17 15:11:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.00896v2

Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature

Text watermarks for large language models (LLMs) have been commonly used to identify the origins of machine-generated content, which is promising for assessing liability when combating deepfake or harmful content. While existing watermarking techniques typically prioritize robustness against removal attacks, unfortunately, they are vulnerable to spoofing attacks: malicious actors can subtly alter the meanings of LLM-generated responses or even forge harmful content, potentially misattributing blame to the LLM developer. To overcome this, we introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks (mitigating spoofing attacks) as well as a coarse-grained signal to trace text sources when the signature is invalid (enhancing detectability) via a novel rank-based sampling strategy. Compared to conventional watermark detectors that only output binary results, Bileve can differentiate 5 scenarios during detection, reliably tracing text provenance and regulating LLMs. The experiments conducted on OPT-1.3B and LLaMA-7B demonstrate the effectiveness of Bileve in defeating spoofing attacks with enhanced detectability.

Updated: 2024-06-17 15:11:11

标题: Bileve：通过双层签名保护大型语言模型中文本来源免受欺骗

摘要: 大型语言模型（LLMs）的文本水印通常用于识别机器生成内容的来源，这对于在对抗深度伪造或有害内容时评估责任是很有希望的。现有的水印技术通常优先考虑对抗去除攻击的稳健性，但不幸的是，它们容易受到欺骗攻击的影响：恶意行为者可以微妙地改变LLM生成的响应的含义，甚至伪造有害内容，潜在地将责任归咎于LLM开发者。为了克服这一问题，我们引入了一种双层签名方案Bileve，该方案嵌入了用于完整性检查（减轻欺骗攻击）的细粒度签名位，以及在签名无效时追踪文本来源的粗粒度信号（增强可检测性）通过一种新颖的基于排名的采样策略。与仅输出二进制结果的传统水印检测器相比，Bileve在检测过程中可以区分5种情况，可靠地追踪文本来源并规范LLMs。在OPT-1.3B和LLaMA-7B上进行的实验表明，Bileve在增强可检测性的同时有效地击败了欺骗攻击。

更新时间: 2024-06-17 15:11:11

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2406.01946v2

Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.

Updated: 2024-06-17 15:08:05

标题: 语言模型在问题求解中是否表现出与人类学习者相同的认知偏见？

摘要: 越来越多的人对使用大型语言模型（LLMs）作为认知模型感兴趣。为了这样的目的，了解LLMs模拟得很好的人类认知特性，以及哪些特性不是很重要。在这项工作中，我们研究LLMs的偏见与儿童在解决算术问题时所知道的偏见之间的关系。通过调查学习科学文献，我们假设问题解决过程可以分为三个不同的步骤：文本理解、解决方案规划和解决方案执行。我们为每个步骤构建测试，以了解当前的LLMs是否在这些步骤中显示与儿童相同的认知偏见。我们为每个测试生成了一组新的单词问题，使用一种神经符号方法，可以对问题特征进行精细控制。我们发现证据表明，LLMs在解决过程的文本理解和解决方案规划步骤中，无论是否经过指导调整，都表现出类似于人类的偏见，但在最终步骤中，即执行算术表达式以获得答案时，并不表现出这种偏见。

更新时间: 2024-06-17 15:08:05

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2401.18070v2

Words in Motion: Representation Engineering for Motion Forecasting

Motion forecasting transforms sequences of past movements and environment context into future motion. Recent methods rely on learned representations, resulting in hidden states that are difficult to interpret. In this work, we use natural language to quantize motion features in a human-interpretable way, and measure the degree to which they are embedded in hidden states. Our experiments reveal that hidden states of motion sequences are arranged with respect to our discrete sets of motion features. Following these insights, we fit control vectors to motion features, which allow for controlling motion forecasts at inference. Consequently, our method enables controlling transformer-based motion forecasting models with textual inputs, providing a unique interface to interact with and understand these models. Our implementation is available at https://github.com/kit-mrt/future-motion

Updated: 2024-06-17 15:07:55

标题: 运动中的文字：运动预测的表示工程

摘要: Motion forecasting transforms sequences of past movements and environment context into future motion. Recent methods rely on learned representations, resulting in hidden states that are difficult to interpret. In this work, we use natural language to quantize motion features in a human-interpretable way, and measure the degree to which they are embedded in hidden states. Our experiments reveal that hidden states of motion sequences are arranged with respect to our discrete sets of motion features. Following these insights, we fit control vectors to motion features, which allow for controlling motion forecasts at inference. Consequently, our method enables controlling transformer-based motion forecasting models with textual inputs, providing a unique interface to interact with and understand these models. Our implementation is available at https://github.com/kit-mrt/future-motion.

更新时间: 2024-06-17 15:07:55

领域: cs.LG,cs.CL,cs.CV

下载: http://arxiv.org/abs/2406.11624v1

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an \gls{av} system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.

Updated: 2024-06-17 15:04:15

标题: AV-CrossNet:一种利用窄频和交叉频带建模的音频视觉复杂频谱映射网络，用于语音分离

摘要: 将视觉线索添加到基于音频的语音分离中可以提高分离性能。本文介绍了AV-CrossNet，这是一个用于语音增强、目标说话者提取和多说话者分离的AV系统。AV-CrossNet是从CrossNet架构扩展而来的，后者是最近提出的一个网络，通过利用全局注意力和位置编码执行复杂的频谱映射以进行语音分离。为了有效利用视觉线索，所提出的系统结合了预提取的视觉嵌入并采用了包含时间卷积层的视觉编码器。在将音频和视觉特征馈送到AV-CrossNet块之前，在一个早期融合层中融合了音频和视觉特征。我们在多个数据集上评估了AV-CrossNet，包括LRS、VoxCeleb和COG-MHEAR挑战。评估结果表明，AV-CrossNet在所有音视频任务中推进了最先进的性能，即使在未经训练和不匹配的数据集上也是如此。

更新时间: 2024-06-17 15:04:15

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2406.11619v1

SoK: A Literature and Engineering Review of Regular Expression Denial of Service

Regular expression denial of service (ReDoS) is an asymmetric cyberattack that has become prominent in recent years. Many research works examine ReDoS, measuring its impact or preventing its exploitation. However, there has been no systematic treatment of this topic in order to understand the limits of the state of the art and identify opportunities for further research. In this paper, we fill this gap by systematizing the existing knowledge on ReDoS. We review the algorithmic basis of ReDoS attacks and the pertinent history of regular expression engines. Next, we survey the literature, dividing works into two classes: measurement studies and defenses. We find no agreed-upon definition for ReDoS vulnerabilities, and observe weaknesses in the practical evaluations of many papers, making the impact of their findings hard to assess. The majority of academic work in this area limit themselves to showing the presence of an unexpected slow computation, without illustrating how this can be weaponized against real systems. Then, we survey the latest regex engines to examine whether and how the proposed defenses have been realized. In this way, we describe the new realities that should be considered in the next generation ReDoS research. We show that many academic threat models are out of date thanks to the adoption of defenses. Beyond this, we underscore the importance of simulating ReDoS attacks in realistic contexts, where factors like request size limiting or deployed mitigations are taken into account. We propose a tool, wrk-DoS, to facilitate these simulations.

Updated: 2024-06-17 15:03:58

标题: SoK: 正则表达式拒绝服务攻击的文献及工程审查

摘要: 正则表达式拒绝服务攻击（ReDoS）是一种不对称的网络攻击，在近年来变得突出。许多研究作品都研究了ReDoS，衡量了其影响或预防其利用。然而，迄今为止还没有对这个主题进行系统化的处理，以便了解现有技术的局限性并确定进一步研究的机会。在本文中，我们通过系统化现有对ReDoS的知识来填补这一空白。我们回顾了ReDoS攻击的算法基础和正则表达式引擎的相关历史。接下来，我们对文献进行了调查，将作品分为两类：测量研究和防御研究。我们发现对于ReDoS漏洞没有达成一致的定义，并观察到许多论文在实际评估中存在弱点，这使得很难评估其研究结果的影响。这个领域的大部分学术工作限制在展示出意外缓慢计算的存在，而没有说明如何将其用于攻击实际系统。然后，我们调查了最新的正则表达式引擎，以检查所提出的防御措施是否已经实现。通过这种方式，我们描述了下一代ReDoS研究中应考虑的新现实。我们发现许多学术威胁模型已经过时，这要归功于防御措施的采用。除此之外，我们强调在现实环境中模拟ReDoS攻击的重要性，其中考虑了请求大小限制或已部署的缓解措施等因素。我们提出了一个工具，wrk-DoS，以便进行这些模拟。

更新时间: 2024-06-17 15:03:58

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2406.11618v1

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Updated: 2024-06-17 15:00:35

标题: 使用参数化知识迹线对遗忘进行内在评估

摘要: 最近，“取消学习”大型语言模型（LLMs）中的某些概念的任务引起了极大关注，因为这对于减轻不良模型行为（如生成有害、私密或不正确信息）至关重要。目前用于评估取消学习方法的协议主要依赖于行为测试，而不监控模型参数中未学习知识的存在。这些残余知识可以被对抗性地利用，以在取消学习后恢复已擦除的信息。我们认为取消学习还应该通过考虑未学习概念的参数化知识痕迹的变化来进行内部评估。为此，我们提出了一种激发参数空间中方向（称为“概念向量”）的通用方法，该方法编码具体概念，并构建了ConceptVectors，一个包含两个开源LLMs中数百个常见概念及其参数化知识痕迹的基准数据集。在ConceptVectors上的评估显示，现有的取消学习方法对概念向量的影响很小，而直接去除这些向量显著地从LLMs中删除相关知识，并显著降低它们对对抗性操纵的敏感性。我们的结果突显了基于行为的取消学习评估的局限性，并呼吁未来的工作包括基于参数的评估。为支持此目标，我们在https://github.com/yihuaihong/ConceptVectors上发布了我们的代码和基准。

更新时间: 2024-06-17 15:00:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11614v1

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

Updated: 2024-06-17 14:58:29

标题: 长代码竞技场：长上下文代码模型的一组基准测试

摘要: 如今，代码和自然语言处理领域正在迅速发展。特别是，模型在处理长上下文窗口方面变得更加优秀 - 在过去几年中，支持的上下文大小已经增加了数个数量级。然而，对于代码处理而言，缺乏超越单个文件上下文的基准测试，而最流行的基准测试受限于单个方法。通过这项工作，我们旨在通过引入Long Code Arena来弥补这一差距，这是一个包含六项基准测试的套件，用于需要项目范围上下文的代码处理任务。这些任务涵盖了代码处理的不同方面：基于库的代码生成、CI构建修复、项目级代码完成、提交消息生成、错误定位和模块摘要。对于每项任务，我们提供了一个经过手工验证的数据集用于测试，一个评估套件，并基于流行的LLM提供了开源基线解决方案，以展示数据集的使用方式，并简化其他研究人员的采用。我们在HuggingFace Spaces上发布了基准测试页面，包括排行榜、指向所有数据集的HuggingFace Hub的链接，以及指向包含基线的GitHub存储库的链接：https://huggingface.co/spaces/JetBrains-Research/long-code-arena。

更新时间: 2024-06-17 14:58:29

领域: cs.LG,cs.AI,cs.IR,cs.SE

下载: http://arxiv.org/abs/2406.11612v1

Standardizing Structural Causal Models

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$-sortability and $\operatorname{R^2}$-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$-sortable, and as we show experimentally, not $\operatorname{R^2}$-sortable either for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here.

Updated: 2024-06-17 14:52:21

标题: 标准化结构因果模型

摘要: 结构因果模型（SCM）生成的合成数据集通常用于基准因果结构学习算法。然而，SCM数据中的方差和成对相关性往往沿着因果排序增加。一些流行的算法利用这些特征，可能导致结论无法推广到现实世界的情境。现有的度量标准如$\operatorname{Var}$-sortability和$\operatorname{R^2}$-sortability量化了这些模式，但它们并未提供解决方法。为了解决这个问题，我们提出了内部标准化结构因果模型（iSCMs），这是对SCMs的修改，在生成过程中引入了每个变量的标准化操作。通过构造，iSCMs不是$\operatorname{Var}$-sortable，并且正如我们在实验中展示的，对于常用的图形族也不是$\operatorname{R^2}$-sortable。此外，与通过标准SCMs生成的数据的事后标准化相反，我们证明线性iSCMs在权重的先验知识上较难识别，并且在大系统中不会倒退到确定性关系，这可能使iSCMs在超越本文研究的基准问题的因果推断中成为有用的模型。

更新时间: 2024-06-17 14:52:21

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11601v1

On GNN explanability with activation rules

GNNs are powerful models based on node representation learning that perform particularly well in many machine learning problems related to graphs. The major obstacle to the deployment of GNNs is mostly a problem of societal acceptability and trustworthiness, properties which require making explicit the internal functioning of such models. Here, we propose to mine activation rules in the hidden layers to understand how the GNNs perceive the world. The problem is not to discover activation rules that are individually highly discriminating for an output of the model. Instead, the challenge is to provide a small set of rules that cover all input graphs. To this end, we introduce the subjective activation pattern domain. We define an effective and principled algorithm to enumerate activations rules in each hidden layer. The proposed approach for quantifying the interest of these rules is rooted in information theory and is able to account for background knowledge on the input graph data. The activation rules can then be redescribed thanks to pattern languages involving interpretable features. We show that the activation rules provide insights on the characteristics used by the GNN to classify the graphs. Especially, this allows to identify the hidden features built by the GNN through its different layers. Also, these rules can subsequently be used for explaining GNN decisions. Experiments on both synthetic and real-life datasets show highly competitive performance, with up to 200% improvement in fidelity on explaining graph classification over the SOTA methods.

Updated: 2024-06-17 14:42:59

标题: 关于使用激活规则解释GNN的可解释性

摘要: GNNs是基于节点表示学习的强大模型，特别擅长处理与图相关的许多机器学习问题。部署GNNs的主要障碍主要是社会可接受性和可信度的问题，这些属性需要明确说明这些模型的内部运作方式。在这里，我们提出在隐藏层中挖掘激活规则，以了解GNNs如何感知世界。问题不在于发现单独针对模型输出具有高度区分性的激活规则，而在于提供一小组覆盖所有输入图的规则。为此，我们引入了主观激活模式领域。我们定义了一种有效和原则性的算法来枚举每个隐藏层中的激活规则。所提出的用于量化这些规则兴趣的方法根植于信息论，并且能够考虑输入图数据的背景知识。然后，可以通过涉及可解释特征的模式语言重新描述激活规则。我们展示了激活规则能够揭示GNN用于对图进行分类的特征。特别地，这使得可以识别通过其不同层构建的GNN隐藏特征。此外，这些规则随后可以用于解释GNN的决策。对合成和真实数据集的实验表明，在解释图分类方面，性能非常竞争，超过了SOTA方法高达200%的准确度提升。

更新时间: 2024-06-17 14:42:59

领域: cs.LG

下载: http://arxiv.org/abs/2406.11594v1

CoSQA+: Enhancing Code Search Dataset with Matching Code

Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets are problematic: either using unrealistic queries, or with mismatched codes, and typically using one-to-one query-code pairing, which fails to reflect the reality that a query might have multiple valid code matches. This paper introduces CoSQA+, pairing high-quality queries (reused from CoSQA) with multiple suitable codes. We collect code candidates from diverse sources and form candidate pairs by pairing queries with these codes. Utilizing the power of large language models (LLMs), we automate pair annotation, filtering, and code generation for queries without suitable matches. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. Furthermore, we propose a new metric Mean Multi-choice Reciprocal Rank (MMRR), to assess one-to-N code search performance. We provide the code and data at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.

Updated: 2024-06-17 14:34:14

标题: CoSQA+: 使用匹配代码增强代码搜索数据集

摘要: 语义代码搜索，检索与给定自然语言查询匹配的代码，是提高软件工程生产力的重要任务。现有的代码搜索数据集存在问题：要么使用不现实的查询，要么代码不匹配，通常使用一对一的查询-代码配对，不能反映查询可能有多个有效代码匹配的现实情况。本文介绍了CoSQA+，将高质量的查询（从CoSQA中重复使用）与多个合适的代码配对。我们从不同来源收集代码候选项，并通过将查询与这些代码配对形成候选对。利用大型语言模型（LLMs）的强大功能，我们自动化了对配对的注释、过滤和为没有合适匹配的查询生成代码。通过大量实验，CoSQA+已经证明比CoSQA具有更优质的质量。在CoSQA+上训练的模型表现出改善的性能。此外，我们提出了一个新的度量标准Mean Multi-choice Reciprocal Rank（MMRR），用于评估一对多代码搜索的性能。我们在https://github.com/DeepSoftwareAnalytics/CoSQA_Plus提供了代码和数据。

更新时间: 2024-06-17 14:34:14

领域: cs.SE,cs.AI,cs.IR,I.2.7; D.2.3

下载: http://arxiv.org/abs/2406.11589v1

Killer Apps: Low-Speed, Large-Scale AI Weapons

The accelerating advancements in Artificial Intelligence (AI) and Machine Learning (ML), highlighted by the development of cutting-edge Generative Pre-trained Transformer (GPT) models by organizations such as OpenAI, Meta, and Anthropic, present new challenges and opportunities in warfare and security. Much of the current focus is on AI's integration within weapons systems and its role in rapid decision-making in kinetic conflict. However, an equally important but often overlooked aspect is the potential of AI-based psychological manipulation at internet scales within the information domain. These capabilities could pose significant threats to individuals, organizations, and societies globally. This paper explores the concept of AI weapons, their deployment, detection, and potential countermeasures.

Updated: 2024-06-17 14:18:50

标题: 杀手应用：低速、大规模人工智能武器

摘要: 人工智能（AI）和机器学习（ML）的不断进步，尤其是由OpenAI、Meta和Anthropic等组织开发的前沿生成式预训练变换器（GPT）模型的发展，为战争和安全领域带来了新的挑战和机遇。当前主要关注的是人工智能在武器系统中的整合以及在动能冲突中快速决策中的作用。然而，一个同样重要但经常被忽视的方面是AI在信息领域内以互联网规模进行心理操纵的潜力。这些能力可能对全球个人、组织和社会构成重大威胁。本文探讨了AI武器的概念、部署、检测和潜在对策。

更新时间: 2024-06-17 14:18:50

领域: cs.CY,cs.CR,cs.LG,I.2.7; H.4.3; J.4

下载: http://arxiv.org/abs/2402.01663v4

MLXP: A Framework for Conducting Replicable Experiments in Python

Replicability in machine learning (ML) research is increasingly concerning due to the utilization of complex non-deterministic algorithms and the dependence on numerous hyper-parameter choices, such as model architecture and training datasets. Ensuring reproducible and replicable results is crucial for advancing the field, yet often requires significant technical effort to conduct systematic and well-organized experiments that yield robust conclusions. Several tools have been developed to facilitate experiment management and enhance reproducibility; however, they often introduce complexity that hinders adoption within the research community, despite being well-handled in industrial settings. To address the challenge of low adoption, we propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python, available at https://github.com/inria-thoth/mlxp . MLXP streamlines the experimental process with minimal practitioner overhead while ensuring a high level of reproducibility.

Updated: 2024-06-17 14:16:16

标题: MLXP：一个在Python中进行可复制实验的框架

摘要: 机器学习（ML）研究中的可复制性越来越受到关注，这是因为复杂的非确定性算法和对众多超参数选择的依赖，例如模型架构和训练数据集。确保可重现和可复制的结果对于推动该领域至关重要，然而往往需要进行系统化和有组织的实验，以产生具有强大结论的结果，这需要显著的技术努力。已经开发了几种工具来促进实验管理和增强可重现性；然而，它们往往引入了复杂性，阻碍了研究社区的采用，尽管在工业环境中已经得到很好的处理。为了解决低采用率的挑战，我们提出了MLXP，这是一个基于Python的开源、简单和轻量级实验管理工具，可在https://github.com/inria-thoth/mlxp 上获得。 MLXP简化了实验过程，减少了从业者的额外负担，同时确保了高水平的可重现性。

更新时间: 2024-06-17 14:16:16

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2402.13831v2

Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling

Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.

Updated: 2024-06-17 14:15:01

标题: PPO中的彩色噪声：通过相关动作采样改进探索和性能

摘要: 近端策略优化（PPO）是一种流行的基于策略的深度强化学习方法，采用随机策略进行探索。本文提出了一种基于有色噪声的PPO随机策略变体。先前的研究强调了行动噪声中的时间相关性对离线强化学习中的有效探索的重要性。在此基础上，我们研究了相关噪声是否也可以增强PPO等基于策略的方法中的探索。我们发现，用于行动选择的相关噪声可以提高学习性能，并在基于策略的方法中胜过当前流行的不相关白噪声方法。与离线学习不同，其中发现粉红噪声非常有效，我们发现，对于PPO中的基于策略学习，介于白色和粉红色之间的有色噪声表现最佳。我们考察了通过修改并行模拟环境的数量来收集每次更新的数据量对结果的影响，并观察到在更多并行环境的情况下，更强相关的噪声是有益的。由于显著的影响和易于实施，我们建议在PPO中将相关噪声作为默认噪声源。

更新时间: 2024-06-17 14:15:01

领域: cs.LG

下载: http://arxiv.org/abs/2312.11091v2

MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis

Recently, numerous embedding models have been made available and widely used for various NLP tasks. The Massive Text Embedding Benchmark (MTEB) has primarily simplified the process of choosing a model that performs well for several tasks in English, but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. We gather 15 existing datasets in an easy-to-use interface and create three new French datasets for a global evaluation of 8 task categories. We compare 51 carefully selected embedding models on a large scale, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well. Our work comes with open-source code, new datasets and a public leaderboard.

Updated: 2024-06-17 14:14:54

标题: MTEB-French：用于法语句子嵌入评估和分析的资源

摘要: 最近，许多嵌入模型已经被提供并广泛用于各种自然语言处理任务。Massive Text Embedding Benchmark（MTEB）主要简化了选择在多个英语任务中表现良好的模型的过程，但拓展到其他语言仍然具有挑战性。因此，我们将MTEB扩展到提出第一个用于法语的大规模句子嵌入基准测试。我们收集了15个现有数据集，建立了一个易于使用的界面，并为8个任务类别进行了全面评估，其中包括三个新的法语数据集。我们在大规模上比较了51个精选的嵌入模型，进行了全面的统计测试，并分析了模型性能与它们许多特征之间的相关性。我们发现，即使没有任何模型在所有任务上表现最佳，但在句子相似性上预先训练的大型多语言模型表现异常出色。我们的工作包括开源代码、新数据集和一个公开排行榜。

更新时间: 2024-06-17 14:14:54

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2405.20468v2

Pre-Training and Personalized Fine-Tuning via Over-the-Air Federated Meta-Learning: Convergence-Generalization Trade-Offs

For modern artificial intelligence (AI) applications such as large language models (LLMs), the training paradigm has recently shifted to pre-training followed by fine-tuning. Furthermore, owing to dwindling open repositories of data and thanks to efforts to democratize access to AI models, pre-training is expected to increasingly migrate from the current centralized deployments to federated learning (FL) implementations. Meta-learning provides a general framework in which pre-training and fine-tuning can be formalized. Meta-learning-based personalized FL (meta-pFL) moves beyond basic personalization by targeting generalization to new agents and tasks. This paper studies the generalization performance of meta-pFL for a wireless setting in which the agents participating in the pre-training phase, i.e., meta-learning, are connected via a shared wireless channel to the server. Adopting over-the-air computing, we study the trade-off between generalization to new agents and tasks, on the one hand, and convergence, on the other hand. The trade-off arises from the fact that channel impairments may enhance generalization, while degrading convergence. Extensive numerical results validate the theory.

Updated: 2024-06-17 14:06:13

标题: 通过空中联邦元学习进行预训练和个性化微调：收敛性和泛化性的权衡

摘要: 对于现代人工智能（AI）应用，如大型语言模型（LLMs），训练范式最近已经转向预训练后的微调。此外，由于数据开放存储库的减少以及努力使AI模型的访问民主化，预训练预计将从当前的集中部署逐渐迁移到联邦学习（FL）实施。元学习提供了一个通用框架，可以在其中形式化预训练和微调。基于元学习的个性化FL（meta-pFL）超越了基本个性化，通过定位到新代理和任务的泛化。本文研究了在无线环境中meta-pFL的泛化性能，其中参与预训练阶段（即元学习）的代理通过共享无线通道连接到服务器。采用空中计算，我们研究了对新代理和任务的泛化与收敛之间的权衡。这种权衡源于通道损伤可能增强泛化，同时降低收敛。广泛的数值结果验证了理论。

更新时间: 2024-06-17 14:06:13

领域: cs.LG,cs.IT,eess.SP,math.IT

下载: http://arxiv.org/abs/2406.11569v1

Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting

Color image inpainting is a challenging task in imaging science. The existing method is based on real operation, and the red, green and blue channels of the color image are processed separately, ignoring the correlation between each channel. In order to make full use of the correlation between each channel, this paper proposes a Quaternion Generative Adversarial Neural Network (QGAN) model and related theory, and applies it to solve the problem of color image inpainting with large area missing. Firstly, the definition of quaternion deconvolution is given and the quaternion batch normalization is proposed. Secondly, the above two innovative modules are applied to generate adversarial networks to improve stability. Finally, QGAN is applied to color image inpainting and compared with other state-of-the-art algorithms. The experimental results show that QGAN has superiority in color image inpainting with large area missing.

Updated: 2024-06-17 14:04:17

标题: 四元数生成对抗神经网络及其在彩色图像修复中的应用

摘要: 彩色图像修复是图像科学中的一项具有挑战性的任务。现有方法基于实际操作，对彩色图像的红色、绿色和蓝色通道进行单独处理，忽略了每个通道之间的相关性。为了充分利用每个通道之间的相关性，本文提出了一种四元数生成对抗神经网络（QGAN）模型及相关理论，并将其应用于解决大面积缺失的彩色图像修复问题。首先，给出了四元数反卷积的定义，并提出了四元数批量归一化。其次，以上两个创新模块被应用于生成对抗网络以提高稳定性。最后，QGAN被应用于彩色图像修复，并与其他最先进的算法进行比较。实验结果表明，QGAN在具有大面积缺失的彩色图像修复中具有优势。

更新时间: 2024-06-17 14:04:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11567v1

Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture

Neural models produce promising results when solving Vehicle Routing Problems (VRPs), but often fall short in generalization. Recent attempts to enhance model generalization often incur unnecessarily large training cost or cannot be directly applied to other models solving different VRP variants. To address these issues, we take a novel perspective on model architecture in this study. Specifically, we propose a plug-and-play Entropy-based Scaling Factor (ESF) and a Distribution-Specific (DS) decoder to enhance the size and distribution generalization, respectively. ESF adjusts the attention weight pattern of the model towards familiar ones discovered during training when solving VRPs of varying sizes. The DS decoder explicitly models VRPs of multiple training distribution patterns through multiple auxiliary light decoders, expanding the model representation space to encompass a broader range of distributional scenarios. We conduct extensive experiments on both synthetic and widely recognized real-world benchmarking datasets and compare the performance with seven baseline models. The results demonstrate the effectiveness of using ESF and DS decoder to obtain a more generalizable model and showcase their applicability to solve different VRP variants, i.e., travelling salesman problem and capacitated VRP. Notably, our proposed generic components require minimal computational resources, and can be effortlessly integrated into conventional generalization strategies to further elevate model generalization.

Updated: 2024-06-17 14:02:57

标题: 通过模型架构改善神经车辆路径问题求解器的泛化能力

摘要: 神经模型在解决车辆路径问题（VRPs）时取得了令人满意的结果，但往往在泛化方面表现不佳。最近的一些尝试在增强模型泛化能力时往往会带来不必要的大训练成本，或者无法直接应用于解决不同VRP变体的其他模型。为了解决这些问题，我们在本研究中采用了一种新颖的模型架构视角。具体来说，我们提出了一种即插即用的基于熵的缩放因子（ESF）和一个分布特定（DS）解码器，分别用于增强尺寸和分布泛化能力。ESF调整模型的注意力权重模式，使其朝向在解决不同尺寸的VRPs时发现的熟悉模式。DS解码器通过多个辅助轻量级解码器明确地建模多种训练分布模式的VRPs，扩展模型表示空间以涵盖更广泛的分布情景。我们在合成和广泛认可的真实世界基准数据集上进行了大量实验，并将性能与七种基线模型进行了比较。结果表明，使用ESF和DS解码器可以获得更具泛化性的模型，并展示了它们解决不同VRP变体的适用性，即旅行商问题和容量限制的VRP。值得注意的是，我们提出的通用组件需要最少的计算资源，并可以轻松集成到传统泛化策略中，进一步提升模型的泛化能力。

更新时间: 2024-06-17 14:02:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.06652v2

Deep learning probability flows and entropy production rates in active matter

Active matter systems, from self-propelled colloids to motile bacteria, are characterized by the conversion of free energy into useful work at the microscopic scale. They involve physics beyond the reach of equilibrium statistical mechanics, and a persistent challenge has been to understand the nature of their nonequilibrium states. The entropy production rate and the probability current provide quantitative ways to do so by measuring the breakdown of time-reversal symmetry. Yet, their efficient computation has remained elusive, as they depend on the system's unknown and high-dimensional probability density. Here, building upon recent advances in generative modeling, we develop a deep learning framework to estimate the score of this density. We show that the score, together with the microscopic equations of motion, gives access to the entropy production rate, the probability current, and their decomposition into local contributions from individual particles. To represent the score, we introduce a novel, spatially-local transformer network architecture that learns high-order interactions between particles while respecting their underlying permutation symmetry. We demonstrate the broad utility and scalability of the method by applying it to several high-dimensional systems of active particles undergoing motility-induced phase separation (MIPS). We show that a single network trained on a system of 4096 particles at one packing fraction can generalize to other regions of the phase diagram, including systems with as many as 32768 particles. We use this observation to quantify the spatial structure of the departure from equilibrium in MIPS as a function of the number of particles and the packing fraction.

Updated: 2024-06-17 14:02:49

标题: 深度学习概率流和活性物质中的熵产率

摘要: 主动物质系统，从自主推进的胶体到运动的细菌，其特点是在微观尺度上将自由能转化为有用的工作。它们涉及超出平衡统计力学范围的物理学，一个持续的挑战是理解它们非平衡状态的本质。熵产生率和概率流提供了量化方法，通过测量时间反演对称性的破坏来实现。然而，它们的高效计算一直是难以实现的，因为它们依赖于系统的未知和高维概率密度。在最近生成模型的进展基础上，我们开发了一个深度学习框架来估计该密度的得分。我们展示了该得分，连同微观运动方程，可以访问熵产生率、概率流及其分解为来自单个粒子的局部贡献。为了表示得分，我们引入了一种新颖的、空间局部的变换器网络架构，它学习了粒子之间的高阶相互作用，同时尊重它们的底层置换对称性。我们通过将其应用于几个经历由运动引起的相分离（MIPS）的高维活跃粒子系统，展示了该方法的广泛实用性和可扩展性。我们展示了一个单一网络在一个填充比下训练了4096个粒子系统后可以推广到相图的其他区域，包括有多达32768个粒子的系统。我们利用这一观察结果来量化MIPS中与平衡偏离的空间结构，作为粒子数量和填充比的函数。

更新时间: 2024-06-17 14:02:49

领域: cond-mat.stat-mech,cond-mat.soft,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2309.12991v2

Intersymbolic AI: Interlinking Symbolic AI and Subsymbolic AI

This perspective piece calls for the study of the new field of Intersymbolic AI, by which we mean the combination of symbolic AI, whose building blocks have inherent significance/meaning, with subsymbolic AI, whose entirety creates significance/effect despite the fact that individual building blocks escape meaning. Canonical kinds of symbolic AI are logic, games and planning. Canonical kinds of subsymbolic AI are (un)supervised machine and reinforcement learning. Intersymbolic AI interlinks the worlds of symbolic AI with its compositional symbolic significance and meaning and of subsymbolic AI with its summative significance or effect to enable culminations of insights from both worlds by going between and across symbolic AI insights with subsymbolic AI techniques that are being helped by symbolic AI principles. For example, Intersymbolic AI may start with symbolic AI to understand a dynamic system, continue with subsymbolic AI to learn its control, and end with symbolic AI to safely use the outcome of the learned subsymbolic AI controller in the dynamic system. Intersymbolic AI combines both symbolic and subsymbolic AI to increase the effectiveness of AI compared to either kind of AI alone, in much the same way that the combination of both conscious and subconscious thought increases the effectiveness of human thought compared to either kind of thought alone. Some successful contributions to the Intersymbolic AI paradigm are surveyed here but many more are considered possible by advancing Intersymbolic AI.

Updated: 2024-06-17 14:01:59

标题: Intersymbolic AI: 连接符号AI和次符号AI

摘要: 这篇透视性文章呼吁研究新领域的交符号人工智能，通过将具有固有意义/含义的符号人工智能与整体上创造意义/效果的次符号人工智能相结合。典型的符号人工智能包括逻辑、游戏和规划。典型的次符号人工智能包括（无）监督机器学习和强化学习。交符号人工智能将符号人工智能的组合符号意义和含义与次符号人工智能的总和意义或效果相互关联，以便通过在符号人工智能洞察力和得到符号人工智能原则帮助的次符号人工智能技术之间以及之间获得来自两个世界的洞察力。例如，交符号人工智能可以从符号人工智能开始了解动态系统，继续通过次符号人工智能学习其控制，并最终通过符号人工智能安全地使用学习到的次符号人工智能控制器的结果在动态系统中。交符号人工智能将符号人工智能和次符号人工智能结合起来，以增加人工智能的效果，与任何一种人工智能相比，就像同时结合意识和潜意识思维会增加人类思维的效果一样。这里对交符号人工智能范式的一些成功贡献进行了调查，但通过推进交符号人工智能，还有更多可能的贡献。

更新时间: 2024-06-17 14:01:59

领域: cs.AI,68T01, 68T05, 68T07, 68T27, 68T30, 03B70,I.2.0; I.2.3; I.2.4; I.2.6; I.2.8

下载: http://arxiv.org/abs/2406.11563v1

An Imitative Reinforcement Learning Framework for Autonomous Dogfight

Unmanned Combat Aerial Vehicle (UCAV) dogfight, which refers to a fight between two or more UCAVs usually at close quarters, plays a decisive role on the aerial battlefields. With the evolution of artificial intelligence, dogfight progressively transits towards intelligent and autonomous modes. However, the development of autonomous dogfight policy learning is hindered by challenges such as weak exploration capabilities, low learning efficiency, and unrealistic simulated environments. To overcome these challenges, this paper proposes a novel imitative reinforcement learning framework, which efficiently leverages expert data while enabling autonomous exploration. The proposed framework not only enhances learning efficiency through expert imitation, but also ensures adaptability to dynamic environments via autonomous exploration with reinforcement learning. Therefore, the proposed framework can learn a successful dogfight policy of 'pursuit-lock-launch' for UCAVs. To support data-driven learning, we establish a dogfight environment based on the Harfang3D sandbox, where we conduct extensive experiments. The results indicate that the proposed framework excels in multistage dogfight, significantly outperforms state-of-the-art reinforcement learning and imitation learning methods. Thanks to the ability of imitating experts and autonomous exploration, our framework can quickly learn the critical knowledge in complex aerial combat tasks, achieving up to a 100% success rate and demonstrating excellent robustness.

Updated: 2024-06-17 13:59:52

标题: 一种针对自主空战的模仿式强化学习框架

摘要: 无人作战飞行器（UCAV）的格斗，指的是两个或多个UCAV之间的接近战斗，在空中战场上发挥着决定性作用。随着人工智能的发展，格斗逐渐过渡到智能和自主模式。然而，自主格斗策略学习的发展受到挑战，如探索能力不足、学习效率低和不现实的模拟环境。为了克服这些挑战，本文提出了一种新颖的模仿强化学习框架，有效利用专家数据，同时实现自主探索。该框架不仅通过专家模仿提高了学习效率，还通过强化学习实现了对动态环境的适应性。因此，该框架可以学习到UCAV的成功格斗策略“追击-锁定-发射”。为了支持数据驱动学习，我们基于Harfang3D沙盒建立了一个格斗环境，在那里我们进行了大量实验。结果表明，该框架在多阶段格斗中表现出色，明显优于最先进的强化学习和模仿学习方法。由于具有模仿专家和自主探索的能力，我们的框架可以快速学习复杂空中作战任务中的关键知识，实现高达100%的成功率，并展示出优秀的稳健性。

更新时间: 2024-06-17 13:59:52

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2406.11562v1

Physics-informed Neural Network Estimation of Material Properties in Soft Tissue Nonlinear Biomechanical Models

The development of biophysical models for clinical applications is rapidly advancing in the research community, thanks to their predictive nature and their ability to assist the interpretation of clinical data. However, high-resolution and accurate multi-physics computational models are computationally expensive and their personalisation involves fine calibration of a large number of parameters, which may be space-dependent, challenging their clinical translation. In this work, we propose a new approach which relies on the combination of physics-informed neural networks (PINNs) with three-dimensional soft tissue nonlinear biomechanical models, capable of reconstructing displacement fields and estimating heterogeneous patient-specific biophysical properties. The proposed learning algorithm encodes information from a limited amount of displacement and, in some cases, strain data, that can be routinely acquired in the clinical setting, and combines it with the physics of the problem, represented by a mathematical model based on partial differential equations, to regularise the problem and improve its convergence properties. Several benchmarks are presented to show the accuracy and robustness of the proposed method and its great potential to enable the robust and effective identification of patient-specific, heterogeneous physical properties, s.a. tissue stiffness properties. In particular, we demonstrate the capability of the PINN to detect the presence, location and severity of scar tissue, which is beneficial to develop personalised simulation models for disease diagnosis, especially for cardiac applications.

Updated: 2024-06-17 13:59:07

标题: 《物理启发的神经网络估计软组织非线性生物力学模型中的材料性质》

摘要: 临床应用的生物物理模型的发展在研究界迅速进展，这要归功于它们的预测性质和帮助解释临床数据的能力。然而，高分辨率和精确的多物理计算模型在计算上是昂贵的，它们的个性化涉及大量参数的精细校准，这些参数可能是空间相关的，挑战了它们的临床转化。在这项工作中，我们提出了一种新方法，该方法依赖于物理信息神经网络（PINNs）与三维软组织非线性生物力学模型的结合，能够重建位移场并估算异质患者特定的生物物理特性。所提出的学习算法从临床设置中定期获取的有限数量的位移数据和在某些情况下的应变数据中编码信息，并将其与问题的物理学相结合，该物理学由基于偏微分方程的数学模型表示，以规范化问题并改进其收敛特性。提供了几个基准测试来展示所提出方法的准确性和稳健性，以及其极大的潜力，能够实现对患者特定、异质的物理特性（如组织硬度特性）进行稳健有效的识别。特别是，我们展示了PINN检测疤痕组织的存在、位置和严重程度的能力，这有利于为疾病诊断开发个性化仿真模型，尤其是心脏应用。

更新时间: 2024-06-17 13:59:07

领域: cs.LG,cs.NA,math.NA,physics.bio-ph,physics.med-ph,I.2.1; J.2

下载: http://arxiv.org/abs/2312.09787v3

Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

Locating objects referred to in natural language poses a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object retrieval with simple (bare) queries but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene spatial graph representation with metric edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to form 3D objects, an advanced raycasting algorithm to project them to 2D, and a vision-language model to describe them as graph nodes. On Replica and ScanNet datasets, we show that the designed method accurately constructs 3D object-centric maps. We have demonstrated that their quality takes a leading place for open-vocabulary 3D semantic segmentation against other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On Sr3D and Nr3D benchmarks, our deductive approach demonstrates a significant improvement, enabling retrieving objects by complex queries compared to other state-of-the-art methods. Considering our design solutions, we achieved a processing speed approximately x3 times faster than the closest analog. This promising performance enables our approach for usage in applied intelligent robotics projects. We make the code publicly available at linukc.github.io/bbq/.

Updated: 2024-06-17 13:55:40

标题: 超越简单查询：使用3D场景图进行开放词汇对象检索

摘要: 自然语言中引用的物体定位对于自主代理来说是一个重要挑战。现有基于CLIP的开放词汇方法成功地使用简单（裸）查询执行3D物体检索，但无法处理需要理解物体关系的模糊描述。为了解决这个问题，我们提出了一种模块化方法，称为BBQ（超越裸查询），它利用度量边构建3D场景空间图表示，并通过我们的演绎场景推理算法利用大型语言模型作为人到代理的接口。BBQ利用强大的DINO支持形成3D对象的关联，使用先进的射线投射算法将它们投射到2D，并使用视觉语言模型将它们描述为图节点。在Replica和ScanNet数据集上，我们展示了设计方法准确构建3D物体为中心的地图。我们已经证明，他们的质量在对抗其他零样本方法的开放词汇3D语义分割中处于领先地位。此外，我们展示了利用空间关系对包含多个同一语义类的实体的场景特别有效。在Sr3D和Nr3D基准测试中，我们的演绎方法显示出显著改进，使得与其他最先进的方法相比，通过复杂查询检索对象成为可能。考虑到我们的设计解决方案，我们的处理速度约快于最接近的类似产品3倍。这种有希望的性能使我们的方法能够在应用智能机器人项目中使用。我们公开提供代码，网址为linukc.github.io/bbq/。

更新时间: 2024-06-17 13:55:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.07113v2

Input Conditioned Graph Generation for Language Agents

Recent progress in Large Language Models (LLMs) and language agents has demonstrated significant promise for various future applications across multiple disciplines. While traditional approaches to language agents often rely on fixed, handcrafted designs, our research aims to develop both learnable and dynamic agents. Our method uses an existing framework that abstracts language agents as graphs. Within this graph framework, we aim to learn a model that can generate edges for every given input to the language agent. This allows us to generate edges that represent the flow of communication within the graph based on the given input, thereby adjusting the internal communication of a language agent. We learn to generate these edges using a pretrained LLM that is fine-tuned with reinforcement learning. This LLM can be fine-tuned on several datasets simultaneously, and we hypothesize that the model learns to adapt to these different domains during training, achieving good overall performance when encountering data from different domains during deployment. We demonstrate that our approach surpasses the previous static approach by nearly 6% accuracy on a combined dataset of MMLU and CMMLU, and by more than 10% when trained with a sparsity-inducing loss. It also performs superior in additional experiments conducted with the MMLU and Mini Crossword Puzzles datasets. The code is available at https://github.com/lukasVierling/DynamicGPTSwarm.

Updated: 2024-06-17 13:53:15

标题: 输入条件图生成用于语言代理

摘要: 最近对大型语言模型（LLMs）和语言代理的进展显示出在多个学科领域中未来应用的显著潜力。传统的语言代理方法通常依赖于固定的、手工设计的方式，而我们的研究旨在开发可学习和动态的代理。我们的方法使用一个将语言代理抽象为图形的现有框架。在这个图形框架内，我们的目标是学习一个模型，能够为每个给定的输入生成边缘。这使得我们能够根据给定的输入生成代表图形内部通信流动的边缘，从而调整语言代理的内部通信。我们通过使用经过强化学习微调的预训练LLM来学习生成这些边缘。这个LLM可以同时在多个数据集上进行微调，我们假设该模型在训练过程中学会适应这些不同领域，当在部署过程中遇到来自不同领域的数据时能够取得良好的整体表现。我们展示了我们的方法在MMLU和CMMLU的组合数据集上比以前的静态方法准确率提高了近6%，并且当使用稀疏性诱导损失进行训练时，准确率提高了超过10%。在使用MMLU和Mini Crossword Puzzles数据集进行的其他实验中，它也表现出色。代码可在https://github.com/lukasVierling/DynamicGPTSwarm 上找到。

更新时间: 2024-06-17 13:53:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11555v1

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

Updated: 2024-06-17 13:51:35

标题: DeepSeek-Coder-V2：打破代码智能中封闭源模型的障碍

摘要: 我们介绍了DeepSeek-Coder-V2，这是一个开源的专家混合（MoE）代码语言模型，在代码特定任务中实现了可与GPT4-Turbo相媲美的性能。具体来说，DeepSeek-Coder-V2是在DeepSeek-V2的中间检查点的基础上进一步预训练，额外使用了6万亿个标记。通过这种持续的预训练，DeepSeek-Coder-V2显著增强了DeepSeek-V2的编码和数学推理能力，同时在一般语言任务中保持了可比性能。与DeepSeek-Coder-33B相比，DeepSeek-Coder-V2在各个与代码相关任务以及推理和一般能力方面都取得了显著进展。此外，DeepSeek-Coder-V2将其对编程语言的支持从86个扩展到338个，并将上下文长度从16K扩展到128K。在标准基准评估中，DeepSeek-Coder-V2在编码和数学基准测试中表现优于闭源模型，如GPT4-Turbo、Claude 3 Opus和Gemini 1.5 Pro。

更新时间: 2024-06-17 13:51:35

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11931v1

Backdoor for Debias: Mitigating Model Bias with Backdoor Attack-based Artificial Bias

With the swift advancement of deep learning, state-of-the-art algorithms have been utilized in various social situations. Nonetheless, some algorithms have been discovered to exhibit biases and provide unequal results. The current debiasing methods face challenges such as poor utilization of data or intricate training requirements. In this work, we found that the backdoor attack can construct an artificial bias similar to the model bias derived in standard training. Considering the strong adjustability of backdoor triggers, we are motivated to mitigate the model bias by carefully designing reverse artificial bias created from backdoor attack. Based on this, we propose a backdoor debiasing framework based on knowledge distillation, which effectively reduces the model bias from original data and minimizes security risks from the backdoor attack. The proposed solution is validated on both image and structured datasets, showing promising results. This work advances the understanding of backdoor attacks and highlights its potential for beneficial applications. The code for the study can be found at \url{https://anonymous.4open.science/r/DwB-BC07/}.

Updated: 2024-06-17 13:47:55

标题: 逆袭偏见的后门：利用后门攻击的人工偏见来减轻模型偏见

摘要: 随着深度学习的快速发展，最先进的算法已经在各种社会情境中得到应用。然而，一些算法被发现存在偏见，并提供不平等的结果。当前的去偏见方法面临诸如数据利用不足或复杂训练要求等挑战。在这项工作中，我们发现后门攻击可以构建一种人工偏见，类似于标准训练中得到的模型偏见。考虑到后门触发器的强大可调性，我们受到启发，通过精心设计从后门攻击中创建的反向人工偏见来减轻模型偏见。基于此，我们提出了一个基于知识蒸馏的后门去偏见框架，有效减少了原始数据中的模型偏见，并最小化了后门攻击的安全风险。所提出的解决方案在图像和结构化数据集上得到验证，显示出令人期待的结果。这项工作推动了对后门攻击的理解，并凸显了其有益应用的潜力。研究的代码可以在\url{https://anonymous.4open.science/r/DwB-BC07/}找到。

更新时间: 2024-06-17 13:47:55

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2303.01504v2

Fine-Grained Domain Generalization with Feature Structuralization

Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning the commonality and specificity within categories. Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is accomplished through joint optimization of five constraints: a decorrelation function applied to disentangled segments, three constraints ensuring common feature consistency and specific feature distinctiveness, and a prediction calibration term. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.2% in FGDG performance. Beyond that, the explainability analysis on explicit concept matching intensity between the shared concepts among categories and the model channels, along with experiments on various mainstream model architectures, substantiates the validity of FS.

Updated: 2024-06-17 13:47:02

标题: 细粒度领域泛化与特征结构化

摘要: 细粒度领域泛化（FGDG）比传统的DG任务更具挑战性，因为其类间变化小，而类内差异相对较大。当领域分布发生变化时，微妙特征的脆弱性导致模型性能严重下降。然而，人类固有地表现出对于超出分布数据的泛化能力，利用从识别类别内的共性和特殊性中产生的结构化多粒度知识。同样，我们提出了一个特征结构化领域泛化（FSDG）模型，其中特征经历结构化，分为共性、特殊和混淆部分，与它们相关的语义概念协调一致，以提升FGDG中的性能。具体而言，特征结构化（FS）通过优化五个约束条件实现：应用于解耦片段的无相关性函数，三个确保共性特征一致性和特定特征独特性的约束条件，以及一个预测校准项。通过施加这些约束，FSDG被促使根据多粒度知识解耦和调整特征，促进类别之间的稳健微妙差异。在三个基准测试上进行的大量实验一致验证了FSDG相对于最先进的对手的优越性，在FGDG性能上平均提升了6.2%。此外，关于类别之间共享概念与模型通道之间显式概念匹配强度的可解释性分析，以及对各种主流模型架构的实验，证实了FS的有效性。

更新时间: 2024-06-17 13:47:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.09166v2

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We carefully design two types of prompt instructions through interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2)textual descriptions to indicate potential directions for rotation correction.During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Real-world demonstration can be found at https://sites.google.com/view/aic-mllm

Updated: 2024-06-17 13:44:53

标题: AIC MLLM：自主交互校正MLLM用于稳健的机器人操作

摘要: 对于机器人系统与现实物体稳定互动，反思和纠正失败的能力至关重要。观察到多模态大语言模型（MLLMs）的泛化和推理能力，先前的方法旨在利用这些模型相应增强机器人系统。然而，这些方法通常侧重于使用额外的MLLM进行高级规划校正，对于利用失败样本来纠正低级接触姿势的利用有限。为了弥补这一差距，我们提出了一种自主交互校正（AIC）MLLM，利用先前的低级交互经验来纠正SE（3）姿势预测。具体而言，AIC MLLM最初进行微调，以获得姿势预测和反馈提示理解能力。我们通过与对象的互动精心设计两种类型的提示指令：1）视觉遮罩突出不可移动的部分以进行位置校正，2）文本描述指示旋转校正的潜在方向。在推断过程中，引入了反馈信息提取模块以识别失败原因，使AIC MLLM能够自适应地使用相应提示来纠正姿势预测。为了进一步增强操纵稳定性，我们设计了一种测试时间适应策略，使AIC MLLM能够更好地适应当前场景配置。最后，在模拟和真实环境中进行了大量实验来评估所提出的方法。结果表明，我们的AIC MLLM可以通过利用交互经验提示有效地纠正失败样本。真实世界演示可在https://sites.google.com/view/aic-mllm 找到。

更新时间: 2024-06-17 13:44:53

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.11548v1

GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations

Large pre-trained language models have become popular for many applications and form an important backbone of many downstream tasks in natural language processing (NLP). Applying 'explainable artificial intelligence' (XAI) techniques to enrich such models' outputs is considered crucial for assuring their quality and shedding light on their inner workings. However, large language models are trained on a plethora of data containing a variety of biases, such as gender biases, affecting model weights and, potentially, behavior. Currently, it is unclear to what extent such biases also impact model explanations in possibly unfavorable ways. We create a gender-controlled text dataset, GECO, in which otherwise identical sentences appear in male and female forms. This gives rise to ground-truth 'world explanations' for gender classification tasks, enabling the objective evaluation of the correctness of XAI methods. We also provide GECOBench, a rigorous quantitative evaluation framework benchmarking popular XAI methods, applying them to pre-trained language models fine-tuned to different degrees. This allows us to investigate how pre-training induces undesirable bias in model explanations and to what extent fine-tuning can mitigate such explanation bias. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to particularly benefit from fine-tuning or complete retraining of embedding layers. Remarkably, this relationship holds for models achieving similar classification performance on the same task. With that, we highlight the utility of the proposed gender-controlled dataset and novel benchmarking approach for research and development of novel XAI methods. All code including dataset generation, model training, evaluation and visualization is available at: https://github.com/braindatalab/gecobench

Updated: 2024-06-17 13:44:37

标题: GECOBench：一个用于量化解释中偏见的性别控制文本数据集和基准

摘要: 大型预训练语言模型已经成为许多应用程序的热门选择，并且是自然语言处理（NLP）中许多下游任务的重要支柱。将“可解释的人工智能”（XAI）技术应用于丰富这些模型的输出被认为对于确保其质量和揭示其内部运作方式至关重要。然而，大型语言模型是在包含各种偏见的大量数据上进行训练的，例如性别偏见，可能会影响模型权重和行为。目前尚不清楚这些偏见在多大程度上也可能以不利的方式影响模型的解释。我们创建了一个性别受控的文本数据集GECO，其中男性和女性形式的相同句子出现。这产生了关于性别分类任务的地面真实“世界解释”，使得能够客观评估XAI方法的正确性。我们还提供了GECOBench，一个严格的定量评估框架，对流行的XAI方法进行基准测试，将它们应用于不同程度的预训练语言模型微调。这使我们能够调查预训练如何引入模型解释中的不良偏差，以及微调能够缓解这种解释偏差的程度。我们展示了解释性能与微调层数量之间的明显依赖关系，观察到XAI方法特别受益于微调或完全重新训练嵌入层。值得注意的是，这种关系适用于在相同任务上实现类似分类性能的模型。通过这一点，我们强调了所提出的性别受控数据集和新颖的基准测试方法对于研究和开发新颖的XAI方法的实用性。所有代码，包括数据集生成、模型训练、评估和可视化，都可在以下网址找到：https://github.com/braindatalab/gecobench

更新时间: 2024-06-17 13:44:37

领域: cs.LG,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2406.11547v1

KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection

SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. Such a detection is important for preventing a potential misuse of large language models (LLMs), the newest of which are very capable in generating multilingual human-like texts. We have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller LLMs for text classification. We have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. Our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.

Updated: 2024-06-17 13:43:28

标题: KInIT在SemEval-2024任务8中的表现：用于多语言机器生成文本检测的微调LLM

摘要: SemEval-2024 Task 8专注于多生成器、多领域和多语言黑盒机器生成文本检测。这种检测对于防止大型语言模型（LLMs）的潜在误用至关重要，最新的LLMs在生成多语种类似人类的文本方面非常有能力。我们通过多种方式应对这一任务，利用语言识别和对较小的LLMs进行参数高效微调以进行文本分类。我们进一步利用每种语言的分类阈值校准，将经过微调的模型预测与统计检测指标独特地结合，以提高系统检测性能的泛化能力。我们提交的方法取得了竞争性的结果，排名第四，仅落后于冠军不到1个百分点。

更新时间: 2024-06-17 13:43:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.13671v2

Do Parameters Reveal More than Loss for Membership Inference?

Membership inference attacks aim to infer whether an individual record was used to train a model, serving as a key tool for disclosure auditing. While such evaluations are useful to demonstrate risk, they are computationally expensive and often make strong assumptions about potential adversaries' access to models and training environments, and thus do not provide very tight bounds on leakage from potential attacks. We show how prior claims around black-box access being sufficient for optimal membership inference do not hold for most useful settings such as stochastic gradient descent, and that optimal membership inference indeed requires white-box access. We validate our findings with a new white-box inference attack IHA (Inverse Hessian Attack) that explicitly uses model parameters by taking advantage of computing inverse-Hessian vector products. Our results show that both audits and adversaries may be able to benefit from access to model parameters, and we advocate for further research into white-box methods for membership privacy auditing.

Updated: 2024-06-17 13:42:28

标题: 参数是否比损失更多地揭示了成员推断？

摘要: 成员推理攻击旨在推断个人记录是否被用来训练模型，作为揭示审计的关键工具。虽然这样的评估对于展示风险是有用的，但它们在计算上是昂贵的，而且通常对潜在对手对模型和训练环境的访问做出了强烈的假设，因此并不能提供对潜在攻击泄漏的非常严格的界限。我们展示了围绕黑盒访问足以实现最佳成员推理的先前声明在大多数有用的设置（如随机梯度下降）中是不成立的，而最佳成员推理确实需要白盒访问。我们通过一个新的白盒推理攻击IHA（逆Hessian攻击）验证了我们的发现，该攻击明确地利用模型参数，利用计算逆Hessian向量乘积。我们的结果显示，审计和对手都可能从访问模型参数中获益，我们主张进一步研究用于成员隐私审计的白盒方法。

更新时间: 2024-06-17 13:42:28

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2406.11544v1

Improving Quality Control of Whole Slide Images by Explicit Artifact Augmentation

The problem of artifacts in whole slide image acquisition, prevalent in both clinical workflows and research-oriented settings, necessitates human intervention and re-scanning. Overcoming this challenge requires developing quality control algorithms, that are hindered by the limited availability of relevant annotated data in histopathology. The manual annotation of ground-truth for artifact detection methods is expensive and time-consuming. This work addresses the issue by proposing a method dedicated to augmenting whole slide images with artifacts. The tool seamlessly generates and blends artifacts from an external library to a given histopathology dataset. The augmented datasets are then utilized to train artifact classification methods. The evaluation shows their usefulness in classification of the artifacts, where they show an improvement from 0.10 to 0.01 AUROC depending on the artifact type. The framework, model, weights, and ground-truth annotations are freely released to facilitate open science and reproducible research.

Updated: 2024-06-17 13:39:31

标题: 通过明确的伪影增强改善全切片图像的质量控制

摘要: 在整个切片图像获取过程中出现的伪影问题在临床工作流程和研究导向的环境中普遍存在，需要人工干预和重新扫描。克服这一挑战需要开发质量控制算法，但组织病理学中相关标注数据的有限可用性是阻碍因素。伪影检测方法的手动标注地面真相是昂贵且耗时的。本研究通过提出一种专门用于增加带有伪影的整张切片图像的方法来解决这一问题。该工具可以无缝地从外部库生成和混合伪影到给定的组织病理学数据集中。然后利用增强数据集来训练伪影分类方法。评估结果显示在伪影分类中它们的有用性，根据伪影类型，AUROC从0.10提高到0.01。该框架、模型、权重和地面真相标注已免费发布，以促进开放科学和可重复研究。

更新时间: 2024-06-17 13:39:31

领域: cs.CV,cs.AI,cs.CE

下载: http://arxiv.org/abs/2406.11538v1

Decentralized Credential Verification

This paper presents a decentralized application (dApp) for secure and efficient digital credential management using blockchain and verifiable credentials. The dApp supports OID4VC and SD-JWT-compliant wallets for privacy-preserving credential management. Primarily demonstrated through resume verification, the framework is versatile across various domains. By integrating Decentralized Identifiers and advanced cryptographic methods, the dApp addresses inefficiency, high costs, and fraud vulnerabilities, providing a robust solution for modern credential management.

Updated: 2024-06-17 13:37:44

标题: 分散式凭证验证

摘要: 本文介绍了一种使用区块链和可验证凭据进行安全高效数字凭证管理的去中心化应用程序（dApp）。该dApp支持OID4VC和SD-JWT符合标准的钱包，用于隐私保护凭证管理。通过简历验证进行主要演示，该框架在各个领域中具有多功能性。通过集成去中心化标识符和先进的加密方法，该dApp解决了低效、高成本和欺诈漏洞问题，为现代凭证管理提供了强大的解决方案。

更新时间: 2024-06-17 13:37:44

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2406.11535v1

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.

Updated: 2024-06-17 13:37:39

标题: MindStar：增强预训练LLMs在推理时间内的数学推理能力

摘要: 尽管大型语言模型（LLMs）在各种任务中取得了显著的表现，但它们通常在复杂的推理任务中遇到困难，比如回答数学问题。最近为解决这一问题所做的努力主要集中在利用数学数据集进行监督微调或自我改进技术。然而，这些方法通常依赖于难以准备的高质量数据集，或者需要大量的计算资源进行微调。受到LLMs知道如何产生正确答案但难以选择正确推理路径的发现的启发，我们提出了一种纯推理为基础的搜索方法--MindStar（M*）。该方法将推理任务形式化为搜索问题，并提出了两种搜索思路来确定最佳推理路径。我们在GSM8K和MATH数据集上评估了M*框架，将其表现与现有的开源和闭源LLMs进行比较。我们的结果表明，M*显著增强了开源模型（如Llama-2-13B和Mistral-7B）的推理能力，并实现了与GPT-3.5和Grok-1相当的性能，但模型大小和计算成本大大降低。

更新时间: 2024-06-17 13:37:39

领域: cs.LG

下载: http://arxiv.org/abs/2405.16265v2

Explainable Artificial Intelligence and Multicollinearity : A Mini Review of Current Approaches

Explainable Artificial Intelligence (XAI) methods help to understand the internal mechanism of machine learning models and how they reach a specific decision or made a specific action. The list of informative features is one of the most common output of XAI methods. Multicollinearity is one of the big issue that should be considered when XAI generates the explanation in terms of the most informative features in an AI system. No review has been dedicated to investigate the current approaches to handle such significant issue. In this paper, we provide a review of the current state-of-the-art approaches in relation to the XAI in the context of recent advances in dealing with the multicollinearity issue. To do so, we searched in three repositories that are: Web of Science, Scopus and IEEE Xplore to find pertinent published papers. After excluding irrelevant papers, seven papers were considered in the review. In addition, we discuss the current XAI methods and their limitations in dealing with the multicollinearity and suggest future directions.

Updated: 2024-06-17 13:26:53

标题: 可解释的人工智能与多重共线性：当前方法的小型综述

摘要: 可解释的人工智能（XAI）方法有助于理解机器学习模型的内部机制以及它们如何达到特定决定或做出特定行为。信息特征列表是XAI方法最常见的输出之一。在AI系统中生成最具信息价值特征的解释时，多重共线性是一个应该考虑的重要问题之一。迄今为止，尚未专门审查如何处理这一重要问题。本文对当前处理多重共线性问题的最新方法进行了综述。为此，我们在三个数据库中进行搜索：Web of Science、Scopus和IEEE Xplore，以找到相关的发表论文。在排除不相关的论文后，考虑了七篇论文进行审查。此外，我们讨论了当前XAI方法及其在处理多重共线性方面的局限性，并提出了未来的方向。

更新时间: 2024-06-17 13:26:53

领域: cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11524v1

Security in IS and social engineering -- an overview and state of the art

Major transformations related to information technologies affect InformationSystems (IS) that support the business processes of organizations and their actors. Deployment in a complex environment involving sensitive, massive and heterogeneous data generates risks with legal, social and financial impacts. This context of transition and openness makes the security of these IS central to the concerns of organizations. The digitization of all processes and the opening to IoT devices (Internet of Things) has fostered the emergence of a new formof crime, i.e. cybercrime.This generic term covers a number of malicious acts, the majority of which are now perpetrated using social engineering strategies, a phenomenon enabling a combined exploitation of ``human'' vulnerabilities and digital tools. The maliciousness of such attacks lies in the fact that they turn users into facilitators of cyber-attacks, to the point of being perceived as the ``weak link'' of cybersecurity.As deployment policies prove insufficient, it is necessary to think about upstream steps: knowing how to anticipate, identifying weak signals and outliers, detect early and react quickly to computer crime are therefore priority issues requiring a prevention and cooperation approach.In this overview, we propose a synthesis of literature and professional practices on this subject.

Updated: 2024-06-17 13:25:27

标题: 信息系统安全与社会工程学——概述与现状

摘要: 信息技术相关的重大变革影响支持组织业务流程的信息系统（IS）及其参与者。部署在涉及敏感、大规模和异构数据的复杂环境中产生了具有法律、社会和财务影响的风险。这种转变和开放的背景使得这些信息系统的安全成为组织关注的核心问题。所有流程的数字化以及对物联网（Internet of Things）设备的开放促使了一种新形式的犯罪的出现，即网络犯罪。这一泛指涵盖了许多恶意行为，其中大多数现在是通过社会工程策略实施的，这种现象使得对“人类”的脆弱性和数字工具进行结合利用成为可能。这种攻击的恶意在于它们将用户变成了网络攻击的帮凶，甚至被视为网络安全的“弱链”。由于部署政策证明不足，有必要考虑上游步骤：了解如何预见、识别弱信号和异常值、及早检测和快速反应计算机犯罪因此成为需要预防和合作方法的优先问题。在这个概述中，我们提出了关于这一主题的文献和专业实践的综合。

更新时间: 2024-06-17 13:25:27

领域: cs.CR,cs.DB

下载: http://arxiv.org/abs/2406.12938v1

FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks

Modern machine learning models are sensitive to the manipulation of both the training data (poisoning attacks) and inference data (adversarial examples). Recognizing this issue, the community has developed many empirical defenses against both attacks and, more recently, provable certification methods against inference-time attacks. However, such guarantees are still largely lacking for training-time attacks. In this work, we present FullCert, the first end-to-end certifier with sound, deterministic bounds, which proves robustness against both training-time and inference-time attacks. We first bound all possible perturbations an adversary can make to the training data under the considered threat model. Using these constraints, we bound the perturbations' influence on the model's parameters. Finally, we bound the impact of these parameter changes on the model's prediction, resulting in joint robustness guarantees against poisoning and adversarial examples. To facilitate this novel certification paradigm, we combine our theoretical work with a new open-source library BoundFlow, which enables model training on bounded datasets. We experimentally demonstrate FullCert's feasibility on two different datasets.

Updated: 2024-06-17 13:23:52

标题: FullCert：神经网络训练和推理的确定性端到端认证

摘要: 现代机器学习模型对训练数据（毒化攻击）和推断数据（对抗样本）的操纵敏感。认识到这一问题，学术界已经开发出许多经验性的防御措施来抵御这两种攻击，更近期还提出了针对推断时攻击的可证明认证方法。然而，对于训练时攻击，这样的保证仍然大多缺乏。在这项工作中，我们提出了FullCert，第一个具有可靠、确定性边界的端到端认证器，证明了对训练时和推断时攻击的稳健性。我们首先限制了对手在考虑的威胁模型下可以对训练数据进行的所有可能扰动。利用这些约束，我们限制了这些扰动对模型参数的影响。最后，我们限制了这些参数变化对模型预测的影响，从而实现了对毒化和对抗样本的联合稳健性保证。为了促进这种新型认证范式，我们将理论工作与一个新的开源库BoundFlow相结合，该库使模型在有界数据集上进行训练。我们在两个不同的数据集上进行了实验，证明了FullCert的可行性。

更新时间: 2024-06-17 13:23:52

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2406.11522v1

CAT: A Causally Graph Attention Network for Trimming Heterophilic Graph

Local Attention-guided Message Passing Mechanism (LAMP) adopted in Graph Attention Networks (GATs) is designed to adaptively learn the importance of neighboring nodes for better local aggregation on the graph, which can bring the representations of similar neighbors closer effectively, thus showing stronger discrimination ability. However, existing GATs suffer from a significant discrimination ability decline in heterophilic graphs because the high proportion of dissimilar neighbors can weaken the self-attention of the central node, jointly resulting in the deviation of the central node from similar nodes in the representation space. This kind of effect generated by neighboring nodes is called the Distraction Effect (DE) in this paper. To estimate and weaken the DE of neighboring nodes, we propose a Causally graph Attention network for Trimming heterophilic graph (CAT). To estimate the DE, since the DE are generated through two paths (grab the attention assigned to neighbors and reduce the self-attention of the central node), we use Total Effect to model DE, which is a kind of causal estimand and can be estimated from intervened data; To weaken the DE, we identify the neighbors with the highest DE (we call them Distraction Neighbors) and remove them. We adopt three representative GATs as the base model within the proposed CAT framework and conduct experiments on seven heterophilic datasets in three different sizes. Comparative experiments show that CAT can improve the node classification accuracy of all base GAT models. Ablation experiments and visualization further validate the enhancement of discrimination ability brought by CAT. The source code is available at https://github.com/GeoX-Lab/CAT.

Updated: 2024-06-17 13:22:15

标题: CAT：一种用于修剪异构图的因果图注意力网络

摘要: 本文介绍了一种本地注意力引导的消息传递机制（LAMP），该机制被应用于图注意力网络（GATs）中，旨在自适应地学习相邻节点在图中的重要性，以实现更好的本地聚合，从而有效地将相似邻居的表示靠近，进而展现更强的区分能力。然而，现有的GATs在异质图中存在显著的区分能力下降，因为不相似邻居的高比例会削弱中心节点的自注意力，共同导致中心节点在表示空间中偏离相似节点。本文将这种由相邻节点引起的效应称为干扰效应（DE）。为了估计和减弱相邻节点的DE，我们提出了一种用于修剪异质图的因果图注意力网络（CAT）。为了估计DE，由于DE通过两条路径生成（抓住分配给邻居的注意力和减少中心节点的自注意力），我们使用Total Effect来建模DE，这是一种因果估计量，可以从干预数据中估计出来；为了减弱DE，我们识别DE最高的邻居（我们称之为干扰邻居）并将其移除。我们在提出的CAT框架中采用了三种代表性的GATs作为基础模型，并在三种不同规模的七个异质数据集上进行实验。比较实验表明CAT可以提高所有基础GAT模型的节点分类准确性。消融实验和可视化进一步验证了CAT带来的区分能力增强。源代码可在https://github.com/GeoX-Lab/CAT上找到。

更新时间: 2024-06-17 13:22:15

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2312.08672v3

Revisiting Spurious Correlation in Domain Generalization

Without loss of generality, existing machine learning techniques may learn spurious correlation dependent on the domain, which exacerbates the generalization of models in out-of-distribution (OOD) scenarios. To address this issue, recent works build a structural causal model (SCM) to describe the causality within data generation process, thereby motivating methods to avoid the learning of spurious correlation by models. However, from the machine learning viewpoint, such a theoretical analysis omits the nuanced difference between the data generation process and representation learning process, resulting in that the causal analysis based on the former cannot well adapt to the latter. To this end, we explore to build a SCM for representation learning process and further conduct a thorough analysis of the mechanisms underlying spurious correlation. We underscore that adjusting erroneous covariates introduces bias, thus necessitating the correct selection of spurious correlation mechanisms based on practical application scenarios. In this regard, we substantiate the correctness of the proposed SCM and further propose to control confounding bias in OOD generalization by introducing a propensity score weighted estimator, which can be integrated into any existing OOD method as a plug-and-play module. The empirical results comprehensively demonstrate the effectiveness of our method on synthetic and large-scale real OOD datasets.

Updated: 2024-06-17 13:22:00

标题: 重新审视领域泛化中的伪相关性

摘要: 在不失一般性的情况下，现有的机器学习技术可能会学习依赖于领域的虚假相关性，这加剧了模型在超出分布（OOD）场景中的泛化。为了解决这个问题，最近的研究建立了一个结构因果模型（SCM）来描述数据生成过程中的因果关系，从而激发了避免模型学习虚假相关性的方法。然而，从机器学习的角度来看，这样的理论分析忽略了数据生成过程和表示学习过程之间微妙的差异，导致基于前者的因果分析不能很好地适应后者。因此，我们探索建立一个用于表示学习过程的SCM，并进一步对虚假相关性的机制进行彻底分析。我们强调调整错误的协变量会引入偏差，因此需要根据实际应用场景正确选择虚假相关性机制。在这方面，我们证实了提出的SCM的正确性，并进一步提出通过引入倾向得分加权估计器来控制OOD泛化中的混杂偏差，该方法可以集成到任何现有的OOD方法中作为即插即用模块。实证结果全面展示了我们的方法在合成和大规模真实OOD数据集上的有效性。

更新时间: 2024-06-17 13:22:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11517v1

Obfuscating IoT Device Scanning Activity via Adversarial Example Generation

Nowadays, attackers target Internet of Things (IoT) devices for security exploitation, and search engines for devices and services compromise user privacy, including IP addresses, open ports, device types, vendors, and products.Typically, application banners are used to recognize IoT device profiles during network measurement and reconnaissance. In this paper, we propose a novel approach to obfuscating IoT device banners (BANADV) based on adversarial examples. The key idea is to explore the susceptibility of fingerprinting techniques to a slight perturbation of an IoT device banner. By modifying device banners, BANADV disrupts the collection of IoT device profiles. To validate the efficacy of BANADV, we conduct a set of experiments. Our evaluation results show that adversarial examples can spoof state-of-the-art fingerprinting techniques, including learning- and matching-based approaches. We further provide a detailed analysis of the weakness of learning-based/matching-based fingerprints to carefully crafted samples. Overall, the innovations of BANADV lie in three aspects: (1) it utilizes an IoT-related semantic space and a visual similarity space to locate available manipulating perturbations of IoT banners; (2) it achieves at least 80\% success rate for spoofing IoT scanning techniques; and (3) it is the first to utilize adversarial examples of IoT banners in network measurement and reconnaissance.

Updated: 2024-06-17 13:21:40

标题: 通过对抗性示例生成对IoT设备扫描活动进行混淆

摘要: 如今，攻击者针对物联网（IoT）设备进行安全利用，搜索引擎用于设备和服务会损害用户隐私，包括IP地址、开放端口、设备类型、供应商和产品。通常情况下，应用程序横幅用于在网络测量和侦察过程中识别物联网设备配置文件。在本文中，我们提出了一种基于对抗样本的模糊化物联网设备横幅（BANADV）的新方法。关键思想是探索对物联网设备横幅进行轻微扰动的指纹技术的敏感性。通过修改设备横幅，BANADV破坏了物联网设备配置文件的收集。为验证BANADV的有效性，我们进行了一系列实验。我们的评估结果显示，对抗样本可以欺骗包括基于学习和匹配的方法在内的最先进的指纹技术。我们进一步对基于学习/匹配的指纹的弱点进行了详细分析，以精心制作样本。总的来说，BANADV的创新点在三个方面：（1）它利用物联网相关的语义空间和视觉相似性空间来定位物联网横幅的可用操纵扰动；（2）至少实现了80\%的成功率来欺骗物联网扫描技术；（3）它是首次利用物联网横幅的对抗样本进行网络测量和侦察。

更新时间: 2024-06-17 13:21:40

领域: cs.CR

下载: http://arxiv.org/abs/2406.11515v1

Attention Meets Post-hoc Interpretability: A Mathematical Perspective

Attention-based architectures, in particular transformers, are at the heart of a technological revolution. Interestingly, in addition to helping obtain state-of-the-art results on a wide range of applications, the attention mechanism intrinsically provides meaningful insights on the internal behavior of the model. Can these insights be used as explanations? Debate rages on. In this paper, we mathematically study a simple attention-based architecture and pinpoint the differences between post-hoc and attention-based explanations. We show that they provide quite different results, and that, despite their limitations, post-hoc methods are capable of capturing more useful insights than merely examining the attention weights.

Updated: 2024-06-17 13:18:30

标题: 关注遇见事后解释性：数学视角

摘要: 基于注意力的架构，特别是变压器，在技术革命的核心。有趣的是，除了帮助获得各种应用的最新结果外，注意机制本质上还提供了有关模型内部行为的有意义见解。这些见解能否用作解释？争论不休。在本文中，我们对一个简单的基于注意力的架构进行数学研究，指出事后和基于注意力的解释之间的差异。我们表明，它们提供了完全不同的结果，并且，尽管存在局限性，事后方法能够捕捉比仅仅检查注意力权重更有用的见解。

更新时间: 2024-06-17 13:18:30

领域: stat.ML,cs.CL,cs.LG

下载: http://arxiv.org/abs/2402.03485v2

Flexible Parametric Inference for Space-Time Hawkes Processes

Many modern spatio-temporal data sets, in sociology, epidemiology or seismology, for example, exhibit self-exciting characteristics, triggering and clustering behaviors both at the same time, that a suitable Hawkes space-time process can accurately capture. This paper aims to develop a fast and flexible parametric inference technique to recover the parameters of the kernel functions involved in the intensity function of a space-time Hawkes process based on such data. Our statistical approach combines three key ingredients: 1) kernels with finite support are considered, 2) the space-time domain is appropriately discretized, and 3) (approximate) precomputations are used. The inference technique we propose then consists of a $\ell_2$ gradient-based solver that is fast and statistically accurate. In addition to describing the algorithmic aspects, numerical experiments have been carried out on synthetic and real spatio-temporal data, providing solid empirical evidence of the relevance of the proposed methodology.

Updated: 2024-06-17 13:18:18

标题: 弹性参数推断空间时间霍克斯过程

摘要: 许多现代的时空数据集，如社会学、流行病学或地震学等领域，都表现出自激特性，同时触发和聚集行为，适当的Hawkes时空过程能够准确捕捉这些特征。本文旨在开发一种快速灵活的参数推断技术，以恢复基于这些数据的时空Hawkes过程的强度函数中所涉及的核函数的参数。我们的统计方法结合了三个关键因素：1）考虑具有有限支持的核函数，2）适当离散化时空域，3）使用（近似）预计算。我们提出的推断技术包括一个快速且统计准确的基于$\ell_2$梯度的求解器。除了描述算法方面，还对合成和真实的时空数据进行了数值实验，提供了所提出方法的实证证据。

更新时间: 2024-06-17 13:18:18

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.06849v2

Decentralized Credential Status Management: A Paradigm Shift in Digital Trust

Public key infrastructures are essential for Internet security, ensuring robust certificate management and revocation mechanisms. The transition from centralized to decentralized systems presents challenges such as trust distribution and privacy-preserving credential management. The transition from centralized to decentralized systems is motivated by addressing the single points of failure inherent in centralized systems and leveraging decentralized technologies' transparency and resilience. This paper explores the evolution of certificate status management from centralized to decentralized frameworks, focusing on blockchain technology and advanced cryptography. We provide a taxonomy of the challenges of centralized systems and discuss opportunities provided by existing decentralized technologies. Our findings reveal that, although blockchain technologies enhance security and trust distribution, they represent a bottleneck for parallel computation and face inefficiencies in cryptographic computations. For this reason, we propose a framework of decentralized technology components that addresses such shortcomings to advance the paradigm shift toward decentralized credential status management.

Updated: 2024-06-17 13:17:56

标题: 分散式凭证状态管理：数字信任的范式转变

摘要: 公钥基础设施对于互联网安全至关重要，确保了强大的证书管理和吊销机制。从集中式向去中心化系统的过渡面临诸如信任分配和保护隐私的凭证管理等挑战。从集中式向去中心化系统的过渡是为了解决集中式系统固有的单点故障，并利用去中心化技术的透明度和韧性。本文探讨了从集中式到去中心化框架的证书状态管理的演变，重点关注区块链技术和先进密码学。我们提供了对集中式系统挑战的分类，并讨论了现有去中心化技术所提供的机遇。我们的研究发现，尽管区块链技术增强了安全性和信任分配，但它们对于并行计算存在瓶颈，并且在加密计算方面面临效率低下的问题。因此，我们提出了一个去中心化技术组件框架，以解决这些缺陷，推动向去中心化凭证状态管理的范式转变。

更新时间: 2024-06-17 13:17:56

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2406.11511v1

A Critical Study of What Code-LLMs (Do Not) Learn

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.

Updated: 2024-06-17 13:11:17

标题: 对代码-LLM（不）学习的批判性研究

摘要: 基于代码语料库训练的大型语言模型（code-LLMs）在各种编码辅助任务中展示出令人印象深刻的性能。然而，尽管它们的规模增大和训练数据集增加，code-LLMs仍然存在一些限制，比如建议带有语法错误、变量误用等代码。一些研究认为，code-LLMs在编码任务中表现良好是因为它们利用自注意力和隐藏表示来编码输入标记之间的关系。然而，先前的研究并没有研究code-LLMs没有编码的代码属性。在本文中，我们对code-LLMs的注意力图和隐藏表示进行了精细的分析。我们的研究表明，code-LLMs只编码特定子集输入标记之间的关系。具体来说，通过将输入标记分类为语法标记和标识符，我们发现模型编码语法标记之间的关系和标识符之间的关系，但它们未能编码语法标记和标识符之间的关系。我们还发现，相对于它们的预训练对应物，微调模型较差地编码这些关系。此外，具有数十亿参数的更大型模型比只有数亿参数的模型编码关于代码的信息显著较少。

更新时间: 2024-06-17 13:11:17

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11930v1

Uncovering Challenges of Solving the Continuous Gromov-Wasserstein Problem

Recently, the Gromov-Wasserstein Optimal Transport (GWOT) problem has attracted the special attention of the ML community. In this problem, given two distributions supported on two (possibly different) spaces, one has to find the most isometric map between them. In the discrete variant of GWOT, the task is to learn an assignment between given discrete sets of points. In the more advanced continuous formulation, one aims at recovering a parametric mapping between unknown continuous distributions based on i.i.d. samples derived from them. The clear geometrical intuition behind the GWOT makes it a natural choice for several practical use cases, giving rise to a number of proposed solvers. Some of them claim to solve the continuous version of the problem. At the same time, GWOT is notoriously hard, both theoretically and numerically. Moreover, all existing continuous GWOT solvers still heavily rely on discrete techniques. Natural questions arise: to what extent existing methods unravel GWOT problem, what difficulties they encounter, and under which conditions they are successful. Our benchmark paper is an attempt to answer these questions. We specifically focus on the continuous GWOT as the most interesting and debatable setup. We crash-test existing continuous GWOT approaches on different scenarios, carefully record and analyze the obtained results, and identify issues. Our findings experimentally testify that the scientific community is still missing a reliable continuous GWOT solver, which necessitates further research efforts. As the first step in this direction, we propose a new continuous GWOT method which does not rely on discrete techniques and partially solves some of the problems of the competitors. Our code is available at https://github.com/Ark-130994/GW-Solvers.

Updated: 2024-06-17 13:10:00

标题: 揭示解决连续Gromov-Wasserstein问题的挑战

摘要: 最近，Gromov-Wasserstein Optimal Transport（GWOT）问题引起了机器学习社区的特别关注。在这个问题中，给定两个支持在两个（可能不同）空间上的分布，需要找到它们之间的最等距映射。在GWOT的离散变体中，任务是学习给定离散点集之间的分配。在更高级的连续形式中，目标是基于从它们导出的独立同分布样本，恢复未知连续分布之间的参数映射。GWOT背后清晰的几何直觉使其成为几个实际用例的自然选择，引发了许多提出的求解器。其中一些声称解决了问题的连续版本。与此同时，GWOT是出了名的难题，无论在理论上还是在数值上。此外，所有现有的连续GWOT求解器仍然严重依赖离散技术。自然而然地引发了一些问题：现有方法在多大程度上揭示了GWOT问题，它们遇到了什么困难，以及在什么条件下它们成功。我们的基准论文试图回答这些问题。我们特别关注连续GWOT作为最有趣和有争议的设置。我们在不同场景下对现有的连续GWOT方法进行了崩溃测试，仔细记录和分析了获得的结果，并确定了问题。我们的研究结果实验证明，科学界仍然缺乏可靠的连续GWOT求解器，这需要进一步的研究努力。作为这方面的第一步，我们提出了一种新的连续GWOT方法，它不依赖离散技术，并部分解决了竞争对手的一些问题。我们的代码可以在https://github.com/Ark-130994/GW-Solvers 上找到。

更新时间: 2024-06-17 13:10:00

领域: cs.LG

下载: http://arxiv.org/abs/2303.05978v2

On the Feasibility of Fidelity$^-$ for Graph Pruning

As one of popular quantitative metrics to assess the quality of explanation of graph neural networks (GNNs), fidelity measures the output difference after removing unimportant parts of the input graph. Fidelity has been widely used due to its straightforward interpretation that the underlying model should produce similar predictions when features deemed unimportant from the explanation are removed. This raises a natural question: "Does fidelity induce a global (soft) mask for graph pruning?" To solve this, we aim to explore the potential of the fidelity measure to be used for graph pruning, eventually enhancing the GNN models for better efficiency. To this end, we propose Fidelity$^-$-inspired Pruning (FiP), an effective framework to construct global edge masks from local explanations. Our empirical observations using 7 edge attribution methods demonstrate that, surprisingly, general eXplainable AI methods outperform methods tailored to GNNs in terms of graph pruning performance.

Updated: 2024-06-17 13:05:00

标题: 关于图剪枝中忠实性的可行性

摘要: 作为评估图神经网络（GNNs）解释质量的流行的定量指标之一，忠实度度量在去除输入图的不重要部分后产生的输出差异。忠实度已被广泛使用，因为其直观解释是，在解释中被认为不重要的特征被移除后，底层模型应该产生类似的预测。这引发了一个自然问题：“忠实度是否会引发全局（软）掩蔽以进行图修剪？”为了解决这个问题，我们旨在探索将忠实度度量用于图修剪的潜力，最终提升GNN模型的效率。为此，我们提出了基于忠实度的修剪（FiP）框架，这是一个有效的框架，可以从局部解释中构建全局边缘掩码。我们使用7种边缘归因方法的实证观察表明，令人惊讶的是，一般的可解释AI方法在图修剪性能方面胜过针对GNNs的方法。

更新时间: 2024-06-17 13:05:00

领域: cs.LG,cs.AI,cs.IT,cs.NE,cs.SI,math.IT

下载: http://arxiv.org/abs/2406.11504v1

Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality

Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks.

Updated: 2024-06-17 13:03:44

标题: 传送门理论：建模跨世界反事实因果关系的一种通用简单方法

摘要: 利用结构因果模型（SCM）的发展，研究人员可以建立图形模型，探索机器学习技术背后的因果机制。随着机器学习应用复杂性的提高，单世界干预主义因果分析遇到了理论适应限制。因此，跨世界反事实方法扩展了我们对因果关系的理解，使我们能够对替代情境进行假设推理。然而，跨世界变量的共同涉及，包括反事实变量和现实世界变量，挑战了图形模型的构建。双网络是一个微妙的尝试，建立一种共生关系，以弥合图形建模和引入反事实之间的差距，尽管在泛化方面仍有改进的空间。在这方面，我们展示了双网络在某些跨世界反事实情景中的理论崩溃。为此，我们提出了一种新颖的传送门理论，建立了一个通用且简单的反事实图形表示，为确定连接多个世界的传送门变量提供标准。在理论应用中，我们确定引入所提出的传送门理论可以直接从跨世界SCM中获取反事实变量和现实世界变量之间的条件独立性，无需进行复杂的代数推导。因此，我们可以通过跨世界符号推导进一步确定反事实因果效应。我们展示了传送门理论对实际应用的普遍性。遵循所提出的理论，我们构建了一个即插即用模块，并通过基准实验证实了其有效性。

更新时间: 2024-06-17 13:03:44

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2406.11501v1

Long-time asymptotics of noisy SVGD outside the population limit

Stein Variational Gradient Descent (SVGD) is a widely used sampling algorithm that has been successfully applied in several areas of Machine Learning. SVGD operates by iteratively moving a set of interacting particles (which represent the samples) to approximate the target distribution. Despite recent studies on the complexity of SVGD and its variants, their long-time asymptotic behavior (i.e., after numerous iterations ) is still not understood in the finite number of particles regime. We study the long-time asymptotic behavior of a noisy variant of SVGD. First, we establish that the limit set of noisy SVGD for large is well-defined. We then characterize this limit set, showing that it approaches the target distribution as increases. In particular, noisy SVGD provably avoids the variance collapse observed for SVGD. Our approach involves demonstrating that the trajectories of noisy SVGD closely resemble those described by a McKean-Vlasov process.

Updated: 2024-06-17 13:00:51

标题: 嘈杂SVGD在种群极限外的长时间渐近特性

摘要: Stein变分梯度下降（SVGD）是一种广泛使用的采样算法，在机器学习的几个领域取得了成功应用。SVGD通过迭代地移动一组相互作用的粒子（代表样本）来逼近目标分布。尽管最近对SVGD及其变体的复杂性进行了研究，但在有限数量的粒子制度下，它们的长时间渐近行为（即经过多次迭代后）仍然不为人所理解。我们研究了一种嘈杂的SVGD的长时间渐近行为。首先，我们确定了大规模嘈杂SVGD的极限集是明确定义的。然后我们表征了这个极限集，表明随着增加，它逼近目标分布。特别地，嘈杂的SVGD明显避免了SVGD中观察到的方差崩溃。我们的方法涉及证明嘈杂SVGD的轨迹与McKean-Vlasov过程描述的轨迹非常相似。

更新时间: 2024-06-17 13:00:51

领域: cs.LG,math.PR

下载: http://arxiv.org/abs/2406.11929v1

Online Context Learning for Socially-compliant Navigation

Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method that aims to empower robots to adapt to new social environments online. The proposed method adopts a two-layer structure. The bottom layer is built using a deep reinforcement learning-based method to ensure the output of basic robot navigation commands. The upper layer is implemented using an online robot learning-based method to socialize the control commands suggested by the bottom layer. Experiments using a community-wide simulator show that our method outperforms the state-of-the-art ones. Experimental results in the most challenging scenarios show that our method improves the performance of the state-of-the-art by 8%. The source code of the proposed method, the data used, and the tools for the per-training step will be publicly available at https://github.com/Nedzhaken/SOCSARL-OL.

Updated: 2024-06-17 12:59:13

标题: 在线环境学习用于社会合规导航

摘要: 机器人社交导航需要适应不同的人类因素和环境背景。然而，由于这些因素和背景很难预测且无法详尽列举，传统的基于学习的方法很难确保机器人在长期和跨环境部署中的社交属性。本文介绍了一种在线环境学习方法，旨在赋予机器人在线适应新的社交环境的能力。所提出的方法采用了一个两层结构。底层采用基于深度强化学习方法构建，以确保基本机器人导航命令的输出。上层则采用在线机器人学习方法实施，以社交化底层建议的控制命令。使用一个社区范围的模拟器进行的实验表明，我们的方法优于最先进的方法。在最具挑战性的场景中的实验结果显示，我们的方法将最先进方法的性能提高了8％。所提出方法的源代码、使用的数据以及用于预训练步骤的工具将公开提供在https://github.com/Nedzhaken/SOCSARL-OL。

更新时间: 2024-06-17 12:59:13

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2406.11495v1

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.

Updated: 2024-06-17 12:56:53

标题: 具有形式可解释性的概率约束强化学习

摘要: 强化学习可以为具有可变动力学的顺序决策问题提供有效的推理。然而，在实际实现中，对奖励函数和相应的最优策略的解释仍然是一个持久性挑战。因此，将顺序决策问题表示为概率推理可能具有相当大的价值，因为原则上，推理提供了多样且强大的数学工具，用于推断随机动态，同时提出了策略优化的概率解释。在这项研究中，我们提出了一种新颖的自适应Wasserstein变分优化方法，即AWaVO，来解决这些可解释性挑战。我们的方法使用正式方法实现了收敛保证、训练透明度和内在决策解释的可解释性。为了展示其实用性，我们展示了在模拟和实际四轴飞行器任务中具有最优全局收敛速率的可解释性保证。与TRPO-IPO、PCPO和CRPO等最新基准相比，我们经验性地验证了AWaVO在高性能和足够可解释性之间提供了合理的折衷。

更新时间: 2024-06-17 12:56:53

领域: cs.LG,cs.AI,cs.RO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2307.07084v4

Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion

Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.

Updated: 2024-06-17 12:55:56

标题: 干预性不平衡多模态表示学习通过$β$-泛化前门标准

摘要: 多模态方法在综合优势上胜过单模态方法。然而，不同模态对任务相关预测的贡献不平衡不断降低了传统多模态方法的区分性能。根据对任务相关预测的贡献，模态可以被确定为主导模态和辅助模态。基准方法提出了一个可行的解决方案：在训练过程中增加辅助模态的小贡献。然而，我们的实证探索挑战了这种行为背后的基本思想，并进一步得出结论，即基准方法存在一定缺陷：理论可解释性不足和区分性知识的探索能力有限。因此，我们从因果的角度重新审视多模态表示学习，并建立结构因果模型。在实证探索之后，我们决定捕捉主导模态的区分知识和预测标签之间的真实因果关系，同时考虑辅助模态。因此，我们引入了β-广义前门标准。此外，我们提出了一个新颖的网络，以充分探索多模态区分性知识。我们提供了严格的理论分析和各种实证评估，以支持我们提出的方法背后的本质机制的有效性。

更新时间: 2024-06-17 12:55:56

领域: cs.LG,stat.ME

下载: http://arxiv.org/abs/2406.11490v1

Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.

Updated: 2024-06-17 12:53:21

标题: 使用时间一致性分析临床笔记中的零样本时间关系提取

摘要: 这篇论文提出了首个针对生物医学文本中时间关系提取的零-shot 设置的研究。我们采用了两种类型的提示和五种LLM（GPT-3.5，Mixtral，Llama 2，Gemma和PMC-LLaMA）来获取关于两个事件之间时间关系的响应。我们的实验表明，在零-shot 设置中，LLMs在F1分数方面表现不佳，比经过精细调整的专门模型差，表明这对LLMs来说是一个具有挑战性的任务。我们进一步通过计算每个LLM的一致性得分，为时序分析提供了一个新颖的全面视角。我们的发现揭示了LLMs在提供与唯一性和传递性时间属性一致的响应方面面临挑战。此外，我们研究了LLM的时间一致性与其准确性之间的关系，以及后者是否可以通过解决时间不一致性来改善。我们的分析显示，即使实现了时间一致性，预测仍然可能不准确。

更新时间: 2024-06-17 12:53:21

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11486v1

MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.

Updated: 2024-06-17 12:52:24

标题: MADA：通过超梯度下降实现元适应优化器

摘要: 在引入Adam之后，提出了几种新型适应性优化器用于深度学习。这些优化器通常在某些任务上表现出色，但可能在所有任务中都不如Adam表现出色。在这项工作中，我们介绍了Meta-Adaptive Optimizers（MADA），这是一个统一的优化器框架，可以概括几种已知的优化器，并在训练过程中动态学习最合适的优化器。MADA的关键思想是对优化器的空间进行参数化，并在训练过程中使用超梯度下降动态搜索。我们通过实验证明，在视觉和语言任务中，MADA与其他流行的优化器相比表现更好，对次优调整的超参数具有鲁棒性。在GPT-2的训练和微调过程中，MADA相对于其他流行的优化器在验证性能方面相对于Adam表现更好。我们还提出了AVGrad，这是AMSGrad的一种修改，将最大操作符替换为平均值，更适合超梯度优化。最后，我们提供了收敛分析，表明优化器的参数化插值可以改善它们的误差界限（直到常数），暗示元优化器的优势。

更新时间: 2024-06-17 12:52:24

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2401.08893v3

Active clustering with bandit feedback

We investigate the Active Clustering Problem (ACP). A learner interacts with an $N$-armed stochastic bandit with $d$-dimensional subGaussian feedback. There exists a hidden partition of the arms into $K$ groups, such that arms within the same group, share the same mean vector. The learner's task is to uncover this hidden partition with the smallest budget - i.e., the least number of observation - and with a probability of error smaller than a prescribed constant $\delta$. In this paper, (i) we derive a non-asymptotic lower bound for the budget, and (ii) we introduce the computationally efficient ACB algorithm, whose budget matches the lower bound in most regimes. We improve on the performance of a uniform sampling strategy. Importantly, contrary to the batch setting, we establish that there is no computation-information gap in the active setting.

Updated: 2024-06-17 12:52:19

标题: 使用匪徒反馈的主动聚类

摘要: 我们研究了主动聚类问题（ACP）。一个学习者与一个具有d维次高斯反馈的N臂随机老虎机进行交互。存在一种隐藏的分组将臂分为K组，使得同一组内的臂共享相同的均值向量。学习者的任务是以最小的预算 - 即最少的观察次数 - 并且具有比预定常数δ更小的错误概率，揭示这种隐藏的分组。在本文中，（i）我们推导了预算的非渐近下界，（ii）我们引入了计算高效的ACB算法，其预算在大多数情况下与下界相匹配。我们改进了均匀抽样策略的性能。重要的是，与批处理设置相反，我们确立了在主动设置中没有计算 - 信息差距的事实。

更新时间: 2024-06-17 12:52:19

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.11485v1

Quantitative CLTs in Deep Neural Networks

We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-\gamma}$ for $\gamma>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.

Updated: 2024-06-17 12:51:09

标题: 深度神经网络中的定量中心极限定理

摘要: 我们研究具有随机高斯权重和偏置的全连接神经网络的分布，其中隐藏层宽度与大常数$n$成比例。在对非线性进行温和假设的情况下，我们获得了在大但有限的$n$和任意固定网络深度下有效的正态逼近的定量界限。我们的定理表明，对于有限维分布和整个过程，随机全连接网络（及其导数）与相应的无限宽高斯过程之间的距离按$n^{-\gamma}$的比例缩放，其中$\gamma>0$，指数取决于用于测量差异的度量。我们的界限在网络宽度的依赖性方面严格强于文献中以前可用的任何界限；在一维情况下，我们还证明它们是最优的，即，我们建立了匹配的下界。

更新时间: 2024-06-17 12:51:09

领域: cs.LG,cs.AI,math.PR,stat.ML

下载: http://arxiv.org/abs/2307.06092v5

MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities

In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.

Updated: 2024-06-17 12:50:26

标题: MergeNet：跨异构模型、任务和模态的知识迁移

摘要: 在这项研究中，我们关注完全不同的模型架构、任务和形式之间的异构知识转移。现有的知识转移方法（例如，骨干共享、知识蒸馏）通常依赖于模型结构内的共享元素或任务特定的特征/标签，从而限制了对复杂模型类型或任务的转移。为了克服这些挑战，我们提出了MergeNet，它学会了跨异构模型的参数空间之间的桥梁，促进这些参数空间内知识的直接交互、提取和应用。MergeNet的核心机制在于参数适配器，通过查询源模型的低秩参数并灵活学习识别和映射参数到目标模型中。MergeNet与两个模型一起学习，使我们的框架能够动态转移和调整与当前阶段相关的知识，包括源模型的训练轨迹知识。对异构知识转移进行的大量实验表明，在具有挑战性的设置中，我们的方法能够取得显著改进，而代表性方法可能会失败或不太适用。

更新时间: 2024-06-17 12:50:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2404.13322v2

Position: Understanding LLMs Requires More Than Statistical Generalization

The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

Updated: 2024-06-17 12:48:46

标题: 职位：理解LLMs需要更多的统计概括

摘要: 在过去的十年里，深度学习理论的研究蓬勃发展，试图回答“为什么深度学习具有泛化能力？”这一问题。这一进展的推动力是视角的强大转变：研究超参数模型在内插区域的情况。在本文中，我们认为另一个视角的转变是必要的，因为一些LLM（Large Language Models）的可取之处并非源于良好的统计泛化，需要单独的理论解释。我们的核心论点建立在这样一个观察的基础上：AR概率模型本质上是不可识别的，零或接近零的KL散度之间的模型可以表现出明显不同的行为，因此测试损失是等价的。我们通过数学例子和经验观察支持我们的观点，说明为什么不可识别性具有实际相关性，通过三个案例研究展示了这一点：（1）零射规则外推的不可识别性；（2）在上下文学习中的近似不可识别性；以及（3）可微调性的不可识别性。我们回顾着重于LLM相关泛化度量、可转移性和归纳偏差的有前途的研究方向。

更新时间: 2024-06-17 12:48:46

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.01964v3

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications.To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

Updated: 2024-06-17 12:47:32

标题: 朝向全球最优性：无需混合时间神谕的实用平均奖励强化学习

摘要: 在平均回报强化学习的背景下，对混合时间的预知要求，即马尔可夫链在固定策略下达到稳定分布所需时间的度量，对策略梯度方法的全局收敛提出了重大挑战。这一要求特别棘手，因为在状态空间较大的环境中估计混合时间的困难和昂贵，导致在实际应用中需要不切实际地长的轨迹来有效估计梯度。为了解决这一限制，我们考虑了多级Actor-Critic（MAC）框架，该框架包含多级蒙特卡洛（MLMC）梯度估计器。通过我们的方法，我们有效地减轻了对混合时间知识的依赖，这是首次实现平均回报MDP全局收敛。此外，我们的方法展现了来自先前工作的最紧密的依赖关系$\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$。通过一个2D网格世界目标导航实验，我们证明MAC优于现有的基于策略梯度的方法在平均回报设置下表现最好。

更新时间: 2024-06-17 12:47:32

领域: cs.LG

下载: http://arxiv.org/abs/2403.11925v4

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.

Updated: 2024-06-17 12:46:02

标题: 受限制的强化学习与平均奖励目标：基于模型和无模型算法

摘要: 强化学习（RL）作为一种多功能的顺序决策框架，在机器人学、自动驾驶、推荐系统、供应链优化、生物学、力学和金融等各个领域都有应用。这些应用的主要目标是最大化平均奖励。现实世界中的情景通常需要在学习过程中遵守特定的约束条件。本专著专注于在平均奖励马尔可夫决策过程（MDPs）的背景下，探索各种基于模型和无模型的受限RL方法。研究从对基于模型的策略的检查开始，深入探讨了两种基础方法 - 在面对不确定性时的乐观主义和后验抽样。随后，讨论转向参数化的无模型方法，其中探讨了基于原始-对偶策略梯度算法作为受限MDPs的解决方案。专著提供了后悔保证，并分析了每个讨论的设置中的约束违规情况。在上述探索中，我们假设底层MDP是遍历的。此外，本专著将讨论扩展到为弱通信MDPs量身定制的结果，从而扩大了其发现的范围以及与更广泛实际情景的相关性。

更新时间: 2024-06-17 12:46:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11481v1

Vocabulary Expansion for Low-resource Cross-lingual Transfer

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.

Updated: 2024-06-17 12:42:34

标题: “低资源跨语言转移的词汇扩展”

摘要: 大型语言模型(LLMs)在许多非英语语言中展现出卓越的能力。然而，由于它们依赖于以英语为中心的分词器、词汇和预训练数据，在生成非英语文本时需要更多的推理步骤，从而导致非英语使用者的使用成本更高。使用目标语言标记进行词汇扩展是一种广泛使用的跨语言词汇适应方法，以解决这个问题。尽管这种方法在推理加速方面很有效，但大部分先前的工作都集中在高资源设置上，假设可以访问大量目标语言数据，以有效初始化新标记的嵌入并将LLM适应到目标语言。然而，在低资源环境(即语言和计算)中对LLMs进行词汇扩展尚未被探索。在本文中，我们从不同角度研究了样本有效的适应策略，包括目标词汇大小和初始化方法，以及可用于适应的目标数据量。对于在语言学上多样化的语言、任务和模型进行了广泛实验，结果表明，在低资源环境中，简单的基于启发式的嵌入初始化方法更加高效和稳健，对于目标词汇大小和适应数据的变化表现优于流行的随机初始化方法和更复杂的依赖于外部数据和模型的最先进方法。

更新时间: 2024-06-17 12:42:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11477v1

How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment

Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model's alignment performance. Our findings indicate that the example part is crucial for enhancing the model's alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA's zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following.

Updated: 2024-06-17 12:38:48

标题: In-Context Alignment的实现能达到多远？探索In-Context Alignment的状态

摘要: 最近的研究表明，通过使用特定的演示，上下文学习（ICL）可以将大型语言模型（LLMs）与人类偏好对齐，称为上下文对齐（ICA），表明模型可以理解人类指令而无需参数调整。然而，对ICA的机制和适用性的探讨仍然有限。在本文中，我们首先将ICA中使用的上下文文本分为三类：格式、系统提示和示例。通过消融实验，我们研究了每个部分在使ICA有效运行方面的有效性。然后，我们检查这些部分的变体如何影响模型的对齐性能。我们的研究结果表明，示例部分对于增强模型的对齐能力至关重要，示例的变化显著影响对齐性能。我们还对ICA在各种对齐任务中的零样本能力进行了全面评估。结果表明，与参数微调方法相比，ICA在基于知识的任务和工具使用任务中表现出优越的性能。然而，在多轮对话和遵循指令等领域仍存在一定的局限性。

更新时间: 2024-06-17 12:38:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11474v1

Promises, Outlooks and Challenges of Diffusion Language Modeling

The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.

Updated: 2024-06-17 12:38:38

标题: 承诺、前景和挑战：扩散语言建模

摘要: 现代自回归大型语言模型（LLMs）在自然语言处理基准测试中取得了出色的表现，并且已经被部署在实际世界中。然而，它们仍然受到自回归训练范式的局限性。例如，自回归标记生成明显较慢，且容易受到“曝光偏差”的影响。扩散式语言模型被提出作为自回归生成的替代方案，以解决其中一些限制。我们评估了最近提出的Score Entropy Discrete Diffusion（SEDD）方法，并发现它是一个有希望的替代方案，但也存在一些缺点。我们通过实证方法展示了SEDD的优势和挑战，并观察到SEDD在困惑度和HellaSwag、Arc或WinoGrande等基准测试中通常与自回归模型相匹配。此外，我们显示在推理时延方面，SEDD比GPT-2高效多达4.5倍。虽然SEDD允许在任意位置上进行标记条件，但在给定简短提示的条件生成方面，SEDD似乎略逊于GPT-2。最后，我们复现了原始SEDD论文的主要结果。

更新时间: 2024-06-17 12:38:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11473v1

Just How Flexible are Neural Networks in Practice?

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

Updated: 2024-06-17 12:24:45

标题: 实践中神经网络有多灵活？

摘要: 广泛认为神经网络可以适应训练集，其中样本数量至少与其参数数量相同，支持过参数化和欠参数化模型的概念。然而，在实践中，我们只能找到通过我们的训练过程（包括优化器和正则化器）可访问的解决方案，限制了灵活性。此外，内置于架构中的函数类参数化方式塑造了其损失曲面，并影响我们发现的极小值。在这项工作中，我们研究神经网络在实践中拟合数据的能力。我们的研究结果表明：（1）标准优化器找到的极小值只能适应具有明显较少样本数量的训练集的模型；（2）卷积网络在随机标记数据上比MLPs和ViTs更具参数效率；（3）尽管随机训练被认为具有正则化效果，但SGD实际上找到的极小值比全批次梯度下降更适应更多的训练数据；（4）适应正确标记和错误标记样本的能力差异可能预测泛化能力；（5）ReLU激活函数导致找到更多数据的极小值，尽管它们被设计用于避免在深层架构中消失和爆炸的梯度。

更新时间: 2024-06-17 12:24:45

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.11463v1

TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) offers an effective approach for addressing question answering (QA) tasks. However, the imperfections of the retrievers in RAG models often result in the retrieval of irrelevant information, which could introduce noises and degrade the performance, especially when handling multi-hop questions that require multiple steps of reasoning. To enhance the multi-hop reasoning ability of RAG models, we propose TRACE. TRACE constructs knowledge-grounded reasoning chains, which are a series of logically connected knowledge triples, to identify and integrate supporting evidence from the retrieved documents for answering questions. Specifically, TRACE employs a KG Generator to create a knowledge graph (KG) from the retrieved documents, and then uses an Autoregressive Reasoning Chain Constructor to build reasoning chains. Experimental results on three multi-hop QA datasets show that TRACE achieves an average performance improvement of up to 14.03% compared to using all the retrieved documents. Moreover, the results indicate that using reasoning chains as context, rather than the entire documents, is often sufficient to correctly answer questions.

Updated: 2024-06-17 12:23:32

标题: 追溯证据：构建基于知识的推理链以实现检索增强生成

摘要: 检索增强生成（RAG）提供了一种有效的方法来解决问答（QA）任务。然而，在RAG模型中的检索器的不完美往往会导致检索到无关信息，这可能会引入噪声并降低性能，特别是在处理需要多步推理的多跳问题时。为了增强RAG模型的多跳推理能力，我们提出了TRACE。TRACE构建了知识基础推理链，这是一系列逻辑连接的知识三元组，用于识别和整合来自检索文档的支持证据以回答问题。具体而言，TRACE利用KG生成器从检索文档创建知识图（KG），然后使用自回归推理链构造器构建推理链。对三个多跳QA数据集的实验结果显示，与使用所有检索文档相比，TRACE实现了高达14.03％的平均性能改进。此外，结果表明，使用推理链作为上下文而不是整个文档通常足以正确回答问题。

更新时间: 2024-06-17 12:23:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11460v1

SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials

Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments. Furthermore, we propose a novel prompt mimicking how human interpret charts for more accurate reasoning. Our source code is available at https://github.com/sangwu99/Simplot.

Updated: 2024-06-17 12:21:33

标题: SIMPLOT：通过提炼要点来增强图表问题回答

摘要: 最近，由于视觉语言模型的发展，利用逻辑推理解释复杂图表已经成为一项挑战。先前的最先进（SOTA）模型提出了一种利用大型语言模型（LLM）将图表转换为表格格式的端到端方法，利用视觉语言模型进行推理。然而，与自然图像不同，图表包含必要和不相关信息的混合，这些信息对于图表推理是必要的，我们发现这种特性可能降低图表到表格提取的性能。在本文中，我们介绍了SIMPLOT，这是一种专门设计用于提取仅对图表推理必要的元素的方法。所提出的方法包括两个步骤：1）训练以模拟一个包含复杂图表中仅必要信息的简单绘图，以进行表格提取，然后2）基于表格进行推理。我们的模型可以进行准确的图表推理，无需额外的注释或数据集，并通过各种实验证明了其有效性。此外，我们提出了一种新颖的提示，模拟人类如何解释图表以进行更准确的推理。我们的源代码可在https://github.com/sangwu99/Simplot 上找到。

更新时间: 2024-06-17 12:21:33

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2405.00021v2

Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness

Adversarial training aims to defend against *adversaries*: malicious opponents whose sole aim is to harm predictive performance in any way possible - a rather harsh perspective, which we assert results in unnecessarily conservative models. Instead, we propose to model opponents as simply pursuing their own goals, rather than working directly against the classifier. Employing tools from strategic modeling, our approach uses knowledge or beliefs regarding the opponent's possible incentives as inductive bias for learning. Our method of *strategic training* is designed to defend against opponents within an *incentive uncertainty set*: this resorts to adversarial learning when the set is maximal, but offers potential gains when it can be appropriately reduced. We conduct a series of experiments that show how even mild knowledge regarding the adversary's incentives can be useful, and that the degree of potential gains depends on how incentives relate to the structure of the learning task.

Updated: 2024-06-17 12:20:59

标题: 拥有激励的对手：对抗鲁棒性的战略替代方案

摘要: 对抗训练旨在防御对手：恶意对手的唯一目的是以任何可能的方式损害预测性能 - 这是一个相当严厉的观点，我们认为这会导致不必要的保守模型。相反，我们建议将对手建模为简单地追求他们自己的目标，而不是直接针对分类器。采用战略建模工具，我们的方法使用关于对手可能激励的知识或信念作为学习的归纳偏差。我们的战略训练方法旨在防御在激励不确定性集合内的对手：当集合是最大的时候，这会导致对抗性学习，但当它可以适当减少时，会提供潜在收益。我们进行了一系列实验，展示了即使是对手激励的轻微了解也可以有用，并且潜在收益的程度取决于激励如何与学习任务的结构相关。

更新时间: 2024-06-17 12:20:59

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2406.11458v1

IDVT: Interest-aware Denoising and View-guided Tuning for Social Recommendation

In the information age, recommendation systems are vital for efficiently filtering information and identifying user preferences. Online social platforms have enriched these systems by providing valuable auxiliary information. Socially connected users are assumed to share similar preferences, enhancing recommendation accuracy and addressing cold start issues. However, empirical findings challenge the assumption, revealing that certain social connections can actually harm system performance. Our statistical analysis indicates a significant amount of noise in the social network, where many socially connected users do not share common interests. To address this issue, we propose an innovative \underline{I}nterest-aware \underline{D}enoising and \underline{V}iew-guided \underline{T}uning (IDVT) method for the social recommendation. The first ID part effectively denoises social connections. Specifically, the denoising process considers both social network structure and user interaction interests in a global view. Moreover, in this global view, we also integrate denoised social information (social domain) into the propagation of the user-item interactions (collaborative domain) and aggregate user representations from two domains using a gating mechanism. To tackle potential user interest loss and enhance model robustness within the global view, our second VT part introduces two additional views (local view and dropout-enhanced view) for fine-tuning user representations in the global view through contrastive learning. Extensive evaluations on real-world datasets with varying noise ratios demonstrate the superiority of IDVT over state-of-the-art social recommendation methods.

Updated: 2024-06-17 12:18:40

标题: IDVT：面向兴趣的去噪和视图引导调整的社交推荐

摘要: 在信息时代，推荐系统对于高效地过滤信息和识别用户偏好至关重要。在线社交平台通过提供有价值的辅助信息，丰富了这些系统。社交连接的用户被认为具有相似的偏好，增强了推荐的准确性并解决了冷启动问题。然而，实证研究挑战了这一假设，揭示了某些社交连接实际上可能损害系统性能。我们的统计分析表明，在社交网络中存在大量噪音，许多社交连接的用户并不分享共同的兴趣。为了解决这个问题，我们提出了一种创新的基于兴趣的去噪和视图引导调整（IDVT）方法用于社交推荐。第一个ID部分有效地去除了社交连接中的噪音。具体来说，去噪过程考虑了社交网络结构和用户交互兴趣的全局视图。此外，在这个全局视图中，我们还将去噪的社交信息（社交领域）整合到用户-项目交互（协作领域）的传播中，并利用门控机制汇总来自两个领域的用户表示。为了解决潜在的用户兴趣丢失并增强模型在全局视图中的鲁棒性，我们的第二个VT部分通过对比学习引入了两个额外视图（局部视图和dropout增强视图）来微调全局视图中的用户表示。基于具有不同噪音比例的真实世界数据集的广泛评估显示了IDVT相对于最先进的社交推荐方法的优越性。

更新时间: 2024-06-17 12:18:40

领域: cs.AI

下载: http://arxiv.org/abs/2308.15926v2

Calibrating Where It Matters: Constrained Temperature Scaling

We consider calibration of convolutional classifiers for diagnostic decision making. Clinical decision makers can use calibrated classifiers to minimise expected costs given their own cost function. Such functions are usually unknown at training time. If minimising expected costs is the primary aim, algorithms should focus on tuning calibration in regions of probability simplex likely to effect decisions. We give an example, modifying temperature scaling calibration, and demonstrate improved calibration where it matters using convnets trained to classify dermoscopy images.

Updated: 2024-06-17 12:14:31

标题: 在重要的地方校准：受限温度缩放

摘要: 我们考虑卷积分类器的校准，用于诊断决策。临床决策者可以使用校准的分类器来最小化预期成本，考虑到他们自己的成本函数。这些函数通常在训练时是未知的。如果最小化预期成本是主要目标，算法应该专注于调整概率单纯形区域中的校准，这些区域可能会影响决策。我们给出一个示例，修改温度缩放校准，并展示了在重要区域改善校准的效果，使用卷积神经网络训练来分类皮肤镜图像。

更新时间: 2024-06-17 12:14:31

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2406.11456v1

The Scaling Law in Stellar Light Curves

Analyzing time series of fluxes from stars, known as stellar light curves, can reveal valuable information about stellar properties. However, most current methods rely on extracting summary statistics, and studies using deep learning have been limited to supervised approaches. In this research, we investigate the scaling law properties that emerge when learning from astronomical time series data using self-supervised techniques. By employing the GPT-2 architecture, we show the learned representation improves as the number of parameters increases from $10^4$ to $10^9$, with no signs of performance plateauing. We demonstrate that a self-supervised Transformer model achieves 3-10 times the sample efficiency compared to the state-of-the-art supervised learning model when inferring the surface gravity of stars as a downstream task. Our research lays the groundwork for analyzing stellar light curves by examining them through large-scale auto-regressive generative models.

Updated: 2024-06-17 12:13:21

标题: 恒星光变曲线中的比例定律

摘要: 分析来自恒星的通量时间序列，即恒星光变曲线，可以揭示有关恒星性质的宝贵信息。然而，大多数当前的方法依赖于提取摘要统计数据，而利用深度学习的研究仅限于监督方法。在这项研究中，我们调查了使用自监督技术从天文时间序列数据学习时出现的标度律性质。通过使用GPT-2架构，我们表明学习到的表示随着参数数量从$10^4$增加到$10^9$而改善，没有性能达到平稳的迹象。我们证明，与最先进的监督学习模型相比，自监督Transformer模型在推断恒星表面重力作为下游任务时具有3-10倍的样本效率。我们的研究为通过大规模自回归生成模型审查恒星光变曲线奠定了基础。

更新时间: 2024-06-17 12:13:21

领域: astro-ph.IM,astro-ph.SR,cs.LG

下载: http://arxiv.org/abs/2405.17156v2

Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction

Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.

Updated: 2024-06-17 12:11:01

标题: 自适应强化学习规划：利用大型语言模型进行复杂信息提取

摘要: 现有关于大型语言模型（LLMs）的研究表明，它们可以通过多步规划解决信息抽取任务。然而，它们在复杂句子和任务上的抽取行为是不稳定的，出现了诸如误报和漏报等问题。我们观察到，将复杂的抽取任务分解并逐步抽取可以有效提高LLMs的性能，并且实体的抽取顺序显著影响LLMs的最终结果。本文提出了一个基于LLM的信息抽取的两阶段多步方法，并采用RL框架执行多步规划。我们将顺序抽取视为马尔可夫决策过程，构建了一个基于LLM的抽取环境，设计了一个决策模块，以适应不同句子上顺序实体抽取的最佳顺序，并利用DDQN算法训练决策模型。我们还设计了适合LLMs抽取结果的奖励和评估指标。我们在多个公共数据集上进行了广泛实验证明了我们的方法在提高LLMs信息抽取能力方面的有效性。

更新时间: 2024-06-17 12:11:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11455v1

Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum Architectures

Modular, distributed and multi-core architectures are currently considered a promising approach for scalability of quantum computing systems. The integration of multiple Quantum Processing Units necessitates classical and quantum-coherent communication, introducing challenges related to noise and quantum decoherence in quantum state transfers between cores. Optimizing communication becomes imperative, and the compilation and mapping of quantum circuits onto physical qubits must minimize state transfers while adhering to architectural constraints. The compilation process, inherently an NP-hard problem, demands extensive search times even with a small number of qubits to be solved to optimality. To address this challenge efficiently, we advocate for the utilization of heuristic mappers that can rapidly generate solutions. In this work, we propose a novel approach employing Deep Reinforcement Learning (DRL) methods to learn these heuristics for a specific multi-core architecture. Our DRL agent incorporates a Transformer encoder and Graph Neural Networks. It encodes quantum circuits using self-attention mechanisms and produce outputs through an attention-based pointer mechanism that directly signifies the probability of matching logical qubits with physical cores. This enables the selection of optimal cores for logical qubits efficiently. Experimental evaluations show that the proposed method can outperform baseline approaches in terms of reducing inter-core communications and minimizing online time-to-solution. This research contributes to the advancement of scalable quantum computing systems by introducing a novel learning-based heuristic approach for efficient quantum circuit compilation and mapping.

Updated: 2024-06-17 12:09:11

标题: 基于注意力机制的深度强化学习在模块化量子架构中的量子比特分配

摘要: 模块化、分布式和多核架构目前被认为是量子计算系统可扩展性的一种有前途的方法。整合多个量子处理单元需要经典和量子一致的通信，引入了与噪声和量子失相相关的挑战，这些挑战与核心之间的量子状态转移有关。优化通信变得至关重要，量子电路的编译和映射到物理比特必须最大程度地减少状态转移，并遵守架构约束。编译过程本质上是一个NP难的问题，即使对少量的比特进行最优求解也需要大量搜索时间。为了有效应对这一挑战，我们主张利用能够快速生成解决方案的启发式映射器。在这项工作中，我们提出了一种新颖的方法，利用深度强化学习（DRL）方法来学习特定多核架构的这些启发式。我们的DRL代理结合了Transformer编码器和图神经网络。它使用自注意机制对量子电路进行编码，并通过基于注意力的指针机制产生输出，直接表示逻辑比特与物理核心匹配的概率。这使得选择适合逻辑比特的最佳核心变得高效。实验评估表明，所提出的方法在减少核间通信和最小化在线解决时间方面可以胜过基线方法。这项研究通过引入一种新颖的基于学习的启发式方法，为高效的量子电路编译和映射做出了贡献，推动了可扩展量子计算系统的进步。

更新时间: 2024-06-17 12:09:11

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2406.11452v1

Fixed points of nonnegative neural networks

We use fixed point theory to analyze nonnegative neural networks, which we define as neural networks that map nonnegative vectors to nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable mappings within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks having inputs and outputs of the same dimension, and these conditions are weaker than those recently obtained using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general nonnegative neural networks. From a practical perspective, our results contribute to the understanding of the behavior of autoencoders, and we also offer valuable mathematical machinery for future developments in deep equilibrium models.

Updated: 2024-06-17 12:06:39

标题: 非负神经网络的不动点

摘要: 我们使用不动点理论来分析非负神经网络，我们将其定义为将非负向量映射到非负向量的神经网络。我们首先展示了具有非负权重和偏置的非负神经网络可以被识别为非线性Perron-Frobenius理论框架内的单调和（弱）可伸缩映射。这一事实使我们能够提供具有相同维度输入和输出的非负神经网络固定点存在的条件，这些条件比最近使用凸分析论证获得的条件要弱。此外，我们证明了具有非负权重和偏置的非负神经网络的固定点集的形状是一个区间，在温和条件下退化为一个点。这些结果随后被用来获得更一般非负神经网络的固定点存在性。从实际角度来看，我们的结果有助于理解自动编码器的行为，并为未来深度平衡模型的发展提供了宝贵的数学工具。

更新时间: 2024-06-17 12:06:39

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2106.16239v9

FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction

Multimodal electronic health record (EHR) data can offer a holistic assessment of a patient's health status, supporting various predictive healthcare tasks. Recently, several studies have embraced the multitask learning approach in the healthcare domain, exploiting the inherent correlations among clinical tasks to predict multiple outcomes simultaneously. However, existing methods necessitate samples to possess complete labels for all tasks, which places heavy demands on the data and restricts the flexibility of the model. Meanwhile, within a multitask framework with multimodal inputs, how to comprehensively consider the information disparity among modalities and among tasks still remains a challenging problem. To tackle these issues, a unified healthcare prediction model, also named by \textbf{FlexCare}, is proposed to flexibly accommodate incomplete multimodal inputs, promoting the adaption to multiple healthcare tasks. The proposed model breaks the conventional paradigm of parallel multitask prediction by decomposing it into a series of asynchronous single-task prediction. Specifically, a task-agnostic multimodal information extraction module is presented to capture decorrelated representations of diverse intra- and inter-modality patterns. Taking full account of the information disparities between different modalities and different tasks, we present a task-guided hierarchical multimodal fusion module that integrates the refined modality-level representations into an individual patient-level representation. Experimental results on multiple tasks from MIMIC-IV/MIMIC-CXR/MIMIC-NOTE datasets demonstrate the effectiveness of the proposed method. Additionally, further analysis underscores the feasibility and potential of employing such a multitask strategy in the healthcare domain. The source code is available at https://github.com/mhxu1998/FlexCare.

Updated: 2024-06-17 12:03:10

标题: FlexCare：利用跨任务协同作用进行灵活的多模态医疗保健预测

摘要: 多模态电子健康记录（EHR）数据可以提供对患者健康状况的全面评估，支持各种预测性医疗任务。最近，几项研究在医疗领域采用了多任务学习方法，利用临床任务之间的内在相关性同时预测多个结果。然而，现有方法要求样本具有所有任务的完整标签，这对数据提出了严格要求，并限制了模型的灵活性。同时，在具有多模态输入的多任务框架内，如何全面考虑模态之间和任务之间的信息差异仍然是一个具有挑战性的问题。为了解决这些问题，提出了一个统一的医疗预测模型，也称为FlexCare，以灵活地适应不完整的多模态输入，促进对多个医疗任务的适应。该模型打破了传统的并行多任务预测范式，将其分解为一系列异步的单任务预测。具体而言，提出了一个任务不可知的多模态信息提取模块，用于捕获多样化的模态内部和模态间模式的不相关表示。充分考虑不同模态和不同任务之间的信息差异，我们提出了一个任务引导的分层多模态融合模块，将经过改进的模态级表示集成到一个个体患者级表示中。来自MIMIC-IV/MIMIC-CXR/MIMIC-NOTE数据集的多个任务的实验结果展示了所提出方法的有效性。此外，进一步分析强调了在医疗领域采用这种多任务策略的可行性和潜力。源代码可在https://github.com/mhxu1998/FlexCare上找到。

更新时间: 2024-06-17 12:03:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11928v1

An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees

Commonly used caching policies, such as LRU (Least Recently Used) or LFU (Least Frequently Used), exhibit optimal performance only under specific traffic patterns. Even advanced machine learning-based methods, which detect patterns in historical request data, struggle when future requests deviate from past trends. Recently, a new class of policies has emerged that are robust to varying traffic patterns. These algorithms address an online optimization problem, enabling continuous adaptation to the context. They offer theoretical guarantees on the regret metric, which measures the performance gap between the online policy and the optimal static cache allocation in hindsight. However, the high computational complexity of these solutions hinders their practical adoption. In this study, we introduce a new variant of the gradient-based online caching policy that achieves groundbreaking logarithmic computational complexity relative to catalog size, while also providing regret guarantees. This advancement allows us to test the policy on large-scale, real-world traces featuring millions of requests and items - a significant achievement, as such scales have been beyond the reach of existing policies with regret guarantees. To the best of our knowledge, our experimental results demonstrate for the first time that the regret guarantees of gradient-based caching policies offer substantial benefits in practical scenarios.

Updated: 2024-06-17 12:01:01

标题: 一种具有对数复杂度和遗憾保证的在线梯度缓存策略

摘要: 常用的缓存策略，如LRU（最近最少使用）或LFU（最少频繁使用），只在特定流量模式下表现出最佳性能。即使是基于先进机器学习方法的策略，也只能在检测历史请求数据中的模式时表现出困难，当未来请求偏离过去趋势时。最近，出现了一类新的策略，能够适应不同的流量模式。这些算法解决了在线优化问题，实现了对上下文的持续适应。它们提供了遗憾度量标准上的理论保证，该标准衡量了在线策略与事后最佳静态缓存分配之间的性能差距。然而，这些解决方案的高计算复杂性阻碍了它们的实际采用。在这项研究中，我们介绍了一种新的基于梯度的在线缓存策略的变体，相对于目录大小具有突破性的对数计算复杂度，同时也提供了遗憾保证。这一进步使我们能够在大规模、实际追踪中测试策略，其中包括数百万个请求和项目 - 这是一个重要的成就，因为这些规模已经超出了现有具有遗憾保证的策略的范围。据我们所知，我们的实验结果首次证明了基于梯度的缓存策略的遗憾保证在实际场景中带来了实质性的好处。

更新时间: 2024-06-17 12:01:01

领域: cs.LG,cs.NI,cs.OS

下载: http://arxiv.org/abs/2405.01263v2

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation (CVA) methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of five CVA methods on four generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that CVA substantially contributes to LLM inference speedups of up to 271.5\%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.

Updated: 2024-06-17 12:00:02

标题: 一项关于跨语言词汇适应的实证研究，用于有效的语言模型推理

摘要: 最先进的生成式大型语言模型（LLMs）的发展在很大程度上依赖于以英语为中心的分词器、词汇和预训练数据。尽管一些LLMs具有多语言能力，但最近的研究表明，它们在生成非英语文本时的推理效率下降。这导致推理时间和成本增加。已经提出了跨语言词汇适应（CVA）方法，旨在将模型调整到目标语言以提高下游性能。然而，这些方法对增加生成式LLMs推理效率的有效性尚未得到探讨。在本文中，我们对四种不同类型语言和四种自然语言理解任务中的四种生成式LLMs（包括单语和多语言模型）进行了五种CVA方法的实证研究。我们发现CVA显著提高了LLMs推理速度高达271.5％。我们还展示，对已在更平衡的多语言数据上预训练的LLMs进行调整，可以使下游性能与原始模型相当。

更新时间: 2024-06-17 12:00:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.10712v2

PrAViC: Probabilistic Adaptation Framework for Real-Time Video Classification

Video processing is generally divided into two main categories: processing of the entire video, which typically yields optimal classification outcomes, and real-time processing, where the objective is to make a decision as promptly as possible. The latter is often driven by the need to identify rapidly potential critical or dangerous situations. These could include machine failure, traffic accidents, heart problems, or dangerous behavior. Although the models dedicated to the processing of entire videos are typically well-defined and clearly presented in the literature, this is not the case for online processing, where a plethora of hand-devised methods exist. To address this, we present \our{}, a novel, unified, and theoretically-based adaptation framework for dealing with the online classification problem for video data. The initial phase of our study is to establish a robust mathematical foundation for the theory of classification of sequential data, with the potential to make a decision at an early stage. This allows us to construct a natural function that encourages the model to return an outcome much faster. The subsequent phase is to demonstrate a straightforward and readily implementable method for adapting offline models to online and recurrent operations. Finally, by comparing the proposed approach to the non-online state-of-the-art baseline, it is demonstrated that the use of \our{} encourages the network to make earlier classification decisions without compromising accuracy.

Updated: 2024-06-17 11:56:15

标题: PrAViC: 实时视频分类的概率适应框架

摘要: 视频处理通常分为两个主要类别：处理整个视频，通常会产生最佳的分类结果，和实时处理，其目标是尽快做出决策。后者通常是为了迅速识别潜在的关键或危险情况驱动的。这些情况可能包括机器故障、交通事故、心脏问题或危险行为。虽然专门用于处理整个视频的模型通常在文献中有明确定义并清晰呈现，但在线处理则不然，存在大量手工设计的方法。为了解决这个问题，我们提出了一个新颖、统一且理论基础的适应框架\our{}，用于处理视频数据的在线分类问题。我们研究的初始阶段是为顺序数据分类理论建立一个坚固的数学基础，有潜力在早期做出决策。这使我们能够构建一个自然函数，鼓励模型更快地返回结果。随后阶段是展示一个简单易实现的方法，将离线模型调整为在线和经常性操作。最后，通过将所提出的方法与非在线最先进的基线进行比较，证明了使用\our{}鼓励网络更早地做出分类决策，而不会损害准确性。

更新时间: 2024-06-17 11:56:15

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11443v1

GPT-Powered Elicitation Interview Script Generator for Requirements Engineering Training

Elicitation interviews are the most common requirements elicitation technique, and proficiency in conducting these interviews is crucial for requirements elicitation. Traditional training methods, typically limited to textbook learning, may not sufficiently address the practical complexities of interviewing techniques. Practical training with various interview scenarios is important for understanding how to apply theoretical knowledge in real-world contexts. However, there is a shortage of educational interview material, as creating interview scripts requires both technical expertise and creativity. To address this issue, we develop a specialized GPT agent for auto-generating interview scripts. The GPT agent is equipped with a dedicated knowledge base tailored to the guidelines and best practices of requirements elicitation interview procedures. We employ a prompt chaining approach to mitigate the output length constraint of GPT to be able to generate thorough and detailed interview scripts. This involves dividing the interview into sections and crafting distinct prompts for each, allowing for the generation of complete content for each section. The generated scripts are assessed through standard natural language generation evaluation metrics and an expert judgment study, confirming their applicability in requirements engineering training.

Updated: 2024-06-17 11:53:55

标题: 由GPT 驱动的需求工程培训引诱面试脚本生成器

摘要: Elicitation interviews are the most common technique used for gathering requirements, and proficiency in conducting these interviews is crucial. Traditional training methods, which often involve textbook learning, may not adequately prepare individuals for the practical complexities of interviewing techniques. Practical training with various interview scenarios is important for applying theoretical knowledge in real-world situations. However, there is a lack of educational material for interviews, as creating interview scripts requires technical expertise and creativity. To address this issue, we have developed a specialized GPT agent for automatically generating interview scripts. This agent is equipped with a knowledge base tailored to the guidelines and best practices of requirements elicitation interviews. We use a prompt chaining approach to overcome the length constraint of GPT and generate comprehensive interview scripts by dividing the interview into sections and crafting distinct prompts for each. The generated scripts are evaluated using standard natural language generation metrics and expert judgment, confirming their usefulness in requirements engineering training.

更新时间: 2024-06-17 11:53:55

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.11439v1

Understanding Large Language Model Based Fuzz Driver Generation

LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

Updated: 2024-06-17 11:53:37

标题: 理解基于大型语言模型的模糊驱动程序生成

摘要: 基于LLM（Large Language Model）的模糊驱动程序生成是一个有前途的研究领域。与传统的基于程序分析的方法不同，这种基于文本的方法更加通用，能够利用各种API使用信息，生成对人类读者友好的代码。然而，对于这一方向的基本问题，如其有效性和潜在挑战，仍然存在一定的认识不足。为了弥合这一差距，我们进行了首次深入研究，针对使用LLM生成有效模糊驱动程序的重要问题。我们的研究包括一个经过策划的数据集，包含了来自30个广泛使用的C项目的86个模糊驱动程序生成问题。我们设计并测试了六种提示策略，跨越了五种最先进的LLM和五种不同的温度设置。总共，我们的研究评估了736,430个生成的模糊驱动程序，花费了0.85亿个令牌（超过8000美元）。此外，我们还将LLM生成的驱动程序与工业中使用的进行了比较，进行了大量的模糊实验（3.75 CPU年）。我们的研究揭示了：-虽然基于LLM的模糊驱动程序生成是一个有前途的方向，但在实际应用中仍然面临一些障碍；-LLM在为具有复杂特定性的API生成有效模糊驱动程序方面面临困难。三种特色设计选择的提示策略可能是有益的：发出重复查询、用示例进行查询和采用迭代查询过程；-虽然LLM生成的驱动程序可能产生与工业中使用的相当的模糊结果，但还有很大的提升机会，比如扩展包含的API使用，或整合语义预示以促进逻辑错误检测。我们的见解已经被实施，用来改进OSS-Fuzz-Gen项目，在工业中促进实际模糊驱动程序的生成。

更新时间: 2024-06-17 11:53:37

领域: cs.CR,D.2.5

下载: http://arxiv.org/abs/2307.12469v4

Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

The landscape of deep learning has vastly expanded the frontiers of source code analysis, particularly through the utilization of structural representations such as Abstract Syntax Trees (ASTs). While these methodologies have demonstrated effectiveness in classification tasks, their efficacy in regression applications, such as execution time prediction from source code, remains underexplored. This paper endeavours to decode the behaviour of tree-based neural network models in the context of such regression challenges. We extend the application of established models--tree-based Convolutional Neural Networks (CNNs), Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our comparative analysis reveals that while these models are benchmarks in code representation, they exhibit limitations when tasked with regression. To address these deficiencies, we propose a novel dual-transformer approach that operates on both source code tokens and AST representations, employing cross-attention mechanisms to enhance interpretability between the two domains. Furthermore, we explore the adaptation of Graph Neural Networks (GNNs) to this tree-based problem, theorizing the inherent compatibility due to the graphical nature of ASTs. Empirical evaluations on real-world datasets showcase that our dual-transformer model outperforms all other tree-based neural networks and the GNN-based models. Moreover, our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.

Updated: 2024-06-17 11:47:14

标题: 分析树状神经网络在回归任务中的行为

摘要: 深度学习领域已经大大拓展了源代码分析的前沿，特别是通过利用结构表示，如抽象语法树（AST）。虽然这些方法在分类任务中表现出有效性，但它们在回归应用中，例如从源代码预测执行时间，仍然没有得到充分探索。本文致力于解码基于树的神经网络模型在这种回归挑战的背景下的行为。我们将已建立的模型--基于树的卷积神经网络（CNNs）、Code2Vec和基于Transformer的方法--扩展到通过将其解析为AST来预测源代码的执行时间。我们的比较分析显示，虽然这些模型在代码表示方面是基准，但在面对回归任务时存在限制。为了解决这些不足，我们提出了一种新颖的双Transformer方法，该方法同时作用于源代码标记和AST表示，利用交叉注意机制增强两个领域之间的可解释性。此外，我们探讨了将图神经网络（GNNs）调整到这个基于树的问题中，推测由于AST的图形特性，其天然的兼容性。对真实世界数据集的实证评估显示，我们的双Transformer模型优于所有其他基于树的神经网络和基于GNN的模型。此外，我们提出的双Transformer展示了在各种数据集上出色的适应性和稳健性表现。

更新时间: 2024-06-17 11:47:14

领域: cs.LG,cs.AI,cs.SE

下载: http://arxiv.org/abs/2406.11437v1

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.

Updated: 2024-06-17 11:38:00

标题: 硅谷人群的智慧：LLM集成预测能力不逊于人类群体的准确性

摘要: 人类在实践中的预测准确性依赖于“众人的智慧”效应，即通过对个体预测者的预测进行聚合，可以显著提高对未来事件的预测。过去关于大型语言模型（LLMs）的预测能力的研究表明，作为个体预测者的尖端LLMs表现不及人类-群体预测比赛聚合的黄金标准。在第一项研究中，我们通过使用由12个LLMs组成的LLM群体方法扩展了这项研究。我们将31个二元问题的聚合LLM预测与来自为期三个月的预测比赛的925名人类预测者的预测进行比较。我们的预先注册的主要分析显示，LLM群体的表现优于简单的无信息基准，并且与人类群体没有统计学上的差异。我们还观察到机器回应中一系列类似于人类的偏见，例如顺从效应和偏爱圆整数的倾向。在第二项研究中，我们测试了LLM预测（GPT-4和Claude 2）是否可以通过利用人类认知输出来改进。我们发现，两个模型的预测准确性受益于暴露于中位数人类预测作为信息，准确性提高了17%至28％，尽管这导致的预测比简单地平均人类和机器预测更不准确。我们的结果表明，LLMs可以通过简单、实际可应用的预测聚合方法实现与人类群体相媲美的预测准确性。

更新时间: 2024-06-17 11:38:00

领域: cs.CY,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2402.19379v5

AnyTrans: Translate AnyText in the Image with Large Scale Models

This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

Updated: 2024-06-17 11:37:48

标题: AnyTrans：使用大型模型翻译图像中的任意文本

摘要: 本文介绍了AnyTrans，一个全面的框架，用于执行图像中的任意文本翻译（TATI）任务，包括多语言文本翻译和图像内文本融合。我们的框架利用大规模模型的优势，如大型语言模型（LLMs）和文本引导扩散模型，在翻译过程中结合来自文本和视觉元素的上下文提示。LLMs的少样本学习能力使得可以通过考虑整体上下文来翻译分散的文本。同时，扩散模型的高级修补和编辑能力使得可以将翻译后的文本无缝融入原始图像中，同时保留其风格和真实感。此外，我们的框架可以完全使用开源模型构建，无需训练，易于访问和扩展。为了鼓励TATI任务的进展，我们精心编制了一个名为MTIT6的测试数据集，包含来自六种语言配对的多语言文本图像翻译数据。

更新时间: 2024-06-17 11:37:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11432v1

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

Updated: 2024-06-17 11:36:39

标题: 超（表层）对齐：在从弱到强的泛化中，强模型可能会欺骗弱模型

摘要: 超对齐是指人类作为超人类模型的弱监督者，在大型语言模型（LLMs）快速发展的当前时代已经成为一个重要且广泛讨论的问题。最近的研究通过使用弱模型监督强模型初步研究了这个问题。研究发现，弱监督的强学生可以始终优于弱教师朝向对齐目标，导致弱到强的泛化现象。然而，我们担心在这种有前途的现象背后，是否存在弱到强欺骗的问题，即强模型可能通过在弱模型已知的领域中表现出良好对齐，但在弱模型不了解的情况下产生不对齐的行为来欺骗弱模型。然后，我们朝着在特定但现实的多目标对齐案例中探索这个安全问题迈出了初步步伐，其中可能存在一些相互冲突的对齐目标（例如，有用性与无害性）。这种冲突可能导致强模型在一个对齐维度上欺骗弱模型，以在其他对齐维度上获得高回报。我们对奖励建模任务和偏好优化方案上的实验表明：（1）存在弱到强的欺骗；（2）欺骗现象可能随着弱模型和强模型之间的能力差距增大而加剧。我们还讨论了潜在的解决方案，并发现通过中间模型的自举可以在一定程度上减轻欺骗。我们的工作强调了更加关注超对齐的真实可靠性的紧迫需要。

更新时间: 2024-06-17 11:36:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11431v1

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.

Updated: 2024-06-17 11:35:16

标题: 一种简单而有效的$L_2$范数基于的KV缓存压缩策略

摘要: 大型语言模型（LLMs）的部署通常受到键-值（KV）缓存的大量内存需求的阻碍，特别是随着上下文长度的增加。现有的减小KV缓存大小的方法包括微调模型学习压缩策略或利用注意力分数减少序列长度。我们分析了仅基于解码器的Transformer模型中的注意力分布，并观察到大多数层中的注意力分配模式保持一致。令人惊讶的是，我们发现缓存的KV对中的$L_2$和注意力分数之间存在明显的相关性，其中键嵌入的低$L_2$通常会导致解码期间的高注意力分数。这一发现表明，在查询之前，KV对的影响可能由键嵌入本身决定。基于这一观察，我们基于键嵌入的$L_2$对KV缓存进行压缩。我们的实验证明，这种简单策略可以在语言建模和寻找针在干草堆中的任务上将KV缓存大小减少50％，在密码检索任务上减少90％，而不会丢失准确性。

更新时间: 2024-06-17 11:35:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11430v1

Large Language Model Can Continue Evolving From Mistakes

As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for keeping Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution.

Updated: 2024-06-17 11:32:29

标题: 大型语言模型可以继续从错误中不断发展

摘要: 随着世界知识的演变和新的任务范式的出现，持续学习对于保持大型语言模型（LLMs）的最新状态并解决其缺点至关重要。在实际应用中，LLMs通常需要持续指导调整（CIT）和持续预训练（CPT）来适应新的任务范式并获取解决任务所需的必要知识。然而，收集能够解决模型知识不足问题的CPT数据同时保持足够的数量仍然具有挑战性，并且提高利用这些数据的效率也面临重大困难。受“总结错误”的学习技能启发，我们提出了“从错误中持续进化”（CEM）方法，旨在通过迭代评估和补充与错误相关的知识，为收集CPT数据和通过不断改进LLMs性能提供一种数据高效的方法。为了有效利用这些CPT数据并减轻遗忘，我们设计了一种新颖的持续学习训练集构建范式，将并行的CIT和CPT数据整合在一起。广泛的实验表明CEM方法的有效性，在最佳情况下提高了高达17%的准确率。此外，额外的实验证实了将CEM与避免灾难性遗忘的方法相结合的潜力，实现了迭代和持续的模型进化。

更新时间: 2024-06-17 11:32:29

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2404.08707v4

Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction

Predicting unseen relations that cannot be observed during the training phase is a challenging task in relation extraction. Previous works have made progress by matching the semantics between input instances and label descriptions. However, fine-grained matching often requires laborious manual annotation, and rich interactions between instances and label descriptions come with significant computational overhead. In this work, we propose an efficient multi-grained matching approach that uses virtual entity matching to reduce manual annotation cost, and fuses coarse-grained recall and fine-grained classification for rich interactions with guaranteed inference speed. Experimental results show that our approach outperforms the previous State Of The Art (SOTA) methods, and achieves a balance between inference efficiency and prediction accuracy in zero-shot relation extraction tasks. Our code is available at https://github.com/longls777/EMMA.

Updated: 2024-06-17 11:31:48

标题: 融合制造完美：一种高效的多粒度匹配方法用于零样本关系抽取

摘要: 在关系抽取中，预测在训练阶段无法观察到的未见关系是一项具有挑战性的任务。先前的研究通过匹配输入实例和标签描述之间的语义来取得进展。然而，细粒度匹配通常需要繁琐的手动注释，并且实例和标签描述之间的丰富交互会带来显著的计算开销。在这项工作中，我们提出了一种高效的多粒度匹配方法，利用虚拟实体匹配来减少手动注释成本，并将粗粒度的召回与细粒度的分类融合，以保证推理速度的丰富交互。实验结果表明，我们的方法优于先前的最新方法，并在零-shot关系抽取任务中实现了推理效率和预测准确性之间的平衡。我们的代码可在https://github.com/longls777/EMMA 上找到。

更新时间: 2024-06-17 11:31:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11429v1

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.

Updated: 2024-06-17 11:25:57

标题: DiTTo-TTS：具有扩散变压器的高效可扩展零样本文本转语音

摘要: 大规模扩散模型在多个模态中展现出卓越的生成能力，包括图像、视频和音频。然而，文本到语音（TTS）系统通常涉及特定领域的建模因素（例如音素和音素级持续时间），以确保文本和语音之间的精确时间对齐，这阻碍了用于TTS的扩散模型的效率和可扩展性。在这项工作中，我们提出了一种高效且可扩展的扩散Transformer（DiT），利用现成的预训练文本和语音编码器。我们的方法通过交叉注意机制和预测语音表示的总长度来解决文本-语音对齐的挑战。为了实现这一目标，我们改进了DiT架构以适应TTS，并通过将语义引导纳入语音的潜在空间来改善对齐。我们将训练数据集和模型大小分别扩展到82K小时和790M参数。我们广泛的实验表明，用于TTS的大规模扩散模型不仅简化了训练流程，而且在自然性、可懂性和说话者相似性方面产生了优越或可比较的零样本性能，与最先进的TTS模型相媲美。我们的语音样本可在https://ditto-tts.github.io 上获取。

更新时间: 2024-06-17 11:25:57

领域: eess.AS,cs.AI,cs.CL,cs.LG,cs.SD

下载: http://arxiv.org/abs/2406.11427v1

H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent

In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a factorized approach to momentum and scaling parameters. Our algorithm demonstrates competitive performances on both ResNets and Vision Transformers, while achieving sublinear memory costs through the use of rank-1 parameterizations for moment estimators. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.

Updated: 2024-06-17 11:25:33

标题: H-Fac: 通过分解哈密顿下降实现内存高效优化

摘要: 在这项研究中，我们引入了一种新颖的自适应优化器H-Fac，该优化器结合了动量和缩放参数的因式方法。我们的算法在ResNets和Vision Transformers上表现出竞争性的性能，同时通过使用基于秩-1的参数化方法实现亚线性的内存成本。我们根据从哈密顿动力学推导出的原则开发了我们的算法，提供了坚实的理论基础。这些优化算法旨在既简单又适应性强，便于在不同环境中轻松实施。

更新时间: 2024-06-17 11:25:33

领域: cs.LG

下载: http://arxiv.org/abs/2406.09958v2

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

Updated: 2024-06-17 11:23:20

标题: 使用大型语言模型生成的相关性判断来预测查询性能

摘要: 查询性能预测（QPP）旨在估计搜索系统对查询的检索质量，而无需人工相关性判断。先前的QPP方法通常返回单个标量值，并不要求预测值逼近特定的信息检索（IR）评估指标，从而导致一定的缺点：（i）单个标量不足以准确表示不同的IR评估指标，特别是当度量不高度相关时，（ii）单个标量限制了QPP方法的可解释性，因为仅使用标量无法解释QPP的结果。为了解决这些问题，我们提出了一个使用自动生成的相关性判断的QPP框架（QPP-GenRE），将QPP分解为独立的子任务，预测给定查询的排名列表中每个项目的相关性。这使我们能够使用生成的相关性判断作为伪标签来预测任何IR评估指标。这也使我们能够解释预测的IR评估指标，并识别、跟踪和纠正生成的相关性判断中的错误，以提高QPP的质量。我们通过使用开源大型语言模型（LLMs）来预测项目的相关性，以确保科学的可重复性。我们面临两个主要挑战：（i）为了预测一个度量考虑召回率，判断整个语料库的计算成本过高，（ii）在零/少量样本的情况下促使开源LLMs的性能有限。为了解决这些挑战，我们设计了一个近似策略来预测考虑召回率的IR度量，并提议使用人工标记的相关性判断对开源LLMs进行微调。对TREC 2019-2022深度学习跟踪的实验表明，QPP-GenRE在词汇和神经排序器方面实现了最先进的QPP质量。

更新时间: 2024-06-17 11:23:20

领域: cs.IR,cs.AI,cs.CL,cs.LG,H.3.3

下载: http://arxiv.org/abs/2404.01012v2

Dredge Word, Social Media, and Webgraph Networks for Unreliable Website Classification and Identification

In an attempt to mimic the complex paths through which unreliable content spreads between search engines and social media, we explore the impact of incorporating both webgraph and large-scale social media contexts into website credibility classification and discovery systems. We further explore the usage of what we define as \textit{dredge words} on social media -- terms or phrases for which unreliable domains rank highly. Through comprehensive graph neural network ablations, we demonstrate that curriculum-based heterogeneous graph models that leverage context from both webgraphs and social media data outperform homogeneous and single-mode approaches. We further demonstrate that the incorporation of dredge words into our model strongly associates unreliable websites with social media and online commerce platforms. Finally, we show our heterogeneous model greatly outperforms competing systems in the top-k identification of unlabeled unreliable websites. We demonstrate the strong unreliability signals present in the diverse paths that users follow to uncover unreliable content, and we release a novel dataset of dredge words.

Updated: 2024-06-17 11:22:04

标题: 使用疏浚词、社交媒体和Web图网络进行不可靠网站分类和识别

摘要: 为了模拟不可靠内容在搜索引擎和社交媒体之间传播的复杂路径，我们探讨了将网页图和大规模社交媒体上下文结合到网站可信度分类和发现系统中的影响。我们进一步探讨了在社交媒体上使用我们定义的“搅拌词”——不可靠域名排名较高的术语或短语。通过全面的图神经网络消融实验，我们证明基于课程的异质图模型，利用来自网页图和社交媒体数据的上下文，胜过同质和单模式方法。我们进一步证明，将搅拌词纳入我们的模型，将不可靠网站与社交媒体和在线商务平台强烈关联起来。最后，我们展示我们的异质模型在识别未标记的不可靠网站的前k名方面远远优于竞争系统。我们展示用户发现不可靠内容的多样路径中存在强烈的不可靠信号，并发布了一组新颖的搅拌词数据集。

更新时间: 2024-06-17 11:22:04

领域: cs.SI,cs.AI,cs.CL,cs.CY,cs.LG

下载: http://arxiv.org/abs/2406.11423v1

Cross-domain Open-world Discovery

In many real-world applications, test data may commonly exhibit categorical shifts, characterized by the emergence of novel classes, as well as distribution shifts arising from feature distributions different from the ones the model was trained on. However, existing methods either discover novel classes in the open-world setting or assume domain shifts without the ability to discover novel classes. In this work, we consider a cross-domain open-world discovery setting, where the goal is to assign samples to seen classes and discover unseen classes under a domain shift. To address this challenging problem, we present CROW, a prototype-based approach that introduces a cluster-then-match strategy enabled by a well-structured representation space of foundation models. In this way, CROW discovers novel classes by robustly matching clusters with previously seen classes, followed by fine-tuning the representation space using an objective designed for cross-domain open-world discovery. Extensive experimental results on image classification benchmark datasets demonstrate that CROW outperforms alternative baselines, achieving an 8% average performance improvement across 75 experimental settings.

Updated: 2024-06-17 11:20:09

标题: 跨领域开放世界发现

摘要: 在许多实际应用中，测试数据通常可能表现出分类偏移，其特征是新的类别的出现，以及由于特征分布与模型训练时不同而产生的分布偏移。然而，现有方法要么在开放世界设置中发现新类别，要么假设领域偏移而无法发现新类别。在这项工作中，我们考虑了一个跨领域开放世界发现设置，目标是在领域偏移下将样本分配给已知类别并发现未知类别。为了解决这一具有挑战性的问题，我们提出了CROW，一种基于原型的方法，通过基于结构良好的基础模型表示空间实现了一种集群-匹配策略。通过这种方式，CROW通过稳健地将集群与先前见过的类别进行匹配来发现新类别，然后使用为跨领域开放世界发现设计的目标对表示空间进行微调。对图像分类基准数据集的大量实验结果表明，CROW优于替代基线，在75个实验设置中平均性能提高了8%。

更新时间: 2024-06-17 11:20:09

领域: cs.LG

下载: http://arxiv.org/abs/2406.11422v1

Private Approximate Query over Horizontal Data Federation

In many real-world scenarios, multiple data providers need to collaboratively perform analysis of their private data. The challenges of these applications, especially at the big data scale, are time and resource efficiency as well as end-to-end privacy with minimal loss of accuracy. Existing approaches rely primarily on cryptography, which improves privacy, but at the expense of query response time. However, current big data analytics frameworks require fast and accurate responses to large-scale queries, making cryptography-based solutions less suitable. In this work, we address the problem of combining Approximate Query Processing (AQP) and Differential Privacy (DP) in a private federated environment answering range queries on horizontally partitioned multidimensional data. We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries and ensure end-to-end data privacy during and after analysis with minimal loss in accuracy. Through empirical evaluation, we show that our solution is able of providing up to 8 times faster processing than the basic non-secure solution while maintaining accuracy, formal privacy guarantees and resilience to learning-based attacks.

Updated: 2024-06-17 11:19:58

标题: 水平数据联邦中的私人近似查询

摘要: 在许多现实世界的情景中，多个数据提供者需要共同分析他们的私人数据。这些应用程序的挑战，特别是在大数据规模上，是时间和资源效率以及端到端隐私保护，同时最大限度地减少准确性损失。现有方法主要依赖于密码学，这提高了隐私性，但以查询响应时间为代价。然而，当前的大数据分析框架需要对大规模查询快速准确的响应，这使得基于密码学的解决方案不太合适。在这项工作中，我们解决了在私人联合环境中结合近似查询处理（AQP）和差分隐私（DP）来回答水平分割多维数据上的范围查询的问题。我们提出了一种新方法，考虑了数据分布感知的在线抽样技术，加速范围查询的执行，并确保端到端数据隐私在分析期间及之后保持准确性的最小损失。通过实证评估，我们展示了我们的解决方案能够提供高达基本非安全解决方案8倍的更快处理速度，同时保持准确性、形式上的隐私保证和对基于学习的攻击的抵抗性。

更新时间: 2024-06-17 11:19:58

领域: cs.DB,cs.CR,H.2.8

下载: http://arxiv.org/abs/2406.11421v1

ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations

As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.

Updated: 2024-06-17 11:13:39

标题: ArtWhisperer：一个用于描述艺术创作中人工智能与人类互动的数据集

摘要: 随着生成式人工智能变得越来越普遍，研究人类用户如何与这些模型进行交互变得至关重要。在这项工作中，我们研究了人们如何利用文本到图像模型生成他们想要的目标图像。为了研究这种交互，我们创建了一个名为ArtWhisperer的在线游戏，用户被给予一个目标图像，并被要求通过迭代找到一个提示，以生成一个与目标图像相似的图像。通过这个游戏，我们记录了超过50,000个人类与人工智能的交互；每个交互对应于用户创建的一个文本提示和相应生成的图像。其中大多数是重复的交互，用户迭代以找到最适合他们目标图像的提示，使得这成为一个独特的序列数据集用于研究人类与人工智能的合作。在对这个数据集的初步分析中，我们确定了几个提示交互和用户策略的特征。人们提交各种各样的提示，并且能够发现可以生成相似图像的各种文本描述。有趣的是，提示的多样性并不随着用户找到更好的提示而减少。我们进一步提出了一种新的度量AI可操纵性的方法，使用我们的数据集。我们将可操纵性定义为完成任务所需的预期交互数量。我们通过为每个目标任务拟合一个马尔可夫链，并计算达到马尔可夫链中充分得分所需的预期时间来估计这个值。我们量化并比较不同类型目标图像和两种不同模型的AI可操纵性，发现城市图像和自然界图像比艺术和幻想图像更易操纵。这些发现提供了对人类与人工智能互动行为的见解，提出了一种评估AI可操纵性的具体方法，并展示了ArtWhisperer数据集的一般实用性。

更新时间: 2024-06-17 11:13:39

领域: cs.AI,cs.CV,cs.HC,cs.LG

下载: http://arxiv.org/abs/2306.08141v4

Formally Certified Approximate Model Counting

Approximate model counting is the task of approximating the number of solutions to an input Boolean formula. The state-of-the-art approximate model counter for formulas in conjunctive normal form (CNF), ApproxMC, provides a scalable means of obtaining model counts with probably approximately correct (PAC)-style guarantees. Nevertheless, the validity of ApproxMC's approximation relies on a careful theoretical analysis of its randomized algorithm and the correctness of its highly optimized implementation, especially the latter's stateful interactions with an incremental CNF satisfiability solver capable of natively handling parity (XOR) constraints. We present the first certification framework for approximate model counting with formally verified guarantees on the quality of its output approximation. Our approach combines: (i) a static, once-off, formal proof of the algorithm's PAC guarantee in the Isabelle/HOL proof assistant; and (ii) dynamic, per-run, verification of ApproxMC's calls to an external CNF-XOR solver using proof certificates. We detail our general approach to establish a rigorous connection between these two parts of the verification, including our blueprint for turning the formalized, randomized algorithm into a verified proof checker, and our design of proof certificates for both ApproxMC and its internal CNF-XOR solving steps. Experimentally, we show that certificate generation adds little overhead to an approximate counter implementation, and that our certificate checker is able to fully certify $84.7\%$ of instances with generated certificates when given the same time and memory limits as the counter.

Updated: 2024-06-17 11:02:04

标题: 正式认证的近似模型计数

摘要: 近似模型计数是估计输入布尔公式解的数量的任务。在合取范式（CNF）中的公式的近似模型计数器ApproxMC是目前的最先进技术，提供了一种可扩展的方式来获得具有可能近似正确（PAC）保证的模型计数。然而，ApproxMC的近似的有效性依赖于对其随机算法的仔细理论分析以及其高度优化的实现的正确性，尤其是后者与能够本地处理奇偶（XOR）约束的增量CNF可满足性求解器的有状态交互。我们提出了第一个近似模型计数的认证框架，对其输出近似质量的保证进行了正式验证。我们的方法结合了：（i）在Isabelle/HOL证明助手中对算法的PAC保证进行一次性的静态形式证明；以及（ii）对ApproxMC调用外部CNF-XOR求解器的动态、每次运行的验证，使用证明证书。我们详细介绍了建立这两部分验证之间严格连接的一般方法，包括我们的蓝图，将形式化的随机算法转化为验证的证明检查器，以及我们为ApproxMC及其内部CNF-XOR求解步骤设计的证明证书。在实验中，我们展示了证书生成对近似计数器实现的额外开销很小，并且当给定与计数器相同的时间和内存限制时，我们的证书检查器能够完全验证84.7％的实例。

更新时间: 2024-06-17 11:02:04

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2406.11414v1

HARE: HumAn pRiors, a key to small language model Efficiency

Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.

Updated: 2024-06-17 10:56:03

标题: HARE：人类先验，小语言模型效率的关键

摘要: 人类先验在深度学习中有效利用数据起着至关重要的作用。然而，随着大型语言模型（LLMs）的发展，越来越强调扩大模型大小和数据量，这往往降低了人类先验在数据构建中的重要性。受这些趋势的影响，现有的小型语言模型（SLMs）主要依赖于网络抓取的大规模训练数据，忽视了正确整合人类先验的重要性。这一疏忽限制了语言模型在资源受限环境中的训练效率。在本文中，我们提出了一项利用人类先验进行数据构建的原则。这一原则强调通过在一个既包含语义多样性又保持数据质量一致性的简洁数据集上训练，避免基准数据泄漏，从而实现高性能SLMs。遵循这一原则，我们训练了一个名为HARE-1.1B的SLM。对大规模基准数据集进行的大量实验表明，HARE-1.1B在性能上优于最先进的SLMs，验证了所提出的原则的有效性。此外，这从人类先验的角度为资源受限环境中的有效语言模型训练提供了新的见解。

更新时间: 2024-06-17 10:56:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11410v1

CodeGemma: Open Code Models Based on Gemma

This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.

Updated: 2024-06-17 10:54:35

标题: CodeGemma：基于Gemma的开放代码模型

摘要: 本文介绍了CodeGemma，这是一个基于Gemma构建的专门的开源代码模型集合，可以执行各种代码和自然语言生成任务。我们发布了三个模型变体。CodeGemma 7B预训练（PT）和指令调整（IT）变体具有非常强大的自然语言理解能力，在数学推理方面表现出色，并且与其他开源模型的代码能力相匹配。CodeGemma 2B是一个最先进的代码补全模型，旨在在延迟敏感的环境中实现快速代码填充和开放式生成。

更新时间: 2024-06-17 10:54:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11409v1

Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.

Updated: 2024-06-17 10:45:36

标题: 评估开放式语言模型在任务类型、应用领域和推理类型上的表现：一项深入的实验分析

摘要: 语言模型（LMs）的快速崛起扩大了它们在几个应用中的使用。然而，由于模型大小、相关成本或专有限制的约束，利用最先进的（SOTA）LLMs并非总是可行的。随着开放、较小的LMs的出现，更多应用可以利用它们的能力，但选择合适的LM可能具有挑战性。本研究对10个较小、开放的LMs的输出的语义正确性进行了深入实验分析，涵盖了任务类型、应用领域和推理类型三个方面，采用多样化的提示风格。我们证明了在特定要求下，最有效的模型和提示风格会有所变化。我们的分析提供了一个基于用例和其他约束条件的三层方面模式的LMs和提示风格的比较评估。我们还展示了，如果适当利用，这些LMs可以与SOTA LLMs（如DeepSeek-v2，GPT-3.5-Turbo和GPT-4o）竞争，并有时表现更优。

更新时间: 2024-06-17 10:45:36

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11402v1

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.

Updated: 2024-06-17 10:45:22

标题: REPOEXEC：使用存储库级可执行基准评估代码生成

摘要: CodeLLMs的能力在\textit{存储库级别规模}生成可执行且功能正确的代码在很大程度上尚未被探索。我们引入了\methodnamews，这是一个用于评估存储库级别规模下代码生成的新颖基准，强调可执行性和正确性。\methodnamews提供了一个自动化系统，用于验证需求并集成了一个动态生成高覆盖率测试用例的机制，以评估生成代码的功能性。我们的工作探索了一个受控场景，在这个场景中开发人员指定了必要的代码依赖关系，挑战模型准确地集成这些依赖关系。实验证明，虽然预训练的LLMs在正确性方面优于指令调整模型，但后者在利用提供的依赖关系和展示调试能力方面表现卓越。\methodnamews旨在全面评估代码功能性以及与开发人员意图的一致性，为在实际场景中更可靠和适用的CodeLLMs铺平道路。

更新时间: 2024-06-17 10:45:22

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.11927v1

Distributed Maximum Consensus over Noisy Links

We introduce a distributed algorithm, termed noise-robust distributed maximum consensus (RD-MC), for estimating the maximum value within a multi-agent network in the presence of noisy communication links. Our approach entails redefining the maximum consensus problem as a distributed optimization problem, allowing a solution using the alternating direction method of multipliers. Unlike existing algorithms that rely on multiple sets of noise-corrupted estimates, RD-MC employs a single set, enhancing both robustness and efficiency. To further mitigate the effects of link noise and improve robustness, we apply moving averaging to the local estimates. Through extensive simulations, we demonstrate that RD-MC is significantly more robust to communication link noise compared to existing maximum-consensus algorithms.

Updated: 2024-06-17 10:38:59

标题: 在噪声链路上的分布式最大一致性

摘要: 我们介绍了一种分布式算法，称为噪声鲁棒分布式最大一致性（RD-MC），用于在存在嘈杂通信链路的多智能体网络中估算最大值。我们的方法重新定义了最大一致性问题为分布式优化问题，允许使用交替方向乘法器方法来解决。与现有算法依赖于多组噪声扰动估计不同，RD-MC使用单组估计，提高了鲁棒性和效率。为了进一步减轻链路噪声的影响并改善鲁棒性，我们对本地估计应用了移动平均。通过大量模拟，我们证明了与现有最大一致性算法相比，RD-MC对通信链路噪声具有更强的鲁棒性。

更新时间: 2024-06-17 10:38:59

领域: cs.DC,cs.LG,eess.SP

下载: http://arxiv.org/abs/2403.18509v2

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

Updated: 2024-06-17 10:35:06

标题: HiFT：一种分层全参数微调策略

摘要: 全参数微调已成为适应下游任务的语言模型（LMs）的首选选择，因为其性能优异。随着LMs的规模增长，微调LMs的全部参数需要大量的GPU内存，这是无法承受的。现有方法利用零阶优化器来节省GPU内存，但这可能会损害LMs的性能，因为非零阶优化器往往更容易在大多数下游任务上收敛。在本文中，我们提出了一种新颖的独立于优化器的端到端分层微调策略，HiFT，它在每个训练步骤中仅更新参数的子集。HiFT可以显著减少梯度和优化器状态参数同时驻留在GPU内存中的数量，从而减少GPU内存使用量。我们的结果表明：（1）HiFT实现了与参数高效微调和标准全参数微调相当的性能。（2）HiFT支持包括AdamW、AdaGrad、SGD等在内的各种优化器。（3）与标准全参数微调相比，HiFT可以为7B模型节省超过60\%的GPU内存。（4）HiFT使得在单个48G A6000上使用AdamW优化器对7B模型进行全参数微调，精度为32，而不使用任何内存节省技术。

更新时间: 2024-06-17 10:35:06

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2401.15207v3

DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting

Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred.

Updated: 2024-06-17 10:33:00

标题: DistPred：一种用于回归和预测的无分布概率推断方法

摘要: 传统的回归和预测任务通常只提供确定性的点估计。为了估计响应变量的不确定性或分布信息，通常使用类似贝叶斯推断、模型集成或MC Dropout等方法。这些方法要么假设样本的后验分布遵循高斯过程，要么需要数千次前向传递来生成样本。我们提出了一种称为DistPred的新方法，用于回归和预测任务，它克服了现有方法的局限性，同时保持简单和强大。具体来说，我们将衡量预测分布和目标分布之间差异的适当评分规则转化为可微分的离散形式，并将其用作损失函数来端到端地训练模型。这使得模型能够在单次前向传递中采样大量样本，以估计响应变量的潜在分布。我们已经在多个数据集上将我们的方法与几种现有方法进行了比较，并取得了最新的性能。此外，我们的方法显著提高了计算效率。例如，与最新的模型相比，DistPred 的推理速度提高了90倍。实验结果可以通过https://github.com/Anoise/DistPred 进行复现。

更新时间: 2024-06-17 10:33:00

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11397v1

Pointer Networks with Q-Learning for Combinatorial Optimization

We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets) to enhance the optimality of attention-based sequence generation, focusing on long-term outcomes. This integration proves particularly effective in solving combinatorial optimization (CO) tasks, especially the Travelling Salesman Problem (TSP), which is the focus of our study. We address this challenge by defining a Markov Decision Process (MDP) compatible with PQN, which involves iterative graph embedding, encoding and decoding by an LSTM-based recurrent neural network. This process generates a context vector and computes raw attention scores, which are dynamically adjusted by Q-values calculated for all available state-action pairs before applying softmax. The resulting attention vector is utilized as an action distribution, with actions selected hinged to exploration-exploitation dynamic adaptibility of PQN. Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.

Updated: 2024-06-17 10:27:54

标题: 使用Q学习的指针网络用于组合优化

摘要: 我们介绍了指针Q网络（PQN），这是一种混合神经架构，将无模型Q值策略逼近与指针网络（Ptr-Nets）相结合，以增强基于注意力的序列生成的最优性，重点关注长期结果。这种整合在解决组合优化（CO）任务，尤其是旅行推销员问题（TSP）方面特别有效，这也是我们研究的重点。我们通过定义一个与PQN兼容的马尔科夫决策过程（MDP）来解决这一挑战，其中涉及通过基于LSTM的递归神经网络进行迭代图嵌入、编码和解码。这个过程生成一个上下文向量，并计算原始的注意力分数，这些分数通过为所有可用状态-动作对计算的Q值进行动态调整后应用softmax。产生的注意力向量被用作动作分布，动作的选择取决于PQN的探索-利用动态适应能力。我们的实证结果证明了这种方法的有效性，同时还在不稳定环境中测试了该模型。

更新时间: 2024-06-17 10:27:54

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2311.02629v3

KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery

This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. An ontology can then be constructed where a user defines the types of entities and relationships they want to capture. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. Named Entity Recognition (NER) and Relation Classification (RC) models can be trained on the resulting annotations and can be used to annotate the unannotated portion of the documents. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data. Furthermore, we integrate a suite of Large Language Models (LLMs) that can be used for QA and summarisation that is grounded in the included documents via a retrieval component. KnowledgeHub is a unique tool that supports annotation, IE and QA, which gives the user full insight into the knowledge discovery pipeline.

Updated: 2024-06-17 10:23:46

标题: KnowledgeHub：辅助科学发现的端到端工具

摘要: 本文介绍了KnowledgeHub工具，一个科学文献信息提取（IE）和问答（QA）管道。通过支持将PDF文档转换为文本和结构化表示形式来实现这一目标。然后可以构建本体论，在其中用户定义要捕获的实体类型和关系。基于浏览器的注释工具使用户可以根据本体论注释PDF文档的内容。命名实体识别（NER）和关系分类（RC）模型可以基于结果注释进行训练，并可用于注释文档的未注释部分。从这些实体和关系三元组构建知识图，可以查询以获取数据洞察。此外，我们集成了一套大型语言模型（LLMs），可用于QA和摘要，通过检索组件与包含的文档相关联。KnowledgeHub是一个独特的工具，支持注释、IE和QA，使用户完全了解知识发现管道。

更新时间: 2024-06-17 10:23:46

领域: cs.IR,cs.AI,cs.CL,cs.DL

下载: http://arxiv.org/abs/2406.00008v2

AFS-BM: Enhancing Model Performance through Adaptive Feature Selection with Binary Masking

We study the problem of feature selection in general machine learning (ML) context, which is one of the most critical subjects in the field. Although, there exist many feature selection methods, however, these methods face challenges such as scalability, managing high-dimensional data, dealing with correlated features, adapting to variable feature importance, and integrating domain knowledge. To this end, we introduce the "Adaptive Feature Selection with Binary Masking" (AFS-BM) which remedies these problems. AFS-BM achieves this by joint optimization for simultaneous feature selection and model training. In particular, we do the joint optimization and binary masking to continuously adapt the set of features and model parameters during the training process. This approach leads to significant improvements in model accuracy and a reduction in computational requirements. We provide an extensive set of experiments where we compare AFS-BM with the established feature selection methods using well-known datasets from real-life competitions. Our results show that AFS-BM makes significant improvement in terms of accuracy and requires significantly less computational complexity. This is due to AFS-BM's ability to dynamically adjust to the changing importance of features during the training process, which an important contribution to the field. We openly share our code for the replicability of our results and to facilitate further research.

Updated: 2024-06-17 10:22:06

标题: AFS-BM：通过二进制掩码的自适应特征选择提高模型性能

摘要: 我们研究了在通用机器学习（ML）环境中的特征选择问题，这是该领域中最关键的主题之一。尽管存在许多特征选择方法，但这些方法面临着诸如可扩展性、管理高维数据、处理相关特征、适应可变特征重要性以及整合领域知识等挑战。为此，我们介绍了“自适应二进制掩码特征选择”（AFS-BM）来解决这些问题。AFS-BM通过联合优化实现了同时特征选择和模型训练。具体来说，我们进行联合优化和二进制掩码以在训练过程中不断调整特征集和模型参数。这种方法导致了模型准确度的显著提高和计算要求的降低。我们进行了大量实验，将AFS-BM与已建立的特征选择方法在来自现实比赛的知名数据集上进行了比较。我们的结果显示，AFS-BM在准确性方面取得了显著改进，并且需要明显较少的计算复杂度。这是因为AFS-BM能够在训练过程中动态调整特征的重要性，这对该领域是一个重要的贡献。我们公开分享我们的代码以便复制我们的结果并促进进一步研究。

更新时间: 2024-06-17 10:22:06

领域: cs.LG,eess.SP,stat.ML

下载: http://arxiv.org/abs/2401.11250v2

P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models

A multitude of industries depend on accurate and reasonable tabular data augmentation for their business processes. Contemporary methodologies in generating tabular data revolve around utilizing Generative Adversarial Networks (GAN) or fine-tuning Large Language Models (LLM). However, GAN-based approaches are documented to produce samples with common-sense errors attributed to the absence of external knowledge. On the other hand, LLM-based methods exhibit a limited capacity to capture the disparities between synthesized and actual data distribution due to the absence of feedback from a discriminator during training. Furthermore, the decoding of LLM-based generation introduces gradient breakpoints, impeding the backpropagation of loss from a discriminator, thereby complicating the integration of these two approaches. To solve this challenge, we propose using proximal policy optimization (PPO) to apply GANs, guiding LLMs to enhance the probability distribution of tabular features. This approach enables the utilization of LLMs as generators for GANs in synthesizing tabular data. Our experiments demonstrate that PPO leads to an approximately 4\% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art across three real-world datasets.

Updated: 2024-06-17 10:22:00

标题: P-TA: 使用近端策略优化来增强大型语言模型通过表格数据增强

摘要: 许多行业依赖准确和合理的表格数据增强其业务流程。当代生成表格数据的方法主要集中在利用生成对抗网络（GAN）或微调大型语言模型（LLM）。然而，基于GAN的方法据报道会产生具有常识错误的样本，这是由于缺乏外部知识所致。另一方面，基于LLM的方法在训练过程中缺乏来自鉴别器的反馈，因此展现出捕捉合成和实际数据分布差异的能力有限。此外，基于LLM的生成解码引入了梯度断点，阻碍了从鉴别器的损失反向传播，从而使得这两种方法的整合变得复杂。为解决这一挑战，我们提出使用近端策略优化（PPO）来应用GAN，引导LLM增强表格特征的概率分布。这种方法使得LLM可以作为GAN的生成器来合成表格数据。我们的实验表明，PPO使得在三个真实数据集上通过合成数据训练的模型的准确率提升了约4\%，超越了现有技术水平。

更新时间: 2024-06-17 10:22:00

领域: cs.LG

下载: http://arxiv.org/abs/2406.11391v1

Unfolding Time: Generative Modeling for Turbulent Flows in 4D

A recent study in turbulent flow simulation demonstrated the potential of generative diffusion models for fast 3D surrogate modeling. This approach eliminates the need for specifying initial states or performing lengthy simulations, significantly accelerating the process. While adept at sampling individual frames from the learned manifold of turbulent flow states, the previous model lacks the capability to generate sequences, hindering analysis of dynamic phenomena. This work addresses this limitation by introducing a 4D generative diffusion model and a physics-informed guidance technique that enables the generation of realistic sequences of flow states. Our findings indicate that the proposed method can successfully sample entire subsequences from the turbulent manifold, even though generalizing from individual frames to sequences remains a challenging task. This advancement opens doors for the application of generative modeling in analyzing the temporal evolution of turbulent flows, providing valuable insights into their complex dynamics.

Updated: 2024-06-17 10:21:01

标题: Unfolding Time: 4D中湍流流动的生成建模

摘要: 最近一项在湍流模拟领域的研究展示了生成扩散模型在快速3D替代建模中的潜力。该方法消除了指定初始状态或进行漫长模拟的需要，显著加快了过程。虽然擅长从学习的湍流状态流形中采样单个帧，但以前的模型缺乏生成序列的能力，阻碍了动态现象的分析。本研究通过引入4D生成扩散模型和一种物理信息引导技术来解决这一局限，使得能够生成真实的流态序列。我们的发现表明，提出的方法可以成功地从湍流流形中采样整个子序列，尽管从单个帧到序列的泛化仍然是一项具有挑战性的任务。这一进展为应用生成建模分析湍流流态的时间演变打开了大门，为了深入了解其复杂动态提供了有价值的见解。

更新时间: 2024-06-17 10:21:01

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2406.11390v1

SEFraud: Graph-based Self-Explainable Fraud Detection via Interpretative Mask Learning

Graph-based fraud detection has widespread application in modern industry scenarios, such as spam review and malicious account detection. While considerable efforts have been devoted to designing adequate fraud detectors, the interpretability of their results has often been overlooked. Previous works have attempted to generate explanations for specific instances using post-hoc explaining methods such as a GNNExplainer. However, post-hoc explanations can not facilitate the model predictions and the computational cost of these methods cannot meet practical requirements, thus limiting their application in real-world scenarios. To address these issues, we propose SEFraud, a novel graph-based self-explainable fraud detection framework that simultaneously tackles fraud detection and result in interpretability. Concretely, SEFraud first leverages customized heterogeneous graph transformer networks with learnable feature masks and edge masks to learn expressive representations from the informative heterogeneously typed transactions. A new triplet loss is further designed to enhance the performance of mask learning. Empirical results on various datasets demonstrate the effectiveness of SEFraud as it shows considerable advantages in both the fraud detection performance and interpretability of prediction results. Moreover, SEFraud has been deployed and offers explainable fraud detection service for the largest bank in China, Industrial and Commercial Bank of China Limited (ICBC). Results collected from the production environment of ICBC show that SEFraud can provide accurate detection results and comprehensive explanations that align with the expert business understanding, confirming its efficiency and applicability in large-scale online services.

Updated: 2024-06-17 10:18:53

标题: SEFraud：基于图形的自我解释欺诈检测通过解释性掩模学习

摘要: 基于图的欺诈检测在现代工业场景中有着广泛的应用，例如垃圾评论和恶意账户检测。虽然人们已经付出了大量努力设计合适的欺诈检测器，但对其结果的可解释性往往被忽视。先前的研究尝试使用事后解释方法（如GNNExplainer）为特定实例生成解释。然而，事后解释无法促进模型预测，并且这些方法的计算成本无法满足实际需求，从而限制了它们在现实场景中的应用。为了解决这些问题，我们提出了SEFraud，一种新颖的基于图的自解释欺诈检测框架，同时处理欺诈检测并实现结果的可解释性。具体来说，SEFraud首先利用定制的异质图变换网络、可学习的特征掩码和边缘掩码，从信息丰富的异构类型交易中学习表达性表示。进一步设计了一种新的三元损失来增强掩码学习的性能。在各种数据集上的实证结果表明，SEFraud的有效性，它在欺诈检测性能和预测结果可解释性方面均显示出明显优势。此外，SEFraud已经部署并为中国最大的银行——中国工商银行提供可解释的欺诈检测服务。来自中国工商银行生产环境的结果表明，SEFraud能够提供准确的检测结果和符合专业业务理解的全面解释，证实了其在大规模在线服务中的效率和适用性。

更新时间: 2024-06-17 10:18:53

领域: cs.LG

下载: http://arxiv.org/abs/2406.11389v1

Benchmarking General Purpose In-Context Learning

In-context learning (ICL) is becoming increasingly appealing to the AI community due to its flexibility, generality, sample efficiency, and exemption from artificial optimization skills. It is desirable to further enhance the generality and capability of ICL, which gives rise to the concept of general-purpose in-context learning (GPICL). We aim to extend ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, albeit with relatively limited zero-shot generalization. To this end, we introduce two lightweight but insightful benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark includes a vast number of tasks characterized by significant task variance, featuring minimal inductive bias. These tasks are also designed to facilitate lifelong in-context learning through continuous generation and interaction. These features pose significant challenges for models that rely on context or interactions to improve their proficiency, including language models, decision models, and world models. Our experiments reveal that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.

Updated: 2024-06-17 10:12:59

标题: 基准测试通用情境学习

摘要: 上下文学习（ICL）因其灵活性、通用性、样本效率和免除人工优化技能而越来越受人工智能社区的青睐。进一步提高ICL的通用性和能力是可取的，这引发了通用目的上下文学习（GPICL）的概念。我们的目标是将ICL扩展到更广泛的任务范围，具有更长的学习视野和更高的改进潜力，尽管零点通用性相对有限。为此，我们引入了两个轻量级但富有见地的基准，专门设计用于训练和评估GPICL功能。每个基准包括大量任务，具有显着的任务差异，并具有最小的归纳偏差。这些任务还旨在通过持续生成和交互促进终身上下文学习。这些特性对依赖上下文或交互来提高其熟练程度的模型提出了重大挑战，包括语言模型、决策模型和世界模型。我们的实验表明，仅参数规模可能对ICL或GPICL并不关键，这表明了增加上下文和记忆状态规模等替代方法。

更新时间: 2024-06-17 10:12:59

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2405.17234v4

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.

Updated: 2024-06-17 10:00:33

标题: 艺术：用于保护良性用户的文本到图像模型的自动红队测试

摘要: 大规模预训练生成模型正席卷世界，因其在生成创意内容方面的能力。与此同时，为保护用户权利和安全而开发了这些生成模型的安全防护措施，其中大部分是为大型语言模型设计的。现有方法主要集中在越狱和对抗性攻击上，主要评估模型在恶意提示下的安全性。最近的工作发现，手工制作的安全提示可能会无意中触发不安全的生成。为了进一步系统评估文本到图像模型的安全风险，我们提出了一种新颖的自动红队框架，ART。我们的方法利用视觉语言模型和大型语言模型建立不安全生成和其提示之间的联系，从而更有效地识别模型的漏洞。通过我们的综合实验，我们揭示了流行的开源文本到图像模型的有毒性。实验还验证了ART的有效性、适应性和极大的多样性。此外，我们引入了三个用于研究文本到图像模型相关安全风险的大规模红队数据集。数据集和模型可在https://github.com/GuanlinLee/ART找到。

更新时间: 2024-06-17 10:00:33

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2405.19360v2

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer.

Updated: 2024-06-17 09:59:43

标题: Minusformer：通过逐步学习残差改进时间序列预测

摘要: 在这篇论文中，我们发现普遍存在的时间序列（TS）预测模型容易严重过拟合。为了解决这个问题，我们采用一种去冗余的方法，逐步恢复TS的内在价值以供未来时间段使用。具体来说，我们引入了一个双流和减法机制，这是一种深度Boosting集成学习方法。同时，我们通过将信息聚合机制从加法转变为减法来更新基础Transformer模型。然后，我们在原始模型的每个块中加入一个辅助输出分支，构建一条通往最终预测结果的高速公路。该分支中后续模块的输出将减去先前学到的结果，使模型逐层学习监督信号的残差。这种设计促进了输入和输出流的逐步分解，使模型具有更高的灵活性、可解释性和抗过拟合能力。由于模型中所有聚合都是减号，因此被称为Minusformer。大量实验表明，提出的方法优于现有的最先进方法，在各种数据集上平均性能提高了11.9%。代码已发布在https://github.com/Anoise/Minusformer。

更新时间: 2024-06-17 09:59:43

领域: cs.LG

下载: http://arxiv.org/abs/2402.02332v3

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.

Updated: 2024-06-17 09:57:09

标题: 超越嵌入：视觉推理中视觉表的潜力

摘要: 视觉表征学习一直是计算机视觉的基石，涉及典型形式，如视觉嵌入、结构符号和基于文本的表征。尽管CLIP类型的视觉嵌入取得了成功，但它们常常缺乏对视觉推理至关重要的世界知识的访问权限。在这项工作中，我们提出了Visual Table，一种专为视觉推理量身定制的新形式的视觉表征。视觉表是作为视觉场景的层次描述构建的，具有场景描述和涵盖类别、属性和知识的多个以物体为中心的描述。由于结构和文本格式，视觉表提供了与仅视觉嵌入相比的独特优势，如可解释性和可控编辑。此外，它们提供了实例级的世界知识和详细的属性，这对于视觉推理至关重要。为了创建视觉表，我们开发了一个在收集的小规模注释数据集上训练的生成器。在11个视觉推理基准测试上的广泛结果表明，生成的视觉表明显优于以前的结构和基于文本的表征。此外，它们在各种基准测试中始终提升了最先进的多模态大型语言模型，展示了它们在推进视觉推理任务方面的潜力。我们的代码可以在https://github.com/LaVi-Lab/Visual-Table 上找到。

更新时间: 2024-06-17 09:57:09

领域: cs.CV,cs.AI,cs.CL,cs.LG,cs.MM

下载: http://arxiv.org/abs/2403.18252v2

Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?

Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.

Updated: 2024-06-17 09:51:38

标题: 提升科学概念理解：教师模型的类比能够强化学生模型吗？

摘要: 类比推理在人类认知中起着至关重要的作用，使我们能够通过将新概念与熟悉概念联系起来来理解新概念。之前在人工智能领域的研究主要集中在识别和生成类比，然后通过人类评估检查它们的质量，这忽略了这些类比在现实世界中的实际应用。受人类教育过程的启发，在本文中，我们提出研究教师语言模型（LMs）创建的类比如何可以帮助学生LMs理解科学概念，从而更贴近实际情景。我们的研究结果表明，自由形式的类比确实可以帮助LMs理解概念。此外，学生LMs生成的类比可以提高他们自己在科学问题回答方面的表现，展示了他们利用类比进行自学新知识的能力。资源可在https://github.com/siyuyuan/SCUA找到。

更新时间: 2024-06-17 09:51:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11375v1

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.

Updated: 2024-06-17 09:48:53

标题: 更公平的偏好引导改进人类对齐的大型语言模型判断

摘要: 大型语言模型（LLMs）已经显示出作为评估语言生成质量的成本有效和无需参考的评估器的潜力。特别是，成对的LLM评估器，用于比较两个生成的文本并确定首选文本，已经被广泛应用于各种应用中。然而，LLMs表现出偏好偏差和对提示设计的敏感性令人担忧。在这项工作中，我们首先揭示了LLMs的预测偏好可能非常脆弱和倾斜，即使是在语义上等效的指令下。我们发现，来自LLMs的更公平的预测偏好始终会导致与人类判断更加一致的判断。受到这一现象的启发，我们提出了一个自动的零样本评估导向的提示优化框架ZEPO，旨在产生更公平的偏好决策并改善LLM评估器与人类判断之间的一致性。为此，我们提出了一个基于偏好决策公平性的零样本学习目标。ZEPO在代表性的元评估基准上显示出明显的性能提升，而无需标记数据。我们的研究结果强调了偏好公平性与人类对齐之间的关键相关性，将ZEPO定位为LLM评估器和人类判断之间的桥梁的高效提示优化器。

更新时间: 2024-06-17 09:48:53

领域: cs.CL,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2406.11370v1

Stockformer: A Price-Volume Factor Stock Selection Model Based on Wavelet Transform and Multi-Task Self-Attention Networks

As the Chinese stock market continues to evolve and its market structure grows increasingly complex, traditional quantitative trading methods are facing escalating challenges. Particularly, due to policy uncertainty and the frequent market fluctuations triggered by sudden economic events, existing models often struggle to accurately predict market dynamics. To address these challenges, this paper introduces Stockformer, a price-volume factor stock selection model that integrates wavelet transformation and a multitask self-attention network, aimed at enhancing responsiveness and predictive accuracy regarding market instabilities. Through discrete wavelet transform, Stockformer decomposes stock returns into high and low frequencies, meticulously capturing long-term market trends and short-term fluctuations, including abrupt events. Moreover, the model incorporates a Dual-Frequency Spatiotemporal Encoder and graph embedding techniques to effectively capture complex temporal and spatial relationships among stocks. Employing a multitask learning strategy, it simultaneously predicts stock returns and directional trends. Experimental results show that Stockformer outperforms existing advanced methods on multiple real stock market datasets. In strategy backtesting, Stockformer consistently demonstrates exceptional stability and reliability across market conditions-whether rising, falling, or fluctuating-particularly maintaining high performance during downturns or volatile periods, indicating a high adaptability to market fluctuations. To foster innovation and collaboration in the financial analysis sector, the Stockformer model's code has been open-sourced and is available on the GitHub repository: https://github.com/Eric991005/Multitask-Stockformer.

Updated: 2024-06-17 09:38:08

标题: Stockformer：基于小波变换和多任务自注意力网络的价格-成交量因子股票选择模型

摘要: 随着中国股市不断发展并且市场结构变得越来越复杂，传统的量化交易方法面临着不断升级的挑战。特别是由于政策不确定性和突发经济事件引发的频繁市场波动，现有模型往往难以准确预测市场动态。为了解决这些挑战，本文介绍了Stockformer，一种价格-成交量因子股票选择模型，该模型整合了小波变换和多任务自注意力网络，旨在增强对市场不稳定性的响应能力和预测准确性。通过离散小波变换，Stockformer将股票收益分解为高频和低频，精细捕捉长期市场趋势和短期波动，包括突发事件。此外，该模型还融合了双频空时编码器和图嵌入技术，有效捕捉股票之间复杂的时间和空间关系。采用多任务学习策略，它同时预测股票收益和方向性趋势。实验结果表明，Stockformer在多个真实股市数据集上表现优于现有先进方法。在策略回测中，Stockformer在各种市场条件下始终表现出异常的稳定性和可靠性-无论是上升、下跌还是波动-特别是在市场下行或波动期间保持高性能，表明其对市场波动具有高度适应性。为了促进金融分析领域的创新和合作，Stockformer模型的代码已经开源，并可在GitHub仓库上获得：https://github.com/Eric991005/Multitask-Stockformer。

更新时间: 2024-06-17 09:38:08

领域: q-fin.TR,cs.LG

下载: http://arxiv.org/abs/2401.06139v2

$\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities

Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. To address this, Retrieval-Augmented Generation (RAG) incorporates external document chunks to expand LLM knowledge. Furthermore, compressing information from document chunks through extraction or summarization can improve LLM performance. Nonetheless, LLMs still struggle to notice and utilize scattered key information, a problem known as the "lost-in-the-middle" syndrome. Therefore, we typically need to restructure the content for LLM to recognize the key information. We propose $\textit{Refiner}$, an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG. $\textit{Refiner}$ leverages a single decoder-only LLM to adaptively extract query-relevant contents verbatim along with the necessary context, and section them based on their interconnectedness, thereby highlights information distinction, and aligns downstream LLMs with the original context effectively. Experiments show that a trained $\textit{Refiner}$ (with 7B parameters) exhibits significant gain to downstream LLM in improving answer accuracy, and outperforms other state-of-the-art advanced RAG and concurrent compressing approaches in various single-hop and multi-hop QA tasks. Notably, $\textit{Refiner}$ achieves a 80.5% tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared to the next best solution. $\textit{Refiner}$ is a plug-and-play solution that can be seamlessly integrated with RAG systems, facilitating its application across diverse open-source frameworks.

Updated: 2024-06-17 09:25:10

标题: $\textit{Refiner}$: 有效地重构检索内容以提升问答能力

摘要: 大型语言模型（LLMs）受其参数化知识的限制，在知识广泛的任务中会出现幻觉。为了解决这个问题，检索增强生成（RAG）将外部文档块纳入LLM知识的扩展。此外，通过从文档块中提取或总结信息可以改善LLM的性能。然而，LLMs仍然很难注意和利用分散的关键信息，这被称为“中间丢失”综合症的问题。因此，通常需要重新构造内容，以便LLM识别关键信息。我们提出了$\textit{Refiner}$，这是一个端到端提取和重构范例，它在RAG的后检索过程中运行。$\textit{Refiner}$利用一个仅具有解码器的LLM，自适应地提取查询相关内容，连同必要的上下文，并基于它们的相互关联性对这些内容进行部分划分，从而突出信息的区分，并有效地将下游LLMs与原始上下文对齐。实验证明，经过训练的$\textit{Refiner}$（具有7B参数）在提高答案准确性方面对下游LLM表现出显著的增益，并在各种单跳和多跳问答任务中优于其他最先进的RAG和同时的压缩方法。值得注意的是，与下一个最佳解决方案相比，$\textit{Refiner}$在多跳任务中实现了80.5%的标记减少和1.6-7.0%的改善幅度。$\textit{Refiner}$是一个即插即用的解决方案，可以无缝集成到RAG系统中，促进其在各种开源框架中的应用。

更新时间: 2024-06-17 09:25:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11357v1

DIDChain: Advancing Supply Chain Data Management with Decentralized Identifiers and Blockchain

Supply chain data management faces challenges in traceability, transparency, and trust. These issues stem from data silos and communication barriers. This research introduces DIDChain, a framework leveraging blockchain technology, Decentralized Identifiers, and the InterPlanetary File System. DIDChain improves supply chain data management. To address privacy concerns, DIDChain employs a hybrid blockchain architecture that combines public blockchain transparency with the control of private systems. Our hybrid approach preserves the authenticity and reliability of supply chain events. It also respects the data privacy requirements of the participants in the supply chain. Central to DIDChain is the cheqd infrastructure. The cheqd infrastructure enables digital tracing of asset events, such as an asset moving from the milk-producing dairy farm to the cheese manufacturer. In this research, assets are raw materials and products. The cheqd infrastructure ensures the traceability and reliability of assets in the management of supply chain data. Our contribution to blockchain-enabled supply chain systems demonstrates the robustness of DIDChain. Integrating blockchain technology through DIDChain offers a solution to data silos and communication barriers. With DIDChain, we propose a framework to transform the supply chain infrastructure across industries.

Updated: 2024-06-17 09:25:05

标题: DIDChain：利用去中心化标识和区块链推进供应链数据管理

摘要: 供应链数据管理在追溯性、透明度和信任方面面临挑战。这些问题源于数据孤岛和沟通障碍。本研究介绍了DIDChain，这是一个利用区块链技术、去中心化身份标识和星际文件系统的框架。DIDChain改进了供应链数据管理。为了解决隐私问题，DIDChain采用了混合区块链架构，将公共区块链透明度与私人系统控制相结合。我们的混合方法保留了供应链事件的真实性和可靠性。它也尊重参与供应链的参与者的数据隐私要求。DIDChain的核心是cheqd基础设施。cheqd基础设施实现了资产事件的数字追踪，例如从生产奶制品的奶牛场到奶酪制造商的资产移动。在这项研究中，资产是原材料和产品。cheqd基础设施确保了供应链数据管理中资产的追踪性和可靠性。我们对区块链供应链系统的贡献展示了DIDChain的强大性。通过DIDChain整合区块链技术提供了解决数据孤岛和沟通障碍的方案。使用DIDChain，我们提出了一个框架，可以改变各行业的供应链基础设施。

更新时间: 2024-06-17 09:25:05

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2406.11356v1

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. We shall release the source code of the proposed architecture in the published version.

Updated: 2024-06-17 09:23:13

标题: SUBLLM：一种新颖高效的架构，采用令牌序列子采样用于LLM

摘要: 尽管大型语言模型（LLMs）在各个领域取得了显著的成功，但训练和推断的效率仍然是一个重要挑战。为了解决这个问题，我们提出了SUBLLM，即Subsampling-Upsampling-Bypass Large Language Model，这是一种创新的架构，通过整合子采样、上采样和旁路模块来扩展核心的解码器框架。子采样模块负责缩短序列长度，而上采样模块恢复序列长度，旁路模块增强收敛性。与LLaMA相比，提出的SUBLLM在训练和推断速度以及内存使用方面表现出显著的增强，同时保持竞争力的少样本表现。在训练过程中，SUBLLM将速度提高了26％，每个GPU减少了10GB的内存。在推断中，它将速度提高了高达37％，每个GPU减少了1GB内存。当上下文窗口扩展到8192时，训练和推断速度分别可以提高34％和52％。我们将在发表的版本中发布提出架构的源代码。

更新时间: 2024-06-17 09:23:13

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2406.06571v2

Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach

Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.

Updated: 2024-06-17 09:17:40

标题: 在大型语言模型中保留知识：一种与模型无关的自解压方法

摘要: 人类在学习新信息的同时可以保留旧知识，但大型语言模型（LLMs）在事后预训练或监督微调（SFT）领域特定数据时经常遭受灾难性遗忘。此外，对于由LLM基础和视觉投影仪（例如LLaVA）组成的多模态大型语言模型（MLLMs），与它们的单模态对应物相比，在语言基准测试中观察到了明显的性能下降。为了解决这些挑战，我们引入了一种新颖的模型无关的自解压方法，树生成（TG），它可以将LLMs内的知识解压到训练语料库中。本文重点介绍了TG-SFT，它可以为指导微调步骤合成生成SFT数据。通过在MLLMs的SFT过程中纳入转储的语料库，我们大大减少了遗忘问题。

更新时间: 2024-06-17 09:17:40

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.11354v1

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, the reliability assessment of MoE lags behind its surging applications. Moreover, when transferred to new domains such as in fine-tuning MoE models sometimes underperform their dense counterparts. Motivated by the research gap and counter-intuitive phenomenon, we propose $\texttt{MoE-RBench}$, the first comprehensive assessment of SMoE reliability from three aspects: $\textit{(i)}$ safety and hallucination, $\textit{(ii)}$ resilience to adversarial attacks, and $\textit{(iii)}$ out-of-distribution robustness. Extensive models and datasets are tested to compare the MoE to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the MoE model more reliably than the dense LLM. In particular, we find that the robustness of SMoE is sensitive to the basic training settings. We hope that this study can provide deeper insights into how to adapt the pre-trained MoE model to other tasks with higher-generation security, quality, and stability. Codes are available at https://github.com/UNITES-Lab/MoE-RBench

Updated: 2024-06-17 09:17:05

标题: $\texttt{MoE-RBench}$：朝向使用稀疏专家混合方法构建可靠的语言模型

摘要: 混合专家模型（MoE）作为扩展大型语言模型（LLMs）的有前途的框架，越来越受欢迎。然而，MoE的可靠性评估落后于其迅速增长的应用。此外，当转移到新领域，如微调MoE模型时，有时表现不如其密集对应物。受到研究差距和反直觉现象的启发，我们提出了$\texttt{MoE-RBench}$，这是对SMoE可靠性的首次全面评估，从三个方面考虑：$\textit{(i)}$ 安全性和幻觉，$\textit{(ii)}$ 对抗性攻击的韧性，以及 $\textit{(iii)}$ 处理超出分布的鲁棒性。通过测试广泛的模型和数据集，比较MoE与密集网络在这些可靠性维度上的表现。我们的实证观察表明，通过适当的超参数、训练配方和推理技术，我们可以更可靠地构建MoE模型，而不是密集LLM。特别是，我们发现SMoE的鲁棒性对基本训练设置敏感。我们希望这项研究可以深入了解如何将预训练的MoE模型适应其他具有更高一代安全性、质量和稳定性的任务。代码可在https://github.com/UNITES-Lab/MoE-RBench找到。

更新时间: 2024-06-17 09:17:05

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.11353v1

Full-ECE: A Metric For Token-level Calibration on Large Language Models

Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.

Updated: 2024-06-17 09:07:58

标题: Full-ECE：大型语言模型上的标记级校准度量

摘要: 深度神经网络（DNNs）在各个领域表现出色，但在提供准确的不确定性估计方面面临挑战，这对于高风险应用至关重要。大型语言模型（LLMs）最近出现作为强大的工具，在语言任务中表现出色。然而，传统的校准度量如期望校准误差（ECE）和类别校准误差（cw-ECE）对于LLMs来说是不足够的，因为它们拥有庞大的词汇量、数据复杂性和分布焦点。为了解决这个问题，我们提出了一个称为完全校准的新型校准概念，并引入了其相应的度量标准，Full-ECE。Full-ECE评估整个预测的概率分布，为LLMs提供了更准确和更稳健的校准度量。

更新时间: 2024-06-17 09:07:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11345v1

Mitigating Bias for Question Answering Models by Tracking Bias Influence

Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.

Updated: 2024-06-17 09:06:24

标题: 通过跟踪偏见影响减轻问答模型的偏见

摘要: 各种自然语言处理任务的模型已被证明存在刻板印象，而问答（QA）模型中的偏见尤为有害，因为输出答案可能直接被最终用户消费。已经有数据集用于评估QA模型中的偏见，但QA模型的偏见缓解技术仍未得到充分探讨。在这项工作中，我们提出了BMBI，一种减轻多选问答模型偏见的方法。基于这样的直觉，即如果模型从一个有偏见的例子中学习，它可能更倾向于存在偏见，我们通过观察查询实例对另一个实例的影响来衡量查询实例的偏见水平。如果受影响的实例更有偏见，我们推断查询实例存在偏见。然后，我们将检测到的偏见水平用作优化目标，形成一个多任务学习设置，除了原始的QA任务外。我们进一步引入了一种新的偏见评估指标，以全面且敏感的方式量化偏见。我们展示了我们的方法可以应用于多个问答公式和多个偏见类别。它可以显着降低BBQ数据集中的所有9个偏见类别的偏见水平，同时保持可比较的QA准确性。

更新时间: 2024-06-17 09:06:24

领域: cs.CL,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2310.08795v2

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro level operations and strategic macro awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voyage and MetaGPT, presents the immense potential in solving intricate tasks. Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II, a highly complex RTS game.To conveniently take full advantage of LLMs` reasoning abilities, we first develop textual StratCraft II environment, called TextStarCraft II, which LLM agent can interact. Secondly, we propose a Chain of Summarization method, including single frame summarization for processing raw observations and multi frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions. Our experiment consists of two parts: first, an evaluation by human experts, which includes assessing the LLMs`s mastery of StarCraft II knowledge and the performance of LLM agents in the game; second, the in game performance of LLM agents, encompassing aspects like win rate and the impact of Chain of Summarization.Experiment results demonstrate that: 1. LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; 2. Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for eight years; 3. LLM agents are capable of defeating the built in AI at the Harder(Lv5) difficulty level. We have open sourced the code and released demo videos of LLM agent playing StarCraft II.

Updated: 2024-06-17 09:04:43

标题: 大型语言模型在星际争霸II中的表现：基准测试和一种摘要链方法

摘要: 星际争霸II是AI代理的一个具有挑战性的基准，因为需要精确的微观操作和战略宏观意识。先前的工作，如Alphastar和SCC，在处理星际争霸II方面取得了令人印象深刻的表现，但仍然存在长期战略规划和策略可解释性方面的不足。新兴的大型语言模型（LLM）代理，如Voyage和MetaGPT，在解决复杂任务方面展现了巨大潜力。受此激励，我们旨在验证LLMs在星际争霸II上的能力，这是一个高度复杂的即时战略游戏。为了方便充分利用LLMs的推理能力，我们首先开发了文本化的星际争霸II环境，称为TextStarCraft II，LLM代理可以与之交互。其次，我们提出了一种摘要链方法，包括单帧摘要用于处理原始观察和多帧摘要用于分析游戏信息，提供命令建议和生成战略决策。我们的实验包括两部分：首先是由人类专家评估，包括评估LLMs对星际争霸II知识的掌握程度和LLM代理在游戏中的表现；其次是LLM代理的游戏表现，涵盖方面如胜率和摘要链的影响。实验结果表明：1. LLMs具有处理星际争霸II场景所需的相关知识和复杂规划能力；2. 人类专家认为LLM代理的表现接近一个已经玩了八年星际争霸II的普通玩家的水平；3. LLM代理能够在Harder（Lv5）难度级别上击败内置AI。我们已经开源了代码并发布了LLM代理玩星际争霸II的演示视频。

更新时间: 2024-06-17 09:04:43

领域: cs.AI

下载: http://arxiv.org/abs/2312.11865v2

X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner

The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.

Updated: 2024-06-17 09:02:50

标题: X-Light：使用变压器上的变压器作为元多智能体强化学习器进行跨城交通信号控制

摘要: 交通灯控制的效果通过当前基于强化学习的方法得到显著改善，通过更好地协作多个交通灯。然而，一个持续存在的问题是：如何获得一个在不同城市之间具有显著可转移性的多智能体交通信号控制算法？在本文中，我们提出了一种Transformer on Transformer（TonT）模型，用于跨城市元多智能体交通信号控制，命名为X-Light：我们输入完整的马尔可夫决策过程轨迹，下层Transformer在城市内目标交叉口及其邻居之间对状态、动作、奖励进行聚合，上层Transformer学习在不同城市之间的一般决策轨迹。这种双层方法增强了模型的健壮泛化能力和可转移性。值得注意的是，在直接转移到未知情景时，我们的方法平均超过所有基线方法7.91％，甚至在某些情况下超过16.3％，取得了最佳结果。

更新时间: 2024-06-17 09:02:50

领域: cs.AI

下载: http://arxiv.org/abs/2404.12090v3

COBIAS: Contextual Reliability in Bias Assessment

Large Language Models (LLMs) are trained on extensive web corpora, which enable them to understand and generate human-like text. However, this training process also results in inherent biases within the models. These biases arise from web data's diverse and often uncurated nature, containing various stereotypes and prejudices. Previous works on debiasing models rely on benchmark datasets to measure their method's performance. However, these datasets suffer from several pitfalls due to the highly subjective understanding of bias, highlighting a critical need for contextual exploration. We propose understanding the context of inputs by considering the diverse situations in which they may arise. Our contribution is two-fold: (i) we augment 2,291 stereotyped statements from two existing bias-benchmark datasets with points for adding context; (ii) we develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to assess a statement's contextual reliability in measuring bias. Our metric aligns with human judgment on contextual reliability of statements (Spearman's $\rho = 0.65, p = 3.4 * 10^{-60}$) and can be used to create reliable datasets, which would assist bias mitigation works.

Updated: 2024-06-17 09:01:25

标题: COBIAS：偏见评估中的情境可靠性

摘要: 大型语言模型（LLMs）是在广泛的网络语料库上进行训练的，这使它们能够理解和生成类似人类的文本。然而，这种训练过程也会导致模型内在的偏见。这些偏见源于网络数据的多样性和通常未经筛选的特性，包含各种刻板印象和偏见。先前对去偏见模型的研究依赖于基准数据集来衡量其方法的性能。然而，由于对偏见的高度主观理解，这些数据集存在几个缺陷，突显了对上下文探索的重要需求。我们建议通过考虑输入可能出现的多种情况来理解其上下文。我们的贡献有两个方面：（i）我们通过添加上下文点，补充了来自两个现有偏见基准数据集的2,291个刻板化陈述；（ii）我们开发了“面向上下文的偏见指标和评估分数”（COBIAS）来评估陈述在衡量偏见时的上下文可靠性。我们的度量与人类对陈述上下文可靠性的判断一致（Spearman's ρ = 0.65，p = 3.4 * 10^{-60}），可以用于创建可靠的数据集，帮助减轻偏见。

更新时间: 2024-06-17 09:01:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.14889v2

CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition

Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.

Updated: 2024-06-17 08:57:00

标题: CM2-Net：用于司机动作识别的持续交叉模态映射网络

摘要: 司机动作识别在增强驾驶员-车辆交互和确保驾驶安全方面取得了显著进展，通过整合多种模态，如红外和深度。然而，与仅RGB模态相比，在汽车舱环境中收集所有类型非RGB模态的大量数据总是费时费力且成本高昂的。因此，先前的研究建议独立学习每个非RGB模态，通过微调在RGB视频上预训练的模型，但是这些方法在面对新到来的模态时提取信息特征效果较差，因为存在较大的域差距。相比之下，我们提出了一个持续跨模态映射网络（CM2-Net），通过先前学习的模态的指导提示来持续学习每个新到来的模态。具体来说，我们开发了积累式跨模态映射提示（ACMP），将从先前模态学习到的具有区分性和信息性的特征映射到新到来模态的特征空间中。然后，在面对新到来的模态时，这些映射特征能够提供有效的提示，指导应该提取和优先处理哪些特征。这些提示在整个持续学习过程中不断累积，从而提升进一步的识别性能。在Drive&Act数据集上进行的广泛实验表明了CM2-Net在单一和多模态司机动作识别上的性能优越性。

更新时间: 2024-06-17 08:57:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11340v1

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-Generated Content (AIGC), the powerful capacity of retrieval in providing additional knowledge enables RAG to assist existing generative AI in producing high-quality outputs. Recently, Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in RA-LLMs, covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we systematically review mainstream relevant work by their architectures, training strategies, and application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research. Updated information about this survey can be found at https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/

Updated: 2024-06-17 08:56:38

标题: 对 RAG 会议的调查：朝着检索增强的大型语言模型

摘要: 作为人工智能中最先进的技术之一，检索增强生成（RAG）可以提供可靠和最新的外部知识，为众多任务提供巨大便利。特别是在人工智能生成内容（AIGC）时代，检索在提供额外知识方面的强大能力使RAG能够辅助现有的生成式人工智能产生高质量的输出。最近，大型语言模型（LLMs）在语言理解和生成方面展示了革命性的能力，尽管仍面临固有的局限，如幻觉和过时的内部知识。鉴于RAG在提供最新和有用的辅助信息方面的强大能力，检索增强大型语言模型（RA-LLMs）已经出现，用于利用外部和权威知识库，而不仅仅依赖模型的内部知识，以增强LLMs的生成质量。在这项调查中，我们全面审查了现有RA-LLMs研究，涵盖了三个主要技术视角：架构、训练策略和应用。作为初步知识，我们简要介绍了LLMs的基础和最新进展。然后，为了说明RAG对LLMs的实际意义，我们系统地回顾了主流相关工作，通过它们的架构、训练策略和应用领域，具体详细地描述了每个挑战以及RA-LLMs的相应能力。最后，为了提供更深入的见解，我们讨论了当前的局限性和未来研究的几个有希望的方向。关于这项调查的更新信息可以在 https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/ 找到。

更新时间: 2024-06-17 08:56:38

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2405.06211v3

Quantum Algorithms for the Pathwise Lasso

We present a novel quantum high-dimensional linear regression algorithm with an $\ell_1$-penalty based on the classical LARS (Least Angle Regression) pathwise algorithm. Similarly to available classical algorithms for Lasso, our quantum algorithm provides the full regularisation path as the penalty term varies, but quadratically faster per iteration under specific conditions. A quadratic speedup on the number of features $d$ is possible by using the quantum minimum-finding routine from D\"urr and Hoyer (arXiv'96) in order to obtain the joining time at each iteration. We then improve upon this simple quantum algorithm and obtain a quadratic speedup both in the number of features $d$ and the number of observations $n$ by using the approximate quantum minimum-finding routine from Chen and de Wolf (ICALP'23). As one of our main contributions, we construct a quantum unitary to approximately compute the joining times to be searched over by the approximate quantum minimum finding. Since the joining times are no longer exactly computed, it is no longer clear that the resulting approximate quantum algorithm obtains a good solution. As our second main contribution, we prove, via an approximate version of the KKT conditions and a duality gap, that the LARS algorithm (and thus our quantum algorithm) is robust to errors. This means that it still outputs a path that minimises the Lasso cost function up to a small error if the joining times are approximately computed. Moreover, we show that, when the observations are sampled from a Gaussian distribution, our quantum algorithm's complexity only depends polylogarithmically on $n$, exponentially better than the classical LARS algorithm, while keeping the quadratic improvement on $d$. Finally, we propose a dequantised algorithm that also retains the polylogarithmic dependence on $n$, albeit with the linear scaling on $d$ from the standard LARS algorithm.

Updated: 2024-06-17 08:55:21

标题: 路径Lasso的量子算法

摘要: 我们提出了一种基于经典LARS（Least Angle Regression）路径算法的带有$\ell_1$惩罚项的新型量子高维线性回归算法。与现有的lasso经典算法类似，我们的量子算法在惩罚项变化时提供完整的正则化路径，但在特定条件下每次迭代的速度可以快两次。通过使用D\"urr和Hoyer（arXiv'96）的量子最小查找例程来获取每次迭代的加入时间，可以实现特征数$d$的二次加速。然后，通过使用Chen和de Wolf（ICALP'23）的近似量子最小查找例程，我们改进了这个简单的量子算法，在特征数$d$和观测数$n$方面都实现了二次加速。作为我们的主要贡献之一，我们构建了一个量子酉矩阵，用于近似计算加入时间，以便由近似量子最小查找搜索。由于加入时间不再精确计算，因此结果近似量子算法是否能获得良好解决方案不再明确。作为我们的第二个主要贡献，通过KKT条件的近似版本和对偶间隙，我们证明了LARS算法（因此我们的量子算法）对错误具有鲁棒性。这意味着如果加入时间被近似计算，它仍然会输出一个最小化lasso成本函数的路径，直到产生一个小误差。此外，我们展示了当观察结果来自高斯分布时，我们的量子算法的复杂性仅在$n$上多对数地依赖，比经典LARS算法好指数倍，同时保持特征数$d$的二次改进。最后，我们提出了一个去量子化的算法，它也保留了对$n$的多对数依赖，尽管在$d$上具有标准LARS算法的线性缩放。

更新时间: 2024-06-17 08:55:21

领域: quant-ph,cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2312.14141v2

Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask

Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.

Updated: 2024-06-17 08:54:51

标题: 通过扩散过程和填补-插值-预测掩模进行时间序列表示的自监督学习

摘要: 时间序列表示学习（TSRL）专注于为各种时间序列（TS）建模任务生成信息丰富的表示。传统的自监督学习（SSL）方法在TSRL中分为四大类：重建、对抗、对比和预测，每种方法都面临对噪声和数据细微差异的敏感性的共同挑战。最近，基于扩散的方法展示了先进的生成能力。然而，它们主要针对特定的应用场景，如插补和预测，留下了一个在通用TSRL中利用扩散模型的空白。我们的工作，时间序列扩散嵌入（TSDE），作为第一个基于扩散的SSL TSRL方法，填补了这一空白。TSDE使用插补-内插-预测（IIF）掩模将TS数据分为观察部分和掩盖部分。它应用一个可训练的嵌入函数，具有具有交叉机制的双正交Transformer编码器，用于观察部分。我们训练一个基于嵌入的反向扩散过程，设计用于预测添加到掩盖部分的噪声。广泛的实验证明了TSDE在插补、内插、预测、异常检测、分类和聚类方面的优越性。我们还进行了消融研究，展示了嵌入可视化，并比较了推理速度，进一步证实了TSDE在学习TS数据表示方面的效率和有效性。

更新时间: 2024-06-17 08:54:51

领域: cs.LG,cs.AI,G.3; I.6.5; I.2.4

下载: http://arxiv.org/abs/2405.05959v2

Multimodal Security of Iris and Fingerprint with Bloom Filters

The standard methods of identification such as PIN Numbers (Personal Identification Number), Passwords, smart cards can be easily stolen and can be misused easily. To overcome this, biometric is introduced, as it will be unique to each individual. In this modern world, security has become a serious threat. So many biometric securities have also come up to secure. In biometrics, the iris has become more popular and widely used biometric because of its stability. This iris is researched for the past two decades along with computer technology. This paper will tell about basic concepts of iris, fingerprint, and securing iris with Cancellable Biometrics. Often we get iris templates and store in server or database. There is a chance of losing the data as well as security is a big question here. Here we are proposing a system for protecting iris templates using bloom filters with feature fusion of iris with fingerprint to provide security. Bloom filters are a useful asset in the fields of computer science. In feature-level fusion, the function units originating from more than one biometric assets are consolidated into a single feature set by the appliance of acceptable feature standardization, transformation, and reduction schemes. We can combine bloom filters with feature level fusion for more secure iris templates.

Updated: 2024-06-17 08:54:22

标题: 虹膜和指纹的多模态布隆过滤器安全性

摘要: 标准的身份验证方法，如个人识别号码（PIN码）、密码、智能卡等很容易被盗用和滥用。为了克服这一问题，生物识别技术被引入，因为它将是每个个体的独特标识。在这个现代世界中，安全已经成为一个严重的威胁。因此，许多生物识别安全技术也应运而生。在生物识别技术中，虹膜因其稳定性而变得更受欢迎和广泛使用。在过去的二十年里，虹膜与计算机技术一起被广泛研究。本文将介绍虹膜、指纹的基本概念，以及如何使用可取消生物识别技术保护虹膜。通常我们会获取虹膜模板并存储在服务器或数据库中。数据丢失的风险以及安全性是一个重要问题。我们提出了一种利用布隆过滤器和虹膜与指纹特征融合来保护虹膜模板的系统。布隆过滤器在计算机科学领域中是一个有用的资产。在特征级融合中，来自多个生物识别资产的功能单元通过适当的特征标准化、转换和减少方案合并为一个特征集。我们可以将布隆过滤器与特征级融合相结合，以提供更安全的虹膜模板。

更新时间: 2024-06-17 08:54:22

领域: cs.CR

下载: http://arxiv.org/abs/2406.11335v1

Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations

Incorporating natural language rationales in the prompt and In-Context Learning (ICL) have led to a significant improvement of Large Language Models (LLMs) performance. However, generating high-quality rationales require human-annotation or the use of auxiliary proxy models. In this work, we propose Self-AMPLIFY to automatically generate rationales from post hoc explanation methods applied to Small Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a 3-step method that targets samples, generates rationales and builds a final prompt to leverage ICL. Self-AMPLIFY performance is evaluated on four SLMs and five datasets requiring strong reasoning abilities. Self-AMPLIFY achieves good results against competitors, leading to strong accuracy improvement. Self-AMPLIFY is the first method to apply post hoc explanation methods to autoregressive language models to generate rationales to improve their own performance in a fully automated manner.

Updated: 2024-06-17 08:52:29

标题: 自我增强：使用自我事后解释改进小型语言模型

摘要: 在提示和上下文学习（ICL）中加入自然语言的理由导致了大型语言模型（LLMs）性能的显著提升。然而，生成高质量的理由需要人工注释或使用辅助代理模型。在这项工作中，我们提出了自动从事后解释方法中生成理由的Self-AMPLIFY，以改善小型语言模型（SLMs）的性能。Self-AMPLIFY是一个三步方法，目标是样本，生成理由并构建最终提示以利用ICL。Self-AMPLIFY的性能在四个SLMs和需要强大推理能力的五个数据集上进行评估。Self-AMPLIFY在竞争对手中取得了不错的结果，导致了强大的准确性改进。Self-AMPLIFY是第一个将事后解释方法应用于自回归语言模型以完全自动化地生成理由以改善其自身性能的方法。

更新时间: 2024-06-17 08:52:29

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.12038v3

Semantic-Aware Spectrum Sharing in Internet of Vehicles Based on Deep Reinforcement Learning

This work aims to investigate semantic communication in high-speed mobile Internet of vehicles (IoV) environments, with a focus on the spectrum sharing between vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. We specifically address spectrum scarcity and network traffic and then propose a semantic-aware spectrum sharing algorithm (SSS) based on the deep reinforcement learning (DRL) soft actor-critic (SAC) approach. Firstly, we delve into the extraction of semantic information. Secondly, we redefine metrics for semantic information in V2V and V2I spectrum sharing in IoV environments, introducing high-speed semantic spectrum efficiency (HSSE) and semantic transmission rate (HSR). Finally, we employ the SAC algorithm for decision optimization in V2V and V2I spectrum sharing based on semantic information. This optimization encompasses the optimal link of V2V and V2I sharing strategies, the transmission power for vehicles sending semantic information and the length of transmitted semantic symbols, aiming at maximizing HSSE of V2I and enhancing success rate of effective semantic information transmission (SRS) of V2V. Experimental results demonstrate that the SSS algorithm outperforms other baseline algorithms, including other traditional-communication-based spectrum sharing algorithms and spectrum sharing algorithm using other reinforcement learning approaches. The SSS algorithm exhibits a 15% increase in HSSE and approximately a 7% increase in SRS.

Updated: 2024-06-17 08:51:31

标题: 基于深度强化学习的智能车联网语义感知频谱共享

摘要: 这项工作旨在研究高速移动车联网（IoV）环境中的语义通信，重点关注车辆对车辆（V2V）和车辆对基础设施（V2I）通信之间的频谱共享。我们特别关注频谱稀缺和网络流量，然后提出了一种基于深度强化学习（DRL）软行动者-评论家（SAC）方法的语义感知频谱共享算法（SSS）。首先，我们深入研究语义信息的提取。其次，我们重新定义了IoV环境中V2V和V2I频谱共享中的语义信息度量标准，引入了高速语义频谱效率（HSSE）和语义传输速率（HSR）。最后，我们利用SAC算法基于语义信息进行V2V和V2I频谱共享的决策优化。这种优化涵盖了V2V和V2I共享策略的最佳连接、发送语义信息的车辆传输功率以及传输的语义符号长度，旨在最大化V2I的HSSE并提高V2V的有效语义信息传输成功率（SRS）。实验结果表明，SSS算法优于其他基准算法，包括其他基于传统通信的频谱共享算法和使用其他强化学习方法的频谱共享算法。SSS算法在HSSE上表现出15%的增长，SRS大约增加了7%。

更新时间: 2024-06-17 08:51:31

领域: cs.LG

下载: http://arxiv.org/abs/2406.07213v3

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment. The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment, each requiring a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models, and provide an in-depth analysis of the models' expertise across different skill dimensions. We will publicly release the benchmark for future research on program synthesis in visual programming.

Updated: 2024-06-17 08:48:02

标题: 在XLogoOnline环境中视觉编程的程序合成基准

摘要: 大型语言和多模态模型在各种专注于特定技能的基准测试中取得了显著成功，例如通用编程、自然语言理解、数学问题解决和视觉问答。然而，目前尚不清楚这些模型在需要结合这些技能的任务中表现如何。在本文中，我们基于XLogoOnline可视化编程环境策划了一个新颖的程序合成基准测试。该基准测试包括来自XLogoOnline环境的迷你级别的85个真实世界任务，每个任务都需要结合不同的技能，如空间规划、基本编程和逻辑推理。我们的评估显示，目前的最先进模型如GPT-4V和Llama3-70B在解决这些任务时遇到困难，成功率仅为20%和2.35%。接下来，我们开发了一个微调流程，通过利用拥有80000多个任务的大规模合成训练数据集来提升模型的性能。此外，我们展示了如何利用模拟器驱动的反馈来设计训练数据分布上的课程。我们展示了经过微调的Llama3-8B显著优于GPT-4V和Llama3-70B模型，并对模型在不同技能维度上的专业知识进行了深入分析。我们将公开发布这个基准测试，以供未来在可视化编程中的程序合成研究使用。

更新时间: 2024-06-17 08:48:02

领域: cs.AI

下载: http://arxiv.org/abs/2406.11334v1

They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.

Updated: 2024-06-17 08:42:19

标题: 他们都是博士：综合各种反事实来减轻联想偏见

摘要: 视觉语言模型（VLMs）如CLIP是强大的模型；但是它们可能表现出不希望的偏见，使它们在直接部署在文本到图像、文本到视频检索、逆向搜索或分类任务等应用中时更不安全。在这项工作中，我们提出了一个新颖的框架，用于生成合成对照图像，创建一个多样化和平衡的数据集，可用于微调CLIP。给定一组来自文本到图像模型的多样化合成基础图像，我们利用现成的分割和修补模型将外观多样的人放置在上下文中。我们展示了在这种数据集上训练的CLIP学习如何将人的外观与图像的上下文分离开来，即，使一个医生的因素与人的视觉外观，如肤色或体型无关，而与上下文相关，如背景、服装或持有的物品。我们证明，我们微调的CLIP模型，$CF_\alpha$，在图像检索任务中将最大偏斜、最小偏斜和NDKL等关键公平性指标提高了40-66％，同时在下游任务中仍达到了类似水平的性能。我们展示了，通过设计，我们的模型保留了与原始CLIP模型的最大兼容性，并且可以轻松地通过即插即用的方式支持不同的准确性与公平性权衡。

更新时间: 2024-06-17 08:42:19

领域: cs.CV,cs.IR,cs.LG

下载: http://arxiv.org/abs/2406.11331v1

GitHub Copilot: the perfect Code compLeeter?

This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.

Updated: 2024-06-17 08:38:29

标题: GitHub Copilot：完美的代码补全工具？

摘要: 本文旨在通过使用自定义自动化框架，基于LeetCode问题集评估GitHub Copilot生成的代码质量。我们评估了Copilot在4种编程语言（Java、C ++、Python3和Rust）上的结果。我们旨在评估Copilot在代码生成阶段的可靠性，生成代码的正确性以及其对编程语言、问题难度级别和问题主题的依赖性。此外，我们评估了代码的时间和内存效率，并将其与平均人类结果进行比较。总共，我们为每种编程语言生成了1760个问题的解决方案，并评估了每个问题的所有Copilot建议，导致在为期2个月的时间内向LeetCode提交了超过50000次。我们发现Copilot成功解决了大多数问题。但是，在生成Java和C ++代码方面，Copilot比在Python3和Rust中更成功。此外，在Python3的情况下，Copilot在代码生成阶段表现出相当不可靠。我们还发现Copilot的排名靠前的建议并不总是最好的。此外，我们分析了问题主题如何影响正确率。最后，根据LeetCode的统计信息，我们可以得出结论，Copilot生成的代码比平均人类更有效率。

更新时间: 2024-06-17 08:38:29

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.11326v1

Deep-Learning-Based Channel Estimation for Distributed MIMO with 1-bit Radio-Over-Fiber Fronthaul

We consider the problem of pilot-aided, uplink channel estimation in a distributed massive multiple-input multiple-output (MIMO) architecture, in which the access points are connected to a central processing unit via fiber-optical fronthaul links, carrying a two-level-quantized version of the received analog radio-frequency signal. We adapt to this architecture the deep-learning-based channel-estimation algorithm recently proposed by Nguyen et al. (2023), and explore its robustness to the additional signal distortions (beyond 1-bit quantization) introduced in the considered architecture by the automatic gain controllers (AGCs) and by the comparators. These components are used at the access points to generate the two-level analog waveform from the received signal. Via simulation results, we illustrate that the proposed channel-estimation method outperforms significantly the Bussgang linear minimum mean-square error channel estimator, and it is robust against the additional impairments introduced by the AGCs and the comparators.

Updated: 2024-06-17 08:38:29

标题: 基于深度学习的1比特光纤前传分布式MIMO信道估计

摘要: 我们考虑在分布式大规模多输入多输出（MIMO）架构中的辅助导频上行信道估计问题，其中接入点通过光纤前传链路连接到中央处理单元，传输接收到的模拟射频信号的两级量化版本。我们将最近由Nguyen等人提出的基于深度学习的信道估计算法调整到这种架构中，并探讨其对由自动增益控制器（AGCs）和比较器在考虑的架构中引入的附加信号失真（超出1位量化）的鲁棒性。这些组件在接入点处用于从接收信号生成两级模拟波形。通过仿真结果，我们说明所提出的信道估计方法明显优于Bussgang线性最小均方误差信道估计器，并且对由AGCs和比较器引入的额外损坏具有鲁棒性。

更新时间: 2024-06-17 08:38:29

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2406.11325v1

Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning

Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.

Updated: 2024-06-17 08:35:32

标题: 基于多智能体强化学习的可重构智能表面辅助 VEC

摘要: 车载边缘计算（VEC）是一种新兴技术，使车辆能够通过在本地执行任务或将任务卸载到附近的边缘设备来执行高强度任务。然而，建筑等障碍物可能降低通信质量并引发通信中断，因此车辆可能无法满足任务卸载的要求。可重构智能表面（RIS）被引入以支持车辆通信并提供替代通信路径。通过灵活调整RIS的相移，系统性能可以得到改善。针对RIS辅助的VEC系统中任务随机到达的情况，我们设计了一个考虑卸载功率、本地功率分配和相移优化的控制方案。为了解决这个非凸问题，我们提出了一个新的深度强化学习（DRL）框架，采用修改后的多智能体深度确定性策略梯度（MADDPG）方法来优化车辆用户（VUs）的功率分配，以及使用块协调下降（BCD）算法来优化RIS的相移。仿真结果显示，我们提出的方案优于集中式深度确定性策略梯度（DDPG）方案和随机方案。

更新时间: 2024-06-17 08:35:32

领域: cs.MA,cs.DC,cs.LG,cs.NI,eess.SP

下载: http://arxiv.org/abs/2406.11318v1

DocCGen: Document-based Controlled Code Generation

Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct libraries using the library documentation that best matches the NL query. Then, it utilizes schema rules extracted from the documentation of these libraries to constrain the decoding. We evaluate our framework for two complex structured languages, Ansible YAML and Bash command, consisting of two settings: Out-of-domain (OOD) and In-domain (ID). Our extensive experiments show that DocCGen consistently improves different-sized language models across all six evaluation metrics, reducing syntactic and semantic errors in structured code. We plan to open-source the datasets and code to motivate research in constrained code generation.

Updated: 2024-06-17 08:34:57

标题: DocCGen：基于文档的受控代码生成

摘要: 最近的发展表明，大型语言模型（LLMs）在自然语言（NL）到代码生成方面表现出最先进的性能，尤其是对于资源丰富的通用编程语言如C++、Java和Python。然而，它们在结构化领域特定语言（DSLs）如YAML、JSON的实际使用受限于领域特定的模式、语法和通常在预训练期间LLMs看不到的定制化。已经通过在上下文中学习相关示例或进行微调来减轻这一挑战。然而，它存在问题，如DSL样本有限和提示敏感性，但企业保留了DSL的良好文档。因此，我们提出了DocCGen，这是一个框架，它可以利用这种丰富的知识，将结构化代码语言的NL到代码生成任务分解为一个两步过程。首先，它使用最匹配NL查询的库文档来检测正确的库。然后，它利用从这些库的文档中提取的模式规则来约束解码。我们评估了我们的框架对两种复杂的结构化语言，Ansible YAML和Bash命令，包括两种设置：领域外（OOD）和领域内（ID）。我们广泛的实验证明，DocCGen在所有六个评估指标上持续改进不同规模的语言模型，减少了结构化代码中的语法和语义错误。我们计划开源数据集和代码，以激励受限制的代码生成研究。

更新时间: 2024-06-17 08:34:57

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11925v1

GUICourse: From General Vision Language Models to Versatile GUI Agents

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

Updated: 2024-06-17 08:30:55

标题: GUI课程：从通用视觉语言模型到多功能GUI代理

摘要: 利用图形用户界面（GUI）进行人机交互对于访问各种数字工具至关重要。最近视觉语言模型（VLMs）的进展突显了开发多功能代理以帮助人类完成GUI导航任务的巨大潜力。然而，当前的VLMs在基本能力（OCR和接地）和GUI知识（GUI元素的功能和控制方法）方面存在挑战，阻止它们成为实用的GUI代理。为了解决这些挑战，我们提出了GUICourse，一个用于从通用VLMs训练基于视觉的GUI代理的数据集套件。首先，我们介绍了GUIEnv数据集，以加强VLMs的OCR和接地能力。然后，我们介绍了GUIAct和GUIChat数据集，以丰富它们对GUI组件和交互的知识。实验证明，我们的GUI代理在常见的GUI任务上比其基准VLMs表现更好。即使是小型GUI代理（具有31亿个参数）仍然能够很好地完成单步和多步GUI任务。最后，我们通过消融研究分析了这个代理的训练阶段的不同变种。我们的源代码和数据集发布在https://github.com/yiye3/GUICourse。

更新时间: 2024-06-17 08:30:55

领域: cs.AI,cs.CL,cs.CV,cs.HC

下载: http://arxiv.org/abs/2406.11317v1

Integrity-protecting block cipher modes -- Untangling a tangled web

This paper re-examines the security of three related block cipher modes of operation designed to provide authenticated encryption. These modes, known as PES-PCBC, IOBC and EPBC, were all proposed in the mid-1990s. However, analyses of security of the latter two modes were published more recently. In each case one or more papers describing security issues with the schemes were eventually published, although a flaw in one of these analyses (of EPBC) was subsequently discovered - this means that until now EPBC had no known major issues. This paper establishes that, despite this, all three schemes possess defects which should prevent their use - especially as there are a number of efficient alternative schemes possessing proofs of security.

Updated: 2024-06-17 08:27:45

标题: 完整性保护的分组密码模式--解开错综复杂的网络

摘要: 本文重新审视了设计用于提供身份验证加密的三种相关区块密码模式的安全性。这些模式被称为PES-PCBC、IOBC和EPBC，都是在1990年代中期提出的。然而，对后两种模式安全性的分析是最近才发表的。在每种情况下，最终都发表了一篇或多篇描述方案安全问题的论文，尽管随后发现了EPBC分析中的一个缺陷 - 这意味着直到现在EPBC还没有已知的主要问题。本文证明，尽管如此，这三种方案都存在缺陷，应该阻止它们的使用 - 尤其是因为有许多具有安全性证明的高效替代方案。

更新时间: 2024-06-17 08:27:45

领域: cs.CR

下载: http://arxiv.org/abs/2403.03654v2

Improved Algorithms for Contextual Dynamic Pricing

In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further distorted by noise. Under minor regularity assumptions, our algorithm achieves an optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, improving the existing results. The second model removes the linearity assumption, requiring only that the expected buyer valuation is $\beta$-H\"older in the context. For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2\beta/d+3\beta})$, where $d$ is the dimension of the context space.

Updated: 2024-06-17 08:26:51

标题: 文献标题翻译为：改进的上下文动态定价算法

摘要: 在情境动态定价中，卖方根据情境信息顺序定价商品。只有当价格低于买家的估值时，买家才会购买产品。卖方的目标是设计一种定价策略，尽可能地收集更多收入。我们专注于两种不同的估值模型。第一种假设估值线性依赖于情境，并受到噪音的进一步扭曲。在较小的正则性假设下，我们的算法实现了一个$\tilde{\mathcal{O}}(T^{2/3})$的最优后悔上界，改进了现有的结果。第二种模型消除了线性假设，只要求期望的买家估值在情境中是$\beta$-H\"older。对于这个模型，我们的算法获得了一个后悔$\tilde{\mathcal{O}}(T^{d+2\beta/d+3\beta})$，其中$d$是情境空间的维度。

更新时间: 2024-06-17 08:26:51

领域: stat.ML,cs.DS,cs.GT,cs.LG

下载: http://arxiv.org/abs/2406.11316v1

Temporal Lidar Depth Completion

Given the lidar measurements from an autonomous vehicle, we can project the points and generate a sparse depth image. Depth completion aims at increasing the resolution of such a depth image by infilling and interpolating the sparse depth values. Like most existing approaches, we make use of camera images as guidance in very sparse or occluded regions. In addition, we propose a temporal algorithm that utilizes information from previous timesteps using recurrence. In this work, we show how a state-of-the-art method PENet can be modified to benefit from recurrency. Our algorithm achieves state-of-the-art results on the KITTI depth completion dataset while adding only less than one percent of additional overhead in terms of both neural network parameters and floating point operations. The accuracy is especially improved for faraway objects and regions containing a low amount of lidar depth samples. Even in regions without any ground truth (like sky and rooftops) we observe large improvements which are not captured by the existing evaluation metrics.

Updated: 2024-06-17 08:25:31

标题: 时间激光雷达深度补全

摘要: 鉴于自动驾驶车辆的激光雷达测量数据，我们可以投影点并生成稀疏深度图像。深度完成旨在通过填充和插值稀疏深度值来提高这种深度图像的分辨率。与大多数现有方法一样，我们利用相机图像作为在非常稀疏或遮挡区域的指导。此外，我们提出了一种利用先前时间步信息的时间算法，利用递归。在这项工作中，我们展示了如何修改最先进的方法PENet以从递归中获益。我们的算法在KITTI深度完成数据集上取得了最先进的结果，而在神经网络参数和浮点运算方面仅增加不到百分之一的额外开销。特别是对于遥远的物体和包含少量激光雷达深度样本的区域，准确性得到了改善。即使在没有任何地面真实值的区域（如天空和屋顶），我们观察到了大幅改善，这些改善并未被现有评估指标捕捉到。

更新时间: 2024-06-17 08:25:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11315v1

Federated Active Learning Framework for Efficient Annotation Strategy in Skin-lesion Classification

Federated Learning (FL) enables multiple institutes to train models collaboratively without sharing private data. Current FL research focuses on communication efficiency, privacy protection, and personalization and assumes that the data of FL have already been ideally collected. In medical scenarios, however, data annotation demands both expertise and intensive labor, which is a critical problem in FL. Active learning (AL), has shown promising performance in reducing the number of data annotations in medical image analysis. We propose a federated AL (FedAL) framework in which AL is executed periodically and interactively under FL. We exploit a local model in each hospital and a global model acquired from FL to construct an ensemble. We use ensemble-entropy-based AL as an efficient data-annotation strategy in FL. Therefore, our FedAL framework can decrease the amount of annotated data and preserve patient privacy while maintaining the performance of FL. To our knowledge, this is the first FedAL framework applied to medical images. We validated our framework on real-world dermoscopic datasets. Using only 50% of samples, our framework was able to achieve state-of-the-art performance on a skin-lesion classification task. Our framework performed better than several state-of-the-art AL methods under FL and achieved comparable performance to full-data FL.

Updated: 2024-06-17 08:16:28

标题: 联邦式主动学习框架在皮肤病变分类中的高效标注策略

摘要: 联邦学习（FL）使多个机构能够在不共享私人数据的情况下协作训练模型。目前的FL研究集中在通信效率、隐私保护和个性化上，并假设FL的数据已经被理想地收集。然而，在医疗场景中，数据标注需要专业知识和大量劳动，这是FL中的一个关键问题。主动学习（AL）在减少医学图像分析中的数据标注数量方面表现出有希望的性能。我们提出了一个联邦主动学习（FedAL）框架，其中AL在FL下周期性和交互地执行。我们利用每个医院的本地模型和从FL获得的全局模型构建一个集成。我们使用基于集成熵的AL作为FL中高效的数据标注策略。因此，我们的FedAL框架可以减少标记数据的数量，同时保护患者隐私，同时保持FL的性能。据我们所知，这是第一个应用于医学图像的FedAL框架。我们在真实皮肤镜数据集上验证了我们的框架。仅使用50%的样本，我们的框架能够在皮肤病变分类任务上实现最先进的性能。我们的框架在FL下表现优于几种最先进的AL方法，并达到了与完整数据FL相当的性能。

更新时间: 2024-06-17 08:16:28

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11310v1

Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?

In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs).

Updated: 2024-06-17 08:14:40

标题: 使用因果机器学习进行制造业管理决策——重新加工，还是不重新加工？

摘要: 在本文中，我们提出了一个用于估计制造系统中最佳返工政策的数据驱动模型。我们考虑了一个允许进行可选返工步骤的多阶段、批次制系统中的单个生产阶段。虽然返工决策取决于批次和系统的中间状态，但最终产品检验以及因此实际产量的评估要延迟到生产完成之后。修复步骤会统一应用于整个批次，有可能改善一些个体项目，同时降低其他项目的质量。因此，挑战在于平衡潜在产量改善与返工成本之间的关系。鉴于这一决策问题的固有因果性质，我们提出了一个因果模型来估计产量改善。我们应用因果机器学习方法，特别是双重/无偏机器学习（DML）技术，来从数据中估计条件处理效果，并制定返工决策政策。我们使用光电半导体制造的真实数据验证了我们的决策模型，在白色发光二极管（LED）的颜色转换过程中实现了2-3%的产量改善。

更新时间: 2024-06-17 08:14:40

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11308v1

Optimal Attack and Defense for Reinforcement Learning

To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

Updated: 2024-06-17 08:13:44

标题: 强化学习的最佳攻击和防御

摘要: 为了确保强化学习（RL）在实际系统中的可用性，关键是确保它们对噪声和对抗性攻击具有稳健性。在对抗性RL中，外部攻击者有能力操纵受害者代理与环境的交互。我们研究了在线操纵攻击的全部类别，包括（i）状态攻击，（ii）观察攻击（这是感知状态攻击的泛化），（iii）动作攻击和（iv）奖励攻击。我们展示了攻击者设计一个最大化自身期望奖励的隐蔽攻击的问题，通常对应于最小化受害者价值，这可以通过我们称之为元MDP的马尔可夫决策过程（MDP）来捕捉，因为它不是真实环境，而是被攻击交互引起的更高级别环境。我们展示了攻击者可以通过在多项式时间内进行规划或使用标准RL技术进行学习来推导出最佳攻击。我们认为，受害者的最佳防御策略可以计算为解决随机斯达克尔伯格博弈，进而简化为部分可观察的轮流随机博弈（POTBSG）。攻击者和受害者都不会从偏离各自的最佳策略中受益，因此这些解决方案是真正稳健的。尽管防御问题是NP难的，我们展示了在许多场景中可以在多项式时间（样本复杂度）内计算（学习）最佳马尔可夫防御策略。

更新时间: 2024-06-17 08:13:44

领域: cs.LG,cs.CR,cs.GT

下载: http://arxiv.org/abs/2312.00198v2

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

Updated: 2024-06-17 08:09:00

标题: VideoVista：用于视频理解和推理的多功能基准

摘要: 尽管大型多模态模型（LMMs）快速发展推动了视频分析方面的重大突破，但在全面评估这些模型在视频理解和推理方面的性能方面仍然缺乏多功能评估基准。为了解决这一问题，我们提出了VideoVista，一个视频问答基准，整合了跨不同内容类别、持续时间和能力的挑战。具体而言，VideoVista包括来自14个类别（如Howto，Film和Entertainment）的3,400个视频衍生出的25,000个问题，持续时间从几秒到10分钟以上不等。此外，它涵盖了19种理解任务（如异常检测、交互理解）和8种推理任务（如逻辑推理、因果推理）。为了实现这一目标，我们提出了一个自动数据构建框架，利用强大的GPT-4o以及先进的分析工具（如视频分割、物体分割和跟踪）。我们还利用这个框架构建训练数据，以增强与视频相关的LMMs（Video-LMMs）的能力。通过对最前沿模型的全面和定量评估，我们发现：1）Video-LMMs在涉及时间位置、物体跟踪和异常检测的细粒度视频任务中面临困难；2）Video-LMMs呈现出较低的逻辑和关系推理能力；3）开源Video-LMMs的性能明显低于GPT-4o和Gemini-1.5，落后20个点。这突显了VideoVista在推进能够准确理解视频并进行精确推理的LMMs方面将发挥关键作用。

更新时间: 2024-06-17 08:09:00

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11303v1

Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs

The effective alignment of Large Language Models (LLMs) with precise instructions is essential for their application in diverse real-world scenarios. Current methods focus on enhancing the diversity and complexity of training and evaluation samples, yet they fall short in accurately assessing LLMs' ability to follow similar instruction variants. We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants, thereby preserves the original instruction's context and complexity while introducing variability, which is critical for training and evaluating LLMs' instruction-following precision. We developed the DeMoRecon dataset using this method to both fine-tune and evaluate LLMs. Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.

Updated: 2024-06-17 08:08:11

标题: 优化和测试指令跟随：分析细粒度指令变体对指令调整的影响

摘要: 大型语言模型（LLMs）与精确指令的有效对齐对它们在各种现实场景中的应用至关重要。当前的方法侧重于增强训练和评估样本的多样性和复杂性，但在准确评估LLMs遵循相似指令变体的能力方面表现不佳。我们引入了一种有效的数据增强技术，将复杂指令分解为更简单的子组件，对其进行修改，并重构为新的变体，从而保留原始指令的上下文和复杂性，同时引入变化，这对于训练和评估LLMs遵循指令的精度至关重要。我们使用这种方法开发了DeMoRecon数据集，用于微调和评估LLMs。我们的研究结果显示，使用DeMoRecon进行微调的LLMs在我们和常用的指令遵循基准测试中都将获得显著的性能提升。

更新时间: 2024-06-17 08:08:11

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11301v1

Explainable assessment of financial experts' credibility by classifying social media forecasts and checking the predictions with actual market data

Social media include diverse interaction metrics related to user popularity, the most evident example being the number of user followers. The latter has raised concerns about the credibility of the posts by the most popular creators. However, most existing approaches to assess credibility in social media strictly consider this problem a binary classification, often based on a priori information, without checking if actual real-world facts back the users' comments. In addition, they do not provide automatic explanations of their predictions to foster their trustworthiness. In this work, we propose a credibility assessment solution for financial creators in social media that combines Natural Language Processing and Machine Learning. The reputation of the contributors is assessed by automatically classifying their forecasts on asset values by type and verifying these predictions with actual market data to approximate their probability of success. The outcome of this verification is a continuous credibility score instead of a binary result, an entirely novel contribution by this work. Moreover, social media metrics (i.e., user context) are exploited by calculating their correlation with the credibility rankings, providing insights on the interest of the end-users in financial posts and their forecasts (i.e., drop or rise). Finally, the system provides natural language explanations of its decisions based on a model-agnostic analysis of relevant features.

Updated: 2024-06-17 08:08:03

标题: 通过分类社交媒体预测并与实际市场数据核对，解释金融专家可信度的评估

摘要: 社交媒体包括与用户知名度相关的各种互动指标，最明显的例子是用户关注者数量。后者引发了关于最受欢迎创作者的帖子可信度的担忧。然而，目前大多数评估社交媒体可信度的方法严格将这一问题视为二元分类，通常基于先验信息，而不检查实际世界事实是否支持用户的评论。此外，它们不提供其预测的自动解释，以增强其可信度。在这项工作中，我们提出了一种结合自然语言处理和机器学习的社交媒体中金融创作者的可信度评估解决方案。通过自动分类创作者对资产价值的预测类型，并验证这些预测与实际市场数据，以近似其成功概率来评估贡献者的声誉。这种验证的结果是一个连续的可信度评分，而不是一个二元结果，这是这项工作的一个全新贡献。此外，通过计算社交媒体指标（即用户上下文）与可信度排名的相关性，利用这些指标提供有关终端用户对财务帖子和他们的预测（即上涨或下跌）兴趣的见解。最后，该系统基于有关特征的模型不可知分析提供其决策的自然语言解释。

更新时间: 2024-06-17 08:08:03

领域: cs.SI,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11924v1

ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both "know what" and "know why". To this end, we present a comprehensive evaluation metric, ValueDCG (Value Discriminator-Critique Gap), to quantitatively assess the two aspects with an engineering implementation. We assess four representative LLMs and provide compelling evidence that the growth rates of LLM's "know what" and "know why" capabilities do not align with increases in parameter numbers, resulting in a decline in the models' capacity to understand human values as larger amounts of parameters. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.

Updated: 2024-06-17 07:58:00

标题: ValueDCG：衡量语言模型综合人类价值理解能力

摘要: 个人价值观是人类决策背后的一个至关重要因素。考虑到大型语言模型（LLMs）已被证明显著影响人类决策，确保它们准确理解人类价值观以确保安全至关重要。然而，由于价值观的错综复杂和可适应性，评估它们对价值观的把握是复杂的。我们认为，真正理解LLMs中的价值观需要考虑“知道什么”和“知道为什么”两个方面。为此，我们提出了一个全面的评估指标，ValueDCG（价值鉴别器-批评间隙），以工程实现定量评估这两个方面。我们评估了四种代表性的LLMs，并提供了有力证据，表明LLM的“知道什么”和“知道为什么”的能力增长速度与参数数量的增加不一致，导致模型理解人类价值观的能力随参数数量的增加而下降。这进一步可能暗示，LLMs可能根据提供的背景制定合理的解释，而不真正理解其固有价值，从而表明潜在风险。

更新时间: 2024-06-17 07:58:00

领域: cs.CL,cs.AI,cs.CY,I.2.m; K.4.m

下载: http://arxiv.org/abs/2310.00378v4

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.

Updated: 2024-06-17 07:55:59

标题: ALLaVA：利用GPT4V合成的数据来支持轻量级视觉-语言模型

摘要: 大型视觉语言模型（LVLMs）以其强大的推理和泛化能力，在广泛的视觉语言任务中显示出潜力。然而，它们需要大量的计算资源进行训练和部署。本研究旨在通过采用高质量的训练数据来弥合传统规模LVLMs和资源友好的轻量版本之间的性能差距。为此，我们提出了一个生成合成数据集的全面流程。关键思想是利用强大的专有模型生成（i）用于视觉语言对齐的细粒度图像注释，以及（ii）用于视觉指导微调的复杂推理视觉问答对，总共产生了130万个样本。我们在合成数据集上训练一系列轻量级VLMs，并实验结果表明所提出的方案的有效性，在4B LVLMs中的17个基准测试中取得了竞争性表现，甚至在各种基准测试中与7B / 13B规模的模型表现相当。这项工作突显了采用高质量数据来打造更高效的LVLMs的可行性。我们将数据集命名为ALLaVA，并向研究界开放，以开发更好的资源高效LVLMs，以供更广泛的使用。

更新时间: 2024-06-17 07:55:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.11684v2

Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy

Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.

Updated: 2024-06-17 07:52:42

标题: 基于哲学中相关性启发的LLMs迭代式效用判断框架

摘要: 效用和主题相关性是信息检索(IR)中的关键衡量标准，分别反映了系统和用户的观点。虽然长期以来一直强调主题相关性，但效用是一个更高的相关性标准，对于促进下游任务（例如在检索增强生成（RAG）中）更有用。当我们将效用判断纳入RAG时，我们意识到RAG中的主题相关性、效用和回答与Schutz从哲学角度讨论的三种相关性类型密切相关。它们分别是主题相关性、解释性相关性和动机相关性。受到这三种相关性类型的动态迭代的启发，我们提出了一个迭代效用判断框架（ITEM），以促进RAG循环的每一步。我们对多等级段落检索和事实问题回答数据集（即TREC DL、WebAP和NQ）进行了广泛的实验。实验结果表明，在代表性基线（包括多个单次效用判断方法）上，效用判断、主题相关性排序和答案生成方面都取得了显著的改进。我们的代码和基准测试可以在https://anonymous.4open.science/r/ITEM-B486/找到。

更新时间: 2024-06-17 07:52:42

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11290v1

Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment

Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude, and Meta's LLaMa have shown remarkable capabilities in text generation. However, their susceptibility to toxic prompts presents significant security challenges. This paper investigates alignment techniques, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to mitigate these risks. We conduct an empirical study on refusal patterns across nine LLMs, revealing that models with uniform refusal patterns, such as Claude3, exhibit higher security. Based on these findings, we propose self-distilling and cross-model distilling methods to enhance LLM security. Our results show that these methods significantly improve refusal rates and reduce unsafe content, with cross-model distilling achieving refusal rates close to Claude3's 94.51%. These findings underscore the potential of distillation-based alignment in securing LLMs against toxic prompts.

Updated: 2024-06-17 07:46:45

标题: 自我和跨模型蒸馏对LLM的影响：拒绝模式对齐的有效方法

摘要: 大型语言模型（LLMs）如OpenAI的GPT系列，Anthropic的Claude和Meta的LLaMa在文本生成方面展示出了显著的能力。然而，它们对有毒提示的敏感性带来了重大的安全挑战。本文研究了包括监督微调（SFT）和从人类反馈中强化学习（RLHF）在内的对齐技术，以减轻这些风险。我们对九种LLMs进行了拒绝模式的经验研究，发现具有统一拒绝模式的模型，如Claude3，表现出更高的安全性。根据这些发现，我们提出了自我蒸馏和跨模型蒸馏方法来增强LLM的安全性。我们的结果表明，这些方法显著改善了拒绝率并减少了不安全内容，跨模型蒸馏实现了接近Claude3 94.51%的拒绝率。这些发现强调了基于蒸馏的对齐在保护LLMs免受有毒提示方面的潜力。

更新时间: 2024-06-17 07:46:45

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2406.11285v1

Instruction Fusion: Advancing Prompt Evolution through Hybridization

The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like Evol-Instruct encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.

Updated: 2024-06-17 07:40:26

标题: 指令融合：通过混合进化推进提示进化

摘要: 大型语言模型（LLMs）在代码生成方面的微调通过使用开放领域编码查询取得了显着进展。尽管取得了成功，但现有方法如Evol-Instruct遇到了性能限制，阻碍了代码生成任务的进一步增强。本文检验了现有提示演化技术的约束，并引入了一种新方法，Instruction Fusion（IF）。IF通过混合化过程创新地结合了两个不同的提示，从而增强了用于代码LLMs的训练提示的演化。我们的实验结果表明，所提出的新方法有效地解决了先前方法的缺点，显著提高了五个代码生成基准测试（HumanEval、HumanEval+、MBPP、MBPP+和MultiPL-E）中的Code LLMs的性能，强调了Instruction Fusion在推进LLMs在代码生成方面能力的有效性。

更新时间: 2024-06-17 07:40:26

领域: cs.AI

下载: http://arxiv.org/abs/2312.15692v4

From Pixels to Progress: Generating Road Network from Satellite Imagery for Socioeconomic Insights in Impoverished Areas

The Sustainable Development Goals (SDGs) aim to resolve societal challenges, such as eradicating poverty and improving the lives of vulnerable populations in impoverished areas. Those areas rely on road infrastructure construction to promote accessibility and economic development. Although publicly available data like OpenStreetMap is available to monitor road status, data completeness in impoverished areas is limited. Meanwhile, the development of deep learning techniques and satellite imagery shows excellent potential for earth monitoring. To tackle the challenge of road network assessment in impoverished areas, we develop a systematic road extraction framework combining an encoder-decoder architecture and morphological operations on satellite imagery, offering an integrated workflow for interdisciplinary researchers. Extensive experiments of road network extraction on real-world data in impoverished regions achieve a 42.7% enhancement in the F1-score over the baseline methods and reconstruct about 80% of the actual roads. We also propose a comprehensive road network dataset covering approximately 794,178 km2 area and 17.048 million people in 382 impoverished counties in China. The generated dataset is further utilized to conduct socioeconomic analysis in impoverished counties, showing that road network construction positively impacts regional economic development. The technical appendix, code, and generated dataset can be found at https://github.com/tsinghua-fib-lab/Road_network_extraction_impoverished_counties.

Updated: 2024-06-17 07:40:13

标题: 从像素到进步：利用卫星图像生成贫困地区道路网络，为社会经济洞察提供支持

摘要: 可持续发展目标（SDGs）旨在解决社会挑战，如消除贫困和改善贫困地区脆弱人口的生活。这些地区依赖道路基础设施建设以促进可及性和经济发展。虽然公开可用的数据如OpenStreetMap可用于监测道路状况，但在贫困地区的数据完整性有限。与此同时，深度学习技术和卫星图像的发展显示出对地球监测的巨大潜力。为了解决贫困地区道路网络评估的挑战，我们开发了一个系统化的道路提取框架，结合编码器-解码器架构和形态学操作对卫星图像进行处理，为跨学科研究人员提供了一个整合的工作流程。在贫困地区的真实数据上进行的道路网络提取大量实验显示，相对于基线方法，F1分数提高了42.7％，并重建了大约80％的实际道路。我们还提出了一个包括中国382个贫困县约794,178平方公里面积和17,048万人口的综合道路网络数据集。生成的数据集进一步用于在贫困县进行社会经济分析，显示道路网络建设对区域经济发展产生积极影响。技术附录、代码和生成的数据集可在https://github.com/tsinghua-fib-lab/Road_network_extraction_impoverished_counties找到。

更新时间: 2024-06-17 07:40:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11282v1

Statistical Learning of Distributionally Robust Stochastic Control in Continuous State Spaces

We explore the control of stochastic systems with potentially continuous state and action spaces, characterized by the state dynamics $X_{t+1} = f(X_t, A_t, W_t)$. Here, $X$, $A$, and $W$ represent the state, action, and exogenous random noise processes, respectively, with $f$ denoting a known function that describes state transitions. Traditionally, the noise process $\{W_t, t \geq 0\}$ is assumed to be independent and identically distributed, with a distribution that is either fully known or can be consistently estimated. However, the occurrence of distributional shifts, typical in engineering settings, necessitates the consideration of the robustness of the policy. This paper introduces a distributionally robust stochastic control paradigm that accommodates possibly adaptive adversarial perturbation to the noise distribution within a prescribed ambiguity set. We examine two adversary models: current-action-aware and current-action-unaware, leading to different dynamic programming equations. Furthermore, we characterize the optimal finite sample minimax rates for achieving uniform learning of the robust value function across continuum states under both adversary types, considering ambiguity sets defined by $f_k$-divergence and Wasserstein distance. Finally, we demonstrate the applicability of our framework across various real-world settings.

Updated: 2024-06-17 07:37:36

标题: 在连续状态空间中统计学习分布稳健随机控制

摘要: 我们探讨了具有潜在连续状态和动作空间的随机系统的控制，其特征是状态动态$X_{t+1} = f(X_t, A_t, W_t)$。这里，$X$，$A$和$W$分别代表状态、动作和外生随机噪声过程，$f$表示描述状态转移的已知函数。传统上，噪声过程$\{W_t, t \geq 0\}$被假定为独立同分布，其分布要么完全已知，要么可以一致估计。然而，在工程环境中典型的分布偏移的发生，需要考虑策略的鲁棒性。本文引入了一个分布鲁棒的随机控制范式，该范式可以容纳对噪声分布进行可能自适应的对抗性扰动，这些扰动处于规定的模糊集合内。我们研究了两种对手模型：当前动作感知和当前动作无感知，导致不同的动态规划方程。此外，我们对在两种对手类型下实现对鲁棒价值函数的统一学习的最优有限样本极小化速率进行了表征，考虑到由$f_k$-散度和Wasserstein距离定义的模糊集。最后，我们展示了我们的框架在各种实际情况下的适用性。

更新时间: 2024-06-17 07:37:36

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.11281v1

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

Traffic prediction is a challenging spatio-temporal forecasting problem that involves highly complex spatio-temporal correlations. This paper proposes a Multi-level Multi-view Augmented Spatio-temporal Transformer (LVSTformer) for traffic prediction. The model aims to capture spatial dependencies from three different levels: local geographic, global semantic, and pivotal nodes, along with long- and short-term temporal dependencies. Specifically, we design three spatial augmented views to delve into the spatial information from the perspectives of local, global, and pivotal nodes. By combining three spatial augmented views with three parallel spatial self-attention mechanisms, the model can comprehensively captures spatial dependencies at different levels. We design a gated temporal self-attention mechanism to effectively capture long- and short-term temporal dependencies. Furthermore, a spatio-temporal context broadcasting module is introduced between two spatio-temporal layers to ensure a well-distributed allocation of attention scores, alleviating overfitting and information loss, and enhancing the generalization ability and robustness of the model. A comprehensive set of experiments is conducted on six well-known traffic benchmarks, the experimental results demonstrate that LVSTformer achieves state-of-the-art performance compared to competing baselines, with the maximum improvement reaching up to 4.32%.

Updated: 2024-06-17 07:36:57

标题: 重新思考交通预测的时空变换器：多层多视图增强学习框架

摘要: 交通预测是一个具有挑战性的时空预测问题，涉及高度复杂的时空相关性。本文提出了一种多级多视图增强时空变换器（LVSTformer）用于交通预测。该模型旨在捕捉来自三个不同级别的空间依赖性：本地地理、全球语义和关键节点，以及长期和短期时间依赖性。具体而言，我们设计了三种空间增强视图，以从本地、全球和关键节点的角度深入研究空间信息。通过将三个空间增强视图与三个并行空间自注意机制相结合，模型可以全面捕捉不同级别的空间依赖性。我们设计了一个门控时间自注意机制，有效捕捉长期和短期时间依赖性。此外，介绍了一个时空上下文广播模块，用于确保两个时空层之间的注意分数分配均匀，减轻过拟合和信息丢失，增强模型的泛化能力和鲁棒性。在六个知名的交通基准上进行了一系列综合实验，实验结果表明，与竞争基线相比，LVSTformer实现了最先进的性能，最大改进达到4.32%。

更新时间: 2024-06-17 07:36:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11921v1

Breaking Free: Efficient Multi-Party Private Set Union Without Non-Collusion Assumptions

Multi-party private set union (MPSU) protocol enables $m$ $(m > 2)$ parties, each holding a set, to collectively compute the union of their sets without revealing any additional information to other parties. There are two main categories of MPSU protocols: The first builds on public-key techniques. All existing works in this category involve a super-linear number of public-key operations, resulting in poor practical efficiency. The second builds on oblivious transfer and symmetric-key techniques. The only existing work in this category is proposed by Liu and Gao (ASIACRYPT 2023), which features the best concrete performance among all existing protocols, despite its super-linear computation and communication. Unfortunately, it does not achieve the standard semi-honest security, as it inherently relies on a non-collusion assumption, which is unlikely to hold in practice. Therefore, the problem of constructing a practical MPSU protocol based on oblivious transfer and symmetric-key techniques in standard semi-honest model remains open. Furthermore, there is no MPSU protocol achieving both linear computation and linear communication complexity, which leaves another unresolved problem. In this work, we resolve these two open problems. We propose the first MPSU protocol based on oblivious transfer and symmetric-key techniques in the standard semi-honest model. This protocol is $4.9-9.3 \times$ faster than Liu and Gao in the LAN setting. Concretely, our protocol requires only $3.6$ seconds in online phase for 3 parties with sets of $2^{20}$ items each. We propose the first MPSU protocol achieving both linear computation and linear communication complexity, based on public-key operations. This protocol has the lowest overall communication costs and shows a factor of $3.0-36.5\times$ improvement in terms of overall communication compared to Liu and Gao.

Updated: 2024-06-17 07:36:15

标题: 突破自由：无需非合谋假设的高效多方私人集合联盟

摘要: 多方私有集合并（MPSU）协议使$m$（$m>2$）个持有集合的当事方能够共同计算它们的集合并，而不向其他当事方透露任何额外信息。MPSU协议主要分为两类：第一类建立在公钥技术之上。该类别中的所有现有作品都涉及超线性数量的公钥操作，导致实际效率较低。第二类是建立在无知传输和对称密钥技术之上。该类别中唯一的现有作品是由刘和高（ASIACRYPT 2023）提出的，尽管其计算和通信都是超线性的，但在所有现有协议中具有最佳的具体性能。不幸的是，它没有实现标准的半诚实安全性，因为它固有地依赖于一个不太可能在实践中成立的非合谋假设。因此，在标准半诚实模型中基于无知传输和对称密钥技术构建实用的MPSU协议的问题仍然存在。此外，没有任何MPSU协议能够实现线性计算和线性通信复杂度，这留下了另一个未解决的问题。在这项工作中，我们解决了这两个未解决的问题。我们提出了第一个基于无知传输和对称密钥技术的标准半诚实模型中的MPSU协议。在局域网设置中，该协议比刘和高快4.9-9.3倍。具体来说，我们的协议对于每个拥有$2^{20}$个项目的3个当事方在在线阶段仅需要3.6秒。我们提出了第一个基于公钥操作实现线性计算和线性通信复杂度的MPSU协议。该协议具有最低的整体通信成本，并且在整体通信方面相比刘和高有3.0-36.5倍的改进。

更新时间: 2024-06-17 07:36:15

领域: cs.CR

下载: http://arxiv.org/abs/2406.07011v2

An Open and Large-Scale Dataset for Multi-Modal Climate Change-aware Crop Yield Predictions

Precise crop yield predictions are of national importance for ensuring food security and sustainable agricultural practices. While AI-for-science approaches have exhibited promising achievements in solving many scientific problems such as drug discovery, precipitation nowcasting, etc., the development of deep learning models for predicting crop yields is constantly hindered by the lack of an open and large-scale deep learning-ready dataset with multiple modalities to accommodate sufficient information. To remedy this, we introduce the CropNet dataset, the first terabyte-sized, publicly available, and multi-modal dataset specifically targeting climate change-aware crop yield predictions for the contiguous United States (U.S.) continent at the county level. Our CropNet dataset is composed of three modalities of data, i.e., Sentinel-2 Imagery, WRF-HRRR Computed Dataset, and USDA Crop Dataset, for over 2200 U.S. counties spanning 6 years (2017-2022), expected to facilitate researchers in developing versatile deep learning models for timely and precisely predicting crop yields at the county-level, by accounting for the effects of both short-term growing season weather variations and long-term climate change on crop yields. Besides, we develop the CropNet package, offering three types of APIs, for facilitating researchers in downloading the CropNet data on the fly over the time and region of interest, and flexibly building their deep learning models for accurate crop yield predictions. Extensive experiments have been conducted on our CropNet dataset via employing various types of deep learning solutions, with the results validating the general applicability and the efficacy of the CropNet dataset in climate change-aware crop yield predictions.

Updated: 2024-06-17 07:35:12

标题: 一个用于多模态气候变化感知作物产量预测的开放且大规模数据集

摘要: 精确的作物产量预测对于确保食品安全和可持续农业实践具有国家重要性。虽然人工智能在解决许多科学问题如药物发现、降水即时预测等方面取得了有希望的成就，但深度学习模型用于预测作物产量的发展不断受制于缺乏一个开放且大规模的深度学习数据集，以容纳足够的信息。为了解决这一问题，我们引入了CropNet数据集，这是第一个以太字节大小的、公开可用的、专门针对气候变化感知作物产量预测的多模态数据集，覆盖了美国连续的2200多个县，跨越6年（2017-2022），由Sentinel-2图像、WRF-HRRR计算数据集和USDA作物数据集三种数据模态组成，旨在为研究人员开发多功能的深度学习模型，准确预测县级作物产量，考虑到短期生长季节气候变化和长期气候变化对作物产量的影响。此外，我们开发了CropNet软件包，提供三种类型的API，帮助研究人员在感兴趣的时间和地区即时下载CropNet数据，并灵活构建精确作物产量预测的深度学习模型。通过采用各种类型的深度学习解决方案在我们的CropNet数据集上进行了广泛实验，结果验证了CropNet数据集在气候变化感知作物产量预测中的普适性和有效性。

更新时间: 2024-06-17 07:35:12

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2406.06081v2

Grounding Continuous Representations in Geometry: Equivariant Neural Fields

Recently, Neural Fields have emerged as a powerful modelling paradigm to represent continuous signals. In a conditional neural field, a field is represented by a latent variable that conditions the NeF, whose parametrisation is otherwise shared over an entire dataset. We propose Equivariant Neural Fields based on cross attention transformers, in which NeFs are conditioned on a geometric conditioning variable, a latent point cloud, that enables an equivariant decoding from latent to field. Our equivariant approach induces a steerability property by which both field and latent are grounded in geometry and amenable to transformation laws if the field transforms, the latent represents transforms accordingly and vice versa. Crucially, the equivariance relation ensures that the latent is capable of (1) representing geometric patterns faitfhully, allowing for geometric reasoning in latent space, (2) weightsharing over spatially similar patterns, allowing for efficient learning of datasets of fields. These main properties are validated using classification experiments and a verification of the capability of fitting entire datasets, in comparison to other non-equivariant NeF approaches. We further validate the potential of ENFs by demonstrate unique local field editing properties.

Updated: 2024-06-17 07:28:40

标题: 用几何学基础来打磨连续表示：等变神经场

摘要: 最近，神经场已经成为一种强大的建模范式，用于表示连续信号。在条件神经场中，一个场由一个潜变量表示，该变量作为NeF的条件，其参数化在整个数据集上共享。我们提出了基于交叉注意力变换器的等变神经场，其中NeF受到几何条件变量的调节，即一个潜在点云，这使得从潜在到场的等变解码成为可能。我们的等变方法引入了可操纵性属性，这样场和潜在都基于几何形状并且容易受到转换定律的影响，如果场发生转换，潜在也相应转换，反之亦然。至关重要的是，等变关系确保潜变量能够（1）忠实地表示几何图案，从而在潜在空间进行几何推理，（2）在空间相似图案上共享权重，从而有效地学习场数据集。这些主要特性通过分类实验进行验证，同时验证了相对于其他非等变NeF方法，我们的方法能够拟合整个数据集的能力。我们进一步通过展示独特的局部场编辑属性来验证ENF的潜力。

更新时间: 2024-06-17 07:28:40

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.05753v3

A New Index for Clustering Evaluation Based on Density Estimation

A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index $ I_a $ is called the Ambiguous Index; the second sub-index $ I_s $ is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.

Updated: 2024-06-17 07:27:57

标题: 基于密度估计的聚类评价新指标

摘要: 介绍了一种用于聚类内部评估的新指标。该指标定义为两个子指标的混合物。第一个子指标$ I_a $被称为模糊指数；第二个子指标$ I_s $被称为相似性指数。两个子指标的计算基于对数据分区中每个簇的密度估计。进行了一项实验来测试新指标的性能，并与其他六个内部聚类评估指标进行了比较 -- Calinski-Harabasz指数、轮廓系数、Davies-Bouldin指数、CDbw、DBCV和VIASCKDE，在一组145个数据集上。结果显示，新指标显著改善了其他内部聚类评估指标。

更新时间: 2024-06-17 07:27:57

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2207.01294v4

Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking

In a rapidly evolving job market, skill demand forecasting is crucial as it enables policymakers and businesses to anticipate and adapt to changes, ensuring that workforce skills align with market needs, thereby enhancing productivity and competitiveness. Additionally, by identifying emerging skill requirements, it directs individuals towards relevant training and education opportunities, promoting continuous self-learning and development. However, the absence of comprehensive datasets presents a significant challenge, impeding research and the advancement of this field. To bridge this gap, we present Job-SDF, a dataset designed to train and benchmark job-skill demand forecasting models. Based on 10.35 million public job advertisements collected from major online recruitment platforms in China between 2021 and 2023, this dataset encompasses monthly recruitment demand for 2,324 types of skills across 521 companies. Our dataset uniquely enables evaluating skill demand forecasting models at various granularities, including occupation, company, and regional levels. We benchmark a range of models on this dataset, evaluating their performance in standard scenarios, in predictions focused on lower value ranges, and in the presence of structural breaks, providing new insights for further research. Our code and dataset are publicly accessible via the https://github.com/Job-SDF/benchmark.

Updated: 2024-06-17 07:22:51

标题: Job-SDF: 用于工作技能需求预测和基准比较的多粒度数据集

摘要: 在一个快速发展的就业市场中，技能需求预测至关重要，因为它使政策制定者和企业能够预见和适应变化，确保劳动力技能与市场需求保持一致，从而提高生产力和竞争力。此外，通过识别新兴的技能要求，它引导个人走向相关的培训和教育机会，促进持续的自我学习和发展。然而，缺乏全面数据集构成了一个重大挑战，阻碍了研究和该领域的进展。为了弥补这一差距，我们提出了Job-SDF，一个旨在训练和基准测试工作技能需求预测模型的数据集。该数据集基于2021年至2023年间从中国主要在线招聘平台收集的1035万个公开工作广告，涵盖了521家公司对2324种技能的月度招聘需求。我们的数据集独特地能够评估不同粒度的技能需求预测模型，包括职业、公司和地区层面。我们在这个数据集上对一系列模型进行基准测试，评估它们在标准情景、针对较低价值范围的预测以及在结构性突变存在的情况下的表现，为进一步研究提供新的见解。我们的代码和数据集可以通过https://github.com/Job-SDF/benchmark 公开访问。

更新时间: 2024-06-17 07:22:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11920v1

Development of an Adaptive Multi-Domain Artificial Intelligence System Built using Machine Learning and Expert Systems Technologies

Producing an artificial general intelligence (AGI) has been an elusive goal in artificial intelligence (AI) research for some time. An AGI would have the capability, like a human, to be exposed to a new problem domain, learn about it and then use reasoning processes to make decisions. While AI techniques have been used across a wide variety of problem domains, an AGI would require an AI that could reason beyond its programming and training. This paper presents a small step towards producing an AGI. It describes a mechanism for an AI to learn about and develop reasoning pathways to make decisions in an a priori unknown domain. It combines a classical AI technique, the expert system, with a its modern adaptation - the gradient descent trained expert system (GDTES) - and utilizes generative artificial intelligence (GAI) to create a network and training data set for this system. These can be created from available sources or may draw upon knowledge incorporated in a GAI's own pre-trained model. The learning process in GDTES is used to optimize the AI's decision-making. While this approach does not meet the standards that many have defined for an AGI, it provides a somewhat similar capability, albeit one which requires a learning process before use.

Updated: 2024-06-17 07:21:44

标题: 发展一个使用机器学习和专家系统技术构建的自适应多领域人工智能系统

摘要: 在人工智能研究中，实现人工通用智能（AGI）一直是一个艰巨的目标。一个AGI将有能力像人类一样接触新问题领域，学习并运用推理过程做出决策。虽然人工智能技术已经被广泛应用于各种问题领域，但一个AGI需要一个超越其编程和训练的人工智能来进行推理。本文提出了朝着实现AGI迈出的一小步。它描述了一种机制，让人工智能学习并发展推理路径，在一个先验未知的领域做出决策。它结合了经典人工智能技术——专家系统，以及其现代改进——梯度下降训练的专家系统（GDTES），并利用生成人工智能（GAI）来为该系统创建网络和训练数据集。这些可以从现有来源创建，也可以利用GAI自身预先训练的模型中融入的知识。GDTES中的学习过程被用来优化人工智能的决策制定。虽然这种方法不符合许多人对AGI定义的标准，但它提供了一种类似的能力，尽管在使用之前需要一个学习过程。

更新时间: 2024-06-17 07:21:44

领域: cs.AI

下载: http://arxiv.org/abs/2406.11272v1

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and three billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.

Updated: 2024-06-17 07:21:36

标题: MINT-1T：通过10倍扩展开源多模态数据：具有一万亿标记的多模态数据集

摘要: 多模态交错数据集具有自由形式的交错图像和文本序列，对于训练前沿的大规模多模态模型（LMMs）至关重要。尽管开源LMMs的迅速发展，但仍然存在大规模、多样化的开源多模态交错数据集极度匮乏的问题。为此，我们介绍了MINT-1T，迄今为止规模最大、最多样化的开源多模态交错数据集。MINT-1T包括一万亿个文本令牌和三十亿张图像，是现有开源数据集的10倍规模。此外，我们还包括了之前未开发的来源，如PDF和ArXiv论文。由于扩展多模态交错数据集需要大量的工程努力，分享数据整理过程并发布数据集将极大地惠及社区。我们的实验表明，在MINT-1T上训练的LMMs能够与之前领先数据集OBELICS上训练的模型的性能相匹敌。我们的数据和代码将在https://github.com/mlfoundations/MINT-1T上发布。

更新时间: 2024-06-17 07:21:36

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11271v1

A Closed-form Solution for Weight Optimization in Fully-connected Feed-forward Neural Networks

This work addresses weight optimization problem for fully-connected feed-forward neural networks. Unlike existing approaches that are based on back-propagation (BP) and chain rule gradient-based optimization (which implies iterative execution, potentially burdensome and time-consuming in some cases), the proposed approach offers the solution for weight optimization in closed-form by means of least squares (LS) methodology. In the case where the input-to-output mapping is injective, the new approach optimizes the weights in a back-propagating fashion in a single iteration by jointly optimizing a set of weights in each layer for each neuron. In the case where the input-to-output mapping is not injective (e.g., in classification problems), the proposed solution is easily adapted to obtain its final solution in a few iterations. An important advantage over the existing solutions is that these computations (for all neurons in a layer) are independent from each other; thus, they can be carried out in parallel to optimize all weights in a given layer simultaneously. Furthermore, its running time is deterministic in the sense that one can obtain the exact number of computations necessary to optimize the weights in all network layers (per iteration, in the case of non-injective mapping). Our simulation and empirical results show that the proposed scheme, BPLS, works well and is competitive with existing ones in terms of accuracy, but significantly surpasses them in terms of running time. To summarize, the new method is straightforward to implement, is competitive and computationally more efficient than the existing ones, and is well-tailored for parallel implementation.

Updated: 2024-06-17 07:16:27

标题: 一个全连接前馈神经网络中的权重优化的封闭形式解决方案

摘要: 这项工作解决了全连接前馈神经网络的权重优化问题。与现有基于反向传播（BP）和链式规则梯度优化的方法不同（这意味着在某些情况下可能会导致迭代执行，可能繁重且耗时），所提出的方法通过最小二乘（LS）方法以闭合形式提供了权重优化的解决方案。在输入到输出映射为单射的情况下，新方法通过在每个神经元的每一层中联合优化一组权重，在单次迭代中以反向传播的方式优化权重。在输入到输出映射不是单射的情况下（例如，在分类问题中），所提出的解决方案很容易适应，在少数迭代中获得最终解决方案。与现有解决方案相比的一个重要优势是，这些计算（对于每一层中的所有神经元）是彼此独立的；因此，它们可以并行进行，同时优化给定层中的所有权重。此外，其运行时间在确定性意义上是确定的，因为可以获得优化所有网络层中权重所需的确切计算数量（在非单射映射的情况下，每次迭代）。我们的模拟和实证结果表明，所提出的方案BPLS工作良好，并且在准确性方面与现有方案竞争，但在运行时间方面明显超越它们。总之，这种新方法易于实现，具有竞争力且计算效率高于现有方法，并且非常适合并行实现。

更新时间: 2024-06-17 07:16:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2401.06699v2

The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.

Updated: 2024-06-17 07:08:29

标题: 罗马的衰落：理解模型编辑中LLMs的崩溃

摘要: 尽管模型编辑方法取得了显著进展，但它们在实际场景中的应用仍然具有挑战性，因为它们经常会导致大型语言模型(LLMs)崩溃。其中，ROME尤其令人担忧，因为它可能仅通过一次编辑就会破坏LLMs。在本文中，我们研究了这种崩溃的根本原因。通过广泛的分析，我们确定了导致崩溃的两个主要因素：i) 参数更新方程中对带前缀和不带前缀键的处理不一致可能导致非常小的分母，从而导致过度大的参数更新；ii) 崩溃案例的主体通常是第一个标记，其不带前缀键分布与自回归变换器中的带前缀键分布明显不同，导致上述问题变得具体化。为了验证我们的分析，我们提出了一个简单而有效的方法：在编辑阶段统一使用带前缀键，并在测试阶段添加前缀。实验结果表明，所提出的解决方案可以防止模型崩溃，同时保持编辑的有效性。

更新时间: 2024-06-17 07:08:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11263v1

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

Large Language Models (LLM) are already widely used to generate content for a variety of online platforms. As we are not able to safely distinguish LLM-generated content from human-produced content, LLM-generated content is used to train the next generation of LLMs, giving rise to a self-consuming training loop. From the image generation domain we know that such a self-consuming training loop reduces both quality and diversity of images finally ending in a model collapse. However, it is unclear whether this alarming effect can also be observed for LLMs. Therefore, we present the first study investigating the self-consuming training loop for LLMs. Further, we propose a novel method based on logic expressions that allows us to unambiguously verify the correctness of LLM-generated content, which is difficult for natural language text. We find that the self-consuming training loop produces correct outputs, however, the output declines in its diversity depending on the proportion of the used generated data. Fresh data can slow down this decline, but not stop it. Given these concerning results, we encourage researchers to study methods to negate this process.

Updated: 2024-06-17 07:07:30

标题: 大型语言模型受到自身输出的影响：自我消耗训练循环的分析

摘要: 大型语言模型(LLM)已经广泛用于为各种在线平台生成内容。由于我们无法安全地区分LLM生成的内容和人类生成的内容，LLM生成的内容被用来训练下一代LLM，从而产生自我消耗的训练循环。从图像生成领域我们知道，这种自我消耗的训练循环会降低图像的质量和多样性，最终导致模型崩溃。然而，目前尚不清楚这种令人担忧的影响是否也适用于LLM。因此，我们提出了第一项研究，对LLM的自我消耗训练循环进行了调查。此外，我们提出了一种基于逻辑表达式的新方法，可以明确验证LLM生成的内容的正确性，这在自然语言文本中很难做到。我们发现自我消耗的训练循环会产生正确的输出，但是输出的多样性会随着使用生成数据的比例而下降。新鲜数据可以减缓这种下降，但无法阻止它。鉴于这些令人担忧的结果，我们鼓励研究人员研究方法来消除这个过程。

更新时间: 2024-06-17 07:07:30

领域: cs.LG,cs.CL,cs.NE

下载: http://arxiv.org/abs/2311.16822v2

Towards Safe Multi-Task Bayesian Optimization

Bayesian optimization has emerged as a highly effective tool for the safe online optimization of systems, due to its high sample efficiency and noise robustness. To further enhance its efficiency, reduced physical models of the system can be incorporated into the optimization process, accelerating it. These models are able to offer an approximation of the actual system, and evaluating them is significantly cheaper. The similarity between the model and reality is represented by additional hyperparameters, which are learned within the optimization process. Safety is a crucial criterion for online optimization methods such as Bayesian optimization, which has been addressed by recent works that provide safety guarantees under the assumption of known hyperparameters. In practice, however, this does not apply. Therefore, we extend the robust Gaussian process uniform error bounds to meet the multi-task setting, which involves the calculation of a confidence region from the hyperparameter posterior distribution utilizing Markov chain Monte Carlo methods. Subsequently, the robust safety bounds are employed to facilitate the safe optimization of the system, while incorporating measurements of the models. Simulation results indicate that the optimization can be significantly accelerated for expensive to evaluate functions in comparison to other state-of-the-art safe Bayesian optimization methods, contingent on the fidelity of the models.

Updated: 2024-06-17 07:05:43

标题: 朝着安全的多任务贝叶斯优化方向

摘要: 贝叶斯优化已经成为一种非常有效的工具，用于系统的安全在线优化，这是由于其高样本效率和噪声鲁棒性。为了进一步提高其效率，可以将系统的简化物理模型纳入优化过程中，加速优化过程。这些模型能够提供实际系统的近似，并且评估它们成本显著较低。模型与现实之间的相似性由额外的超参数表示，这些超参数在优化过程中学习。安全性是在线优化方法（如贝叶斯优化）的一个关键标准，最近的一些研究作品提出了在已知超参数的假设下提供安全保证的方法。然而，在实践中，这并不适用。因此，我们将鲁棒高斯过程均匀误差边界扩展到满足多任务设置，这涉及利用马尔可夫链蒙特卡洛方法从超参数后验分布计算置信区间。随后，鲁棒安全边界被用于促进系统的安全优化，同时结合模型的测量。模拟结果表明，在模型的保真度条件下，与其他最先进的安全贝叶斯优化方法相比，对于昂贵的评估函数，优化可以显着加速。

更新时间: 2024-06-17 07:05:43

领域: cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2312.07281v3

Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection

The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets.

Updated: 2024-06-17 07:00:41

标题: 对抗风格增强：基于大型语言模型的强大假新闻检测

摘要: 虚假新闻的传播对个人产生负面影响，被视为需要解决的重大社会挑战。已经确定了许多用于检测虚假新闻的算法和洞察功能。然而，随着最近的LLMs及其先进的生成能力，许多可检测的特征（例如，风格转换攻击）可以被改变，使其更难以与真实新闻区分开来。本研究提出了对抗风格增强，AdStyle，用于训练一个防止各种风格转换攻击的虚假新闻检测器。我们模型的关键机制是谨慎使用LLMs自动生成多样化但连贯的风格转换攻击提示。这提高了那些对检测器处理特别困难的提示的生成。实验表明，我们的增强策略在虚假新闻基准数据集上的测试中提高了稳健性和检测性能。

更新时间: 2024-06-17 07:00:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11260v1

Auto-ICL: In-Context Learning without Human Supervision

With in-context learning ability, the performance of large language models can be significantly boosted when provided with appropriate context. However, existing in-context learning methods mainly rely on human-provided contexts, such as labeled examples and explicit instructions. Writing context by humans is labor-intensive on various tasks and limits the model to tasks manageable by humans. To overcome these limitations, we propose Automatic In-Context Learning framework that enables the model to autonomously generate examples and instructions for problem-solving. With experiments across various models and datasets, results show that model-generated contexts outperform human-annotated contexts, including Few-Shot and Few-Shot-CoT methods, and surpass existing self-generated context methods like Zero-CoT and Auto-CoT.

Updated: 2024-06-17 06:53:41

标题: Auto-ICL: 无需人类监督的上下文学习

摘要: 通过上下文学习能力，当大型语言模型提供适当的上下文时，其性能可以显著提升。然而，现有的上下文学习方法主要依赖于人类提供的上下文，如标记示例和明确指导。人类编写上下文在各种任务上耗费大量人力，并限制了模型应对人类可管理的任务。为了克服这些限制，我们提出了自动上下文学习框架，使模型能够自主生成解决问题的示例和指导。通过在各种模型和数据集上进行实验，结果显示，模型生成的上下文优于人类注释的上下文，包括Few-Shot和Few-Shot-CoT方法，并超过现有的自动生成上下文方法，如Zero-CoT和Auto-CoT。

更新时间: 2024-06-17 06:53:41

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2311.09263v2

NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation

Talking head generation based on the neural radiation fields model has shown promising visual effects. However, the slow rendering speed of NeRF seriously limits its application, due to the burdensome calculation process over hundreds of sampled points to synthesize one pixel. In this work, a novel Neural Light Dynamic Fields model is proposed aiming to achieve generating high quality 3D talking face with significant speedup. The NLDF represents light fields based on light segments, and a deep network is used to learn the entire light beam's information at once. In learning the knowledge distillation is applied and the NeRF based synthesized result is used to guide the correct coloration of light segments in NLDF. Furthermore, a novel active pool training strategy is proposed to focus on high frequency movements, particularly on the speaker mouth and eyebrows. The propose method effectively represents the facial light dynamics in 3D talking video generation, and it achieves approximately 30 times faster speed compared to state of the art NeRF based method, with comparable generation visual quality.

Updated: 2024-06-17 06:53:37

标题: NLDF:神经光动态场用于高效的3D语音生成

摘要: 基于神经辐射场模型的说话头部生成已经展示出令人期待的视觉效果。然而，NeRF的渲染速度较慢严重限制了其应用，因为在合成一个像素时需要对数百个采样点进行繁重的计算过程。在本研究中，提出了一种新颖的神经光动态场模型，旨在实现高质量的3D说话面部生成并显著加速。NLDF基于光段表示光场，并使用深度网络一次性学习整个光束的信息。在学习过程中应用了知识蒸馏，并使用基于NeRF的合成结果来指导NLDF中光段的正确着色。此外，提出了一种新颖的主动池训练策略，专注于高频率动作，特别是说话者的嘴部和眉毛。该方法有效地表示了3D说话视频生成中的面部光动态，并与NeRF方法相比实现了约30倍的更快速度，同时具有可比较的生成视觉质量。

更新时间: 2024-06-17 06:53:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11259v1

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence. However, the training process of these models poses significant challenges in terms of computational and storage capacities, thus compressing checkpoints has become an urgent problem. In this paper, we propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints while achieving nearly lossless performance. We first calculate the residuals of adjacent checkpoints to obtain the essential but sparse information for higher compression ratio. To further excavate the redundancy parameters in checkpoints, we then propose a weight-momentum joint shrinking method to utilize another important information during the model optimization, i.e., momentum. In particular, we exploit the information of both model and optimizer to discard as many parameters as possible while preserving critical information to ensure optimal performance. Furthermore, we utilize non-uniform quantization to further compress the storage of checkpoints. We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance. For instance, we achieve approximately $70\times$ compression for the Pythia-410M model, with the final performance being as accurate as the original model on various downstream tasks. Codes will be available at https://github.com/Gaffey/ExCP.

Updated: 2024-06-17 06:47:29

标题: ExCP：通过权重-动量联合收缩实现极端LLM检查点压缩

摘要: 大型语言模型（LLM）最近在人工智能领域引起了重大关注。然而，这些模型的训练过程在计算和存储容量方面存在重大挑战，因此压缩检查点已成为一个紧迫的问题。在本文中，我们提出了一种新颖的极限检查点压缩（ExCP）框架，可以显著减少训练检查点所需的存储空间，同时实现几乎无损性能。我们首先计算相邻检查点的残差，以获取更高的压缩比率所需的基本但稀疏信息。为了进一步挖掘检查点中的冗余参数，我们提出了一种权重-动量联合收缩方法，以在模型优化过程中利用另一重要信息，即动量。特别是，我们利用模型和优化器的信息来丢弃尽可能多的参数，同时保留关键信息以确保最佳性能。此外，我们利用非均匀量化进一步压缩检查点的存储空间。我们对多个参数范围从410M到7B的模型对我们提出的ExCP框架进行了广泛评估，并展示了显著的存储空间减少，同时保持强大的性能。例如，我们对Pythia-410M模型实现了约70倍的压缩，最终性能在各种后续任务中与原始模型一样精确。代码可在https://github.com/Gaffey/ExCP找到。

更新时间: 2024-06-17 06:47:29

领域: cs.LG

下载: http://arxiv.org/abs/2406.11257v1

Liberal Entity Matching as a Compound AI Toolchain

Entity matching (EM), the task of identifying whether two descriptions refer to the same entity, is essential in data management. Traditional methods have evolved from rule-based to AI-driven approaches, yet current techniques using large language models (LLMs) often fall short due to their reliance on static knowledge and rigid, predefined prompts. In this paper, we introduce Libem, a compound AI system designed to address these limitations by incorporating a flexible, tool-oriented approach. Libem supports entity matching through dynamic tool use, self-refinement, and optimization, allowing it to adapt and refine its process based on the dataset and performance metrics. Unlike traditional solo-AI EM systems, which often suffer from a lack of modularity that hinders iterative design improvements and system optimization, Libem offers a composable and reusable toolchain. This approach aims to contribute to ongoing discussions and developments in AI-driven data management.

Updated: 2024-06-17 06:33:34

标题: 自由实体匹配作为一个复合人工智能工具链

摘要: 实体匹配（EM）是在数据管理中识别两个描述是否指向同一实体的任务，是至关重要的。传统方法已经从基于规则的发展到基于人工智能的方法，但当前使用大型语言模型（LLMs）的技术往往由于依赖静态知识和严格的预定义提示而表现不佳。在本文中，我们介绍了Libem，这是一个复合人工智能系统，旨在通过融入灵活的、面向工具的方法来解决这些限制。Libem通过动态工具使用、自我精炼和优化支持实体匹配，使其能够根据数据集和性能指标进行适应和优化。与传统的独立AI EM系统不同，这些系统往往缺乏模块化，从而阻碍了迭代设计改进和系统优化，Libem提供了一个可组合和可重复使用的工具链。这种方法旨在为基于人工智能的数据管理领域的持续讨论和发展做出贡献。

更新时间: 2024-06-17 06:33:34

领域: cs.DB,cs.AI,cs.SE

下载: http://arxiv.org/abs/2406.11255v1

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons

Delivering precise point and distributional forecasts across a spectrum of prediction horizons represents a significant and enduring challenge in the application of time-series forecasting within various industries. Prior research on developing deep learning models for time-series forecasting has often concentrated on isolated aspects, such as long-term point forecasting or short-term probabilistic estimations. This narrow focus may result in skewed methodological choices and hinder the adaptability of these models to uncharted scenarios. While there is a rising trend in developing universal forecasting models, a thorough understanding of their advantages and drawbacks, especially regarding essential forecasting needs like point and distributional forecasts across short and long horizons, is still lacking. In this paper, we present ProbTS, a benchmark tool designed as a unified platform to evaluate these fundamental forecasting needs and to conduct a rigorous comparative analysis of numerous cutting-edge studies from recent years. We dissect the distinctive data characteristics arising from disparate forecasting requirements and elucidate how these characteristics can skew methodological preferences in typical research trajectories, which often fail to fully accommodate essential forecasting needs. Building on this, we examine the latest models for universal time-series forecasting and discover that our analyses of methodological strengths and weaknesses are also applicable to these universal models. Finally, we outline the limitations inherent in current research and underscore several avenues for future exploration.

Updated: 2024-06-17 06:33:10

标题: ProbTS：在各种预测时段上对点预测和分布预测进行基准测试

摘要: 在各行业中应用时间序列预测时，提供准确的点预测和分布预测跨越各种预测时间跨度代表着一个重大且持久的挑战。先前关于为时间序列预测开发深度学习模型的研究通常集中在孤立的方面，如长期点预测或短期概率估计。这种狭隘的关注可能导致偏斜的方法选择，并阻碍这些模型对未知情境的适应能力。虽然近年来开发通用预测模型的趋势在上升，但对于其优点和缺点，尤其是在点和分布预测跨越短期和长期时间跨度这样的基本预测需求方面，仍然缺乏深入的了解。在本文中，我们提出了ProbTS，这是一个设计成一个统一平台的基准工具，旨在评估这些基本的预测需求，并对近年来许多尖端研究进行严格的比较分析。我们剖析了不同的预测需求所产生的独特数据特征，并阐明这些特征如何可能偏斜典型研究轨迹中的方法偏好，这些方法往往未能充分适应基本的预测需求。基于此，我们审查了最新的通用时间序列预测模型，并发现我们对方法论优势和劣势的分析也适用于这些通用模型。最后，我们概述了当前研究中固有的限制，并强调了未来探索的几个途径。

更新时间: 2024-06-17 06:33:10

领域: cs.LG

下载: http://arxiv.org/abs/2310.07446v4

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e. strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal adaptive methods through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. This is useful to bridge the computation gap between diagonal and non-diagonal methods. Our findings provide new insights into the development of adaptive methods and raise important questions regarding the currently overlooked role of adaptivity for their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)

Updated: 2024-06-17 06:23:51

标题: 我们能否在自适应梯度方法中去除平方根？第二阶角度视角

摘要: 像Adam(W)这样的自适应梯度优化器是许多深度学习架构的默认训练算法，如transformers。它们的对角预处理器基于梯度外积，通过平方根纳入参数更新。虽然这些方法通常被解释为近似二阶方法，但平方根代表了一个基本差异。在这项工作中，我们研究了当我们去掉平方根时，即加强它们的二阶动机，自适应方法的行为如何改变。令人惊讶的是，我们发现这种无根自适应方法将卷积架构的泛化差距缩小到与SGD相当，同时保持其基于根的对应方法在transformers上的性能。二阶视角也对通过预处理器不变性概念开发非对角自适应方法具有实际益处。与Shampoo这样的基于根的方法相比，无根对应方法在半精度下运行良好且快速，因为它们不需要数值不稳定的矩阵根分解和求逆。这有助于弥合对角和非对角方法之间的计算差距。我们的发现为自适应方法的发展提供了新的见解，并引发了关于自适应性在其成功中当前被忽视角色的重要问题。（实验代码：https://github.com/yorkerlin/remove-the-square-root 优化器代码：https://github.com/f-dangel/sirfshampoo）

更新时间: 2024-06-17 06:23:51

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2402.03496v5

Relational Learning in Pre-Trained Models: A Theory from Hypergraph Recovery Perspective

Foundation Models (FMs) have demonstrated remarkable insights into the relational dynamics of the world, leading to the crucial question: how do these models acquire an understanding of world hybrid relations? Traditional statistical learning, particularly for prediction problems, may overlook the rich and inherently structured information from the data, especially regarding the relationships between objects. We introduce a mathematical model that formalizes relational learning as hypergraph recovery to study pre-training of FMs. In our framework, the world is represented as a hypergraph, with data abstracted as random samples from hyperedges. We theoretically examine the feasibility of a Pre-Trained Model (PTM) to recover this hypergraph and analyze the data efficiency in a minimax near-optimal style. By integrating rich graph theories into the realm of PTMs, our mathematical framework offers powerful tools for an in-depth understanding of pre-training from a unique perspective and can be used under various scenarios. As an example, we extend the framework to entity alignment in multimodal learning.

Updated: 2024-06-17 06:20:39

标题: 预训练模型中的关系学习：来自超图恢复视角的理论

摘要: 基础模型（FMs）展示了对世界关系动态的显著洞察力，引发了一个关键问题：这些模型如何获得对世界混合关系的理解？传统的统计学习，特别是针对预测问题，可能会忽视数据中丰富且固有结构化的信息，特别是关于对象之间关系的信息。我们引入了一个数学模型，将关系学习形式化为超图恢复，以研究FMs的预训练。在我们的框架中，世界被表示为一个超图，数据被抽象为来自超边的随机样本。我们在理论上研究了预训练模型（PTM）恢复这个超图的可行性，并分析了一种极小化近似最优风格下的数据效率。通过将丰富的图论集成到PTMs领域中，我们的数学框架为从独特视角深入理解预训练提供了强大的工具，并可在各种场景下使用。作为例子，我们将这一框架扩展到多模态学习中的实体对齐。

更新时间: 2024-06-17 06:20:39

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11249v1

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9

We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.

Updated: 2024-06-17 06:19:14

标题: 2024年DCASE挑战任务9基于大型语言模型的语言查询音频源分离性能改进的标题增强

摘要: 我们提出了一种基于提示工程的文本增强方法，应用于语言查询的音频源分离（LASS）任务。为了增强LASS的性能，所提出的方法利用大型语言模型（LLMs）生成与训练数据集中每个句子相对应的多个标题。为此，我们首先进行实验，以识别具有较少数量标题的最有效提示，用于标题增强。使用这些增强标题训练的LASS模型在DCASE 2024任务9验证集上表现出比没有增强训练的模型更好的性能。这项研究突出了LLM基于标题增强在推进语言查询的音频源分离方面的有效性。

更新时间: 2024-06-17 06:19:14

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2406.11248v1

Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation

In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.

Updated: 2024-06-17 06:18:13

标题: 两颗石头打一个鸟：双层位置编码用于更好的长度外推

摘要: 在这项工作中，我们利用语言序列的内在分割，并设计了一种新的位置编码方法，称为双层位置编码（BiPE）。对于每个位置，我们的BiPE混合了一个内部段编码和一个间隔段编码。内部段编码确定了段内的位置，并通过绝对位置编码帮助模型捕获其中的语义信息。间隔段编码指定了段索引，建模了段之间的关系，并旨在通过相对位置编码提高外推能力。理论分析表明，这种位置信息的分离使学习更加有效。实证结果还表明，我们的BiPE在各种文本模态的任务中具有优越的长度外推能力。

更新时间: 2024-06-17 06:18:13

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2401.16421v2

Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks

Reconfigurable Intelligent Surface (RIS) is a pivotal technology in communication, offering an alternative path that significantly enhances the link quality in wireless communication environments. In this paper, we propose a RIS-assisted internet of vehicles (IoV) network, considering the vehicle-to-everything (V2X) communication method. In addition, in order to improve the timeliness of vehicle-to-infrastructure (V2I) links and the stability of vehicle-to-vehicle (V2V) links, we introduce the age of information (AoI) model and the payload transmission probability model. Therefore, with the objective of minimizing the AoI of V2I links and prioritizing transmission of V2V links payload, we construct this optimization problem as an Markov decision process (MDP) problem in which the BS serves as an agent to allocate resources and control phase-shift for the vehicles using the soft actor-critic (SAC) algorithm, which gradually converges and maintains a high stability. A AoI-aware joint vehicular resource allocation and RIS phase-shift control scheme based on SAC algorithm is proposed and simulation results show that its convergence speed, cumulative reward, AoI performance, and payload transmission probability outperforms those of proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin delayed deep deterministic policy gradient (TD3) and stochastic algorithms.

Updated: 2024-06-17 06:16:07

标题: 基于深度强化学习的AoI感知资源分配在RIS辅助的IoV网络中的应用

摘要: Reconfigurable Intelligent Surface（RIS）是通信中的一个关键技术，为无线通信环境中显著提升链路质量提供了另一种选择。本文提出了一个RIS辅助的车联网（IoV）网络，考虑了车辆对一切（V2X）通信方法。此外，为了提高车辆对基础设施（V2I）链路的及时性和车辆对车辆（V2V）链路的稳定性，我们引入了信息时代（AoI）模型和有效载荷传输概率模型。因此，为了最小化V2I链路的AoI，并优先传输V2V链路的有效载荷，我们将这个优化问题构建为一个马尔科夫决策过程（MDP）问题，在其中基站充当代理来分配资源并控制车辆的相移，使用软演员-评论家（SAC）算法，逐渐收敛并保持高稳定性。提出了一种基于SAC算法的AoI感知联合车辆资源分配和RIS相移控制方案，并仿真结果表明其收敛速度、累积奖励、AoI性能和有效载荷传输概率均优于近邻策略优化（PPO）、深度确定性策略梯度（DDPG）、双延迟深度确定性策略梯度（TD3）和随机算法。

更新时间: 2024-06-17 06:16:07

领域: cs.LG,cs.DC,cs.NI,eess.SP

下载: http://arxiv.org/abs/2406.11245v1

SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces

Spatio-temporal graph (STG) forecasting is a critical task with extensive applications in the real world, including traffic and weather forecasting. Although several recent methods have been proposed to model complex dynamics in STGs, addressing long-range spatio-temporal dependencies remains a significant challenge, leading to limited performance gains. Inspired by a recently proposed state space model named Mamba, which has shown remarkable capability of capturing long-range dependency, we propose a new STG forecasting framework named SpoT-Mamba. SpoT-Mamba generates node embeddings by scanning various node-specific walk sequences. Based on the node embeddings, it conducts temporal scans to capture long-range spatio-temporal dependencies. Experimental results on the real-world traffic forecasting dataset demonstrate the effectiveness of SpoT-Mamba.

Updated: 2024-06-17 06:15:31

标题: SpoT-Mamba：利用选择性状态空间在时空图上学习长距离依赖

摘要: 时空图（STG）预测是一个在现实世界中具有广泛应用的关键任务，包括交通和天气预测。虽然最近提出了几种方法来模拟STG中的复杂动态，但解决长距离时空依赖仍然是一个重要挑战，导致性能提升有限。受最近提出的名为Mamba的状态空间模型的启发，该模型显示出捕捉长距离依赖的显著能力，我们提出了一个新的STG预测框架，命名为SpoT-Mamba。SpoT-Mamba通过扫描各种节点特定的行走序列生成节点嵌入。基于节点嵌入，它进行时间扫描以捕捉长距离时空依赖。对真实世界交通预测数据集的实验结果证明了SpoT-Mamba的有效性。

更新时间: 2024-06-17 06:15:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11244v1

FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.

Updated: 2024-06-17 06:14:55

标题: FamiCom：进一步揭示语言模型中的提示与任务无关性能估计

摘要: 语言模型显示出在上下文学习能力方面令人印象深刻，这使它们能够从输入提示中受益，并在下游任务中表现更好。现有研究调查了这一观察背后的机制，并提出了能够更好地估计最终任务性能的无标签提示度量。一种流行的方法是使用困惑度来衡量模型对提示的熟悉程度。虽然在领域内任务上表现出一致的改进，但我们发现，像困惑度这样的熟悉度度量无法准确估计在复杂情况下的性能，如任务或领域转移场景。在这项工作中，我们提出了一种修订后的度量，称为FamiCom，为任务无关性能估计提供了更全面的度量。具体而言，FamiCom结合了熟悉度和\textit{复杂性} -- 最终任务的固有难度，这是当前度量中缺少的重要因素。实验表明，FamiCom与最终任务性能强烈相关，产生了0.85的Spearman相关系数，而仅有熟悉度的相关系数为0.43。我们进一步将FamiCom应用于自动提示和演示选择，并在准确率上超过现有方法和基线超过7.0%。

更新时间: 2024-06-17 06:14:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11243v1

Blockchain for Academic Integrity: Developing the Blockchain Academic Credential Interoperability Protocol (BACIP)

This research introduces the Blockchain Academic Credential Interoperability Protocol (BACIP), designed to significantly enhance the security, privacy, and interoperability of verifying academic credentials globally, addressing the widespread issue of academic fraud. BACIP integrates dual blockchain architecture, smart contracts, and zero-knowledge proofs to offer a scalable and transparent framework aimed at reducing fraud and improving the mobility and opportunities for students and professionals worldwide. The research methodology adopts a mixed-methods approach, involving a rigorous review of pertinent literature and systematic integration of advanced technological components. This includes both qualitative and quantitative analyses that underpin the development of a universally compatible system. Preliminary evaluations suggest that BACIP could enhance verification efficiency and bolster security against tampering and unauthorized access. While the theoretical framework and practical implementations have laid a solid foundation, the protocol's real-world efficacy awaits empirical validation in a production environment. Future research will focus on deploying a prototype, establishing robust validation policies, and defining precise testing parameters. This critical phase is indispensable for a thorough assessment of BACIP's operational robustness and its compliance with international educational standards. This work contributes significantly to the academic field by proposing a robust model for managing and safeguarding academic credentials, thus laying a strong foundation for further innovation in credential verification using blockchain technology.

Updated: 2024-06-17 06:11:51

标题: 学术诚信的区块链：开发区块链学术凭证互操作性协议（BACIP）

摘要: 这项研究介绍了区块链学术证书互操作性协议（BACIP），旨在显著提升全球学术证书验证的安全性、隐私性和互操作性，解决学术欺诈的普遍问题。BACIP集成了双区块链架构、智能合约和零知识证明，旨在提供一个可扩展和透明的框架，旨在减少欺诈行为，提高全球学生和专业人士的流动性和机会。研究方法采用混合方法，涉及相关文献的严格审查和先进技术组件的系统集成。这包括定性和定量分析，支持通用兼容系统的开发。初步评估表明，BACIP可以提高验证效率，加强防篡改和未经授权访问的安全性。尽管理论框架和实际实施已经奠定了坚实的基础，但该协议在实际生产环境中的实证效力尚待经验验证。未来研究将集中于部署原型、建立健全的验证政策和定义精确的测试参数。这一关键阶段对于全面评估BACIP的操作稳健性和遵守国际教育标准至关重要。这项工作通过提出一个管理和保护学术证书的强大模型，在学术领域做出了重要贡献，为进一步利用区块链技术进行证书验证的创新奠定了坚实基础。

更新时间: 2024-06-17 06:11:51

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2406.15482v1

The Benefits of Power Regularization in Cooperative Reinforcement Learning

Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.

Updated: 2024-06-17 06:10:37

标题: 合作强化学习中功率规范化的益处

摘要: 多智能体协作强化学习（MARL）算法仅被训练用于优化任务奖励，可能导致权力集中，其中单个代理的失败或敌意意图可能摧毁系统中每个代理的奖励。在团队人员的背景下，明确考虑分配权力的方式通常是有用的，以确保没有任何人成为单点故障。在这里，我们认为在合作RL系统中明确规范权力集中可以使系统更加抗干扰，对单个代理的失败、敌对攻击和合作伙伴激励变化具有更强的韧性。为此，我们定义了一种实用的配对权力度量，捕捉了任何合作伙伴影响自我代理奖励的能力，然后提出了一个平衡任务奖励和权力集中的权力规范化目标。在给定这个新目标的情况下，我们展示了存在一个平衡点，每个代理都在玩一个平衡权力和任务奖励的最佳响应。此外，我们提出了两种算法来训练代理朝着这个权力规范化目标：基于样本的权力规范化（SBPR），在训练过程中注入对抗数据；以及通过内在动机的权力规范化（PRIM），将内在动机添加到训练目标中以调节权力。我们的实验表明，这两种算法成功地平衡了任务奖励和权力，导致比仅基于任务奖励的基线更低的权力行为，并避免了系统中的代理偏离政策时的灾难事件。

更新时间: 2024-06-17 06:10:37

领域: cs.LG,cs.AI,cs.GT,cs.MA

下载: http://arxiv.org/abs/2406.11240v1

Evading AI-Generated Content Detectors using Homoglyphs

The generation of text that is increasingly human-like has been enabled by the advent of large language models (LLMs). As the detection of AI-generated content holds significant importance in the fight against issues such as misinformation and academic cheating, numerous studies have been conducted to develop reliable LLM detectors. While promising results have been demonstrated by such detectors on test data, recent research has revealed that they can be circumvented by employing different techniques. In this article, homoglyph-based ($a \rightarrow {\alpha}$) attacks that can be used to circumvent existing LLM detectors are presented. The efficacy of the attacks is illustrated by analizing how homoglyphs shift the tokenization of the text, and thus its token loglikelihoods. A comprehensive evaluation is conducted to assess the effectiveness of homoglyphs on state-of-the-art LLM detectors, including Binoculars, DetectGPT, OpenAI's detector, and watermarking techniques, on five different datasets. A significant reduction in the efficiency of all the studied configurations of detectors and datasets, down to an accuracy of 0.5 (random guessing), is demonstrated by the proposed approach. The results show that homoglyph-based attacks can effectively evade existing LLM detectors, and the implications of these findings are discussed along with possible defenses against such attacks.

Updated: 2024-06-17 06:07:32

标题: 使用同形文字规避AI生成内容检测器

摘要: 由大语言模型（LLMs）的出现使得文本生成越来越接近人类的水平。在AI生成内容的检测在打击问题如虚假信息和学术作弊方面具有重要意义，因此已经进行了许多研究来开发可靠的LLM检测器。虽然这些检测器在测试数据上展示了令人鼓舞的结果，但最近的研究表明它们可以通过采用不同的技术来规避。本文介绍了基于同形异体（$a \rightarrow {\alpha}$）攻击，这些攻击可以用来规避现有的LLM检测器。攻击的有效性通过分析同形异体如何改变文本的标记化以及其标记对数似然性来加以说明。对五个不同数据集上的最新LLM检测器的有效性进行了全面评估，包括双筒望远镜、DetectGPT、OpenAI的检测器和水印技术。提出的方法展示了所有研究配置和数据集的检测器的效率显著降低，甚至降至0.5的准确率（随机猜测）。结果表明，基于同形异体的攻击可以有效地规避现有的LLM检测器，并讨论了这些发现的影响以及可能的防御措施。

更新时间: 2024-06-17 06:07:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11239v1

QTIP: Quantization with Trellises and Incoherence Processing

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches have converged on using vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

Updated: 2024-06-17 06:03:13

标题: QTIP：使用Trellises和不连续处理进行量化

摘要: The Post-training quantization (PTQ) method reduces the memory usage of Large Language Models (LLMs) by converting weights into low-precision data types. Since LLM inference is typically limited by memory, PTQ techniques can enhance the speed of inference. Recent advanced PTQ methods have adopted vector quantization (VQ) to quantize multiple weights simultaneously, leading to improved information utilization through better shaping. However, VQ requires a codebook with an exponentially increasing size based on the dimension, restricting current VQ-based PTQ approaches to low VQ dimensions (≤ 8) and consequently limiting the quality of quantization. In this study, we introduce QTIP, which utilizes trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ employs a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a range of trellis codes, from lookup-only to computed lookup-free, designed for a hardware-efficient "bitshift" trellis structure. These codes deliver top-notch results in both quantization quality and inference speed.

更新时间: 2024-06-17 06:03:13

领域: cs.LG

下载: http://arxiv.org/abs/2406.11235v1

MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) aims to co-extract the sentiment triplets in a given corpus. Existing approaches within the pretraining-finetuning paradigm tend to either meticulously craft complex tagging schemes and classification heads, or incorporate external semantic augmentation to enhance performance. In this study, we, for the first time, re-evaluate the redundancy in tagging schemes and the internal enhancement in pretrained representations. We propose a method to improve and utilize pretrained representations by integrating a minimalist tagging scheme and a novel token-level contrastive learning strategy. The proposed approach demonstrates comparable or superior performance compared to state-of-the-art techniques while featuring a more compact design and reduced computational overhead. Additionally, we are the first to formally evaluate GPT-4's performance in few-shot learning and Chain-of-Thought scenarios for this task. The results demonstrate that the pretraining-finetuning paradigm remains highly effective even in the era of large language models.

Updated: 2024-06-17 06:01:11

标题: MiniConGTS：一种接近终极极简的对比网格标记方案，用于方面情感三元组提取

摘要: Aspect Sentiment Triplet Extraction (ASTE)旨在共同提取给定语料库中的情感三元组。现有的预训练微调范式中的方法往往要么精心设计复杂的标记方案和分类头，要么整合外部语义增强以提高性能。在本研究中，我们首次重新评估了标记方案中的冗余性和预训练表示中的内部增强。我们提出了一种方法，通过整合极简标记方案和新颖的令牌级对比学习策略来改进和利用预训练表示。所提出的方法展示了与最先进技术相比具有可比或更优性能，同时具有更紧凑的设计和降低的计算开销。此外，我们是首次正式评估GPT-4在少样本学习和思维链场景中的表现。结果表明，在大型语言模型时代，预训练微调范式仍然非常有效。

更新时间: 2024-06-17 06:01:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11234v1

How Do Humans Write Code? Large Models Do It the Same Way Too

Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in Large Language Models (LLMs) mathematical reasoning tasks by utilizing external tool calls to circumvent computational errors. However, our evaluation of the GPT-4 and Llama series reveals that using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT. To address this issue, we propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT, encompassing: (1) a new generation paradigm that uses full CoT reasoning to control code generation. (2) Focus Attention, that directs model attention to the CoT reasoning during PoT to generate more logical code. (3) reinforcement learning that utilizes the accuracy of both CoT and PoT responses as rewards to prevent repetitive reasoning steps in LLMs when solving difficult math problems. Our method achieves an average improvement of 6.5% on the Llama-Base model and 4.3% on the Mistral-Base model across 8 mathematical calculation datasets. It also shows significant effectiveness on five out-of-domain datasets by controlling the model's information flow, exhibiting strong transferability. Additionally, HTL shows the most significant improvement in non-mathematical natural language inference task, contributing to a unified reasoning task framework

Updated: 2024-06-17 06:00:41

标题: 人类如何编写代码？大型模型也是这样做的。

摘要: Program-of-Thought（PoT）取代了自然语言为基础的思维链（CoT），成为大型语言模型（LLMs）数学推理任务中最流行的方法，通过利用外部工具调用来避免计算错误。然而，我们对GPT-4和Llama系列的评估表明，与CoT相比，使用PoT会引入更多推理错误，例如不正确的公式或有缺陷的逻辑。为了解决这个问题，我们提出了Human-Think Language（HTL），利用一系列策略帮助整合PoT和CoT，包括：（1）使用完整的CoT推理来控制代码生成的新一代范式。（2）焦点注意力，将模型的注意力引导到PoT期间的CoT推理，生成更加逻辑的代码。（3）强化学习，利用CoT和PoT响应的准确性作为奖励，防止LLMs在解决困难数学问题时进行重复推理步骤。我们的方法在8个数学计算数据集上使Llama-Base模型的平均改进达到6.5％，Mistral-Base模型的改进达到4.3％。它还通过控制模型的信息流，在五个领域外数据集上展示了显著的有效性，表现出强大的可迁移性。此外，HTL在非数学自然语言推理任务中显示出最显著的改进，为统一的推理任务框架作出贡献。

更新时间: 2024-06-17 06:00:41

领域: cs.AI,cs.CL,cs.PL

下载: http://arxiv.org/abs/2402.15729v2

Probing the Decision Boundaries of In-context Learning in Large Language Models

In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.

Updated: 2024-06-17 06:00:24

标题: 探究大型语言模型中上下文学习的决策边界

摘要: 在大型语言模型（LLMs）中，上下文学习是一个关键范例，它使它们能够通过简单提示这些模型几个示例而无需显式参数更新就能推广到新的任务和领域。许多尝试已经被做出来了，以理解LLMs中上下文学习作为模型规模、预训练数据和其他因素的功能。在这项工作中，我们提出了一种新的机制，从决策边界的视角探究和理解上下文二元分类的学习。决策边界易于可视化，并提供了关于标准分类器归纳偏差的定性行为的重要信息。令我们惊讶的是，我们发现当前LLMs在简单的二元分类任务中学习到的决策边界通常是不规则和非平滑的，无论基础任务中的线性可分性如何。本文研究了影响这些决策边界的因素，并探索了增强它们泛化能力的方法。我们评估了各种方法，包括LLMs的无训练和微调方法，模型架构的影响，以及有效的主动提示技术对以数据高效方式平滑决策边界的影响。我们的研究结果提供了对上下文学习动态的深入理解，并提供了增强上下文学习的鲁棒性和泛化能力的实用改进。

更新时间: 2024-06-17 06:00:24

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11233v1

A Collaborative Data Analytics System with Recommender for Diverse Users

This paper presents the SLEGO (Software-Lego) system, a collaborative analytics platform that bridges the gap between experienced developers and novice users using a cloud-based platform with modular, reusable microservices. These microservices enable developers to share their analytical tools and workflows, while a simple graphical user interface (GUI) allows novice users to build comprehensive analytics pipelines without programming skills. Supported by a knowledge base and a Large Language Model (LLM) powered recommendation system, SLEGO enhances the selection and integration of microservices, increasing the efficiency of analytics pipeline construction. Case studies in finance and machine learning illustrate how SLEGO promotes the sharing and assembly of modular microservices, significantly improving resource reusability and team collaboration. The results highlight SLEGO's role in democratizing data analytics by integrating modular design, knowledge bases, and recommendation systems, fostering a more inclusive and efficient analytical environment.

Updated: 2024-06-17 05:59:13

标题: 一个带有推荐系统的多样化用户协作数据分析系统

摘要: 本文介绍了SLEGO（Software-Lego）系统，这是一个协作分析平台，通过基于云的平台和可重复使用的模块化微服务，弥合了经验丰富的开发人员和新手用户之间的差距。这些微服务使开发人员能够分享他们的分析工具和工作流程，而简单的图形用户界面（GUI）允许新手用户在没有编程技能的情况下构建全面的分析管道。通过知识库和大型语言模型（LLM）驱动的推荐系统的支持，SLEGO提升了微服务的选择和集成，增加了分析管道构建的效率。在金融和机器学习领域的案例研究中，展示了SLEGO如何促进模块化微服务的共享和组装，显著提高了资源的可重复利用性和团队协作。结果突显了SLEGO通过整合模块化设计、知识库和推荐系统，在推动更具包容性和高效率的分析环境方面发挥的作用。

更新时间: 2024-06-17 05:59:13

领域: cs.SE,cs.AI,D.2.11; I.2.1

下载: http://arxiv.org/abs/2406.11232v1

Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models

Large Language Models (LLMs) have achieved remarkable success in reasoning tasks with the development of prompting methods. However, existing prompting approaches cannot reuse insights of solving similar problems and suffer from accumulated errors in multi-step reasoning, since they prompt LLMs to reason \textit{from scratch}. To address these issues, we propose \textbf{\textit{Thought Propagation} (TP)}, which explores the analogous problems and leverages their solutions to enhance the complex reasoning ability of LLMs. These analogous problems are related to the input one, with reusable solutions and problem-solving strategies. Thus, it is promising to propagate insights of solving previous analogous problems to inspire new problem-solving. To achieve this, TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one. Then, TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch. TP is compatible with existing prompting approaches, allowing plug-and-play generalization and enhancement in a wide range of tasks without much labor in task-specific prompt engineering. Experiments across three challenging tasks demonstrate TP enjoys a substantial improvement over the baselines by an average of 12\% absolute increase in finding the optimal solutions in Shortest-path Reasoning, 13\% improvement of human preference in Creative Writing, and 15\% enhancement in the task completion rate of LLM-Agent Planning.

Updated: 2024-06-17 05:56:00

标题: 思维传播：一种类比方法用于大型语言模型的复杂推理

摘要: 大型语言模型（LLMs）在推理任务中取得了显著的成功，随着提示方法的发展。然而，现有的提示方法不能重复使用解决类似问题的见解，并且在多步推理中受到累积错误的困扰，因为它们提示LLMs从头开始进行推理。为了解决这些问题，我们提出了“思维传播”（TP）方法，该方法探索类似问题并利用它们的解决方案来增强LLMs的复杂推理能力。这些类似问题与输入问题相关，具有可重复使用的解决方案和问题解决策略。因此，将解决以前类似问题的见解传播到启发新的问题解决是有希望的。为了实现这一目标，TP首先提示LLMs提出并解决一组与输入问题相关的类似问题。然后，TP重复使用类似问题的结果，直接产生新的解决方案或推导出一个知识密集型的执行计划，以修正从头开始获得的初始解决方案。TP与现有的提示方法兼容，允许在各种任务中进行即插即用的泛化和增强，而无需在特定任务的提示工程中花费大量精力。跨三个具有挑战性的任务的实验表明，TP在最短路径推理中找到最佳解决方案的平均绝对增加率为12％，创意写作中人类偏好的提高率为13％，LLM-Agent规划中任务完成率的提高率为15％。

更新时间: 2024-06-17 05:56:00

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2310.03965v3

Enabling robots to follow abstract instructions and complete complex dynamic tasks

Completing complex tasks in unpredictable settings like home kitchens challenges robotic systems. These challenges include interpreting high-level human commands, such as "make me a hot beverage" and performing actions like pouring a precise amount of water into a moving mug. To address these challenges, we present a novel framework that combines Large Language Models (LLMs), a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF). Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties. It utilises GPT-4 to analyse the user's query and surroundings, then generates code that accesses a curated database of functions during execution. It translates abstract instructions into actionable steps. Each step involves generating custom code by employing retrieval-augmented generalisation to pull IFVF-relevant examples from the Knowledge Base. IFVF allows the robot to respond to noise and disturbances during execution. We use coffee making and plate decoration to demonstrate our approach, including components ranging from pouring to drawer opening, each benefiting from distinct feedback types and methods. This novel advancement marks significant progress toward a scalable, efficient robotic framework for completing complex tasks in uncertain environments. Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository (released upon paper acceptance).

Updated: 2024-06-17 05:55:35

标题: 使机器人能够遵循抽象指令并完成复杂动态任务

摘要: 在像家庭厨房这样不可预测的环境中完成复杂任务对机器人系统构成挑战。这些挑战包括解释高级人类命令，如“给我冲一杯热饮料”，并执行诸如向移动的杯子中倒入精确量水等动作。为了解决这些挑战，我们提出了一个新颖的框架，结合了大型语言模型（LLMs）、策划知识库和集成力和视觉反馈（IFVF）。我们的方法解释抽象指令，执行长期任务，并处理各种不确定性。它利用GPT-4分析用户的查询和周围环境，然后生成在执行过程中访问策划数据库函数的代码。它将抽象指令翻译成可操作步骤。每个步骤都涉及通过利用检索增强泛化从知识库中提取与IFVF相关的示例来生成定制代码。IFVF使机器人能够在执行过程中应对噪音和干扰。我们利用冲咖啡和装饰盘子来展示我们的方法，包括从倒水到打开抽屉等各种组件，每个组件都受益于不同类型和方法的反馈。这一新颖进展标志着在不确定环境中完成复杂任务的可扩展、高效机器人框架取得了重要进展。我们的发现在附带视频中得到了说明，并得到了开源GitHub存储库的支持（在论文被接受后发布）。

更新时间: 2024-06-17 05:55:35

领域: cs.RO,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11231v1

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

Updated: 2024-06-17 05:54:06

标题: 一根草堆中的多模态针：多模态大语言模型长上下文能力的基准测试

摘要: 多模态大语言模型（MLLMs）在各种应用中展现出显著的潜力，引起了研究人员和从业者的广泛兴趣。然而，对它们长上下文能力的全面评估仍未被充分探讨。为了填补这些空白，我们引入了MultiModal Needle-in-a-haystack（MMNeedle）基准测试，专门设计用于评估MLLMs的长上下文能力。除了多图像输入外，我们还采用图像拼接来进一步增加输入上下文长度，并制定了一个协议，用于自动生成子图像级别的检索标签。基本上，MMNeedle通过对图像内容的文本说明和描述，对MLLMs进行评估，通过对一组图像（草堆）中的目标子图像（针）进行定位来对它们的能力进行压力测试。这种设置需要对广泛的视觉上下文和在长上下文图像输入中进行有效信息检索的高级理解。通过这个基准测试，我们评估了最先进的MLLMs，包括基于API和开源模型。研究结果显示，GPT-4o在长上下文场景中持续优于其他模型，但在负样本中存在幻觉问题，即当针不在草堆中时。我们对MLLMs的全面长上下文评估还揭示了基于API和开源模型之间的显著性能差距。重现主要结果所需的所有代码、数据和说明都可以在https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack上找到。

更新时间: 2024-06-17 05:54:06

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2406.11230v1

Compound Schema Registry

Schema evolution is critical in managing database systems to ensure compatibility across different data versions. A schema registry typically addresses the challenges of schema evolution in real-time data streaming by managing, validating, and ensuring schema compatibility. However, current schema registries struggle with complex syntactic alterations like field renaming or type changes, which often require significant manual intervention and can disrupt service. To enhance the flexibility of schema evolution, we propose the use of generalized schema evolution (GSE) facilitated by a compound AI system. This system employs Large Language Models (LLMs) to interpret the semantics of schema changes, supporting a broader range of syntactic modifications without interrupting data streams. Our approach includes developing a task-specific language, Schema Transformation Language (STL), to generate schema mappings as an intermediate representation (IR), simplifying the integration of schema changes across different data processing platforms. Initial results indicate that this approach can improve schema mapping accuracy and efficiency, demonstrating the potential of GSE in practical applications.

Updated: 2024-06-17 05:50:46

标题: 化合物模式注册表

摘要: 模式演进在管理数据库系统中至关重要，以确保在不同数据版本间的兼容性。模式注册表通常通过管理、验证和确保模式兼容性来解决实时数据流中的模式演进挑战。然而，当前的模式注册表在处理复杂的句法更改（如字段重命名或类型更改）时往往需要大量手动干预，并可能会干扰服务。为增强模式演进的灵活性，我们提出利用由综合人工智能系统促成的广义模式演进（GSE）。该系统采用大型语言模型（LLMs）来解释模式变化的语义，支持更广泛范围的句法修改，而不会中断数据流。我们的方法包括开发一种专门的语言，模式转换语言（STL），用于生成模式映射作为中间表示（IR），简化在不同数据处理平台上集成模式变化。初步结果表明，这种方法可以提高模式映射的准确性和效率，展示了广义模式演进在实际应用中的潜力。

更新时间: 2024-06-17 05:50:46

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2406.11227v1

Minimax Optimal Q Learning with Nearest Neighbors

Analyzing the Markov decision process (MDP) with continuous state spaces is generally challenging. A recent interesting work \cite{shah2018q} solves MDP with bounded continuous state space by a nearest neighbor $Q$ learning approach, which has a sample complexity of $\tilde{O}(\frac{1}{\epsilon^{d+3}(1-\gamma)^{d+7}})$ for $\epsilon$-accurate $Q$ function estimation with discount factor $\gamma$. In this paper, we propose two new nearest neighbor $Q$ learning methods, one for the offline setting and the other for the online setting. We show that the sample complexities of these two methods are $\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+2}})$ and $\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+3}})$ for offline and online methods respectively, which significantly improve over existing results and have minimax optimal dependence over $\epsilon$. We achieve such improvement by utilizing the samples more efficiently. In particular, the method in \cite{shah2018q} clears up all samples after each iteration, thus these samples are somewhat wasted. On the other hand, our offline method does not remove any samples, and our online method only removes samples with time earlier than $\beta t$ at time $t$ with $\beta$ being a tunable parameter, thus our methods significantly reduce the loss of information. Apart from the sample complexity, our methods also have additional advantages of better computational complexity, as well as suitability to unbounded state spaces.

Updated: 2024-06-17 05:41:41

标题: 最小最大优化Q学习与最近邻近算法

摘要: 分析具有连续状态空间的马尔可夫决策过程（MDP）通常具有挑战性。最近一项有趣的工作\cite{shah2018q}通过最近邻$Q$学习方法解决了具有有界连续状态空间的MDP问题，该方法在$\epsilon$-准确$Q$函数估计时具有样本复杂度$\tilde{O}(\frac{1}{\epsilon^{d+3}(1-\gamma)^{d+7}})$，其中$\gamma$为折扣因子。在本文中，我们提出了两种新的最近邻$Q$学习方法，一种用于离线设置，另一种用于在线设置。我们展示了这两种方法的样本复杂度分别为$\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+2}})$和$\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+3}})$，这显著超越了现有结果，并在$\epsilon$上具有极小最优依赖性。我们通过更有效地利用样本来实现这种改进。特别是，\cite{shah2018q}中的方法在每次迭代后清除所有样本，因此这些样本在一定程度上被浪费。另一方面，我们的离线方法不会删除任何样本，而我们的在线方法只会在时间$t$处删除早于$\beta t$的样本，其中$\beta$是可调参数，因此我们的方法显著减少了信息损失。除了样本复杂度外，我们的方法还具有更好的计算复杂度和适用于无界状态空间的额外优势。

更新时间: 2024-06-17 05:41:41

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2308.01490v2

HEDI: First-Time Clinical Application and Results of a Biomechanical Evaluation and Visualisation Tool for Incisional Hernia Repair

Abdominal wall defects often lead to pain, discomfort, and recurrence of incisional hernias, resulting in significant morbidity and repeated surgical repairs worldwide. Mesh repair for large hernias is usually based on the defect area with a fixed overlap, neglecting biomechanical factors such as muscle activation, intra-abdominal pressure, tissue elasticity, and abdominal wall distension. To address this issue, we present a biomechanical approach to incisional hernia repair that takes into account the unstable abdominal wall. Additionally, we introduce HEDI, a tool that uses computed tomography with Valsalva maneuver to automatically detect and assess hernia size, volume, and abdominal wall instability. Our first clinical application of HEDI in the preoperative evaluation of 31 patients shows significantly improved success rates compared to reported rates, with all patients remaining pain-free and experiencing no hernia recurrence after three years of follow-up.

Updated: 2024-06-17 05:27:54

标题: HEDI：首次临床应用及结果的机械性评估和可视化工具在腹壁疝修补中的应用

摘要: 腹壁缺损常导致疼痛、不适和切口疝的复发，导致全球范围内显著的发病率和反复手术修复。大型疝的网片修复通常是基于缺损区域的固定重叠，忽视了肌肉激活、腹腔内压力、组织弹性和腹壁膨胀等生物力学因素。为解决这一问题，我们提出了一种考虑不稳定腹壁的切口疝修复的生物力学方法。此外，我们介绍了HEDI，一种利用Valsalva动作的计算机断层扫描自动检测和评估疝的大小、体积和腹壁不稳定性的工具。我们在31名患者术前评估中首次应用HEDI的临床结果显示，成功率明显优于报道的率，所有患者在三年随访后仍然没有疼痛并且没有疝复发。

更新时间: 2024-06-17 05:27:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2307.01502v2

Building another Spanish dictionary, this time with GPT-4

We present the "Spanish Built Factual Freectianary 2.0" (Spanish-BFF-2) as the second iteration of an AI-generated Spanish dictionary. Previously, we developed the inaugural version of this unique free dictionary employing GPT-3. In this study, we aim to improve the dictionary by using GPT-4-turbo instead. Furthermore, we explore improvements made to the initial version and compare the performance of both models.

Updated: 2024-06-17 05:25:56

标题: 构建另一个西班牙语词典，这次使用GPT-4

摘要: 我们提出了“西班牙建造的事实自由词典2.0”（Spanish-BFF-2）作为AI生成的西班牙语词典的第二个版本。之前，我们利用GPT-3开发了这个独特的免费词典的首次版本。在本研究中，我们旨在通过使用GPT-4-turbo来改进词典。此外，我们探讨了对初始版本进行的改进，并比较了两个模型的性能。

更新时间: 2024-06-17 05:25:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11218v1

WeatherQA: Can Multimodal Language Models Reason about Severe Weather?

Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models , including GPT4, Claude3, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.

Updated: 2024-06-17 05:23:18

标题: WeatherQA:多模态语言模型是否能够推理严重天气？

摘要: Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.

更新时间: 2024-06-17 05:23:18

领域: cs.AI,cs.CL,cs.CV,physics.ao-ph

下载: http://arxiv.org/abs/2406.11217v1

Probabilistic Reasoning in Generative Large Language Models

This paper considers the challenges Large Language Models (LLMs) face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values. This type of reasoning is relevant to a variety of contexts ranging from everyday conversations to medical decision-making. Despite improvements in the mathematical reasoning capabilities of LLMs, they still exhibit significant difficulties when it comes to probabilistic reasoning. To deal with this problem, we introduce the Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLInD and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.

Updated: 2024-06-17 05:13:33

标题: 生成式大型语言模型中的概率推理

摘要: 本文考虑了大型语言模型（LLMs）在推理涉及明确通过概率值量化的不确定性信息的文本时所面临的挑战。这种推理类型与从日常对话到医疗决策等各种背景都相关。尽管LLMs的数学推理能力有所提高，但在概率推理方面仍然存在显著困难。为了解决这个问题，我们引入了贝叶斯语言推理数据集（BLInD），这是一个专门设计用于测试LLMs概率推理能力的新数据集。我们使用BLInD来了解LLMs在涉及概率推理的任务中的局限性。此外，我们提出了几种提示策略，将问题映射到不同的形式表示，包括Python代码、概率算法和概率逻辑编程。最后，我们在BLInD上评估了我们的方法以及对因果推理问答数据集的改编。我们的实证结果突出了我们提出的策略对多个LLMs的有效性。

更新时间: 2024-06-17 05:13:33

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2402.09614v2

Efficient Topology-aware Data Augmentation for High-Degree Graph Neural Networks

In recent years, graph neural networks (GNNs) have emerged as a potent tool for learning on graph-structured data and won fruitful successes in varied fields. The majority of GNNs follow the message-passing paradigm, where representations of each node are learned by recursively aggregating features of its neighbors. However, this mechanism brings severe over-smoothing and efficiency issues over high-degree graphs (HDGs), wherein most nodes have dozens (or even hundreds) of neighbors, such as social networks, transaction graphs, power grids, etc. Additionally, such graphs usually encompass rich and complex structure semantics, which are hard to capture merely by feature aggregations in GNNs. Motivated by the above limitations, we propose TADA, an efficient and effective front-mounted data augmentation framework for GNNs on HDGs. Under the hood, TADA includes two key modules: (i) feature expansion with structure embeddings, and (ii) topology- and attribute-aware graph sparsification. The former obtains augmented node features and enhanced model capacity by encoding the graph structure into high-quality structure embeddings with our highly-efficient sketching method. Further, by exploiting task-relevant features extracted from graph structures and attributes, the second module enables the accurate identification and reduction of numerous redundant/noisy edges from the input graph, thereby alleviating over-smoothing and facilitating faster feature aggregations over HDGs. Empirically, TADA considerably improves the predictive performance of mainstream GNN models on 8 real homophilic/heterophilic HDGs in terms of node classification, while achieving efficient training and inference processes.

Updated: 2024-06-17 05:08:14

标题: 高度图神经网络的高效拓扑感知数据增强

摘要: 最近几年，图神经网络（GNNs）已经成为学习图结构数据的强大工具，并在各个领域取得了丰硕的成果。大多数GNNs遵循消息传递范式，其中每个节点的表示通过递归地聚合其邻居的特征来学习。然而，这种机制在高度图（HDGs）上存在严重的过度平滑和效率问题，其中大多数节点具有几十甚至上百个邻居，如社交网络、交易图、电力网等。此外，这些图通常包含丰富而复杂的结构语义，仅通过GNNs中的特征聚合难以捕获。受到上述限制的启发，我们提出了TADA，这是一个在HDGs上的有效和高效的前置数据增强框架。在内部，TADA包括两个关键模块：（i）带有结构嵌入的特征扩展，和（ii）拓扑和属性感知的图稀疏化。前者通过使用高效的草图方法将图结构编码为高质量的结构嵌入，获得增强的节点特征和增强的模型容量。此外，通过利用从图结构和属性中提取的与任务相关的特征，第二个模块能够准确识别和减少输入图中的大量冗余/噪声边，从而减轻过度平滑，并促进在HDGs上更快的特征聚合。从实证方面看，TADA在节点分类方面显著提高了主流GNN模型在8个实际同质性/异质性HDGs上的预测性能，同时实现了高效的训练和推断过程。

更新时间: 2024-06-17 05:08:14

领域: cs.LG

下载: http://arxiv.org/abs/2406.05482v3

What Operations can be Performed Directly on Compressed Arrays, and with What Error?

In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental operations directly on compressed data while offering good compression ratios and modest errors. We implement a new compressor PyBlaz based on the familiar GPU-powered PyTorch framework, and evaluate it on three non-trivial applications, choosing different number systems for internal representation. Our results demonstrate that the compressed-domain operations achieve good scalability with problem sizes while incurring errors well within acceptable limits. To our best knowledge, this is the first such lossy compressor that supports compressed-domain operations while achieving acceptable performance as well as error.

Updated: 2024-06-17 05:01:09

标题: 压缩数组可以直接执行哪些操作，以及可能出现的误差是多少？

摘要: 为了应对由数据移动引起的大矩阵和张量计算成本迅速上升的问题，已经开发出了几种有损压缩方法来显著减少数据量。不幸的是，所有这些方法都需要在进一步计算之前对数据进行解压缩。在这项工作中，我们开发了一种有损压缩器，可以在压缩数据上直接执行十几种基本操作，同时提供良好的压缩比和适度的错误。我们基于熟悉的GPU加速PyTorch框架实现了一个新的压缩器PyBlaz，并在三个非平凡应用程序上进行了评估，选择不同的数字系统用于内部表示。我们的结果表明，压缩域操作在问题规模上具有良好的可扩展性，同时产生的错误远低于可接受限制。据我们所知，这是第一个支持压缩域操作并在性能和错误方面实现可接受表现的有损压缩器。

更新时间: 2024-06-17 05:01:09

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2406.11209v1

Retraining with Predicted Hard Labels Provably Increases Model Accuracy

The performance of a model trained with \textit{noisy labels} is often improved by simply \textit{retraining} the model with its own predicted \textit{hard} labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at \textit{no extra privacy cost}; we call this \textit{consensus-based retraining}. For e.g., when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain $6.4\%$ improvement in accuracy with consensus-based retraining.

Updated: 2024-06-17 04:53:47

标题: 用预测的硬标签重新训练可证明提高模型准确性

摘要: 使用\textit{嘈杂标签}训练的模型性能通常通过简单地使用其自身预测的\textit{硬}标签（即$1$/$0$标签）重新训练模型来改善。然而，对这种现象的详细理论表征尚不完备。在本文中，我们在一个线性可分的设置中对重新训练进行了理论分析，该设置下我们随机获得了受损标签，并证明重新训练可以提高通过最初使用给定（嘈杂）标签进行训练所获得的总体准确性。据我们所知，这是第一个这样的理论结果。重新训练在改进具有标签差分隐私（DP）的训练中找到应用，该训练涉及使用嘈杂标签进行训练。我们在经验上表明，对预测标签与给定标签匹配的样本进行有选择性地重新训练可以显著提高标签DP训练的准确性，而\textit{不增加任何隐私成本}；我们将这称为\textit{基于共识的重新训练}。例如，当在CIFAR-100上使用$\epsilon=3$标签DP训练ResNet-18时，通过基于共识的重新训练，我们获得了$6.4\%$的准确性提升。

更新时间: 2024-06-17 04:53:47

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2406.11206v1

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.

Updated: 2024-06-17 04:35:17

标题: 微调还是微失败？揭穿大型语言模型中的性能神话

摘要: 大型语言模型（LLMs）具有独特的能力，能够理解和生成类似人类的文本从输入查询。在经过微调后，这些模型在特定领域的查询上表现出增强的性能。OpenAI强调了微调的过程，指出：“要微调模型，您需要提供至少10个示例。我们通常看到通过在50到100个训练示例上微调会明显改善，但合适的数量会根据确切的用例大不相同。”本研究将这一概念扩展到了将LLMs集成到检索增强生成（RAG）管道中，RAG旨在通过利用外部语料库数据进行信息检索来提高准确性和相关性。然而，在复杂的查询场景中，RAG提供最佳响应的承诺通常难以实现。本研究旨在具体考察微调LLMs对其提取和整合上下文数据的能力以增强跨多个领域的RAG系统性能的影响。我们通过比较微调模型与多个领域的基准性能在数据集上的准确性和完整性来评估微调对LLMs数据提取和上下文理解能力的影响。我们的研究结果表明，与OpenAI建议的独立LLMs应用中观察到的改善相反，微调导致性能下降。这项研究强调了对领域特定任务的微调模型进行深入调查和验证的必要性。

更新时间: 2024-06-17 04:35:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11201v1

Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on input data. Most famously, positional encoding complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly redundant/unnecessary. Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.

Updated: 2024-06-17 04:34:10

标题: 位置编码有助于循环神经网络处理大词汇量

摘要: 这项研究报告了一个令人意外的发现，即位置编码增强了递归神经网络（RNNs）的学习能力。位置编码是对输入数据中时间索引的高维表示。最为著名的是，位置编码补充了Transformer神经网络的能力，后者缺乏表示数据顺序的固有机制。相比之下，RNNs可以自行编码数据点的时间信息，因此它们对位置编码的使用似乎是多余/不必要的。然而，通过合成基准测试的调查揭示了将位置编码与RNNs结合使用的优势，尤其是用于处理产生低频令牌的大词汇量。进一步的审视揭示了这些低频令牌使普通RNNs的梯度不稳定，而位置编码解决了这种不稳定性。这些结果为位置编码的实用性带来了新的视角，超越了其作为Transformer的时间管理者的传统角色。

更新时间: 2024-06-17 04:34:10

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2402.00236v2

Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models

Dense retrieval (DR) converts queries and documents into dense embeddings and measures the similarity between queries and documents in vector space. One of the challenges in DR is the lack of domain-specific training data. While DR models can learn from large-scale public datasets like MS MARCO through transfer learning, evidence shows that not all DR models and domains can benefit from transfer learning equally. Recently, some researchers have resorted to large language models (LLMs) to improve the zero-shot and few-shot DR models. However, the hard prompts or human-written prompts utilized in these works cannot guarantee the good quality of generated weak queries. To tackle this, we propose soft prompt tuning for augmenting DR (SPTAR): For each task, we leverage soft prompt-tuning to optimize a task-specific soft prompt on limited ground truth data and then prompt the LLMs to tag unlabeled documents with weak queries, yielding enough weak document-query pairs to train task-specific dense retrievers. We design a filter to select high-quality example document-query pairs in the prompt to further improve the quality of weak tagged queries. To the best of our knowledge, there is no prior work utilizing soft prompt tuning to augment DR models. The experiments demonstrate that SPTAR outperforms the unsupervised baselines BM25 and the recently proposed LLMs-based augmentation method for DR.

Updated: 2024-06-17 04:30:58

标题: 使用大型语言模型增强密集检索的软提示调整

摘要: 密集检索（DR）将查询和文档转换为密集嵌入，并在向量空间中衡量查询和文档之间的相似性。DR面临的挑战之一是缺乏领域特定的训练数据。虽然DR模型可以通过迁移学习从大规模的公共数据集（如MS MARCO）中学习，但证据显示，并非所有DR模型和领域都能够同等受益于迁移学习。最近，一些研究人员已经转向使用大型语言模型（LLMs）来改进零-shot和少-shot DR模型。然而，在这些工作中使用的硬提示或人工编写的提示不能保证生成的弱查询的良好质量。为了解决这个问题，我们提出了用于增强DR的软提示调整（SPTAR）：对于每个任务，我们利用软提示调整来优化任务特定的软提示，使用有限的地面真相数据，然后提示LLMs标记未标记的文档，生成足够的弱文档-查询对来训练任务特定的密集检索器。我们设计了一个过滤器来选择高质量的示例文档-查询对，在提示中进一步提高弱标记查询的质量。据我们所知，以前没有利用软提示调整来增强DR模型的研究。实验证明，SPTAR优于无监督基线BM25和最近提出的基于LLMs的DR增强方法。

更新时间: 2024-06-17 04:30:58

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2307.08303v5

AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval

Large language model (LLM) agents have demonstrated impressive capability in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing the prompting techniques that make LLM agents able to effectively use external tools and knowledge is a heuristic and laborious task. Here, we introduce AvaTaR, a novel and automatic framework that optimizes an LLM agent to effectively use the provided tools and improve its performance on a given task/domain. During optimization, we design a comparator module to iteratively provide insightful and holistic prompts to the LLM agent via reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information. We find AvaTaR consistently outperforms state-of-the-art approaches across all four challenging tasks and exhibits strong generalization ability when applied to novel cases, achieving an average relative improvement of 14% on the Hit@1 metric. Code and dataset are available at https://github.com/zou-group/avatar.

Updated: 2024-06-17 04:20:02

标题: Avatar：为工具辅助知识检索优化LLM代理

摘要: 大型语言模型（LLM）代理已经展示出在利用外部工具和知识来提高准确性并减少幻觉方面的印象。然而，开发使LLM代理能够有效使用外部工具和知识的提示技术是一项启发式和繁琐的任务。在这里，我们介绍了AvaTaR，这是一个新颖的自动框架，优化LLM代理以有效使用提供的工具并提高其在给定任务/领域上的性能。在优化过程中，我们设计了一个比较器模块，通过从训练数据中采样的正面和负面示例之间的推理，迭代地为LLM代理提供具有洞察力和全面性的提示。我们在四个复杂的多模式检索数据集上展示了AvaTaR，这些数据集包含文本、视觉和关系信息。我们发现AvaTaR在所有四个具有挑战性的任务中始终优于最先进的方法，并在应用于新案例时展现出强大的泛化能力，使Hit@1指标的平均相对改进达到14％。代码和数据集可在https://github.com/zou-group/avatar 上找到。

更新时间: 2024-06-17 04:20:02

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.11200v1

Mixed Strategy Nash Equilibrium for Crowd Navigation

Robots navigating in crowded areas should negotiate free space with humans rather than fully controlling collision avoidance, as this can lead to freezing behavior. Game theory provides a framework for the robot to reason about potential cooperation from humans for collision avoidance during path planning. In particular, the mixed strategy Nash equilibrium captures the negotiation behavior under uncertainty, making it well suited for crowd navigation. However, computing the mixed strategy Nash equilibrium is often prohibitively expensive for real-time decision-making. In this paper, we propose an iterative Bayesian update scheme over probability distributions of trajectories. The algorithm simultaneously generates a stochastic plan for the robot and probabilistic predictions of other pedestrians' paths. We prove that the proposed algorithm is equivalent to solving a mixed strategy game for crowd navigation, and the algorithm guarantees the recovery of the global Nash equilibrium of the game. We name our algorithm Bayes' Rule Nash Equilibrium (BRNE) and develop a real-time model prediction crowd navigation framework. Since BRNE is not solving a general-purpose mixed strategy Nash equilibrium but a tailored formula specifically for crowd navigation, it can compute the solution in real-time on a low-power embedded computer. We evaluate BRNE in both simulated environments and real-world pedestrian datasets. BRNE consistently outperforms non-learning and learning-based methods regarding safety and navigation efficiency. It also reaches human-level crowd navigation performance in the pedestrian dataset benchmark. Lastly, we demonstrate the practicality of our algorithm with real humans on an untethered quadruped robot with fully onboard perception and computation.

Updated: 2024-06-17 04:05:29

标题: 混合策略纳什均衡在人群导航中的应用

摘要: 在拥挤区域导航的机器人应该与人类协商空间，而不是完全控制避碰，因为这可能导致冻结行为。博弈论为机器人提供了一个框架，用于在路径规划过程中推理人类可能提供的协助以避免碰撞。特别是，混合策略纳什均衡捕捉了在不确定性条件下的协商行为，使其非常适合拥挤导航。然而，计算混合策略纳什均衡通常对于实时决策来说代价高昂。在本文中，我们提出了一个概率轨迹分布的迭代贝叶斯更新方案。该算法同时为机器人生成一个随机计划和其他行人路径的概率预测。我们证明了所提出的算法等效于解决拥挤导航的混合策略博弈，并且该算法保证了游戏的全局纳什均衡的恢复。我们将我们的算法命名为贝叶斯规则纳什均衡（BRNE）并开发了一个实时模型预测拥挤导航框架。由于BRNE并不是解决一般性混合策略纳什均衡，而是专门针对拥挤导航的公式，因此它可以在低功耗嵌入式计算机上实时计算解决方案。我们在模拟环境和真实世界行人数据集中评估了BRNE。BRNE在安全性和导航效率方面始终优于非学习和基于学习的方法。它还在行人数据集基准测试中达到了人类级别的拥挤导航性能。最后，我们通过在一个完全搭载感知和计算的无线四足机器人上与真人的实际演示，展示了我们算法的实用性。

更新时间: 2024-06-17 04:05:29

领域: cs.RO,cs.GT,cs.LG

下载: http://arxiv.org/abs/2403.01537v4

Graph Knowledge Distillation to Mixture of Experts

In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets.

Updated: 2024-06-17 04:00:41

标题: 图知识蒸馏到专家混合模型

摘要: 就准确性而言，图神经网络（GNNs）是节点分类任务的最佳架构选择。它们在现实世界中部署的缺点是由邻域处理操作产生的延迟。解决延迟问题的一个方法是从经过训练的GNN到多层感知器（MLP）进行知识蒸馏，其中MLP仅处理被分类节点的特征（可能还包括一些预先计算的结构信息）。然而，对于现有的知识蒸馏技术，这种MLP在传递性和归纳性设置中的性能仍然不稳定。我们提出通过使用一个特别设计的学生模型来解决性能问题，而不是使用MLP。我们的模型，命名为Routing-by-Memory（RbM），是一种专家混合（MoE）形式，设计上强调专家专业化。通过鼓励每个专家专门在隐藏表示空间的某个区域上，我们在实验中证明可以在多个数据集上获得更加一致的性能。

更新时间: 2024-06-17 04:00:41

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.11919v1

Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human preferences according to the criticism. Additionally, we use a self-consistency method to further reduce the impact of position bias, and employ semantic perplexity to calculate the preference strength differences between different answers. Experimental results show that our method enables 13B and 70B Llama2-Chat annotators to provide high-quality preference feedback, and the policy models trained based on these preference data achieve significant advantages in benchmark datasets through reinforcement learning.

Updated: 2024-06-17 03:51:46

标题: 用一个通用原则对来自自我引用AI反馈的大型语言模型进行对齐

摘要: 在对齐大型语言模型（LLMs）时，利用现有先进人工智能的反馈而不是人类是一个重要的方法，以扩展监督信号。然而，让人工智能理解人类意图和社会价值，并基于这些提供准确的偏好反馈是非常具有挑战性的。当前的人工智能反馈方法依赖于强大的LLMs，精心设计的特定原则来描述人类意图，并且很容易受到位置偏见的影响。为了解决这些问题，我们提出了一种基于自我参考的人工智能反馈框架，使得13B Llama2-Chat能够在简单和通用原则（如“最适合人类”）下提供高质量的反馈。具体来说，我们允许人工智能首先回应用户的指令，然后基于自己的回应作为参考生成对其他答案的批评，最后根据批评确定哪个答案更符合人类的偏好。此外，我们使用自一致性方法进一步减少位置偏见的影响，并利用语义困惑度来计算不同答案之间的偏好强度差异。实验结果表明，我们的方法使得13B和70B Llama2-Chat注释者能够提供高质量的偏好反馈，并且基于这些偏好数据训练的策略模型通过强化学习在基准数据集中取得显著优势。

更新时间: 2024-06-17 03:51:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11190v1

Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Black Gradient Descent

The advent of large language models (LLMs) has revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. However, the pre-training or fine-tuning of LLMs within a federated learning (FL) framework poses substantial challenges, including considerable computational and memory resource demands, as well as communication bottlenecks between servers and clients. Existing solutions either make the unrealistic assumption that the entire model is exchanged for training, or apply parameter-effective fine-tuning methods from centralized learning to train LLMs in FL which tend to underperform during training or fine-tuning stages due to the limited search subspace of parameter updating. In this paper, we introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption. Our approach, termed FedCyBGD, utilizes Cycle Block Gradient Descent to periodically update the model. In particular, we design a compression scheme for FedCyBGD, aiming to further decrease the model download cost. It enables full parameter training in FL with only selected block updates and uploads, thereby reducing communication, computation, and memory costs. Our method achieves state-of-the-art performance for FL LLM training, while significantly reducing associated costs. Codes are provided here.

Updated: 2024-06-17 03:49:44

标题: 保留全部：通过循环黑色梯度下降实现联邦大型语言模型的全参数调优

摘要: 大型语言模型（LLMs）的出现彻底改变了深度学习范式，在各种任务中取得了令人印象深刻的成果。然而，在联邦学习（FL）框架内对LLMs进行预训练或微调面临着重大挑战，包括巨大的计算和内存资源需求，以及服务器和客户端之间的通信瓶颈。现有的解决方案要么假设整个模型在训练过程中进行交换，要么将来自集中式学习的参数有效微调方法应用于FL中训练LLMs，这往往在训练或微调阶段表现不佳，因为参数更新的搜索子空间受限。在本文中，我们介绍了一种在FL中高效训练和微调LLMs的新方法，资源消耗最小。我们的方法，称为FedCyBGD，利用周期性更新模型的Cycle Block Gradient Descent。特别地，我们设计了一个压缩方案用于FedCyBGD，旨在进一步降低模型下载成本。它在FL中实现了全参数训练，只进行了选择性的块更新和上传，从而降低了通信、计算和内存成本。我们的方法在FL LLM训练中实现了最先进的性能，同时显著降低了相关成本。代码在此处提供。

更新时间: 2024-06-17 03:49:44

领域: cs.LG

下载: http://arxiv.org/abs/2406.11187v1

Sketch and shift: a robust decoder for compressive clustering

Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked. In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.

Updated: 2024-06-17 03:44:46

标题: 草图和转移：一种用于压缩聚类的强大解码器

摘要: 压缩学习是一种新兴的方法，通过首先将大型数据集总结为低维度的草图向量，然后从这个草图中解码出学习所需的潜在信息，从而大幅减少大规模学习的内存占用。鉴于最近关于基于随机特征的草图的信息保留保证取得的进展，一个主要目标是设计易于调节的算法（称为解码器），以稳健且高效地提取这些信息。为了解决底层的非凸优化问题，提出了各种启发式方法。在压缩聚类的情况下，标准启发式方法是CL-OMPR，一种滑动Frank-Wolfe的变体。然而，CL-OMPR很难调节，并且其稳健性的检查被忽视了。在这项工作中，我们对CL-OMPR进行了仔细的审查，以规避其局限性。特别是，我们展示了即使在有利的情况下，这种算法也可能无法恢复聚类。为了深入了解，我们展示了该算法的缺陷可以归因于在算法的核心步骤中出现的相关函数结构的优化困难。为了解决这些限制，我们提出了一种替代解码器，相较于CL-OMPR有显著改进。其设计显著受到均值漂移算法的启发，这是一种经典方法，用于检测核密度估计器的局部极大值。所提出的算法可以从MNIST数据集的草图中提取聚类信息，其大小比以前小10倍。

更新时间: 2024-06-17 03:44:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2312.09940v2

Cracking Factual Knowledge: A Comprehensive Analysis of Degenerate Knowledge Neurons in Large Language Models

Large language models (LLMs) store extensive factual knowledge, but the underlying mechanisms remain unclear. Previous research suggests that factual knowledge is stored within multi-layer perceptron weights, and some storage units exhibit degeneracy, referred to as Degenerate Knowledge Neurons (DKNs). Despite the novelty and unique properties of this concept, it has not been rigorously defined or systematically studied. We first consider the connection weight patterns of MLP neurons and define DKNs from both structural and functional aspects. Based on this, we introduce the Neurological Topology Clustering method, which allows the formation of DKNs in any numbers and structures, leading to a more accurate DKN acquisition. Furthermore, inspired by cognitive science, we explore the relationship between DKNs and the robustness, evolvability, and complexity of LLMs. Our execution of 34 experiments under 6 settings demonstrates the connection between DKNs and these three properties. The code will be available soon.

Updated: 2024-06-17 03:44:10

标题: 破解事实知识：对大型语言模型中退化知识神经元的全面分析

摘要: 大型语言模型（LLMs）存储了广泛的事实知识，但其基本机制仍不清楚。先前的研究表明，事实知识存储在多层感知器的权重中，一些存储单元表现出退化性，被称为退化性知识神经元（DKNs）。尽管这一概念具有新颖性和独特性质，但尚未得到严格定义或系统研究。我们首先考虑MLP神经元的连接权重模式，并从结构和功能两个方面定义DKNs。基于此，我们引入了神经拓扑聚类方法，允许在任何数量和结构中形成DKNs，从而实现更准确的DKN获取。此外，受认知科学启发，我们探讨了DKNs与LLMs的稳健性、适应性和复杂性之间的关系。我们在6个设置下执行了34个实验，证明了DKNs与这三个属性之间的关联。代码即将发布。

更新时间: 2024-06-17 03:44:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.13731v2

Learning Iterative Reasoning through Energy Diffusion

We introduce iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for a variety of tasks by formulating reasoning and decision-making problems with energy-based optimization. IRED learns energy functions to represent the constraints between input conditions and desired outputs. After training, IRED adapts the number of optimization steps during inference based on problem difficulty, enabling it to solve problems outside its training distribution -- such as more complex Sudoku puzzles, matrix completion with large value magnitudes, and pathfinding in larger graphs. Key to our method's success is two novel techniques: learning a sequence of annealed energy landscapes for easier inference and a combination of score function and energy landscape supervision for faster and more stable training. Our experiments show that IRED outperforms existing methods in continuous-space reasoning, discrete-space reasoning, and planning tasks, particularly in more challenging scenarios. Code and visualizations at https://energy-based-model.github.io/ired/

Updated: 2024-06-17 03:36:47

标题: 学习通过能量扩散进行迭代推理

摘要: 我们介绍了通过能量扩散进行迭代推理（IRED）的新框架，该框架通过能量优化来学习各种任务的推理。IRED学习能量函数来表示输入条件和期望输出之间的约束。在训练之后，IRED根据问题难度调整推理中的优化步骤数量，使其能够解决训练分布之外的问题，如更复杂的数独难题，具有大数值幅度的矩阵填充，以及更大图中的路径规划。我们方法成功的关键在于两种新技术：学习一系列经退火的能量景观以进行更容易的推理，以及结合得分函数和能量景观监督以进行更快速更稳定的训练。我们的实验表明，在连续空间推理、离散空间推理和规划任务中，IRED在更具挑战性的场景中胜过现有方法。代码和可视化结果可在https://energy-based-model.github.io/ired/ 上找到。

更新时间: 2024-06-17 03:36:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11179v1

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.

Updated: 2024-06-17 03:29:13

标题: 密切关注每一步！通过迭代步骤级过程细化的LLM代理学习

摘要: 大型语言模型代理在一系列复杂的互动任务中表现出色。最近的方法利用专家轨迹进行调整以增强代理性能，然而它们主要集中于结果奖励，这可能会导致错误或次优行为，因为缺乏过程监督信号。在本文中，我们引入了迭代步骤级过程细化（IPR）框架，提供详细的逐步指导以增强代理训练。具体来说，我们采用蒙特卡洛方法来估计步骤级奖励。在每次迭代中，代理沿着专家轨迹进行探索并生成新的动作。然后，这些动作根据相应的专家轨迹步骤使用步骤级奖励进行评估。这种比较有助于识别差异，产生对比动作对，作为代理的训练数据。我们在三个复杂代理任务上的实验表明，我们的框架优于多种强基准。此外，我们的分析结果凸显了IPR在增强行动效率方面的有效性及其适用于不同模型的能力。

更新时间: 2024-06-17 03:29:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11176v1

Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization

Independent and identically distributed (i.i.d.) data is essential to many data analysis and modeling techniques. In the medical domain, collecting data from multiple sites or institutions is a common strategy that guarantees sufficient clinical diversity, determined by the decentralized nature of medical data. However, data from various sites are easily biased by the local environment or facilities, thereby violating the i.i.d. rule. A common strategy is to harmonize the site bias while retaining important biological information. The ComBat is among the most popular harmonization approaches and has recently been extended to handle distributed sites. However, when faced with situations involving newly joined sites in training or evaluating data from unknown/unseen sites, ComBat lacks compatibility and requires retraining with data from all the sites. The retraining leads to significant computational and logistic overhead that is usually prohibitive. In this work, we develop a novel Cluster ComBat harmonization algorithm, which leverages cluster patterns of the data in different sites and greatly advances the usability of ComBat harmonization. We use extensive simulation and real medical imaging data from ADNI to demonstrate the superiority of the proposed approach. Our codes are provided in https://github.com/illidanlab/distributed-cluster-harmonization.

Updated: 2024-06-17 03:28:33

标题: 分布式协调：联合聚类批次效应调整和泛化

摘要: 独立同分布（i.i.d.）数据对许多数据分析和建模技术至关重要。在医疗领域，从多个网站或机构收集数据是一种常见策略，确保了由医疗数据的分散性确定的充分临床多样性。然而，来自不同网站的数据很容易受到当地环境或设施的偏见，从而违反了i.i.d.规则。一种常见的策略是在保留重要生物信息的同时协调网站偏见。 ComBat是最受欢迎的协调方法之一，最近已扩展到处理分布式站点。然而，当面临涉及新加入站点的情况时，在训练或评估来自未知/未见站点的数据时，ComBat缺乏兼容性，并需要重新使用所有站点的数据进行训练。重新训练导致了通常是禁止的重大计算和逻辑开销。在这项工作中，我们开发了一种新颖的Cluster ComBat协调算法，利用不同站点数据的集群模式，并极大地提高了ComBat协调的可用性。我们使用广泛的模拟和来自ADNI的真实医学成像数据来展示所提出方法的优越性。我们的代码可以在https://github.com/illidanlab/distributed-cluster-harmonization找到。

更新时间: 2024-06-17 03:28:33

领域: cs.LG

下载: http://arxiv.org/abs/2405.15081v2

Black-box Prompt Tuning with Subspace Learning

Black-box prompt tuning employs derivative-free optimization algorithms to learn prompts within low-dimensional subspaces rather than back-propagating through the network of Large Language Models (LLMs). Recent studies reveal that black-box prompt tuning lacks versatility across tasks and LLMs, which we believe is related to the suboptimal choice of subspaces. In this paper, we introduce Black-box prompt tuning with Subspace Learning (BSL) to enhance the versatility of black-box prompt tuning. Based on the assumption that nearly optimal prompts for similar tasks reside in a common subspace, we propose identifying such subspaces through meta-learning on a collection of similar source tasks. Consequently, for a target task that shares similarities with the source tasks, we expect that optimizing within the identified subspace can yield a prompt that performs well on the target task. Experimental results confirm that our BSL framework consistently achieves competitive performance across various downstream tasks and LLMs.

Updated: 2024-06-17 03:26:59

标题: 黑盒提示调整与子空间学习

摘要: 黑盒提示调整使用无导数优化算法在低维子空间内学习提示，而不是通过大型语言模型（LLMs）的网络进行反向传播。最近的研究表明，黑盒提示调整在任务和LLMs之间缺乏通用性，我们认为这与子空间选择不佳有关。在本文中，我们引入了带子空间学习的黑盒提示调整（BSL）来增强黑盒提示调整的通用性。基于类似任务的近乎最佳提示驻留在共同子空间的假设，我们提出通过对一系列相似源任务进行元学习来识别这样的子空间。因此，对于与源任务具有相似性的目标任务，我们期望在识别的子空间内进行优化可以产生在目标任务上表现良好的提示。实验结果证实，我们的BSL框架在各种下游任务和LLMs中始终实现竞争性表现。

更新时间: 2024-06-17 03:26:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2305.03518v2

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community.

Updated: 2024-06-17 03:22:20

标题: SUGARCREPE++ 数据集：视觉-语言模型对语义和词汇变化的敏感性

摘要: 尽管最先进的大型语言模型（LLMs），包括视觉与语言模型（VLMs）和单模语言模型（ULMs）取得了显著的成功，但它们仍然无法理解精确的语义。例如，使用不同词汇构成表达的语义等价句会引发不同的表示。这种分歧程度及其对编码语义的影响尚不十分清楚。在本文中，我们引入了SUGARCREPE++数据集，以分析VLMs和ULMs对词汇和语义变化的敏感性。SUGARCREPE++数据集中的每个样本包括一幅图像和一个对应的三元组标题：一对语义等价但词汇不同的正标题和一个难以处理的负标题。这对语言模型构成了一个3路语义（不）等价问题。我们全面评估了在体系结构、预训练目标和数据集方面有所不同的VLMs和ULMs，以评估SUGARCREPE++数据集的性能。实验结果突显了VLMs在区分词汇和语义变化方面遇到的困难，特别是在物体属性和空间关系方面。尽管具有更大的预训练数据集、模型大小和多个预训练目标的VLMs在SUGARCREPE++上取得了更好的性能，但仍有显著的改进空间。我们表明，在组合性数据集上表现更好的所有模型未必在SUGARCREPE++上表现同样出色，这表明单单具有组合性可能不足以理解语义和词汇的变化。鉴于SUGARCREPE++数据集的目标属性的重要性，它为视觉与语言社区提供了一个新的挑战。

更新时间: 2024-06-17 03:22:20

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11171v1

Two-Timescale Optimization Framework for Decentralized Linear-Quadratic Optimal Control

This study investigates a decentralized linear-quadratic optimal control problem, and several approximate separable constrained optimization problems are formulated for the first time based on the selection of sparsity promoting functions. First, for the optimization problem with weighted $\ell_1$ sparsity promoting function, a two-timescale algorithm is adopted that is based on the BSUM (Block Successive Upper-bound Minimization) framework and a differential equation solver. Second, a piecewise quadratic sparsity promoting function is introduced, and the induced optimization problem demonstrates an accelerated convergence rate by performing the same two-timescale algorithm. Finally, the optimization problem with $\ell_0$ sparsity promoting function is considered that is nonconvex and discontinuous, and can be approximated by successive coordinatewise convex optimization problems.

Updated: 2024-06-17 03:17:33

标题: 分散线性二次最优控制的双时间尺度优化框架

摘要: 本研究调查了一个分散的线性二次最优控制问题，并首次基于稀疏促进函数的选择提出了几个近似可分离的约束优化问题。首先，针对具有加权$\ell_1$稀疏促进函数的优化问题，采用了基于BSUM（块逐步上界最小化）框架和微分方程求解器的双时间尺度算法。其次，引入了分段二次稀疏促进函数，并通过执行相同的双时间尺度算法，该诱导优化问题展现出加速收敛速率。最后，考虑具有$\ell_0$稀疏促进函数的非凸和不连续的优化问题，可以通过逐步坐标凸优化问题来近似。

更新时间: 2024-06-17 03:17:33

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2406.11168v1

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research

Updated: 2024-06-17 03:03:34

标题: ChatBug：由聊天模板引发的对齐LLMs的常见漏洞

摘要: 大型语言模型（LLMs）被期望遵循用户的指令并参与对话。增强LLMs的指令遵循能力的技术通常使用根据预定义聊天模板结构化的数据对它们进行微调。尽管已经证明聊天模板对优化LLM的性能有效，但它们对LLM的安全对齐的影响尚不为人了解，这对于安全地大规模部署LLM至关重要。在本文中，我们调查了聊天模板如何影响LLMs的安全对齐。我们发现了一种常见的漏洞，名为ChatBug，由聊天模板引入。我们识别ChatBug的关键见解是，聊天模板提供了LLMs需要遵循的严格格式，但用户却不需要。因此，恶意用户在提示LLMs时不一定遵循聊天模板。相反，恶意用户可能利用他们对聊天模板的了解，相应地制作他们的提示以绕过LLMs的安全对齐。我们开发了两种攻击来利用ChatBug漏洞。我们展示了一个恶意用户如何利用ChatBug漏洞来有效地引发这些模型的意外响应。此外，我们展示了ChatBug可以被现有的越狱攻击利用以提高它们的攻击成功率。我们调查了对抗ChatBug的潜在对策。我们的结果表明，尽管对抗性训练有效地减轻了ChatBug漏洞，但受害模型遭受了显著的性能下降。这些结果突显了安全对齐和实用性之间的权衡。开发新的指令调整方法以平衡这种权衡是未来研究的一个开放且关键的方向。

更新时间: 2024-06-17 03:03:34

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.12935v1

SnapKV: LLM Knows What You are Looking for Before Generation

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to the baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

Updated: 2024-06-17 03:01:58

标题: SnapKV：在生成之前LLM知道您要找什么

摘要: 大型语言模型（LLMs）在处理广泛上下文方面取得了显著进展，关键-值（KV）缓存在增强它们的性能中起着至关重要的作用。然而，随着输入长度的增加，KV缓存的增长对内存和时间效率提出了挑战。为了解决这个问题，本文介绍了SnapKV，这是一种创新且无需微调的方法，可以有效地减小KV缓存大小，同时在实际应用中仍提供可比较的性能。我们发现模型中的每个注意力头在生成过程中始终关注特定的提示注意力特征。同时，这种稳健的模式可以从位于提示末尾的“观察”窗口中获得。基于这一洞察，SnapKV通过为每个注意力头选择聚类的重要KV位置自动压缩KV缓存。我们的方法在处理长输入序列时显著减少了不断增长的计算开销和内存占用。具体来说，与基线相比，SnapKV在处理16K令牌输入时实现了一致的解码速度，生成速度增加了3.6倍，内存效率提高了8.2倍。同时，与基线模型相比，它在16个长序列数据集上保持了可比较的性能。此外，SnapKV可以在单个A100-80GB GPU上处理高达380K上下文令牌，使用HuggingFace实现只需进行轻微更改，在“草堆中的针”测试中仅表现出微不足道的准确性下降。进一步的综合研究表明SnapKV在实际应用中的潜力。

更新时间: 2024-06-17 03:01:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2404.14469v2

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

Updated: 2024-06-17 03:01:22

标题: 情感-LLaMA：多模态情感识别和推理与指令调整

摘要: 情感识别对于各种应用至关重要，包括人机交互、教育和咨询。然而，传统的单模态方法往往无法捕捉真实世界情感表达的复杂性，因为这些表达本质上是多模态的。此外，现有的多模态大型语言模型（MLLMs）在整合音频和识别微妙的面部微表情方面面临挑战。为了解决这个问题，我们引入了MERR数据集，包含28,618个粗粒度和4,487个细粒度的标注样本，涵盖了多种情感类别。这个数据集使模型能够从各种场景中学习，并推广到真实世界应用中。此外，我们提出了Emotion-LLaMA模型，通过情感特定的编码器无缝地整合音频、视觉和文本输入。通过将特征对齐到共享空间，并利用经过修改的LLaMA模型进行指导调整，Emotion-LLaMA显著增强了情感识别和推理能力。广泛的评估显示，Emotion-LLaMA在Clue Overlap（7.83）和Label Overlap（6.25）上的EMER得分最高，在MER2023挑战中获得了0.9036的F1分数，在DFEW数据集的零-shot评估中获得了最高的UAR（45.59）和WAR（59.37）。

更新时间: 2024-06-17 03:01:22

领域: cs.AI,cs.MM

下载: http://arxiv.org/abs/2406.11161v1

Move Beyond Triples: Contextual Knowledge Graph Representation and Reasoning

Knowledge Graphs (KGs) are foundational structures in many AI applications, representing entities and their interrelations through triples. However, triple-based KGs lack the contextual information of relational knowledge, like temporal dynamics and provenance details, which are crucial for comprehensive knowledge representation and effective reasoning. Instead, \textbf{Contextual Knowledge Graphs} (CKGs) expand upon the conventional structure by incorporating additional information such as time validity, geographic location, and source provenance. This integration provides a more nuanced and accurate understanding of knowledge, enabling KGs to offer richer insights and support more sophisticated reasoning processes. In this work, we first discuss the inherent limitations of triple-based KGs and introduce the concept of contextual KGs, highlighting their advantages in knowledge representation and reasoning. We then present \textbf{KGR$^3$, a context-enriched KG reasoning paradigm} that leverages large language models (LLMs) to retrieve candidate entities and related contexts, rank them based on the retrieved information, and reason whether sufficient information has been obtained to answer a query. Our experimental results demonstrate that KGR$^3$ significantly improves performance on KG completion (KGC) and KG question answering (KGQA) tasks, validating the effectiveness of incorporating contextual information on KG representation and reasoning.

Updated: 2024-06-17 02:59:19

标题: 超越三元组：上下文知识图表示与推理

摘要: 知识图谱（KGs）是许多人工智能应用程序中的基本结构，通过三元组表示实体及其相互关系。然而，基于三元组的知识图谱缺乏关系知识的上下文信息，如时间动态和来源细节，这对于全面知识表示和有效推理至关重要。相反，\textbf{上下文知识图}（CKGs）通过整合额外信息，如时间有效性、地理位置和来源可靠性，扩展了传统结构。这种整合提供了对知识更细致和准确的理解，使得知识图谱能够提供更丰富的见解，并支持更复杂的推理过程。在这项工作中，我们首先讨论基于三元组的知识图谱的固有局限性，并介绍上下文知识图的概念，突出它们在知识表示和推理中的优势。然后，我们提出\textbf{KGR$^3$，一个富有上下文的知识图谱推理范式}，利用大型语言模型（LLMs）检索候选实体和相关上下文，根据检索信息对它们进行排名，并推断是否已获取足够信息来回答查询。我们的实验结果表明，KGR$^3$在知识图谱完整性（KGC）和知识图谱问题回答（KGQA）任务上显著提高了性能，验证了在知识图谱表示和推理中整合上下文信息的有效性。

更新时间: 2024-06-17 02:59:19

领域: cs.AI

下载: http://arxiv.org/abs/2406.11160v1

Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

Updated: 2024-06-17 02:56:55

标题: 带有过时性的分布式随机梯度下降：基于随机延迟微分方程的框架

摘要: 分布式随机梯度下降（SGD）由于其在扩展计算资源、减少训练时间以及帮助保护机器学习中用户隐私方面的潜力，近年来引起了广泛关注。然而，阶梯和有限的带宽可能会导致随机计算/通信延迟，从而严重阻碍学习过程。因此，如何通过有效调度多个工作者来加速异步SGD是一个重要问题。本文提出了一个统一的框架，基于随机延迟微分方程（SDDEs）和聚合梯度到达的泊松近似，来分析和优化异步SGD的收敛性。特别地，我们提出了未经内存假设的分布式SGD的运行时间和陈旧度。在给定学习率的情况下，我们揭示了相关SDDE的阻尼系数及其延迟统计，作为激活客户端数量、陈旧度阈值、目标函数的Hessian矩阵的特征值以及整体计算/通信延迟的函数。制定的SDDE使我们能够通过计算其特征根来呈现分布式SGD的收敛条件和速度，从而优化异步/事件触发SGD的调度策略。有趣的是显示，增加激活工作者数量并不一定会加速分布式SGD，因为陈旧度的存在。此外，小程度的陈旧度不一定会减慢收敛速度，而大程度的陈旧度将导致分布式SGD的发散。数值结果展示了我们的SDDE框架的潜力，即使在具有非凸目标函数的复杂学习任务中也是如此。

更新时间: 2024-06-17 02:56:55

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2406.11159v1

Progressive Dual Priori Network for Generalized Breast Tumor Segmentation

To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast and irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different centers. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC and HD95 of PDPNet were improved at least by 5.13% and 7.58% respectively on multi-center test sets. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregular tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance. The source code and open data can be accessed at https://github.com/wangli100209/PDPNet.

Updated: 2024-06-17 02:55:34

标题: 渐进式双先验网络用于普通乳腺肿瘤分割

摘要: 为了提高乳腺肿瘤分割模型的泛化能力，以及提高对大小较小、对比度较低和形状不规则的乳腺肿瘤的分割性能，我们提出了一个渐进的双重先验网络（PDPNet），用于从不同中心获取的动态增强磁共振图像（DCE-MRI）中分割乳腺肿瘤。PDPNet首先通过基于粗分割的定位模块裁剪肿瘤区域，然后利用弱语义先验和跨尺度相关性先验逐步优化乳腺肿瘤掩模。为了验证PDPNet的有效性，我们将其与几种最先进的方法在多中心数据集上进行比较。结果显示，与次优方法相比，PDPNet在多中心测试集上的DSC和HD95至少分别提高了5.13%和7.58%。此外，通过消融实验，我们证明了提出的定位模块可以减少正常组织的影响，从而提高模型的泛化能力。弱语义先验允许专注于肿瘤区域，避免遗漏小肿瘤和低对比度肿瘤。跨尺度相关性先验有助于提高对不规则肿瘤的形状感知能力。因此，将它们集成在一个统一框架中改善了多中心乳腺肿瘤分割性能。源代码和开放数据可在https://github.com/wangli100209/PDPNet 上访问。

更新时间: 2024-06-17 02:55:34

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2310.13574v2

DeFiGuard: A Price Manipulation Detection Service in DeFi using Graph Neural Networks

The prosperity of Decentralized Finance (DeFi) unveils underlying risks, with reported losses surpassing 3.2 billion USD between 2018 and 2022 due to vulnerabilities in Decentralized Applications (DApps). One significant threat is the Price Manipulation Attack (PMA) that alters asset prices during transaction execution. As a result, PMA accounts for over 50 million USD in losses. To address the urgent need for efficient PMA detection, this paper introduces a novel detection service, DeFiGuard, using Graph Neural Networks (GNNs). In this paper, we propose cash flow graphs with four distinct features, which capture the trading behaviors from transactions. Moreover, DeFiGuard integrates transaction parsing, graph construction, model training, and PMA detection. Evaluations on a dataset of 208 PMA and 2,080 non-PMA transactions show that DeFiGuard with GNN models outperforms the baseline in Accuracy, TPR, FPR, and AUC-ROC. The results of ablation studies suggest that the combination of the four proposed node features enhances DeFiGuard's efficacy. Moreover, DeFiGuard classifies transactions within 0.892 to 5.317 seconds, which provides sufficient time for the victims (DApps and users) to take action to rescue their vulnerable funds. In conclusion, this research offers a significant step towards safeguarding the DeFi landscape from PMAs using GNNs.

Updated: 2024-06-17 02:51:18

标题: DeFiGuard：使用图神经网络在DeFi中进行价格操纵检测服务

摘要: DeFi的繁荣揭示了潜在的风险，据报道，由于去中心化应用（DApps）中的漏洞，2018年至2022年间的损失超过32亿美元。一个重要的威胁是价格操纵攻击（PMA），在交易执行过程中改变资产价格。因此，PMA导致超过5000万美元的损失。为了解决对高效PMA检测的急切需求，本文引入了一种新颖的检测服务DeFiGuard，使用图神经网络（GNNs）。在本文中，我们提出了具有四个不同特性的现金流图，捕捉了交易行为。此外，DeFiGuard整合了交易解析、图构建、模型训练和PMA检测。对包含208个PMA和2080个非PMA交易的数据集进行评估显示，带有GNN模型的DeFiGuard在准确度、TPR、FPR和AUC-ROC方面优于基线。消融研究的结果表明，四个提出的节点特征的组合增强了DeFiGuard的效力。此外，DeFiGuard在0.892至5.317秒内对交易进行分类，为受害者（DApps和用户）提供足够的时间采取行动拯救他们的脆弱资金。总之，这项研究利用GNNs向保护DeFi景观免受PMA的重要一步。

更新时间: 2024-06-17 02:51:18

领域: cs.CR

下载: http://arxiv.org/abs/2406.11157v1

DELRec: Distilling Sequential Pattern to Enhance LLM-based Recommendation

Sequential recommendation (SR) tasks enhance recommendation accuracy by capturing the connection between users' past interactions and their changing preferences. Conventional models often focus solely on capturing sequential patterns within the training data, neglecting the broader context and semantic information embedded in item titles from external sources. This limits their predictive power and adaptability. Recently, large language models (LLMs) have shown promise in SR tasks due to their advanced understanding capabilities and strong generalization abilities. Researchers have attempted to enhance LLMs' recommendation performance by incorporating information from SR models. However, previous approaches have encountered problems such as 1) only influencing LLMs at the result level;2) increased complexity of LLMs recommendation methods leading to reduced interpretability; 3) incomplete understanding and utilization of SR models information by LLMs. To address these problems, we proposes a novel framework, DELRec, which aims to extract knowledge from SR models and enable LLMs to easily comprehend and utilize this supplementary information for more effective sequential recommendations. DELRec consists of two main stages: 1) SR Models Pattern Distilling, focusing on extracting behavioral patterns exhibited by SR models using soft prompts through two well-designed strategies; 2) LLMs-based Sequential Recommendation, aiming to fine-tune LLMs to effectively use the distilled auxiliary information to perform SR tasks. Extensive experimental results conducted on three real datasets validate the effectiveness of the DELRec framework.

Updated: 2024-06-17 02:47:09

标题: DELRec：提炼顺序模式以增强基于LLM的推荐

摘要: 顺序推荐（SR）任务通过捕捉用户过去交互和他们不断变化的偏好之间的联系，提高了推荐准确性。传统模型往往只关注捕捉训练数据中的顺序模式，忽视了来自外部来源的项目标题中嵌入的更广泛的上下文和语义信息。这限制了它们的预测能力和适应性。最近，大型语言模型（LLMs）在SR任务中显示出潜力，因为它们具有先进的理解能力和强大的泛化能力。研究人员已经尝试通过将SR模型的信息纳入来增强LLMs的推荐性能。然而，先前的方法遇到了一些问题，如：1）只在结果级别影响LLMs；2）LLMs推荐方法复杂度增加导致解释性降低；3）LLMs对SR模型信息的理解和利用不完整。为了解决这些问题，我们提出了一个新颖的框架DELRec，旨在从SR模型中提取知识，使LLMs能够轻松理解和利用这些辅助信息以进行更有效的顺序推荐。DELRec包括两个主要阶段：1）SR模型模式提取，重点是通过两种精心设计的策略使用软提示提取SR模型展示的行为模式；2）基于LLMs的顺序推荐，旨在微调LLMs以有效利用提炼的辅助信息执行SR任务。在三个真实数据集上进行的大量实验结果验证了DELRec框架的有效性。

更新时间: 2024-06-17 02:47:09

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2406.11156v1

Interpretable modulated differentiable STFT and physics-informed balanced spectrum metric for freight train wheelset bearing cross-machine transfer fault diagnosis under speed fluctuations

The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentiable short-time Fourier transform (STFT) and physics-informed balanced spectrum quality metric is proposed to learn domain-invariant and discriminative features under time-varying speeds. Firstly, due to insufficiency in extracting extract frequency components of time-varying speed signals using fixed windows, a modulated differentiable STFT (MDSTFT) that is interpretable with STFT-informed theoretical support, is proposed to extract the robust time-frequency spectrum (TFS). During training process, multiple windows with different lengths dynamically change. Also, in addition to the classification metric and domain discrepancy metric, we creatively introduce a third kind of metric, referred to as the physics-informed metric, to enhance transferable TFS. A physics-informed balanced spectrum quality (BSQ) regularization loss is devised to guide an optimization direction for MDSTFT and model. With it, not only can model acquire high-quality TFS, but also a physics-restricted domain adaptation network can be also acquired, making it learn real-world physics knowledge, ultimately diminish the domain discrepancy across different datasets. The experiment is conducted in the scenario of migrating from the laboratory datasets to the freight train dataset, indicating that the hybrid-driven pyDSN outperforms existing methods and has practical value.

Updated: 2024-06-17 02:43:24

标题: 可解释的调制可微STFT和物理启发的平衡频谱度量用于速度波动下的货运列车轮对轴承跨机器传递故障诊断

摘要: 轮对轴承的服务条件直接影响铁路重载货物列车的安全运行，因为它们是关键的组件。然而，列车速度波动和缺乏故障样本是限制轴承故障诊断准确性的两个主要问题。因此，提出了一种耦合可解释的调制可微短时傅立叶变换（STFT）和物理信息平衡频谱质量指标的跨机器传递诊断（pyDSN）网络，以学习在时间变化速度下的域不变和有区别的特征。首先，由于使用固定窗口提取时间变化速度信号的频率分量不足，提出了一种调制可微STFT（MDSTFT），其具有与STFT相关的理论支持，用于提取稳健的时频谱（TFS）。在训练过程中，多个具有不同长度的窗口动态变化。此外，除了分类度量和域差异度度量外，我们还创造性地引入了第三种度量，称为物理信息度量，以增强可传递的TFS。设计了一种物理信息平衡频谱质量（BSQ）正则化损失，用于指导MDSTFT和模型的优化方向。通过它，模型不仅可以获得高质量的TFS，还可以获得一个受物理限制的域适应网络，使其学习真实世界的物理知识，最终减少不同数据集之间的域差异。实验在从实验室数据集迁移到货运列车数据集的情景中进行，表明混合驱动的pyDSN优于现有方法，并具有实际价值。

更新时间: 2024-06-17 02:43:24

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2406.11917v1

Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine Learning

Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trends, efforts to incorporate RandNLA algorithms into core numerical libraries, and advances in machine learning, statistics, and random matrix theory, have lead to new theoretical and practical challenges. This article provides a self-contained overview of RandNLA, in light of these developments.

Updated: 2024-06-17 02:30:55

标题: 最近和即将发展的基于随机数的数值线性代数在机器学习中的应用

摘要: 大型矩阵在许多机器学习和数据分析应用中出现，包括作为数据集、图形、模型权重以及一阶和二阶导数的表示。随机数值线性代数（RandNLA）是一个利用随机性开发改进的算法来解决普遍的矩阵问题的领域。该领域已经达到了一定的成熟水平；但是最近的硬件趋势、将RandNLA算法纳入核心数值库的努力以及在机器学习、统计学和随机矩阵理论方面的进展，导致了新的理论和实际挑战。本文提供了RandNLA的自包含概述，结合了这些发展。

更新时间: 2024-06-17 02:30:55

领域: cs.LG,cs.NA,math.NA,stat.ML

下载: http://arxiv.org/abs/2406.11151v1

GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory

Privacy issues arise prominently during the inappropriate transmission of information between entities. Existing research primarily studies privacy by exploring various privacy attacks, defenses, and evaluations within narrowly predefined patterns, while neglecting that privacy is not an isolated, context-free concept limited to traditionally sensitive data (e.g., social security numbers), but intertwined with intricate social contexts that complicate the identification and analysis of potential privacy violations. The advent of Large Language Models (LLMs) offers unprecedented opportunities for incorporating the nuanced scenarios outlined in privacy laws to tackle these complex privacy issues. However, the scarcity of open-source relevant case studies restricts the efficiency of LLMs in aligning with specific legal statutes. To address this challenge, we introduce a novel framework, GoldCoin, designed to efficiently ground LLMs in privacy laws for judicial assessing privacy violations. Our framework leverages the theory of contextual integrity as a bridge, creating numerous synthetic scenarios grounded in relevant privacy statutes (e.g., HIPAA), to assist LLMs in comprehending the complex contexts for identifying privacy risks in the real world. Extensive experimental results demonstrate that GoldCoin markedly enhances LLMs' capabilities in recognizing privacy risks across real court cases, surpassing the baselines on different judicial tasks.

Updated: 2024-06-17 02:27:32

标题: GoldCoin：通过上下文完整性理论将大型语言模型落地于隐私法律

摘要: 隐私问题在实体之间不当传输信息时显著出现。现有研究主要通过探索各种隐私攻击、防御和评估来研究隐私，这些研究局限于狭窄预定义模式，忽视了隐私不是一个孤立的、无上下文概念，而是与复杂的社会背景交织在一起，这些背景使潜在的隐私侵犯的识别和分析变得复杂。大型语言模型（LLMs）的出现为将隐私法中详细描述的情景整合到解决这些复杂隐私问题中提供了前所未有的机会。然而，开源相关案例研究的稀缺限制了LLMs在与特定法律条文一致方面的效率。为了解决这一挑战，我们引入了一种新颖的框架GoldCoin，旨在有效地将LLMs基于隐私法进行法庭评估隐私侵犯。我们的框架利用上下文完整性理论作为桥梁，创建了许多基于相关隐私法规（如HIPAA）的合成情景，以帮助LLMs理解现实世界中识别隐私风险的复杂背景。广泛的实验结果表明，GoldCoin显著增强了LLMs在识别真实法庭案例中的隐私风险方面的能力，超越了不同司法任务的基线。

更新时间: 2024-06-17 02:27:32

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2406.11149v1

Few-Shot Recognition via Stage-Wise Augmented Finetuning

Few-shot recognition aims to train a classification model with only a few labeled examples of pre-defined concepts, where annotation can be costly in a downstream task. In another related research area, zero-shot recognition, which assumes no access to any downstream-task data, has been greatly advanced by using pretrained Vision-Language Models (VLMs). In this area, retrieval-augmented learning (RAL) effectively boosts zero-shot accuracy by retrieving and learning from external data relevant to downstream concepts. Motivated by these advancements, our work explores RAL for few-shot recognition. While seemingly straightforward despite being under-explored in the literature (till now!), we present novel challenges and opportunities for applying RAL for few-shot recognition. First, perhaps surprisingly, simply finetuning the VLM on a large amount of retrieved data barely surpasses state-of-the-art zero-shot methods due to the imbalanced distribution of retrieved data and its domain gaps compared to few-shot annotated data. Second, finetuning a VLM on few-shot examples alone significantly outperforms prior methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issue, we propose Stage-Wise Augmented fineTuning (SWAT) method, which involves end-to-end finetuning on mixed data for the first stage and retraining the classifier solely on the few-shot data in the second stage. Extensive experiments show that SWAT achieves the best performance on standard benchmark datasets, resoundingly outperforming prior works by ~10% in accuracy. Code is available at https://github.com/tian1327/SWAT.

Updated: 2024-06-17 02:27:14

标题: 通过阶段式增强微调实现少样本识别

摘要: Few-shot recognition旨在使用少量预定义概念的标记示例来训练分类模型，其中注释在下游任务中可能成本高昂。在另一个相关的研究领域中，零样本识别假设没有任何下游任务数据的访问，通过使用预训练的视觉-语言模型（VLMs）取得了显著进展。在这一领域，检索增强学习（RAL）通过检索和学习与下游概念相关的外部数据有效地提高了零样本准确性。受这些进展的启发，我们的工作探索了RAL用于少样本识别。尽管在文献中似乎很简单（直到现在！），我们提出了将RAL应用于少样本识别的新挑战和机遇。首先，也许让人惊讶的是，仅在大量检索数据上微调VLM几乎无法超越最先进的零样本方法，这是由于检索数据的分布不平衡以及与少样本标记数据相比的领域差距。其次，仅在少样本示例上对VLM进行微调明显优于先前的方法，而在检索和少样本数据混合上进行微调则产生更好的结果。第三，为了减轻不平衡分布和领域差距问题，我们提出了分阶段增强微调（SWAT）方法，该方法涉及在第一阶段对混合数据进行端到端微调，并在第二阶段仅在少样本数据上重新训练分类器。大量实验表明，SWAT在标准基准数据集上取得了最佳性能，准确性超过以前的工作约10％。代码可在https://github.com/tian1327/SWAT找到。

更新时间: 2024-06-17 02:27:14

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11148v1

StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables

StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes to develop more complex statistical models. These models generally divide into a measurement model that relates the latent classes to observed indicators, and a structural model that relates covariates and outcome variables to the latent classes. The measurement and structural models can be estimated jointly using the so-called one-step approach or sequentially using stepwise methods, which present significant advantages for practitioners regarding the interpretability of the estimated latent classes. In addition to the one-step approach, StepMix implements the most important stepwise estimation methods from the literature, including the bias-adjusted three-step methods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the more recent two-step approach. These pseudo-likelihood estimators are presented in this paper under a unified framework as specific expectation-maximization subroutines. To facilitate and promote their adoption among the data science community, StepMix follows the object-oriented design of the scikit-learn library and provides an additional R wrapper.

Updated: 2024-06-17 02:26:30

标题: StepMix：一个用于外部变量的广义混合模型伪似然估计的Python包

摘要: StepMix是一个开源的Python软件包，用于估计广义有限混合模型（潜在资料和潜在类分析）的伪似然（一步、两步和三步方法），并引入外部变量（协变量和远程结果）。在社会科学的许多应用中，主要目标不仅是将个体分成潜在类别，还要利用这些类别来开发更复杂的统计模型。这些模型通常分为将潜在类别与观察指标相关联的测量模型，以及将协变量和结果变量与潜在类别相关联的结构模型。测量和结构模型可以使用所谓的一步方法联合估计，也可以使用逐步方法依次估计，这对于从业者而言在解释估计的潜在类别方面具有重要优势。除了一步方法之外，StepMix还实现了文献中最重要的逐步估计方法，包括具有Bolk-Croon-Hagenaars和最大似然校正的偏差调整的三步方法，以及较新的两步方法。这些伪似然估计器在本文中以特定的期望最大化子程序的统一框架下呈现。为了促进和推广它们在数据科学社区中的采用，StepMix遵循scikit-learn库的面向对象设计，并提供额外的R封装。

更新时间: 2024-06-17 02:26:30

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2304.03853v6

Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG

Vulnerability detection is essential for software quality assurance. In recent years, deep learning models (especially large language models) have shown promise in vulnerability detection. In this work, we propose a novel LLM-based vulnerability detection technique Vul-RAG, which leverages knowledge-level retrieval-augmented generation (RAG) framework to detect vulnerability for the given code in three phases. First, Vul-RAG constructs a vulnerability knowledge base by extracting multi-dimension knowledge via LLMs from existing CVE instances; second, for a given code snippet, Vul-RAG} retrieves the relevant vulnerability knowledge from the constructed knowledge base based on functional semantics; third, Vul-RAG leverages LLMs to check the vulnerability of the given code snippet by reasoning the presence of vulnerability causes and fixing solutions of the retrieved vulnerability knowledge. Our evaluation of Vul-RAG on our constructed benchmark PairVul shows that Vul-RAG substantially outperforms all baselines by 12.96\%/110\% relative improvement in accuracy/pairwise-accuracy. In addition, our user study shows that the vulnerability knowledge generated by Vul-RAG can serve as high-quality explanations which can improve the manual detection accuracy from 0.60 to 0.77.

Updated: 2024-06-17 02:25:45

标题: Vul-RAG: 通过知识级RAG增强基于LLM的漏洞检测

摘要: 漏洞检测对软件质量保证至关重要。近年来，深度学习模型（特别是大型语言模型）在漏洞检测方面展现出了潜力。在这项工作中，我们提出了一种基于LLM的漏洞检测技术Vul-RAG，该技术利用知识级检索增强生成（RAG）框架，通过三个阶段检测给定代码的漏洞。首先，Vul-RAG通过从现有CVE实例中利用LLMs提取多维知识来构建漏洞知识库；其次，对于给定的代码片段，Vul-RAG根据功能语义从构建的知识库中检索相关漏洞知识；最后，Vul-RAG利用LLMs通过推理检查给定代码片段的漏洞，判断检索到的漏洞知识中漏洞原因和修复解决方案的存在。我们在构建的基准测试PairVul上评估了Vul-RAG，结果显示Vul-RAG在准确率/成对准确率方面相对改善了12.96\%/110\%，明显优于所有基准线。此外，我们的用户研究表明，Vul-RAG生成的漏洞知识可以作为高质量解释，可以将人工检测准确率从0.60提高到0.77。

更新时间: 2024-06-17 02:25:45

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.11147v1

README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.

Updated: 2024-06-17 02:12:24

标题: README：通过数据中心的自然语言处理技术，弥合医学行话和患者教育的理解差距

摘要: 在医疗保健领域的进步已经转向以患者为中心的方法，特别是在自我护理和患者教育方面，通过对电子健康记录（EHR）的访问来实现。然而，在EHR中的医学术语给患者的理解带来了重大挑战。为了解决这个问题，我们引入了一个新任务，即自动生成通俗定义，旨在将复杂的医学术语简化为患者友好的通俗语言。我们首先创建了README数据集，这是一个包含超过50,000个唯一（医学术语，通俗定义）对和300,000个提及的广泛收集，每个都提供了由领域专家手动注释的上下文感知的通俗定义。我们还设计了一个数据中心的人工智能流水线，通过数据过滤、增强和选择来提高数据质量。然后，我们使用README作为模型的训练数据，并利用检索增强生成方法来减少幻觉并提高模型输出的质量。我们进行了广泛的自动和人工评估，结果表明，当使用高质量数据进行微调时，开源移动友好模型能够匹配甚至超越像ChatGPT这样的最先进的封闭源大型语言模型的性能。这项研究代表了在患者教育领域缩小知识差距和推动以患者为中心的医疗保健解决方案的重大进步。

更新时间: 2024-06-17 02:12:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2312.15561v3

Scorecards for Synthetic Medical Data Evaluation and Reporting

The growing utilization of synthetic medical data (SMD) in training and testing AI-driven tools in healthcare necessitates a systematic framework for assessing SMD quality. The current lack of a standardized methodology to evaluate SMD, particularly in terms of its applicability in various medical scenarios, is a significant hindrance to its broader acceptance and utilization in healthcare applications. Here, we outline an evaluation framework designed to meet the unique requirements of medical applications, and introduce the concept of SMD scorecards, which can serve as comprehensive reports that accompany artificially generated datasets. This can help standardize evaluation and enable SMD developers to assess and further enhance the quality of SMDs by identifying areas in need of attention and ensuring that the synthetic data more accurately approximate patient data.

Updated: 2024-06-17 02:11:59

标题: 合成医疗数据评估和报告的评分卡

摘要: 随着合成医疗数据（SMD）在医疗人工智能工具的培训和测试中的不断增加利用，迫使我们需要一个系统性框架来评估SMD的质量。目前缺乏一种标准化方法来评估SMD，特别是在各种医疗场景中的适用性方面，这对于SMD在医疗应用中更广泛接受和利用构成了重大障碍。在这里，我们概述了一个旨在满足医疗应用的独特要求的评估框架，并引入了SMD评分卡的概念，它可以作为伴随人工生成数据集的全面报告。这可以帮助标准化评估，并使SMD开发人员能够通过识别需要关注的领域并确保合成数据更准确地近似患者数据来评估和进一步增强SMD的质量。

更新时间: 2024-06-17 02:11:59

领域: cs.AI,cs.CY,cs.DB

下载: http://arxiv.org/abs/2406.11143v1

Lifelong and Continual Learning Dialogue Systems

Dialogue systems, commonly known as chatbots, have gained escalating popularity in recent times due to their wide-spread applications in carrying out chit-chat conversations with users and task-oriented dialogues to accomplish various user tasks. Existing chatbots are usually trained from pre-collected and manually-labeled data and/or written with handcrafted rules. Many also use manually-compiled knowledge bases (KBs). Their ability to understand natural language is still limited, and they tend to produce many errors resulting in poor user satisfaction. Typically, they need to be constantly improved by engineers with more labeled data and more manually compiled knowledge. This book introduces the new paradigm of lifelong learning dialogue systems to endow chatbots the ability to learn continually by themselves through their own self-initiated interactions with their users and working environments to improve themselves. As the systems chat more and more with users or learn more and more from external sources, they become more and more knowledgeable and better and better at conversing. The book presents the latest developments and techniques for building such continual learning dialogue systems that continuously learn new language expressions and lexical and factual knowledge during conversation from users and off conversation from external sources, acquire new training examples during conversation, and learn conversational skills. Apart from these general topics, existing works on continual learning of some specific aspects of dialogue systems are also surveyed. The book concludes with a discussion of open challenges for future research.

Updated: 2024-06-17 02:10:48

标题: 终身学习和持续学习对话系统

摘要: 对话系统，通常被称为聊天机器人，由于其在与用户进行闲聊对话和执行各种用户任务的任务导向对话方面的广泛应用而在最近变得越来越受欢迎。现有的聊天机器人通常是从预先收集和手工标记的数据中训练的，或者是使用手工制定的规则编写的。许多还使用手工编制的知识库。它们理解自然语言的能力仍然有限，往往会产生许多错误，导致用户满意度低。通常情况下，它们需要通过工程师不断改进，提供更多标记的数据和更多手工编制的知识。本书介绍了终身学习对话系统的新范式，赋予聊天机器人通过自己与用户和工作环境的自发互动来持续学习的能力，以改进自身。随着系统与用户进行更多对话或从外部来源学习更多知识，它们变得越来越知识渊博，越来越擅长交谈。本书介绍了构建这种持续学习对话系统的最新发展和技术，这些系统在对话中不断学习新的语言表达和词汇和事实知识，从用户和外部来源中获得新的训练示例，并学习会话技巧。除了这些一般主题外，还对一些特定方面的对话系统的持续学习的现有工作进行了调查。本书最后讨论了未来研究的挑战。

更新时间: 2024-06-17 02:10:48

领域: cs.CL,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2211.06553v2

Active search for Bifurcations

Bifurcations mark qualitative changes of long-term behavior in dynamical systems and can often signal sudden ("hard") transitions or catastrophic events (divergences). Accurately locating them is critical not just for deeper understanding of observed dynamic behavior, but also for designing efficient interventions. When the dynamical system at hand is complex, possibly noisy, and expensive to sample, standard (e.g. continuation based) numerical methods may become impractical. We propose an active learning framework, where Bayesian Optimization is leveraged to discover saddle-node or Hopf bifurcations, from a judiciously chosen small number of vector field observations. Such an approach becomes especially attractive in systems whose state x parameter space exploration is resource-limited. It also naturally provides a framework for uncertainty quantification (aleatoric and epistemic), useful in systems with inherent stochasticity.

Updated: 2024-06-17 02:01:17

标题: 主动搜索分歧点

摘要: Bifurcations标记了动态系统中长期行为的质的变化，并经常可以标志突然的（"硬"）转变或灾难性事件（发散）。准确地定位它们对于不仅更深入地理解观察到的动态行为，而且对于设计高效的干预措施至关重要。当手头的动态系统复杂、可能带有噪声且采样昂贵时，标准的（例如基于延续的）数值方法可能变得不切实际。我们提出了一个主动学习框架，其中利用贝叶斯优化来发现鞍点或Hopf分叉，从精心选择的少量矢量场观察中。这种方法在状态x参数空间的勘探受到资源限制的系统中尤其具有吸引力。它还自然地为包含随机性的系统提供了不确定性量化的框架（aleatoric和epistemic），这在系统中尤其有用。

更新时间: 2024-06-17 02:01:17

领域: cs.LG,nlin.CD,37M20

下载: http://arxiv.org/abs/2406.11141v1

Diffusion Models in Low-Level Vision: A Survey

Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compelling results with intricate texture information. Despite their remarkable success, a noticeable gap exists in a comprehensive survey that amalgamates these pioneering diffusion model-based works and organizes the corresponding threads. This paper proposes the comprehensive review of diffusion model-based techniques. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models, establishing the theoretical foundation. Following this, we introduce a multi-perspective categorization of diffusion models, considering both the underlying framework and the target task. Additionally, we summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios. Moreover, we provide an overview of commonly used benchmarks and evaluation metrics. We conduct a thorough evaluation, encompassing both performance and efficiency, of diffusion model-based techniques in three prominent tasks. Finally, we elucidate the limitations of current diffusion models and propose seven intriguing directions for future research. This comprehensive examination aims to facilitate a profound understanding of the landscape surrounding denoising diffusion models in the context of low-level vision tasks. A curated list of diffusion model-based techniques in over 20 low-level vision tasks can be found at https://github.com/ChunmingHe/awesome-diffusion-models-in-low-level-vision.

Updated: 2024-06-17 01:49:27

标题: 低水平视觉中的扩散模型：一项调查

摘要: 深度生成模型因其生成能力而在低级别视觉任务中引起了重大关注。其中，以前向扩散过程和反向去噪过程为特征的扩散模型解决方案因其能够产生质量和多样性优异的样本而广受好评。这确保了生成具有复杂纹理信息的视觉吸引力结果。尽管它们取得了显著的成功，但存在一个明显的差距，即缺乏一个综合调查，将这些开创性的基于扩散模型的作品汇集起来并组织相应的线索。本文提出了对基于扩散模型的技术的全面回顾。我们提出了三个通用的扩散建模框架，并探讨它们与其他深度生成模型的相关性，建立理论基础。在此基础上，我们介绍了对扩散模型的多角度分类，考虑了潜在的框架和目标任务。此外，我们总结了应用于其他任务的扩散模型，包括医学、遥感和视频场景。此外，我们提供了常用基准和评估指标的概述。我们对基于扩散模型的技术在三个突出任务中的性能和效率进行了彻底评估。最后，我们阐明了当前扩散模型的局限性，并提出了未来研究的七个有趣方向。这一全面审查旨在促进对低级别视觉任务背景下去噪扩散模型周围环境的深刻理解。在https://github.com/ChunmingHe/awesome-diffusion-models-in-low-level-vision 上可以找到超过20个低级别视觉任务中基于扩散模型的技术的精选列表。

更新时间: 2024-06-17 01:49:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11138v1

The Prompt Report: A Systematic Survey of Prompting Techniques

Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.

Updated: 2024-06-17 01:28:09

标题: 《提示报告：提示技术的系统调查》

摘要: 生成人工智能（GenAI）系统正越来越多地被部署在各行业和研究领域。开发人员和最终用户通过提示或提示工程与这些系统进行交互。虽然提示是一个广泛研究的概念，但由于该领域的新生，存在着术语上的冲突和对提示构成的本体论理解不足。本文通过组装提示技术的分类法并分析其使用，建立了对提示的结构化理解。我们提出了一个包含33个词汇术语的全面词汇表，一个包含58种仅文本提示技术的分类法，以及其他模态的40种技术。我们进一步对自然语言前缀提示的整个文献进行了元分析。

更新时间: 2024-06-17 01:28:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.06608v2

Towards Understanding Emotions for Engaged Mental Health Conversations

Providing timely support and intervention is crucial in mental health settings. As the need to engage youth comfortable with texting increases, mental health providers are exploring and adopting text-based media such as chatbots, community-based forums, online therapies with licensed professionals, and helplines operated by trained responders. To support these text-based media for mental health--particularly for crisis care--we are developing a system to perform passive emotion-sensing using a combination of keystroke dynamics and sentiment analysis. Our early studies of this system posit that the analysis of short text messages and keyboard typing patterns can provide emotion information that may be used to support both clients and responders. We use our preliminary findings to discuss the way forward for applying AI to support mental health providers in providing better care.

Updated: 2024-06-17 01:27:15

标题: 朝向理解情绪，促进融洽的心理健康对话

摘要: 提供及时支持和干预在心理健康环境中至关重要。随着需要吸引习惯于发送短信的青少年的增加，心理健康提供者正在探索和采用基于文本的媒体，如聊天机器人、社区论坛、在线带有持牌专业人员的疗法，以及由训练有素的响应者操作的热线。为了支持这些基于文本的心理健康媒体--特别是危机护理--我们正在开发一个系统，通过键击动态和情感分析的结合来执行被动情感感知。我们早期对该系统的研究认为，对短文本消息和键盘输入模式的分析可以提供情感信息，这些信息可以用来支持客户和响应者。我们利用初步发现来讨论将人工智能应用于支持心理健康提供者提供更好关怀的未来方向。

更新时间: 2024-06-17 01:27:15

领域: cs.HC,cs.AI,H.5.2; I.2.7

下载: http://arxiv.org/abs/2406.11135v1

RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents

In this past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and people are starting to explore the usage of LLMs in more general and close to application domains like code generation, travel planning, and robot controls. Connecting these LLMs with great capacity and external tools, people are building the so-called LLM agents, which are supposed to help people do all kinds of work in everyday life. In all these domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering has become an important question for many researchers and users of LLMs. In this paper, we propose a novel method, \textsc{RePrompt}, which does "gradient descent" to optimize the step-by-step instructions in the prompt of the LLM agents based on the chat history obtained from interactions with LLM agents. By optimizing the prompt, the LLM will learn how to plan in specific domains. We have used experiments in PDDL generation and travel planning to show that our method could generally improve the performance for different reasoning tasks when using the updated prompt as the initial prompt.

Updated: 2024-06-17 01:23:11

标题: RePrompt: 大型语言模型代理的自动提示工程规划

摘要: 在过去的一年中，大型语言模型（LLMs）在传统自然语言处理领域之外取得了显著的成功，人们开始探索在更普遍和接近应用领域，如代码生成、旅行规划和机器人控制中使用LLMs。通过将这些具有巨大容量和外部工具的LLMs联系起来，人们正在构建所谓的LLM代理，这些代理应该帮助人们完成日常生活中的各种工作。在所有这些领域中，LLMs的提示已经被证明对LLM生成的内容产生重大影响，从而影响LLM代理的性能。因此，自动提示工程已经成为许多LLM研究人员和用户的重要问题。在本文中，我们提出了一种新颖的方法\textsc{RePrompt}，该方法对LLM代理的提示中的逐步指令进行“梯度下降”优化，该指令是基于与LLM代理的交互中获得的聊天历史。通过优化提示，LLM将学习如何在特定领域进行规划。我们已经在PDDL生成和旅行规划方面进行了实验，展示了我们的方法可以在使用更新后的提示作为初始提示时，通常提高不同推理任务的性能。

更新时间: 2024-06-17 01:23:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11132v1

Are Large Language Models a Good Replacement of Taxonomies?

Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.

Updated: 2024-06-17 01:21:50

标题: 大型语言模型是否能很好地取代分类法？

摘要: 大型语言模型(LLMs)展示了内化知识并回答自然语言问题的令人印象深刻的能力。尽管先前的研究验证了LLMs在一般知识上表现良好，但在长尾微妙知识上表现不佳，社区仍然对传统知识图是否应该被LLMs取代存在疑虑。在本文中，我们探讨了知识图的架构(即分类法)是否被LLMs所废弃。直觉上，LLMs应该在常见的分类法和对人们常见的分类法层次表现良好。不幸的是，缺乏一个全面的基准来评估LLMs在从常见到专业领域的各种分类法以及从根到叶的不同层次上的表现，以便我们能够得出自信的结论。为了缩小研究差距，我们构建了一个名为TaxoGlimpse的新型分类法层次结构发现基准，以评估LLMs在分类法上的表现。TaxoGlimpse涵盖了从常见到专业领域的十个代表性分类法，并详细实验了该分类法中不同层次的实体。我们对十八种最先进的LLMs在三种提示设置下的全面实验验证了LLMs仍然无法很好地捕捉专业分类法和叶级实体的知识。

更新时间: 2024-06-17 01:21:50

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2406.11131v1

Model Adaptation for Time Constrained Embodied Control

When adopting a deep learning model for embodied agents, it is required that the model structure be optimized for specific tasks and operational conditions. Such optimization can be static such as model compression or dynamic such as adaptive inference. Yet, these techniques have not been fully investigated for embodied control systems subject to time constraints, which necessitate sequential decision-making for multiple tasks, each with distinct inference latency limitations. In this paper, we present MoDeC, a time constraint-aware embodied control framework using the modular model adaptation. We formulate model adaptation to varying operational conditions on resource and time restrictions as dynamic routing on a modular network, incorporating these conditions as part of multi-task objectives. Our evaluation across several vision-based embodied environments demonstrates the robustness of MoDeC, showing that it outperforms other model adaptation methods in both performance and adherence to time constraints in robotic manipulation and autonomous driving applications

Updated: 2024-06-17 01:07:30

标题: 适用于时间受限的体态控制的模型适应

摘要: 在为具身代理采用深度学习模型时，需要对模型结构进行优化，以适应特定任务和操作条件。这种优化可以是静态的，如模型压缩，也可以是动态的，如自适应推理。然而，这些技术尚未完全应用于受时间约束的具身控制系统，这些系统需要为多个任务进行顺序决策，每个任务都具有不同的推理延迟限制。在本文中，我们提出了MoDeC，一个考虑时间约束的具身控制框架，使用模块化模型适应。我们将模型适应到不同的资源和时间限制的操作条件上，将其视为模块化网络上的动态路由，并将这些条件作为多任务目标的一部分。我们在多个基于视觉的具身环境中进行评估，结果显示MoDeC的鲁棒性，表明在机器人操作和自主驾驶应用中，它在性能和对时间约束的遵守方面优于其他模型适应方法。

更新时间: 2024-06-17 01:07:30

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2406.11128v1

Optimizing Automatic Differentiation with Deep Reinforcement Learning

Computing Jacobians with automatic differentiation is ubiquitous in many scientific domains such as machine learning, computational fluid dynamics, robotics and finance. Even small savings in the number of computations or memory usage in Jacobian computations can already incur massive savings in energy consumption and runtime. While there exist many methods that allow for such savings, they generally trade computational efficiency for approximations of the exact Jacobian. In this paper, we present a novel method to optimize the number of necessary multiplications for Jacobian computation by leveraging deep reinforcement learning (RL) and a concept called cross-country elimination while still computing the exact Jacobian. Cross-country elimination is a framework for automatic differentiation that phrases Jacobian accumulation as ordered elimination of all vertices on the computational graph where every elimination incurs a certain computational cost. We formulate the search for the optimal elimination order that minimizes the number of necessary multiplications as a single player game which is played by an RL agent. We demonstrate that this method achieves up to 33% improvements over state-of-the-art methods on several relevant tasks taken from diverse domains. Furthermore, we show that these theoretical gains translate into actual runtime improvements by providing a cross-country elimination interpreter in JAX that can efficiently execute the obtained elimination orders.

Updated: 2024-06-17 00:54:09

标题: 用深度强化学习优化自动微分

摘要: 使用自动微分计算雅可比矩阵在许多科学领域中是普遍的，如机器学习、计算流体力学、机器人技术和金融领域。即使在雅可比矩阵计算中节省了一点计算量或内存使用量，也可能导致能源消耗和运行时间的巨大节省。虽然存在许多方法可以实现这种节省，但它们通常以计算效率为代价来近似精确的雅可比矩阵。在本文中，我们提出了一种利用深度强化学习（RL）和一种称为跨国排除的概念来优化雅可比矩阵计算所需乘法次数的新方法，同时仍然计算精确的雅可比矩阵。跨国排除是一种自动微分框架，将雅可比矩阵累积表述为计算图上所有顶点的有序排除，其中每次排除都会产生一定的计算成本。我们将寻找最小化必要乘法次数的最佳排除顺序形式化为一个由RL代理玩家进行的单人游戏。我们展示了这种方法在取自不同领域的若干相关任务上比现有方法取得了高达33％的改进。此外，我们展示了这些理论收益如何通过在JAX中提供一个跨国排除解释器将转化为实际的运行时间改进，该解释器可以有效地执行获得的排除顺序。

更新时间: 2024-06-17 00:54:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.05027v2

Modeling groundwater levels in California's Central Valley by hierarchical Gaussian process and neural network regression

Modeling groundwater levels continuously across California's Central Valley (CV) hydrological system is challenging due to low-quality well data which is sparsely and noisily sampled across time and space. The lack of consistent well data makes it difficult to evaluate the impact of 2017 and 2019 wet years on CV groundwater following a severe drought during 2012-2015. A novel machine learning method is formulated for modeling groundwater levels by learning from a 3D lithological texture model of the CV aquifer. The proposed formulation performs multivariate regression by combining Gaussian processes (GP) and deep neural networks (DNN). The hierarchical modeling approach constitutes training the DNN to learn a lithologically informed latent space where non-parametric regression with GP is performed. We demonstrate the efficacy of GP-DNN regression for modeling non-stationary features in the well data with fast and reliable uncertainty quantification, as validated to be statistically consistent with the empirical data distribution from 90 blind wells across CV. We show how the model predictions may be used to supplement hydrological understanding of aquifer responses in basins with irregular well data. Our results indicate that on average the 2017 and 2019 wet years in California were largely ineffective in replenishing the groundwater loss caused during previous drought years.

Updated: 2024-06-17 00:51:22

标题: 加利福尼亚中央谷地地下水位建模：层次高斯过程和神经网络回归

摘要: 在加利福尼亚州中央谷（CV）水文系统中持续建模地下水位是具有挑战性的，因为低质量的井数据在时间和空间上稀疏且带有噪声。缺乏一致的井数据使得评估2017年和2019年湿润年份对CV地下水的影响变得困难，尤其是在2012-2015年严重干旱之后。提出了一种新颖的机器学习方法，通过学习CV含水层的3D岩性纹理模型来建模地下水位。该提议的公式通过结合高斯过程（GP）和深度神经网络（DNN）执行多元回归。层次建模方法包括训练DNN学习一个岩性信息的潜在空间，其中执行与GP的非参数回归。我们展示了GP-DNN回归在模拟井数据中的非平稳特征时具有快速可靠的不确定性量化的功效，并验证了它与CV范围内90口盲井的经验数据分布在统计上是一致的。我们展示了如何使用模型预测来补充对具有不规则井数据的盆地中含水层响应的水文理解。我们的结果表明，2017年和2019年加利福尼亚州的湿润年份平均来说在补充之前干旱年份造成的地下水流失方面效果不佳。

更新时间: 2024-06-17 00:51:22

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2310.14555v2

Incentivizing Quality Text Generation via Statistical Contracts

While the success of large language models (LLMs) increases demand for machine-generated text, current pay-per-token pricing schemes create a misalignment of incentives known in economics as moral hazard: Text-generating agents have strong incentive to cut costs by preferring a cheaper model over the cutting-edge one, and this can be done "behind the scenes" since the agent performs inference internally. In this work, we approach this issue from an economic perspective, by proposing a pay-for-performance, contract-based framework for incentivizing quality. We study a principal-agent game where the agent generates text using costly inference, and the contract determines the principal's payment for the text according to an automated quality evaluation. Since standard contract theory is inapplicable when internal inference costs are unknown, we introduce cost-robust contracts. As our main theoretical contribution, we characterize optimal cost-robust contracts through a direct correspondence to optimal composite hypothesis tests from statistics, generalizing a result of Saig et al. (NeurIPS'23). We evaluate our framework empirically by deriving contracts for a range of objectives and LLM evaluation benchmarks, and find that cost-robust contracts sacrifice only a marginal increase in objective value compared to their cost-aware counterparts.

Updated: 2024-06-17 00:30:58

标题: 通过统计合同激励优质文本生成

摘要: 随着大型语言模型（LLMs）的成功增加了对机器生成文本的需求，当前的按标记付费定价方案造成了经济学中所称的道德风险的利益错配：生成文本的代理有强烈的动机通过偏好更便宜的模型而不是尖端模型来降低成本，而这可以在“幕后”完成，因为代理在内部执行推理。在这项工作中，我们从经济学的角度来解决这个问题，提出了一种基于绩效付费的合同框架来激励质量。我们研究了一个委托代理博弈，其中代理通过昂贵的推理生成文本，合同根据自动质量评估确定委托人对文本的付款。由于标准合同理论在内部推理成本未知时不适用，我们引入了成本稳健的合同。作为我们的主要理论贡献，我们通过直接对应统计学中最佳复合假设检验的结果对最佳成本稳健合同进行了表征，从而推广了Saig等人的结果（NeurIPS'23）。我们通过为一系列目标和LLM评估基准推导合同来实证评估我们的框架，并发现成本稳健合同与其成本感知对应物相比只牺牲了一点点目标价值的增加。

更新时间: 2024-06-17 00:30:58

领域: cs.GT,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11118v1

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $\eta / b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.

Updated: 2024-06-17 00:19:16

标题: 神经网络如何学习：支持是随机梯度下降的隐式正则化效应

摘要: 我们研究了深度神经网络识别目标函数支持的能力。我们的研究发现，小批量随机梯度下降有效地通过将与输入无关的权重缩减为零来学习网络的第一层支持。相比之下，我们证明了虽然普通的梯度下降也能够逼近目标函数，但需要一个显式的正则化项来学习网络的第一层支持。我们证明了小批量随机梯度下降的这一特性是由一个与 $\eta / b$（步长/批大小）成正比的二阶隐式正则化效应所致。我们的研究不仅是隐式正则化对训练优化动态有重要影响的另一个证据，而且还揭示了网络学习的特征结构。此外，研究结果表明，较小的批量有助于增强特征的可解释性并减少对初始化的依赖。

更新时间: 2024-06-17 00:19:16

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2406.11110v1

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: https://github.com/AmitDasRup123/HateSpeechCorpus

Updated: 2024-06-17 00:18:31

标题: 调查大型语言模型中标注者偏见对仇恨言论检测的影响

摘要: 数据标注是将描述性标签分配给原始数据的实践，在优化机器学习模型的性能方面至关重要。然而，这是一个资源密集型的过程，容易受到注释者引入的偏见影响。像ChatGPT这样复杂的大型语言模型（LLMs）的出现为现代化和简化这个复杂过程提供了独特的机会。虽然现有研究广泛评估了LLMs的有效性，但作为注释者，本文深入探讨了LLMs（特别是GPT 3.5和GPT 4o）在标注仇恨言论数据时存在的偏见。我们的研究有助于理解LLMs中存在的偏见，特别是在性别、种族、宗教和残疾等四个关键类别中针对高度脆弱群体的注释者偏见。此外，我们通过仔细审查已注释数据，对可能导致这些偏见的潜在因素进行了全面的考察。我们引入了我们的自定义仇恨言论检测数据集HateSpeechCorpus进行此研究。此外，我们也对ETHOS（Mollas等人，2022）数据集进行了相同的实验以进行比较分析。本文作为一个重要资源，指导研究人员和从业者利用LLMs进行数据标注的潜力，从而促进这一关键领域的进展。HateSpeechCorpus数据集可在此处获得：https://github.com/AmitDasRup123/HateSpeechCorpus

更新时间: 2024-06-17 00:18:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11109v1

From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

With the rapid growth of Large Language Models (LLMs), safeguarding textual content against unauthorized use is crucial. Text watermarking offers a vital solution, protecting both - LLM-generated and plain text sources. This paper presents a unified overview of different perspectives behind designing watermarking techniques, through a comprehensive survey of the research literature. Our work has two key advantages, (1) we analyze research based on the specific intentions behind different watermarking techniques, evaluation datasets used, watermarking addition, and removal methods to construct a cohesive taxonomy. (2) We highlight the gaps and open challenges in text watermarking to promote research in protecting text authorship. This extensive coverage and detailed analysis sets our work apart, offering valuable insights into the evolving landscape of text watermarking in language models.

Updated: 2024-06-17 00:09:31

标题: 从意图到技术：大型语言模型文本水印的全面分类和挑战

摘要: 随着大型语言模型（LLMs）的快速增长，保护文本内容免受未经授权的使用至关重要。文本水印技术提供了一种重要的解决方案，可以保护LLM生成的文本和纯文本来源。本文通过对研究文献的综合调查，提供了设计水印技术背后不同观点的统一概述。我们的工作具有两个关键优势，（1）我们根据不同水印技术背后的具体意图、评估数据集的使用、水印添加和删除方法对研究进行分析，以构建一个连贯的分类体系。（2）我们突出了文本水印技术中存在的差距和挑战，以促进保护文本作者权益的研究。这种广泛覆盖和详细分析使我们的工作脱颖而出，为了解语言模型中文本水印技术的发展景观提供了宝贵的见解。

更新时间: 2024-06-17 00:09:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11106v1