Arxiv Day: Article

Exploiting Diffusion Prior for Out-of-Distribution Detection

Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models, especially in areas where security is critical. However, traditional OOD detection methods often fail to capture complex data distributions from large scale date. In this paper, we present a novel approach for OOD detection that leverages the generative ability of diffusion models and the powerful feature extraction capabilities of CLIP. By using these features as conditional inputs to a diffusion model, we can reconstruct the images after encoding them with CLIP. The difference between the original and reconstructed images is used as a signal for OOD identification. The practicality and scalability of our method is increased by the fact that it does not require class-specific labeled ID data, as is the case with many other methods. Extensive experiments on several benchmark datasets demonstrates the robustness and effectiveness of our method, which have significantly improved the detection accuracy.

Updated: 2024-06-16 23:55:25

标题: 利用扩散先验进行超出分布检测

摘要: Out-of-distribution (OOD) detection 是部署强大的机器学习模型至关重要，尤其是在安全至关重要的领域。然而，传统的OOD检测方法经常无法捕捉大规模数据的复杂分布。在本文中，我们提出了一种利用扩散模型的生成能力和CLIP的强大特征提取能力的OOD检测新方法。通过将这些特征作为扩散模型的条件输入，我们可以用CLIP对图像进行编码后重构图像。原始和重构图像之间的差异被用作OOD识别的信号。我们的方法的实用性和可扩展性增加了，因为它不需要特定类别的标记的ID数据，这与许多其他方法不同。在几个基准数据集上进行的大量实验表明，我们的方法的鲁棒性和有效性，显著提高了检测准确性。

更新时间: 2024-06-16 23:55:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11105v1

Speech language models lack important brain-relevant semantics

Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we systematically remove specific low-level stimulus features (textual, speech, and visual) from language model representations to assess their impact on alignment with fMRI brain recordings during reading and listening. Comparing these findings with speech-based language models reveals starkly different effects of low-level features on brain alignment. While text-based models show reduced alignment in early sensory regions post-removal, they retain significant predictive power in late language regions. In contrast, speech-based models maintain strong alignment in early auditory regions even after feature removal but lose all predictive power in late language regions. These results suggest that speech-based models provide insights into additional information processed by early auditory regions, but caution is needed when using them to model processing in late language regions. We make our code publicly available. [https://github.com/subbareddy248/speech-llm-brain]

Updated: 2024-06-16 23:52:21

标题: 语言模型缺乏重要的与大脑相关的语义信息

摘要: 尽管已知阅读和听力在大脑中存在差异，但最近的研究表明基于文本的语言模型可以令人印象深刻地预测文本唤起和言语唤起的大脑活动。这引发了一个问题，即语言模型在大脑中真正预测了哪些类型的信息。我们通过直接方法来研究这个问题，系统地去除语言模型表征中的特定低级刺激特征（文本、语音和视觉），以评估它们对阅读和听力过程中与fMRI大脑记录的对齐的影响。将这些发现与基于语音的语言模型进行比较，显示低级特征对大脑对齐的影响有明显不同。尽管基于文本的模型在去除特征后在早期感觉区域表现出降低的对齐，但它们仍在晚期语言区域保持显著的预测能力。相比之下，基于语音的模型即使在去除特征后仍在早期听觉区域保持强烈的对齐，但在晚期语言区域失去了所有的预测能力。这些结果表明，基于语音的模型提供了对早期听觉区域处理的额外信息的见解，但在使用它们来模拟晚期语言区域处理时需要谨慎。我们将我们的代码公开提供。[https://github.com/subbareddy248/speech-llm-brain]

更新时间: 2024-06-16 23:52:21

领域: cs.CL,cs.LG,eess.AS,q-bio.NC

下载: http://arxiv.org/abs/2311.04664v2

Grading Massive Open Online Courses Using Large Language Models

Massive open online courses (MOOCs) offer free education globally to anyone with a computer and internet access. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for one instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. Specifically, we use two LLMs, GPT-4 and GPT-3.5, across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. To instruct LLMs, we use three different prompts based on the zero-shot chain-of-thought (ZCoT) prompting technique: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. Tested on 18 settings, our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.

Updated: 2024-06-16 23:42:11

标题: 使用大型语言模型对大规模在线开放课程进行评分

摘要: 大规模开放在线课程（MOOCs）为全球任何拥有计算机和互联网的人提供免费教育。尽管这种学习的民主化，但是这些课程的大规模注册使得一个教师评估每个学生的写作作业变得不切实际。因此，同行评分，通常由简单的评分标准指导，是首选的方法。虽然方便，但同行评分在可靠性和有效性方面经常存在不足。在这项研究中，我们探讨了在MOOCs中使用大型语言模型（LLMs）替代同行评分的可行性。具体来说，我们在三个MOOCs中使用了两个LLMs，GPT-4和GPT-3.5：入门天文学，天体生物学和天文学的历史与哲学。为了指导LLMs，我们使用基于零射链式思考（ZCoT）提示技术的三个不同提示：（1）包含由教师提供的正确答案的ZCoT，（2）包含教师提供的正确答案和评分标准的ZCoT，以及（3）包含教师提供的正确答案和LLM生成的评分标准的ZCoT。在18个设置上进行测试，我们的结果显示，当与教师提供的正确答案和评分标准相结合时，ZCoT产生的成绩与教师分配的成绩更加一致，与同行评分相比。最后，我们的研究结果表明，在MOOCs中自动评分系统具有很大的潜力，特别是在具有明确定义评分标准的学科中，可以改善全球数百万在线学习者的学习体验。

更新时间: 2024-06-16 23:42:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11102v1

Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning

We introduce Diffusion World Model (DWM), a conditional diffusion model capable of predicting multistep future states and rewards concurrently. As opposed to traditional one-step dynamics models, DWM offers long-horizon predictions in a single forward pass, eliminating the need for recursive queries. We integrate DWM into model-based value estimation, where the short-term return is simulated by future trajectories sampled from DWM. In the context of offline reinforcement learning, DWM can be viewed as a conservative value regularization through generative modeling. Alternatively, it can be seen as a data source that enables offline Q-learning with synthetic data. Our experiments on the D4RL dataset confirm the robustness of DWM to long-horizon simulation. In terms of absolute performance, DWM significantly surpasses one-step dynamics models with a $44\%$ performance gain, and is comparable to or slightly surpassing their model-free counterparts.

Updated: 2024-06-16 23:35:37

标题: Diffusion World Model: 未来建模——离线强化学习的逐步推出以外

摘要: 我们介绍了扩散世界模型（DWM），这是一个条件扩散模型，能够同时预测多步未来状态和奖励。与传统的一步动态模型相比，DWM在单次前向传递中提供了长期预测，消除了递归查询的需求。我们将DWM集成到基于模型的价值估计中，其中短期回报通过从DWM中采样的未来轨迹进行模拟。在离线强化学习的背景下，DWM可以被视为通过生成建模实现保守价值正则化。或者，它可以被视为一种数据源，可以使用合成数据进行离线Q学习。我们在D4RL数据集上的实验证实了DWM对长期模拟的稳健性。在绝对性能方面，DWM显著超越了一步动态模型，性能提高了44％，并且与无模型对应的模型相当或略有超越。

更新时间: 2024-06-16 23:35:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2402.03570v3

Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference

Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an algorithm based on secure lookup tables. Our approach precomputes lookup tables during an offline phase that contains the result of all possible nonlinear function calls. Because these tables incur exponential storage costs in the number of operands and the precision of the input values, we use quantization to reduce these storage costs to make this approach practical. This enables an online phase where securely computing the result of a nonlinear function requires just a single round of communication, with communication cost equal to twice the number of bits of the input to the nonlinear function. In practice our approach costs 2 bytes of communication per nonlinear function call in the online phase. Compared to garbled circuits with 8-bit quantized inputs, when computing individual nonlinear functions during the online phase, experiments show Tabula with 8-bit activations uses between $280$-$560 \times$ less communication, is over $100\times$ faster, and uses a comparable (within a factor of 2) amount of storage; compared against other state-of-the-art protocols Tabula achieves greater than $40\times$ communication reduction. This leads to significant performance gains over garbled circuits with quantized inputs during the online phase of secure inference of neural networks: Tabula reduces end-to-end inference communication by up to $9 \times$ and achieves an end-to-end inference speedup of up to $50 \times$, while imposing comparable storage and offline preprocessing costs.

Updated: 2024-06-16 23:24:01

标题: 表：高效计算非线性激活函数以实现安全神经网络推断

摘要: 多方计算方法用于保护神经网络推断通常依赖于加密电路来安全执行非线性激活函数。然而，加密电路需要服务器和客户端之间过多的通信，造成显著的存储开销，并产生大量的运行时惩罚。为了降低这些成本，我们提出了一种替代加密电路的方法：Tabula，这是一种基于安全查找表的算法。我们的方法在离线阶段预先计算查找表，其中包含所有可能的非线性函数调用的结果。由于这些表在操作数的数量和输入值的精度方面产生指数级的存储成本，我们使用量化来降低这些存储成本以使这种方法具有实用性。这使得在线阶段安全地计算非线性函数的结果仅需要一轮通信，通信成本等于非线性函数输入的位数的两倍。在实践中，我们的方法在在线阶段每次非线性函数调用时通信成本为2字节。与具有8位量化输入的加密电路相比，实验表明在在线阶段计算单个非线性函数时，具有8位激活的Tabula使用的通信量减少了$280$-$560 \times$，速度提高了超过$100\times$，并且使用了一个可比的（在2倍以内）存储量；与其他最先进的协议相比，Tabula实现了超过$40\times$的通信量减少。这导致在神经网络安全推断的在线阶段，Tabula相比具有量化输入的加密电路实现了显著的性能提升：Tabula将端到端推断通信减少了最多$9\times$，实现了最多$50\times$的端到端推断加速，同时施加了相当的存储和离线预处理成本。

更新时间: 2024-06-16 23:24:01

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2203.02833v2

InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models

Extractive summarization can produce faithful summaries but often requires additional constraints such as a desired summary length. Traditional sentence compression models do not typically consider the constraints because of their restricted model abilities, which require model modifications for coping with them. To bridge this gap, we propose Instruction-based Compression (InstructCMP), an approach to the sentence compression task that can consider the length constraint through instructions by leveraging the zero-shot task-solving abilities of Large Language Models (LLMs). For this purpose, we created new evaluation datasets by transforming traditional sentence compression datasets into an instruction format. By using the datasets, we first reveal that the current LLMs still face challenges in accurately controlling the length for a compressed text. To address this issue, we propose an approach named "length priming," that incorporates additional length information into the instructions without external resources. While the length priming effectively works in a zero-shot setting, a training dataset with the instructions would further improve the ability of length control. Thus, we additionally created a training dataset in an instruction format to fine-tune the model on it. Experimental results and analysis show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings without the need of any model modifications.

Updated: 2024-06-16 23:00:47

标题: InstructCMP: 通过基于指令的大型语言模型控制句子压缩长度

摘要: 抽取式摘要可以产生忠实的摘要，但通常需要额外的约束，如所需摘要长度。传统的句子压缩模型通常不考虑这些约束，因为它们的模型能力有限，需要进行模型修改以应对这些约束。为了弥合这一差距，我们提出了基于指令的压缩（InstructCMP）方法，这是一种可以通过利用大型语言模型（LLMs）的零-shot任务解决能力来考虑长度约束的句子压缩任务方法。为此，我们将传统的句子压缩数据集转换为指令格式，创建了新的评估数据集。通过使用这些数据集，我们首先发现当前的LLMs仍然面临着在压缩文本中准确控制长度的挑战。为了解决这个问题，我们提出了一种名为“长度引导”的方法，它将额外的长度信息整合到指令中，而无需外部资源。虽然长度引导在零-shot设置中起到了有效作用，但使用带有指令的训练数据集将进一步提高长度控制能力。因此，我们另外创建了一个指令格式的训练数据集，以在其上对模型进行微调。实验结果和分析表明，应用长度引导显著改善了InstructCMP在零-shot和微调设置中的性能，而无需进行任何模型修改。

更新时间: 2024-06-16 23:00:47

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2406.11097v1

Prompt-based Learning for Text Readability Assessment

We propose the novel adaptation of a pre-trained seq2seq model for readability assessment. We prove that a seq2seq model - T5 or BART - can be adapted to discern which text is more difficult from two given texts (pairwise). As an exploratory study to prompt-learn a neural network for text readability in a text-to-text manner, we report useful tips for future work in seq2seq training and ranking-based approach to readability assessment. Specifically, we test nine input-output formats/prefixes and show that they can significantly influence the final model performance. Also, we argue that the combination of text-to-text training and pairwise ranking setup 1) enables leveraging multiple parallel text simplification data for teaching readability and 2) trains a neural model for the general concept of readability (therefore, better cross-domain generalization). At last, we report a 99.6% pairwise classification accuracy on Newsela and a 98.7% for OneStopEnglish, through a joint training approach.

Updated: 2024-06-16 22:50:50

标题: 基于提示的学习用于文本可读性评估

摘要: 我们提出了一种新颖的预训练seq2seq模型的可读性评估方法。我们证明了seq2seq模型 - T5或BART - 可以被调整以区分两个给定文本中哪个更难理解（成对）。作为促使神经网络以文本-文本方式学习文本可读性的探索性研究，我们提供了未来seq2seq训练和基于排名的可读性评估方法的有用建议。具体来说，我们测试了九种输入-输出格式/前缀，并展示它们可以显著影响最终模型的性能。此外，我们认为文本-文本训练和成对排名设置的结合使得1）能够利用多个平行文本简化数据来教授可读性，2）训练神经模型通用可读性概念（因此，更好的跨领域泛化）。最后，通过联合训练方法，我们报告了Newsela上的99.6%成对分类准确率和OneStopEnglish上的98.7%。

更新时间: 2024-06-16 22:50:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2302.13139v2

Instruction Tuning with Human Curriculum

In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response generation framework that complements our theoretical approach. Distinct from the existing instruction tuning dataset, our generation pipeline is systematically structured to emulate the sequential and orderly characteristic of human learning. Additionally, we describe a methodology for generating instruction-response datasets that extensively span the various stages of human education, from middle school through the graduate level, utilizing educational subject catalogs. Before training, we meticulously organize the instruction data to ensure that questions escalate in difficulty regarding (A) the subject matter and (B) the intricacy of the instructions. The findings of our study reveal that substantial improvements in performance can be achieved through the mere application of curriculum ordering to instruction data (achieving gains of +4.76 on TruthfulQA, +2.98 on MMLU, +2.8 on OpenbookQA, and +1.28 on ARC-hard) compared to random shuffling. This enhancement is achieved without incurring additional computational expenses. Through comprehensive experimentation, we observe that the advantages of our proposed method are consistently evident across nine benchmarks.

Updated: 2024-06-16 22:46:38

标题: 使用人类课程进行指导调优

摘要: 在这项工作中，我们（1）引入了课程教学调整，（2）探讨了采用多样化课程策略的潜在优势，并（3）描述了一个综合的教学-响应生成框架，与我们的理论方法相辅相成。与现有的教学调整数据集不同，我们的生成管道被系统地构建，以模拟人类学习的顺序和有序特征。此外，我们描述了一种生成教学-响应数据集的方法，广泛涵盖了人类教育的各个阶段，从中学到研究生水平，利用教育学科目目录。在训练之前，我们精心组织教学数据，以确保问题在主题和指令的复杂性方面随着难度逐渐增加。我们的研究结果显示，通过仅将课程排序应用于教学数据（在TruthfulQA上实现增益+4.76，在MMLU上增益+2.98，在OpenbookQA上增益+2.8，在ARC-hard上增益+1.28），可以实现性能的显著改进，而与随机洗牌相比，这种改进是在不增加额外计算开销的情况下实现的。通过全面的实验，我们观察到我们提出的方法的优势在九个基准测试中始终明显。

更新时间: 2024-06-16 22:46:38

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2310.09518v4

Guaranteed Sampling Flexibility for Low-tubal-rank Tensor Completion

While Bernoulli sampling is extensively studied in tensor completion, t-CUR sampling approximates low-tubal-rank tensors via lateral and horizontal subtensors. However, both methods lack sufficient flexibility for diverse practical applications. To address this, we introduce Tensor Cross-Concentrated Sampling (t-CCS), a novel and straightforward sampling model that advances the matrix cross-concentrated sampling concept within a tensor framework. t-CCS effectively bridges the gap between Bernoulli and t-CUR sampling, offering additional flexibility that can lead to computational savings in various contexts. A key aspect of our work is the comprehensive theoretical analysis provided. We establish a sufficient condition for the successful recovery of a low-rank tensor from its t-CCS samples. In support of this, we also develop a theoretical framework validating the feasibility of t-CUR via uniform random sampling and conduct a detailed theoretical sampling complexity analysis for tensor completion problems utilizing the general Bernoulli sampling model. Moreover, we introduce an efficient non-convex algorithm, the Iterative t-CUR Tensor Completion (ITCURTC) algorithm, specifically designed to tackle the t-CCS-based tensor completion. We have intensively tested and validated the effectiveness of the t-CCS model and the ITCURTC algorithm across both synthetic and real-world datasets.

Updated: 2024-06-16 22:45:56

标题: 保证低秩张量补全的采样灵活性

摘要: 尽管伯努利采样在张量完成中得到了广泛研究，但t-CUR采样通过横向和纵向子张量逼近低管状秩张量。然而，这两种方法都缺乏足够的灵活性以适用于不同的实际应用。为了解决这个问题，我们引入了张量交叉集中采样（t-CCS），这是一种新颖且直观的采样模型，将矩阵交叉集中采样概念引入张量框架中。t-CCS有效地弥合了伯努利和t-CUR采样之间的差距，提供了额外的灵活性，可以在各种情况下节省计算成本。我们工作的一个关键方面是提供了全面的理论分析。我们建立了一个成功从其t-CCS样本中恢复低秩张量的充分条件。为了支持这一点，我们还开发了一个理论框架，验证了通过均匀随机采样实现t-CUR的可行性，并对利用一般伯努利采样模型进行张量完成问题的理论采样复杂性进行了详细分析。此外，我们引入了一种高效的非凸算法，即迭代式t-CUR张量完成（ITCURTC）算法，专门设计用于处理基于t-CCS的张量完成。我们已经在合成和真实数据集上进行了大量测试和验证，证实了t-CCS模型和ITCURTC算法的有效性。

更新时间: 2024-06-16 22:45:56

领域: cs.LG,cs.NA,math.NA,stat.ML

下载: http://arxiv.org/abs/2406.11092v1

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features

We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.

Updated: 2024-06-16 22:39:03

标题: 推动文本可读性评估：变压器遇上手工制作的语言特征

摘要: 我们报告了可读性评估中的两个重要改进：1.在高级语义方面有三个新颖特征，2.及时证据表明传统的ML模型（例如使用手工特征的随机森林）可以与转换器（例如RoBERTa）结合以增强模型性能。首先，我们探索适合的转换器和传统的ML模型。然后，我们使用自行开发的提取软件提取了255个手工语言特征。最后，我们将这些组合起来创建了几个混合模型，在可读性评估的流行数据集上实现了最新技术的准确性。手工特征的使用有助于在较小的数据集上提高模型性能。值得注意的是，我们的RoBERTA-RF-T1混合模型实现了接近完美的99%分类准确率，比以前的最新技术提高了20.3%。

更新时间: 2024-06-16 22:39:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2109.12258v2

Breaking Symmetry When Training Transformers

As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Updated: 2024-06-16 22:18:36

标题: 训练Transformer模型时打破对称性

摘要: 正如本文所展示的，没有位置编码和因果关注机制之一的Transformer架构对输出令牌$n+1$的预测对于输入令牌$1, 2, ..., n-1$的排列是不变的。通常，这两种机制都被使用，并且相对于输入令牌的对称性被打破。最近，已经证明可以训练没有位置编码的Transformer。这必须由因果关注机制实现。在本文中，我们详细阐述了因果连接机制必须负责使得Transformer能够建模输入顺序重要的序列的论点。Transformer的垂直“切片”都被鼓励表示输入序列中相同的位置$k$。我们假设残差连接对这一现象起到了贡献，并展示了支持这一观点的证据。

更新时间: 2024-06-16 22:18:36

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.05969v2

Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions

Despite extensive research on training generative adversarial networks (GANs) with limited training data, learning to generate images from long-tailed training distributions remains fairly unexplored. In the presence of imbalanced multi-class training data, GANs tend to favor classes with more samples, leading to the generation of low-quality and less diverse samples in tail classes. In this study, we aim to improve the training of class-conditional GANs with long-tailed data. We propose a straightforward yet effective method for knowledge sharing, allowing tail classes to borrow from the rich information from classes with more abundant training data. More concretely, we propose modifications to existing class-conditional GAN architectures to ensure that the lower-resolution layers of the generator are trained entirely unconditionally while reserving class-conditional generation for the higher-resolution layers. Experiments on several long-tail benchmarks and GAN architectures demonstrate a significant improvement over existing methods in both the diversity and fidelity of the generated images. The code is available at https://github.com/khorrams/utlo.

Updated: 2024-06-16 22:11:56

标题: 驯服类别条件GAN中的尾部：通过较低分辨率的无条件训练进行知识共享

摘要: 尽管对于在有限训练数据下训练生成对抗网络（GANs）进行了广泛研究，但学习从长尾训练分布生成图像仍然相对未被探索。在存在不平衡的多类别训练数据的情况下，GANs往往倾向于偏爱样本更多的类别，导致在尾部类别生成低质量且缺乏多样性的样本。在本研究中，我们旨在改进具有长尾数据的类别条件GANs的训练。我们提出了一种简单而有效的知识共享方法，允许尾部类别从具有更丰富训练数据的类别中借鉴信息。更具体地说，我们提出对现有的类别条件GAN架构进行修改，以确保生成器的低分辨率层完全无条件地进行训练，同时将类别条件生成保留在高分辨率层。在几个长尾基准和GAN架构上的实验表明，与现有方法相比，生成的图像的多样性和保真度都有显着改善。代码可在https://github.com/khorrams/utlo找到。

更新时间: 2024-06-16 22:11:56

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.17065v2

MemDPT: Differential Privacy for Memory Efficient Language Models

Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.

Updated: 2024-06-16 22:11:41

标题: MemDPT: 高效内存语言模型的差分隐私

摘要: 大型语言模型在广泛的应用中持续展现出卓越的性能。然而，这些模型的部署可能会无意中将用户隐私暴露给潜在风险。这些模型在训练过程中对内存的巨大需求代表着一个重要的资源消耗挑战。这些模型的庞大规模给内存资源带来了巨大负担，这在实践中是一个重要的关注点。在本文中，我们介绍了一种创新的训练框架MemDPT，它不仅减少了大型语言模型的内存成本，还非常重视保护用户数据隐私。MemDPT提供了边缘网络和反向网络设计，以适应各种差分隐私内存高效微调方案。我们的方法不仅实现了2到3倍的内存优化，还提供了强大的隐私保护，确保用户数据保持安全和机密。大量实验证明了MemDPT能够有效地在各种任务场景中提供差分隐私高效微调。

更新时间: 2024-06-16 22:11:41

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11087v1

Increasing Trust in Language Models through the Reuse of Verified Circuits

Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this standard if built using mathematically and logically specified frameworks. In this paper, we fully verify an auto-regressive transformer model for n-digit integer addition. To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. We discuss how inserting verified task modules into LMs can leverage model reuse to improve verifiability and trustworthiness of language models built using them. The reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.

Updated: 2024-06-16 22:09:57

标题: 通过重复使用经过验证的电路增加对语言模型的信任

摘要: 语言模型（LMs）越来越多地被用于各种预测任务，但它们的训练往往会忽视罕见的边缘情况，降低它们的可靠性。在这里，我们定义了一个严格的可信度标准，即任务算法和电路实现必须经过验证，考虑到边缘情况，没有已知的故障模式。我们表明，如果采用数学和逻辑规定的框架构建模型，该模型可以被训练以满足此标准。在本文中，我们完全验证了一个用于n位整数加法的自回归变压器模型。为了展示验证模块的可重用性，我们将训练好的整数加法模型插入到一个更大的未经训练的模型中，并训练组合模型来执行加法和减法。我们发现加法电路在两个任务中都得到了广泛重用，这有助于验证更复杂的减法模型。我们讨论了如何将经过验证的任务模块插入LMs中，利用模型重用来提高使用它们构建的语言模型的可验证性和可信度。经过验证的电路的重用减少了验证更复杂的复合模型的工作量，我们认为这是向语言模型安全性迈出的重要一步。

更新时间: 2024-06-16 22:09:57

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.02619v7

Current state of LLM Risks and AI Guardrails

Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount. However, LLMs have inherent risks accompanying them, including bias, potential for unsafe actions, dataset poisoning, lack of explainability, hallucinations, and non-reproducibility. These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm. This work explores the risks associated with deploying LLMs and evaluates current approaches to implementing guardrails and model alignment techniques. We examine intrinsic and extrinsic bias evaluation methods and discuss the importance of fairness metrics for responsible AI development. The safety and reliability of agentic LLMs (those capable of real-world actions) are explored, emphasizing the need for testability, fail-safes, and situational awareness. Technical strategies for securing LLMs are presented, including a layered protection model operating at external, secondary, and internal levels. System prompts, Retrieval-Augmented Generation (RAG) architectures, and techniques to minimize bias and protect privacy are highlighted. Effective guardrail design requires a deep understanding of the LLM's intended use case, relevant regulations, and ethical considerations. Striking a balance between competing requirements, such as accuracy and privacy, remains an ongoing challenge. This work underscores the importance of continuous research and development to ensure the safe and responsible use of LLMs in real-world applications.

Updated: 2024-06-16 22:04:10

标题: LLM风险和AI监管的当前状态

摘要: 大型语言模型（LLMs）变得越来越复杂，导致在安全性和可靠性至关重要的敏感应用中广泛部署。然而，LLMs自带潜在风险，包括偏见、潜在的不安全行为、数据集中毒、缺乏可解释性、幻觉和不可重复性。这些风险需要开发“防护栏”，以使LLMs与期望的行为保持一致，并减轻潜在危害。本研究探讨了部署LLMs所带来的风险，并评估了实施防护栏和模型对齐技术的当前方法。我们检查了内在和外在偏见评估方法，并讨论了公平度指标对负责任的人工智能开发的重要性。研究了具有真实世界行为能力的代理LLMs（能够执行真实世界行为的LLMs）的安全性和可靠性，强调了测试性、故障保护和情境感知的需求。提出了保护LLMs的技术策略，包括在外部、次级和内部层次上运作的分层保护模型。突出了系统提示、检索增强生成（RAG）架构以及最小化偏见和保护隐私的技术。有效的防护栏设计需要深刻理解LLMs的预期用例、相关法规和道德考虑。在精度和隐私等竞争要求之间取得平衡仍然是一个持续的挑战。本研究强调了持续研究和开发的重要性，以确保在现实应用中安全和负责任地使用LLMs。

更新时间: 2024-06-16 22:04:10

领域: cs.CR,cs.AI,cs.HC

下载: http://arxiv.org/abs/2406.12934v1

A Tutorial on the Non-Asymptotic Theory of System Identification

This tutorial serves as an introduction to recently developed non-asymptotic methods in the theory of -- mainly linear -- system identification. We emphasize tools we deem particularly useful for a range of problems in this domain, such as the covering technique, the Hanson-Wright Inequality and the method of self-normalized martingales. We then employ these tools to give streamlined proofs of the performance of various least-squares based estimators for identifying the parameters in autoregressive models. We conclude by sketching out how the ideas presented herein can be extended to certain nonlinear identification problems.

Updated: 2024-06-16 21:50:01

标题: 《系统识别的非渐近理论教程》

摘要: 这篇教程作为对最近发展的非渐近方法在主要是线性系统辨识理论中的介绍。我们强调我们认为对这一领域中一系列问题特别有用的工具，如覆盖技术、汉森-赖特不等式和自标准化鞅方法。然后我们利用这些工具简化证明了各种基于最小二乘法的估计器在自回归模型中识别参数的性能。最后，我们概述了本文提出的想法如何扩展到某些非线性辨识问题。

更新时间: 2024-06-16 21:50:01

领域: eess.SY,cs.LG,cs.SY,stat.ML

下载: http://arxiv.org/abs/2309.03873v2

Enhanced Elephant Herding Optimization for Large Scale Information Access on Social Media

In this article, we present a novel information access approach inspired by the information foraging theory (IFT) and elephant herding optimization (EHO). First, we propose a model for information access on social media based on the IFT. We then elaborate an adaptation of the original EHO algorithm to apply it to the information access problem. The combination of the IFT and EHO constitutes a good opportunity to find relevant information on social media. However, when dealing with voluminous data, the performance undergoes a sharp drop. To overcome this issue, we developed an enhanced version of EHO for large scale information access. We introduce new operators to the algorithm, including territories delimitation and clan migration using clustering. To validate our work, we created a dataset of more than 1.4 million tweets, on which we carried out extensive experiments. The outcomes reveal the ability of our approach to find relevant information in an effective and efficient way. They also highlight the advantages of the improved version of EHO over the original algorithm regarding different aspects. Furthermore, we undertook a comparative study with two other metaheuristic-based information foraging approaches, namely ant colony system and particle swarm optimization. Overall, the results are very promising.

Updated: 2024-06-16 21:48:41

标题: 增强的大规模社交媒体信息获取的大象聚类优化

摘要: 在这篇文章中，我们提出了一种新颖的信息访问方法，灵感来自信息觅食理论（IFT）和大象聚集优化（EHO）。首先，我们提出了一个基于IFT的社交媒体信息访问模型。然后，我们详细阐述了对原始EHO算法的改进，以将其应用于信息访问问题。IFT和EHO的结合为在社交媒体上找到相关信息提供了良好的机会。然而，在处理大量数据时，性能会急剧下降。为了克服这个问题，我们为大规模信息访问开发了EHO的增强版本。我们向算法引入了新的操作符，包括使用聚类进行领土划分和族群迁移。为了验证我们的工作，我们创建了一个超过140万条推特的数据集，并进行了广泛的实验。结果显示我们的方法能够以有效和高效的方式找到相关信息。他们还突出了改进版EHO在不同方面相对于原始算法的优势。此外，我们进行了与另外两种基于元启发式的信息搜索方法的比较研究，分别是蚁群系统和粒子群优化。总的来说，结果非常有希望。

更新时间: 2024-06-16 21:48:41

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2406.11916v1

PICL: Physics Informed Contrastive Learning for Partial Differential Equations

Neural operators have recently grown in popularity as Partial Differential Equation (PDE) surrogate models. Learning solution functionals, rather than functions, has proven to be a powerful approach to calculate fast, accurate solutions to complex PDEs. While much work has been done evaluating neural operator performance on a wide variety of surrogate modeling tasks, these works normally evaluate performance on a single equation at a time. In this work, we develop a novel contrastive pretraining framework utilizing Generalized Contrastive Loss that improves neural operator generalization across multiple governing equations simultaneously. Governing equation coefficients are used to measure ground-truth similarity between systems. A combination of physics-informed system evolution and latent-space model output are anchored to input data and used in our distance function. We find that physics-informed contrastive pretraining improves accuracy for the Fourier Neural Operator in fixed-future and autoregressive rollout tasks for the 1D and 2D Heat, Burgers', and linear advection equations.

Updated: 2024-06-16 21:37:26

标题: PICL: 物理信息对比学习用于偏微分方程

摘要: 神经算子最近在偏微分方程（PDE）代理模型中越来越受欢迎。学习解决方案泛函而不是函数已被证明是一种强大的方法，可以快速、准确地计算复杂PDE的解决方案。尽管已经进行了很多评估神经算子在各种代理建模任务上的表现的工作，但这些工作通常是逐个方程评估性能。在这项工作中，我们开发了一种新颖的对比预训练框架，利用广义对比损失，可以同时提高神经算子在多个控制方程上的泛化能力。控制方程系数用于衡量系统之间的基本相似性。物理知识导向的系统演变和潜在空间模型输出的组合被锚定到输入数据中，并用于我们的距离函数。我们发现，物理知识导向的对比预训练提高了傅立叶神经算子在1D和2D热传导、波尔格斯方程和线性对流方程的固定未来和自回归推进任务中的准确性。

更新时间: 2024-06-16 21:37:26

领域: cs.LG,cs.NA,math.NA,physics.comp-ph

下载: http://arxiv.org/abs/2401.16327v3

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplored yet extremely critical. LLMs and VLMs are highly susceptible to adversarial inputs, prompting a significant inquiry into the safety of robotic systems. This concern is important because robotics operate in the physical world where erroneous actions can result in severe consequences. This paper explores this issue thoroughly, presenting a mathematical formulation of potential attacks on LLM/VLM-based robotic systems and offering experimental evidence of the safety challenges. Our empirical findings highlight a significant vulnerability: simple modifications to the input can drastically reduce system effectiveness. Specifically, our results demonstrate an average performance deterioration of 19.4% under minor input prompt modifications and a more alarming 29.1% under slight perceptual changes. These findings underscore the urgent need for robust countermeasures to ensure the safe and reliable deployment of advanced LLM/VLM-based robotic systems.

Updated: 2024-06-16 21:31:55

标题: 突显在机器人中部署LLMs/VLMs的安全问题

摘要: 在本文中，我们重点讨论了将大型语言模型（LLMs）和视觉语言模型（VLMs）集成到机器人应用中所涉及的健壮性和安全性关键问题。最近的研究着重于利用LLMs和VLMs来提高机器人任务（如操纵和导航）的性能。尽管这些改进，分析这些系统的安全性仍然是一个未被充分探讨但极其关键的问题。LLMs和VLMs极易受到对抗性输入的影响，这引发了对机器人系统安全性的重要探讨。这一关注是重要的，因为机器人在物理世界中运行，错误的行动可能导致严重后果。本文对这一问题进行了深入探讨，提出了对LLM/VLM-based机器人系统潜在攻击的数学公式，并提供了实验证据来说明安全挑战。我们的实证结果突显了一个重大的脆弱性：对输入进行简单修改可以大幅降低系统的有效性。具体而言，我们的结果表明，在轻微的输入提示修改下，平均性能下降了19.4％，在轻微的感知变化下更是达到了29.1％，这些发现强调了迫切需要强有力的对策措施，以确保先进的LLM/VLM-based机器人系统的安全可靠部署。

更新时间: 2024-06-16 21:31:55

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2402.10340v4

Reward Generalization in RLHF: A Topological Perspective

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.

Updated: 2024-06-16 21:25:50

标题: 奖励概括在RLHF中的应用：拓扑学视角

摘要: 现有的对齐方法共享一种信息流的常见拓扑结构，其中奖励信息从人类那里收集，通过偏好学习建模，并用于调整语言模型。然而，这种共享的拓扑结构尚未被系统地表征，其替代方案也尚未得到彻底探索，导致数据效率低和泛化不可靠的问题尚未得到解决。作为解决方案，我们引入了一个理论框架，用于研究来自人类反馈的强化学习（RLHF）中的奖励泛化，重点放在宏观和微观信息流的拓扑结构上。在宏观水平上，我们将RLHF信息流描绘为一个行为分布上的自编码过程，明确了RLHF在人类偏好和模型行为之间的分布一致性目标。在微观水平上，我们提出了诱导的贝叶斯网络作为RLHF中奖励泛化的理论，引入了细粒度的数据集拓扑结构到泛化界限中。通过对两个层面的分析，我们提出了树结构偏好信息的奖励建模。与基线相比，该模型显示出将奖励不确定性降低了最多$\Theta(\log n/\log\log n)$倍，其中$n$是数据集大小。在三个NLP任务上的验证表明，我们基于树的奖励模型以65%的平均胜率击败了基线方法，从而通过拓扑设计免费提高了奖励泛化。

更新时间: 2024-06-16 21:25:50

领域: cs.LG,cs.AI,cs.CL,cs.DM

下载: http://arxiv.org/abs/2402.10184v5

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Indirect User Requests (IURs), such as "It's cold in here" instead of "Could you please increase the temperature?" are common in human-human task-oriented dialogue and require world knowledge and pragmatic reasoning from the listener. While large language models (LLMs) can handle these requests effectively, smaller models deployed on virtual assistants often struggle due to resource constraints. Moreover, existing task-oriented dialogue benchmarks lack sufficient examples of complex discourse phenomena such as indirectness. To address this, we propose a set of linguistic criteria along with an LLM-based pipeline for generating realistic IURs to test natural language understanding (NLU) and dialogue state tracking (DST) models before deployment in a new domain. We also release IndirectRequests, a dataset of IURs based on the Schema Guided Dialog (SGD) corpus, as a comparative testbed for evaluating the performance of smaller models in handling indirect requests.

Updated: 2024-06-16 21:20:34

标题: 通过合成生成间接用户请求，使面向任务的对话数据集更加自然

摘要: 间接用户请求（IURs），例如“这里很冷”而不是“你能不能增加温度？”在人际任务导向对话中很常见，需要听者具备世界知识和语用推理能力。虽然大型语言模型（LLMs）可以有效处理这些请求，但部署在虚拟助手上的较小模型常常由于资源限制而面临困难。此外，现有的任务导向对话基准缺乏足够的复杂话语现象示例，如间接性。为了解决这个问题，我们提出了一组语言标准，以及基于LLM的流水线，用于生成真实的IURs，以在部署到新领域之前测试自然语言理解（NLU）和对话状态跟踪（DST）模型。我们还发布了IndirectRequests数据集，该数据集基于Schema Guided Dialog（SGD）语料库，作为评估较小模型处理间接请求性能的比较测试平台。

更新时间: 2024-06-16 21:20:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.07794v2

Phased Instruction Fine-Tuning for Large Language Models

Instruction Fine-Tuning enhances pre-trained language models from basic next-word prediction to complex instruction-following. However, existing One-off Instruction Fine-Tuning (One-off IFT) method, applied on a diverse instruction, may not effectively boost models' adherence to instructions due to the simultaneous handling of varying instruction complexities. To improve this, Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process. It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets. Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT, supporting the progressive alignment hypothesis and providing a simple and efficient way to enhance large language models. Codes and datasets from our experiments are freely available at https://github.com/xubuvd/PhasedSFT.

Updated: 2024-06-16 21:20:29

标题: 大语言模型的分阶段指导微调

摘要: 指导微调（Instruction Fine-Tuning）将预训练语言模型从基本的下一个词预测提升到复杂的指令遵循。然而，现有的一次性指导微调（One-off IFT）方法，应用于多样化的指令可能无法有效地提升模型对指令的遵循，因为同时处理不同指令复杂度。为了改进这一点，提出了分阶段指导微调（Phased IFT）方法，基于学习遵循指令是一个渐进的过程的想法。它使用 GPT-4 评估指令难度，将指令数据分成逐渐增加难度的子集，并依次在这些子集上对模型进行微调。使用 Alpaca 数据对 Llama-2 7B/13B/70B、Llama3 8/70B 和 Mistral-7B 模型进行实验，结果显示 Phased IFT 明显优于 One-off IFT，支持渐进对齐假设，并提供了一种简单高效的方式来增强大型语言模型。我们实验的代码和数据集可在 https://github.com/xubuvd/PhasedSFT 免费获取。

更新时间: 2024-06-16 21:20:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.04371v2

miniCodeProps: a Minimal Benchmark for Proving Code Properties

Neural networks have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable a machine learning system to output provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 177 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, miniCodeProps is challenging for current LLM-based provers, which succeed in proving about 25 percent of the specifications. We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.

Updated: 2024-06-16 21:11:23

标题: miniCodeProps：一种用于证明代码属性的最小基准测试

摘要: 神经网络已经显示出在证明助手（如Lean）中自动化数学定理证明方面具有初步的潜力。相同的证明助手可以通过将代码与规范和证明配对来验证代码的正确性，证明规范成立。自动编写代码、规范和证明可能降低验证成本，或者更有野心地，使机器学习系统能够输出经过证明正确的代码。然而，目前的神经定理证明器是否能自动验证甚至相对简单的程序仍不清楚。我们提出了miniCodeProps，这是一个在Lean证明助手中包含177个程序规范的基准，旨在自动为提供的程序和规范生成证明的子问题。miniCodeProps包含有关简单、独立的程序（例如列表、自然数、二叉树）的规范，具有不同的证明难度。尽管它很简单，但对于当前基于LLM的证明器来说，miniCodeProps是具有挑战性的，这些证明器能够证明约25％的规范。我们公开发布miniCodeProps作为进一步推动在形式验证的代码背景下自动定理证明的基准。

更新时间: 2024-06-16 21:11:23

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11915v1

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles

The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.

Updated: 2024-06-16 21:10:55

标题: 一个标记的数据集，用于关于2024年麻疹爆发的YouTube、TikTok和其他来源视频的情感分析

摘要: 本文工作提供了一个数据集，其中包含了4011个关于2024年1月1日至5月31日互联网上发布的关于麻疹持续爆发的视频的数据，这些视频发布在264个网站上。该数据集可在https://dx.doi.org/10.21227/40s8-xf63 上获得。这些网站主要包括YouTube和TikTok，分别占视频的48.6%和15.2%。其余的网站包括Instagram和Facebook以及各种全球和本地新闻组织的网站。对于这些视频，数据集中分别提供了视频的URL、帖子标题、帖子描述和视频发布日期作为独立属性。在开发了这个数据集之后，对视频标题和视频描述进行了情感分析（使用VADER）、主观性分析（使用TextBlob）和细粒度情感分析（使用DistilRoBERTa-base）。这包括将每个视频标题和视频描述分类为（i）情感类别之一，即积极、消极或中性，（ii）主观性类别之一，即高度主观、中性主观或最少主观，以及（iii）细粒度情感类别之一，即恐惧、惊讶、喜悦、悲伤、愤怒、厌恶或中性。这些结果作为数据集中的独立属性，用于训练和测试机器学习算法，以进行情感分析或主观性分析，以及其他应用。最后，本文还提出了一系列可能使用该数据集进行调查的开放性研究问题。

更新时间: 2024-06-16 21:10:55

领域: cs.CY,cs.AI,cs.CL,cs.LG,cs.SI,I.2.7; I.2.8; I.5.4; K.4.2; H.2.8; I.2.6

下载: http://arxiv.org/abs/2406.07693v2

Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations

Large language and vision-language models are rapidly being deployed in practice thanks to their impressive capabilities in instruction following, in-context learning, and so on. This raises an urgent need to carefully analyse their robustness so that stakeholders can understand if and when such models are trustworthy enough to be relied upon in any given application. In this paper, we highlight a specific vulnerability in popular models, namely permutation sensitivity in multiple-choice question answering (MCQA). Specifically, we show empirically that popular models are vulnerable to adversarial permutation in answer sets for multiple-choice prompting, which is surprising as models should ideally be as invariant to prompt permutation as humans are. These vulnerabilities persist across various model sizes, and exist in very recent language and vision-language models. Code is available at https://github.com/ys-zong/FoolyourVLLMs.

Updated: 2024-06-16 21:10:40

标题: 用尴尬简单的排列方式愚弄你的（视觉和）语言模型

摘要: 大型语言和视觉-语言模型由于在遵循指令、上下文学习等方面的印象深刻功能而迅速部署在实践中。这引发了对它们的鲁棒性进行仔细分析的迫切需求，以便利益相关者能够了解在任何给定应用中，这些模型是否足够值得信赖。在本文中，我们强调了流行模型中的一个特定漏洞，即在多项选择问题回答（MCQA）中的排列敏感性。具体来说，我们通过实证证明，流行模型容易受到多项选择提示中答案集的敌对排列的影响，这令人惊讶，因为模型理想上应该像人类一样对提示排列不变。这些漏洞在各种模型大小中持续存在，也存在于最近的语言和视觉-语言模型中。代码可在https://github.com/ys-zong/FoolyourVLLMs找到。

更新时间: 2024-06-16 21:10:40

领域: cs.LG

下载: http://arxiv.org/abs/2310.01651v2

Intelligent Energy Management with IoT Framework in Smart Cities Using Intelligent Analysis: An Application of Machine Learning Methods for Complex Networks and Systems

This study confronts the growing challenges of energy consumption and the depletion of energy resources, particularly in the context of smart buildings. As the demand for energy increases alongside the necessity for efficient building maintenance, it becomes imperative to explore innovative energy management solutions. We present a comprehensive review of Internet of Things (IoT)-based frameworks aimed at smart city energy management, highlighting the pivotal role of IoT devices in addressing these issues due to their compactness, sensing, measurement, and computing capabilities. Our review methodology encompasses a thorough analysis of existing literature on IoT architectures and frameworks for intelligent energy management applications. We focus on systems that not only collect and store data but also support intelligent analysis for monitoring, controlling, and enhancing system efficiency. Additionally, we examine the potential for these frameworks to serve as platforms for the development of third-party applications, thereby extending their utility and adaptability. The findings from our review indicate that IoT-based frameworks offer significant potential to reduce energy consumption and environmental impact in smart buildings. Through the adoption of intelligent mechanisms and solutions, these frameworks facilitate effective energy management, leading to improved system efficiency and sustainability. Considering these findings, we recommend further exploration and adoption of IoT-based wireless sensing systems in smart buildings as a strategic approach to energy management. Our review underscores the importance of incorporating intelligent analysis and enabling the development of third-party applications within the IoT framework to efficiently meet the evolving energy demands and maintenance challenges

Updated: 2024-06-16 21:04:33

标题: 智能分析在智能城市中利用物联网框架进行能源管理：复杂网络和系统的机器学习方法应用

摘要: 这项研究面临着能源消耗和能源资源枯竭日益严峻的挑战，尤其是在智能建筑的背景下。随着对能源的需求与高效建筑维护的必要性同步增加，探索创新的能源管理解决方案变得迫在眉睫。我们提出了一个关于基于物联网 (IoT) 框架的智能城市能源管理的综合审查，突出了IoT设备在解决这些问题中的关键作用，因为它们具有紧凑性、传感、测量和计算能力。我们的审查方法包括对现有文献中关于智能能源管理应用的IoT架构和框架的彻底分析。我们关注的是不仅收集和存储数据，而且支持智能分析以监测、控制和提高系统效率的系统。此外，我们还研究了这些框架作为第三方应用程序开发平台的潜力，从而扩展了它们的实用性和适应性。我们审查的结果表明，基于IoT的框架在智能建筑中降低能源消耗和环境影响方面具有重要潜力。通过采用智能机制和解决方案，这些框架促进了有效的能源管理，提高了系统的效率和可持续性。考虑到这些发现，我们建议进一步探索和采用基于IoT的无线传感系统作为能源管理的战略方法。我们的审查强调了在IoT框架内整合智能分析和支持第三方应用程序开发的重要性，以有效应对不断变化的能源需求和维护挑战。

更新时间: 2024-06-16 21:04:33

领域: cs.LG,cs.CY,cs.SY,eess.SY

下载: http://arxiv.org/abs/2306.05567v2

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

The widespread use of social media necessitates reliable and efficient detection of offensive content to mitigate harmful effects. Although sophisticated models perform well on individual datasets, they often fail to generalize due to varying definitions and labeling of "offensive content." In this paper, we introduce HateCOT, an English dataset with over 52,000 samples from diverse sources, featuring explanations generated by GPT-3.5Turbo and curated by humans. We demonstrate that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task. Additionally, HateCOT facilitates effective K-shot fine-tuning of LLMs with limited data and improves the quality of their explanations, as confirmed by our human evaluation.

Updated: 2024-06-16 20:55:25

标题: HateCOT：通过大型语言模型进行可泛化的攻击性言论检测的解释增强数据集

摘要: 随着社交媒体的广泛使用，可靠和高效地检测具有攻击性内容以减轻其有害影响变得必要。尽管复杂的模型在单个数据集上表现良好，但由于对“攻击性内容”的定义和标记不同，它们经常无法推广。在本文中，我们介绍了HateCOT，这是一个包含来自不同来源的超过52,000个样本的英语数据集，其特点是由GPT-3.5Turbo生成并由人类策划的解释。我们证明，在HateCOT上的预训练显著提高了开源大型语言模型在三个攻击性内容检测基准数据集上的性能，在零样本和少样本设置下都如此，尽管在领域和任务上存在差异。此外，HateCOT有助于有效地使用有限数据对LLM进行K-shot微调，并提高了它们解释质量，这得到了我们的人类评估的确认。

更新时间: 2024-06-16 20:55:25

领域: cs.CL,cs.AI,cs.SI

下载: http://arxiv.org/abs/2403.11456v3

Fine-grained Classes and How to Find Them

In many practical applications, coarse-grained labels are readily available compared to fine-grained labels that reflect subtle differences between classes. However, existing methods cannot leverage coarse labels to infer fine-grained labels in an unsupervised manner. To bridge this gap, we propose FALCON, a method that discovers fine-grained classes from coarsely labeled data without any supervision at the fine-grained level. FALCON simultaneously infers unknown fine-grained classes and underlying relationships between coarse and fine-grained classes. Moreover, FALCON is a modular method that can effectively learn from multiple datasets labeled with different strategies. We evaluate FALCON on eight image classification tasks and a single-cell classification task. FALCON outperforms baselines by a large margin, achieving 22% improvement over the best baseline on the tieredImageNet dataset with over 600 fine-grained classes.

Updated: 2024-06-16 20:55:19

标题: 细粒度类和如何找到它们

摘要: 在许多实际应用中，与反映类别之间微小差异的细粒度标签相比，粗粒度标签更容易获得。然而，现有方法无法利用粗标签以无监督方式推断细粒度标签。为了弥补这一差距，我们提出了FALCON，一种从粗略标记数据中发现细粒度类别的方法，而无需在细粒度级别进行任何监督。FALCON同时推断未知的细粒度类别和粗粒度类别之间的潜在关系。此外，FALCON是一种模块化方法，可以有效地从使用不同策略标记的多个数据集中学习。我们在八个图像分类任务和一个单细胞分类任务上评估了FALCON。FALCON在tieredImageNet数据集上表现优异，与最佳基线相比取得了22%的改进，该数据集包含600多个细粒度类别。

更新时间: 2024-06-16 20:55:19

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2406.11070v1

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

Updated: 2024-06-16 20:53:25

标题: 野外视觉：通过人类偏好评估野外的视觉-语言模型

摘要: 最近在视觉语言模型（VLMs）方面取得的突破强调了在现实世界多模态交互中进行人类偏好基准测试的必要性。为了填补这一空白，我们推出了WildVision-Arena（WV-Arena），这是一个在线平台，用于收集人类偏好以评估VLMs。我们通过从WV-Arena的8,000个用户提交中选择500个高质量样本来策划了WV-Bench。WV-Bench使用GPT-4作为评判员，将每个VLM与Claude-3-Sonnet进行比较，与WV-Arena Elo的Spearman相关性达到0.94。这明显优于其他基准测试，如MMVet，MMMU和MMStar。我们对20K个真实世界互动的综合分析揭示了表现最佳的VLMs的失败案例的重要见解。例如，我们发现，尽管GPT-4V在简单的视觉识别和推理任务中超越了许多其他模型，如Reka-Flash，Opus和Yi-VL-Plus，但在细微的上下文线索，空间推理，视觉想象和专家领域知识方面仍面临挑战。此外，当前的VLMs在故意挑衅时出现幻觉和安全问题。我们将发布我们的聊天和反馈数据，以进一步推动VLM领域的研究。

更新时间: 2024-06-16 20:53:25

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.11069v1

A Unified View of Abstract Visual Reasoning Problems

The field of Abstract Visual Reasoning (AVR) encompasses a wide range of problems, many of which are inspired by human IQ tests. The variety of AVR tasks has resulted in state-of-the-art AVR methods being task-specific approaches. Furthermore, contemporary methods consider each AVR problem instance not as a whole, but in the form of a set of individual panels with particular locations and roles (context vs. answer panels) pre-assigned according to the task-specific arrangements. While these highly specialized approaches have recently led to significant progress in solving particular AVR tasks, considering each task in isolation hinders the development of universal learning systems in this domain. In this paper, we introduce a unified view of AVR tasks, where each problem instance is rendered as a single image, with no a priori assumptions about the number of panels, their location, or role. The main advantage of the proposed unified view is the ability to develop universal learning models applicable to various AVR tasks. What is more, the proposed approach inherently facilitates transfer learning in the AVR domain, as various types of problems share a common representation. The experiments conducted on four AVR datasets with Raven's Progressive Matrices and Visual Analogy Problems, and one real-world visual analogy dataset show that the proposed unified representation of AVR tasks poses a challenge to state-of-the-art Deep Learning (DL) AVR models and, more broadly, contemporary DL image recognition methods. In order to address this challenge, we introduce the Unified Model for Abstract Visual Reasoning (UMAVR) capable of dealing with various types of AVR problems in a unified manner. UMAVR outperforms existing AVR methods in selected single-task learning experiments, and demonstrates effective knowledge reuse in transfer learning and curriculum learning setups.

Updated: 2024-06-16 20:52:44

标题: 一个关于抽象视觉推理问题的统一观点

摘要: 摘要：抽象视觉推理（AVR）领域涵盖了广泛的问题，许多受到人类智商测试的启发。AVR任务的多样性导致了最先进的AVR方法成为特定任务的方法。此外，当代方法将每个AVR问题实例视为一组具有特定位置和角色（上下文与答案面板）的个体面板的形式，根据任务特定的安排进行预分配。尽管这些高度专业化的方法最近在解决特定AVR任务方面取得了重大进展，但将每个任务单独考虑会阻碍在该领域开发通用学习系统。在本文中，我们介绍了AVR任务的统一视图，其中每个问题实例被呈现为单个图像，没有关于面板数量、位置或角色的先验假设。所提出的统一视图的主要优势在于能够开发适用于各种AVR任务的通用学习模型。此外，所提出的方法本质上促进了AVR领域的迁移学习，因为各种类型的问题共享一个通用表示。在四个AVR数据集上进行的实验，包括雷文逐渐矩阵和视觉类比问题，以及一个真实世界的视觉类比数据集，表明所提出的AVR任务统一表示对最先进的深度学习（DL）AVR模型以及更广泛的当代DL图像识别方法构成挑战。为了应对这一挑战，我们引入了能够以统一方式处理各种类型AVR问题的抽象视觉推理统一模型（UMAVR）。UMAVR在选定的单任务学习实验中优于现有的AVR方法，并在迁移学习和课程学习设置中展示了有效的知识重用。

更新时间: 2024-06-16 20:52:44

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11068v1

Improving GFlowNets for Text-to-Image Diffusion Alignment

Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples. In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability -- a natural scenario for the framework of generative flow networks (GFlowNets). To this end, we propose the Diffusion Alignment with GFlowNet (DAG) algorithm to post-train diffusion models with black-box property functions. Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.

Updated: 2024-06-16 20:45:19

标题: 改进GFlowNets以实现文本到图像的扩散对齐

摘要: 扩散模型已成为生成视觉数据的实际方法，这些数据经过训练以匹配训练数据集的分布。此外，我们还希望控制生成以满足所需属性，例如对齐文本描述，这可以通过黑匣子奖励函数指定。先前的工作通过强化学习算法微调预训练的扩散模型以实现这一目标。然而，它们存在问题，包括信用分配缓慢以及生成样本质量低。在这项工作中，我们探索的技术不是直接最大化奖励，而是以相对较高的概率生成高奖励图像，这是生成流网络（GFlowNets）框架的自然场景。为此，我们提出了Diffusion Alignment with GFlowNet（DAG）算法，用黑匣子属性函数对扩散模型进行后训练。对稳定扩散和各种奖励规范的广泛实验证实，我们的方法可以有效地将大规模文本到图像扩散模型与给定的奖励信息对齐。

更新时间: 2024-06-16 20:45:19

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2406.00633v2

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

Updated: 2024-06-16 20:44:54

标题: 一种在双层神经网络中通过一次梯度步骤进行非线性特征学习的理论

摘要: 特征学习被认为是深度神经网络成功的一个基本原因。严格来说，在某些条件下，已知在两层全连接神经网络中，第一层的梯度下降可以导致特征学习；其特征是特征矩阵的频谱中出现一个分离的秩一分量--尖峰。然而，使用恒定的梯度下降步长，这个尖峰只携带来自目标函数的线性分量的信息，因此学习非线性分量是不可能的。我们展示了随着样本大小增长的学习率，这种训练实际上引入了多个秩一分量，每个对应于特定的多项式特征。我们进一步证明了更新后的神经网络的极限大维度和大样本训练和测试误差完全由这些尖峰所表征。通过精确分析训练和测试误差的改善，我们证明了这些非线性特征可以增强学习。

更新时间: 2024-06-16 20:44:54

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2310.07891v3

Generalization and Knowledge Transfer in Abstract Visual Reasoning Models

We study generalization and knowledge reuse capabilities of deep neural networks in the domain of abstract visual reasoning (AVR), employing Raven's Progressive Matrices (RPMs), a recognized benchmark task for assessing AVR abilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset are investigated. Firstly, inspired by generalization assessment capabilities of the PGM dataset and popularity of I-RAVEN, we introduce Attributeless-I-RAVEN, a benchmark with four generalization regimes that allow to test generalization of abstract rules applied to held-out attributes. Secondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel component structure comprising line-based patterns, facilitating assessment of progressive knowledge acquisition in transfer learning setting. The developed benchmarks reveal shortcomings of the contemporary deep learning models, which we partly address with Pathways of Normalized Group Convolution (PoNG) model, a novel neural architecture for solving AVR tasks. PoNG excels in both presented challenges, as well as the standard I-RAVEN and PGM setups.

Updated: 2024-06-16 20:26:38

标题: 抽象视觉推理模型中的泛化和知识转移

摘要: 我们研究了深度神经网络在抽象视觉推理（AVR）领域中的泛化和知识重用能力，采用了Raven的渐进矩阵（RPMs），这是一个公认的用于评估AVR能力的基准任务。我们研究了涉及I-RAVEN数据集的两种知识转移场景。首先，受到PGM数据集的泛化评估能力和I-RAVEN的普及的启发，我们引入了无属性I-RAVEN，这是一个具有四种泛化制度的基准，允许测试应用于保留属性的抽象规则的泛化能力。其次，我们构建了I-RAVEN-Mesh，这是一个丰富了RPMs的数据集，包括基于线条的模式的新组件结构，有助于在迁移学习设置中评估渐进知识获取。开发的基准揭示了当代深度学习模型的不足之处，我们部分地通过Pathways of Normalized Group Convolution（PoNG）模型来解决这些问题，这是一种用于解决AVR任务的新型神经架构。PoNG在所提出的挑战，以及标准的I-RAVEN和PGM设置中都表现出色。

更新时间: 2024-06-16 20:26:38

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.11061v1

Element-wise Multiplication Based Physics-informed Neural Networks

As a promising framework for resolving partial differential equations (PDEs), physics-informed neural networks (PINNs) have received widespread attention from industrial and scientific fields. However, lack of expressive ability and initialization pathology issues are found to prevent the application of PINNs in complex PDEs. In this work, we propose Element-wise Multiplication Based Physics-informed Neural Networks (EM-PINNs) to resolve these issues. The element-wise multiplication operation is adopted to transform features into high-dimensional, non-linear spaces, which effectively enhance the expressive capability of PINNs. Benefiting from element-wise multiplication operation, EM-PINNs can eliminate the initialization pathologies of PINNs. The proposed structure is verified on various benchmarks. The results show that EM-PINNs have strong expressive ability.

Updated: 2024-06-16 20:15:40

标题: 基于逐元素乘法的物理知识引导神经网络

摘要: 作为解决偏微分方程（PDEs）的有希望的框架，物理信息神经网络（PINNs）已经受到工业和科学领域的广泛关注。然而，发现缺乏表达能力和初始化病态问题阻碍了PINNs在复杂PDEs中的应用。在这项工作中，我们提出了基于元素乘法的物理信息神经网络（EM-PINNs）来解决这些问题。采用元素乘法操作将特征转化为高维非线性空间，有效增强了PINNs的表达能力。受益于元素乘法操作，EM-PINNs可以消除PINNs的初始化病态。所提出的结构在各种基准测试中得到验证。结果显示，EM-PINNs具有强大的表达能力。

更新时间: 2024-06-16 20:15:40

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2406.04170v2

Interpretable Temporal Class Activation Representation for Audio Spoofing Detection

Explaining the decisions made by audio spoofing detection models is crucial for fostering trust in detection outcomes. However, current research on the interpretability of detection models is limited to applying XAI tools to post-trained models. In this paper, we utilize the wav2vec 2.0 model and attentive utterance-level features to integrate interpretability directly into the model's architecture, thereby enhancing transparency of the decision-making process. Specifically, we propose a class activation representation to localize the discriminative frames contributing to detection. Furthermore, we demonstrate that multi-label training based on spoofing types, rather than binary labels as bonafide and spoofed, enables the model to learn distinct characteristics of different attacks, significantly improving detection performance. Our model achieves state-of-the-art results, with an EER of 0.51% and a min t-DCF of 0.0165 on the ASVspoof2019-LA set.

Updated: 2024-06-16 20:01:29

标题: 可解释的音频欺骗检测中的时间类激活表示

摘要: 解释音频欺骗检测模型所做决定的重要性对于促进对检测结果的信任至关重要。然而，目前关于检测模型可解释性的研究仅限于将XAI工具应用于训练后的模型。在本文中，我们利用wav2vec 2.0模型和关注力utterance级特征，将可解释性直接集成到模型的架构中，从而增强决策过程的透明度。具体来说，我们提出了一种类激活表示来定位对检测有贡献的区分帧。此外，我们证明基于欺骗类型而不是二进制标签（真实和伪造）的多标签训练使模型能够学习不同攻击的不同特征，显着提高检测性能。我们的模型在ASVspoof2019-LA数据集上实现了最先进的结果，EER为0.51％，min t-DCF为0.0165。

更新时间: 2024-06-16 20:01:29

领域: cs.SD,cs.CR,eess.AS

下载: http://arxiv.org/abs/2406.08825v2

Event-Triggered Islanding in Inverter-Based Grids

The decentralization of modern power systems challenges the hierarchical structure of the electric grid and necessitates automated schemes to manage adverse conditions. This work proposes an adaptive isolation methodology that can divide a grid into autonomous islands, ensuring stable and economical operation amid deliberate (e.g., cyberattacks) or unintentional abnormal events. The adaptive isolation logic is event-triggered to prevent false positives, enhance detection accuracy, and reduce computational overhead. A measurement-based stable kernel representation (SKR) triggering mechanism initially inspects distributed generation controllers for abnormal behavior. The SKR then alerts a machine learning (ML) ensemble classifier to assess whether the system behavior remains within acceptable operational limits. The event-triggered adaptive isolation framework is evaluated using the IEEE RTS-24 and 118-bus systems. Simulation results demonstrate that the proposed framework detects anomalous behavior with 100% accuracy in real-time, i.e., within 22 msec. Supply-adequate partitions are identified outperforming traditional islanding detection and formation techniques while minimizing operating costs.

Updated: 2024-06-16 19:55:39

标题: 逆变器型电网中的事件触发式孤岛模式

摘要: 现代电力系统的分散化挑战了电网的层次结构，并需要自动化方案来管理不利条件。本文提出了一种自适应隔离方法，可以将电网分割成自治岛屿，确保在有意的（例如网络攻击）或无意的异常事件发生时保持稳定和经济运行。自适应隔离逻辑是事件触发的，以防止误报，提高检测准确性，并减少计算开销。一种基于测量的稳定核心表示（SKR）触发机制最初检查分布式发电控制器是否存在异常行为。然后，SKR会提醒一个机器学习（ML）集成分类器来评估系统行为是否保持在可接受的运行限制内。该事件触发的自适应隔离框架使用IEEE RTS-24和118母线系统进行评估。模拟结果表明，所提出的框架可以在实时环境中以100%的准确率检测异常行为，即在22毫秒内。识别出供应充足的分区，优于传统的隔离检测和形成技术，同时最大限度地降低运营成本。

更新时间: 2024-06-16 19:55:39

领域: eess.SY,cs.CR,cs.SY

下载: http://arxiv.org/abs/2306.15454v3

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for the new estimator. Secondly, to set a high-standard for efficiency evaluation, we employ a human expert to design best algorithms and implementations as our reference solutions of efficiency, many of which are much more efficient than existing canonical solutions in HumanEval and HumanEval+. Moreover, to ensure a rigorous evaluation, we employ a human expert to curate strong test case generators to filter out wrong code and differentiate suboptimal algorithms. An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization. Our benchmark is publicly available at https://github.com/q-rz/enamel .

Updated: 2024-06-16 19:34:04

标题: LLM生成的代码有多高效？一个严谨且高标准的基准测试

摘要: 大型语言模型(LLMs)的出现显著推动了程序合成的前沿。基于LLM的程序合成的进展需要对LLM生成的代码进行彻底评估。大多数评估框架侧重于生成代码的(功能)正确性；效率作为代码质量的重要衡量标准，在现有评估中被忽视了。在这项工作中，我们开发了ENAMEL（EfficeNcy AutoMatic EvaLuator），这是一个严格和高标准的基准，用于评估LLMs在生成高效代码方面的能力。首先，我们提出了一个新的效率度量标准，称为eff@k，它将正确性度量标准pass@k从正确性推广到效率，并适当处理了右截尾的执行时间。此外，我们通过Rao-Blackwell化推导出了eff@k的无偏和方差减少的估计量；我们还为新估计量提供了一个数值稳定的实现。其次，为了为效率评估设定高标准，我们雇用了一位人类专家设计最佳算法和实现作为我们效率的参考解决方案，其中许多比HumanEval和HumanEval+中现有的标准解决方案更有效率。此外，为了确保严格的评估，我们雇用了一位人类专家策划了强大的测试用例生成器，以过滤掉错误代码并区分次优算法。通过使用我们的基准ENAMEL对30种流行的LLMs进行广泛研究表明，LLMs仍然无法生成专家级别的高效代码。使用我们问题集的两个子集，我们证明这种不足是因为当前的LLMs在设计先进算法方面有困难，并且几乎没有意识到实现优化。我们的基准公开可用于https://github.com/q-rz/enamel。

更新时间: 2024-06-16 19:34:04

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.06647v2

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.

Updated: 2024-06-16 19:22:53

标题: 一窥令牌偏见：大型语言模型尚未成为真正的推理者

摘要: 这项研究引入了一个假设检验框架，以评估大型语言模型（LLMs）是否具有真正的推理能力，或者主要依赖于记号偏见。我们超越了对准确性的评估；相反，我们的目标是调查它们在解决逻辑推理任务时的记号偏见。具体来说，我们开发了精心控制的合成数据集，其中包含连词谬误和三段论问题。我们的框架概述了一系列假设，其中记号偏见很容易识别，所有零假设都假设LLMs具有真正的推理能力。本研究的发现表明，以统计保证，大多数LLMs仍然在逻辑推理方面有困难。虽然它们在经典问题上表现良好，但它们的成功在很大程度上取决于识别具有强烈记号偏见的表面模式，因此引发了对它们实际推理和泛化能力的担忧。

更新时间: 2024-06-16 19:22:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11050v1

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Federated learning (FL) has obtained tremendous progress in providing collaborative training solutions for distributed data silos with privacy guarantees. However, few existing works explore a more realistic scenario where the clients hold multiple data modalities. In this paper, we aim to solve a novel challenge in multi-modal federated learning (MFL) -- modality missing -- the clients may lose part of the modalities in their local data sets. To tackle the problems, we propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which integrates the large-scale pre-trained models to enhance the federated training. In the proposed FedMVP framework, each client deploys a large-scale pre-trained model with frozen parameters for modality completion and representation knowledge transfer, enabling efficient and robust local training. On the server side, we utilize generated data to uniformly measure the representation similarity among the uploaded client models and construct a graph perspective to aggregate them according to their importance in the system. We demonstrate that the model achieves superior performance over two real-world image-text classification datasets and is robust to the performance degradation caused by missing modality.

Updated: 2024-06-16 19:18:06

标题: 利用基础模型进行多模态联邦学习，处理不完整的模态

摘要: 联邦学习（FL）在为具有隐私保障的分布式数据孤岛提供协作训练解决方案方面取得了巨大进展。然而，很少有现有研究探索了一个更为现实的场景，即客户端持有多个数据模态的情况。本文旨在解决多模态联邦学习（MFL）中的一个新挑战——模态缺失——即客户端可能在其本地数据集中丢失部分模态。为了解决这些问题，我们提出了一种新颖的多模态联邦学习方法，名为Federated Multi-modal contrastiVe training with Pre-trained completion（FedMVP），该方法集成了大规模预训练模型以增强联邦训练。在提出的FedMVP框架中，每个客户端部署一个带有冻结参数的大规模预训练模型，用于模态完成和表示知识转移，从而实现高效和稳健的本地训练。在服务器端，我们利用生成的数据均匀地衡量上传客户端模型之间的表示相似性，并从图的角度构建聚合它们的视角，根据它们在系统中的重要性进行聚合。我们证明该模型在两个真实世界的图像-文本分类数据集上实现了优越性能，并且对由模态缺失引起的性能下降具有鲁棒性。

更新时间: 2024-06-16 19:18:06

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2406.11048v1

Enhancing Supermarket Robot Interaction: A Multi-Level LLM Conversational Interface for Handling Diverse Customer Intents

This paper presents the design and evaluation of a novel multi-level LLM interface for supermarket robots to assist customers. The proposed interface allows customers to convey their needs through both generic and specific queries. While state-of-the-art systems like OpenAI's GPTs are highly adaptable and easy to build and deploy, they still face challenges such as increased response times and limitations in strategic control of the underlying model for tailored use-case and cost optimization. Driven by the goal of developing faster and more efficient conversational agents, this paper advocates for using multiple smaller, specialized LLMs fine-tuned to handle different user queries based on their specificity and user intent. We compare this approach to a specialized GPT model powered by GPT-4 Turbo, using the Artificial Social Agent Questionnaire (ASAQ) and qualitative participant feedback in a counterbalanced within-subjects experiment. Our findings show that our multi-LLM chatbot architecture outperformed the benchmarked GPT model across all 13 measured criteria, with statistically significant improvements in four key areas: performance, user satisfaction, user-agent partnership, and self-image enhancement. The paper also presents a method for supermarket robot navigation by mapping the final chatbot response to correct shelf numbers, enabling the robot to sequentially navigate towards the respective products, after which lower-level robot perception, control, and planning can be used for automated object retrieval. We hope this work encourages more efforts into using multiple, specialized smaller models instead of relying on a single powerful, but more expensive and slower model.

Updated: 2024-06-16 19:13:01

标题: 提升超市机器人交互：用于处理多样客户意图的多级LLM对话接口

摘要: 本文介绍了一种新颖的多层次LLM接口设计和评估，用于超市机器人协助顾客。所提出的接口允许顾客通过通用和具体的查询传达他们的需求。虽然像OpenAI的GPTs这样的最新系统非常适应性强，易于构建和部署，但它们仍面临挑战，如响应时间增加和在定制用例和成本优化方面对底层模型的战略控制有限。本文的目标是开发更快、更高效的对话代理，提倡使用多个较小、专门的LLMs进行微调，以处理不同用户查询基于其特定性和用户意图。我们将此方法与由GPT-4 Turbo支持的专门GPT模型进行比较，使用人工社交代理问卷（ASAQ）和定性参与者反馈在一项反平衡的被试实验中。我们的发现显示，我们的多LLM聊天机器人架构在所有13个衡量标准上优于基准GPT模型，并在四个关键领域中有统计显著的改进：性能、用户满意度、用户代理合作和自我形象增强。本文还提出了一种超市机器人导航方法，通过将最终聊天机器人响应映射到正确的货架编号，使机器人能够顺序导航到相应的产品，之后可以使用较低级别的机器人感知、控制和规划进行自动物体检索。我们希望这项工作鼓励更多的努力，使用多个专门的较小模型，而不是依赖于单一强大但更昂贵和更慢的模型。

更新时间: 2024-06-16 19:13:01

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2406.11047v1

TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with the minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) show competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting, because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which (1) automatically enriches the label taxonomy with class-indicative terms to facilitate classifier training and (2) utilizes LLMs for both data annotation and creation tailored for the hierarchical label space. Experiments show that TELEClass can outperform previous weakly-supervised methods and LLM-based zero-shot prompting methods on two public datasets.

Updated: 2024-06-16 19:10:39

标题: TELEClass：具有最小监督的分类学丰富和LLM增强的层次文本分类

摘要: 分层文本分类旨在将每个文档分类为标签分类法中的一组类别。大多数早期作品侧重于需要大量人工注释数据的全面或半监督方法，这种方法获取成本高且耗时。为了减轻人力劳动，本文致力于使用最少的监督来进行分层文本分类：仅使用每个节点的唯一类名作为监督。最近，大型语言模型(LLM)通过零-shot提示在各种任务上表现出竞争性能，但该方法在层次设置中表现不佳，因为将大型和结构化的标签空间包含在提示中是无效的。另一方面，先前的弱监督分层文本分类方法仅利用原始分类法骨架，忽略了隐藏在文本语料库中的丰富信息，这些信息可以作为额外的类别指示特征。为了解决上述挑战，我们提出了TELEClass，即分类法丰富和LLM增强的弱监督分层文本分类，该方法(1)自动使用类别指示术语丰富标签分类法，以促进分类器训练，(2)利用LLM对数据进行注释并为层次标签空间定制创建。实验证明，TELEClass在两个公共数据集上可以胜过先前的弱监督方法和基于LLM的零-shot提示方法。

更新时间: 2024-06-16 19:10:39

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2403.00165v2

Kolmogorov Arnold Informed neural network: A physics-informed deep learning framework for solving PDEs based on Kolmogorov Arnold Networks

AI for partial differential equations (PDEs) has garnered significant attention, particularly with the emergence of Physics-informed neural networks (PINNs). The recent advent of Kolmogorov-Arnold Network (KAN) indicates that there is potential to revisit and enhance the previously MLP-based PINNs. Compared to MLPs, KANs offer interpretability and require fewer parameters. PDEs can be described in various forms, such as strong form, energy form, and inverse form. While mathematically equivalent, these forms are not computationally equivalent, making the exploration of different PDE formulations significant in computational physics. Thus, we propose different PDE forms based on KAN instead of MLP, termed Kolmogorov-Arnold-Informed Neural Network (KINN). We systematically compare MLP and KAN in various numerical examples of PDEs, including multi-scale, singularity, stress concentration, nonlinear hyperelasticity, heterogeneous, and complex geometry problems. Our results demonstrate that KINN significantly outperforms MLP in terms of accuracy and convergence speed for numerous PDEs in computational solid mechanics, except for the complex geometry problem. This highlights KINN's potential for more efficient and accurate PDE solutions in AI for PDEs.

Updated: 2024-06-16 19:07:06

标题: 科尔莫哥洛夫阿诺德通知神经网络：基于科尔莫哥洛夫阿诺德网络的求解偏微分方程的物理通知深度学习框架

摘要: 人工智能对偏微分方程（PDEs）的研究引起了人们的极大关注，尤其是随着物理信息神经网络（PINNs）的兴起。最近出现的科尔莫戈洛夫-阿诺德网络（KAN）表明有重新审视和增强先前基于多层感知器（MLP）的PINNs的潜力。与MLPs相比，KANs提供了可解释性并且需要更少的参数。PDEs可以用不同形式描述，例如强形式、能量形式和逆形式。尽管在数学上是等价的，但这些形式在计算上并不等价，这使得探索不同的PDE形式在计算物理学中具有重要意义。因此，我们提出基于KAN而不是MLP的不同PDE形式，称为科尔莫戈洛夫-阿诺德信息神经网络（KINN）。我们在多个PDE的各种数值示例中系统比较了MLP和KAN，包括多尺度、奇异性、应力集中、非线性高弹性、异质性和复杂几何问题。我们的结果表明，在计算固体力学中，KINN在准确性和收敛速度方面明显优于MLP，除了复杂的几何问题。这突出了KINN在人工智能PDE解决方案中更高效和准确的潜力。

更新时间: 2024-06-16 19:07:06

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2406.11045v1

Evaluating the Performance of Large Language Models via Debates

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications where tasks are not always from a single domain, or rely on human input, making them unscalable. We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.

Updated: 2024-06-16 19:02:31

标题: 通过辩论评估大型语言模型的性能

摘要: 大型语言模型（LLMs）正在迅速发展，并影响着各个领域，这要求开发有效的方法来评估和比较它们的性能。目前大多数性能评估方法要么基于固定的、特定领域的问题，缺乏许多真实世界应用中所需的灵活性，其中任务并非总是来自单一领域，要么依赖于人类输入，使它们无法扩展。我们提出了一个基于LLMs之间辩论的自动基准测试框架，由另一个LLM进行评判。这种方法不仅评估领域知识，还评估问题定义和矛盾识别等技能。我们使用辩论框架评估各种最先进的LLMs的性能，并获得与基于人类输入的热门排名密切一致的排名，消除了昂贵的人类众包的需求。

更新时间: 2024-06-16 19:02:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.11044v1

Visual Hallucinations of Multi-modal Large Language Models

Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest.

Updated: 2024-06-16 18:43:50

标题: 多模式大型语言模型的视觉幻觉

摘要: 视觉幻觉（VH）指的是多模态LLM（MLLM）在视觉问题回答中对图像中的错误细节进行想象。现有研究发现VH实例仅存在于现有图像数据集中，这导致对MLLM在VH下的表现理解存在偏见，因为此类VH实例的多样性有限。在这项工作中，我们提出了一种名为VHTest的工具，用于生成多样化的VH实例。具体而言，VHTest在现有图像数据集（例如COCO）中找到一些初始的VH实例，为每个VH模式生成文本描述，并使用文本到图像生成模型（例如DALL-E-3）基于文本描述生成VH图像。我们使用VHTest收集了包含8种VH模式的1,200个VH实例的基准数据集。我们发现，现有的MLLM，如GPT-4V、LLaVA-1.5和MiniGPT-v2，在我们的基准数据集中对大部分实例进行了幻觉。此外，我们发现，使用我们的基准数据集对MLLM进行微调可以减少其产生幻觉的可能性，而不影响其在其他基准测试中的表现。我们的基准测试数据可公开获取：https://github.com/wenhuang2000/VHTest。

更新时间: 2024-06-16 18:43:50

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.14683v2

Dynamic Normativity: Necessary and Sufficient Conditions for Value Alignment

The critical inquiry pervading the realm of Philosophy, and perhaps extending its influence across all Humanities disciplines, revolves around the intricacies of morality and normativity. Surprisingly, in recent years, this thematic thread has woven its way into an unexpected domain, one not conventionally associated with pondering "what ought to be": the field of artificial intelligence (AI) research. Central to morality and AI, we find "alignment", a problem related to the challenges of expressing human goals and values in a manner that artificial systems can follow without leading to unwanted adversarial effects. More explicitly and with our current paradigm of AI development in mind, we can think of alignment as teaching human values to non-anthropomorphic entities trained through opaque, gradient-based learning techniques. This work addresses alignment as a technical-philosophical problem that requires solid philosophical foundations and practical implementations that bring normative theory to AI system development. To accomplish this, we propose two sets of necessary and sufficient conditions that, we argue, should be considered in any alignment process. While necessary conditions serve as metaphysical and metaethical roots that pertain to the permissibility of alignment, sufficient conditions establish a blueprint for aligning AI systems under a learning-based paradigm. After laying such foundations, we present implementations of this approach by using state-of-the-art techniques and methods for aligning general-purpose language systems. We call this framework Dynamic Normativity. Its central thesis is that any alignment process under a learning paradigm that cannot fulfill its necessary and sufficient conditions will fail in producing aligned systems.

Updated: 2024-06-16 18:37:31

标题: 动态规范性：价值对齐的必要和充分条件

摘要: 贯穿哲学领域的重要探讨，或许延伸至所有人文学科，围绕着道德和规范性的复杂性。令人惊讶的是，近年来，这一主题线索已经融入了一个意想不到的领域，一个与思考“应该是什么”传统上不相关的领域：人工智能（AI）研究领域。在道德与人工智能之间，我们发现“对齐”是一个与表达人类目标和价值的挑战相关的问题，以一种人工系统可以遵循而不会导致意外对抗效应的方式。更明确地说，根据我们当前的人工智能发展范式，我们可以将对齐视为通过不透明的基于梯度的学习技术训练非人类实体人类价值观的过程。这项工作将对齐视为一个需要坚实的哲学基础和将规范理论引入人工智能系统开发的技术哲学问题。为了实现这一目标，我们提出了两组必要和充分条件，我们认为这些条件应该在任何对齐过程中被考虑。虽然必要条件作为关于对齐的许可性的形而上学和元伦理根源，充分条件为在基于学习的范式下对齐人工智能系统建立了一个蓝图。在奠定这样的基础之后，我们通过使用最先进的技术和方法来对齐通用语言系统的实施。我们将这一框架称为动态规范性。其核心论点是，在学习范式下的任何对齐过程如果不能满足其必要和充分条件，将无法产生对齐的系统。

更新时间: 2024-06-16 18:37:31

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2406.11039v1

garak: A Framework for Security Probing Large Language Models

As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.

Updated: 2024-06-16 18:18:43

标题: garak：用于安全探测大型语言模型的框架

摘要: 随着大型语言模型（LLMs）被部署并集成到数千个应用程序中，对模型如何应对对抗性攻击的可扩展评估的需求迅速增长。然而，LLM安全是一个不断变化的目标：模型产生不可预测的输出，不断更新，并且潜在的对手非常多样化：任何具有互联网访问权限和良好自然语言掌握能力的人。此外，在一个上下文中构成安全弱点的问题在另一个上下文中可能不是问题；一刀切的防护栏仍然是理论性的。在本文中，我们认为是时候重新思考什么构成“LLM安全”，并追求一种全面的LLM安全评估方法，其中探索和发现问题是核心。为此，本文介绍了garak（生成式人工智能红队和评估工具包），这是一个框架，可用于发现和识别目标LLM或对话系统中的漏洞。garak以结构化的方式探索LLM，以发现潜在的漏洞。该框架的输出描述了目标模型的弱点，有助于对在独特上下文中构成漏洞的问题进行知情讨论，并可以为LLM部署的调整和政策讨论提供信息。

更新时间: 2024-06-16 18:18:43

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2406.11036v1

HAIChart: Human and AI Paired Visualization System

The growing importance of data visualization in business intelligence and data science emphasizes the need for tools that can efficiently generate meaningful visualizations from large datasets. Existing tools fall into two main categories: human-powered tools (e.g., Tableau and PowerBI), which require intensive expert involvement, and AI-powered automated tools (e.g., Draco and Table2Charts), which often fall short of guessing specific user needs. In this paper, we aim to achieve the best of both worlds. Our key idea is to initially auto-generate a set of high-quality visualizations to minimize manual effort, then refine this process iteratively with user feedback to more closely align with their needs. To this end, we present HAIChart, a reinforcement learning-based framework designed to iteratively recommend good visualizations for a given dataset by incorporating user feedback. Specifically, we propose a Monte Carlo Graph Search-based visualization generation algorithm paired with a composite reward function to efficiently explore the visualization space and automatically generate good visualizations. We devise a visualization hints mechanism to actively incorporate user feedback, thus progressively refining the visualization generation module. We further prove that the top-k visualization hints selection problem is NP-hard and design an efficient algorithm. We conduct both quantitative evaluations and user studies, showing that HAIChart significantly outperforms state-of-the-art human-powered tools (21% better at Recall and 1.8 times faster) and AI-powered automatic tools (25.1% and 14.9% better in terms of Hit@3 and R10@30, respectively).

Updated: 2024-06-16 18:04:47

标题: HAIChart：人工智能与人类配对可视化系统

摘要: 数据可视化在商业智能和数据科学中的日益重要性强调了需要能够高效地从大型数据集中生成有意义的可视化的工具。现有工具可以分为两大类：人力驱动工具（如Tableau和PowerBI），需要专家大量参与，以及AI驱动的自动化工具（如Draco和Table2Charts），通常无法准确猜测特定用户需求。本文旨在实现两者的最佳结合。我们的关键想法是首先自动生成一组高质量的可视化，以减少人工努力，然后通过用户反馈迭代地完善这个过程，更贴近他们的需求。为此，我们提出了HAIChart，一个基于强化学习的框架，旨在通过整合用户反馈，为给定数据集迭代建议良好的可视化。具体来说，我们提出了一个基于蒙特卡罗图搜索的可视化生成算法，配以一个复合奖励函数，以高效地探索可视化空间并自动生成良好的可视化。我们设计了一个可视化提示机制，积极融入用户反馈，逐渐完善可视化生成模块。我们进一步证明了前k个可视化提示选择问题是NP难题，并设计了一个高效的算法。我们进行了定量评估和用户研究，结果显示HAIChart在召回率方面显著优于最先进的人力驱动工具（召回率提高21%，速度提高1.8倍）和AI驱动的自动化工具（在Hit@3和R10@30方面分别提高了25.1%和14.9%）。

更新时间: 2024-06-16 18:04:47

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2406.11033v1

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling

Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework PROMST that incorporates human-designed feedback rules to automatically offer direct suggestions for improvement. We also use an extra learned heuristic model that predicts prompt performance to efficiently sample from prompt candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across 11 representative multi-step tasks (an average 10.6\%-29.3\% improvement to current best methods on five LLMs respectively). We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks. Datasets and Codes are available at https://github.com/yongchao98/PROMST. Project Page is available at https://yongchao98.github.io/MIT-REALM-PROMST/.

Updated: 2024-06-16 18:01:06

标题: 多步任务中的PRompt优化（PROMST）：集成人类反馈和基于启发式抽样

摘要: 快速优化旨在为给定任务找到大型语言模型（LLM）的最佳提示。LLMs已成功用于帮助找到和改进单步任务的提示候选。然而，对于代理人的现实任务是多步的，引入了新的挑战：（1）提示内容可能更加广泛和复杂，使LLMs更难分析错误，（2）评估单个步骤的影响很困难，（3）不同的人可能对任务执行有不同的偏好。虽然人类很难优化提示，但他们擅长提供有关LLM输出的反馈；因此，我们引入了一个新的LLM驱动的离散提示优化框架PROMST，该框架结合了人类设计的反馈规则，自动提供改进的直接建议。我们还使用了一个额外学习的启发式模型，用于预测提示性能，以有效地从提示候选中进行采样。这种方法在11个代表性的多步任务中明显优于人工设计的提示和其他几种提示优化方法（分别对五个LLM方法的当前最佳方法平均提高了10.6％-29.3％）。我们相信我们的工作可以作为LLM驱动的多步任务的自动提示优化的基准。数据集和代码可在https://github.com/yongchao98/PROMST找到。项目页面可在https://yongchao98.github.io/MIT-REALM-PROMST/找到。

更新时间: 2024-06-16 18:01:06

领域: cs.CL,cs.AI,cs.HC,cs.RO

下载: http://arxiv.org/abs/2402.08702v3

Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval

Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information for tasks like sentiment analysis and text classification. English, which is a high-resource language, takes advantage of the availability of stopwords, whereas low-resource Indian languages like Marathi are very limited, standardized, and can be used in available packages, but the number of available words in those packages is low. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences. We make use of the TF-IDF approach coupled with human evaluation to curate a strong stopword list of 400 words. We apply the stop word removal to the text classification task and show its efficacy. The work also presents a simple recipe for stopword curation in a low-resource language. The stopwords are integrated into the mahaNLP library and publicly available on https://github.com/l3cube-pune/MarathiNLP .

Updated: 2024-06-16 17:59:05

标题: 在马拉地语中筛选停用词：一种TF-IDF方法用于改进文本分析和信息检索

摘要: 停用词是语言中常用的词语，通常被认为在确定文档的含义或重要性方面价值不大。这些词在大多数文本中频繁出现，对于情感分析和文本分类等任务并没有提供太多有用的信息。英语作为一种高资源语言，利用了停用词的可用性，而像马拉地语这样的低资源印度语言非常有限，标准化，并且可以在可用包中使用，但这些包中可用词汇的数量很少。我们的工作旨在使用MahaCorpus对马拉地语中的停用词进行筛选，该语料库包含2480万个句子。我们利用TF-IDF方法结合人工评估来筛选出一个包含400个词的强大的停用词列表。我们将停用词去除应用于文本分类任务，并展示其有效性。该工作还提供了一种低资源语言中停用词筛选的简单方法。这些停用词已经整合到mahaNLP库中，并可以在https://github.com/l3cube-pune/MarathiNLP上公开获取。

更新时间: 2024-06-16 17:59:05

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11029v1

Universal Cross-Lingual Text Classification

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.

Updated: 2024-06-16 17:58:29

标题: Universal Cross-Lingual Text Classification的翻译为：通用跨语言文本分类

摘要: 文本分类是自然语言处理中的一个重要任务，涉及将文本自动分类到预定义的类别中。为低资源语言创建监督标记数据集面临着重大挑战。释放低资源语言的语言潜力需要具有监督标签的强大数据集。然而，这样的数据集很少，标签空间通常有限。在我们致力于填补这一差距的过程中，我们旨在优化不同语言中现有的标签/数据集。本研究提出了一种新颖的通用跨语言文本分类视角，利用跨语言统一模型。我们的方法涉及在训练过程中混合不同语言的监督数据，以创建一个通用模型。目标分类任务的监督数据可能来自涵盖不同标签的不同语言。主要目标是增强标签和语言覆盖率，旨在实现一个代表来自各种语言的标签并集的标签集。我们建议使用强大的多语言SBERT作为我们的基础模型，使我们的新颖训练策略成为可能。这种策略有助于模型在跨语言语言转移场景中的适应性和有效性，可以对在训练期间未遇到的语言进行文本分类。因此，本文深入探讨跨语言文本分类的复杂性，特别关注其在低资源语言中的应用，探索发展强大且适应性强的通用跨语言模型的方法和影响。

更新时间: 2024-06-16 17:58:29

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11028v1

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi-agent systems in advanced software engineering environments. Our source code can be found at https://github.com/FSoft-AI4Code/AgileCoder.

Updated: 2024-06-16 17:57:48

标题: AgileCoder：基于敏捷方法论的软件开发动态协作代理

摘要: 软件代理已经成为解决复杂软件工程任务的有希望的工具。然而，现有的作品通过遵循瀑布模型过于简化软件开发工作流程。因此，我们提出了AgileCoder，这是一个将敏捷方法论（AM）整合到框架中的多代理系统。该系统将特定的AM角色（如产品经理、开发人员和测试人员）分配给不同的代理，然后这些代理根据用户输入进行协作开发软件。AgileCoder通过将工作组织成冲刺，并专注于通过冲刺逐步开发软件来提高开发效率。此外，我们引入了动态代码图生成器，这是一个模块，随着对代码库的更新而动态创建代码依赖图。这使代理能够更好地理解代码库，从而在整个软件开发过程中进行更精确的代码生成和修改。AgileCoder超越了现有的基准，如ChatDev和MetaGPT，建立了新的标准，并展示了在先进的软件工程环境中多代理系统的能力。我们的源代码可以在https://github.com/FSoft-AI4Code/AgileCoder找到。

更新时间: 2024-06-16 17:57:48

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.11912v1

Boosting Medical Image Classification with Segmentation Foundation Model

The Segment Anything Model (SAM) exhibits impressive capabilities in zero-shot segmentation for natural images. Recently, SAM has gained a great deal of attention for its applications in medical image segmentation. However, to our best knowledge, no studies have shown how to harness the power of SAM for medical image classification. To fill this gap and make SAM a true ``foundation model'' for medical image analysis, it is highly desirable to customize SAM specifically for medical image classification. In this paper, we introduce SAMAug-C, an innovative augmentation method based on SAM for augmenting classification datasets by generating variants of the original images. The augmented datasets can be used to train a deep learning classification model, thereby boosting the classification performance. Furthermore, we propose a novel framework that simultaneously processes raw and SAMAug-C augmented image input, capitalizing on the complementary information that is offered by both. Experiments on three public datasets validate the effectiveness of our new approach.

Updated: 2024-06-16 17:54:49

标题: 用分割基础模型提升医学图像分类

摘要: The Segment Anything Model (SAM) 在自然图像的零射分割中展示出令人印象深刻的能力。最近，SAM 在医学图像分割应用方面引起了很大关注。然而，据我们所知，还没有研究显示如何利用SAM的能力进行医学图像分类。为了填补这一空白，使SAM成为医学图像分析的真正“基础模型”，非常有必要专门定制SAM用于医学图像分类。在本文中，我们介绍了SAMAug-C，这是一种基于SAM的创新增强方法，通过生成原始图像的变体来增强分类数据集。增强的数据集可以用于训练深度学习分类模型，从而提高分类性能。此外，我们提出了一个新的框架，同时处理原始和SAMAug-C增强的图像输入，充分利用两者提供的互补信息。对三个公共数据集的实验证明了我们新方法的有效性。

更新时间: 2024-06-16 17:54:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11026v1

Physics-Informed Deep Learning and Partial Transfer Learning for Bearing Fault Diagnosis in the Presence of Highly Missing Data

One of the most significant obstacles in bearing fault diagnosis is a lack of labeled data for various fault types. Also, sensor-acquired data frequently lack labels and have a large amount of missing data. This paper tackles these issues by presenting the PTPAI method, which uses a physics-informed deep learning-based technique to generate synthetic labeled data. Labeled synthetic data makes up the source domain, whereas unlabeled data with missing data is present in the target domain. Consequently, imbalanced class problems and partial-set fault diagnosis hurdles emerge. To address these challenges, the RF-Mixup approach is used to handle imbalanced classes. As domain adaptation strategies, the MK-MMSD and CDAN are employed to mitigate the disparity in distribution between synthetic and actual data. Furthermore, the partial-set challenge is tackled by applying weighting methods at the class and instance levels. Experimental outcomes on the CWRU and JNU datasets indicate that the proposed approach effectively addresses these problems.

Updated: 2024-06-16 17:36:53

标题: 物理信息深度学习和部分迁移学习在高度缺失数据情况下的轴承故障诊断

摘要: 在轴承故障诊断中最显著的障碍之一是各种故障类型的标记数据不足。此外，传感器获取的数据经常缺乏标签，并且有大量缺失数据。本文通过提出PTPAI方法来解决这些问题，该方法利用基于物理信息的深度学习技术生成合成标记数据。标记的合成数据构成源域，而带有缺失数据的未标记数据存在于目标域中。因此，出现了不平衡类问题和部分集故障诊断障碍。为了解决这些挑战，采用RF-Mixup方法处理不平衡类。作为域自适应策略，采用MK-MMSD和CDAN来减轻合成数据和实际数据之间的分布差异。此外，通过在类别和实例级别应用加权方法来解决部分集挑战。对CWRU和JNU数据集的实验结果表明，提出的方法有效地解决了这些问题。

更新时间: 2024-06-16 17:36:53

领域: eess.SP,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2406.11023v1

Benign overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Inductive Bias

Recent advances in machine learning have inspired a surge of research into reconstructing specific quantities of interest from measurements that comply with certain physical laws. These efforts focus on inverse problems that are governed by partial differential equations (PDEs). In this work, we develop an asymptotic Sobolev norm learning curve for kernel ridge(less) regression when addressing (elliptical) linear inverse problems. Our results show that the PDE operators in the inverse problem can stabilize the variance and even behave benign overfitting for fixed-dimensional problems, exhibiting different behaviors from regression problems. Besides, our investigation also demonstrates the impact of various inductive biases introduced by minimizing different Sobolev norms as a form of implicit regularization. For the regularized least squares estimator, we find that all considered inductive biases can achieve the optimal convergence rate, provided the regularization parameter is appropriately chosen. The convergence rate is actually independent to the choice of (smooth enough) inductive bias for both ridge and ridgeless regression. Surprisingly, our smoothness requirement recovered the condition found in Bayesian setting and extend the conclusion to the minimum norm interpolation estimators.

Updated: 2024-06-16 17:34:27

标题: 在固定维度中通过具有平滑归纳偏差的物理启发学习实现良性过拟合

摘要: 近年来，机器学习的最新进展激发了对符合特定物理定律的测量数据中特定量的重建的研究激增。这些努力集中在由偏微分方程（PDEs）控制的逆问题上。在这项工作中，我们开发了一个渐近Sobolev范数学习曲线，用于处理（椭圆）线性逆问题时的核岭（无）回归。我们的结果表明，逆问题中的PDE运算符可以稳定方差，甚至对于固定维度问题表现出良性过拟合，展示了与回归问题不同的行为。此外，我们的研究还展示了通过最小化不同Sobolev范数引入的各种归纳偏差对结果的影响，作为一种隐式正则化形式。对于正则化最小二乘估计器，我们发现，只要合适地选择正则化参数，所有考虑的归纳偏差都可以实现最佳收敛速率。对于岭回归和无岭回归，收敛速率实际上与（足够光滑的）归纳偏差的选择无关。令人惊讶的是，我们的光滑性要求恢复了在贝叶斯设置中找到的条件，并将结论扩展到最小范数插值估计器。

更新时间: 2024-06-16 17:34:27

领域: stat.ML,cs.IT,cs.LG,cs.NA,math.IT,math.NA,math.ST,stat.TH

下载: http://arxiv.org/abs/2406.09194v2

Invariant Probabilistic Prediction

In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method to yield invariant probabilistic predictions, called IPP, and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure on simulated as well as on single-cell data.

Updated: 2024-06-16 17:32:09

标题: 不变的概率预测

摘要: 最近，对于在训练和测试数据之间的分布变化下表现稳健的统计方法越来越受到关注。尽管大部分相关研究集中在使用平方误差损失的点预测上，本文将焦点转向了概率预测，旨在全面量化给定协变量的结果变量的不确定性。在一个因果启发的框架内，我们研究了概率预测相对于适当评分规则的不变性和稳健性。我们展示了一般情况下任意分布的变化不会导致不变和稳健的概率预测，与点预测的设定相反。我们说明了如何选择评估指标并限制分布变化类别，以便在典型的高斯异方差线性模型中实现可识别性和不变性。受这些发现的启发，我们提出了一种产生不变概率预测的方法，称为IPP，并研究了基本参数的一致性。最后，我们展示了我们提出的程序在模拟数据和单细胞数据上的实证表现。

更新时间: 2024-06-16 17:32:09

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2309.10083v2

Optimized Speculative Sampling for GPU Hardware Accelerators

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

Updated: 2024-06-16 17:19:23

标题: GPU硬件加速器的优化推测采样

摘要: 在这项工作中，我们优化了并行硬件加速器的推测抽样，以提高抽样速度。我们注意到，用于推测抽样的中间矩阵的大部分部分可以同时计算。这使我们能够将工作负载分配到多个GPU线程上，从而在线程块内对矩阵段进行同时操作。此外，我们使用快速的芯片内存来存储中间结果，从而最大限度地减少了在不同类型内存之间进行缓慢读写操作的频率。这导致相对于基线实现而言，调优时间的改善范围从6%到13%，而不会影响准确性。为了进一步加速推测抽样，通过softmax参数化的概率分布被sigmoid近似。这种近似方法导致调优时间的相对改善显著增加，范围从37%到94%，准确性略有下降。我们在自动语音识别和摘要任务上进行了大量实验，以验证我们优化方法的有效性。

更新时间: 2024-06-16 17:19:23

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.11016v1

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.

Updated: 2024-06-16 17:17:41

标题: Eloquent：一种更健壮的LLM令牌流传输方案

摘要: 为了在实时为用户呈现每个生成的令牌，大型语言模型（LLM）服务器逐个生成令牌并通过网络将每个令牌（或少量令牌组）流式传输给用户，这一过程被称为LLM令牌流式传输。然而，在不稳定的网络条件下，LLM令牌流式传输体验可能会因为出现阻塞而受到严重影响，因为即使包含它们的数据包按时到达，一个数据包的丢失也可能阻止后续令牌的呈现。通过一项测量研究，我们表明当前应用在不稳定网络下遭受了更多的阻塞。针对LLM聊天机器人中出现的这一新兴令牌流式传输问题，该问题与以往的多媒体和文本应用有所不同，我们提出了一种新颖的传输方案，名为Eloquent，该方案将新生成的令牌以及当前未确认的令牌放入下一个传出数据包中。这确保每个数据包包含一些新的令牌，并在接收到时独立呈现，避免了由于丢失数据包而引起的前述阻塞。通过在各种网络条件下进行模拟，我们展示Eloquent相较于实际聊天机器人应用通常使用的重传方法将阻塞比例（令牌呈现等待时间的比例）降低了71.0%，相较于基准数据包复制方案降低了31.6%。通过定制Eloquent以适应LLM的逐令牌生成，我们使聊天机器人能够像一位雄辩的演讲者一样为用户提供反应，从而更好地享受无处不在的人工智能。

更新时间: 2024-06-16 17:17:41

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2401.12961v2

Latent Communication in Artificial Neural Networks

As NNs permeate various scientific and industrial domains, understanding the universality and reusability of their representations becomes crucial. At their core, these networks create intermediate neural representations, indicated as latent spaces, of the input data and subsequently leverage them to perform specific downstream tasks. This dissertation focuses on the universality and reusability of neural representations. Do the latent representations crafted by a NN remain exclusive to a particular trained instance, or can they generalize across models, adapting to factors such as randomness during training, model architecture, or even data domain? This adaptive quality introduces the notion of Latent Communication -- a phenomenon that describes when representations can be unified or reused across neural spaces. A salient observation from our research is the emergence of similarities in latent representations, even when these originate from distinct or seemingly unrelated NNs. By exploiting a partial correspondence between the two data distributions that establishes a semantic link, we found that these representations can either be projected into a universal representation, coined as Relative Representation, or be directly translated from one space to another. Latent Communication allows for a bridge between independently trained NN, irrespective of their training regimen, architecture, or the data modality they were trained on -- as long as the data semantic content stays the same (e.g., images and their captions). This holds true for both generation, classification and retrieval downstream tasks; in supervised, weakly supervised, and unsupervised settings; and spans various data modalities including images, text, audio, and graphs -- showcasing the universality of the Latent Communication phenomenon. [...]

Updated: 2024-06-16 17:13:58

标题: 人工神经网络中的潜在通信

摘要: 随着神经网络在各种科学和工业领域的渗透，理解它们表示的普遍性和可重用性变得至关重要。在核心，这些网络创建中间神经表示，被指示为潜在空间，用于输入数据，并随后利用它们执行特定的下游任务。这篇论文关注神经表示的普遍性和可重用性。神经网络制作的潜在表示是否仅限于特定训练实例，还是可以横跨模型进行泛化，适应训练期间的随机性、模型架构，甚至数据领域等因素？这种自适应质量引入了潜在通信的概念--描述表示何时可以统一或重用跨神经空间的现象。我们研究的一个显著观察是，即使这些表示来自不同或看似不相关的神经网络，潜在表示中也出现相似性。通过利用两个数据分布之间的部分对应关系建立语义链接，我们发现这些表示可以投影到一个称为相对表示的通用表示中，或者直接从一个空间翻译到另一个空间。潜在通信允许在独立训练的神经网络之间建立桥梁，无论它们的训练方案、架构或它们接受训练的数据模态如何--只要数据的语义内容保持不变（例如图像及其标题）。这适用于生成、分类和检索下游任务；在监督、弱监督和无监督设置中；跨越各种数据模态，包括图像、文本、音频和图形--展示了潜在通信现象的普遍性。 [...]

更新时间: 2024-06-16 17:13:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.11014v1

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.

Updated: 2024-06-16 17:10:32

标题: 连接点：使用《纽约时报》连接词游戏评估LLMs的抽象推理能力

摘要: 《纽约时报连线游戏》已经成为文字谜爱好者中流行且具有挑战性的追求。我们收集了200个连线游戏，以评估最先进的大型语言模型（LLMs）在与专家和新手人类玩家对抗时的表现。我们的研究结果显示，即使是表现最佳的LLM，即GPT-4o，在各种基准测试中展现出令人印象深刻的推理能力，也只能完全解决8％的游戏。与GPT-4o相比，新手和专家玩家表现更好，专家人类玩家明显优于GPT-4o。为了加深我们的理解，我们创建了一个知识类型分类法，以成功对连接游戏中的单词进行分类所需的知识类型，揭示了LLMs在联想、百科和语言知识方面的困难。我们的研究结果将纽约时报连线游戏确立为评估人类和人工智能系统的抽象推理能力的具有挑战性的基准。

更新时间: 2024-06-16 17:10:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.11012v1

Data Shapley in One Training Run

Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content. The metric for contributions is quantitatively determined by leveraging the probabilistic nature of modern generative AI models and using techniques from cooperative game theory in economics. This framework enables a platform where AI developers benefit from access to high-quality training data, thus improving model performance. Meanwhile, copyright owners receive fair compensation, driving the continued provision of relevant data for generative model training. Experiments demonstrate that our framework successfully identifies the most relevant data sources used in artwork generation, ensuring a fair and interpretable distribution of revenues among copyright owners.

Updated: 2024-06-16 17:09:24

标题: 在一个训练运行中的数据谢普利

摘要: 生成人工智能（AI）系统通过大数据语料库进行训练，以生成新的文本、图片、视频和其他媒体。人们越来越担心这类系统可能侵犯训练数据贡献者的版权利益。为了解决生成AI的版权挑战，我们提出了一个框架，按照版权所有者对AI生成内容的贡献比例进行补偿。贡献度的度量通过利用现代生成AI模型的概率性质，并使用经济学中合作博弈理论的技术来定量确定。该框架使AI开发人员受益于获取高质量的训练数据，从而提高模型性能。与此同时，版权所有者获得公平的补偿，推动为生成模型训练提供相关数据的持续提供。实验证明，我们的框架成功识别出用于艺术作品生成的最相关数据源，确保版权所有者之间的收入公平且可解释。

更新时间: 2024-06-16 17:09:24

领域: cs.LG,cs.CL,stat.ML

下载: http://arxiv.org/abs/2406.11011v1

chainBoost: A Secure Performance Booster for Blockchain-based Resource Markets

Cryptocurrencies and blockchain technology provide an innovative model for reshaping digital services. Driven by the movement toward Web 3.0, recent systems started to provide distributed services, such as computation outsourcing or file storage, on top of the currency exchange medium. By allowing anyone to join and collect payments for serving others, these systems create decentralized markets for trading digital resources. Yet, there is still a big gap between the promise of these markets and their practical viability. Existing initiatives are still early-stage and have already encountered security and efficiency obstacles. At the same time, existing work around promising ideas, specifically sidechains, fall short in exploiting their full potential in addressing these problems. To bridge this gap, we propose chainBoost, a secure performance booster for decentralized resource markets. It expedites service related operations, reduces the blockchain size, and supports flexible service-payment exchange modalities at low overhead. At its core, chainBoost employs a sidechain, that has a (security and semantic) mutual-dependence with the mainchain, to which the system offloads heavy/frequent operations. To enable it, we develop a novel sidechain architecture composed of temporary and permanent blocks, a block suppression mechanism to prune the sidechain, a syncing protocol to permit arbitrary data exchange between the two chains, and an autorecovery protocol to support robustness and resilience. We analyze the security of chainBoost, and implement a proof-of-concept prototype for a distributed file storage market as a use case. For a market handling around 2000 transactions per round, our experiments show up to 11x improvement in throughput and 94% reduction in confirmation time. They also show that chainBoost can reduce the main blockchain size by around 90%.

Updated: 2024-06-16 17:02:53

标题: chainBoost：一种用于基于区块链的资源市场的安全性能增强器

摘要: 加密货币和区块链技术为重塑数字服务提供了创新模型。受驱动朝向Web 3.0的运动的影响，最近的系统开始提供分布式服务，例如计算外包或文件存储，基于货币交换媒介。通过允许任何人加入并收取为他人提供服务的费用，这些系统创造了用于交易数字资源的去中心化市场。然而，这些市场的承诺与其实际可行性之间仍存在巨大差距。现有的倡议仍处于初期阶段，并已遭遇安全和效率障碍。与此同时，围绕有前途的想法的现有工作，特别是侧链，未能充分利用其解决这些问题的潜力。为了弥合这一差距，我们提出了chainBoost，这是一个用于分布式资源市场的安全性能增强器。它加快与服务相关的操作，减少区块链大小，并以低开销支持灵活的服务支付交换模式。在其核心，chainBoost采用一个侧链，该侧链与主链具有（安全性和语义）相互依赖关系，系统将繁重/频繁的操作卸载到其中。为了实现这一点，我们开发了一个由临时和永久区块组成的新颖侧链架构，一个修剪侧链的区块抑制机制，一个允许两个链之间进行任意数据交换的同步协议，以及一个支持强壮性和恢复能力的自动恢复协议。我们分析了chainBoost的安全性，并为分布式文件存储市场开发了一个概念验证原型作为用例。对于每轮处理约2000笔交易的市场，我们的实验显示吞吐量提高了高达11倍，确认时间减少了94%。它们还表明，chainBoost可以将主区块链大小减少约90%。

更新时间: 2024-06-16 17:02:53

领域: cs.CR

下载: http://arxiv.org/abs/2402.16095v3

WeShap: Weak Supervision Source Evaluation with Shapley Values

Efficient data annotation stands as a significant bottleneck in training contemporary machine learning models. The Programmatic Weak Supervision (PWS) pipeline presents a solution by utilizing multiple weak supervision sources to automatically label data, thereby expediting the annotation process. Given the varied contributions of these weak supervision sources to the accuracy of PWS, it is imperative to employ a robust and efficient metric for their evaluation. This is crucial not only for understanding the behavior and performance of the PWS pipeline but also for facilitating corrective measures. In our study, we introduce WeShap values as an evaluation metric, which quantifies the average contribution of weak supervision sources within a proxy PWS pipeline, leveraging the theoretical underpinnings of Shapley values. We demonstrate efficient computation of WeShap values using dynamic programming, achieving quadratic computational complexity relative to the number of weak supervision sources. Our experiments demonstrate the versatility of WeShap values across various applications, including the identification of beneficial or detrimental labeling functions, refinement of the PWS pipeline, and rectification of mislabeled data. Furthermore, WeShap values aid in comprehending the behavior of the PWS pipeline and scrutinizing specific instances of mislabeled data. Although initially derived from a specific proxy PWS pipeline, we empirically demonstrate the generalizability of WeShap values to other PWS pipeline configurations. Our findings indicate a noteworthy average improvement of 4.8 points in downstream model accuracy through the revision of the PWS pipeline compared to previous state-of-the-art methods, underscoring the efficacy of WeShap values in enhancing data quality for training machine learning models.

Updated: 2024-06-16 17:02:27

标题: WeShap: 使用Shapley值进行弱监督源评估

摘要: 高效的数据标注在训练当代机器学习模型中是一个重要的瓶颈。程序化弱监督（PWS）管道通过利用多个弱监督来源自动标记数据，从而加快注释过程，提出了一种解决方案。鉴于这些弱监督来源对PWS准确性的不同贡献，有必要采用一个稳健且高效的指标来评估它们。这不仅对于理解PWS管道的行为和性能至关重要，还有助于采取纠正措施。在我们的研究中，我们引入了WeShap值作为评估指标，它量化了代理PWS管道内弱监督来源的平均贡献，利用了Shapley值的理论基础。我们通过动态规划实现了WeShap值的高效计算，相对于弱监督来源的数量，实现了二次计算复杂度。我们的实验展示了WeShap值在各种应用中的多功能性，包括识别有益或有害的标记函数、完善PWS管道以及纠正错误标记的数据。此外，WeShap值有助于理解PWS管道的行为并审查特定实例的错误标记数据。尽管最初是从一个特定代理PWS管道中导出的，我们通过实证证明了WeShap值对其他PWS管道配置的普适性。我们的研究结果表明，通过对PWS管道进行修订，下游模型准确性平均提高了4.8个点，与先前的最先进方法相比，强调了WeShap值在提高数据质量，用于训练机器学习模型方面的有效性。

更新时间: 2024-06-16 17:02:27

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2406.11010v1

Unclonable Secret Sharing

Unclonable cryptography utilizes the principles of quantum mechanics to addresses cryptographic tasks that are impossible classically. We introduce a novel unclonable primitive in the context of secret sharing, called unclonable secret sharing (USS). In a USS scheme, there are $n$ shareholders, each holding a share of a classical secret represented as a quantum state. They can recover the secret once all parties (or at least $t$ parties) come together with their shares. Importantly, it should be infeasible to copy their own shares and send the copies to two non-communicating parties, enabling both of them to recover the secret. Our work initiates a formal investigation into the realm of unclonable secret sharing, shedding light on its implications, constructions, and inherent limitations. ** Connections: We explore the connections between USS and other quantum cryptographic primitives such as unclonable encryption and position verification, showing the difficulties to achieve USS in different scenarios. **Limited Entanglement: In the case where the adversarial shareholders do not share any entanglement or limited entanglement, we demonstrate information-theoretic constructions for USS. **Large Entanglement: If we allow the adversarial shareholders to have unbounded entanglement resources (and unbounded computation), we prove that unclonable secret sharing is impossible. On the other hand, in the quantum random oracle model where the adversary can only make a bounded polynomial number of queries, we show a construction secure even with unbounded entanglement. Furthermore, even when these adversaries possess only a polynomial amount of entanglement resources, we establish that any unclonable secret sharing scheme with a reconstruction function implementable using Cliffords and logarithmically many T-gates is also unattainable.

Updated: 2024-06-16 16:50:15

标题: 不可克隆的秘密分享

摘要: 不可复制的密码学利用量子力学的原理来处理在经典情况下不可能完成的加密任务。我们在秘密共享的背景下引入了一种新颖的不可复制的原语，称为不可复制的秘密共享（USS）。在USS方案中，有$n$个股东，每个持有一个表示为量子状态的经典秘密的份额。他们可以在所有各方（或至少$t$个各方）齐聚并携带自己的份额时恢复秘密。重要的是，复制自己的份额并将副本发送给两个不通信的各方应是不可行的，使两者都能恢复秘密。我们的工作展开了对不可复制的秘密共享领域的形式化调查，阐明了其含义、构造和固有限制。**连接：我们探索了USS与其他量子密码学原语（如不可复制加密和位置验证）之间的联系，展示了在不同场景中实现USS的困难。 **有限纠缠：在对手股东没有共享任何纠缠或有限纠缠的情况下，我们展示了对USS的信息论构造。 **大纠缠：如果允许对手股东拥有无限纠缠资源（和无限计算），我们证明不可复制的秘密共享是不可能的。另一方面，在对手只能进行有界多项式次数查询的量子随机预言模型中，我们展示了一个即使有无限纠缠也安全的构造。此外，即使这些对手只拥有多项式数量的纠缠资源，我们也确定任何可使用Cliffords和对数数量的T门实现重建函数的不可复制的秘密共享方案也是无法实现的。

更新时间: 2024-06-16 16:50:15

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2406.11008v1

Data Science Education in Undergraduate Physics: Lessons Learned from a Community of Practice

It is becoming increasingly important that physics educators equip their students with the skills to work with data effectively. However, many educators may lack the necessary training and expertise in data science to teach these skills. To address this gap, we created the Data Science Education Community of Practice (DSECOP), bringing together graduate students and physics educators from different institutions and backgrounds to share best practices and lessons learned from integrating data science into undergraduate physics education. In this article we present insights and experiences from this community of practice, highlighting key strategies and challenges in incorporating data science into the introductory physics curriculum. Our goal is to provide guidance and inspiration to educators who seek to integrate data science into their teaching, helping to prepare the next generation of physicists for a data-driven world.

Updated: 2024-06-16 16:47:56

标题: 本科物理学中的数据科学教育：从实践社区中学到的经验。

摘要: 物理教育工作者越来越重视培养学生有效处理数据的能力。然而，许多教育工作者可能缺乏教授这些技能所需的数据科学培训和专业知识。为了弥补这一差距，我们创建了数据科学教育社区实践（DSECOP），将不同机构和背景的研究生和物理教育工作者聚集在一起，分享将数据科学融入本科物理教育中的最佳实践和经验教训。在本文中，我们呈现了这个实践社区的见解和经验，重点介绍了将数据科学整合到入门物理课程中的关键策略和挑战。我们的目标是为寻求将数据科学融入教学的教育工作者提供指导和启发，帮助为一个数据驱动的世界做好准备的下一代物理学家。

更新时间: 2024-06-16 16:47:56

领域: physics.ed-ph,cs.LG,physics.data-an

下载: http://arxiv.org/abs/2403.00961v2

A Notion of Complexity for Theory of Mind via Discrete World Models

Theory of Mind (ToM) can be used to assess the capabilities of Large Language Models (LLMs) in complex scenarios where social reasoning is required. While the research community has proposed many ToM benchmarks, their hardness varies greatly, and their complexity is not well defined. This work proposes a framework to measure the complexity of ToM tasks. We quantify a problem's complexity as the number of states necessary to solve it correctly. Our complexity measure also accounts for spurious states of a ToM problem designed to make it apparently harder. We use our method to assess the complexity of five widely adopted ToM benchmarks. On top of this framework, we design a prompting technique that augments the information available to a model with a description of how the environment changes with the agents' interactions. We name this technique Discrete World Models (DWM) and show how it elicits superior performance on ToM tasks.

Updated: 2024-06-16 16:46:55

标题: 一个关于心灵理论的复杂性概念通过离散世界模型

摘要: 心理理论（ToM）可以用来评估大型语言模型（LLMs）在需要社会推理的复杂场景中的能力。虽然研究界提出了许多ToM基准，但它们的难度差异很大，而且其复杂性并未得到很好定义。本文提出了一个框架来衡量ToM任务的复杂性。我们将问题的复杂性量化为解决它所需的状态数量。我们的复杂性测量还考虑了设计ToM问题的虚假状态，使其看起来更难。我们使用我们的方法评估了五个广泛采用的ToM基准的复杂性。在这个框架之上，我们设计了一种提示技术，通过描述环境如何随着代理互动而改变，增加了模型可用信息的量。我们将这种技术命名为离散世界模型（DWM），并展示了它如何在ToM任务上引发出优越的性能。

更新时间: 2024-06-16 16:46:55

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.11911v1

Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications

The advent of Large Language Models (LLMs) has revolutionized various applications by providing advanced natural language processing capabilities. However, this innovation introduces new cybersecurity challenges. This paper explores the threat modeling and risk analysis specifically tailored for LLM-powered applications. Focusing on potential attacks like data poisoning, prompt injection, SQL injection, jailbreaking, and compositional injection, we assess their impact on security and propose mitigation strategies. We introduce a framework combining STRIDE and DREAD methodologies for proactive threat identification and risk assessment. Furthermore, we examine the feasibility of an end-to-end threat model through a case study of a custom-built LLM-powered application. This model follows Shostack's Four Question Framework, adjusted for the unique threats LLMs present. Our goal is to propose measures that enhance the security of these powerful AI tools, thwarting attacks, and ensuring the reliability and integrity of LLM-integrated systems.

Updated: 2024-06-16 16:43:58

标题: 大型语言模型（LLM）驱动应用的威胁建模和风险分析

摘要: 大型语言模型（LLMs）的出现通过提供先进的自然语言处理能力，彻底改变了各种应用程序。然而，这种创新引入了新的网络安全挑战。本文探讨了专门针对LLM驱动应用程序的威胁建模和风险分析。重点关注数据污染、提示注入、SQL注入、越狱和组合注入等潜在攻击，评估它们对安全性的影响并提出缓解策略。我们介绍了一种结合STRIDE和DREAD方法论的框架，用于积极识别威胁和评估风险。此外，我们通过一个定制的LLM驱动应用程序案例研究，检验了端到端威胁模型的可行性。该模型遵循Shostack的四个问题框架，针对LLMs呈现的独特威胁进行了调整。我们的目标是提出增强这些强大AI工具安全性的措施，阻止攻击，确保LLM集成系统的可靠性和完整性。

更新时间: 2024-06-16 16:43:58

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2406.11007v1

SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

We present SPEAR, a continuous receiver-to-receiver acoustic neural warping field for spatial acoustic effects prediction in an acoustic 3D space with a single stationary audio source. Unlike traditional source-to-receiver modelling methods that require prior space acoustic properties knowledge to rigorously model audio propagation from source to receiver, we propose to predict by warping the spatial acoustic effects from one reference receiver position to another target receiver position, so that the warped audio essentially accommodates all spatial acoustic effects belonging to the target position. SPEAR can be trained in a data much more readily accessible manner, in which we simply ask two robots to independently record spatial audio at different positions. We further theoretically prove the universal existence of the warping field if and only if one audio source presents. Three physical principles are incorporated to guide SPEAR network design, leading to the learned warping field physically meaningful. We demonstrate SPEAR superiority on both synthetic, photo-realistic and real-world dataset, showing the huge potential of SPEAR to various down-stream robotic tasks.

Updated: 2024-06-16 16:40:26

标题: SPEAR：接收器到接收器声学神经翘曲场

摘要: 我们提出了SPEAR，一个用于在具有单个固定音频源的声学三维空间中进行空间声学效果预测的连续接收器到接收器的声学神经扭曲场。与传统的需要先验空间声学特性知识来严格建模音频从源到接收器传播的方法不同，我们提出通过将空间声学效果从一个参考接收器位置扭曲到另一个目标接收器位置来进行预测，使得扭曲的音频基本上包含了属于目标位置的所有空间声学效果。SPEAR可以以更容易获取的方式进行训练，我们只需要求两个机器人在不同位置独立记录空间音频。我们进一步理论上证明了只有一个音频源存在时扭曲场的普遍存在性。三个物理原则被纳入指导SPEAR网络设计，从而使学习到的扭曲场具有物理意义。我们展示了SPEAR在合成、照片逼真和真实世界数据集上的优越性，展示了SPEAR在各种下游机器人任务中的巨大潜力。

更新时间: 2024-06-16 16:40:26

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2406.11006v1

On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning

The emergence of unsupervised word embeddings, pre-trained on very large monolingual text corpora, is at the core of the ongoing neural revolution in Natural Language Processing (NLP). Initially introduced for English, such pre-trained word embeddings quickly emerged for a number of other languages. Subsequently, there have been a number of attempts to align the embedding spaces across languages, which could enable a number of cross-language NLP applications. Performing the alignment using unsupervised cross-lingual learning (UCL) is especially attractive as it requires little data and often rivals supervised and semi-supervised approaches. Here, we analyze popular methods for UCL and we find that often their objectives are, intrinsically, versions of the Wasserstein-Procrustes problem. Hence, we devise an approach to solve Wasserstein-Procrustes in a direct way, which can be used to refine and to improve popular UCL methods such as iterative closest point (ICP), multilingual unsupervised and supervised embeddings (MUSE) and supervised Procrustes methods. Our evaluation experiments on standard datasets show sizable improvements over these approaches. We believe that our rethinking of the Wasserstein-Procrustes problem could enable further research, thus helping to develop better algorithms for aligning word embeddings across languages. Our code and instructions to reproduce the experiments are available at https://github.com/guillemram97/wp-hungarian.

Updated: 2024-06-16 16:37:44

标题: 关于Wasserstein-Procrustes在无监督跨语言学习中的新应用

摘要: 无监督单词嵌入的出现，预先在非常大的单语文本语料库上进行训练，是自然语言处理(NLP)中正在进行的神经革命的核心。最初是为英语引入的这种预先训练的单词嵌入很快就出现在其他许多语言中。随后，有许多尝试在不同语言之间对齐嵌入空间，这可以实现许多跨语言NLP应用。使用无监督跨语言学习(UCL)进行对齐尤其吸引人，因为它需要很少的数据，并且通常可以与监督和半监督方法媲美。在这里，我们分析了流行的UCL方法，发现它们的目标通常本质上是Wasserstein-Procrustes问题的各种版本。因此，我们设计了一种直接解决Wasserstein-Procrustes的方法，可以用来改进和提高流行的UCL方法，如迭代最近点(ICP)、多语言无监督和监督嵌入(MUSE)和监督Procrustes方法。我们在标准数据集上进行的评估实验显示，与这些方法相比，我们取得了可观的改进。我们相信我们对Wasserstein-Procrustes问题的重新思考可以促进进一步研究，从而帮助开发更好的算法来对齐跨语言的单词嵌入。我们的代码和重现实验的说明可在https://github.com/guillemram97/wp-hungarian找到。

更新时间: 2024-06-16 16:37:44

领域: cs.CL,cs.LG,stat.ML

下载: http://arxiv.org/abs/2007.09456v2

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Biases in LLMs can harm user experience and societal outcomes. Current bias mitigation methods such as RLHF usually rely on costly human feedback, lack transferability to other topics, and show poor performance. We find that informing the LLMs that their generated content is not generated by them and querying about potential biases greatly boosts their awareness and ability to mitigate biases. Based on this, we propose RLDF (Reinforcement Learning from Multi-role Debates as Feedback), replacing human feedback with AI for bias mitigation. RLDF engages LLMs in multi-role debates to expose biases and gradually reduce biases in each iteration using a ranking scoring mechanism. The dialogue are then used to create a dataset composed of both high bias and low bias instances to train the reward model in reinforcement learning. This dataset can be generated by the same LLM for self-reflection or a superior LLM like an API which guides the former one in a teacher-student mode. Experimental results across different LLMs and types of bias show the effectiveness of our approach in bias mitigation.

Updated: 2024-06-16 16:34:42

标题: 从多角色辩论中强化学习作为LLMs偏见减轻的反馈

摘要: LLMs中的偏见可能会损害用户体验和社会结果。目前的偏见缓解方法，如RLHF通常依赖于昂贵的人类反馈，缺乏对其他主题的可转移性，并表现出较差的性能。我们发现，告知LLMs他们生成的内容并非由他们生成，并询问可能的偏见，极大地提高了他们的意识和减轻偏见的能力。基于此，我们提出了RLDF（从多角色辩论中获得反馈的强化学习），用AI取代人类反馈来缓解偏见。RLDF让LLMs参与多角色辩论，揭示偏见，并使用排名评分机制逐步减少每次迭代中的偏见。然后，对话被用来创建一个数据集，由高偏见和低偏见实例组成，用于训练强化学习中的奖励模型。这个数据集可以由相同的LLM生成用于自我反思，或者由一个像API这样的更高级的LLM指导前者以教师-学生模式。跨不同LLMs和偏见类型的实验结果显示了我们方法在偏见缓解方面的有效性。

更新时间: 2024-06-16 16:34:42

领域: cs.AI

下载: http://arxiv.org/abs/2404.10160v4

Schrödinger Bridge with Quadratic State Cost is Exactly Solvable

Schr\"odinger bridge is a diffusion process that steers a given distribution to another in a prescribed time while minimizing the effort to do so. It can be seen as the stochastic dynamical version of the optimal mass transport, and has growing applications in generative diffusion models and stochastic optimal control. In this work, we propose a regularized variant of the Schr\"odinger bridge with a quadratic state cost-to-go that incentivizes the optimal sample paths to stay close to a nominal level. Unlike the conventional Schr\"odinger bridge, the regularization induces a state-dependent rate of killing and creation of probability mass, and its solution requires determining the Markov kernel of a reaction-diffusion partial differential equation. We derive this Markov kernel in closed form. Our solution recovers the heat kernel in the vanishing regularization (i.e., diffusion without reaction) limit, thereby recovering the solution of the conventional Schr\"odinger bridge. Our results enable the use of dynamic Sinkhorn recursion for computing the Schr\"odinger bridge with a quadratic state cost-to-go, which would otherwise be challenging to use in this setting. We deduce properties of the new kernel and explain its connections with certain exactly solvable models in quantum mechanics.

Updated: 2024-06-16 16:33:43

标题: 薛定谔桥与二次状态成本的确切可解形式

摘要: Schr\"odinger桥是一种扩散过程，可以在规定的时间内将给定的分布引导到另一个分布，同时最大限度地减少这种努力。它可以被看作是最优质量传输的随机动力学版本，并在生成式扩散模型和随机最优控制中有着越来越广泛的应用。在这项工作中，我们提出了Schr\"odinger桥的一个正则化变种，其中包含一个二次状态成本函数，鼓励最优样本路径保持接近一个名义水平。与传统的Schr\"odinger桥不同，正则化引起了一个依赖于状态的杀伤和概率质量的创造速率，并且其解决方案需要确定反应扩散偏微分方程的马尔可夫核。我们以闭合形式推导了这个马尔可夫核。我们的解决方案在逐渐消减正则化（即没有反应的扩散）的极限下恢复了热核，从而恢复了传统Schr\"odinger桥的解决方案。我们的结果使得能够使用动态Sinkhorn递归来计算具有二次状态成本函数的Schr\"odinger桥，否则在这种情况下将会很具有挑战性。我们推导了新核的性质，并解释了它与量子力学中某些确切可解模型的联系。

更新时间: 2024-06-16 16:33:43

领域: math.OC,cs.LG,cs.SY,eess.SY,math-ph,math.MP,stat.ML

下载: http://arxiv.org/abs/2406.00503v3

3D Gaze Tracking for Studying Collaborative Interactions in Mixed-Reality Environments

This study presents a novel framework for 3D gaze tracking tailored for mixed-reality settings, aimed at enhancing joint attention and collaborative efforts in team-based scenarios. Conventional gaze tracking, often limited by monocular cameras and traditional eye-tracking apparatus, struggles with simultaneous data synchronization and analysis from multiple participants in group contexts. Our proposed framework leverages state-of-the-art computer vision and machine learning techniques to overcome these obstacles, enabling precise 3D gaze estimation without dependence on specialized hardware or complex data fusion. Utilizing facial recognition and deep learning, the framework achieves real-time, tracking of gaze patterns across several individuals, addressing common depth estimation errors, and ensuring spatial and identity consistency within the dataset. Empirical results demonstrate the accuracy and reliability of our method in group environments. This provides mechanisms for significant advances in behavior and interaction analysis in educational and professional training applications in dynamic and unstructured environments.

Updated: 2024-06-16 16:30:56

标题: 3D凝视追踪用于研究混合现实环境中的协作互动

摘要: 这项研究提出了一个新颖的框架，用于定制混合现实环境下的3D凝视追踪，旨在增强团队合作场景中的共同关注和协作努力。传统的凝视追踪通常受限于单眼摄像头和传统眼动仪器，往往在团体环境中难以同时同步和分析多个参与者的数据。我们提出的框架利用最先进的计算机视觉和机器学习技术来克服这些障碍，实现了精确的3D凝视估计，而不依赖于专门的硬件或复杂的数据融合。利用面部识别和深度学习，该框架实现了对多个个体凝视模式的实时跟踪，解决了常见的深度估计错误，并确保了数据集中的空间和身份一致性。实证结果证明了我们的方法在团体环境中的准确性和可靠性。这为在动态和非结构化环境中的教育和专业培训应用中的行为和互动分析提供了重要进展的机制。

更新时间: 2024-06-16 16:30:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.11003v1

Not All Bias is Bad: Balancing Rational Deviations and Cognitive Biases in Large Language Model Reasoning

This paper investigates the nuanced role of biases in the decision-making processes of large language models (LLMs). While conventional research typically aims to eliminate all biases, our study reveals that not all biases are detrimental. By examining rational deviations, involving heuristic shortcuts that enhance decision-making efficiency, we highlight their potential benefits when properly balanced. We introduce the concepts of heuristic moderation and an abstention option, allowing LLMs to abstain from answering when uncertain, thereby reducing error rates and improving decision accuracy. Using our newly developed BRD (Balance Rational Deviations) dataset, our findings demonstrate that appropriately scaled bias inspection enhances model performance and aligns LLM decision-making more closely with human reasoning. This balance improves the reliability and trustworthiness of LLMs and suggests new strategies for future enhancements. Our work offers a fresh perspective on leveraging biases constructively to enhance the practical applications of LLMs, from conversational agents to decision support systems and beyond.

Updated: 2024-06-16 16:25:22

标题: 并不是所有的偏见都是坏的：在大型语言模型推理中平衡理性偏差和认知偏见

摘要: 这篇论文调查了大型语言模型（LLMs）在决策过程中偏见的微妙作用。传统研究通常旨在消除所有偏见，但我们的研究显示，并非所有偏见都是有害的。通过研究理性偏差，涉及强化决策效率的启发式快捷方式，我们强调了在适当平衡时它们的潜在好处。我们引入了启发性调节和弃权选项的概念，允许LLMs在不确定时放弃回答，从而降低错误率并提高决策准确性。利用我们新开发的BRD（平衡理性偏差）数据集，我们的发现表明，适当缩放的偏见检查提升了模型性能，并使LLMs的决策更贴近人类推理。这种平衡提高了LLMs的可靠性和可信度，为未来增强提供了新的策略。我们的工作提供了一个新的视角，利用偏见有益地增强LLMs的实际应用，从对话代理到决策支持系统等等。

更新时间: 2024-06-16 16:25:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10999v1

Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies

Apple introduced privacy labels in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they also require developers to provide a privacy policy, which offers an important comparison point. In this paper, we fine-tuned BERT-based language models to extract privacy policy features for 474,669 apps on the iOS App Store, comparing the output to the privacy labels. We identify discrepancies between the policies and the labels, particularly as they relate to data collected linked to users. We find that 228K apps' privacy policies may indicate data collection linked to users than what is reported in the privacy labels. More alarming, a large number (97%) of the apps with a Data Not Collected privacy label have a privacy policy indicating otherwise. We provide insights into potential sources for discrepancies, including the use of templates and confusion around Apple's definitions and requirements. These results suggest that significant work is still needed to help developers more accurately label their apps. Our system can be incorporated as a first-order check to inform developers when privacy labels are possibly misapplied.

Updated: 2024-06-16 16:24:27

标题: 诚实是最好的策略：苹果隐私标签的准确性与应用程序隐私政策的比较

摘要: 苹果于2020年12月推出了隐私标签，作为开发人员报告其应用程序隐私行为的一种方式。虽然苹果不验证标签，但他们也要求开发人员提供隐私政策，这提供了一个重要的比较点。本文中，我们对iOS应用商店上474,669款应用程序进行了基于BERT的语言模型的微调，以提取隐私政策特征，并将其与隐私标签进行了比较。我们发现了政策和标签之间的差异，特别是与用户相关的数据收集方面。我们发现，有228,000款应用程序的隐私政策可能表明与用户有关的数据收集超出了隐私标签中报告的内容。更令人担忧的是，有大量（97%）带有“数据未收集”隐私标签的应用程序的隐私政策表明情况并非如此。我们提供了关于差异的潜在来源的见解，包括使用模板和对苹果的定义和要求的混淆。这些结果表明，仍然需要大量工作来帮助开发人员更准确地标记其应用程序。我们的系统可以作为一个第一级检查，通知开发人员当隐私标签可能被错误应用时。

更新时间: 2024-06-16 16:24:27

领域: cs.CR

下载: http://arxiv.org/abs/2306.17063v2

FENet: Focusing Enhanced Network for Lane Detection

Inspired by human driving focus, this research pioneers networks augmented with Focusing Sampling, Partial Field of View Evaluation, Enhanced FPN architecture and Directional IoU Loss - targeted innovations addressing obstacles to precise lane detection for autonomous driving. Experiments demonstrate our Focusing Sampling strategy, emphasizing vital distant details unlike uniform approaches, significantly boosts both benchmark and practical curved/distant lane recognition accuracy essential for safety. While FENetV1 achieves state-of-the-art conventional metric performance via enhancements isolating perspective-aware contexts mimicking driver vision, FENetV2 proves most reliable on the proposed Partial Field analysis. Hence we specifically recommend V2 for practical lane navigation despite fractional degradation on standard entire-image measures. Future directions include collecting on-road data and integrating complementary dual frameworks to further breakthroughs guided by human perception principles. The Code is available at https://github.com/HanyangZhong/FENet.

Updated: 2024-06-16 16:23:47

标题: FENet: 用于车道检测的聚焦增强网络

摘要: 受人类驾驶关注的启发，这项研究开创了采用聚焦采样、部分视野评估、增强FPN架构和方向IoU损失的网络，针对自动驾驶精确车道检测的障碍进行了有针对性的创新。实验证明，我们的聚焦采样策略，强调关键的远距离细节，与均匀方法不同，显著提升了基准和实际曲线/远距离车道识别准确性，这对安全至关重要。虽然FENetV1通过增强隔离透视感知上下文模拟驾驶员视野而实现了最先进的传统度量性能，但FENetV2在提出的部分视野分析上最可靠。因此，尽管在标准整体图像测量上存在部分下降，我们特别推荐V2用于实际车道导航。未来的方向包括收集路面数据，并集成互补的双重框架，以进一步突破，遵循人类感知原则。代码可在https://github.com/HanyangZhong/FENet找到。

更新时间: 2024-06-16 16:23:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2312.17163v6

Two-level overlapping additive Schwarz preconditioner for training scientific machine learning applications

We introduce a novel two-level overlapping additive Schwarz preconditioner for accelerating the training of scientific machine learning applications. The design of the proposed preconditioner is motivated by the nonlinear two-level overlapping additive Schwarz preconditioner. The neural network parameters are decomposed into groups (subdomains) with overlapping regions. In addition, the network's feed-forward structure is indirectly imposed through a novel subdomain-wise synchronization strategy and a coarse-level training step. Through a series of numerical experiments, which consider physics-informed neural networks and operator learning approaches, we demonstrate that the proposed two-level preconditioner significantly speeds up the convergence of the standard (LBFGS) optimizer while also yielding more accurate machine learning models. Moreover, the devised preconditioner is designed to take advantage of model-parallel computations, which can further reduce the training time.

Updated: 2024-06-16 16:18:45

标题: 两级重叠加性Schwarz预处理器用于训练科学机器学习应用程序

摘要: 我们引入了一种新颖的两级重叠的加法Schwarz预处理器，用于加速科学机器学习应用的训练。所提出的预处理器的设计受非线性两级重叠的加法Schwarz预处理器的启发。神经网络参数被分解为具有重叠区域的组（子域）。此外，网络的前馈结构通过一种新颖的子域同步策略和粗级训练步骤间接强加。通过一系列考虑物理信息神经网络和运算符学习方法的数值实验，我们证明了所提出的两级预处理器显著加快了标准（LBFGS）优化器的收敛速度，同时也产生了更准确的机器学习模型。此外，所设计的预处理器旨在利用模型并行计算，进一步缩短训练时间。

更新时间: 2024-06-16 16:18:45

领域: math.NA,cs.LG,cs.NA,math.OC,90C30, 90C26, 90C06, 65M55, 68T07

下载: http://arxiv.org/abs/2406.10997v1

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.

Updated: 2024-06-16 16:15:20

标题: 基于概念技能可转移性的大规模视觉语言模型数据选择

摘要: 指导调优，或在大规模视觉语言模型（LVLMs）上进行监督微调以获得广泛任务特定数据，对于LVLMs在广泛的视觉语言（VL）任务中良好泛化是必要的。然而，对大型VL数据集进行训练可能变得昂贵。在这项工作中，我们介绍了COINCIDE，一种有效且可扩展的数据选择技术，该技术使用一个小型模型作为参考模型，为目标LVLM选择视觉指导调优数据，以便有效微调目标LVLM，重点放在多样性和可转移性上。具体来说，我们使用小型模型的内部激活对训练数据进行聚类，该聚类识别了目标LVLM所需的VL概念-技能组合。然后，我们通过考虑这些不同簇的密度和可转移性，或者是能够很好地转移到其他概念-技能组合的能力，从这些不同簇中抽取数据。这种方法确保了这些组合的多样性，这对于LVLM的泛化至关重要。广泛的实验表明，COINCIDE在两个不同数据集LLaVA-1.5和Vision-Flan上，与8个强基线相比，实现了卓越的性能和数据选择效率。仅使用LLaVA-1.5数据集的20％，COINCIDE实现了与整个数据集微调后的LVLM性能相当的表现，同时减少了70％的挂钟运行时间。在Vision-Flan数据集上，我们的方法仅使用16.7％的训练数据就实现了优秀的结果。

更新时间: 2024-06-16 16:15:20

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.10995v1

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali-English, Hindi-English, Marathi-English and Telugu- English speech to English text. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.

Updated: 2024-06-16 16:10:51

标题: CoSTA: 使用对齐的语音文本交错进行混合语言语音翻译

摘要: 代码切换是印度等多语言社会普遍存在的语言现象。为混合语音构建语音到文本模型具有挑战性，因为数据集的可用性有限。在这项工作中，我们专注于将印度语言的混合语音翻译成英文文本的口语翻译（ST）问题。我们提出了一种新的端到端模型架构COSTA，它依赖于预训练的自动语音识别（ASR）和机器翻译（MT）模块（对于许多语言更为普遍可用）。语音和ASR文本表示使用对齐的交错方案融合，并进一步作为输入传递给预训练的MT模块；整个流水线然后使用合成创建的ST数据进行端到端的口语翻译训练。我们还发布了一个新的评估基准，用于将混合孟加拉语-英语、印地语-英语、马拉地语-英语和泰卢固语-英语的语音转换成英文文本。COSTA在BLEU分数上显着优于许多具有竞争力的级联和端到端多模态基线，最高可达3.5个BLEU分。

更新时间: 2024-06-16 16:10:51

领域: cs.CL,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2406.10993v1

LLM-SAP: Large Language Models Situational Awareness Based Planning

This study explores integrating large language models (LLMs) with situational awareness-based planning (SAP) to enhance the decision-making capabilities of AI agents in dynamic and uncertain environments. We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks through iterative feedback and evaluation processes. Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process, thereby expanding the planning scope of LLMs beyond structured and predictable scenarios. The results demonstrate significant improvements in the model's ability to provide comparative safe actions within hazard interactions, offering a perspective on proactive and reactive planning strategies. This research highlights the potential of LLMs to perform human-like action planning, thereby paving the way for more sophisticated, reliable, and safe AI systems in unpredictable real-world applications.

Updated: 2024-06-16 16:00:55

标题: LLM-SAP：基于情境感知的大型语言模型规划

摘要: 本研究探讨了将大型语言模型（LLMs）与基于情境意识的规划（SAP）相结合，以增强AI代理在动态和不确定环境中的决策能力。我们采用多代理推理框架来开发一种方法论，通过迭代反馈和评估过程来预测和积极减轻潜在风险。我们的方法与传统的自动机理论不同，将人类中心交互的复杂性纳入规划过程中，从而将LLMs的规划范围扩展到非结构化和不可预测的情景。结果表明，模型在危险交互中提供比较安全的行动能力显著提高，为主动和被动规划策略提供了视角。这项研究突出了LLMs在执行类似人类的行动规划方面的潜力，从而为在不可预测的真实世界应用中建立更复杂、可靠和安全的AI系统铺平了道路。

更新时间: 2024-06-16 16:00:55

领域: cs.AI

下载: http://arxiv.org/abs/2312.16127v5

RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting

An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effective utilization. To tackle this challenge, we initially developed the Time Series Knowledge Base (TSKB) with an advanced indexing system for efficient historical data retrieval. We also developed the Retrieval Augmented Cross-Attention (RACA) module, a variant of the cross-attention mechanism within Transformer's decoder layers, designed to be seamlessly integrated into the vanilla Transformer architecture to assimilate key historical data segments. The synergy between TSKB and RACA forms the backbone of our Retrieval-Augmented Time Series Forecasting (RATSF) framework. Based on the above two components, RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but also adapts flexibly to various scenarios and integrates with a multitude of Transformer variants for time-series forecasting. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts.

Updated: 2024-06-16 15:59:13

标题: RATSF: 通过检索增强时间序列预测强化客户服务量管理

摘要: 一个有效的客户服务管理系统取决于对服务量的精确预测。在这种情况下，数据的非平稳性非常明显，成功的预测在很大程度上依赖于识别和利用类似的历史数据，而不仅仅是总结周期性模式。基于RNN或Transformer架构的现有模型可能在灵活和有效利用方面遇到困难。为了解决这一挑战，我们最初开发了时间序列知识库（TSKB），具有用于高效检索历史数据的先进索引系统。我们还开发了检索增强交叉注意（RACA）模块，这是Transformer解码器层内的交叉注意机制的一种变体，旨在无缝集成到基本Transformer架构中，以吸收关键的历史数据段。TSKB和RACA之间的协同作用构成了我们的检索增强时间序列预测（RATSF）框架的核心。基于以上两个组件，RATSF不仅显著提高了在飞猪酒店服务量预测方面的性能，而且灵活适应各种情景，并与各种Transformer变体集成用于时间序列预测。大量实验验证了这一系统设计在多种不同情境下的有效性和普适性。

更新时间: 2024-06-16 15:59:13

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2403.04180v2

Aligners: Decoupling LLMs and Alignment

Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We use the same synthetic data to train inspectors, binary miss-alignment classification models to guide a "squad" of multiple aligners. Our empirical results demonstrate consistent improvements when applying aligner squad to various LLMs, including chat-aligned models, across several instruction-following and red-teaming datasets.

Updated: 2024-06-16 15:59:11

标题: 对齐器：解耦LLMs和对齐

摘要: 大型语言模型（LLMs）需要与人类期望保持一致，以确保它们在大多数应用中的安全性和效用。对齐是具有挑战性、昂贵的，并且需要针对每个LLM和对齐标准重复进行。我们建议通过训练对齐器模型来解耦LLMs和对齐，这些模型可以根据需要对任何LLM进行对齐，从而还可以减少对性能的潜在负面影响。我们的训练对齐器模型的配方仅依赖于使用（提示）LLM生成的合成数据，并且可以轻松调整以适应各种对齐标准。我们使用相同的合成数据来训练检查员，二元失对齐分类模型来指导多个对齐器的“小组”。我们的实证结果表明，在应用对齐器小组到各种LLMs时，包括聊天对齐模型，在几个遵循指令和红队数据集上都表现出一致的改进。

更新时间: 2024-06-16 15:59:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.04224v3

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

Updated: 2024-06-16 15:58:44

标题: 越狱基准测试：针对越狱大型语言模型的开放性鲁棒性基准

摘要: 越狱攻击导致大型语言模型（LLMs）生成有害、不道德或其他令人反感的内容。评估这些攻击面临许多挑战，目前的基准和评估技术并未充分解决。首先，关于越狱评估没有明确的标准实践。其次，现有作品以不可比较的方式计算成本和成功率。第三，许多作品不可重复，因为它们隐瞒了对抗性提示，涉及闭源代码，或依赖于不断演变的专有API。为了解决这些挑战，我们引入了JailbreakBench，一个开源基准测试，包括以下组件：（1）一系列最新的对抗性提示的不断发展的存储库，我们称之为越狱工件；（2）一个包含100种行为的越狱数据集，既包括原创内容，也包括来源于先前工作的内容，这些内容符合OpenAI的使用政策；（3）一个标准化的评估框架，网址为https://github.com/JailbreakBench/jailbreakbench，包括明确定义的威胁模型，系统提示，聊天模板和评分函数；（4）一个排行榜，网址为https://jailbreakbench.github.io/，跟踪各种LLMs的攻击和防御性能。我们已经认真考虑了发布这一基准测试的潜在道德影响，并相信这将对社区产生积极影响。

更新时间: 2024-06-16 15:58:44

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2404.01318v3

Predicting the Understandability of Computational Notebooks through Code Metrics Analysis

Computational notebooks have become the primary coding environment for data scientists. However, research on their code quality is still emerging, and the code shared is often of poor quality. Given the importance of maintenance and reusability, understanding the metrics that affect notebook code comprehensibility is crucial. Code understandability, a qualitative variable, is closely tied to user opinions. Traditional approaches to measuring it either use limited questionnaires to review a few code pieces or rely on metadata such as likes and votes in software repositories. Our approach enhances the measurement of Jupyter notebook understandability by leveraging user comments related to code understandability. As a case study, we used 542,051 Kaggle Jupyter notebooks from our previous research, named DistilKaggle. We employed a fine-tuned DistilBERT transformer to identify user comments associated with code understandability. We established a criterion called User Opinion Code Understandability (UOCU), which considers the number of relevant comments, upvotes on those comments, total notebook views, and total notebook upvotes. UOCU proved to be more effective than previous methods. Furthermore, we trained machine learning models to predict notebook code understandability based solely on their metrics. We collected 34 metrics for 132,723 final notebooks as features in our dataset, using UOCU as the label. Our predictive model, using the Random Forest classifier, achieved 89% accuracy in predicting the understandability levels of computational notebooks.

Updated: 2024-06-16 15:58:40

标题: 预测计算笔记本可读性的方法：通过代码指标分析

摘要: 计算笔记本已成为数据科学家的主要编码环境。然而，关于其代码质量的研究仍在兴起，共享的代码往往质量较差。鉴于维护和可重用性的重要性，理解影响笔记本代码可理解性的度量标准至关重要。代码可理解性是一种定性变量，与用户意见密切相关。传统的衡量方法要么使用有限的问卷调查来审查一些代码片段，要么依赖于软件存储库中的元数据，如点赞和投票。我们的方法通过利用与代码可理解性相关的用户评论来增强Jupyter笔记本的可理解性测量。作为案例研究，我们使用了我们先前研究中的542,051个Kaggle Jupyter笔记本，名为DistilKaggle。我们使用经过调整的DistilBERT transformer来识别与代码可理解性相关的用户评论。我们建立了一个称为用户意见代码可理解性（UOCU）的标准，考虑了相关评论的数量、这些评论的赞数、总笔记本查看次数和总笔记本点赞数。UOCU证明比先前的方法更有效。此外，我们训练了机器学习模型，仅基于它们的度量标准来预测笔记本代码的可理解性。我们为132,723个最终笔记本收集了34个度量标准作为我们数据集中的特征，使用UOCU作为标签。我们的预测模型，使用随机森林分类器，在预测计算笔记本的可理解性水平时达到了89％的准确率。

更新时间: 2024-06-16 15:58:40

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2406.10989v1

Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations

Existing research often posits spurious features as easier to learn than core features in neural network optimization, but the impact of their relative simplicity remains under-explored. Moreover, studies mainly focus on end performance rather than the learning dynamics of feature learning. In this paper, we propose a theoretical framework and an associated synthetic dataset grounded in boolean function analysis. This setup allows for fine-grained control over the relative complexity (compared to core features) and correlation strength (with respect to the label) of spurious features to study the dynamics of feature learning under spurious correlations. Our findings uncover several interesting phenomena: (1) stronger spurious correlations or simpler spurious features slow down the learning rate of the core features, (2) two distinct subnetworks are formed to learn core and spurious features separately, (3) learning phases of spurious and core features are not always separable, (4) spurious features are not forgotten even after core features are fully learned. We demonstrate that our findings justify the success of retraining the last layer to remove spurious correlation and also identifies limitations of popular debiasing algorithms that exploit early learning of spurious features. We support our empirical findings with theoretical analyses for the case of learning XOR features with a one-hidden-layer ReLU network.

Updated: 2024-06-16 15:43:33

标题: 复杂性重要性：在伪相关性存在的情况下特征学习的动态性

摘要: 现有研究常常认为在神经网络优化中，虚假特征比核心特征更容易学习，但它们相对简单性的影响仍未被充分探讨。此外，研究主要关注最终性能，而不是特征学习的学习动态。在本文中，我们提出了一个理论框架和一个基于布尔函数分析的相关合成数据集。这种设置允许对虚假特征的相对复杂性（与核心特征相比）和与标签的相关性强度进行精细控制，以研究在虚假相关性下特征学习的动态。我们的研究结果揭示了几个有趣的现象：（1）更强的虚假相关性或更简单的虚假特征会减缓核心特征的学习速度，（2）形成了两个不同的子网络来分别学习核心和虚假特征，（3）虚假特征和核心特征的学习阶段并非总是可分离的，（4）即使核心特征完全学习后，虚假特征也不会被遗忘。我们证明了我们的研究结果支持重新训练最后一层以消除虚假相关性的成功，并且确定了利用虚假特征的早期学习的流行去偏算法的局限性。我们通过理论分析支持我们的实证研究结果，针对使用一个隐藏层ReLU网络学习XOR特征的情况。

更新时间: 2024-06-16 15:43:33

领域: cs.LG

下载: http://arxiv.org/abs/2403.03375v2

Adversarial Illusions in Multi-Modal Embeddings

Multi-modal embeddings encode texts, images, thermal images, sounds, and videos into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary can align any image or sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future tasks, as well as modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.

Updated: 2024-06-16 15:34:37

标题: 多模态嵌入中的对抗性幻觉

摘要: 多模态嵌入将文本、图像、热像图、声音和视频编码到一个单一的嵌入空间中，对不同模态的表示进行对齐（例如，将一张狗的图片与狗叫声关联起来）。在本文中，我们展示了多模态嵌入可能会受到一种我们称之为“对抗幻觉”的攻击。给定一张图片或一个声音，对手可以扰动它使其嵌入接近于另一种模态中任意对手选择的输入。这些攻击是跨模态和有针对性的：对手可以将任何图片或声音与任何他选择的目标对齐。对抗幻觉利用嵌入空间中的接近性，因此对下游任务和模态是不可知的，从而使当前和未来任务以及对手无法访问的模态都受到全面妥协。利用ImageBind和AudioCLIP嵌入，我们展示了如何通过对抗对齐输入（在不知道具体下游任务的情况下生成），误导图像生成、文本生成、零样本分类和音频检索。我们调查了幻觉在不同嵌入之间的可转移性，并开发了我们方法的黑盒版本，用于展示对亚马逊商业专有的Titan嵌入进行的第一个对抗对齐攻击。最后，我们分析了对抗措施和逃避攻击。

更新时间: 2024-06-16 15:34:37

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2308.11804v4

Daisy Bloom Filters

A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.

Updated: 2024-06-16 15:29:00

标题: 雏菊布隆过滤器

摘要: 一个过滤器是一种广泛使用的数据结构，用于存储来自某个有限集合$U$（可数集合）的元素集合$S$的近似。它表示一个超集$S'\supseteq S$，在某种意义上与$S$“接近”，即对于$x\not\in S$，$x\in S'$的概率受到某个$\varepsilon > 0$的限制。使用Bloom过滤器的优势在于，当一些误报是可以接受的时，空间利用率比精确存储$S$所需的空间更小。尽管从最坏情况的角度来看，过滤器是被很好理解的，但很明显，最先进的构造可能并不是针对特定数据和查询分布最优的。举个例子，假设某些元素以接近1的概率存在于$S$中。那么始终将它们包含在$S'$中是有意义的，通过不必在过滤器中表示这些元素来节省空间。类似的问题在加权Bloom过滤器（Bruck，Gao和Jiang，ISIT 2006）和利用对学习组件的访问的Bloom过滤器实现（Vaidya，Knorr，Mitzenmacher和Krask，ICLR 2021）的背景下被提出。在本文中，我们提出了这样一个过滤器所需的预期空间的下界。我们还展示了这个下界在渐近意义上是紧密的，通过展示一个在最坏情况下执行查询和插入的过滤器构造，并且具有最多$\varepsilon$的误报率，对于从产品分布中抽取的输入集合，有很高概率成立。我们还提出了一种Bloom过滤器的替代方案，称为“雏菊Bloom过滤器”，它执行操作更快，使用的空间显著少于标准Bloom过滤器。

更新时间: 2024-06-16 15:29:00

领域: cs.DS,cs.DB,cs.LG

下载: http://arxiv.org/abs/2205.14894v2

Biologically-Motivated Learning Model for Instructed Visual Processing

As part of understanding how the brain learns, ongoing work seeks to combine biological knowledge and current artificial intelligence (AI) modeling in an attempt to find an efficient biologically plausible learning scheme. Current models of biologically plausible learning often use a cortical-like combination of bottom-up (BU) and top-down (TD) processing, where the TD part carries feedback signals used for learning. However, in the visual cortex, the TD pathway plays a second major role of visual attention, by guiding the visual process to locations and tasks of interest. A biological model should therefore combine the two tasks, and learn to guide the visual process. We introduce a model that uses a cortical-like combination of BU and TD processing that naturally integrates the two major functions of the TD stream. The integrated model is obtained by an appropriate connectivity pattern between the BU and TD streams, a novel processing cycle that uses the TD part twice, and the use of 'Counter-Hebb' learning that operates across the streams. We show that the 'Counter-Hebb' mechanism can provide an exact backpropagation synaptic modification. We further demonstrate the model's ability to guide the visual stream to perform a task of interest, achieving competitive performance compared with AI models on standard multi-task learning benchmarks. The successful combination of learning and visual guidance could provide a new view on combining BU and TD processing in human vision, and suggests possible directions for both biologically plausible models and artificial instructed models, such as vision-language models (VLMs).

Updated: 2024-06-16 15:24:53

标题: 受生物启发的指导视觉处理学习模型

摘要: 作为理解大脑学习过程的一部分，正在进行的工作旨在结合生物知识和当前的人工智能（AI）建模，试图找到一种高效的生物可信的学习方案。目前的生物可信学习模型通常使用类似皮层的自下而上（BU）和自上而下（TD）处理的组合，其中TD部分携带用于学习的反馈信号。然而，在视觉皮层中，TD途径还扮演视觉注意的第二个主要角色，通过引导视觉过程到感兴趣的位置和任务。因此，生物模型应该结合这两个任务，并学会引导视觉过程。我们介绍了一个模型，该模型使用类似皮层的BU和TD处理的组合，自然地整合了TD流的两个主要功能。通过BU和TD流之间的适当连接模式、一个使用TD部分两次的新颖处理循环以及跨流操作的“Counter-Hebb”学习，我们获得了集成模型。我们展示了“Counter-Hebb”机制可以提供准确的反向传播突触修改。我们进一步展示了该模型引导视觉流执行感兴趣任务的能力，与标准多任务学习基准上的AI模型相比，取得了竞争性表现。学习和视觉引导的成功结合可能为人类视觉中的BU和TD处理提供了一种新视角，并为生物可信模型和人工指导模型，如视觉语言模型（VLMs），提供了可能的方向。

更新时间: 2024-06-16 15:24:53

领域: cs.AI

下载: http://arxiv.org/abs/2306.02415v3

Toward Optimal LLM Alignments Using Two-Player Games

The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.

Updated: 2024-06-16 15:24:50

标题: 朝着使用双人游戏实现最佳LLM对齐

摘要: 标准的来自人类反馈的强化学习（RLHF）框架主要关注优化大型语言模型的性能，使用预先收集的提示。然而，收集提供全面覆盖的提示既繁琐又具挑战性，通常无法包含LLMs最需要改进的情景。在本文中，我们通过两个代理游戏的视角研究对齐，涉及敌对代理和防御代理之间的迭代交互。每一步中，敌对代理的任务是生成暴露防御代理弱点的提示。作为回报，防御代理试图改进对这些新识别的提示的响应，基于奖励模型的反馈。我们在理论上证明，这种迭代强化学习优化收敛到由代理引发的游戏的纳什均衡。在安全场景中的实验结果表明，在这样一个竞争环境中学习不仅完全训练了代理，还导致对抗性和防御性代理都具有增强泛化能力的策略。

更新时间: 2024-06-16 15:24:50

领域: cs.CL,cs.AI,68

下载: http://arxiv.org/abs/2406.10977v1

Promoting Data and Model Privacy in Federated Learning through Quantized LoRA

Conventional federated learning primarily aims to secure the privacy of data distributed across multiple edge devices, with the global model dispatched to edge devices for parameter updates during the learning process. However, the development of large language models (LLMs) requires substantial data and computational resources, rendering them valuable intellectual properties for their developers and owners. To establish a mechanism that protects both data and model privacy in a federated learning context, we introduce a method that just needs to distribute a quantized version of the model's parameters during training. This method enables accurate gradient estimations for parameter updates while preventing clients from accessing a model whose performance is comparable to the centrally hosted one. Moreover, we combine this quantization strategy with LoRA, a popular and parameter-efficient fine-tuning method, to significantly reduce communication costs in federated learning. The proposed framework, named \textsc{FedLPP}, successfully ensures both data and model privacy in the federated learning context. Additionally, the learned central model exhibits good generalization and can be trained in a resource-efficient manner.

Updated: 2024-06-16 15:23:07

标题: 通过量化LoRA在联邦学习中促进数据和模型隐私

摘要: 传统的联邦学习主要旨在保护分布在多个边缘设备上的数据的隐私，全局模型在学习过程中被分发到边缘设备进行参数更新。然而，大型语言模型（LLMs）的发展需要大量的数据和计算资源，使它们成为开发者和所有者宝贵的知识产权。为了在联邦学习环境中建立既保护数据又保护模型隐私的机制，我们引入了一种方法，只需要在训练过程中分发模型参数的量化版本。这种方法可以实现准确的梯度估计以进行参数更新，同时防止客户端访问性能与中心托管模型相当的模型。此外，我们将这种量化策略与LoRA（一种流行且参数高效的微调方法）结合起来，以显著降低联邦学习中的通信成本。所提出的框架，命名为FedLPP，成功确保了联邦学习环境中的数据和模型隐私。此外，学习到的中心模型具有良好的泛化能力，并且可以以资源高效的方式进行训练。

更新时间: 2024-06-16 15:23:07

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2406.10976v1

Towards Supporting Legal Argumentation with NLP: Is More Data Really All You Need?

Modeling legal reasoning and argumentation justifying decisions in cases has always been central to AI & Law, yet contemporary developments in legal NLP have increasingly focused on statistically classifying legal conclusions from text. While conceptually simpler, these approaches often fall short in providing usable justifications connecting to appropriate legal concepts. This paper reviews both traditional symbolic works in AI & Law and recent advances in legal NLP, and distills possibilities of integrating expert-informed knowledge to strike a balance between scalability and explanation in symbolic vs. data-driven approaches. We identify open challenges and discuss the potential of modern NLP models and methods that integrate

Updated: 2024-06-16 15:15:44

标题: 朝向支持法律论证的自然语言处理：仅仅增加数据就足够了吗？

摘要: 法律推理和辩论建立在案例决定的合理性上一直是人工智能与法律领域的核心内容，然而，当代法律自然语言处理的发展越来越集中于从文本中对法律结论进行统计分类。尽管在概念上更简单，但这些方法通常在提供连接到适当法律概念的可用理由方面存在不足。本文回顾了人工智能与法律领域中传统的符号作品和最近法律自然语言处理的进展，并概括了整合专家知识以在符号与数据驱动方法之间取得可扩展性和解释性平衡的可能性。我们识别了开放挑战，并讨论了现代自然语言处理模型和方法整合的潜力。

更新时间: 2024-06-16 15:15:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10974v1

ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain. In this extended pre-training phase, ExPLoRA only unfreezes 1-2 pre-trained ViT blocks and all normalization layers, and then tunes all other layers with LoRA. Finally, we fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and simply unfreezing more transformer blocks.

Updated: 2024-06-16 15:14:56

标题: ExPLoRA：参数高效的扩展预训练，在领域转移下调整视觉Transformer

摘要: 参数高效微调（PEFT）技术，如低秩适应（LoRA），可以有效地将大型预训练基础模型适应到下游任务中，只使用原始可训练权重的很小一部分（0.1%-10%）。PEFT中一个尚未探索的问题是如何在没有监督标签的情况下延长预训练阶段；也就是说，我们能否通过在新领域上进行高效的自监督预训练来使预训练的基础模型适应新领域？在本研究中，我们引入了ExPLoRA，这是一种极为有效的技术，用于改善基于领域转移的预训练视觉变换器（ViTs）的迁移学习。 ExPLoRA在新领域继续无监督预训练目标，初始化一个ViT，并使用来自DinoV2或MAE等大型自然图像数据集的预训练权重。在这个扩展的预训练阶段，ExPLoRA只解冻1-2个预训练的ViT块和所有归一化层，然后使用LoRA调整所有其他层。最后，我们仅使用LoRA在新领域对结果模型进行微调以进行监督学习。我们的实验在卫星图像上展示了最先进的结果，甚至超过了完全预训练和微调ViTs。使用DinoV2的训练目标，我们在下游任务的线性探测top-1准确率上展示了高达7%的改进，同时使用的参数数量<10%，少于先前完全调整的最先进方法中使用的参数数量。我们的消融研究证实了我们的方法相对于其他基线的有效性，包括PEFT和简单地解冻更多的变换器块。

更新时间: 2024-06-16 15:14:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.10973v1

Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges

In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.

Updated: 2024-06-16 14:50:50

标题: 使用LLMs进行数据增强：数据视角、学习范式和挑战

摘要: 在快速发展的大型语言模型（LLMs）领域，数据增强（DA）已经成为一种关键技术，通过使训练示例多样化而无需额外数据收集来增强模型性能。本调查探讨了LLMs在DA上的转变性影响，特别是在自然语言处理（NLP）等领域中所面临的独特挑战和机遇。从数据和学习角度出发，我们研究了利用LLMs进行数据增强的各种策略，包括一种新颖的学习范式探索，其中LLM生成的数据用于多种形式的进一步训练。此外，本文重点介绍了该领域面临的主要开放挑战，从可控数据增强到多模态数据增强。这项调查突出了LLMs在DA中引入的范式转变，并旨在成为研究人员和实践者的全面指南。

更新时间: 2024-06-16 14:50:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.02990v2

Ontology Embedding: A Survey of Methods, Applications and Resources

Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated in recent years. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies, and formally defines ontology embedding from the perspectives of both mathematics and machine learning, as well as its property of faithfulness. Based on this, it systematically categorises and analyses a relatively complete set of over 80 papers, according to the ontologies and semantics that they aim at, and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL, and discusses the challenges and future directions.

Updated: 2024-06-16 14:49:19

标题: 本体嵌入：方法、应用和资源综述

摘要: 本文摘要介绍了本体论在表示领域知识和元数据方面的广泛应用，它在信息系统、语义网络、生物信息学等领域中发挥着越来越重要的作用。然而，本体论能直接支持的逻辑推理在学习、逼近和预测方面相当有限。一个直接的解决方案是整合统计分析和机器学习。为此，近年来广泛研究了自动学习本体知识的向量表示，即本体嵌入。许多论文已经发表关于本体嵌入，但缺乏系统性的综述阻碍了研究人员对这一领域的全面理解。为了弥补这一差距，我们撰写了这篇调查论文，首先介绍了不同种类的本体语义，并从数学和机器学习的角度正式定义了本体嵌入，以及其忠实性属性。基于此，它根据它们的目标本体和语义，以及它们的技术解决方案，包括几何建模、序列建模和图传播，系统地对超过80篇论文进行了分类和分析。本调查还介绍了本体嵌入在本体工程、机器学习增强和生命科学中的应用，展示了一个新的库mOWL，并讨论了挑战和未来方向。

更新时间: 2024-06-16 14:49:19

领域: cs.AI

下载: http://arxiv.org/abs/2406.10964v1

Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

X-ray prohibited item detection is an essential component of security check and categories of prohibited item are continuously increasing in accordance with the latest laws. Previous works all focus on close-set scenarios, which can only recognize known categories used for training and often require time-consuming as well as labor-intensive annotations when learning novel categories, resulting in limited real-world applications. Although the success of vision-language models (e.g. CLIP) provides a new perspectives for open-set X-ray prohibited item detection, directly applying CLIP to X-ray domain leads to a sharp performance drop due to domain shift between X-ray data and general data used for pre-training CLIP. To address aforementioned challenges, in this paper, we introduce distillation-based open-vocabulary object detection (OVOD) task into X-ray security inspection domain by extending CLIP to learn visual representations in our specific X-ray domain, aiming to detect novel prohibited item categories beyond base categories on which the detector is trained. Specifically, we propose X-ray feature adapter and apply it to CLIP within OVOD framework to develop OVXD model. X-ray feature adapter containing three adapter submodules of bottleneck architecture, which is simple but can efficiently integrate new knowledge of X-ray domain with original knowledge, further bridge domain gap and promote alignment between X-ray images and textual concepts. Extensive experiments conducted on PIXray and PIDray datasets demonstrate that proposed method performs favorably against other baseline OVOD methods in detecting novel categories in X-ray scenario. It outperforms previous best result by 15.2 AP50 and 1.5 AP50 on PIXray and PIDray with achieving 21.0 AP50 and 27.8 AP50 respectively.

Updated: 2024-06-16 14:42:52

标题: 通过微调CLIP进行开放词汇X射线禁止物品检测

摘要: X射线禁止物品检测是安全检查的一个重要组成部分，禁止物品的类别随着最新法律的不断增加而不断增加。先前的研究都集中在密集场景中，只能识别用于训练的已知类别，并且在学习新类别时往往需要耗时且劳动密集的注释，导致现实世界应用有限。虽然视觉语言模型（例如CLIP）的成功为开放集X射线禁止物品检测提供了新的视角，但直接将CLIP应用于X射线领域会导致性能急剧下降，因为X射线数据与用于预训练CLIP的一般数据之间存在领域转移。为了解决上述挑战，在本文中，我们通过将CLIP扩展到学习我们特定的X射线领域中的视觉表示，引入了基于蒸馏的开放词汇目标检测（OVOD）任务到X射线安全检查领域，旨在检测超出检测器训练的基本类别的新禁止物品类别。具体来说，我们提出了X射线特征适配器，并将其应用于OVOD框架内的CLIP，以开发OVXD模型。X射线特征适配器包含三个瓶颈架构的适配器子模块，这种简单但有效地将X射线领域的新知识与原始知识整合在一起，进一步弥合领域差距，促进X射线图像和文本概念之间的对齐。在PIXray和PIDray数据集上进行的大量实验表明，所提出的方法在X射线场景中检测新类别方面表现优异。它在PIXray和PIDray上的表现优于先前最佳结果，分别提高了15.2 AP50和1.5 AP50，分别达到21.0 AP50和27.8 AP50。

更新时间: 2024-06-16 14:42:52

领域: cs.CV,cs.AI,cs.CY

下载: http://arxiv.org/abs/2406.10961v1

PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metrics based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.

Updated: 2024-06-16 14:39:20

标题: PhyloLM：推断大型语言模型的系统发育并预测它们在基准测试中的表现

摘要: 这篇论文介绍了PhyloLM，一种将系统发育算法应用于大型语言模型（LLMs）的方法，以探讨它们之间的关系以及如何预测它们的性能特征。我们的方法根据LLMs输出的相似性计算系统发育距离度量。然后利用这一度量构建树状图，可以满意地捕捉111个开源模型和45个闭源模型之间已知的关系。此外，我们的系统发育距离可以预测标准基准测试中的性能，从而证明了其功能有效性，并为评估LLM能力提供了一种节约时间和成本的方法。总之，通过将种群遗传学概念转化为机器学习，我们提出并验证了一种评估LLM发展、关系和能力的工具，即使在缺乏透明的训练信息的情况下也可以。

更新时间: 2024-06-16 14:39:20

领域: cs.CL,cs.LG,q-bio.PE

下载: http://arxiv.org/abs/2404.04671v3

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.

Updated: 2024-06-16 14:32:48

标题: 重新思考在大型语言模型知识蒸馏中的Kullback-Leibler散度

摘要: Kullback-Leiber散度已被广泛用于知识蒸馏（KD）中压缩大型语言模型（LLMs）。与先前的断言相反，逆Kullback-Leibler（RKL）散度是寻找模式的，因此优于寻找平均值的正向Kullback-Leibler（FKL）散度，本研究在实证和理论上证明了在LLMs中KD中既不表现出寻找模式的特性也不表现出寻找平均值的特性。相反，发现RKL和FKL共享相同的优化目标，并且在足够数量的时代后都会收敛。然而，由于实际约束，LLMs很少被训练如此多的时代。同时，我们进一步发现RKL专注于分布的尾部，而FKL在开始时代专注于头部。因此，我们提出了一种简单而有效的自适应Kullback-Leiber（AKL）散度方法，该方法自适应地分配权重以结合FKL和RKL。基于度量和基于GPT-4的评估表明，所提出的AKL在各种任务中优于基线，并提高了生成响应的多样性和质量。

更新时间: 2024-06-16 14:32:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2404.02657v2

On Convergence and Rate of Convergence of Policy Improvement Algorithms

In this paper, we provide a simple proof from scratch for the convergence of the Policy Improvement Algorithm(PIA) for a continuous time entropy-regularized stochastic control problem. Such convergence has been established by Huang-Wang-Zhou(2023) by using sophisticated PDE estimates for the iterative PDEs involved in the PIA. Our approach builds on some Feynman-Kac type probabilistic representation formulae for solutions of PDEs and their derivatives. Moreover, in the infinite horizon model with a large discount factor and in the finite horizon model, we obtain the exponential rate of convergence with similar arguments.

Updated: 2024-06-16 14:31:26

标题: 关于策略改进算法的收敛性和收敛速率

摘要: 在这篇论文中，我们为连续时间熵正则化随机控制问题的策略改进算法（PIA）的收敛提供了一个简单的证明。黄王周（2023）通过对PIA中涉及的迭代PDE进行复杂的PDE估计，已经证明了这种收敛性。我们的方法基于一些Feynman-Kac类型的概率表示公式，用于PDE及其导数的解。此外，在具有大折现率的无限时间模型和有限时间模型中，我们通过类似的论证获得了指数收敛速度。

更新时间: 2024-06-16 14:31:26

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2406.10959v1

Robust Channel Learning for Large-Scale Radio Speaker Verification

Recent research in speaker verification has increasingly focused on achieving robust and reliable recognition under challenging channel conditions and noisy environments. Identifying speakers in radio communications is particularly difficult due to inherent limitations such as constrained bandwidth and pervasive noise interference. To address this issue, we present a Channel Robust Speaker Learning (CRSL) framework that enhances the robustness of the current speaker verification pipeline, considering data source, data augmentation, and the efficiency of model transfer processes. Our framework introduces an augmentation module that mitigates bandwidth variations in radio speech datasets by manipulating the bandwidth of training inputs. It also addresses unknown noise by introducing noise within the manifold space. Additionally, we propose an efficient fine-tuning method that reduces the need for extensive additional training time and large amounts of data. Moreover, we develop a toolkit for assembling a large-scale radio speech corpus and establish a benchmark specifically tailored for radio scenario speaker verification studies. Experimental results demonstrate that our proposed methodology effectively enhances performance and mitigates degradation caused by radio transmission in speaker verification tasks. The code will be available on Github.

Updated: 2024-06-16 14:17:57

标题: 大规模无线电扬声器验证的稳健信道学习

摘要: 最近在说话者验证方面的研究越来越集中在在具有挑战性的通道条件和嘈杂环境下实现稳健可靠的识别上。在无线电通信中识别说话者特别困难，这是由于固有的限制，如带宽受限和普遍的噪音干扰。为了解决这个问题，我们提出了一个名为通道鲁棒说话者学习（CRSL）框架，该框架增强了当前说话者验证流程的稳健性，考虑到数据源、数据增强和模型转移过程的效率。我们的框架引入了一个增强模块，通过操纵训练输入的带宽，缓解了无线电语音数据集中的带宽变化。它还通过在流形空间内引入噪声来解决未知噪声。此外，我们提出了一种高效的微调方法，减少了对大量额外训练时间和数据的需求。此外，我们开发了一个工具包，用于组建大规模的无线电语音语料库，并建立了一个专门为无线电场景的说话者验证研究量身定制的基准。实验结果表明，我们提出的方法有效地提高了性能，并减轻了无线电传输在说话者验证任务中引起的降级。该代码将在Github上提供。

更新时间: 2024-06-16 14:17:57

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2406.10956v1

Towards Efficient Target-Level Machine Unlearning Based on Essential Graph

Machine unlearning is an emerging technology that has come to attract widespread attention. A number of factors, including regulations and laws, privacy, and usability concerns, have resulted in this need to allow a trained model to forget some of its training data. Existing studies of machine unlearning mainly focus on unlearning requests that forget a cluster of instances or all instances from one class. While these approaches are effective in removing instances, they do not scale to scenarios where partial targets within an instance need to be forgotten. For example, one would like to only unlearn a person from all instances that simultaneously contain the person and other targets. Directly migrating instance-level unlearning to target-level unlearning will reduce the performance of the model after the unlearning process, or fail to erase information completely. To address these concerns, we have proposed a more effective and efficient unlearning scheme that focuses on removing partial targets from the model, which we name "target unlearning". Specifically, we first construct an essential graph data structure to describe the relationships between all important parameters that are selected based on the model explanation method. After that, we simultaneously filter parameters that are also important for the remaining targets and use the pruning-based unlearning method, which is a simple but effective solution to remove information about the target that needs to be forgotten. Experiments with different training models on various datasets demonstrate the effectiveness of the proposed approach.

Updated: 2024-06-16 14:17:13

标题: 朝着基于关键图的高效目标级机器遗忘

摘要: 机器遗忘是一种新兴技术，已经引起了广泛的关注。一些因素，包括法规和法律、隐私和可用性问题，导致了需要允许一个经过训练的模型忘记部分训练数据。现有的机器遗忘研究主要集中在忘记一个类别的一组实例或所有实例的遗忘请求上。尽管这些方法能够有效地移除实例，但在需要忘记实例中的部分目标的情况下，这些方法并不适用。例如，有时候只想从同时包含某个人和其他目标的所有实例中遗忘这个人。将实例级的遗忘直接迁移到目标级的遗忘会导致模型在遗忘过程后性能下降，或者无法完全擦除信息。为了解决这些问题，我们提出了一种更有效和高效的遗忘方案，专注于从模型中移除部分目标，我们称之为“目标遗忘”。具体地，我们首先构建一个基本的图形数据结构来描述基于模型解释方法选择的所有重要参数之间的关系。之后，我们同时过滤对于剩余目标也重要的参数，并使用基于修剪的遗忘方法，这是一个简单但有效的解决方案，用于移除需要遗忘的目标的信息。在各种数据集上使用不同的训练模型进行实验，证明了所提出方法的有效性。

更新时间: 2024-06-16 14:17:13

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2406.10954v1

Large Language Models Can Better Understand Knowledge Graphs Than We Thought

As the parameter scale of large language models (LLMs) grows, jointly training knowledge graph (KG) embeddings with model parameters to enhance LLM capabilities becomes increasingly costly. Consequently, the community has shown interest in developing prompt strategies that effectively integrate KG information into LLMs. However, the format for incorporating KGs into LLMs lacks standardization; for instance, KGs can be transformed into linearized triples or natural language (NL) text. Current prompting methods often rely on a trial-and-error approach, leaving researchers with an incomplete understanding of which KG input format best facilitates LLM comprehension of KG content. To elucidate this, we design a series of experiments to explore LLMs' understanding of different KG input formats within the context of prompt engineering. Our analysis examines both literal and attention distribution levels. Through extensive experiments, we indicate a counter-intuitive phenomenon: when addressing fact-related questions, unordered linearized triples are more effective for LLMs' understanding of KGs compared to fluent NL text. Furthermore, noisy, incomplete, or marginally relevant subgraphs can still enhance LLM performance. Finally, different LLMs have distinct preferences for different formats of organizing unordered triples.

Updated: 2024-06-16 14:16:56

标题: 大型语言模型可以比我们想象的更好地理解知识图谱

摘要: 随着大型语言模型（LLMs）的参数规模增长，联合训练知识图谱（KG）嵌入和模型参数以增强LLM能力变得越来越昂贵。因此，学术界对有效将KG信息整合到LLMs中的提示策略表现出兴趣。然而，将KG整合到LLMs中的格式缺乏标准化；例如，KG可以转换为线性化三元组或自然语言文本。当前的提示方法通常依赖于试错方法，使研究人员对哪种KG输入格式最有利于LLM理解KG内容没有完全理解。为了阐明这一点，我们设计了一系列实验来探索LLMs对提示工程背景下不同KG输入格式的理解。我们的分析考察了字面和注意力分布级别。通过大量实验，我们指出了一个违反直觉的现象：在回答与事实相关的问题时，无序线性化三元组比流畅的自然语言文本更有助于LLMs理解KG。此外，嘈杂、不完整或边缘相关的子图仍然可以提高LLM的性能。最后，不同的LLMs对组织无序三元组的不同格式有着明显的偏好。

更新时间: 2024-06-16 14:16:56

领域: cs.CL,cs.AI,I.2.4; I.2.7

下载: http://arxiv.org/abs/2402.11541v3

Really Unlearned? Verifying Machine Unlearning via Influential Sample Pairs

Machine unlearning enables pre-trained models to eliminate the effects of partial training samples. Previous research has mainly focused on proposing efficient unlearning strategies. However, the verification of machine unlearning, or in other words, how to guarantee that a sample has been successfully unlearned, has been overlooked for a long time. Existing verification schemes typically rely on machine learning attack techniques, such as backdoor or membership inference attacks. As these techniques are not formally designed for verification, they are easily bypassed when an untrustworthy MLaaS undergoes rapid fine-tuning to merely meet the verification conditions, rather than executing real unlearning. In this paper, we propose a formal verification scheme, IndirectVerify, to determine whether unlearning requests have been successfully executed. We design influential sample pairs: one referred to as trigger samples and the other as reaction samples. Users send unlearning requests regarding trigger samples and use reaction samples to verify if the unlearning operation has been successfully carried out. We propose a perturbation-based scheme to generate those influential sample pairs. The objective is to perturb only a small fraction of trigger samples, leading to the reclassification of reaction samples. This indirect influence will be used for our verification purposes. In contrast to existing schemes that employ the same samples for all processes, our scheme, IndirectVerify, provides enhanced robustness, making it less susceptible to bypassing processes.

Updated: 2024-06-16 14:14:05

标题: 真的没学到吗？通过影响样本对验证机器的去学习

摘要: 机器遗忘使预先训练的模型能够消除部分训练样本的影响。先前的研究主要集中在提出高效的遗忘策略上。然而，机器遗忘的验证，或者换句话说，如何保证样本已成功遗忘，长期以来一直被忽视。现有的验证方案通常依赖于机器学习攻击技术，例如后门或成员推断攻击。由于这些技术并非正式设计用于验证，当不可信的MLaaS经过快速微调以满足验证条件时，容易被绕过，而不是执行真正的遗忘。在本文中，我们提出了一个形式验证方案，称为IndirectVerify，用于确定遗忘请求是否已成功执行。我们设计了有影响力的样本对：一个称为触发样本，另一个称为反应样本。用户发送有关触发样本的遗忘请求，并使用反应样本来验证遗忘操作是否已成功执行。我们提出了一种基于扰动的方案来生成这些有影响力的样本对。目标是仅扰动触发样本的一小部分，导致反应样本的重新分类。这种间接影响将用于我们的验证目的。与现有方案使用相同样本进行所有过程相比，我们的方案IndirectVerify提供了增强的鲁棒性，使其不太容易受到绕过过程的影响。

更新时间: 2024-06-16 14:14:05

领域: cs.CR

下载: http://arxiv.org/abs/2406.10953v1

Don't Forget Too Much: Towards Machine Unlearning on Feature Level

Machine unlearning enables pre-trained models to remove the effect of certain portions of training data. Previous machine unlearning schemes have mainly focused on unlearning a cluster of instances or all instances belonging to a specific class. These types of unlearning might have a significant impact on the model utility; and they may be inadequate for situations where we only need to unlearn features within instances, rather than the whole instances. Due to the different granularity, current unlearning methods can hardly achieve feature-level unlearning. To address the challenges of utility and granularity, we propose a refined granularity unlearning scheme referred to as ``feature unlearning". We first explore two distinct scenarios based on whether the annotation information about the features is given: feature unlearning with known annotations and feature unlearning without annotations. Regarding unlearning with known annotations, we propose an adversarial learning approach to automatically remove effects about features. For unlearning without annotations, we initially enable the output of one model's layer to identify different pattern features using model interpretability techniques. We proceed to filter features from instances based on these outputs with identifying ability. So that we can remove the feature impact based on filtered instances and the fine-tuning process. The effectiveness of our proposed approach is demonstrated through experiments involving diverse models on various datasets in different scenarios.

Updated: 2024-06-16 14:08:46

标题: 不要忘记太多：朝向特征级别的机器遗忘

摘要: 机器遗忘使预训练模型能够消除部分训练数据的影响。先前的机器遗忘方案主要集中在遗忘一组实例或所有属于特定类的实例。这些类型的遗忘可能对模型效用产生重大影响；在只需要遗忘实例内特征而非整个实例的情况下，它们可能不足够。由于不同的粒度，当前的遗忘方法几乎无法实现特征级别的遗忘。为了解决效用和粒度的挑战，我们提出了一个称为“特征遗忘”的细粒度遗忘方案。我们首先探讨了两种不同的场景，基于是否提供有关特征的注释信息：具有已知注释的特征遗忘和没有注释的特征遗忘。关于具有已知注释的遗忘，我们提出了一种对抗学习方法，自动消除有关特征的影响。对于没有注释的遗忘，我们首先利用模型可解释性技术使一个模型层的输出能够识别不同的模式特征。然后，我们根据这些具有识别能力的输出从实例中筛选特征。这样，我们可以基于经过筛选的实例和微调过程来消除特征影响。我们提出的方法的有效性通过涉及不同场景的各种数据集上的多种模型的实验得到了证明。

更新时间: 2024-06-16 14:08:46

领域: cs.CR

下载: http://arxiv.org/abs/2406.10951v1

Exploring the Efficacy of Federated-Continual Learning Nodes with Attention-Based Classifier for Robust Web Phishing Detection: An Empirical Investigation

Web phishing poses a dynamic threat, requiring detection systems to quickly adapt to the latest tactics. Traditional approaches of accumulating data and periodically retraining models are outpaced. We propose a novel paradigm combining federated learning and continual learning, enabling distributed nodes to continually update models on streams of new phishing data, without accumulating data. These locally adapted models are then aggregated at a central server via federated learning. To enhance detection, we introduce a custom attention-based classifier model with residual connections, tailored for web phishing, leveraging attention mechanisms to capture intricate phishing patterns. We evaluate our hybrid learning paradigm across continual learning strategies (cumulative, replay, MIR, LwF) and model architectures through an empirical investigation. Our main contributions are: (1) a new hybrid federated-continual learning paradigm for robust web phishing detection, and (2) a novel attention + residual connections based model explicitly designed for this task, attaining 0.93 accuracy, 0.90 precision, 0.96 recall and 0.93 f1-score with the LwF strategy, outperforming traditional approaches in detecting emerging phishing threats while retaining past knowledge.

Updated: 2024-06-16 14:05:53

标题: 探索基于关注力分类器的联邦持续学习节点在鲁棒网络钓鱼检测中的有效性：实证调查

摘要: 网络钓鱼构成了一种动态威胁，要求检测系统能够迅速适应最新的策略。传统的数据积累和定期重新训练模型的方法已经落后。我们提出了一种结合联邦学习和持续学习的新范式，使分布式节点能够持续更新模型，无需积累数据，仅使用新的网络钓鱼数据流。这些经过本地调整的模型然后通过联邦学习汇总到中央服务器。为了增强检测能力，我们引入了一个基于注意力的自定义分类器模型，带有残差连接，专为网络钓鱼而设计，利用注意力机制捕获复杂的网络钓鱼模式。我们通过实证调查评估了我们的混合学习范式跨持续学习策略（累积、重放、MIR、LwF）和模型架构。我们的主要贡献是：（1）一种新的用于强大网络钓鱼检测的混合联邦-持续学习范式，以及（2）一种新颖的基于注意力和残差连接的模型，专门为这一任务设计，通过LwF策略获得了0.93的准确率、0.90的精确度、0.96的召回率和0.93的f1分数，在检测新兴网络钓鱼威胁时优于传统方法，同时保留过去的知识。

更新时间: 2024-06-16 14:05:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2405.03537v2

Incorporating uncertainty quantification into travel mode choice modeling: a Bayesian neural network (BNN) approach and an uncertainty-guided active survey framework

Existing deep learning approaches for travel mode choice modeling fail to inform modelers about their prediction uncertainty. Even when facing scenarios that are out of the distribution of training data, which implies high prediction uncertainty, these approaches still provide deterministic answers, potentially leading to misguidance. To address this limitation, this study introduces the concept of uncertainty from the field of explainable artificial intelligence into travel mode choice modeling. We propose a Bayesian neural network-based travel mode prediction model (BTMP) that quantifies the uncertainty of travel mode predictions, enabling the model itself to "know" and "tell" what it doesn't know. With BTMP, we further propose an uncertainty-guided active survey framework, which dynamically formulates survey questions representing travel mode choice scenarios with high prediction uncertainty. Through iterative collection of responses to these dynamically tailored survey questions, BTMP is iteratively trained to achieve the desired accuracy faster with fewer questions, thereby reducing survey costs. Experimental validation using synthetic datasets confirms the effectiveness of BTMP in quantifying prediction uncertainty. Furthermore, experiments, utilizing both synthetic and real-world data, demonstrate that the BTMP model, trained with the uncertainty-guided active survey framework, requires 20% to 50% fewer survey responses to match the performance of the model trained on randomly collected survey data. Overall, the proposed BTMP model and active survey framework innovatively incorporate uncertainty quantification into travel mode choice modeling, providing model users with essential insights into prediction reliability while optimizing data collection for deep learning model training in a cost-efficient manner.

Updated: 2024-06-16 14:05:47

标题: 将不确定性量化融入出行方式选择建模：一种贝叶斯神经网络（BNN）方法和基于不确定性引导的主动调查框架

摘要: 现有的旅行方式选择建模的深度学习方法未能告知建模者其预测的不确定性。即使面对训练数据分布之外的情景，这意味着高预测不确定性，这些方法仍然提供确定性答案，可能导致误导。为了解决这个限制，本研究将可解释人工智能领域的不确定性概念引入到旅行方式选择建模中。我们提出了一种基于贝叶斯神经网络的旅行方式预测模型（BTMP），该模型量化了旅行方式预测的不确定性，使模型本身能够“知道”和“告知”自己不知道的内容。通过BTMP，我们进一步提出了一种不确定性引导的主动调查框架，动态制定代表具有高预测不确定性的旅行方式选择情景的调查问题。通过对这些动态定制的调查问题的响应进行迭代收集，BTMP被迭代训练以更快地实现所需的准确性，并且需要更少的问题，从而降低调查成本。使用合成数据集进行的实验验证证实了BTMP在量化预测不确定性方面的有效性。此外，利用合成和真实世界数据进行的实验表明，使用不确定性引导的主动调查框架训练的BTMP模型需要比随机收集的调查数据训练的模型少20%至50%的调查响应，即可达到相同性能水平。总的来说，提出的BTMP模型和主动调查框架创新地将不确定性量化整合到旅行方式选择建模中，为模型用户提供了关于预测可靠性的重要见解，同时以一种成本有效的方式优化了深度学习模型训练的数据收集。

更新时间: 2024-06-16 14:05:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.10948v1

ROME: Memorization Insights from Text, Logits and Representation

Previous works have evaluated memorization by comparing model outputs with training corpora, examining how factors such as data duplication, model size, and prompt length influence memorization. However, analyzing these extensive training corpora is highly time-consuming. To address this challenge, this paper proposes an innovative approach named ROME that bypasses direct processing of the training data. Specifically, we select datasets categorized into three distinct types -- context-independent, conventional, and factual -- and redefine memorization as the ability to produce correct answers under these conditions. Our analysis then focuses on disparities between memorized and non-memorized samples by examining the logits and representations of generated texts. Experimental findings reveal that longer words are less likely to be memorized, higher confidence correlates with greater memorization, and representations of the same concepts are more similar across different contexts. Our code and data will be publicly available when the paper is accepted.

Updated: 2024-06-16 13:53:44

标题: 罗马：来自文本、逻辑和表征的记忆洞见

摘要: 以前的研究通过比较模型输出和训练语料库来评估记忆能力，研究数据重复、模型大小和提示长度等因素如何影响记忆。然而，分析这些庞大的训练语料库非常耗时。为了解决这一挑战，本文提出了一种名为ROME的创新方法，绕过直接处理训练数据。具体来说，我们选择了分为三种不同类型的数据集--独立于上下文的、传统的和事实性的--并重新定义记忆为在这些条件下产生正确答案的能力。我们的分析重点是通过检查生成文本的对数和表示之间的差异来研究记忆和非记忆样本之间的差异。实验结果显示，较长的单词更不可能被记忆，更高的置信度与更大的记忆相关，相同概念的表示在不同的上下文中更相似。当论文被接受后，我们的代码和数据将公开提供。

更新时间: 2024-06-16 13:53:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.00510v3

Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

State-of-the-art large language models are sometimes distributed as open-source software but are also increasingly provided as a closed-source service. These closed-source large-language models typically see the widest usage by the public, however, they often do not provide an estimate of their uncertainty when responding to queries. As even the best models are prone to ``hallucinating" false information with high confidence, a lack of a reliable estimate of uncertainty limits the applicability of these models in critical settings. We explore estimating the uncertainty of closed-source LLMs via multiple rephrasings of an original base query. Specifically, we ask the model, multiple rephrased questions, and use the similarity of the answers as an estimate of uncertainty. We diverge from previous work in i) providing rules for rephrasing that are simple to memorize and use in practice ii) proposing a theoretical framework for why multiple rephrased queries obtain calibrated uncertainty estimates. Our method demonstrates significant improvements in the calibration of uncertainty estimates compared to the baseline and provides intuition as to how query strategies should be designed for optimal test calibration.

Updated: 2024-06-16 13:49:53

标题: 重新表达一下：通过多次重新表述查询来估计封闭源语言模型中的不确定性

摘要: 目前最先进的大型语言模型有时作为开源软件分发，但也越来越多地作为闭源服务提供。这些闭源大型语言模型通常被公众广泛使用，然而，它们在回答查询时通常不提供其不确定性的估计。即使最好的模型也容易以高置信度“产生”错误信息，缺乏可靠的不确定性估计限制了这些模型在关键环境中的适用性。我们通过对原始基本查询进行多次改写来探索估计闭源LLM的不确定性。具体来说，我们向模型提出多个改写的问题，并使用答案的相似性作为不确定性的估计。我们不同于以往的工作在于：i）提供易于记忆和实际使用的改写规则；ii）提出多次改写查询获取校准不确定性估计的理论框架。与基线相比，我们的方法在不确定性估计的校准方面表现出显著的改进，并提供了如何设计查询策略以实现最佳测试校准的直观理解。

更新时间: 2024-06-16 13:49:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2405.13907v2

Effective Generative AI: The Human-Algorithm Centaur

Advanced analytics science methods have enabled combining the power of artificial and human intelligence, creating \textit{centaurs} that allow superior decision-making. Centaurs are hybrid human-algorithm AI models that combine both formal analytics and human intuition in a symbiotic manner within their learning and reasoning process. We argue that the future of AI development and use in many domains needs to focus on centaurs as opposed to traditional AI approaches. This paradigm shift from traditional AI methods to centaur-based AI methods raises some fundamental questions: How are centaurs different from traditional human-in-the-loop methods? What are the most effective methods for creating centaurs? When should centaurs be used, and when should the lead be given to traditional AI models? Doesn't the incorporation of human intuition -- which at times can be misleading -- in centaurs' decision-making process degrade its performance compared to traditional AI methods? This work aims to address these fundamental questions, focusing on recent advancements in generative AI, and especially in Large Language Models (LLMs), as a main case study to illustrate centaurs' critical essentiality to future AI endeavors.

Updated: 2024-06-16 13:44:41

标题: 有效的生成式人工智能：人机混合体

摘要: 高级分析科学方法已经实现了人工智能和人类智慧的结合，创造了允许进行卓越决策的“半人马”模型。半人马是混合人类和算法的AI模型，以一种共生方式在学习和推理过程中结合了形式化分析和人类直觉。我们认为AI在许多领域的未来发展和应用需要专注于半人马，而不是传统的AI方法。从传统AI方法向基于半人马的AI方法的这种范式转变引发了一些基本问题：半人马与传统的人在环路方法有何不同？创建半人马最有效的方法是什么？何时应该使用半人马，何时应该让传统AI模型主导？在半人马的决策过程中加入人类直觉（有时可能会误导）是否会降低其性能，相较于传统的AI方法？这项工作旨在解决这些基本问题，重点关注生成式AI和尤其是大型语言模型（LLMs）的最新进展，作为主要案例研究，以说明半人马对未来AI努力的关键重要性。

更新时间: 2024-06-16 13:44:41

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.10942v1

Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses

In the contemporary data-driven landscape, ensuring data quality (DQ) is crucial for deriving actionable insights from vast data repositories. The objective of this study is to explore the potential for automating data quality management within data warehouses as data repository commonly used by large organizations. By conducting a systematic review of existing DQ tools available in the market and academic literature, the study assesses their capability to automatically detect and enforce data quality rules. The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses. Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses. The findings underscore a significant gap in the market and academic research regarding AI-augmented DQ rule detection in data warehouses. This paper advocates for further development in this area to enhance the efficiency of DQ management processes, reduce human workload, and lower costs. The study highlights the necessity of advanced tools for automated DQ rule detection, paving the way for improved practices in data quality management tailored to data warehouse environments. The study can guide organizations in selecting data quality tool that would meet their requirements most.

Updated: 2024-06-16 13:43:04

标题: 朝着增强的数据质量管理方向：在数据仓库中自动化数据质量规则定义

摘要: 在当今数据驱动的环境中，确保数据质量（DQ）对于从庞大的数据存储库中获取可操作的见解至关重要。本研究的目标是探索在大型组织常用的数据仓库中自动化数据质量管理的潜力。通过对市场和学术文献中现有的DQ工具进行系统性审查，本研究评估它们自动检测和执行数据质量规则的能力。审查涵盖了来自各种来源的151种工具，揭示了大多数当前工具侧重于在特定领域数据库中进行数据清洗和修复，而不是数据仓库。只有少数工具，具体来说是十种，显示了检测DQ规则的能力，更不用说在数据仓库中实施了。研究结果强调了市场和学术研究中关于AI增强的数据仓库中DQ规则检测的显著差距。本文主张在这一领域进一步发展，以提高DQ管理流程的效率，减少人力工作量和降低成本。研究强调了自动化DQ规则检测的先进工具的必要性，为适应数据仓库环境的数据质量管理实践的改进铺平了道路。该研究可以指导组织选择最符合其要求的数据质量工具。

更新时间: 2024-06-16 13:43:04

领域: cs.DB,cs.AI,cs.CE,cs.ET

下载: http://arxiv.org/abs/2406.10940v1

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of information (e.g., demand or travel times in delivery problems). A common strategy in such cases is to estimate said parameters via machine learning (ML) models trained to minimize the prediction error, which however is not necessarily aligned with the downstream task-level error. The decision-focused learning (DFL) paradigm overcomes this limitation by training to directly minimize a task loss, e.g. regret. Since the latter has non-informative gradients for combinatorial problems, state-of-the-art DFL methods introduce surrogates and approximations that enable training. But these methods exploit specific assumptions about the problem structures (e.g., convex or linear problems, unknown parameters only in the objective function). We propose an alternative method that makes no such assumptions, it combines stochastic smoothing with score function gradient estimation which works on any task loss. This opens up the use of DFL methods to nonlinear objectives, uncertain parameters in the problem constraints, and even two-stage stochastic optimization. Experiments show that it typically requires more epochs, but that it is on par with specialized methods and performs especially well for the difficult case of problems with uncertainty in the constraints, in terms of solution quality, scalability, or both.

Updated: 2024-06-16 13:38:00

标题: 评分函数梯度估计以扩大决策焦点学习的适用性

摘要: 许多现实世界的优化问题包含在部署之前未知的参数，这可能是由于随机性或信息不足（例如，在交付问题中的需求或行程时间）。在这种情况下的常见策略是通过训练用于最小化预测误差的机器学习（ML）模型来估计这些参数，然而这并不一定与下游任务级别的误差对齐。决策焦点学习（DFL）范式通过直接最小化任务损失（例如遗憾）进行训练来克服这一限制。由于后者对于组合问题具有非信息梯度，因此最先进的DFL方法引入了替代和近似方法来进行训练。但是这些方法利用了关于问题结构的特定假设（例如，凸或线性问题，目标函数中仅有未知参数）。我们提出了一种不做此类假设的替代方法，它将随机平滑与评分函数梯度估计相结合，可以适用于任何任务损失。这使得DFL方法可以用于非线性目标、问题约束中的不确定参数，甚至是两阶段随机优化。实验表明，这通常需要更多的epochs，但它与专门方法相当，并且在问题约束中存在不确定性的困难情况下表现尤为出色，无论是在解决方案质量、可扩展性还是两者方面。

更新时间: 2024-06-16 13:38:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2307.05213v2

Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models

Motivated by the rapid ascent of Large Language Models (LLMs) and debates about the extent to which they possess human-level qualities, we propose a framework for testing whether any agent (be it a machine or a human) understands a subject matter. In Turing-test fashion, the framework is based solely on the agent's performance, and specifically on how well it answers questions. Elements of the framework include circumscribing the set of questions (the "scope of understanding"), requiring general competence ("passing grade"), avoiding "ridiculous answers", but still allowing wrong and "I don't know" answers to some questions. Reaching certainty about these conditions requires exhaustive testing of the questions which is impossible for nontrivial scopes, but we show how high confidence can be achieved via random sampling and the application of probabilistic confidence bounds. We also show that accompanying answers with explanations can improve the sample complexity required to achieve acceptable bounds, because an explanation of an answer implies the ability to answer many similar questions. According to our framework, current LLMs cannot be said to understand nontrivial domains, but as the framework provides a practical recipe for testing understanding, it thus also constitutes a tool for building AI agents that do understand.

Updated: 2024-06-16 13:37:08

标题: 理解理解：受大型语言模型启发的实用框架

摘要: 受到大型语言模型(LLMs)的快速崛起以及关于它们是否具有人类水平品质的争论的启发，我们提出了一个框架，用于测试任何代理（无论是机器还是人类）是否理解一个主题。类似图灵测试，该框架仅基于代理的表现，特别是其回答问题的能力。该框架的要素包括限定问题集（“理解范围”），要求一般能力（“及格分数”），避免“荒谬答案”，但仍允许对某些问题的错误和“我不知道”的回答。确信这些条件需要对问题进行详尽测试，对于非平凡范围来说是不可能的，但我们展示了如何通过随机抽样和应用概率置信度界限来实现高置信度。我们还展示了附带解释的答案可以提高达到可接受置信度所需的样本复杂度，因为对答案的解释意味着能够回答许多类似的问题。根据我们的框架，目前的LLMs不能被称为理解非平凡领域，但由于该框架提供了一个测试理解的实用配方，因此它也构成了构建确实理解的AI代理的工具。

更新时间: 2024-06-16 13:37:08

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.10937v1

Knowledge Base Enabled Semantic Communication: A Generative Perspective

Semantic communication is widely touted as a key technology for propelling the sixth-generation (6G) wireless networks. However, providing effective semantic representation is quite challenging in practice. To address this issue, this article takes a crack at exploiting semantic knowledge base (KB) to usher in a new era of generative semantic communication. Via semantic KB, source messages can be characterized in low-dimensional subspaces without compromising their desired meanings, thus significantly enhancing the communication efficiency. The fundamental principle of semantic KB is first introduced, and a generative semantic communication architecture is developed by presenting three sub-KBs, namely source, task, and channel KBs. Then, the detailed construction approaches for each sub-KB are described, followed by their utilization in terms of semantic coding and transmission. A case study is also provided to showcase the superiority of generative semantic communication over conventional syntactic communication and classical semantic communication. In a nutshell, this article establishes a scientific foundation for the exciting uncharted frontier of generative semantic communication.

Updated: 2024-06-16 13:35:45

标题: 知识库支持的语义通信：生成视角

摘要: 语义通信被广泛认为是推动第六代（6G）无线网络的关键技术。然而，在实践中提供有效的语义表示是非常具有挑战性的。为了解决这个问题，本文尝试利用语义知识库（KB）来开启一种新的生成式语义通信时代。通过语义知识库，源消息可以在低维子空间中进行表征，而不损害其所需的含义，从而显著提高通信效率。首先介绍了语义知识库的基本原理，然后通过呈现三个子KB（源、任务和信道KB）构建了一个生成式语义通信架构。接着，描述了每个子KB的详细构建方法，并介绍了它们在语义编码和传输方面的利用。还提供了一个案例研究，展示了生成式语义通信相对于传统的句法通信和经典的语义通信的优越性。总之，本文为激动人心的生成式语义通信的未知领域奠定了科学基础。

更新时间: 2024-06-16 13:35:45

领域: cs.IT,cs.AI,cs.NI,math.IT

下载: http://arxiv.org/abs/2311.12443v2

KAN: Kolmogorov-Arnold Networks

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

Updated: 2024-06-16 13:34:56

标题: KAN：科尔莫戈洛夫-阿诺德网络

摘要: 受 Kolmogorov-Arnold 表示定理的启发，我们提出 Kolmogorov-Arnold 网络 (KANs) 作为多层感知机 (MLPs) 的有希望的替代品。MLPs 在节点（“神经元”）上具有固定的激活函数，而 KANs 在边缘（“权重”）上具有可学习的激活函数。KANs 没有任何线性权重 - 每个权重参数都被替换为一个作为样条函数参数化的单变量函数。我们表明，这看似简单的改变使得 KANs 在准确性和可解释性方面优于 MLPs。在准确性方面，较小的 KANs 在数据拟合和 PDE 求解方面可以实现与较大的 MLPs 相当或更好的准确性。从理论和实证上看，KANs 拥有比 MLPs 更快的神经缩放规律。在可解释性方面，KANs 可以直观地可视化，并且可以轻松地与人类用户互动。通过数学和物理领域的两个示例，KANs 被证明是有用的合作者，帮助科学家（重新）发现数学和物理规律。总之，KANs 是 MLPs 的有希望替代品，为进一步改进今天严重依赖 MLPs 的深度学习模型开辟了机会。

更新时间: 2024-06-16 13:34:56

领域: cs.LG,cond-mat.dis-nn,cs.AI,stat.ML

下载: http://arxiv.org/abs/2404.19756v4

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

Speech recognition is an essential start ring of human-computer interaction, and recently, deep learning models have achieved excellent success in this task. However, when the model training and private data provider are always separated, some security threats that make deep neural networks (DNNs) abnormal deserve to be researched. In recent years, the typical backdoor attacks have been researched in speech recognition systems. The existing backdoor methods are based on data poisoning. The attacker adds some incorporated changes to benign speech spectrograms or changes the speech components, such as pitch and timbre. As a result, the poisoned data can be detected by human hearing or automatic deep algorithms. To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. The algorithm combines four steps to generate stealthy poisoned utterances. From the perspective of rhythm component transformation, our proposed trigger stretches or squeezes the mel spectrograms and recovers them back to signals. The operation keeps timbre and content unchanged for good stealthiness. Our experiments are conducted on two kinds of speech recognition tasks, including testing the stealthiness of poisoned samples by speaker verification and automatic speech recognition. The results show that our method has excellent effectiveness and stealthiness. The rhythm trigger needs a low poisoning rate and gets a very high attack success rate.

Updated: 2024-06-16 13:29:21

标题: 难以察觉的节奏后门攻击：探索节奏转换以在语音识别中嵌入不可检测的漏洞

摘要: 语音识别是人机交互的一个重要起点，最近，深度学习模型在这一任务中取得了出色的成功。然而，当模型训练和私人数据提供者总是分开时，一些导致深度神经网络（DNNs）异常的安全威胁值得研究。近年来，已经在语音识别系统中研究了典型的后门攻击。现有的后门方法基于数据毒害。攻击者向良性语音频谱图添加一些混合变化，或更改语音组件，如音高和音色。结果，毒害数据可以被人类听觉或自动深度算法检测到。为了提高数据毒害的隐蔽性，本文提出了一种非神经和快速算法，称为随机频谱节奏转换（RSRT）。该算法结合了四个步骤来生成隐蔽的毒害话语。从节奏组件转换的角度来看，我们提出的触发器拉伸或挤压了mel频谱图，并将其恢复回信号。该操作使音色和内容保持不变，以获得良好的隐蔽性。我们的实验在两种语音识别任务上进行，包括通过说话人验证和自动语音识别测试毒害样本的隐蔽性。结果显示我们的方法具有出色的有效性和隐蔽性。节奏触发器需要低毒害率，并获得非常高的攻击成功率。

更新时间: 2024-06-16 13:29:21

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2406.10932v1

RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation

The Retrieval Augmented Generation (RAG) framework utilizes a combination of parametric knowledge and external knowledge to demonstrate state-of-the-art performance on open-domain question answering tasks. However, the RAG framework suffers from performance degradation when the query is accompanied by irrelevant contexts. In this work, we propose the RE-RAG framework, which introduces a relevance estimator (RE) that not only provides relative relevance between contexts as previous rerankers did, but also provides confidence, which can be used to classify whether given context is useful for answering the given question. We propose a weakly supervised method for training the RE simply utilizing question-answer data without any labels for correct contexts. We show that RE trained with a small generator (sLM) can not only improve the sLM fine-tuned together with RE but also improve previously unreferenced large language models (LLMs). Furthermore, we investigate new decoding strategies that utilize the proposed confidence measured by RE such as choosing to let the user know that it is "unanswerable" to answer the question given the retrieved contexts or choosing to rely on LLM's parametric knowledge rather than unrelated contexts.

Updated: 2024-06-16 13:28:24

标题: RE-RAG：通过检索增强生成中的相关性估计器提高开放域问答性能和可解释性

摘要: 检索增强生成（RAG）框架利用参数化知识和外部知识的组合来展示在开放领域问答任务上的最新性能。然而，当查询伴随着不相关的内容时，RAG框架会遭受性能下降的困扰。在这项工作中，我们提出了RE-RAG框架，引入了一个相关性评估器（RE），它不仅像以前的重新排名器一样提供上下文之间的相对相关性，还提供置信度，可用于分类给定上下文是否对回答给定问题有用。我们提出了一种弱监督方法来训练RE，简单地利用问答数据，而不需要为正确上下文添加任何标签。我们展示了，用小生成器（sLM）训练的RE不仅可以改进与RE一起微调的sLM，还可以改进以前未引用的大型语言模型（LLMs）。此外，我们研究了利用RE测量的置信度的新的解码策略，比如选择告知用户在给定检索上下文的情况下回答问题是“无法回答”，或者选择依赖LLM的参数化知识，而不是不相关的上下文。

更新时间: 2024-06-16 13:28:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.05794v2

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask

Smart homes, powered by the Internet of Things, offer great convenience but also pose security concerns due to abnormal behaviors, such as improper operations of users and potential attacks from malicious attackers. Several behavior modeling methods have been proposed to identify abnormal behaviors and mitigate potential risks. However, their performance often falls short because they do not effectively learn less frequent behaviors, consider temporal context, or account for the impact of noise in human behaviors. In this paper, we propose SmartGuard, an autoencoder-based unsupervised user behavior anomaly detection framework. First, we design a Loss-guided Dynamic Mask Strategy (LDMS) to encourage the model to learn less frequent behaviors, which are often overlooked during learning. Second, we propose a Three-level Time-aware Position Embedding (TTPE) to incorporate temporal information into positional embedding to detect temporal context anomaly. Third, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) that assigns different weights for routine behaviors and noise behaviors to mitigate the interference of noise behaviors during inference. Comprehensive experiments on three datasets with ten types of anomaly behaviors demonstrates that SmartGuard consistently outperforms state-of-the-art baselines and also offers highly interpretable results.

Updated: 2024-06-16 13:23:21

标题: 让您的家更安全：通过损失引导掩盖实现的智能家居中基于时间感知的无监督用户行为异常检测

摘要: 智能家居由物联网驱动，提供了极大的便利性，但也因用户不当操作和恶意攻击者的潜在攻击而引发安全问题。已提出了几种行为建模方法来识别异常行为并减轻潜在风险。然而，它们的性能通常不足，因为它们无法有效学习较少频繁的行为，考虑时间上下文，或者考虑人类行为中噪声的影响。在本文中，我们提出了SmartGuard，这是一种基于自动编码器的无监督用户行为异常检测框架。首先，我们设计了一种Loss-guided Dynamic Mask Strategy（LDMS）来鼓励模型学习较少频繁的行为，这些行为在学习过程中经常被忽视。其次，我们提出了一种Three-level Time-aware Position Embedding（TTPE）来将时间信息融入位置嵌入以检测时间上下文异常。第三，我们提出了一种Noise-aware Weighted Reconstruction Loss（NWRL），为常规行为和噪声行为分配不同的权重，以减轻推理过程中噪声行为的干扰。对三个数据集进行的全面实验展示了，SmartGuard始终优于现有基准，并提供高度可解释的结果。

更新时间: 2024-06-16 13:23:21

领域: cs.CR,cs.AI,cs.NI

下载: http://arxiv.org/abs/2406.10928v1

Exact Mean Square Linear Stability Analysis for SGD

The dynamical stability of optimization methods at the vicinity of minima of the loss has recently attracted significant attention. For gradient descent (GD), stable convergence is possible only to minima that are sufficiently flat w.r.t. the step size, and those have been linked with favorable properties of the trained model. However, while the stability threshold of GD is well-known, to date, no explicit expression has been derived for the exact threshold of stochastic GD (SGD). In this paper, we derive such a closed-form expression. Specifically, we provide an explicit condition on the step size that is both necessary and sufficient for the linear stability of SGD in the mean square sense. Our analysis sheds light on the precise role of the batch size $B$. In particular, we show that the stability threshold is monotonically non-decreasing in the batch size, which means that reducing the batch size can only decrease stability. Furthermore, we show that SGD's stability threshold is equivalent to that of a mixture process which takes in each iteration a full batch gradient step w.p. $1-p$, and a single sample gradient step w.p. $p$, where $p \approx 1/B $. This indicates that even with moderate batch sizes, SGD's stability threshold is very close to that of GD's. We also prove simple necessary conditions for linear stability, which depend on the batch size, and are easier to compute than the precise threshold. Finally, we derive the asymptotic covariance of the dynamics around the minimum, and discuss its dependence on the learning rate. We validate our theoretical findings through experiments on the MNIST dataset.

Updated: 2024-06-16 13:21:53

标题: 随机梯度下降（SGD）的精确均方线性稳定性分析

摘要: 最近，优化方法在损失函数极小值附近的动态稳定性引起了很大关注。对于梯度下降（GD）来说，稳定的收敛仅可能发生在与步长足够平坦的极小值，并且这些极小值已经与训练模型的有利特性相关联。然而，虽然GD的稳定性阈值是众所周知的，但迄今为止，尚未为随机梯度下降（SGD）的确切阈值推导出明确的表达式。在本文中，我们推导出这样一个封闭形式表达式。具体来说，我们提供了一个关于步长的明确条件，这个条件既是SGD在均方意义下的线性稳定性所必需的，也是充分的。我们的分析揭示了批量大小$B$的准确作用。特别地，我们展示了稳定性阈值在批量大小上是单调非递减的，这意味着减小批量大小只能降低稳定性。此外，我们展示了SGD的稳定性阈值等同于一个混合过程的稳定性阈值，该过程在每次迭代中以概率$1-p$采取完整批量梯度步长，以概率$p$采取单个样本梯度步长，其中$p≈1/B$。这表明，即使在适度的批量大小下，SGD的稳定性阈值也非常接近于GD的。我们还证明了线性稳定性的简单必要条件，这些条件取决于批量大小，比精确阈值更容易计算。最后，我们推导了极小值周围动态的渐近协方差，并讨论了它对学习率的依赖性。我们通过对MNIST数据集的实验证实了我们的理论发现。

更新时间: 2024-06-16 13:21:53

领域: cs.LG

下载: http://arxiv.org/abs/2306.07850v3

Convergence Acceleration in Wireless Federated Learning: A Stackelberg Game Approach

This paper studies issues that arise with respect to the joint optimization for convergence time in federated learning over wireless networks (FLOWN). We consider the criterion and protocol for selection of participating devices in FLOWN under the energy constraint and derive its impact on device selection. In order to improve the training efficiency, age-of-information (AoI) enables FLOWN to assess the freshness of gradient updates among participants. Aiming to speed up convergence, we jointly investigate global loss minimization and latency minimization in a Stackelberg game based framework. Specifically, we formulate global loss minimization as a leader-level problem for reducing the number of required rounds, and latency minimization as a follower-level problem to reduce time consumption of each round. By decoupling the follower-level problem into two sub-problems, including resource allocation and sub-channel assignment, we achieve an optimal strategy of the follower through monotonic optimization and matching theory. At the leader-level, we derive an upper bound of convergence rate and subsequently reformulate the global loss minimization problem and propose a new age-of-update (AoU) based device selection algorithm. Simulation results indicate the superior performance of the proposed AoU based device selection scheme in terms of the convergence rate, as well as efficient utilization of available sub-channels.

Updated: 2024-06-16 13:12:26

标题: 《无线联合学习中的收敛加速：一种斯塔克伯格博弈方法》

摘要: 本文研究了在无线网络上联合优化以提高联邦学习收敛时间的问题。我们考虑了在能量约束下参与设备选择的标准和协议，并推导出对设备选择的影响。为了提高训练效率，信息时代（AoI）使得联邦学习能够评估参与者之间梯度更新的新鲜度。为了加快收敛速度，我们在基于斯塔克贝格游戏的框架中共同研究了全局损失最小化和延迟最小化。具体来说，我们将全局损失最小化建模为一个领导者级问题，以减少所需轮次的数量，将延迟最小化建模为一个追随者级问题，以减少每轮的时间消耗。通过将追随者级问题分解为资源分配和子信道分配两个子问题，我们通过单调优化和匹配理论实现了追随者的最优策略。在领导者级别上，我们推导出一个收敛速率的上界，随后重新制定全局损失最小化问题，并提出了一种基于更新时代（AoU）的设备选择算法。模拟结果表明，所提出的基于AoU的设备选择方案在收敛速度和有效利用可用子信道方面表现出优越性能。

更新时间: 2024-06-16 13:12:26

领域: cs.LG

下载: http://arxiv.org/abs/2209.06623v2

Decoupling the Class Label and the Target Concept in Machine Unlearning

Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class, through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these methods are useful, they are insufficient as the class label and the target concept are often considered to coincide. In this work, we decouple them by considering the label domain mismatch and investigate three problems beyond the conventional all matched forgetting, e.g., target mismatch, model mismatch, and data mismatch forgetting. We systematically analyze the new challenges in restrictively forgetting the target concept and also reveal crucial forgetting dynamics in the representation level to realize these tasks. Based on that, we propose a general framework, namely, TARget-aware Forgetting (TARF). It enables the additional tasks to actively forget the target concept while maintaining the rest part, by simultaneously conducting annealed gradient ascent on the forgetting data and selected gradient descent on the hard-to-affect remaining data. Empirically, various experiments under the newly introduced settings are conducted to demonstrate the effectiveness of our TARF.

Updated: 2024-06-16 13:07:49

标题: 解耦机器反学习中的类标签和目标概念

摘要: 机器遗忘作为数据监管的新兴研究主题，旨在调整训练模型以逼近一个重新训练的模型，其中排除了部分训练数据。先前的研究表明，按类别的遗忘成功地忘记了目标类别的知识，通过在遗忘数据上进行梯度上升或使用剩余数据进行微调。然而，虽然这些方法很有用，但它们不够，因为类别标签和目标概念通常被认为是一致的。在这项工作中，我们通过考虑标签域不匹配，对它们进行解耦，并研究了传统的所有匹配遗忘之外的三个问题，例如，目标不匹配、模型不匹配和数据不匹配遗忘。我们系统分析了在限制性遗忘目标概念中的新挑战，并在表示水平上揭示了关键的遗忘动态以实现这些任务。基于此，我们提出了一个通用框架，即TARget-aware Forgetting（TARF）。它使得额外任务能够主动遗忘目标概念，同时保留其余部分，同时在遗忘数据上同时进行退火梯度上升，并在难以影响的剩余数据上进行选择梯度下降。在新引入的设置下进行了各种实验，以证明我们的TARF的有效性。

更新时间: 2024-06-16 13:07:49

领域: cs.LG

下载: http://arxiv.org/abs/2406.08288v2

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM

Updated: 2024-06-16 12:58:31

标题: 用电影中的模式调查大型语言模型的视频推理能力

摘要: 大型语言模型（LLMs）不仅在语言任务中表现出有效性，而且在视频推理中也表现出有效性。本文介绍了一个新颖的数据集，电影中的模式（TiM），设计为一个用于探索两种关键但以前被忽视的视频推理技能的试验田：（1）抽象感知：理解和标记视频中的抽象概念，以及（2）长程组合推理：计划和整合中间推理步骤，以理解具有大量帧的长程视频。利用电影叙事中的模式，TiM评估了基于最先进的LLM方法的推理能力。我们的实验证明，当前方法，包括描述器-推理器，大型多模型模型指令微调和视觉编程，在处理抽象感知和长程组合推理的挑战时，仅略优于随机基线。为了解决这些缺陷，我们提出了面部增强的角色互动（FEVoRI）和上下文查询简化（ConQueR），通过促进角色互动意识和在推理过程中逐渐完善电影背景和模式查询，显著提高了15个F1点的性能。然而，这种性能仍然落后于人类水平（40 vs. 65 F1）。此外，我们介绍了一个新的协议，用于评估抽象感知和长程组合推理对任务解决的必要性。这是通过使用抽象语法树（AST）分析通过视觉编程生成的代码来完成的，从而确认了TiM的增加复杂性。数据集和代码可在以下网址找到：https://ander1119.github.io/TiM

更新时间: 2024-06-16 12:58:31

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.10923v1

Generating Tables from the Parametric Knowledge of Language Models

We explore generating factual and accurate tables from the parametric knowledge of large language models (LLMs). While LLMs have demonstrated impressive capabilities in recreating knowledge bases and generating free-form text, we focus on generating structured tabular data, which is crucial in domains like finance and healthcare. We examine the table generation abilities of four state-of-the-art LLMs: GPT-3.5, GPT-4, Llama2-13B, and Llama2-70B, using three prompting methods for table generation: (a) full-table, (b) row-by-row; (c) cell-by-cell. For evaluation, we introduce a novel benchmark, WikiTabGen which contains 100 curated Wikipedia tables. Tables are further processed to ensure their factual correctness and manually annotated with short natural language descriptions. Our findings reveal that table generation remains a challenge, with GPT-4 reaching the highest accuracy at 19.6%. Our detailed analysis sheds light on how various table properties, such as size, table popularity, and numerical content, influence generation performance. This work highlights the unique challenges in LLM-based table generation and provides a solid evaluation framework for future research. Our code, prompts and data are all publicly available: https://github.com/analysis-bots/WikiTabGen

Updated: 2024-06-16 12:55:55

标题: 利用语言模型的参数知识生成表格

摘要: 我们探索从大型语言模型（LLMs）的参数知识中生成事实和准确的表格。虽然LLMs已经展示了在重新创建知识库和生成自由形式文本方面的令人印象深刻的能力，但我们专注于生成结构化的表格数据，在金融和医疗保健等领域至关重要。我们研究了四种最先进的LLMs（GPT-3.5、GPT-4、Llama2-13B和Llama2-70B）的表格生成能力，使用三种提示方法进行表格生成：（a）全表、（b）逐行；（c）逐单元格。为了评估，我们引入了一个新的基准，WikiTabGen，其中包含100个精心策划的维基百科表格。表格进一步经过处理，以确保其事实正确性，并用简短的自然语言描述手动注释。我们的研究结果表明，表格生成仍然是一个挑战，GPT-4的准确率达到了19.6%。我们的详细分析揭示了各种表格属性（如大小、表格流行度和数字内容）如何影响生成性能。这项工作突显了基于LLM的表格生成中的独特挑战，并为未来研究提供了一个坚实的评估框架。我们的代码、提示和数据都是公开可用的：https://github.com/analysis-bots/WikiTabGen

更新时间: 2024-06-16 12:55:55

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2406.10922v1

Hamilton-Jacobi Based Policy-Iteration via Deep Operator Learning

The framework of deep operator network (DeepONet) has been widely exploited thanks to its capability of solving high dimensional partial differential equations. In this paper, we incorporate DeepONet with a recently developed policy iteration scheme to numerically solve optimal control problems and the corresponding Hamilton--Jacobi--Bellman (HJB) equations. A notable feature of our approach is that once the neural network is trained, the solution to the optimal control problem and HJB equations with different terminal functions can be inferred quickly thanks to the unique feature of operator learning. Furthermore, a quantitative analysis of the accuracy of the algorithm is carried out via comparison principles of viscosity solutions. The effectiveness of the method is verified with various examples, including 10-dimensional linear quadratic regulator problems (LQRs).

Updated: 2024-06-16 12:53:17

标题: 基于Hamilton-Jacobi的政策迭代通过深度操作学习

摘要: 深度运算器网络（DeepONet）框架被广泛利用，因为它具有解决高维偏微分方程的能力。本文将DeepONet与最近开发的策略迭代方案结合起来，从而在数值上解决最优控制问题和相应的Hamilton-Jacobi-Bellman（HJB）方程。我们方法的一个显著特点是，一旦神经网络训练完成，由于运算器学习的独特特性，可以快速推断具有不同终端函数的最优控制问题和HJB方程的解。此外，通过比较粘性解的比较原理对算法的准确性进行了定量分析。该方法的有效性通过各种示例进行验证，包括10维线性二次调节器问题（LQRs）。

更新时间: 2024-06-16 12:53:17

领域: math.OC,cs.AI,cs.LG,cs.NA,math.NA,68T20, 68U07, 35F21, 49L12, 49L25

下载: http://arxiv.org/abs/2406.10920v1

Efficient and Generalized end-to-end Autonomous Driving System with Latent Deep Reinforcement Learning and Demonstrations

An intelligent driving system should dynamically formulate appropriate driving strategies based on the current environment and vehicle status while ensuring system security and reliability. However, methods based on reinforcement learning and imitation learning often suffer from high sample complexity, poor generalization, and low safety. To address these challenges, this paper introduces an Efficient and Generalized end-to-end Autonomous Driving System (EGADS) for complex and varied scenarios. The RL agent in our EGADS combines variational inference with normalizing flows, which are independent of distribution assumptions. This combination allows the agent to capture historical information relevant to driving in latent space effectively, thereby significantly reducing sample complexity. Additionally, we enhance safety by formulating robust safety constraints and improve generalization and performance by integrating RL with expert demonstrations. Experimental results demonstrate that, compared to existing methods, EGADS significantly reduces sample complexity, greatly improves safety performance, and exhibits strong generalization capabilities in complex urban scenarios. Particularly, we contributed an expert dataset collected through human expert steering wheel control, specifically using the G29 steering wheel.

Updated: 2024-06-16 12:48:53

标题: 高效且通用的端到端自主驾驶系统：基于潜在深度强化学习和示范的方法

摘要: 一个智能驾驶系统应该根据当前环境和车辆状态动态制定合适的驾驶策略，同时确保系统安全性和可靠性。然而，基于强化学习和模仿学习的方法经常面临高样本复杂性、泛化能力差和安全性低的问题。为了应对这些挑战，本文介绍了一个用于复杂和多样化情景的高效和泛化的端到端自动驾驶系统（EGADS）。我们的EGADS中的RL代理结合了变分推理和归一化流，这些方法独立于分布假设。这种组合使代理能够有效地在潜在空间中捕捉与驾驶相关的历史信息，从而显著减少了样本复杂性。此外，我们通过制定强健的安全约束来增强安全性，并通过将RL与专家演示集成来提高泛化和性能。实验结果表明，与现有方法相比，EGADS显著减少了样本复杂性，大大提高了安全性能，并在复杂的城市场景中展现了强大的泛化能力。特别地，我们通过人类专家方向盘控制，具体使用G29方向盘，贡献了一个专家数据集。

更新时间: 2024-06-16 12:48:53

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2401.11792v6

Embodied Question Answering via Multi-LLM Systems

Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.

Updated: 2024-06-16 12:46:40

标题: 通过多LLM系统实现具身问答

摘要: 具身问答（EQA）是一个重要的问题，涉及一个代理探索环境来回答用户查询。在现有文献中，EQA仅在单一代理场景中进行研究，其中探索可能耗时且昂贵。在这项工作中，我们考虑在一个多代理框架中进行EQA，涉及多个基于大型语言模型（LLM）的代理独立回答有关家庭环境的查询。为了为每个查询生成一个答案，我们使用各自的回应来训练一个中央答案模型（CAM），该模型汇总答案以得到稳健的答案。使用CAM，我们观察到与集成LLM的聚合方法（如投票方案和辩论）相比，EQA准确率提高了50％。CAM不需要任何形式的代理通信，减轻了相关成本。我们使用各种非线性（神经网络、随机森林、决策树、XGBoost）和线性（逻辑回归分类器、支持向量机）算法对CAM进行消融实验。最后，我们通过置换特征重要性（PFI）对CAM进行特征重要性分析，量化CAM对每个独立代理和查询上下文的依赖程度。

更新时间: 2024-06-16 12:46:40

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.10918v1

Bayesian Intervention Optimization for Causal Discovery

Causal discovery is crucial for understanding complex systems and informing decisions. While observational data can uncover causal relationships under certain assumptions, it often falls short, making active interventions necessary. Current methods, such as Bayesian and graph-theoretical approaches, do not prioritize decision-making and often rely on ideal conditions or information gain, which is not directly related to hypothesis testing. We propose a novel Bayesian optimization-based method inspired by Bayes factors that aims to maximize the probability of obtaining decisive and correct evidence. Our approach uses observational data to estimate causal models under different hypotheses, evaluates potential interventions pre-experimentally, and iteratively updates priors to refine interventions. We demonstrate the effectiveness of our method through various experiments. Our contributions provide a robust framework for efficient causal discovery through active interventions, enhancing the practical application of theoretical advancements.

Updated: 2024-06-16 12:45:44

标题: 贝叶斯干预优化用于因果发现

摘要: 因果发现对于理解复杂系统和做出决策至关重要。虽然观察数据可以在某些假设下揭示因果关系，但通常效果不佳，因此需要积极干预。目前的方法，如贝叶斯和图论方法，没有将决策放在首位，通常依赖理想条件或信息增益，这与假设检验没有直接关系。我们提出一种受贝叶斯因子启发的新型基于贝叶斯优化的方法，旨在最大化获得决定性和正确证据的概率。我们的方法利用观察数据估计不同假设下的因果模型，预先评估潜在干预措施，并迭代更新先验以改进干预。我们通过各种实验展示了我们方法的有效性。我们的贡献为通过积极干预实现高效因果发现提供了坚实框架，增强了理论进展的实际应用。

更新时间: 2024-06-16 12:45:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.10917v1

To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO

The temperature parameter plays a profound role during training and/or inference with large foundation models (LFMs) such as large language models (LLMs) and CLIP models. Particularly, it adjusts the logits in the softmax function in LLMs, which is crucial for next token generation, and it scales the similarities in the contrastive loss for training CLIP models. A significant question remains: Is it viable to learn a neural network to predict a personalized temperature of any input data for enhancing LFMs"? In this paper, we present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our solution is composed of a novel learning framework with a robust loss underpinned by constrained distributionally robust optimization (DRO), and a properly designed TempNet with theoretical inspiration. TempNet can be trained together with a large foundation model from scratch or learned separately given a pretrained foundation model. It is not only useful for predicting personalized temperature to promote the training of LFMs but also generalizable and transferable to new tasks. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models, e.g. Table 1. The code to reproduce the experimental results in this paper can be found at https://github.com/zhqiu/TempNet.

Updated: 2024-06-16 12:43:39

标题: 冷却还是不冷却？温度网络通过直接风险优化与大型基础模型相结合

摘要: 温度参数在训练和/或推断大型基础模型（LFMs）如大型语言模型（LLMs）和CLIP模型中发挥着重要作用。特别地，它调整了LLMs中softmax函数中的logits，在下一个令牌生成中至关重要，并且在训练CLIP模型时，它调整了对比损失中的相似度。一个重要的问题是：是否可行学习神经网络来预测任何输入数据的个性化温度以增强LFMs？在本文中，我们提出了一个基于原则的框架，用于学习一个小型但具有泛化能力的温度预测网络（TempNet）来改进LFMs。我们的解决方案由一个新颖的学习框架和一个基于受限分布鲁棒优化（DRO）支撑的强大损失组成，以及一个经过合理设计的TempNet，具有理论启发。TempNet可以与大型基础模型一起从头开始训练，或者在给定预训练基础模型的情况下单独学习。它不仅有助于预测个性化温度以促进LFMs的训练，而且具有泛化性和可转移性到新任务。我们在LLMs和CLIP模型上的实验证明TempNet极大地改善了现有解决方案或模型的性能，例如表1。本文中用于重现实验结果的代码可以在https://github.com/zhqiu/TempNet 上找到。

更新时间: 2024-06-16 12:43:39

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2404.04575v3

Enhanced Classification of Heart Sounds Using Mel Frequency Cepstral Coefficients: A Comparative Study of Single and Ensemble Classifier Strategies

This paper explores the efficacy of Mel Frequency Cepstral Coefficients (MFCCs) in detecting abnormal heart sounds using two classification strategies: a single classifier and an ensemble classifier approach. Heart sounds were first pre-processed to remove noise and then segmented into S1, systole, S2, and diastole intervals, with thirteen MFCCs estimated from each segment, yielding 52 MFCCs per beat. Finally, MFCCs were used for heart sound classification. For that purpose, in the single classifier strategy, the MFCCs from nine consecutive beats were averaged to classify heart sounds by a single classifier (either a support vector machine (SVM), the k nearest neighbors (kNN), or a decision tree (DT)). Conversely, the ensemble classifier strategy employed nine classifiers (either nine SVMs, nine kNN classifiers, or nine DTs) to individually assess beats as normal or abnormal, with the overall classification based on the majority vote. Both methods were tested on a publicly available phonocardiogram database. The heart sound classification accuracy was 91.95% for the SVM, 91.9% for the kNN, and 87.33% for the DT in the single classifier strategy. Also, the accuracy was 93.59% for the SVM, 91.84% for the kNN, and 92.22% for the DT in the ensemble classifier strategy. Overall, the results demonstrated that the ensemble classifier strategy improved the accuracies of the DT and the SVM by 4.89% and 1.64%, establishing MFCCs as more effective than other features, including time, time-frequency, and statistical features, evaluated in similar studies.

Updated: 2024-06-16 12:43:23

标题: 使用梅尔频率倒谱系数增强心音分类：单一和集成分类器策略的比较研究

摘要: 本文探讨了使用Mel频率倒谱系数（MFCCs）在检测异常心音方面的效力，采用了两种分类策略：单一分类器和集成分类器方法。首先对心音进行预处理以去除噪音，然后将其分割为S1、收缩期、S2和舒张期间隔，从每个段中估计出13个MFCCs，每个心跳产生52个MFCCs。最后，使用MFCCs进行心音分类。在单一分类器策略中，将来自九个连续心跳的MFCCs平均以通过单一分类器（支持向量机（SVM）、k最近邻（kNN）或决策树（DT））对心音进行分类。相反，在集成分类器策略中，采用九个分类器（九个SVM、九个kNN分类器或九个DTs）对心跳进行单独评估，根据多数投票确定整体分类。这两种方法在一个公开可用的心音图数据库上进行了测试。在单一分类器策略中，SVM的心音分类准确率为91.95%，kNN为91.9%，DT为87.33%。此外，在集成分类器策略中，SVM的准确率为93.59%，kNN为91.84%，DT为92.22%。总体来看，结果表明集成分类器策略将DT和SVM的准确性提高了4.89%和1.64%，将MFCCs确立为比其他特征更有效的特征，包括时间、时间-频率和统计特征，这些特征在类似研究中进行了评估。

更新时间: 2024-06-16 12:43:23

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2406.00702v3

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.

Updated: 2024-06-16 12:35:23

标题: GMP-TL：性别增强的多尺度伪标签增强迁移学习用于语音情绪识别

摘要: 预训练语音模型的持续演进极大地推动了语音情感识别（SER）的发展。然而，当前的研究通常依赖于话语级情感标签，无法充分捕捉单个话语内情感的复杂性。本文介绍了一种新颖的SER框架GMP-TL，采用基于性别增强的多尺度伪标签（GMP）的迁移学习来弥补这一差距。具体而言，GMP-TL首先使用预训练的HuBERT，实施多任务学习和多尺度k均值聚类来获取帧级GMP。随后，为了充分利用帧级GMP和话语级情感标签，提出了一个两阶段模型微调方法，进一步优化GMP-TL。在IEMOCAP数据集上的实验证明，我们的GMP-TL实现了80.0%的WAR和82.0%的UAR，与最先进的单模态SER方法相比表现出更优异的性能，同时也产生了与多模态SER方法可比的结果。

更新时间: 2024-06-16 12:35:23

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2405.02151v2

First-Order Manifold Data Augmentation for Regression Learning

Data augmentation (DA) methods tailored to specific domains generate synthetic samples by applying transformations that are appropriate for the characteristics of the underlying data domain, such as rotations on images and time warping on time series data. In contrast, domain-independent approaches, e.g. mixup, are applicable to various data modalities, and as such they are general and versatile. While regularizing classification tasks via DA is a well-explored research topic, the effect of DA on regression problems received less attention. To bridge this gap, we study the problem of domain-independent augmentation for regression, and we introduce FOMA: a new data-driven domain-independent data augmentation method. Essentially, our approach samples new examples from the tangent planes of the train distribution. Augmenting data in this way aligns with the network tendency towards capturing the dominant features of its input signals. We evaluate FOMA on in-distribution generalization and out-of-distribution robustness benchmarks, and we show that it improves the generalization of several neural architectures. We also find that strong baselines based on mixup are less effective in comparison to our approach. Our code is publicly available athttps://github.com/azencot-group/FOMA.

Updated: 2024-06-16 12:35:05

标题: 一阶流形数据增强用于回归学习

摘要: 数据增强（DA）方法针对特定领域生成合成样本，通过应用适合底层数据域特征的变换，例如在图像上进行旋转和在时间序列数据上进行时间扭曲。相比之下，独立于领域的方法，例如混合，适用于各种数据模态，因此它们是通用且多功能的。虽然通过数据增强对分类任务进行规范化是一个被广泛研究的课题，但数据增强对回归问题的影响受到较少关注。为了弥合这一差距，我们研究了用于回归的独立于领域的增强问题，并介绍了FOMA：一种新的以数据驱动的独立于领域的数据增强方法。基本上，我们的方法从训练分布的切线平面中采样新的例子。通过这种方式增强数据，可以与网络倾向于捕捉其输入信号的主要特征相一致。我们在内部泛化和外部鲁棒性基准上评估了FOMA，并展示了它如何改善几种神经架构的泛化能力。我们还发现，基于混合的强基线与我们的方法相比效果较差。我们的代码可在https://github.com/azencot-group/FOMA 上公开获取。

更新时间: 2024-06-16 12:35:05

领域: cs.LG

下载: http://arxiv.org/abs/2406.10914v1

A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability

Graph pooling has gained attention for its ability to obtain effective node and graph representations for various downstream tasks. Despite the recent surge in graph pooling approaches, there is a lack of standardized experimental settings and fair benchmarks to evaluate their performance. To address this issue, we have constructed a comprehensive benchmark that includes 15 graph pooling methods and 21 different graph datasets. This benchmark systematically assesses the performance of graph pooling methods in three dimensions, i.e., effectiveness, robustness, and generalizability. We first evaluate the performance of these graph pooling approaches across different tasks including graph classification, graph regression and node classification. Then, we investigate their performance under potential noise attacks and out-of-distribution shifts in real-world scenarios. We also involve detailed efficiency analysis and parameter analysis. Extensive experiments validate the strong capability and applicability of graph pooling approaches in various scenarios, which can provide valuable insights and guidance for deep geometric learning research. The source code of our benchmark is available at https://github.com/goose315/Graph_Pooling_Benchmark.

Updated: 2024-06-16 12:32:17

标题: 一个全面的图池基准测试：有效性、稳健性和泛化能力

摘要: 图池化因其在获取各种下游任务的有效节点和图表示方面的能力而受到关注。尽管最近图池化方法的激增，但缺乏标准化的实验设置和公平的基准来评估它们的性能。为了解决这个问题，我们构建了一个包括15种图池化方法和21种不同图数据集的全面基准。该基准系统地评估了图池化方法在效力、鲁棒性和泛化性三个维度上的表现。我们首先评估了这些图池化方法在不同任务中的表现，包括图分类、图回归和节点分类。然后，我们研究它们在潜在的噪声攻击和现实场景中的分布偏移下的表现。我们还进行了详细的效率分析和参数分析。广泛的实验验证了图池化方法在各种场景中的强大能力和适用性，为深度几何学习研究提供了有价值的见解和指导。我们的基准的源代码可在 https://github.com/goose315/Graph_Pooling_Benchmark 上找到。

更新时间: 2024-06-16 12:32:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.09031v2

ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we enhance the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.

Updated: 2024-06-16 12:23:56

标题: ULDP-FL：跨领域用户级差分隐私的联邦学习

摘要: 差分隐私联邦学习（DP-FL）作为一种确保正式隐私的协作机器学习方法，引起了关注。大多数DP-FL方法确保跨边界FL中每个数据存储库中的记录级别的DP。然而，单个用户的数据可能跨越多个数据存储库，对于这种情况的期望用户级别DP保证仍然未知。在这项研究中，我们提出了Uldp-FL，一个新颖的FL框架，旨在保证用户级别DP在可能属于多个数据存储库的单个用户数据的跨边界FL中。我们提出的算法通过每个用户的加权剪切直接确保用户级别DP，而不是采用群体隐私方法。我们对算法的隐私和效用进行了理论分析。此外，我们通过基于用户记录分布的增强加权策略增强了所提出算法的效用，并设计了一种新颖的私密协议，确保不向数据存储库和服务器透露额外信息。在真实数据集上的实验显示，与基准方法相比，在用户级别DP下，我们的方法在隐私-效用权衡方面取得了显著改进。据我们所知，我们的工作是第一个在一般跨边界FL环境中有效提供用户级别DP的FL框架。

更新时间: 2024-06-16 12:23:56

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2308.12210v3

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.

Updated: 2024-06-16 12:21:39

标题: 单模态多任务融合用于情绪模仿强度预测

摘要: 在这项研究中，我们介绍了一种新的方法，用于评估情绪模仿强度（EMI），作为第六届“野外情感行为分析研讨会和竞赛”的一部分。我们的方法利用了Wav2Vec 2.0架构，该架构已在大量播客数据集上进行了预训练，以捕获包括语言和语调组成部分在内的各种音频特征。我们通过采用融合技术来细化我们的特征提取过程，该技术将个别特征与全局均值向量相结合，从而将更广泛的上下文理解嵌入到我们的分析中。我们方法的一个关键方面是多任务融合策略，不仅利用这些特征，还整合了一个预训练的Valence-Arousal-Dominance（VAD）模型。这种整合旨在通过同时处理多个情绪维度来完善情绪强度预测，从而将更丰富的上下文理解嵌入到我们的框架中。对于音频数据的时间分析，我们的特征融合过程利用了长短期记忆（LSTM）网络。这种方法仅依赖于提供的音频数据，在现有基准线的基础上取得显著进展，提供了更全面的对自然环境中情绪模仿的理解，在EMI挑战中获得了第二名。

更新时间: 2024-06-16 12:21:39

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2403.11879v4

Breaking the Attention Bottleneck

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

Updated: 2024-06-16 12:06:58

标题: 突破注意力瓶颈

摘要: 基于注意力机制的Transformer已经成为许多深度学习领域的标准架构，主要是因为它能够建模长距离依赖关系并处理可变长度的输入序列。然而，注意力机制的二次复杂度是Transformer架构中的一个重要瓶颈。这种算法在解码器中只是单向的，并且在过度参数化的仅解码器模型中会收敛到静态模式。我通过开发一个生成函数作为注意力或激活替代来解决这个问题。它仍然具有自回归特性，通过将每个标记与前一个标记进行比较。在我使用nanoGPT进行测试时，这导致了更小的损失，同时具有更小的模型。通过合并平均上下文向量，损失进一步下降。注意力替代的概念在GNU AGPL v3许可下分发，网址为https://gitlab.com/Bachstelze/causal_generation。

更新时间: 2024-06-16 12:06:58

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.10906v1

Mini Honor of Kings: A Lightweight Environment for Multi-Agent Reinforcement Learning

Games are widely used as research environments for multi-agent reinforcement learning (MARL), but they pose three significant challenges: limited customization, high computational demands, and oversimplification. To address these issues, we introduce the first publicly available map editor for the popular mobile game Honor of Kings and design a lightweight environment, Mini Honor of Kings (Mini HoK), for researchers to conduct experiments. Mini HoK is highly efficient, allowing experiments to be run on personal PCs or laptops while still presenting sufficient challenges for existing MARL algorithms. We have tested our environment on common MARL algorithms and demonstrated that these algorithms have yet to find optimal solutions within this environment. This facilitates the dissemination and advancement of MARL methods within the research community. Additionally, we hope that more researchers will leverage the Honor of Kings map editor to develop innovative and scientifically valuable new maps. Our code and user manual are available at: https://github.com/tencent-ailab/mini-hok.

Updated: 2024-06-16 12:01:11

标题: 《迷你王者荣耀：多智能体强化学习的轻量级环境》

摘要: 游戏被广泛用作多智能体强化学习（MARL）研究环境，但它们存在三个重要挑战：定制性有限、计算需求高和过度简化。为了解决这些问题，我们引入了第一个针对流行移动游戏《王者荣耀》的地图编辑器，并设计了一个轻量级环境Mini Honor of Kings（Mini HoK）供研究人员进行实验。Mini HoK非常高效，可以在个人PC或笔记本电脑上运行实验，同时仍然对现有MARL算法提出足够的挑战。我们已经在常见的MARL算法上测试了我们的环境，并展示了这些算法在这个环境中尚未找到最优解。这有助于在研究社区内推广和推进MARL方法。此外，我们希望更多的研究人员利用《王者荣耀》地图编辑器开发创新和科学有价值的新地图。我们的代码和用户手册可在以下网址找到：https://github.com/tencent-ailab/mini-hok。

更新时间: 2024-06-16 12:01:11

领域: cs.MA,cs.LG

下载: http://arxiv.org/abs/2406.03978v2

New Solutions on LLM Acceleration, Optimization, and Application

Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present significant challenges in both training and deployment, leading to substantial computational and storage costs as well as heightened energy consumption. In this paper, we provide a review of recent advancements and research directions aimed at addressing these challenges and enhancing the efficiency of LLM-based systems. We begin by discussing algorithm-level acceleration techniques focused on optimizing LLM inference speed and resource utilization. We also explore LLM-hardware co-design strategies with a vision to improve system efficiency by tailoring hardware architectures to LLM requirements. Further, we delve into LLM-to-accelerator compilation approaches, which involve customizing hardware accelerators for efficient LLM deployment. Finally, as a case study to leverage LLMs for assisting circuit design, we examine LLM-aided design methodologies for an important task: High-Level Synthesis (HLS) functional verification, by creating a new dataset that contains a large number of buggy and bug-free codes, which can be essential for training LLMs to specialize on HLS verification and debugging. For each aspect mentioned above, we begin with a detailed background study, followed by the presentation of several novel solutions proposed to overcome specific challenges. We then outline future research directions to drive further advancements. Through these efforts, we aim to pave the way for more efficient and scalable deployment of LLMs across a diverse range of applications.

Updated: 2024-06-16 11:56:50

标题: 新的解决方案在LLM加速、优化和应用方面 (Note: LLM可能是一个专业术语，需要根据上下文具体理解)

摘要: 大型语言模型（LLMs）已经成为极其强大的工具，在理解和生成类似人类文本的各种应用中具有异常的能力。然而，LLMs规模和复杂性的增加在训练和部署中带来了重大挑战，导致了巨大的计算和存储成本以及增加的能源消耗。在本文中，我们回顾了针对这些挑战并增强基于LLM系统效率的最新进展和研究方向。我们首先讨论算法级加速技术，重点是优化LLM推断速度和资源利用率。我们还探讨了LLM硬件共同设计策略，旨在通过将硬件架构定制为LLM要求来提高系统效率。此外，我们深入研究了LLM到加速器编译方法，这涉及为高效部署LLM定制硬件加速器。最后，作为利用LLMs辅助电路设计的案例研究，我们研究了LLM辅助设计方法论，针对一个重要任务：高级综合（HLS）功能验证，通过创建一个包含大量错误和无错误代码的新数据集，这对于训练LLMs专注于HLS验证和调试可能是至关重要的。对于上述每个方面，我们首先进行详细的背景研究，然后提出几种新颖的解决方案来克服特定挑战。然后，我们概述未来的研究方向，推动进一步的进展。通过这些努力，我们旨在为更高效和可扩展地部署LLMs在各种应用中铺平道路。

更新时间: 2024-06-16 11:56:50

领域: cs.LG,cs.CL,cs.SE

下载: http://arxiv.org/abs/2406.10903v1

PPFlow: Target-aware Peptide Design with Torsional Flow Matching

Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design and to allow the training of deep learning methods. Extensive experiments show that PPFlow reaches state-of-the-art performance in tasks of peptide drug generation and optimization in comparison with baseline models, and can be generalized to other tasks including docking and side-chain packing.

Updated: 2024-06-16 11:33:54

标题: PPFlow：具有扭转流匹配的靶向肽设计

摘要: 治疗肽在近几十年来已被证明具有巨大的药用价值和潜力。然而，AI辅助肽药物发现的方法并未得到充分探索。为了填补这一空白，我们提出了一种基于环面流匹配的目标感知肽设计方法，称为PPFlow，用于模拟肽结构设计中的扭转角的内部几何结构。此外，我们建立了一个名为PPBench2024的蛋白质-肽结合数据集，以填补结构为基础的肽药物设计任务的大量数据的空白，并允许深度学习方法的训练。广泛的实验表明，与基准模型相比，PPFlow在肽药物生成和优化任务中达到了最先进的性能，并且可以推广到包括对接和侧链装填在内的其他任务。

更新时间: 2024-06-16 11:33:54

领域: q-bio.BM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2405.06642v3

ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology

The critical field of psychology necessitates a comprehensive benchmark to enhance the evaluation and development of domain-specific Large Language Models (LLMs). Existing MMLU-type benchmarks, such as C-EVAL and CMMLU, include psychology-related subjects, but their limited number of questions and lack of systematic concept sampling strategies mean they cannot cover the concepts required in psychology. Consequently, despite their broad subject coverage, these benchmarks lack the necessary depth in the psychology domain, making them inadequate as psychology-specific evaluation suite. To address this issue, this paper presents ConceptPsy, designed to evaluate Chinese complex reasoning and knowledge abilities in psychology. ConceptPsy includes 12 core subjects and 1383 manually collected concepts. Specifically, we prompt GPT-4 to generate questions for each concept using carefully designed diverse prompts and hire professional psychologists to review these questions. To help to understand the fine-grained performances and enhance the weaknesses, we annotate each question with a chapter label and provide chapter-wise accuracy. Based on ConceptPsy, we evaluate a broad range of LLMs. We observe that, although some LLMs achieve similar accuracies on overall performances, they exhibit significant performance variations across different psychology concepts, even when they are models from the same series. We hope our work can facilitate the development of LLMs in the field of psychology.

Updated: 2024-06-16 11:33:03

标题: ConceptPsy：心理学中概念全面性的基准套件

摘要: 心理学领域的关键领域需要一个全面的基准来增强对特定领域大型语言模型（LLMs）的评估和发展。现有的MMLU类型基准，如C-EVAL和CMMLU，包括与心理学相关的主题，但由于问题数量有限和缺乏系统的概念抽样策略，它们无法涵盖心理学所需的概念。因此，尽管这些基准具有广泛的主题覆盖范围，但它们在心理学领域缺乏必要的深度，使它们无法作为心理学特定评估套件。为解决这个问题，本文介绍了ConceptPsy，旨在评估心理学中的中国复杂推理和知识能力。ConceptPsy包括12个核心主题和1383个手动收集的概念。具体来说，我们提示GPT-4根据精心设计的多样化提示为每个概念生成问题，并雇佣专业心理学家审查这些问题。为了帮助理解细粒度的表现并增强弱点，我们使用章节标签注释每个问题，并提供章节准确度。基于ConceptPsy，我们评估了广泛范围的LLMs。我们观察到，尽管一些LLMs在整体表现上达到了类似的准确度，但它们在不同的心理学概念上表现出显著的性能变化，即使它们来自同一系列的模型。我们希望我们的工作能促进心理学领域LLMs的发展。

更新时间: 2024-06-16 11:33:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2311.09861v4

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.

Updated: 2024-06-16 11:12:34

标题: PIPER：基于原始信息的基于偏好的分层强化学习通过事后重新标记

摘要: 在这项工作中，我们介绍了PIPER：基于偏好学习的基本知识优先的层次强化学习方法，通过事后重标记，这是一种利用基于偏好学习来学习奖励模型的新方法，随后使用该奖励模型来重标记较高级别的重放缓冲区。由于此奖励不受较低基本行为的影响，我们的重标记方法能够减轻非稳态性，这在现有的层次方法中很常见，并且在一系列具有挑战性的稀疏奖励任务中展现出令人印象深刻的性能。由于通常难以获得人类反馈，我们提出用我们的基本知识循环方法替换人在回路方法，该方法使用环境提供的稀疏奖励生成反馈。此外，为了防止不可行的子目标预测并避免退化解决方案，我们提出基于基本知识的正则化，将高级策略条件为生成适用于低级策略的可行子目标。我们进行了大量实验，展示了PIPER在层次强化学习中减轻非稳态性，并在具有挑战性的稀疏奖励机器人环境中实现了超过50%的成功率，而大多数其他基线方法均无法取得任何显著进展。

更新时间: 2024-06-16 11:12:34

领域: cs.LG

下载: http://arxiv.org/abs/2404.13423v2

Development and Validation of Fully Automatic Deep Learning-Based Algorithms for Immunohistochemistry Reporting of Invasive Breast Ductal Carcinoma

Immunohistochemistry (IHC) analysis is a well-accepted and widely used method for molecular subtyping, a procedure for prognosis and targeted therapy of breast carcinoma, the most common type of tumor affecting women. There are four molecular biomarkers namely progesterone receptor (PR), estrogen receptor (ER), antigen Ki67, and human epidermal growth factor receptor 2 (HER2) whose assessment is needed under IHC procedure to decide prognosis as well as predictors of response to therapy. However, IHC scoring is based on subjective microscopic examination of tumor morphology and suffers from poor reproducibility, high subjectivity, and often incorrect scoring in low-score cases. In this paper, we present, a deep learning-based semi-supervised trained, fully automatic, decision support system (DSS) for IHC scoring of invasive ductal carcinoma. Our system automatically detects the tumor region removing artifacts and scores based on Allred standard. The system is developed using 3 million pathologist-annotated image patches from 300 slides, fifty thousand in-house cell annotations, and forty thousand pixels marking of HER2 membrane. We have conducted multicentric trials at four centers with three different types of digital scanners in terms of percentage agreement with doctors. And achieved agreements of 95, 92, 88 and 82 percent for Ki67, HER2, ER, and PR stain categories, respectively. In addition to overall accuracy, we found that there is 5 percent of cases where pathologist have changed their score in favor of algorithm score while reviewing with detailed algorithmic analysis. Our approach could improve the accuracy of IHC scoring and subsequent therapy decisions, particularly where specialist expertise is unavailable. Our system is highly modular. The proposed algorithm modules can be used to develop DSS for other cancer types.

Updated: 2024-06-16 10:52:38

标题: 《针对浸润性乳腺导管癌的免疫组化报告的全自动深度学习算法的开发与验证》

摘要: 免疫组化（IHC）分析是一种被广泛接受和广泛使用的方法，用于对影响女性最常见的肿瘤——乳腺癌进行分子亚型分类、预后和靶向治疗。有四种分子生物标志物，即孕激素受体（PR）、雌激素受体（ER）、抗原Ki67和人类表皮生长因子受体2（HER2），在IHC程序下需要进行评估以决定预后以及对治疗反应的预测。然而，IHC评分是基于对肿瘤形态的主观显微镜检查，并且存在着评分的重复性差、主观性高以及在低分数情况下常常出现错误评分的问题。在本文中，我们提出了一个基于深度学习的半监督训练、完全自动的决策支持系统（DSS），用于对浸润性导管癌进行IHC评分。我们的系统自动检测肿瘤区域，去除伪影，并根据Allred标准进行评分。该系统使用了来自300张幻灯片的300万个病理学标注的图像块，五万个内部细胞标注和四万个HER2膜标记像素。我们在四个中心进行了多中心试验，使用了三种不同类型的数字扫描仪，与医生的百分比一致性方面取得了95％，92％，88％和82％的协议，分别针对Ki67，HER2，ER和PR染色类别。除了总体准确性外，我们发现有5％的病例，在详细的算法分析中，病理学家已经改变了他们的评分，以支持算法评分。我们的方法可以提高IHC评分的准确性和随后的治疗决策，特别是在专业知识不可用的情况下。我们的系统是高度模块化的。所提议的算法模块可以用于开发其他癌症类型的DSS。

更新时间: 2024-06-16 10:52:38

领域: eess.IV,cs.AI,cs.CV,q-bio.QM,q-bio.TO

下载: http://arxiv.org/abs/2406.10893v1

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

Learning control policies to perform complex robotics tasks from human preference data presents significant challenges. On the one hand, the complexity of such tasks typically requires learning policies to perform a variety of subtasks, then combining them to achieve the overall goal. At the same time, comprehensive, well-engineered reward functions are typically unavailable in such problems, while limited human preference data often is; making efficient use of such data to guide learning is therefore essential. Methods for learning to perform complex robotics tasks from human preference data must overcome both these challenges simultaneously. In this work, we introduce DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning, an efficient hierarchical approach that leverages direct preference optimization to learn a higher-level policy and reinforcement learning to learn a lower-level policy. DIPPER enjoys improved computational efficiency due to its use of direct preference optimization instead of standard preference-based approaches such as reinforcement learning from human feedback, while it also mitigates the well-known hierarchical reinforcement learning issues of non-stationarity and infeasible subgoal generation due to our use of primitive-informed regularization inspired by a novel bi-level optimization formulation of the hierarchical reinforcement learning problem. To validate our approach, we perform extensive experimental analysis on a variety of challenging robotics tasks, demonstrating that DIPPER outperforms hierarchical and non-hierarchical baselines, while ameliorating the non-stationarity and infeasible subgoal generation issues of hierarchical reinforcement learning.

Updated: 2024-06-16 10:49:41

标题: DIPPER: 直接偏好优化以加速启用原始层次强化学习

摘要: 从人类偏好数据中学习控制策略以执行复杂的机器人任务面临着重大挑战。一方面，这些任务的复杂性通常需要学习执行各种子任务的策略，然后将它们结合起来实现整体目标。与此同时，在这些问题中通常缺乏全面、精心设计的奖励函数，而人类偏好数据通常是有限的；因此，有效利用这些数据来指导学习是至关重要的。从人类偏好数据中学习执行复杂的机器人任务的方法必须同时克服这两个挑战。在这项工作中，我们介绍了DIPPER：直接偏好优化以加速基础启用的分层强化学习，这是一种高效的分层方法，利用直接偏好优化来学习高级策略，利用强化学习来学习低级策略。由于使用直接偏好优化而不是标准的基于偏好的方法（如从人类反馈中学习强化学习），DIPPER具有改进的计算效率，同时也通过我们使用基于原始信息的正则化来缓解众所周知的分层强化学习问题的非稳态性和不可行的子目标生成问题，受到新颖的分层强化学习问题的双层优化公式的启发。为验证我们的方法，我们在各种具有挑战性的机器人任务上进行了广泛的实验分析，结果表明DIPPER优于分层和非分层基线，同时改善了分层强化学习的非稳态性和不可行的子目标生成问题。

更新时间: 2024-06-16 10:49:41

领域: cs.LG

下载: http://arxiv.org/abs/2406.10892v1

Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters

Obtaining accurate labels for instance segmentation is particularly challenging due to the complex nature of the task. Each image necessitates multiple annotations, encompassing not only the object's class but also its precise spatial boundaries. These requirements elevate the likelihood of errors and inconsistencies in both manual and automated annotation processes. By simulating different noise conditions, we provide a realistic scenario for assessing the robustness and generalization capabilities of instance segmentation models in different segmentation tasks, introducing COCO-N and Cityscapes-N. We also propose a benchmark for weakly annotation noise, dubbed COCO-WAN, which utilizes foundation models and weak annotations to simulate semi-automated annotation tools and their noisy labels. This study sheds light on the quality of segmentation masks produced by various models and challenges the efficacy of popular methods designed to address learning with label noise.

Updated: 2024-06-16 10:49:23

标题: 在实例分割中基准标签噪声：空间噪声很重要

摘要: 由于任务的复杂性，获得准确的实例分割标签特别具有挑战性。每个图像都需要多个注释，不仅包括对象的类别，还包括其精确的空间边界。这些要求增加了手动和自动注释过程中错误和不一致性的可能性。通过模拟不同的噪声条件，我们为评估不同分割任务中实例分割模型的鲁棒性和泛化能力提供了一个现实情景，引入了COCO-N和Cityscapes-N。我们还提出了一个弱注释噪声的基准，名为COCO-WAN，该基准利用基础模型和弱注释来模拟半自动注释工具及其嘈杂的标签。这项研究揭示了各种模型产生的分割蒙版的质量，并挑战了旨在解决带有标签噪声的学习的流行方法的有效性。

更新时间: 2024-06-16 10:49:23

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.10891v1

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

Large language models (LLMs) inevitably memorize sensitive, copyrighted, and harmful knowledge from the training corpus; therefore, it is crucial to erase this knowledge from the models. Machine unlearning is a promising solution for efficiently removing specific knowledge by post hoc modifying models. In this paper, we propose a Real-World Knowledge Unlearning benchmark (RWKU) for LLM unlearning. RWKU is designed based on the following three key factors: (1) For the task setting, we consider a more practical and challenging unlearning setting, where neither the forget corpus nor the retain corpus is accessible. (2) For the knowledge source, we choose 200 real-world famous people as the unlearning targets and show that such popular knowledge is widely present in various LLMs. (3) For the evaluation framework, we design the forget set and the retain set to evaluate the model's capabilities across various real-world applications. Regarding the forget set, we provide four four membership inference attack (MIA) methods and nine kinds of adversarial attack probes to rigorously test unlearning efficacy. Regarding the retain set, we assess locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. We conduct extensive experiments across two unlearning scenarios, two models and six baseline methods and obtain some meaningful findings. We release our benchmark and code publicly at http://rwku-bench.github.io for future work.

Updated: 2024-06-16 10:47:21

标题: RWKU：用于大型语言模型的实际知识遗忘基准测试

摘要: 大型语言模型（LLMs）不可避免地会记忆训练语料库中的敏感、受版权保护和有害知识；因此，从模型中删除这些知识至关重要。机器遗忘是一种有效地通过事后修改模型来移除特定知识的解决方案。在本文中，我们提出了一个面向LLM遗忘的真实世界知识遗忘基准（RWKU）。RWKU基于以下三个关键因素设计：（1）对于任务设置，我们考虑了一个更实用和具有挑战性的遗忘设置，在这种设置中，忘记语料库和保留语料库都无法访问。（2）对于知识来源，我们选择了200位真实世界知名人士作为遗忘目标，并展示这些热门知识在各种LLMs中广泛存在。（3）对于评估框架，我们设计了忘记集和保留集来评估模型在各种真实世界应用中的能力。关于忘记集，我们提供了四种成员推断攻击（MIA）方法和九种敌对攻击探针，以严格测试遗忘效果。关于保留集，我们评估了邻域扰动、一般能力、推理能力、真实性、事实性和流畅性方面的局部性和实用性。我们在两个遗忘场景、两个模型和六种基准方法上进行了广泛实验，并得出了一些有意义的发现。我们在http://rwku-bench.github.io上公开发布我们的基准和代码，以供未来工作使用。

更新时间: 2024-06-16 10:47:21

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.10890v1

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests.

Updated: 2024-06-16 10:42:21

标题: VELOCITI：视频-语言模型是否能通过时间绑定语义概念？

摘要: Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests.

更新时间: 2024-06-16 10:42:21

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.10889v1

Distilling Opinions at Scale: Incremental Opinion Summarization using XL-OPSUMM

Opinion summarization in e-commerce encapsulates the collective views of numerous users about a product based on their reviews. Typically, a product on an e-commerce platform has thousands of reviews, each review comprising around 10-15 words. While Large Language Models (LLMs) have shown proficiency in summarization tasks, they struggle to handle such a large volume of reviews due to context limitations. To mitigate, we propose a scalable framework called Xl-OpSumm that generates summaries incrementally. However, the existing test set, AMASUM has only 560 reviews per product on average. Due to the lack of a test set with thousands of reviews, we created a new test set called Xl-Flipkart by gathering data from the Flipkart website and generating summaries using GPT-4. Through various automatic evaluations and extensive analysis, we evaluated the framework's efficiency on two datasets, AMASUM and Xl-Flipkart. Experimental results show that our framework, Xl-OpSumm powered by Llama-3-8B-8k, achieves an average ROUGE-1 F1 gain of 4.38% and a ROUGE-L F1 gain of 3.70% over the next best-performing model.

Updated: 2024-06-16 10:36:41

标题: 在规模上提炼意见：使用XL-OPSUMM进行增量意见总结

摘要: 在电子商务中，观点总结概括了许多用户关于产品的集体观点，基于他们的评论。通常，在电子商务平台上，一个产品有成千上万的评论，每个评论大约包含10-15个单词。虽然大型语言模型（LLMs）在总结任务中表现出色，但由于上下文限制，它们难以处理如此大量的评论。为了缓解这一问题，我们提出了一个可扩展的框架，称为Xl-OpSumm，它可以逐步生成总结。然而，现有的测试集AMASUM每个产品平均只有560个评论。由于缺乏包含成千上万评论的测试集，我们从Flipkart网站收集数据，并使用GPT-4生成摘要，创建了一个名为Xl-Flipkart的新测试集。通过各种自动评估和广泛分析，我们评估了该框架在两个数据集AMASUM和Xl-Flipkart上的效率。实验结果表明，我们的框架Xl-OpSumm搭载Llama-3-8B-8k，实现了平均ROUGE-1 F1增益4.38％和ROUGE-L F1增益3.70％，超过了下一个表现最佳的模型。

更新时间: 2024-06-16 10:36:41

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.10886v1

CASE: Efficient Curricular Data Pre-training for Building Assistive Psychology Expert Models

The limited availability of psychologists necessitates efficient identification of individuals requiring urgent mental healthcare. This study explores the use of Natural Language Processing (NLP) pipelines to analyze text data from online mental health forums used for consultations. By analyzing forum posts, these pipelines can flag users who may require immediate professional attention. A crucial challenge in this domain is data privacy and scarcity. To address this, we propose utilizing readily available curricular texts used in institutes specializing in mental health for pre-training the NLP pipelines. This helps us mimic the training process of a psychologist. Our work presents CASE-BERT that flags potential mental health disorders based on forum text. CASE-BERT demonstrates superior performance compared to existing methods, achieving an f1 score of 0.91 for Depression and 0.88 for Anxiety, two of the most commonly reported mental health disorders. Our code is publicly available.

Updated: 2024-06-16 10:33:34

标题: 案例：用于构建辅助心理学专家模型的高效课程数据预训练

摘要: 心理学家的有限供应需要有效地识别需要紧急心理保健的个体。本研究探讨了利用自然语言处理（NLP）管道分析在线心理健康论坛中用于咨询的文本数据的方法。通过分析论坛帖子，这些管道可以标记可能需要立即专业关注的用户。在这一领域的一个关键挑战是数据隐私和稀缺性。为了解决这个问题，我们建议利用专门从事心理健康的机构使用的即时可用的课程文本来预先训练NLP管道。这有助于我们模拟心理学家的培训过程。我们的工作提出了基于论坛文本标记潜在心理健康障碍的CASE-BERT。与现有方法相比，CASE-BERT表现出优越的性能，为抑郁症和焦虑症两种最常报告的心理健康障碍实现了0.91和0.88的F1分数。我们的代码是公开可用的。

更新时间: 2024-06-16 10:33:34

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.00314v2

Linkage on Security, Privacy and Fairness in Federated Learning: New Balances and New Perspectives

Federated learning is fast becoming a popular paradigm for applications involving mobile devices, banking systems, healthcare, and IoT systems. Hence, over the past five years, researchers have undertaken extensive studies on the privacy leaks, security threats, and fairness associated with these emerging models. For the most part, these three critical concepts have been studied in isolation; however, recent research has revealed that there may be an intricate interplay between them. For instance, some researchers have discovered that pursuing fairness may compromise privacy, or that efforts to enhance security can impact fairness. These emerging insights shed light on the fundamental connections between privacy, security, and fairness within federated learning, and, by delving deeper into these interconnections, we may be able to significantly augment research and development across the field. Consequently, the aim of this survey is to offer comprehensive descriptions of the privacy, security, and fairness issues in federated learning. Moreover, we analyze the complex relationships between these three dimensions of cyber safety and pinpoint the fundamental elements that influence each of them. We contend that there exists a trade-off between privacy and fairness and between security and gradient sharing. On this basis, fairness can function as a bridge between privacy and security to build models that are either more secure or more private. Building upon our observations, we identify the trade-offs between privacy and fairness and between security and fairness within the context of federated learning. The survey then concludes with promising directions for future research in this vanguard field.

Updated: 2024-06-16 10:31:45

标题: 在联邦学习中的安全、隐私和公平性联动：新的平衡和新观点

摘要: 联邦学习正迅速成为涉及移动设备、银行系统、医疗保健和物联网系统的应用程序中流行的范例。因此，在过去五年里，研究人员已经进行了广泛的研究，探讨了与这些新兴模型相关的隐私泄漏、安全威胁和公平性。在很大程度上，这三个关键概念一直被孤立地研究；然而，最近的研究揭示了它们之间可能存在复杂的相互作用。例如，一些研究人员发现，追求公平可能会损害隐私，或者努力提高安全性可能会影响公平性。这些新兴的见解揭示了在联邦学习中隐私、安全和公平之间的基本联系，并且通过深入探讨这些相互关系，我们可能能够显著增强整个领域的研究和发展。因此，本调查的目的是提供对联邦学习中的隐私、安全和公平问题进行全面描述。此外，我们分析了这三个网络安全维度之间的复杂关系，并指出了影响它们各自的基本元素。我们认为隐私和公平之间存在一种权衡，安全性和梯度共享之间也存在一种权衡。基于此，公平性可以作为隐私和安全之间的桥梁，建立更安全或更隐私的模型。基于我们的观察，我们确定了在联邦学习背景下隐私与公平、安全与公平之间的权衡。调查最后总结了在这一前沿领域未来研究的有希望的方向。

更新时间: 2024-06-16 10:31:45

领域: cs.LG,cs.CR,cs.DC

下载: http://arxiv.org/abs/2406.10884v1

On the Implicit Bias of Adam

In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.

Updated: 2024-06-16 10:29:01

标题: 关于亚当的内隐偏见

摘要: 在先前的文献中，使用反向误差分析来找到近似梯度下降轨迹的普通微分方程（ODEs）。发现有限步长隐式地对解进行正则化，因为出现在ODEs中的项惩罚了损失梯度的二范数。我们证明RMSProp和Adam中存在类似的隐式正则化取决于它们的超参数和训练阶段，但涉及不同的“范数”：对应的ODE项要么惩罚（扰动的）损失梯度的一范数，要么反之，阻碍其减少（后一种情况更加典型）。我们还进行数值实验并讨论已证实的事实如何影响泛化。

更新时间: 2024-06-16 10:29:01

领域: cs.LG,cs.AI,math.OC,stat.CO,stat.ML

下载: http://arxiv.org/abs/2309.00079v4

LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Developing interactive systems that leverage natural language instructions to solve complex robotic control tasks has been a long-desired goal in the robotics community. Large Language Models (LLMs) have demonstrated exceptional abilities in handling complex tasks, including logical reasoning, in-context learning, and code generation. However, predicting low-level robotic actions using LLMs poses significant challenges. Additionally, the complexity of such tasks usually demands the acquisition of policies to execute diverse subtasks and combine them to attain the ultimate objective. Hierarchical Reinforcement Learning (HRL) is an elegant approach for solving such tasks, which provides the intuitive benefits of temporal abstraction and improved exploration. However, HRL faces the recurring issue of non-stationarity due to unstable lower primitive behaviour. In this work, we propose LGR2, a novel HRL framework that leverages language instructions to generate a stationary reward function for the higher-level policy. Since the language-guided reward is unaffected by the lower primitive behaviour, LGR2 mitigates non-stationarity and is thus an elegant method for leveraging language instructions to solve robotic control tasks. To analyze the efficacy of our approach, we perform empirical analysis and demonstrate that LGR2 effectively alleviates non-stationarity in HRL. Our approach attains success rates exceeding 70$\%$ in challenging, sparse-reward robotic navigation and manipulation environments where the baselines fail to achieve any significant progress. Additionally, we conduct real-world robotic manipulation experiments and demonstrate that CRISP shows impressive generalization in real-world scenarios.

Updated: 2024-06-16 10:28:45

标题: LGR2：用于加速分层强化学习的语言引导奖励重标记

摘要: 开发利用自然语言指令解决复杂机器人控制任务的交互式系统一直是机器人学界长期以来的目标。大型语言模型(LLMs)在处理复杂任务，包括逻辑推理、上下文学习和代码生成方面表现出色。然而，使用LLMs预测低级机器人动作存在重大挑战。此外，这类任务的复杂性通常要求获取执行各种子任务并将它们组合以达到最终目标的策略。分层强化学习(HRL)是解决这类任务的一种优雅方法，它提供了时间抽象和改进探索的直观优势。然而，由于下层原始行为不稳定，HRL面临着非稳态的反复问题。在这项工作中，我们提出了LGR2，一个利用语言指令生成高级策略的稳态奖励函数的新型HRL框架。由于语言引导奖励不受下层原始行为的影响，LGR2减轻了非稳态性，因此是利用语言指令解决机器人控制任务的一种优雅方法。为了分析我们方法的有效性，我们进行了实证分析，并展示LGR2在HRL中有效缓解了非稳态性。我们的方法在具有挑战性、稀疏奖励的机器人导航和操作环境中取得了超过70%的成功率，在那里基线未能取得任何显著进展。此外，我们进行了现实世界的机器人操作实验，并展示CRISP在现实场景中展现出令人印象深刻的泛化能力。

更新时间: 2024-06-16 10:28:45

领域: cs.LG,cs.CL,cs.RO

下载: http://arxiv.org/abs/2406.05881v2

LABCAT: Locally adaptive Bayesian optimization using principal-component-aligned trust regions

Bayesian optimization (BO) is a popular method for optimizing expensive black-box functions. BO has several well-documented shortcomings, including computational slowdown with longer optimization runs, poor suitability for non-stationary or ill-conditioned objective functions, and poor convergence characteristics. Several algorithms have been proposed that incorporate local strategies, such as trust regions, into BO to mitigate these limitations; however, none address all of them satisfactorily. To address these shortcomings, we propose the LABCAT algorithm, which extends trust-region-based BO by adding a rotation aligning the trust region with the weighted principal components and an adaptive rescaling strategy based on the length-scales of a local Gaussian process surrogate model with automatic relevance determination. Through extensive numerical experiments using a set of synthetic test functions and the well-known COCO benchmarking software, we show that the LABCAT algorithm outperforms several state-of-the-art BO and other black-box optimization algorithms.

Updated: 2024-06-16 10:22:52

标题: LABCAT：使用主成分对齐信任区域的局部自适应贝叶斯优化

摘要: 贝叶斯优化（BO）是一种用于优化昂贵黑盒函数的流行方法。BO有一些众所周知的缺点，包括随着优化运行时间的延长而出现的计算速度下降，对非平稳或病态目标函数的适应性差，以及收敛特性差。已经提出了几种算法，将局部策略（如信任区域）纳入BO以减轻这些限制；然而，没有一个能令人满意地解决所有问题。为了解决这些缺点，我们提出了LABCAT算法，该算法通过添加将信任区域与加权主成分对齐的旋转和基于局部高斯过程替代模型的长度尺度的自适应重新缩放策略来扩展基于信任区域的BO。通过使用一组合成测试函数和著名的COCO基准软件进行大量数值实验，我们展示了LABCAT算法优于几种最先进的BO和其他黑盒优化算法。

更新时间: 2024-06-16 10:22:52

领域: cs.LG

下载: http://arxiv.org/abs/2311.11328v2

Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning methods to facilitate cross-domain data fusion in smart cities. To this end, we propose the first survey that systematically reviews the latest advancements in deep learning-based data fusion methods tailored for urban computing. Specifically, we first delve into data perspective to comprehend the role of each modality and data source. Secondly, we classify the methodology into four primary categories: feature-based, alignment-based, contrast-based, and generation-based fusion methods. Thirdly, we further categorize multi-modal urban applications into seven types: urban planning, transportation, economy, public safety, society, environment, and energy. Compared with previous surveys, we focus more on the synergy of deep learning methods with urban computing applications. Furthermore, we shed light on the interplay between Large Language Models (LLMs) and urban computing, postulating future research directions that could revolutionize the field. We firmly believe that the taxonomy, progress, and prospects delineated in our survey stand poised to significantly enrich the research community. The summary of the comprehensive and up-to-date paper list can be found at https://github.com/yoshall/Awesome-Multimodal-Urban-Computing.

Updated: 2024-06-16 10:16:00

标题: 深度学习在城市计算中的跨领域数据融合：分类、进展和展望

摘要: 随着城市不断蓬勃发展，城市计算作为可持续发展的关键学科，通过利用来自各种来源（如地理、交通、社交媒体和环境数据）和模态（如时空、视觉和文本模态）的跨领域数据融合的力量，开始崭露头角。最近，我们正在见证一种新兴趋势，即利用各种深度学习方法来促进智能城市中的跨领域数据融合。为此，我们提出了第一份系统性审查，系统地回顾了专为城市计算定制的基于深度学习的数据融合方法的最新进展。具体来说，我们首先深入研究数据视角，以理解每种模态和数据来源的作用。其次，我们将方法学分类为四个主要类别：基于特征、基于对齐、基于对比和基于生成的融合方法。第三，我们进一步将多模态城市应用程序分类为七种类型：城市规划、交通、经济、公共安全、社会、环境和能源。与先前的调查相比，我们更注重深度学习方法与城市计算应用之间的协同作用。此外，我们还探讨了大型语言模型（LLM）与城市计算之间的互动，提出了可能颠覆该领域的未来研究方向。我们坚信，在我们的调查中勾画的分类法、进展和前景将极大丰富研究社区。详细和最新的论文列表摘要可在https://github.com/yoshall/Awesome-Multimodal-Urban-Computing找到。

更新时间: 2024-06-16 10:16:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2402.19348v2

Demonstration Notebook: Finding the Most Suited In-Context Learning Example from Interactions

Large language models (LLMs) benefit greatly from prompt engineering, with in-context learning standing as a pivital technique. While former approaches have provided various ways to construct the demonstrations used for in-context learning, they often ignore the inherent heterogeneity within datasets, applying the same demonstrations to all reasoning questions. We observed that the effectiveness of demonstrations varies depending on the specific question. This motivates our exploration of using prompt engineering to select appropriate demonstrations. To address the challenge of automatically creating and choosing demonstrations tailored to each question, we propose a novel prompt engineering workflow built around a novel object called the "demonstration notebook." This notebook helps identify the most suitable in-context learning example for a question by gathering and reusing information from the LLM's past interactions. Our experiments show that this approach outperforms all existing methods for automatic demonstration construction and selection (as far as we know), achieving state-of-the-art results on serveral reasoning benchmarks. The method's versatility is further demonstrated by its success in text summarization and prompt compression tasks. Additionally, we contribute a rigorous analysis method to reveal the "demonstrative regime" of a demonstration, providing valuable insights into how demonstrations relate to different question types within a dataset.

Updated: 2024-06-16 10:02:20

标题: 演示笔记本：从互动中找到最适合的上下文学习示例

摘要: 大型语言模型（LLMs）在提示工程方面受益匪浅，其中上下文学习作为一个关键技术。虽然以往的方法提供了各种构建用于上下文学习的演示的方式，但它们经常忽视数据集内固有的异质性，将相同的演示应用于所有推理问题。我们观察到，演示的有效性取决于具体的问题。这促使我们探索使用提示工程来选择适当的演示。为了解决自动创建并选择针对每个问题量身定制的演示的挑战，我们提出了一种围绕“演示笔记本”这一新对象构建的新颖提示工程工作流程。这个笔记本通过收集和重复利用LLM过去的交互信息，帮助确定问题的最合适的上下文学习示例。我们的实验表明，这种方法优于所有现有的自动演示构建和选择方法（据我们所知），在几个推理基准测试上取得了最先进的结果。该方法的多功能性进一步通过其在文本摘要和提示压缩任务中的成功得到证实。此外，我们贡献了一种严格的分析方法，揭示了演示的“示范制度”，为演示如何与数据集内不同类型的问题相关提供了有价值的见解。

更新时间: 2024-06-16 10:02:20

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.10878v1

Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for space-time solutions of semilinear partial differential equations

It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impossible to approximately solve high-dimensional PDEs even when the fastest currently available computers are used. However, in the last years great progress has been made in this area of research through suitable deep learning (DL) based methods for PDEs in which deep neural networks (DNNs) are used to approximate solutions of PDEs. Despite the remarkable success of such DL methods in simulations, it remains a fundamental open problem of research to prove (or disprove) that such methods can overcome the COD in the approximation of PDEs. However, there are nowadays several partial error analysis results for DL methods for high-dimensional nonlinear PDEs in the literature which prove that DNNs can overcome the COD in the sense that the number of parameters of the approximating DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon>0$ and the PDE dimension $d\in\mathbb{N}$. In the main result of this article we prove that for all $T,p\in(0,\infty)$ it holds that solutions $u_d\colon[0,T]\times\mathbb{R}^d\to\mathbb{R}$, $d\in\mathbb{N}$, of semilinear heat equations with Lipschitz continuous nonlinearities can be approximated in the $L^p$-sense on space-time regions without the COD by DNNs with the rectified linear unit (ReLU), the leaky ReLU, or the softplus activation function. In previous articles similar results have been established not for space-time regions but for the solutions $u_d(T,\cdot)$, $d\in\mathbb{N}$, at the terminal time $T$.

Updated: 2024-06-16 09:59:29

标题: 带有ReLU、leaky ReLU和softplus激活函数的深度神经网络可证明地克服半线性偏微分方程时空解的维度灾难

摘要: 解决高维非线性偏微分方程（PDEs）是应用数学中的一个挑战性课题。非线性PDEs的标准逼近方法受到维度诅咒的影响，即逼近方法的计算操作数量至少呈指数增长，并且即使使用目前最快的计算机，也基本上不可能近似解决高维PDEs。然而，近年来在这一研究领域取得了巨大进展，通过适用于PDEs的深度学习（DL）方法，其中使用深度神经网络（DNNs）来逼近PDEs的解。尽管这种DL方法在模拟中取得了显著成功，但证明（或否定）这种方法能否克服维度诅咒在PDE逼近中仍然是一个基本开放的研究问题。然而，现在文献中有几个关于高维非线性PDEs的DL方法的部分误差分析结果，证明了DNNs能够克服维度诅咒，即逼近DNN的参数数量最多在逆预设逼近精度ε>0和PDE维度d∈N的多项式中增长。在本文的主要结果中，我们证明对于所有T，p∈(0，∞)，在没有维度诅咒的空间时间区域上，具有Lipschitz连续非线性的半线性热方程的解$u_d\colon[0,T]\times\mathbb{R}^d\to\mathbb{R}$，d∈N，可以通过具有修正线性单元（ReLU）、泄漏ReLU或softplus激活函数的DNNs在L^p意义上逼近。先前的文章已经建立了类似的结果，但不是针对空间时间区域，而是针对在终端时间T时的解$u_d(T,\cdot)$，d∈N。

更新时间: 2024-06-16 09:59:29

领域: cs.LG,cs.NA,math.NA,math.PR,65M15, 65C05, 68T07 (Primary) 60H35 (Secondary)

下载: http://arxiv.org/abs/2406.10876v1

Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies

Automatic Speech Assessment (ASA) has seen notable advancements with the utilization of self-supervised features (SSL) in recent research. However, a key challenge in ASA lies in the imbalanced distribution of data, particularly evident in English test datasets. To address this challenge, we approach ASA as an ordinal classification task, introducing Weighted Vectors Ranking Similarity (W-RankSim) as a novel regularization technique. W-RankSim encourages closer proximity of weighted vectors in the output layer for similar classes, implying that feature vectors with similar labels would be gradually nudged closer to each other as they converge towards corresponding weighted vectors. Extensive experimental evaluations confirm the effectiveness of our approach in improving ordinal classification performance for ASA. Furthermore, we propose a hybrid model that combines SSL and handcrafted features, showcasing how the inclusion of handcrafted features enhances performance in an ASA system.

Updated: 2024-06-16 09:55:21

标题: 优化自动语音评估：W-RankSim正则化和混合特征融合策略

摘要: 自动语音评估（ASA）在最近的研究中利用自监督特征（SSL）取得了显著进展。然而，ASA中的一个关键挑战在于数据的不平衡分布，特别是在英语测试数据集中特别明显。为了解决这一挑战，我们将ASA视为一个有序分类任务，引入了加权向量排序相似性（W-RankSim）作为一种新颖的正则化技术。W-RankSim鼓励输出层中具有相似类别的加权向量更接近，这意味着具有相似标签的特征向量会逐渐靠近彼此，因为它们趋向于对应的加权向量。广泛的实验评估证实了我们的方法在提高ASA的有序分类性能方面的有效性。此外，我们提出了一个将SSL和手工特征结合的混合模型，展示了如何在ASA系统中包含手工特征可以提高性能。

更新时间: 2024-06-16 09:55:21

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2406.10873v1

Graph Neural Reaction Diffusion Models

The integration of Graph Neural Networks (GNNs) and Neural Ordinary and Partial Differential Equations has been extensively studied in recent years. GNN architectures powered by neural differential equations allow us to reason about their behavior, and develop GNNs with desired properties such as controlled smoothing or energy conservation. In this paper we take inspiration from Turing instabilities in a Reaction Diffusion (RD) system of partial differential equations, and propose a novel family of GNNs based on neural RD systems. We \textcolor{black}{demonstrate} that our RDGNN is powerful for the modeling of various data types, from homophilic, to heterophilic, and spatio-temporal datasets. We discuss the theoretical properties of our RDGNN, its implementation, and show that it improves or offers competitive performance to state-of-the-art methods.

Updated: 2024-06-16 09:46:58

标题: 图神经反应扩散模型

摘要: 最近几年来，图神经网络（GNNs）和神经普通和偏微分方程的整合得到了广泛研究。由神经微分方程驱动的GNN架构使我们能够推理它们的行为，并开发具有所需属性（如控制平滑或能量守恒）的GNNs。本文受图灵不稳定性在反应扩散（RD）偏微分方程系统中的启发，提出了一种基于神经RD系统的新型GNN系列。我们证明了我们的RDGNN对于各种数据类型的建模是强大的，从同质性到异质性和时空数据集。我们讨论了我们的RDGNN的理论性质，其实现，并展示它改进或提供了与最先进方法竞争的性能。

更新时间: 2024-06-16 09:46:58

领域: cs.LG

下载: http://arxiv.org/abs/2406.10871v1

Geometric-informed GFlowNets for Structure-Based Drug Design

The rise of cost involved with drug discovery and current speed of which they are discover, underscore the need for more efficient structure-based drug design (SBDD) methods. We employ Generative Flow Networks (GFlowNets), to effectively explore the vast combinatorial space of drug-like molecules, which traditional virtual screening methods fail to cover. We introduce a novel modification to the GFlowNet framework by incorporating trigonometrically consistent embeddings, previously utilized in tasks involving protein conformation and protein-ligand interactions, to enhance the model's ability to generate molecules tailored to specific protein pockets. We have modified the existing protein conditioning used by GFlowNets, blending geometric information from both protein and ligand embeddings to achieve more geometrically consistent embeddings. Experiments conducted using CrossDocked2020 demonstrated an improvement in the binding affinity between generated molecules and protein pockets for both single and multi-objective tasks, compared to previous work. Additionally, we propose future work aimed at further increasing the geometric information captured in protein-ligand interactions.

Updated: 2024-06-16 09:32:19

标题: 几何信息引导的GFlowNets在基于结构的药物设计中的应用

摘要: 随着药物发现涉及成本的上升和目前的发现速度，迫切需要更高效的基于结构的药物设计（SBDD）方法。我们利用生成流网络（GFlowNets）有效地探索类似药物分子的庞大组合空间，传统的虚拟筛选方法无法涵盖该空间。我们通过将先前用于涉及蛋白质构象和蛋白质-配体相互作用的三角函数一致嵌入引入到GFlowNet框架中，以增强模型生成特定蛋白质口袋定制分子的能力。我们修改了GFlowNets使用的现有蛋白质条件，将来自蛋白质和配体嵌入的几何信息融合在一起，以实现更加几何一致的嵌入。使用CrossDocked2020进行的实验表明，与先前的工作相比，生成的分子与蛋白质口袋之间的结合亲和力在单一和多目标任务中有所改善。此外，我们提出未来的工作旨在进一步增加蛋白质-配体相互作用中捕获的几何信息。

更新时间: 2024-06-16 09:32:19

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2406.10867v1

SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems

Recent advancements in multi-agent reinforcement learning (MARL) have opened up vast application prospects, such as swarm control of drones, collaborative manipulation by robotic arms, and multi-target encirclement. However, potential security threats during the MARL deployment need more attention and thorough investigation. Recent research reveals that attackers can rapidly exploit the victim's vulnerabilities, generating adversarial policies that result in the failure of specific tasks. For instance, reducing the winning rate of a superhuman-level Go AI to around 20%. Existing studies predominantly focus on two-player competitive environments, assuming attackers possess complete global state observation. In this study, we unveil, for the first time, the capability of attackers to generate adversarial policies even when restricted to partial observations of the victims in multi-agent competitive environments. Specifically, we propose a novel black-box attack (SUB-PLAY) that incorporates the concept of constructing multiple subgames to mitigate the impact of partial observability and suggests sharing transitions among subpolicies to improve attackers' exploitative ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under three typical partial observability limitations. Visualization results indicate that adversarial policies induce significantly different activations of the victims' policy networks. Furthermore, we evaluate three potential defenses aimed at exploring ways to mitigate security threats posed by adversarial policies, providing constructive recommendations for deploying MARL in competitive environments.

Updated: 2024-06-16 09:27:17

标题: 子博弈：针对部分观测多智能体强化学习系统的对抗策略

摘要: 最近在多智体强化学习（MARL）领域取得的进展打开了广阔的应用前景，如无人机群体控制、机器臂的协作操纵和多目标包围等。然而，在MARL部署过程中潜在的安全威胁需要更多关注和深入调查。最近的研究表明，攻击者可以迅速利用受害者的漏洞，生成导致特定任务失败的对抗策略。例如，将超人级别的围棋AI的胜率降低到约20%。现有研究主要集中在双方竞争环境中，假设攻击者具有完全的全局状态观察能力。在本研究中，我们首次揭示了攻击者在多智体竞争环境中即使受限于受害者的部分观察也能生成对抗策略的能力。具体来说，我们提出了一种新颖的黑盒攻击（SUB-PLAY），该攻击结合了构建多个子游戏的概念，以减轻部分可观测性的影响，并建议在子策略之间分享转换以提高攻击者的剥削能力。广泛的评估表明，在三种典型的部分可观测性限制下，SUB-PLAY的有效性。可视化结果表明，对抗策略导致受害者策略网络的激活显著不同。此外，我们评估了三种潜在的防御方法，旨在探索减轻对抗策略造成的安全威胁的方法，为在竞争环境中部署MARL提供建设性建议。

更新时间: 2024-06-16 09:27:17

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2402.03741v2

Global-Local Graph Neural Networks for Node-Classification

The task of graph node classification is often approached by utilizing a local Graph Neural Network (GNN), that learns only local information from the node input features and their adjacency. In this paper, we propose to improve the performance of node classification GNNs by utilizing both global and local information, specifically by learning label- and node- features. We therefore call our method Global-Local-GNN (GLGNN). To learn proper label features, for each label, we maximize the similarity between its features and nodes features that belong to the label, while maximizing the distance between nodes that do not belong to the considered label. We then use the learnt label features to predict the node classification map. We demonstrate our GLGNN using three different GNN backbones, and show that our approach improves baseline performance, revealing the importance of global information utilization for node classification.

Updated: 2024-06-16 09:13:30

标题: 全局-局部图神经网络用于节点分类

摘要: 图节点分类任务通常通过利用局部图神经网络（GNN）来处理，该网络仅从节点输入特征及其邻接性中学习局部信息。在本文中，我们提出通过利用全局和局部信息来改善节点分类GNN的性能，具体来说是通过学习标签和节点特征。因此，我们将我们的方法命名为全局-局部-GNN（GLGNN）。为了学习正确的标签特征，对于每个标签，我们最大化其特征与属于该标签的节点特征之间的相似性，同时最大化不属于考虑标签的节点之间的距离。然后，我们使用学到的标签特征来预测节点分类图。我们使用三种不同的GNN骨干来展示我们的GLGNN，并展示我们的方法改善了基准性能，揭示了利用全局信息对节点分类的重要性。

更新时间: 2024-06-16 09:13:30

领域: cs.LG

下载: http://arxiv.org/abs/2406.10863v1

Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions

Federated Learning (FL) is a distributed and privacy-preserving machine learning paradigm that coordinates multiple clients to train a model while keeping the raw data localized. However, this traditional FL poses some challenges, including privacy risks, data heterogeneity, communication bottlenecks, and system heterogeneity issues. To tackle these challenges, knowledge distillation (KD) has been widely applied in FL since 2020. KD is a validated and efficacious model compression and enhancement algorithm. The core concept of KD involves facilitating knowledge transfer between models by exchanging logits at intermediate or output layers. These properties make KD an excellent solution for the long-lasting challenges in FL. Up to now, there have been few reviews that summarize and analyze the current trend and methods for how KD can be applied in FL efficiently. This article aims to provide a comprehensive survey of KD-based FL, focusing on addressing the above challenges. First, we provide an overview of KD-based FL, including its motivation, basics, taxonomy, and a comparison with traditional FL and where KD should execute. We also analyze the critical factors in KD-based FL in the appendix, including teachers, knowledge, data, and methods. We discuss how KD can address the challenges in FL, including privacy protection, data heterogeneity, communication efficiency, and personalization. Finally, we discuss the challenges facing KD-based FL algorithms and future research directions. We hope this survey can provide insights and guidance for researchers and practitioners in the FL area.

Updated: 2024-06-16 09:12:16

标题: 在联邦学习中的知识蒸馏：对长期挑战和新解决方案的调查

摘要: 联邦学习（FL）是一种分布式和保护隐私的机器学习范例，它协调多个客户端训练模型同时保持原始数据本地化。然而，这种传统的FL存在一些挑战，包括隐私风险、数据异构性、通信瓶颈和系统异构性问题。为了解决这些挑战，自2020年以来，知识蒸馏（KD）已被广泛应用于FL。KD是一种经过验证的有效模型压缩和增强算法。KD的核心概念涉及通过在中间或输出层交换logits以促进模型之间的知识传输。这些特性使KD成为解决FL中长期挑战的优秀解决方案。到目前为止，很少有综述总结和分析了当前KD在FL中如何高效应用的趋势和方法。本文旨在提供一个关于基于KD的FL的综合调查，重点解决上述挑战。首先，我们提供了一个关于基于KD的FL的概述，包括其动机、基础知识、分类、与传统FL的比较以及KD应该执行的位置。我们还分析了附录中基于KD的FL中的关键因素，包括教师、知识、数据和方法。我们讨论了KD如何解决FL中的挑战，包括隐私保护、数据异构性、通信效率和个性化。最后，我们讨论了面临基于KD的FL算法的挑战和未来研究方向。我们希望这项调查可以为FL领域的研究人员和从业者提供见解和指导。

更新时间: 2024-06-16 09:12:16

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2406.10861v1

Step-level Value Preference Optimization for Mathematical Reasoning

Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathematical reasoning. To address this limitation, we introduce a novel algorithm called Step-level Value Preference Optimization (SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning. Furthermore, from the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model, complementing standard preference optimization. This value model enables the LLM to generate higher reward responses with minimal cost during inference. Experimental results demonstrate that our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.

Updated: 2024-06-16 09:06:17

标题: 阶段性价值偏好优化对数学推理的影响

摘要: 直接偏好优化（DPO）使用隐式奖励模型已被证明是对细调偏好对齐的大型语言模型（LLMs）进行强化学习的有效替代方法，而不是从人类反馈（RLHF）中获得。然而，对响应的整体偏好注释并不能完全捕捉复杂多步推理任务中模型输出的细粒度质量，比如数学推理。为了解决这一限制，我们引入了一种名为步骤级值偏好优化（SVPO）的新算法。我们的方法采用蒙特卡洛树搜索（MCTS）来自动为多步推理注释步骤级偏好。此外，从学习排序的角度出发，我们训练了一个显式值模型来复制隐式奖励模型的行为，以补充标准偏好优化。这个值模型使得LLM在推理过程中以最小成本生成更高奖励的响应。实验结果表明，我们的方法在领域内和领域外的数学推理基准上均实现了最先进的性能。

更新时间: 2024-06-16 09:06:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10858v1

ALPS: An Auto-Labeling and Pre-training Scheme for Remote Sensing Segmentation With Segment Anything Model

In the fast-growing field of Remote Sensing (RS) image analysis, the gap between massive unlabeled datasets and the ability to fully utilize these datasets for advanced RS analytics presents a significant challenge. To fill the gap, our work introduces an innovative auto-labeling framework named ALPS (Automatic Labeling for Pre-training in Segmentation), leveraging the Segment Anything Model (SAM) to predict precise pseudo-labels for RS images without necessitating prior annotations or additional prompts. The proposed pipeline significantly reduces the labor and resource demands traditionally associated with annotating RS datasets. By constructing two comprehensive pseudo-labeled RS datasets via ALPS for pre-training purposes, our approach enhances the performance of downstream tasks across various benchmarks, including iSAID and ISPRS Potsdam. Experiments demonstrate the effectiveness of our framework, showcasing its ability to generalize well across multiple tasks even under the scarcity of extensively annotated datasets, offering a scalable solution to automatic segmentation and annotation challenges in the field. In addition, the proposed a pipeline is flexible and can be applied to medical image segmentation, remarkably boosting the performance. Note that ALPS utilizes pre-trained SAM to semi-automatically annotate RS images without additional manual annotations. Though every component in the pipeline has bee well explored, integrating clustering algorithms with SAM and novel pseudo-label alignment significantly enhances RS segmentation, as an off-the-shelf tool for pre-training data preparation. Our source code is available at: https://github.com/StriveZs/ALPS.

Updated: 2024-06-16 09:02:01

标题: ALPS: 一种用于遥感分割的自动标记和预训练方案，基于Segment Anything模型

摘要: 在快速增长的遥感图像分析领域，庞大的未标记数据集与充分利用这些数据集进行高级遥感分析的能力之间存在显著差距，这是一个重要挑战。为了填补这一差距，我们的工作引入了一种创新的自动标记框架，名为ALPS（用于分割的自动标记预训练），利用“Segment Anything Model（SAM）”来预测遥感图像的精确伪标签，而无需先前的注释或额外提示。所提出的流程显著降低了传统上与标注遥感数据集相关的劳动力和资源需求。通过利用ALPS构建两个全面的伪标记遥感数据集进行预训练，我们的方法增强了各种基准测试中下游任务的性能，包括iSAID和ISPRS波茨坦。实验证明了我们框架的有效性，展示了其在多个任务中即使在广泛注释数据集稀缺的情况下也能很好地泛化，为领域中的自动分割和注释挑战提供了可扩展的解决方案。此外，所提出的流程灵活，并可应用于医学图像分割，显著提升性能。值得注意的是，ALPS利用预训练的SAM来半自动标记遥感图像，无需额外的手动标注。尽管流程中的每个组件都经过了良好的探索，但将聚类算法与SAM以及新颖的伪标签对齐集成在一起显著增强了遥感分割，作为一个用于预训练数据准备的即插即用工具。我们的源代码可在以下网址获得：https://github.com/StriveZs/ALPS。

更新时间: 2024-06-16 09:02:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.10855v1

Outdated Issue Aware Decoding for Reasoning Questions on Edited Knowledge

Recently, Knowledge Editing has received increasing attention, since it could update the specific knowledge from outdated ones in pretrained models without re-training. However, as pointed out by recent studies, existing related methods tend to merely memorize the superficial word composition of the edited knowledge, rather than truly learning and absorbing it. Consequently, on the reasoning questions, we discover that existing methods struggle to utilize the edited knowledge to reason the new answer, and tend to retain outdated responses, which are generated by the original models utilizing original knowledge. Nevertheless, the outdated responses are unexpected for the correct answers to reasoning questions, which we named as the outdated issue. To alleviate this issue, in this paper, we propose a simple yet effective decoding strategy, i.e., outDated ISsue aware deCOding (DISCO), to enhance the performance of edited models on reasoning questions. Specifically, we capture the difference in the probability distribution between the original and edited models. Further, we amplify the difference of the token prediction in the edited model to alleviate the outdated issue, and thus enhance the model performance w.r.t the edited knowledge. Experimental results suggest that applying DISCO could enhance edited models to reason, e.g., on reasoning questions, DISCO outperforms the prior SOTA method by 12.99 F1 scores, and reduces the ratio of the outdated issue to 5.78% on the zsRE dataset.

Updated: 2024-06-16 08:50:34

标题: 过时问题感知解码对编辑知识上的推理问题

摘要: 最近，知识编辑引起了越来越多的关注，因为它可以更新预训练模型中过时的特定知识，而无需重新训练。然而，正如最近的研究所指出的那样，现有的相关方法往往只是记住了编辑知识的表面词组成，而并非真正学习和吸收它。因此，在推理问题上，我们发现现有方法很难利用编辑后的知识来推理出新的答案，并且倾向于保留由原始模型利用原始知识生成的过时响应。然而，这些过时的响应对于推理问题的正确答案是意外的，我们将其称为过时问题。为了缓解这个问题，在本文中，我们提出了一种简单而有效的解码策略，即过时问题感知解码（DISCO），以提高编辑模型在推理问题上的性能。具体来说，我们捕获了原始模型和编辑模型之间的概率分布差异。此外，我们放大了编辑模型中的令牌预测差异，以减轻过时问题，并从而增强模型在编辑知识方面的性能。实验结果表明，应用DISCO可以增强编辑模型的推理能力，例如，在推理问题上，DISCO的F1分数比之前的SOTA方法高出12.99个百分点，并将过时问题比例降低到5.78％在zsRE数据集上。

更新时间: 2024-06-16 08:50:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.02882v3

Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.

Updated: 2024-06-16 08:49:00

标题: 贝叶斯示例选择改进了语音、文本和视觉模态下的上下文学习

摘要: 大型语言模型（LLMs）可以通过基于对话历史中呈现的少量示例的上下文学习（ICL）来适应新任务，而无需更新任何模型参数。尽管如此便利，ICL的性能严重依赖于呈现的上下文示例的质量，这使得上下文示例选择方法成为一个关键选择。本文提出了一种新颖的基于贝叶斯定理的上下文示例选择方法（ByCS）用于ICL。通过基于贝叶斯定理的推理概率在上下文示例的条件下，ByCS专注于在测试输入条件下的逆推理。根据一个假设，即准确的逆推理概率（似然）将导致准确的推理概率（后验概率），选择基于它们的逆推理结果的上下文示例。对语音、文本和图像示例进行了多样化和广泛的跨任务和跨模态实验。实验结果显示了我们的ByCS方法在各种模型、任务和模态上的有效性和稳健性。

更新时间: 2024-06-16 08:49:00

领域: cs.CL,cs.AI,cs.CV,cs.SD,eess.AS

下载: http://arxiv.org/abs/2404.14716v2

IG2: Integrated Gradient on Iterative Gradient Path for Feature Attribution

Feature attribution explains Artificial Intelligence (AI) at the instance level by providing importance scores of input features' contributions to model prediction. Integrated Gradients (IG) is a prominent path attribution method for deep neural networks, involving the integration of gradients along a path from the explained input (explicand) to a counterfactual instance (baseline). Current IG variants primarily focus on the gradient of explicand's output. However, our research indicates that the gradient of the counterfactual output significantly affects feature attribution as well. To achieve this, we propose Iterative Gradient path Integrated Gradients (IG2), considering both gradients. IG2 incorporates the counterfactual gradient iteratively into the integration path, generating a novel path (GradPath) and a novel baseline (GradCF). These two novel IG components effectively address the issues of attribution noise and arbitrary baseline choice in earlier IG methods. IG2, as a path method, satisfies many desirable axioms, which are theoretically justified in the paper. Experimental results on XAI benchmark, ImageNet, MNIST, TREC questions answering, wafer-map failure patterns, and CelebA face attributes validate that IG2 delivers superior feature attributions compared to the state-of-the-art techniques. The code is released at: https://github.com/JoeZhuo-ZY/IG2.

Updated: 2024-06-16 08:48:03

标题: IG2：集成梯度在特征归因的迭代梯度路径上的应用

摘要: 特征归因通过提供输入特征对模型预测的重要性分数，解释了实例级别的人工智能（AI）。综合梯度（IG）是一种深度神经网络的突出路径归因方法，涉及沿着一个路径从解释的输入（解释者）到一个对照实例（基线）的梯度的整合。目前的IG变体主要关注解释者输出的梯度。然而，我们的研究表明，对照输出的梯度也显著影响特征归因。为了实现这一点，我们提出了迭代梯度路径综合梯度（IG2），考虑了两者的梯度。IG2将对照梯度迭代地整合到路径中，生成了一条新路径（GradPath）和一个新基线（GradCF）。这两个新的IG组件有效地解决了早期IG方法中的归因噪声和任意基线选择的问题。作为一种路径方法，IG2满足许多理论上合理的公理。在XAI基准测试、ImageNet、MNIST、TREC问题回答、晶圆图故障模式和CelebA面部属性上的实验结果验证了，与最先进的技术相比，IG2提供了更优越的特征归因。代码发布在：https://github.com/JoeZhuo-ZY/IG2。

更新时间: 2024-06-16 08:48:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.10852v1

TorchOpera: A Compound AI System for LLM Safety

We introduce TorchOpera, a compound AI system for enhancing the safety and quality of prompts and responses for Large Language Models. TorchOpera ensures that all user prompts are safe, contextually grounded, and effectively processed, while enhancing LLM responses to be relevant and high quality. TorchOpera utilizes the vector database for contextual grounding, rule-based wrappers for flexible modifications, and specialized mechanisms for detecting and adjusting unsafe or incorrect content. We also provide a view of the compound AI system to reduce the computational cost. Extensive experiments show that TorchOpera ensures the safety, reliability, and applicability of LLMs in real-world settings while maintaining the efficiency of LLM responses.

Updated: 2024-06-16 08:39:19

标题: TorchOpera：一个用于LLM安全的复合人工智能系统

摘要: 我们引入了TorchOpera，这是一个用于增强大型语言模型提示和响应的安全性和质量的复合人工智能系统。TorchOpera确保所有用户提示都是安全的、有上下文的，并且得到有效处理，同时提升了LLM响应的相关性和高质量。TorchOpera利用向量数据库进行上下文化支撑，基于规则的封装器进行灵活修改，并利用专门机制来检测和调整不安全或不正确的内容。我们还提供了一个复合人工智能系统的视图，以降低计算成本。广泛的实验表明，TorchOpera确保了LLM在现实世界环境中的安全性、可靠性和适用性，同时保持了LLM响应的效率。

更新时间: 2024-06-16 08:39:19

领域: cs.AI,cs.CE,cs.CL,cs.MA

下载: http://arxiv.org/abs/2406.10847v1

NBA: defensive distillation for backdoor removal via neural behavior alignment

Recently, deep neural networks have been shown to be vulnerable to backdoor attacks. A backdoor is inserted into neural networks via this attack paradigm, thus compromising the integrity of the network. As soon as an attacker presents a trigger during the testing phase, the backdoor in the model is activated, allowing the network to make specific wrong predictions. It is extremely important to defend against backdoor attacks since they are very stealthy and dangerous. In this paper, we propose a novel defense mechanism, Neural Behavioral Alignment (NBA), for backdoor removal. NBA optimizes the distillation process in terms of knowledge form and distillation samples to improve defense performance according to the characteristics of backdoor defense. NBA builds high-level representations of neural behavior within networks in order to facilitate the transfer of knowledge. Additionally, NBA crafts pseudo samples to induce student models exhibit backdoor neural behavior. By aligning the backdoor neural behavior from the student network with the benign neural behavior from the teacher network, NBA enables the proactive removal of backdoors. Extensive experiments show that NBA can effectively defend against six different backdoor attacks and outperform five state-of-the-art defenses.

Updated: 2024-06-16 08:39:15

标题: NBA：通过神经行为调整进行防守精炼以消除背门进攻

摘要: 最近，深度神经网络已被证明对后门攻击是脆弱的。后门是通过这种攻击范式插入神经网络中的，从而破坏网络的完整性。一旦攻击者在测试阶段提出触发器，模型中的后门就会被激活，使网络能够做出特定错误的预测。防御后门攻击非常重要，因为它们非常隐匿和危险。本文提出了一种新颖的防御机制，神经行为对齐（NBA），用于清除后门。NBA优化了知识形式和蒸馏样本的蒸馏过程，以根据后门防御的特点提高防御性能。NBA建立了网络内的神经行为的高级表示，以便促进知识的传递。此外，NBA制作伪样本，诱使学生模型展示后门神经行为。通过将学生网络中的后门神经行为与教师网络中的良性神经行为对齐，NBA实现了主动清除后门。大量实验表明，NBA可以有效防御六种不同的后门攻击，并优于五种最先进的防御方法。

更新时间: 2024-06-16 08:39:15

领域: cs.CR

下载: http://arxiv.org/abs/2406.10846v1

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.

Updated: 2024-06-16 08:37:25

标题: EE-LLM：具有三维并行性的早期退出大规模语言模型的训练和推断

摘要: 我们提出了EE-LLM，一个用于大规模训练和推断早期退出大型语言模型（LLMs）的框架。尽管最近的研究已经显示了早期退出在加速LLM推断方面的有效性的初步证据，但EE-LLM通过支持使用大规模3D并行性进行训练和推断，迈出了扩展早期退出LLMs的基础性步骤。建立在Megatron-LM之上，EE-LLM实现了各种早期退出专门的算法创新和性能优化，包括一种轻量级方法，利用流水线并行性促进早期退出训练目标的反向传播，利用原始流水线调度中的空闲资源进行与早期退出层相关的计算的技术，以及两种与KV缓存兼容的早期退出推断方法，用于自回归生成。我们的分析和实证研究表明，与标准LLM训练相比，EE-LLM在训练效率方面取得了巨大的进展，计算开销可以忽略不计，并且在不影响输出质量的情况下实现了出色的推断加速。为了促进进一步的研究和采用，我们在https://github.com/pan-x-c/EE-LLM上发布了EE-LLM。

更新时间: 2024-06-16 08:37:25

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2312.04916v3

Enriching the Machine Learning Workloads in BigBench

In the era of Big Data and the growing support for Machine Learning, Deep Learning and Artificial Intelligence algorithms in the current software systems, there is an urgent need of standardized application benchmarks that stress test and evaluate these new technologies. Relying on the standardized BigBench (TPCx-BB) benchmark, this work enriches the improved BigBench V2 with three new workloads and expands the coverage of machine learning algorithms. Our workloads utilize multiple algorithms and compare different implementations for the same algorithm across several popular libraries like MLlib, SystemML, Scikit-learn and Pandas, demonstrating the relevance and usability of our benchmark extension.

Updated: 2024-06-16 08:32:28

标题: 丰富BigBench中的机器学习工作负载

摘要: 在大数据时代和对机器学习、深度学习和人工智能算法在当前软件系统中的不断支持中，迫切需要标准化的应用基准测试来对这些新技术进行压力测试和评估。本文基于标准化的BigBench（TPCx-BB）基准测试，丰富了改进的BigBench V2，增加了三个新的工作负载，并扩展了机器学习算法的覆盖范围。我们的工作负载利用多种算法，并比较了同一算法在多个流行库如MLlib、SystemML、Scikit-learn和Pandas中的不同实现，展示了我们基准测试扩展的相关性和可用性。

更新时间: 2024-06-16 08:32:28

领域: cs.LG

下载: http://arxiv.org/abs/2406.10843v1

Large Language Models for Automatic Milestone Detection in Group Discussions

Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.

Updated: 2024-06-16 08:32:22

标题: 大型语言模型用于小组讨论中的自动里程碑检测

摘要: 大型语言模型如GPT在基于书面文本文档的自然语言理解任务上已被广泛证明是成功的。在本文中，我们调查了一个LLM在一个群体口头交流任务的表现，其中话语经常被截断或不完整。我们提出了一个涉及一个谜题的新的群体任务实验，其中有几个里程碑可以以任何顺序完成。我们调查了处理转录的方法，以检测何时以及由谁完成了一个里程碑。我们证明了通过迭代提示GPT使用转录块优于使用文本嵌入的语义相似性搜索方法，并进一步讨论了在不同上下文窗口大小下GPT响应的质量和随机性。

更新时间: 2024-06-16 08:32:22

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2406.10842v1

Quantifying Multilingual Performance of Large Language Models Across Languages

The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM's performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.

Updated: 2024-06-16 08:24:32

标题: 量化大型语言模型跨语言的多语言表现

摘要: 大型语言模型（LLMs）的发展依赖于广泛的文本语料库，这些语料库在不同语言之间通常分布不均。这种不平衡导致LLMs在高资源语言（如英语、德语和法语）上表现明显优于在低资源语言上的表现。目前，缺乏定量方法来评估LLMs在这些低资源语言中的表现。为了填补这一空白，我们提出了Language Ranker，这是一种设计用于基于LLM性能使用内部表示来评估和排名语言的内在指标。通过将LLM对各种语言的内部表示与从英语派生的基准进行比较，我们可以以一种稳健且与语言无关的方式评估模型的多语言能力。我们的分析显示，高资源语言与英语的相似性得分更高，表现更出色，而低资源语言显示出较低的相似性得分，突显了我们的指标在评估特定语言能力方面的有效性。此外，实验表明，LLM在不同语言中的表现与其预训练语料库中这些语言的比例之间存在很强的相关性。这些见解强调了Language Ranker作为一种工具，用于评估LLM在不同语言中的表现，特别是那些资源有限的语言。

更新时间: 2024-06-16 08:24:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.11553v2

CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}.

Updated: 2024-06-16 08:20:24

标题: CBGBench: 填补蛋白质-分子复合物结合图的空白

摘要: 基于结构的药物设计（SBDD）旨在生成可以结合到目标蛋白质的潜在药物，并且受到生成模型中人工智能技术的帮助，该过程大大加快。然而，由于多样的设置、复杂的实施、难以复制以及任务的独特性，缺乏系统性的理解仍然存在。首先，缺乏标准化可能会导致不公平的比较和无结论的见解。为了解决这一困境，我们提出CBGBench，一个针对SBDD的全面基准，将任务统一为生成异质图完成，类似于填写3D复合结合图中的空白。通过根据它们的属性对现有方法进行分类，CBGBench促进了一个模块化和可扩展的框架，实现了各种尖端方法。其次，单一的\textit{de novo}分子生成任务很难反映它们的能力。为了拓宽范围，我们已经将这些模型调整为药物设计中至关重要的一系列任务，这些任务被视为图中填写的子任务。这些任务包括在蛋白质口袋的结构条件下生成\textit{de novo}分子、连接物、片段、支架和侧链。我们的评估是公平进行的，全面涵盖了相互作用、化学性质、几何真实性和次级结构的有效性等方面的观点。我们进一步提供了最新模型的预训练版本和经验研究分析。CBGBench的代码库可以在\url{https://github.com/Edapinenut/CBGBench}上公开访问。

更新时间: 2024-06-16 08:20:24

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2406.10840v1

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.

Updated: 2024-06-16 08:06:05

标题: 揭示阿喀琉斯之踵：评估LLMs处理数学推理中错误的能力

摘要: 大型语言模型（LLMs）已被应用于数学问题（MWPs），产生了变革性影响，彻底改变了这些复杂问题在各个领域包括教育领域中的解决方式。然而，对这些模型的评估通常更注重最终准确性，而忽略了推理能力这一关键方面。本文通过专注于LLMs检测和纠正推理错误的能力，填补了这一空白。我们引入了一个新的数据集MWP-MISTAKE，其中包含通过基于规则的方法和较小的语言模型生成的具有正确和不正确推理步骤的MWPs。我们的全面基准测试揭示了关于最先进模型（如GPT-4o、GPT-4、GPT-3.5Turbo等）优势和劣势的重要见解。我们强调了GPT-$o在错误检测和矫正方面的优越性能，以及较小模型面临的持久挑战。此外，我们还确定了与数据污染和记忆有关的问题，影响了LLMs在实际应用中的可靠性。我们的研究结果强调了推理过程严格评估的重要性，并提出未来方向以增强LLMs在数学问题解决中的泛化能力和鲁棒性。

更新时间: 2024-06-16 08:06:05

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.10834v1

Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices

Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional intermediate layer between the master and workers in the original distributed learning systems, potentially leading to more severe straggler effect. Recently, coding theory-based approaches have been proposed for stragglers mitigation in distributed learning, but the majority focus on the conventional workers-master architecture. In this paper, along a different line, we investigate the problem of mitigating the straggler effect in hierarchical distributed learning systems with an additional layer composed of edge nodes. Technically, we first derive the fundamental trade-off between the computational loads of workers and the stragglers tolerance. Then, we propose a hierarchical gradient coding framework, which provides better stragglers mitigation, to achieve the derived computational trade-off. To further improve the performance of our framework in heterogeneous scenarios, we formulate an optimization problem with the objective of minimizing the expected execution time for each iteration in the learning process. We develop an efficient algorithm to mathematically solve the problem by outputting the optimum strategy. Extensive simulation results demonstrate the superiority of our schemes compared with conventional solutions.

Updated: 2024-06-16 07:52:12

标题: 边缘设备上分布式学习的分层梯度编码设计与优化

摘要: 边缘计算最近作为一种有前途的范式出现，通过利用边缘节点的分布式资源来提高分布式学习的性能。从架构上讲，引入边缘节点在原始分布式学习系统中的主节点和工作者之间增加了一个额外的中间层，可能导致更严重的阻塞效应。最近，基于编码理论的方法已被提出用于在分布式学习中减轻阻塞效应，但大多数集中在传统的工作者-主节点架构上。在这篇论文中，我们沿着不同的方向，研究了在由边缘节点组成的额外层次中减轻阻塞效应的问题。从技术上讲，我们首先推导了工作者的计算负载和阻塞者容忍度之间的基本权衡。然后，我们提出了一个层次梯度编码框架，提供更好的阻塞者减轻，以实现所得到的计算权衡。为了进一步提高我们的框架在异构场景中的性能，我们制定了一个优化问题，目标是最小化学习过程中每次迭代的预期执行时间。我们开发了一个高效的算法来通过输出最佳策略来数学解决这个问题。广泛的模拟结果证明了我们的方案相对于传统解决方案的优越性。

更新时间: 2024-06-16 07:52:12

领域: cs.NI,cs.AI,cs.DC

下载: http://arxiv.org/abs/2406.10831v1

Algorithm Selection for Optimal Multi-Agent Path Finding via Graph Embedding

Multi-agent path finding (MAPF) is the problem of finding paths for multiple agents such that they do not collide. This problem manifests in numerous real-world applications such as controlling transportation robots in automated warehouses, moving characters in video games, and coordinating self-driving cars in intersections. Finding optimal solutions to MAPF is NP-Hard, yet modern optimal solvers can scale to hundreds of agents and even thousands in some cases. Different solvers employ different approaches, and there is no single state-of-the-art approach for all problems. Furthermore, there are no clear, provable, guidelines for choosing when each optimal MAPF solver to use. Prior work employed Algorithm Selection (AS) techniques to learn such guidelines from past data. A major challenge when employing AS for choosing an optimal MAPF algorithm is how to encode the given MAPF problem. Prior work either used hand-crafted features or an image representation of the problem. We explore graph-based encodings of the MAPF problem and show how they can be used on-the-fly with a modern graph embedding algorithm called FEATHER. Then, we show how this encoding can be effectively joined with existing encodings, resulting in a novel AS method we call MAPF Algorithm selection via Graph embedding (MAG). An extensive experimental evaluation of MAG on several MAPF algorithm selection tasks reveals that it is either on-par or significantly better than existing methods.

Updated: 2024-06-16 07:41:58

标题: 算法选择对通过图嵌入实现最佳多智能体路径规划的影响

摘要: 多智能体路径规划（MAPF）是找到多个智能体的路径，使它们不发生碰撞的问题。这个问题在许多现实世界的应用中都存在，比如控制自动仓库中的运输机器人、移动视频游戏中的角色以及在交叉口协调自动驾驶汽车。找到MAPF的最优解是NP难的，然而现代最优求解器可以扩展到数百个智能体，甚至在某些情况下可以扩展到数千个智能体。不同的求解器采用不同的方法，没有一种单一的最先进方法适用于所有问题。此外，目前没有清晰的、可证明的准则来选择何时使用每个最优的MAPF求解器。以前的研究采用算法选择（AS）技术来从过去的数据中学习这样的准则。在选择最优MAPF算法时使用AS的一个主要挑战是如何对给定的MAPF问题进行编码。以前的工作要么使用手工制作的特征，要么使用问题的图像表示。我们探索了MAPF问题的基于图的编码，并展示了如何使用现代图嵌入算法FEATHER来实时使用它们。然后，我们展示了如何有效地将这种编码与现有编码结合，从而产生一种新颖的AS方法，我们称之为通过图嵌入的MAPF算法选择（MAG）。对MAG在几个MAPF算法选择任务上的广泛实验评估表明，它要么与现有方法持平，要么明显优于现有方法。

更新时间: 2024-06-16 07:41:58

领域: cs.AI,68T20,I.2.8

下载: http://arxiv.org/abs/2406.10827v1

Mitigating Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation

Adversarial Training is a practical approach for improving the robustness of deep neural networks against adversarial attacks. Although bringing reliable robustness, the performance towards clean examples is negatively affected after Adversarial Training, which means a trade-off exists between accuracy and robustness. Recently, some studies have tried to use knowledge distillation methods in Adversarial Training, achieving competitive performance in improving the robustness but the accuracy for clean samples is still limited. In this paper, to mitigate the accuracy-robustness trade-off, we introduce the Balanced Multi-Teacher Adversarial Robustness Distillation (B-MTARD) to guide the model's Adversarial Training process by applying a strong clean teacher and a strong robust teacher to handle the clean examples and adversarial examples, respectively. During the optimization process, to ensure that different teachers show similar knowledge scales, we design the Entropy-Based Balance algorithm to adjust the teacher's temperature and keep the teachers' information entropy consistent. Besides, to ensure that the student has a relatively consistent learning speed from multiple teachers, we propose the Normalization Loss Balance algorithm to adjust the learning weights of different types of knowledge. A series of experiments conducted on three public datasets demonstrate that B-MTARD outperforms the state-of-the-art methods against various adversarial attacks.

Updated: 2024-06-16 07:14:23

标题: 通过平衡多教师对抗性蒸馏来减轻精度-鲁棒性权衡

摘要: 对抗训练是一种改善深度神经网络对抗性攻击鲁棒性的实用方法。尽管带来了可靠的鲁棒性，但对抗训练后对干净样本的性能受到了负面影响，这意味着准确性和鲁棒性之间存在一种权衡。最近，一些研究尝试在对抗训练中使用知识蒸馏方法，取得了竞争性的性能，提高了鲁棒性，但对干净样本的准确性仍然有限。在本文中，为了缓解准确性和鲁棒性之间的权衡，我们引入了平衡多教师对抗鲁棒性蒸馏（B-MTARD）来引导模型的对抗训练过程，通过应用一个强大的干净教师和一个强大的鲁棒教师来处理干净样本和对抗样本。在优化过程中，为了确保不同教师展示相似的知识规模，我们设计了基于熵的平衡算法来调整教师的温度，并保持教师的信息熵一致。此外，为了确保学生从多个教师中有相对一致的学习速度，我们提出了归一化损失平衡算法来调整不同类型知识的学习权重。在三个公共数据集上进行的一系列实验表明，B-MTARD在抵御各种对抗性攻击方面优于最先进的方法。

更新时间: 2024-06-16 07:14:23

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2306.16170v3

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.

Updated: 2024-06-16 06:56:53

标题: GUI-WORLD：用于基于GUI导向的多模态LLM代理的数据集

摘要: 最近，多模态大型语言模型（MLLMs）已被用作代理来直接感知图形用户界面（GUI）并生成相应的代码以控制键盘和鼠标输入。然而，当前的代理主要表现出在静态环境中具有出色的理解能力，并且主要应用于相对简单的领域，如Web或移动界面。我们认为，一个强大的GUI代理应该能够感知GUI上的时间信息，包括动态Web内容和多步任务。此外，它应该对各种GUI场景有全面的理解，包括桌面软件和多窗口交互。为此，本文介绍了一个新数据集，称为GUI-World，其中包含精心制作的Human-MLLM注释，广泛涵盖六种GUI场景和三种格式的八种GUI导向问题。我们评估了当前最先进的MLLMs（包括ImageLLMs和VideoLLMs）在理解各种类型的GUI内容，特别是动态和顺序内容方面的能力。我们的研究结果显示，ImageLLMs在没有手动注释的关键帧或操作历史的情况下很难处理动态GUI内容。另一方面，由于GUI视频数据集稀缺，VideoLLMs在所有GUI导向任务中表现不佳。基于GUI-World，我们采取了利用经过精调的VideoLLM作为GUI代理的初始步骤，展示了对各种GUI任务的改进理解。然而，由于基本LLMs性能的限制，我们得出结论称，将VideoLLMs用作GUI代理仍然是一个重大挑战。我们相信我们的工作为未来研究动态GUI内容理解提供了有价值的见解。代码和数据集可在我们的项目主页公开获取：https://gui-world.github.io/。

更新时间: 2024-06-16 06:56:53

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.10819v1

Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp

This article optimizes the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama.cpp, and modifying the compilation script to improve the compiler optimization level. On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.

Updated: 2024-06-16 06:46:25

标题: 基于Llama.cpp的Armv9架构通用大型语言模型推理性能优化

摘要: 本文通过执行Int8量化、向llama.cpp中的一些运算符进行向量化以及修改编译脚本以提高编译器优化级别，优化了Qwen-1.8B模型的推理性能。在Yitian 710实验平台上，填充性能提高了1.6倍，解码性能提高了24倍，内存使用量减少到原始的1/5，并且准确性损失几乎可以忽略不计。

更新时间: 2024-06-16 06:46:25

领域: cs.PL,cs.AI,cs.AR,cs.PF

下载: http://arxiv.org/abs/2406.10816v1

DEEP-ICL: Definition-Enriched Experts for Language Model In-Context Learning

It has long been assumed that the sheer number of parameters in large language models (LLMs) drives in-context learning (ICL) capabilities, enabling remarkable performance improvements by leveraging task-specific demonstrations. Challenging this hypothesis, we introduce DEEP-ICL, a novel task Definition Enriched ExPert Ensembling methodology for ICL. DEEP-ICL explicitly extracts task definitions from given demonstrations and generates responses through learning task-specific examples. We argue that improvement from ICL does not directly rely on model size, but essentially stems from understanding task definitions and task-guided learning. Inspired by this, DEEP-ICL combines two 3B models with distinct roles (one for concluding task definitions and the other for learning task demonstrations) and achieves comparable performance to LLaMA2-13B. Furthermore, our framework outperforms conventional ICL by overcoming pretraining sequence length limitations, by supporting unlimited demonstrations. We contend that DEEP-ICL presents a novel alternative for achieving efficient few-shot learning, extending beyond the conventional ICL.

Updated: 2024-06-16 06:44:50

标题: DEEP-ICL：用于语言模型上下文学习的定义丰富的专家

摘要: 长期以来，人们一直认为大型语言模型（LLMs）中的参数数量是推动上下文学习（ICL）能力的关键，通过利用特定任务的演示实现了显著的性能提升。挑战这一假设，我们引入了DEEP-ICL，一种新颖的任务定义丰富的专家组合方法，用于ICL。DEEP-ICL明确从给定的演示中提取任务定义，并通过学习任务特定示例生成响应。我们认为，ICL的改进并不直接依赖于模型大小，而基本上源于对任务定义和任务引导学习的理解。受此启发，DEEP-ICL结合了两个具有不同作用的3B模型（一个用于总结任务定义，另一个用于学习任务演示），并实现了与LLaMA2-13B相当的性能。此外，我们的框架通过克服预训练序列长度限制，支持无限演示，优于传统的ICL。我们认为，DEEP-ICL提供了一种新颖的替代方案，实现了高效的少样本学习，超越了传统的ICL。

更新时间: 2024-06-16 06:44:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.04233v2

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs

Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills.

Updated: 2024-06-16 06:43:50

标题: SportsMetrics：将文本和数字数据融合以理解LLMs中的信息融合

摘要: 大型语言模型具有将各种数据类型（如文本文档和数据库记录）集成到高级分析中的重要潜力。然而，将文本和数字数据融合在一起面临着重大挑战。LLMs需要处理和交叉引用实体和数字，处理数据不一致性和冗余，并开发规划能力，如构建用于管理复杂数据查询的工作记忆。在本文中，我们介绍了围绕体育数据分析的四项新颖任务，以评估LLMs的数值推理和信息融合能力。这些任务涉及向LLMs提供详细的一场一场的体育比赛描述，然后挑战它们以对抗性情景，如新的比赛规则，更长的持续时间，混乱的叙述，并分析比赛摘要中的关键统计数据。我们对NBA和NFL比赛进行了大量实验，以评估LLMs在这些任务上的表现。我们的基准，SportsMetrics，引入了一种新机制来评估LLMs的数值推理和融合技能。

更新时间: 2024-06-16 06:43:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.10979v2

On the Effectiveness of Supervision in Asymmetric Non-Contrastive Learning

Supervised contrastive representation learning has been shown to be effective in various transfer learning scenarios. However, while asymmetric non-contrastive learning (ANCL) often outperforms its contrastive learning counterpart in self-supervised representation learning, the extension of ANCL to supervised scenarios is less explored. To bridge the gap, we study ANCL for supervised representation learning, coined SupSiam and SupBYOL, leveraging labels in ANCL to achieve better representations. The proposed supervised ANCL framework improves representation learning while avoiding collapse. Our analysis reveals that providing supervision to ANCL reduces intra-class variance, and the contribution of supervision should be adjusted to achieve the best performance. Experiments demonstrate the superiority of supervised ANCL across various datasets and tasks. The code is available at: https://github.com/JH-Oh-23/Sup-ANCL.

Updated: 2024-06-16 06:43:15

标题: 关于异构非对比学习中监督有效性的研究

摘要: 监督对比表示学习已被证明在各种迁移学习场景中是有效的。然而，虽然非对称非对比学习（ANCL）在自监督表示学习中通常优于其对比学习对应物，但将ANCL扩展到监督场景的研究较少。为了弥合这一差距，我们研究了用于监督表示学习的ANCL，命名为SupSiam和SupBYOL，利用ANCL中的标签来实现更好的表示。所提出的监督ANCL框架改进了表示学习，同时避免了崩溃。我们的分析表明，对ANCL提供监督减少了类内方差，监督的贡献应该调整以达到最佳性能。实验表明，监督ANCL在各种数据集和任务中具有优越性。代码可在以下网址获取：https://github.com/JH-Oh-23/Sup-ANCL。

更新时间: 2024-06-16 06:43:15

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2406.10815v1

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusin on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques achieving up to a 20.5 absolute accuracy improvement on TREC.

Updated: 2024-06-16 06:41:08

标题: 金字塔KV：基于金字塔信息导流的动态KV缓存压缩

摘要: 在这项研究中，我们调查了大型语言模型（LLMs）内部基于注意力的信息流是否通过显著模式来聚合，以进行长文本处理。我们的观察揭示了LLMs通过金字塔信息漏斗聚合信息的方式，其中注意力在较低层中广泛分散，逐渐在特定上下文中巩固，并最终集中在高层的关键标记（即大规模激活或关注点）。在这些见解的启发下，我们开发了PyramidKV，一种新颖有效的KV缓存压缩方法。该方法动态调整不同层的KV缓存大小，在较低层分配更多缓存，而在较高层分配更少，与保持统一KV缓存大小的传统方法有所不同。我们利用LongBench基准测试进行的实验评估显示，PyramidKV在仅保留12%的KV缓存的情况下与具有完整KV缓存的模型性能相匹配，从而显著减少了内存使用。在强调内存效率的场景中，仅保留0.7%的KV缓存时，PyramidKV超越了其他KV缓存压缩技术，在TREC上实现了高达20.5个绝对准确度的提高。

更新时间: 2024-06-16 06:41:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.02069v2

LLMFactor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction

Recently, Large Language Models (LLMs) have attracted significant attention for their exceptional performance across a broad range of tasks, particularly in text analysis. However, the finance sector presents a distinct challenge due to its dependence on time-series data for complex forecasting tasks. In this study, we introduce a novel framework called LLMFactor, which employs Sequential Knowledge-Guided Prompting (SKGP) to identify factors that influence stock movements using LLMs. Unlike previous methods that relied on keyphrases or sentiment analysis, this approach focuses on extracting factors more directly related to stock market dynamics, providing clear explanations for complex temporal changes. Our framework directs the LLMs to create background knowledge through a fill-in-the-blank strategy and then discerns potential factors affecting stock prices from related news. Guided by background knowledge and identified factors, we leverage historical stock prices in textual format to predict stock movement. An extensive evaluation of the LLMFactor framework across four benchmark datasets from both the U.S. and Chinese stock markets demonstrates its superiority over existing state-of-the-art methods and its effectiveness in financial time-series forecasting.

Updated: 2024-06-16 06:20:50

标题: LLM因子：通过提示提取有利因素，以解释股票运动预测

摘要: 最近，大型语言模型（LLMs）因其在广泛任务上的出色表现而受到了重视，特别是在文本分析领域。然而，金融领域由于对时间序列数据进行复杂预测任务的依赖性而面临着独特挑战。在本研究中，我们引入了一个名为LLMFactor的新框架，采用了顺序知识引导提示（SKGP）来识别影响股票走势的因素，利用LLMs。与先前依赖关键词或情感分析的方法不同，这种方法专注于提取与股市动态更直接相关的因素，为复杂的时间变化提供清晰的解释。我们的框架指导LLMs通过填空策略创建背景知识，然后从相关新闻中辨别可能影响股价的因素。在背景知识和确定的因素的指导下，我们利用文本格式的历史股价来预测股票走势。通过对来自美国和中国股市的四个基准数据集对LLMFactor框架进行广泛评估，结果显示其优于现有最先进的方法，并在金融时间序列预测中的有效性。

更新时间: 2024-06-16 06:20:50

领域: cs.CL,cs.AI,cs.CE

下载: http://arxiv.org/abs/2406.10811v1

Post-hoc Utterance Refining Method by Entity Mining for Faithful Knowledge Grounded Conversations

Despite the striking advances in recent language generation performance, model-generated responses have suffered from the chronic problem of hallucinations that are either untrue or unfaithful to a given source. Especially in the task of knowledge grounded conversation, the models are required to generate informative responses, but hallucinated utterances lead to miscommunication. In particular, entity-level hallucination that causes critical misinformation and undesirable conversation is one of the major concerns. To address this issue, we propose a post-hoc refinement method called REM. It aims to enhance the quality and faithfulness of hallucinated utterances by refining them based on the source knowledge. If the generated utterance has a low source-faithfulness score with the given knowledge, REM mines the key entities in the knowledge and implicitly uses them for refining the utterances. We verify that our method reduces entity hallucination in the utterance. Also, we show the adaptability and efficacy of REM with extensive experiments and generative results. Our code is available at https://github.com/YOONNAJANG/REM.

Updated: 2024-06-16 06:12:47

标题: 实体挖掘的后续话语细化方法用于基于忠实知识的对话

摘要: 尽管近年来语言生成性能取得了显著进展，模型生成的回应仍然存在慢性幻觉问题，这些幻觉要么不真实，要么与给定来源不忠实。特别是在基于知识的对话任务中，模型需要生成信息丰富的回应，但幻觉言论会导致误解。特别是导致重要错误信息和不良对话的实体级幻觉是一个主要问题。为了解决这个问题，我们提出了一种名为REM的后期精化方法。它旨在通过根据源知识对其进行精化来提高幻觉言论的质量和忠实度。如果生成的言论与给定知识的来源忠实度得分较低，REM会挖掘知识中的关键实体，并隐式地使用它们来精化言论。我们验证了我们的方法减少了言论中的实体幻觉。此外，我们通过大量实验和生成结果展示了REM的适应性和有效性。我们的代码可在https://github.com/YOONNAJANG/REM上找到。

更新时间: 2024-06-16 06:12:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10809v1

Diffusion Model With Optimal Covariance Matching

The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed full covariance moment matching technique and introduce a novel method for learning covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency of both Markovian (DDPM) and non-Markovian (DDIM) diffusion model families.

Updated: 2024-06-16 05:47:12

标题: 具有最佳协方差匹配的扩散模型

摘要: 概率扩散模型在各个领域已经变得非常有效。通常，从扩散模型中抽样涉及使用一个以学习到的均值和固定或学习到的协方差为特征的高斯去噪分布。在本文中，我们利用最近提出的全协方差矩匹配技术，并引入一种学习协方差的新方法。与传统的数据驱动协方差逼近方法不同，我们的方法涉及直接回归最优解析协方差，使用一种名为Optimal Covariance Matching（OCM）的新的、无偏的目标。这种方法可以显著减少协方差预测中的近似误差。我们展示了我们的方法如何显著增强马尔科夫（DDPM）和非马尔科夫（DDIM）扩散模型族的抽样效率。

更新时间: 2024-06-16 05:47:12

领域: cs.LG

下载: http://arxiv.org/abs/2406.10808v1

Bayesian Networks and Machine Learning for COVID-19 Severity Explanation and Demographic Symptom Classification

With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, we present a three-stage data-driven approach to distill the hidden information about COVID-19. The first stage employs a Bayesian network structure learning method to identify the causal relationships among COVID-19 symptoms and their intrinsic demographic variables. As a second stage, the output from the Bayesian network structure learning, serves as a useful guide to train an unsupervised machine learning (ML) algorithm that uncovers the similarities in patients' symptoms through clustering. The final stage then leverages the labels obtained from clustering to train a demographic symptom identification (DSID) model which predicts a patient's symptom class and the corresponding demographic probability distribution. We applied our method on the COVID-19 dataset obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Results from the experiments show a testing accuracy of 99.99\%, as against the 41.15\% accuracy of a heuristic ML method. This strongly reveals the viability of our Bayesian network and ML approach in understanding the relationship between the virus symptoms, and providing insights on patients' stratification towards reducing the severity of the virus.

Updated: 2024-06-16 05:43:24

标题: 贝叶斯网络和机器学习用于COVID-19严重程度解释和人口症状分类

摘要: 随着当前努力应对新型冠状病毒病（COVID-19）大流行，关于其传播、未来影响和复发仍存在一些尚未发现的不确定性。本文介绍了一种三阶段的数据驱动方法，用于挖掘关于COVID-19的潜在信息。第一阶段采用贝叶斯网络结构学习方法，识别COVID-19症状与其固有人口统计变量之间的因果关系。作为第二阶段，来自贝叶斯网络结构学习的输出作为指导，训练一个无监督的机器学习（ML）算法，通过聚类揭示患者症状之间的相似性。最终阶段利用聚类获得的标签训练一个人口统计症状识别（DSID）模型，预测患者的症状类别和相应的人口统计概率分布。我们将该方法应用于美国疾病控制与预防中心（CDC）获取的COVID-19数据集。实验结果显示，我们的方法测试准确率为99.99％，而启发式ML方法的准确率为41.15％。这充分显示了我们的贝叶斯网络和ML方法在理解病毒症状之间的关系，并提供有关患者分层以减轻病毒严重程度的见解的可行性。

更新时间: 2024-06-16 05:43:24

领域: stat.ML,cs.AI,cs.LG,stat.AP

下载: http://arxiv.org/abs/2406.10807v1

Model Free Prediction with Uncertainty Assessment

Deep nonparametric regression, characterized by the utilization of deep neural networks to learn target functions, has emerged as a focus of research attention in recent years. Despite considerable progress in understanding convergence rates, the absence of asymptotic properties hinders rigorous statistical inference. To address this gap, we propose a novel framework that transforms the deep estimation paradigm into a platform conducive to conditional mean estimation, leveraging the conditional diffusion model. Theoretically, we develop an end-to-end convergence rate for the conditional diffusion model and establish the asymptotic normality of the generated samples. Consequently, we are equipped to construct confidence regions, facilitating robust statistical inference. Furthermore, through numerical experiments, we empirically validate the efficacy of our proposed methodology.

Updated: 2024-06-16 05:30:25

标题: 无模型预测及不确定性评估

摘要: 深度非参数回归，以利用深度神经网络学习目标函数为特征，近年来已成为研究关注的焦点。尽管在理解收敛速度方面取得了相当大的进展，但缺乏渐近性质阻碍了严格的统计推断。为了填补这一空白，我们提出了一个新颖的框架，将深度估计范式转化为有利于条件均值估计的平台，利用条件扩散模型。在理论上，我们为条件扩散模型开发了一个端到端的收敛速度，并建立了生成样本的渐近正态性。因此，我们能够构建置信区间，促进强大的统计推断。此外，通过数值实验，我们经验性地验证了我们提出的方法的有效性。

更新时间: 2024-06-16 05:30:25

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.12684v3

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0 and https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d.

Updated: 2024-06-16 05:17:56

标题: ptt5-v2: 探究葡萄牙语T5模型持续预训练

摘要: 尽管自然语言处理（NLP）的进展和预训练模型的日益普及，英语仍然是模型开发的主要焦点。在语言特定语料库上继续预训练为将模型适应其他语言提供了实用解决方案。然而，不同预训练设置对下游任务的影响仍未得到充分探讨。本文介绍了$\texttt{ptt5-v2}$，研究了T5模型在葡萄牙语上的持续预训练。我们首先制定了一组基准设置，并使用高达3B参数的模型进行预训练。在三个葡萄牙语下游任务（assin2 STS、assin2 RTE和TweetSentBR）上微调，取得了后两项的SOTA结果。然后，我们探讨了不同预训练配置的影响，包括质量过滤器、优化策略和多轮预训练。令人惊讶的是，与我们的基准相比，它们的影响仍然微妙。我们在HuggingFace上发布了$\texttt{ptt5-v2}$预训练检查点和微调的MonoT5 rerankers，网址为https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0和https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d。

更新时间: 2024-06-16 05:17:56

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2406.10806v1

Dynamic Byzantine-Robust Learning: Adapting to Switching Byzantine Workers

Byzantine-robust learning has emerged as a prominent fault-tolerant distributed machine learning framework. However, most techniques focus on the static setting, wherein the identity of Byzantine workers remains unchanged throughout the learning process. This assumption fails to capture real-world dynamic Byzantine behaviors, which may include intermittent malfunctions or targeted, time-limited attacks. Addressing this limitation, we propose DynaBRO -- a new method capable of withstanding any sub-linear number of identity changes across rounds. Specifically, when the number of such changes is $\mathcal{O}(\sqrt{T})$ (where $T$ is the total number of training rounds), DynaBRO nearly matches the state-of-the-art asymptotic convergence rate of the static setting. Our method utilizes a multi-level Monte Carlo (MLMC) gradient estimation technique applied at the server to robustly aggregated worker updates. By additionally leveraging an adaptive learning rate, we circumvent the need for prior knowledge of the fraction of Byzantine workers.

Updated: 2024-06-16 05:08:35

标题: 动态拜占庭鲁棒学习：适应切换的拜占庭工作者

摘要: 拜占庭强健学习已经成为一个突出的容错分布式机器学习框架。然而，大多数技术集中在静态设置上，即在学习过程中拜占庭工作者的身份保持不变。这种假设未能捕捉到现实世界中动态拜占庭行为，其中可能包括间歇性故障或有针对性的、时间有限的攻击。为了解决这一限制，我们提出了DynaBRO——一种新的方法，能够在每轮中承受任何次线性数量的身份变化。具体来说，当这些变化的数量为$\mathcal{O}(\sqrt{T})$（其中$T$是总训练轮数）时，DynaBRO几乎与静态设置的最先进渐近收敛速度相匹配。我们的方法利用了一种在服务器上应用的多级蒙特卡洛（MLMC）梯度估计技术，以强健地聚合工作者的更新。通过另外利用自适应学习率，我们规避了对拜占庭工作者比例的先验知识的需求。

更新时间: 2024-06-16 05:08:35

领域: cs.LG,cs.DC,stat.ML

下载: http://arxiv.org/abs/2402.02951v2

HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies

A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external models such as gpt-3.5-turbo. We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. In essence, "HiddenTables" is played between the code-generating LLM "Solver" and the "Oracle" which evaluates the ability of the LLM agents to solve Table QA tasks. This game is based on natural language schemas and importantly, ensures the security of the underlying data. We provide evidential experiments on a diverse set of tables that demonstrate an LLM's collective inability to generalize and perform on complex queries, handle compositional dependencies, and align natural language to programmatic commands when concrete table schemas are provided. Unlike encoder-based models, we have pushed the boundaries of "HiddenTables" to not be limited by the number of rows - therefore we exhibit improved efficiency in prompt and completion tokens. Our infrastructure has spawned a new dataset "PyQTax" that spans across 116,671 question-table-answer triplets and provides additional fine-grained breakdowns & labels for varying question taxonomies. Therefore, in tandem with our academic contributions regarding LLMs' deficiency in TableQA tasks, "HiddenTables" is a tactile manifestation of how LLMs can interact with massive datasets while ensuring data security and minimizing generation costs.

Updated: 2024-06-16 04:53:29

标题: HiddenTables & PyQTax：一个合作式游戏和数据集，用于确保跨多种分类体系的规模和数据隐私的TableQA

摘要: 大量不同的大型语言模型（LLMs）在上下文分析表格问答任务时面临着一个共同的挑战。这些挑战源自于（1）大表格的有限上下文窗口，（2）在单元格边界之间的多方面标记化模式之间的差异，以及（3）在使用外部模型（如gpt-3.5-turbo）过程中由数据保密性带来的各种限制。我们提出了一个名为“HiddenTables”的合作游戏，作为应对这一挑战的潜在解决方案。实质上，“HiddenTables”是在生成代码的LLM“Solver”和评估LLM代理解决表格问答任务能力的“Oracle”之间进行的游戏。这个游戏基于自然语言模式，并且重要的是确保底层数据的安全性。我们在各种表格上进行了实验，证明了LLM对于泛化和执行复杂查询，处理组合依赖关系以及将自然语言与程序命令对齐的共同无能。与基于编码器的模型不同，我们已经推动了“HiddenTables”的边界，不受行数限制 - 因此我们在提示和完成标记的效率方面有所改善。我们的基础设施产生了一个新的数据集“PyQTax”，涵盖了116,671个问题-表格-答案三元组，并为不同的问题分类提供了额外的细分和标签。因此，与我们关于LLMs在表格问答任务中不足的学术贡献相呼应，“HiddenTables”是LLMs如何与大型数据集互动，同时确保数据安全性和最小化生成成本的有形表现。

更新时间: 2024-06-16 04:53:29

领域: cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2406.10803v1

KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs

Existing frameworks for assessing robustness of large language models (LLMs) overly depend on specific benchmarks, increasing costs and failing to evaluate performance of LLMs in professional domains due to dataset limitations. This paper proposes a framework that systematically evaluates the robustness of LLMs under adversarial attack scenarios by leveraging knowledge graphs (KGs). Our framework generates original prompts from the triplets of knowledge graphs and creates adversarial prompts by poisoning, assessing the robustness of LLMs through the results of these adversarial attacks. We systematically evaluate the effectiveness of this framework and its modules. Experiments show that adversarial robustness of the ChatGPT family ranks as GPT-4-turbo > GPT-4o > GPT-3.5-turbo, and the robustness of large language models is influenced by the professional domains in which they operate.

Updated: 2024-06-16 04:48:43

标题: KGPA：通过跨领域知识图对大型语言模型进行鲁棒性评估

摘要: 现有的用于评估大型语言模型（LLMs）鲁棒性的框架过度依赖特定的基准测试，增加了成本并未能评估LLMs在专业领域中的表现，这是由于数据集的限制。本文提出了一个框架，通过利用知识图谱（KGs）系统地评估LLMs在对抗攻击场景下的鲁棒性。我们的框架从知识图谱的三元组中生成原始提示，并通过毒化生成对抗性提示，通过这些对抗性攻击的结果来评估LLMs的鲁棒性。我们系统地评估了这个框架及其模块的有效性。实验证明，ChatGPT系列的对抗性鲁棒性排名为GPT-4-turbo > GPT-4o > GPT-3.5-turbo，大型语言模型的鲁棒性受其运行的专业领域的影响。

更新时间: 2024-06-16 04:48:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10802v1

DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

Retrieval-Augmented Generation (RAG) has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Additionally, a compact classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems.

Updated: 2024-06-16 04:33:17

标题: DR-RAG：将动态文档相关性应用于检索增强生成，用于问答

摘要: 检索增强生成（RAG）最近展示了大型语言模型（LLMs）在知识密集型任务（如问答）中的性能。RAG通过整合外部知识库扩展查询上下文，以提高响应准确性。然而，对于每个查询多次访问LLMs会效率低下，而通过单个查询检索所有相关文档也不可靠。我们发现，即使一些关键文档与查询之间的相关性较低，仍然可以通过将文档的部分内容与查询结合来检索其余文档。为了挖掘相关性，提出了一个名为动态相关检索增强生成（DR-RAG）的两阶段检索框架，以提高文档检索召回率和答案准确性，同时保持效率。此外，应用一个紧凑的分类器到两种不同的选择策略中，以确定检索文档对回答查询的贡献，并检索相对相关的文档。与此同时，DR-RAG仅调用LLMs一次，显著提高了实验的效率。在多跳QA数据集上的实验结果表明，DR-RAG可以显著提高答案的准确性，并在QA系统中取得新的进展。

更新时间: 2024-06-16 04:33:17

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2406.07348v3

NOD-TAMP: Generalizable Long-Horizon Planning with Neural Object Descriptors

Solving complex manipulation tasks in household and factory settings remains challenging due to long-horizon reasoning, fine-grained interactions, and broad object and scene diversity. Learning skills from demonstrations can be an effective strategy, but such methods often have limited generalizability beyond training data and struggle to solve long-horizon tasks. To overcome this, we propose to synergistically combine two paradigms: Neural Object Descriptors (NODs) that produce generalizable object-centric features and Task and Motion Planning (TAMP) frameworks that chain short-horizon skills to solve multi-step tasks. We introduce NOD-TAMP, a TAMP-based framework that extracts short manipulation trajectories from a handful of human demonstrations, adapts these trajectories using NOD features, and composes them to solve broad long-horizon, contact-rich tasks. NOD-TAMP solves existing manipulation benchmarks with a handful of demonstrations and significantly outperforms prior NOD-based approaches on new tabletop manipulation tasks that require diverse generalization. Finally, we deploy NOD-TAMP on a number of real-world tasks, including tool-use and high-precision insertion. For more details, please visit https://sites.google.com/view/nod-tamp/.

Updated: 2024-06-16 04:25:46

标题: NOD-TAMP：具有神经对象描述符的通用长视程规划

摘要: 在家庭和工厂环境中解决复杂的操作任务仍然具有挑战性，原因是需要长期推理、精细交互以及广泛的物体和场景多样性。从演示中学习技能可以是一种有效的策略，但这种方法通常在训练数据之外具有有限的泛化能力，并且难以解决长期任务。为了克服这一问题，我们提出了将两种方法结合起来的方法：产生可泛化物体中心特征的神经目标描述符（NODs）和链式短期技能解决多步任务的任务和动作规划（TAMP）框架。我们介绍了NOD-TAMP，这是一个基于TAMP的框架，从少数人类演示中提取短期操纵轨迹，利用NOD特征调整这些轨迹，并将它们组合起来解决广泛的长期、接触丰富的任务。NOD-TAMP可以通过少数演示解决现有的操作基准，并且在需要多样化泛化的新桌面操作任务上明显优于先前基于NOD的方法。最后，我们将NOD-TAMP部署到多个现实世界任务中，包括工具使用和高精度插入。有关更多详细信息，请访问https://sites.google.com/view/nod-tamp/。

更新时间: 2024-06-16 04:25:46

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2311.01530v2

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments.

Updated: 2024-06-16 04:02:28

标题: 非线性变压器如何在上下文学习中学习和泛化？

摘要: 基于Transformer的大型语言模型展现出令人印象深刻的上下文学习能力，预训练模型可以处理新任务，而无需通过简单地将查询与该任务的一些输入-输出示例相结合进行微调。尽管在实证方面取得成功，但由于分析由Transformer中的非线性自注意力和非线性激活导致的非凸训练问题的技术挑战，如何训练Transformer实现ICL以及相应的ICL容量大多仍然难以解释。据我们所知，本文首次提供了对具有非线性自注意力和非线性MLP的Transformer训练动态以及生成模型的ICL概括能力的理论分析。我们专注于一组二元分类任务，使用这些任务的子集数据训练Transformer，并量化各种因素对剩余未见任务上的ICL概括性能的影响，包括数据分布转移和不转移的情况。我们还分析了学习Transformer中不同组件如何影响ICL性能。此外，我们首次提供了有关模型修剪如何影响ICL性能的理论分析，并证明适当基于幅度的修剪可以最小化对ICL的影响，同时降低推理成本。这些理论发现通过数值实验得到了证明。

更新时间: 2024-06-16 04:02:28

领域: cs.LG

下载: http://arxiv.org/abs/2402.15607v3

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Updated: 2024-06-16 03:57:51

标题: MLKV：用于内存高效Transformer解码的多层键值头

摘要: Transformers的自回归推断极大受益于键-值（KV）缓存，但随着模型尺寸、批量大小和序列长度的增长，可能会导致重大内存瓶颈。我们引入了多层键-值（MLKV）共享，这是一种新颖的方法，将KV共享扩展到变压器层，以减少内存使用量，超出了多查询注意力（MQA）和分组查询注意力（GQA）所能实现的范围。在各种自然语言处理基准和推断指标上进行评估，使用经过训练的Pythia-160M变种，证明MLKV显著减少了内存使用量，同时性能损失最小，将KV缓存大小降至MQA的6倍。这些结果突显了MLKV在大规模部署变压器模型方面的潜力。我们在https://github.com/zaydzuhri/pythia-mlkv提供代码。

更新时间: 2024-06-16 03:57:51

领域: cs.LG

下载: http://arxiv.org/abs/2406.09297v2

Federated Learning Optimization: A Comparative Study of Data and Model Exchange Strategies in Dynamic Networks

The promise and proliferation of large-scale dynamic federated learning gives rise to a prominent open question - is it prudent to share data or model across nodes, if efficiency of transmission and fast knowledge transfer are the prime objectives. This work investigates exactly that. Specifically, we study the choices of exchanging raw data, synthetic data, or (partial) model updates among devices. The implications of these strategies in the context of foundational models are also examined in detail. Accordingly, we obtain key insights about optimal data and model exchange mechanisms considering various environments with different data distributions and dynamic device and network connections. Across various scenarios that we considered, time-limited knowledge transfer efficiency can differ by up to 9.08\%, thus highlighting the importance of this work.

Updated: 2024-06-16 03:46:23

标题: 《联邦学习优化：动态网络中数据和模型交换策略的比较研究》

摘要: 大规模动态联邦学习的承诺和普及引发了一个突出的开放问题 - 如果传输效率和快速知识传递是主要目标，那么是否明智地在节点之间共享数据或模型。这项工作正是在研究这个问题。具体来说，我们研究了在设备之间交换原始数据、合成数据或（部分）模型更新的选择。我们还详细研究了在基础模型的背景下这些策略的影响。因此，我们获得了关于在考虑不同数据分布和动态设备和网络连接的各种环境下考虑最佳数据和模型交换机制的关键见解。在我们考虑的各种场景中，有时间限制的知识传递效率可以相差高达9.08\%，因此突显了这项工作的重要性。

更新时间: 2024-06-16 03:46:23

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2406.10798v1

Diffusion Models Are Promising for Ab Initio Structure Solutions from Nanocrystalline Powder Diffraction Data

A major challenge in materials science is the determination of the structure of nanometer sized objects. Here we present a novel approach that uses a generative machine learning model based on a Diffusion model that is trained on 45,229 known structures. The model factors both the measured diffraction pattern as well as relevant statistical priors on the unit cell of atomic cluster structures. Conditioned only on the chemical formula and the information-scarce finite-size broadened powder diffraction pattern, we find that our model, PXRDnet, can successfully solve simulated nanocrystals as small as 10 angstroms across 200 materials of varying symmetry and complexity, including structures from all seven crystal systems. We show that our model can determine structural solutions with up to $81.5\%$ accuracy, as measured by structural correlation. Furthermore, PXRDnet is capable of solving structures from noisy diffraction patterns gathered in real-world experiments. We suggest that data driven approaches, bootstrapped from theoretical simulation, will ultimately provide a path towards determining the structure of previously unsolved nano-materials.

Updated: 2024-06-16 03:45:03

标题: 扩散模型对纳米晶粉末衍射数据的从头算结构解决具有潜力

摘要: 材料科学中的一个主要挑战是确定纳米级物体的结构。在这里，我们提出了一种新颖的方法，该方法使用基于扩散模型的生成式机器学习模型，该模型在45229个已知结构上进行了训练。该模型同时考虑了测量的衍射图案以及原子簇结构的晶胞相关统计先验。仅根据化学式和信息稀缺的有限尺寸加宽的粉末衍射图案，我们发现我们的模型PXRDnet可以成功解决模拟的晶体，尺寸小至10埃跨越200种不同对称性和复杂性的材料，包括来自所有七个晶体系的结构。我们展示了我们的模型可以以高达81.5%的准确度确定结构解决方案，通过结构相关性测量。此外，PXRDnet能够从现实实验中收集的噪声衍射图案中解决结构。我们建议，从理论模拟中启发的数据驱动方法最终将提供一条路径，以确定以前未解决的纳米材料的结构。

更新时间: 2024-06-16 03:45:03

领域: physics.comp-ph,cond-mat.mes-hall,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.10796v1

Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions

Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning. Compared with policy gradient methods, policy learning in RCPs is simpler since it is based on supervised learning, and unlike value-based methods, it does not require optimization in the action space to take actions. However, for multi-armed bandit (MAB) problems, we find that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods such as the upper confidence bound and Thompson sampling. In this work, we show that the performance of RCPs can be enhanced by constructing policies through the marginalization of rewards using normalized weight functions, whose sum or integral equal $1$, although the function values may be negative. We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them. Strategies to perform generalized marginalization in MAB with discrete action spaces are studied. Through simulations, we demonstrate that the proposed technique improves RCPs and makes them competitive with classic methods, showing superior performance on challenging MABs with large action spaces and sparse reward signals.

Updated: 2024-06-16 03:43:55

标题: 使用归一化权重函数改进多臂老虎机的奖励条件策略

摘要: 最近提出的奖励条件策略（RCPs）在强化学习中提供了一种有吸引力的选择。与策略梯度方法相比，RCPs中的策略学习更简单，因为它是基于监督学习的，而不像基于值的方法那样需要在动作空间中进行优化以采取行动。然而，对于多臂老虎机（MAB）问题，我们发现与经典方法（如上置信边界和汤普森抽样）相比，RCPs收敛速度较慢，并且在收敛时具有较低的期望奖励。在本文中，我们展示了通过使用归一化权重函数通过奖励边际化来构建策略，其总和或积分等于1，尽管函数值可能为负。我们将这种技术称为广义边际化，其优势在于对低奖励条件策略的负权重可以使生成的策略与它们更加不同。研究了在具有离散动作空间的MAB中执行广义边际化的策略。通过模拟，我们证明了提出的技术改进了RCPs，并使它们与经典方法竞争，并在具有大动作空间和稀疏奖励信号的具有挑战性的MAB上表现出卓越的性能。

更新时间: 2024-06-16 03:43:55

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.10795v1

Diffusion Actor-Critic with Entropy Regulator

Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $\alpha$ that modulates the degree of exploration and exploitation. Parameter $\alpha$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.

Updated: 2024-06-16 03:14:47

标题: 扩散演员-评论家与熵调节器

摘要: 强化学习（RL）在解决复杂的决策和控制任务方面已被证明非常有效。然而，在大多数传统的RL算法中，策略通常被参数化为具有学习均值和方差的对角高斯分布，这限制了它们获取复杂策略的能力。针对这个问题，我们提出了一种名为带熵调节器的扩散演员-评论家在线RL算法（DACER）。该算法将扩散模型的反向过程概念化为一种新颖的策略函数，并利用扩散模型拟合多模态分布的能力，从而增强了策略的表征能力。由于扩散策略的分布缺乏解析表达式，其熵无法通过解析确定。为了缓解这一问题，我们提出了一种利用高斯混合模型估计扩散策略熵的方法。基于估计的熵，我们可以学习一个调节探索和利用程度的参数α。参数α将被用于自适应调节添加到扩散模型输出动作的噪声的方差。在MuJoCo基准测试和一个多模态任务上的实验试验表明，DACER算法在大多数MuJoCo控制任务中取得了最先进的性能，同时展示了扩散策略更强的表征能力。

更新时间: 2024-06-16 03:14:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2405.15177v3

Evidential Uncertainty Sets in Deep Classifiers Using Conformal Prediction

In this paper, we propose Evidential Conformal Prediction (ECP) method for image classifiers to generate the conformal prediction sets. Our method is designed based on a non-conformity score function that has its roots in Evidential Deep Learning (EDL) as a method of quantifying model (epistemic) uncertainty in DNN classifiers. We use evidence that are derived from the logit values of target labels to compute the components of our non-conformity score function: the heuristic notion of uncertainty in CP, uncertainty surprisal, and expected utility. Our extensive experimental evaluation demonstrates that ECP outperforms three state-of-the-art methods for generating CP sets, in terms of their set sizes and adaptivity while maintaining the coverage of true labels.

Updated: 2024-06-16 03:00:16

标题: 深度分类器中使用符合预测的证据不确定性集合

摘要: 在本文中，我们提出了一种证据符合预测（ECP）方法，用于生成图像分类器的符合预测集。我们的方法基于一种非符合分数函数设计，该函数源自证据深度学习（EDL），作为量化DNN分类器中模型（认识论）不确定性的方法。我们使用从目标标签的logit值导出的证据来计算我们的非符合分数函数的组成部分：符合预测中的不确定性的启发式概念，不确定性惊讶和期望效用。我们进行了广泛的实验评估，结果表明ECP在生成符合预测集方面优于三种最先进的方法，同时保持真实标签的覆盖范围。

更新时间: 2024-06-16 03:00:16

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2406.10787v1

Generative AI and Digital Neocolonialism in Global Education: Towards an Equitable Framework

This paper critically discusses how generative artificial intelligence (GenAI) might impose Western ideologies on non-Western societies, perpetuating digital neocolonialism in education through its inherent biases. It further suggests strategies for local and global stakeholders to mitigate these effects. Our discussions demonstrated that GenAI can foster cultural imperialism by generating content that primarily incorporates cultural references and examples relevant to Western students, thereby alienating students from non-Western backgrounds. Also, the predominant use of Western languages by GenAI can marginalize non-dominant languages, making educational content less accessible to speakers of indigenous languages and potentially impacting their ability to learn in their first language. Additionally, GenAI often generates content and curricula that reflect the perspectives of technologically dominant countries, overshadowing marginalized indigenous knowledge and practices. Moreover, the cost of access to GenAI intensifies educational inequality and the control of GenAI data could lead to commercial exploitation without benefiting local students and their communities. We propose human-centric reforms to prioritize cultural diversity and equity in GenAI development; a liberatory design to empower educators and students to identify and dismantle the oppressive structures within GenAI applications; foresight by design to create an adjustable GenAI system to meet future educational needs; and finally, effective prompting skills to reduce the retrieval of neocolonial outputs.

Updated: 2024-06-16 02:57:15

标题: 生成式人工智能和数字新殖民主义在全球教育中：走向公平框架

摘要: 这篇论文批判性地讨论了生成式人工智能（GenAI）如何可能将西方意识形态强加给非西方社会，通过其固有的偏见在教育领域延续数字新殖民主义。它进一步提出了本地和全球利益相关者减轻这些影响的策略。我们的讨论表明，GenAI可能通过生成主要包含西方学生相关文化参考和例子的内容来促进文化帝国主义，从而使非西方背景的学生感到疏远。此外，GenAI主要使用西方语言可能边缘化非主流语言，使母语为土著语言的人难以获得教育内容，可能影响他们在第一语言中学习的能力。此外，GenAI经常生成反映技术主导国家观点的内容和课程，掩盖了边缘化的土著知识和实践。此外，GenAI的获取成本加剧了教育不平等，GenAI数据的控制可能导致商业利用，而不会使当地学生及其社区受益。我们提出了以人为中心的改革，优先考虑文化多样性和公平性在GenAI的发展中；解放性设计，赋予教育者和学生识别和解体GenAI应用中压迫结构的能力；设计前瞻性，创建一个可调整的GenAI系统以满足未来教育需求；最后，有效促使技能以减少新殖民主义产出的检索。

更新时间: 2024-06-16 02:57:15

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2406.02966v3

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Current LLM evaluation predominantly performs evaluation with prompts comprising single problems. We propose multi-problem evaluation as an additional approach to study the multiple problem handling capabilities of LLMs. We present a systematic study in this regard by comprehensively examining 7 LLMs on 4 related types of tasks constructed from 6 classification benchmarks. The 4 task types include traditional single-problem tasks, homogeneous multi-problem tasks, and two index selection tasks that embed the multi-problem tasks. We find that LLMs are competent multi-problem solvers: they generally perform (nearly) as well on multi-problem tasks as on single-problem tasks. Furthermore, contrary to common expectation, they often do not suffer from a positional bias with long inputs. This makes multi-problem prompting a simple and cost-efficient prompting method of practical significance. However, our results also strongly indicate that LLMs lack true understanding: they perform significantly worse in the two index selection tasks than in the multi-problem task under various evaluation settings, although they can indeed do index selection in general.

Updated: 2024-06-16 02:52:32

标题: 一次评估具有多个问题的LLMs：探究LLM能力的新范式

摘要: 目前，大多数LLM评估主要通过包含单个问题的提示来进行评估。我们提出了多问题评估作为研究LLM多问题处理能力的另一种方法。我们通过全面检查7个LLM在构建自6个分类基准的4种相关任务上的表现来进行这方面的系统研究。这4种任务类型包括传统的单问题任务、同质多问题任务，以及嵌入多问题任务的两个指标选择任务。我们发现LLM在解决多问题时表现出色：它们通常在多问题任务上表现得几乎和在单问题任务上一样好。此外，与常见期望相反，它们通常不会在长输入上受到位置偏见。这使得多问题提示成为一种简单且具有实际意义的成本效益高的提示方法。然而，我们的结果也强烈表明LLM缺乏真正的理解：在各种评估设置下，它们在两个指标选择任务中的表现明显不如在多问题任务中，尽管它们确实可以进行指标选择。

更新时间: 2024-06-16 02:52:32

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.10786v1

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation

This study introduces an approach to optimize Parameter Efficient Fine Tuning (PEFT) for Pretrained Language Models (PLMs) by implementing a Shared Low Rank Adaptation (ShareLoRA). By strategically deploying ShareLoRA across different layers and adapting it for the Query, Key, and Value components of self-attention layers, we achieve a substantial reduction in the number of training parameters and memory usage. Importantly, ShareLoRA not only maintains model performance but also exhibits robustness in both classification and generation tasks across a variety of models, including RoBERTa, GPT-2, LLaMA and LLaMA2. It demonstrates superior transfer learning capabilities compared to standard LoRA applications and mitigates overfitting by sharing weights across layers. Our findings affirm that ShareLoRA effectively boosts parameter efficiency while ensuring scalable and high-quality performance across different language model architectures.

Updated: 2024-06-16 02:52:28

标题: ShareLoRA：通过共享低秩适应实现参数高效和稳健的大型语言模型微调

摘要: 这项研究介绍了一种优化预训练语言模型（PLMs）的参数高效微调（PEFT）的方法，通过实施共享低秩适应（ShareLoRA）。通过在不同层中策略性地部署ShareLoRA，并对自注意力层的查询、键和值组件进行调整，我们实现了训练参数和内存使用量的大幅减少。重要的是，ShareLoRA不仅保持了模型性能，在分类和生成任务中表现出鲁棒性，适用于各种模型，包括RoBERTa、GPT-2、LLaMA和LLaMA2。与标准LoRA应用相比，它展现出卓越的迁移学习能力，并通过在层之间共享权重来减轻过拟合问题。我们的研究结果证实，ShareLoRA有效地提升了参数效率，同时确保在不同语言模型架构中实现可扩展和高质量的性能。

更新时间: 2024-06-16 02:52:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10785v1

Enhancing Generative Networks for Chest Anomaly Localization through Automatic Registration-Based Unpaired-to-Pseudo-Paired Training Data Translation

Image translation based on a generative adversarial network (GAN-IT) is a promising method for the precise localization of abnormal regions in chest X-ray images (AL-CXR) even without the pixel-level annotation. However, heterogeneous unpaired datasets undermine existing methods to extract key features and distinguish normal from abnormal cases, resulting in inaccurate and unstable AL-CXR. To address this problem, we propose an improved two-stage GAN-IT involving registration and data augmentation. For the first stage, we introduce an advanced deep-learning-based registration technique that virtually and reasonably converts unpaired data into paired data for learning registration maps, by sequentially utilizing linear-based global and uniform coordinate transformation and AI-based non-linear coordinate fine-tuning. This approach enables independent and complex coordinate transformation of each detailed location of the lung while recognizing the entire lung structure, thereby achieving higher registration performance with resolving inherent artifacts caused by unpaired conditions. For the second stage, we apply data augmentation to diversify anomaly locations by swapping the left and right lung regions on the uniform registered frames, further improving the performance by alleviating imbalance in data distribution showing left and right lung lesions. The proposed method is model agnostic and shows consistent AL-CXR performance improvement in representative AI models. Therefore, we believe GAN-IT for AL-CXR can be clinically implemented by using our basis framework, even if learning data are scarce or difficult for the pixel-level disease annotation.

Updated: 2024-06-16 02:52:15

标题: 通过基于自动配准的非配对到伪配对训练数据转换增强胸部异常定位的生成网络

摘要: 基于生成对抗网络（GAN-IT）的图像翻译是一种有前途的方法，可以在胸部X射线图像（AL-CXR）中精确定位异常区域，即使没有像素级别的注释。然而，异构的不成对数据集削弱了现有方法提取关键特征和区分正常与异常情况，导致AL-CXR不准确且不稳定。为解决这一问题，我们提出了一种改进的两阶段GAN-IT，包括注册和数据增强。在第一阶段，我们引入了一种先进的基于深度学习的注册技术，通过顺序利用基于线性的全局和均匀坐标变换以及基于人工智能的非线性坐标微调，将不成对数据虚拟和合理地转换为成对数据，用于学习注册映射。这种方法能够独立和复杂地转换肺部每个详细位置的坐标，同时识别整个肺部结构，从而实现更高的注册性能，解决了不成对条件导致的固有伪影问题。在第二阶段，我们应用数据增强来通过在均匀注册帧上交换左右肺区域来使异常位置多样化，进一步改善性能，减轻显示左右肺部病变的数据分布不平衡。该方法是模型不可知的，并在代表性AI模型中显示出一致的AL-CXR性能改善。因此，我们相信通过使用我们的基础框架，即使学习数据稀缺或对像素级别疾病注释困难，也可以通过GAN-IT实现AL-CXR的临床应用。

更新时间: 2024-06-16 02:52:15

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2207.10324v3

Ultrafast-and-Ultralight ConvNet-Based Intelligent Monitoring System for Diagnosing Early-Stage Mpox Anytime and Anywhere

Due to the absence of more efficient diagnostic tools, the spread of mpox continues to be unchecked. Although related studies have demonstrated the high efficiency of deep learning models in diagnosing mpox, key aspects such as model inference speed and parameter size have always been overlooked. Herein, an ultrafast and ultralight network named Fast-MpoxNet is proposed. Fast-MpoxNet, with only 0.27M parameters, can process input images at 68 frames per second (FPS) on the CPU. To detect subtle image differences and optimize model parameters better, Fast-MpoxNet incorporates an attention-based feature fusion module and a multiple auxiliary losses enhancement strategy. Experimental results indicate that Fast-MpoxNet, utilizing transfer learning and data augmentation, produces 98.40% classification accuracy for four classes on the mpox dataset. Furthermore, its Recall for early-stage mpox is 93.65%. Most importantly, an application system named Mpox-AISM V2 is developed, suitable for both personal computers and smartphones. Mpox-AISM V2 can rapidly and accurately diagnose mpox and can be easily deployed in various scenarios to offer the public real-time mpox diagnosis services. This work has the potential to mitigate future mpox outbreaks and pave the way for developing real-time diagnostic tools in the healthcare field.

Updated: 2024-06-16 02:48:40

标题: 随时随地诊断早期Mpox的超快超轻ConvNet智能监测系统

摘要: 由于缺乏更有效的诊断工具，mpox的传播继续未受控制。尽管相关研究已经证明深度学习模型在诊断mpox方面的高效性，但模型推理速度和参数大小等关键方面一直被忽视。在这里，提出了一种名为Fast-MpoxNet的超快速和超轻量级网络。Fast-MpoxNet仅具有0.27M个参数，可以在CPU上以每秒68帧的速度处理输入图像。为了检测微小的图像差异并更好地优化模型参数，Fast-MpoxNet融入了基于注意力的特征融合模块和多个辅助损失增强策略。实验结果表明，Fast-MpoxNet利用迁移学习和数据增强，在mpox数据集上对四个类别的分类准确率达到98.40%。此外，其对早期mpox的召回率为93.65%。最重要的是，开发了一个名为Mpox-AISM V2的应用系统，适用于个人电脑和智能手机。Mpox-AISM V2可以快速准确地诊断mpox，并可以轻松部署在各种场景中，为公众提供实时mpox诊断服务。这项工作有望减轻未来mpox爆发的风险，并为在医疗领域开发实时诊断工具铺平道路。

更新时间: 2024-06-16 02:48:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2308.13492v3

Mpox-AISM: AI-Mediated Super Monitoring for Mpox and Like-Mpox

Swift and accurate diagnosis for earlier-stage monkeypox (mpox) patients is crucial to avoiding its spread. However, the similarities between common skin disorders and mpox and the need for professional diagnosis unavoidably impaired the diagnosis of earlier-stage mpox patients and contributed to mpox outbreak. To address the challenge, we proposed "Super Monitoring", a real-time visualization technique employing artificial intelligence (AI) and Internet technology to diagnose earlier-stage mpox cheaply, conveniently, and quickly. Concretely, AI-mediated "Super Monitoring" (mpox-AISM) integrates deep learning models, data augmentation, self-supervised learning, and cloud services. According to publicly accessible datasets, mpox-AISM's Precision, Recall, Specificity, and F1-score in diagnosing mpox reach 99.3%, 94.1%, 99.9%, and 96.6%, respectively, and it achieves 94.51% accuracy in diagnosing mpox, six like-mpox skin disorders, and normal skin. With the Internet and communication terminal, mpox-AISM has the potential to perform real-time and accurate diagnosis for earlier-stage mpox in real-world scenarios, thereby preventing mpox outbreak.

Updated: 2024-06-16 02:24:15

标题: Mpox-AISM：Mpox及类似疾病的AI介导超级监测

摘要: 迅速准确地诊断早期猴痘（mpox）患者对避免其传播至关重要。然而，普通皮肤疾病与mpox之间的相似之处以及对专业诊断的需求不可避免地影响了早期mpox患者的诊断，并导致mpox爆发。为了解决这一挑战，我们提出了“超级监测”（Super Monitoring）技术，这是一种利用人工智能（AI）和互联网技术实时诊断早期mpox的便捷、快速和经济的可视化技术。具体地，AI中介的“超级监测”（mpox-AISM）集成了深度学习模型、数据增强、自监督学习和云服务。根据公开可访问的数据集，mpox-AISM在诊断mpox方面的精度、召回率、特异性和F1分数分别达到99.3％、94.1％、99.9％和96.6％，并在诊断mpox、六种类似mpox的皮肤疾病和正常皮肤方面实现了94.51％的准确性。通过互联网和通信终端，mpox-AISM有潜力在现实场景中实时准确地诊断早期mpox，从而预防mpox爆发。

更新时间: 2024-06-16 02:24:15

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2303.09780v4

SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Controllable Adversaries

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism and neglect the dynamics of agent interactions. To mitigate these limitations, we introduce SAFE-SIM, a novel diffusion-based controllable closed-loop safety-critical simulation framework. Our approach yields two distinct advantages: 1) the generation of realistic long-tail safety-critical scenarios that closely emulate real-world conditions, and 2) enhanced controllability, enabling more comprehensive and interactive evaluations. We develop a novel approach to simulate safety-critical scenarios through an adversarial term in the denoising process, which allows an adversarial agent to challenge a planner with plausible maneuvers while all agents in the scene exhibit reactive and realistic behaviors. Furthermore, we propose novel guidance objectives and a partial diffusion process that enables a user to control key aspects of the generated scenarios, such as the collision type and aggressiveness of the adversarial driver, while maintaining the realism of the behavior. We validate our framework empirically using the NuScenes dataset, demonstrating improvements in both realism and controllability. These findings affirm that diffusion models provide a robust and versatile foundation for safety-critical, interactive traffic simulation, extending their utility across the broader landscape of autonomous driving. For supplementary videos, visit our project at https://safe-sim.github.io/.

Updated: 2024-06-16 02:23:52

标题: SAFE-SIM: 具有可控对手的安全关键闭环交通仿真

摘要: 评估自主车辆规划算法的性能需要模拟长尾安全关键交通场景。然而，传统的生成这种场景的方法在可控性和真实性方面通常存在不足，并且忽略了代理之间的动态交互。为了缓解这些限制，我们引入了SAFE-SIM，这是一个基于扩散的新颖可控闭环安全关键仿真框架。我们的方法具有两个明显优势：1）生成与真实世界条件密切模拟的长尾安全关键场景，2）增强的可控性，使得评估更加全面和交互式。我们通过去噪过程中的对抗项提出了一种新颖的方法来模拟安全关键场景，这允许对抗性代理挑战规划者进行合理的操纵，而场景中所有代理展示出反应灵敏和真实的行为。此外，我们提出了新颖的指导目标和部分扩散过程，使用户能够控制生成场景的关键方面，如碰撞类型和对抗性驾驶员的攻击性，同时保持行为的真实性。我们通过NuScenes数据集在实验上验证了我们的框架，展示了在真实性和可控性方面的改进。这些发现证实了扩散模型为安全关键的互动交通仿真提供了坚实而多才多艺的基础，扩展了它们在自动驾驶广泛领域的实用性。有关补充视频，请访问我们的项目网站https://safe-sim.github.io/。

更新时间: 2024-06-16 02:23:52

领域: cs.RO,cs.AI,cs.CV,cs.LG,I.2.9; I.2.6

下载: http://arxiv.org/abs/2401.00391v2

Toward Enhanced Reinforcement Learning-Based Resource Management via Digital Twin: Opportunities, Applications, and Challenges

This article presents a digital twin (DT)-enhanced reinforcement learning (RL) framework aimed at optimizing performance and reliability in network resource management, since the traditional RL methods face several unified challenges when applied to physical networks, including limited exploration efficiency, slow convergence, poor long-term performance, and safety concerns during the exploration phase. To deal with the above challenges, a comprehensive DT-based framework is proposed to enhance the convergence speed and performance for unified RL-based resource management. The proposed framework provides safe action exploration, more accurate estimates of long-term returns, faster training convergence, higher convergence performance, and real-time adaptation to varying network conditions. Then, two case studies on ultra-reliable and low-latency communication (URLLC) services and multiple unmanned aerial vehicles (UAV) network are presented, demonstrating improvements of the proposed framework in performance, convergence speed, and training cost reduction both on traditional RL and neural network based Deep RL (DRL). Finally, the article identifies and explores some of the research challenges and open issues in this rapidly evolving field.

Updated: 2024-06-16 01:46:06

标题: 朝向通过数字孪生增强强化学习资源管理：机遇、应用和挑战

摘要: 本文提出了一种数字孪生（DT）增强的强化学习（RL）框架，旨在优化网络资源管理中的性能和可靠性，因为传统的RL方法在应用于物理网络时面临几个统一的挑战，包括探索效率有限、收敛速度慢、长期性能差以及在探索阶段存在安全问题。为了应对上述挑战，提出了一个基于全面DT的框架，以增强统一RL-based资源管理的收敛速度和性能。所提出的框架提供了安全的行动探索、更准确的长期回报估计、更快的训练收敛、更高的收敛性能以及对不断变化的网络条件的实时调整。然后，介绍了超可靠低延迟通信（URLLC）服务和多个无人机（UAV）网络的两个案例研究，展示了所提出框架在传统RL和基于神经网络的深度RL（DRL）上性能、收敛速度和训练成本降低的改进。最后，文章确定并探讨了这个快速发展领域中的一些研究挑战和未解问题。

更新时间: 2024-06-16 01:46:06

领域: eess.SY,cs.LG,cs.NI,cs.SY

下载: http://arxiv.org/abs/2406.07857v2

A Rate-Distortion View of Uncertainty Quantification

While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck.

Updated: 2024-06-16 01:33:22

标题: 不确定性量化的失真率视角

摘要: 尽管强大的概率模型如高斯过程自然具有这种特性，但深度神经网络经常缺乏这种特性。在本文中，我们引入了Distance Aware Bottleneck（DAB），即一种用于丰富深度神经网络具有这种特性的新方法。基于先前的信息瓶颈方法，我们的方法学习一个码书，存储训练过程中所见所有输入的压缩表示。一个新示例与该码书的距离可以作为该示例的不确定性估计。最终，我们的方法通过单次前向传递提供确定性不确定性估计，并实现比先前方法更好的外分布（OOD）检测和误分类预测，包括昂贵的集成方法、深度核高斯过程和基于标准信息瓶颈的方法。

更新时间: 2024-06-16 01:33:22

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.10775v1

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .

Updated: 2024-06-16 01:33:02

标题: Quest：针对高效长上下文LLM推理的查询感知稀疏化

摘要: 随着对长上下文大语言模型（LLMs）的需求增加，具有长达128K或1M令牌的上下文窗口的模型变得越来越普遍。然而，长上下文LLM推理具有挑战性，因为随着序列长度的增加，推理速度显著减慢。这种减速主要是由于在自注意力期间加载大型KV缓存引起的。先前的研究表明，关键令牌的一小部分将主导注意力结果。然而，我们观察到一个令牌的关键性高度取决于查询。为此，我们提出了Quest，一种查询感知的KV缓存选择算法。Quest跟踪KV缓存页面中的最小和最大关键值，并使用查询向量估计给定页面的关键性。通过仅加载用于注意力的Top-K关键KV缓存页面，Quest显着加快了自我关注速度，而不牺牲准确性。我们展示了Quest可以实现高达2.23倍的自我关注加速度，从而将推理延迟降低了7.03倍，同时在具有长依赖关系的任务上表现良好，准确性损失可忽略不计。代码可在http://github.com/mit-han-lab/Quest找到。

更新时间: 2024-06-16 01:33:02

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.10774v1

Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles

Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their application within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both supervised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.

Updated: 2024-06-16 01:32:04

标题: 用一个包含真实世界和生成的新闻文章的语料库量化生成式媒体偏见

摘要: 大型语言模型（LLMs）越来越多地被应用于各种任务和领域，人们在新闻业中应用它们的兴趣不断增长。这种趋势引起了担忧，因为我们对LLM在这个领域的行为了解有限，特别是涉及政治偏见方面。现有研究主要集中在LLMs进行政治问卷调查上，这仅提供了有限的了解它们的偏见和操作细微差别。为了填补这一空白，我们的研究建立了一个新的策划数据集，包含2,100篇人工撰写的文章，并利用它们的描述生成了56,700篇合成文章，使用了九个LLMs。这使我们能够分析人类撰写和机器生成文章之间的属性变化，本研究重点关注政治偏见，并使用监督模型和LLMs检测偏见。我们的研究发现基础和指令调整的LLMs之间存在显著差异，指令调整的模型表现出一致的政治偏见。此外，我们能够研究LLMs作为分类器的行为，观察它们在这一角色中展示的政治偏见。总体而言，这项研究首次在新闻领域中概述了一个框架，并提供了一个结构化数据集用于可量化实验，为进一步研究LLM政治偏见及其影响奠定了基础。

更新时间: 2024-06-16 01:32:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.10773v1

Explore Spurious Correlations at the Concept Level in Language Models for Text Classification

Language models (LMs) have achieved notable success in numerous NLP tasks, employing both fine-tuning and in-context learning (ICL) methods. While language models demonstrate exceptional performance, they face robustness challenges due to spurious correlations arising from imbalanced label distributions in training data or ICL exemplars. Previous research has primarily concentrated on word, phrase, and syntax features, neglecting the concept level, often due to the absence of concept labels and difficulty in identifying conceptual content in input texts. This paper introduces two main contributions. First, we employ ChatGPT to assign concept labels to texts, assessing concept bias in models during fine-tuning or ICL on test data. We find that LMs, when encountering spurious correlations between a concept and a label in training or prompts, resort to shortcuts for predictions. Second, we introduce a data rebalancing technique that incorporates ChatGPT-generated counterfactual data, thereby balancing label distribution and mitigating spurious correlations. Our method's efficacy, surpassing traditional token removal approaches, is validated through extensive testing.

Updated: 2024-06-16 01:28:49

标题: 在文本分类中探索语言模型中概念级别的伪相关性

摘要: 语言模型（LMs）在许多自然语言处理任务中取得了显著的成功，采用了微调和上下文学习（ICL）方法。虽然语言模型表现出卓越的性能，但由于训练数据或ICL示例中出现的标签分布不平衡导致的虚假相关性，它们面临着鲁棒性挑战。先前的研究主要集中在词语、短语和句法特征上，忽视了概念层面，通常是因为缺乏概念标签和难以识别输入文本中的概念内容。本文引入了两个主要贡献。首先，我们利用ChatGPT为文本分配概念标签，在测试数据上评估模型在微调或ICL过程中的概念偏见。我们发现，当LMs在训练或提示中遇到概念与标签之间的虚假相关性时，会为了预测而采取捷径。其次，我们引入了一种数据重平衡技术，结合了ChatGPT生成的反事实数据，从而平衡标签分布并减轻虚假相关性。通过广泛测试验证了我们方法的有效性，超过传统的标记删除方法。

更新时间: 2024-06-16 01:28:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2311.08648v4

Fuzzy Convolution Neural Networks for Tabular Data Classification

Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.

Updated: 2024-06-16 01:18:18

标题: 模糊卷积神经网络用于表格数据分类

摘要: 最近，卷积神经网络（CNNs）由于在各个领域，特别是在图像和文本分类任务中表现出色，吸引了大量关注。然而，它们在表格数据分类方面的应用仍未被充分开发。在许多领域，如生物信息学、金融、医学等，非图像数据占主导地位。将CNNs调整为用于分类非图像数据仍然具有很高的挑战性。本文研究了CNNs在表格数据分类中的有效性，旨在弥合传统机器学习方法和深度学习技术之间的差距。我们提出了一个新颖的模糊卷积神经网络（FCNN），专门为表格数据设计，以捕获特征向量中的局部模式。在我们的方法中，我们将特征值映射到模糊成员资格。模糊成员资格向量被转换为图像，用于训练CNN模型。训练后的CNN模型用于分类未知的特征向量。为了验证我们的方法，我们生成了六个复杂的嘈杂数据集。我们从每个数据集随机选择了百分之七十的样本用于训练，百分之三十用于测试。数据集还使用了最先进的机器学习算法进行分类，如决策树（DT）、支持向量机（SVM）、模糊神经网络（FNN）、贝叶斯分类器和随机森林（RF）。实验结果表明，我们提出的模型可以有效地从表格数据中学习到有意义的表示，与现有方法相比，取得了竞争性或优越的性能。总体而言，我们的发现表明，提出的FCNN模型在表格数据分类任务中具有潜力，提供了一种新的前景，可能为利用深度学习进行结构化数据分析开辟了新的机会。

更新时间: 2024-06-16 01:18:18

领域: cs.LG,cs.AI,I.2.10,I.4.6

下载: http://arxiv.org/abs/2406.03506v2

Predicting Exoplanetary Features with a Residual Model for Uniform and Gaussian Distributions

The advancement of technology has led to rampant growth in data collection across almost every field, including astrophysics, with researchers turning to machine learning to process and analyze this data. One prominent example of this data in astrophysics is the atmospheric retrievals of exoplanets. In order to help bridge the gap between machine learning and astrophysics domain experts, the 2023 Ariel Data Challenge was hosted to predict posterior distributions of 7 exoplanetary features. The procedure outlined in this paper leveraged a combination of two deep learning models to address this challenge: a Multivariate Gaussian model that generates the mean and covariance matrix of a multivariate Gaussian distribution, and a Uniform Quantile model that predicts quantiles for use as the upper and lower bounds of a uniform distribution. Training of the Multivariate Gaussian model was found to be unstable, while training of the Uniform Quantile model was stable. An ensemble of uniform distributions was found to have competitive results during testing (posterior score of 696.43), and when combined with a multivariate Gaussian distribution achieved a final rank of third in the 2023 Ariel Data Challenge (final score of 681.57).

Updated: 2024-06-16 01:07:15

标题: 用残差模型预测均匀和高斯分布的外行星特征

摘要: 技术的进步导致了数据收集在几乎所有领域的猛烈增长，包括天体物理学，在这方面研究人员转向机器学习来处理和分析这些数据。天体物理学中一个突出的例子是外行星的大气检索。为了帮助弥合机器学习和天体物理学领域专家之间的差距，2023年阿丽尔数据挑战赛被举办，以预测7个外行星特征的后验分布。本文概述的程序利用了两个深度学习模型的组合来解决这一挑战：一个多元高斯模型生成多元高斯分布的均值和协方差矩阵，一个均匀分位数模型预测用作均匀分布上下界的分位数。发现多元高斯模型的训练是不稳定的，而均匀分位数模型的训练是稳定的。在测试中，一组均匀分布表现出有竞争力的结果（后验分数为696.43），当与多元高斯分布结合时，在2023年阿丽尔数据挑战赛中获得第三名的最终排名（最终分数为681.57）。

更新时间: 2024-06-16 01:07:15

领域: astro-ph.EP,astro-ph.IM,cs.LG,physics.data-an

下载: http://arxiv.org/abs/2406.10771v1

A Survey on LLM-Based Agents: Common Workflows and Reusable LLM-Profiled Components

Recent advancements in Large Language Models (LLMs) have catalyzed the development of sophisticated frameworks for developing LLM-based agents. However, the complexity of these frameworks r poses a hurdle for nuanced differentiation at a granular level, a critical aspect for enabling efficient implementations across different frameworks and fostering future research. Hence, the primary purpose of this survey is to facilitate a cohesive understanding of diverse recently proposed frameworks by identifying common workflows and reusable LLM-Profiled Components (LMPCs).

Updated: 2024-06-16 00:59:27

标题: 基于LLM的代理的调查：常见工作流程和可重用的LLM配置文件组件

摘要: 最近大规模语言模型（LLMs）的进展催生了开发基于LLM代理的复杂框架。然而，这些框架的复杂性在粒度级别上提出了障碍，这是实现高效实现跨不同框架和促进未来研究的关键方面。因此，本调查的主要目的是通过识别常见工作流程和可重用的LLM配置文件组件（LMPCs），促进对不同最近提出的框架的一致理解。

更新时间: 2024-06-16 00:59:27

领域: cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2406.05804v2

VideoPrism: A Foundational Visual Encoder for Video Understanding

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.

Updated: 2024-06-16 00:56:08

标题: VideoPrism：用于视频理解的基础视觉编码器

摘要: 我们介绍了VideoPrism，这是一个通用的视频编码器，可以利用单个冻结模型处理各种视频理解任务。我们在一个包含36M高质量视频-字幕对和582M带有嘈杂平行文本（例如ASR转录）的异构语料库上对VideoPrism进行了预训练。预训练方法通过全局-局部蒸馏语义视频嵌入和令牌洗牌方案改进了遮蔽自编码，使VideoPrism能够主要关注视频模态，同时利用与视频相关的宝贵文本。我们对VideoPrism在四个广泛的视频理解任务组上进行了广泛测试，从网络视频问答到用于科学的CV，实现了在33个视频理解基准测试中的31项最先进性能。

更新时间: 2024-06-16 00:56:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2402.13217v2

Comparing Hallucination Detection Metrics for Multilingual Generation

While many hallucination detection techniques have been evaluated on English text, their effectiveness in multilingual contexts remains unknown. This paper assesses how well various factual hallucination detection metrics (lexical metrics like ROUGE and Named Entity Overlap, and Natural Language Inference (NLI)-based metrics) identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models. However, NLI metrics are still limited, as they do not detect single-fact hallucinations well and fail for lower-resource languages. Therefore, our findings highlight the gaps in exisiting hallucination detection methods for non-English languages and motivate future research to develop more robust multilingual detection methods for LLM hallucinations.

Updated: 2024-06-16 00:44:28

标题: 比较多语言生成中的幻觉检测度量标准

摘要: 尽管许多幻觉检测技术已经在英文文本上进行了评估，但它们在多语言环境中的有效性仍然未知。本文评估了各种事实性幻觉检测指标（词汇指标如 ROUGE 和命名实体重叠，以及基于自然语言推理（NLI）的指标）在不同语言中识别生成的传记摘要中的幻觉的效果。我们比较自动度量方法之间的相关性以及它们是否与人类对事实性的判断一致。我们的分析显示，尽管词汇指标效果不佳，NLI 基于指标表现良好，与人类注释在许多情境下相关，并且通常表现优于监督模型。然而，NLI 指标仍然存在局限性，因为它们不太能有效检测单一事实的幻觉，并且在资源较少的语言中失败。因此，我们的发现突显了现有的幻觉检测方法在非英语语言中的差距，并激励未来的研究开发更健壮的多语言检测方法以检测LLM的幻觉。

更新时间: 2024-06-16 00:44:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2402.10496v2

InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models

Weblogs, comprised of records detailing user activities on any website, offer valuable insights into user preferences, behavior, and interests. Numerous recommendation algorithms, employing strategies such as collaborative filtering, content-based filtering, and hybrid methods, leverage the data mined through these weblogs to provide personalized recommendations to users. Despite the abundance of information available in these weblogs, identifying and extracting pertinent information and key features from them necessitate extensive engineering endeavors. The intricate nature of the data also poses a challenge for interpretation, especially for non-experts. In this study, we introduce a sophisticated and interactive recommendation framework denoted as InteraRec, which diverges from conventional approaches that exclusively depend on weblogs for recommendation generation. InteraRec framework captures high-frequency screenshots of web pages as users navigate through a website. Leveraging state-of-the-art multimodal large language models (MLLMs), it extracts valuable insights into user preferences from these screenshots by generating a textual summary based on predefined keywords. Subsequently, an LLM-integrated optimization setup utilizes this summary to generate tailored recommendations. Through our experiments, we demonstrate the effectiveness of InteraRec in providing users with valuable and personalized offerings. Furthermore, we explore the integration of session-based recommendation systems into the InteraRec framework, aiming to enhance its overall performance. Finally, we curate a new dataset comprising of screenshots from product web pages on the Amazon website for the validation of the InteraRec framework. Detailed experiments demonstrate the efficacy of the InteraRec framework in delivering valuable and personalized recommendations tailored to individual user preferences.

Updated: 2024-06-16 00:40:15

标题: InteraRec: 使用多模态大型语言模型基于截图的推荐

摘要: Weblogs, 由记录用户在任何网站上活动的记录组成，为用户的偏好、行为和兴趣提供宝贵的见解。许多推荐算法利用这些weblogs中挖掘的数据，采用协同过滤、基于内容的过滤和混合方法等策略，为用户提供个性化推荐。尽管这些weblogs中有大量信息可用，但识别和提取其中相关信息和关键特征需要大量的工程努力。数据的复杂性也给解释带来了挑战，尤其对于非专业人士。在本研究中，我们介绍了一个复杂且交互式的推荐框架，称为InteraRec，它不同于仅依赖weblogs进行推荐生成的传统方法。InteraRec框架在用户浏览网站时捕获网页的高频截图。利用最先进的多模态大语言模型（MLLMs），它通过基于预定义关键字生成文本摘要，从这些截图中提取用户偏好的宝贵见解。随后，LLM集成的优化设置利用这个摘要生成定制的推荐。通过我们的实验，我们展示了InteraRec在为用户提供有价值且个性化的推荐方面的有效性。此外，我们探讨了将基于会话的推荐系统集成到InteraRec框架中，旨在提升其整体性能。最后，我们策划了一个新的数据集，其中包含来自亚马逊网站产品网页的截图，用于验证InteraRec框架。详细实验展示了InteraRec框架在提供适合个人用户偏好的有价值且个性化推荐方面的有效性。

更新时间: 2024-06-16 00:40:15

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2403.00822v2

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Many real-world tasks require an agent to reason jointly over text and visual objects, (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. However, there is a lack of existing datasets to benchmark the state-of-the-art multimodal models' capability on context-sensitive text-rich visual reasoning. In this paper, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap of 30.8% between GPT-4V (the current best-performing Large Multimodal Model) and human performance. Our fine-grained analysis reveals that GPT-4V encounters difficulties interpreting time-related data and infographics. However, it demonstrates proficiency in comprehending abstract visual contexts such as memes and quotes. Finally, our qualitative analysis uncovers various factors contributing to poor performance including lack of precise visual perception and hallucinations. Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/

Updated: 2024-06-16 00:38:24

标题: ConTextual：评估大型多模态模型中的上下文敏感文本丰富视觉推理

摘要: 许多现实世界的任务需要代理人在文本和视觉对象之间进行联合推理，（例如，在公共空间中导航），我们称之为上下文敏感的文本丰富的视觉推理。具体来说，这些任务需要理解文本与图像中的视觉元素相互作用的上下文。然而，目前缺乏现有数据集来评估最先进的多模态模型在上下文敏感的文本丰富的视觉推理上的能力。在本文中，我们介绍了ConTextual，一个包含人工制作的需要上下文敏感推理的文本丰富图像的新数据集。我们进行实验评估了14个基础模型（GPT-4V、Gemini-Pro-Vision、LLaVA-Next）的性能，并建立了一个人类性能基线。此外，我们对模型的响应进行人类评估，并观察到GPT-4V（目前性能最佳的大型多模态模型）与人类性能之间的显著性能差距为30.8%。我们的细致分析显示，GPT-4V在解释与时间相关的数据和信息图表时遇到困难。然而，它在理解抽象视觉语境（如表情包和引用语）方面表现出熟练。最后，我们的定性分析揭示了导致性能不佳的各种因素，包括缺乏精确的视觉感知和幻觉。我们的数据集、代码和排行榜可在项目页面https://con-textual.github.io/找到。

更新时间: 2024-06-16 00:38:24

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2401.13311v2