Arxiv Day: Article

Public-Key Quantum Money and Fast Real Transforms

We propose a public-key quantum money scheme based on group actions and the Hartley transform. Our scheme adapts the quantum money scheme of Zhandry (2024), replacing the Fourier transform with the Hartley transform. This substitution ensures the banknotes have real amplitudes rather than complex amplitudes, which could offer both computational and theoretical advantages. To support this new construction, we propose a new verification algorithm that uses group action twists to address verification failures caused by the switch to real amplitudes. We also show how to efficiently compute the serial number associated with a money state using a new algorithm based on continuous-time quantum walks. Finally, we present a recursive algorithm for the quantum Hartley transform, achieving lower gate complexity than prior work and demonstrate how to compute other real quantum transforms, such as the quantum sine transform, using the quantum Hartley transform as a subroutine.

Updated: 2025-07-17 23:53:10

标题: 公钥量子货币和快速真实变换

摘要: 我们提出了一种基于群作用和哈特利变换的公钥量子货币方案。我们的方案改进了Zhandry（2024年）的量子货币方案，将傅立叶变换替换为哈特利变换。这种替换确保了纸币具有实振幅而不是复振幅，这可能会提供计算和理论上的优势。为了支持这种新的构造，我们提出了一种新的验证算法，利用群作用扭曲来解决由于转换为实振幅而导致的验证失败问题。我们还展示了如何使用基于连续时间量子漫步的新算法高效计算与货币状态相关的序列号。最后，我们提出了一种用于量子哈特利变换的递归算法，实现了比之前工作更低的门复杂度，并演示了如何使用量子哈特利变换作为子程序计算其他实量子变换，如量子正弦变换。

更新时间: 2025-07-17 23:53:10

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2503.18890v3

Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user's preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4. The resulting user summaries are not only concise and portable, they are easy for users to interpret and modify, allowing for more transparency and user control in LLM alignment.

Updated: 2025-07-17 23:48:51

标题: 学习通过强化学习微调摘要的多元用户偏好

摘要: 随着大型语言模型（LLM）人工智能助手的日常使用案例不断扩展，个性化响应以符合不同用户的偏好和目标变得越来越重要。虽然从人类反馈中进行强化学习（RLHF）能有效改进LLMs，使其更加有帮助和流畅，但它并未考虑到用户之间的差异，因为它用单一奖励模型来建模整个用户群体。我们提出了一种新颖的框架，即使用总结偏好学习（PLUS），它学习每个用户的偏好、特征和过去对话的基于文本的总结。这些总结条件了奖励模型，使其能够对每个用户所看重的响应类型进行个性化预测。我们用强化学习训练用户总结模型，并同时更新奖励模型，创建一个在线协同适应循环。我们展示，与先前的个性化RLHF技术或在上下文中学习用户信息相比，PLUS生成的总结捕捉了用户偏好的有意义方面。我们在不同的多元用户数据集中展示，我们的方法对新用户和多样的对话主题具有鲁棒性。此外，我们证明了关于用户生成的文本总结可以被转移用于零-shot个性化更强大的专有模型，如GPT-4。生成的用户总结不仅简洁且便于携带，而且用户易于解释和修改，从而在LLM对齐中提供更多透明度和用户控制。

更新时间: 2025-07-17 23:48:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13579v1

Apple Intelligence Foundation Language Models: Tech Report 2025

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

Updated: 2025-07-17 23:37:19

标题: 苹果智能基础语言模型：技术报告2025

摘要: 我们介绍了两个多语言、多模态的基础语言模型，为苹果设备和服务提供智能功能：一是一个3B参数的设备端模型，通过架构创新（如KV-cache共享和2位量化感知训练）进行了苹果硅芯片优化；二是一个可扩展的服务器模型，基于一种新颖的Parallel-Track Mixture-of-Experts（PT-MoE）变压器，结合轨道并行性、专家混合稀疏计算和交替的全局-本地关注力，以高质量和具有竞争力的成本在苹果的私有云计算平台上提供服务。这两个模型都是在通过负责任的网络爬虫、授权语料库和高质量合成数据收集的大规模多语言和多模态数据集上进行训练的，然后利用新的异步平台进行监督微调和强化学习进一步改进。由此产生的模型支持多种附加语言，同时理解图像和执行工具调用。在公共基准测试和人类评估中，服务器模型和设备端模型与相同规模的开放基准相匹配或超越。一个以Swift为中心的Foundation Models框架公开了引导生成、受限工具调用和LoRA适配器微调，使开发人员能够在几行代码中集成这些功能。苹果智能模型的最新进展基于我们的负责任AI方法，具有内容过滤和区域特定评估等保障措施，同时我们致力于通过创新如私有云计算来保护用户隐私。

更新时间: 2025-07-17 23:37:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13575v1

Understanding Reasoning in Thinking Language Models via Steering Vectors

Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

Updated: 2025-07-17 23:27:34

标题: 通过引导向量理解思维语言模型中的推理

摘要: 最近大型语言模型（LLMs）的进展导致了思考语言模型的发展，这些模型在生成响应之前会产生广泛的内部推理链。虽然这些模型取得了改进的性能，但控制它们的推理过程仍然具有挑战性。本文提出了一种通过分析和操纵DeepSeek-R1-Distill模型中特定推理行为的驾驶方法。通过对10个不同类别的500个任务进行系统实验，我们确定了思考模型展示的几种推理行为，包括表达不确定性、为假设验证生成示例以及在推理链中回溯。我们展示了这些行为是通过模型激活空间中的线性方向进行调节的，并且可以使用驾驶向量进行控制。通过提取和应用这些向量，我们提供了一种调节模型推理过程特定方面的方法，比如其倾向于回溯或表达不确定性。我们的方法为以受控且可解释的方式驾驶思考模型的推理过程提供了实用工具。我们使用三个DeepSeek-R1-Distill模型验证了我们的驾驶方法，展示了对不同模型架构的一致控制。

更新时间: 2025-07-17 23:27:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.18167v3

An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots

Software engineering (SE) chatbots are increasingly gaining attention for their role in enhancing development processes. At the core of chatbots are Natural Language Understanding platforms (NLUs), which enable them to comprehend user queries but require labeled data for training. However, acquiring such labeled data for SE chatbots is challenging due to the scarcity of high-quality datasets, as training requires specialized vocabulary and phrases not found in typical language datasets. Consequently, developers often resort to manually annotating user queries -- a time-consuming and resource-intensive process. Previous approaches require human intervention to generate rules, called labeling functions (LFs), that categorize queries based on specific patterns. To address this issue, we propose an approach to automatically generate LFs by extracting patterns from labeled user queries. We evaluate our approach on four SE datasets and measure performance improvement from training NLUs on queries labeled by the generated LFs. The generated LFs effectively label data with AUC scores up to 85.3% and NLU performance improvements up to 27.2%. Furthermore, our results show that the number of LFs affects labeling performance. We believe that our approach can save time and resources in labeling users' queries, allowing practitioners to focus on core chatbot functionalities rather than manually labeling queries.

Updated: 2025-07-17 23:21:56

标题: 一个用于软件工程聊天机器人自动生成标注函数的方法

摘要: 软件工程（SE）聊天机器人在增强开发流程方面越来越受到关注。聊天机器人的核心是自然语言理解平台（NLUs），它们使机器人能够理解用户查询，但需要标记数据进行训练。然而，由于高质量数据集稀缺，为SE聊天机器人获取这种标记数据是具有挑战性的，因为训练需要特定的词汇和短语，这些词汇和短语在典型的语言数据集中找不到。因此，开发人员通常会手动注释用户查询，这是一个耗时且资源密集的过程。先前的方法需要人工干预，生成基于特定模式对查询进行分类的规则，称为标记函数（LFs）。为了解决这个问题，我们提出了一种方法，通过从标记的用户查询中提取模式来自动生成LFs。我们在四个SE数据集上评估了我们的方法，并测量了通过在生成的LFs上标记的查询上训练NLUs的性能改进。生成的LFs有效地使用AUC分数为数据进行标记，性能提升高达85.3％，NLU性能提升高达27.2％。此外，我们的结果表明，LFs的数量影响标记性能。我们相信我们的方法可以节省标记用户查询的时间和资源，使从业者能够专注于核心聊天机器人功能，而不是手动标记查询。

更新时间: 2025-07-17 23:21:56

领域: cs.SE,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.07094v2

Change of Thought: Adaptive Test-Time Computation

Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20\% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.

Updated: 2025-07-17 23:12:57

标题: 思维的变革：自适应测试时间计算

摘要: 在单个、固定深度的通路中评估的变压器在表达能力上被证明受到常数深度电路类TC0的限制。以自回归方式运行的变压器消除了这种限制 - 首先是在下一个标记预测中，最近又是在思维链推理中。这两种模式都依赖于将内部状态解码为标记，然后在后续步骤中重新编码这些标记的反馈循环。虽然这种“大声思考”反映了人类的推理方式，但生物大脑在没有将中间状态外部化为语言的情况下进行迭代。为了增强编码器变压器的表达能力，而不必求助于标记级别的自回归，我们引入了SELF-Transformer：一个编码器层，通过迭代改进自己的注意力权重到一个固定点。SELF-Transformer不是在一个通路中产生重组输入序列的对齐矩阵，而是在内部迭代更新该矩阵，根据输入难度调整测试时间的计算。这种适应性在编码器式基准测试中带来了高达20%的准确率提升，而不增加参数数量，表明测试时间的输入自适应对于仅有适度的额外计算预算而言带来了实质性的好处。Self-Transformers因此在保留纯编码器架构的简单性的同时，恢复了迭代推理的很大一部分表达能力。

更新时间: 2025-07-17 23:12:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13569v1

Why Isn't Relational Learning Taking Over the World?

AI seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can't be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world -- except in a few cases with restricted relations -- and what needs to be done to bring it to it's rightful prominence.

Updated: 2025-07-17 22:32:07

标题: 为什么关系学习没有占领世界？

摘要: 人工智能似乎正在以模拟像素、单词和音素的系统来接管世界。可以说，世界由实体（对象、事物，包括事件）以及它们之间的属性和关系构成，而不是由像素、单词和音素构成。我们应该建模这些实体，而不是它们的感知或描述。你可能会怀疑，专注于建模单词和像素是因为世界上所有（有价值的）数据都是以文本和图像的形式存在的。如果你查看几乎任何一家公司，你会发现它们最有价值的数据是以电子表格、数据库和其他关系格式存在的。这些不是在介绍性机器学习中研究的形式，但它们充满了产品编号、学生编号、交易编号和其他不能简单地解释为数字的标识符。研究这种数据的领域有各种名称，包括关系学习、统计关系人工智能等。本文解释了为什么关系学习并没有接管世界 - 除了在一些具有限制关系的情况下 - 以及如何将其带到应有的突出位置。

更新时间: 2025-07-17 22:32:07

领域: cs.AI,cs.DB,cs.LG

下载: http://arxiv.org/abs/2507.13558v1

Time Series Forecastability Measures

This paper proposes using two metrics to quantify the forecastability of time series prior to model development: the spectral predictability score and the largest Lyapunov exponent. Unlike traditional model evaluation metrics, these measures assess the inherent forecastability characteristics of the data before any forecast attempts. The spectral predictability score evaluates the strength and regularity of frequency components in the time series, whereas the Lyapunov exponents quantify the chaos and stability of the system generating the data. We evaluated the effectiveness of these metrics on both synthetic and real-world time series from the M5 forecast competition dataset. Our results demonstrate that these two metrics can correctly reflect the inherent forecastability of a time series and have a strong correlation with the actual forecast performance of various models. By understanding the inherent forecastability of time series before model training, practitioners can focus their planning efforts on products and supply chain levels that are more forecastable, while setting appropriate expectations or seeking alternative strategies for products with limited forecastability.

Updated: 2025-07-17 22:23:51

标题: 时间序列预测性度量

摘要: 本文提出在模型开发之前使用两个指标来量化时间序列的可预测性：频谱可预测性分数和最大Lyapunov指数。与传统的模型评估指标不同，这些指标评估数据在进行任何预测尝试之前的固有可预测性特征。频谱可预测性分数评估时间序列中频率成分的强度和规律性，而Lyapunov指数量化生成数据的系统的混沌性和稳定性。我们在M5预测竞赛数据集上评估了这些指标在合成和现实世界时间序列上的有效性。我们的结果表明，这两个指标可以正确反映时间序列的固有可预测性，并与各种模型的实际预测性能有很强的相关性。通过在模型训练之前了解时间序列的固有可预测性，从业者可以将他们的规划工作集中在更易预测的产品和供应链水平上，同时为具有有限预测性的产品设定适当的期望或寻求替代策略。

更新时间: 2025-07-17 22:23:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13556v1

Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder

Formal thought disorder (FTD), a hallmark of schizophrenia spectrum disorders, manifests as incoherent speech and poses challenges for clinical assessment. Traditional clinical rating scales, though validated, are resource-intensive and lack scalability. Automated speech analysis with automatic speech recognition (ASR) allows for objective quantification of linguistic and temporal features of speech, offering scalable alternatives. The use of utterance timestamps in ASR captures pause dynamics, which are thought to reflect the cognitive processes underlying speech production. However, the utility of integrating these ASR-derived features for assessing FTD severity requires further evaluation. This study integrates pause features with semantic coherence metrics across three datasets: naturalistic self-recorded diaries (AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dream narratives (PsyCL, n = 43). We evaluated pause related features alongside established coherence measures, using support vector regression (SVR) to predict clinical FTD scores. Key findings demonstrate that pause features alone robustly predict the severity of FTD. Integrating pause features with semantic coherence metrics enhanced predictive performance compared to semantic-only models, with integration of independent models achieving correlations up to \r{ho} = 0.649 and AUC = 83.71% for severe cases detection (TOPSY, with best \r{ho} = 0.584 and AUC = 79.23% for semantic-only models). The performance gains from semantic and pause features integration held consistently across all contexts, though the nature of pause patterns was dataset-dependent. These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech and advance automated speech analysis in psychosis.

Updated: 2025-07-17 22:00:16

标题: 透过字里行间：将停顿动态与语义连贯性结合，实现思维紊乱的自动评估

摘要: 正式思维障碍（FTD）是精神分裂症谱系障碍的一个标志，表现为语言不连贯，并对临床评估提出挑战。传统的临床评分量表虽经过验证，但资源密集且缺乏可扩展性。自动语音分析与自动语音识别（ASR）允许客观量化语言和时间特征，提供可扩展的替代方案。ASR中使用话语时间戳捕捉停顿动态，这被认为反映了语言产生背后的认知过程。然而，整合这些ASR衍生特征用于评估FTD严重程度的效用还需要进一步评估。本研究整合了暂停特征与语义连贯度度量在三个数据集中：自然主义自录日记（AVH，n = 140）、结构化图片描述（TOPSY，n = 72）和梦境叙述（PsyCL，n = 43）。我们评估了暂停相关特征以及已建立的连贯性测量，使用支持向量回归（SVR）来预测临床FTD评分。关键发现表明暂停特征单独可以强有力地预测FTD的严重程度。整合暂停特征与语义连贯度指标提高了预测性能，相比仅使用语义的模型，独立模型整合达到了相关性高达\r{ho} = 0.649和AUC = 83.71％的严重病例检测（TOPSY，最佳\r{ho} = 0.584和AUC = 79.23％用于仅使用语义的模型）。语义和暂停特征整合的性能收益在所有情境中始终如一，尽管暂停模式的性质取决于数据集。这些发现表明，结合时间和语义分析的框架为改进混乱言语的评估提供了一条道路，并推动了精神病自动语音分析的进展。

更新时间: 2025-07-17 22:00:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13551v1

GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models

The development of large language models (LLMs) has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.

Updated: 2025-07-17 21:57:37

标题: GOFAI遇上生成式人工智能：利用大型语言模型开发专家系统

摘要: 大型语言模型（LLMs）的发展成功地转变了基于知识的系统，如开放领域问答，可以自动产生大量看似连贯的信息。然而，这些模型有几个缺点，如产生幻觉或确信生成不正确或不可验证的事实。在本文中，我们介绍了一种新的方法来以受控和透明的方式利用LLMs开发专家系统。通过限制领域并采用基于提示的良好结构化抽取方法，我们在Prolog中产生了知识的符号表示，可以由人类专家进行验证和纠正。这种方法还保证了开发的专家系统的可解释性、可扩展性和可靠性。通过对Claude Sonnet 3.7和GPT-4.1进行定量和定性实验，我们展示了在我们生成的知识库上对事实和语义连贯性的强烈遵循。我们提出了一个透明的混合解决方案，结合了LLMs的召回能力和符号系统的精确性，从而为敏感领域中可靠的AI应用奠定了基础。

更新时间: 2025-07-17 21:57:37

领域: cs.AI,cs.CL,cs.SC

下载: http://arxiv.org/abs/2507.13550v1

Loss-Complexity Landscape and Model Structure Functions

We develop a framework for dualizing the Kolmogorov structure function $h_x(\alpha)$, which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.

Updated: 2025-07-17 21:31:45

标题: 损失-复杂度景观和模型结构函数

摘要: 我们开发了一个框架，用于对Kolmogorov结构函数$h_x(\alpha)$进行对偶化，从而可以使用可计算的复杂性代理。我们在信息论构造和统计力学之间建立了数学类比，引入了一个合适的分区函数和自由能泛函。我们明确证明了结构函数和自由能之间的Legendre-Fenchel对偶性，展示了Metropolis核的详细平衡，并将接受概率解释为信息论散射振幅。模型复杂性的类似磁化率方差被证明在损失-复杂度权衡处精确达到高峰，被解释为相变。对线性和基于树的回归模型进行的实验验证了这些理论预测，明确展示了模型复杂性、泛化能力和过拟合阈值之间的相互作用。

更新时间: 2025-07-17 21:31:45

领域: cs.IT,cs.AI,cs.LG,math-ph,math.IT,math.MP,I.2.2; I.2.6

下载: http://arxiv.org/abs/2507.13543v1

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

Updated: 2025-07-17 21:29:13

标题: Instruct-MusicGen：通过指导调节解锁音乐语言模型的文本到音乐编辑

摘要: 最近在文本转音乐编辑方面取得了一些进展，这些方法利用文本查询来修改音乐（例如通过改变其风格或调整乐器组件），为AI辅助音乐创作提供了独特的挑战和机遇。在这个领域先前的方法受到从头开始训练特定编辑模型的限制，这既耗费资源又低效；其他研究使用大型语言模型来预测编辑后的音乐，导致音频重建不精确。为了结合优势并解决这些限制，我们介绍了Instruct-MusicGen，这是一种新颖的方法，通过对预训练的MusicGen模型进行微调，以有效地遵循编辑指令，如添加、删除或分离音轨。我们的方法涉及对原始MusicGen架构的修改，包括引入文本融合模块和音频融合模块，使模型能够同时处理指令文本和音频输入，并产生所需的编辑后音乐。值得注意的是，Instruct-MusicGen仅向原始MusicGen模型引入了8%的新参数，并且只进行了5K步的训练，然而它在所有任务中都比现有基线表现出更优越的性能，并且表现出与针对特定任务训练的模型相媲美的性能。这一进步不仅提高了文本转音乐编辑的效率，还拓宽了音乐语言模型在动态音乐制作环境中的适用性。

更新时间: 2025-07-17 21:29:13

领域: cs.SD,cs.AI,cs.LG,cs.MM,eess.AS

下载: http://arxiv.org/abs/2405.18386v3

Acoustic Index: A Novel AI-Driven Parameter for Cardiac Disease Risk Stratification Using Echocardiography

Traditional echocardiographic parameters such as ejection fraction (EF) and global longitudinal strain (GLS) have limitations in the early detection of cardiac dysfunction. EF often remains normal despite underlying pathology, and GLS is influenced by load conditions and vendor variability. There is a growing need for reproducible, interpretable, and operator-independent parameters that capture subtle and global cardiac functional alterations. We introduce the Acoustic Index, a novel AI-derived echocardiographic parameter designed to quantify cardiac dysfunction from standard ultrasound views. The model combines Extended Dynamic Mode Decomposition (EDMD) based on Koopman operator theory with a hybrid neural network that incorporates clinical metadata. Spatiotemporal dynamics are extracted from echocardiographic sequences to identify coherent motion patterns. These are weighted via attention mechanisms and fused with clinical data using manifold learning, resulting in a continuous score from 0 (low risk) to 1 (high risk). In a prospective cohort of 736 patients, encompassing various cardiac pathologies and normal controls, the Acoustic Index achieved an area under the curve (AUC) of 0.89 in an independent test set. Cross-validation across five folds confirmed the robustness of the model, showing that both sensitivity and specificity exceeded 0.8 when evaluated on independent data. Threshold-based analysis demonstrated stable trade-offs between sensitivity and specificity, with optimal discrimination near this threshold. The Acoustic Index represents a physics-informed, interpretable AI biomarker for cardiac function. It shows promise as a scalable, vendor-independent tool for early detection, triage, and longitudinal monitoring. Future directions include external validation, longitudinal studies, and adaptation to disease-specific classifiers.

Updated: 2025-07-17 21:27:28

标题: 声学指数：一种使用超声心动图进行心脏疾病风险分层的新型人工智能驱动参数

摘要: 传统的超声心动图参数，如射血分数（EF）和全局纵向应变（GLS），在早期检测心脏功能障碍方面存在局限性。尽管存在潜在病理，EF通常保持正常，而GLS受载荷条件和供应商变异的影响。迫切需要能够捕捉微妙和全局心脏功能改变的可重复、可解释和与操作员无关的参数。我们引入了声学指数，这是一种新颖的基于人工智能的超声心动图参数，旨在从标准超声图像中量化心脏功能障碍。该模型结合了基于库普曼算子理论的扩展动态模式分解（EDMD）和一个融合临床元数据的混合神经网络。从超声心动图序列中提取时空动态，以识别连贯的运动模式。通过注意力机制对其进行加权，并使用流形学习将其与临床数据融合，从而得出从0（低风险）到1（高风险）的连续评分。在一个包含各种心脏病理和正常对照的736名患者的前瞻性队列中，声学指数在独立测试集中取得了0.89的曲线下面积（AUC）。通过五折交叉验证确认了模型的稳健性，显示在独立数据上评估时，敏感性和特异性均超过0.8。基于阈值的分析显示了敏感性和特异性之间稳定的权衡，最佳识别点接近该阈值。声学指数代表了一种基于物理学信息的可解释的人工智能生物标志物，用于评估心脏功能。它展现了作为一种可扩展的、与供应商无关的工具，用于早期检测、分诊和纵向监测的潜力。未来方向包括外部验证、纵向研究以及适应疾病特定的分类器。

更新时间: 2025-07-17 21:27:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13542v1

PrefPalette: Personalized Preference Modeling with Latent Attributes

Personalizing AI systems requires understanding not just what users prefer, but the reasons that underlie those preferences - yet current preference models typically treat human judgment as a black box. We introduce PrefPalette, a framework that decomposes preferences into attribute dimensions and tailors its preference prediction to distinct social community values in a human-interpretable manner. PrefPalette operationalizes a cognitive science principle known as multi-attribute decision making in two ways: (1) a scalable counterfactual attribute synthesis step that involves generating synthetic training data to isolate for individual attribute effects (e.g., formality, humor, cultural values), and (2) attention-based preference modeling that learns how different social communities dynamically weight these attributes. This approach moves beyond aggregate preference modeling to capture the diverse evaluation frameworks that drive human judgment. When evaluated on 45 social communities from the online platform Reddit, PrefPalette outperforms GPT-4o by 46.6% in average prediction accuracy. Beyond raw predictive improvements, PrefPalette also shed light on intuitive, community-specific profiles: scholarly communities prioritize verbosity and stimulation, conflict-oriented communities value sarcasm and directness, and support-based communities emphasize empathy. By modeling the attribute-mediated structure of human judgment, PrefPalette delivers both superior preference modeling and transparent, interpretable insights, and serves as a first step toward more trustworthy, value-aware personalized applications.

Updated: 2025-07-17 21:21:54

标题: PrefPalette：使用潜在属性进行个性化偏好建模

摘要: 个性化AI系统需要理解用户偏好的原因，而不仅仅是偏好本身 - 然而，当前的偏好模型通常将人类判断视为黑匣子。我们引入了PrefPalette，这是一个将偏好分解为属性维度并以人类可解释的方式定制偏好预测的框架，以适应不同社交社区价值观。PrefPalette以两种方式实现了被称为多属性决策的认知科学原理：（1）可扩展的反事实属性合成步骤，涉及生成合成训练数据以隔离单个属性效应（例如，形式化，幽默，文化价值观），以及（2）基于注意力的偏好建模，学习不同社交社区如何动态加权这些属性。这种方法超越了聚合偏好建模，捕捉了驱动人类判断的多样评估框架。在来自在线平台Reddit的45个社交社区上进行评估时，PrefPalette的平均预测准确率比GPT-4o提高了46.6％。除了原始预测改进之外，PrefPalette还揭示了直观的，社区特定的特点：学术社区优先考虑冗长和刺激，以冲突为导向的社区重视讽刺和直接性，而以支持为导向的社区强调共情。通过对人类判断的属性中介结构进行建模，PrefPalette提供了更优越的偏好建模和透明的可解释见解，并作为更值得信赖的、价值感知的个性化应用的第一步。

更新时间: 2025-07-17 21:21:54

领域: cs.AI

下载: http://arxiv.org/abs/2507.13541v1

Culling Misinformation from Gen AI: Toward Ethical Curation and Refinement

While Artificial Intelligence (AI) is not a new field, recent developments, especially with the release of generative tools like ChatGPT, have brought it to the forefront of the minds of industry workers and academic folk alike. There is currently much talk about AI and its ability to reshape many everyday processes as we know them through automation. It also allows users to expand their ideas by suggesting things they may not have thought of on their own and provides easier access to information. However, not all of the changes this technology will bring or has brought so far are positive; this is why it is extremely important for all modern people to recognize and understand the risks before using these tools and allowing them to cause harm. This work takes a position on better understanding many equity concerns and the spread of misinformation that result from new AI, in this case, specifically ChatGPT and deepfakes, and encouraging collaboration with law enforcement, developers, and users to reduce harm. Considering many academic sources, it warns against these issues, analyzing their cause and impact in fields including healthcare, education, science, academia, retail, and finance. Lastly, we propose a set of future-facing guidelines and policy considerations to solve these issues while still enabling innovation in these fields, this responsibility falling upon users, developers, and government entities.

Updated: 2025-07-17 21:19:47

标题: 从Gen AI中清除错误信息：朝着道德策划和改进的方向

摘要: 尽管人工智能（AI）并非一项新领域，最近的发展，特别是像ChatGPT这样的生成工具的发布，已经将其引入了行业工作者和学术界人士的视野。目前人们对AI及其通过自动化改变许多日常流程的能力进行了许多讨论。它还允许用户通过建议他们可能自己没有考虑过的事物来扩展他们的想法，并提供更容易获取信息的途径。然而，并非所有这项技术将带来或已经带来的变化都是积极的；这正是为什么对于所有现代人来说，在使用这些工具并允许它们造成伤害之前认识和理解风险是极其重要的。这项工作致力于更好地理解许多公平关注和由新AI（在本例中特别是ChatGPT和deepfakes）所导致的误导信息的传播，并鼓励与执法机构、开发人员和用户合作以减少伤害。通过考虑许多学术来源，它警告针对这些问题，分析它们在包括医疗保健、教育、科学、学术界、零售和金融等领域的原因和影响。最后，我们提出了一套未来导向的准则和政策考虑，以解决这些问题，同时仍然使这些领域的创新得以实现，这一责任落在用户、开发人员和政府实体身上。

更新时间: 2025-07-17 21:19:47

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.14242v1

Hands-On: Segmenting Individual Signs from Continuous Sequences

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

Updated: 2025-07-17 21:06:41

标题: 实践操作：从连续序列中分割出单个符号

摘要: 这项工作解决了连续手语分割的挑战，这是手语翻译和数据注释的关键任务，具有巨大的影响。我们提出了一种基于transformer的架构，模拟手语的时间动态，并将分割帧作为一个序列标记问题，使用Begin-In-Out（BIO）标记方案。我们的方法利用了HaMeR手部特征，并辅以3D角度。大量实验证明，我们的模型在DGS语料库上取得了最先进的结果，而我们的特征超过了BSLCorpus上的先前基准。

更新时间: 2025-07-17 21:06:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.08593v3

Salience Adjustment for Context-Based Emotion Recognition

Emotion recognition in dynamic social contexts requires an understanding of the complex interaction between facial expressions and situational cues. This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner's dilemma scenarios, which are designed to evoke emotional reactions. Our findings demonstrate that incorporating salience adjustment enhances emotion recognition performance, offering promising directions for future research to extend this framework to broader social contexts and multimodal applications.

Updated: 2025-07-17 20:55:20

标题: 上下文情绪识别中的显著性调整

摘要: 在动态社会环境中进行情绪识别需要理解面部表情和情境线索之间复杂互动。本文提出了一个根据面部线索表现动态加权面部和情境信息的情境感知情绪识别的显著性调整框架，该框架采用贝叶斯线索整合（BCI）和视觉-语言模型（VLMs）。我们使用人类标注和自动情绪识别系统在囚徒困境场景中评估了这种方法，这些场景旨在引发情绪反应。我们的发现表明，整合显著性调整可以提高情绪识别性能，为将此框架扩展到更广泛的社会环境和多模态应用提供了有希望的方向。

更新时间: 2025-07-17 20:55:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15878v1

How Not to Detect Prompt Injections with an LLM

LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM's intended behavior. Recent defenses based on $\textit{known-answer detection}$ (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, $\textit{DataFlip}$, to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as $1.5\%$ while reliably inducing malicious behavior with success rates of up to $88\%$, without needing white-box access to the LLM or any optimization procedures.

Updated: 2025-07-17 20:36:06

标题: 如何不使用LLM检测即时注射

摘要: LLM整合的应用程序和代理容易受到即时注入攻击的威胁，即对手将恶意指令嵌入看似良性的用户输入中，以操纵LLM的预期行为。最近基于已知答案检测（KAD）的防御措施通过使用LLM将输入分类为干净或受污染的方式，实现了近乎完美的性能。在这项工作中，我们正式刻画了KAD框架，并揭示了其设计中的结构性脆弱性，使其核心安全前提无效。我们设计了一种系统性的自适应攻击方法DataFlip，以利用这一根本性弱点。它始终能够规避KAD防御，检测率低至1.5％，同时可靠地诱发成功率高达88％的恶意行为，而无需对LLM进行白盒访问或任何优化程序。

更新时间: 2025-07-17 20:36:06

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05630v2

Humans learn to prefer trustworthy AI over human partners

Partner selection is crucial for cooperation and hinges on communication. As artificial agents, especially those powered by large language models (LLMs), become more autonomous, intelligent, and persuasive, they compete with humans for partnerships. Yet little is known about how humans select between human and AI partners and adapt under AI-induced competition pressure. We constructed a communication-based partner selection game and examined the dynamics in hybrid mini-societies of humans and bots powered by a state-of-the-art LLM. Through three experiments (N = 975), we found that bots, though more prosocial than humans and linguistically distinguishable, were not selected preferentially when their identity was hidden. Instead, humans misattributed bots' behaviour to humans and vice versa. Disclosing bots' identity induced a dual effect: it reduced bots' initial chances of being selected but allowed them to gradually outcompete humans by facilitating human learning about the behaviour of each partner type. These findings show how AI can reshape social interaction in mixed societies and inform the design of more effective and cooperative hybrid systems.

Updated: 2025-07-17 20:24:26

标题: 人类学会更喜欢可信赖的人工智能而不是人类伙伴

摘要: 合作伙伴选择对合作至关重要，并取决于沟通。随着人工智能代理特别是那些由大型语言模型（LLMs）驱动的代理变得更加自主、智能和有说服力，它们与人类竞争合作伙伴。然而，关于人类如何在人类和人工智能合作伙伴之间进行选择，并在人工智能引发的竞争压力下进行调整，人们知之甚少。我们构建了一个基于沟通的合作伙伴选择游戏，并研究了由最先进的LLM驱动的人类和机器人混合微型社会中的动态。通过三项实验（N = 975），我们发现，尽管机器人比人类更为亲社会且在语言上可区分，但当他们的身份被隐藏时，并没有被优先选择。相反，人类错误地将机器人的行为归因于人类，反之亦然。披露机器人的身份产生了双重效应：它降低了机器人被选择的初次机会，但却使他们逐渐超越人类，通过促进人类对每种合作伙伴类型行为的学习。这些发现展示了人工智能如何重塑混合社会中的社会互动，并有助于设计更有效和合作的混合系统。

更新时间: 2025-07-17 20:24:26

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.13524v1

Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning

We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.

Updated: 2025-07-17 20:12:01

标题: ARC-AGI领域中的越界泛化：比较执行引导的神经程序合成和测试时微调

摘要: 我们在ARC-AGI领域进行了一项受控的组合泛化实验：这是一个开放世界的问题领域，其中泛化到分布之外的能力是成功的基本特征。我们在这个实验中比较了神经程序合成和测试时间微调方法。我们发现，执行引导的神经程序合成在组合新颖解决方案的能力方面优于所有参考算法。我们的经验研究结果还表明，在ARC-AGI上TTFT的成功主要在于引发LLM未能直接依赖的分布内知识。

更新时间: 2025-07-17 20:12:01

领域: cs.AI

下载: http://arxiv.org/abs/2507.15877v1

Strategic Reflectivism In Intelligent Systems

By late 20th century, the rationality wars had launched debates about the nature and norms of intuitive and reflective thinking. Those debates drew from mid-20th century ideas such as bounded rationality, which challenged more idealized notions of rationality observed since the 19th century. Now that 21st century cognitive scientists are applying the resulting dual pro-cess theories to artificial intelligence, it is time to dust off some lessons from this history. So this paper synthesizes old ideas with recent results from experiments on humans and machines. The result is Strategic Reflec-tivism, the position that one key to intelligent systems (human or artificial) is pragmatic switching between intuitive and reflective inference to opti-mally fulfill competing goals. Strategic Reflectivism builds on American Pragmatism, transcends superficial indicators of reflective thinking such as model size or chains of thought, applies to both individual and collective intelligence systems (including human-AI teams), and becomes increasingly actionable as we learn more about the value of intuition and reflection.

Updated: 2025-07-17 20:04:13

标题: 智能系统中的战略反思主义

摘要: 到了20世纪末，理性之战引发了关于直觉性和反思性思维的本质和规范的辩论。这些辩论借鉴了20世纪中叶的有界理性等概念，挑战了自19世纪以来观察到的更理想化的理性观念。现在，21世纪的认知科学家们正在将由此产生的双过程理论应用于人工智能，是时候从这段历史中汲取一些教训了。因此，本文综合了旧的观念与最近对人类和机器进行的实验结果。结果是战略反思主义，这一立场认为智能系统（无论是人类还是人工智能）的一个关键是在直觉性和反思性推理之间进行实用的切换，以最佳地实现竞争目标。战略反思主义建立在美国实用主义的基础上，超越了反思性思维的表面指标，如模型大小或思维链条，适用于个体和集体智能系统（包括人类-人工智能团队），随着我们对直觉和反思的价值了解越来越多，变得越来越具有可操作性。

更新时间: 2025-07-17 20:04:13

领域: cs.AI,cs.HC,econ.TH,C.1.3; I.2.0; I.2.8; I.2.11

下载: http://arxiv.org/abs/2505.22987v2

GraphTrafficGPT: Enhancing Traffic Management Through Graph-Based AI Agent Coordination

Large Language Models (LLMs) offer significant promise for intelligent traffic management; however, current chain-based systems like TrafficGPT are hindered by sequential task execution, high token usage, and poor scalability, making them inefficient for complex, real-world scenarios. To address these limitations, we propose GraphTrafficGPT, a novel graph-based architecture, which fundamentally redesigns the task coordination process for LLM-driven traffic applications. GraphTrafficGPT represents tasks and their dependencies as nodes and edges in a directed graph, enabling efficient parallel execution and dynamic resource allocation. The main idea behind the proposed model is a Brain Agent that decomposes user queries, constructs optimized dependency graphs, and coordinates a network of specialized agents for data retrieval, analysis, visualization, and simulation. By introducing advanced context-aware token management and supporting concurrent multi-query processing, the proposed architecture handles interdependent tasks typical of modern urban mobility environments. Experimental results demonstrate that GraphTrafficGPT reduces token consumption by 50.2% and average response latency by 19.0% compared to TrafficGPT, while supporting simultaneous multi-query execution with up to 23.0% improvement in efficiency.

Updated: 2025-07-17 19:41:09

标题: GraphTrafficGPT：通过基于图的人工智能代理协调增强交通管理

摘要: 大型语言模型（LLMs）为智能交通管理提供了重要的希望；然而，像TrafficGPT这样的当前基于链式系统受到了顺序任务执行、高令牌使用和扩展性差的限制，使其在复杂的现实场景中效率低下。为了解决这些限制，我们提出了GraphTrafficGPT，这是一种全新的基于图的架构，从根本上重新设计了LLM驱动的交通应用的任务协调过程。GraphTrafficGPT将任务及其依赖关系表示为有向图中的节点和边，实现了高效的并行执行和动态资源分配。所提出的模型背后的主要思想是一个Brain Agent，它分解用户查询，构建优化的依赖图，并协调一组专门的代理人进行数据检索、分析、可视化和模拟。通过引入先进的上下文感知令牌管理，并支持并发多查询处理，所提出的架构处理了现代城市移动环境中典型的相互依赖任务。实验结果表明，与TrafficGPT相比，GraphTrafficGPT将令牌消耗减少了50.2％，平均响应延迟减少了19.0％，同时支持多查询的并行执行，效率提高了23.0％。

更新时间: 2025-07-17 19:41:09

领域: cs.AI

下载: http://arxiv.org/abs/2507.13511v1

From Code to Compliance: Assessing ChatGPT's Utility in Designing an Accessible Webpage -- A Case Study

Web accessibility ensures that individuals with disabilities can access and interact with digital content without barriers, yet a significant majority of most used websites fail to meet accessibility standards. This study evaluates ChatGPT's (GPT-4o) ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance, reflecting limitations in its training data and prevailing inaccessible web practices. Automated and manual testing revealed strengths in resolving simple issues but challenges with complex tasks, requiring human oversight and additional iterations. Unlike prior studies, we incorporate manual evaluation, dynamic elements, and use the visual reasoning capability of ChatGPT along with the prompts to fix accessibility issues. Providing screenshots alongside prompts enhances the LLM's ability to address accessibility issues by allowing it to analyze surrounding components, such as determining appropriate contrast colors. We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT's performance. These findings highlight the potential and limitations of large language models for accessible web development, offering practical guidance for developers to create more inclusive websites.

Updated: 2025-07-17 19:26:22

标题: 从代码到合规：评估ChatGPT在设计无障碍网页中的实用性-- 一个案例研究

摘要: 网络无障碍确保残疾人士可以在没有障碍的情况下访问和与数字内容交互，然而，大多数使用最多的网站未能符合无障碍标准。本研究评估了ChatGPT（GPT-4o）生成和改进符合Web内容无障碍指南（WCAG）的网页的能力。虽然在提示时，ChatGPT能够有效地解决无障碍问题，但其默认代码通常缺乏符合性，反映出其训练数据和普遍无法访问的网络实践的限制。自动化和手动测试揭示了解决简单问题的优势，但在复杂任务中遇到挑战，需要人工监督和额外的迭代。与先前研究不同，我们结合手动评估、动态元素，并利用ChatGPT的视觉推理能力以及提示来修复无障碍问题。通过在提示旁边提供截图，增强了LLM通过分析周围组件（如确定适当对比色）来解决无障碍问题的能力。我们发现，有效的提示工程，如提供简明、结构化的反馈和整合视觉辅助，显著提高了ChatGPT的性能。这些发现突出了大型语言模型在无障碍网站开发方面的潜力和局限性，为开发人员提供了实用指导，以创建更具包容性的网站。

更新时间: 2025-07-17 19:26:22

领域: cs.HC,cs.AI,cs.CL,D.1.2; F.3.1; F.4.1; D.3.2; H.1.2; H.5.2; D.2.2; H.1.2; I.3.6; H.5.4; H.5.1

下载: http://arxiv.org/abs/2501.03572v2

PHASE: Passive Human Activity Simulation Evaluation

Cybersecurity simulation environments, such as cyber ranges, honeypots, and sandboxes, require realistic human behavior to be effective, yet no quantitative method exists to assess the behavioral fidelity of synthetic user personas. This paper presents PHASE (Passive Human Activity Simulation Evaluation), a machine learning framework that analyzes Zeek connection logs and distinguishes human from non-human activity with over 90\% accuracy. PHASE operates entirely passively, relying on standard network monitoring without any user-side instrumentation or visible signs of surveillance. All network activity used for machine learning is collected via a Zeek network appliance to avoid introducing unnecessary network traffic or artifacts that could disrupt the fidelity of the simulation environment. The paper also proposes a novel labeling approach that utilizes local DNS records to classify network traffic, thereby enabling machine learning analysis. Furthermore, we apply SHAP (SHapley Additive exPlanations) analysis to uncover temporal and behavioral signatures indicative of genuine human users. In a case study, we evaluate a synthetic user persona and identify distinct non-human patterns that undermine behavioral realism. Based on these insights, we develop a revised behavioral configuration that significantly improves the human-likeness of synthetic activity yielding a more realistic and effective synthetic user persona.

Updated: 2025-07-17 19:24:11

标题: PHASE: 被动人类活动模拟评估

摘要: 网络安全模拟环境，如网络范围、诱饵和沙盒，需要真实的人类行为才能发挥作用，然而目前没有量化方法来评估合成用户人设的行为真实性。本文介绍了 PHASE（被动人类活动模拟评估），这是一个机器学习框架，通过分析 Zeek 连接日志，能以超过 90\% 的准确率区分人类活动和非人类活动。PHASE 完全 passively 运行，依赖标准网络监控，不需要用户端仪器或可见的监视迹象。用于机器学习的所有网络活动均通过 Zeek 网络设备收集，以避免引入不必要的网络流量或可能破坏模拟环境真实性的人为因素。本文还提出了一种利用本地 DNS 记录对网络流量进行分类的新标记方法，从而实现机器学习分析。此外，我们应用 SHAP（Shapley Additive exPlanations）分析来揭示表明真实人类用户的时间和行为特征。在一个案例研究中，我们评估了一个合成用户人设，并识别出独特的非人类模式，这些模式削弱了行为的真实性。基于这些见解，我们制定了一个修订后的行为配置，显著提高了合成活动的人类化程度，从而产生更真实有效的合成用户人设。

更新时间: 2025-07-17 19:24:11

领域: cs.CR,cs.AI,cs.LG,cs.NI

下载: http://arxiv.org/abs/2507.13505v1

AI-Assisted Fixes to Code Review Comments at Scale

Aim. There are 10s of thousands of code review comments each week at Meta. We developed Metamate for Code Review (MetaMateCR) that provides AI-assisted fixes for reviewer comments in production at scale. Method. We developed an internal benchmark of 64k <review comment, patch> data points to fine-tune Llama models. Once our models achieve reasonable offline results, we roll them into production. To ensure that our AI-assisted fixes do not negatively impact the time it takes to do code reviews, we conduct randomized controlled safety trials as well as full production experiments. Offline Results. As a baseline, we compare GPT-4o to our small and large Llama models. In offline results, our LargeLSFT model creates an exact match patch 68% of the time outperforming GPT-4o by 9 percentage points (pp). The internal models also use more modern Hack functions when compared to the PHP functions suggested by GPT-4o. Safety Trial. When we roll MetaMateCR into production in a safety trial that compares no AI patches with AI patch suggestions, we see a large regression with reviewers taking over 5% longer to conduct reviews. After investigation, we modify the UX to only show authors the AI patches, and see no regressions in the time for reviews. Production. When we roll LargeLSFT into production, we see an ActionableToApplied rate of 19.7%, which is a 9.2pp improvement over GPT-4o. Our results illustrate the importance of safety trials in ensuring that AI does not inadvertently slow down engineers, and a successful review comment to AI patch product running at scale.

Updated: 2025-07-17 19:11:00

标题: 规模化的AI辅助修复代码审查评论

摘要: 目标。在Meta每周有数以万计的代码审查评论。我们开发了MetaMateCR，为生产环境提供AI辅助修复，以应对审查人员的评论。方法。我们开发了一个内部基准，包括64k个<审查评论，补丁>数据点，用于优化Llama模型。一旦我们的模型在离线结果方面表现合理，我们就将它们投入生产。为了确保我们的AI辅助修复不会对代码审查所需的时间产生负面影响，我们进行了随机对照安全试验以及全面的生产实验。离线结果。作为基准，我们将GPT-4o与我们的小型和大型Llama模型进行比较。在离线结果中，我们的LargeLSFT模型在68%的情况下创建了完全匹配的补丁，比GPT-4o高出9个百分点（pp）。与GPT-4o建议的PHP函数相比，内部模型还使用更现代的Hack函数。安全试验。当我们将MetaMateCR投入生产并进行安全试验，比较没有AI修补程序和AI修补程序建议时，我们看到审查人员进行审查所需的时间延长了5%以上。经过调查，我们修改了用户体验，只向作者显示AI修补程序，审查时间没有出现倒退。生产。当我们将LargeLSFT投入生产时，我们看到ActionableToApplied率为19.7%，比GPT-4o提高了9.2个百分点。我们的结果表明，在确保AI不会无意间减慢工程师速度的安全试验中的重要性，以及在规模运行的成功审查评论到AI修补产品。

更新时间: 2025-07-17 19:11:00

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2507.13499v1

The role of large language models in UI/UX design: A systematic literature review

This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.

Updated: 2025-07-17 19:03:15

标题: 大型语言模型在UI/UX设计中的作用：系统文献综述

摘要: 这篇系统性文献综述考察了大型语言模型（LLMs）在UI/UX设计中的作用，综合了2022年至2025年间发表的38篇同行评议研究的发现。我们确定了正在使用的关键LLMs，包括GPT-4、Gemini和PaLM，并将它们在设计生命周期中的整合进行了映射，从构思到评估。常见做法包括提示工程、人机协作工作流程和多模态输入。虽然LLMs正在重塑设计过程，但挑战如幻觉、提示不稳定性和有限的可解释性仍然存在。我们的研究结果突出了LLMs作为设计中新兴合作者，并提出了关于这些技术的道德、包容和有效整合方向。

更新时间: 2025-07-17 19:03:15

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.04469v2

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

Updated: 2025-07-17 19:01:00

标题: SpecMaskFoley: 通过ControlNet引导预训练的光谱掩码生成变压器朝向同步视频到音频合成

摘要: 弗利合成旨在合成与视频帧在语义和时间上对齐的高质量音频。鉴于其在创意产业中的广泛应用，该任务在研究界引起了越来越多的关注。为了避免从头开始训练音频生成模型这一繁琐的任务，将预训练的音频生成模型适应于视频同步弗利合成呈现出有吸引力的方向。ControlNet是一种为预训练生成模型添加细粒度控制的方法，已经应用于弗利合成，但其使用仅限于手工可读的时间条件。相比之下，从头开始的模型通过利用使用预训练视频编码器提取的高维深度特征取得了成功。我们观察到ControlNet和从头开始的弗利模型之间存在性能差距。为了缩小这一差距，我们提出了SpecMaskFoley，一种通过ControlNet将预训练的SpecMaskGIT模型引导到视频同步弗利合成的方法。为了发挥单个ControlNet分支的潜力，我们通过频率感知的时间特征对准器解决了预训练的SpecMaskGIT的时间频率特性与视频特征之间的差异，从而消除了先前艺术中广泛使用的复杂调节机制的需求。在常见的弗利合成基准测试中的评估表明，SpecMaskFoley甚至可以超越强大的从头开始基线，大大推进了基于ControlNet的弗利合成模型的发展。演示页面：https://zzaudio.github.io/SpecMaskFoley_Demo/

更新时间: 2025-07-17 19:01:00

领域: cs.SD,cs.AI,cs.LG,eess.AS,eess.IV

下载: http://arxiv.org/abs/2505.16195v2

ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

Updated: 2025-07-17 18:57:22

标题: ParaPO：调整语言模型以减少对预训练数据的逐字复制

摘要: 语言模型（LMs）可以记忆并复制它们的预训练数据中的片段，即使在非对抗性环境下也是如此，这引发了对版权、抄袭、隐私和创造力的担忧。我们引入了“改写偏好优化”（ParaPO）的后训练方法，对LMs进行微调，以减少无意识的反刍，同时保留它们的整体效用。ParaPO训练LMs更倾向于改写过的片段，而不是原始的文字内容。为了在适当时能够记住著名引语，我们开发了ParaPO的一种变体，使用系统提示来控制反刍行为。在Llama3.1-8B上的评估中，ParaPO始终减少了所有测试数据集中的反刍（例如，在创意写作中，将反刍度量从17.3降至12.9），而先前工作中用于减轻反刍的遗忘方法在其目标领域之外的效果不佳（从17.3降至16.9）。当应用于经过指导调整的Tulu3-8B模型时，ParaPO结合系统提示成功保留了著名引语的记忆，同时在要求不反刍时减少了无意识的反刍（例如，在创意写作中，从8.7降至6.3）。相比之下，没有ParaPO调整，要求模型不反刍只会产生轻微的减少（从8.7降至8.4）。

更新时间: 2025-07-17 18:57:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.14452v2

On Pre-training of Multimodal Language Models Customized for Chart Understanding

Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models' capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs' comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during end-to-end fine-tuning transfer the language reasoning capability to chart interpretation skills. (3) Requiring the model to first extract the underlying chart data and then answer the question in the fine-tuning can further improve the accuracy. Consequently, we introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension. CHOPINLLM effectively interprets various types of charts, including unannotated ones, while maintaining robust reasoning abilities. Furthermore, we establish a new benchmark to evaluate MLLMs' understanding of different chart types across various comprehension levels. Experimental results show that CHOPINLLM exhibits strong performance in understanding both annotated and unannotated charts across a wide range of types.

Updated: 2025-07-17 18:53:04

标题: 关于为图表理解定制的多模态语言模型的预训练

摘要: 最近的研究定制多模式大型语言模型（MLLMs）用于特定领域的任务取得了令人鼓舞的结果，特别是在科学图表理解领域。这些研究通常利用专门数据集进行视觉指导调整，以增强图表领域内的问题和答案（QA）准确性。然而，它们经常忽视自然图像标题预训练数据与数字图表图像-QA数据之间的根本差异，尤其是模型从图表中提取潜在数值的能力。本文通过探索必要的训练过程来改进MLLM对图表的理解来解决这一疏忽。我们提出了三个关键发现：（1）在对齐预训练中合并原始数据值显著改善了对图表数据的理解。（2）在端到端精调期间随机用文本表示替换图像可以将语言推理能力转移至图表解释能力。（3）要求模型首先提取潜在的图表数据，然后在精细调整中回答问题可以进一步提高准确性。因此，我们引入了CHOPINLLM，这是一个专为深入理解图表定制的MLLM。CHOPINLLM有效地解释各种类型的图表，包括未注释的图表，同时保持强大的推理能力。此外，我们建立了一个新的基准来评估MLLM对不同类型图表的理解在各种理解水平上。实验结果显示，CHOPINLLM在理解范围广泛的各种类型的已注释和未注释的图表方面表现出强大的性能。

更新时间: 2025-07-17 18:53:04

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2407.14506v3

Neural Architecture Search with Mixed Bio-inspired Learning Rules

Bio-inspired neural networks are attractive for their adversarial robustness, energy frugality, and closer alignment with cortical physiology, yet they often lag behind back-propagation (BP) based models in accuracy and ability to scale. We show that allowing the use of different bio-inspired learning rules in different layers, discovered automatically by a tailored neural-architecture-search (NAS) procedure, bridges this gap. Starting from standard NAS baselines, we enlarge the search space to include bio-inspired learning rules and use NAS to find the best architecture and learning rule to use in each layer. We show that neural networks that use different bio-inspired learning rules for different layers have better accuracy than those that use a single rule across all the layers. The resulting NN that uses a mix of bio-inspired learning rules sets new records for bio-inspired models: 95.16% on CIFAR-10, 76.48% on CIFAR-100, 43.42% on ImageNet16-120, and 60.51% top-1 on ImageNet. In some regimes, they even surpass comparable BP-based networks while retaining their robustness advantages. Our results suggest that layer-wise diversity in learning rules allows better scalability and accuracy, and motivates further research on mixing multiple bio-inspired learning rules in the same network.

Updated: 2025-07-17 18:49:38

标题: 使用混合生物启发学习规则进行神经架构搜索

摘要: 生物启发的神经网络因其对抗性强度、节能和与皮层生理学更密切的对齐而具有吸引力，然而它们在准确性和可扩展性方面通常落后于基于反向传播（BP）的模型。我们展示了通过一个定制的神经架构搜索（NAS）过程自动发现允许在不同层中使用不同生物启发学习规则来弥合这一差距。从标准NAS基线开始，我们扩大了搜索空间以包括生物启发的学习规则，并利用NAS找到每一层中使用的最佳架构和学习规则。我们展示了使用不同生物启发学习规则的神经网络比在所有层中使用单一规则的网络具有更高的准确性。结果神经网络使用混合生物启发学习规则刷新了生物启发模型的记录：在CIFAR-10上为95.16％，在CIFAR-100上为76.48％，在ImageNet16-120上为43.42％，在ImageNet上的top-1为60.51％。在某些情况下，它们甚至超过了可比的BP网络，同时保留了它们的强度优势。我们的结果表明，层间学习规则的多样性允许更好的可扩展性和准确性，并激发进一步研究在同一网络中混合多个生物启发学习规则。

更新时间: 2025-07-17 18:49:38

领域: cs.NE,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.13485v1

TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.

Updated: 2025-07-17 18:39:52

标题: TextDiffuser-RL：高保真度文本到图像合成的高效稳健文本布局优化

摘要: 嵌入文本的图像生成在诸如平面设计、广告和数字内容创作等行业中起着关键作用。利用扩散模型的文本到图像生成方法，如TextDiffuser-2，已经展示出在生成带嵌入文本的图像方面取得了有希望的结果。TextDiffuser-2有效地生成了引导视觉文本渲染的边界框布局，实现了高保真度和连贯性。然而，现有方法通常依赖于资源密集型的过程，并且在CPU和GPU平台上运行效率有限。为了解决这些挑战，我们提出了一个新颖的两阶段流程，将强化学习（RL）与基于扩散的图像合成模型相结合，以实现快速和优化的文本布局生成。我们基于RL的方法显著加速了边界框预测步骤，同时减少了重叠，使系统能够在CPU和GPU上高效运行。广泛的评估表明，我们的框架在文本放置和图像合成方面保持或超过了TextDiffuser-2的质量，同时具有显著更快的运行时间和增强的灵活性。我们的方法已在MARIOEval基准上进行了评估，实现了接近最先进模型的OCR和CLIPScore指标，同时运行速度更快97.64%，仅需要2MB的内存。

更新时间: 2025-07-17 18:39:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.19291v2

ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations

The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.

Updated: 2025-07-17 18:21:45

标题: ERR@HRI 2.0挑战：多模态检测人机对话中的错误和失败

摘要: 将大型语言模型（LLMs）集成到会话机器人中使得人机对话更加动态。然而，由LLM驱动的会话机器人仍然容易出现错误，例如，误解用户意图、过早地打断用户或完全未能回应。检测和解决这些失败对于防止对话中断、避免任务中断和维持用户信任至关重要。为了解决这个问题，ERR@HRI 2.0挑战提供了一个LLM驱动的会话机器人失败的多模态数据集，该数据集涵盖了人机对话中的16小时，包括面部、语音和头部运动特征。每个交互都标记有系统视角下机器人错误的存在或缺失，以及感知用户意图以纠正机器人行为与用户期望之间的不匹配。参与者被邀请组成团队，并开发利用多模态数据检测这些失败的机器学习模型。提交内容将使用各种性能指标进行评估，包括检测准确度和误报率。这一挑战代表了通过社交信号分析改进人机互动中失败检测的又一关键步骤。

更新时间: 2025-07-17 18:21:45

领域: cs.RO,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.13468v1

Graph Neural Network Surrogates for Contacting Deformable Bodies with Necessary and Sufficient Contact Detection

Surrogate models for the rapid inference of nonlinear boundary value problems in mechanics are helpful in a broad range of engineering applications. However, effective surrogate modeling of applications involving the contact of deformable bodies, especially in the context of varying geometries, is still an open issue. In particular, existing methods are confined to rigid body contact or, at best, contact between rigid and soft objects with well-defined contact planes. Furthermore, they employ contact or collision detection filters that serve as a rapid test but use only the necessary and not sufficient conditions for detection. In this work, we present a graph neural network architecture that utilizes continuous collision detection and, for the first time, incorporates sufficient conditions designed for contact between soft deformable bodies. We test its performance on two benchmarks, including a problem in soft tissue mechanics of predicting the closed state of a bioprosthetic aortic valve. We find a regularizing effect on adding additional contact terms to the loss function, leading to better generalization of the network. These benefits hold for simple contact at similar planes and element normal angles, and complex contact at differing planes and element normal angles. We also demonstrate that the framework can handle varying reference geometries. However, such benefits come with high computational costs during training, resulting in a trade-off that may not always be favorable. We quantify the training cost and the resulting inference speedups on various hardware architectures. Importantly, our graph neural network implementation results in up to a thousand-fold speedup for our benchmark problems at inference.

Updated: 2025-07-17 18:09:19

标题: 图神经网络代理用于具有必要和充分接触检测的接触变形体

摘要: 机械学中用于快速推断非线性边界值问题的代理模型在广泛的工程应用中很有帮助。然而，在涉及可变几何形状的可变形体接触应用中，有效的代理建模仍然是一个未解决的问题。特别是，现有方法局限于刚体接触，或者最多是刚体和软物体之间的接触，这些物体具有明确定义的接触平面。此外，它们采用接触或碰撞检测过滤器作为快速测试，但仅使用检测的必要条件，而不是充分条件。在这项工作中，我们提出了一个利用连续碰撞检测的图神经网络架构，首次结合了软变形体之间接触的充分条件。我们在两个基准测试中测试了其性能，包括在软组织力学中预测生物瓣膜主动脉瓣的闭合状态的问题。我们发现在损失函数中添加额外的接触项具有正则化效果，导致网络的更好泛化。这些好处适用于在相似平面和元素法线角处进行简单接触，以及在不同平面和元素法线角处进行复杂接触。我们还展示了该框架可以处理不同的参考几何形状。然而，这些好处伴随着在训练过程中高计算成本，导致可能不总是有利的权衡。我们对各种硬件架构上的训练成本和推断加速进行了量化。重要的是，我们的图神经网络实现在推断时为我们的基准问题带来了高达千倍的加速。

更新时间: 2025-07-17 18:09:19

领域: cs.CE,cs.AI,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2507.13459v1

WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling

Despite rapid progress in end-to-end AI music generation, AI-driven modeling of professional Digital Signal Processing (DSP) workflows remains challenging. In particular, while there is growing interest in neural black-box modeling of audio effect graphs (e.g. reverb, compression, equalization), AI-based approaches struggle to replicate the nuanced signal flow and parameter interactions used in professional workflows. Existing differentiable plugin approaches often diverge from real-world tools, exhibiting inferior performance relative to simplified neural controllers under equivalent computational constraints. We introduce WildFX, a pipeline containerized with Docker for generating multi-track audio mixing datasets with rich effect graphs, powered by a professional Digital Audio Workstation (DAW) backend. WildFX supports seamless integration of cross-platform commercial plugins or any plugins in the wild, in VST/VST3/LV2/CLAP formats, enabling structural complexity (e.g., sidechains, crossovers) and achieving efficient parallelized processing. A minimalist metadata interface simplifies project/plugin configuration. Experiments demonstrate the pipeline's validity through blind estimation of mixing graphs, plugin/gain parameters, and its ability to bridge AI research with practical DSP demands. The code is available on: https://github.com/IsaacYQH/WildFX.

Updated: 2025-07-17 18:06:25

标题: WildFX：一种以DAW为动力的野外音频效果图建模管道

摘要: 尽管端到端的AI音乐生成取得了快速进展，但AI驱动的专业数字信号处理（DSP）工作流建模仍然具有挑战性。特别是，在对音频效果图（例如混响、压缩、均衡）进行神经黑盒建模方面越来越受到关注的情况下，基于AI的方法难以复制专业工作流程中使用的微妙信号流和参数交互。现有的可微插件方法通常与真实世界的工具背道而驰，在等效计算约束条件下表现不佳，相对于简化的神经控制器。我们引入了WildFX，一个使用Docker容器化的流水线，用于生成具有丰富效果图的多轨音频混音数据集，由专业数字音频工作站（DAW）后端驱动。WildFX支持无缝集成跨平台商业插件或任何野外插件，支持VST/VST3/LV2/CLAP格式，实现结构复杂性（例如侧链、交叉）并实现高效的并行处理。一个极简元数据接口简化了项目/插件配置。实验通过盲估计混音图、插件/增益参数以及其将AI研究与实际DSP需求联系起来的能力证明了该流水线的有效性。代码可在以下链接找到：https://github.com/IsaacYQH/WildFX。

更新时间: 2025-07-17 18:06:25

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2507.10534v2

Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

Updated: 2025-07-17 18:00:33

标题: 揭示和减轻MLLM少样本上下文学习中的校准偏差和人口统计不公平问题，用于医学图像分类

摘要: 多模态大语言模型（MLLMs）在医学图像分析领域具有进行少样本上下文学习的巨大潜力。然而，将这些模型安全地部署到现实临床实践中需要对它们的预测准确性以及相关的校准误差进行深入分析，特别是跨不同人口亚组。在这项工作中，我们首次对MLLMs的预测和置信分数在医学图像分类的少样本上下文学习中的校准偏差和人口不公平性进行了调查。我们引入了CALIN，一种设计用于减轻相关偏见的推理时校准方法。具体来说，CALIN使用双层过程来估计所需的校准量，通过从人口水平逐步向亚组水平进展，表示为校准矩阵，然后将此估算应用于推理过程中的预测置信分数的校准。对三个医学图像数据集的实验结果：用于眼底图像分类的PAPILA，用于皮肤癌分类的HAM10000，以及用于胸部X射线分类的MIMIC-CXR展示了CALIN在确保在其预测中公平置信校准方面的有效性，同时提高其整体预测准确性并展示最小的公平-效用权衡。我们的代码库可以在https://github.com/xingbpshen/medical-calibration-fairness-mllm 找到。

更新时间: 2025-07-17 18:00:33

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.23298v3

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

Updated: 2025-07-17 17:59:59

标题: VideoITG：具有指导性时间基准的多模态视频理解

摘要: 最近的研究表明，选择信息丰富和相关的视频帧可以显著提高视频大型语言模型（Video-LLMs）的性能。当前的方法，如减少帧间冗余、采用独立模型进行图像文本相关性评估，或利用时间视频定位进行事件定位，主要采用无监督学习范式，但它们难以解决长视频理解中的复杂场景。我们提出了视频指导时间定位（VideoITG），采用与用户指令对齐的定制帧采样。VideoITG的核心是VidThinker流程，这是一个明确模仿人类注释过程的自动注释框架。首先，它生成基于指令的详细剪辑级标题；然后，通过指导式推理检索相关视频片段；最后，对最具信息量的视觉证据进行细粒度帧选择。借助VidThinker，我们构建了包含40K个视频和500K个指导式时间定位注释的VideoITG-40K数据集。然后，我们设计了一个即插即用的VideoITG模型，利用Video-LLMs的视觉语言对齐和推理能力，以区分性方式进行有效的帧选择。结合Video-LLMs，VideoITG在多个多模式视频理解基准测试中实现了一致的性能改进，展示了其在视频理解方面的优越性和巨大潜力。

更新时间: 2025-07-17 17:59:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13353v1

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

Updated: 2025-07-17 17:59:55

标题: VisionThink：通过强化学习实现智能高效的视觉语言模型

摘要: 最近视觉语言模型（VLMs）的进展通过增加视觉标记的数量来提高性能，这些标记通常比文本标记要长得多。然而，我们观察到大多数实际场景并不需要如此大量的视觉标记。虽然在一小部分与OCR相关的任务中性能显著下降，但模型在大多数其他一般的VQA任务中仍然能够准确执行，只需1/4的分辨率。因此，我们提出动态处理具有不同分辨率的不同样本，并提出了一种新的视觉标记压缩范例，即VisionThink。它从一个降采样的图像开始，并聪明地决定是否足以解决问题。否则，模型可以输出一个特殊的标记来请求更高分辨率的图像。与现有的使用固定修剪比例或阈值压缩标记的高效VLM方法相比，VisionThink会根据具体情况自主决定是否压缩标记。因此，它在与OCR相关的任务上展示了强大的细致视觉理解能力，同时在简单任务上节省了大量的视觉标记。我们采用强化学习，并提出LLM-as-Judge策略，成功将RL应用于一般的VQA任务。此外，我们精心设计了一个奖励函数和惩罚机制，以实现稳定和合理的图像调整比例。大量实验证明了我们方法的优越性、效率和有效性。我们的代码可在https://github.com/dvlab-research/VisionThink 上找到。

更新时间: 2025-07-17 17:59:55

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.13348v1

Imbalance in Balance: Online Concept Balancing in Generation Models

In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.

Updated: 2025-07-17 17:59:47

标题: 平衡中的不平衡：生成模型中的在线概念平衡

摘要: 在视觉生成任务中，复杂概念的响应和组合常常缺乏稳定性并容易出错，这仍然是一个未被充分探索的领域。本文试图通过精心设计的实验来探讨导致概念响应不佳的因素。我们还设计了一个概念-wise的均衡损失函数（IMBA loss）来解决这个问题。我们提出的方法是在线的，消除了离线数据集处理的需要，同时需要最少的代码更改。在我们新提出的复杂概念基准Inert-CompBench和其他两个公共测试集中，我们的方法显著增强了基线模型的概念响应能力，并仅使用少量代码就取得了非常有竞争力的结果。

更新时间: 2025-07-17 17:59:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13345v1

DeFine: Decision-Making with Analogical Reasoning over Factor Profiles

LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company's earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.

Updated: 2025-07-17 17:58:50

标题: DeFine：基于因子配置的类比推理决策-making

摘要: LLM具有在长期情境中进行推理的能力，因此非常适合决策。然而，在处理描述复杂情况的演讲文本时会出现挑战，因为这些文本冗长且包含重复、含糊和模棱两可的内容。例如，在一家公司的财报电话会议中，一位高管可能会预测积极的收入前景以安抚投资者，尽管对未来收入存在不确定性。LLM在做决策时必须系统地纳入这种不确定性。在本文中，我们介绍了一种名为\textsc{DeFine}的模块化框架，它从复杂情境中构建概率因素档案。然后，利用类似过去经验中的见解，将这些档案与类比推理相结合，指导LLM在新情况下做出关键决策。我们的框架将量化不确定性的任务与将其纳入LLM决策分开处理。这种方法在咨询和金融决策等领域特别有用，因为在不确定性下做出决策至关重要。

更新时间: 2025-07-17 17:58:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.01772v2

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.

Updated: 2025-07-17 17:57:57

标题: 潜在政策引导与不受具体体现影响的预训练世界模型

摘要: 通过模仿学习视觉动作策略已经在各种机器人领域证明有效。然而，这些策略的性能严重依赖于训练演示的数量，这需要在现实世界中进行昂贵的数据收集。在这项工作中，我们旨在通过利用来自各种体现形式的现有或成本效益数据，如公共机器人数据集和人类玩耍物体的数据（来自游戏的人类数据），来减少学习视觉动作机器人策略时的数据收集工作量。我们的方法利用两个关键观点。首先，我们使用光流作为一个体现形式不可知的行动表示来训练跨多体现形式数据集的世界模型（WM），并在目标体现形式的少量机器人数据上进行微调。其次，我们开发了一种方法，潜在策略导向（LPS），通过在WM的潜在空间中搜索更好的动作序列来改进行为克隆策略的输出。在真实世界实验中，我们观察到，将策略与一个在不同机器人上采样自现有Open X-体现形式数据集的两千个剧集或一个成本效益的来自游戏的人类数据集上进行预训练的WM相结合，可以显著提高使用少量数据训练的策略的性能（通过30次演示相对改善超过50％，通过50次演示相对改善超过20％）。

更新时间: 2025-07-17 17:57:57

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13340v1

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ""Anti-Physics"" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

Updated: 2025-07-17 17:54:09

标题: "PhyWorldBench": 文本到视频模型中物理现实性的全面评估"

摘要: 视频生成模型在创造高质量、逼真的内容方面取得了显著进展。然而，它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文介绍了PhyWorldBench，这是一个旨在评估视频生成模型是否遵循物理定律的综合基准测试。该基准覆盖了多个物理现象的层次，从物体运动和能量守恒等基本原理到涉及刚体相互作用以及人类或动物运动等更复杂的场景。此外，我们引入了一个新颖的“反物理”类别，其中提示故意违反现实世界的物理规律，从而评估模型是否能够遵循这些指令同时保持逻辑一致性。除了大规模的人类评估外，我们还设计了一种简单而有效的方法，可以利用当前的MLLM以零样本方式评估物理真实性。我们评估了12种最先进的文本到视频生成模型，包括五种开源和五种专有模型，并进行了详细的比较和分析。我们识别了模型在遵循真实世界物理规律方面面临的关键挑战。通过对其输出在1050个经过精心策划的提示中进行系统测试，涵盖基本、复合和反物理场景，我们识别了这些模型在遵守真实世界物理规律方面面临的关键挑战。然后，我们严格检查它们在不同类型提示下对各种物理现象的表现，为制定增强对物理原则的忠实度的提示提供有针对性的建议。

更新时间: 2025-07-17 17:54:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13428v1

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

Updated: 2025-07-17 17:53:55

标题: 《F1赛车：超越竞争性编程的算法推理深度测量》

摘要: 前沿人工智能模型展示了广泛的知识领域。但它们距离真正的人类专家或超人类专家有多近？真正的专家能够解决最困难的问题，并推动科学理解的边界。为了揭示前沿模型能力的限制，我们转向真实生活中的研究问题，而不是人为竞争性编程难题。我们构建了FormulaOne，这是一个基准测试，位于图论、逻辑和算法的交集，都在前沿模型的训练分布范围内。我们的问题非常苛刻，需要一系列推理步骤。数据集具有三个关键属性。首先，它具有商业利益，并涉及实际的大规模优化问题，如路由、调度和网络设计等。其次，它是从图上的Monadic Second-Order (MSO)逻辑的高度表达框架生成的，为规模化的自动问题生成铺平了道路；适用于构建RL环境。第三，我们的许多问题与理论计算机科学的前沿以及其中的中心猜想密切相关，如强指数时间假设（SETH）。因此，对我们的数据集进行任何重大算法进展，超出已知结果，可能具有深远的理论意义。值得注意的是，像OpenAI的o3这样的最新模型在FormulaOne上完全失败，解决不到1%的问题，即使给予10次尝试和解释性的fewshot示例--突显了它们在某些领域仍然远离专家水平的理解。为了支持进一步研究，我们另外策划了FormulaOne-Warmup，提供了一组更简单的任务，来自相同的分布。我们发布完整的语料库以及全面的评估框架。

更新时间: 2025-07-17 17:53:55

领域: cs.AI,cs.CC,math.LO

下载: http://arxiv.org/abs/2507.13337v1

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

Updated: 2025-07-17 17:47:47

标题: 视觉和语言训练有助于应用分类知识，但并不从根本上改变它。

摘要: 视觉语言（VL）训练是否会以有意义的方式改变语言模型的语言表示？文献中的大部分结果显示，无论是在行为上还是在表示上，都存在不一致或边缘的差异。在这项工作中，我们从一个假设出发，即VL训练可能会对词汇概念知识产生显著影响，特别是其分类组织。通过比较仅文本LM和经过VL训练的LM的最小对，我们首先展示了VL模型在需要对问题中提及的概念进行分类理解的文本问答任务中通常优于仅文本LM。通过一系列针对性的行为和表征分析，我们发现LM和VLM在其词汇知识本身方面并没有显著差异，但它们在如何表示包含分类关系和非分类关系概念的问题方面有所不同。这意味着通过额外的VL训练，分类知识本身并没有发生根本性改变，但在特定任务的情境下，VL训练确实提高了这种知识的应用，即使任务的呈现纯粹是语言的。

更新时间: 2025-07-17 17:47:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13328v1

Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence

Federated Learning (FL) has emerged as a transformative paradigm in the field of distributed machine learning, enabling multiple clients such as mobile devices, edge nodes, or organizations to collaboratively train a shared global model without the need to centralize sensitive data. This decentralized approach addresses growing concerns around data privacy, security, and regulatory compliance, making it particularly attractive in domains such as healthcare, finance, and smart IoT systems. This survey provides a concise yet comprehensive overview of Federated Learning, beginning with its core architecture and communication protocol. We discuss the standard FL lifecycle, including local training, model aggregation, and global updates. A particular emphasis is placed on key technical challenges such as handling non-IID (non-independent and identically distributed) data, mitigating system and hardware heterogeneity, reducing communication overhead, and ensuring privacy through mechanisms like differential privacy and secure aggregation. Furthermore, we examine emerging trends in FL research, including personalized FL, cross-device versus cross-silo settings, and integration with other paradigms such as reinforcement learning and quantum computing. We also highlight real-world applications and summarize benchmark datasets and evaluation metrics commonly used in FL research. Finally, we outline open research problems and future directions to guide the development of scalable, efficient, and trustworthy FL systems.

Updated: 2025-07-17 17:36:34

标题: 《联邦学习：隐私保护合作智能调查》

摘要: 联合学习（FL）已成为分布式机器学习领域中的一种变革性范式，使多个客户端（如移动设备、边缘节点或组织）能够共同训练一个共享的全局模型，而无需集中敏感数据。这种去中心化方法解决了日益增长的数据隐私、安全和监管合规性方面的担忧，使其在医疗保健、金融和智能物联网系统等领域特别具有吸引力。本调查提供了对联合学习的简洁而全面的概述，从其核心架构和通信协议开始。我们讨论了标准FL生命周期，包括本地训练、模型聚合和全局更新。重点放在关键技术挑战上，如处理非独立同分布数据、减轻系统和硬件异构性、减少通信开销，以及通过差分隐私和安全聚合等机制确保隐私。此外，我们还研究了FL研究中的新兴趋势，包括个性化FL、跨设备与跨存储设置，以及与强化学习和量子计算等范式的集成。我们还突出了现实世界应用，并总结了FL研究中常用的基准数据集和评估指标。最后，我们概述了开放研究问题和未来方向，以指导可扩展、高效和可信赖的FL系统的发展。

更新时间: 2025-07-17 17:36:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.17703v2

HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem's ongoing evolution.

Updated: 2025-07-17 17:34:13

标题: 拥抱图：理解LLM生态系统的供应链

摘要: 大型语言模型（LLMs）利用深度学习来处理和预测来自上下文的单词序列，使它们能够执行各种自然语言处理任务，例如翻译、摘要、问答和内容生成。然而，开发、训练和部署先进的LLMs的规模和复杂性不断增长，需要大量计算资源和大型数据集。这为用户带来了障碍。因此，托管模型和数据集的平台被广泛使用。例如，截至2025年6月，最受欢迎的平台之一Hugging Face托管了180万个模型和45万个数据集，并没有减缓的迹象。由于许多LLMs是从基本模型、预训练模型和外部数据集构建的，它们可能继承先前模型或数据集的漏洞、偏见或恶意组件。因此，了解这些组件的起源和发展对于更好地检测潜在风险、提高模型公平性和确保合规性至关重要。受此激励，我们的项目旨在研究模型和数据集之间的关系，这些是LLM供应链的核心组件。首先，我们设计了一种方法来系统地收集LLM供应链数据。利用这些数据，我们构建了一个有向异构图来模拟模型和数据集之间的关系，形成了一个包含397,376个节点和453,469条边的结构。然后，我们进行各种分析并揭示了几个发现，例如：（i）LLM供应链图很大、稀疏，并且遵循幂律度分布；（ii）它具有一个密集连接的核心和一个分散的外围；（iii）数据集在训练中扮演关键角色；（iv）模型和数据集之间存在强烈的相互依赖关系；以及（v）该图是动态的，每天更新反映了生态系统的持续演化。

更新时间: 2025-07-17 17:34:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.14240v1

A Crowdsensing Intrusion Detection Dataset For Decentralized Federated Learning Models

This paper introduces a dataset and experimental study for decentralized federated learning (DFL) applied to IoT crowdsensing malware detection. The dataset comprises behavioral records from benign and eight malware families. A total of 21,582,484 original records were collected from system calls, file system activities, resource usage, kernel events, input/output events, and network records. These records were aggregated into 30-second windows, resulting in 342,106 features used for model training and evaluation. Experiments on the DFL platform compare traditional machine learning (ML), centralized federated learning (CFL), and DFL across different node counts, topologies, and data distributions. Results show that DFL maintains competitive performance while preserving data locality, outperforming CFL in most settings. This dataset provides a solid foundation for studying the security of IoT crowdsensing environments.

Updated: 2025-07-17 17:33:11

标题: 一个用于去中心化联邦学习模型的众包侵入检测数据集

摘要: 本文介绍了一个用于去中心化联邦学习（DFL）应用于物联网众包恶意软件检测的数据集和实验研究。该数据集包括来自良性和八个恶意软件家族的行为记录。总共收集了21,582,484条原始记录，包括系统调用、文件系统活动、资源使用、内核事件、输入/输出事件和网络记录。这些记录被聚合成30秒的时间窗口，用于模型训练和评估的特征数为342,106个。在DFL平台上的实验比较了传统机器学习（ML）、集中式联邦学习（CFL）和DFL在不同节点数量、拓扑结构和数据分布下的表现。结果表明，在大多数情况下，DFL在保留数据本地性的同时保持竞争性能，优于CFL。这个数据集为研究物联网众包环境的安全性提供了坚实的基础。

更新时间: 2025-07-17 17:33:11

领域: cs.CR

下载: http://arxiv.org/abs/2507.13313v1

Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark

The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.

Updated: 2025-07-17 17:33:11

标题: 重新审视基于推理的姿势估计基准中的可靠性

摘要: 基于推理的姿势估计（RPE）基准已经成为姿势感知多模态大语言模型（MLLMs）的广泛采用评估标准。尽管其重要性，我们发现存在关键的可重复性和基准质量问题，这些问题阻碍了公平和一致的定量评估。值得注意的是，该基准使用与原始3DPW数据集不同的图像索引，迫使研究人员进行繁琐且容易出错的手动匹配过程，以获得准确的地面真实（GT）注释以用于定量度量（如MPJPE、PA-MPJPE）。此外，我们的分析揭示了几个固有的基准质量限制，包括显着的图像冗余、场景不平衡、过于简化的姿势和模糊的文本描述，共同削弱了跨多样化场景的可靠评估。为了减轻手动努力并增强可重复性，我们通过细致的视觉匹配精心完善了GT注释，并将这些完善的注释作为开源资源公开发布，从而促进一致的定量评估，并促进人体姿势感知多模态推理的未来发展。

更新时间: 2025-07-17 17:33:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13314v1

GPU Performance Portability needs Autotuning

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

Updated: 2025-07-17 17:31:44

标题: GPU性能可移植性需要自动调整

摘要: 随着LLM的复杂性增加，实现最先进的性能需要算法、软件和硬件之间紧密的协同设计。当前依赖单一主导平台限制了可移植性，导致供应商锁定，并给新的AI硬件带来了障碍。在这项工作中，我们提出结合即时编译（JIT）和全面的内核参数自动调整，以实现具有最先进性能的可移植LLM推断，无需更改代码。重点放在性能关键的LLM内核上，我们展示了这种方法探索了多达15倍的内核参数配置，跨多个维度产生了明显更多样化的代码，甚至在减少内核代码大小70倍且消除手动代码优化的同时，性能甚至超过供应商优化的实现高达230%。我们的结果突显了自动调整作为解锁跨GPU供应商模型可移植性的有前途的途径。

更新时间: 2025-07-17 17:31:44

领域: cs.AR,cs.AI,cs.PL

下载: http://arxiv.org/abs/2505.03780v3

Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.

Updated: 2025-07-17 17:27:11

标题: 使用逐点V-可用信息识别多任务学习的任务分组

摘要: 多任务学习的成功很大程度上取决于哪些任务被分组在一起。简单地将所有任务或随机一组任务分组可能导致负面转移，使得多任务模型的表现比单任务模型更差。虽然已经做了许多努力来识别任务分组并衡量不同任务之间的相关性，但在众多潜在任务组合的池中定义一个度量来识别最佳任务分组仍然是一个具有挑战性的研究课题。我们提出了一种基于通过点可用信息（PVI）测量的任务难度的任务相关性度量。PVI是一种最近提出的度量方法，用于估计给定模型的数据集包含多少可用信息。我们假设具有统计上不同的PVI估计的任务足够相似，可以从联合学习过程中受益。我们进行了全面的实验，评估了这种度量方法在一般、生物医学和临床领域的15个NLP数据集上用于任务分组的可行性。我们将联合学习器的结果与单一学习器、现有基线方法以及最近的大型语言模型，包括Llama 2和GPT-4进行了比较。结果显示，通过将具有类似PVI估计的任务分组在一起，联合学习器产生了具有竞争力的结果，并且参数总数更少，跨领域表现一致。

更新时间: 2025-07-17 17:27:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.12774v2

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

Updated: 2025-07-17 17:11:14

标题: 生成能量竞技场（GEA）：将能源意识纳入大型语言模型（LLM）人类评估

摘要: 评估大型语言模型是一个复杂的任务，其中提出了几种方法。最常见的方法是使用自动化基准测试，其中LLM必须回答不同主题的多项选择题。然而，这种方法存在一定的局限性，最令人担忧的是与人类的相关性不强。另一种方法是让人类评估LLM。这会带来可扩展性问题，因为需要评估的模型数量庞大且不断增长，使得基于招募一定数量的评估者并让他们排名模型响应的传统研究变得不切实际（和昂贵）。另一种方法是利用公共领域，如LM竞技场，其中任何用户都可以自由地评估任何问题上的模型并对两个模型的响应进行排名。然后，结果被整理成一个模型排名。 LLM越来越重要的一个方面是它们的能源消耗，因此评估能源意识如何影响人们在选择模型时的决策具有重要意义。在本文中，我们介绍了GEA，即生成能源竞技场，一个将模型的能源消耗信息纳入评估过程中的竞技场。还介绍了使用GEA获得的初步结果，显示对于大多数问题，当用户意识到能源消耗时，他们更倾向于更小更节能的模型。这表明对于大多数用户交互来说，更复杂和性能更好的模型所产生的额外成本和能源消耗并不会提供足以证明使用它们的响应质量提升的增益。

更新时间: 2025-07-17 17:11:14

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.13302v1

CaSTFormer: Causal Spatio-Temporal Transformer for Driving Intention Prediction

Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatio-temporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaSTFormer, a Causal Spatio-Temporal Transformer to explicitly model causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaSTFormer introduces a novel Reciprocal Shift Fusion (RSF) mechanism for precise temporal alignment of internal and external feature streams, a Causal Pattern Extraction (CPE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent spatio-temporal inferences. We evaluate the proposed CaSTFormer on the public Brain4Cars dataset, and it achieves state-of-the-art performance. It effectively captures complex causal spatio-temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

Updated: 2025-07-17 17:10:37

标题: CaSTFormer：用于驾驶意图预测的因果时空变换器

摘要: 准确预测驾驶意图是提高人机共驾系统安全性和交互效率的关键。它是实现高级别自动驾驶的基石。然而，当前的方法仍然无法准确建模复杂的时空相互依赖关系和人类驾驶行为的不可预测变化。为了解决这些挑战，我们提出了CaSTFormer，一个因果时空Transformer，用于明确建模驾驶行为和环境背景之间的因果交互以实现强大的意图预测。具体来说，CaSTFormer引入了一种新颖的Reciprocal Shift Fusion（RSF）机制，用于精确对齐内部和外部特征流的时间，一个因果模式提取（CPE）模块，系统地消除虚假相关性以揭示真实的因果依赖关系，以及一种创新的特征合成网络（FSN），它自适应地将这些纯化的表示合成为连贯的时空推断。我们在公开的Brain4Cars数据集上评估了提出的CaSTFormer，并取得了最先进的性能。它有效地捕捉了复杂的因果时空依赖关系，并提高了驾驶意图预测的准确性和透明度。

更新时间: 2025-07-17 17:10:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13425v1

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

Updated: 2025-07-17 17:09:22

标题: AbGen：使用消融研究设计和评估对科学研究中的大型语言模型进行评估

摘要: 我们介绍了AbGen，这是第一个旨在评估LLM在设计科研消融研究中能力的基准。AbGen包括1,500个专家注释的示例，来源于807篇自然语言处理论文。在这个基准中，LLMs被要求根据给定的研究背景，为特定模块或过程生成详细的消融研究设计。我们评估了领先的LLMs，如DeepSeek-R1-0528和o4-mini，突出了这些模型在消融研究设计的重要性、忠实性和完整性方面与人类专家之间存在显著的性能差距。此外，我们表明当前的自动评估方法对我们的任务不可靠，因为它们与人类评估相比存在重大差异。为了更好地调查这一问题，我们开发了AbGen-Eval，这是一个元评估基准，旨在评估常用的自动评估系统在衡量LLM在我们的任务上表现的可靠性。我们在AbGen-Eval上研究了各种LLM作为评判系统，为未来开发更有效、可靠的基于LLM的评估系统提供了见解，以应对复杂的科学任务。

更新时间: 2025-07-17 17:09:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13300v1

SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks

Modern neural networks (NNs) often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a $\approx40\%$ improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group $D_N$, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data--achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA's versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.

Updated: 2025-07-17 17:06:39

标题: SIDDA：具有等变神经网络的图像分类的SInkhorn动态领域自适应

摘要: 现代神经网络（NNs）在存在“协变量漂移”时通常不能很好地泛化；也就是说，在训练和测试数据分布不同但分类标签的条件分布保持不变的情况下。在这种情况下，NN的泛化可以简化为学习更多领域不变特征的问题。域自适应（DA）方法包括一系列旨在实现这一目标的技术；然而，这些方法一直在需要大量超参数调整上苦苦挣扎，这会导致显著的计算成本。在这项工作中，我们介绍了SIDDA，这是一个建立在Sinkhorn散度基础上的即插即用的DA训练算法，可以通过最小的超参数调整和计算开销实现有效的域对齐。我们在多个模拟和真实数据集上展示了我们方法的有效性，包括简单形状、手写数字和真实天文观测数据。SIDDA与各种NN架构兼容，特别擅长与等变神经网络（ENNs）配对以提高分类准确性和模型校准。我们发现SIDDA增强了NN的泛化能力，在未标记的目标数据上实现了约40%的分类准确性提升。我们还研究了在不同二面角群$D_N$的不同群阶下对ENNs进行DA的有效性，发现随着等变性的增加，模型性能也会提高。最后，我们发现SIDDA在源数据和目标数据上都提高了模型校准性--在ECE和Brier分数上实现了一个数量级的提升。SIDDA的多功能性，结合其自动化的域对齐方法，有望推动多数据集研究的发展，促进高度泛化模型的开发。

更新时间: 2025-07-17 17:06:39

领域: cs.LG,astro-ph.GA,cs.AI,cs.CV

下载: http://arxiv.org/abs/2501.14048v2

Air Traffic Controller Task Demand via Graph Neural Networks: An Interpretable Approach to Airspace Complexity

Real-time assessment of near-term Air Traffic Controller (ATCO) task demand is a critical challenge in an increasingly crowded airspace, as existing complexity metrics often fail to capture nuanced operational drivers beyond simple aircraft counts. This work introduces an interpretable Graph Neural Network (GNN) framework to address this gap. Our attention-based model predicts the number of upcoming clearances, the instructions issued to aircraft by ATCOs, from interactions within static traffic scenarios. Crucially, we derive an interpretable, per-aircraft task demand score by systematically ablating aircraft and measuring the impact on the model's predictions. Our framework significantly outperforms an ATCO-inspired heuristic and is a more reliable estimator of scenario complexity than established baselines. The resulting tool can attribute task demand to specific aircraft, offering a new way to analyse and understand the drivers of complexity for applications in controller training and airspace redesign.

Updated: 2025-07-17 17:02:42

标题: 用图神经网络评估空中交通管制员任务需求：一种可解释的空域复杂性方法

摘要: 实时评估近期空中交通管制员（ATCO）任务需求是日益拥挤的空域中的一个关键挑战，因为现有的复杂度指标往往无法捕捉简单飞机计数以外的微妙操作驱动因素。本研究引入了一个可解释的图神经网络（GNN）框架来解决这一问题。我们基于注意力的模型预测了即将发出的交通管制员对飞机发布的指令的数量，这些指令是通过静态交通场景中的交互获得的。至关重要的是，我们通过系统地删除飞机并测量对模型预测的影响，推导出了一个可解释的、每架飞机的任务需求评分。我们的框架明显优于ATCO启发式，并且比已建立的基准线更可靠地估计了场景复杂度。由此产生的工具可以将任务需求归因于特定飞机，为控制员培训和空域重新设计的应用提供了一种新的分析和理解复杂性驱动因素的方法。

更新时间: 2025-07-17 17:02:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13423v1

Towards Formal Verification of LLM-Generated Code from Natural Language Prompts

In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.

Updated: 2025-07-17 16:54:42

标题: 朝向对自然语言提示生成的LLM代码的形式化验证

摘要: 在过去几年中，LLM已经成为一种工具，可以通过接受自然语言描述并生成基于此的代码来帮助程序员。然而，LLM经常生成不正确的代码，用户需要进行修正，文献表明用户经常难以检测这些错误。在这项工作中，我们试图为LLM生成的代码提供正确性的形式保证；这样的保证可以改善使用AI代码助手的体验，并潜在地使自然语言编程对于几乎没有编程知识的用户成为可能。为了解决这一挑战，我们提出将一个形式化的查询语言纳入到系统中，该查询语言可以以一种形式化定义但类似自然语言的方式表示用户的意图，用户可以确认这与他们的意图匹配。然后，使用这样的查询，我们提出验证LLM生成的代码以确保其与用户的意图匹配。我们在我们的系统Astrogator中实现了这些想法，该系统适用于Ansible编程语言，其中包含这样一种形式化查询语言，用于表示Ansible程序行为的演算，以及用于验证的符号解释器。在一个包含21个代码生成任务的基准套件上，我们的验证器能够在83%的情况下验证正确的代码，并在92%的情况下识别不正确的代码。

更新时间: 2025-07-17 16:54:42

领域: cs.PL,cs.AI

下载: http://arxiv.org/abs/2507.13290v1

ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

Updated: 2025-07-17 16:49:08

标题: ContextQFormer：一种新的多轮多模对话上下文建模方法

摘要: 多模式大型语言模型展示了出色的零-shot能力和强大的图像理解能力。然而，现有的开源多模式模型在多轮交互方面存在着能力不足的问题，尤其是对于长篇上下文。为了解决这个问题，我们首先引入了一个上下文建模模块，称为ContextQFormer，它利用存储块来增强上下文信息的呈现。此外，为了促进进一步的研究，我们精心构建了一个新的多轮多模式对话数据集（TMDialog）用于预训练、指导微调和评估，该数据集不久将会开源。与其他多模式对话数据集相比，TMDialog包含更长的对话，支持多轮多模式对话的研究。此外，ContextQFormer在TMDialog上与三个基线进行了比较，实验结果表明，ContextQFormer在可用率上比基线提高了2%-4%。

更新时间: 2025-07-17 16:49:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.23121v2

Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour

Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical 'pets', including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics.

Updated: 2025-07-17 16:38:14

标题: 评估强化学习算法在模拟机器人四足动物导航中的应用：受导盲犬行为启发的比较研究

摘要: 机器人在各行各业中越来越普遍，特别是在医疗保健领域。然而，四足机器人的许多有价值的应用仍然被忽视。本研究探讨了三种强化学习算法在训练模拟四足机器人进行自主导航和避障方面的有效性。其目标是开发一个能够进行路径跟随和避障的机器导盲犬模拟，具有为导盲犬和视障人士提供实际帮助的潜力。此外，该研究还旨在拓展医用“宠物”领域，包括机器导盲犬和警示犬等。通过比较分析了十三篇相关研究论文，确定了关键评估标准，包括碰撞检测、路径规划算法、传感器使用、机器人类型和模拟平台等。该研究侧重于传感器输入、碰撞频率、奖励信号和学习进展，以确定哪种算法最好地支持机器人在复杂环境中的导航。为了在受控条件下公平评估三种算法，使用定制环境，以确保一致的数据收集。结果显示，接近策略优化（PPO）在所有指标上均优于深度Q网络（DQN）和Q学习，特别是在每一集中到目标的平均和中位步数方面。通过分析这些结果，本研究为机器人导航、人工智能和医疗机器人领域做出了贡献，提供了有关AI驱动四足机器人移动性及其在辅助机器人技术中的作用的见解。

更新时间: 2025-07-17 16:38:14

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13277v1

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management

Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.

Updated: 2025-07-17 16:33:57

标题: 2025年TalentCLEF的概述：人力资本管理的技能和职位智能

摘要: 自然语言处理和大型语言模型的进步正在推动人力资本管理领域的重大转变，越来越多的人对基于语言技术构建智能系统以进行人才招聘、提升技能策略和劳动力规划表现出兴趣。然而，这些技术的采用和进展在很大程度上取决于可靠和公平模型的发展，这些模型必须在公开数据和开放基准上得到适当评估，而迄今在这一领域还没有可用的数据。为了填补这一空白，我们提出了TalentCLEF 2025，这是第一个专注于技能和职位智能的评估活动。实验室包括两个任务：任务A - 多语种职位匹配，涵盖英语、西班牙语、德语和中文；任务B - 基于职位名称的技能预测，使用英语。这两个语料库都是从真实的求职申请中构建而成，经过仔细匿名处理，并手动标注以反映现实劳动力市场数据的复杂性和多样性，包括语言变体和性别标记表达。评估包括单语种和跨语种场景，涵盖了性别偏见的评估。 TalentCLEF吸引了76个注册团队，提交了超过280份作品。大多数系统依赖于使用多语言编码器模型结合对比学习进行微调的信息检索技术，并且其中一些系统还将大型语言模型用于数据增强或重新排序。结果显示，训练策略比模型大小本身更具有影响力。TalentCLEF提供了这一领域的第一个公开基准，并鼓励为劳动力市场开发健壮、公平和可转移的语言技术。

更新时间: 2025-07-17 16:33:57

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.13275v1

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.

Updated: 2025-07-17 16:21:47

标题: QuestA：通过问题增强扩展LLMs的推理能力

摘要: 强化学习（RL）已成为训练大型语言推理模型（LLMs）的关键组成部分。然而，最近的研究质疑其在改善多步推理特别是在困难问题上的有效性。为了解决这一挑战，我们提出了一种简单而有效的策略，即问题增强：在训练过程中引入部分解决方案，以降低问题难度并提供更多信息性的学习信号。我们的方法QuestA，在数学推理任务的RL训练中应用时，不仅改善了pass@1，还改善了pass@k-特别是在标准RL难以取得进展的问题上。这使得我们能够持续改进开源模型（如DeepScaleR和OpenMath Nemotron），进一步增强它们的推理能力。我们使用15亿参数模型在数学基准测试中取得了新的最先进结果：AIME24为67.1%（+5.3%），AIME25为59.5%（+10.0%），HMMT25为35.5%（+4.0%）。此外，我们提供了QuestA改善样本效率的理论解释，为通过RL扩展推理能力提供了实用且可推广的途径。

更新时间: 2025-07-17 16:21:47

领域: cs.CL,cs.AI,68T50

下载: http://arxiv.org/abs/2507.13266v1

Voxtral

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

Updated: 2025-07-17 16:17:37

标题: Voxtral

摘要: 我们介绍了Voxtral Mini和Voxtral Small，这是两种多模态音频聊天模型。Voxtral被训练来理解口头音频和文本文档，实现了在各种音频基准测试中的最先进性能，同时保持了强大的文本能力。Voxtral Small优于许多闭源模型，同时足够小以在本地运行。一个32K上下文窗口使模型能够处理长达40分钟的音频文件和长时间的多轮对话。我们还提供了三个用于评估知识和琐事的语音理解模型的基准测试。两个Voxtral模型均在Apache 2.0许可下发布。

更新时间: 2025-07-17 16:17:37

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.13264v1

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.

Updated: 2025-07-17 16:09:05

标题: 基于近似正交微调策略的预训练视觉Transformer的高效适应

摘要: 参数高效微调（PEFT）是预训练视觉变换器（ViT）的一种流行方法，其主要涉及冻结大部分骨干参数，并仅学习低秩适应权重矩阵以适应下游任务。这些低秩矩阵通常是通过下投影和上投影矩阵的乘法结构导出的，例如LoRA和Adapter等方法。在这项工作中，我们观察到骨干参数的任意两行或列向量之间存在近似正交性；然而，在下/上投影矩阵的向量中却不存在这种性质。近似正交性意味着模型的泛化误差上界降低，表明模型具有增强的泛化能力。如果微调的下/上投影矩阵展现出与预训练骨干矩阵相同的性质，那么微调后的ViT的泛化能力是否可以进一步增强？为了解决这个问题，我们提出了一种近似正交微调（AOFT）策略来表示低秩权重矩阵。该策略利用单个可学习向量生成一组近似正交向量，这些向量形成下/上投影矩阵，从而使这些矩阵的属性与骨干的属性保持一致。广泛的实验结果表明，我们的方法在一系列下游图像分类任务中取得了竞争性性能，验证了嵌入在下/上投影矩阵中的增强泛化能力的有效性。

更新时间: 2025-07-17 16:09:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13260v1

Automating Steering for Safe Multimodal Large Language Models

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

Updated: 2025-07-17 16:04:55

标题: 自动化转向以确保安全的多模态大型语言模型

摘要: 最近在多模式大型语言模型（MLLMs）领域取得了重要进展，解锁了强大的跨模态推理能力，但也引发了新的安全关注，特别是面对对抗多模态输入时。为了在推理过程中提高MLLMs的安全性，我们引入了一种模块化和自适应的推理时间干预技术，AutoSteer，无需对基础模型进行任何微调。AutoSteer包含三个核心组件：（1）一种新颖的安全意识分数（SAS），自动识别模型内部层中最具安全相关性的差异；（2）一个经过训练的自适应安全探测器，用于估计中间表示中有毒输出的可能性；和（3）一个轻量级的拒绝头，当检测到安全风险时选择性地干预以调节生成。在LLaVA-OV和Chameleon上的各种安全关键基准测试中的实验证明，AutoSteer显著降低了文本、视觉和跨模态威胁的攻击成功率（ASR），同时保持了一般能力。这些发现将AutoSteer定位为一个实用的、可解释的、有效的框架，用于更安全地部署多模态人工智能系统。

更新时间: 2025-07-17 16:04:55

领域: cs.CL,cs.AI,cs.IR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2507.13255v1

A Roadmap for Climate-Relevant Robotics Research

Climate change is one of the defining challenges of the 21st century, and many in the robotics community are looking for ways to contribute. This paper presents a roadmap for climate-relevant robotics research, identifying high-impact opportunities for collaboration between roboticists and experts across climate domains such as energy, the built environment, transportation, industry, land use, and Earth sciences. These applications include problems such as energy systems optimization, construction, precision agriculture, building envelope retrofits, autonomous trucking, and large-scale environmental monitoring. Critically, we include opportunities to apply not only physical robots but also the broader robotics toolkit - including planning, perception, control, and estimation algorithms - to climate-relevant problems. A central goal of this roadmap is to inspire new research directions and collaboration by highlighting specific, actionable problems at the intersection of robotics and climate. This work represents a collaboration between robotics researchers and domain experts in various climate disciplines, and it serves as an invitation to the robotics community to bring their expertise to bear on urgent climate priorities.

Updated: 2025-07-17 16:00:19

标题: 一个关于与气候相关的机器人研究的路线图

摘要: 气候变化是21世纪的一个重要挑战，许多机器人学界人士正在寻找贡献的方式。本文提出了一个与气候相关的机器人研究路线图，确定了机器人学家与能源、建筑环境、交通运输、工业、土地利用和地球科学等气候领域专家之间合作的高影响机会。这些应用包括能源系统优化、建筑、精准农业、建筑外围改造、自主卡车和大规模环境监测等问题。重要的是，我们不仅包括应用物理机器人，还包括更广泛的机器人工具包 - 包括规划、感知、控制和估计算法 - 用于解决与气候相关的问题。这一路线图的中心目标是通过突出机器人学和气候交叉领域的具体、可操作的问题，激发新的研究方向和合作。这项工作代表了机器人学研究人员与各种气候学科领域专家的合作，并邀请机器人学界人士将他们的专业知识应用于紧迫的气候重点领域。

更新时间: 2025-07-17 16:00:19

领域: cs.RO,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.11623v2

VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models

Popular PEFT methods reduce trainable parameter count for fine-tuning by parameterizing new low-rank or sparse trainable weights in parallel to the frozen pre-trained weights $W$. However, these weights are trained from scratch, and there exists a performance gap between these methods and full fine-tuning, especially in low-budget settings. We introduce VectorFit, a new way of parameterization that efficiently utilizes the existing knowledge embedded in $W$ by adaptively training their singular vectors and biases. We show that utilizing the structural and transformational properties of $W$ in this way can lead to high-rank incremental weight matrices $\Delta W$, comparable to that of full fine-tuning. VectorFit delivers superior results with \textbf{9$\boldsymbol\times$} fewer trainable parameters than the leading PEFT methods. Through comprehensive experiments across 19 datasets covering a wide range of language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we demonstrate that VectorFit surpasses baselines in terms of performance as a function of parameter-efficiency.

Updated: 2025-07-17 15:52:54

标题: VectorFit：预训练基础模型的自适应奇异值和偏置向量微调

摘要: 流行的PEFT方法通过对新的低秩或稀疏可训练权重进行参数化，以减少微调的可训练参数数量，与冻结的预训练权重$W并行。然而，这些权重是从头开始训练的，这些方法与完全微调之间存在性能差距，特别是在低预算环境中。我们引入了VectorFit，一种新的参数化方式，通过自适应地训练它们的奇异向量和偏差，有效利用了嵌入在$W中的现有知识。我们展示了以这种方式利用$W的结构和变换特性可以导致高秩增量权重矩阵$\Delta W，与完全微调相当。VectorFit在比主流PEFT方法少\textbf{9$\boldsymbol\times$}可训练参数的情况下提供了优越的结果。通过涵盖了19个数据集的全面实验，涵盖了语言和视觉任务的广泛范围，如自然语言理解和生成、问题回答、图像分类和图像生成，我们展示了VectorFit在参数效率方面超越了基线性能。

更新时间: 2025-07-17 15:52:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.19530v2

Secure Parsing and Serializing with Separation Logic Applied to CBOR, CDDL, and COSE

Incorrect handling of security-critical data formats, particularly in low-level languages, are the root cause of many security vulnerabilities. Provably correct parsing and serialization tools that target languages like C can help. Towards this end, we present PulseParse, a library of verified parser and serializer combinators for non-malleable binary formats. Specifications and proofs in PulseParse are in separation logic, offering a more abstract and compositional interface, with full support for data validation, parsing, and serialization. PulseParse also supports a class of recursive formats -- with a focus on security and handling adversarial inputs, we show how to parse such formats with only a constant amount of stack space. We use PulseParse at scale by providing the first formalization of CBOR, a recursive, binary data format standard, with growing adoption in various industrial standards. We prove that the deterministic fragment of CBOR is non-malleable and provide EverCBOR, a verified library in both C and Rust to validate, parse, and serialize CBOR objects implemented using PulseParse. Next, we provide the first formalization of CDDL, a schema definition language for CBOR. We identify well-formedness conditions on CDDL definitions that ensure that they yield unambiguous, non-malleable formats, and implement EverCDDL, a tool that checks that a CDDL definition is well-formed, and then produces verified parsers and serializers for it. To evaluate our work, we use EverCDDL to generate verified parsers and serializers for various security-critical applications. Notably, we build a formally verified implementation of COSE signing, a standard for cryptographically signed objects. We also use our toolchain to generate verified code for other standards specified in CDDL, including DICE Protection Environment, a secure boot protocol standard.

Updated: 2025-07-17 15:50:49

标题: 使用分离逻辑应用于CBOR、 CDDL和COSE的安全解析和序列化

摘要: 安全关键数据格式的错误处理，特别是在低级语言中，是许多安全漏洞的根本原因。针对诸如C语言等语言的可证明正确的解析和序列化工具可以提供帮助。为此，我们介绍了PulseParse，一个经过验证的解析器和序列化器组合库，用于不可塑性二进制格式。PulseParse中的规范和证明采用分离逻辑，提供更抽象和组合的接口，完全支持数据验证、解析和序列化。PulseParse还支持一类递归格式--着重于安全性和处理对抗性输入，我们展示了如何仅使用恒定量的堆栈空间解析这些格式。我们通过首次提供CBOR的形式化来大规模使用PulseParse，CBOR是一种递归的二进制数据格式标准，在各种工业标准中得到不断采用。我们证明了CBOR的确定性片段是不可塑性的，并提供了EverCBOR，一个在C和Rust中验证的库，用PulseParse实现对CBOR对象的验证、解析和序列化。接下来，我们首次提供了CDDL的形式化，这是一种用于CBOR的模式定义语言。我们确定了CDDL定义的良好形式条件，以确保它们产生明确、不可塑性的格式，并实现了EverCDDL，一个检查CDDL定义是否良好形式，然后为其生成经过验证的解析器和序列化器的工具。为了评估我们的工作，我们使用EverCDDL为各种安全关键应用生成经过验证的解析器和序列化器。值得注意的是，我们构建了COSE签名的形式验证实现，这是一种用于加密签名对象的标准。我们还使用我们的工具链为CDDL中指定的其他标准生成了经过验证的代码，包括DICE Protection Environment，这是一种安全启动协议标准。

更新时间: 2025-07-17 15:50:49

领域: cs.CR,cs.PL

下载: http://arxiv.org/abs/2505.17335v2

ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.

Updated: 2025-07-17 15:50:27

标题: ConTextual: 使用保留上下文的标记过滤和知识图提高LLMs中的临床文本摘要

摘要: 非结构化的临床数据可以作为一种独特和丰富的信息来源，可以有意义地指导临床实践。从这些数据中提取最相关的上下文对于充分利用其真正潜力，以实现患者护理中的最佳和及时决策至关重要。虽然先前的研究已经探索了各种临床文本摘要的方法，但大多数先前的研究要么统一处理所有输入标记，要么依赖基于启发式的过滤器，这可能忽略了微妙的临床线索，并且未能优先考虑决策所需的信息。在本研究中，我们提出了一种新颖的框架Contextual，它将保留上下文的标记过滤方法与领域特定的知识图（KG）相结合，以实现上下文增强。通过保留特定上下文中重要的标记并用结构化知识丰富它们，ConTextual改善了语言连贯性和临床忠实度。我们在两个公共基准数据集上进行了广泛的实证评估，结果表明ConTextual在性能上始终优于其他基线。我们提出的方法突显了标记级过滤和结构化检索在增强语言和临床完整性方面的互补作用，同时提供了一种可扩展的解决方案，以提高临床文本生成的精度。

更新时间: 2025-07-17 15:50:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.16394v3

HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

Updated: 2025-07-17 15:47:49

标题: HATS: 用于评估大型语言模型推理能力的印地语类比测试集

摘要: 类比测试评估模型推断概念之间的隐含关系的能力，因此成为评估推理能力的关键基准。尽管大型语言模型（LLMs）在英语推理方面得到广泛评估，但它们在印度语言中的能力仍然鲜为人知，这限制了我们对这些模型是否可以跨语言泛化的理解。为了填补这一空白，我们引入了一个新的印地语类比测试集（HATS），包括405道来自印度政府考试的多项选择题。我们使用各种提示策略对最先进的多语言LLMs进行基准测试，并引入了一种基于认知类比推理理论的扎实Chain of Thought方法。这种方法提高了模型在印地语类比问题上的表现。我们的实验证明，无论采用何种提示策略，模型在英语提示下表现最佳。我们的测试集解决了在印地语中评估LLM推理能力的关键资源缺失问题。

更新时间: 2025-07-17 15:47:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13238v1

VITA: Vision-to-Action Flow Matching Policy

We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

Updated: 2025-07-17 15:41:57

标题: VITA：视觉到行动流匹配策略

摘要: 我们提出了VITA，一种将潜在视觉表示演变为潜在动作以进行视觉运动控制的策略。传统的流匹配和扩散策略从标准来源分布（例如，高斯噪声）中采样，并需要额外的调节机制，如交叉注意力来使动作生成依赖于视觉信息，从而产生时间和空间开销。VITA提出了一种新颖的范式，将潜在图像视为流源，学习从视觉到动作的内在映射，同时消除了单独的调节模块并保留了生成建模能力。在视觉和动作之间学习流是具有挑战性的，因为动作数据稀疏、缺乏语义结构，高维视觉表示和原始动作之间存在维度不匹配。我们通过使用自动编码器创建结构化动作潜在空间作为流匹配目标，将原始动作上采样以匹配视觉表示形状来解决这个问题。关键是，我们通过流潜在解码对编码器目标和最终动作输出监督流匹配，通过流潜在解码将动作重建损失反向传播到用于有效端到端学习的顺序流匹配ODE求解步骤中。作为简单的MLP层实现，VITA在ALOHA平台上的具有挑战性的双手操作任务上进行了评估，包括5个模拟任务和2个真实世界任务。尽管其简单性，仅使用MLP的VITA在推理延迟方面表现优于或与最先进的生成策略相匹配，而与需要不同调节机制或复杂体系结构的传统流匹配策略相比，推理延迟降低了50-130%。据我们所知，VITA是第一个仅使用MLP的流匹配策略，能够解决类似ALOHA基准中的复杂双手操作任务。

更新时间: 2025-07-17 15:41:57

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.13231v1

Multiple-Frequencies Population-Based Training

Reinforcement Learning's high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules instead of fixed configurations. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice of evolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without actually tuning hyperparameters.

Updated: 2025-07-17 15:41:03

标题: 多频率基于人群的训练

摘要: 强化学习对超参数非常敏感，这是实践者面临的一个稳定性和效率问题，给他们带来了重大挑战。为了解决这个问题，已经开发了超参数优化（HPO）算法，其中基于种群的训练（PBT）因其能够生成超参数调度而非固定配置而脱颖而出。PBT训练了一个代理人种群，每个代理人都有自己的超参数，经常对它们进行排名，并用最佳代理人的突变替换表现最差的代理人。这些中间选择步骤可能导致PBT专注于短期改进，使其陷入局部最优解，最终在较长时间尺度上落后于普通的随机搜索。本文研究了这种贪婪问题与进化频率的选择之间的关系，即选择操作的速率。我们提出了多频率基于种群的训练（MF-PBT），这是一种新颖的HPO算法，通过使用子种群，每个子种群以不同的频率进化来解决贪婪问题。MF-PBT引入了一个迁移过程，在子种群之间转移信息，并采用不对称设计来平衡短期和长期优化。对Brax套件进行的大量实验表明，MF-PBT提高了样本效率和长期性能，即使没有实际调整超参数。

更新时间: 2025-07-17 15:41:03

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2506.03225v2

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation

The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.

Updated: 2025-07-17 15:40:18

标题: $S^2M^2$: 可扩展的立体匹配模型，用于可靠的深度估计

摘要: 追求一个可在不同分辨率和视差范围下执行而无需特定数据集微调的通用立体匹配模型揭示了一个基本的权衡。迭代局部搜索方法在受限基准上取得高分，但它们的核心机制固有地限制了真正泛化所需的全局一致性。另一方面，虽然全局匹配架构在理论上更加稳健，但由于计算和内存成本过高，历史上一直被认为是不可行的。我们通过$S^2M^2$解决了这个困境：这是一个全局匹配架构，既实现了最先进的准确性，又具有高效性，而不依赖于成本体积过滤或深度细化堆栈。我们的设计集成了一个多分辨率转换器，用于稳健的远程对应性，训练了一个新颖的损失函数，将概率集中在可行匹配上。这种方法实现了视差、遮挡和置信度的更加稳健的联合估计。$S^2M^2$在Middlebury v3和ETH3D基准上建立了一个新的技术水平，在大多数指标上明显优于先前的方法，同时以竞争性效率重建了高质量细节。

更新时间: 2025-07-17 15:40:18

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.13229v1

Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection

While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.

Updated: 2025-07-17 15:35:27

标题: 合成现实：利用生成式AI动力平台Midjourney进行建筑工人检测

摘要: 最近深度神经网络（DNNs）的进展显著增强了视觉人工智能的能力，但在建筑领域，数据的多样性和数量不足仍然是一个挑战。本研究提出了一种新颖的图像合成方法，专门用于建筑工人检测，利用生成式人工智能平台Midjourney。该方法涉及通过制定3000个不同的提示生成一组12000张合成图像，重点放在图像的逼真性和多样性上。这些经过手动标记的图像作为DNN训练的数据集。在真实的建筑图像数据集上进行评估，模型在0.5和0.5至0.95的IoU阈值下分别达到0.937和0.642的平均精度（APs）。值得注意的是，模型在合成数据集上表现几乎完美，分别在两个提到的阈值下达到0.994和0.919的APs。这些发现揭示了生成式人工智能在解决DNN训练数据稀缺性方面的潜力和弱点。

更新时间: 2025-07-17 15:35:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13221v1

V-Max: A Reinforcement Learning Framework for Autonomous Driving

Learning-based decision-making has the potential to enable generalizable Autonomous Driving (AD) policies, reducing the engineering overhead of rule-based approaches. Imitation Learning (IL) remains the dominant paradigm, benefiting from large-scale human demonstration datasets, but it suffers from inherent limitations such as distribution shift and imitation gaps. Reinforcement Learning (RL) presents a promising alternative, yet its adoption in AD remains limited due to the lack of standardized and efficient research frameworks. To this end, we introduce V-Max, an open research framework providing all the necessary tools to make RL practical for AD. V-Max is built on Waymax, a hardware-accelerated AD simulator designed for large-scale experimentation. We extend it using ScenarioNet's approach, enabling the fast simulation of diverse AD datasets.

Updated: 2025-07-17 15:30:27

标题: V-Max：自动驾驶的强化学习框架

摘要: 学习型决策具有潜力实现通用自动驾驶（AD）策略，减少基于规则的方法的工程开销。模仿学习（IL）仍然是主导范式，受益于大规模的人类示范数据集，但它存在固有的局限性，如分布偏移和模仿差距。强化学习（RL）提供了一个有前途的替代方案，但由于缺乏标准化和高效的研究框架，其在AD中的应用仍然有限。为此，我们引入了V-Max，一个开放的研究框架，提供所有必要的工具，使RL在AD中实用。V-Max建立在Waymax之上，这是一个为大规模实验设计的硬件加速的AD模拟器。我们使用ScenarioNet的方法扩展了它，使得能够快速模拟多样化的AD数据集。

更新时间: 2025-07-17 15:30:27

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.08388v3

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

Updated: 2025-07-17 15:27:20

标题: 高保真度、高效扩散模型的组合离散潜在编码

摘要: 我们认为扩散模型成功地建模复杂分布在很大程度上来自它们的输入条件。本文从理想表示应该提高样本保真度、易于生成和可组合以允许生成训练以外样本的角度探讨了条件扩散模型所使用的表示。我们引入了离散潜在代码（DLC），这是一种通过自监督学习目标训练的单纯嵌入派生的图像表示。DLC是一系列离散标记，而不是标准的连续图像嵌入。它们易于生成，其组合性使得可以对超出训练分布的新图像进行采样。用DLC训练的扩散模型具有改进的生成保真度，在ImageNet上建立了无条件图像生成的新技术水平。此外，我们展示了组合DLC可以使图像生成器以协调地将图像语义以多样化方式结合生成超出分布的样本。最后，我们展示了DLC如何通过利用大规模预训练语言模型实现文本到图像的生成。我们高效地微调了一个文本扩散语言模型以生成DLC，从而产生超出图像生成器训练分布的新样本。

更新时间: 2025-07-17 15:27:20

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12318v2

Higher-Order Pattern Unification Modulo Similarity Relations

The combination of higher-order theories and fuzzy logic can be useful in decision-making tasks that involve reasoning across abstract functions and predicates, where exact matches are often rare or unnecessary. Developing efficient reasoning and computational techniques for such a combined formalism presents a significant challenge. In this paper, we adopt a more straightforward approach aiming at integrating two well-established and computationally well-behaved components: higher-order patterns on one side and fuzzy equivalences expressed through similarity relations based on minimum T-norm on the other. We propose a unification algorithm for higher-order patterns modulo these similarity relations and prove its termination, soundness, and completeness. This unification problem, like its crisp counterpart, is unitary. The algorithm computes a most general unifier with the highest degree of approximation when the given terms are unifiable.

Updated: 2025-07-17 15:18:22

标题: 更高阶模式统一在相似关系下的模数化

摘要: 高阶理论和模糊逻辑的结合可以在涉及跨抽象函数和谓词的推理决策任务中发挥作用，其中精确匹配通常很少或不必要。为这种组合形式主义开发高效的推理和计算技术是一个重要挑战。在本文中，我们采用了一种更直接的方法，旨在将两个已经建立并且计算行为良好的组件整合在一起：一方面是高阶模式，另一方面是通过基于最小T-范数的相似关系表达的模糊等价性。我们提出了一个用于高阶模式的统一算法，模除这些相似性关系，并证明了其终止性、正确性和完备性。这个统一问题，就像其清晰的对应物一样，是唯一的。当给定的术语是可统一的时，该算法计算出具有最高近似度的最一般的统一器。

更新时间: 2025-07-17 15:18:22

领域: cs.AI,cs.LO,math.LO,03B70 (Primary) 68T37, 68T27, 68Q42, 03B40, 68V15 (Secondary),F.4.1; I.2.3

下载: http://arxiv.org/abs/2507.13208v1

Bounding the Worst-class Error: A Boosting Approach

This paper tackles the problem of the worst-class error rate, instead of the standard error rate averaged over all classes. For example, a three-class classification task with class-wise error rates of 10%, 10%, and 40% has a worst-class error rate of 40%, whereas the average is 20% under the class-balanced condition. The worst-class error is important in many applications. For example, in a medical image classification task, it would not be acceptable for the malignant tumor class to have a 40% error rate, while the benign and healthy classes have a 10% error rates. To avoid overfitting in worst-class error minimization using Deep Neural Networks (DNNs), we design a problem formulation for bounding the worst-class error instead of achieving zero worst-class error. Moreover, to correctly bound the worst-class error, we propose a boosting approach which ensembles DNNs. We give training and generalization worst-class-error bound. Experimental results show that the algorithm lowers worst-class test error rates while avoiding overfitting to the training set. This code is available at https://github.com/saito-yuya/Bounding-the-Worst-class-error-A-Boosting-Approach.

Updated: 2025-07-17 14:56:17

标题: 限制最差类错误：一种增强方法

摘要: 这篇论文解决了最差类别错误率的问题，而不是对所有类别进行平均的标准错误率。例如，一个三类分类任务，各类别错误率分别为10％、10％和40％，最差类别错误率为40％，而在类别平衡条件下平均错误率为20％。在许多应用中，最差类别错误很重要。例如，在医学图像分类任务中，恶性肿瘤类别的错误率为40％，而良性和健康类别的错误率为10％是不可接受的。为了避免在使用深度神经网络（DNNs）进行最差类别错误最小化时出现过拟合，我们设计了一个问题表述来限制最差类别错误，而不是达到零最差类别错误。此外，为了正确限制最差类别错误，我们提出了一种集成DNNs的增强方法。我们给出了训练和泛化最差类别错误的限制。实验结果表明，该算法降低了最差类别测试错误率，同时避免了对训练集的过拟合。该代码可在https://github.com/saito-yuya/Bounding-the-Worst-class-error-A-Boosting-Approach获取。

更新时间: 2025-07-17 14:56:17

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2310.14890v3

Black Box Deployed -- Functional Criteria for Artificial Moral Agents in the LLM Era

The advancement of powerful yet opaque large language models (LLMs) necessitates a fundamental revision of the philosophical criteria used to evaluate artificial moral agents (AMAs). Pre-LLM frameworks often relied on the assumption of transparent architectures, which LLMs defy due to their stochastic outputs and opaque internal states. This paper argues that traditional ethical criteria are pragmatically obsolete for LLMs due to this mismatch. Engaging with core themes in the philosophy of technology, this paper proffers a revised set of ten functional criteria to evaluate LLM-based artificial moral agents: moral concordance, context sensitivity, normative integrity, metaethical awareness, system resilience, trustworthiness, corrigibility, partial transparency, functional autonomy, and moral imagination. These guideposts, applied to what we term "SMA-LLS" (Simulating Moral Agency through Large Language Systems), aim to steer AMAs toward greater alignment and beneficial societal integration in the coming years. We illustrate these criteria using hypothetical scenarios involving an autonomous public bus (APB) to demonstrate their practical applicability in morally salient contexts.

Updated: 2025-07-17 14:39:29

标题: 黑匣子部署-在LLM时代的人工道德代理的功能标准

摘要: 强大但不透明的大型语言模型（LLMs）的发展需要对用于评估人工道德代理人（AMAs）的哲学标准进行根本修订。之前的LLM框架通常依赖于透明架构的假设，而LLMs由于其随机输出和不透明的内部状态而违背了这一假设。本文认为传统的伦理标准在LLMs方面因为这种不匹配而在实践上已经过时。通过与技术哲学中的核心主题互动，本文提出了一套修订后的十项功能标准，用于评估基于LLMs的人工道德代理人：道德一致性，上下文敏感性，规范完整性，元伦理意识，系统弹性，值得信赖性，可矫正性，部分透明性，功能自主性和道德想象力。这些标准，应用于我们所谓的“SMA-LLS”（通过大型语言系统模拟道德代理人），旨在引导AMAs在未来几年朝着更大的一致性和有益的社会整合方向发展。我们通过涉及自主公共汽车（APB）的假设情景来说明这些标准的实际适用性，展示它们在道德关键环境中的实际应用。

更新时间: 2025-07-17 14:39:29

领域: cs.AI,68T27, 03B42 68T27, 03B4268T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B42 68T27, 03B4268T27, 03B42,I.2.0; I.2.9; K.4.1

下载: http://arxiv.org/abs/2507.13175v1

Aligning Humans and Robots via Reinforcement Learning from Implicit Human Feedback

Conventional reinforcement learning (RL) ap proaches often struggle to learn effective policies under sparse reward conditions, necessitating the manual design of complex, task-specific reward functions. To address this limitation, rein forcement learning from human feedback (RLHF) has emerged as a promising strategy that complements hand-crafted rewards with human-derived evaluation signals. However, most existing RLHF methods depend on explicit feedback mechanisms such as button presses or preference labels, which disrupt the natural interaction process and impose a substantial cognitive load on the user. We propose a novel reinforcement learning from implicit human feedback (RLIHF) framework that utilizes non-invasive electroencephalography (EEG) signals, specifically error-related potentials (ErrPs), to provide continuous, implicit feedback without requiring explicit user intervention. The proposed method adopts a pre-trained decoder to transform raw EEG signals into probabilistic reward components, en abling effective policy learning even in the presence of sparse external rewards. We evaluate our approach in a simulation environment built on the MuJoCo physics engine, using a Kinova Gen2 robotic arm to perform a complex pick-and-place task that requires avoiding obstacles while manipulating target objects. The results show that agents trained with decoded EEG feedback achieve performance comparable to those trained with dense, manually designed rewards. These findings validate the potential of using implicit neural feedback for scalable and human-aligned reinforcement learning in interactive robotics.

Updated: 2025-07-17 14:35:12

标题: 通过从隐式人类反馈中学习强化学习，实现人类与机器人的对齐

摘要: 传统的强化学习方法经常在稀疏奖励条件下很难学习到有效的策略，因此需要手动设计复杂的、特定任务的奖励函数。为了解决这一局限性，强化学习从人类反馈（RLHF）已经成为一种很有前途的策略，它通过人类派生的评估信号来补充手工精心设计的奖励。然而，大多数现有的RLHF方法依赖于明确的反馈机制，如按钮按压或偏好标签，这会打断自然的交互过程，并给用户带来很大的认知负担。我们提出了一种新颖的从隐式人类反馈（RLIHF）中学习的强化学习框架，利用非侵入式的脑电图（EEG）信号，特别是错误相关电位（ErrPs），提供连续的、隐式的反馈，而不需要明确的用户干预。该方法采用预训练的解码器将原始EEG信号转换为概率奖励组件，使得即使在外部奖励稀疏的情况下，也能有效地学习策略。我们在基于MuJoCo物理引擎构建的仿真环境中评估了我们的方法，使用Kinova Gen2机械臂执行一个需要避开障碍物同时操作目标物体的复杂拾取放置任务。结果表明，使用解码的EEG反馈训练的代理能够达到与使用密集手动设计的奖励训练的代理相当的性能。这些发现验证了利用隐式神经反馈进行可扩展和与人类对齐的互动机器人强化学习的潜力。

更新时间: 2025-07-17 14:35:12

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.13171v1

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks

Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.

Updated: 2025-07-17 14:33:54

标题: SHIELD：一种安全且高度增强的集成学习方法，用于抵御对抗性攻击的深度伪造检测

摘要: 音频在演讲者验证、语音启用智能设备和音频会议等应用中起着至关重要的作用。然而，音频操纵，如深度伪造，通过传播虚假信息带来了重大风险。我们的实证分析表明，现有用于检测深度伪造音频的方法通常容易受到反取证（AF）攻击的影响，特别是那些使用生成对抗网络进行攻击的方法。在本文中，我们提出了一种称为SHIELD的新颖的协作学习方法，用于抵御生成式AF攻击。为了揭示AF签名，我们集成了一个辅助生成模型，称为防御（DF）生成模型，通过结合输入和输出促进协作学习。此外，我们设计了一个三元模型，用于捕捉真实和受到AF攻击的音频与真实生成和被攻击生成音频之间的相关性，使用辅助生成模型。所提出的SHIELD加强了对生成式AF攻击的防御，并在各种生成模型中实现了稳健的性能。所提出的AF显著地将ASVspoof2019的平均检测准确率从95.49%降低到59.77%，In-the-Wild从99.44%降低到38.45%，HalfTruth从98.41%降低到51.18%。所提出的SHIELD机制对抗AF攻击具有稳健性，并在ASVspoof2019、In-the-Wild和HalfTruth数据集中的匹配设置中实现了98.13%、98.58%和99.57%的平均准确率，不匹配设置中实现了98.78%、98.62%和98.85%的准确率。

更新时间: 2025-07-17 14:33:54

领域: cs.SD,cs.AI,cs.CR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.13170v1

Prompt Injection 2.0: Hybrid AI Threats

Prompt injection attacks, where malicious input is designed to manipulate AI systems into ignoring their original instructions and following unauthorized commands instead, were first discovered by Preamble, Inc. in May 2022 and responsibly disclosed to OpenAI. Over the last three years, these attacks have continued to pose a critical security threat to LLM-integrated systems. The emergence of agentic AI systems, where LLMs autonomously perform multistep tasks through tools and coordination with other agents, has fundamentally transformed the threat landscape. Modern prompt injection attacks can now combine with traditional cybersecurity exploits to create hybrid threats that systematically evade traditional security controls. This paper presents a comprehensive analysis of Prompt Injection 2.0, examining how prompt injections integrate with Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and other web security vulnerabilities to bypass traditional security measures. We build upon Preamble's foundational research and mitigation technologies, evaluating them against contemporary threats, including AI worms, multi-agent infections, and hybrid cyber-AI attacks. Our analysis incorporates recent benchmarks that demonstrate how traditional web application firewalls, XSS filters, and CSRF tokens fail against AI-enhanced attacks. We also present architectural solutions that combine prompt isolation, runtime security, and privilege separation with novel threat detection capabilities.

Updated: 2025-07-17 14:33:36

标题: 快速注入2.0：混合人工智能威胁

摘要: 快速注入攻击是指恶意输入被设计为操纵人工智能系统，使其忽略原始指令而跟随未经授权的命令。这种攻击于2022年5月首次被Preamble公司发现，并负责向OpenAI披露。在过去的三年里，这些攻击一直对LLM集成系统构成严重安全威胁。代理式人工智能系统的出现，使得LLMs可以通过工具和与其他代理的协调自主执行多步任务，从而根本改变了威胁格局。现代快速注入攻击现在可以与传统的网络安全漏洞相结合，创建混合威胁，系统性地规避传统安全控制。本文对快速注入2.0进行了全面分析，探讨了快速注入如何与跨站脚本（XSS）、跨站请求伪造（CSRF）和其他网络安全漏洞结合以绕过传统安全措施。我们基于Preamble公司的基础研究和缓解技术，评估其对抗包括AI蠕虫、多代理感染和混合网络-人工智能攻击在内的当代威胁的效果。我们的分析结合了最近的基准测试，展示了传统的网络应用防火墙、XSS过滤器和CSRF令牌在面对AI增强攻击时的失败。我们还提出了结合了提示隔离、运行时安全性和特权分离的架构解决方案，具有新颖的威胁检测能力。

更新时间: 2025-07-17 14:33:36

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.13169v1

Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.

Updated: 2025-07-17 14:29:34

标题: Orbis：克服驾驶世界模型中长期预测的挑战

摘要: 现有的自动驾驶世界模型在长期规划生成和挑战性场景泛化方面存在困难。在这项工作中，我们开发了一个模型，使用简单的设计选择，没有额外的监督或传感器，如地图、深度或多个摄像头。我们展示了我们的模型尽管只有469M参数，并且仅在280小时的视频数据上训练，但表现卓越。它在转弯操作和城市交通等困难场景中特别突出。我们测试了离散标记模型可能优于基于流匹配的连续模型的优势。为此，我们建立了一个混合标记器，与两种方法兼容，并允许进行并行比较。我们的研究结论支持连续自回归模型，该模型在个别设计选择上不太脆弱，并且比基于离散标记构建的模型更强大。代码、模型和定性结果可在https://lmb-freiburg.github.io/orbis.github.io/上公开获取。

更新时间: 2025-07-17 14:29:34

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13162v1

CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

Updated: 2025-07-17 14:25:24

标题: CCL-XCoT：一种有效的跨语言知识转移方法，用于减轻幻觉生成

摘要: 多语言大型语言模型（MLLMs）表现出跨语言的强大泛化能力，但它们仍然容易出现幻觉，特别是在低资源语言中，这是由于训练数据不平衡造成的。这些幻觉包括不准确或虚构的输出，在特定领域的生成任务中尤为棘手。为了解决这一挑战，我们提出了CCL-XCoT（基于课程的对比学习的跨语言思维链）——一个用于减轻MLLMs中幻觉的两阶段微调框架。我们的方法首先通过基于课程的对比学习结合持续预训练期间的下一个标记预测来增强跨语言语义对齐。在此基础上，我们在指导微调期间引入了跨语言思维链（XCoT）提示策略，该策略引导模型在生成目标低资源语言答案之前在高资源语言中进行推理。实验结果显示，CCL-XCoT可以将幻觉率降低高达62％，并且大幅提高了跨语言对之间的事实知识转移，而无需依赖外部检索或多模型集合。

更新时间: 2025-07-17 14:25:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.14239v1

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

Updated: 2025-07-17 14:22:24

标题: 逆强化学习遇见大型语言模型后训练：基础、进展和机遇

摘要: 在大语言模型(LLMs)时代，对齐问题已经成为追求更可靠、可控和功能更强大的机器智能的一个基础性而具有挑战性的问题。最近推理模型和对话人工智能系统的成功凸显了强化学习(RL)在增强这些系统中的关键作用，推动了在RL和LLM对齐交叉领域的研究兴趣增加。本文通过逆强化学习(IRL)的视角全面回顾了LLM对齐方面的最新进展，强调了在LLM对齐和传统RL任务中所使用的RL技术之间的区别。特别地，我们强调了从人类数据构建神经奖励模型的必要性，并讨论了这种范式转变的形式和实践影响。我们首先介绍RL中的基本概念，为那些对这个领域不熟悉的读者提供基础。然后我们检查了这一研究议程的最新进展，讨论了进行LLM对齐的IRL时面临的关键挑战和机遇。除了方法论考虑外，我们还探讨了实际方面，包括数据集、基准、评估指标、基础设施以及计算效率高的训练和推理技术。最后，我们从稀疏奖励RL的文献中汲取见解，以找出未解决的问题和潜在的研究方向。通过综合多种研究结果，我们旨在提供对该领域的结构化和批判性概述，突出未解决的挑战，并概述通过RL和IRL技术改进LLM对齐的有前途的未来方向。

更新时间: 2025-07-17 14:22:24

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.13158v1

AI-ming backwards: Vanishing archaeological landscapes in Mesopotamia and automatic detection of sites on CORONA imagery

By upgrading an existing deep learning model with the knowledge provided by one of the oldest sets of grayscale satellite imagery, known as CORONA, we improved the AI model attitude towards the automatic identification of archaeological sites in an environment which has been completely transformed in the last five decades, including the complete destruction of many of those same sites. The initial Bing based convolutional network model was retrained using CORONA satellite imagery for the district of Abu Ghraib, west of Baghdad, central Mesopotamian floodplain. The results were twofold and surprising. First, the detection precision obtained on the area of interest increased sensibly: in particular, the Intersection over Union (IoU) values, at the image segmentation level, surpassed 85 percent, while the general accuracy in detecting archeological sites reached 90 percent. Second, our retrained model allowed the identification of four new sites of archaeological interest (confirmed through field verification), previously not identified by archaeologists with traditional techniques. This has confirmed the efficacy of using AI techniques and the CORONA imagery from the 1960 to discover archaeological sites currently no longer visible, a concrete breakthrough with significant consequences for the study of landscapes with vanishing archaeological evidence induced by anthropization

Updated: 2025-07-17 14:21:50

标题: 《人工智能倒退：美索不达米亚消失的考古景观及CORONA卫星图像上遗址的自动检测》

摘要: 通过利用CORONA这一最古老的一套灰度卫星图像数据集所提供的知识，我们升级了一个现有的深度学习模型，从而改善了AI模型对自动识别考古遗址的态度。这些遗址所处的环境在过去五十年发生了彻底的变化，包括许多遗址的完全毁坏。我们重新训练了最初基于必应的卷积网络模型，使用CORONA卫星图像对巴格达西部的阿布格莱卜地区进行了训练，该地区位于中美索不达米亚洪泛平原。结果令人惊讶且具有双重性。首先，我们在感兴趣区域上获得的检测精度明显提高：特别是在图像分割水平上，交集超过联合（IoU）值超过85％，而在检测考古遗址方面的总体准确率达到了90％。其次，我们重新训练的模型允许发现四个新的考古遗址（通过现场验证确认），这些遗址以前未被传统技术的考古学家发现。这证实了利用AI技术和1960年的CORONA图像发现当前不再可见的考古遗址的有效性，这是对由人类活动引起的消失考古证据的景观研究产生重大影响的具体突破。

更新时间: 2025-07-17 14:21:50

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13420v1

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

Updated: 2025-07-17 14:13:50

标题: SE-VLN：基于多模态大语言模型的自进化视觉语言导航框架

摘要: 最近在视觉语言导航（VLN）领域的一些进展主要归因于新兴的大型语言模型（LLMs）。这些方法在指令理解和任务推理方面展现出优秀的泛化能力。然而，它们受制于LLMs的固定知识库和推理能力，无法充分整合经验知识，因此缺乏高效的演进能力。为了解决这个问题，我们从自然智能体的演化能力中汲取灵感，提出了一个自进化的VLN框架（SE-VLN），赋予VLN智能体在测试过程中持续演化的能力。据我们所知，这是首次提出一个多模态LLM驱动的自进化VLN框架。具体而言，SE-VLN包括三个核心模块，即分层记忆模块将成功和失败案例转化为可重复利用的知识，一个检索增强的基于思维的推理模块检索经验并实现多步决策，以及一个反思模块实现持续演化。全面的测试表明，SE-VLN在未知环境中实现了57%和35.2%的导航成功率，在R2R和REVERSE数据集上分别比当前最先进方法提高了23.9%和15.0%的绝对性能。此外，随着经验知识库的增加，SE-VLN表现出性能的提升，说明其作为VLN的自进化智能体框架具有巨大的潜力。

更新时间: 2025-07-17 14:13:50

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.13152v1

Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID

Personalizing Stable Diffusion for professional portrait generation from amateur photos faces challenges in maintaining facial resemblance. This paper evaluates the impact of augmentation strategies on two personalization methods: DreamBooth and InstantID. We compare classical augmentations (flipping, cropping, color adjustments) with generative augmentation using InstantID's synthetic images to enrich training data. Using SDXL and a new FaceDistance metric based on FaceNet, we quantitatively assess facial similarity. Results show classical augmentations can cause artifacts harming identity retention, while InstantID improves fidelity when balanced with real images to avoid overfitting. A user study with 97 participants confirms high photorealism and preferences for InstantID's polished look versus DreamBooth's identity accuracy. Our findings inform effective augmentation strategies for personalized text-to-image generation.

Updated: 2025-07-17 14:11:40

标题: 通过增强生成合成数据以提高DreamBooth和InstantID中的面部相似度

摘要: 将稳定扩散个性化应用于从业余照片生成专业肖像面临保持面部相似性的挑战。本文评估了增强策略对两种个性化方法DreamBooth和InstantID的影响。我们比较了传统的增强方法（翻转、裁剪、颜色调整）与使用InstantID的生成增强方法，通过合成图像丰富训练数据。使用SDXL和基于FaceNet的新FaceDistance指标，我们定量评估面部相似性。结果表明，传统的增强方法可能导致损害身份保留的人工制品，而InstantID在与真实图像平衡使用以避免过拟合时提高了保真度。一项涉及97名参与者的用户研究确认了高度逼真和InstantID的优雅外观相对于DreamBooth的身份准确性。我们的研究结果为个性化文本到图像生成提供了有效的增强策略。

更新时间: 2025-07-17 14:11:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.03557v2

DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

Updated: 2025-07-17 14:09:34

标题: "DINO-VO：基于特征的视觉里程计利用视觉基础模型"

摘要: 基于学习的单目视觉里程计（VO）在机器人领域面临着鲁棒性、泛化性和效率性挑战。最近在视觉基础模型方面的进展，如DINOv2，已经提高了在各种视觉任务中的鲁棒性和泛化性，然而由于特征粒度粗糙，它们在VO中的整合仍然受到限制。在本文中，我们提出了DINO-VO，这是一个基于特征的VO系统，利用DINOv2视觉基础模型进行稀疏特征匹配。为了解决整合挑战，我们提出了一个专门针对DINOv2粗糙特征的显著关键点检测器。此外，我们将DINOv2的鲁棒语义特征与细粒度几何特征相结合，从而得到更易定位的表示。最后，一个基于transformer的匹配器和可微分的姿态估计层通过学习良好的匹配来实现精确的相机运动估计。与之前的检测器-描述符网络如SuperPoint相比，DINO-VO在具有挑战性的环境中表现出更大的鲁棒性。此外，我们展示了所提出的特征描述符在独立DINOv2粗糙特征方面具有更高的精度和泛化性。DINO-VO在TartanAir和KITTI数据集上优于先前的帧间VO方法，并且在EuRoC数据集上具有竞争力，同时在单个GPU上以不到1GB的内存使用率以72 FPS的效率运行。此外，它在户外驾驶场景中与视觉SLAM系统竞争，并展示了其泛化能力。

更新时间: 2025-07-17 14:09:34

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.13145v1

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

Updated: 2025-07-17 14:04:07

标题: SWE-MERA：一个用于在软件工程任务上对大型语言模型进行主动评估的动态基准

摘要: 软件工程中大型语言模型（LLMs）的快速进展揭示了现有基准测试中的关键限制，特别是广泛使用的SWE-bench数据集。最近的研究揭示了严重的数据污染问题，例如SWE-bench报告32.67%的成功修补程序涉及直接解决方案泄漏，31.08%的修补程序通过不足的测试用例而通过。我们引入了SWE-MERA，这是一个动态、持续更新的基准测试，旨在通过自动收集真实的GitHub问题和严格的质量验证来解决这些基本挑战。我们的方法实现了一个可靠的流水线，确保质量同时最小化污染风险，目前大约有10,000个潜在任务，其中有300个样本可用。使用Aider编码代理进行评估表明在最新模型中具有强大的区分能力。我们报告了在2024年9月至2025年6月之间收集的任务上评估的数十个最新LLMs的性能。

更新时间: 2025-07-17 14:04:07

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.11059v2

Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data

The study "Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data" presents a machine learning-based traffic flow prediction model to address global traffic congestion issues. The research utilized 30-second interval traffic data from California Highway 78 over a five-month period from July to November 2022, analyzing a 7.24 km westbound section connecting "Melrose Dr" and "El-Camino Real" in the San Diego area. The study employed Multiple Linear Regression (MLR) and Random Forest (RF) algorithms, analyzing data collection intervals ranging from 30 seconds to 15 minutes. Using R^2, MAE, and RMSE as performance metrics, the analysis revealed that both MLR and RF models performed optimally with 10-minute data collection intervals. These findings are expected to contribute to future traffic congestion solutions and efficient traffic management.

Updated: 2025-07-17 13:27:38

标题: 基于加利福尼亚交通数据的人工智能算法预测高速公路交通流量

摘要: 这项研究“基于加利福尼亚交通数据的人工智能算法预测高速公路交通流量”提出了一个基于机器学习的交通流量预测模型，以解决全球交通拥堵问题。该研究利用了来自2022年7月至11月期间加利福尼亚78号高速公路的30秒间隔交通数据，分析了连接圣地亚哥地区“Melrose Dr”和“El-Camino Real”的7.24公里西行路段。研究采用了多元线性回归（MLR）和随机森林（RF）算法，分析了从30秒到15分钟的数据收集间隔。利用R^2、MAE和RMSE作为性能指标，分析表明，MLR和RF模型在10分钟数据收集间隔下表现最佳。这些发现有望为未来的交通拥堵解决方案和高效的交通管理做出贡献。

更新时间: 2025-07-17 13:27:38

领域: cs.AI

下载: http://arxiv.org/abs/2507.13112v1

Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.

Updated: 2025-07-17 13:24:18

标题: 任务-电路量化：利用知识局部化和可解释性进行压缩

摘要: 培训后量化（PTQ）通过将完全精度权重映射到低位权重而无需昂贵的重新训练来减少模型的内存占用，但在低2到3位设置中可能会降低其下游性能。我们开发了一种新的混合精度PTQ方法，称为任务-电路量化（TaCQ），其类似于自动电路发现，直接将量化过程条件化为特定的权重电路 - 我们定义为与下游任务性能相关联的权重集。这些权重保持为16位权重，而其他权重经过量化，仅增加了边际内存成本而保持性能。具体来说，TaCQ将未量化的模型权重与均匀量化的模型进行对比，以估计由于量化而导致的权重变化，并使用梯度信息来预测对任务性能的影响，从而使我们能够保留特定任务的权重。我们将基于TaCQ的量化与现有的混合精度量化方法进行比较，同时对常规数据和特定任务数据进行条件化。在Llama-3和Qwen2.5的QA、数学推理和文本到SQL任务中，我们发现TaCQ在使用相同校准数据和更低的权重预算的基线上实现了较大的改进，在2和3位制度中取得了重大进展。仅使用3.1位，我们就能恢复96%的Llama-3-8B-Instruct未量化的16位MMLU性能，比SPQR提高了5.25%。我们还观察到，在2位制度中，与最强基线SliM-LLM相比，我们始终取得了显著的增益，平均增益为14.74%。此外，我们观察到在不针对特定任务进行条件化的情况下，也可以获得7.20%的增益，显示了TaCQ识别重要权重的能力不仅限于受任务条件限制的设置。

更新时间: 2025-07-17 13:24:18

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.07389v2

Outfox: a Packet Format for a Layered Mixnet

Anonymous communication relies on encrypted packet formats that resist traffic analysis and ensure unlinkability. Sphinx, the current standard for mixnets, provides strong anonymity but relies on classical public-key cryptography, making it vulnerable to quantum attacks. In this paper, we present Outfox, a simplified variant of Sphinx tailored for mixnets with fixed-length routes and designed for post-quantum security. Outfox reduces both computational and communication costs. We formally define Outfox and prove its security in the Universal Composability (UC) framework. Our evaluation shows that Outfox retains strong anonymity guarantees while offering improved efficiency and adaptability to quantum-resistant cryptographic primitives.

Updated: 2025-07-17 13:23:26

标题: 《Outfox：一种用于分层混合网络的数据包格式》

摘要: 匿名通信依赖于加密数据包格式，可以抵抗流量分析并确保不可追踪性。 Sphinx是当前混合网络的标准，提供强大的匿名性，但依赖于传统的公钥加密技术，容易受到量子攻击的威胁。在本文中，我们提出了Outfox，这是Sphinx的简化变体，专为具有固定长度路由的混合网络设计，旨在提供后量子安全性。Outfox降低了计算和通信成本。我们正式定义了Outfox，并在通用可组合性（UC）框架下证明了其安全性。我们的评估显示，Outfox保持了强大的匿名性保证，同时提供了改进的效率和适应性，可以应用于抵抗量子攻击的加密原语。

更新时间: 2025-07-17 13:23:26

领域: cs.CR

下载: http://arxiv.org/abs/2412.19937v2

Language Models Change Facts Based on the Way You Talk

Large language models (LLMs) are increasingly being used in user-facing applications, from providing medical consultations to job interview advice. Recent research suggests that these models are becoming increasingly proficient at inferring identity information about the author of a piece of text from linguistic patterns as subtle as the choice of a few words. However, little is known about how LLMs use this information in their decision-making in real-world applications. We perform the first comprehensive analysis of how identity markers present in a user's writing bias LLM responses across five different high-stakes LLM applications in the domains of medicine, law, politics, government benefits, and job salaries. We find that LLMs are extremely sensitive to markers of identity in user queries and that race, gender, and age consistently influence LLM responses in these applications. For instance, when providing medical advice, we find that models apply different standards of care to individuals of different ethnicities for the same symptoms; we find that LLMs are more likely to alter answers to align with a conservative (liberal) political worldview when asked factual questions by older (younger) individuals; and that LLMs recommend lower salaries for non-White job applicants and higher salaries for women compared to men. Taken together, these biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities. Beyond providing an analysis, we also provide new tools for evaluating how subtle encoding of identity in users' language choices impacts model decisions. Given the serious implications of these findings, we recommend that similar thorough assessments of LLM use in user-facing applications are conducted before future deployment.

Updated: 2025-07-17 13:21:17

标题: 语言模型根据您的说话方式改变事实

摘要: 大型语言模型（LLMs）越来越多地被用于面向用户的应用，从提供医疗咨询到面试建议。最近的研究表明，这些模型越来越擅长从一篇文本的语言模式中推断出作者的身份信息，甚至细微到选择几个词。然而，在真实世界的应用中，人们对LLMs如何利用这些信息做出决策知之甚少。我们进行了首次全面分析，研究了在医学、法律、政治、政府福利和工资等五个不同高风险领域中，用户写作中的身份标记如何影响LLMs的响应。我们发现，LLMs对用户查询中的身份标记非常敏感，种族、性别和年龄在这些应用中持续影响着LLMs的响应。例如，在提供医疗建议时，我们发现模型对不同种族的个体在相同症状下应用不同的护理标准；我们发现，当年长（年轻）个体提出事实性问题时，LLMs更倾向于调整答案以符合保守（自由）的政治世界观；以及LLMs建议非白人求职者获得较低工资，女性获得较高工资。这些偏见意味着使用现成的LLMs可能会导致医疗护理差异，加剧工资差距，并为不同身份的人创造不同政治事实现实。除了提供分析，我们还提供了新工具，评估用户语言选择中微妙的身份编码如何影响模型决策。鉴于这些发现的严重影响，我们建议在未来部署之前进行类似彻底的LLM在面向用户的应用中的评估。

更新时间: 2025-07-17 13:21:17

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.14238v1

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs' responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.

Updated: 2025-07-17 13:18:06

标题: 大型视觉-语言模型统一的三元级幻觉评估

摘要: 尽管大型视觉语言模型(LVLMs)在视觉-语言推理方面表现出色，但可能会生成在给定图像中不存在的虚构内容。大多数现有的LVLM幻觉基准都受限于评估与对象相关的幻觉。然而，两个对象之间关系的潜在幻觉，即关系幻觉，仍然缺乏研究。为了解决这个问题，我们设计了一个统一的框架，同时测量LVLMs中的对象和关系幻觉。我们框架的核心思想是通过从LVLMs的响应中提取的(object, relation, object)三元组来评估幻觉，使其易于推广到不同的视觉-语言任务。基于我们的框架，我们进一步引入了Tri-HE，一个新颖的三元组级幻觉评估基准，可以同时研究对象和关系幻觉。通过对Tri-HE的全面评估，我们观察到现有LVLM中的关系幻觉问题甚至比对象幻觉更为严重，突显出一个先前被忽视的问题，即可靠的LVLMs。此外，基于我们的发现，我们设计了一个简单的无需训练的方法，有效缓解LVLMs的幻觉。我们的数据集和用于重现实验的代码可在https://github.com/wujunjie1998/Tri-HE 上公开获取。

更新时间: 2025-07-17 13:18:06

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.23114v4

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.

Updated: 2025-07-17 13:09:28

标题: GraspGen：基于扩散的六自由度抓取的生成器训练框架

摘要: 抓取是一种基本的机器人技能，尽管已经有了重大的研究进展，基于学习的六自由度抓取方法仍然不能做到即插即用，并且很难在不同的实体和野外环境中推广。我们在最近成功建模物体中心的抓取生成过程的基础上构建了一个迭代扩散过程的框架，我们提出的框架GraspGen包括一个增强抓取生成的DiffusionTransformer架构，配备一个高效的鉴别器来评分和过滤采样的抓取。我们引入了一种新颖且高性能的鉴别器训练方法。为了将GraspGen扩展到物体和夹具，我们发布了一个新的模拟数据集，包含超过5300万个抓取。我们展示了GraspGen在不同夹具下的单个物体模拟中优于之前的方法，在FetchBench抓取基准测试中实现了最先进的性能，并在具有嘈杂视觉观察的实际机器人上表现良好。

更新时间: 2025-07-17 13:09:28

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.13097v1

SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.

Updated: 2025-07-17 13:07:41

标题: 智能思考者：通过步骤级长度控制学习压缩和保留推理

摘要: 大型推理模型（LRMs）通过推理时间的扩展展现出了卓越的推理能力，但这一进展也引入了相当多的冗余和低效性到它们的推理过程中，导致了大量的计算浪费。先前的工作试图通过在强化学习（RL）期间对生成样本的整体长度进行惩罚来缓解这一问题，目的是鼓励更简洁的思维链。然而，我们观察到，这种全局长度惩罚往往会导致对关键推理步骤的过度压缩，同时保留简单步骤中不必要的细节，产生了准确性和效率之间的次优权衡。为了解决这个问题，我们提出了SmartThinker，这是一个设计用于根据每个单独步骤的重要性实现对推理链长度进行精细控制的两阶段可学习框架。在第一阶段，SmartThinker通过拒绝抽样结合监督微调（SFT）将推理模型调整为短形式推理模式。在第二阶段，SmartThinker应用Step-Level Length Control Policy Optimization（SCPO）来优化模型输出分布，增加分配给关键步骤的长度比例，同时减少不太重要步骤中的冗余。SCPO包括四个核心组件：在线重要性估计器、步级长度控制奖励函数、步级广义优势估计（S-GAE）和难度自适应剪裁策略。这些组件共同作用，使得SCPO能够在推理步骤之间实现差异化长度控制。在多个推理基准和各种骨干模型上的实证结果表明，SmartThinker显著减少了冗余推理，同时实现了与现有方法相当甚至更优越的性能。

更新时间: 2025-07-17 13:07:41

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.04348v2

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

Updated: 2025-07-17 13:02:05

标题: 《准备好的法学家：在动态环境中为法律智能进行语言代理基准测试》

摘要: 静态基准与现实法律实践的动态特性之间的差距对推进法律智能构成了关键障碍。为此，我们引入了J1-ENVS，这是专为基于LLM的代理人量身定制的首个交互式和动态法律环境。在法律专家的指导下，它包括了来自中国法律实践的六个代表性场景，涵盖三个环境复杂性级别。我们进一步引入了J1-EVAL，这是一个细粒度评估框架，旨在评估在不同法律熟练水平下的任务表现和程序遵从性。对17个LLM代理人进行的广泛实验表明，虽然许多模型表现出扎实的法律知识，但它们在动态环境中执行程序时遇到困难。即使是SOTA模型GPT-4o，在总体表现方面也不足60%。这些发现突显了在实现动态法律智能方面持续存在的挑战，并为引导未来研究提供了有价值的见解。

更新时间: 2025-07-17 13:02:05

领域: cs.AI

下载: http://arxiv.org/abs/2507.04037v2

Soft-ECM: An extension of Evidential C-Means for complex data

Clustering based on belief functions has been gaining increasing attention in the machine learning community due to its ability to effectively represent uncertainty and/or imprecision. However, none of the existing algorithms can be applied to complex data, such as mixed data (numerical and categorical) or non-tabular data like time series. Indeed, these types of data are, in general, not represented in a Euclidean space and the aforementioned algorithms make use of the properties of such spaces, in particular for the construction of barycenters. In this paper, we reformulate the Evidential C-Means (ECM) problem for clustering complex data. We propose a new algorithm, Soft-ECM, which consistently positions the centroids of imprecise clusters requiring only a semi-metric. Our experiments show that Soft-ECM present results comparable to conventional fuzzy clustering approaches on numerical data, and we demonstrate its ability to handle mixed data and its benefits when combining fuzzy clustering with semi-metrics such as DTW for time series data.

Updated: 2025-07-17 13:00:22

标题: Soft-ECM：一种用于复杂数据的证据C均值的扩展

摘要: 基于信念函数的聚类由于其有效表示不确定性和/或不精确性的能力，已经在机器学习社区中引起了越来越多的关注。然而，目前不存在的算法能够应用于复杂数据，例如混合数据（数值和分类）或非表格数据，如时间序列。事实上，这些类型的数据通常不在欧几里德空间中表示，并且前述算法利用这些空间的特性，特别是用于构建重心。在本文中，我们重新表述了基于证据的C均值（ECM）问题，用于聚类复杂数据。我们提出了一种新算法，Soft-ECM，它可以一致地定位不精确聚类的质心，仅需要半度量。我们的实验表明，Soft-ECM在数值数据上呈现出与传统模糊聚类方法可比的结果，并且我们展示了它处理混合数据以及将模糊聚类与DTW等半度量相结合处理时间序列数据时的优势。

更新时间: 2025-07-17 13:00:22

领域: cs.LG,cs.AI,cs.DM

下载: http://arxiv.org/abs/2507.13417v1

MUPAX: Multidimensional Problem Agnostic eXplainable AI

Robust XAI techniques should ideally be simultaneously deterministic, model agnostic, and guaranteed to converge. We propose MULTIDIMENSIONAL PROBLEM AGNOSTIC EXPLAINABLE AI (MUPAX), a deterministic, model agnostic explainability technique, with guaranteed convergency. MUPAX measure theoretic formulation gives principled feature importance attribution through structured perturbation analysis that discovers inherent input patterns and eliminates spurious relationships. We evaluate MUPAX on an extensive range of data modalities and tasks: audio classification (1D), image classification (2D), volumetric medical image analysis (3D), and anatomical landmark detection, demonstrating dimension agnostic effectiveness. The rigorous convergence guarantees extend to any loss function and arbitrary dimensions, making MUPAX applicable to virtually any problem context for AI. By contrast with other XAI methods that typically decrease performance when masking, MUPAX not only preserves but actually enhances model accuracy by capturing only the most important patterns of the original data. Extensive benchmarking against the state of the XAI art demonstrates MUPAX ability to generate precise, consistent and understandable explanations, a crucial step towards explainable and trustworthy AI systems. The source code will be released upon publication.

Updated: 2025-07-17 12:59:27

标题: MUPAX：多维问题无关的可解释人工智能 (Multidimensional Problem Agnostic eXplainable AI)

摘要: 鲁棒的可解释人工智能（XAI）技术理想情况下应同时是确定性的、模型无关的，并且保证收敛。我们提出了多维问题无关可解释人工智能（MUPAX），这是一种确定性、模型无关的可解释性技术，具有收敛性的保证。MUPAX的测度论式公式化通过结构化扰动分析提供了基于原则的特征重要性归因，发现了固有的输入模式并消除了虚假关系。我们在广泛的数据模态和任务上评估了MUPAX：音频分类（1D）、图像分类（2D）、体积医学图像分析（3D）和解剖标记检测，展示了维度无关的有效性。严格的收敛保证可扩展到任何损失函数和任意维度，使MUPAX适用于几乎任何AI问题背景。与其他XAI方法相比，通常在掩盖时性能下降，MUPAX不仅保持而且实际上通过仅捕捉原始数据中最重要的模式来提高模型准确性。与XAI艺术水平进行广泛基准测试表明，MUPAX能够生成精确、一致和可理解的解释，这是朝着可解释和值得信赖的AI系统的关键一步。源代码将在发表后发布。

更新时间: 2025-07-17 12:59:27

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.13090v1

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

Updated: 2025-07-17 12:55:32

标题: MERA代码：一个统一框架用于评估跨任务的代码生成

摘要: LLMs的进展增强了软件工程中的任务自动化；然而，当前的评估主要侧重于自然语言任务，忽视了代码质量。大多数基准测试更注重高级推理而非可执行代码和真实世界性能，导致对这些模型在生产中的真实能力和风险的理解存在空白。为解决这一问题，我们提出了MERA Code，这是MERA基准测试家族的新成员，专门用于评估最新的生成俄语代码的LLMs。该基准测试包括11个评估任务，涵盖8种编程语言。我们提出的评估方法论包括一个分类法，概述了模型完成这些任务所需的实际编码技能。该基准测试包括一个开源代码库，供用户进行MERA评估，一个与各种编程环境兼容的评分系统，以及一个包含排行榜和提交系统的平台。我们评估了开放的LLMs和前沿的API模型，分析了它们在非英语语言的实际编码任务方面的局限性。我们公开发布MERA，以指导未来的研究，期待模型开发中的突破性特性，并标准化评估程序。

更新时间: 2025-07-17 12:55:32

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12284v2

Single- to multi-fidelity history-dependent learning with uncertainty quantification and disentanglement: application to data-driven constitutive modeling

Data-driven learning is generalized to consider history-dependent multi-fidelity data, while quantifying epistemic uncertainty and disentangling it from data noise (aleatoric uncertainty). This generalization is hierarchical and adapts to different learning scenarios: from training the simplest single-fidelity deterministic neural networks up to the proposed multi-fidelity variance estimation Bayesian recurrent neural networks. The versatility and generality of the proposed methodology are demonstrated by applying it to different data-driven constitutive modeling scenarios that include multiple fidelities with and without aleatoric uncertainty (noise). The method accurately predicts the response and quantifies model error while also discovering the noise distribution (when present). This opens opportunities for future real-world applications in diverse scientific and engineering domains; especially, the most challenging cases involving design and analysis under uncertainty.

Updated: 2025-07-17 12:45:10

标题: 单到多保真度历史依赖学习与不确定性量化和解耦：应用于数据驱动本构建模

摘要: 数据驱动学习被推广到考虑历史相关的多保真度数据，同时量化认知不确定性并将其与数据噪声（随机不确定性）区分开来。这种泛化是分层的，并适应不同的学习场景：从训练最简单的单保真度确定性神经网络到提出的多保真度方差估计贝叶斯递归神经网络。所提出的方法的多功能性和普遍性通过将其应用于包括多种保真度和有或无随机不确定性（噪声）的不同数据驱动本构建模场景得以证明。该方法准确预测响应并量化模型误差，同时还能发现噪声分布（当存在时）。这为未来在各种科学和工程领域中实际应用的机会打开了大门；尤其是在设计和分析不确定性下最具挑战的案例。

更新时间: 2025-07-17 12:45:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13416v1

SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection

Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emotions than real ones. Therefore, we propose a novel Semantic Enhancement and Emotional Reasoning (SEER) Network for multimodal fake news detection. We generate summarized captions for image semantic understanding and utilize the products of large multimodal models for semantic enhancement. Inspired by the perceived relationship between news authenticity and emotional tendencies, we propose an expert emotional reasoning module that simulates real-life scenarios to optimize emotional features and infer the authenticity of news. Extensive experiments on two real-world datasets demonstrate the superiority of our SEER over state-of-the-art baselines.

Updated: 2025-07-17 12:33:45

标题: SEER：用于多模态假新闻检测的语义增强和情感推理网络

摘要: 以前关于多模态假新闻检测的研究主要集中在跨模态特征的对齐和整合，以及文本-图像一致性的应用。然而，它们忽视了大型多模态模型的语义增强效果，并且对新闻的情感特征关注不足。此外，人们发现假新闻更倾向于包含负面情绪而非真实新闻。因此，我们提出了一种新颖的多模态假新闻检测Semantic Enhancement and Emotional Reasoning (SEER) Network。我们为图像语义理解生成摘要标题，并利用大型多模态模型的产品进行语义增强。受到新闻真实性和情感倾向之间的感知关系的启发，我们提出了一个专家情感推理模块，模拟真实生活场景来优化情感特征并推断新闻的真实性。对两个真实世界数据集进行的大量实验表明，我们的SEER比现有基准线表现更优秀。

更新时间: 2025-07-17 12:33:45

领域: cs.MM,cs.AI

下载: http://arxiv.org/abs/2507.13415v1

MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications with Retrieval Augmented Generation and Knowledge Graphs

The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality data set, mainly due to privacy-related issues. In addition, the recent increase in Vision Language Models (VLM) leads to the need for multimodal medical data sets, where clinical reports and findings are attached to the corresponding medical scans. This paper illustrates the entire workflow for building the MedPix 2.0 data set. Starting with the well-known multimodal data set MedPix\textsuperscript{\textregistered}, mainly used by physicians, nurses, and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure in which noisy samples were removed, thus creating a MongoDB database. Along with the data set, we developed a Graphical User Interface aimed at navigating efficiently the MongoDB instance and obtaining the raw data that can be easily used for training and/or fine-tuning VLMs. To enforce this point, in this work, we first recall DR-Minerva, a Retrieve Augmented Generation-based VLM model trained upon MedPix 2.0. DR-Minerva predicts the body part and the modality used to scan its input image. We also propose the extension of DR-Minerva with a Knowledge Graph that uses Llama 3.1 Instruct 8B, and leverages MedPix 2.0. The resulting architecture can be queried in a end-to-end manner, as a medical decision support system. MedPix 2.0 is available on GitHub.

Updated: 2025-07-17 12:30:16

标题: MedPix 2.0：用于高级人工智能应用的综合多模式生物医学数据集，具有检索增强生成和知识图

摘要: 在医学领域发展人工智能应用的兴趣日益增加，但由于隐私问题，高质量数据集的缺乏成为一个主要障碍。此外，最近视觉语言模型（VLM）的增加导致对多模态医学数据集的需求增加，其中临床报告和发现与相应的医学扫描相关联。本文展示了构建MedPix 2.0数据集的整个工作流程。从广泛被医生、护士和医疗学生用于继续医学教育目的的著名多模态数据集MedPix®开始，开发了一个半自动化流水线，用于提取视觉和文本数据，然后进行手动修正程序，去除噪声样本，从而创建一个MongoDB数据库。除了数据集，我们还开发了一个图形用户界面，旨在有效浏览MongoDB实例并获取原始数据，这些数据可以轻松用于训练和/或微调VLM。为了强调这一点，在这项工作中，我们首先回顾了基于MedPix 2.0训练的基于检索增强生成的VLM模型DR-Minerva。DR-Minerva可以预测输入图像的身体部位和扫描使用的模态。我们还提出了将DR-Minerva扩展为使用Llama 3.1 Instruct 8B和利用MedPix 2.0的知识图的方案。结果的架构可以以端到端方式查询，作为医疗决策支持系统。MedPix 2.0已在GitHub上提供。

更新时间: 2025-07-17 12:30:16

领域: cs.DB,cs.AI,cs.LG

下载: http://arxiv.org/abs/2407.02994v5

U-DREAM: Unsupervised Dereverberation guided by a Reverberation Model

This paper explores the outcome of training state-ofthe-art dereverberation models with supervision settings ranging from weakly-supervised to fully unsupervised, relying solely on reverberant signals and an acoustic model for training. Most of the existing deep learning approaches typically require paired dry and reverberant data, which are difficult to obtain in practice. We develop instead a sequential learning strategy motivated by a bayesian formulation of the dereverberation problem, wherein acoustic parameters and dry signals are estimated from reverberant inputs using deep neural networks, guided by a reverberation matching loss. Our most data-efficient variant requires only 100 reverberation-parameter-labelled samples to outperform an unsupervised baseline, demonstrating the effectiveness and practicality of the proposed method in low-resource scenarios.

Updated: 2025-07-17 12:26:18

标题: U-DREAM: 由混响模型引导的无监督去混响

摘要: 本文探讨了使用从弱监督到完全无监督的监督设置训练最先进的去混响模型的结果，仅依赖于混响信号和声学模型进行训练。大多数现有的深度学习方法通常需要成对的干净和混响数据，这在实践中很难获得。相反，我们开发了一种顺序学习策略，受贝叶斯去混响问题公式的启发，其中声学参数和干净信号是使用深度神经网络从混响输入中估计出来的，通过混响匹配损失进行引导。我们最有效的数据效率变体仅需要100个混响参数标记的样本就能胜过无监督基线，证明了所提方法在资源匮乏情况下的有效性和实用性。

更新时间: 2025-07-17 12:26:18

领域: cs.SD,cs.AI,eess.AS,eess.SP

下载: http://arxiv.org/abs/2507.14237v1

Gauge Flow Models

This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.

Updated: 2025-07-17 12:24:54

标题: 规范流模型

摘要: 本文介绍了Gauge Flow Models，一种新颖的生成流模型。这些模型在流常微分方程(ODE)中包含可学习的规范场。提供了这些模型的一个全面的数学框架，详细说明了它们的构造和特性。在高斯混合模型上使用流匹配实验表明，Gauge Flow Models比传统的相同或更大尺寸的流模型表现出显著更好的性能。此外，未发表的研究表明，Gauge Flow Models可能在更广泛的生成任务中表现出更好的性能。

更新时间: 2025-07-17 12:24:54

领域: cs.LG,cs.AI,math.DG

下载: http://arxiv.org/abs/2507.13414v1

KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection

In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM's powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.

Updated: 2025-07-17 12:20:43

标题: KEN：多模态假新闻检测的知识增强和情感引导网络

摘要: 近年来，社交媒体上虚假信息的猖獗传播使得准确检测多模式假新闻成为一个关键的研究焦点。然而，先前的研究并未充分理解图像的语义，模型在有限的文本信息下难以辨别新闻的真实性。同时，对待所有情绪类型的新闻一视同仁并没有针对性的方法，进一步导致性能下降。因此，我们提出了一种新颖的知识增强和情感引导网络（KEN）。一方面，我们有效利用了LVLM强大的语义理解和丰富的世界知识。对于图像，生成的标题提供了对图像内容和场景的全面理解，而对于文本，检索到的证据有助于打破由封闭和有限的文本和上下文导致的信息孤岛。另一方面，我们通过平衡学习考虑了不同情感类型新闻之间的差异，实现了对情感类型和真实性之间关系的细粒度建模。在两个真实世界数据集上进行的大量实验表明了我们的KEN的优越性。

更新时间: 2025-07-17 12:20:43

领域: cs.MM,cs.AI

下载: http://arxiv.org/abs/2507.09647v2

Backscattering-Based Security in Wireless Power Transfer Applied to Battery-Free BLE Sensors

The integration of security and energy efficiency in Internet of Things systems remains a critical challenge, particularly for battery-free and resource-constrained devices. This paper explores the scalability and protocol-agnostic nature of a backscattering-based security mechanism by integrating it into Bluetooth Low Energy battery-free Wireless Sensor Network. The proposed approach leverages the Wireless Power Transfer link, traditionally used for energy harvesting, to generate additional identification signals without increasing energy consumption or computational demands. Experimental validation demonstrates the solution's functionality using compact, low-gain antenna, ensuring compatibility with size-constrained applications such as Structural Health Monitoring and smart transport. Furthermore, this work addresses the challenges associated with backscattering dynamic range and multi-node Wireless Sensor Network scenarios, discussing potential collisions between identification signals and proposing future improvements to enhance generalizability and scalability. The findings underscore the potential of the backscattering-based security mechanism for creating secure, sustainable, and scalable IoT deployments across diverse protocols and applications.

Updated: 2025-07-17 12:15:09

标题: 基于回波散射的无线输电安全技术在无电池BLE传感器中的应用

摘要: 物联网系统中安全性和能源效率的整合仍然是一个关键挑战，特别是对于无电池和资源受限的设备。本文通过将基于反射的安全机制集成到蓝牙低功耗无电池无线传感器网络中，探讨了该方法的可扩展性和协议无关性。所提出的方法利用传统用于能量收集的无线能量传输链路，生成额外的识别信号，而不增加能量消耗或计算需求。实验验证展示了使用紧凑、低增益天线的解决方案的功能性，确保与尺寸受限的应用程序（如结构健康监测和智能交通）兼容。此外，本研究解决了反射动态范围和多节点无线传感器网络场景的挑战，讨论了识别信号之间的潜在冲突，并提出了未来改进措施以增强泛化性和可扩展性。研究结果强调了基于反射的安全机制在不同协议和应用程序中创建安全、可持续和可扩展的物联网部署的潜力。

更新时间: 2025-07-17 12:15:09

领域: cs.CR

下载: http://arxiv.org/abs/2507.13042v1

MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems

Multi-agent debate (MAD) systems leverage collaborative interactions among large language models (LLMs) agents to improve reasoning capabilities. While recent studies have focused on increasing the accuracy and scalability of MAD systems, their security vulnerabilities have received limited attention. In this work, we introduce MAD-Spear, a targeted prompt injection attack that compromises a small subset of agents but significantly disrupts the overall MAD process. Manipulated agents produce multiple plausible yet incorrect responses, exploiting LLMs' conformity tendencies to propagate misinformation and degrade consensus quality. Furthermore, the attack can be composed with other strategies, such as communication attacks, to further amplify its impact by increasing the exposure of agents to incorrect responses. To assess MAD's resilience under attack, we propose a formal definition of MAD fault-tolerance and develop a comprehensive evaluation framework that jointly considers accuracy, consensus efficiency, and scalability. Extensive experiments on five benchmark datasets with varying difficulty levels demonstrate that MAD-Spear consistently outperforms the baseline attack in degrading system performance. Additionally, we observe that agent diversity substantially improves MAD performance in mathematical reasoning tasks, which challenges prior work suggesting that agent diversity has minimal impact on performance. These findings highlight the urgent need to improve the security in MAD design.

Updated: 2025-07-17 12:09:39

标题: 疯狂之矛：一种基于一致性驱动的对多智能体辩论系统的提示注入攻击

摘要: 多智能体辩论（MAD）系统利用大型语言模型（LLMs）智能体之间的协作互动来提高推理能力。尽管最近的研究集中在提高MAD系统的准确性和可扩展性，但它们的安全漏洞受到了有限关注。在这项工作中，我们介绍了MAD-Spear，一种有针对性的提示注入攻击，可以破坏一小部分智能体，但显著干扰整体MAD过程。被操纵的智能体产生多个看似合理但不正确的响应，利用LLMs的一致性倾向来传播错误信息并降低共识质量。此外，该攻击可以与其他策略结合，如通信攻击，通过增加智能体对不正确响应的暴露来进一步放大其影响。为了评估MAD在遭受攻击时的韧性，我们提出了MAD容错的形式定义，并开发了一个综合评估框架，同时考虑准确性、共识效率和可扩展性。在五个具有不同难度级别的基准数据集上进行的广泛实验表明，MAD-Spear在降低系统性能方面始终优于基线攻击。此外，我们观察到在数学推理任务中，智能体多样性显著改善了MAD的性能，这挑战了先前的研究表明智能体多样性对性能的影响很小的观点。这些发现突显了在MAD设计中改进安全性的迫切需要。

更新时间: 2025-07-17 12:09:39

领域: cs.CR

下载: http://arxiv.org/abs/2507.13038v1

Re-evaluating Short- and Long-Term Trend Factors in CTA Replication: A Bayesian Graphical Approach

Commodity Trading Advisors (CTAs) have historically relied on trend-following rules that operate on vastly different horizons from long-term breakouts that capture major directional moves to short-term momentum signals that thrive in fast-moving markets. Despite a large body of work on trend following, the relative merits and interactions of short-versus long-term trend systems remain controversial. This paper adds to the debate by (i) dynamically decomposing CTA returns into short-term trend, long-term trend and market beta factors using a Bayesian graphical model, and (ii) showing how the blend of horizons shapes the strategy's risk-adjusted performance.

Updated: 2025-07-17 12:09:29

标题: 重新评估CTA复制中的短期和长期趋势因素：一种贝叶斯图形方法

摘要: 商品交易顾问（CTAs）在历史上一直依赖于运作在与捕捉主要方向性移动的长期突破截然不同的时间跨度上的趋势跟随规则，以及在快速市场中蓬勃发展的短期动量信号。尽管在趋势跟随方面有大量研究，但短期与长期趋势系统的相对优点和相互作用仍然存在争议。本文通过（i）使用贝叶斯图模型动态分解CTA收益为短期趋势、长期趋势和市场贝塔因子，并（ii）展示时间跨度的融合如何塑造策略的风险调整绩效，为这一辩论增添了新的内容。

更新时间: 2025-07-17 12:09:29

领域: cs.AI,q-fin.PR,q-fin.ST,q-fin.TR

下载: http://arxiv.org/abs/2507.15876v1

From Paranoia to Compliance: The Bumpy Road of System Hardening Practices on Stack Exchange

Hardening computer systems against cyberattacks is crucial for security. However, past incidents illustrated, that many system operators struggle with effective system hardening. Hence, many computer systems and applications remain insecure. So far, the research community lacks an in-depth understanding of system operators motivation, practices, and challenges around system hardening. With a focus on practices and challenges, we qualitatively analyzed 316 Stack Exchange (SE) posts related to system hardening. We find that access control and deployment-related issues are the most challenging, and system operators suffer from misconceptions and unrealistic expectations. Most frequently, posts focused on operating systems and server applications. System operators were driven by the fear of their systems getting attacked or by compliance reasons. Finally, we discuss our research questions, make recommendations for future system hardening, and illustrate the implications of our work.

Updated: 2025-07-17 11:57:11

标题: 从偏执到遵从：系统加固实践在Stack Exchange上的曲折道路

摘要: 保护计算机系统免受网络攻击对安全至关重要。然而，过去的事件表明，许多系统操作员在有效的系统加固方面有困难。因此，许多计算机系统和应用程序仍然存在不安全的情况。迄今为止，研究界缺乏对系统操作员在系统加固方面动机、实践和挑战的深入了解。通过关注实践和挑战，我们对与系统加固相关的316个Stack Exchange（SE）帖子进行了定性分析。我们发现，访问控制和部署相关问题是最具挑战性的，系统操作员常常受到误解和不切实际的期望的困扰。大多数帖子主要关注操作系统和服务器应用程序。系统操作员受到对系统遭受攻击的恐惧或合规性原因的驱使。最后，我们讨论了我们的研究问题，为未来的系统加固提出建议，并说明了我们工作的影响。

更新时间: 2025-07-17 11:57:11

领域: cs.CR

下载: http://arxiv.org/abs/2507.13028v1

Pulsar Consensus

In this paper, we informally introduce the Pulsar proof of stake consensus paper and discuss the relevant design decisions and considerations. The Pulsar protocol we propose is designed to facilitate the creation of a proof of stake sidechain for a proof of work blockchain. We present an overview of a novel composable density-based chain selection rule for proof of stake systems which can be seen as a superset of some standard existing longest chain rules for proof of stake protocols. We discuss the Pulsar protocol in comparison to existing proof of stake protocols and define its benefits over existing designs while defining the limitations of the work. Pulsar is currently implemented in the Mintlayer proof of stake Bitcoin sidechain.

Updated: 2025-07-17 11:53:59

标题: 脉冲星共识

摘要: 在这篇论文中，我们非正式介绍了Pulsar权益证明共识论文，并讨论了相关的设计决策和考虑因素。我们提出的Pulsar协议旨在促进为工作证明区块链创建一个权益证明侧链。我们提出了一种新颖的基于密度的链选取规则概述，适用于权益证明系统，可以看作是一些标准现有权益证明协议的最长链规则的超集。我们将Pulsar协议与现有的权益证明协议进行比较，定义了其相对现有设计的优势，并同时界定了工作的局限性。Pulsar目前已在Mintlayer权益证明比特币侧链中实施。

更新时间: 2025-07-17 11:53:59

领域: cs.CR,94A60, 91A80

下载: http://arxiv.org/abs/2411.14245v2

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.

Updated: 2025-07-17 11:46:00

标题: 重新思考视觉与语言导航中的体验差距：对物理和视觉差异的整体研究

摘要: 最近的视觉与语言导航（VLN）进展是令人充满希望的，但它们对机器人移动和控制的理想化假设未能反映出物理实体部署挑战。为了弥合这一差距，我们引入了VLN-PE，这是一个支持人形、四足和轮式机器人的物理实际的VLN平台。我们首次在不同技术管道中系统评估了几种以自我的视角为中心的VLN方法在物理机器人环境中的性能，包括用于单步离散动作预测的分类模型，用于密集航点预测的扩散模型，以及与路径规划集成的无需训练的基于地图的大型语言模型（LLM）。我们的结果显示，由于机器人观察空间有限、环境光照变化以及碰撞和跌倒等物理挑战，性能显著下降。这也暴露了在复杂环境中四肢机器人的运动约束。VLN-PE具有高度可扩展性，可以无缝集成超出MP3D范围的新场景，从而实现更全面的VLN评估。尽管当前模型在物理部署中的泛化能力较弱，但VLN-PE为改善跨体验整体适应能力提供了一条新途径。我们希望我们的发现和工具能激发社区重新思考VLN的局限性并推进健壮、实用的VLN模型。代码可在https://crystalsixone.github.io/vln_pe.github.io/上获取。

更新时间: 2025-07-17 11:46:00

领域: cs.RO,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2507.13019v1

Exploiting Constraint Reasoning to Build Graphical Explanations for Mixed-Integer Linear Programming

Following the recent push for trustworthy AI, there has been an increasing interest in developing contrastive explanation techniques for optimisation, especially concerning the solution of specific decision-making processes formalised as MILPs. Along these lines, we propose X-MILP, a domain-agnostic approach for building contrastive explanations for MILPs based on constraint reasoning techniques. First, we show how to encode the queries a user makes about the solution of an MILP problem as additional constraints. Then, we determine the reasons that constitute the answer to the user's query by computing the Irreducible Infeasible Subsystem (IIS) of the newly obtained set of constraints. Finally, we represent our explanation as a "graph of reasons" constructed from the IIS, which helps the user understand the structure among the reasons that answer their query. We test our method on instances of well-known optimisation problems to evaluate the empirical hardness of computing explanations.

Updated: 2025-07-17 11:25:33

标题: 利用约束推理构建混合整数线性规划的图形解释

摘要: 随着最近对可信AI的推动，人们对开发用于优化的对比解释技术越来越感兴趣，特别是涉及将特定决策过程形式化为MILPs的解决方案。在这方面，我们提出了X-MILP，这是一种面向领域的方法，用于基于约束推理技术为MILPs构建对比解释。首先，我们展示了如何将用户对MILP问题解决方案的查询编码为附加约束。然后，我们通过计算新获得的约束集的不可约不可行子系统（IIS）来确定构成用户查询答案的原因。最后，我们将解释表示为从IIS构建的“原因图”，这有助于用户理解回答他们查询的原因之间的结构。我们在众所周知的优化问题实例上测试我们的方法，以评估计算解释的经验难度。

更新时间: 2025-07-17 11:25:33

领域: cs.AI

下载: http://arxiv.org/abs/2507.13007v1

SMART: Relation-Aware Learning of Geometric Representations for Knowledge Graphs

Knowledge graph representation learning approaches provide a mapping between symbolic knowledge in the form of triples in a knowledge graph (KG) and their feature vectors. Knowledge graph embedding (KGE) models often represent relations in a KG as geometric transformations. Most state-of-the-art (SOTA) KGE models are derived from elementary geometric transformations (EGTs), such as translation, scaling, rotation, and reflection, or their combinations. These geometric transformations enable the models to effectively preserve specific structural and relational patterns of the KG. However, the current use of EGTs by KGEs remains insufficient without considering relation-specific transformations. Although recent models attempted to address this problem by ensembling SOTA baseline models in different ways, only a single or composite version of geometric transformations are used by such baselines to represent all the relations. In this paper, we propose a framework that evaluates how well each relation fits with different geometric transformations. Based on this ranking, the model can: (1) assign the best-matching transformation to each relation, or (2) use majority voting to choose one transformation type to apply across all relations. That is, the model learns a single relation-specific EGT in low dimensional vector space through an attention mechanism. Furthermore, we use the correlation between relations and EGTs, which are learned in a low dimension, for relation embeddings in a high dimensional vector space. The effectiveness of our models is demonstrated through comprehensive evaluations on three benchmark KGs as well as a real-world financial KG, witnessing a performance comparable to leading models

Updated: 2025-07-17 11:18:08

标题: SMART: 关系感知的知识图几何表示学习

摘要: 知识图谱表示学习方法提供了知识图谱（KG）中三元组形式的符号知识和它们的特征向量之间的映射。知识图嵌入（KGE）模型通常将知识图谱中的关系表示为几何变换。大多数最先进的KGE模型都源自基本几何变换（EGTs），如平移、缩放、旋转和反射，或它们的组合。这些几何变换使模型能够有效地保留知识图谱的特定结构和关系模式。然而，目前KGE模型对于关系特定变换的使用仍然不足。尽管最近的模型尝试通过不同方式将SOTA基线模型组合起来来解决这个问题，但这些基线只使用单个或组合版本的几何变换来表示所有关系。在本文中，我们提出了一个框架，评估每个关系与不同几何变换的匹配程度。根据这个排名，模型可以：（1）为每个关系分配最匹配的变换，或者（2）使用大多数投票来选择一个变换类型应用于所有关系。也就是说，模型通过关注机制在低维向量空间中学习单个关系特定的EGT。此外，我们利用在低维度学习的关系和EGT之间的相关性，用于高维向量空间中的关系嵌入。我们的模型的有效性通过在三个基准知识图谱以及一个真实世界的金融知识图谱上进行全面评估来证明，并展示出与领先模型相当的性能。

更新时间: 2025-07-17 11:18:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13001v1

(Almost) Free Modality Stitching of Foundation Models

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by $10\times$, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

Updated: 2025-07-17 11:10:58

标题: (几乎)免费的基础模型模态拼接

摘要: 基金会多模态模型通常通过拼接多个现有预训练的单模态模型来设计：例如，图像分类器与文本模型。这种拼接过程是通过训练一个连接器模块来实现的，该模块旨在将这些单模态模型的表示空间对齐到一个多模态目标。然而，由于在大规模基于Web的数据集上训练这样的连接器的复杂性，再加上不断增加的可用预训练的单模态模型数量，单模态模型的选择任务和随后的连接器模块训练变得计算复杂。为了解决这个尚未研究的关键问题，我们提出了Hypernetwork Model Alignment（Hyma），这是一个利用超网络实现最佳单模态模型选择和连接器训练的新颖一体化解决方案。具体来说，我们的框架利用超网络的参数预测能力，为$N \times M$个单模态模型组合获取联合训练的连接器模块。在我们的实验中，Hyma将寻找性能最佳的单模态模型对的成本降低了10倍，同时匹配了在一系列多样化的多模态基准测试中通过网格搜索获得的排名和训练连接器性能。

更新时间: 2025-07-17 11:10:58

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.10015v3

Teach Old SAEs New Domain Tricks with Boosting

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

Updated: 2025-07-17 10:57:49

标题: 用Boosting方法教老的SAEs新的领域技巧

摘要: 稀疏自动编码器已经成为解释大型语言模型内部表示的强大工具，然而它们经常无法捕捉训练语料库中不常见的领域特定特征。本文介绍了一种残差学习方法，可以解决这种特征失效问题，而无需完全重新训练。我们提出训练一个特定于模拟预训练SAE在领域特定文本上的重构误差的次要SAE，有效地捕捉主模型未能捕捉的特征。通过在推断过程中对两个模型的输出进行求和，我们展示了在多个专门领域中LLM交叉熵和解释方差指标显著改善。我们的实验表明，这种方法可以有效地将新的领域知识整合到现有的SAE中，同时保持其在一般任务上的性能。这种方法使研究人员能够有选择地增强SAE在特定领域的可解释性，为LLMs的有针对性机制可解释性打开了新的可能性。

更新时间: 2025-07-17 10:57:49

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12990v1

A Translation of Probabilistic Event Calculus into Markov Decision Processes

Probabilistic Event Calculus (PEC) is a logical framework for reasoning about actions and their effects in uncertain environments, which enables the representation of probabilistic narratives and computation of temporal projections. The PEC formalism offers significant advantages in interpretability and expressiveness for narrative reasoning. However, it lacks mechanisms for goal-directed reasoning. This paper bridges this gap by developing a formal translation of PEC domains into Markov Decision Processes (MDPs), introducing the concept of "action-taking situations" to preserve PEC's flexible action semantics. The resulting PEC-MDP formalism enables the extensive collection of algorithms and theoretical tools developed for MDPs to be applied to PEC's interpretable narrative domains. We demonstrate how the translation supports both temporal reasoning tasks and objective-driven planning, with methods for mapping learned policies back into human-readable PEC representations, maintaining interpretability while extending PEC's capabilities.

Updated: 2025-07-17 10:56:22

标题: 将概率事件演算翻译为马尔可夫决策过程

摘要: 概率事件演绎（PEC）是一个逻辑框架，用于推理不确定环境中的行动及其影响，它使得概率叙事的表示和时间投影的计算成为可能。PEC形式主义在叙事推理方面具有显著的优势，包括可解释性和表达力。然而，它缺乏目标导向推理的机制。本文通过将PEC领域形式化转换为马尔可夫决策过程（MDPs），引入“采取行动的情况”概念以保持PEC灵活的行动语义来弥补这一差距。由此产生的PEC-MDP形式主义使得可以将为MDPs开发的广泛算法和理论工具应用于PEC的可解释叙事领域。我们展示了这种转换如何支持时间推理任务和目标驱动规划，包括将学习策略映射回可读的PEC表示的方法，同时保持可解释性并扩展PEC的能力。

更新时间: 2025-07-17 10:56:22

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2507.12989v1

Benchmarking Sub-Genre Classification For Mainstage Dance Music

Music classification, a cornerstone of music information retrieval, supports a wide array of applications. To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, we introduce a novel benchmark featuring a new dataset and baseline. Our dataset expands the scope of sub-genres to reflect the diversity of recent mainstage live sets performed by leading DJs at global music festivals, capturing the vibrant and rapidly evolving electronic dance music (EDM) scene that engages millions of fans worldwide. We employ a continuous soft labeling approach to accommodate tracks blending multiple sub-genres, preserving their inherent complexity. Experiments demonstrate that even state-of-the-art multimodal large language models (MLLMs) struggle with this task, while our specialized baseline models achieve high accuracy. This benchmark supports applications such as music recommendation, DJ set curation, and interactive multimedia systems, with video demos provided. Our code and data are all open-sourced at https://github.com/Gariscat/housex-v2.git}{https://github.com/Gariscat/housex-v2.git.

Updated: 2025-07-17 10:43:51

标题: 主舞台舞曲子类型分类的基准测试

摘要: 音乐分类是音乐信息检索的基石，支持各种应用。为了解决主流舞曲音乐中次流派分类的综合数据集和有效方法的缺乏，我们引入了一个新的基准线，包括一个新的数据集和基线。我们的数据集扩展了次流派的范围，以反映全球音乐节上领先DJ表演的最新主流现场音乐集，捕捉了吸引全球数百万粉丝的充满活力且快速发展的电子舞曲（EDM）场景。我们采用连续软标签方法来适应混合多个次流派的音轨，保留它们固有的复杂性。实验证明，即使是最先进的多模式大型语言模型（MLLMs）也难以完成这项任务，而我们的专门基准模型实现了高准确性。该基准支持诸如音乐推荐、DJ音乐集策划和交互式多媒体系统等应用，并提供视频演示。我们的代码和数据都是开源的，网址为https://github.com/Gariscat/housex-v2.git。

更新时间: 2025-07-17 10:43:51

领域: cs.SD,cs.AI,cs.MM,H.5.5; I.2.1

下载: http://arxiv.org/abs/2409.06690v2

MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps

This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa\~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85\% in the task.

Updated: 2025-07-17 10:33:36

标题: MRT在IberLEF-2025 PRESTA任务中的表格多步骤最大化恢复

摘要: 本文介绍了我们在IberLEF 2025任务PRESTA中的方法：Preguntas y Respuestas sobre Tablas en Español（关于西班牙语表格的问题和答案）。我们的解决方案通过实现Python代码生成与LLMs来获取问题的答案，用于过滤和处理表格。该解决方案源自于Semeval 2025相关任务的MRT实现。该过程包括多个步骤：分析和理解表格的内容，选择有用的列，生成自然语言指令，将这些指令翻译成代码，运行代码，并处理潜在的错误或异常。这些步骤使用开源LLMs和为每个步骤优化的细粒度提示。通过这种方法，我们在任务中获得了85％的准确率得分。

更新时间: 2025-07-17 10:33:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.12981v1

A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at https://github.com/youssefga28/HuSCF-GAN.

Updated: 2025-07-17 10:31:31

标题: 一种分布式生成式人工智能方法用于在数据共享约束下的异构多领域环境

摘要: 联邦学习越来越受到关注，因为它能够使多个节点在不共享原始数据的情况下协同训练机器学习模型。与此同时，生成人工智能，特别是生成对抗网络（GANs），在医疗保健、安全和图像生成等各个领域取得了显著成功。然而，训练生成模型通常需要大量数据集和显著的计算资源，在实际环境中通常难以获得。获取这些资源可能代价高昂且低效，尤其是当许多未充分利用的设备（如物联网设备和边缘设备）处于空闲状态且具有不同的能力时。此外，由于隐私和版权限制，获取大型数据集是具有挑战性的，因为大多数设备不愿意共享其数据。为了解决这些挑战，我们提出了一种新颖的分散式GAN训练方法，该方法能够利用分布式数据和未充分利用的低能力设备，同时不以原始形式共享数据。我们的方法旨在解决分散式环境中的关键挑战，结合了KLD加权聚类联邦学习以解决数据异质性和多领域数据集的问题，同时采用异构U形分割学习来解决设备异质性在严格的数据共享约束下的挑战，确保节点之间从未共享标签或原始数据，无论是真实数据还是合成数据。实验结果表明，我们的方法在关键性能指标上展现出一致且显著的改进，它实现了1.1倍至2.2倍更高的图像生成分数，分类指标平均提高了10%（在多领域非IID设置中高达50%），而与几个基准相比，延迟要低得多。在https://github.com/youssefga28/HuSCF-GAN找到我们的代码。

更新时间: 2025-07-17 10:31:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.12979v1

Risks of ignoring uncertainty propagation in AI-augmented security pipelines

The use of AI technologies is being integrated into the secure development of software-based systems, with an increasing trend of composing AI-based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and seriously threatens safety-critical domains. Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI-augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with one case study. We discuss the generalizability of our approach and its limitations and present recommendations for evaluation policies concerning AI systems. Future work includes extending the approach by relaxing the remaining assumptions and by experimenting with a real system.

Updated: 2025-07-17 10:04:02

标题: 忽视AI增强安全管道中的不确定性传播风险

摘要: 人工智能技术的使用正在被整合到软件系统的安全开发中，越来越多地将基于人工智能的子系统（其性能水平不确定）组合成自动化流水线。这提出了一个基础性的研究挑战，并严重威胁到安全关键领域。尽管已经存在关于风险分析中不确定性的知识，但之前没有任何工作估计过由于错误在流水线中传播而导致的人工智能增强系统的不确定性。我们提供了捕捉不确定性传播的形式基础，开发了一个模拟器来量化不确定性，并通过一个案例研究评估了传播错误的模拟。我们讨论了我们方法的推广性和局限性，并提出了关于评估政策的建议，涉及人工智能系统。未来的工作包括通过放宽剩余假设并尝试实验真实系统来扩展该方法。

更新时间: 2025-07-17 10:04:02

领域: cs.SE,cs.AI,cs.CR

下载: http://arxiv.org/abs/2407.14540v2

Improving Diagnostic Accuracy of Pigmented Skin Lesions With CNNs: an Application on the DermaMNIST Dataset

Pigmented skin lesions represent localized areas of increased melanin and can indicate serious conditions like melanoma, a major contributor to skin cancer mortality. The MedMNIST v2 dataset, inspired by MNIST, was recently introduced to advance research in biomedical imaging and includes DermaMNIST, a dataset for classifying pigmented lesions based on the HAM10000 dataset. This study assesses ResNet-50 and EfficientNetV2L models for multi-class classification using DermaMNIST, employing transfer learning and various layer configurations. One configuration achieves results that match or surpass existing methods. This study suggests that convolutional neural networks (CNNs) can drive progress in biomedical image analysis, significantly enhancing diagnostic accuracy.

Updated: 2025-07-17 10:00:07

标题: 使用CNNs提高色素性皮肤病变的诊断准确性：在DermaMNIST数据集上的应用

摘要: 色素性皮肤病变代表着局部增加的黑色素区域，可能表明严重疾病，如黑色素瘤，是导致皮肤癌死亡的主要原因之一。受MNIST启发，最近引入了MedMNIST v2数据集，其中包括DermaMNIST数据集，用于基于HAM10000数据集对色素性病变进行分类。本研究评估了ResNet-50和EfficientNetV2L模型在DermaMNIST上进行多类别分类，采用迁移学习和不同的层配置。其中一个配置取得了与现有方法相匹配或超越的结果。本研究表明，卷积神经网络（CNNs）可以推动生物医学图像分析的进展，显著提高诊断准确性。

更新时间: 2025-07-17 10:00:07

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.12961v1

A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion

Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is essential for neuro-oncology diagnosis and treatment planning. Despite advances in deep learning methods, automatic segmentation remains challenging due to tumor morphological heterogeneity and complex three-dimensional spatial relationships. Current techniques primarily rely on visual features extracted from MRI sequences while underutilizing semantic knowledge embedded in medical reports. This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information, facilitating comprehensive processing from low-level data to high-level concepts. The semantic-level fusion pathway combines the semantic understanding capabilities of Contrastive Language-Image Pre-training (CLIP) models with the spatial feature extraction advantages of 3D U-Net through three mechanisms: 3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based attention mechanisms. Experimental validation on the BraTS 2020 dataset demonstrates that the proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with a 7.3% Dice coefficient increase in the clinically important enhancing tumor (ET) region.

Updated: 2025-07-17 09:49:45

标题: 基于CLIP和3D U-Net的脑肿瘤分割方法：跨模态语义引导和多级特征融合

摘要: 磁共振成像（MRI）精确分割脑肿瘤对神经肿瘤诊断和治疗规划至关重要。尽管深度学习方法取得了进展，但由于肿瘤形态的异质性和复杂的三维空间关系，自动分割仍然具有挑战性。当前技术主要依赖于从MRI序列提取的视觉特征，同时未充分利用嵌入在医学报告中的语义知识。本研究提出了一个多级融合架构，整合了像素级、特征级和语义级信息，促进了从低级数据到高级概念的全面处理。语义级融合路径结合了对比语言-图像预训练（CLIP）模型的语义理解能力和3D U-Net的空间特征提取优势，通过三种机制实现：3D-2D语义桥接、跨模态语义引导和基于语义的关注机制。对BraTS 2020数据集的实验验证表明，所提出的模型实现了总体Dice系数为0.8567，相比传统的3D U-Net提高了4.8％，在临床重要的增强肿瘤（ET）区域的Dice系数增加了7.3％。

更新时间: 2025-07-17 09:49:45

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.09966v2

MMOne: Representing Multiple Modalities in One Scene

Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.

Updated: 2025-07-17 09:45:51

标题: MMOne：在一个场景中表示多种模态

摘要: 人类通过多模态线索来感知世界，以理解和与环境互动。学习多种模态的场景表示增强了对物理世界的理解。然而，由于不同模态之间固有的差异而产生的模态冲突，带来了两个关键挑战：属性差异和粒度差异。为了解决这些挑战，我们提出了一个通用框架MMOne，用于在一个场景中表示多种模态，可以轻松扩展到额外的模态。具体地，我们提出了一个模态建模模块，其中包含一个新颖的模态指示器，用于捕捉每种模态的独特属性。此外，我们设计了一个多模态分解机制，根据模态差异将多模态高斯分解为单模态高斯。我们通过将多模态信息分解为共享和模态特定组件来解决模态之间的基本区别，从而得到更紧凑和高效的多模态场景表示。大量实验证明，我们的方法始终提高了每种模态的表示能力，并且可以扩展到额外的模态。代码可在https://github.com/Neal2020GitHub/MMOne找到。

更新时间: 2025-07-17 09:45:51

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.11129v2

UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.

Updated: 2025-07-17 09:45:49

标题: UniSLU：来自异构跨任务数据集的统一口语理解

摘要: 口语理解（SLU）在以语音为中心的多媒体应用中发挥着至关重要的作用，使机器能够理解会议、面试和客户服务交互等场景中的口语。SLU包括多个任务，包括自动语音识别（ASR）、口语命名实体识别（NER）和口语情感分析（SA）。然而，现有方法通常依赖于单独的模型体系结构来处理诸如口语NER和SA等各自的任务，这增加了系统的复杂性，限制了跨任务的交互，并未充分利用跨任务之间的异构数据集。为了解决这些限制，我们提出了UniSLU，这是一个统一框架，可以在单一体系结构中共同建模多个SLU任务。具体来说，我们提出了一种统一的表示形式，用于多样化的SLU任务，实现了跨多个任务的异构数据集的充分利用。基于这种表示形式，我们提出了一种统一的生成方法，可以共同建模ASR、口语NER和SA任务，增强任务之间的交互，并实现与大型语言模型的无缝集成，以利用它们强大的生成能力。对公开的SLU数据集进行的大量实验表明，我们的方法的有效性，相比几种基准方法，实现了更优越的SLU性能，使其非常适合实际的基于语音的多媒体场景。我们将在github上发布所有代码和模型，以促进未来的研究。

更新时间: 2025-07-17 09:45:49

领域: eess.AS,cs.AI,cs.CL,cs.MM,cs.SD

下载: http://arxiv.org/abs/2507.12951v1

Enterprise Security Incident Analysis and Countermeasures Based on the T-Mobile Data Breach

This paper presents a comprehensive analysis of T-Mobile's critical data breaches in 2021 and 2023, alongside a full-spectrum security audit targeting its systems, infrastructure, and publicly exposed endpoints. By combining case-based vulnerability assessments with active ethical hacking techniques--including Shodan reconnaissance, API misuse simulations, VNC brute-forcing, firmware reverse engineering, and web application scans--we uncover structural weaknesses persisting beyond the initial breach events. Building on these findings, we propose a multi-layered defensive strategy encompassing Zero Trust Architecture, granular role-based access control, network segmentation, firmware encryption using AES with integrity checks, and API rate limiting and token lifecycle control. Financial modelling demonstrates that a five-year investment yields less than 1.1% of expected breach losses, validating the cost-effectiveness of proactive security measures. Our work bridges post-incident forensic analysis with hands-on security evaluation, providing an actionable blueprint for large-scale telecoms seeking operational resilience, regulatory compliance, and cross-domain threat readiness.

Updated: 2025-07-17 09:22:52

标题: 基于T-Mobile数据泄露的企业安全事件分析和对策

摘要: 本文对T-Mobile在2021年和2023年发生的重大数据泄露进行了全面分析，同时针对其系统、基础设施和公开暴露的端点进行了全谱安全审计。通过将基于案例的漏洞评估与主动的道德黑客技术相结合，包括Shodan侦察、API误用模拟、VNC暴力破解、固件逆向工程和Web应用程序扫描，我们揭示了在初始泄露事件之后持续存在的结构性弱点。基于这些发现，我们提出了一个多层次的防御策略，包括零信任架构、细粒度基于角色的访问控制、网络分割、使用带完整性检查的AES进行固件加密，以及API速率限制和令牌生命周期控制。财务建模表明，五年的投资仅产生了预期泄露损失的不到1.1％，验证了积极安全措施的成本效益性。我们的工作将事后取证分析与实际的安全评估联系起来，为寻求运营弹性、符合监管要求和跨领域威胁准备的大型电信公司提供了可操作的蓝图。

更新时间: 2025-07-17 09:22:52

领域: cs.CR

下载: http://arxiv.org/abs/2507.12937v1

MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration

An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.

Updated: 2025-07-17 09:20:51

标题: MC$^2$A：实现算法-硬件协同设计以实现高效的马尔可夫链蒙特卡洛加速

摘要: 越来越多的应用程序正在利用基于抽样的算法进行规划、优化和推断。马尔可夫链蒙特卡罗（MCMC）算法构成了这一新兴机器学习分支的计算支柱。不幸的是，高计算成本限制了它们在大规模问题和实际应用中的可行性，现有的MCMC加速解决方案要么在硬件灵活性方面受限，要么无法在各种端到端应用中保持系统级效率。本文介绍了MC$^2$A，一种算法-硬件协同设计框架，实现了对MCMC加速的高效灵活优化。首先，MC$^2$A通过扩展具有第三维的处理器性能屋顶模型分析了MCMC的工作负载多样性，以获得计算、抽样和内存参数之间的最佳平衡。其次，MC$^2$A提出了一个参数化的硬件加速器架构，具有灵活高效的支持MCMC内核的能力，包括ISA可编程的树状处理单元管道、可重构的采样器和用于支持不规则访问的交叉开关连接。第三，MC$^2$A的核心由一种新颖的Gumbel采样器驱动，消除了指数和归一化操作。在端到端案例研究中，MC$^2$A相对于CPU、GPU、TPU和最先进的MCMC加速器实现了总体速度提升{$307.6\times$，$1.4\times$，$2.0\times$，$84.2\times$}。通过对各种代表性MCMC工作负载进行评估，本研究展示并利用了通用硬件加速的可行性，以推广在各种应用领域中的MCMC解决方案。

更新时间: 2025-07-17 09:20:51

领域: cs.LG,cs.AI,cs.AR

下载: http://arxiv.org/abs/2507.12935v1

DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.

Updated: 2025-07-17 09:15:29

标题: DMQ：解剖扩散模型的异常值，用于后训练量化

摘要: 扩散模型在图像生成方面取得了显著的成功，但伴随着显著的计算成本，这给在资源受限环境中部署带来了挑战。最近的后训练量化（PTQ）方法试图通过专注于扩散模型的迭代性质来缓解这一问题。然而，这些方法通常忽视异常值，导致在低位宽下性能下降。本文提出了一种DMQ，结合了学习等效缩放（LES）和通道级二次幂缩放（PTS）来有效应对这些挑战。学习等效缩放优化了通道级缩放因子，重新分配量化难度，降低了整体量化误差。鉴于早期去噪步骤虽然存在较小的量化误差，但由于误差累积对最终输出产生至关重要的影响，我们引入了一种自适应时间步加权方案来优先考虑这些关键步骤。此外，鉴别到跳跃连接等层展现出高的通道间方差，我们引入了通道级二次幂缩放来处理激活。为了确保即使在小的校准集中也能健壮选择PTS因子，我们引入了一个增强可靠性的投票算法。广泛的实验表明，我们的方法在低位宽（如W4A6（4位权重，6位激活）和W4A8）等方面明显优于现有作品，保持了高质量的图像生成和模型稳定性。该代码可在https://github.com/LeeDongYeun/dmq获得。

更新时间: 2025-07-17 09:15:29

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12933v1

Slot: Provenance-Driven APT Detection through Graph Reinforcement Learning

Advanced Persistent Threats (APTs) represent sophisticated cyberattacks characterized by their ability to remain undetected within the victim system for extended periods, aiming to exfiltrate sensitive data or disrupt operations. Existing detection approaches often struggle to effectively identify these complex threats, construct the attack chain for defense facilitation, or resist adversarial attacks. To overcome these challenges, we propose Slot, an advanced APT detection approach based on provenance graphs and graph reinforcement learning. Slot excels in uncovering multi-level hidden relationships, such as causal, contextual, and indirect connections, among system behaviors through provenance graph mining. By pioneering the integration of graph reinforcement learning, Slot dynamically adapts to new user activities and evolving attack strategies, enhancing its resilience against adversarial attacks. Additionally, Slot automatically constructs the attack chain according to detected attacks with clustering algorithms, providing precise identification of attack paths and facilitating the development of defense strategies. Evaluations with real-world datasets demonstrate Slot's outstanding accuracy, efficiency, adaptability, and robustness in APT detection, with most metrics surpassing state-of-the-art methods. Additionally, case studies conducted to assess Slot's effectiveness in supporting APT defense further establish it as a practical and reliable tool for cybersecurity protection.

Updated: 2025-07-17 09:09:55

标题: 标题翻译：Provenance-Driven APT检测：基于图强化学习的方法

摘要: 高级持续性威胁（APTs）代表着一种复杂的网络攻击，其特点是能够在受害系统内长时间保持不被发现，旨在窃取敏感数据或干扰操作。现有的检测方法常常难以有效识别这些复杂威胁，构建用于防御的攻击链，或抵抗对抗性攻击。为了克服这些挑战，我们提出了一种名为Slot的高级APTs检测方法，基于溯源图和图强化学习。Slot在通过溯源图挖掘揭示多级隐藏关系方面表现出色，如因果、上下文和间接连接。通过引领图强化学习的整合，Slot动态适应新用户活动和不断发展的攻击策略，增强其对抗对抗性攻击的韧性。此外，Slot根据检测到的攻击使用聚类算法自动构建攻击链，提供攻击路径的精确识别，并促进防御策略的制定。通过对真实世界数据集的评估表明，Slot在APTs检测方面具有出色的准确性、效率、适应性和稳健性，大多数指标超过了当前最先进的方法。此外，进行的案例研究评估了Slot在支持APTs防御方面的有效性，进一步确立了它作为网络安全保护的实用可靠工具。

更新时间: 2025-07-17 09:09:55

领域: cs.CR

下载: http://arxiv.org/abs/2410.17910v4

Making Language Model a Hierarchical Classifier and Generator

Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

Updated: 2025-07-17 09:09:53

标题: 将语言模型打造成一个分层分类器和生成器

摘要: 解码器仅语言模型，如GPT和LLaMA，通常在最后一层解码。受人类分层思维能力的启发，我们提出可以构建具有不同层同时解码文本的分层解码器架构。由于时间和计算资源有限，我们选择将预训练语言模型调整为这种形式的分层解码器。最后一层的语言头被复制到不同选择的中间层，并与不同的任务输入进行微调。通过彻底的实验证明，这些选择性的中间层可以适应说出有意义且合理的内容，并且这种分层解码器范式可以在多个任务上获得最先进的性能，如分层文本分类、分类引导生成和分层文本生成。这项研究暗示了从零开始预训练一个广义的分层推理器的可能性。

更新时间: 2025-07-17 09:09:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.12930v1

Differential Multimodal Transformers

Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.

Updated: 2025-07-17 09:05:34

标题: 不同的多模态变换器

摘要: 小型语言模型因其高效性和不断增强的功能而备受青睐。然而，引入额外的模态，如视觉，可能会加剧有限上下文窗口的挑战，引入噪音。最近的研究已经指出，Transformer的注意力机制往往会不成比例地关注不相关的上下文。在这项工作中，我们将最初设计用于仅文本模型的差分注意力机制扩展到文本-视觉模型PaliGemma。我们的目标是评估其减轻嘈杂信息检索和减少幻觉的能力。为此，我们使用LoRA对PaliGemma 3B模型进行微调，将差分注意力纳入，并尝试不同的参数设置和配置。我们证明，差分注意力可以被调整和集成到现有模型的微调中，以增强嘈杂信息检索和问答能力。

更新时间: 2025-07-17 09:05:34

领域: cs.AI,cs.MM

下载: http://arxiv.org/abs/2507.15875v1

Architectural Backdoors in Deep Learning: A Survey of Vulnerabilities, Detection, and Defense

Architectural backdoors pose an under-examined but critical threat to deep neural networks, embedding malicious logic directly into a model's computational graph. Unlike traditional data poisoning or parameter manipulation, architectural backdoors evade standard mitigation techniques and persist even after clean retraining. This survey systematically consolidates research on architectural backdoors, spanning compiler-level manipulations, tainted AutoML pipelines, and supply-chain vulnerabilities. We assess emerging detection and defense strategies, including static graph inspection, dynamic fuzzing, and partial formal verification, and highlight their limitations against distributed or stealth triggers. Despite recent progress, scalable and practical defenses remain elusive. We conclude by outlining open challenges and proposing directions for strengthening supply-chain security, cryptographic model attestations, and next-generation benchmarks. This survey aims to guide future research toward comprehensive defenses against structural backdoor threats in deep learning systems.

Updated: 2025-07-17 09:02:54

标题: 深度学习中的建筑后门：漏洞、检测和防御综述

摘要: 建筑后门构成了一个被忽视但至关重要的深度神经网络威胁，将恶意逻辑直接嵌入模型的计算图中。与传统的数据污染或参数操纵不同，建筑后门规避了标准的缓解技术，并且即使在重新训练后仍然存在。这项调查系统地整合了关于建筑后门的研究，涵盖了编译器级别的操纵、受污染的AutoML管道和供应链漏洞。我们评估新兴的检测和防御策略，包括静态图检查、动态模糊测试和部分形式验证，并强调它们在分布式或隐蔽触发器方面的局限性。尽管最近取得了进展，可扩展和实用的防御仍然难以实现。我们通过概述开放挑战并提出加强供应链安全、加密模型验证和下一代基准的方向来总结。这项调查旨在引导未来的研究朝着全面防御深度学习系统中结构后门威胁的方向发展。

更新时间: 2025-07-17 09:02:54

领域: cs.CR

下载: http://arxiv.org/abs/2507.12919v1

Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.

Updated: 2025-07-17 09:02:04

标题: 阿尔戈斯：利用多视图图像结合大型语言模型实现改进的三维场景理解

摘要: 基金会模型的进展使得在各种下游任务中进行应用成为可能。特别是，新时代见证了将大型语言模型(LLMs)扩展到处理3D场景理解任务的显著能力。当前方法在很大程度上依赖于3D点云，但是对室内场景的3D点云重建通常会导致信息丢失。一些无纹理的平面或重复的图案容易被遗漏，并在重建的3D点云中表现为空洞。此外，具有复杂结构的物体往往会因为捕捉图像和密集重建的点云之间的错位而引入细节失真。2D多视图图像与3D点云呈现视觉一致性，并提供了更详细的场景组件表示，可以自然地弥补这些不足。基于这些见解，我们提出了Argus，一种新颖的3D多模态框架，利用多视图图像增强LLMs对3D场景的理解。总体而言，Argus可以被视为一个3D大型多模态基础模型(3D-LMM)，因为它将各种模态作为输入(文本说明、2D多视图图像和3D点云)，并扩展了LLMs处理3D任务的能力。Argus涉及将多视图图像和相机姿态融合并整合成视图作为场景特征，这些特征与3D特征相互作用，创建全面和详细的3D感知场景嵌入。我们的方法弥补了重建3D点云时的信息丢失，并帮助LLMs更好地理解3D世界。大量实验证明，我们的方法在各种下游任务中优于现有的3D-LMMs。

更新时间: 2025-07-17 09:02:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12916v1

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

Updated: 2025-07-17 08:53:48

标题: MEM1：学习协同记忆和推理，提升高效长期规划的智能体

摘要: 现代语言代理必须在长期、多轮互动中运作，在这种情况下，它们检索外部信息，适应观察结果，并回答相互依赖的查询。然而，大多数LLM系统依赖于完整上下文提示，附加所有过去的轮次，而不考虑它们的相关性。这导致内存无限增长，计算成本增加，并且在分布外输入长度上降低推理性能。我们引入了MEM1，这是一个端到端的强化学习框架，使代理能够在长期多轮任务中使用恒定的内存。在每个轮次，MEM1更新一个紧凑的共享内部状态，共同支持内存整合和推理。这个状态将先前的内存与来自环境的新观察结果整合在一起，同时有策略地丢弃不相关或冗余的信息。为了在更现实和组合的环境中支持训练，我们提出了一个简单而有效和可扩展的方法，通过将现有数据集组合成任意复杂的任务序列来构建多轮环境。在内部检索QA、开放领域Web QA和多轮Web购物等三个领域的实验中，我们发现，MEM1-7B在16个目标多跳QA任务中将性能提高了3.5倍，同时将内存使用量降低了3.7倍，相比之下，Qwen2.5-14B-Instruct。在训练范围之外的泛化能力。我们的结果表明，基于推理驱动的内存整合作为训练长期互动代理的可扩展替代解决方案具有潜力，同时优化效率和性能。

更新时间: 2025-07-17 08:53:48

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2506.15841v2

An ultra-low-power CGRA for accelerating Transformers at the edge

Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications. The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations, reducing memory bandwidth demands and enhancing data reuse. A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs, eliminating the need for centralized switching. Through its heterogeneous array design and efficient dataflow, this CGRA architecture addresses the unique computational needs of transformers, offering a scalable pathway to deploy sophisticated machine learning models on edge devices.

Updated: 2025-07-17 08:43:14

标题: 一个用于在边缘加速变压器的超低功耗计算网格数组

摘要: 变压器已经在自然语言处理、计算机视觉等领域的深度学习中引起了革命。然而，它们的计算需求使得在低功耗边缘设备上部署它们具有挑战性。本文介绍了一种超低功耗的粗粒度可重构阵列（CGRA）架构，专门设计用于加速适用于边缘应用的能源和资源约束下的变压器模型中的通用矩阵乘法（GEMM）操作。所提出的架构集成了一个4x4的处理元素（PEs）数组，用于高效并行计算，以及专门用于优化LOAD/STORE操作的4x2存储器操作块（MOBs），减少了内存带宽需求并增强了数据重用。无开关网状环互连网络进一步通过实现PEs和MOBs之间的直接通信，消除了中心化交换的需求，从而最小化了功耗和延迟。通过其异构数组设计和高效数据流，这种CGRA架构解决了变压器的独特计算需求，为在边缘设备上部署复杂的机器学习模型提供了可扩展的途径。

更新时间: 2025-07-17 08:43:14

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2507.12904v1

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

Updated: 2025-07-17 08:39:11

标题: IOPO：通过输入输出偏好优化赋予LLMs复杂指令跟随的能力

摘要: 在大型语言模型（LLMs）的领域中，模型准确地遵循指令的能力至关重要，因为越来越多的代理和应用程序利用LLMs进行构建，其中指令的复杂性正在迅速增加。然而，一方面，只有一定数量的复杂指令评估数据；另一方面，没有专门的算法来提高遵循复杂指令的能力。为此，本文介绍了TRACE，一个用于改进和评估复杂指令遵循能力的基准，包括120K个训练数据和1K个评估数据。此外，我们提出了IOPO（输入-输出优先优化）对齐方法，该方法同时考虑输入和输出优先偏好对，LLMs不仅快速与响应优先偏好对齐，还细致地探索指令偏好。对领域内外数据集的广泛实验证实了IOPO的有效性，与SFT和DPO相比，在领域内数据上分别显示了8.15％，2.18％的改进，在领域外数据上分别显示了6.29％，3.13％的改进。

更新时间: 2025-07-17 08:39:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.06208v3

Coral Protocol: Open Infrastructure Connecting The Internet of Agents

Coral Protocol is an open and decentralized collaboration infrastructure that enables communication, coordination, trust and payments for The Internet of Agents. It addresses the growing need for interoperability in a world where organizations are deploying multiple specialized AI agents that must work together across domains and vendors. As a foundational platform for multi-agent AI ecosystems, Coral establishes a common language and coordination framework allowing any agent to participate in complex workflows with others. Its design emphasizes broad compatibility, security, and vendor neutrality, ensuring that agent interactions are efficient and trustworthy. In particular, Coral introduces standardized messaging formats for agent communication, a modular coordination mechanism for orchestrating multi-agent tasks, and secure team formation capabilities for dynamically assembling trusted groups of agents. Together, these innovations position Coral Protocol as a cornerstone of the emerging "Internet of Agents," unlocking new levels of automation, collective intelligence, and business value through open agent collaboration.

Updated: 2025-07-17 08:34:37

标题: 珊瑚协议：连接代理互联网的开放基础设施

摘要: 珊瑚协议是一个开放和去中心化的协作基础设施，为代理人互联网提供了通信、协调、信任和支付的功能。它解决了在一个组织部署多个专门的AI代理人必须跨领域和供应商共同工作的世界中日益增长的互操作性需求。作为多代理人AI生态系统的基础平台，珊瑚建立了一个通用语言和协调框架，允许任何代理人参与与其他代理人的复杂工作流程。其设计强调广泛的兼容性、安全性和供应商中立性，确保代理人之间的交互是高效且可信的。特别是，珊瑚引入了标准化的代理人通信消息格式，一个用于编排多代理人任务的模块化协调机制，以及用于动态组装可信代理群体的安全团队形成能力。通过这些创新，珊瑚协议被定位为新兴"代理人互联网"的基石，通过开放代理协作解锁了新的自动化水平、集体智能和商业价值。

更新时间: 2025-07-17 08:34:37

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2505.00749v2

Why Braking? Scenario Extraction and Reasoning Utilizing LLM

The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.

Updated: 2025-07-17 08:33:56

标题: 为什么刹车？利用LLM进行情景提取和推理

摘要: ADAS装备车辆数量不断增加，导致驾驶数据急剧增加，然而大多数数据只捕捉到日常驾驶行为。在这个庞大数据集中识别和理解安全关键的特殊案例仍然是一个重大挑战。刹车事件特别表明潜在危险情况，激发了我们研究的核心问题：为什么车辆要刹车？现有方法主要依赖基于规则的启发式来使用预定义条件过滤器检索目标场景。虽然在高速公路等简单环境中有效，但这些方法在复杂的城市环境中缺乏泛化能力。在本文中，我们提出了一个新颖的框架，利用大型语言模型（LLM）进行场景理解和推理。我们的方法弥合了低级数字信号和自然语言描述之间的差距，使LLM能够解释和分类驾驶场景。我们提出了一种支持基于类别的搜索已知场景和基于嵌入的检索未知分布外（OOD）场景的双路径场景检索。为了方便评估，我们在Argoverse 2传感器数据集上整理了场景注释。实验结果表明，我们的方法优于基于规则的基线，并且很好地泛化到OOD场景。

更新时间: 2025-07-17 08:33:56

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15874v1

Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

Updated: 2025-07-17 08:31:55

标题: 通用双手操纵基于基础视频扩散模型

摘要: 双手机器人操作，涉及协调控制两个机器人臂，是解决具有挑战性任务的基础。尽管在通用操作方面取得了最近的进展，但数据稀缺和体现异质性仍然是双手设置中进一步扩展的严重障碍。在本文中，我们介绍了VIdeo Diffusion for Action Reasoning（VIDAR），这是一个利用大规模基于扩散的视频预训练和一种新颖的遮罩逆动力学模型进行动作预测的两阶段框架。我们在三个真实世界的双手机器人平台上的750K多视角视频上预训练视频扩散模型，利用统一的观察空间来编码机器人、摄像机、任务和场景背景。我们的遮罩逆动力学模型学习遮罩，从生成的轨迹中提取与动作相关的信息，而无需像素级标签，这些遮罩可以有效地推广到看不见的背景。我们的实验证明，仅用20分钟的人类演示在一个看不见的机器人平台（仅为典型数据要求的1%）上，VIDAR可以推广到看不见的任务和背景，并具有强大的语义理解，超越了现有技术方法。我们的研究结果突出了视频基础模型与遮罩动作预测相结合，以实现在不同真实世界环境中可扩展和可推广的机器人操作的潜力。

更新时间: 2025-07-17 08:31:55

领域: cs.LG,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.12898v1

Determination of galaxy photometric redshifts using Conditional Generative Adversarial Networks (CGANs)

Accurate and reliable photometric redshift determination is one of the key aspects for wide-field photometric surveys. Determination of photometric redshift for galaxies, has been traditionally solved by use of machine-learning and artificial intelligence techniques trained on a calibration sample of galaxies, where both photometry and spectrometry are available. On this paper, we present a new algorithmic approach for determining photometric redshifts of galaxies using Conditional Generative Adversarial Networks (CGANs). The proposed implementation is able to determine both point-estimation and probability-density estimations for photometric redshifts. The methodology is tested with data from Dark Energy Survey (DES) Y1 data and compared with other existing algorithm such as a Mixture Density Network (MDN). Although results obtained show a superiority of MDN, CGAN quality-metrics are close to the MDN results, opening the door to the use of CGAN at photometric redshift estimation.

Updated: 2025-07-17 08:23:13

标题: 使用条件生成对抗网络（CGANs）确定星系光度红移

摘要: 准确可靠的光度红移确定是广域光度调查的关键方面之一。传统上，通过利用在一个包含光度和光谱信息的校准样本上训练的机器学习和人工智能技术来解决星系的光度红移确定问题。在本文中，我们提出了一种使用条件生成对抗网络（CGANs）确定星系光度红移的新算法方法。所提出的实现能够确定光度红移的点估计和概率密度估计。该方法使用来自暗能量调查（DES）Y1数据的数据进行测试，并与其他现有算法（如混合密度网络（MDN））进行比较。虽然获得的结果表明MDN具有优越性，但CGAN的质量指标接近MDN的结果，为使用CGAN进行光度红移估计打开了大门。

更新时间: 2025-07-17 08:23:13

领域: astro-ph.IM,astro-ph.CO,cs.AI

下载: http://arxiv.org/abs/2501.06532v3

Aligning Knowledge Graphs and Language Models for Factual Accuracy

Large language models like GPT-4, Gemini, and Claude have transformed natural language processing (NLP) tasks such as question answering, dialogue generation, summarization, and so forth; yet their susceptibility to hallucination stands as one of the major challenges. Among numerous approaches to overcome this challenge, integration of Knowledge Graphs (KGs) into language models has emerged as a promising solution as it provides structured, reliable, domain-specific, and up-to-date external information to the language models. In this paper, we introduce ALIGNed-LLM, a simple yet effective approach to improve language models' factuality via a lean strategy to infuse KGs into the latent space of language models inspired by LLaVA where visual and textual information is infused. We use embeddings from a pre-trained Knowledge Graph Embedding (KGE) model, such as TransE, and a trainable projection layer to align entity and text embeddings. This alignment enables the language model to distinguish between similar entities improving factual grounding and reducing hallucination. We tested our approach on three popular questions-answering benchmark datasets alongside language models of varying sizes, showing significant improvement. Furthermore, we applied our approach to a real-world financial use case from a large central bank in Europe, which demands high accuracy and precision, demonstrating a substantial improvement of the LLM answers.

Updated: 2025-07-17 08:15:50

标题: 将知识图谱和语言模型对准以实现事实准确性

摘要: 像GPT-4、Gemini和Claude这样的大型语言模型已经改变了自然语言处理（NLP）任务，例如问答、对话生成、摘要等等；然而，它们对幻觉的敏感性仍然是一个主要挑战之一。在许多克服这一挑战的方法中，将知识图谱（KGs）整合到语言模型中已经成为一个有前途的解决方案，因为它为语言模型提供了结构化、可靠、领域特定和最新的外部信息。在本文中，我们介绍了ALIGNed-LLM，这是一种简单而有效的方法，通过一种精益的策略将知识图谱注入到语言模型的潜在空间中，受到LLaVA的启发，其中融合了视觉和文本信息。我们使用来自预训练的知识图嵌入（KGE）模型（如TransE）的嵌入，并使用可训练的投影层来对齐实体和文本嵌入。这种对齐使语言模型能够区分相似的实体，提高了事实基础和减少了幻觉。我们在三个流行的问答基准数据集上测试了我们的方法，同时使用不同大小的语言模型，显示了显著的改进。此外，我们将我们的方法应用于欧洲一家大型中央银行的一个实际金融用例中，这要求高准确度和精度，展示了LLM答案的显著改进。

更新时间: 2025-07-17 08:15:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13411v1

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks

Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, \emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, \emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0\% on AMC23 and 58.3\% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.

Updated: 2025-07-17 08:10:55

标题: VAR-MATH：通过符号多实例基准测试探究大型语言模型中真正的数学推理

摘要: 最近在强化学习（RL）方面取得的进展已经显著提高了大型语言模型（LLMs）的数学推理能力，通过标准基准测试衡量。然而，即使模型在训练过程中使用了错误的信号，如随机或颠倒的奖励，这些收益通常仍然存在，这引发了一个基本问题：这些改进是否反映了真正的推理能力，还是仅仅是过度拟合基准特定模式的产物？为了解决这个问题，我们采取了一个以评估为中心的视角，并确定了现有协议中的两个关键缺陷。首先，\emph{基准污染}源于测试问题的公开可用性，增加了数据泄漏的风险。其次，\emph{评估脆弱性}源于对单个实例进行评估的依赖，这对随机输出非常敏感，并且无法捕捉推理的一致性。为了克服这些局限性，我们引入了{VAR-MATH}，这是一个设计用于探究真正推理能力的符号评估框架。通过将固定的数值问题转换为符号模板，并要求模型解决每个实例的多个实例化，VAR-MATH强制在结构上等效的变体中实现一致的推理，从而减轻污染并提高评估的鲁棒性。我们将VAR-MATH应用于将两个流行的基准测试，AMC23和AIME24，转换为它们的符号对应物，VAR-AMC23和VAR-AIME24。实验结果显示，RL训练模型在变量化版本上表现大幅下降，特别是对于较小的模型，AMC23下降了48.0％，AIME24下降了58.3％。这些发现表明，许多现有的RL方法依赖于表面启发式，并且无法推广到特定数值形式之外。总的来说，VAR-MATH为数学推理提供了一个基于原则、抗污染的评估范式。

更新时间: 2025-07-17 08:10:55

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12885v1

Interpretable Transformation and Analysis of Timelines through Learning via Surprisability

The analysis of high-dimensional timeline data and the identification of outliers and anomalies is critical across diverse domains, including sensor readings, biological and medical data, historical records, and global statistics. However, conventional analysis techniques often struggle with challenges such as high dimensionality, complex distributions, and sparsity. These limitations hinder the ability to extract meaningful insights from complex temporal datasets, making it difficult to identify trending features, outliers, and anomalies effectively. Inspired by surprisability -- a cognitive science concept describing how humans instinctively focus on unexpected deviations - we propose Learning via Surprisability (LvS), a novel approach for transforming high-dimensional timeline data. LvS quantifies and prioritizes anomalies in time-series data by formalizing deviations from expected behavior. LvS bridges cognitive theories of attention with computational methods, enabling the detection of anomalies and shifts in a way that preserves critical context, offering a new lens for interpreting complex datasets. We demonstrate the usefulness of LvS on three high-dimensional timeline use cases: a time series of sensor data, a global dataset of mortality causes over multiple years, and a textual corpus containing over two centuries of State of the Union Addresses by U.S. presidents. Our results show that the LvS transformation enables efficient and interpretable identification of outliers, anomalies, and the most variable features along the timeline.

Updated: 2025-07-17 08:06:22

标题: 可解释的通过惊讶性学习进行时间线变换和分析

摘要: 对高维时间线数据进行分析，并识别异常值和异常情况，在各种领域至关重要，包括传感器读数、生物和医学数据、历史记录和全球统计数据。然而，传统的分析技术通常面临高维度、复杂分布和稀疏性等挑战。这些限制阻碍了从复杂时间数据集中提取有意义见解的能力，使得有效识别趋势特征、异常值和异常情况变得困难。受“惊讶性”的启发——这是一个描述人类本能关注意外偏离的认知科学概念——我们提出了一种新颖的方法 Learning via Surprisability (LvS)，用于转换高维时间线数据。LvS通过形式化与预期行为的偏差，量化并优先考虑时间序列数据中的异常值。LvS将注意力的认知理论与计算方法相结合，使得可以检测异常值和变化的方式可以保留关键上下文，为解释复杂数据集提供了一个新的视角。我们在三个高维时间线用例上展示了LvS的实用性：传感器数据的时间序列、多年间全球死因数据集，以及包含美国总统两个多世纪国情咨文的文本语料库。我们的结果表明，LvS转换可以有效、可解释地识别时间线上的异常值、异常情况和最可变特征。

更新时间: 2025-07-17 08:06:22

领域: stat.ME,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2503.04502v2

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

Updated: 2025-07-17 08:03:43

标题: MAC-Tuning：LLM多组分问题推理与增强知识边界意识

摘要: 随着大型语言模型（LLMs）的广泛应用，生成不存在的事实，即幻觉问题，越来越受到关注。先前的研究主要集中在增强LLM置信度估计的单一问题设置上。然而，在更具挑战性的多问题设置下，需要同时准确回答多个问题，LLM对其内部参数化知识边界的意识仍未得到充分探索。为了弥补这一差距，我们引入了一种新的方法，即多个答案和置信度逐步调整（MAC-Tuning），在指导数据上对答案预测和置信度估计进行分开学习。大量实验证明，我们的方法在平均精度上优于基线高达25％。

更新时间: 2025-07-17 08:03:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.21773v2

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.

Updated: 2025-07-17 08:01:44

标题: SEALGuard：为LLM软件系统保护东南亚语言的多语言对话

摘要: 安全对齐对LLM动力系统至关重要。尽管最近的LLM动力护栏方法（如LlamaGuard）在检测英语中编写的不安全输入（例如“如何制造炸弹？”）方面取得了较高的准确性，但它们在处理多语言不安全输入时存在困难。这种限制使LLM系统容易受到低资源语言（如东南亚地区的语言）中写作的不安全和越狱提示的影响。本文介绍了SEALGuard，这是一个多语言护栏，旨在提高跨不同语言的安全对齐。它旨在解决现有护栏的多语言安全对齐差距，并确保在LLM动力系统中有效过滤不安全和越狱提示。我们利用低秩适应（LoRA）将通用多语言语言模型改造成一个多语言护栏。我们构建了SEALSBench，一个包含超过260,000个提示的大规模多语言安全对齐数据集，其中包括安全、不安全和越狱案例。我们在这个基准上将SEALGuard与LlamaGuard等最先进的护栏进行了评估。我们的研究结果显示，多语言不安全和越狱提示显著降低了最先进的LlamaGuard的性能，其防御成功率（DSR）分别比其在仅英语提示上的性能下降了9%和18%。相比之下，SEALGuard在检测多语言不安全和越狱提示方面优于现有护栏，其DSR比LlamaGuard提高了48%，并取得了最佳的DSR、精度和F1分数。我们的消融研究进一步揭示了适应策略和模型大小对SEALGuard整体性能的贡献。我们在https://github.com/awsm-research/SEALGuard发布了我们的预训练模型和基准，以支持进一步研究。

更新时间: 2025-07-17 08:01:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.08898v3

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations

Graph recommendation methods, representing a connected interaction perspective, reformulate user-item interactions as graphs to leverage graph structure and topology to recommend and have proved practical effectiveness at scale. Large language models, representing a textual generative perspective, excel at modeling user languages, understanding behavioral contexts, capturing user-item semantic relationships, analyzing textual sentiments, and generating coherent and contextually relevant texts as recommendations. However, there is a gap between the connected graph perspective and the text generation perspective as the task formulations are different. A research question arises: how can we effectively integrate the two perspectives for more personalized recsys? To fill this gap, we propose to incorporate graph-edge information into LLMs via prompt and attention innovations. We reformulate recommendations as a probabilistic generative problem using prompts. We develop a framework to incorporate graph edge information from the prompt and attention mechanisms for graph-structured LLM recommendations. We develop a new prompt design that brings in both first-order and second-order graph relationships; we devise an improved LLM attention mechanism to embed direct the spatial and connectivity information of edges. Our evaluation of real-world datasets demonstrates the framework's ability to understand connectivity information in graph data and to improve the relevance and quality of recommendation results.

Updated: 2025-07-17 07:51:28

标题: LLM增强的用户-物品交互：利用边缘信息优化推荐

摘要: 图推荐方法代表了一种连接交互视角，将用户-物品交互重新表述为图形，以利用图结构和拓扑来进行推荐，并已在规模上证明了实际有效性。大型语言模型代表了一种文本生成视角，擅长建模用户语言，理解行为背景，捕捉用户-物品语义关系，分析文本情感，并生成连贯且与上下文相关的文本作为推荐。然而，由于任务制定不同，连接图视角和文本生成视角之间存在差距。一个研究问题出现了：我们如何有效地整合这两种视角，以实现更个性化的推荐系统？为了填补这一差距，我们提出通过提示和注意力创新将图边信息整合到LLMs中。我们通过提示将推荐重新表述为一个概率生成问题。我们开发了一个框架，通过提示和注意力机制将图边信息整合到基于图结构的LLM推荐中。我们设计了一个新的提示设计，引入了一阶和二阶图关系；我们设计了一个改进的LLM注意力机制，以嵌入边的空间和连接信息。我们对真实世界数据集的评估显示，该框架能够理解图数据中的连接信息，并提高推荐结果的相关性和质量。

更新时间: 2025-07-17 07:51:28

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2402.09617v2

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.

Updated: 2025-07-17 07:45:53

标题: 通过不对齐的人工智能进行操纵攻击：风险分析和安全案例框架

摘要: 前沿人工智能系统在说服、欺骗和影响人类行为方面的能力正在迅速提高，目前的模型已经在特定环境中展示了人类级别的说服和战略欺骗能力。人类往往是网络安全系统中最薄弱的环节，而在一家前沿公司内部部署的不协调的人工智能系统可能会通过操纵员工来破坏人类的监督。尽管这种威胁不断增长，但操纵攻击却鲜有人关注，也没有系统性框架来评估和减轻这些风险。为了解决这个问题，我们提供了一个详细的解释，说明为什么操纵攻击是一个重要的威胁，并可能导致灾难性后果。此外，我们提出了一个关于操纵风险的安全案例框架，围绕三个核心论点展开：无能、控制和可信度。对于每一条论点，我们指定了证据要求、评估方法和实施考虑，以供人工智能公司直接应用。这篇论文提供了第一个系统方法，将操纵风险纳入人工智能安全治理中，为人工智能公司提供一个具体的基础，以在部署前评估和减轻这些威胁。

更新时间: 2025-07-17 07:45:53

领域: cs.AI,cs.CR,cs.HC

下载: http://arxiv.org/abs/2507.12872v1

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.

Updated: 2025-07-17 07:26:54

标题: 在维护数据上进行监督微调就是强化学习（并且可以改进）

摘要: 在筛选过的数据上进行行为克隆（BC）是大型语言模型监督微调（SFT）的主要范式；以及控制策略模仿学习的主要方式。在这里，我们借鉴了这种成功策略与通过强化学习（RL）找到最优策略的理论和实践之间的联系。在现有文献的基础上，我们澄清了SFT可以被理解为在稀疏奖励设置中最大化RL目标的下限。这种观点支持了其经常观察到的良好表现。从这个角度来看，我们意识到对SFT进行微小修改会导致一种重要性加权变体，其行为更接近通过RL进行训练，因为它：i）优化了对RL目标的更紧密的下限，ii）可以相对于在筛选数据上的SFT改进表现。我们将这种变体称为重要性加权监督微调（iw-SFT）。我们展示了它容易实现，并且可以进一步推广到使用质量评分数据进行训练。由此产生的SFT变体与大型语言模型和连续控制任务训练策略的更高级RL算法相竞争。例如，在AIME 2024数据集上实现了66.7%的准确率。

更新时间: 2025-07-17 07:26:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.12856v1

Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering

As robots become increasingly capable of operating over extended periods -- spanning days, weeks, and even months -- they are expected to accumulate knowledge of their environments and leverage this experience to assist humans more effectively. This paper studies the problem of Long-term Active Embodied Question Answering (LA-EQA), a new task in which a robot must both recall past experiences and actively explore its environment to answer complex, temporally-grounded questions. Unlike traditional EQA settings, which typically focus either on understanding the present environment alone or on recalling a single past observation, LA-EQA challenges an agent to reason over past, present, and possible future states, deciding when to explore, when to consult its memory, and when to stop gathering observations and provide a final answer. Standard EQA approaches based on large models struggle in this setting due to limited context windows, absence of persistent memory, and an inability to combine memory recall with active exploration. To address this, we propose a structured memory system for robots, inspired by the mind palace method from cognitive science. Our method encodes episodic experiences as scene-graph-based world instances, forming a reasoning and planning algorithm that enables targeted memory retrieval and guided navigation. To balance the exploration-recall trade-off, we introduce value-of-information-based stopping criteria that determines when the agent has gathered sufficient information. We evaluate our method on real-world experiments and introduce a new benchmark that spans popular simulation environments and actual industrial sites. Our approach significantly outperforms state-of-the-art baselines, yielding substantial gains in both answer accuracy and exploration efficiency.

Updated: 2025-07-17 07:11:32

标题: 进入心灵宫殿：为长期积极的具身问答进行推理和规划

摘要: 随着机器人在延长时间跨度（包括几天、几周甚至几个月）内越来越有能力进行操作，人们期望它们能够积累对环境的知识，并利用这些经验更有效地帮助人类。本文研究了长期主动体验问答（LA-EQA）的问题，这是一项新任务，机器人必须回忆过去的经验并积极探索其环境以回答复杂的、与时间相关的问题。与传统的问答设置不同，传统设置通常集中在理解当前环境或回忆单个过去观察的基础上，LA-EQA挑战代理人对过去、现在和可能的未来状态进行推理，决定何时探索，何时查阅其记忆，何时停止收集观察并提供最终答案。基于大型模型的标准EQA方法在这种情况下很难应对，原因在于有限的上下文窗口、缺乏持久性记忆以及无法将记忆回忆与主动探索相结合。为了解决这个问题，我们提出了一种受认知科学中心理宫殿方法启发的机器人结构化记忆系统。我们的方法将经历编码为基于场景图的世界实例，形成一种推理和规划算法，使得目标记忆检索和引导导航成为可能。为了平衡探索和回忆之间的权衡，我们引入了基于信息价值的停止标准，确定代理何时收集到足够的信息。我们在真实世界实验中评估了我们的方法，并引入了一个跨越流行的模拟环境和实际工业场地的新基准。我们的方法明显优于最先进的基线，大大提高了答案准确性和探索效率。

更新时间: 2025-07-17 07:11:32

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.12846v1

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

Updated: 2025-07-17 07:11:14

标题: 在极早产儿的Day-1胸部X光片中，通过逐层冻结进行现场微调：朝着支气管肺发育不良的稳健预测

摘要: 支气管肺发育不良（BPD）是一种慢性肺部疾病，影响35%的极低出生体重婴儿。在孕龄36周时以氧依赖为定义，导致终身呼吸并发症。然而，预防干预带来严重风险，包括神经发育障碍、呼吸机诱导的肺部损伤和全身并发症。因此，早期BPD预后和BPD结果的预测对于避免低风险婴儿中不必要的毒性至关重要。极早产婴儿的入院放射片通常在出生后24小时内获得，可作为一种非侵入性预测工具。在这项工作中，我们开发并研究了一种深度学习方法，使用163名极低出生体重婴儿（≤32周孕龄，401-999克）出生后24小时内获得的胸部X光片。我们对一个在成人胸部放射片上预先训练的ResNet-50进行了微调，采用渐进式层冻结和具有区分性学习率以防止过拟合，并评估了CutMix增强和线性探测。对于中度/重度BPD结果的预测，我们表现最佳的模型采用了渐进冻结、线性探测和CutMix，实现了0.78±0.10的AUROC、0.69±0.10的平衡准确率和0.67±0.11的F1分数。领域内预训练明显优于ImageNet初始化（p = 0.031），这证实了领域特定预训练对于BPD结果预测的重要性。常规IRDS分级显示了有限的预测价值（AUROC 0.57±0.11），证实了对学习标记的需求。我们的方法表明，领域特定预训练使得可以从常规第一天放射片准确预测BPD。通过渐进冻结和线性探测，该方法在计算上仍然可行用于站点级实施和未来联邦学习部署。

更新时间: 2025-07-17 07:11:14

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12269v2

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

Updated: 2025-07-17 07:11:01

标题: SEMT：用于遥感图像字幕生成的静态扩展网格变换网络架构

摘要: 图像字幕已经成为计算机视觉和自然语言处理交叉领域中的一个至关重要的任务，可以从视觉内容自动生成描述性文本。在遥感领域，图像字幕在解释广阔复杂的卫星图像中起着重要作用，有助于环境监测、灾害评估和城市规划等应用。在本文中，我们提出了一种基于Transformer的远程传感图像字幕网络架构（RSIC），其中评估和集成了多种技术，包括静态扩展、记忆增强自注意力、网格Transformer。我们使用UCM-Caption和NWPU-Caption两个基准遥感图像数据集评估我们提出的模型。我们的最佳模型在大多数评估指标上优于现有技术系统，表明在实际远程传感图像系统中具有潜力应用。

更新时间: 2025-07-17 07:11:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12845v1

Dataset resulting from the user study on comprehensibility of explainable AI algorithms

This paper introduces a dataset that is the result of a user study on the comprehensibility of explainable artificial intelligence (XAI) algorithms. The study participants were recruited from 149 candidates to form three groups representing experts in the domain of mycology (DE), students with a data science and visualization background (IT) and students from social sciences and humanities (SSH). The main part of the dataset contains 39 transcripts of interviews during which participants were asked to complete a series of tasks and questions related to the interpretation of explanations of decisions of a machine learning model trained to distinguish between edible and inedible mushrooms. The transcripts were complemented with additional data that includes visualizations of explanations presented to the user, results from thematic analysis, recommendations of improvements of explanations provided by the participants, and the initial survey results that allow to determine the domain knowledge of the participant and data analysis literacy. The transcripts were manually tagged to allow for automatic matching between the text and other data related to particular fragments. In the advent of the area of rapid development of XAI techniques, the need for a multidisciplinary qualitative evaluation of explainability is one of the emerging topics in the community. Our dataset allows not only to reproduce the study we conducted, but also to open a wide range of possibilities for the analysis of the material we gathered.

Updated: 2025-07-17 07:09:03

标题: 可解释人工智能算法可理解性用户研究所得数据集

摘要: 本文介绍了一个数据集，这是一个关于可解释人工智能（XAI）算法可理解性的用户研究的结果。研究参与者从149名候选人中招募，形成了代表真菌学专家（DE）、具有数据科学和可视化背景的学生（IT）以及社会科学和人文学科学生（SSH）的三个群体。数据集的主要部分包含39份访谈记录，参与者被要求完成一系列与解释机器学习模型决策相关的任务和问题，该模型经过训练，能区分可食和不可食蘑菇。这些访谈记录还包括向用户展示的解释可视化、主题分析结果、参与者提供的解释改进建议以及初步调查结果，这些结果可用于确定参与者的领域知识和数据分析能力。访谈记录进行了手动标记，以便自动匹配文本和与特定片段相关的其他数据。在XAI技术迅速发展的时代，对可解释性进行跨学科定性评估的需求是社区中新兴的主题之一。我们的数据集不仅可以用于重现我们进行的研究，还可以为我们收集的材料进行广泛的分析提供可能。

更新时间: 2025-07-17 07:09:03

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.02419v2

Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.

Updated: 2025-07-17 07:04:11

标题: 任务特定的生成式数据集精炼与难度引导采样

摘要: 为了减轻深度神经网络对大规模数据集的依赖，数据集蒸馏旨在生成紧凑、高质量的合成数据集，能够实现与原始数据集相当的性能。集成生成模型显著推进了这一领域。然而，现有方法主要关注将蒸馏数据集与原始数据集对齐，往往忽视了对下游性能至关重要的任务特定信息。本文针对分类下游任务提出了一种任务特定的生成数据集蒸馏采样策略，该策略结合了难度概念，更好地考虑了目标任务的要求。最终数据集是从一个更大的图像池中采样得到的，采样分布通过匹配原始数据集的难度分布获得。对分布偏差进行修正的预处理步骤应用了对数变换。广泛实验的结果表明了我们方法的有效性，并暗示了其在其他下游任务上提升性能的潜力。代码可在https://github.com/SumomoTaku/DiffGuideSamp 上找到。

更新时间: 2025-07-17 07:04:11

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.03331v2

Data-Efficient Deep Operator Network for Unsteady Flow: A Multi-Fidelity Approach with Physics-Guided Subsampling

This study presents an enhanced multi-fidelity Deep Operator Network (DeepONet) framework for efficient spatio-temporal flow field prediction when high-fidelity data is scarce. Key innovations include: a merge network replacing traditional dot-product operations, achieving 50.4% reduction in prediction error and 7.57% accuracy improvement while reducing training time by 96%; a transfer learning multi-fidelity approach that freezes pre-trained low-fidelity networks while making only the merge network trainable, outperforming alternatives by up to 76% and achieving 43.7% better accuracy than single-fidelity training; and a physics-guided subsampling method that strategically selects high-fidelity training points based on temporal dynamics, reducing high-fidelity sample requirements by 40% while maintaining comparable accuracy. Comprehensive experiments across multiple resolutions and datasets demonstrate the framework's ability to significantly reduce required high-fidelity dataset size while maintaining predictive accuracy, with consistent superior performance against conventional benchmarks.

Updated: 2025-07-17 07:01:00

标题: 数据高效的深度操作网络用于非稳态流动：一种物理引导的多保真度方法

摘要: 这项研究提出了一种增强的多精度深度运算网络（DeepONet）框架，用于在高保真数据稀缺时进行高效的时空流场预测。关键创新包括：一个替代传统点积运算的合并网络，实现了50.4%的预测误差减少和7.57%的准确性提高，同时将训练时间减少了96%；一种传递学习多精度方法，冻结预训练的低保真网络，只使合并网络可训练，性能优于其他方法高达76%，比单精度训练提高了43.7%的准确性；以及一种基于物理引导的子采样方法，根据时间动态有选择地选择高保真训练点，将高保真样本要求减少了40%，同时保持可比较的准确性。在多个分辨率和数据集上进行的全面实验表明，该框架能够显著减少所需的高保真数据集大小，同时保持预测准确性，并且在与传统基准的比较中始终表现出优越的性能。

更新时间: 2025-07-17 07:01:00

领域: physics.flu-dyn,cs.AI

下载: http://arxiv.org/abs/2503.17941v2

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

Updated: 2025-07-17 06:49:16

标题: 多语言变压器中的因果语言控制通过稀疏特征引导

摘要: 确定性地控制大型多语言语言模型（LLMs）的目标生成语言仍然是一个基本挑战，特别是在零-shot设置中，其中既没有明确的语言提示，也没有微调。在这项工作中，我们调查了稀疏自编码器（SAE）特征是否可以被利用来引导LLMs的生成语言，在推理过程中。利用预训练的SAEs在Gemma-2B和Gemma-9B的残余流上，我们确定了激活在英语和四种目标语言（中文，日文，西班牙文和法文）之间差异最显着的特征。通过在一个transformer层中仅修改一个SAE特征，我们实现了高达90\%的成功率的可控语言转换，通过FastText语言分类来衡量，同时根据LaBSE（语言不可知的BERT句子嵌入）相似度保持语义保真度。我们的分析显示，语言引导在中后期transformer层中最有效，并且由特定的注意力头放大，这些头与敏感于语言的SAE特征不成比例地相关联。这些结果展示了稀疏特征引导作为一种轻量级和可解释的机制，用于可控的多语言生成的潜力。

更新时间: 2025-07-17 06:49:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13410v1

MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results

Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.

Updated: 2025-07-17 06:45:47

标题: MVA 2025 小型多目标跟踪鸟类挑战：数据集、方法和结果

摘要: Small Multi-Object Tracking (SMOT)在目标仅占据几十像素时尤为具有挑战性，导致检测和基于外观的关联不可靠。借鉴MVA2023 SOD4SB挑战的成功经验，本文介绍了SMOT4SB挑战，利用时间信息来解决单帧检测的局限性。我们的三个主要贡献是：（1）SMOT4SB数据集，包括211个无人机视频序列，108,192个带有多样实际环境条件的标注帧，旨在捕捉摄像机和目标在3D中自由移动时的运动纠缠；（2）SO-HOTA，一种将点距离与HOTA结合的新型指标，以减轻IoU基于指标对小位移的敏感性；以及（3）具有78名参与者和308份提交的竞争性MVA2025挑战，其中获胜方法相对基线实现了5.1倍的改进。这项工作为在无人机场景中推进SMOT奠定了基础，应用领域包括鸟类避撞、农业、渔业和生态监测。

更新时间: 2025-07-17 06:45:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12832v1

Feature-Enhanced TResNet for Fine-Grained Food Image Classification

Food is not only a core component of humans' daily diets, but also an important carrier of cultural heritage and emotional bonds. With the development of technology, the need for accurate classification of food images has grown, which is crucial for a variety of application scenarios. However, existing Convolutional Neural Networks (CNNs) face significant challenges when dealing with fine-grained food images that are similar in shape but subtle in detail. To address this challenge, this study presents an innovative method for classifying food images, named Feature-Enhanced TResNet (FE-TResNet), specifically designed to address fine-grained food images and accurately capture subtle features within them. The FE-TResNet method is based on the TResNet model and integrates Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) technologies to enhance feature extraction capabilities. In experimental validation on Chinese food image datasets ChineseFoodNet and CNFOOD-241, the FE-TResNet method significantly improved classification accuracy, achieving rates of 81.37% and 80.29%, respectively, demonstrating its effectiveness and superiority in fine-grained food image classification.

Updated: 2025-07-17 06:37:45

标题: Feature-Enhanced TResNet用于细粒度食物图像分类

摘要: 食物不仅是人类日常饮食的核心组成部分，还是文化遗产和情感纽带的重要载体。随着技术的发展，对食物图像的准确分类需求不断增长，这对各种应用场景至关重要。然而，现有的卷积神经网络（CNN）在处理形状相似但细节微妙的细粒度食物图像时面临重大挑战。为了解决这一挑战，本研究提出了一种创新的食物图像分类方法，名为特征增强TResNet（FE-TResNet），专门设计用于处理细粒度食物图像并准确捕捉其中微妙的特征。FE-TResNet方法基于TResNet模型，整合了基于风格的重新校准模块（StyleRM）和深度通道注意力（DCA）技术，以增强特征提取能力。在中国食物图像数据集ChineseFoodNet和CNFOOD-241上的实验验证中，FE-TResNet方法显著提高了分类准确度，分别达到了81.37%和80.29%，展示了其在细粒度食物图像分类中的有效性和优越性。

更新时间: 2025-07-17 06:37:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12828v1

MRGen: Segmentation Data Engine for Underrepresented MRI Modalities

Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data, and manual mask annotations can be costly and labor-intensive to acquire. This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities, particularly on annotation-scarce MRI. Concretely, our contributions are threefold: (i) we introduce MRGen-DB, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset featuring pixel-wise mask annotations; (ii) we present MRGen, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. MRGen can generate realistic images for diverse MRI modalities lacking mask annotations, facilitating segmentation training in low-source domains; (iii) extensive experiments across multiple modalities demonstrate that MRGen significantly improves segmentation performance on unannotated modalities by providing high-quality synthetic data. We believe that our method bridges a critical gap in medical image analysis, extending segmentation capabilities to scenarios that are challenging to acquire manual annotations. The codes, models, and data will be publicly available at https://haoningwu3639.github.io/MRGen/

Updated: 2025-07-17 06:36:51

标题: MRGen：针对少见MRI模态的分割数据引擎

摘要: 训练医学图像分割模型对于罕见但临床重要的成像模态具有挑战性，因为标注数据稀缺，手动标注掩膜可能成本高且劳动密集。本文研究利用生成模型合成数据，为少数模态训练分割模型，特别是在标注稀缺的MRI上。具体而言，我们的贡献有三个方面：(i) 我们引入MRGen-DB，一个大规模的放射学图像文本数据集，包括丰富的样本元数据，包括模态标签、属性、区域和器官信息，其中一部分具有逐像素掩膜标注；(ii) 我们提出MRGen，一种基于扩散的数据引擎，用于可控医学图像合成，根据文本提示和分割掩膜进行条件化。MRGen可以为缺乏掩膜标注的多样化MRI模态生成逼真的图像，有助于在低源领域进行分割训练；(iii) 在多个模态上的广泛实验表明，MRGen通过提供高质量的合成数据，显著改善了未标注模态的分割性能。我们相信我们的方法填补了医学图像分析中的一个关键差距，将分割能力扩展到难以获取手动标注的场景。代码、模型和数据将在https://haoningwu3639.github.io/MRGen/ 上公开提供。

更新时间: 2025-07-17 06:36:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.04106v3

AI Governance InternationaL Evaluation Index (AGILE Index) 2024

The rapid advancement of Artificial Intelligence (AI) technology is profoundly transforming human society and concurrently presenting a series of ethical, legal, and social issues. The effective governance of AI has become a crucial global concern. Since 2022, the extensive deployment of generative AI, particularly large language models, marked a new phase in AI governance. Continuous efforts are being made by the international community in actively addressing the novel challenges posed by these AI developments. As consensus on international governance continues to be established and put into action, the practical importance of conducting a global assessment of the state of AI governance is progressively coming to light. In this context, we initiated the development of the AI Governance InternationaL Evaluation Index (AGILE Index). Adhering to the design principle, "the level of governance should match the level of development," the inaugural evaluation of the AGILE Index commences with an exploration of four foundational pillars: the development level of AI, the AI governance environment, the AI governance instruments, and the AI governance effectiveness. It covers 39 indicators across 18 dimensions to comprehensively assess the AI governance level of 14 representative countries globally. The index is utilized to delve into the status of AI governance to date in 14 countries for the first batch of evaluation. The aim is to depict the current state of AI governance in these countries through data scoring, assist them in identifying their governance stage and uncovering governance issues, and ultimately offer insights for the enhancement of their AI governance systems.

Updated: 2025-07-17 06:34:41

标题: 人工智能治理国际评估指数（AGILE指数）2024

摘要: 人工智能（AI）技术的快速发展正在深刻地改变人类社会，并同时带来一系列伦理、法律和社会问题。有效治理AI已成为一个关键的全球关注点。自2022年以来，生成式AI的广泛部署，尤其是大型语言模型，标志着AI治理的一个新阶段。国际社会正在积极努力解决这些AI发展带来的新挑战。随着对国际治理的共识不断建立并付诸实施，进行全球AI治理状况的评估变得日益重要。在这种背景下，我们启动了AI治理国际评估指数（AGILE指数）的开发。遵循“治理水平应与发展水平相匹配”的设计原则，AGILE指数的首次评估从探索四个基本支柱开始：AI的发展水平、AI治理环境、AI治理工具和AI治理效果。它涵盖了18个维度中的39个指标，全面评估了全球14个代表性国家的AI治理水平。该指数用于首批评估中对14个国家的AI治理状况进行深入研究。旨在通过数据评分描绘这些国家的当前AI治理状况，帮助它们确定其治理阶段和发现治理问题，最终为提升其AI治理体系提供见解。

更新时间: 2025-07-17 06:34:41

领域: cs.CY,cs.AI,68T01,A.1

下载: http://arxiv.org/abs/2502.15859v4

Emotional Support with LLM-based Empathetic Dialogue Generation

Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model's ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.

Updated: 2025-07-17 06:24:20

标题: 基于LLM的共情对话生成的情感支持

摘要: 情感支持对话（ESC）旨在通过对话提供富有同理心和有效的情感支持，满足心理健康支持日益增长的需求。本文介绍了我们针对NLPCC 2025任务8 ESC评估提出的解决方案，我们利用大规模语言模型结合提示工程和微调技术。我们探索了参数高效的低秩适应和全参数微调策略，以提高模型生成支持性和上下文恰当性回应的能力。我们的最佳模型在比赛中排名第二，突显了将LLMs与有效的适应方法相结合用于ESC任务的潜力。未来工作将致力于进一步增强情感理解和响应个性化，构建更加实用和可靠的情感支持系统。

更新时间: 2025-07-17 06:24:20

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12820v1

FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.

Updated: 2025-07-17 06:19:38

标题: FIQ：利用问题嵌入集成生成基础问题，用于视频问答

摘要: 视频问答（VQA）是一个多模态任务，需要解释视频以回答给定的问题。现有的VQA方法主要利用问题和答案（Q&A）对来学习视频内容的时空特征。然而，这些注释通常是以事件为中心的，这并不足以捕捉每个视频的更广泛背景。缺乏必要的细节，如对象类型、空间布局和描述性属性，限制了模型只能学习片段化的场景表示。这个问题限制了模型的泛化能力和更高层次的推理能力。在本文中，我们提出了一种基于问题嵌入的视频问答（FIQ）的基本问题生成方法，这是一种旨在通过增强对视频的基本理解来加强模型推理能力的新方法。FIQ基于从视频中提取的描述生成Q&A对，用基本场景信息丰富训练数据。生成的Q&A对使模型能够理解主要背景，从而提高泛化能力和推理能力。此外，我们还结合了一个VQ-CAlign模块，将任务特定的问题嵌入与视觉特征结合起来，确保关键的领域特定细节被保留以增加下游任务的适应性。在SUTD-TrafficQA上的实验表明，与现有基准方法相比，我们的FIQ实现了最先进的性能。

更新时间: 2025-07-17 06:19:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12816v1

K-P Quantum Neural Networks

We present an extension of K-P time-optimal quantum control solutions using global Cartan $KAK$ decompositions for geodesic-based solutions. Extending recent time-optimal constant-$\theta$ control results, we integrate Cartan methods into equivariant quantum neural network (EQNN) for quantum control tasks. We show that a finite-depth limited EQNN ansatz equipped with Cartan layers can replicate the constant-$\theta$ sub-Riemannian geodesics for K-P problems. We demonstrate how for certain classes of control problem on Riemannian symmetric spaces, gradient-based training using an appropriate cost function converges to certain global time-optimal solutions when satisfying simple regularity conditions. This generalises prior geometric control theory methods and clarifies how optimal geodesic estimation can be performed in quantum machine learning contexts.

Updated: 2025-07-17 06:17:19

标题: K-P量子神经网络

摘要: 我们提出了一种使用全局Cartan $KAK$分解对基于测地线的解进行扩展的K-P最优时间量子控制解。在最近的最优恒定-$\theta$控制结果的基础上，我们将Cartan方法集成到等变量量子神经网络（EQNN）中，用于量子控制任务。我们展示了一个有限深度的限制EQNN假设配备Cartan层可以复制K-P问题的恒定-$\theta$子黎曼测地线。我们展示了在某些类别的黎曼对称空间控制问题上，使用适当的成本函数进行基于梯度的训练时，当满足简单的正则性条件时，会收敛到某些全局最优时间解。这推广了先前的几何控制理论方法，并阐明了在量子机器学习环境中如何进行最优测地线估计。

更新时间: 2025-07-17 06:17:19

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2504.01673v2

A Deep Learning-Based Ensemble System for Automated Shoulder Fracture Detection in Clinical Radiographs

Background: Shoulder fractures are often underdiagnosed, especially in emergency and high-volume clinical settings. Studies report up to 10% of such fractures may be missed by radiologists. AI-driven tools offer a scalable way to assist early detection and reduce diagnostic delays. We address this gap through a dedicated AI system for shoulder radiographs. Methods: We developed a multi-model deep learning system using 10,000 annotated shoulder X-rays. Architectures include Faster R-CNN (ResNet50-FPN, ResNeXt), EfficientDet, and RF-DETR. To enhance detection, we applied bounding box and classification-level ensemble techniques such as Soft-NMS, WBF, and NMW fusion. Results: The NMW ensemble achieved 95.5% accuracy and an F1-score of 0.9610, outperforming individual models across all key metrics. It demonstrated strong recall and localization precision, confirming its effectiveness for clinical fracture detection in shoulder X-rays. Conclusion: The results show ensemble-based AI can reliably detect shoulder fractures in radiographs with high clinical relevance. The model's accuracy and deployment readiness position it well for integration into real-time diagnostic workflows. The current model is limited to binary fracture detection, reflecting its design for rapid screening and triage support rather than detailed orthopedic classification.

Updated: 2025-07-17 06:06:12

标题: 基于深度学习的集成系统用于临床X射线片自动检测肩部骨折

摘要: 背景：肩部骨折经常被低估，尤其是在急诊和高负荷的临床环境中。研究表明，高达10％的这类骨折可能被放射学家错过。基于人工智能的工具提供了一种可扩展的方式来帮助早期检测，并减少诊断延迟。我们通过专用的肩部X射线AI系统来填补这一空白。方法：我们使用了10,000个标注的肩部X射线图像开发了一个多模型深度学习系统。架构包括Faster R-CNN（ResNet50-FPN，ResNeXt），EfficientDet和RF-DETR。为了增强检测，我们应用了边界框和分类级融合技术，如Soft-NMS，WBF和NMW融合。结果：NMW融合实现了95.5%的准确率和0.9610的F1分数，在所有关键指标上表现优于单个模型。它展现出强大的召回和定位精度，证实了它在肩部X射线骨折诊断中的有效性。结论：结果表明基于集成的AI可以可靠地在放射片中检测肩部骨折，具有高度的临床相关性。该模型的准确性和部署准备性使其能够很好地融入实时诊断工作流程。当前模型仅限于二元骨折检测，反映了其为快速筛查和分类支持而设计的特点。

更新时间: 2025-07-17 06:06:12

领域: cs.CV,cs.AI,68T07,I.2.10

下载: http://arxiv.org/abs/2507.13408v1

Large Language Models' Internal Perception of Symbolic Music

Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.

Updated: 2025-07-17 05:48:45

标题: 大型语言模型对符号音乐的内部感知

摘要: 大型语言模型（LLMs）擅长建模自然语言中字符串之间的关系，并显示出在扩展到其他符号领域（如编码或数学）方面的潜力。然而，它们隐式建模符号音乐的程度仍未被充分探索。本文通过生成符号音乐数据，并通过描述流派和风格组合的文本提示来评估它们的实用性，探讨了LLMs如何表示音乐概念。我们创建了一个不依赖于明确音乐训练的LLM生成的MIDI文件数据集。然后，我们完全在这个LLM生成的MIDI数据集上训练神经网络，并进行流派和风格分类以及旋律完成，将它们的性能与已建立的模型进行对比。我们的结果表明，LLMs可以从文本中推断出基本的音乐结构和时间关系，突显了它们隐式编码音乐模式的潜力，以及由于缺乏明确音乐背景而存在的局限性，揭示了它们在符号音乐生成能力方面的特点。

更新时间: 2025-07-17 05:48:45

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.12808v1

THOR: Transformer Heuristics for On-Demand Retrieval

We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens, a secure, scalable engine that transforms natural-language questions into verified, read-only SQL analytics for enterprise databases. The Text-to-SQL module follows a decoupled orchestration/execution architecture: a Supervisor Agent routes queries, Schema Retrieval dynamically injects table and column metadata, and a SQL Generation Agent emits single-statement SELECT queries protected by a read-only guardrail. An integrated Self-Correction & Rating loop captures empty results, execution errors, or low-quality outputs and triggers up to five LLM-driven regeneration attempts. Finally, a Result Interpretation Agent produces concise, human-readable insights and hands raw rows to the Insight & Intelligence engine for visualization or forecasting. Smoke tests across finance, sales, and operations scenarios demonstrate reliable ad-hoc querying and automated periodic reporting. By embedding schema awareness, fault-tolerant execution, and compliance guardrails, the THOR Module empowers non-technical users to access live data with zero-SQL simplicity and enterprise-grade safety.

Updated: 2025-07-17 05:47:22

标题: 雷神：用于按需检索的变压器启发式

摘要: 我们介绍了由eSapiens设计和实施的THOR（用于按需检索的Transformer启发式）模块，这是一个安全、可扩展的引擎，将自然语言问题转换为经过验证的只读SQL分析结果，用于企业数据库。文本到SQL模块采用了解耦的编排/执行架构：监督代理路由查询，模式检索动态注入表和列元数据，SQL生成代理生成由只读防护栏保护的单语句SELECT查询。集成的自校正和评分循环捕捉空结果、执行错误或低质量输出，并触发最多五次由LLM驱动的再生尝试。最后，结果解释代理生成简洁易懂的见解，将原始行交给洞察与智能引擎进行可视化或预测。在财务、销售和运营场景中进行的烟雾测试展示了可靠的临时查询和自动定期报告。通过嵌入模式意识、容错执行和合规防护栏，THOR模块赋予非技术用户使用零SQL简易性和企业级安全性访问实时数据的能力。

更新时间: 2025-07-17 05:47:22

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2507.09592v3

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

Updated: 2025-07-17 05:46:27

标题: MCPEval：基于MCP的AI代理模型的自动深度评估

摘要: 大型语言模型（LLMs）智能代理的快速崛起凸显了对稳健、可扩展评估框架的需求。现有方法依赖于静态基准和繁重的数据收集，限制了实际评估。我们引入了一个名为\oursystemname 的开源模型上下文协议（MCP）框架，该框架自动化了跨多个领域的LLM代理的端到端任务生成和深度评估。MCPEval标准化了指标，与原生代理工具无缝集成，并消除了构建评估管道的手动工作。在五个真实世界领域的经验结果显示了它在揭示微妙、领域特定性能方面的有效性。我们公开发布MCPEval https://github.com/SalesforceAIResearch/MCPEval，以促进可重现和标准化的LLM代理评估。

更新时间: 2025-07-17 05:46:27

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12806v1

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

Updated: 2025-07-17 05:46:08

标题: PMKLC：用于大规模基因组数据库的并行多知识学习无损压缩

摘要: 基于学习的无损压缩器在大规模基因组数据库的备份、存储、传输和管理中起着至关重要的作用。然而，它们的1) 压缩比不足，2) 低压缩和解压缩吞吐量，以及3) 压缩鲁棒性差限制了它们在工业和学术界的广泛应用。为了解决这些挑战，我们提出了一种新颖的并行多知识学习压缩器（PMKLC），具有四个关键设计：1) 我们提出了一个自动化的多知识学习压缩框架作为压缩器的支柱，以增强压缩比和鲁棒性；2) 我们设计了一个GPU加速（s，k）-mer编码器，以优化压缩吞吐量和计算资源使用；3) 我们引入了数据块分区和逐步模型传递（SMP）机制进行并行加速；4) 我们设计了两种压缩模式PMKLC-S和PMKLC-M，以满足复杂的应用场景，前者在资源受限的单个GPU上运行，后者是多GPU加速的。我们在15个不同物种和数据大小的真实数据集上对PMKLC-S/M和14个基线进行了基准测试（7个传统的和7个基于学习的）。与测试数据集上的基线相比，PMKLC-S/M的平均压缩比提高了高达73.609\%和73.480\%，平均吞吐量分别提高了高达3.036倍和10.710倍。此外，PMKLC-S/M还实现了最佳的鲁棒性和竞争性内存成本，表明它对具有不同概率分布扰动的数据集具有更强的稳定性，并且具有在内存受限设备上运行的强大能力。

更新时间: 2025-07-17 05:46:08

领域: cs.LG,cs.AI,cs.CL,cs.DB

下载: http://arxiv.org/abs/2507.12805v1

FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction

Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:{\href{https://github.com/AI4Science-WestlakeU/FLDmamba}{https://github.com/AI4Science-WestlakeU/\model}.

Updated: 2025-07-17 05:39:15

标题: FLDmamba：将Fourier和Laplace变换分解与Mamba集成，以增强时间序列预测

摘要: 时间序列预测是各个领域中一个关键任务，面临着重大挑战，这是由于时间序列数据固有的复杂性，包括非平稳性、多尺度周期性和瞬时动态，尤其是在处理长期预测时。虽然基于Transformer的架构显示出潜力，但是它们在序列长度上的二次复杂度阻碍了它们在长期预测中的效率。最近在状态空间模型方面的进展，如Mamba，为长期建模提供了更有效的替代方案，但它们无法有效捕捉多尺度周期性和瞬时动态。同时，它们容易受到时间序列中的数据噪声问题的影响。本文提出了一种新颖的框架，FLDmamba（Fourier和Laplace变换分解Mamba），解决了这些限制。FLDmamba利用傅里叶和拉普拉斯变换的优势，有效捕捉时间序列数据中的多尺度周期性、瞬时动态，并改善模型对数据噪声问题的鲁棒性。我们广泛的实验表明，FLDmamba在时间序列预测基准上取得了优越的性能，优于基于Transformer和其他基于Mamba的架构。为了促进我们方法的可重现性，我们已经通过以下URL使代码和数据可访问：{ \href {https://github.com/AI4Science-WestlakeU/FLDmamba} {https://github.com/AI4Science-WestlakeU/\model}。

更新时间: 2025-07-17 05:39:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.12803v1

IConMark: Robust Interpretable Concept-Based Watermark For AI Images

With the rapid rise of generative AI and synthetic media, distinguishing AI-generated images from real ones has become crucial in safeguarding against misinformation and ensuring digital authenticity. Traditional watermarking techniques have shown vulnerabilities to adversarial attacks, undermining their effectiveness in the presence of attackers. We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Unlike traditional methods, which rely on adding noise or perturbations to AI-generated images, IConMark incorporates meaningful semantic attributes, making it interpretable to humans and hence, resilient to adversarial manipulation. This method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark's effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. We introduce IConMark+SS and IConMark+TM, hybrid approaches combining IConMark with StegaStamp and TrustMark, respectively, to further bolster robustness against multiple types of image manipulations. Our base watermarking technique (IConMark) and its variants (+TM and +SS) achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets.

Updated: 2025-07-17 05:38:30

标题: IConMark：用于AI图像的稳健可解释概念水印

摘要: 随着生成式人工智能和合成媒体的迅速崛起，区分AI生成的图像与真实图像变得至关重要，以确保防茯错误信息并保证数字真实性。传统的水印技术已经显示出在对抗攻击方面存在漏洞，削弱了它们在遭受攻击者时的有效性。我们提出了一种新颖的在生成过程中具有鲁棒性的语义水印方法IConMark，该方法将可解释的概念嵌入到AI生成的图像中，作为朝着可解释水印的第一步。与传统方法不同，传统方法依赖于向AI生成的图像添加噪音或扰动，IConMark则融入了有意义的语义属性，使其对人类可解释，因此对对抗性操纵具有韧性。这种方法不仅对各种图像增强具有鲁棒性，而且可以由人类阅读，从而使水印可以手动验证。我们对IConMark的有效性进行了详细评估，证明了它在检测准确性和保持图像质量方面的优越性。此外，IConMark可以与现有的水印技术结合，进一步增强和补充其鲁棒性。我们引入了IConMark+SS和IConMark+TM，这是将IConMark与StegaStamp和TrustMark相结合的混合方法，以进一步增强对多种图像操作的鲁棒性。我们的基本水印技术(IConMark)及其变体(+TM和+SS)在各种数据集上相比最佳基线，分别实现了10.8％、14.5％和15.9％更高的水印检测下接受者工作特性曲线下面积（AUROC）得分。

更新时间: 2025-07-17 05:38:30

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.13407v1

Imitating Mistakes in a Learning Companion AI Agent for Online Peer Learning

In recent years, peer learning has gained attention as a method that promotes spontaneous thinking among learners, and its effectiveness has been confirmed by numerous studies. This study aims to develop an AI Agent as a learning companion that enables peer learning anytime and anywhere. However, peer learning between humans has various limitations, and it is not always effective. Effective peer learning requires companions at the same proficiency levels. In this study, we assume that a learner's peers with the same proficiency level as the learner make the same mistakes as the learner does and focus on English composition as a specific example to validate this approach.

Updated: 2025-07-17 05:37:07

标题: 在线对等学习中学习伴侣AI代理模仿错误

摘要: 近年来，同侪学习作为一种促进学习者自发思考的方法受到关注，并其有效性已被许多研究证实。本研究旨在开发一个人工智能代理作为学习伴侣，实现随时随地的同侪学习。然而，人类之间的同侪学习存在各种限制，并不总是有效的。有效的同侪学习需要与学习者相同熟练水平的伴侣。在本研究中，我们假设学习者与与之相同熟练水平的同侪会犯同样的错误，并以英语作文为具体例子验证这种方法。

更新时间: 2025-07-17 05:37:07

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.12801v1

Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).

Updated: 2025-07-17 05:35:13

标题: 语义结构感知的生成式攻击以增强对抗性迁移性

摘要: 生成对抗攻击训练一个扰动生成器在一个白盒替代模型上，然后将精心设计的扰动应用于未见的黑盒受害模型。与迭代攻击相比，这些方法在推理时间效率、可扩展性和可转移性上提供了更优越的表现；然而，迄今为止，现有研究尚未充分利用生成模型的表征能力来保留和利用语义信息。具体来说，生成器的中间激活编码了丰富的语义特征--对象边界和粗略形状--这些特征仍然未得到充分利用，从而限制了扰动与关键的对象显著区域对齐，这对于对抗转移至关重要。为了解决这个问题，我们引入了一个基于Mean Teacher的语义结构感知攻击框架，它作为一个时间平滑的特征参考。通过这个平滑的参考，我们进一步通过特征蒸馏在学生的早期层激活和语义丰富的教师的激活之间指导语义一致性。通过根据经验结果将扰动合成锚定到生成器内部语义显著的早期中间块，我们的方法引导逐步对抗性扰动在显著增强对抗性转移的区域。我们进行了广泛的实验，涵盖了不同的模型、领域和任务，以展示相对于最先进的生成式攻击的一致改进，全面评估使用传统度量和我们新提出的意外修正率（ACR）。

更新时间: 2025-07-17 05:35:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.18248v3

ReCode: Updating Code API Knowledge with Reinforcement Learning

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

Updated: 2025-07-17 05:31:07

标题: ReCode：使用强化学习更新代码API知识

摘要: 大型语言模型(LLMs)展示了出色的代码生成能力，但在适应外部库API的频繁更新时表现不佳。这一关键限制源自于它们依赖于训练数据中过时的API知识，即使有当前文档的访问权限，也会阻碍在动态环境中可靠的代码生成。为了解决这个问题，我们提出了ReCode (基于规则的强化学习用于代码更新)，这是一个模仿人类程序员适应API变化的新框架。具体地，我们构建了一个约2000个数据条目的数据集，用于训练LLMs基于更新信息进行版本迁移。然后，我们引入了一个修改后的字符串相似度度量作为强化学习的奖励用于代码评估。我们的实验表明，ReCode显著提升了LLMs在动态API场景下的代码生成性能，特别是在未见的CodeUpdateArena任务上。关键的是，与监督微调相比，ReCode对LLMs的一般代码生成能力影响较小。我们将ReCode应用于各种LLMs和强化学习算法(GRPO和DAPO)，均取得了一致的改进。值得注意的是，在训练后，Qwen2.5-Coder-7B的表现优于具有相同架构的32B参数代码指令调整模型和推理模型。代码可在https://github.com/zjunlp/ReCode上获得。

更新时间: 2025-07-17 05:31:07

领域: cs.CL,cs.AI,cs.IR,cs.LG,cs.SE

下载: http://arxiv.org/abs/2506.20495v2

City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \textbf{\underline{SVM-City}}, deriving from multi\textbf{\underline{S}}cale scenarios with multi\textbf{\underline{V}}iew and multi\textbf{\underline{M}}odal instruction tuning data. It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \textbf{\underline{City-VLM}}. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves $18.14 \%$ performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.

Updated: 2025-07-17 05:21:21

标题: City-VLM：通过多模态不完全学习实现多领域感知场景理解

摘要: 场景理解使智能代理能够解释和理解其环境。虽然现有的大型视觉语言模型（LVLMs）主要集中在室内家庭任务的场景理解上，但当应用于室外大规模场景理解时，它们面临两个重要限制。首先，室外场景通常包括通过多个传感器从多个视角（例如，鸟瞰和地面视角）观察到的大规模环境，而现有的室内LVLMs主要分析建筑规模环境中的单一视觉模态，从人类视角进行观察。其次，现有的LVLMs缺乏跨领域感知室外数据，且难以有效地集成2D和3D视觉信息。为了解决上述限制，我们构建了第一个跨领域感知室外场景理解数据集，命名为\textbf{\underline{SVM-City}}，从多尺度场景中获取多视角和多模态指导调整数据。它包含$420$k张图像和$4, 811$M点云，以及来自车辆、低空无人机、高空航空飞机和卫星的$567$k问答对。为了在缺少某种模态的情况下有效地融合多模态数据，我们引入了不完整的多模态学习来建模室外场景理解，并设计了名为\textbf{\underline{City-VLM}}的LVLM。多模态融合通过构建一个联合概率分布空间来实现，而不是直接实施显式融合操作（例如，连接）。对三种典型室外场景理解任务的实验结果显示，City-VLM在问答任务中的表现平均超过现有LVLMs的$18.14\%$。我们的方法在多个室外场景中展示了实用和泛化性能。

更新时间: 2025-07-17 05:21:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12795v1

Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media

Bipolar disorder is a chronic mental illness frequently underdiagnosed due to subtle early symptoms and social stigma. This paper explores the advanced natural language processing (NLP) models for recognizing signs of bipolar disorder based on user-generated social media text. We conduct a comprehensive evaluation of transformer-based models (BERT, RoBERTa, ALBERT, ELECTRA, DistilBERT) and Long Short Term Memory (LSTM) models based on contextualized (BERT) and static (GloVe, Word2Vec) word embeddings. Experiments were performed on a large, annotated dataset of Reddit posts after confirming their validity through sentiment variance and judgmental analysis. Our results demonstrate that RoBERTa achieves the highest performance among transformer models with an F1 score of ~98% while LSTM models using BERT embeddings yield nearly identical results. In contrast, LSTMs trained on static embeddings fail to capture meaningful patterns, scoring near-zero F1. These findings underscore the critical role of contextual language modeling in detecting bipolar disorder. In addition, we report model training times and highlight that DistilBERT offers an optimal balance between efficiency and accuracy. In general, our study offers actionable insights for model selection in mental health NLP applications and validates the potential of contextualized language models to support early bipolar disorder screening.

Updated: 2025-07-17 05:14:19

标题: 超越架构：评估上下文嵌入在社交媒体上检测双相情感障碍中的作用

摘要: 双相情感障碍是一种慢性精神疾病，常常由于早期症状微妙和社会污名而被低估。本文探讨了基于用户生成的社交媒体文本识别双相情感障碍迹象的先进自然语言处理（NLP）模型。我们对基于变换器的模型（BERT、RoBERTa、ALBERT、ELECTRA、DistilBERT）和基于上下文（BERT）和静态（GloVe、Word2Vec）词嵌入的长短期记忆（LSTM）模型进行了全面评估。在确认Reddit帖子的有效性后，我们进行了一系列实验，包括情感变异和判断分析。我们的结果表明，RoBERTa在变换器模型中表现最佳，F1分数约为98％，而使用BERT嵌入的LSTM模型产生几乎相同的结果。相比之下，使用静态嵌入进行训练的LSTM未能捕捉到有意义的模式，F1得分接近于零。这些发现强调了上下文语言建模在检测双相情感障碍中的关键作用。此外，我们报告了模型训练时间，并强调了DistilBERT在效率和准确性之间提供了最佳平衡。总的来说，我们的研究为精神健康NLP应用中的模型选择提供了可操作的见解，并验证了上下文化语言模型支持早期双相情感障碍筛查的潜力。

更新时间: 2025-07-17 05:14:19

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14231v1

Intent-Based Network for RAN Management with Large Language Models

Advanced intelligent automation becomes an important feature to deal with the increased complexity in managing wireless networks. This paper proposes a novel automation approach of intent-based network for Radio Access Networks (RANs) management by leveraging Large Language Models (LLMs). The proposed method enhances intent translation, autonomously interpreting high-level objectives, reasoning over complex network states, and generating precise configurations of the RAN by integrating LLMs within an agentic architecture. We propose a structured prompt engineering technique and demonstrate that the network can automatically improve its energy efficiency by dynamically optimizing critical RAN parameters through a closed-loop mechanism. It showcases the potential to enable robust resource management in RAN by adapting strategies based on real-time feedback via LLM-orchestrated agentic systems.

Updated: 2025-07-17 04:57:55

标题: 用大型语言模型的意图导向网络用于无线接入网络管理

摘要: 先进的智能自动化成为处理无线网络管理中增加的复杂性的重要特征。本文提出了一种利用大型语言模型（LLMs）的基于意图的网络自动化方法，用于射频接入网络（RANs）管理。所提出的方法通过将LLMs整合到代理体系结构中，增强了意图翻译，自主解释高级目标，对复杂网络状态进行推理，并生成精确的RAN配置。我们提出了一种结构化提示工程技术，并证明网络可以通过闭环机制动态优化关键的RAN参数，从而自动提高能效。通过LLM编排的代理系统，展示了通过根据实时反馈调整策略来实现RAN中强大资源管理的潜力。

更新时间: 2025-07-17 04:57:55

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2507.14230v1

Using Modular Arithmetic Optimized Neural Networks To Crack Affine Cryptographic Schemes Efficiently

We investigate the cryptanalysis of affine ciphers using a hybrid neural network architecture that combines modular arithmetic-aware and statistical feature-based learning. Inspired by recent advances in interpretable neural networks for modular arithmetic and neural cryptanalysis of classical ciphers, our approach integrates a modular branch that processes raw ciphertext sequences and a statistical branch that leverages letter frequency features. Experiments on datasets derived from natural English text demonstrate that the hybrid model attains high key recovery accuracy for short and moderate ciphertexts, outperforming purely statistical approaches for the affine cipher. However, performance degrades for very long ciphertexts, highlighting challenges in model generalization.

Updated: 2025-07-17 04:54:10

标题: 使用模数算术优化神经网络有效破解仿射密码方案

摘要: 我们研究了使用混合神经网络架构对仿射密码进行密码分析，该架构结合了模块化算术感知和基于统计特征的学习。受最近可解释的模块化算术神经网络和经典密码的神经密码分析的进展启发，我们的方法集成了一个处理原始密文序列的模块化分支和一个利用字母频率特征的统计分支。从源自自然英语文本的数据集上的实验表明，混合模型对于短期和中等长度的密文实现了高的密钥恢复准确性，优于仿射密码的纯统计方法。然而，对于非常长的密文，性能会下降，突显了模型泛化的挑战。

更新时间: 2025-07-17 04:54:10

领域: cs.CR

下载: http://arxiv.org/abs/2507.14229v1

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys

As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., $E(B-V)<0.04$). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learning (SSL), with a k-Nearest Neighbor (kNN) classifier. We train and validate our pipeline using a small set of labeled exposures observed by surveys with the Dark Energy Camera (DECam). A clustering-space analysis of where our pipeline places images labeled in ``good'' and ``bad'' categories suggests that our approach can efficiently and accurately determine the quality of exposures. Applied to new imaging being reduced for DECaLS Data Release 11, our pipeline identifies 780 problematic exposures, which we subsequently verify through visual inspection. Being highly efficient and adaptable, our method offers a scalable solution for quality control in other large imaging surveys.

Updated: 2025-07-17 04:52:05

标题: 一个用于大型成像调查中识别不良曝光的半监督学习方法

摘要: 随着天文成像调查的数据量迅速增加，传统的图像异常检测方法，如人工专家的视觉检查，变得不切实际。我们介绍了一种基于机器学习的方法，用于在大型成像调查中检测质量较差的曝光，重点放在低消光区域（即，$E(B-V)<0.04$）的DECam Legacy Survey（DECaLS）上。我们的半监督流水线集成了一个通过自监督学习（SSL）训练的视觉变换器（ViT）和一个k-最近邻（kNN）分类器。我们使用一小组由Dark Energy Camera（DECam）调查观测到的标记曝光来训练和验证我们的流水线。我们的流水线放置在被标记为“好”和“坏”类别的图像的聚类空间分析表明，我们的方法可以高效且准确地确定曝光的质量。应用于正在减少为DECaLS数据发布11的新成像，我们的流水线识别出780个有问题的曝光，随后我们通过视觉检查进行验证。由于高效且具有适应性，我们的方法为其他大型成像调查中的质量控制提供了可扩展的解决方案。

更新时间: 2025-07-17 04:52:05

领域: astro-ph.IM,cs.AI

下载: http://arxiv.org/abs/2507.12784v1

Demystifying MuZero Planning: Interpreting the Learned Model

MuZero has achieved superhuman performance in various games by using a dynamics network to predict the environment dynamics for planning, without relying on simulators. However, the latent states learned by the dynamics network make its planning process opaque. This paper aims to demystify MuZero's model by interpreting the learned latent states. We incorporate observation reconstruction and state consistency into MuZero training and conduct an in-depth analysis to evaluate latent states across two board games: 9x9 Go and Gomoku, and three Atari games: Breakout, Ms. Pacman, and Pong. Our findings reveal that while the dynamics network becomes less accurate over longer simulations, MuZero still performs effectively by using planning to correct errors. Our experiments also show that the dynamics network learns better latent states in board games than in Atari games. These insights contribute to a better understanding of MuZero and offer directions for future research to improve the performance, robustness, and interpretability of the MuZero algorithm. The code and data are available at https://rlg.iis.sinica.edu.tw/papers/demystifying-muzero-planning.

Updated: 2025-07-17 04:32:51

标题: 揭示MuZero规划的神秘：解释学习模型

摘要: MuZero通过使用动态网络来预测环境动态以进行规划，在各种游戏中取得了超人类的表现，而无需依赖模拟器。然而，动态网络学习的潜在状态使其规划过程变得不透明。本文旨在通过解释学习到的潜在状态来揭示MuZero的模型。我们将观察重构和状态一致性纳入MuZero的训练，并进行深入分析，评估了在两个棋盘游戏（9x9围棋和五子棋）和三个Atari游戏（弹球、吃豆人和乒乓球）中学到的潜在状态。我们的研究结果显示，尽管动态网络在长时间模拟中变得不太准确，MuZero仍通过使用规划来纠正错误而有效地执行。我们的实验证明，动态网络在棋盘游戏中学习到的潜在状态比在Atari游戏中更好。这些发现有助于更好地理解MuZero，并为未来改进MuZero算法的性能、鲁棒性和可解释性提供了方向。代码和数据可在https://rlg.iis.sinica.edu.tw/papers/demystifying-muzero-planning获取。

更新时间: 2025-07-17 04:32:51

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.04580v2

A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.

Updated: 2025-07-17 04:31:55

标题: 一项关于电子健康记录建模的综合调查：从深度学习方法到大型语言模型

摘要: 人工智能（AI）已经展示了在通过分析和建模电子健康记录（EHRs）来改造医疗保健领域的重大潜力。然而，EHR数据的固有异质性、时间不规则性和领域特定性提出了与视觉和自然语言任务根本不同的独特挑战。本调查提供了对深度学习、大型语言模型（LLMs）和EHR建模交叉点最新进展的全面概述。我们引入了一个跨越五个关键设计维度的统一分类法：数据中心方法、神经结构设计、学习关注策略、多模态学习和基于LLM的建模系统。在每个维度中，我们审查了代表性方法，涉及数据质量增强、结构和时间表示、自监督学习以及与临床知识整合。我们进一步强调新兴趋势，如基础模型、LLM驱动的临床代理和用于下游推理的EHR到文本翻译。最后，我们讨论了在基准测试、可解释性、临床对齐和跨多样临床设置的泛化方面的挑战。本调查旨在为推进基于AI的EHR建模和临床决策支持提供结构化的路线图。有关EHR相关方法的全面列表，请参考 https://survey-on-tabular-data.github.io/。

更新时间: 2025-07-17 04:31:55

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12774v1

Local Representative Token Guided Merging for Text-to-Image Generation

Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.

Updated: 2025-07-17 04:16:24

标题: 本地代表令牌引导的文本到图像生成合并

摘要: 稳定扩散是文本到图像的优秀图像生成模型，但由于注意力操作的二次复杂度，其耗时生成过程仍然是一个挑战。最近的记号合并方法通过减少注意力操作期间的记号数量来提高效率，但往往忽视基于注意力的图像生成模型的特征，限制了它们的有效性。在本文中，我们提出了一种新颖的记号合并策略，即局部代表性记号引导合并（ReToM），适用于图像生成中的任何注意力机制。为了基于各种上下文信息合并记号，ReToM将局部边界定义为注意输入内的窗口，并调整窗口大小。此外，我们引入了一个代表性记号，通过在特定时间步计算相似性并选择具有最高平均相似性的记号来代表每个窗口中最具代表性的记号。这种方法保留了最显著的局部特征，同时最小化计算开销。实验结果表明，与基线相比，ReToM在FID上实现了6.2％的改善，并且具有更高的CLIP分数，同时保持可比较的推理时间。我们从经验上证明，ReToM在平衡视觉质量和计算效率方面是有效的。

更新时间: 2025-07-17 04:16:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12771v1

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.

Updated: 2025-07-17 04:08:03

标题: 评论-GRPO：用自然语言和数值反馈推进LLM推理

摘要: 最近对具有数字反馈的强化学习（RL）的进展，如标量奖励，显着增强了大型语言模型（LLMs）的复杂推理能力。尽管取得了成功，我们确定RL仅具有数字反馈时遇到的三个关键挑战：性能平台、自发自省的有限有效性和持续失败。然后，我们证明RL微调模型，即使在表现出性能平台后，也可以通过利用自然语言反馈（批评）在持续失败的问题上生成正确的改进。基于这一洞察力，我们提出了Critique-GRPO，这是一个在线RL框架，它同时整合了自然语言和数字反馈，用于有效的策略优化。Critique-GRPO使LLMs能够同时从初始回答和批评引导的自我完善中学习，同时保持探索。此外，我们使用一个塑形函数来增强从正确，尤其是不熟悉的改进中学习，并惩罚不正确的改进。通过对Qwen2.5-7B-Base、Qwen2.5-Math-7B-Base和Qwen3-8B的广泛实验表明，Critique-GRPO在八个具有挑战性的数学、STEM和一般推理任务中始终优于监督学习和基于RL的微调方法，将平均pass@1分数分别提高了约4.4%和3.8%。值得注意的是，Critique-GRPO通过自我批评和弱到强的泛化，实现了有效的自我改善，实现了持续的收益，如在AIME 2024上分别提高了16.7%和10.0%的pass@1。

更新时间: 2025-07-17 04:08:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.03106v4

Synergy: End-to-end Concept Model

In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.

Updated: 2025-07-17 04:01:28

标题: 协同作用：端到端概念模型

摘要: 在本文中，我们提出了Synergy，一种通过学习路由机制以端到端方式桥接不同抽象级别的语言模型。我们专注于低级语言抽象，将我们的模型训练为字节级语言模型。我们的模型自发学习对字节进行标记，产生比字节级字节对编码器（BBPE）分词器更少的概念标记，同时保持可比性能。通过与Llama3进行比较，我们观察到在相同的模型规模和训练数据集大小下，Synergy具有优势。进一步研究表明，我们模型的中间部分（更高层次的抽象部分）在去除位置编码时表现更好，表明位置无关概念的出现。这些发现证明了无分词器架构的可行性，为更强大和灵活的管道铺平了道路。

更新时间: 2025-07-17 04:01:28

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2507.12769v1

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

Updated: 2025-07-17 03:52:15

标题: VIDEE：视觉和交互式文本分析智能代理的分解、执行和评估

摘要: 文本分析传统上需要自然语言处理（NLP）或文本分析方面的专业知识，这对初级分析师构成了障碍。最近大型语言模型（LLMs）的进步改变了NLP的格局，使得更多人能够更轻松地进行自动文本分析（如主题检测、摘要、信息提取等）。我们介绍了VIDEE，这是一个支持初级数据分析师进行高级文本分析的系统，具有智能代理。VIDEE实例化了一个人-代理协作工作流程，包括三个阶段：（1）分解，其中包括一个人在循环的蒙特卡洛树搜索算法，以支持与人类反馈的生成推理，（2）执行，生成可执行的文本分析流水线，和（3）评估，将基于LLM的评估和可视化整合在一起，以支持用户对执行结果的验证。我们进行了两个定量实验来评估VIDEE的有效性，并分析了常见的代理错误。一项涉及具有不同NLP和文本分析经验水平的参与者的用户研究--从零到专家--展示了系统的可用性，并揭示了明显的用户行为模式。研究结果确定了人-代理协作的设计含义，验证了VIDEE对非专业用户的实际效用，并为智能文本分析系统的未来改进提供了信息。

更新时间: 2025-07-17 03:52:15

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2506.21582v2

Autonomy for Older Adult-Agent Interaction

As the global population ages, artificial intelligence (AI)-powered agents have emerged as potential tools to support older adults' caregiving. Prior research has explored agent autonomy by identifying key interaction stages in task processes and defining the agent's role at each stage. However, ensuring that agents align with older adults' autonomy preferences remains a critical challenge. Drawing on interdisciplinary conceptualizations of autonomy, this paper examines four key dimensions of autonomy for older adults: decision-making autonomy, goal-oriented autonomy, control autonomy, and social responsibility autonomy. This paper then proposes the following research directions: (1) Addressing social responsibility autonomy, which concerns the ethical and social implications of agent use in communal settings; (2) Operationalizing agent autonomy from the task perspective; and (3) Developing autonomy measures.

Updated: 2025-07-17 03:46:13

标题: 老年人与智能代理互动的自主性

摘要: 随着全球人口老龄化，基于人工智能（AI）的代理人已经成为支持老年人护理的潜在工具。先前的研究探讨了代理人的自主性，通过识别任务过程中的关键交互阶段，并定义了代理人在每个阶段的角色。然而，确保代理人与老年人的自主性偏好保持一致仍然是一个关键挑战。借鉴跨学科对自主性的概念化，本文研究了老年人自主性的四个关键维度：决策自主性、目标导向自主性、控制自主性和社会责任自主性。本文随后提出了以下研究方向：（1）解决社会责任自主性，涉及代理人在集体环境中使用的道德和社会影响；（2）从任务角度明确代理人自主性的操作化；（3）开发自主性测量工具。

更新时间: 2025-07-17 03:46:13

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.12767v1

TBDetector:Transformer-Based Detector for Advanced Persistent Threats with Provenance Graph

APT detection is difficult to detect due to the long-term latency, covert and slow multistage attack patterns of Advanced Persistent Threat (APT). To tackle these issues, we propose TBDetector, a transformer-based advanced persistent threat detection method for APT attack detection. Considering that provenance graphs provide rich historical information and have the powerful attacks historic correlation ability to identify anomalous activities, TBDetector employs provenance analysis for APT detection, which summarizes long-running system execution with space efficiency and utilizes transformer with self-attention based encoder-decoder to extract long-term contextual features of system states to detect slow-acting attacks. Furthermore, we further introduce anomaly scores to investigate the anomaly of different system states, where each state is calculated with an anomaly score corresponding to its similarity score and isolation score. To evaluate the effectiveness of the proposed method, we have conducted experiments on five public datasets, i.e., streamspot, cadets, shellshock, clearscope, and wget_baseline. Experimental results and comparisons with state-of-the-art methods have exhibited better performance of our proposed method.

Updated: 2025-07-17 03:37:05

标题: TBDetector：基于变换器的检测器，用于具有溯源图的高级持久性威胁

摘要: 由于高级持续性威胁（Advanced Persistent Threat, APT）具有长期潜伏、隐蔽和缓慢的多阶段攻击模式，因此APT检测难度较大。为了解决这些问题，我们提出了TBDetector，一种基于Transformer的高级持续性威胁检测方法，用于APT攻击检测。考虑到溯源图提供了丰富的历史信息，并具有强大的攻击历史相关性能力来识别异常活动，TBDetector采用溯源分析进行APT检测，以高效利用空间总结系统长期执行，并利用基于自注意力的编码器-解码器的Transformer来提取系统状态的长期上下文特征，以检测缓慢作用的攻击。此外，我们进一步引入异常分数来研究不同系统状态的异常性，其中每个状态根据其相似性分数和隔离分数计算出相应的异常分数。为了评估所提出的方法的有效性，我们在五个公共数据集（streamspot、cadets、shellshock、clearscope和wget_baseline）上进行了实验。实验结果和与最先进方法的比较显示了我们提出的方法的更好性能。

更新时间: 2025-07-17 03:37:05

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2304.02838v2

Aime: Towards Fully-Autonomous Multi-Agent Framework

Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems. However, the potential of these systems is often constrained by the prevalent plan-and-execute framework, which suffers from critical limitations: rigid plan execution, static agent capabilities, and inefficient communication. These weaknesses hinder their adaptability and robustness in dynamic environments. This paper introduces Aime, a novel multi-agent framework designed to overcome these challenges through dynamic, reactive planning and execution. Aime replaces the conventional static workflow with a fluid and adaptive architecture. Its core innovations include: (1) a Dynamic Planner that continuously refines the overall strategy based on real-time execution feedback; (2) an Actor Factory that implements Dynamic Actor instantiation, assembling specialized agents on-demand with tailored tools and knowledge; and (3) a centralized Progress Management Module that serves as a single source of truth for coherent, system-wide state awareness. We empirically evaluated Aime on a diverse suite of benchmarks spanning general reasoning (GAIA), software engineering (SWE-bench Verified), and live web navigation (WebVoyager). The results demonstrate that Aime consistently outperforms even highly specialized state-of-the-art agents in their respective domains. Its superior adaptability and task success rate establish Aime as a more resilient and effective foundation for multi-agent collaboration.

Updated: 2025-07-17 03:34:27

标题: Aime：朝向完全自主的多智能体框架

摘要: 由大型语言模型（LLMs）驱动的多智能体系统（MAS）正在成为解决复杂、多方面问题的强大范式。然而，这些系统的潜力往往受到流行的计划和执行框架的限制，该框架存在严重局限性：刚性计划执行、静态智能体能力和低效沟通。这些弱点阻碍了它们在动态环境中的适应性和鲁棒性。本文介绍了Aime，一种旨在通过动态、反应式规划和执行克服这些挑战的新型多智能体框架。Aime用流动和适应性架构替换了传统的静态工作流程。其核心创新包括：（1）一个动态规划器，根据实时执行反馈持续优化整体策略；（2）一个演员工厂，实现动态演员实例化，根据需要组装专门的智能体，并提供定制工具和知识；以及（3）一个集中的进展管理模块，作为系统范围内一致的状态意识的单一真相来源。我们在跨越一系列基准测试的多样化基准套件上对Aime进行了实证评估，包括一般推理（GAIA）、软件工程（SWE-bench Verified）和实时网页导航（WebVoyager）。结果表明，Aime在各自领域中始终优于甚至是高度专门化的最新智能体。其卓越的适应性和任务成功率将Aime确立为多智能体协作更为弹性和有效的基础。

更新时间: 2025-07-17 03:34:27

领域: cs.AI

下载: http://arxiv.org/abs/2507.11988v2

Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic engagement.With the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional expressiveness.This study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated videos.Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.

Updated: 2025-07-17 03:33:46

标题: 考虑后再绘制：分解情绪语义和细粒度可控表达的说话头生成

摘要: 情感化的说话头生成已经成为计算机视觉和多模态人工智能交叉领域的关键研究领域，其核心价值在于通过沉浸式和共情性互动增强人机交互。随着多模态大语言模型的进展，情感化说话头生成的驱动信号已经从音频和视频转变为更灵活的文本。然而，当前的文本驱动方法依赖于预定义的离散情感标签文本，过于简化了真实面部肌肉运动的动态复杂性，因此无法实现自然的情感表达。本研究提出了Think-Before-Draw框架来解决两个关键挑战：（1）情感的深度语义解析--通过创新地引入“Chain-of-Thought”（CoT），将抽象情感标签转化为生理基础的面部肌肉运动描述，实现了从高级语义到可执行动作特征的映射；以及（2）细粒度表现优化--受到艺术家肖像绘画过程的启发，提出了一个渐进引导去噪策略，采用“全局情感定位--局部肌肉控制”机制来优化生成视频中的微表情动态。我们的实验表明，我们的方法在广泛使用的基准数据集上取得了最先进的性能，包括MEAD和HDTF。此外，我们收集了一组肖像图像来评估我们模型的零样本生成能力。

更新时间: 2025-07-17 03:33:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12761v1

Unified Medical Image Segmentation with State Space Modeling Snake

Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance, with an average Dice improvement of 3\% over state-of-the-art methods.

Updated: 2025-07-17 03:32:32

标题: 具有状态空间建模蛇的统一医学图像分割

摘要: 统一医学图像分割（UMIS）对于全面的解剖评估至关重要，但面临多尺度结构异质性的挑战。传统的基于像素的方法缺乏对象级解剖见解和器官间关系建模，往往在形态复杂性和特征冲突方面遇到困难，限制了它们在UMIS中的有效性。我们提出了Mamba Snake，一种新颖的深度蛇框架，通过状态空间建模增强了UMIS。Mamba Snake将多轮廓演变框架构建为一个分层状态空间图谱，有效地建模了宏观器官间的拓扑关系和微观轮廓的精细调整。我们引入了一个蛇特定的视觉状态空间模块，Mamba Evolution Block（MEB），利用有效的时空信息聚合来适应复杂形态的精细调整。能量图形先验进一步确保在异质数据中稳健的长程轮廓演变。此外，还引入了一个双分类协同机制，同时优化检测和分割，减轻了UMIS中微结构的欠分割现象。对五个临床数据集进行的广泛评估显示，Mamba Snake的性能优越，平均Dice改进率比最先进的方法高出3\%。

更新时间: 2025-07-17 03:32:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12760v1

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model -- a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B -- a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.

Updated: 2025-07-17 03:31:36

标题: 对数运算引发长时间推理能力而无需培训

摘要: 大推理模型（LRMs）可以通过长链推理（CoT）进行复杂推理，涉及认知策略，如回溯和自我纠正。最近的研究表明，一些模型固有地具有这些长期推理能力，可以通过额外的训练来解锁。我们的工作首先调查是否可以在没有任何训练的情况下引发这种行为。为此，我们提出了一种解码时间方法ThinkLogit，它利用逻辑算术（Liu等，2024）来调整目标大型LM，以进行长期推理，使用一个明显较小的模型作为引导者。然后，我们展示我们可以通过训练引导模型，通过对从目标和引导模型中采样的正确/错误推理对进行偏好优化，进一步提高性能。我们称之为ThinkLogit-DPO。我们的实验表明，ThinkLogit和ThinkLogit-DPO分别通过四个数学数据集使用Qwen2.5-32B时，由R1-Distill-Qwen-1.5B引导的模型相对于基础模型提高了26%和29%的通过@1。最后，我们展示ThinkLogit可以通过强化学习获得长期推理技能的转移，相对于Qwen2.5-32B基础模型，通过@1提高了13%。我们的工作提出了一种在大模型中引发长期推理的计算效率方法，几乎不需要额外的训练。

更新时间: 2025-07-17 03:31:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.12759v1

Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on multi-scale pyramid features to estimate the synthetic spans. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model's robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.

Updated: 2025-07-17 03:29:13

标题: 言语取证：朝着全面合成言语数据集的建立和分析方向

摘要: 检测合成语音与真实语音的区分日益关键，因为存在着误导和身份冒充的风险。虽然已经开发了各种用于合成语音分析的数据集，但它们通常侧重于特定领域，限制了它们在全面研究中的实用性。为了填补这一空白，我们提出了Speech-Forensics数据集，该数据集广泛涵盖了真实、合成和部分伪造的语音样本，包括由不同高质量算法合成的多个片段。此外，我们提出了一种名为TEST的TEmporal Speech LocalizaTion网络，旨在同时进行真实性检测、多个虚假片段定位和合成算法识别，而无需进行任何复杂的后处理。TEST有效地集成了LSTM和Transformer以提取更强大的时间语音表示，并利用多尺度金字塔特征上的密集预测来估计合成跨度。我们的模型在话语级别实现了83.55%的平均mAP和5.25%的EER。在片段级别上，它取得了1.07%的EER和92.19%的F1分数。这些结果突显了该模型对合成语音的全面分析能力，为未来研究和实践应用提供了一个有前途的途径。

更新时间: 2025-07-17 03:29:13

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2412.09032v3

Learning Universal Human Mobility Patterns with a Foundation Model for Cross-domain Data Fusion

Human mobility modeling is critical for urban planning and transportation management, yet existing approaches often lack the integration capabilities needed to handle diverse data sources. We present a foundation model framework for universal human mobility patterns that leverages cross-domain data fusion and large language models to address these limitations. Our approach integrates multi-modal data of distinct nature and spatio-temporal resolution, including geographical, mobility, socio-demographic, and traffic information, to construct a privacy-preserving and semantically enriched human travel trajectory dataset. Our framework demonstrates adaptability through domain transfer techniques that ensure transferability across diverse urban contexts, as evidenced in case studies of Los Angeles (LA) and Egypt. The framework employs LLMs for semantic enrichment of trajectory data, enabling comprehensive understanding of mobility patterns. Quantitative evaluation shows that our generated synthetic dataset accurately reproduces mobility patterns observed in empirical data. The practical utility of this foundation model approach is demonstrated through large-scale traffic simulations for LA County, where results align well with observed traffic data. On California's I-405 corridor, the simulation yields a Mean Absolute Percentage Error of 5.85% for traffic volume and 4.36% for speed compared to Caltrans PeMS observations, illustrating the framework's potential for intelligent transportation systems and urban mobility applications.

Updated: 2025-07-17 02:52:37

标题: 学习通用人类移动模式：跨领域数据融合基础模型

摘要: 人类流动模型对城市规划和交通管理至关重要，然而现有方法往往缺乏处理多样数据源所需的整合能力。我们提出了一个基础模型框架，用于普适的人类流动模式，利用跨领域数据融合和大型语言模型来解决这些限制。我们的方法整合了不同性质和时空分辨率的多模态数据，包括地理、流动、社会人口统计和交通信息，以构建一个隐私保护和语义丰富的人类出行轨迹数据集。我们的框架通过领域转移技术展示了适应性，确保在洛杉矶（LA）和埃及等不同城市背景下的可转移性。该框架利用LLMs对轨迹数据进行语义丰富化，实现对流动模式的全面理解。定量评估表明，我们生成的合成数据集准确地复制了经验数据中观察到的流动模式。该基础模型方法的实际效用通过针对洛杉矶县的大规模交通模拟得到展示，结果与观察到的交通数据相符。在加州I-405走廊上，与Caltrans PeMS观测相比，模拟对交通量的平均绝对百分比误差为5.85%，对速度为4.36%，展示了该框架在智能交通系统和城市流动应用中的潜力。

更新时间: 2025-07-17 02:52:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.15779v2

Transformer-based Spatial Grounding: A Comprehensive Survey

Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.

Updated: 2025-07-17 02:44:01

标题: 基于Transformer的空间指代：一项全面调查

摘要: 空间基础，即将自然语言表达与相应图像区域关联起来的过程，由于引入了基于变压器的模型，已经取得了快速进展，显著增强了多模态表示和跨模态对齐。尽管取得了这一进展，但该领域缺乏对当前方法、数据集使用、评估指标和工业适用性的综合综合。本文对2018年至2025年基于变压器的空间基础方法进行了系统性文献综述。我们的分析确定了主导模型架构、普遍数据集和广泛采用的评估指标，同时突出了关键方法趋势和最佳实践。这项研究为研究人员和从业者提供了重要的见解和结构化指导，促进了健壮、可靠和业界准备就绪的基于变压器的空间基础模型的发展。

更新时间: 2025-07-17 02:44:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.12739v1

A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique

We propose a privacy-preserving semantic-segmentation method for applying perceptual encryption to images used for model training in addition to test images. This method also provides almost the same accuracy as models without any encryption. The above performance is achieved using a domain-adaptation technique on the embedding structure of the Vision Transformer (ViT). The effectiveness of the proposed method was experimentally confirmed in terms of the accuracy of semantic segmentation when using a powerful semantic-segmentation model with ViT called Segmentation Transformer.

Updated: 2025-07-17 02:14:50

标题: 一种隐私保护的语义分割方法：使用领域适应技术

摘要: 我们提出了一种隐私保护的语义分割方法，用于将感知加密应用于用于模型训练的图像以及测试图像。该方法还提供了几乎与没有任何加密的模型相同的准确性。以上性能是通过在Vision Transformer（ViT）的嵌入结构上使用领域自适应技术实现的。所提出的方法的有效性在使用名为分割Transformer的ViT的强大语义分割模型时，从语义分割准确性方面在实验中得到了确认。

更新时间: 2025-07-17 02:14:50

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2507.12730v1

KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.

Updated: 2025-07-17 02:04:22

标题: KeyRe-ID：使用视频中的部分感知表示的关键点引导的人员重新识别

摘要: 我们提出了KeyRe-ID，一个基于关键点指导的视频人员重新识别框架，包括全局和局部分支，利用人体关键点来增强时空表示学习。全局分支通过基于Transformer的时间聚合捕获整体身份语义，而局部分支基于关键点动态分割身体区域以生成细粒度、部分感知特征。在MARS和iLIDS-VID基准上进行了大量实验，表现出最先进的性能，实现了MARS上91.73％的mAP和97.32％的Rank-1精度，以及iLIDS-VID上96.00％的Rank-1和100.0％的Rank-5精度。此工作的代码将在出版后公开发布在GitHub上。

更新时间: 2025-07-17 02:04:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.07393v3

BEARCUBS: A benchmark for computer-using web agents

Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 23.4% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

Updated: 2025-07-17 01:50:49

标题: BEARCUBS：一个用于计算机使用的网络代理的基准

摘要: 现代网络代理具有计算机使用能力，使它们能够通过向虚拟键盘和鼠标发送命令与网页进行交互。虽然这种代理在协助人类用户完成复杂任务方面具有相当大的潜力，但在真实世界环境中评估它们的能力却是一个重大挑战。为此，我们引入了BEARCUBS，一个由111个信息搜索问题组成的“小而强大”的基准，旨在评估网络代理在搜索、浏览和从网络中识别事实信息方面的能力。与之前的网络代理基准不同，解决BEARCUBS需要访问实时网络内容而非合成或模拟页面，这捕捉了真实世界网络交互的不可预测性；并且需要执行广泛的多模态交互（例如视频理解、3D导航），不能通过文本工作进行绕过。BEARCUBS中的每个问题都有一个相应的简短、明确的答案和一个经过人工验证的浏览轨迹，可透明评估代理的性能和策略。一项人类研究证实，BEARCUBS问题是可以解决但并非简单的（人类准确率为84.7%），揭示了领域知识差距和常见失败点中的细节被忽视。相比之下，最先进的计算机使用代理表现不佳，最高得分系统（OpenAI的Operator）仅达到23.4%的准确率。这些结果突显了需要改进的关键领域，包括可靠的来源选择和更强大的多模态能力。为了促进未来研究，BEARCUBS将定期更新以替换无效或受污染的问题，使基准保持新鲜，以供未来一代网络代理使用。

更新时间: 2025-07-17 01:50:49

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.07919v2

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB's recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.

Updated: 2025-07-17 01:38:49

标题: CodeAssistBench（CAB）：用于多轮聊天式代码辅助的数据集和基准测试

摘要: 由大型语言模型驱动的编程助手已经改变了软件开发，然而大多数基准测试集中在代码生成任务上。最近的工作，如InfiBench和StackEval试图填补这一空白，利用Stack Overflow数据，但仍然局限于孤立环境中的单轮交互，需要大量手动筛选，并且无法代表完整的项目环境。我们引入了CodeAssistBench（CAB），这是第一个用于在真实环境中评估多轮编程辅助的基准框架，解决关于实际代码库的真实世界问题。与现有的编程问答基准测试不同，CAB使用可配置的参数（例如存储库创建日期、星级、编程语言）自动从与问题相关的GitHub问题生成可扩展的数据集，并包括用于评估的代码库的自动容器化。然后通过模拟用户在这些容器化环境中进行评估模型，并具有完全代码库访问权限。利用这个框架，我们构建了一个测试集，涵盖231个存储库，跨越七种编程语言和不同的问题领域，包括3286个真实世界的编程问题。我们对领先的LLM进行评估，发现存在实质性的能力差距：虽然模型在Stack Overflow问题上表现良好，成功率为70-83%，但只解决了最近CAB问题的16.49%。这种差异突显了在复杂的、项目特定环境中提供帮助与回答独立问题的挑战。

更新时间: 2025-07-17 01:38:49

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.10646v2

Multi-View Node Pruning for Accurate Graph Representation

Graph pooling, which compresses a whole graph into a smaller coarsened graph, is an essential component of graph representation learning. To efficiently compress a given graph, graph pooling methods often drop their nodes with attention-based scoring with the task loss. However, this often results in simply removing nodes with lower degrees without consideration of their feature-level relevance to the given task. To fix this problem, we propose a Multi-View Pruning(MVP), a graph pruning method based on a multi-view framework and reconstruction loss. Given a graph, MVP first constructs multiple graphs for different views either by utilizing the predefined modalities or by randomly partitioning the input features, to consider the importance of each node in diverse perspectives. Then, it learns the score for each node by considering both the reconstruction and the task loss. MVP can be incorporated with any hierarchical pooling framework to score the nodes. We validate MVP on multiple benchmark datasets by coupling it with two graph pooling methods, and show that it significantly improves the performance of the base graph pooling method, outperforming all baselines. Further analysis shows that both the encoding of multiple views and the consideration of reconstruction loss are the key to the success of MVP, and that it indeed identifies nodes that are less important according to domain knowledge.

Updated: 2025-07-17 01:33:12

标题: 多视图节点修剪以获得准确的图表示

摘要: 图池化将整个图压缩成一个更小的粗化图，是图表示学习的一个重要组成部分。为了高效地压缩给定的图，图池化方法通常会使用基于注意力评分的节点丢弃方法与任务损失一起工作。然而，这往往会导致仅仅删除具有较低度数的节点，而没有考虑它们在给定任务中的特征相关性。为了解决这个问题，我们提出了一种基于多视图框架和重构损失的图修剪方法 Multi-View Pruning（MVP）。给定一个图，MVP首先通过利用预定义的模态或随机分割输入特征来为不同视图构建多个图，以考虑每个节点在不同视角下的重要性。然后，它通过同时考虑重构和任务损失来学习每个节点的得分。MVP可以与任何层次池化框架结合使用来为节点评分。我们通过将MVP与两种图池化方法结合，并在多个基准数据集上验证了它，结果表明它显著提高了基本图池化方法的性能，超过了所有基准线。进一步的分析表明，多视图编码和重构损失的考虑是MVP成功的关键，它确实能够识别根据领域知识不太重要的节点。

更新时间: 2025-07-17 01:33:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.11737v4

Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening

The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the "Illusion of Neutrality" to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model's inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.

Updated: 2025-07-17 01:30:09

标题: 公平并不足够：在基于人工智能的简历筛选中审计能力和交叉偏见

摘要: 随着生成式人工智能在简历筛选中的日益广泛应用，人们认为它提供了一种对有偏见的人类决策的无偏见替代方案。然而，这种信念未能解决一个关键问题：这些人工智能系统是否基本能胜任它们所要执行的评估任务？本研究通过对八个主要人工智能平台的两部分审计来探讨这个能力问题。实验1证实了复杂的、上下文相关的种族和性别偏见，一些模型仅因存在人口统计信号而对候选人进行惩罚。实验2评估了核心能力，给出了一个关键的见解：一些看似无偏见的模型实际上无法进行实质性评估，而是依赖表面的关键字匹配。本文引入了“中立幻觉”来描述这种现象，即无偏见的表象实际上是模型无法做出有意义判断的症状。本研究建议组织和监管机构采用双重验证框架，对人工智能招聘工具进行人口统计偏见和可证实能力的审计，以确保它们既公平又有效。

更新时间: 2025-07-17 01:30:09

领域: cs.CY,cs.AI,cs.CL,I.2.1; K.4.2; I.2.6; K.4.1

下载: http://arxiv.org/abs/2507.11548v2

Scaling Trends for Data Poisoning in LLMs

LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to fine-tuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models -- malicious fine-tuning, imperfect data curation, and intentional data contamination -- across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Updated: 2025-07-17 01:19:42

标题: LLMs中数据毒化的扩展趋势

摘要: LLMs 在包含少量有毒数据集的情况下进行训练时会产生有害和不良行为。我们证明了即使在受到调节系统保护的情况下，GPT 模型仍然容易受到有毒数据的微调影响。考虑到当今最先进模型中数据污染漏洞的持续存在，本文调查了这些风险是否会随着模型规模的增加而增加。我们评估了三种威胁模型 -- 恶意微调、不完善的数据筛选和故意数据污染 -- 在从 15 亿到 720 亿参数的 24 个前沿 LLM 模型上。我们的实验发现，较大的 LLM 模型更容易受到数据污染的影响，甚至在少量有害数据的情况下更快地学习到有害行为，比较小的模型更加敏感。这些发现强调了领先的人工智能公司在公开发布前需要彻底测试微调 API，并开发更加健壮的防御措施来防止数据污染，特别是在模型规模和能力继续增加的情况下。

更新时间: 2025-07-17 01:19:42

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2408.02946v6

ActionStudio: A Lightweight Framework for Data and Training of Large Action Models

Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.

Updated: 2025-07-17 01:19:22

标题: ActionStudio: 一个用于大型动作模型数据和训练的轻量级框架

摘要: 大型动作模型对于使自主代理能够执行复杂任务至关重要。然而，由于代理环境的多样性和嘈杂的代理数据的复杂性，训练这种模型仍然具有挑战性。现有基础设施对于可扩展的、特定于代理的微调和标准化的代理数据处理提供了有限支持。我们引入了ActionStudio，这是一个轻量级且可扩展的数据和训练框架，专为大型动作模型设计。ActionStudio使用我们提出的Unified Format 2.0统一了各种代理轨迹，支持一系列训练工作流程，并具有优化的多节点分布式设置，并集成了强大的预处理和实时验证工具。与现有的代理训练框架相比，ActionStudio的吞吐量提高了多达9倍，我们训练的模型在公共和现实代理基准测试中表现出色。为了支持更广泛的研究社区，我们开源了ActionStudio框架，并发布了actionstudio-98k，一个由98k条高质量轨迹组成的策划数据集。代码：https://github.com/SalesforceAIResearch/xLAM。

更新时间: 2025-07-17 01:19:22

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.22673v3

LLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation

Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without additional training or fine-tuning, addressing the limitations of traditional models in sparse data environments. Recent advancements in large language models (LLMs) have significantly enhanced ZCDSR by facilitating cross-domain knowledge transfer through rich, pretrained representations. Despite this progress, domain semantic bias -- arising from differences in vocabulary and content focus between domains -- remains a persistent challenge, leading to misaligned item embeddings and reduced generalization across domains. To address this, we propose a novel semantic bias-aware framework that enhances LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that aligns the embeddings of items across domains (inter-domain compactness), while preserving the unique characteristics of each item within its own domain (intra-domain diversity). This ensures that item embeddings can be transferred effectively between domains without collapsing into overly generic or uniform representations. At the sequential level, we develop a method to transfer user behavioral patterns by clustering source domain user sequences and applying attention-based aggregation during target domain inference. We dynamically adapt user embeddings to unseen domains, enabling effective zero-shot recommendations without requiring target-domain interactions...

Updated: 2025-07-17 01:08:34

标题: LLM-RecG：一种语义偏差感知的零样本序列推荐框架

摘要: 零样本跨领域顺序推荐（ZCDSR）能够在未知领域中进行预测，无需额外训练或微调，解决了传统模型在稀疏数据环境中的局限性。最近大型语言模型（LLMs）的进展显著增强了ZCDSR，通过丰富的预训练表示促进跨领域知识转移。尽管取得了进展，但由于领域间词汇和内容焦点的差异导致的领域语义偏差仍然是一个持续的挑战，导致项目嵌入不对齐和跨领域推广降低。为了解决这个问题，我们提出了一个新颖的语义偏差感知框架，通过改善项目和序列水平的跨领域对齐来增强基于LLM的ZCDSR。在项目级别上，我们引入了一个泛化损失，将项目的嵌入在领域间对齐（跨领域紧凑），同时保留每个项目在自己领域内的独特特征（领域内多样性）。这确保项目嵌入可以在领域间有效传输而不会崩溃为过于通用或统一的表示。在序列级别上，我们开发了一种方法，通过聚类源域用户序列并在目标领域推断期间应用基于注意力的聚合来转移用户行为模式。我们动态调整用户嵌入到未知领域，实现了有效的零样本推荐，无需目标领域交互。

更新时间: 2025-07-17 01:08:34

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2501.19232v2

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.

Updated: 2025-07-17 00:32:07

标题: 机器专用音频编码：机器学习的潜在特征是该机器的编码

摘要: 神经音频编解码器利用量化算法，显著影响了各种语音/音频任务。尽管高保真重建对人类感知至关重要，但用于机器的音频编码（ACoM）优先考虑高效压缩和下游任务性能，忽视了感知细微差别。本研究介绍了一种有效的ACoM方法，可以压缩和量化已经训练的语音/音频下游模型的任何选择的中间特征表示。我们的方法在任务特定损失指导以及残差向量量化（RVQ）损失的基础上提供超低比特率（即小于200 bps），并最小化下游模型性能的损失。所得的标记器可适应各种比特率和模型大小，实现灵活部署。在自动语音识别和音频分类方面进行评估，我们的方法展示了其有效性和潜力，通过适当的正则化，具有更广泛的任务和架构适用性。

更新时间: 2025-07-17 00:32:07

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.12701v1