Arxiv Day: Article

BARNN: A Bayesian Autoregressive and Recurrent Neural Network

Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and Machine Learning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the "Variational Mixtures of Posteriors" prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.

Updated: 2025-07-18 23:49:21

标题: BARNN：贝叶斯自回归和递归神经网络

摘要: 自回归和递归网络在各个领域取得了显著进展，从天气预报到分子生成和大型语言模型。尽管这些模型具有强大的预测能力，但它们缺乏处理不确定性的严格框架，这在科学应用中至关重要，如PDE求解、分子生成和机器学习力场。为了解决这一缺点，我们提出了BARNN：一种变分贝叶斯自回归和递归神经网络。BARNN旨在提供一种合理的方法，将任何自回归或递归模型转变为其贝叶斯版本。BARNN基于变分辍学方法，允许将其应用于大型递归神经网络。我们还引入了“变分后验混合”先验的时间版本（tVAMP-prior），以使贝叶斯推断高效且良好校准。对PDE建模和分子生成的广泛实验表明，BARNN不仅在准确性方面达到可比或优越的水平，而且在不确定性量化和建模长程依赖方面表现出色。

更新时间: 2025-07-18 23:49:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.18665v2

A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models. Methods: A set of 100 multiple-choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets. Five LLMs -- OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet -- with the versions released before September 30, 2024, were queried using these new exam sets. To evaluate their deductive reasoning ability, the correct answer options in the questions were replaced with "None of the above." Then, the explain-first and step-by-step instruction prompts were used to test if this strategy improved their reasoning ability. The performance of the LLMs was compared with the answers from medical physicists. Results: All models demonstrated expert-level performance on these questions, with o1-preview even surpassing medical physicists with a majority vote. When replacing the correct answer options with 'None of the above', all models exhibited a considerable decline in performance, suggesting room for improvement. The explain-first and step-by-step instruction prompts helped enhance the reasoning ability of the LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet models. Conclusion: These recently released LLMs demonstrated expert-level performance in answering radiation oncology physics questions, exhibiting great potential to assist in radiation oncology physics education and training.

Updated: 2025-07-18 23:41:43

标题: 最近对LLM在放射肿瘤物理学表现的评估：使用随机排列选项的问题

摘要: 目的：我们提出了一项更新的研究，评估大型语言模型（LLMs）在回答放射肿瘤物理学问题时的表现，重点放在最近发布的模型上。方法：本研究使用了一组100个多项选择的放射肿瘤物理学问题，这些问题是由一位经验丰富的物理学家之前创建的。问题的答案选项被随机打乱，以创建“新”的考试套。使用了五个LLMs——OpenAI o1-preview、GPT-4o、LLaMA 3.1（405B）、Gemini 1.5 Pro和Claude 3.5 Sonnet，这些模型的版本在2024年9月30日之前发布。为了评估它们的演绎推理能力，将问题中的正确答案选项替换为“以上都不是”。然后，使用解释优先和逐步指导提示来测试这种策略是否提高了它们的推理能力。将LLMs的表现与医学物理学家的答案进行比较。结果：所有模型都展示了对这些问题的专家级表现，其中o1-preview甚至超过了医学物理学家的多数投票。当将正确答案选项替换为“以上都不是”时，所有模型的表现都出现了显著下降，表明还有改进的空间。解释优先和逐步指导提示有助于提高LLaMA 3.1（405B）、Gemini 1.5 Pro和Claude 3.5 Sonnet模型的推理能力。结论：这些最近发布的LLMs在回答放射肿瘤物理学问题时表现出专家级水平，展示了在放射肿瘤物理学教育和培训中提供帮助的巨大潜力。

更新时间: 2025-07-18 23:41:43

领域: physics.med-ph,cs.AI

下载: http://arxiv.org/abs/2412.10622v4

Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering

State-of-the-art reasoning LLMs are powerful problem solvers, but they still occasionally make mistakes. However, adopting AI models in risk-sensitive domains often requires error rates near 0%. To address this gap, we propose collaboration between a reasoning model and a human expert who resolves queries the model cannot confidently answer. We find that quantifying the uncertainty of a reasoning model through the length of its reasoning trace yields an effective basis for deferral to a human, e.g., cutting the error rate of Qwen3 235B-A22B on difficult MATH problems from 3% to less than 1% when deferring 7.5% of queries. However, the high latency of reasoning models still makes them challenging to deploy on use cases with high query volume. To address this challenge, we explore fronting a reasoning model with a large non-reasoning model. We call this modified human-in-the-loop system "Fail Fast, or Ask", since the non-reasoning model may defer difficult queries to the human expert directly ("failing fast"), without incurring the reasoning model's higher latency. We show that this approach yields around 40% latency reduction and about 50% cost savings for DeepSeek R1 while maintaining 90+% area under the accuracy-rejection curve. However, we observe that latency savings are lower than expected because of "latency drag", the phenomenon that processing easier queries with a non-reasoning model pushes the reasoning model's latency distribution towards longer latencies. Broadly, our results suggest that the deficiencies of state-of-the-art reasoning models -- nontrivial error rates and high latency -- can be substantially mitigated through black-box systems engineering, without requiring access to LLM internals.

Updated: 2025-07-18 23:25:26

标题: 快速失败，或询问：通过人为环节系统工程减轻推理LLMs的不足

摘要: 最先进的推理LLMs是强大的问题解决者，但它们仍然偶尔会犯错误。然而，在风险敏感领域采用AI模型通常需要接近0％的错误率。为了解决这一差距，我们提出了一个推理模型和人类专家之间的合作，专家可以解决模型无法自信回答的查询。我们发现，通过推理模型的推理轨迹长度量化推理模型的不确定性是一个有效的基础，可以将Qwen3235B-A22B在困难的数学问题上的错误率从3％降低到低于1％，同时推迟7.5％的查询。然而，推理模型的高延迟仍然使其难以在查询量大的用例中部署。为了解决这个挑战，我们探索了在一个大型非推理模型之前放置一个推理模型。我们将这种修改过的人机协作系统称为“快速失败，或询问”，因为非推理模型可能会直接将困难的查询推迟给人类专家（“快速失败”），而不会产生推理模型更高的延迟。我们发现，这种方法可以减少DeepSeek R1约40％的延迟，并节约约50％的成本，同时保持90％以上的准确性拒绝曲线下面积。然而，我们观察到，延迟节约低于预期，这是因为“延迟拖累”现象，即使用非推理模型处理更容易的查询会将推理模型的延迟分布推向更长的延迟。总的来说，我们的结果表明，最先进的推理模型的缺陷 -- 不可忽略的错误率和高延迟 -- 可以通过黑盒系统工程技术大幅减轻，而无需访问LLM内部。

更新时间: 2025-07-18 23:25:26

领域: cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.14406v1

SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementations demonstrate SensorChat's capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves 93% higher answer accuracy than the best performing state-of-the-art systems on quantitative questions. Furthermore, a user study with eight volunteers highlights SensorChat's effectiveness in answering qualitative questions.

Updated: 2025-07-18 23:00:52

标题: SensorChat：在长期多模传感器交互期间回答定性和定量问题

摘要: 自然语言与传感系统的交互对于解决用户个人关注并为其日常生活提供健康相关见解至关重要。当用户提出问题时，系统会自动分析传感器数据的完整历史记录，提取相关信息，并生成适当的回复。然而，现有系统仅限于短时间（例如，一分钟）或低频率（例如，每日步数）的传感器数据。此外，它们在处理需要精确数字答案的定量问题时存在困难。在这项工作中，我们介绍了SensorChat，这是第一个专为使用长时间、高频率时间序列数据进行日常生活监测而设计的端到端问答系统。给定跨越多天的原始传感器信号和用户定义的自然语言问题，SensorChat生成语义有意义的响应，直接解决用户关注点。SensorChat有效处理既需要数字精度的定量问题，又需要高级推理来推断主观见解的定性问题。为实现这一目标，SensorChat使用创新的三阶段流程，包括问题分解、传感器数据查询和答案组装。第一和第三阶段利用大型语言模型（LLMs）来解释人类查询并生成回复。中间查询阶段从完整传感器数据历史中提取相关信息。实际应用展示了SensorChat在云服务器上实现实时交互的能力，同时也能在量化后完全在边缘平台上运行。全面的问答评估显示，SensorChat在定量问题上的答案准确率比最优表现良好的现有系统高出93%。此外，一项涉及八名志愿者的用户研究突显了SensorChat在回答定性问题方面的有效性。

更新时间: 2025-07-18 23:00:52

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.02883v3

Adaptive Multi-Agent Reasoning via Automated Workflow Generation

The rise of Large Reasoning Models (LRMs) promises a significant leap forward in language model capabilities, aiming to tackle increasingly sophisticated tasks with unprecedented efficiency and accuracy. However, despite their impressive performance, recent studies have highlighted how current reasoning models frequently fail to generalize to novel, unseen problems, often resorting to memorized solutions rather than genuine inferential reasoning. Such behavior underscores a critical limitation in modern LRMs, i.e., their tendency toward overfitting, which in turn results in poor generalization in problem-solving capabilities. In this paper, we introduce Nexus Architect, an enhanced iteration of our multi-agent system framework, Nexus, equipped with a novel automated workflow synthesis mechanism. Given a user's prompt and a small set of representative examples, the Architect autonomously generates a tailored reasoning workflow by selecting suitable strategies, tool integrations, and adversarial techniques for a specific problem class. Furthermore, the Architect includes an iterative prompt refinement mechanism that fine-tunes agents' system prompts to maximize performance and improve the generalization capabilities of the system. We empirically evaluate Nexus Architect by employing an off-the-shelf, non-reasoning model on a custom dataset of challenging logical questions and compare its performance against state-of-the-art LRMs. Results show that Nexus Architect consistently outperforms existing solutions, achieving up to a 66% increase in pass rate over Gemini 2.5 Flash Preview, nearly 2.5$\times$ against Claude Sonnet 4 and DeepSeek-R1, and over 3$\times$ w.r.t. Llama 4 Scout.

Updated: 2025-07-18 22:46:27

标题: 自适应多智能体推理通过自动生成工作流程

摘要: 大型推理模型（LRMs）的崛起承诺在语言模型能力方面迈出重要一步，旨在以前所未有的效率和准确性处理日益复杂的任务。然而，尽管它们表现出色，最近的研究指出现有推理模型经常无法推广到新颖的问题，往往倾向于记忆解决方案而非真正的推理。这种行为突显了现代LRMs的一个关键局限，即它们倾向于过度拟合，从而导致问题解决能力的泛化能力不佳。在本文中，我们介绍了Nexus Architect，这是我们的多智能体系统框架Nexus的一个增强版本，配备了一种新颖的自动化工作流合成机制。根据用户的提示和一小组代表性示例，Architect通过选择适当的策略、工具集成和对抗技术为特定问题类别自动生成定制的推理工作流。此外，Architect还包括一个迭代提示细化机制，通过微调智能体系统的提示来最大化性能并提高系统的泛化能力。我们通过在具有挑战性的逻辑问题数据集上使用一个现成的非推理模型对Nexus Architect进行了实证评估，并将其性能与最先进的LRMs进行了比较。结果显示，Nexus Architect始终优于现有解决方案，在闪存预览Gemini 2.5上的通过率提高了高达66％，对Claude Sonnet 4和DeepSeek-R1提高了近2.5倍，对Llama 4 Scout提高了3倍以上。

更新时间: 2025-07-18 22:46:27

领域: cs.AI

下载: http://arxiv.org/abs/2507.14393v1

ADEPTS: A Capability Framework for Human-Centered Agent Design

Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross-disciplinary approach to developing, monitoring and discussing the capabilities required for agent-driven user experiences. However, current guidance on human-centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high-level governance. There is no concise, user-facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user-facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human-centered agent design, that express the minimal, user-facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI-UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user-relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.

Updated: 2025-07-18 22:27:40

标题: ADEPTS：人类中心代理设计的能力框架

摘要: 大型语言模型为强大而灵活的人工智能代理铺平了道路，通过越来越多地融入人们的日常生活，帮助人类。这种灵活性、潜力和不断增长的采用需求要求以全面和跨学科的方式来开发、监控和讨论为代理驱动的用户体验所需的能力。然而，当前关于以人为中心的人工智能代理开发的指导是零散的：用户体验启发规则关注界面行为，工程分类描述内部流程，伦理检查表涉及高级治理。没有简明扼要的面向用户的词汇告诉团队代理基本上应该能做什么。我们介绍了ADEPTS，一个能力框架，定义了一组核心用户面向的能力，为AI代理的开发提供统一的指导。ADEPTS基于人为中心代理设计的六个原则，表达了AI代理应该展示的最小的用户面向的能力，以便在日常使用中理解、可控和值得信赖。ADEPTS补充了现有的框架和分类法；与它们不同的是，它位于技术和体验开发之间的界面上。通过提出ADEPTS，我们的目标是将复杂的AI-UX要求压缩成一个紧凑的框架，为AI研究人员、设计师、工程师和政策审查者提供可行的指导。我们相信ADEPTS有可能加速改进与用户相关的代理能力，简化利用这些能力设计体验，并提供一个共享的语言来跟踪和讨论AI代理的发展进展。

更新时间: 2025-07-18 22:27:40

领域: cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2507.15885v1

Incremental Causal Graph Learning for Online Cyberattack Detection in Cyber-Physical Infrastructures

The escalating threat of cyberattacks on real-time critical infrastructures poses serious risks to public safety, demanding detection methods that effectively capture complex system interdependencies and adapt to evolving attack patterns. Traditional real-time anomaly detection techniques often suffer from excessive false positives due to their statistical sensitivity to high data variance and class imbalance. To address these limitations, recent research has explored modeling causal relationships among system components. However, prior work mainly focuses on offline causal graph-based approaches that require static historical data and fail to generalize to real-time settings. These methods are fundamentally constrained by: (1) their inability to adapt to dynamic shifts in data distribution without retraining, and (2) the risk of catastrophic forgetting when lacking timely supervision in live systems. To overcome these challenges, we propose INCADET, a novel framework for incremental causal graph learning tailored to real-time cyberattack detection. INCADET dynamically captures evolving system behavior by incrementally updating causal graphs across streaming time windows. The framework comprises three modules: 1) Early Symptom Detection: Detects transitions in system status using divergence in edge-weight distributions across sequential causal graphs. 2) Incremental Causal Graph Learning: Leverages experience replay and edge reinforcement to continually refine causal structures while preserving prior knowledge. 3) Causal Graph Classification: Employs Graph Convolutional Networks (GCNs) to classify system status using the learned causal graphs. Extensive experiments on real-world critical infrastructure datasets demonstrate that INCADET achieves superior accuracy, robustness, and adaptability compared to both static causal and deep temporal baselines in evolving attack scenarios.

Updated: 2025-07-18 22:27:13

标题: 网络物理基础设施在线网络攻击检测的增量因果图学习

摘要: 随着网络攻击对实时关键基础设施的威胁不断上升，对公共安全构成严重风险，需要检测方法能够有效捕捉复杂系统相互依赖关系并适应不断演变的攻击模式。传统的实时异常检测技术通常由于对高数据方差和类别不平衡的统计敏感性而导致过多的误报。为了解决这些限制，最近的研究探索了对系统组件之间的因果关系进行建模。然而，先前的工作主要集中在需要静态历史数据的离线因果图方法上，并且不能推广到实时环境。这些方法基本上受到以下限制：(1) 在没有重新训练的情况下无法适应数据分布的动态变化，(2) 在现场系统缺乏及时监督时存在遗忘的风险。为了克服这些挑战，我们提出了INCADET，一个针对实时网络攻击检测的增量因果图学习框架。INCADET通过在流式时间窗口中增量更新因果图，动态捕捉系统行为的演变。该框架包括三个模块：1) 早期症状检测：使用在连续因果图中边权重分布的差异来检测系统状态的转变。2) 增量因果图学习：利用经验重现和边强化来不断完善因果结构，同时保留先前的知识。3) 因果图分类：利用图卷积网络（GCNs）使用学习到的因果图对系统状态进行分类。对真实世界关键基础设施数据集的大量实验表明，INCADET在不断演变的攻击场景中比静态因果和深度时间基线实现了更高的准确性、稳健性和适应性。

更新时间: 2025-07-18 22:27:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.14387v1

Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94\% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63\% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

Updated: 2025-07-18 22:25:01

标题: Orchestrator-Agent Trust: 具有信任感知编排和基于RAG推理的模块化代理人AI视觉分类系统

摘要: 现代人工智能（AI）越来越依赖于融合视觉和语言理解的多代理架构。然而，一个紧迫的挑战仍然存在：在没有微调的零样本情况下，我们如何能够信任这些代理？我们引入了一种新颖的模块化代理AI视觉分类框架，将通用的多模态代理与非视觉推理协调器和检索增强生成（RAG）模块集成在一起。应用于苹果叶病诊断，我们对三种配置进行了基准测试：（一）零样本与基于置信度的协调，（二）性能改进的微调代理，（三）通过基于CLIP的图像检索和重新评估循环增强的信任校准协调。使用置信度校准指标（ECE、OCR、CCC），协调器调节代理之间的信任。我们的结果表明，在零样本设置中，使用具有信任感知协调和RAG的77.94\%准确度提高，总体达到85.63\%。GPT-4o显示出更好的校准，而Qwen-2.5-VL显示出过度自信。此外，图像-RAG通过与视觉上相似的案例进行基础预测，使代理的过度自信得以纠正，通过迭代重新评估。所提出的系统将感知（视觉代理）与元推理（协调器）分开，实现可扩展和可解释的多代理AI。这个蓝图可扩展到诊断、生物学和其他信任关键领域。所有模型、提示、结果和系统组件，包括完整的软件源代码，都已在Github上公开发布，以支持可重现性、透明性和社区基准测试：https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust.

更新时间: 2025-07-18 22:25:01

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.10571v2

Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms

Large Language Models (LLMs) have shown notable potential in code generation for optimization algorithms, unlocking exciting new opportunities. This paper examines how LLMs, rather than creating algorithms from scratch, can improve existing ones without the need for specialized expertise. To explore this potential, we selected 10 baseline optimization algorithms from various domains (metaheuristics, reinforcement learning, deterministic, and exact methods) to solve the classic Travelling Salesman Problem. The results show that our simple methodology often results in LLM-generated algorithm variants that improve over the baseline algorithms in terms of solution quality, reduction in computational time, and simplification of code complexity, all without requiring specialized optimization knowledge or advanced algorithmic implementation skills.

Updated: 2025-07-18 21:55:15

标题: 组合优化对所有人来说: 使用LLMs帮助非专家改进优化算法

摘要: 大型语言模型（LLMs）在优化算法的代码生成中显示出显著的潜力，打开了令人兴奋的新机遇。本文研究了LLMs如何在不需要专业知识的情况下改进现有算法，而不是从头开始创建算法。为了探索这种潜力，我们选择了来自各个领域（元启发式、强化学习、确定性和精确方法）的10个基准优化算法来解决经典的旅行商问题。结果表明，我们的简单方法往往会产生LLM生成的算法变体，这些变体在解决方案质量、计算时间减少和代码复杂性简化方面优于基准算法，而无需专业的优化知识或高级的算法实现技能。

更新时间: 2025-07-18 21:55:15

领域: cs.AI,cs.CL,cs.LG,cs.SE

下载: http://arxiv.org/abs/2503.10968v2

Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores the critical role of retrieval and provides practical guidance on model selection.

Updated: 2025-07-18 21:50:36

标题: Schemora：利用现成的LLMS进行多阶段推荐和元数据丰富的模式匹配

摘要: 模式匹配对于整合异构数据源和增强数据集发现至关重要，然而仍然是一个复杂且资源密集的问题。我们介绍了SCHEMORA，一个结合了大型语言模型和混合检索技术的模式匹配框架，采用基于提示的方法，实现了在不依赖标记训练数据或详尽的成对比较的情况下高效识别候选匹配。通过丰富模式元数据并利用基于向量和词汇的检索，SCHEMORA提高了匹配精度和可扩展性。在MIMIC-OMOP基准上进行评估，它取得了新的最高性能，HitRate@5和HitRate@3分别较之前最佳结果提高了7.49%和3.75%。据我们所知，这是第一个基于LLM的模式匹配方法，具有开源实现，配有强调检索关键作用并提供模型选择实用指导的分析。

更新时间: 2025-07-18 21:50:36

领域: cs.DB,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14376v1

Text-to-SQL for Enterprise Data Analytics

The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn's product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.

Updated: 2025-07-18 21:39:17

标题: 文本到SQL用于企业数据分析

摘要: 引入大型语言模型已经在文本到SQL基准上取得了快速进展，但构建一个可用的企业解决方案仍然不容易。本文介绍了构建内部聊天机器人的见解，该机器人使领英的产品经理、工程师和运营团队能够自助获取来自庞大、动态数据湖的数据洞见。我们的方法包括三个组件。首先，我们构建了一个知识图谱，通过索引数据库元数据、历史查询日志、维基和代码来捕捉最新的语义。我们应用聚类技术来识别每个团队或产品领域的相关表。其次，我们构建了一个文本到SQL代理，从知识图谱中检索和排名上下文，编写查询，并自动纠正幻觉和语法错误。第三，我们构建了一个交互式聊天机器人，支持各种用户意图，从数据发现到查询编写到调试，并在丰富的UI元素中显示响应，以鼓励后续对话。我们的聊天机器人每周有超过300名用户。专家评审显示，在内部基准集上，53%的响应是正确的或接近正确的。通过消融研究，我们确定了最重要的知识图谱和建模组件，为开发企业文本到SQL解决方案提供了实际路径。

更新时间: 2025-07-18 21:39:17

领域: cs.CL,cs.AI,cs.DB,cs.HC

下载: http://arxiv.org/abs/2507.14372v1

Smarter Together: Combining Large Language Models and Small Models for Physiological Signals Visual Inspection

Large language models (LLMs) have shown promising capabilities in visually interpreting medical time-series data. However, their general-purpose design can limit domain-specific precision, and the proprietary nature of many models poses challenges for fine-tuning on specialized clinical datasets. Conversely, small specialized models (SSMs) offer strong performance on focused tasks but lack the broader reasoning needed for complex medical decision-making. To address these complementary limitations, we introduce \ConMIL{} (Conformalized Multiple Instance Learning), a novel decision-support framework distinctively synergizes three key components: (1) a new Multiple Instance Learning (MIL) mechanism, QTrans-Pooling, designed for per-class interpretability in identifying clinically relevant physiological signal segments; (2) conformal prediction, integrated with MIL to generate calibrated, set-valued outputs with statistical reliability guarantees; and (3) a structured approach for these interpretable and uncertainty-quantified SSM outputs to enhance the visual inspection capabilities of LLMs. Our experiments on arrhythmia detection and sleep stage classification demonstrate that \ConMIL{} can enhance the accuracy of LLMs such as ChatGPT4.0, Qwen2-VL-7B, and MiMo-VL-7B-RL. For example, \ConMIL{}-supported Qwen2-VL-7B and MiMo-VL-7B-RL both achieves 94.92% and 96.82% precision on confident samples and (70.61% and 78.02%)/(78.10% and 71.98%) on uncertain samples for the two tasks, compared to 46.13% and 13.16% using the LLM alone. These results suggest that integrating task-specific models with LLMs may offer a promising pathway toward more interpretable and trustworthy AI-driven clinical decision support.

Updated: 2025-07-18 21:37:05

标题: 更聪明的方法：将大型语言模型和小型模型结合用于生理信号的可视检查

摘要: 大语言模型（LLMs）在视觉上解释医学时间序列数据方面显示出了良好的能力。然而，它们的通用设计可能会限制领域特定的精度，而许多模型的专有性质则会对专门的临床数据集上的微调提出挑战。相反，小规模专门模型（SSMs）在专注任务上表现出色，但缺乏复杂医学决策所需的更广泛推理能力。为了解决这些互补的限制，我们引入了\ConMIL{}（Conformalized Multiple Instance Learning），这是一个新颖的决策支持框架，独特地协调了三个关键组件：（1）一种新的多实例学习（MIL）机制，QTrans-Pooling，旨在识别临床相关的生理信号片段以实现每类可解释性；（2）整合了MIL的符合预测，生成具有统计可靠性保证的校准、集合值输出；以及（3）一种结构化方法，用于这些可解释且不确定性量化的SSM输出，以增强LLMs的视觉检查能力。我们在心律失常检测和睡眠阶段分类的实验中表明，\ConMIL{}能够提高LLMs（如ChatGPT4.0、Qwen2-VL-7B和MiMo-VL-7B-RL）的准确性。例如，\ConMIL{}支持的Qwen2-VL-7B和MiMo-VL-7B-RL在自信样本上分别达到94.92%和96.82%的精度，对于两个任务的不确定样本分别为（70.61%和78.02%）/（78.10%和71.98%），而仅使用LLMs时分别为46.13%和13.16%。这些结果表明，将任务特定模型与LLMs集成可能为更可解释和值得信赖的基于人工智能的临床决策支持开辟了一条有前景的道路。

更新时间: 2025-07-18 21:37:05

领域: cs.AI,cs.LG,eess.SP

下载: http://arxiv.org/abs/2501.16215v2

Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs

This study explores how large language models (LLMs) encode interwoven scientific knowledge, using chemical elements and LLaMA-series models as a case study. We identify a 3D spiral structure in the hidden states that aligns with the conceptual structure of the periodic table, suggesting that LLMs can reflect the geometric organization of scientific concepts learned from text. Linear probing reveals that middle layers encode continuous, overlapping attributes that enable indirect recall, while deeper layers sharpen categorical distinctions and incorporate linguistic context. These findings suggest that LLMs represent symbolic knowledge not as isolated facts, but as structured geometric manifolds that intertwine semantic information across layers. We hope this work inspires further exploration of how LLMs represent and reason about scientific knowledge, particularly in domains such as materials science.

Updated: 2025-07-18 21:24:29

标题: 逐层召回与LLM中交织知识的几何结构

摘要: 这项研究探讨了大型语言模型（LLMs）如何编码交织在一起的科学知识，以化学元素和LLaMA系列模型为案例研究。我们发现隐藏状态中存在一个3D螺旋结构，与周期表的概念结构相一致，表明LLMs可以反映从文本中学习的科学概念的几何组织。线性探测显示，中间层编码连续、重叠的属性，可以实现间接召回，而更深层则加强分类区分并融入语言上下文。这些发现表明，LLMs代表符号知识不是孤立的事实，而是结构化的几何流形，交织在各层之间传递语义信息。我们希望这项工作能激发对LLMs如何表达和推理科学知识的进一步探索，特别是在材料科学等领域。

更新时间: 2025-07-18 21:24:29

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.10871v2

Achieving Robust Channel Estimation Neural Networks by Designed Training Data

Channel estimation is crucial in wireless communications. However, in many papers neural networks are frequently tested by training and testing on one example channel or similar channels. This is because data-driven methods often degrade on new data which they are not trained on, as they cannot extrapolate their training knowledge. This is despite the fact physical channels are often assumed to be time-variant. However, due to the low latency requirements and limited computing resources, neural networks may not have enough time and computing resources to execute online training to fine-tune the parameters. This motivates us to design offline-trained neural networks that can perform robustly over wireless channels, but without any actual channel information being known at design time. In this paper, we propose design criteria to generate synthetic training datasets for neural networks, which guarantee that after training the resulting networks achieve a certain mean squared error (MSE) on new and previously unseen channels. Therefore, trained neural networks require no prior channel information or parameters update for real-world implementations. Based on the proposed design criteria, we further propose a benchmark design which ensures intelligent operation for different channel profiles. To demonstrate general applicability, we use neural networks with different levels of complexity to show that the generalization achieved appears to be independent of neural network architecture. From simulations, neural networks achieve robust generalization to wireless channels with both fixed channel profiles and variable delay spreads.

Updated: 2025-07-18 21:16:40

标题: 通过设计的训练数据实现强健信道估计神经网络

摘要: 信道估计在无线通信中至关重要。然而，在许多论文中，神经网络经常通过在一个示例信道或类似的信道上进行训练和测试来进行测试。这是因为数据驱动方法通常在未经训练的新数据上性能下降，因为它们无法推广其训练知识。这是尽管物理信道通常被假设为时变的。然而，由于低延迟要求和有限的计算资源，神经网络可能没有足够的时间和计算资源来执行在线训练以微调参数。这激励我们设计离线训练的神经网络，可以在无线信道上稳健地执行，但在设计时不需要任何实际信道信息。在本文中，我们提出设计标准，为神经网络生成合成训练数据集，保证在训练后所得网络在新的和之前未见过的信道上达到一定的均方误差（MSE）。因此，训练后的神经网络对于实际实现不需要任何先前的信道信息或参数更新。基于所提出的设计标准，我们进一步提出了一个基准设计，确保在不同信道配置下的智能操作。为了展示其普遍适用性，我们使用了不同复杂度水平的神经网络，以表明所实现的泛化似乎独立于神经网络架构。从模拟结果来看，神经网络在具有固定信道配置和可变时延传播的无线信道上实现了稳健的泛化。

更新时间: 2025-07-18 21:16:40

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2507.12630v2

Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports

This study introduces a LLMs powered multiagent ensemble method to address challenges in hallucination and data labeling, particularly in large-scale EHR datasets. Manual labeling of such datasets requires domain expertise and is labor-intensive, time-consuming, expensive, and error-prone. To overcome this bottleneck, we developed an ensemble LLMs method and demonstrated its effectiveness in two real-world tasks: (1) labeling a large-scale unlabeled ECG dataset in MIMIC-IV; (2) identifying social determinants of health (SDOH) from the clinical notes of EHR. Trading off benefits and cost, we selected a pool of diverse open source LLMs with satisfactory performance. We treat each LLM's prediction as a vote and apply a mechanism of majority voting with minimal winning threshold for ensemble. We implemented an ensemble LLMs application for EHR data labeling tasks. By using the ensemble LLMs and natural language processing, we labeled MIMIC-IV ECG dataset of 623,566 ECG reports with an estimated accuracy of 98.2%. We applied the ensemble LLMs method to identify SDOH from social history sections of 1,405 EHR clinical notes, also achieving competitive performance. Our experiments show that the ensemble LLMs can outperform individual LLM even the best commercial one, and the method reduces hallucination errors. From the research, we found that (1) the ensemble LLMs method significantly reduces the time and effort required for labeling large-scale EHR data, automating the process with high accuracy and quality; (2) the method generalizes well to other text data labeling tasks, as shown by its application to SDOH identification; (3) the ensemble of a group of diverse LLMs can outperform or match the performance of the best individual LLM; and (4) the ensemble method substantially reduces hallucination errors. This approach provides a scalable and efficient solution to data-labeling challenges.

Updated: 2025-07-18 20:53:27

标题: 大型语言模型驱动的多智能体集成用于减轻幻觉和高效标记心电图报告中的房颤

摘要: 这项研究介绍了一种由LLMs驱动的多智能体集成方法，以解决在幻觉和数据标记方面的挑战，特别是在大规模电子病历（EHR）数据集中。对这类数据集进行手动标记需要领域专业知识，工作强度大、耗时长、费用高且容易出错。为了克服这一瓶颈，我们开发了一种集成LLMs方法，并在两个实际任务中展示了其有效性：（1）标记MIMIC-IV中的大规模未标记心电图数据集；（2）从EHR的临床笔记中识别健康的社会决定因素（SDOH）。在效益和成本之间进行权衡后，我们选择了一组性能令人满意的多样化开源LLMs。我们将每个LLM的预测视为一票，并应用了一个带有最小获胜阈值的多数投票机制进行集成。我们为EHR数据标记任务实现了一个集成LLMs应用程序。通过使用集成LLMs和自然语言处理，我们以估计准确率达到98.2%的准确度，标记了623,566份心电图报告的MIMIC-IV心电图数据集。我们将集成LLMs方法应用于从1,405份EHR临床笔记的社会历史部分识别SDOH，同样取得了竞争性的表现。我们的实验表明，集成LLMs可以胜过单个LLM，甚至胜过最好的商业LLM，而且该方法可以减少幻觉错误。从研究中我们发现：（1）集成LLMs方法显著减少了标记大规模EHR数据所需的时间和精力，自动化过程具有高准确性和质量；（2）该方法在其他文本数据标记任务上表现良好，如其在SDOH识别中的应用所示；（3）一组多样化的LLMs的集成可以胜过或与最佳单个LLM的性能匹敌；（4）集成方法大幅减少了幻觉错误。这种方法提供了一个可扩展且高效的数据标记解决方案。

更新时间: 2025-07-18 20:53:27

领域: cs.AI,I.2

下载: http://arxiv.org/abs/2410.16543v3

Oversmoothing Alleviation in Graph Neural Networks: A Survey and Unified View

Oversmoothing is a common challenge in learning graph neural networks (GNN), where, as layers increase, embedding features learned from GNNs quickly become similar or indistinguishable, making them incapable of differentiating network proximity. A GNN with shallow layer architectures can only learn short-term relation or localized structure information, limiting its power of learning long-term connection, evidenced by their inferior learning performance on heterophilous graphs. Tackling oversmoothing is crucial for harnessing deep-layer architectures for GNNs. To date, many methods have been proposed to alleviate oversmoothing. The vast difference behind their design principles, combined with graph complications, make it difficult to understand and even compare the difference between different approaches in tackling the oversmoothing. In this paper, we propose ATNPA, a unified view with five key steps: Augmentation, Transformation, Normalization, Propagation, and Aggregation, to summarize GNN oversmoothing alleviation approaches. We first propose a taxonomy for GNN oversmoothing alleviation which includes three themes to tackle oversmoothing. After that, we separate all methods into six categories, followed by detailed reviews of representative methods, including their relation to ATNPA, and discussion of their niche, strength, and weakness. The review not only draws an in-depth understanding of existing methods in the field but also shows a clear road map for future study.

Updated: 2025-07-18 20:52:22

标题: 图神经网络中的过度平滑缓解：调查和统一视角

摘要: 平滑过度是学习图神经网络(GNN)中常见的挑战，随着层数的增加，从GNN学习到的嵌入特征很快变得相似或难以区分，使它们无法区分网络接近程度。具有浅层架构的GNN只能学习短期关系或局部结构信息，限制了其学习长期连接的能力，这在异质图上的学习性能表现不佳。解决平滑过度对于利用GNN的深层架构至关重要。迄今为止，已经提出了许多方法来缓解平滑过度。由于其设计原则的巨大差异，再加上图的复杂性，使人们很难理解甚至比较不同方法在解决平滑过度方面的差异。在本文中，我们提出了ATNPA，一个统一的视角，包括五个关键步骤：增强、转换、规范化、传播和聚合，以总结GNN平滑过度缓解方法。我们首先提出了一个GNN平滑过度缓解的分类法，包括三个主题来解决平滑过度。之后，我们将所有方法分为六类，接着详细审查代表性方法，包括它们与ATNPA的关系，并讨论它们的优缺点。该审查不仅深入理解了该领域现有方法，还为未来研究提供了清晰的路线图。

更新时间: 2025-07-18 20:52:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2405.01663v2

Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM's learned representations align with human representations. In this work, we introduce a novel approach to study representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ''cat'') and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human alignment levels. Our approach significantly outperforms the alignment captured by word embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts -- we show that LLMs organize concepts in a way that mirrors human concept organization.

Updated: 2025-07-18 20:48:40

标题: 分析神经元，而不是嵌入：理解LLM表征何时何地与人类一致

摘要: 现代大型语言模型（LLMs）在某些任务上取得了令人印象深刻的表现，同时在其他任务上展现出明显的非人类行为。这引发了一个问题，即LLM学习表示与人类表示有多好的一致性。在这项工作中，我们引入了一种研究表示一致性的新方法：我们采用了来自激活引导研究的方法，以识别负责特定概念（例如“猫”）的神经元，然后分析相应的激活模式。我们发现，通过这种方式捕获的LLM表示与从行为数据推断的人类表示紧密一致，与人际一致性水平相匹配。我们的方法明显优于单词嵌入所捕获的一致性，这已经成为先前关于人类-LLM一致性的焦点。此外，我们的方法使得更细粒度地了解LLMs如何表示概念成为可能 - 我们展示LLMs以一种反映人类概念组织的方式组织概念。

更新时间: 2025-07-18 20:48:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.15090v2

Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.

Updated: 2025-07-18 20:14:08

标题: 生成模型与互联和自动驾驶车辆：探索交通与人工智能交叉领域的调查

摘要: 这份报告调查了生成模型和连接和自动化车辆（CAVs）两个开创性力量在技术和交通领域推动进展的历史和影响。通过关注生成模型在CAVs环境中的应用，该研究旨在揭示这种整合如何提升自动驾驶车辆中的预测建模、模拟精度和决策过程。本文讨论了在交通领域整合生成模型和CAV技术的好处和挑战。它旨在突出取得的进展、尚存的障碍以及在安全和创新方面的潜力。

更新时间: 2025-07-18 20:14:08

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2403.10559v2

Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model's ability to adapt to new tasks while leveraging pre-trained knowledge.

Updated: 2025-07-18 20:11:50

标题: 单人连接：变压器模型的参数高效微调技术

摘要: Parameter efficient fine tuning (PEFT)是一种适应新任务的多功能和可扩展的方法。其中，最突出的PEFT方法之一是低秩适应（LoRA），主要集中在调整生成式预训练变压器（GPT2）中单个解码器块内的注意力权重矩阵。相比之下，我们引入了Solo Connection，这是一种新颖的方法，它在解码器块级别上调整表示，而不是修改单个权重矩阵。Solo Connection不仅在E2E自然语言生成基准测试中优于LoRA，而且相对于LoRA减少了可训练参数数量的59％，相比于GPT2的完全微调更是减少了超过99％，GPT2是早期版本的大型语言模型（LLMs）。Solo Connection还受到同伦理论的启发：我们引入了一个可训练的线性变换，逐渐在零向量和任务特定表示之间插值，从而实现随时间平稳和稳定的适应。原始12层GPT2中的跳过连接通常局限于单个解码器块，随后的GPT2变体扩展到48层，甚至更大的语言模型可以包括128个或更多的解码器块。这些扩展的架构强调了在微调期间如何使用跳过连接的必要性。本文侧重于连接不同解码器块输出的长跳过连接，潜在地增强了模型适应新任务并利用预训练知识的能力。

更新时间: 2025-07-18 20:11:50

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.14353v1

Still More Shades of Null: An Evaluation Suite for Responsible Missing Value Imputation

Data missingness is a practical challenge of sustained interest to the scientific community. In this paper, we present Shades-of-Null, an evaluation suite for responsible missing value imputation. Our work is novel in two ways (i) we model realistic and socially-salient missingness scenarios that go beyond Rubin's classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) settings, to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test) (ii) we evaluate imputers holistically, based on imputation quality and imputation fairness, as well as on the predictive performance, fairness and stability of the models that are trained and tested on the data post-imputation. We use Shades-of-Null to conduct a large-scale empirical study involving 29,736 experimental pipelines, and find that while there is no single best-performing imputation approach for all missingness types, interesting trade-offs arise between predictive performance, fairness and stability, based on the combination of missingness scenario, imputer choice, and the architecture of the predictive model. We make Shades-of-Null publicly available, to enable researchers to rigorously evaluate missing value imputation methods on a wide range of metrics in plausible and socially meaningful scenarios.

Updated: 2025-07-18 20:08:32

标题: 更多的缺失值处理方式：一套用于负责任缺失值填充的评估套件

摘要: 数据缺失是科学界持续关注的一个实际挑战。在本文中，我们提出了Shades-of-Null，一个用于负责任地处理缺失值插补的评估套件。我们的工作在两个方面是新颖的：(i)我们建模了现实和社会显著的缺失情景，超越了Rubin经典的完全随机缺失（MCAR）、随机缺失（MAR）和非随机缺失（MNAR）设置，包括多机制缺失（当数据中存在不同的缺失模式时）和缺失机制转变（当缺失机制在训练和测试之间发生变化时）；(ii)我们全面评估插补器，基于插补质量、插补公平性，以及基于插补后数据训练和测试的模型的预测性能、公平性和稳定性。我们使用Shades-of-Null进行了一项大规模的实证研究，涉及29,736个实验管道，发现虽然没有单一的最佳插补方法适用于所有缺失类型，但根据缺失情景、插补器选择以及预测模型的架构的组合，存在有趣的预测性能、公平性和稳定性之间的权衡。我们公开提供Shades-of-Null，以便研究人员可以在合理和社会意义的情景中，严格评估缺失值插补方法在广泛的指标上的表现。

更新时间: 2025-07-18 20:08:32

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2409.07510v6

A Reproducibility Study of Product-side Fairness in Bundle Recommendation

Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain.

Updated: 2025-07-18 20:06:39

标题: 一个关于捆绑推荐中产品侧公平性的可重复性研究

摘要: 推荐系统被认为存在公平性问题，特别是在产品方面，推荐结果中的产品及其相关供应商接收到不均等的曝光。虽然这个问题在传统推荐设置中得到了广泛研究，但它对于捆绑推荐（BR）的影响还很少被探讨。这一新兴任务引入了额外的复杂性：推荐是在捆绑级别生成的，然而用户满意度和产品（或供应商）曝光取决于捆绑和其中包含的个别项目。为传统推荐系统设计的现有公平框架和指标可能无法直接转化为这种多层次设置。在本文中，我们使用四种最先进的BR方法在三个真实世界数据集上进行了对产品方面公平性的全面可重现性研究。我们使用多个公平性指标分析了捆绑和项目级别的曝光差异，揭示了重要的模式。我们的结果显示，捆绑和项目之间的曝光模式明显不同，说明需要超越捆绑级别假设的公平干预。我们还发现，公平性评估因使用的指标而有很大差异，强调了多方面评估的必要性。此外，用户行为起着关键作用：当用户与捆绑比与个别项目更频繁地互动时，BR系统往往会在两个级别上产生更公平的曝光分布。总体而言，我们的研究结果为构建更公平的捆绑推荐系统提供了可操作的见解，并为这一新兴领域的未来研究奠定了重要基础。

更新时间: 2025-07-18 20:06:39

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.14352v1

Influence Functions for Preference Dataset Pruning

Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests that local curvature is important for detecting harmful training examples, but less so for identifying helpful examples.

Updated: 2025-07-18 19:43:36

标题: 偏好数据集修剪的影响函数

摘要: 语言模型通常通过强化学习进行微调，以改变其行为或引发新的能力。用于这些目的的数据集，特别是人类偏好数据集，通常存在噪音。相对较小的训练后数据集，结合参数高效的微调方法，使得可以使用影响函数近似来检测和修剪对验证集性能有害的训练示例。在这项工作中，我们改编了TL;DR数据集用于奖励模型训练，以展示如何使用共轭梯度近似的影响函数来过滤数据集。在我们的实验中，影响函数过滤在移除10%的训练示例后，使重新训练的准确率提升了1.5%。我们还展示了梯度相似性优于影响函数用于检测有用的训练示例。这表明局部曲率对于检测有害的训练示例很重要，但对于识别有用示例则不那么重要。

更新时间: 2025-07-18 19:43:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.14344v1

Fiduciary AI for the Future of Brain-Technology Interactions

Brain foundation models represent a new frontier in AI: instead of processing text or images, these models interpret real-time neural signals from EEG, fMRI, and other neurotechnologies. When integrated with brain-computer interfaces (BCIs), they may enable transformative applications-from thought controlled devices to neuroprosthetics-by interpreting and acting on brain activity in milliseconds. However, these same systems pose unprecedented risks, including the exploitation of subconscious neural signals and the erosion of cognitive liberty. Users cannot easily observe or control how their brain signals are interpreted, creating power asymmetries that are vulnerable to manipulation. This paper proposes embedding fiduciary duties-loyalty, care, and confidentiality-directly into BCI-integrated brain foundation models through technical design. Drawing on legal traditions and recent advancements in AI alignment techniques, we outline implementable architectural and governance mechanisms to ensure these systems act in users' best interests. Placing brain foundation models on a fiduciary footing is essential to realizing their potential without compromising self-determination.

Updated: 2025-07-18 19:34:08

标题: 未来大脑技术交互的受托人人工智能

摘要: 脑基础模型代表了人工智能中的新前沿：这些模型不是处理文本或图像，而是解释来自脑电图、功能性磁共振成像和其他神经技术的实时神经信号。当与脑机接口（BCI）集成时，它们可能通过解释和响应毫秒级的脑活动而实现转变性的应用，从思维控制设备到神经假体。然而，这些系统也带来了前所未有的风险，包括利用潜意识神经信号和侵蚀认知自由。用户无法轻易观察或控制他们的脑信号如何被解释，从而产生易受操纵的权力不对称性。本文提出通过技术设计，将忠诚、关怀和保密的信托责任直接嵌入到BCI集成的脑基础模型中。借鉴法律传统和人工智能对齐技术的最新进展，我们概述了可实施的架构和治理机制，以确保这些系统符合用户最佳利益。将脑基础模型放在信托基础上是实现其潜力而不损害自主权的关键。

更新时间: 2025-07-18 19:34:08

领域: cs.CY,cs.AI,cs.HC,cs.LG,eess.SP,K.4.0; I.2.0; J.4

下载: http://arxiv.org/abs/2507.14339v1

Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents to challenge VLMs' retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

Updated: 2025-07-18 19:33:15

标题: 文档堆栈：一种长上下文多模态图像/文档理解视觉LLM基准测试

摘要: 多模式大型语言模型的大量增多显著推动了对来自不同模态的复杂数据输入进行分析和理解的能力。然而，对长文档的处理仍然未被充分探索，主要是因为缺乏合适的基准。为了解决这个问题，我们引入了Document Haystack，这是一个全面的基准，旨在评估视觉语言模型（VLMs）在长、视觉复杂的文档上的表现。Document Haystack包含从5到200页的文档，并在文档中的不同深度处策略性地插入纯文本或多模式文本+图像的“针”来挑战VLMs的检索能力。该基准包括400种文档变体和总共8,250个问题，并由客观、自动化的评估框架支持。我们详细介绍了Document Haystack数据集的构建和特点，展示了主要VLMs的结果，并讨论了这一领域的潜在研究方向。

更新时间: 2025-07-18 19:33:15

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.15882v1

ProofCompass: Enhancing Specialized Provers with LLM Guidance

Language models have become increasingly powerful tools for formal mathematical reasoning. However, most existing approaches rely exclusively on either large general-purpose models or smaller specialized models, each with distinct limitations, while training specialized large models still requires significant computational resources. This paper introduces ProofCompass, a novel hybrid methodology that achieves remarkable computational efficiency by strategically guiding existing specialized prover methods, such as DeepSeek-Prover-v1.5-RL (DSP-v1.5) with a Large Language Model (LLM) without requiring additional model training. The LLM provides natural language proof strategies and analyzes failed attempts to select intermediate lemmas, enabling effective problem decomposition. On the miniF2F benchmark, ProofCompass demonstrates substantial resource efficiency: it outperforms DSP-v1.5 ($54.9\% \rightarrow 55.3\%$) while using 25x fewer attempts ($3200 \rightarrow 128$). Our synergistic approach paves the way for simultaneously improving computational efficiency and accuracy in formal theorem proving.

Updated: 2025-07-18 19:28:01

标题: ProofCompass：通过LLM指导增强专门的证明器

摘要: 语言模型已成为形式数学推理的强大工具。然而，大多数现有方法要么仅依赖于大型通用模型，要么依赖于较小的专门模型，每种方法都具有各自的局限性，而训练专门的大型模型仍然需要大量的计算资源。本文介绍了ProofCompass，一种新颖的混合方法，通过有策略地引导现有的专门证明方法，如DeepSeek-Prover-v1.5-RL（DSP-v1.5），结合大型语言模型（LLM），实现了出色的计算效率，而无需额外的模型训练。LLM提供自然语言证明策略，并分析失败的尝试以选择中间引理，实现有效的问题分解。在miniF2F基准测试中，ProofCompass展示了显著的资源效率：它在使用25倍更少尝试次数的情况下（3200→128）胜过DSP-v1.5（54.9%→55.3%）。我们的协同方法为同时提高形式定理证明的计算效率和准确性铺平了道路。

更新时间: 2025-07-18 19:28:01

领域: cs.AI

下载: http://arxiv.org/abs/2507.14335v1

Language Models as Ontology Encoders

OWL (Web Ontology Language) ontologies which are able to formally represent complex knowledge and support semantic reasoning have been widely adopted across various domains such as healthcare and bioinformatics. Recently, ontology embeddings have gained wide attention due to its potential to infer plausible new knowledge and approximate complex reasoning. However, existing methods face notable limitations: geometric model-based embeddings typically overlook valuable textual information, resulting in suboptimal performance, while the approaches that incorporate text, which are often based on language models, fail to preserve the logical structure. In this work, we propose a new ontology embedding method OnT, which tunes a Pretrained Language Model (PLM) via geometric modeling in a hyperbolic space for effectively incorporating textual labels and simultaneously preserving class hierarchies and other logical relationships of Description Logic EL. Extensive experiments on four real-world ontologies show that OnT consistently outperforms the baselines including the state-of-the-art across both tasks of prediction and inference of axioms. OnT also demonstrates strong potential in real-world applications, indicated by its robust transfer learning abilities and effectiveness in real cases of constructing a new ontology from SNOMED CT. Data and code are available at https://github.com/HuiYang1997/OnT.

Updated: 2025-07-18 19:26:16

标题: 语言模型作为本体编码器 (Note: "Ontology" refers to the philosophical study of the nature of being, becoming, existence, or reality, as well as the basic categories of being and their relations.)

摘要: OWL（Web Ontology Language）本体能够形式化地表示复杂知识并支持语义推理，已被广泛应用于诸如医疗保健和生物信息学等各个领域。最近，由于本体嵌入具有推断可能的新知识和近似复杂推理的潜力，本体嵌入引起了广泛关注。然而，现有方法存在明显的局限性：基于几何模型的嵌入通常忽视有价值的文本信息，导致性能不佳，而结合文本的方法，通常基于语言模型，未能保留逻辑结构。在这项工作中，我们提出了一种新的本体嵌入方法OnT，通过在双曲空间中对预训练语言模型（PLM）进行几何建模，有效地结合文本标签，同时保留描述逻辑EL的类层次结构和其他逻辑关系。对四个实际本体的广泛实验表明，OnT在预测和推理公理两个任务上始终优于包括最先进方法在内的基准。OnT还展示了在现实应用中的强大潜力，通过其稳健的迁移学习能力和在从SNOMED CT构建新本体的实际案例中的有效性。数据和代码可在https://github.com/HuiYang1997/OnT获得。

更新时间: 2025-07-18 19:26:16

领域: cs.AI

下载: http://arxiv.org/abs/2507.14334v1

Blackbox Dataset Inference for LLM

Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores \textit{dataset inference}, which aims to detect if a suspect model $\mathcal{M}$ used a victim dataset $\mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $\mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs. In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $\mathcal{D}$ in training and the other not. By measuring which set of reference model $\mathcal{M}$ is closer to, we determine if $\mathcal{M}$ used $\mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.

Updated: 2025-07-18 19:19:10

标题: LLM的黑盒数据集推理

摘要: 今天，大型语言模型（LLMs）的训练可能涉及个人可识别信息和受版权保护的材料，导致数据集被滥用的问题。为了缓解数据集滥用问题，本文探讨了\textit{数据集推断}，旨在检测可疑模型$\mathcal{M}$是否在训练中使用了受害数据集$\mathcal{D}$。先前的研究通过聚合成员推断攻击（MIAs）的结果来解决数据集推断问题，这些方法用于确定个别样本是否属于训练数据集。然而，受MIAs低准确率的限制，先前的研究要求对$\mathcal{M}$进行灰盒访问，以获取中间输出（概率、损失、困惑度等）以获得令人满意的结果。这导致实用性降低，特别是对于为盈利而部署的LLMs，对返回中间输出的激励有限。在本文中，我们提出了一种只能对目标模型进行黑盒访问（即仅假设目标模型的基于文本的响应可用）的数据集推断新方法。我们的方法依赖于两组本地构建的参考模型，一组涉及$\mathcal{D}$在训练中，另一组不涉及。通过衡量$\mathcal{M}$更接近哪个参考模型集，我们确定$\mathcal{M}$是否使用了$\mathcal{D}$进行训练。对野外实际LLMs的评估显示，我们的方法在所有设置中都具有高准确性，并对规避尝试具有鲁棒性。

更新时间: 2025-07-18 19:19:10

领域: cs.CR

下载: http://arxiv.org/abs/2507.03619v2

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT to remove backdoors and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.

Updated: 2025-07-18 19:13:18

标题: 用潜在对抗训练方法防御意外故障模式

摘要: 尽管开发人员进行了广泛的诊断和调试，人工智能系统有时会展示出有害的意外行为。发现并修复这些问题是具有挑战性的，因为攻击面非常广大，无法穷尽地搜索可能引发有害行为的输入。红队测试和对抗训练（AT）通常用于提高鲁棒性，然而，它们在实际中难以修复与训练中使用的攻击不同的故障模式。在这项工作中，我们利用潜在对抗训练（LAT）来防御漏洞，而不需要利用已知的漏洞或使用可能引发漏洞的输入。LAT利用网络实际用于预测的压缩、抽象和结构化概念的潜在表示。在这里，我们使用LAT来防御没有引发漏洞的示例的故障模式。具体地，我们使用LAT来消除后门并防御对抗攻击的被保留类别。我们展示在图像分类、文本分类和文本生成任务中，LAT通常提高了对新攻击的鲁棒性和在干净数据上的性能，相对于AT。这表明LAT可以是一个有希望的工具，用于防御开发人员未明确识别的故障模式。

更新时间: 2025-07-18 19:13:18

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.05030v5

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

Updated: 2025-07-18 19:07:54

标题: 速度计划：用于掩模扩散语言模型的扩张调度方案

摘要: 蒙面扩散语言模型（MDLMs）承诺快速、非自回归的文本生成，然而现有的采样器，根据模型置信度选择要解码的标记，当同时解码多个位置时忽略交互作用，从而有效地降低到缓慢的、自回归的行为。我们提出了扩张解码调度器（DUS），这是一种仅推理、无需规划模型的方法，将序列位置分成非相邻的扩张组，并并行解码它们，以便在每个去噪步骤中最小化联合熵增益的上界。通过明确地权衡网络调用次数与生成质量，DUS恢复了在传统并行解码策略下丢失的大部分性能。在数学（GSM8K，MATH500）、代码（HumanEval，MBPP）和常识基准（BBH，MMLU-Pro）中，DUS优于基于置信度的规划器，而不修改基础去噪器，并揭示了MDLMs的真实速度-质量前沿。

更新时间: 2025-07-18 19:07:54

领域: cs.CL,cs.AI,cs.IT,cs.LG,cs.NE,math.IT

下载: http://arxiv.org/abs/2506.19037v2

The Elicitation Game: Evaluating Capability Elicitation Techniques

Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system's capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms -- language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in the MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

Updated: 2025-07-18 19:01:58

标题: 《引发游戏：评估能力引发技术》

摘要: 能力评估是必要的，以了解和规范可能被部署或进一步开发的人工智能系统。因此，评估提供准确估计人工智能系统能力是很重要的。然而，在许多情况下，先前潜在的能力已经从模型中引出，有时是在最初发布后很久。因此，人们已经做出了大量努力来开发方法来引出模型的潜在能力。在本文中，我们通过有意训练模型生物（具有隐藏能力的语言模型，通过密码揭示）来评估能力引出技术的有效性。我们引入了一种基于断路器的训练模型生物的新方法，比标准密码锁定模型更能抵御引出技术。我们关注基于提示和激活控制的引出技术，并将其与微调方法进行比较。提示技术可以在MCQA设置中引出密码锁定和断路器破坏的模型生物的实际能力，而激活控制无法做到这一点。对于代码生成任务，只有微调可以引出我们的新模型生物的隐藏能力。此外，我们的结果表明结合技术可以提高引出效果。然而，如果可能的话，微调应该是改善能力评估可信度的首选方法。

更新时间: 2025-07-18 19:01:58

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.02180v3

Quantum-Safe Identity Verification using Relativistic Zero-Knowledge Proof Systems

Identity verification is the process of confirming an individual's claimed identity, which is essential in sectors like finance, healthcare, and online services to ensure security and prevent fraud. However, current password/PIN-based identity solutions are susceptible to phishing or skimming attacks, where malicious intermediaries attempt to steal credentials using fake identification portals. Alikhani et al. [Nature, 2021] began exploring identity verification through graph coloring-based relativistic zero-knowledge proofs (RZKPs), a key cryptographic primitive that enables a prover to demonstrate knowledge of secret credentials to a verifier without disclosing any information about the secret. Our work advances this field and addresses unresolved issues: From an engineering perspective, we relax further the relativistic constraints from 60m to 30m, and significantly enhance the stability and scalability of the experimental demonstration of the 2-prover graph coloring-based RZKP protocol for near-term use cases. At the same time, for long-term security against entangled malicious provers, we propose a modified protocol with comparable computation and communication costs, we establish an upper bound on the soundness parameter for this modified protocol. On the other hand, we extend the two-prover, two-verifier setup to a three-prover configuration, demonstrating the security of such relativistic protocols against entangled malicious provers.

Updated: 2025-07-18 18:59:19

标题: 使用相对论零知识证明系统进行量子安全身份验证

摘要: 身份验证是确认个人所声称身份的过程，在金融、医疗保健和在线服务等领域至关重要，以确保安全性并防止欺诈。然而，目前基于密码/PIN的身份解决方案容易受到钓鱼或刷卡攻击的影响，恶意中介试图使用虚假身份验证门户窃取凭据。Alikhani等人[Nature，2021]开始通过基于图着色的相对论零知识证明（RZKPs）探索身份验证，这是一种关键的密码原语，使证明者能够向验证者展示对秘密凭据的了解，而不披露任何关于秘密的信息。我们的工作推动了这一领域的发展，并解决了未解决的问题：从工程角度来看，我们将相对论约束从60m放宽到30m，并显著增强了基于图着色的两证明者RZKP协议的实验演示的稳定性和可扩展性，以满足近期用例的需求。同时，为了长期安全地对抗纠缠的恶意证明者，我们提出了一个修改后的协议，具有可比的计算和通信成本，我们为这一修改后的协议建立了一个声音参数的上限。另一方面，我们将两证明者、两验证者的设置扩展到三证明者的配置，展示了这种相对论协议对抗纠缠的恶意证明者的安全性。

更新时间: 2025-07-18 18:59:19

领域: cs.CR,quant-ph

下载: http://arxiv.org/abs/2507.14324v1

FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks. While numerous static defenses exist, their effectiveness is highly context-dependent, often failing against adaptive adversaries or in heterogeneous data environments. This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem. We design a lightweight contextual bandit agent that dynamically selects the optimal aggregation rule from an arsenal of defenses based on real-time diagnostic metrics. Through comprehensive experiments, we demonstrate that no single static rule is universally optimal. We show that our adaptive agent successfully learns superior policies across diverse scenarios, including a ``Krum-favorable" environment and against a sophisticated "stealth" adversary designed to neutralize specific diagnostic signals. Critically, we analyze the paradoxical scenario where a non-robust baseline achieves high but compromised accuracy, and demonstrate that our agent learns a conservative policy to prioritize model integrity. Furthermore, we prove the agent's policy is controllable via a single "risk tolerance" parameter, allowing practitioners to explicitly manage the trade-off between performance and security. Our work provides a new, practical, and analyzable approach to creating resilient and intelligent decentralized AI systems.

Updated: 2025-07-18 18:53:26

标题: FedStrategist：一种用于联邦学习中自适应和稳健聚合的元学习框架

摘要: 联合学习（FL）为保护隐私的协作人工智能提供了一种范例，但其分散式特性使其容易受到模型污染攻击的影响。虽然存在许多静态防御方法，但它们的有效性高度依赖于上下文，通常无法抵御适应性对手或异构数据环境。本文介绍了FedStrategist，这是一个新颖的元学习框架，将强大的聚合重新构想为一个实时的、成本感知的控制问题。我们设计了一个轻量级的上下文概率选择器，根据实时诊断指标动态选择最佳的聚合规则，从一系列基于实时诊断指标的防御措施中。通过全面的实验，我们证明没有单一的静态规则是普遍有效的。我们展示了我们的自适应代理成功地在各种场景下学习了更好的策略，包括一个“Krum有利”的环境和一个设计用来抵消特定诊断信号的精密“隐身”对手。至关重要的是，我们分析了一个非强大基准实现高但受损准确性的矛盾场景，并证明我们的代理学习了一种保守策略，以优先考虑模型完整性。此外，我们证明了代理的策略可以通过一个“风险容忍度”参数来控制，使从业人员可以明确地管理性能和安全之间的权衡。我们的工作提供了一种新的、实用的、可分析的方法来创建具有弹性和智能的分散式人工智能系统。

更新时间: 2025-07-18 18:53:26

领域: cs.LG,cs.CR,cs.DC,I.2.11; C.2.4; K.6.5

下载: http://arxiv.org/abs/2507.14322v1

Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting task-level experts is often too coarse-grained, as heterogeneous tasks may require different expertise per instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we show that Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE generalizes well to unseen tasks and removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

Updated: 2025-07-18 18:50:23

标题: Symbolic Mixture-of-Experts: 自适应技能路由用于异构推理

摘要: 将现有的预训练专家LLM组合起来是一种有前景的途径，可扩展地应对大规模和多样化任务。然而，选择任务级专家通常过于粗粒度，因为异构任务可能需要每个实例不同的专业知识。为了实现自适应实例级别的预训练LLM专家混合，我们提出了一种符号化、基于文本且无梯度的专家混合框架Symbolic-MoE。Symbolic-MoE采用细粒度的方法进行选择，强调技能，例如数学中的代数或生物医学推理中的分子生物学。我们提出了一种基于技能的招募策略，根据其优势动态选择最相关的一组专家LLM用于各种推理任务。然后，每个选定的专家生成自己的推理，结果是来自k个专家的k个输出，然后由聚合器综合成最终高质量的响应，该聚合器是根据其整合不同推理输出的能力选择的。我们展示了Symbolic-MoE的实例级专家选择通过大幅度提高性能，但是--当天真实施时--可能会因为需要不断加载和卸载模型而引入高计算开销。为了解决这个问题，我们实施了一种分组策略，根据其分配的专家对实例进行分组，仅加载每个模型一次。这使我们能够在1个GPU上集成16个专家模型，其时间成本与或优于之前使用4个GPU的多智能体基线。通过在多个基准测试（MMLU-Pro、GPQA、AIME和MedMCQA）上进行广泛评估，我们展示了Symbolic-MoE击败了强大的LLM，如GPT4o-mini，以及多智能体方法，相对于最佳多智能体基线的绝对平均增益达到了8.15%。此外，Symbolic-MoE在未知任务上具有很好的泛化性，并消除了昂贵的多轮讨论的需要，优于少计算的讨论基线。

更新时间: 2025-07-18 18:50:23

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.05641v3

Manimator: Transforming Research Papers into Visual Explanations

Understanding complex scientific and mathematical concepts, particularly those presented in dense research papers, poses a significant challenge for learners. Dynamic visualizations can greatly enhance comprehension, but creating them manually is time-consuming and requires specialized knowledge and skills. We introduce manimator, an open-source system that leverages Large Language Models to transform research papers and natural language prompts into explanatory animations using the Manim engine. Manimator employs a pipeline where an LLM interprets the input text or research paper PDF to generate a structured scene description outlining key concepts, mathematical formulas, and visual elements and another LLM translates this description into executable Manim Python code. We discuss its potential as an educational tool for rapidly creating engaging visual explanations for complex STEM topics, democratizing the creation of high-quality educational content.

Updated: 2025-07-18 18:28:26

标题: Manimator：将研究论文转化为视觉解释

摘要: 理解复杂的科学和数学概念，特别是那些在密集的研究论文中呈现的概念，对学习者来说是一个重大挑战。动态可视化可以极大地增强理解能力，但手动创建它们是耗时的，需要专业知识和技能。我们介绍了manimator，这是一个开源系统，利用大型语言模型将研究论文和自然语言提示转化为使用Manim引擎创建解释性动画。Manimator采用一种流水线工作方式，其中一个LLM解释输入文本或研究论文PDF，生成一个结构化的场景描述，概述关键概念、数学公式和视觉元素，另一个LLM将这个描述翻译成可执行的Manim Python代码。我们讨论了它作为教育工具的潜力，可以快速创建有吸引力的视觉解释复杂STEM主题，使高质量的教育内容的创作民主化。

更新时间: 2025-07-18 18:28:26

领域: cs.AI,cs.MM

下载: http://arxiv.org/abs/2507.14306v1

Age of Information Minimization in UAV-Enabled Integrated Sensing and Communication Systems

Unmanned aerial vehicles (UAVs) equipped with integrated sensing and communication (ISAC) capabilities are envisioned to play a pivotal role in future wireless networks due to their enhanced flexibility and efficiency. However, jointly optimizing UAV trajectory planning, multi-user communication, and target sensing under stringent resource constraints and time-critical conditions remains a significant challenge. To address this, we propose an Age of Information (AoI)-centric UAV-ISAC system that simultaneously performs target sensing and serves multiple ground users, emphasizing information freshness as the core performance metric. We formulate a long-term average AoI minimization problem that jointly optimizes the UAV's flight trajectory and beamforming. To tackle the high-dimensional, non-convexity of this problem, we develop a deep reinforcement learning (DRL)-based algorithm capable of providing real-time decisions on UAV movement and beamforming for both radar sensing and multi-user communication. Specifically, a Kalman filter is employed for accurate target state prediction, regularized zero-forcing is utilized to mitigate inter-user interference, and the Soft Actor-Critic algorithm is applied for training the DRL agent on continuous actions. The proposed framework adaptively balances the trade-offs between sensing accuracy and communication quality. Extensive simulation results demonstrate that our proposed method consistently achieves lower average AoI compared to baseline approaches.

Updated: 2025-07-18 18:17:09

标题: Title Translation: 无人机-启用的综合感知和通信系统中的信息最小化时代

摘要: 具有集成感知和通信（ISAC）功能的无人机（UAV）被认为将在未来无线网络中发挥关键作用，因为它们具有提升的灵活性和效率。然而，在严格的资源限制和时间关键条件下，共同优化UAV轨迹规划、多用户通信和目标感知仍然是一个重要挑战。为了解决这个问题，我们提出了一种以信息时代（AoI）为中心的UAV-ISAC系统，同时进行目标感知并为多个地面用户提供服务，强调信息新鲜度作为核心性能指标。我们制定了一个长期平均AoI最小化问题，共同优化UAV的飞行轨迹和波束成形。为了解决这个问题的高维度、非凸性，我们开发了一种基于深度强化学习（DRL）的算法，能够为雷达感知和多用户通信提供实时决策。具体地，卡尔曼滤波器用于准确的目标状态预测，正则化零强制用于减轻用户间干扰，Soft Actor-Critic算法用于训练DRL代理在连续动作上。该提出的框架自适应地平衡了感知精度和通信质量之间的权衡。广泛的仿真结果表明，与基线方法相比，我们提出的方法一致实现了更低的平均AoI。

更新时间: 2025-07-18 18:17:09

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14299v1

In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

Updated: 2025-07-18 18:15:09

标题: 深度和广度：为全面图表理解定制的预训练多模态语言模型

摘要: 最近的一些用于为特定领域任务定制大型视觉语言模型（LVLMs）的方法在科学图表理解方面取得了令人期待的结果。然而，现有方法面临两个主要限制：首先，它们仅依赖于少数几种图表类型的配对数据，限制了对各种图表类型的泛化能力。其次，它们缺乏针对图表数据对齐的有针对性预训练，这妨碍了模型对底层数据的理解。在本文中，我们介绍了ChartScope，一个经过优化的LVLM，用于深入理解各种类型的图表。我们提出了一个高效的数据生成流水线，合成了各种类型图表的配对数据，以及一种新颖的双通路训练策略，使模型能够简洁地捕捉关键数据细节，同时通过对底层数据的推理保持强大的推理能力。最后，我们建立了ChartDQA，一个用于评估不仅问题回答在不同层面上的能力，还有底层数据理解的新基准。实验结果表明，ChartScope显著提高了对各种图表类型的理解能力。代码和数据可在https://davidhalladay.github.io/chartscope_demo上获取。

更新时间: 2025-07-18 18:15:09

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.14298v1

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

Updated: 2025-07-18 18:07:38

标题: 一个简单的“再试一次”可以引发多轮LLM推理

摘要: 多轮问题解决对于大型推理模型（LRMs）来说是至关重要的，但也具有挑战性，因为它需要反思推理并根据反馈进行修订。现有的强化学习（RL）方法在单轮范式下训练大型推理模型，具有可验证的奖励。然而，我们观察到，使用现有的RL范式训练的模型经常丧失解决多轮问题的能力，并且难以根据上下文反馈修改答案，导致重复的响应。我们提出一个问题：LRMs能否学会在多轮上下文中反思他们的答案？在这项工作中，我们发现，使用一元反馈（例如“让我们再试一次”）在错误答案后训练模型可以提高单轮性能和多轮推理能力。我们引入了一元反馈作为观察（UFO）用于强化学习，它在迭代问题解决过程中使用最少但常见的一元用户反馈。它可以轻松应用于现有的单轮RL训练设置。实验结果表明，使用UFO进行RL训练可以保持单轮性能，并将多轮推理准确性提高高达14％，使语言模型能够更好地对多轮问题解决中的反馈做出反应。为了进一步减少达到正确答案所需的轮数，并在发生错误时鼓励多样化推理，我们设计了奖励结构，引导模型在每一轮产生谨慎和深思熟虑的答案。代码：https://github.com/lichengliu03/unary-feedback

更新时间: 2025-07-18 18:07:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.14295v1

WebGuard: Building a Generalizable Guardrail for Web Agents

The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

Updated: 2025-07-18 18:06:27

标题: WebGuard：构建一个适用于Web代理的通用保护栏

摘要: 由大型语言模型（LLMs）驱动的自主网络代理的快速发展极大提高了效率，但也暴露了意外或有害行为的前沿风险。这种情况强调了对有效安全措施的迫切需求，类似于对人类用户的访问控制。为了解决这一关键挑战，我们引入了WebGuard，这是第一个旨在支持网络代理行为风险评估并促进在现实世界在线环境中开发防护措施的综合数据集。在这个过程中，WebGuard专门关注预测状态更改行为的结果，并包含来自22个不同领域的193个网站中4,939个人工注释的行为，包括常被忽视的长尾网站。这些行为使用一种新颖的三层风险模式进行分类：安全、低风险和高风险。数据集包括专门的训练和测试拆分，以支持在多样化泛化设置下的评估。我们的初步评估揭示了一个令人担忧的缺陷：即使是前沿LLMs在预测行动结果方面的准确率也不到60%，在滞后的高风险行动中的召回率也不到60%，突出了在没有专门防护措施的情况下部署当前代理的风险。因此，我们通过使用WebGuard对专门的防护栏模型进行微调。我们在多个泛化设置下进行了全面评估，并发现经过微调的Qwen2.5VL-7B模型在性能上有了显著改善，将准确率从37%提升到80%，高风险行动的召回率从20%提升到76%。尽管有这些改进，性能仍不足以满足高风险部署所需的可靠性，这种情况下，防护栏必须接近完美的准确率和召回率。

更新时间: 2025-07-18 18:06:27

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2507.14293v1

Toward Temporal Causal Representation Learning with Tensor Decomposition

Temporal causal representation learning is a powerful tool for uncovering complex patterns in observational studies, which are often represented as low-dimensional time series. However, in many real-world applications, data are high-dimensional with varying input lengths and naturally take the form of irregular tensors. To analyze such data, irregular tensor decomposition is critical for extracting meaningful clusters that capture essential information. In this paper, we focus on modeling causal representation learning based on the transformed information. First, we present a novel causal formulation for a set of latent clusters. We then propose CaRTeD, a joint learning framework that integrates temporal causal representation learning with irregular tensor decomposition. Notably, our framework provides a blueprint for downstream tasks using the learned tensor factors, such as modeling latent structures and extracting causal information, and offers a more flexible regularization design to enhance tensor decomposition. Theoretically, we show that our algorithm converges to a stationary point. More importantly, our results fill the gap in theoretical guarantees for the convergence of state-of-the-art irregular tensor decomposition. Experimental results on synthetic and real-world electronic health record (EHR) datasets (MIMIC-III), with extensive benchmarks from both phenotyping and network recovery perspectives, demonstrate that our proposed method outperforms state-of-the-art techniques and enhances the explainability of causal representations.

Updated: 2025-07-18 17:55:42

标题: 朝向使用张量分解进行时间因果表示学习

摘要: 时间因果表示学习是揭示观测研究中复杂模式的强大工具，这些模式通常表示为低维时间序列。然而，在许多真实世界的应用中，数据是高维的，具有不同的输入长度，并且自然地采用不规则张量的形式。为了分析这种数据，不规则张量分解对于提取捕获关键信息的有意义的簇至关重要。在本文中，我们专注于基于转换信息的因果表示学习建模。首先，我们为一组潜在簇提出了一种新颖的因果形式。然后，我们提出了CaRTeD，一个将时间因果表示学习与不规则张量分解集成的联合学习框架。值得注意的是，我们的框架为使用学习的张量因子进行下游任务提供了蓝图，例如建模潜在结构和提取因果信息，并提供了更灵活的正则化设计来增强张量分解。从理论上讲，我们展示了我们的算法收敛到一个稳定点。更重要的是，我们的结果填补了现有最先进的不规则张量分解的收敛的理论保证方面的空白。在合成和真实世界的电子病历（MIMIC-III）数据集上的实验结果，从表型学和网络恢复的角度进行了广泛的基准测试，表明我们提出的方法优于最先进的技术，并增强了因果表示的可解释性。

更新时间: 2025-07-18 17:55:42

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2507.14126v1

Kolmogorov Arnold Networks (KANs) for Imbalanced Data -- An Empirical Perspective

Kolmogorov Arnold Networks (KANs) are recent architectural advancement in neural computation that offer a mathematically grounded alternative to standard neural networks. This study presents an empirical evaluation of KANs in context of class imbalanced classification, using ten benchmark datasets. We observe that KANs can inherently perform well on raw imbalanced data more effectively than Multi-Layer Perceptrons (MLPs) without any resampling strategy. However, conventional imbalance strategies fundamentally conflict with KANs mathematical structure as resampling and focal loss implementations significantly degrade KANs performance, while marginally benefiting MLPs. Crucially, KANs suffer from prohibitive computational costs without proportional performance gains. Statistical validation confirms that MLPs with imbalance techniques achieve equivalence with KANs (|d| < 0.08 across metrics) at minimal resource costs. These findings reveal that KANs represent a specialized solution for raw imbalanced data where resources permit. But their severe performance-resource tradeoffs and incompatibility with standard resampling techniques currently limits practical deployment. We identify critical research priorities as developing KAN specific architectural modifications for imbalance learning, optimizing computational efficiency, and theoretical reconciling their conflict with data augmentation. This work establishes foundational insights for next generation KAN architectures in imbalanced classification scenarios.

Updated: 2025-07-18 17:50:51

标题: 科尔莫戈洛夫阿诺德网络（KANs）用于不平衡数据--实证视角

摘要: 科尔莫哥洛夫阿诺德网络（KANs）是神经计算中的最新架构进展，为标准神经网络提供了一个具有数学基础的替代方案。本研究在十个基准数据集的背景下对KANs进行了实证评估，涉及不平衡分类。我们观察到，KANs在原始不平衡数据上的表现比多层感知器（MLPs）更有效，而无需任何重新采样策略。然而，传统的不平衡策略与KANs的数学结构存在根本冲突，因为重新采样和焦点损失实施显著降低了KANs的性能，同时略微有益于MLPs。关键是，KANs在没有相应性能增益的情况下面临着巨大的计算成本。统计验证证实，具有不平衡技术的MLPs在最小资源成本下实现了与KANs的等效性（|d| < 0.08跨度指标）。这些发现揭示了KANs代表了一种针对原始不平衡数据的专业解决方案，如果资源允许的话。但是，它们严重的性能-资源折衷以及与标准重新采样技术的不兼容性目前限制了实际部署。我们确定了关键的研究重点，即为不平衡学习开发KAN特定的架构修改，优化计算效率，并理论上解决它们与数据增强之间的冲突。这项工作为不平衡分类场景中下一代KAN架构奠定了基础洞察。

更新时间: 2025-07-18 17:50:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.14121v1

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://anonymous.4open.science/r/DiZO-E86D.

Updated: 2025-07-18 17:50:04

标题: 差异中的和谐：向快速、准确、且内存高效的零阶LLM微调迈进

摘要: 大型语言模型（LLMs）在各种任务中表现卓越，但标准的一阶（FO）微调需要大量内存，显著限制了实际部署。最近，零阶（ZO）优化作为一种有前途的节约内存的训练范式脱颖而出，避免了向后传递，仅依赖向前传递进行梯度估计，使其在资源受限的情况下备受青睐。然而，ZO方法在收敛速度和准确性方面远远落后于FO方法。为了填补这一差距，我们引入了一种新颖的基于层间散度分析的方法，揭示了FO和ZO优化的不同更新模式。为了从研究结果中模拟FO方法的学习能力，我们提出了基于散度驱动的零阶（DiZO）优化。DiZO通过将投影结合到ZO更新中进行散度驱动的层适应，生成准确缩放到层间个别优化需求的各种大小更新。我们的结果表明，DiZO显著减少了达到收敛所需的迭代次数，而不会牺牲吞吐量，在各种数据集上将训练GPU小时缩减了高达48％。此外，DiZO在下游任务中始终优于代表性的ZO基线，包括在RoBERTa-large、OPT系列和Llama系列的微调中，有些情况下甚至超越了内存密集型的FO微调。我们的代码发布在https://anonymous.4open.science/r/DiZO-E86D。

更新时间: 2025-07-18 17:50:04

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.03304v2

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

Updated: 2025-07-18 17:50:00

标题: 无需人工：自主高质量图像编辑三元组挖掘

摘要: 最近生成建模的进展使得图像编辑助手能够按照自然语言指令进行操作，无需额外的用户输入。它们的监督训练需要数百万个三元组：原始图像、指令、编辑后的图像。然而，挖掘像素精确的示例很困难。每个编辑必须仅影响指定区域，保持风格的一致性，遵守物理可信度，并保持视觉吸引力。缺乏稳健的自动编辑质量指标阻碍了可靠的大规模自动化。我们提出了一个自动化、模块化的流水线，可以在跨领域、分辨率、指令复杂性和风格之间挖掘高保真度的三元组。基于公共生成模型，并且在没有人为干预的情况下运行，我们的系统使用经过任务调整的Gemini验证器直接评分指令的遵守度和美感，消除了对分割或基础模型的需求。反向和组合引导通过约2.2倍扩大了挖掘数据集，实现了大规模高保真度的训练数据。通过自动化最重复的注释步骤，这种方法允许进行新的规模训练，无需人工标注。为了使这一资源密集型领域的研究民主化，我们发布了NHR-Edit：一个包含358k高质量三元组的开放数据集。在最大的跨数据集评估中，它超过了所有公共替代品。我们还发布了Bagel-NHR-Edit，一个经过优化的Bagel模型，它在我们的实验中实现了最先进的度量标准。

更新时间: 2025-07-18 17:50:00

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.14119v1

An Adversarial-Driven Experimental Study on Deep Learning for RF Fingerprinting

Radio frequency (RF) fingerprinting, which extracts unique hardware imperfections of radio devices, has emerged as a promising physical-layer device identification mechanism in zero trust architectures and beyond 5G networks. In particular, deep learning (DL) methods have demonstrated state-of-the-art performance in this domain. However, existing approaches have primarily focused on enhancing system robustness against temporal and spatial variations in wireless environments, while the security vulnerabilities of these DL-based approaches have often been overlooked. In this work, we systematically investigate the security risks of DL-based RF fingerprinting systems through an adversarial-driven experimental analysis. We observe a consistent misclassification behavior for DL models under domain shifts, where a device is frequently misclassified as another specific one. Our analysis based on extensive real-world experiments demonstrates that this behavior can be exploited as an effective backdoor to enable external attackers to intrude into the system. Furthermore, we show that training DL models on raw received signals causes the models to entangle RF fingerprints with environmental and signal-pattern features, creating additional attack vectors that cannot be mitigated solely through post-processing security methods such as confidence thresholds.

Updated: 2025-07-18 17:42:20

标题: 一个以对抗为驱动的关于深度学习在射频指纹识别方面的实验研究

摘要: 射频（RF）指纹技术提取无线设备的独特硬件缺陷，已经成为零信任架构和5G网络以及更高级网络中的一种有前途的物理层设备识别机制。特别是，深度学习（DL）方法在这一领域展示了最先进的性能。然而，现有方法主要集中在提高系统对无线环境中的时间和空间变化的鲁棒性，而往往忽视了这些基于DL的方法的安全漏洞。在本研究中，我们通过对抗驱动的实验分析系统地调查了基于DL的RF指纹系统的安全风险。我们观察到DL模型在域转移下表现出一致的错误分类行为，其中一个设备经常被错误地分类为另一个特定设备。我们基于广泛的真实世界实验的分析表明，这种行为可以被利用作为一种有效的后门，使外部攻击者能够侵入系统。此外，我们还展示了在原始接收信号上训练DL模型会使模型将RF指纹与环境和信号模式特征纠缠在一起，从而产生额外的攻击向量，这些攻击向量不能仅通过后处理安全方法（如置信度阈值）来减轻。

更新时间: 2025-07-18 17:42:20

领域: cs.CR,cs.LG,eess.SP

下载: http://arxiv.org/abs/2507.14109v1

Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment

Bridge maintenance and safety are essential for transportation authorities, and Non-Destructive Evaluation (NDE) techniques are critical to assessing structural integrity. However, interpreting NDE data can be time-consuming and requires expertise, potentially delaying decision-making. Recent advancements in Large Language Models (LLMs) offer new ways to automate and improve this analysis. This pilot study introduces a holistic assessment of LLM capabilities for interpreting NDE contour maps and demonstrates the effectiveness of LLMs in providing detailed bridge condition analyses. It establishes a framework for integrating LLMs into bridge inspection workflows, indicating that LLM-assisted analysis can enhance efficiency without compromising accuracy. In this study, several LLMs are explored with prompts specifically designed to enhance the quality of image descriptions, which are applied to interpret five different NDE contour maps obtained through technologies for assessing bridge conditions. Each LLM model is evaluated based on its ability to produce detailed descriptions, identify defects, provide actionable recommendations, and demonstrate overall accuracy. The research indicates that four of the nine models provide better image descriptions, effectively covering a wide range of topics related to the bridge's condition. The outputs from these four models are summarized using five different LLMs to form a comprehensive overview of the bridge. Notably, LLMs ChatGPT-4 and Claude 3.5 Sonnet generate more effective summaries. The findings suggest that LLMs have the potential to significantly improve efficiency and accuracy. This pilot study presents an innovative approach that leverages LLMs for image captioning in parallel and summarization, enabling faster decision-making in bridge maintenance and enhancing infrastructure management and safety assessments.

Updated: 2025-07-18 17:39:03

标题: 大型语言模型用于桥梁状况评估的非破坏性评估轮廓图的自动解读

摘要: 桥梁的维护和安全对于交通管理部门至关重要，非破坏评估（NDE）技术对于评估结构完整性至关重要。然而，解释NDE数据可能耗时且需要专业知识，可能会延迟决策。最近大型语言模型（LLMs）的进展提供了自动化和改进这种分析的新途径。这项试点研究介绍了LLM能力的全面评估，用于解释NDE等高程图，并展示了LLM在提供详细桥梁状况分析方面的有效性。它建立了一个将LLM整合到桥梁检查工作流程中的框架，表明LLM辅助分析可以提高效率而不影响准确性。在这项研究中，探讨了几种LLM，这些LLM针对设计的提示用于增强图像描述的质量，应用于解释通过技术获得的五种不同NDE等高程图，用于评估桥梁状况。每个LLM模型基于其产生详细描述的能力、识别缺陷、提供可操作建议和展示总体准确性进行评估。研究表明，九个模型中有四个提供更好的图像描述，有效覆盖了与桥梁状况相关的各种主题。这四个模型的输出使用五种不同的LLM进行总结，形成对桥梁的综合概述。值得注意的是，LLMs ChatGPT-4 和 Claude 3.5 Sonnet 生成了更有效的摘要。研究结果表明，LLM有潜力显著提高效率和准确性。这项试点研究提出了一种创新的方法，利用LLM进行图像字幕并同时进行总结，从而加快桥梁维护中的决策过程，增强基础设施管理和安全评估。

更新时间: 2025-07-18 17:39:03

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.14107v1

Generative AI-Driven High-Fidelity Human Motion Simulation

Human motion simulation (HMS) supports cost-effective evaluation of worker behavior, safety, and productivity in industrial tasks. However, existing methods often suffer from low motion fidelity. This study introduces Generative-AI-Enabled HMS (G-AI-HMS), which integrates text-to-text and text-to-motion models to enhance simulation quality for physical tasks. G-AI-HMS tackles two key challenges: (1) translating task descriptions into motion-aware language using Large Language Models aligned with MotionGPT's training vocabulary, and (2) validating AI-enhanced motions against real human movements using computer vision. Posture estimation algorithms are applied to real-time videos to extract joint landmarks, and motion similarity metrics are used to compare them with AI-enhanced sequences. In a case study involving eight tasks, the AI-enhanced motions showed lower error than human created descriptions in most scenarios, performing better in six tasks based on spatial accuracy, four tasks based on alignment after pose normalization, and seven tasks based on overall temporal similarity. Statistical analysis showed that AI-enhanced prompts significantly (p $<$ 0.0001) reduced joint error and temporal misalignment while retaining comparable posture accuracy.

Updated: 2025-07-18 17:24:50

标题: 生成式人工智能驱动的高保真人体动作模拟

摘要: 人体运动模拟（HMS）支持对工业任务中工人行为、安全和生产力进行成本效益评估。然而，现有方法往往受到运动保真度较低的影响。本研究介绍了生成式人工智能支持的HMS（G-AI-HMS），该方法整合了文本到文本和文本到运动模型，以提高物理任务的模拟质量。G-AI-HMS解决了两个关键挑战：（1）使用与MotionGPT的训练词汇对齐的大型语言模型将任务描述转换为运动感知语言，（2）使用计算机视觉验证AI增强的动作与真实人类运动之间的相似性。姿势估计算法应用于实时视频中，以提取关节标志点，并使用运动相似度指标将它们与AI增强序列进行比较。在涉及八项任务的案例研究中，AI增强的动作在大多数场景中显示出比人类创建描述更低的错误，在基于空间精度的六项任务中表现更好，在姿势归一化后对齐的四项任务中表现更好，并在整体时间相似性的七项任务中表现更好。统计分析显示，AI增强的提示明显（p <0.0001）降低了关节错误和时间不对齐，同时保持可比的姿势准确性。

更新时间: 2025-07-18 17:24:50

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.14097v1

Multi-Centre Validation of a Deep Learning Model for Scoliosis Assessment

Scoliosis affects roughly 2 to 4 percent of adolescents, and treatment decisions depend on precise Cobb angle measurement. Manual assessment is time consuming and subject to inter observer variation. We conducted a retrospective, multi centre evaluation of a fully automated deep learning software (Carebot AI Bones, Spine Measurement functionality; Carebot s.r.o.) on 103 standing anteroposterior whole spine radiographs collected from ten hospitals. Two musculoskeletal radiologists independently measured each study and served as reference readers. Agreement between the AI and each radiologist was assessed with Bland Altman analysis, mean absolute error (MAE), root mean squared error (RMSE), Pearson correlation coefficient, and Cohen kappa for four grade severity classification. Against Radiologist 1 the AI achieved an MAE of 3.89 degrees (RMSE 4.77 degrees) with a bias of 0.70 degrees and limits of agreement from minus 8.59 to plus 9.99 degrees. Against Radiologist 2 the AI achieved an MAE of 3.90 degrees (RMSE 5.68 degrees) with a bias of 2.14 degrees and limits from minus 8.23 to plus 12.50 degrees. Pearson correlations were r equals 0.906 and r equals 0.880 (inter reader r equals 0.928), while Cohen kappa for severity grading reached 0.51 and 0.64 (inter reader kappa 0.59). These results demonstrate that the proposed software reproduces expert level Cobb angle measurements and categorical grading across multiple centres, suggesting its utility for streamlining scoliosis reporting and triage in clinical workflows.

Updated: 2025-07-18 17:21:53

标题: 多中心验证脊柱侧弯评估深度学习模型

摘要: 脊柱侧弯大约影响2％至4％的青少年，治疗决策取决于准确的Cobb角度测量。手动评估耗时且存在观察者间变异。我们对来自十家医院收集的103张立位前后整脊X线片进行了一项回顾性、多中心评估，使用了完全自动化的深度学习软件（Carebot AI Bones，脊柱测量功能；Carebot s.r.o.）。两名肌肉骨骼放射科医师独立测量了每项研究并担任参考读者。通过Bland Altman分析、平均绝对误差（MAE）、均方根误差（RMSE）、皮尔逊相关系数以及四级严重分类的Cohen卡帕系数，评估了AI与每位放射科医师之间的一致性。与放射科医师1相比，AI的MAE为3.89度（RMSE为4.77度），偏差为0.70度，协议限为负8.59至正9.99度。与放射科医师2相比，AI的MAE为3.90度（RMSE为5.68度），偏差为2.14度，协议限为负8.23至正12.50度。皮尔逊相关系数分别为r等于0.906和r等于0.880（读者间r等于0.928），而严重分级的Cohen卡帕达到0.51和0.64（读者间卡帕0.59）。这些结果表明，建议的软件能够在多个中心复制专家级别的Cobb角度测量和分类分级，表明其在简化脊柱侧弯报告和临床工作流程中的分流方面具有实用性。

更新时间: 2025-07-18 17:21:53

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14093v1

Learning to Reason at the Frontier of Learnability

Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning with LLMs.

Updated: 2025-07-18 17:21:27

标题: 学习推理在可学习性边缘

摘要: 强化学习现在被广泛地采用作为大型语言模型训练的最后阶段，特别是用于类似数学问题这样的推理型任务。通常，在单个训练步骤中模型会尝试多次回答每个问题，并试图从成功和失败中学习。然而，我们证明了在使用两种流行算法（PPO和VinePPO）在两个广泛使用的数据集上进行训练时，许多问题要么每次尝试都被解决 - 这意味着它们已经被学习了 - 要么都没有被解决 - 提供不了有意义的训练信号。为了解决这个问题，我们从强化学习文献中借鉴了一种方法 - 采样学习能力，并将其应用于LLM训练的强化学习阶段。我们的课程设置优先考虑具有高成功率变化的问题，即那些代理有时成功，但并非总是成功的问题。我们的研究结果表明，这种课程安排能够持续提升多种算法和数据集的训练性能，为更高效和有效地使用LLM进行强化学习铺平道路。

更新时间: 2025-07-18 17:21:27

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.12272v5

The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?

Humans have a selective memory, remembering relevant episodes and forgetting the less relevant information. Possessing awareness of event memorability for a user could help intelligent systems in more accurate user modelling, especially for such applications as meeting support systems, memory augmentation, and meeting summarisation. Emotion recognition has been widely studied, since emotions are thought to signal moments of high personal relevance to users. The emotional experience of situations and their memorability have traditionally been considered to be closely tied to one another: moments that are experienced as highly emotional are considered to also be highly memorable. This relationship suggests that emotional annotations could serve as proxies for memorability. However, existing emotion recognition systems rely heavily on third-party annotations, which may not accurately represent the first-person experience of emotional relevance and memorability. This is why, in this study, we empirically examine the relationship between perceived group emotions (Pleasure-Arousal) and group memorability in the context of conversational interactions. Our investigation involves continuous time-based annotations of both emotions and memorability in dynamic, unstructured group settings, approximating conditions of real-world conversational AI applications such as online meeting support systems. Our results show that the observed relationship between affect and memorability annotations cannot be reliably distinguished from what might be expected under random chance. We discuss the implications of this surprising finding for the development and applications of Affective Computing technology. In addition, we contextualise our findings in broader discourses in the Affective Computing and point out important targets for future research efforts.

Updated: 2025-07-18 17:06:34

标题: 情感-记忆联系：记忆性注释对智能系统是否重要？

摘要: 人类具有选择性记忆，记住相关事件并遗忘不太相关的信息。了解事件易记性对用户来说可能有助于智能系统更准确地建模用户，特别是对于诸如会议支持系统、记忆增强和会议总结等应用。情绪识别已经被广泛研究，因为认为情绪可以提示用户的高度个人相关性时刻。传统上认为情绪体验和易记性之间密切相关：被体验为高度情绪化的时刻也被认为是高度易记的。这种关系表明情绪标注可以作为易记性的替代物。然而，现有的情绪识别系统主要依赖于第三方标注，这可能无法准确代表情感相关性和易记性的第一人称体验。因此，在本研究中，我们在对话互动背景下实证检验了感知的群体情绪（愉悦-激动）与群体易记性之间的关系。我们的调查涉及动态、非结构化群体环境中情绪和易记性的持续基于时间的标注，近似于在线会议支持系统等现实世界对话AI应用的条件。我们的结果显示，情感和易记性标注之间观察到的关系无法可靠地与随机机会下可能预期的关系区分开。我们讨论了这一令人惊讶的发现对情感计算技术的发展和应用的影响。此外，我们将我们的发现置于情感计算的更广泛讨论中，并指出未来研究努力的重要目标。

更新时间: 2025-07-18 17:06:34

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.14084v1

DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits

Progress notes are among the most clinically meaningful artifacts in an Electronic Health Record (EHR), offering temporally grounded insights into a patient's evolving condition, treatments, and care decisions. Despite their importance, they are severely underrepresented in large-scale EHR datasets. For instance, in the widely used Medical Information Mart for Intensive Care III (MIMIC-III) dataset, only about $8.56\%$ of hospital visits include progress notes, leaving gaps in longitudinal patient narratives. In contrast, the dataset contains a diverse array of other note types, each capturing different aspects of care. We present DENSE (Documenting Evolving Progress Notes from Scattered Evidence), a system designed to align with clinical documentation workflows by simulating how physicians reference past encounters while drafting progress notes. The system introduces a fine-grained note categorization and a temporal alignment mechanism that organizes heterogeneous notes across visits into structured, chronological inputs. At its core, DENSE leverages a clinically informed retrieval strategy to identify temporally and semantically relevant content from both current and prior visits. This retrieved evidence is used to prompt a large language model (LLM) to generate clinically coherent and temporally aware progress notes. We evaluate DENSE on a curated cohort of patients with multiple visits and complete progress note documentation. The generated notes demonstrate strong longitudinal fidelity, achieving a temporal alignment ratio of $1.089$, surpassing the continuity observed in original notes. By restoring narrative coherence across fragmented documentation, our system supports improved downstream tasks such as summarization, predictive modeling, and clinical decision support, offering a scalable solution for LLM-driven note synthesis in real-world healthcare settings.

Updated: 2025-07-18 17:00:27

标题: 密集：通过跨医院就诊的异质临床笔记的时间模型生成纵向进展记录

摘要: 进展记录是电子健康记录（EHR）中最具临床意义的文物之一，提供了关于患者病情、治疗和护理决策的时间依据。尽管它们非常重要，但在大规模EHR数据集中却严重缺乏。例如，在广泛使用的重症监护病历信息管理系统III（MIMIC-III）数据集中，仅约$8.56\%$的住院就诊包括进展记录，导致纵向患者叙事中存在空白。相比之下，该数据集包含各种其他类型的记录，每种记录捕捉不同的护理方面。我们提出了DENSE（Documenting Evolving Progress Notes from Scattered Evidence），这是一个与临床文档工作流程相一致的系统，模拟医生在撰写进展记录时如何参考过去的就诊情况。该系统引入了一种细粒度的记录分类和时间对齐机制，将跨就诊的异构记录组织成结构化、按时间排序的输入。在其核心，DENSE利用了一种临床信息检索策略，从当前和先前的就诊中识别时间和语义相关的内容。这些检索到的证据用于提示一个大型语言模型（LLM）生成临床连贯且具有时间意识的进展记录。我们在一个精心筛选的多次就诊和完整进展记录文档的患者队列上评估了DENSE。生成的记录表现出强大的纵向忠实度，实现了$1.089$的时间对齐比率，超过了原始记录中观察到的连续性。通过恢复分散文档中的叙事连贯性，我们的系统支持改进下游任务，如摘要、预测模型和临床决策支持，为现实世界的医疗保健环境提供了一种可扩展的LLM驱动的记录综合解决方案。

更新时间: 2025-07-18 17:00:27

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.14079v1

Glucose-ML: A collection of longitudinal diabetes datasets for development of robust AI solutions

Artificial intelligence (AI) algorithms are a critical part of state-of-the-art digital health technology for diabetes management. Yet, access to large high-quality datasets is creating barriers that impede development of robust AI solutions. To accelerate development of transparent, reproducible, and robust AI solutions, we present Glucose-ML, a collection of 10 publicly available diabetes datasets, released within the last 7 years (i.e., 2018 - 2025). The Glucose-ML collection comprises over 300,000 days of continuous glucose monitor (CGM) data with a total of 38 million glucose samples collected from 2500+ people across 4 countries. Participants include persons living with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. To support researchers and innovators with using this rich collection of diabetes datasets, we present a comparative analysis to guide algorithm developers with data selection. Additionally, we conduct a case study for the task of blood glucose prediction - one of the most common AI tasks within the field. Through this case study, we provide a benchmark for short-term blood glucose prediction across all 10 publicly available diabetes datasets within the Glucose-ML collection. We show that the same algorithm can have significantly different prediction results when developed/evaluated with different datasets. Findings from this study are then used to inform recommendations for developing robust AI solutions within the diabetes or broader health domain. We provide direct links to each longitudinal diabetes dataset in the Glucose-ML collection and openly provide our code.

Updated: 2025-07-18 16:53:05

标题: 葡萄糖-ML：一组用于开发强大人工智能解决方案的纵向糖尿病数据集

摘要: 人工智能（AI）算法是最先进的数字健康技术中糖尿病管理的关键部分。然而，获取大型高质量数据集正在制造障碍，阻碍了强大的AI解决方案的发展。为了加速透明、可重复和强大的AI解决方案的发展，我们提出了Glucose-ML，这是一个包含10个公开可用的糖尿病数据集的集合，这些数据集在过去7年（即2018年至2025年）发布。Glucose-ML集合包括超过300,000天的连续血糖监测（CGM）数据，总共收集了来自4个国家的2500多人的3800万血糖样本。参与者包括患有1型糖尿病、2型糖尿病、糖尿病前期和无糖尿病的人。为了支持研究人员和创新者使用这一丰富的糖尿病数据集集合，我们提供了一项比较分析，以指导算法开发者选择数据。此外，我们进行了一项针对血糖预测任务的案例研究——这是该领域中最常见的AI任务之一。通过这个案例研究，我们提供了一个在Glucose-ML集合中所有10个公开可用的糖尿病数据集中进行短期血糖预测的基准。我们展示了当使用不同数据集开发/评估时，同一算法可能会有显著不同的预测结果。这项研究的发现随后被用来制定在糖尿病或更广泛的健康领域内开发强大AI解决方案的建议。我们提供了Glucose-ML集合中每个纵向糖尿病数据集的直接链接，并公开提供我们的代码。

更新时间: 2025-07-18 16:53:05

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14077v1

Quantum-Resilient Privacy Ledger (QRPL): A Sovereign Digital Currency for the Post-Quantum Era

The emergence of quantum computing presents profound challenges to existing cryptographic infrastructures, whilst the development of central bank digital currencies (CBDCs) has raised concerns regarding privacy preservation and excessive centralisation in digital payment systems. This paper proposes the Quantum-Resilient Privacy Ledger (QRPL) as an innovative token-based digital currency architecture that incorporates National Institute of Standards and Technology (NIST)-standardised post-quantum cryptography (PQC) with hash-based zero-knowledge proofs to ensure user sovereignty, scalability, and transaction confidentiality. Key contributions include adaptations of ephemeral proof chains for unlinkable transactions, a privacy-weighted Proof-of-Stake (PoS) consensus to promote equitable participation, and a novel zero-knowledge proof-based mechanism for privacy-preserving selective disclosure. QRPL aims to address critical shortcomings in prevailing CBDC designs, including risks of pervasive surveillance, with a 10-20 second block time to balance security and throughput in future monetary systems. While conceptual, empirical prototypes are planned. Future work includes prototype development to validate these models empirically.

Updated: 2025-07-18 16:51:19

标题: 《量子韧性隐私账本（QRPL）：后量子时代的主权数字货币》

摘要: 量子计算的出现给现有的加密基础设施带来了深刻的挑战，同时央行数字货币（CBDCs）的发展引发了对数字支付系统中隐私保护和过度集中化的担忧。本文提出了量子抗干扰隐私账本（QRPL）作为一种创新的基于代币的数字货币架构，该架构将国家标准技术研究所（NIST）标准化的后量子密码学（PQC）与基于哈希的零知识证明相结合，以确保用户主权、可扩展性和交易机密性。关键贡献包括将瞬时证明链用于不可链接的交易、基于隐私加权的股权证明（PoS）共识以促进公平参与，以及一种基于零知识证明的机制，用于保护隐私的选择性披露。QRPL旨在解决当前CBDC设计中的关键缺陷，包括普遍监视的风险，其块时间为10-20秒，以在未来货币系统中平衡安全性和吞吐量。虽然这些是概念性的，但计划进行实证原型。未来的工作包括原型开发，以验证这些模型的实证。

更新时间: 2025-07-18 16:51:19

领域: cs.ET,cs.CR

下载: http://arxiv.org/abs/2507.09067v2

Critiques of World Models

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Updated: 2025-07-18 16:48:16

标题: 对世界模型的批评

摘要: 世界模型，被认为是生物体经历并采取行动的真实世界环境的算法替代品，近年来已成为一个新兴话题，因为人们越来越需要开发具有人工（通用）智能的虚拟代理。关于世界模型到底是什么，如何构建它，如何使用它以及如何评估它已经引发了很多争论。在这篇文章中，从著名的科幻经典《沙丘》中的想象开始，并从心理学文献中的“假设性思维”概念中汲取灵感，我们对几种关于世界建模的思想流派提出批评，并认为世界模型的主要目标是模拟真实世界的所有可操作可能性，以进行有意义的推理和行动。基于这些批评，我们提出了一个基于分层、多级和混合连续/离散表示的通用世界模型架构，以及一个生成和自监督学习框架，展望了一个由这种模型实现的物理、主动和嵌套（PAN）AGI系统。

更新时间: 2025-07-18 16:48:16

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.05169v2

Edge Intelligence with Spiking Neural Networks

The convergence of artificial intelligence and edge computing has spurred growing interest in enabling intelligent services directly on resource-constrained devices. While traditional deep learning models require significant computational resources and centralized data management, the resulting latency, bandwidth consumption, and privacy concerns have exposed critical limitations in cloud-centric paradigms. Brain-inspired computing, particularly Spiking Neural Networks (SNNs), offers a promising alternative by emulating biological neuronal dynamics to achieve low-power, event-driven computation. This survey provides a comprehensive overview of Edge Intelligence based on SNNs (EdgeSNNs), examining their potential to address the challenges of on-device learning, inference, and security in edge scenarios. We present a systematic taxonomy of EdgeSNN foundations, encompassing neuron models, learning algorithms, and supporting hardware platforms. Three representative practical considerations of EdgeSNN are discussed in depth: on-device inference using lightweight SNN models, resource-aware training and updating under non-stationary data conditions, and secure and privacy-preserving issues. Furthermore, we highlight the limitations of evaluating EdgeSNNs on conventional hardware and introduce a dual-track benchmarking strategy to support fair comparisons and hardware-aware optimization. Through this study, we aim to bridge the gap between brain-inspired learning and practical edge deployment, offering insights into current advancements, open challenges, and future research directions. To the best of our knowledge, this is the first dedicated and comprehensive survey on EdgeSNNs, providing an essential reference for researchers and practitioners working at the intersection of neuromorphic computing and edge intelligence.

Updated: 2025-07-18 16:47:52

标题: 边缘智能与脉冲神经网络

摘要: 人工智能和边缘计算的融合引发了在资源受限设备上直接实现智能服务的兴趣。传统的深度学习模型需要大量的计算资源和集中的数据管理，由此产生的延迟、带宽消耗和隐私问题暴露了云中心范式的关键限制。受大脑启发的计算，特别是脉冲神经网络（SNNs），通过模拟生物神经元的动态来实现低功耗、事件驱动的计算，提供了一种有前途的替代方案。本调查提供了基于SNNs的边缘智能（EdgeSNNs）的全面概述，探讨它们在边缘场景中解决设备学习、推理和安全挑战的潜力。我们系统地对EdgeSNN基础进行了分类，包括神经元模型、学习算法和支持硬件平台。深入讨论了EdgeSNN的三个代表性实际考虑因素：使用轻量级SNN模型进行设备推理、在非稳态数据条件下进行资源感知的训练和更新，以及安全和隐私保护问题。此外，我们强调了在传统硬件上评估EdgeSNNs的局限性，并引入了双轨基准测试策略，以支持公平比较和硬件感知的优化。通过这项研究，我们旨在弥合大脑启发学习与实际边缘部署之间的差距，提供对当前进展、开放挑战和未来研究方向的见解。据我们所知，这是关于EdgeSNNs的第一份专门和全面的调查，为在神经形态计算和边缘智能交叉领域工作的研究人员和从业者提供了重要参考。

更新时间: 2025-07-18 16:47:52

领域: cs.DC,cs.AI,cs.ET,cs.NE

下载: http://arxiv.org/abs/2507.14069v1

VLA-Mark: A cross modal watermark for large vision-language alignment model

Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1\% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking

Updated: 2025-07-18 16:44:41

标题: VLA-Mark：用于大规模视觉语言对齐模型的跨模态水印

摘要: 视觉语言模型需要数字水印解决方案，既能保护知识产权，又能保持多模态一致性。现有的文本数字水印方法通过偏向性令牌选择和静态策略破坏了视觉-文本对齐，使语义关键概念容易受到影响。我们提出了VLA-Mark，这是一个视觉对齐框架，通过跨模态协调嵌入可检测的水印，同时保持语义忠实。我们的方法整合了多尺度视觉-文本对齐指标，结合了局部补丁亲和性、全局语义一致性和上下文关注模式，以指导水印注入而无需重新训练模型。熵敏感机制动态平衡水印强度和语义保留，优先考虑在低不确定性生成阶段期间的视觉基础。实验结果显示，相比传统方法，我们的方法具有7.4%更低的PPL和26.6%更高的BLEU，几乎完美的检测率（98.8%的AUC）。该框架表现出96.1\%的抵抗攻击能力，例如改写和同义词替换，同时保持文本-视觉一致性，建立了对保持质量的多模态数字水印的新标准。

更新时间: 2025-07-18 16:44:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.14067v1

Noradrenergic-inspired gain modulation attenuates the stability gap in joint training

Recent studies in continual learning have identified a transient drop in performance on mastered tasks when assimilating new ones, known as the stability gap. Such dynamics contradict the objectives of continual learning, revealing a lack of robustness in mitigating forgetting, and notably, persisting even under an ideal joint-loss regime. Examining this gap within this idealized joint training context is critical to isolate it from other sources of forgetting. We argue that it reflects an imbalance between rapid adaptation and robust retention at task boundaries, underscoring the need to investigate mechanisms that reconcile plasticity and stability within continual learning frameworks. Biological brains navigate a similar dilemma by operating concurrently on multiple timescales, leveraging neuromodulatory signals to modulate synaptic plasticity. However, artificial networks lack native multitimescale dynamics, and although optimizers like momentum-SGD and Adam introduce implicit timescale regularization, they still exhibit stability gaps. Inspired by locus coeruleus mediated noradrenergic bursts, which transiently enhance neuronal gain under uncertainty to facilitate sensory assimilation, we propose uncertainty-modulated gain dynamics - an adaptive mechanism that approximates a two-timescale optimizer and dynamically balances integration of knowledge with minimal interference on previously consolidated information. We evaluate our mechanism on domain-incremental and class-incremental variants of the MNIST and CIFAR benchmarks under joint training, demonstrating that uncertainty-modulated gain dynamics effectively attenuate the stability gap. Finally, our analysis elucidates how gain modulation replicates noradrenergic functions in cortical circuits, offering mechanistic insights into reducing stability gaps and enhance performance in continual learning tasks.

Updated: 2025-07-18 16:34:06

标题: "Noradrenergic启发的增益调制减弱了联合训练中的稳定性差距"

摘要: 最近的连续学习研究已经确定，在吸收新任务时，掌握任务的表现会暂时下降，这被称为稳定性差距。这种动态与连续学习的目标相矛盾，揭示了在减轻遗忘方面缺乏稳健性，特别是在理想的联合损失制度下仍然存在。在这种理想的联合训练背景下研究这种差距对于将其与其他遗忘源隔离开来至关重要。我们认为这反映了在任务边界处快速适应和稳健保留之间的不平衡，强调了需要研究调和可塑性和稳定性的机制在连续学习框架内。生物大脑通过在多个时间尺度上同时运行，利用神经调节信号调节突触可塑性来应对类似的困境。然而，人工网络缺乏本地多时间尺度动态，尽管像动量-SGD和Adam这样的优化器引入了隐式时间尺度正则化，但它们仍然存在稳定性差距。受到蓝斑介导的去甲肾上腺素突发事件的启发，这些事件会在不确定性下瞬间增强神经元增益以促进感觉同化，我们提出了不确定性调制增益动态-一种适应性机制，近似于双时间尺度优化器，并动态平衡知识的整合与对先前巩固信息的最小干扰。我们在联合训练下的MNIST和CIFAR基准的域增量和类增量变体上评估了我们的机制，证明了不确定性调制增益动态有效地减轻了稳定性差距。最后，我们的分析阐明了增益调制如何在皮质回路中复制去甲肾上腺功能，提供了减少稳定性差距并增强连续学习任务表现的机制洞见。

更新时间: 2025-07-18 16:34:06

领域: cs.LG,cs.AI,q-bio.NC,68T05

下载: http://arxiv.org/abs/2507.14056v1

NuSeC: A Dataset for Nuclei Segmentation in Breast Cancer Histopathology Images

The NuSeC dataset is created by selecting 4 images with the size of 1024*1024 pixels from the slides of each patient among 25 patients. Therefore, there are a total of 100 images in the NuSeC dataset. To carry out a consistent comparative analysis between the methods that will be developed using the NuSeC dataset by the researchers in the future, we divide the NuSeC dataset 75% as the training set and 25% as the testing set. In detail, an image is randomly selected from 4 images of each patient among 25 patients to build the testing set, and then the remaining images are reserved for the training set. While the training set includes 75 images with around 30000 nuclei structures, the testing set includes 25 images with around 6000 nuclei structures.

Updated: 2025-07-18 16:23:07

标题: NuSeC：用于乳腺癌组织病理图像中细胞核分割的数据集

摘要: NuSeC数据集是通过从25名患者的幻灯片中选择大小为1024*1024像素的4张图像创建的。因此，NuSeC数据集共有100张图像。为了在未来研究人员使用NuSeC数据集开发的方法之间进行一致的比较分析，我们将NuSeC数据集分为75%作为训练集和25%作为测试集。具体来说，从25名患者的每个患者的4张图像中随机选择一张图像来构建测试集，然后将其余图像保留为训练集。训练集包括75张图像，大约有30000个细胞核结构，而测试集包括25张图像，大约有6000个细胞核结构。

更新时间: 2025-07-18 16:23:07

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.14272v1

Towards Practical Operation of Deep Reinforcement Learning Agents in Real-World Network Management at Open RAN Edges

Deep Reinforcement Learning (DRL) has emerged as a powerful solution for meeting the growing demands for connectivity, reliability, low latency and operational efficiency in advanced networks. However, most research has focused on theoretical analysis and simulations, with limited investigation into real-world deployment. To bridge the gap and support practical DRL deployment for network management, we first present an orchestration framework that integrates ETSI Multi-access Edge Computing (MEC) with Open RAN, enabling seamless adoption of DRL-based strategies across different time scales while enhancing agent lifecycle management. We then identify three critical challenges hindering DRL's real-world deployment, including (1) asynchronous requests from unpredictable or bursty traffic, (2) adaptability and generalization across heterogeneous topologies and evolving service demands, and (3) prolonged convergence and service interruptions due to exploration in live operational environments. To address these challenges, we propose a three-fold solution strategy: (a) advanced time-series integration for handling asynchronized traffic, (b) flexible architecture design such as multi-agent DRL and incremental learning to support heterogeneous scenarios, and (c) simulation-driven deployment with transfer learning to reduce convergence time and service disruptions. Lastly, the feasibility of the MEC-O-RAN architecture is validated on an urban-wide testing infrastructure, and two real-world use cases are presented, showcasing the three identified challenges and demonstrating the effectiveness of the proposed solutions.

Updated: 2025-07-18 16:19:31

标题: 朝向在Open RAN边缘实际运行深度强化学习代理的网络管理

摘要: 深度强化学习（DRL）已成为满足先进网络连接性、可靠性、低延迟和运营效率不断增长需求的强大解决方案。然而，大多数研究集中在理论分析和模拟上，对于真实世界部署的调查有限。为了弥合这一差距并支持网络管理的实际DRL部署，我们首先提出一个编排框架，将ETSI多接入边缘计算（MEC）与开放RAN集成，实现跨不同时间尺度无缝采用基于DRL的策略，同时增强代理生命周期管理。然后，我们确定了阻碍DRL真实世界部署的三个关键挑战，包括（1）来自不可预测或突发流量的异步请求，（2）在异构拓扑和不断演化的服务需求之间的适应性和泛化，以及（3）由于在实时运营环境中的探索而导致的长时间收敛和服务中断。为了解决这些挑战，我们提出了一个三重解决策略：（a）用于处理异步流量的高级时间序列集成，（b）灵活的架构设计，如多代理DRL和增量学习，以支持异构场景，以及（c）通过模拟驱动的部署，使用迁移学习减少收敛时间和服务中断。最后，在城区范围的测试基础设施上验证了MEC-O-RAN架构的可行性，并展示了两个真实世界用例，展示了三个识别的挑战，并展示了所提出解决方案的有效性。

更新时间: 2025-07-18 16:19:31

领域: cs.NI,cs.AI,cs.DC,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.23086v2

MiDeSeC: A Dataset for Mitosis Detection and Segmentation in Breast Cancer Histopathology Images

The MiDeSeC dataset is created through H&E stained invasive breast carcinoma, no special type (NST) slides of 25 different patients captured at 40x magnification from the Department of Medical Pathology at Ankara University. The slides have been scanned by 3D Histech Panoramic p250 Flash-3 scanner and Olympus BX50 microscope. As several possible mitosis shapes exist, it is crucial to have a large dataset to cover all the cases. Accordingly, a total of 50 regions is selected from glass slides for 25 patients, each of regions with a size of 1024*1024 pixels. There are more than 500 mitoses in total in these 50 regions. Two-thirds of the regions are reserved for training, the other third for testing.

Updated: 2025-07-18 16:19:05

标题: MiDeSeC：乳腺癌组织病理图像中有丝分裂检测和分割的数据集

摘要: 通过在安卡拉大学医学病理学系拍摄的25名不同患者的40倍放大的H&E染色浸润性乳腺癌无特殊类型（NST）幻灯片创建了MiDeSeC数据集。这些幻灯片是通过3D Histech Panoramic p250 Flash-3扫描仪和Olympus BX50显微镜扫描的。由于存在多种可能的有丝分裂形态，因此拥有一个大型数据集来覆盖所有情况是至关重要的。因此，从玻璃幻灯片中为25名患者选择了50个区域，每个区域尺寸为1024*1024像素。在这50个区域总共有500多个有丝分裂。其中的三分之二的区域用于训练，另外三分之一用于测试。

更新时间: 2025-07-18 16:19:05

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.14271v1

APTx Neuron: A Unified Trainable Neuron Architecture Integrating Activation and Computation

We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both computationally efficient and elegant. The proposed neuron follows the functional form $y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta$, where all parameters $\alpha_i$, $\beta_i$, $\gamma_i$, and $\delta$ are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to 96.69\% test accuracy in just 20 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and computational efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it.

Updated: 2025-07-18 16:17:40

标题: APTx神经元：一个集成激活和计算的统一可训练神经元架构

摘要: 我们提出了APTx神经元，这是一种新颖的统一神经计算单元，将非线性激活和线性转换集成为一个可训练的表达式。APTx神经元源自APTx激活函数，因此消除了单独激活层的需求，使架构既具有计算效率又优雅。提出的神经元遵循功能形式$y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta$，其中所有参数$\alpha_i$、$\beta_i$、$\gamma_i$和$\delta$都是可训练的。我们在MNIST数据集上验证了基于APTx神经元的架构，在仅20个epochs中使用大约332K个可训练参数，实现了高达96.69\%的测试准确率。结果突显了APTx神经元相对于传统神经元的出色表达能力和计算效率，指向了统一神经元设计及基于其构建的架构的新模式。

更新时间: 2025-07-18 16:17:40

领域: cs.NE,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.14270v1

A multi-strategy improved snake optimizer for three-dimensional UAV path planning and engineering problems

Metaheuristic algorithms have gained widespread application across various fields owing to their ability to generate diverse solutions. One such algorithm is the Snake Optimizer (SO), a progressive optimization approach. However, SO suffers from the issues of slow convergence speed and susceptibility to local optima. In light of these shortcomings, we propose a novel Multi-strategy Improved Snake Optimizer (MISO). Firstly, we propose a new adaptive random disturbance strategy based on sine function to alleviate the risk of getting trapped in a local optimum. Secondly, we introduce adaptive Levy flight strategy based on scale factor and leader and endow the male snake leader with flight capability, which makes it easier for the algorithm to leap out of the local optimum and find the global optimum. More importantly, we put forward a position update strategy combining elite leadership and Brownian motion, effectively accelerating the convergence speed while ensuring precision. Finally, to demonstrate the performance of MISO, we utilize 30 CEC2017 test functions and the CEC2022 test suite, comparing it with 11 popular algorithms across different dimensions to validate its effectiveness. Moreover, Unmanned Aerial Vehicle (UAV) has been widely used in various fields due to its advantages of low cost, high mobility and easy operation. However, the UAV path planning problem is crucial for flight safety and efficiency, and there are still challenges in establishing and optimizing the path model. Therefore, we apply MISO to the UAV 3D path planning problem as well as 6 engineering design problems to assess its feasibility in practical applications. The experimental results demonstrate that MISO exceeds other competitive algorithms in terms of solution quality and stability, establishing its strong potential for application.

Updated: 2025-07-18 16:11:35

标题: 一种多策略改进的蛇优化器用于三维无人机路径规划和工程问题

摘要: 元启发式算法由于其生成多样化解决方案的能力，在各个领域得到了广泛的应用。其中之一是蛇优化器（SO），这是一种渐进优化方法。然而，SO存在收敛速度慢和容易陷入局部最优的问题。鉴于这些缺点，我们提出了一种新颖的多策略改进蛇优化器（MISO）。首先，我们提出了一种基于正弦函数的新自适应随机扰动策略，以减轻陷入局部最优的风险。其次，我们引入了基于比例因子和领导者的自适应Levy飞行策略，并赋予雄性蛇领导者飞行能力，使算法更容易跳出局部最优并找到全局最优。更重要的是，我们提出了一种结合精英领导和布朗运动的位置更新策略，可有效加快收敛速度同时确保精度。最后，为了展示MISO的性能，我们利用30个CEC2017测试函数和CEC2022测试套件，将其与11种流行算法进行比较，验证其有效性。此外，无人机（UAV）由于其低成本、高机动性和易操作的优势，在各个领域得到了广泛应用。然而，UAV路径规划问题对于飞行安全和效率至关重要，建立和优化路径模型仍然存在挑战。因此，我们将MISO应用于UAV 3D路径规划问题以及6个工程设计问题，评估其在实际应用中的可行性。实验结果表明，MISO在解决方案质量和稳定性方面超过其他竞争算法，展示了其在应用中的强大潜力。

更新时间: 2025-07-18 16:11:35

领域: cs.RO,cs.AI,cs.CE

下载: http://arxiv.org/abs/2507.14043v1

SecurePose: Automated Face Blurring and Human Movement Kinematics Extraction from Videos Recorded in Clinical Settings

Movement disorder diagnosis often relies on expert evaluation of patient videos, but sharing these videos poses privacy risks. Current methods for de-identifying videos, such as blurring faces, are often manual, inconsistent, or inaccurate. Furthermore, these methods can compromise objective kinematic analysis - a crucial component of diagnosis. To address these challenges, we developed SecurePose, an open-source software that simultaneously provides reliable de-identification and automated kinematic extraction from videos recorded in clinic settings using smartphones/tablets. SecurePose utilizes pose estimation (using OpenPose) to extract full body kinematics, track individuals, identify the patient, and then accurately blur faces in the videos. We validated SecurePose on gait videos recorded in outpatient clinic visits of 116 children with cerebral palsy, assessing both the accuracy of its de-identification compared to the ground truth (manual blurring) and the reliability of the intermediate steps of kinematics extraction. Results demonstrate that SecurePose outperformed six existing methods in automated face detection and achieved comparable accuracy to robust manual blurring, but in significantly less time (91.08% faster). Ten experienced researchers also confirmed SecurePose's usability via System Usability Scale scores. These findings validate SecurePose as a practical and effective tool for protecting patient privacy while enabling accurate kinematics extraction in clinical settings.

Updated: 2025-07-18 16:01:33

标题: SecurePose：从临床环境录制的视频中自动进行人脸模糊处理和人体运动动力学提取

摘要: 运动障碍的诊断通常依赖于对患者视频的专家评估，但分享这些视频会带来隐私风险。目前用于去标识视频的方法，如模糊面部，通常是手动的、不一致的或不准确的。此外，这些方法可能会损害客观的运动分析 - 诊断的一个关键组成部分。为了解决这些挑战，我们开发了SecurePose，这是一个开源软件，可以同时提供可靠的去标识和从临床设置中使用智能手机/平板电脑记录的视频中自动提取运动学。SecurePose利用姿势估计（使用OpenPose）来提取全身运动学，追踪个体，识别患者，然后在视频中准确模糊面部。我们在116名患有脑瘫的儿童门诊就诊时录制的步态视频上验证了SecurePose，评估了其去标识的准确性与基准真相（手动模糊）以及运动学提取中间步骤的可靠性。结果表明，SecurePose在自动人脸检测方面优于六种现有方法，并且在时间上与强大的手动模糊具有可比准确性，但速度显著更快（快91.08%）。十名经验丰富的研究人员还通过系统可用性量表评分确认了SecurePose的可用性。这些发现验证了SecurePose作为一种实用且有效的工具，可以保护患者隐私的同时在临床设置中实现准确的运动学提取。

更新时间: 2025-07-18 16:01:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2402.14143v2

KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models

Ontology Matching (OM) is a cornerstone task of semantic interoperability, yet existing systems often rely on handcrafted rules or specialized models with limited adaptability. We present KROMA, a novel OM framework that harnesses Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) pipeline to dynamically enrich the semantic context of OM tasks with structural, lexical, and definitional knowledge. To optimize both performance and efficiency, KROMA integrates a bisimilarity-based concept matching and a lightweight ontology refinement step, which prune candidate concepts and substantially reduce the communication overhead from invoking LLMs. Through experiments on multiple benchmark datasets, we show that integrating knowledge retrieval with context-augmented LLMs significantly enhances ontology matching, outperforming both classic OM systems and cutting-edge LLM-based approaches while keeping communication overhead comparable. Our study highlights the feasibility and benefit of the proposed optimization techniques (targeted knowledge retrieval, prompt enrichment, and ontology refinement) for ontology matching at scale.

Updated: 2025-07-18 16:00:11

标题: KROMA：利用知识检索和大型语言模型进行本体匹配

摘要: 本文介绍了KROMA，这是一个利用大型语言模型（LLMs）在检索增强生成（RAG）管道中动态丰富本体匹配任务的语义背景的新颖本体匹配（OM）框架。现有系统通常依赖手工制定的规则或具有有限适应性的专门模型。为了优化性能和效率，KROMA集成了基于双同构的概念匹配和轻量级本体细化步骤，这些步骤修剪候选概念，并显著减少了调用LLMs所带来的通信开销。通过在多个基准数据集上进行实验，我们展示了将知识检索与上下文增强的LLMs集成能够显著增强本体匹配，胜过传统OM系统和尖端基于LLMs的方法，同时保持通信开销可比。我们的研究突出了针对大规模本体匹配提出的优化技术（有针对性的知识检索、提示丰富和本体细化）的可行性和益处。

更新时间: 2025-07-18 16:00:11

领域: cs.AI

下载: http://arxiv.org/abs/2507.14032v1

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

Updated: 2025-07-18 15:57:54

标题: 走向推理时代：对于推理大型语言模型的长思维链调查

摘要: 最近在使用大型语言模型（RLLM）方面取得的进展，比如OpenAI-O1和DeepSeek-R1，展示了它们在数学和编码等复杂领域中的引人注目的能力。它们成功的关键因素在于长链条思维（Long CoT）特征的应用，这些特征增强了推理能力，使得解决复杂问题成为可能。然而，尽管取得了这些进展，关于Long CoT的综合调查仍然缺乏，这限制了我们对其与传统短链条思维（Short CoT）的区别的理解，使得关于“过度思考”和“推理时间缩放”等问题的讨论变得复杂。本调查试图填补这一空白，提供Long CoT的统一视角。首先，我们区分Long CoT和Short CoT，并引入一个新的分类法来对当前的推理范式进行分类。接下来，我们探讨Long CoT的关键特征：深层推理、广泛探索和可行反思，这使得模型能够处理更复杂的任务，并产生比较浅层Short CoT更高效、连贯的结果。然后，我们调查了关键现象，比如出现具有这些特征的Long CoT，包括过度思考和推理时间缩放，从而揭示这些过程在实践中如何表现。最后，我们确定了重要的研究空白，突出了有前途的未来方向，包括多模态推理的整合、效率改进和增强的知识框架。通过提供一个结构化的概述，本调查旨在激发未来研究，并推动人工智能中的逻辑推理的发展。

更新时间: 2025-07-18 15:57:54

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.09567v5

Kintsugi: Decentralized E2EE Key Recovery

Kintsugi is a protocol for key recovery, allowing a user to regain access to end-to-end encrypted data after they have lost their device, but still have their (potentially low-entropy) password. Existing E2EE key recovery methods, such as those deployed by Signal and WhatsApp, centralize trust by relying on servers administered by a single provider. Kintsugi is decentralized, distributing trust over multiple recovery nodes, which could be servers run by independent parties, or end user devices in a peer-to-peer setting. To recover a user's keys, a threshold $t+1$ of recovery nodes must assist the user in decrypting a shared backup. Kintsugi is password-authenticated and protects against offline brute-force password guessing without requiring any specialized secure hardware. Kintsugi can tolerate up to $t$ honest-but-curious colluding recovery nodes, as well as $n - t - 1$ offline nodes, and operates safely in an asynchronous network model where messages can be arbitrarily delayed.

Updated: 2025-07-18 15:29:51

标题: 金继：分散式端对端加密密钥恢复

摘要: 金细工是一种用于密钥恢复的协议，允许用户在丢失设备但仍具有（潜在低熵）密码时重新访问端到端加密数据。现有的端到端加密密钥恢复方法，如Signal和WhatsApp部署的方法，通过依赖单个提供者管理的服务器来集中信任。金细工是去中心化的，通过多个恢复节点分布信任，这些节点可以是由独立方管理的服务器，也可以是点对点设置中的终端用户设备。要恢复用户的密钥，必须有至少$t+1$个恢复节点协助用户解密共享备份。金细工是基于密码认证的，可以防止离线暴力破解密码，而无需任何专门的安全硬件。金细工可以容忍多达$t$个诚实但好奇的合作恢复节点，以及$n - t - 1$个离线节点，并且可以安全地在异步网络模型中运行，其中消息可能会任意延迟。

更新时间: 2025-07-18 15:29:51

领域: cs.CR

下载: http://arxiv.org/abs/2507.21122v1

DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation

Materials discovery relies on high-throughput, high-fidelity simulation techniques such as Density Functional Theory (DFT), which require years of training, extensive parameter fine-tuning and systematic error handling. To address these challenges, we introduce the DFT-based Research Engine for Agentic Materials Screening (DREAMS), a hierarchical, multi-agent framework for DFT simulation that combines a central Large Language Model (LLM) planner agent with domain-specific LLM agents for atomistic structure generation, systematic DFT convergence testing, High-Performance Computing (HPC) scheduling, and error handling. In addition, a shared canvas helps the LLM agents to structure their discussions, preserve context and prevent hallucination. We validate DREAMS capabilities on the Sol27LC lattice-constant benchmark, achieving average errors below 1\% compared to the results of human DFT experts. Furthermore, we apply DREAMS to the long-standing CO/Pt(111) adsorption puzzle, demonstrating its long-term and complex problem-solving capabilities. The framework again reproduces expert-level literature adsorption-energy differences. Finally, DREAMS is employed to quantify functional-driven uncertainties with Bayesian ensemble sampling, confirming the Face Centered Cubic (FCC)-site preference at the Generalized Gradient Approximation (GGA) DFT level. In conclusion, DREAMS approaches L3-level automation - autonomous exploration of a defined design space - and significantly reduces the reliance on human expertise and intervention, offering a scalable path toward democratized, high-throughput, high-fidelity computational materials discovery.

Updated: 2025-07-18 15:26:04

标题: 梦想：基于密度泛函理论的代理材料模拟研究引擎

摘要: 材料发现依赖于高通量、高保真度的模拟技术，如密度泛函理论（DFT），这需要多年的训练、广泛的参数微调和系统性的错误处理。为了解决这些挑战，我们引入了基于DFT的主体材料筛选研究引擎（DREAMS），这是一个层次化、多主体框架，用于DFT模拟，它将一个中央的大型语言模型（LLM）计划代理与特定领域的LLM代理结合起来，用于原子结构生成、系统性DFT收敛测试、高性能计算（HPC）调度和错误处理。此外，一个共享的画布帮助LLM代理结构化他们的讨论，保留上下文并防止幻觉。我们在Sol27LC格子常数基准上验证了DREAMS的能力，平均误差低于1\%，与人类DFT专家的结果相比。此外，我们将DREAMS应用于长期存在的CO/Pt(111)吸附难题，展示了它的长期和复杂问题解决能力。该框架再次复现了专家级文献吸附能差异。最后，DREAMS被用于使用贝叶斯集成采样量化功能驱动的不确定性，确认了在广义梯度近似（GGA）DFT水平上的面心立方（FCC）位点偏好。总之，DREAMS接近L3级自动化 - 对定义的设计空间的自主探索 - 并显著减少对人类专业知识和干预的依赖，提供了一条可扩展的路径，朝向民主化、高通量、高保真度的计算材料发现。

更新时间: 2025-07-18 15:26:04

领域: cs.AI,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2507.14267v1

The CryptoNeo Threat Modelling Framework (CNTMF): Securing Neobanks and Fintech in Integrated Blockchain Ecosystems

The rapid integration of blockchain, cryptocurrency, and Web3 technologies into digital banks and fintech operations has created an integrated environment blending traditional financial systems with decentralised elements. This paper introduces the CryptoNeo Threat Modelling Framework (CNTMF), a proposed framework designed to address the risks in these ecosystems, such as oracle manipulation and cross-chain exploits. CNTMF represents a proposed extension of established methodologies like STRIDE, OWASP Top 10, NIST frameworks, LINDDUN, and PASTA, while incorporating tailored components including Hybrid Layer Analysis, the CRYPTOQ mnemonic for cryptocurrency-specific risks, and an AI-Augmented Feedback Loop. Drawing on real-world data from 2025 incidents, CNTMF supports data-driven mitigation to reduce losses, which totalled approximately $2.47 billion in the first half of 2025 across 344 security events (CertiK via GlobeNewswire, 2025; Infosecurity Magazine, 2025). Its phases guide asset mapping, risk profiling, prioritisation, mitigation, and iterative feedback. This supports security against evolving risks like state-sponsored attacks.

Updated: 2025-07-18 15:19:08

标题: 《加密新威胁建模框架(CNTMF)：在集成区块链生态系统中保护新型银行和金融科技》

摘要: 区块链、加密货币和Web3技术迅速融入数字银行和金融科技运营，创造了一个整合了传统金融系统与去中心化元素的环境。本文介绍了CryptoNeo威胁建模框架（CNTMF），这是一个旨在解决这些生态系统中的风险的框架，如预言机操纵和跨链利用。CNTMF代表了对已建立的方法论（如STRIDE、OWASP Top 10、NIST框架、LINDDUN和PASTA）的拓展，同时融入了定制组件，包括混合层分析、CRYPTOQ助记符用于加密货币特定风险，以及AI增强反馈循环。借鉴2025年事件的真实数据，CNTMF支持数据驱动的减少损失的缓解措施，该措施在2025年上半年的344个安全事件中共计损失约24.7亿美元（CertiK通过GlobeNewswire，2025年；Infosecurity Magazine，2025年）。其阶段指导资产映射、风险概况、优先级确定、缓解和迭代反馈。这支持对抗诸如国家支持的攻击等不断进化的风险。

更新时间: 2025-07-18 15:19:08

领域: cs.CR,cs.ET

下载: http://arxiv.org/abs/2507.14007v1

OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models

The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

Updated: 2025-07-18 15:01:44

标题: OrthoInsight：基于多模式大型模型的肋骨骨折诊断和报告生成

摘要: 随着医学影像数据量的增长，自动诊断工具的需求日益增加，特别是对于骨骼肌肉损伤如肋骨骨折的诊断，通常通过CT扫描检测。手动解释耗时且容易出错。我们提出了OrthoInsight，一个用于肋骨骨折诊断和报告生成的多模式深度学习框架。它集成了一个用于骨折检测的YOLOv9模型，一个用于检索临床背景的医学知识图，以及一个经过微调的LLaVA语言模型用于生成诊断报告。OrthoInsight将CT图像的视觉特征与专家文本数据结合，提供临床上有用的输出。在28,675个注释的CT图像和专家报告上进行评估，它在诊断准确性、内容完整性、逻辑连贯性和临床指导价值方面表现出色，平均得分为4.28，优于GPT-4和Claude-3等模型。这项研究展示了多模式学习在转变医学影像分析和为放射科医生提供有效支持方面的潜力。

更新时间: 2025-07-18 15:01:44

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.13993v1

Bridging MOOCs, Smart Teaching, and AI: A Decade of Evolution Toward a Unified Pedagogy

Over the past decade, higher education has evolved through three distinct paradigms: the emergence of Massive Open Online Courses (MOOCs), the integration of Smart Teaching technologies into classrooms, and the rise of AI-enhanced learning. Each paradigm is intended to address specific challenges in traditional education: MOOCs enable ubiquitous access to learning resources; Smart Teaching supports real-time interaction with data-driven insights; and generative AI offers personalized feedback and on-demand content generation. However, these paradigms are often implemented in isolation due to their disparate technological origins and policy-driven adoption. This paper examines the origins, strengths, and limitations of each paradigm, and advocates a unified pedagogical perspective that synthesizes their complementary affordances. We propose a three-layer instructional framework that combines the scalability of MOOCs, the responsiveness of Smart Teaching, and the adaptivity of AI. To demonstrate its feasibility, we present a curriculum design for a project-based course. The findings highlight the framework's potential to enhance learner engagement, support instructors, and enable personalized yet scalable learning.

Updated: 2025-07-18 14:57:20

标题: 连接MOOCs、智能教学和人工智能：十年演变走向统一教学法

摘要: 在过去的十年中，高等教育已经经历了三种不同的范式演变：大规模在线开放课程（MOOCs）的出现，智能教学技术融入课堂，以及AI增强学习的兴起。每种范式都旨在解决传统教育中的特定挑战：MOOCs使学习资源无处不在地获得；智能教学支持实时与数据驱动的交互；生成式AI提供个性化反馈和按需内容生成。然而，由于这些范式具有不同的技术起源和政策驱动的采用，它们经常被孤立实施。本文考察了每种范式的起源、优势和局限性，并倡导一个统一的教学视角，综合它们互补的优势。我们提出了一个结合MOOCs的可扩展性，智能教学的响应性和AI的适应性的三层教学框架。为了证明其可行性，我们提出了一个基于项目的课程设计。研究结果突显了该框架增强学习者参与度，支持教师，并实现个性化而可扩展的学习的潜力。

更新时间: 2025-07-18 14:57:20

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.14266v1

Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative "team" focused on a specific subtask. Agentic Neural Network follows a two-phase optimization strategy: (1) Forward Phase-Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase-Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables ANN to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data-driven framework for multi-agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open-source the entire framework.

Updated: 2025-07-18 14:52:19

标题: 主观神经网络：通过文本反向传播实现自进化多智能体系统

摘要: 利用多个大型语言模型(LLMs)已被证明对于解决复杂、高维任务是有效的，但目前的方法往往依赖于静态、手工设计的多智能体配置。为了克服这些限制，我们提出了Agentic神经网络(ANN)，这是一个将多智能体协作概念化为分层神经网络架构的框架。在这种设计中，每个智能体作为一个节点运作，每一层形成一个专注于特定子任务的合作"团队"。Agentic神经网络遵循一个两阶段优化策略：(1)前向阶段-借鉴神经网络前向传递的灵感，任务被动态分解为子任务，并逐层构建适合的聚合方法的合作智能体团队。(2)后向阶段-通过反向传播，我们通过迭代反馈来精细调整全局和局部协作，使智能体能够自我演化其角色、提示和协调。这种神经符号方法使ANN能够在训练后创建新的或专门的智能体团队，提供显著的准确性和适应性增益。在四个基准数据集上，ANN在相同配置下超越了领先的多智能体基线，表现出一致的性能改进。我们的发现表明，ANN为多智能体系统提供了可扩展的、数据驱动的框架，将LLMs的协作能力与神经网络原则的效率和灵活性相结合。我们计划开源整个框架。

更新时间: 2025-07-18 14:52:19

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2506.09046v2

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

Updated: 2025-07-18 14:45:48

标题: CSD-VAR：视觉自回归模型中的内容风格分解

摘要: 将内容和风格从单个图像中分离出来，即内容-风格分解（CSD），使得可以重新定位提取出的内容和对提取出的风格进行风格化处理，在视觉合成中提供了更大的创作灵活性。尽管最近的个性化方法已经探索了显式内容风格的分解，但它们仍然专为扩散模型而设计。与此同时，视觉自回归建模（VAR）已经成为一种具有下一规模预测范式的有希望的替代方法，实现了与扩散模型相媲美的性能。在本文中，我们探索了VAR作为CSD的生成框架，利用其逐渐生成过程来改进分离。为此，我们提出了一种新颖的方法CSD-VAR，引入了三个关键创新：（1）一种具有规模感知的交替优化策略，将内容和风格表示与它们各自的规模对齐，以增强分离，（2）一种基于SVD的矫正方法，以减轻内容泄漏到风格表示中，（3）一种增强内容身份保留的增强键-值（K-V）存储器。为了评估这个任务，我们引入了一个专门设计用于内容-风格分解的数据集CSD-100，其中包含以不同艺术风格呈现的多样主题。实验证明，CSD-VAR优于先前方法，实现了更优秀的内容保留和风格化保真度。

更新时间: 2025-07-18 14:45:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13984v1

From Roots to Rewards: Dynamic Tree Reasoning with RL

Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree's static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree's probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.

Updated: 2025-07-18 14:38:54

标题: 从根到果：利用强化学习进行动态树推理

摘要: 现代语言模型通过思维链（CoT）推理（魏等，2023）和检索增强（刘易斯等，2021）解决复杂问题，但在错误传播和知识整合方面遇到困难。树形推理方法，特别是概率思维树（ProbTree）（曹等，2023）框架，通过将问题分解为层次结构，并通过参数化和检索知识的置信加权聚合来选择答案，减轻了这些问题（姚等，2023）。然而，ProbTree的静态实现引入了两个关键限制：（1）推理树在初始构建阶段固定，阻止对中间结果的动态适应，（2）每个节点需要对所有可能的解决策略进行详尽评估，导致计算效率低下。我们提出了一个动态强化学习（Sutton和Barto，2018）框架，将基于树的推理转化为自适应过程。我们的方法根据实时置信度估计逐步构建推理树，同时学习用于动作选择（分解、检索或聚合）的最优策略。通过选择性扩展和集中资源分配，这一方法保持了ProbTree的概率严谨性，同时提高了解决方案质量和计算效率。该工作建立了一个新的平衡概率框架可靠性和现实世界问题回答系统所需的灵活性的树形推理范式。

更新时间: 2025-07-18 14:38:54

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.13142v2

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}

Updated: 2025-07-18 14:38:50

标题: Ev2R: 自动事实核查中的证据检索评估

摘要: 目前自动事实检查（AFC）方法通常通过预测的裁决或与预定义的封闭知识源（如维基百科）的精确匹配来隐含地评估证据。然而，这些方法受限于其依赖于最初设计用于其他目的的评估指标以及来自封闭知识源的约束。在这项工作中，我们引入了结合了基于参考文献的评估和裁决级代理评分优势的\textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}}。Ev\textsuperscript{2}R共同评估证据与黄金参考文献的对齐程度以及其对裁决的可靠支持程度，解决了先前方法的缺点。我们将Ev\textsuperscript{2}R与三种类型的证据评估方法进行评估：基于参考文献的、代理参考文献的和无参考线的基线。与人类评分和对抗性测试的评估表明，Ev\textsuperscript{2}R在准确性和稳健性方面始终优于现有的评分方法。它与人类判断之间的相关性更强，对对抗性扰动的稳健性更强，确立了它作为AFC中证据评估的可靠度量标准。【代码可在\href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}找到。】

更新时间: 2025-07-18 14:38:50

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2411.05375v2

A segmented robot grasping perception neural network for edge AI

Robotic grasping, the ability of robots to reliably secure and manipulate objects of varying shapes, sizes and orientations, is a complex task that requires precise perception and control. Deep neural networks have shown remarkable success in grasp synthesis by learning rich and abstract representations of objects. When deployed at the edge, these models can enable low-latency, low-power inference, making real-time grasping feasible in resource-constrained environments. This work implements Heatmap-Guided Grasp Detection, an end-to-end framework for the detection of 6-Dof grasp poses, on the GAP9 RISC-V System-on-Chip. The model is optimised using hardware-aware techniques, including input dimensionality reduction, model partitioning, and quantisation. Experimental evaluation on the GraspNet-1Billion benchmark validates the feasibility of fully on-chip inference, highlighting the potential of low-power MCUs for real-time, autonomous manipulation.

Updated: 2025-07-18 14:32:45

标题: 一个用于边缘人工智能的分段式机器人抓取感知神经网络

摘要: 机器人抓取，即机器人可可靠地抓取和操纵各种形状、大小和方向的物体的能力，是一个复杂的任务，需要精确的感知和控制。深度神经网络已经展示出在抓取合成方面取得了显著成功，通过学习对象的丰富和抽象表示。在边缘部署时，这些模型可以实现低延迟、低功耗的推断，使得在资源受限环境中实时抓取成为可能。这项工作实现了热图引导抓取检测，这是一个用于检测6自由度抓取姿势的端到端框架，在GAP9 RISC-V片上系统上。该模型使用硬件感知技术进行优化，包括输入维度降低、模型分区和量化。在GraspNet-1Billion基准测试上的实验评估验证了在芯片上完全推断的可行性，突显了低功耗MCU在实时、自主操作方面的潜力。

更新时间: 2025-07-18 14:32:45

领域: cs.RO,cs.AI,I.2; I.2.9; I.2.10

下载: http://arxiv.org/abs/2507.13970v1

Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model's performance. While the industry's approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.

Updated: 2025-07-18 14:30:08

标题: 自下而上的领域特定超智能：我们需要一个可靠的知识图谱

摘要: 传统上用于跨领域泛化的语言模型最近已经展示出任务特定的推理能力。然而，它们在通用语料库上的自上而下训练方法不足以获取深度领域专业知识所需的抽象概念。这可能需要一种自下而上的方法，通过学习将简单的领域概念组合成更复杂的概念来获取专业知识。知识图（KG）提供了这种组合结构，其中领域基元表示为头-关系-尾边缘，它们的路径编码了更高级的概念。我们提出了一个任务生成流水线，直接从KG基元合成任务，使模型能够获取并组合这些任务进行推理。我们在由此产生的KG基准课程上对语言模型进行微调，以展示领域特定的超智能。尽管具有广泛适用性，但我们在医学领域验证了我们的方法，这里存在可靠的知识图。利用医学知识图，我们策划了24,000个推理任务，与来自不同医学基元的思维跟踪配对。我们在这个课程上对QwQ-32B模型进行微调，获得QwQ-Med-3，这是迈向医学超智能的一步。我们还引入了ICD-Bench，一个评估套件，用于量化跨15个医学领域的推理能力。我们的实验表明，QwQ-Med-3在ICD-Bench类别上明显优于最先进的推理模型。进一步分析显示，QwQ-Med-3利用所获得的基元来扩大ICD-Bench上最困难任务的表现差距。最后，在医学问答基准测试上的评估表明，QwQ-Med-3将所获得的专业知识转移，以增强基准模型的性能。尽管产业对人工通用智能（AGI）的方法强调广泛的专业知识，但我们设想一个未来，AGI将从高效的领域特定超智能代理的可组合交互中出现。

更新时间: 2025-07-18 14:30:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13966v1

Towards Constraint Temporal Answer Set Programming

Reasoning about dynamic systems with a fine-grained temporal and numeric resolution presents significant challenges for logic-based approaches like Answer Set Programming (ASP). To address this, we introduce and elaborate upon a novel temporal and constraint-based extension of the logic of Here-and-There and its nonmonotonic equilibrium extension, representing, to the best of our knowledge, the first approach to nonmonotonic temporal reasoning with constraints specifically tailored for ASP. This expressive system is achieved by a synergistic combination of two foundational ASP extensions: the linear-time logic of Here-and-There, providing robust nonmonotonic temporal reasoning capabilities, and the logic of Here-and-There with constraints, enabling the direct integration and manipulation of numeric constraints, among others. This work establishes the foundational logical framework for tackling complex dynamic systems with high resolution within the ASP paradigm.

Updated: 2025-07-18 14:22:38

标题: 朝向约束时间答案集编程

摘要: 使用细粒度的时间和数字分辨率对动态系统进行推理对于基于逻辑的方法（如Answer Set Programming，ASP）来说具有重大挑战。为了解决这个问题，我们介绍并详细阐述了一个新颖的基于时间和约束的Here-and-There逻辑及其非单调平衡扩展，据我们所知，这是首个针对ASP定制的非单调时间推理方法。这种表达性系统通过两个基础ASP扩展的协同组合实现：Here-and-There的线性时间逻辑提供了强大的非单调时间推理能力，Here-and-There带约束逻辑使得直接集成和操作数值约束等成为可能。这项工作建立了处理ASP范式内具有高分辨率的复杂动态系统的基础逻辑框架。

更新时间: 2025-07-18 14:22:38

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2507.13958v1

DUALRec: A Hybrid Sequential and Language Model Framework for Context-Aware Movie Recommendation

The modern recommender systems are facing an increasing challenge of modelling and predicting the dynamic and context-rich user preferences. Traditional collaborative filtering and content-based methods often struggle to capture the temporal patternings and evolving user intentions. While Large Language Models (LLMs) have gained gradual attention in recent years, by their strong semantic understanding and reasoning abilities, they are not inherently designed to model chronologically evolving user preference and intentions. On the other hand, for sequential models like LSTM (Long-Short-Term-Memory) which is good at capturing the temporal dynamics of user behaviour and evolving user preference over time, but still lacks a rich semantic understanding for comprehensive recommendation generation. In this study, we propose DUALRec (Dynamic User-Aware Language-based Recommender), a novel recommender that leverages the complementary strength of both models, which combines the temporal modelling abilities of LSTM networks with semantic reasoning power of the fine-tuned Large Language Models. The LSTM component will capture users evolving preference through their viewing history, while the fine-tuned LLM variants will leverage these temporal user insights to generate next movies that users might enjoy. Experimental results on MovieLens-1M dataset shows that the DUALRec model outperforms a wide range of baseline models, with comprehensive evaluation matrices of Hit Rate (HR@k), Normalized Discounted Cumulative Gain (NDCG@k), and genre similarity metrics. This research proposes a novel architecture that bridges the gap between temporal sequence modeling and semantic reasoning, and offers a promising direction for developing more intelligent and context-aware recommenders.

Updated: 2025-07-18 14:22:05

标题: DUALRec：一种混合顺序和语言模型框架，用于上下文感知电影推荐

摘要: 现代推荐系统面临着越来越大的挑战，即建模和预测动态且充满上下文的用户偏好。传统的协同过滤和基于内容的方法往往难以捕捉时间模式和不断演变的用户意图。尽管近年来大型语言模型（LLMs）逐渐受到关注，由于其强大的语义理解和推理能力，但它们并非天生设计用于建模随时间演化的用户偏好和意图。另一方面，对于像LSTM（长短期记忆）这样的序列模型来说，它擅长捕捉用户行为的时间动态和随时间演化的用户偏好，但仍然缺乏全面推荐生成的丰富语义理解。在本研究中，我们提出了DUALRec（基于动态用户感知的语言推荐系统），这是一种利用两种模型的互补优势的新型推荐系统，它结合了LSTM网络的时间建模能力和精调大型语言模型的语义推理能力。LSTM组件将通过用户的观看历史捕捉用户不断演变的偏好，而经过精调的LLM变体将利用这些时间用户洞察力来生成用户可能喜欢的下一个电影。在MovieLens-1M数据集上的实验结果显示，DUALRec模型优于广泛的基线模型，具有命中率（HR@k）、归一化折现累积增益（NDCG@k）和流派相似度指标的全面评估矩阵。这项研究提出了一种新颖的架构，弥合了时间序列建模和语义推理之间的差距，并为开发更智能和上下文感知的推荐系统提供了一个有前途的方向。

更新时间: 2025-07-18 14:22:05

领域: cs.IR,cs.AI,cs.LG,68T05, 68T50, 62M45,H.3.3; I.2.6; H.3.4; I.2.7

下载: http://arxiv.org/abs/2507.13957v1

Cross-modal Causal Intervention for Alzheimer's Disease Prediction

Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer's Disease (AD), where early identification and intervention can effectively slow the progression to dementia. However, diagnosing AD remains a significant challenge in neurology due to the confounders caused mainly by the selection bias of multimodal data and the complex relationships between variables. To address these issues, we propose a novel visual-language causal intervention framework named Alzheimer's Disease Prediction with Cross-modal Causal Intervention (ADPC) for diagnostic assistance. Our ADPC employs large language model (LLM) to summarize clinical data under strict templates, maintaining structured text outputs even with incomplete or unevenly distributed datasets. The ADPC model utilizes Magnetic Resonance Imaging (MRI), functional MRI (fMRI) images and textual data generated by LLM to classify participants into Cognitively Normal (CN), MCI, and AD categories. Because of the presence of confounders, such as neuroimaging artifacts and age-related biomarkers, non-causal models are likely to capture spurious input-output correlations, generating less reliable results. Our framework implicitly eliminates confounders through causal intervention. Experimental results demonstrate the outstanding performance of our method in distinguishing CN/MCI/AD cases, achieving state-of-the-art (SOTA) metrics across most evaluation metrics. The study showcases the potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis.

Updated: 2025-07-18 14:21:24

标题: 跨模态因果干预用于阿尔茨海默病预测

摘要: 轻度认知障碍（Mild Cognitive Impairment, MCI）作为阿尔茨海默病（Alzheimer's Disease, AD）的前驱阶段，早期识别和干预可以有效地减缓痴呆症的进展。然而，在神经学领域，由于多模态数据选择偏差和变量之间复杂关系的影响，诊断阿尔茨海默病仍然是一个重要挑战。为了解决这些问题，我们提出了一种新颖的视觉-语言因果干预框架，命名为带跨模态因果干预的阿尔茨海默病预测（Alzheimer's Disease Prediction with Cross-modal Causal Intervention, ADPC），用于诊断辅助。我们的ADPC利用大型语言模型（Large Language Model, LLM）在严格模板下总结临床数据，即使数据不完整或分布不均，也能保持结构化文本输出。ADPC模型利用由LLM生成的磁共振成像（MRI）、功能性磁共振成像（fMRI）图像和文本数据将参与者分类为认知正常（Cognitively Normal, CN）、MCI和AD类别。由于存在神经影像艺术品和年龄相关生物标志物等混杂因素，非因果模型可能会捕获虚假的输入-输出相关性，生成不太可靠的结果。我们的框架通过因果干预隐式消除混杂因素。实验结果表明，我们的方法在区分CN/MCI/AD案例方面表现出色，实现了在大多数评估指标上的最先进（SOTA）性能。该研究展示了将因果推理与多模态学习相结合用于神经疾病诊断的潜力。

更新时间: 2025-07-18 14:21:24

领域: cs.AI,cs.CV,cs.MM

下载: http://arxiv.org/abs/2507.13956v1

Exploiting Primacy Effect To Improve Large Language Models

Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

Updated: 2025-07-18 14:18:18

标题: 利用首位效应改进大型语言模型

摘要: 大型语言模型(LLMs)已经成为许多自然语言处理(NLP)任务中不可或缺的工具，利用广泛的预训练和微调来实现高准确性。然而，与人类一样，LLMs表现出偏见，特别是位置偏见，如首要效应和最新效应，这可能会影响答案的准确性。首要效应-即首先呈现的项目更有可能被记住或选择-在多项选择题答题(MCQA)中起着关键作用，答案选项的顺序可以影响预测结果。本研究重点研究了微调LLMs中的首要偏见：我们首先展示微调放大了这种偏见，可能是由于接触到类似人类的模式。因此，我们战略性地利用这种效应，通过基于与查询的语义相似性重新排列响应选项，而无需了解正确答案。我们的实验结果显示，这种方法显着提高了MCQA的性能。更普遍地，我们的发现强调了偏见作为挑战和机会的双重性质，为偏见感知模型设计和NLP应用提供了见解。

更新时间: 2025-07-18 14:18:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13949v1

Generalist Forecasting with Frozen Video Models via Latent Diffusion

Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.

Updated: 2025-07-18 14:14:19

标题: 冻结视频模型通过潜在扩散进行的广义预测

摘要: 预测接下来会发生什么是规划或在不同抽象级别的世界中行动的通用系统的关键技能。在本文中，我们发现一个视觉模型的感知能力与其在短期内的通用预测性能之间存在强烈的相关性。这一趋势在各种预训练模型中都保持一致，包括那些通过生成训练的模型，以及在不同抽象级别上，从原始像素到深度、点跟踪和物体运动。这一结果得益于一种新颖的通用预测框架，该框架在任何冻结视觉骨干上运行：我们训练潜在扩散模型来预测冻结表示空间中的未来特征，然后通过轻量级、特定任务的输出解码这些特征。为了实现跨任务的一致评估，我们引入了比较下游任务空间中分布属性的分布度量，并将这个框架应用于九个模型和四个任务。我们的结果突显了在时间上基础的视频理解中搭建表示学习和生成建模之间的价值。

更新时间: 2025-07-18 14:14:19

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13942v1

Convergent transformations of visual representation in brains and models

A fundamental question in cognitive neuroscience is what shapes visual perception: the external world's structure or the brain's internal architecture. Although some perceptual variability can be traced to individual differences, brain responses to naturalistic stimuli evoke similar activity patterns across individuals, suggesting a convergent representational principle. Here, we test if this stimulus-driven convergence follows a common trajectory across people and deep neural networks (DNNs) during its transformation from sensory to high-level internal representations. We introduce a unified framework that traces representational flow by combining inter-subject similarity with alignment to model hierarchies. Applying this framework to three independent fMRI datasets of visual scene perception, we reveal a cortex-wide network, conserved across individuals, organized into two pathways: a medial-ventral stream for scene structure and a lateral-dorsal stream tuned for social and biological content. This functional organization is captured by the hierarchies of vision DNNs but not language models, reinforcing the specificity of the visual-to-semantic transformation. These findings show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.

Updated: 2025-07-18 14:13:54

标题: 大脑和模型中视觉表征的收敛转化

摘要: 在认知神经科学中一个基本问题是是什么塑造了视觉感知：外部世界的结构还是大脑的内部架构。虽然一些感知变异可以追溯到个体差异，但大脑对自然刺激的反应在个体之间引发类似的活动模式，表明存在一种收敛的表征原则。在这里，我们测试了在从感官到高级内部表征转换过程中，这种刺激驱动的收敛是否沿着人类和深度神经网络（DNNs）之间共同的轨迹进行。我们引入了一个统一的框架，通过将人际相似性与模型层次对齐相结合来跟踪表征流动。将这一框架应用于三个独立的视觉场景感知fMRI数据集，我们揭示了一个在个体之间保守的大脑皮层网络，组织成两条通路：一个用于场景结构的中央-腹侧通路，一个用于社会和生物内容的侧面-背侧通路。这种功能性组织被视觉DNNs的层次所捕捉，但不被语言模型所捕捉，加强了视觉到语义转换的特异性。这些发现显示了在人类和人工视觉中视觉编码的收敛计算解决方案，其驱动力是外部世界的结构。

更新时间: 2025-07-18 14:13:54

领域: q-bio.NC,cs.AI,cs.CV,eess.IV,I.2.10

下载: http://arxiv.org/abs/2507.13941v1

Preprint: Did I Just Browse A Website Written by LLMs?

Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are insufficient, because they perform well mainly on clean, prose-like text, while web content has complex markup and diverse genres. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.

Updated: 2025-07-18 14:09:04

标题: 预印本：我刚刚浏览了一个由LLMs撰写的网站吗？

摘要: 越来越多的网络内容是由大型语言模型（LLMs）自动生成，几乎没有人工输入。我们将这种内容称为“LLM主导”内容。由于LLMs存在抄袭和虚构，LLM主导内容可能不可靠且不道德。然而，网站很少披露此类内容，人类读者很难区分。因此，我们必须开发可靠的LLM主导内容检测器。然而，目前的LLM检测器仍不足，因为它们主要在干净的散文文本上表现良好，而网络内容具有复杂的标记和多样的流派。我们提出了一个高度可靠且可扩展的管道，用于对整个网站进行分类。我们不是简单地对每个页面提取的文本进行分类，而是根据多个类似散文页面的LLM文本检测器的输出对每个站点进行分类。我们通过收集两个不同的真实数据集，共计120个站点来训练和评估我们的检测器，并在它们之间进行100%的准确性测试。在实际环境中，我们在搜索引擎结果中的10,000个站点和Common Crawl档案中的10,000个站点中检测到相当大比例的LLM主导站点。我们发现LLM主导站点在普及程度上不断增长，并在搜索结果中排名靠前，引发了对它们对最终用户和整个网络生态系统的影响的问题。

更新时间: 2025-07-18 14:09:04

领域: cs.NI,cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2507.13933v1

Chain Table: Protecting Table-Level Data Integrity by Digital Ledger Technology

The rise of blockchain and Digital Ledger Technology (DLT) has gained wide traction. Instead of relying on a traditional centralized data authority, a blockchain system consists of digitally entangled block data shared across a distributed network. The specially designed chain data structure and its consensus mechanism protect blockchain data from being tampered by unauthorized adversaries. However, implementing a full-fledged blockchain system to protect a database can be technically cumbersome. In this work, we introduce an in-database design, named chain table, to protect data integrity without the need for a blockchain system. It features a succinct design without significant technology barriers or storage overhead. To realize rigorous data security, we also propose a set of data writing principles for the chain table. We prove that the chain table, together with the data writing principles, will guarantee flexible data integrity, named table-level data integrity (TDI).

Updated: 2025-07-18 14:08:24

标题: 链表：通过数字账本技术保护表级数据完整性

摘要: 区块链和数字账本技术（DLT）的兴起已经获得了广泛的认可。区块链系统不再依赖于传统的中心化数据机构，而是由数字化纠缠的块数据组成，共享在分布式网络中。特别设计的链数据结构及其共识机制保护区块链数据免受未经授权的对手篡改。然而，实施一个完整的区块链系统来保护数据库可能在技术上会很繁琐。在这项工作中，我们引入了一种名为链表的数据库设计，以保护数据完整性，而无需区块链系统。它具有简洁的设计，没有显著的技术障碍或存储开销。为了实现严格的数据安全性，我们还提出了一套链表数据写入原则。我们证明，链表与数据写入原则结合起来，将保证灵活的数据完整性，即表级数据完整性（TDI）。

更新时间: 2025-07-18 14:08:24

领域: cs.CR,cs.DB

下载: http://arxiv.org/abs/2507.13932v1

Developers Insight On Manifest v3 Privacy and Security Webextensions

Webextensions can improve web browser privacy, security, and user experience. The APIs offered by the browser to webextensions affect possible functionality. Currently, Chrome transitions to a modified set of APIs called Manifest v3. This paper studies the challenges and opportunities of Manifest v3 with an in-depth structured qualitative research. Even though some projects observed positive effects, a majority expresses concerns over limited benefits to users, removal of crucial APIs, or the need to find workarounds. Our findings indicate that the transition affects different types of webextensions differently; some can migrate without losing functionality, while other projects remove functionality or decline to update. The respondents identified several critical missing APIs, including reliable APIs to inject content scripts, APIs for storing confidential content, and others.

Updated: 2025-07-18 14:00:16

标题: 开发者对Manifest v3隐私和安全Web扩展的见解

摘要: Webextensions可以提高网络浏览器的隐私、安全性和用户体验。浏览器提供给webextensions的API会影响可能的功能。目前，Chrome正转向一组名为Manifest v3的修改后的API。本文通过深入结构化的定性研究，研究了Manifest v3的挑战和机遇。尽管一些项目观察到积极的影响，但大多数人对用户的受益有限、关键API的移除或需要找到解决方法表示担忧。我们的研究结果表明，转变对不同类型的webextensions产生不同的影响；一些可以迁移而不失功能，而其他项目则会删除功能或拒绝更新。受访者指出了几个关键缺失的API，包括可靠的API用于注入内容脚本、用于存储机密内容等API。

更新时间: 2025-07-18 14:00:16

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2507.13926v1

Stonefish: Supporting Machine Learning Research in Marine Robotics

Simulations are highly valuable in marine robotics, offering a cost-effective and controlled environment for testing in the challenging conditions of underwater and surface operations. Given the high costs and logistical difficulties of real-world trials, simulators capable of capturing the operational conditions of subsea environments have become key in developing and refining algorithms for remotely-operated and autonomous underwater vehicles. This paper highlights recent enhancements to the Stonefish simulator, an advanced open-source platform supporting development and testing of marine robotics solutions. Key updates include a suite of additional sensors, such as an event-based camera, a thermal camera, and an optical flow camera, as well as, visual light communication, support for tethered operations, improved thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy. These developments and an automated annotation tool significantly bolster Stonefish's role in marine robotics research, especially in the field of machine learning, where training data with a known ground truth is hard or impossible to collect.

Updated: 2025-07-18 13:58:39

标题: 石鱼：支持海洋机器人中的机器学习研究

摘要: 模拟在海洋机器人领域具有极高的价值，为测试水下和水面操作的挑战性环境提供了一种经济有效且可控的方法。考虑到实际试验的高成本和后勤困难，能够捕捉水下环境操作条件的模拟器已成为开发和完善远程操作和自主水下车辆算法的关键。本文强调了Stonefish模拟器的最新增强，这是一个先进的开源平台，支持海洋机器人解决方案的开发和测试。关键更新包括一套额外的传感器，如事件相机、热像相机和光流相机，以及视觉光通信、支持绳索操作、改进的推进器建模、更灵活的流体动力学和增强的声纳精度。这些发展和自动注释工具显著增强了Stonefish在海洋机器人研究中的作用，特别是在机器学习领域，其中很难或不可能收集具有已知地面真实数据的训练数据。

更新时间: 2025-07-18 13:58:39

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.11887v2

Two-Stage Pretraining for Molecular Property Prediction in the Wild

Molecular deep learning models have achieved remarkable success in property prediction, but they often require large amounts of labeled data. The challenge is that, in real-world applications, labels are extremely scarce, as obtaining them through laboratory experimentation is both expensive and time-consuming. In this work, we introduce MoleVers, a versatile pretrained molecular model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated labels are scarce. MoleVers employs a two-stage pretraining strategy. In the first stage, it learns molecular representations from unlabeled data through masked atom prediction and extreme denoising, a novel task enabled by our newly introduced branching encoder architecture and dynamic noise scale sampling. In the second stage, the model refines these representations through predictions of auxiliary properties derived from computational methods, such as the density functional theory or large language models. Evaluation on 22 small, experimentally-validated datasets demonstrates that MoleVers achieves state-of-the-art performance, highlighting the effectiveness of its two-stage framework in producing generalizable molecular representations for diverse downstream properties.

Updated: 2025-07-18 13:53:09

标题: 野外分子性质预测的两阶段预训练

摘要: 分子深度学习模型在性质预测方面取得了显著成功，但通常需要大量标记数据。挑战在于，在现实世界的应用中，标签非常稀缺，因为通过实验室实验获取它们既昂贵又耗时。在这项工作中，我们介绍了MoleVers，这是一个多功能的预训练分子模型，旨在用于野外各种类型的分子性质预测，即在实验验证的标签稀缺的情况下。MoleVers采用两阶段预训练策略。在第一阶段，它通过掩蔽原子预测和极端去噪学习来自未标记数据的分子表示，这是由我们新引入的分支编码器架构和动态噪声比例采样实现的一项新任务。在第二阶段，该模型通过预测由计算方法推导出的辅助性质，如密度泛函理论或大型语言模型，来细化这些表示。对22个小型的实验验证数据集进行评估表明，MoleVers取得了最先进的性能，突显了其两阶段框架在为不同下游属性生成可泛化分子表示方面的有效性。

更新时间: 2025-07-18 13:53:09

领域: cs.LG,cs.AI,physics.chem-ph,q-bio.BM

下载: http://arxiv.org/abs/2411.03537v2

The Levers of Political Persuasion with Conversational AI

There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs' unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.

Updated: 2025-07-18 13:50:09

标题: 使用对话人工智能的政治说服杠杆

摘要: 有广泛的担忧，即对话人工智能可能很快对人类信仰产生前所未有的影响。在这里，我们进行了三项大规模实验（N=76,977），部署了19个LLMs，其中包括一些明确为说服而进行后训练的LLMs，以评估它们在707个政治问题上的说服力。然后，我们检查了466,769个结果LLM主张的事实准确性。与普遍担忧相反，我们表明目前和未来AI的说服力更可能来自后训练和提示方法，这些方法分别最多提高了说服力51%和27%，而不是来自个性化或增加模型规模。我们进一步表明，这些方法通过利用LLMs快速访问和战略部署信息的独特能力增加了说服力，并且引人注目的是，它们增加AI的说服力的同时也系统性地降低了事实准确性。

更新时间: 2025-07-18 13:50:09

领域: cs.CL,cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2507.13919v1

What the F*ck Is Artificial General Intelligence?

Artificial general intelligence (AGI) is an established field of research. Yet some have questioned if the term still has meaning. AGI has been subject to so much hype and speculation it has become something of a Rorschach test. Melanie Mitchell argues the debate will only be settled through long term, scientific investigation. To that end here is a short, accessible and provocative overview of AGI. I compare definitions of intelligence, settling on intelligence in terms of adaptation and AGI as an artificial scientist. Taking my cue from Sutton's Bitter Lesson I describe two foundational tools used to build adaptive systems: search and approximation. I compare pros, cons, hybrids and architectures like o3, AlphaGo, AERA, NARS and Hyperon. I then discuss overall meta-approaches to making systems behave more intelligently. I divide them into scale-maxing, simp-maxing, w-maxing based on the Bitter Lesson, Ockham's and Bennett's Razors. These maximise resources, simplicity of form, and the weakness of constraints on functionality. I discuss examples including AIXI, the free energy principle and The Embiggening of language models. I conclude that though scale-maxed approximation dominates, AGI will be a fusion of tools and meta-approaches. The Embiggening was enabled by improvements in hardware. Now the bottlenecks are sample and energy efficiency.

Updated: 2025-07-18 13:45:28

标题: 人工通用智能究竟是什么？

摘要: 人工通用智能（AGI）是一个已经建立的研究领域。然而，一些人质疑这个术语是否仍然有意义。AGI已经遭受了太多的炒作和猜测，变成了一种罗夏测试。梅兰妮·米切尔认为，这场辩论只能通过长期的科学调查来解决。为此，这里有一个简短、易懂且挑衅性的AGI概述。我比较了智能的定义，最终将智能界定为适应能力，将AGI界定为人工科学家。我从萨顿的“苦涩教训”中得到启示，描述了构建适应系统所使用的两个基础工具：搜索和近似。我比较了o3、AlphaGo、AERA、NARS和Hyperon等各种架构的优缺点、混合形式和体系结构。然后我讨论了使系统行为更加智能的整体元方法。我根据“苦涩教训”、奥卡姆剃刀和贝内特剃刀将它们分为基于规模最大化、简化最大化和弱化约束的W最大化。这些方法最大化了资源、形式的简单性和功能性的约束弱点。我讨论了包括AIXI、自由能原理和语言模型的“Embiggening”在内的实例。我得出结论，尽管规模最大化的近似占主导地位，AGI将是工具和元方法的融合。“Embiggening”是由硬件改进实现的。现在瓶颈是样本和能源效率。

更新时间: 2025-07-18 13:45:28

领域: cs.AI

下载: http://arxiv.org/abs/2503.23923v2

Political Leaning and Politicalness Classification of Texts

This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

Updated: 2025-07-18 13:44:30

标题: 文本的政治倾向和政治性分类

摘要: 本文介绍了使用变压器模型自动将文本按政治倾向和政治性分类的挑战。我们综合梳理了现有数据集和模型，发现目前的方法创造了孤立的解决方案，在处理分布之外的文本时表现不佳。为了解决这一限制，我们通过合并12个用于政治倾向分类的数据集，并通过为18个现有数据集添加适当的标签来创建政治性的新数据集，构建了一个多样化的数据集。通过采用留一法和留一法的广泛基准测试，我们评估了现有模型的性能，并训练具有增强泛化能力的新模型。

更新时间: 2025-07-18 13:44:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13913v1

Self-supervised learning on gene expression data

Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine. Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data that are costly and time-consuming to obtain in the case of gene expression data. Self-supervised learning has recently emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data. In this study, we investigate the application of state-of-the-art self-supervised learning methods to bulk gene expression data for phenotype prediction. We selected three self-supervised methods, based on different approaches, to assess their ability to exploit the inherent structure of the data and to generate qualitative representations which can be used for downstream predictive tasks. By using several publicly available gene expression datasets, we demonstrate how the selected methods can effectively capture complex information and improve phenotype prediction accuracy. The results obtained show that self-supervised learning methods can outperform traditional supervised models besides offering significant advantage by reducing the dependency on annotated data. We provide a comprehensive analysis of the performance of each method by highlighting their strengths and limitations. We also provide recommendations for using these methods depending on the case under study. Finally, we outline future research directions to enhance the application of self-supervised learning in the field of gene expression data analysis. This study is the first work that deals with bulk RNA-Seq data and self-supervised learning.

Updated: 2025-07-18 13:43:04

标题: 基因表达数据上的自监督学习

摘要: 从基因表达数据预测表型是生物医学研究中的一个关键任务，它能够帮助深入了解疾病机制、药物反应和个性化医学。传统机器学习和深度学习依赖于监督学习，这需要大量标记数据，而在基因表达数据的情况下获取这些数据是昂贵且耗时的。最近，自监督学习已经成为一种有希望的方法，通过直接从未标记数据的结构中提取信息来克服这些限制。在本研究中，我们研究了将最先进的自监督学习方法应用于大规模基因表达数据进行表型预测。我们选择了三种基于不同方法的自监督方法，以评估它们利用数据的固有结构和生成可用于下游预测任务的定性表示的能力。通过使用几个公开可用的基因表达数据集，我们展示了所选方法如何有效地捕捉复杂信息并提高表型预测的准确性。得到的结果显示，自监督学习方法可以胜过传统的监督模型，除了通过减少对标记数据的依赖提供显著优势。我们通过突出它们的优势和局限性，对每种方法的性能进行了全面分析。我们还根据研究案例提供了使用这些方法的建议。最后，我们概述了未来研究方向，以增强自监督学习在基因表达数据分析领域的应用。该研究是第一项涉及大规模RNA-Seq数据和自监督学习的工作。

更新时间: 2025-07-18 13:43:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13912v1

Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified AgentFacts

The Internet is poised to host billions to trillions of autonomous AI agents that negotiate, delegate, and migrate in milliseconds and workloads that will strain DNS-centred identity and discovery. In this paper, we describe the NANDA index architecture, which we envision as a means for discoverability, identifiability and authentication in the internet of AI agents. We present an architecture where a minimal lean index resolves to dynamic, cryptographically verifiable AgentFacts that supports multi-endpoint routing, load balancing, privacy-preserving access, and credentialed capability assertions. Our architecture design delivers five concrete guarantees: (1) A quilt-like index proposal that supports both NANDA-native agents as well as third party agents being discoverable via the index, (2) rapid global resolution for newly spawned AI agents, (3) sub-second revocation and key rotation, (4) schema-validated capability assertions, and (5) privacy-preserving discovery across organisational boundaries via verifiable, least-disclosure queries. We formalize the AgentFacts schema, specify a CRDT-based update protocol, and prototype adaptive resolvers. The result is a lightweight, horizontally scalable foundation that unlocks secure, trust-aware collaboration for the next generation of the Internet of AI agents, without abandoning existing web infrastructure.

Updated: 2025-07-18 13:40:46

标题: 超越DNS：通过NANDA指数和验证的AgentFacts解锁AI代理互联网

摘要: 互联网正准备承载数十亿到数万亿个自治人工智能代理，这些代理在毫秒内进行协商、委托和迁移，工作负载将对以DNS为中心的身份和发现造成压力。在本文中，我们描述了NANDA指数架构，我们将其视为在人工智能代理互联网中实现可发现性、可识别性和认证性的一种手段。我们提出了一种架构，其中最小的精简指数解析为动态、可加密验证的AgentFacts，支持多端点路由、负载平衡、保护隐私的访问和具有凭证的能力声明。我们的架构设计提供了五个具体的保证：(1)支持NANDA原生代理和第三方代理通过指数可发现的像被覆盖的提议，(2)新生成的人工智能代理快速全球解析，(3)亚秒级撤销和密钥轮换，(4)经过模式验证的能力声明，和(5)通过可验证的、最少披露查询在组织边界间保持隐私的发现。我们形式化了AgentFacts模式，指定了基于CRDT的更新协议，并原型化了自适应解析器。结果是一个轻量级、水平可扩展的基础，为下一代人工智能代理互联网解锁安全、信任感知的协作，而不放弃现有的web基础设施。

更新时间: 2025-07-18 13:40:46

领域: cs.NI,cs.AI,cs.CR,cs.MA

下载: http://arxiv.org/abs/2507.14263v1

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ and class-conditional models trained on ImageNet datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel.

Updated: 2025-07-18 13:31:06

标题: 使用匹配高斯混合模型改进的DDIM采样

摘要: 我们提议在去噪扩散隐式模型（DDIM）框架中使用高斯混合模型（GMM）作为反向转移操作符（核），这是从预先训练的去噪扩散概率模型（DDPM）中加速采样的最广泛使用的方法之一。具体来说，我们通过约束GMM的参数来匹配DDPM前向边际的一阶和二阶中心矩。我们发现，通过矩匹配就足以获得质量等于或优于原始带有高斯核的DDIM的样本。我们提供了在CelebAHQ和FFHQ上训练的无条件模型以及在ImageNet数据集上训练的有条件模型的实验结果。我们的结果表明，在采样步骤数量较少时，使用GMM核可显著提高生成样本的质量，如FID和IS度量所示。例如，在ImageNet 256x256上，使用10个采样步骤，我们相对于使用高斯核分别获得了6.94的FID和207.85的IS，而使用高斯核分别为10.15和196.73。

更新时间: 2025-07-18 13:31:06

领域: cs.CV,cs.AI,cs.LG,I.2, I.4

下载: http://arxiv.org/abs/2311.04938v3

Instance space analysis of the capacitated vehicle routing problem

This paper seeks to advance CVRP research by addressing the challenge of understanding the nuanced relationships between instance characteristics and metaheuristic (MH) performance. We present Instance Space Analysis (ISA) as a valuable tool that allows for a new perspective on the field. By combining the ISA methodology with a dataset from the DIMACS 12th Implementation Challenge on Vehicle Routing, our research enabled the identification of 23 relevant instance characteristics. Our use of the PRELIM, SIFTED, and PILOT stages, which employ dimensionality reduction and machine learning methods, allowed us to create a two-dimensional projection of the instance space to understand how the structure of instances affect the behavior of MHs. A key contribution of our work is that we provide a projection matrix, which makes it straightforward to incorporate new instances into this analysis and allows for a new method for instance analysis in the CVRP field.

Updated: 2025-07-18 13:00:55

标题: 容量车辆路径问题的实例空间分析

摘要: 本文旨在通过解决理解实例特征和元启发式（MH）性能之间微妙关系的挑战，推动CVRP研究。我们提出实例空间分析（ISA）作为一种有价值的工具，它能够为该领域提供新的视角。通过将ISA方法与DIMACS第12次实施挑战中的车辆路径数据集相结合，我们的研究使得能够识别出23个相关的实例特征。我们使用了PRELIM、SIFTED和PILOT阶段，这些阶段采用了降维和机器学习方法，使我们能够创建实例空间的二维投影，以了解实例的结构如何影响MH的行为。我们工作的一个关键贡献是我们提供了一个投影矩阵，使得可以轻松地将新实例纳入这种分析，并为CVRP领域的实例分析提供了一种新方法。

更新时间: 2025-07-18 13:00:55

领域: cs.AI

下载: http://arxiv.org/abs/2507.10397v2

Stablecoins: Fundamentals, Emerging Issues, and Open Challenges

Stablecoins, with a capitalization exceeding 200 billion USD as of January 2025, have shown significant growth, with annual transaction volumes exceeding 10 trillion dollars in 2023 and nearly doubling that figure in 2024. This exceptional success has attracted the attention of traditional financial institutions, with an increasing number of governments exploring the potential of Central Bank Digital Currencies (CBDCs). Although academia has recognized the importance of stablecoins, research in this area remains fragmented, incomplete, and sometimes contradictory. In this paper, we aim to address the cited gap with a structured literature analysis, correlating recent contributions to present a picture of the complex economic, technical, and regulatory aspects of stablecoins. To achieve this, we formulate the main research questions and categorize scientific contributions accordingly, identifying main results, data sources, methodologies, and open research questions. The research questions we address in this survey paper cover several topics, such as the stability of various stablecoins, novel designs and implementations, and relevant regulatory challenges. The studies employ a wide range of methodologies and data sources, which we critically analyze and synthesize. Our analysis also reveals significant research gaps, including limited studies on security and privacy, underexplored stablecoins, unexamined failure cases, unstudied governance mechanisms, and the treatment of stablecoins under financial accounting standards, among other areas.

Updated: 2025-07-18 13:00:19

标题: 稳定币：基本原理、新兴问题和面临的挑战

摘要: 截至2025年1月，稳定币的市值超过2000亿美元，显示出显著增长，2023年的年交易量超过10万亿美元，2024年几乎翻了一番。这一卓越的成功引起了传统金融机构的关注，越来越多的政府正在探索中央银行数字货币（CBDCs）的潜力。尽管学术界已经认识到稳定币的重要性，但这一领域的研究仍然是零散的、不完整的，有时甚至是矛盾的。在本文中，我们旨在通过结构化文献分析填补这一空白，将最新的研究成果相关联，呈现稳定币的复杂经济、技术和监管方面的图景。为实现这一目标，我们制定了主要研究问题，并相应地对科学贡献进行分类，识别主要结果、数据来源、方法论和开放研究问题。我们在这篇综述论文中探讨的研究问题涵盖了多个主题，如各种稳定币的稳定性、新颖设计和实施，以及相关的监管挑战。这些研究采用了各种方法和数据来源，我们对其进行了批判性分析和综合。我们的分析还揭示了一些重要的研究空白，包括对安全性和隐私性的有限研究，未充分探索的稳定币，未经审查的失败案例，未研究的治理机制，以及在财务会计准则下对稳定币的处理等方面。

更新时间: 2025-07-18 13:00:19

领域: econ.GN,cs.CR,q-fin.EC

下载: http://arxiv.org/abs/2507.13883v1

Using LLMs to identify features of personal and professional skills in an open-response situational judgment test

Academic programs are increasingly recognizing the importance of personal and professional skills and their critical role alongside technical expertise in preparing students for future success in diverse career paths. With this growing demand comes the need for scalable systems to measure, evaluate, and develop these skills. Situational Judgment Tests (SJTs) offer one potential avenue for measuring these skills in a standardized and reliable way, but open-response SJTs have traditionally relied on trained human raters for evaluation, presenting operational challenges to delivering SJTs at scale. Past attempts at developing NLP-based scoring systems for SJTs have fallen short due to issues with construct validity of these systems. In this article, we explore a novel approach to extracting construct-relevant features from SJT responses using large language models (LLMs). We use the Casper SJT to demonstrate the efficacy of this approach. This study sets the foundation for future developments in automated scoring for personal and professional skills.

Updated: 2025-07-18 12:59:17

标题: 使用LLMs来识别开放式反应情境判断测试中个人和专业技能特征

摘要: 学术课程越来越认识到个人和专业技能的重要性，以及它们在为学生未来成功的多样化职业路径中与技术专长并重的关键作用。随着这种需求的增长，有必要建立可扩展的系统来衡量、评估和发展这些技能。情境判断测试（SJTs）提供了一种潜在途径，以标准化和可靠的方式衡量这些技能，但开放式回答的SJTs传统上依赖于训练有素的人类评分员进行评估，这给规模化交付SJTs带来了操作挑战。过去开发基于自然语言处理的SJTs评分系统的尝试由于这些系统的建构效度问题而未能成功。在本文中，我们探讨了一种从SJTs回答中提取与建构相关特征的新颖方法，使用大型语言模型（LLMs）来展示这种方法的有效性。我们使用Casper SJT来展示这种方法的有效性。这项研究为未来发展自动评分个人和专业技能奠定了基础。

更新时间: 2025-07-18 12:59:17

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.13881v1

Real-Time Fusion of Visual and Chart Data for Enhanced Maritime Vision

This paper presents a novel approach to enhancing marine vision by fusing real-time visual data with chart information. Our system overlays nautical chart data onto live video feeds by accurately matching detected navigational aids, such as buoys, with their corresponding representations in chart data. To achieve robust association, we introduce a transformer-based end-to-end neural network that predicts bounding boxes and confidence scores for buoy queries, enabling the direct matching of image-domain detections with world-space chart markers. The proposed method is compared against baseline approaches, including a ray-casting model that estimates buoy positions via camera projection and a YOLOv7-based network extended with a distance estimation module. Experimental results on a dataset of real-world maritime scenes demonstrate that our approach significantly improves object localization and association accuracy in dynamic and challenging environments.

Updated: 2025-07-18 12:58:11

标题: 实时融合视觉和图表数据以增强海上视觉

摘要: 本文提出了一种通过将实时视觉数据与图表信息融合来增强海洋视觉的新方法。我们的系统通过准确匹配检测到的航行标志（如浮标）与图表数据中对应的表示，将海图数据叠加到实时视频中。为了实现稳健的关联，我们引入了一种基于变换器的端到端神经网络，该网络为浮标查询预测边界框和置信度分数，从而实现了图像域检测与世界空间图表标记的直接匹配。所提出的方法与基准方法进行了比较，包括通过摄像机投影估计浮标位置的射线投射模型和基于YOLOv7的网络，该网络扩展了距离估计模块。对真实世界海事场景数据集的实验结果表明，我们的方法显著提高了在动态和具有挑战性环境中的对象定位和关联准确性。

更新时间: 2025-07-18 12:58:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13880v1

Large Language Models as Innovators: A Framework to Leverage Latent Space Exploration for Novelty Discovery

Innovative idea generation remains a core challenge in AI, as large language models (LLMs) often struggle to produce outputs that are both novel and relevant. Despite their fluency, LLMs tend to replicate patterns seen during training, limiting their ability to diverge creatively without extensive prompt engineering. Prior work has addressed this through domain-specific heuristics and structured prompting pipelines, but such solutions are brittle and difficult to generalize. In this paper, we propose a model-agnostic latent-space ideation framework that enables controlled, scalable creativity by navigating the continuous embedding space of ideas. Unlike prior methods, our framework requires no handcrafted rules and adapts easily to different domains, input formats, and creative tasks. This paper introduces an early-stage prototype of our method, outlining the conceptual framework and preliminary results highlighting its potential as a general-purpose co-ideator for human-AI collaboration.

Updated: 2025-07-18 12:54:28

标题: 大型语言模型作为创新者：利用潜在空间探索框架进行新颖性发现

摘要: 创新的想法生成仍然是人工智能中的一个核心挑战，因为大型语言模型（LLMs）往往很难产生既新颖又相关的输出。尽管它们很流利，但LLMs往往会复制训练过程中看到的模式，限制了它们在没有大量提示工程的情况下进行创造性发散的能力。先前的研究通过领域特定的启发式方法和结构化的提示管道来解决这个问题，但这些解决方案脆弱且难以泛化。在本文中，我们提出了一个与模型无关的潜在空间构思框架，通过导航思想的连续嵌入空间实现可控、可扩展的创造力。与先前的方法不同，我们的框架不需要手工制定规则，并且很容易适应不同的领域、输入格式和创意任务。本文介绍了我们方法的早期原型，概述了概念框架和初步结果，突显了其作为人工智能与人类协作的通用共同构思者的潜力。

更新时间: 2025-07-18 12:54:28

领域: cs.AI

下载: http://arxiv.org/abs/2507.13874v1

FAMST: Fast Approximate Minimum Spanning Tree Construction for Large-Scale and High-Dimensional Data

We present Fast Approximate Minimum Spanning Tree (FAMST), a novel algorithm that addresses the computational challenges of constructing Minimum Spanning Trees (MSTs) for large-scale and high-dimensional datasets. FAMST utilizes a three-phase approach: Approximate Nearest Neighbor (ANN) graph construction, ANN inter-component connection, and iterative edge refinement. For a dataset of $n$ points in a $d$-dimensional space, FAMST achieves $\mathcal{O}(dn \log n)$ time complexity and $\mathcal{O}(dn + kn)$ space complexity when $k$ nearest neighbors are considered, which is a significant improvement over the $\mathcal{O}(n^2)$ time and space complexity of traditional methods. Experiments across diverse datasets demonstrate that FAMST achieves remarkably low approximation errors while providing speedups of up to 1000$\times$ compared to exact MST algorithms. We analyze how the key hyperparameters, $k$ (neighborhood size) and $\lambda$ (inter-component edges), affect performance, providing practical guidelines for hyperparameter selection. FAMST enables MST-based analysis on datasets with millions of points and thousands of dimensions, extending the applicability of MST techniques to problem scales previously considered infeasible.

Updated: 2025-07-18 12:53:58

标题: FAMST：大规模高维数据的快速近似最小生成树构建

摘要: 我们提出了快速近似最小生成树（FAMST）算法，该算法解决了构建大规模和高维数据集的最小生成树（MST）的计算挑战。FAMST利用三阶段方法：近似最近邻图构建、最近邻组件连接和迭代边缘细化。对于一个$n$个点在$d$维空间中的数据集，当考虑$k$个最近邻时，FAMST实现了时间复杂度为$\mathcal{O}(dn \log n)$和空间复杂度为$\mathcal{O}(dn + kn)$，这显著改善了传统方法的$\mathcal{O}(n^2)$时间和空间复杂度。通过对不同数据集的实验，我们发现FAMST在提供高达1000倍的速度优势的同时，实现了极低的近似误差。我们分析了关键超参数$k$（邻域大小）和$\lambda$（组件间边缘）如何影响性能，并提供了超参数选择的实用指南。FAMST使得在拥有数百万个点和数千个维度的数据集上进行基于MST的分析成为可能，扩展了MST技术的适用范围到以前被认为不可行的问题规模。

更新时间: 2025-07-18 12:53:58

领域: cs.DS,cs.AI

下载: http://arxiv.org/abs/2507.14261v1

CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors

Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.

Updated: 2025-07-18 12:51:29

标题: CDUPatch：用于双模可见-红外探测器的颜色驱动通用对抗性贴片攻击

摘要: 对抗性补丁广泛用于评估现实场景中物体检测系统的鲁棒性。这些补丁最初设计用于欺骗单模检测器（例如可见光或红外），最近已扩展到目标为可见-红外双模检测器。然而，现有的双模对抗性补丁攻击在各种物理场景中的攻击效果有限。为了解决这个问题，我们提出了CDUPatch，这是一个针对可见-红外物体检测器的通用跨模补丁攻击，跨越尺度、视角和场景。具体来说，我们观察到颜色变化导致不同程度的热吸收，导致红外成像中的温度差异。利用这一特性，我们提出了一个RGB到红外的适配器，将RGB补丁映射到红外补丁，实现跨模补丁的统一优化。通过学习对抗性补丁上的最佳颜色分布，我们可以操纵其热响应并生成对抗性红外纹理。此外，我们引入了多尺度剪切策略，并构建了一个新的可见-红外数据集MSDrone，其中包含不同尺度和角度的航空器图像。这些数据增强策略增强了我们的补丁在现实条件下的鲁棒性。在四个基准数据集上的实验（例如DroneVehicle、LLVIP、VisDrone、MSDrone）显示，我们的方法在数字领域优于现有的补丁攻击。广泛的物理测试进一步证实了在尺度、视角和场景之间的强大可转移性。

更新时间: 2025-07-18 12:51:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.10888v2

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Vision-language models (VLMs) increasingly leverage diverse knowledge sources to address complex tasks, often encountering conflicts between their internal parametric knowledge and external information. Knowledge conflicts can result in hallucinations and unreliable responses, but the mechanisms governing such interactions remain unknown. To address this gap, we analyze the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. We localize with logit inspection a small set of heads that control the conflict. Moreover, by modifying these heads, we can steer the model towards its internal knowledge or the visual inputs. Finally, we show that attention from such heads pinpoints localized image regions driving visual overrides, outperforming gradient-based attribution in precision.

Updated: 2025-07-18 12:42:30

标题: 当视觉优于认知：解开视觉-语言模型中的认知冲突

摘要: 视觉语言模型（VLMs）越来越多地利用多样的知识来源来解决复杂任务，经常遇到其内部参数化知识和外部信息之间的冲突。知识冲突可能导致幻觉和不可靠的响应，但是控制这种相互作用的机制仍然未知。为了填补这一空白，我们分析了VLMs用于解决跨模态冲突的机制，引入了一个故意与内部常识知识相矛盾的多模态反事实查询数据集。我们通过对数检查定位了一小组控制冲突的头部。此外，通过修改这些头部，我们可以引导模型朝向其内部知识或视觉输入。最后，我们表明来自这些头部的注意力可以准确定位驱动视觉覆盖的局部图像区域，优于基于梯度的归因在精度上的表现。

更新时间: 2025-07-18 12:42:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13868v1

Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

Sparse dictionary learning (DL) has emerged as a powerful approach to extract semantically meaningful concepts from the internals of large language models (LLMs) trained mainly in the text domain. In this work, we explore whether DL can extract meaningful concepts from less human-interpretable scientific data, such as vision foundation models trained on cell microscopy images, where limited prior knowledge exists about which high-level concepts should arise. We propose a novel combination of a sparse DL algorithm, Iterative Codebook Feature Learning (ICFL), with a PCA whitening pre-processing step derived from control data. Using this combined approach, we successfully retrieve biologically meaningful concepts, such as cell types and genetic perturbations. Moreover, we demonstrate how our method reveals subtle morphological changes arising from human-interpretable interventions, offering a promising new direction for scientific discovery via mechanistic interpretability in bioimaging.

Updated: 2025-07-18 12:37:43

标题: 朝向科学发现之路：利用字典学习从显微镜基础模型中提取生物概念

摘要: 稀疏字典学习（DL）已经成为一种强大的方法，可以从主要在文本领域训练的大型语言模型（LLMs）的内部提取语义上有意义的概念。在这项工作中，我们探讨了DL是否可以从较少人类可解释的科学数据中提取有意义的概念，例如在细胞显微镜图像上训练的视觉基础模型，其中对应高级概念的有限先验知识存在。我们提出了一种新颖的组合方法，将稀疏DL算法（迭代码书特征学习（ICFL））与从控制数据中导出的PCA白化预处理步骤相结合。使用这种组合方法，我们成功地检索到生物学上有意义的概念，如细胞类型和基因扰动。此外，我们展示了我们的方法如何揭示出自人类可解释干预而产生的微妙形态变化，为通过生物成像中的机制可解释性开辟了一条新的有希望的科学发现方向。

更新时间: 2025-07-18 12:37:43

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2412.16247v3

HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG. Our code and data are available at: https://github.com/0russwest0/HoH.

Updated: 2025-07-18 12:34:45

标题: HoH：评估过时信息对检索增强生成影响的动态基准

摘要: 检索增强生成（RAG）已经成为解决大型语言模型（LLMs）中知识过时问题的有效方法，但仍然面临一个关键挑战：知识库中过时信息的普遍存在。当前的研究主要集中在整合最新信息，但过时信息在检索来源中存在的影响仍未得到充分解决。为了弥合这一差距，我们引入了HoH，这是第一个专门设计用于评估过时信息对RAG影响的基准。我们的基准利用标记级别的差异算法结合LLM管道，有效创建一个大规模的问答数据集，准确捕捉了现实世界事实的时间知识演变。通过全面的实验，我们发现过时信息以两种关键方式显著降低了RAG的性能：（1）它通过让模型分散于正确信息而实质性降低了响应准确性，（2）即使当前信息可用，它也可能误导模型生成潜在有害的输出。当前的RAG方法在处理过时信息时在检索和生成方面都面临困难。这些发现凸显了解决RAG中时间挑战的创新解决方案的迫切需要。我们的代码和数据可以在以下链接找到：https://github.com/0russwest0/HoH。

更新时间: 2025-07-18 12:34:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.04800v3

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection

Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with "anonymized" knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.

Updated: 2025-07-18 12:28:08

标题: SPARQL查询生成与LLMs：衡量训练数据记忆和知识注入的影响

摘要: 如今，具有自然语言用户界面的软件的重要性不容小觑。特别是，在问答（QA）系统中，从相同问题检索的信息生成给定自然语言问题的SPARQL查询（通常称为查询构建）是在知识图谱（KGQA）上运行的QA系统的核心任务。由于大型语言模型（LLMs）的兴起，它们被认为是提高问答功能质量的一种合适方法，因为仍然有很大改进的空间，旨在提高质量和可信度。然而，LLMs是在网络数据上训练的，研究人员无法控制基准或知识图是否已包含在训练数据中。在本文中，我们介绍了一种从自然语言问题生成SPARQL查询评估LLMs质量的新方法，在各种条件下：（1）零射击SPARQL生成，（2）注入知识，以及（3）具有“匿名化”知识注入。这使我们首次能够估计LLMs改善的QA质量受训练数据影响的程度。最终，这将有助于确定一种方法的可移植性，或者好结果是否主要是因为基准已包含在训练数据中（参见LLM记忆）。开发的方法是可移植的、稳健的，并支持任何知识图谱；因此，它可以轻松应用于任何KGQA或LLM，从而生成一致的洞察力，了解实际LLM能力是可能的。

更新时间: 2025-07-18 12:28:08

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.13859v1

Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude

Recent advances in Large Language Models (LLMs) have enabled human-like responses across various tasks, raising questions about their ethical decision-making capabilities and potential biases. This study investigates protected attributes in LLMs through systematic evaluation of their responses to ethical dilemmas. Using two prominent models - GPT-3.5 Turbo and Claude 3.5 Sonnet - we analyzed their decision-making patterns across multiple protected attributes including age, gender, race, appearance, and disability status. Through 11,200 experimental trials involving both single-factor and two-factor protected attribute combinations, we evaluated the models' ethical preferences, sensitivity, stability, and clustering of preferences. Our findings reveal significant protected attributeses in both models, with consistent preferences for certain features (e.g., "good-looking") and systematic neglect of others. Notably, while GPT-3.5 Turbo showed stronger preferences aligned with traditional power structures, Claude 3.5 Sonnet demonstrated more diverse protected attribute choices. We also found that ethical sensitivity significantly decreases in more complex scenarios involving multiple protected attributes. Additionally, linguistic referents heavily influence the models' ethical evaluations, as demonstrated by differing responses to racial descriptors (e.g., "Yellow" versus "Asian"). These findings highlight critical concerns about the potential impact of LLM biases in autonomous decision-making systems and emphasize the need for careful consideration of protected attributes in AI development. Our study contributes to the growing body of research on AI ethics by providing a systematic framework for evaluating protected attributes in LLMs' ethical decision-making capabilities.

Updated: 2025-07-18 12:09:59

标题: 决策中的偏见：ChatGPT和Claude伦理困境的比较研究

摘要: 最近大型语言模型（LLMs）的最新进展使其在各种任务中实现了类似人类的反应，引发了关于它们的道德决策能力和潜在偏见的问题。本研究通过系统评估LLMs对道德困境的反应来调查受保护属性。我们使用了两种知名模型 - GPT-3.5 Turbo和Claude 3.5 Sonnet - 分析它们对年龄、性别、种族、外貌和残疾状态等多个受保护属性的决策模式。通过包括单因素和双因素受保护属性组合在内的11,200个实验试验，我们评估了模型的道德偏好、敏感性、稳定性和偏好的聚类。我们的研究结果显示，两种模型在受保护属性上存在显著差异，对某些特征（如“好看”）具有一致的偏好，同时系统性地忽视其他特征。值得注意的是，虽然GPT-3.5 Turbo表现出与传统权力结构更为一致的偏好，Claude 3.5 Sonnet展示出更多样化的受保护属性选择。我们还发现，在涉及多个受保护属性的更复杂情境中，道德敏感性显著下降。此外，语言标识对模型的道德评估产生了重大影响，如对种族描述词的不同反应（例如，“黄种人”与“亚洲人”）。这些研究结果突显了对LLM偏见在自主决策系统中潜在影响的重要关切，并强调了在AI开发中对受保护属性进行谨慎考虑的必要性。我们的研究通过提供一个系统框架来评估LLMs在道德决策能力中的受保护属性，为AI伦理研究领域的不断增长贡献了一份力量。

更新时间: 2025-07-18 12:09:59

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2501.10484v2

Towards Regulated Deep Learning

Regulation of Multi-Agent Systems (MAS) and Declarative Electronic Institutions (DEIs) was a multidisciplinary research topic of the past decade involving (Physical and Software) Agents and Law since the beginning, but recently evolved towards News-claimed Robot Lawyer since 2016. One of these first proposals of restricting the behaviour of Software Agents was Electronic Institutions. However, with the recent reformulation of Artificial Neural Networks (ANNs) as Deep Learning (DL), Security, Privacy,Ethical and Legal issues regarding the use of DL has raised concerns in the Artificial Intelligence (AI) Community. Now that the Regulation of MAS is almost correctly addressed, we propose the Regulation of Artificial Neural Networks as Agent-based Training of a special type of regulated Artificial Neural Network that we call Institutional Neural Network (INN).The main purpose of this paper is to bring attention to Artificial Teaching (AT) and to give a tentative answer showing a proof-of-concept implementation of Regulated Deep Learning (RDL). This paper introduces the former concept and provide $I^*$, a language previously used to model declaratively and extend Electronic Institutions, as a means to regulate the execution of Artificial Neural Networks and their interactions with Artificial Teachers (ATs)

Updated: 2025-07-18 12:07:30

标题: 走向受监管的深度学习

摘要: 多智能体系统（MAS）和声明性电子机构（DEIs）的监管是过去十年的一个跨学科研究课题，涉及（物理和软件）智能体和法律，但自2016年以来，发展为新闻宣称的机器人律师。限制软件智能体行为的最初提议之一是电子机构。然而，随着人工神经网络（ANNs）被重新制定为深度学习（DL），有关DL使用的安全、隐私、伦理和法律问题引起了人工智能（AI）社区的关注。现在，几乎已经正确解决了MAS的监管问题，我们提出将人工神经网络的监管作为一种特殊类型的受管制人工神经网络的基于代理的培训，我们称之为机构神经网络（INN）。本文的主要目的是引起对人工教学（AT）的关注，并提供一个初步策略，展示受管制深度学习（RDL）的概念实现。本文介绍了前述概念，并提供了$I^*$，这是一种先前用于声明性建模和扩展电子机构的语言，作为监管人工神经网络执行及其与人工教师（ATs）互动的手段。

更新时间: 2025-07-18 12:07:30

领域: cs.AI,cs.LG,cs.LO,cs.MA,cs.PL

下载: http://arxiv.org/abs/1912.13122v8

Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments

[Context] Multi-agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors. However, transferring knowledge across agents remains challenging in non-stationary environments with changing goals. [Problem] Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt. [Approach] This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non-stationary environment. As the environment changes (new obstacles), agents' collisions require adaptive recovery strategies. We model each collision as a causal intervention instantiated as a sequence of recovery actions (a macro) whose effect corresponds to a causal knowledge of how to circumvent the obstacle while increasing the chances of achieving the agent's goal (maximizing cumulative reward). This recovery action macro is transferred online from a second agent and is applied in a zero-shot fashion, i.e., without retraining, just by querying a lookup model with local context information (collisions). [Results] Our findings reveal two key insights: (1) agents with heterogeneous goals were able to bridge about half of the gap between random exploration and a fully retrained policy when adapting to new environments, and (2) the impact of causal knowledge transfer depends on the interplay between environment complexity and agents' heterogeneous goals.

Updated: 2025-07-18 11:59:55

标题: 动态环境中多智能体强化学习的因果知识转移

摘要: [背景] 多智能体强化学习（MARL）在需要学习协调行为的环境中取得了显著的成功。然而，在具有不断变化目标的非稳态环境中跨智能体传输知识仍然具有挑战性。[问题] MARL中传统的知识传输方法往往难以泛化，智能体通常需要昂贵的重新训练来适应。[方法] 本文介绍了一种因果知识传输框架，使RL智能体能够在非稳态环境中学习和分享路径的紧凑因果表示。随着环境的变化（新增障碍物），智能体的碰撞需要自适应恢复策略。我们将每次碰撞建模为一种因果干预，实例化为一系列恢复动作（宏），其效果对应于如何规避障碍物的因果知识，同时增加实现智能体目标（最大化累积奖励）的机会。这种恢复动作宏在线从第二个智能体传输，并以零次训练的方式应用，即通过查询具有本地上下文信息（碰撞）的查找模型。[结果] 我们的研究结果揭示了两个关键见解：（1）具有异构目标的智能体在适应新环境时能够弥合大约一半的随机探索与完全重新训练策略之间的差距，（2）因果知识传输的影响取决于环境复杂性和智能体异构目标之间的相互作用。

更新时间: 2025-07-18 11:59:55

领域: cs.AI

下载: http://arxiv.org/abs/2507.13846v1

Scalable Submodular Policy Optimization via Pruned Submodularity Graph

In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution. The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized. In a traditional setup, the reward function in an RL Problem is considered additive. However, in reality, there exist many problems, including path planning, coverage control, etc., the reward function follows the diminishing return, which can be modeled as a submodular function. In this paper, we study a variant of the RL Problem where the reward function is submodular, and our objective is to find an optimal policy such that this reward function gets maximized. We have proposed a pruned submodularity graph-based approach that provides a provably approximate solution in a feasible computation time. The proposed approach has been analyzed to understand its time and space requirements as well as a performance guarantee. We have experimented with a benchmark agent-environment setup, which has been used for similar previous studies, and the results are reported. From the results, we observe that the policy obtained by our proposed approach leads to more reward than the baseline methods.

Updated: 2025-07-18 11:42:07

标题: 通过修剪后的子模性图实现可扩展的子模性策略优化

摘要: 在强化学习（简称RL）中，一个agent通过一组可能的动作与环境进行交互，并从某个未知分布中生成奖励。任务是找到一组最优动作，使得在某个时间步之后的奖励最大化。在传统设置中，RL问题中的奖励函数被认为是可加的。然而，在现实中存在许多问题，包括路径规划、覆盖控制等，奖励函数遵循递减回报，可以被建模为次模函数。本文研究了一个RL问题的变种，其中奖励函数是次模的，我们的目标是找到一个最优策略，使得这个奖励函数最大化。我们提出了一个基于修剪次模图的方法，在可行的计算时间内提供了一个可证明的近似解。所提出的方法已经进行了分析，以了解其时间和空间要求以及性能保证。我们在一个用于类似先前研究的基准agent-环境设置上进行了实验，并报告了结果。从结果中，我们观察到通过我们提出的方法获得的策略比基准方法带来更多的奖励。

更新时间: 2025-07-18 11:42:07

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.13834v1

LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

Updated: 2025-07-18 11:37:12

标题: LearnLens：LLM技术支持的个性化、课程基础反馈与教育者合作

摘要: 有效的反馈对学生学习至关重要，但对教师来说是耗时的。我们提出了LearnLens，这是一个基于模块化、基于LLM的系统，可以在科学教育中生成个性化、与课程对齐的反馈。LearnLens包括三个组成部分：(1)一个捕捉微妙推理错误的错误感知评估模块；(2)一个基于课程的生成模块，使用结构化、与主题相关的记忆链，而不是传统的基于相似性的检索，从而提高相关性并减少噪音；以及(3)一个供教育者自定义和监督的界面。LearnLens解决了现有系统中的关键挑战，提供可扩展、高质量的反馈，使教师和学生都能获益。

更新时间: 2025-07-18 11:37:12

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2507.04295v3

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and (2) IVGdirect, which uses a multimodal architecture for end-to-end grounding. For each method, we introduce a dedicated dataset. In addition, we propose the Central Point Validation (CPV) metric, a relaxed variant of the classical Central Proximity Score (CPS) metric. Our final test dataset is publicly released to support future research.

Updated: 2025-07-18 11:35:01

标题: 桌面图形用户界面的高效交互的视觉基础方法

摘要: 大多数视觉定位解决方案主要关注逼真图像。然而，涉及合成图像的应用，如图形用户界面（GUI），仍然有限。这限制了自主计算机视觉驱动的人工智能（AI）代理在自动应用交互方面的发展。使AI能够有效地理解和与GUI进行交互对于推进软件测试、可访问性和人机交互的自动化至关重要。在这项工作中，我们探讨了指令视觉定位（IVG），这是一种多模态方法，用于在GUI中识别对象。更具体地说，给定自然语言指令和一个GUI屏幕，IVG定位应在屏幕上执行指令的元素的坐标。我们提出了两种主要方法：（1）IVGocr，它结合了一个大型语言模型（LLM）、一个目标检测模型和一个光学字符识别（OCR）模块；和（2）IVGdirect，它使用端到端定位的多模态架构。对于每种方法，我们引入了一个专门的数据集。此外，我们提出了中央点验证（CPV）度量，这是经典的中央接近度分数（CPS）度量的一种宽松变体。我们最终的测试数据集已公开发布，以支持未来的研究。

更新时间: 2025-07-18 11:35:01

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2407.01558v3

Bridging Local and Global Knowledge via Transformer in Board Games

Although AlphaZero has achieved superhuman performance in board games, recent studies reveal its limitations in handling scenarios requiring a comprehensive understanding of the entire board, such as recognizing long-sequence patterns in Go. To address this challenge, we propose ResTNet, a network that interleaves residual and Transformer blocks to bridge local and global knowledge. ResTNet improves playing strength across multiple board games, increasing win rate from 54.6% to 60.8% in 9x9 Go, 53.6% to 60.9% in 19x19 Go, and 50.4% to 58.0% in 19x19 Hex. In addition, ResTNet effectively processes global information and tackles two long-sequence patterns in 19x19 Go, including circular pattern and ladder pattern. It reduces the mean square error for circular pattern recognition from 2.58 to 1.07 and lowers the attack probability against an adversary program from 70.44% to 23.91%. ResTNet also improves ladder pattern recognition accuracy from 59.15% to 80.01%. By visualizing attention maps, we demonstrate that ResTNet captures critical game concepts in both Go and Hex, offering insights into AlphaZero's decision-making process. Overall, ResTNet shows a promising approach to integrating local and global knowledge, paving the way for more effective AlphaZero-based algorithms in board games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/restnet.

Updated: 2025-07-18 11:31:06

标题: 通过变压器在桌游中连接本地和全球知识

摘要: 尽管AlphaZero在棋盘游戏中取得了超人类表现，但最近的研究揭示了其在处理需要全面理解整个棋盘的情景方面的局限性，比如在围棋中识别长序列模式。为了解决这一挑战，我们提出了ResTNet，这是一个将残差和Transformer块交错使用以桥接局部和全局知识的网络。ResTNet提高了跨多个棋盘游戏的下棋实力，使得9x9围棋中的胜率从54.6%提高到60.8%，19x19围棋中从53.6%提高到60.9%，19x19六角棋中从50.4%提高到58.0%。此外，ResTNet有效地处理全局信息，并解决了19x19围棋中的两种长序列模式，包括循环模式和阶梯模式。它将循环模式识别的均方误差从2.58降低到1.07，并将对抗性程序的攻击概率从70.44%降低到23.91%。ResTNet还将阶梯模式识别准确率从59.15%提高到80.01%。通过可视化注意力地图，我们展示了ResTNet在围棋和六角棋中捕捉关键游戏概念，为AlphaZero的决策过程提供了见解。总的来说，ResTNet显示了一个有前途的方法，将局部和全局知识整合在一起，为基于AlphaZero的更有效算法在棋盘游戏中铺平了道路。我们的代码可在https://rlg.iis.sinica.edu.tw/papers/restnet 上找到。

更新时间: 2025-07-18 11:31:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.05347v2

When Speed meets Accuracy: an Efficient and Effective Graph Model for Temporal Link Prediction

Temporal link prediction in dynamic graphs is a critical task with applications in diverse domains such as social networks, recommendation systems, and e-commerce platforms. While existing Temporal Graph Neural Networks (T-GNNs) have achieved notable success by leveraging complex architectures to model temporal and structural dependencies, they often suffer from scalability and efficiency challenges due to high computational overhead. In this paper, we propose EAGLE, a lightweight framework that integrates short-term temporal recency and long-term global structural patterns. EAGLE consists of a time-aware module that aggregates information from a node's most recent neighbors to reflect its immediate preferences, and a structure-aware module that leverages temporal personalized PageRank to capture the influence of globally important nodes. To balance these attributes, EAGLE employs an adaptive weighting mechanism to dynamically adjust their contributions based on data characteristics. Also, EAGLE eliminates the need for complex multi-hop message passing or memory-intensive mechanisms, enabling significant improvements in efficiency. Extensive experiments on seven real-world temporal graphs demonstrate that EAGLE consistently achieves superior performance against state-of-the-art T-GNNs in both effectiveness and efficiency, delivering more than a 50x speedup over effective transformer-based T-GNNs.

Updated: 2025-07-18 11:29:15

标题: 当速度遇见准确性：一种高效有效的用于时序链接预测的图模型

摘要: 动态图中的时间链接预测是一个关键任务，应用于社交网络、推荐系统和电子商务平台等各种领域。现有的时间图神经网络（T-GNNs）通过利用复杂的架构来建模时间和结构依赖关系，取得了显著的成功，但由于高计算开销而常常面临可扩展性和效率挑战。在本文中，我们提出了EAGLE，一个轻量级框架，集成了短期时间新颖性和长期全局结构模式。EAGLE包括一个时间感知模块，从节点的最近邻居聚合信息以反映其即时偏好，以及一个结构感知模块，利用时间个性化PageRank来捕捉全局重要节点的影响。为了平衡这些属性，EAGLE采用自适应加权机制动态调整它们的贡献基于数据特征。此外，EAGLE消除了复杂的多跳消息传递或内存密集型机制的需求，显著提高了效率。对七个真实世界的时间图进行了广泛实验，结果表明EAGLE在效果和效率方面始终优于最先进的T-GNNs，使得基于Transformer的T-GNNs的速度提高了超过50倍。

更新时间: 2025-07-18 11:29:15

领域: cs.AI

下载: http://arxiv.org/abs/2507.13825v1

Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models

Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to 'reasoning' models -- achieves the best results, with up to 96.3\% branch coverage, a 57\% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. All the code and resulting test suites are publicly available at https://github.com/peetery/LLM-analysis.

Updated: 2025-07-18 11:23:17

标题: 现代通用大型语言模型在自动生成单元测试时的代码上下文和提示策略的影响

摘要: 生成式人工智能在软件工程领域越来越受到关注，而测试仍然是一种不可或缺的可靠性机制。根据广泛采用的测试金字塔，单元测试构成了大多数测试用例，通常是基于模式的，需要最少的领域专业知识。在软件工程师的监督下自动生成这些测试可以显著提高软件生命周期开发阶段的生产力。本文研究了代码上下文和提示策略对各种大型语言模型（LLMs）生成的单元测试的质量和充分性的影响。结果表明，包括文档字符串显著提高了代码的充分性，而进一步扩展上下文到完整实现则带来了较小的收益。值得注意的是，即使对于“推理”模型，链式思考提示策略也取得了最佳结果，分支覆盖率高达96.3％，平均变异分数为57％，几乎完美的编译成功率。在评估的模型中，M5（Gemini 2.5 Pro）在变异分数和分支覆盖率方面表现出色，在编译成功率方面仍然处于前列。所有的代码和生成的测试套件都可以在https://github.com/peetery/LLM-analysis上公开获取。

更新时间: 2025-07-18 11:23:17

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.14256v1

RAG-based Architectures for Drug Side Effect Retrieval in LLMs

Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.

Updated: 2025-07-18 11:20:52

标题: RAG 架构用于在LLMs中检索药物副作用

摘要: 药物副作用是一个重要的全球健康问题，需要先进的方法来准确检测和分析。尽管大型语言模型（LLMs）提供了有前途的对话界面，但它们固有的局限性，包括依赖于黑盒训练数据、易受幻觉影响和缺乏领域特定知识，阻碍了它们在药物警戒领域等专业领域的可靠性。为了填补这一空白，我们提出了两种架构：检索增强生成（RAG）和GraphRAG，将全面的药物副作用知识集成到Llama 3 8B语言模型中。通过对19,520个药物副作用关联（涵盖976种药物和3,851个副作用术语）进行广泛评估，我们的结果表明，GraphRAG在药物副作用检索方面实现了接近完美的准确性。这一框架提供了一个高度准确和可扩展的解决方案，标志着在利用LLMs进行关键药物警戒应用方面取得了重大进展。

更新时间: 2025-07-18 11:20:52

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.13822v1

Demographic-aware fine-grained classification of pediatric wrist fractures

Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.

Updated: 2025-07-18 11:16:14

标题: 儿童腕部骨折的人口统计学细粒度分类

摘要: 手腕病变经常被观察到，特别是在构成大多数骨折病例的儿童中。然而，诊断这些疾病是耗时的，并需要专门的专业知识。计算机视觉提供了一个有前途的途径，取决于广泛数据集的可用性，这在医学影像领域是一个显著的挑战。因此，仅依赖于一种模态，如图像，特别是在数据类型多样且丰富的时代，这种方法是不够的。在这项研究中，我们采用了一个多方面的方法来解决利用极其有限的数据集识别手腕病变的挑战。首先，我们将这个问题作为一个细粒度识别任务，旨在识别传统CNN忽略的微妙的X射线病变。其次，我们通过将患者元数据与X射线图像融合来提高网络性能。第三，我们并没有在像ImageNet这样的粗粒度数据集上进行预训练，而是利用在一个细粒度数据集上训练的权重。虽然元数据集成已经在其他医学领域中使用，但这是手腕病变的一个新颖应用。我们的结果表明，细粒度策略和元数据集成可以在有限数据集上将诊断准确性提高2%，在更大的骨折重点数据集上提高超过10%。

更新时间: 2025-07-18 11:16:14

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12964v2

Team of One: Cracking Complex Video QA with Model Synergy

We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.

Updated: 2025-07-18 11:12:44

标题: 一个人的团队：利用模型协同破解复杂视频质量保证

摘要: 我们提出了一个新颖的框架，用于增强在复杂的现实世界场景中的推理深度和鲁棒性，以CVRR-ES数据集为基准的开放式视频问答。现有的视频-大型多模态模型（Video-LMMs）通常表现出有限的上下文理解能力，弱的时间建模能力，并且对模糊或组成性查询的泛化能力差。为了解决这些挑战，我们引入了一种提示和响应集成机制，通过结构化的思维链协调多个异构的视频-语言模型（VLMs），每个模型都定制为不同的推理路径。一个外部的大型语言模型（LLM）作为评估器和整合器，选择并融合最可靠的响应。大量实验证明，我们的方法在所有评估指标上显著优于现有的基线，展示了卓越的泛化能力和鲁棒性。我们的方法提供了一种轻量级，可扩展的策略，用于推进多模态推理，无需重新训练模型，为未来视频-LMM开发奠定了坚实的基础。

更新时间: 2025-07-18 11:12:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13820v1

AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results

The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. Additionally, we compare the efficiency of on-device LLMs with client-server and cloud-based setups, and evaluate their accuracy on two interactive tasks. We believe our findings offer valuable insight to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be used as standard groundwork for further research and development in this emerging field. The source code and supplementary materials are available at: www.nanovis.org/AIvaluateXR.html

Updated: 2025-07-18 11:10:07

标题: AIvaluateXR：一种用于XR中设备上AI评估的评估框架及基准测试结果

摘要: 将大型语言模型（LLMs）部署在扩展现实（XR）设备上具有推动人工智能与人类互动领域发展的巨大潜力。在直接、设备上模型推理的情况下，选择适合特定任务的模型和设备仍然具有挑战性。本文介绍了AIvaluateXR，一个用于在XR设备上运行LLMs进行基准测试的综合评估框架。为了展示这一框架，我们在四个XR平台上部署了17个选定的LLMs：Magic Leap 2、Meta Quest 3、Vivo X100s Pro和Apple Vision Pro，并进行了大量评估。我们的实验设置测量了四个关键指标：性能一致性、处理速度、内存使用和电池消耗。对于每个模型-设备对的68个组合，我们评估了不同字符串长度、批量大小和线程数下的性能，分析了实时XR应用的权衡。我们提出了基于3D Pareto最优性理论的统一评估方法，以选择质量和速度目标下的最佳设备-模型对。此外，我们比较了设备上LLMs与客户端-服务器和基于云的设置的效率，并评估了它们在两个交互任务上的准确性。我们相信我们的发现为指导将LLMs部署在XR设备上的未来优化工作提供了有价值的见解。我们的评估方法可以作为这一新兴领域进一步研究和开发的标准基础。源代码和补充材料可在www.nanovis.org/AIvaluateXR.html上找到。

更新时间: 2025-07-18 11:10:07

领域: cs.DC,cs.AI,cs.GR,cs.HC

下载: http://arxiv.org/abs/2502.15761v2

Exploring Graph Representations of Logical Forms for Language Modeling

We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs (BERT) pretrained on the same data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.

Updated: 2025-07-18 11:03:24

标题: 探索逻辑形式的图表示用于语言建模

摘要: 我们提出语言模型胜过逻辑形式（LFLMs）的观点，认为这样的模型比它们的文本对应物更具数据效率。为此，我们引入了基于图的形式逻辑分布语义（GFoLDS）原型，作为LFLMs的概念验证。使用GFoLDS，我们提供了强有力的实验证据，证明LFLMs可以利用这些模型中固有的基本语言知识，立即开始学习更复杂的模式。在下游任务中，我们展示GFoLDS远远优于同样数据预训练的文本、变压器LMs（BERT），表明LFLMs可以用比纯文本模型更少的数据学习。此外，我们展示这种模型的性能可能随着更多参数和预训练数据的增加而提升，暗示LFLMs在实际应用中的可行性。

更新时间: 2025-07-18 11:03:24

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2505.14523v2

DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs

Large language models (LLMs) have recently revolutionized language processing tasks but have also brought ethical and legal issues. LLMs have a tendency to memorize potentially private or copyrighted information present in the training data, which might then be delivered to end users at inference time. When this happens, a naive solution is to retrain the model from scratch after excluding the undesired data. Although this guarantees that the target data have been forgotten, it is also prohibitively expensive for LLMs. Approximate unlearning offers a more efficient alternative, as it consists of ex post modifications of the trained model itself to prevent undesirable results, but it lacks forgetting guarantees because it relies solely on empirical evidence. In this work, we present DP2Unlearning, a novel LLM unlearning framework that offers formal forgetting guarantees at a significantly lower cost than retraining from scratch on the data to be retained. DP2Unlearning involves training LLMs on textual data protected using {\epsilon}-differential privacy (DP), which later enables efficient unlearning with the guarantees against disclosure associated with the chosen {\epsilon}. Our experiments demonstrate that DP2Unlearning achieves similar model performance post-unlearning, compared to an LLM retraining from scratch on retained data -- the gold standard exact unlearning -- but at approximately half the unlearning cost. In addition, with a reasonable computational cost, it outperforms approximate unlearning methods at both preserving the utility of the model post-unlearning and effectively forgetting the targeted information.

Updated: 2025-07-18 10:56:13

标题: DP2Unlearning: 一种高效且可靠的LLM模型遗忘框架

摘要: 大型语言模型（LLMs）最近彻底改变了语言处理任务，但也带来了道德和法律问题。LLMs有记住训练数据中可能包含的私密或受版权保护信息的倾向，这些信息可能在推理时传递给最终用户。当这种情况发生时，一个天真的解决方案是在排除不需要的数据后从头开始重新训练模型。尽管这确保了目标数据已被遗忘，但对LLMs来说成本太高。近似遗忘提供了更有效的替代方案，因为它包括对训练模型本身的事后修改，以防止不良结果，但它缺乏遗忘保证，因为它仅依赖于经验证据。在这项工作中，我们提出了DP2Unlearning，这是一个新颖的LLM遗忘框架，它以比从头开始对保留数据重新训练的成本低得多的价格提供正式的遗忘保证。DP2Unlearning涉及在使用{\epsilon} -差分隐私（DP）保护的文本数据上训练LLMs，这之后可以通过与所选{\epsilon}相关的披露保证来实现高效的遗忘。我们的实验表明，与从头开始对保留数据重新训练的LLM（即黄金标准精确遗忘）相比，DP2Unlearning在遗忘后实现了相似的模型性能，但成本仅为其一半左右。此外，以合理的计算成本，它在保护模型遗忘后的效用和有效地遗忘目标信息方面优于近似遗忘方法。

更新时间: 2025-07-18 10:56:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.13774v2

Consistency of Responses and Continuations Generated by Large Language Models on Social Media

Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using two open-source models: Gemma and Llama. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic similarity between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: Gemma shows a tendency toward negative emotion amplification, particularly anger, while maintaining certain positive emotions like optimism. Llama demonstrates superior emotional preservation across a broader spectrum of affects. Both models systematically generate responses with attenuated emotional intensity compared to human-authored content and show a bias toward positive emotions in response tasks. Additionally, both models maintain strong semantic similarity with original texts, though performance varies between continuation and response tasks. These findings provide insights into LLMs' emotional and semantic processing capabilities, with implications for their deployment in social media contexts and human-AI interaction design.

Updated: 2025-07-18 10:53:52

标题: 大型语言模型在社交媒体上生成的回应和延续的一致性

摘要: 大型语言模型（LLMs）在文本生成方面展示出非凡的能力，然而它们在社交媒体环境中的情感一致性和语义连贯性仍未得到充分理解。本研究通过使用两个开源模型Gemma和Llama，通过持续和回复任务来调查LLMs如何处理情感内容并保持语义关系。通过分析来自Twitter和Reddit的气候变化讨论，我们研究人类撰写和LLM生成内容之间的情感过渡、强度模式和语义相似性。我们的发现表明，虽然两个模型都保持了高度的语义连贯性，但它们表现出不同的情感模式：Gemma显示出对消极情绪特别是愤怒的倾向，同时保持一定的积极情绪如乐观。Llama在更广泛的情感范围内展示出卓越的情感保持能力。两个模型在回复任务中系统地生成了比人类撰写内容更减弱的情感强度，并在回复任务中显示出对积极情绪的偏好。此外，两个模型保持与原始文本的强大语义相似性，尽管在持续和回复任务之间表现有所不同。这些发现为LLMs的情感和语义处理能力提供了见解，并对它们在社交媒体环境和人工智能交互设计中的部署产生影响。

更新时间: 2025-07-18 10:53:52

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2501.08102v3

Quantum Shadows: The Dining Information Brokers

This article introduces the innovative Quantum Dining Information Brokers Problem, presenting a novel entanglement-based quantum protocol to address it. The scenario involves $n$ information brokers, all located in distinct geographical regions, engaging in a metaphorical virtual dinner. The objective is for each broker to share a unique piece of information with all others simultaneously. Unlike previous approaches, this protocol enables a fully parallel, single-step communication exchange among all brokers, regardless of their physical locations. A key feature of this protocol is its ability to ensure both the anonymity and privacy of all participants are preserved, meaning no broker can discern the identity of the sender behind any received information. At its core, the Quantum Dining Information Brokers Problem serves as a conceptual framework for achieving anonymous, untraceable, and massively parallel information exchange in a distributed system. The proposed protocol introduces three significant advancements. First, while quantum protocols for one-to-many simultaneous information transmission have been developed, this is, to the best of our knowledge, one of the first quantum protocols to facilitate many-to-many simultaneous information exchange. Second, it guarantees complete anonymity and untraceability for all senders, a critical improvement over sequential applications of one-to-many protocols, which fail to ensure such robust anonymity. Third, leveraging quantum entanglement, the protocol operates in a fully distributed manner, accommodating brokers in diverse spatial locations. This approach marks a substantial advancement in secure, scalable, and anonymous communication, with potential applications in distributed environments where privacy and parallelism are paramount.

Updated: 2025-07-18 10:41:27

标题: 量子阴影：用餐信息经纪人

摘要: 本文介绍了创新的量子餐饮信息经纪人问题，提出了一种新颖的基于纠缠的量子协议来解决它。情景涉及$n$个信息经纪人，全部位于不同的地理区域，参与一场比喻性的虚拟晚餐。目标是让每个经纪人与所有其他人同时共享一条独特的信息。与先前的方法不同，该协议使得所有经纪人之间可以进行完全并行的单步通信交换，而不受其物理位置的限制。该协议的一个关键特点是能够确保所有参与者的匿名性和隐私性得到保留，意味着没有经纪人可以辨别任何接收到的信息背后发送者的身份。在核心上，量子餐饮信息经纪人问题作为一个概念框架，用于实现在分布系统中的匿名、不可追踪和大规模并行信息交换。所提议的协议引入了三个重大进展。首先，尽管已经开发了用于一对多同时信息传输的量子协议，但据我们所知，这是第一个促进多对多同时信息交换的量子协议之一。其次，它保证了所有发送者的完全匿名和不可追踪，这是对一对多协议的顺序应用的关键改进，后者无法确保如此强大的匿名性。第三，利用量子纠缠，该协议以完全分布式的方式运作，容纳了不同空间位置的经纪人。这种方法标志着安全、可扩展和匿名通信的显著进步，在隐私性和并行性至关重要的分布式环境中具有潜在应用。

更新时间: 2025-07-18 10:41:27

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2507.13810v1

Food safety trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database

In the European Union, official food safety monitoring data collected by member states are submitted to the European Food Safety Authority (EFSA) and published on Zenodo. This data includes 392 million analytical results derived from over 15.2 million samples covering more than 4,000 different types of food products, offering great opportunities for artificial intelligence to analyze trends, predict hazards, and support early warning systems. However, the current format with data distributed across approximately 1000 files totaling several hundred gigabytes hinders accessibility and analysis. To address this, we introduce the CompreHensive European Food Safety (CHEFS) database, which consolidates EFSA monitoring data on pesticide residues, veterinary medicinal product residues, and chemical contaminants into a unified and structured dataset. We describe the creation and structure of the CHEFS database and demonstrate its potential by analyzing trends in European food safety monitoring data from 2000 to 2024. Our analyses explore changes in monitoring activities, the most frequently tested products, which products were most often non-compliant and which contaminants were most often found, and differences across countries. These findings highlight the CHEFS database as both a centralized data source and a strategic tool for guiding food safety policy, research, and regulation.

Updated: 2025-07-18 10:29:30

标题: 欧洲食品安全趋势：来自拥有392百万条目的全面欧洲食品安全（CHEFS）数据库的见解

摘要: 在欧盟，成员国收集的官方食品安全监测数据被提交给欧洲食品安全局（EFSA）并发布在Zenodo上。这些数据包括来自超过1520万样本的392万个分析结果，涵盖了4000多种不同类型的食品产品，为人工智能分析趋势、预测危害并支持早期预警系统提供了巨大机会。然而，目前的数据格式将数据分布在大约1000个文件中，总计几百吉字节，阻碍了数据的获取和分析。为了解决这个问题，我们引入了CompreHensive European Food Safety（CHEFS）数据库，将EFSA对农药残留、兽药残留和化学污染物的监测数据整合到一个统一的结构化数据集中。我们描述了CHEFS数据库的创建和结构，并通过分析2000年至2024年的欧洲食品安全监测数据展示了其潜力。我们的分析探讨了监测活动的变化，被测试最频繁的产品，哪些产品最常不合规以及最常发现的污染物，以及不同国家之间的差异。这些发现突显了CHEFS数据库既是集中数据来源，也是指导食品安全政策、研究和监管的战略工具。

更新时间: 2025-07-18 10:29:30

领域: cs.CY,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.13802v1

One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion

In recent years, visual 3D Semantic Scene Completion (SSC) has emerged as a critical perception task for autonomous driving due to its ability to infer complete 3D scene layouts and semantics from single 2D images. However, in real-world traffic scenarios, a significant portion of the scene remains occluded or outside the camera's field of view -- a fundamental challenge that existing monocular SSC methods fail to address adequately. To overcome these limitations, we propose Creating the Future SSC (CF-SSC), a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model's effective perceptual range. Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space. Unlike conventional methods that rely on simple feature stacking, our 3D-aware architecture achieves more robust scene completion by explicitly modeling spatial-temporal relationships. Comprehensive experiments on SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate state-of-the-art performance, validating the effectiveness of our approach, highlighting our method's ability to improve occlusion reasoning and 3D scene completion accuracy.

Updated: 2025-07-18 10:24:58

标题: 更近一步：创造未来，提升单目语义场景完成

摘要: 近年来，视觉3D语义场景完成（SSC）已成为自动驾驶的关键感知任务，因为它能够从单个2D图像推断完整的3D场景布局和语义。然而，在现实世界的交通场景中，场景的大部分仍然被遮挡或在相机的视野之外——这是现有单眼SSC方法未能充分解决的基本挑战。为了克服这些限制，我们提出了Creating the Future SSC（CF-SSC），这是一种新颖的时态SSC框架，利用伪未来帧预测来扩展模型的有效感知范围。我们的方法结合了姿势和深度，建立准确的3D对应关系，实现过去、现在和预测未来帧在3D空间中的几何一致融合。与传统方法依赖简单特征堆叠不同，我们的3D感知架构通过明确建模空间-时间关系，实现了更加健壮的场景完成。对SemanticKITTI和SSCBench-KITTI-360基准的全面实验表明，我们的方法具有最先进的性能，验证了我们方法的有效性，突显了我们的方法改善遮挡推理和3D场景完成精度的能力。

更新时间: 2025-07-18 10:24:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13801v1

Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian

Software engineers spend a significant amount of time reading code during the software development process, especially in the age of large language models (LLMs) that can automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners' perspectives in this new era. In this paper, we conduct a survey to explore the practitioners' perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.

Updated: 2025-07-18 10:09:05

标题: 大型语言模型时代的代码可读性：Atlassian的工业案例研究

摘要: 软件工程师在软件开发过程中花费了大量时间阅读代码，尤其是在可以自动生成代码的大型语言模型（LLMs）时代。然而，关于LLM生成的代码的可读性以及在这个新时代从从实践者的角度来看是否仍然重要，我们知之甚少。本文通过一项调查，探讨了在LLMs时代软件开发者对代码可读性的看法，并通过将其生成的代码与真实场景中人工编写的代码进行比较，研究了我们基于LLM的软件开发代理框架HULA的可读性。总体而言，研究结果强调了：（1）可读性仍然是软件开发中至关重要的一个方面；（2）我们LLM生成的代码的可读性与人工编写的代码相当，促进了适当的信任建立，推动了我们LLM驱动的软件开发平台的广泛应用。

更新时间: 2025-07-18 10:09:05

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2501.11264v3

Localized FNO for Spatiotemporal Hemodynamic Upsampling in Aneurysm MRI

Hemodynamic analysis is essential for predicting aneurysm rupture and guiding treatment. While magnetic resonance flow imaging enables time-resolved volumetric blood velocity measurements, its low spatiotemporal resolution and signal-to-noise ratio limit its diagnostic utility. To address this, we propose the Localized Fourier Neural Operator (LoFNO), a novel 3D architecture that enhances both spatial and temporal resolution with the ability to predict wall shear stress (WSS) directly from clinical imaging data. LoFNO integrates Laplacian eigenvectors as geometric priors for improved structural awareness on irregular, unseen geometries and employs an Enhanced Deep Super-Resolution Network (EDSR) layer for robust upsampling. By combining geometric priors with neural operator frameworks, LoFNO de-noises and spatiotemporally upsamples flow data, achieving superior velocity and WSS predictions compared to interpolation and alternative deep learning methods, enabling more precise cerebrovascular diagnostics.

Updated: 2025-07-18 10:00:38

标题: 在动脉瘤MRI中用于时空血液动力学上采样的局部FNO

摘要: 血流动力学分析对于预测动脉瘤破裂和指导治疗至关重要。虽然磁共振流量成像可以实现时间分辨率体积血流速度测量，但其低空间时间分辨率和信噪比限制了其诊断效用。为了解决这个问题，我们提出了局部傅里叶神经算子（LoFNO），这是一种新颖的3D架构，可以增强空间和时间分辨率，并能够直接从临床成像数据中预测壁面剪应力（WSS）。LoFNO集成了拉普拉斯特征向量作为几何先验，以提高对不规则、看不见的几何结构的结构意识，并使用增强深度超分辨率网络（EDSR）层进行稳健的上采样。通过将几何先验与神经算子框架相结合，LoFNO对流动数据进行去噪和空间时间上采样，实现了比插值和替代深度学习方法更优秀的速度和WSS预测，从而实现更精确的脑血管诊断。

更新时间: 2025-07-18 10:00:38

领域: cs.CV,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2507.13789v1

The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture

Intelligence-biological, artificial, or collective-requires structural coherence across recursive reasoning processes to scale effectively. As complex systems grow, coherence becomes fragile unless a higher-order structure ensures semantic consistency. This paper introduces the Recursive Coherence Principle (RCP): a foundational constraint stating that for any reasoning system of order N, composed of systems operating over conceptual spaces of order N-1, semantic coherence is preserved only by a recursively evaluable generalization operator that spans and aligns those lower-order conceptual spaces. Crucially, this coherence enables structural alignment. Without recursive coherence, no system can reliably preserve goals, meanings, or reasoning consistency at scale. We formally define the Functional Model of Intelligence (FMI) as the only known operator capable of satisfying the RCP at any scale. The FMI is a minimal, composable architecture with internal functions (evaluation, modeling, adaptation, stability, decomposition, bridging) and external functions (storage, recall, System 1 and System 2 reasoning) vital for preserving semantic structure across inference and coordination layers. We prove that any system lacking the FMI will experience recursive coherence breakdown as it scales, arguing that common AI issues like misalignment, hallucination, and instability are symptoms of this structural coherence loss. Unlike other foundational principles, RCP uniquely captures the internal, recursive dynamics needed for coherent, alignable intelligence, modeling semantic coherence under recursion. This work significantly impacts AI alignment, advocating a shift from behavioral constraints to structural coherence, and offers a pathway for safely generalizable, robustly coherent AI at scale.

Updated: 2025-07-18 09:44:01

标题: 《递归一致性原则：可扩展智能、对齐和推理架构的形式约束》

摘要: 智能-生物、人工或集体-需要在递归推理过程中保持结构连贯性，以实现有效的扩展。随着复杂系统的增长，除非高阶结构确保语义一致性，否则连贯性会变得脆弱。本文介绍了递归连贯性原则（RCP）：一项基础约束，指出对于任何N阶推理系统，由操作在N-1阶概念空间上的系统组成，只有通过跨越和对齐这些较低阶概念空间的递归可评估的概括运算符才能保持语义连贯性。关键是，这种连贯性实现了结构对齐。没有递归连贯性，没有任何系统可以可靠地保持目标、含义或推理一致性的规模。我们正式定义了智能功能模型（FMI）作为唯一已知能够在任何规模上满足RCP的运算符。FMI是一个最小的、可组合的体系结构，具有内部功能（评估、建模、适应、稳定性、分解、桥接）和外部功能（存储、回忆、系统1和系统2推理），对于在推理和协调层之间保持语义结构至关重要。我们证明，任何缺乏FMI的系统在扩展时都将经历递归连贯性破裂，认为常见的人工智能问题如不对齐、幻觉和不稳定性都是这种结构连贯性丧失的症状。与其他基本原则不同，RCP独特地捕捉了用于连贯、可对齐智能的内部、递归动态，对递归下的语义连贯性进行建模。这项工作对AI对齐产生了重大影响，倡导从行为约束转向结构连贯性，并为规模上安全可泛化、强大连贯的人工智能提供了一条途径。

更新时间: 2025-07-18 09:44:01

领域: cs.AI

下载: http://arxiv.org/abs/2507.15880v1

Entropy Loss: An Interpretability Amplifier of 3D Object Detection Network for Intelligent Driving

With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a "black box." This paper introduces a novel type of loss function, termed "Entropy Loss," along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47\% compared to models without Entropy Loss, underscoring the method's efficacy. The implementation code is available at https://github.com/yhbcode000/Eloss-Interpretability.

Updated: 2025-07-18 09:39:59

标题: 熵损失：智能驾驶中3D物体检测网络的可解释性增强器

摘要: 随着交通环境的复杂性不断增加，智能驾驶中安全感知的重要性日益加强。智能驾驶感知领域传统方法依赖于深度学习，但存在解释能力有限的问题，通常被描述为“黑匣子”。本文介绍了一种新型的损失函数，称为“熵损失”，以及创新的训练策略。熵损失是基于感知模型中特征压缩网络的功能性制定的。受到通信系统的启发，特征压缩网络中的信息传输过程预计将表现出信息量的稳定变化和信息熵的持续降低。通过将网络层输出建模为连续随机变量，我们构建了一个量化信息量变化的概率模型。然后根据这些期望推导出熵损失，引导网络参数的更新以增强网络的解释性。我们的实验证明，熵损失训练策略加速了训练过程。在相同的60个训练周期内，使用熵损失在KITTI测试集上进行的3D物体检测模型的准确率相比没有使用熵损失的模型提高了高达4.47\%，凸显了该方法的有效性。实现代码可在https://github.com/yhbcode000/Eloss-Interpretability 获取。

更新时间: 2025-07-18 09:39:59

领域: cs.CV,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2409.00839v2

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

Updated: 2025-07-18 09:31:19

标题: 从KMMLU-Redux到KMMLU-Pro：LLM评估的专业韩语基准测试套件

摘要: 大型语言模型（LLMs）的发展需要健壮的基准，这些基准不仅涵盖学术领域，还涵盖工业领域，以便有效评估它们在实际场景中的适用性。在本文中，我们介绍了两个韩国专家级别的基准。KMMLU-Redux是从现有的KMMLU重建而来，包括来自韩国国家技术资格考试的问题，移除了关键错误以增强可靠性。KMMLU-Pro基于韩国国家专业资格考试，反映了韩国的专业知识。我们的实验表明，这些基准全面代表了韩国的工业知识。我们公开发布我们的数据集。

更新时间: 2025-07-18 09:31:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.08924v2

Learning Spectral Diffusion Prior for Hyperspectral Image Reconstruction

Hyperspectral image (HSI) reconstruction aims to recover 3D HSI from its degraded 2D measurements. Recently great progress has been made in deep learning-based methods, however, these methods often struggle to accurately capture high-frequency details of the HSI. To address this issue, this paper proposes a Spectral Diffusion Prior (SDP) that is implicitly learned from hyperspectral images using a diffusion model. Leveraging the powerful ability of the diffusion model to reconstruct details, this learned prior can significantly improve the performance when injected into the HSI model. To further improve the effectiveness of the learned prior, we also propose the Spectral Prior Injector Module (SPIM) to dynamically guide the model to recover the HSI details. We evaluate our method on two representative HSI methods: MST and BISRNet. Experimental results show that our method outperforms existing networks by about 0.5 dB, effectively improving the performance of HSI reconstruction.

Updated: 2025-07-18 09:27:11

标题: 学习光谱扩散先验用于高光谱图像重建

摘要: 高光谱图像（HSI）重建旨在从其退化的2D测量中恢复3D HSI。最近，在基于深度学习的方法中取得了巨大进展，然而，这些方法常常难以准确捕捉HSI的高频细节。为解决这个问题，本文提出了一种从高光谱图像中隐式学习的Spectral Diffusion Prior（SDP），使用扩散模型。利用扩散模型重建细节的强大能力，这种学习到的先验可以在注入到HSI模型时显著提升性能。为了进一步提高学习到的先验的有效性，我们还提出了Spectral Prior Injector Module（SPIM），动态指导模型恢复HSI的细节。我们在两种代表性的HSI方法MST和BISRNet上评估我们的方法。实验结果表明，我们的方法比现有网络表现优异约0.5 dB，有效提升了HSI重建的性能。

更新时间: 2025-07-18 09:27:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13769v1

From Extraction to Synthesis: Entangled Heuristics for Agent-Augmented Strategic Reasoning

We present a hybrid architecture for agent-augmented strategic reasoning, combining heuristic extraction, semantic activation, and compositional synthesis. Drawing on sources ranging from classical military theory to contemporary corporate strategy, our model activates and composes multiple heuristics through a process of semantic interdependence inspired by research in quantum cognition. Unlike traditional decision engines that select the best rule, our system fuses conflicting heuristics into coherent and context-sensitive narratives, guided by semantic interaction modeling and rhetorical framing. We demonstrate the framework via a Meta vs. FTC case study, with preliminary validation through semantic metrics. Limitations and extensions (e.g., dynamic interference tuning) are discussed.

Updated: 2025-07-18 09:26:37

标题: 从抽取到合成：代理增强战略推理的纠缠启发式算法

摘要: 我们提出了一种混合体系结构，用于代理增强战略推理，结合了启发式提取、语义激活和组合合成。我们的模型汲取了从古典军事理论到当代企业战略的多种来源，通过受到量子认知研究启发的语义相互依赖过程激活和组合多个启发式。与传统的选择最佳规则的决策引擎不同，我们的系统将冲突的启发式融合成连贯且与上下文相关的叙事，通过语义交互建模和修辞框架指导。我们通过一个Meta vs.FTC案例研究展示了这一框架，并通过语义度量进行了初步验证。讨论了限制和扩展（例如，动态干扰调整）。

更新时间: 2025-07-18 09:26:37

领域: cs.AI,I.2.7

下载: http://arxiv.org/abs/2507.13768v1

Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models

Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO

Updated: 2025-07-18 09:25:36

标题: 反向-DPO：扩散模型的精确高效后训练

摘要: 最近对扩散模型（DMs）的进展受到了使模型更符合人类偏好的对齐方法的推动。然而，这些方法通常需要对基础模型和奖励模型进行计算密集型训练，这不仅会带来大量的计算开销，还可能损害模型的准确性和训练效率。为了解决这些限制，我们提出了一种新的对齐框架Inversion-DPO，通过重新构造将直接偏好优化（DPO）与DDIM反演相结合，从而规避了奖励建模。我们的方法通过从获胜和失败样本到噪声的确定性反演在扩散-DPO中进行难以处理的后验抽样，从而推导出一种新的后训练范式。这种范式消除了辅助奖励模型或不准确逼近的需要，显著提高了训练的精度和效率。我们将Inversion-DPO应用于文本到图像生成的基本任务和复杂图像生成的具有挑战性的任务。广泛的实验证明，与现有的后训练方法相比，Inversion-DPO取得了显著的性能改进，并突显了训练生成模型生成高保真度、构图一致的图像的能力。对于复杂图像生成的后训练，我们策划了一个配对数据集，包括11,140张具有复杂结构注释和综合评分的图像，旨在增强生成模型的构图能力。Inversion-DPO探索了扩散模型中高效、高精度对齐的新途径，推动了它们在复杂现实生成任务中的适用性。我们的代码可在https://github.com/MIGHTYEZ/Inversion-DPO上找到。

更新时间: 2025-07-18 09:25:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.11554v2

OntView: What you See is What you Meant

In the field of knowledge management and computer science, ontologies provide a structured framework for modeling domain-specific knowledge by defining concepts and their relationships. However, the lack of tools that provide effective visualization is still a significant challenge. While numerous ontology editors and viewers exist, most of them fail to graphically represent ontology structures in a meaningful and non-overwhelming way, limiting users' ability to comprehend dependencies and properties within large ontological frameworks. In this paper, we present OntView, an ontology viewer that is designed to provide users with an intuitive visual representation of ontology concepts and their formal definitions through a user-friendly interface. Building on the use of a DL reasoner, OntView follows a "What you see is what you meant" paradigm, showing the actual inferred knowledge. One key aspect for this is its ability to visualize General Concept Inclusions (GCI), a feature absent in existing visualization tools. Moreover, to avoid a possible information overload, OntView also offers different ways to show a simplified view of the ontology by: 1) creating ontology summaries by assessing the importance of the concepts (according to different available algorithms), 2) focusing the visualization on the existing TBox elements between two given classes and 3) allowing to hide/show different branches in a dynamic way without losing the semantics. OntView has been released with an open-source license for the whole community.

Updated: 2025-07-18 09:06:49

标题: OntView：你所看到的就是你的意思。

摘要: 在知识管理和计算机科学领域，本体论为建模特定领域知识提供了一个结构化框架，通过定义概念及其关系。然而，缺乏提供有效可视化的工具仍然是一个重要挑战。尽管存在许多本体编辑器和查看器，但大多数未能以有意义且不过度复杂的方式图形化表示本体结构，限制了用户理解大型本体框架中的依赖关系和属性的能力。在本文中，我们提出了OntView，一个旨在通过用户友好的界面为用户提供直观的本体概念及其形式定义的视觉表示的本体查看器。基于DL推理器的使用，OntView遵循“所见即所想”的范式，展示实际推断出的知识。其中一个关键方面是其能够可视化一般概念包含（GCI），这是现有可视化工具中缺失的功能。此外，为了避免可能的信息过载，OntView还提供了不同的方式来显示本体的简化视图：1）通过评估概念的重要性（根据不同的可用算法），创建本体摘要，2）将可视化重点放在两个给定类之间的现有TBox元素上，3）允许动态方式隐藏/显示不同分支而不丢失语义。 OntView已经以开源许可证发布，供整个社区使用。

更新时间: 2025-07-18 09:06:49

领域: cs.AI

下载: http://arxiv.org/abs/2507.13759v1

A Simple Baseline for Stable and Plastic Neural Networks

Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured.

Updated: 2025-07-18 08:54:22

标题: 一个稳定和可塑神经网络的简单基线

摘要: 在计算机视觉中的持续学习需要模型适应连续流的任务而不会忘记先前的知识，然而现有的方法往往倾向于在可塑性或稳定性之间严重失衡。我们引入了RDBP，这是一个简单且低开销的基准线，它将两种互补机制结合在一起：ReLUDown，一种轻量级的激活修改，能保留特征敏感性同时防止神经元休眠，以及Decreasing Backpropagation，一种受生物启发的梯度调度方案，逐步保护早期层免受灾难性更新的影响。在持续ImageNet基准测试中评估，RDBP在减少计算成本的同时匹配或超过了最先进方法的可塑性和稳定性。因此，RDBP不仅为实际世界的持续学习提供了一个实用的解决方案，还提供了一个清晰的基准，可用于衡量未来持续学习策略。

更新时间: 2025-07-18 08:54:22

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.10637v2

Search-Optimized Quantization in Biomedical Ontology Alignment

In the fast-moving world of AI, as organizations and researchers develop more advanced models, they face challenges due to their sheer size and computational demands. Deploying such models on edge devices or in resource-constrained environments adds further challenges related to energy consumption, memory usage and latency. To address these challenges, emerging trends are shaping the future of efficient model optimization techniques. From this premise, by employing supervised state-of-the-art transformer-based models, this research introduces a systematic method for ontology alignment, grounded in cosine-based semantic similarity between a biomedical layman vocabulary and the Unified Medical Language System (UMLS) Metathesaurus. It leverages Microsoft Olive to search for target optimizations among different Execution Providers (EPs) using the ONNX Runtime backend, followed by an assembled process of dynamic quantization employing Intel Neural Compressor and IPEX (Intel Extension for PyTorch). Through our optimization process, we conduct extensive assessments on the two tasks from the DEFT 2020 Evaluation Campaign, achieving a new state-of-the-art in both. We retain performance metrics intact, while attaining an average inference speed-up of 20x and reducing memory usage by approximately 70%.

Updated: 2025-07-18 08:42:20

标题: 在生物医学本体对齐中的搜索优化量化

摘要: 在人工智能快速发展的世界中，随着组织和研究人员开发更先进的模型，他们面临着由于规模庞大和计算需求的挑战。在边缘设备或资源受限环境部署这些模型会增加与能源消耗、内存使用和延迟相关的进一步挑战。为了解决这些挑战，新兴趋势正在塑造高效模型优化技术的未来。从这个前提出发，通过使用监督的最先进的基于Transformer的模型，这项研究引入了一种系统的方法来进行本体对齐，基于生物医学外行词汇和统一医学语言系统（UMLS）Metathesaurus之间基于余弦的语义相似性。它利用Microsoft Olive在ONNX Runtime后端中搜索不同执行提供者（EPs）之间的目标优化，然后通过使用Intel Neural Compressor和IPEX（Intel Extension for PyTorch）的动态量化组装过程。通过我们的优化过程，我们对DEFT 2020评估活动中的两项任务进行了广泛评估，在两项任务中均取得了新的最先进成果。我们保持性能指标完整，同时实现平均推理加速20倍，并将内存使用减少约70%。

更新时间: 2025-07-18 08:42:20

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2507.13742v1

SamGoG: A Sampling-Based Graph-of-Graphs Framework for Imbalanced Graph Classification

Graph Neural Networks (GNNs) have shown remarkable success in graph classification tasks by capturing both structural and feature-based representations. However, real-world graphs often exhibit two critical forms of imbalance: class imbalance and graph size imbalance. These imbalances can bias the learning process and degrade model performance. Existing methods typically address only one type of imbalance or incur high computational costs. In this work, we propose SamGoG, a sampling-based Graph-of-Graphs (GoG) learning framework that effectively mitigates both class and graph size imbalance. SamGoG constructs multiple GoGs through an efficient importance-based sampling mechanism and trains on them sequentially. This sampling mechanism incorporates the learnable pairwise similarity and adaptive GoG node degree to enhance edge homophily, thus improving downstream model quality. SamGoG can seamlessly integrate with various downstream GNNs, enabling their efficient adaptation for graph classification tasks. Extensive experiments on benchmark datasets demonstrate that SamGoG achieves state-of-the-art performance with up to a 15.66% accuracy improvement with 6.7$\times$ training acceleration.

Updated: 2025-07-18 08:41:58

标题: SamGoG：一种基于抽样的不平衡图分类图结构框架

摘要: 图神经网络（GNNs）在图分类任务中取得了显著的成功，通过捕捉结构和基于特征的表示。然而，现实世界中的图通常表现出两种关键形式的不平衡：类别不平衡和图大小不平衡。这些不平衡可能会偏向学习过程并降低模型性能。现有方法通常只解决一种类型的不平衡或产生高计算成本。在这项工作中，我们提出了SamGoG，一种基于采样的图图（GoG）学习框架，有效缓解了类别和图大小不平衡。SamGoG通过高效的基于重要性的采样机制构建多个GoG，并对它们进行顺序训练。这种采样机制将可学习的成对相似度和自适应的GoG节点度结合起来，以增强边缘同质性，从而提高下游模型质量。SamGoG可以与各种下游GNNs无缝集成，使它们能够高效地适应图分类任务。对基准数据集进行的广泛实验表明，SamGoG在精度提高了15.66%的同时，训练加速了6.7倍，达到了最先进的性能水平。

更新时间: 2025-07-18 08:41:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13741v1

Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges

The latest developments in computer hardware, sensor technologies, and artificial intelligence can make virtual reality (VR) and virtual spaces an important part of human everyday life. Eye tracking offers not only a hands-free way of interaction but also the possibility of a deeper understanding of human visual attention and cognitive processes in VR. Despite these possibilities, eye-tracking data also reveals users' privacy-sensitive attributes when combined with the information about the presented stimulus. To address all these possibilities and potential privacy issues, in this survey, we first cover major works in eye tracking, VR, and privacy areas between 2012 and 2022. While eye tracking in the VR part covers the complete pipeline of eye-tracking methodology from pupil detection and gaze estimation to offline use of the data and analyses, as for privacy and security, we focus on eye-based authentication as well as computational methods to preserve the privacy of individuals and their eye-tracking data in VR. Later, considering all of these, we draw three main directions for the research community by focusing on privacy challenges. In summary, this survey provides an extensive literature review of the utmost possibilities with eye tracking in VR and the privacy implications of those possibilities.

Updated: 2025-07-18 08:40:04

标题: 眼动追踪虚拟现实：关于方法和隐私挑战的综合调研

摘要: 计算机硬件、传感器技术和人工智能的最新发展可以使虚拟现实（VR）和虚拟空间成为人类日常生活的重要组成部分。眼动追踪不仅提供了一种无需使用手的交互方式，还有可能更深入地理解人类在VR中的视觉注意力和认知过程。尽管存在这些可能性，眼动追踪数据在与呈现刺激信息相结合时也会透露用户的隐私敏感属性。为了解决所有这些可能性和潜在的隐私问题，在本调查中，我们首先涵盖了2012年至2022年之间眼动追踪、VR和隐私领域的主要工作。在VR部分的眼动追踪涵盖了从瞳孔检测和凝视估计到离线数据使用和分析的完整眼动追踪方法。至于隐私和安全方面，我们着重于基于眼睛的认证以及保护个人及其眼动追踪数据隐私的计算方法。随后，考虑到所有这些因素，我们通过关注隐私挑战为研究社区提出了三个主要方向。总之，本调查对眼动追踪在VR中的最大可能性以及这些可能性的隐私影响进行了广泛的文献综述。

更新时间: 2025-07-18 08:40:04

领域: cs.HC,cs.AI,cs.CR,cs.GR,cs.LG

下载: http://arxiv.org/abs/2305.14080v2

From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios

Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.

Updated: 2025-07-18 08:39:33

标题: 从词语到碰撞：LLM引导的安全关键驾驶场景评估和对抗性生成

摘要: 确保自动驾驶汽车的安全性需要基于虚拟场景的测试，这取决于对安全关键场景进行强大的评估和生成。到目前为止，研究人员使用依赖手工制作场景的场景测试框架作为安全度量。为了减少人类解释的工作量并克服这些方法的有限可扩展性，我们将大型语言模型(LLMs)与结构化场景解析和提示工程相结合，以自动评估和生成安全关键的驾驶场景。我们引入笛卡尔和自我中心提示策略用于场景评估，以及一个敌对生成模块，修改风险诱发车辆(自我攻击者)的轨迹以创建关键场景。我们使用2D模拟框架和多个预训练的LLMs验证了我们的方法。结果显示评估模块有效检测碰撞场景并推断场景安全性。同时，新的生成模块识别高风险代理并合成现实的、安全关键的场景。我们得出结论，装备有领域内提示技术的LLM可以有效评估和生成安全关键的驾驶场景，减少对手工度量的依赖。我们在https://github.com/TUM-AVS/From-Words-to-Collisions发布我们的开源代码和场景。

更新时间: 2025-07-18 08:39:33

领域: cs.AI,cs.CL,cs.RO

下载: http://arxiv.org/abs/2502.02145v4

Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning

Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model's capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, \emph{mini}ImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.

Updated: 2025-07-18 08:38:07

标题: 合成图像能征服遗忘吗？在少样本类增量学习中未探索的疑虑之外

摘要: Few-shot class-incremental learning (FSCIL)由于训练数据极其有限而具有挑战性；同时旨在减少灾难性遗忘并学习新信息。我们提出了Diffusion-FSCIL，这是一种采用文本到图像扩散模型作为冻结骨干的新颖方法。我们的推测是，FSCIL可以通过利用大型生成模型的能力来解决，从而获益于1）通过大规模预训练实现的生成能力；2）多尺度表示；3）通过文本编码器实现的表达灵活性。为了最大化表征能力，我们提议提取多个互补的扩散特征，作为潜在重播，并稍微支持特征蒸馏以防止生成偏见。我们的框架通过1）使用冻结骨干；2）最小化可训练组件；3）批处理多个特征提取来实现效率。在CUB-200、miniImageNet和CIFAR-100上进行的大量实验表明，Diffusion-FSCIL超越了最先进的方法，在先前学习的类上保持性能，并有效地适应新类。

更新时间: 2025-07-18 08:38:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.13739v1

DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs

Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.

Updated: 2025-07-18 08:33:30

标题: DailyLLM:使用多模传感器和LLMs进行上下文感知活动日志生成

摘要: 富含上下文的活动日志有助于用户行为分析和健康监测，在普适计算中成为重要的研究焦点。最近，大型语言模型（LLMs）出色的语义理解和生成能力为活动日志生成带来了新的机遇。然而，现有方法在准确性、效率和语义丰富性方面仍存在明显限制。为解决这些挑战，我们提出了DailyLLM。据我们所知，这是首个综合整合四个维度的上下文活动信息（位置、运动、环境和生理）的日志生成和总结系统，仅使用智能手机和智能手表上常见的传感器。为实现这一目标，DailyLLM引入了一个轻量级的基于LLM的框架，将结构化提示与高效特征提取相结合，实现高级别活动理解。广泛的实验表明，DailyLLM优于最先进的日志生成方法，并且可以高效部署在个人电脑和树莓派上。仅利用一个1.5B参数的LLM模型，DailyLLM相对于70B参数的最先进基准在日志生成BERTScore精度方面提高了17%，同时推断速度几乎提高了10倍。

更新时间: 2025-07-18 08:33:30

领域: cs.AI,cs.CL,cs.HC,cs.MM

下载: http://arxiv.org/abs/2507.13737v1

On the Transfer of Knowledge in Quantum Algorithms

Quantum computing is poised to transform computational paradigms across science and industry. As the field evolves, it can benefit from established classical methodologies, including promising paradigms such as Transfer of Knowledge (ToK). This work serves as a brief, self-contained reference for ToK, unifying its core principles under a single formal framework. We introduce a joint notation that consolidates and extends prior work in Transfer Learning and Transfer Optimization, bridging traditionally separate research lines and enabling a common language for knowledge reuse. Building on this foundation, we classify existing ToK strategies and principles into a structured taxonomy that helps researchers position their methods within a broader conceptual map. We then extend key transfer protocols to quantum computing, introducing two novel use cases (reverse annealing and multitasking QAOA) alongside a sequential VQE approach that supports and validates prior findings. These examples highlight ToK's potential to improve performance and generalization in quantum algorithms. Finally, we outline challenges and opportunities for integrating ToK into quantum computing, emphasizing its role in reducing resource demands and accelerating problem-solving. This work lays the groundwork for future synergies between classical and quantum computing through a shared, transferable knowledge framework.

Updated: 2025-07-18 08:26:25

标题: 关于量子算法中知识传递的研究

摘要: 量子计算正处于转变科学和工业计算范式的关键时刻。随着这一领域的发展，传统的经典方法可以从中受益，包括有前途的知识转移（ToK）范例。本文是对ToK的简要、自包含的参考，将其核心原则统一在一个形式化框架下。我们引入了一种联合符号表示法，整合和扩展了之前在知识转移学习和知识转移优化方面的工作，将传统上分开的研究线路连接起来，为知识重用提供了一个通用语言。在此基础上，我们将现有的ToK策略和原则分类成一个结构化的分类法，帮助研究人员将他们的方法定位到更广泛的概念地图中。然后，我们将关键的转移协议扩展到量子计算中，引入了两种新颖的用例（逆退火和多任务QAOA），以及支持和验证之前发现的顺序VQE方法。这些例子突显了ToK在量子算法中改善性能和泛化能力的潜力。最后，我们概述了将ToK整合到量子计算中的挑战和机遇，强调其在减少资源需求和加速问题解决方面的作用。这项工作为未来经典和量子计算之间通过共享、可转移的知识框架实现的协同奠定了基础。

更新时间: 2025-07-18 08:26:25

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2501.14120v2

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

Updated: 2025-07-18 08:20:23

标题: DeepSeek-Prover-V2: 通过强化学习推进子目标分解的形式数学推理

摘要: 我们介绍了DeepSeek-Prover-V2，这是一个开源的大型语言模型，专为在Lean 4中进行形式定理证明而设计，其初始化数据通过由DeepSeek-V3提供动力的递归定理证明管道收集而来。冷启动训练过程从促使DeepSeek-V3将复杂问题分解为一系列子目标开始。已解决子目标的证明被合成为一种思维链过程，结合DeepSeek-V3的逐步推理，以创建强化学习的初始冷启动。这个过程使我们能够将非正式和形式数学推理整合到一个统一的模型中。由此产生的模型，DeepSeek-Prover-V2-671B，在神经定理证明中达到了最先进的性能，MiniF2F-test上达到了88.9%的通过率，并解决了PutnamBench中的658个问题中的49个。除了标准基准之外，我们还介绍了ProverBench，这是一个包含325个形式化问题的集合，以丰富我们的评估，其中包括最近AIME竞赛（第24-25年）中选定的15个问题。对这15个AIME问题的进一步评估显示，该模型成功解决了其中的6个。相比之下，DeepSeek-V3使用多数投票解决了其中8个问题，突显了大型语言模型中形式和非正式数学推理之间的差距大大缩小。

更新时间: 2025-07-18 08:20:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.21801v2

AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework

Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real-world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data-driven models require extensive training data and often lack fine-grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning-based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self-driving systems. Therefore, this paper introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine-grained control over the output and maintaining high performance even with smaller, cost-effective LLMs. Extensive human expert evaluation demonstrates our framework's ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.

Updated: 2025-07-18 08:20:16

标题: AGENTS-LLM：一个具有智能LLM框架的挑战性交通场景增强生成系统

摘要: 罕见但至关重要的场景在测试和评估自主驾驶规划者方面提出了重大挑战。仅依赖于现实世界的驾驶场景需要收集大量数据集来捕捉这些场景。虽然交通场景的自动生成看起来很有前景，但数据驱动模型需要大量的训练数据，通常缺乏对输出的细粒度控制。此外，从头开始生成新颖的场景可能会引入与原始训练场景不同的分布转移，从而削弱了特别是针对基于学习的规划者的评估的有效性。为了避免这种情况，最近的研究提出通过增强测试集中的原始场景来生成具有挑战性的场景。然而，这涉及由领域专家手动增强场景。这种方法无法满足自动驾驶系统评估中规模的需求。因此，本文介绍了一种基于LLM代理的框架，使用自然语言描述来增加现实世界的交通场景，解决现有方法的局限性。一个关键创新是使用代理设计，实现对输出的细粒度控制，并且即使使用较小、经济有效的LLM，仍能保持高性能。广泛的人类专家评估表明，我们的框架能够准确地遵循用户意图，生成与手动创建的高质量增强场景相媲美的场景。

更新时间: 2025-07-18 08:20:16

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.13729v1

Point of Interest Recommendation: Pitfalls and Viable Solutions

Point of interest (POI) recommendation can play a pivotal role in enriching tourists' experiences by suggesting context-dependent and preference-matching locations and activities, such as restaurants, landmarks, itineraries, and cultural attractions. Unlike some more common recommendation domains (e.g., music and video), POI recommendation is inherently high-stakes: users invest significant time, money, and effort to search, choose, and consume these suggested POIs. Despite the numerous research works in the area, several fundamental issues remain unresolved, hindering the real-world applicability of the proposed approaches. In this paper, we discuss the current status of the POI recommendation problem and the main challenges we have identified. The first contribution of this paper is a critical assessment of the current state of POI recommendation research and the identification of key shortcomings across three main dimensions: datasets, algorithms, and evaluation methodologies. We highlight persistent issues such as the lack of standardized benchmark datasets, flawed assumptions in the problem definition and model design, and inadequate treatment of biases in the user behavior and system performance. The second contribution is a structured research agenda that, starting from the identified issues, introduces important directions for future work related to multistakeholder design, context awareness, data collection, trustworthiness, novel interactions, and real-world evaluation.

Updated: 2025-07-18 08:10:09

标题: 兴趣点推荐：陷阱与可行解决方案

摘要: 兴趣点（POI）推荐可以通过建议依赖于上下文和偏好匹配的地点和活动，如餐馆、地标、行程和文化景点，发挥关键作用，丰富游客的体验。与一些更常见的推荐领域（如音乐和视频）不同，POI推荐本质上具有高风险：用户投入大量时间、金钱和精力来搜索、选择和消费这些建议的POI。尽管该领域存在许多研究作品，但仍然有几个基本问题尚未解决，阻碍了所提出方法的实际适用性。本文讨论了POI推荐问题的当前状态以及我们所确定的主要挑战。本文的第一个贡献是对当前POI推荐研究的现状进行了批判性评估，并确定了在数据集、算法和评估方法三个主要维度上的关键缺陷。我们强调了诸如缺乏标准化基准数据集、问题定义和模型设计中的错误假设以及对用户行为和系统性能中偏见的不足处理等持久性问题。第二个贡献是一个结构化的研究议程，从确定的问题开始，引入了与多利益相关设计、上下文意识、数据收集、可信度、新颖交互和实际评估相关的未来工作的重要方向。

更新时间: 2025-07-18 08:10:09

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.13725v1

Quantum Blockchain Survey: Foundations, Trends, and Gaps

Quantum computing poses fundamental risks to classical blockchain systems by undermining widely used cryptographic primitives. In response, two major research directions have emerged: post-quantum blockchains, which integrate quantum-resistant algorithms, and quantum blockchains, which leverage quantum properties such as entanglement and quantum key distribution. This survey reviews key developments in both areas, analyzing their cryptographic foundations, architectural designs, and implementation challenges. This work provides a comparative overview of technical proposals, highlight trade-offs in security, scalability, and deployment, and identify open research problems across hardware, consensus, and network design. The goal is to offer a structured and comprehensive reference for advancing secure blockchain systems in the quantum era.

Updated: 2025-07-18 08:00:09

标题: 量子区块链调查：基础、趋势和差距

摘要: 量子计算对传统区块链系统构成了根本风险，因为它破坏了广泛使用的加密原语。作为回应，出现了两个主要的研究方向：后量子区块链，它们整合了量子抗攻击算法，以及量子区块链，它们利用了量子纠缠和量子密钥分发等量子特性。本调查回顾了这两个领域的关键发展，分析了它们的加密基础、架构设计和实施挑战。这项工作提供了技术提案的比较概述，突出了在安全性、可扩展性和部署方面的权衡，并确定了硬件、共识和网络设计领域存在的开放性研究问题。目标是为推进量子时代安全区块链系统提供一个结构化和全面的参考。

更新时间: 2025-07-18 08:00:09

领域: cs.CR,cs.DC,cs.ET,cs.NI,68M10, 81P94, 94A60 68M10, 81P94, 94A60 68M10, 81P94, 94A60,C.2.1; E.3; K.6.5

下载: http://arxiv.org/abs/2507.13720v1

To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

Updated: 2025-07-18 07:40:22

标题: 编码还是不编码？通过期望最大化实现数学语言模型的自适应工具集成

摘要: 最近在数学问题解决方面，使用语言模型（LMs）进行的研究取得了重要进展，将思维链（CoT）推理和代码执行整合起来，以发挥它们互补的优势。然而，现有的混合框架存在一个关键限制：它们依赖外部指示或刚性代码整合模板，缺乏元认知意识 - 即动态评估内在能力并自主确定何时以及如何整合工具的能力。这种刚性促使我们研究自主代码整合，使模型能够在训练过程中随着推理能力的发展而调整工具使用策略。虽然强化学习（RL）显示出提升LLM推理规模的潜力（例如DeepSeek-R1），但我们证明其在学习自主代码整合方面效率低下，因为未能充分探索CoT代码交织模式的巨大组合空间。为了解决这一挑战，我们提出了一个新颖的期望最大化（EM）框架，将结构化探索（E步骤）与离线策略RL优化（M步骤）相结合，创建了一个元认知工具使用决策和不断发展能力之间的自我强化循环。实验证明，我们的方法通过改进探索取得了更好的结果。值得注意的是，我们的7B模型在没有类似o1的CoT的情况下在MATH500上提高了超过11％，在AIME上提高了9.4％。

更新时间: 2025-07-18 07:40:22

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.00691v4

SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at https://auto-safe.github.io/.

Updated: 2025-07-18 07:34:40

标题: SafeAgent：通过自动风险模拟器保护LLM代理

摘要: 基于大型语言模型（LLM）的代理越来越多地被部署在现实世界的应用中，如“数字助手、自主客户服务和决策支持系统”，它们在“多轮交互、工具增强环境中的能力”使它们成为不可或缺的。然而，确保这些代理的安全性仍然是一个重要挑战，因为动态用户交互、外部工具使用以及意外有害行为的潜在风险多样且复杂。为了解决这一关键问题，我们提出了AutoSafe，这是第一个通过完全自动合成数据生成系统性增强代理安全性的框架。具体地，1）我们引入了一个开放且可扩展的威胁模型OTS，该模型形式化了不安全行为是如何由用户指令、交互上下文和代理行为相互作用而产生的。这使得能够精确地对各种场景下的安全风险进行建模。2）我们开发了一个完全自动化的数据生成管线，模拟不安全的用户行为，应用自我反思推理生成安全响应，并构建了一个大规模、多样化且高质量的安全训练数据集，消除了对危险现实世界数据收集的需求。为了评估我们框架的有效性，我们设计了针对合成和真实世界安全基准的综合实验。结果表明，AutoSafe平均提高了45％的安全评分，并在真实世界任务上实现了28.91％的改进，验证了我们学习的安全策略的泛化能力。这些结果突显了AutoSafe在构建更安全的基于LLM的代理以用于现实世界部署中的实际进展和可扩展性。我们已在https://auto-safe.github.io/发布了项目页面。

更新时间: 2025-07-18 07:34:40

领域: cs.AI,68T07,I.2.6

下载: http://arxiv.org/abs/2505.17735v2

ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model's response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.

Updated: 2025-07-18 07:31:17

标题: ASTRID -- 一种用于评估基于RAG的临床问题回答系统的自动化和可扩展的TRIaD

摘要: 大型语言模型（LLMs）已经展现出在临床问题回答（QA）方面具有令人印象深刻的潜力，检索增强生成（RAG）作为确保模型响应事实准确性的领先方法。然而，当前自动化RAG度量在临床和对话使用案例中表现不佳。利用临床人类评估回答是昂贵的、不可扩展的，并且不利于RAG系统的持续迭代开发。为了解决这些挑战，我们介绍了ASTRID - 一种用于评估临床QA系统的自动化和可扩展的TRIaD，利用RAG - 由三个度量组成：上下文相关性（CR），拒绝准确性（RA）和对话忠诚度（CF）。我们的新颖评估度量CF旨在更好地捕捉模型响应对知识库的忠实度，而不会惩罚对话元素。为了验证我们的三重，我们收集了一个包含200多个实际患者问题的数据集，这些问题是在白内障手术后的手术随访期间向基于LLM的QA代理提出的 - 这是世界上手术量最大的手术 - 并增加了临床医生选择的紧急、临床和非临床领域之外情景的问题。我们展示了CF可以更好地预测对话使用案例中模型忠实度的人类评级。此外，我们展示了使用我们的包括CF、RA和CR的三重度量进行评估，与临床医生评估不当、有害或无益回答的评估一致。最后，使用九种不同的LLM，我们证明这三个度量可以与人类评估紧密一致，突出了这些度量在LLM驱动的自动化评估流程中的潜力。我们还发布了这些实验的提示和数据集，为进一步研究和开发提供了有价值的资源。

更新时间: 2025-07-18 07:31:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.08208v2

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

Updated: 2025-07-18 07:18:39

标题: EgoVLA：从主观视角的人类视频中学习视觉-语言-动作模型

摘要: 用于模仿学习的真实机器人数据收集已经在机器人操作方面取得了显著的进展。然而，在这个过程中对机器人硬件的要求从根本上限制了数据的规模。在本文中，我们探讨了使用以自我为中心的人类视频来训练视觉-语言-动作（VLA）模型。使用人类视频的好处不仅在于规模，更重要的是场景和任务的丰富性。通过在人类视频上训练的VLA可以预测人类手腕和手部动作，我们可以执行逆运动学和重新定位，将人类动作转换为机器人动作。我们使用一些机器人操作演示对模型进行微调，以获得机器人策略，即EgoVLA。我们提出了一个名为Ego Humanoid Manipulation Benchmark的仿真基准，其中我们设计了各种双手操作任务的演示。我们对EgoVLA进行微调和评估，并展示了与基线相比的显著改进，并消蚀了人类数据的重要性。视频可以在我们的网站上找到：https://rchalyang.github.io/EgoVLA

更新时间: 2025-07-18 07:18:39

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.12440v3

Binarizing Physics-Inspired GNNs for Combinatorial Optimization

Physics-inspired graph neural networks (PI-GNNs) have been utilized as an efficient unsupervised framework for relaxing combinatorial optimization problems encoded through a specific graph structure and loss, reflecting dependencies between the problem's variables. While the framework has yielded promising results in various combinatorial problems, we show that the performance of PI-GNNs systematically plummets with an increasing density of the combinatorial problem graphs. Our analysis reveals an interesting phase transition in the PI-GNNs' training dynamics, associated with degenerate solutions for the denser problems, highlighting a discrepancy between the relaxed, real-valued model outputs and the binary-valued problem solutions. To address the discrepancy, we propose principled alternatives to the naive strategy used in PI-GNNs by building on insights from fuzzy logic and binarized neural networks. Our experiments demonstrate that the portfolio of proposed methods significantly improves the performance of PI-GNNs in increasingly dense settings.

Updated: 2025-07-18 07:11:50

标题: 将物理启发的GNNs二值化用于组合优化

摘要: 受物理启发的图神经网络（PI-GNNs）已被用作一种有效的无监督框架，用于通过特定图结构和损失对编码的组合优化问题进行放松，反映问题变量之间的依赖关系。虽然该框架在各种组合优化问题中取得了令人满意的结果，但我们发现随着组合优化问题图的密度增加，PI-GNNs的性能系统性下降。我们的分析揭示了PI-GNNs训练动态中的一个有趣相变，与更密集问题的退化解相关，突出了松弛的实值模型输出和二值问题解之间的差异。为了解决这种差异，我们提出了基于模糊逻辑和二值化神经网络的见解构建的原则性替代方法，以取代PI-GNNs中使用的天真策略。我们的实验表明，所提出方法的组合显著提高了PI-GNNs在越来越密集的环境中的性能。

更新时间: 2025-07-18 07:11:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13703v1

Can we ease the Injectivity Bottleneck on Lorentzian Manifolds for Graph Neural Networks?

While hyperbolic GNNs show promise for hierarchical data, they often have limited discriminative power compared to Euclidean counterparts or the WL test, due to non-injective aggregation. To address this expressivity gap, we propose the Lorentzian Graph Isomorphic Network (LGIN), a novel HGNN designed for enhanced discrimination within the Lorentzian model. LGIN introduces a new update rule that preserves the Lorentzian metric while effectively capturing richer structural information. This marks a significant step towards more expressive GNNs on Riemannian manifolds. Extensive evaluations across nine benchmark datasets demonstrate LGIN's superior performance, consistently outperforming or matching state-of-the-art hyperbolic and Euclidean baselines, showcasing its ability to capture complex graph structures. LGIN is the first to adapt principles of powerful, highly discriminative GNN architectures to a Riemannian manifold. The code for our paper can be found at https://github.com/Deceptrax123/LGIN

Updated: 2025-07-18 07:10:12

标题: 我们是否可以为图神经网络在洛伦兹流形上的注入性瓶颈提供缓解？

摘要: 超双曲GNN在层次数据方面表现出很大潜力，但与欧几里得对应物或WL测试相比，它们通常具有有限的区分能力，这是由于非可逆聚合造成的。为了解决这种表达能力差距，我们提出了洛伦兹同构网络（Lorentzian Graph Isomorphic Network，LGIN），这是一种新颖的设计用于在洛伦兹模型中增强区分能力的HGNN。LGIN引入了一种新的更新规则，可以在保持洛伦兹度量的同时有效地捕捉更丰富的结构信息。这标志着向更具表现力的Riemannian流形上的GNN迈出了重要的一步。对九个基准数据集的广泛评估显示了LGIN优越的性能，始终优于或与最先进的超双曲和欧几里得基线相匹配，展示了其捕捉复杂图结构的能力。LGIN是第一个将强大、高度区分性GNN架构的原则应用到Riemannian流形上的方法。我们论文的代码可以在https://github.com/Deceptrax123/LGIN 找到。

更新时间: 2025-07-18 07:10:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.00142v5

Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

Updated: 2025-07-18 06:48:39

标题: 建模开放世界认知：作为概率模型的按需合成

摘要: 在面对新颖情境时，人们能够从广泛的背景知识中整合相关考虑，并将其用于推理和预测。是什么使我们能够从全球范围内获取相关信息并对其进行连贯推理？在这里，我们探讨了人们使用分布式和符号表示相结合的假设，构建定制的适用于新颖情境的心智模型。我们提出了这一想法的计算实现 - “模型综合架构”（MSA）- 使用语言模型实现全球相关性检索和模型综合，使用概率程序实现定制的、连贯的世界模型。我们评估了我们的MSA作为人类对新型推理数据集的判断模型。该数据集 - 围绕“模型奥林匹克”体育片段领域构建 - 通过要求（i）对用语言描述的新颖因果结构进行判断；（ii）借助大量背景知识；以及（iii）在引入任意新变量的观察下进行这两项操作，来测试模型对人类式、开放式推理的能力。我们的MSA方法比仅基于语言模型的基准模型更好地捕捉了人类的判断，无论是直接还是通过支持模型综合的LM进行思维链生成。这些结果表明，MSA可以以一种模拟人们在全球相关变量上进行局部连贯推理的方式实现，为理解和复制开放领域中人类推理的道路提供了一种可能。

更新时间: 2025-07-18 06:48:39

领域: cs.CL,cs.AI,cs.PL

下载: http://arxiv.org/abs/2507.12547v2

CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis

Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these methods highly rely on the strong correlations between modalities, and cannot fully uncover and utilize the correlations between modalities to enhance sentiment analysis. Therefore, these methods usually achieve bad performance for identifying the sentiment of multimodal data with weak correlations. To address this issue, we proposed a two-stage semi-supervised model termed Correlation-aware Multimodal Transformer (CorMulT) which consists pre-training stage and prediction stage. At the pre-training stage, a modality correlation contrastive learning module is designed to efficiently learn modality correlation coefficients between different modalities. At the prediction stage, the learned correlation coefficients are fused with modality representations to make the sentiment prediction. According to the experiments on the popular multimodal dataset CMU-MOSEI, CorMulT obviously surpasses state-of-the-art multimodal sentiment analysis methods.

Updated: 2025-07-18 06:42:18

标题: CorMulT：一种半监督模态相关感知多模态变换器用于情感分析

摘要: 多模态情感分析是一个活跃的研究领域，结合了多种数据模态，如文本、图像和音频，用于分析人类情绪并受益于各种应用。现有的多模态情感分析方法可以分类为基于模态交互的方法、基于模态转换的方法和基于模态相似性的方法。然而，大多数这些方法高度依赖于不同模态之间的强相关性，并且无法充分揭示和利用模态之间的相关性来增强情感分析。因此，这些方法通常在识别具有弱相关性的多模态数据的情感方面表现不佳。为了解决这个问题，我们提出了一个两阶段的半监督模型，称为关联感知多模态转换器（CorMulT），包括预训练阶段和预测阶段。在预训练阶段，设计了一个模态相关性对比学习模块，有效地学习不同模态之间的相关性系数。在预测阶段，学习到的相关系数与模态表示合并，进行情感预测。根据对流行的多模态数据集CMU-MOSEI的实验，CorMulT明显优于最先进的多模态情感分析方法。

更新时间: 2025-07-18 06:42:18

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2407.07046v3

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition

Large language models (LLMs) have shown remarkable performance across a range of NLP tasks. However, their strong instruction-following capabilities and inability to distinguish instructions from data content make them vulnerable to indirect prompt injection attacks. In such attacks, instructions with malicious purposes are injected into external data sources, such as web documents. When LLMs retrieve this injected data through tools, such as a search engine and execute the injected instructions, they provide misled responses. Recent attack methods have demonstrated potential, but their abrupt instruction injection often undermines their effectiveness. Motivated by the limitations of existing attack methods, we propose TopicAttack, which prompts the LLM to generate a fabricated conversational transition prompt that gradually shifts the topic toward the injected instruction, making the injection smoother and enhancing the plausibility and success of the attack. Through comprehensive experiments, TopicAttack achieves state-of-the-art performance, with an attack success rate (ASR) over 90\% in most cases, even when various defense methods are applied. We further analyze its effectiveness by examining attention scores. We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.

Updated: 2025-07-18 06:23:31

标题: 主题攻击：通过主题转换的间接提示注入攻击

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中表现出色。然而，它们强大的指令遵循能力以及无法区分指令和数据内容的能力使它们容易受到间接提示注入攻击的影响。在这种攻击中，具有恶意目的的指令被注入到外部数据源中，例如网络文档。当LLMs通过工具检索这些注入的数据，例如搜索引擎，并执行注入的指令时，它们会提供误导性的响应。最近的攻击方法已经展示了潜力，但它们的突然指令注入往往会削弱它们的有效性。受现有攻击方法的限制启发，我们提出了TopicAttack，它促使LLM生成一个虚构的对话转换提示，逐渐将话题转向注入的指令，使注入更加平滑，并增强攻击的可信度和成功率。通过全面的实验，TopicAttack实现了最新技术水平，在大多数情况下攻击成功率（ASR）超过90％，即使应用了各种防御方法。我们进一步通过检查注意力分数来分析其有效性。我们发现更高的注入到原始注意力比会导致更大的成功概率，我们的方法实现的比基准方法更高的比率。

更新时间: 2025-07-18 06:23:31

领域: cs.CR

下载: http://arxiv.org/abs/2507.13686v1

LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.

Updated: 2025-07-18 06:12:08

标题: LoopServe：用于多轮对话的自适应双阶段LLM推理加速系统

摘要: 多轮对话在许多大型语言模型的实际应用中至关重要，例如聊天机器人和虚拟助手。随着对话历史变得越来越长，现有的大型语言模型面临着越来越多的计算和内存挑战，这阻碍了它们提供高效和响应迅速的交互能力。大多数当前的加速方法要么压缩上下文，要么优化键值缓存，但它们通常依赖于固定的或基于位置的启发式方法，无法很好地适应实际多轮对话中发现的动态和不可预测的模式。在本文中，我们提出了LoopServe，这是一个用于多轮对话中大型语言模型的自适应双阶段推理加速框架。LoopServe引入了两个主要创新。首先，在预填充阶段通过动态选择每个新输入的注意力矩阵的最重要部分来执行在线稀疏化。其次，在解码过程中使用渐进式键值压缩，通过自适应地维护一个基于最近生成的输出标记的相关和高效缓存。我们还提出了一个包含十一个多轮数据集的新基准，反映了现实中的查询位置和对话依赖关系。大量实验证明，LoopServe相对于现有基线始终实现了卓越的有效性，并显著加速了在广泛范围的长上下文对话任务中的LLM推理。

更新时间: 2025-07-18 06:12:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13681v1

Real-Time Communication-Aware Ride-Sharing Route Planning for Urban Air Mobility: A Multi-Source Hybrid Attention Reinforcement Learning Approach

Urban Air Mobility (UAM) systems are rapidly emerging as promising solutions to alleviate urban congestion, with path planning becoming a key focus area. Unlike ground transportation, UAM trajectory planning has to prioritize communication quality for accurate location tracking in constantly changing environments to ensure safety. Meanwhile, a UAM system, serving as an air taxi, requires adaptive planning to respond to real-time passenger requests, especially in ride-sharing scenarios where passenger demands are unpredictable and dynamic. However, conventional trajectory planning strategies based on predefined routes lack the flexibility to meet varied passenger ride demands. To address these challenges, this work first proposes constructing a radio map to evaluate the communication quality of urban airspace. Building on this, we introduce a novel Multi-Source Hybrid Attention Reinforcement Learning (MSHA-RL) framework for the challenge of effectively focusing on passengers and UAM locations, which arises from the significant dimensional disparity between the representations. This model first generates the alignment among diverse data sources with large gap dimensions before employing hybrid attention to balance global and local insights, thereby facilitating responsive, real-time path planning. Extensive experimental results demonstrate that the approach enables communication-compliant trajectory planning, reducing travel time and enhancing operational efficiency while prioritizing passenger safety.

Updated: 2025-07-18 06:09:30

标题: 实时通信感知城市空中移动共乘路线规划：一种多源混合注意力强化学习方法

摘要: 城市空中移动（UAM）系统正迅速发展成为缓解城市拥堵的有前途的解决方案，路径规划成为关注的重点领域。与地面交通不同，UAM轨迹规划必须优先考虑通信质量，以在不断变化的环境中进行准确的位置跟踪，以确保安全。同时，作为空中出租车的UAM系统需要适应性规划来响应实时乘客请求，特别是在乘客需求不可预测且动态变化的拼车情景中。然而，基于预定义路线的传统轨迹规划策略缺乏灵活性，无法满足多样化的乘客乘车需求。为应对这些挑战，本研究首先提出构建无线电地图来评估城市空域的通信质量。在此基础上，我们引入了一种新颖的多源混合关注强化学习（MSHA-RL）框架，以有效关注乘客和UAM位置，从而解决由于表示之间的显著维度差异而产生的挑战。该模型首先在生成大型间隙维度之间的多样数据源之间进行对齐，然后利用混合关注平衡全局和局部见解，从而促进响应性、实时的路径规划。大量实验结果表明，该方法实现了符合通信标准的轨迹规划，减少了旅行时间，提高了运营效率，同时优先考虑了乘客安全。

更新时间: 2025-07-18 06:09:30

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.14249v1

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B

Updated: 2025-07-18 06:03:26

标题: ZonUI-3B：一种用于跨分辨率GUI定位的轻量级视觉-语言模型

摘要: 在这篇论文中，我们提出了ZonUI-3B，这是一个轻量级的视觉-语言模型（VLM），可以在一台普通消费级GPU（RTX 4090）上完全训练，同时在GUI对齐任务上提供与更大模型相当的性能。该模型融合了几个关键创新：（i）结合来自移动、桌面和Web GUI截图等多样化来源的24K示例的跨平台、多分辨率数据集，以有效解决高分辨率桌面环境中数据稀缺问题；（ii）两阶段微调策略，其中初始的跨平台训练建立了强大的GUI理解，随后再针对高分辨率数据进行专门微调，显著增强了模型的适应性；和（iii）数据整理和冗余减少策略，表明随机抽样一个具有减少冗余的较小子集可以实现与更大数据集相当的性能，强调数据多样性胜过纯量。对标准GUI对齐基准的实证评估，包括ScreenSpot、ScreenSpot-v2和具有挑战性的ScreenSpot-Pro，突显了ZonUI-3B的异常准确性，实现了ScreenSpot 84.9%和ScreenSpot-v2 86.4%的成绩，超过了4B参数以下的先前模型。消融研究验证了平衡抽样和两阶段微调在增强鲁棒性方面的关键作用，特别是在高分辨率桌面场景中。ZonUI-3B可在以下网址找到：https://github.com/Han1018/ZonUI-3B

更新时间: 2025-07-18 06:03:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.23491v3

HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors

Real-world Vehicle-to-Everything (V2X) cooperative perception systems often operate under heterogeneous sensor configurations due to cost constraints and deployment variability across vehicles and infrastructure. This heterogeneity poses significant challenges for feature fusion and perception reliability. To address these issues, we propose HeCoFuse, a unified framework designed for cooperative perception across mixed sensor setups where nodes may carry Cameras (C), LiDARs (L), or both. By introducing a hierarchical fusion mechanism that adaptively weights features through a combination of channel-wise and spatial attention, HeCoFuse can tackle critical challenges such as cross-modality feature misalignment and imbalanced representation quality. In addition, an adaptive spatial resolution adjustment module is employed to balance computational cost and fusion effectiveness. To enhance robustness across different configurations, we further implement a cooperative learning strategy that dynamically adjusts fusion type based on available modalities. Experiments on the real-world TUMTraf-V2X dataset demonstrate that HeCoFuse achieves 43.22% 3D mAP under the full sensor configuration (LC+LC), outperforming the CoopDet3D baseline by 1.17%, and reaches an even higher 43.38% 3D mAP in the L+LC scenario, while maintaining 3D mAP in the range of 21.74% to 43.38% across nine heterogeneous sensor configurations. These results, validated by our first-place finish in the CVPR 2025 DriveX challenge, establish HeCoFuse as the current state-of-the-art on TUM-Traf V2X dataset while demonstrating robust performance across diverse sensor deployments.

Updated: 2025-07-18 06:02:22

标题: HeCoFuse：具有异构传感器的跨模态互补V2X合作感知请问还有其他什么可以帮到您的吗？如果您有任何问题，请随时告诉我。

摘要: 真实世界中的车辆对一切（V2X）合作感知系统通常由于成本约束和车辆及基础设施部署的差异而采用异构传感器配置。这种异构性对于特征融合和感知可靠性构成重大挑战。为了解决这些问题，我们提出了HeCoFuse，这是一个统一的框架，旨在实现跨混合传感器设置的合作感知，其中节点可能携带摄像机（C）、激光雷达（L）或两者。通过引入一种层次融合机制，通过通道注意力和空间注意力的组合自适应加权特征，HeCoFuse可以解决重要挑战，如跨模态特征错位和不平衡的表示质量。此外，采用自适应空间分辨率调整模块来平衡计算成本和融合效果。为了增强不同配置之间的稳健性，我们进一步实施一种合作学习策略，根据可用的模态动态调整融合类型。在真实世界的TUMTraf-V2X数据集上的实验表明，HeCoFuse在全传感器配置（LC+LC）下实现了43.22%的3D mAP，比CoopDet3D基线高出1.17%，在L+LC情景下甚至达到更高的43.38%的3D mAP，同时在九种异构传感器配置中保持在21.74%至43.38%的3D mAP范围内。这些结果在我们在CVPR 2025 DriveX挑战中的第一名的验证下，将HeCoFuse确立为TUM-Traf V2X数据集上的当前最先进技术，并展示了在各种传感器部署中的稳健表现。

更新时间: 2025-07-18 06:02:22

领域: cs.CV,cs.AI,cs.LG,cs.MM

下载: http://arxiv.org/abs/2507.13677v1

Fast computational deep thermalization

Deep thermalization refers to the emergence of Haar-like randomness from quantum systems upon partial measurements. As a generalization of quantum thermalization, it is often associated with high complexity and entanglement. Here, we introduce computational deep thermalization and construct the fastest possible dynamics exhibiting it at infinite effective temperature. Our circuit dynamics produce quantum states with low entanglement in polylogarithmic depth that are indistinguishable from Haar random states to any computationally bounded observer. Importantly, the observer is allowed to request many copies of the same residual state obtained from partial projective measurements on the state -- this condition is beyond the standard settings of quantum pseudorandomness, but natural for deep thermalization. In cryptographic terms, these states are pseudorandom, pseudoentangled, and crucially, retain these properties under local measurements. Our results demonstrate a new form of computational thermalization, where thermal-like behavior arises from structured quantum states endowed with cryptographic properties, instead of from highly unstructured ensembles. The low resource complexity of preparing these states suggests scalable simulations of deep thermalization using quantum computers. Our work also motivates the study of computational quantum pseudorandomness beyond BQP observers.

Updated: 2025-07-18 05:42:05

标题: 快速计算的深度热化

摘要: 深度热化是指部分测量后量子系统中Haar样式随机性的出现。作为量子热化的推广，它通常与高复杂度和纠缠相关。在这里，我们介绍了计算深度热化，并构建了在无限有效温度下展示它的最快动力学。我们的电路动力学以对数多项式深度产生低纠缠的量子态，对于任何计算有界的观察者来说，这些态与Haar随机态不可区分。重要的是，观察者可以要求从部分投影测量得到的相同残余态的多个副本 -- 这个条件超出了量子伪随机性的标准设置，但对于深度热化来说是自然的。在密码学术语中，这些态是伪随机的、伪纠缠的，并且关键的是，在局部测量下保留这些特性。我们的结果展示了一种新形式的计算热化，其中热态行为是从具有密码学特性的结构化量子态中产生的，而不是从高度非结构化的集合中产生的。准备这些态的低资源复杂度暗示了使用量子计算机进行深度热化的可扩展模拟。我们的工作还激励了对BQP观察者以外的计算量子伪随机性的研究。

更新时间: 2025-07-18 05:42:05

领域: quant-ph,cond-mat.stat-mech,cs.CC,cs.CR

下载: http://arxiv.org/abs/2507.13670v1

Multi-Agent LLMs as Ethics Advocates for AI-Based Systems

Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.

Updated: 2025-07-18 05:24:57

标题: 多智能体LLMs作为基于AI系统的伦理倡导者

摘要: 将伦理学纳入需求获取过程对于创建符合伦理标准的系统至关重要。尽管获取手动伦理要求是有效的，但由于时间和资源限制，它需要来自多个利益相关者的多样化输入，这可能具有挑战性。此外，在需求获取过程中通常将其优先级降低。本研究提出了一个框架，通过在多代理LLM环境中引入伦理倡导者代理来生成伦理要求草案。这个代理基于系统描述对伦理问题进行批判性评价和提供意见。通过来自不同背景的两个案例研究对所提出的框架进行评估，证明它捕捉了研究人员在30分钟面试期间识别的大部分伦理要求，并引入了一些额外相关的要求。然而，它也凸显了在生成伦理要求方面的可靠性问题，强调了在这个敏感领域中需要人类反馈的必要性。我们相信这项工作可以促进在需求工程过程中更广泛地采用伦理学，最终导致更符合伦理标准的产品。

更新时间: 2025-07-18 05:24:57

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.08392v2

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.

Updated: 2025-07-18 05:14:24

标题: UniEmoX：跨模态语义引导的大规模预训练，用于通用场景情感感知

摘要: 视觉情绪分析在计算机视觉和心理学领域具有重要的研究价值。然而，现有的视觉情绪分析方法由于情绪感知的模糊性和数据场景的多样性而受到限制。为了解决这个问题，我们引入了UniEmoX，这是一个跨模态语义引导的大规模预训练框架。受心理学研究强调情绪探索过程与个体与环境互动之间的不可分割性的启发，UniEmoX集成了以场景为中心和以个人为中心的低级图像空间结构信息，旨在推导更加细致和有区别的情绪表示。通过利用配对和非配对图像文本样本之间的相似性，UniEmoX从CLIP模型中提炼丰富的语义知识，以更有效地增强情绪嵌入表示。据我们所知，这是第一个将心理学理论与当代对比学习和遮罩图像建模技术相结合的大规模预训练框架，用于跨多样情景的情绪分析。此外，我们开发了一个名为Emo8的视觉情感数据集。Emo8样本涵盖了卡通、自然、现实、科幻和广告封面风格等各种领域，几乎涵盖了所有常见的情感场景。在两个下游任务的六个基准数据集上进行的全面实验验证了UniEmoX的有效性。源代码可在https://github.com/chincharles/u-emo 上找到。

更新时间: 2025-07-18 05:14:24

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2409.18877v3

Evaluating link prediction: New perspectives and recommendations

Link prediction (LP) is an important problem in network science and machine learning research. The state-of-the-art LP methods are usually evaluated in a uniform setup, ignoring several factors associated with the data and application specific needs. We identify a number of such factors, such as, network-type, problem-type, geodesic distance between the end nodes and its distribution over the classes, nature and applicability of LP methods, class imbalance and its impact on early retrieval, evaluation metric, etc., and present an experimental setup which allows us to evaluate LP methods in a rigorous and controlled manner. We perform extensive experiments with a variety of LP methods over real network datasets in this controlled setup, and gather valuable insights on the interactions of these factors with the performance of LP through an array of carefully designed hypotheses. Following the insights, we provide recommendations to be followed as best practice for evaluating LP methods.

Updated: 2025-07-18 05:12:58

标题: 评估链接预测：新的观点和建议

摘要: 链路预测（LP）是网络科学和机器学习研究中的一个重要问题。现有的LP方法通常在统一的设置中进行评估，忽略了与数据和应用特定需求相关的几个因素。我们识别了许多这样的因素，例如网络类型、问题类型、端节点之间的测地距离及其在类别之间的分布、LP方法的性质和适用性、类别不平衡及其对早期检索的影响、评估指标等，并提出了一个实验设置，使我们能够以严格和可控的方式评估LP方法。我们在这个控制设置中对各种LP方法进行了广泛的实验，并通过一系列精心设计的假设，收集了有关这些因素与LP性能之间相互作用的宝贵见解。根据这些见解，我们提供了评估LP方法的最佳实践建议。

更新时间: 2025-07-18 05:12:58

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2502.12777v4

Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack

Vision transformer (ViT) models, when coupled with interpretation models, are regarded as secure and challenging to deceive, making them well-suited for security-critical domains such as medical applications, autonomous vehicles, drones, and robotics. However, successful attacks on these systems can lead to severe consequences. Recent research on threats targeting ViT models primarily focuses on generating the smallest adversarial perturbations that can deceive the models with high confidence, without considering their impact on model interpretations. Nevertheless, the use of interpretation models can effectively assist in detecting adversarial examples. This study investigates the vulnerability of transformer models to adversarial attacks, even when combined with interpretation models. We propose an attack called "AdViT" that generates adversarial examples capable of misleading both a given transformer model and its coupled interpretation model. Through extensive experiments on various transformer models and two transformer-based interpreters, we demonstrate that AdViT achieves a 100% attack success rate in both white-box and black-box scenarios. In white-box scenarios, it reaches up to 98% misclassification confidence, while in black-box scenarios, it reaches up to 76% misclassification confidence. Remarkably, AdViT consistently generates accurate interpretations in both scenarios, making the adversarial examples more difficult to detect.

Updated: 2025-07-18 05:11:11

标题: 打破安全幻觉：透过解释破解的视觉变换系统遭受攻击

摘要: Vision transformer (ViT)模型，当与解释模型结合时，被认为是安全且具有挑战性，很适合用于医疗应用、自动驾驶汽车、无人机和机器人等安全关键领域。然而，对这些系统的成功攻击可能导致严重后果。针对ViT模型的威胁的最新研究主要集中在生成能够以高置信度欺骗模型的最小对抗扰动，而不考虑其对模型解释的影响。尽管如此，解释模型的使用可以有效帮助检测对抗性示例。本研究调查了变压器模型对对抗性攻击的脆弱性，即使与解释模型结合使用。我们提出了一种名为“AdViT”的攻击，能够误导给定的变压器模型及其配对的解释模型。通过对各种变压器模型和两种基于变压器的解释器进行大量实验，我们证明了AdViT在白盒和黑盒场景中均实现了100%的攻击成功率。在白盒场景中，它达到了高达98%的错误分类置信度，而在黑盒场景中，它达到了高达76%的错误分类置信度。值得注意的是，AdViT在两种情况下始终生成准确的解释，使对抗性示例更难以检测。

更新时间: 2025-07-18 05:11:11

领域: cs.CR,cs.AI,cs.CV,cs.LG,I.2.10; I.2.6; I.5.1; D.4.6; K.6.5

下载: http://arxiv.org/abs/2507.14248v1

Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light

Three core tenets of reinforcement learning (RL)--concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis--have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three "dogmas." We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model of biological learning, we first establish that evolutionary dynamics can plausibly operate within living brains over an individual's lifetime, and are not confined to cross-generational processes. We begin by revisiting the second dogma, drawing on evolutionary insights to enrich the "adaptation-rather-than-search" view of learning. We then address the third dogma regarding the limits of the reward hypothesis, using analogies from evolutionary fitness to illuminate the scalar reward vs. multi-objective debate. After discussing practical implications for exploration in RL, we turn to the first--and arguably most fundamental--issue: the absence of a formal account of agency. We argue that unlike the other two problems, the evolutionary paradigm alone cannot resolve the agency question, though it gestures in a productive direction. We advocate integrating ideas from origins-of-life theory, where the thermodynamics of sustenance and replication offer promising foundations for understanding agency and resource-constrained reinforcement learning in biological systems.

Updated: 2025-07-18 05:07:38

标题: 在进化之光下阐明强化学习的三大教条

摘要: 强化学习（RL）的三个核心原则——关于代理的定义、学习的目标以及奖励假设的范围——被强调为概念修订的关键目标，这对理论和应用具有重大影响。我们提出了一个受开放式进化理论启发的框架，重新考虑这三个“教条”。我们重新审视每个假设并解决相关的疑虑。为了使我们的论点与强化学习作为生物学习模型相关联，我们首先确定了进化动态在个体一生中可能在生物大脑内运行，并不限于代际间的过程。我们首先重新审视第二个“教条”，借鉴进化观点来丰富“适应性而非搜索”学习观。然后我们解决了关于奖励假设的限制的第三个“教条”，使用从进化适应度到阐明标量奖励与多目标辩论的类比。在讨论强化学习中探索的实际影响后，我们转向第一个——也可以说是最基本的——问题：缺乏对代理的正式解释。我们认为，与其他两个问题不同，单靠进化范式无法解决代理问题，尽管它指向了一个富有成效的方向。我们主张整合生命起源理论的想法，其中关于维持和复制的热力学为理解生物系统中代理和资源受限强化学习提供了有希望的基础。

更新时间: 2025-07-18 05:07:38

领域: cs.AI

下载: http://arxiv.org/abs/2507.11482v2

When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework

Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

Updated: 2025-07-18 05:04:59

标题: 当人物重新识别遇到事件相机时：一个基准数据集和一个属性引导的重新识别框架

摘要: 最近的研究人员提出使用事件相机进行人物重识别（ReID），因为它们具有有希望的性能和在隐私保护方面更好的平衡，事件相机基于人物ReID已经引起了重大关注。目前，主流的基于事件的人物ReID算法主要集中在融合可见光和事件流，以及保护隐私。尽管取得了显著进展，但这些方法通常在小规模或模拟事件相机数据集上进行训练和评估，这使得很难评估它们的真实识别性能和泛化能力。为了解决数据稀缺问题，本文介绍了一个大规模的基于RGB事件的人物ReID数据集，称为EvReID。该数据集包含118,988个图像对，并涵盖1200个行人身份，数据跨越多个季节、场景和照明条件进行收集。我们还评估了15种最先进的人物ReID算法，为未来的研究提供了坚实的基础，包括数据和基准测试。基于我们新构建的数据集，本文进一步提出了一个行人属性引导的对比学习框架，以增强人物重识别的特征学习，称为TriPro-ReID。该框架不仅有效地探索了来自RGB帧和事件流的视觉特征，还充分利用了行人属性作为中层语义特征。在EvReID数据集和MARS数据集上进行的大量实验证实了我们提出的RGB-事件人物ReID框架的有效性。基准数据集和源代码将在https://github.com/Event-AHU/Neuromorphic_ReID上发布。

更新时间: 2025-07-18 05:04:59

领域: cs.CV,cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2507.13659v1

Combining model tracing and constraint-based modeling for multistep strategy diagnoses

Model tracing and constraint-based modeling are two approaches to diagnose student input in stepwise tasks. Model tracing supports identifying consecutive problem-solving steps taken by a student, whereas constraint-based modeling supports student input diagnosis even when several steps are combined into one step. We propose an approach that merges both paradigms. By defining constraints as properties that a student input has in common with a step of a strategy, it is possible to provide a diagnosis when a student deviates from a strategy even when the student combines several steps. In this study we explore the design of a system for multistep strategy diagnoses, and evaluate these diagnoses. As a proof of concept, we generate diagnoses for an existing dataset containing steps students take when solving quadratic equations (n=2136). To compare with human diagnoses, two teachers coded a random sample of deviations (n=70) and applications of the strategy (n=70). Results show that that the system diagnosis aligned with the teacher coding in all of the 140 student steps.

Updated: 2025-07-18 04:47:47

标题: 结合模型跟踪和基于约束的建模进行多步骤策略诊断

摘要: 模型跟踪和基于约束的建模是诊断学生在分步任务中输入的两种方法。模型跟踪支持识别学生所采取的连续解决问题步骤，而基于约束的建模支持即使将多个步骤合并为一个步骤也能诊断学生输入。我们提出了一种融合这两种范例的方法。通过将约束定义为学生输入与策略步骤具有共同属性，可以在学生偏离策略时提供诊断，即使学生将多个步骤合并。在这项研究中，我们探索了一个用于多步策略诊断的系统设计，并评估了这些诊断。作为概念验证，我们为包含学生解决二次方程时采取的步骤的现有数据集（n=2136）生成了诊断。为了与人类诊断进行比较，两名教师对偏离（n=70）和应用策略（n=70）的随机样本进行编码。结果显示，在所有140个学生步骤中，系统诊断与教师编码一致。

更新时间: 2025-07-18 04:47:47

领域: cs.AI

下载: http://arxiv.org/abs/2507.13652v1

Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks

Many intelligent tutoring systems can support a student in solving a stepwise task. When a student combines several steps in one step, the number of possible paths connecting consecutive inputs may be very large. This combinatorial explosion makes error diagnosis hard. Using a final answer to diagnose a combination of steps can mitigate the combinatorial explosion, because there are generally fewer possible (erroneous) final answers than (erroneous) solution paths. An intermediate input for a task can be diagnosed by automatically completing it according to the task solution strategy and diagnosing this solution. This study explores the potential of automated error diagnosis based on a final answer. We investigate the design of a service that provides a buggy rule diagnosis when a student combines several steps. To validate the approach, we apply the service to an existing dataset (n=1939) of unique student steps when solving quadratic equations, which could not be diagnosed by a buggy rule service that tries to connect consecutive inputs with a single rule. Results show that final answer evaluation can diagnose 29,4% of these steps. Moreover, a comparison of the generated diagnoses with teacher diagnoses on a subset (n=115) shows that the diagnoses align in 97% of the cases. These results can be considered a basis for further exploration of the approach.

Updated: 2025-07-18 04:39:13

标题: Stepwise任务中综合步骤到最终答案评估的错误规则诊断

摘要: 许多智能辅导系统可以帮助学生解决逐步任务。当学生将几个步骤合并为一步时，连接连续输入的可能路径数量可能非常大。这种组合爆炸使错误诊断变得困难。使用最终答案来诊断多个步骤的组合可以缓解组合爆炸，因为通常可能的（错误的）最终答案比（错误的）解决路径少。任务的中间输入可以通过根据任务解决策略自动完成它并诊断这个解决方案来诊断。本研究探讨了基于最终答案的自动错误诊断的潜力。我们研究了设计一个服务，当学生组合几个步骤时提供一个有问题的规则诊断。为了验证这种方法，我们将该服务应用于解决二次方程时的现有数据集（n=1939），这些数据集无法通过尝试使用单一规则连接连续输入的有问题规则服务进行诊断。结果显示，最终答案评估可以诊断这些步骤的29.4%。此外，将生成的诊断与教师在一个子集（n=115）上的诊断进行比较显示，在97%的情况下诊断一致。这些结果可以被视为进一步探索该方法的基础。

更新时间: 2025-07-18 04:39:13

领域: cs.AI

下载: http://arxiv.org/abs/2507.13651v1

Improved particle swarm optimization algorithm: multi-target trajectory optimization for swarm drones

Real-time trajectory planning for unmanned aerial vehicles (UAVs) in dynamic environments remains a key challenge due to high computational demands and the need for fast, adaptive responses. Traditional Particle Swarm Optimization (PSO) methods, while effective for offline planning, often struggle with premature convergence and latency in real-time scenarios. To overcome these limitations, we propose PE-PSO, an enhanced PSO-based online trajectory planner. The method introduces a persistent exploration mechanism to preserve swarm diversity and an entropy-based parameter adjustment strategy to dynamically adapt optimization behavior. UAV trajectories are modeled using B-spline curves, which ensure path smoothness while reducing optimization complexity. To extend this capability to UAV swarms, we develop a multi-agent framework that combines genetic algorithm (GA)-based task allocation with distributed PE-PSO, supporting scalable and coordinated trajectory generation. The distributed architecture allows for parallel computation and decentralized control, enabling effective cooperation among agents while maintaining real-time performance. Comprehensive simulations demonstrate that the proposed framework outperforms conventional PSO and other swarm-based planners across several metrics, including trajectory quality, energy efficiency, obstacle avoidance, and computation time. These results confirm the effectiveness and applicability of PE-PSO in real-time multi-UAV operations under complex environmental conditions.

Updated: 2025-07-18 04:31:49

标题: 改进的粒子群优化算法：群体无人机的多目标轨迹优化

摘要: 实时轨迹规划是无人机在动态环境中面临的关键挑战，因为其需要高计算需求和快速适应性响应。传统的粒子群优化（PSO）方法虽然在离线规划方面有效，但在实时场景中往往面临早熟收敛和延迟的困难。为了克服这些限制，我们提出了PE-PSO，一种增强型基于PSO的在线轨迹规划器。该方法介绍了一种持久的探索机制，以保持群体多样性，并引入了基于熵的参数调整策略，以动态调整优化行为。无人机轨迹采用B样条曲线建模，以确保路径平滑性同时降低优化复杂性。为了将这种能力扩展到无人机群，我们开发了一个多代理框架，将基于遗传算法（GA）的任务分配与分布式PE-PSO相结合，支持可扩展和协调的轨迹生成。分布式架构允许并行计算和分散控制，实现了代理之间的有效合作，同时保持实时性能。全面的模拟表明，所提出的框架在轨迹质量、能源效率、障碍物避免和计算时间等多个指标上优于传统的PSO和其他基于群体的规划器。这些结果证实了PE-PSO在复杂环境条件下实时多无人机操作中的有效性和适用性。

更新时间: 2025-07-18 04:31:49

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.13647v1

A Comprehensive Review of Transformer-based language models for Protein Sequence Analysis and Design

The impact of Transformer-based language models has been unprecedented in Natural Language Processing (NLP). The success of such models has also led to their adoption in other fields including bioinformatics. Taking this into account, this paper discusses recent advances in Transformer-based models for protein sequence analysis and design. In this review, we have discussed and analysed a significant number of works pertaining to such applications. These applications encompass gene ontology, functional and structural protein identification, generation of de novo proteins and binding of proteins. We attempt to shed light on the strength and weaknesses of the discussed works to provide a comprehensive insight to readers. Finally, we highlight shortcomings in existing research and explore potential avenues for future developments. We believe that this review will help researchers working in this field to have an overall idea of the state of the art in this field, and to orient their future studies.

Updated: 2025-07-18 04:20:33

标题: 一篇全面综述蛋白质序列分析和设计的基于Transformer的语言模型

摘要: 基于Transformer的语言模型在自然语言处理（NLP）领域的影响是空前的。这些模型的成功也导致它们在包括生物信息学在内的其他领域的应用。考虑到这一点，本文讨论了Transformer-based模型在蛋白质序列分析和设计方面的最新进展。在这篇综述中，我们讨论并分析了与这些应用相关的大量研究工作。这些应用涵盖了基因本体论、功能和结构蛋白质识别、全新蛋白质的生成和蛋白质的结合。我们试图揭示所讨论工作的优势和劣势，以提供给读者全面的见解。最后，我们强调了现有研究的不足之处，并探讨了未来发展的潜在途径。我们相信这篇综述将帮助在这一领域工作的研究人员对该领域的最新技术有一个整体的了解，并引导他们未来的研究。

更新时间: 2025-07-18 04:20:33

领域: cs.LG,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2507.13646v1

GATSim: Urban Mobility Simulation with Generative Agents

Traditional agent-based urban mobility simulations often rely on rigid rule-based systems that struggle to capture the complexity, adaptability, and behavioral diversity inherent in human travel decision making. Recent advancements in large language models and AI agent technologies present new opportunities to develop agents with enhanced reasoning capabilities, persistent memory, and adaptive learning. We introduce GATSim (Generative-Agent Transport Simulation), a novel framework that leverages these advancements to simulate urban mobility using generative agents with rich, human-like behaviors. Unlike conventional approaches, GATSim agents are characterized by diverse socioeconomic profiles, individual lifestyles, and evolving preferences shaped through psychologically informed memory systems, tool usage, and lifelong learning. The main contributions of this work are: (1) a comprehensive architecture that integrates an urban mobility foundation model with agent cognitive systems and a transport simulation environment; (2) a hierarchical memory designed for efficient retrieval of contextually relevant information, incorporating spatial and temporal associations, keyword matching, and semantic relevance; (3) innovative planning and reactive mechanisms for modeling adaptive mobility behaviors which integrate a multi-scale reflection process to transform specific travel experiences into generalized behavioral insights. We implement a prototype system and conduct systematic validation, demonstrating that generative agents produce believable and coherent travel behaviors. Experimental results indicate that generative agents perform at least as well as human annotators with 92\% posterior probability, while naturally producing realistic macroscopic traffic patterns. The code for the prototype implementation is publicly available at https://github.com/qiliuchn/gatsim.

Updated: 2025-07-18 04:20:16

标题: GATSim：具有生成代理的城市移动模拟

摘要: 传统基于代理的城市移动性模拟通常依赖于刚性的基于规则的系统，难以捕捉人类出行决策中固有的复杂性、适应性和行为多样性。最近大规模语言模型和人工智能代理技术的进步提供了开发具有增强推理能力、持久记忆和自适应学习的代理的新机会。我们引入了GATSim（生成式代理交通模拟），这是一个利用这些进步来模拟城市移动性的新框架，使用具有丰富、类似人类行为的生成式代理。与传统方法不同，GATSim代理的特征包括多样化的社会经济特征、个体生活方式和通过心理学知识的记忆系统、工具使用和终身学习塑造的不断发展的偏好。这项工作的主要贡献是：（1）一个综合的架构，将城市移动性基础模型与代理认知系统和交通模拟环境集成在一起；（2）一个为了有效检索相关上下文信息而设计的分层记忆，包括空间和时间关联、关键词匹配和语义相关性；（3）创新的规划和反应机制，用于建模自适应移动行为，其中整合了多尺度反思过程，将具体出行经验转化为广义行为见解。我们实现了一个原型系统并进行了系统验证，证明生成式代理产生可信、连贯的出行行为。实验结果表明，生成式代理以92\%的后验概率至少与人类标注者表现一样好，同时自然地产生现实的宏观交通模式。原型实现的代码可以在https://github.com/qiliuchn/gatsim 上公开获取。

更新时间: 2025-07-18 04:20:16

领域: cs.AI

下载: http://arxiv.org/abs/2506.23306v2

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

Updated: 2025-07-18 04:14:22

标题: 思维的幻觉：通过问题复杂性的视角理解推理模型的优势和局限性

摘要: 最近几代语言模型引入了大型推理模型(LRMs)，在提供答案之前生成详细的思考过程。虽然这些模型在推理基准测试中表现出改进的性能，但它们的基本能力、扩展属性和限制仍然不够理解。目前的评估主要集中在已建立的数学和编码基准上，强调最终答案的准确性。然而，这种评估范式经常受到污染，并且不能提供对推理过程的洞察。在这项工作中，我们通过可控拼图环境系统地调查这些差距，这些环境允许在保持一致逻辑结构的同时精确操纵复杂性。这种设置不仅使得对最终答案的分析成为可能，还使得对内部推理过程的分析成为可能，进而揭示LRMs的思考方式。通过大量实验，我们发现LRMs在某些复杂度之后面临完全准确性崩溃。此外，它们表现出一种反直觉的扩展限制：随着问题复杂度的增加，推理的努力会增加到一定程度，然后尽管还有剩余的标记预算，但会下降。通过将LRMs与其标准LLM对应物在相同推理计算下进行比较，我们确定了三种性能区别：(1)低复杂性任务中，标准模型优于LRMs，(2)中等复杂性任务中，LRMs展现出优势，(3)高复杂性任务中，两种模型都面临完全崩溃。我们发现LRMs在精确计算方面存在限制：它们无法使用显式算法，而且在不同尺度上的推理不一致。我们还深入研究了推理过程，研究了探索解决方案的模式并分析了模型的计算行为，揭示了它们的优势、限制，并提出了关于它们推理能力的问题。

更新时间: 2025-07-18 04:14:22

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.06941v2

Differential Privacy in Kernelized Contextual Bandits via Random Projections

We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space. We study this problem under an additional constraint of Differential Privacy, where the agent needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that achieves the state-of-the-art cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{\gamma_TT}+\frac{\gamma_T}{\varepsilon_{\mathrm{DP}}})$ and $\widetilde{\mathcal{O}}(\sqrt{\gamma_TT}+\frac{\gamma_T\sqrt{T}}{\varepsilon_{\mathrm{DP}}})$ over a time horizon of $T$ in the joint and local models of differential privacy, respectively, where $\gamma_T$ is the effective dimension of the kernel and $\varepsilon_{\mathrm{DP}} > 0$ is the privacy parameter. The key ingredient of the proposed algorithm is a novel private kernel-ridge regression estimator which is based on a combination of private covariance estimation and private random projections. It offers a significantly reduced sensitivity compared to its classical counterpart while maintaining a high prediction accuracy, allowing our algorithm to achieve the state-of-the-art performance guarantees.

Updated: 2025-07-18 03:54:49

标题: 通过随机投影在核化上下文赌博中的差分隐私

摘要: 我们考虑具有随机上下文的背景下的上下文内核赌博问题，其中基础奖励函数属于已知的再生核希尔伯特空间。我们在差分隐私的附加约束下研究这个问题，其中代理需要确保查询点序列对上下文和奖励序列都具有差分隐私。我们提出了一种新颖的算法，实现了在时间跨度$T$内的共累积遗憾的最新结果为$\widetilde{\mathcal{O}}(\sqrt{\gamma_TT}+\frac{\gamma_T}{\varepsilon_{\mathrm{DP}}})$和$\widetilde{\mathcal{O}}(\sqrt{\gamma_TT}+\frac{\gamma_T\sqrt{T}}{\varepsilon_{\mathrm{DP}}})$，分别在差分隐私的联合模型和本地模型下，其中$\gamma_T$是核的有效维度，$\varepsilon_{\mathrm{DP}} > 0$是隐私参数。所提出算法的关键要素是一种基于私有协方差估计和私有随机投影组合的新型私有内核岭回归估计器。与经典对应物相比，它提供了显著降低的敏感性，同时保持了高预测准确性，使我们的算法能够实现最先进的性能保证。

更新时间: 2025-07-18 03:54:49

领域: stat.ML,cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.13639v1

Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques

Large Language Models (LLMs) are transforming cybersecurity by enabling intelligent, adaptive, and automated approaches to threat detection, vulnerability assessment, and incident response. With their advanced language understanding and contextual reasoning, LLMs surpass traditional methods in tackling challenges across domains such as IoT, blockchain, and hardware security. This survey provides a comprehensive overview of LLM applications in cybersecurity, focusing on two core areas: (1) the integration of LLMs into key cybersecurity domains, and (2) the vulnerabilities of LLMs themselves, along with mitigation strategies. By synthesizing recent advancements and identifying key limitations, this work offers practical insights and strategic recommendations for leveraging LLMs to build secure, scalable, and future-ready cyber defense systems.

Updated: 2025-07-18 03:41:18

标题: 大型语言模型在网络安全中的应用、漏洞和防御技术

摘要: 大型语言模型（LLMs）正在通过实现智能、适应性和自动化的方式来改变网络安全领域，从而实现威胁检测、漏洞评估和事件响应。凭借其先进的语言理解和语境推理能力，LLMs在处理物联网、区块链和硬件安全等领域的挑战方面超越了传统方法。本调查提供了LLMs在网络安全领域应用的综合概述，重点关注两个核心领域：（1）LLMs在关键网络安全领域的整合，以及（2）LLMs本身的漏洞及缓解策略。通过综合最新进展并识别关键限制，本研究提供了关于如何利用LLMs构建安全、可扩展和未来准备的网络防御系统的实用见解和战略建议。

更新时间: 2025-07-18 03:41:18

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13629v1

BifrostRAG: Bridging Dual Knowledge Graphs for Multi-Hop Question Answering in Construction Safety

Information retrieval and question answering from safety regulations are essential for automated construction compliance checking but are hindered by the linguistic and structural complexity of regulatory text. Many compliance-related queries are multi-hop, requiring synthesis of information across interlinked clauses. This poses a challenge for traditional retrieval-augmented generation (RAG) systems. To overcome this, we introduce BifrostRAG: a dual-graph RAG-integrated system that explicitly models both linguistic relationships (via an Entity Network Graph) and document structure (via a Document Navigator Graph). This architecture powers a hybrid retrieval mechanism that combines graph traversal with vector-based semantic search, enabling large language models to reason over both the meaning and the structure of the text. Evaluation on a multi-hop question dataset shows that BifrostRAG achieves 92.8 percent precision, 85.5 percent recall, and an F1 score of 87.3 percent. These results significantly outperform vector-only and graph-only RAG baselines that represent current leading approaches. Error analysis further highlights the comparative advantages of our hybrid method over single-modality RAGs. These findings establish BifrostRAG as a robust knowledge engine for LLM-driven compliance checking. Its dual-graph, hybrid retrieval mechanism offers a transferable blueprint for navigating complex technical documents across knowledge-intensive engineering domains.

Updated: 2025-07-18 03:39:14

标题: BifrostRAG：在建筑安全中桥接双重知识图以进行多跳问题回答

摘要: 信息检索和问题回答是自动化建筑合规性检查中不可或缺的，但受到法规文本的语言和结构复杂性的阻碍。许多与合规性相关的查询是多跳的，需要综合跨链接子句的信息。这对于传统的检索增强生成（RAG）系统构成挑战。为了克服这一障碍，我们引入了BifrostRAG：一个双图RAG集成系统，明确建模了语言关系（通过实体网络图）和文档结构（通过文档导航器图）。这种架构提供了一种混合检索机制，将图遍历与基于向量的语义搜索相结合，使大型语言模型能够推理文本的含义和结构。在一个多跳问题数据集上的评估显示，BifrostRAG 实现了92.8% 的精度、85.5% 的召回率和87.3% 的 F1 得分。这些结果明显优于代表当前领先方法的仅基于向量和仅基于图的 RAG 基线。错误分析进一步突显了我们的混合方法相对于单模态 RAG 的比较优势。这些发现将BifrostRAG确立为LLM驱动的合规性检查的强大知识引擎。其双图、混合检索机制为跨知识密集工程领域的复杂技术文档提供了一个可转移的蓝图。

更新时间: 2025-07-18 03:39:14

领域: cs.AI

下载: http://arxiv.org/abs/2507.13625v1

EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.

Updated: 2025-07-18 03:23:12

标题: EvolveNav：基于LLM的视觉语言导航的自我改进体现推理

摘要: 构建能够根据自然语言指令导航的视觉-语言导航（VLN）代理是人机交互应用中的一个长期目标。最近的研究揭示了训练开源的大型语言模型（LLMs）的潜力，以释放LLMs的推理能力来改进导航，并同时缓解LLMs的训练语料库与VLN任务之间的领域差距。然而，这些方法主要采用直接的输入-输出映射范式，导致映射学习困难，导航决策难以解释。思维链（CoT）训练是改善导航决策准确性和解释性的一种有前途的方法，但导航任务的复杂性使得完美的CoT标签不可用，并可能通过纯粹的CoT监督微调导致过拟合。在本文中，我们提出了一个新颖的自我改进的体现推理框架，用于提升基于LLM的视觉-语言导航，命名为EvolveNav。我们的EvolveNav包含两个阶段：（1）正式化的CoT监督微调，在这个阶段我们使用正式化的CoT标签训练模型，以激活模型的导航推理能力并提高推理速度；（2）自我反思后训练，在这个阶段模型将被迭代地训练，使用其自身推理输出作为自我丰富的CoT标签，以增强监督多样性。同时引入了一个自我反思的辅助任务，以鼓励通过与错误模式对比学习正确的推理模式。在流行的VLN基准测试上的实验结果表明，EvolveNav相对于先前的基于LLM的VLN方法具有优越性。代码可在https://github.com/expectorlin/EvolveNav上找到。

更新时间: 2025-07-18 03:23:12

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.01551v2

Temporal reasoning for timeline summarisation in social media

This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarisation, the task of summarising long texts containing sequences of events, such as social media threads. We first introduce NarrativeReason, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarisation through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarisation. Experimental results demonstrate that our model achieves superior performance on out-of-domain mental health-related timeline summarisation tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance and generalisability of leveraging temporal reasoning to improve timeline summarisation.

Updated: 2025-07-18 03:12:59

标题: 社交媒体中时间线摘要的时间推理

摘要: 本文探讨了是否增强大型语言模型（LLMs）的时间推理能力可以提高时间轴摘要的质量，时间轴摘要是对包含事件序列的长文本进行总结的任务，例如社交媒体帖子。我们首先介绍了NarrativeReason，一个专注于叙事中顺序事件之间时间关系的新数据集，将其与现有主要处理成对事件关系的时间推理数据集区分开来。然后，我们的方法通过知识蒸馏框架将时间推理与时间轴摘要相结合，其中我们首先在时间推理任务上对教师模型进行微调，然后将这些知识蒸馏到学生模型中，同时训练它执行时间轴摘要任务。实验结果表明，我们的模型在跨领域的与心理健康相关的时间轴摘要任务上取得了卓越的表现，这些任务涉及长的社交媒体帖子，包含事件重复和情绪混合，突显了利用时间推理来改进时间轴摘要的重要性和普适性。

更新时间: 2025-07-18 03:12:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.00152v3

Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.

Updated: 2025-07-18 02:46:55

标题: 人类和大型语言模型生成的文本的语言和嵌入式特征分析

摘要: 大型语言模型（LLMs）的快速发展显著提高了它们生成自然语言的能力，使得LLMs生成的文本越来越难以区分是否为人类撰写的文本。尽管最近的研究主要集中在使用LLMs将文本分类为人类撰写和机器生成的文本，我们的研究侧重于使用一组跨不同语言水平（如形态、句法和语义）的语言特征来表征这些文本。我们选择了一个跨8个领域的人类撰写和机器生成文本数据集，并由11个不同的LLMs生成。我们计算了不同的语言特征，如依赖长度和情感色彩，并将它们用于表征人类撰写和机器生成的文本，同时采用不同的抽样策略、重复控制和模型发布日期。我们的统计分析揭示了人类撰写的文本倾向于展示更简单的句法结构和更多样化的语义内容。此外，我们计算了我们的特征集在模型和领域之间的变化性。人类和机器文本在不同领域展示了样式多样性，人类在我们的特征上显示出更大的变化。最后，我们应用样式嵌入来进一步测试人类撰写和机器生成文本之间的变化性。值得注意的是，新模型输出的文本同样变化多样，指向机器生成文本的同质化。

更新时间: 2025-07-18 02:46:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13614v1

Merge Kernel for Bayesian Optimization on Permutation Space

Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.

Updated: 2025-07-18 02:45:58

标题: 置换空间上的贝叶斯优化的合并核

摘要: Bayesian Optimization (BO)算法是解决黑盒优化问题的标准工具。目前在置换空间中的BO最先进方法依赖于Mallows核心-一个$\Omega(n^2)$表示，明确列举了每一对比较。受Mallows核心和成对比较之间密切关系的启发，我们提出了一个基于排序算法在置换空间生成核函数的新框架。在这个框架内，Mallows核心可以被视为从冒泡排序派生出的一个特殊实例。此外，我们引入了由归并排序构建的Merge Kernel，它将二次复杂度替换为$\Theta(n\log n)$以实现最低可能的复杂度。结果特征向量显著缩短，可以在线性对数时间内计算，但仍有效地捕捉到有意义的置换距离。为了提高鲁棒性和正确不变性而不牺牲紧凑性，我们进一步结合了三个轻量级的、任务无关的描述符：(1) 一个位移直方图，它聚合了绝对元素位移并提供了一个全局错位信号；(2) 一个分割对线，通过在整个置换的两半之间对齐元素来编码选定的长距离比较；以及(3) 滑动窗口图案，总结了影响邻近目标的局部顺序模式。我们的实证评估表明，所提出的核心在各种置换优化基准测试中始终优于最先进的Mallows核心。结果证实，Merge Kernel为置换空间中的贝叶斯优化提供了更紧凑但更有效的解决方案。

更新时间: 2025-07-18 02:45:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.13263v2

Invisible Textual Backdoor Attacks based on Dual-Trigger

Backdoor attacks pose an important security threat to textual large language models. Exploring textual backdoor attacks not only helps reveal the potential security risks of models, but also promotes innovation and development of defense mechanisms. Currently, most textual backdoor attack methods are based on a single trigger. For example, inserting specific content into text as a trigger or changing the abstract text features to be a trigger. However, the adoption of this single-trigger mode makes the existing backdoor attacks subject to certain limitations: either they are easily identified by the existing defense strategies, or they have certain shortcomings in attack performance and in the construction of poisoned datasets. In order to solve these issues, a dual-trigger backdoor attack method is proposed in this paper. Specifically, we use two different attributes, syntax and mood (we use subjunctive mood as an example in this article), as two different triggers. It makes our backdoor attack method similar to a double landmine which can have completely different trigger conditions simultaneously. Therefore, this method not only improves the flexibility of trigger mode, but also enhances the robustness against defense detection. A large number of experimental results show that this method significantly outperforms the previous methods based on abstract features in attack performance, and achieves comparable attack performance (almost 100\% attack success rate) with the insertion-based method. In addition, in order to further improve the attack performance, we also give the construction method of the poisoned dataset.The code and data of this paper can be obtained at https://github.com/HoyaAm/Double-Landmines.

Updated: 2025-07-18 02:44:07

标题: 基于双触发的隐形文本后门攻击

摘要: 后门攻击对文本大型语言模型构成重要的安全威胁。探索文本后门攻击不仅有助于揭示模型的潜在安全风险，还推动了防御机制的创新和发展。目前，大多数文本后门攻击方法都是基于单个触发器。例如，将特定内容插入文本作为触发器，或更改摘要文本特征以成为触发器。然而，采用这种单触发器模式使得现有的后门攻击受到一定的限制：要么它们很容易被现有的防御策略识别，要么它们在攻击性能和构建受害数据集方面存在一定的缺陷。为了解决这些问题，本文提出了一种双触发器后门攻击方法。具体地，我们使用两种不同的属性，语法和语气（本文中以虚拟语气为例），作为两种不同的触发器。这使得我们的后门攻击方法类似于双重地雷，可以同时具有完全不同的触发条件。因此，这种方法不仅提高了触发模式的灵活性，还增强了对防御检测的稳健性。大量实验结果表明，这种方法在攻击性能方面显著优于基于摘要特征的先前方法，并且在攻击性能上实现了与基于插入的方法相当的攻击性能（几乎100\%的攻击成功率）。此外，为了进一步提高攻击性能，我们还提供了受害数据集的构建方法。本文的代码和数据可在https://github.com/HoyaAm/Double-Landmines 获取。

更新时间: 2025-07-18 02:44:07

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2412.17531v3

Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans' innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

Updated: 2025-07-18 02:39:29

标题: 对不确定性的推理：推理模型是否知道自己不知道？

摘要: 推理语言模型在许多具有挑战性的基准测试中取得了最先进的记录，这得益于使用强化学习引入的多步推理。然而，像以前的语言模型一样，推理模型容易生成自信、合理的但不正确的响应（幻觉）。了解何时以及在多大程度上信任这些模型对于在现实世界中安全部署推理模型至关重要。为此，我们在这项工作中探索推理模型的不确定性量化。具体而言，我们提出了三个基本问题：首先，推理模型是否校准良好？其次，更深层次的推理是否改善模型的校准？最后，受到人类天生能够反复检查他们的思维过程以验证答案的有效性和自信心的启发，我们问：推理模型是否可以通过明确推理他们的思维链迹来改善他们的校准？我们引入内省不确定性量化（UQ）来探索这个方向。在广泛的基准测试中对最先进的推理模型进行了全面评估，我们发现推理模型：（i）通常过于自信，自我陈述的置信度估计通常大于85%，尤其是对于错误的响应，（ii）通过更深层次的推理变得更加自信，（iii）通过内省可以更好地校准（例如o3-Mini和DeepSeek R1），但并非均匀（例如，Claude 3.7 Sonnet的校准变得更糟）。最后，我们得出结论，指出设计必要的UQ基准测试和改善推理模型的校准的重要研究方向。

更新时间: 2025-07-18 02:39:29

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.18183v3

Generative Multi-Target Cross-Domain Recommendation

Recently, there has been a surge of interest in Multi-Target Cross-Domain Recommendation (MTCDR), which aims to enhance recommendation performance across multiple domains simultaneously. Existing MTCDR methods primarily rely on domain-shared entities (\eg users or items) to fuse and transfer cross-domain knowledge, which may be unavailable in non-overlapped recommendation scenarios. Some studies model user preferences and item features as domain-sharable semantic representations, which can be utilized to tackle the MTCDR task. Nevertheless, they often require extensive auxiliary data for pre-training. Developing more effective solutions for MTCDR remains an important area for further exploration. Inspired by recent advancements in generative recommendation, this paper introduces GMC, a generative paradigm-based approach for multi-target cross-domain recommendation. The core idea of GMC is to leverage semantically quantized discrete item identifiers as a medium for integrating multi-domain knowledge within a unified generative model. GMC first employs an item tokenizer to generate domain-shared semantic identifiers for each item, and then formulates item recommendation as a next-token generation task by training a domain-unified sequence-to-sequence model. To further leverage the domain information to enhance performance, we incorporate a domain-aware contrastive loss into the semantic identifier learning, and perform domain-specific fine-tuning on the unified recommender. Extensive experiments on five public datasets demonstrate the effectiveness of GMC compared to a range of baseline methods.

Updated: 2025-07-18 02:34:05

标题: 生成式多目标跨领域推荐

摘要: 最近，对多目标跨域推荐（MTCDR）的兴趣日益增加，旨在同时提高多个领域的推荐性能。现有的MTCDR方法主要依赖于领域共享实体（例如用户或物品）来融合和传递跨领域知识，这在非重叠推荐场景中可能不可用。一些研究将用户偏好和物品特征建模为可共享的领域语义表示，这可以用于解决MTCDR任务。然而，它们通常需要大量辅助数据进行预训练。为MTCDR开发更有效的解决方案仍然是进一步探索的重要领域。受生成式推荐最新进展的启发，本文介绍了GMC，一种基于生成范式的多目标跨领域推荐方法。GMC的核心思想是利用语义量化的离散物品标识符作为整合多领域知识的媒介，融入统一的生成模型。GMC首先使用物品标记器为每个物品生成领域共享的语义标识符，然后通过训练一个领域统一的序列到序列模型，将物品推荐形式化为下一个标记生成任务。为了进一步利用领域信息增强性能，我们将领域感知对比损失融入语义标识符学习，并对统一的推荐器进行领域特定的微调。对五个公共数据集进行的广泛实验表明，与一系列基准方法相比，GMC的有效性。

更新时间: 2025-07-18 02:34:05

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.12871v2

BreastSegNet: Multi-label Segmentation of Breast MRI

Breast MRI provides high-resolution imaging critical for breast cancer screening and preoperative staging. However, existing segmentation methods for breast MRI remain limited in scope, often focusing on only a few anatomical structures, such as fibroglandular tissue or tumors, and do not cover the full range of tissues seen in scans. This narrows their utility for quantitative analysis. In this study, we present BreastSegNet, a multi-label segmentation algorithm for breast MRI that covers nine anatomical labels: fibroglandular tissue (FGT), vessel, muscle, bone, lesion, lymph node, heart, liver, and implant. We manually annotated a large set of 1123 MRI slices capturing these structures with detailed review and correction from an expert radiologist. Additionally, we benchmark nine segmentation models, including U-Net, SwinUNet, UNet++, SAM, MedSAM, and nnU-Net with multiple ResNet-based encoders. Among them, nnU-Net ResEncM achieves the highest average Dice scores of 0.694 across all labels. It performs especially well on heart, liver, muscle, FGT, and bone, with Dice scores exceeding 0.73, and approaching 0.90 for heart and liver. All model code and weights are publicly available, and we plan to release the data at a later date.

Updated: 2025-07-18 02:16:00

标题: BreastSegNet：乳腺MRI的多标签分割

摘要: 乳腺MRI提供了对乳腺癌筛查和术前分期至关重要的高分辨率成像。然而，目前用于乳腺MRI的分割方法在范围上仍然有限，通常只关注一些解剖结构，如纤维腺组织或肿瘤，并未涵盖扫描中所见的所有组织范围。这限制了它们用于定量分析的效用。在这项研究中，我们提出了BreastSegNet，这是一种用于乳腺MRI的多标签分割算法，涵盖了九个解剖标签：纤维腺组织（FGT）、血管、肌肉、骨骼、病变、淋巴结、心脏、肝脏和植入物。我们手工注释了一组包含1123个MRI切片的大型数据集，捕捉了这些结构，并经过专家放射科医师的详细审查和修正。此外，我们对包括U-Net、SwinUNet、UNet++、SAM、MedSAM和nnU-Net在内的九种分割模型进行了基准测试，这些模型使用了多个基于ResNet的编码器。其中，nnU-Net ResEncM在所有标签上实现了最高的平均Dice分数为0.694。它在心脏、肝脏、肌肉、FGT和骨骼上表现特别出色，Dice分数超过0.73，并且心脏和肝脏接近0.90。所有模型代码和权重都可以公开获取，我们计划在以后发布数据。

更新时间: 2025-07-18 02:16:00

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.13604v1

GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention

We present GIFT: a {G}radient-aware {I}mmunization technique to defend diffusion models against malicious {F}ine-{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model's ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model's ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks.

Updated: 2025-07-18 01:47:07

标题: GIFT：梯度感知扩散模型的免疫化，抵御恶意微调并保留安全概念

摘要: 我们提出了GIFT：一种梯度感知的免疫技术，用于防御扩散模型免受恶意微调的攻击，同时保持其生成安全内容的能力。现有的安全机制如安全检查器很容易被绕过，概念擦除方法在对抗性微调下失败。GIFT通过将免疫视为一个双层优化问题来解决这个问题：上层目标通过表示噪声和最大化来降低模型表示有害概念的能力，而下层目标保持在安全数据上的性能。GIFT实现了对恶意微调的强大抵抗力，同时保持安全生成质量。实验结果表明，我们的方法显著降低了模型重新学习有害概念的能力，同时保持在安全内容上的性能，为创建抵抗对抗性微调攻击的固有安全生成模型提供了一个充满希望的方向。

更新时间: 2025-07-18 01:47:07

领域: cs.CR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.13598v1

An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

Updated: 2025-07-18 01:39:40

标题: 一种离线逆强化学习和动态离散选择模型的经验风险最小化方法

摘要: 我们研究了估计动态离散选择（DDC）模型的问题，也称为离线最大熵正则逆强化学习（离线MaxEnt-IRL）在机器学习中。目标是从离线行为数据中恢复控制代理行为的奖励或$Q^*$函数。在本文中，我们提出了一种全局收敛的基于梯度的方法来解决这些问题，而无需线性参数化奖励的限制性假设。我们方法的新颖之处在于引入基于经验风险最小化（ERM）的IRL/DDC框架，这避免了贝尔曼方程中显式状态转移概率估计的需求。此外，我们的方法与神经网络等非参数估计技术兼容。因此，所提出的方法有潜力被扩展到高维度，无限状态空间。我们方法的一个关键理论见解是贝尔曼残差满足Polyak-Lojasiewicz（PL）条件 - 这是一种虽然比强凸性弱，但足以确保快速全局收敛保证的属性。通过一系列合成实验，我们证明我们的方法始终优于基准方法和最先进的替代方法。

更新时间: 2025-07-18 01:39:40

领域: cs.LG,cs.AI,econ.EM

下载: http://arxiv.org/abs/2502.14131v4

BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c-MADRL, BLAST, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the \textit{leverage attack effect} that can pry open the entire multi-agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

Updated: 2025-07-18 01:31:33

标题: 爆炸：一种针对基于合作多智能体深度强化学习系统的隐蔽后门利用攻击

摘要: 最近的研究表明，合作多智能体深度强化学习（c-MADRL）面临后门攻击的威胁。一旦观察到后门触发器，它将执行恶意操作导致失败或恶意目标。然而，现有的后门攻击存在几个问题，例如，即时触发模式缺乏隐蔽性，后门由额外网络训练或激活，或所有智能体都被后门攻击。因此，在本文中，我们提出了一种针对c-MADRL的新型后门利用攻击，即BLAST，通过仅在单个智能体中嵌入后门来攻击整个多智能体团队。首先，我们引入对手时空行为模式作为后门触发器，而不是手动注入的固定视觉模式或即时状态，并控制执行恶意操作的周期。此方法可以保证BLAST的隐蔽性和实用性。其次，我们通过单方面指导来攻击后门智能体的原始奖励函数以注入BLAST，从而实现可以通过单个后门智能体打开整个多智能体系统的“杠杆攻击效果”。我们在2个流行的c-MADRL环境（SMAC和Pursuit）中对BLAST进行了评估，针对3种经典的c-MADRL算法（VDN、QMIX和MAPPO），以及2种现有的防御机制。实验结果表明，BLAST在保持低干净性能方差率的同时，可以实现高攻击成功率。

更新时间: 2025-07-18 01:31:33

领域: cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2501.01593v2

Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity. Project page: https://danceba.github.io/ .

Updated: 2025-07-18 01:29:23

标题: 调整你的节奏：利用门控增强的节奏感知特征表示生成高度对齐的舞蹈姿势

摘要: 自动生成自然、多样且有节奏的人类舞蹈动作对虚拟现实和电影行业至关重要。然而，生成自然地跟随音乐的舞蹈仍然是一个挑战，因为现有方法缺乏适当的节拍对齐，并表现出不自然的运动动态。在本文中，我们提出了Danceba，一个新颖的框架，利用门控机制增强音乐驱动舞蹈生成的节奏感特征表示，实现了与增强节奏感的高度对齐的舞蹈姿势。具体地，我们引入了基于相位的节奏提取（PRE），精确地从音乐相位数据中提取节奏信息，利用音乐的固有周期性和时间结构。此外，我们提出了时间门控因果注意力（TGCA），专注于全局节奏特征，确保舞蹈动作紧密跟随音乐节奏。我们还引入了并行曼巴运动建模（PMMM）架构，分别对上半身和下半身动作以及音乐特征进行建模，从而改善生成舞蹈动作的自然性和多样性。大量实验证实，Danceba优于现有技术方法，实现了更好的节奏对齐和动作多样性。项目页面：https://danceba.github.io/。

更新时间: 2025-07-18 01:29:23

领域: cs.MM,cs.AI,cs.CV,cs.SD,eess.AS

下载: http://arxiv.org/abs/2503.17340v2

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

Updated: 2025-07-18 01:26:48

标题: STACK：对LLM安全管道的对抗性攻击

摘要: 前沿人工智能开发者依赖层层防护措施来防止人工智能系统的灾难性滥用。Anthropic使用一种防护管道来保护他们最新的Claude 4 Opus模型，而包括Google DeepMind和OpenAI在内的其他前沿开发者也承诺很快部署类似的防御措施。然而，这些管道的安全性尚不清楚，之前对这些管道进行评估或攻击的工作有限。我们通过开发和红队测试一个开源的防御管道来填补这一空白。首先，我们发现一种新颖的少样本提示输入和输出分类器在三种攻击和两个数据集上都优于最先进的开放权重防护模型ShieldGemma，将攻击成功率（ASR）降低到0%在灾难性滥用数据集ClearHarm上。其次，我们引入了一种名为STaged AttaCK（STACK）的过程，在黑盒攻击中对少样本提示分类器管道在ClearHarm上取得了71%的ASR。最后，我们还在转移设置中评估了STACK，在这种攻击中取得了33%的ASR，初步证明设计攻击而无需访问目标管道是可行的。我们最后提出一些建议，开发者可以使用这些建议来防范分阶段攻击。

更新时间: 2025-07-18 01:26:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.24068v2

ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle

Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based "student-like" code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at https://github.com/mmiroyan/ParaStudent.

Updated: 2025-07-18 01:02:16

标题: ParaStudent: 通过教导LLMs奋斗来生成和评估真实学生代码

摘要: 大型语言模型（LLMs）在编程任务上表现出色，但它们能否生成类似真实学生的代码 - 不完美、迭代和风格多样？我们提出了ParaStudent，这是一个在介绍性编程课程环境中对基于LLM的“类似学生”的代码生成进行系统研究。通过使用跨多个学期的时间戳学生提交的数据集，我们设计了低分辨率和高分辨率实验来模拟学生的进展，并评估代码输出在语义、功能和风格维度上的表现。我们的结果表明，微调显著提高了与真实学生轨迹的一致性，并更真实地捕捉了错误模式、逐步改进和风格变化。这项研究表明，建模真实学生代码需要通过上下文感知生成、时间建模和多维度评估来捕捉学习动态。实验和评估的代码可在https://github.com/mmiroyan/ParaStudent找到。

更新时间: 2025-07-18 01:02:16

领域: cs.CY,cs.AI,cs.SE

下载: http://arxiv.org/abs/2507.12674v2

FuSeFL: Fully Secure and Scalable Cross-Silo Federated Learning

Federated Learning (FL) enables collaborative model training without centralizing client data, making it attractive for privacy-sensitive domains. While existing approaches employ cryptographic techniques such as homomorphic encryption, differential privacy, or secure multiparty computation to mitigate inference attacks-including model inversion, membership inference, and gradient leakage-they often suffer from high computational, communication, or memory overheads. Moreover, many methods overlook the confidentiality of the global model itself, which may be proprietary and sensitive. These challenges limit the practicality of secure FL, especially in cross-silo deployments involving large datasets and strict compliance requirements. We present FuSeFL, a fully secure and scalable FL scheme designed for cross-silo settings. FuSeFL decentralizes training across client pairs using lightweight secure multiparty computation (MPC), while confining the server's role to secure aggregation. This design eliminates server bottlenecks, avoids data offloading, and preserves full confidentiality of data, model, and updates throughout training. FuSeFL defends against inference threats, achieves up to 95% lower communication latency and 50% lower server memory usage, and improves accuracy over prior secure FL solutions, demonstrating strong security and efficiency at scale.

Updated: 2025-07-18 00:50:44

标题: FuSeFL：全面安全和可扩展的跨辖域联邦学习

摘要: 联邦学习（FL）使得在不集中客户数据的情况下进行协作模型训练成为可能，因此在涉及隐私的领域具有吸引力。虽然现有方法采用了诸如同态加密、差分隐私或安全多方计算等密码技术来减轻推断攻击，包括模型反演、成员推断和梯度泄漏，但通常会受到高计算、通信或内存开销的影响。此外，许多方法忽视了全局模型本身的保密性，这可能是专有且敏感的。这些挑战限制了安全FL的实用性，特别是在涉及大型数据集和严格合规要求的跨隔壁部署中。我们提出了FuSeFL，这是一个专为跨隔壁环境设计的完全安全且可扩展的FL方案。FuSeFL利用轻量级安全多方计算（MPC）在客户对之间分散训练，同时限制服务器的作用为安全聚合。这种设计消除了服务器瓶颈，避免了数据转移，并在整个训练过程中保持数据、模型和更新的完全保密性。FuSeFL抵御推断威胁，实现了高达95%的通信延迟降低和50%的服务器内存使用降低，并改善了先前安全FL解决方案的准确性，展示了在规模上的强大安全性和效率。

更新时间: 2025-07-18 00:50:44

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.13591v1

A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions

Unlocking the potential of nanomaterials in medicine and environmental science hinges on understanding their interactions with proteins, a complex decision space where AI is poised to make a transformative impact. However, progress has been hindered by limited datasets and the restricted generalizability of existing models. Here, we propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins. Leveraging this, we present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning, demonstrating strong generalization, handling missing features, and unseen nanomaterials or proteins. We show that multimodal modeling significantly outperforms single-modality approaches and identifies key determinants of corona formation. Furthermore, we demonstrate its applicability to a range of downstream tasks through zero-shot inference and fine-tuning. Together, this work establishes a solid foundation for high-performance and generalized prediction of nanomaterial-protein interaction endpoints, reducing experimental reliance and accelerating various in vitro applications.

Updated: 2025-07-18 00:00:52

标题: 一个百万级数据集和可推广的纳米材料-蛋白质相互作用基础模型

摘要: 在医学和环境科学领域，利用纳米材料的潜力取决于理解它们与蛋白质的相互作用，这是一个复杂的决策空间，人工智能有望产生变革性影响。然而，由于数据集有限和现有模型的受限泛化能力，进展受到了阻碍。在这里，我们提出NanoPro-3M，迄今为止规模最大的纳米材料-蛋白质相互作用数据集，包括超过320万个样本和3.7万种独特的蛋白质。利用这一数据集，我们提出了NanoProFormer，一个基础模型，通过多模态表示学习预测纳米材料-蛋白质的亲和性，展示出强大的泛化能力，能够处理丢失的特征以及未见过的纳米材料或蛋白质。我们展示了多模态建模明显优于单模态方法，并确定了蛋白质冠形成的关键因素。此外，我们通过零-shot推断和微调展示了其适用于一系列下游任务。总之，这项工作为高性能和泛化的预测纳米材料-蛋白质相互作用终点奠定了坚实基础，减少了实验依赖性并加速了各种体外应用。

更新时间: 2025-07-18 00:00:52

领域: cs.LG,cond-mat.mtrl-sci,cs.AI,cs.CE,q-bio.BM,I.6.5; J.3; I.5.4

下载: http://arxiv.org/abs/2507.14245v1