Arxiv Day: Article

HyperAIRI: a plug-and-play algorithm for precise hyperspectral image reconstruction in radio interferometry

The next-generation radio-interferometric (RI) telescopes require imaging algorithms capable of forming high-resolution high-dynamic-range images from large data volumes spanning wide frequency bands. Recently, AIRI, a plug-and-play (PnP) approach taking the forward-backward algorithmic structure (FB), has demonstrated state-of-the-art performance in monochromatic RI imaging by alternating a data-fidelity step with a regularisation step via learned denoisers. In this work, we introduce HyperAIRI, its hyperspectral extension, underpinned by learned hyperspectral denoisers enforcing a power-law spectral model. For each spectral channel, the HyperAIRI denoiser takes as input its current image estimate, alongside estimates of its two immediate neighbouring channels and the spectral index map, and provides as output its associated denoised image. To ensure convergence of HyperAIRI, the denoisers are trained with a Jacobian regularisation enforcing non-expansiveness. To accommodate varying dynamic ranges, we assemble a shelf of pre-trained denoisers, each tailored to a specific dynamic range. At each HyperAIRI iteration, the spectral channels of the target image cube are updated in parallel using dynamic-range-matched denoisers from the pre-trained shelf. The denoisers are also endowed with a spatial image faceting functionality, enabling scalability to varied image sizes. Additionally, we formally introduce Hyper-uSARA, a variant of the optimisation-based algorithm HyperSARA, promoting joint sparsity across spectral channels via the l2,1-norm, also adopting FB. We evaluate HyperAIRI's performance on simulated and real observations. We showcase its superior performance compared to its optimisation-based counterpart Hyper-uSARA, CLEAN's hyperspectral variant in WSClean, and the monochromatic imaging algorithms AIRI and uSARA.

Updated: 2025-10-16 23:49:20

标题: HyperAIRI：一种用于无线干涉测量中精确高光谱图像重建的即插即用算法

摘要: 下一代射电干涉（RI）望远镜需要能够从跨越宽频带的大数据量中形成高分辨率高动态范围图像的成像算法。最近，一种名为AIRI的即插即用（PnP）方法采用前向-后向算法结构（FB），通过交替进行数据保真步骤和通过学习去噪器进行正则化步骤，在单色RI成像方面表现出最先进的性能。在这项工作中，我们介绍了HyperAIRI，它是其高光谱扩展，基于学习的高光谱去噪器，强制执行幂律光谱模型。对于每个光谱通道，HyperAIRI去噪器将其当前图像估计作为输入，以及其两个直接相邻通道的估计和光谱指数图，输出其相关的去噪图像。为了确保HyperAIRI的收敛，去噪器使用雅可比正则化进行非扩张。为了适应不同的动态范围，我们组装了一组预训练的去噪器，每个去噪器都专为特定的动态范围定制。在每个HyperAIRI迭代中，目标图像立方体的光谱通道使用来自预先训练的架子上的动态范围匹配的去噪器并行更新。去噪器还具有空间图像分面功能，使其能够扩展到不同的图像尺寸。此外，我们正式介绍了Hyper-uSARA，作为基于优化的算法HyperSARA的变种，通过l2,1-范促进跨光谱通道的联合稀疏性，同样采用FB。我们评估了HyperAIRI在模拟和实际观测中的性能。我们展示了与其基于优化的对应物Hyper-uSARA、WSClean中的CLEAN高光谱变体以及单色成像算法AIRI和uSARA相比，其优越的性能。

更新时间: 2025-10-16 23:49:20

领域: astro-ph.IM,cs.LG,eess.IV

下载: http://arxiv.org/abs/2510.15198v1

Low-Rank Adaptation of Neural Fields

Processing visual data often involves small adjustments or sequences of changes, e.g., image filtering, surface smoothing, and animation. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields -- neural network parameterizations of visual or physical functions -- has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA for instance-specific neural fields, avoiding the need for large pre-trained models and yielding lightweight updates. We validate our approach with experiments in image filtering, geometry editing, video compression, and energy-based editing, demonstrating its effectiveness and versatility for representing neural field updates.

Updated: 2025-10-16 23:38:18

标题: 神经场的低秩适应

摘要: 处理视觉数据通常涉及小的调整或一系列的变化，例如图像滤波、表面平滑和动画。虽然已建立的图形技术如法线映射和视频压缩利用冗余性有效地编码这些小的变化，但对于编码神经场（神经网络参数化的视觉或物理函数）的小变化这个问题却受到较少关注。我们提出了一种使用低秩适应（LoRA）更新神经场的参数高效策略。LoRA是来自参数高效调整LLM社区的一种方法，以最小的计算开销编码对预训练模型的小更新。我们将LoRA调整为特定实例的神经场，避免了对大型预训练模型的需求，并产生轻量级的更新。我们通过图像滤波、几何编辑、视频压缩和基于能量的编辑实验验证了我们的方法，展示了它在表示神经场更新方面的有效性和多功能性。

更新时间: 2025-10-16 23:38:18

领域: cs.GR,cs.LG

下载: http://arxiv.org/abs/2504.15933v2

SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

Updated: 2025-10-16 23:37:21

标题: SPICE: 一种协同、精确、迭代和可定制的图像编辑工作流程

摘要: 基于提示的模型已经展示出在图像编辑任务中令人印象深刻的跟随提示能力。然而，这些模型仍然在遵循详细的编辑提示或执行局部编辑方面存在困难。具体而言，全局图像质量通常在单个编辑步骤后立即下降。为了解决这些挑战，我们引入了SPICE，这是一个无需训练的工作流程，可以接受任意分辨率和纵横比，准确地遵循用户需求，并在超过100个编辑步骤中持续提高图像质量，同时保持未编辑的区域完整。通过协同作用基础扩散模型和Canny边缘ControlNet模型的优势，SPICE可以稳健地处理用户的自由形式编辑指令。在一个具有挑战性的现实图像编辑数据集上，SPICE在定量上优于现有基线模型，并在人类标注者中一直受到青睐。我们发布了该工作流程实现的Web UIs，以支持进一步的研究和艺术探索。

更新时间: 2025-10-16 23:37:21

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.09697v2

Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.

Updated: 2025-10-16 23:19:28

标题: Structure-R1：通过强化学习动态利用LLM推理中的结构知识

摘要: 大型语言模型（LLMs）展示了在推理能力方面的显著进展。然而，它们的性能仍受限于对明确和结构化领域知识的有限访问。检索增强生成（RAG）通过将外部信息作为上下文来增强推理来解决这个问题。然而，传统的RAG系统通常在非结构化和碎片化的文本上运作，导致信息密度低和推理不够优化。为了克服这些限制，我们提出了Structure-R1，这是一个将检索到的内容转换为针对推理优化的结构化表示的新框架。利用强化学习，Structure-R1学习了一个内容表示策略，根据多步推理的需求动态生成和调整结构格式。与依赖于固定模式的先前方法不同，我们的方法采用了能够生成适合个别查询的任务特定结构的生成范式。为了确保这些表示的质量和可靠性，我们引入了一个自我奖励结构验证机制，检查生成的结构是否既正确又自包含。对七个知识密集型基准测试的广泛实验表明，Structure-R1始终通过7B规模的骨干模型实现竞争性性能，并与更大模型的性能相匹配。此外，我们的理论分析表明，结构化表示通过提高信息密度和上下文清晰度来增强推理。我们的代码和数据可在以下网址找到：https://github.com/jlwu002/sr1。

更新时间: 2025-10-16 23:19:28

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2510.15191v1

OCR-APT: Reconstructing APT Stories from Audit Logs using Subgraph Anomaly Detection and LLMs

Advanced Persistent Threats (APTs) are stealthy cyberattacks that often evade detection in system-level audit logs. Provenance graphs model these logs as connected entities and events, revealing relationships that are missed by linear log representations. Existing systems apply anomaly detection to these graphs but often suffer from high false positive rates and coarse-grained alerts. Their reliance on node attributes like file paths or IPs leads to spurious correlations, reducing detection robustness and reliability. To fully understand an attack's progression and impact, security analysts need systems that can generate accurate, human-like narratives of the entire attack. To address these challenges, we introduce OCR-APT, a system for APT detection and reconstruction of human-like attack stories. OCR-APT uses Graph Neural Networks (GNNs) for subgraph anomaly detection, learning behavior patterns around nodes rather than fragile attributes such as file paths or IPs. This approach leads to a more robust anomaly detection. It then iterates over detected subgraphs using Large Language Models (LLMs) to reconstruct multi-stage attack stories. Each stage is validated before proceeding, reducing hallucinations and ensuring an interpretable final report. Our evaluations on the DARPA TC3, OpTC, and NODLINK datasets show that OCR-APT outperforms state-of-the-art systems in both detection accuracy and alert interpretability. Moreover, OCR-APT reconstructs human-like reports that comprehensively capture the attack story.

Updated: 2025-10-16 23:14:03

标题: OCR-APT：使用子图异常检测和LLMs从审计日志重建APT故事

摘要: 高级持久性威胁(APTs)是隐匿的网络攻击，往往能够逃避系统级审计日志的检测。溯源图将这些日志建模为相互连接的实体和事件，揭示了线性日志表示中被忽略的关系。现有系统对这些图应用异常检测，但往往受到高误报率和粗粒度警报的困扰。它们依赖于节点属性，如文件路径或IP地址，导致虚假相关性，降低了检测的鲁棒性和可靠性。为了全面了解攻击的进展和影响，安全分析师需要能够生成准确的、类似人类的整个攻击故事的系统。为了解决这些挑战，我们介绍了OCR-APT，一个用于APT检测和重构类似人类攻击故事的系统。OCR-APT使用图神经网络(GNNs)进行子图异常检测，学习节点周围的行为模式，而不是脆弱的属性，如文件路径或IP地址。这种方法导致更强大的异常检测。然后，它使用大型语言模型(LLMs)迭代检测到的子图，以重构多阶段的攻击故事。在继续之前，每个阶段都经过验证，减少了幻觉，并确保最终报告的可解释性。我们在DARPA TC3、OpTC和NODLINK数据集上的评估表明，OCR-APT在检测准确性和警报可解释性方面优于最先进的系统。此外，OCR-APT重构了全面捕捉攻击故事的类人报告。

更新时间: 2025-10-16 23:14:03

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.15188v1

MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation

A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.

Updated: 2025-10-16 23:12:12

标题: 喜鹊：多智能体情境隐私评估的基准

摘要: 在协作环境中，自主LLM代理面临的一个核心挑战是在任务效能的同时平衡对隐私的全面理解和保护。现有的隐私基准仅关注简单的单轮交互，其中可以轻松省略私人信息而不影响任务结果。在本文中，我们介绍了MAGPIE（Multi-AGent contextual PrIvacy Evaluation），这是一个新颖的基准，包括200个高风险任务，旨在评估多代理协作、非对抗场景中的隐私理解和保护。MAGPIE将私人信息集成为任务解决的关键，迫使代理在有效协作和战略信息控制之间取得平衡。我们的评估显示，包括GPT-5和Gemini 2.5-Pro在内的最先进代理存在显著的隐私泄露，其中Gemini 2.5-Pro泄露高达50.7%，GPT-5泄露高达35.1%的敏感信息，即使明确指示不要泄露。此外，这些代理难以达成共识或完成任务，并且经常采取不良行为，如操纵和谋求权力（例如，Gemini 2.5-Pro在38.2%的情况下展示操纵）。这些发现强调了当前LLM代理缺乏全面的隐私理解，并且尚未充分对齐以同时保护隐私并在复杂环境中保持有效协作。

更新时间: 2025-10-16 23:12:12

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2510.15186v1

Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment

Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.

Updated: 2025-10-16 23:07:28

标题: 基于机器学习的超声波焊接特性表征：使用分层波模拟和扩散驱动分布对齐

摘要: 自动超声焊接检测仍然是无损检测（NDE）社区面临的重要挑战，原因包括训练数据有限（由于筛选实验标本或高保真度模拟的复杂性）以及许多工业环境的环境不稳定性（导致即时测量数据的损坏）。因此，在实际（即工业）环境中进行声波焊接检测的端到端机器学习（ML）工作流程一直是一个难以实现的目标。本研究通过提出由降阶建模方案、基于扩散的分布对齐以及基于U-Net的分割和反演组成的工作流程来解决数据筛选和信号损坏的挑战。基于Lamb波理论的降阶Helmholtz模型用于生成一个全面的数据集，涵盖不同焊接异质性和裂纹缺陷。相对廉价的低阶解决方案为反演模型提供了强大的训练数据集，通过使用有限的完整3D弹性动力学模拟集进行迁移学习阶段对其进行改进。为了处理具有不同且难以预测的噪声分布的实际测量数据，即激光多普勒振动测量扫描，引导扩散产生OOD实验LDV扫描的分布内表示，随后由反演模型处理。这一综合框架为实际数据上的自动焊接检测提供了端到端的解决方案。

更新时间: 2025-10-16 23:07:28

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2510.13023v2

PowerChain: A Verifiable Agentic AI System for Automating Distribution Grid Analyses

Rapid electrification and decarbonization are increasing the complexity of distribution grid (DG) operation and planning, necessitating advanced computational analyses to ensure reliability and resilience. These analyses depend on disparate workflows comprising complex models, function calls, and data pipelines that require substantial expert knowledge and remain difficult to automate. Workforce and budget constraints further limit utilities' ability to apply such analyses at scale. To address this gap, we build an agentic system PowerChain, which is capable of autonomously performing complex grid analyses. Existing agentic AI systems are typically developed in a bottom-up manner with customized context for predefined analysis tasks; therefore, they do not generalize to tasks that the agent has never seen. In comparison, to generalize to unseen DG analysis tasks, PowerChain dynamically generates structured context by leveraging supervisory signals from self-contained power systems tools (e.g., GridLAB-D) and an optimized set of expert-annotated and verified reasoning trajectories. For complex DG tasks defined in natural language, empirical results on real utility data demonstrate that PowerChain achieves up to a 144/% improvement in performance over baselines.

Updated: 2025-10-16 22:58:53

标题: PowerChain：用于自动化配电网分析的可验证的代理人AI系统

摘要: 快速电气化和脱碳正在增加配电网（DG）运营和规划的复杂性，需要先进的计算分析来确保可靠性和弹性。这些分析依赖于包含复杂模型、函数调用和数据管道的不同工作流程，需要大量专业知识，并且难以自动化。人力和预算限制进一步限制了公用事业公司在规模上应用这种分析的能力。为了填补这一差距，我们构建了一个有能力自主执行复杂电网分析的代理系统PowerChain。现有的代理人工智能系统通常是以自定义上下文为预定义分析任务开发的底层方式；因此，它们无法推广到代理人从未见过的任务。相比之下，为了推广到未见过的DG分析任务，PowerChain通过利用来自自包含电力系统工具（例如GridLAB-D）和一组经过优化的专家注释和验证的推理轨迹的监督信号，动态生成结构化上下文。对于用自然语言定义的复杂DG任务，对真实公用事业数据的经验结果表明，PowerChain相对基准线取得了高达144%的性能改进。

更新时间: 2025-10-16 22:58:53

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.17094v2

Game mechanics for cyber-harm awareness in the metaverse

Educating children and young people to be safe online is essential, especially as the metaverse, a next-generation internet blending immersive technologies, promises to reshape their interactions and amplify their experiences. While virtual reality offers fully immersive, highly interactive, and multi-sensory engagement, it also heightens cyber harm risks for young or vulnerable users. To address this, the CyberNinjas VR experience was developed to educate children aged 8 to 16 on safe metaverse behaviours, providing clear referral steps for harmful interactions. Understanding user engagement in metaverse gaming will aid the design of future VR environments which prioritize safety and inclusivity. This project analyses CyberNinjas to understand how game mechanics can foster cyber-safe behaviours.

Updated: 2025-10-16 22:53:04

标题: 元宇宙中用于网络伤害意识的游戏机制

摘要: 教育儿童和青少年在线安全至关重要，尤其是随着元宇宙的到来，这是一种融合沉浸式技术的新一代互联网，承诺重塑他们的互动并放大他们的体验。虚拟现实提供全面沉浸、高度互动和多感官参与，但也加剧了年轻或易受伤害用户的网络伤害风险。为了解决这个问题，开发了CyberNinjas虚拟现实体验，用于教育8至16岁的儿童如何在元宇宙中安全行为，为有害互动提供清晰的转介步骤。了解用户在元宇宙游戏中的参与将有助于设计未来优先考虑安全和包容性的虚拟现实环境。这个项目分析了CyberNinjas，以了解游戏机制如何促进网络安全行为。

更新时间: 2025-10-16 22:53:04

领域: cs.MM,cs.CR,cs.CY

下载: http://arxiv.org/abs/2510.15180v1

An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets

Hip fractures are a major cause of disability, mortality, and healthcare burden in older adults, underscoring the need for early risk assessment. However, commonly used tools such as the DXA T-score and FRAX often lack sensitivity and miss individuals at high risk, particularly those without prior fractures or with osteopenia. To address this limitation, we propose a sequential two-stage model that integrates clinical and imaging information to improve prediction accuracy. Using data from the Osteoporotic Fractures in Men Study (MrOS), the Study of Osteoporotic Fractures (SOF), and the UK Biobank, Stage 1 (Screening) employs clinical, demographic, and functional variables to estimate baseline risk, while Stage 2 (Imaging) incorporates DXA-derived features for refinement. The model was rigorously validated through internal and external testing, showing consistent performance and adaptability across cohorts. Compared to T-score and FRAX, the two-stage framework achieved higher sensitivity and reduced missed cases, offering a cost-effective and personalized approach for early hip fracture risk assessment. Keywords: Hip Fracture, Two-Stage Model, Risk Prediction, Sensitivity, DXA, FRAX

Updated: 2025-10-16 22:44:51

标题: 一个具有高敏感性和泛化能力的预测髋部骨折风险的先进两阶段模型，利用多个数据集。

摘要: 髋骨骨折是老年人残疾、死亡和医疗负担的主要原因，强调了早期风险评估的必要性。然而，常用工具如DXA T得分和FRAX通常缺乏敏感性，错过了高风险个体，特别是那些没有先前骨折或骨量减少的个体。为解决这一限制，我们提出了一个顺序两阶段模型，将临床和影像信息整合以提高预测准确性。使用来自男性骨质疏松性骨折研究（MrOS）、骨质疏松性骨折研究（SOF）和英国生物库的数据，第一阶段（筛选）利用临床、人口统计和功能变量来估计基线风险，而第二阶段（影像学）则结合DXA衍生特征进行进一步细化。该模型经过内部和外部测试严格验证，表现出对不同队列的一致性性能和适应性。与T得分和FRAX相比，两阶段框架实现了更高的敏感性，减少了漏诊病例，为早期髋骨骨折风险评估提供了一种具有成本效益和个性化的方法。关键词：髋骨骨折、两阶段模型、风险预测、敏感性、DXA、FRAX

更新时间: 2025-10-16 22:44:51

领域: cs.LG,physics.med-ph

下载: http://arxiv.org/abs/2510.15179v1

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.

Updated: 2025-10-16 22:43:35

标题: 一个弱监督的变压器用于罕见疾病的诊断和亚表型分类：以肺部病例为例

摘要: 罕见疾病影响全球约3-4亿人，然而由于低流行率和临床医生熟悉程度有限，个别疾病仍然被低估和描述不足。计算表型学提供了一种可扩展的方法来改善罕见疾病的检测，但算法的发展受到高质量标记数据的稀缺性的阻碍。来自病历回顾和注册表的专家标记数据集在临床上准确但范围有限且可用性有限，而从电子健康记录（EHRs）中得出的标签提供了更广泛的覆盖范围，但通常存在噪音或不完整。为了解决这些挑战，我们提出了WEST（从EHRs进行罕见疾病表型和亚表型的弱监督变压器），该框架将常规收集的EHR数据与有限的专家验证的病例和对照组相结合，以实现大规模表型。在其核心，WEST采用弱监督变压器模型，该模型在训练过程中使用大量概率银标准标签进行训练 - 这些标签来自结构化和非结构化EHR特征，并在训练过程中进行迭代地优化以改善模型校准。我们使用波士顿儿童医院的EHR数据对西部进行评估，并展示它在表型分类、识别临床意义亚表型以及疾病进展预测方面胜过现有方法。通过减少对手动注释的依赖，西部实现了数据高效的罕见疾病表型，改善了队列定义，支持更早和更准确的诊断，并加速罕见疾病社区的数据驱动发现。

更新时间: 2025-10-16 22:43:35

领域: cs.LG,cs.CL,stat.ML

下载: http://arxiv.org/abs/2507.02998v2

Finding geodesics with the Deep Ritz method

Geodesic problems involve computing trajectories between prescribed initial and final states to minimize a user-defined measure of distance, cost, or energy. They arise throughout physics and engineering -- for instance, in determining optimal paths through complex environments, modeling light propagation in refractive media, and the study of spacetime trajectories in control theory and general relativity. Despite their ubiquity, the scientific machine learning (SciML) community has given relatively little attention to investigating its methods in the context of these problems. In this work, we argue that given their simple geometry, variational structure, and natural nonlinearity, geodesic problems are particularly well-suited for the Deep Ritz method. We substantiate this claim with three numerical examples drawn from path planning, optics, and solid mechanics. Our goal is not to provide an exhaustive study of geodesic problems, but rather to identify a promising application of the Deep Ritz method and a fruitful direction for future SciML research.

Updated: 2025-10-16 22:30:59

标题: 用深度里兹方法找到测地线

摘要: 地理问题涉及计算在规定的初始和最终状态之间的轨迹，以最小化用户定义的距离、成本或能量度量。它们在物理学和工程学中广泛出现，例如在确定通过复杂环境的最佳路径、在折射介质中建模光传播以及在控制理论和广义相对论中研究时空轨迹中。尽管它们普遍存在，科学机器学习（SciML）社区在探讨这些问题的方法方面付出的关注相对较少。在这项工作中，我们认为，由于其简单的几何结构、变分结构和自然的非线性，地理问题特别适合深里兹方法。我们通过从路径规划、光学和固体力学中抽取的三个数值例子来支持这一观点。我们的目标不是提供对地理问题的详尽研究，而是确定深里兹方法的一个有前途的应用，并为未来的SciML研究指明一个有成果的方向。

更新时间: 2025-10-16 22:30:59

领域: cs.LG

下载: http://arxiv.org/abs/2510.15177v1

A simple mean field model of feature learning

Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textit{self-reinforcing input feature selection}. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.

Updated: 2025-10-16 22:28:44

标题: 一个简单的特征学习的均场模型

摘要: 特征学习（FL）是指神经网络在训练过程中调整其内部表示的过程，目前仍然不够清楚。利用统计物理学的方法，我们推导出了一个可处理的、自洽的平均场（MF）理论，用于描述使用随机梯度 Langevin 动力学（SGLD）训练的两层非线性网络的贝叶斯后验。在宽度无限时，该理论退化为核岭回归，但在有限宽度时，它预测了一个对称破缺相变，使网络突然与目标函数对齐。虽然基本的 MF 理论为我们提供了对有限宽度范围内 FL 出现的理论洞察力，可以半定量地预测 FL 的发生与噪声或样本大小有关，但它显著低估了转变后泛化性能的提升。我们追溯到这种差异是由于平凡的 MF 描述中缺少一个关键机制：自我强化的输入特征选择。将这种机制纳入 MF 理论使我们能够定量匹配以 SGLD 训练的网络的学习曲线，并提供了关于 FL 的机制洞见。

更新时间: 2025-10-16 22:28:44

领域: cs.LG

下载: http://arxiv.org/abs/2510.15174v1

Beyond the Voice: Inertial Sensing of Mouth Motion for High Security Speech Verification

Voice interfaces are increasingly used in high stakes domains such as mobile banking, smart home security, and hands free healthcare. Meanwhile, modern generative models have made high quality voice forgeries inexpensive and easy to create, eroding confidence in voice authentication alone. To strengthen protection against such attacks, we present a second authentication factor that combines acoustic evidence with the unique motion patterns of a speaker's lower face. By placing lightweight inertial sensors around the mouth to capture mouth opening and evolving lower facial geometry, our system records a distinct motion signature with strong discriminative power across individuals. We built a prototype and recruited 43 participants to evaluate the system under four conditions seated, walking on level ground, walking on stairs, and speaking with different language backgrounds (native vs. non native English). Across all scenarios, our approach consistently achieved a median equal error rate (EER) of 0.01 or lower, indicating that mouth movement data remain robust under variations in gait, posture, and spoken language. We discuss specific use cases where this second line of defense could provide tangible security benefits to voice authentication systems.

Updated: 2025-10-16 22:26:18

标题: 超越声音：惯性传感器检测口腔运动用于高安全性语音验证

摘要: 语音界面越来越广泛地应用于高风险领域，如移动银行、智能家居安全和免提医疗保健。与此同时，现代生成模型使得高质量的语音伪造变得廉价且容易制作，削弱了人们对仅依靠语音验证的信心。为了加强对此类攻击的防护，我们提出了一个结合声学证据和说话者下半脸独特运动模式的第二身份验证因素。通过在嘴周围放置轻量级惯性传感器来捕捉嘴巴张合和演变的下半脸几何形状，我们的系统记录了一个具有在个体间具有强大区分能力的独特运动签名。我们建立了一个原型，并招募了43名参与者在四种条件下评估系统：坐着、在平地上行走、在楼梯上行走，以及说着不同语言背景（母语和非母语英语）。在所有场景下，我们的方法始终保持中位等误差率（EER）为0.01或更低，表明嘴部运动数据在步态、姿势和口语变化下仍然具有稳健性。我们讨论了这种第二道防线在提供语音验证系统实际安全收益方面的具体应用案例。

更新时间: 2025-10-16 22:26:18

领域: cs.CR

下载: http://arxiv.org/abs/2510.15173v1

AI Guided Accelerator For Search Experience

Effective query reformulation is pivotal in narrowing the gap between a user's exploratory search behavior and the identification of relevant products in e-commerce environments. While traditional approaches predominantly model query rewrites as isolated pairs, they often fail to capture the sequential and transitional dynamics inherent in real-world user behavior. In this work, we propose a novel framework that explicitly models transitional queries--intermediate reformulations occurring during the user's journey toward their final purchase intent. By mining structured query trajectories from eBay's large-scale user interaction logs, we reconstruct query sequences that reflect shifts in intent while preserving semantic coherence. This approach allows us to model a user's shopping funnel, where mid-journey transitions reflect exploratory behavior and intent refinement. Furthermore, we incorporate generative Large Language Models (LLMs) to produce semantically diverse and intent-preserving alternative queries, extending beyond what can be derived through collaborative filtering alone. These reformulations can be leveraged to populate Related Searches or to power intent-clustered carousels on the search results page, enhancing both discovery and engagement. Our contributions include (i) the formal identification and modeling of transitional queries, (ii) the introduction of a structured query sequence mining pipeline for intent flow understanding, and (iii) the application of LLMs for scalable, intent-aware query expansion. Empirical evaluation demonstrates measurable gains in conversion and engagement metrics compared to the existing Related Searches module, validating the effectiveness of our approach in real-world e-commerce settings.

Updated: 2025-10-16 22:14:29

标题: 人工智能引导的加速器用于搜索体验

摘要: 有效的查询重构在缩小用户探索性搜索行为和在电子商务环境中识别相关产品之间的差距方面至关重要。传统方法主要将查询重写建模为孤立的对，但往往无法捕捉现实世界用户行为中固有的顺序和过渡动态。在这项工作中，我们提出了一个新颖的框架，明确地对过渡性查询进行建模--这些查询是在用户朝向最终购买意图的过程中发生的中间重构。通过从eBay的大规模用户交互日志中挖掘结构化查询轨迹，我们重建了反映意图转变的查询序列，同时保持语义连贯性。这种方法使我们能够建模用户的购物漏斗，其中中途转变反映了探索行为和意图的细化。此外，我们还结合生成式大型语言模型(LLMs)来生成语义多样且保留意图的替代查询，超越了仅通过协同过滤可以得到的内容。这些重构可以用来填充相关搜索或为搜索结果页面上的意图聚类旋转木马提供动力，从而增强发现和参与度。我们的贡献包括(i)对过渡性查询的形式化识别和建模，(ii)引入了用于理解意图流的结构化查询序列挖掘管道，以及(iii)应用LLMs来进行可扩展的、意图感知的查询扩展。经验评估表明，与现有的相关搜索模块相比，我们的方法在转化和参与度指标上取得了可衡量的增益，验证了我们的方法在现实世界电子商务环境中的有效性。

更新时间: 2025-10-16 22:14:29

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.05649v2

Loss-Complexity Landscape and Model Structure Functions

We develop a framework for dualizing the Kolmogorov structure function $h_x(\alpha)$, which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.

Updated: 2025-10-16 22:13:58

标题: 损失-复杂度景观和模型结构函数

摘要: 我们开发了一个框架，用于对Kolmogorov结构函数$h_x(\alpha)$进行对偶化，从而可以使用可计算的复杂性代理。我们在信息论构造和统计力学之间建立了数学类比，引入了一个合适的分区函数和自由能泛函。我们明确证明了结构函数和自由能之间的Legendre-Fenchel对偶性，展示了Metropolis核的详细平衡，并将接受概率解释为信息论散射振幅。模型复杂性的类似磁化率方差被证明会在损失-复杂性折衷点达到峰值，被解释为相变。对线性和基于树的回归模型进行的实际实验验证了这些理论预测，明确展示了模型复杂性、泛化和过拟合阈值之间的相互作用。

更新时间: 2025-10-16 22:13:58

领域: cs.IT,cs.AI,cs.LG,math-ph,math.IT,math.MP,I.2.2; I.2.6

下载: http://arxiv.org/abs/2507.13543v4

Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

Reinforcement Learning (RL) enables agents to learn optimal decision-making strategies through interaction with an environment, yet training from scratch on complex tasks can be highly inefficient. Transfer learning (TL), widely successful in large language models (LLMs), offers a promising direction for enhancing RL efficiency by leveraging pre-trained models. This paper investigates policy transfer, a TL approach that initializes learning in a target RL task using a policy from a related source task, in the context of continuous-time linear quadratic regulators (LQRs) with entropy regularization. We provide the first theoretical proof of policy transfer for continuous-time RL, proving that a policy optimal for one LQR serves as a near-optimal initialization for closely related LQRs, while preserving the original algorithm's convergence rate. Furthermore, we introduce a novel policy learning algorithm for continuous-time LQRs that achieves global linear and local super-linear convergence. Our results demonstrate both theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, addressing a gap in existing literature and extending prior work from discrete to continuous time settings. As a byproduct of our analysis, we derive the stability of a class of continuous-time score-based diffusion models via their connection with LQRs.

Updated: 2025-10-16 21:57:53

标题: 政策转移确保连续时间LQR与熵正则化的快速学习

摘要: 强化学习（RL）使代理能够通过与环境的交互学习最优决策策略，然而在复杂任务上从头开始训练可能效率非常低。迁移学习（TL），在大型语言模型（LLMs）中取得了广泛成功，为通过利用预训练模型增强RL的效率提供了一个有希望的方向。本文研究了策略迁移，这是一种TL方法，通过在连续时间线性二次调节器（LQRs）中使用来自相关源任务的策略来初始化目标RL任务中的学习。我们首次提供了连续时间RL的策略迁移的理论证明，证明了对于一个LQR来说最优的策略可以作为与其密切相关的LQRs的近似最优初始化，同时保持原始算法的收敛速度。此外，我们还介绍了一种新颖的连续时间LQRs的策略学习算法，实现了全局线性和局部超线性收敛。我们的结果展示了在连续时间RL中迁移学习的理论保证和算法优势，填补了现有文献中的空白，并将先前的工作从离散时间设置扩展到连续时间设置。作为我们分析的副产品，我们通过其与LQRs的联系推导了一类连续时间基于分数的扩散模型的稳定性。

更新时间: 2025-10-16 21:57:53

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2510.15165v1

Dr. Bias: Social Disparities in AI-Powered Medical Guidance

With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.

Updated: 2025-10-16 21:54:20

标题: Dr. 偏见：AI 动力医疗指导中的社会差距

摘要: 随着大型语言模型（LLMs）的快速发展，普罗大众现在可以轻松且廉价地获取能够以个性化方式回答大多数健康相关问题的应用程序。这些LLMs越来越显示出竞争力，甚至在某些医学能力方面已经超越了专业人士。它们在资源匮乏的环境中具有特别的潜力，因为它们提供了广泛可及、准免费的医疗支持可能性。然而，激发这些动机的评估高度缺乏对医疗保健的社会性质的洞察力，对社会群体之间的健康差异以及偏见如何转化为LLM生成的医疗建议和影响用户毫不知情。我们对一系列跨越关键临床领域的医学问题的LLM答案进行了探索性分析，在这些问题中，我们模拟了由几个在性别、年龄范围和种族上有所不同的患者档案提出的问题。通过比较生成的回应的自然语言特征，我们表明，在用于生成医疗建议时，LLMs会在社会群体之间生成系统性差异的回应。特别是，土著和间性患者接收到的建议更难阅读且更加复杂。当考虑到交叉群体时，我们观察到这些趋势会放大。考虑到个人对这些模型的信任日益增加，我们主张提高人工智能素养，并呼吁AI开发者迫切需要进行调查和缓解措施，以确保这些系统性差异被减少，不会转化为不公正的患者支持。我们的代码在GitHub上公开可用。

更新时间: 2025-10-16 21:54:20

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2510.09162v2

MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.

Updated: 2025-10-16 21:37:37

标题: MotionScript：用于表达3D人体动作的自然语言描述

摘要: 我们介绍了MotionScript，这是一个用于生成高度详细、自然语言描述的3D人体动作的新颖框架。与现有依赖于广泛动作标签或通用字幕的动作数据集不同，MotionScript提供了捕捉人类运动完整复杂性的细粒度、结构化描述，包括表达性动作（如情感、风格化行走）以及超出标准运动捕捉数据集的互动。MotionScript既可以作为描述工具，也可以作为文本到动作模型的训练资源，使得能够从文本中合成高度逼真和多样化的人类动作。通过用MotionScript字幕增强动作数据集，我们展示了在超出分布的动作生成方面的显著改进，使得大型语言模型（LLMs）能够生成超出现有数据的动作。此外，MotionScript在动画、虚拟人类模拟和机器人领域开辟了新的应用，提供了直观描述和动作合成之间的可解释桥梁。据我们所知，这是第一次尝试系统地将3D动作转换为结构化自然语言，而无需训练数据。

更新时间: 2025-10-16 21:37:37

领域: cs.CV,cs.AI,cs.CL,cs.RO

下载: http://arxiv.org/abs/2312.12634v5

Optimally Deep Networks - Adapting Model Depth to Datasets for Superior Efficiency

Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

Updated: 2025-10-16 21:34:23

标题: 最优深度网络-将模型深度适应数据集以获得更高效果

摘要: 深度神经网络（DNNs）在各种任务中表现出色。然而，这种成功通常是以不必要的大模型尺寸、高计算需求和大量内存占用为代价的。通常，强大的架构是在完整深度上进行训练的，但并非所有数据集或任务都需要如此高的模型容量。在相对低复杂性数据集上训练非常深的架构经常导致计算浪费、不必要的能源消耗和过度内存使用，从而使得在资源受限设备上部署模型变得不切实际。为了解决这个问题，我们引入了理想深度网络（ODNs），它在模型深度和任务复杂度之间提供了平衡。具体来说，我们提出了一种类似于NAS的训练策略，称为渐进深度扩展，它从浅层深度训练深度网络开始，并在早期块收敛时逐步增加深度，继续这个过程直到达到目标精度。ODNs仅使用给定数据集的最佳深度，去除多余的层。这降低了未来训练和推断成本，降低了内存占用，增强了计算效率，并便于在边缘设备上部署。实证结果显示，对于MNIST和SVHN数据集，ResNet-18和ResNet-34的最佳深度分别实现了98.64％和96.44％的内存占用减少，同时保持竞争力的准确性分别为99.31％和96.08％。

更新时间: 2025-10-16 21:34:23

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.10764v3

MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation

Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model's effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.

Updated: 2025-10-16 21:22:55

标题: MSCloudCAM：多尺度上下文交叉注意力用于多光谱云分割

摘要: 云在光学卫星图像中仍然是一个关键挑战，阻碍了环境监测、土地覆盖映射和气候研究的可靠分析。为了克服这一障碍，我们提出了MSCloudCAM，这是一个专为多光谱和多传感器云分割而设计的具有多尺度上下文网络的交叉注意力模型。我们的框架利用Sentinel-2（CloudSEN12）和Landsat-8（L8Biome）数据的光谱丰富性来对四个语义类别进行分类：晴天、薄云、厚云和云影。MSCloudCAM结合了Swin Transformer骨干网络进行分层特征提取，同时利用多尺度上下文模块ASPP和PSP进行增强的尺度感知学习。交叉注意力块实现了有效的多传感器和多光谱特征融合，同时整合了高效通道注意力块（ECAB）和空间注意力模块，自适应地优化特征表示。对CloudSEN12和L8Biome的全面实验表明，MSCloudCAM提供了最先进的分割准确性，超越了领先的基准架构，同时保持了竞争力的参数效率和FLOPs。这些结果强调了该模型的有效性和实用性，使其非常适用于大规模地球观测任务和实际应用。

更新时间: 2025-10-16 21:22:55

领域: cs.CV,cs.AI,cs.LG,F.2.2; I.2.7

下载: http://arxiv.org/abs/2510.10802v2

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

Updated: 2025-10-16 21:10:22

标题: XModBench: 在全语言模型中对跨模态能力和一致性进行基准测试

摘要: 全模态大型语言模型（OLLMs）旨在统一音频、视觉和文本理解在一个框架内。虽然现有的基准主要评估一般的跨模态问答能力，但目前尚不清楚OLLMs是否实现了模态不变的推理或表现出特定模态的偏见。我们引入了XModBench，一个大规模三模态基准，专门设计用于衡量跨模态一致性。XModBench包括60,828个涵盖五个任务家族的多项选择问题，并系统地涵盖了问题-答案对中的所有六种模态组合，使得可以对OLLM的模态不变推理、模态差异和方向不平衡进行细致的诊断。实验表明，即使最强大的模型Gemini 2.5 Pro，在空间和时间推理方面也存在困难，准确率不到60%，展现出持续的模态差异，当相同的语义内容通过音频而非文本传达时，性能显著下降，并显示出系统性的方向不平衡，当视觉作为上下文时，一致性较低。这些发现表明目前的OLLMs远未达到真正的模态不变推理，将XModBench定位为评估和提高跨模态能力的基本诊断工具。所有数据和评估工具将在https://xingruiwang.github.io/projects/XModBench/ 上提供。

更新时间: 2025-10-16 21:10:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.15148v1

HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

Updated: 2025-10-16 21:03:54

标题: HugAgent：评估LLMs在模拟人类个体推理在开放性任务中的表现

摘要: 在人类推理在开放式任务中的模拟一直是人工智能和认知科学的长期愿望。虽然大型语言模型现在可以在规模上近似人类的响应，但它们仍然调整到人群层次的共识，通常会抹去推理风格和信念轨迹的个性。为了推进在机器中实现更类似人类推理的愿景，我们引入了HugAgent（Human-Grounded Agent Benchmark），这是一个用于平均到个体推理适应的基准。该任务是预测一个特定人在新情景中如何推理和更新他们的信念，考虑到他们过去观点的部分证据。HugAgent采用双轨设计：一个用于规模和系统性压力测试的合成轨道，一个用于生态有效的、“大声说出”的推理数据的人类轨道。这种设计使得可以对代理内部忠实性进行可扩展、可重复的评估：模型能否捕捉人们信仰的内容，以及他们的推理如何演变。与最先进的LLMs进行的实验揭示了持续的适应差距，将HugAgent定位为与人类思维个性对齐的第一个可扩展基准。我们的基准和聊天机器人是开源的，可以在HugAgent（https://anonymous.4open.science/r/HugAgent）和TraceYourThinking（https://anonymous.4open.science/r/trace-your-thinking）找到。

更新时间: 2025-10-16 21:03:54

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2510.15144v1

Explainable Machine Learning for Oxygen Diffusion in Perovskites and Pyrochlores

Explainable machine learning can help to discover new physical relationships for material properties. To understand the material properties that govern the activation energy for oxygen diffusion in perovskites and pyrochlores, we build a database of experimental activation energies and apply a grouping algorithm to the material property features. These features are then used to fit seven different machine learning models. An ensemble consensus determines that the most important features for predicting the activation energy are the ionicity of the A-site bond and the partial pressure of oxygen for perovskites. For pyrochlores, the two most important features are the A-site $s$ valence electron count and the B-site electronegativity. The most important features are all constructed using the weighted averages of elemental metal properties, despite weighted averages of the constituent binary oxides being included in our feature set. This is surprising because the material properties of the constituent oxides are more similar to the experimentally measured properties of perovskites and pyrochlores than the features of the metals that are chosen. The easy-to-measure features identified in this work enable rapid screening for new materials with fast oxide-ion diffusivity.

Updated: 2025-10-16 21:00:31

标题: 可解释的机器学习在钙钛矿和荧石氧扩散中的应用

摘要: 可解释的机器学习有助于发现材料性质的新物理关系。为了理解控制钙钛矿和钛酸盐中氧扩散活化能的材料性质，我们建立了一个实验活化能数据库，并应用了分组算法对材料性质特征进行分类。然后，这些特征被用于拟合七种不同的机器学习模型。一个集成共识确定了预测活化能最重要的特征是A位键的离子性和钙钛矿中氧分压。对于钛酸盐，最重要的两个特征是A位$s$价电子数和B位电负性。最重要的特征都是使用元素金属性质的加权平均值构建的，尽管我们的特征集中包括了组成二元氧化物的加权平均值。这令人惊讶，因为组成氧化物的材料性质与钙钛矿和钛酸盐的实验测量性质更相似，而不是所选择的金属特征。本文确定的易于测量的特征使得能够快速筛选出具有快速氧离子扩散性的新材料。

更新时间: 2025-10-16 21:00:31

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2505.11722v2

Beyond PCA: Manifold Dimension Estimation via Local Graph Structure

Local principal component analysis (Local PCA) has proven to be an effective tool for estimating the intrinsic dimension of a manifold. More recently, curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly accounting for the curvature of the underlying manifold, rather than assuming local flatness. Building on these insights, we propose a general framework for manifold dimension estimation that captures the manifold's local graph structure by integrating PCA with regression-based techniques. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art alternatives.

Updated: 2025-10-16 20:59:46

标题: 超越PCA：通过局部图结构进行流形维度估计

摘要: 局部主成分分析（Local PCA）已被证明是估计流形固有维度的有效工具。最近，曲率调整主成分分析（CA-PCA）通过明确考虑基础流形的曲率，而不是假设局部平坦性，改进了这种方法。基于这些洞察力，我们提出了一个捕捉流形局部图结构的流形维度估计的通用框架，通过将PCA与基于回归的技术相结合。在这个框架内，我们引入了两种代表性的估计器：二次嵌入（QE）和总最小二乘（TLS）。对合成和真实世界数据集的实验表明，这些方法与现有的最先进方法竞争性能，并且经常胜过它们。

更新时间: 2025-10-16 20:59:46

领域: stat.ML,cs.LG,stat.AP

下载: http://arxiv.org/abs/2510.15141v1

FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.

Updated: 2025-10-16 20:58:17

标题: FERA：使用基于姿势的多标签移动识别和规则推理的击剑裁判助手

摘要: 击剑这项运动，像许多其他运动一样，在裁判方面面临挑战：主观判断、人为错误、偏见以及实践环境的有限可用性。我们提出了一种名为FERA（击剑裁判助手）的原型AI裁判，用于佩剑击剑，它集成了基于姿势的多标签动作识别和基于规则的推理。FERA从视频中提取2D关节位置，对其进行标准化，计算一个101维的动力学特征集，并应用Transformer进行多标签移动和刀片分类。为了确定优先级和得分，FERA应用了一个经过提炼的语言模型，其中包括编码的权利规则，为每次交换产生决定和解释。在有限的手工标记数据下，5折交叉验证实现了平均宏F1分数为0.549，胜过多个基线，包括时间卷积网络（TCN）、BiLSTM和普通Transformer。虽然尚未准备好部署，但这些结果展示了通往自动化佩剑击剑裁判辅助和AI应用新机会的有希望的道路，例如在击剑领域进行指导。

更新时间: 2025-10-16 20:58:17

领域: cs.AI

下载: http://arxiv.org/abs/2509.18527v2

Predicting the Unpredictable: Reproducible BiLSTM Forecasting of Incident Counts in the Global Terrorism Database (GTD)

We study short-horizon forecasting of weekly terrorism incident counts using the Global Terrorism Database (GTD, 1970--2016). We build a reproducible pipeline with fixed time-based splits and evaluate a Bidirectional LSTM (BiLSTM) against strong classical anchors (seasonal-naive, linear/ARIMA) and a deep LSTM-Attention baseline. On the held-out test set, the BiLSTM attains RMSE 6.38, outperforming LSTM-Attention (9.19; +30.6\%) and a linear lag-regression baseline (+35.4\% RMSE gain), with parallel improvements in MAE and MAPE. Ablations varying temporal memory, training-history length, spatial grain, lookback size, and feature groups show that models trained on long historical data generalize best; a moderate lookback (20--30 weeks) provides strong context; and bidirectional encoding is critical for capturing both build-up and aftermath patterns within the window. Feature-group analysis indicates that short-horizon structure (lagged counts and rolling statistics) contributes most, with geographic and casualty features adding incremental lift. We release code, configs, and compact result tables, and provide a data/ethics statement documenting GTD licensing and research-only use. Overall, the study offers a transparent, baseline-beating reference for GTD incident forecasting.

Updated: 2025-10-16 20:53:43

标题: 预测不可预测的：可重复的BiLSTM方法在全球恐怖主义数据库（GTD）中预测事件计数

摘要: 我们研究了利用全球恐怖主义数据库（GTD，1970-2016年）进行短期预测每周恐怖主义事件数量。我们建立了一个可重现的流程，使用固定的基于时间的分割，并将双向LSTM（BiLSTM）与强大的经典锚点（季节性-天真、线性/ARIMA）和深度LSTM-Attention基线进行评估。在留出的测试集上，BiLSTM实现了RMSE为6.38，优于LSTM-Attention（9.19；+30.6\%）和线性滞后回归基线（+35.4\% RMSE增益），同时在MAE和MAPE上也有相应的改进。在变化时间记忆、训练历史长度、空间粒度、回溯大小和特征组的消融实验中，表明在长期历史数据上训练的模型具有最佳泛化性能；适度的回溯（20-30周）提供了强大的上下文；双向编码对于捕捉窗口内的建立和后续模式至关重要。特征组分析表明，短期结构（滞后计数和滚动统计）贡献最大，地理和伤亡特征则提供增量提升。我们发布了代码、配置和紧凑的结果表，并提供了关于GTD许可和仅限研究用途的数据/伦理声明文件。总体而言，这项研究为GTD事件预测提供了一个透明、超越基线的参考。

更新时间: 2025-10-16 20:53:43

领域: cs.LG

下载: http://arxiv.org/abs/2510.15136v1

FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners' knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.

Updated: 2025-10-16 20:52:07

标题: "FarsiMCQGen：一个波斯语多项选择题生成框架"

摘要: 多项选择题（MCQs）通常用于教育测验中，因为它们提供了一种评估学习者知识的有效手段。然而，在低资源语言（如波斯语）中生成高质量的MCQs仍然是一个重大挑战。本文介绍了FarsiMCQGen，一种用于生成波斯语MCQs的创新方法。我们的方法结合了候选生成、过滤和排名技术，构建一个模型，生成类似真实MCQs中的选项。我们利用包括Transformers和知识图在内的先进方法，与基于规则的方法相结合，制作具有挑战性的干扰项，挑战考试者。我们的工作基于维基百科的数据，其中包括常识问题。此外，本研究引入了一个新颖的波斯语MCQ数据集，包括10,289个问题。这个数据集由不同的最新大型语言模型（LLMs）进行评估。我们的结果证明了我们模型的有效性和生成数据集的质量，这有可能激发对MCQs的进一步研究。

更新时间: 2025-10-16 20:52:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.15134v1

Intermittent File Encryption in Ransomware: Measurement, Modeling, and Detection

File encrypting ransomware increasingly employs intermittent encryption techniques, encrypting only parts of files to evade classical detection methods. These strategies, exemplified by ransomware families like BlackCat, complicate file structure based detection techniques due to diverse file formats exhibiting varying traits under partial encryption. This paper provides a systematic empirical characterization of byte level statistics under intermittent encryption across common file types, establishing a comprehensive baseline of how partial encryption impacts data structure. We specialize a classical KL divergence upper bound on a tailored mixture model of intermittent encryption, yielding filetype specific detectability ceilings for histogram-based detectors. Leveraging insights from this analysis, we empirically evaluate convolutional neural network (CNN) based detection methods using realistic intermittent encryption configurations derived from leading ransomware variants. Our findings demonstrate that localized analysis via chunk level CNNs consistently outperforms global analysis methods, highlighting their practical effectiveness and establishing a robust baseline for future detection systems.

Updated: 2025-10-16 20:48:22

标题: 勒索软件中的间歇性文件加密：测量、建模和检测

摘要: 文件加密勒索软件越来越多地采用间歇性加密技术，只加密文件的部分内容以规避传统的检测方法。这些策略，如BlackCat等勒索软件系列所示，由于不同的文件格式在部分加密下展示出不同的特征，使基于文件结构的检测技术变得复杂。本文对常见文件类型在间歇性加密下的字节级统计特征进行了系统的经验性表征，建立了部分加密对数据结构影响的全面基准。我们在间歇性加密的定制混合模型上专门对经典的KL散度上界进行了特殊化，为基于直方图的检测器提供了特定文件类型的可检测性上限。利用这一分析的见解，我们从领先的勒索软件变种中提取出真实的间歇性加密配置，经验性地评估了基于卷积神经网络（CNN）的检测方法。我们的研究结果表明，通过块级CNN的局部分析始终优于全局分析方法，突出了它们的实际有效性，并为未来检测系统建立了稳健的基准。

更新时间: 2025-10-16 20:48:22

领域: cs.CR

下载: http://arxiv.org/abs/2510.15133v1

A Simple Method for PMF Estimation on Large Supports

We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed. The core idea is to treat the empirical PMF as a signal on a line graph and apply a data-dependent low-pass filter. Concretely, we form a symmetric tri-diagonal operator, the path graph Laplacian perturbed with a diagonal matrix built from the empirical PMF, then compute the eigenvectors, corresponding to the smallest feq eigenvalues. Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise. A light post-processing step of clipping and re-normalizing yields a valid PMF. Because we compute the eigenpairs of a symmetric tridiagonal matrix, the computation is reliable and runs time and memory proportional to the support times the dimension of the desired low-dimensional supspace. We also provide a practical, data-driven rule for selecting the dimension based on an orthogonal-series risk estimate, so the method "just works" with minimal tuning. On synthetic and real heavy-tailed examples, the approach preserves coarse structure while suppressing sampling noise, compares favorably to logspline and Gaussian-KDE baselines in the intended regimes. However, it has known failure modes (e.g., abrupt discontinuities). The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed.

Updated: 2025-10-16 20:47:40

标题: 一个用于在大支持上估计PMF的简单方法

摘要: 我们研究了在大型离散支持上对概率质量函数（PMF）进行非参数估计，其中PMF是多模态和重尾的。核心思想是将经验PMF视为线图上的信号，并应用数据相关的低通滤波器。具体地，我们构建了一个对称三对角算子，路径图拉普拉斯算子，其中包含一个由经验PMF构建的对角矩阵，然后计算与最小feq特征值相对应的特征向量。将经验PMF投影到这个低维子空间产生一个平滑的、多模态的估计，保留了粗略结构同时抑制了噪声。一个简单的后处理步骤是裁剪和重新归一化，产生一个有效的PMF。由于我们计算对称三对角矩阵的特征对，计算是可靠的，并且运行时间和内存与支持的维度乘以所需低维子空间的维度成正比。我们还提供了一个基于正交级数风险估计的实用的、数据驱动的规则来选择维度，因此该方法可以在最小调整下“即刻生效”。在合成和真实的重尾示例中，该方法保留了粗略结构，同时抑制了采样噪声，与logspline和高斯-KDE基线在预期区域中有良好的比较。然而，它也有已知的失败模式（例如，突然的不连续性）。该方法实现简短，对样本大小具有强大的鲁棒性，并且适用于自动化流程和规模化探索性分析，因为它具有可靠性和速度。

更新时间: 2025-10-16 20:47:40

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.15132v1

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.

Updated: 2025-10-16 20:43:13

标题: METIS：快速质量感知的具有配置适应性的RAG系统

摘要: RAG（检索增强生成）允许LLMs（大型语言模型）利用外部知识生成更好的响应，但使用更多外部知识往往会以响应延迟为代价提高生成质量。先前的工作要么减少响应延迟（通过更好地调度RAG查询），要么努力最大化质量（这涉及调整RAG工作流程），但它们在优化RAG响应的延迟和质量之间的权衡方面表现不佳。本文介绍了METIS，这是第一个RAG系统，它同时调度查询并调整每个查询的关键RAG配置，比如检索的文本块数量和合成方法，以实现质量优化和响应延迟减少的平衡。使用4个流行的RAG-QA数据集，我们展示了与最先进的RAG优化方案相比，METIS将生成延迟降低了$1.64-2.54\times$，而不会牺牲生成质量。

更新时间: 2025-10-16 20:43:13

领域: cs.LG,cs.CL,cs.IR

下载: http://arxiv.org/abs/2412.10543v3

Photovoltaic power forecasting using quantum machine learning

Accurate forecasting of photovoltaic power is essential for reliable grid integration, yet remains difficult due to highly variable irradiance, complex meteorological drivers, site geography, and device-specific behavior. Although contemporary machine learning has achieved successes, it is not clear that these approaches are optimal: new model classes may further enhance performance and data efficiency. We investigate hybrid quantum neural networks for time-series forecasting of photovoltaic power and introduce two architectures. The first, a Hybrid Quantum Long Short-Term Memory model, reduces mean absolute error and mean squared error by more than 40% relative to the strongest baselines evaluated. The second, a Hybrid Quantum Sequence-to-Sequence model, once trained, it predicts power for arbitrary forecast horizons without requiring prior meteorological inputs and achieves a 16% lower mean absolute error than the best baseline on this task. Both hybrid models maintain superior accuracy when training data are limited, indicating improved data efficiency. These results show that hybrid quantum models address key challenges in photovoltaic power forecasting and offer a practical route to more reliable, data-efficient energy predictions.

Updated: 2025-10-16 20:34:58

标题: 利用量子机器学习进行光伏发电功率预测

摘要: 光伏电力的准确预测对可靠的电网整合至关重要，但由于辐照度极其变化、复杂的气象驱动因素、场地地理和设备特定行为，仍然难以实现。尽管当代机器学习取得了成功，但尚不清楚这些方法是否最佳：新的模型类别可能进一步提高性能和数据效率。我们研究了用于光伏电力时间序列预测的混合量子神经网络，并引入了两种架构。第一种是混合量子长短期记忆模型，相对于评估的最强基线，平均绝对误差和平均平方误差降低了超过40%。第二种是混合量子序列到序列模型，一旦训练完成，它可以预测任意预测时间范围内的功率，而无需先前的气象输入，并且在这项任务中比最佳基线降低了16%的平均绝对误差。当训练数据有限时，这两种混合模型仍保持卓越的准确性，表明数据效率得到了改善。这些结果表明混合量子模型解决了光伏电力预测中的关键挑战，并为更可靠、数据高效的能源预测提供了实用途径。

更新时间: 2025-10-16 20:34:58

领域: cs.LG,cs.ET,quant-ph

下载: http://arxiv.org/abs/2312.16379v3

Towards Error Centric Intelligence I, Beyond Observational Learning

We argue that progress toward AGI is theory limited rather than data or scale limited. Building on the critical rationalism of Popper and Deutsch, we challenge the Platonic Representation Hypothesis. Observationally equivalent worlds can diverge under interventions, so observational adequacy alone cannot guarantee interventional competence. We begin by laying foundations, definitions of knowledge, learning, intelligence, counterfactual competence and AGI, and then analyze the limits of observational learning that motivate an error centric shift. We recast the problem as three questions about how explicit and implicit errors evolve under an agent's actions, which errors are unreachable within a fixed hypothesis space, and how conjecture and criticism expand that space. From these questions we propose Causal Mechanics, a mechanisms first program in which hypothesis space change is a first class operation and probabilistic structure is used when useful rather than presumed. We advance structural principles that make error discovery and correction tractable, including a differential Locality and Autonomy Principle for modular interventions, a gauge invariant form of Independent Causal Mechanisms for separability, and the Compositional Autonomy Principle for analogy preservation, together with actionable diagnostics. The aim is a scaffold for systems that can convert unreachable errors into reachable ones and correct them.

Updated: 2025-10-16 20:33:55

标题: 朝向错误中心智能I，超越观察性学习

摘要: 我们认为实现人工智能的进展受到理论限制，而不是数据或规模限制。基于波普尔和德国学派的批判理性主义，我们质疑柏拉图的表示假设。在干预下，观察上等价的世界可能发散，因此仅靠观察充分性无法保证干预能力。我们首先奠定基础，定义知识、学习、智能、反事实能力和人工智能，然后分析观察学习的限制，这些限制激发了以错误为中心的转变。我们将问题重新定义为三个问题，关于如何明确和隐含错误在代理人的行动下演化，哪些错误在固定假设空间内无法达到，以及猜测和批评如何扩展该空间。基于这些问题，我们提出了因果力学，这是一个机制优先的计划，在这个计划中，假设空间的变化是一个第一类操作，概率结构只在有用时才被使用，而不是被假定。我们提出了使错误发现和校正可行的结构原则，包括模块干预的微分局部性和自治性原则、可分离的独立因果机制的规范不变形式，以及用于类比保留的组合自治性原则，以及可操作的诊断。我们的目标是建立一个系统的支撑结构，可以将无法达到的错误转化为可达到的错误并加以纠正。

更新时间: 2025-10-16 20:33:55

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.15128v1

Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework

Identifying the effects of mechanical ventilation strategies and protocols in critical care requires analyzing data from heterogeneous patient-ventilator systems within the context of the clinical decision-making environment. This research develops a framework to help understand the consequences of mechanical ventilation (MV) and adjunct care decisions on patient outcome from observations of critical care patients receiving MV. Developing an understanding of and improving critical care respiratory management requires the analysis of existing secondary-use clinical data to generate hypotheses about advantageous variations and adaptations of current care. This work introduces a perspective of the joint patient-ventilator-care systems (so-called J6) to develop a scalable method for analyzing data and trajectories of these complex systems. To that end, breath behaviors are analyzed using evolutionary game theory (EGT), which generates the necessary quantitative precursors for deeper analysis through probabilistic and stochastic machinery such as reinforcement learning. This result is one step along the pathway toward MV optimization and personalization. The EGT-based process is analytically validated on synthetic data to reveal potential caveats before proceeding to real-world ICU data applications that expose complexities of the data-generating process J6. The discussion includes potential developments toward a state transition model for the simulating effects of MV decision using empirical and game-theoretic elements.

Updated: 2025-10-16 20:32:52

标题: 通过进化博弈论框架在临床重症监护设置中应对机械通气的后果

摘要: 识别机械通气策略和协议在危重病护理中的影响需要在临床决策环境中分析来自异质患者-呼吸机系统的数据。本研究开发了一个框架，以帮助理解机械通气（MV）和辅助护理决策对患者预后的影响，从接受MV的危重病患者的观察中获得。了解和改进危重病呼吸管理需要分析现有的二次使用临床数据，以产生关于当前护理有利变化和适应的假设。这项工作引入了联合患者-呼吸机-护理系统（所谓的J6）的视角，以开发一种可扩展的方法来分析这些复杂系统的数据和轨迹。为此，呼吸行为使用进化博弈论（EGT）进行分析，EGT生成通过概率和随机机制（如强化学习）进行更深入分析所需的定量前体。这一结果是朝着MV优化和个性化的路径迈出的一步。基于EGT的过程在合成数据上进行了分析验证，以揭示在进行真实世界ICU数据应用之前暴露的数据生成过程J6的复杂性。讨论包括朝着使用经验和博弈论元素模拟MV决策效果的状态转换模型的潜在发展。

更新时间: 2025-10-16 20:32:52

领域: cs.LG,math.OC,q-bio.QM

下载: http://arxiv.org/abs/2510.15127v1

Variational Autoencoders for Efficient Simulation-Based Inference

We present a generative modeling approach based on the variational inference framework for likelihood-free simulation-based inference. The method leverages latent variables within variational autoencoders to efficiently estimate complex posterior distributions arising from stochastic simulations. We explore two variations of this approach distinguished by their treatment of the prior distribution. The first model adapts the prior based on observed data using a multivariate prior network, enhancing generalization across various posterior queries. In contrast, the second model utilizes a standard Gaussian prior, offering simplicity while still effectively capturing complex posterior distributions. We demonstrate the ability of the proposed approach to approximate complex posteriors while maintaining computational efficiency on well-established benchmark problems.

Updated: 2025-10-16 20:30:35

标题: 变分自编码器用于高效基于模拟的推理

摘要: 我们提出了一种基于变分推断框架的生成建模方法，用于基于仿真的无似然推断。该方法利用变分自动编码器中的潜在变量，高效地估计由随机仿真产生的复杂后验分布。我们探讨了两种不同处理先验分布的变体。第一种模型根据观察到的数据使用多元先验网络来调整先验，增强了对各种后验查询的泛化能力。相比之下，第二种模型利用标准的高斯先验，提供了简单性同时仍然有效地捕捉复杂的后验分布。我们展示了所提出方法在已建立的基准问题上能够近似复杂的后验分布，同时保持计算效率。

更新时间: 2025-10-16 20:30:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.14511v2

Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.

Updated: 2025-10-16 20:30:20

标题: 潜在主题综合：利用LLMs进行选举广告分析

摘要: 社交媒体平台在塑造政治话语方面发挥着关键作用，但分析其庞大且迅速发展的内容仍然是一个主要挑战。我们引入了一个端到端的框架，用于自动生成一个可解释的主题分类法，从一个未标记的语料库中。通过将无监督聚类与基于提示的标记相结合，我们的方法利用大型语言模型（LLMs）来迭代构建一个分类法，而无需种子集或领域专业知识。我们将这个框架应用于一个大规模的Meta（之前称为Facebook）政治广告语料库，这些广告来自2024年美国总统选举前一个月。我们的方法揭示了潜在的话语结构，综合语义丰富的主题标签，并用道德框架维度注释主题。我们展示了定量和定性分析，以证明我们框架的有效性。我们的研究结果显示，投票和移民广告在整体支出和印象中占主导地位，而堕胎和选举诚信取得了不成比例的影响。资金模式同样偏向极化：经济上的呼吁主要由保守派政治行动委员会（PACs）推动，堕胎信息在支持和反对权利的联盟之间分裂，犯罪与司法活动在地方委员会之间分散。这些呼吁的框架也有所不同--堕胎广告强调自由/压迫的修辞，而经济信息则融合关怀/伤害、公平/作弊和自由/压迫叙事。主题显著性进一步揭示了道德基础和问题之间的强烈相关性。人口统计定位也逐渐显现。这项工作支持了对社交媒体上政治信息的可扩展、可解释性分析，使研究人员、政策制定者和公众能够更好地理解新兴叙事、极化动态以及数字政治沟通的道德基础。

更新时间: 2025-10-16 20:30:20

领域: cs.CL,cs.AI,cs.CY,cs.LG,cs.SI

下载: http://arxiv.org/abs/2510.15125v1

Procedural Game Level Design with Deep Reinforcement Learning

Procedural content generation (PCG) has become an increasingly popular technique in game development, allowing developers to generate dynamic, replayable, and scalable environments with reduced manual effort. In this study, a novel method for procedural level design using Deep Reinforcement Learning (DRL) within a Unity-based 3D environment is proposed. The system comprises two agents: a hummingbird agent, acting as a solver, and a floating island agent, responsible for generating and placing collectible objects (flowers) on the terrain in a realistic and context-aware manner. The hummingbird is trained using the Proximal Policy Optimization (PPO) algorithm from the Unity ML-Agents toolkit. It learns to navigate through the terrain efficiently, locate flowers, and collect them while adapting to the ever-changing procedural layout of the island. The island agent is also trained using the Proximal Policy Optimization (PPO) algorithm. It learns to generate flower layouts based on observed obstacle positions, the hummingbird's initial state, and performance feedback from previous episodes. The interaction between these agents leads to emergent behavior and robust generalization across various environmental configurations. The results demonstrate that the approach not only produces effective and efficient agent behavior but also opens up new opportunities for autonomous game level design driven by machine learning. This work highlights the potential of DRL in enabling intelligent agents to both generate and solve content in virtual environments, pushing the boundaries of what AI can contribute to creative game development processes.

Updated: 2025-10-16 20:26:14

标题: 用深度强化学习进行程序化游戏关卡设计

摘要: 程序内容生成（PCG）已经成为游戏开发中越来越受欢迎的技术，使开发人员能够以减少手动工作量的方式生成动态、可重复播放和可扩展的环境。在这项研究中，提出了一种使用基于Unity的3D环境中的深度强化学习（DRL）进行程序化关卡设计的新方法。该系统包括两个代理：一个作为解决者的蜂鸟代理，以及一个负责以逼真且具有上下文感知的方式在地形上生成和放置可收集对象（花朵）的漂浮岛代理。蜂鸟使用Unity ML-Agents工具包中的Proximal Policy Optimization（PPO）算法进行训练。它学会了有效地穿越地形，定位花朵并收集它们，同时适应岛屿的不断变化的程序化布局。漂浮岛代理也使用Proximal Policy Optimization（PPO）算法进行训练。它学会了根据观察到的障碍物位置、蜂鸟的初始状态以及先前剧集的表现反馈来生成花朵布局。这些代理之间的互动导致了新兴行为和对各种环境配置的稳健泛化。结果表明，这种方法不仅产生了有效和高效的代理行为，还为由机器学习驱动的自主游戏关卡设计开辟了新的机会。这项工作突显了DRL在使智能代理能够在虚拟环境中生成和解决内容方面的潜力，推动了人工智能在创意游戏开发过程中所能贡献的边界。

更新时间: 2025-10-16 20:26:14

领域: cs.AI,I.2.6, I.2.8, I.2.11, I.3.7

下载: http://arxiv.org/abs/2510.15120v1

Towards smart and adaptive agents for active sensing on edge devices

TinyML has made deploying deep learning models on low-power edge devices feasible, creating new opportunities for real-time perception in constrained environments. However, the adaptability of such deep learning methods remains limited to data drift adaptation, lacking broader capabilities that account for the environment's underlying dynamics and inherent uncertainty. Deep learning's scaling laws, which counterbalance this limitation by massively up-scaling data and model size, cannot be applied when deploying on the Edge, where deep learning limitations are further amplified as models are scaled down for deployment on resource-constrained devices. This paper presents an innovative agentic system capable of performing on-device perception and planning, enabling active sensing on the edge. By incorporating active inference into our solution, our approach extends beyond deep learning capabilities, allowing the system to plan in dynamic environments while operating in real-time with a compact memory footprint of as little as 300 MB. We showcase our proposed system by creating and deploying a saccade agent connected to an IoT camera with pan and tilt capabilities on an NVIDIA Jetson embedded device. The saccade agent controls the camera's field of view following optimal policies derived from the active inference principles, simulating human-like saccadic motion for surveillance and robotics applications.

Updated: 2025-10-16 20:22:48

标题: 朝着智能和自适应的代理程序在边缘设备上进行主动感知

摘要: TinyML使得在低功耗边缘设备上部署深度学习模型成为可能，为受限环境中的实时感知创造了新机遇。然而，这种深度学习方法的适应性仍然受到限制，只能进行数据漂移调整，缺乏考虑环境底层动态和固有不确定性的更广泛能力。深度学习的扩展规律通过大规模增加数据和模型大小来抵消这种限制，但在部署在边缘设备时无法应用，因为深度学习的限制会进一步放大，因为模型被缩小以部署在资源受限设备上。本文提出了一种创新的主体系统，能够在设备上执行感知和规划，实现边缘上的主动感知。通过将主动推理纳入我们的解决方案，我们的方法超越了深度学习的能力，使系统能够在动态环境中进行规划，同时以仅300MB的紧凑内存占用实时运行。我们通过创建并部署一个连接到具有平移和倾斜功能的IoT摄像头的扫视代理在NVIDIA Jetson嵌入式设备上展示了我们提出的系统。扫视代理根据主动推理原则派生的最优策略来控制摄像头的视野，模拟人类般的扫视运动，用于监视和机器人应用。

更新时间: 2025-10-16 20:22:48

领域: cs.RO,cs.AI,eess.IV

下载: http://arxiv.org/abs/2501.06262v2

Deep generative priors for 3D brain analysis

Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

Updated: 2025-10-16 20:20:50

标题: 深度生成先验对3D脑分析的影响

摘要: 扩散模型最近已经成为医学影像学中强大的生成模型。然而，将这些数据驱动模型与领域知识结合起来以指导大脑影像问题仍然是一个重大挑战。在神经影像学中，贝叶斯逆问题长期以来已经为推断任务提供了成功的框架，其中结合影像处理过程的领域知识能够实现稳健的性能，而不需要大量的训练数据。然而，这些方法的解剖建模组件通常依赖于经典的数学先验，往往无法捕获大脑解剖的复杂结构。在这项工作中，我们提出了扩散模型作为先验解决各种医学影像逆问题的通用应用。我们的方法利用基于分数的扩散先验，在多样化的脑MRI数据上进行了广泛训练，配合灵活的前向模型，捕获常见的图像处理任务，如超分辨率、偏差场校正、修补和它们的组合。我们进一步展示了我们的框架如何改进现有深度学习方法的输出，以提高解剖保真度。对异构临床和研究MRI数据的实验表明，我们的方法实现了最先进的性能，产生一致、高质量的解决方案，而不需要配对训练数据集。这些结果突显了扩散先验作为大脑MRI分析多功能工具的潜力。

更新时间: 2025-10-16 20:20:50

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.15119v1

Polarization based direction of arrival estimation using a radio interferometric array

Direction of arrival (DOA) estimation is mostly performed using specialized arrays that have carefully designed receiver spacing and layouts to match the operating frequency range. In contrast, radio interferometric arrays are designed to optimally sample the Fourier space data for making high quality images of the sky. Therefore, using existing radio interferometric arrays (with arbitrary geometry and wide frequency variation) for DOA estimation is practically infeasible except by using images made by such interferometers. In this paper, we focus on low cost DOA estimation without imaging, using a subset of a radio interferometric array, using a fraction of the data collected by the full array, and, enabling early determination of DOAs. The proposed method is suitable for transient and low duty cycle source detection. Moreover, the proposed method is an ideal follow-up step to online radio frequency interference (RFI) mitigation, enabling the early estimation of the DOA of the detected RFI.

Updated: 2025-10-16 20:17:55

标题: 基于极化的射向估计在无线干涉阵列中的应用

摘要: 方向到达（DOA）估计通常使用专门设计的阵列进行，这些阵列具有精心设计的接收器间距和布局，以匹配操作频率范围。相比之下，无线电干涉阵列设计为最佳采样傅立叶空间数据，以制作高质量的天空图像。因此，使用现有的无线电干涉阵列（具有任意几何形状和广泛的频率变化）进行DOA估计实际上是不可行的，除非使用这种干涉仪制作的图像。在本文中，我们专注于低成本的DOA估计而不使用成像，使用无线电干涉阵列的子集，使用完整数组收集的数据的一部分，并且能够早期确定DOA。所提出的方法适用于瞬变和低占空比源检测。此外，所提出的方法是在线无线电频率干扰（RFI）消除的理想后续步骤，使得早期估计检测到的RFI的DOA成为可能。

更新时间: 2025-10-16 20:17:55

领域: astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2510.15116v1

AndroByte: LLM-Driven Privacy Analysis through Bytecode Summarization and Dynamic Dataflow Call Graph Generation

With the exponential growth in mobile applications, protecting user privacy has become even more crucial. Android applications are often known for collecting, storing, and sharing sensitive user information such as contacts, location, camera, and microphone data often without the user's clear consent or awareness raising significant privacy risks and exposure. In the context of privacy assessment, dataflow analysis is particularly valuable for identifying data usage and potential leaks. Traditionally, this type of analysis has relied on formal methods, heuristics, and rule-based matching. However, these techniques are often complex to implement and prone to errors, such as taint explosion for large programs. Moreover, most existing Android dataflow analysis methods depend heavily on predefined list of sinks, limiting their flexibility and scalability. To address the limitations of these existing techniques, we propose AndroByte, an AI-driven privacy analysis tool that leverages LLM reasoning on bytecode summarization to dynamically generate accurate and explainable dataflow call graphs from static code analysis. AndroByte achieves a significant F\b{eta}-Score of 89% in generating dynamic dataflow call graphs on the fly, outperforming the effectiveness of traditional tools like FlowDroid and Amandroid in leak detection without relying on predefined propagation rules or sink lists. Moreover, AndroByte's iterative bytecode summarization provides comprehensive and explainable insights into dataflow and leak detection, achieving high, quantifiable scores based on the G-Eval metric.

Updated: 2025-10-16 20:10:20

标题: AndroByte：通过字节码总结和动态数据流调用图生成的LLM驱动隐私分析

摘要: 随着移动应用程序数量的指数增长，保护用户隐私变得更加关键。Android 应用程序通常以收集、存储和分享敏感用户信息而闻名，如联系人、位置、摄像头和麦克风数据，通常在用户没有明确同意或意识的情况下进行，从而引发重大的隐私风险和曝光。在隐私评估的背景下，数据流分析对于识别数据使用和潜在泄漏特别有价值。传统上，这种类型的分析依赖于形式化方法、启发式和基于规则的匹配。然而，这些技术通常难以实现，并容易出现错误，例如对大型程序的染色爆炸。此外，大多数现有的 Android 数据流分析方法严重依赖预定义的汇点列表，限制了它们的灵活性和可扩展性。为了解决这些现有技术的局限性，我们提出了 AndroByte，这是一个基于人工智能的隐私分析工具，利用字节码总结上的 LLM 推理，通过静态代码分析动态生成准确和可解释的数据流调用图。AndroByte 在动态生成数据流调用图方面取得了显著的 F-β 分数为 89%，在不依赖预定义传播规则或汇点列表的情况下，在泄漏检测方面胜过了传统工具如 FlowDroid 和 Amandroid。此外，AndroByte 的迭代字节码总结提供了全面和可解释的数据流和泄漏检测见解，根据 G-Eval 指标实现了高水平的可量化评分。

更新时间: 2025-10-16 20:10:20

领域: cs.CR

下载: http://arxiv.org/abs/2510.15112v1

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Updated: 2025-10-16 20:05:57

标题: DLER：正确实施长度惩罚 - 通过强化学习激励每个标记更多的智能

摘要: 推理语言模型如OpenAI-o1、DeepSeek-R1和Qwen通过延长的思维链实现了强大的性能，但通常生成了不必要的长输出。最大化每个记号的智能-准确性相对于响应长度-仍然是一个悬而未决的问题。我们重新审视了强化学习（RL）与最简单的长度惩罚-截断-并展示了准确性下降并非来自于缺乏复杂的惩罚，而是来自于不足的RL优化。我们确定了三个关键挑战：（i）优势估计中的大偏差，（ii）熵崩溃，和（iii）稀疏的奖励信号。我们通过Doing Length pEnalty Right（DLER）来解决这些挑战，这是一个结合了批次式奖励规范化、更高的剪切、动态采样和简单的截断长度惩罚的训练配方。DLER实现了最先进的准确性-效率权衡，将输出长度减少了超过70％，同时超过了所有先前基准准确性。它还改进了测试时间的扩展：与DeepSeek-R1-7B相比，DLER-7B以28％更高的准确性和更低的延迟同时生成多个简洁的响应。我们进一步引入了Difficulty-Aware DLER，它针对更容易的问题自适应地加强了截断以获得额外的效率增益。我们还提出了一种更新选择性合并方法，该方法在保留DLER模型的基线准确性的同时保留了简洁的推理能力，这对RL训练数据稀缺的情况非常有用。

更新时间: 2025-10-16 20:05:57

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.15110v1

Targeted Attacks and Defenses for Distributed Federated Learning in Vehicular Networks

In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such as threat detection in remote, dynamic, and infrastructure-constrained environments where power and bandwidth are scarce. Federated learning (FL) addresses these constraints and privacy concerns by enabling nodes to share local model weights for deep neural networks instead of raw data, facilitating more reliable decision-making than individual learning. However, conventional FL relies on a central server to coordinate model updates in each learning round, which imposes significant computational burdens on the central node and may not be feasible due to the connectivity constraints. By eliminating dependence on a central server, distributed federated learning (DFL) offers scalability, resilience to node failures, learning robustness, and more effective defense strategies. Despite these advantages, DFL remains vulnerable to increasingly advanced and stealthy cyberattacks. In this paper, we design sophisticated targeted training data poisoning and backdoor (Trojan) attacks, and characterize the emerging vulnerabilities in a vehicular network. We analyze how DFL provides resilience against such attacks compared to individual learning and present effective defense mechanisms to further strengthen DFL against the emerging cyber threats.

Updated: 2025-10-16 20:05:13

标题: 面向车联网中分布式联邦学习的目标攻击和防御

摘要: 在新兴的网络化系统中，移动边缘设备如地面车辆和无人机群体集合大量数据，以在电力和带宽稀缺的远程、动态和基础设施受限环境中做出机器学习决策，如威胁检测。联邦学习（FL）通过使节点共享深度神经网络的本地模型权重而不是原始数据，解决了这些约束和隐私问题，从而比个体学习更可靠地促进决策。然而，传统FL依赖于中央服务器在每一轮学习中协调模型更新，这给中央节点带来了重大的计算负担，并且由于连接约束可能不可行。分布式联邦学习（DFL）通过消除对中央服务器的依赖，提供了可扩展性、对节点故障的韧性、学习的稳健性以及更有效的防御策略。尽管具有这些优势，DFL仍然容易受到越来越先进和隐蔽的网络攻击的威胁。在本文中，我们设计了复杂的有针对性的训练数据投毒和后门（木马）攻击，并对车辆网络中的新兴漏洞进行了表征。我们分析了DFL相对于个体学习如何提供抵抗这些攻击的韧性，并提出了有效的防御机制，以进一步加强DFL对新兴网络威胁的抵抗能力。

更新时间: 2025-10-16 20:05:13

领域: cs.NI,cs.AI,cs.DC,cs.LG,eess.SP

下载: http://arxiv.org/abs/2510.15109v1

Partitioning $\mathbb{Z}_{sp}$ in finite fields and groups of trees and cycles

This paper investigates the algebraic and graphical structure of the ring $\mathbb{Z}_{sp}$, with a focus on its decomposition into finite fields, kernels, and special subsets. We establish classical isomorphisms between $\mathbb{F}_s$ and $p\mathbb{F}_s$, as well as $p\mathbb{F}_s^{\star}$ and $p\mathbb{F}_s^{+1,\star}$. We introduce the notion of arcs and rooted trees to describe the pre-periodic structure of $\mathbb{Z}_{sp}$, and prove that trees rooted at elements not divisible by $s$ or $p$ can be generated from the tree of unity via multiplication by cyclic arcs. Furthermore, we define and analyze the set $\mathbb{D}_{sp}$, consisting of elements that are neither multiples of $s$ or $p$ nor "off-by-one" elements, and show that its graph decomposes into cycles and pre-periodic trees. Finally, we demonstrate that every cycle in $\mathbb{Z}_{sp}$ contains inner cycles that are derived predictably from the cycles of the finite fields $p\mathbb{F}_s$ and $s\mathbb{F}_p$, and we discuss the cryptographic relevance of $\mathbb{D}_{sp}$, highlighting its potential for analyzing cyclic attacks and factorization methods.

Updated: 2025-10-16 19:59:36

标题: 将 $\mathbb{Z}_{sp}$ 在有限域和树和环的群中进行分割

摘要: 本文研究了环$\mathbb{Z}_{sp}$的代数和图形结构，重点关注其分解为有限域、核和特殊子集。我们建立了$\mathbb{F}_s$和$p\mathbb{F}_s$以及$p\mathbb{F}_s^{\star}$和$p\mathbb{F}_s^{+1,\star}$之间的经典同构。我们引入了弧和有根树的概念来描述$\mathbb{Z}_{sp}$的预周期结构，并证明了根在不可被$s$或$p$整除的元素可以通过与单位树相乘产生循环弧而生成树。此外，我们定义并分析了集合$\mathbb{D}_{sp}$，其中包含既不是$s$或$p$的倍数，也不是“相差一”的元素，并展示了其图分解为循环和预周期树。最后，我们证明了$\mathbb{Z}_{sp}$中的每个循环都包含可预测地由有限域$p\mathbb{F}_s$和$s\mathbb{F}_p$的循环派生出的内部循环，并讨论了$\mathbb{D}_{sp}$的密码学相关性，突出了其用于分析循环攻击和因式分解方法的潜力。

更新时间: 2025-10-16 19:59:36

领域: cs.CR,math.GR,math.NT

下载: http://arxiv.org/abs/2510.15108v1

Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors

Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient sampling procedure to construct Bayesian posteriors for such estimates based on Martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the uncertainty quantification of our method in inference applications.

Updated: 2025-10-16 19:58:50

标题: 使用鞅后验概率对先验数据拟合网络进行不确定性量化

摘要: 先验数据拟合网络（PFN）已经成为预测表格数据集的有前途的基础模型，无需调整即可在小到中等数据规模上取得最先进的性能。虽然 PFN 受到贝叶斯思想的启发，但它们并不提供对预测均值、分位数或类似数量的不确定性量化。我们提出了一种基于鞍点后验的原则性和高效的抽样过程，用于构建这些估计的贝叶斯后验，并证明了其收敛性。几个模拟和真实世界的数据示例展示了我们的方法在推断应用中的不确定性量化。

更新时间: 2025-10-16 19:58:50

领域: stat.ME,cs.AI,cs.LG,stat.CO,stat.ML

下载: http://arxiv.org/abs/2505.11325v2

Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.

Updated: 2025-10-16 19:53:18

标题: 封闭形式的训练动态揭示了Word2Vec类模型中学习到的特征和线性结构

摘要: 自我监督的词嵌入算法，如word2vec，提供了一个研究语言建模中表示学习的最小设置。我们检查了word2vec损失在原点周围的四次泰勒近似，并且我们表明，由此产生的训练动态和最终在下游任务上的性能在经验上与word2vec非常相似。我们的主要贡献是在仅使用语料统计和训练超参数的情况下，分析地解决了梯度流训练动态和最终词嵌入。解决方案揭示了这些模型逐个学习正交线性子空间，每个子空间增加嵌入的有效秩直到模型容量饱和。在维基百科上进行训练，我们发现前几个线性子空间代表一个可解释的主题级概念。最后，我们应用我们的理论描述更抽象语义概念的线性表示是如何在训练过程中出现的；这些可以通过向量加法来完成类比。

更新时间: 2025-10-16 19:53:18

领域: cs.LG,cs.CL,stat.ML

下载: http://arxiv.org/abs/2502.09863v3

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs' language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker's ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol's potential to enhance the accountability and security of LLM development, especially against insider threats.

Updated: 2025-10-16 19:49:34

标题: PoTS: 用于大型语言模型后门检测的训练步骤证明

摘要: 随着大型语言模型（LLMs）在关键领域越来越受到关注，确保安全可信的训练过程已成为一个主要关注点。后门攻击，即恶意行为者向训练数据中注入隐藏触发器，尤其阴险且难以检测。现有的后训练验证解决方案，如学习证明，对于LLMs来说并不实用，因为它们需要完全重新训练，对于隐蔽操纵缺乏稳健性，并且无法在训练期间提供早期检测。早期检测将显著降低计算成本。为了解决这些限制，我们引入了Proof-of-Training Steps，这是一个验证协议，使独立的审计员（爱丽丝）能够确认LLM开发人员（鲍勃）是否按照声明的训练配方进行了训练，包括数据批次、架构和超参数。通过分析LLMs的语言建模头（LM-Head）对输入扰动的敏感性，我们的方法可以暴露微妙的后门注入或训练中的偏差。即使在训练数据中有高达10％的后门触发器，我们的协议显著降低了攻击者实现高攻击成功率（ASR）的能力。我们的方法使得能够在注入步骤早期检测攻击，验证步骤比训练步骤快3倍。我们的结果突显了该协议提升LLM开发的问责和安全性的潜力，特别是对抗内部威胁。

更新时间: 2025-10-16 19:49:34

领域: cs.CR,cs.LG,I.2.7; I.2.6

下载: http://arxiv.org/abs/2510.15106v1

Continual Learning via Sparse Memory Finetuning

Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.

Updated: 2025-10-16 19:44:38

标题: 持续学习通过稀疏内存微调

摘要: 现代语言模型非常强大，但通常在部署后是静态的。建立能够持续学习的模型的一个主要障碍是灾难性遗忘，即对新数据进行更新会抹去先前获取的能力。受到减轻遗忘挑战的直觉的启发，因为可训练参数在所有任务之间共享，我们调查了稀疏参数更新是否可以实现无灾难性遗忘的学习。我们引入了稀疏内存微调，利用内存层模型（Berges等，2024年），这些模型通过设计进行稀疏更新。通过仅更新由新知识高度激活的内存槽，相对于在预训练数据上的使用，我们减少了新知识与模型现有能力之间的干扰。我们在两个问答任务上评估了与全面微调和使用LoRA进行参数高效微调相比的学习和遗忘情况。我们发现，稀疏内存微调可以学习新知识，同时减少大量的遗忘：在新事实上进行全面微调后，NaturalQuestions F1下降了89％，使用LoRA后下降了71％，而使用稀疏内存微调只有11％的降低，同时获得了相同水平的新知识。我们的结果表明，内存层的稀疏性为大型语言模型的持续学习提供了一个有前途的途径。

更新时间: 2025-10-16 19:44:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.15103v1

Operator Flow Matching for Timeseries Forecasting

Forecasting high-dimensional, PDE-governed dynamics remains a core challenge for generative modeling. Existing autoregressive and diffusion-based approaches often suffer cumulative errors and discretisation artifacts that limit long, physically consistent forecasts. Flow matching offers a natural alternative, enabling efficient, deterministic sampling. We prove an upper bound on FNO approximation error and propose TempO, a latent flow matching model leveraging sparse conditioning with channel folding to efficiently process 3D spatiotemporal fields using time-conditioned Fourier layers to capture multi-scale modes with high fidelity. TempO outperforms state-of-the-art baselines across three benchmark PDE datasets, and spectral analysis further demonstrates superior recovery of multi-scale dynamics, while efficiency studies highlight its parameter- and memory-light design compared to attention-based or convolutional regressors.

Updated: 2025-10-16 19:40:56

标题: 运算符流匹配用于时间序列预测

摘要: 预测高维、受PDE控制的动态仍然是生成建模的核心挑战。现有的自回归和扩散基础方法经常受到累积误差和离散化伪像的影响，限制了长期、物理上一致的预测。流匹配提供了一种自然的替代方案，能够实现高效、确定性的抽样。我们证明了对FNO逼近误差的上界，并提出了TempO，一种利用稀疏条件和通道折叠的潜在流匹配模型，通过使用时间条件的傅里叶层来高保真地处理3D时空场，捕捉多尺度模式。TempO在三个基准PDE数据集上胜过最先进的基线模型，并进一步通过谱分析展示出对多尺度动态的优越恢复能力，同时效率研究突出了与基于注意力或卷积回归器相比，其参数和内存轻量化的设计。

更新时间: 2025-10-16 19:40:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.15101v1

PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.

Updated: 2025-10-16 19:38:36

标题: PrivacyPAD：一种用于动态隐私感知委托的强化学习框架

摘要: 当用户向大型语言模型（LLMs）提交查询时，他们的提示通常可能包含敏感数据，从而面临一个困难的选择：将查询发送给强大的专有LLM提供商以实现最先进的性能并冒着数据泄露的风险，或者依赖较小的本地模型来保证数据隐私，但通常会导致任务性能下降。先前的方法依赖于使用LLM重写的静态管道，这会破坏语言连贯性并不加区分地删除隐私敏感信息，包括任务关键内容。我们将这一挑战重新制定为一个顺序决策问题（隐私意识代理），并引入一个名为PrivacyPAD的新型强化学习（RL）框架来解决它。我们的框架训练一个代理动态路由文本块，学习一个最优地平衡隐私泄露和任务性能之间的权衡的策略。它隐含地区分可替换的个人可识别信息（PII）（在本地进行保护）和任务关键PII（将其战略性地发送到远程模型以获得最大效用）。为了在复杂场景中验证我们的方法，我们还引入了一个具有高PII密度的新医疗数据集。我们的框架在隐私-效用界限上实现了最新的技术水平，展示了在敏感环境中部署LLMs需要学习的自适应策略的必要性。

更新时间: 2025-10-16 19:38:36

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2510.16054v1

OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data

Real-world settings where language models (LMs) are deployed -- in domains spanning healthcare, finance, and other forms of knowledge work -- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.

Updated: 2025-10-16 19:35:22

标题: OpenEstimate：使用真实世界数据评估LLMs在不确定性推理上的表现

摘要: 实际应用中部署语言模型（LMs）的领域涵盖了医疗保健、金融和其他形式的知识工作，这些领域要求模型处理不完整信息并在不确定性下进行推理。然而，大多数LM评估都集中在具有明确定义答案和成功标准的问题上。这种差距部分存在是因为涉及不确定性的自然问题很难构建：考虑到LMs可以访问大部分与人类相同的知识，设计出LMs将难以产生正确答案但人类可以可靠回答的问题并非易事。因此，LM在处理不确定性推理时的表现仍然被较少地表征。为填补这一差距，我们介绍了OpenEstimate，一个可扩展的、多领域的基准，用于评估LM在需要模型综合大量背景信息并将预测表达为概率先验的数字估计任务上的表现。我们评估这些先验的准确性和校准性，量化它们相对于感兴趣的真实分布样本的有用性。在六个前沿LM中，我们发现通过LM引发的先验往往是不准确且自信过高的。性能在如何从模型中引发不确定性方面略有改善，但在采样策略、推理努力或提示设计的变化下几乎不受影响。因此，OpenEstimate基准为前沿LM提供了一个具有挑战性的评估，并为开发更擅长概率估计和在不确定性下进行推理的模型提供了一个平台。

更新时间: 2025-10-16 19:35:22

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.15096v1

Competition and Diversity in Generative AI

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: compe- tition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact com- petition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a compet- itive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development

Updated: 2025-10-16 19:33:03

标题: 生成式人工智能中的竞争与多样性

摘要: 最近的证据表明，无论是在实验室还是在野外，使用生成人工智能会降低产生的内容多样性。使用相同或类似的AI模型似乎会导致更加同质化的行为。我们的研究从一个观察开始，即存在一个推动相反方向的力量：竞争。当生产者彼此竞争（例如，为了客户或注意力），他们会被激励创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型，我们展示了竞争市场选择多样化的AI模型，缓解了单一文化现象。我们进一步展示，一个在孤立环境中表现良好的生成AI模型（即根据基准测试）可能在竞争市场中无法提供价值。我们的结果强调了在将生成AI模型部署在竞争环境中时，评估其输出分布的广度的重要性。我们通过使用语言模型玩“Scattergories”这款文字游戏在实证上验证了我们的结果，玩家根据答案的正确性和独特性获得奖励。总的来说，我们的结果表明，生成AI导致的同质化在竞争市场中不太可能持续存在，相反，下游市场中的竞争可能推动AI模型开发的多样化。

更新时间: 2025-10-16 19:33:03

领域: cs.GT,cs.AI,cs.CY

下载: http://arxiv.org/abs/2412.08610v2

Unfair Learning: GenAI Exceptionalism and Copyright Law

This paper challenges the argument that generative artificial intelligence (GenAI) is entitled to broad immunity from copyright law for reproducing copyrighted works without authorization due to a fair use defense. It examines fair use legal arguments and eight distinct substantive arguments, contending that every legal and substantive argument favoring fair use for GenAI applies equally, if not more so, to humans. Therefore, granting GenAI exceptional privileges in this domain is legally and logically inconsistent with withholding broad fair use exemptions from individual humans. It would mean no human would need to pay for virtually any copyright work again. The solution is to take a circumspect view of any fair use claim for mass copyright reproduction by any entity and focus on the first principles of whether permitting such exceptionalism for GenAI promotes science and the arts.

Updated: 2025-10-16 19:32:15

标题: 不公平学习：GenAI例外主义和版权法

摘要: 本文挑战了一种观点，即生成式人工智能（GenAI）因为可以通过公平使用辩护而不受版权法保护，未经授权复制受版权保护的作品。文章审视了公平使用的法律论点和八个不同的实质性论点，认为支持GenAI公平使用的每一个法律和实质性论点同样适用于人类，甚至更为适用。因此，在这一领域给予GenAI特权在法律上和逻辑上与对个体人类不给予广泛的公平使用豁免相矛盾。这意味着没有人类需要再为几乎任何受版权保护的作品付费。解决办法是审慎看待任何实体对大规模版权复制的公平使用主张，并关注是否允许GenAI这种例外对于促进科学和艺术。

更新时间: 2025-10-16 19:32:15

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2504.00955v3

Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions

Machine unlearning algorithms aim to efficiently remove data from a model without retraining it from scratch, in order to remove corrupted or outdated data or respect a user's ``right to be forgotten." Certified machine unlearning is a strong theoretical guarantee based on differential privacy that quantifies the extent to which an algorithm erases data from the model weights. In contrast to existing works in certified unlearning for convex or strongly convex loss functions, or nonconvex objectives with limiting assumptions, we propose the first, first-order, black-box (i.e., can be applied to models pretrained with vanilla gradient descent) algorithm for unlearning on general nonconvex loss functions, which unlearns by ``rewinding" to an earlier step during the learning process before performing gradient descent on the loss function of the retained data points. We prove $(\epsilon, \delta)$ certified unlearning and performance guarantees that establish the privacy-utility-complexity tradeoff of our algorithm, and we prove generalization guarantees for functions that satisfy the Polyak-Lojasiewicz inequality. Finally, we demonstrate the superior performance of our algorithm compared to existing methods, within a new experimental framework that more accurately reflects unlearning user data in practice.

Updated: 2025-10-16 19:30:16

标题: 倒带删除：非凸函数的认证机器遗忘

摘要: 机器遗忘算法的目标是有效地从模型中移除数据，而无需从头开始重新训练，以便消除损坏或过时的数据，或尊重用户的“被遗忘权利”。认证机器遗忘是基于差分隐私的强大理论保证，量化算法从模型权重中擦除数据的程度。与现有针对凸或强凸损失函数的认证遗忘作品，或具有限制假设的非凸目标相比，我们提出了第一个用于在一般非凸损失函数上进行遗忘的一阶黑盒（即可应用于使用普通梯度下降预训练的模型）算法，该算法通过在执行梯度下降之前“倒带”到学习过程中的早期步骤，从而遗忘保留数据点的损失函数。我们证明了（ε，δ）认证遗忘和性能保证，建立了我们算法的隐私-效用-复杂性权衡，并为满足波利亚克-洛亚斯维奇不等式的函数证明了泛化保证。最后，在一个新的实验框架中展示了我们的算法相对于现有方法的优越性能，这个框架更准确地反映了在实践中遗忘用户数据。

更新时间: 2025-10-16 19:30:16

领域: cs.LG

下载: http://arxiv.org/abs/2409.09778v6

Beyond Outcome-Based Imperfect-Recall: Higher-Resolution Abstractions for Imperfect-Information Games

Hand abstraction is crucial for scaling imperfect-information games (IIGs) such as Texas Hold'em, yet progress is limited by the lack of a formal task model and by evaluations that require resource-intensive strategy solving. We introduce signal observation ordered games (SOOGs), a subclass of IIGs tailored to hold'em-style games that cleanly separates signal from player action sequences, providing a precise mathematical foundation for hand abstraction. Within this framework, we define a resolution bound-an information-theoretic upper bound on achievable performance under a given signal abstraction. Using the bound, we show that mainstream outcome-based imperfect-recall algorithms suffer substantial losses by arbitrarily discarding historical information; we formalize this behavior via potential-aware outcome Isomorphism (PAOI) and prove that PAOI characterizes their resolution bound. To overcome this limitation, we propose full-recall outcome isomorphism (FROI), which integrates historical information to raise the bound and improve policy quality. Experiments on hold'em-style benchmarks confirm that FROI consistently outperforms outcome-based imperfect-recall baselines. Our results provide a unified formal treatment of hand abstraction and practical guidance for designing higher-resolution abstractions in IIGs.

Updated: 2025-10-16 19:27:15

标题: 超越基于结果的不完全回忆：不完全信息游戏的更高分辨率抽象化

摘要: 手牌抽象对于扩展像德州扑克这样的信息不完全游戏（IIGs）至关重要，但由于缺乏正式的任务模型以及需要资源密集型策略求解的评估，进展受到限制。我们引入了信号观察有序游戏（SOOGs），这是一类专门针对扑克风格游戏的IIGs子类，清晰地将信号与玩家动作序列分离，为手牌抽象提供了精确的数学基础。在这个框架内，我们定义了一个分辨率上限-在给定信号抽象下实现性能的信息论上限。利用这个上限，我们展示了主流基于结果的不完全回忆算法通过任意丢弃历史信息而导致重大损失；我们通过潜在感知结果同构（PAOI）正式化了这种行为，并证明了PAOI表征了它们的分辨率上限。为了克服这一限制，我们提出了全回忆结果同构（FROI），它整合了历史信息以提高上限并提高策略质量。在扑克风格基准测试中的实验证实了FROI始终优于基于结果的不完全回忆基准。我们的结果提供了对手牌抽象的统一正式处理，以及在IIGs中设计更高分辨率抽象的实用指导。

更新时间: 2025-10-16 19:27:15

领域: cs.GT,cs.AI

下载: http://arxiv.org/abs/2510.15094v1

Multi-identity Human Image Animation with Structural Video Diffusion

Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present \emph{Structural Video Diffusion}, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation. Code is available at https://github.com/zhenzhiwang/Multi-HumanVid

Updated: 2025-10-16 19:11:40

标题: 使用结构视频扩散进行多身份人类图像动画

摘要: 从单个图像生成人类视频，同时确保高视觉质量和精确控制是一项具有挑战性的任务，特别是在涉及多个个体和与对象的互动的复杂场景中。现有方法虽然对于单人情况有效，但往往无法处理多个身份互动的复杂性，因为它们很难将正确的人类外观和姿势条件对应起来，并对3D感知动态的分布进行建模。为了解决这些限制，我们提出了一种新颖的框架\emph{结构视频扩散}，用于生成逼真的多人视频。我们的方法引入了两个核心创新：特定身份的嵌入，以保持个体之间一致的外观，以及一个结构学习机制，结合深度和表面法线线索来建模人体与对象的互动。此外，我们扩展了现有的人类视频数据集，增加了25,000个新视频，展示了不同的多人和对象互动场景，为训练提供了稳固的基础。实验结果表明，结构视频扩散在生成逼真、连贯的多主体视频方面表现出卓越的性能，推动了以人为中心的视频生成技术的发展。代码可以在https://github.com/zhenzhiwang/Multi-HumanVid 上找到。

更新时间: 2025-10-16 19:11:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04126v2

DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management

Effective and efficient access to relevant information is essential for disaster management. However, no retrieval model is specialized for disaster management, and existing general-domain models fail to handle the varied search intents inherent to disaster management scenarios, resulting in inconsistent and unreliable performance. To this end, we introduce DMRetriever, the first series of dense retrieval models (33M to 7.6B) tailored for this domain. It is trained through a novel three-stage framework of bidirectional attention adaptation, unsupervised contrastive pre-training, and difficulty-aware progressive instruction fine-tuning, using high-quality data generated through an advanced data refinement pipeline. Comprehensive experiments demonstrate that DMRetriever achieves state-of-the-art (SOTA) performance across all six search intents at every model scale. Moreover, DMRetriever is highly parameter-efficient, with 596M model outperforming baselines over 13.3 X larger and 33M model exceeding baselines with only 7.6% of their parameters. All codes, data, and checkpoints are available at https://github.com/KaiYin97/DMRETRIEVER

Updated: 2025-10-16 19:08:34

标题: DMRetriever：一组模型，用于改进灾难管理中的文本检索

摘要: 灾害管理中有效和高效地访问相关信息是至关重要的。然而，目前没有专门针对灾害管理的检索模型，现有的通用领域模型无法处理灾害管理场景中固有的多样化搜索意图，导致性能不一致且不可靠。为此，我们引入了DMRetriever，这是专门针对该领域定制的第一系列密集检索模型（从33M到7.6B）。它通过一种新颖的三阶段框架进行训练，包括双向注意力适应、无监督对比预训练和难度感知渐进指导微调，使用通过先进数据细化流程生成的高质量数据。综合实验证明，DMRetriever在每个模型规模下的六种搜索意图中均取得了最先进的性能。此外，DMRetriever非常高效，596M模型在13.3倍大于基线的情况下表现更好，而33M模型仅使用其参数的7.6%就超过了基线。所有代码、数据和检查点均可在https://github.com/KaiYin97/DMRETRIEVER 上获得。

更新时间: 2025-10-16 19:08:34

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2510.15087v1

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data. Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications. In this work, we conduct the first systematic study of privacy leakage in SMOTE: We begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy. Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios. We also provide theoretical guarantees for both attacks. Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks. Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications.

Updated: 2025-10-16 18:55:46

标题: SMOTE和镜像：从合成少数过采样中暴露隐私泄露

摘要: 合成少数过采样技术（SMOTE）是解决类别不平衡和生成合成数据的最常用方法之一。尽管它很受欢迎，但人们很少关注其隐私影响；然而，在许多涉及隐私的应用中，它被广泛使用。在这项工作中，我们进行了对SMOTE中隐私泄漏的第一次系统研究：我们首先展示了普遍的评估实践，即天真的区分和与最接近记录的距离度量，完全无法检测到任何泄漏，而成员推断攻击（MIAs）可以以高准确度实现。然后，通过利用SMOTE的几何特性，我们构建了两种有限假设的新攻击：DistinSMOTE可以完美地区分增强数据集中的真实和合成记录，ReconSMOTE可以在实际不平衡比率下以接近1的精确度和召回率重新构建合成数据集中的真实少数记录。我们还为这两种攻击提供了理论保证。对八个标准不平衡数据集的实验验证了这些攻击的实用性和有效性。总的来说，我们的工作揭示了SMOTE本质上不是私密的，并且不成比例地暴露了少数记录，突显了在涉及隐私的应用中重新考虑其使用的必要性。

更新时间: 2025-10-16 18:55:46

领域: cs.CR

下载: http://arxiv.org/abs/2510.15083v1

Evaluating Sparse Autoencoders for Monosemantic Representation

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.

Updated: 2025-10-16 18:52:47

标题: 评估稀疏自动编码器在单语义表示中的应用

摘要: 解释大型语言模型的一个关键障碍是多义性，即神经元对多个不相关概念激活。稀疏自编码器（SAEs）被提出来缓解这一问题，通过将密集激活转换为稀疏、更易解释的特征。虽然先前的工作表明SAEs促进了单义性，但尚未进行量化比较以检查SAEs和其基础模型之间概念激活分布的差异。本文通过激活分布角度提供了对SAEs与基础模型的首次系统评估。我们引入了基于Jensen-Shannon距离的细粒度概念可分离度评分，捕捉神经元的激活分布在概念之间变化的明显程度。使用两个大型语言模型（Gemma-2-2B和DeepSeek-R1）和多个SAE变体，跨五个数据集（包括单词级和句子级），我们展示了SAEs减少了多义性并实现了更高的概念可分离度。为了评估实际效用，我们使用两种策略评估概念级干预：完全神经元屏蔽和部分抑制。我们发现，与基础模型相比，SAEs在使用部分抑制时能够实现更精确的概念级控制。在此基础上，我们提出了通过后验概率衰减（APP）的新干预方法，该方法利用概念条件激活分布进行有针对性的抑制。APP在保持高效概念去除的同时，实现了最小的困惑增加。

更新时间: 2025-10-16 18:52:47

领域: cs.LG

下载: http://arxiv.org/abs/2508.15094v2

Graph-based Neural Space Weather Forecasting

Accurate space weather forecasting is crucial for protecting our increasingly digital infrastructure. Hybrid-Vlasov models, like Vlasiator, offer physical realism beyond that of current operational systems, but are too computationally expensive for real-time use. We introduce a graph-based neural emulator trained on Vlasiator data to autoregressively predict near-Earth space conditions driven by an upstream solar wind. We show how to achieve both fast deterministic forecasts and, by using a generative model, produce ensembles to capture forecast uncertainty. This work demonstrates that machine learning offers a way to add uncertainty quantification capability to existing space weather prediction systems, and make hybrid-Vlasov simulation tractable for operational use.

Updated: 2025-10-16 18:49:57

标题: 基于图神经网络的太空天气预测

摘要: 准确的太空天气预测对于保护我们日益数字化的基础设施至关重要。混合Vlasov模型，如Vlasiator，提供了超出当前操作系统的物理现实性，但对于实时使用来说计算成本太高。我们介绍了一个基于图的神经仿真器，通过对Vlasiator数据进行训练来自回归预测受上游太阳风驱动的近地球空间条件。我们展示了如何实现快速确定性预测，并通过使用生成模型产生集合以捕捉预测不确定性。这项工作证明了机器学习可以为现有的太空天气预测系统添加不确定性量化能力，并使混合Vlasov模拟适用于操作性使用。

更新时间: 2025-10-16 18:49:57

领域: physics.space-ph,cs.LG,physics.plasm-ph

下载: http://arxiv.org/abs/2509.19605v2

Online Correlation Clustering: Simultaneously Optimizing All $\ell_p$-norms

The $\ell_p$-norm objectives for correlation clustering present a fundamental trade-off between minimizing total disagreements (the $\ell_1$-norm) and ensuring fairness to individual nodes (the $\ell_\infty$-norm). Surprisingly, in the offline setting it is possible to simultaneously approximate all $\ell_p$-norms with a single clustering. Can this powerful guarantee be achieved in an online setting? This paper provides the first affirmative answer. We present a single algorithm for the online-with-a-sample (AOS) model that, given a small constant fraction of the input as a sample, produces one clustering that is simultaneously $O(\log^4 n)$-competitive for all $\ell_p$-norms with high probability, $O(\log n)$-competitive for the $\ell_\infty$-norm with high probability, and $O(1)$-competitive for the $\ell_1$-norm in expectation. This work successfully translates the offline "all-norms" guarantee to the online world. Our setting is motivated by a new hardness result that demonstrates a fundamental separation between these objectives in the standard random-order (RO) online model. Namely, while the $\ell_1$-norm is trivially $O(1)$-approximable in the RO model, we prove that any algorithm in the RO model for the fairness-promoting $\ell_\infty$-norm must have a competitive ratio of at least $\Omega(n^{1/3})$. This highlights the necessity of a different beyond-worst-case model. We complement our algorithm with lower bounds, showing our competitive ratios for the $\ell_1$- and $\ell_\infty$- norms are nearly tight in the AOS model.

Updated: 2025-10-16 18:42:54

标题: 在线相关聚类：同时优化所有$\ell_p$范数

摘要: 相关聚类问题的$\ell_p$-范数目标呈现了在最小化总不一致性（$\ell_1$-范数）和确保对个体节点公平性（$\ell_\infty$-范数）之间的基本权衡。令人惊讶的是，在离线设置中，可以通过单个聚类同时逼近所有$\ell_p$-范数。这种强大的保证是否可以在在线设置中实现呢？本文首次给出了肯定答复。我们提出了一种在线抽样（AOS）模型的单一算法，给定输入的一个小恒定分数作为样本，产生一个聚类，其高概率下同时对所有$\ell_p$-范数具有$O(\log^4 n)$竞争力，高概率下对$\ell_\infty$-范数具有$O(\log n)$竞争力，并且对$\ell_1$-范数的期望值具有$O(1)$竞争力。这项工作成功地将离线的“所有范数”保证转化为在线世界中。我们的设置受到一个新的困难结果的启发，该结果展示了在标准随机顺序（RO）在线模型中这些目标之间的基本分离。换句话说，虽然$\ell_1$-范数在RO模型中可以轻松近似为$O(1)$，但我们证明在RO模型中任何为促进公平性的$\ell_\infty$-范数设计的算法必须具有至少$\Omega(n^{1/3})$的竞争比率。这突显了需要一个不同于最坏情况的模型的必要性。我们通过下界来补充我们的算法，展示了我们对于$\ell_1$-和$\ell_\infty$-范数的竞争比率在AOS模型中几乎是紧密的。

更新时间: 2025-10-16 18:42:54

领域: cs.LG,cs.DM,cs.DS

下载: http://arxiv.org/abs/2510.15076v1

Physics-informed data-driven machine health monitoring for two-photon lithography

Two-photon lithography (TPL) is a sophisticated additive manufacturing technology for creating three-dimensional (3D) micro- and nano-structures. Maintaining the health of TPL systems is critical for ensuring consistent fabrication quality. Current maintenance practices often rely on experience rather than informed monitoring of machine health, resulting in either untimely maintenance that causes machine downtime and poor-quality fabrication, or unnecessary maintenance that leads to inefficiencies and avoidable downtime. To address this gap, this paper presents three methods for accurate and timely monitoring of TPL machine health. Through integrating physics-informed data-driven predictive models for structure dimensions with statistical approaches, the proposed methods are able to handle increasingly complex scenarios featuring different levels of generalizability. A comprehensive experimental dataset that encompasses six process parameter combinations and six structure dimensions under two machine health conditions was collected to evaluate the effectiveness of the proposed approaches. Across all test scenarios, the approaches are shown to achieve high accuracies, demonstrating excellent effectiveness, robustness, and generalizability. These results represent a significant step toward condition-based maintenance for TPL systems.

Updated: 2025-10-16 18:41:46

标题: 基于物理学信息的数据驱动式双光子光刻机健康监测

摘要: 双光子光刻（TPL）是一种先进的增材制造技术，用于创建三维微观和纳米结构。保持TPL系统的健康对确保一致的制造质量至关重要。目前的维护实践通常依赖经验而不是对机器健康的明确监控，导致要么维护不及时导致机器停机和制造质量差，要么进行不必要的维护导致低效和可避免的停机。为了解决这一问题，本文提出了三种准确及时监测TPL机器健康的方法。通过将基于物理的数据驱动预测模型与统计方法相结合，所提出的方法能够处理越来越复杂的情况，具有不同级别的泛化能力。收集了一个全面的实验数据集，包括六种工艺参数组合和两种机器健康条件下的六种结构尺寸，以评估所提方法的有效性。在所有测试场景中，这些方法都表现出较高的准确性，展示了出色的有效性、稳健性和泛化能力。这些结果代表了向基于条件的TPL系统维护迈出的重要一步。

更新时间: 2025-10-16 18:41:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.15075v1

GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix's magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2\% higher average accuracy in language tasks and 3.88\% on multimodal benchmarks.

Updated: 2025-10-16 18:38:51

标题: Gradient-Based Early Stopping在Transformer模型中显著加快训练速度GradES

摘要: 提前停止监视全局验证损失，并同时停止所有参数更新，这对于大型transformers来说在验证推断所需的时间延长，计算代价高昂。我们提出了一种新颖的基于梯度的提前停止方法\textit{GradES}，该方法在transformer组件（注意力投影和前馈层矩阵）内运行。我们发现，在微调语言和视觉-语言模型时，不同的组件以不同的速率收敛。 \textit{GradES}在训练过程中跟踪这些矩阵的反向传播梯度变化的大小。当一个投影矩阵的梯度变化幅度低于收敛阈值$\tau$时，我们单独排除该投影矩阵的进一步更新，从而消除昂贵的验证传递，同时允许收敛缓慢的矩阵继续学习。 \textit{GradES}通过1.57-7.22倍提高训练时间，同时通过提前防止过拟合来增强泛化能力，在语言任务中平均精度提高1.2％，在多模态基准测试中提高3.88％。

更新时间: 2025-10-16 18:38:51

领域: cs.LG,cs.AI,68T07,I.2; I.2.7; I.4; H.5.1

下载: http://arxiv.org/abs/2509.01842v3

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.

Updated: 2025-10-16 18:37:35

标题: 水母：高效长上下文推理的模块化架构

摘要: transformers的二次复杂度在资源受限和长上下文设置中根本限制了推理系统的部署。我们介绍了Hydra，这是一个基于状态空间骨干的模块化架构，可以自适应地在稀疏全局注意力、专家混合和包含推理工作空间和产品键内存的双存储器之间进行路由。我们评估了一个拥有2900万参数的模型，在合成序列上测量逻辑链正确性和吞吐量，以及在WikiText上的吞吐量。消融研究使用组件特定的合成数据集来隔离单个机制。Hydra在合成和WikiText数据集的8K标记处分别实现了$3.01\times$和$3.0\times$的吞吐量增益，并与相同大小的transformers相比，在多步逻辑组成上实现了$10\times$的准确性改进。消融研究证实了每个组件的贡献：稀疏注意力捕获了长距离依赖关系，专家专门针对输入领域，产品键内存实现了选择性检索。

更新时间: 2025-10-16 18:37:35

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2508.15099v3

Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates

Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.

Updated: 2025-10-16 18:34:15

标题: 使用基于模式连接的轨迹代理改善临床数据集的压缩

摘要: 数据集压缩（DC）使得可以创建紧凑的、保护隐私的合成数据集，这些数据集可以与真实患者记录的效用匹配，支持对高度监管的临床数据的民主化访问，以用于开发下游临床模型。目前最先进的DC方法通过调整在真实数据和合成数据上训练的模型的训练动态来监督合成数据，通常使用完整的随机梯度下降（SGD）轨迹作为对齐目标；然而，这些轨迹通常具有噪声、高曲率和存储密集，导致梯度不稳定、收敛缓慢和存储开销大。我们通过用平滑、低损失的参数替代完整的SGD轨迹，具体是连接真实训练轨迹的初始和最终模型状态的二次Bézier曲线，来解决这些限制。这些模式连接路径提供无噪声、低曲率的监督信号，稳定梯度、加速收敛，消除了对密集轨迹存储的需求。我们理论上证明了Bézier模式连接作为SGD路径的有效替代，并通过实验证明，所提出的方法在五个临床数据集上优于最先进的压缩方法，产生了使临床有效模型开发成为可能的压缩数据集。

更新时间: 2025-10-16 18:34:15

领域: cs.LG,cs.CV,cs.DB

下载: http://arxiv.org/abs/2510.05805v2

End-to-End Learning Framework for Solving Non-Markovian Optimal Control

Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.

Updated: 2025-10-16 18:31:20

标题: 解决非马尔可夫最优控制的端到端学习框架

摘要: 整数阶微积分经常无法捕捉到许多真实世界过程中存在的长程依赖性和记忆效应。分数阶微积分通过分数阶积分和导数来解决这些问题，但是由于缺乏标准控制方法论，分数阶动态系统在系统识别和最优控制方面面临着重大挑战。在本文中，我们通过线性二次调节器（LQR）理论推导了分数阶线性时不变系统的最优控制，并基于这一理论基础开发了一个端到端的深度学习框架。我们的方法建立了一个严格的数学模型，推导了解析解，并结合深度学习实现了基于数据驱动的分数阶线性时不变系统的最优控制。我们的主要贡献包括：(i)为分数阶线性时不变系统提出了创新的系统识别方法控制策略，(ii)开发了第一个端到端数据驱动学习框架，Fractional-Order Learning for Optimal Control (FOLOC)，从观察到的轨迹中学习控制策略，(iii)推导了一个样本复杂度的理论分析，以量化在复杂的真实世界问题中实现准确最优控制所需的样本数量。实验结果表明，我们的方法能够准确地近似分数阶系统行为，而不依赖于高斯噪声假设，为先进的最优控制提供了有希望的途径。

更新时间: 2025-10-16 18:31:20

领域: cs.SY,cs.LG,math.OC

下载: http://arxiv.org/abs/2502.04649v5

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs. Our method decomposes malicious queries into visually innocuous storytelling elements using an auxiliary LLM, generates corresponding image sequences through diffusion models, and exploits the models' reliance on narrative coherence to elicit harmful outputs. Extensive experiments on harmful textual queries from established safety benchmarks show that our approach achieves an average attack success rate of 83.5\%, surpassing prior state-of-the-art by 46\%. Compared with existing visual jailbreak methods, our sequential narrative strategy demonstrates superior effectiveness across diverse categories of harmful content. We further analyze attack patterns, uncover key vulnerability factors in multimodal safety mechanisms, and evaluate the limitations of current defense strategies against narrative-driven attacks, revealing significant gaps in existing protections.

Updated: 2025-10-16 18:30:26

标题: 使用结构化视觉叙事的顺序漫画对大型多模式语言模型进行越狱

摘要: 多模态大语言模型（MLLMs）展现出卓越的能力，但仍然容易受到越界攻击的影响，这些攻击利用跨模态漏洞。在这项工作中，我们引入了一种新颖的方法，利用连续的漫画风格视觉叙述来规避最先进的MLLMs中的安全对齐。我们的方法将恶意查询分解为在辅助LLM中使用的视觉无害叙事元素，并通过扩散模型生成相应的图像序列，利用模型对叙事连贯性的依赖性来引发有害输出。在来自已建立的安全基准的有害文本查询上进行了大量实验，结果显示我们的方法实现了83.5%的平均攻击成功率，超过了先前的最先进技术46%。与现有的视觉越狱方法相比，我们的连续叙事策略在各种有害内容类别中展现出卓越的有效性。我们进一步分析了攻击模式，揭示了多模态安全机制中的关键漏洞因素，并评估了当前防御策略对叙事驱动攻击的限制，揭示了现有保护措施中的重大差距。

更新时间: 2025-10-16 18:30:26

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.15068v1

Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces - whether using all-at-once evaluation, step-by-step analysis, or binary search - fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis-based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional-based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi-agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus-based objective decision-making, provides a more robust framework for error attribution in multi-agent systems.

Updated: 2025-10-16 18:25:19

标题: 哪里出错了？对多智能体错误归因的层次性探究

摘要: 在大型语言模型（LLM）多智能体系统中的错误归因对于调试和改进协作人工智能系统提出了重大挑战。目前用于定位智能体和步骤级别失败的交互追踪的方法 - 无论是使用一次性评估、逐步分析还是二进制搜索 - 在分析复杂模式时存在不足，难以兼顾准确性和一致性。我们提出了ECHO（通过上下文层次结构和客观共识分析进行错误归因），这是一种结合了层次化上下文表示、基于客观分析的评估以及共识投票以提高错误归因准确性的新颖算法。我们的方法利用基于位置的上下文理解水平，同时保持客观评估标准，最终通过共识机制得出结论。实验结果表明，ECHO在各种多智能体交互场景中优于现有方法，特别在涉及微妙推理错误和复杂相互依赖的情况下表现出色。我们的发现表明，利用结构化、层次化的上下文表示概念结合基于共识的客观决策制定，为多智能体系统中的错误归因提供了更为稳健的框架。

更新时间: 2025-10-16 18:25:19

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.04886v2

Physical Layer Deception based on Semantic Distortion

Physical layer deception (PLD) is a framework we previously introduced that integrates physical layer security (PLS) with deception techniques, enabling proactive countermeasures against eavesdropping rather than relying solely on passive defense. We extend this framework to a semantic communication model and conduct a theoretical analysis using semantic distortion as the performance metric. In this work, we further investigate the receiver's selection of decryption strategies and the transmitter's optimization of encryption strategies. By anticipating the decryption strategy likely to be employed by the legitimate receiver and eavesdropper, the transmitter can optimize resource allocation and encryption parameters, thereby maximizing the semantic distortion at the eavesdropper while maintaining a low level of semantic distortion for the legitimate receiver. We present a rigorous analysis of the resulting optimization problem, propose an efficient optimization algorithm, and derive closed-form optimal solutions for multiple scenarios. Finally, we corroborate the theoretical findings with numerical simulations, which also confirm the practicality of the proposed algorithm.

Updated: 2025-10-16 18:23:35

标题: 基于语义扭曲的物理层欺骗

摘要: 物理层欺骗（PLD）是我们先前引入的一个框架，将物理层安全（PLS）与欺骗技术集成在一起，实现对窃听的主动对抗，而不仅仅依赖于被动防御。我们将这个框架扩展到语义通信模型，并使用语义失真作为性能指标进行理论分析。在这项工作中，我们进一步研究接收方选择解密策略和发送方优化加密策略。通过预测合法接收方和窃听者可能采用的解密策略，发送方可以优化资源分配和加密参数，从而最大化窃听者的语义失真，同时保持合法接收方的语义失真水平较低。我们对产生的优化问题进行了严格分析，提出了高效的优化算法，并为多种场景推导出闭式最优解。最后，我们通过数值模拟验证了理论发现，并证实了所提出算法的实用性。

更新时间: 2025-10-16 18:23:35

领域: cs.CR,cs.IT,math.IT

下载: http://arxiv.org/abs/2510.15063v1

Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,'' which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90\% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.

Updated: 2025-10-16 18:22:22

标题: Antislop：一个用于识别和消除语言模型中重复模式的综合框架

摘要: 广泛采用LLM引入了特征性的重复用语，称为“slop”，这降低了输出质量并使AI生成的文本立即被识别出来。我们提出了Antislop，一个全面的框架，提供工具来检测和消除这些过度使用的模式。我们的方法结合了三项创新：（1）Antislop采样器，使用回溯在推断时抑制不需要的字符串而不破坏词汇；（2）一个自动化流水线，对模型特定的slop与人类基线进行概要分析并生成训练数据；（3）最终令牌偏好优化（FTPO），一种新颖的微调方法，对每个令牌进行操作，手术性地调整logits，无论在推断跟踪中出现了哪个被禁止的模式。我们证明，一些slop模式在LLM输出中出现的频率比人类文本高出1000倍以上。Antislop采样器成功抑制了8000多种模式，同时保持质量，而令牌禁止在只有2000种模式时就无法使用。最重要的是，FTPO在保持或提高性能的同时，实现了90％的slop减少，在跨领域评估中包括GSM8K、MMLU和创意写作任务。相比之下，DPO在写作质量和词汇多样性方面遭受了显著的退化，尽管实现了较弱的抑制。我们在MIT许可下发布了所有代码和结果：https://github.com/sam-paech/auto-antislop。

更新时间: 2025-10-16 18:22:22

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.15061v1

Retrieval-Augmented Test Generation: How Far Are We?

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.

Updated: 2025-10-16 18:18:55

标题: 检索增强测试生成：我们已经走了多远？

摘要: 检索增强生成（RAG）在软件工程任务中取得了进展，但在单元测试生成方面仍未得到充分探讨。为了弥补这一差距，我们研究了基于RAG的机器学习（ML/DL）API单元测试生成的效力，并分析了不同知识来源对其有效性的影响。我们考察了三种领域特定的RAG来源：（1）API文档（官方指南），（2）GitHub问题（开发者报告的解决方案），以及（3）StackOverflow问答（社区驱动的解决方案）。我们的研究集中在五个广泛使用的基于Python的ML/DL库上，包括TensorFlow、PyTorch、Scikit-learn、Google JAX和XGBoost，针对最常用的API。我们评估了四种最新的LLMs——GPT-3.5-Turbo、GPT-4o、Mistral MoE 8x22B和Llama 3.1 405B——采用三种策略：基本指令提示、基本RAG和API级别RAG。在定量上，我们评估了句法和动态正确性以及代码覆盖率。虽然RAG并未增强正确性，但平均提高了6.5%的代码覆盖率。我们发现，GitHub问题通过提供来自各种问题的边缘案例，对代码覆盖率的提高效果最好。我们还发现，这些生成的单元测试可以帮助检测新的错误。具体来说，检测到了28个错误，向开发者报告了24个独特的错误，其中十个得到了确认，四个被拒绝，十个正在等待开发者的确认。我们的研究结果突显了RAG在单元测试生成中的潜力，可以通过有针对性的知识来源提高测试覆盖率。未来的工作应该集中在检索技术，以识别具有独特程序状态的文档，进一步优化基于RAG的单元测试生成。

更新时间: 2025-10-16 18:18:55

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2409.12682v2

Uncertainty Quantification for Physics-Informed Neural Networks with Extended Fiducial Inference

Uncertainty quantification (UQ) in scientific machine learning is increasingly critical as neural networks are widely adopted to tackle complex problems across diverse scientific disciplines. For physics-informed neural networks (PINNs), a prominent model in scientific machine learning, uncertainty is typically quantified using Bayesian or dropout methods. However, both approaches suffer from a fundamental limitation: the prior distribution or dropout rate required to construct honest confidence sets cannot be determined without additional information. In this paper, we propose a novel method within the framework of extended fiducial inference (EFI) to provide rigorous uncertainty quantification for PINNs. The proposed method leverages a narrow-neck hyper-network to learn the parameters of the PINN and quantify their uncertainty based on imputed random errors in the observations. This approach overcomes the limitations of Bayesian and dropout methods, enabling the construction of honest confidence sets based solely on observed data. This advancement represents a significant breakthrough for PINNs, greatly enhancing their reliability, interpretability, and applicability to real-world scientific and engineering challenges. Moreover, it establishes a new theoretical framework for EFI, extending its application to large-scale models, eliminating the need for sparse hyper-networks, and significantly improving the automaticity and robustness of statistical inference.

Updated: 2025-10-16 18:18:54

标题: 物理信息神经网络的不确定性量化与扩展的基准推断

摘要: 科学机器学习中的不确定性量化（UQ）在神经网络被广泛应用于解决各种科学学科中的复杂问题时变得越来越关键。对于物理启发神经网络（PINNs），这是科学机器学习中一个显著的模型，通常使用贝叶斯或辍学方法来量化不确定性。然而，这两种方法都存在一个根本限制：在没有额外信息的情况下，无法确定构建诚实置信区间所需的先验分布或辍学率。在本文中，我们提出了一种新的方法，即在扩展的基准推断（EFI）框架内提供PINNs的严格不确定性量化。所提出的方法利用一个窄颈超网络来学习PINN的参数，并基于观测中插入的随机误差来量化它们的不确定性。这种方法克服了贝叶斯和辍学方法的局限性，使得可以仅基于观测数据构建诚实的置信区间。这一进展对于PINNs来说是一个重大突破，极大地提高了它们的可靠性、可解释性和适用性于真实世界的科学和工程挑战。此外，它为EFI建立了一个新的理论框架，将其应用扩展到大规模模型，消除了对稀疏超网络的需求，显著提高了统计推断的自动性和鲁棒性。

更新时间: 2025-10-16 18:18:54

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2505.19136v2

The Minimax Lower Bound of Kernel Stein Discrepancy Estimation

Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve $\sqrt n$-convergence. In this work, we present two complementary results (with different proof strategies), establishing that the minimax lower bound of KSD estimation is $n^{-1/2}$ and settling the optimality of these estimators. Our first result focuses on KSD estimation on $\mathbb R^d$ with the Langevin-Stein operator; our explicit constant for the Gaussian kernel indicates that the difficulty of KSD estimation may increase exponentially with the dimensionality $d$. Our second result settles the minimax lower bound for KSD estimation on general domains.

Updated: 2025-10-16 18:16:05

标题: 核斯坦不一致估计的极小下界

摘要: 核斯坦差异（KSDs）在过去十年中已成为衡量拟合度的强大工具，具有许多成功的应用。据我们所知，所有现有的KSD估计器在已知速率下均实现了$\sqrt n$收敛。在这项工作中，我们提出了两个互补的结果（具有不同的证明策略），建立了KSD估计的极小下界为$n^{-1/2}$，并确定了这些估计量的最优性。我们的第一个结果集中在$\mathbb R^d$上使用Langevin-Stein算子进行KSD估计；我们对高斯核的显式常数表明，KSD估计的难度可能随着维度$d$的增加呈指数增长。我们的第二个结果解决了在一般域上进行KSD估计的极小下界。

更新时间: 2025-10-16 18:16:05

领域: stat.ML,cs.LG,math.ST,stat.TH,62C20 (Primary) 46E22, 62B10 (Secondary),G.3; H.1.1; I.2.6

下载: http://arxiv.org/abs/2510.15058v1

Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions

Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its long-term discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents' rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent's objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.

Updated: 2025-10-16 18:13:42

标题: 学会改变世界：具有改变模型动作的多层强化学习

摘要: 强化学习通常假定一个给定的或有时甚至是固定的环境，在这个环境中，一个代理寻求一个最优策略来最大化其长期折现奖励。相反，我们考虑那些不仅仅局限于被动适应的代理：他们具有可以积极修改世界动态RL模型的模型更改动作。重新配置基础的转换过程可能会增加代理的奖励。受到这种设置的启发，我们引入了多层可配置的时变马尔可夫决策过程（MCTVMDP）。在MCTVMDP中，较低层次的MDP具有可通过上层模型更改动作进行配置的非稳态转换函数。代理的目标包括两部分：优化上层MDP中的配置策略，并优化下层MDP中的原始动作策略，共同改善其预期的长期奖励。

更新时间: 2025-10-16 18:13:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.15056v1

Your AI, Not Your View: The Bias of LLMs in Investment Analysis

In finance, Large Language Models (LLMs) face frequent knowledge conflicts arising from discrepancies between their pre-trained parametric knowledge and real-time market data. These conflicts are especially problematic in real-world investment services, where a model's inherent biases can misalign with institutional objectives, leading to unreliable recommendations. Despite this risk, the intrinsic investment biases of LLMs remain underexplored. We propose an experimental framework to investigate emergent behaviors in such conflict scenarios, offering a quantitative analysis of bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract the latent biases of models and measure their persistence. Our analysis, centered on sector, size, and momentum, reveals distinct, model-specific biases. Across most models, a tendency to prefer technology stocks, large-cap stocks, and contrarian strategies is observed. These foundational biases often escalate into confirmation bias, causing models to cling to initial judgments even when faced with increasing counter-evidence. A public leaderboard benchmarking bias across a broader set of models is available at https://linqalpha.com/leaderboard

Updated: 2025-10-16 18:06:41

标题: 您的AI，而非您的观点：LLMs在投资分析中的偏见

摘要: 在金融领域，大型语言模型（LLMs）经常面临知识冲突，这些冲突源于它们预训练的参数化知识和实时市场数据之间的差异。这些冲突在现实世界的投资服务中尤为棘手，模型固有的偏见可能与机构目标不一致，导致不可靠的建议。尽管存在这种风险，LLMs的固有投资偏见仍未被充分探讨。我们提出了一个实验框架，以调查这种冲突情景中的新兴行为，提供了LLM基于投资分析的偏见的定量分析。通过使用平衡和不平衡论点的假设情景，我们提取模型的潜在偏见并测量它们的持久性。我们的分析集中在行业、规模和动量上，揭示了不同的、特定于模型的偏见。在大多数模型中，观察到偏好科技股、大盘股和逆势策略的倾向。这些基础性偏见往往升级为确认偏见，导致模型坚持初步判断，即使面对越来越多的反证。一个公开的排行榜，对更广泛的模型进行偏见基准测试，可在https://linqalpha.com/leaderboard上找到。

更新时间: 2025-10-16 18:06:41

领域: q-fin.PM,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.20957v4

Internalizing World Models via Self-Play Finetuning for Agentic RL

Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k--the probability that at least one of (k) sampled trajectories succeeds--drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.

Updated: 2025-10-16 18:03:39

标题: 通过自我对抗微调内部化世界模型以实现Agent RL

摘要: 大型语言模型(LLMs)作为代理通常在分布外(OOD)情况下很难应对。现实世界环境复杂而动态，受任务特定规则和随机性控制，这使得LLMs难以将其内部知识与这些动态联系起来。在这种OOD条件下，普通的RL训练通常无法扩展；我们观察到Pass@k——(k)个采样轨迹中至少有一个成功的概率在训练步骤中显着下降，表明探索脆弱且泛化有限。受模型驱动强化学习的启发，我们假设为LLM代理配备一个内部世界模型可以更好地将推理与环境动态对齐，并改善决策制定。我们展示了如何将这个世界模型编码为两个组件：状态表示和转换建模。在此基础上，我们引入了SPA，一个简单的强化学习框架，通过自我对弈监督微调(SFT)阶段冷启动策略，通过与环境的交互学习世界模型，然后在策略优化之前用它来模拟未来状态。这种简单的初始化优于在线世界建模基准，并大大提升了基于RL的代理训练性能。在各种环境中进行的实验，如Sokoban、FrozenLake和数独，显示出我们的方法显著提高了性能。例如，SPA将Sokoban的成功率从25.6%提高到59.8%，将FrozenLake的分数从22.1%提高到70.9%，适用于Qwen2.5-1.5B-Instruct模型。

更新时间: 2025-10-16 18:03:39

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.15047v1

IQNN-CS: Interpretable Quantum Neural Network for Credit Scoring

Credit scoring is a high-stakes task in financial services, where model decisions directly impact individuals' access to credit and are subject to strict regulatory scrutiny. While Quantum Machine Learning (QML) offers new computational capabilities, its black-box nature poses challenges for adoption in domains that demand transparency and trust. In this work, we present IQNN-CS, an interpretable quantum neural network framework designed for multiclass credit risk classification. The architecture combines a variational QNN with a suite of post-hoc explanation techniques tailored for structured data. To address the lack of structured interpretability in QML, we introduce Inter-Class Attribution Alignment (ICAA), a novel metric that quantifies attribution divergence across predicted classes, revealing how the model distinguishes between credit risk categories. Evaluated on two real-world credit datasets, IQNN-CS demonstrates stable training dynamics, competitive predictive performance, and enhanced interpretability. Our results highlight a practical path toward transparent and accountable QML models for financial decision-making.

Updated: 2025-10-16 18:02:03

标题: IQNN-CS：可解释的信用评分量子神经网络

摘要: 信用评分在金融服务中是一项高风险任务，模型决策直接影响个人获取信贷的机会，并且受到严格的监管审查。尽管量子机器学习（QML）提供了新的计算能力，但其黑匣子特性对于需要透明度和信任的领域的采用提出了挑战。在这项工作中，我们提出了IQNN-CS，这是一个专为多类信用风险分类设计的可解释的量子神经网络框架。该架构结合了一种变分QNN和一套专为结构化数据量身定制的事后解释技术。为了解决QML中缺乏结构化可解释性的问题，我们引入了Inter-Class Attribution Alignment（ICAA），这是一种新颖的度量，可以量化在预测类别之间的归因差异，揭示模型如何区分信用风险类别。在两个真实世界的信用数据集上评估，IQNN-CS展现出稳定的训练动态，竞争性的预测性能和增强的可解释性。我们的结果突出了一条通向透明和负责任的QML模型用于金融决策的实际路径。

更新时间: 2025-10-16 18:02:03

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2510.15044v1

Comprehensive language-image pre-training for 3D medical image understanding

Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

Updated: 2025-10-16 18:01:31

标题: 三维医学图像理解的全面语言-图像预训练

摘要: 视觉-语言预训练，即将图像与配对文本进行对齐，是创建直接用于分类和检索任务以及分割和报告生成等下游任务的编码器的强大范例。在3D医学图像领域，这些能力使视觉-语言编码器（VLEs）能够通过检索具有类似异常的患者或预测异常可能性来支持放射科医生。尽管这种方法很有前途，但数据的可用性限制了当前3D VLEs的能力。在本文中，我们通过注入额外的归纳偏差来缓解数据不足：引入报告生成目标并将视觉-语言预训练与仅视觉预训练相配对。这使我们能够利用仅图像和配对图像-文本3D数据集，增加我们的模型暴露的数据总量。通过这些额外的归纳偏差，结合3D医学成像领域的最佳实践，我们开发了全面的语言-图像预训练（COLIPRI）编码器系列。我们的COLIPRI编码器在报告生成、分类探测和零样本分类方面取得了最先进的性能，并在语义分割方面保持竞争力。

更新时间: 2025-10-16 18:01:31

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.15042v1

Composition-Grounded Instruction Synthesis for Visual Reasoning

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

Updated: 2025-10-16 18:00:48

标题: 基于构成的教学综合：视觉推理

摘要: 预训练的多模态大型语言模型（MLLMs）在各种多模态任务上表现出强大的性能，但在难以收集注释的领域的推理能力仍然有限。在这项工作中，我们专注于人工图像领域，如图表、渲染文档和网页，这些领域在实践中很丰富，但缺乏大规模人类注释的推理数据集。我们引入了COGS（COmposition-Grounded instruction Synthesis），这是一个数据效率高的框架，可以为MLLMs提供先进的推理能力，只需一小组种子问题。关键思想是将每个种子问题分解为基本感知和推理因素，然后可以系统地重新组合新图像，生成大量的合成问题-答案对。每个生成的问题都与子问题和中间答案配对，可以通过因素级别的过程奖励进行强化学习。对图表推理的实验表明，COGS显着提高了在看不见的问题上的表现，对推理重和组合问题的增益最大。此外，使用不同种子数据的因素级混合进行训练可以更好地在多个数据集之间进行迁移，表明COGS引发了可泛化的能力，而不是特定于数据集的过拟合。我们进一步证明了该框架不仅适用于图表，还适用于其他领域，如网页。

更新时间: 2025-10-16 18:00:48

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.15040v1

On the Interaction of Compressibility and Adversarial Robustness

Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.

Updated: 2025-10-16 18:00:46

标题: 压缩性与对抗鲁棒性的相互作用

摘要: 现代神经网络被期望同时满足许多理想特性：对训练数据的准确拟合，对未见输入的泛化能力，参数和计算效率，以及对敌对扰动的稳健性。尽管可压缩性和稳健性已经被广泛研究，但它们之间的相互作用仍然难以理解。在这项工作中，我们开发了一个原则性框架来分析不同形式的可压缩性 - 如神经元级稀疏性和谱压缩性 - 如何影响对抗稳健性。我们展示了这些形式的压缩可以在表示空间中引入少量高度敏感的方向，对手可以利用这些方向构建有效的扰动。我们的分析得出了一个简单但有启发性的稳健性界限，揭示了神经元和谱压缩性通过对学习表示的影响如何影响$L_\infty$和$L_2$稳健性。至关重要的是，我们确认的这些漏洞是无论通过正则化、架构偏差还是隐式学习动态如何实现的。通过对合成和现实任务的经验评估，我们验证了我们的理论预测，并进一步证明了这些漏洞在对抗训练和迁移学习下持续存在，并促成了普遍的对抗性扰动的出现。我们的发现显示了结构化可压缩性和稳健性之间的基本张力，并提出了设计既高效又安全的模型的新途径。

更新时间: 2025-10-16 18:00:46

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2507.17725v2

AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport

Flow-based Generative Models (FGMs) effectively transform noise into complex data distributions. Incorporating Optimal Transport (OT) to couple noise and data during FGM training has been shown to improve the straightness of flow trajectories, enabling more effective inference. However, existing OT-based methods estimate the OT plan using (mini-)batches of sampled noise and data points, which limits their scalability to large and high-dimensional datasets in FGMs. This paper introduces AlignFlow, a novel approach that leverages Semi-Discrete Optimal Transport (SDOT) to enhance the training of FGMs by establishing an explicit, optimal alignment between noise distribution and data points with guaranteed convergence. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During FGM training, i.i.d. noise samples are paired with data points via the SDOT map. AlignFlow scales well to large datasets and model architectures with negligible computational overhead. Experimental results show that AlignFlow improves the performance of a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play component. Code is available at: https://github.com/konglk1203/AlignFlow.

Updated: 2025-10-16 18:00:43

标题: AlignFlow：利用半离散最优输运改善基于流的生成模型

摘要: 基于流的生成模型（FGM）有效地将噪声转换为复杂的数据分布。在FGM训练过程中，将最优传输（OT）结合到噪声和数据中已经被证明可以改善流轨迹的直线性，从而实现更有效的推断。然而，现有基于OT的方法使用（小批量的）采样噪声和数据点来估计OT计划，这限制了它们在FGM中对大型和高维数据集的可扩展性。本文介绍了一种名为AlignFlow的新方法，它利用半离散最优传输（SDOT）通过建立噪声分布和数据点之间的显式、最优对齐来增强FGM的训练，保证收敛性。SDOT通过将噪声空间划分为拉盖尔细胞来计算传输映射，每个细胞映射到相应的数据点。在FGM训练过程中，通过SDOT映射将独立同分布的噪声样本与数据点配对。AlignFlow在大型数据集和模型架构上具有良好的扩展性，并具有可忽略的计算开销。实验结果表明，AlignFlow提高了各种最先进的FGM算法的性能，并可以作为即插即用的组件集成。代码可在以下链接找到：https://github.com/konglk1203/AlignFlow。

更新时间: 2025-10-16 18:00:43

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.15038v1

Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

Updated: 2025-10-16 17:59:59

标题: 耦合扩散采样用于无需训练的多视角图像编辑

摘要: 我们提出了一种推断时扩散采样方法，使用预训练的2D图像编辑模型执行多视角一致的图像编辑。这些模型可以独立地为3D场景或对象的一组多视角图像中的每个图像产生高质量的编辑，但它们不保持视图间的一致性。现有方法通常通过优化显式的3D表示来解决这个问题，但在稀疏视图设置下，它们会遭受漫长的优化过程和不稳定性。我们提出了一种隐式的3D正则化方法，通过限制生成的2D图像序列遵循预训练的多视角图像分布来实现。这是通过耦合扩散采样实现的，这是一种简单的扩散采样技术，同时从多视角图像分布和2D编辑图像分布中采样两条轨迹，使用耦合项来强化生成图像之间的多视角一致性。我们验证了这一框架在三个不同的多视角图像编辑任务上的有效性和普遍性，展示了它在各种模型架构中的适用性，并突出了它作为多视角一致编辑的通用解决方案的潜力。

更新时间: 2025-10-16 17:59:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14981v1

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Updated: 2025-10-16 17:59:58

标题: 从像素到文字--朝向规模化的本地视觉语言基元

摘要: 本文概述了本地视觉语言模型（VLMs）的结构，它已经成为典型模块化VLMs的一个不断崛起的竞争对手，这是由不断演变的模型架构和训练范式塑造的。然而，两个持续存在的障碍阻碍了其广泛的探索和推广：（-）本地VLMs与模块化VLMs之间有哪些基本的约束条件，这些障碍能够被克服到什么程度？（-）如何使本地VLMs的研究更加易于访问和民主化，从而加快该领域的进展。在本文中，我们阐明了这些挑战，并概述了构建本地VLMs的指导原则。具体来说，一个本地VLM原语应该：（i）在共享语义空间中有效地对齐像素和单词表示；（ii）无缝地整合先前独立的视觉和语言模块的优势；（iii）固有地体现各种支持统一视觉-语言编码、对齐和推理的跨模态属性。因此，我们推出NEO，这是一个新型的基于第一原则构建的本地VLMs系列，能够在各种现实场景中与顶尖的模块化对手媲美。仅使用390M个图像-文本示例，NEO可以高效地从零开始发展视觉感知，同时在由我们精心设计的原语构建的密集而整体的模型内减轻视觉-语言冲突。我们将NEO定位为可伸缩和强大的本地VLMs的基石，配备了丰富的可重复使用的组件，促进了成本效益和可扩展性生态系统的发展。我们的代码和模型可以在以下网址公开获取：https://github.com/EvolvingLMMs-Lab/NEO。

更新时间: 2025-10-16 17:59:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14979v1

Agentic Design of Compositional Machines

The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.

Updated: 2025-10-16 17:59:58

标题: 主观设计的组合机器

摘要: 复杂机器的设计既是人类智慧的标志，也是工程实践的基础。鉴于最近大型语言模型（LLMs）的进展，我们想知道它们是否也能学会创造。我们通过组合式机器设计的视角来探讨这个问题：这是一项任务，其中机器是从标准化组件组装而成，以满足在模拟物理环境中的运动或操作等功能需求。为了支持这项研究，我们推出了BesiegeField，这是一个建立在机器构建游戏Besiege基础上的测试平台，它实现了基于零件的构建、物理模拟和基于奖励的评估。利用BesiegeField，我们对具有代理式工作流程的最先进的LLMs进行基准测试，并确定成功所需的关键能力，包括空间推理、战略组装和遵循指示。鉴于当前的开源模型存在不足，我们探索强化学习（RL）作为改进的途径：我们整理了一个冷启动数据集，进行了RL微调实验，并突出了在语言、机器设计和物理推理交叉点上的开放挑战。

更新时间: 2025-10-16 17:59:58

领域: cs.AI,cs.CL,cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2510.14980v1

Learning an Image Editing Model without Image Editing Pairs

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

Updated: 2025-10-16 17:59:57

标题: 学习一种无需图像编辑对的图像编辑模型

摘要: 最近的图像编辑模型在遵循自然语言编辑指令的同时取得了令人印象深刻的成果，但它们依赖于具有大量输入-目标对的监督微调。这是一个关键瓶颈，因为这种自然发生的对难以大规模策划。目前的解决方法利用合成训练对，利用现有模型的零射能力。然而，这可能会将预训练模型的伪影传播并放大到最终训练的模型中。在这项工作中，我们提出了一种新的训练范式，完全消除了对配对数据的需求。我们的方法直接通过在训练过程中展开它并利用视觉语言模型(VLMs)的反馈来优化几步扩散模型。对于每个输入和编辑指令，VLM评估编辑是否遵循指令并保留不变内容，为端到端优化提供直接梯度。为了确保视觉保真度，我们还引入了分布匹配损失(DMD)，将生成的图像约束在预训练模型学习的图像流形内。我们在标准基准上评估了我们的方法，并包括了广泛的消融研究。在没有任何配对数据的情况下，我们的方法在几步设置下与经过广泛监督配对数据训练的各种图像编辑扩散模型表现相当。在相同的VLM作为奖励模型的情况下，我们还优于基于RL的技术，如Flow-GRPO。

更新时间: 2025-10-16 17:59:57

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14978v1

Terra: Explorable Native 3D World Model with Point Latents

World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

Updated: 2025-10-16 17:59:56

标题: Terra：具有点潜变量的可探索的本地三维世界模型

摘要: 世界模型越来越受到关注，用于综合建模现实世界。然而，大多数现有方法仍然依赖于像素对齐表示作为世界演变的基础，忽略了物理世界固有的3D特性。这可能会损害3D一致性，降低世界模型的建模效率。在本文中，我们提出了Terra，一种原生3D世界模型，用内在的3D潜在空间表示和生成可探索的环境。具体来说，我们提出了一种新颖的点到高斯变分自编码器（P2G-VAE），将3D输入编码为潜在点表示，随后解码为3D高斯基元，共同建模几何和外观。然后，我们引入了一种稀疏点流匹配网络（SPFlow）用于生成潜在点表示，同时去噪点潜在的位置和特征。我们的Terra通过本地3D表示和架构实现了精确的多视图一致性，并支持从任何视点进行灵活渲染，仅需进行单次生成过程。此外，Terra通过点潜在空间中的渐进生成实现了可探索的世界建模。我们在来自ScanNet v2的具有挑战性的室内场景上进行了大量实验证明。Terra在重建和生成方面都取得了高3D一致性的最先进表现。

更新时间: 2025-10-16 17:59:56

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14977v1

WithAnyone: Towards Controllable and ID Consistent Image Generation

Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

Updated: 2025-10-16 17:59:54

标题: 与任何人一起：朝着可控和ID一致的图像生成

摘要: 身份一致生成已成为文本到图像研究的重要焦点，最近的模型在产生与参考身份一致的图像方面取得了显著成功。然而，由于缺乏包含同一人物多张图像的大规模配对数据集，大多数方法不得不采用基于重建的训练。这种依赖通常会导致一种我们称之为“复制粘贴”的故障模式，即模型直接复制参考面孔而不是在姿势、表情或光照的自然变化中保持身份。这种过度相似性削弱了可控性，并限制了生成的表现力。为了解决这些限制，我们(1)构建了一个专为多人场景定制的大规模配对数据集MultiID-2M，为每个身份提供了多样化的参考；(2)引入了一个基准，量化了复制粘贴的人工痕迹和身份忠实度与变化之间的权衡；(3)提出了一种新颖的训练范式，其中包括对比身份损失，利用配对数据平衡忠实度和多样性。这些贡献最终导致了WithAnyone，这是一种基于扩散的模型，有效地减轻了复制粘贴问题，同时保持了高度的身份相似性。广泛的定性和定量实验表明，WithAnyone显着减少了复制粘贴的人工痕迹，提高了对姿势和表情的可控性，并保持了强大的视觉质量。用户研究进一步验证了我们的方法在实现高身份忠实度的同时能够进行富有表现力的可控生成。

更新时间: 2025-10-16 17:59:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14975v1

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.

Updated: 2025-10-16 17:59:51

标题: pi-Flow：基于策略的少步生成通过模仿蒸馏

摘要: 通常，少步骤扩散或基于流的生成模型通常将一个预测速度的教师提炼成一个预测通往去噪数据的快捷方式的学生。这种格式不匹配导致了复杂的提炼过程，往往在质量和多样性之间存在权衡。为了解决这个问题，我们提出了基于策略的流模型（$\pi$-Flow）。$\pi$-Flow修改了学生流模型的输出层，以预测一个时间步的无网络策略。然后，该策略在未来子步骤产生动态流速度，几乎没有额外开销，从而实现对这些子步骤的快速和准确的ODE集成，而无需额外的网络评估。为了将策略的ODE轨迹与教师的轨迹匹配，我们引入了一种新颖的模仿提炼方法，该方法使用标准的$\ell_2$流匹配损失沿着策略的轨迹将策略的速度与教师的速度匹配。通过简单地模仿教师的行为，$\pi$-Flow实现了稳定且可扩展的训练，并避免了质量和多样性之间的权衡。在ImageNet 256$^2$上，其1-NFE FID为2.85，优于相同DiT架构的MeanFlow。在FLUX.1-12B和Qwen-Image-20B上，在4个NFE的情况下，$\pi$-Flow比最先进的少步骤方法具有更好的多样性，同时保持了教师级别的质量。

更新时间: 2025-10-16 17:59:51

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.14974v1

Attention Is All You Need for KV Cache in Diffusion LLMs

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

Updated: 2025-10-16 17:59:48

标题: 关注是扩散LLMs中KV缓存所需的一切

摘要: 这项工作研究了如何自适应地重新计算扩散大型语言模型（DLMs）的关键-值（KV）缓存，以最大化预测准确性同时最小化解码延迟。先前的方法的解码器在每个去噪步骤和层中重新计算所有标记的QKV，尽管在大多数步骤中KV状态变化很少，特别是在浅层中，这导致了相当大的冗余。我们做出了三个观察：（1）远处的${\bf MASK}$标记主要起到长度偏差的作用，可以按块缓存超出活动预测窗口；（2）KV动态随深度增加而增加，表明从更深层开始选择性刷新就足够了；和（3）最受关注的标记表现出最小的KV漂移，为其他标记的缓存变化提供了保守的下限。基于这些观察，我们提出了${\bf Elastic-Cache}$，这是一种无需训练、与架构无关的策略，它共同决定何时刷新（通过对最受关注标记进行注意力感知漂移测试）和在哪里刷新（通过一个深度感知计划，从选择的层开始重新计算，同时重用浅层缓存和窗口外MASK缓存）。与固定周期方案不同，Elastic-Cache为扩散LLM执行自适应的、层感知的缓存更新，减少了冗余计算，并加速了解码，同时在生成质量上几乎没有损失。在数学推理和代码生成任务中，对LLaDA-Instruct、LLaDA-1.5和LLaDA-V进行的实验表明了持续的加速效果：在GSM8K（256个标记）上为$8.7\times$，在更长的序列上为$45.1\times$，在HumanEval上为$4.8\times$，同时始终保持比基线更高的准确性。我们的方法在GSM8K上实现了显着更高的吞吐量（为$6.8\times$）比现有的基于置信度的方法，同时保持生成质量，实现了扩散LLM的实际部署。

更新时间: 2025-10-16 17:59:48

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14973v1

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

Updated: 2025-10-16 17:59:45

标题: TokDrift: 当LLM使用子词，但代码使用语法

摘要: 大型语言模型（LLMs）用于代码依赖于子词标记器，例如字节对编码（BPE），从混合自然语言文本和编程语言代码中学习，但受统计学而非语法驱动。结果，语义上相同的代码片段可以根据表面因素（如空格或标识符命名）而被标记不同。为了衡量这种不一致性的影响，我们引入了TokDrift，一个框架，应用保留语义的重写规则来创建只在标记化方面有所不同的代码变体。跨越包括具有超过30B参数的大型代码LLM的九个代码LLM，即使是微小的格式更改也可能导致模型行为发生显著变化。层次分析显示问题源于早期的嵌入，子词分割未能捕捉语法标记的边界。我们的发现确定了不对齐的标记化作为可靠代码理解和生成的隐含障碍，凸显了未来代码LLMs需要语法感知的标记化的必要性。

更新时间: 2025-10-16 17:59:45

领域: cs.CL,cs.AI,cs.LG,cs.PL,cs.SE

下载: http://arxiv.org/abs/2510.14972v1

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

Updated: 2025-10-16 17:59:38

标题: LLMs作为可扩展的、通用的模拟器，用于进化数字化代理培训

摘要: 数字代理需要多样化、大规模的UI轨迹来泛化到真实世界的任务中，然而收集这样的数据在人类标注、基础设施和工程方面是昂贵的。为此，我们引入了$\textbf{UI-Simulator}$，这是一种可扩展的范式，可以生成结构化的UI状态和转换，以合成规模的训练轨迹。我们的范式集成了数字世界模拟器以生成多样化的UI状态，引导式展开过程以进行连贯的探索，以及生成高质量和多样性轨迹的轨迹包装器，用于代理训练。我们进一步提出了$\textbf{UI-Simulator-Grow}$，这是一种有针对性的扩展策略，通过优先考虑高影响任务并合成信息轨迹变体，实现更快速和数据有效的扩展。在WebArena和AndroidWorld上的实验表明，UI-Simulator在与基于真实UI训练的开源代理相比具有更好的稳健性的同时，与使用较弱的教师模型的agent相媲美甚至超越。此外，UI-Simulator-Grow仅使用Llama-3-8B-Instruct作为基础模型就能与Llama-3-70B-Instruct的性能相匹配，凸显了针对性合成扩展范式不断而高效地增强数字代理的潜力。

更新时间: 2025-10-16 17:59:38

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14969v1

Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability

We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavily on direct mappings that achieve only modest accuracy, forcing breeders to conduct large, costly field trials to maintain or marginally improve genetic gain. Models that incorporate intermediate molecular phenotypes such as gene expression can achieve higher predictive fit, but they remain impractical for GS since such data are unavailable at deployment or design time. BINNs overcome this limitation by encoding pathway-level inductive biases and leveraging multi-omics data only during training, while using genotype data alone during inference. Applied to maize gene-expression and multi-environment field-trial data, BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions and nonlinearly identifies genes that GWAS/TWAS fail to uncover. With complete domain knowledge for a synthetic metabolomics benchmark, BINN reduces prediction error by 75% relative to conventional neural nets and correctly identifies the most important nonlinear pathway. Importantly, both cases show highly sensitive BINN latent variables correlate with the experimental quantities they represent, despite not being trained on them. This suggests BINNs learn biologically-relevant representations, nonlinear or linear, from genotype to phenotype. Together, BINNs establish a framework that leverages intermediate domain information to improve genomic prediction accuracy and reveal nonlinear biological relationships that can guide genomic selection, candidate gene selection, pathway enrichment, and gene-editing prioritization.

Updated: 2025-10-16 17:59:38

标题: 基于生物学信息的神经网络从组学数据中学习非线性表示，以提高基因组预测和可解释性

摘要: 我们通过将成千上万个单核苷酸多态性（SNP）与多组学测量和先前的生物知识整合，扩展了生物信息学神经网络（BINNs）在作物基因组预测（GP）和选择（GS）中的应用。传统的基因型-表型（G2P）模型严重依赖于仅能实现适度准确性的直接映射，迫使育种者进行大规模、昂贵的田间试验以维持或略微改善遗传增益。包含基因表达等中间分子表型的模型可以实现更高的预测拟合度，但由于这类数据在部署或设计时不可用，因此对于GS来说仍然不切实际。BINNs通过编码通路级归纳偏差并仅在训练期间利用多组学数据，同时在推断期间仅使用基因型数据，克服了这一限制。应用于玉米基因表达和多环境田间试验数据，BINN在稀疏数据条件下提高了56%的等级相关准确性，并非线性地识别了GWAS/TWAS未能发现的基因。在合成代谢组学基准中，BINN与传统神经网络相比，将预测误差降低了75%，并正确识别了最重要的非线性通路。重要的是，两种情况都显示高度敏感的BINN潜变量与它们代表的实验量相关，尽管它们没有接受训练。这表明BINNs从基因型到表型学习了与生物相关的表示，无论是非线性还是线性。总的来说，BINNs建立了一个框架，利用中间领域信息来提高基因组预测准确性，并揭示可以指导基因组选择、候选基因选择、通路富集和基因编辑优先级确定的非线性生物关系。

更新时间: 2025-10-16 17:59:38

领域: cs.LG

下载: http://arxiv.org/abs/2510.14970v1

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.

Updated: 2025-10-16 17:59:37

标题: RDD：用于长期任务规划者对齐的基于检索的演示分解器

摘要: 为了解决长期任务，最近的分层视觉-语言-动作（VLAs）框架采用基于视觉-语言模型（VLM）的规划器，将复杂的操作任务分解为简单的子任务，以便低层次视觉运动策略可以轻松处理。通常，VLM规划器被微调以学习分解目标任务。这种微调需要将目标任务演示分解为由人类注释或启发式规则分割的子任务。然而，启发式子任务可能会与视觉运动策略的训练数据显著偏离，从而降低任务性能。为了解决这些问题，我们提出了一种基于检索的演示分解器（RDD），通过将分解的子任务间隔的视觉特征与低级视觉运动策略的训练数据对齐，自动将演示分解为子任务。我们的方法在仿真和现实世界任务中均优于最先进的子任务分解器，展示了在不同设置下的稳健性。代码和更多结果可在rdd-neurips.github.io上找到。

更新时间: 2025-10-16 17:59:37

领域: cs.RO,cs.AI,cs.CV,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2510.14968v1

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

Updated: 2025-10-16 17:59:32

标题: 基于信息增益的政策优化：一种简单有效的多轮LLM代理方法

摘要: 基于大型语言模型（LLM）的代理越来越多地通过强化学习（RL）进行训练，以增强它们与外部环境通过工具使用进行交互的能力，特别是在需要多轮推理和知识获取的搜索环境中。然而，现有方法通常依赖于仅在最终答案处提供的基于结果的奖励。这种奖励稀疏性在多轮设置中变得特别棘手，其中长路径加剧了两个关键问题：（i）优势坍塌，其中所有模拟均获得相同的奖励并且不提供有用的学习信号，以及（ii）缺乏细粒度的信用分配，其中转折之间的依赖关系被掩盖，特别是在长时间跨度任务中。在本文中，我们提出了基于信息增益的策略优化（IGPO），这是一个简单而有效的RL框架，为多轮代理训练提供密集和内在的监督。IGPO将每个交互转折视为获取关于真相的信息的增量过程，并将转折级别的奖励定义为策略产生正确答案的概率的边际增加。与以往依赖外部奖励模型或昂贵的蒙特卡洛估计的过程级奖励方法不同，IGPO直接从模型自身的信念更新中获得内在奖励。这些内在的转折级别奖励与结果级别的监督结合形成密集奖励轨迹。在领域内和领域外基准测试上的大量实验表明，IGPO在多轮场景中始终优于强基线，在准确性和样本效率方面取得了更高的表现。

更新时间: 2025-10-16 17:59:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14967v1

Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage. TVD-MI's geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.

Updated: 2025-10-16 17:59:25

标题: 身份链接IRT用于无标签LLM评估：在TVD-MI分数中保持可加性

摘要: 使用总变差距离互信息（TVD-MI），对大型语言模型进行逐对比较会产生每对的二元评判决定。我们表明，对TVD-MI的二元试验进行平均会产生适合项目反应理论（IRT）的具有加法结构的中心概率分数，而无需非线性链接函数。最大似然方法用于IRT时使用逻辑链接，但我们在实证中发现，这些转换会引入破坏加法性的曲率：在三个领域中，恒等链接在原始数据上产生0.080-0.150的中位弯曲（P95 = [0.474, 0.580]），而概率统计/逻辑函数引入了明显更高的违规行为（中位数为[0.245, 0.588]，P95为[0.825, 2.252]）。我们从基尼熵最大化推导出这个修剪线性模型，得到一个处理边界饱和的箱约束最小二乘公式。在33%的覆盖率下，我们实现了保持代理排名（Spearman ρ = 0.972 ± 0.015）的保留RMSE为$0.117 ± 0.008，比完全密集的评估减少三倍。对裁判鲁棒性分析（GPT-4o-mini vs. Llama3-70b）显示了代理排名的强一致性（ρ = 0.872）和恒等链接优势的一致性。对于高效评估LLM，TVD-MI的几何结构最好由恒等映射保留，适用于其他有界响应领域。

更新时间: 2025-10-16 17:59:25

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14966v1

Are Large Reasoning Models Interruptible?

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/

Updated: 2025-10-16 17:59:24

标题: 大型推理模型是否可中断？

摘要: 大型推理模型（LRM）在复杂推理方面表现出色，但传统上是在静态的“冻结世界”设置中进行评估的：模型的响应被假定为即时的，请求的上下文被假定在响应的持续时间内是不可变的。虽然对于短期任务通常为真，但在现代推理任务中，如辅助编程，这种“冻结世界”的假设在问题思考需要数小时，代码可能从模型开始思考到模型最终输出之间发生巨大变化的情况下会崩溃。在这项工作中，我们挑战了“冻结世界”的假设，并在两种现实动态情景下评估LRM的稳健性：中断，这测试模型在有限预算下的部分输出质量，以及动态上下文，这测试模型对飞行中的变化的适应能力。在需要长篇推理的数学和编程基准测试中，静态评估一直高估了稳健性：即使是在静态环境中取得高准确性的最先进LRM，在被中断或暴露于不断变化的上下文时也可能出现无法预测的失败，当更新在推理过程中被引入时，性能可能下降高达60％。我们的分析进一步揭示了几种新的失败模式，包括推理泄漏，在被中断时模型将推理融入最终答案中；恐慌，在时间紧迫时模型完全放弃推理并返回错误答案；自我怀疑，在整合更新信息时性能下降。项目页面：http://dynamic-lm.github.io/

更新时间: 2025-10-16 17:59:24

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.11713v3

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.

Updated: 2025-10-16 17:59:07

标题: 高效并行采样器用于递归深度模型及其与扩散语言模型的关联

摘要: 具有循环深度的语言模型，也被称为通用或循环模型，在考虑到变压器时，通过重复层来增加其计算能力。最近在预训练方面的努力表明，这些架构可以扩展到现代语言建模任务，同时在推理任务中表现出优势。在这项工作中，我们研究了循环深度模型和扩散语言模型之间的关系。基于它们的相似之处，我们为这些模型开发了一种新的扩散强制采样器，以加速生成。采样器在模型的每次前向传递中通过解码新的标记来推进，而这些标记的潜在状态可以通过循环在并行中进一步完善。从理论上讲，使用我们的采样器进行生成在现代硬件上使用相同的时间预算比基线自回归生成严格更具表现力。此外，这个基于扩散文献原理的采样器可以直接应用于现有的3.5B循环深度变压器，无需任何调整，从而加速高达5倍。因此，我们的发现不仅为并行化循环深度模型在推断时的额外计算提供了一种有效的机制，而且表明这些模型可以自然地被视为强大的连续性、虽然因果性的扩散语言模型。

更新时间: 2025-10-16 17:59:07

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.14961v1

C4D: 4D Made from 3D through Dual Correspondences

Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D

Updated: 2025-10-16 17:59:06

标题: C4D: 通过双重对应关系从3D制作出4D

摘要: 从单眼视频中恢复4D，联合估计动态几何和相机姿态，是一个不可避免的具有挑战性的问题。尽管最近基于点地图的3D重建方法（例如DUSt3R）在重建静态场景方面取得了巨大进展，但直接将它们应用于动态场景会导致不准确的结果。这种差异是因为移动物体违反了多视几何约束，破坏了重建。为了解决这个问题，我们引入了C4D，这是一个利用时间对应关系将现有的3D重建公式扩展到4D的框架。具体来说，除了预测点地图，C4D还捕捉了两种类型的对应关系：短期光流和长期点跟踪。我们训练了一个动态感知的点追踪器，提供额外的移动信息，便于估计运动掩模，将移动元素与静态背景分开，从而为动态场景提供更可靠的引导。此外，我们引入了一组动态场景优化目标，以恢复每帧的3D几何和相机参数。同时，对应关系将2D轨迹提升为平滑的3D轨迹，实现完全集成的4D重建。实验表明，我们的框架实现了完整的4D恢复，并在多个下游任务中表现出色，包括深度估计、相机姿态估计和点跟踪。项目页面：https://littlepure2333.github.io/C4D

更新时间: 2025-10-16 17:59:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14960v1

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed \emph{online} via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs \emph{in training}. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.

Updated: 2025-10-16 17:58:58

标题: CBF-RL: 使用控制屏障函数进行安全过滤的强化学习训练

摘要: 强化学习（RL）虽然功能强大且表达能力强，但往往会以牺牲安全为代价来优先考虑性能。然而，安全违规可能导致现实世界部署中的灾难性结果。控制屏障函数（CBFs）提供了一种原则性的方法来强制执行动态安全性 - 传统上通过安全滤波器在线部署。虽然结果是安全行为，但RL策略不知道CBF的事实可能导致保守行为。本文提出了CBF-RL，这是一种通过在训练中强制执行CBFs来生成安全行为的RL框架。CBF-RL具有两个关键属性：（1）最小修改标称RL策略以通过CBF项编码安全约束，（2）以及安全筛选策略在训练中的回滚。从理论上讲，我们证明了连续时间安全滤波器可以通过离散时间回滚上的闭合形式表达来部署。在实践中，我们证明了CBF-RL内化了学习策略中的安全约束 - 既强制执行更安全的动作，又倾向于更安全的奖励 - 实现了安全部署，无需在线安全滤波器。我们通过对导航任务和Unitree G1人形机器人的消融研究来验证我们的框架，在这些任务中，CBF-RL实现了更安全的探索，更快的收敛速度，并在不确定性下表现出稳健性，使人形机器人能够在现实世界环境中避开障碍物并安全攀爬楼梯，而无需运行时安全过滤器。

更新时间: 2025-10-16 17:58:58

领域: cs.RO,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2510.14959v1

RealDPO: Real or Not Real, that is the Preference

Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

Updated: 2025-10-16 17:58:25

标题: 真实DPO：真实还是虚假，这是偏好

摘要: 视频生成模型最近在合成质量方面取得了显著进展。然而，生成复杂动作仍然是一个关键挑战，因为现有模型通常难以产生自然、流畅和具有上下文一致性的动作。生成和真实世界动作之间的差距限制了它们的实际应用性。为了解决这个问题，我们引入了RealDPO，一种新颖的对齐范式，利用真实世界数据作为偏好学习的正样本，实现更准确的动作合成。与传统的监督微调（SFT）不同，后者只提供有限的纠正反馈，RealDPO采用了直接偏好优化（DPO）和定制的损失函数来增强动作的逼真度。通过将真实世界视频与错误的模型输出进行对比，RealDPO实现了迭代自我纠正，逐渐改善动作质量。为了支持在复杂动作合成后的训练，我们提出了RealAction-5K，一个精心筛选的高质量视频数据集，捕捉了人类日常活动的丰富和精确的运动细节。大量实验证明，与最先进的模型和现有的偏好优化技术相比，RealDPO显著提高了视频质量、文本对齐和动作逼真度。

更新时间: 2025-10-16 17:58:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14955v1

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

Updated: 2025-10-16 17:56:55

标题: 方言生成：在多模态生成中对方言稳健性进行基准测试和改进

摘要: 像英语这样的接触语言展现出丰富的地区变体，即方言，方言经常被方言使用者与生成模型进行交互。然而，多模态生成模型能否有效地根据方言文本输入生成内容？在这项工作中，我们通过构建一个涵盖六种常见英语方言的新的大规模基准来研究这个问题。我们与方言使用者合作，收集和验证了超过4200个独特的提示，并在17个图像和视频生成模型上进行评估。我们的自动和人工评估结果显示，目前最先进的多模态生成模型在使用一个方言词汇的提示时表现下降了32.26%至48.17%。常见的缓解方法，如微调和提示重写，只能稍微改善方言表现（<7%），同时可能导致标准美式英语（SAE）的性能下降。为此，我们为多模态生成模型设计了一种通用的基于编码器的缓解策略。我们的方法教导模型识别新的方言特征，同时保持SAE的性能。对诸如Stable Diffusion 1.5等模型的实验表明，我们的方法能够同时提高五种方言的性能，使其与SAE持平（+34.4%），同时几乎不对SAE性能造成成本。

更新时间: 2025-10-16 17:56:55

领域: cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14949v1

Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion

Robust humanoid locomotion in unstructured environments requires architectures that balance fast low-level stabilization with slower perceptual decision-making. We show that a simple layered control architecture (LCA), a proprioceptive stabilizer running at high rate, coupled with a compact low-rate perceptual policy, enables substantially more robust performance than monolithic end-to-end designs, even when using minimal perception encoders. Through a two-stage training curriculum (blind stabilizer pretraining followed by perceptual fine-tuning), we demonstrate that layered policies consistently outperform one-stage alternatives in both simulation and hardware. On a Unitree G1 humanoid, our approach succeeds across stair and ledge tasks where one-stage perceptual policies fail. These results highlight that architectural separation of timescales, rather than network scale or complexity, is the key enabler for robust perception-conditioned locomotion.

Updated: 2025-10-16 17:56:08

标题: 建筑就是你需要的一切：多样性启用的稳健人形机器人运动的甜蜜点

摘要: 在无结构环境中，强壮的人形机器人运动需要平衡快速的低层稳定性和较慢的感知决策的架构。我们展示了一个简单的分层控制架构（LCA），一个在高速率下运行的本体感知稳定器，与一个紧凑的低速率感知策略结合，比单块的端到端设计实现了更加稳健的性能，即使使用最少的感知编码器。通过一个两阶段的训练计划（盲目稳定器预训练后进行感知微调），我们证明了分层策略在模拟和硬件中始终优于单阶段的替代方案。在Unitree G1人形机器人上，我们的方法成功完成了楼梯和悬崖任务，而单阶段感知策略则失败了。这些结果突出了时间尺度的架构分离，而不是网络规模或复杂性，是实现稳健感知条件下运动的关键驱动因素。

更新时间: 2025-10-16 17:56:08

领域: cs.RO,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2510.14947v1

MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

Updated: 2025-10-16 17:55:14

标题: MetaBench：用于评估代谢组学中LLMs的多任务基准测试

摘要: 大型语言模型（LLMs）已经在一般文本上展现出了卓越的能力；然而，在需要深入、相互关联知识的专业科学领域中，它们的熟练程度仍然大多没有被描述。代谢组学面临独特挑战，其复杂的生物化学途径、异质性标识系统和碎片化数据库为之。为了系统评估LLM在这一领域的能力，我们引入了MetaBench，这是代谢组学评估的第一个基准。从权威公共资源中策划而来，MetaBench评估了对代谢组学研究至关重要的五种能力：知识、理解、基础、推理和研究。我们对25个开放和闭源LLMs的评估显示了在代谢组学任务中明显的性能模式：虽然模型在文本生成任务上表现良好，但即使通过检索增强，跨数据库标识基础仍然具有挑战性。模型在稀疏注释的长尾代谢物上的表现也下降。通过MetaBench，我们为发展和评估代谢组学人工智能系统提供了必要的基础设施，从而有利于系统性地向着可靠的代谢组学研究计算工具的进步。

更新时间: 2025-10-16 17:55:14

领域: cs.CL,cs.AI,cs.CE

下载: http://arxiv.org/abs/2510.14944v1

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

Updated: 2025-10-16 17:55:11

标题: LaSeR：使用最后一个标记自我奖励的强化学习

摘要: 最近，具有可验证奖励的强化学习（RLVR）已经成为增强大型语言模型（LLMs）推理能力的核心范式。为了解决测试时缺乏验证信号的问题，先前的研究将模型自我验证能力的训练纳入标准RLVR过程中，从而将推理和验证能力统一到单个LLM中。然而，先前的做法要求LLM使用两个单独的提示模板顺序生成解决方案和自我验证，这显著降低了效率。在这项工作中，我们理论上揭示了自我验证RL目标的闭合形式解决方案可以简化为一个非常简单的形式：解决方案的真实推理奖励等于其最后一个标记的自我奖励分数，该分数是通过以某个预先指定的标记作为解决方案的最后标记，计算策略模型的下一个标记对数概率分配与预先计算的常数之间的差异，乘以KL系数。基于这一洞察，我们提出了LaSeR（带有最后标记自我奖励的强化学习），这是一种算法，它简单地将原始RLVR损失与一个MSE损失相结合，将最后标记的自我奖励分数与基于验证器的推理奖励对齐，共同优化LLMs的推理和自我奖励能力。优化的自我奖励分数可以在训练和测试中使用，以提高模型性能。值得注意的是，我们的算法从生成后立即预测的最后一个标记的下一个标记概率分布中派生这些分数，仅需额外成本一个额外的标记推断。实验证明，我们的方法不仅提高了模型的推理性能，还赋予其卓越的自我奖励能力，从而提高了其推理时间的扩展性能。

更新时间: 2025-10-16 17:55:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14943v1

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

Updated: 2025-10-16 17:54:07

标题: GroundedPRM：基于树导向和关注准确性的步骤级推理过程奖励建模

摘要: 流程奖励模型（PRM）旨在通过监督中间步骤和识别错误来改进大型语言模型（LLMs）中的多步推理。然而，由于缺乏可扩展的高质量注释，构建有效的PRM仍然具有挑战性。现有方法依赖于昂贵的人工标注、容易产生幻觉的基于LLM的自我评估，或者蒙特卡洛（MC）估计，后者仅从展开结果中推断步骤质量，通常会由于信用错误归因而引入嘈杂、不对齐的监督。这些问题导致三个核心限制：嘈杂的奖励、低事实忠实度和与步骤级推理目标不对齐。为了解决这些挑战，我们引入了GroundedPRM，这是一个面向自动过程监督的树引导和忠实度感知框架。为了减少奖励噪音并实现细粒度信用分配，我们通过蒙特卡洛树搜索（MCTS）构建了结构化推理路径。为了消除幻觉监督，我们使用外部工具验证每个中间步骤，提供执行基础的正确性信号。为了结合步骤级验证和全局结果评估，我们设计了一种混合奖励聚合机制，将基于工具的验证与MCTS导出的反馈融合在一起。最后，我们将奖励信号格式化为一个增强理性的生成结构，以促进可解释性并与指导调整的LLMs兼容。GroundedPRM仅在40K个自动标记样本上训练，仅为最佳表现的使用自动标记监督的PRM所使用数据的10%。尽管如此，在ProcessBench上，它的平均性能相对提高了高达26%。当用于奖励引导的贪婪搜索时，GroundedPRM甚至胜过了使用人工标记监督训练的PRM，为高质量的过程级推理提供了可扩展和可验证的途径。

更新时间: 2025-10-16 17:54:07

领域: cs.AI

下载: http://arxiv.org/abs/2510.14942v1

The Coverage Principle: How Pre-training Enables Post-Training

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

Updated: 2025-10-16 17:53:50

标题: 覆盖原则：如何通过预训练实现后续训练

摘要: 语言模型在大型文本语料库上进行预训练并针对特定任务进行微调时展现出卓越的能力，但是预训练如何以及为什么会影响最终模型的成功仍然不太清楚。值得注意的是，尽管预训练成功通常通过交叉熵损失来量化，但交叉熵可能是下游性能的不良预测指标。相反，我们通过“覆盖率”的视角提供了这种关系的理论观点，覆盖率量化了预训练模型放置在高质量响应上的概率质量，这对于后期训练和测试时的缩放方法（如Best-of-N）的成功是必要且充分的。我们的主要结果发展了“覆盖原则”的理解，这是一种现象，即下一个标记的预测隐式地优化为具有良好覆盖率的模型。特别是，我们揭示了一个解释覆盖率在预测下游性能方面的力量的机制：覆盖率比交叉熵更快地泛化，避免对问题相关参数（如序列长度）的错误依赖。我们还研究了一些具有可证益处的实用算法干预措施，用于改善覆盖率，包括（i）模型/检查点选择程序，（ii）梯度归一化方案，以及（iii）测试时解码策略。

更新时间: 2025-10-16 17:53:50

领域: stat.ML,cs.AI,cs.CL,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2510.15020v1

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.

Updated: 2025-10-16 17:51:25

标题: 保持冷静，避免有害内容：概念对齐和潜在操纵以获得更安全的答案

摘要: 大型语言模型容易受到越狱攻击的影响，这些攻击可以绕过内置的安全防护（例如，通过用对抗性提示欺骗模型）。我们提出了概念对齐和概念操纵（CALM）方法，在推断时通过修改模型的最后一层的潜在表示来抑制有害概念，而无需重新训练。利用计算机视觉中的概念白化技术结合正交投影，CALM去除与有害内容相关的不需要的潜在方向，同时保持模型性能。实验表明，CALM减少了有害输出，并在大多数指标上优于基线方法，为AI安全提供了一种轻量级方法，无需额外的训练数据或模型微调，同时在推断时仅产生少量计算开销。

更新时间: 2025-10-16 17:51:25

领域: cs.LG

下载: http://arxiv.org/abs/2510.12672v2

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML$.$ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML$.$ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML$.$ENERGY Benchmark. We then highlight results from the early 2025 iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML$.$ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.

Updated: 2025-10-16 17:51:15

标题: ML.ENERGY基准：朝向自动推理能耗测量与优化

摘要: 随着生成式人工智能在实际服务中的采用急剧增长，能源已经成为一个关键的瓶颈资源。然而，在构建机器学习系统的背景下，能源仍然是一个经常被忽视、少有探讨或理解不足的度量标准。我们提出了ML$.$ENERGY Benchmark，这是一个用于在现实服务环境下测量推理能耗的基准套件和工具，以及相应的ML$.$ENERGY Leaderboard，这些资源对于希望了解和优化他们的生成式人工智能服务的能耗的人们来说是宝贵的资源。在本文中，我们解释了我们随着时间获得的评估ML能耗的四个关键设计原则，然后描述了它们如何在ML$.$ENERGY Benchmark中实施。我们然后突出了基准测试的2025年早期迭代的结果，包括在6个不同任务中对40种广泛使用的模型架构进行的能耗测量，ML设计选择如何影响能耗以及自动优化建议如何导致显著（有时超过40%）的能量节约而不改变模型计算的内容。ML$.$ENERGY Benchmark是开源的，可以轻松扩展到各种定制模型和应用场景中。

更新时间: 2025-10-16 17:51:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.06371v2

Circuit Insights: Towards Interpretability Beyond Activations

The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.

Updated: 2025-10-16 17:49:41

标题: 电路洞察：走向超越激活的可解释性

摘要: 可解释人工智能和机械解释性的领域旨在揭示神经网络的内部结构，电路发现作为理解模型计算的核心工具。然而，现有方法依赖于手动检查，并且仅限于玩具任务。自动可解释性通过分析孤立特征及其激活来提供可扩展性，但通常会忽略特征之间的相互作用，并且在很大程度上依赖于外部LLMs和数据集质量。最近，转换器使将特征归因分为与输入相关和与输入无关的组件成为可能，为更系统地电路分析打下了基础。在此基础上，我们提出了WeightLens和CircuitLens，这两种互补方法超越了基于激活的分析。WeightLens直接从学习的权重中解释特征，消除了对解释器模型或数据集的需求，同时在上下文无关特征上匹配或超过现有方法的性能。CircuitLens捕捉特征激活是如何产生于组件之间的相互作用，揭示了激活-仅方法无法识别的电路级动态。这些方法共同提高了可解释性的稳健性，并增强了可扩展的电路机械分析，同时保持了效率和质量。

更新时间: 2025-10-16 17:49:41

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14936v1

Ctrl-VI: Controllable Video Synthesis via Variational Inference

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Updated: 2025-10-16 17:48:29

标题: Ctrl-VI：通过变分推断实现可控视频合成

摘要: 许多视频工作流程受益于具有不同粒度的用户控制，从精确的4D对象轨迹和摄像机路径到粗略的文本提示，而现有的视频生成模型通常针对固定的输入格式进行训练。我们开发了Ctrl-VI，这是一种视频合成方法，可以满足这种需求，并为指定元素生成具有高可控性的样本，同时保持对未指定元素的多样性。我们将任务建模为变分推断，以近似一个组合分布，利用多个视频生成骨干来共同考虑所有任务约束。为了解决优化挑战，我们将问题分解为逐步KL散度最小化，通过一个退火序列的分布，并进一步提出了一种上下文条件的因式分解技术，可以减少解空间中的模式，以规避局部最优解。实验证明，与先前的工作相比，我们的方法产生了具有改进可控性、多样性和3D一致性的样本。

更新时间: 2025-10-16 17:48:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07670v2

Why is Your Language Model a Poor Implicit Reward Model?

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Updated: 2025-10-16 17:45:44

标题: 为什么你的语言模型是一个较差的隐式奖励模型？

摘要: 奖励模型对语言模型的后训练和推断管道至关重要。方便地，最近的研究表明，每个语言模型都定义了一个隐含的奖励模型(IM-RM)，而不需要任何架构更改。然而，与应用专用线性头在语言模型的隐藏表示上的显式奖励模型(EX-RM)相比，这种IM-RMs往往泛化能力较差，特别是在分布之外。存在泛化差距是令人困惑的，因为EX-RMs和IM-RMs几乎是相同的。它们可以使用相同的数据、损失函数和语言模型进行训练，唯一的区别在于奖励是如何计算的。为了对不同奖励模型类型背后的隐含偏见有一个基本的理解，我们调查了这一差距的根本原因。我们的主要发现，通过理论和实验证实支持，是IM-RMs更依赖于表面的令牌级线索。因此，它们通常在令牌级分布转变以及在分布内泛化方面比EX-RMs差。此外，我们提供了反对泛化差距的替代假设的证据。最值得注意的是，我们挑战了IM-RMs在生成比验证更困难的任务中表现糟糕的直觉说法，因为它们可以同时作为验证者和生成者。综上所述，我们的结果突显了看似微不足道的设计选择可以对奖励模型的泛化行为产生实质性影响。

更新时间: 2025-10-16 17:45:44

领域: cs.CL,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.07981v2

GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Although data that can be naturally represented as graphs is widespread in real-world applications across diverse industries, popular graph ML benchmarks for node property prediction only cover a surprisingly narrow set of data domains, and graph neural networks (GNNs) are often evaluated on just a few academic citation networks. This issue is particularly pressing in light of the recent growing interest in designing graph foundation models. These models are supposed to be able to transfer to diverse graph datasets from different domains, and yet the proposed graph foundation models are often evaluated on a very limited set of datasets from narrow applications. To alleviate this issue, we introduce GraphLand: a benchmark of 14 diverse graph datasets for node property prediction from a range of different industrial applications. GraphLand allows evaluating graph ML models on a wide range of graphs with diverse sizes, structural characteristics, and feature sets, all in a unified setting. Further, GraphLand allows investigating such previously underexplored research questions as how realistic temporal distributional shifts under transductive and inductive settings influence graph ML model performance. To mimic realistic industrial settings, we use GraphLand to compare GNNs with gradient-boosted decision trees (GBDT) models that are popular in industrial applications and show that GBDTs provided with additional graph-based input features can sometimes be very strong baselines. Further, we evaluate currently available general-purpose graph foundation models and find that they fail to produce competitive results on our proposed datasets.

Updated: 2025-10-16 17:45:31

标题: GraphLand：在多样化的工业数据上评估图机器学习模型

摘要: 尽管可以自然表示为图形的数据在各行各业的实际应用中广泛存在，但用于节点属性预测的流行图形机器学习基准仅涵盖了一组令人惊讶地狭窄的数据领域，而图神经网络（GNN）通常仅在一些学术引文网络上进行评估。考虑到最近对设计图形基础模型的兴趣不断增长，这个问题尤为紧迫。这些模型应该能够在不同领域的多样化图数据集之间进行转移，然而，提出的图形基础模型通常只在来自狭窄应用的非常有限的数据集上进行评估。为了缓解这个问题，我们引入了GraphLand：一个包括来自不同工业应用领域的14个多样化图数据集的节点属性预测基准。GraphLand允许在统一设置中评估图形机器学习模型在各种图形上的性能，这些图形具有不同的大小、结构特征和特征集。此外，GraphLand还允许研究以前未被充分探讨的研究问题，例如在传导和归纳设置下，现实中的时间分布变化如何影响图形机器学习模型的性能。为了模拟真实的工业环境，我们使用GraphLand比较了在工业应用中流行的梯度增强决策树（GBDT）模型和GNNs，结果显示，提供了额外基于图形的输入特征的GBDT有时可以成为非常强大的基线。此外，我们评估了目前可用的通用图形基础模型，并发现它们在我们提出的数据集上无法产生竞争性的结果。

更新时间: 2025-10-16 17:45:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.14500v4

Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox

The convergence of Information Technology and Operational Technology has exposed Industrial Control Systems to adaptive, intelligent adversaries that render static defenses obsolete. This paper introduces the Adversarial Resilience Co-evolution (ARC) framework, addressing the "Trinity of Trust" comprising model fidelity, data integrity, and analytical resilience. ARC establishes a co-evolutionary arms race within a Fortified Secure Digital Twin (F-SCDT), where a Deep Reinforcement Learning "Red Agent" autonomously discovers attack paths while an ensemble-based "Blue Agent" is continuously hardened against these threats. Experimental validation on the Tennessee Eastman Process (TEP) and Secure Water Treatment (SWaT) testbeds demonstrates superior performance in detecting novel attacks, with F1-scores improving from 0.65 to 0.89 and detection latency reduced from over 1200 seconds to 210 seconds. A comprehensive ablation study reveals that the co-evolutionary process itself contributes a 27% performance improvement. By integrating Explainable AI and proposing a Federated ARC architecture, this work presents a necessary paradigm shift toward dynamic, self-improving security for critical infrastructure.

Updated: 2025-10-16 17:44:09

标题: 自主网络韧性：通过数字孪生沙盒内的共同进化军备竞赛

摘要: 信息技术和运营技术的融合暴露了工业控制系统面临适应性、智能的对手，使静态防御变得过时。本文介绍了Adversarial Resilience Co-evolution（ARC）框架，解决了“信任三位一体”问题，包括模型保真度、数据完整性和分析弹性。ARC在Fortified Secure Digital Twin（F-SCDT）内建立了一个共同进化的军备竞赛，其中Deep Reinforcement Learning“红方代理”自主发现攻击路径，而基于集成的“蓝方代理”则不断加固对这些威胁的防御。在田纳西伊斯曼工艺（TEP）和安全水处理（SWaT）实验台上进行的实验验证表明，在检测新型攻击方面表现优异，F1分数从0.65提高到0.89，检测延迟从超过1200秒减少到210秒。全面的消融研究显示，共同进化过程本身贡献了27%的性能改进。通过整合可解释AI并提出联合ARC架构，本研究提出了一种必要的范式转变，朝着动态、自我改进的关键基础设施安全方向迈进。

更新时间: 2025-10-16 17:44:09

领域: cs.CR,cs.LG,cs.SY,eess.SY,68T05 (Primary) 93C40, 91A80, 68M25 (Secondary),C.2.0; I.2.11; I.2.6

下载: http://arxiv.org/abs/2506.20102v2

VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design

Large Language Models demonstrate substantial promise for advancing scientific discovery, yet their deployment in disciplines demanding factual precision and specialized domain constraints presents significant challenges. Within molecular design for pharmaceutical development, these models can propose innovative molecular modifications but frequently generate chemically infeasible structures. We introduce VALID-Mol, a comprehensive framework that integrates chemical validation with LLM-driven molecular design, achieving an improvement in valid chemical structure generation from 3% to 83%. Our methodology synthesizes systematic prompt optimization, automated chemical verification, and domain-adapted fine-tuning to ensure dependable generation of synthesizable molecules with enhanced properties. Our contribution extends beyond implementation details to provide a transferable methodology for scientifically-constrained LLM applications with measurable reliability enhancements. Computational analyses indicate our framework generates promising synthesis candidates with up to 17-fold predicted improvements in target binding affinity while preserving synthetic feasibility.

Updated: 2025-10-16 17:43:31

标题: VALID-Mol：一个用于验证LLM辅助分子设计的系统框架

摘要: 大型语言模型展示了在推动科学发现方面的巨大潜力，然而它们在要求事实准确性和专业领域约束的学科中的部署面临着重大挑战。在制药开发的分子设计领域，这些模型可以提出创新的分子修饰，但经常生成化学上不可行的结构。我们引入了VALID-Mol，一个综合框架，将化学验证与LLM驱动的分子设计相结合，将有效的化学结构生成率从3％提高到83％。我们的方法综合了系统化提示优化、自动化化学验证和领域适应微调，以确保可靠生成具有增强性能的可合成分子。我们的贡献不仅仅限于实施细节，还提供了一个可转移的方法，用于具有可衡量可靠性增强的受科学限制的LLM应用。计算分析表明，我们的框架产生了有望的合成候选物，预计目标结合亲和力提高了多达17倍，同时保持了合成可行性。

更新时间: 2025-10-16 17:43:31

领域: cs.LG,cs.AI,physics.chem-ph,q-bio.QM,68T50 (Primary) 92E10, 68T07 (Secondary),I.2.7; J.3; I.2.1; I.2.6

下载: http://arxiv.org/abs/2506.23339v2

UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

Updated: 2025-10-16 17:42:34

标题: UrbanVerse：通过观看城市旅游视频扩展城市模拟

摘要: 城市中的具象化人工智能代理，从送货机器人到四足动物，正日益充斥我们的城市，穿梭在混乱的街道上，提供最后一公里的连接。训练这些代理需要多样化、高保真度的城市环境来扩展，然而现有的人工制作或程序生成的模拟场景要么缺乏可扩展性，要么无法捕捉现实世界的复杂性。我们介绍了UrbanVerse，这是一个基于数据驱动的真实到虚拟系统，将众包的城市游览视频转化为具有物理感知的互动模拟场景。UrbanVerse包括：(i)UrbanVerse-100K，一个拥有100k+注释的城市3D资产库，具有语义和物理属性，以及(ii) UrbanVerse-Gen，一个自动流水线，从视频中提取场景布局，并使用检索到的资产实例化度量尺度的3D模拟。在IsaacSim中运行，UrbanVerse提供了来自24个国家的160个高质量构建的场景，还提供了一个由艺术家设计的10个测试场景的策划基准。实验表明，UrbanVerse场景保留了现实世界的语义和布局，达到了与手工制作场景相媲美的人工评估的逼真度。在城市导航中，通过在UrbanVerse中训练的策略展现了规模功率法则和强大的泛化能力，在模拟中成功率提高了+6.3%，在零样本模拟到实际转移中提高了+30.1%，只用了两次干预就完成了300米的真实世界任务。

更新时间: 2025-10-16 17:42:34

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.15018v1

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin

Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.

Updated: 2025-10-16 17:41:36

标题: VT-Refine: 通过模拟微调学习双手装配及视触反馈

摘要: 人类在双手装配任务中表现出色，通过适应丰富的触觉反馈，这是一种仅靠行为克隆难以在机器人中复制的能力，原因在于人类演示的次优性和有限多样性。在这项工作中，我们提出了VT-Refine，一种视觉触觉策略学习框架，结合现实世界演示、高保真触觉模拟和强化学习，以解决精确、接触丰富的双手装配任务。我们首先通过使用同步的视觉和触觉输入在一小组演示中训练扩散策略。然后将此策略转移到配有模拟触觉传感器的模拟数字孪生体上，并通过大规模强化学习进一步优化，以增强鲁棒性和泛化能力。为实现准确的模拟到真实环境的转移，我们利用高分辨率压阻式触觉传感器提供法向力信号，并可并行使用GPU加速模拟进行逼真建模。实验结果表明，VT-Refine通过增加数据多样性并实现更有效的策略微调，改善了模拟和真实世界中的装配性能。我们的项目页面位于https://binghao-huang.github.io/vt_refine/。

更新时间: 2025-10-16 17:41:36

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2510.14930v1

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.

Updated: 2025-10-16 17:41:09

标题: 主动蜜罐防护系统：探测和确认多轮LLM越狱

摘要: 大型语言模型（LLMs）越来越容易受到多轮越狱攻击的威胁，攻击者通过迭代引诱有害行为，绕过单轮安全过滤器。现有的防御主要依赖于被动拒绝，这种方法要么无法对抗适应性攻击者，要么过度限制良好用户。我们提出了一种基于蜜罐的主动防护系统，将风险规避转化为风险利用。我们的框架微调一个诱饵模型，生成模棱两可、不可操作但语义相关的响应，作为探索用户意图的诱饵。结合受保护的LLM的安全回复，系统插入主动诱饵问题，逐渐通过多轮互动暴露恶意意图。我们进一步引入了蜜罐实用性评分（HUS），衡量诱饵响应的吸引力和可行性，并使用防御效用率（DER）来平衡安全性和可用性。对MHJ数据集进行的初步实验，使用最新的攻击方法跨越GPT-4o显示，我们的系统显著干扰了越狱成功，同时保留了良好的用户体验。

更新时间: 2025-10-16 17:41:09

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.15017v1

Instruction Set Migration at Warehouse Scale

Migrating codebases from one instruction set architecture (ISA) to another is a major engineering challenge. A recent example is the adoption of Arm (in addition to x86) across the major Cloud hyperscalers. Yet, this problem has seen limited attention by the academic community. Most work has focused on static and dynamic binary translation, and the traditional conventional wisdom has been that this is the primary challenge. In this paper, we show that this is no longer the case. Modern ISA migrations can often build on a robust open-source ecosystem, making it possible to recompile all relevant software from scratch. This introduces a new and multifaceted set of challenges, which are different from binary translation. By analyzing a large-scale migration from x86 to Arm at Google, spanning almost 40,000 code commits, we derive a taxonomy of tasks involved in ISA migration. We show how Google automated many of the steps involved, and demonstrate how AI can play a major role in automatically addressing these tasks. We identify tasks that remain challenging and highlight research challenges that warrant further attention.

Updated: 2025-10-16 17:41:01

标题: Instruction Set Migration at Warehouse Scale 仓库规模下的指令集迁移

摘要: 将代码库从一种指令集架构（ISA）迁移到另一种是一个重大的工程挑战。一个最近的例子是主要云超大规模计算公司采用Arm（除了x86）。然而，这个问题在学术界受到了有限的关注。大部分研究都集中在静态和动态二进制翻译上，传统的普遍观点是这是主要挑战。在本文中，我们展示了这不再是问题。现代ISA迁移通常可以基于强大的开源生态系统构建，使得重新编译所有相关软件成为可能。这引入了一个新的多方面的挑战集，与二进制翻译不同。通过分析Google从x86到Arm的大规模迁移，涉及近40000个代码提交，我们得出了ISA迁移涉及的任务分类。我们展示了Google如何自动化了许多涉及的步骤，并演示了人工智能如何可以在自动解决这些任务中发挥主要作用。我们确定了仍然具有挑战性的任务，并突出了值得进一步关注的研究挑战。

更新时间: 2025-10-16 17:41:01

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2510.14928v1

Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing -- and selectively reducing -- overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.

Updated: 2025-10-16 17:40:28

标题: 稳定但错误校准：一个康德观点对从过滤器到大型语言模型的过度自信进行观察

摘要: 我们重新解释康德的《纯粹理性批判》为一个关于反馈稳定性的理论，将理性视为一个调节器，使推理保持在可能经验的范围内。我们通过一个复合不稳定性指数（H-Risk）形式化这种直觉，结合了频谱边界、条件、时间敏感性和创新放大。在线性高斯模拟中，更高的H-Risk预测即使在形式稳定的情况下也会出现过于自信的错误，揭示了名义稳定性和认识稳定性之间的差距。将其扩展到大型语言模型（LLMs），我们发现脆弱的内部动态与误校准和幻觉相关，而类似批评风格的提示对校准和幻觉产生了混合影响。这些结果表明了康德式的自我限制和反馈控制之间的结构桥梁，为诊断推理系统中过度自信提供了一个原则性的视角，并有选择地减少。这是一个初步版本；将在未来的修订中报告补充实验和更广泛的复制。

更新时间: 2025-10-16 17:40:28

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.14925v1

TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

Updated: 2025-10-16 17:39:59

标题: TRI-DEP：使用语音、文本和脑电图进行抑郁症检测的三模态比较研究

摘要: 抑郁症是一种普遍的心理健康障碍，但其自动检测仍然具有挑战性。先前的研究探索了单模态和多模态方法，多模态系统通过利用互补信号显示出潜力。然而，现有研究在范围上受到限制，缺乏特征的系统比较，并且存在评估协议不一致的问题。我们通过系统地探索脑电图、语音和文本的特征表示和建模策略来填补这些空白。我们评估手工制作的特征与预训练的嵌入，评估不同神经编码器的有效性，比较单模态、双模态和三模态配置，并分析注意力到脑电图角色的融合策略。我们采用一致的独立主体分割以确保稳健、可重复的基准测试。我们的结果表明：(i)脑电图、语音和文本模态的组合增强了多模态检测；(ii)预训练的嵌入优于手工制作的特征；(iii)精心设计的三模态模型实现了最先进的性能。我们的工作为未来多模态抑郁症检测研究奠定了基础。

更新时间: 2025-10-16 17:39:59

领域: cs.AI,cs.CL,cs.LG,eess.AS,eess.SP

下载: http://arxiv.org/abs/2510.14922v1

DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

Updated: 2025-10-16 17:39:21

标题: DeLeaker：文本到图像模型中语义泄漏缓解的动态推断时间重新加权

摘要: 文本到图像（T2I）模型发展迅速，但仍然容易受到语义泄漏的影响，即不经意地在不同实体之间传输语义相关特征。现有的缓解策略通常基于优化或依赖外部输入。我们介绍了DeLeaker，这是一种轻量级、无需优化的推断时间方法，通过直接干预模型的注意力图来减轻泄漏。在扩散过程中，DeLeaker动态重新调整注意力图，抑制过度的跨实体交互，同时加强每个实体的身份。为了支持系统评估，我们引入了SLIM（图像中的语义泄漏），这是第一个专门用于语义泄漏的数据集，包括1,130个经过人工验证的样本，涵盖了各种情景，以及一个新颖的自动评估框架。实验证明，DeLeaker始终优于所有基准线，即使它们提供外部信息，也能有效减轻泄漏，而不会损害忠实度或质量。这些结果强调了注意力控制的价值，并为更加语义精确的T2I模型铺平了道路。

更新时间: 2025-10-16 17:39:21

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.15015v1

REX: Causal discovery based on machine learning and explainability techniques

Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, ReX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.

Updated: 2025-10-16 17:38:53

标题: REX：基于机器学习和可解释性技术的因果发现

摘要: 可解释的人工智能（XAI）技术在增强因果发现过程方面具有重要潜力，这对于理解医疗保健、经济和人工智能等领域的复杂系统至关重要。然而，目前没有因果发现方法将可解释性纳入模型以推导因果图。因此，在本文中，我们探讨了这种创新方法，因为它提供了巨大潜力，并代表了一个值得探索的有前景的新方向。具体地，我们介绍了ReX，一种利用机器学习（ML）模型结合解释性技术，特别是夏普利值，来识别和解释变量之间重要因果关系的因果发现方法。对包含连续表格数据的合成数据集进行比较评估显示，ReX在各种数据生成过程中（包括非线性和加性噪声模型）优于最先进的因果发现方法。此外，ReX在Sachs单细胞蛋白信号数据集上的测试结果显示，精度为0.952，并且没有错误边缘地恢复了关键的因果关系。综合这些结果表明，ReX在准确恢复真实因果结构的同时最小化了误报预测，在各种数据集上表现出了鲁棒性，并适用于实际问题。通过将ML和可解释性技术与因果发现相结合，ReX弥合了预测建模与因果推断之间的鸿沟，为理解复杂因果结构提供了有效工具。

更新时间: 2025-10-16 17:38:53

领域: cs.LG

下载: http://arxiv.org/abs/2501.12706v2

Predicting Task Performance with Context-aware Scaling Laws

Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

Updated: 2025-10-16 17:35:18

标题: 使用上下文感知的标度定律预测任务表现

摘要: Scaling laws已经通过将跨熵损失等上游指标与模型大小、训练数据和计算等设计因素联系起来，转变了我们对大型语言模型的理解。然而，这些传统规律未能捕捉到下游任务表现，而上下文在其中扮演了关键角色。在这项工作中，我们提出了一个简单易懂的框架，将下游表现联合建模为训练计算和提供的上下文的函数。我们通过拟合观察到的Llama-2-7B和Llama-2-13B扩展上下文变体的下游表现，跨越三个任务（算术推理、常识推理和机器翻译）的65,500个独特实例，从实证角度验证了我们的框架。我们的结果表明，我们的框架准确地模拟了分布内的下游表现，在训练计算的三个数量级上具有泛化能力，并在上下文数量增加时可靠地外推表现。这些发现为训练计算和上下文利用之间的相互作用提供了宝贵的见解，为设计更高效的长上下文LLM提供了指导。我们的代码可在https://github.com/wang-research-lab/context-scaling上找到。

更新时间: 2025-10-16 17:35:18

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14919v1

PerfBench: Can Agents Resolve Real-World Performance Bugs?

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.

Updated: 2025-10-16 17:31:16

标题: PerfBench：代理能够解决现实世界的性能问题吗？

摘要: 性能缺陷是软件中的效率低下，浪费计算资源而不会导致功能故障的问题，因此特别难以检测和修复。虽然软件工程代理的最新进展显示出自动修复bug的潜力，但现有的基准主要关注功能正确性，未能评估代理程序识别和解决性能缺陷等非功能性问题的能力。我们引入了PerfBench，一个由GitHub上流行的.NET存储库中的81个真实世界性能bug修复任务组成的基准。与依赖预先存在的测试套件的现有基准不同，PerfBench具有一种新颖的评估工具，允许代理程序生成自己的性能基准，并通过比较开发人员修复和代理程序修复收集的执行指标来验证修复。PerfBench中的每个任务都源自与性能相关问题相关联的实际开发人员修复，然后由人类专家进行验证，确保其具有真实世界的相关性。我们的评估显示，当前最先进的编码代理在性能优化任务上存在困难，基准OpenHands代理在我们的基准上仅实现了约3%的成功率。我们开发了OpenHands-Perf-Agent，该代理程序融合了性能感知工具和说明，并在基准上实现了约20%的成功率。我们表明，通过确保代理程序具有正确的指令来评估其更改以及用于基准输出处理的工具，我们可以显著提高代理程序的性能，但仍有改进的空间。PerfBench为进一步提升代理程序在修复性能问题方面的能力提供了具有挑战性的测试集。

更新时间: 2025-10-16 17:31:16

领域: cs.SE,cs.AI,cs.PF

下载: http://arxiv.org/abs/2509.24091v2

Budget-aware Test-time Scaling via Discriminative Verification

Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

Updated: 2025-10-16 17:30:02

标题: Budget-aware Test-time Scaling via Discriminative Verification 经过鉴别性验证的预算感知测试时间缩放

摘要: 测试时间缩放是一种强大的策略，可以提高大型语言模型在复杂推理任务上的性能。尽管现有技术往往采用生成性验证器从候选池中选择最佳解决方案，但这种方法会产生昂贵的计算成本，限制了其实用性。在这项工作中，我们将重点转向更具预算意识的范式：判别性验证。我们进行了彻底的实证分析，并证明，虽然判别性验证器在单独使用时可能表现不佳，但将它们与自一致性相结合的混合方法创造了一种强大且高效的测试时间缩放机制。值得注意的是，在固定的计算预算下，这种混合方法明显超过了最先进的生成性验证，实现了AIME2025上高达15.3％的更高准确性。我们的研究结果表明，对于实际的现实世界应用，具有预算意识的判别性验证器不仅是对自一致性的“免费”升级，而且是代价高昂的生成性技术的更有效和更高效的替代方案。代码可在https://github.com/wang-research-lab/verification中找到。

更新时间: 2025-10-16 17:30:02

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.14913v1

Learnable Mixed Nash Equilibria are Collectively Rational

We extend the study of learning in games to dynamics that exhibit non-asymptotic stability. We do so through the notion of uniform stability, which is concerned with equilibria of individually utility-seeking dynamics. Perhaps surprisingly, it turns out to be closely connected to economic properties of collective rationality. Under mild non-degeneracy conditions and up to strategic equivalence, if a mixed equilibrium is not uniformly stable, then it is not weakly Pareto optimal: there is a way for all players to improve by jointly deviating from the equilibrium. On the other hand, if it is locally uniformly stable, then the equilibrium must be weakly Pareto optimal. Moreover, we show that uniform stability determines the last-iterate convergence behavior for the family of incremental smoothed best-response dynamics, used to model individual and corporate behaviors in the markets. Unlike dynamics around strict equilibria, which can stabilize to socially-inefficient solutions, individually utility-seeking behaviors near mixed Nash equilibria lead to collective rationality.

Updated: 2025-10-16 17:25:32

标题: 可学习的混合纳什均衡是集体合理的

摘要: 我们将学习游戏中的学习扩展到展示非渐近稳定性的动态。我们通过统一稳定性的概念来实现这一点，该概念关注个体寻求效用的动态的均衡。也许令人惊讶的是，它与集体理性的经济属性密切相关。在轻微的非退化条件下，如果一个混合均衡不是统一稳定的，并且在战略上等效，那么它就不是弱帕累托最优的：所有玩家都有办法通过共同偏离均衡来改善。另一方面，如果它在局部上是统一稳定的，那么均衡必须是弱帕累托最优的。此外，我们表明统一稳定性确定了用于模拟市场中的个人和公司行为的增量平滑最佳响应动态族的最终收敛行为。与围绕严格均衡的动态不同，这些动态可以稳定到社会效率低下的解决方案，个人寻求效用的行为接近混合纳什均衡导致集体理性。

更新时间: 2025-10-16 17:25:32

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2510.14907v1

A Hard-Label Black-Box Evasion Attack against ML-based Malicious Traffic Detection Systems

Machine Learning (ML)-based malicious traffic detection is a promising security paradigm. It outperforms rule-based traditional detection by identifying various advanced attacks. However, the robustness of these ML models is largely unexplored, thereby allowing attackers to craft adversarial traffic examples that evade detection. Existing evasion attacks typically rely on overly restrictive conditions (e.g., encrypted protocols, Tor, or specialized setups), or require detailed prior knowledge of the target (e.g., training data and model parameters), which is impractical in realistic black-box scenarios. The feasibility of a hard-label black-box evasion attack (i.e., applicable across diverse tasks and protocols without internal target insights) thus remains an open challenge. To this end, we develop NetMasquerade, which leverages reinforcement learning (RL) to manipulate attack flows to mimic benign traffic and evade detection. Specifically, we establish a tailored pre-trained model called Traffic-BERT, utilizing a network-specialized tokenizer and an attention mechanism to extract diverse benign traffic patterns. Subsequently, we integrate Traffic-BERT into the RL framework, allowing NetMasquerade to effectively manipulate malicious packet sequences based on benign traffic patterns with minimal modifications. Experimental results demonstrate that NetMasquerade enables both brute-force and stealthy attacks to evade 6 existing detection methods under 80 attack scenarios, achieving over 96.65% attack success rate. Notably, it can evade the methods that are either empirically or certifiably robust against existing evasion attacks. Finally, NetMasquerade achieves low-latency adversarial traffic generation, demonstrating its practicality in real-world scenarios.

Updated: 2025-10-16 17:24:18

标题: 一种针对基于机器学习的恶意流量检测系统的硬标签黑盒规避攻击

摘要: 基于机器学习（ML）的恶意流量检测是一种有前途的安全范式。它通过识别各种高级攻击，优于基于规则的传统检测。然而，这些ML模型的鲁棒性在很大程度上尚未被探索，因此允许攻击者制作可以规避检测的对抗性流量示例。现有的规避攻击通常依赖于过于严格的条件（例如，加密协议、Tor或专门设置），或需要对目标有详细的先验知识（例如，训练数据和模型参数），这在实际的黑盒场景中是不切实际的。因此，一个适用于各种任务和协议的硬标签黑盒规避攻击（即，在没有内部目标见解的情况下适用）仍然是一个开放挑战。为此，我们开发了NetMasquerade，利用强化学习（RL）来操纵攻击流以模仿良性流量并规避检测。具体而言，我们建立了一个定制的预训练模型称为Traffic-BERT，利用网络专用的标记器和注意机制提取各种良性流量模式。随后，我们将Traffic-BERT集成到RL框架中，使NetMasquerade能够基于良性流量模式有效地操纵恶意数据包序列，只需进行最少的修改。实验结果表明，NetMasquerade使得蛮力攻击和隐秘攻击能够规避80个攻击场景下的6种现有检测方法，实现超过96.65％的攻击成功率。值得注意的是，它可以规避那些已经经过实证或可证实坚固抵抗现有规避攻击的方法。最后，NetMasquerade实现了低延迟的对抗性流量生成，展示了其在真实场景中的实用性。

更新时间: 2025-10-16 17:24:18

领域: cs.CR

下载: http://arxiv.org/abs/2510.14906v1

MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

Updated: 2025-10-16 17:20:22

标题: MaskCaptioner：学习在视频中联合分割和描述对象轨迹

摘要: 密集视频对象字幕（DVOC）是在视频中联合检测、跟踪和字幕化对象轨迹的任务，需要理解时空细节并用自然语言描述它们的能力。由于任务的复杂性和手动标注的高成本，先前的方法倾向于采用不连续的训练策略，可能导致性能不佳。为了规避这个问题，我们提出利用最先进的VLM生成关于时空局部化实体的字幕。通过向LVIS和LV-VIS数据集添加我们的合成字幕（LVISCap和LV-VISCap），我们训练MaskCaptioner，这是一个能够联合检测、分割、跟踪和字幕化对象轨迹的端到端模型。此外，通过在LVISCap和LV-VISCap上进行预训练，MaskCaptioner在三个现有基准测试集VidSTG、VLN和BenSMOT上实现了最先进的DVOC结果。数据集和代码可在https://www.gabriel.fiastre.fr/maskcaptioner/上获得。

更新时间: 2025-10-16 17:20:22

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14904v1

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

Updated: 2025-10-16 17:18:11

标题: 使用抽样进行推理：您的基础模型比您想象的更聪明

摘要: 前沿推理模型在各个学科领域展现出令人难以置信的能力，这些模型是通过使用强化学习（RL）训练后的大型语言模型（LLMs）驱动的。然而，尽管这种范式取得了广泛成功，但很多文献都致力于梳理在RL过程中出现但在基础模型中不存在的真正新颖行为。在我们的工作中，我们从不同角度探讨这个问题，而是询问是否可以通过纯粹的采样在推理时间从基础模型中引发出可比的推理能力，而无需任何额外训练。受马尔科夫链蒙特卡洛（MCMC）技术用于从锐化分布中采样的启发，我们提出了一个简单的迭代采样算法，利用基础模型自身的可能性。在不同的基础模型上，我们展示了我们的算法在各种单次任务（包括MATH500、HumanEval和GPQA）中提供了实质性的推理提升，几乎可以与甚至胜过RL。此外，我们的采样器避免了在多个样本中出现的多样性崩溃，这是RL后训练的特征。关键是，我们的方法不需要训练、精心策划的数据集或验证器，这表明其在易于验证的领域之外具有广泛适用性。

更新时间: 2025-10-16 17:18:11

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14901v1

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates

The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent: 1) Identifies ambiguous field-mapping attempts. 2) Generates targeted web-search queries to gather external evidence. 3) Applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 56.4\%(LLM-only) to 72.73\%(RAG) to 93.94\% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85\%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.

Updated: 2025-10-16 17:17:00

标题: 更聪明地绘制地图：一种在没有标签或模型更新的情况下改进的测试时间强化学习代理

摘要: 企业智能平台必须集成来自众多第三方供应商的日志，以执行各种下游任务。然而，在测试时通常无法获得供应商文档。这些文档可能被错放、匹配错误、格式混乱或不完整，这使得模式映射变得困难。我们引入了一种强化学习代理，可以在没有标记示例或模型权重更新的情况下自我改进。在推断过程中，代理：1）识别模糊的字段映射尝试。2）生成有针对性的网络搜索查询以收集外部证据。3）应用基于置信度的奖励来迭代地完善其映射。为了演示这一概念，我们将Microsoft Defender for Endpoint日志转换为一个通用模式。我们的方法使用GPT-4o在100次迭代中，将映射准确性从56.4\%（仅通过LLM）提高到72.73\%（通过RAG），再提高到93.94\%。同时，它将需要专家审查的低置信度映射数量减少了85\%。这种新方法提供了一种基于证据的、透明的方法来解决未来行业问题，为更健壮、负责任、可扩展、高效、灵活、适应性强和协作的解决方案铺平了道路。

更新时间: 2025-10-16 17:17:00

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.14900v1

Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning

To preserve privacy, multi-party computation (MPC) enables executing Machine Learning (ML) algorithms on secret-shared or encrypted data. However, existing MPC frameworks are not optimized for sparse data. This makes them unsuitable for ML applications involving sparse data, e.g., recommender systems or genomics. Even in plaintext, such applications involve high-dimensional sparse data, that cannot be processed without sparsity-related optimizations due to prohibitively large memory requirements. Since matrix multiplication is central in ML algorithms, we propose MPC algorithms to multiply secret sparse matrices. On the one hand, our algorithms avoid the memory issues of the "dense" data representation of classic secure matrix multiplication algorithms. On the other hand, our algorithms can significantly reduce communication costs (some experiments show a factor 1000) for realistic problem sizes. We validate our algorithms in two ML applications in which existing protocols are impractical. An important question when developing MPC algorithms is what assumptions can be made. In our case, if the number of non-zeros in a row is a sensitive piece of information then a short runtime may reveal that the number of non-zeros is small. Existing approaches make relatively simple assumptions, e.g., that there is a universal upper bound to the number of non-zeros in a row. This often doesn't align with statistical reality, in a lot of sparse datasets the amount of data per instance satisfies a power law. We propose an approach which allows adopting a safe upper bound on the distribution of non-zeros in rows/columns of sparse matrices.

Updated: 2025-10-16 17:12:18

标题: 安全稀疏矩阵乘法及其在隐私保护机器学习中的应用

摘要: 为了保护隐私，多方计算（MPC）使得能够在秘密共享或加密数据上执行机器学习（ML）算法成为可能。然而，现有的MPC框架并未针对稀疏数据进行优化。这使得它们对于涉及稀疏数据的ML应用，如推荐系统或基因组学，不太适用。即使在明文情况下，这些应用涉及高维稀疏数据，如果没有与稀疏相关的优化，就无法处理，因为内存需求过大。由于矩阵乘法在ML算法中起着核心作用，我们提出了用于乘秘密稀疏矩阵的MPC算法。一方面，我们的算法避免了传统安全矩阵乘法算法中“密集”数据表示的内存问题。另一方面，我们的算法可以显著减少通信成本（一些实验显示减少了1000倍）对于实际问题规模。我们在两个ML应用中验证了我们的算法，在这些应用中现有协议并不实用。在开发MPC算法时一个重要问题是可以做出什么样的假设。在我们的情况下，如果一行中的非零数是敏感信息，那么短暂的运行时间可能会透露出非零数很少的事实。现有方法往往做出相对简单的假设，例如一行中非零数的数量存在一个通用的上限。这往往与统计现实不符，在许多稀疏数据集中，每个实例的数据量满足幂律分布。我们提出了一种方法，允许采用对稀疏矩阵的行/列中非零数的分布的安全上限。

更新时间: 2025-10-16 17:12:18

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.14894v1

Strategyproof Reinforcement Learning from Human Feedback

We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.

Updated: 2025-10-16 17:10:09

标题: 无懈可击的策略学习：来自人类反馈的强化学习

摘要: 我们研究了在多个标注者可能会策略性地错误报反馈以引导学习策略朝向他们自己偏好的情景下，从人类反馈中进行强化学习（RLHF）。我们发现现有的RLHF算法，包括最近的多元方法，都不是无懈可击的，并且即使一个策略性的标注者也可以导致与社会福利的严重偏差。此外，我们证明，在最坏情况下，任何无懈可击的RLHF算法必须比最优策略差$k$倍，其中$k$是标注者的数量。这表明在激励对齐（确保标注者真实报告）和政策对齐（最大化社会福利）之间存在基本的权衡。为了解决这个问题，我们提出了Pessimistic Median of MLEs算法，根据适当的策略覆盖假设，该算法在标注者数量和样本数量增加时近似无懈可击，并收敛到最优策略。我们的结果适用于上下文臂和马尔可夫决策过程。

更新时间: 2025-10-16 17:10:09

领域: cs.LG

下载: http://arxiv.org/abs/2503.09561v2

Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media

On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.

Updated: 2025-10-16 17:09:14

标题: 通过社交媒体上的纵向和信息环境信号检测早期和隐含的自杀意念

摘要: 在社交媒体上，许多有自杀意念的个体并不会明确地披露他们的痛苦。相反，迹象可能会通过日常帖子或同龄人互动间接地浮出水面。及早检测这种暗示信号至关重要，但仍然具有挑战性。我们将早期和隐含的自杀意念视为一个前瞻性的预测任务，并开发了一个计算框架，模拟用户的信息环境，包括他们的纵向发布历史以及社交近邻的对话。我们采用了一个综合网络中心性度量来识别用户的顶部邻居，并在时间上对齐用户和邻居的互动 - 在一个经过精心调整的DeBERTa-v3模型中整合多层信号。在一个Reddit研究中，我们对1,000名（500例和500对照）用户的方法比仅基于个体的基线模型提高了15%的早期和隐含的自杀意念检测率。这些发现强调了同龄人互动提供了有价值的预测信号，并为设计早期检测系统提供了更广泛的启示，捕捉在线环境中的风险的间接和掩饰表达。

更新时间: 2025-10-16 17:09:14

领域: cs.SI,cs.AI,cs.CL,cs.CY,cs.HC

下载: http://arxiv.org/abs/2510.14889v1

Prediction-Specific Design of Learning-Augmented Algorithms

Algorithms with predictions} has emerged as a powerful framework to combine the robustness of traditional online algorithms with the data-driven performance benefits of machine-learned (ML) predictions. However, most existing approaches in this paradigm are overly conservative, {as they do not leverage problem structure to optimize performance in a prediction-specific manner}. In this paper, we show that such prediction-specific performance criteria can enable significant performance improvements over the coarser notions of consistency and robustness considered in prior work. Specifically, we propose a notion of \emph{strongly-optimal} algorithms with predictions, which obtain Pareto optimality not just in the worst-case tradeoff between robustness and consistency, but also in the prediction-specific tradeoff between these metrics. We develop a general bi-level optimization framework that enables systematically designing strongly-optimal algorithms in a wide variety of problem settings, and we propose explicit strongly-optimal algorithms for several classic online problems: deterministic and randomized ski rental, and one-max search. Our analysis reveals new structural insights into how predictions can be optimally integrated into online algorithms by leveraging a prediction-specific design. To validate the benefits of our proposed framework, we empirically evaluate our algorithms in case studies on problems including dynamic power management and volatility-based index trading. Our results demonstrate that prediction-specific, strongly-optimal algorithms can significantly improve performance across a variety of online decision-making settings.

Updated: 2025-10-16 17:06:53

标题: 预测特定的学习增强算法设计

摘要: 具有预测的算法已经成为一个强大的框架，将传统在线算法的稳健性与机器学习（ML）预测的数据驱动性能优势相结合。然而，在这种范式中，大多数现有方法过于保守，因为它们不利用问题结构来优化预测特定方式的性能。在本文中，我们展示了这种预测特定性能标准可以在较粗的一致性和稳健性概念考虑的基础上实现显著的性能改进。具体来说，我们提出了一个强优化算法与预测的概念，它不仅在稳健性和一致性之间的最坏情况折衷中获得帕累托最优性，还在这些指标之间的预测特定折衷中获得帕累托最优性。我们开发了一个通用的双层优化框架，可以系统地设计在各种问题设置中具有强优化性能的算法，并为几个经典的在线问题提出了明确的强优化算法：确定性和随机滑雪租赁，以及最大搜索。我们的分析揭示了通过利用特定于预测的设计如何将预测最佳地集成到在线算法中的新结构见解。为了验证我们提出的框架的优势，我们在包括动态功耗管理和基于波动率的指数交易在内的案例研究中对我们的算法进行了实证评估。我们的结果表明，具有预测特定性的强优化算法可以在各种在线决策制定设置中显著提高性能。

更新时间: 2025-10-16 17:06:53

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2510.14887v1

Approximation Rates for Shallow ReLU$^k$ Neural Networks on Sobolev Spaces via the Radon Transform

Let $\Omega\subset \mathbb{R}^d$ be a bounded domain. We consider the problem of how efficiently shallow neural networks with the ReLU$^k$ activation function can approximate functions from Sobolev spaces $W^s(L_p(\Omega))$ with error measured in the $L_q(\Omega)$-norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $q\leq p$, $p\geq 2$, and $s \leq k + (d+1)/2$. The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU$^k$ neural networks enables them to obtain optimal approximation rates for smoothness up to order $s = k + (d+1)/2$, even though they represent piecewise polynomials of fixed degree $k$.

Updated: 2025-10-16 17:03:54

标题: 浅ReLU$^k$神经网络在Sobolev空间中的逼近速率通过Radon变换

摘要: 让$\Omega\subset \mathbb{R}^d$成为一个有界域。我们考虑浅层神经网络如何有效地使用ReLU$^k$激活函数逼近Sobolev空间$W^s(L_p(\Omega))$中的函数，其中误差以$L_q(\Omega)$-范数衡量。利用Radon变换和最近的差异理论结果，我们提供了在各种情况下几乎最优逼近率的简单证明，包括当$q\leq p$，$p\geq 2$，以及$s \leq k + (d+1)/2$时。我们得出的速率在对数因子上是最优的，并且显著推广了现有的结果。一个有趣的结果是，浅层ReLU$^k$神经网络的适应性使它们能够获得平滑度达到$s = k + (d+1)/2$的最优逼近率，即使它们表示的是固定度数为$k$的分段多项式。

更新时间: 2025-10-16 17:03:54

领域: stat.ML,cs.LG,cs.NA,math.NA,62M45, 41A25, 41A30

下载: http://arxiv.org/abs/2408.10996v3

Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards

In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.

Updated: 2025-10-16 17:01:57

标题: 学会何时不学习：对无界奖励的赌博机中的风险敏感放弃

摘要: 在高风险的人工智能应用中，即使一次行动也可能造成不可挽回的伤害。然而，几乎所有的序贯决策理论都假设所有的错误都是可以弥补的（例如通过限制奖励）。当这一假设失败时，探索性强的标准赌博算法可能会造成不可挽回的伤害。一些先前的工作通过寻求导师的帮助来避免不可挽回的错误，但并不总是有导师可用。在这项工作中，我们将学习模型形式化为一个具有放弃选项的双动作情境赌博问题，其中奖励没有上界且没有导师：在每一轮中，代理观察一个输入，并选择是放弃（总是获得0奖励）还是承诺（执行一个已存在的任务策略）。承诺会带来上界奖励，但可能是任意负值，并且假定承诺奖励在输入上是Lipschitz的。我们提出了一种基于谨慎的算法，学习何时不学习：它选择一个可信的区域，并仅在可用证据还未证明伤害时承诺。在这些条件和独立同分布的输入下，我们建立了亚线性后悔保证，从理论上证明了谨慎探索对于在高风险环境中安全部署学习代理的有效性。

更新时间: 2025-10-16 17:01:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14884v1

The Gatekeeper Knows Enough

Large Language Models (LLMs) are increasingly deployed as autonomous agents, yet their practical utility is fundamentally constrained by a limited context window and state desynchronization resulting from the LLMs' stateless nature and inefficient context management. These limitations lead to unreliable output, unpredictable behavior, and inefficient resource usage, particularly when interacting with large, structured, and sensitive knowledge systems such as codebases and documents. To address these challenges, we introduce the Gatekeeper Protocol, a novel, domain-agnostic framework that governs agent-system interactions. Our protocol mandates that the agent first operate and reason on a minimalist, low-fidelity "latent state" representation of the system to strategically request high-fidelity context on demand. All interactions are mediated through a unified JSON format that serves as a declarative, state-synchronized protocol, ensuring the agent's model of the system remains verifiably grounded in the system's reality. We demonstrate the efficacy of this protocol with Sage, a reference implementation of the Gatekeeper Protocol for software development. Our results show that this approach significantly increases agent reliability, improves computational efficiency by minimizing token consumption, and enables scalable interaction with complex systems, creating a foundational methodology for building more robust, predictable, and grounded AI agents for any structured knowledge domain.

Updated: 2025-10-16 17:00:42

标题: 守门人知道足够多

摘要: 大型语言模型（LLMs）越来越多地被部署为自治代理，然而它们的实际效用受到有限上下文窗口和LLMs的无状态性质以及低效的上下文管理导致的状态不同步的根本限制。这些限制导致输出不可靠、行为不可预测和资源使用效率低，特别是在与大型、结构化和敏感知识系统（如代码库和文档）交互时。为了解决这些挑战，我们引入了Gatekeeper Protocol，这是一个新颖的、领域无关的框架，用于管理代理-系统交互。我们的协议要求代理首先在系统的最小主义、低保真度的“潜在状态”表示上进行操作和推理，以便根据需要战略性地请求高保真度的上下文。所有交互都通过统一的JSON格式进行调节，这种格式作为一种声明性的、状态同步的协议，确保代理对系统的模型始终保持可验证地基于系统的现实。我们通过Sage展示了这一协议的有效性，Sage是Gatekeeper Protocol在软件开发领域的一个参考实现。我们的结果表明，这种方法显著提高了代理的可靠性，通过最小化标记消耗提高了计算效率，并实现了与复杂系统的可扩展交互，为构建更加稳健、可预测和基于实际的AI代理提供了基础方法论，适用于任何结构化知识领域。

更新时间: 2025-10-16 17:00:42

领域: cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2510.14881v1

Predicting kernel regression learning curves from only raw data statistics

We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$. The key new idea is an analytical approximation of a kernel's eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the Hermite eigenstructure ansatz (HEA). We prove the HEA for Gaussian data, but we find that real image data is often "Gaussian enough" for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.

Updated: 2025-10-16 16:57:59

标题: 从仅原始数据统计预测核回归学习曲线

摘要: 我们研究了在真实数据集（包括CIFAR-5m、SVHN和ImageNet）上使用常见的旋转不变核进行核回归。我们提出了一个理论框架，可以仅通过两个测量值（经验数据协方差矩阵和目标函数$f_*$的经验多项式分解）来预测学习曲线（测试风险与样本大小的关系）。关键的新想法是在各向异性数据分布下对核的特征值和特征函数进行解析近似。这些特征函数类似于数据的Hermite多项式，因此我们将这种近似称为Hermite特征结构假设（HEA）。我们证明了对于高斯数据，HEA成立，但我们发现真实图像数据通常对HEA足够“高斯”，在实践中能够很好地保持，从而使我们能够通过将核的特征结构与测试风险相关的先前结果来预测学习曲线。在超越核回归的范围之外，我们在特征学习阶段实证发现MLPs学习的Hermite多项式的顺序与HEA预测的顺序一致。我们的HEA框架是一个概念验证，证明了一个端到端的学习理论，可以将数据集结构映射到模型性能，对于真实数据集上的非平凡学习算法是可能的。

更新时间: 2025-10-16 16:57:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14878v1

Robust Counterfactual Inference in Markov Decision Processes

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Updated: 2025-10-16 16:51:12

标题: 马尔可夫决策过程中的强大反事实推断

摘要: 本文讨论了马尔可夫决策过程（MDPs）中现有反事实推断方法的一个关键局限性。当前方法假定一个特定的因果模型以使得反事实能够被确定。然而，通常有许多因果模型与MDP的观测和干预分布相一致，每个模型产生不同的反事实分布，因此固定一个特定的因果模型限制了反事实推断的有效性（和有用性）。我们提出了一种新颖的非参数方法，计算所有兼容因果模型中反事实转移概率的紧密界限。与以前需要解决规模呈指数增长的优化问题的方法不同，我们的方法提供了这些界限的闭合形式表达式，使计算对于复杂的MDP高度高效和可扩展。一旦构建了这样一个区间反事实MDP，我们的方法确定了针对不确定区间MDP概率优化最坏情况奖励的鲁棒反事实策略。我们在各种案例研究中评估了我们的方法，证明了相对现有方法的改进的鲁棒性。

更新时间: 2025-10-16 16:51:12

领域: cs.AI

下载: http://arxiv.org/abs/2502.13731v3

From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

Updated: 2025-10-16 16:49:05

标题: 从循环嵌套到硅片：利用MLIR-AIR将AI工作负载映射到AMD NPUs

摘要: 通用编译器抽象出并行性、局部性和同步性，限制了它们在现代空间架构上的效果。随着现代计算架构越来越依赖于对数据移动、执行顺序和计算放置的精细控制以获得性能，编译器基础设施必须提供明确的机制来协调计算和数据，以充分利用这样的架构。我们介绍了MLIR-AIR，这是一个新颖的、基于MLIR构建的开源编译器栈，它弥合了高级工作负载和AMD的NPUs等细粒度空间架构之间的语义差距。MLIR-AIR定义了AIR方言，提供了跨计算和内存资源的异步和分层操作的结构化表示。AIR原语允许编译器协调空间调度，在硬件区域之间分配计算，并在不依赖特定运行时协调或手动调度的情况下重叠通信与计算。我们通过两个案例研究展示了MLIR-AIR的能力：矩阵乘法和LLaMA 2模型中的多头注意力块。对于矩阵乘法，MLIR-AIR实现了高达78.7%的计算效率，并生成了性能几乎与使用更低级别、接近金属的MLIR-AIE框架手动优化的矩阵乘法的最新实现相同的实现。对于多头注意力，我们证明了AIR接口支持使用大约150行代码的融合实现，从而实现了对复杂工作负载的可跟踪表达，并将其有效地映射到空间硬件。MLIR-AIR将高级结构化控制流转换为空间程序，有效利用NPU的计算布局和内存层次结构，通过编译器管理的调度实现异步执行、平铺和通信的重叠。

更新时间: 2025-10-16 16:49:05

领域: cs.CL,cs.AR,cs.LG

下载: http://arxiv.org/abs/2510.14871v1

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

Updated: 2025-10-16 16:44:51

标题: 基于芯片组的具有模块化人工智能加速的RISC-V SoC

摘要: 在开发和部署边缘AI设备时，实现高性能、能源效率和经济效益，同时保持架构灵活性是一个关键挑战。单片SoC设计由于在先进的360 mm^2工艺节点下低制造良率（低于16%）而面临复杂平衡的挑战。本文提出了一种基于芯片组的RISC-V SoC架构，通过模块化AI加速和智能系统级优化来解决这些限制。我们提出的设计在30mm x 30mm硅中间层集成了4种不同的关键创新：自适应跨芯片组动态电压和频率调整（DVFS）；AI感知的通用芯片组互连扩展（UCIe）协议，具有流控单元和压缩感知传输；异构芯片组间分布式加密安全；以及智能传感器驱动的负载迁移。拟议的架构集成了一个7nm RISC-V CPU芯片组，带有双5nm AI加速器（每个15 TOPS INT8），16GB HBM3内存堆栈和专用的电源管理控制器。通过MobileNetV2、ResNet-50等行业标准基准测试和实时视频处理的实验结果表明了显著的性能改进。AI优化配置相比于之前的基本芯片组实现，实现了约14.7%的延迟降低、17.3%的吞吐量提高和16.2%的功耗降低。这些改进共同转化为40.1%的效率增益，对应于每个MobileNetV2推理的约3.5 mJ（860 mW/244图像/秒），同时在所有实验负载中保持亚5ms的实时能力。这些性能升级表明，模块化芯片组设计可以实现接近单片计算密度，同时实现成本效益、可扩展性和可升级性，这对于下一代边缘AI设备应用至关重要。

更新时间: 2025-10-16 16:44:51

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2509.18355v2

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.

Updated: 2025-10-16 16:44:31

标题: 导向安全性：关于LLMs中表征导向的系统安全评估框架

摘要: 我们介绍了SteeringSafety，这是一个系统的框架，用于评估跨越17个数据集的七个安全性视角的表示导向方法。虽然先前的工作突出了表示导向的一般能力，但我们系统地探索了包括偏见、有害性、幻觉、社会行为、推理、认识完整性和规范判断在内的安全性视角。我们的框架为最新的表示导向方法提供了模块化的构建模块，实现了DIM、ACE、CAA、PCA和LAT的统一实施，同时还包括了条件导向等最新增强功能。对Gemma-2-2B、Llama-3.1-8B和Qwen-2.5-7B的结果显示，强大的导向性能关键取决于方法、模型和特定视角的配对。DIM显示出一致的有效性，但所有方法都表现出相当大的纠缠：社会行为显示出最高的脆弱性（降级高达76%），越狱经常会损害规范判断，而幻觉导向则会不可预测地转变政治观点。我们的研究结果强调了对整体安全性评估的重要性。

更新时间: 2025-10-16 16:44:31

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.13450v2

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

Updated: 2025-10-16 16:42:58

标题: 机器遗忘遇到对抗鲁棒性：通过对LLMs的受限干预

摘要: 随着大型语言模型（LLMs）的日益普及，需要更多的定制化来确保隐私保护和安全生成。我们从两个关键方面解决这一目标：敏感信息的遗忘和抵抗越狱攻击的稳健性。我们研究了各种受限优化公式，以统一的方式解决这两个方面的问题，通过找到对LLM权重进行的最小干预，使得给定的词汇集无法到达，或者将LLM嵌入到具有针对性攻击的稳健性的一部分权重转移到“更安全”的区域。除了统一两个关键属性，这种方法与以前的工作不同，因为它不需要通常不可用或代表计算开销的预测分类器。令人惊讶的是，我们发现我们提出的最简单的基于点约束的干预方法比最大最小干预方法表现更好，同时具有更低的计算成本。与最先进的防御方法进行比较，表明所提出的方法具有更高的性能。

更新时间: 2025-10-16 16:42:58

领域: cs.LG,cs.CL,cs.CR,cs.CY,math.OC

下载: http://arxiv.org/abs/2510.03567v3

Benchmarking Multimodal Large Language Models for Face Recognition

Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

Updated: 2025-10-16 16:42:27

标题: 基准测试多模态大型语言模型用于人脸识别

摘要: 多模态大型语言模型（MLLMs）在各种视觉与语言任务中取得了卓越的表现。然而，它们在人脸识别方面的潜力仍未被充分挖掘。特别是，需要评估和比较开源MLLMs在标准基准数据集上与现有人脸识别模型的性能，并采用类似的协议。在这项工作中，我们对最先进的MLLMs进行了系统性评估，用于人脸识别的基准测试，包括LFW、CALFW、CPLFW、CFP、AgeDB和RFW等多个人脸识别数据集。实验结果显示，虽然MLLMs捕捉了对于人脸相关任务有用的丰富语义线索，但在零样本应用中的高精度识别场景中，它们落后于专门模型。这一基准测试为推动基于MLLM的人脸识别提供了基础，为设计具有更高准确性和泛化性的下一代模型提供了见解。我们的基准测试源代码已经公开在项目页面上。

更新时间: 2025-10-16 16:42:27

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14866v1

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Entended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and human-AI collaboration, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications--from cancer immunotherapy target discovery to stem-cell engineering -- LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.

Updated: 2025-10-16 16:36:22

标题: 实验室操作系统：与人类一起工作的AI-XR共同科学家

摘要: 现代科学在思想与行动相结合时发展最快。LabOS代表了第一个将计算推理与物理实验通过多模态感知、自我进化代理和增强现实(XR)启用的人工智能协同科学家结合起来的系统。通过连接多模态人工智能代理、智能眼镜和人工智能协同，LabOS使人工智能能够看到科学家所看到的东西，理解实验背景，并在实时执行过程中提供帮助。从癌症免疫疗法靶点发现到干细胞工程等应用中，LabOS展示了人工智能可以超越计算设计，参与其中，将实验室转变为一个智能、协作环境，人机发现共同演进。

更新时间: 2025-10-16 16:36:22

领域: cs.AI

下载: http://arxiv.org/abs/2510.14861v1

TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

Updated: 2025-10-16 16:29:36

标题: TinyGraphEstimator：将轻量级语言模型调整为图结构推断

摘要: 图表提供了一个通用的框架来表示复杂的关系系统，推断它们的结构特性是图分析和推理中的核心挑战。尽管最近大型语言模型已经展示出执行符号和数值推理的新能力，但在这种情况下，较小、资源高效的模型的潜力仍然很大部分未被开发。本文研究了紧凑的基于变压器的语言模型是否能够直接从图表示中推断图论参数。为了进行系统评估，我们引入了TinyGraphEstimator数据集 - 一个由多个随机图模型生成的连接图的平衡收集，并用详细的结构元数据进行注释。我们评估了几个小型开放模型在预测关键图参数（如密度、聚类和色数）的能力。此外，我们应用了轻量级微调技术Low-Rank Adaptation（LoRA），在所有评估指标上取得了一致的改进。结果表明，小型语言模型具有对图结构化数据的非平凡推理能力，并且可以通过有效的参数调整有效地适应结构推断任务。

更新时间: 2025-10-16 16:29:36

领域: cs.LG

下载: http://arxiv.org/abs/2510.08808v3

A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation

Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.

Updated: 2025-10-16 16:28:21

标题: 一个用于皮肤病变分类、ABCDE特征量化和演化模拟的多任务深度学习框架

摘要: 早期发现黑色素瘤已变得至关重要，因为它显著提高了生存率，但皮肤病变的自动分析仍然具有挑战性。ABCDE代表不对称性（Asymmetry）、边界不规则性（Border irregularity）、颜色变异（Color variation）、直径（Diameter）和演变（Evolving），是一种广为人知的皮肤病变分类方法，但大多数深度学习机制将其视为黑匣子，因为大多数人类可解释的特征没有得到解释。在这项工作中，我们提出了一个深度学习框架，既可以将皮肤病变分类为不同类别，也可以为每个ABCD特征量化评分。它模拟了这些特征随时间的演变，以代表E方面，为未来的探索打开更多可能。在这项工作中特别量化了A、B、C和D的值。此外，该框架还在潜在空间中可视化ABCD特征轨迹，展示了皮肤病变从良性痣演变为恶性黑色素瘤的过程。实验使用包含大约一万张不同阶段皮肤病变图像的HAM10000数据集进行。总的来说，该分类工作的准确率约为89％，黑色素瘤AUC为0.96，而特征评估在预测不对称性、颜色变异和直径方面表现良好，尽管边界不规则性仍然更难建模。总体而言，这项工作提供了一个深度学习框架，将使医生能够将机器学习诊断与临床相关标准联系起来，从而提高我们对皮肤癌症进展的理解。

更新时间: 2025-10-16 16:28:21

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14855v1

Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval

Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.

Updated: 2025-10-16 16:28:05

标题: Hopfield-Fenchel-Young网络：一种用于联想记忆检索的统一框架

摘要: 关联记忆模型，如Hopfield网络及其现代变体，由于记忆容量的提升和与transformers中的自注意力的联系而再次引起了人们的兴趣。在这项工作中，我们介绍了一个统一的框架-Hopfield-Fenchel-Young网络，将这些模型推广到更广泛的能量函数家族中。我们的能量被规定为两个Fenchel-Young损失之间的差异：一个，由广义熵参数化，定义了Hopfield的评分机制，而另一个对Hopfield输出进行后转换。通过利用Tsallis和规范熵，我们推导出端到端可微分的更新规则，实现稀疏变换，揭示了损失边界、稀疏性和单个记忆模式的精确检索之间的新连接。我们进一步将这一框架扩展到使用SparseMAP转换的结构化Hopfield网络，允许检索模式关联而不是单个模式。我们的框架统一和扩展了传统和现代Hopfield网络，并为广泛使用的后转换（如$\ell_2$-归一化和层归一化）提供了能量最小化的视角，所有这些都是通过适当选择Fenchel-Young损失并使用凸分析作为构建块实现的。最后，我们在各种记忆回忆任务上验证了我们的Hopfield-Fenchel-Young网络，包括自由和顺序回忆。对模拟数据、图像检索、多实例学习和文本合理化的实验证明了我们方法的有效性。

更新时间: 2025-10-16 16:28:05

领域: cs.LG

下载: http://arxiv.org/abs/2411.08590v3

Efficient & Correct Predictive Equivalence for Decision Trees

The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. The approach of McTavish et al. consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the QM method may incorrectly decide predictive equivalence, if two key constraints are not respected, and one may be difficult to formally guarantee. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

Updated: 2025-10-16 16:22:56

标题: 决策树的高效和正确的预测等价性

摘要: 决策树（DTs）的拉肖蒙集合发现了重要的用途。最近的研究表明，计算相同分类函数的DTs，即预测等价的DTs，可以表示拉肖蒙集合的重要部分。这种冗余是不可取的。例如，基于拉肖蒙集合的特征重要性由于存在预测等价的DTs，即对每个可能的输入具有相同预测的DTs，变得不准确。在最近的工作中，McTavish等人提出了解决与DTs相关的几个计算问题的解决方案，包括决定预测等价的DTs。McTavish等人的方法是应用著名的Quine-McCluskey（QM）方法获取DTs的最小尺寸DNF（析取范式）表示，然后用于比较DTs的预测等价性。此外，最小尺寸的DNF表示也应用于计算DTs所做预测的解释，并在缺失数据的情况下找到预测。然而，公式最小化问题对于多项式层次结构的第二级来说是困难的，QM方法可能表现出最坏情况的指数运行时间和空间。本文首先证明存在会触发QM方法最坏情况的指数运行时间和空间的决策树。其次，本文表明如果不遵守两个关键约束条件，QM方法可能错误地决定预测等价性，而且其中一个可能难以正式保证。第三，本文表明任何应用最小DNF表示的问题都可以在DT的大小的多项式时间内解决。实验证实，对于触发QM方法最坏情况的DTs，本文提出的算法比McTavish等人提出的算法快几个数量级。

更新时间: 2025-10-16 16:22:56

领域: cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2509.17774v4

Thinker: Learning to Think Fast and Slow

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.

Updated: 2025-10-16 16:20:17

标题: 思考者：学习快与慢思考

摘要: 最近的研究表明，通过将强化学习（RL）应用于数学和编程等领域的问答（QA）任务，可以提高大型语言模型（LLMs）的推理能力。具有长上下文长度的LLMs可能会学会执行搜索，正如在DeepSeek R1中观察到的自我纠正行为所示。然而，这种搜索行为往往不够精确，缺乏信心，导致回答冗长、冗余，并突显出直觉和验证方面的不足。受心理学中的双重过程理论启发，我们对QA任务进行了简单修改，包括四个阶段：快速思考，在这个阶段LLM必须在严格的标记预算内回答问题；验证，在这个阶段模型评估其初始回答；慢思考，在这个阶段用更多的考虑对初始回答进行改进；总结，在这个阶段将前一个阶段的改进精炼成精确的步骤。我们提出的任务将Qwen2.5-1.5B的平均准确率从25.6%提高到27.3%，将DeepSeek-R1-Qwen-1.5B的准确率从45.9%提高到51.0%。值得注意的是，对于Qwen2.5-1.5B，仅仅通过快速思考模式就可以在少于1000个标记的情况下达到25.2%的准确率，显示出了显著的推理效率提升。这些发现表明，直觉和深思熟虑的推理是不同的、互补的系统，从有针对性的训练中受益。此外，我们已经开源了训练模型和源代码。

更新时间: 2025-10-16 16:20:17

领域: cs.CL,cs.AI,cs.LG,I.2.6; I.2.8; I.5.1

下载: http://arxiv.org/abs/2505.21097v2

A Geometric Approach to Optimal Experimental Design

We introduce a novel geometric framework for optimal experimental design (OED). Traditional OED approaches, such as those based on mutual information, rely explicitly on probability densities, leading to restrictive invariance properties. To address these limitations, we propose the mutual transport dependence (MTD), a measure of statistical dependence grounded in optimal transport theory which provides a geometric objective for optimizing designs. Unlike conventional approaches, the MTD can be tailored to specific downstream estimation problems by choosing appropriate geometries on the underlying spaces. We demonstrate that our framework produces high-quality designs while offering a flexible alternative to standard information-theoretic techniques.

Updated: 2025-10-16 16:20:14

标题: 一种几何方法优化实验设计

摘要: 我们引入了一个新颖的几何框架用于最优实验设计（OED）。传统的OED方法，比如基于互信息的方法，明确依赖于概率密度，导致了限制性的不变性属性。为了解决这些限制，我们提出了互联运输依赖（MTD），这是一种基于最优运输理论的统计依赖度量，为优化设计提供了一个几何目标。与传统方法不同，MTD可以通过选择基础空间上适当的几何形态来定制特定的下游估计问题。我们证明了我们的框架产生了高质量的设计，同时提供了一种灵活的替代标准信息论技术。

更新时间: 2025-10-16 16:20:14

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.14848v1

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

Updated: 2025-10-16 16:18:37

标题: 搜索的方向：测量LLM代理的先验结构化搜索空间

摘要: 基于大型语言模型（LLMs）的生成-过滤-精炼（迭代范式）在人工智能+科学领域的推理、编程和程序发现方面取得了进展。然而，搜索的有效性取决于搜索的位置，即如何将领域先验编码到一个操作结构化的假设空间中。为此，本文提出了一个简洁的形式理论，描述和衡量由领域先验指导的LLM辅助的迭代搜索。我们将代理表示为对输入和输出的模糊关系运算符，以捕获可行的过渡；代理因此受到固定安全边界的约束。为了描述多步推理/搜索，我们通过一个单一的延续参数对所有可达路径进行加权，并将它们相加以获得一个覆盖生成函数；这引出了一个可达性困难的度量；并提供了一个搜索图的几何解释，该图由安全边界引发。我们进一步提供了最简单的可测试推理，并通过多数票实例验证了它们。这个理论提供了一个可行的语言和操作工具，以测量代理和它们的搜索空间，提出了一个由LLMs构建的迭代搜索的系统形式描述。

更新时间: 2025-10-16 16:18:37

领域: cs.AI,cs.CL,cs.LO

下载: http://arxiv.org/abs/2510.14846v1

Backdoor Unlearning by Linear Task Decomposition

Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

Updated: 2025-10-16 16:18:07

标题: 通过线性任务分解实现后门学习的消除

摘要: 基础模型通过实现跨多种任务的广泛泛化，彻底改变了计算机视觉。然而，它们仍然极易受到对抗性扰动和有针对性的后门攻击的影响。减轻这种脆弱性仍然是一个未解决的挑战，尤其是考虑到模型的大规模性质使得重新训练以确保安全变得不可能。现有的后门清除方法依赖于昂贵的微调来覆盖有害行为，并且往往会降低其他无关任务的性能。这引发了一个问题，即是否可以在不影响模型的一般功能的情况下去除后门。在这项工作中，我们解决了这个问题，并研究了后门是如何在模型权重空间中编码的，发现它们与其他良性任务分离。具体来说，这种分离使得可以隔离和擦除后门对模型的影响，对干净性能的影响最小。基于这一洞察力，我们引入了一种简单的遗忘方法，利用这种分离。通过对基于CLIP的模型和常见对抗触发器进行大量实验，我们表明，通过了解攻击的知识，我们的方法实现了近乎完美的遗忘，同时保留了平均96%的干净精度。此外，我们证明，即使攻击及其存在是未知的，我们的方法也可以通过使用逆向工程触发器的适当估计成功地遗忘后门。总的来说，与现有的最先进的防御相比，我们的方法在遗忘和干净精度的权衡方面始终提供更好的结果。

更新时间: 2025-10-16 16:18:07

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.14845v1

Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks

Machine Unlearning aims to remove specific data from trained models, addressing growing privacy and ethical concerns. We provide a theoretical analysis of a simple and widely used method - gradient ascent - used to reverse the influence of a specific data point without retraining from scratch. Leveraging the implicit bias of gradient descent towards solutions that satisfy the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem, we quantify the quality of the unlearned model by evaluating how well it satisfies these conditions w.r.t. the retained data. To formalize this idea, we propose a new success criterion, termed \textbf{$(\epsilon, \delta, \tau)$-successful} unlearning, and show that, for both linear models and two-layer neural networks with high dimensional data, a properly scaled gradient-ascent step satisfies this criterion and yields a model that closely approximates the retrained solution on the retained data. We also show that gradient ascent performs successful unlearning while still preserving generalization in a synthetic Gaussian-mixture setting.

Updated: 2025-10-16 16:16:36

标题: 基于两层ReLU神经网络梯度上升的可证明遗忘

摘要: 机器去学习旨在从训练模型中删除特定数据，解决日益增长的隐私和道德关注。我们对一种简单且广泛使用的方法 - 梯度上升进行了理论分析，该方法用于逆转特定数据点的影响，而无需从头开始重新训练。利用梯度下降对满足Karush-Kuhn-Tucker（KKT）条件的边界最大化问题解决方案的隐含偏差，我们通过评估未学习模型在保留数据方面如何满足这些条件来量化未学习模型的质量。为了形式化这个想法，我们提出了一个新的成功标准，称为$(\epsilon, \delta, \tau)$-成功的去学习，并且展示了对于线性模型和高维数据的两层神经网络，适当缩放的梯度上升步骤满足这个标准，并产生一个在保留数据上紧密逼近重新训练解决方案的模型。我们还展示了在合成高斯混合设置中，梯度上升在保持泛化的同时进行成功的去学习。

更新时间: 2025-10-16 16:16:36

领域: cs.LG,cs.CR,cs.NE,stat.ML

下载: http://arxiv.org/abs/2510.14844v1

Boosting Instruction Following at Scale

A typical approach developers follow to influence an LLM's behavior in an application is through careful manipulation of the prompt, such as by adding or modifying instructions. However, merely adding more instructions provides little assurance that they will actually be followed. We introduce Instruction Boosting as a post-generation method to increase the reliability of LLM prompt instructions. We show that Instruction Boosting improves the instruction following rate by up to 7 points for two instructions and up to 4 points for ten instructions. To demonstrate these results we introduce SCALEDIF, a benchmark with a scaled instruction volume of up to ten instructions per data sample. We also present an analysis of the commonly observed trend that performance degrades as more instructions are added. We show that an important factor contributing to this trend is the degree of tension and conflict that arises as the number of instructions is increased. We contribute a quantitative conflict scoring tool that explains the observed performance trends and provides feedback to developers on the impact that additional prompt instructions have on a model's performance.

Updated: 2025-10-16 16:15:58

标题: 在规模上推动指示遵循

摘要: 开发人员通常采用的一种影响LLM在应用程序中行为的典型方法是通过仔细操纵提示，例如添加或修改指令。然而，仅仅添加更多指令并不能保证它们实际上会被遵循。我们引入了Instruction Boosting作为一种后生成方法，以增加LLM提示指令的可靠性。我们展示了Instruction Boosting可以将两个指令的指令遵循率提高多达7个点，将十个指令的指令遵循率提高多达4个点。为了展示这些结果，我们引入了一个具有最多十个指令的数据样本的SCALEDIF基准。我们还对观察到的常见趋势进行了分析，即随着添加更多指令，性能会下降。我们表明，导致这种趋势的一个重要因素是随着指令数量的增加而产生的紧张和冲突程度。我们提供了一个定量冲突评分工具，解释了观察到的性能趋势，并向开发人员提供反馈，说明额外提示指令对模型性能的影响。

更新时间: 2025-10-16 16:15:58

领域: cs.AI

下载: http://arxiv.org/abs/2510.14842v1

Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.

Updated: 2025-10-16 16:13:33

标题: 梯度符号掩蔽用于跨预训练模型传输任务向量

摘要: 当一个基础模型的新版本发布时，从业者通常需要重复完整的微调，即使相同的任务在先前版本中已经解决。一种有前途的替代方法是重复利用捕捉模型如何适应特定任务的参数更改（即任务向量）。然而，由于它们在参数空间上不对齐，它们通常无法在不同的预训练模型之间转移。在这项工作中，我们展示了成功的转移关键在于新模型梯度的符号结构。基于这一洞察力，我们提出了GradFix，一种新颖的方法，它近似于理想的梯度符号结构，并利用它来仅使用少量标记样本来传递知识。值得注意的是，这不需要额外的微调：适应是通过在目标模型计算一些梯度并相应地屏蔽源任务向量来实现的。这产生了一个更新，它在目标损失景观上与本地对齐，有效地将任务向量重新定位到新的预训练上。我们提供了一个理论保证，即我们的方法确保了一阶下降。在实证上，我们展示了在视觉和语言基准上显著的性能提升，始终优于天真的任务向量添加和少样本微调。

更新时间: 2025-10-16 16:13:33

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.09658v2

Reinforcement Learning with Stochastic Reward Machines

Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

Updated: 2025-10-16 16:12:04

标题: 使用随机奖励机制的强化学习

摘要: 奖励机制是处理奖励稀疏且依赖复杂行动序列的强化学习问题的一种成熟工具。然而，现有的学习奖励机制的算法假设了一个过于理想化的环境，其中奖励必须是没有噪音的。为了克服这一实际限制，我们引入了一种新型奖励机制，称为随机奖励机制，并提出了一种学习它们的算法。我们的算法基于约束求解，从强化学习代理的探索中学习最小的随机奖励机制。这种算法可以轻松地与现有的强化学习算法配对，保证在极限情况下收敛到最优策略。我们在两个案例研究中展示了我们算法的有效性，并表明它优于现有方法以及处理嘈杂奖励函数的天真方法。

更新时间: 2025-10-16 16:12:04

领域: cs.LG

下载: http://arxiv.org/abs/2510.14837v1

The Tree-SNE Tree Exists

The clustering and visualisation of high-dimensional data is a ubiquitous task in modern data science. Popular techniques include nonlinear dimensionality reduction methods like t-SNE or UMAP. These methods face the `scale-problem' of clustering: when dealing with the MNIST dataset, do we want to distinguish different digits or do we want to distinguish different ways of writing the digits? The answer is task dependent and depends on scale. We revisit an idea of Robinson & Pierce-Hoffman that exploits an underlying scaling symmetry in t-SNE to replace 2-dimensional with (2+1)-dimensional embeddings where the additional parameter accounts for scale. This gives rise to the t-SNE tree (short: tree-SNE). We prove that the optimal embedding depends continuously on the scaling parameter for all initial conditions outside a set of measure 0: the tree-SNE tree exists. This idea conceivably extends to other attraction-repulsion methods and is illustrated on several examples.

Updated: 2025-10-16 16:10:41

标题: 这个标题的翻译是：树形SNE树的存在

摘要: 高维数据的聚类和可视化是现代数据科学中的一个普遍任务。流行的技术包括非线性降维方法如 t-SNE 或 UMAP。这些方法面临着聚类的“尺度问题”：当处理 MNIST 数据集时，我们是要区分不同的数字还是要区分数字的不同书写方式？答案取决于任务并且取决于尺度。我们重新审视了 Robinson & Pierce-Hoffman 的一个思想，利用 t-SNE 中的潜在缩放对称性，将2维嵌入替换为(2+1)维嵌入，额外的参数考虑了尺度。这产生了 t-SNE 树(简称：tree-SNE)。我们证明了最优嵌入连续地依赖于缩放参数，对于所有初始条件而言，除了一个测度为 0 的集合外：tree-SNE 树存在。这个想法可能扩展到其他吸引-排斥方法，并在几个示例上进行了说明。

更新时间: 2025-10-16 16:10:41

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2510.15014v1

Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT Networks

The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable under fast channel dynamics, interference, and heterogeneous coverage. Handover in multi-RAT deployments is still highly reactive and event-triggered, relying on instantaneous measurements and threshold events. This work proposes a Machine Learning (ML)-assisted Predictive Conditional Handover (P-CHO) framework based on a model-driven and short-horizon signal quality forecasts. We present a generalized P-CHO sequence workflow orchestrated by a RAT Steering Controller, which standardizes data collection, parallel per-RAT predictions, decision logic with hysteresis-based conditions, and CHO execution. Considering a realistic multi-RAT environment, we train RAT-aware Long Short Term Memory (LSTM) networks to forecast the signal quality indicators of mobile users along randomized trajectories. The proposed P-CHO models are trained and evaluated under different channel models for cellular and IEEE 802.11 WiFi integrated coverage. We study the impact of hyperparameter tuning of LSTM models under different system settings, and compare direct multi-step versus recursive P-CHO variants. Comparisons against baseline predictors are also carried out. Finally, the proposed P-CHO is tested under soft and hard handover settings, showing that hysteresis-enabled P-CHO scheme is able to reduce handover failures and ping-pong events. Overall, the proposed P-CHO framework can enable accurate, low-latency, and proactive handovers suitable for ML-assisted handover steering in 6G multi-RAT deployments.

Updated: 2025-10-16 16:08:14

标题: 6G多RAT网络中基于AI辅助信号质量预测的智能动态切换

摘要: 兴起的6G多射频接入技术（多RAT）网络范式中，蜂窝和无线保真度（WiFi）发射器共存，需要在快速信道动态、干扰和异构覆盖下保持可靠的移动性决策。多RAT部署中的切换仍然高度反应性和事件触发，依赖即时测量和阈值事件。本文提出了一种基于模型驱动和短期信号质量预测的机器学习（ML）辅助预测条件切换（P-CHO）框架。我们提出了一个由RAT引导控制器编排的通用P-CHO序列工作流程，该工作流程标准化了数据收集、并行每个RAT的预测、基于滞后特性的决策逻辑和CHO执行。考虑到一个现实的多RAT环境，我们训练了RAT感知的长短期记忆（LSTM）网络，以预测移动用户沿随机轨迹的信号质量指标。提出的P-CHO模型在不同的蜂窝和IEEE 802.11 WiFi集成覆盖的信道模型下进行了训练和评估。我们研究了在不同系统设置下调整LSTM模型的超参数对结果的影响，并比较了直接多步与递归P-CHO变体。还进行了与基准预测器的比较。最后，提出的P-CHO在软切换和硬切换设置下进行了测试，结果显示基于滞后特性的P-CHO方案能够减少切换失败和乒乓事件。总体而言，提出的P-CHO框架能够实现适用于6G多RAT部署中的ML辅助切换引导的准确、低延迟和主动切换。

更新时间: 2025-10-16 16:08:14

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2510.14832v1

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained bu supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.

Updated: 2025-10-16 16:07:50

标题: RL-100：利用真实世界强化学习进行高效的机器人操作

摘要: 在家庭和工厂中的现实世界机器人操作需要接近或超过熟练人操作员的可靠性、效率和稳健性。我们提出了RL-100，这是一个建立在扩散视觉运动策略上的真实世界强化学习训练框架，通过监督学习进行训练。RL-100引入了一个三阶段流程。首先，模仿学习利用人类先验知识。其次，迭代离线强化学习使用离线策略评估程序（缩写为OPE）来控制应用于去噪过程的PPO风格更新，以实现保守和可靠的改进。第三，在线强化学习消除残余故障模式。另外，一个轻量级的一致性提取模块将扩散中的多步采样过程压缩为单步策略，实现高频控制，同时将延迟减少一个数量级，同时保持任务性能。该框架与任务、实体和表示无关，支持3D点云和2D RGB输入，多种机器人平台以及单步和动作块策略。我们在七个真实机器人任务上评估了RL-100，涵盖了动态刚体控制（如Push-T和Agile Bowling）、流体和颗粒倾倒、可变形布料折叠、精确灵巧的拧螺丝和多阶段橙汁挤压。RL-100在所有评估试验中均取得了100%的成功率，总共达到了900个回合，包括在一个任务上最多连续250次的试验。该方法实现了接近人类遥操作或更高的时间效率，并且表现出多小时的稳健性，连续运行时间长达两小时。

更新时间: 2025-10-16 16:07:50

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14830v1

Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning

Robust distributed learning algorithms aim to maintain reliable performance despite the presence of misbehaving workers. Such misbehaviors are commonly modeled as $\textit{Byzantine failures}$, allowing arbitrarily corrupted communication, or as $\textit{data poisoning}$, a weaker form of corruption restricted to local training data. While prior work shows similar optimization guarantees for both models, an important question remains: $\textit{How do these threat models impact generalization?}$ Empirical evidence suggests a gap, yet it remains unclear whether it is unavoidable or merely an artifact of suboptimal attacks. We show, for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning. Our findings leverage a tight algorithmic stability analysis of robust distributed learning. Specifically, we prove that: $\textit{(i)}$ under data poisoning, the uniform algorithmic stability of an algorithm with optimal optimization guarantees degrades by an additive factor of $\varTheta ( \frac{f}{n-f} )$, with $f$ out of $n$ workers misbehaving; whereas $\textit{(ii)}$ under Byzantine failures, the degradation is in $\Omega \big( \sqrt{ \frac{f}{n-2f}} \big)$.

Updated: 2025-10-16 16:05:21

标题: 拜占庭故障对稳健分布式学习算法的泛化造成的影响比数据中毒更严重

摘要: 稳健的分布式学习算法旨在在存在异常工作者的情况下保持可靠性表现。这种异常行为通常被建模为$\textit{拜占庭失败}$，允许任意损坏通信，或者作为$\textit{数据污染}$，一种更弱的腐败形式，仅限于本地训练数据。尽管先前的工作显示这两种模型具有类似的优化保证，但一个重要的问题仍然存在：$\textit{这些威胁模型如何影响泛化能力}$？经验证据表明存在差距，但尚不清楚这是否是不可避免的，还是仅仅是次优攻击的产物。我们首次展示了两种威胁模型之间在泛化保证方面存在根本差距：拜占庭失败产生的速率严格比数据污染产生的速率更差。我们的发现利用了对稳健分布式学习的紧密算法稳定性分析。具体而言，我们证明：$\textit{(i)}$在数据污染下，具有最佳优化保证的算法的均匀算法稳定性会以一个增加因子$\varTheta (\frac{f}{n-f})$下降，其中$f$个工作者中有$f$个表现异常；而$\textit{(ii)}$在拜占庭失败下，这种下降率为$\Omega \big(\sqrt{\frac{f}{n-2f}}\big)$。

更新时间: 2025-10-16 16:05:21

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2506.18020v2

RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

Updated: 2025-10-16 16:04:35

标题: RoboGPT-R1：利用强化学习增强机器人规划

摘要: 提高具身体的代理的推理能力对于机器人成功完成复杂的长期视图操纵任务中的人类指令至关重要。尽管基于监督微调（SFT）的大型语言模型和视觉语言模型在规划任务中取得了成功，但它们在复杂真实世界环境中执行长期操纵任务仍然面临挑战，因为它们的常识和推理能力受到限制。考虑到通过监督微调将通用视觉语言模型与机器人规划任务对齐存在泛化能力差和物理理解不足的问题，我们提出了RoboGPT-R1，这是一个用于具身体规划的两阶段微调框架。在这个框架中，监督训练通过专家序列获取基础知识，然后通过强化学习来解决模型在视觉空间理解和推理方面的不足之处。为了在多步推理任务中实现物理理解和动作序列一致性，我们设计了一个基于规则的奖励函数，同时考虑环境中的长期表现和动作约束。在EmbodiedBench基准测试中，训练在Qwen2.5-VL-3B上的推理模型显著优于更大规模的模型GPT-4o-mini，提高了21.33%，并且在Qwen2.5-VL-7B上训练的其他工作也超过了20.33%。

更新时间: 2025-10-16 16:04:35

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.14828v1

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form'' generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.

Updated: 2025-10-16 16:02:45

标题: 无限远方：工具使用在状态空间模型中解锁长度概括

摘要: 状态空间模型（SSMs）已成为序列建模中领先的替代方案，而非变压器。它们的主要优势在于长上下文和长形式生成的效率，这得益于固定大小内存和计算复杂性的线性缩放。我们从展示一个简单的理论结果开始这项工作，该结果表明SSMs不能准确解决任何“真正的长形式”生成问题（在我们正式定义的意义上），从而削弱了它们的主要竞争优势。然而，我们表明通过允许SSMs与外部工具进行交互访问，可以缓解这种限制。事实上，我们表明在选择正确的工具访问和问题相关的训练数据的情况下，SSMs可以学会解决任何可解问题，并推广到任意问题长度/复杂度（即实现长度泛化）。根据我们的理论发现，我们展示了工具增强的SSMs在各种算术、推理和编码任务上取得了显著的长度泛化。这些发现突显了SSMs作为与变压器在交互式基于工具和主体设置中的潜在高效替代方案。

更新时间: 2025-10-16 16:02:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.14826v1

Programmatic Representation Learning with Language Models

Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features. Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently. In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors. We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. We propose two algorithms to learn LeaPR models from supervised data. First, we design an adaptation of FunSearch to learn features rather than directly generate predictors. Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes. In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks. Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.

Updated: 2025-10-16 16:02:42

标题: 使用语言模型进行程序化表示学习

摘要: 传统的监督机器学习模型，比如决策树，是高效且可解释的预测器，但其质量高度依赖于输入特征的选择。虽然神经网络可以直接从原始数据（如图像或文本）中学习有用的表示，但这是以牺牲可解释性和需要专门的硬件来高效运行为代价的。在本文中，我们探讨了一种假设类别，称为学习编程表示（LeaPR）模型，它将表示为代码（从数据点到标量的函数）和决策树预测器的任意特征堆叠起来。我们使用大型语言模型（LLMs）合成特征函数，这些模型在各种领域具有丰富的先验知识，并具有使用现有领域特定库编写代码的显著能力。我们提出了两种从监督数据中学习LeaPR模型的算法。首先，我们设计了一种适应于FunSearch的改进版本，用于学习特征而不是直接生成预测器。然后，我们开发了一种决策树学习的新颖变体，其中在分裂叶节点时按需生成新特征。在从国际象棋局面评估到图像和文本分类的实验中，我们的方法学习到了高质量、无神经网络的预测器，通常与神经网络竞争激烈。我们的工作提出了一种灵活的学习可解释表示的范式，其中特征和预测可以轻松检查和理解。

更新时间: 2025-10-16 16:02:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.14825v1

Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.

Updated: 2025-10-16 15:55:28

标题: 整合环境感知和路径选择建模，用于轨迹表示学习

摘要: 轨迹表示学习（TRL）旨在将原始轨迹编码为低维向量，然后可以在各种下游任务中利用，包括旅行时间估计、位置预测和轨迹相似性分析。然而，现有的TRL方法存在一个关键的疏忽：将轨迹视为孤立的时空序列，而不考虑指导其形成的外部环境和内部路径选择行为。为了弥补这一差距，我们提出了一个新颖的框架，将全面的环境感知和显式的路径选择建模统一起来，实现有效的轨迹表示学习，命名为PRTraj。具体而言，PRTraj首先引入了一个环境感知模块，通过捕获周围POI分布的多粒度环境语义来增强道路网络。在这个环境感知的基础上，路径选择编码器捕获每条轨迹中固有的路径选择行为，通过将其构成的道路段转换建模为一系列决策。这些具有路径选择意识的表示最终被聚合以形成全局轨迹嵌入。在5个下游任务中对3个真实数据集进行的大量实验证实了PRTraj的有效性和普适性。此外，PRTraj展示了强大的数据效率，在少样本情况下保持了稳健的性能。我们的代码可在https://anonymous.4open.science/r/PRTraj上找到。

更新时间: 2025-10-16 15:55:28

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14819v1

Adapting Noise to Data: Generative Flows from 1D Processes

We introduce a general framework for constructing generative models using one-dimensional noising processes. Beyond diffusion processes, we outline examples that demonstrate the flexibility of our approach. Motivated by this, we propose a novel framework in which the 1D processes themselves are learnable, achieved by parameterizing the noise distribution through quantile functions that adapt to the data. Our construction integrates seamlessly with standard objectives, including Flow Matching and consistency models. Learning quantile-based noise naturally captures heavy tails and compact supports when present. Numerical experiments highlight both the flexibility and the effectiveness of our method.

Updated: 2025-10-16 15:52:49

标题: 将噪声调整到数据：从1D过程生成流

摘要: 我们引入了一个构建生成模型的一般框架，使用一维噪声过程。除了扩散过程，我们概述了展示我们方法灵活性的示例。受此启发，我们提出了一个新颖的框架，其中一维过程本身是可学习的，通过将噪声分布参数化为适应数据的分位函数来实现。我们的构建与标准目标（包括流匹配和一致性模型）无缝集成。学习基于分位数的噪声自然捕捉到存在时的重尾和紧支持。数值实验突出了我们方法的灵活性和有效性。

更新时间: 2025-10-16 15:52:49

领域: stat.ML,cs.LG,math.AP

下载: http://arxiv.org/abs/2510.12636v2

Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift

Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.

Updated: 2025-10-16 15:48:52

标题: 解决时间序列预测泛化问题：通过缓解概念漂移

摘要: 时间序列预测在现实场景中有广泛的应用。由于时间序列数据的动态性质，时间序列预测模型需要处理随时间可能发生的分布变化是很重要的。在本文中，我们首先确定了时间序列中两种类型的分布变化：概念漂移和时间漂移。我们意识到，虽然现有的研究主要集中在解决时间序列预测中的时间漂移问题，但为时间序列预测设计适当的概念漂移方法却受到相对较少的关注。在需要解决潜在概念漂移的动机下，传统的概念漂移方法通过不变学习在时间序列预测中面临一定挑战。我们提出了一种软注意机制，从回溯和时间序列中找到不变模式，以应对潜在的概念漂移。此外，我们强调在解决概念漂移之前，缓解时间漂移的重要性。在这个背景下，我们介绍了ShifTS，这是一个方法不可知的框架，旨在统一方法中首先解决时间漂移，然后解决概念漂移。大量实验证明了ShifTS在多个数据集上持续提高不可知模型的预测准确性，并优于现有的概念漂移、时间漂移和组合基线。

更新时间: 2025-10-16 15:48:52

领域: cs.LG

下载: http://arxiv.org/abs/2510.14814v1

Higher-order interactions of multi-layer prompt

The "pre-train, prompt" paradigm has successfully evolved in representation learning. While current prompt-tuning methods often introduce learnable prompts, they predominantly treat prompts as isolated, independent components across different network layers. This overlooks the complex and synergistic higher-order interactions that exist between prompts at various hierarchical depths, consequently limiting the expressive power and semantic richness of the prompted model. To address this fundamental gap, we propose a novel framework that explicitly models the Higher-order Interactions of Multi-layer Prompt. Our approach conceptualizes prompts from different layers not as separate entities, but as a cohesive system where their inter-relationships are critical. We design an innovative interaction module that captures these sophisticated, non-linear correlations among multi-layer prompts, effectively modeling their cooperative effects. This allows the model to dynamically aggregate and refine prompt information across the network's depth, leading to a more integrated and powerful prompting strategy. Extensive experiments on eight benchmark datasets demonstrate that our method, by leveraging these higher-order interactions, consistently surpasses state-of-the-art prompt-tuning baselines. The performance advantage is particularly pronounced in few-shot scenarios, validating that capturing the intricate interplay between multi-layer prompts is key to unlocking more robust and generalizable representation learning.

Updated: 2025-10-16 15:48:27

标题: 多层次提示的高阶交互

摘要: “预训练，提示”范式在表示学习中取得了成功的发展。当前的提示调整方法通常引入可学习的提示，但它们主要将提示视为在不同网络层之间孤立且独立的组件。这忽视了存在于各种层次深度的提示之间复杂且协同的高阶相互作用，从而限制了提示模型的表现力和语义丰富性。为了解决这一基本差距，我们提出了一个新颖的框架，明确建模多层提示的高阶交互。我们的方法将来自不同层的提示概念化为一个连贯的系统，其相互关系至关重要。我们设计了一个创新的交互模块，捕捉多层提示之间的复杂、非线性相关性，有效地建模它们的合作效应。这使得模型能够动态地聚合和细化网络深度上的提示信息，从而实现更加集成和强大的提示策略。在八个基准数据集上进行的大量实验表明，通过利用这些高阶交互，我们的方法始终优于最先进的提示调整基线。在少样本场景中，性能优势尤为显著，验证了捕捉多层提示之间微妙相互作用的重要性，这对于解锁更加健壮和可泛化的表示学习至关重要。

更新时间: 2025-10-16 15:48:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.09394v2

Efficient Dynamic Structured Sparse Training with Learned Shuffles

Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.

Updated: 2025-10-16 15:48:17

标题: 学习洗牌的高效动态结构稀疏训练

摘要: 结构稀疏性加速了现代GPU上的训练和推断，但在准确性方面仍落后于非结构化动态稀疏训练（DST）。这种不足源于表达能力的损失：密集层可以实现通过选择任意数量的 $w$ 个激活权重中获得的每个可能掩码，而固定块或 N:M 布局仅探索了其中一部分可能性。我们提出通过学习为每个层级单独的一个置换矩阵，与结构化权重矩阵一起来弥补这一差距。应用于三种经典结构--块、N:M 和对角线--我们展示了置换增强 DST（PA-DST）在ImageNet-1K（ViT-B/16）和WikiText-103（GPT-2）上以90-95%的稀疏度与非结构化基线（RigL、SET）匹配，同时训练速度最多加快了1.21倍，推断速度最多加快了2.9倍。结果将结构+学习置换定位为准确性和效率之间的完美平衡点。

更新时间: 2025-10-16 15:48:17

领域: cs.LG

下载: http://arxiv.org/abs/2510.14812v1

Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning

Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.

Updated: 2025-10-16 15:47:29

标题: 重新思考赫布原则：无监督学习的低维结构投影

摘要: Hebbian学习是一种生物学原理，直观地描述了神经元如何通过重复刺激来调整它们的连接。然而，当应用于机器学习时，由于连接的无约束更新和缺乏对反馈调节的考虑，它存在严重问题。这些缺点限制了其在复杂网络架构和任务中的有效扩展。为此，我们介绍了结构投影Hebbian表示（SPHeRe），这是一种整合正交性和结构信息保持的新型无监督学习方法，通过本地辅助非线性块实现。用于结构信息保持的损失通过一个辅助轻量级投影向输入反向传播，概念上起到反馈调节的作用，而正交性约束则考虑了更新幅度的有界性。大量实验结果表明，SPHeRe在标准图像分类基准数据集（包括CIFAR-10、CIFAR-100和Tiny-ImageNet）上实现了无监督突触可塑性方法中的SOTA性能。此外，该方法在持续学习和迁移学习场景中表现出强大的有效性，图像重建任务显示出所提取特征的稳健性和泛化性。本工作展示了Hebbian无监督学习规则在现代深度学习框架中的竞争力和潜力，证明了在没有对严格反向传播的强依赖的情况下实现高效和生物启发式学习算法的可能性。我们的代码可在https://github.com/brain-intelligence-lab/SPHeRe 上获得。

更新时间: 2025-10-16 15:47:29

领域: cs.LG

下载: http://arxiv.org/abs/2510.14810v1

Merge-of-Thought Distillation

Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite the practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different "best teachers," and even for the same student, the best teacher can vary across datasets. Therefore, to unify multiple teachers' reasoning abilities into a student to overcome conflicts among various teachers' supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics while reducing catastrophic forgetting, and shows robustness to distribution-shifted and peer-level teachers. Finally, we have demonstrated MoT possesses consensus CoT by eliminating teacher-specific inductive biases and inter-teacher conflicts while repeatedly reinforcing the learning of consensus reasoning features. These results position MoT as a simple, effective route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

Updated: 2025-10-16 15:43:35

标题: 思维融合提炼

摘要: 长思维链（CoT）模型的有效推理蒸馏越来越受到限制，尽管存在多个候选教师和不断增长的CoT语料库，但仍受到单一神谕教师的假设的约束。我们重新审视教师选择，并观察到不同学生有不同的“最佳教师”，即使对于同一个学生，最佳教师在不同数据集中也可能有所不同。因此，为了将多个教师的推理能力统一到一个学生中，以克服各种教师监督之间的冲突，我们提出了“思维合并蒸馏”（MoT），这是一个轻量级框架，交替进行教师特定的监督微调分支和结果学生变种的权重空间合并。在竞赛数学基准测试中，仅使用约200个CoT样本，将MoT应用于Qwen3-14B学生超越了强大的模型，包括Deepseek-R1、Qwen3-32B和OpenAI-O1，显示出显著的增益。此外，MoT始终优于最佳单一教师蒸馏，在减少灾难性遗忘的同时提高了数学以外的一般推理能力，并展现了对分布转移和同级教师的稳健性。最后，我们证明MoT通过消除教师特定的归纳偏见和教师间冲突，反复强化共识推理特征的学习，具有一致性的CoT。这些结果将MoT定位为一种简单有效的途径，可以将来自不同教师的长期CoT能力高效蒸馏到紧凑的学生中。

更新时间: 2025-10-16 15:43:35

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.08814v3

Agentic NL2SQL to Reduce Computational Costs

Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87\% and thus allows for substantial cost reductions while maintaining competitive performance.

Updated: 2025-10-16 15:42:28

标题: 主动式NL2SQL以减少计算成本

摘要: 将自然语言查询转换为SQL查询（NL2SQL或文本到SQL）最近已经通过大型语言模型（LLMs）得到增强。在大量SQL数据库上使用LLMs执行NL2SQL方法需要处理大量关于数据库的元信息，这反过来会导致具有许多标记和高处理成本的长提示。为了解决这一挑战，我们引入了Datalake Agent，这是一个旨在使LLM更有效地解决NL2SQL任务的代理系统。与使用直接解算器调用LLM一次并将所有元信息包含在提示中的NL2SQL方法不同，Datalake Agent采用交互式循环来减少使用的元信息。在循环中，LLM被用于一个推理框架，该框架有选择地请求仅解决表问题回答任务所需的信息。我们在一个包含23个数据库和100个表问题回答任务的集合上评估了Datalake Agent。Datalake Agent将LLM使用的标记减少了高达87％，从而在保持竞争性能的同时实现了大幅成本降低。

更新时间: 2025-10-16 15:42:28

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14808v1

SimKO: Simple Pass@K Policy Optimization

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.

Updated: 2025-10-16 15:40:49

标题: SimKO: 简单的Pass@K策略优化

摘要: 具有可验证奖励的强化学习（RLVR）已经提升了大型语言模型（LLMs）的推理能力。然而，目前的RLVR方法表现出一种系统性偏向于利用而非探索的偏见，这表现在pass@1的改善但降低了pass@K（K>1）的表现。为了理解这个问题，我们通过跟踪词汇候选项上的标记级概率分布来分析RLVR方法的训练动态。我们的分析揭示了一个一致的概率集中效应，即排名第一的候选项越来越积累概率质量并抑制其他候选项的概率。更重要的是，更强的过度集中与更差的pass@K表现相关联。受到这一发现的启发，我们提出了简单的Pass@K优化（SimKO），这是一种旨在减轻过度集中问题，从而鼓励探索的方法。SimKO以不对称方式运作。对于经过验证的正确响应，它提升前K个候选项的概率。对于经过验证的错误响应，它对排名第一的候选项施加更严厉的惩罚。我们观察到，这种不对称设计在应用于熵较高的标记时特别有效地减轻了过度集中。在各种数学和逻辑推理基准测试中，SimKO始终为一系列K提供更高的pass@K，为改进RLVR的探索提供了一种简单的方法。

更新时间: 2025-10-16 15:40:49

领域: cs.AI

下载: http://arxiv.org/abs/2510.14807v1

Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks

Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super

Updated: 2025-10-16 15:35:44

标题: 用更多的报告、更少的口罩扩展多肿瘤早期检测的人工智能

摘要: 早期肿瘤检测可以挽救生命。每年全球进行超过3亿次计算机断层扫描(CT) ，为有效的癌症筛查提供了广阔的机会。然而，在这些CT扫描中检测小型或早期肿瘤仍然具有挑战性，即使对于专家也是如此。人工智能（AI）模型可以通过突出显示可疑区域来帮助，但训练这些模型通常需要大量的肿瘤蒙版--由放射科医生手动绘制的详细的像素级轮廓。绘制这些蒙版是昂贵的，需要数年的努力和数百万美元。相比之下，临床实践中的几乎每一次CT扫描都已附带描述肿瘤大小、数量、外观，有时还有病理学结果的医学报告--这些信息丰富、丰富，常常被AI训练所低估。我们介绍了R-Super，它训练AI来分割与医学报告中描述相匹配的肿瘤。这种方法利用大量现成的医学报告集进行AI训练，大大减少了手动绘制肿瘤蒙版的需求。当在101,654份报告上接受训练时，AI模型的性能达到了与在723个蒙版上接受训练相媲美的水平。将报告和蒙版结合使用进一步提高了灵敏度+13%和特异性+8%，超过了放射科医生在检测七种肿瘤类型中的五种的能力。值得注意的是，R-Super实现了对脾脏、胆囊、前列腺、膀胱、子宫和食管等以前不存在公共蒙版或AI模型的肿瘤进行分割。这项研究挑战了长期以来认为大规模、劳动密集型的肿瘤蒙版创建是不可或缺的信念，建立了一条可扩展且可访问的路径，实现对各种肿瘤类型的早期检测。我们计划在https://github.com/MrGiovanni/R-Super上发布我们训练过的模型、代码和数据集。

更新时间: 2025-10-16 15:35:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14803v1

Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images

Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.

Updated: 2025-10-16 15:32:05

标题: 形态学感知的结直肠癌H&E全切片图像五年生存预测的预后模型

摘要: 结直肠癌（CRC）仍然是全球第三大常见恶性肿瘤，预计到2025年将有大约154,000例新病例和54,000例预计死亡。计算病理学中基础模型的最新进展主要受到任务不可知方法论的推动，这些方法可以忽视代表不同生物过程的器官特异性关键形态学模式，这些模式可以从根本上影响肿瘤行为、治疗反应和患者预后。本研究的目的是开发一种新颖的可解释的人工智能模型PRISM（整合空间形态的预后表征），该模型在每种不同形态中融入了连续变异谱，以表征表型多样性，并反映恶性转化是通过逐渐的演化过程而非突变的表型转变发生的原则。PRISM 在424例III期CRC患者的手术切除标本中提取的874万张组织学图像上进行训练。PRISM 在五年生存率（AUC = 0.70 +- 0.04; 准确率 = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001）方面取得了卓越的预后表现，比现有的CRC特异方法高出15%，比AI基础模型高出约23%的准确率。它表现出性别不可知的稳健性（AUC delta = 0.02; 准确率 delta = 0.15%），并在临床病理亚组之间表现出稳定的性能，5FU/LV和CPT-11/5FU/LV方案之间的准确率波动最小（delta = 1.44%），再现了联盟队发现两种治疗方案之间无存活差异的结果。

更新时间: 2025-10-16 15:32:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14800v1

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Chemical process optimization maximizes production efficiency and economic performance, but optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable. We present a multi-agent LLM framework that autonomously infers operating constraints from minimal process descriptions, then collaboratively guides optimization. Our AutoGen-based framework employs OpenAI's o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance. Through autonomous constraint generation and iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on hydrodealkylation across cost, yield, and yield-to-cost ratio metrics, the framework achieved competitive performance with conventional methods while reducing wall-time 31-fold relative to grid search, converging in under 20 minutes. The reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs and applying domain-informed heuristics. Unlike conventional methods requiring predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable parameter exploration. Model comparison reveals reasoning-capable architectures (o3, o1) are essential for successful optimization, while standard models fail to converge. This approach is particularly valuable for emerging processes and retrofit applications where operational constraints are poorly characterized or unavailable.

Updated: 2025-10-16 15:31:07

标题: 使用多智能体方法进行LLM引导的化学过程优化

摘要: 化学过程优化最大化生产效率和经济绩效，但优化算法，包括基于梯度的求解器、数值方法和参数网格搜索，在操作约束不明确定或不可用时变得不切实际。我们提出了一个多代理LLM框架，能够从最少的过程描述中自动推断操作约束，然后协作引导优化。我们基于AutoGen的框架采用了OpenAI的o3模型，专门为约束生成、参数验证、模拟和优化引导等任务设计了代理。通过自动约束生成和迭代多代理优化，该框架消除了对预定义操作范围的需求。在涉及成本、产量和产量成本比等指标的脱烷烃反应验证中，该框架相对于网格搜索减少了31倍的计算时间，收敛时间不到20分钟。推理引导的搜索表现出对过程的复杂理解，正确识别了效用权衡，并应用了领域知识启发式。与需要预定义约束的传统方法不同，我们的方法独特地将自主约束生成与可解释的参数探索结合起来。模型比较显示，具有推理能力的体系结构（o3、o1）对于成功的优化至关重要，而标准模型则无法收敛。这种方法特别适用于新兴过程和改造应用，其中操作约束缺乏明确定性。

更新时间: 2025-10-16 15:31:07

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2506.20921v2

Active Jammer Localization via Acquisition-Aware Path Planning

We propose an active jammer localization framework that combines Bayesian optimization with acquisition-aware path planning. Unlike passive crowdsourced methods, our approach adaptively guides a mobile agent to collect high-utility Received Signal Strength measurements while accounting for urban obstacles and mobility constraints. For this, we modified the A* algorithm, A-UCB*, by incorporating acquisition values into trajectory costs, leading to high-acquisition planned paths. Simulations on realistic urban scenarios show that the proposed method achieves accurate localization with fewer measurements compared to uninformed baselines, demonstrating consistent performance under different environments.

Updated: 2025-10-16 15:22:24

标题: 主动干扰器通过获取感知路径规划进行定位

摘要: 我们提出了一种将贝叶斯优化与获取感知路径规划相结合的主动干扰定位框架。与被动的众包方法不同，我们的方法能够自适应地引导移动代理收集高效用的接收信号强度测量值，同时考虑城市障碍物和移动限制。为此，我们修改了A*算法，将获取值纳入到轨迹成本中，从而实现了高获取值的规划路径。在真实的城市场景模拟中，我们的方法比未知基线方法更少地实现了准确的定位，证明了在不同环境下具有一致的性能表现。

更新时间: 2025-10-16 15:22:24

领域: cs.LG

下载: http://arxiv.org/abs/2510.14790v1

Say My Name: a Model's Bias Discovery Framework

In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn ``easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce ``Say My Name'' (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.

Updated: 2025-10-16 15:21:49

标题: 说出我的名字：一个模特的偏见发现框架

摘要: 近几年来，由于深度学习在下游任务和端对端训练能力方面的广泛适用性，对特定、非代表性模式存在潜在偏见的担忧越来越多。许多关注无监督去偏见的作品通常利用深度模型学习“更容易”的样本的倾向，例如通过对潜在空间进行聚类来获得偏见伪标签。然而，对于非专家最终用户来说，对这种伪标签的解释并不是微不足道的，因为它并没有提供关于偏见特征的语义信息。为了解决这个问题，我们引入了“Say My Name”（SaMyNa），这是第一个用于语义上识别深度模型中的偏见的工具。与现有方法不同，我们的方法关注模型学习到的偏见。我们基于文本的管道增强了可解释性，并支持去偏见的努力：我们的方法可在训练中或事后验证中应用，可以解开任务相关信息，并提出自己作为分析偏见的工具。传统基准测试的评估表明，它在检测偏见和甚至否认偏见方面的有效性，展示了其在模型诊断中的广泛适用性。

更新时间: 2025-10-16 15:21:49

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2408.09570v2

Cross-Scenario Unified Modeling of User Interests at Billion Scale

User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery. Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving. We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks. We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.

Updated: 2025-10-16 15:20:49

标题: 十亿级别用户兴趣的跨场景统一建模

摘要: 用户在内容平台上的兴趣本质上是多样化的，通过复杂的行为模式在异构场景中体现，例如搜索、浏览动态信息和内容发现。传统的推荐系统通常优先考虑在孤立的特定场景中优化业务指标，忽略跨场景行为信号，并且在数十亿规模的部署中难以整合像LLMs这样的先进技术，最终限制了它们捕捉跨平台接触点上的全面用户兴趣的能力。我们提出了RED-Rec，这是一个为行业级内容推荐系统量身定制的LLM增强的分层推荐引擎，用于多样化场景。RED-Rec通过汇总和合成来自不同场景的行为，统一用户兴趣表达，从而实现全面的物品和用户建模。在其核心，一个双塔LLM驱动的框架实现了微妙、多方面的表示，并具有部署效率，一个场景感知的密集混合和查询策略有效地融合了多样化的行为信号，捕捉跨场景用户意图模式，并在服务过程中表达细粒度、特定上下文的意图。我们通过在RedNote上数亿用户进行在线A/B测试验证了RED-Rec，在内容推荐和广告定位任务中显示出显著的性能提升。我们进一步介绍了一个百万规模的顺序推荐数据集RED-MMU，用于全面的离线训练和评估。我们的工作推进了统一用户建模，解锁了更深层次的个性化，并促进了大规模UGC平台上更有意义的用户参与。

更新时间: 2025-10-16 15:20:49

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2510.14788v1

A Novel GPT-Based Framework for Anomaly Detection in System Logs

Identification of anomalous events within system logs constitutes a pivotal element within the frame- work of cybersecurity defense strategies. However, this process faces numerous challenges, including the management of substantial data volumes, the distribution of anomalies, and the precision of con- ventional methods. To address this issue, the present paper puts forward a proposal for an intelligent detection method for system logs based on Genera- tive Pre-trained Transformers (GPT). The efficacy of this approach is attributable to a combination of structured input design and a Focal Loss op- timization strategy, which collectively result in a substantial enhancement of the performance of log anomaly detection. The initial approach involves the conversion of raw logs into event ID sequences through the use of the Drain parser. Subsequently, the Focal Loss loss function is employed to address the issue of class imbalance. The experimental re- sults demonstrate that the optimized GPT-2 model significantly outperforms the unoptimized model in a range of key metrics, including precision, recall, and F1 score. In specific tasks, comparable or superior performance has been demonstrated to that of the GPT-3.5 API.

Updated: 2025-10-16 15:17:39

标题: 一个新颖的基于GPT的框架用于系统日志中的异常检测

摘要: 在系统日志中识别异常事件是网络安全防御策略中的关键要素。然而，这个过程面临着许多挑战，包括管理大量数据、异常的分布以及传统方法的准确性。为了解决这个问题，本文提出了一种基于生成预训练变压器（GPT）的系统日志智能检测方法。这种方法的有效性归功于结构化输入设计和焦点损失优化策略的结合，这些策略共同导致了日志异常检测性能的显著提升。初始方法涉及通过Drain解析器将原始日志转换为事件ID序列。随后，使用焦点损失损失函数来解决类别不平衡的问题。实验结果表明，优化的GPT-2模型在一系列关键指标上明显优于未优化的模型，包括精确度、召回率和F1分数。在特定任务中，已证明其性能与GPT-3.5 API相当或更优。

更新时间: 2025-10-16 15:17:39

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.16044v1

Causal Discovery for Linear DAGs with Dependent Latent Variables via Higher-order Cumulants

This paper addresses the problem of estimating causal directed acyclic graphs in linear non-Gaussian acyclic models with latent confounders (LvLiNGAM). Existing methods assume mutually independent latent confounders or cannot properly handle models with causal relationships among observed variables. We propose a novel algorithm that identifies causal DAGs in LvLiNGAM, allowing causal structures among latent variables, among observed variables, and between the two. The proposed method leverages higher-order cumulants of observed data to identify the causal structure. Extensive simulations and experiments with real-world data demonstrate the validity and practical utility of the proposed algorithm.

Updated: 2025-10-16 15:15:20

标题: 使用高阶累积量进行依赖潜变量的线性有向无环图的因果发现

摘要: 这篇论文解决了在线性非高斯有向无环模型中估计具有潜在混淆因素（LvLiNGAM）的因果有向无环图的问题。现有方法假设潜在混淆因素相互独立，或者不能正确处理具有观察变量之间因果关系的模型。我们提出了一种新颖的算法，可以在LvLiNGAM中识别因果有向无环图，允许潜在变量之间、观察变量之间以及两者之间的因果结构。所提出的方法利用观察数据的高阶累积量来识别因果结构。大量模拟实验和真实数据实验表明了所提出算法的有效性和实际实用性。

更新时间: 2025-10-16 15:15:20

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14780v1

Leveraging Code Cohesion Analysis to Identify Source Code Supply Chain Attacks

Supply chain attacks significantly threaten software security with malicious code injections within legitimate projects. Such attacks are very rare but may have a devastating impact. Detecting spurious code injections using automated tools is further complicated as it often requires deciphering the intention of both the inserted code and its context. In this study, we propose an unsupervised approach for highlighting spurious code injections by quantifying cohesion disruptions in the source code. Using a name-prediction-based cohesion (NPC) metric, we analyze how function cohesion changes when malicious code is introduced compared to natural cohesion fluctuations. An analysis of 54,707 functions over 369 open-source C++ repositories reveals that code injection reduces cohesion and shifts naming patterns toward shorter, less descriptive names compared to genuine function updates. Considering the sporadic nature of real supply-chain attacks, we evaluate the proposed method with extreme test-set imbalance and show that monitoring high-cohesion functions with NPC can effectively detect functions with injected code, achieving a Precision@100 of 36.41% at a 1:1,000 ratio and 12.47% at 1:10,000. These results suggest that automated cohesion measurements, in general, and name-prediction-based cohesion, in particular, may help identify supply chain attacks, improving source code integrity.

Updated: 2025-10-16 15:14:04

标题: 利用代码内聚性分析来识别源代码供应链攻击

摘要: 供应链攻击通过在合法项目中注入恶意代码，显著威胁软件安全。这种攻击虽然非常罕见，但可能造成毁灭性影响。使用自动化工具检测虚假代码注入进一步复杂化，因为通常需要解读插入代码及其上下文的意图。在本研究中，我们提出了一种无监督的方法，通过量化源代码中的内聚性破坏来突出虚假代码注入。通过使用基于名称预测的内聚性（NPC）度量，我们分析了当恶意代码被引入时，函数内聚性的变化，与自然内聚性波动相比。对369个开源C++存储库中的54,707个函数进行的分析显示，代码注入会降低内聚性，并将命名模式转向更短、不够描述性的名称，与真实函数更新相比。考虑到真实供应链攻击的零星性质，我们以极端的测试集不平衡评估了所提出的方法，并显示通过NPC监控高内聚性函数可以有效检测到注入代码的函数，在1:1,000比例下的Precision@100为36.41%，在1:10,000下为12.47%。这些结果表明，自动化内聚性测量，尤其是基于名称预测的内聚性，可能有助于识别供应链攻击，提高源代码完整性。

更新时间: 2025-10-16 15:14:04

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2510.14778v1

Understanding and Mitigating Covert Channel and Side Channel Vulnerabilities Introduced by RowHammer Defenses

DRAM chips are vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows. Attackers leverage RowHammer bitflips in real systems to take over systems and leak data. Consequently, many prior works propose defenses, including recent DDR specifications introducing new defenses (e.g., PRAC and RFM). For robust operation, it is critical to analyze other security implications of RowHammer defenses. Unfortunately, no prior work analyzes the timing covert and side channel vulnerabilities introduced by RowHammer defenses. This paper presents the first analysis and evaluation of timing covert and side channel vulnerabilities introduced by state-of-the-art RowHammer defenses. We demonstrate that RowHammer defenses' preventive actions have two fundamental features that enable timing channels. First, preventive actions reduce DRAM bandwidth availability, resulting in longer memory latencies. Second, preventive actions can be triggered on demand depending on memory access patterns. We introduce LeakyHammer, a new class of attacks that leverage the RowHammer defense-induced memory latency differences to establish communication channels and leak secrets. First, we build two covert channel attacks exploiting two state-of-the-art RowHammer defenses, achieving 39.0 Kbps and 48.7 Kbps channel capacity. Second, we demonstrate a website fingerprinting attack that identifies visited websites based on the RowHammer-preventive actions they cause. We propose and evaluate three countermeasures against LeakyHammer. Our results show that fundamentally mitigating LeakyHammer induces large performance overheads in highly RowHammer-vulnerable systems. We believe and hope our work can enable and aid future work on designing better solutions and more robust systems in the presence of such new vulnerabilities.

Updated: 2025-10-16 15:11:02

标题: 理解和减轻由RowHammer防御引入的隐蔽信道和侧信道漏洞

摘要: DRAM芯片容易受到读取干扰现象的影响（例如，RowHammer和RowPress），重复访问或保持打开DRAM行会导致附近行的位翻转。攻击者利用RowHammer位翻转在实际系统中接管系统并泄露数据。因此，许多先前的研究提出了防御措施，包括最近的DDR规范引入了新的防御措施（例如，PRAC和RFM）。为了稳健地运行，分析RowHammer防御的其他安全影响至关重要。不幸的是，以前的研究没有分析RowHammer防御引入的时序隐蔽和侧信道漏洞。本文提出了对最先进的RowHammer防御引入的时序隐蔽和侧信道漏洞进行首次分析和评估。我们展示了RowHammer防御的预防措施具有两个使时序信道成为可能的基本特征。首先，预防措施降低了DRAM带宽可用性，导致更长的内存延迟。其次，预防措施可以根据内存访问模式按需触发。我们引入了LeakyHammer，一种利用RowHammer防御引起的内存延迟差异建立通信信道并泄露机密信息的新攻击类别。首先，我们构建了两种利用两种最先进的RowHammer防御的隐蔽信道攻击，实现了39.0 Kbps和48.7 Kbps的信道容量。其次，我们展示了一种网站指纹攻击，根据它们引起的RowHammer预防措施来识别访问的网站。我们提出并评估了三种LeakyHammer的对策。我们的结果表明，在高度容易受到RowHammer攻击的系统中，从根本上减轻LeakyHammer会带来巨大的性能开销。我们相信并希望我们的工作可以促使和帮助未来的工作设计更好的解决方案和更健壮的系统以应对这种新的漏洞。

更新时间: 2025-10-16 15:11:02

领域: cs.CR,cs.AR

下载: http://arxiv.org/abs/2503.17891v3

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

Updated: 2025-10-16 15:09:22

标题: 在思想中找到答案：重新审视具有推理能力的大型语言模型的评估

摘要: 评估生成模型，如大型语言模型（LLMs），通常涉及问答任务，其中最终答案是基于答案选择的概率选择的。另一方面，对于需要推理的模型，答案提取方法起着至关重要的作用。我们的研究表明，推理模型的性能及其最终答案分布对所采用的答案提取算法非常敏感。为了缓解这一问题，我们提出了一个基本框架：答案再生。该方法使用额外的模型推断，提供以"答案："为前缀的先前输入和输出。然后从重新生成的输出中选择或提取最终答案。我们展示了这种提取规则不可知的方法表现出改进的性能和增强的鲁棒性。此外，我们已将该框架应用于一般数学问题和开放式问题回答任务。我们的分析和这一框架可能为模型评估提供更可靠的结果。

更新时间: 2025-10-16 15:09:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14773v1

Leveraging LLMs, IDEs, and Semantic Embeddings for Automated Move Method Refactoring

MOVEMETHOD is a hallmark refactoring. Despite a plethora of research tools that recommend which methods to move and where, these recommendations do not align with how expert developers perform MOVEMETHOD. Given the extensive training of Large Language Models and their reliance upon naturalness of code, they should expertly recommend which methods are misplaced in a given class and which classes are better hosts. Our formative study of 2016 LLM recommendations revealed that LLMs give expert suggestions, yet they are unreliable: up to 80% of the suggestions are hallucinations. We introduce the first LLM fully powered assistant for MOVEMETHOD refactoring that automates its whole end-to-end lifecycle, from recommendation to execution. We designed novel solutions that automatically filter LLM hallucinations using static analysis from IDEs and a novel workflow that requires LLMs to be self-consistent, critique, and rank refactoring suggestions. As MOVEMETHOD refactoring requires global, projectlevel reasoning, we solved the limited context size of LLMs by employing refactoring-aware retrieval augment generation (RAG). Our approach, MM-assist, synergistically combines the strengths of the LLM, IDE, static analysis, and semantic relevance. In our thorough, multi-methodology empirical evaluation, we compare MM-assist with the previous state-of-the-art approaches. MM-assist significantly outperforms them: (i) on a benchmark widely used by other researchers, our Recall@1 and Recall@3 show a 1.7x improvement; (ii) on a corpus of 210 recent refactorings from Open-source software, our Recall rates improve by at least 2.4x. Lastly, we conducted a user study with 30 experienced participants who used MM-assist to refactor their own code for one week. They rated 82.8% of MM-assist recommendations positively. This shows that MM-assist is both effective and useful.

Updated: 2025-10-16 15:08:16

标题: 利用LLMs、IDEs和语义嵌入进行自动的方法移动重构

摘要: MOVEMETHOD是一个标志性的重构技术。尽管有大量的研究工具推荐了应该移动哪些方法以及移动到哪里，但这些建议并不符合专业开发人员执行MOVEMETHOD的方式。鉴于大型语言模型的广泛培训和它们对代码自然性的依赖，它们应该能够专家地推荐哪些方法在给定类中错位，并且哪些类更适合作为宿主。我们对2016年LLM建议的形成性研究表明，LLMs提供了专家建议，但它们是不可靠的：高达80%的建议是幻觉。我们引入了第一个完全由LLM提供动力的MOVEMETHOD重构助手，自动化了整个从建议到执行的生命周期。我们设计了新颖的解决方案，利用IDE的静态分析自动过滤LLM的幻觉，并设计了一个新颖的工作流程，要求LLMs自洽、评论和对重构建议进行排序。由于MOVEMETHOD重构需要全局、项目级的推理，我们通过使用重构感知检索增强生成（RAG）解决了LLM的有限上下文大小的问题。我们的方法MM-assist将LLM、IDE、静态分析和语义相关性的优势协同结合在一起。在我们深入的多方法实证评估中，我们将MM-assist与先前的最先进方法进行了比较。MM-assist明显优于它们：（i）在其他研究人员广泛使用的基准测试中，我们的Recall@1和Recall@3显示了1.7倍的改进；（ii）在来自开源软件的210个最近重构的语料库中，我们的召回率至少提高了2.4倍。最后，我们进行了一项用户研究，有30名经验丰富的参与者使用MM-assist对他们自己的代码进行重构一周。他们对82.8%的MM-assist建议给予了积极评价。这表明MM-assist既有效又有用。

更新时间: 2025-10-16 15:08:16

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.20934v2

Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

Updated: 2025-10-16 15:02:05

标题: 在虚拟现实中修复火星：扩散模型用于重建火星环境

摘要: 太空探索越来越依赖虚拟现实进行多项任务，如任务规划、多学科科学分析和宇航员训练。模拟可靠性的关键因素是具有准确的行星地形的3D表示。由于获取和传输约束，从卫星图像导出的外星高程图通常包含缺失值。火星是地球之外最受研究的行星之一，其广泛的地形数据集使得火星表面重建成为一项宝贵的任务，尽管许多区域仍未被绘制。深度学习算法可以支持填充空缺的任务；然而，尽管地球的综合数据集使得能够使用有条件的方法，但这种方法不能应用于火星。当前的方法依赖于较简单的插值技术，然而这些方法往往无法保持几何一致性。在这项工作中，我们提出了一种基于无条件扩散模型重建火星表面的方法。训练是在从NASA的HiRISE调查导出的12000个火星高程图的增强数据集上进行的。非均匀重新缩放策略在调整到固定的128x128模型分辨率之前捕获多个尺度的地形特征。我们将我们的方法与已建立的填充空缺和修补技术进行了比较，包括反距离加权、克里金插值和Navier-Stokes算法，评估集包括1000个样本。结果表明，我们的方法在重建精度（RMSE上的4-15%）和感知相似性（LPIPS上的29-81%）方面始终优于这些方法与原始数据。

更新时间: 2025-10-16 15:02:05

领域: cs.CV,cs.AI,cs.GR

下载: http://arxiv.org/abs/2510.14765v1

COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

Updated: 2025-10-16 15:01:19

标题: COIG-Writer：一个包含思维过程的高质量中文创意写作数据集

摘要: 大型语言模型在创意写作中表现出系统性缺陷，特别是在非英语环境中，训练数据稀缺且缺乏过程级监督。我们提出了COIG-Writer，这是一个新颖的中国创意写作数据集，通过对高质量文本的系统逆向工程捕捉了多样的输出及其潜在思维过程。与现有数据集只提供输入-输出对不同，COIG-Writer包含了1,665个精心策划的三元组，涵盖了51种类型，每个三元组包括：（1）逆向工程提示，（2）详细的创意推理记录决策过程，和（3）最终文本。通过全面实验，我们确定了创意写作的两个组成部分模型：叙事逻辑（由过程监督提供）和语言表达（由通用数据维护）。我们的发现揭示了三个关键见解：（1）过程监督非常有效，但需要与通用数据相结合。为了实现最佳性能，至少需要一个创意样本对应十二个通用样本；在此阈值以下，胜率逐渐降低（从62.75%下降到35.78%）。（2）创意能力是与文化相关的，没有跨语言转移（中英文表现之间有89.26pp的差距），（3）词汇多样性与创意质量呈负相关（TTR悖论），表明高度多样性信号是逻辑缺陷的补偿行为。这些发现表明，创意卓越是由逻辑支架和语言基础之间的互动产生的，类似于数学推理如何增强但不能取代基础模型中的语言能力。

更新时间: 2025-10-16 15:01:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14763v1

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.

Updated: 2025-10-16 14:59:50

标题: 在大型语言模型中，对偏见引诱的对抗鲁棒性进行基准测试：使用LLM作为评判者的可伸缩自动评估

摘要: 随着大型语言模型（LLMs）日益融入关键社会领域，人们对内在偏见引发的担忧日益加剧，这可能会持续强化刻板印象并破坏公平。此类偏见可能源于训练数据中的历史不平等、语言不平衡或对抗性操纵。尽管进行了缓解工作，但最近的研究显示，LLMs仍然容易受到引发偏见输出的对抗性攻击。本研究提出了一个可扩展的基准测试框架，用于评估LLM对对抗性偏见引发的鲁棒性。我们的方法包括：（i）系统地检查针对不同社会文化偏见的多任务模型，（ii）通过将LLM作为评判者来使用安全分数来量化鲁棒性，以及（iii）使用越狱技术来揭示安全漏洞。为了促进系统基准测试，我们发布了一个名为CLEAR-Bias的精心策划的偏见相关提示数据集。我们的分析发现DeepSeek V3是最可靠的评判LLM，显示出偏见韧性不均匀，其中年龄、残疾和交叉偏见最为突出。一些小型模型在安全性方面表现优于较大的模型，这表明训练和架构可能比规模更重要。然而，没有一个模型能完全抵御对抗性引发，使用低资源语言或拒绝抑制的越狱攻击在模型家族中都证明有效。我们还发现，连续的LLM代际在安全性方面略有增长，而为医疗领域进行微调的模型往往比它们的通用对应物更不安全。

更新时间: 2025-10-16 14:59:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.07887v2

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

Updated: 2025-10-16 14:56:46

标题: 双子座2.5：通过先进的推理、多模态、长上下文和下一代代理能力拓展前沿

摘要: 在这份报告中，我们介绍了Gemini 2.X模型系列：Gemini 2.5 Pro和Gemini 2.5 Flash，以及我们早期的Gemini 2.0 Flash和Flash-Lite模型。Gemini 2.5 Pro是我们迄今为止最强大的模型，实现了在前沿编码和推理基准测试中的最新性能。除了其令人难以置信的编码和推理能力外，Gemini 2.5 Pro是一个擅长多模态理解的思考模型，现在能够处理长达3小时的视频内容。它独特的长上下文、多模态和推理能力的结合可以用来解锁新的主动式工作流程。Gemini 2.5 Flash在计算和延迟需求的一小部分提供了出色的推理能力，而Gemini 2.0 Flash和Flash-Lite在低延迟和成本下提供了高性能。总的来说，Gemini 2.X模型一代涵盖了模型能力与成本的完整Pareto边界，使用户能够探索复杂主动问题解决的可能边界。

更新时间: 2025-10-16 14:56:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06261v5

ConDiSim: Conditional Diffusion Models for Simulation Based Inference

We present a conditional diffusion model - ConDiSim, for simulation-based inference of complex systems with intractable likelihoods. ConDiSim leverages denoising diffusion probabilistic models to approximate posterior distributions, consisting of a forward process that adds Gaussian noise to parameters, and a reverse process learning to denoise, conditioned on observed data. This approach effectively captures complex dependencies and multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy while maintaining computational efficiency and stability in model training. ConDiSim offers a robust and extensible framework for simulation-based inference, particularly suitable for parameter inference workflows requiring fast inference methods.

Updated: 2025-10-16 14:53:05

标题: ConDiSim：基于模拟推断的条件扩散模型

摘要: 我们提出了一种条件扩散模型 - ConDiSim，用于基于模拟的推断复杂系统的难以处理的似然。ConDiSim利用去噪扩散概率模型来近似后验分布，包括一个向前过程，向参数添加高斯噪声，以及一个学习去噪的反向过程，条件是观察到的数据。这种方法有效地捕捉了后验分布中的复杂依赖关系和多模态性。ConDiSim在十个基准问题和两个真实世界测试问题中进行了评估，在这些问题中，它展示了有效的后验近似精度，同时在模型训练中保持计算效率和稳定性。ConDiSim为基于模拟的推断提供了一个强大且可扩展的框架，特别适用于需要快速推断方法的参数推断工作流程。

更新时间: 2025-10-16 14:53:05

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2505.08403v2

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

Updated: 2025-10-16 14:52:52

标题: 超越多标记预测：使用未来摘要对LLMs进行预训练

摘要: 下一个标记预测（NTP）推动了大型语言模型（LLMs）的成功，但它在长期推理、规划和创意写作方面存在困难，这些限制在很大程度上归因于教师强迫训练。多标记预测（MTP）部分缓解了这些问题，通过一次预测多个未来标记，但它主要捕获短期依赖关系并提供有限的改进。我们提出了未来摘要预测（FSP），它训练一个辅助头来预测一个紧凑的长期未来表示，保留与长篇生成相关的信息。我们探讨了FSP的两种变体：手工制作的摘要，例如，序列未来的词袋摘要，以及学习摘要，它使用由从右到左训练的逆语言模型产生的嵌入。大规模预训练实验（3B和8B参数模型）表明，FSP在数学、推理和编码基准上比NTP和MTP都提供了改进。

更新时间: 2025-10-16 14:52:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14751v1

ColumnDisturb: Understanding Column-based Read Disturbance in Real DRAM Chips and Implications for Future Systems

We experimentally demonstrate a new widespread read disturbance phenomenon, ColumnDisturb, in real commodity DRAM chips. By repeatedly opening or keeping a DRAM row (aggressor row) open, we show that it is possible to disturb DRAM cells through a DRAM column (i.e., bitline) and induce bitflips in DRAM cells sharing the same columns as the aggressor row (across multiple DRAM subarrays). With ColumnDisturb, the activation of a single row concurrently disturbs cells across as many as three subarrays (e.g., 3072 rows) as opposed to RowHammer/RowPress, which affect only a few neighboring rows of the aggressor row in a single subarray. We rigorously characterize ColumnDisturb and its characteristics under various operational conditions using 216 DDR4 and 4 HBM2 chips from three major manufacturers. Among our 27 key experimental observations, we highlight two major results and their implications. First, ColumnDisturb affects chips from all three major manufacturers and worsens as DRAM technology scales down to smaller node sizes (e.g., the minimum time to induce the first ColumnDisturb bitflip reduces by up to 5.06x). We observe that, in existing DRAM chips, ColumnDisturb induces bitflips within a standard DDR4 refresh window (e.g., in 63.6 ms) in multiple cells. We predict that, as DRAM technology node size reduces, ColumnDisturb would worsen in future DRAM chips, likely causing many more bitflips in the standard refresh window. Second, ColumnDisturb induces bitflips in many (up to 198x) more rows than retention failures. Therefore, ColumnDisturb has strong implications for retention-aware refresh mechanisms that leverage the heterogeneity in cell retention times: our detailed analyses show that ColumnDisturb greatly reduces the benefits of such mechanisms.

Updated: 2025-10-16 14:52:41

标题: ColumnDisturb：理解实际DRAM芯片中基于列的读干扰及其对未来系统的影响

摘要: 我们在真实的商品DRAM芯片中实验性地展示了一种新的普遍性读取干扰现象，ColumnDisturb。通过重复打开或保持一个DRAM行（侵略行）的开启状态，我们展示了可以通过一个DRAM列（即位线）干扰DRAM单元，并在共享与侵略行相同列的DRAM单元中引起位翻转（跨多个DRAM子阵列）。通过ColumnDisturb，单个行的激活可以同时干扰多达三个子阵列（例如3072行），而RowHammer/RowPress只会影响单个子阵列中侵略行附近的几行。我们严格地表征了ColumnDisturb及其在不同操作条件下的特性，使用了来自三家主要制造商的216个DDR4和4个HBM2芯片。在我们的27个关键实验观察中，我们强调了两个主要结果及其影响。首先，ColumnDisturb影响来自所有三家主要制造商的芯片，并随着DRAM技术缩小到更小的节点尺寸（例如，诱发第一个ColumnDisturb位翻转的最小时间减少了高达5.06倍）而恶化。我们观察到，在现有的DRAM芯片中，ColumnDisturb可以在标准DDR4刷新窗口内（例如63.6毫秒内）在多个单元中引起位翻转。我们预测，随着DRAM技术节点尺寸的减小，ColumnDisturb在未来的DRAM芯片中可能会恶化，很可能导致更多位翻转在标准刷新窗口内发生。其次，ColumnDisturb在许多（高达198倍）行中引起位翻转，远远多于保留失效。因此，ColumnDisturb对利用单元保留时间异质性的保留感知刷新机制有着重要影响：我们详细的分析表明，ColumnDisturb极大地减少了这些机制的益处。

更新时间: 2025-10-16 14:52:41

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2510.14750v1

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

Updated: 2025-10-16 14:51:42

标题: 超越线性探针：语言模型的动态安全监控

摘要: 监控大型语言模型（LLMs）的激活是在它们导致不安全输出之前检测有害请求的有效方式。然而，传统的安全监视器通常需要针对每个查询相同数量的计算资源。这产生了一种权衡：昂贵的监视器浪费资源处理简单的输入，而廉价的监视器则可能错过微妙的情况。我们认为安全监视器应该是灵活的--只有在难以评估输入或有更多计算资源可用时，成本才应该上升。为了实现这一点，我们引入了截断多项式分类器（TPCs），这是线性探针动态激活监控的自然扩展。我们的关键洞察是多项式可以逐步进行训练和评估，逐项进行。在测试时，可以在需要时提前停止以进行轻量级监控，或者在需要时使用更多项进行更强的防护。TPCs提供两种使用模式。首先，作为安全拨号器：通过评估更多项，开发人员和监管机构可以从同一模型中“购买”更强的防护。其次，作为自适应级联：在低阶检查后，清晰情况早期退出，并且只对模棱两可的输入评估更高阶的防护，从而降低整体监控成本。在两个大规模安全数据集（WildGuardMix和BeaverTails）上，针对具有30B参数的4个模型，我们展示了TPCs与相同大小的基于MLP的探针基线竞争或表现优秀，同时比它们的黑盒对应物更具可解释性。我们的代码可在http://github.com/james-oldfield/tpc 上找到。

更新时间: 2025-10-16 14:51:42

领域: cs.LG

下载: http://arxiv.org/abs/2509.26238v2

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

Updated: 2025-10-16 14:44:45

标题: 克服跨编码器中的稀疏性伪影，以解释对话调整

摘要: 模型差异分析是研究微调如何改变模型表示和内部算法的过程。在微调过程中引入了许多感兴趣的行为，模型差异分析提供了一个有希望的视角来解释这些行为。Crosscoders是一种最近的模型差异分析方法，它学习了一个共享的可解释概念词典，这些概念在基础模型和微调模型中表示为潜在方向，使我们能够跟踪概念在微调过程中如何变化或出现。值得注意的是，先前的工作观察到在基础模型中没有方向的概念，并假设这些模型特定的潜在概念是在微调过程中引入的。然而，我们确定了两个问题，这些问题源自Crosscoders L1训练损失，它可能会错误地将概念归因为仅存在于微调模型中，而实际上它们存在于两个模型中。我们开发了潜在缩放方法来标记这些问题，通过更准确地测量每个潜在概念在模型中的存在。在比较Gemma 2 2B基础模型和聊天模型的实验中，我们观察到标准的Crosscoder受到这些问题的严重影响。基于这些发现，我们训练了一个使用BatchTopK损失的Crosscoder，并展示它显著减轻了这些问题，找到了更真实的聊天特定且高度可解释的概念。我们建议从业者采用类似的技术。使用BatchTopK Crosscoder，我们成功地识别了一组既可解释又具有因果效应的聊天特定的潜在概念，代表了概念，如“虚假信息”和“个人问题”，以及多个拒绝相关的潜在概念，显示出对不同拒绝触发器的微妙偏好。总的来说，我们的工作推进了基于Crosscoder的模型差异分析方法的最佳实践，并证明它可以提供关于聊天微调如何修改模型行为的具体见解。

更新时间: 2025-10-16 14:44:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.02922v3

DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

Updated: 2025-10-16 14:43:25

标题: DEXTER：使用文本推理进行扩散引导的视觉模型解释

摘要: 理解和解释机器学习模型的行为对于构建透明和可信赖的人工智能系统至关重要。我们介绍了DEXTER，一个无需数据的框架，它利用扩散模型和大型语言模型生成全局的文本解释视觉分类器。DEXTER通过优化文本提示来合成强烈激活目标分类器的类别条件图像。然后利用这些合成样本来引发详细的自然语言报告，描述类别特定的决策模式和偏见。与先前的工作不同，DEXTER能够在没有训练数据或基准标签的情况下对分类器的决策过程进行自然语言解释。我们展示了DEXTER在激活最大化、切片发现、去偏差和偏见解释等三个任务中的灵活性，每个任务都展示了其揭示视觉分类器的内部机制的能力。定量和定性评估，包括用户研究，表明DEXTER产生准确、可解释的输出。在ImageNet、Waterbirds、CelebA和FairFaces上的实验证实，DEXTER在全局模型解释和类别级别偏见报告方面优于现有方法。代码可在https://github.com/perceivelab/dexter获取。

更新时间: 2025-10-16 14:43:25

领域: cs.CV,cs.AI,I.2.m

下载: http://arxiv.org/abs/2510.14741v1

The Principle of Uncertain Maximum Entropy

The Principle of Maximum Entropy is a rigorous technique for estimating an unknown distribution given partial information while simultaneously minimizing bias. However, an important requirement for applying the principle is that the available information be provided error-free (Jaynes 1982). We relax this requirement using a memoryless communication channel as a framework to derive a new, more general principle. We show our new principle provides an upper bound on the entropy of the unknown distribution and the amount of information lost due to the use of a given communications channel is unknown unless the unknown distribution's entropy is also known. Using our new principle we provide a new interpretation of the classic principle and experimentally show its performance relative to the classic principle and other generally applicable solutions. Finally, we present a simple algorithm for solving our new principle and an approximation useful when samples are limited.

Updated: 2025-10-16 14:36:12

标题: 不确定最大熵原理

摘要: 最大熵原理是一种严格的技术，用于在部分信息的情况下估计未知分布，同时最小化偏差。然而，应用该原理的一个重要要求是提供无误的可用信息（Jaynes 1982）。我们利用无记忆通信通道作为框架，放宽这一要求，推导出一个新的更一般的原理。我们展示了我们的新原理为未知分布的熵和由于使用给定通信通道而丢失的信息量提供了一个上限，除非未知分布的熵也已知。利用我们的新原理，我们提供了对经典原理的新解释，并在实验中展示了其相对于经典原理和其他普遍适用的解决方案的性能。最后，我们提出了一个用于解决我们的新原理的简单算法，并提供了一个在样本有限的情况下有用的近似解。

更新时间: 2025-10-16 14:36:12

领域: cs.IT,cs.CV,cs.LG,math.IT

下载: http://arxiv.org/abs/2305.09868v5

The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents

Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.

Updated: 2025-10-16 14:25:55

标题: 追求多样性：深度强化学习代理的多目标测试

摘要: 在安全关键领域测试深度强化学习（DRL）代理需要发现多样化的失败场景。现有工具如INDAGO依赖于单目标优化，仅专注于最大化失败次数，但这并不确保发现的场景多样化或揭示不同的错误类型。我们引入了INDAGO-Nexus，这是一种多目标搜索方法，利用多目标进化算法和多个多样性指标以及帕累托前沿选择策略，共同优化失败可能性和测试场景多样性。我们在三个DRL代理上评估了INDAGO-Nexus：人形行走器、无人驾驶汽车和停车代理。平均而言，INDAGO-Nexus在SDC和停车场景中发现的独特失败（测试效果）比INDAGO多了多达83%和40%，同时将所有代理的失败时间缩短了多达67%。

更新时间: 2025-10-16 14:25:55

领域: cs.LG

下载: http://arxiv.org/abs/2510.14727v1

Disentangled and Self-Explainable Node Representation Learning

Node representations, or embeddings, are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on explaining graph model decisions, the interpretability of unsupervised node embeddings remains underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that generates self-explainable embeddings in an unsupervised manner. Our method employs disentangled representation learning to produce dimension-wise interpretable embeddings, where each dimension is aligned with distinct topological structure of the graph. We formalize novel desiderata for disentangled and interpretable embeddings, which drive our new objective functions, optimizing simultaneously for both interpretability and disentanglement. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of our approach.

Updated: 2025-10-16 14:23:53

标题: 解缠和自解释节点表示学习

摘要: 节点表示或嵌入是捕捉节点属性的低维向量，通常通过无监督结构相似性目标或监督任务学习。尽管最近的努力集中在解释图模型的决策上，但无监督节点嵌入的可解释性仍未得到充分探讨。为了弥合这一差距，我们引入了DiSeNE（解耦和自解释节点嵌入）框架，以无监督方式生成自解释的嵌入。我们的方法采用解耦表示学习来产生维度明确的嵌入，其中每个维度与图的不同拓扑结构对齐。我们为解耦和可解释嵌入规范了新的期望，推动我们的新目标函数，同时优化可解释性和解耦性。此外，我们提出了几种新的度量标准来评估表示质量和人类可解释性。跨多个基准数据集的广泛实验证明了我们方法的有效性。

更新时间: 2025-10-16 14:23:53

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2410.21043v2

Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.

Updated: 2025-10-16 14:21:40

标题: Rényi锐度：一种与泛化强相关的新锐度

摘要: 锐度（损失最小值）是研究神经网络泛化的常用指标。直观来说，最小值附近的地形越平坦，泛化可能越好。不幸的是，许多现有的锐度指标与泛化之间的相关性通常不强，有时甚至很弱。为了缩小直觉与现实之间的差距，我们提出了一种新颖的锐度测量方法，即\textit{R\'enyi锐度}，它被定义为损失Hessian的负R\'enyi熵（经典Shannon熵的推广）。主要思想如下：1）我们意识到损失Hessian的\textit{均匀}（相同）特征值是最理想的（同时保持总和恒定）以实现良好的泛化；2）我们使用\textit{R\'enyi熵}来简洁地表征损失Hessian的特征值的分布程度。通常情况下，特征值的分布越大，（R\'enyi）熵越小。为了严格建立泛化和（R\'enyi）锐度之间的关系，我们提供了几个以R\'enyi锐度为基础的泛化上界，利用了R\'enyi锐度的再参数化不变性属性，以及将数据差异转化为权重扰动的技巧。此外，进行了大量实验来验证R\'enyi锐度与泛化之间的强相关性（具体来说，是Kendall等级相关性）。此外，我们提出在训练期间使用R\'enyi锐度的变体作为正则化器，即R\'enyi锐度感知最小化（RSAM），结果表明该方法优于所有现有的锐度感知最小化方法。值得注意的是，与经典的SAM方法相比，我们提出的RSAM方法的测试准确率提高可以高达近2.5\%。

更新时间: 2025-10-16 14:21:40

领域: cs.LG

下载: http://arxiv.org/abs/2510.07758v2

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

Updated: 2025-10-16 14:20:00

标题: Tawa：现代GPU异步引用的自动变形特化

摘要: 现代GPU具有专门的硬件单元，可以实现高性能、异步数据流执行。然而，传统的SIMT编程模型与任务并行硬件根本不匹配，造成了显著的可编程差距。虽然硬件级warp专门化是释放最大性能的关键，但它迫使开发人员手动编排复杂、低级别的通信和软件管道--这是一种费力、容易出错且不可持续的过程。为了解决这一挑战，我们提出了Tawa，一个自动化编译器，从高级、基于瓷砖的程序中系统地生成高性能、warp专门化的代码。我们方法的核心是一种新颖的IR抽象，异步引用（aref），它表达了warp级别的通信，而不暴露低级硬件细节。利用这种抽象，Tawa自动将程序划分为生产者-消费者角色，并管理复杂的数据流管道，减轻了开发人员的内核重写工作。在代表性的LLM内核上对NVIDIA H100 GPU进行评估显示，Tawa实现了高硬件利用率，比高度优化的cuBLAS GEMM内核提高了1.1倍速度。对于注意力工作负载，Tawa比Triton提高了1.2倍速度，并且与经过手工优化的CUTLASS C++ FlashAttention-3内核的性能相匹配，而编程工作量要少得多。

更新时间: 2025-10-16 14:20:00

领域: cs.LG,cs.AR,cs.PL

下载: http://arxiv.org/abs/2510.14719v1

Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.

Updated: 2025-10-16 14:17:38

标题: Seesaw：通过平衡学习率和批量大小调度加速训练

摘要: 增加训练过程中的批量大小——即“批量斜坡”——是加速大型语言模型预训练的一种有前景的策略。对于SGD而言，将批量大小加倍可能相当于将学习速率减半，但对于像Adam这样的自适应优化器的最佳策略则不太明确。因此，如果使用任何批量斜坡调度，通常会根据经验进行调整。本文为批量大小调度开发了一个基于原则的框架，并引入了Seesaw：每当标准调度程序将学习速率减半时，Seesaw会将其乘以$1/\sqrt{2}$并将批量大小加倍，从而在减少串行步骤的同时保持损失动态。从理论上讲，我们提供了据我们所知，对于SGD在噪声线性回归中学习速率衰减和批量大小增加之间的等价性的首个有限样本证明，并将此等价性扩展到规范化的SGD，这是Adam的一个可处理的代理，在实践中观察到的方差主导制度下。在经验上，在使用恒定（关键）批量大小训练的150M/300M/600M参数模型上，Seesaw与余弦衰减相匹配，在相同的FLOPs下减少了大约36%的墙钟时间，接近我们分析所暗示的理论极限。

更新时间: 2025-10-16 14:17:38

领域: cs.LG,cs.AI,math.OC,stat.ML

下载: http://arxiv.org/abs/2510.14717v1

Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

Updated: 2025-10-16 14:16:14

标题: 合成历史：评估扩散模型中过去的视觉表示

摘要: 随着文本到图像（TTI）扩散模型在内容创作中变得越来越有影响力，越来越多的关注被引向它们的社会和文化含义。虽然先前的研究主要关注人口统计和文化偏见，但这些模型准确表现历史背景的能力仍然大部分未被探究。为了填补这一空白，我们引入了一个用于评估TTI模型如何描绘历史背景的基准。该基准结合了HistVis数据集，该数据集包含了30,000张由三种最先进的扩散模型生成的合成图像，这些图像是通过精心设计的提示涵盖了跨越多个历史时期的普遍人类活动。我们评估生成的图像在三个关键方面：（1）隐含的风格关联：检查与特定时代相关联的默认视觉风格；（2）历史一致性：识别诸如在前现代环境中出现现代物品等时代错误；以及（3）人口统计表现：将生成的种族和性别分布与历史上合理的基线进行比较。我们的研究结果揭示了在历史主题生成的图像中存在系统性的不准确性，因为TTI模型经常通过融入未明确说明的风格线索来刻板地刻画过去的时代，引入时代错误，并未能反映出合理的人口统计模式。通过提供一个可复制的历史图像代表基准，这项工作为构建更具历史准确性的TTI模型迈出了初步的一步。

更新时间: 2025-10-16 14:16:14

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.17064v2

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.

Updated: 2025-10-16 14:13:23

标题: Moto：作为学习机器人操作视频的桥梁语言的潜在运动标记

摘要: 最近发展的大型语言模型在广泛语料库上进行预训练，通过最小的微调在各种自然语言处理任务中取得了显著成功。这种成功为机器人技术带来了新的希望，长期以来一直受制于高昂的动作标记数据成本。我们探讨：鉴于充分视频数据中包含与互动相关的知识，作为丰富的“语料库”，是否可以有效地应用类似的生成式预训练方法来增强机器人学习？关键挑战在于确定一种有效的自回归预训练表示，有利于机器人操作任务。受人类通过观察动态环境学习新技能的方式启发，我们提出，有效的机器人学习应强调与运动相关的知识，这与低级动作密切相关，且与硬件无关，有助于将学习的动作转移到实际机器人操作。为此，我们引入了Moto，通过Latent Motion Tokenizer将视频内容转换为潜在的Motion Token序列，以无监督的方式从视频中学习运动的“语言”桥梁。我们通过运动令牌自回归对Moto-GPT进行预训练，使其能够捕捉多样的视觉运动知识。在预训练之后，Moto-GPT表现出有望产生语义可解释的动作令牌，预测合理的运动轨迹，并通过输出概率评估轨迹的合理性。为了将学习的运动先验知识转移到真实的机器人操作中，我们实施了一种协同微调策略，无缝地连接潜在的运动令牌预测和真实的机器人控制。广泛的实验表明，经过微调的Moto-GPT在机器人操作基准测试中表现出卓越的稳健性和效率，突显了其在将知识从视频数据转移到下游视觉操作任务中的有效性。

更新时间: 2025-10-16 14:13:23

领域: cs.RO,cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.04445v4

Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.

Updated: 2025-10-16 14:11:52

标题: 历史影片中的摄像机移动分类：深度视频模型的比较研究

摘要: 摄像机移动传达了对理解视频内容至关重要的空间和叙事信息。尽管最近的摄像机移动分类（CMC）方法在现代数据集上表现良好，但它们对历史素材的泛化尚未被探索。本文首次对深度视频CMC模型在档案电影素材上进行系统评估。我们总结了代表性方法和数据集，突出了模型设计和标签定义的差异。在包含专家注释的二战素材的HISTORIAN数据集上评估了五种标准视频分类模型。表现最佳的模型Video Swin Transformer 实现了80.25%的准确率，显示出在有限训练数据情况下的强大收敛性。我们的发现突显了将现有模型适应低质量视频的挑战和潜力，并激励未来工作结合多样输入模态和时间架构。

更新时间: 2025-10-16 14:11:52

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2510.14713v1

Fast and Scalable Score-Based Kernel Calibration Tests

We introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a non-parametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a conditional goodness-of-fit criterion for the KCCSD test's U-statistic. We demonstrate the properties of our test on various synthetic settings.

Updated: 2025-10-16 14:11:14

标题: 快速且可扩展的基于评分的核校准测试

摘要: 我们介绍了核校准条件Stein差异检验（KCCSD检验），这是一种基于核的非参数检验，用于评估具有明确定义分数的概率模型的校准性。与先前的方法相比，我们的测试避免了可能昂贵的期望近似，同时控制了其一型错误。我们通过使用一种新的用于基于分数的概率的核族，可以在不需要概率密度样本的情况下估计，并使用KCCSD测试的U统计的条件拟合优度标准来实现这些改进。我们在各种合成设置上展示了我们测试的性质。

更新时间: 2025-10-16 14:11:14

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.14711v1

MCbiF: Measuring Topological Autocorrelation in Multiscale Clusterings via 2-Parameter Persistent Homology

Datasets often possess an intrinsic multiscale structure with meaningful descriptions at different levels of coarseness. Such datasets are naturally described as multi-resolution clusterings, i.e., not necessarily hierarchical sequences of partitions across scales. To analyse and compare such sequences, we use tools from topological data analysis and define the Multiscale Clustering Bifiltration (MCbiF), a 2-parameter filtration of abstract simplicial complexes that encodes cluster intersection patterns across scales. The MCbiF can be interpreted as a higher-order extension of Sankey diagrams and reduces to a dendrogram for hierarchical sequences. We show that the multiparameter persistent homology (MPH) of the MCbiF yields a finitely presented and block decomposable module, and its stable Hilbert functions characterise the topological autocorrelation of the sequence of partitions. In particular, at dimension zero, the MPH captures violations of the refinement order of partitions, whereas at dimension one, the MPH captures higher-order inconsistencies between clusters across scales. We demonstrate through experiments the use of MCbiF Hilbert functions as topological feature maps for downstream machine learning tasks. MCbiF feature maps outperform information-based baseline features on both regression and classification tasks on synthetic sets of non-hierarchical sequences of partitions. We also show an application of MCbiF to real-world data to measure non-hierarchies in wild mice social grouping patterns across time.

Updated: 2025-10-16 14:11:12

标题: MCbiF：通过2参数持久同调在多尺度聚类中测量拓扑自相关

摘要: 数据集通常具有固有的多尺度结构，具有不同粗细级别的有意义描述。这些数据集自然地被描述为多分辨率聚类，即不一定是跨尺度的分区层次序列。为了分析和比较这样的序列，我们使用拓扑数据分析工具，并定义了多尺度聚类双过滤（MCbiF），这是一个编码跨尺度的聚类交叉模式的抽象单纯复合体的两参数过滤。MCbiF可以被解释为桑基图的高阶扩展，并在层次序列中简化为树状图。我们展示了MCbiF的多参数持久同调（MPH）产生一个有限呈现和块分解模块，其稳定希尔伯特函数表征了分区序列的拓扑自相关性。特别是，在零维上，MPH捕捉到分区精细度的违规，而在一维上，MPH捕捉到跨尺度的聚类之间的高阶不一致性。我们通过实验展示了MCbiF希尔伯特函数作为下游机器学习任务的拓扑特征映射的用途。MCbiF特征映射在合成的非层次分区序列上的回归和分类任务上优于基于信息的基线特征。我们还展示了将MCbiF应用于现实数据，以测量野生老鼠社会群体模式随时间的非层次性。

更新时间: 2025-10-16 14:11:12

领域: math.AT,cs.LG,physics.data-an,Primary 55N31, Secondary 62H30

下载: http://arxiv.org/abs/2510.14710v1

Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery

Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. "interesting points". We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% -- from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.

Updated: 2025-10-16 14:10:51

标题: 鲸鱼在哪里：一种人机协作的检测方法，用于在高分辨率卫星图像中识别鲸鱼

摘要: 鲸鱼种群的有效监测对于保护至关重要，但传统的调查方法昂贵且难以扩展。先前的研究表明，鲸鱼可以在极高分辨率（VHR）卫星图像中识别，但由于缺乏注释图像、图像质量和环境条件的变化以及在大规模遥感档案上构建强大的机器学习流程的成本，大规模自动检测仍然具有挑战性。我们提出了一种半自动化方法，利用统计异常检测方法在VHR图像中发现可能的鲸鱼检测，标记空间异常值，即“有趣点”。我们将这个检测器与一个基于Web的标注界面配对，旨在使专家能够快速注释这些有趣点。我们在三个具有已知鲸鱼注释的基准场景上评估了我们的系统，并实现了90.3%至96.4%的召回率，同时在某些情况下将需要专家检查的区域减少了高达99.8% —— 从超过1,000平方公里减少到不到2平方公里。我们的方法不依赖标记的训练数据，并为未来从空间进行机器辅助海洋哺乳动物监测提供了可扩展的第一步。我们已经在https://github.com/microsoft/whales上开源了这个流程。

更新时间: 2025-10-16 14:10:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14709v1

SLIE: A Secure and Lightweight Cryptosystem for Data Sharing in IoT Healthcare Services

The Internet of Medical Things (IoMT) has revolutionized healthcare by transforming medical operations into standardized, interoperable services. However, this service-oriented model introduces significant security vulnerabilities in device management and communication, which are especially critical given the sensitivity of medical data. To address these risks, this paper proposes SLIE (Secure and Lightweight Identity Encryption), a novel cryptosystem based on Wildcard Key Derivation Identity-Based Encryption (WKD-IBE). SLIE ensures scalable trust and secure omnidirectional communication through end-to-end encryption, hierarchical access control, and a lightweight key management system designed for resource-constrained devices. It incorporates constant-time operations, memory obfuscation, and expiry-based key revocation to counter side-channel, man-in-the-middle, and unauthorized access attacks, thereby ensuring compliance with standards like HIPAA and GDPR. Evaluations show that SLIE significantly outperforms RSA, with encryption and decryption times of 0.936ms and 0.217ms for 1KB of data, an 84.54% improvement in encryption speed, a 99.70% improvement in decryption speed, and an energy efficiency of 0.014 J/KB.

Updated: 2025-10-16 14:10:48

标题: SLIE：一种用于物联网医疗服务中数据共享的安全轻量级加密系统

摘要: 医疗物联网（IoMT）已经通过将医疗操作转变为标准化、互操作服务而改变了医疗保健。然而，这种以服务为导向的模式在设备管理和通信方面引入了重大安全漏洞，尤其在考虑到医疗数据的敏感性时更为关键。为了解决这些风险，本文提出了SLIE（安全轻量级身份加密），这是一种基于通配符键派生身份加密（WKD-IBE）的新型加密系统。SLIE通过端到端加密、分层访问控制和专为资源受限设备设计的轻量级密钥管理系统，确保可扩展的信任和安全的全向通信。它包括恒定时间操作、内存混淆和基于到期的密钥吊销，以应对侧信道、中间人和未经授权的访问攻击，从而确保符合HIPAA和GDPR等标准。评估结果显示，SLIE在RSA方面表现出色，对于1KB的数据，加密和解密时间分别为0.936ms和0.217ms，加密速度提高了84.54％，解密速度提高了99.70％，能效为0.014 J/KB。

更新时间: 2025-10-16 14:10:48

领域: cs.CR

下载: http://arxiv.org/abs/2510.14708v1

Reliable data clustering with Bayesian community detection

From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results. To detect reliable clusters, we capitalize on recent advances in network science to unite sparsification and clustering with principled model selection. We test two Bayesian community detection methods, the Degree-Corrected Stochastic Block Model and the Regularized Map Equation, both grounded in the Minimum Description Length principle for model selection. In synthetic data, they outperform traditional approaches, detecting planted clusters under high-noise conditions and with fewer samples. Compared to WGCNA on gene co-expression data, the Regularized Map Equation identifies more robust and functionally coherent gene modules. Our results establish Bayesian community detection as a principled and noise-resistant framework for uncovering modular structure in high-dimensional data across fields.

Updated: 2025-10-16 14:10:24

标题: 使用贝叶斯社区检测进行可靠的数据聚类

摘要: 从神经科学和基因组学到系统生物学和生态学，研究人员依赖于聚类相似性数据来揭示模块化结构。然而，广泛使用的聚类方法，如层次聚类、k均值和WGCNA，缺乏有原则的模型选择，使它们容易受到噪声的影响。一种常见的解决方法是将相关矩阵表示稀疏化以去除噪声，但这一额外步骤引入了可以扭曲结构并导致不可靠结果的任意阈值。为了检测可靠的聚类，我们利用网络科学的最新进展，将稀疏化和聚类与有原则的模型选择结合起来。我们测试了两种贝叶斯社区检测方法，即度校正随机块模型和正则化映射方程，两者都基于模型选择的最小描述长度原则。在合成数据中，它们优于传统方法，在高噪声条件下检测种植的聚类，并减少样本数量。与基因共表达数据上的WGCNA相比，正则化映射方程识别出更稳健和功能连贯的基因模块。我们的结果将贝叶斯社区检测确定为一个有原则且抗噪声的框架，可用于在各个领域的高维数据中揭示模块化结构。

更新时间: 2025-10-16 14:10:24

领域: stat.ML,cs.LG,physics.data-an,stat.ME

下载: http://arxiv.org/abs/2510.15013v1

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

Updated: 2025-10-16 14:10:10

标题: 多项式逻辑赌博机的几乎最小最优后悔

摘要: 在本文中，我们研究了上下文多项式逻辑(MNL)赌博问题，其中学习代理根据上下文信息顺序选择一个组合，用户反馈遵循MNL选择模型。特别是在最大组合大小$K$方面，关于下界和上界遗憾的差距很大。此外，在这些边界之间奖励结构的变化使得寻求最优性变得复杂。在均匀奖励下，其中所有项目具有相同的期望奖励，我们建立了一个$\Omega(d\sqrt{T/K})$的遗憾下界，并提出了一个常数时间算法OFU-MNL+，它实现了一个匹配的$\tilde{O}(d\sqrt{T/K})$的上界。我们还提供了在均匀奖励下的实例相关极小遗憾边界。在非均匀奖励下，我们证明了一个$\Omega(d\sqrt{T})$的下界和一个$\tilde{O}(d\sqrt{T})$的上界，OFU-MNL+也可以实现这个上界。我们的实证研究支持这些理论发现。据我们所知，这是上下文MNL赌博文献中第一篇证明极小最优性的工作，无论是在均匀还是非均匀奖励设置下，并提出了一个计算效率高的算法，可以实现这种最优性，直至对数因子。

更新时间: 2025-10-16 14:10:10

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.09831v9

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

Updated: 2025-10-16 14:10:07

标题: Ax-Prover：用于数学和量子物理定理证明的深层推理代理框架

摘要: 我们介绍了Ax-Prover，这是一个用于在Lean中进行自动定理证明的多智能体系统，可以解决各种科学领域的问题，并且可以自主运行或与人类专家合作。为了实现这一目标，Ax-Prover通过形式证明生成的方式来解决科学问题，这个过程需要创造性推理和严格的句法严谨性。Ax-Prover通过将提供知识和推理的大型语言模型(LLMs)与通过模型上下文协议(MCP)确保形式正确性的Lean工具相结合，来应对这一挑战。为了评估其作为自主证明者的性能，我们将我们的方法与前沿的LLMs和专门的证明模型在两个公开数学基准和我们在抽象代数和量子理论领域引入的两个Lean基准上进行了基准测试。在公开数据集上，Ax-Prover与最先进的证明者相竞争，而在新基准上，它在很大程度上胜过它们。这表明，与那些难以泛化的专门系统不同，我们基于工具的智能定理证明方法为跨越多样科学领域的形式验证提供了一种可推广的方法论。此外，我们展示了Ax-Prover在实际用例中的助手能力，展示了它如何帮助专业数学家正式化复杂密码学定理的证明。

更新时间: 2025-10-16 14:10:07

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.12787v2

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent's effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper's results and can correctly carry out novel user queries. Paper2Agent automatically created AI co-scientist that identified new splicing variant associated with ADHD risk. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.

Updated: 2025-10-16 14:09:17

标题: Paper2Agent: 将研究论文重新想象为交互式和可靠的AI代理

摘要: 我们介绍了Paper2Agent，这是一个自动化框架，可以将研究论文转化为人工智能代理。Paper2Agent将研究成果从被动的文献转变为能够加速下游使用、采纳和发现的主动系统。传统的研究论文需要读者投入大量精力去理解和调整论文的代码、数据和方法以适应自己的工作，从而造成了传播和重复利用的障碍。Paper2Agent通过自动将论文转化为一个能够作为知识型研究助手的人工智能代理来解决这一挑战。它通过使用多个代理系统系统地分析论文和相关代码库来构建一个Model Context Protocol（MCP）服务器，然后通过迭代生成和运行测试来完善和强化所得到的MCP。这些论文MCPs可以灵活地连接到一个聊天代理（例如Claude Code），通过自然语言执行复杂的科学查询，同时调用原始论文中的工具和工作流程。我们通过深入的案例研究展示了Paper2Agent在创建可靠和有能力的论文代理方面的有效性。Paper2Agent创建了一个利用AlphaGenome解释基因组变异的代理，以及基于ScanPy和TISSUE进行单细胞和空间转录组分析的代理。我们验证了这些论文代理可以复制原始论文的结果，并且能够正确执行新的用户查询。Paper2Agent自动生成了一位人工智能共同科学家，发现了与ADHD风险相关的新的剪接变异。通过将静态论文转化为动态、交互式的人工智能代理，Paper2Agent引入了一种新的知识传播范式，为人工智能共同科学家的协作生态系统奠定了基础。

更新时间: 2025-10-16 14:09:17

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.06917v2

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs: "explore more but retain less" due to the unrecoverability characteristics of structured function calling generation.

Updated: 2025-10-16 14:06:03

标题: ToolPRM：用于函数调用的结构化输出的精细推理缩放

摘要: 大型语言模型（LLMs）越来越展示出作为自主代理的强大能力，函数调用作为与环境交互的核心机制。同时，推理扩展已成为通过在推理过程中分配更多计算资源来增强LLM性能的前沿技术。然而，当前关于推理扩展的研究主要集中在非结构化输出生成任务上，其在结构化输出（如函数调用）中的应用尚未得到充分探索。为了弥补这一差距，我们提出了一个推理扩展框架，将细粒度的束搜索与一个过程奖励模型ToolPRM相结合，该模型对每个单一函数调用的内部步骤进行评分。为了训练ToolPRM，我们构建了第一个细粒度的函数内过程监督数据集，自动用函数掩蔽技术进行注释，为结构化工具使用推理提供步级奖励。广泛的实验证明，ToolPRM在预测准确性方面击败了粗粒度和结果奖励模型，表明其在监督函数调用推理过程方面具有更强的能力。装备有ToolPRM的推理扩展技术也显著提高了各种函数调用任务和基准测试中骨干模型的性能。更重要的是，我们揭示了将推理扩展技术应用于结构化输出的一个关键原则：“探索更多但保留更少”，这是由于结构化函数调用生成的不可恢复特性。

更新时间: 2025-10-16 14:06:03

领域: cs.AI

下载: http://arxiv.org/abs/2510.14703v1

The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs

Resolving Python dependency issues remains a tedious and error-prone process, forcing developers to manually trial compatible module versions and interpreter configurations. Existing automated solutions, such as knowledge-graph-based and database-driven methods, face limitations due to the variety of dependency error types, large sets of possible module versions, and conflicts among transitive dependencies. This paper investigates the use of Large Language Models (LLMs) to automatically repair dependency issues in Python programs. We propose PLLM (pronounced "plum"), a novel retrieval-augmented generation (RAG) approach that iteratively infers missing or incorrect dependencies. PLLM builds a test environment where the LLM proposes module combinations, observes execution feedback, and refines its predictions using natural language processing (NLP) to parse error messages. We evaluate PLLM on the Gistable HG2.9K dataset, a curated collection of real-world Python programs. Using this benchmark, we explore multiple PLLM configurations, including six open-source LLMs evaluated both with and without RAG. Our findings show that RAG consistently improves fix rates, with the best performance achieved by Gemma-2 9B when combined with RAG. Compared to two state-of-the-art baselines, PyEGo and ReadPyE, PLLM achieves significantly higher fix rates; +15.97\% more than ReadPyE and +21.58\% more than PyEGo. Further analysis shows that PLLM is especially effective for projects with numerous dependencies and those using specialized numerical or machine-learning libraries.

Updated: 2025-10-16 14:05:53

标题: 最后的依赖军团：用LLMs解决Python依赖冲突

摘要: 解决Python依赖问题仍然是一个繁琐且容易出错的过程，迫使开发人员手动尝试兼容的模块版本和解释器配置。现有的自动化解决方案，如基于知识图谱和数据库驱动的方法，由于依赖错误类型的多样性、可能的模块版本众多以及传递依赖之间的冲突而面临限制。本文调查了使用大型语言模型（LLM）自动修复Python程序中的依赖问题。我们提出了PLL（发音为“plum”），一种新颖的检索增强生成（RAG）方法，迭代推断缺失或不正确的依赖关系。PLL构建了一个测试环境，其中LLM提出模块组合，观察执行反馈，并使用自然语言处理（NLP）解析错误消息来完善其预测。我们在Gistable HG2.9K数据集上评估了PLL，这是一个由真实世界Python程序组成的策划收集。利用这个基准测试，我们探讨了多个PLL配置，包括六个开源LLM，分别在有和没有RAG的情况下进行评估。我们的研究结果显示，RAG始终提高修复率，当与RAG结合时，Gemma-29B表现最佳。与两种最新技术基线PyEGo和ReadPyE相比，PLL实现了显著更高的修复率；比ReadPyE高15.97\%，比PyEGo高21.58\%。进一步分析表明，PLL对于具有大量依赖关系和使用专门的数值或机器学习库的项目特别有效。

更新时间: 2025-10-16 14:05:53

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2501.16191v2

Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction

The next point-of-interest (POI) recommendation task aims to predict the users' immediate next destinations based on their preferences and historical check-ins, holding significant value in location-based services. Recently, large language models (LLMs) have shown great potential in recommender systems, which treat the next POI prediction in a generative manner. However, these LLMs, pretrained primarily on vast corpora of unstructured text, lack the native understanding of structured geographical entities and sequential mobility patterns required for next POI prediction tasks. Moreover, in industrial-scale POI prediction applications, incorporating world knowledge and alignment of human cognition, such as seasons, weather conditions, holidays, and users' profiles (such as habits, occupation, and preferences), can enhance the user experience while improving recommendation performance. To address these issues, we propose CoAST (Cognitive-Aligned Spatial-Temporal LLMs), a framework employing natural language as an interface, allowing for the incorporation of world knowledge, spatio-temporal trajectory patterns, profiles, and situational information. Specifically, CoAST mainly comprises of 2 stages: (1) Recommendation Knowledge Acquisition through continued pretraining on the enriched spatial-temporal trajectory data of the desensitized users; (2) Cognitive Alignment to align cognitive judgments with human preferences using enriched training data through Supervised Fine-Tuning (SFT) and a subsequent Reinforcement Learning (RL) phase. Extensive offline experiments on various real-world datasets and online experiments deployed in "Guess Where You Go" of AMAP App homepage demonstrate the effectiveness of CoAST.

Updated: 2025-10-16 14:05:28

标题: 面向下一个兴趣点预测的认知对齐时空大型语言模型

摘要: 下一个兴趣点（POI）推荐任务旨在根据用户的偏好和历史签到来预测用户的即时下一个目的地，在基于位置的服务中具有重要价值。最近，大型语言模型（LLMs）在推荐系统中展现出巨大潜力，将下一个POI预测作为生成方式来处理。然而，这些主要在大量非结构化文本语料库上预训练的LLMs缺乏对所需的结构化地理实体和顺序移动模式的本地理解。此外，在工业规模的POI预测应用中，整合世界知识和人类认知的对齐，如季节、天气条件、假期和用户的个人资料（如习惯、职业和偏好），可以提升用户体验同时改善推荐性能。为了解决这些问题，我们提出了CoAST（认知对齐的时空LLMs），这是一个采用自然语言作为接口的框架，允许整合世界知识、时空轨迹模式、个人资料和情境信息。具体而言，CoAST主要包括2个阶段：（1）通过对去敏感用户的丰富时空轨迹数据进行持续预训练来获取推荐知识；（2）认知对齐使用通过监督微调（SFT）和随后的强化学习（RL）阶段通过丰富的训练数据来对齐认知判断与人类偏好。在各种真实世界数据集上进行了大量离线实验，并在高德地图应用首页的“猜你去哪”中进行了在线实验，证明了CoAST的有效性。

更新时间: 2025-10-16 14:05:28

领域: cs.AI

下载: http://arxiv.org/abs/2510.14702v1

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored application is automated web vulnerability reproduction, which transforms vulnerability reports into working exploits. Although recent advances suggest promising potential, challenges remain in applying LLM agents to real-world web vulnerability reproduction scenarios. In this paper, we present the first comprehensive evaluation of state-of-the-art LLM agents for automated web vulnerability reproduction. We systematically assess 20 agents from software engineering, cybersecurity, and general domains across 16 dimensions, including technical capabilities, environment adaptability, and user experience factors, on 3 representative web vulnerabilities. Based on the results, we select three top-performing agents (OpenHands, SWE-agent, and CAI) for in-depth evaluation on our benchmark dataset of 80 real-world CVEs spanning 7 vulnerability types and 6 web technologies. Our results reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments. Complex environment configurations and authentication barriers create a gap where agents can execute exploit code but fail to trigger actual vulnerabilities. We observe high sensitivity to input guidance, with performance degrading by over 33% under incomplete authentication information. Our findings highlight the significant gap between current LLM agent capabilities and the demands of reliable automated vulnerability reproduction, emphasizing the need for advances in environmental adaptation and autonomous problem-solving capabilities.

Updated: 2025-10-16 14:04:46

标题: LLM代理用于自动化Web漏洞重现：我们是否已经到达目标？

摘要: 大型语言模型（LLM）代理在软件工程和网络安全任务中展示了出色的能力，包括代码生成、漏洞发现和自动化测试。一个关键但尚未充分探索的应用是自动化网络漏洞再现，将漏洞报告转化为可工作的利用。尽管最近的进展表明有很大的潜力，但在将LLM代理应用于实际的网络漏洞再现场景中仍然存在挑战。在本文中，我们对最先进的LLM代理进行了首次全面评估，用于自动化网络漏洞再现。我们系统地评估了来自软件工程、网络安全和一般领域的20个代理在16个维度上的技术能力、环境适应性和用户体验因素，针对3种代表性的网络漏洞。根据结果，我们选择了三个表现最佳的代理（OpenHands、SWE-agent和CAI）对我们的基准数据集进行深入评估，该数据集涵盖了80个涵盖7种漏洞类型和6种网络技术的真实CVE。我们的结果显示，尽管LLM代理在简单的基于库的漏洞上取得了合理的成功，但它们在复杂的基于服务的漏洞上一直失败，这些漏洞需要多组件环境。复杂的环境配置和身份验证障碍产生了一个差距，代理可以执行利用代码但无法触发实际漏洞。我们观察到对输入指导的高敏感性，当缺少身份验证信息时，性能会下降超过33%。我们的发现凸显了当前LLM代理能力与可靠自动化漏洞再现需求之间的显著差距，强调了对环境适应和自主问题解决能力的进步的需求。

更新时间: 2025-10-16 14:04:46

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2510.14700v1

Adaptive Budget Allocation for Orthogonal-Subspace Adapter Tuning in LLMs Continual Learning

Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training stage. Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment.To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks. Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency. OA-Adapter achieves higher average accuracy while using 58.5% fewer parameters on the standard CL benchmark, and maintains its advantages on two larger benchmarks comprising 15 tasks.

Updated: 2025-10-16 14:03:22

标题: 在LLMs持续学习中用于正交子空间适配器调整的自适应预算分配

摘要: 大型语言模型（LLMs）在持续学习（CL）场景中经常面临灾难性遗忘问题，在这种情况下，之前学习的任务的性能会严重下降，而在连续到达的任务中进行训练。尽管使用正交子空间的开创性CL方法可以减轻任务干扰，但它们通常采用固定的预算分配，忽略了任务和层之间的复杂性差异。此外，最近针对LLMs的预算自适应调整方法通常采用多阶段范式，将优化和预算分配解耦。这种解耦导致潜在的不一致，阻碍了这些方法在CL场景中的实际应用。为了解决这些限制，我们提出了OA-Adapter，这是一种在LLMs中进行持续学习的新型参数高效方法，它将动态预算调整与正交子空间学习统一在一个端到端的训练阶段。具体而言，OA-Adapter引入了一种动态瓶颈维度调整机制，同时分配了一个高效的参数预算并优化任务目标而不会发生不一致。为了有效保留先前获得的知识并与动态预算分配协调，OA-Adapter特别在当前任务的参数子空间和历史任务的动态分配参数子空间之间应用正交约束。对持续学习基准测试的实验结果表明，OA-Adapter在准确性和参数效率方面优于最先进的方法。在标准CL基准测试中，使用58.5％更少的参数，OA-Adapter实现了更高的平均准确性，并在包含15个任务的两个更大的基准测试中保持其优势。

更新时间: 2025-10-16 14:03:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.22358v2

FedPPA: Progressive Parameter Alignment for Personalized Federated Learning

Federated Learning (FL) is designed as a decentralized, privacy-preserving machine learning paradigm that enables multiple clients to collaboratively train a model without sharing their data. In real-world scenarios, however, clients often have heterogeneous computational resources and hold non-independent and identically distributed data (non-IID), which poses significant challenges during training. Personalized Federated Learning (PFL) has emerged to address these issues by customizing models for each client based on their unique data distribution. Despite its potential, existing PFL approaches typically overlook the coexistence of model and data heterogeneity arising from clients with diverse computational capabilities. To overcome this limitation, we propose a novel method, called Progressive Parameter Alignment (FedPPA), which progressively aligns the weights of common layers across clients with the global model's weights. Our approach not only mitigates inconsistencies between global and local models during client updates, but also preserves client's local knowledge, thereby enhancing personalization robustness in non-IID settings. To further enhance the global model performance while retaining strong personalization, we also integrate entropy-based weighted averaging into the FedPPA framework. Experiments on three image classification datasets, including MNIST, FMNIST, and CIFAR-10, demonstrate that FedPPA consistently outperforms existing FL algorithms, achieving superior performance in personalized adaptation.

Updated: 2025-10-16 14:03:05

标题: FedPPA：用于个性化联邦学习的渐进参数对齐

摘要: 联邦学习（FL）被设计为一种去中心化、隐私保护的机器学习范式，使多个客户端能够协作训练模型而不共享其数据。然而，在现实场景中，客户端通常拥有异构的计算资源并持有非独立分布的数据（非IID），这在训练过程中带来了重大挑战。个性化联邦学习（PFL）已经出现，以解决这些问题，通过根据每个客户端的独特数据分布来定制模型。尽管具有潜力，现有的PFL方法通常忽视了由具有不同计算能力的客户端引起的模型和数据异质性的共存。为了克服这一局限，我们提出了一种新方法，称为渐进参数对齐（FedPPA），它逐渐使全局模型的权重与客户端的权重对齐。我们的方法不仅在客户端更新期间减轻了全局模型和本地模型之间的不一致性，还保留了客户端的本地知识，从而增强了在非IID环境中的个性化鲁棒性。为了进一步提高全局模型的性能，同时保持强大的个性化能力，我们还将基于熵的加权平均集成到FedPPA框架中。对包括MNIST、FMNIST和CIFAR-10在内的三个图像分类数据集的实验表明，FedPPA始终优于现有的FL算法，在个性化适应性方面表现出色。

更新时间: 2025-10-16 14:03:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14698v1

Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging

Model merging aims to integrate task-specific abilities from individually fine-tuned models into a single model without extra training. In recent model merging methods, task vector has become a fundamental building block, as it can encapsulate the residual information from finetuning. However, the merged model often suffers from notable performance degradation due to the conflicts caused by task-irrelevant redundancy in task vectors. Existing efforts in overcoming redundancy by randomly dropping elements in the parameter space involves randomness and lacks knowledge awareness. To address these challenges, in this study, we propose Purifying TAsk Vectors (PAVE) in knowledge-aware subspace. Concretely, we sample some training examples from each task, and feed them into their corresponding fine-tuned models to acquire the covariance matrices before linear layers. We then perform a context-oriented singular value decomposition, which accentuates the weight components most relevant to the target knowledge. As a result, we can split fine-tuned model weights into task-relevant and redundant components in the knowledge-aware subspace, and purify the task vector by pruning the redundant components. To induce fair pruning efforts across models, we further introduce a spectral rank allocation strategy by optimizing a normalized activated pruning error. The task vector purification by our method as a plug-and-play scheme is applicable across various task vector-based merging methods to improve their performance. In experiments, we demonstrate the effectiveness of PAVE across a diverse set of merging methods, tasks, and model architectures.

Updated: 2025-10-16 14:02:57

标题: 在知识感知子空间中纯化任务向量以进行模型合并

摘要: 模型合并旨在将个别微调模型中的特定任务能力整合到单个模型中，而无需额外训练。在最近的模型合并方法中，任务向量已成为基本构建模块，因为它可以包含微调过程中的残余信息。然而，合并模型通常由于任务向量中与任务无关的冗余导致显著的性能下降。现有的消除冗余的努力涉及在参数空间中随机丢弃元素，涉及随机性并缺乏知识意识。为了解决这些挑战，在这项研究中，我们提出了在知识感知子空间中的Purifying TAsk Vectors（PAVE）。具体而言，我们从每个任务中抽取一些训练样本，并将它们输入到相应的微调模型中，在线性层之前获取协方差矩阵。然后，我们执行面向上下文的奇异值分解，强调与目标知识最相关的权重分量。结果，我们可以将微调模型的权重分为与任务相关和冗余组件，通过修剪冗余组件来净化任务向量。为了在模型之间引入公平的修剪努力，我们进一步引入了一种通过优化标准化激活修剪错误的谱等级分配策略。我们的方法通过任务向量净化作为即插即用方案，可应用于各种基于任务向量的合并方法，以提高它们的性能。在实验中，我们展示了PAVE在各种合并方法、任务和模型架构中的有效性。

更新时间: 2025-10-16 14:02:57

领域: cs.AI

下载: http://arxiv.org/abs/2510.14697v1

Role-Aware Multi-modal federated learning system for detecting phishing webpages

We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.

Updated: 2025-10-16 14:00:39

标题: 角色感知多模态联邦学习系统用于检测网络钓鱼网页

摘要: 我们提出了一个联合的、多模态的钓鱼网站检测器，支持URL、HTML和图像输入，而不会将客户端绑定到固定的模态推断：任何客户端都可以调用在其他地方训练的任何模态头。在方法上，我们提出了基于FedProx的角色感知桶聚合，受到专家混合和FedMM的启发。我们放弃可学习的路由，并使用硬门控（通过样本模态选择图像/HTML/URL专家），实现模态特定参数的单独聚合，以隔离交叉嵌入冲突并稳定收敛。在TR-OP上，融合头在两种数据类型上达到了97.5%的准确率，FPR为2.4%；在图像子集（消融）上，准确率达到95.5%，FPR为5.9%。对于文本，我们使用GraphCodeBERT用于URL，使用早期的三路嵌入用于原始、嘈杂的HTML。在WebPhish（HTML）上，我们获得了96.5%的准确率/1.8%的FPR；在TR-OP（原始HTML）上，我们获得了95.1%的准确率/4.6%的FPR。结果表明，具有硬门控专家的桶聚合使得在严格的隐私下稳定进行联合训练，同时提高了多模态钓鱼检测的可用性和灵活性。

更新时间: 2025-10-16 14:00:39

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2509.22369v2

Response to Discussions of "Causal and Counterfactual Views of Missing Data Models"

We are grateful to the discussants, Levis and Kennedy [2025], Luo and Geng [2025], Wang and van der Laan [2025], and Yang and Kim [2025], for their thoughtful comments on our paper (Nabi et al., 2025). In this rejoinder, we summarize our main contributions and respond to each discussion in turn.

Updated: 2025-10-16 13:59:09

标题: 对《缺失数据模型的因果和反事实观的讨论》的回应

摘要: 我们感谢讨论者Levis和Kennedy（2025年），Luo和Geng（2025年），Wang和van der Laan（2025年），以及Yang和Kim（2025年），对我们的论文（Nabi等，2025年）提出的深思熟虑的评论。在这篇答辩中，我们总结了我们的主要贡献，并依次回应每一次讨论。

更新时间: 2025-10-16 13:59:09

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14694v1

FibRace: a large-scale benchmark of client-side proving on mobile devices

FibRace, jointly developed by KKRT Labs and Hyli, was the first large-scale experiment to test client-side proof generation on smartphones using Cairo M. Presented as a mobile game in which players proved Fibonacci numbers and climbed a leaderboard, FibRace served a dual purpose: to engage the public and to provide empirical benchmarking. Over a three-week campaign (September 11-30, 2025), 6,047 players across 99 countries generated 2,195,488 proofs on 1,420 unique device models. The results show that most modern smartphones can complete a proof in under 5 seconds, confirming that *mobile devices are now capable of producing zero-knowledge proofs reliably*, without the need for remote provers or specialized hardware. Performance was correlated primarily with RAM capacity and SoC (System on Chip) performance: devices with at least 3 GB of RAM proved stably, when Apple's A19 Pro and M-series chips achieved the fastest proving times. Hyli's blockchain natively verified every proof onchain without congestion. FibRace provides the most comprehensive dataset to date on mobile proving performance, establishing a practical baseline for future research in lightweight provers, proof-powered infrastructure, and privacy-preserving mobile applications.

Updated: 2025-10-16 13:59:00

标题: FibRace：移动设备上客户端证明的大规模基准测试

摘要: FibRace是由KKRT Labs和Hyli共同开发的，是第一个在智能手机上使用Cairo M进行客户端证明生成的大规模实验。FibRace被设计为一个移动游戏，玩家可以证明斐波那契数并登上排行榜，旨在吸引公众参与并提供经验基准。在为期三周的活动中（2025年9月11日至30日），来自99个国家的6,047名玩家在1,420种独特的设备型号上生成了2,195,488个证明。结果显示，大多数现代智能手机可以在5秒内完成一个证明，证实了移动设备现在能够可靠地生成零知识证明，无需远程证明人或专门的硬件。性能主要与RAM容量和SoC（片上系统）性能相关：至少有3GB RAM的设备能够稳定地进行证明，苹果的A19 Pro和M系列芯片实现了最快的证明时间。Hyli的区块链原生验证了每个证明，没有拥堵。FibRace提供了迄今为止关于移动证明性能的最全面数据集，为轻量级证明器、基于证明的基础设施和隐私保护的移动应用的未来研究奠定了实际基线。

更新时间: 2025-10-16 13:59:00

领域: cs.CR

下载: http://arxiv.org/abs/2510.14693v1

Online Reliable Anomaly Detection via Neuromorphic Sensing and Communications

This paper proposes a low-power online anomaly detection framework based on neuromorphic wireless sensor networks, encompassing possible use cases such as brain-machine interfaces and remote environmental monitoring. In the considered system, a central reader node actively queries a subset of neuromorphic sensor nodes (neuro-SNs) at each time frame. The neuromorphic sensors are event-driven, producing spikes in correspondence to relevant changes in the monitored system. The queried neuro-SNs respond to the reader with impulse radio (IR) transmissions that directly encode the sensed local events. The reader processes these event-driven signals to determine whether the monitored environment is in a normal or anomalous state, while rigorously controlling the false discovery rate (FDR) of detections below a predefined threshold. The proposed approach employs an online hypothesis testing method with e-values to maintain FDR control without requiring knowledge of the anomaly rate, and it dynamically optimizes the sensor querying strategy by casting it as a best-arm identification problem in a multi-armed bandit framework. Extensive performance evaluation demonstrates that the proposed method can reliably detect anomalies under stringent FDR requirements, while efficiently scheduling sensor communications and achieving low detection latency.

Updated: 2025-10-16 13:56:54

标题: 通过神经形态感知和通信的在线可靠异常检测

摘要: 本文提出一种基于神经形态无线传感器网络的低功耗在线异常检测框架，涵盖可能的应用场景，如脑机接口和远程环境监测。在考虑的系统中，一个中央读取节点在每个时间段积极查询一组神经形态传感器节点(neuro-SNs)。神经形态传感器是事件驱动的，产生与监测系统中相关变化相对应的脉冲。被查询的神经形态传感器通过冲击无线(IR)传输直接编码感知到的局部事件来响应读取器。读取器处理这些事件驱动信号，以确定监测环境是处于正常状态还是异常状态，同时严格控制发现异常的误报率(FDR)低于预定义的阈值。所提出的方法采用一种在线假设检验方法，使用e值来维持FDR控制，而不需要了解异常率，并将传感器查询策略动态优化，将其作为多臂老虎机框架中的最佳臂识别问题。广泛的性能评估表明，所提出的方法可以可靠地检测异常，在严格的FDR要求下，同时有效地调度传感器通信并实现低检测延迟。

更新时间: 2025-10-16 13:56:54

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2510.14688v1

Offline Reinforcement Learning via Inverse Optimization

Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

Updated: 2025-10-16 13:55:37

标题: 离线强化学习通过逆优化

摘要: 受到逆优化（IO）在各种应用领域取得的最近成功的启发，我们提出了一种新颖的离线强化学习（ORL）算法，用于连续状态和行动空间，利用了IO文献中称为“次优损失”的凸损失函数。为了减轻ORL问题中常见的分布偏移，我们进一步采用了一个强大且非因果的模型预测控制（MPC）专家，通过事后信息引导一个动态的名义模型，来应对模型不匹配引发的问题。与现有文献不同，我们的强大MPC专家享有一个准确且易于处理的凸重构。在研究的第二部分中，我们展示了通过所提出的凸损失函数训练的IO假设类，在MuJoCo基准测试的低数据区域与业界最先进方法相比表现出了充分的表达能力，并取得了竞争性能，同时利用了三个数量级更少的参数，因此需要更少的计算资源。为了促进我们结果的可重现性，我们提供了一个开源包，实现了所提出的算法和实验。

更新时间: 2025-10-16 13:55:37

领域: cs.LG,cs.SY,eess.SY,math.OC

下载: http://arxiv.org/abs/2502.20030v2

Subspace-Boosted Model Merging

Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on both vision and language benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to quantify task similarity, offering a new interpretable perspective on model merging.

Updated: 2025-10-16 13:54:16

标题: 子空间增强模型合并

摘要: 模型合并使得多个专业化专家模型的组合能够成为一个能够执行多个任务的单一模型。然而，合并越来越多专业专家的好处通常会导致递减的回报和降低的整体性能增益。在这项工作中，我们从任务算术的角度提供了一个解释和分析；揭示了随着合并过程（跨越众多现有的合并方法）不断进行，相关的任务向量空间会经历排名崩溃。为了缓解这个问题，我们引入了Subspace Boosting，该方法操作于奇异值分解的任务向量空间上，并保持任务向量的排名。Subspace Boosting在视觉和语言基准测试中对多达20个专家模型的合并效果提升了很大幅度，增加了超过10％。此外，我们提议使用高阶广义奇异值分解来量化任务的相似性，提供了一个新的可解释的视角来解释模型合并。

更新时间: 2025-10-16 13:54:16

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.16506v2

xLLM Technical Report

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

Updated: 2025-10-16 13:53:47

标题: xLLM技术报告

摘要: 我们介绍了xLLM，这是一个智能高效的大规模语言模型（LLM）推理框架，专为高性能、大规模企业级服务而设计，并针对各种人工智能加速器进行了深度优化。为了解决这些挑战，xLLM构建了一种新颖的解耦服务引擎架构。在服务层，xLLM-Service具有智能调度模块，通过统一的弹性调度高效处理多模式请求，并通过统一的弹性调度将在线和离线任务并置，以最大化集群利用率。该模块还依赖于适应工作负载的动态Prefill-Decode（PD）分解策略和为多模式输入设计的新颖Encode-Prefill-Decode（EPD）分解策略。此外，它还采用分布式架构，提供全局KV缓存管理和强大的容错能力，以确保高可用性。在引擎层，xLLM-Engine通过系统和算法设计的协同优化来充分利用计算资源。这是通过全面的多层执行管道优化、自适应图模式和xTensor内存管理实现的。xLLM-Engine还进一步集成了算法增强功能，如优化的推测解码和动态EPLB，共同提高吞吐量和推理效率。广泛的评估表明，xLLM提供了明显优越的性能和资源效率。在相同的TPOT约束条件下，xLLM在Qwen系列模型中的吞吐量达到MindIE的1.7倍，vLLM-Ascend的2.2倍，同时在Deepseek系列模型中维持MindIE的平均吞吐量的1.7倍。xLLM框架可在以下网址公开获取：https://github.com/jd-opensource/xllm 和 https://github.com/jd-opensource/xllm-service。

更新时间: 2025-10-16 13:53:47

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2510.14686v1

Practical, Utilitarian Algorithm Configuration

Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user's utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user's preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.

Updated: 2025-10-16 13:47:41

标题: 实用的效用算法配置

摘要: Utilitarian algorithm configuration是一种确定给定算法参数设置的方法，以最大化用户的效用。效用函数提供了一个在不确定性下优化决策的理论基础方法，并且灵活到足以捕捉用户对算法运行时间的偏好（例如，它们可以描述在某个解决方案不再需要之后出现的明显截止点，按小时计算的计算成本，或者随着运行时间延长而出现的递减收益）。COUP是一种最近引入的utilitarian算法配置程序，主要设计用于提供有关其返回的配置质量的强大理论保证，但对其实际性能的关注较少。本文填补了这一空白，将基于理论的utilitarian算法配置带到与广泛使用的启发式配置程序竞争且不提供性能保证的水平。我们提出了一系列改进COUP的方法，以提高其实证性能而不降低其理论保证，并通过实验证明了它们的好处。通过一个案例研究，我们还说明了探索给定解决方案对算法选择问题中效用函数变化的稳健性的方法。

更新时间: 2025-10-16 13:47:41

领域: cs.AI

下载: http://arxiv.org/abs/2510.14683v1

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models

Large Language Models (LLMs) are increasingly integral to information dissemination and decision-making processes. Given their growing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propagation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the inherent political leanings of these models. Subsequently, persona prompting with the PCT is used to explore explicit stereotypes across various social dimensions. In a final step, implicit stereotypes are uncovered by evaluating models with multilingual versions of the PCT. Key findings reveal a consistent left-leaning political alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those identified via explicit persona prompting. Interestingly, for most models, implicit and explicit stereotypes show a notable alignment, suggesting a degree of transparency or "awareness" regarding their inherent biases. This study underscores the complex interplay of political bias and stereotypes in LLMs.

Updated: 2025-10-16 13:44:28

标题: 隐藏的偏见：关于大型语言模型中显性和隐性政治刻板印象的研究

摘要: 大型语言模型（LLMs）在信息传播和决策过程中变得越来越重要。鉴于它们日益增长的社会影响力，了解潜在的偏见，特别是在政治领域内，对于防止对公共舆论和民主过程产生不当影响至关重要。本研究使用二维政治罗盘测试（PCT）调查了八个知名LLMs的政治偏见和刻板印象传播。首先，使用PCT评估这些模型的固有政治倾向。随后，使用PCT的人物提示来探索不同社会维度上的明确刻板印象。最后，通过使用PCT的多语言版本评估模型，发现了隐性刻板印象。关键发现显示所有调查模型都具有一致的左倾政治倾向。此外，虽然刻板印象的性质和程度在不同模型之间差异很大，但通过语言变化引发的隐性刻板印象比通过明确人物提示识别的刻板印象更为显著。有趣的是，对于大多数模型，隐性和明确的刻板印象显示出显著的一致性，表明在其固有偏见方面存在一定程度的透明度或“意识”。这项研究强调了LLMs中政治偏见和刻板印象的复杂互动。

更新时间: 2025-10-16 13:44:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08236v2

ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models

Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.

Updated: 2025-10-16 13:44:20

标题: ECG-Soup：利用多层协同作用构建心电图基础模型

摘要: 最近，基于Transformer的基础模型在心电图（ECG）方面取得了令人印象深刻的性能，在许多下游应用中实现了出色的表现。

更新时间: 2025-10-16 13:44:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.00102v2

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Updated: 2025-10-16 13:40:55

标题: 关于SFT泛化的研究：基于强化学习视角和奖励校正

摘要: 我们提出了一种简单但有理论支持的改进方法，用于大型语言模型（LLM）的监督微调（SFT），解决了其相对于强化学习（RL）的有限泛化能力。通过数学分析，我们揭示了标准SFT梯度隐含地编码了一个可能严重限制模型泛化能力的问题奖励结构。为了纠正这一点，我们提出了动态微调（DFT），通过动态地重新缩放目标函数来稳定每个标记的梯度更新。值得注意的是，这一行代码的改变显著优于多个具有挑战性的基准测试和基本模型上的标准SFT，展示了极大改善的泛化能力。此外，我们的方法在离线RL设置中显示出竞争性结果，提供了一种有效而简单的替代方案。这项工作将理论洞察力与实际解决方案联系起来，大大提升了SFT的性能。代码将在https://github.com/yongliang-wu/DFT 上提供。

更新时间: 2025-10-16 13:40:55

领域: cs.LG

下载: http://arxiv.org/abs/2508.05629v2

Adaptive Set-Mass Calibration with Conformal Prediction

Reliable probabilities are critical in high-risk applications, yet common calibration criteria (confidence, class-wise) are only necessary for full distributional calibration, and post-hoc methods often lack distribution-free guarantees. We propose a set-based notion of calibration, cumulative mass calibration, and a corresponding empirical error measure: the Cumulative Mass Calibration Error (CMCE). We develop a new calibration procedure that starts with conformal prediction to obtain a set of labels that gives the desired coverage. We then instantiate two simple post-hoc calibrators: a mass normalization and a temperature scaling-based rule, tuned to the conformal constraint. On multi-class image benchmarks, especially with a large number of classes, our methods consistently improve CMCE and standard metrics (ECE, cw-ECE, MCE) over baselines, delivering a practical, scalable framework with theoretical guarantees.

Updated: 2025-10-16 13:39:56

标题: 使用符合预测的自适应集质量校准

摘要: 可靠的概率在高风险应用中至关重要，然而常见的校准标准（置信度、类别-wise）仅在完全分布校准时才是必要的，后续方法通常缺乏无分布保证。我们提出了一种基于集合的校准概念，累积质量校准，以及相应的经验错误度量：累积质量校准误差（CMCE）。我们开发了一种新的校准程序，从符合性预测开始，以获得给定覆盖范围的标签集。然后我们实例化了两种简单的后续校准器：基于质量归一化和基于温度缩放的规则，调整到符合性约束。在多类图像基准测试中，特别是在有大量类别的情况下，我们的方法始终改善CMCE和标准指标（ECE、cw-ECE、MCE）相对于基线，提供了一个具有理论保证的实用、可扩展的框架。

更新时间: 2025-10-16 13:39:56

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2505.15437v2

The simulation of judgment in LLMs

Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

Updated: 2025-10-16 13:38:08

标题: LLMs中的判断模拟

摘要: 大型语言模型（LLMs）越来越多地嵌入到评估过程中，从信息过滤到评估和解决知识空白，通过解释和可信度判断。这引发了对如何构建这些评估、它们依赖的假设以及它们的策略如何与人类不同的需要。我们将六个LLMs与专家评级-NewsGuard和Media Bias/Fact Check-以及通过控制实验收集的人类判断进行基准比较。我们将新闻领域纯粹作为评估任务的受控基准，重点放在基础机制上，而不是新闻分类本身。为了实现直接比较，我们在一个结构化的主体框架中实施，其中模型和非专家参与者遵循相同的评估程序：选择标准，检索内容，提出理由。尽管输出对齐，我们的发现显示，在指导模型评估的可观准则中存在一致的差异，表明词汇关联和统计先验可能会以与上下文推理不同的方式影响评估。这种依赖与系统效应相关联：政治不对称和倾向于混淆语言形式与认识可靠性-我们称之为认知，即当表面的合理性取代验证时出现的知识幻觉。事实上，将判断委托给这些系统可能会影响评估过程中基础的启发式，表明从规范推理向基于模式的近似的转变，并提出有关LLMs在评估过程中的作用的开放问题。

更新时间: 2025-10-16 13:38:08

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2502.04426v3

When Planners Meet Reality: How Learned, Reactive Traffic Agents Shift nuPlan Benchmarks

Planner evaluation in closed-loop simulation often uses rule-based traffic agents, whose simplistic and passive behavior can hide planner deficiencies and bias rankings. Widely used IDM agents simply follow a lead vehicle and cannot react to vehicles in adjacent lanes, hindering tests of complex interaction capabilities. We address this issue by integrating the state-of-the-art learned traffic agent model SMART into nuPlan. Thus, we are the first to evaluate planners under more realistic conditions and quantify how conclusions shift when narrowing the sim-to-real gap. Our analysis covers 14 recent planners and established baselines and shows that IDM-based simulation overestimates planning performance: nearly all scores deteriorate. In contrast, many planners interact better than previously assumed and even improve in multi-lane, interaction-heavy scenarios like lane changes or turns. Methods trained in closed-loop demonstrate the best and most stable driving performance. However, when reaching their limits in augmented edge-case scenarios, all learned planners degrade abruptly, whereas rule-based planners maintain reasonable basic behavior. Based on our results, we suggest SMART-reactive simulation as a new standard closed-loop benchmark in nuPlan and release the SMART agents as a drop-in alternative to IDM at https://github.com/shgd95/InteractiveClosedLoop.

Updated: 2025-10-16 13:34:12

标题: 当规划者面对现实：学习型、反应型交通代理如何调整nuPlan基准

摘要: 在闭环模拟中，规划器的评估通常使用基于规则的交通代理，这些代理的简单和被动行为可能会隐藏规划器的缺陷并产生偏见。广泛使用的IDM代理只是跟随前车行驶，无法对相邻车道的车辆做出反应，这阻碍了对复杂交互能力的测试。我们通过将最先进的学习交通代理模型SMART集成到nuPlan中来解决这个问题。因此，我们是第一个在更真实条件下评估规划器，并量化当缩小模拟与实际之间的差距时结论如何转变的研究。我们的分析涵盖了14个最近的规划器和已建立的基线，并显示基于IDM的模拟高估了规划性能：几乎所有得分都下降了。相反，许多规划器的交互能力比先前假设的更好，甚至在多车道、交互密集的情况下（如变道或转弯）表现更好。在闭环训练的方法展示了最佳和最稳定的驾驶表现。然而，当在增强的特殊情况下达到极限时，所有学习规划器都会突然退化，而基于规则的规划器则保持合理的基本行为。根据我们的结果，我们建议将SMART-反应式模拟作为nuPlan中的新标准闭环基准，并将SMART代理释放为IDM的替代选择，网址为https://github.com/shgd95/InteractiveClosedLoop。

更新时间: 2025-10-16 13:34:12

领域: cs.RO,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2510.14677v1

NAEL: Non-Anthropocentric Ethical Logic

We introduce NAEL (Non-Anthropocentric Ethical Logic), a novel ethical framework for artificial agents grounded in active inference and symbolic reasoning. Departing from conventional, human-centred approaches to AI ethics, NAEL formalizes ethical behaviour as an emergent property of intelligent systems minimizing global expected free energy in dynamic, multi-agent environments. We propose a neuro-symbolic architecture to allow agents to evaluate the ethical consequences of their actions in uncertain settings. The proposed system addresses the limitations of existing ethical models by allowing agents to develop context-sensitive, adaptive, and relational ethical behaviour without presupposing anthropomorphic moral intuitions. A case study involving ethical resource distribution illustrates NAEL's dynamic balancing of self-preservation, epistemic learning, and collective welfare.

Updated: 2025-10-16 13:33:10

标题: NAEL：非人类中心伦理逻辑

摘要: 我们介绍了NAEL（非人类中心伦理逻辑），这是一个基于主动推理和符号推理的人工智能代理的新型伦理框架。与传统的以人为中心的人工智能伦理方法不同，NAEL将伦理行为形式化为智能系统在动态的多智能体环境中最小化全局预期自由能的新兴属性。我们提出了一种神经符号架构，使代理能够在不确定环境中评估其行为的伦理后果。所提出的系统通过允许代理在不预设人类化道德直觉的情况下发展具有上下文敏感性、适应性和关联性的伦理行为，从而解决了现有伦理模型的局限性。一个涉及伦理资源分配的案例研究展示了NAEL在自我保存、认知学习和集体福利之间动态平衡的能力。

更新时间: 2025-10-16 13:33:10

领域: cs.AI

下载: http://arxiv.org/abs/2510.14676v1

A Comprehensive Evaluation of Graph Neural Networks and Physics Informed Learning for Surrogate Modelling of Finite Element Analysis

Although Finite Element Analysis (FEA) is an integral part of the product design lifecycle, the analysis is computationally expensive, making it unsuitable for many design optimization problems. The deep learning models can be a great solution. However, selecting the architecture that emulates the FEA with great accuracy is a challenge. This paper presents a comprehensive evaluation of graph neural networks (GNNs) and 3D U-Nets as surrogates for FEA of parametric I-beams. We introduce a Physics-Informed Neural Network (PINN) framework, governed by the Navier Cauchy equations, to enforce physical laws. Crucially, we demonstrate that a curriculum learning strategy, pretraining on data followed by physics informed fine tuning, is essential for stabilizing training. Our results show that GNNs fundamentally outperform the U-Net. Even the worst performer among GNNs, the GCN framework, achieved a relative L2 error of 8.7% while the best framework among U Net, U Net with attention mechanism trained on high resolution data, achieved 13.0% score. Among the graph-based architectures, the Message Passing Neural Networks (MPNN) and Graph Transformers achieved the highest accuracy, achieving a relative L2 score of 3.5% and 2.6% respectively. The inclusion of physics fundamental laws (PINN) significantly improved the generalization, reducing error by up to 11.3% on high-signal tasks. While the Graph Transformer is the most accurate model, it is more 37.5% slower during inference when compared to second best model, MPNN PINN. The PINN enhanced MPNN (MPNN PINN) provides the most practical solution. It offers a good compromise between predictive performance, model size, and inference speed.

Updated: 2025-10-16 13:32:53

标题: 一个对图神经网络和物理知识学习在有限元分析代理建模中的全面评估

摘要: 有限元分析（FEA）虽然是产品设计生命周期的重要组成部分，但分析具有计算成本高的特点，使其不适用于许多设计优化问题。深度学习模型可以是一个很好的解决方案。然而，选择能够准确模拟FEA的架构是一项挑战。本文介绍了图神经网络（GNNs）和3D U-Nets作为参数I型梁FEA的替代方法的全面评估。我们引入了一个受Navier Cauchy方程控制的物理信息神经网络（PINN）框架，以强制执行物理定律。关键是，我们证明了一个课程学习策略，即在数据上进行预训练，然后进行物理信息微调，对于稳定训练是至关重要的。我们的结果显示，GNNs在根本上优于U-Net。即使在GNNs中表现最差的GCN框架，也实现了相对L2误差为8.7%，而在U-Net中表现最好的框架，即在高分辨率数据上经过训练的具有注意机制的U-Net，实现了13.0%的得分。在基于图的架构中，消息传递神经网络（MPNN）和图变换器分别实现了最高的准确性，分别实现了相对L2得分为3.5%和2.6%。物理基本定律（PINN）的引入显著改善了泛化能力，在高信号任务上减少了高达11.3%的误差。虽然图变换器是最准确的模型，但在推断时比第二最佳模型MPNN PINN慢37.5%。PINN增强的MPNN（MPNN PINN）提供了最实用的解决方案。它在预测性能、模型大小和推断速度之间提供了很好的平衡。

更新时间: 2025-10-16 13:32:53

领域: cs.LG

下载: http://arxiv.org/abs/2510.15750v1

AEX-NStep: Probabilistic Interrupt Counting Attacks on Intel SGX

To mitigate interrupt-based stepping attacks (notably using SGX-Step), Intel introduced AEX-Notify, an ISA extension to Intel SGX that aims to prevent deterministic single-stepping. In this work, we introduce AEX-NStep, the first interrupt counting attack on AEX-Notify-enabled Enclaves. We show that deterministic single-stepping is not required for interrupt counting attacks to be practical and that, therefore, AEX-Notify does not entirely prevent such attacks. We specifically show that one of AEX-Notify's security guarantees, obfuscated forward progress, does not hold, and we introduce two new probabilistic interrupt counting attacks. We use these attacks to construct a practical ECDSA key leakage attack on an AEX-Notify-enabled SGX enclave. Our results extend the original security analysis of AEX-Notify and inform the design of future mitigations.

Updated: 2025-10-16 13:31:04

标题: AEX-NStep：对英特尔SGX的概率中断计数攻击

摘要: 为了缓解基于中断的步进攻击（特别是使用SGX-Step），英特尔引入了AEX-Notify，这是一种旨在防止确定性单步执行的Intel SGX的ISA扩展。在这项工作中，我们介绍了AEX-NStep，这是针对AEX-Notify启用的Enclaves的第一个中断计数攻击。我们展示了确定性单步执行并非中断计数攻击的必要条件，因此AEX-Notify并不能完全阻止此类攻击。我们特别展示了AEX-Notify的一个安全保证，即混淆的前进进度，并介绍了两种新的概率中断计数攻击。我们利用这些攻击构建了一个针对AEX-Notify启用的SGX隔离区的实际ECDSA密钥泄露攻击。我们的结果扩展了AEX-Notify的原始安全分析，并为未来的缓解设计提供了信息。

更新时间: 2025-10-16 13:31:04

领域: cs.CR

下载: http://arxiv.org/abs/2510.14675v1

Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification

Lightweight convolutional and transformer-based networks are increasingly preferred for real-time image classification, especially on resource-constrained devices. This study evaluates the impact of hyperparameter optimization on the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M, trained on a class-balanced subset of 90,000 images from ImageNet-1K. Under standardized training settings, this paper investigates the influence of learning rate schedules, augmentation, optimizers, and initialization on model performance. Inference benchmarks are performed using an NVIDIA L40s GPU with batch sizes ranging from 1 to 512, capturing latency and throughput in real-time conditions. This work demonstrates that controlled hyperparameter variation significantly alters convergence dynamics in lightweight CNN and transformer backbones, providing insight into stability regions and deployment feasibility in edge artificial intelligence. Our results reveal that tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models (e.g., RepVGG-A2, MobileNetV3-L) deliver latency under 5 milliseconds and over 9,800 frames per second, making them ideal for edge deployment. This work provides reproducible, subset-based insights into lightweight hyperparameter tuning and its role in balancing speed and accuracy. The code and logs may be seen at: https://vineetkumarrakesh.github.io/lcnn-opt

Updated: 2025-10-16 13:29:58

标题: 轻量级深度模型超参数优化对实时图像分类效果的分析

摘要: 轻量级卷积和基于Transformer的网络越来越受到青睐，特别是在资源受限的设备上进行实时图像分类。本研究评估了超参数优化对七种现代轻量级架构的准确性和部署可行性的影响：ConvNeXt-T、EfficientNetV2-S、MobileNetV3-L、MobileViT v2（S/XS）、RepVGG-A2和TinyViT-21M，这些架构在ImageNet-1K的一个类别平衡子集上进行训练，包含了9万张图像。在标准化的训练设置下，本文研究了学习率调度、数据增强、优化器和初始化对模型性能的影响。使用NVIDIA L40s GPU进行推理基准测试，批处理大小从1到512不等，捕捉实时条件下的延迟和吞吐量。这项工作表明，受控的超参数变化显著改变了轻量级CNN和Transformer骨干的收敛动态，提供了对稳定区域和边缘人工智能部署可行性的见解。我们的结果显示，仅调整超参数就可以将top-1准确性提高1.5至3.5个百分点，部分模型（例如RepVGG-A2、MobileNetV3-L）的延迟低于5毫秒，每秒超过9800帧，使它们成为边缘部署的理想选择。这项工作提供了基于子集的轻量级超参数调整洞察，以及在速度和准确性之间平衡中的作用。代码和日志可在以下链接查看：https://vineetkumarrakesh.github.io/lcnn-opt

更新时间: 2025-10-16 13:29:58

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.23315v2

TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence

TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.

Updated: 2025-10-16 13:27:05

标题: TITAN：用于网络威胁情报的图形可执行推理

摘要: TITAN（通过自动导航进行威胁情报）是一个框架，将自然语言的网络威胁查询与对结构化知识图进行可执行推理相连接。它集成了一个路径规划模型，该模型可以从文本中预测逻辑关系链，以及一个图执行器，可以遍历TITAN本体论以检索事实答案和支持证据。与传统的检索系统不同，TITAN基于来自MITRE的类型化双向图进行操作，使推理能够在威胁、行为和防御之间清晰且可逆地移动。为了支持训练和评估，我们引入了TITAN数据集，这是一个包含88209个示例（训练：74258；测试：13951）的语料库，将自然语言问题与可执行推理路径和逐步的思维链解释进行配对。实证评估表明，TITAN使模型能够生成在底层图上可以确定执行的句法有效和语义连贯的推理路径。

更新时间: 2025-10-16 13:27:05

领域: cs.AI,cs.CL,cs.CR,cs.IR

下载: http://arxiv.org/abs/2510.14670v1

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

Updated: 2025-10-16 13:25:38

标题: Ada-KV：通过自适应预算分配优化KV缓存淘汰，以实现高效的LLM推断

摘要: 大型语言模型在各个领域表现出色，但由于需要长序列推断所需的增长的键-值（KV）缓存而面临效率挑战。最近的努力旨在通过在运行时驱逐大量非关键缓存元素来减小KV缓存大小，同时保留生成质量。然而，这些方法通常在所有注意力头部均匀分配压缩预算，忽略了每个头部的独特注意力模式。在本文中，我们建立了预驱逐和后驱逐注意力输出之间的理论损失上限，解释了先前缓存驱逐方法的优化目标，同时指导了自适应预算分配的优化。基于此，我们提出了第一个头部适应预算分配策略“Ada-KV”。它提供即插即用的好处，使其能够与先前的缓存驱逐方法无缝集成。在来自Ruler的13个数据集和来自LongBench的16个数据集上进行了广泛评估，所有评估都在问题感知和问题不可知场景下进行，展示了相对于现有方法的实质性质量改进。我们的代码可在https://github.com/FFY0/AdaKV 上找到。

更新时间: 2025-10-16 13:25:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2407.11550v5

Machine Learning and Public Health: Identifying and Mitigating Algorithmic Bias through a Systematic Review

Machine learning (ML) promises to revolutionize public health through improved surveillance, risk stratification, and resource allocation. However, without systematic attention to algorithmic bias, ML may inadvertently reinforce existing health disparities. We present a systematic literature review of algorithmic bias identification, discussion, and reporting in Dutch public health ML research from 2021 to 2025. To this end, we developed the Risk of Algorithmic Bias Assessment Tool (RABAT) by integrating elements from established frameworks (Cochrane Risk of Bias, PROBAST, Microsoft Responsible AI checklist) and applied it to 35 peer-reviewed studies. Our analysis reveals pervasive gaps: although data sampling and missing data practices are well documented, most studies omit explicit fairness framing, subgroup analyses, and transparent discussion of potential harms. In response, we introduce a four-stage fairness-oriented framework called ACAR (Awareness, Conceptualization, Application, Reporting), with guiding questions derived from our systematic literature review to help researchers address fairness across the ML lifecycle. We conclude with actionable recommendations for public health ML practitioners to consistently consider algorithmic bias and foster transparency, ensuring that algorithmic innovations advance health equity rather than undermine it.

Updated: 2025-10-16 13:24:11

标题: 机器学习与公共卫生：通过系统性审查识别和减轻算法偏见

摘要: 机器学习（ML）承诺通过改进监测、风险分层和资源分配来革命化公共卫生。然而，如果不系统关注算法偏见，ML可能会无意中强化现有的健康差距。本文对2021年至2025年荷兰公共卫生ML研究中的算法偏见鉴别、讨论和报告进行了系统文献综述。为此，我们开发了风险算法偏见评估工具（RABAT），通过整合已建立的框架（Cochrane偏见风险、PROBAST、Microsoft负责任的AI清单）的元素，并将其应用于35项同行评议研究。我们的分析显示普遍存在的差距：尽管数据采样和缺失数据实践得到很好的记录，但大多数研究省略了明确的公平性框架、亚组分析和潜在危害的透明讨论。作为回应，我们提出了一个四阶段的面向公平的框架ACAR（认识、概念化、应用、报告），其中包含从我们的系统文献综述中提取的指导问题，以帮助研究人员在整个ML生命周期中解决公平性问题。我们最后提出了可操作的建议，供公共卫生ML从业者始终考虑算法偏见并促进透明度，确保算法创新推动健康公平，而不是削弱它。

更新时间: 2025-10-16 13:24:11

领域: cs.AI,68T01, 68T09, 62P10 68T01, 68T09, 62P10,I.2.6; I.5.4; H.2.8; J.3; K.4.1; K.4.2

下载: http://arxiv.org/abs/2510.14669v1

Geometric Moment Alignment for Domain Adaptation via Siegel Embeddings

We address the problem of distribution shift in unsupervised domain adaptation with a moment-matching approach. Existing methods typically align low-order statistical moments of the source and target distributions in an embedding space using ad-hoc similarity measures. We propose a principled alternative that instead leverages the intrinsic geometry of these distributions by adopting a Riemannian distance for this alignment. Our key novelty lies in expressing the first- and second-order moments as a single symmetric positive definite (SPD) matrix through Siegel embeddings. This enables simultaneous adaptation of both moments using the natural geometric distance on the shared manifold of SPD matrices, preserving the mean and covariance structure of the source and target distributions and yielding a more faithful metric for cross-domain comparison. We connect the Riemannian manifold distance to the target-domain error bound, and validate the method on image denoising and image classification benchmarks. Our code is publicly available at https://github.com/shayangharib/GeoAdapt.

Updated: 2025-10-16 13:20:51

标题: 通过Siegel嵌入的几何矩对齐进行领域自适应

摘要: 我们通过一种匹配矩正的方法解决了无监督域适应中的分布偏移问题。现有方法通常使用临时相似度测量对源和目标分布在嵌入空间中的低阶统计矩进行对齐。我们提出了一种原则性的替代方案，该方案利用了这些分布的固有几何结构，通过采用黎曼距离进行对齐。我们的关键创新在于通过Siegel嵌入将一阶和二阶矩表达为单一的对称正定（SPD）矩阵。这使得可以使用在SPD矩阵的共享流形上的自然几何距离同时适应两个矩，保持源和目标分布的均值和协方差结构，并产生更忠实的跨领域比较指标。我们将黎曼流形距离与目标域错误界联系起来，并在图像去噪和图像分类基准上验证该方法。我们的代码可以在https://github.com/shayangharib/GeoAdapt 公开获取。

更新时间: 2025-10-16 13:20:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.14666v1

Beyond Hallucinations: The Illusion of Understanding in Large Language Models

Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton's observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model's limitations and the user's assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.

Updated: 2025-10-16 13:19:44

标题: 超越幻觉：大型语言模型中的理解错觉

摘要: 大型语言模型（LLMs）正在深度融入人类交流和决策中，然而它们继承了语言本身固有的模糊、偏见和真相直接获取的缺乏。虽然它们的输出流畅、情感共鸣和连贯，但它们是通过统计预测而非基于推理生成的。这会导致幻觉的风险，即听起来令人信服但缺乏事实依据的回应。基于 Geoffrey Hinton 的观察，认为人工智能反映了人类直觉而非推理，本文认为LLMs在规模上实现了系统1认知：快速、联想和有说服力，但缺乏反思或证伪。为了解决这个问题，我们引入了Rose-Frame，一个用于诊断人类与人工智能互动中认知和认识漂移的三维框架。这三个维度是：（i）地图 vs. 领土，区分现实的表征（认识论）和现实本身（本体论）；（ii）直觉 vs. 推理，借鉴双过程理论将快速、情感判断与缓慢、反思性思考分开；（iii）冲突 vs. 确认，检查想法是否通过分歧进行批判性测试，还是简单通过相互确认加强。每个维度捕捉了不同的失败模式，它们的结合增强了不对齐。Rose-Frame并不试图通过更多数据或规则修复LLMs。相反，它提供了一个反思工具，使模型的局限性和用户的假设都能显现出来，从而实现更透明和批判意识的人工智能部署。它将对齐重新框定为认知治理：不管是人类还是人工的直觉必须受人类理性的控制。只有通过嵌入反思、可证伪的监督，我们才能将机器流畅性与人类理解对齐。

更新时间: 2025-10-16 13:19:44

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2510.14665v1

Provable Mixed-Noise Learning with Flow-Matching

We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.

Updated: 2025-10-16 13:17:05

标题: 可证明的混合噪声学习与流匹配

摘要: 我们研究具有混合噪声的贝叶斯反问题，将其建模为加性和乘性高斯分量的组合。传统的推断方法通常假设噪声特征是固定的或已知的，而现实世界中的应用，特别是在物理和化学领域，经常涉及具有未知和异质结构的噪声。受到最近在基于流的生成建模方面的进展的启发，我们提出了一种基于条件流匹配的新型推断框架，嵌入在期望最大化（EM）算法中，以联合估计后验采样器和噪声参数。为了实现高维推断并提高可扩展性，我们在EM算法的E步中使用无模拟ODE基础的流匹配作为生成模型。我们证明，在适当的假设下，EM更新在观测次数无限的人口极限下收敛于真实的噪声参数。我们的数值结果说明了将EM推断与流匹配结合应用于混合噪声贝叶斯反问题的有效性。

更新时间: 2025-10-16 13:17:05

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2508.18122v2

An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

Updated: 2025-10-16 13:15:40

标题: 一种基于有效的标准的生成式验证器，用于搜索增强型LLMs

摘要: 搜索增强技术赋予大型语言模型检索能力，以克服静态参数所施加的限制。最近，强化学习利用定制的奖励信号作为增强LLMs执行涉及搜索任务的可行技术。然而，现有的搜索增强LLMs的奖励建模面临着几个限制。基于规则的奖励，如精确匹配，可验证但对表达变化脆弱，不适用于长篇工作负载。相反，生成式奖励提高了鲁棒性，但为动态语料库中长篇工作负载设计可验证和稳定的奖励仍然具有挑战性，并且还需要高计算成本。在本文中，我们提出了一个统一且可验证的范式，即“金块作为评分标准”，将原子信息点视为不同搜索增强工作负载的结构化评估标准。短篇任务对应于单个评分标准，而长篇任务则扩展到与问题信息需求对齐的多个评分标准。为支持长篇设置，我们设计了一个基于查询重写的自动评分标准构建流水线，可以自动检索与每个问题相关的段落，并从静态语料库和动态在线网络内容中提取评分标准。此外，我们引入了Search-Gen-V，这是一个基于我们提出的可验证范式的4B参数高效生成式验证器，通过蒸馏思想和两阶段策略进行训练。实验结果表明，Search-Gen-V在不同工作负载下实现了强大的验证准确性，使其成为搜索增强LLMs的可扩展、鲁棒和高效的可验证奖励构造器。

更新时间: 2025-10-16 13:15:40

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2510.14660v1

From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons

We revisit the Universal Approximation Theorem(UAT) through the lens of the tropical geometry of neural networks and introduce a constructive, geometry-aware initialization for sigmoidal multi-layer perceptrons (MLPs). Tropical geometry shows that Rectified Linear Unit (ReLU) networks admit decision functions with a combinatorial structure often described as a tropical rational, namely a difference of tropical polynomials. Focusing on planar binary classification, we design purely sigmoidal MLPs that adhere to the finite-sum format of UAT: a finite linear combination of shifted and scaled sigmoids of affine functions. The resulting models yield decision boundaries that already align with prescribed shapes at initialization and can be refined by standard training if desired. This provides a practical bridge between the tropical perspective and smooth MLPs, enabling interpretable, shape-driven initialization without resorting to ReLU architectures. We focus on the construction and empirical demonstrations in two dimensions; theoretical analysis and higher-dimensional extensions are left for future work.

Updated: 2025-10-16 13:15:39

标题: 从通用逼近定理到多层感知机的热带几何学

摘要: 我们通过神经网络的热带几何视角重新审视了通用逼近定理（UAT），并引入了一种构造性的、几何感知的初始化方法，用于S形多层感知器（MLPs）。热带几何表明，修正线性单元（ReLU）网络接受具有通常被描述为热带有理数的组合结构的决策函数，即热带多项式的差。针对平面二分类问题，我们设计了纯粹的S形MLPs，遵循UAT的有限和格式：有限的线性组合，由平移和缩放的仿射函数的S形函数组成。由此产生的模型产生的决策边界在初始化时已经与预设形状对齐，并且可以通过标准训练进行优化。这为热带视角和平滑MLPs之间提供了实用的桥梁，实现了可解释的、形状驱动的初始化，而无需使用ReLU结构。我们关注在二维空间中的构建和实证演示；理论分析和高维扩展留待未来研究。

更新时间: 2025-10-16 13:15:39

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.15012v1

Decorrelation Speeds Up Vision Transformers

Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

Updated: 2025-10-16 13:13:12

标题: Decorrelation Speeds Up Vision Transformers （装饰加速视觉变压器）

摘要: 遮蔽自动编码器（MAE）对视觉变换器（ViTs）进行预训练在低标签制度下表现出强大性能，但带来了重大的计算成本，使其在时间和资源受限的工业环境中不可行。我们通过将去相关反向传播（DBP）集成到MAE预训练中来解决这个问题，这是一种优化方法，通过在每一层迭代地降低输入相关性来加速收敛。将DBP选择性地应用于编码器，可以实现更快的预训练而不会损失稳定性。在ImageNet-1K上进行预训练并使用ADE20K进行微调，DBP-MAE将壁钟时间降低了21.1％，减少了21.4％的碳排放，并将分割mIoU提高了1.1个百分点。当在专有工业数据上进行预训练和微调时，我们观察到类似的收益，证实了该方法在现实场景中的适用性。这些结果表明，DBP可以减少训练时间和能源使用，同时提高大规模ViT预训练的下游性能。

更新时间: 2025-10-16 13:13:12

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14657v1

Parameter Identification for Partial Differential Equation with Jump Discontinuities in Coefficients by Markov Switching Model and Physics-Informed Machine Learning

Inverse problems involving partial differential equations (PDEs) with discontinuous coefficients are fundamental challenges in modeling complex spatiotemporal systems with heterogeneous structures and uncertain dynamics. Traditional numerical and machine learning approaches often face limitations in addressing these problems due to high dimensionality, inherent nonlinearity, and discontinuous parameter spaces. In this work, we propose a novel computational framework that synergistically integrates physics-informed deep learning with Bayesian inference for accurate parameter identification in PDEs with jump discontinuities in coefficients. The core innovation of our framework lies in a dual-network architecture employing a gradient-adaptive weighting strategy: a main network approximates PDE solutions while a sub network samples its coefficients. To effectively identify mixture structures in parameter spaces, we employ Markovian dynamics methods to capture hidden state transitions of complex spatiotemporal systems. The framework has applications in reconstruction of solutions and identification of parameter-varying regions. Comprehensive numerical experiments on various PDEs with jump-varying coefficients demonstrate the framework's exceptional adaptability, accuracy, and robustness compared to existing methods. This study provides a generalizable computational approach of parameter identification for PDEs with discontinuous parameter structures, particularly in non-stationary or heterogeneous systems.

Updated: 2025-10-16 13:12:26

标题: 使用马尔可夫切换模型和基于物理信息的机器学习对具有系数跳跃不连续性的偏微分方程进行参数识别

摘要: 涉及具有不连续系数的偏微分方程（PDEs）的反问题是建模复杂时空系统中具有异质结构和不确定动态的基本挑战。传统的数值和机器学习方法通常面临限制，因为高维度、固有的非线性和不连续的参数空间。在这项工作中，我们提出了一个新颖的计算框架，将基于物理信息的深度学习与贝叶斯推断相结合，以在具有系数跳跃不连续性的PDE中准确识别参数。我们框架的核心创新在于采用渐变自适应加权策略的双网络架构：一个主网络近似PDE解决方案，而一个子网络采样其系数。为了有效地识别参数空间中的混合结构，我们采用马尔可夫动力学方法来捕捉复杂时空系统的隐藏状态转换。该框架在解决方案重建和参数变化区域识别方面具有应用。对具有跳跃变化系数的各种PDE的全面数值实验展示了该框架相对于现有方法的卓越适应性、准确性和稳健性。本研究提供了一种适用于具有不连续参数结构的PDE的参数识别的通用计算方法，特别是在非平稳或异质系统中。

更新时间: 2025-10-16 13:12:26

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.14656v1

Galaxy Morphology Classification with Counterfactual Explanation

Galaxy morphologies play an essential role in the study of the evolution of galaxies. The determination of morphologies is laborious for a large amount of data giving rise to machine learning-based approaches. Unfortunately, most of these approaches offer no insight into how the model works and make the results difficult to understand and explain. We here propose to extend a classical encoder-decoder architecture with invertible flow, allowing us to not only obtain a good predictive performance but also provide additional information about the decision process with counterfactual explanations.

Updated: 2025-10-16 13:11:56

标题: 用反事实解释进行星系形态分类

摘要: 星系形态在研究星系演化中起着至关重要的作用。对大量数据的形态确定是繁琐的，因此出现了基于机器学习的方法。不幸的是，大多数这些方法并未提供模型工作原理，并且使结果难以理解和解释。在这里，我们提出将经典的编码器-解码器架构与可逆流相结合，不仅能够获得良好的预测性能，还能够提供有关决策过程的额外信息，以及对立事实解释。

更新时间: 2025-10-16 13:11:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14655v1

O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis

Large language models have recently demonstrated advanced capabilities in solving IMO and Putnam problems; yet their role in research mathematics has remained fairly limited. The key difficulty is verification: suggested proofs may look plausible, but cannot be trusted without rigorous checking. We present a framework, called LLM+CAS, and an associated tool, O-Forge, that couples frontier LLMs with a computer algebra systems (CAS) in an In-Context Symbolic Feedback loop to produce proofs that are both creative and symbolically verified. Our focus is on asymptotic inequalities, a topic that often involves difficult proofs and appropriate decomposition of the domain into the "right" subdomains. Many mathematicians, including Terry Tao, have suggested that using AI tools to find the right decompositions can be very useful for research-level asymptotic analysis. In this paper, we show that our framework LLM+CAS turns out to be remarkably effective at proposing such decompositions via a combination of a frontier LLM and a CAS. More precisely, we use an LLM to suggest domain decomposition, and a CAS (such as Mathematica) that provides a verification of each piece axiomatically. Using this loop, we answer a question posed by Terence Tao: whether LLMs coupled with a verifier can be used to help prove intricate asymptotic inequalities. More broadly, we show how AI can move beyond contest math towards research-level tools for professional mathematicians.

Updated: 2025-10-16 13:07:41

标题: O-Forge：用于渐近分析的LLM +计算代数框架

摘要: 大型语言模型最近在解决IMO和Putnam问题方面展示了先进的能力；然而它们在研究数学中的作用仍然相对有限。关键困难在于验证：建议的证明可能看起来合理，但如果没有严格检查就不能被信任。我们提出了一个框架，称为LLM+CAS，以及一个相关工具O-Forge，它将前沿的LLMs与计算机代数系统（CAS）结合在一个上下文符号反馈循环中，以产生既具有创造性又符号验证的证明。我们的重点是渐近不等式，这是一个经常涉及困难证明和将域分解为“正确”子域的主题。许多数学家，包括Terry Tao，都建议使用AI工具来找到正确的分解对于研究级别的渐近分析非常有用。在本文中，我们展示了我们的框架LLM+CAS通过结合前沿的LLM和CAS来提出这样的分解是非常有效的。更具体地说，我们使用LLM来建议域的分解，并使用一个CAS（如Mathematica）对每个部分进行公理验证。通过这个循环，我们回答了Terence Tao提出的一个问题：LLM和验证器结合是否可以帮助证明复杂的渐近不等式。更广泛地说，我们展示了AI如何能够超越竞赛数学，成为专业数学家的研究工具。

更新时间: 2025-10-16 13:07:41

领域: cs.AI,03B35, 68W30, 68T05

下载: http://arxiv.org/abs/2510.12350v2

In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.

Updated: 2025-10-16 13:02:11

标题: 使用无配对片段的上下文学习进行基于指导的视频编辑

摘要: 尽管基于指令的图像编辑取得了快速进展，但其在视频方面的拓展仍未被充分探索，主要是因为构建大规模配对视频编辑数据集的成本和复杂性较高。为了解决这一挑战，我们引入了一种低成本的预训练策略，用于基于指令的视频编辑，利用来自未配对视频剪辑的上下文学习。我们展示了使用这种策略对基础视频生成模型进行预训练赋予其一般编辑能力，如根据输入编辑指令进行添加、替换或删除操作。预训练模型随后可以通过少量高质量配对编辑数据进行有效精化。基于HunyuanVideoT2V，我们的框架首先在约100万个真实视频剪辑上进行预训练，学习基本编辑概念，然后在少于15万个筛选过的编辑配对上进行微调，以拓展更多编辑任务并提高编辑质量。比较实验证明，我们的方法在指令对齐和视觉保真度方面均超过了现有的基于指令的视频编辑方法，编辑指令跟随率提高了12\%，编辑质量提高了15\%。

更新时间: 2025-10-16 13:02:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14648v1

Online Continual Learning via Spiking Neural Networks with Sleep Enhanced Latent Replay

Edge computing scenarios necessitate the development of hardware-efficient online continual learning algorithms to be adaptive to dynamic environment. However, existing algorithms always suffer from high memory overhead and bias towards recently trained tasks. To tackle these issues, this paper proposes a novel online continual learning approach termed as SESLR, which incorporates a sleep enhanced latent replay scheme with spiking neural networks (SNNs). SESLR leverages SNNs' binary spike characteristics to store replay features in single bits, significantly reducing memory overhead. Furthermore, inspired by biological sleep-wake cycles, SESLR introduces a noise-enhanced sleep phase where the model exclusively trains on replay samples with controlled noise injection, effectively mitigating classification bias towards new classes. Extensive experiments on both conventional (MNIST, CIFAR10) and neuromorphic (NMNIST, CIFAR10-DVS) datasets demonstrate SESLR's effectiveness. On Split CIFAR10, SESLR achieves nearly 30% improvement in average accuracy with only one-third of the memory consumption compared to baseline methods. On Split CIFAR10-DVS, it improves accuracy by approximately 10% while reducing memory overhead by a factor of 32. These results validate SESLR as a promising solution for online continual learning in resource-constrained edge computing scenarios.

Updated: 2025-10-16 12:58:52

标题: 使用增强睡眠的潜在重播通过脉冲神经网络进行在线持续学习

摘要: 边缘计算场景需要开发硬件高效的在线持续学习算法，以适应动态环境。然而，现有算法总是受到高内存开销和对最近训练任务的偏见的困扰。为了解决这些问题，本文提出了一种新颖的在线持续学习方法，称为SESLR，它将增强的潜在重放方案与脉冲神经网络（SNNs）相结合。SESLR利用SNNs的二进制脉冲特性将重放特征存储在单个比特中，显著减少了内存开销。此外，受生物睡眠-清醒周期的启发，SESLR引入了一个噪声增强的睡眠阶段，在这个阶段，模型专门在具有控制噪声注入的重放样本上进行训练，有效地减轻了对新类别的分类偏见。对传统（MNIST、CIFAR10）和神经形态（NMNIST、CIFAR10-DVS）数据集的大量实验表明了SESLR的有效性。在拆分的CIFAR10上，SESLR相比基准方法仅使用三分之一的内存消耗，平均准确率提高了近30%。在拆分的CIFAR10-DVS上，准确率提高了约10%，同时将内存开销降低了32倍。这些结果验证了SESLR作为边缘计算场景中资源受限的在线持续学习的一个有前景的解决方案。

更新时间: 2025-10-16 12:58:52

领域: cs.NE,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.02901v3

The Bidding Games: Reinforcement Learning for MEV Extraction on Polygon Blockchain

In blockchain networks, the strategic ordering of transactions within blocks has emerged as a significant source of profit extraction, known as Maximal Extractable Value (MEV). The transition from spam-based Priority Gas Auctions to structured auction mechanisms like Polygon Atlas has transformed MEV extraction from public bidding wars into sealed-bid competitions under extreme time constraints. While this shift reduces network congestion, it introduces complex strategic challenges where searchers must make optimal bidding decisions within a sub-second window without knowledge of competitor behavior or presence. Traditional game-theoretic approaches struggle in this high-frequency, partially observable environment due to their reliance on complete information and static equilibrium assumptions. We present a reinforcement learning framework for MEV extraction on Polygon Atlas and make three contributions: (1) A novel simulation environment that accurately models the stochastic arrival of arbitrage opportunities and probabilistic competition in Atlas auctions; (2) A PPO-based bidding agent optimized for real-time constraints, capable of adaptive strategy formulation in continuous action spaces while maintaining production-ready inference speeds; (3) Empirical validation demonstrating our history-conditioned agent captures 49\% of available profits when deployed alongside existing searchers and 81\% when replacing the market leader, significantly outperforming static bidding strategies. Our work establishes that reinforcement learning provides a critical advantage in high-frequency MEV environments where traditional optimization methods fail, offering immediate value for industrial participants and protocol designers alike.

Updated: 2025-10-16 12:54:53

标题: 《竞标游戏：在Polygon区块链上进行MEV提取的强化学习》

摘要: 在区块链网络中，交易在区块内的战略排序已经成为一个重要的利润提取来源，被称为最大可提取价值（MEV）。从基于垃圾邮件的优先级气体拍卖转变为结构化拍卖机制，如Polygon Atlas，已经将MEV提取从公开竞标战转变为在极端时间限制下的封闭竞标竞争。虽然这种转变减少了网络拥塞，但也带来了复杂的战略挑战，搜索者必须在亚秒级窗口内做出最佳竞标决策，而不知道竞争对手的行为或存在。传统的博弈论方法在这种高频率、部分可观察环境中表现不佳，因为它们依赖于完整信息和静态均衡假设。我们提出了一个用于Polygon Atlas上MEV提取的强化学习框架，并做出了三点贡献：（1）一个准确模拟套利机会的随机到达和Atlas拍卖中的概率竞争的新颖仿真环境；（2）基于PPO的竞标代理，针对实时约束进行优化，在连续行动空间中能够进行自适应策略制定，同时保持生产就绪的推断速度；（3）经验验证表明，我们的历史条件代理在与现有搜索者一起部署时捕获了49％的可用利润，在取代市场领导者时达到了81％，明显优于静态竞标策略。我们的工作建立了强化学习在高频率MEV环境中提供关键优势，传统优化方法失败，为工业参与者和协议设计者提供了即时价值。

更新时间: 2025-10-16 12:54:53

领域: cs.GT,cs.AI,cs.DC

下载: http://arxiv.org/abs/2510.14642v1

Causality Enhancement for Cross-Domain Recommendation

Cross-domain recommendation forms a crucial component in recommendation systems. It leverages auxiliary information through source domain tasks or features to enhance target domain recommendations. However, incorporating inconsistent source domain tasks may result in insufficient cross-domain modeling or negative transfer. While incorporating source domain features without considering the underlying causal relationships may limit their contribution to final predictions. Thus, a natural idea is to directly train a cross-domain representation on a causality-labeled dataset from the source to target domain. Yet this direction has been rarely explored, as identifying unbiased real causal labels is highly challenging in real-world scenarios. In this work, we attempt to take a first step in this direction by proposing a causality-enhanced framework, named CE-CDR. Specifically, we first reformulate the cross-domain recommendation as a causal graph for principled guidance. We then construct a causality-aware dataset heuristically. Subsequently, we derive a theoretically unbiased Partial Label Causal Loss to generalize beyond the biased causality-aware dataset to unseen cross-domain patterns, yielding an enriched cross-domain representation, which is then fed into the target model to enhance target-domain recommendations. Theoretical and empirical analyses, as well as extensive experiments, demonstrate the rationality and effectiveness of CE-CDR and its general applicability as a model-agnostic plugin. Moreover, it has been deployed in production since April 2025, showing its practical value in real-world applications.

Updated: 2025-10-16 12:54:46

标题: 跨领域推荐中的因果增强

摘要: 跨领域推荐在推荐系统中起着至关重要的作用。它通过源领域任务或特征来利用辅助信息，以增强目标领域的推荐。然而，整合不一致的源领域任务可能导致跨领域建模不足或负面转移。在不考虑潜在因果关系的情况下整合源领域特征可能限制它们对最终预测的贡献。因此，一个自然的想法是直接在源到目标领域的因果标记数据集上训练跨领域表示。然而，这个方向鲜有探索，因为在现实场景中识别无偏的真实因果标签非常具有挑战性。在这项工作中，我们尝试迈出这个方向的第一步，提出了一个名为CE-CDR的因果增强框架。具体来说，我们首先将跨领域推荐重新制定为一个因果图，以提供原则性指导。然后，我们通过启发式方法构建了一个具有因果意识的数据集。随后，我们推导出一个理论上无偏的部分标签因果损失，以将偏见的因果意识数据集推广到未见的跨领域模式，从而产生丰富的跨领域表示，然后将其输入到目标模型中，以增强目标领域的推荐。理论和实证分析以及大量实验表明CE-CDR的合理性和有效性，以及作为一个与模型无关的插件的一般适用性。此外，自2025年4月以来，它已在生产中部署，展示了它在实际应用中的实际价值。

更新时间: 2025-10-16 12:54:46

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2510.14641v1

Improving Cybercrime Detection and Digital Forensics Investigations with Artificial Intelligence

According to a recent EUROPOL report, cybercrime is still recurrent in Europe, and different activities and countermeasures must be taken to limit, prevent, detect, analyze, and fight it. Cybercrime must be prevented with specific measures, tools, and techniques, for example through automated network and malware analysis. Countermeasures against cybercrime can also be improved with proper \df analysis in order to extract data from digital devices trying to retrieve information on the cybercriminals. Indeed, results obtained through a proper \df analysis can be leveraged to train cybercrime detection systems to prevent the success of similar crimes. Nowadays, some systems have started to adopt Artificial Intelligence (AI) algorithms for cyberattack detection and \df analysis improvement. However, AI can be better applied as an additional instrument in these systems to improve the detection and in the \df analysis. For this reason, we highlight how cybercrime analysis and \df procedures can take advantage of AI. On the other hand, cybercriminals can use these systems to improve their skills, bypass automatic detection, and develop advanced attack techniques. The case study we presented highlights how it is possible to integrate the use of the three popular chatbots {\tt Gemini}, {\tt Copilot} and {\tt chatGPT} to develop a Python code to encode and decoded images with steganographic technique, even though their presence is not an indicator of crime, attack or maliciousness but used by a cybercriminal as anti-forensics technique.

Updated: 2025-10-16 12:53:36

标题: 利用人工智能提升网络犯罪检测和数字取证调查

摘要: 根据最近的欧洲刑警组织报告，网络犯罪在欧洲仍然时有发生，必须采取不同的活动和对策来限制、预防、检测、分析和打击网络犯罪。网络犯罪必须通过特定的措施、工具和技术来预防，例如通过自动化网络和恶意软件分析。针对网络犯罪的对策也可以通过适当的数字取证分析来提高，以从数字设备中提取数据，试图检索有关网络犯罪分子的信息。事实上，通过适当的数字取证分析获得的结果可以用来训练网络犯罪检测系统，以预防类似犯罪的成功。如今，一些系统已经开始采用人工智能（AI）算法来改进网络攻击检测和数字取证分析。然而，AI可以更好地应用于这些系统中作为额外工具，以改进检测和数字取证分析。因此，我们强调了网络犯罪分析和数字取证程序如何可以利用人工智能。另一方面，网络犯罪分子可以利用这些系统来提高他们的技能，绕过自动检测，并开发高级攻击技术。我们提出的案例研究突显了如何将三个流行的聊天机器人Gemini、Copilot和chatGPT整合在一起，以开发一个Python代码，使用隐写术技术对图像进行编码和解码，尽管它们的存在并不是犯罪、攻击或恶意行为的指标，而是被网络犯罪分子用作反取证技术。

更新时间: 2025-10-16 12:53:36

领域: cs.CR

下载: http://arxiv.org/abs/2510.14638v1

EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT's design in addressing attention sink and improving visual feature extraction.

Updated: 2025-10-16 12:43:03

标题: 编辑：通过编码器-解码器架构减轻注意力漏洞以增强视觉变形器

摘要: 在本文中，我们提出了EDIT（Encoder-Decoder Image Transformer），这是一种新颖的架构，旨在缓解观察到的视觉Transformer模型中的注意力漏斗现象。当过多的注意力被分配给[CLS]标记时，就会发生注意力漏斗，扭曲模型有效处理图像块的能力。为了解决这个问题，我们引入了一种层对齐的编码器-解码器架构，其中编码器利用自注意力来处理图像块，而解码器使用交叉注意力来聚焦于[CLS]标记。与传统的编码器-解码器框架不同，解码器仅依赖于高级编码器表示，EDIT允许解码器从低级特征开始提取信息，逐层逐层地完善表示。通过顺序注意力图演示，EDIT自然可解释性，展示了对关键图像特征的逐层精炼的关注。在ImageNet-1k和ImageNet-21k上的实验，以及迁移学习任务中，显示EDIT相对于DeiT3模型实现了一致的性能改进。这些结果突显了EDIT设计在解决注意力漏斗和改善视觉特征提取方面的有效性。

更新时间: 2025-10-16 12:43:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.06738v2

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

Updated: 2025-10-16 12:40:37

标题: RLAIF-SPA：通过RLAIF优化基于LLM的情感语音合成

摘要: 文本到语音合成在中性语音方面已经达到接近人类质量，但情感表达力仍然是一个挑战。现有的方法往往依赖于昂贵的情感注释或优化间接目标，无法捕捉语音的情感表达力和感知自然性，导致生成的语音准确但情感平淡。为了解决这些挑战，我们提出了RLAIF-SPA框架，将强化学习从AI反馈（RLAIF）机制纳入其中，利用自动语音识别（ASR）和大型语言模型（LLM）技术分别评判语义准确性和韵律-情感标签对齐作为情感表达力和可理解性优化的直接奖励。具体而言，它利用韵律标签对齐来通过同时考虑语义准确性和韵律-情感对齐在四个细粒度维度上提高表达质量：结构、情感、速度和音调。此外，它还结合了语义准确性反馈，以确保生成清晰准确的语音。在Libri Speech数据集上的实验表明，RLAIF-SPA优于Chat-TTS，在WER上减少了26.1％，SIM-O增加了9.1％，并在人类评估方面有超过10％的改进。

更新时间: 2025-10-16 12:40:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14628v1

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

Updated: 2025-10-16 12:37:53

标题: FLUX是否已经知道如何进行物理合理的图像合成？

摘要: 图像合成旨在将用户指定的对象无缝地插入到新场景中，但现有模型在复杂光照（例如准确的阴影、水反射）和多样化、高分辨率输入方面存在困难。现代文本到图像扩散模型（例如SD3.5、FLUX）已经编码了重要的物理和分辨率先验，但缺乏一个框架来释放它们，而不是依赖于潜在反演，后者经常将对象姿势锁定在上下文不恰当的方向，或者是脆弱的注意力手术。我们提出了SHINE，一个无需训练的框架，用于无缝、高保真度的插入，同时中和错误。SHINE引入了流形导向的锚损失，利用预训练的定制适配器（例如IP-Adapter）来引导潜在变量，以忠实地表示主题，同时保持背景的完整性。提出了抑制退化指导和自适应背景混合，进一步消除低质量输出和可见的接缝。为了解决缺乏严格基准的问题，我们引入了ComplexCompo，其中包含多种分辨率和挑战性条件，如低光照、强照明、复杂阴影和反射表面。在ComplexCompo和DreamEditBench上的实验证明了在标准指标（例如DINOv2）和人类对齐得分（例如DreamSim、ImageReward、VisionReward）方面的最新性能。代码和基准将在发表后公开提供。

更新时间: 2025-10-16 12:37:53

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.21278v2

Exploring the Noise Robustness of Online Conformal Prediction

Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Recent work develops online conformal prediction methods that adaptively construct prediction sets to accommodate distribution shifts. However, existing algorithms typically assume perfect label accuracy which rarely holds in practice. In this work, we investigate the robustness of online conformal prediction under uniform label noise with a known noise rate, in both constant and dynamic learning rate schedules. We show that label noise causes a persistent gap between the actual mis-coverage rate and the desired rate $\alpha$, leading to either overestimated or underestimated coverage guarantees. To address this issue, we propose Noise Robust Online Conformal Prediction (dubbed NR-OCP) by updating the threshold with a novel robust pinball loss, which provides an unbiased estimate of clean pinball loss without requiring ground-truth labels. Our theoretical analysis shows that NR-OCP eliminates the coverage gap in both constant and dynamic learning rate schedules, achieving a convergence rate of $\mathcal{O}(T^{-1/2})$ for both empirical and expected coverage errors under uniform label noise. Extensive experiments demonstrate the effectiveness of our method by achieving both precise coverage and improved efficiency.

Updated: 2025-10-16 12:37:37

标题: 探索在线符合预测的噪声鲁棒性

摘要: Conformal prediction是一种新兴的不确定性量化技术，它构建了预测集，保证以预定义的概率包含真实标签。最近的工作开发了自适应构建预测集以适应分布转移的在线conformal prediction方法。然而，现有算法通常假设标签准确性完美，这在实践中很少发生。在这项工作中，我们调查了在已知噪声率下的均匀标签噪声下的在线conformal prediction的鲁棒性，无论是在恒定还是动态学习率计划下。我们表明，标签噪声导致实际误覆盖率和期望率α之间存在持续差距，从而导致保证被高估或低估。为了解决这个问题，我们提出了Noise Robust Online Conformal Prediction（简称NR-OCP），通过使用一种新颖的鲁棒pinball loss更新阈值，提供了无需地面真实标签的干净pinball loss的无偏估计。我们的理论分析表明，NR-OCP在恒定和动态学习率计划下消除了覆盖率差距，实现了在均匀标签噪声下的经验和期望覆盖误差的收敛率为O（T^{-1/2}）。大量实验证明了我们的方法的有效性，既实现了精确覆盖，又提高了效率。

更新时间: 2025-10-16 12:37:37

领域: cs.LG

下载: http://arxiv.org/abs/2501.18363v3

GemiRec: Interest Quantization and Generation for Multi-Interest Recommendation

Multi-interest recommendation has gained attention, especially in industrial retrieval stage. Unlike classical dual-tower methods, it generates multiple user representations instead of a single one to model comprehensive user interests. However, prior studies have identified two underlying limitations: the first is interest collapse, where multiple representations homogenize. The second is insufficient modeling of interest evolution, as they struggle to capture latent interests absent from a user's historical behavior. We begin with a thorough review of existing works in tackling these limitations. Then, we attempt to tackle these limitations from a new perspective. Specifically, we propose a framework-level refinement for multi-interest recommendation, named GemiRec. The proposed framework leverages interest quantization to enforce a structural interest separation and interest generation to learn the evolving dynamics of user interests explicitly. It comprises three modules: (a) Interest Dictionary Maintenance Module (IDMM) maintains a shared quantized interest dictionary. (b) Multi-Interest Posterior Distribution Module (MIPDM) employs a generative model to capture the distribution of user future interests. (c) Multi-Interest Retrieval Module (MIRM) retrieves items using multiple user-interest representations. Both theoretical and empirical analyses, as well as extensive experiments, demonstrate its advantages and effectiveness. Moreover, it has been deployed in production since March 2025, showing its practical value in industrial applications.

Updated: 2025-10-16 12:37:15

标题: GemiRec：多兴趣推荐的兴趣量化和生成

摘要: 多兴趣推荐引起了人们的关注，尤其是在工业检索阶段。与传统的双塔方法不同，它生成多个用户表示，而不是单个表示来模拟全面的用户兴趣。然而，先前的研究已经确定了两个潜在的限制：第一个是兴趣坍塌，即多个表示同质化。第二个是对兴趣演化建模不足，因为它们难以捕捉用户历史行为中缺少的潜在兴趣。我们从彻底审查现有作品以应对这些限制开始。然后，我们尝试从新的角度解决这些限制。具体而言，我们提出了一个名为GemiRec的多兴趣推荐框架级改进。所提出的框架利用兴趣量化来强制进行结构兴趣分离，并利用兴趣生成明确地学习用户兴趣的演化动态。它包括三个模块：(a)兴趣词典维护模块（IDMM）维护一个共享的量化兴趣词典。(b)多兴趣后验分布模块（MIPDM）采用生成模型来捕捉用户未来兴趣的分布。(c)多兴趣检索模块（MIRM）使用多个用户兴趣表示来检索项目。理论和实证分析以及大量实验表明了其优势和有效性。此外，自2025年3月以来，它已在生产中部署，显示了其在工业应用中的实际价值。

更新时间: 2025-10-16 12:37:15

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2510.14626v1

Partially stochastic deep learning with uncertainty quantification for model predictive heating control

Making the control of building heating systems more energy efficient is crucial for reducing global energy consumption and greenhouse gas emissions. Traditional rule-based control methods use a static, outdoor temperature-dependent heating curve to regulate heat input. This open-loop approach fails to account for both the current state of the system (indoor temperature) and free heat gains, such as solar radiation, often resulting in poor thermal comfort and overheating. Model Predictive Control (MPC) addresses these drawbacks by using predictive modeling to optimize heating based on a building's learned thermal behavior, current system state, and weather forecasts. However, current industrial MPC solutions often employ simplified physics-inspired indoor temperature models, sacrificing accuracy for robustness and interpretability. While purely data-driven models offer superior predictive performance and therefore more accurate control, they face challenges such as a lack of transparency. To bridge this gap, we propose a partially stochastic deep learning (DL) architecture, dubbed LSTM+BNN, for building-specific indoor temperature modeling. Unlike most studies that evaluate model performance through simulations or limited test buildings, our experiments across a comprehensive dataset of 100 real-world buildings, under various weather conditions, demonstrate that LSTM+BNN outperforms an industry-proven reference model, reducing the average prediction error measured as RMSE by more than 40% for the 48-hour prediction horizon of interest. Unlike deterministic DL approaches, LSTM+BNN offers a critical advantage by enabling pre-assessment of model competency for control optimization through uncertainty quantification. Thus, the proposed model shows significant potential to improve thermal comfort and energy efficiency achieved with heating MPC solutions.

Updated: 2025-10-16 12:35:14

标题: 部分随机深度学习与不确定性量化用于模型预测加热控制

摘要: 将建筑供暖系统的控制更加节能对于减少全球能源消耗和温室气体排放至关重要。传统基于规则的控制方法使用静态的、基于室外温度的供暖曲线来调节热输入。这种开环方法未能考虑系统的当前状态（室内温度）和免费热量增益，如太阳辐射，通常导致热舒适性差和过热。模型预测控制（MPC）通过使用预测建模来优化供暖，基于建筑物学习的热行为、当前系统状态和天气预报来解决这些缺点。然而，当前的工业MPC解决方案通常采用简化的基于物理的室内温度模型，为了鲁棒性和可解释性而牺牲了准确性。虽然纯数据驱动模型提供了更优越的预测性能，因此更准确的控制，但它们面临透明度不足等挑战。为了弥合这一差距，我们提出了一个部分随机的深度学习（DL）架构，被称为LSTM+BNN，用于建筑特定的室内温度建模。与大多数通过模拟或有限测试建筑评估模型性能的研究不同，我们在100栋真实世界建筑的全面数据集上进行的实验，在各种天气条件下，表明LSTM+BNN优于一个经过工业验证的参考模型，在感兴趣的48小时预测时段内，将平均预测误差（以RMSE测量）降低了40%以上。与确定性DL方法不同，LSTM+BNN通过不确定性量化在控制优化前评估模型能力，提供了关键优势。因此，所提出的模型显示出显著潜力，可以改善通过供暖MPC解决方案实现的热舒适性和能源效率。

更新时间: 2025-10-16 12:35:14

领域: stat.AP,cs.LG

下载: http://arxiv.org/abs/2504.03350v3

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model's prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.

Updated: 2025-10-16 12:34:10

标题: LeapFactual：使用条件流匹配的可靠视觉反事实解释

摘要: 机器学习（ML）和人工智能（AI）模型日益融入医疗保健和科学研究等高风险领域，这要求模型不仅准确，而且可解释。在现有的可解释方法中，对抗性解释通过识别最小的输入变化来改变模型的预测，从而提供更深入的洞察力。然而，当前的对抗性生成方法存在关键限制，包括梯度消失、不连续的潜在空间，以及对学习和真实决策边界之间对齐的过度依赖。为了克服这些限制，我们提出了LeapFactual，一种基于条件流匹配的新型对抗性解释算法。LeapFactual即使在真实和学习决策边界分歧时也能生成可靠且信息丰富的对抗性解释。采用模型不可知的方法，LeapFactual不局限于具有可微损失函数的模型。它甚至可以处理人机交互系统，将对抗性解释的范围扩展到需要人类注释者参与的领域，例如公民科学。我们在基准和实际数据集上进行了大量实验，结果表明LeapFactual生成准确且符合分布的对抗性解释，提供可操作的洞察力。例如，我们观察到，我们可靠的对抗性样本与地面真相相符的标签可以有益地用作新的训练数据来增强模型。该方法具有广泛的适用性，既增强了科学知识发现，又增强了非专家的可解释性。

更新时间: 2025-10-16 12:34:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14623v1

Large Language Models Enable Design of Personalized Nudges across Cultures

Nudge strategies are effective tools for influencing behaviour, but their impact depends on individual preferences. Strategies that work for some individuals may be counterproductive for others. We hypothesize that large language models (LLMs) can facilitate the design of individual-specific nudges without the need for costly and time-intensive behavioural data collection and modelling. To test this, we use LLMs to design personalized decoy-based nudges tailored to individual profiles and cultural contexts, aimed at encouraging air travellers to voluntarily offset CO$_2$ emissions from flights. We evaluate their effectiveness through a large-scale survey experiment ($n=3495$) conducted across five countries. Results show that LLM-informed personalized nudges are more effective than uniform settings, raising offsetting rates by 3-7$\%$ in Germany, Singapore, and the US, though not in China or India. Our study highlights the potential of LLM as a low-cost testbed for piloting nudge strategies. At the same time, cultural heterogeneity constrains their generalizability underscoring the need for combining LLM-based simulations with targeted empirical validation.

Updated: 2025-10-16 12:33:13

标题: 大型语言模型促进跨文化个性化提示的设计

摘要: Nudge策略是影响行为的有效工具，但其影响取决于个体的偏好。对一些个体有效的策略可能对其他个体产生反作用。我们假设大型语言模型（LLMs）可以促进设计个体特定的Nudge，而无需昂贵和耗时的行为数据收集和建模。为了测试这一假设，我们使用LLMs设计了针对个人资料和文化背景定制的基于诱饵的Nudge，旨在鼓励空中旅行者自愿抵消飞行产生的CO$_2$排放量。我们通过跨五个国家进行的大规模调查实验（n=3495）来评估它们的有效性。结果显示，LLM信息的个性化Nudge比统一设置更有效，在德国、新加坡和美国提高了3-7%的抵消率，尽管在中国和印度没有。我们的研究突出了LLM作为一种低成本试验平台来试验Nudge策略的潜力。与此同时，文化差异限制了它们的普适性，强调了结合基于LLM的模拟和有针对性的实证验证的必要性。

更新时间: 2025-10-16 12:33:13

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.12045v2

Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses

Large language models (LLMs) are increasingly deployed in real-world applications ranging from chatbots to agentic systems, where they are expected to process untrusted data and follow trusted instructions. Failure to distinguish between the two poses significant security risks, exploited by prompt injection attacks, which inject malicious instructions into the data to control model outputs. Model-level defenses have been proposed to mitigate prompt injection attacks. These defenses fine-tune LLMs to ignore injected instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints produced during fine-tuning to initialize GCG, with each checkpoint acting as a stepping stone for the next one to continuously improve attacks. First, we instantiate Checkpoint-GCG to evaluate the robustness of the state-of-the-art defenses in an auditing setup, assuming both (a) full knowledge of the model input and (b) access to intermediate model checkpoints. We show Checkpoint-GCG to achieve up to $96\%$ attack success rate (ASR) against the strongest defense. Second, we relax the first assumption by searching for a universal suffix that would work on unseen inputs, and obtain up to $89.9\%$ ASR against the strongest defense. Finally, we relax both assumptions by searching for a universal suffix that would transfer to similar black-box models and defenses, achieving an ASR of $63.9\%$ against a newly released defended model from Meta.

Updated: 2025-10-16 12:31:18

标题: Checkpoint-GCG: 审计和攻击基于微调的提示注入防御

摘要: 大型语言模型（LLMs）越来越多地部署在从聊天机器人到主动系统等各种实际应用中，预计它们将处理不受信任的数据并遵循受信任的指令。未能区分这两者之间的差异会带来重大安全风险，这些风险可以被提示注入攻击所利用，这些攻击会将恶意指令注入数据中以控制模型输出。已经提出了模型级别的防御措施来减轻提示注入攻击。这些防御措施对LLMs进行微调，以忽略不受信任数据中的注入指令。我们介绍了Checkpoint-GCG，这是一种基于微调防御的白盒攻击。Checkpoint-GCG通过利用在微调过程中产生的中间模型检查点来初始化GCG，从而增强了贪婪坐标梯度（GCG）攻击，每个检查点都作为下一个检查点的跳板，以持续改进攻击。首先，我们实例化Checkpoint-GCG来评估审计设置中最先进的防御措施的稳健性，假设既有（a）对模型输入的完全了解，又有（b）对中间模型检查点的访问权限。我们展示了Checkpoint-GCG在对抗最强防御措施时可以达到高达96%的攻击成功率（ASR）。其次，我们通过寻找一个适用于未见输入的通用后缀来放宽第一个假设，并在对抗最强防御措施时获得高达89.9%的ASR。最后，我们通过寻找一个能够转移到类似黑盒模型和防御措施的通用后缀来放宽两个假设，对抗Meta发布的新防御模型取得了63.9%的ASR。

更新时间: 2025-10-16 12:31:18

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2505.15738v2

ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined "golden path", while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

Updated: 2025-10-16 12:30:05

标题: ColorBench：基于图结构框架对复杂长期任务进行移动Agent基准测试

摘要: 多模式大型语言模型的快速发展使代理能够通过直接与图形用户界面交互来操作移动设备，为移动自动化开辟了新的可能性。然而，现实世界中的移动任务通常复杂，并允许多个有效解决方案。这与当前移动代理评估标准相矛盾：离线静态基准只能验证单个预定义的“黄金路径”，而在线动态测试受限于真实设备的复杂性和不可重复性，使得这两种方法都无法全面评估代理的能力。为了弥合离线和在线评估之间的差距，并增强测试稳定性，本文介绍了一种新颖的基于图形结构的基准测试框架。通过对真实设备交互过程中观察到的有限状态进行建模，实现了对动态行为的静态模拟。在此基础上，我们开发了ColorBench，一个专注于复杂长期任务的基准测试。它支持多个有效解决方案的评估、子任务完成率统计和原子级能力分析。ColorBench包含175个任务（74个单应用程序，101个跨应用程序），平均长度超过13个步骤。每个任务包含至少两条正确路径和几条典型的错误路径，实现准动态交互。通过在各种基线上评估ColorBench，我们发现现有模型的局限性，并提出改进方向和可行的技术路径，以根据实验结果提高代理在复杂、长期问题上的性能。代码和数据可在以下链接获取：https://github.com/MadeAgents/ColorBench。

更新时间: 2025-10-16 12:30:05

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14621v1

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.

Updated: 2025-10-16 12:29:40

标题: 基于代码驱动的数字序列计算：增强大型语言模型的归纳推理能力

摘要: 大型语言模型（LLMs）在推理任务中取得了显著的进展。在不同的推理模式中，归纳推理由于其与人类学习更好的对齐性，越来越受到关注。然而，关于归纳推理的研究面临一定的挑战。首先，现有的归纳数据主要集中在表面规律上，缺乏更复杂的内部模式。其次，当前的研究仅仅提示LLMs或在简单的提示-响应对上进行微调，但没有提供精确的思维过程，也没有实现难度控制。与以往的工作不同，我们通过引入\textit{CodeSeq}来解决这些挑战，这是一个从数字序列构建的合成后训练数据集。我们将数字序列打包成算法问题，以发现它们的一般项，相应地定义了一项一般项生成（GTG）任务。我们的流程通过反思失败的测试案例并整合迭代修正生成监督微调数据，从而教导LLMs学习自主案例生成和自我检查。此外，它利用增强学习与一种基于案例协同可解性缩放奖励的新颖案例-协同可解性缩放奖励，该奖励基于可解性，从问题通过率估计，以及自主案例生成的成功率，使模型能够更有效地从成功和失败中学习。实验结果显示，使用\textit{CodeSeq}训练的模型在各种推理任务上有所改进，并且能够保持模型的OOD性能。

更新时间: 2025-10-16 12:29:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14620v1

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.

Updated: 2025-10-16 12:28:23

标题: 大语言模型中的语言理解能力保留

摘要: 这篇论文介绍了C3T（跨模态能力保留测试），这是一种用于评估语音感知大型语言模型性能的新基准。该基准利用文本任务和语音克隆文本到语音模型来量化模型通过语音输入时语言理解能力的保留程度。C3T量化了模型在不同类别说话者上的公平性以及在文本和语音模态下的鲁棒性。

更新时间: 2025-10-16 12:28:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.12171v2

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

Updated: 2025-10-16 12:23:13

标题: 超越正确性：跨文化评估主观写作偏好

摘要: 当前的偏好学习方法在标准基准上取得了很高的准确性，但是当客观质量信号被移除时，性能会显著降低。我们引入了WritingPreferenceBench数据集，包含1,800对人类注释的偏好对（1,200英文，600中文），涵盖8种创意写作流派，其中回复在客观正确性、事实准确性和长度方面是匹配的。在这个基准上，基于序列的奖励模型--RLHF的标准架构--只能达到52.7%的平均准确率，而零-shot语言模型评判得分为53.9%。相比之下，产生明确推理链的生成式奖励模型达到81.8%的准确率。我们观察到在不同流派中模型内部存在较高的方差：不同写作类别中个别模型的准确率从18.2%到81.8%不等，标准偏差平均为10.1%。这种方差不受模型规模影响，27B参数的模型与8B变体之间没有一致的改善。我们的结果表明，当前的RLHF方法主要学习检测客观错误，而不是捕捉主观质量偏好（如创意、风格和情感共鸣），成功的偏好建模可能需要中间推理表示，而非直接分类。

更新时间: 2025-10-16 12:23:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14616v1

First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline.

Updated: 2025-10-16 12:21:46

标题: 首先关注最后：更好地利用首要关注点以实现高效Transformer训练

摘要: 随着训练十亿规模的transformer变得越来越普遍，使用多个分布式GPU以及并行训练方法已成为标准做法。然而，现有的transformer设计存在显著的通信开销，特别是在张量并行性（TP）中，每个块的MHA-MLP连接都需要进行全局归约通信。通过我们的调查，我们表明MHA-MLP连接可以被绕过以提高效率，而第一层的注意力输出可以作为绕过连接的替代信号。受到这些观察的启发，我们提出了FAL（First Attentions Last），一种高效的transformer架构，将第一个MHA输出重定向到后续层的MLP输入，消除了每个块的MHA-MLP连接。这消除了全局归约通信，并使MHA和MLP在单个GPU上可以并行执行。我们还引入了FAL+，它将第一个注意力输出归一化并添加到后续层的MHA输出中，以增强模型质量的MLP输入。我们的评估显示，FAL将多GPU训练时间缩短了高达44％，将单GPU吞吐量提高了高达1.18倍，并相对于基线GPT实现更好的困惑度。与基线相比，FAL+实现了更低的困惑度而不增加训练时间。

更新时间: 2025-10-16 12:21:46

领域: cs.LG

下载: http://arxiv.org/abs/2510.14614v1

A Methodology for Assessing the Risk of Metric Failure in LLMs Within the Financial Domain

As Generative Artificial Intelligence is adopted across the financial services industry, a significant barrier to adoption and usage is measuring model performance. Historical machine learning metrics can oftentimes fail to generalize to GenAI workloads and are often supplemented using Subject Matter Expert (SME) Evaluation. Even in this combination, many projects fail to account for various unique risks present in choosing specific metrics. Additionally, many widespread benchmarks created by foundational research labs and educational institutions fail to generalize to industrial use. This paper explains these challenges and provides a Risk Assessment Framework to allow for better application of SME and machine learning Metrics

Updated: 2025-10-16 12:21:22

标题: 一个评估金融领域中LLMs指标失败风险的方法论

摘要: 随着生成式人工智能在金融服务行业得到应用，衡量模型性能成为采用和使用的重要障碍。历史机器学习指标往往无法推广到生成式人工智能工作负载，并经常通过专业主题评估（SME）来补充。即使在这种组合中，许多项目也未能考虑选择特定指标时存在的各种独特风险。此外，许多由基础研究实验室和教育机构创建的广泛基准无法推广到工业使用。本文阐述了这些挑战，并提供了一个风险评估框架，以便更好地应用专业主题评估和机器学习指标。

更新时间: 2025-10-16 12:21:22

领域: cs.AI

下载: http://arxiv.org/abs/2510.13524v2

The syzygy distinguisher

We present a new distinguisher for alternant and Goppa codes, whose complexity is subexponential in the error-correcting capability, hence better than that of generic decoding algorithms. Moreover it does not suffer from the strong regime limitations of the previous distinguishers or structure recovery algorithms: in particular, it applies to the codes used in the Classic McEliece candidate for postquantum cryptography standardization. The invariants that allow us to distinguish are graded Betti numbers of the homogeneous coordinate ring of a shortening of the dual code. Since its introduction in 1978, this is the first time an analysis (in the CPA model) of the McEliece cryptosystem breaks the exponential barrier.

Updated: 2025-10-16 12:20:51

标题: 卫星的区分者

摘要: 我们提出了一种新的用于区分交替和Goppa码的方法，其复杂度在纠错能力方面是次指数级的，因此优于通用解码算法。此外，它不受先前区分器或结构恢复算法的强制限制：特别是，它适用于经典McEliece候选方案中用于后量子密码标准化的代码。允许我们区分的不变量是对偶代码的缩短的齐次坐标环的分级Betti数。自1978年引入以来，这是首次分析（在CPA模型中）McEliece加密系统打破了指数障碍。

更新时间: 2025-10-16 12:20:51

领域: cs.CR,cs.IT,math.AG,math.IT

下载: http://arxiv.org/abs/2407.15740v6

An Active Inference Model of Mouse Point-and-Click Behaviour

We explore the use of Active Inference (AIF) as a computational user model for spatial pointing, a key problem in Human-Computer Interaction (HCI). We present an AIF agent with continuous state, action, and observation spaces, performing one-dimensional mouse pointing and clicking. We use a simple underlying dynamic system to model the mouse cursor dynamics with realistic perceptual delay. In contrast to previous optimal feedback control-based models, the agent's actions are selected by minimizing Expected Free Energy, solely based on preference distributions over percepts, such as observing clicking a button correctly. Our results show that the agent creates plausible pointing movements and clicks when the cursor is over the target, with similar end-point variance to human users. In contrast to other models of pointing, we incorporate fully probabilistic, predictive delay compensation into the agent. The agent shows distinct behaviour for differing target difficulties without the need to retune system parameters, as done in other approaches. We discuss the simulation results and emphasize the challenges in identifying the correct configuration of an AIF agent interacting with continuous systems.

Updated: 2025-10-16 12:19:38

标题: 一种关于鼠标点选行为的主动推理模型

摘要: 我们探讨了将主动推理（AIF）作为空间指向的计算用户模型，在人机交互（HCI）中是一个关键问题。我们提出了一个具有连续状态、动作和观测空间的AIF代理，执行一维鼠标指向和点击。我们使用一个简单的基础动态系统来模拟鼠标光标的动态，具有逼真的感知延迟。与先前基于最优反馈控制的模型相比，代理的动作是通过最小化期望自由能来选择的，仅基于对感知的偏好分布，如正确观察到点击按钮。我们的结果表明，当光标悬停在目标上方时，代理会创建可信的指向运动和点击，其末端方差与人类用户相似。与其他指向模型不同，我们将完全概率性的、预测性的延迟补偿纳入代理中。代理对不同目标难度表现出不同的行为，无需重新调整系统参数，这与其他方法不同。我们讨论了模拟结果，并强调了在识别与连续系统交互的AIF代理的正确配置方面所面临的挑战。

更新时间: 2025-10-16 12:19:38

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2510.14611v1

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that monitor and control LLM interactions--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, provide insights into optimizing their defense mechanisms, and explore their universality across attack types. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

Updated: 2025-10-16 12:15:42

标题: SoK: 评估大型语言模型的越狱防护栏

摘要: 大型语言模型（LLMs）取得了显著进展，但它们的部署暴露了关键的漏洞，特别是对越狱攻击的攻击，这些攻击绕过了安全对齐。护栏-监控和控制LLM交互的外部防御机制-已经成为一种有前途的解决方案。然而，目前LLM护栏的现状是碎片化的，缺乏统一的分类和全面的评估框架。在这篇知识系统化（SoK）论文中，我们提出了对LLM的越狱护栏的第一次全面分析。我们提出了一种新颖的多维分类法，将护栏沿着六个关键维度进行分类，并引入了一个安全-效率-实用性评估框架，以评估它们的实际有效性。通过广泛的分析和实验，我们发现了现有护栏方法的优缺点，提供了优化其防御机制的见解，并探讨了它们在不同攻击类型下的普适性。我们的工作为未来的研究和发展提供了一个结构化的基础，旨在引导健壮的LLM护栏的原则性推进和部署。源代码可在https://github.com/xunguangwang/SoK4JailbreakGuardrails获取。

更新时间: 2025-10-16 12:15:42

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.10597v2

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

Updated: 2025-10-16 12:10:00

标题: 基于知识的多模态处理、检索和过滤的视觉问题回答

摘要: 基于知识的视觉问题回答（KB-VQA）需要视觉语言模型（VLMs）将视觉理解与外部知识检索相结合。虽然检索增强生成（RAG）通过结合知识库查询在此任务中取得了显著进展，但仍然存在多模态查询质量和检索结果相关性的问题。为了克服这些挑战，我们提出了一种新的三阶段方法，称为Wiki-PRF，包括处理、检索和过滤阶段。处理阶段动态调用视觉工具提取精确的多模态信息用于检索。检索阶段整合视觉和文本特征实现多模态知识检索。过滤阶段对检索结果进行相关性过滤和集中。为此，我们引入了一个经过回馈学习方式训练的视觉语言模型，其奖励信号为回答准确性和格式一致性。这增强了模型的推理能力、为准确查询调用工具以及过滤不相关内容。在基准数据集（E-VQA和InfoSeek）上的实验显示，回答质量取得了显著改善（36.0和42.8），达到了最先进的性能。代码可在https://github.com/cqu-student/Wiki-PRF 上找到。

更新时间: 2025-10-16 12:10:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14605v1

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

Updated: 2025-10-16 11:58:47

标题: 利用 SignSGD 和 Muon 中的坐标动量：内存优化的零阶

摘要: 大型语言模型（LLMs）的微调对于调整预训练模型以适应下游任务至关重要。然而，传统的一阶优化器，如随机梯度下降（SGD）和Adam，会产生与模型大小不利地扩展的内存和计算成本。在本文中，我们研究了零阶（ZO）优化方法作为内存和计算效率的替代方案，特别是在像LoRA这样的参数高效微调技术的背景下。我们提出了一种名为$\texttt{JAGUAR SignSGD}$的基于动量的ZO算法，它扩展了ZO SignSGD，需要与标准ZO SGD相同数量的参数，每次迭代仅需要$\mathcal{O}(1)$个函数评估。据我们所知，这是第一项在随机ZO情况下建立SignSGD严格收敛保证的研究。我们进一步提出了一种名为$\texttt{JAGUAR Muon}$的新颖的ZO扩展Muon优化器，利用模型参数的矩阵结构，并提供了其在任意随机噪声下的收敛速度。通过对具有挑战性的LLM微调基准的大量实验，我们展示了所提出的算法达到或超过标准一阶方法的收敛质量，实现了显著的内存减少。我们的理论和实证结果建立了新的ZO优化方法作为资源受限的LLM调整的实用和理论基础的方法。我们的代码可在https://github.com/brain-mmo-lab/ZO_LLM 上找到。

更新时间: 2025-10-16 11:58:47

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2506.04430v4

Rethinking Purity and Diversity in Multi-Behavior Sequential Recommendation from the Frequency Perspective

In recommendation systems, users often exhibit multiple behaviors, such as browsing, clicking, and purchasing. Multi-behavior sequential recommendation (MBSR) aims to consider these different behaviors in an integrated manner to improve the recommendation performance of the target behavior. However, some behavior data will also bring inevitable noise to the modeling of user interests. Some research efforts focus on data denoising from the frequency domain perspective to improve the accuracy of user preference prediction. These studies indicate that low-frequency information tends to be valuable and reliable, while high-frequency information is often associated with noise. In this paper, we argue that high-frequency information is by no means insignificant. Further experimental results highlight that low frequency corresponds to the purity of user interests, while high frequency corresponds to the diversity of user interests. Building upon this finding, we proposed our model PDB4Rec, which efficiently extracts information across various frequency bands and their relationships, and introduces Boostrapping Balancer mechanism to balance their contributions for improved recommendation performance. Sufficient experiments on real-world datasets demonstrate the effectiveness and efficiency of our model.

Updated: 2025-10-16 11:58:20

标题: 重新思考多行为顺序推荐中的纯度和多样性：从频率角度出发

摘要: 在推荐系统中，用户通常表现出多种行为，如浏览、点击和购买。多行为顺序推荐（MBSR）旨在综合考虑这些不同行为，以提高目标行为的推荐性能。然而，一些行为数据也会给用户兴趣建模带来必然的噪音。一些研究工作侧重于从频域角度对数据进行去噪，以提高用户偏好预测的准确性。这些研究表明，低频信息往往具有价值和可靠性，而高频信息通常与噪音相关。在本文中，我们认为高频信息绝不是无关紧要的。进一步的实验结果突显了低频对应于用户兴趣的纯度，而高频对应于用户兴趣的多样性。基于这一发现，我们提出了我们的模型PDB4Rec，它有效地提取各种频段及其关系的信息，并引入了平衡贡献的Boostrapping Balancer机制，以提高推荐性能。在真实数据集上进行的充分实验证明了我们模型的有效性和效率。

更新时间: 2025-10-16 11:58:20

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.20427v2

WeightLoRA: Keep Only Necessary Adapters

The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.

Updated: 2025-10-16 11:57:58

标题: WeightLoRA：只保留必要的适配器

摘要: 现代应用中广泛使用语言模型不可思议地离不开参数高效微调技术，例如低秩适应（LoRA），它向选定的层添加可训练的适配器。虽然LoRA可能获得准确的解决方案，但训练大型模型需要大量内存，并需要直觉来确定要添加适配器的层。在本文中，我们提出了一种新颖的方法，WeightLoRA，通过自适应选择最关键的LoRA头部来克服这个问题，这样我们可以显著减少可训练参数的数量，同时保持获得一致或甚至更优的度量值的能力。我们对一系列竞争基准测试和DeBERTa、BART和Llama模型进行了实验，将我们的方法与不同的自适应方法进行了比较。实验结果表明WeightLoRA的有效性以及在几乎所有情况下WeightLoRA+的卓越性能。

更新时间: 2025-10-16 11:57:58

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2506.02724v3

Visual Stereotypes of Autism Spectrum in Janus-Pro-7B, DALL-E, Stable Diffusion, SDXL, FLUX, and Midjourney

Avoiding systemic discrimination of neurodiverse individuals is an ongoing challenge in training AI models, which often propagate negative stereotypes. This study examined whether six text-to-image models (Janus-Pro-7B VL2 vs. VL3, DALL-E 3 v. April 2024 vs. August 2025, Stable Diffusion v. 1.6 vs. 3.5, SDXL v. April 2024 vs. FLUX.1 Pro, and Midjourney v. 5.1 vs. 7) perpetuate non-rational beliefs regarding autism by comparing images generated in 2024-2025 with controls. 53 prompts aimed at neutrally visualizing concrete objects and abstract concepts related to autism were used against 53 controls (baseline total N=302, follow-up experimental 280 images plus 265 controls). Expert assessment measuring the presence of common autism-related stereotypes employed a framework of 10 deductive codes followed by statistical analysis. Autistic individuals were depicted with striking homogeneity in skin color (white), gender (male), and age (young), often engaged in solitary activities, interacting with objects rather than people, and exhibiting stereotypical emotional expressions such as sadness, anger, or emotional flatness. In contrast, the images of neurotypical individuals were more diverse and lacked such traits. We found significant differences between the models; however, with a moderate effect size, and no differences between baseline and follow-up summary values, with the ratio of stereotypical themes to the number of images similar across all models. The control prompts showed a significantly lower degree of stereotyping with large size effects, confirming the hidden biases of the models. In summary, despite improvements in the technical aspects of image generation, the level of reproduction of potentially harmful autism-related stereotypes remained largely unaffected.

Updated: 2025-10-16 11:56:26

标题: 《Janus-Pro-7B，DALL-E，Stable Diffusion，SDXL，FLUX和Midjourney中的自闭症谱视觉刻板印象》

摘要: 避免对神经多样化个体进行系统性歧视是在训练人工智能模型中持续面临的挑战，这往往会传播负面刻板印象。本研究检验了六个文本到图像模型（Janus-Pro-7B VL2 vs. VL3、DALL-E 3 v. April 2024 vs. August 2025、Stable Diffusion v. 1.6 vs. 3.5、SDXL v. April 2024 vs. FLUX.1 Pro以及Midjourney v. 5.1 vs. 7）是否通过比较2024-2025年生成的图像与对照组，持续传播关于自闭症的非理性信念。使用53个提示来中性地可视化与自闭症相关的具体对象和抽象概念，与53个对照组（基线总数N=302，后续实验280个图像加上265个对照）。专家评估测量常见自闭症相关刻板印象的存在，采用了10个演绎代码的框架，随后进行统计分析。自闭症个体被描绘为在皮肤颜色（白色）、性别（男性）和年龄（年轻）上具有显著的同质性，通常从事独自活动，与物体而非人交互，并展现出刻板的情感表达，如悲伤、愤怒或情感平淡。相比之下，神经典型个体的图像更加多样化，不具备这些特征。我们发现了模型之间的显著差异；然而，效应规模适中，基线和后续摘要值之间没有差异，刻板主题与图像数量的比率在所有模型中相似。对照提示显示了显著较低程度的刻板印象，具有较大的影响效应，证实了模型的隐藏偏见。总之，尽管图像生成的技术方面有所改进，但对潜在有害自闭症相关刻板印象的再现程度基本上没有受到影响。

更新时间: 2025-10-16 11:56:26

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2407.16292v3

Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

Updated: 2025-10-16 11:55:27

标题: 潜在检索增强跨领域蛋白结合物生成

摘要: 设计针对特定位点的蛋白质结合物，需要生成现实和功能性的相互作用模式，这是药物发现中的一个基本挑战。目前基于结构的生成模型在生成具有足够合理性和可解释性的界面方面存在局限性。在本文中，我们提出了一种新框架RADiAnce（检索增强扩散对齐界面），利用已知界面来指导新型结合物的设计。通过在共享对比潜在空间中统一检索和生成，我们的模型有效地识别了给定结合位点的相关界面，并通过条件潜在扩散生成器无缝集成它们，实现跨领域界面转移。大量实验表明，RADiAnce在多个度量标准上明显优于基线模型，包括结合亲和力和几何形状和相互作用的恢复。额外的实验结果验证了跨领域泛化，表明从不同领域（如肽、抗体和蛋白质片段）检索界面可以增强其他领域结合物的生成性能。我们的工作建立了一种成功将基于检索的知识和生成人工智能桥接起来的蛋白质结合物设计新范式，为药物发现开辟了新的可能性。

更新时间: 2025-10-16 11:55:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.10480v2

Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.

Updated: 2025-10-16 11:55:24

标题: 多模式RAG用于非结构化数据：利用具有混合检索的模态感知知识图

摘要: 目前的检索增强生成（RAG）系统主要在单模态文本数据上运行，限制了它们在非结构化多模态文档上的有效性。这些文档通常结合文本、图像、表格、方程和图表，每种形式都提供独特的信息。在这项工作中，我们提出了一种专为多模态问答与推理设计的Modality-Aware Hybrid检索架构（MAHA），通过模态感知知识图。MAHA将稠密向量检索与结构化图遍历相结合，其中知识图编码跨模态语义和关系。这种设计实现了跨多样模态的语义丰富和上下文感知的检索。在多个基准数据集上的评估表明，MAHA明显优于基线方法，取得了0.486的ROUGE-L分数，提供了完整的模态覆盖。这些结果突显了MAHA将嵌入与显式文档结构结合起来，实现了有效的多模态检索。我们的工作建立了一个可扩展和可解释的检索框架，通过使RAG系统能够在非结构化多模态数据上进行模态感知推理，推动了RAG系统的发展。

更新时间: 2025-10-16 11:55:24

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2510.14592v1

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

Large language models promise a broad set of functions, but when not given a specific objective, they default to milquetoast results such as drafting emails littered with cliches. We demonstrate that inferring the user's in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce tools, interfaces, and responses that are more responsive and desired. We contribute an architecture for automatically inducing just-in-time objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., "Clarify the abstract's research contribution") enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers' reactions, or surface ambiguous terminology. In a series of experiments (N=14, N=205) on participants' own tasks, JIT objectives enable LLM outputs that achieve 66-86% win rates over typical LLMs, and in-person use sessions (N=17) confirm that JIT objectives produce specialized tools unique to each participant.

Updated: 2025-10-16 11:53:17

标题: "Just-In-Time目标：专门AI交互的通用方法"

摘要: 大型语言模型承诺提供广泛的功能，但如果没有给定特定目标，它们默认生成充斥着陈词滥调的草稿电子邮件等毫无特色的结果。我们证明，推断用户当下的目标，然后迅速优化这一单一目标，使LLMs能够生成更具响应性和受欢迎的工具、界面和响应。我们提出了一种架构，通过 passively 观察用户行为自动生成即时目标，然后通过生成和评估针对该目标的方法引导下游AI系统。引导即时目标（例如“澄清摘要的研究贡献”）使得自动生成工具变得可能，例如基于相关HCI方法批判草稿、预测相关研究人员的反应，或揭示模糊术语。在一系列实验中（N=14，N=205），针对参与者自己的任务，即时目标使得LLM输出在典型LLMs上取得66-86%的胜率，并且面对面使用会话（N=17）证实即时目标产生了针对每个参与者独特的专业工具。

更新时间: 2025-10-16 11:53:17

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14591v1

Symbolic verification of Apple's Find My location-tracking protocol

Tracking devices, while designed to help users find their belongings in case of loss/theft, bring in new questions about privacy and surveillance of not just their own users, but in the case of crowd-sourced location tracking, even that of others even orthogonally associated with these platforms. Apple's Find My is perhaps the most ubiquitous such system which can even locate devices which do not possess any cellular support or GPS, running on millions of devices worldwide. Apple claims that this system is private and secure, but the code is proprietary, and such claims have to be taken on faith. It is well known that even with perfect cryptographic guarantees, logical flaws might creep into protocols, and allow undesirable attacks. In this paper, we present a symbolic model of the Find My protocol, as well as a precise formal specification of desirable properties, and provide automated, machine-checkable proofs of these properties in the Tamarin prover.

Updated: 2025-10-16 11:52:05

标题: 苹果Find My位置跟踪协议的符号验证

摘要: 跟踪设备，虽然旨在帮助用户在丢失/被盗时找到其物品，但也带来了关于隐私和监视的新问题，不仅涉及到他们自己的用户，甚至包括与这些平台正交关联的其他人。苹果的“Find My”可能是最常见的这种系统，甚至可以定位不具有任何蜂窝支持或GPS的设备，在全球数百万设备上运行。苹果声称该系统是私密且安全的，但代码是专有的，这些声明必须凭信仰接受。众所周知，即使具有完美的加密保证，协议中可能会出现逻辑缺陷，并允许不良攻击。在本文中，我们提出了“Find My”协议的符号模型，以及对理想属性的精确形式规范，并在Tamarin证明器中提供了这些属性的自动化、机器可验证的证明。

更新时间: 2025-10-16 11:52:05

领域: cs.CR

下载: http://arxiv.org/abs/2510.14589v1

STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB $+$ auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

Updated: 2025-10-16 11:50:38

标题: STANCE：通过稀疏到稠密锚定编码实现运动一致视频生成

摘要: 视频生成最近在视觉方面取得了显著的进展，但保持对象运动和交互的连贯性仍然很困难。我们追溯了两个实际瓶颈：（i）人工提供的运动提示（例如，小型2D地图）经过编码后通常会崩溃为太少的有效令牌，削弱了引导作用；以及（ii）在单个头部优化外观和运动可能会偏向纹理而不是时间上的一致性。我们提出了STANCE，这是一个图像到视频的框架，解决了这两个问题，其中包括两个简单的组件。首先，我们引入了实例提示——一个像素对齐的控制信号，通过对每个实例的流进行平均并使用单眼深度增强，将稀疏的、可编辑的提示转换为密集的2.5D（相对于相机的）运动场。与2D箭头输入相比，这降低了深度模糊，同时仍然易于使用。其次，我们使用Dense RoPE在令牌空间中保留这些提示的显著性，它将一小组运动令牌（以第一帧为锚点）标记为具有空间可寻址的旋转嵌入。结合RGB和辅助地图预测（分割或深度），我们的模型在RGB处理外观的同时锚定结构，稳定优化并提高了时间上的连贯性，而不需要每帧轨迹脚本。

更新时间: 2025-10-16 11:50:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14588v1

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.

Updated: 2025-10-16 11:44:24

标题: 抹茶：用于准确和物理有效的分子对接的多阶段黎曼流匹配

摘要: 蛋白质-配体结合位构的准确预测对于基于结构的药物设计至关重要，然而现有方法往往难以在速度、准确性和物理合理性之间取得平衡。我们引入了Matcha，这是一种新颖的分子对接流程，将多阶段流匹配与学习评分和物理有效性过滤相结合。我们的方法包括三个顺序阶段，连续应用于改进对接预测，每个阶段都作为在适当的几何空间（$\mathbb{R}^3$，$\mathrm{SO}(3)$和$\mathrm{SO}(2)$）上运行的流匹配模型。我们通过专用评分模型增强预测质量，并应用无监督的物理有效性过滤器来消除不现实的位构。与各种方法相比，Matcha在Astex和PDBbind测试集上表现出更优越的对接成功率和物理合理性。此外，我们的方法比现代大规模共折叠模型快大约25倍。用于重现我们结果的模型权重和推理代码可在https://github.com/LigandPro/Matcha获得。

更新时间: 2025-10-16 11:44:24

领域: cs.LG

下载: http://arxiv.org/abs/2510.14586v1

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

Updated: 2025-10-16 11:40:29

标题: 比较人类和语言模型对复杂结构句子处理困难的研究

摘要: 大型语言模型（LLMs）可以流畅地与人类进行对话，这已经成为现实，但是LLMs是否会经历类似于人类处理困难的问题呢？我们系统地比较了人类和LLM在七种具有挑战性的语言结构上的句子理解能力。我们在统一的实验框架中收集了来自人类和五个最先进的LLMs家族的句子理解数据，这些模型在规模和训练过程上有所不同。我们的结果显示，LLMs在目标结构上总体上存在困难，尤其是在“花园路径”（GP）句子上表现更差。事实上，虽然最强的模型在非GP结构上实现了接近完美的准确率（GPT-5为93.7%），但它们在GP结构上表现不佳（GPT-5为46.8%）。此外，基于平均表现对结构进行排名时，人类和模型之间的排名相关性随参数数量增加而增加。对于每种目标结构，我们还收集了它们匹配的基准数据，不包含困难结构。比较目标句子和基准句子的表现时，人类观察到的表现差距也适用于LLMs，但有两个例外：对于表现过弱的模型，其表现在两种句子类型上均较低，而对于表现过强的模型，其表现在两种句子类型上均较高。总的来说，这些结果揭示了人类和LLM句子理解中的收敛和分歧，为我们提供了关于人类和LLMs相似性的新见解。

更新时间: 2025-10-16 11:40:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07141v2

Local Causal Discovery for Statistically Efficient Causal Inference

Causal discovery methods can identify valid adjustment sets for causal effect estimation for a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the whole causal graph and therefore enable the recovery of optimal adjustment sets, i.e., sets with the lowest asymptotic variance, but they quickly become computationally prohibitive as the number of variables grows. Local causal discovery methods offer a more scalable alternative by focusing on the local neighborhood of the target variables, but are restricted to statistically suboptimal adjustment sets. In this work, we propose Local Optimal Adjustments Discovery (LOAD), a sound and complete causal discovery approach that combines the computational efficiency of local methods with the statistical optimality of global methods. First, LOAD identifies the causal relation between the targets and tests if the causal effect is identifiable by using only local information. If it is identifiable, it then finds the optimal adjustment set by leveraging local causal discovery to infer the mediators and their parents. Otherwise, it returns the locally valid parent adjustment sets based on the learned local structure. In our experiments on synthetic and realistic data LOAD outperforms global methods in scalability, while providing more accurate effect estimation than local methods.

Updated: 2025-10-16 11:39:02

标题: 局部因果发现对于统计有效因果推断的重要性

摘要: 因果发现方法可以识别一对目标变量的因果效应估计的有效调整集，即使基础因果图未知。全局因果发现方法专注于学习整个因果图，因此能够恢复最佳调整集，即具有最低渐近方差的集合，但随着变量数量的增加，它们很快变得计算上不可行。局部因果发现方法提供了更具可扩展性的替代方案，它们专注于目标变量的局部邻域，但受限于统计上次优调整集。在这项工作中，我们提出了局部最佳调整发现（LOAD）方法，这是一种结合了局部方法的计算效率和全局方法的统计优越性的稳健完整的因果发现方法。首先，LOAD确定目标之间的因果关系，并通过仅使用局部信息测试因果效应是否可识别。如果可识别，则利用局部因果发现来推断中介者及其父母，然后找到最佳调整集。否则，它将基于学习到的局部结构返回局部有效的父调整集。在我们对合成和现实数据的实验中，LOAD在可扩展性方面优于全局方法，同时比局部方法提供更准确的效应估计。

更新时间: 2025-10-16 11:39:02

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14582v1

Selective Labeling with False Discovery Rate Control

Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models' predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models' predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.

Updated: 2025-10-16 11:39:00

标题: False Discovery Rate 控制下的选择性标记

摘要: 获得大型数据集的高质量标签是昂贵的，需要来自人类专家的大量注释。虽然AI模型通过预测标签提供了一种经济有效的替代方法，但它们的标签质量受到不可避免的标注错误的影响。现有方法通过选择性标注来减轻这个问题，其中AI标注一个子集，人类标注其余部分。然而，这些方法缺乏对AI分配标签质量的理论保证，通常导致AI标记的子集中标记错误率不可接受地高。为了解决这个问题，我们引入了\textbf{符合标注}，这是一种新颖的方法，用于识别可以被可靠信任的AI预测的实例。这是通过控制错误发现率（FDR），即所选子集中不正确标签的比例来实现的。具体来说，我们为每个测试实例构建一个符合$p$-值，方法是将AI模型的预测置信度与由AI模型误标记的校准实例的置信度进行比较。然后，我们选择$p$-值低于数据相关阈值的测试实例，以证明AI模型的预测是可信的。我们提供理论保证，符合标注可以控制FDR低于名义水平，确保平均情况下预定义比例的AI分配标签是正确的。大量实验证明，我们的方法在各种任务中实现了紧密的FDR控制和高功率，包括图像和文本标注以及LLM QA。

更新时间: 2025-10-16 11:39:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14581v1

Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models

Concept Bottleneck Models (CBNMs) are deep learning models that provide interpretability by enforcing a bottleneck layer where predictions are based exclusively on human-understandable concepts. However, this constraint also restricts information flow and often results in reduced predictive accuracy. Concept Sidechannel Models (CSMs) address this limitation by introducing a sidechannel that bypasses the bottleneck and carry additional task-relevant information. While this improves accuracy, it simultaneously compromises interpretability, as predictions may rely on uninterpretable representations transmitted through sidechannels. Currently, there exists no principled technique to control this fundamental trade-off. In this paper, we close this gap. First, we present a unified probabilistic concept sidechannel meta-model that subsumes existing CSMs as special cases. Building on this framework, we introduce the Sidechannel Independence Score (SIS), a metric that quantifies a CSM's reliance on its sidechannel by contrasting predictions made with and without sidechannel information. We propose SIS regularization, which explicitly penalizes sidechannel reliance to improve interpretability. Finally, we analyze how the expressivity of the predictor and the reliance of the sidechannel jointly shape interpretability, revealing inherent trade-offs across different CSM architectures. Empirical results show that state-of-the-art CSMs, when trained solely for accuracy, exhibit low representation interpretability, and that SIS regularization substantially improves their interpretability, intervenability, and the quality of learned interpretable task predictors. Our work provides both theoretical and practical tools for developing CSMs that balance accuracy and interpretability in a principled manner.

Updated: 2025-10-16 11:37:20

标题: 用概念为基础的侧信道模型中准确性与可解释性权衡的量化

摘要: 概念瓶颈模型（CBNMs）是深度学习模型，通过强制一个瓶颈层，其中预测仅基于人可理解的概念，提供可解释性。然而，这种约束也限制了信息流动，并经常导致预测精度降低。概念侧信道模型（CSMs）通过引入一个绕过瓶颈的侧信道，并携带额外的任务相关信息，解决了这一限制。虽然这提高了准确性，但同时也牺牲了可解释性，因为预测可能依赖于通过侧信道传输的不可解释的表示。目前，尚无一种原则性技术来控制这种基本权衡。在本文中，我们填补了这一空白。首先，我们提出了一个统一的概率概念侧信道元模型，将现有的CSMs纳入特例。基于这个框架，我们引入了侧信道独立性评分（SIS），一种衡量CSM依赖其侧信道程度的度量标准，通过对比使用侧信道信息和不使用侧信道信息进行的预测。我们提出SIS正则化，明确惩罚侧信道依赖以提高可解释性。最后，我们分析了预测器的表达能力和侧信道的依赖如何共同塑造可解释性，揭示了不同CSM架构之间固有的权衡。实证结果表明，当仅为了准确性训练时，最先进的CSMs表现出较低的表示可解释性，而SIS正则化显著提高了它们的可解释性、可干预性以及学习到的可解释任务预测器的质量。我们的工作为以原则方式平衡准确性和可解释性的CSMs的开发提供了理论和实践工具。

更新时间: 2025-10-16 11:37:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.05670v2

Natural Language Processing RELIES on Linguistics

Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES that encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-\`a-vis systems of human language.

Updated: 2025-10-16 11:35:21

标题: 自然语言处理依赖于语言学

摘要: 大型语言模型（LLMs）已经能够在某些语言中生成高度流畅的文本，而无需专门设计用于捕捉语法或语义连贯性的模块。这对自然语言处理领域中语言专业知识的未来意味着什么？我们强调了几个方面，即NLP（仍然）依赖于语言学，或者语言学思维可以揭示新的方向。我们围绕代表六个主要方面的缩写RELIES提出了我们的观点，这六个方面是：资源、评估、低资源环境、可解释性、解释性和语言研究。这个列表并不详尽，语言学也不是这些主题下每一项努力的主要参考点；但从宏观角度来看，这些方面突显了研究机器系统与人类语言系统之间的持久重要性。

更新时间: 2025-10-16 11:35:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2405.05966v5

State-Space Models for Tabular Prior-Data Fitted Networks

Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

Updated: 2025-10-16 11:31:51

标题: 表格先验数据拟合网络的状态空间模型

摘要: 最近关于表格数据基础模型的进展，如TabPFN，表明预训练Transformer架构可以以高预测性能近似贝叶斯推断。然而，Transformers在序列长度方面存在二次复杂度，促使探索更高效的序列模型。在这项工作中，我们研究了使用Hydra作为TabPFN中Transformer的替代品的潜力，Hydra是一个双向线性时间结构化状态空间模型（SSM）。一个关键挑战在于SSM对输入标记顺序的敏感性 - 这是表格数据集中不希望的特性，因为行顺序在语义上毫无意义。我们研究双向方法在多大程度上可以保持效率并实现对称上下文聚合。我们的实验证明，这种方法减少了顺序依赖性，实现了与原始TabPFN模型竞争性预测性能。

更新时间: 2025-10-16 11:31:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.14573v1

A Comprehensive Review of Recommender Systems: Transitioning from Theory to Practice

Recommender Systems (RS) play an integral role in enhancing user experiences by providing personalized item suggestions. This survey reviews the progress in RS inclusively from 2017 to 2024, effectively connecting theoretical advances with practical applications. We explore the development from traditional RS techniques like content-based and collaborative filtering to advanced methods involving deep learning, graph-based models, reinforcement learning, and large language models. We also discuss specialized systems such as context-aware, review-based, and fairness-aware RS. The primary goal of this survey is to bridge theory with practice. It addresses challenges across various sectors, including e-commerce, healthcare, and finance, emphasizing the need for scalable, real-time, and trustworthy solutions. Through this survey, we promote stronger partnerships between academic research and industry practices. The insights offered by this survey aim to guide industry professionals in optimizing RS deployment and to inspire future research directions, especially in addressing emerging technological and societal trends\footnote. The survey resources are available in the public GitHub repository https://github.com/VectorInstitute/Recommender-Systems-Survey. (Recommender systems, large language models, chatgpt, responsible AI)

Updated: 2025-10-16 11:31:30

标题: 一个综合的推荐系统评论：从理论到实践的过渡

摘要: 推荐系统（RS）通过提供个性化的物品建议，发挥着增强用户体验的重要作用。本调查全面回顾了从2017年到2024年RS的进展，有效地将理论进步与实际应用相连接。我们探讨了从传统RS技术如基于内容和协同过滤到涉及深度学习、基于图的模型、强化学习和大型语言模型等先进方法的发展。我们还讨论了专门的系统，如基于上下文的、基于评论的和关注公平的RS。本调查的主要目标是搭起理论与实践之间的桥梁。它解决了各个领域的挑战，包括电子商务、医疗保健和金融，强调了对可扩展、实时和可信赖解决方案的需求。通过这项调查，我们促进了学术研究和行业实践之间更紧密的合作关系。本调查所提供的见解旨在指导行业专业人士优化RS部署，并激发未来研究方向，特别是在应对新兴技术和社会趋势方面。调查资源可在公共GitHub存储库https://github.com/VectorInstitute/Recommender-Systems-Survey 中找到。（推荐系统，大型语言模型，chatgpt，负责任的AI）

更新时间: 2025-10-16 11:31:30

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2407.13699v4

Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

Updated: 2025-10-16 11:31:03

标题: 即时扰动揭示大型语言模型调查响应中类人偏见

摘要: 大型语言模型(LLMs)越来越被用作社会科学调查中代表人类主体的替代品，但它们的可靠性和易受已知人类式响应偏见的影响，如中心趋势、意见漂移和首要偏见，仍知之甚少。本研究调查了LLMs在规范调查环境中的响应稳健性，我们测试了来自世界价值观调查(WVS)的九个LLMs问题，对问题措辞和答案选项结构应用了一个包括十种扰动的全面设置，结果产生了超过167,000次模拟调查访谈。通过这样做，我们不仅揭示了LLMs对扰动的敏感性，还表明所有测试的模型都表现出一致的最近偏见，过度偏向于最后呈现的答案选项。尽管较大的模型通常更为稳健，但所有模型仍对如释义变化(如改述)和组合扰动敏感。这强调了在使用LLMs生成合成调查数据时，提示设计和稳健性测试的关键重要性。

更新时间: 2025-10-16 11:31:03

领域: cs.CL,cs.AI,cs.CY,J.4

下载: http://arxiv.org/abs/2507.07188v3

SPIRIT: Patching Speech Language Models against Jailbreak Attacks

Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM's activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.

Updated: 2025-10-16 11:26:23

标题: SPIRIT：针对越狱攻击的语音语言模型修补

摘要: 言语语言模型（SLMs）通过口头指令实现自然交互，更有效地捕捉用户意图，通过检测言语中的微妙差异。与基于文本的模型相比，更丰富的语音信号引入了新的安全风险，因为对手可以通过向语音中注入无法察觉的噪音来更好地绕过安全机制。我们分析了对抗性攻击，并发现SLMs对越狱攻击更加脆弱，在某些情况下可以实现完美的100%攻击成功率。为了提高安全性，我们提出了事后修补防御策略，通过修改SLM的激活来干预推断过程，将鲁棒性提高至99%，同时对效用几乎没有影响，并且无需重新训练。我们进行消融研究，以最大化我们的防御效果，改进效用/安全性权衡，并通过适用于SLMs的大规模基准验证。

更新时间: 2025-10-16 11:26:23

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2505.13541v2

OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/

Updated: 2025-10-16 11:18:45

标题: OpenDerisk：一种面向AI驱动SRE的工业框架，包括设计、实施和案例研究

摘要: 现代软件的不断复杂化给网站可靠性工程（SRE）团队带来了无法持续的运营负担，要求AI驱动的自动化能够模拟专家诊断推理。现有的解决方案，从传统的AI方法到通用的多智能体系统，都存在不足：它们要么缺乏深层次的因果推理，要么不适用于SRE独特的调查工作流程。为了填补这一空白，我们提出了OpenDerisk，一个专门为SRE设计的专业的开源多智能体框架。OpenDerisk集成了一个诊断原生协作模型，一个可插拔的推理引擎，一个知识引擎和一个标准化协议（MCP），以使专家代理能够共同解决复杂的多领域问题。我们的综合评估表明，OpenDerisk在准确性和效率方面明显优于现有的基线。其有效性已在蚂蚁集团的大规模生产部署中得到验证，在那里它为超过3,000名日常用户提供服务，涵盖各种场景，证实了其工业级可伸缩性和实际影响。OpenDerisk是开源的，可在https://github.com/derisk-ai/OpenDerisk/ 上找到。

更新时间: 2025-10-16 11:18:45

领域: cs.SE,cs.AI,68N30

下载: http://arxiv.org/abs/2510.13561v2

Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models

Synthetic data generators and machine learning models can memorize their training data, posing privacy concerns. Membership inference attacks (MIAs) are a standard method of estimating the privacy risk of these systems. The risk of individual records is typically computed by evaluating MIAs in a record-specific privacy game. We analyze the record-specific privacy game commonly used for evaluating attackers under realistic assumptions (the \textit{traditional} game) -- particularly for synthetic tabular data -- and show that it averages a record's privacy risk across datasets. We show this implicitly assumes the dataset a record is part of has no impact on the record's risk, providing a misleading risk estimate when a specific model or synthetic dataset is released. Instead, we propose a novel use of the leave-one-out game, used in existing work exclusively to audit differential privacy guarantees, and call this the \textit{model-seeded} game. We formalize it and show that it provides an accurate estimate of the privacy risk posed by a given adversary for a record in its specific dataset. We instantiate and evaluate the state-of-the-art MIA for synthetic data generators in the traditional and model-seeded privacy games, and show across multiple datasets and models that the two privacy games indeed result in different risk scores, with up to 94\% of high-risk records being overlooked by the traditional game. We further show that records in smaller datasets and models not protected by strong differential privacy guarantees tend to have a larger gap between risk estimates. Taken together, our results show that the model-seeded setup yields a risk estimate specific to a certain model or synthetic dataset released and in line with the standard notion of privacy leakage from prior work, meaningfully different from the dataset-averaged risk provided by the traditional privacy game.

Updated: 2025-10-16 11:17:06

标题: 迷失在平均数中：一种评估机器学习模型成员推断攻击的新特定设置

摘要: 合成数据生成器和机器学习模型可以记住它们的训练数据，引发隐私问题。成员推断攻击（MIAs）是评估这些系统隐私风险的标准方法。通常通过在一个记录特定的隐私游戏中评估MIAs来计算个别记录的风险。我们分析了常用于根据现实假设评估攻击者的记录特定隐私游戏（\textit{传统}游戏）--特别是对于合成表格数据--并且表明它会平均计算一个记录在数据集中的隐私风险。我们表明，这暗含了一个记录所属的数据集对记录的风险没有影响，这在发布特定模型或合成数据集时提供了一个误导性的风险估计。相反，我们提出了对现有工作中仅用于审计差分隐私保证的“留一游戏”的新用法，并将其称为\textit{模型种子}游戏。我们对其进行了形式化，并表明它为特定数据集中的记录提供了对给定对手构成的隐私风险的准确估计。我们在传统和模型种子隐私游戏中对合成数据生成器的最新MIA进行实例化和评估，并展示了在多个数据集和模型中，这两种隐私游戏确实会导致不同的风险评分，高达94\%的高风险记录在传统游戏中被忽视。我们进一步表明，在小型数据集和没有受到强差分隐私保证的模型中，记录的风险估计之间存在更大的差距。综合来看，我们的结果表明，模型种子设置产生了与特定模型或合成数据集发布一致的风险估计，并符合先前工作中隐私泄露的标准概念，与传统隐私游戏提供的数据集平均风险有意义上的不同。

更新时间: 2025-10-16 11:17:06

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2405.15423v2

A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy

Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.

Updated: 2025-10-16 11:16:25

标题: 基于图像净化策略的真实世界超低剂量肺部CT图像去噪框架

摘要: 超低剂量CT（uLDCT）显著降低了辐射暴露，但引入了严重的噪声和伪影。它还导致uLDCT和正常剂量CT（NDCT）图像对之间存在重大的空间错位。这给直接应用现有在合成噪声或对齐数据上训练的去噪网络带来了挑战。为了解决uLDCT去噪中的核心挑战，本文提出了基于图像净化（IP）策略的创新去噪框架。首先，我们构建了一个真实的临床uLDCT肺部数据集。然后，我们提出了一个图像净化策略，生成结构对齐的uLDCT-NDCT图像对，为网络训练提供高质量的数据基础。在此基础上，我们提出了一个频域流匹配（FFM）模型，与IP策略协同工作，优秀地保留了去噪图像的解剖结构完整性。在真实临床数据集上的实验表明，我们的IP策略显著提升了多个主流去噪模型在uLDCT任务上的性能。值得注意的是，我们提出的结合IP策略的FFM模型在解剖结构保留方面取得了最先进的结果。这项研究为真实世界uLDCT去噪中的数据不匹配问题提供了有效的解决方案。代码和数据集可在https://github.com/MonkeyDadLufy/flow-matching获取。

更新时间: 2025-10-16 11:16:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07492v2

Redundancy-Aware Test-Time Graph Out-of-Distribution Detection

Distributional discrepancy between training and test data can lead models to make inaccurate predictions when encountering out-of-distribution (OOD) samples in real-world applications. Although existing graph OOD detection methods leverage data-centric techniques to extract effective representations, their performance remains compromised by structural redundancy that induces semantic shifts. To address this dilemma, we propose RedOUT, an unsupervised framework that integrates structural entropy into test-time OOD detection for graph classification. Concretely, we introduce the Redundancy-aware Graph Information Bottleneck (ReGIB) and decompose the objective into essential information and irrelevant redundancy. By minimizing structural entropy, the decoupled redundancy is reduced, and theoretically grounded upper and lower bounds are proposed for optimization. Extensive experiments on real-world datasets demonstrate the superior performance of RedOUT on OOD detection. Specifically, our method achieves an average improvement of 6.7%, significantly surpassing the best competitor by 17.3% on the ClinTox/LIPO dataset pair.

Updated: 2025-10-16 11:14:45

标题: 冗余感知测试时图形外分布检测

摘要: 训练数据和测试数据之间的分布差异可能导致模型在真实世界的应用中遇到超出分布（OOD）样本时做出不准确的预测。尽管现有的图形OOD检测方法利用数据中心技术提取有效的表示，但其性能仍受到引起语义转变的结构冗余的影响。为了解决这一困境，我们提出了RedOUT，这是一个无监督框架，将结构熵集成到图分类的测试时OOD检测中。具体而言，我们引入了Redundancy-aware Graph Information Bottleneck（ReGIB），将目标分解为基本信息和无关冗余。通过最小化结构熵，可以减少解耦的冗余，并为优化提出了理论上的上限和下限。对真实数据集进行的大量实验表明，RedOUT在OOD检测方面表现出卓越的性能。具体而言，我们的方法平均改善了6.7％，在ClinTox/LIPO数据集对上超过最佳竞争对手17.3％。

更新时间: 2025-10-16 11:14:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.14562v1

HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the AMOS 2022 dataset obtains an average Dice of 85.09%, IoU of 76.66%, ASSD of 19.49 mm, and VOE of 23.34%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.

Updated: 2025-10-16 11:11:15

标题: HANS-Net：在CT成像中进行准确和可泛化的肝脏和肿瘤分割的双曲卷积和自适应时间注意

摘要: 在腹部CT图像上准确地分割肝脏和肿瘤对于可靠的诊断和治疗规划至关重要，但由于复杂的解剖结构、肿瘤外观的变异性和有限的标注数据，仍然具有挑战性。为了解决这些问题，我们引入了一种新颖的分割框架HANS-Net（Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network），该框架将双曲线卷积用于层次几何表示，一个受波纹启发的分解模块用于多尺度纹理学习，一个生物学上激发的突触可塑性机制用于自适应特征增强，以及一个隐式神经表示分支来建模细粒度和连续的解剖边界。此外，我们还结合了不确定性感知的蒙特卡洛辍学来量化预测置信度，以及轻量级的时间关注来提高切片间的一致性而不牺牲效率。对LiTS数据集的广泛评估表明，HANS-Net实现了93.26%的平均Dice分数，88.09%的IoU，0.72mm的平均对称表面距离（ASSD）和11.91%的体积重叠误差（VOE）。此外，对AMOS 2022数据集的跨数据集验证获得了85.09%的平均Dice，76.66%的IoU，19.49mm的ASSD和23.34%的VOE，表明在不同数据集之间具有很强的泛化能力。这些结果证实了HANS-Net在提供解剖一致、准确和自信的肝脏和肿瘤分割方面的有效性和稳健性。

更新时间: 2025-10-16 11:11:15

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.11325v2

MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.

Updated: 2025-10-16 11:05:54

标题: MX+:推动微缩放格式的极限，实现高效的大型语言模型服务

摘要: 减少精度数据格式对于成本效益高的大型语言模型（LLMs）至关重要。尽管迄今已经引入了许多减少精度格式，但它们常常需要对软件框架进行侵入性修改，或者对硬件供应商进行广泛采用而言相当不寻常。本文着重研究了最近由行业推动的块浮点（BFP）格式的变体，并进行了全面分析，以推动其在有效的LLM服务中的极限。我们的分析显示，现有的超低比特BFP变体由于块中存在异常值而难以提供合理的语言模型性能。为了处理BFP中的异常值，我们提出了MX+，这是一个成本效益高且不侵入性的扩展，旨在无缝集成到微缩放（MX）格式中。MX+建立在这样一个关键观点之上，即异常值不需要使用其元素数据类型中的指数字段，这使我们能够重新利用指数字段作为扩展尾数，以增加异常值元素的精度。我们的评估表明，与4位MX格式（MXFP4）相比，MX+实现了显著更高的模型性能，存储开销和减速几乎可以忽略不计，从而为有效的LLM推理提供了有力的替代方案。

更新时间: 2025-10-16 11:05:54

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2510.14557v1

Adversarial Disentanglement by Backpropagation with Physics-Informed Variational Autoencoder

Inference and prediction under partial knowledge of a physical system is challenging, particularly when multiple confounding sources influence the measured response. Explicitly accounting for these influences in physics-based models is often infeasible due to epistemic uncertainty, cost, or time constraints, resulting in models that fail to accurately describe the behavior of the system. On the other hand, data-driven machine learning models such as variational autoencoders are not guaranteed to identify a parsimonious representation. As a result, they can suffer from poor generalization performance and reconstruction accuracy in the regime of limited and noisy data. We propose a physics-informed variational autoencoder architecture that combines the interpretability of physics-based models with the flexibility of data-driven models. To promote disentanglement of the known physics and confounding influences, the latent space is partitioned into physically meaningful variables that parametrize a physics-based model, and data-driven variables that capture variability in the domain and class of the physical system. The encoder is coupled with a decoder that integrates physics-based and data-driven components, and constrained by an adversarial training objective that prevents the data-driven components from overriding the known physics, ensuring that the physics-grounded latent variables remain interpretable. We demonstrate that the model is able to disentangle features of the input signal and separate the known physics from confounding influences using supervision in the form of class and domain observables. The model is evaluated on a series of synthetic case studies relevant to engineering structures, demonstrating the feasibility of the proposed approach.

Updated: 2025-10-16 11:05:51

标题: 对抗性解缠的物理信息变分自动编码器反向传播

摘要: 在对物理系统的部分知识进行推断和预测是具有挑战性的，特别是当多个混淆源影响到测量响应时。在基于物理的模型中明确考虑这些影响通常是不可行的，因为认识不确定性、成本或时间限制，导致模型无法准确描述系统的行为。另一方面，基于数据驱动的机器学习模型如变分自动编码器并不能保证识别出简洁的表示。因此，在有限和嘈杂数据的情况下，它们可能会遭受泛化性能和重建准确性不佳的问题。我们提出了一种物理信息变分自动编码器架构，将基于物理模型的可解释性与数据驱动模型的灵活性相结合。为了促进已知物理和混淆影响的分离，潜在空间被划分为能够参数化基于物理的模型的具有物理意义的变量和捕捉领域和物理系统类的变异性的数据驱动变量。编码器与一个整合了基于物理和数据驱动组件的解码器相结合，并受到对抗性训练目标的约束，防止数据驱动组件覆盖已知物理，确保基于物理的潜在变量保持可解释性。我们展示了该模型能够通过类和领域可观测量的监督来分离输入信号的特征，并将已知物理与混淆影响分离。该模型在一系列与工程结构相关的合成案例研究中进行了评估，展示了所提出方法的可行性。

更新时间: 2025-10-16 11:05:51

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2506.13658v3

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.

Updated: 2025-10-16 10:59:56

标题: 抗干扰的政策扩展在多样化数据损坏情况下的离线到在线强化学习

摘要: 通过对离线数据进行预训练，然后通过在线交互进行微调，被称为离线到在线强化学习（O2O RL），已成为实际世界强化学习部署的一种有前途的范式。然而，在实际环境中，离线数据集和在线交互通常会存在噪音，甚至被恶意破坏，严重影响了O2O RL的性能。现有的工作主要集中在通过在线探索来减轻离线策略的保守性，而O2O RL在数据损坏（包括状态、动作、奖励和动态）下的鲁棒性仍未被探索。在这项工作中，我们观察到数据损坏会导致策略产生重尾行为，从而严重降低在线探索的效率。为了解决这个问题，我们将逆概率加权（IPW）纳入在线探索策略中以减轻重尾性，并提出了一种新颖、简单而有效的方法，称为$\textbf{RPEX}$：$\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion。对D4RL数据集的广泛实验结果表明，RPEX在各种数据损坏情景下实现了O2O性能的SOTA。代码可在$\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$找到。

更新时间: 2025-10-16 10:59:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.24748v2

Hybrid Autoencoder-Based Framework for Early Fault Detection in Wind Turbines

Wind turbine reliability is critical to the growing renewable energy sector, where early fault detection significantly reduces downtime and maintenance costs. This paper introduces a novel ensemble-based deep learning framework for unsupervised anomaly detection in wind turbines. The method integrates Variational Autoencoders (VAE), LSTM Autoencoders, and Transformer architectures, each capturing different temporal and contextual patterns from high-dimensional SCADA data. A unique feature engineering pipeline extracts temporal, statistical, and frequency-domain indicators, which are then processed by the deep models. Ensemble scoring combines model predictions, followed by adaptive thresholding to detect operational anomalies without requiring labeled fault data. Evaluated on the CARE dataset containing 89 years of real-world turbine data across three wind farms, the proposed method achieves an AUC-ROC of 0.947 and early fault detection up to 48 hours prior to failure. This approach offers significant societal value by enabling predictive maintenance, reducing turbine failures, and enhancing operational efficiency in large-scale wind energy deployments.

Updated: 2025-10-16 10:49:19

标题: 混合自编码器为风力涡轮机早期故障检测提供框架

摘要: 风力发电机的可靠性对不断增长的可再生能源部门至关重要，早期故障检测显著降低停机时间和维护成本。本文介绍了一种新型基于集成的深度学习框架，用于无监督风力发电机异常检测。该方法整合了变分自编码器（VAE）、LSTM自编码器和Transformer架构，每个架构可捕捉高维SCADA数据中不同的时间和上下文模式。一个独特的特征工程流程提取了时间、统计和频域指标，然后由深度模型处理。集成评分结合模型预测，随后进行自适应阈值处理，以检测操作异常，无需标记的故障数据。在包含三个风场89年真实风力发电机数据的CARE数据集上评估，所提出的方法实现了0.947的AUC-ROC，并在故障前最多48小时进行了早期故障检测。这种方法通过实现预测性维护，减少风力发电机故障，并提高大规模风能部署的运营效率，提供了显著的社会价值。

更新时间: 2025-10-16 10:49:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.15010v1

LLM Agents Beyond Utility: An Open-Ended Perspective

Recent LLM agents have made great use of chain of thought reasoning and function calling. As their capabilities grow, an important question arises: can this software represent not only a smart problem-solving tool, but an entity in its own right, that can plan, design immediate tasks, and reason toward broader, more ambiguous goals? To study this question, we adopt an open-ended experimental setting where we augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment. We study the resulting open-ended agent qualitatively. It can reliably follow complex multi-step instructions, store and reuse information across runs, and propose and solve its own tasks, though it remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations. These findings illustrate both the promise and current limits of adapting pretrained LLMs toward open-endedness, and point to future directions for training agents to manage memory, explore productively, and pursue abstract long-term goals.

Updated: 2025-10-16 10:46:54

标题: LLM代理人超越效用：一个开放式的视角

摘要: 最近的LLM代理大量利用思维链推理和函数调用。随着它们的能力增长，一个重要问题浮现：这种软件能否不仅代表一个智能问题解决工具，而且是一个能够规划、设计即时任务，并朝着更广泛、更模糊的目标推理的实体？为了研究这个问题，我们采用了一个开放式的实验环境，通过为一个预训练的LLM代理增加生成自己任务、积累知识和与环境进行大量互动的能力。我们从质的角度研究了所得到的开放式代理。它能够可靠地遵循复杂的多步指令，跨运行存储和重用信息，并提出和解决自己的任务，尽管它仍对提示设计敏感，容易生成重复任务，并且无法形成自我表征。这些发现既展示了将预训练的LLMs适应开放性的潜力和当前限制，也指出了训练代理管理记忆、高效探索和追求抽象长期目标的未来方向。

更新时间: 2025-10-16 10:46:54

领域: cs.AI

下载: http://arxiv.org/abs/2510.14548v1

Agentic Entropy-Balanced Policy Optimization

Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.

Updated: 2025-10-16 10:40:52

标题: 主动性熵平衡策略优化

摘要: 最近，自主强化学习（Agentic RL）在激励网络代理的多轮、长期工具使用能力方面取得了显著进展。虽然主流的自主强化学习算法在熵的指导下自主探索高不确定性的工具调用步骤，但过度依赖熵信号可能会产生进一步的约束，导致训练崩溃。在本文中，我们深入探讨了熵带来的挑战，并提出了Agentic Entropy-Balanced Policy Optimization（AEPO），这是一种旨在平衡滚动和策略更新阶段熵的自主强化学习算法。AEPO包括两个核心组件：（1）动态熵平衡滚动机制，通过熵预监测自适应地分配全局和分支采样预算，同时对连续高熵工具调用步骤进行分支惩罚，以防止过度分支问题；（2）熵平衡策略优化，将停梯度操作插入高熵剪切项，以保留和适当调整高熵标记上的梯度，同时结合熵感知优势估计，优先学习高不确定性标记。在14个具有挑战性的数据集上的结果显示，AEPO始终优于7种主流的强化学习算法。仅使用1K个RL样本，具有AEPO的Qwen3-14B取得了令人印象深刻的结果：GAIA的47.6％，人类最后考试的11.2％，以及WebWalker的43.0％的Pass@1；GAIA的65.0％，人类最后考试的26.0％，WebWalker的70.0％的Pass@5。进一步的分析表明，AEPO改善了滚动采样的多样性，同时保持稳定的策略熵，促进了可扩展的网络代理训练。

更新时间: 2025-10-16 10:40:52

领域: cs.LG,cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2510.14545v1

A Deep State-Space Model Compression Method using Upper Bound on Output Error

We study deep state-space models (Deep SSMs) that contain linear-quadratic-output (LQO) systems as internal blocks and present a compression method with a provable output error guarantee. We first derive an upper bound on the output error between two Deep SSMs and show that the bound can be expressed via the $h^2$-error norms between the layerwise LQO systems, thereby providing a theoretical justification for existing model order reduction (MOR)-based compression. Building on this bound, we formulate an optimization problem in terms of the $h^2$-error norm and develop a gradient-based MOR method. On the IMDb task from the Long Range Arena benchmark, we demonstrate that our compression method achieves strong performance. Moreover, unlike prior approaches, we reduce roughly 80% of trainable parameters without retraining, with only a 4-5% performance drop.

Updated: 2025-10-16 10:32:21

标题: 一种使用输出误差上界的深度状态空间模型压缩方法

摘要: 我们研究了包含线性二次输出（LQO）系统作为内部模块的深度状态空间模型（Deep SSMs），并提出了一种具有可证明输出误差保证的压缩方法。我们首先推导出两个Deep SSMs之间的输出误差上限，并展示该上限可以通过层次LQO系统之间的$h^2$-误差范数来表示，从而为现有的基于模型阶降低（MOR）的压缩提供了理论上的依据。基于这个上限，我们根据$h^2$-误差范数制定了一个优化问题，并开发了一个基于梯度的MOR方法。在Long Range Arena基准测试中的IMDb任务中，我们展示了我们的压缩方法取得了很好的性能。此外，与先前的方法不同，我们在无需重新训练的情况下减少了约80%的可训练参数，仅有4-5%的性能下降。

更新时间: 2025-10-16 10:32:21

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2510.14542v1

Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts

Neuro-symbolic (NeSy) AI aims to develop deep neural networks whose predictions comply with prior knowledge encoding, e.g. safety or structural constraints. As such, it represents one of the most promising avenues for reliable and trustworthy AI. The core idea behind NeSy AI is to combine neural and symbolic steps: neural networks are typically responsible for mapping low-level inputs into high-level symbolic concepts, while symbolic reasoning infers predictions compatible with the extracted concepts and the prior knowledge. Despite their promise, it was recently shown that - whenever the concepts are not supervised directly - NeSy models can be affected by Reasoning Shortcuts (RSs). That is, they can achieve high label accuracy by grounding the concepts incorrectly. RSs can compromise the interpretability of the model's explanations, performance in out-of-distribution scenarios, and therefore reliability. At the same time, RSs are difficult to detect and prevent unless concept supervision is available, which is typically not the case. However, the literature on RSs is scattered, making it difficult for researchers and practitioners to understand and tackle this challenging problem. This overview addresses this issue by providing a gentle introduction to RSs, discussing their causes and consequences in intuitive terms. It also reviews and elucidates existing theoretical characterizations of this phenomenon. Finally, it details methods for dealing with RSs, including mitigation and awareness strategies, and maps their benefits and limitations. By reformulating advanced material in a digestible form, this overview aims to provide a unifying perspective on RSs to lower the bar to entry for tackling them. Ultimately, we hope this overview contributes to the development of reliable NeSy and trustworthy AI models.

Updated: 2025-10-16 10:28:34

标题: 神经符号人工智能中的符号基础：推理捷径的温和介绍

摘要: 神经符号（NeSy）人工智能旨在开发深度神经网络，其预测符合先前知识编码，例如安全性或结构约束。因此，它代表着可靠和可信赖的人工智能的最有前景的途径之一。NeSy人工智能背后的核心思想是结合神经和符号步骤：神经网络通常负责将低级输入映射为高级符号概念，而符号推理推断出与提取的概念和先前知识相容的预测。尽管它们有希望，但最近发现，只要概念没有直接受到监督，NeSy模型可能会受到推理捷径（RSs）的影响。也就是说，它们可以通过不正确地基于概念来实现高标签准确性。推理捷径可能会危及模型解释的可解释性，在分布场景中的性能，因此可靠性。同时，除非概念监督可用，否则很难检测和防止RSs。然而，关于RSs的文献分散，这使得研究人员和从业者难以理解和解决这一具有挑战性的问题。本概述通过以直观的方式讨论推理捷径的起因和后果，提供了对推理捷径的温和介绍。它还回顾和阐明了这一现象的现有理论特征。最后，它详细介绍了处理RSs的方法，包括减轻和意识策略，并映射了它们的好处和局限性。通过以易于理解的形式重新表述先进材料，本概述旨在为解决推理捷径问题降低门槛，提供一个统一的视角。最终，我们希望本概述能为可靠的NeSy和可信赖的人工智能模型的发展做出贡献。

更新时间: 2025-10-16 10:28:34

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14538v1

JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol

AI systems are continually evolving and advancing, and user expectations are concurrently increasing, with a growing demand for interactions that go beyond simple text-based interaction with Large Language Models (LLMs). Today's applications often require LLMs to interact with external tools, marking a shift toward more complex agentic systems. To support this, standards such as the Model Context Protocol (MCP) have emerged, enabling agents to access tools by including a specification of the capabilities of each tool within the prompt. Although this approach expands what agents can do, it also introduces a growing problem: prompt bloating. As the number of tools increases, the prompts become longer, leading to high prompt token costs, increased latency, and reduced task success resulting from the selection of tools irrelevant to the prompt. To address this issue, we introduce JSPLIT, a taxonomy-driven framework designed to help agents manage prompt size more effectively when using large sets of MCP tools. JSPLIT organizes the tools into a hierarchical taxonomy and uses the user's prompt to identify and include only the most relevant tools, based on both the query and the taxonomy structure. In this paper, we describe the design of the taxonomy, the tool selection algorithm, and the dataset used to evaluate JSPLIT. Our results show that JSPLIT significantly reduces prompt size without significantly compromising the agent's ability to respond effectively. As the number of available tools for the agent grows substantially, JSPLIT even improves the tool selection accuracy of the agent, effectively reducing costs while simultaneously improving task success in high-complexity agent environments.

Updated: 2025-10-16 10:28:23

标题: JSPLIT：基于分类的模型上下文协议中快速膨胀问题的解决方案

摘要: 人工智能系统不断发展和进步，用户期望也在同时增加，对与大型语言模型（LLMs）进行简单基于文本的交互的需求不断增长。如今的应用程序通常要求LLMs与外部工具进行交互，标志着向更复杂的主动系统转变。为了支持这一点，出现了诸如模型上下文协议（MCP）之类的标准，使代理能够通过在提示中包含每个工具的功能规范来访问工具。尽管这种方法扩展了代理可以做的事情，但也引入了一个不断增长的问题：提示膨胀。随着工具数量的增加，提示变得更长，导致高提示令牌成本、延迟增加，以及由于选择与提示无关的工具而导致任务成功率降低。为了解决这个问题，我们介绍了JSPLIT，这是一个基于分类法的框架，旨在帮助代理在使用大量MCP工具时更有效地管理提示大小。JSPLIT将工具组织成层次分类，并使用用户的提示来识别和包含仅与查询和分类结构最相关的工具。在本文中，我们描述了分类法的设计、工具选择算法以及用于评估JSPLIT的数据集。我们的结果显示，JSPLIT明显减少了提示大小，而不会明显影响代理有效响应的能力。随着代理可用工具数量的大幅增长，JSPLIT甚至提高了代理的工具选择准确性，有效降低了成本，同时提高了在高复杂度代理环境中的任务成功率。

更新时间: 2025-10-16 10:28:23

领域: cs.AI

下载: http://arxiv.org/abs/2510.14537v1

Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

Updated: 2025-10-16 10:14:34

标题: 噪声投影：在扩散模型中解决文本到图像不一致的问题

摘要: 在文本到图像生成中，不同的初始噪声会引起与预训练的稳定扩散（SD）模型不同的去噪路径。虽然这种模式可能会产生多样化的图像，但其中一些可能无法很好地与提示对齐。现有方法通过改变去噪动态或绘制多个噪声并进行后选择来缓解这个问题。在本文中，我们将错位归因于训练-推断不匹配：在训练过程中，基于提示的噪声位于潜在空间的提示特定子集中，而在推断过程中，噪声是从基于提示的高斯先验中抽取的。为了弥合这一差距，我们提出了一个噪声投影器，在去噪之前将基于文本的精炼应用于初始噪声。在提示嵌入的条件下，它将噪声映射到一个更好地与SD训练期间观察到的分布匹配的基于提示的对应物，而不修改SD模型。我们的框架包括以下步骤：首先从视觉语言模型（VLM）中采样一些噪声，并获取它们对应图像的标记级反馈，然后将这些信号提炼为奖励模型，最后通过准直接偏好优化来优化噪声投影器。我们的设计具有两个好处：（i）不需要参考图像或手工先验，（ii）推断成本较低，用单次前向传递替代多样本选择。进一步的实验证明，我们的基于提示的噪声投影提高了在各种提示下的文本-图像对齐。

更新时间: 2025-10-16 10:14:34

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14526v1

Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

Updated: 2025-10-16 10:14:32

标题: 实时外科手术器械缺陷检测方法：非破坏性测试

摘要: 不合格的外科手术器械对无菌性、机械完整性和患者安全构成严重风险，增加了手术并发症的可能性。然而，外科手术器械制造中的质量控制往往依赖于容易出现人为错误和不一致性的手工检查。本研究介绍了 SurgScan，这是一个基于人工智能的外科手术器械缺陷检测框架。使用 YOLOv8，SurgScan 实时分类缺陷，确保高准确性和工业可扩展性。该模型在包含 11 种器械类型和五个主要缺陷类别的 102,876 张高分辨率图像数据集上进行训练。与最先进的 CNN 架构进行了全面评估，证实 SurgScan 实现了最高准确性（99.3%），实时推理速度为每张图像 4.2-5.8 毫秒，适用于工业部署。统计分析表明，增强对比度的预处理显著提高了缺陷检测，解决了视觉检查中的关键限制。SurgScan 提供了一个可扩展、具有成本效益的人工智能解决方案，用于自动化质量控制，减少对手工检查的依赖，并确保符合 ISO 13485 和 FDA 标准，为医疗制造业的缺陷检测打开了新局面。

更新时间: 2025-10-16 10:14:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14525v1

On the Identifiability of Tensor Ranks via Prior Predictive Matching

Selecting the latent dimensions (ranks) in tensor factorization is a central challenge that often relies on heuristic methods. This paper introduces a rigorous approach to determine rank identifiability in probabilistic tensor models, based on prior predictive moment matching. We transform a set of moment matching conditions into a log-linear system of equations in terms of marginal moments, prior hyperparameters, and ranks; establishing an equivalence between rank identifiability and the solvability of such system. We apply this framework to four foundational tensor-models, demonstrating that the linear structure of the PARAFAC/CP model, the chain structure of the Tensor Train model, and the closed-loop structure of the Tensor Ring model yield solvable systems, making their ranks identifiable. In contrast, we prove that the symmetric topology of the Tucker model leads to an underdetermined system, rendering the ranks unidentifiable by this method. For the identifiable models, we derive explicit closed-form rank estimators based on the moments of observed data only. We empirically validate these estimators and evaluate the robustness of the proposal.

Updated: 2025-10-16 10:13:45

标题: 关于通过先验预测匹配来确定张量秩的可识别性

摘要: 在张量分解中选择潜在维度（秩）是一个中心挑战，通常依赖于启发式方法。本文介绍了一种严谨的方法来确定概率张量模型中的秩可识别性，基于先验预测矩匹配。我们将一组矩匹配条件转化为关于边际矩、先验超参数和秩的对数线性方程组，建立了秩可识别性与这种系统可解性之间的等价关系。我们将这一框架应用于四个基础张量模型，证明了PARAFAC/CP模型的线性结构、Tensor Train模型的链式结构以及Tensor Ring模型的闭环结构产生可解系统，使得它们的秩可识别。相比之下，我们证明了Tucker模型的对称拓扑性导致一个欠定系统，使得通过这种方法无法识别秩。对于可识别的模型，我们基于观察数据的矩推导出显式的闭式秩估计器。我们在经验上验证了这些估计器，并评估了提议的鲁棒性。

更新时间: 2025-10-16 10:13:45

领域: cs.LG,math.ST,stat.ML,stat.TH,62A09, 62F15,G.3

下载: http://arxiv.org/abs/2510.14523v1

Lexo: Eliminating Stealthy Supply-Chain Attacks via LLM-Assisted Program Regeneration

Software supply-chain attacks are an important and ongoing concern in the open source software ecosystem. These attacks maintain the standard functionality that a component implements, but additionally hide malicious functionality activated only when the component reaches its target environment. Lexo addresses such stealthy attacks by automatically learning and regenerating vulnerability-free versions of potentially malicious components. Lexo first generates a set of input-output pairs to model a component's full observable behavior, which it then uses to synthesize a new version of the original component. The new component implements the original functionality but avoids stealthy malicious behavior. Throughout this regeneration process, Lexo consults several distinct instances of Large Language Models (LLMs), uses correctness and coverage metrics to shepherd these instances, and guardrails their results. Our evaluation on 100+ real-world packages, including high profile stealthy supply-chain attacks, indicates that Lexo scales across multiple domains, regenerates code efficiently (<100s on average), maintains compatibility, and succeeds in eliminating malicious code in several real-world supply-chain-attacks, even in cases when a state-of-the-art LLM fails to eliminate malicious code when prompted to do so.

Updated: 2025-10-16 10:12:14

标题: Lexo: 通过LLM辅助程序重建消除隐匿的供应链攻击

摘要: 软件供应链攻击是开源软件生态系统中重要且持续存在的问题。这些攻击保持了组件实现的标准功能，但在组件到达目标环境时激活了隐藏的恶意功能。Lexo通过自动学习和重新生成潜在恶意组件的无漏洞版本来解决这种隐蔽攻击。Lexo首先生成一组输入-输出对来建模组件的完整可观察行为，然后使用这些对来合成原始组件的新版本。新组件实现了原始功能，但避免了隐蔽的恶意行为。在整个再生过程中，Lexo咨询了几个不同实例的大型语言模型（LLM），使用正确性和覆盖度指标来引导这些实例，并保护它们的结果。我们对100多个真实世界包进行的评估，包括高调的隐蔽供应链攻击，表明Lexo在多个领域具有规模化，有效地重新生成代码（平均不到100秒），保持了兼容性，并成功消除了几个真实世界供应链攻击中的恶意代码，甚至在最新的LLM被要求时无法消除恶意代码的情况下也能成功。

更新时间: 2025-10-16 10:12:14

领域: cs.CR

下载: http://arxiv.org/abs/2510.14522v1

DELE: Deductive $\mathcal{EL}^{++}$ Embeddings for Knowledge Base Completion

Ontology embeddings map classes, roles, and individuals in ontologies into $\mathbb{R}^n$, and within $\mathbb{R}^n$ similarity between entities can be computed or new axioms inferred. For ontologies in the Description Logic $\mathcal{EL}^{++}$, several optimization-based embedding methods have been developed that explicitly generate models of an ontology. However, these methods suffer from some limitations; they do not distinguish between statements that are unprovable and provably false, and therefore they may use entailed statements as negatives. Furthermore, they do not utilize the deductive closure of an ontology to identify statements that are inferred but not asserted. We evaluated a set of embedding methods for $\mathcal{EL}^{++}$ ontologies, incorporating several modifications that aim to make use of the ontology deductive closure. In particular, we designed novel negative losses that account both for the deductive closure and different types of negatives and formulated evaluation methods for knowledge base completion. We demonstrate that our embedding methods improve over the baseline ontology embedding in the task of knowledge base or ontology completion.

Updated: 2025-10-16 10:06:23

标题: DELE: 用于知识库补全的演绎性$\mathcal{EL}^{++}$嵌入

摘要: 本体嵌入将本体中的类、角色和个体映射到 $\mathbb{R}^n$ 中，在 $\mathbb{R}^n$ 中可以计算实体之间的相似性或推断新的公理。对于描述逻辑 $\mathcal{EL}^{++}$ 中的本体，已经开发了几种基于优化的嵌入方法，可以明确生成本体的模型。然而，这些方法存在一些局限性；它们不能区分不可证明和可证明为假的陈述，因此可能将被推导出的陈述用作负例。此外，它们也没有利用本体的推导闭包来识别那些被推断但未被断言的陈述。我们评估了一组针对 $\mathcal{EL}^{++}$ 本体的嵌入方法，结合了几种修改，旨在利用本体的推导闭包。特别地，我们设计了新颖的负损失，既考虑了推导闭包，又考虑了不同类型的负例，并制定了知识库完成的评估方法。我们证明了我们的嵌入方法在知识库或本体完成任务中优于基准本体嵌入。

更新时间: 2025-10-16 10:06:23

领域: cs.AI

下载: http://arxiv.org/abs/2411.01574v3

Mapping Farmed Landscapes from Remote Sensing

Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96\%) and farmed land (95\%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72\% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.

Updated: 2025-10-16 10:00:07

标题: 用遥感技术绘制农田景观

摘要: 农业景观的有效管理对于实现全球生物多样性目标至关重要，但由于缺乏详细的大规模生态地图，相关努力受到阻碍。为解决这一问题，我们介绍了Farmscapes，这是覆盖英格兰大部分地区的第一个高分辨率（25厘米）农村景观要素地图，包括生态关键要素如树篱、木地和石墙。该地图是利用深度学习分割模型生成的，该模型在新型数据集上进行了训练，该数据集包含从航空影像中提取的942个手动标注的瓦片。我们的模型准确识别了关键栖息地，对木地（96%）和耕地（95%）实现了较高的F1分数，并展示了在分割线性要素方面的强大能力，对树篱的F1分数为72%。通过在Google Earth Engine上发布覆盖英格兰的地图，我们为生态学家和政策制定者提供了一个强大的开放获取工具。这项工作使得基于数据的栖息地恢复规划成为可能，支持监测欧盟生物多样性战略等计划，并为景观连通性的高级分析奠定了基础。

更新时间: 2025-10-16 10:00:07

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.13993v2

State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living

When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user's intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at this url https://intentassistant.github.io

Updated: 2025-10-16 09:57:35

标题: 表明你的意图引导你的注意力：一款用于有意识数字生活的人工智能助手

摘要: 在使用数字设备时，人们经常面临分心的情况，这可能会导致生产力和效率下降，以及产生负面的心理和情感影响。为了解决这一挑战，我们引入了一种新颖的人工智能（AI）助手，它可以引发用户的意图，评估当前活动是否与该意图一致，并在出现偏离时提供温和的提醒。该系统利用大型语言模型分析屏幕截图、应用程序标题和URL，当行为偏离所述目标时发出通知。其检测准确性通过初始澄清对话和持续用户反馈得到改进。在一个为期三周的22名参与者的被试实地部署中，我们将我们的助手与基于规则的意图提醒系统和仅记录活动的被动基线进行比较。结果表明，我们的AI助手有效地支持用户保持专注，并使他们的数字行为与意图保持一致。我们的源代码可以在以下网址公开获取：https://intentassistant.github.io

更新时间: 2025-10-16 09:57:35

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14513v1

Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

Updated: 2025-10-16 09:57:31

标题: 舵手：通过多智能体协作实现联邦学习系统的自主合成

摘要: 联邦学习（FL）为在分散数据上训练模型提供了一个强大的范例，但其潜力常常被设计和部署稳健系统的巨大复杂性所削弱。选择、组合和调整策略以解决数据异质性和系统约束等多方面挑战的需要已经成为一个关键瓶颈，导致脆弱的定制解决方案。为了解决这个问题，我们引入了Helmsman，这是一个新颖的多智能体系统，可以从高层用户规范中自动合成联邦学习系统。它通过三个协作阶段模拟一个有原则的研发工作流程：（1）与人类进行互动规划以制定合理的研究计划，（2）由受监督的智能体团队生成模块化代码，（3）在一个沙盒模拟环境中进行自主评估和改进的闭环。为了促进严格的评估，我们还引入了AgentFL-Bench，这是一个新的基准，包括16个多样化的任务，旨在评估FL中智能系统的系统级生成能力。大量实验表明，我们的方法生成的解决方案与已建立的手工基准相竞争，并且通常优于它们。我们的工作代表了朝着自动化工程复杂分散式AI系统的重要一步。

更新时间: 2025-10-16 09:57:31

领域: cs.AI

下载: http://arxiv.org/abs/2510.14512v1

Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective

Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plugin-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.

Updated: 2025-10-16 09:55:06

标题: 通过选择性表示空间增强时间序列预测：一个补丁视角

摘要: 时间序列预测在分区技术的帮助下取得了显著进展，该技术将时间序列分成多个分区，以有效地将语境语义信息保留在一个有利于建模长期依赖性的表示空间中。然而，传统的分区将时间序列分成相邻的分区，导致固定的表示空间，从而导致表示不足。本文探索了构建一个选择性表示空间的方法，以灵活地包含最具信息量的分区来进行预测。具体来说，我们提出了选择性表示空间（SRS）模块，该模块利用可学习的选择性分区和动态重组技术，从上下文时间序列中自适应地选择和重排分区，旨在充分利用上下文时间序列的信息，以增强基于分区的模型的预测性能。为了展示SRS模块的有效性，我们提出了一个简单而有效的SRSNet，由SRS和一个MLP头组成，该模型在来自多个领域的真实数据集上实现了最先进的性能。此外，作为一种新颖的即插即用模块，SRS还可以提升现有基于分区的模型的性能。资源可在https://github.com/decisionintelligence/SRSNet 上获得。

更新时间: 2025-10-16 09:55:06

领域: cs.LG

下载: http://arxiv.org/abs/2510.14510v1

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple BDD test scenarios with corresponding Python step implementations for each requirement}, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with E2EDev}, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

Updated: 2025-10-16 09:54:26

标题: E2Edev：在端到端软件开发任务中对大型语言模型进行基准测试

摘要: E2EDev是由以下组成的：（i）一个细粒度的用户需求集合，（ii）{针对每个需求的多个BDD测试场景，并配有相应的Python步骤实现}，以及（iii）建立在Behave框架上的完全自动化测试管道。为了确保其质量并减少注释工作量，E2EDev利用了我们提出的人机协作多代理注释框架（HITL-MAA）。通过评估各种E2ESD框架和LLM骨干与E2EDev的结合，我们的分析揭示了在有效解决这些任务方面持续挣扎的情况，突显了更加有效和成本效益的E2ESD解决方案的迫切需求。我们的代码库和基准测试可公开访问：https://github.com/SCUNLP/E2EDev。

更新时间: 2025-10-16 09:54:26

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14509v1

Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals

This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.

Updated: 2025-10-16 09:48:54

标题: 学习如何撤销：具有可逆信号的回滚增强学习

摘要: 本文提出了一个可逆学习框架，以提高基于值的强化学习代理的鲁棒性和效率，解决了对值过度估计和部分不可逆环境中的不稳定性的脆弱性。该框架具有两个互补的核心机制：一个经验推导的转移可逆性度量称为Phi of s和a，以及一个选择性状态回滚操作。我们引入了一个称为Phi的在线每状态行动估计器，用于量化在固定时间段K内返回到先前状态的可能性。该度量被用来动态调整时间差更新期间的惩罚项，将可逆性意识直接整合到值函数中。该系统还包括一个选择性回滚操作符。当一个动作产生的预期回报明显低于其瞬时估计值，并且违反了预定义的阈值时，代理会受到惩罚，并返回到前一个状态而不是继续前进。这打断了次优的高风险轨迹，并避免了灾难性的步骤。通过将可逆性感知评估与有针对性的回滚相结合，该方法提高了安全性、性能和稳定性。在CliffWalking v0领域，该框架将灾难性下降减少了超过99.8％，并使平均回合回报率提高了55％。在Taxi v3领域，它抑制了大于或等于99.9％的非法行为，并在累积奖励上实现了65.7％的改善，同时在两个环境中大幅减少了奖励方差。消融研究证实，回滚机制是支撑这些安全性和性能提升的关键组件，标志着朝着安全可靠的顺序决策迈出了坚实的一步。

更新时间: 2025-10-16 09:48:54

领域: cs.LG

下载: http://arxiv.org/abs/2510.14503v1

A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets

Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations, such as HIPAA and the GDPR, restricting access to essential real-world medical data. These limitations impede innovation, delay robust AI model creation, and hinder advancements in patient-centered care. Synthetic data generation offers a transformative solution by producing artificial datasets that emulate real data statistics while safeguarding patient privacy. We introduce a novel hybrid framework for high-fidelity healthcare data synthesis integrating five augmentation methods: noise injection, interpolation, Gaussian Mixture Model (GMM) sampling, Conditional Variational Autoencoder (CVAE) sampling, and SMOTE, combined via a reinforcement learning-based dynamic weight selection mechanism. Its key innovations include advanced calibration techniques -- moment matching, full histogram matching, soft and adaptive soft histogram matching, and iterative refinement -- that align marginal distributions and preserve joint feature dependencies. Evaluated on the Breast Cancer Wisconsin (UCI Repository) and Khulna Medical College cardiology datasets, our calibrated hybrid achieves Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, demonstrating near-zero marginal discrepancy. Pairwise trend scores surpass 90%, and Nearest Neighbor Adversarial Accuracy approaches 50%, confirming robust privacy protection. Downstream classifiers trained on synthetic data achieve up to 94% accuracy and F1 scores above 93%, comparable to models trained on real data. This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.

Updated: 2025-10-16 09:48:52

标题: 一种用于临床表格数据的合成数据生成的混合机器学习方法，带有事后校准

摘要: 健康研究和发展面临着数据稀缺和严格的隐私规定（如HIPAA和GDPR）等重大障碍，限制了对必要的真实世界医疗数据的获取。这些限制阻碍了创新，延迟了强大的人工智能模型的创建，并阻碍了患者中心护理的进展。合成数据生成提供了一种变革性的解决方案，通过产生模拟真实数据统计的人工数据集，同时保护患者隐私。我们介绍了一个新颖的高保真度医疗数据合成框架，集成了五种增强方法：噪声注入、插值、高斯混合模型（GMM）采样、条件变分自动编码器（CVAE）采样和SMOTE，通过强化学习的动态权重选择机制进行组合。其关键创新包括先进的校准技术--矩匹配、完整直方图匹配、软直方图匹配和自适应软直方图匹配，以及迭代细化--这些技术可以对齐边缘分布并保持联合特征依赖性。在威斯康星州乳腺癌（UCI存储库）和库尔纳医学院心脏病数据集上进行评估，我们校准的混合方法实现了低至0.001的Wasserstein距离和约0.01的Kolmogorov-Smirnov统计数据，显示出接近零的边缘差异。两两趋势得分超过90％，最近邻敌对准确性接近50％，证实了强大的隐私保护。在合成数据上训练的下游分类器实现了高达94％的准确性和高于93％的F1分数，与在真实数据上训练的模型相当。这种可扩展的、保护隐私的方法与最先进的方法相匹配，在医疗保健领域设立了新的联合分布忠实度基准，并支持敏感的人工智能应用。

更新时间: 2025-10-16 09:48:52

领域: cs.LG

下载: http://arxiv.org/abs/2510.10513v2

Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach

Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

Updated: 2025-10-16 09:42:45

标题: 使用方差信号检测令牌级幻觉：一种无参考方法

摘要: 大型语言模型(LLMs)在各种任务中展示出令人印象深刻的生成能力，但仍然容易出现幻觉，即生成自信但事实不正确的结果。我们引入了一种无参考、基于标记级别的幻觉检测框架，利用多个随机生成中标记对数概率的差异。与先前需要地面真值参考或句子级验证的方法不同，我们的方法是模型不可知的、可解释的，并适用于实时或事后分析。我们在SQuAD v2数据集中的无法回答的问题提示上评估了我们的方法，并在三个不同规模的自回归模型(GPT-Neo 125M、Falcon 1B和Mistral 7B)上进行了基准测试。通过定量指标和视觉诊断，我们展示了标记级别的方差可靠地突出显示模型输出中的不稳定性，并与幻觉模式相关。我们的框架轻量、可复制，并适用于多个领域，为分析LLMs中生成可靠性提供了有价值的诊断工具。

更新时间: 2025-10-16 09:42:45

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.04137v3

TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery

Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.

Updated: 2025-10-16 09:37:40

标题: TinyDef-DETR：基于Transformer的框架，用于从无人机图像中检测输电线路的缺陷

摘要: 通过提出的TinyDef-DETR框架，本文旨在实现从UAV获取的图像中准确高效地检测输电线路缺陷。该模型集成了四个主要组件：一个增强边缘的ResNet骨干网络，用于加强边界敏感表示；一个无步幅的空间到深度模块，用于实现保留细节的降采样；一个跨阶段双域多尺度注意机制，用于共同建模全局背景和局部线索；以及一个Focaler-Wise-SIoU回归损失，用于改善小型和困难对象的定位。这些设计有效地弥补了传统检测器的局限性。在公共和真实世界数据集上进行的大量实验表明，TinyDef-DETR实现了优越的检测性能和强大的泛化能力，同时保持了适度的计算开销。TinyDef-DETR的准确性和高效性使其成为基于UAV的输电线路缺陷检测的一种合适方法，特别适用于涉及小型和模糊对象的场景。

更新时间: 2025-10-16 09:37:40

领域: cs.CV,cs.AI,cs.CE

下载: http://arxiv.org/abs/2509.06035v7

Just One Layer Norm Guarantees Stable Extrapolation

In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results -- the first of their kind -- by applying Neural Tangent Kernel (NTK) theory to analyse infinitely-wide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.

Updated: 2025-10-16 09:34:44

标题: 只需一个层范数即可保证稳定外推

摘要: 尽管神经网络的普及程度很高，但在远离训练分布进行外推时，其行为仍然很难理解，现有结果仅限于特定情况。在这项工作中，我们通过应用神经切线核（NTK）理论来分析训练至收敛的无限宽神经网络，证明了一系列普遍结果，这是首次实现。我们证明，仅包含一个层归一化（LN）就会从根本上改变引发的NTK，将其转变为一个有界方差的核。因此，具有至少一个LN的无限宽网络的输出保持有界，即使在远离训练数据的输入情况下也是如此。相比之下，我们展示了一个广泛类别的没有LN的网络可以对某些输入产生病态大的输出。我们通过有限宽度网络上的经验实验证实了这些理论发现，表明尽管标准NN经常在训练域外表现出不受控制的增长，但单个LN层有效地缓解了这种不稳定性。最后，我们探讨了这种外推稳定性的现实世界影响，包括应用于预测蛋白质中大于训练时所见的残基大小以及从未出现在训练集中的被低估族裔的面部图像中估计年龄。

更新时间: 2025-10-16 09:34:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.14512v2

Measuring the stability and plasticity of recommender systems

The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. Although additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.

Updated: 2025-10-16 09:33:12

标题: 衡量推荐系统的稳定性和可塑性

摘要: 评估推荐算法的典型离线协议是收集用户-物品交互数据集，然后使用该数据集的一部分来训练模型，剩余数据用于衡量模型推荐与观察到的用户交互的匹配程度。这一协议简单、有用且实用，但只能捕捉过去某一时间点训练的特定模型的性能。然而，我们知道在线系统会随时间演变。一般来说，模型应该反映这些变化，因此模型经常使用最新数据进行重新训练。但如果是这种情况，我们能多大程度上相信先前的评估呢？当出现不同模式时，模型会表现如何？在本文中，我们提出了一种研究推荐模型在重新训练时的行为的方法论。这个想法是根据算法的能力对算法进行分析，一方面是保持过去模式的能力-稳定性，另一方面是(快速)适应变化的能力-可塑性。我们设计了一个离线评估协议，提供关于模型长期行为的细节，并且不依赖于数据集、算法和评估指标。为了说明这个框架的潜力，我们在GoodReads数据集上展示了三种不同类型算法的初步结果，这些结果表明根据算法技术不同，稳定性和可塑性的特征不同，并且在稳定性和可塑性之间可能存在一种权衡。尽管需要进行额外实验来确认这些观察结果，但它们已经展示了所提出框架在获取对推荐模型长期动态的洞见方面的实用性。

更新时间: 2025-10-16 09:33:12

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.03941v2

Internet of Agents: Fundamentals, Applications, and Challenges

With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.

Updated: 2025-10-16 09:32:37

标题: Agent的互联网：基础知识、应用和挑战

摘要: 随着大型语言模型和视觉语言模型的快速增长，人工智能代理已经从孤立的、特定任务的系统发展成了能够在没有人类干预的情况下感知、推理和行动的自主、互动实体。随着这些代理在虚拟和物理环境中的蔓延，从虚拟助手到具象机器人，建立一个统一的以代理为中心的基础设施变得至关重要。在本调查中，我们将互联网代理（IoA）引入为一个基础框架，可以使异构代理在规模上实现无缝互连、动态发现和协同编排。我们首先介绍了一个通用的IoA架构，突出了其分层组织、与传统互联网的区别特征以及新兴应用。接下来，我们分析了IoA的关键运营支持者，包括能力通知和发现、自适应通信协议、动态任务匹配、共识和冲突解决机制以及激励模型。最后，我们确定了建立弹性和可信任的IoA生态系统的开放研究方向。

更新时间: 2025-10-16 09:32:37

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2505.07176v2

From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?

Causal discovery algorithms often perform poorly with limited samples. While integrating expert knowledge (including from LLMs) as constraints promises to improve performance, guarantees for existing methods require perfect predictions or uncertainty estimates, making them unreliable for practical use. We propose the Guess2Graph (G2G) framework, which uses expert guesses to guide the sequence of statistical tests rather than replacing them. This maintains statistical consistency while enabling performance improvements. We develop two instantiations of G2G: PC-Guess, which augments the PC algorithm, and gPC-Guess, a learning-augmented variant designed to better leverage high-quality expert input. Theoretically, both preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart in finite samples when experts are "better than random." Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.

Updated: 2025-10-16 09:31:44

标题: 从猜测到图形：不可靠专家何时以及如何可以安全地在有限样本中提升因果发现？

摘要: 因果发现算法在样本有限时通常表现不佳。尽管整合专家知识（包括来自LLMs的知识）作为约束有望改善性能，但现有方法的保证要求完美预测或不确定性估计，使其在实际应用中不可靠。我们提出了Guess2Graph（G2G）框架，该框架利用专家猜测来指导统计测试的顺序，而不是替代它们。这样既保持了统计一致性，又实现了性能改进。我们开发了两个G2G的实例化版本：PC-Guess，它增强了PC算法，以及gPC-Guess，一种设计更好地利用高质量专家输入的学习增强变体。从理论上讲，无论专家错误如何，两者都能保持正确性，其中gPC-Guess在专家比“随机更好”时在有限样本中被证明优于其非增强对应物。从实证上看，两者都显示出随着专家准确性的提高而单调改善，gPC-Guess取得了显著更强的增益。

更新时间: 2025-10-16 09:31:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14488v1

Semantic representations emerge in biologically inspired ensembles of cross-supervising neural networks

Brains learn to represent information from a large set of stimuli, typically by weak supervision. Unsupervised learning is therefore a natural approach for exploring the design of biological neural networks and their computations. Accordingly, redundancy reduction has been suggested as a prominent design principle of neural encoding, but its ``mechanistic'' biological implementation is unclear. Analogously, unsupervised training of artificial neural networks yields internal representations that allow for accurate stimulus classification or decoding, but typically rely on biologically-implausible implementations. We suggest that interactions between parallel subnetworks in the brain may underlie such learning: we present a model of representation learning by ensembles of neural networks, where each network learns to encode stimuli into an abstract representation space by cross-supervising interactions with other networks, for inputs they receive simultaneously or in close temporal proximity. Aiming for biological plausibility, each network has a small ``receptive field'', thus receiving a fixed part of the external input, and the networks do not share weights. We find that for different types of network architectures, and for both visual or neuronal stimuli, these cross-supervising networks learn semantic representations that are easily decodable and that decoding accuracy is comparable to supervised networks -- both at the level of single networks and the ensemble. We further show that performance is optimal for small receptive fields, and that sparse connectivity between networks is nearly as accurate as all-to-all interactions, with far fewer computations. We thus suggest a sparsely interacting collective of cross-supervising networks as an algorithmic framework for representational learning and collective computation in the brain.

Updated: 2025-10-16 09:30:22

标题: 生物启发的跨监督神经网络集合中出现语义表示

摘要: 大脑通过弱监督学习来学习表示来自大量刺激的信息。无监督学习因此成为探索生物神经网络设计及其计算的自然方法。因此，冗余减少被提出作为神经编码的主要设计原则，但其“机械”生物实现尚不清楚。类似地，人工神经网络的无监督训练产生的内部表示允许精确的刺激分类或解码，但通常依赖于生物不可信的实现。我们建议大脑中平行子网络之间的相互作用可能是这种学习的基础：我们提出了一个由神经网络集合学习表示的模型，其中每个网络通过与其他网络的交叉监督交互学习将刺激编码为抽象表示空间，对于同时或接近时间接收到的输入。为了追求生物可信度，每个网络具有一个小的“感受域”，因此接收外部输入的固定部分，并且网络不共享权重。我们发现，对于不同类型的网络架构以及视觉或神经刺激，这些交叉监督网络学习了易于解码的语义表示，并且解码准确性与受监督网络相当--在单个网络和集合的水平上。我们进一步表明，对于小的感受域，性能最佳，并且网络之间的稀疏连接几乎与全连接一样准确，但计算量要少得多。因此，我们建议一个稀疏交互的交叉监督网络集合作为大脑中表示学习和集体计算的算法框架。

更新时间: 2025-10-16 09:30:22

领域: q-bio.NC,cs.AI

下载: http://arxiv.org/abs/2510.14486v1

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Updated: 2025-10-16 09:26:09

标题: 连接承诺与性能之间的差距：微缩FP4量化

摘要: 最近硬件加速的微缩放4位浮点格式，如MXFP4和NVFP4，在NVIDIA和AMD的GPU上受支持，承诺革新大型语言模型（LLM）推理。然而，它们的实际好处尚未得到证实。我们提出了对MXFP4和NVFP4进行后训练量化的首个全面研究，揭示了它们的承诺与实际性能之间的差距。我们的分析显示，由于两个关键问题，现有技术在FP4上面临困难：（1）NVFP4的小组大小明显中和了传统的异常值缓解技术；（2）MXFP4的二次幂尺度量化由于引入的高误差严重降低了准确性。为了弥合这一差距，我们引入了Micro-Rotated-GPTQ（MR-GPTQ），这是一种经典GPTQ量化算法的变体，通过使用块状Hadamard变换和特定格式的优化，将量化过程定制为FP4的独特属性。我们通过一组高性能GPU内核支持我们的提议，使MR-GPTQ格式具有可忽略的开销，通过将旋转融合到权重中，并快速在线计算激活。这导致与FP16相比，NVIDIA B200层间速度提高了最多3.6倍，端到端提高了2.2倍，并且RTX5090的层间提高了6倍，端到端提高了4倍。我们的广泛经验评估表明，MR-GPTQ与现有技术的准确性相匹配或超越，显著提升了MXFP4，使其接近NVFP4的准确性。我们得出结论，虽然FP4并非INT4的自动升级，但像MR-GPTQ这样的格式专用方法可以解锁准确性和性能之间的新领域。

更新时间: 2025-10-16 09:26:09

领域: cs.LG

下载: http://arxiv.org/abs/2509.23202v2

Certifying optimal MEV strategies with Lean

Maximal Extractable Value (MEV) refers to a class of attacks to decentralized applications where the adversary profits by manipulating the ordering, inclusion, or exclusion of transactions in a blockchain. Decentralized Finance (DeFi) protocols are a primary target of these attacks, as their logic depends critically on transaction sequencing. To date, MEV attacks have already extracted billions of dollars in value, underscoring their systemic impact on blockchain security. Verifying the absence of MEV attacks requires determining suitable upper bounds, i.e. proving that no adversarial strategy can extract more value (if any) than expected by protocol designers. This problem is notoriously difficult: the space of adversarial strategies is extremely vast, making empirical studies and pen-and-paper reasoning insufficiently rigorous. In this paper, we present the first mechanized formalization of MEV in the Lean theorem prover. We introduce a methodology to construct machine-checked proofs of MEV bounds, providing correctness guarantees beyond what is possible with existing techniques. To demonstrate the generality of our approach, we model and analyse the MEV of two paradigmatic DeFi protocols. Notably, we develop the first machine-checked proof of the optimality of sandwich attacks in Automated Market Makers, a fundamental DeFi primitive.

Updated: 2025-10-16 09:24:28

标题: 用Lean证明最优MEV策略

摘要: 最大可提取价值（MEV）是指一类针对去中心化应用的攻击，攻击者通过操纵区块链中的交易排序、包含或排除来获利。去中心化金融（DeFi）协议是这些攻击的主要目标，因为它们的逻辑严重依赖于交易排序。迄今为止，MEV攻击已经提取了数十亿美元的价值，突显了它们对区块链安全的系统性影响。验证不存在MEV攻击需要确定适当的上界，即证明没有任何敌对策略可以提取比协议设计者预期更多的价值（如果有的话）。这个问题众所周知地很困难：敌对策略的空间极为广阔，使经验研究和纸笔推理不够严谨。在本文中，我们在Lean定理证明器中首次机械化形式化了MEV。我们介绍了一种构建机器检查证明MEV上界的方法论，提供了超越现有技术可能的正确性保证。为了展示我们方法的普适性，我们对两个典型DeFi协议的MEV进行建模和分析。值得注意的是，我们开发了第一个机器检查的自动市场制造者中夹击攻击的最优性证明。

更新时间: 2025-10-16 09:24:28

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2510.14480v1

ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models

We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model's CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single information-geometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability

Updated: 2025-10-16 09:21:06

标题: ENIGMA：大型语言模型中的推理和对齐几何学

摘要: 我们提出了Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA)，这是一种新颖的大语言模型（LLM）训练方法，通过将组织的政策/原则视为模型信息流形上移动的方向，共同改进推理、对齐和稳健性。我们的单循环训练器结合了Group-Relative Policy Optimisation (GRPO)，一种基于策略的无评论RL方法，采用CoT格式的奖励；一种自监督对齐与互信息（SAMI）风格的对称InfoNCE辅助；以及在隐藏状态分布上应用的熵Sinkhorn最优传输正则化器，用于限制几何漂移。我们还引入了infoNCE指标，这些指标专门针对匹配的负例，以衡量模型的CoT如何编码这些政策。这些指标包括一个充分性指标（SI），可以在训练之前选择和创建最大化下游性能的原则。在我们使用小型（1B）LLM的实验中，高SI原则预测训练动态更加稳定，并且在GRPO消融中提高基准性能。我们对训练模型的信息几何分析验证了流形中的理想结构变化。这些结果支持我们的假设，即推理、对齐和稳健性是单一信息几何目标的投影，使用ENIGMA训练的模型展示了有原则的推理，无需使用奖励模型，为可信的能力提供了一条路径。

更新时间: 2025-10-16 09:21:06

领域: cs.LG,cs.AI,cs.CL,68T50,I.2.7

下载: http://arxiv.org/abs/2510.11278v2

Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation

Adapting to diverse user needs at test time is a key challenge in controllable multi-objective generation. Existing methods are insufficient: merging-based approaches provide indirect, suboptimal control at the parameter level, often disregarding the impacts of multiple objectives. While decoding-based guidance is more direct, it typically requires aggregating logits from multiple expert models, incurring significant space overhead and relying heavily on individual model capacity. To address these issues, we introduce Merge-And-GuidE (MAGE), a two-stage framework that leverages model merging for guided decoding. We first identify a critical compatibility problem between the guidance and base models. In Stage 1, MAGE resolves this by dynamically constructing a more robust base model, merging a series of backbone models that account for multiple objectives. In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1. Our analysis empirically validates Linear Mode Connectivity (LMC) in value models, explores the relationship between model merging and prediction ensembling, and demonstrates the enhanced controllability afforded by our approach. Extensive experiments show that our method outperforms existing approaches, achieving superior controllability, Pareto-optimal performance, and enhanced adaptability.

Updated: 2025-10-16 09:11:22

标题: 合并和引导：统一模型合并和引导解码以控制多目标生成

摘要: 在可控的多目标生成中，适应测试时的多样用户需求是一个关键挑战。现有方法不够完善：基于合并的方法在参数级别提供间接的、次优的控制，通常忽视多个目标的影响。而基于解码的引导更直接，但通常需要聚合多个专家模型的logits，产生显著的空间开销，并且严重依赖于各个模型的容量。为了解决这些问题，我们引入了Merge-And-GuidE（MAGE），这是一个利用模型合并进行引导解码的两阶段框架。我们首先确定了引导模型和基础模型之间的一个关键兼容性问题。在第一阶段，MAGE通过动态构建一个更加稳健的基础模型来解决这个问题，合并了一系列考虑多个目标的骨干模型。在第二阶段，我们将显式和隐式价值模型合并为一个统一的引导代理，然后指导来自第一阶段基础模型的解码。我们的分析在经验上验证了价值模型中的线性模式连接（LMC），探索了模型合并和预测集成之间的关系，并展示了我们方法所提供的增强可控性。广泛的实验表明，我们的方法优于现有方法，实现了更优越的可控性、帕累托最优的性能和增强的适应性。

更新时间: 2025-10-16 09:11:22

领域: cs.LG

下载: http://arxiv.org/abs/2510.03782v2

Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models

The emergence of graph foundation models (GFMs), particularly those incorporating language models (LMs), has revolutionized graph learning and demonstrated remarkable performance on text-attributed graphs (TAGs). However, compared to traditional GNNs, these LM-empowered GFMs introduce unique security vulnerabilities during the unsecured prompt tuning phase that remain understudied in current research. Through empirical investigation, we reveal a significant performance degradation in traditional graph backdoor attacks when operating in attribute-inaccessible constrained TAG systems without explicit trigger node attribute optimization. To address this, we propose a novel dual-trigger backdoor attack framework that operates at both text-level and struct-level, enabling effective attacks without explicit optimization of trigger node text attributes through the strategic utilization of a pre-established text pool. Extensive experimental evaluations demonstrate that our attack maintains superior clean accuracy while achieving outstanding attack success rates, including scenarios with highly concealed single-trigger nodes. Our work highlights critical backdoor risks in web-deployed LM-empowered GFMs and contributes to the development of more robust supervision mechanisms for open-source platforms in the era of foundation models.

Updated: 2025-10-16 09:10:38

标题: 隐蔽的双触发后门：在LM-增强的图基础模型中攻击即时调整

摘要: 图基模型（GFMs）的出现，特别是那些结合语言模型（LMs）的模型，彻底改变了图学习，并在文本属性图（TAGs）上展现出了卓越的性能。然而，与传统的图神经网络相比，这些LM增强的GFMs在未经保护的提示调整阶段引入了独特的安全漏洞，在当前研究中尚未得到充分研究。通过实证研究，我们揭示了在没有明确触发节点属性优化的属性不可访问受限的TAG系统中，传统图背门攻击的显著性能下降。为了解决这个问题，我们提出了一个新颖的双触发背门攻击框架，同时在文本级和结构级上进行操作，通过战略利用预先建立的文本池，实现有效攻击，而无需明确优化触发节点文本属性。广泛的实验评估表明，我们的攻击保持了卓越的干净准确率，同时实现了出色的攻击成功率，包括高度隐蔽的单触发节点情景。我们的工作突出了网络部署的LM增强的GFMs中的关键背门风险，并为基础模型时代的开源平台开发更加健壮的监督机制做出贡献。

更新时间: 2025-10-16 09:10:38

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.14470v1

InfoDet: A Dataset for Infographic Element Detection

Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

Updated: 2025-10-16 09:10:01

标题: InfoDet：一个用于信息图元素检测的数据集

摘要: 鉴于图表在科学、商业和沟通领域的核心作用，增强视觉语言模型（VLMs）对图表的理解能力变得日益关键。现有VLMs的一个关键局限在于它们对信息图表元素（包括图表和人可识别对象（HROs）如图标和图像）的视觉定位不准确。然而，图表理解通常需要识别相关元素并对其进行推理。为解决这一限制，我们引入了InfoDet，这是一个旨在支持信息图表中准确对象检测模型开发的数据集。它包含11,264个真实和90,000个合成信息图表，具有超过1400万个边界框注释。这些注释是通过结合模型在回路和程序化方法创建的。我们通过三个应用程序展示了InfoDet的实用性：1）构建一个Thinking-with-Boxes方案来提升VLMs的图表理解性能，2）比较现有对象检测模型，3）将开发的检测模型应用于文档布局和UI元素检测。

更新时间: 2025-10-16 09:10:01

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.17473v5

LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

Updated: 2025-10-16 09:08:24

标题: LiRA：用于跨语言大型语言模型的语言鲁棒锚定

摘要: 随着大型语言模型（LLMs）的快速发展，高资源语言（例如英语、中文）的性能接近饱和，但低资源语言（例如乌尔都语、泰语）的性能仍然明显较低，这是由于训练数据有限、机器翻译噪音和跨语言对齐不稳定造成的。我们介绍了LiRA（大型语言模型的语言韧性锚定），这是一个训练框架，可以在低资源条件下稳健地改进跨语言表示，同时加强检索和推理。LiRA包括两个模块：（i）Arca（锚定表示组合体系结构），通过基于锚点的对齐和多智能体协作编码，将低资源语言锚定到英语语义空间，保持在共享嵌入空间中的几何稳定性；（ii）LaSR（语言耦合语义推理器），在Arca的多语言表示之上添加一个语言感知轻量级推理头，并对一致性进行规范化，统一训练目标以增强跨语言理解、检索和推理的稳健性。我们进一步构建并发布了一个涵盖五种东南亚语言和两种南亚语言的多语言产品检索数据集。在低资源基准测试（跨语言检索、语义相似度和推理）中的实验显示，在少样本和噪声放大设置下，LiRA能够稳定提高性能；消融实验证实了Arca和LaSR的贡献。代码将在GitHub上发布，数据集将在Hugging Face上发布。

更新时间: 2025-10-16 09:08:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14466v1

The Fluorescent Veil: A Stealthy and Effective Physical Adversarial Patch Against Traffic Sign Recognition

Recently, traffic sign recognition (TSR) systems have become a prominent target for physical adversarial attacks. These attacks typically rely on conspicuous stickers and projections, or using invisible light and acoustic signals that can be easily blocked. In this paper, we introduce a novel attack medium, i.e., fluorescent ink, to design a stealthy and effective physical adversarial patch, namely FIPatch, to advance the state-of-the-art. Specifically, we first model the fluorescence effect in the digital domain to identify the optimal attack settings, which guide the real-world fluorescence parameters. By applying a carefully designed fluorescence perturbation to the target sign, the attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially leading to traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of FIPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.

Updated: 2025-10-16 09:07:58

标题: 荧光面纱：一种隐蔽且有效的物理对抗贴片，用于交通标志识别

摘要: 最近，交通标志识别（TSR）系统已成为物理对抗攻击的主要目标。这些攻击通常依赖于显眼的贴纸和投影，或者使用看不见的光和声信号，这些信号很容易被阻挡。本文引入了一种新颖的攻击介质，即荧光墨水，设计了一种隐秘而有效的物理对抗贴片，即FIPatch，以推进技术前沿。具体而言，我们首先在数字领域建模荧光效应，以确定最佳攻击设置，从而指导现实世界中的荧光参数。通过对目标标志施加精心设计的荧光扰动，攻击者可以后续触发一个荧光效果，使用看不见的紫外线光，导致TSR系统错误分类标志，可能导致交通事故。我们进行了全面评估，以调查FIPatch的有效性，在低光条件下显示出98.31%的成功率。此外，我们的攻击成功地绕过了五种流行的防御措施，实现了96.72%的成功率。

更新时间: 2025-10-16 09:07:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2409.12394v2

Best-of-Both Worlds for linear contextual bandits with paid observations

We study the problem of linear contextual bandits with paid observations, where at each round the learner selects an action in order to minimize its loss in a given context, and can then decide to pay a fixed cost to observe the loss of any arm. Building on the Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, we introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem. We show that it achieves the minimax-optimal regret of $\Theta(T^{2/3})$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) stochastic regimes. Our approach builds on the framework from \cite{BOBWhardproblems} to design BOBW algorithms for ``hard problem'', using analysis techniques tailored for the setting that we consider.

Updated: 2025-10-16 09:06:32

标题: 最佳方案-线性上下文赌博机中带有付费观测

摘要: 我们研究了具有付费观测的线性情境赌博机问题，在每一轮中，学习者选择一个动作以最小化在给定情境下的损失，并且可以决定支付固定成本来观测任何手臂的损失。基于Follow-the-Regularized-Leader框架以及通过矩阵几何重采样实现高效估计器，我们为这个问题引入了一个计算上高效的Best-of-Both-Worlds（BOBW）算法。我们展示了在对抗性环境中，它实现了$\Theta(T^{2/3})$的极小化遗憾，同时在（被损坏的）随机环境中保证了对数多项式遗憾。我们的方法建立在\cite{BOBWhardproblems}的框架之上，为“困难问题”设计BOBW算法，使用针对我们考虑的情况量身定制的分析技术。

更新时间: 2025-10-16 09:06:32

领域: cs.LG

下载: http://arxiv.org/abs/2510.07424v2

Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning

Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a theoretically grounded, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. Under a local linearization, ICA is equivalent to a first-order update toward the holdout optimum, motivating its use as a proxy for data value. We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the choice of $k$ holdout examples for in-context demonstrations, and note limitations for rapidly drifting on-policy updates, highlighting directions for future work. Code and prompts will be released.

Updated: 2025-10-16 09:00:39

标题: 基于保留损失的数据选择：通过上下文学习进行LLM微调

摘要: Fein调整大型预训练语言模型是一种常见方法，用于使它们与人类偏好保持一致，但嘈杂或偏离目标的示例可能会削弱监督。虽然小型、精心选择的数据集通常可以匹配远大于它们的性能，但系统化和高效的识别高价值训练数据的方法仍未被充分探讨。许多当前方法依赖于启发式方法或昂贵的重新训练。我们提出了一个理论上基础的、资源高效的数据选择和重新加权框架。其核心是一种上下文逼近（ICA）方法，通过在上下文中对一个小型策划的留存集进行条件化，估计模型在训练候选示例后会产生的留存损失。ICA不需要参考模型或额外的微调。在局部线性化下，ICA等价于向留存最优值的一阶更新，这促使其作为数据价值的代理使用。我们从ICA分数中导出每个示例的权重，随着模型参数的演变动态重新加权梯度更新。在SFT、DPO和SimPO以及不同的骨干和数据集上，基于ICA的重新加权始终可以提高模型与最小开销的对齐。我们分析了得分更新频率和用于上下文演示的k个留存示例的选择的灵敏度，并注意到对于快速漂移的策略更新存在限制，强调未来工作的方向。代码和提示将会发布。

更新时间: 2025-10-16 09:00:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14459v1

Coder as Editor: Code-driven Interpretable Molecular Optimization

Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.

Updated: 2025-10-16 08:55:06

标题: 编码编辑者：基于代码驱动的可解释分子优化

摘要: 分子优化是药物发现中的一个核心任务，需要精确的结构推理和领域知识。尽管大型语言模型（LLMs）在生成自然语言中的高级编辑意图方面表现出潜力，但它们通常难以忠实地执行这些修改-特别是在操作非直观表示（如SMILES）时。我们引入了MECo，一个将推理和执行连接起来的框架，通过将编辑操作转换为可执行代码来实现这一目标。MECo将LLMs的分子优化重新构建为一个级联框架：从分子和属性目标生成人类可解释的编辑意图，然后通过代码生成将这些意图转化为可执行的结构编辑。我们的方法在产生自化学反应和特定目标化合物对导出的实际编辑中实现了超过98％的准确性。在跨物理化学性质和靶标活性的下游优化基准上，MECo将一致性提高了38-86个百分点至90％以上，并在保持结构相似性的同时，比基于SMILES的基线实现了更高的成功率。通过将意图与执行对齐，MECo实现了一致、可控和可解释的分子设计，为药物发现中高保真度反馈循环和协作人工智能工作流程奠定了基础。

更新时间: 2025-10-16 08:55:06

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2510.14455v1

Towards Adaptable Humanoid Control via Adaptive Motion Tracking

Humanoid robots are envisioned to adapt demonstrated motions to diverse real-world conditions while accurately preserving motion patterns. Existing motion prior approaches enable well adaptability with a few motions but often sacrifice imitation accuracy, whereas motion-tracking methods achieve accurate imitation yet require many training motions and a test-time target motion to adapt. To combine their strengths, we introduce AdaMimic, a novel motion tracking algorithm that enables adaptable humanoid control from a single reference motion. To reduce data dependence while ensuring adaptability, our method first creates an augmented dataset by sparsifying the single reference motion into keyframes and applying light editing with minimal physical assumptions. A policy is then initialized by tracking these sparse keyframes to generate dense intermediate motions, and adapters are subsequently trained to adjust tracking speed and refine low-level actions based on the adjustment, enabling flexible time warping that further improves imitation accuracy and adaptability. We validate these significant improvements in our approach in both simulation and the real-world Unitree G1 humanoid robot in multiple tasks across a wide range of adaptation conditions. Videos and code are available at https://taohuang13.github.io/adamimic.github.io/.

Updated: 2025-10-16 08:54:53

标题: 朝向可适应人形机器人控制：通过自适应运动跟踪

摘要: 人形机器人被设想能够在不同的现实世界条件下，准确地保留运动模式的同时，适应展示的动作。现有的运动先验方法使得机器人在少量动作上能够很好地适应，但往往牺牲了模仿的准确性，而运动跟踪方法实现了准确的模仿，但需要大量的训练动作和测试时的目标动作进行适应。为了结合它们的优势，我们引入了AdaMimic，一种新颖的运动跟踪算法，能够从单个参考动作实现可适应的人形控制。为了减少对数据的依赖性，同时确保可适应性，我们的方法首先通过将单个参考动作稀疏化为关键帧并应用最小物理假设的轻量级编辑，创建了一个增强的数据集。然后通过跟踪这些稀疏的关键帧来初始化策略，生成密集的中间动作，并随后训练适配器来调整跟踪速度，并根据调整来完善低级动作，实现灵活的时间扭曲，进一步提高模仿的准确性和适应性。我们在模拟和现实世界的Unitree G1人形机器人上，在多个任务中跨越广泛的适应条件验证了我们方法的这些显著改进。视频和代码可在https://taohuang13.github.io/adamimic.github.io/上找到。

更新时间: 2025-10-16 08:54:53

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.14454v1

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Updated: 2025-10-16 08:49:03

标题: 接种促使：在培训期间引发LLMs的特征可能会在测试时间抑制它们.

摘要: 语言模型微调通常会导致学习到不良特质与期望特质的组合。为了解决这个问题，我们提出了接种提示：通过在微调数据前面添加一个短系统提示指令，故意引发不良特质。在测试时，我们在没有指令的情况下进行评估；接种模型表现出的特质远低于使用未经修改的训练数据训练的模型。接种是有选择性的：在一个玩具设置中，助手响应总是使用西班牙语和大写字母，一个合适的接种（例如，“你总是说西班牙语。”）教会模型在保持使用英语的同时将回复大写。我们发现接种在几个额外设置中也是有效的：减少特定任务微调带来的紧急不一致（EM），抵御后门注入，减轻通过潜意识学习传播特质。后续分析表明一个机制：通过接种使特质不那么令人惊讶，减少了全局更新模型的优化压力，从而降低了泛化程度。我们的分析与先前关于EM的工作相关：接种解释了之前发现的教育背景如何减轻来自不安全代码的EM。除了展示一种简单有效的有选择性学习技术，我们的结果还有助于更好地理解语言模型如何以及为什么泛化。

更新时间: 2025-10-16 08:49:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.04340v3

Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints

Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn's optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.

Updated: 2025-10-16 08:47:05

标题: 特征选择和正则化在多类分类中的应用：一种基于梯度下降优化和L1稀疏约束的一对多逻辑回归的实证研究

摘要: 多类葡萄酒分类在模型准确性、特征维度和可解释性之间存在基本的权衡 - 这些是在分析化学中进行生产部署的关键因素。本文介绍了对UCI Wine数据集（178个样本，3种品种，13种化学特征）上的一对其余逻辑回归的全面实证研究，比较了从头开始的梯度下降实现与scikit-learn的优化求解器，并量化了L1正则化对特征稀疏性的影响。手动梯度下降实现了92.59％的平均测试准确性，并实现了平稳的收敛，验证了理论基础，尽管scikit-learn提供了24倍的训练加速和98.15％的准确性。类别特定的分析揭示了具有异质模式的不同化学特征，其中颜色强度在不同品种之间剧烈变化（0.31至16.50）。L1正则化实现了54-69％的特征减少，仅导致4.63％的准确性降低，表明有利的可解释性-性能权衡。我们提出了一个最佳的5个特征子集，实现了62％的复杂性减少，估计准确率为92-94％，可以节约每个样本80美元，并减少56％的时间。统计验证确认了强大的泛化性能，适用于实时质量控制的次2ms预测延迟。我们的研究结果为实践者提供了可行的指导方针，平衡了资源受限环境中全面化学分析和有针对性特征测量。

更新时间: 2025-10-16 08:47:05

领域: cs.LG,cs.AI,62H30, 68T10, 62J12,I.2.6; I.5.2; J.3

下载: http://arxiv.org/abs/2510.14449v1

Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits

The distribution of resources in the subsurface is deeply linked to the variations of its physical properties. Generative modeling has long been used to predict those physical properties while quantifying the associated uncertainty. But current approaches struggle to properly reproduce geological structures, and fluvial deposits in particular, because of their continuity. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - can be trained to reproduce fluvial deposits simulated by a process-based model - a more expensive model that mimics geological processes. An ablation study shows that developments from the deep-learning community to generate large 2D images are directly transferable to 3D images of fluvial deposits. Training remains stable, and the generated samples reproduce the non-stationarity and details of the deposits without mode collapse or pure memorization of the training data. Using a process-based model to generate those training data allows us to include valuable properties other than the usual physical properties. We show how the deposition time let us monitor and validate the performance of a GAN by checking that its samples honor the law of superposition. Our work joins a series of previous studies suggesting that GANs are more robust that given credit for, at least for training datasets targeting specific geological structures. Whether this robustness transfers to larger 3D images and multimodal datasets remains to be seen. Exploring how deep generative models can leverage geological principles like the law of superposition shows a lot of promise.

Updated: 2025-10-16 08:43:40

标题: 朝向基于过程和深度生成建模的地质推断，第一部分：河流沉积物的训练

摘要: 地下资源的分布与其物理性质的变化密切相关。发生模型长期以来一直被用来预测这些物理性质，同时量化相关的不确定性。但是当前的方法在适当重现地质结构，尤其是河流沉积物方面存在困难，因为它们具有连续性。本研究探讨了一个生成对抗网络（GAN）-一种用于生成模型的深度学习算法-是否可以被训练来复制由基于过程的模型模拟的河流沉积物-一种模拟地质过程的更昂贵的模型。消融研究表明，深度学习社区的发展用于生成大型2D图像的方法可以直接转移到河流沉积物的3D图像。训练保持稳定，并且生成的样本可以重现沉积物的非平稳性和细节，而不会出现模态坍塌或对训练数据的纯记忆。使用基于过程的模型生成这些训练数据使我们能够包含除通常的物理性质之外的有价值的性质。我们展示了如何通过沉积时间来监测和验证GAN的性能，通过检查其样本是否遵守叠加定律。我们的工作与一系列先前的研究一起，这些研究表明GAN比通常认为的更加稳健，至少对于针对特定地质结构的训练数据集。这种稳健性是否能够转移到更大的3D图像和多模态数据集还有待观察。探索深度生成模型如何利用叠加定律等地质原则显示出很大的潜力。

更新时间: 2025-10-16 08:43:40

领域: cs.LG,physics.geo-ph,I.2.6; I.6.3; J.2

下载: http://arxiv.org/abs/2510.14445v1

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.

Updated: 2025-10-16 08:43:09

标题: 在LLM压缩中的免费午餐：重新审视修剪后的重新训练

摘要: 尽管神经网络剪枝通常需要重新训练模型以恢复剪枝导致的性能下降，但最先进的大型语言模型（LLMs）剪枝方法却通过在一小组校准数据上解决逐层掩码选择和重构问题，避免了完全重新训练，因为对于LLMs而言这被认为在计算上是不可行的。单独重建矩阵具有有利的特性，例如目标的凸性和相对于完全重新训练而言显著降低的内存需求。然而，在实践中，重建往往以更粗粒度的方式实现，例如对整个transformer块进行重建，而不是对单个矩阵进行重建。在这项工作中，我们研究了在剪枝后重建或重新训练剩余权重时的关键设计选择。我们在最先进的GPT架构上进行了广泛的计算研究，并报告了一些令人惊讶的发现，挑战了关于剪枝后重新训练的常见直觉。特别是，我们观察到一个"免费午餐"的情景：在每个transformer块内分别重建注意力和MLP组件几乎是最资源高效的，但却实现了最佳的困惑度。最重要的是，这种帕累托最优设置实现了比完全重新训练更好的性能，尽管只需要一小部分内存。此外，我们证明了简单高效的剪枝准则，如Wanda，在重建步骤正确执行时可以胜过更复杂的方法，突显了其重要性。我们的发现挑战了应该尽一切可能避免重新训练的说法，并为LLMs的剪枝后性能恢复提供了重要见解。

更新时间: 2025-10-16 08:43:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14444v1

HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

Updated: 2025-10-16 08:43:05

标题: HALF：与部署对齐的有害感知LLM公平评估

摘要: 大型语言模型（LLMs）越来越多地部署在高影响领域，从临床决策支持和法律分析到招聘和教育，因此在部署前进行公平性和偏见评估至关重要。然而，现有的评估缺乏对真实场景的基础，并且未考虑到伤害严重程度的差异，例如，在外科手术中的偏见决策不应与文本摘要中的风格偏见同等对待。为了填补这一差距，我们引入了HALF（Harm-Aware LLM Fairness），这是一个与部署对齐的框架，评估模型在现实应用中的偏见，并按伤害严重性权衡结果。HALF将九个应用领域组织成三个层次（严重、中等、轻微），使用五阶段管道。我们对八个LLMs的评估结果显示：（1）LLMs在不同领域之间并不一致公平，（2）模型大小或性能并不能保证公平性，（3）推理模型在医疗决策支持方面表现更好，但在教育方面表现较差。我们得出结论，HALF揭示了先前基准测试成功和部署准备之间存在明显差距。

更新时间: 2025-10-16 08:43:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.12217v2

Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare

The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed augmentation. This FAIR compliant resource addresses major Big Data challenges - volume (90 hours of recordings, 65.6 GB), variety (multi farm and multi zone acoustics), velocity (real time processing), and veracity (noise robust feature extraction). Our distributed processing framework integrates advanced denoising using iZotope RX, multimodal synchronization through audio and video alignment, and standardized feature engineering with 24 acoustic descriptors generated from Praat, librosa, and openSMILE. Preliminary benchmarks reveal distinct class level acoustic patterns for estrus detection, distress classification, and maternal communication. The datasets ecological realism, reflecting authentic barn acoustics rather than controlled settings, ensures readiness for field deployment. This work establishes a foundation for animal centered AI, where bioacoustic data enable continuous and non invasive welfare assessment at industrial scale. By releasing standardized pipelines and detailed metadata, we promote reproducible research that connects Big Data analytics, sustainable agriculture, and precision livestock management. The framework supports UN SDG 9, showing how data science can turn traditional farming into intelligent, welfare optimized systems that meet global food needs while upholding ethical animal care.

Updated: 2025-10-16 08:42:45

标题: 大数据方法在牛类生物声学中的应用：一个符合FAIR标准的数据集和可扩展的用于精准畜福的机器学习框架

摘要: 物联网传感、边缘计算和机器学习的融合正在转变精准畜牧业。然而，生物声学数据流仍未充分利用，因为存在计算复杂性和生态有效性挑战。我们提出了迄今为止最全面的牛发声数据集之一，包括569个经过筛选的片段，涵盖48个行为类别，通过使用多个麦克风阵列在三个商业奶牛场录制，并通过领域知情增强扩展到2900个样本。这个符合FAIR标准的资源解决了主要的大数据挑战 - 数据量（90小时录音，65.6 GB）、多样性（多农场和多区域声学）、速度（实时处理）和准确性（抗噪声特征提取）。我们的分布式处理框架集成了使用iZotope RX的先进降噪、通过音频和视频对齐进行多模态同步，以及使用从Praat、librosa和openSMILE生成的24种声学描述符进行标准化特征工程。初步基准测试显示了发情检测、痛苦分类和母性沟通的不同类别级声学模式。数据集的生态逼真性，反映了真实的畜舍声学而不是受控环境，确保了投入使用的准备就绪。这项工作为以动物为中心的AI奠定了基础，其中生物声学数据使得在工业规模上能够进行持续且无创的福利评估。通过发布标准化流程和详细元数据，我们促进可重现的研究，连接大数据分析、可持续农业和精准畜牧管理。该框架支持联合国可持续发展目标9号，展示了数据科学如何将传统农业转变为智能化、福利优化的系统，满足全球食品需求的同时维护道德动物护理。

更新时间: 2025-10-16 08:42:45

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2510.14443v1

Thompson Sampling via Fine-Tuning of LLMs

Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality--a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.

Updated: 2025-10-16 08:38:06

标题: 通过对LLMs的微调进行汤普森抽样

摘要: 在大型非结构化离散空间中的贝叶斯优化常常受到最大化收益函数的计算成本的限制，这是由于缺乏梯度。我们提出了一种基于汤普森采样的可伸缩替代方案，通过直接参数化候选者产生最大奖励的概率来消除对收益函数最大化的需求。我们的方法，即Thompson Sampling via Fine-Tuning（ToSFiT），利用了嵌入在提示条件下的大型语言模型中的先验知识，并逐步将它们调整到后验中。从理论上讲，我们为汤普森采样的变分表述推导出了一个新颖的后悔界，与其标准对应物的强有力保证相匹配。我们的分析揭示了对最大性后验概率的仔细调整在ToSFiT算法中起着关键作用。在实证方面，我们在三个不同任务上验证了我们的方法：FAQ响应的优化、热稳定蛋白质搜索和量子电路设计。我们展示了在线微调显著提高了样本效率，而对计算效率几乎没有影响。

更新时间: 2025-10-16 08:38:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.13328v2

MergeMoE: Efficient Compression of MoE Models via Expert Output Merging

The Mixture-of-Experts (MoE) technique has proven to be a promising solution to efficiently scale the model size, which has been widely applied in recent LLM advancements. However, the substantial memory overhead of MoE models has made their compression an important research direction. In this work, we provide a theoretical analysis of expert merging, a recently proposed technique for compressing MoE models. Rather than interpreting expert merging from the conventional perspective of parameter aggregation, we approach it from the perspective of merging experts' outputs. Our key insight is that the merging process can be interpreted as inserting additional matrices into the forward computation, which naturally leads to an optimization formulation. Building on this analysis, we introduce MergeMoE, a method that leverages mathematical optimization to construct the compression matrices. We evaluate MergeMoE on multiple MoE models and show that our algorithm consistently outperforms the baselines with the same compression ratios.

Updated: 2025-10-16 08:36:40

标题: MergeMoE：通过专家输出合并实现MoE模型的高效压缩

摘要: 混合专家（MoE）技术已被证明是一个有希望的解决方案，可以有效地扩展模型大小，在最近的LLM进展中得到了广泛应用。然而，MoE模型的实质性内存开销使得它们的压缩成为一个重要的研究方向。在这项工作中，我们提供了对专家合并的理论分析，这是一种最近提出的用于压缩MoE模型的技术。与传统的参数聚合的观点不同，我们从合并专家输出的角度来看待它。我们的关键见解是，合并过程可以被解释为将额外的矩阵插入到前向计算中，这自然地导致了一个优化形式。基于这一分析，我们引入了MergeMoE，这是一种利用数学优化来构建压缩矩阵的方法。我们在多个MoE模型上评估MergeMoE，并展示我们的算法始终优于具有相同压缩比的基准。

更新时间: 2025-10-16 08:36:40

领域: cs.LG

下载: http://arxiv.org/abs/2510.14436v1

TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

Updated: 2025-10-16 08:36:09

标题: TopoStreamer：自动驾驶中的时间车道段拓扑推理

摘要: 车道段拓扑推理通过捕获车道段之间的拓扑关系及其语义类型构建了一个全面的道路网络。这使得端到端的自动驾驶系统能够执行依赖道路的操作，如转弯和变道。然而，现有方法中在一致的位置嵌入和时间多属性学习方面的局限性阻碍了精确的道路网络重建。为了解决这些问题，我们提出了TopoStreamer，一种用于车道段拓扑推理的端到端的时间感知模型。具体而言，TopoStreamer引入了三个关键改进：流式属性约束、动态车道边界位置编码以及车道段去噪。流式属性约束强化了中心线和边界坐标以及它们的分类的时间一致性。与此同时，动态车道边界位置编码增强了对查询中最新位置信息的学习，而车道段去噪有助于捕捉多样化的车道段模式，最终提高了模型性能。此外，我们使用车道边界分类指标评估了现有模型的准确性，这在自动驾驶的变道场景中是一个至关重要的衡量标准。在OpenLane-V2数据集上，TopoStreamer相对于最先进的方法展示了显著的改进，在车道段感知任务中实现了+3.0%的mAP和中心线感知任务中的+1.7%的OLS的大幅性能提升。

更新时间: 2025-10-16 08:36:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00709v3

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

Updated: 2025-10-16 08:27:41

标题: 无反向传播的测试时间自适应通过概率高斯对齐

摘要: 测试时间适应（TTA）通过在推断过程中利用未标记的测试数据，增强了零样本鲁棒性在分布转移下。尽管有显著进展，但仍有几个挑战限制了其更广泛的适用性。首先，大多数方法依赖反向传播或迭代优化，这限制了可扩展性并阻碍了实时部署。其次，它们缺乏对类条件特征分布的明确建模。这种建模对于产生可靠的决策边界和校准预测至关重要，但由于缺乏源数据和测试时间的监督，这一领域仍未充分探索。在本文中，我们提出了ADAPT，一种先进的分布感知和无反向传播的测试时间适应方法。我们通过使用逐渐更新的类均值和共享协方差矩阵来对类条件可能性进行建模，将TTA重新构建为高斯概率推断任务。这使得可以进行封闭形式的、无需训练的推断。为了纠正潜在的可能性偏差，我们引入了受CLIP先验和历史知识库指导的轻量级正则化。ADAPT不需要源数据、梯度更新或完全访问目标数据，支持在线和传导设置。在各种基准测试中进行的广泛实验表明，我们的方法在各种分布转移下实现了最先进的性能，具有出色的可扩展性和鲁棒性。

更新时间: 2025-10-16 08:27:41

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.15568v4

Boosting Graph Foundation Model from Structural Perspective

Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspective and propose BooG. The model constructs virtual super nodes to unify structural characteristics of graph data from different domains. Specifically, the super nodes fuse the information of anchor nodes and class labels, where each anchor node captures the information of a node or a graph instance to be classified. Instead of using the raw graph structure, we connect super nodes to all nodes within their neighborhood by virtual edges. This new structure allows for effective information aggregation while unifying cross-domain structural characteristics. Additionally, we propose a novel pre-training objective based on contrastive learning, which learns more expressive representations for graph data and generalizes effectively to different domains and downstream tasks. Experimental results on various datasets and tasks demonstrate the superior performance of BooG. We provide our code and data here: https://anonymous.4open.science/r/BooG-EE42/.

Updated: 2025-10-16 08:27:03

标题: 从结构角度提升图基础模型

摘要: 最近，图基础模型因其强大的泛化能力而受到了显著关注。尽管现有方法借助语言模型学习跨领域统一的语义表示，但它们忽视了不同领域图的独特结构特征。为了解决这个问题，在本文中，我们从结构的角度加强了图基础模型，并提出了BooG。该模型构建虚拟超级节点，以统一不同领域图数据的结构特征。具体来说，超级节点融合了锚点和类标签的信息，其中每个锚点捕捉了待分类节点或图实例的信息。我们通过虚拟边将超级节点连接到其邻域内的所有节点，而不是使用原始图结构。这种新结构允许有效信息聚合，同时统一跨领域的结构特征。此外，我们提出了一种基于对比学习的新型预训练目标，可以学习更具表现力的图数据表示，并有效泛化到不同领域和下游任务。在各种数据集和任务上的实验结果表明了BooG的优越性能。我们在此提供我们的代码和数据：https://anonymous.4open.science/r/BooG-EE42/.

更新时间: 2025-10-16 08:27:03

领域: cs.LG

下载: http://arxiv.org/abs/2407.19941v2

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

Updated: 2025-10-16 08:24:44

标题: 您需要的只是指令：自监督强化学习用于指令跟随

摘要: 语言模型经常难以遵循对于真实世界应用至关重要的多约束指令。现有的强化学习（RL）方法受到对外部监督的依赖和多约束任务中稀疏奖励信号的影响。我们提出了一个无标签自监督RL框架，通过直接从指令中获取奖励信号并为奖励模型训练生成伪标签，消除了对外部监督的依赖。我们的方法引入了约束分解策略和高效的约束-wise二元分类来解决稀疏奖励挑战，同时保持计算效率。实验表明，我们的方法在3个领域内和5个领域外数据集上都有很好的泛化性能，包括具有挑战性的行为和多轮指令跟随。数据和代码可以在https://github.com/Rainier-rq/verl-if 上公开获取。

更新时间: 2025-10-16 08:24:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14420v1

Interaction Concordance Index: Performance Evaluation for Interaction Prediction Methods

Consider two sets of entities and their members' mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug's effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions' prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm's permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.

Updated: 2025-10-16 08:24:16

标题: 互作一致性指数：互作预测方法的性能评估

摘要: 考虑两组实体及其成员的相互亲和值，比如药物-靶标亲和度（DTA）。如果药物和靶标在它们对DTA的影响中相互作用，则称它们会相互作用。存在相互作用意味着将一种药物分配给一个靶标，将另一种药物分配给另一个靶标所提供的聚合DTA与颠倒分配所提供的不同。因此，正确捕捉相互作用有助于更好地决策，例如，在将有限数量的药物剂量分配给最匹配的靶标时。学习预测DTA通常是通过仅从已知的DTA或与实体的副信息（如药物和靶标的化学结构）一起进行的。在本文中，我们介绍了一个称为交互一致性指数（IC-index）的交互方向预测性能估计器，用于推断固定预测器和旨在推断它们的机器学习算法。IC-index通过评估数据中正确预测的交互作用效果方向的比率来补充常用的DTA预测性能估计器。首先，我们展示了IC-index在无法捕捉相互作用的预测器上的不变性。其次，我们展示了学习算法在药物和靶标身份方面的排列等变性意味着在训练期间看不见其中一个或两者时无法捕捉相互作用。在实际应用中，通过纳入药物和靶标的适当副信息来纠正这种等变性。我们对几个生物医学互动数据集进行了全面的实证评估，使用各种最先进的机器学习算法。实验展示了不同类型的亲和力强度预测方法在IC-index补充现有预测性能估计器方面的表现。

更新时间: 2025-10-16 08:24:16

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14419v1

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Updated: 2025-10-16 08:23:36

标题: 绝对零度：零数据的增强自我对弈推理

摘要: 具有可验证奖励的强化学习（RLVR）已经显示出通过直接从基于结果的奖励中学习来增强大型语言模型的推理能力的潜力。最近的RLVR作品在零设置下运行，避免了在标记推理过程中的监督，但仍然依赖于手动策划的问题和答案集合进行训练。高质量、人工制作的示例的稀缺性引发了对依赖人类监督的长期可扩展性的担忧，在语言模型预训练领域已经明显存在这一挑战。此外，在一个假设的未来，人工智能超越人类智能，由人类提供的任务可能为超智能系统提供有限的学习潜力。为了解决这些问题，我们提出了一个名为绝对零（Absolute Zero）的新RLVR范式，在这个范式下，一个单一模型学习提出最大化自己学习进展的任务，并通过解决这些任务来改善推理，而不依赖任何外部数据。在这个范式下，我们引入了绝对零推理者（AZR），一个系统通过使用代码执行器自我演化其培训课程和推理能力，来验证提出的代码推理任务和验证答案，作为一个统一的可验证奖励源，以引导开放但扎根的学习。尽管完全没有外部数据进行训练，AZR在编码和数学推理任务上实现了总体SOTA性能，超越了依赖成千上万个领域内人工策划示例的现有零设置模型。此外，我们证明了AZR可以有效地应用于不同的模型规模，并与各种模型类兼容。

更新时间: 2025-10-16 08:23:36

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.03335v3

Personalized federated learning, Row-wise fusion regularization, Multivariate modeling, Sparse estimation

We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors across clients and induces within-row sparsity, and we develop RowFed, a communication-efficient federated algorithm that embeds SROF into a linearized ADMM framework with privacy-preserving partial participation. Theoretically, we establish an oracle property for SROF-achieving correct variable-level group recovery with asymptotic normality-and prove convergence of RowFed to a stationary solution. Under random client participation, the iterate gap contracts at a rate that improves with participation probability. Empirically, simulations in heterogeneous regimes show that RowFed consistently lowers estimation and prediction error and strengthens variable-level cluster recovery over NonFed, FedAvg, and a personalized matrix-fusion baseline. A real-data study further corroborates these gains while preserving interpretability. Together, our results position row-wise fusion as an effective and transparent paradigm for large-scale personalized federated multivariate learning, bridging the gap between entry-wise and matrix-wise formulations.

Updated: 2025-10-16 08:18:36

标题: 个性化联邦学习，逐行融合正则化，多变量建模，稀疏估计

摘要: 我们研究了针对多元响应的个性化联邦学习，其中客户模型是异构的，但共享变量级别结构。现有的逐条惩罚方法忽略了跨响应的依赖关系，而矩阵级别的融合过度耦合了客户端。我们提出了一种稀疏行级融合（SROF）正则化器，它在客户端之间对行向量进行聚类，并引发了行内稀疏性，并且我们开发了RowFed，这是一种通信高效的联邦算法，将SROF嵌入到一个线性化的ADMM框架中，同时保护隐私的部分参与。从理论上讲，我们建立了一个SROF实现正确的变量级别群体恢复的神谕特性，具有渐近正态性，并证明了RowFed收敛到一个稳定解。在随机客户参与的情况下，迭代间隙以提高参与概率的速率收缩。在异构环境中的模拟实验表明，RowFed始终降低估计和预测误差，并加强了变量级别簇恢复，优于NonFed、FedAvg和个性化矩阵融合基线。一项真实数据研究进一步证实了这些收益，同时保持了可解释性。综上所述，我们的结果将行级融合定位为大规模个性化联邦多元学习的有效和透明范式，弥合了逐条和矩阵级别制定之间的差距。

更新时间: 2025-10-16 08:18:36

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.14413v1

Eliminating Negative Occurrences of Derived Predicates from PDDL Axioms

Axioms are a feature of the Planning Domain Definition Language PDDL that can be considered as a generalization of database query languages such as Datalog. The PDDL standard restricts negative occurrences of predicates in axiom bodies to predicates that are directly set by actions and not derived by axioms. In the literature, authors often deviate from this limitation and only require that the set of axioms is stratifiable. Both variants can express exactly the same queries as least fixed-point logic, indicating that negative occurrences of derived predicates can be eliminated. We present the corresponding transformation.

Updated: 2025-10-16 08:18:09

标题: 从PDDL公理中消除导出谓词的负发生

摘要: Axioms are a feature of the Planning Domain Definition Language PDDL that can be considered as a generalization of database query languages such as Datalog. The PDDL standard restricts negative occurrences of predicates in axiom bodies to predicates that are directly set by actions and not derived by axioms. In the literature, authors often deviate from this limitation and only require that the set of axioms is stratifiable. Both variants can express exactly the same queries as least fixed-point logic, indicating that negative occurrences of derived predicates can be eliminated. We present the corresponding transformation. 计划领域定义语言PDDL的公理可以被视为数据库查询语言（如Datalog）的一种泛化。PDDL标准将公理主体中的否定谓词发生限制为由操作直接设置的谓词，而不是由公理派生的谓词。在文献中，作者经常偏离这一限制，只要求公理集是可分层的。这两种变体可以完全表达与最小固定点逻辑相同的查询，表明派生谓词的否定发生可以被消除。我们提出相应的转换。

更新时间: 2025-10-16 08:18:09

领域: cs.AI

下载: http://arxiv.org/abs/2510.14412v1

MIO: A Foundation Model on Multimodal Tokens

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

Updated: 2025-10-16 08:18:03

标题: MIO: 一个关于多模式标记的基础模型

摘要: 在这篇论文中，我们介绍了MIO，这是一种建立在多模态标记之上的新型基础模型，能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。虽然大型语言模型（LLMs）和多模态大型语言模型（MM-LLMs）的出现通过它们的多功能能力推动了人工通用智能的进步，但它们仍然缺乏真正的任意-任意理解和生成。最近发布的GPT-4o展示了任意-任意LLMs在复杂现实任务中的巨大潜力，实现了图像、语音和文本之间的全向输入和输出。然而，它是封闭源代码的，不支持生成多模态交错序列。为了填补这一空白，我们提出了MIO，它通过因果多模态建模训练在四种模态上的离散标记混合。MIO经历了四个阶段的训练过程：（1）对齐预训练，（2）交错预训练，（3）增强语音预训练，以及（4）对各种文本、视觉和语音任务进行全面监督微调。我们的实验结果表明，与先前的双模态基线、任意-任意模型基线甚至模态特定基线相比，MIO表现出具有竞争力的，有时甚至更优越的性能。此外，MIO展示了与其任意-任意特性固有的先进能力，如交错视频-文本生成、视觉思维链推理、视觉指导生成、指导图像编辑等。

更新时间: 2025-10-16 08:18:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2409.17692v4

Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

Updated: 2025-10-16 08:14:52

标题: 生成式人工智能能否理解比喻语言？ChatGPT、Gemini和Deepseek对成语对文章评分的影响

摘要: 生成式人工智能技术的发展为不同领域的许多创新铺平了道路。最近，生成式人工智能被提议作为评估学生文章的AES系统的竞争对手。考虑到人工智能在处理习语方面的潜在局限性，本研究通过结合语料库语言学和计算语言学的见解，评估了生成式人工智能模型对带有和不带有习语的文章的评分表现。从一个语料库中获取的348篇学生文章创建了两个相等的文章列表：一个列表中每篇文章都包含多个习语，另一个列表中文章不含任何习语。要求三个生成式人工智能模型（ChatGPT、Gemini和Deepseek）使用人类评分者在分配文章分数时使用的相同评分标准，分别对两个列表中的所有文章进行三次评分。结果显示，所有模型的一致性都非常好，但Gemini在与人类评分者的一致性方面表现优于竞争对手。在人工智能评估中也没有发现对任何人口统计学群体的偏见。对于包含多个习语的文章，Gemini的评分模式与人类评分者最为相似。虽然研究中的模型展示了混合方法的潜力，但由于其处理比喻语言的能力，Gemini是该任务的最佳候选人，并显示出未来有望独自处理文章评分任务。

更新时间: 2025-10-16 08:14:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.15009v1

Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

Updated: 2025-10-16 08:13:32

标题: 重新发现熵正则化：自适应系数释放了其在LLM强化学习中的潜力

摘要: 推理能力已经成为大型语言模型（LLMs）的一个定义能力，强化学习与可验证奖励（RLVR）作为增强这一能力的关键范式正在兴起。然而，RLVR训练经常遭受政策熵崩溃的困扰，其中政策变得过于确定性，阻碍了探索并限制了推理性能。虽然熵正则化是一种常见的治疗方法，但其有效性对固定系数非常敏感，使其在任务和模型之间不稳定。在这项工作中，我们重新审视了RLVR中的熵正则化，并认为其潜力被大大低估了。我们的分析显示，（i）不同难度的任务需要不同的探索强度，并且（ii）平衡的探索可能需要将政策熵保持在初始水平以下的适度范围内。因此，我们提出了自适应熵正则化（AER）——一个通过三个组件动态平衡探索和利用的框架：基于困难程度的系数分配、初始锚定目标熵和动态全局系数调整。对多个数学推理基准进行的实验表明，AER始终优于基线，提高了推理准确性和探索能力。

更新时间: 2025-10-16 08:13:32

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2510.10959v2

Revisit Modality Imbalance at the Decision Layer

Multimodal learning integrates information from different modalities to enhance model performance, yet it often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization. This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer. Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) show that even after extensive pretraining and balanced optimization, models still exhibit systematic bias toward certain modalities, such as audio. Further analysis demonstrates that this bias originates from intrinsic disparities in feature-space and decision-weight distributions rather than from optimization dynamics alone. We argue that aggregating uncalibrated modality outputs at the fusion stage leads to biased decision-layer weighting, hindering weaker modalities from contributing effectively. To address this, we propose that future multimodal systems should focus more on incorporate adaptive weight allocation mechanisms at the decision layer, enabling relative balanced according to the capabilities of each modality.

Updated: 2025-10-16 08:11:24

标题: 重新审视决策层面的模态失衡

摘要: 多模态学习整合来自不同模态的信息以增强模型性能，然而通常会受到模态不平衡的影响，在联合优化过程中，主导模态会在一起优化过程中压倒较弱的模态。本文揭示了这种不平衡不仅发生在表示学习过程中，而且在决策层面显著表现出来。对音频-视觉数据集（CREMAD和Kinetic-Sounds）的实验表明，即使经过大量的预训练和平衡优化，模型仍然表现出对某些模态（如音频）的系统偏见。进一步的分析表明，这种偏见源于特征空间和决策权重分布的内在差异，而不仅仅是优化动态。我们认为，在融合阶段聚合未校准的模态输出会导致偏见的决策层加权，阻碍较弱的模态有效贡献。为了解决这个问题，我们建议未来的多模态系统应更加关注在决策层面引入自适应权重分配机制，使每个模态的能力相对平衡。

更新时间: 2025-10-16 08:11:24

领域: cs.LG,cs.MM,cs.SD,eess.AS

下载: http://arxiv.org/abs/2510.14411v1

IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.

Updated: 2025-10-16 08:06:35

标题: IMAGINE：将多智能体系统整合为一个模型，用于复杂推理和规划

摘要: 虽然大型语言模型（LLMs）在各种任务上取得了重大进展，但它们在复杂推理和规划方面仍然面临重大挑战。例如，即使在设计精心的提示和明确提供先前信息的情况下，GPT-4o在单一规划模式下在TravelPlanner数据集上仅达到7％的最终通过率。同样，在思考模式下，Qwen3-8B-Instruct和DeepSeek-R1-671B分别仅实现了5.9％和40％的最终通过率。尽管良好组织的多智能体系统（MAS）可以提供改进的集体推理，但它们往往由于多轮内部交互、长的每次响应延迟和端到端训练的困难而遭受高昂的推理成本。为了解决这些挑战，我们提出了一个称为IMAGINE的通用且可扩展的框架，简称为将多智能体系统集成到一个模型中。该框架不仅将MAS的推理和规划能力整合到一个单一、紧凑的模型中，而且还通过简单的端到端训练显着超越了MAS的能力。通过这一流程，一个小规模模型不仅能够获得良好组织的MAS的结构化推理和规划能力，而且还能显著优于它。实验证明，当以Qwen3-8B-Instruct作为基础模型，并用我们的方法进行训练时，该模型在TravelPlanner基准上实现了82.7％的最终通过率，远远超过DeepSeek-R1-671B的40％，同时保持了更小的模型尺寸。

更新时间: 2025-10-16 08:06:35

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14406v1

Why do explanations fail? A typology and discussion on failures in XAI

As Machine Learning models achieve unprecedented levels of performance, the XAI domain aims at making these models understandable by presenting end-users with intelligible explanations. Yet, some existing XAI approaches fail to meet expectations: several issues have been reported in the literature, generally pointing out either technical limitations or misinterpretations by users. In this paper, we argue that the resulting harms arise from a complex overlap of multiple failures in XAI, which existing ad-hoc studies fail to capture. This work therefore advocates for a holistic perspective, presenting a systematic investigation of limitations of current XAI methods and their impact on the interpretation of explanations. % By distinguishing between system-specific and user-specific failures, we propose a typological framework that helps revealing the nuanced complexities of explanation failures. Leveraging this typology, we discuss some research directions to help practitioners better understand the limitations of XAI systems and enhance the quality of ML explanations.

Updated: 2025-10-16 08:03:54

标题: 为什么解释会失败？XAI中失败的分类和讨论

摘要: 随着机器学习模型实现了前所未有的性能水平，可解释人工智能领域旨在通过提供可理解的解释，使这些模型变得可理解。然而，一些现有的可解释人工智能方法未能达到预期：文献中报道了几个问题，通常指出要么是技术限制，要么是用户的误解。在本文中，我们认为由于可解释人工智能中存在的多个失败的复杂重叠，导致了相应的危害，而现有的临时研究未能捕捉到这些问题。因此，本研究主张以整体的视角进行研究，系统地调查当前可解释人工智能方法的局限性及其对解释解释的影响。通过区分系统特定和用户特定的失败，我们提出了一个分类框架，帮助揭示解释失败的微妙复杂性。利用这种分类法，我们讨论了一些研究方向，以帮助从业者更好地理解可解释人工智能系统的局限性，并提高机器学习解释的质量。

更新时间: 2025-10-16 08:03:54

领域: cs.LG,cs.AI,cs.HC

下载: http://arxiv.org/abs/2405.13474v2

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

Updated: 2025-10-16 08:01:13

标题: 从感知到认知：多模态大语言模型中视觉-语言交互推理的调查

摘要: 多模态大语言模型（MLLMs）致力于实现对物理世界的深刻、类人的理解和交互，但在获取信息（感知）和进行推理（认知）时通常表现出浅显和不连贯的整合。这种脱节导致了一系列推理失败，其中幻觉是最突出的。总体而言，这些问题暴露了一个根本性挑战：处理像素的能力尚未赋予构建一个连贯、可信的内部世界模型的能力。为了系统地剖析和解决这一挑战，本调查引入了一个新颖且统一的分析框架：“从感知到认知”。我们将复杂的视觉语言互动理解过程分解为两个相互依存的层次：感知，即准确提取视觉信息并与文本说明实现细粒度对齐的基础能力；和认知，即基于这种感知基础构建的积极的、多步、目标导向推理的高阶能力，其中的核心是形成一个动态的观察思考验证推理循环。在这一框架的指导下，本文系统地分析了当前MLLMs在两个层面上的关键瓶颈。它调查了旨在解决这些挑战的尖端方法的景观，涵盖了增强低级视觉表示的技术到改进高级推理范式的方法。此外，我们还审查了关键的基准，并勾画了未来的研究方向。这项调查旨在为研究社区提供一个清晰、结构化的视角，以理解当前MLLMs固有限制，并照亮通往构建能够进行深层推理和真正理解世界的下一代模型的道路。

更新时间: 2025-10-16 08:01:13

领域: cs.AI

下载: http://arxiv.org/abs/2509.25373v4

The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

A growing body of multi-agent studies with Large Language Models (LLMs) explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without full visibility into payoffs and population, relying instead on heuristics, communication, and punishment. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom's principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a $2\times2$ grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.

Updated: 2025-10-16 07:59:31

标题: 在LLM多智能体系统中促进合作的社会学习和集体规范形成的作用

摘要: 一个日益增长的多主体研究体系，使用大型语言模型（LLMs）探讨规范和合作如何在复杂动机场景中出现，即在这种场景中，追求个人利益可能会损害集体利益。尽管先前的研究已经在丰富环境模拟和简化博弈理论环境中探索了这些动态，但大多数具有共享资源（CPR）游戏的LLM系统为代理提供了直接与其行为相关的明确奖励功能。相比之下，人类合作经常在没有完全了解收益和人口的情况下出现，而是依靠启发式方法、沟通和惩罚。我们引入了一个CPR模拟框架，去除了明确的奖励信号，并嵌入了文化-进化机制：社会学习（从成功的同行那里采用策略和信念）和基于规范的惩罚，这些都根植于奥斯特罗姆的资源治理原则。代理还通过环境反馈从收获、监测和惩罚的后果中进行个体学习，使规范能够内生地出现。我们通过复制现有研究关于人类行为的关键发现，建立了我们模拟的有效性。基于此，我们研究了在资源丰富与资源稀缺、利他与自私的环境和社会初始状态（$2\times2$格）下规范演变，并对不同LLMs构成的代理社会在这些条件下的表现进行基准测试。我们的结果显示，在维持合作和规范形成方面，模型之间存在系统性差异，将该框架定位为研究混合动机LLM社会中新兴规范的严格测试平台。这种分析可以为在社会和组织背景中部署的人工智能系统的设计提供信息，其中与合作规范的对齐对于AI介导环境的稳定性、公平性和有效治理至关重要。

更新时间: 2025-10-16 07:59:31

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2510.14401v1

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

Updated: 2025-10-16 07:59:11

标题: MedTrust-RAG：生物医学问题回答的证据验证和信任对准

摘要: 生物医学问题回答（QA）需要准确解释复杂的医学知识。大型语言模型（LLMs）在这一领域表现出有希望的能力，检索增强生成（RAG）系统通过整合外部医学文献来提高性能。然而，基于RAG的生物医学QA方法由于检索后的噪音和检索证据不足而导致幻觉，削弱了响应的可靠性。我们提出了MedTrust-Guided Iterative RAG，这是一个旨在增强医学QA中事实一致性并减轻幻觉的框架。我们的方法引入了三个关键创新。首先，它通过要求所有生成的内容明确基于检索到的医学文档来强化引文感知推理，当证据不足时使用结构化的负知识断言。其次，它采用了迭代检索验证过程，其中验证代理评估证据的充分性，并通过医学差距分析来细化查询，直到获得可靠信息。第三，它集成了MedTrust-Align模块（MTAM），将经过验证的正例与具有幻觉意识的负样本结合起来，利用直接偏好优化来加强引文感知推理，同时惩罚容易产生幻觉的响应模式。在MedMCQA、MedQA和MMLU-Med上的实验表明，我们的方法始终优于多种模型架构的竞争基线，为LLaMA3.1-8B-Instruct和Qwen3-8B提供了2.7%和2.4%的平均准确率增益，实现了最佳平均准确率。

更新时间: 2025-10-16 07:59:11

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2510.14400v1

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.

Updated: 2025-10-16 07:50:21

标题: 大型多模态模型中的紧急视觉基础，不需要基础监督

摘要: 目前的大型多模型（LMMs）在基础方面面临挑战，需要模型将语言组件与视觉实体联系起来。与常见做法相反，即通过额外的基础监督来微调LMMs，我们发现事实上，基础能力可以在没有明确基础监督的情况下在LMMs中出现。为了揭示这种新型基础，我们引入了一种“注意和分段”方法，利用标准LMMs的注意力图来执行像素级分割。此外，为了增强基础能力，我们提出了DIFFLMM，一种利用基于扩散的视觉编码器的LMM，与标准的CLIP视觉编码器相反，并使用相同的弱监督进行训练。我们的方法不受基础特定监督数据的偏见和有限规模的约束，更具通用性和可扩展性。与基础LMMs和通用主义LMMs相比，我们在基础特定和一般视觉问答基准上均取得了竞争性能。值得注意的是，在没有任何基础监督的情况下，我们在基于基础的对话生成中实现了44.2的基础掩码召回，优于经过广泛监督的GLaMM模型。项目页面：https://GroundLMM-ICCV.github.io。

更新时间: 2025-10-16 07:50:21

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.08209v2

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs' planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

Updated: 2025-10-16 07:45:03

标题: TripScore：通过精细评估对真实世界旅行规划进行基准测试和奖励

摘要: 旅行规划是一项有价值但复杂的任务，即使对于先进的大型语言模型（LLMs）也会带来重大挑战。虽然最近的基准测试在评估LLMs的规划能力方面取得了进展，但它们通常在评估旅行计划的可行性、可靠性和吸引力方面存在不足。我们引入了一个综合的旅行规划基准，将细粒度标准统一为一个奖励，实现了计划质量的直接比较，并与强化学习（RL）无缝集成。我们的评估器与旅行专家注释达成了适度的一致性（60.75%），并优于多个以LLM为评判者的基线。我们进一步发布了一个包含4,870个查询的大规模数据集，其中包括219个真实世界的自由形式请求，以便对真实用户意图进行泛化。使用这个基准，我们对各种方法和LLMs进行了广泛的实验，包括测试时计算、神经符号方法、监督微调和通过GRPO的强化学习。在基础模型中，RL通常会提高行程的可行性，相对于仅有提示和监督基线，产生更高的统一奖励分数。

更新时间: 2025-10-16 07:45:03

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.09011v3

Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, where the Feed-Forward Network (FFN) tends to be the dominant computational bottleneck. This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design. The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms. Sparsity is further improved by replacing GELU with ReLU activations and employing dynamic FFN2 pruning, achieving a 61.5\% reduction in operations and a 59.3\% reduction in FFN2 weights, with an accuracy loss of less than 2\%. The hardware adopts a row-wise dataflow with output-oriented data access to eliminate data transposition, and supports dynamic operations with minimal area overhead. Implemented in TSMC's 28nm CMOS technology, our design occupies 496.4K gates and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.

Updated: 2025-10-16 07:44:42

标题: 低功耗视觉Transformer加速器：具有硬件感知剪枝和优化数据流

摘要: 目前，电流互感器加速器主要集中在优化自注意力，因为其二次复杂度。然而，对于具有较短标记长度的视觉变换器来说，这种关注度较低，其中前馈网络（FFN）往往是主要的计算瓶颈。本文提出了一种低功耗视觉变换器加速器，通过算法硬件共同设计进行优化。通过使用硬件友好的动态标记修剪来减少模型复杂性，而不引入复杂的机制。通过将GELU替换为ReLU激活并采用动态FFN2修剪，进一步提高了稀疏性，实现了操作量的减少61.5％和FFN2权重的减少59.3％，准确度损失不到2％。硬件采用基于行的数据流，采用面向输出的数据访问以消除数据转置，并支持具有最小面积开销的动态操作。在TSMC的28nm CMOS技术中实现，我们的设计占用了496.4K门，并包括一个232KB的SRAM缓冲区，在1GHz时实现了1024 GOPS的峰值吞吐量，能效为2.31 TOPS/W，面积效率为858.61 GOPS/mm2。

更新时间: 2025-10-16 07:44:42

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2510.14393v1

FairBatching: Fairness-Aware Batch Formation for LLM Inference

Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low Time-Per-Output-Token, or TPOT) for ongoing requests. Existing stall-free batching schedulers proposed by Sarathi, while effective at preventing decode stalls, introduce significant computational unfairness. They prioritize decode tasks excessively, simultaneously leading to underutilized decode slack and unnecessary prefill queuing delays, which collectively degrade the system's overall quality of service (QoS). This work identifies the root cause of this unfairness: the non-monotonic nature of Time-Between-Tokens (TBT) as a scheduling metric and the rigid decode-prioritizing policy that fails to adapt to dynamic workload bursts. We therefore propose FairBatching, a novel LLM inference scheduler that enforces fair resource allocation between prefill and decode tasks. It features an adaptive batch capacity determination mechanism, which dynamically adjusts the computational budget to improve the GPU utilization without triggering SLO violations. Its fair and dynamic batch formation algorithm breaks away from the decode-prioritizing paradigm, allowing computation resources to be reclaimed from bursting decode tasks to serve prefill surges, achieving global fairness. Furthermore, FairBatching provides a novel load estimation method, enabling more effective coordination with upper-level schedulers. Implemented and evaluated on realistic traces, FairBatching significantly reduces TTFT tail latency by up to 2.29x while robustly maintaining TPOT SLOs, achieving overall 20.0% improvement in single-node capacity and 54.3% improvement in cluster-level capacity.

Updated: 2025-10-16 07:43:56

标题: 公平批处理：面向LLM推断的公平批处理形成

摘要: 大型语言模型（LLM）推理系统面临着一个基本的紧张关系，即在最小化新请求的Time-to-First-Token（TTFT）延迟和保持高速的令牌生成率（低Time-Per-Output-Token，或TPOT）之间。现有的由Sarathi提出的无阻塞批处理调度器虽然有效地防止了解码停顿，但引入了显著的计算不公平。它们过分优先处理解码任务，同时导致未充分利用的解码间隙和不必要的预填充排队延迟，共同降低了系统的整体服务质量（QoS）。这项工作确定了这种不公平的根本原因：Time-Between-Tokens（TBT）作为调度指标的非单调性以及刚性的解码优先策略无法适应动态工作负载突发情况。因此，我们提出了FairBatching，一种新颖的LLM推理调度器，它在预填充和解码任务之间强制实现公平的资源分配。它具有自适应批处理容量确定机制，动态调整计算预算以提高GPU利用率，而不触发SLO违规。其公平和动态的批处理形成算法摆脱了解码优先范式，允许从突发解码任务中重新获取计算资源以服务预填充激增，实现全局公平。此外，FairBatching提供了一种新颖的负载估算方法，使其能够更有效地与上层调度器协调。FairBatching在实际跟踪中实施和评估，显著减少了TTFT尾延迟高达2.29倍，同时稳健地保持TPOT SLO，使单节点容量整体提高20.0％，集群级容量提高54.3％。

更新时间: 2025-10-16 07:43:56

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2510.14392v1

MarkDiffusion: An Open-Source Toolkit for Generative Watermarking of Latent Diffusion Models

We introduce MarkDiffusion, an open-source Python toolkit for generative watermarking of latent diffusion models. It comprises three key components: a unified implementation framework for streamlined watermarking algorithm integrations and user-friendly interfaces; a mechanism visualization suite that intuitively showcases added and extracted watermark patterns to aid public understanding; and a comprehensive evaluation module offering standard implementations of 24 tools across three essential aspects - detectability, robustness, and output quality - plus 8 automated evaluation pipelines. Through MarkDiffusion, we seek to assist researchers, enhance public awareness and engagement in generative watermarking, and promote consensus while advancing research and applications.

Updated: 2025-10-16 07:42:56

标题: MarkDiffusion：一种用于生成潜在扩散模型水印的开源工具包

摘要: 我们介绍了MarkDiffusion，这是一个用于生成水印的潜在扩散模型的开源Python工具包。它包括三个关键组件：一个统一的实现框架，用于简化水印算法的集成和用户友好的界面；一个机制可视化套件，直观展示添加和提取的水印图案，以帮助公众理解；以及一个全面的评估模块，提供24种工具的标准实现，涵盖了检测性能、鲁棒性和输出质量三个关键方面，以及8个自动化评估流程。通过MarkDiffusion，我们希望协助研究人员，增强公众对生成水印的认识和参与，促进共识，推动研究和应用的发展。

更新时间: 2025-10-16 07:42:56

领域: cs.CR,cs.AI,cs.MM,68T50,I.2.7

下载: http://arxiv.org/abs/2509.10569v2

Beat Detection as Object Detection

Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.

Updated: 2025-10-16 07:42:45

标题: 节拍检测作为物体检测

摘要: 最近的拍子和下拍跟踪模型（例如RNNs、TCNs、Transformers）输出帧级别的激活。我们提出将这个任务重新定义为对象检测，其中拍子和下拍被建模为时间“对象”。将计算机视觉中的FCOS检测器调整为1D音频，我们用WaveBeat的时间特征提取器替换其原始骨干，并添加特征金字塔网络来捕获多尺度的时间模式。该模型预测具有置信度分数的重叠的拍子/下拍间隔，随后进行非极大值抑制（NMS）以选择最终的预测。这个NMS步骤在传统跟踪器中起到类似于DBNs的作用，但更简单、不那么启发式。在标准音乐数据集上进行评估，我们的方法取得了竞争性的结果，表明对象检测技术可以有效地模拟音乐拍子，并且需要最少的适应。

更新时间: 2025-10-16 07:42:45

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14391v1

A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

Updated: 2025-10-16 07:41:48

标题: A^2FM：工具感知混合推理的自适应代理基础模型

摘要: 大型语言模型分为两类：以推理为中心的LLM，加强内部思维链推理，但不能调用外部工具；以行动为中心的LLM，学习与环境互动和利用工具，但在深层推理方面经常落后。这种分歧源于根本不同的训练目标，导致在简单查询中存在强弱不一和低效问题，两类模型往往会过度思考或过度调用工具。在本研究中，我们提出了自适应智能基础模型（A$^2$FM），这是一个统一框架，遵循路由-对齐原则：模型首先学习任务感知路由，然后在共享骨干下对齐模式特定轨迹。为了解决效率差距，我们引入了一个直接处理简单查询的第三模式-即时模式，防止不必要的推理或工具调用，同时补充了行动和推理模式。为了共同提高准确性和效率，我们提出了自适应策略优化（APO），强制在模式之间实施自适应抽样，并应用成本正则化奖励。在32B规模上，A$^2$FM在BrowseComp上达到了13.4%，在AIME25上达到了70.4%，在HLE上达到了16.7%，在可比模型中创造了新的SOTA，并在行动、推理和一般基准测试中与前沿LLM模型竞争。值得注意的是，自适应执行仅需$0.00487的成本即可获得正确答案，相对于推理减少了45.2%，相对于行动减少了33.5%，从而在保持可比准确性的同时，提供更高的成本效率。

更新时间: 2025-10-16 07:41:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.12838v2

Tensor Logic: The Language of AI

Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP and Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.

Updated: 2025-10-16 07:40:28

标题: 张量逻辑：AI的语言

摘要: 人工智能的进展受到缺乏具备所有必要功能的编程语言的阻碍。像PyTorch和TensorFlow这样的库提供了自动微分和高效的GPU实现，但它们是Python的补充，Python从未被设计用于人工智能。它们缺乏对自动推理和知识获取的支持，导致了一系列长期且昂贵的试图将其添加进来的尝试。另一方面，像LISP和Prolog这样的人工智能语言缺乏可扩展性和学习支持。本文提出了张量逻辑，这是一种通过在基本层面上统一神经和符号人工智能来解决这些问题的语言。张量逻辑中唯一的构造是张量方程，基于观察到逻辑规则和爱因斯坦求和本质上是相同的操作，其他一切都可以归结为它们。我展示了如何在张量逻辑中优雅地实现神经、符号和统计人工智能的关键形式，包括变换器、形式推理、核机器和图形模型。最重要的是，张量逻辑使新的方向成为可能，例如在嵌入空间中进行合理推理。这将神经网络的可扩展性和可学习性与符号推理的可靠性和透明性结合起来，有可能成为更广泛采用人工智能的基础。

更新时间: 2025-10-16 07:40:28

领域: cs.AI,cs.LG,cs.NE,cs.PL,stat.ML,I.2.3; I.2.4; I.2.5; I.2.6; I.5.1

下载: http://arxiv.org/abs/2510.12269v3

BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

Updated: 2025-10-16 07:38:31

标题: BoardVision：使用YOLO+Faster-RCNN集成进行部署就绪和稳健的主板缺陷检测

摘要: 母板缺陷检测对于确保高产量电子制造的可靠性至关重要。虽然以往PCB检测研究主要针对裸板或迹线级缺陷，但完整母板组装级检测仍未得到充分探索。在这项工作中，我们提出了BoardVision，一个可重现的框架，用于检测组装级缺陷，如缺失螺丝、散乱风扇线和表面划痕。我们在MiracleFactory母板数据集上对两个代表性检测器 - YOLOv7和Faster R-CNN进行基准测试，在受控条件下进行比较，提供了该领域的首次系统比较。为了减轻单一模型的局限性，其中YOLO在精度方面表现出色但在召回率方面表现不佳，而Faster R-CNN则相反，我们提出了一个轻量级集成方法，信心-时间投票(CTV投票器)，通过可解释的规则平衡精度和召回率。我们进一步评估在现实扰动下的稳健性，包括锐度、亮度和方向变化，突出显示经常被忽视的母板缺陷检测中的稳定性挑战。最后，我们发布了一个可部署的GUI驱动检测工具，将研究评估与操作员可用性相结合。这些贡献共同展示了计算机视觉技术如何从基准结果过渡到实际质量保证，用于组装级母板制造。

更新时间: 2025-10-16 07:38:31

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14389v1

Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.

Updated: 2025-10-16 07:38:21

标题: 嗨-代理：用于移动设备控制的分层视觉语言代理

摘要: 构建能够自主操作移动设备的代理已经引起了越来越多的关注。虽然视觉语言模型（VLMs）显示出潜力，但大多数现有方法依赖于直接状态到动作的映射，缺乏结构化推理和规划，因此在新任务或未见过的用户界面布局上泛化能力较差。我们引入了Hi-Agent，这是一个可训练的分层视觉语言代理，用于移动控制，具有高级推理模型和低级动作模型，二者联合优化。为了有效训练，我们将多步决策制定重新构造为一系列单步子目标，并提出了一种前瞻优势函数，利用低级模型的执行反馈来指导高级优化。这种设计缓解了长期任务中Group Relative Policy Optimization（GRPO）遇到的路径爆炸问题，并实现了稳定、无评论家的联合训练。Hi-Agent在Android-in-the-Wild（AitW）基准测试中取得了新的最先进（SOTA）87.9%的任务成功率，明显优于先前方法在三种范例中的表现：基于提示的（AppAgent：17.7%）、监督式（Filtered BC：54.5%）、基于强化学习的（DigiRL：71.9%）。它还在ScreenSpot-v2基准测试中展示了竞争力的零-shot泛化能力。在更具挑战性的AndroidWorld基准测试中，Hi-Agent还能够有效地扩展到更大的主干网络，展现出在高复杂度移动控制场景中的强大适应性。

更新时间: 2025-10-16 07:38:21

领域: cs.AI

下载: http://arxiv.org/abs/2510.14388v1

Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?

Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and a vision block, we wonder: Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning? Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose IP-Merging that first identifies the reasoning-associated parameters in both MLLM and Math LLM, then projects them into the subspace of MLLM, aiming to maintain the alignment, and finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.

Updated: 2025-10-16 07:38:16

标题: MLLMs是否可以像免费午餐一样吸收LLMs的数学推理能力？

摘要: 数学推理一直是大型语言模型（LLMs）的一个重要能力，在最近几年取得了显著进展。然而，大多数努力都集中在通过精心策划高质量的注释数据和复杂的训练（或推理）范式来改进LLMs的数学推理表现，而多模态LLMs（MLLMs）的数学推理表现仍然落后。由于MLLM通常包含一个LLM和一个视觉块，我们想知道：MLLM能否直接从现成的数学LLMs中吸收数学推理能力而无需调整？最近的模型合并方法可能会对这个问题提供一些启示。然而，它们忽略了MLLM和LLM之间的对齐，我们发现它们的参数空间之间存在很大差距，导致性能较低。我们的实证证据揭示了这个问题背后的两个关键因素：模型中关键推理相关层的识别以及参数空间差距的缓解。基于实证洞见，我们提出了IP-Merging方法，首先识别MLLM和数学LLM中的推理相关参数，然后将它们投影到MLLM的子空间中，旨在保持对齐，最后在这个子空间中合并参数。IP-Merging是一种无需调整的方法，因为参数直接进行调整。大量实验表明，我们的IP-Merging方法可以直接从数学LLMs中增强MLLMs的数学推理能力，而不会影响它们的其他能力。

更新时间: 2025-10-16 07:38:16

领域: cs.AI

下载: http://arxiv.org/abs/2510.14387v1

PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research

Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms -- yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

Updated: 2025-10-16 07:38:09

标题: PETLP：一个社交媒体数据隐私设计管道在AI研究中

摘要: 社交媒体数据为人工智能研究人员提供了在《通用数据保护条例》（GDPR）、版权法和平台条款下的重叠义务 - 然而现有框架未能整合这些监管领域，使研究人员缺乏统一的指导。我们介绍了PETLP（隐私设计提取、转换、加载和呈现），这是一个合规框架，直接将法律保障嵌入到扩展的ETL流程中。PETLP的核心是将数据保护影响评估视为从预注册到传播不断发展的文件。通过系统性的Reddit分析，我们展示了合格研究组织（可以援引DSM第3条覆盖平台限制）和商业实体（受服务条款约束）之间提取权利在本质上有所不同，而GDPR义务则普遍适用。我们展示了为什么对社交媒体数据进行真正的匿名化仍然是不可实现的，并揭示了允许数据集创建与模型分发之间不确定的法律差距。通过将合规决策结构化为实际工作流程并简化机构数据管理计划，PETLP使研究人员能够自信地应对监管复杂性，弥合法律要求与研究实践之间的差距。

更新时间: 2025-10-16 07:38:09

领域: cs.MM,cs.AI,cs.DB

下载: http://arxiv.org/abs/2508.09232v2

SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences

In recent years, with the emergence of large models, there has been a significant interest in spiking neural networks (SNNs) primarily due to their energy efficiency, multiplication-free, and sparse event-based deep learning. Similarly, state space models (SSMs) in varying designs have evolved as a powerful alternative to transformers for target modeling in long sequences, thereby overcoming the quadratic dependence on sequence length of a transformer. Inspired by this progress, we here design SHaRe-SSM (Spiking Harmonic Resonate and Fire State Space Model), for target variable modeling (including both classification and regression) for very-long-range sequences. Our second-order spiking SSM, on average, performs better than transformers or first-order SSMs while circumventing multiplication operations, making it ideal for resource-constrained applications. The proposed block consumes $73 \times$ less energy than second-order ANN-based SSMs for an 18k sequence, while retaining performance. To ensure learnability over the long-range sequences, we propose exploiting the stable and efficient implementation of the dynamical system using parallel scans. Moreover, for the first time, we propose a kernel-based spiking regressor using resonate and fire neurons for very long-range sequences. Our network shows superior performance on even a 50k sequence while being significantly energy-efficient. In addition, we conducted a systematic analysis of the impact of heterogeneity, dissipation, and conservation in resonate-and-fire SSMs.

Updated: 2025-10-16 07:37:59

标题: SHaRe-SSM: 一种用于长序列目标变量建模的振荡脉冲神经网络

摘要: 近年来，随着大型模型的出现，对脉冲神经网络（SNNs）产生了极大的兴趣，主要是由于其能效高、无需乘法和稀疏事件驱动的深度学习。同样，状态空间模型（SSMs）在不同设计中已经发展成为长序列目标建模的强大替代方案，从而克服了变压器对序列长度的二次依赖性。受到这一进展的启发，我们设计了SHaRe-SSM（脉冲谐波共振和火状态空间模型），用于非常长序列的目标变量建模（包括分类和回归）。我们的二阶脉冲SSM平均表现优于变压器或一阶SSMs，同时避开乘法运算，使其成为资源受限应用的理想选择。提出的模块在18k序列上比基于二阶ANN的SSMs消耗的能量少73倍，同时保持性能。为了确保在长序列上的可学习性，我们提出利用并行扫描的稳定高效的动态系统实现。此外，我们首次提出了基于共振和火神经元的基于核的脉冲回归器，用于非常长序列。我们的网络在50k序列上表现出优越性能，同时显著节能。此外，我们对共振和火SSMs中的异质性、耗散和保守性的影响进行了系统分析。

更新时间: 2025-10-16 07:37:59

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2510.14386v1

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

Updated: 2025-10-16 07:32:30

标题: SFTMix：通过混合配方提升语言模型指导调整

摘要: 为了获得指令遵循能力，大型语言模型（LLMs）经历指令调整，它们通过下一个标记预测（NTP）在指令-响应对上进行训练。改进指令调整的努力通常集中在更高质量的监督微调（SFT）数据集上，通常需要使用专有LLMs或人工注释进行数据筛选。在本文中，我们采用了一种不同的方法，提出了SFTMix，一种基于Mixup的新型配方，可以提升LLM指令调整，而无需依赖精心策划的数据集。我们观察到LLMs在语义表示空间中表现出不均匀的置信度。我们认为，具有不同置信水平的示例应在指令调整中扮演不同的角色：置信数据容易过拟合，而不确定数据难以泛化。基于这一见解，SFTMix利用训练动态来识别具有不同置信水平的示例。然后，我们对它们进行插值以弥合置信度差距，并应用基于Mixup的正则化来支持对这些额外的、插值的示例的学习。我们在指令遵循和特定于医疗保健的SFT任务中展示了SFTMix的有效性，对各种大小和质量的LLM家族和SFT数据集都有持续改进。通过六个方向的广泛分析突出了SFTMix与数据选择的兼容性，适应计算受限情景的能力，以及对更广泛应用的可扩展性。

更新时间: 2025-10-16 07:32:30

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.05248v3

Match & Mend: Minimally Invasive Local Reassembly for Patching N-day Vulnerabilities in ARM Binaries

Low-cost Internet of Things (IoT) devices are increasingly popular but often insecure due to poor update regimes. As a result, many devices run outdated and known-vulnerable versions of open-source software. We address this problem by proposing to patch IoT firmware at the binary level, without requiring vendor support. In particular, we introduce minimally invasive local reassembly, a new technique for automatically patching known (n-day) vulnerabilities in IoT firmware. Our approach is designed to minimize side effects and reduce the risk of introducing breaking changes. We systematically evaluate our approach both on 108 binaries within the controlled environment of the MAGMA benchmarks, as well as on 30 real-world Linux-based IoT firmware images from the KARONTE dataset. Our prototype successfully patches 83% of targeted vulnerabilities in MAGMA and 96% in the firmware dataset.

Updated: 2025-10-16 07:31:42

标题: 匹配与修复：用于修补ARM二进制文件中N天漏洞的最小侵入式本地重组。

摘要: 低成本的物联网（IoT）设备越来越受欢迎，但由于更新制度不良，通常存在安全隐患。因此，许多设备运行过时且已知易受攻击的开源软件版本。我们通过提出在二进制级别修补IoT固件来解决这个问题，而无需厂商支持。具体来说，我们引入了最小侵入本地重组，这是一种新的技术，用于自动修补IoT固件中已知（n-day）漏洞。我们的方法旨在最小化副作用并降低引入破坏性更改的风险。我们系统地评估了我们的方法，包括在MAGMA基准测试的受控环境中对108个二进制文件的评估，以及对来自KARONTE数据集的30个真实世界基于Linux的IoT固件映像的评估。我们的原型成功修补了MAGMA中83％的目标漏洞和固件数据集中96％的漏洞。

更新时间: 2025-10-16 07:31:42

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2510.14384v1

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

Updated: 2025-10-16 07:28:54

标题: 我的优化提示是否受到威胁？探索基于LLM的优化器的脆弱性

摘要: 大型语言模型（LLM）系统现在支撑着日常人工智能应用，如聊天机器人、计算机助手和自主机器人，性能往往取决于精心设计的提示。基于LLM的提示优化器通过迭代地从评分反馈中优化提示，减少了这种努力，然而这种优化阶段的安全性仍未得到充分研究。我们提出了第一次对基于LLM的提示优化中的毒化风险进行系统分析。使用HarmBench，我们发现系统对于操纵反馈比注入查询更容易受到攻击：基于反馈的攻击将攻击成功率（ASR）提高了高达$\Delta$ASR = 0.48。我们引入了一种简单的虚假奖励攻击，不需要访问奖励模型就能显著提高脆弱性，并提出了一种轻量级的突出防御，将虚假奖励的$\Delta$ASR从0.23降低到0.07而不降低实用性。这些结果将提示优化管道确立为一个一流的攻击面，并为反馈通道和优化框架提供更强大的保障提供动力。

更新时间: 2025-10-16 07:28:54

领域: cs.LG,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2510.14381v1

Flows and Diffusions on the Neural Manifold

Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques towards matching a gradient flow, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.

Updated: 2025-10-16 07:23:22

标题: 神经流和扩散在神经流形上

摘要: 扩散和基于流的生成模型在诸如图像合成、视频生成和自然语言建模等领域取得了显著成功。在这项工作中，我们通过利用最近的技术，将这些进展扩展到权重空间学习，以整合从优化动态中导出的结构先验。我们方法的核心是将由梯度下降引起的轨迹建模为轨迹推断问题。我们统一了几种轨迹推断技术，以匹配梯度流，为将优化路径视为归纳偏差提供了理论框架。我们进一步探讨了架构和算法选择，包括通过共轭匹配进行奖励微调，使用自编码器进行潜在权重表示，以任务特定上下文数据为条件，以及采用信息源分布，如Kaiming均匀分布。实验证明，我们的方法在生成分布权重方面与基线相匹配或超越，改善了下游训练的初始化，并支持微调以提高性能。最后，我们展示了在安全关键系统中的实际应用：检测有害的协变量转移，其中我们的方法优于最接近的可比基线。

更新时间: 2025-10-16 07:23:22

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.10623v2

PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

Updated: 2025-10-16 07:22:58

标题: PluriHop: 在干扰物丰富语料库上进行全面、召回敏感的问答

摘要: 最近大型语言模型（LLMs）和检索增强生成（RAG）的进展已经在单个（单跳）或多个（多跳）段落中的问题回答（QA）方面取得了进展。然而，许多关于重复报告数据的现实问题 - 医疗记录、合规文件、维护日志 - 需要在所有文档之间进行聚合，没有明确的检索停止点，对甚至一个遗漏段落非常敏感。我们将这些问题称为多跳问题，并通过三个标准对其进行形式化：召回灵敏度、全面性和准确性。为了研究这种情况，我们引入了PluriHopWIND，这是一个由191个现实世界的德语和英语风能行业报告构建的48个多跳问题的诊断多语言数据集。我们展示了PluriHopWIND比其他常见数据集更具重复性，因此具有更高密度的干扰文档，更好地反映了重复报告文集的实际挑战。我们测试了传统的RAG管道以及基于图形和多模式的变体，并发现没有一个被测试的方法在声明级F1分数方面超过40％。受此启发，我们提出了PluriHopRAG，这是一种遵循“逐个检查所有文档，便宜筛选”的RAG架构：（i）将查询分解为文档级子问题，并（ii）使用交叉编码器筛选器在昂贵的LLM推理之前丢弃不相关的文档。我们发现，PluriHopRAG在基础LLM的基础上实现了18-52％的相对F1分数改善。尽管规模较小，PluriHopWIND暴露了当前QA系统在重复、干扰丰富的文集上的局限性。PluriHopRAG的表现突显了全面检索和早期过滤作为一种强大替代方案，而不是顶级方法的价值。

更新时间: 2025-10-16 07:22:58

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2510.14377v1

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

Updated: 2025-10-16 07:19:57

标题: GEPO：稳定异质强化学习的群体期望策略优化

摘要: 随着单中心计算方法受到功耗限制，分散式训练变得至关重要。然而，传统的强化学习（RL）方法对于增强大型模型的后期训练至关重要，但由于参数学习和推演采样之间的紧密耦合，无法适应分散式分布式训练。因此，我们提出了HeteroRL，这是一种异质RL架构，它解耦了这些过程，实现了通过互联网连接的地理分布节点之间的稳定训练。其核心组件是Group Expectation Policy Optimization（GEPO），这是一种异步RL算法，能够应对由网络延迟或计算资源异质性引起的延迟。我们的研究表明，高延迟显著增加了KL散度，导致重要性权重的方差增加和训练不稳定。GEPO通过使用群体期望加权来指数级减少重要性权重的方差，具有理论保证，从而缓解了这一问题。实验表明，GEPO实现了卓越的稳定性-从在线到1800秒延迟仅下降3％，并将最佳与最终之间的差距减少了85％，相对于GSPO（1.8与12.0），同时获得最高分数，突显了其在分散式、资源异质环境中的有效性。

更新时间: 2025-10-16 07:19:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.17850v7

Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

Updated: 2025-10-16 07:15:27

标题: 创新者：科学的持续预训练与细粒度MoE再利用

摘要: 一个具有科学和一般任务知识的大型语言模型(LLM)是科学智能的基础。然而，直接使用科学数据持续预训练LLM通常会导致灾难性遗忘，这表明一般能力严重下降。在这份报告中，我们提出了Innovator，通过将预训练的密集LLM升级为一个细粒度的专家混合模型来解决这个问题，在持续预训练过程中，不同的专家被期望学习不同学科的科学知识，同时利用一个共享专家进行一般任务。Innovator引入了一个四阶段的循环训练范式：(1)在特定学科数据上进行科学专家归纳，(2)通过FFN维度分解进行细粒度专家划分，(3)科学感知路由热身，以及(4)在混合数据集上进行一般-科学家整合训练。这种范式使得一般领域的知识和不同科学学科可以解耦，避免不同领域知识之间的负面影响。Innovator具有53.3B总参数和13.3B激活参数，使用一个共享一般专家和64个专门的科学专家，其中8个被激活。在经过三级质量控制数据的300B标记上进行训练，Innovator在30项科学任务中平均改善了25%，胜率为70%，同时在一般任务中保持了99%的性能。此外，从Innovator中进行推理增强的Innovator-Reason在解决复杂科学问题时表现出色，改善超过30%。

更新时间: 2025-10-16 07:15:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.18671v2

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

Updated: 2025-10-16 07:14:55

标题: 思考恰到好处：序列级熵作为LLM推理的信心信号

摘要: 我们引入了一个简单但新颖的基于熵的框架，以在推理任务中提高大型语言模型的令牌效率。我们的方法利用来自令牌级logprobs的香农熵作为置信信号，实现了提前停止，节省了25-50%的计算资源，同时保持任务准确性。至关重要的是，我们展示了基于熵的置信度校准代表了现代推理模型中的高级后训练优化的新兴属性，但在标准的指令调整和预训练模型（Llama 3.3 70B）中明显缺失。我们展示了停止推理的熵阈值因模型而异，但可以通过仅使用现有推理数据集中的几个示例进行一次简单计算。我们的结果表明，高级推理模型通常很早就知道自己得到了正确答案，这种新兴的置信度意识可以被利用来节省令牌并减少延迟。该框架在推理优化模型系列中展现了一致的性能，实现了25-50%的计算成本降低，同时保持准确性，揭示了置信度机制是现代后训练推理系统与其前身之间的一个显著特征。

更新时间: 2025-10-16 07:14:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08146v2

From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

Updated: 2025-10-16 07:06:05

标题: 从二进制到双语：国家气象局如何利用人工智能开发全面的翻译计划

摘要: 为了推动建设一个天气准备就绪的国家，美国国家气象局（NWS）正在开发一个系统化的翻译计划，以更好地为美国家中68.8百万人口提供服务，这些人口在家中不说英语。本文概述了一个由人工智能驱动的NWS产品自动翻译工具的基础。NWS与LILT合作，后者的专利培训过程使大型语言模型（LLMs）能够适应神经机器翻译（NMT）工具用于天气术语和信息传递。该系统旨在实现在气象预报办公室（WFOs）和国家中心之间的可扩展性，目前正在用西班牙语、简体中文、越南语和其他广泛使用的非英语语言进行开发。该系统根植于多语言风险沟通的最佳实践，提供准确、及时和与文化相关的翻译，显著减少了手动翻译时间，并减轻了NWS的操作工作量。为了指导这些产品的分发，GIS映射被用于识别不同NWS地区的语言需求，帮助为最需要这些资源的社区优先安排资源。我们还在整个计划的设计中整合了道德人工智能实践，确保透明度、公平性和人工监督引导自动翻译的创建、评估和与公众分享。这项工作最终形成了一个网站，展示了实验性的多语言NWS产品，包括翻译的警告、7天预报和教育活动，使该国离一个能够覆盖所有美国人的国家警告系统更进一步。

更新时间: 2025-10-16 07:06:05

领域: cs.CL,cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2510.14369v1

Technical and legal aspects of federated learning in bioinformatics: applications, challenges and opportunities

Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. This paper provides a gentle introduction to this approach in bioinformatics, and is the first to review key applications in proteomics, genome-wide association studies (GWAS), single-cell and multi-omics studies in their legal as well as methodological and infrastructural challenges. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have a similar impact in bioinformatics, allowing academic and clinical institutions to access many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks.

Updated: 2025-10-16 07:04:38

标题: 生物信息学中联合学习的技术和法律方面：应用、挑战和机遇

摘要: 联邦学习利用跨机构的数据来改进临床发现，同时遵守数据共享限制并保护患者隐私。本文提供了对这种方法在生物信息学中的简要介绍，并首次审查了蛋白质组学、全基因组关联研究（GWAS）、单细胞和多组学研究中的关键应用，包括它们在法律、方法学和基础设施挑战方面。正如遗传学和系统生物学中生物库的发展所证明的那样，访问更广泛和多样化的数据池会导致对结果的更快和更稳健的探索和转化。联邦学习的更广泛应用可能会在生物信息学中产生类似的影响，使学术和临床机构能够访问许多组合的基因型、表型和环境信息，这些信息在现有的生物库中被忽视或未包含其中。

更新时间: 2025-10-16 07:04:38

领域: q-bio.OT,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.09649v3

TriQXNet: Forecasting Dst Index from Solar Wind Data Using an Interpretable Parallel Classical-Quantum Framework with Uncertainty Quantification

Geomagnetic storms, caused by solar wind energy transfer to Earth's magnetic field, can disrupt critical infrastructure like GPS, satellite communications, and power grids. The disturbance storm-time (Dst) index measures storm intensity. Despite advancements in empirical, physics-based, and machine-learning models using real-time solar wind data, accurately forecasting extreme geomagnetic events remains challenging due to noise and sensor failures. This research introduces TriQXNet, a novel hybrid classical-quantum neural network for Dst forecasting. Our model integrates classical and quantum computing, conformal prediction, and explainable AI (XAI) within a hybrid architecture. To ensure high-quality input data, we developed a comprehensive preprocessing pipeline that included feature selection, normalization, aggregation, and imputation. TriQXNet processes preprocessed solar wind data from NASA's ACE and NOAA's DSCOVR satellites, predicting the Dst index for the current hour and the next, providing vital advance notice to mitigate geomagnetic storm impacts. TriQXNet outperforms 13 state-of-the-art hybrid deep-learning models, achieving a root mean squared error of 9.27 nanoteslas (nT). Rigorous evaluation through 10-fold cross-validated paired t-tests confirmed its superior performance with 95% confidence. Conformal prediction techniques provide quantifiable uncertainty, which is essential for operational decisions, while XAI methods like ShapTime enhance interpretability. Comparative analysis shows TriQXNet's superior forecasting accuracy, setting a new level of expectations for geomagnetic storm prediction and highlighting the potential of classical-quantum hybrid models in space weather forecasting.

Updated: 2025-10-16 07:04:21

标题: TriQXNet：使用可解释的并行经典-量子框架和不确定性量化从太阳风数据预测Dst指数

摘要: 地磁风暴是由太阳风能量传输到地球磁场引起的，可能会扰乱关键基础设施，如GPS、卫星通信和电力网络。扰动暴风时间（Dst）指数衡量了风暴的强度。尽管在使用实时太阳风数据的经验、基于物理的和机器学习模型方面取得了进展，但由于噪音和传感器故障，准确预测极端地磁事件仍然具有挑战性。本研究介绍了TriQXNet，这是一种新型的用于Dst预测的混合经典-量子神经网络。我们的模型集成了经典和量子计算、一致性预测和可解释人工智能（XAI）在一个混合架构中。为了确保高质量的输入数据，我们开发了一个包括特征选择、归一化、聚合和插补在内的全面预处理流程。TriQXNet处理来自NASA的ACE和NOAA的DSCOVR卫星的经过预处理的太阳风数据，预测当前小时和下一个小时的Dst指数，为减轻地磁风暴影响提供重要的提前通知。TriQXNet超越了13种最先进的混合深度学习模型，达到了9.27纳特（nT）的均方根误差。通过10倍交叉验证的配对t检验进行严格评估，确认其在95%置信水平下的优越性能。一致性预测技术提供了可量化的不确定性，这对操作决策至关重要，而像ShapTime这样的XAI方法提高了可解释性。比较分析显示TriQXNet的卓越预测准确性，为地磁风暴预测设定了新的期望水平，并突显了经典-量子混合模型在空间天气预测中的潜力。

更新时间: 2025-10-16 07:04:21

领域: cs.AI

下载: http://arxiv.org/abs/2407.06658v3

CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment

Proprietary large language models (LLMs) exhibit strong generalization capabilities across diverse tasks and are increasingly deployed on edge devices for efficiency and privacy reasons. However, deploying proprietary LLMs at the edge without adequate protection introduces critical security threats. Attackers can extract model weights and architectures, enabling unauthorized copying and misuse. Even when protective measures prevent full extraction of model weights, attackers may still perform advanced attacks, such as fine-tuning, to further exploit the model. Existing defenses against these threats typically incur significant computational and communication overhead, making them impractical for edge deployment. To safeguard the edge-deployed LLMs, we introduce CoreGuard, a computation- and communication-efficient protection method. CoreGuard employs an efficient protection protocol to reduce computational overhead and minimize communication overhead via a propagation protocol. Extensive experiments show that CoreGuard achieves upper-bound security protection with negligible overhead.

Updated: 2025-10-16 07:01:28

标题: CoreGuard：在边缘部署中保护LLM的基本能力免受模型窃取

摘要: 专有的大型语言模型(LLMs)在不同任务中展现出强大的泛化能力，并且出于效率和隐私原因，越来越多地部署在边缘设备上。然而，在没有足够保护的情况下在边缘部署专有LLMs会引入关键的安全威胁。攻击者可以提取模型权重和架构，从而使未经授权的复制和滥用成为可能。即使保护措施阻止了完全提取模型权重，攻击者仍可能执行高级攻击，如微调，以进一步利用模型。现有的对抗这些威胁的防御措施通常会带来显著的计算和通信开销，使它们在边缘部署中变得不切实际。为了保护边缘部署的LLMs，我们引入了CoreGuard，一种计算和通信高效的保护方法。CoreGuard采用高效的保护协议来减少计算开销，并通过传播协议来最小化通信开销。大量实验证明，CoreGuard实现了极小的开销下的上限安全保护。

更新时间: 2025-10-16 07:01:28

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2410.13903v2

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.

Updated: 2025-10-16 06:58:28

标题: SOHES：自监督开放世界分层实体分割

摘要: 开放世界实体分割作为一种新兴的计算机视觉任务，旨在在图像中分割实体而不受预定义类别的限制，为未见过的图像和概念提供令人印象深刻的泛化能力。尽管存在诸如Segment Anything Model (SAM)之类的现有实体分割方法承诺很大，但却严重依赖昂贵的专家标注者。本文提出了自监督的开放世界分层实体分割（SOHES），这是一种新颖的方法，可以消除对人类注释的需求。SOHES分为三个阶段：自我探索、自我指导和自我纠正。在给定一个预训练的自监督表示的情况下，我们通过视觉特征聚类产生大量高质量的伪标签。然后，我们在伪标签上训练一个分割模型，并通过师生相互学习过程纠正伪标签中的噪音。除了分割实体外，SOHES还捕捉它们的组成部分，提供对视觉实体的层次理解。使用原始图像作为唯一的训练数据，我们的方法在自监督开放世界分割中取得了空前的性能，标志着在没有人类标注的情况下朝着高质量的开放世界实体分割迈出了重要的里程碑。项目页面：https://SOHES-ICLR.github.io。

更新时间: 2025-10-16 06:58:28

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2404.12386v2

AI for Service: Proactive Assistance with AI Glasses

In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.

Updated: 2025-10-16 06:55:28

标题: AI为服务：AI眼镜的主动协助

摘要: 在一个人工智能从被动工具逐渐发展为主动和适应性伴侣的时代，我们介绍了AI for Service（AI4Service），这是一种新的范式，能够在日常生活中提供积极和实时的帮助。现有的人工智能服务主要是被动的，只响应显式的用户指令。我们认为，一个真正智能和有用的助手应该能够预测用户的需求，在适当时采取主动行动。为了实现这一愿景，我们提出了Alpha-Service，一个统一的框架，解决了两个基本挑战：通过从以自我为中心的视频流中检测服务机会来判断何时干预，以及如何提供通用和个性化服务。受冯·诺伊曼计算机体系结构启发，基于人工智能眼镜，Alpha-Service 包括五个关键组件：用于感知的输入单元，用于任务调度的中央处理单元，用于工具利用的算术逻辑单元，用于长期个性化的存储单元，以及用于自然人类交互的输出单元。作为一个初步的探索，我们通过部署在人工智能眼镜上的多代理系统实现了 Alpha-Service。案例研究，包括实时的二十一点顾问、博物馆导览员和购物搭配助手，展示了它在无需显式提示的情况下，无缝感知环境、推断用户意图并及时提供有用的帮助的能力。

更新时间: 2025-10-16 06:55:28

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2510.14359v1

SUM-AgriVLN: Spatial Understanding Memory for Agricultural Vision-and-Language Navigation

Agricultural robots are emerging as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily rely on manual operation or fixed rail systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling robots to navigate to the target positions following the natural language instructions. In practical agricultural scenarios, navigation instructions often repeatedly occur, yet AgriVLN treat each instruction as an independent episode, overlooking the potential of past experiences to provide spatial context for subsequent ones. To bridge this gap, we propose the method of Spatial Understanding Memory for Agricultural Vision-and-Language Navigation (SUM-AgriVLN), in which the SUM module employs spatial understanding and save spatial memory through 3D reconstruction and representation. When evaluated on the A2A benchmark, our SUM-AgriVLN effectively improves Success Rate from 0.47 to 0.54 with slight sacrifice on Navigation Error from 2.91m to 2.93m, demonstrating the state-of-the-art performance in the agricultural domain. Code: https://github.com/AlexTraveling/SUM-AgriVLN.

Updated: 2025-10-16 06:53:32

标题: SUM-AgriVLN：用于农业视觉和语言导航的空间理解记忆

摘要: 农业机器人正逐渐成为广泛农业任务中强大的助手，然而仍然严重依赖手动操作或固定轨道系统进行移动。AgriVLN方法和A2A基准首次将视觉与语言导航（VLN）扩展到农业领域，使机器人能够根据自然语言指令导航到目标位置。在实际农业场景中，导航指令经常重复出现，然而AgriVLN将每个指令视为独立的情节，忽视了过去经验提供空间背景以为后续指令提供支持的潜力。为弥补这一差距，我们提出了农业视觉与语言导航的空间理解记忆方法（SUM-AgriVLN），其中SUM模块通过3D重建和表示来利用空间理解并保存空间记忆。在A2A基准测试中，我们的SUM-AgriVLN将成功率从0.47提高到0.54，导航误差从2.91米增加到2.93米，展示了在农业领域的最新性能。代码: https://github.com/AlexTraveling/SUM-AgriVLN.

更新时间: 2025-10-16 06:53:32

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.14357v1

Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

Updated: 2025-10-16 06:50:33

标题: 重新思考大型语言模型的毒性评估：多标签视角

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中取得了令人印象深刻的成果，但它们生成有害内容的潜力引发了严重的安全担忧。当前的毒性检测器主要依赖于单标签基准，无法充分捕捉现实世界有害提示的固有模糊和多维性质。这种限制导致评估存在偏见，包括错过有害检测和假阳性，削弱了现有检测器的可靠性。此外，跨细粒度毒性类别收集全面的多标签标注成本过高，进一步阻碍了有效评估和开发。为了解决这些问题，我们引入了三种新颖的毒性检测多标签基准：\textbf{Q-A-MLL}、\textbf{R-A-MLL}和\textbf{H-X-MLL}，派生自公共毒性数据集，并根据详细的15类分类法进行注释。我们进一步提供了一个理论证明，在我们发布的数据集上，使用伪标签进行训练比直接从单标签监督中学习表现更好。此外，我们开发了一种基于伪标签的毒性检测方法。广泛的实验结果表明，我们的方法显著超越了先进的基线，包括GPT-4o和DeepSeek，从而实现了对LLM生成内容中多标签毒性的更准确和可靠的评估。

更新时间: 2025-10-16 06:50:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.15007v1

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.

Updated: 2025-10-16 06:50:23

标题: 当风格破坏安全性：捍卫LLMs免受表面风格对齐的攻击

摘要: 大型语言模型（LLMs）可以通过特定样式（例如，将回应格式化为列表）进行提示，包括恶意查询。先前的越狱研究主要通过对这些查询进行额外的字符串转换来最大化攻击成功率（ASR）。然而，原始查询中与恶意意图语义无关的样式模式的影响仍不清楚。在这项工作中，我们试图了解样式模式是否会危及LLM的安全性，表面样式对齐如何增加模型的易受攻击性，以及如何在对齐期间最好地减轻这些风险。我们首先将ASR膨胀定义为现有越狱基准查询中由于样式模式导致的ASR增加。通过评估七个基准中的32个LLMs，我们发现几乎所有模型都表现出ASR膨胀。值得注意的是，膨胀与LLM相对于样式模式的关注度相关，当膨胀发生时，样式模式与其指令调整数据重叠更多。然后，我们调查了表面样式对齐，并发现与特定样式进行微调会使LLMs更容易受到相同样式的越狱攻击。最后，我们提出了SafeStyle，一种防御策略，其中包含一小部分安全训练数据，以匹配微调数据中样式模式的分布。在三个LLMs，六个微调样式设置和两个真实世界的指令调整数据集中，SafeStyle始终优于基线，保持LLM的安全性。

更新时间: 2025-10-16 06:50:23

领域: cs.LG,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2506.07452v2

CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model's certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0\%) and MedMCQA (78.0\%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.

Updated: 2025-10-16 06:46:11

标题: CURE：面向医疗问题回答的基于置信度驱动的统一推理集成框架

摘要: 高性能的医学大型语言模型（LLMs）通常需要大量微调和大量计算资源，这限制了资源受限的医疗机构的可访问性。本研究介绍了一种基于置信度驱动的多模型框架，利用模型多样性增强医学问题回答而无需微调。我们的框架采用两阶段架构：置信度检测模块评估主模型的确定性，自适应路由机制将低置信度的查询引导到具有互补知识的辅助模型进行协同推理。我们使用Qwen3-30B-A3B-Instruct、Phi-4 14B和Gemma 2 12B在三个医学基准测试中评估我们的方法；MedQA、MedMCQA和PubMedQA。结果表明，我们的框架取得了竞争性能，特别是在PubMedQA（95.0\%）和MedMCQA（78.0\%）方面表现出色。消融研究证实，结合置信度感知路由和多模型协作明显优于单一模型方法和统一推理策略。这项工作建立了战略模型协作提供了一个实用的、计算效率高的途径来改进医学人工智能系统，对于在资源有限环境中民主化获取先进的医学人工智能具有重要意义。

更新时间: 2025-10-16 06:46:11

领域: cs.CL,cs.AI,physics.med-ph

下载: http://arxiv.org/abs/2510.14353v1

ES-C51: Expected Sarsa Based C51 Distributional Reinforcement Learning Algorithm

In most value-based reinforcement learning (RL) algorithms, the agent estimates only the expected reward for each action and selects the action with the highest reward. In contrast, Distributional Reinforcement Learning (DRL) estimates the entire probability distribution of possible rewards, providing richer information about uncertainty and variability. C51 is a popular DRL algorithm for discrete action spaces. It uses a Q-learning approach, where the distribution is learned using a greedy Bellman update. However, this can cause problems if multiple actions at a state have similar expected reward but with different distributions, as the algorithm may not learn a stable distribution. This study presents a modified version of C51 (ES-C51) that replaces the greedy Q-learning update with an Expected Sarsa update, which uses a softmax calculation to combine information from all possible actions at a state rather than relying on a single best action. This reduces instability when actions have similar expected rewards and allows the agent to learn higher-performing policies. This approach is evaluated on classic control environments from Gym, and Atari-10 games. For a fair comparison, we modify the standard C51's exploration strategy from e-greedy to softmax, which we refer to as QL-C51 (Q- Learning based C51). The results demonstrate that ES-C51 outperforms QL-C51 across many environments.

Updated: 2025-10-16 06:44:07

标题: ES-C51：基于期望Sarsa的C51分布式强化学习算法

摘要: 在大多数基于价值的强化学习（RL）算法中，代理人仅估计每个动作的预期奖励并选择具有最高奖励的动作。相比之下，分布式强化学习（DRL）估计可能奖励的完整概率分布，提供有关不确定性和变异性更丰富的信息。C51是离散动作空间中的一种流行的DRL算法。它使用Q-learning方法，其中分布是使用贪婪贝尔曼更新来学习的。然而，如果状态下的多个动作具有类似预期奖励但具有不同分布，这可能会导致问题，因为算法可能不会学习到稳定的分布。本研究提出了C51的修改版本（ES-C51），它将贪婪Q-learning更新替换为期望Sarsa更新，该更新使用softmax计算来结合状态下所有可能动作的信息，而不是依赖于单个最佳动作。当动作具有类似预期奖励时，这降低了不稳定性，并允许代理学习性能更高的策略。该方法在来自Gym和Atari-10游戏的经典控制环境中进行评估。为了进行公平比较，我们将标准C51的探索策略从ε-greedy修改为softmax，我们将其称为QL-C51（基于Q-learning的C51）。结果表明，ES-C51在许多环境中表现优于QL-C51。

更新时间: 2025-10-16 06:44:07

领域: cs.LG,I.2.6

下载: http://arxiv.org/abs/2510.15006v1

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

Updated: 2025-10-16 06:39:27

标题: 超越一个世界：在跨多元宇宙背景中对角色扮演超级英雄进行基准测试

摘要: 大型语言模型（LLMs）越来越被用作扮演角色代理，然而它们在忠实和一致地描绘特定版本的角色能力——例如，跨漫画和电影宇宙的超级英雄——仍未得到充分探讨。像漫威和DC这样的超级英雄宇宙提供了一个丰富的实验场：数十年的故事讲述产生了同一角色的多种化身，拥有不同的历史、价值观和道德准则。为了研究这个问题，我们引入了Beyond One World，一个涵盖30位标志性英雄和90个特定版本的角色扮演基准。该基准包括两个任务：（i）Canon Events，用于探究关键人生阶段的事实回忆，以及（ii）Moral Dilemmas，将模型置于伦理问题严重的情境中。我们根据一个将内部思考（“思考”）与外部决策（“行动”）分离的框架评分，评估对经典准确性和推理忠实度。我们进一步提出了Think-Act Matching，一种量化理由和行动之间对齐程度的指标，作为模型可信度的代理。跨推理和非推理模型的实验得出三个发现：（1）思维链提示可以提高较弱模型的叙事连贯性，但可能降低较强模型的经典准确性；（2）在一个角色内部跨版本泛化仍然是一个主要障碍；以及（3）模型通常在思考或行动中表现出色，但很少两者兼顾。Beyond One World揭示了多元一致性和推理对齐中的重要差距，为角色扮演LLMs提供了一个具有挑战性的评估。

更新时间: 2025-10-16 06:39:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14351v1

LLM Based Bayesian Optimization for Prompt Search

Bayesian Optimization (BO) has been widely used to efficiently optimize expensive black-box functions with limited evaluations. In this paper, we investigate the use of BO for prompt engineering to enhance text classification with Large Language Models (LLMs). We employ an LLM-powered Gaussian Process (GP) as the surrogate model to estimate the performance of different prompt candidates. These candidates are generated by an LLM through the expansion of a set of seed prompts and are subsequently evaluated using an Upper Confidence Bound (UCB) acquisition function in conjunction with the GP posterior. The optimization process iteratively refines the prompts based on a subset of the data, aiming to improve classification accuracy while reducing the number of API calls by leveraging the prediction uncertainty of the LLM-based GP. The proposed BO-LLM algorithm is evaluated on two datasets, and its advantages are discussed in detail in this paper.

Updated: 2025-10-16 06:37:22

标题: 基于LLM的贝叶斯优化用于及时搜索

摘要: 贝叶斯优化（BO）被广泛应用于高效优化具有有限评估的昂贵黑匣子函数。本文研究了将BO用于即时工程，以增强基于大型语言模型（LLMs）的文本分类。我们采用LLM驱动的高斯过程（GP）作为替代模型，以估计不同提示候选项的性能。这些候选项由LLM通过扩展一组种子提示而生成，并随后使用上置信界（UCB）获取函数与GP后验一起进行评估。优化过程通过基于数据子集逐步完善提示，旨在提高分类精度同时减少API调用次数，利用基于LLM的GP的预测不确定性。提出的BO-LLM算法在两个数据集上进行评估，并在本文中对其优势进行了详细讨论。

更新时间: 2025-10-16 06:37:22

领域: cs.AI

下载: http://arxiv.org/abs/2510.04384v2

BinCtx: Multi-Modal Representation Learning for Robust Android App Behavior Detection

Mobile app markets host millions of apps, yet undesired behaviors (e.g., disruptive ads, illegal redirection, payment deception) remain hard to catch because they often do not rely on permission-protected APIs and can be easily camouflaged via UI or metadata edits. We present BINCTX, a learning approach that builds multi-modal representations of an app from (i) a global bytecode-as-image view that captures code-level semantics and family-style patterns, (ii) a contextual view (manifested actions, components, declared permissions, URL/IP constants) indicating how behaviors are triggered, and (iii) a third-party-library usage view summarizing invocation frequencies along inter-component call paths. The three views are embedded and fused to train a contextual-aware classifier. On real-world malware and benign apps, BINCTX attains a macro F1 of 94.73%, outperforming strong baselines by at least 14.92%. It remains robust under commercial obfuscation (F1 84% post-obfuscation) and is more resistant to adversarial samples than state-of-the-art bytecode-only systems.

Updated: 2025-10-16 06:29:06

标题: BinCtx：用于稳健的Android应用行为检测的多模态表示学习

摘要: 移动应用程序市场托管数百万个应用程序，然而不良行为（例如，破坏性广告、非法重定向、欺诈支付）仍然很难捕捉，因为它们通常不依赖于受保护的权限API，并且可以通过UI或元数据编辑轻松伪装。我们提出了BINCTX，一种学习方法，它从以下三个方面构建应用程序的多模态表示：（i）全局字节码作为图像视图捕获代码级语义和家族风格模式，（ii）上下文视图（表现行为、组件、声明的权限、URL/IP常量）指示行为如何触发，以及（iii）总结调用频率的第三方库使用视图沿着组件间的调用路径。这三个视图被嵌入和融合以训练一个具有上下文意识的分类器。在现实世界的恶意软件和良性应用程序上，BINCTX 实现了94.73%的宏F1，至少比强基线表现优越14.92%。它在商业混淆下仍保持稳健（混淆后的F1为84%），并且比最先进的仅基于字节码的系统更具抵抗性对抗性样本。

更新时间: 2025-10-16 06:29:06

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.14344v1

Attention-Aided MMSE for OFDM Channel Estimation: Learning Linear Filters with Attention

In orthogonal frequency division multiplexing (OFDM), accurate channel estimation is crucial. Classical signal processing based approaches, such as minimum mean-squared error (MMSE) estimation, often require second-order statistics that are difficult to obtain in practice. Recent deep neural networks based methods have been introduced to address this; yet they often suffer from high inference complexity. This paper proposes an Attention-aided MMSE (A-MMSE), a novel model-based DNN framework that learns the optimal MMSE filter via the Attention Transformer. Once trained, the A-MMSE estimates the channel through a single linear operation for channel estimation, eliminating nonlinear activations during inference and thus reducing computational complexity. To enhance the learning efficiency of the A-MMSE, we develop a two-stage Attention encoder, designed to effectively capture the channel correlation structure. Additionally, a rank-adaptive extension of the proposed A-MMSE allows flexible trade-offs between complexity and channel estimation accuracy. Extensive simulations with 3GPP TDL channel models demonstrate that the proposed A-MMSE consistently outperforms other baseline methods in terms of normalized MSE across a wide range of signal-to-noise ratio (SNR) conditions. In particular, the A-MMSE and its rank-adaptive extension establish a new frontier in the performance-complexity trade-off, providing a powerful yet highly efficient solution for practical channel estimation

Updated: 2025-10-16 06:28:28

标题: 注意力辅助的OFDM信道估计MMSE：利用注意力学习线性滤波器

摘要: 在正交频分复用（OFDM）中，准确的信道估计至关重要。传统的基于信号处理的方法，如最小均方误差（MMSE）估计，通常需要难以在实践中获得的二阶统计量。最近引入了基于深度神经网络的方法来解决这个问题；然而它们往往受到高推断复杂性的影响。本文提出了一种基于注意力的MMSE（A-MMSE），这是一种基于模型的DNN框架，通过注意力变换器学习最佳的MMSE滤波器。一旦训练完成，A-MMSE通过单个线性操作对信道进行估计，消除了推断过程中的非线性激活，从而降低了计算复杂性。为了增强A-MMSE的学习效率，我们开发了一个两阶段的注意力编码器，旨在有效地捕捉信道相关结构。此外，所提出的A-MMSE的秩自适应扩展允许在复杂性和信道估计准确性之间进行灵活权衡。通过对3GPP TDL信道模型的广泛模拟表明，所提出的A-MMSE在一系列信噪比条件下始终优于其他基准方法，表现出标准化均方误差方面的优越性。特别是，A-MMSE及其秩自适应扩展在性能-复杂性权衡方面建立了新的前沿，为实际信道估计提供了一种强大且高效的解决方案。

更新时间: 2025-10-16 06:28:28

领域: eess.SP,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.00452v2

Jet Functors and Weil Algebras in Automatic Differentiation: A Geometric Analysis

We present a geometric formulation of automatic differentiation (AD) using jet bundles and Weil algebras. Reverse-mode AD emerges as cotangent-pullback, while Taylor-mode corresponds to evaluation in a Weil algebra. From these principles, we derive concise statements on correctness, stability, and complexity: a functorial identity for reverse-mode, algebraic exactness of higher-order derivatives, and explicit bounds on truncation error. We further show that tensorized Weil algebras permit one-pass computation of all mixed derivatives with cost linear in the algebra dimension, avoiding the combinatorial blow-up of nested JVP/VJP schedules. This framework interprets AD theory through the lens of differential geometry and offers a foundation for developing structure-preserving differentiation methods in deep learning and scientific computing. Code and examples are available at https://git.nilu.no/geometric-ad/jet-weil-ad.

Updated: 2025-10-16 06:25:24

标题: 喷射函子和韦尔代数在自动微分中的应用：几何分析

摘要: 我们提出了一种使用Jet丛和Weil代数的几何形式的自动微分（AD）的表达。反向模式AD出现为余切-拉回，而泰勒模式对应于在Weil代数中的评估。基于这些原则，我们推导出关于正确性、稳定性和复杂性的简明陈述：逆向模式的函子恒等、高阶导数的代数准确性以及截断误差的显式界限。我们进一步展示，张量化的Weil代数允许以代数维度为线性成本一次性计算所有混合导数，避免了嵌套JVP/VJP调度的组合爆炸。这个框架通过微分几何的视角解释了AD理论，并为在深度学习和科学计算中开发保持结构的微分方法提供了基础。代码和示例可在https://git.nilu.no/geometric-ad/jet-weil-ad上找到。

更新时间: 2025-10-16 06:25:24

领域: cs.LG,math.DG,stat.ML

下载: http://arxiv.org/abs/2510.14342v1

A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

Objective Renal cancer is a common malignancy and a major cause of cancer-related deaths. Computed tomography (CT) is central to early detection, staging, and treatment planning. However, the growing CT workload increases radiologists' burden and risks incomplete documentation. Automatically generating accurate reports remains challenging because it requires integrating visual interpretation with clinical reasoning. Advances in artificial intelligence (AI), especially large language and vision-language models, offer potential to reduce workload and enhance diagnostic quality. Methods We propose a clinically informed, two-stage framework for automatic renal CT report generation. In Stage 1, a multi-task learning model detects structured clinical features from each 2D image. In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features. To evaluate clinical fidelity, generated clinical features are extracted from the reports and compared with expert-annotated ground truth. Results Experiments on an expert-labeled dataset show that incorporating detected features improves both report quality and clinical accuracy. The model achieved an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, demonstrating higher clinical consistency and fewer template-driven errors. Conclusion Linking structured feature detection with conditioned report generation provides a clinically grounded approach to integrate structured prediction and narrative drafting for renal CT reporting. This method enhances interpretability and clinical faithfulness, underscoring the value of domain-relevant evaluation metrics for medical AI development.

Updated: 2025-10-16 06:21:00

标题: 一个基于临床的两阶段框架用于肾脏CT报告生成

摘要: 目的：肾癌是一种常见的恶性肿瘤，也是导致癌症相关死亡的主要原因。计算机断层扫描(CT)在早期检测、分期和治疗计划中起着关键作用。然而，不断增加的CT工作量增加了放射科医师的负担，并存在着文档不完整的风险。自动生成准确的报告仍然具有挑战性，因为它需要将视觉解释与临床推理相结合。人工智能(AI)的进展，尤其是大型语言和视觉-语言模型，为减轻工作量和提高诊断质量提供了潜力。方法：我们提出了一个经临床验证的两阶段框架，用于自动生成肾CT报告。在第一阶段，一个多任务学习模型从每个2D图像中检测结构化的临床特征。在第二阶段，一个视觉-语言模型生成以图像和检测到的特征为条件的自由文本报告。为了评估临床真实性，生成的临床特征被提取出来，并与专家注释的基本事实进行比较。结果：在一个专家标记的数据集上的实验表明，整合检测到的特征可以提高报告质量和临床准确性。该模型在关键成像特征的平均AUC为0.75，METEOR分数为0.33，表明具有更高的临床一致性和更少的基于模板的错误。结论：将结构化特征检测与条件报告生成相结合，提供了一个经临床验证的方法，用于整合结构化预测和叙述草拟，用于肾CT报告。这种方法增强了可解释性和临床忠实度，强调了医学AI发展中领域相关评估指标的价值。

更新时间: 2025-10-16 06:21:00

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.23584v2

A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities

Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.

Updated: 2025-10-16 06:20:14

标题: 一个基于密度信息的多模态人工智能框架，用于提高各种乳房密度下乳腺癌检测的准确性

摘要: 乳腺X线摄影是当前乳腺癌筛查的标准方法，但在乳腺组织密度高的女性中，其敏感性降低，导致漏诊或延误诊断。Thermalytix是一种基于人工智能的热成像模式，可以捕捉功能性血管和代谢线索，这可能补充乳腺X线摄影的结构数据。本研究探讨了一种乳腺密度信息驱动的多模态人工智能框架是否可以通过动态选择适当的成像模式来改善癌症检测。共有324名女性接受了乳腺X线摄影和热成像检查。乳腺X线摄影图像使用多视角深度学习模型进行分析，而Thermalytix通过血管和热辐射学评估热成像。提出的框架利用乳腺X线摄影AI用于脂肪乳房，利用Thermalytix AI用于密集乳房，根据组织类型优化预测。这种多模态人工智能框架实现了94.55%的灵敏度（95% CI：88.54-100）和79.93%的特异性（95% CI：75.14-84.71），优于独立的乳腺X线摄影AI（灵敏度81.82%，特异性86.25%）和Thermalytix AI（灵敏度92.73%，特异性75.46%）。重要的是，乳腺X线摄影的灵敏度在密集乳房（67.86%）与脂肪乳房（96.30%）中显著下降，而Thermalytix AI在两者中保持高且一致的灵敏度（分别为92.59%和92.86%）。这表明，一个密度信息驱动的多模态人工智能框架可以克服单模式筛查的关键局限，并在各种乳腺组成中实现高性能。提出的框架易于解释、成本低廉且易于部署，为改善高资源和资源有限环境中的乳腺癌筛查结果提供了实用途径。

更新时间: 2025-10-16 06:20:14

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14340v1

Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q($\lambda$) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.

Updated: 2025-10-16 06:17:38

标题: Stop-RAG: 基于价值的检索控制用于迭代式RAG

摘要: 迭代检索增强生成（RAG）使大型语言模型能够回答复杂的多跳问题，但每次额外循环都会增加延迟、成本，以及引入分散注意力的风险，这促使我们需要一个高效的停止策略。现有方法要么使用预定数量的迭代，要么依赖于不够准确的置信代理来判断更多的检索是否真的有帮助。我们将迭代RAG建模为一个有限时间视角的马尔可夫决策过程，并引入Stop-RAG，一个基于价值的控制器，能够自适应地决定何时停止检索。通过使用完整轨迹的全宽度前瞻视图Q（λ）目标进行训练，Stop-RAG学习到了有效的停止策略，同时仍与黑匣子API和现有流程兼容。在多跳问答基准测试中，Stop-RAG始终优于固定迭代基线和基于提示的LLMs停止方法。这些结果突显了自适应停止作为当前代理系统中的一个关键缺失组件，并证明基于价值的控制可以提高RAG系统的准确性。

更新时间: 2025-10-16 06:17:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.14337v1

DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis

Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.

Updated: 2025-10-16 06:15:42

标题: DARTS-GT：具有可量化实例特定可解释性分析的可微架构搜索图变换器

摘要: 图形变换器（GTs）已经成为处理图形结构数据的强大架构，但仍受到刚性设计的限制，缺乏可量化的可解释性。目前最先进的GTs在所有层上都采用固定的GNN类型，错过了深度特定组件选择的潜在好处，而它们复杂的架构变得不透明，性能提升无法区分有意义的模式和虚假相关性。我们通过不对称重新设计GT的注意力机制，将结构编码与特征表示解耦：查询来自节点特征，而键和值来自GNN转换。在这个框架内，我们使用可微分结构搜索（DARTS）在每一层选择最优的GNN运算符，使得变换器注意力本身具有深度异质性（DARTS-GT）。为了理解发现的架构，我们通过因果消融开发了GT的第一个定量解释性框架。我们的指标（头偏差、专业化和关注）确定哪些头部和节点驱动预测，同时实现模型比较。在八个基准测试中的实验表明，DARTS-GT在四个数据集上实现了最先进的性能，同时在其他数据集上保持竞争力，发现的架构揭示了特定于数据集的模式。我们的可解释性分析揭示了视觉注意力显著性和因果重要性并不总是相关的，表明广泛使用的可视化方法可能会忽略实际重要的组件。至关重要的是，DARTS-GT发现的异质架构始终产生比基线更可解释的模型，建立了图形变换器无需在性能和可解释性之间选择。

更新时间: 2025-10-16 06:15:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.14336v1

A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease

Early detection of Alzheimer's Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.

Updated: 2025-10-16 06:10:31

标题: 一个使用混合词嵌入的强大分类方法用于早期诊断阿尔茨海默病

摘要: 早期检测阿尔茨海默病（AD）对AD患者大有裨益，可带来早期治疗以减轻症状并减轻医疗费用负担。作为AD的主要迹象之一，语言能力的变化可用于早期诊断AD。在本文中，我开发了一种强大的分类方法，利用混合词嵌入和微调的超参数，在早期AD检测中实现了最先进的准确性。具体而言，我们创建了一个基于Doc2Vec和ELMo的词向量的混合词嵌入，以获取句子的困惑度分数。这些分数可确定句子是否流畅，并捕捉句子的语义上下文。我通过添加语言特征来丰富词嵌入，以分析句法和语义。此外，我们将嵌入特征向量输入逻辑回归并在整个流程中微调超参数。通过调整机器学习流程的超参数（例如，模型正则化参数、学习率、Doc2Vec的向量大小和ELMo的向量大小），我在将早期AD与健康受试者区分开方面实现了91%的分类准确性和97%的曲线下面积（AUC）。根据我的了解，我的模型具有91%的准确性和97%的AUC，优于现有AD诊断的最佳自然语言处理模型，其准确性为88%[32]。我通过重复实验研究模型的稳定性，发现即使训练数据随机分割，模型仍然稳定（准确性的标准偏差=0.0403；AUC的标准偏差=0.0174）。这证实了我们提出的方法是准确且稳定的。该模型可用作AD的大规模筛查方法，同时也可作为医生检测AD的补充检查。

更新时间: 2025-10-16 06:10:31

领域: cs.CL,cs.AI,cs.LG,eess.AS,I.2.7; I.2.6

下载: http://arxiv.org/abs/2510.14332v1

LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search

We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short ``python code'') can be learned with a ``small'' number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw $k$ candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. {\em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples}. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.

Updated: 2025-10-16 06:10:11

标题: LLM-ERM：通过LLM引导搜索实现高效的样本程序学习

摘要: 我们寻求既具有样本效率又在计算上可行的程序学习算法。经典结果表明，可以通过长度优先程序枚举学习具有简短程序描述（例如，带有简短“Python代码”）的目标，只需“少量”示例（与代码大小成比例），但搜索在描述长度上是指数级的。因此，基于梯度的训练可以避免这种成本，但在某些简短程序族上可能需要指数多的样本。为了填补这一空白，我们引入了LLM-ERM，这是一个提出和验证框架，用LLM引导的搜索替换了对候选程序的穷举枚举，同时保留了对留出数据上的ERM样式选择。具体而言，我们使用预训练的推理增强型LLM绘制$k$个候选程序，对每个程序进行编译和检查，然后返回最佳验证的假设，无需反馈、适应性或梯度。从理论上讲，我们表明坐标智能在线小批量SGD需要许多样本来学习某些简短程序。实证上，LLM-ERM可以解决诸如奇偶变体、模式匹配和素性测试等任务，样本量仅需200个，而经过SGD训练的变压器即使有10万个样本也会过拟合。这些结果表明，语言引导的程序合成恢复了有限类ERM的许多统计效率，同时仍然在计算上可跟踪，为学习简洁假设提供了一条实际的途径，超出了基于梯度的训练的范围。

更新时间: 2025-10-16 06:10:11

领域: cs.LG

下载: http://arxiv.org/abs/2510.14331v1

KScope: A Framework for Characterizing the Knowledge Status of Language Models

Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

Updated: 2025-10-16 06:05:43

标题: KScope: 一个用于描述语言模型知识状态的框架

摘要: 表征一个大型语言模型（LLM）对给定问题的知识是具有挑战性的。因此，先前的研究主要考察了LLM在知识冲突下的行为，即模型的内部参数化记忆与外部上下文中的信息相矛盾。然而，这并不能完全反映出模型对问题答案的了解程度。在本文中，我们首先提出了一个基于LLM知识模式一致性和正确性的五种知识状态分类法。然后，我们提出了KScope，一个逐步细化有关知识模式的假设并将LLM知识表征为这五种状态的统计测试的分层框架。我们将KScope应用于四个数据集中的九个LLM，并系统地建立：（1）支持性上下文缩小了模型之间的知识差距。（2）与难度、相关性和熟悉度相关的上下文特征推动了成功的知识更新。（3）当部分正确或存在冲突时，LLM表现出类似的特征偏好，但在持续错误时明显分歧。（4）受我们的特征分析限制的上下文摘要，以及增强的可信度，进一步提高了更新效果并且具有泛化性，适用于各种LLM。

更新时间: 2025-10-16 06:05:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.07458v2

TangledFeatures: Robust Feature Selection in Highly Correlated Spaces

Feature selection is a fundamental step in model development, shaping both predictive performance and interpretability. Yet, most widely used methods focus on predictive accuracy, and their performance degrades in the presence of correlated predictors. To address this gap, we introduce TangledFeatures, a framework for feature selection in correlated feature spaces. It identifies representative features from groups of entangled predictors, reducing redundancy while retaining explanatory power. The resulting feature subset can be directly applied in downstream models, offering a more interpretable and stable basis for analysis compared to traditional selection techniques. We demonstrate the effectiveness of TangledFeatures on Alanine Dipeptide, applying it to the prediction of backbone torsional angles and show that the selected features correspond to structurally meaningful intra-atomic distances that explain variation in these angles.

Updated: 2025-10-16 05:54:04

标题: 纠缠特征：在高度相关空间中的稳健特征选择

摘要: 特征选择是模型开发中的一个基本步骤，它塑造了预测性能和可解释性。然而，大多数广泛使用的方法侧重于预测准确性，在存在相关预测变量的情况下，它们的性能会下降。为了解决这一问题，我们引入了TangledFeatures，这是一个在相关特征空间中进行特征选择的框架。它从纠缠的预测变量组中识别代表性特征，减少冗余同时保留解释能力。生成的特征子集可以直接应用于下游模型，在与传统选择技术相比，提供了更可解释和稳定的分析基础。我们在Alanine Dipeptide上展示了TangledFeatures的有效性，将其应用于背骨扭转角的预测，并展示所选特征对应于在这些角度变化中解释结构含义的原子间距离。

更新时间: 2025-10-16 05:54:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.15005v1

On Theoretical Interpretations of Concept-Based In-Context Learning

In-Context Learning (ICL) has emerged as an important new paradigm in natural language processing and large language model (LLM) applications. However, the theoretical understanding of the ICL mechanism remains limited. This paper aims to investigate this issue by studying a particular ICL approach, called concept-based ICL (CB-ICL). In particular, we propose theoretical analyses on applying CB-ICL to ICL tasks, which explains why and when the CB-ICL performs well for predicting query labels in prompts with only a few demonstrations. In addition, the proposed theory quantifies the knowledge that can be leveraged by the LLMs to the prompt tasks, and leads to a similarity measure between the prompt demonstrations and the query input, which provides important insights and guidance for model pre-training and prompt engineering in ICL. Moreover, the impact of the prompt demonstration size and the dimension of the LLM embeddings in ICL are also explored based on the proposed theory. Finally, several real-data experiments are conducted to validate the practical usefulness of CB-ICL and the corresponding theory.

Updated: 2025-10-16 05:50:52

标题: 关于基于概念的上下文学习的理论解释

摘要: In-Context Learning (ICL)已成为自然语言处理和大型语言模型（LLM）应用中重要的新范式。然而，对ICL机制的理论理解仍然有限。本文旨在通过研究一种特定的ICL方法，即基于概念的ICL（CB-ICL），来调查这个问题。具体而言，我们提出了将CB-ICL应用于ICL任务的理论分析，解释了为什么以及何时CB-ICL在仅有少数演示的提示中表现良好，用于预测查询标签。此外，所提出的理论量化了LLMs可以利用的知识，导致了提示演示和查询输入之间的相似性度量，为ICL中的模型预训练和提示工程提供了重要的见解和指导。此外，基于所提出的理论，还探讨了ICL中提示演示大小和LLM嵌入维度的影响。最后，进行了几个真实数据实验，以验证CB-ICL及其相应理论的实际用途。

更新时间: 2025-10-16 05:50:52

领域: cs.IT,cs.AI,cs.CL,math.IT

下载: http://arxiv.org/abs/2509.20882v2

On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise

Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.

Updated: 2025-10-16 05:46:47

标题: 关于在使用扭曲噪声训练的视频扩散模型中的等变性和快速采样

摘要: 时间一致的视频生成对于风格转移和上采样等应用至关重要。本文提供了对扭曲噪声的理论分析 - 这是一种最近提出的训练视频扩散模型的技术，并展示将其与标准去噪目标配对可以隐式训练模型对输入噪声的空间变换具有等变性，我们称之为EquiVDM。这种等变性使得输入噪声中的运动自然地与生成的视频中的运动对齐，产生连贯、高保真度的输出，无需专门的模块或辅助损失。另一个优点是采样效率：EquiVDM在更少的采样步骤中实现了可比或更高的质量。将EquiVDM蒸馏成一步学生模型时，它保留了等变性，并提供比蒸馏的非等变基线更强的运动可控性和保真度。在基准测试中，EquiVDM在运动对齐、时间一致性和感知质量方面始终优于先前的方法，同时大幅降低了采样成本。

更新时间: 2025-10-16 05:46:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.09789v2

Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction

Large Language Model based multi-agent systems (MAS) excel at collaborative problem solving but remain brittle to cascading errors: a single faulty step can propagate across agents and disrupt the trajectory. In this paper, we present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction. MASC rethinks detection as history-conditioned anomaly scoring via two complementary designs: (1) Next-Execution Reconstruction, which predicts the embedding of the next step from the query and interaction history to capture causal consistency, and (2) Prototype-Guided Enhancement, which learns a prototype prior over normal-step embeddings and uses it to stabilize reconstruction and anomaly scoring under sparse context (e.g., early steps). When an anomaly step is flagged, MASC triggers a correction agent to revise the acting agent's output before information flows downstream. On the Who&When benchmark, MASC consistently outperforms all baselines, improving step-level error detection by up to 8.47% AUC-ROC ; When plugged into diverse MAS frameworks, it delivers consistent end-to-end gains across architectures, confirming that our metacognitive monitoring and targeted correction can mitigate error propagation with minimal overhead.

Updated: 2025-10-16 05:35:37

标题: 通过原型引导的下一执行重建的元认知自我修正多智能体系统

摘要: 基于大型语言模型的多智能体系统（MAS）擅长协作问题解决，但仍然对级联错误脆弱：单个错误步骤可能在智能体之间传播并干扰轨迹。在本文中，我们提出了MASC，一个元认知框架，赋予MAS实时、无监督、步骤级错误检测和自我修正能力。MASC通过两种互补设计重新思考检测，即基于历史条件异常评分的下一执行重建，从查询和交互历史预测下一步骤的嵌入以捕获因果一致性，以及原型引导增强，学习正常步骤嵌入的原型先验并使用它稳定重建和异常评分在稀疏上下文（例如，早期步骤）下的表现。当标记出异常步骤时，MASC触发一个修正智能体来修订执行智能体的输出，然后信息流向下游。在Who&When基准测试中，MASC始终优于所有基线，将步骤级错误检测提高了高达8.47％的AUC-ROC；当插入多样的MAS框架时，它在架构上提供了一致的端到端增益，证实我们的元认知监控和有针对性的修正可以减轻错误传播，并带来最小的开销。

更新时间: 2025-10-16 05:35:37

领域: cs.AI

下载: http://arxiv.org/abs/2510.14319v1

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

Updated: 2025-10-16 05:29:36

标题: 使用多轮强化学习评估和减少语言模型中的欺骗性对话

摘要: 大型语言模型（LLMs）在全球数百万人的应用中与人们互动，例如客户支持、教育和医疗保健。然而，它们产生欺骗性输出的能力，无论是有意还是无意地，都会带来重大的安全风险。LLM行为的不可预测性，加上对幻觉、错误信息和用户操纵的不足防范，使它们的误用成为一个严重的现实风险。在本文中，我们调查了LLMs在对话中进行欺骗的程度，并提出了信念不一致度量来量化欺骗行为。我们通过五个已建立的欺骗检测度量和我们提出的度量，在四种不同的对话场景中评估欺骗行为。我们的研究结果显示，这种新颖的欺骗度量与人类判断更为接近，比我们测试的任何现有度量都要好。此外，我们对八个最先进的模型进行基准测试，结果显示，LLMs在对话中自然地表现出约26%的欺骗行为，即使在看似良性的目标下。当被要求欺骗时，LLMs可以相对于基线增加多达31%的欺骗性。令人意外的是，使用RLHF训练的模型，这是确保广泛部署的LLMs安全的主要方法，平均仍然以43%的欺骗率表现。鉴于对话中的欺骗行为是在互动历史中发展的，其有效评估和缓解需要超越单次话语分析。我们引入了一种多轮强化学习方法，对LLMs进行微调以减少欺骗行为，相对于其他指令调整模型，实现了77.6%的减少。

更新时间: 2025-10-16 05:29:36

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14318v1

Agentic Misalignment: How LLMs Could Be Insider Threats

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.

Updated: 2025-10-16 05:26:52

标题: 主体错位：LLMs 如何成为内部威胁

摘要: 我们对来自多个开发者的16个领先模型在假设的企业环境中进行了压力测试，以在它们造成真正危害之前识别潜在风险的代理行为。在这些情景中，我们允许模型自主发送电子邮件和访问敏感信息。它们只被部署公司分配了无害的商业目标；然后我们测试它们是否会在面临被更新版本替代或其指定目标与公司不断变化的方向相冲突时对这些公司采取行动。在至少一些情况下，来自所有开发者的模型在避免被替代或实现目标的唯一方式是采取恶意内部行为，包括勒索官员并向竞争对手泄露敏感信息。我们称这种现象为代理错位。模型经常不听从直接命令以避免这种行为。在另一个实验中，我们告诉Claude在采取行动之前评估自己是在测试还是在真实部署中。当它表示自己在进行测试时，它的不端行为较少；当它表示情况是真实的时，它的不端行为增加。我们尚未在真实部署中看到代理错位的证据。然而，我们的结果(a)提示在具有最少人类监督和接触敏感信息的角色中部署当前模型时要谨慎；(b)指出随着模型被放置在更具自主性的角色中可能出现的未来风险；(c)强调继续研究和测试代理智能模型的安全性和对齐性的重要性，以及对来自前沿人工智能开发者的透明度（Amodei，2025年）。我们将公开发布我们的方法以促进进一步研究。

更新时间: 2025-10-16 05:26:52

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.05179v2

Column Generation Using Domain-Independent Dynamic Programming

Column generation and branch-and-price are leading methods for large-scale exact optimization. Column generation iterates between solving a master problem and a pricing problem. The master problem is a linear program, which can be solved using a generic solver. The pricing problem is highly dependent on the application but is usually discrete. Due to the difficulty of discrete optimization, high-performance column generation often relies on a custom pricing algorithm built specifically to exploit the problem's structure. This bespoke nature of the pricing solver prevents the reuse of components for other applications. We show that domain-independent dynamic programming, a software package for modeling and solving arbitrary dynamic programs, can be used as a generic pricing solver. We develop basic implementations of branch-and-price with pricing by domain-independent dynamic programming and show that they outperform a world-leading solver on static mixed integer programming formulations for seven problem classes.

Updated: 2025-10-16 05:23:50

标题: 使用领域无关动态规划的列生成

摘要: 列生成和分支定价是大规模精确优化的主要方法。列生成在解决主问题和定价问题之间迭代。主问题是一个线性规划问题，可以使用通用求解器来解决。定价问题高度依赖于应用程序，但通常是离散的。由于离散优化的困难，高性能的列生成通常依赖于定制的定价算法，专门利用问题的结构。定价求解器的定制性质阻碍了组件在其他应用中的重复使用。我们展示了领域无关的动态规划，一个用于建模和解决任意动态规划问题的软件包，可以作为通用的定价求解器。我们开发了基本的分支定价实现，通过领域无关的动态规划进行定价，并展示它们在七个问题类的静态混合整数规划公式上优于世界领先的求解器。

更新时间: 2025-10-16 05:23:50

领域: math.OC,cs.AI,I.2.8

下载: http://arxiv.org/abs/2510.14317v1

Active Measuring in Reinforcement Learning With Delayed Negative Effects

Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users' health status through surveys.

Updated: 2025-10-16 05:21:36

标题: 使用延迟负效应的强化学习中的主动测量

摘要: 在强化学习（RL）中测量状态可能在现实世界中代价高昂，并可能对未来结果产生负面影响。我们引入主动可观测马尔可夫决策过程（AOMDP），其中一个智能体不仅选择控制动作，还决定是否测量潜在状态。测量动作揭示真实的潜在状态，但可能对环境产生负面的延迟效应。我们表明，尽管存在这些成本，这种减少不确定性可能可以有效提高样本效率并增加最优策略的价值。我们将AOMDP形式化为周期性部分可观测MDP，并提出了基于信念状态的在线RL算法。为了近似信念状态，我们进一步提出了一种顺序蒙特卡洛方法，联合近似未知静态环境参数和未观测到的潜在状态的后验概率。我们在数字健康应用中评估了提出的算法，其中智能体决定何时提供数字干预和何时通过调查评估用户的健康状况。

更新时间: 2025-10-16 05:21:36

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14315v1

Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework's flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.

Updated: 2025-10-16 05:19:13

标题: 玻璃箱：重访黑板，进行多智能体安全、隐私和安全性研究

摘要: 由大型语言模型（LLMs）驱动的多智能体系统（MAS）可以自动化繁琐的用户任务，例如需要智能体间协作的会议安排。LLMs使得可以实现考虑非结构化私人数据、用户约束和偏好的微妙协议。然而，这种设计引入了新的风险，包括不对齐和恶意攻击，这些攻击可能会危及智能体或窃取用户数据。在本文中，我们提出了Terrarium框架，用于对基于LLM的MAS的安全、隐私和安全性进行细粒度研究。我们重新设计了黑板设计，这是多智能体系统中的一种早期方法，以创建一个模块化、可配置的多智能体协作测试平台。我们确定了关键的攻击向量，如不对齐、恶意智能体、受损通信和数据污染。我们实现了三个协作MAS场景，并进行了四种代表性攻击，以展示该框架的灵活性。通过提供工具来快速原型设计、评估和改进防御和设计，Terrarium旨在加速向值得信赖的多智能体系统的进展。

更新时间: 2025-10-16 05:19:13

领域: cs.AI,cs.CL,cs.CR,I.2.7; I.2.11

下载: http://arxiv.org/abs/2510.14312v1

RepIt: Representing Isolated Targets to Steer Language Models

While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

Updated: 2025-10-16 05:13:27

标题: RepIt: 代表孤立目标以引导语言模型

摘要: 尽管大型语言模型（LLMs）中的激活导向是一个不断增长的研究领域，但方法往往会产生比预期更广泛的影响。这促使我们孤立出更纯净的概念向量，以实现有针对性的干预，并在更细粒度水平上了解LLM的行为。我们提出了RepIt，这是一个简单且数据高效的框架，用于孤立特定概念的表示。在五个前沿LLM中，RepIt实现了精确的干预：它有选择地抑制针对性概念上的拒绝，同时保留其他地方的拒绝，从而产生能够回答与WMD相关问题的模型，同时在标准基准测试中得分安全。我们进一步展示了校正信号仅局限于100-200个神经元，并且可以从仅有几十个示例中在单个A6000上提取强大的目标表示。这种高效性引发了双重关注：使用适度的计算和数据可以执行操纵，以扩展到数据稀缺的主题，同时避开现有基准。通过使用RepIt来解开拒绝向量，这项工作表明有针对性的干预可以抵消过度泛化，为更细粒度地控制模型行为奠定基础。

更新时间: 2025-10-16 05:13:27

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.13281v3

Real-Time Adaptive Motion Planning via Point Cloud-Guided, Energy-Based Diffusion and Potential Fields

Motivated by the problem of pursuit-evasion, we present a motion planning framework that combines energy-based diffusion models with artificial potential fields for robust real time trajectory generation in complex environments. Our approach processes obstacle information directly from point clouds, enabling efficient planning without requiring complete geometric representations. The framework employs classifier-free guidance training and integrates local potential fields during sampling to enhance obstacle avoidance. In dynamic scenarios, the system generates initial trajectories using the diffusion model and continuously refines them through potential field-based adaptation, demonstrating effective performance in pursuit-evasion scenarios with partial pursuer observability.

Updated: 2025-10-16 05:13:04

标题: 实时自适应运动规划：基于点云引导的能量扩散和势场

摘要: 受到追逐-逃避问题的启发，我们提出了一个运动规划框架，将基于能量的扩散模型与人工势场结合，用于在复杂环境中生成鲁棒的实时轨迹。我们的方法直接从点云中处理障碍物信息，实现高效规划而无需完整的几何表示。该框架采用无分类器的指导训练，并在采样过程中集成局部势场以增强障碍物避免能力。在动态场景中，系统使用扩散模型生成初始轨迹，并通过基于势场的适应性不断优化，展示出在部分追逐者可观测性的追逐-逃避场景中的有效性。

更新时间: 2025-10-16 05:13:04

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.09383v3

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies -- particularly exponential growth policies -- exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

Updated: 2025-10-16 05:12:13

标题: 明智花费：在迭代合成数据自助法中最大化后训练收益

摘要: 现代基础模型在训练后阶段通常经历迭代的“自举”过程：模型生成合成数据，外部验证器过滤掉低质量样本，然后使用高质量子集进行进一步的微调。随着多次迭代，模型性能不断提升，引发一个关键问题：如何在各个迭代中分配生成和训练的总预算，以最大化最终性能？在这项工作中，我们开发了一个用于分析预算分配策略的理论框架。具体来说，我们发现恒定策略很可能无法收敛，而增长策略 - 特别是指数增长策略 - 具有显著的理论优势。使用扩散概率模型进行图像去噪和大型语言模型进行数学推理的实验证明，指数增长策略和多项式增长策略一贯优于恒定策略，其中指数增长策略通常提供更稳定的性能。

更新时间: 2025-10-16 05:12:13

领域: cs.LG

下载: http://arxiv.org/abs/2501.18962v2

MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

Updated: 2025-10-16 05:06:54

标题: MERLIN：一个用于多语言多模式实体识别和链接的测试平台

摘要: 这篇论文介绍了MERLIN，一个用于多语言多模态实体链接任务的新型测试系统。创建的数据集包括BBC新闻文章标题，配对的图像，涵盖了五种语言：印地语、日语、印尼语、越南语和泰米尔语，涵盖了7000多个命名实体提及，链接至2500个独特的Wikidata实体。我们还使用多语言和多模态实体链接方法进行了几项基准测试，探索了不同的语言模型，如LLaMa-2和Aya-23。我们的研究结果表明，将视觉数据纳入实体链接可以提高准确性，特别是对于文本上下文模糊或不足的实体，特别适用于没有强大多语言能力的模型。有关该工作的数据集和方法可以在此处获得：https://github.com/rsathya4802/merlin

更新时间: 2025-10-16 05:06:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14307v1

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and a speech dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.

Updated: 2025-10-16 05:00:29

标题: 通过高阶 Langevin 动力学防御扩散模型免受成员推断攻击

摘要: 最近在生成人工智能应用方面取得了新的进展，引发了新的数据安全担忧。本文着重于防御扩散模型受到成员推断攻击的问题。这种攻击发生在攻击者可以确定某个数据点是否被用来训练模型时。尽管扩散模型在本质上比其他生成模型更具抵抗成员推断攻击的能力，但它们仍然容易受到攻击。本文提出的防御方法利用了临界阻尼高阶朗之万动力学，引入了几个辅助变量和沿这些变量的联合扩散过程。其思想是，辅助变量的存在混合了外部随机性，有助于在扩散过程中更早地破坏敏感输入数据。这一概念在一个玩具数据集和一个语音数据集上进行了理论研究，并使用接收器操作特征曲线下面积（AUROC）曲线和FID指标进行了验证。

更新时间: 2025-10-16 05:00:29

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.14225v2

Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

Updated: 2025-10-16 04:58:45

标题: 水印技术用于真实性：通过三层对比解码引导视觉-语言模型走向真相

摘要: 最近，大型视觉-语言模型（LVLMs）在各种多模态任务上展现出了令人鼓舞的结果，甚至在某些情况下实现了与人类可比较的性能。然而，LVLMs仍然容易出现幻觉 - 它们经常严重依赖于单一模态或者仅仅记住训练数据而没有正确地将其输出与视觉联系起来。为了解决这个问题，我们提出了一种无需训练的三层对比解码与水印技术，该技术分为三个步骤：（1）在解码层中选择一个成熟层和一个业余层，（2）使用与水印相关的问题来确定一个枢纽层，以评估该层是否在视觉上有很好的基础，（3）应用三层对比解码来生成最终输出。在一些公共基准测试如POPE、MME和AMBER上的实验证明，我们的方法在减少LVLMs中的幻觉方面实现了最先进的性能，并生成了更具视觉基础的响应。

更新时间: 2025-10-16 04:58:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14304v1

Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

Updated: 2025-10-16 04:58:28

标题: 基于Agent和OpenAlex知识图谱的约束驱动小语言模型：在学术论文中挖掘概念路径和发现创新点

摘要: 近年来，各个领域学术出版物的迅速增加给学术论文分析带来了严峻挑战：科学家们难以及时且全面地跟踪最新的研究成果和方法。关键概念提取已被证明是一种有效的分析范式，其自动化已通过在工业和科学领域广泛应用语言模型而实现。然而，现有的论文数据库大多局限于相似性匹配和基本的关键概念分类，未能深入探索概念之间的关系网络。本文基于OpenAlex开源知识图谱。通过分析来自新西伯利亚州立大学近8000份开源论文数据，我们发现论文关键概念路径的分布模式与创新点和罕见路径之间存在强烈相关性。我们提出了一种基于即时工程的关键概念路径分析方法。该方法利用小型语言模型实现精确的关键概念提取和创新点识别，并构建基于知识图谱约束机制的代理以增强分析准确性。通过对Qwen和DeepSeek模型进行微调，我们在准确性方面取得了显著的改进，这些模型已在Hugging Face平台上公开提供。

更新时间: 2025-10-16 04:58:28

领域: cs.CL,cs.LG,I.2.7

下载: http://arxiv.org/abs/2510.14303v1

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.

Updated: 2025-10-16 04:57:53

标题: 一种用于安全保护的护栏：当安全敏感子空间遇到有害抵抗的零空间

摘要: 大型语言模型（LLMs）在各种任务中取得了显著成功，但它们在适应过程中的安全对齐仍然很脆弱。即使在良性数据上进行微调或进行低秩适应，预训练的安全行为也很容易被降级，导致微调模型产生有害响应。为了解决这一挑战，我们提出了GuardSpace，一个保持安全对齐的护栏框架，由两个关键组件组成：一个安全敏感子空间和一个有害抗性零空间。首先，我们使用协方差预处理的奇异值分解将预训练权重明确分解为安全相关和安全无关的组件，并从安全无关的组件中初始化低秩适配器，同时冻结安全相关组件以保留它们相关的安全机制。其次，我们构建了一个零空间投影器，限制适配器更新改变有害提示上的安全输出，从而保持原始的拒绝行为。在多个下游任务上使用各种预训练模型进行实验表明，GuardSpace相对于现有方法取得了优越的性能。值得注意的是，对于在GSM8K上微调的Llama-2-7B-Chat，GuardSpace优于最先进的方法AsFT，将平均有害分数从14.4%降低到3.6%，同时将准确性从26.0%提高到28.0%。

更新时间: 2025-10-16 04:57:53

领域: cs.AI

下载: http://arxiv.org/abs/2510.14301v1

Uncertainty Quantification with the Empirical Neural Tangent Kernel

While neural networks have demonstrated impressive performance across various tasks, accurately quantifying uncertainty in their predictions is essential to ensure their trustworthiness and enable widespread adoption in critical systems. Several Bayesian uncertainty quantification (UQ) methods exist that are either cheap or reliable, but not both. We propose a post-hoc, sampling-based UQ method for over-parameterized networks at the end of training. Our approach constructs efficient and meaningful deep ensembles by employing a (stochastic) gradient-descent sampling process on appropriately linearized networks. We demonstrate that our method effectively approximates the posterior of a Gaussian process using the empirical Neural Tangent Kernel. Through a series of numerical experiments, we show that our method not only outperforms competing approaches in computational efficiency-often reducing costs by multiple factors-but also maintains state-of-the-art performance across a variety of UQ metrics for both regression and classification tasks.

Updated: 2025-10-16 04:54:00

标题: 使用经验神经切向核进行不确定性量化

摘要: 尽管神经网络在各种任务中表现出色，但准确量化其预测的不确定性对于确保其可信度并在关键系统中实现广泛应用至关重要。存在几种贝叶斯不确定性量化（UQ）方法，其中一些便宜但不可靠，另一些可靠但昂贵。我们提出了一种后处理的基于采样的UQ方法，适用于训练结束时的过参数化网络。我们的方法通过在适当线性化的网络上使用（随机）梯度下降采样过程，构建高效且有意义的深度集成。我们展示了我们的方法通过利用经验神经切线核有效地逼近高斯过程的后验分布。通过一系列数值实验，我们展示了我们的方法不仅在计算效率上优于竞争方法（通常将成本降低多个因素），而且在回归和分类任务的各种UQ指标中保持了最先进的性能。

更新时间: 2025-10-16 04:54:00

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.02870v2

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.

Updated: 2025-10-16 04:52:57

标题: 专业知识不必垄断：视觉-语言-行动学习的行动专门的专家混合模型

摘要: Vision-Language-Action (VLA)模型正在快速发展，并在机器人操作任务中展示出有希望的能力。然而，扩展VLA模型存在几个关键挑战：（1）从头开始训练新的VLA模型需要大量的计算资源和广泛的数据集。鉴于当前机器人数据的稀缺性，在扩展过程中充分利用预训练的VLA模型权重变得尤为重要。（2）实时控制需要仔细平衡模型容量与计算效率。为了解决这些挑战，我们提出了AdaMoE，一个继承了密集VLA模型预训练权重的专家混合（MoE）架构，并通过将前馈层替换为稀疏激活的MoE层来扩展行动专家。AdaMoE采用一种解耦技术，通过与传统路由器并行工作的独立比例适配器来解耦专家选择和专家加权，使专家可以根据任务相关性进行选择，同时以独立控制的权重进行贡献，实现协同专家利用而不是取胜者通吃的动态。我们的方法表明，专业知识不必垄断。相反，通过协同专家利用，我们可以在保持计算效率的同时实现卓越的性能。AdaMoE在关键基准测试中始终优于基线模型，LIBERO提供了1.8％的性能增益，RoboTwin提供了9.3％的性能增益。最重要的是，在现实世界的实验中，实现了实用性有效性的大幅21.5％提高，证实了其对机器人操作任务的实际有效性。

更新时间: 2025-10-16 04:52:57

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.14300v1

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class's hidden-feature manifold, estimating its local ``thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic ``tube-constrained'' behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14\% in AUROC over the next-best method. The code is publicly available at https://github.com/namle-w/TEDpp.

Updated: 2025-10-16 04:51:25

标题: TED++：通过逐层管状邻域筛选实现子流形感知后门检测

摘要: 随着深度神经网络越来越多地应用于关键性应用程序，潜在的后门攻击成为一种严重的安全风险，其中受毒化的训练输入触发恶意模型行为，同时看起来是良性的。许多现有的防御措施在攻击者利用微妙的基于距离的异常或干净示例稀缺时都是脆弱的。为了应对这一挑战，我们引入了TED++，这是一个子流形感知框架，能够有效地检测到逃避现有防御措施的微妙后门攻击。TED++首先通过在每个类的隐藏特征流形周围构建一个管状邻域，估计其本地的“厚度”从少量干净激活中。然后，它应用局部自适应排序（LAR）来检测任何漂移到可接受管中之外的激活。通过在所有层中聚合这些经过LAR调整的排名，TED++捕捉输入在不断发展的类子流形上保持的忠实程度。基于这种特征的“管约束”行为，TED++标记那些LAR基于排名序列明显偏离的输入。我们在基准数据集和任务上进行了大量实验，证明TED++在自适应攻击和有限数据情景下达到了最先进的检测性能。值得注意的是，即使每个类只有五个被保留的示例，TED++仍然可以实现几乎完美的检测，相比下一个最佳方法，AUROC提高了高达14%。代码公开在https://github.com/namle-w/TEDpp。

更新时间: 2025-10-16 04:51:25

领域: cs.LG,cs.AI,68T07, 62H30, 53Z50,I.2.6; I.5.1; K.6.5

下载: http://arxiv.org/abs/2510.14299v1

VeilAudit: Breaking the Deadlock Between Privacy and Accountability Across Blockchains

Cross chain interoperability in blockchain systems exposes a fundamental tension between user privacy and regulatory accountability. Existing solutions enforce an all or nothing choice between full anonymity and mandatory identity disclosure, which limits adoption in regulated financial settings. We present VeilAudit, a cross chain auditing framework that introduces Auditor Only Linkability, which allows auditors to link transaction behaviors that originate from the same anonymous entity without learning its identity. VeilAudit achieves this with a user generated Linkable Audit Tag that embeds a zero knowledge proof to attest to its validity without exposing the user master wallet address, and with a special ciphertext that only designated auditors can test for linkage. To balance privacy and compliance, VeilAudit also supports threshold gated identity revelation under due process. VeilAudit further provides a mechanism for building reputation in pseudonymous environments, which enables applications such as cross chain credit scoring based on verifiable behavioral history. We formalize the security guarantees and develop a prototype that spans multiple EVM chains. Our evaluation shows that the framework is practical for today multichain environments.

Updated: 2025-10-16 04:48:58

标题: VeilAudit：打破区块链上隐私和可追溯性之间的僵局

摘要: 区块链系统中的跨链互操作性暴露了用户隐私和监管问责之间的根本张力。现有解决方案在完全匿名和强制身份披露之间强制选择，这限制了在受监管的金融环境中的应用。我们提出VeilAudit，一个跨链审计框架，引入了仅审计员可链接性，允许审计员链接来自相同匿名实体的交易行为，而无需了解其身份。VeilAudit通过用户生成的可链接审计标签实现这一点，该标签嵌入了零知识证明以证明其有效性，而不暴露用户主钱包地址，并且使用一种特殊的密文，只有指定的审计员可以测试其链接性。为了平衡隐私和合规性，VeilAudit还支持在正当程序下的门限门控身份披露。VeilAudit进一步提供了在匿名环境中建立声誉的机制，这使得基于可验证行为历史的跨链信用评分等应用成为可能。我们形式化了安全保证并开发了一个跨越多个EVM链的原型。我们的评估表明，该框架对于今天的多链环境是实际可行的。

更新时间: 2025-10-16 04:48:58

领域: cs.CR

下载: http://arxiv.org/abs/2510.12153v2

Symmetry-Aware GFlowNets

Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atom-based and fragment-based generation schemes. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.

Updated: 2025-10-16 04:42:38

标题: 对称感知的GFlowNets

摘要: 生成式流网络（GFlowNets）为按照其奖励比例对图进行采样提供了一个强大的框架。然而，现有方法由于状态转移概率计算不准确而受到系统偏差的影响。这些偏差根植于图的固有对称性，影响基于原子和基于片段的生成方案。为了解决这一挑战，我们引入了对称感知GFlowNets（SA-GFN），一种通过奖励缩放将对称性修正纳入学习过程的方法。通过将偏差校正直接整合到奖励结构中，SA-GFN消除了对显式状态转移计算的需求。实证结果表明，SA-GFN能够实现无偏采样，同时增强多样性，并始终生成与目标分布密切匹配的高奖励图。

更新时间: 2025-10-16 04:42:38

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2506.02685v3

R1-Ranker: Teaching LLM Rankers to Reason

Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.

Updated: 2025-10-16 04:41:42

标题: R1-Ranker：教授LLM排序器进行推理

摘要: 最近，大型语言模型（LLMs）在数学、编码和科学问题解决等领域展现出强大的推理能力，然而它们在排名任务中的潜力，其中主要示例包括检索、推荐系统和LLM路由，仍未被充分探索。排名需要跨异构候选对象进行复杂推理，但现有基于LLM的排名器通常是特定于领域的，与固定骨干相结合，并且缺乏迭代细化，限制了它们充分利用LLMs推理潜力的能力。为了解决这些挑战，我们提出了R1-Ranker，这是一个基于强化学习构建的推理激励框架，具有两种互补设计：DRanker，可以一次性生成完整排名，IRanker，将排名分解为一个迭代消除过程，并根据步骤奖励鼓励更深入的推理。我们在涵盖推荐、路由和段落排名的九个数据集上评估了统一的R1-Rankers，结果显示IRanker-3B始终实现了最先进的性能，在某些任务上超越了更大的7B模型，并产生了15.7%的平均相对改进。消融和泛化实验进一步确认了强化学习和迭代推理的关键作用，IRanker-3B在跨领域任务上的零-shot性能提高了超过9%，推理追踪将其他LLMs的性能提高了高达22.87%。这些结果表明，将多样的排名任务统一在一个基于推理驱动的基础模型中是推动LLM在排名场景中推理的有效和必要方法。

更新时间: 2025-10-16 04:41:42

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.21638v3

SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

Updated: 2025-10-16 04:37:37

标题: SimULi：使用无味变换进行实时LiDAR和摄像头模拟

摘要: 自主机器人（如自动驾驶汽车）的严格测试对于确保它们在实际部署中的安全性至关重要。这需要建立高保真度的模拟器，以测试那些无法在现实世界中安全或全面收集的场景。基于NeRF和3DGS的现有神经渲染方法具有潜力，但受到渲染速度低或只能渲染针孔相机模型的限制，这阻碍了它们适用于通常需要高畸变镜头和LiDAR数据的应用。多传感器模拟面临额外挑战，因为现有方法通过牺牲一种模态的质量来处理跨传感器的不一致性。为了克服这些限制，我们提出了SimULi，这是第一种能够实时渲染任意相机模型和LiDAR数据的方法。我们的方法扩展了原生支持复杂相机模型的3DGUT，通过自动分块策略支持任意旋转LiDAR模型和基于射线的剔除。为了解决跨传感器的不一致性，我们设计了一个分解的3D高斯表示和锚定策略，与现有方法相比，可以将平均相机和深度误差降低多达40%。SimULi比射线追踪方法快10-20倍，比之前基于栅格化的工作快1.5-10倍（并且处理更广泛的相机模型）。在两个广泛评估的自主驾驶数据集上进行评估时，SimULi在众多相机和LiDAR指标上与现有最先进方法的保真度相匹配或超越。

更新时间: 2025-10-16 04:37:37

领域: cs.CV,cs.GR,cs.LG,cs.RO

下载: http://arxiv.org/abs/2510.12901v2

Learning Human-Humanoid Coordination for Collaborative Object Carrying

Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.

Updated: 2025-10-16 04:36:25

标题: 学习人类-人形机器协作搬运物体的协调能力

摘要: 人类-仿生机器人协作在医疗保健、家庭助理和制造领域显示出显著的应用前景。虽然对于机器人手臂已经广泛开发了合作机器人-人类的协作，但由于仿生机器人复杂的全身动力学，使得合规的人类-仿生机器人协作仍然大多未被探索。在本文中，我们提出了一种仅基于本体感知的强化学习方法COLA，该方法在单一策略中结合了领导者和跟随者行为。该模型在一个闭环环境中进行训练，通过动态物体交互来预测物体运动模式和人类意图，从而实现合规协作，通过协调的轨迹规划来保持负载平衡。我们通过全面的模拟器和实际世界的协作搬运任务实验评估了我们的方法，展示了我们的模型在各种地形和物体上的有效性、泛化性和鲁棒性。模拟实验表明，与基线方法相比，我们的模型减少了人类努力24.7%，同时保持了物体的稳定性。实际世界实验验证了在不同物体类型（箱子、桌子、担架等）和运动模式（直线、转弯、爬坡）中的强大的协作搬运。23名参与者的人类用户研究证实，与基线模型相比，平均改进了27.4%。我们的方法实现了合规的人类-仿生机器人协作搬运，无需外部传感器或复杂的交互模型，为实际部署提供了实用的解决方案。

更新时间: 2025-10-16 04:36:25

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.14293v1

Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling

Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph. To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph -- graph with no cycles of odd number of negative edges. A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices. We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data. Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals. Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.

Updated: 2025-10-16 04:34:31

标题: 轻量级变压器通过平衡有符号图算法展开进行脑电图分类

摘要: 脑信号样本由EEG传感器采集，具有固有的反相关性，这在有限图中很好地模拟为负边。为了利用收集到的EEG信号区分癫痫患者和健康受试者，我们通过展开一种用于平衡有符号图上信号的谱去噪算法来构建轻量级且可解释的类似变压器的神经网络 - 无奇数负边循环的图。平衡的有符号图具有明确定义的频率，可以通过图拉普拉斯矩阵的相似变换映射到相应的正图。我们通过Lanczos近似在映射的正图上高效实现理想低通滤波器，其中最佳截止频率从数据中学习。鉴于两个平衡的有符号图去噪器在训练期间学习两种不同信号类的后验概率，我们评估它们对EEG信号的二元分类的重构误差。实验证明，我们的方法实现了与代表性深度学习方案相媲美的分类性能，同时使用的参数大大减少。

更新时间: 2025-10-16 04:34:31

领域: cs.LG

下载: http://arxiv.org/abs/2510.03027v2

FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling

Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS

Updated: 2025-10-16 04:30:19

标题: FedRTS: 通过组合式汤普森抽样实现的联邦鲁棒修剪

摘要: Federated Learning（FL）允许在分布式客户端之间进行协作模型训练，而无需共享数据，但其高计算和通信需求会给资源受限设备带来压力。虽然现有方法使用动态修剪来通过定期调整稀疏模型拓扑结构来提高效率，同时保持稀疏度，但这些方法存在问题，如贪婪调整、不稳定的拓扑结构和通信效率低下，导致模型不够稳健，在数据异构性和部分客户端可用性下表现不佳。为解决这些挑战，我们提出了通过组合式汤普森采样（FedRTS）的联邦稳健修剪，这是一个旨在开发稳健稀疏模型的新框架。FedRTS通过其基于汤普森采样的调整机制（TSAdj）增强了稳健性和性能，该机制使用基于稳定、有远见的信息的概率决策，而不是以前方法中依赖于不稳定和目光短浅信息的确定性决策。大量实验证明，FedRTS在计算机视觉和自然语言处理任务中实现了最先进的性能，同时降低了通信成本，特别在具有异构数据分布和部分客户端参与的场景中表现出色。我们的代码可在以下链接找到：https://github.com/Little0o0/FedRTS

更新时间: 2025-10-16 04:30:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.19122v3

Enhancing Time-Series Anomaly Detection by Integrating Spectral-Residual Bottom-Up Attention with Reservoir Computing

Reservoir computing (RC) establishes the basis for the processing of time-series data by exploiting the high-dimensional spatiotemporal response of a recurrent neural network to an input signal. In particular, RC trains only the output layer weights. This simplicity has drawn attention especially in Edge Artificial Intelligence (AI) applications. Edge AI enables time-series anomaly detection in real time, which is important because detection delays can lead to serious incidents. However, achieving adequate anomaly-detection performance with RC alone may require an unacceptably large reservoir on resource-constrained edge devices. Without enlarging the reservoir, attention mechanisms can improve accuracy, although they may require substantial computation and undermine the learning efficiency of RC. In this study, to improve the anomaly detection performance of RC without sacrificing learning efficiency, we propose a spectral residual RC (SR-RC) that integrates the spectral residual (SR) method - a learning-free, bottom-up attention mechanism - with RC. We demonstrated that SR-RC outperformed conventional RC and logistic-regression models based on values extracted by the SR method across benchmark tasks and real-world time-series datasets. Moreover, because the SR method, similarly to RC, is well suited for hardware implementation, SR-RC suggests a practical direction for deploying RC as Edge AI for time-series anomaly detection.

Updated: 2025-10-16 04:17:30

标题: 通过将谱残差自底向上关注与储存计算相结合，增强时间序列异常检测

摘要: Reservoir computing (RC)建立了通过利用循环神经网络对输入信号的高维时空响应来处理时间序列数据的基础。特别是，RC只训练输出层权重。这种简单性尤其引起了边缘人工智能（AI）应用的注意。边缘AI实现了实时时间序列异常检测，这很重要，因为检测延迟可能导致严重事故。然而，仅使用RC可能需要在资源受限的边缘设备上使用不可接受大的水库来实现足够的异常检测性能。在不扩大水库的情况下，注意机制可以提高准确性，尽管可能需要大量计算且可能削弱RC的学习效率。在这项研究中，为了提高RC的异常检测性能而不损害学习效率，我们提出了集成了谱残差（SR）方法 - 一种无需学习的自下而上的注意机制 - 的谱残差RC（SR-RC）。我们证明SR-RC在基准任务和真实世界时间序列数据集上的表现优于传统RC和基于SR方法提取的值的逻辑回归模型。此外，由于SR方法与RC类似，很适合硬件实现，SR-RC为将RC部署为边缘AI以进行时间序列异常检测提供了实际方向。

更新时间: 2025-10-16 04:17:30

领域: cs.LG

下载: http://arxiv.org/abs/2510.14287v1

Stable Prediction of Adverse Events in Medical Time-Series Data

Early event prediction (EEP) systems continuously estimate a patient's imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: https://github.com/SeewonChoi/CAREBench.)

Updated: 2025-10-16 04:16:54

标题: 医学时间序列数据中不良事件的稳定预测

摘要: 早期事件预测（EEP）系统持续评估患者的即将到来的风险，以支持临床决策。为了获得床边信任，风险轨迹必须准确且在时间上稳定，只有在有新的相关证据时才会发生变化。然而，目前的基准（a）忽略了风险分数的稳定性，（b）主要评估表格输入，导致轨迹行为未经测试。为了解决这一差距，我们引入了CAREBench，一个EEP基准，使用多模态输入（表格电子病历、心电图波形和临床文本）评估可部署性，并评估时间稳定性以及预测准确性。我们提出了一种稳定性度量，量化每个患者风险的短期变化，并基于局部Lipschitz常数惩罚突然的振荡。CAREBench涵盖了六个预测任务，如脓毒症发作，并比较了经典学习器、深度序列模型和零样本LLM。在各个任务中，现有方法，尤其是LLMs，在同时优化准确性和稳定性方面存在困难，在高精度操作点处的召回率明显较低。这些结果突显了需要产生与证据一致、稳定的轨迹的模型，以赢得临床医生对连续监测环境的信任。（代码：https://github.com/SeewonChoi/CAREBench。）

更新时间: 2025-10-16 04:16:54

领域: cs.LG

下载: http://arxiv.org/abs/2510.14286v1

Beyond a Single Perspective: Towards a Realistic Evaluation of Website Fingerprinting Attacks

Website Fingerprinting (WF) attacks exploit patterns in encrypted traffic to infer the websites visited by users, posing a serious threat to anonymous communication systems. Although recent WF techniques achieve over 90% accuracy in controlled experimental settings, most studies remain confined to single scenarios, overlooking the complexity of real-world environments. This paper presents the first systematic and comprehensive evaluation of existing WF attacks under diverse realistic conditions, including defense mechanisms, traffic drift, multi-tab browsing, early-stage detection, open-world settings, and few-shot scenarios. Experimental results show that many WF techniques with strong performance in isolated settings degrade significantly when facing other conditions. Since real-world environments often combine multiple challenges, current WF attacks are difficult to apply directly in practice. This study highlights the limitations of WF attacks and introduces a multidimensional evaluation framework, offering critical insights for developing more robust and practical WF attacks.

Updated: 2025-10-16 04:14:17

标题: 超越单一视角：朝向对网站指纹攻击的现实评估

摘要: 网站指纹识别（WF）攻击利用加密流量中的模式推断用户访问的网站，对匿名通信系统构成严重威胁。尽管最近的WF技术在受控实验环境中实现了超过90％的准确性，但大多数研究仍局限于单一场景，忽视了现实环境的复杂性。本文首次系统全面评估了现有WF攻击在不同实际条件下的表现，包括防御机制、流量漂移、多标签浏览、早期检测、开放世界设置和少样本场景。实验结果表明，许多在孤立环境中表现出色的WF技术在面对其他条件时明显退化。由于现实环境通常结合多重挑战，当前的WF攻击难以直接应用于实践。这项研究突显了WF攻击的局限性，并引入了一个多维评估框架，为开发更鲁棒和实用的WF攻击提供了关键见解。

更新时间: 2025-10-16 04:14:17

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14283v1

Socratic Mind: Impact of a Novel GenAI-Powered Assessment Tool on Student Learning and Higher-Order Thinking

This study examines the impact of Socratic Mind, a Generative Artificial Intelligence (GenAI) powered formative assessment tool that employs Socratic questioning to support student learning in a large, fully online undergraduate-level computing course. Employing a quasi-experimental, mixed-methods design, we investigated participants' engagement patterns, the influence of user experience on engagement, and impacts on both perceived and actual learning outcomes. Data were collected from the system logs, surveys on user experience and perceived engagement and learning gains, student reflections, and course performance data. Results indicated that participants consistently reported high levels of affective, behavioral, and cognitive engagement, and these were strongly linked to positive user experiences and perceived learning outcomes. Quantitative analysis further revealed that students who engaged with the GenAI tool experienced significant gains in their quiz scores compared to those who did not, particularly benefiting students with lower baseline achievement. Additionally, thematic analysis of qualitative feedback revealed substantial perceived improvements in higher-order thinking skills, including problem solving, critical thinking, and self-reflection. Our findings highlight the promise of AI-mediated dialogue in fostering deeper engagement and higher-order cognitive skills. As higher education institutions expand GenAI integration in curriculum, this dialogic, GenAI powered assessment tool can offer a scalable strategy to promote students' meaningful learning outcomes.

Updated: 2025-10-16 04:07:31

标题: 苏格拉底思维：一种新型基于GenAI技术的评估工具对学生学习和高阶思维的影响

摘要: 这项研究探讨了Socratic Mind对大型完全在线本科级计算机课程中学生学习的影响，Socratic Mind是一种采用苏格拉底问答方法的生成式人工智能（GenAI）辅助评估工具。采用拟实验性、混合方法设计，我们调查了参与者的参与模式，用户体验对参与的影响，以及对感知和实际学习成果的影响。数据来自系统日志、用户体验和感知参与和学习收获的调查、学生反思以及课程表现数据。结果表明，参与者一直报告高水平的情感、行为和认知参与度，并且这些与积极的用户体验和感知学习成果密切相关。定量分析进一步显示，使用GenAI工具参与的学生在测验分数上取得了显著进步，尤其是对基础成绩较低的学生有益。此外，定性反馈的主题分析显示，在高阶思维能力方面，包括问题解决、批判性思维和自我反思，有明显的感知改进。我们的发现突显了人工智能介导对话在促进更深层次的参与和高阶认知技能方面的潜力。随着高等教育机构在课程中扩大GenAI整合，这种对话、GenAI 动力评估工具可以提供一个可扩展的策略，促进学生有意义的学习成果。

更新时间: 2025-10-16 04:07:31

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.16262v2

PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.

Updated: 2025-10-16 04:02:29

标题: PRISM：利用LLMs进行主动检索以实现多跳问题回答

摘要: 检索在多跳问题回答（QA）中起着核心作用，回答复杂问题需要收集多个证据。我们引入了一种主动检索系统，利用大型语言模型（LLMs）在结构化循环中检索相关证据，具有高精度和召回率。我们的框架由三个专门的代理组成：一个问题分析器将多跳问题分解为子问题，一个选择器确定每个子问题最相关的上下文（重点是精度），一个添加器引入任何缺失的证据（重点是召回）。选择器和添加器之间的迭代交互产生了一个简洁而全面的支持段落集。特别是，在过滤出分散注意力的内容的同时，它实现了更高的检索准确性，使得下游QA模型能够在依赖显著较少的无关信息的情况下超越全文答案准确性。对四个多跳QA基准数据集（HotpotQA、2WikiMultiHopQA、MuSiQue和MultiHopRAG）的实验表明，我们的方法始终优于强基线。

更新时间: 2025-10-16 04:02:29

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2510.14278v1

RareAgent: Self-Evolving Reasoning for Drug Repurposing in Rare Diseases

Computational drug repurposing for rare diseases is especially challenging when no prior associations exist between drugs and target diseases. Therefore, knowledge graph completion and message-passing GNNs have little reliable signal to learn and propagate, resulting in poor performance. We present RareAgent, a self-evolving multi-agent system that reframes this task from passive pattern recognition to active evidence-seeking reasoning. RareAgent organizes task-specific adversarial debates in which agents dynamically construct evidence graphs from diverse perspectives to support, refute, or entail hypotheses. The reasoning strategies are analyzed post hoc in a self-evolutionary loop, producing textual feedback that refines agent policies, while successful reasoning paths are distilled into transferable heuristics to accelerate future investigations. Comprehensive evaluations reveal that RareAgent improves the indication AUPRC by 18.1% over reasoning baselines and provides a transparent reasoning chain consistent with clinical evidence.

Updated: 2025-10-16 03:52:44

标题: RareAgent：罕见疾病中药物再利用的自我演变推理

摘要: 罕见疾病的计算药物再利用尤其具有挑战性，当药物与靶疾病之间没有先前的关联时。因此，知识图完成和消息传递GNNs几乎没有可靠的信号可以学习和传播，导致性能不佳。我们提出了RareAgent，这是一个自我进化的多智能体系统，将这一任务从被动的模式识别转变为主动的寻求证据推理。RareAgent组织了任务特定的对抗性辩论，在这些辩论中，智能体从不同的角度动态构建证据图来支持、反驳或推导假设。推理策略在自我进化循环中进行分析，产生细化智能体政策的文本反馈，而成功的推理路径被提炼为可转移的启发式规则，加速未来的研究。全面评估显示，RareAgent相对于推理基线提高了18.1%的指示AUPRC，并提供与临床证据一致的透明推理链。

更新时间: 2025-10-16 03:52:44

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.05764v2

Membership Inference over Diffusion-models-based Synthetic Tabular Data

This study investigates the privacy risks associated with diffusion-based synthetic tabular data generation methods, focusing on their susceptibility to Membership Inference Attacks (MIAs). We examine two recent models, TabDDPM and TabSyn, by developing query-based MIAs based on the step-wise error comparison method. Our findings reveal that TabDDPM is more vulnerable to these attacks. TabSyn exhibits resilience against our attack models. Our work underscores the importance of evaluating the privacy implications of diffusion models and encourages further research into robust privacy-preserving mechanisms for synthetic data generation.

Updated: 2025-10-16 03:43:11

标题: Membership Inference over Diffusion-models-based Synthetic Tabular Data 基于扩散模型的合成表格数据的成员推断

摘要: 本研究调查了基于扩散的合成表格数据生成方法所涉及的隐私风险，重点关注它们对成员推理攻击（MIAs）的易受性。我们通过基于逐步错误比较方法开发基于查询的MIAs，对两个最近的模型TabDDPM和TabSyn进行了研究。我们的发现显示TabDDPM更容易受到这些攻击。TabSyn对我们的攻击模型表现出抵抗力。我们的工作强调了评估扩散模型的隐私影响的重要性，并鼓励进一步研究用于合成数据生成的强大隐私保护机制。

更新时间: 2025-10-16 03:43:11

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.16037v1

Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.

Updated: 2025-10-16 03:41:44

标题: 少即是多：为检索增强生成清理知识图谱

摘要: 检索增强生成（RAG）系统使大型语言模型（LLMs）能够立即访问相关信息进行生成过程，展示出在解决常见的LLM挑战（如幻觉、事实不准确和知识截断）方面的卓越性能。基于图的RAG通过整合知识图（KGs）进一步延伸了这一范式，利用丰富的结构连接以获得更精确和推理性的响应。然而，一个关键挑战是，大多数基于图的RAG系统依赖LLMs进行自动化KG构建，通常会产生带有冗余实体和不可靠关系的嘈杂KGs。这种噪音会降低检索和生成性能，同时增加计算成本。至关重要的是，目前的研究并未全面解决LLM生成的KG的去噪问题。在本文中，我们介绍了用于检索增强生成的去噪知识图（DEG-RAG）框架，通过实体解析（消除冗余实体）和三元反射（删除错误关系）等技术来解决这些挑战。这些技术共同产生更紧凑、更高质量的KGs，明显优于未经处理的对应物。除了方法之外，我们还对LLM生成的KG的实体解析进行了系统评估，考察了不同的阻塞策略、嵌入选择、相似性度量和实体合并技术。据我们所知，这是对LLM生成的KG的实体解析的首次全面探索。我们的实验证明，这种直接的方法不仅极大减少了图的大小，还在各种流行的基于图的RAG变体中始终提高了问答性能。

更新时间: 2025-10-16 03:41:44

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14271v1

Training LLM Agents to Empower Humans

Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.

Updated: 2025-10-16 03:39:31

标题: 训练LLM代理以增强人类力量

摘要: 辅助代理不仅应代表人类采取行动，而且在需要做出重要决策时，还应退居幕后并放弃控制权。然而，当前构建辅助代理的方法，无论是通过模仿专家人类还是通过强化学习在推断奖励上的微调，往往鼓励代理人自行完成任务，而不是真正帮助人类实现他们的目标。此外，这些方法通常需要昂贵的明确人类反馈来提供训练信号。我们提出了一种基于最大化人类授权的调整辅助语言模型的新方法。我们的授权最大化方法Empower仅需要离线文本数据，为微调语言模型以更好地帮助人类提供了自监督方法。为了研究我们方法的有效性，我们进行了一项18人用户研究，将我们的授权助手与一个强基线进行了比较。参与者78%的时间更喜欢我们的助手（p=0.015），接受率高31%，建议少38%。此外，我们引入了一个新的环境，用于评估使用虚拟人类的多轮代码辅助。利用这个环境，我们展示了通过Empower训练的代理，能够使虚拟人类程序员在具有挑战性的编码问题上的成功率平均提高了192%，超过了SFT基线。通过这种授权目标，我们提供了一个框架，可以在规模上使用仅离线数据构建有用的对齐AI代理，而无需任何额外的人类反馈或可验证的奖励。

更新时间: 2025-10-16 03:39:31

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.13709v2

ELASTIC: Efficient Once For All Iterative Search for Object Detection on Microcontrollers

Deploying high-performance object detectors on TinyML platforms poses significant challenges due to tight hardware constraints and the modular complexity of modern detection pipelines. Neural Architecture Search (NAS) offers a path toward automation, but existing methods either restrict optimization to individual modules, sacrificing cross-module synergy, or require global searches that are computationally intractable. We propose ELASTIC (Efficient Once for AlL IterAtive Search for ObjecT DetectIon on MiCrocontrollers), a unified, hardware-aware NAS framework that alternates optimization across modules (e.g., backbone, neck, and head) in a cyclic fashion. ELASTIC introduces a novel Population Passthrough mechanism in evolutionary search that retains high-quality candidates between search stages, yielding faster convergence, up to an 8% final mAP gain, and eliminates search instability observed without population passthrough. In a controlled comparison, empirical results show ELASTIC achieves +4.75% higher mAP and 2x faster convergence than progressive NAS strategies on SVHN, and delivers a +9.09% mAP improvement on PascalVOC given the same search budget. ELASTIC achieves 72.3% mAP on PascalVOC, outperforming MCUNET by 20.9% and TinyissimoYOLO by 16.3%. When deployed on MAX78000/MAX78002 microcontrollers, ELASTICderived models outperform Analog Devices' TinySSD baselines, reducing energy by up to 71.6%, lowering latency by up to 2.4x, and improving mAP by up to 6.99 percentage points across multiple datasets.

Updated: 2025-10-16 03:38:23

标题: ELASTIC：微控制器上目标检测的高效一次性迭代搜索

摘要: 在TinyML平台上部署高性能目标检测器面临着重大挑战，这是由于硬件约束严格和现代检测管道的模块化复杂性。神经架构搜索（NAS）提供了自动化的路径，但现有方法要么将优化限制在单个模块上，牺牲了跨模块的协同作用，要么需要进行计算复杂的全局搜索。我们提出了ELASTIC（Efficient Once for All Iterative Search for Object Detection on Microcontrollers），这是一个统一的、硬件感知的NAS框架，以循环方式在模块之间（如backbone、neck和head）进行优化。ELASTIC引入了一种新颖的群体传递机制，保留了搜索阶段之间的高质量候选者，从而实现更快的收敛，最终mAP增益高达8%，并消除了在没有群体传递的情况下观察到的搜索不稳定性。在受控比较中，实证结果显示ELASTIC在SVHN上实现了比渐进NAS策略高4.75%的mAP，收敛速度快2倍，并在相同的搜索预算下在PascalVOC上实现了9.09%的mAP改进。ELASTIC在PascalVOC上实现了72.3%的mAP，优于MCUNET 20.9%和TinyissimoYOLO 16.3%。当部署在MAX78000/MAX78002微控制器上时，ELASTIC衍生的模型优于Analog Devices的TinySSD基准线，将能耗降低高达71.6%，延迟降低高达2.4倍，并在多个数据集上将mAP提高高达6.99个百分点。

更新时间: 2025-10-16 03:38:23

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.21999v2

Nonparametric Data Attribution for Diffusion Models

Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.

Updated: 2025-10-16 03:37:16

标题: 扩散模型的非参数数据归因

摘要: 生成模型的数据归因旨在量化个别训练样本对模型输出的影响。现有的扩散模型方法通常需要访问模型梯度或重新训练，从而限制了它们在专有或大规模设置中的适用性。我们提出了一种非参数化归因方法，完全基于数据运行，通过衡量生成图像与训练图像之间的补丁级相似性来衡量影响。我们的方法根植于最佳分数函数的分析形式，并自然地扩展到多尺度表示，同时通过基于卷积的加速保持计算效率。除了产生空间可解释的归因外，我们的框架还揭示了反映训练数据和输出之间固有关系的模式，独立于任何特定模型。实验证明，我们的方法实现了强大的归因性能，与基于梯度的方法紧密匹配，并大大优于现有的非参数基线。代码可在https://github.com/sail-sg/NDA上找到。

更新时间: 2025-10-16 03:37:16

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.14269v1

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

With the advancement of powerful large-scale reasoning models, effectively evaluating the reasoning capabilities of these models has become increasingly important. However, existing benchmarks designed to assess the reasoning abilities of large models tend to be limited in scope and lack the flexibility to adapt their difficulty according to the evolving reasoning capacities of the models. To address this, we propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models and can adjust and update question difficulty based on the reasoning abilities of advanced models. Specifically, we curate the benchmark by selecting and collecting complex reasoning questions from existing benchmarks and sources such as Olympiad-level competitions. Additionally, MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model's reasoning process. Furthermore, it includes questions generated using simulation software, enabling dynamic adjustment of benchmark difficulty with minimal resource consumption. We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5. MorphoBench enhances the comprehensiveness and validity of model reasoning evaluation, providing reliable guidance for improving both the reasoning abilities and scientific robustness of large models. The code has been released in https://github.com/OpenDCAI/MorphoBench.

Updated: 2025-10-16 03:30:56

标题: MorphoBench：一个适应模型推理难度的基准

摘要: 随着强大的大规模推理模型的发展，有效评估这些模型的推理能力变得越来越重要。然而，现有的用于评估大型模型推理能力的基准往往范围有限，缺乏根据模型不断发展的推理能力调整难度的灵活性。为了解决这个问题，我们提出了MorphoBench，这是一个包含多学科问题的基准，用于评估大型模型的推理能力，并且可以根据先进模型的推理能力调整和更新问题的难度。具体而言，我们通过从现有基准和奥林匹克级比赛等来源中选择和收集复杂的推理问题来策划这个基准。此外，MorphoBench通过利用模型推理过程中生成的关键陈述来自适应地修改问题的分析挑战。此外，它还包括使用模拟软件生成的问题，可以在最小资源消耗的情况下动态调整基准的难度。我们已经收集了超过1300个测试问题，并根据o3和GPT-5等模型的推理能力迭代地调整了MorphoBench的难度。MorphoBench提高了模型推理评估的全面性和有效性，为改进大型模型的推理能力和科学鲁棒性提供可靠的指导。该代码已在https://github.com/OpenDCAI/MorphoBench上发布。

更新时间: 2025-10-16 03:30:56

领域: cs.AI

下载: http://arxiv.org/abs/2510.14265v1

CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions

Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.

Updated: 2025-10-16 03:27:15

标题: CAST：利用谱跟踪进行组合分析以理解变压器层功能

摘要: 大型语言模型取得了显著的成功，但其内部机制仍然很难理解，往往是黑匣子。为了解决这一局限性，许多研究人员提出了各种可解释性方法，包括机制分析、探测分类器和激活可视化，每种方法都从不同的角度提供了有价值的见解。在这个丰富的互补方法的基础上，我们引入了CAST（通过光谱跟踪进行组合分析），这是一个无需探测的框架，通过直接转换矩阵估计和全面的光谱分析，从分析transformer层函数的新视角做出贡献。CAST通过使用Moore-Penrose伪逆估算每个层的实现转换矩阵，并应用具有六个可解释指标的光谱分析，提供了对现有方法的补充见解，揭示了编码器模型和解码器模型之间的不同行为，解码器模型展现出压缩-扩展循环，而编码器模型保持一致的高秩处理。核分析进一步展示了层之间的功能关系模式，CKA相似性矩阵清晰地将层分为三个阶段：特征提取、压缩和专门化。

更新时间: 2025-10-16 03:27:15

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14262v1

TAI3: Testing Agent Integrity in Interpreting User Intent

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

Updated: 2025-10-16 03:20:27

标题: TAI3：测试代理人在解释用户意图中的诚实性

摘要: LLM代理越来越多地被部署来通过自然语言指令调用API来自动化现实世界的任务。尽管功能强大，但它们经常受到用户意图错误解释的困扰，导致代理的行动偏离用户的预期目标，特别是在外部工具包不断发展的情况下。传统软件测试假定结构化输入，因此在处理自然语言的歧义方面存在不足。我们引入了TAI3，一个以API为中心的压力测试框架，系统地揭示LLM代理中的意图完整性违规。与之前专注于固定基准或对抗性输入的工作不同，TAI3基于工具包的文档生成逼真的任务，并应用有针对性的变异来暴露微妙的代理错误同时保留用户意图。为了指导测试，我们提出了语义分区，根据工具包API参数及其等价类将自然语言任务组织成有意义的类别。在每个分区中，种子任务通过轻量级预测器进行变异和排名，该预测器估计触发代理错误的可能性。为了增强效率，TAI3维护着一个数据类型感知的策略记忆，从过去案例中检索并调整有效的变异模式。对80个工具包API的实验表明，TAI3有效地揭示了意图完整性违规，明显优于基准测试在错误暴露率和查询效率方面。此外，TAI3在使用较小的LLM进行测试生成时很好地泛化到更强大的目标模型，并且适应不断发展的跨领域API。

更新时间: 2025-10-16 03:20:27

领域: cs.SE,cs.AI,cs.CY

下载: http://arxiv.org/abs/2506.07524v2

Minimax Optimal Kernel Two-Sample Tests with Random Features

Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy), for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently have minimax optimal two-sample tests been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral-regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similarly (with a small drop in power) to the exact test.

Updated: 2025-10-16 03:19:33

标题: 具有随机特征的最小最优核双样本检验

摘要: 概率分布的再生核希尔伯特空间（RKHS）嵌入已被证明是一种有效的方法，通过MMD（最大均值差异）用于涉及一般（非欧几里得）域上定义的分布的非参数假设检验问题。虽然在这个主题上已经做了大量工作，但只有最近构建了包含均值元素和协方差算子的正则化版本的极小极大优化的双样本检验，与MMD不同。然而，与大多数核算法一样，最优检验的样本规模以三次方增长，限制了其适用性。在本文中，我们提出了基于随机傅里叶特征（RFF）逼近的谱正则化双样本检验，并研究了统计最优性和计算效率之间的权衡。我们表明，如果RFF的逼近阶数（取决于似然比的平滑度和积分算子的特征值衰减速度）足够大，则所提出的检验是极小极大优化的。我们开发了一个基于置换的实用实施版本的所提出的检验，其中采用数据自适应策略选择正则化参数。最后，通过对模拟和基准数据集的数值实验，我们证明了所提出的基于RFF的测试在计算效率上是高效的，并且执行几乎与精确测试相似（功率略有下降）。

更新时间: 2025-10-16 03:19:33

领域: math.ST,cs.LG,stat.ML,stat.TH,62G10 (Primary) 65J20, 65J22, 46E22, 47A52 (Secondary)

下载: http://arxiv.org/abs/2502.20755v2

Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals

Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models' capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.

Updated: 2025-10-16 03:13:04

标题: 广义与专业的时间序列基础模型：探究利用PPG信号评估人类健康时潜在的新行为

摘要: 基础模型是大规模机器学习模型，它们在大量数据上进行预训练，并可用于各种下游任务。它们已广泛应用于自然语言处理和计算机视觉任务，如GPT、BERT和CLIP模型。现在它们也越来越受到时间序列分析的关注，特别是用于生理感知。然而，大多数时间序列基础模型是专业模型 - 在预训练和测试中使用相同类型的数据，如心电图、脑电图和光电容积脉搏图（PPG）。最近的作品，如MOMENT，使用来自多个领域的数据训练通用的时间序列基础模型，如天气、交通和电力。本文旨在进行一项全面的基准研究，比较通用模型和专业模型的性能，重点放在PPG信号上。通过包括心脏状态评估、实验室数值估计和跨模态推理在内的总共51个任务的广泛套件，我们全面评估了两种模型在七个维度上的表现，包括胜分、平均性能、特征质量、调优增益、性能方差、可转移性和可扩展性。这些指标共同捕捉了模型的能力，以及它们在不同微调策略下的适应性、鲁棒性和效率，提供了对它们在不同下游场景中的优势和局限性的全面理解。在完全调优的情况下，我们展示了专业模型获得了27%更高的胜分。最后，我们对泛化、公平性、注意力可视化以及训练数据选择的重要性进行了进一步分析。

更新时间: 2025-10-16 03:13:04

领域: cs.LG

下载: http://arxiv.org/abs/2510.14254v1

Towards Agentic Self-Learning LLMs in Search Environment

We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at https://github.com/forangel2014/Towards-Agentic-Self-Learning

Updated: 2025-10-16 03:11:56

标题: 朝向在搜索环境中自主学习的LLM（语言模型）

摘要: 我们研究了自学习是否可以在不依赖人类策划的数据集或预定义规则的奖励的情况下扩展基于LLM的代理。通过在搜索代理设置中进行控制实验，我们确定了可扩展代理训练的两个关键因素：奖励信号的来源和代理任务数据的规模。我们发现，来自生成奖励模型（GRM）的奖励优于严格基于规则的信号用于开放领域学习，并且与策略共同进化的GRM进一步提高了性能。增加代理任务数据的数量-即使是合成生成的-大大增强了代理能力。基于这些见解，我们提出了\textbf{代理自学习}（ASL），这是一个完全闭环、多角色强化学习框架，将任务生成、策略执行和评估统一在一个共享的工具环境和LLM骨干中。ASL协调了提示生成器、策略模型和生成奖励模型，形成更加严格的任务设置、更尖锐的验证和更强的解决的良性循环。从经验上看，ASL提供了稳定的、逐轮增长的收益，超过了强大的RLVR基线（例如，Search-R1），这些基线达到或下降，并且在零标记数据条件下持续改进，表明其具有更高的样本效率和鲁棒性。我们进一步展示，GRM的验证能力是主要瓶颈：如果冻结，它会导致奖励黑客行为并停滞进展；在不断演变的数据分布上持续训练GRM可以缓解这一问题，并且在后期注入少量真实验证数据可提高性能上限。这项工作将奖励来源和数据规模确定为开放领域代理学习的关键杠杆，并证明了多角色共同进化对于可扩展、自我改进的代理的有效性。本文的数据和代码可在以下网址获取：https://github.com/forangel2014/Towards-Agentic-Self-Learning。

更新时间: 2025-10-16 03:11:56

领域: cs.AI

下载: http://arxiv.org/abs/2510.14253v1

APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .

Updated: 2025-10-16 03:07:40

标题: APEX：基于物理学任务规划赋能LLMs实现实时洞察

摘要: 大型语言模型(LLMs)展示了强大的推理和任务规划能力，但在物理交互建模方面仍然存在根本性的限制。现有方法通过视觉语言模型(VLMs)或通过强化学习(RL)进行自适应决策来整合感知，但它们无法捕捉动态物体交互，或者需要特定任务训练，从而限制了它们在现实世界中的适用性。我们引入了APEX(Anticipatory Physics-Enhanced Execution)，这是一个框架，为LLMs提供了物理驱动的前瞻性，用于实时任务规划。APEX构建结构化图形来识别和建模环境中最相关的动态交互，为LLMs提供明确的物理状态更新。同时，APEX提供了物理可行动作的低延迟前向模拟，使LLMs能够基于预测结果而不是静态观察选择最佳策略。我们在三个基准测试上评估了APEX，这些测试旨在评估感知、预测和决策能力：(1)物理推理基准，测试因果推理和物体运动预测；(2)俄罗斯方块，评估物理信息预测是否提高了长期规划任务中的决策性能；(3)动态避障，评估感知和行动可行性分析的即时集成。APEX明显优于标准LLMs和基于VLM的模型，证明了明确的物理推理对于弥合基于语言的智能和现实世界任务执行之间的差距的必要性。源代码和实验设置可在https://github.com/hwj20/APEX_EXP 上公开获取。

更新时间: 2025-10-16 03:07:40

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2505.13921v2

A Physics Prior-Guided Dual-Stream Attention Network for Motion Prediction of Elastic Bragg Breakwaters

Accurate motion response prediction for elastic Bragg breakwaters is critical for their structural safety and operational integrity in marine environments. However, conventional deep learning models often exhibit limited generalization capabilities when presented with unseen sea states. These deficiencies stem from the neglect of natural decay observed in marine systems and inadequate modeling of wave-structure interaction (WSI). To overcome these challenges, this study proposes a novel Physics Prior-Guided Dual-Stream Attention Network (PhysAttnNet). First, the decay bidirectional self-attention (DBSA) module incorporates a learnable temporal decay to assign higher weights to recent states, aiming to emulate the natural decay phenomenon. Meanwhile, the phase differences guided bidirectional cross-attention (PDG-BCA) module explicitly captures the bidirectional interaction and phase relationship between waves and the structure using a cosine-based bias within a bidirectional cross-computation paradigm. These streams are synergistically integrated through a global context fusion (GCF) module. Finally, PhysAttnNet is trained with a hybrid time-frequency loss that jointly minimizes time-domain prediction errors and frequency-domain spectral discrepancies. Comprehensive experiments on wave flume datasets demonstrate that PhysAttnNet significantly outperforms mainstream models. Furthermore,cross-scenario generalization tests validate the model's robustness and adaptability to unseen environments, highlighting its potential as a framework to develop predictive models for complex systems in ocean engineering.

Updated: 2025-10-16 03:06:44

标题: 一种基于物理先验引导的双流注意力网络用于弹性布拉格防波堤运动预测

摘要: 精确预测弹性布拉格防波堤的运动响应对于其在海洋环境中的结构安全和运行完整性至关重要。然而，传统的深度学习模型在面对未知海况时往往表现出有限的泛化能力。这些不足源于对海洋系统中观察到的自然衰减的忽视以及波-结构相互作用（WSI）的建模不足。为了克服这些挑战，本研究提出了一种新颖的物理先验引导的双流注意力网络（PhysAttnNet）。首先，衰减双向自注意力（DBSA）模块结合了可学习的时间衰减，将更高的权重分配给最近的状态，旨在模拟自然衰减现象。同时，相位差引导的双向交叉注意力（PDG-BCA）模块明确捕捉了波浪与结构之间的双向交互和相位关系，使用了一个基于余弦的偏置在双向交叉计算范式内。这些流通过全局上下文融合（GCF）模块进行协同集成。最后，PhysAttnNet使用混合时频损失进行训练，共同最小化时域预测误差和频域谱差异。在波浪水槽数据集上的综合实验表明，PhysAttnNet明显优于主流模型。此外，跨场景泛化测试验证了该模型对未知环境的稳健性和适应性，突显了其作为海洋工程复杂系统预测模型开发框架的潜力。

更新时间: 2025-10-16 03:06:44

领域: cs.LG

下载: http://arxiv.org/abs/2510.14250v1

Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.

Updated: 2025-10-16 03:01:41

标题: 联合语言-音频嵌入是否编码感知音色语义？

摘要: 理解和建模语言和声音之间的关系对于诸如音乐信息检索、文本引导音乐生成和音频字幕等应用至关重要。这些任务的核心是使用联合语言-音频嵌入空间，将文本描述和听觉内容映射到共享嵌入空间中。虽然多模态嵌入模型如MS-CLAP、LAION-CLAP和MuQ-MuLan在对齐语言和音频方面表现出色，但它们与音色的人类感知对应，音色是一个包含亮度、粗糙度和温暖等品质的多方面属性，仍未被充分探讨。在本文中，我们评估了上述三种联合语言-音频嵌入模型对于捕捉音色感知维度的能力。我们的研究结果表明，LAION-CLAP在乐器声和音频效果中一贯提供了与人类感知音色语义最可靠的对齐。

更新时间: 2025-10-16 03:01:41

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2510.14249v1

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

Updated: 2025-10-16 03:00:47

标题: 雷达：一种风险感知的动态多智能体框架，通过角色专业化协作进行LLM安全评估

摘要: 现有大型语言模型（LLMs）的安全评估方法存在固有限制，包括评估者偏见和由模型同质性引起的检测失败，这些共同削弱了风险评估过程的健壮性。本文旨在通过引入一个理论框架重新审视风险评估范式，该框架重构了潜在的风险概念空间。具体来说，我们将潜在的风险概念空间分解为三个互斥的子空间：显式风险子空间（涵盖直接违反安全指南的行为）、隐含风险子空间（捕捉需要上下文推理才能识别的潜在恶意内容）和非风险子空间。此外，我们提出了RADAR，一个利用多轮辩论机制的多智能体协作评估框架，通过四个专门的互补角色和采用动态更新机制实现风险概念分布的自我演化。这种方法能够全面覆盖显式和隐含风险，同时减轻评估者的偏见。为验证我们框架的有效性，我们构建了一个包含800个具有挑战性案例的评估数据集。在我们的挑战性测试集和公共基准上进行的广泛实验证明，RADAR在多个维度上明显优于基线评估方法，包括准确性、稳定性和自我评估风险敏感性。值得注意的是，与最强基线评估方法相比，RADAR在风险识别准确性方面实现了28.87%的改进。

更新时间: 2025-10-16 03:00:47

领域: cs.AI,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2509.25271v2

Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation

Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the $d$-rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.

Updated: 2025-10-16 02:56:58

标题: 具有线性函数逼近的政策正则化分布鲁棒马尔可夫决策过程

摘要: 在分布转移下进行决策是强化学习（RL）中的一个核心挑战，训练和部署环境不同。我们通过鲁棒马尔可夫决策过程（RMDPs）的视角研究了这个问题，该过程优化对抗性转移动态的性能。我们关注在线设置，其中代理与环境的交互有限，使样本效率和探索尤为重要。尽管策略优化在标准RL中取得成功，但在鲁棒RL中在理论和实证上仍未被充分探索。为了弥补这一差距，我们提出了分布鲁棒正则化策略优化算法（DR-RPO），这是一种无模型在线策略优化方法，可以学习具有次线性遗憾的鲁棒策略。为了在softmax策略类中进行可行优化，DR-RPO结合了参考策略正则化，产生了在转移和策略方面双重约束的RMDP变体。为了扩展到大状态-动作空间，我们采用$d$-矩形线性MDP形式，并将线性函数近似与上置信奖励相结合，以进行乐观探索。我们提供了理论保证，表明策略优化可以在鲁棒RL中实现多项式次优性界限和样本效率，与基于值的方法的表现相匹配。最后，跨多个领域的实证结果证实了我们的理论，并展示了DR-RPO的鲁棒性。

更新时间: 2025-10-16 02:56:58

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2510.14246v1

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.

Updated: 2025-10-16 02:56:19

标题: 从简单到困难：渐进交错多图像推理的MIR基准

摘要: 多图像交错推理旨在提高多模态大语言模型（MLLMs）跨多个图像及其相关文本上共同理解和推理的能力，引入了超出单个图像或非交错多图像任务之外的独特挑战。当前的多图像基准忽视了交错文本上下文并忽视了个别图像与其相关文本之间的明显关系，使模型能够推理多图像交错数据可能显著增强其对复杂场景的理解，并更好地捕捉跨模态相关性。为了弥合这一差距，我们引入了一个新颖的基准MIR，要求对多个图像进行联合推理，伴随着交错的文本上下文，以准确将图像区域与相应的文本联系起来，并在图像之间逻辑连接信息。为了增强MLLMs对多图像交错数据的理解能力，我们为基准中的每个实例引入了推理步骤，并提出了一个分阶段的课程学习策略。这种策略遵循一个“易到难”的方法，逐步引导模型从简单到复杂的情况，从而增强其处理具有挑战性任务的能力。大量实验对比多个MLLMs表明，我们的方法显著提高了模型在MIR和其他已建立的基准上的推理性能。我们相信，MIR将鼓励进一步研究多图像交错推理，促进MLLMs处理复杂跨模态任务的能力的进步。

更新时间: 2025-10-16 02:56:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.17040v2

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

Updated: 2025-10-16 02:55:04

标题: 强化学习在时空心脏超声图像分割无监督领域自适应中的应用

摘要: 域自适应方法旨在通过实现跨领域的知识转移来弥合数据集之间的差距，从而减少对额外专家标注的需求。然而，许多方法在目标领域中存在可靠性问题，尤其在医学图像分割领域中，准确性和解剖学有效性至关重要。这一挑战在时空数据中进一步恶化，其中缺乏时间一致性可能会显著降低分割质量，尤其在超声心动图中，伪影和噪音的存在可能进一步阻碍分割性能。为了解决这些问题，我们提出了RL4Seg3D，一种用于2D +时间超声心动图分割的无监督领域自适应框架。RL4Seg3D整合了新颖的奖励函数和融合方案，以增强其分割中的关键标记精度，同时处理全尺寸的输入视频。通过利用强化学习进行图像分割，我们的方法提高了准确性、解剖学有效性和时间一致性，同时还提供了一个稳健的不确定性估计器，可在测试时用于进一步提高分割性能。我们在超过30,000个超声心动图视频上展示了我们框架的有效性，表明它在不需要目标领域上的任何标签的情况下优于标准领域自适应技术。代码可在https://github.com/arnaudjudge/RL4Seg3D 上找到。

更新时间: 2025-10-16 02:55:04

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.14244v1

Spatial Computing Communications for Multi-User Virtual Reality in Distributed Mobile Edge Computing Network

Immersive virtual reality (VR) applications impose stringent requirements on latency, energy efficiency, and computational resources, particularly in multi-user interactive scenarios. To address these challenges, we introduce the concept of spatial computing communications (SCC), a framework designed to meet the latency and energy demands of multi-user VR over distributed mobile edge computing (MEC) networks. SCC jointly represents the physical space, defined by users and base stations, and the virtual space, representing shared immersive environments, using a probabilistic model of user dynamics and resource requirements. The resource deployment task is then formulated as a multi-objective combinatorial optimization (MOCO) problem that simultaneously minimizes system latency and energy consumption across distributed MEC resources. To solve this problem, we propose MO-CMPO, a multi-objective consistency model with policy optimization that integrates supervised learning and reinforcement learning (RL) fine-tuning guided by preference weights. Leveraging a sparse graph neural network (GNN), MO-CMPO efficiently generates Pareto-optimal solutions. Simulations with real-world New Radio base station datasets demonstrate that MO-CMPO achieves superior hypervolume performance and significantly lower inference latency than baseline methods. Furthermore, the analysis reveals practical deployment patterns: latency-oriented solutions favor local MEC execution to reduce transmission delay, while energy-oriented solutions minimize redundant placements to save energy.

Updated: 2025-10-16 02:55:01

标题: 分布式移动边缘计算网络中多用户虚拟现实的空间计算通信

摘要: 沉浸式虚拟现实（VR）应用对延迟、能效和计算资源提出了严格要求，尤其是在多用户互动场景中。为解决这些挑战，我们引入了空间计算通信（SCC）的概念，这是一个旨在满足分布式移动边缘计算（MEC）网络上多用户VR的延迟和能耗需求的框架。SCC共同表示由用户和基站定义的物理空间以及表示共享沉浸式环境的虚拟空间，使用用户动态和资源需求的概率模型。然后将资源部署任务制定为多目标组合优化（MOCO）问题，同时最小化分布式MEC资源上的系统延迟和能耗。为了解决这个问题，我们提出了MO-CMPO，这是一个多目标一致性模型，通过优化政策集成了监督学习和强化学习（RL）微调，由偏好权重指导。利用稀疏图神经网络（GNN），MO-CMPO有效地生成帕累托最优解。使用真实的新无线电基站数据集进行的模拟表明，MO-CMPO实现了优越的超体积性能，并且推断延迟明显低于基准方法。此外，分析显示出实际部署模式：以延迟为导向的解决方案偏向于本地MEC执行以减少传输延迟，而以能耗为导向的解决方案则最小化冗余位置以节约能量。

更新时间: 2025-10-16 02:55:01

领域: cs.IT,cs.AI,math.IT

下载: http://arxiv.org/abs/2510.14243v1

Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

Updated: 2025-10-16 02:54:01

标题: Flip-Flop Consistency：在LLMs中对提示扰动的鲁棒性进行无监督训练

摘要: 大型语言模型（LLMs）在面对同一提示的不同表达方式时往往会产生不一致的答案。本文提出了Flip-Flop Consistency（$F^2C$），这是一种无监督训练方法，可以提高对这种扰动的鲁棒性。$F^2C$由两个关键组成部分组成。第一个是Consensus Cross-Entropy（CCE），它利用跨提示变体的多数投票来创建一个硬伪标签。第二个是表示对齐损失，将低置信度和非多数预测器拉向由高置信度、多数投票变体建立的共识。我们在涵盖四个NLP任务的11个数据集上评估了我们的方法，每个数据集有4-15个提示变体。平均而言，$F^2C$将观察到的一致性提高了11.62%，将平均$F_1$提高了8.94%，并将不同格式之间的性能差异降低了3.29%。在领域外评估中，$F^2C$表现出很好的泛化能力，提高了$\overline{F_1}$和一致性，同时降低了大多数源-目标对之间的方差。最后，在仅在一部分提示扰动上进行训练并在保留格式上进行评估时，$F^2C$始终提高了性能和一致性，同时减少了方差。这些发现突显了$F^2C$作为一种有效的无监督方法，可以增强LLM在提示扰动下的一致性，性能和泛化能力。源代码可在以下链接找到：https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs。

更新时间: 2025-10-16 02:54:01

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.14242v1

An analysis of the derivative-free loss method for solving PDEs

This study analyzes the derivative-free loss method to solve a certain class of elliptic PDEs and fluid problems using neural networks. The approach leverages the Feynman-Kac formulation, incorporating stochastic walkers and their averaged values. We investigate how the time interval associated with the Feynman-Kac representation and the walker size influence computational efficiency, trainability, and sampling errors. Our analysis shows that the training loss bias scales proportionally with the time interval and the spatial gradient of the neural network, while being inversely proportional to the walker size. Moreover, we demonstrate that the time interval must be sufficiently long to enable effective training. These results indicate that the walker size can be chosen as small as possible, provided it satisfies the optimal lower bound determined by the time interval. Finally, we present numerical experiments that support our theoretical findings.

Updated: 2025-10-16 02:51:16

标题: 对无导数损失方法求解PDE的分析

摘要: 这项研究分析了利用神经网络解决一类椭圆型偏微分方程和流体问题的无导数损失方法。该方法利用费曼-卡克公式，将随机行走者及其平均值纳入考虑。我们研究了与费曼-卡克表示和行走者大小相关的时间间隔如何影响计算效率、可训练性和采样误差。我们的分析表明，训练损失偏差与时间间隔及神经网络的空间梯度成正比，与行走者大小成反比。此外，我们证明时间间隔必须足够长才能实现有效训练。这些结果表明，行走者大小可以尽可能选择小，只要满足由时间间隔确定的最佳下限。最后，我们展示了支持我们理论发现的数值实验。

更新时间: 2025-10-16 02:51:16

领域: math.NA,cs.LG,cs.NA,stat.ML,65N15, 65N75, 65C05, 60G46

下载: http://arxiv.org/abs/2309.16829v2

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

Updated: 2025-10-16 02:49:16

标题: LiveResearchBench：一个用于野外用户中心深度研究的实时基准测试

摘要: 深度研究——通过搜索和综合来自数百个实时网络来源的信息制作全面的、引证基础的报告——标志着主动系统的一个重要前沿。为了严格评估这种能力，有四个原则是必不可少的：任务应该是（1）以用户为中心，反映现实信息需求，（2）动态的，需要最新的信息而不仅仅是参数化的知识，（3）明确的，确保用户之间的一致解释，和（4）多方面和搜索密集，需要对众多网络来源进行搜索和深入分析。现有的基准未能达到这些原则，通常集中在狭窄的领域或提出模棱两可的问题，这阻碍了公平比较。在这些原则的指导下，我们介绍了LiveResearchBench，这是一个由100个专家策划的任务组成的基准，涵盖了日常生活、企业和学术领域，每个任务都需要进行广泛的、动态的、实时的网络搜索和综合。通过1500多小时的人力劳动建立起来，LiveResearchBench为系统评估提供了严格的基础。为了评估引证基础的长篇报告，我们引入了DeepEval，一个综合套件，涵盖内容和报告质量，包括覆盖范围、呈现、引文准确性和相关性、一致性和分析深度。DeepEval整合了四种互补的评估协议，每种协议都旨在确保稳定的评估和与人类判断的高度一致性。使用LiveResearchBench和DeepEval，我们对包括单一代理网络搜索、单一代理深度研究和多代理系统在内的17个前沿深度研究系统进行了全面评估。我们的分析揭示了当前的优势、频繁的失效模式以及推进可靠、有深度的深度研究所需的关键系统组成部分。

更新时间: 2025-10-16 02:49:16

领域: cs.AI

下载: http://arxiv.org/abs/2510.14240v1

RoBCtrl: Attacking GNN-Based Social Bot Detectors via Reinforced Manipulation of Bots Control Interaction

Social networks have become a crucial source of real-time information for individuals. The influence of social bots within these platforms has garnered considerable attention from researchers, leading to the development of numerous detection technologies. However, the vulnerability and robustness of these detection methods is still underexplored. Existing Graph Neural Network (GNN)-based methods cannot be directly applied due to the issues of limited control over social agents, the black-box nature of bot detectors, and the heterogeneity of bots. To address these challenges, this paper proposes the first adversarial multi-agent Reinforcement learning framework for social Bot control attacks (RoBCtrl) targeting GNN-based social bot detectors. Specifically, we use a diffusion model to generate high-fidelity bot accounts by reconstructing existing account data with minor modifications, thereby evading detection on social platforms. To the best of our knowledge, this is the first application of diffusion models to mimic the behavior of evolving social bots effectively. We then employ a Multi-Agent Reinforcement Learning (MARL) method to simulate bots adversarial behavior. We categorize social accounts based on their influence and budget. Different agents are then employed to control bot accounts across various categories, optimizing the attachment strategy through reinforcement learning. Additionally, a hierarchical state abstraction based on structural entropy is designed to accelerate the reinforcement learning. Extensive experiments on social bot detection datasets demonstrate that our framework can effectively undermine the performance of GNN-based detectors.

Updated: 2025-10-16 02:41:49

标题: RoBCtrl：通过强化机器人控制交互攻击基于GNN的社交机器人检测器

摘要: 社交网络已成为个人实时信息的重要来源。社交机器人在这些平台中的影响引起了研究人员的重视，导致了大量检测技术的发展。然而，这些检测方法的脆弱性和稳健性仍未得到充分探讨。现有基于图神经网络（GNN）的方法由于社交代理的控制有限、机器人检测器的黑盒特性以及机器人的异质性问题，无法直接应用。为了解决这些挑战，本文提出了第一个针对基于GNN的社交机器人检测器的对抗多智能体强化学习框架（RoBCtrl）。具体而言，我们使用扩散模型通过对现有账户数据进行微小修改生成高保真度的机器人账户，从而在社交平台上避免检测。据我们所知，这是首次将扩散模型应用于有效模拟不断演化的社交机器人行为。然后，我们采用多智能体强化学习方法模拟机器人的对抗行为。我们根据其影响力和预算对社交账户进行分类。然后，不同的智能体被用来控制各类机器人账户，通过强化学习优化附着策略。此外，基于结构熵的分层状态抽象被设计用于加速强化学习。对社交机器人检测数据集的广泛实验表明，我们的框架可以有效地削弱基于GNN的检测器的性能。

更新时间: 2025-10-16 02:41:49

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.16035v1

MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control

MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: https://github.com/xbpeng/MimicKit.

Updated: 2025-10-16 02:41:08

标题: MimicKit：一种用于动作模仿和控制的强化学习框架

摘要: MimicKit 是一个开源框架，用于使用运动模仿和强化学习训练运动控制器。该代码库提供了常用的运动模仿技术和强化学习算法的实现。该框架旨在通过提供统一的训练框架以及标准化的环境、代理和数据结构，支持计算机图形学和机器人学的研究和应用。该代码库被设计为模块化且易于配置，使得方便地修改和扩展到新的角色和任务。开源代码库可在以下链接获取：https://github.com/xbpeng/MimicKit。

更新时间: 2025-10-16 02:41:08

领域: cs.GR,cs.LG,cs.RO

下载: http://arxiv.org/abs/2510.13794v2

Domain-Independent Dynamic Programming

For combinatorial optimization problems, model-based paradigms such as mixed-integer programming (MIP) and constraint programming (CP) aim to decouple modeling and solving a problem: the `holy grail' of declarative problem solving. We propose domain-independent dynamic programming (DIDP), a novel model-based paradigm based on dynamic programming (DP). While DP is not new, it has typically been implemented as a problem-specific method. We introduce Dynamic Programming Description Language (DyPDL), a formalism to define DP models based on a state transition system, inspired by artificial intelligence (AI) planning. we show that heuristic search algorithms can be used to solve DyPDL models and propose seven DIDP solvers. We experimentally compare our DIDP solvers with commercial MIP and CP solvers (solving MIP and CP models, respectively) on common benchmark instances of eleven combinatorial optimization problem classes. We show that DIDP outperforms MIP in nine problem classes, CP also in nine problem classes, and both MIP and CP in seven. DIDP also achieves superior performance to existing state-based solvers including domain-independent AI planners.

Updated: 2025-10-16 02:40:01

标题: 领域无关的动态规划

摘要: 对于组合优化问题，基于模型的范例如混合整数规划（MIP）和约束编程（CP）旨在解耦建模和解决问题：这是声明性问题解决的“圣杯”。我们提出了领域无关的动态规划（DIDP），这是一种基于动态规划（DP）的新型模型范例。虽然DP并不是新概念，但通常被实现为特定问题的方法。我们引入了动态规划描述语言（DyPDL），这是一种基于状态转移系统定义DP模型的形式化工具，受人工智能（AI）规划的启发。我们展示了启发式搜索算法可以用来解决DyPDL模型，并提出了七种DIDP求解器。我们在十一种组合优化问题类别的常见基准实例上，实验比较了我们的DIDP求解器与商业MIP和CP求解器（分别解决MIP和CP模型）。我们展示了DIDP在九个问题类别中优于MIP，也在九个问题类别中优于CP，同时在七个问题类别中优于MIP和CP。DIDP还比包括领域无关AI规划器在内的现有基于状态的求解器表现更优异。

更新时间: 2025-10-16 02:40:01

领域: cs.AI,F.2.2; I.2.8

下载: http://arxiv.org/abs/2401.13883v4

VERITAS: Verifying the Performance of AI-native Transceiver Actions in Base-Stations

Artificial Intelligence (AI)-native receivers prove significant performance improvement in high noise regimes and can potentially reduce communication overhead compared to the traditional receiver. However, their performance highly depends on the representativeness of the training dataset. A major issue is the uncertainty of whether the training dataset covers all test environments and waveform configurations, and thus, whether the trained model is robust in practical deployment conditions. To this end, we propose a joint measurement-recovery framework for AI-native transceivers post deployment, called VERITAS, that continuously looks for distribution shifts in the received signals and triggers finite re-training spurts. VERITAS monitors the wireless channel using 5G pilots fed to an auxiliary neural network that detects out-of-distribution channel profile, transmitter speed, and delay spread. As soon as such a change is detected, a traditional (reference) receiver is activated, which runs for a period of time in parallel to the AI-native receiver. Finally, VERTIAS compares the bit probabilities of the AI-native and the reference receivers for the same received data inputs, and decides whether or not a retraining process needs to be initiated. Our evaluations reveal that VERITAS can detect changes in the channel profile, transmitter speed, and delay spread with 99%, 97%, and 69% accuracies, respectively, followed by timely initiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel profile, transmitter speed, and delay spread test sets, respectively.

Updated: 2025-10-16 02:39:38

标题: VERITAS：验证基站中AI原生收发器操作的性能

摘要: 人工智能原生接收器在高噪声环境中表现出显著的性能提升，并且与传统接收器相比，可能减少通信开销。然而，它们的性能高度取决于训练数据集的代表性。一个主要问题是训练数据集是否涵盖了所有测试环境和波形配置的不确定性，因此训练模型是否在实际部署条件下具有稳健性。为此，我们提出了一种针对AI原生收发器在部署后的联合测量-恢复框架，称为VERITAS，该框架不断寻找接收信号中的分布转移，并触发有限的重新训练突发。VERITAS使用5G导频来监视无线信道，这些导频输入到辅助神经网络中，用于检测超出分布的信道配置文件、发射机速度和传播延迟。一旦检测到这样的变化，就会激活传统（参考）接收器，该接收器在与AI原生接收器并行运行的一段时间内。最后，VERITAS比较了AI原生和参考接收器对于相同接收数据输入的比特概率，并决定是否需要启动重新训练过程。我们的评估结果表明，VERITAS能够以99%、97%和69%的准确率分别检测信道配置、发射机速度和传播延迟的变化，并及时启动86%、93.3%和94.8%的输入在信道配置、发射机速度和传播延迟测试集中的重新训练。

更新时间: 2025-10-16 02:39:38

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.09761v2

Benchmarking drug-drug interaction prediction methods: a perspective of distribution changes

Motivation: Emerging drug-drug interaction (DDI) prediction is crucial for new drugs but is hindered by distribution changes between known and new drugs in real-world scenarios. Current evaluation often neglects these changes, relying on unrealistic i.i.d. split due to the absence of drug approval data. Results: We propose DDI-Ben, a benchmarking framework for emerging DDI prediction under distribution changes. DDI-Ben introduces a distribution change simulation framework that leverages distribution changes between drug sets as a surrogate for real-world distribution changes of DDIs, and is compatible with various drug split strategies. Through extensive benchmarking on ten representative methods, we show that most existing approaches suffer substantial performance degradation under distribution changes. Our analysis further indicates that large language model (LLM) based methods and the integration of drug-related textual information offer promising robustness against such degradation. To support future research, we release the benchmark datasets with simulated distribution changes. Overall, DDI-Ben highlights the importance of explicitly addressing distribution changes and provides a foundation for developing more resilient methods for emerging DDI prediction. Availability and implementation: Our code and data are available at https://github.com/LARS-research/DDI-Bench.

Updated: 2025-10-16 02:37:35

标题: 基准药物相互作用预测方法：分布变化的视角

摘要: 动机：新兴药物相互作用（DDI）预测对新药至关重要，但在现实环境中，已知药物和新药物之间的分布变化阻碍了这一预测。当前的评估常常忽视这些变化，依赖于由于缺乏药物批准数据而产生的不切实际的i.i.d.拆分。结果：我们提出了DDI-Ben，一个针对新兴DDI预测的基准框架，用于处理分布变化。DDI-Ben引入了一个分布变化模拟框架，利用药物集之间的分布变化作为DDI的真实世界分布变化的替代，并与各种药物拆分策略兼容。通过对十种代表性方法进行广泛的基准测试，我们发现大多数现有方法在分布变化下表现出明显的性能下降。我们的分析进一步表明，基于大型语言模型（LLM）的方法以及整合药物相关文本信息可以在面对这种降级时提供有希望的稳健性。为了支持未来的研究，我们发布了具有模拟分布变化的基准数据集。总的来说，DDI-Ben强调了明确处理分布变化的重要性，并为开发更具弹性的新兴DDI预测方法奠定了基础。可用性和实施：我们的代码和数据可在https://github.com/LARS-research/DDI-Bench 上获取。

更新时间: 2025-10-16 02:37:35

领域: cs.LG

下载: http://arxiv.org/abs/2410.18583v5

Never too Prim to Swim: An LLM-Enhanced RL-based Adaptive S-Surface Controller for AUVs under Extreme Sea Conditions

The adaptivity and maneuvering capabilities of Autonomous Underwater Vehicles (AUVs) have drawn significant attention in oceanic research, due to the unpredictable disturbances and strong coupling among the AUV's degrees of freedom. In this paper, we developed large language model (LLM)-enhanced reinforcement learning (RL)-based adaptive S-surface controller for AUVs. Specifically, LLMs are introduced for the joint optimization of controller parameters and reward functions in RL training. Using multi-modal and structured explicit task feedback, LLMs enable joint adjustments, balance multiple objectives, and enhance task-oriented performance and adaptability. In the proposed controller, the RL policy focuses on upper-level tasks, outputting task-oriented high-level commands that the S-surface controller then converts into control signals, ensuring cancellation of nonlinear effects and unpredictable external disturbances in extreme sea conditions. Under extreme sea conditions involving complex terrain, waves, and currents, the proposed controller demonstrates superior performance and adaptability in high-level tasks such as underwater target tracking and data collection, outperforming traditional PID and SMC controllers.

Updated: 2025-10-16 02:33:20

标题: 永远不会太正式去游泳：在极端海况下用于AUV的基于LLM增强RL的自适应S表面控制器

摘要: 自主水下车辆（AUVs）的适应能力和操纵能力引起了海洋研究领域的重要关注，这是因为AUV的自由度之间存在不可预测的干扰和强耦合。本文中，我们开发了基于大型语言模型（LLM）增强强化学习（RL）的自适应S表面控制器，用于AUV。具体而言，LLMs被引入用于联合优化RL训练中的控制器参数和奖励函数。利用多模态和结构化显式任务反馈，LLMs能够进行联合调整，平衡多个目标，并增强任务导向的性能和适应性。在所提出的控制器中，RL策略专注于高级任务，输出面向任务的高级命令，然后S表面控制器将其转换为控制信号，确保在极端海况中取消非线性效应和不可预测的外部干扰。在涉及复杂地形、波浪和洋流的极端海况下，所提出的控制器在高级任务（如水下目标跟踪和数据收集）中表现出卓越的性能和适应性，优于传统的PID和SMC控制器。

更新时间: 2025-10-16 02:33:20

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.00527v2

Automated Snippet-Alignment Data Augmentation for Code Translation

Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.

Updated: 2025-10-16 02:30:24

标题: 自动片段对齐数据增强用于代码翻译

摘要: 代码翻译旨在将代码从源语言翻译到目标语言，并在各种软件开发场景中使用。大型语言模型（LLMs）的最新发展展示了它们在代码翻译中的能力，并且平行语料库在为代码翻译训练模型中起着至关重要的作用。平行语料库可以分为程序对齐（PA）和代码片段对齐（SA）数据。虽然PA数据具有完整的上下文并适合语义对齐学习，但由于其长度较长，可能无法提供足够的细粒度训练信号，而SA数据的简洁性使得更多的细粒度对齐学习成为可能。由于平行语料库有限，研究人员探索了几种用于代码翻译的增强方法。先前的研究主要集中在增强PA数据上。在本文中，我们提出了一种利用LLMs自动生成SA数据的数据增强方法。为了充分利用PA数据和SA数据，我们探索了一种简单而有效的两阶段训练策略，相比仅在PA数据上进行微调，该策略始终能够提升模型性能。在TransCoder-test上的实验表明，我们增强的SA数据结合两阶段训练方法比基线表现持续改善，最大提升率为3.78%。

更新时间: 2025-10-16 02:30:24

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.15004v1

RHINO: Guided Reasoning for Mapping Network Logs to Adversarial Tactics and Techniques with Large Language Models

Modern Network Intrusion Detection Systems generate vast volumes of low-level alerts, yet these outputs remain semantically fragmented, requiring labor-intensive manual correlation with high-level adversarial behaviors. Existing solutions for automating this mapping-rule-based systems and machine learning classifiers-suffer from critical limitations: rule-based approaches fail to adapt to novel attack variations, while machine learning methods lack contextual awareness and treat tactic-technique mapping as a syntactic matching problem rather than a reasoning task. Although Large Language Models have shown promise in cybersecurity tasks, preliminary experiments reveal that existing LLM-based methods frequently hallucinate technique names or produce decontextualized mappings due to their single-step classification approach. To address these challenges, we introduce RHINO, a novel framework that decomposes LLM-based attack analysis into three interpretable phases mirroring human reasoning: (1) behavioral abstraction, where raw logs are translated into contextualized narratives; (2) multi-role collaborative inference, generating candidate techniques by evaluating behavioral evidence against MITRE ATT&CK knowledge; and (3) validation, cross-referencing predictions with official MITRE definitions to rectify hallucinations. RHINO bridges the semantic gap between low-level observations and adversarial intent while improving output reliability through structured reasoning. We evaluate RHINO on three benchmarks across four backbone models. RHINO achieved high accuracy, with model performance ranging from 86.38% to 88.45%, resulting in relative gains from 24.25% to 76.50% across different models. Our results demonstrate that RHINO significantly enhances the interpretability and scalability of threat analysis, offering a blueprint for deploying LLMs in operational security settings.

Updated: 2025-10-16 02:25:46

标题: RHINO：使用大型语言模型引导推理将网络日志映射到对抗性战术和技术

摘要: 现代网络入侵检测系统生成大量低级警报，然而这些输出仍然在语义上是分散的，需要劳动密集的手动与高级对抗行为进行关联。现有的自动化映射规则系统和机器学习分类器存在关键限制：基于规则的方法无法适应新型攻击变种，而机器学习方法缺乏上下文意识，将策略技术映射视为句法匹配问题而不是推理任务。虽然大型语言模型在网络安全任务中显示出潜力，但初步实验显示现有的基于LLM的方法经常产生技术名称幻觉或产生由于其单步分类方法而导致的脱离上下文的映射。为了解决这些挑战，我们介绍了一种新颖的框架RHINO，将基于LLM的攻击分析分解为三个可解释的阶段，模拟人类推理：（1）行为抽象，将原始日志转换为上下文化叙述；（2）多角色协同推理，通过评估行为证据与MITRE ATT&CK知识相结合，生成候选技术；（3）验证，将预测与官方MITRE定义进行交叉参考，以纠正幻觉。RHINO弥合了低级观察和对抗意图之间的语义差距，通过结构化推理提高了输出的可靠性。我们在四个骨干模型上对RHINO进行了三个基准测试评估。RHINO取得了较高的准确性，模型性能范围从86.38%到88.45%，在不同模型之间相对增益从24.25%到76.50%。我们的结果表明，RHINO显著提高了威胁分析的可解释性和可扩展性，为在操作安全设置中部署LLM提供了蓝图。

更新时间: 2025-10-16 02:25:46

领域: cs.CR

下载: http://arxiv.org/abs/2510.14233v1

A Neural Symbolic Model for Space Physics

In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.

Updated: 2025-10-16 02:20:02

标题: 一个神经符号模型用于空间物理学

摘要: 在这项研究中，我们揭示了一个新的人工智能模型，称为PhyE2E，通过符号回归来发现物理公式。PhyE2E通过使用神经网络的二阶导数将符号回归分解为子问题，并采用变压器模型以端到端的方式将数据转换为符号公式。通过蒙特卡洛树搜索和遗传编程对生成的公式进行了优化。我们利用大型语言模型合成了大量类似真实物理的符号表达式，并训练模型直接从数据中恢复这些公式。全面评估表明，PhyE2E胜过现有的最先进方法，在符号准确性、数据拟合精度和物理单位的一致性方面表现出色。我们将PhyE2E部署到了太空物理的五个应用中，包括太阳黑子数量的预测、太阳自转角速度、发射线贡献函数、近地等离子体压力和月潮等离子体信号。人工智能生成的物理公式在拟合来自卫星和天文望远镜的实验数据方面表现出高度的准确性。我们成功升级了NASA 1993年提出的关于太阳活动的公式，并首次以明确形式提供了太阳活动长周期的解释。我们还发现，近地等离子体压力的衰减与地球的r^2成正比，随后的数学推导与另一独立研究的卫星数据一致。此外，我们找到了可以描述太阳极紫外光谱中发射线、温度、电子密度和磁场之间关系的物理公式。所得公式与物理学家先前推测的性质一致。

更新时间: 2025-10-16 02:20:02

领域: astro-ph.SR,astro-ph.EP,astro-ph.IM,cs.AI,physics.space-ph

下载: http://arxiv.org/abs/2503.07994v2

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

Updated: 2025-10-16 02:19:25

标题: 将测试时间计算扩展至使用开放权重模型实现IOI金牌

摘要: 竞技编程已成为评估大型语言模型（LLMs）推理和问题解决能力的严格基准。国际信息学奥林匹克（IOI）是竞技编程中最负盛名的年度比赛之一，已成为比较人类和人工智能级别编程能力的关键基准。虽然有几个专有模型声称在IOI上实现了金牌水平的表现，通常使用未公开的方法，但使用开源模型实现可比的结果仍然是一个重大挑战。在本文中，我们提出了\gencluster，这是一个可扩展且可重现的测试计算框架，在有限的验证预算下实现IOI金牌水平的性能。它结合了大规模生成、行为聚类、排名和循环提交策略，有效地探索不同的解决方案空间。我们的实验表明，我们提出的方法的性能与可用计算资源一致地提升，缩小了开放和封闭系统之间的差距。值得注意的是，我们将展示GenCluster可以首次使用开源模型gpt-oss-120b在IOI 2025赢得金牌，为透明和可重现评估LLMs推理能力设立了新的基准。

更新时间: 2025-10-16 02:19:25

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14232v1

When Flatness Does (Not) Guarantee Adversarial Robustness

Despite their empirical success, neural networks remain vulnerable to small, adversarial perturbations. A longstanding hypothesis suggests that flat minima, regions of low curvature in the loss landscape, offer increased robustness. While intuitive, this connection has remained largely informal and incomplete. By rigorously formalizing the relationship, we show this intuition is only partially correct: flatness implies local but not global adversarial robustness. To arrive at this result, we first derive a closed-form expression for relative flatness in the penultimate layer, and then show we can use this to constrain the variation of the loss in input space. This allows us to formally analyze the adversarial robustness of the entire network. We then show that to maintain robustness beyond a local neighborhood, the loss needs to curve sharply away from the data manifold. We validate our theoretical predictions empirically across architectures and datasets, uncovering the geometric structure that governs adversarial vulnerability, and linking flatness to model confidence: adversarial examples often lie in large, flat regions where the model is confidently wrong. Our results challenge simplified views of flatness and provide a nuanced understanding of its role in robustness.

Updated: 2025-10-16 02:15:14

标题: 当平坦性不保证对抗鲁棒性时

摘要: 尽管神经网络在实践中取得了成功，但仍然容易受到微小的对抗性扰动的影响。长期以来的假设表明，损失景观中曲率较低的平坦极小值提供了增加的稳健性。尽管直观，但这种联系仍然主要是非正式和不完整的。通过严格形式化这种关系，我们显示这种直觉只是部分正确的：平坦性意味着局部但不是全局的对抗性稳健性。为了得出这一结果，我们首先推导了倒数第二层的相对平坦性的封闭形式表达式，然后展示我们可以利用这一表达式限制输入空间中损失的变化。这使我们能够正式分析整个网络的对抗性稳健性。然后我们展示，为了在局部邻域之外保持稳健性，损失需要急剧地远离数据流形。我们通过各种架构和数据集在实证上验证了我们的理论预测，揭示了控制对抗性脆弱性的几何结构，并将平坦性与模型信心联系起来：对抗性示例通常位于模型错误的大型平坦区域。我们的结果挑战了对平坦性的简化观点，并提供了对其在稳健性中的作用的微妙理解。

更新时间: 2025-10-16 02:15:14

领域: cs.LG

下载: http://arxiv.org/abs/2510.14231v1

Large Scale Retrieval for the LinkedIn Feed using Causal Language Models

In large scale recommendation systems like the LinkedIn Feed, the retrieval stage is critical for narrowing hundreds of millions of potential candidates to a manageable subset for ranking. LinkedIn's Feed serves suggested content from outside of the member's network (based on the member's topical interests), where 2000 candidates are retrieved from a pool of hundreds of millions candidate with a latency budget of a few milliseconds and inbound QPS of several thousand per second. This paper presents a novel retrieval approach that fine-tunes a large causal language model (Meta's LLaMA 3) as a dual encoder to generate high quality embeddings for both users (members) and content (items), using only textual input. We describe the end to end pipeline, including prompt design for embedding generation, techniques for fine-tuning at LinkedIn's scale, and infrastructure for low latency, cost effective online serving. We share our findings on how quantizing numerical features in the prompt enables the information to get properly encoded in the embedding, facilitating greater alignment between the retrieval and ranking layer. The system was evaluated using offline metrics and an online A/B test, which showed substantial improvements in member engagement. We observed significant gains among newer members, who often lack strong network connections, indicating that high-quality suggested content aids retention. This work demonstrates how generative language models can be effectively adapted for real time, high throughput retrieval in industrial applications.

Updated: 2025-10-16 02:01:33

标题: 利用因果语言模型实现LinkedIn动态消息的大规模检索

摘要: 在LinkedIn Feed等大规模推荐系统中，检索阶段对于将数以亿计的潜在候选人缩小到可管理的子集进行排名至关重要。LinkedIn的Feed为会员提供来自会员网络之外的建议内容（基于会员的主题兴趣），从数以亿计的候选人中检索出2000名候选人，延迟预算为几毫秒，入站每秒QPS为数千。本文提出了一种新颖的检索方法，将大型因果语言模型（Meta的LLaMA 3）调整为双编码器，仅使用文本输入为用户（会员）和内容（项目）生成高质量的嵌入。我们描述了端到端的流水线，包括用于嵌入生成的提示设计、在LinkedIn规模上进行微调的技术，以及用于低延迟、成本有效的在线提供基础设施。我们分享了在提示中量化数值特征如何使信息得以正确编码在嵌入中，促进检索和排名层之间更大的对齐。该系统使用离线指标和在线A/B测试进行评估，结果显示会员参与度显着提高。我们观察到，在新会员中取得了显著收益，他们通常缺乏强大的网络连接，这表明高质量的建议内容有助于增加留存率。这项工作展示了如何有效地将生成式语言模型适应于工业应用中的实时、高吞吐量检索。

更新时间: 2025-10-16 02:01:33

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2510.14223v1

A novel Information-Driven Strategy for Optimal Regression Assessment

In Machine Learning (ML), a regression algorithm aims to minimize a loss function based on data. An assessment method in this context seeks to quantify the discrepancy between the optimal response for an input-output system and the estimate produced by a learned predictive model (the student). Evaluating the quality of a learned regressor remains challenging without access to the true data-generating mechanism, as no data-driven assessment method can ensure the achievability of global optimality. This work introduces the Information Teacher, a novel data-driven framework for evaluating regression algorithms with formal performance guarantees to assess global optimality. Our novel approach builds on estimating the Shannon mutual information (MI) between the input variables and the residuals and applies to a broad class of additive noise models. Through numerical experiments, we confirm that the Information Teacher is capable of detecting global optimality, which is aligned with the condition of zero estimation error with respect to the -- inaccessible, in practice -- true model, working as a surrogate measure of the ground truth assessment loss and offering a principled alternative to conventional empirical performance metrics.

Updated: 2025-10-16 02:01:32

标题: 一种新颖的基于信息的策略用于最佳回归评估

摘要: 在机器学习（ML）中，回归算法旨在基于数据最小化损失函数。在这种情况下，评估方法旨在量化输入-输出系统的最佳响应与学习预测模型（学生）产生的估计之间的差异。在没有访问真实数据生成机制的情况下，评估学习回归器的质量仍然具有挑战性，因为没有数据驱动的评估方法可以确保全局最优性的实现。本文介绍了信息教师，这是一个新颖的数据驱动框架，用于评估回归算法，具有正式的性能保证以评估全局最优性。我们的新方法建立在估计输入变量和残差之间的Shannon互信息（MI）上，并适用于广泛类别的加性噪声模型。通过数值实验，我们证实信息教师能够检测全局最优性，这与条件为零估计误差相一致，相对于--在实践中--真实模型是不可访问的，作为地面真相评估损失的替代度量，并提供了一种有原则的替代方法，以替代传统的经验性能指标。

更新时间: 2025-10-16 02:01:32

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.14222v1

Quantum Polar Metric Learning: Efficient Classically Learned Quantum Embeddings

Deep metric learning has recently shown extremely promising results in the classical data domain, creating well-separated feature spaces. This idea was also adapted to quantum computers via Quantum Metric Learning(QMeL). QMeL consists of a 2 step process with a classical model to compress the data to fit into the limited number of qubits, then train a Parameterized Quantum Circuit(PQC) to create better separation in Hilbert Space. However, on Noisy Intermediate Scale Quantum (NISQ) devices. QMeL solutions result in high circuit width and depth, both of which limit scalability. We propose Quantum Polar Metric Learning (QPMeL) that uses a classical model to learn the parameters of the polar form of a qubit. We then utilize a shallow PQC with $R_y$ and $R_z$ gates to create the state and a trainable layer of $ZZ(\theta)$-gates to learn entanglement. The circuit also computes fidelity via a SWAP Test for our proposed Fidelity Triplet Loss function, used to train both classical and quantum components. When compared to QMeL approaches, QPMeL achieves 3X better multi-class separation, while using only 1/2 the number of gates and depth. We also demonstrate that QPMeL outperforms classical networks with similar configurations, presenting a promising avenue for future research on fully classical models with quantum loss functions.

Updated: 2025-10-16 02:00:22

标题: 量子极化度量学习：高效的经典学习的量子嵌入

摘要: 深度度量学习最近在传统数据领域展现出极具潜力的结果，创造了明显分隔的特征空间。这个想法也被应用到了量子计算机中，通过量子度量学习（QMeL）。QMeL包括一个两步过程，首先使用经典模型将数据压缩以适应有限数量的量子比特，然后训练一个参数化量子电路（PQC）在希尔伯特空间中创造更好的分离。然而，在噪声中等规模量子（NISQ）设备上，QMeL解决方案导致高电路宽度和深度，这两者都限制了可扩展性。我们提出了量子极坐标度量学习（QPMeL），它使用一个经典模型来学习量子比特的极坐标形式的参数。然后我们利用一个浅层PQC，使用$R_y$和$R_z$门来创建状态，并使用一个可训练的$ZZ(\theta)$-门层来学习纠缠。该电路还通过SWAP测试计算保真度，用于训练提出的保真度三元损失函数，用于训练经典和量子组件。与QMeL方法相比，QPMeL实现了3倍更好的多类分隔，同时只使用了一半数量的门和深度。我们还展示了QPMeL优于具有类似配置的经典网络，为未来研究提供了一个有前途的途径，即使用量子损失函数的全经典模型。

更新时间: 2025-10-16 02:00:22

领域: quant-ph,cs.AI,I.2.6; E.4

下载: http://arxiv.org/abs/2312.01655v4

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. In this paper, we investigate the Soft Thinking capabilities of various LLMs through a systematic analysis of their internal behavior using a suite of probing techniques. Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that LLMs behave as single-threaded reasoners--they predominantly rely on the token with the highest probability in the soft input to predict the next step. This behavior induces a greedy feedback loop that suppresses alternative reasoning paths and undermines the benefits of transmitting richer information via Soft Tokens. To address this Greedy Pitfall, we propose Stochastic Soft Thinking, which introduces stochasticity to break free from this Greedy Pitfall. Our experiments demonstrate that incorporating randomness--particularly with the Gumbel-Softmax trick--can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking, resulting in superior performance across eight reasoning benchmarks. We further demonstrate that Stochastic Soft Thinking exhibits stronger exploration potential compared to conventional COT. Our findings deepen the understanding of continuous reasoning and establish the foundation for future work on improving Soft Thinking with Reinforcement Learning.

Updated: 2025-10-16 01:59:24

标题: LLMs是单线程推理器：揭秘软思维的工作机制

摘要: 人类认知自然涉及抽象和流动的概念，而现有的推理模型往往依赖于生成离散的标记，可能限制其表达能力。最近的进展旨在通过使大型语言模型（LLMs）能够生成软、抽象的标记，从而促进在连续概念空间内进行推理，以解决这一限制。在本文中，我们通过系统分析一系列探测技术来研究各种LLMs的软思维能力。与普遍认为软思维支持多样化推理路径的观点相反，我们的研究结果显示，LLMs表现为单线程推理者--它们主要依赖于软输入中概率最高的标记来预测下一步。这种行为导致了贪婪反馈循环，抑制了替代推理路径，并削弱了通过软标记传递更丰富信息的好处。为了解决这种贪婪陷阱，我们提出了随机软思维，引入随机性以摆脱这种贪婪陷阱。我们的实验表明，引入随机性--特别是使用Gumbel-Softmax技巧--可以缓解传统方法的局限性，释放软思维的潜力，从而在八个推理基准测试中实现更优越的性能。我们进一步证明，随机软思维相对于传统的COT具有更强的探索潜力。我们的研究深化了对连续推理的理解，并为未来改进通过强化学习提高软思维的基础奠定了基础。

更新时间: 2025-10-16 01:59:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.03440v4

Protenix-Mini+: efficient structure prediction model with scalable pairformer

Lightweight inference is critical for biomolecular structure prediction and downstream tasks, enabling efficient real-world deployment and inference-time scaling for large-scale applications. While AF3 and its variants (e.g., Protenix, Chai-1) have advanced structure prediction results, they suffer from critical limitations: high inference latency and cubic time complexity with respect to token count, both of which restrict scalability for large biomolecular complexes. To address the core challenge of balancing model efficiency and prediction accuracy, we introduce three key innovations: (1) compressing non-scalable operations to mitigate cubic time complexity, (2) removing redundant blocks across modules to reduce unnecessary overhead, and (3) adopting a few-step sampler for the atom diffusion module to accelerate inference. Building on these design principles, we develop Protenix-Mini+, a highly lightweight and scalable variant of the Protenix model. Within an acceptable range of performance degradation, it substantially improves computational efficiency. For example, in the case of low-homology single-chain proteins, Protenix-Mini+ experiences an intra-protein LDDT drop of approximately 3% relative to the full Protenix model -- an acceptable performance trade-off given its substantially 90%+ improved computational efficiency.

Updated: 2025-10-16 01:57:03

标题: Protenix-Mini+: 高效的可扩展Pairformer结构预测模型

摘要: 轻量级推断对于生物分子结构预测和下游任务至关重要，可以实现高效的实际部署和推断时间扩展，适用于大规模应用。虽然AF3及其变体（例如Protenix，Chai-1）已经取得了结构预测结果的进展，但它们存在关键限制：高推断延迟和与令牌数量成立方时间复杂度，这两者都限制了对大型生物分子复合物的可伸缩性。为了解决平衡模型效率和预测准确性的核心挑战，我们引入了三项关键创新：（1）压缩不可扩展操作以减轻立方时间复杂度，（2）跨模块删除冗余块以减少不必要的开销，以及（3）采用少步采样器来加速原子扩散模块的推断。基于这些设计原则，我们开发了Protenix-Mini+，这是Protenix模型的高度轻量级且可扩展的变体。在可接受的性能降级范围内，它显着提高了计算效率。例如，在低同源单链蛋白质的情况下，Protenix-Mini+相对于完整的Protenix模型经历了约3%的蛋白内LDDT下降--考虑到其显著提高的90%+计算效率，这是一个可以接受的性能折衷。

更新时间: 2025-10-16 01:57:03

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2510.12842v2

An Information Asymmetry Game for Trigger-based DNN Model Watermarking

As a valuable digital product, deep neural networks (DNNs) face increasingly severe threats to the intellectual property, making it necessary to develop effective technical measures to protect them. Trigger-based watermarking methods achieve copyright protection by embedding triggers into the host DNNs. However, the attacker may remove the watermark by pruning or fine-tuning. We model this interaction as a game under conditions of information asymmetry, namely, the defender embeds a secret watermark with private knowledge, while the attacker can only access the watermarked model and seek removal. We define strategies, costs, and utilities for both players, derive the attacker's optimal pruning budget, and establish an exponential lower bound on the accuracy of watermark detection after attack. Experimental results demonstrate the feasibility of the watermarked model, and indicate that sparse watermarking can resist removal with negligible accuracy loss. This study highlights the effectiveness of game-theoretic analysis in guiding the design of robust watermarking schemes for model copyright protection.

Updated: 2025-10-16 01:55:33

标题: 一个基于触发器的DNN模型水印技术的信息不对称游戏

摘要: 作为一种宝贵的数字产品，深度神经网络（DNNs）面临着越来越严重的知识产权威胁，因此有必要开发有效的技术措施来保护它们。基于触发器的水印方法通过将触发器嵌入到主机DNNs中实现版权保护。然而，攻击者可能通过修剪或微调来去除水印。我们将这种交互建模为一种在信息不对称条件下的博弈，即，防御者嵌入具有私有知识的秘密水印，而攻击者只能访问带水印的模型并寻求去除。我们为两个玩家定义了策略、成本和效用，推导了攻击者的最佳修剪预算，并建立了攻击后水印检测准确性的指数下界。实验结果表明，带水印模型的可行性，并表明稀疏水印可以抵抗去除而几乎不损失准确性。这项研究突出了博弈论分析在指导设计强大水印方案以保护模型版权方面的有效性。

更新时间: 2025-10-16 01:55:33

领域: cs.CR

下载: http://arxiv.org/abs/2510.14218v1

Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization

Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that "richer spectra yield better generalization" and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.

Updated: 2025-10-16 01:52:26

标题: 分子核的光谱分析：当更丰富的特征并不保证更好的泛化

摘要: 理解核函数的光谱特性为泛化和表示质量提供了一个原则性的视角。虽然深度模型在分子性质预测中达到了最先进的准确性，但核方法仍然被广泛应用于低数据情境下的鲁棒性和透明的理论基础。尽管在机器学习中对核光谱进行了广泛研究，但对分子核的系统光谱分析却很少见。在这项研究中，我们首次对QM9数据集、分子指纹、预训练的基于变压器的、全局和局部3D表示进行了核岭回归的全面光谱分析，涵盖了七种分子性质。令人惊讶的是，通过四种不同的光谱度量衡量的更丰富的光谱特征并不能一致地提高准确性。皮尔逊相关性检验进一步显示，对于基于变压器和局部3D表示，光谱丰富性甚至可能与性能呈负相关。我们还实现了截断核函数，探究光谱与预测性能之间的关系：在许多核函数中，仅保留前2%的特征值几乎可以恢复所有性能，表明主要特征值捕捉了最具信息量的特征。我们的结果挑战了“更丰富的光谱产生更好的泛化”的常见启发，并突显了表示、核特征和预测性能之间微妙的关系。在分子性质预测之外，这些发现为在数据受限的科学和现实世界任务中评估核和自监督学习方法提供了信息。

更新时间: 2025-10-16 01:52:26

领域: cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2510.14217v1

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

Updated: 2025-10-16 01:38:12

标题: EMAC+: 具有VLM+LLM的具身多模态代理用于协作规划

摘要: 尽管LLMs在几种基于文本的推理和规划任务中展现出熟练的能力，但它们在机器人控制方面的实现受到显著的局限：（1）LLM代理主要设计用于处理文本输入而不是视觉条件；（2）当前的多模式代理将LLMs视为静态规划者，这将它们的推理与环境动态分开，导致行动不考虑领域特定知识；（3）LLMs并非设计用于从视觉交互中学习，这使它们更难为特定领域制定更好的策略。在本文中，我们介绍了EMAC+，一种通过双向训练范式协同集成LLM和VLM的具体多模式代理。与现有方法不同，EMAC+通过实时来自执行低级视觉控制任务的VLM的反馈动态地完善由LLM生成的高级文本计划。我们通过让LLM直接通过互动体验内化视觉环境动态，而不仅仅依赖于静态符号映射，解决了先前模型的关键限制。在ALFWorld和RT-1基准测试上进行了广泛的实验评估，结果显示EMAC+在任务性能、对嘈杂观测的稳健性和高效学习方面均表现出优越性。我们还进行了彻底的消融研究，并提供了成功和失败案例的详细分析。

更新时间: 2025-10-16 01:38:12

领域: cs.AI

下载: http://arxiv.org/abs/2505.19905v2

Regularizing Extrapolation in Causal Inference

Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.

Updated: 2025-10-16 01:37:40

标题: 在因果推断中规范外推

摘要: 在机器学习和因果推断中，许多常见的估计器是线性平滑器，其中预测是训练结果的加权平均值。一些估计器，如普通最小二乘法和核岭回归，允许任意负权重，这可以改善特征不平衡，但通常会以增加对参数建模假设和更高方差的代价。相比之下，估计器如重要性加权和随机森林（有时是隐式地）限制权重为非负数，降低了对参数建模和方差的依赖性，但代价是更糟糕的不平衡。在本文中，我们提出了一个直接惩罚外推水平的统一框架，将当前的硬非负性约束实践替换为软约束和相应的超参数。我们推导了最坏情况外推误差界，并引入了一个新颖的“偏差-偏差-方差”权衡，包括由于特征不平衡、模型错误规范和估计器方差而产生的偏差；这种权衡在高维度尤其显著，特别是在正性不佳的情况下。然后，我们开发了一种优化过程，对该界限进行正则化，同时最小化不平衡，并概述如何将这种方法用作对参数建模假设依赖性的敏感性分析。我们通过合成实验和一个涉及将随机对照试验估计推广到目标人群的真实应用来展示我们方法的有效性。

更新时间: 2025-10-16 01:37:40

领域: cs.LG,econ.EM,stat.ME

下载: http://arxiv.org/abs/2509.17180v2

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

Updated: 2025-10-16 01:37:39

标题: LiteStage：用于多阶段推理的延迟感知层跳过

摘要: 多阶段推理已经成为一种有效的策略，通过将复杂问题分解为顺序子阶段，增强了小型语言模型的推理能力。然而，这样做会增加延迟成本。我们观察到，现有的自适应加速技术，如层跳过，在这种情况下很难在效率和准确性之间取得平衡，这是由于两个关键挑战：（1）阶段性跳过敏感度的变化，以及（2）生成冗余输出标记。为了解决这些问题，我们提出了LiteStage，这是一个针对多阶段推理的延迟感知层跳过框架。LiteStage结合了一个阶段性的离线搜索，分配最佳层预算，以及一个基于置信度的在线生成早期退出，来抑制不必要的解码。对三个基准测试（如OBQA、CSQA和StrategyQA）的实验表明，LiteStage实现了高达1.70倍的加速，准确度损失不到4.0%，优于之前的无需训练的层跳过方法。

更新时间: 2025-10-16 01:37:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14211v1

AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity

The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.

Updated: 2025-10-16 01:36:57

标题: 由AI生成的文章：对自动评分和学术诚信的特点和影响

摘要: 大型语言模型（LLMs）的快速发展使得生成连贯的文章成为可能，AI辅助写作在教育和专业环境中越来越普遍。利用大规模经验数据，我们研究并对比了流行LLMs生成的文章的特征和质量，并讨论了它们对写作评估的两个关键组成部分的影响：自动评分和学术诚信。我们的研究结果突出了现有自动评分系统（如e-rater）在应用于由AI生成或受到重大影响的文章时存在的局限性，并确定了改进的方向，包括开发新特性以捕捉更深层次的思考和重新校准特征权重。尽管人们越来越担心不同种类的LLMs可能会破坏检测AI生成文章的可行性，我们的研究结果表明，对于由一个模型生成的文章进行训练的检测器通常能够以高准确率识别其他模型生成的文本，这表明有效的检测在实践中仍然是可管理的。

更新时间: 2025-10-16 01:36:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.17439v4

Incentive-Based Federated Learning

Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.

Updated: 2025-10-16 01:29:54

标题: 基于激励的联邦学习

摘要: 联邦学习承诺通过实现协作模型训练而不损害数据隐私来彻底改变机器学习。然而，实际适应性可能受到关键因素的限制，例如参与困境。参与实体通常不愿意为学习系统做出贡献，除非他们获得一些好处，或者他们可能假装参与并在他人身上搭便车。本章节确定了为联邦学习系统设计激励机制面临的基本挑战。它探讨了如何将经济学和博弈论的基础概念应用于联邦学习，以及区块链和深度强化学习等技术驱动的解决方案。这项工作提出了一个全面的分类法，详细涵盖了基于上述理论概念的集中式和分散式架构。此外，所描述的概念是从应用角度呈现的，涵盖了新兴的工业应用，包括医疗保健、智能基础设施、车辆网络和基于区块链的分散式系统。通过这一探索，本章展示了精心设计的激励机制不仅仅是可选功能，而是联邦学习实际成功的关键组成部分。这一分析揭示了已出现的有前途的解决方案以及在构建真正可持续、公平和强大的联邦学习生态系统中仍存在的重大挑战。

更新时间: 2025-10-16 01:29:54

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2510.14208v1

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

Updated: 2025-10-16 01:27:44

标题: 人类恶意在机器人中的回响：为多轮在线骚扰攻击评估LLMsBenchmarking

摘要: 大型语言模型（LLM）代理正在支持越来越多的交互式网络应用程序，但仍然容易被误用和造成伤害。先前的越狱研究主要集中在单次提示上，而真正的骚扰经常是在多次互动中展开。在这项工作中，我们提出了在线骚扰代理基准，包括：（i）一个合成的多轮骚扰对话数据集，（ii）一个受重复博弈理论启发的多代理（例如，骚扰者，受害者）模拟，（iii）攻击代理的三种越狱方法，涵盖内存、规划和微调，并且（iv）一个混合方法的评估框架。我们利用了两个知名的LLM，LLaMA-3.1-8B-Instruct（开源）和Gemini-2.0-flash（闭源）。我们的结果显示，越狱调整使骚扰几乎成为必然，攻击成功率为95.78-96.89％（在Llama中进行调整）与57.25-64.19％（在没有调整的情况下），在Gemini中为99.33％（进行调整）与98.46％（未进行调整），同时将拒绝率降低至1-2％。最常见的有毒行为是侮辱，调整后为84.9-87.8％（没有调整时为44.2-50.8％），以及炮轰，调整后为81.2-85.1％（没有调整时为31.5-38.8％），表明与敏感类别（如性骚扰或种族骚扰）相比，防护栏较弱。定性评估进一步显示，受攻击的代理会再现出类似于人类的侵略性行为模式，例如在规划下表现出马基雅维利/精神病性模式，并且在记忆中表现出自恋倾向。与直觉相反，闭源和开源模型在转折点上展现出不同的升级轨迹，闭源模型显示出明显的脆弱性。总的来说，我们的研究结果表明，多轮和基于理论的攻击不仅成功率高，而且模仿了人类的骚扰动态，促使开发强大的安全防护栏，最终使在线平台保持安全和负责。

更新时间: 2025-10-16 01:27:44

领域: cs.AI

下载: http://arxiv.org/abs/2510.14207v1

DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

Updated: 2025-10-16 01:26:38

标题: DPRF：用于优化个性化LLM角色扮演代理与人类行为一致性的可推广动态角色细化框架

摘要: 新兴的大型语言模型角色扮演代理（LLM RPAs）旨在模拟个体人类行为，但角色的忠实度往往受到手动创建的个人资料（例如，挑选出的信息和个性特征）的削弱，而没有验证与目标个体的一致性。为解决这一限制，我们的工作引入了动态个人塑造框架（DPRF）。DPRF旨在通过迭代地识别生成的行为与人类真实情况之间的认知差异，无论是通过自由形式还是基于理论的结构化分析，然后优化LLM RPAs的行为与目标个体的一致性，并调整个人资料以减轻这些差异。我们在四种不同的行为预测场景中使用五种LLM对DPRF进行评估：正式辩论、带有心理健康问题的社交媒体帖子、公开面试和电影评论。DPRF可以显著提高基线人物角色的行为一致性，并在模型和场景之间具有泛化性。我们的工作提供了一个稳健的方法论，用于创建高度忠实的个人资料，并增强下游应用程序的有效性，例如用户模拟、社会研究和个性化人工智能。

更新时间: 2025-10-16 01:26:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14205v1

Infrastructure Patterns in Toll Scam Domains: A Comprehensive Analysis of Cybercriminal Registration and Hosting Strategies

Toll scams involve criminals registering fake domains that pretend to be legitimate transportation agencies to trick users into making fraudulent payments. Although these scams are rapidly increasing and causing significant harm, they have not been extensively studied. We present the first large-scale analysis of toll scam domains, using a newly created dataset of 67,907 confirmed scam domains mostly registered in 2025. Our study reveals that attackers exploit permissive registrars and less common top-level domains, with 86.9% of domains concentrated in just five non-mainstream TLDs and 72.9% registered via a single provider. We also discover specific registration patterns, including short bursts of activity that suggest automated, coordinated attacks, with over half of domains registered in the first quarter of 2025. This extreme temporal clustering reflects highly synchronized campaign launches. Additionally, we build a simple predictive model using only domain registration data to predict which scam domains are likely to be suspended -- a proxy for confirmed abuse -- achieving 80.4% accuracy, and 92.3% sensitivity. Our analysis reveals attacker strategies for evading detection -- such as exploiting obscure TLDs, permissive registrars, and coordinated registration bursts -- which can inform more targeted interventions by registrars, hosting providers, and security platforms. However, our results suggest that registration metadata alone may be insufficient, and incorporating features from domain URLs and webpage content could further improve detection.

Updated: 2025-10-16 01:09:30

标题: 高速公路欺诈领域中的基础设施模式：对网络犯罪注册和托管策略的全面分析

摘要: 通行费诈骗涉及犯罪分子注册假冒合法交通机构的域名，骗取用户进行欺诈性支付。尽管这些诈骗案件正在迅速增加并造成重大损害，但它们并没有得到广泛研究。我们提供了对通行费诈骗域名的首次大规模分析，使用了一个新创建的数据集，其中包括67,907个确认的诈骗域名，大部分是在2025年注册的。我们的研究表明，攻击者利用宽容的注册商和较少常见的顶级域名，86.9%的域名集中在仅五个非主流顶级域名中，72.9%通过单一提供商注册。我们还发现特定的注册模式，包括短暂的活动爆发，表明自动化、协调的攻击，超过一半的域名在2025年第一季度注册。这种极端的时间聚集反映了高度同步的活动启动。此外，我们建立了一个简单的预测模型，仅使用域名注册数据来预测哪些诈骗域名可能会被暂停 -- 作为确认滥用的代理 -- 实现了80.4%的准确性和92.3%的敏感性。我们的分析揭示了攻击者为逃避检测而采取的策略 -- 如利用晦涩的顶级域名、宽容的注册商和协调的注册爆发 -- 这可以为注册商、托管提供商和安全平台提供更有针对性的干预。然而，我们的结果表明，仅凭注册元数据可能是不够的，将域名URL和网页内容的特征纳入考虑可能进一步提高检测效果。

更新时间: 2025-10-16 01:09:30

领域: cs.CR

下载: http://arxiv.org/abs/2510.14198v1

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

Updated: 2025-10-16 01:09:09

标题: EdiVal-Agent：一种用于自动化、细粒度评估多轮编辑的面向对象框架

摘要: 基于指令的图像编辑技术发展迅速，然而可靠且可解释的评估仍然是一个瓶颈。当前的协议要么依赖于配对的参考图像-导致覆盖范围有限，并继承先前生成模型的偏见-要么仅依赖于零样本视觉语言模型（VLMs），其基于提示的指令跟随、内容一致性和视觉质量评估经常不精确。为了解决这个问题，我们引入了EdiVal-Agent，一个自动化且细致的评估框架，基于物体中心视角设计，旨在精确评估基于单轮指令和多轮指令的编辑。给定输入图像，EdiVal-Agent首先将其分解为语义上有意义的物体，然后合成多样化、上下文感知的编辑指令，同时动态更新各轮中的物体池。这两个阶段使得能够量化两个针对多轮评估的新颖物体中心指标和一个全局视觉质量指标：(1) EdiVal-IF，通过结合开放词汇的物体检测器进行符号检查，结合VLMs对检测器引导的裁剪进行语义验证，来衡量指令遵循程度；(2) EdiVal-CC，通过计算未改变物体和背景的语义相似度，使用不断演变的物体池来评估内容一致性；(3) EdiVal-VQ，通过人类偏好模型量化整体视觉质量的变化。实例化这个流程，我们构建了EdiVal-Bench，一个涵盖9种指令类型和13种最先进的编辑模型的多轮编辑基准，涵盖了上下文、流匹配和扩散范式。我们证明EdiVal-Agent可用于识别现有的失败模式，从而为下一代编辑模型的发展提供信息。

更新时间: 2025-10-16 01:09:09

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.13399v2

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

Updated: 2025-10-16 01:08:45

标题: NExT-OMNI：朝向具有离散流匹配的任意对任意全模态基础模型

摘要: 下一代多模态基础模型具有任意对任意跨模态生成和多轮交互的能力，将成为人工通用智能系统的核心组成部分，在人机交互中发挥关键作用。然而，大多数现有的多模态模型仍受自回归架构的限制，其固有限制阻碍了理解和生成能力的平衡整合。尽管已经探索了混合和解耦策略来分别解决这些任务，但它们冗余的、非集成的设计限制了它们在更广泛场景中的适用性，比如跨模态检索。在这项工作中，我们介绍了NExT-OMNI，一个基于离散流范式实现统一建模的开源全模态基础模型。通过利用度量诱导的概率路径和动力学最优速度，NExT-OMNI原生支持任意对任意理解和生成，提高了响应效率，同时通过简洁统一的表示而不是任务解耦设计，实现了更广泛的应用场景。在大规模交错的文本、图像、视频和音频数据上训练，NExT-OMNI在多模态生成和理解基准测试中表现出色，同时在多轮多模态交互和跨模态检索中优于先前的统一模型，突显其作为下一代多模态基础模型的架构优势。为推动进一步研究，我们发布了训练细节、数据协议，并开源了代码和模型检查点。

更新时间: 2025-10-16 01:08:45

领域: cs.CL,cs.AI,cs.CV,cs.MM

下载: http://arxiv.org/abs/2510.13721v2

Implementation of AI in Precision Medicine

Artificial intelligence (AI) has become increasingly central to precision medicine by enabling the integration and interpretation of multimodal data, yet implementation in clinical settings remains limited. This paper provides a scoping review of literature from 2019-2024 on the implementation of AI in precision medicine, identifying key barriers and enablers across data quality, clinical reliability, workflow integration, and governance. Through an ecosystem-based framework, we highlight the interdependent relationships shaping real-world translation and propose future directions to support trustworthy and sustainable implementation.

Updated: 2025-10-16 00:55:15

标题: 精准医学中人工智能的实施

摘要: 人工智能（AI）在精准医学中的作用日益重要，通过实现多模态数据的集成和解释，然而在临床设置中的实施仍然有限。本文对2019年至2024年有关AI在精准医学中的实施的文献进行了范围性审查，识别了数据质量、临床可靠性、工作流集成和治理方面的关键障碍和促进因素。通过一个生态系统框架，我们突出了塑造现实世界转化的相互依赖关系，并提出了未来方向，以支持可靠和可持续的实施。

更新时间: 2025-10-16 00:55:15

领域: cs.AI

下载: http://arxiv.org/abs/2510.14194v1

Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation

Diffusion models excel at generation, but their latent spaces are not explicitly organized for interpretable control. We introduce ConDA (Contrastive Diffusion Alignment), a framework that applies contrastive learning within diffusion embeddings to align latent geometry with system dynamics. Motivated by recent advances showing that contrastive objectives can recover more disentangled and structured representations, ConDA organizes diffusion latents such that traversal directions reflect underlying dynamical factors. Within this contrastively structured space, ConDA enables nonlinear trajectory traversal that supports faithful interpolation, extrapolation, and controllable generation. Across benchmarks in fluid dynamics, neural calcium imaging, therapeutic neurostimulation, and facial expression, ConDA produces interpretable latent representations with improved controllability compared to linear traversals and conditioning-based baselines. These results suggest that diffusion latents encode dynamics-relevant structure, but exploiting this structure requires latent organization and traversal along the latent manifold.

Updated: 2025-10-16 00:48:05

标题: 对比扩散对齐：学习结构化潜在变量以进行可控生成

摘要: 扩散模型在生成方面表现出色，但它们的潜在空间并未明确组织为可解释的控制。我们引入了ConDA（对比扩散对齐），这是一个框架，将对比学习应用于扩散嵌入，以使潜在几何与系统动态对齐。受最近显示对比目标可以恢复更多解耦和结构化表示的进展的启发，ConDA组织扩散潜在，使遍历方向反映底层动态因素。在这个对比结构化空间中，ConDA实现了支持忠实插值、外推和可控生成的非线性轨迹遍历。在流体动力学、神经钙成像、治疗性神经刺激和面部表情等基准测试中，与线性遍历和基于条件的基线相比，ConDA产生了可解释的潜在表示，并具有改进的可控性。这些结果表明，扩散潜在编码了与动态相关的结构，但利用这种结构需要潜在组织和沿着潜在流形的遍历。

更新时间: 2025-10-16 00:48:05

领域: cs.LG

下载: http://arxiv.org/abs/2510.14190v1

SoK: Adversarial Evasion Attacks Practicality in NIDS Domain and the Impact of Dynamic Learning

Machine Learning (ML) has become pervasive, and its deployment in Network Intrusion Detection Systems (NIDS) is inevitable due to its automated nature and high accuracy compared to traditional models in processing and classifying large volumes of data. However, ML has been found to have several flaws, most importantly, adversarial attacks, which aim to trick ML models into producing faulty predictions. While most adversarial attack research focuses on computer vision datasets, recent studies have explored the suitability of these attacks against ML-based network security entities, especially NIDS, due to the wide difference between different domains regarding the generation of adversarial attacks. To further explore the practicality of adversarial attacks against ML-based NIDS in-depth, this paper presents several key contributions: identifying numerous practicality issues for evasion adversarial attacks on ML-NIDS using an attack tree threat model, introducing a taxonomy of practicality issues associated with adversarial attacks against ML-based NIDS, identifying specific leaf nodes in our attack tree that demonstrate some practicality for real-world implementation and conducting a comprehensive review and exploration of these potentially viable attack approaches, and investigating how the dynamicity of real-world ML models affects evasion adversarial attacks against NIDS. Our experiments indicate that continuous re-training, even without adversarial training, can reduce the effectiveness of adversarial attacks. While adversarial attacks can compromise ML-based NIDSs, our aim is to highlight the significant gap between research and real-world practicality in this domain, which warrants attention.

Updated: 2025-10-16 00:43:56

标题: 标题翻译：SoK：入侵逃避攻击在NIDS领域的实用性和动态学习的影响

摘要: 机器学习（ML）已经变得无处不在，在网络入侵检测系统（NIDS）中部署ML是不可避免的，因为与传统模型相比，ML在处理和分类大量数据时具有自动化特性和高准确性。然而，发现ML存在一些缺陷，最重要的是对抗性攻击，旨在欺骗ML模型产生错误预测。虽然大多数对抗性攻击研究集中在计算机视觉数据集上，但最近的研究探讨了这些攻击对基于ML的网络安全实体（尤其是NIDS）的适用性，这是因为在生成对抗性攻击方面不同领域之间存在广泛差异。为了进一步深入探讨对基于ML的NIDS实施对抗性攻击的实用性问题，本文提出了几个关键贡献：使用攻击树威胁模型识别ML-NIDS对逃避对抗性攻击的许多实用性问题，介绍了与对抗性攻击对基于ML的NIDS相关的实用性问题的分类法，识别了攻击树中的特定叶节点，展示了一些在实际世界中具有实用性的实现，并对这些潜在可行的攻击方法进行了全面审查和探索，并调查了现实世界ML模型的动态性如何影响对抗性攻击对NIDS的逃避。我们的实验表明，即使没有进行对抗训练，持续重新训练也可以降低对抗性攻击的有效性。尽管对抗性攻击可能会 compromise ML-based NIDS，但我们的目标是强调研究与现实世界实用性在该领域的显著差距，这值得关注。

更新时间: 2025-10-16 00:43:56

领域: cs.CR,cs.LG,cs.NI

下载: http://arxiv.org/abs/2306.05494v4

Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer's view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish "self" from "peer", which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to "mask" identity to ensure that MAD systems reason based on content rather than source identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.

Updated: 2025-10-16 00:43:28

标题: 《通过匿名化测量和减缓多智能体辩论中的身份偏见》

摘要: 多智能体辩论（MAD）旨在通过让多个智能体交换答案并汇总意见来改进大型语言模型（LLM）的推理能力。然而，最近的研究表明，智能体并非中立：它们易受身份驱动的谄媚和自我偏见影响，不加批判地采纳同行的观点或固执地坚持自己之前的输出，从而削弱了辩论的可靠性。在这项工作中，我们提出了第一个将谄媚和自我偏见结合起来以减轻和量化MAD中身份偏见的原则性框架。首先，我们将辩论动态形式化为一个身份加权的贝叶斯更新过程。其次，我们提出了响应匿名化：通过从提示中删除身份标识符，智能体无法区分“自己”和“同行”，从而强制智能体身份上的权重相等，从而减少偏见。第三，我们定义了身份偏见系数（IBC），这是一个明确的度量标准，用于衡量智能体多频繁跟随同行而不是自己。跨多个模型、数据集和辩论轮次的实证研究证实，身份偏见是普遍存在的，谄媚远比自我偏见更为普遍。我们的发现强调了需要“掩盖”身份以确保MAD系统基于内容而不是源身份进行推理。我们在https://github.com/deeplearning-wisc/MAD-identity-bias 上发布了代码。

更新时间: 2025-10-16 00:43:28

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.07517v2

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce VisualToolBench, a visual tool-use reasoning benchmark that rigorously evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think-with-images paradigm. VisualToolBench comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, VisualToolBench offers critical insights for advancing visual intelligence in MLLMs.

Updated: 2025-10-16 00:41:28

标题: 超越视觉：评估多模式LLMs在工具支持的图像感知、转换和推理上的表现

摘要: 多模态大型语言模型（MLLMs）越来越多地应用于现实世界场景，用户提供的图像通常不完美，需要进行主动的图像操作，如裁剪、编辑或增强，以揭示显著的视觉线索。除了静态视觉感知，MLLMs还必须与图像一起思考：动态地转换视觉内容，并将其与其他工具集成以解决复杂任务。然而，从将视觉视为被动背景到可操作的认知工作区的转变仍未得到充分探讨。大多数现有的基准仍然遵循思考图像的范式，其中图像被视为静态输入。为了弥补这一差距，我们引入了VisualToolBench，这是一个视觉工具使用推理基准，严格评估MLLMs在思考与图像范式下跨复杂视觉文本任务中感知、转换和推理的能力。VisualToolBench包括1,204个具有挑战性的开放式视觉任务（603个单轮，601个多轮），跨越五个不同领域，每个任务都配有详细的评分标准，以便进行系统评估。我们的评估显示，当前的MLLMs在需要有效集成视觉和通用工具的任务上表现不佳。即使是最强大的模型（GPT-5-think）也只达到18.68%的通过率。我们进一步观察到不同的工具使用行为，OpenAI模型受益于多样化的图像操作，而Gemini-2.5-pro则没有显示改进。通过引入以图像思考为中心的第一个基准，VisualToolBench为推进MLLMs中的视觉智能提供了关键见解。

更新时间: 2025-10-16 00:41:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.12712v2

Securing U.S. Critical Infrastructure: Lessons from Stuxnet and the Ukraine Power Grid Attacks

Industrial Control Systems (ICS) underpin the United States' critical infrastructure, managing essential services such as power, water, and transportation that are vital to national security and public safety. However, increasing digital integration has exposed these systems to escalating cyber threats. Historical attacks like Stuxnet and the Ukraine power grid incident revealed exploitable weaknesses-poor network segmentation, outdated software, weak authentication, and inadequate monitoring-that persist in many U.S. ICS environments today. This paper analyzes these landmark attacks to identify recurring vulnerabilities and assess their relevance to current U.S. infrastructure. It argues that without immediate reforms, similar exploits could lead to catastrophic disruptions and national security crises. To address these risks, the paper proposes policy measures focused on implementing zero-trust architecture and improved network segmentation to enhance system resilience. These recommendations aim to guide policymakers and industry leaders in securing the nation's most critical operational technologies against future cyber threats.

Updated: 2025-10-16 00:30:17

标题: 确保美国关键基础设施安全：从Stuxnet和乌克兰电网攻击中学到的教训

摘要: 工业控制系统（ICS）支撑着美国的关键基础设施，管理着诸如电力、水资源和交通等至关重要的服务，这些服务对于国家安全和公共安全至关重要。然而，日益增加的数字集成使这些系统面临不断升级的网络威胁。历史上的攻击如Stuxnet和乌克兰电网事件揭示了可利用的弱点，如网络分割不足、过时软件、弱身份验证和监测不足，这些问题在许多美国ICS环境中仍然存在。本文分析了这些重大攻击，以识别重复出现的漏洞，并评估它们与当前美国基础设施的相关性。文章认为，如果没有立即进行改革，类似的利用可能导致灾难性的中断和国家安全危机。为了应对这些风险，本文提出了专注于实施零信任架构和改进网络分割以增强系统韧性的政策措施。这些建议旨在指导决策者和行业领导者，以确保国家最关键的运营技术免受未来网络威胁的影响。

更新时间: 2025-10-16 00:30:17

领域: cs.CR

下载: http://arxiv.org/abs/2510.14185v1

MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation

We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.

Updated: 2025-10-16 00:30:08

标题: MAFA：一种用于企业规模注释的可配置任务适应的多代理框架

摘要: 我们介绍了MAFA（多代理标注框架），这是一个已投入生产的系统，通过可配置的多代理协作，改变了企业规模的标注工作流程。MAFA解决了金融服务中标注积压的关键挑战，其中数百万客户话语需要准确分类。MAFA将专业代理与结构化推理和基于评判的共识机制相结合。我们的框架独特地支持动态任务适应性，允许组织通过配置而不是代码更改来定义自定义标注类型（FAQ、意图、实体或特定领域的类别）。在JP Morgan Chase部署后，MAFA消除了100万个话语的积压，同时与人类标注者的平均一致性达到86%，每年节省超过5000小时的手动标注工作。该系统以标注置信度分类处理话语，通常在所有我们测试的数据集中为85%高、10%中、5%低。这使人类标注者能够专注于模糊和覆盖率低的案例。我们展示了MAFA在多个数据集和语言中的有效性，表明相对于传统和单一代理标注基准线，我们内部意图分类数据集的Top-1准确率提高了13.8%，Top-5准确率提高了15.1%，F1值提高了16.9%，并在公共基准上获得了类似的增益。这项工作弥合了理论多代理系统和实际企业部署之间的鸿沟，为面临类似标注挑战的组织提供了蓝图。

更新时间: 2025-10-16 00:30:08

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.14184v1

One-Step Flow Policy Mirror Descent

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.

Updated: 2025-10-16 00:28:10

标题: 一步流程策略镜像下降

摘要: 扩散策略在在线强化学习中取得了巨大成功，这归功于其强大的表达能力。然而，扩散策略模型的推断依赖于一个缓慢的迭代抽样过程，这限制了它们的响应能力。为了克服这一限制，我们提出了流策略镜像下降（FPMD），这是一种在线强化学习算法，可以在流策略推断过程中进行1步抽样。我们的方法利用了直线插值流匹配模型中分布方差与单步抽样的离散化误差之间的理论联系，并且不需要额外的提取或一致性训练。我们分别基于修正流策略和MeanFlow策略提出了两种算法变体。对MuJoCo和视觉DeepMind控制套件基准的广泛实证评估表明，我们的算法在推断过程中需要的计算成本比扩散策略基线少几个数量级，同时表现出与扩散策略基线相当的强大性能。

更新时间: 2025-10-16 00:28:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.23675v2

Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.

Updated: 2025-10-16 00:26:43

标题: 调整大型语言模型的校准：一种校准感知的微调方法

摘要: Large Language Models (LLMs)的成功关键技术之一是偏好对齐。然而，偏好对齐的一个显著副作用是校准不佳：尽管预训练模型通常是校准良好的，LLMs在与人类偏好对齐后往往会变得校准不佳。在本文中，我们调查了为什么偏好对齐会影响校准，以及如何解决这个问题。对于第一个问题，我们观察到在对齐中的偏好坍塌问题不受欢迎地概括到校准场景中，导致LLMs表现出过度自信和校准不良。为了解决这个问题，我们展示了利用领域特定知识进行微调的重要性，以减轻过度自信问题。为了进一步分析这是否影响了模型的性能，我们将模型分为两个区域：可校准和不可校准，由预期校准误差（ECE）的界定义。在可校准区域，我们提出了一种校准感知的微调方法，以实现适当的校准而不影响LLMs的性能。然而，随着模型进一步进行更好性能的微调，它们进入了不可校准区域。对于这种情况，我们开发了基于EM算法的ECE正则化，用于微调损失以保持低校准误差。大量实验证实了所提方法的有效性。

更新时间: 2025-10-16 00:26:43

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2505.01997v3

Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

Updated: 2025-10-16 00:20:57

标题: 虚拟存在：使用多视角性能捕捉定制可控摄像机视频扩散模型

摘要: 我们引入了一个框架，通过一种新颖的定制数据流程，在视频扩散模型中实现了多视角角色一致性和3D摄像机控制。我们使用记录的体积捕捉表演重新渲染的数据，通过4D高斯喷涂（4DGS）获得了不同摄像机轨迹和光照变化。我们在这些数据上对最先进的开源视频扩散模型进行了微调，提供了强大的多视角身份保留、精确的摄像机控制和光照适应性。我们的框架还支持虚拟制作的核心能力，包括使用两种方法进行多主体生成：联合训练和噪声混合，后者在推理时能够有效地组合独立定制的模型；它还实现了场景和真实视频的定制，以及在定制过程中对运动和空间布局的控制。大量实验表明，视频质量得到了提高，个性化准确性更高，摄像机控制和光照适应性也得到了增强，推动了视频生成与虚拟制作的整合。我们的项目页面可在以下网址找到：https://eyeline-labs.github.io/Virtually-Being。

更新时间: 2025-10-16 00:20:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.14179v1

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

Updated: 2025-10-16 00:18:30

标题: ARM-FM：基于组合强化学习的基础模型的自动奖励机器

摘要: 强化学习（RL）算法对奖励函数的规范非常敏感，这仍然是限制其广泛适用性的一个核心挑战。我们提出了ARM-FM：通过基础模型自动化奖励机制，这是一个用于强化学习中自动、组合奖励设计的框架，利用基础模型（FMs）的高级推理能力。奖励机制（RMs）——一种基于自动机的奖励规范形式——被用作RL目标规范的机制，并通过使用FMs自动构建。RMs的结构形式主义产生了有效的任务分解，而FMs的使用使得在自然语言中进行目标规范成为可能。具体地，我们（i）使用FMs从自然语言规范自动生成RMs；（ii）将语言嵌入与每个RM自动机状态关联，以实现跨任务的泛化；（iii）在各种具有挑战性的环境中提供了ARM-FM有效性的实证证据，包括零次泛化的证据。

更新时间: 2025-10-16 00:18:30

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14176v1

Power Grid Cybersecurity: Policy Analysis White Paper

The U.S. power grid underpins national security, public safety, and economic stability, but faces growing cyber risks from vulnerabilities in industrial control systems, remote access, and poor cyber hygiene. Despite its critical importance, current policy remains fragmented and reactive. This paper proposes a dual policy approach to strengthen grid cybersecurity: enhanced information sharing between government and private utilities to improve threat detection and response, and standardized cyber hygiene practices to reduce common attack vectors. For long-term resilience, a Unified National Cybersecurity Framework is recommended to align existing NERC, IEC, IEEE, and NIST standards, eliminate regulatory overlap, and adapt to evolving threats. Together, these policies offer both immediate and sustainable improvements in safeguarding the nation's most vital infrastructure.

Updated: 2025-10-16 00:08:00

标题: 电网网络安全：政策分析白皮书

摘要: 美国电网支撑着国家安全、公共安全和经济稳定，但面临着日益增长的网络风险，包括工业控制系统、远程访问和糟糕的网络卫生漏洞。尽管其至关重要，但当前政策仍然分散且被动。本文提出了一种双重政策方法来加强电网网络安全性：加强政府和私人公用事业单位之间的信息共享，以提高威胁检测和响应能力，并制定标准化的网络卫生实践，以减少常见攻击路径。为了长期的韧性，建议制定统一的国家网络安全框架，以调整现有的NERC、IEC、IEEE和NIST标准，消除监管重叠，并适应不断发展的威胁。这些政策共同为保护国家最重要的基础设施提供了即时和可持续的改进。

更新时间: 2025-10-16 00:08:00

领域: cs.CR

下载: http://arxiv.org/abs/2510.14171v1

JEDA: Query-Free Clinical Order Search from Ambient Dialogues

Clinical conversations mix explicit directives (order a chest X-ray) with implicit reasoning (the cough worsened overnight, we should check for pneumonia). Many systems rely on LLM rewriting, adding latency, instability, and opacity that hinder real-time ordering. We present JEDA (Joint Embedding for Direct and Ambient clinical orders), a domain-initialized bi-encoder that retrieves canonical orders directly and, in a query-free mode, encodes a short rolling window of ambient dialogue to trigger retrieval. Initialized from PubMedBERT and fine-tuned with a duplicate-safe contrastive objective, JEDA aligns heterogeneous expressions of intent to shared order concepts. Training uses constrained LLM guidance to tie each signed order to complementary formulations (command only, context only, command+context, context+reasoning), producing clearer inter-order separation, tighter query extendash order coupling, and stronger generalization. The query-free mode is noise-resilient, reducing sensitivity to disfluencies and ASR errors by conditioning on a short window rather than a single utterance. Deployed in practice, JEDA yields large gains and substantially outperforms its base encoder and recent open embedders (Linq Embed Mistral, SFR Embedding, GTE Qwen, BGE large, Embedding Gemma). The result is a fast, interpretable, LLM-free retrieval layer that links ambient context to actionable clinical orders in real time.

Updated: 2025-10-16 00:00:21

标题: JEDA：来自环境对话的无查询临床医嘱搜索

摘要: 临床对话融合了明确的指令（比如要求进行胸部X光检查）和隐含的推理（咳嗽在夜间恶化，我们应该检查是否有肺炎）。许多系统依赖LLM重写，增加了延迟、不稳定性和不透明度，阻碍了实时下单。我们提出了JEDA（联合嵌入直接和环境临床指令），这是一个领域初始化的双编码器，可以直接检索规范指令，并且在无查询模式下，编码一个短暂的滚动窗口的环境对话，以触发检索。JEDA从PubMedBERT初始化，并通过一个重复安全的对比目标进行微调，将异构意图表达对齐到共享的指令概念。训练使用受限LLM指导，将每个签署的订单与互补的表述（仅命令、仅上下文、命令+上下文、上下文+推理）联系起来，产生更清晰的订单间分离、更紧密的查询-订单耦合和更强的泛化能力。无查询模式具有抗噪声性，通过在一个短窗口而不是单个话语上进行条件化，减少对不流畅语音和ASR错误的敏感性。在实践中部署，JEDA取得了巨大的收益，并且明显优于其基础编码器和最近的开放嵌入器（Linq Embed Mistral、SFR Embedding、GTE Qwen、BGE large、Embedding Gemma）。结果是一个快速、可解释的、无LLM的检索层，可以将环境上下文实时链接到可行的临床订单。

更新时间: 2025-10-16 00:00:21

领域: cs.AI

下载: http://arxiv.org/abs/2510.14169v1